Logstash and Bonsai and bots, oh my!

Sometimes we're asked if we support Logstash. The relationships between Elasticsearch, Logstash and Kibana (often referred to as the ELK stack) can sometimes foster a confusing mental model, and a simple "yes" from our support staff probably isn't enough to help users get up and running quickly. In this post, we'd like to unveil some of the mystery behind why the ELK stack is so popular.

We whipped up a quick example showing how to set up and use Logstash with Bonsai Elasticsearch and a Kibana 4 instance. This example takes a real world use case and walks the reader through the process of setting up Logstash to pass a particular type of event to Bonsai for indexing, where it can then be analyzed in Kibana. The intent is to help new users get up to speed quickly, without needing to do a lot of Googling for answers.

So what is Logstash?

Logstash is an awesome tool for log parsing. Logstash can accept inputs from a number of different sources, apply a series of filters to the incoming data, then pass the results to a data store. In combination with Elasticsearch (for data storage, aggregation and retrieval) and Kibana (for analysis and visualization), users can get a very nice high-level view of a data set. It can do many other things, but that is its primary function

A real world example

In this example, we'll look at the SSH daemon running on an Ubuntu server. SSH logs, particularly failed login attempts, can be extremely useful in investigating break-in attempts, identifying unusual traffic patterns, and performing regular security audits. It's part of a proactive security policy, and critical if you're a sysadmin in charge of dozens or hundreds of servers.

We could use other input contexts, but since the 'net is rife with automated attack traffic, we think security is particularly interesting. If you have a machine connected to the Internet, you will eventually be hit by someone. Hackers use bots to scan the IPv4 space looking for particular ports — 21 for FTP, 22 for SSH, 23 for Telnet, and so on — and when a server responds to a request on a given port, the bot will try everything it can to log in to the system.

Generally, the bot doesn't know anything about the target, so it just tries a bunch of random usernames and passwords. By studying this behavior, we can perform some counterintelligence gathering on the hacker community and use the ELK stack to aid in our reconnaissance. For another fun read, see how the developer of Elastichoney uses a similar approach to learn how hackers try to exploit Groovy scripting vulnerabilities in Elasticsearch.

The Game Plan

We'll pass the system logs into Logstash, run some filters to make sure we're only seeing what we want, send the results along to Bonsai, and finally, we'll examine the data using Kibana.For the purposes of this example, we'll make a few assumptions:

SSH activity is being logged to /var/log/auth.log. This is the default setting for Amazon's Ubuntu images, which have not (yet) switched over to systemd. Although you can use systemd with Logstash if you like.
You have the ability to install Logstash on the server(s), and have already done so. There are ways to do it with a package manager, and Elastic offers Logstash as a .deb on their downloads page, so it can be pretty easily installed. Read their documentation if you need more help getting it installed wherever you're trying to use it. In preparing this example, I downloaded the .deb to an Ubuntu machine, so the instructions are based on that. If you use a package manager or another OS, some files might end up in other places and some commands may need minor tweaks.
You already have a Bonsai Elasticsearch cluster provisioned. In this example, the URL of that cluster will be http://user123:[email protected]
You have a Kibana 4 instance set up somewhere. Need more info on this? We've got you covered.

Laying the groundwork

Logstash can accept streams from a large number of inputs. They also have an ever-expanding list of input plugins to support more streams and file types. One requirement that all inputs have in common is permissions; that is, Logstash must have permission to access the stream in order to read from it. This probably sounds trivial, but it is a pretty common "gotcha," especially when dealing with daemon log files.

When Logstash is installed in Debian, the OS creates a user called logstash. This is the user that will need to be able to access the log file. We can see the permissions of the auth.log file with ls -lAh /var/log/auth.log:

-rw-r----- 1 syslog adm 6.4M Jan 20 17:24 /var/log/auth.log

We don't want to change these permissions, because that's not a best practice and could be dangerous. But what we can do is add the logstash user to the adm group. The adm group typically grants members access to read log files and nothing else. Joining this group would give the Logstash process the ability to read the file, but not modify it in any way. This is the minimum necessary access to do the job, and it doesn't expose security details to other programs/users on the server, so it's the route we'll take:

$ sudo usermod -G adm logstash

The last thing we need to do is create an index on our Bonsai cluster for Logstash to send data. We'll call this index logstash:

$ curl -XPUT http://user123:[email protected]/logstash

Configuring Logstash

Now that we know Logstash will be able to read the input stream, we'll need to configure it. Go ahead and edit /etc/logstash/conf.d/logstash.conf:

$ sudo nano /etc/logstash/conf.d/logstash.conf

Copy/paste the settings below (making sure to substitute the correct values for your cluster URL):

input {     
    file {         
        path => "/var/log/auth.log"         
        start_position => beginning         
        sincedb_path => "/dev/null"     
    } 
}  

output {     
    elasticsearch {         
        hosts     => ["logstash-12345.us-east-1.bonsai.io:443"]         
        user      => "user123"         
        password  => "pass456"         
        ssl       => true         
        index     => "logstash"     
    } 
}

Finally, make sure that Logstash has permission to read its own config file:

$ sudo chown logstash /etc/logstash/conf.d/logstash.conf

Test it out!

We've just completed the bare minimum necessary to get Logstash up and running. We've given Logstash access to a log file, defined it as an input stream, created a place in Elasticsearch for the stream to go, and defined that index as an output. Now all we need to do is run it and make sure it works. When I installed the .deb for Logstash, the one thing it didn't do was symlink the binary, so I'll just use the full path. YMMV:

$ /opt/logstash/bin/logstash agent -f /etc/logstash/conf.d/logstash.conf
Logstash startup completed

Now, if everything worked as expected, I should be able to see the data in my cluster:

$ curl "http://user123:pass456@logstash-12345.us-east-1.bonsai.io/logstash/_search?pretty&size=1"
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 1975,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "logstash",
      "_type" : "logs",
      "_id" : "AVJdzwbLfbt0kfwq_Tzs",
      "_score" : 1.0,
      "_source":{"message":"Jan 17 04:30:52 ip-XXX-XXX-XXX-XXX sshd[12170]: Invalid user cms from XXX.XXX.XXX.XXX","@version":"1","@timestamp":"2016-01-17T11:30:52.000Z","host":"localhost","path":"/var/log/auth.log"}
    } ]
  }
}

Neat! Let's look at this in Kibana:

So, a partial success. We can see data in the index, and it is coming out of the auth.log file as expected. But the timestamp associated to the event is not when it occurred in real time, but when it was indexed. Not very helpful.

This is a pretty common gotcha as well. I included it here to underscore it as a pitfall (and maybe save everyone a few support tickets). It's easily addressed, we just need to do a bit more work.

A more complex setup

Let's go ahead and delete the logstash index on our cluster:

$ curl -XDELETE http://user123:[email protected]/logstash

With that out of the way, we can focus on refining our Logstash configuration. We'll use some of the more popular filters to extract the most important details, and add those to fields that we can later query and display with Kibana.

Before we do that, let's think about what our goal is here. We want to look for failed login attempts over SSH; we're not interested in cron tasks or sudo sessions, or anything else that commonly gets logged to auth.log. Let's take a look at a sample line:

Jan 17 04:30:52 ip-XXX-XXX-XXX-XXX sshd[12170]: Invalid user cms from XXX.XXX.XXX.XXX

What kind of data do we have there? I see a date (in a non-standard format, boo!), a private IP address (useful for identifying what server is being hit), as well as a probe at a user name ("cms," really?) and an IP address from where the attempt originated.

So we want to disregard lines that don't match this pattern, and extract some useful information from lines that do match the pattern. We'll also have to do something about that date — it will need to be in a standard format to be useful in Kibana. It would also be cool if we could perform some further analysis on the originating IP address, to see if there are any interesting trends. Who's hitting us, and where are they coming from, would be good to know.

Fortunately, Logstash has filters that allow us to do all of this:

Grok. The grok filter allows you to use regular expressions to parse log entries. It has a number of built-in patterns, and allows you to arbitrarily define your own. This is how we'll extract the lines of interest.
Date. The date filter allows you to extract timestamps from your logs, then using those as the Logstash timestamp for the event. This will fix that "single bar" problem in Kibana.
GeoIP. The geoip filter allows you to take an IP address, run it through the GeoLiteCity database, and save geographic information to a new field. We can use this info in Kibana to make a heat map of where the nefarious traffic is originating.

Update the Logstash configuration file to this, again changing the Elasticsearch output to match your Bonsai URL:

input {     
    file {         
        path => "/var/log/auth.log"         
        start_position => beginning     
    } 
}  

filter {     
    grok {         
        match => { "message" => "(?%{MONTH} %{HOUR}.%{MINUTE}.%{SECOND}) (?ip-[0-9]{1,3}-[0-9]{1,3}-[0-9]{1,3}-[0-9]{1,3}) sshd[[0-9]{1,}]: (?Invalid user (?[a-z0-9]{1,}) from (?%{IP}))"
        }     
    }     
    if "_grokparsefailure" in [tags] {       
        drop { }     
    }     
    date {         
        match => [ "logdate", "MMM dd HH:mm:ss" ]     
    }     
    geoip {       
        source => "invalid_ip"       
        target => "geoip"       
        database => "/etc/logstash/conf.d/GeoLiteCity.dat"     
    } 
}  

output {     
    elasticsearch {         
        hosts     => ["logstash-12345.us-east-1.bonsai.io:443"]         
        user      => "user123"         
        password  => "pass456"         
        ssl       => true         
        index     => "logstash"     
    } 
}

We'll need to make sure that the GeoLiteCity.dat file is in its proper place:

$ cd /etc/logstash/conf.d/ 
$ sudo curl -O "http://geolite.maxmind.com/download/geoip/database/GeoLiteCity.dat.gz" 
$ sudo gunzip GeoLiteCity.dat.gz 
$ sudo chown logstash GeoLiteCity.dat

Finally, we need to make one little tweak to our index before running Logstash. As of this writing, there is a bug where Elasticsearch will dynamically map the geoip.location as a float instead of a geo_point, which makes it impossible for Kibana to create a tile map with the information. We just need to explicitly set this field up first. We'll do that when we recreate the index:

$ curl -XPUT http://user123:[email protected]/logstash -d '{"mappings":{"logs":{"properties":{"geoip":{"properties":{"location":{"type":"geo_point"}}}}}}}'

Let's spin up Logstash again and see what happens:

$ logstash agent -f /etc/logstash.conf
Settings: Default filter workers: 2
Logstash startup completed

If we check on the data again, we see:

curl "http://user123:pass456@logstash-12345.us-east-1.bonsai.io/logstash/_search?pretty&size=1"
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 144,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "logstash",
      "_type" : "logs",
      "_id" : "AVJggJNEfbt0kfwq_U0P",
      "_score" : 1.0,
      "_source":{"message":"Jan 17 04:30:52 ip-XXX-XXX-XXX-XXX sshd[12170]: Invalid user cms from XXX.XXX.XXX.XXX","@version":"1","@timestamp":"2016-01-17T11:30:52.000Z","host":"localhost","path":"/var/log/auth.log","logdate":"Jan 17 04:30:52","private_ip":"ip-XXX-XXX-XXX-XXX","sshd_message":"Invalid user cms from XXX.XXX.XXX.XXX","invalid_user":"cms","invalid_ip":"XXX.XXX.XXX.XXX","geoip":{"ip":"XXX.XXX.XXX.XXX","country_code2":"CN","country_code3":"CHN","country_name":"China","continent_code":"AS","region_name":"12","city_name":"Wuhan","latitude":30.580099999999987,"longitude":114.27339999999998,"timezone":"Asia/Shanghai","real_region_name":"Hubei","location":[114.27339999999998,30.580099999999987]}}
    } ]
  }
}

Nice! So our filters worked. We've dropped from almost 2K lines to just 144. That's a big reduction in noise. Let's look in Kibana:

Great, so the timestamps are correct as well. We can see a big spike in activity at one point. From a security standpoint, that would be something to investigate. Ideally, it would also be something we could wire up to an alarm and page a sysadmin to let them know there might be an intrusion attempt.

But let's take it further. We can create a whole dashboard around this. We can create a Tile Map using the geoip.location field and see where these requests are originating. We can also look at stats like the user account that was probed; this might help us to determine if these attacks are some random bot or script kiddie, or if there is a more sophisticated adversary targeting specific systems.

I'm sure you could think of even more ways to use this data. For example, imagine that the source data is coming from multiple servers in your network. You could see which ones are being hit hardest and take corrective action. You could set up an alert that notifies you if someone logs in from an IP address overseas — perhaps they've been compromised. There is a wealth of intelligence that can be gleaned from looking at a setup like this.

Wrapping up

People use Logstash for all sorts of things. It really is an amazing tool, and we wish it had a larger following. If you're using Logstash and want to host your Elasticsearch cluster with Bonsai, we'd love to work with you. We fully support both Logstash and Kibana, so…

Give us a shout!

Got a cool use case you'd like to share? Maybe an anecdote or useful tip? Did you find this article tedious and boring? We'd like to hear about that too. Hit us up at [email protected] and share. You can also ping us at @bonsaisearch on Twitter if that's something you're in to.

Ready to take a closer look at Bonsai?

Find out if Bonsai is a good fit for you in just 15 minutes.

Learn how a managed service works and why it’s valuable to dev teams

You won’t be pressured or used in any manipulative sales tactics

We’ll get a deep understanding of your current tech stack and needs

Get all the information you need to decide to continue exploring Bonsai services