Introduction

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Being highly configurable and very extensible means that there are many options and thus many decisions that need to be made by an operator. This document is a "cookbook" with "recipes" for getting Flume up and running quickly without being overwhelmed with all the details.

This document starts with some basic tips on experimenting with Flume nodes, and then three stories about setting up agents for different kinds of sources. Finally we’ll do a quick setup for collectors and how to troubleshoot distributed deployments.

  • Basic debugging with one-shot Flume nodes

  • Flume agents for scribe logging

Trying out Flume sources and sinks

First this section will describe some basic tips for testing sources, sinks and logical nodes.

Testing sources

Testing sources is a straightforward process. The flume script has a dump command that allows you test data source configuration by displaying data at the console.

flume dump source
Note
source needs to be a single command line argument, so you may need to add quotes or "quotes" to the argument if it has quotes or parenthesis in them. Using single quotes allow you to use unescaped double quotes in the configuration argument. (ex: 'text("/tmp/foo")' or "text(\"/tmp/foo\")").

Here are some simple examples to try:

$ flume dump console
$ flume dump 'text("/path/to/file")'
$ flume dump 'file("/path/to/file")'
$ flume dump syslogTcp
$ flume dump 'syslogUdp(5140)'
$ flume dump 'tail("/path/to/file")'
$ flume dump 'tailDir("path/to/dir", "fileregex")'
$ flume dump 'rpcSource(12346)'

Under the covers, this dump command is actually running the flume node_nowatch command with some extra command line parameters.

flume node_nowatch -1 -s -n dump -c "dump: $1 | console;"

Here’s a summary of what the options mean.

-1

one shot execution. This makes the node instance not use the heartbeating mechanism to get a config.

-s

starts the Flume node without starting the http status web server.

-c "node:src|snk;"

Starts the node with the given configuration definition. NOTE: If not using -1, this will be invalidated upon the first heartbeat to the master.

-n node

gives the node the physical name node.

You can get info on all of the Flume node commands by using this command:

$ flume node -h

Testing sinks and decorators

Now that you can test sources, there is only one more step necessary to test arbitrary sinks and decorators from the command line. A sink requires data to consume so some common sources used to generate test data include synthetic datasets (asciisynth), the console (console), or files (text).

For example, you can use a synthetic source to generate 1000 events, each of 100 "random" bytes data to the console. You could use a text source to read a file like /etc/services, or you could use the console as a source and interactively enter lines of text as events:

$ flume node_nowatch -1 -s -n dump -c 'dump: asciisynth(1000,100) | console;'
$ flume node_nowatch -1 -s -n dump -c 'dump: text("/etc/services") | console;'
$ flume node_nowatch -1 -s -n dump -c 'dump: console | console;'

You can also use decorators on the sinks you specify. For example, you could rate limit the amount of data that pushed from the source to sink by inserting a delay decorator. In this case, the delay decorator waits 100ms before sending each synthesized event to the console.

$ flume node_nowatch -1 -s -n dump -c 'dump: asciisynth(1000,100) | { delay(100) => console};'

Using the command line, you can send events via the direct best-effort (BE) or disk-failover (DFO) agents. The example below uses the console source so you can interactively generate data to send in BE and DFO mode to collectors.

Note
Flume node_nowatch must be used when piping data in to a Flume node’s console source. The watchdog program does not forward stdin.
$ flume node_nowatch -1 -s -n dump -c 'dump: console | agentBESink("collectorHost");'
$ flume node_nowatch -1 -s -n dump -c 'dump: console | agentDFOSink("collectorHost");'
Warning
Since these nodes are executed with configurations entered at the command line and never contact the master, they cannot use the automatic chains or logical node translations. Currently, the acknowledgements used in E2E mode go through the master piggy-backed on node-to-master heartbeats. Since this mode does not heartbeat, E2E mode should not be used.

Console sources are useful because we can pipe data into Flume directly. The next example pipes data from a program into Flume which then delivers it.

$ <external process> | flume node_nowatch -1 -n foo -c 'foo:console|agentBESink("collector");'

Ideally, you could write data to a named pipe and just have Flume read data from a named pipe using text or tail. Unfortunately, this version of Flume’s text and tail are not currently compatible with named pipes in a Linux environment. However, you could pipe data to a Flume node listening on the stdin console:

$ tail -f namedpipe | flume node_nowatch -1 -n foo -c 'foo:console|agentBESink;'

Or you can use the exec source to get its output data:

$ flume node_nowatch -1 -n bar -c 'bar:exec("cat pipe")|agentBESink;'

Monitoring nodes

While outputting data to a console or to a logfile is an effective way to verify data transmission, Flume nodes provide a way to monitor the state of sources and sinks. This can be done by looking at the node’s report page. By default, you can navigate your web browser to the nodes TCP 35862 port. (http://node:35862/)

This page shows counter information about all of the logical nodes on the physical node as well as some basic machine metric information.

Tip
If you have multiple physical nodes on a single machine, there may be a port conflict. If you have the auto-find port option on (flume.node.http.autofindport), the physical node will increment the port number until it finds a free port it can bind to.

Flume Agents for Syslog data

syslog is the standard unix single machine logging service. Events are generally emitted as lines with a time stamp, "facility" type, priority, and message. Syslog can be configured to send data to remote destinations. The default syslog remote delivery was originally designed to provide best effort delivery service. Today, there are several more advanced syslog services that deliver messages with improved reliability (TCP connections with memory buffering on failure). The reliability guarantees however are one hop and weaker than Flume’s more reliable delivery mechanism.

This section describes collecting syslog data using two methods. The first part describes a file tailing approach. The latter parts describe syslog system configuration guidance that enables directly feeding Flume’s syslog* sources.

Tailing files

The quickest way to record syslog messages is to tail syslog generated log files. These files generally live in /var/log.

Some examples include:

/var/log/auth.log
/var/log/messages
/var/log/syslog
/var/log/user.log

These files could be tailed by Flume nodes with tail sources:

tail("/var/log/auth.log")
tail("/var/log/messages")
tail("/var/log/syslog")
tail("/var/log/user.log")

Depending on your system configuration, there may be permissions issues when accessing these files from the Flume node process.

Note
Red Hat/CentOS systems default to writing log files owned by root, in group root, and with 0600 (-rw-------) permissions. Flume could be run as root, but this is not advised because Flume can be remotely configured to execute arbitrary programs.
Note
Ubuntu systems default to writing logs files owned by syslog, in group adm, and with 0640 (-rw-r-----) permissions. By adding the user "flume" to group "adm", a Flume node running user "flume" should be able to read the syslog generated files.
Note
When tailing files, the time when the event is read is used as the time stamp.

Delivering Syslog events via sockets

The original syslog listens to the /dev/log named pipe, and can be configured to listen on UDP port 514. (http://tools.ietf.org/search/rfc5424). More advanced versions (rsyslog, syslog-ng) can send and recieve over TCP and may do in-memory queuing/buffering. For example, syslog-ng and rsyslog can optionally use the default UDP port 514 or use TCP port 514 for better recovery options.

Note
By default only superusers can listen on on UDP/TCP ports 514. Unix systems usually only allow ports <1024 to be bound by superusers. While Flume can run as superuser, from a security stance this is not advised. The examples provide directions to route to the user-bindable port 5140.

For debugging syslog configurations, you can just use flume dump with syslog sources. This command outputs received syslog data to the console. To test if syslog data is coming in to the proper port you can run this command from the command line:

$ flume dump 'syslogUdp(5140)'

This will dump all incoming events to the console.

If you are satisfied with your connection, you can have a Flume node run on the machine configure its sink for the reliability level you desire.

Using a syslog* Flume source will save the entire line of event data, use the timestamp found in the original data, extract a host, and attempt to extract a service from the syslog line. All of these map to a Flume event’s fields except for service so this is added as extra metadata field to each event (this is a convention with syslog defined in RFC).

So, a syslog entry whose body is this:

Sep 14 07:57:24 blitzwing dhclient: bound to 192.168.126.212 -- renewal in 710 seconds.

will have the Flume event body:

Sep 14 07:57:24 blitzwing dhclient: bound to 192.168.126.212 -- renewal in 710 seconds.

The event will also translated the "Sep 14 07:57:24" date+time data so that it will be bucketable. Since this date does not have a year, it assumes the current year and since it has no timezone it assumes the local timezone. The host field should be "blitzwing", and the optional "service" metadata field will contain "dhclient".

Configuring syslogd

The original syslog is syslogd. It is configured by an /etc/syslog.conf file. Its format is fairly simple.

Syslog recieves messages and then sends to out to different facilities that have associated names (http://tools.ietf.org/search/rfc5424#section-6.2).

The /etc/syslog.conf file essentially contains lists of facilities and "actions". These "actions" are destinations such as regular files, but can also be named pipes, consoles, or remote machines. One can specify a remote machine by prefixing an @ symbol in front the destination host machine. If no port is specified, events are sent via UDP port 514.

The example below specifies delivery to machine localhost on port 5140.

user.*     @localhost:5140

A Flume node daemon running on this machine would have a syslogUdp source listening for new log data.

host-syslog : syslogUdp(5140) | autoE2EChain ;

Configuring rsyslog

rsyslog is a more advanced drop-in replacement for syslog and the default syslog system used by Ubuntu systems. It supports basic filtering, best effort delivery, and queuing for handling one-hop downstream failures.

rsyslog actually extends the syslog configuration file format. Similar to regular syslogd you can send data to a remote machine on listening on UDP port 514 (standard syslog port).

*.*   @remotehost

Moreover, rsyslog also allows you to use the more reliable TCP protocol to send data to a remote host listening on TCP port 514. In rsyslog configurations, an @@ prefix dictates the use of TCP.

*.*  @@remotehost

Similarly, you can also append a suffix port number to have it deliver to a particular port. In this example, events are delivered to localhost TCP port 5140.

*.*  @@localhost:5140

Assuming you have a Flume node daemon running on the local host, you can capture syslog data by adding a logical node with the following configuration:

host-syslog : syslogTcp(5140) | autoE2EChain ;

Configuring syslog-ng

Syslog-ng is another common replacement for the default syslog logging system. Syslog-ng has a different configuration file format but essentially gives the operator the ability to send syslog data from different facilities to different remote destinations. TCP or UDP can be used.

Here is an example of modifications to a syslog-ng.conf (often found in /etc/syslog-ng/) file.

## set up logging to loghost (which is flume)
destination loghost {
        tcp("localhost" port(5140));
};

# send everything to loghost, too
log {
        source(src);
        destination(loghost);
};

Assuming you have a Flume node daemon running on the local host, you can capture syslog data by adding a logical node with the following configuration:

host-syslog : syslogTcp(5140) | autoE2EChain ;

Logging Scribe Events to a Flume Agent

Flume can emulate a downstream node for applications that log using Scribe. Scribe is Facebook’s open source log aggregation framework. It has a simple API and uses Thrift as its core network transport mechanism. Flume uses the same Thrift IDL file and can listen for data provided by Scribe sources.

Scribe comes with an example application called scribe_cat. It acts like the unix cat program but sends data to a downstream host with a specified "category". Scribe by default uses TCP port 1463.

You can configure a Flume node to listen for incoming Scribe traffic by creating a logical node that uses the scribe source. We can then assign an arbitrary sink to the node. In the example below, the Scribe nodes receives events, send its events to both the console and an automatically-assigned end-to-end agent which delivers the events downstream to its collector pools.

scribe: scribe | [console, autoE2EChain];

Assuming that this node communicates with the master, and that the node is on localhost, we can use the scribe_cat program to send data to Flume.

$ echo "foo" | scribe_cat localhost testcategory

When the Scribe source accepts the Scribe events, it converts Scribe’s category information into a new Flume event metadata entry, and then delivers the event to its sinks. Since Scribe does not include time metadata, the timestamp of the created Flume event will be the arrival time of the Scribe event.

     ______
    / ___//_  ______  ____
   / /_/ / / / /    \/ __/
  / __/ / /_/ / / / / __/
 / / /_/\____/_/_/_/\__/
/_/ Distributed Log Collection.