Flume Cookbook

flume-dev@cloudera.org

Revision History
Revision 0.9.4-cdh4.1.3	January 26 2013	F

Table of Contents

1. Introduction

2. Trying out Flume sources and sinks

2.1. Testing sources
2.2. Testing sinks and decorators
2.3. Monitoring nodes

3. Using Flume Agents for Apache 2.x Web Server Logging

3.1. If you are using CentOS / Red Hat Apache servers
3.2. If you are using Ubuntu servers Apache servers
3.3. Getting log entries from Piped Log files
3.4. Using piped logs

4. Flume Agents for Syslog data

4.1. Tailing files

4.2. Delivering Syslog events via sockets

4.2.1. Configuring syslogd
4.2.2. Configuring rsyslog
4.2.3. Configuring syslog-ng

5. Logging Scribe Events to a Flume Agent

1. Introduction

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Being highly configurable and very extensible means that there are many options and thus many decisions that need to be made by an operator. This document is a "cookbook" with "recipes" for getting Flume up and running quickly without being overwhelmed with all the details.

This document starts with some basic tips on experimenting with Flume nodes, and then three stories about setting up agents for different kinds of sources. Finally we’ll do a quick setup for collectors and how to troubleshoot distributed deployments.

Basic debugging with one-shot Flume nodes
Flume agents for Apache web server logging
Flume agents for syslog / syslog-ng logging
Flume agents for scribe logging

First this section will describe some basic tips for testing sources, sinks and logical nodes.

2.1. Testing sources

Testing sources is a straightforward process. The flume script has a dump command that allows you test data source configuration by displaying data at the console.

flume dump source

	Note
	`source` needs to be a single command line argument, so you may need to add quotes or "quotes" to the argument if it has quotes or parenthesis in them. Using single quotes allow you to use unescaped double quotes in the configuration argument. (ex: `'text("/tmp/foo")'` or `"text(\"/tmp/foo\")"`).

Here are some simple examples to try:

$ flume dump console
$ flume dump 'text("/path/to/file")'
$ flume dump 'file("/path/to/file")'
$ flume dump syslogTcp
$ flume dump 'syslogUdp(5140)'
$ flume dump 'tail("/path/to/file")'
$ flume dump 'tailDir("path/to/dir", "fileregex")'
$ flume dump 'rpcSource(12346)'

Under the covers, this dump command is actually running the flume node_nowatch command with some extra command line parameters.

flume node_nowatch -1 -s -n dump -c "dump: $1 | console;"

Here’s a summary of what the options mean.

`-1`	one shot execution. This makes the node instance not use the heartbeating mechanism to get a config.
`-s`	starts the Flume node without starting the node’s http status web server. If the status web server is started, a Flume node’s status server will keep the process alive even if in one-shot mode. If the -s flag is specified along with one-shot mode (-1), the Flume node will exit after all logical nodes complete.
`-c "node:src\|snk;"`	Starts the node with the given configuration definition. NOTE: If not using -1, this will be invalidated upon the first heartbeat to the master.
`-n node`	gives the node the physical name node.

You can get info on all of the Flume node commands by using this command:

$ flume node -h

2.2. Testing sinks and decorators

Now that you can test sources, there is only one more step necessary to test arbitrary sinks and decorators from the command line. A sink requires data to consume so some common sources used to generate test data include synthetic datasets (asciisynth), the console (console), or files (text).

For example, you can use a synthetic source to generate 1000 events, each of 100 "random" bytes data to the console. You could use a text source to read a file like /etc/services, or you could use the console as a source and interactively enter lines of text as events:

$ flume node_nowatch -1 -s -n dump -c 'dump: asciisynth(1000,100) | console;'
$ flume node_nowatch -1 -s -n dump -c 'dump: text("/etc/services") | console;'
$ flume node_nowatch -1 -s -n dump -c 'dump: console | console;'

You can also use decorators on the sinks you specify. For example, you could rate limit the amount of data that pushed from the source to sink by inserting a delay decorator. In this case, the delay decorator waits 100ms before sending each synthesized event to the console.

$ flume node_nowatch -1 -s -n dump -c 'dump: asciisynth(1000,100) | { delay(100) => console};'

Using the command line, you can send events via the direct best-effort (BE) or disk-failover (DFO) agents. The example below uses the console source so you can interactively generate data to send in BE and DFO mode to collectors.

	Note
	Flume node_nowatch must be used when piping data in to a Flume node’s console source. The watchdog program does not forward stdin.

$ flume node_nowatch -1 -s -n dump -c 'dump: console | agentBESink("collectorHost");'
$ flume node_nowatch -1 -s -n dump -c 'dump: console | agentDFOSink("collectorHost");'

	Warning
	Since these nodes are executed with configurations entered at the command line and never contact the master, they cannot use the automatic chains or logical node translations. Currently, the acknowledgements used in E2E mode go through the master piggy-backed on node-to-master heartbeats. Since this mode does not heartbeat, E2E mode should not be used.

Console sources are useful because we can pipe data into Flume directly. The next example pipes data from a program into Flume which then delivers it.

$ <external process> | flume node_nowatch -1 -s -n foo -c 'foo:console|agentBESink("collector");'

Ideally, you could write data to a named pipe and just have Flume read data from a named pipe using text or tail. Unfortunately, this version of Flume’s text and tail are not currently compatible with named pipes in a Linux environment. However, you could pipe data to a Flume node listening on the stdin console:

$ tail -f namedpipe | flume node_nowatch -1 -s -n foo -c 'foo:console|agentBESink;'

Or you can use the exec source to get its output data:

$ flume node_nowatch -1 -s -n bar -c 'bar:exec("cat pipe")|agentBESink;'

2.3. Monitoring nodes

While outputting data to a console or to a logfile is an effective way to verify data transmission, Flume nodes provide a way to monitor the state of sources and sinks. This can be done by looking at the node’s report page. By default, you can navigate your web browser to the nodes TCP 35862 port. (http://node:35862/)

This page shows counter information about all of the logical nodes on the physical node as well as some basic machine metric information.

	Tip
	If you have multiple physical nodes on a single machine, there may be a port conflict. If you have the auto-find port option on (`flume.node.http.autofindport`), the physical node will increment the port number until it finds a free port it can bind to.

3. Using Flume Agents for Apache 2.x Web Server Logging

3.1. If you are using CentOS / Red Hat Apache servers
3.2. If you are using Ubuntu servers Apache servers
3.3. Getting log entries from Piped Log files
3.4. Using piped logs

To connect Flume to Apache 2.x servers, you will need to

# Configure web log file permissions

# Tail the web logs or use piped logs to enable Flume to get data from the web server.

This section will step through basic setup on default Ubuntu Lucid and default CentOS 5.5 installations. Then it will describe various ways of integrating Flume.

3.1. If you are using CentOS / Red Hat Apache servers

By default, CentOS’s Apache writes weblogs to files owned by root and in group adm in 0644 (rw-r—r--) mode. Flume is run as the flume user, so the Flume node is able to read the logs.

Apache on CentOS/Red Hat servers defaults to writing logs to two files:

/var/log/httpd/access_log
/var/log/httpd/error_log

The simplest way to gather data from these files is to tail the files by configuring Flume nodes to use Flume’s tail source:

tail("/var/log/httpd/access_log")
tail("/var/log/httpd/error_log")

3.2. If you are using Ubuntu servers Apache servers

By default, Ubuntu writes weblogs to files owned by root and in group adm in 0640 (rw-r-----) mode. Flume is run as the flume user and by default will not be able to tread the files. One approach to allow the flume user to read the files is to add it to the adm group.

Apache servers on Ubuntu defaults to writing logs to three files:

/var/log/apache2/access.log
/var/log/apache2/error.log
/var/log/apache2/other_vhosts_access.log

The simplest way to gather data from these files is by configuring Flume nodes to use Flume’s tail source:

tail("/var/log/apache2/access.log")
tail("/var/log/apache2/error.log")
tail("/var/log/apache2/other_vhosts_access.log")

3.3. Getting log entries from Piped Log files

The Apache 2.x’s documentation (http://httpd.apache.org/docs/2.2/logs.html) describes using piped logfile with the CustomLog directive. Their example uses rotatelogs to periodically write data to new files with a given prefix. Here are some example directives that could be in the httpd.conf/apache2.conf file.

LogFormat "%h %l %u %t \"%r\" %>s %b" common
CustomLog "|/usr/sbin/rotatelogs /var/log/apache2/foo_access_log 3600" common

	Tip
	In Ubuntu Lucid these directives are in `/etc/apache2/sites-available/default`. In CentOS 5.5, these directives are in `/etc/httpd/conf/httpd.conf`.

These directives configure Apache to write log files in /var/log/apache2/foo_access_log.xxxxx every hour (3600 seconds) using the "common" log format.

You can use Flume’s tailDir source to read all files without modifying the Apache settings:

tailDir("/var/log/apache2/", "foo_access_log.*")

The first argument is the directory, and then the second is a regex that should match against the file name. tailDir will watch the dir and tail all files that have matching file names.

3.4. Using piped logs

Instead of writing data to disk and then having Flume read it, you can have Flume ingest data directly from Apache. To do so, modify the web server’s parameters and use its piped log feature by adding some directives to Apache’s configuration:

CustomLog "|flume node_nowatch -1 -s -n apache -c \'apache:console|agentBESink(\"collector\");\'" common

CustomLog "|flume node_nowatch -1 -s -n apache -c \'apache:console|agentDFOSink(\"collector\");\'" common

	Warning
	By default, CentOS does not have the Java required by the Flume node in user `root`'s path. You can use `alternatives` to create a managed symlink in `/usr/bin/` for the java executable.

Using piped logs can be more efficient, but is riskier because Flume can deliver messages without saving on disk. Doing this, however, increases the probability of event loss. From a security point of view, this Flume node instance runs as Apache’s user which is often root according to the Apache manual.

	Note
	You could configure the one-shot mode node to deliver data directly to a collector. This can only be done at the best effort or disk-failover level.

The prior examples use Flume nodes in one-shot mode which runs without contacting a master. Unfortunately, it means that one-shot mode cannot directly use the automatic chains or the end-to-end (E2E) reliability mode. This is because the automatic chains are generated by the master and because E2E mode currently delivers acknowledgements through the master.

However, you can have a one-shot Flume node deliver data to a Flume local node daemon where the reliable E2E mode can be used. In this setup we would have the following Apache directive:

CustomLog "|flume node_nowatch -1 -s -n apache -c \'apache:console|agentBESink(\"localhost\", 12345);\'" common

Then you can have a Flume node setup to listen with the following configuration:

node : rpcSource(12345) | agentE2ESink("collector");

Since this daemon node is connected to the master, it can use the auto*Chains.

node : rpcSource(12345) | autoE2EChain;

Note

End-to-end mode attempts to ensure the deliver of the data that enters the E2E sink. In this one-shot-node to reliable-node scenario, data is not safe it gets to the E2E sink. However, since this is a local connection, it should only fail when the machine or processes fails. The one-shot node can be set to disk failover (DFO) mode in order to reduce the chance of message loss if the daemon node’s configuration changes.

4. Flume Agents for Syslog data

4.1. Tailing files

4.2. Delivering Syslog events via sockets

4.2.1. Configuring syslogd
4.2.2. Configuring rsyslog
4.2.3. Configuring syslog-ng

syslog is the standard unix single machine logging service. Events are generally emitted as lines with a time stamp, "facility" type, priority, and message. Syslog can be configured to send data to remote destinations. The default syslog remote delivery was originally designed to provide best effort delivery service. Today, there are several more advanced syslog services that deliver messages with improved reliability (TCP connections with memory buffering on failure). The reliability guarantees however are one hop and weaker than Flume’s more reliable delivery mechanism.

This section describes collecting syslog data using two methods. The first part describes a file tailing approach. The latter parts describe syslog system configuration guidance that enables directly feeding Flume’s syslog* sources.

4.1. Tailing files

The quickest way to record syslog messages is to tail syslog generated log files. These files generally live in /var/log.

Some examples include:

/var/log/auth.log
/var/log/messages
/var/log/syslog
/var/log/user.log

These files could be tailed by Flume nodes with tail sources:

tail("/var/log/auth.log")
tail("/var/log/messages")
tail("/var/log/syslog")
tail("/var/log/user.log")

Depending on your system configuration, there may be permissions issues when accessing these files from the Flume node process.

	Note
	Red Hat/CentOS systems default to writing log files owned by root, in group root, and with 0600 (-rw-------) permissions. Flume could be run as root, but this is not advised because Flume can be remotely configured to execute arbitrary programs.

	Note
	Ubuntu systems default to writing logs files owned by syslog, in group adm, and with 0640 (-rw-r-----) permissions. By adding the user "flume" to group "adm", a Flume node running user "flume" should be able to read the syslog generated files.

	Note
	When tailing files, the time when the event is read is used as the time stamp.

4.2. Delivering Syslog events via sockets

4.2.1. Configuring syslogd
4.2.2. Configuring rsyslog
4.2.3. Configuring syslog-ng

The original syslog listens to the /dev/log named pipe, and can be configured to listen on UDP port 514. (http://tools.ietf.org/search/rfc5424). More advanced versions (rsyslog, syslog-ng) can send and recieve over TCP and may do in-memory queuing/buffering. For example, syslog-ng and rsyslog can optionally use the default UDP port 514 or use TCP port 514 for better recovery options.

	Note
	By default only superusers can listen on on UDP/TCP ports 514. Unix systems usually only allow ports <1024 to be bound by superusers. While Flume can run as superuser, from a security stance this is not advised. The examples provide directions to route to the user-bindable port 5140.

For debugging syslog configurations, you can just use flume dump with syslog sources. This command outputs received syslog data to the console. To test if syslog data is coming in to the proper port you can run this command from the command line:

$ flume dump 'syslogUdp(5140)'

This will dump all incoming events to the console.

If you are satisfied with your connection, you can have a Flume node run on the machine configure its sink for the reliability level you desire.

Using a syslog* Flume source will save the entire line of event data, use the timestamp found in the original data, extract a host, and attempt to extract a service from the syslog line. All of these map to a Flume event’s fields except for service so this is added as extra metadata field to each event (this is a convention with syslog defined in RFC).

So, a syslog entry whose body is this:

Sep 14 07:57:24 blitzwing dhclient: bound to 192.168.126.212 -- renewal in 710 seconds.

will have the Flume event body:

Sep 14 07:57:24 blitzwing dhclient: bound to 192.168.126.212 -- renewal in 710 seconds.

The event will also translated the "Sep 14 07:57:24" date+time data so that it will be bucketable. Since this date does not have a year, it assumes the current year and since it has no timezone it assumes the local timezone. The host field should be "blitzwing", and the optional "service" metadata field will contain "dhclient".

4.2.1. Configuring `syslogd`

The original syslog is syslogd. It is configured by an /etc/syslog.conf file. Its format is fairly simple.

Syslog recieves messages and then sends to out to different facilities that have associated names (http://tools.ietf.org/search/rfc5424#section-6.2).

The /etc/syslog.conf file essentially contains lists of facilities and "actions". These "actions" are destinations such as regular files, but can also be named pipes, consoles, or remote machines. One can specify a remote machine by prefixing an @ symbol in front the destination host machine. If no port is specified, events are sent via UDP port 514.

The example below specifies delivery to machine localhost on port 5140.

user.*     @localhost:5140

A Flume node daemon running on this machine would have a syslogUdp source listening for new log data.

host-syslog : syslogUdp(5140) | autoE2EChain ;

4.2.2. Configuring `rsyslog`

rsyslog is a more advanced drop-in replacement for syslog and the default syslog system used by Ubuntu systems. It supports basic filtering, best effort delivery, and queuing for handling one-hop downstream failures.

rsyslog actually extends the syslog configuration file format. Similar to regular syslogd you can send data to a remote machine on listening on UDP port 514 (standard syslog port).

*.*   @remotehost

Moreover, rsyslog also allows you to use the more reliable TCP protocol to send data to a remote host listening on TCP port 514. In rsyslog configurations, an @@ prefix dictates the use of TCP.

*.*  @@remotehost

Similarly, you can also append a suffix port number to have it deliver to a particular port. In this example, events are delivered to localhost TCP port 5140.

*.*  @@localhost:5140

Assuming you have a Flume node daemon running on the local host, you can capture syslog data by adding a logical node with the following configuration:

host-syslog : syslogTcp(5140) | autoE2EChain ;

4.2.3. Configuring `syslog-ng`

Syslog-ng is another common replacement for the default syslog logging system. Syslog-ng has a different configuration file format but essentially gives the operator the ability to send syslog data from different facilities to different remote destinations. TCP or UDP can be used.

Here is an example of modifications to a syslog-ng.conf (often found in /etc/syslog-ng/) file.

## set up logging to loghost (which is flume)
destination loghost {
        tcp("localhost" port(5140));
};

# send everything to loghost, too
log {
        source(src);
        destination(loghost);
};

Assuming you have a Flume node daemon running on the local host, you can capture syslog data by adding a logical node with the following configuration:

host-syslog : syslogTcp(5140) | autoE2EChain ;

5. Logging Scribe Events to a Flume Agent

Flume can emulate a downstream node for applications that log using Scribe. Scribe is Facebook’s open source log aggregation framework. It has a simple API and uses Thrift as its core network transport mechanism. Flume uses the same Thrift IDL file and can listen for data provided by Scribe sources.

Scribe comes with an example application called scribe_cat. It acts like the unix cat program but sends data to a downstream host with a specified "category". Scribe by default uses TCP port 1463.

You can configure a Flume node to listen for incoming Scribe traffic by creating a logical node that uses the scribe source. We can then assign an arbitrary sink to the node. In the example below, the Scribe nodes receives events, send its events to both the console and an automatically-assigned end-to-end agent which delivers the events downstream to its collector pools.

scribe: scribe | [console, autoE2EChain];

Assuming that this node communicates with the master, and that the node is on localhost, we can use the scribe_cat program to send data to Flume.

$ echo "foo" | scribe_cat localhost testcategory

When the Scribe source accepts the Scribe events, it converts Scribe’s category information into a new Flume event metadata entry, and then delivers the event to its sinks. Since Scribe does not include time metadata, the timestamp of the created Flume event will be the arrival time of the Scribe event.

     ______
    / ___//_  ______  ____
   / /_/ / / / /    \/ __/
  / __/ / /_/ / / / / __/
 / / /_/\____/_/_/_/\__/
/_/ Distributed Log Collection.

Flume Cookbook

flume-dev@cloudera.org

1. Introduction

2. Trying out Flume sources and sinks

2.1. Testing sources

2.2. Testing sinks and decorators

2.3. Monitoring nodes

3. Using Flume Agents for Apache 2.x Web Server Logging

3.1. If you are using CentOS / Red Hat Apache servers

3.2. If you are using Ubuntu servers Apache servers

3.3. Getting log entries from Piped Log files

3.4. Using piped logs

4. Flume Agents for Syslog data

4.1. Tailing files

4.2. Delivering Syslog events via sockets

4.2.1. Configuring syslogd

4.2.2. Configuring rsyslog

4.2.3. Configuring syslog-ng

5. Logging Scribe Events to a Flume Agent

4.2.1. Configuring `syslogd`

4.2.2. Configuring `rsyslog`

4.2.3. Configuring `syslog-ng`