Table of Contents
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Being highly configurable and very extensible means that there are many options and thus many decisions that need to be made by an operator. This document is a "cookbook" with "recipes" for getting Flume up and running quickly without being overwhelmed with all the details.
This document starts with some basic tips on experimenting with Flume nodes, and then three stories about setting up agents for different kinds of sources. Finally we’ll do a quick setup for collectors and how to troubleshoot distributed deployments.
First this section will describe some basic tips for testing sources, sinks and logical nodes.
Testing sources is a straightforward process. The flume
script has
a dump
command that allows you test data source configuration by
displaying data at the console.
flume dump source
Note | |
---|---|
|
Here are some simple examples to try:
$ flume dump console $ flume dump 'text("/path/to/file")' $ flume dump 'file("/path/to/file")' $ flume dump syslogTcp $ flume dump 'syslogUdp(5140)' $ flume dump 'tail("/path/to/file")' $ flume dump 'tailDir("path/to/dir", "fileregex")' $ flume dump 'rpcSource(12346)'
Under the covers, this dump command is actually running the flume
node_nowatch
command with some extra command line parameters.
flume node_nowatch -1 -s -n dump -c "dump: $1 | console;"
Here’s a summary of what the options mean.
|
one shot execution. This makes the node instance not use the heartbeating mechanism to get a config. |
|
starts the Flume node without starting the node’s http status web server. If the status web server is started, a Flume node’s status server will keep the process alive even if in one-shot mode. If the -s flag is specified along with one-shot mode (-1), the Flume node will exit after all logical nodes complete. |
|
Starts the node with the given configuration definition. NOTE: If not using -1, this will be invalidated upon the first heartbeat to the master. |
|
gives the node the physical name node. |
You can get info on all of the Flume node commands by using this command:
$ flume node -h
Now that you can test sources, there is only one more step necessary
to test arbitrary sinks and decorators from the command line. A sink
requires data to consume so some common sources used to generate test
data include synthetic datasets (asciisynth
), the console
(console
), or files (text
).
For example, you can use a synthetic source to generate 1000 events, each of 100 "random" bytes data to the console. You could use a text source to read a file like /etc/services, or you could use the console as a source and interactively enter lines of text as events:
$ flume node_nowatch -1 -s -n dump -c 'dump: asciisynth(1000,100) | console;' $ flume node_nowatch -1 -s -n dump -c 'dump: text("/etc/services") | console;' $ flume node_nowatch -1 -s -n dump -c 'dump: console | console;'
You can also use decorators on the sinks you specify. For example, you could rate limit the amount of data that pushed from the source to sink by inserting a delay decorator. In this case, the delay decorator waits 100ms before sending each synthesized event to the console.
$ flume node_nowatch -1 -s -n dump -c 'dump: asciisynth(1000,100) | { delay(100) => console};'
Using the command line, you can send events via the direct best-effort
(BE) or disk-failover (DFO) agents. The example below uses the
console
source so you can interactively generate data to send in BE
and DFO mode to collectors.
Note | |
---|---|
Flume node_nowatch must be used when piping data in to a Flume node’s console source. The watchdog program does not forward stdin. |
$ flume node_nowatch -1 -s -n dump -c 'dump: console | agentBESink("collectorHost");' $ flume node_nowatch -1 -s -n dump -c 'dump: console | agentDFOSink("collectorHost");'
Warning | |
---|---|
Since these nodes are executed with configurations entered at the command line and never contact the master, they cannot use the automatic chains or logical node translations. Currently, the acknowledgements used in E2E mode go through the master piggy-backed on node-to-master heartbeats. Since this mode does not heartbeat, E2E mode should not be used. |
Console sources are useful because we can pipe data into Flume directly. The next example pipes data from a program into Flume which then delivers it.
$ <external process> | flume node_nowatch -1 -s -n foo -c 'foo:console|agentBESink("collector");'
Ideally, you could write data to a named pipe and just have Flume read
data from a named pipe using text
or tail
. Unfortunately, this
version of Flume’s text
and tail
are not currently compatible with
named pipes in a Linux environment. However, you could pipe data to a
Flume node listening on the stdin console:
$ tail -f namedpipe | flume node_nowatch -1 -s -n foo -c 'foo:console|agentBESink;'
Or you can use the exec source to get its output data:
$ flume node_nowatch -1 -s -n bar -c 'bar:exec("cat pipe")|agentBESink;'
While outputting data to a console or to a logfile is an effective way to verify data transmission, Flume nodes provide a way to monitor the state of sources and sinks. This can be done by looking at the node’s report page. By default, you can navigate your web browser to the nodes TCP 35862 port. (http://node:35862/)
This page shows counter information about all of the logical nodes on the physical node as well as some basic machine metric information.
Tip | |
---|---|
If you have multiple physical nodes on a single machine, there
may be a port conflict. If you have the auto-find port option on
( |
To connect Flume to Apache 2.x servers, you will need to
# Configure web log file permissions
# Tail the web logs or use piped logs to enable Flume to get data from the web server.
This section will step through basic setup on default Ubuntu Lucid and default CentOS 5.5 installations. Then it will describe various ways of integrating Flume.
By default, CentOS’s Apache writes weblogs to files owned by root
and in group adm
in 0644 (rw-r—r--) mode. Flume is run as the
flume
user, so the Flume node is able to read the logs.
Apache on CentOS/Red Hat servers defaults to writing logs to two files:
/var/log/httpd/access_log /var/log/httpd/error_log
The simplest way to gather data from these files is to tail the files
by configuring Flume nodes to use Flume’s tail
source:
tail("/var/log/httpd/access_log") tail("/var/log/httpd/error_log")
By default, Ubuntu writes weblogs to files owned by root
and in
group adm
in 0640 (rw-r-----) mode. Flume is run as the flume
user and by default will not be able to tread the files. One
approach to allow the flume
user to read the files is to add it to
the adm
group.
Apache servers on Ubuntu defaults to writing logs to three files:
/var/log/apache2/access.log /var/log/apache2/error.log /var/log/apache2/other_vhosts_access.log
The simplest way to gather data from these files is by configuring
Flume nodes to use Flume’s tail
source:
tail("/var/log/apache2/access.log") tail("/var/log/apache2/error.log") tail("/var/log/apache2/other_vhosts_access.log")
The Apache 2.x’s documentation
(http://httpd.apache.org/docs/2.2/logs.html) describes using piped
logfile with the CustomLog
directive. Their example uses
rotatelogs
to periodically write data to new files with a given
prefix. Here are some example directives that could be in the
httpd.conf
/apache2.conf
file.
LogFormat "%h %l %u %t \"%r\" %>s %b" common CustomLog "|/usr/sbin/rotatelogs /var/log/apache2/foo_access_log 3600" common
Tip | |
---|---|
In Ubuntu Lucid these directives are in
|
These directives configure Apache to write log files in
/var/log/apache2/foo_access_log.xxxxx
every hour (3600 seconds)
using the "common" log format.
You can use Flume’s tailDir
source to read all files without
modifying the Apache settings:
tailDir("/var/log/apache2/", "foo_access_log.*")
The first argument is the directory, and then the second is a regex
that should match against the file name. tailDir
will watch the dir
and tail all files that have matching file names.
Instead of writing data to disk and then having Flume read it, you can have Flume ingest data directly from Apache. To do so, modify the web server’s parameters and use its piped log feature by adding some directives to Apache’s configuration:
CustomLog "|flume node_nowatch -1 -s -n apache -c \'apache:console|agentBESink(\"collector\");\'" common
CustomLog "|flume node_nowatch -1 -s -n apache -c \'apache:console|agentDFOSink(\"collector\");\'" common
Warning | |
---|---|
By default, CentOS does not have the Java required by the
Flume node in user |
Using piped logs can be more efficient, but is riskier because Flume
can deliver messages without saving on disk. Doing this, however,
increases the probability of event loss. From a security point of
view, this Flume node instance runs as Apache’s user which is often
root
according to the Apache manual.
Note | |
---|---|
You could configure the one-shot mode node to deliver data directly to a collector. This can only be done at the best effort or disk-failover level. |
The prior examples use Flume nodes in one-shot mode which runs without contacting a master. Unfortunately, it means that one-shot mode cannot directly use the automatic chains or the end-to-end (E2E) reliability mode. This is because the automatic chains are generated by the master and because E2E mode currently delivers acknowledgements through the master.
However, you can have a one-shot Flume node deliver data to a Flume local node daemon where the reliable E2E mode can be used. In this setup we would have the following Apache directive:
CustomLog "|flume node_nowatch -1 -s -n apache -c \'apache:console|agentBESink(\"localhost\", 12345);\'" common
Then you can have a Flume node setup to listen with the following configuration:
node : rpcSource(12345) | agentE2ESink("collector");
Since this daemon node is connected to the master, it can use the auto*Chains.
node : rpcSource(12345) | autoE2EChain;
Note | |
---|---|
End-to-end mode attempts to ensure the deliver of the data that enters the E2E sink. In this one-shot-node to reliable-node scenario, data is not safe it gets to the E2E sink. However, since this is a local connection, it should only fail when the machine or processes fails. The one-shot node can be set to disk failover (DFO) mode in order to reduce the chance of message loss if the daemon node’s configuration changes. |
syslog
is the standard unix single machine logging service. Events
are generally emitted as lines with a time stamp, "facility" type,
priority, and message. Syslog can be configured to send data to
remote destinations. The default syslog remote delivery was
originally designed to provide best effort delivery service. Today,
there are several more advanced syslog services that deliver messages
with improved reliability (TCP connections with memory buffering on
failure). The reliability guarantees however are one hop and weaker
than Flume’s more reliable delivery mechanism.
This section describes collecting syslog data using two methods. The
first part describes a file tailing approach. The latter parts
describe syslog system configuration guidance that enables directly
feeding Flume’s syslog*
sources.
The quickest way to record syslog messages is to tail syslog generated
log files. These files generally live in /var/log
.
Some examples include:
/var/log/auth.log /var/log/messages /var/log/syslog /var/log/user.log
These files could be tailed by Flume nodes with tail sources:
tail("/var/log/auth.log") tail("/var/log/messages") tail("/var/log/syslog") tail("/var/log/user.log")
Depending on your system configuration, there may be permissions issues when accessing these files from the Flume node process.
Note | |
---|---|
Red Hat/CentOS systems default to writing log files owned by root, in group root, and with 0600 (-rw-------) permissions. Flume could be run as root, but this is not advised because Flume can be remotely configured to execute arbitrary programs. |
Note | |
---|---|
Ubuntu systems default to writing logs files owned by syslog, in group adm, and with 0640 (-rw-r-----) permissions. By adding the user "flume" to group "adm", a Flume node running user "flume" should be able to read the syslog generated files. |
Note | |
---|---|
When tailing files, the time when the event is read is used as the time stamp. |
The original syslog listens to the /dev/log
named pipe, and can be
configured to listen on UDP port
514. (http://tools.ietf.org/search/rfc5424). More advanced versions
(rsyslog, syslog-ng) can send and recieve over TCP and may do
in-memory queuing/buffering. For example, syslog-ng and rsyslog can
optionally use the default UDP port 514 or use TCP port 514 for better
recovery options.
Note | |
---|---|
By default only superusers can listen on on UDP/TCP ports 514. Unix systems usually only allow ports <1024 to be bound by superusers. While Flume can run as superuser, from a security stance this is not advised. The examples provide directions to route to the user-bindable port 5140. |
For debugging syslog configurations, you can just use flume dump with syslog sources. This command outputs received syslog data to the console. To test if syslog data is coming in to the proper port you can run this command from the command line:
$ flume dump 'syslogUdp(5140)'
This will dump all incoming events to the console.
If you are satisfied with your connection, you can have a Flume node run on the machine configure its sink for the reliability level you desire.
Using a syslog*
Flume source will save the entire line of event
data, use the timestamp found in the original data, extract a host
,
and attempt to extract a service from the syslog line. All of these
map to a Flume event’s fields except for service
so this is added as
extra metadata field to each event (this is a convention with
syslog defined in RFC).
So, a syslog entry whose body is this:
Sep 14 07:57:24 blitzwing dhclient: bound to 192.168.126.212 -- renewal in 710 seconds.
will have the Flume event body:
Sep 14 07:57:24 blitzwing dhclient: bound to 192.168.126.212 -- renewal in 710 seconds.
The event will also translated the "Sep 14 07:57:24" date+time data so that it will be bucketable. Since this date does not have a year, it assumes the current year and since it has no timezone it assumes the local timezone. The host field should be "blitzwing", and the optional "service" metadata field will contain "dhclient".
The original syslog is syslogd
. It is configured by an
/etc/syslog.conf
file. Its format is fairly simple.
Syslog recieves messages and then sends to out to different facilities that have associated names (http://tools.ietf.org/search/rfc5424#section-6.2).
The /etc/syslog.conf
file essentially contains lists of facilities
and "actions". These "actions" are destinations such as regular
files, but can also be named pipes, consoles, or remote machines. One
can specify a remote machine by prefixing an @ symbol in front the
destination host machine. If no port is specified, events are sent
via UDP port 514.
The example below specifies delivery to machine localhost on port 5140.
user.* @localhost:5140
A Flume node daemon running on this machine would have a syslogUdp
source listening for new log data.
host-syslog : syslogUdp(5140) | autoE2EChain ;
rsyslog
is a more advanced drop-in replacement for syslog and the
default syslog system used by Ubuntu systems. It supports basic
filtering, best effort delivery, and queuing for handling one-hop
downstream failures.
rsyslog
actually extends the syslog configuration file format.
Similar to regular syslogd
you can send data to a remote machine on
listening on UDP port 514 (standard syslog port).
*.* @remotehost
Moreover, rsyslog
also allows you to use the more reliable TCP
protocol to send data to a remote host listening on TCP port 514. In
rsyslog
configurations, an @@ prefix dictates the use of TCP.
*.* @@remotehost
Similarly, you can also append a suffix port number to have it deliver to a particular port. In this example, events are delivered to localhost TCP port 5140.
*.* @@localhost:5140
Assuming you have a Flume node daemon running on the local host, you can capture syslog data by adding a logical node with the following configuration:
host-syslog : syslogTcp(5140) | autoE2EChain ;
Syslog-ng is another common replacement for the default syslog logging system. Syslog-ng has a different configuration file format but essentially gives the operator the ability to send syslog data from different facilities to different remote destinations. TCP or UDP can be used.
Here is an example of modifications to a syslog-ng.conf
(often found
in /etc/syslog-ng/
) file.
## set up logging to loghost (which is flume) destination loghost { tcp("localhost" port(5140)); }; # send everything to loghost, too log { source(src); destination(loghost); };
Assuming you have a Flume node daemon running on the local host, you can capture syslog data by adding a logical node with the following configuration:
host-syslog : syslogTcp(5140) | autoE2EChain ;
Flume can emulate a downstream node for applications that log using Scribe. Scribe is Facebook’s open source log aggregation framework. It has a simple API and uses Thrift as its core network transport mechanism. Flume uses the same Thrift IDL file and can listen for data provided by Scribe sources.
Scribe comes with an example application called scribe_cat
. It acts
like the unix cat
program but sends data to a downstream host with a
specified "category". Scribe by default uses TCP port 1463.
You can configure a Flume node to listen for incoming Scribe traffic
by creating a logical node that uses the scribe
source. We can then
assign an arbitrary sink to the node. In the example below, the
Scribe nodes receives events, send its events to both the console and
an automatically-assigned end-to-end agent which delivers the events
downstream to its collector pools.
scribe: scribe | [console, autoE2EChain];
Assuming that this node communicates with the master, and that the
node is on localhost
, we can use the scribe_cat
program to send
data to Flume.
$ echo "foo" | scribe_cat localhost testcategory
When the Scribe source accepts the Scribe events, it converts Scribe’s category information into a new Flume event metadata entry, and then delivers the event to its sinks. Since Scribe does not include time metadata, the timestamp of the created Flume event will be the arrival time of the Scribe event.
______ / ___//_ ______ ____ / /_/ / / / / \/ __/ / __/ / /_/ / / / / __/ / / /_/\____/_/_/_/\__/ /_/ Distributed Log Collection.