HBase includes several methods of loading data into tables. The most straightforward method is to either use the TableOutputFormat class from a MapReduce job, or use the normal client APIs; however, these are not always the most efficient methods.
This document describes HBase's bulk load functionality. The bulk load feature uses a MapReduce job to output table data in HBase's internal data format, and then directly loads the data files into a running cluster. Using bulk load will use less CPU and network resources than simply using the HBase API.
The HBase bulk load process consists of two main steps.
The first step of a bulk load is to generate HBase data files from a MapReduce job using HFileOutputFormat. This output format writes out data in HBase's internal storage format so that they can be later loaded very efficiently into the cluster.
In order to function efficiently, HFileOutputFormat must be configured such that each output HFile fits within a single region. In order to do this, jobs whose output will be bulk loaded into HBase use Hadoop's TotalOrderPartitioner class to partition the map output into disjoint ranges of the key space, corresponding to the key ranges of the regions in the table.
HFileOutputFormat includes a convenience function,
configureIncrementalLoad()
, which automatically sets up
a TotalOrderPartitioner based on the current region boundaries of a
table.
After the data has been prepared using
HFileOutputFormat
, it is loaded into the cluster using
completebulkload
. This command line tool iterates
through the prepared data files, and for each one determines the
region the file belongs to. It then contacts the appropriate Region
Server which adopts the HFile, moving it into its storage directory
and making the data available to clients.
If the region boundaries have changed during the course of bulk load
preparation, or between the preparation and completion steps, the
completebulkloads
utility will automatically split the
data files into pieces corresponding to the new boundaries. This
process is not optimally efficient, so users should take care to
minimize the delay between preparing a bulk load and importing it
into the cluster, especially if other clients are simultaneously
loading data through other means.
After a data import has been prepared, either by using the
importtsv
tool with the
"importtsv.bulk.output
" option or by some other MapReduce
job using the HFileOutputFormat
, the
completebulkload
tool is used to import the data into the
running cluster.
The completebulkload
tool simply takes the output path
where importtsv
or your MapReduce job put its results, and
the table name to import into. For example:
$ hadoop jar hbase-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] /user/todd/myoutput mytable
The -c config-file
option can be used to specify a file
containing the appropriate hbase parameters (e.g., hbase-site.xml) if
not supplied already on the CLASSPATH (In addition, the CLASSPATH must
contain the directory that has the zookeeper configuration file if
zookeeper is NOT managed by HBase).
Note: If the target table does not already exist in HBase, this tool will create the table automatically.
This tool will run quickly, after which point the new data will be visible in the cluster.
HBase ships with a command line tool called importtsv
which when given files containing data in TSV form can prepare this
data for bulk import into HBase. This tool by default uses the HBase
put
API to insert data into HBase one row at a time, but
when the "importtsv.bulk.output
" option is used,
importtsv
will instead generate files using
HFileOutputFormat
which can subsequently be bulk-loaded
into HBase using the completebulkload
tool described
above. This tool is available by running "hadoop jar
/path/to/hbase-VERSION.jar importtsv
". Running this command
with no arguments prints brief usage information:
Usage: importtsv -Dimporttsv.columns=a,b,c <tablename> <inputdir>
Imports the given input directory of TSV data into the specified table.
The column names of the TSV data must be specified using the -Dimporttsv.columns
option. This option takes the form of comma-separated column names, where each
column name is either a simple column family, or a columnfamily:qualifier. The special
column name HBASE_ROW_KEY is used to designate that this column should be used
as the row key for each imported record. You must specify exactly one column
to be the row key, and you must specify a column name for every column that exists in the
input data.
By default importtsv will load data directly into HBase. To instead generate
HFiles of data to prepare for a bulk data load, pass the option:
-Dimporttsv.bulk.output=/path/for/output
Note: if you do not use this option, then the target table must already exist in HBase
Other options that may be specified with -D include:
-Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line
'-Dimporttsv.separator=|' - eg separate on pipes instead of tabs
-Dimporttsv.timestamp=currentTimeAsLong - use the specified timestamp for the import
-Dimporttsv.mapper.class=my.Mapper - A user-defined Mapper to use instead of org.apache.hadoop.hbase.mapreduce.TsvImporterMapper
Although the importtsv
tool is useful in many cases, advanced users may
want to generate data programatically, or import data from other formats. To get
started doing so, dig into ImportTsv.java
and check the JavaDoc for
HFileOutputFormat.
The import step of the bulk load can also be done programatically. See the
LoadIncrementalHFiles
class for more information.