RepoNameRelease DateTarballsAptYum
StableCDH1March 2009/cdh/stable/debian/redhat/cdh/stable
TestingCDH2August 2009/cdh/testing/debian/redhat/cdh/testing
Cloudera logo
1.3.1.5. Installing Hadoop (Pseudo-Distributed Mode)

A Hadoop standalone installation doesn't allow Hadoop to harness the power of multiple CPU processors/cores. Let's take our installation to the next level. We're going to start all our Hadoop services in a Pseudo-distributed configuration. This setup will allow us to distribute Hadoop processing over all the cores/processors on a single machine. In Pseudo-distributed mode, all files will be written to the Hadoop Distributed FileSystem (HDFS) and all services/daemons will communicate over local TCP sockets for inter-process communication.

[Warning]Warning

While you can install hadoop-0.18 and hadoop-0.20 at the same time, you cannot run them at the same time. This is because of TCP port conflicts which will be resolved in future updates of CDH2. Because of this limitation, we're going to walk through an installation of hadoop-0.20-conf-pseudo alone. You can easily substitute 0.18 for 0.20 for the command below if you are interested in running hadoop-0.18 in Pseudo-distributed mode.

Let's update our Hadoop configuration to run in pseudo-distributed mode by running:

Example 17. Installation on Debian-based systems

# apt-get -y install hadoop-0.20-conf-pseudo

Example 18. Installation on Redhat-based systems

# yum install hadoop-0.20-conf-pseudo -y

Let's look more closely at how the hadoop-0.20-conf-pseudo packages changes your system.

Example 19. Viewing the files on Debian-based systems

$ dpkg -L hadoop-0.20-conf-pseudo

Example 20. Viewing the files on Redhat-based systems

$ rpm -ql hadoop-0.20-conf-pseudo

Example 21. Files in the hadoop-pseudo-conf package

/etc/hadoop-0.20/conf.pseudo
/etc/hadoop-0.20/conf.pseudo/configuration.xsl
/etc/hadoop-0.20/conf.pseudo/capacity-scheduler.xml
/etc/hadoop-0.20/conf.pseudo/ssl-server.xml.example
/etc/hadoop-0.20/conf.pseudo/fair-scheduler.xml
/etc/hadoop-0.20/conf.pseudo/hdfs-site.xml
/etc/hadoop-0.20/conf.pseudo/log4j.properties
/etc/hadoop-0.20/conf.pseudo/mapred-site.xml
/etc/hadoop-0.20/conf.pseudo/hadoop-policy.xml
/etc/hadoop-0.20/conf.pseudo/hadoop-metrics.properties
/etc/hadoop-0.20/conf.pseudo/README
/etc/hadoop-0.20/conf.pseudo/core-site.xml
/etc/hadoop-0.20/conf.pseudo/ssl-client.xml.example
/etc/hadoop-0.20/conf.pseudo/hadoop-env.sh
/etc/hadoop-0.20/conf.pseudo/masters
/etc/hadoop-0.20/conf.pseudo/slaves

All the new configuration is nicely self-contained in the /etc/hadoop-0.20/conf.pseudo directory.

[Important]Important

The hadoop-0.20-conf-pseudo package automatically formats HDFS on installation if (and only if) the filesystem hasn't already been formatted previously.

This HDFS formatting only initializes files in your /var/lib/hadoop-0.20 directory and will not effect any other filesystems on your machine.

The Cloudera packages use the alternative framework for managing which Hadoop configuration is activate. All Hadoop components search for the Hadoop configuration in /etc/hadoop-0.20/conf. More on that later.

We need to start all the Hadoop services in order for the pseudo-distributed configuration to be functional. The hadoop-0.20 package provides all the scripts that you need for managing Hadoop services/daemons.

Let's start the services now.

Example 22. Starting all services on Debian and Redhat based systems

for service in /etc/init.d/hadoop-0.20-*
do
sudo $service start
done

The NameNode provides a web console http://localhost:50070/ for viewing your Distributed File System (DFS) capacity, number of DataNodes and logs. In the pseudo-distributed configuration, you should see one live datanode named localhost.

The JobTracker provides a web console http://localhost:50030/ for viewing running, completed and failed jobs with logs.

Let's run the same example we ran in standalone mode in our new pseudo-distributed mode.

First, let's make a directory in HDFS called input and copy some xml files into it:

Example 23. Creating an input directory in pseudo-distributed mode

$ hadoop-0.20 fs -mkdir input
$ hadoop-0.20 fs -put /etc/hadoop-0.20/conf/*.xml input
$ hadoop-0.20 fs -ls input
Found 6 items
-rw-r--r--   1 matt supergroup       6275 2009-08-18 18:36 /user/matt/input/capacity-scheduler.xml
-rw-r--r--   1 matt supergroup        338 2009-08-18 18:36 /user/matt/input/core-site.xml
-rw-r--r--   1 matt supergroup       3032 2009-08-18 18:36 /user/matt/input/fair-scheduler.xml
-rw-r--r--   1 matt supergroup       4190 2009-08-18 18:36 /user/matt/input/hadoop-policy.xml
-rw-r--r--   1 matt supergroup        496 2009-08-18 18:36 /user/matt/input/hdfs-site.xml
-rw-r--r--   1 matt supergroup        213 2009-08-18 18:36 /user/matt/input/mapred-site.xml

Now, let's run an example Hadoop job to grep for a regex in our input data.

Example 24. Running the example grep job

$ hadoop-0.20 jar /usr/lib/hadoop-0.20/hadoop-*-examples.jar grep input output 'dfs[a-z.]+'

Once the job completes, you will find the output in the HDFS directory named output (since we passed in that output directory to Hadoop).

Let's check the results.

Example 25. Listing the files in HDFS after the job completes

$ hadoop-0.20 fs -ls
Found 2 items
drwxr-xr-x   - matt supergroup          0 2009-08-18 18:36 /user/matt/input
drwxr-xr-x   - matt supergroup          0 2009-08-18 18:38 /user/matt/output

Sure enough, there is a new directory called output

Example 26. Listing the output files

$ hadoop-0.20 fs -ls output
Found 2 items
drwxr-xr-x   - matt supergroup          0 2009-02-25 10:33 /user/matt/output/_logs
-rw-r--r--   1 matt supergroup       1068 2009-02-25 10:33 /user/matt/output/part-00000

Let's read the results.

Example 27. Reading the results in the output file

$ hadoop-0.20 fs -cat output/part-00000 | head
1       dfs.name.dir
1       dfs.permissions
1       dfs.replication
1       dfsadmin