| Repo | Name | Release Date | Tarballs | Apt | Yum |
|---|---|---|---|---|---|
| Stable | CDH1 | March 2009 | /cdh/stable | /debian | /redhat/cdh/stable |
| Testing | CDH2 | August 2009 | /cdh/testing | /debian | /redhat/cdh/testing |

A Hadoop standalone installation doesn't allow Hadoop to harness the power
of multiple CPU processors/cores. Let's take our installation to the next level.
We're going to start all our Hadoop services in a Pseudo-distributed configuration.
This setup will allow us to distribute Hadoop processing over all the
cores/processors on a single machine. In Pseudo-distributed mode, all files will
be written to the Hadoop Distributed FileSystem (HDFS) and all services/daemons
will communicate over local TCP sockets for inter-process communication.
![]() | Warning |
|---|---|
While you can install |
Let's update our Hadoop configuration to run in pseudo-distributed mode by running:
Let's look more closely at how the hadoop-0.20-conf-pseudo packages changes your system.
Example 21. Files in the hadoop-pseudo-conf package
/etc/hadoop-0.20/conf.pseudo /etc/hadoop-0.20/conf.pseudo/configuration.xsl /etc/hadoop-0.20/conf.pseudo/capacity-scheduler.xml /etc/hadoop-0.20/conf.pseudo/ssl-server.xml.example /etc/hadoop-0.20/conf.pseudo/fair-scheduler.xml /etc/hadoop-0.20/conf.pseudo/hdfs-site.xml /etc/hadoop-0.20/conf.pseudo/log4j.properties /etc/hadoop-0.20/conf.pseudo/mapred-site.xml /etc/hadoop-0.20/conf.pseudo/hadoop-policy.xml /etc/hadoop-0.20/conf.pseudo/hadoop-metrics.properties /etc/hadoop-0.20/conf.pseudo/README /etc/hadoop-0.20/conf.pseudo/core-site.xml /etc/hadoop-0.20/conf.pseudo/ssl-client.xml.example /etc/hadoop-0.20/conf.pseudo/hadoop-env.sh /etc/hadoop-0.20/conf.pseudo/masters /etc/hadoop-0.20/conf.pseudo/slaves
All the new configuration is nicely self-contained in the /etc/hadoop-0.20/conf.pseudo directory.
![]() | Important |
|---|---|
The This HDFS formatting only initializes files in your /var/lib/hadoop-0.20 directory and will not effect any other filesystems on your machine. |
The Cloudera packages use the alternative framework for managing which
Hadoop configuration is activate. All Hadoop components search for the Hadoop
configuration in /etc/hadoop-0.20/conf. More on that later.
We need to start all the Hadoop services in order for the pseudo-distributed
configuration to be functional. The hadoop-0.20 package provides all the scripts
that you need for managing Hadoop services/daemons.
Let's start the services now.
Example 22. Starting all services on Debian and Redhat based systems
for service in /etc/init.d/hadoop-0.20-* do sudo $service start done
The NameNode provides a web console http://localhost:50070/ for viewing
your Distributed File System (DFS) capacity, number of DataNodes and logs.
In the pseudo-distributed configuration, you should see one live datanode
named localhost.
The JobTracker provides a web console http://localhost:50030/ for viewing
running, completed and failed jobs with logs.
Let's run the same example we ran in standalone mode in our new pseudo-distributed mode.
First, let's make a directory in HDFS called input and copy some xml files
into it:
Example 23. Creating an input directory in pseudo-distributed mode
$ hadoop-0.20 fs -mkdir input $ hadoop-0.20 fs -put /etc/hadoop-0.20/conf/*.xml input $ hadoop-0.20 fs -ls input Found 6 items -rw-r--r-- 1 matt supergroup 6275 2009-08-18 18:36 /user/matt/input/capacity-scheduler.xml -rw-r--r-- 1 matt supergroup 338 2009-08-18 18:36 /user/matt/input/core-site.xml -rw-r--r-- 1 matt supergroup 3032 2009-08-18 18:36 /user/matt/input/fair-scheduler.xml -rw-r--r-- 1 matt supergroup 4190 2009-08-18 18:36 /user/matt/input/hadoop-policy.xml -rw-r--r-- 1 matt supergroup 496 2009-08-18 18:36 /user/matt/input/hdfs-site.xml -rw-r--r-- 1 matt supergroup 213 2009-08-18 18:36 /user/matt/input/mapred-site.xml
Now, let's run an example Hadoop job to grep for a regex in our input data.
Example 24. Running the example grep job
$ hadoop-0.20 jar /usr/lib/hadoop-0.20/hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
Once the job completes, you will find the output in the HDFS directory named output
(since we passed in that output directory to Hadoop).
Let's check the results.
Example 25. Listing the files in HDFS after the job completes
$ hadoop-0.20 fs -ls Found 2 items drwxr-xr-x - matt supergroup 0 2009-08-18 18:36 /user/matt/input drwxr-xr-x - matt supergroup 0 2009-08-18 18:38 /user/matt/output
Sure enough, there is a new directory called output…
Example 26. Listing the output files
$ hadoop-0.20 fs -ls output Found 2 items drwxr-xr-x - matt supergroup 0 2009-02-25 10:33 /user/matt/output/_logs -rw-r--r-- 1 matt supergroup 1068 2009-02-25 10:33 /user/matt/output/part-00000
Let's read the results.
Example 27. Reading the results in the output file
$ hadoop-0.20 fs -cat output/part-00000 | head 1 dfs.name.dir 1 dfs.permissions 1 dfs.replication 1 dfsadmin