commit cfc3233ece0769b11af9add328261295aaf4d1ad
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:56:30 2010 -0800

    CLOUDERA-BUILD. Fix ivy xml after rebase. Removed a redundant </dependencies> closing tag.
    
    Author: Matt Massie

commit 54e1aefdd7a25a539831cac2c9b1bc3597f119ea
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:56:07 2010 -0800

    CLOUDERA-BUILD. Small tweaks and fixes to Cloudera styling:
    
    Description:
        - Fixes trivial CSS bug for missing table cell borders in Chrome
        - Fixes footer to read "Distribution for Hadoop" instead of "Distribution of Hadoop"
    
    Author: Todd Lipcon

commit ea83036b3838fa97c673e73145d52867b8ace6ac
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:55:30 2010 -0800

    HDFS-1013. Miscellaneous improvements to HTML markup for web UIs
    
    Description: The Web UIs have various bits of bad markup (eg missing &lt;head&gt; sections, some pages missing CSS links, inconsistent td vs th for table headings). We should fix this up.
    <hr/>
        Improve markup and add Cloudera styling to Web UIs
    
        This adds a favicon and a number of HTML/CSS improvements to make the
        pages more space-efficient and easy on the eyes.
    
        This may be an incompatible change for users who are scraping the HTML
        output of the web UIs. Those users are encouraged to access the data
        programmatically rather than through scraping.
    
        The non-Cloudera-specific improvements will be contributed upstream
        as HDFS-1013 and MAPREDUCE-1544.
    Reason: User experience improvement
    Author: Todd Lipcon
    Ref: UNKNOWN

commit 90ba5543e4c3176343e23943131a34d666c23d89
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:54:58 2010 -0800

    MAPREDUCE-1436. Deadlock in preemption code in fair scheduler
    
    Description: In testing the fair scheduler with preemption, I found a deadlock between updatePreemptionVariables and some code in the JobTracker. This was found while testing a backport of the fair scheduler to Hadoop 0.20, but it looks like it could also happen in trunk and 0.21. Details are in a comment below.
    <hr/>
    The fair scheduler introduces a potential jobtracker deadlock which
    was fixed on trunk by MAPREDUCE-870. This patch adjusts the locking
    in 0.20-based MapReduce to prevent this condition.
    
    Reason: bugfix (deadlock)
    Author: Matei Zaharia
    Ref: UNKNOWN

commit 6f04e94feee3f40a73449cc6fbe7b4e3c48f1fc4
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:54:13 2010 -0800

    HDFS-696. Java assertion failures triggered by tests
    
    Description: Re-purposing as catch-all ticket for assertion failures when running tests with java asserts enabled. Running with the attached patch on trunk@823732 the following tests all trigger assertion failures:
    
    <p>TestAccessTokenWithDFS<br/>
    TestInterDatanodeProtocol<br/>
    TestBackupNode <br/>
    TestBlockUnderConstruction<br/>
    TestCheckpoint  <br/>
    TestNameEditsConfigs<br/>
    TestStartup<br/>
    TestStorageRestore</p>
    <hr/>
        Disable failing asserts (see HDFS-696).
    
        Disabled asserts in HDFS that cause unit tests to fail.
        These will be re-enabled at a later date when the underlying cause is fixed
        upstream. In the meantime, these are disabled to keep our CI server returning
        only new failures. Issue HDFS-696 lists the failing tests and tracks their
        progress.
    Reason: Test harness improvement
    Author: Eli Collins
    Ref: UNKNOWN

commit 74b80b9c9490bba1a1120f3a9376d2f21f3763b6
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:53:38 2010 -0800

    MAPREDUCE-1093. Java assertion failures triggered by tests
    
    Description:
        Removes failing asserts from the CDH build until they are fixed in trunk.
        Tracking MAPREDUCE-1506 to include a fix for this assertion failure.
    Reason: Test harness improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit b4be440cd928976544bcbeb7e10566fc523dbd0c
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:53:13 2010 -0800

    MAPREDUCE-1092. Enable asserts for tests by default
    
    Description: See <a href="http://issues.apache.org/jira/browse/HADOOP-6309" title="Enable asserts for tests by default"><del>HADOOP-6309</del></a>. Let's make the tests run with java asserts by default.
    Reason: Test coverage improvement
    Author: Eli Collins
    Ref: UNKNOWN

commit 5e7fb9843f99f5e1023f2723210f26ac0c33323b
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:52:45 2010 -0800

    MAPREDUCE-1375. TestFileArgs fails intermittently
    
    Description: TestFileArgs failed once for me with the following error
    <div class="code panel" style="border-width: 1px;"><div class="codeContent panelContent">
    <pre class="code-java">expected:&lt;[job.jar
    sidefile
    tmp
    ]&gt; but was:&lt;[]&gt;
    sidefile
    tmp
    ]&gt; but was:&lt;[]&gt;
            at org.apache.hadoop.streaming.TestStreaming.checkOutput(TestStreaming.java:107)
            at org.apache.hadoop.streaming.TestStreaming.testCommandLine(TestStreaming.java:123)</pre>
    </div></div>
    
        This test was flaky due to trying to write some data into /bin/ls.
        Depending on the speed of the test run, this sometimes resulted
        in a Broken Pipe on flush() which caused the test to fail.
    
    Reason: Bugfix (race condition in test)
    Author: Todd Lipcon
    Ref: UNKNOWN

commit ae699cda01c093097ae723224553773247577aa2
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:52:32 2010 -0800

    HDFS-961. dfs_readdir incorrectly parses paths
    
    Description: fuse-dfs dfs_readdir assumes that DistributedFileSystem#listStatus returns Paths with the same scheme/authority as the dfs.name.dir used to connect. If NameNode.DEFAULT_PORT port is used listStatus returns Paths that have authorities without the port (see <a href="http://issues.apache.org/jira/browse/HDFS-960" title="DistributedFileSystem#makeQualified port inconsistency">HDFS-960</a>), which breaks the following code.
    
    <div class="code panel" style="border-width: 1px;"><div class="codeContent panelContent">
    <pre class="code-java"><span class="code-comment">// hack city: todo fix the below to something nicer and more maintainable but
    </span><span class="code-comment">// with good performance
    </span><span class="code-comment">// strip off the path but be careful <span class="code-keyword">if</span> the path is solely '/'
    </span><span class="code-comment">// NOTE - <span class="code-keyword">this</span> API started returning filenames as full dfs uris
    </span><span class="code-keyword">const</span> <span class="code-object">char</span> *<span class="code-keyword">const</span> str = info[i].mName + dfs-&gt;dfs_uri_len + path_len + ((path_len == 1 &amp;&amp; *path == '/') ? 0 : 1);</pre>
    </div></div>
    
    <p>Let's make the path parsing here more robust. listStatus returns normalized paths so we can find the start of the path by searching for the 3rd slash. A more long term solution is to have hdfsFileInfo maintain a path object or at least pointers to the relevant URI components.</p>
    Reason: bugfix
    Author: Eli Collins
    Ref: UNKNOWN

commit 7f9f42b27b109eff6fafc6ee24526fcadaf68d69
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:52:23 2010 -0800

    MAPREDUCE-1467. Add a --verbose flag to Sqoop
    
    Description: Need a <tt>--verbose</tt> flag that sets the log4j level to DEBUG.
    Reason: Logging improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit db680058f5796fc41d61242d60bc86b1b25facf9
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:52:07 2010 -0800

    MAPREDUCE-1469. Sqoop should disable speculative execution in export
    
    Description: Concurrent writers of the same output shard may cause the database to try to insert duplicate primary keys concurrently. Not a good situation. Speculative execution should be forced off for this operation.
    Reason: Bugfix (race condition)
    Author: Aaron Kimball
    Ref: UNKNOWN

commit a5ccc56a79fc53de5ff16c6cb996f41a4216c28d
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:51:29 2010 -0800

    MAPREDUCE-1341. Sqoop should have an option to create hive tables and skip the table import step
    
    Description: In case the client only needs to create tables in hive, it would be helpful if Sqoop had an optional parameter:
    
    <p>--hive-create-only</p>
    
    <p>which would omit the time consuming table import step, generate hive create table statements and run them.</p>
    
    <p>Also adds --hive-overwrite flag which allows overwriting of existing table definition.
    
    Reason: New feature
    Author: Leonid Furman
    Ref: UNKNOWN

commit bdf576aa69eeb56a954416f7c2fcbe0136f421bd
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:51:16 2010 -0800

    HADOOP-4012. Providing splitting support for bzip2 compressed files
    
    Description: Hadoop assumes that if the input data is compressed, it can not be split (mainly due to the limitation of many codecs that they need the whole input stream to decompress successfully).  So in such a case, Hadoop prepares only one split per compressed file, where the lower split limit is at 0 while the upper limit is the end of the file.  The consequence of this decision is that, one compress file goes to a single mapper. Although it circumvents the limitation of codecs (as mentioned above) but reduces the parallelism substantially, as it was possible otherwise in case of splitting.
    
    <p>BZip2 is a compression / De-Compression algorithm which does compression on blocks of data and later these compressed blocks can be decompressed independent of each other.  This is indeed an opportunity that instead of one BZip2 compressed file going to one mapper, we can process chunks of file in parallel.  The correctness criteria of such a processing is that for a bzip2 compressed file, each compressed block should be processed by only one mapper and ultimately all the blocks of the file should be processed.  (By processing we mean the actual utilization of that un-compressed data (coming out of the codecs) in a mapper).</p>
    
    <p>We are writing the code to implement this suggested functionality.  Although we have used bzip2 as an example, but we have tried to extend Hadoop's compression interfaces so that any other codecs with the same capability as that of bzip2, could easily use the splitting support.  The details of these changes will be posted when we submit the code.</p>
    Reason: New feature
    Author: Abdul Qadeer
    Ref: UNKNOWN

commit 8e47288583fcdbdf649ddf3486bf201788e79202
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:50:51 2010 -0800

    MAPREDUCE-707. Provide a jobconf property for explicitly assigning a job to a pool
    
    Description: A common use case of the fair scheduler is to have one pool per user, but then to define some special pools for various production jobs, import jobs, etc. Therefore, it would be nice if jobs went by default to the pool of the user who submitted them, but there was a setting to explicitly place a job in another pool. Today, this can be achieved through a sort of trick in the JobConf:
    
    <div class="code panel" style="border-width: 1px;"><div class="codeContent panelContent">
    <pre class="code-java">&lt;property&gt;
      &lt;name&gt;mapred.fairscheduler.poolnameproperty&lt;/name&gt;
      &lt;value&gt;pool.name&lt;/value&gt;
    &lt;/property&gt;
    
    &lt;property&gt;
      &lt;name&gt;pool.name&lt;/name&gt;
      &lt;value&gt;${user.name}&lt;/value&gt;
    &lt;/property&gt;</pre>
    </div></div>
    
    <p>This JIRA proposes to add a property called mapred.fairscheduler.pool that allows a job to be placed directly into a pool, avoiding the need for this trick.</p>
    Reason: Configuration improvement
    Author: Alan Heirich
    Ref: UNKNOWN

commit 96e17e1e593b818a888c8dfc177b8fb36e514e8f
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:50:18 2010 -0800

    MAPREDUCE-967. (version 2) TaskTracker does not need to fully unjar job jars
    
    Description:
        This is a performance improvement for jobs that contain a large number of
        classes. The unpacking of these jars consumes a large amount of time, as
        does the resulting cleanup. This patch changes the classpath to simply
        include the jar itself, and only unpacks the lib/ directory out of the
        jar in order to add those dependencies to the classpath.
    
        Users who previously depended on this functionality for shipping non-code
        dependencies can use the undocumented configuration parameter
        "mapreduce.job.jar.unpack.pattern" to cause specific jar contents to be unpacked
    
        This new patch version fixes a streaming regression where the "-file" argument
        no longer worked. It includes a new unit test, TestFileArgs, to protect
        against this regression.
    Author: Todd Lipcon
    Ref: UNKNOWN

commit cf08a128b87bbfae90babd61795599b3645d37a3
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:48:40 2010 -0800

    HDFS-455, MAPREDUCE-1441, HADOOP-6534. Allow spaces in between comma-separated elements in directory list configurations.
    
    Description: Make NN and DN handle in a intuitive way comma-separated configuration strings
    
    The following configuration causes problems:<br/>
    &lt;property&gt;<br/>
    &lt;name&gt;dfs.data.dir&lt;/name&gt;<br/>
    &lt;value&gt;/mnt/hstore2/hdfs, /home/foo/dfs&lt;/value&gt; <br/>
    &lt;/property&gt;
    
    <p>The problem is that the space after the comma causes the second directory for storage to be " /home/foo/dfs" which is in a directory named &lt;SPACE&gt; which contains a sub-dir named "home" in the hadoop datanodes default directory. This will typically cause the user's home partition to fill, but will be very hard for the user to understand since a directory with a whitespace name is hard to understand.</p>
    
    <p>(ripped from <a href="http://issues.apache.org/jira/browse/HADOOP-2366" title="Space in the value for dfs.data.dir can cause great problems"><del>HADOOP-2366</del></a>)</p>
    
    <hr/>
    This fixes any configuration consisting of a comma-separated list of directories
    (e.g., dfs.data.dir, dfs.name.dir, fs.checkpoint.dir, mapred.local.dir, etc) so that
    the elements may also contain separating whitespace. Without this patch,
    setting mapred.local.dir to "/disk1, /disk2" would create a directory by the name
    " " in the user's home directory, or fail outright. The patch trims the
    directory
    names as they are fetched from the configuration.
    
    Reason: Configuration improvement
    Author: Todd Lipcon
    Ref: UNKNOWN

commit 65a04ab8197a8db21a97d279ca881b5cd45a5365
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:48:03 2010 -0800

    HADOOP-2366. Space in the value for dfs.data.dir can cause great problems
    
    Description: The following configuration causes problems:
    
    <p>&lt;property&gt;<br/>
      &lt;name&gt;dfs.data.dir&lt;/name&gt;<br/>
      &lt;value&gt;/mnt/hstore2/hdfs, /home/foo/dfs&lt;/value&gt;  <br/>
      &lt;description&gt;<br/>
      Determines where on the local filesystem an DFS data node  should store its bl<br/>
    ocks.  If this is a comma-delimited  list of directories, then data will be stor<br/>
    ed in all named  directories, typically on different devices.  Directories that <br/>
    do not exist are ignored.  <br/>
      &lt;/description&gt;<br/>
    &lt;/property&gt;</p>
    
    <p>The problem is that the space after the comma causes the second directory for storage to be " /home/foo/dfs" which is in a directory named &lt;SPACE&gt; which contains a sub-dir named "home" in the hadoop datanodes default directory.  This will typically cause the user's home partition to fill, but will be very hard for the user to understand since a directory with a whitespace name is hard to understand.</p>
    
    <p>My proposed solution would be to trimLeft all path names from this and similar property after splitting on comma.  This still allows spaces in file and directory names but avoids this problem. </p>
    <hr/>
        This provides support in Configuration to get comma-separated string lists in such
        a way that whitespace in between elements is ignored. This patch is required for
        later patches which fix mapred.local.dir, dfs.data.dir, etc to support spaces
        in between elements.
    
        Test plan: unit tested in TestStringUtils
    Reason: Configuration improvement
    Author: Michele (@pirroh) Catasta
    Ref: UNKNOWN

commit 8d4807322a42509726b376b37a89739acd6cbd7d
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:47:55 2010 -0800

    MAPREDUCE-1356. Allow user-specified hive table name in sqoop
    
    Description: The table name used in a hive-destination import is currently pegged to the input table name. This should be user-configurable.
    Reason: New feature
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 8bf3439ff69762a33967dca4abb15c0cd2bb8417
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:47:45 2010 -0800

    MAPREDUCE-1395. Sqoop does not check return value of Job.waitForCompletion()
    
    Description: Old code depended on JobClient.runJob() throwing IOException on failure. Job.waitForCompletion can fail in that manner, or it can fail by returning false. Sqoop needs to check for this condition.
    Reason: bugfix
    Author: Aaron Kimball
    Ref: UNKNOWN

commit bd4e81234dd12fa9534577f0caa0db5c3d0a99fc
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:47:30 2010 -0800

    CLOUDERA-BUILD. Set HADOOP_PID_DIR to something smarter than /tmp
    
    Author: Chad Metcalf

commit 2466310d0e2a426e848860e9a8411b8ea14e1bb1
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:47:07 2010 -0800

    HADOOP-6453. Hadoop wrapper script shouldn't ignore an existing JAVA_LIBRARY_PATH
    
    Description: Currently the hadoop wrapper script assumes its the only place that uses JAVA_LIBRARY_PATH and initializes it to a blank line.
    
    <p>JAVA_LIBRARY_PATH=''</p>
    
    <p>This prevents anyone from setting this outside of the hadoop wrapper (say hadoop-config.sh) for their own native libraries.</p>
    
    <p>The fix is pretty simple. Don't initialize it to '' and append the native libs like normal. </p>
    Reason: Bugfix (environment)
    Author: Chad Metcalf
    Ref: UNKNOWN

commit a67b4b1c361c26e002da64953a7f8bc068d29b98
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:46:42 2010 -0800

    MAPREDUCE-1327. Oracle database import via sqoop fails when a table contains the column types such as TIMESTAMP(6) WITH LOCAL TIME ZONE and TIMESTAMP(6) WITH TIME ZONE
    
    Description: When Oracle table contains the columns "TIMESTAMP(6) WITH LOCAL TIME ZONE" and "TIMESTAMP(6) WITH TIME ZONE", Sqoop fails to map values for those columns to valid Java data types, resulting in the following exception:
    
    <p>ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.NullPointerException<br/>
    java.lang.NullPointerException<br/>
            at org.apache.hadoop.sqoop.orm.ClassWriter.generateFields(ClassWriter.java:253)<br/>
            at org.apache.hadoop.sqoop.orm.ClassWriter.generateClassForColumns(ClassWriter.java:701)<br/>
            at org.apache.hadoop.sqoop.orm.ClassWriter.generate(ClassWriter.java:597)<br/>
            at org.apache.hadoop.sqoop.Sqoop.generateORM(Sqoop.java:75)<br/>
            at org.apache.hadoop.sqoop.Sqoop.importTable(Sqoop.java:87)<br/>
            at org.apache.hadoop.sqoop.Sqoop.run(Sqoop.java:175)<br/>
            at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)<br/>
            at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)<br/>
            at org.apache.hadoop.sqoop.Sqoop.main(Sqoop.java:201)<br/>
            at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)<br/>
            at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)</p>
    
    Reason: Compatibility improvement
    Author: Leonid Furman
    Ref: UNKNOWN

commit a937ba2b9b6132883d727f856911ae31d22ad619
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:46:26 2010 -0800

    MAPREDUCE-1394. Sqoop generates incorrect URIs in paths sent to Hive
    
    Description: Hive used to require a ':8020' in HDFS URIs used with LOAD DATA statements, even though the normalized form of such a URI does not contain an explicit port number (since 8020 is the default port). Sqoop matched this by hacking the URI strings it forwarded to Hive.
    
    <p>Hive fixed this bug a while ago &#8211; Sqoop should catch up.</p>
    Reason: bugfix (compatibility)
    Author: Aaron Kimball
    Ref: UNKNOWN

commit c5c9b8bf0bf83637589a809b3c376cf74a2fb464
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:45:54 2010 -0800

    MAPREDUCE-1313. NPE in FieldFormatter if escape character is set and field is null
    
    Description: Performing an import with the <tt>&#45;&#45;escaped-by</tt> character set on a table with a null field will cause a NullPointerException in FieldFormatter
    Reason: bugfix
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 1c6dd471832946929928801dd9c9e4b79259ad9d
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:45:38 2010 -0800

    HADOOP-6460. Namenode runs of out of memory due to memory leak in ipc Server
    
    Description: Namenode heap usage grows disproportional to the number objects supports (files, directories and blocks). Based on heap dump analysis, this is due to large growth in ByteArrayOutputStream allocated in o.a.h.ipc.Server.Handler.run().
    Reason: Bugfix (Scalability)
    Author: Suresh Srinivas
    Ref: UNKNOWN

commit d190a8067827ce09cdcb7741d588cce0e0e7aa02
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:45:23 2010 -0800

    HADOOP-5687. Hadoop NameNode throws NPE if fs.default.name is the default value
    
    Description: Throwing NPE is confusing; instead, an exception with a useful string description could be thrown instead.
    Reason: Logging improvement
    Author: Philip Zeyliger
    Ref: UNKNOWN

commit 7604c6f69076effbb0c9793e114946d679f5912d
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:45:02 2010 -0800

    HADOOP-6505. sed in build.xml fails
    
    Description: I'm not sure whether this is a Solaris thing or an ant 1.7.1 thing, but it definitely doesn't do what it is supposed to.  Instead of getting SunOS-x86-32 (or whatever) I get -x86-32.
    
    <p>This patch replaces the sed call with tr. </p>
    Reason: OS compatibility improvement
    Author: Allen Wittenauer
    Ref: UNKNOWN

commit ca662cbba6044be216b586e7359d9fc2f1dd4e4f
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:44:00 2010 -0800

    HDFS-908. (version 2) TestDistributedFileSystem fails with Wrong FS on weird hosts
    
    Description: On the same host where I experienced <a href="http://issues.apache.org/jira/browse/HDFS-874" title="TestHDFSFileContextMainOperations fails on weirdly configured DNS hosts">HDFS-874</a>, I also experience this failure for TestDistributedFileSystem:
    
    <p>Testcase: testFileChecksum took 0.492 sec<br/>
      Caused an ERROR<br/>
    Wrong FS: hftp://localhost.localdomain:59782/filechecksum/foo0, expected: hftp://127.0.0.1:59782<br/>
    java.lang.IllegalArgumentException: Wrong FS: hftp://localhost.localdomain:59782/filechecksum/foo0, expected: hftp://127.0.0.1:59782<br/>
      at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310)<br/>
      at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:222)<br/>
      at org.apache.hadoop.hdfs.HftpFileSystem.getFileChecksum(HftpFileSystem.java:318)<br/>
      at org.apache.hadoop.hdfs.TestDistributedFileSystem.testFileChecksum(TestDistributedFileSystem.java:166)</p>
    
    <p>Doesn't appear to occur on trunk or branch-0.21.</p>
    
    This is version two of this patch. THe previous patch fixed some systems
    but broke others.
    Reason: Bugfix
    Author: Todd Lipcon
    Ref: UNKNOWN

commit 7fafe032223921ad194c69b16ab451b4aade87fa
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:43:41 2010 -0800

    HADOOP-4368. Superuser privileges required to do "df"
    
    Description: super user privileges are required in DFS in order to get the file system statistics (FSNamesystem.java, getStats method).  This means that when HDFS is mounted via fuse-dfs as a non-root user, "df" is going to return 16exabytes total and 0 free instead of the correct amount.
    
    <p>As far as I can tell, there's no need to require super user privileges to see the file system size (and historically in Unix, this is not required).</p>
    
    <p>To fix this, simply comment out the privilege check in the getStats method.</p>
    Reason: Usability improvement
    Author: Craig Macdonald
    Ref: UNKNOWN

commit 6129c87f5dd1fdb7375c80285534b8b91fbcd392
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:43:25 2010 -0800

    HDFS-412. Hadoop JMX usage makes Nagios monitoring impossible
    
    Description: When Hadoop reports Datanode information to JMX, the bean uses the name "DataNode-" + storageid.  The storage ID incorporates a random number and is unpredictable.
    
    <p>This prevents me from monitoring DFS datanodes through Hadoop using the JMX interface; in order to do that, you must be able to specify the bean name on the command line.</p>
    
    <p>The fix is simple, patch will be coming momentarily.  However, there was probably a reason for making the datanodes all unique names which I'm unaware of, so it'd be nice to hear from the metrics maintainer.</p>
    Reason: Monitoring improvement
    Author: Brian Bockelman
    Ref: UNKNOWN

commit 5dfcc6d2d7806636c6237996e1b28a00ba075b4b
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:43:05 2010 -0800

    HADOOP-6503. contrib projects should pull in the ivy-fetched libs from the root project
    
    Description: On branch-20 currently, I get an error just running "ant contrib -Dtestcase=TestHdfsProxy". In a full "ant test" build sometimes this doesn't appear to be an issue. The problem is that the contrib projects don't automatically pull in the dependencies of the "Hadoop" ivy project. Thus, they each have to declare all of the common dependencies like commons-cli, etc. Some are missing and this causes test failures.
    Reason: Build system improvement
    Author: Todd Lipcon
    Ref: UNKNOWN

commit be70b10f11445f4a71807405718bfeebd38ad924
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:42:51 2010 -0800

    MAPREDUCE-1155. Streaming tests swallow exceptions
    
    Description: Many of the streaming tests (including TestMultipleArchiveFiles) catch exceptions and print their stack trace rather than failing the job. This means that tests do not fail even when the job fails.
    Reason: Test coverage improvement
    Author: Todd Lipcon
    Ref: UNKNOWN

commit f84830ae5e6c862cd0e2b8ebea57880e54c8a082
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:42:33 2010 -0800

    HADOOP-5647. TestJobHistory fails if /tmp/_logs is not writable to. Testcase should not depend on /tmp
    
    Description: TestJobHistory sets /tmp as hadoop.job.history.user.location to check if the history file is created in that directory or not. If /tmp/_logs is already created by some other user, this test will fail because of not having write permission.
    Reason: Bugfix in test harness
    Author: Ravi Gummadi
    Ref: UNKNOWN

commit 669b65f14d78ffd1cf0304cf459d1abbae3412ae
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:42:15 2010 -0800

    CLOUDERA-BUILD. Fix javadoc warnings shown by test-patch, and update eclipse classpath to match current CDH.
    
    Author: Todd Lipcon

commit 51804fd45d3a527a130a373c591a17c185102a0c
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:41:40 2010 -0800

    Revert "HDFS-127: DFSClient block read failures cause open DFSInputStream to become unusable"
    
    Description: This is being reverted as it causes infinite retries when there are no valid replicas.
    Reason: bugfix
    Author: Todd Lipcon
    Ref: UNKNOWN

commit 623bfc0c18087274315dfbd41d025a8a775abe80
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:40:30 2010 -0800

    HDFS-877. Client-driven block verification not functioning
    
    Description: This is actually the reason for <a href="http://issues.apache.org/jira/browse/HDFS-734" title="TestDatanodeBlockScanner times out in branch 0.20"><del>HDFS-734</del></a> (TestDatanodeBlockScanner timing out). The issue is that DFSInputStream relies on readChunk being called one last time at the end of the file in order to receive the lastPacketInBlock=true packet from the DN. However, DFSInputStream.read checks pos &lt; getFileLength() before issuing the read. Thus gotEOS never shifts to true and checksumOk() is never called.
    
    This is a simpler patch than the one on 0.21/0.22 since those fix a further regression
    since 0.20.
    
    Reason: bugfix
    Author: Todd Lipcon
    Ref: UNKNOWN

commit b332fe77255047409da701dfb97df1bddb5b10cb
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:40:05 2010 -0800

    CLOUDERA-BUILD. Add mockito to 0.20 branch for easier unit testing of HDFS stability patches.
    
    Reason: Test coverage improvement
    Author: Todd Lipcon

commit 44a6c559de056b35c6eb2e2d53798c88d8c779e6
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:39:09 2010 -0800

    HDFS-630. In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block.
    
    Description: created from hdfs-200.
    
    <p>If during a write, the dfsclient sees that a block replica location for a newly allocated block is not-connectable, it re-requests the NN to get a fresh set of replica locations of the block. It tries this dfs.client.block.write.retries times (default 3), sleeping 6 seconds between each retry ( see DFSClient.nextBlockOutputStream).</p>
    
    <p>This setting works well when you have a reasonable size cluster; if u have few datanodes in the cluster, every retry maybe pick the dead-datanode and the above logic bails out.</p>
    
    <p>Our solution: when getting block location from namenode, we give nn the excluded datanodes. The list of dead datanodes is only for one block allocation.</p>
    Reason: bugfix (Fault tolerance improvement)
    Author: Cosmin Lehene (modified by Cloudera to not break compatibility)
    Ref: UNKNOWN

commit 47c404e0cf10ceb31336d2a77d53e0a971348102
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:37:37 2010 -0800

    HDFS-908. TestDistributedFileSystem fails with Wrong FS on weird hosts
    
    Description: On the same host where I experienced <a href="http://issues.apache.org/jira/browse/HDFS-874" title="TestHDFSFileContextMainOperations fails on weirdly configured DNS hosts">HDFS-874</a>, I also experience this failure for TestDistributedFileSystem:
    
    <p>Testcase: testFileChecksum took 0.492 sec<br/>
      Caused an ERROR<br/>
    Wrong FS: hftp://localhost.localdomain:59782/filechecksum/foo0, expected: hftp://127.0.0.1:59782<br/>
    java.lang.IllegalArgumentException: Wrong FS: hftp://localhost.localdomain:59782/filechecksum/foo0, expected: hftp://127.0.0.1:59782<br/>
      at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310)<br/>
      at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:222)<br/>
      at org.apache.hadoop.hdfs.HftpFileSystem.getFileChecksum(HftpFileSystem.java:318)<br/>
      at org.apache.hadoop.hdfs.TestDistributedFileSystem.testFileChecksum(TestDistributedFileSystem.java:166)</p>
    
    <p>Doesn't appear to occur on trunk or branch-0.21.</p>
    Reason: bugfix
    Author: Todd Lipcon
    Ref: UNKNOWN

commit 7c2a791f0a397d924a623e45bf823c238374c42c
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:37:19 2010 -0800

    MAPREDUCE-1258. Fair scheduler event log not logging job info
    
    Description: The <a href="http://issues.apache.org/jira/browse/MAPREDUCE-706" title="Support for FIFO pools in the fair scheduler"><del>MAPREDUCE-706</del></a> patch seems to have left an unfinished TODO in the Fair Scheduler - namely, in the dump() function for periodically dumping scheduler state to the event log, the part that dumps information about jobs is commented out. This makes the event log less useful than it was before.
    
    <p>It should be fairly easy to update this part to use the new scheduler data structures (Schedulable etc) and print the data.</p>
    Reason: Logging improvement
    Author: Matei Zaharia
    Ref: UNKNOWN

commit 353f7813bf7dfb0bca1362f9370f6a080256a345
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:36:58 2010 -0800

    MAPREDUCE-1198. Alternatively schedule different types of tasks in fair share scheduler
    
    Description: Matei has mentioned in <a href="http://issues.apache.org/jira/browse/MAPREDUCE-961" title="ResourceAwareLoadManager to dynamically decide new tasks based on current CPU/memory load on TaskTracker(s)">MAPREDUCE-961</a> that the current scheduler will first try to launch map tasks until canLaunthTask() returns false then look for reduce tasks. This might starve reduce task. He also mention that alternatively schedule different types of tasks can solve this problem.
    Reason: bugfix
    Author: Scott Chen
    Ref: UNKNOWN

commit ef449fb7832055951e2364cf12a73717b2add3ce
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:36:50 2010 -0800

    MAPREDUCE-698. Per-pool task limits for the fair scheduler
    
    Description: The fair scheduler could use a way to cap the share of a given pool similar to <a href="http://issues.apache.org/jira/browse/MAPREDUCE-532" title="Allow admins of the Capacity Scheduler to set a hard-limit on the capacity of a queue"><del>MAPREDUCE-532</del></a>.
    Reason: New feature
    Author: Kevin Peterson
    Ref: UNKNOWN

commit a1e25ec70e677db322b2cce43c6381f865eb3f79
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:36:42 2010 -0800

    HDFS-464. Memory leaks in libhdfs
    
    Description: hdfsExists does not call destroyLocalReference for jPath anytime,<br/>
    hdfsDelete does not call it when it fails, and<br/>
    hdfsRename does not call it for jOldPath and jNewPath when it fails
    Reason: bugfix
    Author: Christian Kunz
    Ref: UNKNOWN

commit d93dad715d3c702d15c2a32c85d586c708e70857
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:36:23 2010 -0800

    CLOUDERA-BUILD. Add test ivy configurations to additional projects.
    
    Author: Aaron Kimball
    Reason: Build system improvement

commit 5d0c8f82b87e7cbb541ace9e4f22abfad2799e56
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:35:08 2010 -0800

    CLOUDERA-BUILD. Sqoop bin script now includes jars from contrib/sqoop/lib/ on classpath.
    
    Author: Aaron Kimball

commit 7e009a29c0806537cd50972df90ec87b617eb78f
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:34:54 2010 -0800

    MAPREDUCE-1212. Mapreduce contrib project ivy dependencies are not included in binary target
    
    Description: As in <a href="http://issues.apache.org/jira/browse/HADOOP-6370" title="Contrib project ivy dependencies are not included in binary target">HADOOP-6370</a>, only Hadoop's own library dependencies are promoted to ${build.dir}/lib; any libraries required by contribs are not redistributed.
    Reason: Build system (packaging) improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 8d289f97d6b66cd435f755a4acae9f138de934d6
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:34:43 2010 -0800

    CLOUDERA-BUILD. Update cloud script version to cdh-0.20.1
    
    Author: Tom White

commit ac7eacd44af059d7a859b8d6773a82cd84ba4c9b
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:34:35 2010 -0800

    HADOOP-6466. Add a ZooKeeper service to the cloud scripts
    
    Description: It would be good to add other Hadoop services to the cloud scripts.
    Reason: New feature
    Author: Tom White
    Ref: UNKNOWN

commit 06ceb079693292a41085af795c5b2bbc3fd10af2
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:34:24 2010 -0800

    HADOOP-6454. Create setup.py for EC2 cloud scripts
    
    Description: This would make it easier to install the scripts.
    Reason: Installation improvement
    Author: Tom White
    Ref: UNKNOWN

commit 23c45791bbc3a23d69c77f3518b5d1a1a4702ccc
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:34:11 2010 -0800

    HADOOP-6462. contrib/cloud failing, target "compile" does not exist
    
    Description: I'm not seeing this mentioned in hudson or other bugreports, which confuses me. With the addition of a src/contrib/cloud/build.xml from <a href="http://issues.apache.org/jira/browse/HADOOP-6426" title="Create ant build for running EC2 unit tests"><del>HADOOP-6426</del></a>, contrib/build.xml won't build no more: <br/>
    hadoop-common/src/contrib/build.xml:30: The following error occurred while executing this line:<br/>
    Target "compile" does not exist in the project "hadoop-cloud".
    
    <p>What is odd is this: the final patch of <a href="http://issues.apache.org/jira/browse/HADOOP-6426" title="Create ant build for running EC2 unit tests"><del>HADOOP-6426</del></a> does include the stub &lt;target&gt; files needed, yet they aren't in SVN_HEAD. Which implies that a different version may have gone in than intended. </p>
    Reason: Build system bugfix
    Author: Tom White
    Ref: UNKNOWN

commit 083a6a1cfb2a5198243aa82a020681ad62da5938
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:33:58 2010 -0800

    HADOOP-6444. Support additional security group option in hadoop-ec2 script
    
    Description: When deploying a hadoop cluster on ec2 alongside other services it is very useful to be able to specify additional (pre-existing) security groups to facilitate access control.  For example one could use this feature to add a cluster to a generic "hadoop" group, which authorizes hdfs access from instances outside the cluster.  Without such an option the access control for the security groups created by the script need to manually updated after cluster launch.
    Reason: Security improvement
    Author: Paul Egan
    Ref: UNKNOWN

commit 63152ce4ba3c0cf2006016cc825fc72b0bd23d2d
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:33:49 2010 -0800

    HADOOP-6426. Create ant build for running EC2 unit tests
    
    Description: There is no easy way currently to run the Python unit tests for the cloud contrib.
    Reason: Test coverage improvement
    Author: Tom White
    Ref: UNKNOWN

commit a20069b2adfafa59e0001fe5e5685d36d9eb7fee
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:33:15 2010 -0800

    HADOOP-6392. Run namenode and jobtracker on separate EC2 instances
    
    Description: Replace concept of "master" with that of "namenode" and "jobtracker". Still need to be able to run both on one node, of course.
    Reason: Scalability improvement
    Author: Tom White
    Ref: UNKNOWN

commit 361221a2a082d0ab7a87ba0226dbe05938440738
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:33:07 2010 -0800

    HADOOP-6108. Add support for EBS storage on EC2
    
    Description: By using EBS for namenode and datanode storage we can have persistent, restartable Hadoop clusters running on EC2.
    Reason: New feature
    Author: Tom White
    Ref: UNKNOWN

commit 4ca1c78e1b257eefa10b5ed94479df8a6473d3e9
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:32:50 2010 -0800

    HDFS-861. fuse-dfs does not support O_RDWR
    
    Description: Some applications (for us, the big one is rsync) will open a file in read-write mode when it really only intends to read xor write (not both).  fuse-dfs should try to not fail until the application actually tries to write to a pre-existing file or read from a newly created file.
    Reason: bugfix
    Author: Brian Bockelman
    Ref: UNKNOWN

commit 00f6976093cc20ea825a35f6831f645dc5f61637
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:32:17 2010 -0800

    HDFS-860. fuse-dfs truncate behavior causes issues with scp
    
    Description: For whatever reason, scp issues a "truncate" once it's written a file to truncate the file to the # of bytes it has written (i.e., if a file is X bytes, it calls truncate(X)).
    
    <p>This fails on the current fuse-dfs.</p>
    Reason: bugfix (tool compatibility)
    Author: Brian Bockelman
    Ref: UNKNOWN

commit 46d2b6d6b27887375c44d691d776f70e89e4b81b
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:31:58 2010 -0800

    HDFS-859. fuse-dfs utime behavior causes issues with tar
    
    Description: When trying to untar files onto fuse-dfs, tar will try to set the utime on all the files and directories.  However, setting the utime on a directory in libhdfs causes an error.
    
    <p>We should silently ignore the failure of setting a utime on a directory; this will allow tar to complete successfully.</p>
    Reason: bugfix (tool compatibility)
    Author: Brian Bockelman
    Ref: UNKNOWN

commit 9a38b9c423aca358307aa6455977432f34aef990
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:31:45 2010 -0800

    HDFS-858. Incorrect return codes for fuse-dfs
    
    Description: fuse-dfs doesn't pass proper error codes from libhdfs; places I'd like to correct are hdfsFileOpen (which can result in permission denied or quota violations) and hdfsWrite (which can result in quota violations).
    
    <p>By returning the correct error codes, command line utilities return much better error messages - especially for quota violations, which can be a devil to debug.</p>
    Reason: bugfix
    Author: Brian Bockelman
    Ref: UNKNOWN

commit 84afb26bb0e42eda1e26b07e3aac016695f5ad87
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:31:37 2010 -0800

    HDFS-857. Incorrect type for fuse-dfs capacity can cause "df" to return negative values on 32-bit machines
    
    Description: On sufficiently large HDFS installs, the casting of hdfsGetCapacity to a long may cause "df" to return negative values.  tOffset should be used instead.
    Reason: bugfix
    Author: Brian Bockelman
    Ref: UNKNOWN

commit a4cf3e8e86cbd42bef25eb3aab7e464ac86e3068
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:31:19 2010 -0800

    HDFS-856. Hardcoded replication level for new files in fuse-dfs
    
    Description: In fuse-dfs, the number of replicas is always hardcoded to 3 in the arguments to hdfsOpenFile.  We should use the setting in the hadoop configuration instead.
    Reason: Configuration improvement
    Author: Brian Bockelman
    Ref: UNKNOWN

commit e9f3ec90e57b383faf49e6a6eb8cc91e5182d31e
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:31:08 2010 -0800

    HADOOP-5625. Add I/O duration time in client trace
    
    Description: Add I/O duration information into client trace log for analyzing performance.
    
    Reason: Logging improvement
    Author: Lei Xu
    Ref: UNKNOWN

commit 42eeb4540850278563e76841f0c6b369933d5b70
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:30:43 2010 -0800

    HADOOP-5222. Add offset in client trace
    
    Description: By adding offset in client trace, the client trace information can provide more accurately information about I/O.<br/>
    It is useful for performance analyzing.
    
    <p>Since there is  no random write now, the offset of writing is always zero.</p>
    Reason: Logging improvement
    Author: Lei Xu
    Ref: UNKNOWN

commit 5880960fb32ae0fc2c16bac1f333dbb237c3448f
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:30:27 2010 -0800

    CLOUDERA-BUILD. Solaris do-release-build fix
    
    Author: Eli Collins
    Ref: CDH-531

commit 35f87aef6d7cd4030644a1d454da2e0a6e2969c0
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:30:18 2010 -0800

    MAPREDUCE-1310. CREATE TABLE statements for Hive do not correctly specify delimiters
    
    Description: Imports to HDFS via Sqoop that also inject metadata into Hive do not correctly specify delimiters; using Hive to access the data results in rows being parsed as NULL characters. See <span class="nobr"><a href="http://getsatisfaction.com/cloudera/topics/sqoop_hive_import_giving_null_query_values">http://getsatisfaction.com/cloudera/topics/sqoop_hive_import_giving_null_query_values<sup><img class="rendericon" src="https://issues.apache.org/jira/images/icons/linkext7.gif" height="7" width="7" align="absmiddle" alt="" border="0"/></sup></a></span> for an example bug report
    Reason: Bugfix
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 60784d712cdd5781ceff262bb67e2d484fde428b
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:29:56 2010 -0800

    MAPREDUCE-1235. java.io.IOException: Cannot convert value '0000-00-00 00:00:00' from column 6 to TIMESTAMP.
    
    Description: <b>Description</b>: java.io.IOException is thrown when trying to import a table to HDFS using Sqoop. Table has "0" value in a field of type datetime. <br/>
    <b>Full Exception</b>: java.io.IOException: Cannot convert value '0000-00-00 00:00:00' from column 6 to TIMESTAMP. <br/>
    <b>Original question</b>: <span class="nobr"><a href="http://getsatisfaction.com/cloudera/topics/cant_import_table?utm_content=reply_link&amp;utm_medium=email&amp;utm_source=reply_notification">http://getsatisfaction.com/cloudera/topics/cant_import_table?utm_content=reply_link&amp;utm_medium=email&amp;utm_source=reply_notification<sup><img class="rendericon" src="https://issues.apache.org/jira/images/icons/linkext7.gif" height="7" width="7" align="absmiddle" alt="" border="0"/></sup></a></span>
    Reason: Bugfix (compatibility)
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 23c116b6ab5615bdb846e22b61a41e92ca287bdf
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:29:47 2010 -0800

    MAPREDUCE-1174. Sqoop improperly handles table/column names which are reserved sql words
    
    Description: In some databases it is legal to name tables and columns with terms that overlap SQL reserved keywords (e.g., <tt>CREATE</tt>, <tt>table</tt>, etc.). In such cases, the database allows you to escape the table and column names. We should always escape table and column names when possible.
    Reason: Bugfix
    Author: Aaron Kimball
    Ref: UNKNOWN

commit d4b3b7592c94aa1f4608245829b5de202ed1b148
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:29:39 2010 -0800

    MAPREDUCE-1168. Export data to databases via Sqoop
    
    Description: Sqoop can import from a database into HDFS. It's high time it works in reverse too.
    Reason: New feature
    Author: Aaron Kimball
    Ref: UNKNOWN

commit b29023803d1136bf7d4de45853a2d4481fb36d3c
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:29:24 2010 -0800

    MAPREDUCE-1169. Improvements to mysqldump use in Sqoop
    
    Description: Improve Sqoop's integration with mysqldump
    Reason: Feature/performance improvements
    Author: Aaron Kimball
    Ref: UNKNOWN
    
    commit c6b956630e327ddabf674f8e06de02408e603155
    Author: Aaron Kimball <aaron@cloudera.com>
    Date:   Wed Jan 6 16:05:05 2010 -0800
    
        MAPREDUCE-1169. Improvements to mysqldump use in Sqoop

commit 26ba4fd749755a3df79eaa27792662e5b7e3da80
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:29:15 2010 -0800

    MAPREDUCE-1036. An API Specification for Sqoop
    
    Description: Over the last several months, Sqoop has evolved to a state that is functional and has room for extensions. Developing extensions requires a stable API and documentation. I am attaching to this ticket a description of Sqoop's design and internal APIs, which include some open questions. I would like to solicit input on the design regarding these open questions and standardize the API.
    Reason: Documentation
    Author: Aaron Kimball
    Ref: UNKNOWN

commit e8c47124bb2ada5de0cfdf49150dd7296a41df71
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:29:04 2010 -0800

    MAPREDUCE-1069. Implement Sqoop API refactoring
    
    Description: Implement refactoring decisions outlined in <a href="http://issues.apache.org/jira/browse/MAPREDUCE-1036" title="An API Specification for Sqoop"><del>MAPREDUCE-1036</del></a>
    Reason: API compatibility
    Author: Aaron Kimball
    Ref: UNKNOWN

commit b73cab8083c1594c0328a565eef05951a17f998a
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:28:46 2010 -0800

    MAPREDUCE-1146. Sqoop dependencies break Eclipse build on Linux
    
    Description: Under  Linux there's the error in the Eclipse "Problems" view:
    <div class="preformatted panel" style="border-width: 1px;"><div class="preformattedContent panelContent">
    <pre>- "com.sun.tools cannot be resolved" at line 166 of  org.apache.hadoop.sqoop.orm.CompilationManager
    </pre>
    </div></div>
    <p>The problem doesn't appear on MacOS though</p>
    Reason: bugfix
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 0629ac30abb5e58fb80be56a385867ac7360de22
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:28:37 2010 -0800

    MAPREDUCE-1148. SQL identifiers are a superset of Java identifiers
    
    Description: SQL identifiers can contain arbitrary characters, can start with numbers, can be words like <tt>class</tt> which are reserved in Java, etc. If Sqoop uses these names literally for class and field names then compilation errors can occur in auto-generated classes. SQL identifiers need to be cleansed to map onto Java identifiers.
    Reason: bugfix
    Author: Aaron Kimball
    Ref: UNKNOWN

commit dec4c616921b547e5a332a254254d77efc3a7d5e
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:28:25 2010 -0800

    MAPREDUCE-1224. Calling "SELECT t.* from <table> AS t" to get meta information is too expensive for big tables
    
    Description: The SqlManager uses the query, "SELECT t.* from &lt;table&gt; AS t" to get table spec is too expensive for big tables, and it was called twice to generate column names and types.  For tables that are big enough to be map-reduced, this is too expensive to make sqoop useful.
    Reason: Performance improvement
    Author: Spencer Ho
    Ref: UNKNOWN

commit 1198ef1375387ba107d46f0ab5e9a7c6a7645931
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:28:15 2010 -0800

    MAPREDUCE-706. Support for FIFO pools in the fair scheduler
    
    Description: The fair scheduler should support making the internal scheduling algorithm for some pools be FIFO instead of fair sharing in order to work better for batch workloads. FIFO pools will behave exactly like the current default scheduler, sorting jobs by priority and then submission time. Pools will have their scheduling algorithm set through the pools config file, and it will be changeable at runtime.
    
    <p>To support this feature, I'm also changing the internal logic of the fair scheduler to no longer use deficits. Instead, for fair sharing, we will assign tasks to the job farthest below its share as a ratio of its share. This is easier to combine with other scheduling algorithms and leads to a more stable sharing situation, avoiding unfairness issues brought up in <a href="http://issues.apache.org/jira/browse/MAPREDUCE-543" title="large pending jobs hog resources"><del>MAPREDUCE-543</del></a> and <a href="http://issues.apache.org/jira/browse/MAPREDUCE-544" title="deficit computation is biased by historical load">MAPREDUCE-544</a> that happen when some jobs have long tasks. The new preemption (<a href="http://issues.apache.org/jira/browse/MAPREDUCE-551" title="Add preemption to the fair scheduler"><del>MAPREDUCE-551</del></a>) will ensure that critical jobs can gain their fair share within a bounded amount of time.</p>
    Reason: New feature
    Author: Matei Zaharia
    Ref: UNKNOWN

commit 5699f5483e2a9ee9debd0f0154c6506ee5dc87e2
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:28:03 2010 -0800

    MAPREDUCE-1285. DistCp cannot handle -delete if destination is local filesystem
    
    Description: The following exception is thrown:
    <div class="code panel" style="border-width: 1px;"><div class="codeContent panelContent">
    <pre class="code-java">Copy failed: java.io.IOException: wrong value class: org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus is not class org.apache.hadoop.fs.FileStatus
    	at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:988)
    	at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:977)
    	at org.apache.hadoop.tools.DistCp.deleteNonexisting(DistCp.java:1226)
    	at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1134)
    	at org.apache.hadoop.tools.DistCp.copy(DistCp.java:650)
    	at org.apache.hadoop.tools.DistCp.run(DistCp.java:857)
    	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)</pre>
    </div></div>
    Reason: bugfix
    Author: Peter Romianowski
    Ref: UNKNOWN

commit 34bb813a5884aeb05909c2ce2cc541882ca3eda1
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:27:53 2010 -0800

    MAPREDUCE-764. TypedBytesInput's readRaw() does not preserve custom type codes
    
    Description: The typed bytes format supports byte sequences of the form <tt>&lt;custom type code&gt; &lt;length&gt; &lt;bytes&gt;</tt>. When reading such a sequence via <tt>TypedBytesInput</tt>'s <tt>readRaw()</tt> method, however, the returned sequence currently is <tt>0 &lt;length&gt; &lt;bytes&gt;</tt> (0 is the type code for a bytes array), which leads to bugs such as the one described <span class="nobr"><a href="http://dumbo.assembla.com/spaces/dumbo/tickets/54">here<sup><img class="rendericon" src="https://issues.apache.org/jira/images/icons/linkext7.gif" height="7" width="7" align="absmiddle" alt="" border="0"/></sup></a></span>.
    Reason: bugfix
    Author: Klaas Bosteels
    Ref: UNKNOWN

commit 7fd2cb371354219abd108fda35087f08dc481b35
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:27:31 2010 -0800

    HADOOP-6400. Log errors getting Unix UGI
    
    Description: For various reasons, the calls out to `whoami` and `id` can fail when trying to get the unix UGI information. Currently it silently ignores failures and uses the default DrWho/Tardis ugi. This is extremely confusing for users - we should log the exception at warn level when the shell execs fail.
    Reason: Debug logging improvement
    Author: Todd Lipcon
    Ref: UNKNOWN

commit d6dc22fecc058e12695a481fa354078d9b012089
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:27:21 2010 -0800

    MAPREDUCE-1293. AutoInputFormat doesn't work with non-default FileSystems
    
    Description: AutoInputFormat uses the wrong FileSystem.get() method when getting a reference to a FileSystem object. AutoInputFormat gets the default FileSystem, so this method breaks if the InputSplit's path is pointing to a different FileSystem.
    Reason: bugfix
    Author: Andrew Hitchcock
    Ref: UNKNOWN

commit 25a4ea86b0b085e3afd6f2f040201594155b3de1
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:27:09 2010 -0800

    MAPREDUCE-1131. Using profilers other than hprof can cause JobClient to report job failure
    
    Description: If task profiling is enabled, the JobClient will download the <tt>profile.out</tt> file created by the tasks under profile. If this causes an IOException, the job is reported as a failure to the client, even though all the tasks themselves may complete successfully. The expected result files are assumed to be generated by hprof. Using the profiling system with other profilers will cause job failure.
    Reason: compatibility bugfix
    Author: Aaron Kimball
    Ref: UNKNOWN

commit ab98123c7114752945452af0b96c8de04af9ba93
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:26:02 2010 -0800

    MAPREDUCE-370. Change org.apache.hadoop.mapred.lib.MultipleOutputs to use new api.
    
    Description: Ports the MultipleOutputs OutputFormat to the new context-based API.
    Reason: API compatibility improvement.
    Author: Amareshwari Sriramadasu
    Ref: UNKNOWN

commit 50726d13750f3f71d2fc5d3a012ce81aa2adb26d
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:24:46 2010 -0800

    CLOUDERA-BUILD. Backport MapReduceTestUtil to Hadoop 0.20
    
    Description: MapReduceTestUtil is required for unit tests in subsequent
    patches, but this class itself was not created in one clean JIRA. Therefore
    it was backported "As-is" from the trunk and not in a patch-wise fashion.
    This class is only used in the JUnit tests for Hadoop.
    Author: Aaron Kimball
    Reason: Testing improvement
    Ref: UNKNOWN

commit d713dc1063afc4967381b6583ec424d2850bac63
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:24:30 2010 -0800

    MAPREDUCE-1059. distcp can generate uneven map task assignments
    
    Description: distcp writes out a SequenceFile containing the source files to transfer, and their sizes. Map tasks are created over spans of this file, representing files which each mapper should transfer. In practice, some transfer loads yield many empty map tasks and a few tasks perform the bulk of the work.
    Reason: Improvement for load balancing
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 855b0bf3718f2c397ef79967475468e4153f120a
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:24:20 2010 -0800

    MAPREDUCE-1128. MRUnit Allows Iteration Twice
    
    Description: MRUnit allows one to iterate over a collection of values twice (ie.
    
    <p>reduce(Key key, Iterable&lt;Value&gt; values, Context context){
       for(Value : values ) /* iterate once */;
       for(Value : values ) /* iterate again */;
    }</p>
    
    <p>Hadoop will allow this as well, however the second iterator will be empty. MRUnit should either match hadoop's behavior or warn the user that their code is likely flawed.</p>
    Reason: bugfix (API compatibility)
    Author: Aaron Kimball
    Ref: UNKNOWN

commit c9d77f6e1fdbb24b45675e363e3bd5111533893a
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:24:10 2010 -0800

    HDFS-464. Memory leaks in libhdfs
    
    Description: hdfsExists does not call destroyLocalReference for jPath anytime,<br/>
    hdfsDelete does not call it when it fails, and<br/>
    hdfsRename does not call it for jOldPath and jNewPath when it fails
    Reason: bugfix
    Author: Christian Kunz
    Ref: UNKNOWN

commit c7996c5e2fbb9260740fec369550551d6320762a
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:23:51 2010 -0800

    HDFS-423. Unbreak FUSE build and fuse_dfs_wrapper.sh
    
    Description: fuse-dfs depends on libhdfs, and fuse-dfs build.xml still points to the libhfds/libhdfs.so location but libhdfs now is build in a different location <br/>
    please take a look at this bug for the location details
    
    <p><span class="nobr"><a href="https://issues.apache.org/jira/browse/HADOOP-3344">https://issues.apache.org/jira/browse/HADOOP-3344<sup><img class="rendericon" src="https://issues.apache.org/jira/images/icons/linkext7.gif" height="7" width="7" align="absmiddle" alt="" border="0"/></sup></a></span></p>
    
    <p>Thanks,<br/>
    Giri</p>
    Reason: Build system bugfix
    Author: Eli Collins
    Ref: UNKNOWN

commit 72b0b791cd347e760807a44f5197599f57afde03
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:23:39 2010 -0800

    CLOUDERA-BUILD. Make bin/hadoop-config.sh work with dev builds
    
    Author: Eli Collins

commit a9466041ccfcdb07f4f0dd34a57c9e9bdd6a3e70
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:23:06 2010 -0800

    HDFS-727. bug setting block size hdfsOpenFile
    
    Description: In hdfsOpenFile in libhdfs invokeMethod needs to cast the block size argument to a jlong so a full 8 bytes are passed (rather than 4 plus some garbage which causes writes to fail due to a bogus block size).
    
    Reason: Bugfix
    Author: Eli Collins
    Ref: UNKNOWN

commit 4e7d205daa86d904614252101bb422664ab6d203
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:22:47 2010 -0800

    Revert MAPREDUCE-967. TaskTracker does not need to fully unjar job jars
    
    Author: Todd Lipcon
    Ref: UNKNOWN

commit d5f0c77a6c81e9e56da81976645614280247f7a2
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:22:18 2010 -0800

    HADOOP-5640. Allow ServicePlugins to hook callbacks into key service events
    
    Description: <a href="http://issues.apache.org/jira/browse/HADOOP-5257" title="Export namenode/datanode functionality through a pluggable RPC layer"><del>HADOOP-5257</del></a> added the ability for NameNode and DataNode to start and stop ServicePlugin implementations at NN/DN start/stop. However, this is insufficient integration for some common use cases.
    
    <p>We should add some functionality for Plugins to subscribe to events generated by the service they're plugging into. Some potential hook points are:</p>
    
    <p>NameNode:</p>
    <ul class="alternate" type="square">
    	<li>new datanode registered</li>
    	<li>datanode has died</li>
    	<li>exception caught</li>
    	<li>etc?</li>
    </ul>
    
    <p>DataNode:</p>
    <ul class="alternate" type="square">
    	<li>startup</li>
    	<li>initial registration with NN complete (this is important for HADOOP-4707 to sync up datanode.dnRegistration.name with the NN-side registration)</li>
    	<li>namenode reconnect</li>
    	<li>some block transfer hooks?</li>
    	<li>exception caught</li>
    </ul>
    
    <p>I see two potential routes for implementation:</p>
    
    <p>1) We make an enum for the types of hookpoints and have a general function in the ServicePlugin interface. Something like:</p>
    
    <div class="code panel" style="border-width: 1px;"><div class="codeContent panelContent">
    <pre class="code-java"><span class="code-keyword">enum</span> HookPoint {
      DN_STARTUP,
      DN_RECEIVED_NEW_BLOCK,
      DN_CAUGHT_EXCEPTION,
     ...
    }
    
    void runHook(HookPoint hp, <span class="code-object">Object</span> value);</pre>
    </div></div>
    
    <p>2) We make classes specific to each "pluggable" as was originally suggested in HADDOP-5257. Something like:</p>
    
    <div class="code panel" style="border-width: 1px;"><div class="codeContent panelContent">
    <pre class="code-java">class DataNodePlugin {
      void datanodeStarted() {}
      void receivedNewBlock(block info, etc) {}
      void caughtException(Exception e) {}
      ...
    }</pre>
    </div></div>
    
    <p>I personally prefer option (2) since we can ensure plugin API compatibility at compile-time, and we avoid an ugly switch statement in a runHook() function.</p>
    
    <p>Interested to hear what people's thoughts are here.</p>
    
    HADOOP-5640 puts this in the new test dir. It needs to be in the old one.
    
    Reason: Improvement
    Author: Todd Lipcon
    Ref: UNKNOWN

commit e9b04609d88ed5d1af442ee950aa5dcd6646e830
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:22:08 2010 -0800

    MAPREDUCE-1017. Compression and output splitting for Sqoop
    
    Description: Sqoop "direct mode" writing will generate a single large text file in HDFS. It is important to be able to compress this data before it reaches HDFS. Due to the difficulty in splitting compressed files in HDFS for use by MapReduce jobs, data should also be split at compression time.
    Reason: New feature
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 8c9b473e1af036a3e2cc9036a945a4567277db8a
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:21:14 2010 -0800

    HADOOP-6312. Configuration sends too much data to log4j
    
    Description: Configuration objects send a DEBUG-level log message every time they're instantiated, which include a full stack trace. This is more appropriate for TRACE-level logging, as it renders other debug logs very hard to read.
    Reason: Logging improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 698fe169f31e54111d30e4420cd1c1c5eaeecdec
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:21:03 2010 -0800

    HDFS-686. NullPointerException is thrown while merging edit log and image
    
    Description: Our secondary name node is not able to start on NullPointerException:<br/>
    ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: java.lang.NullPointerException<br/>
            at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1232)<br/>
            at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1221)<br/>
            at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:776)<br/>
            at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:992)<br/>
            at<br/>
    org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.doMerge(SecondaryNameNode.java:590)<br/>
            at<br/>
    org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.access$000(SecondaryNameNode.java:473)<br/>
            at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:350)<br/>
            at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:314)<br/>
            at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:225)<br/>
            at java.lang.Thread.run(Thread.java:619)
    
    <p>This was caused by setting access time on a non-existent file.</p>
    Reason: bugfix
    Author: Hairong Kuang
    Ref: UNKNOWN

commit b2cc8e02f37a1604bb076acefff0ebf016c249d5
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:20:40 2010 -0800

    MAPREDUCE-112. Reduce Input Records and Reduce Output Records counters are not being set when using the new Mapreduce reducer API
    
    Description: After running the examples/wordcount (which uses the new API), the reduce input and output record counters always show 0. This is because these counters are not getting updated in the new API
    This adds counters for reduce input, output records to the new API.
    Reason: Bugfix
    Author: Jothi Padmanabhan
    Ref: UNKNOWN

commit 3e62477434542dc3de89fd43fd9b19abaf76f0de
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:20:00 2010 -0800

    MAPREDUCE-768. Configuration information should generate dump in a standard format.
    
    Description:  We need to generate the configuration dump in a standard format .
    This adds the 'hadoop jobtracker -dumpConfiguration' command.
    This is modified from the original patch in that it does not dump QueueManager configuration.
    This is because we have not backported HADOOP-5396
    
    Reason: New feature
    Author: V.V.Chaitanya Krishna
    Ref: UNKNOWN

commit 4d9333b00772455a1ca7a365fa5b5b2f6872abd7
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:19:46 2010 -0800

    HADOOP-6184. Provide a configuration dump in json format.
    
    Description: Configuration dump in json format.
    Reason: New feature
    Author: V.V.Chaitanya Krishna
    Ref: UNKNOWN

commit 96244c3e7d6735f450b618fdcbdbbf9a81436ba3
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:19:27 2010 -0800

    CLOUDERA-BUILD. Duplicated effort. FULL_VERSION already set in package.mk
    
    Description: Revert "Need to pass in FULL_VERSION"
    Author: Chad Metcalf

commit 604d3a71334b9340a6219e3b88bf563b79f5d083
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:19:11 2010 -0800

    CLOUDERA-BUILD. Copy the sqoop manpage to the expected version number
    
    Author: Chad Metcalf

commit 6d428f70591a92a90dca5256968c62a510659240
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:18:58 2010 -0800

    CLOUDERA-BUILD. Bump jdiff stable to 0.20.1
    
    Author: Chad Metcalf

commit 46ffc9aa9260a96bdf67fbaee9a2acd76cfcf675
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:18:44 2010 -0800

    CLOUDERA-BUILD. Need to pass in FULL_VERSION
    
    Author: Chad Metcalf

commit aa7ae9d9826866f94ecfe5629d087ef68e4b5c54
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:18:29 2010 -0800

    MAPREDUCE-999. Improve Sqoop test speed and refactor tests
    
    Description: Sqoop's tests take a long time to run, but this can be improved (by a factor of 2 or more) by taking advantage of <tt>jobclient.completion.poll.interval</tt>.
    Reason: Testing performance improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 084c390ed5fcb03c456121c8497759b40a74f809
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:18:13 2010 -0800

    MAPREDUCE-1089. Fair Scheduler preemption triggers NPE when tasks are scheduled but not running
    
    Description: We see exceptions like this when preemption runs when a task has been scheduled on a TT but has not yet started running.
    
    <p>2009-10-09 14:30:53,989 INFO org.apache.hadoop.mapred.FairScheduler: Should preempt 2 MAP tasks for job_200910091420_0006: tasksDueToMinShare = 2, tasksDueToFairShare = 0<br/>
    2009-10-09 14:30:54,036 ERROR org.apache.hadoop.mapred.FairScheduler: Exception in fair scheduler UpdateThread<br/>
    java.lang.NullPointerException<br/>
            at org.apache.hadoop.mapred.FairScheduler$2.compare(FairScheduler.java:1015)<br/>
            at org.apache.hadoop.mapred.FairScheduler$2.compare(FairScheduler.java:1013)<br/>
            at java.util.Arrays.mergeSort(Arrays.java:1270)<br/>
            at java.util.Arrays.sort(Arrays.java:1210)<br/>
            at java.util.Collections.sort(Collections.java:159)<br/>
            at org.apache.hadoop.mapred.FairScheduler.preemptTasks(FairScheduler.java:1013)<br/>
            at org.apache.hadoop.mapred.FairScheduler.preemptTasksIfNecessary(FairScheduler.java:911)<br/>
            at org.apache.hadoop.mapred.FairScheduler$UpdateThread.run(FairScheduler.java:286)</p>
    Reason: Bugfix
    Author: Todd Lipcon
    Ref: UNKNOWN

commit 34ca2a5547398f9435a5d3d22603d0f7da420226
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:17:48 2010 -0800

    MAPREDUCE-551. Add preemption to the fair scheduler
    
    Description: Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (<a href="http://issues.apache.org/jira/browse/HADOOP-3746" title="A fair sharing job scheduler"><del>HADOOP-3746</del></a>) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:<br/>
    1) A job isn't getting its <em>guaranteed</em> share of the cluster for at least T1 seconds.<br/>
    2) A job is getting significantly less than its <em>fair</em> share for T2 seconds (e.g. less than half its share).
    
    <p>T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.</p>
    
    <p>When deciding which tasks to kill to make room for the job, we will use the following heuristics:</p>
    <ul class="alternate" type="square">
    	<li>Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).</li>
    	<li>For maps: kill tasks that have run for the least amount of time (limiting wasted time).</li>
    	<li>For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).</li>
    </ul>
    
    This fixes an error in the previous backport where the
    EagerTaskInitializationListener wasn't properly passed the
    TaskTrackerManager before starting.
    
    Reason: New feature
    Author: Matei Zaharia
    Ref: UNKNOWN

commit a3e29eff0b9337a1007ec1b90ccb832dca5c1d20
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:17:33 2010 -0800

    CLOUDERA-BUILD. Fix hadoop wrapper to properly pass through multiword quoted arguments
    
    Author: Todd Lipcon

commit 975647b6c3a6644cabbd48bf14e074a0efda2cb9
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:17:15 2010 -0800

    CLOUDERA-BUILD. Sqoop documentation is now part of the generated tarball. Updated the install script to reflect that change.
    
    Author: Matt Massie

commit 19c038a6af07e3999e83a2178d2328535e00dedb
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:16:55 2010 -0800

    CLOUDERA-BUILD. Generate the sqoop documentation and ensure that it's in the release tarball
    
    Author: Matt Massie

commit 6957626991875302f33bb73630f4f376412f9711
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:16:43 2010 -0800

    CLOUDERA-BUILD. More changes to get debs building correctly
    
    Author: Chad Metcalf

commit 67d1c732cea0eebf59de512301ae8f2a1cb2f349
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:16:30 2010 -0800

    CLOUDERA-BUILD. Reformatted Sqoop manpage asciidoc for CDH build process
    
    Author: Aaron Kimball

commit af158d6aa7ffe72d931bc4763ace7d4a299d077b
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:16:14 2010 -0800

    CLOUDERA-BUILD. Only rerun libtoolize if version 2.2 is installed
    
    Author: Todd Lipcon

commit 586992381042e1b4ec8c9ece069561ad2e4dfcc0
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:15:42 2010 -0800

    HADOOP-6279. Add JVM memory usage to JvmMetrics
    
    Description: The JvmMetrics currently publish memory usage from the MemoryMXBean. This is useful, but doesn't include the total heap size (eg as displayed in the JT Web UI).
    
    <p>It would be nice to expose Runtime.getRuntime().maxMemory() as part of JvmMetrics.</p>
    
    <p>It seems that Runtime.getRuntime().totalMemory() (used by the JT for "memory used") is the same as the 'memHeapCommittedM' which already exists.</p>
    Reason: Metrics improvement
    Author: Todd Lipcon
    Ref: UNKNOWN

commit 7c168a8a2613d93e19508a91e7c4db3b3cfb503b
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:15:26 2010 -0800

    HADOOP-6269. Missing synchronization for defaultResources in Configuration.addResource
    
    Description: Configuration.defaultResources is a simple ArrayList. In two places in Configuration it is accessed without appropriate synchronization, which we've seen to occasionally result in ConcurrentModificationExceptions.
    Reason: bugfix (race condition)
    Author: Sreekanth Ramakrishnan
    Ref: UNKNOWN

commit 8bf845170decdcb12254bc1dc98ccbf0fda7d233
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:15:01 2010 -0800

    CLOUDERA-BUILD. Recreate c++ configure files during build if we have the right build dependencies
    
    Author: Todd Lipcon

commit e7e9812fa7a6a256652f2f6bbb269334f883c53b
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:14:43 2010 -0800

    CLOUDERA-BUILD. Package sqoop docs w/o requiring asciidoc
    
    Author: Chad Metcalf
    Ref: UNKNOWN

commit 7171eabfad501d635b1da9e0287f50e025b4a83f
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:13:39 2010 -0800

    CLOUDERA-BUILD. Revert "Package sqoop docs."
    
    Description: This reverts packaging of sqoop documentation in preparation
    for including MAPREDUCE-906 properly after it has been committed
    to Apache.
    Author: Chad Metcalf
    Ref: UNKNOWN

commit 4bd437c9d70f2c0d68047e0376a7af21cc4a70e0
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:13:17 2010 -0800

    HADOOP-5891. If dfs.http.address is default, SecondaryNameNode can't find NameNode
    
    Description: As detailed in this blog post:<br/>
    <span class="nobr"><a href="http://www.cloudera.com/blog/2009/02/10/multi-host-secondarynamenode-configuration/">http://www.cloudera.com/blog/2009/02/10/multi-host-secondarynamenode-configuration/<sup><img class="rendericon" src="https://issues.apache.org/jira/images/icons/linkext7.gif" height="7" width="7" align="absmiddle" alt="" border="0"/></sup></a></span><br/>
    if dfs.http.address is not configured, and the 2NN is a different machine from the NN, the 2NN fails to connect.
    
    <p>In SecondaryNameNode.getInfoServer, the 2NN should notice a "0.0.0.0" dfs.http.address and, in that case, pull the hostname out of fs.default.name. This would fix the default configuration to work properly for most users.</p>
    Reason: Configuration improvement
    Author: Todd Lipcon
    Ref: UNKNOWN

commit 74e10e4a137b2aa60ab39186115350b5e82464fc
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:11:50 2010 -0800

    HDFS-127. DFSClient block read failures cause open DFSInputStream to become unusable
    
    Description: We are using some Lucene indexes directly from HDFS and for quite long time we were using Hadoop version 0.15.3.
    
    <p>When tried to upgrade to Hadoop 0.19 - index searches started to fail with exceptions like:<br/>
    2008-11-13 16:50:20,314 WARN <span class="error">&#91;Listener-4&#93;</span> [] DFSClient : DFS Read: java.io.IOException: Could not obtain block: blk_5604690829708125511_15489 file=/usr/collarity/data/urls-new/part-00000/20081110-163426/_0.tis<br/>
    at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1708)<br/>
    at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1536)<br/>
    at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1663)<br/>
    at java.io.DataInputStream.read(DataInputStream.java:132)<br/>
    at org.apache.nutch.indexer.FsDirectory$DfsIndexInput.readInternal(FsDirectory.java:174)<br/>
    at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:152)<br/>
    at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)<br/>
    at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76)<br/>
    at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:63)<br/>
    at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:131)<br/>
    at org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:162)<br/>
    at org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:223)<br/>
    at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:217)<br/>
    at org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:54) <br/>
    ...</p>
    
    <p>The investigation showed that the root of this issue is that we exceeded # of xcievers in the data nodes and that was fixed by changing configuration settings to 2k.<br/>
    However - one thing that bothered me was that even after datanodes recovered from overload and most of client servers had been shut down - we still observed errors in the logs of running servers.<br/>
    Further investigation showed that fix for <a href="http://issues.apache.org/jira/browse/HADOOP-1911" title="infinite loop in dfs -cat command."><del>HADOOP-1911</del></a> introduced another problem - the DFSInputStream instance might become unusable once number of failures over lifetime of this instance exceeds configured threshold.</p>
    
    <p>The fix for this specific issue seems to be trivial - just reset failure counter before reading next block (patch will be attached shortly).</p>
    
    <p>This seems to be also related to HADOOP-3185, but I'm not sure I really understand necessity of keeping track of failed block accesses in the DFS client.</p>
    
        HADOOP-4681: Also referenced
    
        This as-yet-uncommitted patch is recommended by HBase people.
        Applied patch "4681.patch" attached to the JIRA on 2008-11-18.
    
    Reason: Bugfix
    Author: Igor Bolotin
    Ref: UNKNOWN

commit ca547d89042fff3a38c0c93b6e0ece78e74ae064
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:11:10 2010 -0800

    HADOOP-4655. FileSystem.CACHE should be ref-counted
    
    Description: FileSystem.CACHE is not ref-counted, and could lead to resource leakage.
    Adds new method FileSystem.newInstance() that always returns a newly allocated
    FileSystem object.
    Reason: Bugfix
    Author: dhruba borthakur
    Ref: UNKNOWN

commit 15660507606b32c3c6c2878f8ed69fe106119bc9
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:10:51 2010 -0800

    MAPREDUCE-967. TaskTracker does not need to fully unjar job jars
    
    Description: In practice we have seen some users submitting job jars that consist of 10,000+ classes. Unpacking these jars into mapred.local.dir and then cleaning up after them has a significant cost (both in wall clock and in unnecessary heavy disk utilization). This cost can be easily avoided
    Reason: Performance improvement
    Author: Todd Lipcon
    Ref: UNKNOWN

commit 648e30e074a16de837fb4c604a198bc780c2e6c5
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:10:34 2010 -0800

    MAPREDUCE-968. NPE in distcp encountered when placing _logs directory on S3FileSystem
    
    Description: If distcp is pointed to an empty S3 bucket as the destination for an s3:// filesystem transfer, it will fail with the following exception
    
    <p>Copy failed: java.lang.NullPointerException<br/>
    at org.apache.hadoop.fs.s3.S3FileSystem.makeAbsolute(S3FileSystem.java:121)<br/>
    at org.apache.hadoop.fs.s3.S3FileSystem.getFileStatus(S3FileSystem.java:332)<br/>
    at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:633)<br/>
    at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1005)<br/>
    at org.apache.hadoop.tools.DistCp.copy(DistCp.java:650)<br/>
    at org.apache.hadoop.tools.DistCp.run(DistCp.java:857)<br/>
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)<br/>
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)<br/>
    at org.apache.hadoop.tools.DistCp.main(DistCp.java:884) </p>
    Reason: Bugfix
    Author: Aaron Kimball
    Ref: UNKNOWN

commit a61718b87c36dbeddcc6f9917438f81ebdda0214
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:10:22 2010 -0800

    HADOOP-6133. ReflectionUtils performance regression
    
    Description: <a href="http://issues.apache.org/jira/browse/HADOOP-4187" title="Create a MapReduce-specific ReflectionUtils that handles JobConf and JobConfigurable"><del>HADOOP-4187</del></a> introduced extra calls to Class.forName in ReflectionUtils.setConf. This caused a fairly large performance regression. Attached is a microbenchmark that shows the following timings (ms) for 100M constructions of new instances:
    
    <p>Explicit construction (new Test): around ~1.6sec<br/>
    Using Test.class.newInstance: around ~2.6sec<br/>
    ReflectionUtils on 0.18.3: ~8.0sec<br/>
    ReflectionUtils on 0.20.0: ~200sec</p>
    
    <p>This illustrates the ~80x slowdown caused by <a href="http://issues.apache.org/jira/browse/HADOOP-4187" title="Create a MapReduce-specific ReflectionUtils that handles JobConf and JobConfigurable"><del>HADOOP-4187</del></a>.</p>
    Reason: Performance improvement
    Author: Todd Lipcon
    Ref: UNKNOWN
    
    commit 5e299f831420ed52569eefc5ba815359a0ebc64e
    Author: Chad Metcalf <chad@cloudera.com>
    Date:   Tue Sep 15 22:21:42 2009 -0700
    
        HADOOP-6133: ReflectionUtils performance regression

commit b6f790774d34ed34bb7c649142dc770c25121ac3
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:10:13 2010 -0800

    HADOOP-5981. HADOOP-2838 doesnt work as expected
    
    Description: The substitution feature i.e X=$X:/tmp doesnt work as expected.
    
    <p>This issue completes the feature mentioned in <a href="http://issues.apache.org/jira/browse/HADOOP-2838" title="Add HADOOP_LIBRARY_PATH config setting so Hadoop will include external directories for jni"><del>HADOOP-2838</del></a>. <a href="http://issues.apache.org/jira/browse/HADOOP-2838" title="Add HADOOP_LIBRARY_PATH config setting so Hadoop will include external directories for jni"><del>HADOOP-2838</del></a> provided a way to set env variables in child process. This issue provides a way to inherit tt's env variables and append or reset it. So now <br/>
    X=$X:y will inherit X (if  there) and append y to it. </p>
    Reason: Bugfix
    Author: Amar Kamat
    Ref: UNKNOWN
    
    commit eb635e4de3a8b2b5bd9f34225770f24be42dcd83
    Author: Chad Metcalf <chad@cloudera.com>
    Date:   Tue Sep 15 22:29:50 2009 -0700
    
        HADOOP-5981: HADOOP-2838 doesnt work as expected

commit 5d4e93d8e0df3c445f56c5eb51965eef92bebd78
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:09:46 2010 -0800

    HADOOP-2838. Add HADOOP_LIBRARY_PATH config setting so Hadoop will include external directories for jni
    
    Description: Currently there is no way to configure Hadoop to use external JNI directories. I propose we add a new variable like HADOOP_CLASS_PATH that is added to the JAVA_LIBRARY_PATH before the process is run.
    
    <p>Now the users can set environment variables using mapred.child.env. They can do the following <br/>
    X=Y : set X to Y<br/>
    X=$X:Y : Append Y to X (which should be taken from the tasktracker)</p>
    Reason: Improves job launch flexibility
    Author: Amar Kamat
    Ref: UNKNOWN
    
    commit 9b3fc32fa793b338dc700a7f6c437402f80d6b7f
    Author: Chad Metcalf <chad@cloudera.com>
    Date:   Tue Sep 15 22:09:57 2009 -0700
    
        HADOOP-2838: Add HADOOP_LIBRARY_PATH config setting so Hadoop will include external directories for jni

commit 877429c3f94a1e937fbe29b4cbe8da573831d802
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:09:31 2010 -0800

    MAPREDUCE-814. Move completed Job history files to HDFS
    
    Description: Currently completed job history files remain on the jobtracker node. Having the files available on HDFS will enable clients to access these files more easily.
    Reason: New feature
    Author: Sharad Agarwal
    Ref: UNKNOWN
    
    commit c0575c0908fee4ec01f5bc0abbd7f4b2254dd38e
    Author: Chad Metcalf <chad@cloudera.com>
    Date:   Tue Sep 15 18:15:17 2009 -0700
    
        MAPREDUCE-814: Move completed Job history files to HDFS

commit a8bf06eac5312ede0982118801e4495285a442fe
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:08:12 2010 -0800

    MAPREDUCE-693. Conf files not moved to "done" subdirectory after JT restart
    
    Description: After <a href="http://issues.apache.org/jira/browse/MAPREDUCE-516" title="Fix the 'cluster drain' problem in the Capacity Scheduler wrt High RAM Jobs"><del>MAPREDUCE-516</del></a>, when a job is submitted and the JT is restarted (before job files have been written) and the job is killed after recovery, the conf files fail to be moved to the "done" subdirectory.<br/>
    The exact scenario to reproduce this issue is:
    <ul>
    	<li>Submit a job</li>
    	<li>Restart JT before anything is written to the job files</li>
    	<li>Kill the job</li>
    	<li>The old conf files remain in the history folder and fail to be moved to "done" subdirectory</li>
    </ul>
    
    Reason: bugfix
    Author: Amar Kamat
    Ref: UNKNOWN

commit cc22e9f92db6470d244fb17f57601b93bab6db80
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:07:55 2010 -0800

    MAPREDUCE-683. TestJobTrackerRestart fails with Map task completion events ordering mismatch
    
    Description: <tt>TestJobTrackerRestart</tt> fails consistently with Map task completion events ordering mismatch error.
    Reason: bugfix
    Author: Amar Kamat
    Ref: UNKNOWN

commit 57a67dff5d15e3833c7968254df076e440de2765
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:07:39 2010 -0800

    MAPREDUCE-416. Move the completed jobs' history files to a DONE subdirectory inside the configured history directory
    
    Description: Whenever a job completes, the history file can be moved to a directory called DONE. That would make the management of job history files easier (for example, administrators can move the history files from that directory to some other place, delete them, archive them, etc.).
    Reason: System management improvement
    Author: Amar Kamat
    Ref: UNKNOWN

commit 99dfdb9a98e1ebd643f47877be3541962c32dcd0
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:07:18 2010 -0800

    HADOOP-5733. Add map/reduce slot capacity and lost map/reduce slot capacity to JobTracker metrics
    
    Description: It would be nice to have the actual map/reduce slot capacity and the lost map/reduce slot capacity (# of blacklisted nodes * map-slot-per-node or reduce-slot-per-node). This information can be used to calculate a JT view of slot utilization.
    Reason: Metrics improvement
    Author: Sreekanth Ramakrishnan
    Ref: UNKNOWN

commit 955fe9433b13f21079f92e4035393b683486ad07
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:05:59 2010 -0800

    HADOOP-5738. Split waiting tasks field in JobTracker metrics to individual tasks
    
    Description: Currently, job tracker metrics reports waiting tasks as a single field in metrics. It would be better if we can split waiting tasks into maps and reduces.
    Reason: User experience improvement
    Author: Sreekanth Ramakrishnan
    Ref: UNKNOWN

commit 3b8f77cd452c1098c6af5907b787bf9167df806b
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:05:48 2010 -0800

    HADOOP-5442. The job history display needs to be paged
    
    Description: Currently the list of job history will try to render the entire list of jobs that have run. That doesn't scale up as more and more jobs run on a job tracker.
    Reason: Scalability improvement
    Author: Amar Kamat
    Ref: UNKNOWN

commit dfac0482267aaf0fabac97c163e0015306ec5b16
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:05:16 2010 -0800

    HADOOP-4842. Streaming combiner should allow command, not just JavaClass
    
    Description: Streaming jobs are way slower than Java jobs for many reasons, but certainly stopping the shell-only programmer from using the combiner feature won't help. Right now, the streaming usage says:
    
    <blockquote>
    <p>  -mapper   &lt;cmd|JavaClassName&gt;      The streaming command to run<br/>
      -combiner &lt;JavaClassName&gt; Combiner has to be a Java class<br/>
      -reducer  &lt;cmd|JavaClassName&gt;      The streaming command to run</p></blockquote>
    Reason: Usability improvement
    Author: Amareshwari Sriramadasu
    Ref: UNKNOWN

commit 33e4f0a87effa466914e292488c47977245edc96
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:04:06 2010 -0800

    MAPREDUCE-987. Exposing MiniDFS and MiniMR clusters as a single process command-line
    
    Description: It's hard to test non-Java programs that rely on significant mapreduce functionality.  The patch I'm proposing shortly will let you just type "bin/hadoop jar hadoop-hdfs-hdfswithmr-test.jar minicluster" to start a cluster (internally, it's using Mini{MR,HDFS}Cluster) with a specified number of daemons, etc.  A test that checks how some external process interacts with Hadoop might start minicluster as a subprocess, run through its thing, and then simply kill the java subprocess.
    
    <p>I've been using just such a system for a couple of weeks, and I like it.  It's significantly easier than developing a lot of scripts to start a pseudo-distributed cluster, and then clean up after it.  I figure others might find it useful as well.</p>
    
    <p>I'm at a bit of a loss as to where to put it in 0.21.  hdfs-with-mr tests have all the required libraries, so I've put it there.  I could conceivably split this into "minimr" and "minihdfs", but it's specifically the fact that they're configured to talk to each other that I like about having them together.  And one JVM is better than two for my test programs.</p>
    Reason: Testing feature
    Author: Philip Zeyliger
    Ref: UNKNOWN

commit 39ff7e5ee285df97c765a73271066df718be0e30
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:03:23 2010 -0800

    HADOOP-6267. build-contrib.xml unnecessarily enforces that contrib projects be located in contrib/ dir
    
    Description: build-contrib.xml currently sets hadoop.root to ${basedir}/../../../. This path is relative to the contrib project which is assumed to be inside src/contrib/. We occasionally work on contrib projects in other repositories until they're ready to contribute. We can use the &lt;dirname&gt; ant task to do this more correctly.
    Reason: Build system improvement
    Author: Todd Lipcon
    Ref: UNKNOWN

commit 139bea6660193cc73852832e03fe570437343e96
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 15:02:55 2010 -0800

    HDFS-528. Add ability for safemode to wait for a minimum number of live datanodes
    
    Description: When starting up a fresh cluster programatically, users often want to wait until DFS is "writable" before continuing in a script. "dfsadmin -safemode wait" doesn't quite work for this on a completely fresh cluster, since when there are 0 blocks on the system, 100% of them are accounted for before any DNs have reported.
    
    <p>This JIRA is to add a command which waits until a certain number of DNs have reported as alive to the NN.</p>
    Reason: New feature
    Author: Todd Lipcon
    Ref: UNKNOWN

commit b301746d45bde2759535549f87c6485f4ee577b2
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 15:02:38 2010 -0800

    HADOOP-4936. Improvements to TestSafeMode
    
    Description: TestSafeMode
    <ul class="alternate" type="square">
    	<li>needs a detailed description of the test case</li>
    	<li>should not use direct calls to the name-node rather call <tt>DistributedFileSystem</tt> methods.</li>
    </ul>
    
    Reason: Test coverage improvement
    Author: Konstantin Shvachko
    Ref: UNKNOWN

commit f04a321596a513e71354f2a6829b44e474077507
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 15:02:22 2010 -0800

    HADOOP-5650. Namenode log that indicates why it is not leaving safemode may be confusing
    
    Description: A namenode with a large number of datablocks is setup with dfs.safemode.threshold.pct set to 1.0. With a small number of unreported blocks, namenode prints the following as the reason for not leaving safe mode:<br/>
    <tt>The ratio of reported blocks 1.0000 has not reached the threshold 1.0000</tt>
    
    <p>With a large number of blocks, precision used for printing the log may not indicate the difference between the actual ratio of safe blocks to total blocks and the configured threshold. Printing number of blocks instead of ratio will improve the clarity.</p>
    Reason: User experience improvement
    Author: Suresh Srinivas
    Ref: UNKNOWN

commit 13e35e654c51a5b1cfe809ef1e2c4d2ca46ed612
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 15:01:52 2010 -0800

    HADOOP-4675. Current Ganglia metrics implementation is incompatible with Ganglia 3.1
    
    Description: Ganglia changed its wire protocol in the 3.1.x series; the current implementation only works for 3.0.x.
    
    Patched using
    https://issues.apache.org/jira/secure/attachment/12407207/HADOOP-4675-v7.patch
    
    Reason: Compatibility improvement
    Author: Brian Bockelman
    Ref: UNKNOWN
    
    commit dcf76896b1c8a7b891995b1546eef6ea3018e7ca
    Author: Philip Zeyliger <philip@cloudera.com>
    Date:   Tue Jul 28 15:28:18 2009 -0700
    
        HADOOP-4675. Current Ganglia metrics implementation is incompatible with Ganglia 3.1
    
        Patched using
        https://issues.apache.org/jira/secure/attachment/12407207/HADOOP-4675-v7.patch

commit 4305750d026b895b3afbd0d4a4ee4b3b42596016
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 15:01:29 2010 -0800

    HADOOP-6269. Missing synchronization for defaultResources in Configuration.addResource
    
    Description: Configuration.defaultResources is a simple ArrayList. In two places in Configuration it is accessed without appropriate synchronization, which we've seen to occasionally result in ConcurrentModificationExceptions.
    Reason: Bugfix (race condition)
    Author: Sreekanth Ramakrishnan
    Ref: UNKNOWN

commit 90f9c40df18fe464383de52e3d3952638a393e34
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 15:01:08 2010 -0800

    CLOUDERA-BUILD. Make some JT methods and classes public for use from within contrib plugins
    
    Author: Henry Robinson

commit f8e0599a434e1ce94158384f575e912e9f988229
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:59:40 2010 -0800

    MAPREDUCE-461. Enable ServicePlugins for the JobTracker
    
    Description: Allow ServicePlugins (see <a href="http://issues.apache.org/jira/browse/HADOOP-5257" title="Export namenode/datanode functionality through a pluggable RPC layer"><del>HADOOP-5257</del></a>) for the JobTracker.
    (Relies on HADOOP-5640)
    Reason: API Improvement
    Author: Todd Lipcon
    Ref: UNKNOWN

commit c58318cfa6e26b7dbacd4093d646fc8b66f9eda6
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:58:23 2010 -0800

    HADOOP-5640. Allow ServicePlugins to hook callbacks into key service events
    
    Description: <a href="http://issues.apache.org/jira/browse/HADOOP-5257" title="Export namenode/datanode functionality through a pluggable RPC layer"><del>HADOOP-5257</del></a> added the ability for NameNode and DataNode to start and stop ServicePlugin implementations at NN/DN start/stop. However, this is insufficient integration for some common use cases.
    
    <p>We should add some functionality for Plugins to subscribe to events generated by the service they're plugging into. Some potential hook points are:</p>
    
    <p>NameNode:</p>
    <ul class="alternate" type="square">
    	<li>new datanode registered</li>
    	<li>datanode has died</li>
    	<li>exception caught</li>
    	<li>etc?</li>
    </ul>
    
    <p>DataNode:</p>
    <ul class="alternate" type="square">
    	<li>startup</li>
    	<li>initial registration with NN complete (this is important for HADOOP-4707 to sync up datanode.dnRegistration.name with the NN-side registration)</li>
    	<li>namenode reconnect</li>
    	<li>some block transfer hooks?</li>
    	<li>exception caught</li>
    </ul>
    
    <p>I see two potential routes for implementation:</p>
    
    <p>1) We make an enum for the types of hookpoints and have a general function in the ServicePlugin interface. Something like:</p>
    
    <div class="code panel" style="border-width: 1px;"><div class="codeContent panelContent">
    <pre class="code-java"><span class="code-keyword">enum</span> HookPoint {
      DN_STARTUP,
      DN_RECEIVED_NEW_BLOCK,
      DN_CAUGHT_EXCEPTION,
     ...
    }
    
    void runHook(HookPoint hp, <span class="code-object">Object</span> value);</pre>
    </div></div>
    
    <p>2) We make classes specific to each "pluggable" as was originally suggested in HADDOP-5257. Something like:</p>
    
    <div class="code panel" style="border-width: 1px;"><div class="codeContent panelContent">
    <pre class="code-java">class DataNodePlugin {
      void datanodeStarted() {}
      void receivedNewBlock(block info, etc) {}
      void caughtException(Exception e) {}
      ...
    }</pre>
    </div></div>
    
    <p>I personally prefer option (2) since we can ensure plugin API compatibility at compile-time, and we avoid an ugly switch statement in a runHook() function.</p>
    
    <p>Interested to hear what people's thoughts are here.</p>
    Reason: API Improvement
    Author: Todd Lipcon
    Ref: UNKNOWN

commit 137999a0b48a81bed10a5f30868dbfe6d176956b
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:58:09 2010 -0800

    HADOOP-5257. Export namenode/datanode functionality through a pluggable RPC layer
    
    Description: Adding support for pluggable components would allow exporting DFS functionallity using arbitrary protocols, like Thirft or Protocol Buffers. I'm opening this issue on Dhruba's suggestion in HADOOP-4707.
    
    <p>Plug-in implementations would extend this base class:</p>
    
    <div class="code panel" style="border-width: 1px;"><div class="codeContent panelContent">
    <pre class="code-java"><span class="code-keyword">abstract</span> class Plugin {
    
        <span class="code-keyword">public</span> <span class="code-keyword">abstract</span> datanodeStarted(DataNode datanode);
    
        <span class="code-keyword">public</span> <span class="code-keyword">abstract</span> datanodeStopping();
    
        <span class="code-keyword">public</span> <span class="code-keyword">abstract</span> namenodeStarted(NameNode namenode);
    
        <span class="code-keyword">public</span> <span class="code-keyword">abstract</span> namenodeStopping();
    }</pre>
    </div></div>
    
    <p>Name node instances would then start the plug-ins according to a configuration object, and would also shut them down when the node goes down:</p>
    
    <div class="code panel" style="border-width: 1px;"><div class="codeContent panelContent">
    <pre class="code-java"><span class="code-keyword">public</span> class NameNode {
    
        <span class="code-comment">// [..]
    </span>
        <span class="code-keyword">private</span> void initialize(Configuration conf)
            <span class="code-comment">// [...]
    </span>        <span class="code-keyword">for</span> (Plugin p: PluginManager.loadPlugins(conf))
              p.namenodeStarted(<span class="code-keyword">this</span>);
        }
    
        <span class="code-comment">// [..]
    </span>
        <span class="code-keyword">public</span> void stop() {
            <span class="code-keyword">if</span> (stopRequested)
                <span class="code-keyword">return</span>;
            stopRequested = <span class="code-keyword">true</span>;
            <span class="code-keyword">for</span> (Plugin p: plugins)
                p.namenodeStopping();
            <span class="code-comment">// [..]
    </span>    }
    
        <span class="code-comment">// [..]
    </span>}</pre>
    </div></div>
    
    <p>Data nodes would do a similar thing in <tt>DataNode.startDatanode()</tt> and <tt>DataNode.shutdown</tt></p>
    Reason: MISSING: Reason for inclusion
    Author: Carlos Valiente
    Ref: UNKNOWN

commit 155394ca5eed2e2a6151a5c9d9452e9cfbb30a11
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:57:58 2010 -0800

    MAPREDUCE-971. distcp does not always remove distcp.tmp.dir
    
    Description: Sometimes distcp leaves behind its tmpdir when the target filesystem is s3n.
    Reason: Bugfix
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 7575b83ba0cab30394bad0943ff906ab0609dc40
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:57:49 2010 -0800

    CLOUDERA-BUILD. Package sqoop docs.

commit 9321b18352e55d4d37c25335b578151b18f938f2
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:57:32 2010 -0800

    MAPREDUCE-923. Sqoop's ORM uses URLDecoder on a file, which replaces plus signs in a jar file name with spaces
    
    Description: In findThisJar, sqoop runs URLDecoder.decode on the resulting jar, which has the effect of replacing any + signs in the path with a space.  This obviously breaks the classpath variable that it's trying to set, and the sqoop-generated code fails to compile.  Ironically, Cloudera's hadoop distro is the one that puts + characters in jar files, and so exhibits the bug.  Here is an example from running sqoop with log4j at debug level.  Note the space in the very last term, which should read hadoop-0.20.0+61-sqoop.jar rather than hadoop-0.20.0 61-sqoop.jar.
    
    <p>09/08/27 18:00:07 DEBUG orm.CompilationManager: Invoking javac with args: -sourcepath ./ -d /tmp/sqoop/compile/ -classpath /usr/lib/hadoop-0.20/conf:/usr/java/jdk1.6.0_06/lib/tools.jar:/usr/lib/hadoop-0.20:/usr/lib/hadoop-0.20/hadoop-0.20.0+61-core.jar:/usr/lib/hadoop-0.20/lib/commons-cli-2.0-SNAPSHOT.jar:/usr/lib/hadoop-0.20/lib/commons-codec-1.3.jar:/usr/lib/hadoop-0.20/lib/commons-el-1.0.jar:/usr/lib/hadoop-0.20/lib/commons-httpclient-3.0.1.jar:/usr/lib/hadoop-0.20/lib/commons-logging-1.0.4.jar:/usr/lib/hadoop-0.20/lib/commons-logging-api-1.0.4.jar:/usr/lib/hadoop-0.20/lib/commons-net-1.4.1.jar:/usr/lib/hadoop-0.20/lib/core-3.1.1.jar:/usr/lib/hadoop-0.20/lib/hadoop-0.20.0+61-fairscheduler.jar:/usr/lib/hadoop-0.20/lib/hadoop-0.20.0+61-scribe-log4j.jar:/usr/lib/hadoop-0.20/lib/hsqldb-1.8.0.10.jar:/usr/lib/hadoop-0.20/lib/hsqldb.jar:/usr/lib/hadoop-0.20/lib/jasper-compiler-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jasper-runtime-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jets3t-0.6.1.jar:/usr/lib/hadoop-0.20/lib/jetty-6.1.14.jar:/usr/lib/hadoop-0.20/lib/jetty-util-6.1.14.jar:/usr/lib/hadoop-0.20/lib/junit-3.8.1.jar:/usr/lib/hadoop-0.20/lib/junit-4.5.jar:/usr/lib/hadoop-0.20/lib/kfs-0.2.2.jar:/usr/lib/hadoop-0.20/lib/libfb303.jar:/usr/lib/hadoop-0.20/lib/libthrift.jar:/usr/lib/hadoop-0.20/lib/log4j-1.2.15.jar:/usr/lib/hadoop-0.20/lib/mysql-connector-java-5.0.8-bin.jar:/usr/lib/hadoop-0.20/lib/oro-2.0.8.jar:/usr/lib/hadoop-0.20/lib/servlet-api-2.5-6.1.14.jar:/usr/lib/hadoop-0.20/lib/slf4j-api-1.4.3.jar:/usr/lib/hadoop-0.20/lib/slf4j-log4j12-1.4.3.jar:/usr/lib/hadoop-0.20/lib/xmlenc-0.52.jar:/usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-2.1.jar:/usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-api-2.1.jar:/usr/local/hadoop/lib/hadoop-gpl-compression.jar:/usr/lib/hadoop-0.20/hadoop-0.20.0+61-core.jar:/usr/lib/hadoop-0.20/contrib/sqoop/hadoop-0.20.0 61-sqoop.jar</p>
    Reason: Bugfix
    Author: Aaron Kimball
    Ref: UNKNOWN

commit e97883c5b9c389f82a6447e4cb1678c0a0ed83ba
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:57:19 2010 -0800

    CLOUDERA-BUILD. Sqoop asciidoc syntax error
    
    Author: Aaron Kimball

commit 520bda2edcb90dfe9461e16b96aa4a048d33ed7b
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:57:11 2010 -0800

    HADOOP-5450. Add support for application-specific typecodes to typed bytes
    
    Description: For serializing objects of types that are not supported by typed bytes serialization, applications might want to use a custom serialization format. Right now, typecode 0 has to be used for the bytes resulting from this custom serialization, which could lead to problems when deserializing the objects because the application cannot know if a byte sequence following typecode 0 is a customly serialized object or just a raw sequence of bytes. Therefore, a range of typecodes that are treated as aliases for 0 should be added, such that different typecodes can be used for application-specific purposes.
    Reason: New feature
    Author: Klaas Bosteels
    Ref: UNKNOWN

commit b30fc99332c4a444d275731dac4b4245115d65b2
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:56:59 2010 -0800

    HADOOP-1722. Make streaming to handle non-utf8 byte array
    
    Description: Right now, the streaming framework expects the output sof the steam process (mapper or reducer) are line <br/>
    oriented UTF-8 text. This limit makes it impossible to use those programs whose outputs may be non-UTF-8<br/>
     (international encoding, or maybe even binary data). Streaming can overcome this limit by introducing a simple<br/>
    encoding protocol. For example, it can allow the mapper/reducer to hexencode its keys/values, <br/>
    the framework decodes them in the Java side.<br/>
    This way, as long as the mapper/reducer executables follow this encoding protocol, <br/>
    they can output arabitary bytearray and the streaming framework can handle them.
    Reason: New feature
    Author: Klaas Bosteels
    Ref: UNKNOWN

commit 921c135653736bcc279700435358058762bc8f78
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:56:43 2010 -0800

    CLOUDERA-BUILD. More Sqoop documentation updates
    
    Author: Aaron Kimball

commit be7f1dc031e17dc4f53ebe76d27c1b9242105785
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:56:26 2010 -0800

    MAPREDUCE-840. DBInputFormat leaves open transaction
    
    Description: (Reapplied after HADOOP-4687)
    Reason: MISSING: Reason for inclusion
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 89a96d8fff80ac809dbda9582044a7c6b3986d16
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:56:07 2010 -0800

    MAPREDUCE-906. Updated Sqoop documentation
    
    Description: Provides the latest documentation for Sqoop, in both user-guide and manpage form. Built with asciidoc.
    Reason: Documentation
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 51f867aea0667d0191b730ea3abf114e75cafa4b
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:55:54 2010 -0800

    MAPREDUCE-907. Sqoop should use more intelligent splits
    
    Description: Sqoop should use the new split generation / InputFormat in <a href="http://issues.apache.org/jira/browse/MAPREDUCE-885" title="More efficient SQL queries for DBInputFormat"><del>MAPREDUCE-885</del></a>
    Reason: Performance / scalability improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 239df04415dba8d12c7d3fbf33c580d473202e94
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:55:28 2010 -0800

    MAPREDUCE-885. More efficient SQL queries for DBInputFormat
    
    Description: DBInputFormat generates InputSplits by counting the available rows in a table, and selecting subsections of the table via the "LIMIT" and "OFFSET" SQL keywords. These are only meaningful in an ordered context, so the query also includes an "ORDER BY" clause on an index column. The resulting queries are often inefficient and require full table scans. Actually using multiple mappers with these queries can lead to O(n^2) behavior in the database, where n is the number of splits. Attempting to use parallelism with these queries is counter-productive.
    
    <p>A better mechanism is to organize splits based on data values themselves, which can be performed in the WHERE clause, allowing for index range scans of tables, and can better exploit parallelism in the database.</p>
    Reason: Performance and scalability improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 23a0d1882c797160cc7b6fae99fc5e686aa30191
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:55:16 2010 -0800

    MAPREDUCE-938. Postgresql support for Sqoop
    
    Description: Sqoop should be able to import from postgresql databases.
    Reason: Compatability improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 7b89feb34fafd2365f75ab744db9cb07a5443046
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:55:05 2010 -0800

    MAPREDUCE-876. Sqoop import of large tables can time out
    
    Description: Related to <a href="http://issues.apache.org/jira/browse/MAPREDUCE-875" title="Make DBRecordReader execute queries lazily"><del>MAPREDUCE-875</del></a>, Sqoop should use a background thread to ensure that progress is being reported while a database does external work for the MapReduce task.
    Reason: Scalability improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 61d4ef5175dca1859a1320f9e7cad1caeab5d982
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:54:49 2010 -0800

    MAPREDUCE-918. Test hsqldb server should be memory-only.
    
    Description: Sqoop launches a standalone hsqldb server for unit tests, but it currently writes its database to disk and uses a connect string of <tt>//localhost</tt>. If multiple test instances are running concurrently, one test server may serve to the other instance of the unit tests, causing race conditions.
    Reason: Bugfix in test harness
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 1fc17ad34e8288b54503eeb15f788eb4e6a070dc
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:54:37 2010 -0800

    MAPREDUCE-875. Make DBRecordReader execute queries lazily
    
    Description: DBInputFormat's DBRecordReader executes the user's SQL query in the constructor. If the query is long-running, this can cause task timeout. The user is unable to spawn a background thread (e.g., in a MapRunnable) to inform Hadoop of on-going progress.
    Reason: Scalability improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 21fdb7a7fd501fd63e1a540c2b55cf410d057301
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:54:27 2010 -0800

    MAPREDUCE-825. JobClient completion poll interval of 5s causes slow tests in local mode
    
    Description: The JobClient.NetworkedJob.waitForCompletion() method polls for job completion every 5 seconds. When running a set of short tests in pseudo-distributed mode, this is unnecessarily slow and causes lots of wasted time. When bandwidth is not scarce, setting the poll interval to 100 ms results in a 4x speedup in some tests.  This interval should be parametrized to allow users to control the interval for testing purposes.
    Reason: Test performance improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit f996b8a019bffefff183d7d688ccf95b8cb73de5
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:54:15 2010 -0800

    MAPREDUCE-750. Extensible ConnManager factory API
    
    Description: Sqoop uses the ConnFactory class to instantiate a ConnManager implementation based on the connect string and other arguments supplied by the user. This allows per-database logic to be encapsulated in different ConnManager instances, and dynamically chosen based on which database the user is actually importing from. But adding new ConnManager implementations requires modifying the source of a common ConnFactory class. An indirection layer should be used to delegate instantiation to a number of factory implementations which can be specified in the static configuration or at runtime.
    Reason: API flexibility improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 39bdff7bd3b83359884c90ae857d3f3144a94803
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:54:04 2010 -0800

    MAPREDUCE-749. Make Sqoop unit tests more Hudson-friendly
    
    Description: Hudson servers (other than Apache's) need to be able to run the sqoop unit tests which depend on thirdparty JDBC drivers / database implementations. The build.xml needs some refactoring to make this happen.
    Reason: Test coverage improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 0ca54f2722206685d9e36fcbb2656d0ac1957311
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:53:47 2010 -0800

    MAPREDUCE-792. javac warnings in DBInputFormat
    
    Description: <a href="http://issues.apache.org/jira/browse/MAPREDUCE-716" title="org.apache.hadoop.mapred.lib.db.DBInputformat not working with oracle"><del>MAPREDUCE-716</del></a> introduces javac warnings
    Reason: Technical debt
    Author: Aaron Kimball
    Ref: UNKNOWN

commit e39ae9d017e89e4df193b1f8075184320230499b
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:52:45 2010 -0800

    MAPREDUCE-716. org.apache.hadoop.mapred.lib.db.DBInputformat not working with oracle
    
    Description: Applied "trunk" version of the patch after incorporating
    HADOOP-4687's move of DBInputFormat-related files. (Prior patch was 0.20-branch
    specific)
    Reason: Branch compatibility improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 074e824f5d3d2f6ab862083e6eb4b0df8c881bfc
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:52:27 2010 -0800

    MAPREDUCE-910. MRUnit should support counters
    
    Description: incrCounter() is currently a dummy stub method in MRUnit that does nothing. Would be good for the mock reporter/context implementations to support counters.
    Reason: New feature
    Author: Aaron Kimball
    Ref: UNKNOWN

commit b4b7c5d9b4cba84bc47f4a48074fd295d060ab35
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:52:17 2010 -0800

    MAPREDUCE-798. MRUnit should be able to test a succession of MapReduce passes
    
    Description: MRUnit can currently test that the inputs to a given (mapper, reducer) "job" produce certain outputs at the end of the reducer. It would be good to support more end-to-end tests of a series of MapReduce jobs that form a longer pipeline surrounding some data.
    Reason: New Feature
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 59677d22261974560117fa82e74d9a7f80f804d5
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:52:06 2010 -0800

    MAPREDUCE-800. MRUnit should support the new API
    
    Description: MRUnit's TestDriver implementations use the old org.apache.hadoop.mapred-based classes. TestDrivers and associated mock object implementations are required for org.apache.hadoop.mapreduce-based code.
    Reason: New feature (API Compatibility)
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 7fda23b419b1c98e84eea43a0f35191d41032e18
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:51:53 2010 -0800

    MAPREDUCE-799. Some of MRUnit's self-tests were not being run
    
    Description: Due to method naming issues, some test cases were not being executed.
    Reason: Bugfix; test coverage
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 20d5bf205e9f2864f3da53d30408ba97763a46e9
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:51:40 2010 -0800

    MAPREDUCE-797. MRUnit MapReduceDriver should support combiners
    
    Description: The MapReduceDriver allows you to specify a mapper and a reducer class with a simple sort/"shuffle" between the passes. It would be nice to also support another Reducer implementation being used as a combiner in the middle.
    Reason: New feature
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 5c873336b3380e6c8f07ca28230ede9d41e4e840
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:50:05 2010 -0800

    Integrate with 0.21-branch versions of DBInputFormat
    
    Description: In 0.21 there is now a DBInputFormat in the mapred/lib/ package
    as well as mapreduce/lib/db. This patch backports the new API edition of
    DBInputFormat to CDH
    Reason: Cross-branch compatibility improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 51b650554e3bc8054e8ca966f5f552c522f7483d
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:49:52 2010 -0800

    HADOOP-5170. Set max map/reduce tasks on a per-job basis, either per-node or cluster-wide
    
    Description: There are a number of use cases for being able to do this.  The focus of this jira should be on finding what would be the simplest to implement that would satisfy the most use cases.
    
    <p>This could be implemented as either a per-node maximum or a cluster-wide maximum.  It seems that for most uses, the former is preferable however either would fulfill the requirements of this jira.</p>
    
    <p>Some of the reasons for allowing this feature (mine and from others on list):</p>
    <ul class="alternate" type="square">
    	<li>I have some very large CPU-bound jobs.  I am forced to keep the max map/node limit at 2 or 3 (on a 4 core node) so that I do not starve the Datanode and Regionserver.  I have other jobs that are network latency bound and would like to be able to run high numbers of them concurrently on each node.  Though I can thread some jobs, there are some use cases that are difficult to thread (scanning from hbase) and there's significant complexity added to the job rather than letting hadoop handle the concurrency.</li>
    	<li>Poor assignment of tasks to nodes creates some situations where you have multiple reducers on a single node but other nodes that received none.  A limit of 1 reducer per node for that job would prevent that from happening. (only works with per-node limit)</li>
    	<li>Poor mans MR job virtualization.  Since we can limit a jobs resources, this gives much more control in allocating and dividing up resources of a large cluster.  (makes most sense w/ cluster-wide limit)</li>
    </ul>
    
    Reason: Configuration improvement
    Author: Matei Zaharia
    Ref: UNKNOWN

commit 99e25a93542251debd248ed71cb380858ca8c9bd
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:49:40 2010 -0800

    HADOOP-6166. Improve PureJavaCrc32
    
    Description: Got some ideas to improve CRC32 calculation.
    Reason: Performance Improvement
    Author: Tsz Wo (Nicholas), SZE
    Ref: UNKNOWN

commit 2d0a97cefa559ab9059d976bda66f9dbcf051e79
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:49:28 2010 -0800

    MAPREDUCE-782. Use PureJavaCrc32 in mapreduce spills
    
    Description: <a href="http://issues.apache.org/jira/browse/HADOOP-6148" title="Implement a pure Java CRC32 calculator"><del>HADOOP-6148</del></a> implemented a Pure Java implementation of CRC32 which performs better than the built-in one. This issue is to make use of it in the mapred package
    Reason: Performance improvement
    Author: Todd Lipcon
    Ref: UNKNOWN

commit bb65cb649c2924b5a20f06deb9ecd66fc219eeeb
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:49:12 2010 -0800

    HDFS-496. Use PureJavaCrc32 in HDFS
    
    Description: Common now has a pure java CRC32 implementation which is more efficient than java.util.zip.CRC32. This issue is to make use of it.
    Reason: Performance improvement
    Author: Todd Lipcon
    Ref: UNKNOWN

commit ac73e6d51d5ad1df993097349602e5f3199b952a
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:48:40 2010 -0800

    HADOOP-6148. Implement a pure Java CRC32 calculator
    
    Description: We've seen a reducer writing 200MB to HDFS with replication = 1 spending a long time in crc calculation. In particular, it was spending 5 seconds in crc calculation out of a total of 6 for the write. I suspect that it is the java-jni border that is causing us grief.
    
    This outperforms java.util.zip.CRC32.
    Reason: Performance improvement
    Author: Scott Carey and Todd Lipcon
    Ref: UNKNOWN

commit e7430c8cbd2d182716ac7efb08cb2187c1edab95
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:48:08 2010 -0800

    Updated Sqoop documentation for MAPREDUCE-816, MAPREDUCE-789.
    
    Reason: Documentation improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit aa75ab7f749604c354dcdb0b806aca9cd140f504
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:47:58 2010 -0800

    MAPREDUCE-789. Oracle support for Sqoop
    
    Description: A separate ConnManager is needed for Oracle to support its slightly different syntax and configuration
    Reason: Compatibility improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 6f017db468a82e336a28f451c7d90bc225130094
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:47:33 2010 -0800

    MAPREDUCE-840. DBInputFormat leaves open transaction
    
    Description: DBInputFormat.getSplits() does not call connection.commit() after the COUNT query. This can leave an open transaction against the database which interferes with other connections to the same table.
    Reason: bugfix
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 84b622a5f6f5bd145f19f4c08b6263759ac51756
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:47:15 2010 -0800

    MAPREDUCE-816. Rename "local" mysql import to "direct"
    
    Description: A mysqldump-based fast path known as "local mode" is used in sqoop when users pass the argument <tt>-<del>local.</tt> The restriction that this only import from localhost was based on an implementation technique that was later abandoned in favor of a more general one, which can support remote hosts as well. Thus, <tt></del><del>local</tt> is a poor name for the flag. <tt></del>-direct</tt> is more general and more descriptive. This should be used instead.
    Reason: Interface clarification
    Author: Aaron Kimball
    Ref: UNKNOWN

commit ce75318a484615dc7b161a41710884f34db50c86
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:46:34 2010 -0800

    MAPREDUCE-716. org.apache.hadoop.mapred.lib.db.DBInputformat not working with oracle
    
    Description: <p>The out of the box implementation of the Hadoop is working properly with mysql/hsqldb, but NOT with oracle.<br/>
    Reason is DBInputformat is implemented with mysql/hsqldb specific query constructs like "LIMIT", "OFFSET".</p>
    
    <p>FIX:<br/>
    building a database provider specific logic based on the database providername (which we can get using connection).</p>
    
    Reason: Compatibility improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 338de775796c2102ce680eaa983b719b50e9f3ee
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:46:18 2010 -0800

    HADOOP-5469. Exposing Hadoop metrics via HTTP
    
    Description: Implement a "/metrics" URL on the HTTP server of Hadoop daemons, to expose metrics data to users via their web browsers, in plain-text and JSON.
    Reason: New feature
    Author: Philip Zeyliger
    Ref: UNKNOWN

commit cad421ec1c51382f81714ccafb96a6bb8bcc8aec
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:46:11 2010 -0800

    HADOOP-5469. Exposing Hadoop metrics via HTTP
    
    Description: Implement a "/metrics" URL on the HTTP server of Hadoop daemons, to expose metrics data to users via their web browsers, in plain-text and JSON.
    Reason: MISSING: Reason for inclusion
    Author: Philip Zeyliger
    Ref: UNKNOWN

commit 8b09839047997a4b5461703650b5779ec86c1844
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:45:49 2010 -0800

    CLOUDERA-BUILD. Added Sqoop documentation to installation script
    
    Author: Todd Lipcon

commit 7e77c6b13f06dec9c742bf76c81e2ec02d81c7cb
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:45:35 2010 -0800

    CLOUDERA-BUILD. Fix the hadoop/sqoop wrapper scripts
    
    Author: Matt Massie

commit 0caaf80f3a569b91f482de0dcb87f826967f5c7c
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:45:16 2010 -0800

    CLOUDERA-BUILD. Fix a bug in the hadoop/sqoop wrapper generation
    
    Author: Matt Massie
    Ref: UNKNOWN

commit bd8ddae402a876fe78cbb1482362935780b57d84
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:44:59 2010 -0800

    CLOUDERA-BUILD. Update the install hadoop script
    
    Author: Matt Massie
    Ref: UNKNOWN

commit 80cf01124877a5aebd742142b10fda45910f0328
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:44:42 2010 -0800

    CLOUDERA-BUILD. Rename the hadoop man page to be hadoop-0.20
    
    Author: Matt Massie
    Ref: UNKNOWN

commit 78cb9f21a3ddf04f8cef9e37a94f657448d0d111
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:43:51 2010 -0800

    HADOOP-5745. Allow setting the default value of maxRunningJobs for all pools
    
    Description: The &lt;pool&gt; element allows setting the maxRunningJobs for that pool. It wold be nice to be able to set a default value for all pools.
    
    <p>In out configuration, pools are autocreated.. every new uesre gets his own pool. We would like to allow each user to be able to run a max of 5 jobs at a time. For the etl pool, this limit will be set to a greater value,</p>
    Reason: Improved configuration flexibility
    Author: dhruba borthakur
    Ref: UNKNOWN

commit 3c39e1fa8c3c89fc8f11f1faff46397fa82d5116
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:43:13 2010 -0800

    MAPREDUCE-906. Updated Sqoop documentation.
    
    Description: Update Sqoop documentation with user guide and manpage.
    Reason: Documentation improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 79a2645bc81894331721ef94c255992075ccf195
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:42:14 2010 -0800

    CLOUDERA-BUILD. Added MySQL Connector/J library for Sqoop.
    
    Description: We can ship MySQL Connector/J with CDH because the licenses
    are compatible. However, the public Apache project will not include this
    library in their source repository due to stricter licensing concerns.
    Reason: Simplifies deployment of Sqoop for mysql users
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 4a097b35bf1264a0606f2ebe410c45f16f900f03
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:42:05 2010 -0800

    MAPREDUCE-705. User-configurable quote and delimiter characters for Sqoop records and record reparsing
    
    Description: Sqoop needs a mechanism for users to govern how fields are quoted and what delimiter characters separate fields and records. With delimiters providing an unambiguous format, a parse method can reconstitute the generated record data object from a text-based representation of the same record.
    Reason: New feature
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 58e23056af0e99ef611ac258719207cc9459a849
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:41:47 2010 -0800

    MAPREDUCE-710. Sqoop should read and transmit passwords in a more secure manner
    
    Description: Sqoop's current support for passwords involves reading passwords from the command line "--password foo", which makes the password visible to other users via 'ps'. An invisible-console approach should be taken.
    
    <p>Related, Sqoop transmits passwords to mysqldump in the same fashion, which is also insecure.</p>
    Reason: Security improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit a67a0f77729fb9005b0c47872d6ba677f6434b41
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:41:34 2010 -0800

    MAPREDUCE-713. Sqoop has some superfluous imports
    
    Description: Some classes have vestigial imports that should be removed
    Reason: Code cleanup
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 0a4dab2eac0ba8b6da5190bc53a9ce8e4344a336
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:41:01 2010 -0800

    MAPREDUCE-685. Sqoop will fail with OutOfMemory on large tables using mysql
    
    Description: The default MySQL JDBC client behavior is to buffer the entire ResultSet in the client before allowing the user to use the ResultSet object. On large SELECTs, this can cause OutOfMemory exceptions, even when the client intends to close the ResultSet after reading only a few rows. The MySQL ConnManager should configure its connection to use row-at-a-time delivery of results to the client.
    Reason: bugfix / scalability improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 499aa76b500136a0e8996898f468b088ca5d7ed3
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:40:50 2010 -0800

    MAPREDUCE-674. Sqoop should allow a "where" clause to avoid having to export entire tables
    
    Description: Sqoop currently only exports at the granularity of a table.  This doesn't work well on systems with large tables, where the overhead of performing a full dump each time is significant.  Allowing the user to specify a where clause is a relatively simple task which will give Sqoop a lot more flexibility.
    Reason: New feature
    Author: Kevin Weil
    Ref: UNKNOWN

commit ed4ba254d7708f363f5f1b4708e9e35061ad936c
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:40:37 2010 -0800

    MAPREDUCE-675. Sqoop should allow user-defined class and package names
    
    Description: Currently Sqoop generates a class for each table to be imported; the class names are equal to the table names and they are not part of any package.
    
    <p>This adds --class-name and --package-name parameters to Sqoop, allowing these aspects of code generation to be controlled.</p>
    Reason: New feature
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 16e0ca8119b99b244c9eeafd78bb9eb43e4ba639
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:40:20 2010 -0800

    MAPREDUCE-703. Sqoop requires dependency on hsqldb in ivy
    
    Description: Sqoop builds crash without explicit dependency on hsqldb.
    Reason: build system bugfix
    Author: Aaron Kimball
    Ref: UNKNOWN

commit b8e54791e990328db983f070e9a04952301eda35
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:40:04 2010 -0800

    MAPREDUCE-692. Make Hudson run Sqoop unit tests
    
    Description: Running 'ant test-contrib' didn't test Sqoop because it wasn't explicitly listed in the build.xml file in src/contrib/
    Reason: Test coverage
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 8a3b6472ae00542dadf7f7d60991ec0f21b38177
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:39:40 2010 -0800

    HADOOP-5968. Sqoop should only print a warning about mysql import speed once
    
    Description: After <a href="http://issues.apache.org/jira/browse/HADOOP-5844" title="Use mysqldump when connecting to local mysql instance in Sqoop"><del>HADOOP-5844</del></a>, Sqoop can use mysqldump as an alternative to JDBC for importing from MySQL. If you use the JDBC mechanism, it prints a warning if you could have enabled the mysqldump path instead. But the warning is printed multiple times (every time the LocalMySQLManager is instantiated), and also when the MySQL manager is used for informational queries (e.g., listing tables) rather than true imports.
    
    <p>It should only emit the warning once per session, and only then if it's actually doing an import.</p>
    Reason: User experience improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 86211e3714dc5b1dbcb7a3c328336277f6657de7
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:38:44 2010 -0800

    HADOOP-5967. Sqoop should only use a single map task
    
    Description: The current DBInputFormat implementation uses SELECT ... LIMIT ... OFFSET statements to
    read from a database table. This actually results in several queries all accessing the same table at
    the same time. Most database implementations will actually use a full table scan for each such
    query, starting at row 1 and scanning down until the OFFSET is reached before emitting data to the
    client. The upshot of this is that we see O(n^2) performance in the size of the table when using a
    large number of mappers, when a single mapper would read through the table in O(n) time in the number of rows.
    
    <p>This patch sets the number of map tasks to 1 in the MapReduce job sqoop launches.</p>
    Reason: Performance improvement
    Author: Aaron Kimball
    Ref: UNKNOWN
    
    commit 410db7130a8e85ceed46850f73e74f480d45994e
    Author: Aaron Kimball <aaron@cloudera.com>
    Date:   Thu Jul 23 16:10:21 2009 -0700
    
        HADOOP-5967: Sqoop should only use a single map task

commit b8f5d1d3a30a7461936f3f92bd9f007ed2db43e8
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:38:23 2010 -0800

    HADOOP-5887. Sqoop should create tables in Hive metastore after importing to HDFS
    
    Description: Sqoop (<a href="http://issues.apache.org/jira/browse/HADOOP-5815" title="Sqoop: A database import tool for Hadoop"><del>HADOOP-5815</del></a>) imports tables into HDFS; it is a straightforward enhancement to then generate a Hive DDL statement to recreate the table definition in the Hive metastore and move the imported table into the Hive warehouse directory from its upload target.
    
    <p>This feature enhancement makes this process automatic. An import is performed with sqoop in the usual way; providing the argument "--hive-import" will cause it to then issue a CREATE TABLE .. LOAD DATA INTO statement to a Hive shell. It generates a script file and then attempts to run "$HIVE_HOME/bin/hive" on it, or failing that, any "hive" on the $PATH; $HIVE_HOME can be overridden with --hive-home. As a result, no direct linking against Hive is necessary.</p>
    
    <p>The unit tests provided with this enhancement use a mock implementation of 'bin/hive' that compares the script it's fed with one from a directory full of "expected" scripts. The exact script file referenced is controlled via an environment variable. It doesn't actually load into a proper Hive metastore, but manual testing has shown that this process works in practice, so the mock implementation is a reasonable unit testing tool.</p>
    Reason: New feature
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 50993494fdc7b2284837562b500e2840106bb3bb
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:37:48 2010 -0800

    CLOUDERA-BUILD. Address issue where docs were not properly copied through to release tarball
    
    Description:
        This was caused by some cleanup in build.xml early on in the CDH 0.20
        branch
    Reason: bugfix
    Author: Todd Lipcon
    Ref: UNKNOWN

commit 3ecb9c07279302d18f7367d49bcd98c4391cbb68
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:37:27 2010 -0800

    CLOUDERA-BUILD. Decrease build time by only rebuilding the native code for each platform
    
    Reason: build system improvement
    Author: Todd Lipcon
    Ref: UNKNOWN

commit f0c6a810ba7237ec7cc570ecad8a8665768b3d06
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:37:07 2010 -0800

    CLOUDERA-BUILD. Run jdiff against vanilla Hadoop during Cloudera release build
    
    Author: Todd Lipcon
    Ref: UNKNOWN

commit 9cf8f0cb6ed744439d8e90e3ba376edb5d9521f3
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:36:22 2010 -0800

    MAPREDUCE-415. JobControl Job does always has an unassigned name
    
    Description: When creating and adding org.apache.hadoop.mapred.jobcontrol.Job(s) they don't use the names specified in their respective JobConf files.  Instead it's just hardcoded to "unassigned".
    Reason: bugfix
    Author: Xavier Stevens
    Ref: UNKNOWN

commit 330f009bae260ac990426a988fc56913897a50ca
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:35:03 2010 -0800

    HADOOP-5805. problem using top level s3 buckets as input/output directories
    
    Description: When I specify top level s3 buckets as input or output directories, I get the following exception.
    
    <p>hadoop jar subject-map-reduce.jar s3n://infocloud-input s3n://infocloud-output</p>
    
    <p>java.lang.IllegalArgumentException: Path must be absolute: s3n://infocloud-output<br/>
            at org.apache.hadoop.fs.s3native.NativeS3FileSystem.pathToKey(NativeS3FileSystem.java:246)<br/>
            at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:319)<br/>
            at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:667)<br/>
            at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:109)<br/>
            at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:738)<br/>
            at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1026)<br/>
            at com.evri.infocloud.prototype.subjectmapreduce.SubjectMRDriver.run(SubjectMRDriver.java:63)<br/>
            at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)<br/>
            at com.evri.infocloud.prototype.subjectmapreduce.SubjectMRDriver.main(SubjectMRDriver.java:25)<br/>
            at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)<br/>
            at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)<br/>
            at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)<br/>
            at java.lang.reflect.Method.invoke(Method.java:597)<br/>
            at org.apache.hadoop.util.RunJar.main(RunJar.java:155)<br/>
            at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)<br/>
            at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)<br/>
            at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)<br/>
            at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)</p>
    
    <p>The workaround is to specify input/output buckets with sub-directories:</p>
    
    <p>hadoop jar subject-map-reduce.jar s3n://infocloud-input/input-subdir  s3n://infocloud-output/output-subdir</p>
    
    Reason: bugfix
    Author: Ian Nowland
    Ref: UNKNOWN

commit 35fa82b5c743e34d62449e0f4abffd885e0dfe4c
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:34:42 2010 -0800

    HADOOP-5656. Counter for S3N Read Bytes does not work
    
    Description: Counter for S3N Read Bytes does not work on trunk. On 0.18 branch neither read nor write byte counters work.
    Reason: Bugfix
    Author: Ian Nowland
    Ref: UNKNOWN

commit a6670de0a1c4b03c293ae47d1595e8c33764aaa5
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:33:43 2010 -0800

    HADOOP-5613. change S3Exception to checked exception
    
    Description: Currently the S3 filesystems can throw unchecked exceptions (S3Exception) which are not declared in the interface of FileSystem. These aren't caught by the various callers and can cause unpredictable behavior. IOExceptions are caught by most users of FileSystem since it is declared in the interface and hence is handled better.
    
    S3Exception now extends IOException.
    Reason: Improved error-checking at compile time for user applications.
    Author: Andrew Hitchcock
    Ref: UNKNOWN

commit 1f11b63a42ae441eb8d0693ed0e4e01aca553e42
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:33:09 2010 -0800

    HADOOP-5528. Binary partitioner
    
    Description: It would be useful to have a <tt>BinaryPartitioner</tt> that partitions <tt>BinaryComparable</tt> keys by hashing a configurable part of the bytes array corresponding to each key.
    Reason: New feature
    Author: Klaas Bosteels
    Ref: UNKNOWN

commit 716d3598e5a4a18cdfcfcf0dc800e263ef7c7685
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:32:47 2010 -0800

    HADOOP-5240. 'ant javadoc' does not check whether outputs are up to date and always rebuilds
    
    Description: Running 'ant javadoc' twice in a row calls the javadoc program both times; it doesn't check to see whether this is redundant work.
    Reason: Build system improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 2bb607d29d9080a7ca3bce72739ccef654d5392d
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:30:46 2010 -0800

    HADOOP-5175. Option to prohibit jars unpacking
    
    Description: The task tracker moves all unpacked jars into
    ${hadoop.tmp.dir}/mapred/local/taskTracker. When using a lot of external
    libraries via -libjars, this results in several thousand unpacked files.
    The amount of time needed to `du` these directories can increase to the point
    where tasks time out before starting. This patch provides an option to
    suppress jar unpacking.
    Reason: Scalability improvement
    Author: Todd Lipcon
    Ref: UNKNOWN

commit 349281bfa0243f5adbbd459266f4a9ac7ac8c1cc
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:30:16 2010 -0800

    CLOUDERA-BUILD. Fix scribe-log4j's ivy.xml to properly get log4j on the compile classpath
    
    Author: Todd Lipcon
    Reason: bugfix to build system
    Ref: UNKNOWN

commit b07aec5129e618bfeda8ba753fb5138e612b1a8b
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:29:33 2010 -0800

    HADOOP-4829. Allow FileSystem shutdown hook to be disabled
    
    Description: FileSystem sets a JVM shutdown hook so that it can clean up the FileSystem cache. This is great behavior when you are writing a client application, but when you're writing a server application, like the Collector or an HBase RegionServer, you need to control the shutdown of the application and HDFS much more closely. If you set your own shutdown hook, there's no guarantee that your hook will run before the HDFS one, preventing you from taking some shutdown actions.
    Reason: Integration improvement.
    Author: Todd Lipcon
    Ref: UNKNOWN

commit 154c6a6474b02e68c3418fddf9a8ee5d476a8b7d
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:28:14 2010 -0800

    HADOOP-3327. Shuffling fetchers waited too long between map output fetch re-tries
    
    Description: Improves handling of READ_TIMEOUT during map output copying.
    Author: Amareshwari Sriramadasu
    Reason: bugfix
    Ref: UNKNOWN
    
    commit 8a6293fc5c3733035dde8e4d3a68c414a1f800f8
    Author: Devaraj Das <ddas@apache.org>
    Date:   Thu Feb 5 05:35:09 2009 +0000
    
        HADOOP-3327. Improves handling of READ_TIMEOUT during map output copying. Contributed by Amareshwari Sriramadasu.
    
        git-svn-id: https://svn.apache.org/repos/asf/hadoop/core/trunk@741009 13f79535-47bb-0310-9956-ffa450edef68

commit 4ee0ecf4760d7adb3e1a81e018a3b5cd6d2e9775
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:27:44 2010 -0800

    MAPREDUCE-680. Reuse of Writable objects is improperly handled by MRUnit
    
    Description: As written, MRUnit's MockOutputCollector simply stores references to the objects passed in to its collect() method. Thus if the same Text (or other Writable) object is reused as an output containiner multiple times with different values, these separate values will not all be collected. MockOutputCollector needs to properly use io.serializations to deep copy the objects sent in.
    Reason: Bugfix; see description.
    Author: Aaron Kimball
    Ref: UNKNOWN
    
    commit 51bdfdcf947bc8447aa36d68ae802f154516b0b6
    Author: Aaron Kimball <aaron@cloudera.com>
    Date:   Wed Jul 15 10:40:47 2009 -0700
    
        MAPREDUCE-680. Reuse of Writable objects is improperly handled by MRUnit.

commit c2026460d4cf7049c67da65d3a2db2e9bcd9c848
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:27:14 2010 -0800

    HADOOP-5518. MRUnit unit test library
    
    Description: MRUnit is a tool to help authors of MapReduce programs write unit tests.
    
    Testing map() and reduce() methods requires some repeated work to mock the inputs and outputs of a Mapper or Reducer class, and ensure that the correct values are emitted to the OutputCollector based on inputs. Also, testing a mapper and reducer together requires running them with the sorted ordering guarantees made by the shuffle process.
    
    This library provides the above functionality to authors of maps and reduces; it allows you to test maps, reduces, and map-reduce pairs without needing to perform all the setup and teardown work associated with running a job.
    
    Reason: New feature
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 6991a0eb635953bf3729bce330c426ed7d8b996a
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:26:29 2010 -0800

    CLOUDERA-BUILD. Add sqoop wrapper to bin
    
    Description: Adds a '/usr/bin/sqoop' wrapper script for users
    Reason: User-experience improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit c365162d7db1ee70c8607ad84a11e4aa594224e7
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:25:56 2010 -0800

    HADOOP-5844. Use mysqldump when connecting to local mysql instance in Sqoop
    
    Description: Sqoop uses MapReduce + DBInputFormat to read the contents of a table into HDFS. On many databases, this implementation is O(N^2) in the number of rows. Also, the use of multiple mappers has low value in terms of throughput, because the database itself is inherently singlethreaded. While DBInputFormat/JDBC provides a useful fallback mechanism for importing from databases, db-specific dump utilities will nearly always provide faster throughput, and should be selected when available. This patch allows users to use mysqldump to read from local mysql instances instead of the MapReduce-based input.
    Reason: Performance improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit eddbfbca420bfb81a3a565e4324f6189bfd97e41
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:24:58 2010 -0800

    HADOOP-5815. Sqoop: A database import tool for Hadoop
    
    Description:
    Sqoop is a tool designed to help users import existing relational databases into their Hadoop clusters. Sqoop uses JDBC to connect to a database, examine the schema for tables, and auto-generate the necessary classes to import data into HDFS. It then instantiates a MapReduce job to read the table from the database via the DBInputFormat (JDBC-based InputFormat). The table is read into a set of files loaded into HDFS. Both SequenceFile and text-based targets are supported.
    Reason: New feature
    Author: Aaron Kimball
    Ref: UNKNOWN

commit b33265ff77c71af61899a4b3add1e82cc195fdb7
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:23:53 2010 -0800

    MAPREDUCE-714. JobConf.findContainingJar unescapes unnecessarily on Linux
    
    Description: In JobConf.findContainingJar, the path name is decoded using URLDecoder.decode(...). This was done by Doug in r381794 (commit msg "Un-escape containing jar's path, which is URL-encoded.  This fixes things primarily on Windows, where paths are likely to contain spaces.") Unfortunately, jar paths do not appear to be URL encoded on Linux. If you try to use "hadoop jar" on a jar with a "+" in it, this function decodes it to a space and then the job cannot be submitted.
    Reason: Cloudera-based packages include a '+' in the filename; Hadoop's URL escaper will not
    properly handle jar filenames with a '+' without this patch.
    Author: Todd Lipcon
    Ref: UNKNOWN
    
    commit d9767d2cefab288e581732f71779f3ce8e3267e4
    Author: Todd Lipcon <todd@cloudera.com>
    Date:   Mon Jul 6 19:36:11 2009 -0700
    
        MAPREDUCE-714: Fix JobConf.findContainingJars to work with jars with + in the name

commit aaeb69f8dda72a2e7aecacd622e99c00bc961efa
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:23:23 2010 -0800

    CLOUDERA-BUILD. Add dependency libraries for Scribe/log4j
    
    Author: Todd Lipcon

commit cb7a3677942c1d2f9e0d2a75dbffa09fa6125e61
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:22:41 2010 -0800

    CLOUDERA-BUILD. Apply Scribe patches to Hadoop
    
    Description:
        scribe_hadoop_trunk.patch
        Also, add empty ivy infrastructure for scribe-log4j
    Author: Todd Lipcon

commit d5ead434b221076fb830308d2d112d53aa6dc59f
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:22:26 2010 -0800

    CLOUDERA-BUILD. Use cloudera's versioning info from cloudera.hash in saveVersion.sh
    
    Description:
        This should make the "hadoop version" output far more useful for
        determing exactly what code is running. The cloudera.hash property is
        set by cloudera/build.properties which is generated during the build
        process.

commit bf10e46e425395145dcc4b85db66d45cbf9797b0
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:21:45 2010 -0800

    CLOUDERA-BUILD. Move saveVersion.sh in build.xml to ensure build
    
    Description:
        This error is due to ant 1.7.1 not compiling package-info.java if the
        timestamp of the output class directory is newer than the package-info
        file itself. Since other compiles were happening after package-info.java
        was generated, the build dir was newer and compilation was being
        skipped.
    
        Move cloudera hooks inside the package task of build.xml
    
        Fixes an issue where the fair scheduler jar was not built before the
        hooks were run, and therefore was not included in the target lib/
        directory.
    
    Ref: CLOUDERA-436

commit 5359a3bbd2b09644825be99fdd354ff3276a5d59
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:21:36 2010 -0800

    CLOUDERA-BUILD. New versions of cloudera packaging scripts

commit ee255f3909b9938b1023be6a2c59a8429227c766
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:21:27 2010 -0800

    CLOUDERA-BUILD. Change paths to point to hadoop-0.20 where necessary

commit a2d051bcf456fde45c0a0c3aa512872ce6059a97
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:21:08 2010 -0800

    CLOUDERA-BUILD. Add Hadoop manpage to Hadoop 0.20 repository

commit 9600765ec5d6c3cef9ab34ecb573cbb876acf7ee
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:21:01 2010 -0800

    CLOUDERA-BUILD. Move install_hadoop.sh into hadoop repo

commit 77ac6923ad6e63874a429e7dd13c4a084b6a9556
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:20:52 2010 -0800

    CLOUDERA-BUILD. Add example-confs directory for storing configuration of conf.pseudo

commit 14256386d4cb155fea0f5745dd6c49fba74ff40f
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:20:43 2010 -0800

    CLOUDERA-BUILD. Replace hadoop-config.sh with Cloudera version

commit f7d0a20e0d74f1aac1fb96f3c08ce31e9b9ca5d9
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:20:25 2010 -0800

    CLOUDERA-BUILD. Remove redundant code in build.xml between package and bin-package

commit 0fa65091ecd9dd150d6afb93845d3fb10d80e115
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:16:59 2010 -0800

    CLOUDERA-BUILD. Hook build.xml to enable contrib modules
