commit 9b72d268a0b590b4fd7d13aca17c1c453f8bc957
Author: Eli Collins <eli@cloudera.com>
Date:   Sun Jun 27 18:42:45 2010 -0700

    CLOUDERA-BUILD. Make symlinks so old hadoop jar names are preserved (CDH-1543).

commit 4c50269dda2038d202ddb890ffde38dc3fb2ead2
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Thu Jun 24 18:25:09 2010 -0700

    MAPREDUCE-1887. MRAsyncDiskService does not properly absolutize volume root paths.
    
    Description: In MRAsyncDiskService, volume names are sometimes specified as
    relative paths, which are not converted to absolute paths. This can cause
    errors of the form "cannot delete &lt;/full/path/to/foo&gt; since it is outside of
    &lt;relative/volume/root&gt;" even though the actual path is inside the root.
    Reason: Bug
    Author: Aaron Kimball
    Ref: CDH-1509

commit 43ccf90369692c4d8b7d13a7f04b0864c55f615a
Author: Todd Lipcon <todd@cloudera.com>
Date:   Wed Jun 23 17:35:08 2010 -0700

    HDFS-1266. Add Apache License Notice to several places where it was missing
    
    Description: Adds license headers to source code
    Reason: Apache policy
    Author: Todd Lipcon
    Ref: CDH-1495

commit bf08bde983501e3ce8ebf6197049262518580611
Author: Todd Lipcon <todd@cloudera.com>
Date:   Wed Jun 23 16:14:50 2010 -0700

    HDFS-1260. tryUpdateBlock should do validation before renaming meta file
    
    Description: Solves bug where block became inaccessible in certain failure
                 conditions (particularly network partitions). Observed under
                 HBase workload at user site.
    Reason: Potential loss of synced data when write pipeline fails
    Author: Todd Lipcon
    Ref: CDH-659

commit 7243001d5511922f293f0641cb8dbc0af4850dae
Author: Todd Lipcon <todd@cloudera.com>
Date:   Fri Jun 18 16:13:45 2010 -0700

    HDFS-1254. Enable append feature by default
    
    Description: Changes dfs.support.append to "true" in hdfs-default.xml
    Reason: Append/sync have been tested in CDH3b2 and are safe to use.
    Author: Dhruba Borthakur
    Ref: CDH-659

commit 0e1d71c08923bb4c4172ef043b0b2d82f95b92fa
Author: Todd Lipcon <todd@cloudera.com>
Date:   Sat Jun 19 16:26:39 2010 -0700

    HDFS-1252. Updates to TestDFSConcurrentFileOperations (test was previously broken)
    
    Description: Fixes TestDFSConcurrentFileOperations to test the correct
                 semantics for sync feature
    Reason: Test was previously flaky
    Author: Todd Lipcon
    Ref: CDH-659

commit 829497f4867a0e92da712faf02f83c7087df07ce
Author: Eli Collins <eli@cloudera.com>
Date:   Fri Jun 18 19:31:58 2010 -0700

    CLOUDERA-BUILD. Remove Sqoop from the build.

commit 298fda37c4c25434a15886ee9c261e566d595dff
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Jun 18 18:42:37 2010 -0700

    HADOOP-5203. TT's version build is too restrictive.
    
    Description: Use the md5sum checksum of the source for determining version compatibility.
    Reason: Improvement
    Author: Rick Cox (0.20 backport by Bill Au)
    Ref: CDH-1139

commit f07b2df591b91c7de50e8dbb526cf11b27a32a6f
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Jun 18 17:58:53 2010 -0700

    MAPREDUCE-679. XML-based metrics as JSP servlet for JobTracker
    
    Description: A simple XML translation of the existing JobTracker status page
    which provides the same metrics (including the tables of
    running/completed/failed jobs) as the human-readable page. This is a
    relatively lightweight addition to provide some machine-understandable metrics
    reporting.
    Reason: Improvement
    Author: Aaron Kimball
    Ref: CDH-651

commit d8dc8dad821a02619afdbfc3d1cb978b86cb071b
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Jun 18 17:24:07 2010 -0700

    MAPREDUCE-1372. ConcurrentModificationException in JobInProgress
    
    Description: Fixes a ConcurrentModificationException in JobInProgress
    Reason: Bug
    Author: Dick King
    Ref: CDH-546

commit e212ca0b0abbd78cdea4596fe9f3c6dbbaa57258
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Jun 18 16:20:01 2010 -0700

    MAPREDUCE-1378. Args in job details links on jobhistory.jsp are not URL encoded
    
    Description: The logFile argument in the job links on the JT jobhistory.jsp
    page is not properly URL encoded leading to links that result in 500 errors.
    Reason: Bug
    Author: Eric Sammer
    Ref: CDH-645

commit 23e68e669a118d34e265af5e8ffda3615c2666f9
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Jun 18 15:52:15 2010 -0700

    MAPREDUCE-1570. Shuffle stage - Key and Group Comparators
    
    Description: Shuffle method in org.apache.hadoop.mrunit.MapReduceDriverBase
    doesn't currently allow the use of custom GroupingComparator and
    SortComparator. This patch adds these features.
    Reason: Improvement
    Author: Chris White
    Ref: CDH-958

commit 4601521a9793255e8b5881d64ff1a921451bc951
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Jun 18 15:48:41 2010 -0700

    MAPREDUCE-739. Allow relative paths to be created inside archives.
    
    Description: Allow creating archives with relative paths with a -p option on
    the command line.  Archives currently stores the full path from the input
    sources – since it allows multiple sources and regular expressions as inputs.
    So the created archives have the full path of the input sources.  This is un
    intuitive and a user hassle. We should get rid of it and allow users to say
    that the created archive should be relative to some absolute path and throw an
    exception if the input does not confirm to the relative absolute path.
    Reason: Improvement
    Author: Mahadev konar
    Ref: CDH-501

commit 1d4e15f0f8b749981d62bfca9849e0d0493afdad
Author: Todd Lipcon <todd@lipcon.org>
Date:   Thu Jun 17 20:02:51 2010 -0700

    HDFS-1247. Improvements to HDFS-1204 test
    
    Reason: Fixes compile warnings
    Author: Todd Lipcon
    Ref: CDH-659

commit 1fab52d87c29bc7117eb7324d1a152d8d889f62b
Author: Todd Lipcon <todd@lipcon.org>
Date:   Wed Jun 2 18:25:11 2010 -0700

    HDFS-1246. Manual tool to test sync on a cluster
    
    Description: Tool for automated testing that sync maintains every edit after kill -9
    Reason: Cluster Testing of Sync support for CDH3
    Author: Todd Lipcon
    Ref: CDH-659

commit b9259a145f516a01ba37a33b3803c88824fd55e5
Author: Todd Lipcon <todd@cloudera.com>
Date:   Thu Jun 17 09:55:31 2010 -0700

    HDFS-1240. Fix failing TestDFSShell due to HDFS-909 backport on branch-20
    
    Reason: Fix red build
    Author: Todd Lipcon
    Ref: CDH-659

commit 7276208c2789f2c3961c6dc9fa1d2757774971b1
Author: Todd Lipcon <todd@cloudera.com>
Date:   Wed Jun 16 12:16:25 2010 -0700

    HDFS-1243. Replication tests in TestFileAppend4 should wait for a second for replication to occur
    
    Reason: Test error - fix sporadic failure of TestFileAppend4
    Author: Todd Lipcon
    Ref: CDH-659

commit dc1797ec8380b07117bbc6d662e2f1f56b25e6bd
Author: Todd Lipcon <todd@cloudera.com>
Date:   Tue Jun 15 17:56:43 2010 -0700

    HDFS-1207. stallReplicationWork should be marked volatile in FSNamesystem
    
    Description: Small bug fix for code used by tests only
    Reason: Fix sporadic failure of TestFileAppend4
    Author: Todd Lipcon
    Ref: CDH-659

commit a960eea40dbd6a4e87072bdf73ac3b62e772f70a
Author: Todd Lipcon <todd@lipcon.org>
Date:   Sun Jun 13 23:02:38 2010 -0700

    HDFS-1197. Received blocks should not be added to block map prematurely for under construction files
    
    Description: Fixes a possible dataloss scenario when using append() on
                 real-life clusters. Also augments unit tests to uncover
                 similar bugs in the future by simulating latency when
                 reporting blocks received by datanodes.
    Reason: Append support dataloss bug
    Author: Todd Lipcon
    Ref: CDH-659

commit 3cc1405289ac4ec6616a5ba9da18ff421a93678e
Author: Todd Lipcon <todd@lipcon.org>
Date:   Mon Jun 14 01:43:18 2010 -0700

    HDFS-1209. Add parameter dfs.client.block.recovery.retries to determine how many times to try to recover block
    
    Reason: Used by append tests
    Author: Todd Lipcon
    Ref: CDH-659

commit 128395ae4d317204fe8fb118333270826adf96d5
Author: Todd Lipcon <todd@cloudera.com>
Date:   Sun Jun 6 16:38:21 2010 -0400

    HDFS-1118. DFSOutputStream socket leak when can't connect to DN
    
    Reason: Fixes DFS Client socket leaks in an error condition
    Author: Zheng Shao
    Ref: CDH-659

commit 4ba384d2b9f92f7300ce06b35a967e4edc3ba671
Author: Todd Lipcon <todd@cloudera.com>
Date:   Fri Jun 4 15:10:00 2010 -0700

    HADOOP-6762. Interrupting a thread performing an RPC should not hang that thread.
    
    Description: Moves the sending of parameters for RPC calls to a separate
                 thread, such that interrupting a thread that is making
                 an RPC call does not negatively affect the shared RPC channel.
    Reason: Fixes occasional hangs of HBase under heavy load during failure
            testing.
    Author: Sam Rash
    Ref: CDH-659, CDH-1084

commit 6e99c7e2a12eea782629337f5fb5734e8e5e5865
Author: Todd Lipcon <todd@lipcon.org>
Date:   Wed Jun 2 22:32:45 2010 -0700

    HDFS-1210. DFSClient should print IOE that caused recovery failure
    
    Description: Adds an extra WARN message during DFS client error recovery
    Reason: Makes it easier to debug/diagnose recovery issues
    Author: Todd Lipcon
    Ref: CDH-659

commit 1b8d8c3de261c8334d6eac4f5d3fd42cad894e81
Author: Todd Lipcon <todd@lipcon.org>
Date:   Wed Jun 2 21:53:01 2010 -0700

    HDFS-1186. Writers should be interrupted when recovery is started, not when it's completed.
    
    Description: When the write pipeline recovery process is initiated, this
                 interrupts any concurrent writers to the block under recovery.
                 This prevents a case where some edits may be lost if the
                 writer has lost its lease but continues to write (eg due to
                 a garbage collection pause)
    Reason: Fixes a potential dataloss bug
    Author: Todd Lipcon
    Ref: CDH-659

commit 2ec4301341b249acd0c0cac1792aaa6a6dabab8e
Author: Todd Lipcon <todd@lipcon.org>
Date:   Thu May 20 00:23:20 2010 -0700

    HDFS-915. Write pipeline hangs for too long when ResponseProcessor hits timeout
    
    Description: Previously, the write pipeline would hang for the entire write
                 timeout when it encountered a read timeout (eg due to a
                 network connectivity issue). This patch interrupts the writing
                 thread when a read error occurs.
    Reason: Faster recovery from pipeline failure for HBase and other
            interactive applications.
    Author: Todd Lipcon
    Ref: CDH-659

commit 641090318603c47bfd55e1eea2b039f37e5b723a
Author: Todd Lipcon <todd@cloudera.com>
Date:   Fri May 14 19:20:10 2010 -0700

    HDFS-1218. Replicas that are recovered during DN startup should not be allowed to truncate better replicas.
    
    Description: If a datanode loses power and then recovers, its replicas
                 may be truncated due to the recovery of the local FS
                 journal. This patch ensures that a replica truncated by
                 a power loss does not truncate the block on HDFS.
    Reason: Potential dataloss bug uncovered by power failure simulation
    Author: Todd Lipcon
    Ref: CDH-659

commit 46f2b3ad578ea1d2ee2cca4e6467ba2daa57df0e
Author: Todd Lipcon <todd@cloudera.com>
Date:   Fri May 14 19:34:09 2010 -0700

    HDFS-445. pread should refetch block locations when necessary
    
    Description: The positional read API in DFSInputStream was previously
                 missing any retry logic. This patch adds this logic.
    Reason: HBase and other applications depend on the pread API.
    Author: Kan Zhang
    Ref: CDH-659

commit aea067a20e16345f307de7efe80935dd7addbe6b
Author: Todd Lipcon <todd@cloudera.com>
Date:   Fri May 14 19:19:56 2010 -0700

    HDFS-1204. LeaseManager expiring leases should only expire the single file, not entire lease
    
    Reason: Logic bug in lease recovery could cause incorrectly interrupted
            writers
    Author: Sam Rash
    Ref: CDH-659

commit 10e5944da20d851a847cb2ef422383507d070085
Author: Todd Lipcon <todd@cloudera.com>
Date:   Thu May 13 16:33:15 2010 -0700

    HDFS-1242. Add unit test for the appendFile race condition / synchronization bug fixed in HDFS-142
    
    Reason: Test coverage for previously applied patch.
    Author: Todd Lipcon
    Ref: CDH-659

commit 18174a2abc5a91105ae1adc2bda026d90c41a60b
Author: Todd Lipcon <todd@cloudera.com>
Date:   Wed May 12 20:06:33 2010 -0700

    HDFS-1202. Don't try to update block scan status if block scanner is not initialized yet
    
    Reason: Fixes NPE seen at DataNode startup
    Author: Todd Lipcon
    Ref: CDH-659

commit ca9e1b3c59b05de9dc4fafa19f24dca80110bcc0
Author: Todd Lipcon <todd@cloudera.com>
Date:   Wed May 12 19:28:56 2010 -0700

    HDFS-1205. Make async disk service threads nameable
    
    Description: HDFS-611 moved some datanode operations to a separate thread
                 pool. This patch ensures that these worker threads have
                 clear names.
    Reason: Aids debugging/diagnosing of issues
    Author: Todd Lipcon
    Ref: CDH-659

commit 1b8316d403ac542772c0745159a7397c798a5698
Author: Todd Lipcon <todd@cloudera.com>
Date:   Tue May 11 16:47:47 2010 -0700

    HDFS-606. Avoid ConcurrentModification in replica invalidation
    
    Description: Replica invalidation iterated over a collection that it
                 also modified, causing a CME. This patch makes a copy
                 before iteration. Performance should be unaffected
                 as this is a rare code path.
    Reason: Avoid runtime exception in namenode
    Author: Konstantin Shvachko
    Ref: CDH-659

commit b7f908bc77d9344c36dcc409bbfe92709b98cf88
Author: Todd Lipcon <todd@cloudera.com>
Date:   Thu May 6 08:52:18 2010 -0700

    HDFS-1244. Misc improvements to TestFileAppend2
    
    Description: Improvements made to a test case to enable it to be run
                 from the command line, with the various test parameters
                 available in arguments.
    Reason: Enable long-running stress tests of append functionality.
    Author: Todd Lipcon
    Ref: CDH-659

commit 370c9a1e75cc5d5e93cec066006ada0485139fb8
Author: Todd Lipcon <todd@lipcon.org>
Date:   Tue Jun 15 18:48:58 2010 -0700

    HDFS-1141. completeFile should check lease holder
    
    Description: Fixes a bug where a writer could finalize an in-progress
                 file after it had already lost its lease. This could occur
                 for example if the writer entered a GC pause after finishing
                 the last block but before finalizing the file.
    Reason: Potential dataloss bug with append/sync
    Author: Todd Lipcon
    Ref: CDH-659

commit 7f0d67fa52b9c58360b06e851bf77bc2f909f65f
Author: Todd Lipcon <todd@cloudera.com>
Date:   Wed May 5 14:43:40 2010 -0700

    HDFS-1215. Fix TestNodeCount to not infinite loop after HDFS-409 MiniCluster changes
    
    Description: Fixes a test to work properly after some test infrastructure
                 was changed by HDFS-142 in branch-0.20-append.
    Reason: Fixes failing test.
    Author: Todd Lipcon
    Ref: CDH-659

commit 77ac4f46fb5c011b5ac7c5fedb4c51b31580c9ba
Author: Todd Lipcon <todd@lipcon.org>
Date:   Tue Jun 15 18:33:58 2010 -0700

    HDFS-1248. Miscellaneous cleanup and improvements on 0.20 append branch
    
    Description: Miscellaneous code cleanup and logging changes, including:
     - Slight cleanup to recoverFile() function in TestFileAppend4
     - Improve error messages on OP_READ_BLOCK
     - Some comment cleanup in FSNamesystem
     - Remove toInodeUnderConstruction (was not used)
     - Add some checks for null blocks in FSNamesystem to avoid a possible NPE
     - Only log "inconsistent size" warnings at WARN level for non-under-construction blocks.
     - Redundant addStoredBlock calls are also not worthy of WARN level
     - Add some extra information to a warning in ReplicationTargetChooser
    Reason: Improves diagnosis of error cases and clarity of code
    Author: Todd Lipcon
    Ref: CDH-659

commit 46e6199d8819538d96c3f4c5dbbfba163382b2a9
Author: Todd Lipcon <todd@cloudera.com>
Date:   Mon May 3 15:02:32 2010 -0700

    HDFS-1122. Don't allow client verification to prematurely add inprogress blocks to DataBlockScanner
    
    Description: When a client reads a block that is also open for writing,
                 it should not add it to the datanode block scanner.
                 If it does, the block scanner can incorrectly mark the
                 block as corrupt, causing data loss.
    Reason: Potential dataloss with concurrent writer-reader case.
    Author: Sam Rash
    Ref: CDH-659

commit 07711a4ea3edd1a504eb9bbb13c93d5573620d34
Author: Todd Lipcon <todd@cloudera.com>
Date:   Mon May 3 12:04:49 2010 -0700

    HDFS-1057. Fixes for concurrent readers behind an appended file
    
    Description: Allows a client to read a file while it is still being
                 written by a writer, so long as the writer has called
                 sync().
    Reason: Used by HBase replication, and useful for other "tail"-like
            applications.
    Author: Sam Rash
    Ref: CDH-659

commit 587de668e43486f7109a885f617b9b757d7a649e
Author: Todd Lipcon <todd@cloudera.com>
Date:   Sat Apr 24 17:33:34 2010 -0700

    HADOOP-6722. Workaround a TCP spec quirk by not allowing NetUtils.connect to connect to itself
    
    Description: TCP's ephemeral port assignment results in the possibility
                 that a client can connect back to its own outgoing socket,
                 resulting in failed RPCs or datanode transfers.
    Reason: Fixes intermittent errors in cluster testing with ephemeral
            IPC/transceiver ports on datanodes.
    Author: Todd Lipcon
    Ref: CDH-659

commit 7a93fcc8c22b7cff87221ec0a8bf8f6689f12b82
Author: Todd Lipcon <todd@cloudera.com>
Date:   Thu Apr 22 10:24:59 2010 -0700

    HDFS-1203. Add small sleep to prevent DN flooding NN in error cases
    
    Description: If the datanode experiences an error in sending its block
                 reports to the name node, it previously would loop retrying
                 with no delay between attempts. In the case that the DN
                 is sending an invalid report, this will flood the NN with
                 RPCs. This patch adds a short sleep before the retry.
    Reason: Prevents possible flood of RPCs to the NameNode in DN error
            conditions.
    Author: Todd Lipcon
    Ref: CDH-659

commit a30c033c1eed744948ddfddb82b81b06e12bba46
Author: Todd Lipcon <todd@cloudera.com>
Date:   Fri Apr 16 15:19:08 2010 -0700

    HDFS-561. Fix read timeouts in write pipeline to stage correctly
    
    Description: Previously, the read timeout on the write pipeline was
                 incorrectly calculated. This caused the client to detect
                 the wrong failed datanode when a datanode's network
                 failed or froze for another reason.
    Reason: Fix recovery behavior for frozen datanodes
    Author: Kan Zhang
    Ref: CDH-659

commit 02ab12541a004d67a96428055a58a3b726c1c4b6
Author: Todd Lipcon <todd@cloudera.com>
Date:   Thu Apr 15 01:04:43 2010 -0700

    HDFS-895. Allow hflush/sync to operate in parallel with other writers
    
    Description: Modifies synchronization of the DFSOutputStream sync feature
                 such that multiple threads can sync the same stream
                 concurrently and each will wait only the minimal amount
                 of time. Also allows further writes to continue past the
                 sync point while the sync waits.
    Reason: Substantial performance improvement for durable HBase
    Author: Todd Lipcon
    Ref: CDH-659

commit d1c4359e1abc3f3e5e4fa16ee1c83a3d7f015da3
Author: Todd Lipcon <todd@cloudera.com>
Date:   Wed Apr 14 14:59:39 2010 -0700

    HDFS-1211. BlockReceiver logs too much at INFO level when using sync()
    
    Description: Reduces the log level from INFO to DEBUG for a common message
                 in the datanode log when using the sync feature.
    Reason: Substantially reduces DN log chattiness for syncing clients.
    Author: Todd Lipcon
    Ref: CDH-659

commit 23cfa9e8263ad1d92814b5829e2f50bb37d57857
Author: todd <todd@monster01.sf.cloudera.com>
Date:   Sun Mar 21 16:25:48 2010 -0700

    HDFS-1056. Fix possible multinode deadlocks during block recovery when using ephemeral dataxceiver ports
    
    Description: Fixes the logic by which datanodes identify local RPC targets
                 during block recovery for the case when the datanode
                 is configured with an ephemeral data transceiver port.
    Reason: Potential internode deadlock for clusters using ephemeral ports
    Author: Todd Lipcon
    Ref: CDH-659

commit 08cbce1e413e98d0aaeceeaca26a60c3d9a50b29
Author: todd <todd@monster01.sf.cloudera.com>
Date:   Sun Mar 21 14:56:56 2010 -0700

    HDFS-611. Move block deletions to an async thread. Applying this to make the HDFS-142 patch apply cleanly
    
    Description: Moves the deletion of blocks in the datanode into a thread
                 pool. Substantially improves datanode heartbeat consistency
                 for workloads with heavy deletes and/or lots of disks.
    Reason: Substantially reduces frequency of "could not complete block"
            errors and needless re-replication on clusters with lots of disks
            or heavy deletes.
    Author: Zheng Shao
    Ref: CDH-659

commit 57783d0683f0d675423369e0a0f9f5dd520c17f2
Author: todd <todd@monster01.sf.cloudera.com>
Date:   Sun Mar 21 03:36:45 2010 -0700

    HDFS-1055. Improve thread naming in DN Xceiver
    
    Description: Names the threads created by the DataNode based on the action
                 they are performing.
    Reason: Eases diagnosis of datanode performance/lock contention issues.
    Author: Todd Lipcon
    Ref: CDH-659

commit fddb2bd057e88506a1bb94232426053d1640a34b
Author: todd <todd@monster01.sf.cloudera.com>
Date:   Sun Mar 21 03:36:29 2010 -0700

    HDFS-894. Fix ipcPort tracking in Datanode registration. TODO: add the test case from JIRA
    
    Description: Fixes the NameNode to properly reregister datanodes when they
                 crash and restart with a different IPC port (eg when IPC port
                 is configured to be ephemeral)
    Reason: Fixes errors on clusters with ephemeral ports.
    Author: Todd Lipcon
    Ref: CDH-659

commit bc5217543eccc2cfd8a182cdbb051b39d2abf3e7
Author: Dhruba Borthakur <dhruba@apache.org>
Date:   Fri Jun 11 23:37:38 2010 +0000

    HDFS-1054. remove sleep before retry for allocating a block.
    
    Description: When the write pipeline fails to allocate a new block,
                 it previously slept for hard-coded 6 seconds before
                 retrying. This sleep has little reasoning behind it,
                 so is removed.
    Reason: Improve failure recovery performance for interactive applications
            like HBase.
    Author: Todd Lipcon
    Ref: CDH-931

commit 870c7526a3e6a632eb23cf14f9011f279181a759
Author: Dhruba Borthakur <dhruba@apache.org>
Date:   Thu Jun 10 22:25:39 2010 +0000

    HDFS-142. Blocks that are being written by a client are stored in the blocksBeingWritten directory.
    
    git-svn-id: https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-append@953482 13f79535-47bb-0310-9956-ffa450edef68
    
    Description: Moves blocks being written by clients into a different
                 directory in dfs.data.dir. Also fixes several other bugs
                 in the datanode and namenode to support various error
                 conditions related to append and sync.
    Reason: Necessary for proper recovery of synced data in several error conditions.
    Author: Dhruba Borthakur, Nicolas Spiegelberg, Todd Lipcon
    Ref: CDH-659

commit 8e888717294496caae825d7f3f609d0661e7997a
Author: Dhruba Borthakur <dhruba@apache.org>
Date:   Thu Jun 10 18:46:03 2010 +0000

    HDFS-826. Allow a mechanism for an application to detect that datanode(s) have died in the write pipeline. (dhruba)
    
    Description: Adds an API in DFSOutputStream to determine the current length
                 of the write pipeline.
    Reason: Necessary for better reliability of HBase write-ahead logs.
    Author: Dhruba Borthakur
    Ref: CDH-931

commit 8fcb419648160efaed6fdd467875c3b1743d2bee
Author: Dhruba Borthakur <dhruba@apache.org>
Date:   Wed Jun 9 23:12:21 2010 +0000

    HDFS-988. Fix bug where savenameSpace can corrupt edits log.
    
    Description: Fixes several synchronization errors in the NameNode and ensures
                 that all edits have been synced to the edits log before
                 the namespace is saved.
    Reason: Fixes potential data corruption bug.
    Author: Todd Lipcon
    Ref: CDH-1436

commit f5ace5f920bc16fd202a6e4a53fe0ffe0cb5045e
Author: Todd Lipcon <todd@lipcon.org>
Date:   Thu May 20 01:23:15 2010 -0700

    HDFS-101. Datanodes should continue to forward acks until client stops pipeline.
    
    Description: When one node in the pipeline dies, the datanodes in between the client
                 and the dead node should stay alive and continue to forward acks until
                 the client stops the pipeline. This fixes an issue where the client
                 would incorrectly determine that the local DN had failed when in fact
                 another DN in the pipeline was at fault.
    Reason: Common source of failed pipeline recovery in cluster fault testing
    Author: Hairong Kuang, Todd Lipcon
    Ref: CDH-693

commit 132ef7c852847e9d2c1e7879f2fca26652bb77ef
Author: Dhruba Borthakur <dhruba@apache.org>
Date:   Fri Jun 4 07:20:10 2010 +0000

    HDFS-200. Support append and sync for hadoop 0.20 branch.
    
    Description: Provides basic support for append and sync on 0.20
    Reason: Append and sync required for durable HBase and many other
            applications.
    Author: Dhruba Borthakur
    Ref: CDH-659

commit 092bcd174dbf609f5002078490c357462e0ce8b1
Author: Konstantin Shvachko <shv@apache.org>
Date:   Wed Apr 21 03:05:45 2010 +0000

    HDFS-909. Fix race in edit log rolling
    
    Description: Fixes a race condition when rolling edit logs that can corrupt
                 the logs.
    Reason: Potential namenode metadata corruption bug.
    Author: Todd Lipcon
    Ref: CDH-1174

commit e2a78f767d26b838bf67354a4b85235ddd731038
Author: Eli Collins <eli@cloudera.com>
Date:   Fri Jun 18 14:41:14 2010 -0700

    CLOUDERA-BUILD. Update hadoop-config.sh to reflect new jar version.

commit 1756e97a35451bbc01a493e843f1ec0885c99792
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Jun 18 11:37:22 2010 -0700

    MAPREDUCE-1644. Remove Sqoop from Apache Hadoop (moving to github)
    
    Description: Sqoop is moving to github! All code for sqoop is already live at
    http://github.com/cloudera/sqoop - this issue removes the duplicate code from the Apache Hadoop
    repository. CDH users should install the separate 'sqoop' package for this functionality.
    Reason: Moving to a separate package
    Author: Aaron Kimball
    Ref: CDH-1404

commit e0afb34b89a013419fca4bdcda5f2cf0401f93ca
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Thu Jun 17 19:06:50 2010 -0700

    MAPREDUCE-1302. TrackerDistributedCacheManager can delete file asynchronously
    
    Description: With the help of AsyncDiskService from MAPREDUCE-1213, we should be able to delete
    files from distributed cache asynchronously. That will help make task initialization faster, because task initialization calls the code that
    localizes files into the cache and may delete some other files.
    The deletion can slow down the task initialization speed.
    Reason: Performance improvement
    Author: Zheng Shao
    Ref: CDH-495

commit 456821d6934fd769ab317c2290a4ff53b075269e
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Thu Jun 17 19:04:31 2010 -0700

    HADOOP-6433. Add AsyncDiskService that is used in both hdfs and mapreduce
    
    Description: create a thread pool per disk volume, and use that for scheduling async disk
    operations.
    Reason: Improvement
    Author: Zheng Shao
    Ref: CDH-495

commit 6e467c42d62aafd00fd2f38269806680427631c8
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Thu Jun 17 18:50:47 2010 -0700

    MAPREDUCE-1213. TaskTrackers restart is very slow because it deletes distributed cache directory synchronously
    
    Description: We are seeing that when we restart a tasktracker, it tries to recursively delete all
    the file in the distributed cache. It invoked FileUtil.fullyDelete() which is very very slow. This
    means that the TaskTracker cannot join the cluster for an extended period of time (upto 2 hours for
    us). The problem is acute if the number of files in a distributed cache is a few-thousands.
    Reason: Performance
    Author: Zheng Zhao
    Ref: CDH-495

commit 5626a0e301557dbc93ad5084aa9ef4527316db7b
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Thu Jun 17 18:45:58 2010 -0700

    MAPREDUCE-1443. DBInputFormat can leak connections
    
    Description: The DBInputFormat creates a Connection to use when enumerating splits, but never closes
    it. This can leak connections to the database which are not cleaned up for a long time.
    Reason: bug
    Author: Aaron Kimball
    Ref: CDH-1435

commit 912eed1c5d50066e68700d2143b775914d7f8e54
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Thu Jun 17 16:00:49 2010 -0700

    MAPREDUCE-1489. DataDrivenDBInputFormat should not query the database when generating only one split
    
    Description: DataDrivenDBInputFormat runs a query to establish bounding values for each split it
    generates; but if it's going to generate only one split (mapreduce.job.maps == 1), then there's no
    reason to do this. This will remove overhead associated with a single-threaded import of a
    non-indexed table since it avoids a full table scan.
    Reason: Improvement
    Author: Aaron Kimball
    Ref: CDH-1431

commit 1c3fc82063212196fd2fac7f55df8eb323e8f601
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Tue Apr 27 11:44:29 2010 -0700

    MAPREDUCE-1728. Oracle timezone strings do not match Java
    
    Description: OracleDBRecordReader sets the session timezone based on the toString representation of
    the current java.util.TimeZone. This is incorrect; Oracle manages a separate database of acceptable
    timezone strings, whose string representations are different than the timezone representations
    recognized by Java.
    Reason: Bug
    Author: Aaron Kimball
    Ref: CDH-961

commit 11bc9be1ff2fd994046acd660afa7631f9203cfb
Author: Eli Collins <eli@cloudera.com>
Date:   Thu May 27 17:44:00 2010 -0700

    HADOOP-6714. FsShell 'hadoop fs -text' does not support compression codecs.
    
    Currently, 'hadoop fs -text myfile' looks at the first few magic bytes
    of a file to determine whether it is gzip compressed or a sequence
    file. This means 'fs -text' cannot properly decode .deflate or .bz2
    files (or other codecs specified via configuration).
    
    Reason: Improvement
    Author: Eli Collins
    Ref: CDH-1136

commit e95781032b5d886aa6583cab1306025fe372babf
Author: Eli Collins <eli@cloudera.com>
Date:   Tue May 25 13:20:00 2010 -0700

    HADOOP-1849. IPC server max queue size should be configurable.
    
    Description: Currently max queue size for IPC server is set to (100 *
    handlers). Usually when RPC failures are observed (e.g. HADOOP-1763),
    we increase number of handlers and the problem goes away. I think a
    big part of such a fix is increase in max queue size. I think we
    should make maxQsize per handler configurable (with a bigger default
    than 100). There are other improvements also (HADOOP-1841).  Server
    keeps reading RPC requests from clients. When the number in-flight
    RPCs is larger than maxQsize, the earliest RPCs are deleted. This is
    the main feedback Server has for the client. I have often heard from
    users that Hadoop doesn't handle bursty traffic.
    
    Say handler count is 10 (default) and Server can handle 1000 RPCs a
    sec (quite conservative/low for a typical server), it implies that an
    RPC can wait for only for 1 sec before it is dropped. If there 3000
    clients and all of them send RPCs around the same time (not very rare,
    with heartbeats etc), 2000 will be dropped. In stead of dropping the
    earliest RPCs, if the server delays reading new RPCs, the feedback to
    clients would be much smoother, I will file another jira regd queue
    management.
    
    For this jira I propose to make queue size per handler configurable,
    with a larger default (may be 500).
    
    Reason: Improvement
    Author: Eli Collins
    Ref: CDH-1133

commit 776a20d37142534751178b060285d2813cc66c1c
Author: Eli Collins <eli@cloudera.com>
Date:   Tue May 25 13:09:30 2010 -0700

    HADOOP-6724. IPC doesn't properly handle IOEs thrown by socket factory.
    
    Description: If the socket factory throws an IOE inside
    setupIOStreams, then handleConnectionFailure will be called with
    socket still null, and thus generate an NPE on socket.close(). This
    ends up orphaning clients, etc.
    
    Reason: Bug fix
    Author: Eli Collins
    Ref: CDH-1132

commit 1864359f4ef32974ed41a1278e640e1ee246ef9b
Author: Eli Collins <eli@cloudera.com>
Date:   Tue May 25 13:05:38 2010 -0700

    HADOOP-6723. Unchecked exceptions thrown in IPC connection should not orphan clients.
    
    Description: If the server sends back some malformed data, for
    example, receiveResponse() can end up with an incorrect call ID. Then,
    when it tries to find it in the calls map, it will end up with null
    and throw NPE in receiveResponse. This isn't caught anywhere, so the
    original IPC client ends up hanging forever instead of catching an
    exception. Another example is if the writable implementation itself
    throws an unchecked exception or OOME.
    
    We should catch Throwable in Connection.run() and shut down the
    connection if we catch one.
    
    Reason: Bug fix
    Author: Eli Collins
    Ref: CDH-1131

commit 95d64157f05d467dad3e1190a5cba2a3f89b0925
Author: Eli Collins <eli@cloudera.com>
Date:   Thu May 20 17:15:13 2010 -0700

    CLOUDERA-BUILD. Rename the fuse_dfs wrapper.
    
    Description: Rename the fuse_dfs wrapper to hadoop-fuse-dfs.
    
    Reason: Improvement
    Author: Alex Newman
    Ref: CDH-1103

commit d8c973d9c6f650032c88915d9fef6f4a568d37a5
Author: Chad Metcalf <chad@cloudera.com>
Date:   Wed May 19 15:38:14 2010 -0700

    CLOUDERA-BUILD. Fixes for the fuse_dfs wrapper.
    
    Description: The wrapper uses bash syntax (i.e., +=) so we should use
    bash. We need to modprobe fuse explicitly on Ubuntu. Since this is
    installed by install_hadoop.sh we know HADOOP_HOME and should use it
    directly. Lastly, there is more robust JAVA_HOME checking in
    hadoop-config.sh so we should use that.
    
    Reason: Fuse currently broken on Ubuntu
    Author: Chad Metcalf
    Ref: CDH-1089

commit e810911445859693ee0b868c2a5d8bc18360cdb9
Author: Eli Collins <eli@cloudera.com>
Date:   Tue May 18 14:30:04 2010 -0700

    HDFS-1161. Make DN minimum valid volumes configurable
    
    Description: This change adds a dfs.datanode.failed.volumes.tolerated parameter so that users can configure the number of volumes that are allowed to fail before a datanode stops offering service. By default any volume failure will cause a datanode to shutdown.
    
    Reason: Improvement
    Author: Eli Collins
    Ref: CDH-1081

commit baa77bdde4fd971877418391a4fe491c2d4c2501
Author: Eli Collins <eli@cloudera.com>
Date:   Mon May 17 19:49:44 2010 -0700

    HDFS-1160. Improve some FSDataset warnings and comments.
    
    Description: Cleans up HDFS-547 warnings.
    
    Reason: Improvement
    Author: Eli Collins
    Ref: CDH-1080

commit 90f5a4bf77d17adcabb834a3cc2e02becb9f012d
Author: Eli Collins <eli@cloudera.com>
Date:   Mon May 17 18:53:50 2010 -0700

    HDFS-612. FSDataset should not use org.mortbay.log.Log.
    
    Description: Cleans up HDFS-547 logging.
    
    Reason: Improvement
    Author: Eli Collins
    Ref: CDH-1079

commit 4a925fe53a2015e504cd8c8796e0e590d22019c4
Author: Eli Collins <eli@cloudera.com>
Date:   Thu Apr 22 14:41:08 2010 -0700

    HDFS-457. Better handling of volume failure in Data Node storage.
    
    Description: Current implementation shuts DataNode down completely when one of the configured volumes of the storage fails. This is rather wasteful behavior because it decreases utilization (good storage becomes unavailable) and imposes extra load on the system (replication of the blocks from the good volumes). These problems will become even more prominent when we move to mixed (heterogeneous) clusters with many more volumes per Data Node.
    
    Reason: Improvement
    Author: Eli Collins
    Ref: CDH-472

commit 3af9533ee6f260373f302ff4a16dd04eb75e0616
Author: Chad Metcalf <chad@cloudera.com>
Date:   Mon Mar 1 15:28:19 2010 -0800

    CLOUDERA-BUILD. hadoop-config runs before hadoop-env.sh
    
        conf/hadoop-env.sh says you can update JAVA_HOME there, but it gets
        sourced after hadoop-config.sh, which errors out if JAVA_HOME is not
        set. This patch changes the flow so hadoop-env is always sourced by
        hadoop-config after the --config flag is processed. This will allow
        JAVA_HOME to be set in hadoop-env and still allow for trying to find a valid
        JAVA_HOME.

commit c9295d4ac2848403362e5dbaa78aa7be4ce4254e
Author: Eli Collins <eli@cloudera.com>
Date:   Sat May 15 13:39:08 2010 -0700

    HADOOP-3659. Fix hadoop native to compile on Mac OS X.
    
    Description: This patch makes the autoconf script work on Mac OS X. LZO needs to be installed (including the optional shared libraries) for the compile to succeed. You'll want to regenerate the configure script using autoconf after applying this patch.
    
    Reason: Bug fix
    Author: Eli Collins
    Ref: CDH-825

commit cc035175e1cf1ddef878cba6aa93725f832d0327
Author: Eli Collins <eli@cloudera.com>
Date:   Sat May 15 12:55:06 2010 -0700

    MAPREDUCE-1785. Add streaming config option for not emitting the key.
    
    Description: PipeMapper currently does not emit the key when using TextInputFormat. If you switch to input formats (eg LzoTextInputFormat) the key will be emitted. We should add an option so users can explicitly make streaming not emit the key so they can change input formats without breaking or having to modify their existing programs.
    
    Reason: Improvement
    Author: Eli Collins
    Ref: CDH-856

commit 590a82c257842be51170619deafd15cc2988541e
Author: Eli Collins <eli@cloudera.com>
Date:   Thu May 13 21:25:53 2010 -0700

    HADOOP-4885. Try to restore failed replicas of Name Node storage (at checkpoint time).
    
    Description: If one of the replicas of the NameNode storage fails for whatever reason (for example temporarily failure of NFS) this Storage object is removed from the list of storage objects forever. It can be added back only on restart of the NameNode. We propose to check the status of a failed storage on every checkpoint and if it becomes valid - try to restore the edits and fsimage.
    
    Reason: Improvement
    Author: Eli Collins
    Ref: CDH-473

commit 0f2f19e1bd5725f6163998ae86d9103c0d552de3
Author: Eli Collins <eli@cloudera.com>
Date:   Thu May 13 20:07:02 2010 -0700

    HDFS-1024. SecondaryNamenode fails to checkpoint because namenode fails with CancelledKeyException.
    
    Description: The secondary namenode fails to retrieve the entire fsimage from the Namenode. It fetches a part of the fsimage but believes that it has fetched the entire fsimage file and proceeds ahead with the checkpointing.
    
    Reason: Bug fix
    Author: Eli Collins
    Ref: CDH-891

commit 0ec1d6ed85a30327c657c2418932728d0e4e98df
Author: Todd Lipcon <todd@lipcon.org>
Date:   Wed May 12 21:33:45 2010 -0700

    HADOOP-6254. Slow reads cause s3n to fail with SocketTimeoutException
    
    Reason: Bug fix for users of s3n:// file system
    Author: Andrew Hitchcock
    Ref: CDH-1035

commit d64943401780c3dd1dc498419f33ded8222c3210
Author: Eli Collins <eli@cloudera.com>
Date:   Wed May 12 12:05:26 2010 -0700

    HADOOP-6667. RPC.waitForProxy should retry through NoRouteToHostException.
    
    Description: RPC.waitForProxy already loops through ConnectExceptions, but NoRouteToHostException is not a subclass of ConnectException. In the case that the NN is on a VIP, the No Route To Host error is reasonably common during a failover, so we should retry through it just the same as the other connection errors.
    
    Reason: Improvement
    Author: Eli Collins
    Ref: CDH-907

commit a5fb4a8c8bf9d6a3a96c3a06eb3a46febaf21a0f
Author: Todd Lipcon <todd@cloudera.com>
Date:   Fri May 7 15:36:14 2010 -0700

    MAPREDUCE-1375. TestFileArgs fails intermittently
    
    Description: Fixes an error in a test case without modifying code. This is an amendment to the prior fix which did not address the issue properly.
    Reason: Should fix flaky tests.
    Author: Todd Lipcon
    Ref: CDH-657

commit 148d291aa14a4481dc206d2fc9a8527eb6761488
Author: newalex <newalex@ubuntu64-build01.(none)>
Date:   Fri Apr 16 15:48:14 2010 -0700

    CLOUDERA-BUILD. Add a fuse manpage
    
    Description: Adding a fuse_dfs manpage and adding a manpage to the build.
    Reason: New Feature
    Author: Alex Newman
    Ref: CDH-927

commit 9acfd39492f85c92bc45d47d6dcfb309e3826c64
Author: newalex <newalex@centos64-build01.sf.cloudera.com>
Date:   Thu Apr 8 10:35:19 2010 -0700

    CLOUDERA-BUILD. Build script changes to build DEB packages
    
    Description: The required changes to the cloudera hadoop building scripts for pulling the fuse files out and cleaning up its mess v.v. DEBs.
    Reason: Building packages
    Author: Alex Newman
    Ref: CDH-929

commit d144085817496eecc57c510022d66d0540b4511d
Author: newalex <newalex@centos64-build01.sf.cloudera.com>
Date:   Tue Apr 6 14:05:29 2010 -0700

    CLOUDERA-BUILD. Added an RPM for fuse
    
    Description: The required changes to the cloudera hadoop building scripts for pulling the fuse files out and cleaning up its mess.
    Reason: Building packages
    Author: Alex Newman
    Ref: CDH-928

commit 56648efe291503249fec22a242917ec4dddc6214
Author: Eli Collins <eli@cloudera.com>
Date:   Tue Mar 30 15:17:50 2010 -0700

    HADOOP-6522. Fix decoding of codepoint zero in UTF8.
    
    Description: TestUTF8 is actually flaky. It generates 10 random strings to run the test on. If you change this number to 100000 it fails every time. The problem is that the null character (codepoint zero) was correctly encoded but incorrectly decoded. I've attached a patch that fixes this and increases the size of the tests so that problems like this will likely be discovered sooner.
    
    Reason: Bugfix to UTF8
    Author: Eli Collins
    Ref: CDH-718

commit 936a67ba3b34dc8c8efd3df92d9e50309fafb8f6
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Mon Mar 29 23:50:14 2010 -0700

    MAPREDUCE-1460. Oracle support in DataDrivenDBInputFormat
    
    Description: DataDrivenDBInputFormat does not work with Oracle due to various SQL syntax issues.
    Reason: Required for Sqoop/Oracle integration
    Author: Aaron Kimball
    Ref: CDH-888

commit c08f94a6927f9c8b0dfaeb674835afdd3fdd1d08
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Mon Mar 29 17:15:53 2010 -0700

    MAPREDUCE-1569. Mock Contexts & Configurations
    
    Description: Currently the library creates a new Configuration object in the MockMapContext and
    MocKReduceContext constructors, rather than allowing the developer to configure and pass their own
    Reason: Feature improvement for MRUnit
    Author: Chris White
    Ref: CDH-838

commit 27cfda1de80048bf2b46d74d78b61275ecc79be1
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Mon Mar 29 16:43:49 2010 -0700

    MAPREDUCE-1536. DataDrivenDBInputFormat does not split date columns correctly.
    
    Description: The DateSplitter does not properly split a range of (min, max) dates.
    Reason: Bugfix to DateSplitter
    Author: Aaron Kimball
    Ref: CDH-813

commit 7fc6e48e296c30f0afa8ae8da668bddbc9f422bf
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Mon Mar 29 16:11:22 2010 -0700

    MAPREDUCE-1480. CombineFileRecordReader does not properly initialize child RecordReader
    
    Description: CombineFileRecordReader instantiates child RecordReader instances but never calls their initialize() method to give them the proper TaskAttemptContext.
    Reason: Bug in CombineFileInputFormat prevents proper use.
    Author: Aaron Kimball
    Ref: CDH-811

commit 32330fbadb4aed16627397979b90d52f2474ef38
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Mon Mar 29 15:50:20 2010 -0700

    MAPREDUCE-1423. Improve performance of CombineFileInputFormat when multiple pools are configured
    
    Description: I have a map-reduce job that is using CombineFileInputFormat. It has configured 10000
    pools and 30000 files. The time to create the splits takes more than an hour. The reaosn being that
    CombineFileInputFormat.getSplits() converts the same path from String to Path object multiple times,
    one for each instance of a pool. Similarly, it calls Path.toUri(0 multiple times. This code can be
    optimized.
    
    Reason: Improves CombineFileInputFormat performance (used by Sqoop); needed to apply MAPREDUCE-1480 cleanly
    Author: Dhruba Borthakur
    Ref: CDH-811

commit 6906389e07244931a108f2930544b9feada3a487
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Mon Mar 29 15:41:38 2010 -0700

    MAPREDUCE-364. Change org.apache.hadoop.examples.MultiFileWordCount to use new mapreduce api.
    
    Description: Updates MultiFileWordCount example to use the new API in
    org.apache.hadoop.mapreduce instead of the deprecated API of
    org.apache.hadoop.mapred.
    
    This incorporates MAPREDUCE-367: Change org.apache.hadoop.mapred.lib.CombineFileInputFormat
    to use the new api.
    
    This solves duplicate issue MAPREDUCE-1112: Fix CombineFileInputFormat for hadoop 0.20
    
    Reason: CombineFileInputFormat required for many clients of the new API, including Sqoop.
    Author: Amareshwari Sriramadasu
    Ref: CDH-811

commit 4b592cf8cb44c018f86abe529d71434d5106ce1e
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Mon Mar 29 13:07:15 2010 -0700

    HADOOP-6382. Publish hadoop jars to apache mvn repo.
    
    Description: This provides an 'ant mvn-install' command that will
    install Hadoop core, streaming, examples, etc. jars in a maven repository.
    
    Uses the maven ant task to publish hadoop 20 jars to the apache maven repo.
    Reason: Required for cross-distribution dependency management in downstream projects (e.g., sqoop)
    Author: Giridharan Kesavan
    Ref: CDH-402

commit 8424e32eb866d677f40a9446f9c4cf74972b751e
Author: Chad Metcalf <chad@cloudera.com>
Date:   Thu Mar 18 17:05:47 2010 -0700

    HADOOP-6643. Set executable bit for python cloud scripts in the distribution
    
    Description: This needs to be set in the tar target.
    Reason: Required for the EC2 scripts.
    Author: Tom White
    Ref: CDH-821

commit cfc3233ece0769b11af9add328261295aaf4d1ad
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:56:30 2010 -0800

    CLOUDERA-BUILD. Fix ivy xml after rebase. Removed a redundant </dependencies> closing tag.
    
    Author: Matt Massie

commit 54e1aefdd7a25a539831cac2c9b1bc3597f119ea
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:56:07 2010 -0800

    CLOUDERA-BUILD. Small tweaks and fixes to Cloudera styling:
    
    Description:
        - Fixes trivial CSS bug for missing table cell borders in Chrome
        - Fixes footer to read "Distribution for Hadoop" instead of "Distribution of Hadoop"
    
    Author: Todd Lipcon

commit ea83036b3838fa97c673e73145d52867b8ace6ac
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:55:30 2010 -0800

    HDFS-1013. Miscellaneous improvements to HTML markup for web UIs
    
    Description: The Web UIs have various bits of bad markup (eg missing &lt;head&gt; sections, some pages missing CSS links, inconsistent td vs th for table headings). We should fix this up.
    <hr/>
        Improve markup and add Cloudera styling to Web UIs
    
        This adds a favicon and a number of HTML/CSS improvements to make the
        pages more space-efficient and easy on the eyes.
    
        This may be an incompatible change for users who are scraping the HTML
        output of the web UIs. Those users are encouraged to access the data
        programmatically rather than through scraping.
    
        The non-Cloudera-specific improvements will be contributed upstream
        as HDFS-1013 and MAPREDUCE-1544.
    Reason: User experience improvement
    Author: Todd Lipcon
    Ref: UNKNOWN

commit 90ba5543e4c3176343e23943131a34d666c23d89
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:54:58 2010 -0800

    MAPREDUCE-1436. Deadlock in preemption code in fair scheduler
    
    Description: In testing the fair scheduler with preemption, I found a deadlock between updatePreemptionVariables and some code in the JobTracker. This was found while testing a backport of the fair scheduler to Hadoop 0.20, but it looks like it could also happen in trunk and 0.21. Details are in a comment below.
    <hr/>
    The fair scheduler introduces a potential jobtracker deadlock which
    was fixed on trunk by MAPREDUCE-870. This patch adjusts the locking
    in 0.20-based MapReduce to prevent this condition.
    
    Reason: bugfix (deadlock)
    Author: Matei Zaharia
    Ref: UNKNOWN

commit 6f04e94feee3f40a73449cc6fbe7b4e3c48f1fc4
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:54:13 2010 -0800

    HDFS-696. Java assertion failures triggered by tests
    
    Description: Re-purposing as catch-all ticket for assertion failures when running tests with java asserts enabled. Running with the attached patch on trunk@823732 the following tests all trigger assertion failures:
    
    <p>TestAccessTokenWithDFS<br/>
    TestInterDatanodeProtocol<br/>
    TestBackupNode <br/>
    TestBlockUnderConstruction<br/>
    TestCheckpoint  <br/>
    TestNameEditsConfigs<br/>
    TestStartup<br/>
    TestStorageRestore</p>
    <hr/>
        Disable failing asserts (see HDFS-696).
    
        Disabled asserts in HDFS that cause unit tests to fail.
        These will be re-enabled at a later date when the underlying cause is fixed
        upstream. In the meantime, these are disabled to keep our CI server returning
        only new failures. Issue HDFS-696 lists the failing tests and tracks their
        progress.
    Reason: Test harness improvement
    Author: Eli Collins
    Ref: UNKNOWN

commit 74b80b9c9490bba1a1120f3a9376d2f21f3763b6
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:53:38 2010 -0800

    MAPREDUCE-1093. Java assertion failures triggered by tests
    
    Description:
        Removes failing asserts from the CDH build until they are fixed in trunk.
        Tracking MAPREDUCE-1506 to include a fix for this assertion failure.
    Reason: Test harness improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit b4be440cd928976544bcbeb7e10566fc523dbd0c
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:53:13 2010 -0800

    MAPREDUCE-1092. Enable asserts for tests by default
    
    Description: See <a href="http://issues.apache.org/jira/browse/HADOOP-6309" title="Enable asserts for tests by default"><del>HADOOP-6309</del></a>. Let's make the tests run with java asserts by default.
    Reason: Test coverage improvement
    Author: Eli Collins
    Ref: UNKNOWN

commit 5e7fb9843f99f5e1023f2723210f26ac0c33323b
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:52:45 2010 -0800

    MAPREDUCE-1375. TestFileArgs fails intermittently
    
    Description: TestFileArgs failed once for me with the following error
    <div class="code panel" style="border-width: 1px;"><div class="codeContent panelContent">
    <pre class="code-java">expected:&lt;[job.jar
    sidefile
    tmp
    ]&gt; but was:&lt;[]&gt;
    sidefile
    tmp
    ]&gt; but was:&lt;[]&gt;
            at org.apache.hadoop.streaming.TestStreaming.checkOutput(TestStreaming.java:107)
            at org.apache.hadoop.streaming.TestStreaming.testCommandLine(TestStreaming.java:123)</pre>
    </div></div>
    
        This test was flaky due to trying to write some data into /bin/ls.
        Depending on the speed of the test run, this sometimes resulted
        in a Broken Pipe on flush() which caused the test to fail.
    
    Reason: Bugfix (race condition in test)
    Author: Todd Lipcon
    Ref: UNKNOWN

commit ae699cda01c093097ae723224553773247577aa2
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:52:32 2010 -0800

    HDFS-961. dfs_readdir incorrectly parses paths
    
    Description: fuse-dfs dfs_readdir assumes that DistributedFileSystem#listStatus returns Paths with the same scheme/authority as the dfs.name.dir used to connect. If NameNode.DEFAULT_PORT port is used listStatus returns Paths that have authorities without the port (see <a href="http://issues.apache.org/jira/browse/HDFS-960" title="DistributedFileSystem#makeQualified port inconsistency">HDFS-960</a>), which breaks the following code.
    
    <div class="code panel" style="border-width: 1px;"><div class="codeContent panelContent">
    <pre class="code-java"><span class="code-comment">// hack city: todo fix the below to something nicer and more maintainable but
    </span><span class="code-comment">// with good performance
    </span><span class="code-comment">// strip off the path but be careful <span class="code-keyword">if</span> the path is solely '/'
    </span><span class="code-comment">// NOTE - <span class="code-keyword">this</span> API started returning filenames as full dfs uris
    </span><span class="code-keyword">const</span> <span class="code-object">char</span> *<span class="code-keyword">const</span> str = info[i].mName + dfs-&gt;dfs_uri_len + path_len + ((path_len == 1 &amp;&amp; *path == '/') ? 0 : 1);</pre>
    </div></div>
    
    <p>Let's make the path parsing here more robust. listStatus returns normalized paths so we can find the start of the path by searching for the 3rd slash. A more long term solution is to have hdfsFileInfo maintain a path object or at least pointers to the relevant URI components.</p>
    Reason: bugfix
    Author: Eli Collins
    Ref: UNKNOWN

commit 7f9f42b27b109eff6fafc6ee24526fcadaf68d69
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:52:23 2010 -0800

    MAPREDUCE-1467. Add a --verbose flag to Sqoop
    
    Description: Need a <tt>--verbose</tt> flag that sets the log4j level to DEBUG.
    Reason: Logging improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit db680058f5796fc41d61242d60bc86b1b25facf9
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:52:07 2010 -0800

    MAPREDUCE-1469. Sqoop should disable speculative execution in export
    
    Description: Concurrent writers of the same output shard may cause the database to try to insert duplicate primary keys concurrently. Not a good situation. Speculative execution should be forced off for this operation.
    Reason: Bugfix (race condition)
    Author: Aaron Kimball
    Ref: UNKNOWN

commit a5ccc56a79fc53de5ff16c6cb996f41a4216c28d
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:51:29 2010 -0800

    MAPREDUCE-1341. Sqoop should have an option to create hive tables and skip the table import step
    
    Description: In case the client only needs to create tables in hive, it would be helpful if Sqoop had an optional parameter:
    
    <p>--hive-create-only</p>
    
    <p>which would omit the time consuming table import step, generate hive create table statements and run them.</p>
    
    <p>Also adds --hive-overwrite flag which allows overwriting of existing table definition.
    
    Reason: New feature
    Author: Leonid Furman
    Ref: UNKNOWN

commit bdf576aa69eeb56a954416f7c2fcbe0136f421bd
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:51:16 2010 -0800

    HADOOP-4012. Providing splitting support for bzip2 compressed files
    
    Description: Hadoop assumes that if the input data is compressed, it can not be split (mainly due to the limitation of many codecs that they need the whole input stream to decompress successfully).  So in such a case, Hadoop prepares only one split per compressed file, where the lower split limit is at 0 while the upper limit is the end of the file.  The consequence of this decision is that, one compress file goes to a single mapper. Although it circumvents the limitation of codecs (as mentioned above) but reduces the parallelism substantially, as it was possible otherwise in case of splitting.
    
    <p>BZip2 is a compression / De-Compression algorithm which does compression on blocks of data and later these compressed blocks can be decompressed independent of each other.  This is indeed an opportunity that instead of one BZip2 compressed file going to one mapper, we can process chunks of file in parallel.  The correctness criteria of such a processing is that for a bzip2 compressed file, each compressed block should be processed by only one mapper and ultimately all the blocks of the file should be processed.  (By processing we mean the actual utilization of that un-compressed data (coming out of the codecs) in a mapper).</p>
    
    <p>We are writing the code to implement this suggested functionality.  Although we have used bzip2 as an example, but we have tried to extend Hadoop's compression interfaces so that any other codecs with the same capability as that of bzip2, could easily use the splitting support.  The details of these changes will be posted when we submit the code.</p>
    Reason: New feature
    Author: Abdul Qadeer
    Ref: UNKNOWN

commit 8e47288583fcdbdf649ddf3486bf201788e79202
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:50:51 2010 -0800

    MAPREDUCE-707. Provide a jobconf property for explicitly assigning a job to a pool
    
    Description: A common use case of the fair scheduler is to have one pool per user, but then to define some special pools for various production jobs, import jobs, etc. Therefore, it would be nice if jobs went by default to the pool of the user who submitted them, but there was a setting to explicitly place a job in another pool. Today, this can be achieved through a sort of trick in the JobConf:
    
    <div class="code panel" style="border-width: 1px;"><div class="codeContent panelContent">
    <pre class="code-java">&lt;property&gt;
      &lt;name&gt;mapred.fairscheduler.poolnameproperty&lt;/name&gt;
      &lt;value&gt;pool.name&lt;/value&gt;
    &lt;/property&gt;
    
    &lt;property&gt;
      &lt;name&gt;pool.name&lt;/name&gt;
      &lt;value&gt;${user.name}&lt;/value&gt;
    &lt;/property&gt;</pre>
    </div></div>
    
    <p>This JIRA proposes to add a property called mapred.fairscheduler.pool that allows a job to be placed directly into a pool, avoiding the need for this trick.</p>
    Reason: Configuration improvement
    Author: Alan Heirich
    Ref: UNKNOWN

commit 96e17e1e593b818a888c8dfc177b8fb36e514e8f
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:50:18 2010 -0800

    MAPREDUCE-967. (version 2) TaskTracker does not need to fully unjar job jars
    
    Description:
        This is a performance improvement for jobs that contain a large number of
        classes. The unpacking of these jars consumes a large amount of time, as
        does the resulting cleanup. This patch changes the classpath to simply
        include the jar itself, and only unpacks the lib/ directory out of the
        jar in order to add those dependencies to the classpath.
    
        Users who previously depended on this functionality for shipping non-code
        dependencies can use the undocumented configuration parameter
        "mapreduce.job.jar.unpack.pattern" to cause specific jar contents to be unpacked
    
        This new patch version fixes a streaming regression where the "-file" argument
        no longer worked. It includes a new unit test, TestFileArgs, to protect
        against this regression.
    Author: Todd Lipcon
    Ref: UNKNOWN

commit cf08a128b87bbfae90babd61795599b3645d37a3
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:48:40 2010 -0800

    HDFS-455, MAPREDUCE-1441, HADOOP-6534. Allow spaces in between comma-separated elements in directory list configurations.
    
    Description: Make NN and DN handle in a intuitive way comma-separated configuration strings
    
    The following configuration causes problems:<br/>
    &lt;property&gt;<br/>
    &lt;name&gt;dfs.data.dir&lt;/name&gt;<br/>
    &lt;value&gt;/mnt/hstore2/hdfs, /home/foo/dfs&lt;/value&gt; <br/>
    &lt;/property&gt;
    
    <p>The problem is that the space after the comma causes the second directory for storage to be " /home/foo/dfs" which is in a directory named &lt;SPACE&gt; which contains a sub-dir named "home" in the hadoop datanodes default directory. This will typically cause the user's home partition to fill, but will be very hard for the user to understand since a directory with a whitespace name is hard to understand.</p>
    
    <p>(ripped from <a href="http://issues.apache.org/jira/browse/HADOOP-2366" title="Space in the value for dfs.data.dir can cause great problems"><del>HADOOP-2366</del></a>)</p>
    
    <hr/>
    This fixes any configuration consisting of a comma-separated list of directories
    (e.g., dfs.data.dir, dfs.name.dir, fs.checkpoint.dir, mapred.local.dir, etc) so that
    the elements may also contain separating whitespace. Without this patch,
    setting mapred.local.dir to "/disk1, /disk2" would create a directory by the name
    " " in the user's home directory, or fail outright. The patch trims the
    directory
    names as they are fetched from the configuration.
    
    Reason: Configuration improvement
    Author: Todd Lipcon
    Ref: UNKNOWN

commit 65a04ab8197a8db21a97d279ca881b5cd45a5365
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:48:03 2010 -0800

    HADOOP-2366. Space in the value for dfs.data.dir can cause great problems
    
    Description: The following configuration causes problems:
    
    <p>&lt;property&gt;<br/>
      &lt;name&gt;dfs.data.dir&lt;/name&gt;<br/>
      &lt;value&gt;/mnt/hstore2/hdfs, /home/foo/dfs&lt;/value&gt;  <br/>
      &lt;description&gt;<br/>
      Determines where on the local filesystem an DFS data node  should store its bl<br/>
    ocks.  If this is a comma-delimited  list of directories, then data will be stor<br/>
    ed in all named  directories, typically on different devices.  Directories that <br/>
    do not exist are ignored.  <br/>
      &lt;/description&gt;<br/>
    &lt;/property&gt;</p>
    
    <p>The problem is that the space after the comma causes the second directory for storage to be " /home/foo/dfs" which is in a directory named &lt;SPACE&gt; which contains a sub-dir named "home" in the hadoop datanodes default directory.  This will typically cause the user's home partition to fill, but will be very hard for the user to understand since a directory with a whitespace name is hard to understand.</p>
    
    <p>My proposed solution would be to trimLeft all path names from this and similar property after splitting on comma.  This still allows spaces in file and directory names but avoids this problem. </p>
    <hr/>
        This provides support in Configuration to get comma-separated string lists in such
        a way that whitespace in between elements is ignored. This patch is required for
        later patches which fix mapred.local.dir, dfs.data.dir, etc to support spaces
        in between elements.
    
        Test plan: unit tested in TestStringUtils
    Reason: Configuration improvement
    Author: Michele (@pirroh) Catasta
    Ref: UNKNOWN

commit 8d4807322a42509726b376b37a89739acd6cbd7d
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:47:55 2010 -0800

    MAPREDUCE-1356. Allow user-specified hive table name in sqoop
    
    Description: The table name used in a hive-destination import is currently pegged to the input table name. This should be user-configurable.
    Reason: New feature
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 8bf3439ff69762a33967dca4abb15c0cd2bb8417
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:47:45 2010 -0800

    MAPREDUCE-1395. Sqoop does not check return value of Job.waitForCompletion()
    
    Description: Old code depended on JobClient.runJob() throwing IOException on failure. Job.waitForCompletion can fail in that manner, or it can fail by returning false. Sqoop needs to check for this condition.
    Reason: bugfix
    Author: Aaron Kimball
    Ref: UNKNOWN

commit bd4e81234dd12fa9534577f0caa0db5c3d0a99fc
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:47:30 2010 -0800

    CLOUDERA-BUILD. Set HADOOP_PID_DIR to something smarter than /tmp
    
    Author: Chad Metcalf

commit 2466310d0e2a426e848860e9a8411b8ea14e1bb1
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:47:07 2010 -0800

    HADOOP-6453. Hadoop wrapper script shouldn't ignore an existing JAVA_LIBRARY_PATH
    
    Description: Currently the hadoop wrapper script assumes its the only place that uses JAVA_LIBRARY_PATH and initializes it to a blank line.
    
    <p>JAVA_LIBRARY_PATH=''</p>
    
    <p>This prevents anyone from setting this outside of the hadoop wrapper (say hadoop-config.sh) for their own native libraries.</p>
    
    <p>The fix is pretty simple. Don't initialize it to '' and append the native libs like normal. </p>
    Reason: Bugfix (environment)
    Author: Chad Metcalf
    Ref: UNKNOWN

commit a67b4b1c361c26e002da64953a7f8bc068d29b98
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:46:42 2010 -0800

    MAPREDUCE-1327. Oracle database import via sqoop fails when a table contains the column types such as TIMESTAMP(6) WITH LOCAL TIME ZONE and TIMESTAMP(6) WITH TIME ZONE
    
    Description: When Oracle table contains the columns "TIMESTAMP(6) WITH LOCAL TIME ZONE" and "TIMESTAMP(6) WITH TIME ZONE", Sqoop fails to map values for those columns to valid Java data types, resulting in the following exception:
    
    <p>ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.NullPointerException<br/>
    java.lang.NullPointerException<br/>
            at org.apache.hadoop.sqoop.orm.ClassWriter.generateFields(ClassWriter.java:253)<br/>
            at org.apache.hadoop.sqoop.orm.ClassWriter.generateClassForColumns(ClassWriter.java:701)<br/>
            at org.apache.hadoop.sqoop.orm.ClassWriter.generate(ClassWriter.java:597)<br/>
            at org.apache.hadoop.sqoop.Sqoop.generateORM(Sqoop.java:75)<br/>
            at org.apache.hadoop.sqoop.Sqoop.importTable(Sqoop.java:87)<br/>
            at org.apache.hadoop.sqoop.Sqoop.run(Sqoop.java:175)<br/>
            at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)<br/>
            at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)<br/>
            at org.apache.hadoop.sqoop.Sqoop.main(Sqoop.java:201)<br/>
            at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)<br/>
            at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)</p>
    
    Reason: Compatibility improvement
    Author: Leonid Furman
    Ref: UNKNOWN

commit a937ba2b9b6132883d727f856911ae31d22ad619
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:46:26 2010 -0800

    MAPREDUCE-1394. Sqoop generates incorrect URIs in paths sent to Hive
    
    Description: Hive used to require a ':8020' in HDFS URIs used with LOAD DATA statements, even though the normalized form of such a URI does not contain an explicit port number (since 8020 is the default port). Sqoop matched this by hacking the URI strings it forwarded to Hive.
    
    <p>Hive fixed this bug a while ago &#8211; Sqoop should catch up.</p>
    Reason: bugfix (compatibility)
    Author: Aaron Kimball
    Ref: UNKNOWN

commit c5c9b8bf0bf83637589a809b3c376cf74a2fb464
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:45:54 2010 -0800

    MAPREDUCE-1313. NPE in FieldFormatter if escape character is set and field is null
    
    Description: Performing an import with the <tt>&#45;&#45;escaped-by</tt> character set on a table with a null field will cause a NullPointerException in FieldFormatter
    Reason: bugfix
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 1c6dd471832946929928801dd9c9e4b79259ad9d
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:45:38 2010 -0800

    HADOOP-6460. Namenode runs of out of memory due to memory leak in ipc Server
    
    Description: Namenode heap usage grows disproportional to the number objects supports (files, directories and blocks). Based on heap dump analysis, this is due to large growth in ByteArrayOutputStream allocated in o.a.h.ipc.Server.Handler.run().
    Reason: Bugfix (Scalability)
    Author: Suresh Srinivas
    Ref: UNKNOWN

commit d190a8067827ce09cdcb7741d588cce0e0e7aa02
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:45:23 2010 -0800

    HADOOP-5687. Hadoop NameNode throws NPE if fs.default.name is the default value
    
    Description: Throwing NPE is confusing; instead, an exception with a useful string description could be thrown instead.
    Reason: Logging improvement
    Author: Philip Zeyliger
    Ref: UNKNOWN

commit 7604c6f69076effbb0c9793e114946d679f5912d
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:45:02 2010 -0800

    HADOOP-6505. sed in build.xml fails
    
    Description: I'm not sure whether this is a Solaris thing or an ant 1.7.1 thing, but it definitely doesn't do what it is supposed to.  Instead of getting SunOS-x86-32 (or whatever) I get -x86-32.
    
    <p>This patch replaces the sed call with tr. </p>
    Reason: OS compatibility improvement
    Author: Allen Wittenauer
    Ref: UNKNOWN

commit ca662cbba6044be216b586e7359d9fc2f1dd4e4f
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:44:00 2010 -0800

    HDFS-908. (version 2) TestDistributedFileSystem fails with Wrong FS on weird hosts
    
    Description: On the same host where I experienced <a href="http://issues.apache.org/jira/browse/HDFS-874" title="TestHDFSFileContextMainOperations fails on weirdly configured DNS hosts">HDFS-874</a>, I also experience this failure for TestDistributedFileSystem:
    
    <p>Testcase: testFileChecksum took 0.492 sec<br/>
      Caused an ERROR<br/>
    Wrong FS: hftp://localhost.localdomain:59782/filechecksum/foo0, expected: hftp://127.0.0.1:59782<br/>
    java.lang.IllegalArgumentException: Wrong FS: hftp://localhost.localdomain:59782/filechecksum/foo0, expected: hftp://127.0.0.1:59782<br/>
      at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310)<br/>
      at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:222)<br/>
      at org.apache.hadoop.hdfs.HftpFileSystem.getFileChecksum(HftpFileSystem.java:318)<br/>
      at org.apache.hadoop.hdfs.TestDistributedFileSystem.testFileChecksum(TestDistributedFileSystem.java:166)</p>
    
    <p>Doesn't appear to occur on trunk or branch-0.21.</p>
    
    This is version two of this patch. THe previous patch fixed some systems
    but broke others.
    Reason: Bugfix
    Author: Todd Lipcon
    Ref: UNKNOWN

commit 7fafe032223921ad194c69b16ab451b4aade87fa
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:43:41 2010 -0800

    HADOOP-4368. Superuser privileges required to do "df"
    
    Description: super user privileges are required in DFS in order to get the file system statistics (FSNamesystem.java, getStats method).  This means that when HDFS is mounted via fuse-dfs as a non-root user, "df" is going to return 16exabytes total and 0 free instead of the correct amount.
    
    <p>As far as I can tell, there's no need to require super user privileges to see the file system size (and historically in Unix, this is not required).</p>
    
    <p>To fix this, simply comment out the privilege check in the getStats method.</p>
    Reason: Usability improvement
    Author: Craig Macdonald
    Ref: UNKNOWN

commit 6129c87f5dd1fdb7375c80285534b8b91fbcd392
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:43:25 2010 -0800

    HDFS-412. Hadoop JMX usage makes Nagios monitoring impossible
    
    Description: When Hadoop reports Datanode information to JMX, the bean uses the name "DataNode-" + storageid.  The storage ID incorporates a random number and is unpredictable.
    
    <p>This prevents me from monitoring DFS datanodes through Hadoop using the JMX interface; in order to do that, you must be able to specify the bean name on the command line.</p>
    
    <p>The fix is simple, patch will be coming momentarily.  However, there was probably a reason for making the datanodes all unique names which I'm unaware of, so it'd be nice to hear from the metrics maintainer.</p>
    Reason: Monitoring improvement
    Author: Brian Bockelman
    Ref: UNKNOWN

commit 5dfcc6d2d7806636c6237996e1b28a00ba075b4b
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:43:05 2010 -0800

    HADOOP-6503. contrib projects should pull in the ivy-fetched libs from the root project
    
    Description: On branch-20 currently, I get an error just running "ant contrib -Dtestcase=TestHdfsProxy". In a full "ant test" build sometimes this doesn't appear to be an issue. The problem is that the contrib projects don't automatically pull in the dependencies of the "Hadoop" ivy project. Thus, they each have to declare all of the common dependencies like commons-cli, etc. Some are missing and this causes test failures.
    Reason: Build system improvement
    Author: Todd Lipcon
    Ref: UNKNOWN

commit be70b10f11445f4a71807405718bfeebd38ad924
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:42:51 2010 -0800

    MAPREDUCE-1155. Streaming tests swallow exceptions
    
    Description: Many of the streaming tests (including TestMultipleArchiveFiles) catch exceptions and print their stack trace rather than failing the job. This means that tests do not fail even when the job fails.
    Reason: Test coverage improvement
    Author: Todd Lipcon
    Ref: UNKNOWN

commit f84830ae5e6c862cd0e2b8ebea57880e54c8a082
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:42:33 2010 -0800

    HADOOP-5647. TestJobHistory fails if /tmp/_logs is not writable to. Testcase should not depend on /tmp
    
    Description: TestJobHistory sets /tmp as hadoop.job.history.user.location to check if the history file is created in that directory or not. If /tmp/_logs is already created by some other user, this test will fail because of not having write permission.
    Reason: Bugfix in test harness
    Author: Ravi Gummadi
    Ref: UNKNOWN

commit 669b65f14d78ffd1cf0304cf459d1abbae3412ae
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:42:15 2010 -0800

    CLOUDERA-BUILD. Fix javadoc warnings shown by test-patch, and update eclipse classpath to match current CDH.
    
    Author: Todd Lipcon

commit 51804fd45d3a527a130a373c591a17c185102a0c
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:41:40 2010 -0800

    Revert "HDFS-127: DFSClient block read failures cause open DFSInputStream to become unusable"
    
    Description: This is being reverted as it causes infinite retries when there are no valid replicas.
    Reason: bugfix
    Author: Todd Lipcon
    Ref: UNKNOWN

commit 623bfc0c18087274315dfbd41d025a8a775abe80
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:40:30 2010 -0800

    HDFS-877. Client-driven block verification not functioning
    
    Description: This is actually the reason for <a href="http://issues.apache.org/jira/browse/HDFS-734" title="TestDatanodeBlockScanner times out in branch 0.20"><del>HDFS-734</del></a> (TestDatanodeBlockScanner timing out). The issue is that DFSInputStream relies on readChunk being called one last time at the end of the file in order to receive the lastPacketInBlock=true packet from the DN. However, DFSInputStream.read checks pos &lt; getFileLength() before issuing the read. Thus gotEOS never shifts to true and checksumOk() is never called.
    
    This is a simpler patch than the one on 0.21/0.22 since those fix a further regression
    since 0.20.
    
    Reason: bugfix
    Author: Todd Lipcon
    Ref: UNKNOWN

commit b332fe77255047409da701dfb97df1bddb5b10cb
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:40:05 2010 -0800

    CLOUDERA-BUILD. Add mockito to 0.20 branch for easier unit testing of HDFS stability patches.
    
    Reason: Test coverage improvement
    Author: Todd Lipcon

commit 44a6c559de056b35c6eb2e2d53798c88d8c779e6
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:39:09 2010 -0800

    HDFS-630. In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block.
    
    Description: created from hdfs-200.
    
    <p>If during a write, the dfsclient sees that a block replica location for a newly allocated block is not-connectable, it re-requests the NN to get a fresh set of replica locations of the block. It tries this dfs.client.block.write.retries times (default 3), sleeping 6 seconds between each retry ( see DFSClient.nextBlockOutputStream).</p>
    
    <p>This setting works well when you have a reasonable size cluster; if u have few datanodes in the cluster, every retry maybe pick the dead-datanode and the above logic bails out.</p>
    
    <p>Our solution: when getting block location from namenode, we give nn the excluded datanodes. The list of dead datanodes is only for one block allocation.</p>
    Reason: bugfix (Fault tolerance improvement)
    Author: Cosmin Lehene (modified by Cloudera to not break compatibility)
    Ref: UNKNOWN

commit 47c404e0cf10ceb31336d2a77d53e0a971348102
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:37:37 2010 -0800

    HDFS-908. TestDistributedFileSystem fails with Wrong FS on weird hosts
    
    Description: On the same host where I experienced <a href="http://issues.apache.org/jira/browse/HDFS-874" title="TestHDFSFileContextMainOperations fails on weirdly configured DNS hosts">HDFS-874</a>, I also experience this failure for TestDistributedFileSystem:
    
    <p>Testcase: testFileChecksum took 0.492 sec<br/>
      Caused an ERROR<br/>
    Wrong FS: hftp://localhost.localdomain:59782/filechecksum/foo0, expected: hftp://127.0.0.1:59782<br/>
    java.lang.IllegalArgumentException: Wrong FS: hftp://localhost.localdomain:59782/filechecksum/foo0, expected: hftp://127.0.0.1:59782<br/>
      at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310)<br/>
      at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:222)<br/>
      at org.apache.hadoop.hdfs.HftpFileSystem.getFileChecksum(HftpFileSystem.java:318)<br/>
      at org.apache.hadoop.hdfs.TestDistributedFileSystem.testFileChecksum(TestDistributedFileSystem.java:166)</p>
    
    <p>Doesn't appear to occur on trunk or branch-0.21.</p>
    Reason: bugfix
    Author: Todd Lipcon
    Ref: UNKNOWN

commit 7c2a791f0a397d924a623e45bf823c238374c42c
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:37:19 2010 -0800

    MAPREDUCE-1258. Fair scheduler event log not logging job info
    
    Description: The <a href="http://issues.apache.org/jira/browse/MAPREDUCE-706" title="Support for FIFO pools in the fair scheduler"><del>MAPREDUCE-706</del></a> patch seems to have left an unfinished TODO in the Fair Scheduler - namely, in the dump() function for periodically dumping scheduler state to the event log, the part that dumps information about jobs is commented out. This makes the event log less useful than it was before.
    
    <p>It should be fairly easy to update this part to use the new scheduler data structures (Schedulable etc) and print the data.</p>
    Reason: Logging improvement
    Author: Matei Zaharia
    Ref: UNKNOWN

commit 353f7813bf7dfb0bca1362f9370f6a080256a345
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:36:58 2010 -0800

    MAPREDUCE-1198. Alternatively schedule different types of tasks in fair share scheduler
    
    Description: Matei has mentioned in <a href="http://issues.apache.org/jira/browse/MAPREDUCE-961" title="ResourceAwareLoadManager to dynamically decide new tasks based on current CPU/memory load on TaskTracker(s)">MAPREDUCE-961</a> that the current scheduler will first try to launch map tasks until canLaunthTask() returns false then look for reduce tasks. This might starve reduce task. He also mention that alternatively schedule different types of tasks can solve this problem.
    Reason: bugfix
    Author: Scott Chen
    Ref: UNKNOWN

commit ef449fb7832055951e2364cf12a73717b2add3ce
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:36:50 2010 -0800

    MAPREDUCE-698. Per-pool task limits for the fair scheduler
    
    Description: The fair scheduler could use a way to cap the share of a given pool similar to <a href="http://issues.apache.org/jira/browse/MAPREDUCE-532" title="Allow admins of the Capacity Scheduler to set a hard-limit on the capacity of a queue"><del>MAPREDUCE-532</del></a>.
    Reason: New feature
    Author: Kevin Peterson
    Ref: UNKNOWN

commit a1e25ec70e677db322b2cce43c6381f865eb3f79
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:36:42 2010 -0800

    HDFS-464. Memory leaks in libhdfs
    
    Description: hdfsExists does not call destroyLocalReference for jPath anytime,<br/>
    hdfsDelete does not call it when it fails, and<br/>
    hdfsRename does not call it for jOldPath and jNewPath when it fails
    Reason: bugfix
    Author: Christian Kunz
    Ref: UNKNOWN

commit d93dad715d3c702d15c2a32c85d586c708e70857
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:36:23 2010 -0800

    CLOUDERA-BUILD. Add test ivy configurations to additional projects.
    
    Author: Aaron Kimball
    Reason: Build system improvement

commit 5d0c8f82b87e7cbb541ace9e4f22abfad2799e56
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:35:08 2010 -0800

    CLOUDERA-BUILD. Sqoop bin script now includes jars from contrib/sqoop/lib/ on classpath.
    
    Author: Aaron Kimball

commit 7e009a29c0806537cd50972df90ec87b617eb78f
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:34:54 2010 -0800

    MAPREDUCE-1212. Mapreduce contrib project ivy dependencies are not included in binary target
    
    Description: As in <a href="http://issues.apache.org/jira/browse/HADOOP-6370" title="Contrib project ivy dependencies are not included in binary target">HADOOP-6370</a>, only Hadoop's own library dependencies are promoted to ${build.dir}/lib; any libraries required by contribs are not redistributed.
    Reason: Build system (packaging) improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 8d289f97d6b66cd435f755a4acae9f138de934d6
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:34:43 2010 -0800

    CLOUDERA-BUILD. Update cloud script version to cdh-0.20.1
    
    Author: Tom White

commit ac7eacd44af059d7a859b8d6773a82cd84ba4c9b
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:34:35 2010 -0800

    HADOOP-6466. Add a ZooKeeper service to the cloud scripts
    
    Description: It would be good to add other Hadoop services to the cloud scripts.
    Reason: New feature
    Author: Tom White
    Ref: UNKNOWN

commit 06ceb079693292a41085af795c5b2bbc3fd10af2
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:34:24 2010 -0800

    HADOOP-6454. Create setup.py for EC2 cloud scripts
    
    Description: This would make it easier to install the scripts.
    Reason: Installation improvement
    Author: Tom White
    Ref: UNKNOWN

commit 23c45791bbc3a23d69c77f3518b5d1a1a4702ccc
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:34:11 2010 -0800

    HADOOP-6462. contrib/cloud failing, target "compile" does not exist
    
    Description: I'm not seeing this mentioned in hudson or other bugreports, which confuses me. With the addition of a src/contrib/cloud/build.xml from <a href="http://issues.apache.org/jira/browse/HADOOP-6426" title="Create ant build for running EC2 unit tests"><del>HADOOP-6426</del></a>, contrib/build.xml won't build no more: <br/>
    hadoop-common/src/contrib/build.xml:30: The following error occurred while executing this line:<br/>
    Target "compile" does not exist in the project "hadoop-cloud".
    
    <p>What is odd is this: the final patch of <a href="http://issues.apache.org/jira/browse/HADOOP-6426" title="Create ant build for running EC2 unit tests"><del>HADOOP-6426</del></a> does include the stub &lt;target&gt; files needed, yet they aren't in SVN_HEAD. Which implies that a different version may have gone in than intended. </p>
    Reason: Build system bugfix
    Author: Tom White
    Ref: UNKNOWN

commit 083a6a1cfb2a5198243aa82a020681ad62da5938
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:33:58 2010 -0800

    HADOOP-6444. Support additional security group option in hadoop-ec2 script
    
    Description: When deploying a hadoop cluster on ec2 alongside other services it is very useful to be able to specify additional (pre-existing) security groups to facilitate access control.  For example one could use this feature to add a cluster to a generic "hadoop" group, which authorizes hdfs access from instances outside the cluster.  Without such an option the access control for the security groups created by the script need to manually updated after cluster launch.
    Reason: Security improvement
    Author: Paul Egan
    Ref: UNKNOWN

commit 63152ce4ba3c0cf2006016cc825fc72b0bd23d2d
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:33:49 2010 -0800

    HADOOP-6426. Create ant build for running EC2 unit tests
    
    Description: There is no easy way currently to run the Python unit tests for the cloud contrib.
    Reason: Test coverage improvement
    Author: Tom White
    Ref: UNKNOWN

commit a20069b2adfafa59e0001fe5e5685d36d9eb7fee
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:33:15 2010 -0800

    HADOOP-6392. Run namenode and jobtracker on separate EC2 instances
    
    Description: Replace concept of "master" with that of "namenode" and "jobtracker". Still need to be able to run both on one node, of course.
    Reason: Scalability improvement
    Author: Tom White
    Ref: UNKNOWN

commit 361221a2a082d0ab7a87ba0226dbe05938440738
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:33:07 2010 -0800

    HADOOP-6108. Add support for EBS storage on EC2
    
    Description: By using EBS for namenode and datanode storage we can have persistent, restartable Hadoop clusters running on EC2.
    Reason: New feature
    Author: Tom White
    Ref: UNKNOWN

commit 4ca1c78e1b257eefa10b5ed94479df8a6473d3e9
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:32:50 2010 -0800

    HDFS-861. fuse-dfs does not support O_RDWR
    
    Description: Some applications (for us, the big one is rsync) will open a file in read-write mode when it really only intends to read xor write (not both).  fuse-dfs should try to not fail until the application actually tries to write to a pre-existing file or read from a newly created file.
    Reason: bugfix
    Author: Brian Bockelman
    Ref: UNKNOWN

commit 00f6976093cc20ea825a35f6831f645dc5f61637
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:32:17 2010 -0800

    HDFS-860. fuse-dfs truncate behavior causes issues with scp
    
    Description: For whatever reason, scp issues a "truncate" once it's written a file to truncate the file to the # of bytes it has written (i.e., if a file is X bytes, it calls truncate(X)).
    
    <p>This fails on the current fuse-dfs.</p>
    Reason: bugfix (tool compatibility)
    Author: Brian Bockelman
    Ref: UNKNOWN

commit 46d2b6d6b27887375c44d691d776f70e89e4b81b
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:31:58 2010 -0800

    HDFS-859. fuse-dfs utime behavior causes issues with tar
    
    Description: When trying to untar files onto fuse-dfs, tar will try to set the utime on all the files and directories.  However, setting the utime on a directory in libhdfs causes an error.
    
    <p>We should silently ignore the failure of setting a utime on a directory; this will allow tar to complete successfully.</p>
    Reason: bugfix (tool compatibility)
    Author: Brian Bockelman
    Ref: UNKNOWN

commit 9a38b9c423aca358307aa6455977432f34aef990
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:31:45 2010 -0800

    HDFS-858. Incorrect return codes for fuse-dfs
    
    Description: fuse-dfs doesn't pass proper error codes from libhdfs; places I'd like to correct are hdfsFileOpen (which can result in permission denied or quota violations) and hdfsWrite (which can result in quota violations).
    
    <p>By returning the correct error codes, command line utilities return much better error messages - especially for quota violations, which can be a devil to debug.</p>
    Reason: bugfix
    Author: Brian Bockelman
    Ref: UNKNOWN

commit 84afb26bb0e42eda1e26b07e3aac016695f5ad87
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:31:37 2010 -0800

    HDFS-857. Incorrect type for fuse-dfs capacity can cause "df" to return negative values on 32-bit machines
    
    Description: On sufficiently large HDFS installs, the casting of hdfsGetCapacity to a long may cause "df" to return negative values.  tOffset should be used instead.
    Reason: bugfix
    Author: Brian Bockelman
    Ref: UNKNOWN

commit a4cf3e8e86cbd42bef25eb3aab7e464ac86e3068
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:31:19 2010 -0800

    HDFS-856. Hardcoded replication level for new files in fuse-dfs
    
    Description: In fuse-dfs, the number of replicas is always hardcoded to 3 in the arguments to hdfsOpenFile.  We should use the setting in the hadoop configuration instead.
    Reason: Configuration improvement
    Author: Brian Bockelman
    Ref: UNKNOWN

commit e9f3ec90e57b383faf49e6a6eb8cc91e5182d31e
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:31:08 2010 -0800

    HADOOP-5625. Add I/O duration time in client trace
    
    Description: Add I/O duration information into client trace log for analyzing performance.
    
    Reason: Logging improvement
    Author: Lei Xu
    Ref: UNKNOWN

commit 42eeb4540850278563e76841f0c6b369933d5b70
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:30:43 2010 -0800

    HADOOP-5222. Add offset in client trace
    
    Description: By adding offset in client trace, the client trace information can provide more accurately information about I/O.<br/>
    It is useful for performance analyzing.
    
    <p>Since there is  no random write now, the offset of writing is always zero.</p>
    Reason: Logging improvement
    Author: Lei Xu
    Ref: UNKNOWN

commit 5880960fb32ae0fc2c16bac1f333dbb237c3448f
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:30:27 2010 -0800

    CLOUDERA-BUILD. Solaris do-release-build fix
    
    Author: Eli Collins
    Ref: CDH-531

commit 35f87aef6d7cd4030644a1d454da2e0a6e2969c0
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:30:18 2010 -0800

    MAPREDUCE-1310. CREATE TABLE statements for Hive do not correctly specify delimiters
    
    Description: Imports to HDFS via Sqoop that also inject metadata into Hive do not correctly specify delimiters; using Hive to access the data results in rows being parsed as NULL characters. See <span class="nobr"><a href="http://getsatisfaction.com/cloudera/topics/sqoop_hive_import_giving_null_query_values">http://getsatisfaction.com/cloudera/topics/sqoop_hive_import_giving_null_query_values<sup><img class="rendericon" src="https://issues.apache.org/jira/images/icons/linkext7.gif" height="7" width="7" align="absmiddle" alt="" border="0"/></sup></a></span> for an example bug report
    Reason: Bugfix
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 60784d712cdd5781ceff262bb67e2d484fde428b
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:29:56 2010 -0800

    MAPREDUCE-1235. java.io.IOException: Cannot convert value '0000-00-00 00:00:00' from column 6 to TIMESTAMP.
    
    Description: <b>Description</b>: java.io.IOException is thrown when trying to import a table to HDFS using Sqoop. Table has "0" value in a field of type datetime. <br/>
    <b>Full Exception</b>: java.io.IOException: Cannot convert value '0000-00-00 00:00:00' from column 6 to TIMESTAMP. <br/>
    <b>Original question</b>: <span class="nobr"><a href="http://getsatisfaction.com/cloudera/topics/cant_import_table?utm_content=reply_link&amp;utm_medium=email&amp;utm_source=reply_notification">http://getsatisfaction.com/cloudera/topics/cant_import_table?utm_content=reply_link&amp;utm_medium=email&amp;utm_source=reply_notification<sup><img class="rendericon" src="https://issues.apache.org/jira/images/icons/linkext7.gif" height="7" width="7" align="absmiddle" alt="" border="0"/></sup></a></span>
    Reason: Bugfix (compatibility)
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 23c116b6ab5615bdb846e22b61a41e92ca287bdf
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:29:47 2010 -0800

    MAPREDUCE-1174. Sqoop improperly handles table/column names which are reserved sql words
    
    Description: In some databases it is legal to name tables and columns with terms that overlap SQL reserved keywords (e.g., <tt>CREATE</tt>, <tt>table</tt>, etc.). In such cases, the database allows you to escape the table and column names. We should always escape table and column names when possible.
    Reason: Bugfix
    Author: Aaron Kimball
    Ref: UNKNOWN

commit d4b3b7592c94aa1f4608245829b5de202ed1b148
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:29:39 2010 -0800

    MAPREDUCE-1168. Export data to databases via Sqoop
    
    Description: Sqoop can import from a database into HDFS. It's high time it works in reverse too.
    Reason: New feature
    Author: Aaron Kimball
    Ref: UNKNOWN

commit b29023803d1136bf7d4de45853a2d4481fb36d3c
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:29:24 2010 -0800

    MAPREDUCE-1169. Improvements to mysqldump use in Sqoop
    
    Description: Improve Sqoop's integration with mysqldump
    Reason: Feature/performance improvements
    Author: Aaron Kimball
    Ref: UNKNOWN
    
    commit c6b956630e327ddabf674f8e06de02408e603155
    Author: Aaron Kimball <aaron@cloudera.com>
    Date:   Wed Jan 6 16:05:05 2010 -0800
    
        MAPREDUCE-1169. Improvements to mysqldump use in Sqoop

commit 26ba4fd749755a3df79eaa27792662e5b7e3da80
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:29:15 2010 -0800

    MAPREDUCE-1036. An API Specification for Sqoop
    
    Description: Over the last several months, Sqoop has evolved to a state that is functional and has room for extensions. Developing extensions requires a stable API and documentation. I am attaching to this ticket a description of Sqoop's design and internal APIs, which include some open questions. I would like to solicit input on the design regarding these open questions and standardize the API.
    Reason: Documentation
    Author: Aaron Kimball
    Ref: UNKNOWN

commit e8c47124bb2ada5de0cfdf49150dd7296a41df71
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:29:04 2010 -0800

    MAPREDUCE-1069. Implement Sqoop API refactoring
    
    Description: Implement refactoring decisions outlined in <a href="http://issues.apache.org/jira/browse/MAPREDUCE-1036" title="An API Specification for Sqoop"><del>MAPREDUCE-1036</del></a>
    Reason: API compatibility
    Author: Aaron Kimball
    Ref: UNKNOWN

commit b73cab8083c1594c0328a565eef05951a17f998a
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:28:46 2010 -0800

    MAPREDUCE-1146. Sqoop dependencies break Eclipse build on Linux
    
    Description: Under  Linux there's the error in the Eclipse "Problems" view:
    <div class="preformatted panel" style="border-width: 1px;"><div class="preformattedContent panelContent">
    <pre>- "com.sun.tools cannot be resolved" at line 166 of  org.apache.hadoop.sqoop.orm.CompilationManager
    </pre>
    </div></div>
    <p>The problem doesn't appear on MacOS though</p>
    Reason: bugfix
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 0629ac30abb5e58fb80be56a385867ac7360de22
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:28:37 2010 -0800

    MAPREDUCE-1148. SQL identifiers are a superset of Java identifiers
    
    Description: SQL identifiers can contain arbitrary characters, can start with numbers, can be words like <tt>class</tt> which are reserved in Java, etc. If Sqoop uses these names literally for class and field names then compilation errors can occur in auto-generated classes. SQL identifiers need to be cleansed to map onto Java identifiers.
    Reason: bugfix
    Author: Aaron Kimball
    Ref: UNKNOWN

commit dec4c616921b547e5a332a254254d77efc3a7d5e
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:28:25 2010 -0800

    MAPREDUCE-1224. Calling "SELECT t.* from <table> AS t" to get meta information is too expensive for big tables
    
    Description: The SqlManager uses the query, "SELECT t.* from &lt;table&gt; AS t" to get table spec is too expensive for big tables, and it was called twice to generate column names and types.  For tables that are big enough to be map-reduced, this is too expensive to make sqoop useful.
    Reason: Performance improvement
    Author: Spencer Ho
    Ref: UNKNOWN

commit 1198ef1375387ba107d46f0ab5e9a7c6a7645931
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:28:15 2010 -0800

    MAPREDUCE-706. Support for FIFO pools in the fair scheduler
    
    Description: The fair scheduler should support making the internal scheduling algorithm for some pools be FIFO instead of fair sharing in order to work better for batch workloads. FIFO pools will behave exactly like the current default scheduler, sorting jobs by priority and then submission time. Pools will have their scheduling algorithm set through the pools config file, and it will be changeable at runtime.
    
    <p>To support this feature, I'm also changing the internal logic of the fair scheduler to no longer use deficits. Instead, for fair sharing, we will assign tasks to the job farthest below its share as a ratio of its share. This is easier to combine with other scheduling algorithms and leads to a more stable sharing situation, avoiding unfairness issues brought up in <a href="http://issues.apache.org/jira/browse/MAPREDUCE-543" title="large pending jobs hog resources"><del>MAPREDUCE-543</del></a> and <a href="http://issues.apache.org/jira/browse/MAPREDUCE-544" title="deficit computation is biased by historical load">MAPREDUCE-544</a> that happen when some jobs have long tasks. The new preemption (<a href="http://issues.apache.org/jira/browse/MAPREDUCE-551" title="Add preemption to the fair scheduler"><del>MAPREDUCE-551</del></a>) will ensure that critical jobs can gain their fair share within a bounded amount of time.</p>
    Reason: New feature
    Author: Matei Zaharia
    Ref: UNKNOWN

commit 5699f5483e2a9ee9debd0f0154c6506ee5dc87e2
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:28:03 2010 -0800

    MAPREDUCE-1285. DistCp cannot handle -delete if destination is local filesystem
    
    Description: The following exception is thrown:
    <div class="code panel" style="border-width: 1px;"><div class="codeContent panelContent">
    <pre class="code-java">Copy failed: java.io.IOException: wrong value class: org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus is not class org.apache.hadoop.fs.FileStatus
    	at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:988)
    	at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:977)
    	at org.apache.hadoop.tools.DistCp.deleteNonexisting(DistCp.java:1226)
    	at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1134)
    	at org.apache.hadoop.tools.DistCp.copy(DistCp.java:650)
    	at org.apache.hadoop.tools.DistCp.run(DistCp.java:857)
    	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)</pre>
    </div></div>
    Reason: bugfix
    Author: Peter Romianowski
    Ref: UNKNOWN

commit 34bb813a5884aeb05909c2ce2cc541882ca3eda1
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:27:53 2010 -0800

    MAPREDUCE-764. TypedBytesInput's readRaw() does not preserve custom type codes
    
    Description: The typed bytes format supports byte sequences of the form <tt>&lt;custom type code&gt; &lt;length&gt; &lt;bytes&gt;</tt>. When reading such a sequence via <tt>TypedBytesInput</tt>'s <tt>readRaw()</tt> method, however, the returned sequence currently is <tt>0 &lt;length&gt; &lt;bytes&gt;</tt> (0 is the type code for a bytes array), which leads to bugs such as the one described <span class="nobr"><a href="http://dumbo.assembla.com/spaces/dumbo/tickets/54">here<sup><img class="rendericon" src="https://issues.apache.org/jira/images/icons/linkext7.gif" height="7" width="7" align="absmiddle" alt="" border="0"/></sup></a></span>.
    Reason: bugfix
    Author: Klaas Bosteels
    Ref: UNKNOWN

commit 7fd2cb371354219abd108fda35087f08dc481b35
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:27:31 2010 -0800

    HADOOP-6400. Log errors getting Unix UGI
    
    Description: For various reasons, the calls out to `whoami` and `id` can fail when trying to get the unix UGI information. Currently it silently ignores failures and uses the default DrWho/Tardis ugi. This is extremely confusing for users - we should log the exception at warn level when the shell execs fail.
    Reason: Debug logging improvement
    Author: Todd Lipcon
    Ref: UNKNOWN

commit d6dc22fecc058e12695a481fa354078d9b012089
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:27:21 2010 -0800

    MAPREDUCE-1293. AutoInputFormat doesn't work with non-default FileSystems
    
    Description: AutoInputFormat uses the wrong FileSystem.get() method when getting a reference to a FileSystem object. AutoInputFormat gets the default FileSystem, so this method breaks if the InputSplit's path is pointing to a different FileSystem.
    Reason: bugfix
    Author: Andrew Hitchcock
    Ref: UNKNOWN

commit 25a4ea86b0b085e3afd6f2f040201594155b3de1
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:27:09 2010 -0800

    MAPREDUCE-1131. Using profilers other than hprof can cause JobClient to report job failure
    
    Description: If task profiling is enabled, the JobClient will download the <tt>profile.out</tt> file created by the tasks under profile. If this causes an IOException, the job is reported as a failure to the client, even though all the tasks themselves may complete successfully. The expected result files are assumed to be generated by hprof. Using the profiling system with other profilers will cause job failure.
    Reason: compatibility bugfix
    Author: Aaron Kimball
    Ref: UNKNOWN

commit ab98123c7114752945452af0b96c8de04af9ba93
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:26:02 2010 -0800

    MAPREDUCE-370. Change org.apache.hadoop.mapred.lib.MultipleOutputs to use new api.
    
    Description: Ports the MultipleOutputs OutputFormat to the new context-based API.
    Reason: API compatibility improvement.
    Author: Amareshwari Sriramadasu
    Ref: UNKNOWN

commit 50726d13750f3f71d2fc5d3a012ce81aa2adb26d
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:24:46 2010 -0800

    CLOUDERA-BUILD. Backport MapReduceTestUtil to Hadoop 0.20
    
    Description: MapReduceTestUtil is required for unit tests in subsequent
    patches, but this class itself was not created in one clean JIRA. Therefore
    it was backported "As-is" from the trunk and not in a patch-wise fashion.
    This class is only used in the JUnit tests for Hadoop.
    Author: Aaron Kimball
    Reason: Testing improvement
    Ref: UNKNOWN

commit d713dc1063afc4967381b6583ec424d2850bac63
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:24:30 2010 -0800

    MAPREDUCE-1059. distcp can generate uneven map task assignments
    
    Description: distcp writes out a SequenceFile containing the source files to transfer, and their sizes. Map tasks are created over spans of this file, representing files which each mapper should transfer. In practice, some transfer loads yield many empty map tasks and a few tasks perform the bulk of the work.
    Reason: Improvement for load balancing
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 855b0bf3718f2c397ef79967475468e4153f120a
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:24:20 2010 -0800

    MAPREDUCE-1128. MRUnit Allows Iteration Twice
    
    Description: MRUnit allows one to iterate over a collection of values twice (ie.
    
    <p>reduce(Key key, Iterable&lt;Value&gt; values, Context context){
       for(Value : values ) /* iterate once */;
       for(Value : values ) /* iterate again */;
    }</p>
    
    <p>Hadoop will allow this as well, however the second iterator will be empty. MRUnit should either match hadoop's behavior or warn the user that their code is likely flawed.</p>
    Reason: bugfix (API compatibility)
    Author: Aaron Kimball
    Ref: UNKNOWN

commit c9d77f6e1fdbb24b45675e363e3bd5111533893a
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:24:10 2010 -0800

    HDFS-464. Memory leaks in libhdfs
    
    Description: hdfsExists does not call destroyLocalReference for jPath anytime,<br/>
    hdfsDelete does not call it when it fails, and<br/>
    hdfsRename does not call it for jOldPath and jNewPath when it fails
    Reason: bugfix
    Author: Christian Kunz
    Ref: UNKNOWN

commit c7996c5e2fbb9260740fec369550551d6320762a
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:23:51 2010 -0800

    HDFS-423. Unbreak FUSE build and fuse_dfs_wrapper.sh
    
    Description: fuse-dfs depends on libhdfs, and fuse-dfs build.xml still points to the libhfds/libhdfs.so location but libhdfs now is build in a different location <br/>
    please take a look at this bug for the location details
    
    <p><span class="nobr"><a href="https://issues.apache.org/jira/browse/HADOOP-3344">https://issues.apache.org/jira/browse/HADOOP-3344<sup><img class="rendericon" src="https://issues.apache.org/jira/images/icons/linkext7.gif" height="7" width="7" align="absmiddle" alt="" border="0"/></sup></a></span></p>
    
    <p>Thanks,<br/>
    Giri</p>
    Reason: Build system bugfix
    Author: Eli Collins
    Ref: UNKNOWN

commit 72b0b791cd347e760807a44f5197599f57afde03
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:23:39 2010 -0800

    CLOUDERA-BUILD. Make bin/hadoop-config.sh work with dev builds
    
    Author: Eli Collins

commit a9466041ccfcdb07f4f0dd34a57c9e9bdd6a3e70
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:23:06 2010 -0800

    HDFS-727. bug setting block size hdfsOpenFile
    
    Description: In hdfsOpenFile in libhdfs invokeMethod needs to cast the block size argument to a jlong so a full 8 bytes are passed (rather than 4 plus some garbage which causes writes to fail due to a bogus block size).
    
    Reason: Bugfix
    Author: Eli Collins
    Ref: UNKNOWN

commit 4e7d205daa86d904614252101bb422664ab6d203
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:22:47 2010 -0800

    Revert MAPREDUCE-967. TaskTracker does not need to fully unjar job jars
    
    Author: Todd Lipcon
    Ref: UNKNOWN

commit d5f0c77a6c81e9e56da81976645614280247f7a2
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:22:18 2010 -0800

    HADOOP-5640. Allow ServicePlugins to hook callbacks into key service events
    
    Description: <a href="http://issues.apache.org/jira/browse/HADOOP-5257" title="Export namenode/datanode functionality through a pluggable RPC layer"><del>HADOOP-5257</del></a> added the ability for NameNode and DataNode to start and stop ServicePlugin implementations at NN/DN start/stop. However, this is insufficient integration for some common use cases.
    
    <p>We should add some functionality for Plugins to subscribe to events generated by the service they're plugging into. Some potential hook points are:</p>
    
    <p>NameNode:</p>
    <ul class="alternate" type="square">
    	<li>new datanode registered</li>
    	<li>datanode has died</li>
    	<li>exception caught</li>
    	<li>etc?</li>
    </ul>
    
    <p>DataNode:</p>
    <ul class="alternate" type="square">
    	<li>startup</li>
    	<li>initial registration with NN complete (this is important for HADOOP-4707 to sync up datanode.dnRegistration.name with the NN-side registration)</li>
    	<li>namenode reconnect</li>
    	<li>some block transfer hooks?</li>
    	<li>exception caught</li>
    </ul>
    
    <p>I see two potential routes for implementation:</p>
    
    <p>1) We make an enum for the types of hookpoints and have a general function in the ServicePlugin interface. Something like:</p>
    
    <div class="code panel" style="border-width: 1px;"><div class="codeContent panelContent">
    <pre class="code-java"><span class="code-keyword">enum</span> HookPoint {
      DN_STARTUP,
      DN_RECEIVED_NEW_BLOCK,
      DN_CAUGHT_EXCEPTION,
     ...
    }
    
    void runHook(HookPoint hp, <span class="code-object">Object</span> value);</pre>
    </div></div>
    
    <p>2) We make classes specific to each "pluggable" as was originally suggested in HADDOP-5257. Something like:</p>
    
    <div class="code panel" style="border-width: 1px;"><div class="codeContent panelContent">
    <pre class="code-java">class DataNodePlugin {
      void datanodeStarted() {}
      void receivedNewBlock(block info, etc) {}
      void caughtException(Exception e) {}
      ...
    }</pre>
    </div></div>
    
    <p>I personally prefer option (2) since we can ensure plugin API compatibility at compile-time, and we avoid an ugly switch statement in a runHook() function.</p>
    
    <p>Interested to hear what people's thoughts are here.</p>
    
    HADOOP-5640 puts this in the new test dir. It needs to be in the old one.
    
    Reason: Improvement
    Author: Todd Lipcon
    Ref: UNKNOWN

commit e9b04609d88ed5d1af442ee950aa5dcd6646e830
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:22:08 2010 -0800

    MAPREDUCE-1017. Compression and output splitting for Sqoop
    
    Description: Sqoop "direct mode" writing will generate a single large text file in HDFS. It is important to be able to compress this data before it reaches HDFS. Due to the difficulty in splitting compressed files in HDFS for use by MapReduce jobs, data should also be split at compression time.
    Reason: New feature
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 8c9b473e1af036a3e2cc9036a945a4567277db8a
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:21:14 2010 -0800

    HADOOP-6312. Configuration sends too much data to log4j
    
    Description: Configuration objects send a DEBUG-level log message every time they're instantiated, which include a full stack trace. This is more appropriate for TRACE-level logging, as it renders other debug logs very hard to read.
    Reason: Logging improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 698fe169f31e54111d30e4420cd1c1c5eaeecdec
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:21:03 2010 -0800

    HDFS-686. NullPointerException is thrown while merging edit log and image
    
    Description: Our secondary name node is not able to start on NullPointerException:<br/>
    ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: java.lang.NullPointerException<br/>
            at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1232)<br/>
            at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1221)<br/>
            at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:776)<br/>
            at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:992)<br/>
            at<br/>
    org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.doMerge(SecondaryNameNode.java:590)<br/>
            at<br/>
    org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.access$000(SecondaryNameNode.java:473)<br/>
            at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:350)<br/>
            at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:314)<br/>
            at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:225)<br/>
            at java.lang.Thread.run(Thread.java:619)
    
    <p>This was caused by setting access time on a non-existent file.</p>
    Reason: bugfix
    Author: Hairong Kuang
    Ref: UNKNOWN

commit b2cc8e02f37a1604bb076acefff0ebf016c249d5
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:20:40 2010 -0800

    MAPREDUCE-112. Reduce Input Records and Reduce Output Records counters are not being set when using the new Mapreduce reducer API
    
    Description: After running the examples/wordcount (which uses the new API), the reduce input and output record counters always show 0. This is because these counters are not getting updated in the new API
    This adds counters for reduce input, output records to the new API.
    Reason: Bugfix
    Author: Jothi Padmanabhan
    Ref: UNKNOWN

commit 3e62477434542dc3de89fd43fd9b19abaf76f0de
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:20:00 2010 -0800

    MAPREDUCE-768. Configuration information should generate dump in a standard format.
    
    Description:  We need to generate the configuration dump in a standard format .
    This adds the 'hadoop jobtracker -dumpConfiguration' command.
    This is modified from the original patch in that it does not dump QueueManager configuration.
    This is because we have not backported HADOOP-5396
    
    Reason: New feature
    Author: V.V.Chaitanya Krishna
    Ref: UNKNOWN

commit 4d9333b00772455a1ca7a365fa5b5b2f6872abd7
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:19:46 2010 -0800

    HADOOP-6184. Provide a configuration dump in json format.
    
    Description: Configuration dump in json format.
    Reason: New feature
    Author: V.V.Chaitanya Krishna
    Ref: UNKNOWN

commit 96244c3e7d6735f450b618fdcbdbbf9a81436ba3
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:19:27 2010 -0800

    CLOUDERA-BUILD. Duplicated effort. FULL_VERSION already set in package.mk
    
    Description: Revert "Need to pass in FULL_VERSION"
    Author: Chad Metcalf

commit 604d3a71334b9340a6219e3b88bf563b79f5d083
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:19:11 2010 -0800

    CLOUDERA-BUILD. Copy the sqoop manpage to the expected version number
    
    Author: Chad Metcalf

commit 6d428f70591a92a90dca5256968c62a510659240
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:18:58 2010 -0800

    CLOUDERA-BUILD. Bump jdiff stable to 0.20.1
    
    Author: Chad Metcalf

commit 46ffc9aa9260a96bdf67fbaee9a2acd76cfcf675
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:18:44 2010 -0800

    CLOUDERA-BUILD. Need to pass in FULL_VERSION
    
    Author: Chad Metcalf

commit aa7ae9d9826866f94ecfe5629d087ef68e4b5c54
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:18:29 2010 -0800

    MAPREDUCE-999. Improve Sqoop test speed and refactor tests
    
    Description: Sqoop's tests take a long time to run, but this can be improved (by a factor of 2 or more) by taking advantage of <tt>jobclient.completion.poll.interval</tt>.
    Reason: Testing performance improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 084c390ed5fcb03c456121c8497759b40a74f809
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:18:13 2010 -0800

    MAPREDUCE-1089. Fair Scheduler preemption triggers NPE when tasks are scheduled but not running
    
    Description: We see exceptions like this when preemption runs when a task has been scheduled on a TT but has not yet started running.
    
    <p>2009-10-09 14:30:53,989 INFO org.apache.hadoop.mapred.FairScheduler: Should preempt 2 MAP tasks for job_200910091420_0006: tasksDueToMinShare = 2, tasksDueToFairShare = 0<br/>
    2009-10-09 14:30:54,036 ERROR org.apache.hadoop.mapred.FairScheduler: Exception in fair scheduler UpdateThread<br/>
    java.lang.NullPointerException<br/>
            at org.apache.hadoop.mapred.FairScheduler$2.compare(FairScheduler.java:1015)<br/>
            at org.apache.hadoop.mapred.FairScheduler$2.compare(FairScheduler.java:1013)<br/>
            at java.util.Arrays.mergeSort(Arrays.java:1270)<br/>
            at java.util.Arrays.sort(Arrays.java:1210)<br/>
            at java.util.Collections.sort(Collections.java:159)<br/>
            at org.apache.hadoop.mapred.FairScheduler.preemptTasks(FairScheduler.java:1013)<br/>
            at org.apache.hadoop.mapred.FairScheduler.preemptTasksIfNecessary(FairScheduler.java:911)<br/>
            at org.apache.hadoop.mapred.FairScheduler$UpdateThread.run(FairScheduler.java:286)</p>
    Reason: Bugfix
    Author: Todd Lipcon
    Ref: UNKNOWN

commit 34ca2a5547398f9435a5d3d22603d0f7da420226
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:17:48 2010 -0800

    MAPREDUCE-551. Add preemption to the fair scheduler
    
    Description: Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (<a href="http://issues.apache.org/jira/browse/HADOOP-3746" title="A fair sharing job scheduler"><del>HADOOP-3746</del></a>) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:<br/>
    1) A job isn't getting its <em>guaranteed</em> share of the cluster for at least T1 seconds.<br/>
    2) A job is getting significantly less than its <em>fair</em> share for T2 seconds (e.g. less than half its share).
    
    <p>T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.</p>
    
    <p>When deciding which tasks to kill to make room for the job, we will use the following heuristics:</p>
    <ul class="alternate" type="square">
    	<li>Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).</li>
    	<li>For maps: kill tasks that have run for the least amount of time (limiting wasted time).</li>
    	<li>For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).</li>
    </ul>
    
    This fixes an error in the previous backport where the
    EagerTaskInitializationListener wasn't properly passed the
    TaskTrackerManager before starting.
    
    Reason: New feature
    Author: Matei Zaharia
    Ref: UNKNOWN

commit a3e29eff0b9337a1007ec1b90ccb832dca5c1d20
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:17:33 2010 -0800

    CLOUDERA-BUILD. Fix hadoop wrapper to properly pass through multiword quoted arguments
    
    Author: Todd Lipcon

commit 975647b6c3a6644cabbd48bf14e074a0efda2cb9
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:17:15 2010 -0800

    CLOUDERA-BUILD. Sqoop documentation is now part of the generated tarball. Updated the install script to reflect that change.
    
    Author: Matt Massie

commit 19c038a6af07e3999e83a2178d2328535e00dedb
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:16:55 2010 -0800

    CLOUDERA-BUILD. Generate the sqoop documentation and ensure that it's in the release tarball
    
    Author: Matt Massie

commit 6957626991875302f33bb73630f4f376412f9711
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:16:43 2010 -0800

    CLOUDERA-BUILD. More changes to get debs building correctly
    
    Author: Chad Metcalf

commit 67d1c732cea0eebf59de512301ae8f2a1cb2f349
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:16:30 2010 -0800

    CLOUDERA-BUILD. Reformatted Sqoop manpage asciidoc for CDH build process
    
    Author: Aaron Kimball

commit af158d6aa7ffe72d931bc4763ace7d4a299d077b
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:16:14 2010 -0800

    CLOUDERA-BUILD. Only rerun libtoolize if version 2.2 is installed
    
    Author: Todd Lipcon

commit 586992381042e1b4ec8c9ece069561ad2e4dfcc0
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:15:42 2010 -0800

    HADOOP-6279. Add JVM memory usage to JvmMetrics
    
    Description: The JvmMetrics currently publish memory usage from the MemoryMXBean. This is useful, but doesn't include the total heap size (eg as displayed in the JT Web UI).
    
    <p>It would be nice to expose Runtime.getRuntime().maxMemory() as part of JvmMetrics.</p>
    
    <p>It seems that Runtime.getRuntime().totalMemory() (used by the JT for "memory used") is the same as the 'memHeapCommittedM' which already exists.</p>
    Reason: Metrics improvement
    Author: Todd Lipcon
    Ref: UNKNOWN

commit 7c168a8a2613d93e19508a91e7c4db3b3cfb503b
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:15:26 2010 -0800

    HADOOP-6269. Missing synchronization for defaultResources in Configuration.addResource
    
    Description: Configuration.defaultResources is a simple ArrayList. In two places in Configuration it is accessed without appropriate synchronization, which we've seen to occasionally result in ConcurrentModificationExceptions.
    Reason: bugfix (race condition)
    Author: Sreekanth Ramakrishnan
    Ref: UNKNOWN

commit 8bf845170decdcb12254bc1dc98ccbf0fda7d233
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:15:01 2010 -0800

    CLOUDERA-BUILD. Recreate c++ configure files during build if we have the right build dependencies
    
    Author: Todd Lipcon

commit e7e9812fa7a6a256652f2f6bbb269334f883c53b
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:14:43 2010 -0800

    CLOUDERA-BUILD. Package sqoop docs w/o requiring asciidoc
    
    Author: Chad Metcalf
    Ref: UNKNOWN

commit 7171eabfad501d635b1da9e0287f50e025b4a83f
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:13:39 2010 -0800

    CLOUDERA-BUILD. Revert "Package sqoop docs."
    
    Description: This reverts packaging of sqoop documentation in preparation
    for including MAPREDUCE-906 properly after it has been committed
    to Apache.
    Author: Chad Metcalf
    Ref: UNKNOWN

commit 4bd437c9d70f2c0d68047e0376a7af21cc4a70e0
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:13:17 2010 -0800

    HADOOP-5891. If dfs.http.address is default, SecondaryNameNode can't find NameNode
    
    Description: As detailed in this blog post:<br/>
    <span class="nobr"><a href="http://www.cloudera.com/blog/2009/02/10/multi-host-secondarynamenode-configuration/">http://www.cloudera.com/blog/2009/02/10/multi-host-secondarynamenode-configuration/<sup><img class="rendericon" src="https://issues.apache.org/jira/images/icons/linkext7.gif" height="7" width="7" align="absmiddle" alt="" border="0"/></sup></a></span><br/>
    if dfs.http.address is not configured, and the 2NN is a different machine from the NN, the 2NN fails to connect.
    
    <p>In SecondaryNameNode.getInfoServer, the 2NN should notice a "0.0.0.0" dfs.http.address and, in that case, pull the hostname out of fs.default.name. This would fix the default configuration to work properly for most users.</p>
    Reason: Configuration improvement
    Author: Todd Lipcon
    Ref: UNKNOWN

commit 74e10e4a137b2aa60ab39186115350b5e82464fc
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:11:50 2010 -0800

    HDFS-127. DFSClient block read failures cause open DFSInputStream to become unusable
    
    Description: We are using some Lucene indexes directly from HDFS and for quite long time we were using Hadoop version 0.15.3.
    
    <p>When tried to upgrade to Hadoop 0.19 - index searches started to fail with exceptions like:<br/>
    2008-11-13 16:50:20,314 WARN <span class="error">&#91;Listener-4&#93;</span> [] DFSClient : DFS Read: java.io.IOException: Could not obtain block: blk_5604690829708125511_15489 file=/usr/collarity/data/urls-new/part-00000/20081110-163426/_0.tis<br/>
    at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1708)<br/>
    at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1536)<br/>
    at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1663)<br/>
    at java.io.DataInputStream.read(DataInputStream.java:132)<br/>
    at org.apache.nutch.indexer.FsDirectory$DfsIndexInput.readInternal(FsDirectory.java:174)<br/>
    at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:152)<br/>
    at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)<br/>
    at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76)<br/>
    at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:63)<br/>
    at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:131)<br/>
    at org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:162)<br/>
    at org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:223)<br/>
    at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:217)<br/>
    at org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:54) <br/>
    ...</p>
    
    <p>The investigation showed that the root of this issue is that we exceeded # of xcievers in the data nodes and that was fixed by changing configuration settings to 2k.<br/>
    However - one thing that bothered me was that even after datanodes recovered from overload and most of client servers had been shut down - we still observed errors in the logs of running servers.<br/>
    Further investigation showed that fix for <a href="http://issues.apache.org/jira/browse/HADOOP-1911" title="infinite loop in dfs -cat command."><del>HADOOP-1911</del></a> introduced another problem - the DFSInputStream instance might become unusable once number of failures over lifetime of this instance exceeds configured threshold.</p>
    
    <p>The fix for this specific issue seems to be trivial - just reset failure counter before reading next block (patch will be attached shortly).</p>
    
    <p>This seems to be also related to HADOOP-3185, but I'm not sure I really understand necessity of keeping track of failed block accesses in the DFS client.</p>
    
        HADOOP-4681: Also referenced
    
        This as-yet-uncommitted patch is recommended by HBase people.
        Applied patch "4681.patch" attached to the JIRA on 2008-11-18.
    
    Reason: Bugfix
    Author: Igor Bolotin
    Ref: UNKNOWN

commit ca547d89042fff3a38c0c93b6e0ece78e74ae064
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:11:10 2010 -0800

    HADOOP-4655. FileSystem.CACHE should be ref-counted
    
    Description: FileSystem.CACHE is not ref-counted, and could lead to resource leakage.
    Adds new method FileSystem.newInstance() that always returns a newly allocated
    FileSystem object.
    Reason: Bugfix
    Author: dhruba borthakur
    Ref: UNKNOWN

commit 15660507606b32c3c6c2878f8ed69fe106119bc9
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:10:51 2010 -0800

    MAPREDUCE-967. TaskTracker does not need to fully unjar job jars
    
    Description: In practice we have seen some users submitting job jars that consist of 10,000+ classes. Unpacking these jars into mapred.local.dir and then cleaning up after them has a significant cost (both in wall clock and in unnecessary heavy disk utilization). This cost can be easily avoided
    Reason: Performance improvement
    Author: Todd Lipcon
    Ref: UNKNOWN

commit 648e30e074a16de837fb4c604a198bc780c2e6c5
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:10:34 2010 -0800

    MAPREDUCE-968. NPE in distcp encountered when placing _logs directory on S3FileSystem
    
    Description: If distcp is pointed to an empty S3 bucket as the destination for an s3:// filesystem transfer, it will fail with the following exception
    
    <p>Copy failed: java.lang.NullPointerException<br/>
    at org.apache.hadoop.fs.s3.S3FileSystem.makeAbsolute(S3FileSystem.java:121)<br/>
    at org.apache.hadoop.fs.s3.S3FileSystem.getFileStatus(S3FileSystem.java:332)<br/>
    at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:633)<br/>
    at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1005)<br/>
    at org.apache.hadoop.tools.DistCp.copy(DistCp.java:650)<br/>
    at org.apache.hadoop.tools.DistCp.run(DistCp.java:857)<br/>
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)<br/>
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)<br/>
    at org.apache.hadoop.tools.DistCp.main(DistCp.java:884) </p>
    Reason: Bugfix
    Author: Aaron Kimball
    Ref: UNKNOWN

commit a61718b87c36dbeddcc6f9917438f81ebdda0214
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:10:22 2010 -0800

    HADOOP-6133. ReflectionUtils performance regression
    
    Description: <a href="http://issues.apache.org/jira/browse/HADOOP-4187" title="Create a MapReduce-specific ReflectionUtils that handles JobConf and JobConfigurable"><del>HADOOP-4187</del></a> introduced extra calls to Class.forName in ReflectionUtils.setConf. This caused a fairly large performance regression. Attached is a microbenchmark that shows the following timings (ms) for 100M constructions of new instances:
    
    <p>Explicit construction (new Test): around ~1.6sec<br/>
    Using Test.class.newInstance: around ~2.6sec<br/>
    ReflectionUtils on 0.18.3: ~8.0sec<br/>
    ReflectionUtils on 0.20.0: ~200sec</p>
    
    <p>This illustrates the ~80x slowdown caused by <a href="http://issues.apache.org/jira/browse/HADOOP-4187" title="Create a MapReduce-specific ReflectionUtils that handles JobConf and JobConfigurable"><del>HADOOP-4187</del></a>.</p>
    Reason: Performance improvement
    Author: Todd Lipcon
    Ref: UNKNOWN
    
    commit 5e299f831420ed52569eefc5ba815359a0ebc64e
    Author: Chad Metcalf <chad@cloudera.com>
    Date:   Tue Sep 15 22:21:42 2009 -0700
    
        HADOOP-6133: ReflectionUtils performance regression

commit b6f790774d34ed34bb7c649142dc770c25121ac3
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:10:13 2010 -0800

    HADOOP-5981. HADOOP-2838 doesnt work as expected
    
    Description: The substitution feature i.e X=$X:/tmp doesnt work as expected.
    
    <p>This issue completes the feature mentioned in <a href="http://issues.apache.org/jira/browse/HADOOP-2838" title="Add HADOOP_LIBRARY_PATH config setting so Hadoop will include external directories for jni"><del>HADOOP-2838</del></a>. <a href="http://issues.apache.org/jira/browse/HADOOP-2838" title="Add HADOOP_LIBRARY_PATH config setting so Hadoop will include external directories for jni"><del>HADOOP-2838</del></a> provided a way to set env variables in child process. This issue provides a way to inherit tt's env variables and append or reset it. So now <br/>
    X=$X:y will inherit X (if  there) and append y to it. </p>
    Reason: Bugfix
    Author: Amar Kamat
    Ref: UNKNOWN
    
    commit eb635e4de3a8b2b5bd9f34225770f24be42dcd83
    Author: Chad Metcalf <chad@cloudera.com>
    Date:   Tue Sep 15 22:29:50 2009 -0700
    
        HADOOP-5981: HADOOP-2838 doesnt work as expected

commit 5d4e93d8e0df3c445f56c5eb51965eef92bebd78
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:09:46 2010 -0800

    HADOOP-2838. Add HADOOP_LIBRARY_PATH config setting so Hadoop will include external directories for jni
    
    Description: Currently there is no way to configure Hadoop to use external JNI directories. I propose we add a new variable like HADOOP_CLASS_PATH that is added to the JAVA_LIBRARY_PATH before the process is run.
    
    <p>Now the users can set environment variables using mapred.child.env. They can do the following <br/>
    X=Y : set X to Y<br/>
    X=$X:Y : Append Y to X (which should be taken from the tasktracker)</p>
    Reason: Improves job launch flexibility
    Author: Amar Kamat
    Ref: UNKNOWN
    
    commit 9b3fc32fa793b338dc700a7f6c437402f80d6b7f
    Author: Chad Metcalf <chad@cloudera.com>
    Date:   Tue Sep 15 22:09:57 2009 -0700
    
        HADOOP-2838: Add HADOOP_LIBRARY_PATH config setting so Hadoop will include external directories for jni

commit 877429c3f94a1e937fbe29b4cbe8da573831d802
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:09:31 2010 -0800

    MAPREDUCE-814. Move completed Job history files to HDFS
    
    Description: Currently completed job history files remain on the jobtracker node. Having the files available on HDFS will enable clients to access these files more easily.
    Reason: New feature
    Author: Sharad Agarwal
    Ref: UNKNOWN
    
    commit c0575c0908fee4ec01f5bc0abbd7f4b2254dd38e
    Author: Chad Metcalf <chad@cloudera.com>
    Date:   Tue Sep 15 18:15:17 2009 -0700
    
        MAPREDUCE-814: Move completed Job history files to HDFS

commit a8bf06eac5312ede0982118801e4495285a442fe
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:08:12 2010 -0800

    MAPREDUCE-693. Conf files not moved to "done" subdirectory after JT restart
    
    Description: After <a href="http://issues.apache.org/jira/browse/MAPREDUCE-516" title="Fix the 'cluster drain' problem in the Capacity Scheduler wrt High RAM Jobs"><del>MAPREDUCE-516</del></a>, when a job is submitted and the JT is restarted (before job files have been written) and the job is killed after recovery, the conf files fail to be moved to the "done" subdirectory.<br/>
    The exact scenario to reproduce this issue is:
    <ul>
    	<li>Submit a job</li>
    	<li>Restart JT before anything is written to the job files</li>
    	<li>Kill the job</li>
    	<li>The old conf files remain in the history folder and fail to be moved to "done" subdirectory</li>
    </ul>
    
    Reason: bugfix
    Author: Amar Kamat
    Ref: UNKNOWN

commit cc22e9f92db6470d244fb17f57601b93bab6db80
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:07:55 2010 -0800

    MAPREDUCE-683. TestJobTrackerRestart fails with Map task completion events ordering mismatch
    
    Description: <tt>TestJobTrackerRestart</tt> fails consistently with Map task completion events ordering mismatch error.
    Reason: bugfix
    Author: Amar Kamat
    Ref: UNKNOWN

commit 57a67dff5d15e3833c7968254df076e440de2765
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:07:39 2010 -0800

    MAPREDUCE-416. Move the completed jobs' history files to a DONE subdirectory inside the configured history directory
    
    Description: Whenever a job completes, the history file can be moved to a directory called DONE. That would make the management of job history files easier (for example, administrators can move the history files from that directory to some other place, delete them, archive them, etc.).
    Reason: System management improvement
    Author: Amar Kamat
    Ref: UNKNOWN

commit 99dfdb9a98e1ebd643f47877be3541962c32dcd0
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:07:18 2010 -0800

    HADOOP-5733. Add map/reduce slot capacity and lost map/reduce slot capacity to JobTracker metrics
    
    Description: It would be nice to have the actual map/reduce slot capacity and the lost map/reduce slot capacity (# of blacklisted nodes * map-slot-per-node or reduce-slot-per-node). This information can be used to calculate a JT view of slot utilization.
    Reason: Metrics improvement
    Author: Sreekanth Ramakrishnan
    Ref: UNKNOWN

commit 955fe9433b13f21079f92e4035393b683486ad07
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:05:59 2010 -0800

    HADOOP-5738. Split waiting tasks field in JobTracker metrics to individual tasks
    
    Description: Currently, job tracker metrics reports waiting tasks as a single field in metrics. It would be better if we can split waiting tasks into maps and reduces.
    Reason: User experience improvement
    Author: Sreekanth Ramakrishnan
    Ref: UNKNOWN

commit 3b8f77cd452c1098c6af5907b787bf9167df806b
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:05:48 2010 -0800

    HADOOP-5442. The job history display needs to be paged
    
    Description: Currently the list of job history will try to render the entire list of jobs that have run. That doesn't scale up as more and more jobs run on a job tracker.
    Reason: Scalability improvement
    Author: Amar Kamat
    Ref: UNKNOWN

commit dfac0482267aaf0fabac97c163e0015306ec5b16
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:05:16 2010 -0800

    HADOOP-4842. Streaming combiner should allow command, not just JavaClass
    
    Description: Streaming jobs are way slower than Java jobs for many reasons, but certainly stopping the shell-only programmer from using the combiner feature won't help. Right now, the streaming usage says:
    
    <blockquote>
    <p>  -mapper   &lt;cmd|JavaClassName&gt;      The streaming command to run<br/>
      -combiner &lt;JavaClassName&gt; Combiner has to be a Java class<br/>
      -reducer  &lt;cmd|JavaClassName&gt;      The streaming command to run</p></blockquote>
    Reason: Usability improvement
    Author: Amareshwari Sriramadasu
    Ref: UNKNOWN

commit 33e4f0a87effa466914e292488c47977245edc96
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:04:06 2010 -0800

    MAPREDUCE-987. Exposing MiniDFS and MiniMR clusters as a single process command-line
    
    Description: It's hard to test non-Java programs that rely on significant mapreduce functionality.  The patch I'm proposing shortly will let you just type "bin/hadoop jar hadoop-hdfs-hdfswithmr-test.jar minicluster" to start a cluster (internally, it's using Mini{MR,HDFS}Cluster) with a specified number of daemons, etc.  A test that checks how some external process interacts with Hadoop might start minicluster as a subprocess, run through its thing, and then simply kill the java subprocess.
    
    <p>I've been using just such a system for a couple of weeks, and I like it.  It's significantly easier than developing a lot of scripts to start a pseudo-distributed cluster, and then clean up after it.  I figure others might find it useful as well.</p>
    
    <p>I'm at a bit of a loss as to where to put it in 0.21.  hdfs-with-mr tests have all the required libraries, so I've put it there.  I could conceivably split this into "minimr" and "minihdfs", but it's specifically the fact that they're configured to talk to each other that I like about having them together.  And one JVM is better than two for my test programs.</p>
    Reason: Testing feature
    Author: Philip Zeyliger
    Ref: UNKNOWN

commit 39ff7e5ee285df97c765a73271066df718be0e30
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 17:03:23 2010 -0800

    HADOOP-6267. build-contrib.xml unnecessarily enforces that contrib projects be located in contrib/ dir
    
    Description: build-contrib.xml currently sets hadoop.root to ${basedir}/../../../. This path is relative to the contrib project which is assumed to be inside src/contrib/. We occasionally work on contrib projects in other repositories until they're ready to contribute. We can use the &lt;dirname&gt; ant task to do this more correctly.
    Reason: Build system improvement
    Author: Todd Lipcon
    Ref: UNKNOWN

commit 139bea6660193cc73852832e03fe570437343e96
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 15:02:55 2010 -0800

    HDFS-528. Add ability for safemode to wait for a minimum number of live datanodes
    
    Description: When starting up a fresh cluster programatically, users often want to wait until DFS is "writable" before continuing in a script. "dfsadmin -safemode wait" doesn't quite work for this on a completely fresh cluster, since when there are 0 blocks on the system, 100% of them are accounted for before any DNs have reported.
    
    <p>This JIRA is to add a command which waits until a certain number of DNs have reported as alive to the NN.</p>
    Reason: New feature
    Author: Todd Lipcon
    Ref: UNKNOWN

commit b301746d45bde2759535549f87c6485f4ee577b2
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 15:02:38 2010 -0800

    HADOOP-4936. Improvements to TestSafeMode
    
    Description: TestSafeMode
    <ul class="alternate" type="square">
    	<li>needs a detailed description of the test case</li>
    	<li>should not use direct calls to the name-node rather call <tt>DistributedFileSystem</tt> methods.</li>
    </ul>
    
    Reason: Test coverage improvement
    Author: Konstantin Shvachko
    Ref: UNKNOWN

commit f04a321596a513e71354f2a6829b44e474077507
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 15:02:22 2010 -0800

    HADOOP-5650. Namenode log that indicates why it is not leaving safemode may be confusing
    
    Description: A namenode with a large number of datablocks is setup with dfs.safemode.threshold.pct set to 1.0. With a small number of unreported blocks, namenode prints the following as the reason for not leaving safe mode:<br/>
    <tt>The ratio of reported blocks 1.0000 has not reached the threshold 1.0000</tt>
    
    <p>With a large number of blocks, precision used for printing the log may not indicate the difference between the actual ratio of safe blocks to total blocks and the configured threshold. Printing number of blocks instead of ratio will improve the clarity.</p>
    Reason: User experience improvement
    Author: Suresh Srinivas
    Ref: UNKNOWN

commit 13e35e654c51a5b1cfe809ef1e2c4d2ca46ed612
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 15:01:52 2010 -0800

    HADOOP-4675. Current Ganglia metrics implementation is incompatible with Ganglia 3.1
    
    Description: Ganglia changed its wire protocol in the 3.1.x series; the current implementation only works for 3.0.x.
    
    Patched using
    https://issues.apache.org/jira/secure/attachment/12407207/HADOOP-4675-v7.patch
    
    Reason: Compatibility improvement
    Author: Brian Bockelman
    Ref: UNKNOWN
    
    commit dcf76896b1c8a7b891995b1546eef6ea3018e7ca
    Author: Philip Zeyliger <philip@cloudera.com>
    Date:   Tue Jul 28 15:28:18 2009 -0700
    
        HADOOP-4675. Current Ganglia metrics implementation is incompatible with Ganglia 3.1
    
        Patched using
        https://issues.apache.org/jira/secure/attachment/12407207/HADOOP-4675-v7.patch

commit 4305750d026b895b3afbd0d4a4ee4b3b42596016
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 15:01:29 2010 -0800

    HADOOP-6269. Missing synchronization for defaultResources in Configuration.addResource
    
    Description: Configuration.defaultResources is a simple ArrayList. In two places in Configuration it is accessed without appropriate synchronization, which we've seen to occasionally result in ConcurrentModificationExceptions.
    Reason: Bugfix (race condition)
    Author: Sreekanth Ramakrishnan
    Ref: UNKNOWN

commit 90f9c40df18fe464383de52e3d3952638a393e34
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 15:01:08 2010 -0800

    CLOUDERA-BUILD. Make some JT methods and classes public for use from within contrib plugins
    
    Author: Henry Robinson

commit f8e0599a434e1ce94158384f575e912e9f988229
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:59:40 2010 -0800

    MAPREDUCE-461. Enable ServicePlugins for the JobTracker
    
    Description: Allow ServicePlugins (see <a href="http://issues.apache.org/jira/browse/HADOOP-5257" title="Export namenode/datanode functionality through a pluggable RPC layer"><del>HADOOP-5257</del></a>) for the JobTracker.
    (Relies on HADOOP-5640)
    Reason: API Improvement
    Author: Todd Lipcon
    Ref: UNKNOWN

commit c58318cfa6e26b7dbacd4093d646fc8b66f9eda6
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:58:23 2010 -0800

    HADOOP-5640. Allow ServicePlugins to hook callbacks into key service events
    
    Description: <a href="http://issues.apache.org/jira/browse/HADOOP-5257" title="Export namenode/datanode functionality through a pluggable RPC layer"><del>HADOOP-5257</del></a> added the ability for NameNode and DataNode to start and stop ServicePlugin implementations at NN/DN start/stop. However, this is insufficient integration for some common use cases.
    
    <p>We should add some functionality for Plugins to subscribe to events generated by the service they're plugging into. Some potential hook points are:</p>
    
    <p>NameNode:</p>
    <ul class="alternate" type="square">
    	<li>new datanode registered</li>
    	<li>datanode has died</li>
    	<li>exception caught</li>
    	<li>etc?</li>
    </ul>
    
    <p>DataNode:</p>
    <ul class="alternate" type="square">
    	<li>startup</li>
    	<li>initial registration with NN complete (this is important for HADOOP-4707 to sync up datanode.dnRegistration.name with the NN-side registration)</li>
    	<li>namenode reconnect</li>
    	<li>some block transfer hooks?</li>
    	<li>exception caught</li>
    </ul>
    
    <p>I see two potential routes for implementation:</p>
    
    <p>1) We make an enum for the types of hookpoints and have a general function in the ServicePlugin interface. Something like:</p>
    
    <div class="code panel" style="border-width: 1px;"><div class="codeContent panelContent">
    <pre class="code-java"><span class="code-keyword">enum</span> HookPoint {
      DN_STARTUP,
      DN_RECEIVED_NEW_BLOCK,
      DN_CAUGHT_EXCEPTION,
     ...
    }
    
    void runHook(HookPoint hp, <span class="code-object">Object</span> value);</pre>
    </div></div>
    
    <p>2) We make classes specific to each "pluggable" as was originally suggested in HADDOP-5257. Something like:</p>
    
    <div class="code panel" style="border-width: 1px;"><div class="codeContent panelContent">
    <pre class="code-java">class DataNodePlugin {
      void datanodeStarted() {}
      void receivedNewBlock(block info, etc) {}
      void caughtException(Exception e) {}
      ...
    }</pre>
    </div></div>
    
    <p>I personally prefer option (2) since we can ensure plugin API compatibility at compile-time, and we avoid an ugly switch statement in a runHook() function.</p>
    
    <p>Interested to hear what people's thoughts are here.</p>
    Reason: API Improvement
    Author: Todd Lipcon
    Ref: UNKNOWN

commit 137999a0b48a81bed10a5f30868dbfe6d176956b
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:58:09 2010 -0800

    HADOOP-5257. Export namenode/datanode functionality through a pluggable RPC layer
    
    Description: Adding support for pluggable components would allow exporting DFS functionallity using arbitrary protocols, like Thirft or Protocol Buffers. I'm opening this issue on Dhruba's suggestion in HADOOP-4707.
    
    <p>Plug-in implementations would extend this base class:</p>
    
    <div class="code panel" style="border-width: 1px;"><div class="codeContent panelContent">
    <pre class="code-java"><span class="code-keyword">abstract</span> class Plugin {
    
        <span class="code-keyword">public</span> <span class="code-keyword">abstract</span> datanodeStarted(DataNode datanode);
    
        <span class="code-keyword">public</span> <span class="code-keyword">abstract</span> datanodeStopping();
    
        <span class="code-keyword">public</span> <span class="code-keyword">abstract</span> namenodeStarted(NameNode namenode);
    
        <span class="code-keyword">public</span> <span class="code-keyword">abstract</span> namenodeStopping();
    }</pre>
    </div></div>
    
    <p>Name node instances would then start the plug-ins according to a configuration object, and would also shut them down when the node goes down:</p>
    
    <div class="code panel" style="border-width: 1px;"><div class="codeContent panelContent">
    <pre class="code-java"><span class="code-keyword">public</span> class NameNode {
    
        <span class="code-comment">// [..]
    </span>
        <span class="code-keyword">private</span> void initialize(Configuration conf)
            <span class="code-comment">// [...]
    </span>        <span class="code-keyword">for</span> (Plugin p: PluginManager.loadPlugins(conf))
              p.namenodeStarted(<span class="code-keyword">this</span>);
        }
    
        <span class="code-comment">// [..]
    </span>
        <span class="code-keyword">public</span> void stop() {
            <span class="code-keyword">if</span> (stopRequested)
                <span class="code-keyword">return</span>;
            stopRequested = <span class="code-keyword">true</span>;
            <span class="code-keyword">for</span> (Plugin p: plugins)
                p.namenodeStopping();
            <span class="code-comment">// [..]
    </span>    }
    
        <span class="code-comment">// [..]
    </span>}</pre>
    </div></div>
    
    <p>Data nodes would do a similar thing in <tt>DataNode.startDatanode()</tt> and <tt>DataNode.shutdown</tt></p>
    Reason: MISSING: Reason for inclusion
    Author: Carlos Valiente
    Ref: UNKNOWN

commit 155394ca5eed2e2a6151a5c9d9452e9cfbb30a11
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:57:58 2010 -0800

    MAPREDUCE-971. distcp does not always remove distcp.tmp.dir
    
    Description: Sometimes distcp leaves behind its tmpdir when the target filesystem is s3n.
    Reason: Bugfix
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 7575b83ba0cab30394bad0943ff906ab0609dc40
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:57:49 2010 -0800

    CLOUDERA-BUILD. Package sqoop docs.

commit 9321b18352e55d4d37c25335b578151b18f938f2
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:57:32 2010 -0800

    MAPREDUCE-923. Sqoop's ORM uses URLDecoder on a file, which replaces plus signs in a jar file name with spaces
    
    Description: In findThisJar, sqoop runs URLDecoder.decode on the resulting jar, which has the effect of replacing any + signs in the path with a space.  This obviously breaks the classpath variable that it's trying to set, and the sqoop-generated code fails to compile.  Ironically, Cloudera's hadoop distro is the one that puts + characters in jar files, and so exhibits the bug.  Here is an example from running sqoop with log4j at debug level.  Note the space in the very last term, which should read hadoop-0.20.0+61-sqoop.jar rather than hadoop-0.20.0 61-sqoop.jar.
    
    <p>09/08/27 18:00:07 DEBUG orm.CompilationManager: Invoking javac with args: -sourcepath ./ -d /tmp/sqoop/compile/ -classpath /usr/lib/hadoop-0.20/conf:/usr/java/jdk1.6.0_06/lib/tools.jar:/usr/lib/hadoop-0.20:/usr/lib/hadoop-0.20/hadoop-0.20.0+61-core.jar:/usr/lib/hadoop-0.20/lib/commons-cli-2.0-SNAPSHOT.jar:/usr/lib/hadoop-0.20/lib/commons-codec-1.3.jar:/usr/lib/hadoop-0.20/lib/commons-el-1.0.jar:/usr/lib/hadoop-0.20/lib/commons-httpclient-3.0.1.jar:/usr/lib/hadoop-0.20/lib/commons-logging-1.0.4.jar:/usr/lib/hadoop-0.20/lib/commons-logging-api-1.0.4.jar:/usr/lib/hadoop-0.20/lib/commons-net-1.4.1.jar:/usr/lib/hadoop-0.20/lib/core-3.1.1.jar:/usr/lib/hadoop-0.20/lib/hadoop-0.20.0+61-fairscheduler.jar:/usr/lib/hadoop-0.20/lib/hadoop-0.20.0+61-scribe-log4j.jar:/usr/lib/hadoop-0.20/lib/hsqldb-1.8.0.10.jar:/usr/lib/hadoop-0.20/lib/hsqldb.jar:/usr/lib/hadoop-0.20/lib/jasper-compiler-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jasper-runtime-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jets3t-0.6.1.jar:/usr/lib/hadoop-0.20/lib/jetty-6.1.14.jar:/usr/lib/hadoop-0.20/lib/jetty-util-6.1.14.jar:/usr/lib/hadoop-0.20/lib/junit-3.8.1.jar:/usr/lib/hadoop-0.20/lib/junit-4.5.jar:/usr/lib/hadoop-0.20/lib/kfs-0.2.2.jar:/usr/lib/hadoop-0.20/lib/libfb303.jar:/usr/lib/hadoop-0.20/lib/libthrift.jar:/usr/lib/hadoop-0.20/lib/log4j-1.2.15.jar:/usr/lib/hadoop-0.20/lib/mysql-connector-java-5.0.8-bin.jar:/usr/lib/hadoop-0.20/lib/oro-2.0.8.jar:/usr/lib/hadoop-0.20/lib/servlet-api-2.5-6.1.14.jar:/usr/lib/hadoop-0.20/lib/slf4j-api-1.4.3.jar:/usr/lib/hadoop-0.20/lib/slf4j-log4j12-1.4.3.jar:/usr/lib/hadoop-0.20/lib/xmlenc-0.52.jar:/usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-2.1.jar:/usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-api-2.1.jar:/usr/local/hadoop/lib/hadoop-gpl-compression.jar:/usr/lib/hadoop-0.20/hadoop-0.20.0+61-core.jar:/usr/lib/hadoop-0.20/contrib/sqoop/hadoop-0.20.0 61-sqoop.jar</p>
    Reason: Bugfix
    Author: Aaron Kimball
    Ref: UNKNOWN

commit e97883c5b9c389f82a6447e4cb1678c0a0ed83ba
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:57:19 2010 -0800

    CLOUDERA-BUILD. Sqoop asciidoc syntax error
    
    Author: Aaron Kimball

commit 520bda2edcb90dfe9461e16b96aa4a048d33ed7b
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:57:11 2010 -0800

    HADOOP-5450. Add support for application-specific typecodes to typed bytes
    
    Description: For serializing objects of types that are not supported by typed bytes serialization, applications might want to use a custom serialization format. Right now, typecode 0 has to be used for the bytes resulting from this custom serialization, which could lead to problems when deserializing the objects because the application cannot know if a byte sequence following typecode 0 is a customly serialized object or just a raw sequence of bytes. Therefore, a range of typecodes that are treated as aliases for 0 should be added, such that different typecodes can be used for application-specific purposes.
    Reason: New feature
    Author: Klaas Bosteels
    Ref: UNKNOWN

commit b30fc99332c4a444d275731dac4b4245115d65b2
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:56:59 2010 -0800

    HADOOP-1722. Make streaming to handle non-utf8 byte array
    
    Description: Right now, the streaming framework expects the output sof the steam process (mapper or reducer) are line <br/>
    oriented UTF-8 text. This limit makes it impossible to use those programs whose outputs may be non-UTF-8<br/>
     (international encoding, or maybe even binary data). Streaming can overcome this limit by introducing a simple<br/>
    encoding protocol. For example, it can allow the mapper/reducer to hexencode its keys/values, <br/>
    the framework decodes them in the Java side.<br/>
    This way, as long as the mapper/reducer executables follow this encoding protocol, <br/>
    they can output arabitary bytearray and the streaming framework can handle them.
    Reason: New feature
    Author: Klaas Bosteels
    Ref: UNKNOWN

commit 921c135653736bcc279700435358058762bc8f78
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:56:43 2010 -0800

    CLOUDERA-BUILD. More Sqoop documentation updates
    
    Author: Aaron Kimball

commit be7f1dc031e17dc4f53ebe76d27c1b9242105785
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:56:26 2010 -0800

    MAPREDUCE-840. DBInputFormat leaves open transaction
    
    Description: (Reapplied after HADOOP-4687)
    Reason: MISSING: Reason for inclusion
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 89a96d8fff80ac809dbda9582044a7c6b3986d16
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:56:07 2010 -0800

    MAPREDUCE-906. Updated Sqoop documentation
    
    Description: Provides the latest documentation for Sqoop, in both user-guide and manpage form. Built with asciidoc.
    Reason: Documentation
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 51f867aea0667d0191b730ea3abf114e75cafa4b
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:55:54 2010 -0800

    MAPREDUCE-907. Sqoop should use more intelligent splits
    
    Description: Sqoop should use the new split generation / InputFormat in <a href="http://issues.apache.org/jira/browse/MAPREDUCE-885" title="More efficient SQL queries for DBInputFormat"><del>MAPREDUCE-885</del></a>
    Reason: Performance / scalability improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 239df04415dba8d12c7d3fbf33c580d473202e94
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:55:28 2010 -0800

    MAPREDUCE-885. More efficient SQL queries for DBInputFormat
    
    Description: DBInputFormat generates InputSplits by counting the available rows in a table, and selecting subsections of the table via the "LIMIT" and "OFFSET" SQL keywords. These are only meaningful in an ordered context, so the query also includes an "ORDER BY" clause on an index column. The resulting queries are often inefficient and require full table scans. Actually using multiple mappers with these queries can lead to O(n^2) behavior in the database, where n is the number of splits. Attempting to use parallelism with these queries is counter-productive.
    
    <p>A better mechanism is to organize splits based on data values themselves, which can be performed in the WHERE clause, allowing for index range scans of tables, and can better exploit parallelism in the database.</p>
    Reason: Performance and scalability improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 23a0d1882c797160cc7b6fae99fc5e686aa30191
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:55:16 2010 -0800

    MAPREDUCE-938. Postgresql support for Sqoop
    
    Description: Sqoop should be able to import from postgresql databases.
    Reason: Compatability improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 7b89feb34fafd2365f75ab744db9cb07a5443046
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:55:05 2010 -0800

    MAPREDUCE-876. Sqoop import of large tables can time out
    
    Description: Related to <a href="http://issues.apache.org/jira/browse/MAPREDUCE-875" title="Make DBRecordReader execute queries lazily"><del>MAPREDUCE-875</del></a>, Sqoop should use a background thread to ensure that progress is being reported while a database does external work for the MapReduce task.
    Reason: Scalability improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 61d4ef5175dca1859a1320f9e7cad1caeab5d982
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:54:49 2010 -0800

    MAPREDUCE-918. Test hsqldb server should be memory-only.
    
    Description: Sqoop launches a standalone hsqldb server for unit tests, but it currently writes its database to disk and uses a connect string of <tt>//localhost</tt>. If multiple test instances are running concurrently, one test server may serve to the other instance of the unit tests, causing race conditions.
    Reason: Bugfix in test harness
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 1fc17ad34e8288b54503eeb15f788eb4e6a070dc
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:54:37 2010 -0800

    MAPREDUCE-875. Make DBRecordReader execute queries lazily
    
    Description: DBInputFormat's DBRecordReader executes the user's SQL query in the constructor. If the query is long-running, this can cause task timeout. The user is unable to spawn a background thread (e.g., in a MapRunnable) to inform Hadoop of on-going progress.
    Reason: Scalability improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 21fdb7a7fd501fd63e1a540c2b55cf410d057301
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:54:27 2010 -0800

    MAPREDUCE-825. JobClient completion poll interval of 5s causes slow tests in local mode
    
    Description: The JobClient.NetworkedJob.waitForCompletion() method polls for job completion every 5 seconds. When running a set of short tests in pseudo-distributed mode, this is unnecessarily slow and causes lots of wasted time. When bandwidth is not scarce, setting the poll interval to 100 ms results in a 4x speedup in some tests.  This interval should be parametrized to allow users to control the interval for testing purposes.
    Reason: Test performance improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit f996b8a019bffefff183d7d688ccf95b8cb73de5
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:54:15 2010 -0800

    MAPREDUCE-750. Extensible ConnManager factory API
    
    Description: Sqoop uses the ConnFactory class to instantiate a ConnManager implementation based on the connect string and other arguments supplied by the user. This allows per-database logic to be encapsulated in different ConnManager instances, and dynamically chosen based on which database the user is actually importing from. But adding new ConnManager implementations requires modifying the source of a common ConnFactory class. An indirection layer should be used to delegate instantiation to a number of factory implementations which can be specified in the static configuration or at runtime.
    Reason: API flexibility improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 39bdff7bd3b83359884c90ae857d3f3144a94803
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:54:04 2010 -0800

    MAPREDUCE-749. Make Sqoop unit tests more Hudson-friendly
    
    Description: Hudson servers (other than Apache's) need to be able to run the sqoop unit tests which depend on thirdparty JDBC drivers / database implementations. The build.xml needs some refactoring to make this happen.
    Reason: Test coverage improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 0ca54f2722206685d9e36fcbb2656d0ac1957311
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:53:47 2010 -0800

    MAPREDUCE-792. javac warnings in DBInputFormat
    
    Description: <a href="http://issues.apache.org/jira/browse/MAPREDUCE-716" title="org.apache.hadoop.mapred.lib.db.DBInputformat not working with oracle"><del>MAPREDUCE-716</del></a> introduces javac warnings
    Reason: Technical debt
    Author: Aaron Kimball
    Ref: UNKNOWN

commit e39ae9d017e89e4df193b1f8075184320230499b
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:52:45 2010 -0800

    MAPREDUCE-716. org.apache.hadoop.mapred.lib.db.DBInputformat not working with oracle
    
    Description: Applied "trunk" version of the patch after incorporating
    HADOOP-4687's move of DBInputFormat-related files. (Prior patch was 0.20-branch
    specific)
    Reason: Branch compatibility improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 074e824f5d3d2f6ab862083e6eb4b0df8c881bfc
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:52:27 2010 -0800

    MAPREDUCE-910. MRUnit should support counters
    
    Description: incrCounter() is currently a dummy stub method in MRUnit that does nothing. Would be good for the mock reporter/context implementations to support counters.
    Reason: New feature
    Author: Aaron Kimball
    Ref: UNKNOWN

commit b4b7c5d9b4cba84bc47f4a48074fd295d060ab35
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:52:17 2010 -0800

    MAPREDUCE-798. MRUnit should be able to test a succession of MapReduce passes
    
    Description: MRUnit can currently test that the inputs to a given (mapper, reducer) "job" produce certain outputs at the end of the reducer. It would be good to support more end-to-end tests of a series of MapReduce jobs that form a longer pipeline surrounding some data.
    Reason: New Feature
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 59677d22261974560117fa82e74d9a7f80f804d5
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:52:06 2010 -0800

    MAPREDUCE-800. MRUnit should support the new API
    
    Description: MRUnit's TestDriver implementations use the old org.apache.hadoop.mapred-based classes. TestDrivers and associated mock object implementations are required for org.apache.hadoop.mapreduce-based code.
    Reason: New feature (API Compatibility)
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 7fda23b419b1c98e84eea43a0f35191d41032e18
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:51:53 2010 -0800

    MAPREDUCE-799. Some of MRUnit's self-tests were not being run
    
    Description: Due to method naming issues, some test cases were not being executed.
    Reason: Bugfix; test coverage
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 20d5bf205e9f2864f3da53d30408ba97763a46e9
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:51:40 2010 -0800

    MAPREDUCE-797. MRUnit MapReduceDriver should support combiners
    
    Description: The MapReduceDriver allows you to specify a mapper and a reducer class with a simple sort/"shuffle" between the passes. It would be nice to also support another Reducer implementation being used as a combiner in the middle.
    Reason: New feature
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 5c873336b3380e6c8f07ca28230ede9d41e4e840
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:50:05 2010 -0800

    Integrate with 0.21-branch versions of DBInputFormat
    
    Description: In 0.21 there is now a DBInputFormat in the mapred/lib/ package
    as well as mapreduce/lib/db. This patch backports the new API edition of
    DBInputFormat to CDH
    Reason: Cross-branch compatibility improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 51b650554e3bc8054e8ca966f5f552c522f7483d
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:49:52 2010 -0800

    HADOOP-5170. Set max map/reduce tasks on a per-job basis, either per-node or cluster-wide
    
    Description: There are a number of use cases for being able to do this.  The focus of this jira should be on finding what would be the simplest to implement that would satisfy the most use cases.
    
    <p>This could be implemented as either a per-node maximum or a cluster-wide maximum.  It seems that for most uses, the former is preferable however either would fulfill the requirements of this jira.</p>
    
    <p>Some of the reasons for allowing this feature (mine and from others on list):</p>
    <ul class="alternate" type="square">
    	<li>I have some very large CPU-bound jobs.  I am forced to keep the max map/node limit at 2 or 3 (on a 4 core node) so that I do not starve the Datanode and Regionserver.  I have other jobs that are network latency bound and would like to be able to run high numbers of them concurrently on each node.  Though I can thread some jobs, there are some use cases that are difficult to thread (scanning from hbase) and there's significant complexity added to the job rather than letting hadoop handle the concurrency.</li>
    	<li>Poor assignment of tasks to nodes creates some situations where you have multiple reducers on a single node but other nodes that received none.  A limit of 1 reducer per node for that job would prevent that from happening. (only works with per-node limit)</li>
    	<li>Poor mans MR job virtualization.  Since we can limit a jobs resources, this gives much more control in allocating and dividing up resources of a large cluster.  (makes most sense w/ cluster-wide limit)</li>
    </ul>
    
    Reason: Configuration improvement
    Author: Matei Zaharia
    Ref: UNKNOWN

commit 99e25a93542251debd248ed71cb380858ca8c9bd
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:49:40 2010 -0800

    HADOOP-6166. Improve PureJavaCrc32
    
    Description: Got some ideas to improve CRC32 calculation.
    Reason: Performance Improvement
    Author: Tsz Wo (Nicholas), SZE
    Ref: UNKNOWN

commit 2d0a97cefa559ab9059d976bda66f9dbcf051e79
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:49:28 2010 -0800

    MAPREDUCE-782. Use PureJavaCrc32 in mapreduce spills
    
    Description: <a href="http://issues.apache.org/jira/browse/HADOOP-6148" title="Implement a pure Java CRC32 calculator"><del>HADOOP-6148</del></a> implemented a Pure Java implementation of CRC32 which performs better than the built-in one. This issue is to make use of it in the mapred package
    Reason: Performance improvement
    Author: Todd Lipcon
    Ref: UNKNOWN

commit bb65cb649c2924b5a20f06deb9ecd66fc219eeeb
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:49:12 2010 -0800

    HDFS-496. Use PureJavaCrc32 in HDFS
    
    Description: Common now has a pure java CRC32 implementation which is more efficient than java.util.zip.CRC32. This issue is to make use of it.
    Reason: Performance improvement
    Author: Todd Lipcon
    Ref: UNKNOWN

commit ac73e6d51d5ad1df993097349602e5f3199b952a
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:48:40 2010 -0800

    HADOOP-6148. Implement a pure Java CRC32 calculator
    
    Description: We've seen a reducer writing 200MB to HDFS with replication = 1 spending a long time in crc calculation. In particular, it was spending 5 seconds in crc calculation out of a total of 6 for the write. I suspect that it is the java-jni border that is causing us grief.
    
    This outperforms java.util.zip.CRC32.
    Reason: Performance improvement
    Author: Scott Carey and Todd Lipcon
    Ref: UNKNOWN

commit e7430c8cbd2d182716ac7efb08cb2187c1edab95
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:48:08 2010 -0800

    Updated Sqoop documentation for MAPREDUCE-816, MAPREDUCE-789.
    
    Reason: Documentation improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit aa75ab7f749604c354dcdb0b806aca9cd140f504
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:47:58 2010 -0800

    MAPREDUCE-789. Oracle support for Sqoop
    
    Description: A separate ConnManager is needed for Oracle to support its slightly different syntax and configuration
    Reason: Compatibility improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 6f017db468a82e336a28f451c7d90bc225130094
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:47:33 2010 -0800

    MAPREDUCE-840. DBInputFormat leaves open transaction
    
    Description: DBInputFormat.getSplits() does not call connection.commit() after the COUNT query. This can leave an open transaction against the database which interferes with other connections to the same table.
    Reason: bugfix
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 84b622a5f6f5bd145f19f4c08b6263759ac51756
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:47:15 2010 -0800

    MAPREDUCE-816. Rename "local" mysql import to "direct"
    
    Description: A mysqldump-based fast path known as "local mode" is used in sqoop when users pass the argument <tt>-<del>local.</tt> The restriction that this only import from localhost was based on an implementation technique that was later abandoned in favor of a more general one, which can support remote hosts as well. Thus, <tt></del><del>local</tt> is a poor name for the flag. <tt></del>-direct</tt> is more general and more descriptive. This should be used instead.
    Reason: Interface clarification
    Author: Aaron Kimball
    Ref: UNKNOWN

commit ce75318a484615dc7b161a41710884f34db50c86
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:46:34 2010 -0800

    MAPREDUCE-716. org.apache.hadoop.mapred.lib.db.DBInputformat not working with oracle
    
    Description: <p>The out of the box implementation of the Hadoop is working properly with mysql/hsqldb, but NOT with oracle.<br/>
    Reason is DBInputformat is implemented with mysql/hsqldb specific query constructs like "LIMIT", "OFFSET".</p>
    
    <p>FIX:<br/>
    building a database provider specific logic based on the database providername (which we can get using connection).</p>
    
    Reason: Compatibility improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 338de775796c2102ce680eaa983b719b50e9f3ee
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:46:18 2010 -0800

    HADOOP-5469. Exposing Hadoop metrics via HTTP
    
    Description: Implement a "/metrics" URL on the HTTP server of Hadoop daemons, to expose metrics data to users via their web browsers, in plain-text and JSON.
    Reason: New feature
    Author: Philip Zeyliger
    Ref: UNKNOWN

commit cad421ec1c51382f81714ccafb96a6bb8bcc8aec
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:46:11 2010 -0800

    HADOOP-5469. Exposing Hadoop metrics via HTTP
    
    Description: Implement a "/metrics" URL on the HTTP server of Hadoop daemons, to expose metrics data to users via their web browsers, in plain-text and JSON.
    Reason: MISSING: Reason for inclusion
    Author: Philip Zeyliger
    Ref: UNKNOWN

commit 8b09839047997a4b5461703650b5779ec86c1844
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:45:49 2010 -0800

    CLOUDERA-BUILD. Added Sqoop documentation to installation script
    
    Author: Todd Lipcon

commit 7e77c6b13f06dec9c742bf76c81e2ec02d81c7cb
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:45:35 2010 -0800

    CLOUDERA-BUILD. Fix the hadoop/sqoop wrapper scripts
    
    Author: Matt Massie

commit 0caaf80f3a569b91f482de0dcb87f826967f5c7c
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:45:16 2010 -0800

    CLOUDERA-BUILD. Fix a bug in the hadoop/sqoop wrapper generation
    
    Author: Matt Massie
    Ref: UNKNOWN

commit bd8ddae402a876fe78cbb1482362935780b57d84
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:44:59 2010 -0800

    CLOUDERA-BUILD. Update the install hadoop script
    
    Author: Matt Massie
    Ref: UNKNOWN

commit 80cf01124877a5aebd742142b10fda45910f0328
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:44:42 2010 -0800

    CLOUDERA-BUILD. Rename the hadoop man page to be hadoop-0.20
    
    Author: Matt Massie
    Ref: UNKNOWN

commit 78cb9f21a3ddf04f8cef9e37a94f657448d0d111
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:43:51 2010 -0800

    HADOOP-5745. Allow setting the default value of maxRunningJobs for all pools
    
    Description: The &lt;pool&gt; element allows setting the maxRunningJobs for that pool. It wold be nice to be able to set a default value for all pools.
    
    <p>In out configuration, pools are autocreated.. every new uesre gets his own pool. We would like to allow each user to be able to run a max of 5 jobs at a time. For the etl pool, this limit will be set to a greater value,</p>
    Reason: Improved configuration flexibility
    Author: dhruba borthakur
    Ref: UNKNOWN

commit 3c39e1fa8c3c89fc8f11f1faff46397fa82d5116
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:43:13 2010 -0800

    MAPREDUCE-906. Updated Sqoop documentation.
    
    Description: Update Sqoop documentation with user guide and manpage.
    Reason: Documentation improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 79a2645bc81894331721ef94c255992075ccf195
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:42:14 2010 -0800

    CLOUDERA-BUILD. Added MySQL Connector/J library for Sqoop.
    
    Description: We can ship MySQL Connector/J with CDH because the licenses
    are compatible. However, the public Apache project will not include this
    library in their source repository due to stricter licensing concerns.
    Reason: Simplifies deployment of Sqoop for mysql users
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 4a097b35bf1264a0606f2ebe410c45f16f900f03
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:42:05 2010 -0800

    MAPREDUCE-705. User-configurable quote and delimiter characters for Sqoop records and record reparsing
    
    Description: Sqoop needs a mechanism for users to govern how fields are quoted and what delimiter characters separate fields and records. With delimiters providing an unambiguous format, a parse method can reconstitute the generated record data object from a text-based representation of the same record.
    Reason: New feature
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 58e23056af0e99ef611ac258719207cc9459a849
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:41:47 2010 -0800

    MAPREDUCE-710. Sqoop should read and transmit passwords in a more secure manner
    
    Description: Sqoop's current support for passwords involves reading passwords from the command line "--password foo", which makes the password visible to other users via 'ps'. An invisible-console approach should be taken.
    
    <p>Related, Sqoop transmits passwords to mysqldump in the same fashion, which is also insecure.</p>
    Reason: Security improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit a67a0f77729fb9005b0c47872d6ba677f6434b41
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:41:34 2010 -0800

    MAPREDUCE-713. Sqoop has some superfluous imports
    
    Description: Some classes have vestigial imports that should be removed
    Reason: Code cleanup
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 0a4dab2eac0ba8b6da5190bc53a9ce8e4344a336
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:41:01 2010 -0800

    MAPREDUCE-685. Sqoop will fail with OutOfMemory on large tables using mysql
    
    Description: The default MySQL JDBC client behavior is to buffer the entire ResultSet in the client before allowing the user to use the ResultSet object. On large SELECTs, this can cause OutOfMemory exceptions, even when the client intends to close the ResultSet after reading only a few rows. The MySQL ConnManager should configure its connection to use row-at-a-time delivery of results to the client.
    Reason: bugfix / scalability improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 499aa76b500136a0e8996898f468b088ca5d7ed3
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:40:50 2010 -0800

    MAPREDUCE-674. Sqoop should allow a "where" clause to avoid having to export entire tables
    
    Description: Sqoop currently only exports at the granularity of a table.  This doesn't work well on systems with large tables, where the overhead of performing a full dump each time is significant.  Allowing the user to specify a where clause is a relatively simple task which will give Sqoop a lot more flexibility.
    Reason: New feature
    Author: Kevin Weil
    Ref: UNKNOWN

commit ed4ba254d7708f363f5f1b4708e9e35061ad936c
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:40:37 2010 -0800

    MAPREDUCE-675. Sqoop should allow user-defined class and package names
    
    Description: Currently Sqoop generates a class for each table to be imported; the class names are equal to the table names and they are not part of any package.
    
    <p>This adds --class-name and --package-name parameters to Sqoop, allowing these aspects of code generation to be controlled.</p>
    Reason: New feature
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 16e0ca8119b99b244c9eeafd78bb9eb43e4ba639
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:40:20 2010 -0800

    MAPREDUCE-703. Sqoop requires dependency on hsqldb in ivy
    
    Description: Sqoop builds crash without explicit dependency on hsqldb.
    Reason: build system bugfix
    Author: Aaron Kimball
    Ref: UNKNOWN

commit b8e54791e990328db983f070e9a04952301eda35
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:40:04 2010 -0800

    MAPREDUCE-692. Make Hudson run Sqoop unit tests
    
    Description: Running 'ant test-contrib' didn't test Sqoop because it wasn't explicitly listed in the build.xml file in src/contrib/
    Reason: Test coverage
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 8a3b6472ae00542dadf7f7d60991ec0f21b38177
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:39:40 2010 -0800

    HADOOP-5968. Sqoop should only print a warning about mysql import speed once
    
    Description: After <a href="http://issues.apache.org/jira/browse/HADOOP-5844" title="Use mysqldump when connecting to local mysql instance in Sqoop"><del>HADOOP-5844</del></a>, Sqoop can use mysqldump as an alternative to JDBC for importing from MySQL. If you use the JDBC mechanism, it prints a warning if you could have enabled the mysqldump path instead. But the warning is printed multiple times (every time the LocalMySQLManager is instantiated), and also when the MySQL manager is used for informational queries (e.g., listing tables) rather than true imports.
    
    <p>It should only emit the warning once per session, and only then if it's actually doing an import.</p>
    Reason: User experience improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 86211e3714dc5b1dbcb7a3c328336277f6657de7
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:38:44 2010 -0800

    HADOOP-5967. Sqoop should only use a single map task
    
    Description: The current DBInputFormat implementation uses SELECT ... LIMIT ... OFFSET statements to
    read from a database table. This actually results in several queries all accessing the same table at
    the same time. Most database implementations will actually use a full table scan for each such
    query, starting at row 1 and scanning down until the OFFSET is reached before emitting data to the
    client. The upshot of this is that we see O(n^2) performance in the size of the table when using a
    large number of mappers, when a single mapper would read through the table in O(n) time in the number of rows.
    
    <p>This patch sets the number of map tasks to 1 in the MapReduce job sqoop launches.</p>
    Reason: Performance improvement
    Author: Aaron Kimball
    Ref: UNKNOWN
    
    commit 410db7130a8e85ceed46850f73e74f480d45994e
    Author: Aaron Kimball <aaron@cloudera.com>
    Date:   Thu Jul 23 16:10:21 2009 -0700
    
        HADOOP-5967: Sqoop should only use a single map task

commit b8f5d1d3a30a7461936f3f92bd9f007ed2db43e8
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:38:23 2010 -0800

    HADOOP-5887. Sqoop should create tables in Hive metastore after importing to HDFS
    
    Description: Sqoop (<a href="http://issues.apache.org/jira/browse/HADOOP-5815" title="Sqoop: A database import tool for Hadoop"><del>HADOOP-5815</del></a>) imports tables into HDFS; it is a straightforward enhancement to then generate a Hive DDL statement to recreate the table definition in the Hive metastore and move the imported table into the Hive warehouse directory from its upload target.
    
    <p>This feature enhancement makes this process automatic. An import is performed with sqoop in the usual way; providing the argument "--hive-import" will cause it to then issue a CREATE TABLE .. LOAD DATA INTO statement to a Hive shell. It generates a script file and then attempts to run "$HIVE_HOME/bin/hive" on it, or failing that, any "hive" on the $PATH; $HIVE_HOME can be overridden with --hive-home. As a result, no direct linking against Hive is necessary.</p>
    
    <p>The unit tests provided with this enhancement use a mock implementation of 'bin/hive' that compares the script it's fed with one from a directory full of "expected" scripts. The exact script file referenced is controlled via an environment variable. It doesn't actually load into a proper Hive metastore, but manual testing has shown that this process works in practice, so the mock implementation is a reasonable unit testing tool.</p>
    Reason: New feature
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 50993494fdc7b2284837562b500e2840106bb3bb
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:37:48 2010 -0800

    CLOUDERA-BUILD. Address issue where docs were not properly copied through to release tarball
    
    Description:
        This was caused by some cleanup in build.xml early on in the CDH 0.20
        branch
    Reason: bugfix
    Author: Todd Lipcon
    Ref: UNKNOWN

commit 3ecb9c07279302d18f7367d49bcd98c4391cbb68
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:37:27 2010 -0800

    CLOUDERA-BUILD. Decrease build time by only rebuilding the native code for each platform
    
    Reason: build system improvement
    Author: Todd Lipcon
    Ref: UNKNOWN

commit f0c6a810ba7237ec7cc570ecad8a8665768b3d06
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:37:07 2010 -0800

    CLOUDERA-BUILD. Run jdiff against vanilla Hadoop during Cloudera release build
    
    Author: Todd Lipcon
    Ref: UNKNOWN

commit 9cf8f0cb6ed744439d8e90e3ba376edb5d9521f3
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:36:22 2010 -0800

    MAPREDUCE-415. JobControl Job does always has an unassigned name
    
    Description: When creating and adding org.apache.hadoop.mapred.jobcontrol.Job(s) they don't use the names specified in their respective JobConf files.  Instead it's just hardcoded to "unassigned".
    Reason: bugfix
    Author: Xavier Stevens
    Ref: UNKNOWN

commit 330f009bae260ac990426a988fc56913897a50ca
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:35:03 2010 -0800

    HADOOP-5805. problem using top level s3 buckets as input/output directories
    
    Description: When I specify top level s3 buckets as input or output directories, I get the following exception.
    
    <p>hadoop jar subject-map-reduce.jar s3n://infocloud-input s3n://infocloud-output</p>
    
    <p>java.lang.IllegalArgumentException: Path must be absolute: s3n://infocloud-output<br/>
            at org.apache.hadoop.fs.s3native.NativeS3FileSystem.pathToKey(NativeS3FileSystem.java:246)<br/>
            at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:319)<br/>
            at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:667)<br/>
            at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:109)<br/>
            at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:738)<br/>
            at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1026)<br/>
            at com.evri.infocloud.prototype.subjectmapreduce.SubjectMRDriver.run(SubjectMRDriver.java:63)<br/>
            at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)<br/>
            at com.evri.infocloud.prototype.subjectmapreduce.SubjectMRDriver.main(SubjectMRDriver.java:25)<br/>
            at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)<br/>
            at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)<br/>
            at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)<br/>
            at java.lang.reflect.Method.invoke(Method.java:597)<br/>
            at org.apache.hadoop.util.RunJar.main(RunJar.java:155)<br/>
            at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)<br/>
            at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)<br/>
            at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)<br/>
            at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)</p>
    
    <p>The workaround is to specify input/output buckets with sub-directories:</p>
    
    <p>hadoop jar subject-map-reduce.jar s3n://infocloud-input/input-subdir  s3n://infocloud-output/output-subdir</p>
    
    Reason: bugfix
    Author: Ian Nowland
    Ref: UNKNOWN

commit 35fa82b5c743e34d62449e0f4abffd885e0dfe4c
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:34:42 2010 -0800

    HADOOP-5656. Counter for S3N Read Bytes does not work
    
    Description: Counter for S3N Read Bytes does not work on trunk. On 0.18 branch neither read nor write byte counters work.
    Reason: Bugfix
    Author: Ian Nowland
    Ref: UNKNOWN

commit a6670de0a1c4b03c293ae47d1595e8c33764aaa5
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:33:43 2010 -0800

    HADOOP-5613. change S3Exception to checked exception
    
    Description: Currently the S3 filesystems can throw unchecked exceptions (S3Exception) which are not declared in the interface of FileSystem. These aren't caught by the various callers and can cause unpredictable behavior. IOExceptions are caught by most users of FileSystem since it is declared in the interface and hence is handled better.
    
    S3Exception now extends IOException.
    Reason: Improved error-checking at compile time for user applications.
    Author: Andrew Hitchcock
    Ref: UNKNOWN

commit 1f11b63a42ae441eb8d0693ed0e4e01aca553e42
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:33:09 2010 -0800

    HADOOP-5528. Binary partitioner
    
    Description: It would be useful to have a <tt>BinaryPartitioner</tt> that partitions <tt>BinaryComparable</tt> keys by hashing a configurable part of the bytes array corresponding to each key.
    Reason: New feature
    Author: Klaas Bosteels
    Ref: UNKNOWN

commit 716d3598e5a4a18cdfcfcf0dc800e263ef7c7685
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:32:47 2010 -0800

    HADOOP-5240. 'ant javadoc' does not check whether outputs are up to date and always rebuilds
    
    Description: Running 'ant javadoc' twice in a row calls the javadoc program both times; it doesn't check to see whether this is redundant work.
    Reason: Build system improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 2bb607d29d9080a7ca3bce72739ccef654d5392d
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:30:46 2010 -0800

    HADOOP-5175. Option to prohibit jars unpacking
    
    Description: The task tracker moves all unpacked jars into
    ${hadoop.tmp.dir}/mapred/local/taskTracker. When using a lot of external
    libraries via -libjars, this results in several thousand unpacked files.
    The amount of time needed to `du` these directories can increase to the point
    where tasks time out before starting. This patch provides an option to
    suppress jar unpacking.
    Reason: Scalability improvement
    Author: Todd Lipcon
    Ref: UNKNOWN

commit 349281bfa0243f5adbbd459266f4a9ac7ac8c1cc
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:30:16 2010 -0800

    CLOUDERA-BUILD. Fix scribe-log4j's ivy.xml to properly get log4j on the compile classpath
    
    Author: Todd Lipcon
    Reason: bugfix to build system
    Ref: UNKNOWN

commit b07aec5129e618bfeda8ba753fb5138e612b1a8b
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:29:33 2010 -0800

    HADOOP-4829. Allow FileSystem shutdown hook to be disabled
    
    Description: FileSystem sets a JVM shutdown hook so that it can clean up the FileSystem cache. This is great behavior when you are writing a client application, but when you're writing a server application, like the Collector or an HBase RegionServer, you need to control the shutdown of the application and HDFS much more closely. If you set your own shutdown hook, there's no guarantee that your hook will run before the HDFS one, preventing you from taking some shutdown actions.
    Reason: Integration improvement.
    Author: Todd Lipcon
    Ref: UNKNOWN

commit 154c6a6474b02e68c3418fddf9a8ee5d476a8b7d
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:28:14 2010 -0800

    HADOOP-3327. Shuffling fetchers waited too long between map output fetch re-tries
    
    Description: Improves handling of READ_TIMEOUT during map output copying.
    Author: Amareshwari Sriramadasu
    Reason: bugfix
    Ref: UNKNOWN
    
    commit 8a6293fc5c3733035dde8e4d3a68c414a1f800f8
    Author: Devaraj Das <ddas@apache.org>
    Date:   Thu Feb 5 05:35:09 2009 +0000
    
        HADOOP-3327. Improves handling of READ_TIMEOUT during map output copying. Contributed by Amareshwari Sriramadasu.
    
        git-svn-id: https://svn.apache.org/repos/asf/hadoop/core/trunk@741009 13f79535-47bb-0310-9956-ffa450edef68

commit 4ee0ecf4760d7adb3e1a81e018a3b5cd6d2e9775
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:27:44 2010 -0800

    MAPREDUCE-680. Reuse of Writable objects is improperly handled by MRUnit
    
    Description: As written, MRUnit's MockOutputCollector simply stores references to the objects passed in to its collect() method. Thus if the same Text (or other Writable) object is reused as an output containiner multiple times with different values, these separate values will not all be collected. MockOutputCollector needs to properly use io.serializations to deep copy the objects sent in.
    Reason: Bugfix; see description.
    Author: Aaron Kimball
    Ref: UNKNOWN
    
    commit 51bdfdcf947bc8447aa36d68ae802f154516b0b6
    Author: Aaron Kimball <aaron@cloudera.com>
    Date:   Wed Jul 15 10:40:47 2009 -0700
    
        MAPREDUCE-680. Reuse of Writable objects is improperly handled by MRUnit.

commit c2026460d4cf7049c67da65d3a2db2e9bcd9c848
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:27:14 2010 -0800

    HADOOP-5518. MRUnit unit test library
    
    Description: MRUnit is a tool to help authors of MapReduce programs write unit tests.
    
    Testing map() and reduce() methods requires some repeated work to mock the inputs and outputs of a Mapper or Reducer class, and ensure that the correct values are emitted to the OutputCollector based on inputs. Also, testing a mapper and reducer together requires running them with the sorted ordering guarantees made by the shuffle process.
    
    This library provides the above functionality to authors of maps and reduces; it allows you to test maps, reduces, and map-reduce pairs without needing to perform all the setup and teardown work associated with running a job.
    
    Reason: New feature
    Author: Aaron Kimball
    Ref: UNKNOWN

commit 6991a0eb635953bf3729bce330c426ed7d8b996a
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:26:29 2010 -0800

    CLOUDERA-BUILD. Add sqoop wrapper to bin
    
    Description: Adds a '/usr/bin/sqoop' wrapper script for users
    Reason: User-experience improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit c365162d7db1ee70c8607ad84a11e4aa594224e7
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:25:56 2010 -0800

    HADOOP-5844. Use mysqldump when connecting to local mysql instance in Sqoop
    
    Description: Sqoop uses MapReduce + DBInputFormat to read the contents of a table into HDFS. On many databases, this implementation is O(N^2) in the number of rows. Also, the use of multiple mappers has low value in terms of throughput, because the database itself is inherently singlethreaded. While DBInputFormat/JDBC provides a useful fallback mechanism for importing from databases, db-specific dump utilities will nearly always provide faster throughput, and should be selected when available. This patch allows users to use mysqldump to read from local mysql instances instead of the MapReduce-based input.
    Reason: Performance improvement
    Author: Aaron Kimball
    Ref: UNKNOWN

commit eddbfbca420bfb81a3a565e4324f6189bfd97e41
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:24:58 2010 -0800

    HADOOP-5815. Sqoop: A database import tool for Hadoop
    
    Description:
    Sqoop is a tool designed to help users import existing relational databases into their Hadoop clusters. Sqoop uses JDBC to connect to a database, examine the schema for tables, and auto-generate the necessary classes to import data into HDFS. It then instantiates a MapReduce job to read the table from the database via the DBInputFormat (JDBC-based InputFormat). The table is read into a set of files loaded into HDFS. Both SequenceFile and text-based targets are supported.
    Reason: New feature
    Author: Aaron Kimball
    Ref: UNKNOWN

commit b33265ff77c71af61899a4b3add1e82cc195fdb7
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:23:53 2010 -0800

    MAPREDUCE-714. JobConf.findContainingJar unescapes unnecessarily on Linux
    
    Description: In JobConf.findContainingJar, the path name is decoded using URLDecoder.decode(...). This was done by Doug in r381794 (commit msg "Un-escape containing jar's path, which is URL-encoded.  This fixes things primarily on Windows, where paths are likely to contain spaces.") Unfortunately, jar paths do not appear to be URL encoded on Linux. If you try to use "hadoop jar" on a jar with a "+" in it, this function decodes it to a space and then the job cannot be submitted.
    Reason: Cloudera-based packages include a '+' in the filename; Hadoop's URL escaper will not
    properly handle jar filenames with a '+' without this patch.
    Author: Todd Lipcon
    Ref: UNKNOWN
    
    commit d9767d2cefab288e581732f71779f3ce8e3267e4
    Author: Todd Lipcon <todd@cloudera.com>
    Date:   Mon Jul 6 19:36:11 2009 -0700
    
        MAPREDUCE-714: Fix JobConf.findContainingJars to work with jars with + in the name

commit aaeb69f8dda72a2e7aecacd622e99c00bc961efa
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:23:23 2010 -0800

    CLOUDERA-BUILD. Add dependency libraries for Scribe/log4j
    
    Author: Todd Lipcon

commit cb7a3677942c1d2f9e0d2a75dbffa09fa6125e61
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:22:41 2010 -0800

    CLOUDERA-BUILD. Apply Scribe patches to Hadoop
    
    Description:
        scribe_hadoop_trunk.patch
        Also, add empty ivy infrastructure for scribe-log4j
    Author: Todd Lipcon

commit d5ead434b221076fb830308d2d112d53aa6dc59f
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:22:26 2010 -0800

    CLOUDERA-BUILD. Use cloudera's versioning info from cloudera.hash in saveVersion.sh
    
    Description:
        This should make the "hadoop version" output far more useful for
        determing exactly what code is running. The cloudera.hash property is
        set by cloudera/build.properties which is generated during the build
        process.

commit bf10e46e425395145dcc4b85db66d45cbf9797b0
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:21:45 2010 -0800

    CLOUDERA-BUILD. Move saveVersion.sh in build.xml to ensure build
    
    Description:
        This error is due to ant 1.7.1 not compiling package-info.java if the
        timestamp of the output class directory is newer than the package-info
        file itself. Since other compiles were happening after package-info.java
        was generated, the build dir was newer and compilation was being
        skipped.
    
        Move cloudera hooks inside the package task of build.xml
    
        Fixes an issue where the fair scheduler jar was not built before the
        hooks were run, and therefore was not included in the target lib/
        directory.
    
    Ref: CLOUDERA-436

commit 5359a3bbd2b09644825be99fdd354ff3276a5d59
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:21:36 2010 -0800

    CLOUDERA-BUILD. New versions of cloudera packaging scripts

commit ee255f3909b9938b1023be6a2c59a8429227c766
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:21:27 2010 -0800

    CLOUDERA-BUILD. Change paths to point to hadoop-0.20 where necessary

commit a2d051bcf456fde45c0a0c3aa512872ce6059a97
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:21:08 2010 -0800

    CLOUDERA-BUILD. Add Hadoop manpage to Hadoop 0.20 repository

commit 9600765ec5d6c3cef9ab34ecb573cbb876acf7ee
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:21:01 2010 -0800

    CLOUDERA-BUILD. Move install_hadoop.sh into hadoop repo

commit 77ac6923ad6e63874a429e7dd13c4a084b6a9556
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:20:52 2010 -0800

    CLOUDERA-BUILD. Add example-confs directory for storing configuration of conf.pseudo

commit 14256386d4cb155fea0f5745dd6c49fba74ff40f
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:20:43 2010 -0800

    CLOUDERA-BUILD. Replace hadoop-config.sh with Cloudera version

commit f7d0a20e0d74f1aac1fb96f3c08ce31e9b9ca5d9
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:20:25 2010 -0800

    CLOUDERA-BUILD. Remove redundant code in build.xml between package and bin-package

commit 0fa65091ecd9dd150d6afb93845d3fb10d80e115
Author: Aaron Kimball <aaron@cloudera.com>
Date:   Fri Mar 12 14:16:59 2010 -0800

    CLOUDERA-BUILD. Hook build.xml to enable contrib modules
