commit 9b72d268a0b590b4fd7d13aca17c1c453f8bc957 Author: Eli Collins Date: Sun Jun 27 18:42:45 2010 -0700 CLOUDERA-BUILD. Make symlinks so old hadoop jar names are preserved (CDH-1543). commit 4c50269dda2038d202ddb890ffde38dc3fb2ead2 Author: Aaron Kimball Date: Thu Jun 24 18:25:09 2010 -0700 MAPREDUCE-1887. MRAsyncDiskService does not properly absolutize volume root paths. Description: In MRAsyncDiskService, volume names are sometimes specified as relative paths, which are not converted to absolute paths. This can cause errors of the form "cannot delete </full/path/to/foo> since it is outside of <relative/volume/root>" even though the actual path is inside the root. Reason: Bug Author: Aaron Kimball Ref: CDH-1509 commit 43ccf90369692c4d8b7d13a7f04b0864c55f615a Author: Todd Lipcon Date: Wed Jun 23 17:35:08 2010 -0700 HDFS-1266. Add Apache License Notice to several places where it was missing Description: Adds license headers to source code Reason: Apache policy Author: Todd Lipcon Ref: CDH-1495 commit bf08bde983501e3ce8ebf6197049262518580611 Author: Todd Lipcon Date: Wed Jun 23 16:14:50 2010 -0700 HDFS-1260. tryUpdateBlock should do validation before renaming meta file Description: Solves bug where block became inaccessible in certain failure conditions (particularly network partitions). Observed under HBase workload at user site. Reason: Potential loss of synced data when write pipeline fails Author: Todd Lipcon Ref: CDH-659 commit 7243001d5511922f293f0641cb8dbc0af4850dae Author: Todd Lipcon Date: Fri Jun 18 16:13:45 2010 -0700 HDFS-1254. Enable append feature by default Description: Changes dfs.support.append to "true" in hdfs-default.xml Reason: Append/sync have been tested in CDH3b2 and are safe to use. Author: Dhruba Borthakur Ref: CDH-659 commit 0e1d71c08923bb4c4172ef043b0b2d82f95b92fa Author: Todd Lipcon Date: Sat Jun 19 16:26:39 2010 -0700 HDFS-1252. Updates to TestDFSConcurrentFileOperations (test was previously broken) Description: Fixes TestDFSConcurrentFileOperations to test the correct semantics for sync feature Reason: Test was previously flaky Author: Todd Lipcon Ref: CDH-659 commit 829497f4867a0e92da712faf02f83c7087df07ce Author: Eli Collins Date: Fri Jun 18 19:31:58 2010 -0700 CLOUDERA-BUILD. Remove Sqoop from the build. commit 298fda37c4c25434a15886ee9c261e566d595dff Author: Aaron Kimball Date: Fri Jun 18 18:42:37 2010 -0700 HADOOP-5203. TT's version build is too restrictive. Description: Use the md5sum checksum of the source for determining version compatibility. Reason: Improvement Author: Rick Cox (0.20 backport by Bill Au) Ref: CDH-1139 commit f07b2df591b91c7de50e8dbb526cf11b27a32a6f Author: Aaron Kimball Date: Fri Jun 18 17:58:53 2010 -0700 MAPREDUCE-679. XML-based metrics as JSP servlet for JobTracker Description: A simple XML translation of the existing JobTracker status page which provides the same metrics (including the tables of running/completed/failed jobs) as the human-readable page. This is a relatively lightweight addition to provide some machine-understandable metrics reporting. Reason: Improvement Author: Aaron Kimball Ref: CDH-651 commit d8dc8dad821a02619afdbfc3d1cb978b86cb071b Author: Aaron Kimball Date: Fri Jun 18 17:24:07 2010 -0700 MAPREDUCE-1372. ConcurrentModificationException in JobInProgress Description: Fixes a ConcurrentModificationException in JobInProgress Reason: Bug Author: Dick King Ref: CDH-546 commit e212ca0b0abbd78cdea4596fe9f3c6dbbaa57258 Author: Aaron Kimball Date: Fri Jun 18 16:20:01 2010 -0700 MAPREDUCE-1378. Args in job details links on jobhistory.jsp are not URL encoded Description: The logFile argument in the job links on the JT jobhistory.jsp page is not properly URL encoded leading to links that result in 500 errors. Reason: Bug Author: Eric Sammer Ref: CDH-645 commit 23e68e669a118d34e265af5e8ffda3615c2666f9 Author: Aaron Kimball Date: Fri Jun 18 15:52:15 2010 -0700 MAPREDUCE-1570. Shuffle stage - Key and Group Comparators Description: Shuffle method in org.apache.hadoop.mrunit.MapReduceDriverBase doesn't currently allow the use of custom GroupingComparator and SortComparator. This patch adds these features. Reason: Improvement Author: Chris White Ref: CDH-958 commit 4601521a9793255e8b5881d64ff1a921451bc951 Author: Aaron Kimball Date: Fri Jun 18 15:48:41 2010 -0700 MAPREDUCE-739. Allow relative paths to be created inside archives. Description: Allow creating archives with relative paths with a -p option on the command line. Archives currently stores the full path from the input sources – since it allows multiple sources and regular expressions as inputs. So the created archives have the full path of the input sources. This is un intuitive and a user hassle. We should get rid of it and allow users to say that the created archive should be relative to some absolute path and throw an exception if the input does not confirm to the relative absolute path. Reason: Improvement Author: Mahadev konar Ref: CDH-501 commit 1d4e15f0f8b749981d62bfca9849e0d0493afdad Author: Todd Lipcon Date: Thu Jun 17 20:02:51 2010 -0700 HDFS-1247. Improvements to HDFS-1204 test Reason: Fixes compile warnings Author: Todd Lipcon Ref: CDH-659 commit 1fab52d87c29bc7117eb7324d1a152d8d889f62b Author: Todd Lipcon Date: Wed Jun 2 18:25:11 2010 -0700 HDFS-1246. Manual tool to test sync on a cluster Description: Tool for automated testing that sync maintains every edit after kill -9 Reason: Cluster Testing of Sync support for CDH3 Author: Todd Lipcon Ref: CDH-659 commit b9259a145f516a01ba37a33b3803c88824fd55e5 Author: Todd Lipcon Date: Thu Jun 17 09:55:31 2010 -0700 HDFS-1240. Fix failing TestDFSShell due to HDFS-909 backport on branch-20 Reason: Fix red build Author: Todd Lipcon Ref: CDH-659 commit 7276208c2789f2c3961c6dc9fa1d2757774971b1 Author: Todd Lipcon Date: Wed Jun 16 12:16:25 2010 -0700 HDFS-1243. Replication tests in TestFileAppend4 should wait for a second for replication to occur Reason: Test error - fix sporadic failure of TestFileAppend4 Author: Todd Lipcon Ref: CDH-659 commit dc1797ec8380b07117bbc6d662e2f1f56b25e6bd Author: Todd Lipcon Date: Tue Jun 15 17:56:43 2010 -0700 HDFS-1207. stallReplicationWork should be marked volatile in FSNamesystem Description: Small bug fix for code used by tests only Reason: Fix sporadic failure of TestFileAppend4 Author: Todd Lipcon Ref: CDH-659 commit a960eea40dbd6a4e87072bdf73ac3b62e772f70a Author: Todd Lipcon Date: Sun Jun 13 23:02:38 2010 -0700 HDFS-1197. Received blocks should not be added to block map prematurely for under construction files Description: Fixes a possible dataloss scenario when using append() on real-life clusters. Also augments unit tests to uncover similar bugs in the future by simulating latency when reporting blocks received by datanodes. Reason: Append support dataloss bug Author: Todd Lipcon Ref: CDH-659 commit 3cc1405289ac4ec6616a5ba9da18ff421a93678e Author: Todd Lipcon Date: Mon Jun 14 01:43:18 2010 -0700 HDFS-1209. Add parameter dfs.client.block.recovery.retries to determine how many times to try to recover block Reason: Used by append tests Author: Todd Lipcon Ref: CDH-659 commit 128395ae4d317204fe8fb118333270826adf96d5 Author: Todd Lipcon Date: Sun Jun 6 16:38:21 2010 -0400 HDFS-1118. DFSOutputStream socket leak when can't connect to DN Reason: Fixes DFS Client socket leaks in an error condition Author: Zheng Shao Ref: CDH-659 commit 4ba384d2b9f92f7300ce06b35a967e4edc3ba671 Author: Todd Lipcon Date: Fri Jun 4 15:10:00 2010 -0700 HADOOP-6762. Interrupting a thread performing an RPC should not hang that thread. Description: Moves the sending of parameters for RPC calls to a separate thread, such that interrupting a thread that is making an RPC call does not negatively affect the shared RPC channel. Reason: Fixes occasional hangs of HBase under heavy load during failure testing. Author: Sam Rash Ref: CDH-659, CDH-1084 commit 6e99c7e2a12eea782629337f5fb5734e8e5e5865 Author: Todd Lipcon Date: Wed Jun 2 22:32:45 2010 -0700 HDFS-1210. DFSClient should print IOE that caused recovery failure Description: Adds an extra WARN message during DFS client error recovery Reason: Makes it easier to debug/diagnose recovery issues Author: Todd Lipcon Ref: CDH-659 commit 1b8d8c3de261c8334d6eac4f5d3fd42cad894e81 Author: Todd Lipcon Date: Wed Jun 2 21:53:01 2010 -0700 HDFS-1186. Writers should be interrupted when recovery is started, not when it's completed. Description: When the write pipeline recovery process is initiated, this interrupts any concurrent writers to the block under recovery. This prevents a case where some edits may be lost if the writer has lost its lease but continues to write (eg due to a garbage collection pause) Reason: Fixes a potential dataloss bug Author: Todd Lipcon Ref: CDH-659 commit 2ec4301341b249acd0c0cac1792aaa6a6dabab8e Author: Todd Lipcon Date: Thu May 20 00:23:20 2010 -0700 HDFS-915. Write pipeline hangs for too long when ResponseProcessor hits timeout Description: Previously, the write pipeline would hang for the entire write timeout when it encountered a read timeout (eg due to a network connectivity issue). This patch interrupts the writing thread when a read error occurs. Reason: Faster recovery from pipeline failure for HBase and other interactive applications. Author: Todd Lipcon Ref: CDH-659 commit 641090318603c47bfd55e1eea2b039f37e5b723a Author: Todd Lipcon Date: Fri May 14 19:20:10 2010 -0700 HDFS-1218. Replicas that are recovered during DN startup should not be allowed to truncate better replicas. Description: If a datanode loses power and then recovers, its replicas may be truncated due to the recovery of the local FS journal. This patch ensures that a replica truncated by a power loss does not truncate the block on HDFS. Reason: Potential dataloss bug uncovered by power failure simulation Author: Todd Lipcon Ref: CDH-659 commit 46f2b3ad578ea1d2ee2cca4e6467ba2daa57df0e Author: Todd Lipcon Date: Fri May 14 19:34:09 2010 -0700 HDFS-445. pread should refetch block locations when necessary Description: The positional read API in DFSInputStream was previously missing any retry logic. This patch adds this logic. Reason: HBase and other applications depend on the pread API. Author: Kan Zhang Ref: CDH-659 commit aea067a20e16345f307de7efe80935dd7addbe6b Author: Todd Lipcon Date: Fri May 14 19:19:56 2010 -0700 HDFS-1204. LeaseManager expiring leases should only expire the single file, not entire lease Reason: Logic bug in lease recovery could cause incorrectly interrupted writers Author: Sam Rash Ref: CDH-659 commit 10e5944da20d851a847cb2ef422383507d070085 Author: Todd Lipcon Date: Thu May 13 16:33:15 2010 -0700 HDFS-1242. Add unit test for the appendFile race condition / synchronization bug fixed in HDFS-142 Reason: Test coverage for previously applied patch. Author: Todd Lipcon Ref: CDH-659 commit 18174a2abc5a91105ae1adc2bda026d90c41a60b Author: Todd Lipcon Date: Wed May 12 20:06:33 2010 -0700 HDFS-1202. Don't try to update block scan status if block scanner is not initialized yet Reason: Fixes NPE seen at DataNode startup Author: Todd Lipcon Ref: CDH-659 commit ca9e1b3c59b05de9dc4fafa19f24dca80110bcc0 Author: Todd Lipcon Date: Wed May 12 19:28:56 2010 -0700 HDFS-1205. Make async disk service threads nameable Description: HDFS-611 moved some datanode operations to a separate thread pool. This patch ensures that these worker threads have clear names. Reason: Aids debugging/diagnosing of issues Author: Todd Lipcon Ref: CDH-659 commit 1b8316d403ac542772c0745159a7397c798a5698 Author: Todd Lipcon Date: Tue May 11 16:47:47 2010 -0700 HDFS-606. Avoid ConcurrentModification in replica invalidation Description: Replica invalidation iterated over a collection that it also modified, causing a CME. This patch makes a copy before iteration. Performance should be unaffected as this is a rare code path. Reason: Avoid runtime exception in namenode Author: Konstantin Shvachko Ref: CDH-659 commit b7f908bc77d9344c36dcc409bbfe92709b98cf88 Author: Todd Lipcon Date: Thu May 6 08:52:18 2010 -0700 HDFS-1244. Misc improvements to TestFileAppend2 Description: Improvements made to a test case to enable it to be run from the command line, with the various test parameters available in arguments. Reason: Enable long-running stress tests of append functionality. Author: Todd Lipcon Ref: CDH-659 commit 370c9a1e75cc5d5e93cec066006ada0485139fb8 Author: Todd Lipcon Date: Tue Jun 15 18:48:58 2010 -0700 HDFS-1141. completeFile should check lease holder Description: Fixes a bug where a writer could finalize an in-progress file after it had already lost its lease. This could occur for example if the writer entered a GC pause after finishing the last block but before finalizing the file. Reason: Potential dataloss bug with append/sync Author: Todd Lipcon Ref: CDH-659 commit 7f0d67fa52b9c58360b06e851bf77bc2f909f65f Author: Todd Lipcon Date: Wed May 5 14:43:40 2010 -0700 HDFS-1215. Fix TestNodeCount to not infinite loop after HDFS-409 MiniCluster changes Description: Fixes a test to work properly after some test infrastructure was changed by HDFS-142 in branch-0.20-append. Reason: Fixes failing test. Author: Todd Lipcon Ref: CDH-659 commit 77ac4f46fb5c011b5ac7c5fedb4c51b31580c9ba Author: Todd Lipcon Date: Tue Jun 15 18:33:58 2010 -0700 HDFS-1248. Miscellaneous cleanup and improvements on 0.20 append branch Description: Miscellaneous code cleanup and logging changes, including: - Slight cleanup to recoverFile() function in TestFileAppend4 - Improve error messages on OP_READ_BLOCK - Some comment cleanup in FSNamesystem - Remove toInodeUnderConstruction (was not used) - Add some checks for null blocks in FSNamesystem to avoid a possible NPE - Only log "inconsistent size" warnings at WARN level for non-under-construction blocks. - Redundant addStoredBlock calls are also not worthy of WARN level - Add some extra information to a warning in ReplicationTargetChooser Reason: Improves diagnosis of error cases and clarity of code Author: Todd Lipcon Ref: CDH-659 commit 46e6199d8819538d96c3f4c5dbbfba163382b2a9 Author: Todd Lipcon Date: Mon May 3 15:02:32 2010 -0700 HDFS-1122. Don't allow client verification to prematurely add inprogress blocks to DataBlockScanner Description: When a client reads a block that is also open for writing, it should not add it to the datanode block scanner. If it does, the block scanner can incorrectly mark the block as corrupt, causing data loss. Reason: Potential dataloss with concurrent writer-reader case. Author: Sam Rash Ref: CDH-659 commit 07711a4ea3edd1a504eb9bbb13c93d5573620d34 Author: Todd Lipcon Date: Mon May 3 12:04:49 2010 -0700 HDFS-1057. Fixes for concurrent readers behind an appended file Description: Allows a client to read a file while it is still being written by a writer, so long as the writer has called sync(). Reason: Used by HBase replication, and useful for other "tail"-like applications. Author: Sam Rash Ref: CDH-659 commit 587de668e43486f7109a885f617b9b757d7a649e Author: Todd Lipcon Date: Sat Apr 24 17:33:34 2010 -0700 HADOOP-6722. Workaround a TCP spec quirk by not allowing NetUtils.connect to connect to itself Description: TCP's ephemeral port assignment results in the possibility that a client can connect back to its own outgoing socket, resulting in failed RPCs or datanode transfers. Reason: Fixes intermittent errors in cluster testing with ephemeral IPC/transceiver ports on datanodes. Author: Todd Lipcon Ref: CDH-659 commit 7a93fcc8c22b7cff87221ec0a8bf8f6689f12b82 Author: Todd Lipcon Date: Thu Apr 22 10:24:59 2010 -0700 HDFS-1203. Add small sleep to prevent DN flooding NN in error cases Description: If the datanode experiences an error in sending its block reports to the name node, it previously would loop retrying with no delay between attempts. In the case that the DN is sending an invalid report, this will flood the NN with RPCs. This patch adds a short sleep before the retry. Reason: Prevents possible flood of RPCs to the NameNode in DN error conditions. Author: Todd Lipcon Ref: CDH-659 commit a30c033c1eed744948ddfddb82b81b06e12bba46 Author: Todd Lipcon Date: Fri Apr 16 15:19:08 2010 -0700 HDFS-561. Fix read timeouts in write pipeline to stage correctly Description: Previously, the read timeout on the write pipeline was incorrectly calculated. This caused the client to detect the wrong failed datanode when a datanode's network failed or froze for another reason. Reason: Fix recovery behavior for frozen datanodes Author: Kan Zhang Ref: CDH-659 commit 02ab12541a004d67a96428055a58a3b726c1c4b6 Author: Todd Lipcon Date: Thu Apr 15 01:04:43 2010 -0700 HDFS-895. Allow hflush/sync to operate in parallel with other writers Description: Modifies synchronization of the DFSOutputStream sync feature such that multiple threads can sync the same stream concurrently and each will wait only the minimal amount of time. Also allows further writes to continue past the sync point while the sync waits. Reason: Substantial performance improvement for durable HBase Author: Todd Lipcon Ref: CDH-659 commit d1c4359e1abc3f3e5e4fa16ee1c83a3d7f015da3 Author: Todd Lipcon Date: Wed Apr 14 14:59:39 2010 -0700 HDFS-1211. BlockReceiver logs too much at INFO level when using sync() Description: Reduces the log level from INFO to DEBUG for a common message in the datanode log when using the sync feature. Reason: Substantially reduces DN log chattiness for syncing clients. Author: Todd Lipcon Ref: CDH-659 commit 23cfa9e8263ad1d92814b5829e2f50bb37d57857 Author: todd Date: Sun Mar 21 16:25:48 2010 -0700 HDFS-1056. Fix possible multinode deadlocks during block recovery when using ephemeral dataxceiver ports Description: Fixes the logic by which datanodes identify local RPC targets during block recovery for the case when the datanode is configured with an ephemeral data transceiver port. Reason: Potential internode deadlock for clusters using ephemeral ports Author: Todd Lipcon Ref: CDH-659 commit 08cbce1e413e98d0aaeceeaca26a60c3d9a50b29 Author: todd Date: Sun Mar 21 14:56:56 2010 -0700 HDFS-611. Move block deletions to an async thread. Applying this to make the HDFS-142 patch apply cleanly Description: Moves the deletion of blocks in the datanode into a thread pool. Substantially improves datanode heartbeat consistency for workloads with heavy deletes and/or lots of disks. Reason: Substantially reduces frequency of "could not complete block" errors and needless re-replication on clusters with lots of disks or heavy deletes. Author: Zheng Shao Ref: CDH-659 commit 57783d0683f0d675423369e0a0f9f5dd520c17f2 Author: todd Date: Sun Mar 21 03:36:45 2010 -0700 HDFS-1055. Improve thread naming in DN Xceiver Description: Names the threads created by the DataNode based on the action they are performing. Reason: Eases diagnosis of datanode performance/lock contention issues. Author: Todd Lipcon Ref: CDH-659 commit fddb2bd057e88506a1bb94232426053d1640a34b Author: todd Date: Sun Mar 21 03:36:29 2010 -0700 HDFS-894. Fix ipcPort tracking in Datanode registration. TODO: add the test case from JIRA Description: Fixes the NameNode to properly reregister datanodes when they crash and restart with a different IPC port (eg when IPC port is configured to be ephemeral) Reason: Fixes errors on clusters with ephemeral ports. Author: Todd Lipcon Ref: CDH-659 commit bc5217543eccc2cfd8a182cdbb051b39d2abf3e7 Author: Dhruba Borthakur Date: Fri Jun 11 23:37:38 2010 +0000 HDFS-1054. remove sleep before retry for allocating a block. Description: When the write pipeline fails to allocate a new block, it previously slept for hard-coded 6 seconds before retrying. This sleep has little reasoning behind it, so is removed. Reason: Improve failure recovery performance for interactive applications like HBase. Author: Todd Lipcon Ref: CDH-931 commit 870c7526a3e6a632eb23cf14f9011f279181a759 Author: Dhruba Borthakur Date: Thu Jun 10 22:25:39 2010 +0000 HDFS-142. Blocks that are being written by a client are stored in the blocksBeingWritten directory. git-svn-id: https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-append@953482 13f79535-47bb-0310-9956-ffa450edef68 Description: Moves blocks being written by clients into a different directory in dfs.data.dir. Also fixes several other bugs in the datanode and namenode to support various error conditions related to append and sync. Reason: Necessary for proper recovery of synced data in several error conditions. Author: Dhruba Borthakur, Nicolas Spiegelberg, Todd Lipcon Ref: CDH-659 commit 8e888717294496caae825d7f3f609d0661e7997a Author: Dhruba Borthakur Date: Thu Jun 10 18:46:03 2010 +0000 HDFS-826. Allow a mechanism for an application to detect that datanode(s) have died in the write pipeline. (dhruba) Description: Adds an API in DFSOutputStream to determine the current length of the write pipeline. Reason: Necessary for better reliability of HBase write-ahead logs. Author: Dhruba Borthakur Ref: CDH-931 commit 8fcb419648160efaed6fdd467875c3b1743d2bee Author: Dhruba Borthakur Date: Wed Jun 9 23:12:21 2010 +0000 HDFS-988. Fix bug where savenameSpace can corrupt edits log. Description: Fixes several synchronization errors in the NameNode and ensures that all edits have been synced to the edits log before the namespace is saved. Reason: Fixes potential data corruption bug. Author: Todd Lipcon Ref: CDH-1436 commit f5ace5f920bc16fd202a6e4a53fe0ffe0cb5045e Author: Todd Lipcon Date: Thu May 20 01:23:15 2010 -0700 HDFS-101. Datanodes should continue to forward acks until client stops pipeline. Description: When one node in the pipeline dies, the datanodes in between the client and the dead node should stay alive and continue to forward acks until the client stops the pipeline. This fixes an issue where the client would incorrectly determine that the local DN had failed when in fact another DN in the pipeline was at fault. Reason: Common source of failed pipeline recovery in cluster fault testing Author: Hairong Kuang, Todd Lipcon Ref: CDH-693 commit 132ef7c852847e9d2c1e7879f2fca26652bb77ef Author: Dhruba Borthakur Date: Fri Jun 4 07:20:10 2010 +0000 HDFS-200. Support append and sync for hadoop 0.20 branch. Description: Provides basic support for append and sync on 0.20 Reason: Append and sync required for durable HBase and many other applications. Author: Dhruba Borthakur Ref: CDH-659 commit 092bcd174dbf609f5002078490c357462e0ce8b1 Author: Konstantin Shvachko Date: Wed Apr 21 03:05:45 2010 +0000 HDFS-909. Fix race in edit log rolling Description: Fixes a race condition when rolling edit logs that can corrupt the logs. Reason: Potential namenode metadata corruption bug. Author: Todd Lipcon Ref: CDH-1174 commit e2a78f767d26b838bf67354a4b85235ddd731038 Author: Eli Collins Date: Fri Jun 18 14:41:14 2010 -0700 CLOUDERA-BUILD. Update hadoop-config.sh to reflect new jar version. commit 1756e97a35451bbc01a493e843f1ec0885c99792 Author: Aaron Kimball Date: Fri Jun 18 11:37:22 2010 -0700 MAPREDUCE-1644. Remove Sqoop from Apache Hadoop (moving to github) Description: Sqoop is moving to github! All code for sqoop is already live at http://github.com/cloudera/sqoop - this issue removes the duplicate code from the Apache Hadoop repository. CDH users should install the separate 'sqoop' package for this functionality. Reason: Moving to a separate package Author: Aaron Kimball Ref: CDH-1404 commit e0afb34b89a013419fca4bdcda5f2cf0401f93ca Author: Aaron Kimball Date: Thu Jun 17 19:06:50 2010 -0700 MAPREDUCE-1302. TrackerDistributedCacheManager can delete file asynchronously Description: With the help of AsyncDiskService from MAPREDUCE-1213, we should be able to delete files from distributed cache asynchronously. That will help make task initialization faster, because task initialization calls the code that localizes files into the cache and may delete some other files. The deletion can slow down the task initialization speed. Reason: Performance improvement Author: Zheng Shao Ref: CDH-495 commit 456821d6934fd769ab317c2290a4ff53b075269e Author: Aaron Kimball Date: Thu Jun 17 19:04:31 2010 -0700 HADOOP-6433. Add AsyncDiskService that is used in both hdfs and mapreduce Description: create a thread pool per disk volume, and use that for scheduling async disk operations. Reason: Improvement Author: Zheng Shao Ref: CDH-495 commit 6e467c42d62aafd00fd2f38269806680427631c8 Author: Aaron Kimball Date: Thu Jun 17 18:50:47 2010 -0700 MAPREDUCE-1213. TaskTrackers restart is very slow because it deletes distributed cache directory synchronously Description: We are seeing that when we restart a tasktracker, it tries to recursively delete all the file in the distributed cache. It invoked FileUtil.fullyDelete() which is very very slow. This means that the TaskTracker cannot join the cluster for an extended period of time (upto 2 hours for us). The problem is acute if the number of files in a distributed cache is a few-thousands. Reason: Performance Author: Zheng Zhao Ref: CDH-495 commit 5626a0e301557dbc93ad5084aa9ef4527316db7b Author: Aaron Kimball Date: Thu Jun 17 18:45:58 2010 -0700 MAPREDUCE-1443. DBInputFormat can leak connections Description: The DBInputFormat creates a Connection to use when enumerating splits, but never closes it. This can leak connections to the database which are not cleaned up for a long time. Reason: bug Author: Aaron Kimball Ref: CDH-1435 commit 912eed1c5d50066e68700d2143b775914d7f8e54 Author: Aaron Kimball Date: Thu Jun 17 16:00:49 2010 -0700 MAPREDUCE-1489. DataDrivenDBInputFormat should not query the database when generating only one split Description: DataDrivenDBInputFormat runs a query to establish bounding values for each split it generates; but if it's going to generate only one split (mapreduce.job.maps == 1), then there's no reason to do this. This will remove overhead associated with a single-threaded import of a non-indexed table since it avoids a full table scan. Reason: Improvement Author: Aaron Kimball Ref: CDH-1431 commit 1c3fc82063212196fd2fac7f55df8eb323e8f601 Author: Aaron Kimball Date: Tue Apr 27 11:44:29 2010 -0700 MAPREDUCE-1728. Oracle timezone strings do not match Java Description: OracleDBRecordReader sets the session timezone based on the toString representation of the current java.util.TimeZone. This is incorrect; Oracle manages a separate database of acceptable timezone strings, whose string representations are different than the timezone representations recognized by Java. Reason: Bug Author: Aaron Kimball Ref: CDH-961 commit 11bc9be1ff2fd994046acd660afa7631f9203cfb Author: Eli Collins Date: Thu May 27 17:44:00 2010 -0700 HADOOP-6714. FsShell 'hadoop fs -text' does not support compression codecs. Currently, 'hadoop fs -text myfile' looks at the first few magic bytes of a file to determine whether it is gzip compressed or a sequence file. This means 'fs -text' cannot properly decode .deflate or .bz2 files (or other codecs specified via configuration). Reason: Improvement Author: Eli Collins Ref: CDH-1136 commit e95781032b5d886aa6583cab1306025fe372babf Author: Eli Collins Date: Tue May 25 13:20:00 2010 -0700 HADOOP-1849. IPC server max queue size should be configurable. Description: Currently max queue size for IPC server is set to (100 * handlers). Usually when RPC failures are observed (e.g. HADOOP-1763), we increase number of handlers and the problem goes away. I think a big part of such a fix is increase in max queue size. I think we should make maxQsize per handler configurable (with a bigger default than 100). There are other improvements also (HADOOP-1841). Server keeps reading RPC requests from clients. When the number in-flight RPCs is larger than maxQsize, the earliest RPCs are deleted. This is the main feedback Server has for the client. I have often heard from users that Hadoop doesn't handle bursty traffic. Say handler count is 10 (default) and Server can handle 1000 RPCs a sec (quite conservative/low for a typical server), it implies that an RPC can wait for only for 1 sec before it is dropped. If there 3000 clients and all of them send RPCs around the same time (not very rare, with heartbeats etc), 2000 will be dropped. In stead of dropping the earliest RPCs, if the server delays reading new RPCs, the feedback to clients would be much smoother, I will file another jira regd queue management. For this jira I propose to make queue size per handler configurable, with a larger default (may be 500). Reason: Improvement Author: Eli Collins Ref: CDH-1133 commit 776a20d37142534751178b060285d2813cc66c1c Author: Eli Collins Date: Tue May 25 13:09:30 2010 -0700 HADOOP-6724. IPC doesn't properly handle IOEs thrown by socket factory. Description: If the socket factory throws an IOE inside setupIOStreams, then handleConnectionFailure will be called with socket still null, and thus generate an NPE on socket.close(). This ends up orphaning clients, etc. Reason: Bug fix Author: Eli Collins Ref: CDH-1132 commit 1864359f4ef32974ed41a1278e640e1ee246ef9b Author: Eli Collins Date: Tue May 25 13:05:38 2010 -0700 HADOOP-6723. Unchecked exceptions thrown in IPC connection should not orphan clients. Description: If the server sends back some malformed data, for example, receiveResponse() can end up with an incorrect call ID. Then, when it tries to find it in the calls map, it will end up with null and throw NPE in receiveResponse. This isn't caught anywhere, so the original IPC client ends up hanging forever instead of catching an exception. Another example is if the writable implementation itself throws an unchecked exception or OOME. We should catch Throwable in Connection.run() and shut down the connection if we catch one. Reason: Bug fix Author: Eli Collins Ref: CDH-1131 commit 95d64157f05d467dad3e1190a5cba2a3f89b0925 Author: Eli Collins Date: Thu May 20 17:15:13 2010 -0700 CLOUDERA-BUILD. Rename the fuse_dfs wrapper. Description: Rename the fuse_dfs wrapper to hadoop-fuse-dfs. Reason: Improvement Author: Alex Newman Ref: CDH-1103 commit d8c973d9c6f650032c88915d9fef6f4a568d37a5 Author: Chad Metcalf Date: Wed May 19 15:38:14 2010 -0700 CLOUDERA-BUILD. Fixes for the fuse_dfs wrapper. Description: The wrapper uses bash syntax (i.e., +=) so we should use bash. We need to modprobe fuse explicitly on Ubuntu. Since this is installed by install_hadoop.sh we know HADOOP_HOME and should use it directly. Lastly, there is more robust JAVA_HOME checking in hadoop-config.sh so we should use that. Reason: Fuse currently broken on Ubuntu Author: Chad Metcalf Ref: CDH-1089 commit e810911445859693ee0b868c2a5d8bc18360cdb9 Author: Eli Collins Date: Tue May 18 14:30:04 2010 -0700 HDFS-1161. Make DN minimum valid volumes configurable Description: This change adds a dfs.datanode.failed.volumes.tolerated parameter so that users can configure the number of volumes that are allowed to fail before a datanode stops offering service. By default any volume failure will cause a datanode to shutdown. Reason: Improvement Author: Eli Collins Ref: CDH-1081 commit baa77bdde4fd971877418391a4fe491c2d4c2501 Author: Eli Collins Date: Mon May 17 19:49:44 2010 -0700 HDFS-1160. Improve some FSDataset warnings and comments. Description: Cleans up HDFS-547 warnings. Reason: Improvement Author: Eli Collins Ref: CDH-1080 commit 90f5a4bf77d17adcabb834a3cc2e02becb9f012d Author: Eli Collins Date: Mon May 17 18:53:50 2010 -0700 HDFS-612. FSDataset should not use org.mortbay.log.Log. Description: Cleans up HDFS-547 logging. Reason: Improvement Author: Eli Collins Ref: CDH-1079 commit 4a925fe53a2015e504cd8c8796e0e590d22019c4 Author: Eli Collins Date: Thu Apr 22 14:41:08 2010 -0700 HDFS-457. Better handling of volume failure in Data Node storage. Description: Current implementation shuts DataNode down completely when one of the configured volumes of the storage fails. This is rather wasteful behavior because it decreases utilization (good storage becomes unavailable) and imposes extra load on the system (replication of the blocks from the good volumes). These problems will become even more prominent when we move to mixed (heterogeneous) clusters with many more volumes per Data Node. Reason: Improvement Author: Eli Collins Ref: CDH-472 commit 3af9533ee6f260373f302ff4a16dd04eb75e0616 Author: Chad Metcalf Date: Mon Mar 1 15:28:19 2010 -0800 CLOUDERA-BUILD. hadoop-config runs before hadoop-env.sh conf/hadoop-env.sh says you can update JAVA_HOME there, but it gets sourced after hadoop-config.sh, which errors out if JAVA_HOME is not set. This patch changes the flow so hadoop-env is always sourced by hadoop-config after the --config flag is processed. This will allow JAVA_HOME to be set in hadoop-env and still allow for trying to find a valid JAVA_HOME. commit c9295d4ac2848403362e5dbaa78aa7be4ce4254e Author: Eli Collins Date: Sat May 15 13:39:08 2010 -0700 HADOOP-3659. Fix hadoop native to compile on Mac OS X. Description: This patch makes the autoconf script work on Mac OS X. LZO needs to be installed (including the optional shared libraries) for the compile to succeed. You'll want to regenerate the configure script using autoconf after applying this patch. Reason: Bug fix Author: Eli Collins Ref: CDH-825 commit cc035175e1cf1ddef878cba6aa93725f832d0327 Author: Eli Collins Date: Sat May 15 12:55:06 2010 -0700 MAPREDUCE-1785. Add streaming config option for not emitting the key. Description: PipeMapper currently does not emit the key when using TextInputFormat. If you switch to input formats (eg LzoTextInputFormat) the key will be emitted. We should add an option so users can explicitly make streaming not emit the key so they can change input formats without breaking or having to modify their existing programs. Reason: Improvement Author: Eli Collins Ref: CDH-856 commit 590a82c257842be51170619deafd15cc2988541e Author: Eli Collins Date: Thu May 13 21:25:53 2010 -0700 HADOOP-4885. Try to restore failed replicas of Name Node storage (at checkpoint time). Description: If one of the replicas of the NameNode storage fails for whatever reason (for example temporarily failure of NFS) this Storage object is removed from the list of storage objects forever. It can be added back only on restart of the NameNode. We propose to check the status of a failed storage on every checkpoint and if it becomes valid - try to restore the edits and fsimage. Reason: Improvement Author: Eli Collins Ref: CDH-473 commit 0f2f19e1bd5725f6163998ae86d9103c0d552de3 Author: Eli Collins Date: Thu May 13 20:07:02 2010 -0700 HDFS-1024. SecondaryNamenode fails to checkpoint because namenode fails with CancelledKeyException. Description: The secondary namenode fails to retrieve the entire fsimage from the Namenode. It fetches a part of the fsimage but believes that it has fetched the entire fsimage file and proceeds ahead with the checkpointing. Reason: Bug fix Author: Eli Collins Ref: CDH-891 commit 0ec1d6ed85a30327c657c2418932728d0e4e98df Author: Todd Lipcon Date: Wed May 12 21:33:45 2010 -0700 HADOOP-6254. Slow reads cause s3n to fail with SocketTimeoutException Reason: Bug fix for users of s3n:// file system Author: Andrew Hitchcock Ref: CDH-1035 commit d64943401780c3dd1dc498419f33ded8222c3210 Author: Eli Collins Date: Wed May 12 12:05:26 2010 -0700 HADOOP-6667. RPC.waitForProxy should retry through NoRouteToHostException. Description: RPC.waitForProxy already loops through ConnectExceptions, but NoRouteToHostException is not a subclass of ConnectException. In the case that the NN is on a VIP, the No Route To Host error is reasonably common during a failover, so we should retry through it just the same as the other connection errors. Reason: Improvement Author: Eli Collins Ref: CDH-907 commit a5fb4a8c8bf9d6a3a96c3a06eb3a46febaf21a0f Author: Todd Lipcon Date: Fri May 7 15:36:14 2010 -0700 MAPREDUCE-1375. TestFileArgs fails intermittently Description: Fixes an error in a test case without modifying code. This is an amendment to the prior fix which did not address the issue properly. Reason: Should fix flaky tests. Author: Todd Lipcon Ref: CDH-657 commit 148d291aa14a4481dc206d2fc9a8527eb6761488 Author: newalex Date: Fri Apr 16 15:48:14 2010 -0700 CLOUDERA-BUILD. Add a fuse manpage Description: Adding a fuse_dfs manpage and adding a manpage to the build. Reason: New Feature Author: Alex Newman Ref: CDH-927 commit 9acfd39492f85c92bc45d47d6dcfb309e3826c64 Author: newalex Date: Thu Apr 8 10:35:19 2010 -0700 CLOUDERA-BUILD. Build script changes to build DEB packages Description: The required changes to the cloudera hadoop building scripts for pulling the fuse files out and cleaning up its mess v.v. DEBs. Reason: Building packages Author: Alex Newman Ref: CDH-929 commit d144085817496eecc57c510022d66d0540b4511d Author: newalex Date: Tue Apr 6 14:05:29 2010 -0700 CLOUDERA-BUILD. Added an RPM for fuse Description: The required changes to the cloudera hadoop building scripts for pulling the fuse files out and cleaning up its mess. Reason: Building packages Author: Alex Newman Ref: CDH-928 commit 56648efe291503249fec22a242917ec4dddc6214 Author: Eli Collins Date: Tue Mar 30 15:17:50 2010 -0700 HADOOP-6522. Fix decoding of codepoint zero in UTF8. Description: TestUTF8 is actually flaky. It generates 10 random strings to run the test on. If you change this number to 100000 it fails every time. The problem is that the null character (codepoint zero) was correctly encoded but incorrectly decoded. I've attached a patch that fixes this and increases the size of the tests so that problems like this will likely be discovered sooner. Reason: Bugfix to UTF8 Author: Eli Collins Ref: CDH-718 commit 936a67ba3b34dc8c8efd3df92d9e50309fafb8f6 Author: Aaron Kimball Date: Mon Mar 29 23:50:14 2010 -0700 MAPREDUCE-1460. Oracle support in DataDrivenDBInputFormat Description: DataDrivenDBInputFormat does not work with Oracle due to various SQL syntax issues. Reason: Required for Sqoop/Oracle integration Author: Aaron Kimball Ref: CDH-888 commit c08f94a6927f9c8b0dfaeb674835afdd3fdd1d08 Author: Aaron Kimball Date: Mon Mar 29 17:15:53 2010 -0700 MAPREDUCE-1569. Mock Contexts & Configurations Description: Currently the library creates a new Configuration object in the MockMapContext and MocKReduceContext constructors, rather than allowing the developer to configure and pass their own Reason: Feature improvement for MRUnit Author: Chris White Ref: CDH-838 commit 27cfda1de80048bf2b46d74d78b61275ecc79be1 Author: Aaron Kimball Date: Mon Mar 29 16:43:49 2010 -0700 MAPREDUCE-1536. DataDrivenDBInputFormat does not split date columns correctly. Description: The DateSplitter does not properly split a range of (min, max) dates. Reason: Bugfix to DateSplitter Author: Aaron Kimball Ref: CDH-813 commit 7fc6e48e296c30f0afa8ae8da668bddbc9f422bf Author: Aaron Kimball Date: Mon Mar 29 16:11:22 2010 -0700 MAPREDUCE-1480. CombineFileRecordReader does not properly initialize child RecordReader Description: CombineFileRecordReader instantiates child RecordReader instances but never calls their initialize() method to give them the proper TaskAttemptContext. Reason: Bug in CombineFileInputFormat prevents proper use. Author: Aaron Kimball Ref: CDH-811 commit 32330fbadb4aed16627397979b90d52f2474ef38 Author: Aaron Kimball Date: Mon Mar 29 15:50:20 2010 -0700 MAPREDUCE-1423. Improve performance of CombineFileInputFormat when multiple pools are configured Description: I have a map-reduce job that is using CombineFileInputFormat. It has configured 10000 pools and 30000 files. The time to create the splits takes more than an hour. The reaosn being that CombineFileInputFormat.getSplits() converts the same path from String to Path object multiple times, one for each instance of a pool. Similarly, it calls Path.toUri(0 multiple times. This code can be optimized. Reason: Improves CombineFileInputFormat performance (used by Sqoop); needed to apply MAPREDUCE-1480 cleanly Author: Dhruba Borthakur Ref: CDH-811 commit 6906389e07244931a108f2930544b9feada3a487 Author: Aaron Kimball Date: Mon Mar 29 15:41:38 2010 -0700 MAPREDUCE-364. Change org.apache.hadoop.examples.MultiFileWordCount to use new mapreduce api. Description: Updates MultiFileWordCount example to use the new API in org.apache.hadoop.mapreduce instead of the deprecated API of org.apache.hadoop.mapred. This incorporates MAPREDUCE-367: Change org.apache.hadoop.mapred.lib.CombineFileInputFormat to use the new api. This solves duplicate issue MAPREDUCE-1112: Fix CombineFileInputFormat for hadoop 0.20 Reason: CombineFileInputFormat required for many clients of the new API, including Sqoop. Author: Amareshwari Sriramadasu Ref: CDH-811 commit 4b592cf8cb44c018f86abe529d71434d5106ce1e Author: Aaron Kimball Date: Mon Mar 29 13:07:15 2010 -0700 HADOOP-6382. Publish hadoop jars to apache mvn repo. Description: This provides an 'ant mvn-install' command that will install Hadoop core, streaming, examples, etc. jars in a maven repository. Uses the maven ant task to publish hadoop 20 jars to the apache maven repo. Reason: Required for cross-distribution dependency management in downstream projects (e.g., sqoop) Author: Giridharan Kesavan Ref: CDH-402 commit 8424e32eb866d677f40a9446f9c4cf74972b751e Author: Chad Metcalf Date: Thu Mar 18 17:05:47 2010 -0700 HADOOP-6643. Set executable bit for python cloud scripts in the distribution Description: This needs to be set in the tar target. Reason: Required for the EC2 scripts. Author: Tom White Ref: CDH-821 commit cfc3233ece0769b11af9add328261295aaf4d1ad Author: Aaron Kimball Date: Fri Mar 12 17:56:30 2010 -0800 CLOUDERA-BUILD. Fix ivy xml after rebase. Removed a redundant closing tag. Author: Matt Massie commit 54e1aefdd7a25a539831cac2c9b1bc3597f119ea Author: Aaron Kimball Date: Fri Mar 12 17:56:07 2010 -0800 CLOUDERA-BUILD. Small tweaks and fixes to Cloudera styling: Description: - Fixes trivial CSS bug for missing table cell borders in Chrome - Fixes footer to read "Distribution for Hadoop" instead of "Distribution of Hadoop" Author: Todd Lipcon commit ea83036b3838fa97c673e73145d52867b8ace6ac Author: Aaron Kimball Date: Fri Mar 12 17:55:30 2010 -0800 HDFS-1013. Miscellaneous improvements to HTML markup for web UIs Description: The Web UIs have various bits of bad markup (eg missing <head> sections, some pages missing CSS links, inconsistent td vs th for table headings). We should fix this up.
Improve markup and add Cloudera styling to Web UIs This adds a favicon and a number of HTML/CSS improvements to make the pages more space-efficient and easy on the eyes. This may be an incompatible change for users who are scraping the HTML output of the web UIs. Those users are encouraged to access the data programmatically rather than through scraping. The non-Cloudera-specific improvements will be contributed upstream as HDFS-1013 and MAPREDUCE-1544. Reason: User experience improvement Author: Todd Lipcon Ref: UNKNOWN commit 90ba5543e4c3176343e23943131a34d666c23d89 Author: Aaron Kimball Date: Fri Mar 12 17:54:58 2010 -0800 MAPREDUCE-1436. Deadlock in preemption code in fair scheduler Description: In testing the fair scheduler with preemption, I found a deadlock between updatePreemptionVariables and some code in the JobTracker. This was found while testing a backport of the fair scheduler to Hadoop 0.20, but it looks like it could also happen in trunk and 0.21. Details are in a comment below.
The fair scheduler introduces a potential jobtracker deadlock which was fixed on trunk by MAPREDUCE-870. This patch adjusts the locking in 0.20-based MapReduce to prevent this condition. Reason: bugfix (deadlock) Author: Matei Zaharia Ref: UNKNOWN commit 6f04e94feee3f40a73449cc6fbe7b4e3c48f1fc4 Author: Aaron Kimball Date: Fri Mar 12 17:54:13 2010 -0800 HDFS-696. Java assertion failures triggered by tests Description: Re-purposing as catch-all ticket for assertion failures when running tests with java asserts enabled. Running with the attached patch on trunk@823732 the following tests all trigger assertion failures:

TestAccessTokenWithDFS
TestInterDatanodeProtocol
TestBackupNode
TestBlockUnderConstruction
TestCheckpoint
TestNameEditsConfigs
TestStartup
TestStorageRestore


Disable failing asserts (see HDFS-696). Disabled asserts in HDFS that cause unit tests to fail. These will be re-enabled at a later date when the underlying cause is fixed upstream. In the meantime, these are disabled to keep our CI server returning only new failures. Issue HDFS-696 lists the failing tests and tracks their progress. Reason: Test harness improvement Author: Eli Collins Ref: UNKNOWN commit 74b80b9c9490bba1a1120f3a9376d2f21f3763b6 Author: Aaron Kimball Date: Fri Mar 12 17:53:38 2010 -0800 MAPREDUCE-1093. Java assertion failures triggered by tests Description: Removes failing asserts from the CDH build until they are fixed in trunk. Tracking MAPREDUCE-1506 to include a fix for this assertion failure. Reason: Test harness improvement Author: Aaron Kimball Ref: UNKNOWN commit b4be440cd928976544bcbeb7e10566fc523dbd0c Author: Aaron Kimball Date: Fri Mar 12 17:53:13 2010 -0800 MAPREDUCE-1092. Enable asserts for tests by default Description: See HADOOP-6309. Let's make the tests run with java asserts by default. Reason: Test coverage improvement Author: Eli Collins Ref: UNKNOWN commit 5e7fb9843f99f5e1023f2723210f26ac0c33323b Author: Aaron Kimball Date: Fri Mar 12 17:52:45 2010 -0800 MAPREDUCE-1375. TestFileArgs fails intermittently Description: TestFileArgs failed once for me with the following error
expected:<[job.jar
    sidefile
    tmp
    ]> but was:<[]>
    sidefile
    tmp
    ]> but was:<[]>
            at org.apache.hadoop.streaming.TestStreaming.checkOutput(TestStreaming.java:107)
            at org.apache.hadoop.streaming.TestStreaming.testCommandLine(TestStreaming.java:123)
This test was flaky due to trying to write some data into /bin/ls. Depending on the speed of the test run, this sometimes resulted in a Broken Pipe on flush() which caused the test to fail. Reason: Bugfix (race condition in test) Author: Todd Lipcon Ref: UNKNOWN commit ae699cda01c093097ae723224553773247577aa2 Author: Aaron Kimball Date: Fri Mar 12 17:52:32 2010 -0800 HDFS-961. dfs_readdir incorrectly parses paths Description: fuse-dfs dfs_readdir assumes that DistributedFileSystem#listStatus returns Paths with the same scheme/authority as the dfs.name.dir used to connect. If NameNode.DEFAULT_PORT port is used listStatus returns Paths that have authorities without the port (see HDFS-960), which breaks the following code.
// hack city: todo fix the below to something nicer and more maintainable but
    // with good performance
    // strip off the path but be careful if the path is solely '/'
    // NOTE - this API started returning filenames as full dfs uris
    const char *const str = info[i].mName + dfs->dfs_uri_len + path_len + ((path_len == 1 && *path == '/') ? 0 : 1);

Let's make the path parsing here more robust. listStatus returns normalized paths so we can find the start of the path by searching for the 3rd slash. A more long term solution is to have hdfsFileInfo maintain a path object or at least pointers to the relevant URI components.

Reason: bugfix Author: Eli Collins Ref: UNKNOWN commit 7f9f42b27b109eff6fafc6ee24526fcadaf68d69 Author: Aaron Kimball Date: Fri Mar 12 17:52:23 2010 -0800 MAPREDUCE-1467. Add a --verbose flag to Sqoop Description: Need a --verbose flag that sets the log4j level to DEBUG. Reason: Logging improvement Author: Aaron Kimball Ref: UNKNOWN commit db680058f5796fc41d61242d60bc86b1b25facf9 Author: Aaron Kimball Date: Fri Mar 12 17:52:07 2010 -0800 MAPREDUCE-1469. Sqoop should disable speculative execution in export Description: Concurrent writers of the same output shard may cause the database to try to insert duplicate primary keys concurrently. Not a good situation. Speculative execution should be forced off for this operation. Reason: Bugfix (race condition) Author: Aaron Kimball Ref: UNKNOWN commit a5ccc56a79fc53de5ff16c6cb996f41a4216c28d Author: Aaron Kimball Date: Fri Mar 12 17:51:29 2010 -0800 MAPREDUCE-1341. Sqoop should have an option to create hive tables and skip the table import step Description: In case the client only needs to create tables in hive, it would be helpful if Sqoop had an optional parameter:

--hive-create-only

which would omit the time consuming table import step, generate hive create table statements and run them.

Also adds --hive-overwrite flag which allows overwriting of existing table definition. Reason: New feature Author: Leonid Furman Ref: UNKNOWN commit bdf576aa69eeb56a954416f7c2fcbe0136f421bd Author: Aaron Kimball Date: Fri Mar 12 17:51:16 2010 -0800 HADOOP-4012. Providing splitting support for bzip2 compressed files Description: Hadoop assumes that if the input data is compressed, it can not be split (mainly due to the limitation of many codecs that they need the whole input stream to decompress successfully). So in such a case, Hadoop prepares only one split per compressed file, where the lower split limit is at 0 while the upper limit is the end of the file. The consequence of this decision is that, one compress file goes to a single mapper. Although it circumvents the limitation of codecs (as mentioned above) but reduces the parallelism substantially, as it was possible otherwise in case of splitting.

BZip2 is a compression / De-Compression algorithm which does compression on blocks of data and later these compressed blocks can be decompressed independent of each other. This is indeed an opportunity that instead of one BZip2 compressed file going to one mapper, we can process chunks of file in parallel. The correctness criteria of such a processing is that for a bzip2 compressed file, each compressed block should be processed by only one mapper and ultimately all the blocks of the file should be processed. (By processing we mean the actual utilization of that un-compressed data (coming out of the codecs) in a mapper).

We are writing the code to implement this suggested functionality. Although we have used bzip2 as an example, but we have tried to extend Hadoop's compression interfaces so that any other codecs with the same capability as that of bzip2, could easily use the splitting support. The details of these changes will be posted when we submit the code.

Reason: New feature Author: Abdul Qadeer Ref: UNKNOWN commit 8e47288583fcdbdf649ddf3486bf201788e79202 Author: Aaron Kimball Date: Fri Mar 12 17:50:51 2010 -0800 MAPREDUCE-707. Provide a jobconf property for explicitly assigning a job to a pool Description: A common use case of the fair scheduler is to have one pool per user, but then to define some special pools for various production jobs, import jobs, etc. Therefore, it would be nice if jobs went by default to the pool of the user who submitted them, but there was a setting to explicitly place a job in another pool. Today, this can be achieved through a sort of trick in the JobConf:
<property>
      <name>mapred.fairscheduler.poolnameproperty</name>
      <value>pool.name</value>
    </property>
    
    <property>
      <name>pool.name</name>
      <value>${user.name}</value>
    </property>

This JIRA proposes to add a property called mapred.fairscheduler.pool that allows a job to be placed directly into a pool, avoiding the need for this trick.

Reason: Configuration improvement Author: Alan Heirich Ref: UNKNOWN commit 96e17e1e593b818a888c8dfc177b8fb36e514e8f Author: Aaron Kimball Date: Fri Mar 12 17:50:18 2010 -0800 MAPREDUCE-967. (version 2) TaskTracker does not need to fully unjar job jars Description: This is a performance improvement for jobs that contain a large number of classes. The unpacking of these jars consumes a large amount of time, as does the resulting cleanup. This patch changes the classpath to simply include the jar itself, and only unpacks the lib/ directory out of the jar in order to add those dependencies to the classpath. Users who previously depended on this functionality for shipping non-code dependencies can use the undocumented configuration parameter "mapreduce.job.jar.unpack.pattern" to cause specific jar contents to be unpacked This new patch version fixes a streaming regression where the "-file" argument no longer worked. It includes a new unit test, TestFileArgs, to protect against this regression. Author: Todd Lipcon Ref: UNKNOWN commit cf08a128b87bbfae90babd61795599b3645d37a3 Author: Aaron Kimball Date: Fri Mar 12 17:48:40 2010 -0800 HDFS-455, MAPREDUCE-1441, HADOOP-6534. Allow spaces in between comma-separated elements in directory list configurations. Description: Make NN and DN handle in a intuitive way comma-separated configuration strings The following configuration causes problems:
<property>
<name>dfs.data.dir</name>
<value>/mnt/hstore2/hdfs, /home/foo/dfs</value>
</property>

The problem is that the space after the comma causes the second directory for storage to be " /home/foo/dfs" which is in a directory named <SPACE> which contains a sub-dir named "home" in the hadoop datanodes default directory. This will typically cause the user's home partition to fill, but will be very hard for the user to understand since a directory with a whitespace name is hard to understand.

(ripped from HADOOP-2366)


This fixes any configuration consisting of a comma-separated list of directories (e.g., dfs.data.dir, dfs.name.dir, fs.checkpoint.dir, mapred.local.dir, etc) so that the elements may also contain separating whitespace. Without this patch, setting mapred.local.dir to "/disk1, /disk2" would create a directory by the name " " in the user's home directory, or fail outright. The patch trims the directory names as they are fetched from the configuration. Reason: Configuration improvement Author: Todd Lipcon Ref: UNKNOWN commit 65a04ab8197a8db21a97d279ca881b5cd45a5365 Author: Aaron Kimball Date: Fri Mar 12 17:48:03 2010 -0800 HADOOP-2366. Space in the value for dfs.data.dir can cause great problems Description: The following configuration causes problems:

<property>
<name>dfs.data.dir</name>
<value>/mnt/hstore2/hdfs, /home/foo/dfs</value>
<description>
Determines where on the local filesystem an DFS data node should store its bl
ocks. If this is a comma-delimited list of directories, then data will be stor
ed in all named directories, typically on different devices. Directories that
do not exist are ignored.
</description>
</property>

The problem is that the space after the comma causes the second directory for storage to be " /home/foo/dfs" which is in a directory named <SPACE> which contains a sub-dir named "home" in the hadoop datanodes default directory. This will typically cause the user's home partition to fill, but will be very hard for the user to understand since a directory with a whitespace name is hard to understand.

My proposed solution would be to trimLeft all path names from this and similar property after splitting on comma. This still allows spaces in file and directory names but avoids this problem.


This provides support in Configuration to get comma-separated string lists in such a way that whitespace in between elements is ignored. This patch is required for later patches which fix mapred.local.dir, dfs.data.dir, etc to support spaces in between elements. Test plan: unit tested in TestStringUtils Reason: Configuration improvement Author: Michele (@pirroh) Catasta Ref: UNKNOWN commit 8d4807322a42509726b376b37a89739acd6cbd7d Author: Aaron Kimball Date: Fri Mar 12 17:47:55 2010 -0800 MAPREDUCE-1356. Allow user-specified hive table name in sqoop Description: The table name used in a hive-destination import is currently pegged to the input table name. This should be user-configurable. Reason: New feature Author: Aaron Kimball Ref: UNKNOWN commit 8bf3439ff69762a33967dca4abb15c0cd2bb8417 Author: Aaron Kimball Date: Fri Mar 12 17:47:45 2010 -0800 MAPREDUCE-1395. Sqoop does not check return value of Job.waitForCompletion() Description: Old code depended on JobClient.runJob() throwing IOException on failure. Job.waitForCompletion can fail in that manner, or it can fail by returning false. Sqoop needs to check for this condition. Reason: bugfix Author: Aaron Kimball Ref: UNKNOWN commit bd4e81234dd12fa9534577f0caa0db5c3d0a99fc Author: Aaron Kimball Date: Fri Mar 12 17:47:30 2010 -0800 CLOUDERA-BUILD. Set HADOOP_PID_DIR to something smarter than /tmp Author: Chad Metcalf commit 2466310d0e2a426e848860e9a8411b8ea14e1bb1 Author: Aaron Kimball Date: Fri Mar 12 17:47:07 2010 -0800 HADOOP-6453. Hadoop wrapper script shouldn't ignore an existing JAVA_LIBRARY_PATH Description: Currently the hadoop wrapper script assumes its the only place that uses JAVA_LIBRARY_PATH and initializes it to a blank line.

JAVA_LIBRARY_PATH=''

This prevents anyone from setting this outside of the hadoop wrapper (say hadoop-config.sh) for their own native libraries.

The fix is pretty simple. Don't initialize it to '' and append the native libs like normal.

Reason: Bugfix (environment) Author: Chad Metcalf Ref: UNKNOWN commit a67b4b1c361c26e002da64953a7f8bc068d29b98 Author: Aaron Kimball Date: Fri Mar 12 17:46:42 2010 -0800 MAPREDUCE-1327. Oracle database import via sqoop fails when a table contains the column types such as TIMESTAMP(6) WITH LOCAL TIME ZONE and TIMESTAMP(6) WITH TIME ZONE Description: When Oracle table contains the columns "TIMESTAMP(6) WITH LOCAL TIME ZONE" and "TIMESTAMP(6) WITH TIME ZONE", Sqoop fails to map values for those columns to valid Java data types, resulting in the following exception:

ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.NullPointerException
java.lang.NullPointerException
at org.apache.hadoop.sqoop.orm.ClassWriter.generateFields(ClassWriter.java:253)
at org.apache.hadoop.sqoop.orm.ClassWriter.generateClassForColumns(ClassWriter.java:701)
at org.apache.hadoop.sqoop.orm.ClassWriter.generate(ClassWriter.java:597)
at org.apache.hadoop.sqoop.Sqoop.generateORM(Sqoop.java:75)
at org.apache.hadoop.sqoop.Sqoop.importTable(Sqoop.java:87)
at org.apache.hadoop.sqoop.Sqoop.run(Sqoop.java:175)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.sqoop.Sqoop.main(Sqoop.java:201)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

Reason: Compatibility improvement Author: Leonid Furman Ref: UNKNOWN commit a937ba2b9b6132883d727f856911ae31d22ad619 Author: Aaron Kimball Date: Fri Mar 12 17:46:26 2010 -0800 MAPREDUCE-1394. Sqoop generates incorrect URIs in paths sent to Hive Description: Hive used to require a ':8020' in HDFS URIs used with LOAD DATA statements, even though the normalized form of such a URI does not contain an explicit port number (since 8020 is the default port). Sqoop matched this by hacking the URI strings it forwarded to Hive.

Hive fixed this bug a while ago – Sqoop should catch up.

Reason: bugfix (compatibility) Author: Aaron Kimball Ref: UNKNOWN commit c5c9b8bf0bf83637589a809b3c376cf74a2fb464 Author: Aaron Kimball Date: Fri Mar 12 17:45:54 2010 -0800 MAPREDUCE-1313. NPE in FieldFormatter if escape character is set and field is null Description: Performing an import with the --escaped-by character set on a table with a null field will cause a NullPointerException in FieldFormatter Reason: bugfix Author: Aaron Kimball Ref: UNKNOWN commit 1c6dd471832946929928801dd9c9e4b79259ad9d Author: Aaron Kimball Date: Fri Mar 12 17:45:38 2010 -0800 HADOOP-6460. Namenode runs of out of memory due to memory leak in ipc Server Description: Namenode heap usage grows disproportional to the number objects supports (files, directories and blocks). Based on heap dump analysis, this is due to large growth in ByteArrayOutputStream allocated in o.a.h.ipc.Server.Handler.run(). Reason: Bugfix (Scalability) Author: Suresh Srinivas Ref: UNKNOWN commit d190a8067827ce09cdcb7741d588cce0e0e7aa02 Author: Aaron Kimball Date: Fri Mar 12 17:45:23 2010 -0800 HADOOP-5687. Hadoop NameNode throws NPE if fs.default.name is the default value Description: Throwing NPE is confusing; instead, an exception with a useful string description could be thrown instead. Reason: Logging improvement Author: Philip Zeyliger Ref: UNKNOWN commit 7604c6f69076effbb0c9793e114946d679f5912d Author: Aaron Kimball Date: Fri Mar 12 17:45:02 2010 -0800 HADOOP-6505. sed in build.xml fails Description: I'm not sure whether this is a Solaris thing or an ant 1.7.1 thing, but it definitely doesn't do what it is supposed to. Instead of getting SunOS-x86-32 (or whatever) I get -x86-32.

This patch replaces the sed call with tr.

Reason: OS compatibility improvement Author: Allen Wittenauer Ref: UNKNOWN commit ca662cbba6044be216b586e7359d9fc2f1dd4e4f Author: Aaron Kimball Date: Fri Mar 12 17:44:00 2010 -0800 HDFS-908. (version 2) TestDistributedFileSystem fails with Wrong FS on weird hosts Description: On the same host where I experienced HDFS-874, I also experience this failure for TestDistributedFileSystem:

Testcase: testFileChecksum took 0.492 sec
Caused an ERROR
Wrong FS: hftp://localhost.localdomain:59782/filechecksum/foo0, expected: hftp://127.0.0.1:59782
java.lang.IllegalArgumentException: Wrong FS: hftp://localhost.localdomain:59782/filechecksum/foo0, expected: hftp://127.0.0.1:59782
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310)
at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:222)
at org.apache.hadoop.hdfs.HftpFileSystem.getFileChecksum(HftpFileSystem.java:318)
at org.apache.hadoop.hdfs.TestDistributedFileSystem.testFileChecksum(TestDistributedFileSystem.java:166)

Doesn't appear to occur on trunk or branch-0.21.

This is version two of this patch. THe previous patch fixed some systems but broke others. Reason: Bugfix Author: Todd Lipcon Ref: UNKNOWN commit 7fafe032223921ad194c69b16ab451b4aade87fa Author: Aaron Kimball Date: Fri Mar 12 17:43:41 2010 -0800 HADOOP-4368. Superuser privileges required to do "df" Description: super user privileges are required in DFS in order to get the file system statistics (FSNamesystem.java, getStats method). This means that when HDFS is mounted via fuse-dfs as a non-root user, "df" is going to return 16exabytes total and 0 free instead of the correct amount.

As far as I can tell, there's no need to require super user privileges to see the file system size (and historically in Unix, this is not required).

To fix this, simply comment out the privilege check in the getStats method.

Reason: Usability improvement Author: Craig Macdonald Ref: UNKNOWN commit 6129c87f5dd1fdb7375c80285534b8b91fbcd392 Author: Aaron Kimball Date: Fri Mar 12 17:43:25 2010 -0800 HDFS-412. Hadoop JMX usage makes Nagios monitoring impossible Description: When Hadoop reports Datanode information to JMX, the bean uses the name "DataNode-" + storageid. The storage ID incorporates a random number and is unpredictable.

This prevents me from monitoring DFS datanodes through Hadoop using the JMX interface; in order to do that, you must be able to specify the bean name on the command line.

The fix is simple, patch will be coming momentarily. However, there was probably a reason for making the datanodes all unique names which I'm unaware of, so it'd be nice to hear from the metrics maintainer.

Reason: Monitoring improvement Author: Brian Bockelman Ref: UNKNOWN commit 5dfcc6d2d7806636c6237996e1b28a00ba075b4b Author: Aaron Kimball Date: Fri Mar 12 17:43:05 2010 -0800 HADOOP-6503. contrib projects should pull in the ivy-fetched libs from the root project Description: On branch-20 currently, I get an error just running "ant contrib -Dtestcase=TestHdfsProxy". In a full "ant test" build sometimes this doesn't appear to be an issue. The problem is that the contrib projects don't automatically pull in the dependencies of the "Hadoop" ivy project. Thus, they each have to declare all of the common dependencies like commons-cli, etc. Some are missing and this causes test failures. Reason: Build system improvement Author: Todd Lipcon Ref: UNKNOWN commit be70b10f11445f4a71807405718bfeebd38ad924 Author: Aaron Kimball Date: Fri Mar 12 17:42:51 2010 -0800 MAPREDUCE-1155. Streaming tests swallow exceptions Description: Many of the streaming tests (including TestMultipleArchiveFiles) catch exceptions and print their stack trace rather than failing the job. This means that tests do not fail even when the job fails. Reason: Test coverage improvement Author: Todd Lipcon Ref: UNKNOWN commit f84830ae5e6c862cd0e2b8ebea57880e54c8a082 Author: Aaron Kimball Date: Fri Mar 12 17:42:33 2010 -0800 HADOOP-5647. TestJobHistory fails if /tmp/_logs is not writable to. Testcase should not depend on /tmp Description: TestJobHistory sets /tmp as hadoop.job.history.user.location to check if the history file is created in that directory or not. If /tmp/_logs is already created by some other user, this test will fail because of not having write permission. Reason: Bugfix in test harness Author: Ravi Gummadi Ref: UNKNOWN commit 669b65f14d78ffd1cf0304cf459d1abbae3412ae Author: Aaron Kimball Date: Fri Mar 12 17:42:15 2010 -0800 CLOUDERA-BUILD. Fix javadoc warnings shown by test-patch, and update eclipse classpath to match current CDH. Author: Todd Lipcon commit 51804fd45d3a527a130a373c591a17c185102a0c Author: Aaron Kimball Date: Fri Mar 12 17:41:40 2010 -0800 Revert "HDFS-127: DFSClient block read failures cause open DFSInputStream to become unusable" Description: This is being reverted as it causes infinite retries when there are no valid replicas. Reason: bugfix Author: Todd Lipcon Ref: UNKNOWN commit 623bfc0c18087274315dfbd41d025a8a775abe80 Author: Aaron Kimball Date: Fri Mar 12 17:40:30 2010 -0800 HDFS-877. Client-driven block verification not functioning Description: This is actually the reason for HDFS-734 (TestDatanodeBlockScanner timing out). The issue is that DFSInputStream relies on readChunk being called one last time at the end of the file in order to receive the lastPacketInBlock=true packet from the DN. However, DFSInputStream.read checks pos < getFileLength() before issuing the read. Thus gotEOS never shifts to true and checksumOk() is never called. This is a simpler patch than the one on 0.21/0.22 since those fix a further regression since 0.20. Reason: bugfix Author: Todd Lipcon Ref: UNKNOWN commit b332fe77255047409da701dfb97df1bddb5b10cb Author: Aaron Kimball Date: Fri Mar 12 17:40:05 2010 -0800 CLOUDERA-BUILD. Add mockito to 0.20 branch for easier unit testing of HDFS stability patches. Reason: Test coverage improvement Author: Todd Lipcon commit 44a6c559de056b35c6eb2e2d53798c88d8c779e6 Author: Aaron Kimball Date: Fri Mar 12 17:39:09 2010 -0800 HDFS-630. In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block. Description: created from hdfs-200.

If during a write, the dfsclient sees that a block replica location for a newly allocated block is not-connectable, it re-requests the NN to get a fresh set of replica locations of the block. It tries this dfs.client.block.write.retries times (default 3), sleeping 6 seconds between each retry ( see DFSClient.nextBlockOutputStream).

This setting works well when you have a reasonable size cluster; if u have few datanodes in the cluster, every retry maybe pick the dead-datanode and the above logic bails out.

Our solution: when getting block location from namenode, we give nn the excluded datanodes. The list of dead datanodes is only for one block allocation.

Reason: bugfix (Fault tolerance improvement) Author: Cosmin Lehene (modified by Cloudera to not break compatibility) Ref: UNKNOWN commit 47c404e0cf10ceb31336d2a77d53e0a971348102 Author: Aaron Kimball Date: Fri Mar 12 17:37:37 2010 -0800 HDFS-908. TestDistributedFileSystem fails with Wrong FS on weird hosts Description: On the same host where I experienced HDFS-874, I also experience this failure for TestDistributedFileSystem:

Testcase: testFileChecksum took 0.492 sec
Caused an ERROR
Wrong FS: hftp://localhost.localdomain:59782/filechecksum/foo0, expected: hftp://127.0.0.1:59782
java.lang.IllegalArgumentException: Wrong FS: hftp://localhost.localdomain:59782/filechecksum/foo0, expected: hftp://127.0.0.1:59782
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310)
at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:222)
at org.apache.hadoop.hdfs.HftpFileSystem.getFileChecksum(HftpFileSystem.java:318)
at org.apache.hadoop.hdfs.TestDistributedFileSystem.testFileChecksum(TestDistributedFileSystem.java:166)

Doesn't appear to occur on trunk or branch-0.21.

Reason: bugfix Author: Todd Lipcon Ref: UNKNOWN commit 7c2a791f0a397d924a623e45bf823c238374c42c Author: Aaron Kimball Date: Fri Mar 12 17:37:19 2010 -0800 MAPREDUCE-1258. Fair scheduler event log not logging job info Description: The MAPREDUCE-706 patch seems to have left an unfinished TODO in the Fair Scheduler - namely, in the dump() function for periodically dumping scheduler state to the event log, the part that dumps information about jobs is commented out. This makes the event log less useful than it was before.

It should be fairly easy to update this part to use the new scheduler data structures (Schedulable etc) and print the data.

Reason: Logging improvement Author: Matei Zaharia Ref: UNKNOWN commit 353f7813bf7dfb0bca1362f9370f6a080256a345 Author: Aaron Kimball Date: Fri Mar 12 17:36:58 2010 -0800 MAPREDUCE-1198. Alternatively schedule different types of tasks in fair share scheduler Description: Matei has mentioned in MAPREDUCE-961 that the current scheduler will first try to launch map tasks until canLaunthTask() returns false then look for reduce tasks. This might starve reduce task. He also mention that alternatively schedule different types of tasks can solve this problem. Reason: bugfix Author: Scott Chen Ref: UNKNOWN commit ef449fb7832055951e2364cf12a73717b2add3ce Author: Aaron Kimball Date: Fri Mar 12 17:36:50 2010 -0800 MAPREDUCE-698. Per-pool task limits for the fair scheduler Description: The fair scheduler could use a way to cap the share of a given pool similar to MAPREDUCE-532. Reason: New feature Author: Kevin Peterson Ref: UNKNOWN commit a1e25ec70e677db322b2cce43c6381f865eb3f79 Author: Aaron Kimball Date: Fri Mar 12 17:36:42 2010 -0800 HDFS-464. Memory leaks in libhdfs Description: hdfsExists does not call destroyLocalReference for jPath anytime,
hdfsDelete does not call it when it fails, and
hdfsRename does not call it for jOldPath and jNewPath when it fails Reason: bugfix Author: Christian Kunz Ref: UNKNOWN commit d93dad715d3c702d15c2a32c85d586c708e70857 Author: Aaron Kimball Date: Fri Mar 12 17:36:23 2010 -0800 CLOUDERA-BUILD. Add test ivy configurations to additional projects. Author: Aaron Kimball Reason: Build system improvement commit 5d0c8f82b87e7cbb541ace9e4f22abfad2799e56 Author: Aaron Kimball Date: Fri Mar 12 17:35:08 2010 -0800 CLOUDERA-BUILD. Sqoop bin script now includes jars from contrib/sqoop/lib/ on classpath. Author: Aaron Kimball commit 7e009a29c0806537cd50972df90ec87b617eb78f Author: Aaron Kimball Date: Fri Mar 12 17:34:54 2010 -0800 MAPREDUCE-1212. Mapreduce contrib project ivy dependencies are not included in binary target Description: As in HADOOP-6370, only Hadoop's own library dependencies are promoted to ${build.dir}/lib; any libraries required by contribs are not redistributed. Reason: Build system (packaging) improvement Author: Aaron Kimball Ref: UNKNOWN commit 8d289f97d6b66cd435f755a4acae9f138de934d6 Author: Aaron Kimball Date: Fri Mar 12 17:34:43 2010 -0800 CLOUDERA-BUILD. Update cloud script version to cdh-0.20.1 Author: Tom White commit ac7eacd44af059d7a859b8d6773a82cd84ba4c9b Author: Aaron Kimball Date: Fri Mar 12 17:34:35 2010 -0800 HADOOP-6466. Add a ZooKeeper service to the cloud scripts Description: It would be good to add other Hadoop services to the cloud scripts. Reason: New feature Author: Tom White Ref: UNKNOWN commit 06ceb079693292a41085af795c5b2bbc3fd10af2 Author: Aaron Kimball Date: Fri Mar 12 17:34:24 2010 -0800 HADOOP-6454. Create setup.py for EC2 cloud scripts Description: This would make it easier to install the scripts. Reason: Installation improvement Author: Tom White Ref: UNKNOWN commit 23c45791bbc3a23d69c77f3518b5d1a1a4702ccc Author: Aaron Kimball Date: Fri Mar 12 17:34:11 2010 -0800 HADOOP-6462. contrib/cloud failing, target "compile" does not exist Description: I'm not seeing this mentioned in hudson or other bugreports, which confuses me. With the addition of a src/contrib/cloud/build.xml from HADOOP-6426, contrib/build.xml won't build no more:
hadoop-common/src/contrib/build.xml:30: The following error occurred while executing this line:
Target "compile" does not exist in the project "hadoop-cloud".

What is odd is this: the final patch of HADOOP-6426 does include the stub <target> files needed, yet they aren't in SVN_HEAD. Which implies that a different version may have gone in than intended.

Reason: Build system bugfix Author: Tom White Ref: UNKNOWN commit 083a6a1cfb2a5198243aa82a020681ad62da5938 Author: Aaron Kimball Date: Fri Mar 12 17:33:58 2010 -0800 HADOOP-6444. Support additional security group option in hadoop-ec2 script Description: When deploying a hadoop cluster on ec2 alongside other services it is very useful to be able to specify additional (pre-existing) security groups to facilitate access control. For example one could use this feature to add a cluster to a generic "hadoop" group, which authorizes hdfs access from instances outside the cluster. Without such an option the access control for the security groups created by the script need to manually updated after cluster launch. Reason: Security improvement Author: Paul Egan Ref: UNKNOWN commit 63152ce4ba3c0cf2006016cc825fc72b0bd23d2d Author: Aaron Kimball Date: Fri Mar 12 17:33:49 2010 -0800 HADOOP-6426. Create ant build for running EC2 unit tests Description: There is no easy way currently to run the Python unit tests for the cloud contrib. Reason: Test coverage improvement Author: Tom White Ref: UNKNOWN commit a20069b2adfafa59e0001fe5e5685d36d9eb7fee Author: Aaron Kimball Date: Fri Mar 12 17:33:15 2010 -0800 HADOOP-6392. Run namenode and jobtracker on separate EC2 instances Description: Replace concept of "master" with that of "namenode" and "jobtracker". Still need to be able to run both on one node, of course. Reason: Scalability improvement Author: Tom White Ref: UNKNOWN commit 361221a2a082d0ab7a87ba0226dbe05938440738 Author: Aaron Kimball Date: Fri Mar 12 17:33:07 2010 -0800 HADOOP-6108. Add support for EBS storage on EC2 Description: By using EBS for namenode and datanode storage we can have persistent, restartable Hadoop clusters running on EC2. Reason: New feature Author: Tom White Ref: UNKNOWN commit 4ca1c78e1b257eefa10b5ed94479df8a6473d3e9 Author: Aaron Kimball Date: Fri Mar 12 17:32:50 2010 -0800 HDFS-861. fuse-dfs does not support O_RDWR Description: Some applications (for us, the big one is rsync) will open a file in read-write mode when it really only intends to read xor write (not both). fuse-dfs should try to not fail until the application actually tries to write to a pre-existing file or read from a newly created file. Reason: bugfix Author: Brian Bockelman Ref: UNKNOWN commit 00f6976093cc20ea825a35f6831f645dc5f61637 Author: Aaron Kimball Date: Fri Mar 12 17:32:17 2010 -0800 HDFS-860. fuse-dfs truncate behavior causes issues with scp Description: For whatever reason, scp issues a "truncate" once it's written a file to truncate the file to the # of bytes it has written (i.e., if a file is X bytes, it calls truncate(X)).

This fails on the current fuse-dfs.

Reason: bugfix (tool compatibility) Author: Brian Bockelman Ref: UNKNOWN commit 46d2b6d6b27887375c44d691d776f70e89e4b81b Author: Aaron Kimball Date: Fri Mar 12 17:31:58 2010 -0800 HDFS-859. fuse-dfs utime behavior causes issues with tar Description: When trying to untar files onto fuse-dfs, tar will try to set the utime on all the files and directories. However, setting the utime on a directory in libhdfs causes an error.

We should silently ignore the failure of setting a utime on a directory; this will allow tar to complete successfully.

Reason: bugfix (tool compatibility) Author: Brian Bockelman Ref: UNKNOWN commit 9a38b9c423aca358307aa6455977432f34aef990 Author: Aaron Kimball Date: Fri Mar 12 17:31:45 2010 -0800 HDFS-858. Incorrect return codes for fuse-dfs Description: fuse-dfs doesn't pass proper error codes from libhdfs; places I'd like to correct are hdfsFileOpen (which can result in permission denied or quota violations) and hdfsWrite (which can result in quota violations).

By returning the correct error codes, command line utilities return much better error messages - especially for quota violations, which can be a devil to debug.

Reason: bugfix Author: Brian Bockelman Ref: UNKNOWN commit 84afb26bb0e42eda1e26b07e3aac016695f5ad87 Author: Aaron Kimball Date: Fri Mar 12 17:31:37 2010 -0800 HDFS-857. Incorrect type for fuse-dfs capacity can cause "df" to return negative values on 32-bit machines Description: On sufficiently large HDFS installs, the casting of hdfsGetCapacity to a long may cause "df" to return negative values. tOffset should be used instead. Reason: bugfix Author: Brian Bockelman Ref: UNKNOWN commit a4cf3e8e86cbd42bef25eb3aab7e464ac86e3068 Author: Aaron Kimball Date: Fri Mar 12 17:31:19 2010 -0800 HDFS-856. Hardcoded replication level for new files in fuse-dfs Description: In fuse-dfs, the number of replicas is always hardcoded to 3 in the arguments to hdfsOpenFile. We should use the setting in the hadoop configuration instead. Reason: Configuration improvement Author: Brian Bockelman Ref: UNKNOWN commit e9f3ec90e57b383faf49e6a6eb8cc91e5182d31e Author: Aaron Kimball Date: Fri Mar 12 17:31:08 2010 -0800 HADOOP-5625. Add I/O duration time in client trace Description: Add I/O duration information into client trace log for analyzing performance. Reason: Logging improvement Author: Lei Xu Ref: UNKNOWN commit 42eeb4540850278563e76841f0c6b369933d5b70 Author: Aaron Kimball Date: Fri Mar 12 17:30:43 2010 -0800 HADOOP-5222. Add offset in client trace Description: By adding offset in client trace, the client trace information can provide more accurately information about I/O.
It is useful for performance analyzing.

Since there is no random write now, the offset of writing is always zero.

Reason: Logging improvement Author: Lei Xu Ref: UNKNOWN commit 5880960fb32ae0fc2c16bac1f333dbb237c3448f Author: Aaron Kimball Date: Fri Mar 12 17:30:27 2010 -0800 CLOUDERA-BUILD. Solaris do-release-build fix Author: Eli Collins Ref: CDH-531 commit 35f87aef6d7cd4030644a1d454da2e0a6e2969c0 Author: Aaron Kimball Date: Fri Mar 12 17:30:18 2010 -0800 MAPREDUCE-1310. CREATE TABLE statements for Hive do not correctly specify delimiters Description: Imports to HDFS via Sqoop that also inject metadata into Hive do not correctly specify delimiters; using Hive to access the data results in rows being parsed as NULL characters. See http://getsatisfaction.com/cloudera/topics/sqoop_hive_import_giving_null_query_values for an example bug report Reason: Bugfix Author: Aaron Kimball Ref: UNKNOWN commit 60784d712cdd5781ceff262bb67e2d484fde428b Author: Aaron Kimball Date: Fri Mar 12 17:29:56 2010 -0800 MAPREDUCE-1235. java.io.IOException: Cannot convert value '0000-00-00 00:00:00' from column 6 to TIMESTAMP. Description: Description: java.io.IOException is thrown when trying to import a table to HDFS using Sqoop. Table has "0" value in a field of type datetime.
Full Exception: java.io.IOException: Cannot convert value '0000-00-00 00:00:00' from column 6 to TIMESTAMP.
Original question: http://getsatisfaction.com/cloudera/topics/cant_import_table?utm_content=reply_link&utm_medium=email&utm_source=reply_notification Reason: Bugfix (compatibility) Author: Aaron Kimball Ref: UNKNOWN commit 23c116b6ab5615bdb846e22b61a41e92ca287bdf Author: Aaron Kimball Date: Fri Mar 12 17:29:47 2010 -0800 MAPREDUCE-1174. Sqoop improperly handles table/column names which are reserved sql words Description: In some databases it is legal to name tables and columns with terms that overlap SQL reserved keywords (e.g., CREATE, table, etc.). In such cases, the database allows you to escape the table and column names. We should always escape table and column names when possible. Reason: Bugfix Author: Aaron Kimball Ref: UNKNOWN commit d4b3b7592c94aa1f4608245829b5de202ed1b148 Author: Aaron Kimball Date: Fri Mar 12 17:29:39 2010 -0800 MAPREDUCE-1168. Export data to databases via Sqoop Description: Sqoop can import from a database into HDFS. It's high time it works in reverse too. Reason: New feature Author: Aaron Kimball Ref: UNKNOWN commit b29023803d1136bf7d4de45853a2d4481fb36d3c Author: Aaron Kimball Date: Fri Mar 12 17:29:24 2010 -0800 MAPREDUCE-1169. Improvements to mysqldump use in Sqoop Description: Improve Sqoop's integration with mysqldump Reason: Feature/performance improvements Author: Aaron Kimball Ref: UNKNOWN commit c6b956630e327ddabf674f8e06de02408e603155 Author: Aaron Kimball Date: Wed Jan 6 16:05:05 2010 -0800 MAPREDUCE-1169. Improvements to mysqldump use in Sqoop commit 26ba4fd749755a3df79eaa27792662e5b7e3da80 Author: Aaron Kimball Date: Fri Mar 12 17:29:15 2010 -0800 MAPREDUCE-1036. An API Specification for Sqoop Description: Over the last several months, Sqoop has evolved to a state that is functional and has room for extensions. Developing extensions requires a stable API and documentation. I am attaching to this ticket a description of Sqoop's design and internal APIs, which include some open questions. I would like to solicit input on the design regarding these open questions and standardize the API. Reason: Documentation Author: Aaron Kimball Ref: UNKNOWN commit e8c47124bb2ada5de0cfdf49150dd7296a41df71 Author: Aaron Kimball Date: Fri Mar 12 17:29:04 2010 -0800 MAPREDUCE-1069. Implement Sqoop API refactoring Description: Implement refactoring decisions outlined in MAPREDUCE-1036 Reason: API compatibility Author: Aaron Kimball Ref: UNKNOWN commit b73cab8083c1594c0328a565eef05951a17f998a Author: Aaron Kimball Date: Fri Mar 12 17:28:46 2010 -0800 MAPREDUCE-1146. Sqoop dependencies break Eclipse build on Linux Description: Under Linux there's the error in the Eclipse "Problems" view:
- "com.sun.tools cannot be resolved" at line 166 of  org.apache.hadoop.sqoop.orm.CompilationManager
    

The problem doesn't appear on MacOS though

Reason: bugfix Author: Aaron Kimball Ref: UNKNOWN commit 0629ac30abb5e58fb80be56a385867ac7360de22 Author: Aaron Kimball Date: Fri Mar 12 17:28:37 2010 -0800 MAPREDUCE-1148. SQL identifiers are a superset of Java identifiers Description: SQL identifiers can contain arbitrary characters, can start with numbers, can be words like class which are reserved in Java, etc. If Sqoop uses these names literally for class and field names then compilation errors can occur in auto-generated classes. SQL identifiers need to be cleansed to map onto Java identifiers. Reason: bugfix Author: Aaron Kimball Ref: UNKNOWN commit dec4c616921b547e5a332a254254d77efc3a7d5e Author: Aaron Kimball Date: Fri Mar 12 17:28:25 2010 -0800 MAPREDUCE-1224. Calling "SELECT t.* from AS t" to get meta information is too expensive for big tables Description: The SqlManager uses the query, "SELECT t.* from <table> AS t" to get table spec is too expensive for big tables, and it was called twice to generate column names and types. For tables that are big enough to be map-reduced, this is too expensive to make sqoop useful. Reason: Performance improvement Author: Spencer Ho Ref: UNKNOWN commit 1198ef1375387ba107d46f0ab5e9a7c6a7645931 Author: Aaron Kimball Date: Fri Mar 12 17:28:15 2010 -0800 MAPREDUCE-706. Support for FIFO pools in the fair scheduler Description: The fair scheduler should support making the internal scheduling algorithm for some pools be FIFO instead of fair sharing in order to work better for batch workloads. FIFO pools will behave exactly like the current default scheduler, sorting jobs by priority and then submission time. Pools will have their scheduling algorithm set through the pools config file, and it will be changeable at runtime.

To support this feature, I'm also changing the internal logic of the fair scheduler to no longer use deficits. Instead, for fair sharing, we will assign tasks to the job farthest below its share as a ratio of its share. This is easier to combine with other scheduling algorithms and leads to a more stable sharing situation, avoiding unfairness issues brought up in MAPREDUCE-543 and MAPREDUCE-544 that happen when some jobs have long tasks. The new preemption (MAPREDUCE-551) will ensure that critical jobs can gain their fair share within a bounded amount of time.

Reason: New feature Author: Matei Zaharia Ref: UNKNOWN commit 5699f5483e2a9ee9debd0f0154c6506ee5dc87e2 Author: Aaron Kimball Date: Fri Mar 12 17:28:03 2010 -0800 MAPREDUCE-1285. DistCp cannot handle -delete if destination is local filesystem Description: The following exception is thrown:
Copy failed: java.io.IOException: wrong value class: org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus is not class org.apache.hadoop.fs.FileStatus
    	at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:988)
    	at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:977)
    	at org.apache.hadoop.tools.DistCp.deleteNonexisting(DistCp.java:1226)
    	at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1134)
    	at org.apache.hadoop.tools.DistCp.copy(DistCp.java:650)
    	at org.apache.hadoop.tools.DistCp.run(DistCp.java:857)
    	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
Reason: bugfix Author: Peter Romianowski Ref: UNKNOWN commit 34bb813a5884aeb05909c2ce2cc541882ca3eda1 Author: Aaron Kimball Date: Fri Mar 12 17:27:53 2010 -0800 MAPREDUCE-764. TypedBytesInput's readRaw() does not preserve custom type codes Description: The typed bytes format supports byte sequences of the form <custom type code> <length> <bytes>. When reading such a sequence via TypedBytesInput's readRaw() method, however, the returned sequence currently is 0 <length> <bytes> (0 is the type code for a bytes array), which leads to bugs such as the one described here. Reason: bugfix Author: Klaas Bosteels Ref: UNKNOWN commit 7fd2cb371354219abd108fda35087f08dc481b35 Author: Aaron Kimball Date: Fri Mar 12 17:27:31 2010 -0800 HADOOP-6400. Log errors getting Unix UGI Description: For various reasons, the calls out to `whoami` and `id` can fail when trying to get the unix UGI information. Currently it silently ignores failures and uses the default DrWho/Tardis ugi. This is extremely confusing for users - we should log the exception at warn level when the shell execs fail. Reason: Debug logging improvement Author: Todd Lipcon Ref: UNKNOWN commit d6dc22fecc058e12695a481fa354078d9b012089 Author: Aaron Kimball Date: Fri Mar 12 17:27:21 2010 -0800 MAPREDUCE-1293. AutoInputFormat doesn't work with non-default FileSystems Description: AutoInputFormat uses the wrong FileSystem.get() method when getting a reference to a FileSystem object. AutoInputFormat gets the default FileSystem, so this method breaks if the InputSplit's path is pointing to a different FileSystem. Reason: bugfix Author: Andrew Hitchcock Ref: UNKNOWN commit 25a4ea86b0b085e3afd6f2f040201594155b3de1 Author: Aaron Kimball Date: Fri Mar 12 17:27:09 2010 -0800 MAPREDUCE-1131. Using profilers other than hprof can cause JobClient to report job failure Description: If task profiling is enabled, the JobClient will download the profile.out file created by the tasks under profile. If this causes an IOException, the job is reported as a failure to the client, even though all the tasks themselves may complete successfully. The expected result files are assumed to be generated by hprof. Using the profiling system with other profilers will cause job failure. Reason: compatibility bugfix Author: Aaron Kimball Ref: UNKNOWN commit ab98123c7114752945452af0b96c8de04af9ba93 Author: Aaron Kimball Date: Fri Mar 12 17:26:02 2010 -0800 MAPREDUCE-370. Change org.apache.hadoop.mapred.lib.MultipleOutputs to use new api. Description: Ports the MultipleOutputs OutputFormat to the new context-based API. Reason: API compatibility improvement. Author: Amareshwari Sriramadasu Ref: UNKNOWN commit 50726d13750f3f71d2fc5d3a012ce81aa2adb26d Author: Aaron Kimball Date: Fri Mar 12 17:24:46 2010 -0800 CLOUDERA-BUILD. Backport MapReduceTestUtil to Hadoop 0.20 Description: MapReduceTestUtil is required for unit tests in subsequent patches, but this class itself was not created in one clean JIRA. Therefore it was backported "As-is" from the trunk and not in a patch-wise fashion. This class is only used in the JUnit tests for Hadoop. Author: Aaron Kimball Reason: Testing improvement Ref: UNKNOWN commit d713dc1063afc4967381b6583ec424d2850bac63 Author: Aaron Kimball Date: Fri Mar 12 17:24:30 2010 -0800 MAPREDUCE-1059. distcp can generate uneven map task assignments Description: distcp writes out a SequenceFile containing the source files to transfer, and their sizes. Map tasks are created over spans of this file, representing files which each mapper should transfer. In practice, some transfer loads yield many empty map tasks and a few tasks perform the bulk of the work. Reason: Improvement for load balancing Author: Aaron Kimball Ref: UNKNOWN commit 855b0bf3718f2c397ef79967475468e4153f120a Author: Aaron Kimball Date: Fri Mar 12 17:24:20 2010 -0800 MAPREDUCE-1128. MRUnit Allows Iteration Twice Description: MRUnit allows one to iterate over a collection of values twice (ie.

reduce(Key key, Iterable<Value> values, Context context){ for(Value : values ) /* iterate once */; for(Value : values ) /* iterate again */; }

Hadoop will allow this as well, however the second iterator will be empty. MRUnit should either match hadoop's behavior or warn the user that their code is likely flawed.

Reason: bugfix (API compatibility) Author: Aaron Kimball Ref: UNKNOWN commit c9d77f6e1fdbb24b45675e363e3bd5111533893a Author: Aaron Kimball Date: Fri Mar 12 17:24:10 2010 -0800 HDFS-464. Memory leaks in libhdfs Description: hdfsExists does not call destroyLocalReference for jPath anytime,
hdfsDelete does not call it when it fails, and
hdfsRename does not call it for jOldPath and jNewPath when it fails Reason: bugfix Author: Christian Kunz Ref: UNKNOWN commit c7996c5e2fbb9260740fec369550551d6320762a Author: Aaron Kimball Date: Fri Mar 12 17:23:51 2010 -0800 HDFS-423. Unbreak FUSE build and fuse_dfs_wrapper.sh Description: fuse-dfs depends on libhdfs, and fuse-dfs build.xml still points to the libhfds/libhdfs.so location but libhdfs now is build in a different location
please take a look at this bug for the location details

https://issues.apache.org/jira/browse/HADOOP-3344

Thanks,
Giri

Reason: Build system bugfix Author: Eli Collins Ref: UNKNOWN commit 72b0b791cd347e760807a44f5197599f57afde03 Author: Aaron Kimball Date: Fri Mar 12 17:23:39 2010 -0800 CLOUDERA-BUILD. Make bin/hadoop-config.sh work with dev builds Author: Eli Collins commit a9466041ccfcdb07f4f0dd34a57c9e9bdd6a3e70 Author: Aaron Kimball Date: Fri Mar 12 17:23:06 2010 -0800 HDFS-727. bug setting block size hdfsOpenFile Description: In hdfsOpenFile in libhdfs invokeMethod needs to cast the block size argument to a jlong so a full 8 bytes are passed (rather than 4 plus some garbage which causes writes to fail due to a bogus block size). Reason: Bugfix Author: Eli Collins Ref: UNKNOWN commit 4e7d205daa86d904614252101bb422664ab6d203 Author: Aaron Kimball Date: Fri Mar 12 17:22:47 2010 -0800 Revert MAPREDUCE-967. TaskTracker does not need to fully unjar job jars Author: Todd Lipcon Ref: UNKNOWN commit d5f0c77a6c81e9e56da81976645614280247f7a2 Author: Aaron Kimball Date: Fri Mar 12 17:22:18 2010 -0800 HADOOP-5640. Allow ServicePlugins to hook callbacks into key service events Description: HADOOP-5257 added the ability for NameNode and DataNode to start and stop ServicePlugin implementations at NN/DN start/stop. However, this is insufficient integration for some common use cases.

We should add some functionality for Plugins to subscribe to events generated by the service they're plugging into. Some potential hook points are:

NameNode:

  • new datanode registered
  • datanode has died
  • exception caught
  • etc?

DataNode:

  • startup
  • initial registration with NN complete (this is important for HADOOP-4707 to sync up datanode.dnRegistration.name with the NN-side registration)
  • namenode reconnect
  • some block transfer hooks?
  • exception caught

I see two potential routes for implementation:

1) We make an enum for the types of hookpoints and have a general function in the ServicePlugin interface. Something like:

enum HookPoint {
      DN_STARTUP,
      DN_RECEIVED_NEW_BLOCK,
      DN_CAUGHT_EXCEPTION,
     ...
    }
    
    void runHook(HookPoint hp, Object value);

2) We make classes specific to each "pluggable" as was originally suggested in HADDOP-5257. Something like:

class DataNodePlugin {
      void datanodeStarted() {}
      void receivedNewBlock(block info, etc) {}
      void caughtException(Exception e) {}
      ...
    }

I personally prefer option (2) since we can ensure plugin API compatibility at compile-time, and we avoid an ugly switch statement in a runHook() function.

Interested to hear what people's thoughts are here.

HADOOP-5640 puts this in the new test dir. It needs to be in the old one. Reason: Improvement Author: Todd Lipcon Ref: UNKNOWN commit e9b04609d88ed5d1af442ee950aa5dcd6646e830 Author: Aaron Kimball Date: Fri Mar 12 17:22:08 2010 -0800 MAPREDUCE-1017. Compression and output splitting for Sqoop Description: Sqoop "direct mode" writing will generate a single large text file in HDFS. It is important to be able to compress this data before it reaches HDFS. Due to the difficulty in splitting compressed files in HDFS for use by MapReduce jobs, data should also be split at compression time. Reason: New feature Author: Aaron Kimball Ref: UNKNOWN commit 8c9b473e1af036a3e2cc9036a945a4567277db8a Author: Aaron Kimball Date: Fri Mar 12 17:21:14 2010 -0800 HADOOP-6312. Configuration sends too much data to log4j Description: Configuration objects send a DEBUG-level log message every time they're instantiated, which include a full stack trace. This is more appropriate for TRACE-level logging, as it renders other debug logs very hard to read. Reason: Logging improvement Author: Aaron Kimball Ref: UNKNOWN commit 698fe169f31e54111d30e4420cd1c1c5eaeecdec Author: Aaron Kimball Date: Fri Mar 12 17:21:03 2010 -0800 HDFS-686. NullPointerException is thrown while merging edit log and image Description: Our secondary name node is not able to start on NullPointerException:
ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: java.lang.NullPointerException
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1232)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1221)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:776)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:992)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.doMerge(SecondaryNameNode.java:590)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.access$000(SecondaryNameNode.java:473)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:350)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:314)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:225)
at java.lang.Thread.run(Thread.java:619)

This was caused by setting access time on a non-existent file.

Reason: bugfix Author: Hairong Kuang Ref: UNKNOWN commit b2cc8e02f37a1604bb076acefff0ebf016c249d5 Author: Aaron Kimball Date: Fri Mar 12 17:20:40 2010 -0800 MAPREDUCE-112. Reduce Input Records and Reduce Output Records counters are not being set when using the new Mapreduce reducer API Description: After running the examples/wordcount (which uses the new API), the reduce input and output record counters always show 0. This is because these counters are not getting updated in the new API This adds counters for reduce input, output records to the new API. Reason: Bugfix Author: Jothi Padmanabhan Ref: UNKNOWN commit 3e62477434542dc3de89fd43fd9b19abaf76f0de Author: Aaron Kimball Date: Fri Mar 12 17:20:00 2010 -0800 MAPREDUCE-768. Configuration information should generate dump in a standard format. Description: We need to generate the configuration dump in a standard format . This adds the 'hadoop jobtracker -dumpConfiguration' command. This is modified from the original patch in that it does not dump QueueManager configuration. This is because we have not backported HADOOP-5396 Reason: New feature Author: V.V.Chaitanya Krishna Ref: UNKNOWN commit 4d9333b00772455a1ca7a365fa5b5b2f6872abd7 Author: Aaron Kimball Date: Fri Mar 12 17:19:46 2010 -0800 HADOOP-6184. Provide a configuration dump in json format. Description: Configuration dump in json format. Reason: New feature Author: V.V.Chaitanya Krishna Ref: UNKNOWN commit 96244c3e7d6735f450b618fdcbdbbf9a81436ba3 Author: Aaron Kimball Date: Fri Mar 12 17:19:27 2010 -0800 CLOUDERA-BUILD. Duplicated effort. FULL_VERSION already set in package.mk Description: Revert "Need to pass in FULL_VERSION" Author: Chad Metcalf commit 604d3a71334b9340a6219e3b88bf563b79f5d083 Author: Aaron Kimball Date: Fri Mar 12 17:19:11 2010 -0800 CLOUDERA-BUILD. Copy the sqoop manpage to the expected version number Author: Chad Metcalf commit 6d428f70591a92a90dca5256968c62a510659240 Author: Aaron Kimball Date: Fri Mar 12 17:18:58 2010 -0800 CLOUDERA-BUILD. Bump jdiff stable to 0.20.1 Author: Chad Metcalf commit 46ffc9aa9260a96bdf67fbaee9a2acd76cfcf675 Author: Aaron Kimball Date: Fri Mar 12 17:18:44 2010 -0800 CLOUDERA-BUILD. Need to pass in FULL_VERSION Author: Chad Metcalf commit aa7ae9d9826866f94ecfe5629d087ef68e4b5c54 Author: Aaron Kimball Date: Fri Mar 12 17:18:29 2010 -0800 MAPREDUCE-999. Improve Sqoop test speed and refactor tests Description: Sqoop's tests take a long time to run, but this can be improved (by a factor of 2 or more) by taking advantage of jobclient.completion.poll.interval. Reason: Testing performance improvement Author: Aaron Kimball Ref: UNKNOWN commit 084c390ed5fcb03c456121c8497759b40a74f809 Author: Aaron Kimball Date: Fri Mar 12 17:18:13 2010 -0800 MAPREDUCE-1089. Fair Scheduler preemption triggers NPE when tasks are scheduled but not running Description: We see exceptions like this when preemption runs when a task has been scheduled on a TT but has not yet started running.

2009-10-09 14:30:53,989 INFO org.apache.hadoop.mapred.FairScheduler: Should preempt 2 MAP tasks for job_200910091420_0006: tasksDueToMinShare = 2, tasksDueToFairShare = 0
2009-10-09 14:30:54,036 ERROR org.apache.hadoop.mapred.FairScheduler: Exception in fair scheduler UpdateThread
java.lang.NullPointerException
at org.apache.hadoop.mapred.FairScheduler$2.compare(FairScheduler.java:1015)
at org.apache.hadoop.mapred.FairScheduler$2.compare(FairScheduler.java:1013)
at java.util.Arrays.mergeSort(Arrays.java:1270)
at java.util.Arrays.sort(Arrays.java:1210)
at java.util.Collections.sort(Collections.java:159)
at org.apache.hadoop.mapred.FairScheduler.preemptTasks(FairScheduler.java:1013)
at org.apache.hadoop.mapred.FairScheduler.preemptTasksIfNecessary(FairScheduler.java:911)
at org.apache.hadoop.mapred.FairScheduler$UpdateThread.run(FairScheduler.java:286)

Reason: Bugfix Author: Todd Lipcon Ref: UNKNOWN commit 34ca2a5547398f9435a5d3d22603d0f7da420226 Author: Aaron Kimball Date: Fri Mar 12 17:17:48 2010 -0800 MAPREDUCE-551. Add preemption to the fair scheduler Description: Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
1) A job isn't getting its guaranteed share of the cluster for at least T1 seconds.
2) A job is getting significantly less than its fair share for T2 seconds (e.g. less than half its share).

T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.

When deciding which tasks to kill to make room for the job, we will use the following heuristics:

  • Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
  • For maps: kill tasks that have run for the least amount of time (limiting wasted time).
  • For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).
This fixes an error in the previous backport where the EagerTaskInitializationListener wasn't properly passed the TaskTrackerManager before starting. Reason: New feature Author: Matei Zaharia Ref: UNKNOWN commit a3e29eff0b9337a1007ec1b90ccb832dca5c1d20 Author: Aaron Kimball Date: Fri Mar 12 17:17:33 2010 -0800 CLOUDERA-BUILD. Fix hadoop wrapper to properly pass through multiword quoted arguments Author: Todd Lipcon commit 975647b6c3a6644cabbd48bf14e074a0efda2cb9 Author: Aaron Kimball Date: Fri Mar 12 17:17:15 2010 -0800 CLOUDERA-BUILD. Sqoop documentation is now part of the generated tarball. Updated the install script to reflect that change. Author: Matt Massie commit 19c038a6af07e3999e83a2178d2328535e00dedb Author: Aaron Kimball Date: Fri Mar 12 17:16:55 2010 -0800 CLOUDERA-BUILD. Generate the sqoop documentation and ensure that it's in the release tarball Author: Matt Massie commit 6957626991875302f33bb73630f4f376412f9711 Author: Aaron Kimball Date: Fri Mar 12 17:16:43 2010 -0800 CLOUDERA-BUILD. More changes to get debs building correctly Author: Chad Metcalf commit 67d1c732cea0eebf59de512301ae8f2a1cb2f349 Author: Aaron Kimball Date: Fri Mar 12 17:16:30 2010 -0800 CLOUDERA-BUILD. Reformatted Sqoop manpage asciidoc for CDH build process Author: Aaron Kimball commit af158d6aa7ffe72d931bc4763ace7d4a299d077b Author: Aaron Kimball Date: Fri Mar 12 17:16:14 2010 -0800 CLOUDERA-BUILD. Only rerun libtoolize if version 2.2 is installed Author: Todd Lipcon commit 586992381042e1b4ec8c9ece069561ad2e4dfcc0 Author: Aaron Kimball Date: Fri Mar 12 17:15:42 2010 -0800 HADOOP-6279. Add JVM memory usage to JvmMetrics Description: The JvmMetrics currently publish memory usage from the MemoryMXBean. This is useful, but doesn't include the total heap size (eg as displayed in the JT Web UI).

It would be nice to expose Runtime.getRuntime().maxMemory() as part of JvmMetrics.

It seems that Runtime.getRuntime().totalMemory() (used by the JT for "memory used") is the same as the 'memHeapCommittedM' which already exists.

Reason: Metrics improvement Author: Todd Lipcon Ref: UNKNOWN commit 7c168a8a2613d93e19508a91e7c4db3b3cfb503b Author: Aaron Kimball Date: Fri Mar 12 17:15:26 2010 -0800 HADOOP-6269. Missing synchronization for defaultResources in Configuration.addResource Description: Configuration.defaultResources is a simple ArrayList. In two places in Configuration it is accessed without appropriate synchronization, which we've seen to occasionally result in ConcurrentModificationExceptions. Reason: bugfix (race condition) Author: Sreekanth Ramakrishnan Ref: UNKNOWN commit 8bf845170decdcb12254bc1dc98ccbf0fda7d233 Author: Aaron Kimball Date: Fri Mar 12 17:15:01 2010 -0800 CLOUDERA-BUILD. Recreate c++ configure files during build if we have the right build dependencies Author: Todd Lipcon commit e7e9812fa7a6a256652f2f6bbb269334f883c53b Author: Aaron Kimball Date: Fri Mar 12 17:14:43 2010 -0800 CLOUDERA-BUILD. Package sqoop docs w/o requiring asciidoc Author: Chad Metcalf Ref: UNKNOWN commit 7171eabfad501d635b1da9e0287f50e025b4a83f Author: Aaron Kimball Date: Fri Mar 12 17:13:39 2010 -0800 CLOUDERA-BUILD. Revert "Package sqoop docs." Description: This reverts packaging of sqoop documentation in preparation for including MAPREDUCE-906 properly after it has been committed to Apache. Author: Chad Metcalf Ref: UNKNOWN commit 4bd437c9d70f2c0d68047e0376a7af21cc4a70e0 Author: Aaron Kimball Date: Fri Mar 12 17:13:17 2010 -0800 HADOOP-5891. If dfs.http.address is default, SecondaryNameNode can't find NameNode Description: As detailed in this blog post:
http://www.cloudera.com/blog/2009/02/10/multi-host-secondarynamenode-configuration/
if dfs.http.address is not configured, and the 2NN is a different machine from the NN, the 2NN fails to connect.

In SecondaryNameNode.getInfoServer, the 2NN should notice a "0.0.0.0" dfs.http.address and, in that case, pull the hostname out of fs.default.name. This would fix the default configuration to work properly for most users.

Reason: Configuration improvement Author: Todd Lipcon Ref: UNKNOWN commit 74e10e4a137b2aa60ab39186115350b5e82464fc Author: Aaron Kimball Date: Fri Mar 12 17:11:50 2010 -0800 HDFS-127. DFSClient block read failures cause open DFSInputStream to become unusable Description: We are using some Lucene indexes directly from HDFS and for quite long time we were using Hadoop version 0.15.3.

When tried to upgrade to Hadoop 0.19 - index searches started to fail with exceptions like:
2008-11-13 16:50:20,314 WARN [Listener-4] [] DFSClient : DFS Read: java.io.IOException: Could not obtain block: blk_5604690829708125511_15489 file=/usr/collarity/data/urls-new/part-00000/20081110-163426/_0.tis
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1708)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1536)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1663)
at java.io.DataInputStream.read(DataInputStream.java:132)
at org.apache.nutch.indexer.FsDirectory$DfsIndexInput.readInternal(FsDirectory.java:174)
at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:152)
at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76)
at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:63)
at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:131)
at org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:162)
at org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:223)
at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:217)
at org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:54)
...

The investigation showed that the root of this issue is that we exceeded # of xcievers in the data nodes and that was fixed by changing configuration settings to 2k.
However - one thing that bothered me was that even after datanodes recovered from overload and most of client servers had been shut down - we still observed errors in the logs of running servers.
Further investigation showed that fix for HADOOP-1911 introduced another problem - the DFSInputStream instance might become unusable once number of failures over lifetime of this instance exceeds configured threshold.

The fix for this specific issue seems to be trivial - just reset failure counter before reading next block (patch will be attached shortly).

This seems to be also related to HADOOP-3185, but I'm not sure I really understand necessity of keeping track of failed block accesses in the DFS client.

HADOOP-4681: Also referenced This as-yet-uncommitted patch is recommended by HBase people. Applied patch "4681.patch" attached to the JIRA on 2008-11-18. Reason: Bugfix Author: Igor Bolotin Ref: UNKNOWN commit ca547d89042fff3a38c0c93b6e0ece78e74ae064 Author: Aaron Kimball Date: Fri Mar 12 17:11:10 2010 -0800 HADOOP-4655. FileSystem.CACHE should be ref-counted Description: FileSystem.CACHE is not ref-counted, and could lead to resource leakage. Adds new method FileSystem.newInstance() that always returns a newly allocated FileSystem object. Reason: Bugfix Author: dhruba borthakur Ref: UNKNOWN commit 15660507606b32c3c6c2878f8ed69fe106119bc9 Author: Aaron Kimball Date: Fri Mar 12 17:10:51 2010 -0800 MAPREDUCE-967. TaskTracker does not need to fully unjar job jars Description: In practice we have seen some users submitting job jars that consist of 10,000+ classes. Unpacking these jars into mapred.local.dir and then cleaning up after them has a significant cost (both in wall clock and in unnecessary heavy disk utilization). This cost can be easily avoided Reason: Performance improvement Author: Todd Lipcon Ref: UNKNOWN commit 648e30e074a16de837fb4c604a198bc780c2e6c5 Author: Aaron Kimball Date: Fri Mar 12 17:10:34 2010 -0800 MAPREDUCE-968. NPE in distcp encountered when placing _logs directory on S3FileSystem Description: If distcp is pointed to an empty S3 bucket as the destination for an s3:// filesystem transfer, it will fail with the following exception

Copy failed: java.lang.NullPointerException
at org.apache.hadoop.fs.s3.S3FileSystem.makeAbsolute(S3FileSystem.java:121)
at org.apache.hadoop.fs.s3.S3FileSystem.getFileStatus(S3FileSystem.java:332)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:633)
at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1005)
at org.apache.hadoop.tools.DistCp.copy(DistCp.java:650)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:857)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:884)

Reason: Bugfix Author: Aaron Kimball Ref: UNKNOWN commit a61718b87c36dbeddcc6f9917438f81ebdda0214 Author: Aaron Kimball Date: Fri Mar 12 17:10:22 2010 -0800 HADOOP-6133. ReflectionUtils performance regression Description: HADOOP-4187 introduced extra calls to Class.forName in ReflectionUtils.setConf. This caused a fairly large performance regression. Attached is a microbenchmark that shows the following timings (ms) for 100M constructions of new instances:

Explicit construction (new Test): around ~1.6sec
Using Test.class.newInstance: around ~2.6sec
ReflectionUtils on 0.18.3: ~8.0sec
ReflectionUtils on 0.20.0: ~200sec

This illustrates the ~80x slowdown caused by HADOOP-4187.

Reason: Performance improvement Author: Todd Lipcon Ref: UNKNOWN commit 5e299f831420ed52569eefc5ba815359a0ebc64e Author: Chad Metcalf Date: Tue Sep 15 22:21:42 2009 -0700 HADOOP-6133: ReflectionUtils performance regression commit b6f790774d34ed34bb7c649142dc770c25121ac3 Author: Aaron Kimball Date: Fri Mar 12 17:10:13 2010 -0800 HADOOP-5981. HADOOP-2838 doesnt work as expected Description: The substitution feature i.e X=$X:/tmp doesnt work as expected.

This issue completes the feature mentioned in HADOOP-2838. HADOOP-2838 provided a way to set env variables in child process. This issue provides a way to inherit tt's env variables and append or reset it. So now
X=$X:y will inherit X (if there) and append y to it.

Reason: Bugfix Author: Amar Kamat Ref: UNKNOWN commit eb635e4de3a8b2b5bd9f34225770f24be42dcd83 Author: Chad Metcalf Date: Tue Sep 15 22:29:50 2009 -0700 HADOOP-5981: HADOOP-2838 doesnt work as expected commit 5d4e93d8e0df3c445f56c5eb51965eef92bebd78 Author: Aaron Kimball Date: Fri Mar 12 17:09:46 2010 -0800 HADOOP-2838. Add HADOOP_LIBRARY_PATH config setting so Hadoop will include external directories for jni Description: Currently there is no way to configure Hadoop to use external JNI directories. I propose we add a new variable like HADOOP_CLASS_PATH that is added to the JAVA_LIBRARY_PATH before the process is run.

Now the users can set environment variables using mapred.child.env. They can do the following
X=Y : set X to Y
X=$X:Y : Append Y to X (which should be taken from the tasktracker)

Reason: Improves job launch flexibility Author: Amar Kamat Ref: UNKNOWN commit 9b3fc32fa793b338dc700a7f6c437402f80d6b7f Author: Chad Metcalf Date: Tue Sep 15 22:09:57 2009 -0700 HADOOP-2838: Add HADOOP_LIBRARY_PATH config setting so Hadoop will include external directories for jni commit 877429c3f94a1e937fbe29b4cbe8da573831d802 Author: Aaron Kimball Date: Fri Mar 12 17:09:31 2010 -0800 MAPREDUCE-814. Move completed Job history files to HDFS Description: Currently completed job history files remain on the jobtracker node. Having the files available on HDFS will enable clients to access these files more easily. Reason: New feature Author: Sharad Agarwal Ref: UNKNOWN commit c0575c0908fee4ec01f5bc0abbd7f4b2254dd38e Author: Chad Metcalf Date: Tue Sep 15 18:15:17 2009 -0700 MAPREDUCE-814: Move completed Job history files to HDFS commit a8bf06eac5312ede0982118801e4495285a442fe Author: Aaron Kimball Date: Fri Mar 12 17:08:12 2010 -0800 MAPREDUCE-693. Conf files not moved to "done" subdirectory after JT restart Description: After MAPREDUCE-516, when a job is submitted and the JT is restarted (before job files have been written) and the job is killed after recovery, the conf files fail to be moved to the "done" subdirectory.
The exact scenario to reproduce this issue is:
  • Submit a job
  • Restart JT before anything is written to the job files
  • Kill the job
  • The old conf files remain in the history folder and fail to be moved to "done" subdirectory
Reason: bugfix Author: Amar Kamat Ref: UNKNOWN commit cc22e9f92db6470d244fb17f57601b93bab6db80 Author: Aaron Kimball Date: Fri Mar 12 17:07:55 2010 -0800 MAPREDUCE-683. TestJobTrackerRestart fails with Map task completion events ordering mismatch Description: TestJobTrackerRestart fails consistently with Map task completion events ordering mismatch error. Reason: bugfix Author: Amar Kamat Ref: UNKNOWN commit 57a67dff5d15e3833c7968254df076e440de2765 Author: Aaron Kimball Date: Fri Mar 12 17:07:39 2010 -0800 MAPREDUCE-416. Move the completed jobs' history files to a DONE subdirectory inside the configured history directory Description: Whenever a job completes, the history file can be moved to a directory called DONE. That would make the management of job history files easier (for example, administrators can move the history files from that directory to some other place, delete them, archive them, etc.). Reason: System management improvement Author: Amar Kamat Ref: UNKNOWN commit 99dfdb9a98e1ebd643f47877be3541962c32dcd0 Author: Aaron Kimball Date: Fri Mar 12 17:07:18 2010 -0800 HADOOP-5733. Add map/reduce slot capacity and lost map/reduce slot capacity to JobTracker metrics Description: It would be nice to have the actual map/reduce slot capacity and the lost map/reduce slot capacity (# of blacklisted nodes * map-slot-per-node or reduce-slot-per-node). This information can be used to calculate a JT view of slot utilization. Reason: Metrics improvement Author: Sreekanth Ramakrishnan Ref: UNKNOWN commit 955fe9433b13f21079f92e4035393b683486ad07 Author: Aaron Kimball Date: Fri Mar 12 17:05:59 2010 -0800 HADOOP-5738. Split waiting tasks field in JobTracker metrics to individual tasks Description: Currently, job tracker metrics reports waiting tasks as a single field in metrics. It would be better if we can split waiting tasks into maps and reduces. Reason: User experience improvement Author: Sreekanth Ramakrishnan Ref: UNKNOWN commit 3b8f77cd452c1098c6af5907b787bf9167df806b Author: Aaron Kimball Date: Fri Mar 12 17:05:48 2010 -0800 HADOOP-5442. The job history display needs to be paged Description: Currently the list of job history will try to render the entire list of jobs that have run. That doesn't scale up as more and more jobs run on a job tracker. Reason: Scalability improvement Author: Amar Kamat Ref: UNKNOWN commit dfac0482267aaf0fabac97c163e0015306ec5b16 Author: Aaron Kimball Date: Fri Mar 12 17:05:16 2010 -0800 HADOOP-4842. Streaming combiner should allow command, not just JavaClass Description: Streaming jobs are way slower than Java jobs for many reasons, but certainly stopping the shell-only programmer from using the combiner feature won't help. Right now, the streaming usage says:

-mapper <cmd|JavaClassName> The streaming command to run
-combiner <JavaClassName> Combiner has to be a Java class
-reducer <cmd|JavaClassName> The streaming command to run

Reason: Usability improvement Author: Amareshwari Sriramadasu Ref: UNKNOWN commit 33e4f0a87effa466914e292488c47977245edc96 Author: Aaron Kimball Date: Fri Mar 12 17:04:06 2010 -0800 MAPREDUCE-987. Exposing MiniDFS and MiniMR clusters as a single process command-line Description: It's hard to test non-Java programs that rely on significant mapreduce functionality. The patch I'm proposing shortly will let you just type "bin/hadoop jar hadoop-hdfs-hdfswithmr-test.jar minicluster" to start a cluster (internally, it's using Mini{MR,HDFS}Cluster) with a specified number of daemons, etc. A test that checks how some external process interacts with Hadoop might start minicluster as a subprocess, run through its thing, and then simply kill the java subprocess.

I've been using just such a system for a couple of weeks, and I like it. It's significantly easier than developing a lot of scripts to start a pseudo-distributed cluster, and then clean up after it. I figure others might find it useful as well.

I'm at a bit of a loss as to where to put it in 0.21. hdfs-with-mr tests have all the required libraries, so I've put it there. I could conceivably split this into "minimr" and "minihdfs", but it's specifically the fact that they're configured to talk to each other that I like about having them together. And one JVM is better than two for my test programs.

Reason: Testing feature Author: Philip Zeyliger Ref: UNKNOWN commit 39ff7e5ee285df97c765a73271066df718be0e30 Author: Aaron Kimball Date: Fri Mar 12 17:03:23 2010 -0800 HADOOP-6267. build-contrib.xml unnecessarily enforces that contrib projects be located in contrib/ dir Description: build-contrib.xml currently sets hadoop.root to ${basedir}/../../../. This path is relative to the contrib project which is assumed to be inside src/contrib/. We occasionally work on contrib projects in other repositories until they're ready to contribute. We can use the <dirname> ant task to do this more correctly. Reason: Build system improvement Author: Todd Lipcon Ref: UNKNOWN commit 139bea6660193cc73852832e03fe570437343e96 Author: Aaron Kimball Date: Fri Mar 12 15:02:55 2010 -0800 HDFS-528. Add ability for safemode to wait for a minimum number of live datanodes Description: When starting up a fresh cluster programatically, users often want to wait until DFS is "writable" before continuing in a script. "dfsadmin -safemode wait" doesn't quite work for this on a completely fresh cluster, since when there are 0 blocks on the system, 100% of them are accounted for before any DNs have reported.

This JIRA is to add a command which waits until a certain number of DNs have reported as alive to the NN.

Reason: New feature Author: Todd Lipcon Ref: UNKNOWN commit b301746d45bde2759535549f87c6485f4ee577b2 Author: Aaron Kimball Date: Fri Mar 12 15:02:38 2010 -0800 HADOOP-4936. Improvements to TestSafeMode Description: TestSafeMode
  • needs a detailed description of the test case
  • should not use direct calls to the name-node rather call DistributedFileSystem methods.
Reason: Test coverage improvement Author: Konstantin Shvachko Ref: UNKNOWN commit f04a321596a513e71354f2a6829b44e474077507 Author: Aaron Kimball Date: Fri Mar 12 15:02:22 2010 -0800 HADOOP-5650. Namenode log that indicates why it is not leaving safemode may be confusing Description: A namenode with a large number of datablocks is setup with dfs.safemode.threshold.pct set to 1.0. With a small number of unreported blocks, namenode prints the following as the reason for not leaving safe mode:
The ratio of reported blocks 1.0000 has not reached the threshold 1.0000

With a large number of blocks, precision used for printing the log may not indicate the difference between the actual ratio of safe blocks to total blocks and the configured threshold. Printing number of blocks instead of ratio will improve the clarity.

Reason: User experience improvement Author: Suresh Srinivas Ref: UNKNOWN commit 13e35e654c51a5b1cfe809ef1e2c4d2ca46ed612 Author: Aaron Kimball Date: Fri Mar 12 15:01:52 2010 -0800 HADOOP-4675. Current Ganglia metrics implementation is incompatible with Ganglia 3.1 Description: Ganglia changed its wire protocol in the 3.1.x series; the current implementation only works for 3.0.x. Patched using https://issues.apache.org/jira/secure/attachment/12407207/HADOOP-4675-v7.patch Reason: Compatibility improvement Author: Brian Bockelman Ref: UNKNOWN commit dcf76896b1c8a7b891995b1546eef6ea3018e7ca Author: Philip Zeyliger Date: Tue Jul 28 15:28:18 2009 -0700 HADOOP-4675. Current Ganglia metrics implementation is incompatible with Ganglia 3.1 Patched using https://issues.apache.org/jira/secure/attachment/12407207/HADOOP-4675-v7.patch commit 4305750d026b895b3afbd0d4a4ee4b3b42596016 Author: Aaron Kimball Date: Fri Mar 12 15:01:29 2010 -0800 HADOOP-6269. Missing synchronization for defaultResources in Configuration.addResource Description: Configuration.defaultResources is a simple ArrayList. In two places in Configuration it is accessed without appropriate synchronization, which we've seen to occasionally result in ConcurrentModificationExceptions. Reason: Bugfix (race condition) Author: Sreekanth Ramakrishnan Ref: UNKNOWN commit 90f9c40df18fe464383de52e3d3952638a393e34 Author: Aaron Kimball Date: Fri Mar 12 15:01:08 2010 -0800 CLOUDERA-BUILD. Make some JT methods and classes public for use from within contrib plugins Author: Henry Robinson commit f8e0599a434e1ce94158384f575e912e9f988229 Author: Aaron Kimball Date: Fri Mar 12 14:59:40 2010 -0800 MAPREDUCE-461. Enable ServicePlugins for the JobTracker Description: Allow ServicePlugins (see HADOOP-5257) for the JobTracker. (Relies on HADOOP-5640) Reason: API Improvement Author: Todd Lipcon Ref: UNKNOWN commit c58318cfa6e26b7dbacd4093d646fc8b66f9eda6 Author: Aaron Kimball Date: Fri Mar 12 14:58:23 2010 -0800 HADOOP-5640. Allow ServicePlugins to hook callbacks into key service events Description: HADOOP-5257 added the ability for NameNode and DataNode to start and stop ServicePlugin implementations at NN/DN start/stop. However, this is insufficient integration for some common use cases.

We should add some functionality for Plugins to subscribe to events generated by the service they're plugging into. Some potential hook points are:

NameNode:

  • new datanode registered
  • datanode has died
  • exception caught
  • etc?

DataNode:

  • startup
  • initial registration with NN complete (this is important for HADOOP-4707 to sync up datanode.dnRegistration.name with the NN-side registration)
  • namenode reconnect
  • some block transfer hooks?
  • exception caught

I see two potential routes for implementation:

1) We make an enum for the types of hookpoints and have a general function in the ServicePlugin interface. Something like:

enum HookPoint {
      DN_STARTUP,
      DN_RECEIVED_NEW_BLOCK,
      DN_CAUGHT_EXCEPTION,
     ...
    }
    
    void runHook(HookPoint hp, Object value);

2) We make classes specific to each "pluggable" as was originally suggested in HADDOP-5257. Something like:

class DataNodePlugin {
      void datanodeStarted() {}
      void receivedNewBlock(block info, etc) {}
      void caughtException(Exception e) {}
      ...
    }

I personally prefer option (2) since we can ensure plugin API compatibility at compile-time, and we avoid an ugly switch statement in a runHook() function.

Interested to hear what people's thoughts are here.

Reason: API Improvement Author: Todd Lipcon Ref: UNKNOWN commit 137999a0b48a81bed10a5f30868dbfe6d176956b Author: Aaron Kimball Date: Fri Mar 12 14:58:09 2010 -0800 HADOOP-5257. Export namenode/datanode functionality through a pluggable RPC layer Description: Adding support for pluggable components would allow exporting DFS functionallity using arbitrary protocols, like Thirft or Protocol Buffers. I'm opening this issue on Dhruba's suggestion in HADOOP-4707.

Plug-in implementations would extend this base class:

abstract class Plugin {
    
        public abstract datanodeStarted(DataNode datanode);
    
        public abstract datanodeStopping();
    
        public abstract namenodeStarted(NameNode namenode);
    
        public abstract namenodeStopping();
    }

Name node instances would then start the plug-ins according to a configuration object, and would also shut them down when the node goes down:

public class NameNode {
    
        // [..]
    
        private void initialize(Configuration conf)
            // [...]
            for (Plugin p: PluginManager.loadPlugins(conf))
              p.namenodeStarted(this);
        }
    
        // [..]
    
        public void stop() {
            if (stopRequested)
                return;
            stopRequested = true;
            for (Plugin p: plugins)
                p.namenodeStopping();
            // [..]
        }
    
        // [..]
    }

Data nodes would do a similar thing in DataNode.startDatanode() and DataNode.shutdown

Reason: MISSING: Reason for inclusion Author: Carlos Valiente Ref: UNKNOWN commit 155394ca5eed2e2a6151a5c9d9452e9cfbb30a11 Author: Aaron Kimball Date: Fri Mar 12 14:57:58 2010 -0800 MAPREDUCE-971. distcp does not always remove distcp.tmp.dir Description: Sometimes distcp leaves behind its tmpdir when the target filesystem is s3n. Reason: Bugfix Author: Aaron Kimball Ref: UNKNOWN commit 7575b83ba0cab30394bad0943ff906ab0609dc40 Author: Aaron Kimball Date: Fri Mar 12 14:57:49 2010 -0800 CLOUDERA-BUILD. Package sqoop docs. commit 9321b18352e55d4d37c25335b578151b18f938f2 Author: Aaron Kimball Date: Fri Mar 12 14:57:32 2010 -0800 MAPREDUCE-923. Sqoop's ORM uses URLDecoder on a file, which replaces plus signs in a jar file name with spaces Description: In findThisJar, sqoop runs URLDecoder.decode on the resulting jar, which has the effect of replacing any + signs in the path with a space. This obviously breaks the classpath variable that it's trying to set, and the sqoop-generated code fails to compile. Ironically, Cloudera's hadoop distro is the one that puts + characters in jar files, and so exhibits the bug. Here is an example from running sqoop with log4j at debug level. Note the space in the very last term, which should read hadoop-0.20.0+61-sqoop.jar rather than hadoop-0.20.0 61-sqoop.jar.

09/08/27 18:00:07 DEBUG orm.CompilationManager: Invoking javac with args: -sourcepath ./ -d /tmp/sqoop/compile/ -classpath /usr/lib/hadoop-0.20/conf:/usr/java/jdk1.6.0_06/lib/tools.jar:/usr/lib/hadoop-0.20:/usr/lib/hadoop-0.20/hadoop-0.20.0+61-core.jar:/usr/lib/hadoop-0.20/lib/commons-cli-2.0-SNAPSHOT.jar:/usr/lib/hadoop-0.20/lib/commons-codec-1.3.jar:/usr/lib/hadoop-0.20/lib/commons-el-1.0.jar:/usr/lib/hadoop-0.20/lib/commons-httpclient-3.0.1.jar:/usr/lib/hadoop-0.20/lib/commons-logging-1.0.4.jar:/usr/lib/hadoop-0.20/lib/commons-logging-api-1.0.4.jar:/usr/lib/hadoop-0.20/lib/commons-net-1.4.1.jar:/usr/lib/hadoop-0.20/lib/core-3.1.1.jar:/usr/lib/hadoop-0.20/lib/hadoop-0.20.0+61-fairscheduler.jar:/usr/lib/hadoop-0.20/lib/hadoop-0.20.0+61-scribe-log4j.jar:/usr/lib/hadoop-0.20/lib/hsqldb-1.8.0.10.jar:/usr/lib/hadoop-0.20/lib/hsqldb.jar:/usr/lib/hadoop-0.20/lib/jasper-compiler-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jasper-runtime-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jets3t-0.6.1.jar:/usr/lib/hadoop-0.20/lib/jetty-6.1.14.jar:/usr/lib/hadoop-0.20/lib/jetty-util-6.1.14.jar:/usr/lib/hadoop-0.20/lib/junit-3.8.1.jar:/usr/lib/hadoop-0.20/lib/junit-4.5.jar:/usr/lib/hadoop-0.20/lib/kfs-0.2.2.jar:/usr/lib/hadoop-0.20/lib/libfb303.jar:/usr/lib/hadoop-0.20/lib/libthrift.jar:/usr/lib/hadoop-0.20/lib/log4j-1.2.15.jar:/usr/lib/hadoop-0.20/lib/mysql-connector-java-5.0.8-bin.jar:/usr/lib/hadoop-0.20/lib/oro-2.0.8.jar:/usr/lib/hadoop-0.20/lib/servlet-api-2.5-6.1.14.jar:/usr/lib/hadoop-0.20/lib/slf4j-api-1.4.3.jar:/usr/lib/hadoop-0.20/lib/slf4j-log4j12-1.4.3.jar:/usr/lib/hadoop-0.20/lib/xmlenc-0.52.jar:/usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-2.1.jar:/usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-api-2.1.jar:/usr/local/hadoop/lib/hadoop-gpl-compression.jar:/usr/lib/hadoop-0.20/hadoop-0.20.0+61-core.jar:/usr/lib/hadoop-0.20/contrib/sqoop/hadoop-0.20.0 61-sqoop.jar

Reason: Bugfix Author: Aaron Kimball Ref: UNKNOWN commit e97883c5b9c389f82a6447e4cb1678c0a0ed83ba Author: Aaron Kimball Date: Fri Mar 12 14:57:19 2010 -0800 CLOUDERA-BUILD. Sqoop asciidoc syntax error Author: Aaron Kimball commit 520bda2edcb90dfe9461e16b96aa4a048d33ed7b Author: Aaron Kimball Date: Fri Mar 12 14:57:11 2010 -0800 HADOOP-5450. Add support for application-specific typecodes to typed bytes Description: For serializing objects of types that are not supported by typed bytes serialization, applications might want to use a custom serialization format. Right now, typecode 0 has to be used for the bytes resulting from this custom serialization, which could lead to problems when deserializing the objects because the application cannot know if a byte sequence following typecode 0 is a customly serialized object or just a raw sequence of bytes. Therefore, a range of typecodes that are treated as aliases for 0 should be added, such that different typecodes can be used for application-specific purposes. Reason: New feature Author: Klaas Bosteels Ref: UNKNOWN commit b30fc99332c4a444d275731dac4b4245115d65b2 Author: Aaron Kimball Date: Fri Mar 12 14:56:59 2010 -0800 HADOOP-1722. Make streaming to handle non-utf8 byte array Description: Right now, the streaming framework expects the output sof the steam process (mapper or reducer) are line
oriented UTF-8 text. This limit makes it impossible to use those programs whose outputs may be non-UTF-8
(international encoding, or maybe even binary data). Streaming can overcome this limit by introducing a simple
encoding protocol. For example, it can allow the mapper/reducer to hexencode its keys/values,
the framework decodes them in the Java side.
This way, as long as the mapper/reducer executables follow this encoding protocol,
they can output arabitary bytearray and the streaming framework can handle them. Reason: New feature Author: Klaas Bosteels Ref: UNKNOWN commit 921c135653736bcc279700435358058762bc8f78 Author: Aaron Kimball Date: Fri Mar 12 14:56:43 2010 -0800 CLOUDERA-BUILD. More Sqoop documentation updates Author: Aaron Kimball commit be7f1dc031e17dc4f53ebe76d27c1b9242105785 Author: Aaron Kimball Date: Fri Mar 12 14:56:26 2010 -0800 MAPREDUCE-840. DBInputFormat leaves open transaction Description: (Reapplied after HADOOP-4687) Reason: MISSING: Reason for inclusion Author: Aaron Kimball Ref: UNKNOWN commit 89a96d8fff80ac809dbda9582044a7c6b3986d16 Author: Aaron Kimball Date: Fri Mar 12 14:56:07 2010 -0800 MAPREDUCE-906. Updated Sqoop documentation Description: Provides the latest documentation for Sqoop, in both user-guide and manpage form. Built with asciidoc. Reason: Documentation Author: Aaron Kimball Ref: UNKNOWN commit 51f867aea0667d0191b730ea3abf114e75cafa4b Author: Aaron Kimball Date: Fri Mar 12 14:55:54 2010 -0800 MAPREDUCE-907. Sqoop should use more intelligent splits Description: Sqoop should use the new split generation / InputFormat in MAPREDUCE-885 Reason: Performance / scalability improvement Author: Aaron Kimball Ref: UNKNOWN commit 239df04415dba8d12c7d3fbf33c580d473202e94 Author: Aaron Kimball Date: Fri Mar 12 14:55:28 2010 -0800 MAPREDUCE-885. More efficient SQL queries for DBInputFormat Description: DBInputFormat generates InputSplits by counting the available rows in a table, and selecting subsections of the table via the "LIMIT" and "OFFSET" SQL keywords. These are only meaningful in an ordered context, so the query also includes an "ORDER BY" clause on an index column. The resulting queries are often inefficient and require full table scans. Actually using multiple mappers with these queries can lead to O(n^2) behavior in the database, where n is the number of splits. Attempting to use parallelism with these queries is counter-productive.

A better mechanism is to organize splits based on data values themselves, which can be performed in the WHERE clause, allowing for index range scans of tables, and can better exploit parallelism in the database.

Reason: Performance and scalability improvement Author: Aaron Kimball Ref: UNKNOWN commit 23a0d1882c797160cc7b6fae99fc5e686aa30191 Author: Aaron Kimball Date: Fri Mar 12 14:55:16 2010 -0800 MAPREDUCE-938. Postgresql support for Sqoop Description: Sqoop should be able to import from postgresql databases. Reason: Compatability improvement Author: Aaron Kimball Ref: UNKNOWN commit 7b89feb34fafd2365f75ab744db9cb07a5443046 Author: Aaron Kimball Date: Fri Mar 12 14:55:05 2010 -0800 MAPREDUCE-876. Sqoop import of large tables can time out Description: Related to MAPREDUCE-875, Sqoop should use a background thread to ensure that progress is being reported while a database does external work for the MapReduce task. Reason: Scalability improvement Author: Aaron Kimball Ref: UNKNOWN commit 61d4ef5175dca1859a1320f9e7cad1caeab5d982 Author: Aaron Kimball Date: Fri Mar 12 14:54:49 2010 -0800 MAPREDUCE-918. Test hsqldb server should be memory-only. Description: Sqoop launches a standalone hsqldb server for unit tests, but it currently writes its database to disk and uses a connect string of //localhost. If multiple test instances are running concurrently, one test server may serve to the other instance of the unit tests, causing race conditions. Reason: Bugfix in test harness Author: Aaron Kimball Ref: UNKNOWN commit 1fc17ad34e8288b54503eeb15f788eb4e6a070dc Author: Aaron Kimball Date: Fri Mar 12 14:54:37 2010 -0800 MAPREDUCE-875. Make DBRecordReader execute queries lazily Description: DBInputFormat's DBRecordReader executes the user's SQL query in the constructor. If the query is long-running, this can cause task timeout. The user is unable to spawn a background thread (e.g., in a MapRunnable) to inform Hadoop of on-going progress. Reason: Scalability improvement Author: Aaron Kimball Ref: UNKNOWN commit 21fdb7a7fd501fd63e1a540c2b55cf410d057301 Author: Aaron Kimball Date: Fri Mar 12 14:54:27 2010 -0800 MAPREDUCE-825. JobClient completion poll interval of 5s causes slow tests in local mode Description: The JobClient.NetworkedJob.waitForCompletion() method polls for job completion every 5 seconds. When running a set of short tests in pseudo-distributed mode, this is unnecessarily slow and causes lots of wasted time. When bandwidth is not scarce, setting the poll interval to 100 ms results in a 4x speedup in some tests. This interval should be parametrized to allow users to control the interval for testing purposes. Reason: Test performance improvement Author: Aaron Kimball Ref: UNKNOWN commit f996b8a019bffefff183d7d688ccf95b8cb73de5 Author: Aaron Kimball Date: Fri Mar 12 14:54:15 2010 -0800 MAPREDUCE-750. Extensible ConnManager factory API Description: Sqoop uses the ConnFactory class to instantiate a ConnManager implementation based on the connect string and other arguments supplied by the user. This allows per-database logic to be encapsulated in different ConnManager instances, and dynamically chosen based on which database the user is actually importing from. But adding new ConnManager implementations requires modifying the source of a common ConnFactory class. An indirection layer should be used to delegate instantiation to a number of factory implementations which can be specified in the static configuration or at runtime. Reason: API flexibility improvement Author: Aaron Kimball Ref: UNKNOWN commit 39bdff7bd3b83359884c90ae857d3f3144a94803 Author: Aaron Kimball Date: Fri Mar 12 14:54:04 2010 -0800 MAPREDUCE-749. Make Sqoop unit tests more Hudson-friendly Description: Hudson servers (other than Apache's) need to be able to run the sqoop unit tests which depend on thirdparty JDBC drivers / database implementations. The build.xml needs some refactoring to make this happen. Reason: Test coverage improvement Author: Aaron Kimball Ref: UNKNOWN commit 0ca54f2722206685d9e36fcbb2656d0ac1957311 Author: Aaron Kimball Date: Fri Mar 12 14:53:47 2010 -0800 MAPREDUCE-792. javac warnings in DBInputFormat Description: MAPREDUCE-716 introduces javac warnings Reason: Technical debt Author: Aaron Kimball Ref: UNKNOWN commit e39ae9d017e89e4df193b1f8075184320230499b Author: Aaron Kimball Date: Fri Mar 12 14:52:45 2010 -0800 MAPREDUCE-716. org.apache.hadoop.mapred.lib.db.DBInputformat not working with oracle Description: Applied "trunk" version of the patch after incorporating HADOOP-4687's move of DBInputFormat-related files. (Prior patch was 0.20-branch specific) Reason: Branch compatibility improvement Author: Aaron Kimball Ref: UNKNOWN commit 074e824f5d3d2f6ab862083e6eb4b0df8c881bfc Author: Aaron Kimball Date: Fri Mar 12 14:52:27 2010 -0800 MAPREDUCE-910. MRUnit should support counters Description: incrCounter() is currently a dummy stub method in MRUnit that does nothing. Would be good for the mock reporter/context implementations to support counters. Reason: New feature Author: Aaron Kimball Ref: UNKNOWN commit b4b7c5d9b4cba84bc47f4a48074fd295d060ab35 Author: Aaron Kimball Date: Fri Mar 12 14:52:17 2010 -0800 MAPREDUCE-798. MRUnit should be able to test a succession of MapReduce passes Description: MRUnit can currently test that the inputs to a given (mapper, reducer) "job" produce certain outputs at the end of the reducer. It would be good to support more end-to-end tests of a series of MapReduce jobs that form a longer pipeline surrounding some data. Reason: New Feature Author: Aaron Kimball Ref: UNKNOWN commit 59677d22261974560117fa82e74d9a7f80f804d5 Author: Aaron Kimball Date: Fri Mar 12 14:52:06 2010 -0800 MAPREDUCE-800. MRUnit should support the new API Description: MRUnit's TestDriver implementations use the old org.apache.hadoop.mapred-based classes. TestDrivers and associated mock object implementations are required for org.apache.hadoop.mapreduce-based code. Reason: New feature (API Compatibility) Author: Aaron Kimball Ref: UNKNOWN commit 7fda23b419b1c98e84eea43a0f35191d41032e18 Author: Aaron Kimball Date: Fri Mar 12 14:51:53 2010 -0800 MAPREDUCE-799. Some of MRUnit's self-tests were not being run Description: Due to method naming issues, some test cases were not being executed. Reason: Bugfix; test coverage Author: Aaron Kimball Ref: UNKNOWN commit 20d5bf205e9f2864f3da53d30408ba97763a46e9 Author: Aaron Kimball Date: Fri Mar 12 14:51:40 2010 -0800 MAPREDUCE-797. MRUnit MapReduceDriver should support combiners Description: The MapReduceDriver allows you to specify a mapper and a reducer class with a simple sort/"shuffle" between the passes. It would be nice to also support another Reducer implementation being used as a combiner in the middle. Reason: New feature Author: Aaron Kimball Ref: UNKNOWN commit 5c873336b3380e6c8f07ca28230ede9d41e4e840 Author: Aaron Kimball Date: Fri Mar 12 14:50:05 2010 -0800 Integrate with 0.21-branch versions of DBInputFormat Description: In 0.21 there is now a DBInputFormat in the mapred/lib/ package as well as mapreduce/lib/db. This patch backports the new API edition of DBInputFormat to CDH Reason: Cross-branch compatibility improvement Author: Aaron Kimball Ref: UNKNOWN commit 51b650554e3bc8054e8ca966f5f552c522f7483d Author: Aaron Kimball Date: Fri Mar 12 14:49:52 2010 -0800 HADOOP-5170. Set max map/reduce tasks on a per-job basis, either per-node or cluster-wide Description: There are a number of use cases for being able to do this. The focus of this jira should be on finding what would be the simplest to implement that would satisfy the most use cases.

This could be implemented as either a per-node maximum or a cluster-wide maximum. It seems that for most uses, the former is preferable however either would fulfill the requirements of this jira.

Some of the reasons for allowing this feature (mine and from others on list):

  • I have some very large CPU-bound jobs. I am forced to keep the max map/node limit at 2 or 3 (on a 4 core node) so that I do not starve the Datanode and Regionserver. I have other jobs that are network latency bound and would like to be able to run high numbers of them concurrently on each node. Though I can thread some jobs, there are some use cases that are difficult to thread (scanning from hbase) and there's significant complexity added to the job rather than letting hadoop handle the concurrency.
  • Poor assignment of tasks to nodes creates some situations where you have multiple reducers on a single node but other nodes that received none. A limit of 1 reducer per node for that job would prevent that from happening. (only works with per-node limit)
  • Poor mans MR job virtualization. Since we can limit a jobs resources, this gives much more control in allocating and dividing up resources of a large cluster. (makes most sense w/ cluster-wide limit)
Reason: Configuration improvement Author: Matei Zaharia Ref: UNKNOWN commit 99e25a93542251debd248ed71cb380858ca8c9bd Author: Aaron Kimball Date: Fri Mar 12 14:49:40 2010 -0800 HADOOP-6166. Improve PureJavaCrc32 Description: Got some ideas to improve CRC32 calculation. Reason: Performance Improvement Author: Tsz Wo (Nicholas), SZE Ref: UNKNOWN commit 2d0a97cefa559ab9059d976bda66f9dbcf051e79 Author: Aaron Kimball Date: Fri Mar 12 14:49:28 2010 -0800 MAPREDUCE-782. Use PureJavaCrc32 in mapreduce spills Description: HADOOP-6148 implemented a Pure Java implementation of CRC32 which performs better than the built-in one. This issue is to make use of it in the mapred package Reason: Performance improvement Author: Todd Lipcon Ref: UNKNOWN commit bb65cb649c2924b5a20f06deb9ecd66fc219eeeb Author: Aaron Kimball Date: Fri Mar 12 14:49:12 2010 -0800 HDFS-496. Use PureJavaCrc32 in HDFS Description: Common now has a pure java CRC32 implementation which is more efficient than java.util.zip.CRC32. This issue is to make use of it. Reason: Performance improvement Author: Todd Lipcon Ref: UNKNOWN commit ac73e6d51d5ad1df993097349602e5f3199b952a Author: Aaron Kimball Date: Fri Mar 12 14:48:40 2010 -0800 HADOOP-6148. Implement a pure Java CRC32 calculator Description: We've seen a reducer writing 200MB to HDFS with replication = 1 spending a long time in crc calculation. In particular, it was spending 5 seconds in crc calculation out of a total of 6 for the write. I suspect that it is the java-jni border that is causing us grief. This outperforms java.util.zip.CRC32. Reason: Performance improvement Author: Scott Carey and Todd Lipcon Ref: UNKNOWN commit e7430c8cbd2d182716ac7efb08cb2187c1edab95 Author: Aaron Kimball Date: Fri Mar 12 14:48:08 2010 -0800 Updated Sqoop documentation for MAPREDUCE-816, MAPREDUCE-789. Reason: Documentation improvement Author: Aaron Kimball Ref: UNKNOWN commit aa75ab7f749604c354dcdb0b806aca9cd140f504 Author: Aaron Kimball Date: Fri Mar 12 14:47:58 2010 -0800 MAPREDUCE-789. Oracle support for Sqoop Description: A separate ConnManager is needed for Oracle to support its slightly different syntax and configuration Reason: Compatibility improvement Author: Aaron Kimball Ref: UNKNOWN commit 6f017db468a82e336a28f451c7d90bc225130094 Author: Aaron Kimball Date: Fri Mar 12 14:47:33 2010 -0800 MAPREDUCE-840. DBInputFormat leaves open transaction Description: DBInputFormat.getSplits() does not call connection.commit() after the COUNT query. This can leave an open transaction against the database which interferes with other connections to the same table. Reason: bugfix Author: Aaron Kimball Ref: UNKNOWN commit 84b622a5f6f5bd145f19f4c08b6263759ac51756 Author: Aaron Kimball Date: Fri Mar 12 14:47:15 2010 -0800 MAPREDUCE-816. Rename "local" mysql import to "direct" Description: A mysqldump-based fast path known as "local mode" is used in sqoop when users pass the argument -local. The restriction that this only import from localhost was based on an implementation technique that was later abandoned in favor of a more general one, which can support remote hosts as well. Thus, local is a poor name for the flag. -direct is more general and more descriptive. This should be used instead. Reason: Interface clarification Author: Aaron Kimball Ref: UNKNOWN commit ce75318a484615dc7b161a41710884f34db50c86 Author: Aaron Kimball Date: Fri Mar 12 14:46:34 2010 -0800 MAPREDUCE-716. org.apache.hadoop.mapred.lib.db.DBInputformat not working with oracle Description:

The out of the box implementation of the Hadoop is working properly with mysql/hsqldb, but NOT with oracle.
Reason is DBInputformat is implemented with mysql/hsqldb specific query constructs like "LIMIT", "OFFSET".

FIX:
building a database provider specific logic based on the database providername (which we can get using connection).

Reason: Compatibility improvement Author: Aaron Kimball Ref: UNKNOWN commit 338de775796c2102ce680eaa983b719b50e9f3ee Author: Aaron Kimball Date: Fri Mar 12 14:46:18 2010 -0800 HADOOP-5469. Exposing Hadoop metrics via HTTP Description: Implement a "/metrics" URL on the HTTP server of Hadoop daemons, to expose metrics data to users via their web browsers, in plain-text and JSON. Reason: New feature Author: Philip Zeyliger Ref: UNKNOWN commit cad421ec1c51382f81714ccafb96a6bb8bcc8aec Author: Aaron Kimball Date: Fri Mar 12 14:46:11 2010 -0800 HADOOP-5469. Exposing Hadoop metrics via HTTP Description: Implement a "/metrics" URL on the HTTP server of Hadoop daemons, to expose metrics data to users via their web browsers, in plain-text and JSON. Reason: MISSING: Reason for inclusion Author: Philip Zeyliger Ref: UNKNOWN commit 8b09839047997a4b5461703650b5779ec86c1844 Author: Aaron Kimball Date: Fri Mar 12 14:45:49 2010 -0800 CLOUDERA-BUILD. Added Sqoop documentation to installation script Author: Todd Lipcon commit 7e77c6b13f06dec9c742bf76c81e2ec02d81c7cb Author: Aaron Kimball Date: Fri Mar 12 14:45:35 2010 -0800 CLOUDERA-BUILD. Fix the hadoop/sqoop wrapper scripts Author: Matt Massie commit 0caaf80f3a569b91f482de0dcb87f826967f5c7c Author: Aaron Kimball Date: Fri Mar 12 14:45:16 2010 -0800 CLOUDERA-BUILD. Fix a bug in the hadoop/sqoop wrapper generation Author: Matt Massie Ref: UNKNOWN commit bd8ddae402a876fe78cbb1482362935780b57d84 Author: Aaron Kimball Date: Fri Mar 12 14:44:59 2010 -0800 CLOUDERA-BUILD. Update the install hadoop script Author: Matt Massie Ref: UNKNOWN commit 80cf01124877a5aebd742142b10fda45910f0328 Author: Aaron Kimball Date: Fri Mar 12 14:44:42 2010 -0800 CLOUDERA-BUILD. Rename the hadoop man page to be hadoop-0.20 Author: Matt Massie Ref: UNKNOWN commit 78cb9f21a3ddf04f8cef9e37a94f657448d0d111 Author: Aaron Kimball Date: Fri Mar 12 14:43:51 2010 -0800 HADOOP-5745. Allow setting the default value of maxRunningJobs for all pools Description: The <pool> element allows setting the maxRunningJobs for that pool. It wold be nice to be able to set a default value for all pools.

In out configuration, pools are autocreated.. every new uesre gets his own pool. We would like to allow each user to be able to run a max of 5 jobs at a time. For the etl pool, this limit will be set to a greater value,

Reason: Improved configuration flexibility Author: dhruba borthakur Ref: UNKNOWN commit 3c39e1fa8c3c89fc8f11f1faff46397fa82d5116 Author: Aaron Kimball Date: Fri Mar 12 14:43:13 2010 -0800 MAPREDUCE-906. Updated Sqoop documentation. Description: Update Sqoop documentation with user guide and manpage. Reason: Documentation improvement Author: Aaron Kimball Ref: UNKNOWN commit 79a2645bc81894331721ef94c255992075ccf195 Author: Aaron Kimball Date: Fri Mar 12 14:42:14 2010 -0800 CLOUDERA-BUILD. Added MySQL Connector/J library for Sqoop. Description: We can ship MySQL Connector/J with CDH because the licenses are compatible. However, the public Apache project will not include this library in their source repository due to stricter licensing concerns. Reason: Simplifies deployment of Sqoop for mysql users Author: Aaron Kimball Ref: UNKNOWN commit 4a097b35bf1264a0606f2ebe410c45f16f900f03 Author: Aaron Kimball Date: Fri Mar 12 14:42:05 2010 -0800 MAPREDUCE-705. User-configurable quote and delimiter characters for Sqoop records and record reparsing Description: Sqoop needs a mechanism for users to govern how fields are quoted and what delimiter characters separate fields and records. With delimiters providing an unambiguous format, a parse method can reconstitute the generated record data object from a text-based representation of the same record. Reason: New feature Author: Aaron Kimball Ref: UNKNOWN commit 58e23056af0e99ef611ac258719207cc9459a849 Author: Aaron Kimball Date: Fri Mar 12 14:41:47 2010 -0800 MAPREDUCE-710. Sqoop should read and transmit passwords in a more secure manner Description: Sqoop's current support for passwords involves reading passwords from the command line "--password foo", which makes the password visible to other users via 'ps'. An invisible-console approach should be taken.

Related, Sqoop transmits passwords to mysqldump in the same fashion, which is also insecure.

Reason: Security improvement Author: Aaron Kimball Ref: UNKNOWN commit a67a0f77729fb9005b0c47872d6ba677f6434b41 Author: Aaron Kimball Date: Fri Mar 12 14:41:34 2010 -0800 MAPREDUCE-713. Sqoop has some superfluous imports Description: Some classes have vestigial imports that should be removed Reason: Code cleanup Author: Aaron Kimball Ref: UNKNOWN commit 0a4dab2eac0ba8b6da5190bc53a9ce8e4344a336 Author: Aaron Kimball Date: Fri Mar 12 14:41:01 2010 -0800 MAPREDUCE-685. Sqoop will fail with OutOfMemory on large tables using mysql Description: The default MySQL JDBC client behavior is to buffer the entire ResultSet in the client before allowing the user to use the ResultSet object. On large SELECTs, this can cause OutOfMemory exceptions, even when the client intends to close the ResultSet after reading only a few rows. The MySQL ConnManager should configure its connection to use row-at-a-time delivery of results to the client. Reason: bugfix / scalability improvement Author: Aaron Kimball Ref: UNKNOWN commit 499aa76b500136a0e8996898f468b088ca5d7ed3 Author: Aaron Kimball Date: Fri Mar 12 14:40:50 2010 -0800 MAPREDUCE-674. Sqoop should allow a "where" clause to avoid having to export entire tables Description: Sqoop currently only exports at the granularity of a table. This doesn't work well on systems with large tables, where the overhead of performing a full dump each time is significant. Allowing the user to specify a where clause is a relatively simple task which will give Sqoop a lot more flexibility. Reason: New feature Author: Kevin Weil Ref: UNKNOWN commit ed4ba254d7708f363f5f1b4708e9e35061ad936c Author: Aaron Kimball Date: Fri Mar 12 14:40:37 2010 -0800 MAPREDUCE-675. Sqoop should allow user-defined class and package names Description: Currently Sqoop generates a class for each table to be imported; the class names are equal to the table names and they are not part of any package.

This adds --class-name and --package-name parameters to Sqoop, allowing these aspects of code generation to be controlled.

Reason: New feature Author: Aaron Kimball Ref: UNKNOWN commit 16e0ca8119b99b244c9eeafd78bb9eb43e4ba639 Author: Aaron Kimball Date: Fri Mar 12 14:40:20 2010 -0800 MAPREDUCE-703. Sqoop requires dependency on hsqldb in ivy Description: Sqoop builds crash without explicit dependency on hsqldb. Reason: build system bugfix Author: Aaron Kimball Ref: UNKNOWN commit b8e54791e990328db983f070e9a04952301eda35 Author: Aaron Kimball Date: Fri Mar 12 14:40:04 2010 -0800 MAPREDUCE-692. Make Hudson run Sqoop unit tests Description: Running 'ant test-contrib' didn't test Sqoop because it wasn't explicitly listed in the build.xml file in src/contrib/ Reason: Test coverage Author: Aaron Kimball Ref: UNKNOWN commit 8a3b6472ae00542dadf7f7d60991ec0f21b38177 Author: Aaron Kimball Date: Fri Mar 12 14:39:40 2010 -0800 HADOOP-5968. Sqoop should only print a warning about mysql import speed once Description: After HADOOP-5844, Sqoop can use mysqldump as an alternative to JDBC for importing from MySQL. If you use the JDBC mechanism, it prints a warning if you could have enabled the mysqldump path instead. But the warning is printed multiple times (every time the LocalMySQLManager is instantiated), and also when the MySQL manager is used for informational queries (e.g., listing tables) rather than true imports.

It should only emit the warning once per session, and only then if it's actually doing an import.

Reason: User experience improvement Author: Aaron Kimball Ref: UNKNOWN commit 86211e3714dc5b1dbcb7a3c328336277f6657de7 Author: Aaron Kimball Date: Fri Mar 12 14:38:44 2010 -0800 HADOOP-5967. Sqoop should only use a single map task Description: The current DBInputFormat implementation uses SELECT ... LIMIT ... OFFSET statements to read from a database table. This actually results in several queries all accessing the same table at the same time. Most database implementations will actually use a full table scan for each such query, starting at row 1 and scanning down until the OFFSET is reached before emitting data to the client. The upshot of this is that we see O(n^2) performance in the size of the table when using a large number of mappers, when a single mapper would read through the table in O(n) time in the number of rows.

This patch sets the number of map tasks to 1 in the MapReduce job sqoop launches.

Reason: Performance improvement Author: Aaron Kimball Ref: UNKNOWN commit 410db7130a8e85ceed46850f73e74f480d45994e Author: Aaron Kimball Date: Thu Jul 23 16:10:21 2009 -0700 HADOOP-5967: Sqoop should only use a single map task commit b8f5d1d3a30a7461936f3f92bd9f007ed2db43e8 Author: Aaron Kimball Date: Fri Mar 12 14:38:23 2010 -0800 HADOOP-5887. Sqoop should create tables in Hive metastore after importing to HDFS Description: Sqoop (HADOOP-5815) imports tables into HDFS; it is a straightforward enhancement to then generate a Hive DDL statement to recreate the table definition in the Hive metastore and move the imported table into the Hive warehouse directory from its upload target.

This feature enhancement makes this process automatic. An import is performed with sqoop in the usual way; providing the argument "--hive-import" will cause it to then issue a CREATE TABLE .. LOAD DATA INTO statement to a Hive shell. It generates a script file and then attempts to run "$HIVE_HOME/bin/hive" on it, or failing that, any "hive" on the $PATH; $HIVE_HOME can be overridden with --hive-home. As a result, no direct linking against Hive is necessary.

The unit tests provided with this enhancement use a mock implementation of 'bin/hive' that compares the script it's fed with one from a directory full of "expected" scripts. The exact script file referenced is controlled via an environment variable. It doesn't actually load into a proper Hive metastore, but manual testing has shown that this process works in practice, so the mock implementation is a reasonable unit testing tool.

Reason: New feature Author: Aaron Kimball Ref: UNKNOWN commit 50993494fdc7b2284837562b500e2840106bb3bb Author: Aaron Kimball Date: Fri Mar 12 14:37:48 2010 -0800 CLOUDERA-BUILD. Address issue where docs were not properly copied through to release tarball Description: This was caused by some cleanup in build.xml early on in the CDH 0.20 branch Reason: bugfix Author: Todd Lipcon Ref: UNKNOWN commit 3ecb9c07279302d18f7367d49bcd98c4391cbb68 Author: Aaron Kimball Date: Fri Mar 12 14:37:27 2010 -0800 CLOUDERA-BUILD. Decrease build time by only rebuilding the native code for each platform Reason: build system improvement Author: Todd Lipcon Ref: UNKNOWN commit f0c6a810ba7237ec7cc570ecad8a8665768b3d06 Author: Aaron Kimball Date: Fri Mar 12 14:37:07 2010 -0800 CLOUDERA-BUILD. Run jdiff against vanilla Hadoop during Cloudera release build Author: Todd Lipcon Ref: UNKNOWN commit 9cf8f0cb6ed744439d8e90e3ba376edb5d9521f3 Author: Aaron Kimball Date: Fri Mar 12 14:36:22 2010 -0800 MAPREDUCE-415. JobControl Job does always has an unassigned name Description: When creating and adding org.apache.hadoop.mapred.jobcontrol.Job(s) they don't use the names specified in their respective JobConf files. Instead it's just hardcoded to "unassigned". Reason: bugfix Author: Xavier Stevens Ref: UNKNOWN commit 330f009bae260ac990426a988fc56913897a50ca Author: Aaron Kimball Date: Fri Mar 12 14:35:03 2010 -0800 HADOOP-5805. problem using top level s3 buckets as input/output directories Description: When I specify top level s3 buckets as input or output directories, I get the following exception.

hadoop jar subject-map-reduce.jar s3n://infocloud-input s3n://infocloud-output

java.lang.IllegalArgumentException: Path must be absolute: s3n://infocloud-output
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.pathToKey(NativeS3FileSystem.java:246)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:319)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:667)
at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:109)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:738)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1026)
at com.evri.infocloud.prototype.subjectmapreduce.SubjectMRDriver.run(SubjectMRDriver.java:63)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at com.evri.infocloud.prototype.subjectmapreduce.SubjectMRDriver.main(SubjectMRDriver.java:25)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

The workaround is to specify input/output buckets with sub-directories:

hadoop jar subject-map-reduce.jar s3n://infocloud-input/input-subdir s3n://infocloud-output/output-subdir

Reason: bugfix Author: Ian Nowland Ref: UNKNOWN commit 35fa82b5c743e34d62449e0f4abffd885e0dfe4c Author: Aaron Kimball Date: Fri Mar 12 14:34:42 2010 -0800 HADOOP-5656. Counter for S3N Read Bytes does not work Description: Counter for S3N Read Bytes does not work on trunk. On 0.18 branch neither read nor write byte counters work. Reason: Bugfix Author: Ian Nowland Ref: UNKNOWN commit a6670de0a1c4b03c293ae47d1595e8c33764aaa5 Author: Aaron Kimball Date: Fri Mar 12 14:33:43 2010 -0800 HADOOP-5613. change S3Exception to checked exception Description: Currently the S3 filesystems can throw unchecked exceptions (S3Exception) which are not declared in the interface of FileSystem. These aren't caught by the various callers and can cause unpredictable behavior. IOExceptions are caught by most users of FileSystem since it is declared in the interface and hence is handled better. S3Exception now extends IOException. Reason: Improved error-checking at compile time for user applications. Author: Andrew Hitchcock Ref: UNKNOWN commit 1f11b63a42ae441eb8d0693ed0e4e01aca553e42 Author: Aaron Kimball Date: Fri Mar 12 14:33:09 2010 -0800 HADOOP-5528. Binary partitioner Description: It would be useful to have a BinaryPartitioner that partitions BinaryComparable keys by hashing a configurable part of the bytes array corresponding to each key. Reason: New feature Author: Klaas Bosteels Ref: UNKNOWN commit 716d3598e5a4a18cdfcfcf0dc800e263ef7c7685 Author: Aaron Kimball Date: Fri Mar 12 14:32:47 2010 -0800 HADOOP-5240. 'ant javadoc' does not check whether outputs are up to date and always rebuilds Description: Running 'ant javadoc' twice in a row calls the javadoc program both times; it doesn't check to see whether this is redundant work. Reason: Build system improvement Author: Aaron Kimball Ref: UNKNOWN commit 2bb607d29d9080a7ca3bce72739ccef654d5392d Author: Aaron Kimball Date: Fri Mar 12 14:30:46 2010 -0800 HADOOP-5175. Option to prohibit jars unpacking Description: The task tracker moves all unpacked jars into ${hadoop.tmp.dir}/mapred/local/taskTracker. When using a lot of external libraries via -libjars, this results in several thousand unpacked files. The amount of time needed to `du` these directories can increase to the point where tasks time out before starting. This patch provides an option to suppress jar unpacking. Reason: Scalability improvement Author: Todd Lipcon Ref: UNKNOWN commit 349281bfa0243f5adbbd459266f4a9ac7ac8c1cc Author: Aaron Kimball Date: Fri Mar 12 14:30:16 2010 -0800 CLOUDERA-BUILD. Fix scribe-log4j's ivy.xml to properly get log4j on the compile classpath Author: Todd Lipcon Reason: bugfix to build system Ref: UNKNOWN commit b07aec5129e618bfeda8ba753fb5138e612b1a8b Author: Aaron Kimball Date: Fri Mar 12 14:29:33 2010 -0800 HADOOP-4829. Allow FileSystem shutdown hook to be disabled Description: FileSystem sets a JVM shutdown hook so that it can clean up the FileSystem cache. This is great behavior when you are writing a client application, but when you're writing a server application, like the Collector or an HBase RegionServer, you need to control the shutdown of the application and HDFS much more closely. If you set your own shutdown hook, there's no guarantee that your hook will run before the HDFS one, preventing you from taking some shutdown actions. Reason: Integration improvement. Author: Todd Lipcon Ref: UNKNOWN commit 154c6a6474b02e68c3418fddf9a8ee5d476a8b7d Author: Aaron Kimball Date: Fri Mar 12 14:28:14 2010 -0800 HADOOP-3327. Shuffling fetchers waited too long between map output fetch re-tries Description: Improves handling of READ_TIMEOUT during map output copying. Author: Amareshwari Sriramadasu Reason: bugfix Ref: UNKNOWN commit 8a6293fc5c3733035dde8e4d3a68c414a1f800f8 Author: Devaraj Das Date: Thu Feb 5 05:35:09 2009 +0000 HADOOP-3327. Improves handling of READ_TIMEOUT during map output copying. Contributed by Amareshwari Sriramadasu. git-svn-id: https://svn.apache.org/repos/asf/hadoop/core/trunk@741009 13f79535-47bb-0310-9956-ffa450edef68 commit 4ee0ecf4760d7adb3e1a81e018a3b5cd6d2e9775 Author: Aaron Kimball Date: Fri Mar 12 14:27:44 2010 -0800 MAPREDUCE-680. Reuse of Writable objects is improperly handled by MRUnit Description: As written, MRUnit's MockOutputCollector simply stores references to the objects passed in to its collect() method. Thus if the same Text (or other Writable) object is reused as an output containiner multiple times with different values, these separate values will not all be collected. MockOutputCollector needs to properly use io.serializations to deep copy the objects sent in. Reason: Bugfix; see description. Author: Aaron Kimball Ref: UNKNOWN commit 51bdfdcf947bc8447aa36d68ae802f154516b0b6 Author: Aaron Kimball Date: Wed Jul 15 10:40:47 2009 -0700 MAPREDUCE-680. Reuse of Writable objects is improperly handled by MRUnit. commit c2026460d4cf7049c67da65d3a2db2e9bcd9c848 Author: Aaron Kimball Date: Fri Mar 12 14:27:14 2010 -0800 HADOOP-5518. MRUnit unit test library Description: MRUnit is a tool to help authors of MapReduce programs write unit tests. Testing map() and reduce() methods requires some repeated work to mock the inputs and outputs of a Mapper or Reducer class, and ensure that the correct values are emitted to the OutputCollector based on inputs. Also, testing a mapper and reducer together requires running them with the sorted ordering guarantees made by the shuffle process. This library provides the above functionality to authors of maps and reduces; it allows you to test maps, reduces, and map-reduce pairs without needing to perform all the setup and teardown work associated with running a job. Reason: New feature Author: Aaron Kimball Ref: UNKNOWN commit 6991a0eb635953bf3729bce330c426ed7d8b996a Author: Aaron Kimball Date: Fri Mar 12 14:26:29 2010 -0800 CLOUDERA-BUILD. Add sqoop wrapper to bin Description: Adds a '/usr/bin/sqoop' wrapper script for users Reason: User-experience improvement Author: Aaron Kimball Ref: UNKNOWN commit c365162d7db1ee70c8607ad84a11e4aa594224e7 Author: Aaron Kimball Date: Fri Mar 12 14:25:56 2010 -0800 HADOOP-5844. Use mysqldump when connecting to local mysql instance in Sqoop Description: Sqoop uses MapReduce + DBInputFormat to read the contents of a table into HDFS. On many databases, this implementation is O(N^2) in the number of rows. Also, the use of multiple mappers has low value in terms of throughput, because the database itself is inherently singlethreaded. While DBInputFormat/JDBC provides a useful fallback mechanism for importing from databases, db-specific dump utilities will nearly always provide faster throughput, and should be selected when available. This patch allows users to use mysqldump to read from local mysql instances instead of the MapReduce-based input. Reason: Performance improvement Author: Aaron Kimball Ref: UNKNOWN commit eddbfbca420bfb81a3a565e4324f6189bfd97e41 Author: Aaron Kimball Date: Fri Mar 12 14:24:58 2010 -0800 HADOOP-5815. Sqoop: A database import tool for Hadoop Description: Sqoop is a tool designed to help users import existing relational databases into their Hadoop clusters. Sqoop uses JDBC to connect to a database, examine the schema for tables, and auto-generate the necessary classes to import data into HDFS. It then instantiates a MapReduce job to read the table from the database via the DBInputFormat (JDBC-based InputFormat). The table is read into a set of files loaded into HDFS. Both SequenceFile and text-based targets are supported. Reason: New feature Author: Aaron Kimball Ref: UNKNOWN commit b33265ff77c71af61899a4b3add1e82cc195fdb7 Author: Aaron Kimball Date: Fri Mar 12 14:23:53 2010 -0800 MAPREDUCE-714. JobConf.findContainingJar unescapes unnecessarily on Linux Description: In JobConf.findContainingJar, the path name is decoded using URLDecoder.decode(...). This was done by Doug in r381794 (commit msg "Un-escape containing jar's path, which is URL-encoded. This fixes things primarily on Windows, where paths are likely to contain spaces.") Unfortunately, jar paths do not appear to be URL encoded on Linux. If you try to use "hadoop jar" on a jar with a "+" in it, this function decodes it to a space and then the job cannot be submitted. Reason: Cloudera-based packages include a '+' in the filename; Hadoop's URL escaper will not properly handle jar filenames with a '+' without this patch. Author: Todd Lipcon Ref: UNKNOWN commit d9767d2cefab288e581732f71779f3ce8e3267e4 Author: Todd Lipcon Date: Mon Jul 6 19:36:11 2009 -0700 MAPREDUCE-714: Fix JobConf.findContainingJars to work with jars with + in the name commit aaeb69f8dda72a2e7aecacd622e99c00bc961efa Author: Aaron Kimball Date: Fri Mar 12 14:23:23 2010 -0800 CLOUDERA-BUILD. Add dependency libraries for Scribe/log4j Author: Todd Lipcon commit cb7a3677942c1d2f9e0d2a75dbffa09fa6125e61 Author: Aaron Kimball Date: Fri Mar 12 14:22:41 2010 -0800 CLOUDERA-BUILD. Apply Scribe patches to Hadoop Description: scribe_hadoop_trunk.patch Also, add empty ivy infrastructure for scribe-log4j Author: Todd Lipcon commit d5ead434b221076fb830308d2d112d53aa6dc59f Author: Aaron Kimball Date: Fri Mar 12 14:22:26 2010 -0800 CLOUDERA-BUILD. Use cloudera's versioning info from cloudera.hash in saveVersion.sh Description: This should make the "hadoop version" output far more useful for determing exactly what code is running. The cloudera.hash property is set by cloudera/build.properties which is generated during the build process. commit bf10e46e425395145dcc4b85db66d45cbf9797b0 Author: Aaron Kimball Date: Fri Mar 12 14:21:45 2010 -0800 CLOUDERA-BUILD. Move saveVersion.sh in build.xml to ensure build Description: This error is due to ant 1.7.1 not compiling package-info.java if the timestamp of the output class directory is newer than the package-info file itself. Since other compiles were happening after package-info.java was generated, the build dir was newer and compilation was being skipped. Move cloudera hooks inside the package task of build.xml Fixes an issue where the fair scheduler jar was not built before the hooks were run, and therefore was not included in the target lib/ directory. Ref: CLOUDERA-436 commit 5359a3bbd2b09644825be99fdd354ff3276a5d59 Author: Aaron Kimball Date: Fri Mar 12 14:21:36 2010 -0800 CLOUDERA-BUILD. New versions of cloudera packaging scripts commit ee255f3909b9938b1023be6a2c59a8429227c766 Author: Aaron Kimball Date: Fri Mar 12 14:21:27 2010 -0800 CLOUDERA-BUILD. Change paths to point to hadoop-0.20 where necessary commit a2d051bcf456fde45c0a0c3aa512872ce6059a97 Author: Aaron Kimball Date: Fri Mar 12 14:21:08 2010 -0800 CLOUDERA-BUILD. Add Hadoop manpage to Hadoop 0.20 repository commit 9600765ec5d6c3cef9ab34ecb573cbb876acf7ee Author: Aaron Kimball Date: Fri Mar 12 14:21:01 2010 -0800 CLOUDERA-BUILD. Move install_hadoop.sh into hadoop repo commit 77ac6923ad6e63874a429e7dd13c4a084b6a9556 Author: Aaron Kimball Date: Fri Mar 12 14:20:52 2010 -0800 CLOUDERA-BUILD. Add example-confs directory for storing configuration of conf.pseudo commit 14256386d4cb155fea0f5745dd6c49fba74ff40f Author: Aaron Kimball Date: Fri Mar 12 14:20:43 2010 -0800 CLOUDERA-BUILD. Replace hadoop-config.sh with Cloudera version commit f7d0a20e0d74f1aac1fb96f3c08ce31e9b9ca5d9 Author: Aaron Kimball Date: Fri Mar 12 14:20:25 2010 -0800 CLOUDERA-BUILD. Remove redundant code in build.xml between package and bin-package commit 0fa65091ecd9dd150d6afb93845d3fb10d80e115 Author: Aaron Kimball Date: Fri Mar 12 14:16:59 2010 -0800 CLOUDERA-BUILD. Hook build.xml to enable contrib modules