CDH 3 Release Notes
The following lists all Apache Hadoop Jiras included in CDH 3
that are not included in the Apache Hadoop base version 0.20.2. The
hadoop-0.20.2+320.CHANGES.txt
file lists all changes included in CDH 3. The patch for each
change can be found in the cloudera/patches directory in the release tarball.
Changes Not In Hadoop 0.20.2
Common
Bug
- [HADOOP-5203] - TT's version build is too restrictive
- [HADOOP-6762] - exception while doing RPC I/O closes channel
- [HADOOP-6722] - NetUtils.connect should check that it hasn't connected a socket to itself
- [HADOOP-6724] - IPC doesn't properly handle IOEs thrown by socket factory
- [HADOOP-6723] - unchecked exceptions thrown in IPC Connection orphan clients
- [HADOOP-6254] - s3n fails with SocketTimeoutException
- [HADOOP-6522] - TestUTF8 fails
- [HADOOP-6643] - Set executable bit for python cloud scripts in the distribution
- [HADOOP-2366] - Space in the value for dfs.data.dir can cause great problems
- [HADOOP-6453] - Hadoop wrapper script shouldn't ignore an existing JAVA_LIBRARY_PATH
- [HADOOP-6460] - Namenode runs of out of memory due to memory leak in ipc Server
- [HADOOP-6505] - sed in build.xml fails
- [HADOOP-6503] - contrib projects should pull in the ivy-fetched libs from the root project
- [HADOOP-5647] - TestJobHistory fails if /tmp/_logs is not writable to. Testcase should not depend on /tmp
- [HADOOP-6462] - contrib/cloud failing, target "compile" does not exist
- [HADOOP-6184] - Provide a configuration dump in json format.
- [HADOOP-6269] - Missing synchronization for defaultResources in Configuration.addResource
- [HADOOP-5891] - If dfs.http.address is default, SecondaryNameNode can't find NameNode
- [HADOOP-4655] - FileSystem.CACHE should be ref-counted
- [HADOOP-5981] - HADOOP-2838 doesnt work as expected
- [HADOOP-5738] - Split waiting tasks field in JobTracker metrics to individual tasks
- [HADOOP-5442] - The job history display needs to be paged
- [HADOOP-5650] - Namenode log that indicates why it is not leaving safemode may be confusing
- [HADOOP-6269] - Missing synchronization for defaultResources in Configuration.addResource
- [HADOOP-5805] - problem using top level s3 buckets as input/output directories
- [HADOOP-5656] - Counter for S3N Read Bytes does not work
- [HADOOP-3327] - Shuffling fetchers waited too long between map output fetch re-tries
Improvement
- [HADOOP-6714] - FsShell 'hadoop fs -text' does not support compression codecs
- [HADOOP-1849] - IPC server max queue size should be configurable
- [HADOOP-3659] - Patch to allow hadoop native to compile on Mac OS X
- [HADOOP-4885] - Try to restore failed replicas of Name Node storage (at checkpoint time)
- [HADOOP-6667] - RPC.waitForProxy should retry through NoRouteToHostException
- [HADOOP-5687] - Hadoop NameNode throws NPE if fs.default.name is the default value
- [HADOOP-6454] - Create setup.py for EC2 cloud scripts
- [HADOOP-6444] - Support additional security group option in hadoop-ec2 script
- [HADOOP-6426] - Create ant build for running EC2 unit tests
- [HADOOP-5625] - Add I/O duration time in client trace
- [HADOOP-5222] - Add offset in client trace
- [HADOOP-6400] - Log errors getting Unix UGI
- [HADOOP-5640] - Allow ServicePlugins to hook callbacks into key service events
- [HADOOP-6312] - Configuration sends too much data to log4j
- [HADOOP-6279] - Add JVM memory usage to JvmMetrics
- [HADOOP-6133] - ReflectionUtils performance regression
- [HADOOP-2838] - Add HADOOP_LIBRARY_PATH config setting so Hadoop will include external directories for jni
- [HADOOP-5733] - Add map/reduce slot capacity and lost map/reduce slot capacity to JobTracker metrics
- [HADOOP-4842] - Streaming combiner should allow command, not just JavaClass
- [HADOOP-6267] - build-contrib.xml unnecessarily enforces that contrib projects be located in contrib/ dir
- [HADOOP-4936] - Improvements to TestSafeMode
- [HADOOP-4675] - Current Ganglia metrics implementation is incompatible with Ganglia 3.1
- [HADOOP-5640] - Allow ServicePlugins to hook callbacks into key service events
- [HADOOP-5450] - Add support for application-specific typecodes to typed bytes
- [HADOOP-1722] - Make streaming to handle non-utf8 byte array
- [HADOOP-6166] - Improve PureJavaCrc32
- [HADOOP-6148] - Implement a pure Java CRC32 calculator
- [HADOOP-5968] - Sqoop should only print a warning about mysql import speed once
- [HADOOP-5967] - Sqoop should only use a single map task
- [HADOOP-5613] - change S3Exception to checked exception
- [HADOOP-5240] - 'ant javadoc' does not check whether outputs are up to date and always rebuilds
New Feature
- [HADOOP-6433] - Add AsyncDiskService that is used in both hdfs and mapreduce
- [HADOOP-6382] - publish hadoop jars to apache mvn repo.
- [HADOOP-4012] - Providing splitting support for bzip2 compressed files
- [HADOOP-4368] - Superuser privileges required to do "df"
- [HADOOP-6466] - Add a ZooKeeper service to the cloud scripts
- [HADOOP-6392] - Run namenode and jobtracker on separate EC2 instances
- [HADOOP-6108] - Add support for EBS storage on EC2
- [HADOOP-5257] - Export namenode/datanode functionality through a pluggable RPC layer
- [HADOOP-5170] - Set max map/reduce tasks on a per-job basis, either per-node or cluster-wide
- [HADOOP-5469] - Exposing Hadoop metrics via HTTP
- [HADOOP-5469] - Exposing Hadoop metrics via HTTP
- [HADOOP-5745] - Allow setting the default value of maxRunningJobs for all pools
- [HADOOP-5887] - Sqoop should create tables in Hive metastore after importing to HDFS
- [HADOOP-5528] - Binary partitioner
- [HADOOP-5175] - Option to prohibit jars unpacking
- [HADOOP-4829] - Allow FileSystem shutdown hook to be disabled
- [HADOOP-5518] - MRUnit unit test library
- [HADOOP-5844] - Use mysqldump when connecting to local mysql instance in Sqoop
- [HADOOP-5815] - Sqoop: A database import tool for Hadoop
HDFS
Bug
- [HDFS-1260] - 0.20: Block lost when multiple DNs trying to recover it to different genstamps
- [HDFS-1254] - 0.20: mark dfs.supprt.append to be true by default for the 0.20-append branch
- [HDFS-1240] - TestDFSShell failing in branch-20
- [HDFS-1207] - 0.20-append: stallReplicationWork should be volatile
- [HDFS-1197] - Blocks are considered "complete" prematurely after commitBlockSynchronization or DN restart
- [HDFS-1118] - DFSOutputStream socket leak when cannot connect to DataNode
- [HDFS-1186] - 0.20: DNs should interrupt writers at start of recovery
- [HDFS-915] - Hung DN stalls write pipeline for far longer than its timeout
- [HDFS-1218] - 20 append: Blocks recovered on startup should be treated with lower priority during block synchronization
- [HDFS-445] - pread() fails when cached block locations are no longer valid
- [HDFS-1204] - 0.20: Lease expiration should recover single files, not entire lease holder
- [HDFS-1202] - DataBlockScanner throws NPE when updated before initialized
- [HDFS-606] - ConcurrentModificationException in invalidateCorruptReplicas()
- [HDFS-1141] - completeFile does not check lease ownership
- [HDFS-1215] - TestNodeCount infinite loops on branch-20-append
- [HDFS-1122] - client block verification may result in blocks in DataBlockScanner prematurely
- [HDFS-1057] - Concurrent readers hit ChecksumExceptions if following a writer to very end of file
- [HDFS-1203] - DataNode should sleep before reentering service loop after an exception
- [HDFS-561] - Fix write pipeline READ_TIMEOUT
- [HDFS-611] - Heartbeats times from Datanodes increase when there are plenty of blocks to delete
- [HDFS-894] - DatanodeID.ipcPort is not updated when existing node re-registers
- [HDFS-142] - In 0.20, move blocks being written into a blocksBeingWritten directory
- [HDFS-988] - saveNamespace can corrupt edits log
- [HDFS-101] - DFS write pipeline : DFSClient sometimes does not detect second datanode failure
- [HDFS-909] - Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log
- [HDFS-612] - FSDataset should not use org.mortbay.log.Log
- [HDFS-1024] - SecondaryNamenode fails to checkpoint because namenode fails with CancelledKeyException
- [HDFS-961] - dfs_readdir incorrectly parses paths
- [HDFS-908] - TestDistributedFileSystem fails with Wrong FS on weird hosts
- [HDFS-877] - Client-driven block verification not functioning
- [HDFS-908] - TestDistributedFileSystem fails with Wrong FS on weird hosts
- [HDFS-464] - Memory leaks in libhdfs
- [HDFS-861] - fuse-dfs does not support O_RDWR
- [HDFS-860] - fuse-dfs truncate behavior causes issues with scp
- [HDFS-859] - fuse-dfs utime behavior causes issues with tar
- [HDFS-858] - Incorrect return codes for fuse-dfs
- [HDFS-857] - Incorrect type for fuse-dfs capacity can cause "df" to return negative values on 32-bit machines
- [HDFS-856] - Hardcoded replication level for new files in fuse-dfs
- [HDFS-464] - Memory leaks in libhdfs
- [HDFS-423] - Unbreak FUSE build and fuse_dfs_wrapper.sh
- [HDFS-727] - bug setting block size hdfsOpenFile
- [HDFS-686] - NullPointerException is thrown while merging edit log and image
- [HDFS-127] - DFSClient block read failures cause open DFSInputStream to become unusable
Improvement
- [HDFS-1209] - Add conf dfs.client.block.recovery.retries to configure number of block recovery attempts
- [HDFS-1210] - DFSClient should log exception when block recovery fails
- [HDFS-1205] - FSDatasetAsyncDiskService should name its threads
- [HDFS-1248] - Misc cleanup/logging improvements for branch-20-append
- [HDFS-895] - Allow hflush/sync to occur in parallel with new writes to the file
- [HDFS-1211] - 0.20 append: Block receiver should not log "rewind" packets at INFO level
- [HDFS-1056] - Multi-node RPC deadlocks during block recovery
- [HDFS-1055] - Improve thread naming for DataXceivers
- [HDFS-1054] - Remove unnecessary sleep after failure in nextBlockOutputStream
- [HDFS-826] - Allow a mechanism for an application to detect that datanode(s) have died in the write pipeline
- [HDFS-1161] - Make DN minimum valid volumes configurable
- [HDFS-1160] - Improve some FSDataset warnings and comments
- [HDFS-457] - better handling of volume failure in Data Node storage
- [HDFS-1013] - Miscellaneous improvements to HTML markup for web UIs
- [HDFS-455] - Make NN and DN handle in a intuitive way comma-separated configuration strings
- [HDFS-412] - Hadoop JMX usage makes Nagios monitoring impossible
- [HDFS-630] - In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block.
- [HDFS-496] - Use PureJavaCrc32 in HDFS
New Feature
- [HDFS-200] - In HDFS, sync() not yet guarantees data available to the new readers
- [HDFS-528] - Add ability for safemode to wait for a minimum number of live datanodes
Task
- [HDFS-1266] - Missing license headers in branch-20-append
Test
- [HDFS-1252] - TestDFSConcurrentFileOperations broken in 0.20-appendj
- [HDFS-1247] - Improvements to HDFS-1204 test
- [HDFS-1246] - Manual tool to test sync against a real cluster
- [HDFS-1243] - 0.20 append: Replication tests in TestFileAppend4 should not expect immediate replication
- [HDFS-1242] - 0.20 append: Add test for appendFile() race solved in HDFS-142
- [HDFS-1244] - Misc improvements to TestFileAppend2
- [HDFS-696] - Java assertion failures triggered by tests
MapReduce
Bug
- [MAPREDUCE-1887] - MRAsyncDiskService does not properly absolutize volume root paths
- [MAPREDUCE-1372] - ConcurrentModificationException in JobInProgress
- [MAPREDUCE-1378] - Args in job details links on jobhistory.jsp are not URL encoded
- [MAPREDUCE-1213] - TaskTrackers restart is very slow because it deletes distributed cache directory synchronously
- [MAPREDUCE-1443] - DBInputFormat can leak connections
- [MAPREDUCE-1728] - Oracle timezone strings do not match Java
- [MAPREDUCE-1375] - TestFileArgs fails intermittently
- [MAPREDUCE-1536] - DataDrivenDBInputFormat does not split date columns correctly.
- [MAPREDUCE-1480] - CombineFileRecordReader does not properly initialize child RecordReader
- [MAPREDUCE-1436] - Deadlock in preemption code in fair scheduler
- [MAPREDUCE-1375] - TestFileArgs fails intermittently
- [MAPREDUCE-1469] - Sqoop should disable speculative execution in export
- [MAPREDUCE-1395] - Sqoop does not check return value of Job.waitForCompletion()
- [MAPREDUCE-1327] - Oracle database import via sqoop fails when a table contains the column types such as TIMESTAMP(6) WITH LOCAL TIME ZONE and TIMESTAMP(6) WITH TIME ZONE
- [MAPREDUCE-1394] - Sqoop generates incorrect URIs in paths sent to Hive
- [MAPREDUCE-1313] - NPE in FieldFormatter if escape character is set and field is null
- [MAPREDUCE-1155] - Streaming tests swallow exceptions
- [MAPREDUCE-1258] - Fair scheduler event log not logging job info
- [MAPREDUCE-1212] - Mapreduce contrib project ivy dependencies are not included in binary target
- [MAPREDUCE-1310] - CREATE TABLE statements for Hive do not correctly specify delimiters
- [MAPREDUCE-1235] - java.io.IOException: Cannot convert value '0000-00-00 00:00:00' from column 6 to TIMESTAMP.
- [MAPREDUCE-1174] - Sqoop improperly handles table/column names which are reserved sql words
- [MAPREDUCE-1146] - Sqoop dependencies break Ecpilse build on Linux
- [MAPREDUCE-1148] - SQL identifiers are a superset of Java identifiers
- [MAPREDUCE-1285] - DistCp cannot handle -delete if destination is local filesystem
- [MAPREDUCE-764] - TypedBytesInput's readRaw() does not preserve custom type codes
- [MAPREDUCE-1293] - AutoInputFormat doesn't work with non-default FileSystems
- [MAPREDUCE-1131] - Using profilers other than hprof can cause JobClient to report job failure
- [MAPREDUCE-1059] - distcp can generate uneven map task assignments
- [MAPREDUCE-1128] - MRUnit Allows Iteration Twice
- [MAPREDUCE-112] - Reduce Input Records and Reduce Output Records counters are not being set when using the new Mapreduce reducer API
- [MAPREDUCE-1089] - Fair Scheduler preemption triggers NPE when tasks are scheduled but not running
- [MAPREDUCE-968] - NPE in distcp encountered when placing _logs directory on S3FileSystem
- [MAPREDUCE-693] - Conf files not moved to "done" subdirectory after JT restart
- [MAPREDUCE-683] - TestJobTrackerRestart fails with Map task completion events ordering mismatch
- [MAPREDUCE-416] - Move the completed jobs' history files to a DONE subdirectory inside the configured history directory
- [MAPREDUCE-971] - distcp does not always remove distcp.tmp.dir
- [MAPREDUCE-923] - Sqoop's ORM uses URLDecoder on a file, which replaces plus signs in a jar file name with spaces
- [MAPREDUCE-840] - DBInputFormat leaves open transaction
- [MAPREDUCE-825] - JobClient completion poll interval of 5s causes slow tests in local mode
- [MAPREDUCE-792] - javac warnings in DBInputFormat
- [MAPREDUCE-716] - org.apache.hadoop.mapred.lib.db.DBInputformat not working with oracle
- [MAPREDUCE-799] - Some of MRUnit's self-tests were not being run
- [MAPREDUCE-840] - DBInputFormat leaves open transaction
- [MAPREDUCE-716] - org.apache.hadoop.mapred.lib.db.DBInputformat not working with oracle
- [MAPREDUCE-685] - Sqoop will fail with OutOfMemory on large tables using mysql
- [MAPREDUCE-703] - Sqoop requires dependency on hsqldb in ivy
- [MAPREDUCE-415] - JobControl Job does always has an unassigned name
- [MAPREDUCE-680] - Reuse of Writable objects is improperly handled by MRUnit
- [MAPREDUCE-714] - JobConf.findContainingJar unescapes unnecessarily on Linux
Improvement
- [MAPREDUCE-1570] - Shuffle stage - Key and Group Comparators
- [MAPREDUCE-739] - Allow relative paths to be created inside archives.
- [MAPREDUCE-1302] - TrackerDistributedCacheManager can delete file asynchronously
- [MAPREDUCE-1489] - DataDrivenDBInputFormat should not query the database when generating only one split
- [MAPREDUCE-1785] - Add streaming config option for not emitting the key
- [MAPREDUCE-1460] - Oracle support in DataDrivenDBInputFormat
- [MAPREDUCE-1569] - Mock Contexts & Configurations
- [MAPREDUCE-1423] - Improve performance of CombineFileInputFormat when multiple pools are configured
- [MAPREDUCE-364] - Change org.apache.hadoop.examples.MultiFileWordCount to use new mapreduce api.
- [MAPREDUCE-1467] - Add a --verbose flag to Sqoop
- [MAPREDUCE-967] - TaskTracker does not need to fully unjar job jars
- [MAPREDUCE-1356] - Allow user-specified hive table name in sqoop
- [MAPREDUCE-1198] - Alternatively schedule different types of tasks in fair share scheduler
- [MAPREDUCE-1169] - Improvements to mysqldump use in Sqoop
- [MAPREDUCE-1224] - Calling "SELECT t.* from <table> AS t" to get meta information is too expensive for big tables
- [MAPREDUCE-370] - Change org.apache.hadoop.mapred.lib.MultipleOutputs to use new api.
- [MAPREDUCE-999] - Improve Sqoop test speed and refactor tests
- [MAPREDUCE-967] - TaskTracker does not need to fully unjar job jars
- [MAPREDUCE-814] - Move completed Job history files to HDFS
- [MAPREDUCE-906] - Updated Sqoop documentation
- [MAPREDUCE-907] - Sqoop should use more intelligent splits
- [MAPREDUCE-885] - More efficient SQL queries for DBInputFormat
- [MAPREDUCE-876] - Sqoop import of large tables can time out
- [MAPREDUCE-918] - Test hsqldb server should be memory-only.
- [MAPREDUCE-875] - Make DBRecordReader execute queries lazily
- [MAPREDUCE-750] - Extensible ConnManager factory API
- [MAPREDUCE-749] - Make Sqoop unit tests more Hudson-friendly
- [MAPREDUCE-910] - MRUnit should support counters
- [MAPREDUCE-797] - MRUnit MapReduceDriver should support combiners
- [MAPREDUCE-782] - Use PureJavaCrc32 in mapreduce spills
- [MAPREDUCE-789] - Oracle support for Sqoop
- [MAPREDUCE-816] - Rename "local" mysql import to "direct"
- [MAPREDUCE-906] - Updated Sqoop documentation
- [MAPREDUCE-710] - Sqoop should read and transmit passwords in a more secure manner
- [MAPREDUCE-713] - Sqoop has some superfluous imports
- [MAPREDUCE-674] - Sqoop should allow a "where" clause to avoid having to export entire tables
- [MAPREDUCE-675] - Sqoop should allow user-defined class and package names
- [MAPREDUCE-692] - Make Hudson run Sqoop unit tests
New Feature
- [MAPREDUCE-679] - XML-based metrics as JSP servlet for JobTracker
- [MAPREDUCE-1341] - Sqoop should have an option to create hive tables and skip the table import step
- [MAPREDUCE-707] - Provide a jobconf property for explicitly assigning a job to a pool
- [MAPREDUCE-698] - Per-pool task limits for the fair scheduler
- [MAPREDUCE-1168] - Export data to databases via Sqoop
- [MAPREDUCE-706] - Support for FIFO pools in the fair scheduler
- [MAPREDUCE-1017] - Compression and output splitting for Sqoop
- [MAPREDUCE-768] - Configuration information should generate dump in a standard format.
- [MAPREDUCE-551] - Add preemption to the fair scheduler
- [MAPREDUCE-987] - Exposing MiniDFS and MiniMR clusters as a single process command-line
- [MAPREDUCE-461] - Enable ServicePlugins for the JobTracker
- [MAPREDUCE-938] - Postgresql support for Sqoop
- [MAPREDUCE-798] - MRUnit should be able to test a succession of MapReduce passes
- [MAPREDUCE-800] - MRUnit should support the new API
- [MAPREDUCE-705] - User-configurable quote and delimiter characters for Sqoop records and record reparsing
Task
Test