CDH 5.3.10 Release Notes
The following lists all Lightning-Fast Cluster Computing Jiras included in CDH 5.3.10
that are not included in the Lightning-Fast Cluster Computing base version 1.2.0. The
spark-1.2.0-cdh5.3.10.CHANGES.txt
file lists all changes included in CDH 5.3.10. The patch for each
change can be found in the cloudera/patches directory in the release tarball.
Changes Not In Lightning-Fast Cluster Computing 1.2.0
Spark
Bug
- [SPARK-12617] - socket descriptor leak killing streaming app
- [SPARK-11652] - Remote code execution with InvokerTransformer
- [SPARK-11484] - Giving precedence to proxyBase set by spark instead of env
- [SPARK-6880] - Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDD
- [SPARK-6480] - histogram() bucket function is wrong in some simple edge cases
- [SPARK-8606] - Exceptions in RDD.getPreferredLocations() and getPartitions() should not be able to crash DAGScheduler
- [SPARK-6578] - Outbound channel in network library is not thread-safe, can lead to fetch failures
- [SPARK-3778] - newAPIHadoopRDD doesn't properly pass credentials for secure hdfs on yarn
- [SPARK-4835] - Streaming saveAs*HadoopFiles() methods may throw FileAlreadyExistsException during checkpoint recovery
- [SPARK-4606] - SparkSubmitDriverBootstrapper does not propagate EOF to child JVM
- [SPARK-4805] - BlockTransferMessage.toByteArray() trips assertion
- [SPARK-4785] - When called with arguments referring column fields, PMOD throws NPE
- [SPARK-4769] - CTAS does not work when reading from temporary tables
- [SPARK-4770] - spark.scheduler.minRegisteredResourcesRatio documented default is incorrect for YARN
- [SPARK-4774] - Make HiveFromSpark example more portable
- [SPARK-4761] - With JDBC server, set Kryo as default serializer and disable reference tracking
- [SPARK-4753] - Parquet2 does not prune based on OR filters on partition columns
- [SPARK-2624] - Datanucleus jars not accessible in yarn-cluster mode
- [SPARK-4464] - Description about configuration options need to be modified in docs.
- [SPARK-4421] - Wrong link in spark-standalone.html
- [SPARK-4459] - JavaRDDLike.groupBy[K](f: JFunction[T, K]) may fail with typechecking errors
- [SPARK-4745] - get_existing_cluster() doesn't work with additional security groups
- [SPARK-4253] - Ignore spark.driver.host in yarn-cluster and standalone-cluster mode
- [SPARK-4085] - Job will fail if a shuffle file that's read locally gets deleted
- [SPARK-4498] - Standalone Master can fail to recognize completed/failed applications
- [SPARK-4552] - query for empty parquet table in spark sql hive get IllegalArgumentException
- [SPARK-4715] - ShuffleMemoryManager.tryToAcquire may return a negative value
- [SPARK-4701] - Typo in sbt/sbt
- [SPARK-4672] - Cut off the super long serialization chain in GraphX to avoid the StackOverflow error
- [SPARK-4670] - bitwise NOT has a wrong `toString` output
- [SPARK-4593] - sum(1/0) would produce a very large number
- [SPARK-4676] - JavaSchemaRDD.schema may throw NullType MatchError if sql has null
- [SPARK-4663] - close() function is not surrounded by finally in ParquetTableOperations.scala
- [SPARK-4536] - Add sqrt and abs to Spark SQL DSL
- [SPARK-4686] - Link to "allowed master URLs" is broken in configuration documentation
- [SPARK-4658] - Code documentation issue in DDL of datasource
- [SPARK-4650] - Supporting multi column support in countDistinct function like count(distinct c1,c2..) in Spark SQL
- [SPARK-4258] - NPE with new Parquet Filters
- [SPARK-2192] - Examples Data Not in Binary Distribution
- [SPARK-4656] - Typo in Programming Guide markdown
- [SPARK-4597] - Use proper exception and reset variable in Utils.createTempDir() method
- [SPARK-4584] - 2x Performance regression for Spark-on-YARN
- [SPARK-4193] - Disable doclint in Java 8 to prevent from build error.
- [SPARK-4645] - Asynchronous execution in HiveThriftServer2 with Hive 0.13.1 doesn't play well with Simba ODBC driver
- [SPARK-4308] - SQL operation state is not properly set when exception is thrown
- [SPARK-4619] - Double "ms" in ShuffleBlockFetcherIterator log
- [SPARK-4626] - NoSuchElementException in CoarseGrainedSchedulerBackend
- [SPARK-3628] - Don't apply accumulator updates multiple times for tasks in result stages
- [SPARK-4516] - Netty off-heap memory use causes executors to be killed by OS
- [SPARK-4471] - blockManagerIdFromJson function throws exception while BlockManagerId be null in MetadataFetchFailedException
- [SPARK-4546] - Improve HistoryServer first time user experience
- [SPARK-4592] - "Worker registration failed: Duplicate worker ID" error during Master failover
- [SPARK-4602] - saveAsNewAPIHadoopFiles by default does not use SparkContext's hadoop configuration
- [SPARK-4601] - Call site of jobs generated by streaming incorrect in Spark UI
- [SPARK-4535] - Fix the error in comments
- [SPARK-4525] - MesosSchedulerBackend.resourceOffers cannot decline unused offers from acceptedOffers
- [SPARK-4266] - Avoid expensive JavaScript for StagePages with huge numbers of tasks
- [SPARK-4578] - Row.asDict() should keep the type of values
- [SPARK-4519] - Filestream does not use hadoop configuration set within sparkContext.hadoopConfiguration
- [SPARK-4487] - Fix attribute reference resolution error when using ORDER BY.
- [SPARK-4479] - Avoid unnecessary defensive copies when Sort based shuffle is on
- [SPARK-4532] - make-distribution in Spark 1.2 does not correctly detect whether Hive is enabled
- [SPARK-4522] - Failure to read parquet schema with missing metadata.
- [SPARK-4244] - ConstantFolding has to be done before initialize the Generic UDF
- [SPARK-4318] - Fix empty sum distinct.
- [SPARK-4513] - Support relational operator '<=>' in Spark SQL
- [SPARK-4446] - MetadataCleaner schedule task with a wrong param for delay time .
- [SPARK-4480] - Avoid many small spills in external data structures
- [SPARK-4478] - totalRegisteredExecutors not updated properly
- [SPARK-4495] - Memory leak in JobProgressListener due to `spark.ui.retainedJobs` not being used
- [SPARK-4384] - Too many open files during sort in pyspark
- [SPARK-4429] - Build for Scala 2.11 using sbt fails.
- [SPARK-3962] - Mark spark dependency as "provided" in external libraries
- [SPARK-4482] - ReceivedBlockTracker's write ahead log is enabled by default
- [SPARK-4467] - Number of elements read is never reset in ExternalSorter
- [SPARK-4455] - Exclude dependency on hbase-annotations module
- [SPARK-4441] - Close Tachyon client when TachyonBlockManager is shut down
- [SPARK-4468] - Wrong Parquet filters are created for all inequality predicates with literals on the left hand side
- [SPARK-4433] - Racing condition in zipWithIndex
- [SPARK-3721] - Broadcast Variables above 2GB break in PySpark
- [SPARK-4404] - SparkSubmitDriverBootstrapper should stop after its SparkSubmit sub-process ends
- [SPARK-4434] - spark-submit cluster deploy mode JAR URLs are broken in 1.1.1
- [SPARK-4213] - SparkSQL - ParquetFilters - No support for LT, LTE, GT, GTE operators
- [SPARK-4448] - Support ConstantObjectInspector for unwrapping data
- [SPARK-4443] - Statistics bug for external table in spark sql hive
- [SPARK-4407] - Thrift server for 0.13.1 doesn't deserialize complex types properly
- [SPARK-4425] - Handle NaN or Infinity cast to Timestamp correctly
- [SPARK-4420] - Change nullability of Cast from DoubleType/FloatType to DecimalType.
- [SPARK-4180] - SparkContext constructor should throw exception if another SparkContext is already running
- [SPARK-4075] - Jar url validation is not enough for Jar file
- [SPARK-4445] - Don't display storage level in toDebugString unless RDD is persisted
- [SPARK-4422] - In some cases, Vectors.fromBreeze get wrong results.
- [SPARK-4426] - The symbol of BitwiseOr is wrong, should not be '&'
- [SPARK-4260] - Httpbroadcast should set connection timeout.
- [SPARK-4415] - Driver did not exit after python driver had exited.
- [SPARK-4412] - Parquet logger cannot be configured
- [SPARK-4322] - Struct fields can't be used as sub-expression of grouping fields
- [SPARK-4391] - Parquet Filter pushdown flag should be set with SQLConf
- [SPARK-4390] - Bad casts to decimal throw instead of returning null
- [SPARK-4333] - Correctly log number of iterations in RuleExecutor
- [SPARK-4375] - Assembly built with Maven is missing most of repl classes
- [SPARK-4245] - Fix containsNull of the result ArrayType of CreateArray expression.
- [SPARK-4313] - "Thread Dump" link is broken in yarn-cluster mode
- [SPARK-4310] - "Submitted" column in Stage page doesn't sort by time
- [SPARK-4372] - Make LR and SVM's default parameters consistent in Scala and Python
- [SPARK-4326] - unidoc is broken on master
- [SPARK-4348] - pyspark.mllib.random conflicts with random module
- [SPARK-4256] - MLLib BinaryClassificationMetricComputers try to divide by zero
- [SPARK-4370] - Limit cores used by Netty transfer service based on executor size
- [SPARK-4373] - MLlib unit tests failed maven test
- [SPARK-4369] - TreeModel.predict does not work with RDD
- [SPARK-4281] - Yarn shuffle service jars need to include dependencies
- [SPARK-4355] - OnlineSummarizer doesn't merge mean correctly
- [SPARK-3936] - Incorrect result in GraphX BytecodeUtils with closures + class/object methods
- [SPARK-2269] - Clean up and add unit tests for resourceOffers in MesosSchedulerBackend
- [SPARK-4282] - Stopping flag in YarnClientSchedulerBackend should be volatile
- [SPARK-4305] - yarn-alpha profile won't build due to network/yarn module
- [SPARK-4295] - [External]Exception throws in SparkSinkSuite although all test cases pass
- [SPARK-3649] - ClassCastException in GraphX custom serializers when sort-based shuffle spills
- [SPARK-4274] - NPE in printing the details of query plan
- [SPARK-4250] - Create constant null value for Hive Inspectors
- [SPARK-4230] - Doc for spark.default.parallelism is incorrect
- [SPARK-4312] - bash can't `die`
- [SPARK-2548] - JavaRecoverableWordCount is missing
- [SPARK-4169] - [Core] Locale dependent code
- [SPARK-1209] - SparkHadoop{MapRed,MapReduce}Util should not use package org.apache.hadoop
- [SPARK-1344] - Scala API docs for top methods
- [SPARK-4301] - StreamingContext should not allow start() to be called after calling stop()
- [SPARK-4291] - Drop "Code" from network module names
- [SPARK-4304] - sortByKey() will fail on empty RDD
- [SPARK-4292] - incorrect result set in JDBC/ODBC
- [SPARK-4203] - Partition directories in random order when inserting into hive table
- [SPARK-4270] - Fix Cast from DateType to DecimalType.
- [SPARK-4225] - jdbc/odbc error when using maven build spark
- [SPARK-4204] - Utils.exceptionString only return the information for the outermost exception
- [SPARK-4236] - External shuffle service must cleanup its shuffle files
- [SPARK-4277] - Support external shuffle service on Worker
- [SPARK-4249] - A problem of EdgePartitionBuilder in Graphx
- [SPARK-4264] - SQL HashJoin induces "refCnt = 0" error in ShuffleBlockFetcherIterator
- [SPARK-4255] - Table striping is incorrect on page load
- [SPARK-4137] - Relative paths don't get handled correctly by spark-ec2
- [SPARK-4254] - MovieLensALS example fails from including Params in closure
- [SPARK-4158] - Spark throws exception when Mesos resources are missing
- [SPARK-3223] - runAsSparkUser cannot change HDFS write permission properly in mesos cluster mode
- [SPARK-4222] - FixedLengthBinaryRecordReader should readFully
- [SPARK-3983] - Scheduler delay (shown in the UI) is incorrect
- [SPARK-4242] - Add SASL to external shuffle service
Documentation
- [SPARK-4652] - Add docs about spark-git-repo option
- [SPARK-4711] - MLlib optimization: docs should suggest how to choose optimizer
- [SPARK-4344] - spark.yarn.user.classpath.first is undocumented
- [SPARK-4481] - Some comments for `updateStateByKey` are wrong
- [SPARK-4363] - The Broadcast example is out of date
- [SPARK-3663] - Document SPARK_LOG_DIR and SPARK_PID_DIR
- [SPARK-4040] - Update spark documentation for local mode and spark-streaming.
Improvement
- [SPARK-4048] - Enhance and extend hadoop-provided profile
- [SPARK-4740] - Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey
- [SPARK-4567] - Make SparkJobInfo and SparkStageInfo serializable
- [SPARK-4765] - Add GC back to default metrics
- [SPARK-4620] - Add unpersist in Graph/GraphImpl
- [SPARK-4646] - Replace Scala.util.Sorting.quickSort with Sorter(TimSort) in Spark
- [SPARK-3623] - Graph should support the checkpoint operation
- [SPARK-4575] - Documentation for the pipeline features
- [SPARK-4610] - Standardize API for DecisionTree: numClasses vs numClassesForClassification
- [SPARK-4642] - Documents about running-on-YARN needs update
- [SPARK-4717] - Optimize BLAS library to avoid de-reference multiple times in loop
- [SPARK-4708] - Make k-mean runs two/three times faster with dense/sparse sample
- [SPARK-4710] - Fix MLlib compilation warnings
- [SPARK-4695] - Get result using executeCollect in spark sql
- [SPARK-4611] - Implement the efficient vector norm
- [SPARK-4358] - Parsing NumericLit with more specified types
- [SPARK-4661] - Minor code and docs cleanup
- [SPARK-2143] - Display Spark version on Driver web page
- [SPARK-4613] - Make JdbcRDD easier to use from Java
- [SPARK-4583] - GradientBoostedTrees error logging should use loss being minimized
- [SPARK-4614] - Slight API changes in Matrix and Matrices
- [SPARK-4604] - Make MatrixFactorizationModel constructor public
- [SPARK-4612] - Configuration object gets created for every task even if not new file/jar is added
- [SPARK-4581] - Refactorize StandardScaler to improve the transformation performance
- [SPARK-4381] - User should get warned when set spark.master with local in Spark Streaming
- [SPARK-4526] - Gradient should be added batch computing interface
- [SPARK-4596] - Refactorize Normalizer to make code cleaner
- [SPARK-4517] - Improve memory efficiency for python broadcast
- [SPARK-4562] - GLM testing time regressions from Spark 1.1
- [SPARK-4457] - Document how to build for Hadoop versions greater than 2.4
- [SPARK-4431] - Implement efficient activeIterator for dense and sparse vector
- [SPARK-4531] - Cache serialized java objects instead of serialized python objects in MLlib
- [SPARK-4472] - Print "Spark context available as sc." only when SparkContext is created successfully
- [SPARK-4413] - Parquet support through datasource API
- [SPARK-2918] - EXPLAIN doens't support the CTAS
- [SPARK-3938] - Set RDD name to table name during cache operations
- [SPARK-4486] - Improve GradientBoosting APIs and doc
- [SPARK-4294] - UnionDStream stream should express the requirements in the same way as TransformedDStream
- [SPARK-4470] - SparkContext accepts local[0] as a master URL
- [SPARK-4463] - Add (de)select all button for additional metrics in webUI
- [SPARK-4466] - Provide support for publishing Scala 2.11 artifacts to Maven
- [SPARK-4444] - Drop VD type parameter from EdgeRDD
- [SPARK-4410] - Support for external sort
- [SPARK-4393] - Memory leak in connection manager timeout thread
- [SPARK-4419] - Upgrade Snappy Java to 1.1.1.6
- [SPARK-2321] - Design a proper progress reporting & event listener API
- [SPARK-4379] - RDD.checkpoint throws a general Exception (should be SparkException)
- [SPARK-4214] - With dynamic allocation, avoid outstanding requests for more executors than pending tasks need
- [SPARK-4365] - Remove unnecessary filter call on records returned from parquet library
- [SPARK-4386] - Parquet file write performance improvement
- [SPARK-4062] - Improve KafkaReceiver to prevent data loss
- [SPARK-4380] - Executor full of log "spilling in-memory map of 0 MB to disk"
- [SPARK-4398] - Specialize rdd.parallelize for xrange
- [SPARK-2703] - Make Tachyon related unit tests execute without deploying a Tachyon system locally.
- [SPARK-4394] - Allow datasources to support IN and sizeInBytes
- [SPARK-4378] - Make ALS more Java-friendly
- [SPARK-2672] - Support compression in wholeFile()
- [SPARK-3666] - Extract interfaces for EdgeRDD and VertexRDD
- [SPARK-4347] - GradientBoostingSuite takes more than 1 minute to finish
- [SPARK-2492] - KafkaReceiver minor changes to align with Kafka 0.8
- [SPARK-4307] - Initialize FileDescriptor lazily in FileRegion
- [SPARK-4324] - Support numpy/scipy in all Python API of MLlib
- [SPARK-4330] - Link to proper URL for YARN overview
- [SPARK-3954] - Optimization to FileInputDStream
- [SPARK-4047] - Generate runtime warning for naive implementation examples for algorithms implemented in MLlib/graphx
- [SPARK-3179] - Add task OutputMetrics
- [SPARK-971] - Link to Confluence wiki from project website / documentation
- [SPARK-4221] - Allow access to nonnegative ALS from python
- [SPARK-4272] - Add more unwrap functions for primitive type in TableReader
- [SPARK-4187] - External shuffle service should not use Java serializer
- [SPARK-4188] - Shuffle fetches should be retried at a lower level
- [SPARK-3797] - Run the shuffle service inside the YARN NodeManager as an AuxiliaryService
- [SPARK-4262] - Add .schemaRDD to JavaSchemaRDD
- [SPARK-4166] - Display the executor ID in the Web UI when ExecutorLostFailure happens
- [SPARK-4163] - When fetching blocks unsuccessfully, Web UI only displays "Fetch failure"
- [SPARK-4168] - Completed Stages Number are misleading webUI when stages are more than 1000
- [SPARK-2938] - Support SASL authentication in Netty network module
New Feature
- [SPARK-2805] - Update akka to version 2.3.4
- [SPARK-4683] - Add a beeline.cmd to run on Windows
- [SPARK-4685] - Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections
- [SPARK-4529] - support view with column alias specified
- [SPARK-4145] - Create jobs overview and job details pages on the web UI
- [SPARK-4477] - remove numpy from RDDSampler of PySpark
- [SPARK-4439] - Expose RandomForest in Python
- [SPARK-4228] - Save a ScheamRDD in JSON format
- [SPARK-4327] - Python API for RDD.randomSplit()
- [SPARK-4306] - LogisticRegressionWithLBFGS support for PySpark MLlib
- [SPARK-4017] - Progress bar in console
- [SPARK-4396] - Support lookup by index in Rating
- [SPARK-4435] - Add setThreshold in Python LogisticRegressionModel and SVMModel
- [SPARK-2811] - update algebird to 0.8.1
- [SPARK-4239] - support view in HiveQL
- [SPARK-3530] - Pipeline and Parameters
- [SPARK-4149] - ISO 8601 support for json date time strings
- [SPARK-4186] - Support binaryFiles and binaryRecords API in Python
- [SPARK-611] - Allow JStack to be run from web UI
Task
Test
- [SPARK-4319] - Enable an ignored test "null count".
Hive
Improvement
- [HIVE-6024] - Load data local inpath unnecessarily creates a copy task