commit 8e2e5bd695bc5c64828fb17a74548d702b91b2ef Author: Jenkins slave Date: Thu Jan 22 15:17:25 2015 -0800 Preparing for CDH5.2.3 release commit d5c348b40ab6704925bac0c5a3f5f91c685573e1 Author: Jenkins slave Date: Tue Jan 13 12:45:13 2015 -0800 Preparing for CDH5.2.2 release commit af825b5ad7288fcd7178a75dd53c21055b1ba25d Author: Marcelo Vanzin Date: Fri Jan 9 10:32:49 2015 -0800 CLOUDERA-BUILD. CDH-22159. Use CDH version of Snappy in Spark. The CDH version runs on older distros and has the fixes Spark needs (as we've already verified in the CDH 5.3 release). commit 3d63c76fe298dfe12916453656894ac1820cc275 Author: Jenkins slave Date: Wed Nov 12 09:43:52 2014 -0800 Preparing for CDH5.2.2 development commit 99dd0bdae38eacc99c481654b1b46aa4c68918b0 Author: Jenkins slave Date: Wed Nov 5 16:05:39 2014 -0800 Preparing for CDH5.2.1 release commit 283f5ad205a470709f631adcbefcc51ee9cda778 Author: Marcelo Vanzin Date: Wed Oct 8 08:51:17 2014 -0500 [SPARK-3788] [yarn] Fix compareFs to do the right thing for HDFS namespaces (1.1 version). HA and viewfs use namespaces instead of host names, so you can't resolve them since that will fail. So be smarter to avoid doing unnecessary work. Author: Marcelo Vanzin Closes #2650 from vanzin/SPARK-3788-1.1 and squashes the following commits: 174bf71 [Marcelo Vanzin] Update comment. 0e36be7 [Marcelo Vanzin] Use Objects.equal() instead of ==. 772aead [Marcelo Vanzin] [SPARK-3788] [yarn] Fix compareFs to do the right thing for HA, federation (1.1 version). (cherry picked from commit a44af7302f814204fdbcc7ad620bc6984b376468) commit c08afa3b82ead4054958b325e26ccc5cdd383212 Author: Marcelo Vanzin Date: Fri Oct 31 15:59:04 2014 -0700 CLOUDERA-BUILD. Increate timeout for flaky test. commit faa5dce34a0952a03534ed71a25c94129889f722 Author: Andrew Or Date: Thu Oct 30 15:44:29 2014 -0700 [SPARK-3661] Respect spark.*.memory in cluster mode This also includes minor re-organization of the code. Tested locally in both client and deploy modes. Author: Andrew Or Author: Andrew Or Closes #2697 from andrewor14/memory-cluster-mode and squashes the following commits: 01d78bc [Andrew Or] Merge branch 'master' of github.com:apache/spark into memory-cluster-mode ccd468b [Andrew Or] Add some comments per Patrick c956577 [Andrew Or] Tweak wording 2b4afa0 [Andrew Or] Unused import 47a5a88 [Andrew Or] Correct Spark properties precedence order bf64717 [Andrew Or] Merge branch 'master' of github.com:apache/spark into memory-cluster-mode dd452d0 [Andrew Or] Respect spark.*.memory in cluster mode (cherry picked from commit 2f54543815c0905dc958d444ad638c23a29507c6) Conflicts: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala commit abaa09d9e19d800970044f00ada0e1262bd16a9c Author: Marcelo Vanzin Date: Fri Oct 17 13:45:10 2014 -0500 [SPARK-3979] [yarn] Use fs's default replication. This avoids issues when HDFS is configured in a way that would not allow the hardcoded default replication of "3". Note: getDefaultReplication(Path) was added in 0.23.3, and the oldest one available on Maven Central is 0.23.7, so I chose to not add code to access that method via reflection. Author: Marcelo Vanzin Closes #2831 from vanzin/SPARK-3979 and squashes the following commits: b0e3a97 [Marcelo Vanzin] [SPARK-3979] [yarn] Use fs's default replication. (cherry picked from commit 803e7f087797bae643754f8db88848a17282ca6e) commit e2e6186d6639be5fc245a50f39343cecefd14f38 Author: GuoQiang Li Date: Wed Oct 29 23:02:58 2014 -0700 [SPARK-1720][SPARK-1719] use LD_LIBRARY_PATH instead of -Djava.library.path - [X] Standalone - [X] YARN - [X] Mesos - [X] Mac OS X - [X] Linux - [ ] Windows This is another implementation about #1031 Author: GuoQiang Li Closes #2711 from witgo/SPARK-1719 and squashes the following commits: c7b26f6 [GuoQiang Li] review commits 4488e41 [GuoQiang Li] Refactoring CommandUtils a444094 [GuoQiang Li] review commits 40c0b4a [GuoQiang Li] Add buildLocalCommand method c1a0ddd [GuoQiang Li] fix comments 156ce88 [GuoQiang Li] review commit 38aa377 [GuoQiang Li] Refactor CommandUtils.scala 4269e00 [GuoQiang Li] Refactor SparkSubmitDriverBootstrapper.scala 7a1d634 [GuoQiang Li] use LD_LIBRARY_PATH instead of -Djava.library.path (cherry picked from commit cd739bd756875bd52e9bd8ae801e0ae10a1f6937) Conflicts: core/src/main/scala/org/apache/spark/deploy/worker/CommandUtils.scala core/src/main/scala/org/apache/spark/deploy/worker/DriverRunner.scala core/src/main/scala/org/apache/spark/deploy/worker/ExecutorRunner.scala core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala core/src/main/scala/org/apache/spark/util/Utils.scala core/src/test/scala/org/apache/spark/deploy/worker/ExecutorRunnerTest.scala yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnableUtil.scala commit 4e77c86d098b3a07e30689839c104e56e8f9d95f Author: Jenkins slave Date: Thu Oct 9 13:57:58 2014 -0700 Preparing for CDH5.2.1 development commit c98c1de1e176f7146167e73fae302c70ea9f1eac Author: Jenkins slave Date: Thu Oct 9 13:44:06 2014 -0700 Preparing for CDH5.2.0 release commit 511df9d1a9490d5168ea1e54990b7f5d2b5a70d5 Author: Marcelo Vanzin Date: Fri Oct 3 09:42:45 2014 -0700 CLOUDERA-BUILD. CDH-21689. Fix fetch failures. Revert "SPARK-2711. Create a ShuffleMemoryManager to track memory for all spilling collections" This reverts commit 4fde28c2063f673ec7f51d514ba62a73321960a1. Conflicts: core/src/main/scala/org/apache/spark/SparkEnv.scala core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala (cherry picked from commit 2c5266de2b73138f6704fb9326d21e64b3d44014) commit a903478d28cf7ed0e3fd22fac97f0cd460b52d7a Author: Marcelo Vanzin Date: Thu Oct 2 12:53:28 2014 -0700 CLOUDERA-BUILD. Exclude hadoop-aws dependency. It packages 3rd-party libraries that conflict with other Spark dependencies, and is otherwise unused by Spark. Also, simplify the way -Phadoop-provided works. (cherry picked from commit 4a3eb0d1f860a9a95e1e53b09643562376bff67f) commit 1ddb74471271a165bb3bb81bea08e99a8087bea2 Author: Marcelo Vanzin Date: Thu Oct 2 12:19:54 2014 -0700 CLOUDERA-BUILD. Try to work around dependency issues with hive-exec. hive-exec includes guava classes of an older version, and if they show up in the classpath before the right guava, things blow up. This change modifies the way the hadoop-provided profile works to not add the dependencies where they're not needed, so that in the end only sql/hive includes hive-exec. (cherry picked from commit 48524b157937a845fdc0e4f4b8b0b47b4eb8c40e) commit a7cf6e46bb5b93f558da5b3545665aa0daf7d57f Author: Marcelo Vanzin Date: Fri Sep 19 16:40:43 2014 -0700 CLOUDERA-BUILD. Preview of SPARK-3606. [SPARK-3606] [yarn] Correctly configure AmIpFilter for Yarn HA (1.1 version). This is a backport of SPARK-3606 to branch-1.1. Some of the code had to be duplicated since branch-1.1 doesn't have the cleanup work that was done to the Yarn codebase. I don't know whether the version issue in yarn/alpha/pom.xml was intentional, but I couldn't compile the code without fixing it. (cherry picked from commit db67825813aab8236cdc435ef3bae1fa131a3119) commit fc8350bb3abc2307054bcdc6118597383ef9b317 Author: Marcelo Vanzin Date: Fri May 30 10:25:48 2014 -0700 CLOUDERA-BUILD. Increase timeout in test. Hopefully will make the test less flaky in our build machines. Long term, need to look at a way to make this less dependent on timeouts. (cherry picked from commit 29a1495d14602ac0517b835f787eefcd203d1938) Conflicts: core/src/test/scala/org/apache/spark/network/ConnectionManagerSuite.scala (cherry picked from commit 15a0530292de314d14dfe7545e844ba30d564011) commit 250acb6fc3a5c62161183dd7c7f98f637de168cc Author: Marcelo Vanzin Date: Mon Sep 29 10:55:29 2014 -0700 CLOUDERA-BUILD. Disable flaky tests. Until I get some time to debug it, so that our builds are more reliable. (cherry picked from commit 009ad6fbe2a5443fe565df2a24a78cb61fa7dfe7) commit 7e86292c6125dd0fb89ff269e840569199c75950 Author: Marcelo Vanzin Date: Fri Sep 26 16:38:37 2014 -0700 CLOUDERA-BUILD. Fix HiveContext for Hive 0.13. I haven't looked at the API history here, but on Hive 0.13 Driver.destroy() does not allow the same instance to be reused; so this code just does not destroy the Driver after it's used, and avoids reinitializing it (it's already done by CommandProcessorFactory). Another solution would be to call CommandProcessorFactory.clean() after the method is done with the driver, but that would have thread-safety issues. (cherry picked from commit df49f6001606cadc97f431d1180f60d177ae46fc) commit 5940717eaae17913cea7c823e9f8155d1d9f76b4 Author: Marcelo Vanzin Date: Fri Sep 26 14:57:33 2014 -0700 CLOUDERA-BUILD. Exclude more jars from the assembly. The assembly jar is big; so big that it's currently over 65k files, the limit where some tools start breaking when trying to read the file. Even some JDK tools seem to not like that. So add a few more exclusions when "-Phadoop-provided" is defined. These are parquet and hive libraries already shipped with other CDH packages. The only end-user effect is that someone using Spark SQL + Hive will have to manually add Hive jars to the classpath, since those are not included in the output of "hadoop classpath". (cherry picked from commit 2037c2316aeb839280e79fe7aff76d8002fddd20) commit 91e7e58f8158a73e12fad416032ed24fc9452a04 Author: Marcelo Vanzin Date: Fri Sep 26 14:19:03 2014 -0700 CLOUDERA-BUILD. Fix typo. (cherry picked from commit 390929d9ba1c2c7205047c8a031419969a201f34) commit 181b82053342331bb2c2bd3b9f97647fbcf839ac Author: Jenkins slave Date: Fri Sep 26 10:50:42 2014 -0700 CLOUDERA-BUILD: Fixing version to 5.2.0-SNAPSHOT commit 4beb6c9ddff29e7c92a9ebf8ae92b797a1540ea0 Author: Jenkins slave Date: Fri Sep 26 09:26:30 2014 -0700 Preparing for CDH5.3.0 development commit a4d7ee7a8d958e951d1fcbd45fb84910d9e3309c Author: Marcelo Vanzin Date: Thu Sep 25 13:12:25 2014 -0700 CLOUDERA-BUILD. Fix bad conflict resolution (wrong indentation). commit a34a486a7577cfea35ff266ff9a226a36a851157 Author: Marcelo Vanzin Date: Mon Sep 22 13:52:31 2014 -0700 Revert "[SPARK-2848] Shade Guava in uber-jars." This reverts commit ca7b275db8b8022e4217eddd7ab014be2a95fa60. commit 06139ef9d1af2953850fe444a4fbd5595ef2e54a Author: Victsm Date: Thu Sep 18 15:58:14 2014 -0700 [SPARK-3560] Fixed setting spark.jars system property in yarn-cluster mode Author: Victsm Author: Min Shen Closes #2449 from Victsm/SPARK-3560 and squashes the following commits: 918405a [Victsm] Removed the additional space 4502a2a [Min Shen] [SPARK-3560] Fixed setting spark.jars system property in yarn-cluster mode. (cherry picked from commit 832dff64ddb1240a4c8e22fcdc0e993cc8c808de) Signed-off-by: Andrew Or (cherry picked from commit b3ed37e5bad15d56db90c2b25fe11c1f758d3a97) commit 8b0ae03a32c2641eb139759f7ef900beecb67d65 Author: Bertrand Bossy Date: Sun Sep 14 21:10:17 2014 -0700 SPARK-3039: Allow spark to be built using avro-mapred for hadoop2 SPARK-3039: Adds the maven property "avro.mapred.classifier" to build spark-assembly with avro-mapred with support for the new Hadoop API. Sets this property to hadoop2 for Hadoop 2 profiles. I am not very familiar with maven, nor do I know whether this potentially breaks something in the hive part of spark. There might be a more elegant way of doing this. Author: Bertrand Bossy Closes #1945 from bbossy/SPARK-3039 and squashes the following commits: c32ce59 [Bertrand Bossy] SPARK-3039: Allow spark to be built using avro-mapred for hadoop2 (cherry picked from commit c243b21a8ba2610266702e00d7d4b5443cb1f687) Conflicts: pom.xml commit ff718bb29ec8fef624a021df362be43c809d3c7c Author: Sandy Ryza Date: Fri Sep 12 16:48:28 2014 -0500 SPARK-3014. Log a more informative messages in a couple failure scenario... ...s Author: Sandy Ryza Closes #1934 from sryza/sandy-spark-3014 and squashes the following commits: ae19cc1 [Sandy Ryza] SPARK-3014. Log a more informative messages in a couple failure scenarios (cherry picked from commit 1d767967e925f1d727957c2d43383ef6ad2c5d5e) Conflicts: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala commit 19c8fd68ad66e277de1bd4c3cbb9fb133a6bc257 Author: Benoy Antony Date: Wed Sep 10 11:59:39 2014 -0500 [SPARK-3286] - Cannot view ApplicationMaster UI when Yarn’s url scheme i... ...s https Author: Benoy Antony Closes #2276 from benoyantony/SPARK-3286 and squashes the following commits: c3d51ee [Benoy Antony] Use address with scheme, but Allpha version removes the scheme e82f94e [Benoy Antony] Use address with scheme, but Allpha version removes the scheme 92127c9 [Benoy Antony] rebasing from master 450c536 [Benoy Antony] [SPARK-3286] - Cannot view ApplicationMaster UI when Yarn’s url scheme is https f060c02 [Benoy Antony] [SPARK-3286] - Cannot view ApplicationMaster UI when Yarn’s url scheme is https (cherry picked from commit 6f7a76838f15687583e3b0ab43309a3c079368c4) Conflicts: yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/YarnRMClientImpl.scala yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala commit 3a43784c650e1bab895e41cccd5e8ef1ace42e50 Author: Prashant Sharma Date: Mon Sep 8 10:24:15 2014 -0700 SPARK-3337 Paranoid quoting in shell to allow install dirs with spaces within. ... Tested ! TBH, it isn't a great idea to have directory with spaces within. Because emacs doesn't like it then hadoop doesn't like it. and so on... Author: Prashant Sharma Closes #2229 from ScrapCodes/SPARK-3337/quoting-shell-scripts and squashes the following commits: d4ad660 [Prashant Sharma] SPARK-3337 Paranoid quoting in shell to allow install dirs with spaces within. (cherry picked from commit e16a8e7db5a3b1065b14baf89cb723a59b99226b) Conflicts: bin/load-spark-env.sh bin/spark-class dev/check-license dev/lint-python sbt/sbt-launch-lib.bash commit 3f674bb9982926711d7215c46cb8c77a9bd09678 Author: Thomas Graves Date: Fri Sep 5 09:54:40 2014 -0500 [SPARK-3260] yarn - pass acls along with executor launch Pass along the acl settings when we launch a container so that they can be applied to viewing the logs on a running NodeManager. Author: Thomas Graves Closes #2185 from tgravescs/SPARK-3260 and squashes the following commits: 6f94b5a [Thomas Graves] make unit test more robust 28b9dd3 [Thomas Graves] yarn - pass acls along with executor launch (cherry picked from commit 51b53a758c85f2e20ad9bd73ed815fcfa9c7180b) Conflicts: yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/YarnRMClientImpl.scala yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala yarn/common/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala yarn/common/src/main/scala/org/apache/spark/deploy/yarn/YarnRMClient.scala yarn/common/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala yarn/common/src/test/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtilSuite.scala yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnRMClientImpl.scala commit 5a613e574bf919ad4658098aa27354ff34848631 Author: Prashant Sharma Date: Sun Sep 14 21:17:29 2014 -0700 [SPARK-3452] Maven build should skip publishing artifacts people shouldn... ...'t depend on Publish local in maven term is `install` and publish otherwise is `deploy` So disabled both for following projects. Author: Prashant Sharma Closes #2329 from ScrapCodes/SPARK-3452/maven-skip-install and squashes the following commits: 257b79a [Prashant Sharma] [SPARK-3452] Maven build should skip publishing artifacts people shouldn't depend on (cherry picked from commit f493f7982b50e3c99e78b649e7c6c5b4313c5ffa) commit b87ae1f018cd69c16a01100e86b21ef830c2ebbe Author: Davies Liu Date: Thu Sep 11 18:53:26 2014 -0700 [SPARK-3465] fix task metrics aggregation in local mode Before overwrite t.taskMetrics, take a deepcopy of it. Author: Davies Liu Closes #2338 from davies/fix_metric and squashes the following commits: a5cdb63 [Davies Liu] Merge branch 'master' into fix_metric 7c879e0 [Davies Liu] add more comments 754b5b8 [Davies Liu] copy taskMetrics only when isLocal is true 5ca26dc [Davies Liu] fix task metrics aggregation in local mode (cherry picked from commit 42904b8d013e71d03e301c3da62e33b4cc2eb54e) commit 85eca6b3db3f98c421ebff5da51162c9fa0eb861 Author: Andrew Ash Date: Thu Sep 11 17:28:36 2014 -0700 [SPARK-3429] Don't include the empty string "" as a defaultAclUser Changes logging from ``` 14/09/05 02:01:08 INFO SecurityManager: Changing view acls to: aash, 14/09/05 02:01:08 INFO SecurityManager: Changing modify acls to: aash, 14/09/05 02:01:08 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(aash, ); users with modify permissions: Set(aash, ) ``` to ``` 14/09/05 02:28:28 INFO SecurityManager: Changing view acls to: aash 14/09/05 02:28:28 INFO SecurityManager: Changing modify acls to: aash 14/09/05 02:28:28 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(aash); users with modify permissions: Set(aash) ``` Note that the first set of logs have a Set of size 2 containing "aash" and the empty string "" cc tgravescs Author: Andrew Ash Closes #2286 from ash211/empty-default-acl and squashes the following commits: 18cc612 [Andrew Ash] Use .isEmpty instead of =="" cf973a1 [Andrew Ash] Don't include the empty string "" as a defaultAclUser (cherry picked from commit ce59725b8703d18988e495dbaaf86ddde4bdfc5a) commit bb653b2633b40c8deda9c07bf92c9373c93381ac Author: Chris Cope Date: Thu Sep 11 08:13:07 2014 -0500 [SPARK-2140] Updating heap memory calculation for YARN stable and alpha. Updated pull request, reflecting YARN stable and alpha states. I am getting intermittent test failures on my own test infrastructure. Is that tracked anywhere yet? Author: Chris Cope Closes #2253 from copester/master and squashes the following commits: 5ad89da [Chris Cope] [SPARK-2140] Removing calculateAMMemory functions since they are no longer needed. 52b4e45 [Chris Cope] [SPARK-2140] Updating heap memory calculation for YARN stable and alpha. (cherry picked from commit ed1980ffa9ccb87d76694ba910ef22df034bca49) commit 9a8223e7b4b4b4ba31111bf8e7719668032e1b34 Author: scwf Date: Tue Sep 9 11:57:01 2014 -0700 [SPARK-3193]output errer info when Process exit code is not zero in test suite https://issues.apache.org/jira/browse/SPARK-3193 I noticed that sometimes pr tests failed due to the Process exitcode != 0,refer to https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18688/consoleFull https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19118/consoleFull [info] SparkSubmitSuite: [info] - prints usage on empty input [info] - prints usage with only --help [info] - prints error with unrecognized options [info] - handle binary specified but not class [info] - handles arguments with --key=val [info] - handles arguments to user program [info] - handles arguments to user program with name collision [info] - handles YARN cluster mode [info] - handles YARN client mode [info] - handles standalone cluster mode [info] - handles standalone client mode [info] - handles mesos client mode [info] - handles confs with flag equivalents [info] - launch simple application with spark-submit *** FAILED *** [info] org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, local, file:/tmp/1408854098404-0/testJar-1408854098404.jar) exited with code 1 [info] at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:872) [info] at org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311) [info] at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291) [info] at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284) [info] at org.apacSpark assembly has been built with Hive, including Datanucleus jars on classpath this PR output the process error info when failed, it can be helpful for diagnosis. Author: scwf Closes #2108 from scwf/output-test-error-info and squashes the following commits: 0c48082 [scwf] minor fix according to comments 563fde1 [scwf] output errer info when Process exitcode not zero (cherry picked from commit 26862337c97ce14794178d6378fb4155dd24acb9) commit dfc96c0978bed0256e8f907ad7656a6d3786f129 Author: Mark Hamstra Date: Mon Sep 8 20:51:56 2014 -0700 SPARK-2425 Don't kill a still-running Application because of some misbehaving Executors Introduces a LOADING -> RUNNING ApplicationState transition and prevents Master from removing an Application with RUNNING Executors. Two basic changes: 1) Instead of allowing MAX_NUM_RETRY abnormal Executor exits over the entire lifetime of the Application, allow that many since any Executor successfully began running the Application; 2) Don't remove the Application while Master still thinks that there are RUNNING Executors. This should be fine as long as the ApplicationInfo doesn't believe any Executors are forever RUNNING when they are not. I think that any non-RUNNING Executors will eventually no longer be RUNNING in Master's accounting, but another set of eyes should confirm that. This PR also doesn't try to detect which nodes have gone rogue or to kill off bad Workers, so repeatedly failing Executors will continue to fail and fill up log files with failure reports as long as the Application keeps running. Author: Mark Hamstra Closes #1360 from markhamstra/SPARK-2425 and squashes the following commits: f099c0b [Mark Hamstra] Reuse appInfo b2b7b25 [Mark Hamstra] Moved 'Application failed' logging bdd0928 [Mark Hamstra] switched to string interpolation 1dd591b [Mark Hamstra] SPARK-2425 introduce LOADING -> RUNNING ApplicationState transition and prevent Master from removing Application with RUNNING Executors (cherry picked from commit 092e2f152fb674e7200cc8a2cb99a8fe0a9b2b33) commit 24ced81bed4134fe2b3bbb143480732b83a451e5 Author: Eric Liang Date: Sun Sep 7 17:57:59 2014 -0700 [SPARK-3394] [SQL] Fix crash in TakeOrdered when limit is 0 This resolves https://issues.apache.org/jira/browse/SPARK-3394 Author: Eric Liang Closes #2264 from ericl/spark-3394 and squashes the following commits: c87355b [Eric Liang] refactor bfb6140 [Eric Liang] change RDD takeOrdered instead 7a51528 [Eric Liang] fix takeordered when limit = 0 (cherry picked from commit 6754570d83044c4fbaf0d2ac2378a0e081a93629) commit 4b65954a151909a806412d2ad0776464445c9caa Author: Tathagata Das Date: Sat Sep 6 14:46:43 2014 -0700 [SPARK-2419][Streaming][Docs] More updates to the streaming programming guide - Improvements to the kinesis integration guide from @cfregly - More information about unified input dstreams in main guide Author: Tathagata Das Author: Chris Fregly Closes #2307 from tdas/streaming-doc-fix1 and squashes the following commits: ec40b5d [Tathagata Das] Updated figure with kinesis fdb9c5e [Tathagata Das] Fixed style issues with kinesis guide 036d219 [Chris Fregly] updated kinesis docs and added an arch diagram 24f622a [Tathagata Das] More modifications. (cherry picked from commit baff7e936101635d9bd4245e45335878bafb75e0) commit 97954370c482b8657103fa70cf709f8013c15ca6 Author: Andrew Ash Date: Fri Sep 5 18:52:05 2014 -0700 SPARK-3211 .take() is OOM-prone with empty partitions Instead of jumping straight from 1 partition to all partitions, do exponential growth and double the number of partitions to attempt each time instead. Fix proposed by Paul Nepywoda Author: Andrew Ash Closes #2117 from ash211/SPARK-3211 and squashes the following commits: 8b2299a [Andrew Ash] Quadruple instead of double for a minor speedup e5f7e4d [Andrew Ash] Update comment to better reflect what we're doing 09a27f7 [Andrew Ash] Update PySpark to be less OOM-prone as well 3a156b8 [Andrew Ash] SPARK-3211 .take() is OOM-prone with empty partitions (cherry picked from commit ba5bcaddecd54811d45c5fc79a013b3857d4c633) commit 3c59d7f807fe60f1f1ed10863d7e931c86eba336 Author: Kousuke Saruta Date: Thu Sep 4 10:29:11 2014 -0700 [SPARK-3401][PySpark] Wrong usage of tee command in python/run-tests Author: Kousuke Saruta Closes #2272 from sarutak/SPARK-3401 and squashes the following commits: 2b35a59 [Kousuke Saruta] Modified wrong usage of tee command in python/run-tests (cherry picked from commit 4feb46c5feca8d48ec340dc9c8d0eccbcd41f505) commit b754c14ab6778f25aaa33b5f015c7095e849953e Author: Kousuke Saruta Date: Wed Sep 3 18:42:01 2014 -0700 [SPARK-3233] Executor never stop its SparnEnv, BlockManager, ConnectionManager etc. Author: Kousuke Saruta Closes #2138 from sarutak/SPARK-3233 and squashes the following commits: c0205b7 [Kousuke Saruta] Merge branch 'SPARK-3233' of github.com:sarutak/spark into SPARK-3233 064679d [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3233 d3005fd [Kousuke Saruta] Modified Class definition format of BlockManagerMaster 039b747 [Kousuke Saruta] Modified style 889e2d1 [Kousuke Saruta] Modified BlockManagerMaster to be able to be past isDriver flag 4da8535 [Kousuke Saruta] Modified BlockManagerMaster#stop to send StopBlockManagerMaster message when sender is Driver 6518c3a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3233 d5ab19a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3233 6bce25c [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3233 6058a58 [Kousuke Saruta] Modified Executor not to invoke SparkEnv#stop in local mode e5ad9d3 [Kousuke Saruta] Modified Executor to stop SparnEnv at the end of itself (cherry picked from commit 4bba10c41acaf84a1c4a8e2db467c22f5ab7cbb9) commit 0ad0e5f2b43f8c8d41910c54db65b0618450f8dc Author: Tathagata Das Date: Wed Sep 3 17:38:01 2014 -0700 [SPARK-2419][Streaming][Docs] Updates to the streaming programming guide Updated the main streaming programming guide, and also added source-specific guides for Kafka, Flume, Kinesis. Author: Tathagata Das Author: Jacek Laskowski Closes #2254 from tdas/streaming-doc-fix and squashes the following commits: e45c6d7 [Jacek Laskowski] More fixes from an old PR 5125316 [Tathagata Das] Fixed links dc02f26 [Tathagata Das] Refactored streaming kinesis guide and made many other changes. acbc3e3 [Tathagata Das] Fixed links between streaming guides. cb7007f [Tathagata Das] Added Streaming + Flume integration guide. 9bd9407 [Tathagata Das] Updated streaming programming guide with additional information from SPARK-2419. (cherry picked from commit a5224079286d1777864cf9fa77330aadae10cd7b) commit d7f7340c50343ba4cf2dc5d3f7cd0780bbb08441 Author: Davies Liu Date: Tue Sep 2 15:47:47 2014 -0700 [SPARK-2871] [PySpark] add countApproxDistinct() API RDD.countApproxDistinct(relativeSD=0.05): :: Experimental :: Return approximate number of distinct elements in the RDD. The algorithm used is based on streamlib's implementation of "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", available here. This support all the types of objects, which is supported by Pyrolite, nearly all builtin types. param relativeSD Relative accuracy. Smaller values create counters that require more space. It must be greater than 0.000017. >>> n = sc.parallelize(range(1000)).map(str).countApproxDistinct() >>> 950 < n < 1050 True >>> n = sc.parallelize([i % 20 for i in range(1000)]).countApproxDistinct() >>> 18 < n < 22 True Author: Davies Liu Closes #2142 from davies/countApproxDistinct and squashes the following commits: e20da47 [Davies Liu] remove the correction in Python c38c4e4 [Davies Liu] fix doc tests 2ab157c [Davies Liu] fix doc tests 9d2565f [Davies Liu] add commments and link for hash collision correction d306492 [Davies Liu] change range of hash of tuple to [0, maxint] ded624f [Davies Liu] calculate hash in Python 4cba98f [Davies Liu] add more tests a85a8c6 [Davies Liu] Merge branch 'master' into countApproxDistinct e97e342 [Davies Liu] add countApproxDistinct() (cherry picked from commit e2c901b4c72b247bb422dd5acf057bc583e639ab) commit 8fc0c08d4db591280b68b463696cdfb9724a5411 Author: Sandy Ryza Date: Tue Sep 2 11:34:55 2014 -0700 SPARK-3052. Misleading and spurious FileSystem closed errors whenever a ... ...job fails while reading from Hadoop Author: Sandy Ryza Closes #1956 from sryza/sandy-spark-3052 and squashes the following commits: 815813a [Sandy Ryza] SPARK-3052. Misleading and spurious FileSystem closed errors whenever a job fails while reading from Hadoop (cherry picked from commit 81b9d5b628229ed69aa9dae45ec4c94068dcd71e) commit 3c55bf3858b002f8d291745c31b37d0e18770374 Author: Raymond Liu Date: Fri Aug 29 23:05:18 2014 -0700 [SPARK-2288] Hide ShuffleBlockManager behind ShuffleManager By Hiding the shuffleblockmanager behind Shufflemanager, we decouple the shuffle data's block mapping management work from Diskblockmananger. This give a more clear interface and more easy for other shuffle manager to implement their own block management logic. the jira ticket have more details. Author: Raymond Liu Closes #1241 from colorant/shuffle and squashes the following commits: 0e01ae3 [Raymond Liu] Move ShuffleBlockmanager behind shuffleManager (cherry picked from commit acea92806c91535162a9fdcb1cce579e7f1f91c7) commit 5c75c0958f6fb054c410b15ffb504e85f084d37e Author: Reynold Xin Date: Thu Aug 28 19:00:40 2014 -0700 [SPARK-1912] Lazily initialize buffers for local shuffle blocks. This is a simplified fix for SPARK-1912. Author: Reynold Xin Closes #2179 from rxin/SPARK-1912 and squashes the following commits: b2f0e9e [Reynold Xin] Fix unit tests. a8eddfe [Reynold Xin] [SPARK-1912] Lazily initialize buffers for local shuffle blocks. (cherry picked from commit 665e71d14debb8a7fc1547c614867a8c3b1f806a) commit 97cfea85c184cc6f38049ac40546a5045d9631da Author: Davies Liu Date: Wed Aug 27 13:18:33 2014 -0700 [SPARK-2871] [PySpark] add RDD.lookup(key) RDD.lookup(key) Return the list of values in the RDD for key `key`. This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to. >>> l = range(1000) >>> rdd = sc.parallelize(zip(l, l), 10) >>> rdd.lookup(42) # slow [42] >>> sorted = rdd.sortByKey() >>> sorted.lookup(42) # fast [42] It also clean up the code in RDD.py, and fix several bugs (related to preservesPartitioning). Author: Davies Liu Closes #2093 from davies/lookup and squashes the following commits: 1789cd4 [Davies Liu] `f` in foreach could be generator or not. 2871b80 [Davies Liu] Merge branch 'master' into lookup c6390ea [Davies Liu] address all comments 0f1bce8 [Davies Liu] add test case for lookup() be0e8ba [Davies Liu] fix preservesPartitioning eb1305d [Davies Liu] add RDD.lookup(key) (cherry picked from commit 4fa2fda88fc7beebb579ba808e400113b512533b) commit 6150b526c33866c625cec523c4afbe8d1ffa7349 Author: Chip Senkbeil Date: Wed Aug 27 13:01:11 2014 -0700 [SPARK-3256] Added support for :cp that was broken in Scala 2.10.x for REPL As seen with [SI-6502](https://issues.scala-lang.org/browse/SI-6502) of Scala, the _:cp_ command was broken in Scala 2.10.x. As the Spark shell is a friendly wrapper on top of the Scala REPL, it is also affected by this problem. My solution was to alter the internal classpath and invalidate any new entries. I also had to add the ability to add new entries to the parent classloader of the interpreter (SparkIMain's global). The advantage of this versus wiping the interpreter and replaying all of the commands is that you don't have to worry about rerunning heavy Spark-related commands (going to the cluster) or potentially reloading data that might have changed. Instead, you get to work from where you left off. Until this is fixed upstream for 2.10.x, I had to use reflection to alter the internal compiler classpath. The solution now looks like this: ![screen shot 2014-08-13 at 3 46 02 pm](https://cloud.githubusercontent.com/assets/2481802/3912625/f02b1440-232c-11e4-9bf6-bafb3e352d14.png) Author: Chip Senkbeil Closes #1929 from rcsenkbeil/FixReplClasspathSupport and squashes the following commits: f420cbf [Chip Senkbeil] Added SparkContext.addJar calls to support executing code on remote clusters a826795 [Chip Senkbeil] Updated AddUrlsToClasspath to use 'new Run' suggestion over hackish compiler error 2ff1d86 [Chip Senkbeil] Added compilation failure on symbols hack to get Scala classes to load correctly a220639 [Chip Senkbeil] Added support for :cp that was broken in Scala 2.10.x for REPL (cherry picked from commit 191d7cf2a655d032f160b9fa181730364681d0e7) commit 7b5e1b343c876157de1477f214588a936488a6d4 Author: Davies Liu Date: Tue Aug 26 16:57:40 2014 -0700 [SPARK-3073] [PySpark] use external sort in sortBy() and sortByKey() Using external sort to support sort large datasets in reduce stage. Author: Davies Liu Closes #1978 from davies/sort and squashes the following commits: bbcd9ba [Davies Liu] check spilled bytes in tests b125d2f [Davies Liu] add test for external sort in rdd eae0176 [Davies Liu] choose different disks from different processes and instances 1f075ed [Davies Liu] Merge branch 'master' into sort eb53ca6 [Davies Liu] Merge branch 'master' into sort 644abaf [Davies Liu] add license in LICENSE 19f7873 [Davies Liu] improve tests 55602ee [Davies Liu] use external sort in sortBy() and sortByKey() (cherry picked from commit f1e71d4c3ba678fc108effb05cf2d6101dadc0ce) Conflicts: python/pyspark/shuffle.py commit 5af5984d28b45258582e888db250a9048ad0f385 Author: Raymond Liu Date: Sat Aug 23 19:47:14 2014 -0700 Clean unused code in SortShuffleWriter Just clean unused code which have been moved into ExternalSorter. Author: Raymond Liu Closes #1882 from colorant/sortShuffleWriter and squashes the following commits: e6337be [Raymond Liu] Clean unused code in SortShuffleWriter (cherry picked from commit 8861cdf11288f7597809e9e0e1cad66fb85dd946) commit 8a3b8739e3dd3844770bed887205d25438115585 Author: Davies Liu Date: Sat Aug 23 19:33:34 2014 -0700 [SPARK-2871] [PySpark] add approx API for RDD RDD.countApprox(self, timeout, confidence=0.95) :: Experimental :: Approximate version of count() that returns a potentially incomplete result within a timeout, even if not all tasks have finished. >>> rdd = sc.parallelize(range(1000), 10) >>> rdd.countApprox(1000, 1.0) 1000 RDD.sumApprox(self, timeout, confidence=0.95) Approximate operation to return the sum within a timeout or meet the confidence. >>> rdd = sc.parallelize(range(1000), 10) >>> r = sum(xrange(1000)) >>> (rdd.sumApprox(1000) - r) / r < 0.05 RDD.meanApprox(self, timeout, confidence=0.95) :: Experimental :: Approximate operation to return the mean within a timeout or meet the confidence. >>> rdd = sc.parallelize(range(1000), 10) >>> r = sum(xrange(1000)) / 1000.0 >>> (rdd.meanApprox(1000) - r) / r < 0.05 True Author: Davies Liu Closes #2095 from davies/approx and squashes the following commits: e8c252b [Davies Liu] add approx API for RDD (cherry picked from commit 8df4dad4951ca6e687df1288331909878922a55f) commit 51726af80b943d7c20cdcbcbf697eddf9967a74e Author: Davies Liu Date: Sat Aug 23 18:55:13 2014 -0700 [SPARK-2871] [PySpark] add `key` argument for max(), min() and top(n) RDD.max(key=None) param key: A function used to generate key for comparing >>> rdd = sc.parallelize([1.0, 5.0, 43.0, 10.0]) >>> rdd.max() 43.0 >>> rdd.max(key=str) 5.0 RDD.min(key=None) Find the minimum item in this RDD. param key: A function used to generate key for comparing >>> rdd = sc.parallelize([2.0, 5.0, 43.0, 10.0]) >>> rdd.min() 2.0 >>> rdd.min(key=str) 10.0 RDD.top(num, key=None) Get the top N elements from a RDD. Note: It returns the list sorted in descending order. >>> sc.parallelize([10, 4, 2, 12, 3]).top(1) [12] >>> sc.parallelize([2, 3, 4, 5, 6], 2).top(2) [6, 5] >>> sc.parallelize([10, 4, 2, 12, 3]).top(3, key=str) [4, 3, 2] Author: Davies Liu Closes #2094 from davies/cmp and squashes the following commits: ccbaf25 [Davies Liu] add `key` to top() ad7e374 [Davies Liu] fix tests 2f63512 [Davies Liu] change `comp` to `key` in min/max dd91e08 [Davies Liu] add `comp` argument for RDD.max() and RDD.min() (cherry picked from commit db436e36c4e20893de708a0bc07a5a8877c49563) commit ca7b275db8b8022e4217eddd7ab014be2a95fa60 Author: Marcelo Vanzin Date: Wed Aug 20 16:23:10 2014 -0700 [SPARK-2848] Shade Guava in uber-jars. For further discussion, please check the JIRA entry. This change moves Guava classes to a different package so that they don't conflict with the user-provided Guava (or the Hadoop-provided one). Since one class (Optional) was exposed through Spark's public API, that class was forked from Guava at the current dependency version (14.0.1) so that it can be kept going forward (until the API is cleaned). Note this change has a few implications: - *all* classes in the final jars will reference the relocated classes. If Hadoop classes are included (i.e. "-Phadoop-provided" is not activated), those will also reference the Guava 14 classes (instead of the Guava 11 classes from the Hadoop classpath). - if the Guava version in Spark is ever changed, the new Guava will still reference the forked Optional class; this may or may not be a problem, but in the long term it's better to think about removing Optional from the public API. For the end user, there are two visible implications: - Guava is not provided as a transitive dependency anymore (since it's "provided" in Spark) - At runtime, unless they provide their own, they'll either have no Guava or Hadoop's version of Guava (11), depending on how they set up their classpath. Note that this patch does not change the sbt deliverables; those will still contain guava in its original package, and provide guava as a compile-time dependency. This assumes that maven is the canonical build, and sbt-built artifacts are not (officially) published. Author: Marcelo Vanzin Closes #1813 from vanzin/SPARK-2848 and squashes the following commits: 9bdffb0 [Marcelo Vanzin] Undo sbt build changes. 819b445 [Marcelo Vanzin] Review feedback. 05e0a3d [Marcelo Vanzin] Merge branch 'master' into SPARK-2848 fef4370 [Marcelo Vanzin] Unfork Optional.java. d3ea8e1 [Marcelo Vanzin] Exclude asm classes from final jar. 637189b [Marcelo Vanzin] Add hacky filter to prefer Spark's copy of Optional. 2fec990 [Marcelo Vanzin] Shade Guava in the sbt build. 616998e [Marcelo Vanzin] Shade Guava in the maven build, fork Guava's Optional.java. (cherry picked from commit c9f743957fa963bc1dbed7a44a346ffce1a45cf2) commit 7c0fa39d44dd090acf37af7af20a4ce85aad1d57 Author: bc Wong Date: Sat Sep 13 08:13:05 2014 -0700 CLOUDERA-BUILD. Add mvn profile 'hadoop-2.5' commit ee1a737610ce481b1a1440b9a152e015fd79a399 Author: Kostas Sakellis Date: Thu Sep 11 17:46:22 2014 -0700 CLOUDERA-BUILD. Disable a missed NetworkReceiverSuite test. commit 4c3f491bc28f59469e6aac54a891446d9f471d1e Author: Sean Owen Date: Tue Sep 9 10:24:00 2014 -0700 SPARK-3404 [BUILD] SparkSubmitSuite fails with "spark-submit exits with code 1" This fixes the `SparkSubmitSuite` failure by setting `0` in the Maven build, to match the SBT build. This avoids a port conflict which causes failures. (This also updates the `scalatest` plugin off of a release candidate, to the identical final release.) Author: Sean Owen Closes #2328 from srowen/SPARK-3404 and squashes the following commits: 512d782 [Sean Owen] Set spark.ui.port=0 in Maven scalatest config to match SBT build and avoid SparkSubmitSuite failure due to port conflict (cherry picked from commit f0f1ba09b195f23f0c89af6fa040c9e01dfa8951) commit cc7ec6c02c05b959aafd3804c42f90b626450c88 Author: Marcelo Vanzin Date: Thu May 15 09:38:36 2014 -0700 CLOUDERA-BUILD. Fix Running executors with hadoop-provided Fix running executors with hadoop-provided profile. commit 3d93bf347b45a399921b5ef5173a4a3617e22932 Author: Marcelo Vanzin Date: Fri May 30 11:26:40 2014 -0700 CLOUDERA-BUILD. Disable NetworkReceiverSuite tests. These tests rely on very short timeouts which are easily triggered on our slow build clusters. They also have very fishy looking checks that seem very thread-unsafe. So just disable them until we have time to look at them properly. commit 39e1d67a2cf23e8ded136b25ed381662d1724db6 Author: Marcelo Vanzin Date: Thu May 29 18:24:11 2014 -0700 CLOUDERA-BUILD. Forcefully disable tests for the sql/ project. These tests take a long time to run and are currently flaky. Disable them by default while we find time to investigate the issues. commit c2cae5356d923342277ac4fe9a2be6ecd0105913 Author: Marcelo Vanzin Date: Tue May 13 17:08:38 2014 -0700 CLOUDERA-BUILD Fix netty dependency exclusion. commit 56235ec399f7fadb2310c3ac7186038d3f2bb254 Author: Kostas Sakellis Date: Wed Sep 10 22:20:51 2014 -0700 CLOUDERA-BUILD. Removed hive thrift server Removed the hive thrift server from the build and assembly. commit ca3dae51ad212649a2cc6989f8ffa62f93e046ec Author: Kostas Sakellis Date: Wed Sep 10 17:49:07 2014 -0700 CLOUDERA-BUILD. Fix Spark SQL Hive build issues Fixes SparkSQL issues due to us using a newer version of Hive to build this project. Changes: - Update Hive TableDesc API (hive.git 4059a32f) - Updated Hive HiveDecimal API (hive.git a64d7d5e) - Updated Hive CommandProcessorFactory API (hive.git ead55008a) - Inline constants added in newer Parquet lib - Isolated sql/ project under a profile so that we can ignore it. - HIVE-3959: Update Partion Statistics (hive.git 27fd9ee7) - HIVE-5489: use metastore statistics to optimize max/min/etc. (hive.git e149b8dc) - HIVE-4113: Optimize select count(1) (hive.git 5be54f9d) - HIVE-6171: Use Paths consistently (hive.git 2666ac60) commit e621a2ba2db1d036f52c31c6d503c74572e8b6ce Author: Kostas Sakellis Date: Wed Sep 10 14:19:28 2014 -0700 CLOUDERA-BUILD. Use CDH version of Hive commit 996783a026bdd774f9725015318e934222988d12 Author: Marcelo Vanzin Date: Mon May 19 15:38:41 2014 -0700 CLOUDERA-BUILD. Add Cloudera repo to pom.xml. commit 58e285aa0d8407feefe8e40cc46c1f26b64cba72 Author: Kostas Sakellis Date: Wed Sep 10 12:41:13 2014 -0700 CLOUDERA-BUILD. Use CDH versions for dependencies. Also fix the name of an hbase dependency. commit 59cf8c46a32172af10ea33519fa62d545de8f081 Author: Kostas Sakellis Date: Wed Sep 10 12:21:50 2014 -0700 CLOUDERA-BUILD. Bumping version to 1.1.0-cdh5.2.0-SNAPSHOT