commit 4957f681366a4834afbf26e64821e5916744fcad Author: Kostas Sakellis Date: Mon Dec 15 11:31:14 2014 -0800 CLOUDERA-BUILD. Remove SNAPSHOT from package.scala commit ee84e7f0892ded94946113d0059865770f8fa4eb Author: Jenkins slave Date: Thu Dec 11 10:22:34 2014 -0800 Preparing for CDH5.3.0 release commit b23b27d6c99787956235a6fb02c92293ab2c60e2 Author: Kostas Sakellis Date: Wed Dec 10 13:20:23 2014 -0800 CLOUDERA-BUILD. CDH-23802 SparkSQL and Hive Fix Fixing kryo import since Hive shades the class. commit 20ea55c8dd8c69a05b0c276061b4cd6f15f8e627 Author: Marcelo Vanzin Date: Tue Nov 25 12:10:02 2014 -0800 CLOUDERA-BUILD. Preview of SPARK-4606. [SPARK-4606] Send EOF to child JVM when there's no more data to read. commit 31a409bbba8a86b5bdc82e124c237ff4c3ddc80e Author: Marcelo Vanzin Date: Tue Nov 18 10:53:27 2014 -0800 CLOUDERA-BUILD. Comment out usage of new Parquet API. Comment out code that depends on PARQUET-84 since CDH doesn't have that feature yet. Replace input format with one that uses the old API, but works around SPARK-3536. commit bb0204a384c3310ec156f8503e5b282c2e5963e1 Author: Marcelo Vanzin Date: Mon Nov 17 12:54:00 2014 -0800 CLOUDERA-BUILD. Fix Spark SQL session manager. At least with CDH's Hive, the Spark wrappers are not correctly initializing the Hive classes. This fixes one instance that was causing test failures. commit 6a785cca2aa1a46c8bc991f34b28bf0d594d7d15 Author: Marcelo Vanzin Date: Sat Nov 15 18:27:07 2014 -0800 CLOUDERA-BUILD: Disable failing Hive tests. While we figure out why they're failing. commit f267a4ac060314cbd0063be5a96da021d260c33b Author: Marcelo Vanzin Date: Wed Nov 12 17:00:49 2014 -0800 CLOUDERA-BUILD. Exclude hadoop-aws dependency. It's not used by Spark and it brings in an older com.fasterxml.jackson dependency which conflicts with the version used by other Spark dependencies. commit 46132109d5de946134baeb45afa60c42f0fd6a16 Author: Marcelo Vanzin Date: Thu Nov 6 12:55:59 2014 -0800 CLOUDERA-BUILD. Revert "[SPARK-2805] Upgrade Akka to 2.3.4" This reverts commit 411cf29fff011561f0093bb6101af87842828369. Conflicts: pom.xml commit 97dfd0755414c7aad9029ff00fee1cc5100e229a Author: Marcelo Vanzin Date: Wed Nov 5 16:55:24 2014 -0800 CLOUDERA-BUILD. Disable some flaky tests. These tests rely too heavily on timeouts and often fail on loaded build machines. commit 2a44bcdaf25e270445df28bf4efa27ac94c658c9 Author: Marcelo Vanzin Date: Wed Nov 5 14:33:02 2014 -0800 CLOUDERA-BUILD. Bumping version to 1.2.0-cdh5.3.0-SNAPSHOT commit a8d99a5343e8e6cfacf4614a65a5b56ec3598f29 Author: Marcelo Vanzin Date: Wed Nov 5 11:45:46 2014 -0800 CLOUDERA-BUILD. Changes for CDH build. Adjusts dependency versions, adds Cloudera repos, and triggers all needed profiles based on the "cdh.build" property. commit f4d4f637cdc62efe0de3e8497b010f3112559267 Author: Marcelo Vanzin Date: Tue Nov 18 10:48:42 2014 -0800 CLOUDERA-BUILD. Preview of SPARK-4048. [SPARK-4048] Enhance and extend hadoop-provided profile. This change does a few things to make the hadoop-provided more useful: - Create new profiles for other libraries / services that might be provided by the infrastructure - Simplify and fix the poms so that the profiles are only activated while building assemblies. - Fix tests so that they're able to run when the profiles are activated - Add a new env variable to be used by distributions that use these profiles to provide the runtime classpath for Spark jobs and daemons. commit ff6f59b2a94bf536e529c45ddf1e24b73096f2fe Author: Josh Rosen Date: Tue Dec 9 23:47:05 2014 -0800 [Minor] Use tag for help icon in web UI page header This small commit makes the `(?)` web UI help link into a superscript, which should address feedback that the current design makes it look like an error occurred or like information is missing. Before: ![image](https://cloud.githubusercontent.com/assets/50748/5370611/a3ed0034-7fd9-11e4-870f-05bd9faad5b9.png) After: ![image](https://cloud.githubusercontent.com/assets/50748/5370602/6c5ca8d6-7fd9-11e4-8d1a-568d71290aa7.png) Author: Josh Rosen Closes #3659 from JoshRosen/webui-help-sup and squashes the following commits: bd72899 [Josh Rosen] Use tag for help icon in web UI page header. (cherry picked from commit f79c1cfc997c1a7ddee480ca3d46f5341b69d3b7) Signed-off-by: Josh Rosen commit 5e5d8f469a1bea9bbe606f772ccdcab7c184c651 Author: Reynold Xin Date: Tue Dec 9 19:29:09 2014 -0800 Config updates for the new shuffle transport. Author: Reynold Xin Closes #3657 from rxin/conf-update and squashes the following commits: 7370eab [Reynold Xin] Config updates for the new shuffle transport. (cherry picked from commit 9bd9334f588dbb44d01554f9f4ca68a153a48993) Signed-off-by: Aaron Davidson commit 441ec3451730c7ae3dbef8952e313071d6147ab6 Author: Reynold Xin Date: Tue Dec 9 17:49:59 2014 -0800 [SPARK-4740] Create multiple concurrent connections between two peer nodes in Netty. It's been reported that when the number of disks is large and the number of nodes is small, Netty network throughput is low compared with NIO. We suspect the problem is that only a small number of disks are utilized to serve shuffle files at any given point, due to connection reuse. This patch adds a new config parameter to specify the number of concurrent connections between two peer nodes, default to 2. Author: Reynold Xin Closes #3625 from rxin/SPARK-4740 and squashes the following commits: ad4241a [Reynold Xin] Updated javadoc. f33c72b [Reynold Xin] Code review feedback. 0fefabb [Reynold Xin] Use double check in synchronization. 41dfcb2 [Reynold Xin] Added test case. 9076b4a [Reynold Xin] Fixed two NPEs. 3e1306c [Reynold Xin] Minor style fix. 4f21673 [Reynold Xin] [SPARK-4740] Create multiple concurrent connections between two peer nodes in Netty. (cherry picked from commit 2b9b72682e587909a84d3ace214c22cec830eeaf) Signed-off-by: Reynold Xin commit b0d64e57255e5ca545c90f18bd9d10a07ae43759 Author: Sean Owen Date: Tue Dec 9 16:38:27 2014 -0800 SPARK-4805 [CORE] BlockTransferMessage.toByteArray() trips assertion Allocate enough room for type byte as well as message, to avoid tripping assertion about capacity of the buffer Author: Sean Owen Closes #3650 from srowen/SPARK-4805 and squashes the following commits: 9e1d502 [Sean Owen] Allocate enough room for type byte as well as message, to avoid tripping assertion about capacity of the buffer (cherry picked from commit d8f84f26e388055ca7459810e001d05ab60af15b) Signed-off-by: Aaron Davidson commit 51da2c557b98aec8309db01ecf8dd0f39c494d28 Author: Sandy Ryza Date: Tue Dec 9 16:26:07 2014 -0800 SPARK-4567. Make SparkJobInfo and SparkStageInfo serializable Author: Sandy Ryza Closes #3426 from sryza/sandy-spark-4567 and squashes the following commits: cb4b8d2 [Sandy Ryza] SPARK-4567. Make SparkJobInfo and SparkStageInfo serializable (cherry picked from commit 5e4c06f8e54265a4024857f5978ec54c936aeea2) Signed-off-by: Josh Rosen commit 5a3a3cc1739e4d5004bc7117bd6afadf3142ec9b Author: Kay Ousterhout Date: Tue Dec 9 15:10:36 2014 -0800 [SPARK-4765] Make GC time always shown in UI. This commit removes the GC time for each task from the set of optional, additional metrics, and instead always shows it for each task. cc pwendell Author: Kay Ousterhout Closes #3622 from kayousterhout/gc_time and squashes the following commits: 15ac242 [Kay Ousterhout] Make TaskDetailsClassNames private[spark] e71d893 [Kay Ousterhout] [SPARK-4765] Make GC time always shown in UI. (cherry picked from commit 1f5110630c1abb13a357b463c805a39772923b82) Signed-off-by: Kay Ousterhout commit e68674200ad95ca88a8427f7b2253b97a03c4337 Author: Cheng Hao Date: Tue Dec 9 10:28:15 2014 -0800 [SPARK-4785][SQL] Initilize Hive UDFs on the driver and serialize them with a wrapper Different from Hive 0.12.0, in Hive 0.13.1 UDF/UDAF/UDTF (aka Hive function) objects should only be initialized once on the driver side and then serialized to executors. However, not all function objects are serializable (e.g. GenericUDF doesn't implement Serializable). Hive 0.13.1 solves this issue with Kryo or XML serializer. Several utility ser/de methods are provided in class o.a.h.h.q.e.Utilities for this purpose. In this PR we chose Kryo for efficiency. The Kryo serializer used here is created in Hive. Spark Kryo serializer wasn't used because there's no available SparkConf instance. Author: Cheng Hao Author: Cheng Lian Closes #3640 from chenghao-intel/udf_serde and squashes the following commits: 8e13756 [Cheng Hao] Update the comment 74466a3 [Cheng Hao] refactor as feedbacks 396c0e1 [Cheng Hao] avoid Simple UDF to be serialized e9c3212 [Cheng Hao] update the comment 19cbd46 [Cheng Hao] support udf instance ser/de after initialization (cherry picked from commit 383c5555c9f26c080bc9e3a463aab21dd5b3797f) Signed-off-by: Michael Armbrust commit 31a6d4fede28d46cd379f788678cc33b0b982d14 Author: Cheng Hao Date: Mon Dec 8 17:39:12 2014 -0800 [SPARK-4769] [SQL] CTAS does not work when reading from temporary tables This is the code refactor and follow ups for #2570 Author: Cheng Hao Closes #3336 from chenghao-intel/createtbl and squashes the following commits: 3563142 [Cheng Hao] remove the unused variable e215187 [Cheng Hao] eliminate the compiling warning 4f97f14 [Cheng Hao] fix bug in unittest 5d58812 [Cheng Hao] revert the API changes b85b620 [Cheng Hao] fix the regression of temp tabl not found in CTAS (cherry picked from commit 51b1fe1426ffecac6c4644523633ea1562ff9a4e) Signed-off-by: Michael Armbrust commit f4160324c55b4d168421af5473ce306bc03a77bb Author: Sandy Ryza Date: Mon Dec 8 16:28:36 2014 -0800 SPARK-4770. [DOC] [YARN] spark.scheduler.minRegisteredResourcesRatio doc... ...umented default is incorrect for YARN Author: Sandy Ryza Closes #3624 from sryza/sandy-spark-4770 and squashes the following commits: bd81a3a [Sandy Ryza] SPARK-4770. [DOC] [YARN] spark.scheduler.minRegisteredResourcesRatio documented default is incorrect for YARN (cherry picked from commit cda94d15ea2a70ed3f0651ba2766b1e2f80308c1) Signed-off-by: Josh Rosen commit 9ed5641a5a4425278283896928efa4e382fb74d8 Author: Kostas Sakellis Date: Mon Dec 8 15:44:18 2014 -0800 [SPARK-4774] [SQL] Makes HiveFromSpark more portable HiveFromSpark read the kv1.txt file from SPARK_HOME/examples/src/main/resources/kv1.txt which assumed you had a source tree checked out. Now we copy the kv1.txt file to a temporary file and delete it when the jvm shuts down. This allows us to run this example outside of a spark source tree. Author: Kostas Sakellis Closes #3628 from ksakellis/kostas-spark-4774 and squashes the following commits: 6770f83 [Kostas Sakellis] [SPARK-4774] [SQL] Makes HiveFromSpark more portable (cherry picked from commit d6a972b3e4dc35a2d95df47d256462b325f4bda6) Signed-off-by: Michael Armbrust commit 6b9e8b081655f71f7ff2c4238254f7aaa110723c Author: Takeshi Yamamuro Date: Sun Dec 7 19:42:02 2014 -0800 [SPARK-4620] Add unpersist in Graph and GraphImpl Add an IF to uncache both vertices and edges of Graph/GraphImpl. This IF is useful when iterative graph operations build a new graph in each iteration, and the vertices and edges of previous iterations are no longer needed for following iterations. Author: Takeshi Yamamuro This patch had conflicts when merged, resolved by Committer: Ankur Dave Closes #3476 from maropu/UnpersistInGraphSpike and squashes the following commits: 77a006a [Takeshi Yamamuro] Add unpersist in Graph and GraphImpl (cherry picked from commit 8817fc7fe8785d7b11138ca744f22f7e70f1f0a0) Signed-off-by: Ankur Dave commit a4ae7c8b533b3998484879439c0982170c3c38a7 Author: Takeshi Yamamuro Date: Sun Dec 7 19:36:08 2014 -0800 [SPARK-4646] Replace Scala.util.Sorting.quickSort with Sorter(TimSort) in Spark This patch just replaces a native quick sorter with Sorter(TimSort) in Spark. It could get performance gains by ~8% in my quick experiments. Author: Takeshi Yamamuro Closes #3507 from maropu/TimSortInEdgePartitionBuilderSpike and squashes the following commits: 8d4e5d2 [Takeshi Yamamuro] Remove a wildcard import 3527e00 [Takeshi Yamamuro] Replace Scala.util.Sorting.quickSort with Sorter(TimSort) in Spark (cherry picked from commit 2e6b736b0e6e5920d0523533c87832a53211db42) Signed-off-by: Ankur Dave commit 27d9f13af2df3bd7af029cf7ac48443ba6f4d6e0 Author: GuoQiang Li Date: Sat Dec 6 00:56:51 2014 -0800 [SPARK-3623][GraphX] GraphX should support the checkpoint operation Author: GuoQiang Li Closes #2631 from witgo/SPARK-3623 and squashes the following commits: a70c500 [GuoQiang Li] Remove java related 4d1e249 [GuoQiang Li] Add comments e682724 [GuoQiang Li] Graph should support the checkpoint operation (cherry picked from commit e895e0cbecbbec1b412ff21321e57826d2d0a982) Signed-off-by: Ankur Dave commit 11446a6488fa95aca75e94f8fbecea80dc8f5331 Author: CrazyJvm Date: Fri Dec 5 13:42:13 2014 -0800 Streaming doc : do you mean inadvertently? Author: CrazyJvm Closes #3620 from CrazyJvm/streaming-foreachRDD and squashes the following commits: b72886b [CrazyJvm] do you mean inadvertently? (cherry picked from commit 6eb1b6f6204ea3c8083af3fb9cd990d9f3dac89d) Signed-off-by: Reynold Xin commit e8d8077bfc3e667a61dc261d2bee80d2a9f1eed3 Author: Cheng Lian Date: Fri Dec 5 10:27:40 2014 -0800 [SPARK-4761][SQL] Enables Kryo by default in Spark SQL Thrift server Enables Kryo and disables reference tracking by default in Spark SQL Thrift server. Configurations explicitly defined by users in `spark-defaults.conf` are respected (the Thrift server is started by `spark-submit`, which handles configuration properties properly). [Review on Reviewable](https://reviewable.io/reviews/apache/spark/3621) Author: Cheng Lian Closes #3621 from liancheng/kryo-by-default and squashes the following commits: 70c2775 [Cheng Lian] Enables Kryo by default in Spark SQL Thrift server (cherry picked from commit 6f61e1f961826a6c9e98a66d10b271b7e3c7dd55) Signed-off-by: Patrick Wendell commit d12ea49f56e9ffa9576a94cda99c066910c1425d Author: Michael Armbrust Date: Thu Dec 4 22:25:21 2014 -0800 [SPARK-4753][SQL] Use catalyst for partition pruning in newParquet. Author: Michael Armbrust Closes #3613 from marmbrus/parquetPartitionPruning and squashes the following commits: 4f138f8 [Michael Armbrust] Use catalyst for partition pruning in newParquet. (cherry picked from commit f5801e813f3c2573ebaf1af839341489ddd3ec78) Signed-off-by: Patrick Wendell commit a8d8077dc01088a49452602f2f2be9cefbce6b4b Author: Andrew Or Date: Thu Dec 4 21:54:48 2014 -0800 Revert "SPARK-2624 add datanucleus jars to the container in yarn-cluster" This reverts commit a975dc32799bb8a14f9e1c76defaaa7cfbaf8b53. commit 325babe8a3c1ab8cc10cc7cee5b6a53757774154 Author: Andrew Or Date: Thu Dec 4 21:54:37 2014 -0800 Revert "[HOT FIX] [YARN] Check whether `/lib` exists before listing its files" This reverts commit 38cb2c3a36a5c9ead4494cbc3dde008c2f0698ce. commit 6c436317881f384bbd760d84a0063d39e96229da Author: Masayoshi TSUZUKI Date: Thu Dec 4 19:33:02 2014 -0800 [SPARK-4464] Description about configuration options need to be modified in docs. Added description about -h and -host. Modified description about -i and -ip which are now deprecated. Added description about --properties-file. Author: Masayoshi TSUZUKI Closes #3329 from tsudukim/feature/SPARK-4464 and squashes the following commits: 6c07caf [Masayoshi TSUZUKI] [SPARK-4464] Description about configuration options need to be modified in docs. (cherry picked from commit ca379039f701e423fa07933db4e063cb85d0236a) Signed-off-by: Josh Rosen commit 63b1bc14ae131ee68959ff9c98a768a19cd6b5ba Author: Andy Konwinski Date: Thu Dec 4 18:27:02 2014 -0800 Fix typo in Spark SQL docs. Author: Andy Konwinski Closes #3611 from andyk/patch-3 and squashes the following commits: 7bab333 [Andy Konwinski] Fix typo in Spark SQL docs. (cherry picked from commit 15cf3b0125fe238dea2ce13e703034ba7cef477f) Signed-off-by: Josh Rosen commit b905e114e2084535dd78f29627b762505438e254 Author: Masayoshi TSUZUKI Date: Thu Dec 4 18:14:36 2014 -0800 [SPARK-4421] Wrong link in spark-standalone.html Modified the link of building Spark. Author: Masayoshi TSUZUKI Closes #3279 from tsudukim/feature/SPARK-4421 and squashes the following commits: 56e31c1 [Masayoshi TSUZUKI] Modified the link of building Spark. (cherry picked from commit ddfc09c36381a0880dfa6778be2ca0bc7d80febf) Signed-off-by: Josh Rosen commit f5c5647b863b593eae88fe5b6ba580413584ed06 Author: lewuathe Date: Thu Dec 4 15:14:36 2014 -0800 [SPARK-4652][DOCS] Add docs about spark-git-repo option There might be some cases when WIPS spark version need to be run on EC2 cluster. In order to setup this type of cluster more easily, add --spark-git-repo option description to ec2 documentation. Author: lewuathe Author: Josh Rosen Closes #3513 from Lewuathe/doc-for-development-spark-cluster and squashes the following commits: 6dae8ee [lewuathe] Wrap consistent with other descriptions cfaf9be [lewuathe] Add docs about spark-git-repo option (Editing / cleanup by Josh Rosen) (cherry picked from commit ab8177da2defab1ecd8bc0cd5a21f07be5b8d2c5) Signed-off-by: Josh Rosen commit 0d159de39a21ab82d4ecdfa2d88fa525339daee1 Author: Saldanha Date: Thu Dec 4 14:22:09 2014 -0800 [SPARK-4459] Change groupBy type parameter from K to U Please see https://issues.apache.org/jira/browse/SPARK-4459 Author: Saldanha Closes #3327 from alokito/master and squashes the following commits: 54b1095 [Saldanha] [SPARK-4459] changed type parameter for keyBy from K to U d5f73c3 [Saldanha] [SPARK-4459] added keyBy test 316ad77 [Saldanha] SPARK-4459 changed type parameter for groupBy from K to U. 62ddd4b [Saldanha] SPARK-4459 added failing unit test (cherry picked from commit 743a889d2778f797aabc3b1e8146e7aa32b62a48) Signed-off-by: Josh Rosen commit a00d0aa6e8cb64f00656fdf4d46ea7842b884e5e Author: alexdebrie Date: Thu Dec 4 14:13:59 2014 -0800 [SPARK-4745] Fix get_existing_cluster() function with multiple security groups The current get_existing_cluster() function would only find an instance belonged to a cluster if the instance's security groups == cluster_name + "-master" (or "-slaves"). This fix allows for multiple security groups by checking if the cluster_name + "-master" security group is in the list of groups for a particular instance. Author: alexdebrie Closes #3596 from alexdebrie/master and squashes the following commits: 9d51232 [alexdebrie] Fix get_existing_cluster() function with multiple security groups (cherry picked from commit 794f3aec24acb578e258532ad0590554d07958ba) Signed-off-by: Josh Rosen commit bc05df8a23ba7ad485f6844f28f96551b13ba461 Author: Patrick Wendell Date: Thu Dec 4 20:15:15 2014 +0000 Preparing development version 1.2.1-SNAPSHOT commit 2b72c569a674cccf79ebbe8d067b8dbaaf78007f Author: Patrick Wendell Date: Thu Dec 4 20:15:15 2014 +0000 Preparing Spark release v1.2.0-rc2 commit ead01b6d5730c7cf238811b19fa42336236ec7dc Author: Patrick Wendell Date: Thu Dec 4 12:11:41 2014 -0800 [HOTFIX] Fixing two issues with the release script. 1. The version replacement was still producing some false changes. 2. Uploads to the staging repo specifically. Author: Patrick Wendell Closes #3608 from pwendell/release-script and squashes the following commits: 3c63294 [Patrick Wendell] Fixing two issues with the release script: (cherry picked from commit 8dae26f83818ee0f5ce8e5b083625170d2e901c5) Signed-off-by: Patrick Wendell commit d9aee07fe1f5381e5c0ceae5a3e7d96d945f4288 Author: WangTaoTheTonic Date: Thu Dec 4 11:52:47 2014 -0800 [SPARK-4253] Ignore spark.driver.host in yarn-cluster and standalone-cluster modes In yarn-cluster and standalone-cluster modes, we don't know where driver will run until it is launched. If the `spark.driver.host` property is set on the submitting machine and propagated to the driver through SparkConf then this will lead to errors when the driver launches. This patch fixes this issue by dropping the `spark.driver.host` property in SparkSubmit when running in a cluster deploy mode. Author: WangTaoTheTonic Author: WangTao Closes #3112 from WangTaoTheTonic/SPARK4253 and squashes the following commits: ed1a25c [WangTaoTheTonic] revert unrelated formatting issue 02c4e49 [WangTao] add comment 32a3f3f [WangTaoTheTonic] ingore it in SparkSubmit instead of SparkContext 667cf24 [WangTaoTheTonic] document fix ff8d5f7 [WangTaoTheTonic] also ignore it in standalone cluster mode 2286e6b [WangTao] ignore spark.driver.host in yarn-cluster mode (cherry picked from commit 8106b1e36b2c2b9f5dc5d7252540e48cc3fc96d5) Signed-off-by: Josh Rosen commit 078894c7f25e139a91da4fac5b4875b738c67443 Author: Patrick Wendell Date: Thu Dec 4 11:22:25 2014 -0800 Revert "Preparing Spark release v1.2.0-rc1" This reverts commit 1056e9ec13203d0c51564265e94d77a054498fdb. commit 701019bf259c3b270b2aeedc4a16caf0f221b8b4 Author: Patrick Wendell Date: Thu Dec 4 11:22:22 2014 -0800 Revert "Preparing development version 1.2.1-SNAPSHOT" This reverts commit 00316cc87983b844f6603f351a8f0b84fe1f6035. commit 2c6e2876b3f57aff9ed88626d95bd84e4f25098f Author: Patrick Wendell Date: Thu Dec 4 11:22:19 2014 -0800 Revert "HOTFIX: Rolling back incorrect version change" This reverts commit 3a4609eada2ee0bfbcce0f4127b6a5363ae528e5. commit 2fbe488a0cd7814fbd4f88041c01e68d2796258c Author: Cheng Lian Date: Thu Dec 4 10:21:03 2014 -0800 [SPARK-4683][SQL] Add a beeline.cmd to run on Windows Tested locally with a Win7 VM. Connected to a Spark SQL Thrift server instance running on Mac OS X with the following command line: ``` bin\beeline.cmd -u jdbc:hive2://10.0.2.2:10000 -n lian ``` [Review on Reviewable](https://reviewable.io/reviews/apache/spark/3599) Author: Cheng Lian Closes #3599 from liancheng/beeline.cmd and squashes the following commits: 79092e7 [Cheng Lian] Windows script for BeeLine (cherry picked from commit 28c7acacef974fdabd2b9ecc20d0d6cf6c58728f) Signed-off-by: Patrick Wendell commit 34fdca0a55ba636d6ebcc7a588df81e042a07827 Author: Xiangrui Meng Date: Thu Dec 4 20:16:35 2014 +0800 [FIX][DOC] Fix broken links in ml-guide.md and some minor changes in ScalaDoc. Author: Xiangrui Meng Closes #3601 from mengxr/SPARK-4575-fix and squashes the following commits: c559768 [Xiangrui Meng] minor code update ce94da8 [Xiangrui Meng] Java Bean -> JavaBean 0b5c182 [Xiangrui Meng] fix links in ml-guide (cherry picked from commit 7e758d709286e73d2c878d4a2d2b4606386142c7) Signed-off-by: Xiangrui Meng commit 266a81492d48cb4f7c2ada9d490e1919fdc506aa Author: Joseph K. Bradley Date: Thu Dec 4 17:00:06 2014 +0800 [SPARK-4575] [mllib] [docs] spark.ml pipelines doc + bug fixes Documentation: * Added ml-guide.md, linked from mllib-guide.md * Updated mllib-guide.md with small section pointing to ml-guide.md Examples: * CrossValidatorExample * SimpleParamsExample * (I copied these + the SimpleTextClassificationPipeline example into the ml-guide.md) Bug fixes: * PipelineModel: did not use ParamMaps correctly * UnaryTransformer: issues with TypeTag serialization (Thanks to mengxr for that fix!) CC: mengxr shivaram etrain Documentation for Pipelines: I know the docs are not complete, but the goal is to have enough to let interested people get started using spark.ml and to add more docs once the package is more established/complete. Author: Joseph K. Bradley Author: jkbradley Author: Xiangrui Meng Closes #3588 from jkbradley/ml-package-docs and squashes the following commits: d393b5c [Joseph K. Bradley] fixed bug in Pipeline (typo from last commit). updated examples for CV and Params for spark.ml c38469c [Joseph K. Bradley] Updated ml-guide with CV examples 99f88c2 [Joseph K. Bradley] Fixed bug in PipelineModel.transform* with usage of params. Updated CrossValidatorExample to use more training examples so it is less likely to get a 0-size fold. ea34dc6 [jkbradley] Merge pull request #4 from mengxr/ml-package-docs 3b83ec0 [Xiangrui Meng] replace TypeTag with explicit datatype 41ad9b1 [Joseph K. Bradley] Added examples for spark.ml: SimpleParamsExample + Java version, CrossValidatorExample + Java version. CrossValidatorExample not working yet. Added programming guide for spark.ml, but need to add CrossValidatorExample to it once CrossValidatorExample works. (cherry picked from commit 469a6e5f3bdd5593b3254bc916be8236e7c6cb74) Signed-off-by: Xiangrui Meng commit bf720ef98f49bcc49b9a3b1a281b2373bf8d739a Author: Joseph K. Bradley Date: Thu Dec 4 00:59:32 2014 -0800 [docs] Fix outdated comment in tuning guide When you use the SPARK_JAVA_OPTS env variable, Spark complains: ``` SPARK_JAVA_OPTS was detected (set to ' -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps '). This is deprecated in Spark 1.0+. Please instead use: - ./spark-submit with conf/spark-defaults.conf to set defaults for an application - ./spark-submit with --driver-java-options to set -X options for a driver - spark.executor.extraJavaOptions to set -X options for executors - SPARK_DAEMON_JAVA_OPTS to set java options for standalone daemons (master or worker) ``` This updates the docs to redirect the user to the relevant part of the configuration docs. CC: mengxr but please CC someone else as needed Author: Joseph K. Bradley Closes #3592 from jkbradley/tuning-doc and squashes the following commits: 0760ce1 [Joseph K. Bradley] fixed outdated comment in tuning guide (cherry picked from commit 529439bd506949f272a2b6f099ea549b097428f3) Signed-off-by: Reynold Xin commit dec838bcbd6e3ba5844173036f5caae3e67eb490 Author: Aaron Davidson Date: Thu Dec 4 00:58:42 2014 -0800 [SQL] Minor: Avoid calling Seq#size in a loop Just found this instance while doing some jstack-based profiling of a Spark SQL job. It is very unlikely that this is causing much of a perf issue anywhere, but it is unnecessarily suboptimal. Author: Aaron Davidson Closes #3593 from aarondav/seq-opt and squashes the following commits: 962cdfc [Aaron Davidson] [SQL] Minor: Avoid calling Seq#size in a loop (cherry picked from commit c6c7165e7ecf1690027d6bd4e0620012cd0d2310) Signed-off-by: Reynold Xin commit 2605acb043fd6693cbade67809f7bbe64e7c1b61 Author: lewuathe Date: Thu Dec 4 16:51:41 2014 +0800 [SPARK-4685] Include all spark.ml and spark.mllib packages in JavaDoc's MLlib group This is #3554 from Lewuathe except that I put both `spark.ml` and `spark.mllib` in the group 'MLlib`. Closes #3554 jkbradley Author: lewuathe Author: Xiangrui Meng Closes #3598 from mengxr/Lewuathe-modify-javadoc-setting and squashes the following commits: 184609a [Xiangrui Meng] merge spark.ml and spark.mllib into the same group in javadoc f7535e6 [lewuathe] [SPARK-4685] Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections (cherry picked from commit 20bfea4ab7c0923e8d3f039d0c5098669db4d5b0) Signed-off-by: Xiangrui Meng commit f9e1f89b2500287ff284317fe4504bd32d3b8e1a Author: Andrew Or Date: Wed Dec 3 19:08:29 2014 -0800 [Release] Correctly translate contributors name in release notes This commit involves three main changes: (1) It separates the translation of contributor names from the generation of the contributors list. This is largely motivated by the Github API limit; even if we exceed this limit, we should at least be able to proceed manually as before. This is why the translation logic is abstracted into its own script translate-contributors.py. (2) When we look for candidate replacements for invalid author names, we should look for the assignees of the associated JIRAs too. As a result, the intermediate file must keep track of these. (3) This provides an interactive mode with which the user can sit at the terminal and manually pick the candidate replacement that he/she thinks makes the most sense. As before, there is a non-interactive mode that picks the first candidate that the script considers "valid." TODO: We should have a known_contributors file that stores known mappings so we don't have to go through all of this translation every time. This is also valuable because some contributors simply cannot be automatically translated. Conflicts: .gitignore commit 9880bb481943b45cb5ad981809cf5cbd7b0639bb Author: Joseph K. Bradley Date: Thu Dec 4 09:57:50 2014 +0800 [SPARK-4580] [SPARK-4610] [mllib] [docs] Documentation for tree ensembles + DecisionTree API fix Major changes: * Added programming guide sections for tree ensembles * Added examples for tree ensembles * Updated DecisionTree programming guide with more info on parameters * **API change**: Standardized the tree parameter for the number of classes (for classification) Minor changes: * Updated decision tree documentation * Updated existing tree and tree ensemble examples * Use train/test split, and compute test error instead of training error. * Fixed decision_tree_runner.py to actually use the number of classes it computes from data. (small bug fix) Note: I know this is a lot of lines, but most is covered by: * Programming guide sections for gradient boosting and random forests. (The changes are probably best viewed by generating the docs locally.) * New examples (which were copied from the programming guide) * The "numClasses" renaming I have run all examples and relevant unit tests. CC: mengxr manishamde codedeft Author: Joseph K. Bradley Author: Joseph K. Bradley Closes #3461 from jkbradley/ensemble-docs and squashes the following commits: 70a75f3 [Joseph K. Bradley] updated forest vs boosting comparison d1de753 [Joseph K. Bradley] Added note about toString and toDebugString for DecisionTree to migration guide 8e87f8f [Joseph K. Bradley] Combined GBT and RandomForest guides into one ensembles guide 6fab846 [Joseph K. Bradley] small fixes based on review b9f8576 [Joseph K. Bradley] updated decision tree doc 375204c [Joseph K. Bradley] fixed python style 2b60b6e [Joseph K. Bradley] merged Java RandomForest examples into 1 file. added header. Fixed small bug in same example in the programming guide. 706d332 [Joseph K. Bradley] updated python DT runner to print full model if it is small c76c823 [Joseph K. Bradley] added migration guide for mllib abe5ed7 [Joseph K. Bradley] added examples for random forest in Java and Python to examples folder 07fc11d [Joseph K. Bradley] Renamed numClassesForClassification to numClasses everywhere in trees and ensembles. This is a breaking API change, but it was necessary to correct an API inconsistency in Spark 1.1 (where Python DecisionTree used numClasses but Scala used numClassesForClassification). cdfdfbc [Joseph K. Bradley] added examples for GBT 6372a2b [Joseph K. Bradley] updated decision tree examples to use random split. tested all of them. ad3e695 [Joseph K. Bradley] added gbt and random forest to programming guide. still need to update their examples (cherry picked from commit 657a88835d8bf22488b53d50f75281d7dc32442e) Signed-off-by: Xiangrui Meng commit 4259ca8dd1217e135a1b2656307c33f2d48f6f50 Author: Joseph K. Bradley Date: Thu Dec 4 08:58:03 2014 +0800 [SPARK-4711] [mllib] [docs] Programming guide advice on choosing optimizer I have heard requests for the docs to include advice about choosing an optimization method. The programming guide could include a brief statement about this (so the user does not have to read the whole optimization section). CC: mengxr Author: Joseph K. Bradley Closes #3569 from jkbradley/lr-doc and squashes the following commits: 654aeb5 [Joseph K. Bradley] updated section header for mllib-optimization 5035ad0 [Joseph K. Bradley] updated based on review 94f6dec [Joseph K. Bradley] Updated linear methods and optimization docs with quick advice on choosing an optimization method (cherry picked from commit 27ab0b8a03b711e8d86b6167df833f012205ccc7) Signed-off-by: Xiangrui Meng commit fe28ee2d13e0799120136419deec094752d2a370 Author: Reynold Xin Date: Wed Dec 3 16:28:24 2014 -0800 [SPARK-4085] Propagate FetchFailedException when Spark fails to read local shuffle file. cc aarondav kayousterhout pwendell This should go into 1.2? Author: Reynold Xin Closes #3579 from rxin/SPARK-4085 and squashes the following commits: 255b4fd [Reynold Xin] Updated test. f9814d9 [Reynold Xin] Code review feedback. 2afaf35 [Reynold Xin] [SPARK-4085] Propagate FetchFailedException when Spark fails to read local shuffle file. (cherry picked from commit 1826372d0a1bc80db9015106dd5d2d155ada33f5) Signed-off-by: Patrick Wendell commit 6b6b7791d544376f8010b20e839c1627a71c69cb Author: Mark Hamstra Date: Wed Dec 3 15:08:01 2014 -0800 [SPARK-4498][core] Don't transition ExecutorInfo to RUNNING until Driver adds Executor The ExecutorInfo only reaches the RUNNING state if the Driver is alive to send the ExecutorStateChanged message to master. Else, appInfo.resetRetryCount() is never called and failing Executors will eventually exceed ApplicationState.MAX_NUM_RETRY, resulting in the application being removed from the master's accounting. Author: Mark Hamstra Closes #3550 from markhamstra/SPARK-4498 and squashes the following commits: 8f543b1 [Mark Hamstra] Don't transition ExecutorInfo to RUNNING until Executor is added by Driver commit 47931975eaffaf6f4c2a9b65d56a2f25806a2e12 Author: Michael Armbrust Date: Wed Dec 3 14:13:35 2014 -0800 [SPARK-4552][SQL] Avoid exception when reading empty parquet data through Hive This is a very small fix that catches one specific exception and returns an empty table. #3441 will address this in a more principled way. Author: Michael Armbrust Closes #3586 from marmbrus/fixEmptyParquet and squashes the following commits: 2781d9f [Michael Armbrust] Handle empty lists for newParquet 04dd376 [Michael Armbrust] Avoid exception when reading empty parquet data through Hive (cherry picked from commit 513ef82e85661552e596d0b483b645ac24e86d4d) Signed-off-by: Michael Armbrust commit 38cb2c3a36a5c9ead4494cbc3dde008c2f0698ce Author: Andrew Or Date: Wed Dec 3 13:56:23 2014 -0800 [HOT FIX] [YARN] Check whether `/lib` exists before listing its files This is caused by a975dc32799bb8a14f9e1c76defaaa7cfbaf8b53 Author: Andrew Or Closes #3589 from andrewor14/yarn-hot-fix and squashes the following commits: a4fad5f [Andrew Or] Check whether lib directory exists before listing its files (cherry picked from commit 90ec643e9af4c8bbb9000edca08c07afb17939c7) Signed-off-by: Andrew Or commit 4a71e08534b92710fd8d1eb17b077c6c7b78e55d Author: Masayoshi TSUZUKI Date: Wed Dec 3 13:16:24 2014 -0800 [SPARK-4642] Add description about spark.yarn.queue to running-on-YARN document. Added descriptions about these parameters. - spark.yarn.queue Modified description about the defalut value of this parameter. - spark.yarn.submit.file.replication Author: Masayoshi TSUZUKI Closes #3500 from tsudukim/feature/SPARK-4642 and squashes the following commits: ce99655 [Masayoshi TSUZUKI] better gramatically. 21cf624 [Masayoshi TSUZUKI] Removed intentionally undocumented properties. 88cac9b [Masayoshi TSUZUKI] [SPARK-4642] Documents about running-on-YARN needs update (cherry picked from commit 692f49378f7d384d5c9c5ab7451a1c1e66f91c50) Signed-off-by: Andrew Or commit 1ee65b4f98f5db397f447047acda2179fee6c7c0 Author: zsxwing Date: Wed Dec 3 12:19:40 2014 -0800 [SPARK-4715][Core] Make sure tryToAcquire won't return a negative value ShuffleMemoryManager.tryToAcquire may return a negative value. The unit test demonstrates this bug. It will output `0 did not equal -200 granted is negative`. Author: zsxwing Closes #3575 from zsxwing/SPARK-4715 and squashes the following commits: a193ae6 [zsxwing] Make sure tryToAcquire won't return a negative value (cherry picked from commit edd3cd477c9d6016bd977c2fa692fdeff5a6e198) Signed-off-by: Andrew Or commit 614e68636c56dbadf3ec1b7e16ee1d9bf5f8948a Author: Masayoshi TSUZUKI Date: Wed Dec 3 12:08:00 2014 -0800 [SPARK-4701] Typo in sbt/sbt Modified typo. Author: Masayoshi TSUZUKI Closes #3560 from tsudukim/feature/SPARK-4701 and squashes the following commits: ed2a3f1 [Masayoshi TSUZUKI] Another whitespace position error. 1af3a35 [Masayoshi TSUZUKI] [SPARK-4701] Typo in sbt/sbt (cherry picked from commit 96786e3ee53a13a57463b74bec0e77b172f719a3) Signed-off-by: Andrew Or commit 163fd785a0ee41209ecdccc8c28c9f458a4d34d1 Author: Jim Lim Date: Wed Dec 3 11:16:02 2014 -0800 SPARK-2624 add datanucleus jars to the container in yarn-cluster If `spark-submit` finds the datanucleus jars, it adds them to the driver's classpath, but does not add it to the container. This patch modifies the yarn deployment class to copy all `datanucleus-*` jars found in `[spark-home]/libs` to the container. Author: Jim Lim Closes #3238 from jimjh/SPARK-2624 and squashes the following commits: 3633071 [Jim Lim] SPARK-2624 update documentation and comments fe95125 [Jim Lim] SPARK-2624 keep java imports together 6c31fe0 [Jim Lim] SPARK-2624 update documentation 6690fbf [Jim Lim] SPARK-2624 add tests d28d8e9 [Jim Lim] SPARK-2624 add spark.yarn.datanucleus.dir option 84e6cba [Jim Lim] SPARK-2624 add datanucleus jars to the container in yarn-cluster commit b63e94175f1f1c4fe44f78b9b82dd3d8d2d81f5a Author: DB Tsai Date: Wed Dec 3 22:31:39 2014 +0800 [SPARK-4717][MLlib] Optimize BLAS library to avoid de-reference multiple times in loop Have a local reference to `values` and `indices` array in the `Vector` object so JVM can locate the value with one operation call. See `SPARK-4581` for similar optimization, and the bytecode analysis. Author: DB Tsai Closes #3577 from dbtsai/blasopt and squashes the following commits: 62d38c4 [DB Tsai] formating 0316cef [DB Tsai] first commit (cherry picked from commit d00542987ed80635782dcc826fc0bdbf434fff10) Signed-off-by: Xiangrui Meng commit 8ff7a286d76d1b93729539649c8f2264c98c072e Author: DB Tsai Date: Wed Dec 3 19:01:56 2014 +0800 [SPARK-4708][MLLib] Make k-mean runs two/three times faster with dense/sparse sample Note that the usage of `breezeSquaredDistance` in `org.apache.spark.mllib.util.MLUtils.fastSquaredDistance` is in the critical path, and `breezeSquaredDistance` is slow. We should replace it with our own implementation. Here is the benchmark against mnist8m dataset. Before DenseVector: 70.04secs SparseVector: 59.05secs With this PR DenseVector: 30.58secs SparseVector: 21.14secs Author: DB Tsai Closes #3565 from dbtsai/kmean and squashes the following commits: 08bc068 [DB Tsai] restyle de24662 [DB Tsai] address feedback b185a77 [DB Tsai] cleanup 4554ddd [DB Tsai] first commit (cherry picked from commit 7fc49ed91168999d24ae7b4cc46fbb4ec87febc1) Signed-off-by: Xiangrui Meng commit fb14bfdd9e0668bc02dc48b2106710db9a0e3cce Author: Joseph K. Bradley Date: Wed Dec 3 18:50:03 2014 +0800 [SPARK-4710] [mllib] Eliminate MLlib compilation warnings Renamed StreamingKMeans to StreamingKMeansExample to avoid warning about name conflict with StreamingKMeans class. Added import to DecisionTreeRunner to eliminate warning. CC: mengxr Author: Joseph K. Bradley Closes #3568 from jkbradley/ml-compilation-warnings and squashes the following commits: 64d6bc4 [Joseph K. Bradley] Updated DecisionTreeRunner.scala and StreamingKMeans.scala to eliminate compilation warnings, including renaming StreamingKMeans to StreamingKMeansExample. (cherry picked from commit 4ac21511547dc6227d05bf61821cd2d9ab5ede74) Signed-off-by: Xiangrui Meng commit 667f7ff440dea9b83dbf3910f26d8dbf82d343a5 Author: JerryLead Date: Tue Dec 2 23:53:29 2014 -0800 [SPARK-4672][Core]Checkpoint() should clear f to shorten the serialization chain The related JIRA is https://issues.apache.org/jira/browse/SPARK-4672 The f closure of `PartitionsRDD(ZippedPartitionsRDD2)` contains a `$outer` that references EdgeRDD/VertexRDD, which causes task's serialization chain become very long in iterative GraphX applications. As a result, StackOverflow error will occur. If we set "f = null" in `clearDependencies()`, checkpoint() can cut off the long serialization chain. More details and explanation can be found in the JIRA. Author: JerryLead Author: Lijie Xu Closes #3545 from JerryLead/my_core and squashes the following commits: f7faea5 [JerryLead] checkpoint() should clear the f to avoid StackOverflow error c0169da [JerryLead] Merge branch 'master' of https://github.com/apache/spark 52799e3 [Lijie Xu] Merge pull request #1 from apache/master (cherry picked from commit 77be8b986fd21b7bbe28aa8db1042cb22bc74fe7) Signed-off-by: Ankur Dave commit 528cce8bca950488a55d5c991bcdb692fe8a883c Author: JerryLead Date: Tue Dec 2 17:14:11 2014 -0800 [SPARK-4672][GraphX]Non-transient PartitionsRDDs will lead to StackOverflow error The related JIRA is https://issues.apache.org/jira/browse/SPARK-4672 In a nutshell, if `val partitionsRDD` in EdgeRDDImpl and VertexRDDImpl are non-transient, the serialization chain can become very long in iterative algorithms and finally lead to the StackOverflow error. More details and explanation can be found in the JIRA. Author: JerryLead Author: Lijie Xu Closes #3544 from JerryLead/my_graphX and squashes the following commits: 628f33c [JerryLead] set PartitionsRDD to be transient in EdgeRDDImpl and VertexRDDImpl c0169da [JerryLead] Merge branch 'master' of https://github.com/apache/spark 52799e3 [Lijie Xu] Merge pull request #1 from apache/master (cherry picked from commit 17c162f6682520e6e2790626e37da3a074471793) Signed-off-by: Ankur Dave commit f1859fc189d9657381fbe82795420de34cad4025 Author: JerryLead Date: Tue Dec 2 17:08:02 2014 -0800 [SPARK-4672][GraphX]Perform checkpoint() on PartitionsRDD to shorten the lineage The related JIRA is https://issues.apache.org/jira/browse/SPARK-4672 Iterative GraphX applications always have long lineage, while checkpoint() on EdgeRDD and VertexRDD themselves cannot shorten the lineage. In contrast, if we perform checkpoint() on their ParitionsRDD, the long lineage can be cut off. Moreover, the existing operations such as cache() in this code is performed on the PartitionsRDD, so checkpoint() should do the same way. More details and explanation can be found in the JIRA. Author: JerryLead Author: Lijie Xu Closes #3549 from JerryLead/my_graphX_checkpoint and squashes the following commits: d1aa8d8 [JerryLead] Perform checkpoint() on PartitionsRDD not VertexRDD and EdgeRDD themselves ff08ed4 [JerryLead] Merge branch 'master' of https://github.com/apache/spark c0169da [JerryLead] Merge branch 'master' of https://github.com/apache/spark 52799e3 [Lijie Xu] Merge pull request #1 from apache/master (cherry picked from commit fc0a1475ef7c8b33363d88adfe8e8f28def5afc7) Signed-off-by: Ankur Dave commit 5e026a3e647f077cf9aef15d80cd1fdfa5aad3cd Author: Andrew Or Date: Tue Dec 2 16:36:12 2014 -0800 [Release] Translate unknown author names automatically commit 658fe8f1a911e080c9a63e67c9185492152c966e Author: wangfei Date: Tue Dec 2 14:30:44 2014 -0800 [SPARK-4695][SQL] Get result using executeCollect Using ```executeCollect``` to collect the result, because executeCollect is a custom implementation of collect in spark sql which better than rdd's collect Author: wangfei Closes #3547 from scwf/executeCollect and squashes the following commits: a5ab68e [wangfei] Revert "adding debug info" a60d680 [wangfei] fix test failure 0db7ce8 [wangfei] adding debug info 184c594 [wangfei] using executeCollect instead collect (cherry picked from commit 3ae0cda83c5106136e90d59c20e61db345a5085f) Signed-off-by: Michael Armbrust commit adc5d6f09edfc366f2ae151c2c3c13e07821d386 Author: Daoyuan Wang Date: Tue Dec 2 14:25:12 2014 -0800 [SPARK-4670] [SQL] wrong symbol for bitwise not We should use `~` instead of `-` for bitwise NOT. Author: Daoyuan Wang Closes #3528 from adrian-wang/symbol and squashes the following commits: affd4ad [Daoyuan Wang] fix code gen test case 56efb79 [Daoyuan Wang] ensure bitwise NOT over byte and short persist data type f55fbae [Daoyuan Wang] wrong symbol for bitwise not (cherry picked from commit 1f5ddf17e831ad9717f0f4b60a727a3381fad4f9) Signed-off-by: Michael Armbrust commit 97dc2384ad4cb555200bbe994b5470f81fe4671f Author: Daoyuan Wang Date: Tue Dec 2 14:21:12 2014 -0800 [SPARK-4593][SQL] Return null when denominator is 0 SELECT max(1/0) FROM src would return a very large number, which is obviously not right. For hive-0.12, hive would return `Infinity` for 1/0, while for hive-0.13.1, it is `NULL` for 1/0. I think it is better to keep our behavior with newer Hive version. This PR ensures that when the divider is 0, the result of expression should be NULL, same with hive-0.13.1 Author: Daoyuan Wang Closes #3443 from adrian-wang/div and squashes the following commits: 2e98677 [Daoyuan Wang] fix code gen for divide 0 85c28ba [Daoyuan Wang] temp 36236a5 [Daoyuan Wang] add test cases 6f5716f [Daoyuan Wang] fix comments cee92bd [Daoyuan Wang] avoid evaluation 2 times 22ecd9a [Daoyuan Wang] fix style cf28c58 [Daoyuan Wang] divide fix 2dfe50f [Daoyuan Wang] return null when divider is 0 of Double type (cherry picked from commit f6df609dcc4f4a18c0f1c74b1ae0800cf09fa7ae) Signed-off-by: Michael Armbrust commit 06129cde4dc035b31fcd8e5870a2030be2f2a8b7 Author: YanTangZhai Date: Tue Dec 2 14:12:48 2014 -0800 [SPARK-4676][SQL] JavaSchemaRDD.schema may throw NullType MatchError if sql has null val jsc = new org.apache.spark.api.java.JavaSparkContext(sc) val jhc = new org.apache.spark.sql.hive.api.java.JavaHiveContext(jsc) val nrdd = jhc.hql("select null from spark_test.for_test") println(nrdd.schema) Then the error is thrown as follows: scala.MatchError: NullType (of class org.apache.spark.sql.catalyst.types.NullType$) at org.apache.spark.sql.types.util.DataTypeConversions$.asJavaDataType(DataTypeConversions.scala:43) Author: YanTangZhai Author: yantangzhai Author: Michael Armbrust Closes #3538 from YanTangZhai/MatchNullType and squashes the following commits: e052dff [yantangzhai] [SPARK-4676] [SQL] JavaSchemaRDD.schema may throw NullType MatchError if sql has null 4b4bb34 [yantangzhai] [SPARK-4676] [SQL] JavaSchemaRDD.schema may throw NullType MatchError if sql has null 896c7b7 [yantangzhai] fix NullType MatchError in JavaSchemaRDD when sql has null 6e643f8 [YanTangZhai] Merge pull request #11 from apache/master e249846 [YanTangZhai] Merge pull request #10 from apache/master d26d982 [YanTangZhai] Merge pull request #9 from apache/master 76d4027 [YanTangZhai] Merge pull request #8 from apache/master 03b62b0 [YanTangZhai] Merge pull request #7 from apache/master 8a00106 [YanTangZhai] Merge pull request #6 from apache/master cbcba66 [YanTangZhai] Merge pull request #3 from apache/master cdef539 [YanTangZhai] Merge pull request #1 from apache/master (cherry picked from commit 10664276007beca3843638e558f504cad44b1fb3) Signed-off-by: Michael Armbrust commit aa3d369a6bf77a00939da020d823ab90c9fe3cab Author: baishuo Date: Tue Dec 2 12:12:03 2014 -0800 [SPARK-4663][sql]add finally to avoid resource leak Author: baishuo Closes #3526 from baishuo/master-trycatch and squashes the following commits: d446e14 [baishuo] correct the code style b36bf96 [baishuo] correct the code style ae0e447 [baishuo] add finally to avoid resource leak (cherry picked from commit 69b6fed206565ecb0173d3757bcb5110422887c3) Signed-off-by: Michael Armbrust commit 1850d90b9bbfb973e13c2c2334ba817e623de46b Author: Kousuke Saruta Date: Tue Dec 2 12:07:52 2014 -0800 [SPARK-4536][SQL] Add sqrt and abs to Spark SQL DSL Spark SQL has embeded sqrt and abs but DSL doesn't support those functions. Author: Kousuke Saruta Closes #3401 from sarutak/dsl-missing-operator and squashes the following commits: 07700cf [Kousuke Saruta] Modified Literal(null, NullType) to Literal(null) in DslQuerySuite 8f366f8 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into dsl-missing-operator 1b88e2e [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into dsl-missing-operator 0396f89 [Kousuke Saruta] Added sqrt and abs to Spark SQL DSL (cherry picked from commit e75e04f980281389b881df76f59ba1adc6338629) Signed-off-by: Michael Armbrust commit b97c27ff257f77422ba17903d4e568be738265fb Author: Kay Ousterhout Date: Tue Dec 2 09:06:02 2014 -0800 [SPARK-4686] Link to allowed master URLs is broken The link points to the old scala programming guide; it should point to the submitting applications page. This should be backported to 1.1.2 (it's been broken as of 1.0). Author: Kay Ousterhout Closes #3542 from kayousterhout/SPARK-4686 and squashes the following commits: a8fc43b [Kay Ousterhout] [SPARK-4686] Link to allowed master URLs is broken (cherry picked from commit d9a148ba6a67a01e4bf77c35c41dd4cbc8918c82) Signed-off-by: Kay Ousterhout commit 3783e15f0dc36f966b449227668e232707d6696b Author: DB Tsai Date: Tue Dec 2 11:40:43 2014 +0800 [SPARK-4611][MLlib] Implement the efficient vector norm The vector norm in breeze is implemented by `activeIterator` which is known to be very slow. In this PR, an efficient vector norm is implemented, and with this API, `Normalizer` and `k-means` have big performance improvement. Here is the benchmark against mnist8m dataset. a) `Normalizer` Before DenseVector: 68.25secs SparseVector: 17.01secs With this PR DenseVector: 12.71secs SparseVector: 2.73secs b) `k-means` Before DenseVector: 83.46secs SparseVector: 61.60secs With this PR DenseVector: 70.04secs SparseVector: 59.05secs Author: DB Tsai Closes #3462 from dbtsai/norm and squashes the following commits: 63c7165 [DB Tsai] typo 0c3637f [DB Tsai] add import org.apache.spark.SparkContext._ back 6fa616c [DB Tsai] address feedback 9b7cb56 [DB Tsai] move norm to static method 0b632e6 [DB Tsai] kmeans dbed124 [DB Tsai] style c1a877c [DB Tsai] first commit (cherry picked from commit 64f3175bf976f5a28e691cedc7a4b333709e0c58) Signed-off-by: Xiangrui Meng commit 445fc9550863bb8616acd6675d57077789177c03 Author: Daoyuan Wang Date: Mon Dec 1 16:08:51 2014 -0800 [SPARK-4529] [SQL] support view with column alias Support view definition like CREATE VIEW view3(valoo) TBLPROPERTIES ("fear" = "factor") AS SELECT upper(value) FROM src WHERE key=86; [valoo as the alias of upper(value)]. This is missing part of SPARK-4239, for a fully view support. Author: Daoyuan Wang Closes #3396 from adrian-wang/viewcolumn and squashes the following commits: 4d001d0 [Daoyuan Wang] support view with column alias (cherry picked from commit 4df60a8cbc58f2877787245c2a83b2de85579c82) Signed-off-by: Michael Armbrust commit e66f8166334026cd5506a9b05ab52b73a96fd7f3 Author: Daoyuan Wang Date: Mon Dec 1 14:03:57 2014 -0800 [SQL][DOC] Date type in SQL programming guide Author: Daoyuan Wang Closes #3535 from adrian-wang/datedoc and squashes the following commits: 18ff1ed [Daoyuan Wang] [DOC] Date type (cherry picked from commit 5edbcbfb61703398a24ce5162a74aba04e365b0c) Signed-off-by: Michael Armbrust commit 31cf51bfaa0e332b903cb5d7f511dfa76d36bdc5 Author: wangfei Date: Mon Dec 1 14:02:02 2014 -0800 [SQL] Minor fix for doc and comment Author: wangfei Closes #3533 from scwf/sql-doc1 and squashes the following commits: 962910b [wangfei] doc and comment fix (cherry picked from commit 7b79957879db4dfcc7c3601cb40ac4fd576259a5) Signed-off-by: Michael Armbrust commit b39cfee0620ccd9c4e966a7d9bbd6017e35023cd Author: ravipesala Date: Mon Dec 1 13:31:27 2014 -0800 [SPARK-4658][SQL] Code documentation issue in DDL of datasource API Author: ravipesala Closes #3516 from ravipesala/ddl_doc and squashes the following commits: d101fdf [ravipesala] Style issues fixed d2238cd [ravipesala] Corrected documentation (cherry picked from commit bc353819cc86c3b0ad75caf81b47744bfc2aeeb3) Signed-off-by: Michael Armbrust commit 5006aab9d6f8dd4ce3dd11d388f96790c04cf25c Author: ravipesala Date: Mon Dec 1 13:26:44 2014 -0800 [SPARK-4650][SQL] Supporting multi column support in countDistinct function like count(distinct c1,c2..) in Spark SQL Supporting multi column support in countDistinct function like count(distinct c1,c2..) in Spark SQL Author: ravipesala Author: Michael Armbrust Closes #3511 from ravipesala/countdistinct and squashes the following commits: cc4dbb1 [ravipesala] style 070e12a [ravipesala] Supporting multi column support in count(distinct c1,c2..) in Spark SQL (cherry picked from commit 6a9ff19dc06745144d5b311d4f87073c81d53a8f) Signed-off-by: Michael Armbrust commit f2bb90a29defab0b9c8ad795c0cb786de275145b Author: Liang-Chi Hsieh Date: Mon Dec 1 13:17:56 2014 -0800 [SPARK-4358][SQL] Let BigDecimal do checking type compatibility Remove hardcoding max and min values for types. Let BigDecimal do checking type compatibility. Author: Liang-Chi Hsieh Closes #3208 from viirya/more_numericLit and squashes the following commits: e9834b4 [Liang-Chi Hsieh] Remove byte and short types for number literal. 1bd1825 [Liang-Chi Hsieh] Fix Indentation and make the modification clearer. cf1a997 [Liang-Chi Hsieh] Modified for comment to add a rule of analysis that adds a cast. 91fe489 [Liang-Chi Hsieh] add Byte and Short. 1bdc69d [Liang-Chi Hsieh] Let BigDecimal do checking type compatibility. (cherry picked from commit b57365a1ec89e31470f424ff37d5ebc7c90a39d8) Signed-off-by: Michael Armbrust commit e0a6d36bc96df63fb8cc5c3b4e516ef1011849ef Author: Jacky Li Date: Mon Dec 1 13:12:30 2014 -0800 [SQL] add @group tab in limit() and count() group tab is missing for scaladoc Author: Jacky Li Closes #3458 from jackylk/patch-7 and squashes the following commits: 0121a70 [Jacky Li] add @group tab in limit() and count() (cherry picked from commit bafee67ebad01f7aea2cd393a70b57eb8345eeb0) Signed-off-by: Michael Armbrust commit 9c9b4bd1e4ac40c4abf4b5d1113c3056732e2c25 Author: Cheng Lian Date: Mon Dec 1 13:09:51 2014 -0800 [SPARK-4258][SQL][DOC] Documents spark.sql.parquet.filterPushdown Documents `spark.sql.parquet.filterPushdown`, explains why it's turned off by default and when it's safe to be turned on. [Review on Reviewable](https://reviewable.io/reviews/apache/spark/3440) Author: Cheng Lian Closes #3440 from liancheng/parquet-filter-pushdown-doc and squashes the following commits: 2104311 [Cheng Lian] Documents spark.sql.parquet.filterPushdown (cherry picked from commit 5db8dcaf494e0dffed4fc22f19b0334d95ab6bfb) Signed-off-by: Michael Armbrust commit 35bc338c04022354654435427bb310acdcb9904a Author: Madhu Siddalingaiah Date: Mon Dec 1 08:45:34 2014 -0800 Documentation: add description for repartitionAndSortWithinPartitions Author: Madhu Siddalingaiah Closes #3390 from msiddalingaiah/master and squashes the following commits: cbccbfe [Madhu Siddalingaiah] Documentation: replace with (again) 332f7a2 [Madhu Siddalingaiah] Documentation: replace with cd2b05a [Madhu Siddalingaiah] Merge remote-tracking branch 'upstream/master' 0fc12d7 [Madhu Siddalingaiah] Documentation: add description for repartitionAndSortWithinPartitions (cherry picked from commit 2b233f5fc4beb2c6ed4bc142e923e96f8bad3ec4) Signed-off-by: Josh Rosen commit 67a2c138c0932ba15617b05c41ee5a8807244790 Author: zsxwing Date: Mon Dec 1 00:35:01 2014 -0800 [SPARK-4661][Core] Minor code and docs cleanup Author: zsxwing Closes #3521 from zsxwing/SPARK-4661 and squashes the following commits: 03cbe3f [zsxwing] Minor code and docs cleanup (cherry picked from commit 30a86acdefd5428af6d6264f59a037e0eefd74b4) Signed-off-by: Reynold Xin commit 9b8a769187e30f8521cecad92a3a6c7f490d507b Author: Sean Owen Date: Mon Dec 1 16:31:04 2014 +0800 SPARK-2192 [BUILD] Examples Data Not in Binary Distribution Simply, add data/ to distributions. This adds about 291KB (compressed) to the tarball, FYI. Author: Sean Owen Closes #3480 from srowen/SPARK-2192 and squashes the following commits: 47688f1 [Sean Owen] Add data/ to distributions (cherry picked from commit 6384f42ab2e5c2b3e767ab4a428cda20a8ddcbe1) Signed-off-by: Xiangrui Meng commit 0f4dad43e30301b9ad8f078ef44f8b8c05c29a25 Author: Cheng Lian Date: Sun Nov 30 19:04:07 2014 -0800 [DOC] Fixes formatting typo in SQL programming guide [Review on Reviewable](https://reviewable.io/reviews/apache/spark/3498) Author: Cheng Lian Closes #3498 from liancheng/fix-sql-doc-typo and squashes the following commits: 865ecd7 [Cheng Lian] Fixes formatting typo in SQL programming guide (cherry picked from commit 2a4d389f70b2066b1ac32b081bef44e61fefb03c) Signed-off-by: Josh Rosen commit c899f0355497021b8bdcef44b13fcd013d54e984 Author: lewuathe Date: Sun Nov 30 17:18:50 2014 -0800 [SPARK-4656][Doc] Typo in Programming Guide markdown Grammatical error in Programming Guide document Author: lewuathe Closes #3412 from Lewuathe/typo-programming-guide and squashes the following commits: a3e2f00 [lewuathe] Typo in Programming Guide markdown (cherry picked from commit a217ec5fd5cd7addc69e538d6ec6dd64956cc8ed) Signed-off-by: Josh Rosen commit d3247284d6a5e0009c917185ade866c9d06a5b37 Author: Sean Owen Date: Sun Nov 30 11:40:08 2014 -0800 SPARK-2143 [WEB UI] Add Spark version to UI footer This PR adds the Spark version number to the UI footer; this is how it looks: ![screen shot 2014-11-21 at 22 58 40](https://cloud.githubusercontent.com/assets/822522/5157738/f4822094-7316-11e4-98f1-333a535fdcfa.png) Author: Sean Owen Closes #3410 from srowen/SPARK-2143 and squashes the following commits: e9b3a7a [Sean Owen] Add Spark version to footer commit e07dbd8f5014dcaf50aa5c16f4732c4c95476228 Author: Takuya UESHIN Date: Sun Nov 30 00:10:31 2014 -0500 [DOCS][BUILD] Add instruction to use change-version-to-2.11.sh in 'Building for Scala 2.11'. To build with Scala 2.11, we have to execute `change-version-to-2.11.sh` before Maven execute, otherwise inter-module dependencies are broken. Author: Takuya UESHIN Closes #3361 from ueshin/docs/building-spark_2.11 and squashes the following commits: 1d29126 [Takuya UESHIN] Add instruction to use change-version-to-2.11.sh in 'Building for Scala 2.11'. (cherry picked from commit 0fcd24cc542040ff3555290eec7b021062e7e6ac) Signed-off-by: Patrick Wendell commit 854fade2dcd6e8d0fe4bdc2ffe7d650cedfca47c Author: Liang-Chi Hsieh Date: Fri Nov 28 18:04:05 2014 -0800 [SPARK-4597] Use proper exception and reset variable in Utils.createTempDir() `File.exists()` and `File.mkdirs()` only throw `SecurityException` instead of `IOException`. Then, when an exception is thrown, `dir` should be reset too. Author: Liang-Chi Hsieh Closes #3449 from viirya/fix_createtempdir and squashes the following commits: 36cacbd [Liang-Chi Hsieh] Use proper exception and reset variable. (cherry picked from commit 49fe8797e64f10c574e0790b32a8c3fdc7e594a0) Signed-off-by: Josh Rosen commit 3a4609eada2ee0bfbcce0f4127b6a5363ae528e5 Author: Patrick Wendell Date: Fri Nov 28 17:13:18 2014 -0500 HOTFIX: Rolling back incorrect version change commit 00316cc87983b844f6603f351a8f0b84fe1f6035 Author: Patrick Wendell Date: Fri Nov 28 21:57:43 2014 +0000 Preparing development version 1.2.1-SNAPSHOT commit 1056e9ec13203d0c51564265e94d77a054498fdb Author: Patrick Wendell Date: Fri Nov 28 21:57:43 2014 +0000 Preparing Spark release v1.2.0-rc1 commit eb4d457a870f7a281dc0267db72715cd00245e82 Author: Patrick Wendell Date: Fri Nov 28 16:55:13 2014 -0500 Updating version in package.scala commit 88f1a6abb6bc89ba805f18a8c220f3dd2df88fd1 Author: Patrick Wendell Date: Fri Nov 28 16:54:43 2014 -0500 Revert "Preparing Spark release v1.2.0-rc1" This reverts commit 39c7d1c1f9a7785285cf4c20dfbffd96f72d5634. commit 6e0269c9295d9faba9a9259eb5023c5a78e5895f Author: Patrick Wendell Date: Fri Nov 28 16:54:39 2014 -0500 Revert "Preparing development version 1.2.1-SNAPSHOT" This reverts commit fc7bff00ac731d2632213a98cd92dc5e84ce7dcd. commit fc7bff00ac731d2632213a98cd92dc5e84ce7dcd Author: Patrick Wendell Date: Fri Nov 28 20:22:31 2014 +0000 Preparing development version 1.2.1-SNAPSHOT commit 39c7d1c1f9a7785285cf4c20dfbffd96f72d5634 Author: Patrick Wendell Date: Fri Nov 28 20:22:31 2014 +0000 Preparing Spark release v1.2.0-rc1 commit 8cec4312e990beb648969a40688f3cba5e3473db Author: Marcelo Vanzin Date: Fri Nov 28 15:15:30 2014 -0500 [SPARK-4584] [yarn] Remove security manager from Yarn AM. The security manager adds a lot of overhead to the runtime of the app, and causes a severe performance regression. Even stubbing out all unneeded methods (all except checkExit()) does not help. So, instead, penalize users who do an explicit System.exit() by leaving them in "undefined behavior" territory: if they do that, the Yarn backend won't be able to report the final app status to the RM. The result is that the final status of the application might not match the user's expectations. One side-effect of the change is that users who do an explicit System.exit() will lose the AM retry functionality. Since there is no way to know if the exit was because of success or failure, the AM right now errs on the side of it being a successful exit. Author: Marcelo Vanzin Closes #3484 from vanzin/SPARK-4584 and squashes the following commits: 21f2502 [Marcelo Vanzin] Do not retry apps that use System.exit(). 4198b3b [Marcelo Vanzin] [SPARK-4584] [yarn] Remove security manager from Yarn AM. (cherry picked from commit 915f8eeb3a493a0bb4b8d05d795ddd21f373d2ff) Signed-off-by: Patrick Wendell commit 32198347ffb71f72f37e4bded262da80452a5aea Author: Takuya UESHIN Date: Fri Nov 28 13:00:15 2014 -0500 [SPARK-4193][BUILD] Disable doclint in Java 8 to prevent from build error. Author: Takuya UESHIN Closes #3058 from ueshin/issues/SPARK-4193 and squashes the following commits: e096bb1 [Takuya UESHIN] Add a plugin declaration to pluginManagement. 6762ec2 [Takuya UESHIN] Fix usage of -Xdoclint javadoc option. fdb280a [Takuya UESHIN] Fix Javadoc errors. 4745f3c [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4193 923e2f0 [Takuya UESHIN] Use doclint option `-missing` instead of `none`. 30d6718 [Takuya UESHIN] Fix Javadoc errors. b548017 [Takuya UESHIN] Disable doclint in Java 8 to prevent from build error. (cherry picked from commit e464f0ac2d7210a4bf715478885fe7a8d397fe89) Signed-off-by: Patrick Wendell commit 8cf12279969afe5099c66ad16897db366e7234ed Author: Cheng Lian Date: Fri Nov 28 11:42:40 2014 -0500 [SPARK-4645][SQL] Disables asynchronous execution in Hive 0.13.1 HiveThriftServer2 This PR disables HiveThriftServer2 asynchronous execution by setting `runInBackground` argument in `ExecuteStatementOperation` to `false`, and reverting `SparkExecuteStatementOperation.run` in Hive 13 shim to Hive 12 version. This change makes Simba ODBC driver v1.0.0.1000 work. [Review on Reviewable](https://reviewable.io/reviews/apache/spark/3506) Author: Cheng Lian Closes #3506 from liancheng/disable-async-exec and squashes the following commits: 593804d [Cheng Lian] Disables asynchronous execution in Hive 0.13.1 HiveThriftServer2 commit 7fa5fff29881421ef5da0ac3c254611b2318be00 Author: Cheng Lian Date: Mon Nov 10 16:56:36 2014 -0800 [SPARK-4308][SQL] Sets SQL operation state to ERROR when exception is thrown In `HiveThriftServer2`, when an exception is thrown during a SQL execution, the SQL operation state should be set to `ERROR`, but now it remains `RUNNING`. This affects the result of the `GetOperationStatus` Thrift API. Author: Cheng Lian Closes #3175 from liancheng/fix-op-state and squashes the following commits: 6d4c1fe [Cheng Lian] Sets SQL operation state to ERROR when exception is thrown commit e9244263c97b61560e30dcb997df4bf074299085 Author: maji2014 Date: Fri Nov 28 00:36:22 2014 -0800 [SPARK-4619][Storage]delete redundant time suffix Time suffix exists in Utils.getUsedTimeMs(startTime), no need to append again, delete that Author: maji2014 Closes #3475 from maji2014/SPARK-4619 and squashes the following commits: df0da4e [maji2014] delete redundant time suffix (cherry picked from commit ceb628197099e6c598cde1564ed9c1c3681ea955) Signed-off-by: Reynold Xin commit 092800435c27c97bf445de934826a1316666dfba Author: Cheng Lian Date: Thu Nov 27 18:01:14 2014 -0800 [SPARK-4613][Core] Java API for JdbcRDD This PR introduces a set of Java APIs for using `JdbcRDD`: 1. Trait (interface) `JdbcRDD.ConnectionFactory`: equivalent to the `getConnection: () => Connection` parameter in `JdbcRDD` constructor. 2. Two overloaded versions of `Jdbc.create`: used to create `JavaRDD` that wraps a `JdbcRDD`. [Review on Reviewable](https://reviewable.io/reviews/apache/spark/3478) Author: Cheng Lian Closes #3478 from liancheng/japi-jdbc-rdd and squashes the following commits: 9a54625 [Cheng Lian] Only shutdowns a single DB rather than the whole Derby driver d4cedc5 [Cheng Lian] Moves Java JdbcRDD test case to a separate test suite ffcdf2e [Cheng Lian] Java API for JdbcRDD (cherry picked from commit 120a350240f58196eafcb038ca3a353636d89239) Signed-off-by: Matei Zaharia commit bfba8bf602074a346e31917b97a6db205d62df69 Author: roxchkplusony Date: Thu Nov 27 15:54:40 2014 -0800 [SPARK-4626] Kill a task only if the executorId is (still) registered with the scheduler Author: roxchkplusony Closes #3483 from roxchkplusony/bugfix/4626 and squashes the following commits: aba9184 [roxchkplusony] replace warning message per review 5e7fdea [roxchkplusony] [SPARK-4626] Kill a task only if the executorId is (still) registered with the scheduler (cherry picked from commit 84376d31392858f7df215ddb3f05419181152e68) Signed-off-by: Reynold Xin commit a0aa07baaab10fe6e491a06171fe42e0f102c7a6 Author: Andrew Or Date: Wed Nov 26 23:16:23 2014 -0800 [Release] Automate generation of contributors list This commit provides a script that computes the contributors list by linking the github commits with JIRA issues. Automatically translating github usernames remains a TODO at this point. commit 66cc2431462a5354bb50c196a59da0ffc258c466 Author: CodingCat Date: Wed Nov 26 16:52:04 2014 -0800 [SPARK-732][SPARK-3628][CORE][RESUBMIT] eliminate duplicate update on accmulator https://issues.apache.org/jira/browse/SPARK-3628 In current implementation, the accumulator will be updated for every successfully finished task, even the task is from a resubmitted stage, which makes the accumulator counter-intuitive In this patch, I changed the way for the DAGScheduler to update the accumulator, DAGScheduler maintains a HashTable, mapping the stage id to the received pairs. Only when the stage becomes independent, (no job needs it any more), we accumulate the values of the pairs, when a task finished, we check if the HashTable has contained such stageId, it saves the accumulator_id, value only when the task is the first finished task of a new stage or the stage is running for the first attempt... Author: CodingCat Closes #2524 from CodingCat/SPARK-732-1 and squashes the following commits: 701a1e8 [CodingCat] roll back change on Accumulator.scala 1433e6f [CodingCat] make MIMA happy b233737 [CodingCat] address Matei's comments 02261b8 [CodingCat] rollback some changes 6b0aff9 [CodingCat] update document 2b2e8cf [CodingCat] updateAccumulator 83b75f8 [CodingCat] style fix 84570d2 [CodingCat] re-enable the bad accumulator guard 1e9e14d [CodingCat] add NPE guard 21b6840 [CodingCat] simplify the patch 88d1f03 [CodingCat] fix rebase error f74266b [CodingCat] add test case for resubmitted result stage 5cf586f [CodingCat] de-duplicate on task level 138f9b3 [CodingCat] make MIMA happy 67593d2 [CodingCat] make if allowing duplicate update as an option of accumulator (cherry picked from commit 5af53ada65f62e6b5987eada288fb48e9211ef9d) Signed-off-by: Matei Zaharia commit 69550f761c53da80343ae982db38780cd2ad956f Author: Joseph K. Bradley Date: Wed Nov 26 13:34:18 2014 -0800 [BRANCH-1.2][SPARK-4583][MLLIB] LogLoss for GradientBoostedTrees fix + doc updates We reverted #3439 in branch-1.2 due to missing `import o.a.s.SparkContext._`, which is no longer needed in master (#3262). This PR adds #3439 back to branch-1.2 with correct imports. Github is out-of-sync now. The real changes are the last two commits. Author: Joseph K. Bradley Author: Xiangrui Meng Closes #3474 from mengxr/SPARK-4583-1.2 and squashes the following commits: aca2abb [Xiangrui Meng] add import o.a.s.SparkContext._ for v1.2 6b5564a [Joseph K. Bradley] [SPARK-4583] [mllib] LogLoss for GradientBoostedTrees fix + doc updates commit 8fc19e5289bccbbe21f663e6d263f816e4701aa8 Author: Xiangrui Meng Date: Wed Nov 26 11:35:44 2014 -0800 [BRANCH-1.2][SPARK-4614][MLLIB] Slight API changes in Matrix and Matrices This is #3468 for branch-1.2, same content except mima excludes. Author: Xiangrui Meng Closes #3482 from mengxr/SPARK-4614-1.2 and squashes the following commits: ea4f08d [Xiangrui Meng] hide transposeMultiply; add rng to rand and randn; add unit tests commit 9b6390092213715347bfe5934c6ca6560c101dcb Author: Xiangrui Meng Date: Wed Nov 26 08:19:03 2014 -0800 [BRANCH-1.2][SPARK-4604][MLLIB] make MatrixFactorizationModel public We reverted #3459 in branch-1.2 due to missing `import o.a.s.SparkContext._`, which is no longer needed in master (#3262). This PR adds #3459 back to branch-1.2 with correct imports. Github is out-of-sync now. The real changes are the last two commits. Author: Xiangrui Meng Closes #3473 from mengxr/SPARK-4604-1.2 and squashes the following commits: a7638a5 [Xiangrui Meng] add import o.a.s.SparkContext._ for v1.2 b749000 [Xiangrui Meng] [SPARK-4604][MLLIB] make MatrixFactorizationModel public commit 9f3b159a5b71bc3aba54a14f5e3af46c87396e79 Author: Joseph E. Gonzalez Date: Wed Nov 26 00:55:28 2014 -0800 Removing confusing TripletFields After additional discussion with rxin, I think having all the possible `TripletField` options is confusing. This pull request reduces the triplet fields to: ```java /** * None of the triplet fields are exposed. */ public static final TripletFields None = new TripletFields(false, false, false); /** * Expose only the edge field and not the source or destination field. */ public static final TripletFields EdgeOnly = new TripletFields(false, false, true); /** * Expose the source and edge fields but not the destination field. (Same as Src) */ public static final TripletFields Src = new TripletFields(true, false, true); /** * Expose the destination and edge fields but not the source field. (Same as Dst) */ public static final TripletFields Dst = new TripletFields(false, true, true); /** * Expose all the fields (source, edge, and destination). */ public static final TripletFields All = new TripletFields(true, true, true); ``` Author: Joseph E. Gonzalez Closes #3472 from jegonzal/SimplifyTripletFields and squashes the following commits: 91796b5 [Joseph E. Gonzalez] removing confusing triplet fields (cherry picked from commit 288ce583b05004a8c71dcd836fab23caff5d4ba7) Signed-off-by: Reynold Xin commit e8669729af4b49423a7514830436b2cb4ee6a08a Author: Tathagata Das Date: Tue Nov 25 23:15:58 2014 -0800 [SPARK-4612] Reduce task latency and increase scheduling throughput by making configuration initialization lazy https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L337 creates a configuration object for every task that is launched, even if there is no new dependent file/JAR to update. This is a heavy-weight creation that should be avoided if there is no new file/JAR to update. This PR makes that creation lazy. Quick local test in spark-perf scheduling throughput tests gives the following numbers in a local standalone scheduler mode. 1 job with 10000 tasks: before 7.8395 seconds, after 2.6415 seconds = 3x increase in task scheduling throughput pwendell JoshRosen Author: Tathagata Das Closes #3463 from tdas/lazy-config and squashes the following commits: c791c1e [Tathagata Das] Reduce task latency by making configuration initialization lazy (cherry picked from commit e7f4d2534bb3361ec4b7af0d42bc798a7a425226) Signed-off-by: Reynold Xin commit 69d021b0becdffe225a1c8859d8c6adeb1a94f4a Author: Xiangrui Meng Date: Tue Nov 25 22:29:56 2014 -0800 Revert "[SPARK-4604][MLLIB] make MatrixFactorizationModel public" This reverts commit 2756d0de91d996f80c0b0883cad1d2fab336ed84. commit 17a4b8e597391af3a258f8f4f9c910e341ba39c3 Author: Patrick Wendell Date: Wed Nov 26 00:42:01 2014 -0500 Revert "[SPARK-4583] [mllib] LogLoss for GradientBoostedTrees fix + doc updates" This reverts commit 6880b467f66a4906161cbc343e70d975056a4f5f. commit 8f5ebcb63c28254abf60cce87c3706ccdee3c91a Author: Patrick Wendell Date: Wed Nov 26 00:36:43 2014 -0500 Revert "Preparing Spark release v1.2.0-rc1" This reverts commit cc2c05e4ee81d2f34873a2ebb9a5272867cb65c2. commit 537d699a53b1fe227d570635e3b4a33abf2d72ab Author: Patrick Wendell Date: Wed Nov 26 00:36:35 2014 -0500 Revert "Preparing development version 1.2.1-SNAPSHOT" This reverts commit 380eba5f49eca1dbd4084e6c84e19866fffd4efa. commit c7185f0c08e2a42e2595466e2d8ac394cbf66f5b Author: Aaron Davidson Date: Wed Nov 26 00:32:45 2014 -0500 [SPARK-4516] Avoid allocating Netty PooledByteBufAllocators unnecessarily Turns out we are allocating an allocator pool for every TransportClient (which means that the number increases with the number of nodes in the cluster), when really we should just reuse one for all clients. This patch, as expected, greatly decreases off-heap memory allocation, and appears to make allocation only proportional to the number of cores. Author: Aaron Davidson Closes #3465 from aarondav/fewer-pools and squashes the following commits: 36c49da [Aaron Davidson] [SPARK-4516] Avoid allocating unnecessarily Netty PooledByteBufAllocators (cherry picked from commit 346bc17a2ec8fc9e6eaff90733aa1e8b6b46883e) Signed-off-by: Patrick Wendell commit 380eba5f49eca1dbd4084e6c84e19866fffd4efa Author: Patrick Wendell Date: Wed Nov 26 05:17:09 2014 +0000 Preparing development version 1.2.1-SNAPSHOT commit cc2c05e4ee81d2f34873a2ebb9a5272867cb65c2 Author: Patrick Wendell Date: Wed Nov 26 05:17:08 2014 +0000 Preparing Spark release v1.2.0-rc1 commit dfb8c65b730fdf60540e91cd74fbaa2764a2a2bc Author: Patrick Wendell Date: Wed Nov 26 00:16:20 2014 -0500 HOTFIX: Updating additional version data commit de8029b39142be5e91714a9d5240bcdb90f66886 Author: Patrick Wendell Date: Wed Nov 26 00:11:58 2014 -0500 Revert "Preparing Spark release v1.2.0-rc1" This reverts commit 5247dd859b95a440baa562b9827bdeb26aa6530e. commit 37bc7a830e862d47776b85767ba599d61ef13e01 Author: Patrick Wendell Date: Wed Nov 26 00:11:49 2014 -0500 Revert "Preparing development version 1.2.1-SNAPSHOT" This reverts commit 79df6b43ae762263a8120f423ddb4a0811dd4b6f. commit 79df6b43ae762263a8120f423ddb4a0811dd4b6f Author: Patrick Wendell Date: Wed Nov 26 05:10:29 2014 +0000 Preparing development version 1.2.1-SNAPSHOT commit 5247dd859b95a440baa562b9827bdeb26aa6530e Author: Patrick Wendell Date: Wed Nov 26 05:10:29 2014 +0000 Preparing Spark release v1.2.0-rc1 commit ce6200b265e63979483e0cccecff391faa159903 Author: Patrick Wendell Date: Wed Nov 26 00:09:01 2014 -0500 Revert "Preparing Spark release v1.2.0-rc1" This reverts commit db7f4a898af22a02b36428507f8ef2b429d78dc1. commit 68a217cd1a792ca3486442e9aa63ca0258e88762 Author: Patrick Wendell Date: Wed Nov 26 00:08:57 2014 -0500 Revert "Preparing development version 1.2.1-SNAPSHOT" This reverts commit d7b1ecb25676d228deb6fe05efdb4e2ab9c3e30b. commit d7b1ecb25676d228deb6fe05efdb4e2ab9c3e30b Author: Ubuntu Date: Wed Nov 26 05:07:50 2014 +0000 Preparing development version 1.2.1-SNAPSHOT commit db7f4a898af22a02b36428507f8ef2b429d78dc1 Author: Ubuntu Date: Wed Nov 26 05:07:50 2014 +0000 Preparing Spark release v1.2.0-rc1 commit 01271786e67bdf8441824fb4dd9ed6e9fd95eaaa Author: Patrick Wendell Date: Wed Nov 26 00:06:16 2014 -0500 Revert "Preparing Spark release v1.2.0-snapshot1" This reverts commit 38c1fbd9694430cefd962c90bc36b0d108c6124b. commit b028aaff161ad749e4723f5821ed000320a6665e Author: Patrick Wendell Date: Wed Nov 26 00:06:14 2014 -0500 Revert "Preparing development version 1.2.1-SNAPSHOT" This reverts commit d7ac6013483e83caff8ea54c228f37aeca159db8. commit 1e12f594be277f6b390c998b1a1e5581ecebdcb0 Author: Aaron Davidson Date: Tue Nov 25 23:57:04 2014 -0500 [SPARK-4516] Cap default number of Netty threads at 8 In practice, only 2-4 cores should be required to transfer roughly 10 Gb/s, and each core that we use will have an initial overhead of roughly 32 MB of off-heap memory, which comes at a premium. Thus, this value should still retain maximum throughput and reduce wasted off-heap memory allocation. It can be overridden by setting the number of serverThreads and clientThreads manually in Spark's configuration. Author: Aaron Davidson Closes #3469 from aarondav/fewer-pools2 and squashes the following commits: 087c59f [Aaron Davidson] [SPARK-4516] Cap default number of Netty threads at 8 (cherry picked from commit f5f2d27385c243959f03a9d78a149d5f405b2f50) Signed-off-by: Patrick Wendell commit 2756d0de91d996f80c0b0883cad1d2fab336ed84 Author: Xiangrui Meng Date: Tue Nov 25 20:11:40 2014 -0800 [SPARK-4604][MLLIB] make MatrixFactorizationModel public User could construct an MF model directly. I added a note about the performance. Author: Xiangrui Meng Closes #3459 from mengxr/SPARK-4604 and squashes the following commits: f64bcd3 [Xiangrui Meng] organize imports ed08214 [Xiangrui Meng] check preconditions and unit tests a624c12 [Xiangrui Meng] make MatrixFactorizationModel public (cherry picked from commit b5fb1410c5eed1156decb4f9fcc22436a658ce4d) Signed-off-by: Xiangrui Meng commit 37d58aaac20b9ab34ea50c9e62905c7f80fe5036 Author: Patrick Wendell Date: Tue Nov 25 23:10:19 2014 -0500 [HOTFIX]: Adding back without-hive dist commit 6880b467f66a4906161cbc343e70d975056a4f5f Author: Joseph K. Bradley Date: Tue Nov 25 20:10:15 2014 -0800 [SPARK-4583] [mllib] LogLoss for GradientBoostedTrees fix + doc updates Currently, the LogLoss used by GradientBoostedTrees has 2 issues: * the gradient (and therefore loss) does not match that used by Friedman (1999) * the error computation uses 0/1 accuracy, not log loss This PR updates LogLoss. It also adds some doc for boosting and forests. I tested it on sample data and made sure the log loss is monotonically decreasing with each boosting iteration. CC: mengxr manishamde codedeft Author: Joseph K. Bradley Closes #3439 from jkbradley/gbt-loss-fix and squashes the following commits: cfec17e [Joseph K. Bradley] removed forgotten temp comments a27eb6d [Joseph K. Bradley] corrections to last log loss commit ed5da2c [Joseph K. Bradley] updated LogLoss (boosting) for numerical stability 5e52bff [Joseph K. Bradley] * Removed the 1/2 from SquaredError. This also required updating the test suite since it effectively doubles the gradient and loss. * Added doc for developers within RandomForest. * Small cleanup in test suite (generating data only once) e57897a [Joseph K. Bradley] Fixed LogLoss for GradientBoostedTrees, and updated doc for losses, forests, and boosting (cherry picked from commit c251fd7405db57d3ab2686c38712601fd8f13ccd) Signed-off-by: Xiangrui Meng commit a48ea3cef22687694a4471705fb707bd1e8fa592 Author: Xiangrui Meng Date: Tue Nov 25 16:07:09 2014 -0800 [Spark-4509] Revert EC2 tag-based cluster membership patch This PR reverts changes related to tag-based cluster membership. As discussed in SPARK-3332, we didn't figure out a safe strategy to use tags to determine cluster membership, because tagging is not atomic. The following changes are reverted: SPARK-2333: 94053a7b766788bb62e2dbbf352ccbcc75f71fc0 SPARK-3213: 7faf755ae4f0cf510048e432340260a6e609066d SPARK-3608: 78d4220fa0bf2f9ee663e34bbf3544a5313b02f0. I tested launch, login, and destroy. It is easy to check the diff by comparing it to Josh's patch for branch-1.1: https://github.com/apache/spark/pull/2225/files JoshRosen I sent the PR to master. It might be easier for us to keep master and branch-1.2 the same at this time. We can always re-apply the patch once we figure out a stable solution. Author: Xiangrui Meng Closes #3453 from mengxr/SPARK-4509 and squashes the following commits: f0b708b [Xiangrui Meng] revert 94053a7b766788bb62e2dbbf352ccbcc75f71fc0 4298ea5 [Xiangrui Meng] revert 7faf755ae4f0cf510048e432340260a6e609066d 35963a1 [Xiangrui Meng] Revert "SPARK-3608 Break if the instance tag naming succeeds" (cherry picked from commit 7eba0fbe456c451122d7a2353ff0beca00f15223) Signed-off-by: Andrew Or commit 93b914df1566c6359d8f1546ab7344823dc4341f Author: hushan[胡珊] Date: Tue Nov 25 15:51:08 2014 -0800 Fix SPARK-4471: blockManagerIdFromJson function throws exception while B... Fix [SPARK-4471](https://issues.apache.org/jira/browse/SPARK-4471): blockManagerIdFromJson function throws exception while BlockManagerId be null in MetadataFetchFailedException Author: hushan[胡珊] Closes #3340 from suyanNone/fix-blockmanagerId-jnothing-2 and squashes the following commits: 159f9a3 [hushan[胡珊]] Refine test code for blockmanager is null 4380d73 [hushan[胡珊]] remove useless blank line 3ccf651 [hushan[胡珊]] Fix SPARK-4471: blockManagerIdFromJson function throws exception while metadata fetch failed (cherry picked from commit 9bdf5da59036c0b052df756fc4a28d64677072e7) Signed-off-by: Andrew Or commit 58c840dde8776efefd5e180d95379598fd061172 Author: Andrew Or Date: Tue Nov 25 15:48:02 2014 -0800 [SPARK-4546] Improve HistoryServer first time user experience The documentation points the user to run the following ``` sbin/start-history-server.sh ``` The first thing this does is throw an exception that complains a log directory is not specified. The exception message itself does not say anything about what to set. Instead we should have a default and a landing page with a better message. The new default log directory is `file:/tmp/spark-events`. This is what it looks like as of this PR: ![after](https://issues.apache.org/jira/secure/attachment/12682985/after.png) Author: Andrew Or Closes #3411 from andrewor14/minor-history-improvements and squashes the following commits: f33d6b3 [Andrew Or] Point user to set config if default log dir does not exist fc4c17a [Andrew Or] Improve HistoryServer UX (cherry picked from commit 9afcbe494a3535a9bf7958429b72e989972f82d9) Signed-off-by: Andrew Or commit ee0317509ee1dfd9c5807890412f9ac5ebf16eb3 Author: Andrew Or Date: Tue Nov 25 15:46:26 2014 -0800 [SPARK-4592] Avoid duplicate worker registrations in standalone mode **Summary.** On failover, the Master may receive duplicate registrations from the same worker, causing the worker to exit. This is caused by this commit https://github.com/apache/spark/commit/4afe9a4852ebeb4cc77322a14225cd3dec165f3f, which adds logic for the worker to re-register with the master in case of failures. However, the following race condition may occur: (1) Master A fails and Worker attempts to reconnect to all masters (2) Master B takes over and notifies Worker (3) Worker responds by registering with Master B (4) Meanwhile, Worker's previous reconnection attempt reaches Master B, causing the same Worker to register with Master B twice **Fix.** Instead of attempting to register with all known masters, the worker should re-register with only the one that it has been communicating with. This is safe because the fact that a failover has occurred means the old master must have died. Then, when the worker is finally notified of a new master, it gives up on the old one in favor of the new one. **Caveat.** Even this fix is subject to more obscure race conditions. For instance, if Master B fails and Master A recovers immediately, then Master A may still observe duplicate worker registrations. However, this and other potential race conditions summarized in [SPARK-4592](https://issues.apache.org/jira/browse/SPARK-4592), are much, much less likely than the one described above, which is deterministically reproducible. Author: Andrew Or Closes #3447 from andrewor14/standalone-failover and squashes the following commits: 0d9716c [Andrew Or] Move re-registration logic to actor for thread-safety 79286dc [Andrew Or] Preserve old behavior for initial retries 83b321c [Andrew Or] Tweak wording 1fce6a9 [Andrew Or] Active master actor could be null in the beginning b6f269e [Andrew Or] Avoid duplicate worker registrations (cherry picked from commit 1b2ab1cd1b7cab9076f3c511188a610eda935701) Signed-off-by: Andrew Or commit a2c01ae5e3489b6c21a4c7bcc1ec615069ff4829 Author: Tathagata Das Date: Tue Nov 25 15:27:20 2014 -0800 [HOTFIX] Fixing broken build due to missing imports. commit a9944c809017cc61c9c2e38efe9d709dfb0a94cd Author: Tathagata Das Date: Tue Nov 25 14:16:27 2014 -0800 [SPARK-4196][SPARK-4602][Streaming] Fix serialization issue in PairDStreamFunctions.saveAsNewAPIHadoopFiles Solves two JIRAs in one shot - Makes the ForechDStream created by saveAsNewAPIHadoopFiles serializable for checkpoints - Makes the default configuration object used saveAsNewAPIHadoopFiles be the Spark's hadoop configuration Author: Tathagata Das Closes #3457 from tdas/savefiles-fix and squashes the following commits: bb4729a [Tathagata Das] Same treatment for saveAsHadoopFiles b382ea9 [Tathagata Das] Fix serialization issue in PairDStreamFunctions.saveAsNewAPIHadoopFiles. (cherry picked from commit 8838ad7c135a585cde015dc38b5cb23314502dd9) Signed-off-by: Tathagata Das commit 1e356a8fa26f287212df0ab5bd3b2aa9fd1d388a Author: DB Tsai Date: Tue Nov 25 11:07:11 2014 -0800 [SPARK-4581][MLlib] Refactorize StandardScaler to improve the transformation performance The following optimizations are done to improve the StandardScaler model transformation performance. 1) Covert Breeze dense vector to primitive vector to reduce the overhead. 2) Since mean can be potentially a sparse vector, we explicitly convert it to dense primitive vector. 3) Have a local reference to `shift` and `factor` array so JVM can locate the value with one operation call. 4) In pattern matching part, we use the mllib SparseVector/DenseVector instead of breeze's vector to make the codebase cleaner. Benchmark with mnist8m dataset: Before, DenseVector withMean and withStd: 50.97secs DenseVector withMean and withoutStd: 42.11secs DenseVector withoutMean and withStd: 8.75secs SparseVector withoutMean and withStd: 5.437secs With this PR, DenseVector withMean and withStd: 5.76secs DenseVector withMean and withoutStd: 5.28secs DenseVector withoutMean and withStd: 5.30secs SparseVector withoutMean and withStd: 1.27secs Note that without the local reference copy of `factor` and `shift` arrays, the runtime is almost three time slower. DenseVector withMean and withStd: 18.15secs DenseVector withMean and withoutStd: 18.05secs DenseVector withoutMean and withStd: 18.54secs SparseVector withoutMean and withStd: 2.01secs The following code, ```scala while (i < size) { values(i) = (values(i) - shift(i)) * factor(i) i += 1 } ``` will generate the bytecode ``` L13 LINENUMBER 106 L13 FRAME FULL [org/apache/spark/mllib/feature/StandardScalerModel org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/DenseVector T [D I I] [] ILOAD 7 ILOAD 6 IF_ICMPGE L14 L15 LINENUMBER 107 L15 ALOAD 5 ILOAD 7 ALOAD 5 ILOAD 7 DALOAD ALOAD 0 INVOKESPECIAL org/apache/spark/mllib/feature/StandardScalerModel.shift ()[D ILOAD 7 DALOAD DSUB ALOAD 0 INVOKESPECIAL org/apache/spark/mllib/feature/StandardScalerModel.factor ()[D ILOAD 7 DALOAD DMUL DASTORE L16 LINENUMBER 108 L16 ILOAD 7 ICONST_1 IADD ISTORE 7 GOTO L13 ``` , while with the local reference of the `shift` and `factor` arrays, the bytecode will be ``` L14 LINENUMBER 107 L14 ALOAD 0 INVOKESPECIAL org/apache/spark/mllib/feature/StandardScalerModel.factor ()[D ASTORE 9 L15 LINENUMBER 108 L15 FRAME FULL [org/apache/spark/mllib/feature/StandardScalerModel org/apache/spark/mllib/linalg/Vector [D org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/DenseVector T [D I I [D] [] ILOAD 8 ILOAD 7 IF_ICMPGE L16 L17 LINENUMBER 109 L17 ALOAD 6 ILOAD 8 ALOAD 6 ILOAD 8 DALOAD ALOAD 2 ILOAD 8 DALOAD DSUB ALOAD 9 ILOAD 8 DALOAD DMUL DASTORE L18 LINENUMBER 110 L18 ILOAD 8 ICONST_1 IADD ISTORE 8 GOTO L15 ``` You can see that with local reference, the both of the arrays will be in the stack, so JVM can access the value without calling `INVOKESPECIAL`. Author: DB Tsai Closes #3435 from dbtsai/standardscaler and squashes the following commits: 85885a9 [DB Tsai] revert to have lazy in shift array. daf2b06 [DB Tsai] Address the feedback cdb5cef [DB Tsai] small change 9c51eef [DB Tsai] style fc795e4 [DB Tsai] update 5bffd3d [DB Tsai] first commit (cherry picked from commit bf1a6aaac577757a293a573fe8eae9669697310a) Signed-off-by: Xiangrui Meng commit 96f76fc405d1da181ed9edc733a897437ee0a6e0 Author: Tathagata Das Date: Tue Nov 25 06:50:36 2014 -0800 [SPARK-4601][Streaming] Set correct call site for streaming jobs so that it is displayed correctly on the Spark UI When running the NetworkWordCount, the description of the word count jobs are set as "getCallsite at DStream:xxx" . This should be set to the line number of the streaming application that has the output operation that led to the job being created. This is because the callsite is incorrectly set in the thread launching the jobs. This PR fixes that. Author: Tathagata Das Closes #3455 from tdas/streaming-callsite-fix and squashes the following commits: 69fc26f [Tathagata Das] Set correct call site for streaming jobs so that it is displayed correctly on the Spark UI (cherry picked from commit 69cd53eae205eb10d52eaf38466db58a23b6ae81) Signed-off-by: Tathagata Das commit a689ab98d944dbe4b239449897841543c0450450 Author: arahuja Date: Tue Nov 25 08:23:41 2014 -0600 [SPARK-4344][DOCS] adding documentation on spark.yarn.user.classpath.first The documentation for the two parameters is the same with a pointer from the standalone parameter to the yarn parameter Author: arahuja Closes #3209 from arahuja/yarn-classpath-first-param and squashes the following commits: 51cb9b2 [arahuja] [SPARK-4344][DOCS] adding documentation for YARN on userClassPathFirst (cherry picked from commit d240760191f692ee7b88dfc82f06a31a340a88a2) Signed-off-by: Thomas Graves commit b026546e3a2195a7e6106af3a5b7370cdb850052 Author: jerryshao Date: Tue Nov 25 05:36:29 2014 -0800 [SPARK-4381][Streaming]Add warning log when user set spark.master to local in Spark Streaming and there's no job executed Author: jerryshao Closes #3244 from jerryshao/SPARK-4381 and squashes the following commits: d2486c7 [jerryshao] Improve the warning log d726e85 [jerryshao] Add local[1] to the filter condition eca428b [jerryshao] Add warning log (cherry picked from commit fef27b29431c2adadc17580f26c23afa6a3bd1d2) Signed-off-by: Tathagata Das commit 42b9d0d31eae8d992301bcd36665d01ef1a00a06 Author: q00251598 Date: Tue Nov 25 04:01:56 2014 -0800 [SPARK-4535][Streaming] Fix the error in comments change `NetworkInputDStream` to `ReceiverInputDStream` change `ReceiverInputTracker` to `ReceiverTracker` Author: q00251598 Closes #3400 from watermen/fix-comments and squashes the following commits: 75d795c [q00251598] change 'NetworkInputDStream' to 'ReceiverInputDStream' && change 'ReceiverInputTracker' to 'ReceiverTracker' (cherry picked from commit a51118a34a4617c07373480c4b021e53124c3c00) Signed-off-by: Tathagata Das Conflicts: examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala commit d117f8fa44a4cf2f51c0fb1a1a6bac65527a63b0 Author: GuoQiang Li Date: Tue Nov 25 02:01:19 2014 -0800 [SPARK-4526][MLLIB]GradientDescent get a wrong gradient value according to the gradient formula. This is caused by the miniBatchSize parameter.The number of `RDD.sample` returns is not fixed. cc mengxr Author: GuoQiang Li Closes #3399 from witgo/GradientDescent and squashes the following commits: 13cb228 [GuoQiang Li] review commit 668ab66 [GuoQiang Li] Double to Long b6aa11a [GuoQiang Li] Check miniBatchSize is greater than 0 0b5c3e3 [GuoQiang Li] Minor fix 12e7424 [GuoQiang Li] GradientDescent get a wrong gradient value according to the gradient formula, which is caused by the miniBatchSize parameter. (cherry picked from commit f515f9432b05f7e090b651c5536aa706d1cde487) Signed-off-by: Xiangrui Meng commit 74571991b894a3b1ec47644d850a64276252b3fb Author: DB Tsai Date: Tue Nov 25 01:57:34 2014 -0800 [SPARK-4596][MLLib] Refactorize Normalizer to make code cleaner In this refactoring, the performance will be slightly increased due to removing the overhead from breeze vector. The bottleneck is still in breeze norm which is implemented by activeIterator. This inefficiency of breeze norm will be addressed in next PR. At least, this PR makes the code more consistent in the codebase. Author: DB Tsai Closes #3446 from dbtsai/normalizer and squashes the following commits: e20a2b9 [DB Tsai] first commit (cherry picked from commit 89f912264603741c7d980135c26102d63e11791f) Signed-off-by: Xiangrui Meng commit 1f4d1ac4bc782f888757073ee2becf59a5251774 Author: wangfei Date: Mon Nov 24 22:32:39 2014 -0800 [DOC][Build] Wrong cmd for build spark with apache hadoop 2.4.X and hive 12 Author: wangfei Closes #3335 from scwf/patch-10 and squashes the following commits: d343113 [wangfei] add '-Phive' 60d595e [wangfei] [DOC] Wrong cmd for build spark with apache hadoop 2.4.X and Hive 12 support (cherry picked from commit 0fe54cff19759dad2dc2a0950bd6c1d31c95e858) Signed-off-by: Patrick Wendell commit 259cb26fcc6bbd3519cc126d8bb882ac3e58e840 Author: w00228970 Date: Mon Nov 24 21:17:24 2014 -0800 [SQL] Compute timeTaken correctly ```timeTaken``` should not count the time of printing result. Author: w00228970 Closes #3423 from scwf/time-taken-bug and squashes the following commits: da7e102 [w00228970] compute time taken correctly (cherry picked from commit 723be60e233d0f85944d948efd06845ef546c9f5) Signed-off-by: Reynold Xin commit 10e433919a9a3520007099a3876b47f74c046f12 Author: Jongyoul Lee Date: Mon Nov 24 19:14:14 2014 -0800 [SPARK-4525] Mesos should decline unused offers Functionally, this is just a small change on top of #3393 (by jongyoul). The issue being addressed is discussed in the comments there. I have not yet added a test for the bug there. I will add one shortly. I've also done some minor renaming/clean-up of variables in this class and tests. Author: Patrick Wendell Author: Jongyoul Lee Closes #3436 from pwendell/mesos-issue and squashes the following commits: 58c35b5 [Patrick Wendell] Adding unit test for this situation c4f0697 [Patrick Wendell] Additional clean-up and fixes on top of existing fix f20f1b3 [Jongyoul Lee] [SPARK-4525] MesosSchedulerBackend.resourceOffers cannot decline unused offers from acceptedOffers - Added code for declining unused offers among acceptedOffers - Edited testCase for checking declining unused offers (cherry picked from commit b043c27424d05e3200e7ba99a1a65656b57fa2f0) Signed-off-by: Patrick Wendell commit e7b8bf067a2606e381f2081db95d9c613391afef Author: Patrick Wendell Date: Mon Nov 24 19:20:09 2014 -0800 Revert "[SPARK-4525] Mesos should decline unused offers" This reverts commit 4b4797309457b9301710b6e98550817337005eca. I accidentally committed this using my own authorship credential. However, I should have given authoriship to the original author: Jongyoul Lee. commit 4b4797309457b9301710b6e98550817337005eca Author: Patrick Wendell Date: Mon Nov 24 19:14:14 2014 -0800 [SPARK-4525] Mesos should decline unused offers Functionally, this is just a small change on top of #3393 (by jongyoul). The issue being addressed is discussed in the comments there. I have not yet added a test for the bug there. I will add one shortly. I've also done some minor renaming/clean-up of variables in this class and tests. Author: Patrick Wendell Author: Jongyoul Lee Closes #3436 from pwendell/mesos-issue and squashes the following commits: 58c35b5 [Patrick Wendell] Adding unit test for this situation c4f0697 [Patrick Wendell] Additional clean-up and fixes on top of existing fix f20f1b3 [Jongyoul Lee] [SPARK-4525] MesosSchedulerBackend.resourceOffers cannot decline unused offers from acceptedOffers - Added code for declining unused offers among acceptedOffers - Edited testCase for checking declining unused offers (cherry picked from commit b043c27424d05e3200e7ba99a1a65656b57fa2f0) Signed-off-by: Patrick Wendell commit 47d4fceffe90905fa8f50551e53c8d2e5b246cae Author: Kay Ousterhout Date: Mon Nov 24 18:03:10 2014 -0800 [SPARK-4266] [Web-UI] Reduce stage page load time. The commit changes the java script used to show/hide additional metrics in order to reduce page load time. SPARK-4016 significantly increased page load time for the stage page when stages had a lot (thousands or tens of thousands) of tasks, due to the additional Javascript to hide some metrics by default and stripe the tables. This commit reduces page load time in two ways: (1) Now, all of the metrics that are hidden by default are hidden by setting "display: none;" using CSS for the page, rather than hiding them using javascript after the page loads. Without this change, for stages with thousands of tasks, there was a few second delay after page load, where first the additional metrics were shown, and then after a delay were hidden once the relevant JS finished running. (2) CSS is used to stripe all of the tables except for the summary table. The summary table needs javascript to do the striping because some rows are hidden, but the javascript striping is slower, which again resulted in a delay when it was used for the task table (where for a few seconds after page load, all of the rows in the task table would be white, while the browser finished running the JS to stripe the table). cc pwendell This change is intended to be backported to 1.2 to avoid a regression in UI performance when users run large jobs. Author: Kay Ousterhout Closes #3328 from kayousterhout/SPARK-4266 and squashes the following commits: f964091 [Kay Ousterhout] [SPARK-4266] [Web-UI] Reduce stage page load time. (cherry picked from commit d24d5bf064572a2319627736b1fbf112b4a78edf) Signed-off-by: Kay Ousterhout commit 841f247a55df8b7f7252ab1b8067a1ea9aa45633 Author: Davies Liu Date: Mon Nov 24 17:17:03 2014 -0800 [SPARK-4548] []SPARK-4517] improve performance of python broadcast Re-implement the Python broadcast using file: 1) serialize the python object using cPickle, write into disks. 2) Create a wrapper in JVM (for the dumped file), it read data from during serialization 3) Using TorrentBroadcast or HttpBroadcast to transfer the data (compressed) into executors 4) During deserialization, writing the data into disk. 5) Passing the path into Python worker, read data from disk and unpickle it into python object, until the first access. It fixes the performance regression introduced in #2659, has similar performance as 1.1, but support object larger than 2G, also improve the memory efficiency (only one compressed copy in driver and executor). Testing with a 500M broadcast and 4 tasks (excluding the benefit from reused worker in 1.2): name | 1.1 | 1.2 with this patch | improvement ---------|--------|---------|-------- python-broadcast-w-bytes | 25.20 | 9.33 | 170.13% | python-broadcast-w-set | 4.13 | 4.50 | -8.35% | Testing with 100 tasks (16 CPUs): name | 1.1 | 1.2 with this patch | improvement ---------|--------|---------|-------- python-broadcast-w-bytes | 38.16 | 8.40 | 353.98% python-broadcast-w-set | 23.29 | 9.59 | 142.80% Author: Davies Liu Closes #3417 from davies/pybroadcast and squashes the following commits: 50a58e0 [Davies Liu] address comments b98de1d [Davies Liu] disable gc while unpickle e5ee6b9 [Davies Liu] support large string 09303b8 [Davies Liu] read all data into memory dde02dd [Davies Liu] improve performance of python broadcast (cherry picked from commit 6cf507685efd01df77d663145ae08e48c7f92948) Signed-off-by: Josh Rosen commit 8371bc20821c39ee6d8116a867577e5c0fcd08ab Author: Davies Liu Date: Mon Nov 24 16:41:23 2014 -0800 [SPARK-4578] fix asDict() with nested Row() The Row object is created on the fly once the field is accessed, so we should access them by getattr() in asDict(0 Author: Davies Liu Closes #3434 from davies/fix_asDict and squashes the following commits: b20f1e7 [Davies Liu] fix asDict() with nested Row() (cherry picked from commit 050616b408c60eae02256913ceb645912dbff62e) Signed-off-by: Patrick Wendell commit 2acbd2884f73c4503d753bb96e0acf75cd237536 Author: tkaessmann Date: Mon Nov 24 16:40:19 2014 -0800 get raw vectors for further processing in Word2Vec e.g. clustering Author: tkaessmann Closes #3309 from tkaessmann/branch-1.2 and squashes the following commits: e3a3142 [tkaessmann] changes the comment for getVectors 58d3d83 [tkaessmann] removes sign from comment a5be213 [tkaessmann] fixes getVectors to fit code guidelines 3782fa9 [tkaessmann] get raw vectors for further processing commit 9ea67fc1ddd2aca70f6e2da38ebaf7ebc2398981 Author: Davies Liu Date: Mon Nov 24 16:37:14 2014 -0800 [SPARK-4562] [MLlib] speedup vector This PR change the underline array of DenseVector to numpy.ndarray to avoid the conversion, because most of the users will using numpy.array. It also improve the serialization of DenseVector. Before this change: trial | trainingTime | testTime -------|--------|-------- 0 | 5.126 | 1.786 1 |2.698 |1.693 After the change: trial | trainingTime | testTime -------|--------|-------- 0 |4.692 |0.554 1 |2.307 |0.525 This could partially fix the performance regression during test. Author: Davies Liu Closes #3420 from davies/ser2 and squashes the following commits: 0e1e6f3 [Davies Liu] fix tests 426f5db [Davies Liu] impove toArray() 44707ec [Davies Liu] add name for ISO-8859-1 fa7d791 [Davies Liu] address comments 1cfb137 [Davies Liu] handle zero sparse vector 2548ee2 [Davies Liu] fix tests 9e6389d [Davies Liu] bugfix 470f702 [Davies Liu] speed up DenseMatrix f0d3c40 [Davies Liu] speedup SparseVector ef6ce70 [Davies Liu] speed up dense vector (cherry picked from commit b660de7a9cbdea3df4a37fbcf60c1c33c71782b8) Signed-off-by: Xiangrui Meng commit 6fa3e415d419ee9b2f3d14106a714b627e251e7d Author: Tathagata Das Date: Mon Nov 24 13:50:20 2014 -0800 [SPARK-4518][SPARK-4519][Streaming] Refactored file stream to prevent files from being processed multiple times Because of a corner case, a file already selected for batch t can get considered again for batch t+2. This refactoring fixes it by remembering all the files selected in the last 1 minute, so that this corner case does not arise. Also uses spark context's hadoop configuration to access the file system API for listing directories. pwendell Please take look. I still have not run long-running integration tests, so I cannot say for sure whether this has indeed solved the issue. You could do a first pass on this in the meantime. Author: Tathagata Das Closes #3419 from tdas/filestream-fix2 and squashes the following commits: c19dd8a [Tathagata Das] Addressed PR comments. 513b608 [Tathagata Das] Updated docs. d364faf [Tathagata Das] Added the current time condition back 5526222 [Tathagata Das] Removed unnecessary imports. 38bb736 [Tathagata Das] Fix long line. 203bbc7 [Tathagata Das] Un-ignore tests. eaef4e1 [Tathagata Das] Fixed SPARK-4519 9dbd40a [Tathagata Das] Refactored FileInputDStream to remember last few batches. (cherry picked from commit cb0e9b0980f38befe88bf52aa037fe33262730f7) Signed-off-by: Tathagata Das commit 2d35cc0852e5ce426b143b51d03a71f16ad06c11 Author: Josh Rosen Date: Mon Nov 24 13:18:14 2014 -0800 [SPARK-4145] Web UI job pages This PR adds two new pages to the Spark Web UI: - A jobs overview page, which shows details on running / completed / failed jobs. - A job details page, which displays information on an individual job's stages. The jobs overview page is now the default UI homepage; the old homepage is still accessible at `/stages`. ### Screenshots #### New UI homepage ![image](https://cloud.githubusercontent.com/assets/50748/5119035/fd0a69e6-701f-11e4-89cb-db7e9705714f.png) #### Job details page (This is effectively a per-job version of the stages page that can be extended later with other things, such as DAG visualizations) ![image](https://cloud.githubusercontent.com/assets/50748/5134910/50b340d4-70c7-11e4-88e1-6b73237ea7c8.png) ### Key changes in this PR - Rename `JobProgressPage` to `AllStagesPage` - Expose `StageInfo` objects in the ``SparkListenerJobStart` event; add backwards-compatibility tests to JsonProtocol. - Add additional data structures to `JobProgressListener` to map from stages to jobs. - Add several fields to `JobUIData`. I also added ~150 lines of Selenium tests as I uncovered UI issues while developing this patch. ### Limitations If a job contains stages that aren't run, then its overall job progress bar may be an underestimate of the total job progress; in other words, a completed job may appear to have a progress bar that's not at 100%. If stages or tasks fail, then the progress bar will not go backwards to reflect the true amount of remaining work. Author: Josh Rosen Closes #3009 from JoshRosen/job-page and squashes the following commits: eb05e90 [Josh Rosen] Disable kill button in completed stages tables. f00c851 [Josh Rosen] Fix JsonProtocol compatibility b89c258 [Josh Rosen] More JSON protocol backwards-compatibility fixes. ff804cd [Josh Rosen] Don't write "Stage Ids" field in JobStartEvent JSON. 6f17f3f [Josh Rosen] Only store StageInfos in SparkListenerJobStart event. 2bbf41a [Josh Rosen] Update job progress bar to reflect skipped tasks/stages. 61c265a [Josh Rosen] Add “skipped stages” table; only display non-empty tables. 1f45d44 [Josh Rosen] Incorporate a bunch of minor review feedback. 0b77e3e [Josh Rosen] More bug fixes for phantom stages. 034aa8d [Josh Rosen] Use `.max()` to find result stage for job. eebdc2c [Josh Rosen] Don’t display pending stages for completed jobs. 67080ba [Josh Rosen] Ensure that "phantom stages" don't cause memory leaks. 7d10b97 [Josh Rosen] Merge remote-tracking branch 'apache/master' into job-page d69c775 [Josh Rosen] Fix table sorting on all jobs page. 5eb39dc [Josh Rosen] Add pending stages table to job page. f2a15da [Josh Rosen] Add status field to job details page. 171b53c [Josh Rosen] Move `startTime` to the start of SparkContext. e2f2c43 [Josh Rosen] Fix sorting of stages in job details page. 8955f4c [Josh Rosen] Display information for pending stages on jobs page. 8ab6c28 [Josh Rosen] Compute numTasks from job start stage infos. 5884f91 [Josh Rosen] Add StageInfos to SparkListenerJobStart event. 79793cd [Josh Rosen] Track indices of completed stage to avoid overcounting when failures occur. d62ea7b [Josh Rosen] Add failing Selenium test for stage overcounting issue. 1145c60 [Josh Rosen] Display text instead of progress bar for stages. 3d0a007 [Josh Rosen] Merge remote-tracking branch 'origin/master' into job-page 8a2351b [Josh Rosen] Add help tooltip to Spark Jobs page. b7bf30e [Josh Rosen] Add stages progress bar; fix bug where active stages show as completed. 4846ce4 [Josh Rosen] Hide "(Job Group") if no jobs were submitted in job groups. 4d58e55 [Josh Rosen] Change label to "Tasks (for all stages)" 85e9c85 [Josh Rosen] Extract startTime into separate variable. 1cf4987 [Josh Rosen] Fix broken kill links; add Selenium test to avoid future regressions. 56701fa [Josh Rosen] Move last stage name / description logic out of markup. a475ea1 [Josh Rosen] Add progress bars to jobs page. 45343b8 [Josh Rosen] More comments 4b206fb [Josh Rosen] Merge remote-tracking branch 'origin/master' into job-page bfce2b9 [Josh Rosen] Address review comments, except for progress bar. 4487dcb [Josh Rosen] [SPARK-4145] Web UI job pages 2568a6c [Josh Rosen] Rename JobProgressPage to AllStagesPage: (cherry picked from commit 4a90276ab22d6989dffb2ee2d8118d9253365646) Signed-off-by: Patrick Wendell commit 97b7eb4d99613944d39f1421dccc2724c4165c9e Author: Kousuke Saruta Date: Mon Nov 24 12:54:37 2014 -0800 [SPARK-4487][SQL] Fix attribute reference resolution error when using ORDER BY. When we use ORDER BY clause, at first, attributes referenced by projection are resolved (1). And then, attributes referenced at ORDER BY clause are resolved (2). But when resolving attributes referenced at ORDER BY clause, the resolution result generated in (1) is discarded so for example, following query fails. SELECT c1 + c2 FROM mytable ORDER BY c1; The query above fails because when resolving the attribute reference 'c1', the resolution result of 'c2' is discarded. Author: Kousuke Saruta Closes #3363 from sarutak/SPARK-4487 and squashes the following commits: fd314f3 [Kousuke Saruta] Fixed attribute resolution logic in Analyzer 6e60c20 [Kousuke Saruta] Fixed conflicts cb5b7e9 [Kousuke Saruta] Added test case for SPARK-4487 282d529 [Kousuke Saruta] Fixed attributes reference resolution error b6123e6 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into concat-feature 317b7fb [Kousuke Saruta] WIP (cherry picked from commit dd1c9cb36cde8202cede8014b5641ae8a0197812) Signed-off-by: Michael Armbrust commit 0e7fa7f632ebe4db60938f2087c1f1a4d614ab32 Author: scwf Date: Mon Nov 24 12:49:08 2014 -0800 [SQL] Fix path in HiveFromSpark It require us to run ```HiveFromSpark``` in specified dir because ```HiveFromSpark``` use relative path, this leads to ```run-example``` error(http://apache-spark-developers-list.1001551.n3.nabble.com/src-main-resources-kv1-txt-not-found-in-example-of-HiveFromSpark-td9100.html). Author: scwf Closes #3415 from scwf/HiveFromSpark and squashes the following commits: ed3d6c9 [scwf] revert no need change b00e20c [scwf] fix path usring spark_home dbd321b [scwf] fix path in hivefromspark (cherry picked from commit b384119304617459592b7ba435368dd6fcc3273e) Signed-off-by: Michael Armbrust commit 1e3d22b9fd2c0a87330283c5097b2b7ec95a5715 Author: Daniel Darabos Date: Mon Nov 24 12:45:07 2014 -0800 [SQL] Fix comment in HiveShim This file is for Hive 0.13.1 I think. Author: Daniel Darabos Closes #3432 from darabos/patch-2 and squashes the following commits: 4fd22ed [Daniel Darabos] Fix comment. This file is for Hive 0.13.1. (cherry picked from commit d5834f0732b586731034a7df5402c25454770fc5) Signed-off-by: Michael Armbrust commit ee1bc892a32bb969b051b3bc3eaaf9a54af1c7a3 Author: Cheng Lian Date: Mon Nov 24 12:43:45 2014 -0800 [SPARK-4479][SQL] Avoids unnecessary defensive copies when sort based shuffle is on This PR is a workaround for SPARK-4479. Two changes are introduced: when merge sort is bypassed in `ExternalSorter`, 1. also bypass RDD elements buffering as buffering is the reason that `MutableRow` backed row objects must be copied, and 2. avoids defensive copies in `Exchange` operator [Review on Reviewable](https://reviewable.io/reviews/apache/spark/3422) Author: Cheng Lian Closes #3422 from liancheng/avoids-defensive-copies and squashes the following commits: 591f2e9 [Cheng Lian] Passes all shuffle suites 0c3c91e [Cheng Lian] Fixes shuffle write metrics when merge sort is bypassed ed5df3c [Cheng Lian] Fixes styling changes f75089b [Cheng Lian] Avoids unnecessary defensive copies when sort based shuffle is on (cherry picked from commit a6d7b61f92dc7c1f9632cecb232afa8040ab2b4d) Signed-off-by: Michael Armbrust commit 1a12ca339cf038c44f5d7402d63851f48a055b35 Author: Sandy Ryza Date: Mon Nov 24 13:28:48 2014 -0600 SPARK-4457. Document how to build for Hadoop versions greater than 2.4 Author: Sandy Ryza Closes #3322 from sryza/sandy-spark-4457 and squashes the following commits: 5e72b77 [Sandy Ryza] Feedback 0cf05c1 [Sandy Ryza] Caveat be8084b [Sandy Ryza] SPARK-4457. Document how to build for Hadoop versions greater than 2.4 (cherry picked from commit 29372b63185a4a170178b6ec2362d7112f389852) Signed-off-by: Thomas Graves commit 4b68cabf5894643deb99042268fb5b343e8d31f3 Author: DB Tsai Date: Fri Nov 21 18:15:07 2014 -0800 [SPARK-4431][MLlib] Implement efficient foreachActive for dense and sparse vector Previously, we were using Breeze's activeIterator to access the non-zero elements in dense/sparse vector. Due to the overhead, we switched back to native `while loop` in #SPARK-4129. However, #SPARK-4129 requires de-reference the dv.values/sv.values in each access to the value, which is very expensive. Also, in MultivariateOnlineSummarizer, we're using Breeze's dense vector to store the partial stats, and this is very expensive compared with using primitive scala array. In this PR, efficient foreachActive is implemented to unify the code path for dense and sparse vector operation which makes codebase easier to maintain. Breeze dense vector is replaced by primitive array to reduce the overhead further. Benchmarking with mnist8m dataset on single JVM with first 200 samples loaded in memory, and repeating 5000 times. Before change: Sparse Vector - 30.02 Dense Vector - 38.27 With this PR: Sparse Vector - 6.29 Dense Vector - 11.72 Author: DB Tsai Closes #3288 from dbtsai/activeIterator and squashes the following commits: 844b0e6 [DB Tsai] formating 03dd693 [DB Tsai] futher performance tunning. 1907ae1 [DB Tsai] address feedback 98448bb [DB Tsai] Made the override final, and had a local copy of variables which made the accessing a single step operation. c0cbd5a [DB Tsai] fix a bug 6441f92 [DB Tsai] Finished SPARK-4431 (cherry picked from commit b5d17ef10e2509d9886c660945449a89750c8116) Signed-off-by: Xiangrui Meng commit 9309ddfc3b9cca3780555fb3ac52d96343cb9545 Author: Davies Liu Date: Fri Nov 21 15:02:31 2014 -0800 [SPARK-4531] [MLlib] cache serialized java object The Pyrolite is pretty slow (comparing to the adhoc serializer in 1.1), it cause much performance regression in 1.2, because we cache the serialized Python object in JVM, deserialize them into Java object in each step. This PR change to cache the deserialized JavaRDD instead of PythonRDD to avoid the deserialization of Pyrolite. It should have similar memory usage as before, but much faster. Author: Davies Liu Closes #3397 from davies/cache and squashes the following commits: 7f6e6ce [Davies Liu] Update -> Updater 4b52edd [Davies Liu] using named argument 63b984e [Davies Liu] fix 7da0332 [Davies Liu] add unpersist() dff33e1 [Davies Liu] address comments c2bdfc2 [Davies Liu] refactor d572f00 [Davies Liu] Merge branch 'master' into cache f1063e1 [Davies Liu] cache serialized java object (cherry picked from commit ce95bd8e130b2c7688b94be40683bdd90d86012d) Signed-off-by: Xiangrui Meng commit 6a01689a913a1a223fad66848c4fc17ab2931f22 Author: Patrick Wendell Date: Fri Nov 21 12:10:04 2014 -0800 SPARK-4532: Fix bug in detection of Hive in Spark 1.2 Because the Hive profile is no longer defined in the root pom, we need to check specifically in the sql/hive pom when we perform the check in make-distribtion.sh. Author: Patrick Wendell Closes #3398 from pwendell/make-distribution and squashes the following commits: 8a58279 [Patrick Wendell] Fix bug in detection of Hive in Spark 1.2 (cherry picked from commit a81918c5a66fc6040f9796fc1a9d4e0bfb8d0cbe) Signed-off-by: Patrick Wendell commit 6f70e0295572e3037660004797040e026e440dbd Author: zsxwing Date: Fri Nov 21 00:42:43 2014 -0800 [SPARK-4472][Shell] Print "Spark context available as sc." only when SparkContext is created... ... successfully It's weird that printing "Spark context available as sc" when creating SparkContext unsuccessfully. Author: zsxwing Closes #3341 from zsxwing/SPARK-4472 and squashes the following commits: 4850093 [zsxwing] Print "Spark context available as sc." only when SparkContext is created successfully (cherry picked from commit f1069b84b82b932751604bc20d5c2e451d57c455) Signed-off-by: Reynold Xin commit 668643b8de0958094766fa62e7e2a7a0909f11da Author: Michael Armbrust Date: Thu Nov 20 20:34:43 2014 -0800 [SPARK-4522][SQL] Parse schema with missing metadata. This is just a quick fix for 1.2. SPARK-4523 describes a more complete solution. Author: Michael Armbrust Closes #3392 from marmbrus/parquetMetadata and squashes the following commits: bcc6626 [Michael Armbrust] Parse schema with missing metadata. (cherry picked from commit 90a6a46bd11030672597f015dd443d954107123a) Signed-off-by: Michael Armbrust commit e445d3ce4e4fb9ee3c2feddb9734d541b61c6c01 Author: Davies Liu Date: Thu Nov 20 19:12:45 2014 -0800 add Sphinx as a dependency of building docs Author: Davies Liu Closes #3388 from davies/doc_readme and squashes the following commits: daa1482 [Davies Liu] add Sphinx dependency (cherry picked from commit 8cd6eea6298fc8e811dece38c2875e94ff863948) Signed-off-by: Patrick Wendell commit 64b30be7e4cb86059bbfeb3e2f8f47f41d015862 Author: Michael Armbrust Date: Thu Nov 20 18:31:02 2014 -0800 [SPARK-4413][SQL] Parquet support through datasource API Goals: - Support for accessing parquet using SQL but not requiring Hive (thus allowing support of parquet tables with decimal columns) - Support for folder based partitioning with automatic discovery of available partitions - Caching of file metadata See scaladoc of `ParquetRelation2` for more details. Author: Michael Armbrust Closes #3269 from marmbrus/newParquet and squashes the following commits: 1dd75f1 [Michael Armbrust] Pass all paths for FileInputFormat at once. 645768b [Michael Armbrust] Review comments. abd8e2f [Michael Armbrust] Alternative implementation of parquet based on the datasources API. 938019e [Michael Armbrust] Add an experimental interface to data sources that exposes catalyst expressions. e9d2641 [Michael Armbrust] logging / formatting improvements. (cherry picked from commit 02ec058efe24348cdd3691b55942e6f0ef138732) Signed-off-by: Michael Armbrust commit 0f6a2eeaf20363061f9ed2d9062f3a7022c2c8ba Author: Cheng Hao Date: Thu Nov 20 16:50:59 2014 -0800 [SPARK-4244] [SQL] Support Hive Generic UDFs with constant object inspector parameters Query `SELECT named_struct(lower("AA"), "12", lower("Bb"), "13") FROM src LIMIT 1` will throw exception, some of the Hive Generic UDF/UDAF requires the input object inspector is `ConstantObjectInspector`, however, we won't get that before the expression optimization executed. (Constant Folding). This PR is a work around to fix this. (As ideally, the `output` of LogicalPlan should be identical before and after Optimization). Author: Cheng Hao Closes #3109 from chenghao-intel/optimized and squashes the following commits: 487ff79 [Cheng Hao] rebase to the latest master & update the unittest (cherry picked from commit 84d79ee9ec47465269f7b0a7971176da93c96f3f) Signed-off-by: Michael Armbrust commit 5153aa041fd4ca8b2a4df4d635598090280655c6 Author: Davies Liu Date: Thu Nov 20 16:40:25 2014 -0800 [SPARK-4477] [PySpark] remove numpy from RDDSampler In RDDSampler, it try use numpy to gain better performance for possion(), but the number of call of random() is only (1+faction) * N in the pure python implementation of possion(), so there is no much performance gain from numpy. numpy is not a dependent of pyspark, so it maybe introduce some problem, such as there is no numpy installed in slaves, but only installed master, as reported in SPARK-927. It also complicate the code a lot, so we may should remove numpy from RDDSampler. I also did some benchmark to verify that: ``` >>> from pyspark.mllib.random import RandomRDDs >>> rdd = RandomRDDs.uniformRDD(sc, 1 << 20, 1).cache() >>> rdd.count() # cache it >>> rdd.sample(True, 0.9).count() # measure this line ``` the results: |withReplacement | random | numpy.random | ------- | ------------ | ------- |True | 1.5 s| 1.4 s| |False| 0.6 s | 0.8 s| closes #2313 Note: this patch including some commits that not mirrored to github, it will be OK after it catches up. Author: Davies Liu Author: Xiangrui Meng Closes #3351 from davies/numpy and squashes the following commits: 5c438d7 [Davies Liu] fix comment c5b9252 [Davies Liu] Merge pull request #1 from mengxr/SPARK-4477 98eb31b [Xiangrui Meng] make poisson sampling slightly faster ee17d78 [Davies Liu] remove = for float 13f7b05 [Davies Liu] Merge branch 'master' of http://git-wip-us.apache.org/repos/asf/spark into numpy f583023 [Davies Liu] fix tests 51649f5 [Davies Liu] remove numpy in RDDSampler 78bf997 [Davies Liu] fix tests, do not use numpy in randomSplit, no performance gain f5fdf63 [Davies Liu] fix bug with int in weights 4dfa2cd [Davies Liu] refactor f866bcf [Davies Liu] remove unneeded change c7a2007 [Davies Liu] switch to python implementation 95a48ac [Davies Liu] Merge branch 'master' of github.com:apache/spark into randomSplit 0d9b256 [Davies Liu] refactor 1715ee3 [Davies Liu] address comments 41fce54 [Davies Liu] randomSplit() (cherry picked from commit d39f2e9c683a4ab78b29eb3c5668325bf8568e8c) Signed-off-by: Xiangrui Meng commit 69e28046b5ebc1ec3afb678b4c81c69e48c02aa8 Author: Jacky Li Date: Thu Nov 20 15:48:36 2014 -0800 [SQL] fix function description mistake Sample code in the description of SchemaRDD.where is not correct Author: Jacky Li Closes #3344 from jackylk/patch-6 and squashes the following commits: 62cd126 [Jacky Li] [SQL] fix function description mistake (cherry picked from commit ad5f1f3ca240473261162c06ffc5aa70d15a5991) Signed-off-by: Michael Armbrust commit 29e8d50773c40abe949d6b3284e0e89a0acb45af Author: Cheng Hao Date: Thu Nov 20 15:46:00 2014 -0800 [SPARK-2918] [SQL] Support the CTAS in EXPLAIN command Hive supports the `explain` the CTAS, which was supported by Spark SQL previously, however, seems it was reverted after the code refactoring in HiveQL. Author: Cheng Hao Closes #3357 from chenghao-intel/explain and squashes the following commits: 7aace63 [Cheng Hao] Support the CTAS in EXPLAIN command (cherry picked from commit 6aa0fc9f4d95f09383cbcb5f79166c60697e6683) Signed-off-by: Michael Armbrust commit 1d7ee2b79b23f08f73a6d53f41ac8fa140b91c19 Author: Takuya UESHIN Date: Thu Nov 20 15:41:24 2014 -0800 [SPARK-4318][SQL] Fix empty sum distinct. Executing sum distinct for empty table throws `java.lang.UnsupportedOperationException: empty.reduceLeft`. Author: Takuya UESHIN Closes #3184 from ueshin/issues/SPARK-4318 and squashes the following commits: 8168c42 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4318 66fdb0a [Takuya UESHIN] Re-refine aggregate functions. 6186eb4 [Takuya UESHIN] Fix Sum of GeneratedAggregate. d2975f6 [Takuya UESHIN] Refine Sum and Average of GeneratedAggregate. 1bba675 [Takuya UESHIN] Refine Sum, SumDistinct and Average functions. 917e533 [Takuya UESHIN] Use aggregate instead of groupBy(). 1a5f874 [Takuya UESHIN] Add tests to be executed as non-partial aggregation. a5a57d2 [Takuya UESHIN] Fix empty Average. 22799dc [Takuya UESHIN] Fix empty Sum and SumDistinct. 65b7dd2 [Takuya UESHIN] Fix empty sum distinct. (cherry picked from commit 2c2e7a44db2ebe44121226f3eac924a0668b991a) Signed-off-by: Michael Armbrust commit 8608ff59881b3cfa6c4cd407ba2c0af7a78e88a9 Author: ravipesala Date: Thu Nov 20 15:34:03 2014 -0800 [SPARK-4513][SQL] Support relational operator '<=>' in Spark SQL The relational operator '<=>' is not working in Spark SQL. Same works in Spark HiveQL Author: ravipesala Closes #3387 from ravipesala/<=> and squashes the following commits: 7198e90 [ravipesala] Supporting relational operator '<=>' in Spark SQL (cherry picked from commit 98e9419784a9ad5096cfd563fa9a433786a90bd4) Signed-off-by: Michael Armbrust commit 72f5ba1fc152fa5dee11740f6193d5cd95bcdce3 Author: Davies Liu Date: Thu Nov 20 15:31:28 2014 -0800 [SPARK-4439] [MLlib] add python api for random forest ``` class RandomForestModel | A model trained by RandomForest | | numTrees(self) | Get number of trees in forest. | | predict(self, x) | Predict values for a single data point or an RDD of points using the model trained. | | toDebugString(self) | Full model | | totalNumNodes(self) | Get total number of nodes, summed over all trees in the forest. | class RandomForest | trainClassifier(cls, data, numClassesForClassification, categoricalFeaturesInfo, numTrees, featureSubsetStrategy='auto', impurity='gini', maxDepth=4, maxBins=32, seed=None): | Method to train a decision tree model for binary or multiclass classification. | | :param data: Training dataset: RDD of LabeledPoint. | Labels should take values {0, 1, ..., numClasses-1}. | :param numClassesForClassification: number of classes for classification. | :param categoricalFeaturesInfo: Map storing arity of categorical features. | E.g., an entry (n -> k) indicates that feature n is categorical | with k categories indexed from 0: {0, 1, ..., k-1}. | :param numTrees: Number of trees in the random forest. | :param featureSubsetStrategy: Number of features to consider for splits at each node. | Supported: "auto" (default), "all", "sqrt", "log2", "onethird". | If "auto" is set, this parameter is set based on numTrees: | if numTrees == 1, set to "all"; | if numTrees > 1 (forest) set to "sqrt". | :param impurity: Criterion used for information gain calculation. | Supported values: "gini" (recommended) or "entropy". | :param maxDepth: Maximum depth of the tree. E.g., depth 0 means 1 leaf node; depth 1 means | 1 internal node + 2 leaf nodes. (default: 4) | :param maxBins: maximum number of bins used for splitting features (default: 100) | :param seed: Random seed for bootstrapping and choosing feature subsets. | :return: RandomForestModel that can be used for prediction | | trainRegressor(cls, data, categoricalFeaturesInfo, numTrees, featureSubsetStrategy='auto', impurity='variance', maxDepth=4, maxBins=32, seed=None): | Method to train a decision tree model for regression. | | :param data: Training dataset: RDD of LabeledPoint. | Labels are real numbers. | :param categoricalFeaturesInfo: Map storing arity of categorical features. | E.g., an entry (n -> k) indicates that feature n is categorical | with k categories indexed from 0: {0, 1, ..., k-1}. | :param numTrees: Number of trees in the random forest. | :param featureSubsetStrategy: Number of features to consider for splits at each node. | Supported: "auto" (default), "all", "sqrt", "log2", "onethird". | If "auto" is set, this parameter is set based on numTrees: | if numTrees == 1, set to "all"; | if numTrees > 1 (forest) set to "onethird". | :param impurity: Criterion used for information gain calculation. | Supported values: "variance". | :param maxDepth: Maximum depth of the tree. E.g., depth 0 means 1 leaf node; depth 1 means | 1 internal node + 2 leaf nodes.(default: 4) | :param maxBins: maximum number of bins used for splitting features (default: 100) | :param seed: Random seed for bootstrapping and choosing feature subsets. | :return: RandomForestModel that can be used for prediction | ``` Author: Davies Liu Closes #3320 from davies/forest and squashes the following commits: 8003dfc [Davies Liu] reorder 53cf510 [Davies Liu] fix docs 4ca593d [Davies Liu] fix docs e0df852 [Davies Liu] fix docs 0431746 [Davies Liu] rebased 2b6f239 [Davies Liu] Merge branch 'master' of github.com:apache/spark into forest 885abee [Davies Liu] address comments dae7fc0 [Davies Liu] address comments 89a000f [Davies Liu] fix docs 565d476 [Davies Liu] add python api for random forest (cherry picked from commit 1c53a5db993193122bfa79574d2540149fe2cc08) Signed-off-by: Xiangrui Meng commit 21f582f12b4d00017b990bcc232dcbf546b5dbe7 Author: Dan McClary Date: Thu Nov 20 13:36:50 2014 -0800 [SPARK-4228][SQL] SchemaRDD to JSON Here's a simple fix for SchemaRDD to JSON. Author: Dan McClary Closes #3213 from dwmclary/SPARK-4228 and squashes the following commits: d714e1d [Dan McClary] fixed PEP 8 error cac2879 [Dan McClary] move pyspark comment and doctest to correct location f9471d3 [Dan McClary] added pyspark doc and doctest 6598cee [Dan McClary] adding complex type queries 1a5fd30 [Dan McClary] removing SPARK-4228 from SQLQuerySuite 4a651f0 [Dan McClary] cleaned PEP and Scala style failures. Moved tests to JsonSuite 47ceff6 [Dan McClary] cleaned up scala style issues 2ee1e70 [Dan McClary] moved rowToJSON to JsonRDD 4387dd5 [Dan McClary] Added UserDefinedType, cleaned up case formatting 8f7bfb6 [Dan McClary] Map type added to SchemaRDD.toJSON 1b11980 [Dan McClary] Map and UserDefinedTypes partially done 11d2016 [Dan McClary] formatting and unicode deserialization default fixed 6af72d1 [Dan McClary] deleted extaneous comment 4d11c0c [Dan McClary] JsonFactory rewrite of toJSON for SchemaRDD 149dafd [Dan McClary] wrapped scala toJSON in sql.py 5e5eb1b [Dan McClary] switched to Jackson for JSON processing 6c94a54 [Dan McClary] added toJSON to pyspark SchemaRDD aaeba58 [Dan McClary] added toJSON to pyspark SchemaRDD 1d171aa [Dan McClary] upated missing brace on if statement 319e3ba [Dan McClary] updated to upstream master with merged SPARK-4228 424f130 [Dan McClary] tests pass, ready for pull and PR 626a5b1 [Dan McClary] added toJSON to SchemaRDD f7d166a [Dan McClary] added toJSON method 5d34e37 [Dan McClary] merge resolved d6d19e9 [Dan McClary] pr example (cherry picked from commit b8e6886fb8ff8f667fb7e600cd727d8649cad1d1) Signed-off-by: Michael Armbrust commit 2fb683c585d8f30a7b19027b941812c922e7d99a Author: Cheng Lian Date: Thu Nov 20 13:12:24 2014 -0800 [SPARK-3938][SQL] Names in-memory columnar RDD with corresponding table name This PR enables the Web UI storage tab to show the in-memory table name instead of the mysterious query plan string as the name of the in-memory columnar RDD. Note that after #2501, a single columnar RDD can be shared by multiple in-memory tables, as long as their query results are the same. In this case, only the first cached table name is shown. For example: ```sql CACHE TABLE first AS SELECT * FROM src; CACHE TABLE second AS SELECT * FROM src; ``` The Web UI only shows "In-memory table first". [Review on Reviewable](https://reviewable.io/reviews/apache/spark/3383) Author: Cheng Lian Closes #3383 from liancheng/columnar-rdd-name and squashes the following commits: 071907f [Cheng Lian] Fixes tests 12ddfa6 [Cheng Lian] Names in-memory columnar RDD with corresponding table name (cherry picked from commit abf29187f0342b607fcefe269391d4db58d2a957) Signed-off-by: Michael Armbrust commit b676d9ad347e296929361809b0001c0f5c700514 Author: zsxwing Date: Thu Nov 20 01:13:36 2014 -0800 [SPARK-4481][Streaming][Doc] Fix the wrong description of updateFunc (backport for branch-1.2) backport for branch-1.2 as per #3356 Author: zsxwing Closes #3376 from zsxwing/SPARK-4481-branch-1.2 and squashes the following commits: 53b94e8 [zsxwing] Fix the wrong description of updateFunc commit e958132a80d202b70976632a51c7e8e4b58d9c4e Author: Xiangrui Meng Date: Thu Nov 20 00:48:59 2014 -0800 [SPARK-4486][MLLIB] Improve GradientBoosting APIs and doc There are some inconsistencies in the gradient boosting APIs. The target is a general boosting meta-algorithm, but the implementation is attached to trees. This was partially due to the delay of SPARK-1856. But for the 1.2 release, we should make the APIs consistent. 1. WeightedEnsembleModel -> private[tree] TreeEnsembleModel and renamed members accordingly. 1. GradientBoosting -> GradientBoostedTrees 1. Add RandomForestModel and GradientBoostedTreesModel and hide CombiningStrategy 1. Slightly refactored TreeEnsembleModel (Vote takes weights into consideration.) 1. Remove `trainClassifier` and `trainRegressor` from `GradientBoostedTrees` because they are the same as `train` 1. Rename class `train` method to `run` because it hides the static methods with the same name in Java. Deprecated `DecisionTree.train` class method. 1. Simplify BoostingStrategy and make sure the input strategy is not modified. Users should put algo and numClasses in treeStrategy. We create ensembleStrategy inside boosting. 1. Fix a bug in GradientBoostedTreesSuite with AbsoluteError 1. doc updates manishamde jkbradley Author: Xiangrui Meng Closes #3374 from mengxr/SPARK-4486 and squashes the following commits: 7097251 [Xiangrui Meng] address joseph's comments 98dea09 [Xiangrui Meng] address manish's comments 4aae3b7 [Xiangrui Meng] add RandomForestModel and GradientBoostedTreesModel, hide CombiningStrategy ea4c467 [Xiangrui Meng] fix unit tests 751da4e [Xiangrui Meng] rename class method train -> run 19030a5 [Xiangrui Meng] update boosting public APIs (cherry picked from commit 15cacc81240eed8834b4730c5c6dc3238f003465) Signed-off-by: Xiangrui Meng commit 83d24efb074f9b9f9aacc1e486c994a6799e981d Author: Leolh Date: Wed Nov 19 18:18:55 2014 -0800 [SPARK-4446] [SPARK CORE] MetadataCleaner schedule task with a wrong param for delay time . Author: Leolh Closes #3306 from Leolh/master and squashes the following commits: 4a21f4e [Leolh] Update MetadataCleaner.scala (cherry picked from commit e216ffaead983274428052caa992b20760b2c5e0) Signed-off-by: Andrew Or commit 4a5c3d21b4df8fa506fe0365a0718c94bbc1cd1b Author: Andrew Or Date: Wed Nov 19 18:07:27 2014 -0800 [SPARK-4480] Avoid many small spills in external data structures **Summary.** Currently, we may spill many small files in `ExternalAppendOnlyMap` and `ExternalSorter`. The underlying root cause of this is summarized in [SPARK-4452](https://issues.apache.org/jira/browse/SPARK-4452). This PR does not address this root cause, but simply provides the guarantee that we never spill the in-memory data structure if its size is less than a configurable threshold of 5MB. This config is not documented because we don't want users to set it themselves, and it is not hard-coded because we need to change it in tests. **Symptom.** Each spill is orders of magnitude smaller than 1MB, and there are many spills. In environments where the ulimit is set, this frequently causes "too many open file" exceptions observed in [SPARK-3633](https://issues.apache.org/jira/browse/SPARK-3633). ``` 14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4792 B to disk (292769 spills so far) 14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4760 B to disk (292770 spills so far) 14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4520 B to disk (292771 spills so far) 14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4560 B to disk (292772 spills so far) 14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4792 B to disk (292773 spills so far) 14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4784 B to disk (292774 spills so far) ``` **Reproduction.** I ran the following on a small 4-node cluster with 512MB executors. Note that the back-to-back shuffle here is necessary for reasons described in [SPARK-4522](https://issues.apache.org/jira/browse/SPARK-4452). The second shuffle is a `reduceByKey` because it performs a map-side combine. ``` sc.parallelize(1 to 100000000, 100) .map { i => (i, i) } .groupByKey() .reduceByKey(_ ++ _) .count() ``` Before the change, I notice that each thread may spill up to 1000 times, and the size of each spill is on the order of 10KB. After the change, each thread spills only up to 20 times in the worst case, and the size of each spill is on the order of 1MB. Author: Andrew Or Closes #3353 from andrewor14/avoid-small-spills and squashes the following commits: 49f380f [Andrew Or] Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/spark into avoid-small-spills 27d6966 [Andrew Or] Merge branch 'master' of github.com:apache/spark into avoid-small-spills f4736e3 [Andrew Or] Fix tests a919776 [Andrew Or] Avoid many small spills (cherry picked from commit 0eb4a7fb0fa1fa56677488cbd74eb39e65317621) Signed-off-by: Andrew Or commit f21e550e35e77363d2804fe22ad3f879a66498f1 Author: Nishkam Ravi Date: Wed Nov 19 17:23:42 2014 -0800 [Spark-4484] Treat maxResultSize as unlimited when set to 0; improve error message The check for maxResultSize > 0 is missing, results in failures. Also, error message needs to be improved so the developers know that there is a new parameter to be configured Author: Nishkam Ravi Author: nravi Author: nishkamravi2 Closes #3360 from nishkamravi2/master_nravi and squashes the following commits: 5c9a4cb [nishkamravi2] Update TaskSetManagerSuite.scala 535295a [nishkamravi2] Update TaskSetManager.scala 3e1b616 [Nishkam Ravi] Modify test for maxResultSize 9f6583e [Nishkam Ravi] Changes to maxResultSize code (improve error message and add condition to check if maxResultSize > 0) 5f8f9ed [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi 636a9ff [nishkamravi2] Update YarnAllocator.scala 8f76c8b [Nishkam Ravi] Doc change for yarn memory overhead 35daa64 [Nishkam Ravi] Slight change in the doc for yarn memory overhead 5ac2ec1 [Nishkam Ravi] Remove out dac1047 [Nishkam Ravi] Additional documentation for yarn memory overhead issue 42c2c3d [Nishkam Ravi] Additional changes for yarn memory overhead issue 362da5e [Nishkam Ravi] Additional changes for yarn memory overhead c726bd9 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi f00fa31 [Nishkam Ravi] Improving logging for AM memoryOverhead 1cf2d1e [nishkamravi2] Update YarnAllocator.scala ebcde10 [Nishkam Ravi] Modify default YARN memory_overhead-- from an additive constant to a multiplier (redone to resolve merge conflicts) 2e69f11 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi efd688a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark 2b630f9 [nravi] Accept memory input as "30g", "512M" instead of an int value, to be consistent with rest of Spark 3bf8fad [nravi] Merge branch 'master' of https://github.com/apache/spark 5423a03 [nravi] Merge branch 'master' of https://github.com/apache/spark eb663ca [nravi] Merge branch 'master' of https://github.com/apache/spark df2aeb1 [nravi] Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456) 6b840f0 [nravi] Undo the fix for SPARK-1758 (the problem is fixed) 5108700 [nravi] Fix in Spark for the Concurrent thread modification issue (SPARK-1097, HADOOP-10456) 681b36f [nravi] Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeTextFiles (cherry picked from commit 73fedf5a6e662b640dfe29936753721988bff6ea) Signed-off-by: Andrew Or commit d68b40bfca70cfbcee052dd6fea4f39602bf9dcf Author: Akshat Aranya Date: Wed Nov 19 17:20:20 2014 -0800 [SPARK-4478] Keep totalRegisteredExecutors up-to-date This rebases PR 3368. This commit fixes totalRegisteredExecutors update [SPARK-4478], so that we can correctly keep track of number of registered executors. Author: Akshat Aranya Closes #3373 from coolfrood/topic/SPARK-4478 and squashes the following commits: 8a4d1e4 [Akshat Aranya] Added comment 150ae93 [Akshat Aranya] [SPARK-4478] Keep totalRegisteredExecutors up-to-date (cherry picked from commit 9ccc53c72c5bcffcc121291710754e1e2d659341) Signed-off-by: Andrew Or commit 8786ddd48166a3c7da20bf37ab894053d882e078 Author: Joseph E. Gonzalez Date: Wed Nov 19 16:53:33 2014 -0800 Updating GraphX programming guide and documentation This pull request revises the programming guide to reflect changes in the GraphX API as well as the deprecated mapReduceTriplets operator. Author: Joseph E. Gonzalez Closes #3359 from jegonzal/GraphXProgrammingGuide and squashes the following commits: 4421964 [Joseph E. Gonzalez] updating documentation for graphx (cherry picked from commit 377b06820934cab6d67f3a9182528c7f417a7d98) Signed-off-by: Reynold Xin commit a7c64cc8f939b6c777e296f775d68fb7088a7530 Author: Josh Rosen Date: Wed Nov 19 16:50:21 2014 -0800 [SPARK-4495] Fix memory leak in JobProgressListener This commit fixes a memory leak in JobProgressListener that I introduced in SPARK-2321 and adds a testing framework to ensure that it’s very difficult to inadvertently introduce new memory leaks. This solution might be overkill, but the main idea is to partition JobProgressListener's state into three buckets: collections that should be empty once Spark is idle, collections that must obey some hard size limit, and collections that have a soft size limit (they can grow arbitrarily large when Spark is active but must shrink to fit within some bound after Spark becomes idle). Based on this, we can write fairly generic tests that run workloads that submit more than `spark.ui.retainedStages` stages and `spark.ui.retainedJobs` jobs then check that these various collections' sizes obey their contracts. Author: Josh Rosen Closes #3372 from JoshRosen/SPARK-4495 and squashes the following commits: c73fab5 [Josh Rosen] "data structures" -> collections be72e81 [Josh Rosen] [SPARK-4495] Fix memory leaks in JobProgressListener (cherry picked from commit 04d462f648aba7b18fc293b7189b86af70e421bc) Signed-off-by: Josh Rosen commit a250ca369208b23503d7fff1cf9ee52e2e1ba3e2 Author: Yadong Qi Date: Wed Nov 19 15:53:06 2014 -0800 [SPARK-4294][Streaming] UnionDStream stream should express the requirements in the same way as TransformedDStream In class TransformedDStream: ```scala require(parents.length > 0, "List of DStreams to transform is empty") require(parents.map(.ssc).distinct.size == 1, "Some of the DStreams have different contexts") require(parents.map(.slideDuration).distinct.size == 1, "Some of the DStreams have different slide durations") ``` In class UnionDStream: ```scala if (parents.length == 0) { throw new IllegalArgumentException("Empty array of parents") } if (parents.map(.ssc).distinct.size > 1) { throw new IllegalArgumentException("Array of parents have different StreamingContexts") } if (parents.map(.slideDuration).distinct.size > 1) { throw new IllegalArgumentException("Array of parents have different slide times") } ``` The function is the same, but the realization is not. I think they shoule be the same. Author: Yadong Qi Closes #3152 from watermen/bug-fix1 and squashes the following commits: ed66db6 [Yadong Qi] Change transform to union b6b3b8b [Yadong Qi] The same function should have the same realization. (cherry picked from commit c3002c4a61c4fc5b966aa384c41c3cba33de0aa6) Signed-off-by: Tathagata Das commit c4abb2eb4f6a2875bbe22b12c246d8ae1773ece2 Author: Ken Takagiwa Date: Wed Nov 19 14:23:18 2014 -0800 [DOC][PySpark][Streaming] Fix docstring for sphinx This commit should be merged for 1.2 release. cc tdas Author: Ken Takagiwa Closes #3311 from giwa/patch-3 and squashes the following commits: ab474a8 [Ken Takagiwa] [DOC][PySpark][Streaming] Fix docstring for sphinx (cherry picked from commit 9b7bbcef8863ecd69e7511825ef9c93d8632dac2) Signed-off-by: Tathagata Das commit 8ecabf4b7678d788faba6a202e883855be0c9f99 Author: Davies Liu Date: Wed Nov 19 15:45:37 2014 -0800 [SPARK-4384] [PySpark] improve sort spilling If there some big broadcasts (or other object) in Python worker, the free memory could be used for sorting will be too small, then it will keep spilling small files into disks, finally failed with too many open files. This PR try to delay the spilling until the used memory goes over limit and start to increase since last spilling, it will increase the size of spilling files, improve the stability and performance in this cases. (We also do this in ExternalAggregator). Author: Davies Liu Closes #3252 from davies/sort and squashes the following commits: 711fb6c [Davies Liu] improve sort spilling (cherry picked from commit 73c8ea84a668f443eb18ce15ba97023da041d808) Signed-off-by: Josh Rosen commit 633d67cb73840225ec5deb5563de53e1f43532a5 Author: Takuya UESHIN Date: Wed Nov 19 14:40:21 2014 -0800 [SPARK-4429][BUILD] Build for Scala 2.11 using sbt fails. I tried to build for Scala 2.11 using sbt with the following command: ``` $ sbt/sbt -Dscala-2.11 assembly ``` but it ends with the following error messages: ``` [error] (streaming-kafka/*:update) sbt.ResolveException: unresolved dependency: org.apache.kafka#kafka_2.11;0.8.0: not found [error] (catalyst/*:update) sbt.ResolveException: unresolved dependency: org.scalamacros#quasiquotes_2.11;2.0.1: not found ``` The reason is: If system property `-Dscala-2.11` (without value) was set, `SparkBuild.scala` adds `scala-2.11` profile, but also `sbt-pom-reader` activates `scala-2.10` profile instead of `scala-2.11` profile because the activator `PropertyProfileActivator` used by `sbt-pom-reader` internally checks if the property value is empty or not. The value is set to non-empty value, then no need to add profiles in `SparkBuild.scala` because `sbt-pom-reader` can handle as expected. Author: Takuya UESHIN Closes #3342 from ueshin/issues/SPARK-4429 and squashes the following commits: 14d86e8 [Takuya UESHIN] Add a comment. 4eef52b [Takuya UESHIN] Remove unneeded condition. ce98d0f [Takuya UESHIN] Set non-empty value to system property "scala-2.11" if the property exists instead of adding profile. (cherry picked from commit f9adda9afb63bfdb722be95304f991a3b38a54b3) Signed-off-by: Patrick Wendell commit fc73171d580ca5f5f88d9cdbdfaf1ebf1c4557d9 Author: Prashant Sharma Date: Wed Nov 19 14:18:10 2014 -0800 SPARK-3962 Marked scope as provided for external projects. Somehow maven shade plugin is set in infinite loop of creating effective pom. Author: Prashant Sharma Author: Prashant Sharma Closes #2959 from ScrapCodes/SPARK-3962/scope-provided and squashes the following commits: 994d1d3 [Prashant Sharma] Fixed failing flume tests 270b4fb [Prashant Sharma] Removed most of the unused code. bb3bbfd [Prashant Sharma] SPARK-3962 Marked scope as provided for external. (cherry picked from commit 1c938413ba5579034675f1b4ea3b8fd0e47dd8d6) Signed-off-by: Patrick Wendell commit ce5ea0fd611ce560f6e1fac83562469bdb97091e Author: Tathagata Das Date: Wed Nov 19 13:06:48 2014 -0800 [SPARK-4482][Streaming] Disable ReceivedBlockTracker's write ahead log by default The write ahead log of ReceivedBlockTracker gets enabled as soon as checkpoint directory is set. This should not happen, as the WAL should be enabled only if the WAL is enabled in the Spark configuration. Author: Tathagata Das Closes #3358 from tdas/SPARK-4482 and squashes the following commits: b740136 [Tathagata Das] Fixed bug in ReceivedBlockTracker (cherry picked from commit 22fc4e751c0a2f0ff39e42aa0a8fb9459d7412ec) Signed-off-by: Tathagata Das commit 2fb40e1aa758a0c305198befb1884b81ac22ae79 Author: Kenichi Maehashi Date: Wed Nov 19 12:11:09 2014 -0800 [SPARK-4470] Validate number of threads in local mode When running Spark locally, if number of threads is specified as 0 (e.g., `spark-submit --master local[0] ...`), the job got stuck and does not run at all. I think it's better to validate the parameter. Fix for [SPARK-4470](https://issues.apache.org/jira/browse/SPARK-4470). Author: Kenichi Maehashi Closes #3337 from kmaehashi/spark-4470 and squashes the following commits: 3ad76f3 [Kenichi Maehashi] fix code style 7716734 [Kenichi Maehashi] SPARK-4470: Validate number of threads in local mode (cherry picked from commit eacc788346ccae232bd530dd880f801475a49734) Signed-off-by: Andrew Or commit 9da71f8651101eed93090829aa01501367284d09 Author: Tianshuo Deng Date: Wed Nov 19 10:01:09 2014 -0800 [SPARK-4467] fix elements read count for ExtrenalSorter the elementsRead variable should be reset to 0 after each spilling Author: Tianshuo Deng Closes #3302 from tsdeng/fix_external_sorter_record_count and squashes the following commits: 7b56ca0 [Tianshuo Deng] fix method signature 782c7de [Tianshuo Deng] make elementsRead private, fix comment bb7ff28 [Tianshuo Deng] update elemetsRead through addElementsRead method 74ca246 [Tianshuo Deng] fix elements read count (cherry picked from commit d75579d09912cfb1eeac0589d625ea0452701fa0) Signed-off-by: Andrew Or commit 1d0fa7fb073e5760a9d7d7d3bfa13f3e7ea48e1a Author: tedyu Date: Wed Nov 19 00:55:39 2014 -0800 SPARK-4455 Exclude dependency on hbase-annotations module pwendell Please take a look Author: tedyu Closes #3286 from tedyu/master and squashes the following commits: e61e610 [tedyu] SPARK-4455 Exclude dependency on hbase-annotations module 7e3a57a [tedyu] Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/spark 2f28b08 [tedyu] Exclude dependency on hbase-annotations module (cherry picked from commit 5f5ac2dafaf849d2375c81d699d82874ac462b49) Signed-off-by: Patrick Wendell commit e0a20994f4deead62b4c038500bb1a98992f9974 Author: Mingfei Date: Tue Nov 18 22:17:06 2014 -0800 [Spark-4432]close InStream after the block is accessed InStream is not closed after data is read from Tachyon. which makes the blocks in Tachyon locked after accessed. Author: Mingfei Closes #3290 from shimingfei/lockFix and squashes the following commits: fffe345 [Mingfei] close InStream after the block is accessed commit d1d6de630faad23f5f88f6c5a254720546d97c72 Author: Mingfei Date: Tue Nov 18 22:16:36 2014 -0800 [SPARK-4441] Close Tachyon client when TachyonBlockManager is shutdown Currently Tachyon client is not closed when TachyonBlockManager is shut down. which causes some resources in Tachyon not reclaimed Author: Mingfei Closes #3299 from shimingfei/closeClient and squashes the following commits: 0913fbd [Mingfei] close Tachyon client when TachyonBlockManager is shutdown (cherry picked from commit 67e9876b3e457b151c123fdb5ac2d8e8371e6acf) Signed-off-by: Patrick Wendell commit 790c8741e70032b2852125ea509f7dc85e9faea8 Author: Cheng Lian Date: Tue Nov 18 17:41:54 2014 -0800 [SPARK-4468][SQL] Fixes Parquet filter creation for inequality predicates with literals on the left hand side For expressions like `10 < someVar`, we should create an `Operators.Gt` filter, but right now an `Operators.Lt` is created. This issue affects all inequality predicates with literals on the left hand side. (This bug existed before #3317 and affects branch-1.1. #3338 was opened to backport this to branch-1.1.) [Review on Reviewable](https://reviewable.io/reviews/apache/spark/3334) Author: Cheng Lian Closes #3334 from liancheng/fix-parquet-comp-filter and squashes the following commits: 0130897 [Cheng Lian] Fixes Parquet comparison filter generation (cherry picked from commit 423baea953996a66dde671ff6db2fb1f32fbe8cb) Signed-off-by: Michael Armbrust commit 70d9e3871f852ec9e8bfaa436bc02bc22fc62dfd Author: Davies Liu Date: Tue Nov 18 16:37:35 2014 -0800 [SPARK-4327] [PySpark] Python API for RDD.randomSplit() ``` pyspark.RDD.randomSplit(self, weights, seed=None) Randomly splits this RDD with the provided weights. :param weights: weights for splits, will be normalized if they don't sum to 1 :param seed: random seed :return: split RDDs in an list >>> rdd = sc.parallelize(range(10), 1) >>> rdd1, rdd2, rdd3 = rdd.randomSplit([0.4, 0.6, 1.0], 11) >>> rdd1.collect() [3, 6] >>> rdd2.collect() [0, 5, 7] >>> rdd3.collect() [1, 2, 4, 8, 9] ``` Author: Davies Liu Closes #3193 from davies/randomSplit and squashes the following commits: 78bf997 [Davies Liu] fix tests, do not use numpy in randomSplit, no performance gain f5fdf63 [Davies Liu] fix bug with int in weights 4dfa2cd [Davies Liu] refactor f866bcf [Davies Liu] remove unneeded change c7a2007 [Davies Liu] switch to python implementation 95a48ac [Davies Liu] Merge branch 'master' of github.com:apache/spark into randomSplit 0d9b256 [Davies Liu] refactor 1715ee3 [Davies Liu] address comments 41fce54 [Davies Liu] randomSplit() (cherry picked from commit 7f22fa81ebd5e501fcb0e1da5506d1d4fb9250cf) Signed-off-by: Xiangrui Meng commit bf76164f1090892544983f753d4b7b16903a6135 Author: Xiangrui Meng Date: Tue Nov 18 16:25:44 2014 -0800 [SPARK-4433] fix a racing condition in zipWithIndex Spark hangs with the following code: ~~~ sc.parallelize(1 to 10).zipWithIndex.repartition(10).count() ~~~ This is because ZippedWithIndexRDD triggers a job in getPartitions and it causes a deadlock in DAGScheduler.getPreferredLocs (synced). The fix is to compute `startIndices` during construction. This should be applied to branch-1.0, branch-1.1, and branch-1.2. pwendell Author: Xiangrui Meng Closes #3291 from mengxr/SPARK-4433 and squashes the following commits: c284d9f [Xiangrui Meng] fix a racing condition in zipWithIndex (cherry picked from commit bb46046154a438df4db30a0e1fd557bd3399ee7b) Signed-off-by: Xiangrui Meng commit bb7a173d95094b63981724c381f68a885e514cd4 Author: Davies Liu Date: Tue Nov 18 16:17:51 2014 -0800 [SPARK-3721] [PySpark] broadcast objects larger than 2G This patch will bring support for broadcasting objects larger than 2G. pickle, zlib, FrameSerializer and Array[Byte] all can not support objects larger than 2G, so this patch introduce LargeObjectSerializer to serialize broadcast objects, the object will be serialized and compressed into small chunks, it also change the type of Broadcast[Array[Byte]]] into Broadcast[Array[Array[Byte]]]]. Testing for support broadcast objects larger than 2G is slow and memory hungry, so this is tested manually, could be added into SparkPerf. Author: Davies Liu Author: Davies Liu Closes #2659 from davies/huge and squashes the following commits: 7b57a14 [Davies Liu] add more tests for broadcast 28acff9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge a2f6a02 [Davies Liu] bug fix 4820613 [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge 5875c73 [Davies Liu] address comments 10a349b [Davies Liu] address comments 0c33016 [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge 6182c8f [Davies Liu] Merge branch 'master' into huge d94b68f [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge 2514848 [Davies Liu] address comments fda395b [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge 1c2d928 [Davies Liu] fix scala style 091b107 [Davies Liu] broadcast objects larger than 2G (cherry picked from commit 4a377aff2d36b64a65b54192a987aba44b8f78e0) Signed-off-by: Josh Rosen commit 4ae78abe66e593ac8bf9de37eca80413730c431b Author: Davies Liu Date: Tue Nov 18 15:57:33 2014 -0800 [SPARK-4306] [MLlib] Python API for LogisticRegressionWithLBFGS ``` class LogisticRegressionWithLBFGS | train(cls, data, iterations=100, initialWeights=None, corrections=10, tolerance=0.0001, regParam=0.01, intercept=False) | Train a logistic regression model on the given data. | | :param data: The training data, an RDD of LabeledPoint. | :param iterations: The number of iterations (default: 100). | :param initialWeights: The initial weights (default: None). | :param regParam: The regularizer parameter (default: 0.01). | :param regType: The type of regularizer used for training | our model. | :Allowed values: | - "l1" for using L1 regularization | - "l2" for using L2 regularization | - None for no regularization | (default: "l2") | :param intercept: Boolean parameter which indicates the use | or not of the augmented representation for | training data (i.e. whether bias features | are activated or not). | :param corrections: The number of corrections used in the LBFGS update (default: 10). | :param tolerance: The convergence tolerance of iterations for L-BFGS (default: 1e-4). | | >>> data = [ | ... LabeledPoint(0.0, [0.0, 1.0]), | ... LabeledPoint(1.0, [1.0, 0.0]), | ... ] | >>> lrm = LogisticRegressionWithLBFGS.train(sc.parallelize(data)) | >>> lrm.predict([1.0, 0.0]) | 1 | >>> lrm.predict([0.0, 1.0]) | 0 | >>> lrm.predict(sc.parallelize([[1.0, 0.0], [0.0, 1.0]])).collect() | [1, 0] ``` Author: Davies Liu Closes #3307 from davies/lbfgs and squashes the following commits: 34bd986 [Davies Liu] Merge branch 'master' of http://git-wip-us.apache.org/repos/asf/spark into lbfgs 5a945a6 [Davies Liu] address comments 941061b [Davies Liu] Merge branch 'master' of github.com:apache/spark into lbfgs 03e5543 [Davies Liu] add it to docs ed2f9a8 [Davies Liu] add regType 76cd1b6 [Davies Liu] reorder arguments 4429a74 [Davies Liu] Update classification.py 9252783 [Davies Liu] python api for LogisticRegressionWithLBFGS (cherry picked from commit d2e29516f2064f93f3a9070c91fc7460706e0b0a) Signed-off-by: Xiangrui Meng commit a93d64c8c677f7121599b21883e1671e1226ec0b Author: Kay Ousterhout Date: Tue Nov 18 15:01:06 2014 -0800 [SPARK-4463] Add (de)select all button for add'l metrics. This commit removes the behavior where when a user clicks "Show additional metrics" on the stage page, all of the additional metrics are automatically selected; now, collapsing and expanding the additional metrics has no effect on which options are selected. Instead, there's a "(De)select All" box at the top; checking this box checks all additional metrics (and similarly, unchecking it unchecks all additional metrics). This commit is intended to be backported to 1.2, so that the additional metrics behavior is not confusing to users. Now when a user clicks the "Show additional metrics" menu, this is what it looks like: ![image](https://cloud.githubusercontent.com/assets/1108612/5094347/1541ead6-6f15-11e4-8e8c-25a65ddbdfb2.png) Author: Kay Ousterhout Closes #3331 from kayousterhout/SPARK-4463 and squashes the following commits: 9e17cea [Kay Ousterhout] Added italics b731230 [Kay Ousterhout] [SPARK-4463] Add (de)select all button for add'l metrics. (cherry picked from commit 010bc86e40a0e54b6850b75abd6105e70eb1af10) Signed-off-by: Andrew Or commit 04b1bdbae31c3039125100e703121daf7d9dabf5 Author: Davies Liu Date: Tue Nov 18 13:37:21 2014 -0800 [SPARK-4017] show progress bar in console The progress bar will look like this: ![1___spark_job__85_250_finished__4_are_running___java_](https://cloud.githubusercontent.com/assets/40902/4854813/a02f44ac-6099-11e4-9060-7c73a73151d6.png) In the right corner, the numbers are: finished tasks, running tasks, total tasks. After the stage has finished, it will disappear. The progress bar is only showed if logging level is WARN or higher (but progress in title is still showed), it can be turned off by spark.driver.showConsoleProgress. Author: Davies Liu Closes #3029 from davies/progress and squashes the following commits: 95336d5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress fc49ac8 [Davies Liu] address commentse 2e90f75 [Davies Liu] show multiple stages in same time 0081bcc [Davies Liu] address comments 38c42f1 [Davies Liu] fix tests ab87958 [Davies Liu] disable progress bar during tests 30ac852 [Davies Liu] re-implement progress bar b3f34e5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress 6fd30ff [Davies Liu] show progress bar if no task finished in 500ms e4e7344 [Davies Liu] refactor e1f524d [Davies Liu] revert unnecessary change a60477c [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress 5cae3f2 [Davies Liu] fix style ea49fe0 [Davies Liu] address comments bc53d99 [Davies Liu] refactor e6bb189 [Davies Liu] fix logging in sparkshell 7e7d4e7 [Davies Liu] address commments 5df26bb [Davies Liu] fix style 9e42208 [Davies Liu] show progress bar in console and title (cherry picked from commit e34f38ff1a0dfbb0ffa4bd11071e03b1a58de998) Signed-off-by: Patrick Wendell commit 2d26c6248240978c6f69bf765113ced50cc70043 Author: Davies Liu Date: Tue Nov 18 13:11:38 2014 -0800 [SPARK-4404] remove sys.exit() in shutdown hook If SparkSubmit die first, then bootstrapper will be blocked by shutdown hook. sys.exit() in a shutdown hook will cause some kind of dead lock. cc andrewor14 Author: Davies Liu Closes #3289 from davies/fix_bootstraper and squashes the following commits: ea5cdd1 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_bootstraper e04b690 [Davies Liu] remove sys.exit in hook 4d11366 [Davies Liu] remove shutdown hook if subprocess die fist (cherry picked from commit 80f31778820586a93d73fa15279a204611cc3c60) Signed-off-by: Andrew Or commit 9e91118455acc074635822d55738866f6cfa7715 Author: Kousuke Saruta Date: Tue Nov 18 12:17:33 2014 -0800 [SPARK-4075][SPARK-4434] Fix the URI validation logic for Application Jar name. This PR adds a regression test for SPARK-4434. Author: Kousuke Saruta Closes #3326 from sarutak/add-triple-slash-testcase and squashes the following commits: 82bc9cc [Kousuke Saruta] Fixed wrong grammar in comment 9149027 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into add-triple-slash-testcase c1c80ca [Kousuke Saruta] Fixed style 4f30210 [Kousuke Saruta] Modified comments 9e09da2 [Kousuke Saruta] Fixed URI validation for jar file d4b99ef [Kousuke Saruta] [SPARK-4075] [Deploy] Jar url validation is not enough for Jar file ac79906 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into add-triple-slash-testcase 6d4f47e [Kousuke Saruta] Added a test case as a regression check for SPARK-4434 (cherry picked from commit bfebfd8b28eeb7e75292333f7885aa0830fcb5fe) Signed-off-by: Andrew Or commit 047b45800654b4b2605d1347cace28faaa25f521 Author: Michael Armbrust Date: Tue Nov 18 12:13:23 2014 -0800 [SQL] Support partitioned parquet tables that have the key in both the directory and the file Author: Michael Armbrust Closes #3272 from marmbrus/keyInPartitionedTable and squashes the following commits: 447f08c [Michael Armbrust] Support partitioned parquet tables that have the key in both the directory and the file (cherry picked from commit 90d72ec8502f7ec11d2fe42f08c884ad2159266f) Signed-off-by: Michael Armbrust commit 48d601f0bac33583c345b2ceebd30a639a20db4e Author: Xiangrui Meng Date: Tue Nov 18 10:35:29 2014 -0800 [SPARK-4396] allow lookup by index in Python's Rating In PySpark, ALS can take an RDD of (user, product, rating) tuples as input. However, model.predict outputs an RDD of Rating. So on the input side, users can use r[0], r[1], r[2], while on the output side, users have to use r.user, r.product, r.rating. We should allow lookup by index in Rating by making Rating a namedtuple. davies [Review on Reviewable](https://reviewable.io/reviews/apache/spark/3261) Author: Xiangrui Meng Closes #3261 from mengxr/SPARK-4396 and squashes the following commits: 543aef0 [Xiangrui Meng] use named tuple to implement ALS 0b61bae [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-4396 d3bd7d4 [Xiangrui Meng] allow lookup by index in Python's Rating (cherry picked from commit b54c6ab3c54e65238d6766832ea1f3fcd694f2fd) Signed-off-by: Xiangrui Meng commit a28902f25fc2a685c4a5663e976c1d735265ecb0 Author: Davies Liu Date: Tue Nov 18 10:11:13 2014 -0800 [SPARK-4435] [MLlib] [PySpark] improve classification This PR add setThrehold() and clearThreshold() for LogisticRegressionModel and SVMModel, also support RDD of vector in LogisticRegressionModel.predict(), SVNModel.predict() and NaiveBayes.predict() Author: Davies Liu Closes #3305 from davies/setThreshold and squashes the following commits: d0b835f [Davies Liu] Merge branch 'master' of github.com:apache/spark into setThreshold e4acd76 [Davies Liu] address comments 2231a5f [Davies Liu] bugfix 7bd9009 [Davies Liu] address comments 0b0a8a7 [Davies Liu] address comments c1e5573 [Davies Liu] improve classification (cherry picked from commit 8fbf72b7903b5bbec8d949151aa4693b4af26ff5) Signed-off-by: Xiangrui Meng commit 4f0477d6f94c85c4777a2f5d587faa539780cded Author: Felix Maximilian Möller Date: Tue Nov 18 10:08:24 2014 -0800 ALS implicit: added missing parameter alpha in doc string Author: Felix Maximilian Möller Closes #3343 from felixmaximilian/fix-documentation and squashes the following commits: 43dcdfb [Felix Maximilian Möller] Removed the information about the switch implicitPrefs. The parameter implicitPrefs cannot be set in this context because it is inherent true when calling the trainImplicit method. 7d172ba [Felix Maximilian Möller] added missing parameter alpha in doc string. (cherry picked from commit cedc3b5aa43a16e2da62f12a36317f00aa1002cc) Signed-off-by: Xiangrui Meng commit 2d3a5a50446483a75496866fa3e5d037e9be2ee7 Author: Patrick Wendell Date: Mon Nov 17 21:07:50 2014 -0800 SPARK-4466: Provide support for publishing Scala 2.11 artifacts to Maven The maven release plug-in does not have support for publishing two separate sets of artifacts for a single release. Because of the way that Scala 2.11 support in Spark works, we have to write some customized code to do this. The good news is that the Maven release API is just a thin wrapper on doing git commits and pushing artifacts to the HTTP API of Apache's Sonatype server and this might overall make our deployment easier to understand. This was already used for the 1.2 snapshot, so I think it is working well. One other nice thing is this could be pretty easily extended to publish nightly snapshots. Author: Patrick Wendell Closes #3332 from pwendell/releases and squashes the following commits: 2fedaed [Patrick Wendell] Automate the opening and closing of Sonatype repos e2a24bb [Patrick Wendell] Fixing issue where we overrode non-spark version numbers 9df3a50 [Patrick Wendell] Adding TODO 1cc1749 [Patrick Wendell] Don't build the thriftserver for 2.11 933201a [Patrick Wendell] Make tagging of release commit eager d0388a6 [Patrick Wendell] Support Scala 2.11 build 4f4dc62 [Patrick Wendell] Change to 2.11 should not be included when committing new patch bf742e1 [Patrick Wendell] Minor fixes ffa1df2 [Patrick Wendell] Adding a Scala 2.11 package to test it 9ac4381 [Patrick Wendell] Addressing TODO b3105ff [Patrick Wendell] Removing commented out code d906803 [Patrick Wendell] Small fix 3f4d985 [Patrick Wendell] More work fcd54c2 [Patrick Wendell] Consolidating use of keys df2af30 [Patrick Wendell] Changes to release stuff (cherry picked from commit c6e0c2ab1c29c184a9302d23ad75e4ccd8060242) Signed-off-by: Patrick Wendell commit 0458b80547f05b92a02891729aa1ef00be06957f Author: Cheng Lian Date: Mon Nov 17 16:55:12 2014 -0800 [SPARK-4453][SPARK-4213][SQL] Simplifies Parquet filter generation code While reviewing PR #3083 and #3161, I noticed that Parquet record filter generation code can be simplified significantly according to the clue stated in [SPARK-4453](https://issues.apache.org/jira/browse/SPARK-4213). This PR addresses both SPARK-4453 and SPARK-4213 with this simplification. While generating `ParquetTableScan` operator, we need to remove all Catalyst predicates that have already been pushed down to Parquet. Originally, we first generate the record filter, and then call `findExpression` to traverse the generated filter to find out all pushed down predicates [[1](https://github.com/apache/spark/blob/64c6b9bad559c21f25cd9fbe37c8813cdab939f2/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L213-L228)]. In this way, we have to introduce the `CatalystFilter` class hierarchy to bind the Catalyst predicates together with their generated Parquet filter, and complicate the code base a lot. The basic idea of this PR is that, we don't need `findExpression` after filter generation, because we already know a predicate can be pushed down if we can successfully generate its corresponding Parquet filter. SPARK-4213 is fixed by returning `None` for any unsupported predicate type. [Review on Reviewable](https://reviewable.io/reviews/apache/spark/3317) Author: Cheng Lian Closes #3317 from liancheng/simplify-parquet-filters and squashes the following commits: d6a9499 [Cheng Lian] Fixes import styling issue 43760e8 [Cheng Lian] Simplifies Parquet filter generation logic (cherry picked from commit 36b0956a3eadc7343ed0d25c79a6ce0496eaaccd) Signed-off-by: Michael Armbrust commit 68e1ce1aa4ca4db224c94122c9b0157426285ff9 Author: Cheng Hao Date: Mon Nov 17 16:35:49 2014 -0800 [SPARK-4448] [SQL] unwrap for the ConstantObjectInspector Author: Cheng Hao Closes #3308 from chenghao-intel/unwrap_constant_oi and squashes the following commits: 156b500 [Cheng Hao] rebase the master c5b20ab [Cheng Hao] unwrap for the ConstantObjectInspector (cherry picked from commit ef7c464effa1510b24bd8e665e4df6c4839b0c87) Signed-off-by: Michael Armbrust commit 060d62194b15de3fe7aa15d053794103def13405 Author: w00228970 Date: Mon Nov 17 16:33:50 2014 -0800 [SPARK-4443][SQL] Fix statistics for external table in spark sql hive The `totalSize` of external table is always zero, which will influence join strategy(always use broadcast join for external table). Author: w00228970 Closes #3304 from scwf/statistics and squashes the following commits: 568f321 [w00228970] fix statistics for external table (cherry picked from commit 42389b1780311d90499b4ce2315ceabf5b6ab384) Signed-off-by: Michael Armbrust commit ff2fe56004209ffe8eb150a56cbd5dccfb8d774b Author: Cheng Lian Date: Mon Nov 17 16:31:05 2014 -0800 [SPARK-4309][SPARK-4407][SQL] Date type support for Thrift server, and fixes for complex types This PR is exactly the same as #3178 except it reverts the `FileStatus.isDir` to `FileStatus.isDirectory` change, since it doesn't compile with Hadoop 1. [Review on Reviewable](https://reviewable.io/reviews/apache/spark/3298) Author: Cheng Lian Closes #3298 from liancheng/date-for-thriftserver and squashes the following commits: 866037e [Cheng Lian] Revers isDirectory to isDir (it breaks Hadoop 1 profile) 6f71d0b [Cheng Lian] Makes toHiveString static 26fa955 [Cheng Lian] Fixes complex type support in Hive 0.13.1 shim a92882a [Cheng Lian] Updates HiveShim for 0.13.1 73f442b [Cheng Lian] Adds Date support for HiveThriftServer2 (Hive 0.12.0) (cherry picked from commit 6b7f2f753d16ff038881772f1958e3f4fd5597a7) Signed-off-by: Michael Armbrust commit 7d0442652ed090783af6f2614c37a9522c46dc95 Author: Cheng Hao Date: Mon Nov 17 16:29:52 2014 -0800 [SQL] Construct the MutableRow from an Array Author: Cheng Hao Closes #3217 from chenghao-intel/mutablerow and squashes the following commits: e8a10bd [Cheng Hao] revert the change of Row object 4681aea [Cheng Hao] Add toMutableRow method in object Row a751838 [Cheng Hao] Construct the MutableRow from an existed row (cherry picked from commit 69e858cc7748b6babadd0cbe20e65f3982161cbf) Signed-off-by: Michael Armbrust commit 1a650e7d863b72025625c3140b038ab12ec86eca Author: Takuya UESHIN Date: Mon Nov 17 16:28:07 2014 -0800 [SPARK-4425][SQL] Handle NaN or Infinity cast to Timestamp correctly. `Cast` from `NaN` or `Infinity` of `Double` or `Float` to `TimestampType` throws `NumberFormatException`. Author: Takuya UESHIN Closes #3283 from ueshin/issues/SPARK-4425 and squashes the following commits: 14def0c [Takuya UESHIN] Fix Cast to be able to handle NaN or Infinity to TimestampType. (cherry picked from commit 566c791931645bfaaaf57ee5a15b9ffad534f81e) Signed-off-by: Michael Armbrust commit 1ca39b723fa1d9c3d3525f1e32e0a19770563d4e Author: Takuya UESHIN Date: Mon Nov 17 16:26:48 2014 -0800 [SPARK-4420][SQL] Change nullability of Cast from DoubleType/FloatType to DecimalType. This is follow-up of [SPARK-4390](https://issues.apache.org/jira/browse/SPARK-4390) (#3256). Author: Takuya UESHIN Closes #3278 from ueshin/issues/SPARK-4420 and squashes the following commits: 7fea558 [Takuya UESHIN] Add some tests. cb2301a [Takuya UESHIN] Fix tests. 133bad5 [Takuya UESHIN] Change nullability of Cast from DoubleType/FloatType to DecimalType. (cherry picked from commit 3a81a1c9e0963173534d96850f3c0b7a16350838) Signed-off-by: Michael Armbrust commit eb9c5bae78e4123cd7d1dfa3758d0880df90ed14 Author: Cheng Lian Date: Mon Nov 17 15:33:13 2014 -0800 [SQL] Makes conjunction pushdown more aggressive for in-memory table This is inspired by the [Parquet record filter generation code](https://github.com/apache/spark/blob/64c6b9bad559c21f25cd9fbe37c8813cdab939f2/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetFilters.scala#L387-L400). [Review on Reviewable](https://reviewable.io/reviews/apache/spark/3318) Author: Cheng Lian Closes #3318 from liancheng/aggresive-conj-pushdown and squashes the following commits: 78b69d2 [Cheng Lian] Makes conjunction pushdown more aggressive (cherry picked from commit 5ce7dae859dc273b0fc532c9456b5960b1eca399) Signed-off-by: Michael Armbrust commit 7162d85fad47f8967985f19e86362e9991128944 Author: Josh Rosen Date: Mon Nov 17 12:48:18 2014 -0800 [SPARK-4180] [Core] Prevent creation of multiple active SparkContexts This patch adds error-detection logic to throw an exception when attempting to create multiple active SparkContexts in the same JVM, since this is currently unsupported and has been known to cause confusing behavior (see SPARK-2243 for more details). **The solution implemented here is only a partial fix.** A complete fix would have the following properties: 1. Only one SparkContext may ever be under construction at any given time. 2. Once a SparkContext has been successfully constructed, any subsequent construction attempts should fail until the active SparkContext is stopped. 3. If the SparkContext constructor throws an exception, then all resources created in the constructor should be cleaned up (SPARK-4194). 4. If a user attempts to create a SparkContext but the creation fails, then the user should be able to create new SparkContexts. This PR only provides 2) and 4); we should be able to provide all of these properties, but the correct fix will involve larger changes to SparkContext's construction / initialization, so we'll target it for a different Spark release. ### The correct solution: I think that the correct way to do this would be to move the construction of SparkContext's dependencies into a static method in the SparkContext companion object. Specifically, we could make the default SparkContext constructor `private` and change it to accept a `SparkContextDependencies` object that contains all of SparkContext's dependencies (e.g. DAGScheduler, ContextCleaner, etc.). Secondary constructors could call a method on the SparkContext companion object to create the `SparkContextDependencies` and pass the result to the primary SparkContext constructor. For example: ```scala class SparkContext private (deps: SparkContextDependencies) { def this(conf: SparkConf) { this(SparkContext.getDeps(conf)) } } object SparkContext( private[spark] def getDeps(conf: SparkConf): SparkContextDependencies = synchronized { if (anotherSparkContextIsActive) { throw Exception(...) } var dagScheduler: DAGScheduler = null try { dagScheduler = new DAGScheduler(...) [...] } catch { case e: Exception => Option(dagScheduler).foreach(_.stop()) [...] } SparkContextDependencies(dagScheduler, ....) } } ``` This gives us mutual exclusion and ensures that any resources created during the failed SparkContext initialization are properly cleaned up. This indirection is necessary to maintain binary compatibility. In retrospect, it would have been nice if SparkContext had no private constructors and could only be created through builder / factory methods on its companion object, since this buys us lots of flexibility and makes dependency injection easier. ### Alternative solutions: As an alternative solution, we could refactor SparkContext's primary constructor to perform all object creation in a giant `try-finally` block. Unfortunately, this will require us to turn a bunch of `vals` into `vars` so that they can be assigned from the `try` block. If we still want `vals`, we could wrap each `val` in its own `try` block (since the try block can return a value), but this will lead to extremely messy code and won't guard against the introduction of future code which doesn't properly handle failures. The more complex approach outlined above gives us some nice dependency injection benefits, so I think that might be preferable to a `var`-ification. ### This PR's solution: - At the start of the constructor, check whether some other SparkContext is active; if so, throw an exception. - If another SparkContext might be under construction (or has thrown an exception during construction), allow the new SparkContext to begin construction but log a warning (since resources might have been leaked from a failed creation attempt). - At the end of the SparkContext constructor, check whether some other SparkContext constructor has raced and successfully created an active context. If so, throw an exception. This guarantees that no two SparkContexts will ever be active and exposed to users (since we check at the very end of the constructor). If two threads race to construct SparkContexts, then one of them will win and another will throw an exception. This exception can be turned into a warning by setting `spark.driver.allowMultipleContexts = true`. The exception is disabled in unit tests, since there are some suites (such as Hive) that may require more significant refactoring to clean up their SparkContexts. I've made a few changes to other suites' test fixtures to properly clean up SparkContexts so that the unit test logs contain fewer warnings. Author: Josh Rosen Closes #3121 from JoshRosen/SPARK-4180 and squashes the following commits: 23c7123 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180 d38251b [Josh Rosen] Address latest round of feedback. c0987d3 [Josh Rosen] Accept boolean instead of SparkConf in methods. 85a424a [Josh Rosen] Incorporate more review feedback. 372d0d3 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180 f5bb78c [Josh Rosen] Update mvn build, too. d809cb4 [Josh Rosen] Improve handling of failed SparkContext creation attempts. 79a7e6f [Josh Rosen] Fix commented out test a1cba65 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180 7ba6db8 [Josh Rosen] Add utility to set system properties in tests. 4629d5c [Josh Rosen] Set spark.driver.allowMultipleContexts=true in tests. ed17e14 [Josh Rosen] Address review feedback; expose hack workaround for existing unit tests. 1c66070 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180 06c5c54 [Josh Rosen] Add / improve SparkContext cleanup in streaming BasicOperationsSuite d0437eb [Josh Rosen] StreamingContext.stop() should stop SparkContext even if StreamingContext has not been started yet. c4d35a2 [Josh Rosen] Log long form of creation site to aid debugging. 918e878 [Josh Rosen] Document "one SparkContext per JVM" limitation. afaa7e3 [Josh Rosen] [SPARK-4180] Prevent creations of multiple active SparkContexts. (cherry picked from commit 0f3ceb56c78e7260725a09fba0e10aa193cbda4b) Signed-off-by: Patrick Wendell commit 202627fda6a48453c3ba853cf1361ef84ba47c63 Author: Andy Konwinski Date: Mon Nov 17 11:52:23 2014 -0800 [DOCS][SQL] Fix broken link to Row class scaladoc Author: Andy Konwinski Closes #3323 from andyk/patch-2 and squashes the following commits: 4699fdc [Andy Konwinski] Fix broken link to Row class scaladoc (cherry picked from commit cec1116b4b80c36b36a8a13338b948e4d6ade377) Signed-off-by: Michael Armbrust commit 84cc0a20978dedc0dcff62e1b48a4aabc0792787 Author: Patrick Wendell Date: Mon Nov 17 11:42:25 2014 -0800 HOTFIX: Error in release script that updates wrong version commit 6cc18e6e1945702638e2d41cef8fe2d97dfb1f16 Author: Andrew Or Date: Mon Nov 17 11:24:57 2014 -0800 Revert "[SPARK-4075] [Deploy] Jar url validation is not enough for Jar file" This reverts commit 098f83c7ccd7dad9f9228596da69fe5f55711a52. commit 98ad8a14483755e6931402db9712c86100339340 Author: Ankur Dave Date: Mon Nov 17 11:06:31 2014 -0800 [SPARK-4444] Drop VD type parameter from EdgeRDD Due to vertex attribute caching, EdgeRDD previously took two type parameters: ED and VD. However, this is an implementation detail that should not be exposed in the interface, so this PR drops the VD type parameter. This requires removing the `filter` method from the EdgeRDD interface, because it depends on vertex attribute caching. Author: Ankur Dave Closes #3303 from ankurdave/edgerdd-drop-tparam and squashes the following commits: 38dca9b [Ankur Dave] Leave EdgeRDD.fromEdges public fafeb51 [Ankur Dave] Drop VD type parameter from EdgeRDD (cherry picked from commit 9ac2bb18ede2e9f73c255fa33445af89aaf8a000) Signed-off-by: Reynold Xin commit e0ab1c4766e1af384213a853588f6e69acd3b780 Author: Adam Pingel Date: Mon Nov 17 10:47:29 2014 -0800 SPARK-2811 upgrade algebird to 0.8.1 Author: Adam Pingel Closes #3282 from adampingel/master and squashes the following commits: 70c8d3c [Adam Pingel] relocate the algebird example back to example/src 7a9d8be [Adam Pingel] SPARK-2811 upgrade algebird to 0.8.1 (cherry picked from commit e7690ed20a2734b7ca88e78a60a8e75ba19e9d8b) Signed-off-by: Patrick Wendell commit d9d36a53dfeb51e4e070803e26187d436fd1f747 Author: Prashant Sharma Date: Mon Nov 17 10:40:33 2014 -0800 SPARK-4445, Don't display storage level in toDebugString unless RDD is persisted. Author: Prashant Sharma Closes #3310 from ScrapCodes/SPARK-4445/rddDebugStringFix and squashes the following commits: 4e57c52 [Prashant Sharma] SPARK-4445, Don't display storage level in toDebugString unless RDD is persisted (cherry picked from commit 5c92d47ad2e3414f2ae089cb47f3c6daccba8d90) Signed-off-by: Patrick Wendell commit d7ac6013483e83caff8ea54c228f37aeca159db8 Author: Ubuntu Date: Mon Nov 17 06:37:44 2014 +0000 Preparing development version 1.2.1-SNAPSHOT commit 38c1fbd9694430cefd962c90bc36b0d108c6124b Author: Ubuntu Date: Mon Nov 17 06:37:44 2014 +0000 Preparing Spark release v1.2.0-snapshot1 commit e1339daec59ff57cdcbccd9073e9dd5f0ac9d3df Author: Patrick Wendell Date: Sun Nov 16 22:13:40 2014 -0800 Revert "Preparing Spark release v1.2.0-snapshot0" This reverts commit bc09875799aa373f4320d38b02618173ffa4c96f. commit c3fd9aef99134f3f649285c5f013f81b3e8e697e Author: Patrick Wendell Date: Sun Nov 16 22:13:29 2014 -0800 Revert "Preparing development version 1.2.1-SNAPSHOT" This reverts commit 6c6fd218c83a049c874b8a0ea737333c1899c94a. commit 8305e803e23808507b68fa9a6876ee455e58ac27 Author: Michael Armbrust Date: Sun Nov 16 21:55:57 2014 -0800 [SPARK-4410][SQL] Add support for external sort Adds a new operator that uses Spark's `ExternalSort` class. It is off by default now, but we might consider making it the default if benchmarks show that it does not regress performance. Author: Michael Armbrust Closes #3268 from marmbrus/externalSort and squashes the following commits: 48b9726 [Michael Armbrust] comments b98799d [Michael Armbrust] Add test afd7562 [Michael Armbrust] Add support for external sort. (cherry picked from commit 64c6b9bad559c21f25cd9fbe37c8813cdab939f2) Signed-off-by: Reynold Xin commit f3b93c1bac292fccb05bf16d1da4b1863b3031fd Author: GuoQiang Li Date: Sun Nov 16 21:31:51 2014 -0800 [SPARK-4422][MLLIB]In some cases, Vectors.fromBreeze get wrong results. cc mengxr Author: GuoQiang Li Closes #3281 from witgo/SPARK-4422 and squashes the following commits: 5f1fa5e [GuoQiang Li] import order 50783bd [GuoQiang Li] review commits 7a10123 [GuoQiang Li] In some cases, Vectors.fromBreeze get wrong results. (cherry picked from commit 5168c6ca9f0008027d688661bae57c28cf386b54) Signed-off-by: Xiangrui Meng commit 6c6fd218c83a049c874b8a0ea737333c1899c94a Author: Ubuntu Date: Mon Nov 17 03:09:19 2014 +0000 Preparing development version 1.2.1-SNAPSHOT commit bc09875799aa373f4320d38b02618173ffa4c96f Author: Ubuntu Date: Mon Nov 17 02:10:59 2014 +0000 Preparing Spark release v1.2.0-snapshot0 commit 70d0371683a56059a7b4c4ebdab6e2fe055b9a76 Author: Michael Armbrust Date: Sun Nov 16 15:05:04 2014 -0800 Revert "[SPARK-4309][SPARK-4407][SQL] Date type support for Thrift server, and fixes for complex types" Author: Michael Armbrust Closes #3292 from marmbrus/revert4309 and squashes the following commits: 808e96e [Michael Armbrust] Revert "[SPARK-4309][SPARK-4407][SQL] Date type support for Thrift server, and fixes for complex types" (cherry picked from commit 45ce3273cb618d14ec4d20c4c95699634b951086) Signed-off-by: Michael Armbrust commit 8b83a34fa310f4e6802c5ef32dcc737f6fb4903f Author: Cheng Lian Date: Sun Nov 16 14:26:41 2014 -0800 [SPARK-4309][SPARK-4407][SQL] Date type support for Thrift server, and fixes for complex types SPARK-4407 was detected while working on SPARK-4309. Merged these two into a single PR since 1.2.0 RC is approaching. [Review on Reviewable](https://reviewable.io/reviews/apache/spark/3178) Author: Cheng Lian Closes #3178 from liancheng/date-for-thriftserver and squashes the following commits: 6f71d0b [Cheng Lian] Makes toHiveString static 26fa955 [Cheng Lian] Fixes complex type support in Hive 0.13.1 shim a92882a [Cheng Lian] Updates HiveShim for 0.13.1 73f442b [Cheng Lian] Adds Date support for HiveThriftServer2 (Hive 0.12.0) (cherry picked from commit cb6bd83a91d9b4a227dc6467255231869c1820e2) Signed-off-by: Michael Armbrust commit 2200de6352fdc1000908554003912303edc3d160 Author: Josh Rosen Date: Sun Nov 16 00:44:15 2014 -0800 [SPARK-4393] Fix memory leak in ConnectionManager ACK timeout TimerTasks; use HashedWheelTimer This patch is intended to fix a subtle memory leak in ConnectionManager's ACK timeout TimerTasks: in the old code, each TimerTask held a reference to the message being sent and a cancelled TimerTask won't necessarily be garbage-collected until it's scheduled to run, so this caused huge buildups of messages that weren't garbage collected until their timeouts expired, leading to OOMs. This patch addresses this problem by capturing only the message ID in the TimerTask instead of the whole message, and by keeping a WeakReference to the promise in the TimerTask. I've also modified this code to use Netty's HashedWheelTimer, whose performance characteristics should be better for this use-case. Thanks to cristianopris for narrowing down this issue! Author: Josh Rosen Closes #3259 from JoshRosen/connection-manager-timeout-bugfix and squashes the following commits: afcc8d6 [Josh Rosen] Address rxin's review feedback. 2a2e92d [Josh Rosen] Keep only WeakReference to promise in TimerTask; 0f0913b [Josh Rosen] Spelling fix: timout => timeout 3200c33 [Josh Rosen] Use Netty HashedWheelTimer f847dd4 [Josh Rosen] Don't capture entire message in ACK timeout task. (cherry picked from commit 7850e0c707affd5eafd570fb43716753396cf479) Signed-off-by: Reynold Xin commit 24287014f6ae7fe6f4b3090d24fe42d8c70f1084 Author: Kousuke Saruta Date: Sat Nov 15 22:23:47 2014 -0800 [SPARK-4426][SQL][Minor] The symbol of BitwiseOr is wrong, should not be '&' The symbol of BitwiseOr is defined as '&' but I think it's wrong. It should be '|'. Author: Kousuke Saruta Closes #3284 from sarutak/bitwise-or-symbol-fix and squashes the following commits: aff4be5 [Kousuke Saruta] Fixed symbol of BitwiseOr (cherry picked from commit 84468b2e2031d646dcf035cb18947170ba326ccd) Signed-off-by: Reynold Xin commit 06c29bccb68f116effe85e3ab1f605a2dfa36a31 Author: Josh Rosen Date: Sat Nov 15 22:22:34 2014 -0800 [SPARK-4419] Upgrade snappy-java to 1.1.1.6 This upgrades snappy-java to 1.1.1.6, which includes a patch that improves error messages when attempting to deserialize empty inputs using SnappyInputStream (see xerial/snappy-java#89). We previously tried up upgrade to 1.1.1.5 in #2911 but reverted that patch after discovering a memory leak in snappy-java. This should leak have been fixed in 1.1.1.6, though (see xerial/snappy-java#92). Author: Josh Rosen Closes #3287 from JoshRosen/SPARK-4419 and squashes the following commits: 5d6f4cc [Josh Rosen] [SPARK-4419] Upgrade snappy-java to 1.1.1.6. (cherry picked from commit 7d8e152eecc7e822b7b1e40b791267a8911e01cf) Signed-off-by: Reynold Xin commit 9eac5fee64def9a18d8961069f631a176f339a5b Author: Josh Rosen Date: Fri Nov 14 23:46:25 2014 -0800 [SPARK-2321] Several progress API improvements / refactorings This PR refactors / extends the status API introduced in #2696. - Change StatusAPI from a mixin trait to a class. Before, the new status API methods were directly accessible through SparkContext, whereas now they're accessed through a `sc.statusAPI` field. As long as we were going to add these methods directly to SparkContext, the mixin trait seemed like a good idea, but this might be simpler to reason about and may avoid pitfalls that I've run into while attempting to refactor other parts of SparkContext to use mixins (see #3071, for example). - Change the name from SparkStatusAPI to SparkStatusTracker. - Make `getJobIdsForGroup(null)` return ids for jobs that aren't associated with any job group. - Add `getActiveStageIds()` and `getActiveJobIds()` methods that return the ids of whatever's currently active in this SparkContext. This should simplify davies's progress bar code. Author: Josh Rosen Closes #3197 from JoshRosen/progress-api-improvements and squashes the following commits: 30b0afa [Josh Rosen] Rename SparkStatusAPI to SparkStatusTracker. d1b08d8 [Josh Rosen] Add missing newlines 2cc7353 [Josh Rosen] Add missing file. d5eab1f [Josh Rosen] Add getActive[Stage|Job]Ids() methods. a227984 [Josh Rosen] getJobIdsForGroup(null) should return jobs for default group c47e294 [Josh Rosen] Remove StatusAPI mixin trait. (cherry picked from commit 40eb8b6ef3a67e36d0d9492c044981a1da76351d) Signed-off-by: Reynold Xin commit c044e124115cc8e9ffb44d12c2744f33362f366f Author: kai Date: Fri Nov 14 23:44:23 2014 -0800 Added contains(key) to Metadata Add contains(key) to org.apache.spark.sql.catalyst.util.Metadata to test the existence of a key. Otherwise, Class Metadata's get methods may throw NoSuchElement exception if the key does not exist. Testcases are added to MetadataSuite as well. Author: kai Closes #3273 from kai-zeng/metadata-fix and squashes the following commits: 74b3d03 [kai] Added contains(key) to Metadata (cherry picked from commit cbddac23696d89b672dce380cc7360a873e27b3b) Signed-off-by: Reynold Xin commit 37716b7953cd737564d5f5ffd5bac7619f94a278 Author: Kousuke Saruta Date: Fri Nov 14 22:36:56 2014 -0800 [SPARK-4260] Httpbroadcast should set connection timeout. Httpbroadcast sets read timeout but doesn't set connection timeout. Author: Kousuke Saruta Closes #3122 from sarutak/httpbroadcast-timeout and squashes the following commits: c7f3a56 [Kousuke Saruta] Added Connection timeout for Http Connection to HttpBroadcast.scala (cherry picked from commit 60969b0336930449a826821a48f83f65337e8856) Signed-off-by: Reynold Xin commit 29a6da37257d8a165967392af6f452a404e445cd Author: zsxwing Date: Fri Nov 14 22:28:48 2014 -0800 [SPARK-4363][Doc] Update the Broadcast example Author: zsxwing Closes #3226 from zsxwing/SPARK-4363 and squashes the following commits: 8109914 [zsxwing] Update the Broadcast example (cherry picked from commit 861223ee5bea8e434a9ebb0d53f436ce23809f9c) Signed-off-by: Reynold Xin commit e27fa40ed16c1b1d04911e0bdd803a4d43eb9a10 Author: zsxwing Date: Fri Nov 14 22:25:41 2014 -0800 [SPARK-4379][Core] Change Exception to SparkException in checkpoint It's better to change to SparkException. However, it's a breaking change since it will change the exception type. Author: zsxwing Closes #3241 from zsxwing/SPARK-4379 and squashes the following commits: 409f3af [zsxwing] Change Exception to SparkException in checkpoint (cherry picked from commit dba14058230194122a715c219e35ab8eaa786321) Signed-off-by: Reynold Xin commit 306e68cf00e6ec6b10f1a29eb7434f3f3ea27752 Author: Davies Liu Date: Fri Nov 14 20:13:46 2014 -0800 [SPARK-4415] [PySpark] JVM should exit after Python exit When JVM is started in a Python process, it should exit once the stdin is closed. test: add spark.driver.memory in conf/spark-defaults.conf ``` daviesdm:~/work/spark$ cat conf/spark-defaults.conf spark.driver.memory 8g daviesdm:~/work/spark$ bin/pyspark >>> quit daviesdm:~/work/spark$ jps 4931 Jps 286 daviesdm:~/work/spark$ python wc.py 943738 0.719928026199 daviesdm:~/work/spark$ jps 286 4990 Jps ``` Author: Davies Liu Closes #3274 from davies/exit and squashes the following commits: df0e524 [Davies Liu] address comments ce8599c [Davies Liu] address comments 050651f [Davies Liu] JVM should exit after Python exit (cherry picked from commit 7fe08b43c78bf9e8515f671e72aa03a83ea782f8) Signed-off-by: Andrew Or commit 118c89c28d1c3c048a5bd0335db4a0c65d71a4aa Author: WangTao Date: Fri Nov 14 20:11:51 2014 -0800 [SPARK-4404]SparkSubmitDriverBootstrapper should stop after its SparkSubmit sub-proc... ...ess ends https://issues.apache.org/jira/browse/SPARK-4404 When we have spark.driver.extra* or spark.driver.memory in SPARK_SUBMIT_PROPERTIES_FILE, spark-class will use SparkSubmitDriverBootstrapper to launch driver. If we get process id of SparkSubmitDriverBootstrapper and wanna kill it during its running, we expect its SparkSubmit sub-process stop also. Author: WangTao Author: WangTaoTheTonic Closes #3266 from WangTaoTheTonic/killsubmit and squashes the following commits: e03eba5 [WangTaoTheTonic] add comments 57b5ca1 [WangTao] SparkSubmitDriverBootstrapper should stop after its SparkSubmit sub-process ends (cherry picked from commit 303a4e4d23e5cd93b541480cf88d5badb9cf9622) Signed-off-by: Andrew Or commit c425e31ad0132ddb0a817b26fe1e5d11a7ef7a63 Author: Sandy Ryza Date: Fri Nov 14 15:51:05 2014 -0800 SPARK-4214. With dynamic allocation, avoid outstanding requests for more... ... executors than pending tasks need. WIP. Still need to add and fix tests. Author: Sandy Ryza Closes #3204 from sryza/sandy-spark-4214 and squashes the following commits: 35cf0e0 [Sandy Ryza] Add comment 13b53df [Sandy Ryza] Review feedback 067465f [Sandy Ryza] Whitespace fix 6ae080c [Sandy Ryza] Add tests and get num pending tasks from ExecutorAllocationListener 531e2b6 [Sandy Ryza] SPARK-4214. With dynamic allocation, avoid outstanding requests for more executors than pending tasks need. (cherry picked from commit ad42b283246b93654c5fd731cd618fee74d8c4da) Signed-off-by: Andrew Or commit ef39ec419a97ad9e8cfcb39f8141ca255e04c4aa Author: Jim Carroll Date: Fri Nov 14 15:33:21 2014 -0800 [SPARK-4412][SQL] Fix Spark's control of Parquet logging. The Spark ParquetRelation.scala code makes the assumption that the parquet.Log class has already been loaded. If ParquetRelation.enableLogForwarding executes prior to the parquet.Log class being loaded then the code in enableLogForwarding has no affect. ParquetRelation.scala attempts to override the parquet logger but, at least currently (and if your application simply reads a parquet file before it does anything else with Parquet), the parquet.Log class hasn't been loaded yet. Therefore the code in ParquetRelation.enableLogForwarding has no affect. If you look at the code in parquet.Log there's a static initializer that needs to be called prior to enableLogForwarding or whatever enableLogForwarding does gets undone by this static initializer. The "fix" would be to force the static initializer to get called in parquet.Log as part of enableForwardLogging. Author: Jim Carroll Closes #3271 from jimfcarroll/parquet-logging and squashes the following commits: 37bdff7 [Jim Carroll] Fix Spark's control of Parquet logging. (cherry picked from commit 37482ce5a7b875f17d32a5e8c561cc8e9772c9b3) Signed-off-by: Michael Armbrust commit aa5d8e57c63d045b291a5c1fc99e782a0f191854 Author: Yash Datta Date: Fri Nov 14 15:16:36 2014 -0800 [SPARK-4365][SQL] Remove unnecessary filter call on records returned from parquet library Since parquet library has been updated , we no longer need to filter the records returned from parquet library for null records , as now the library skips those : from parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordReader.java public boolean nextKeyValue() throws IOException, InterruptedException { boolean recordFound = false; while (!recordFound) { // no more records left if (current >= total) { return false; } try { checkRead(); currentValue = recordReader.read(); current ++; if (recordReader.shouldSkipCurrentRecord()) { // this record is being filtered via the filter2 package if (DEBUG) LOG.debug("skipping record"); continue; } if (currentValue == null) { // only happens with FilteredRecordReader at end of block current = totalCountLoadedSoFar; if (DEBUG) LOG.debug("filtered record reader reached end of block"); continue; } recordFound = true; if (DEBUG) LOG.debug("read value: " + currentValue); } catch (RuntimeException e) { throw new ParquetDecodingException(format("Can not read value at %d in block %d in file %s", current, currentBlock, file), e); } } return true; } Author: Yash Datta Closes #3229 from saucam/remove_filter and squashes the following commits: 8909ae9 [Yash Datta] SPARK-4365: Remove unnecessary filter call on records returned from parquet library (cherry picked from commit 63ca3af66f9680fd12adee82fb4d342caae5cea4) Signed-off-by: Michael Armbrust commit 7f242dc2911bbc821e90fed81421af9b8d6dcd9a Author: Jim Carroll Date: Fri Nov 14 15:11:53 2014 -0800 [SPARK-4386] Improve performance when writing Parquet files. If you profile the writing of a Parquet file, the single worst time consuming call inside of org.apache.spark.sql.parquet.MutableRowWriteSupport.write is actually in the scala.collection.AbstractSequence.size call. This is because the size call actually ends up COUNTING the elements in a scala.collection.LinearSeqOptimized.length ("optimized?"). This doesn't need to be done. "size" is called repeatedly where needed rather than called once at the top of the method and stored in a 'val'. Author: Jim Carroll Closes #3254 from jimfcarroll/parquet-perf and squashes the following commits: 30cc0b5 [Jim Carroll] Improve performance when writing Parquet files. (cherry picked from commit f76b9683706232c3d4e8e6e61627b8188dcb79dc) Signed-off-by: Michael Armbrust commit 1cac30083b97c98c3663e2d2cd057124f033eb34 Author: Cheng Lian Date: Fri Nov 14 15:09:36 2014 -0800 [SPARK-4322][SQL] Enables struct fields as sub expressions of grouping fields While resolving struct fields, the resulted `GetField` expression is wrapped with an `Alias` to make it a named expression. Assume `a` is a struct instance with a field `b`, then `"a.b"` will be resolved as `Alias(GetField(a, "b"), "b")`. Thus, for this following SQL query: ```sql SELECT a.b + 1 FROM t GROUP BY a.b + 1 ``` the grouping expression is ```scala Add(GetField(a, "b"), Literal(1, IntegerType)) ``` while the aggregation expression is ```scala Add(Alias(GetField(a, "b"), "b"), Literal(1, IntegerType)) ``` This mismatch makes the above SQL query fail during the both analysis and execution phases. This PR fixes this issue by removing the alias when substituting aggregation expressions. [Review on Reviewable](https://reviewable.io/reviews/apache/spark/3248) Author: Cheng Lian Closes #3248 from liancheng/spark-4322 and squashes the following commits: 23a46ea [Cheng Lian] Code simplification dd20a79 [Cheng Lian] Should only trim aliases around `GetField`s 7f46532 [Cheng Lian] Enables struct fields as sub expressions of grouping fields (cherry picked from commit 0c7b66bd449093bb5d2dafaf91d54e63e601e320) Signed-off-by: Michael Armbrust commit 680bc06195ecdc6ff2390c55adeb637649f2c8f3 Author: Michael Armbrust Date: Fri Nov 14 15:03:23 2014 -0800 [SQL] Don't shuffle code generated rows When sort based shuffle and code gen are on we were trying to ship the code generated rows during a shuffle. This doesn't work because the classes don't exist on the other side. Instead we now copy into a generic row before shipping. Author: Michael Armbrust Closes #3263 from marmbrus/aggCodeGen and squashes the following commits: f6ba8cf [Michael Armbrust] fix and test (cherry picked from commit 4b4b50c9e596673c1534df97effad50d107a8007) Signed-off-by: Michael Armbrust commit e35672e7edeb7f68bece12d3d656419d3e610e95 Author: Michael Armbrust Date: Fri Nov 14 15:00:42 2014 -0800 [SQL] Minor cleanup of comments, errors and override. Author: Michael Armbrust Closes #3257 from marmbrus/minorCleanup and squashes the following commits: d8b5abc [Michael Armbrust] Use interpolation. 2fdf903 [Michael Armbrust] Better error message when coalesce can't be resolved. f9fa6cf [Michael Armbrust] Methods in a final class do not also need to be final, use override. 199fd98 [Michael Armbrust] Fix typo (cherry picked from commit f805025e8efe9cd522e8875141ec27df8d16bbe0) Signed-off-by: Michael Armbrust commit 576688aa2a19bd4ba239a2b93af7947f983e5124 Author: Michael Armbrust Date: Fri Nov 14 14:59:35 2014 -0800 [SPARK-4391][SQL] Configure parquet filters using SQLConf This is more uniform with the rest of SQL configuration and allows it to be turned on and off without restarting the SparkContext. In this PR I also turn off filter pushdown by default due to a number of outstanding issues (in particular SPARK-4258). When those are fixed we should turn it back on by default. Author: Michael Armbrust Closes #3258 from marmbrus/parquetFilters and squashes the following commits: 5655bfe [Michael Armbrust] Remove extra line. 15e9a98 [Michael Armbrust] Enable filters for tests 75afd39 [Michael Armbrust] Fix comments 78fa02d [Michael Armbrust] off by default e7f9e16 [Michael Armbrust] First draft of correctly configuring parquet filter pushdown (cherry picked from commit e47c38763914aaf89a7a851c5f41b7549a75615b) Signed-off-by: Michael Armbrust commit 0dd9241783b01815b68059067c72f36b8d05dddf Author: Michael Armbrust Date: Fri Nov 14 14:56:57 2014 -0800 [SPARK-4390][SQL] Handle NaN cast to decimal correctly Author: Michael Armbrust Closes #3256 from marmbrus/NanDecimal and squashes the following commits: 4c3ba46 [Michael Armbrust] fix style d360f83 [Michael Armbrust] Handle NaN cast to decimal (cherry picked from commit a0300ea32a9d92bd51c72930bc3979087b0082b2) Signed-off-by: Michael Armbrust commit 5b63158ac2100627ae4a77f3a89ae038e5b6be90 Author: jerryshao Date: Fri Nov 14 14:33:37 2014 -0800 [SPARK-4062][Streaming]Add ReliableKafkaReceiver in Spark Streaming Kafka connector Add ReliableKafkaReceiver in Kafka connector to prevent data loss if WAL in Spark Streaming is enabled. Details and design doc can be seen in [SPARK-4062](https://issues.apache.org/jira/browse/SPARK-4062). Author: jerryshao Author: Tathagata Das Author: Saisai Shao Closes #2991 from jerryshao/kafka-refactor and squashes the following commits: 5461f1c [Saisai Shao] Merge pull request #8 from tdas/kafka-refactor3 eae4ad6 [Tathagata Das] Refectored KafkaStreamSuiteBased to eliminate KafkaTestUtils and made Java more robust. fab14c7 [Tathagata Das] minor update. 149948b [Tathagata Das] Fixed mistake 14630aa [Tathagata Das] Minor updates. d9a452c [Tathagata Das] Minor updates. ec2e95e [Tathagata Das] Removed the receiver's locks and essentially reverted to Saisai's original design. 2a20a01 [jerryshao] Address some comments 9f636b3 [Saisai Shao] Merge pull request #5 from tdas/kafka-refactor b2b2f84 [Tathagata Das] Refactored Kafka receiver logic and Kafka testsuites e501b3c [jerryshao] Add Mima excludes b798535 [jerryshao] Fix the missed issue e5e21c1 [jerryshao] Change to while loop ea873e4 [jerryshao] Further address the comments 98f3d07 [jerryshao] Fix comment style 4854ee9 [jerryshao] Address all the comments 96c7a1d [jerryshao] Update the ReliableKafkaReceiver unit test 8135d31 [jerryshao] Fix flaky test a949741 [jerryshao] Address the comments 16bfe78 [jerryshao] Change the ordering of imports 0894aef [jerryshao] Add some comments 77c3e50 [jerryshao] Code refactor and add some unit tests dd9aeeb [jerryshao] Initial commit for reliable Kafka receiver (cherry picked from commit 5930f64bf0d2516304b21bd49eac361a54caabdd) Signed-off-by: Tathagata Das commit f8810b6a572f314ab0b88899172d8fa2b78e014f Author: DoingDone9 <799203320@qq.com> Date: Fri Nov 14 14:28:06 2014 -0800 [SPARK-4333][SQL] Correctly log number of iterations in RuleExecutor When iterator of RuleExecutor breaks, the num of iterator should be (iteration - 1) not (iteration ).Because log looks like "Fixed point reached for batch ${batch.name} after 3 iterations.", but it did 2 iterations really! Author: DoingDone9 <799203320@qq.com> Closes #3180 from DoingDone9/issue_01 and squashes the following commits: 571e2ed [DoingDone9] Update RuleExecutor.scala 46514b6 [DoingDone9] When iterator of RuleExecutor breaks, the num of iterator should be iteration - 1 not iteration. (cherry picked from commit 0cbdb01e1c817e71c4f80de05c4e5bb11510b368) Signed-off-by: Michael Armbrust commit d90ddf12b6bea2162e982e800c96d2c94f66b347 Author: Sandy Ryza Date: Fri Nov 14 14:21:57 2014 -0800 SPARK-4375. no longer require -Pscala-2.10 It seems like the winds might have moved away from this approach, but wanted to post the PR anyway because I got it working and to show what it would look like. Author: Sandy Ryza Closes #3239 from sryza/sandy-spark-4375 and squashes the following commits: 0ffbe95 [Sandy Ryza] Enable -Dscala-2.11 in sbt cd42d94 [Sandy Ryza] Update doc f6644c3 [Sandy Ryza] SPARK-4375 take 2 (cherry picked from commit f5f757e4ed80759dc5668c63d5663651689f8da8) Signed-off-by: Patrick Wendell commit 4bdeeb7d25453b9b50c7dc23a5c7f588754f0e52 Author: Takuya UESHIN Date: Fri Nov 14 14:21:16 2014 -0800 [SPARK-4245][SQL] Fix containsNull of the result ArrayType of CreateArray expression. The `containsNull` of the result `ArrayType` of `CreateArray` should be `true` only if the children is empty or there exists nullable child. Author: Takuya UESHIN Closes #3110 from ueshin/issues/SPARK-4245 and squashes the following commits: 6f64746 [Takuya UESHIN] Move equalsIgnoreNullability method into DataType. 5a90e02 [Takuya UESHIN] Refine InsertIntoHiveType and add some comments. cbecba8 [Takuya UESHIN] Fix a test title. 884ec37 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4245 3c5274b [Takuya UESHIN] Add tests to insert data of types ArrayType / MapType / StructType with nullability is false into Hive table. 41a94a9 [Takuya UESHIN] Replace InsertIntoTable with InsertIntoHiveTable if data types ignoring nullability are same. 43e6ef5 [Takuya UESHIN] Fix containsNull for empty array. 778e997 [Takuya UESHIN] Fix containsNull of the result ArrayType of CreateArray expression. (cherry picked from commit bbd8f5bee81d5788c356977c173dd1edc42c77a3) Signed-off-by: Michael Armbrust commit 51b053a314c121463161b5fa99d37020a4816a1e Author: Daoyuan Wang Date: Fri Nov 14 13:51:20 2014 -0800 [SPARK-4239] [SQL] support view in HiveQl Currently still not support view like CREATE VIEW view3(valoo) TBLPROPERTIES ("fear" = "factor") AS SELECT upper(value) FROM src WHERE key=86; because the text in metastore for this view is like select \`_c0\` as \`valoo\` from (select upper(\`src\`.\`value\`) from \`default\`.\`src\` where ...) \`view3\` while catalyst cannot resolve \`_c0\` for this query. For view without colname definition in parentheses, it works fine. Author: Daoyuan Wang Closes #3131 from adrian-wang/view and squashes the following commits: 8a56fd6 [Daoyuan Wang] michael's comments e46c056 [Daoyuan Wang] add some golden file 079290a [Daoyuan Wang] remove useless import 88afcad [Daoyuan Wang] support view in HiveQl (cherry picked from commit ade72c436276237f305d6a6aa4b594d43bcc4743) Signed-off-by: Michael Armbrust commit e7f957437ad013d16992a7ab12da58fa8eb6a880 Author: Jeff Hammerbacher Date: Fri Nov 14 13:37:48 2014 -0800 Update failed assert text to match code in SizeEstimatorSuite Author: Jeff Hammerbacher Closes #3242 from hammer/patch-1 and squashes the following commits: f88d635 [Jeff Hammerbacher] Update failed assert text to match code in SizeEstimatorSuite (cherry picked from commit c258db9ed4104b6eefe9f55f3e3959a3c46c2900) Signed-off-by: Andrew Or commit 88278241e9d9ca17db2f7c20d4434c32b7deff92 Author: zsxwing Date: Fri Nov 14 13:36:13 2014 -0800 [SPARK-4313][WebUI][Yarn] Fix link issue of the executor thread dump page in yarn-cluster mode In yarn-cluster mode, the Web UI is running behind a yarn proxy server. Some features(or bugs?) of yarn proxy server will break the links for thread dump. 1. Yarn proxy server will do http redirect internally, so if opening `http://example.com:8088/cluster/app/application_1415344371838_0012/executors`, it will fetch `http://example.com:8088/cluster/app/application_1415344371838_0012/executors/` and return the content but won't change the link in the browser. Then when a user clicks `Thread Dump`, it will jump to `http://example.com:8088/proxy/application_1415344371838_0012/threadDump/?executorId=2`. This is a wrong link. The correct link should be `http://example.com:8088/proxy/application_1415344371838_0012/executors/threadDump/?executorId=2`. Adding "/" to the tab links will fix it. 2. Yarn proxy server has a bug about the URL encode/decode. When a user accesses `http://example.com:8088/proxy/application_1415344371838_0006/executors/threadDump/?executorId=%3Cdriver%3E`, the yarn proxy server will require `http://example.com:36429/executors/threadDump/?executorId=%25253Cdriver%25253E`. But Spark web server expects `http://example.com:36429/executors/threadDump/?executorId=%3Cdriver%3E`. Related to [YARN-2844](https://issues.apache.org/jira/browse/YARN-2844). For now, it's a tricky approach to bypass the yarn bug. ![threaddump](https://cloud.githubusercontent.com/assets/1000778/4972567/d1ccba64-68ad-11e4-983e-257530cef35a.png) Author: zsxwing Closes #3183 from zsxwing/SPARK-4313 and squashes the following commits: 3379ca8 [zsxwing] Encode the executor id in the thread dump link and update the comment abfa063 [zsxwing] Fix link issue of the executor thread dump page in yarn-cluster mode (cherry picked from commit 156cf3333dcd93304eb5240f5a6466a3a0311957) Signed-off-by: Andrew Or commit 204eaf1653b2bdd0befe364392baa32c31ce0d3e Author: Andrew Ash Date: Fri Nov 14 13:33:35 2014 -0800 SPARK-3663 Document SPARK_LOG_DIR and SPARK_PID_DIR These descriptions are from the header of spark-daemon.sh Author: Andrew Ash Closes #2518 from ash211/SPARK-3663 and squashes the following commits: 058b257 [Andrew Ash] Complete hanging clause in SPARK_PID_DIR description a17cb4b [Andrew Ash] Update docs for default locations per SPARK-4110 af89096 [Andrew Ash] SPARK-3663 Document SPARK_LOG_DIR and SPARK_PID_DIR (cherry picked from commit 5c265ccde0c5594899ec61f9c1ea100ddff52da7) Signed-off-by: Andrew Or commit d579b39891f9adbabaaa2a4061042c490f94ee40 Author: Hong Shen Date: Fri Nov 14 13:29:41 2014 -0800 [Spark Core] SPARK-4380 Edit spilling log from MB to B https://issues.apache.org/jira/browse/SPARK-4380 Author: Hong Shen Closes #3243 from shenh062326/spark_change and squashes the following commits: 4653378 [Hong Shen] Edit spilling log from MB to B 21ee960 [Hong Shen] Edit spilling log from MB to B e9145e8 [Hong Shen] Edit spilling log from MB to B da761c2 [Hong Shen] Edit spilling log from MB to B 946351c [Hong Shen] Edit spilling log from MB to B (cherry picked from commit 0c56a039a9c5b871422f0fc55ff4394bc077fb34) Signed-off-by: Andrew Or commit 3014803ead0aac31f36f4387c919174877525ff4 Author: Xiangrui Meng Date: Fri Nov 14 12:43:17 2014 -0800 [SPARK-4398][PySpark] specialize sc.parallelize(xrange) `sc.parallelize(range(1 << 20), 1).count()` may take 15 seconds to finish and the rdd object stores the entire list, making task size very large. This PR adds a specialized version for xrange. JoshRosen davies Author: Xiangrui Meng Closes #3264 from mengxr/SPARK-4398 and squashes the following commits: 8953c41 [Xiangrui Meng] follow davies' suggestion cbd58e3 [Xiangrui Meng] specialize sc.parallelize(xrange) (cherry picked from commit abd581752f9314791a688690c07ad1bb68cc09fe) Signed-off-by: Xiangrui Meng commit 3219271f403091d4d3af4cddd08121ba538a459b Author: Patrick Wendell Date: Fri Nov 14 12:34:21 2014 -0800 Revert "[SPARK-2703][Core]Make Tachyon related unit tests execute without deploying a Tachyon system locally." This reverts commit c127ff8c87fc4f3aa6f09697928832dc6d37cc0f. commit 39257ca1bc920352e89ccb519a7e8b5d90710b68 Author: Michael Armbrust Date: Fri Nov 14 12:00:08 2014 -0800 [SPARK-4394][SQL] Data Sources API Improvements This PR adds two features to the data sources API: - Support for pushing down `IN` filters - The ability for relations to optionally provide information about their `sizeInBytes`. Author: Michael Armbrust Closes #3260 from marmbrus/sourcesImprovements and squashes the following commits: 9a5e171 [Michael Armbrust] Use method instead of configuration directly 99c0e6b [Michael Armbrust] Add support for sizeInBytes. 416f167 [Michael Armbrust] Support for IN in data sources API. 2a04ab3 [Michael Armbrust] Simplify implementation of InSet. (cherry picked from commit 77e845ca7726ffee2d6f8e33ea56ec005dde3874) Signed-off-by: Reynold Xin commit f1e7d1c2c02ded1f66ff2a3cff9a6e46bb10c5d3 Author: zsxwing Date: Thu Nov 13 14:37:04 2014 -0800 [SPARK-4310][WebUI] Sort 'Submitted' column in Stage page by time Author: zsxwing Closes #3179 from zsxwing/SPARK-4310 and squashes the following commits: b0d29f5 [zsxwing] Sort 'Submitted' column in Stage page by time (cherry picked from commit 825709a0b8f9b4bfb2718ecca8efc32be96c5a57) Signed-off-by: Andrew Or commit 5de97fc4384a8671f859cf8e2808324d0337216f Author: Xiangrui Meng Date: Thu Nov 13 13:54:16 2014 -0800 [SPARK-4372][MLLIB] Make LR and SVM's default parameters consistent in Scala and Python The current default regParam is 1.0 and regType is claimed to be none in Python (but actually it is l2), while regParam = 0.0 and regType is L2 in Scala. We should make the default values consistent. This PR sets the default regType to L2 and regParam to 0.01. Note that the default regParam value in LIBLINEAR (and hence scikit-learn) is 1.0. However, we use average loss instead of total loss in our formulation. Hence regParam=1.0 is definitely too heavy. In LinearRegression, we set regParam=0.0 and regType=None, because we have separate classes for Lasso and Ridge, both of which use regParam=0.01 as the default. davies atalwalkar Author: Xiangrui Meng Closes #3232 from mengxr/SPARK-4372 and squashes the following commits: 9979837 [Xiangrui Meng] update Ridge/Lasso to use default regParam 0.01 cast input arguments d3ba096 [Xiangrui Meng] change 'none' back to None 1909a6e [Xiangrui Meng] change default regParam to 0.01 and regType to L2 in LR and SVM (cherry picked from commit 32218307edc6de2b08d5f7a0db6d566081d27197) Signed-off-by: Xiangrui Meng commit d993a44de2bf91e93c5ad3f84d35ff4e55f4b2fb Author: Xiangrui Meng Date: Thu Nov 13 13:16:20 2014 -0800 [SPARK-4326] fix unidoc There are two issues: 1. specifying guava 11.0.2 will cause hashInt not found in unidoc (any reason to force the version here?) 2. unidoc doesn't recognize static class defined in a base class aarondav srowen vanzin Author: Xiangrui Meng Closes #3253 from mengxr/SPARK-4326 and squashes the following commits: 53967bf [Xiangrui Meng] fix unidoc (cherry picked from commit 4b0c1edfdf457cde0e39083c47961184059efded) Signed-off-by: Aaron Davidson commit c07592e4050d7cc7c7288a4b9909cc28cd5467a3 Author: Andrew Or Date: Thu Nov 13 11:54:45 2014 -0800 [HOT FIX] make-distribution.sh fails if Yarn shuffle jar DNE This is introduced in #3147 and is failing builds without the `-Pyarn` profile. Author: Andrew Or Closes #3250 from andrewor14/fix-yarn-shuffle-build and squashes the following commits: 42b3d37 [Andrew Or] Do not fail fast if Yarn shuffle jar does not exist (cherry picked from commit a0fa1ba704355a82e168aa9c16ecfed30128ade0) Signed-off-by: Andrew Or commit ff94283205d6b9774b700974fac0d4dfc33ef3e3 Author: Xiangrui Meng Date: Thu Nov 13 11:42:27 2014 -0800 [SPARK-4378][MLLIB] make ALS more Java-friendly Add Java-friendly version of `run` and `predict`, and use bulk prediction in Java unit tests. The user guide update will come later (though we may not save many lines of code there). srowen Author: Xiangrui Meng Closes #3240 from mengxr/SPARK-4378 and squashes the following commits: 6581503 [Xiangrui Meng] check number of predictions 6c8bbd1 [Xiangrui Meng] make ALS more Java-friendly (cherry picked from commit ca26a212fda39a15fde09dfdb2fbe69580a717f6) Signed-off-by: Xiangrui Meng commit c502e08e89dce0ab3b1ffd530361efe4038a77b8 Author: Davies Liu Date: Thu Nov 13 10:24:54 2014 -0800 [SPARK-4348] [PySpark] [MLlib] rename random.py to rand.py This PR rename random.py to rand.py to avoid the side affects of conflict with random module, but still keep the same interface as before. ``` >>> from pyspark.mllib.random import RandomRDDs ``` ``` $ pydoc pyspark.mllib.random Help on module random in pyspark.mllib: NAME random - Python package for random data generation. FILE /Users/davies/work/spark/python/pyspark/mllib/rand.py CLASSES __builtin__.object pyspark.mllib.random.RandomRDDs class RandomRDDs(__builtin__.object) | Generator methods for creating RDDs comprised of i.i.d samples from | some distribution. | | Static methods defined here: | | normalRDD(sc, size, numPartitions=None, seed=None) ``` cc mengxr reference link: http://xion.org.pl/2012/05/06/hacking-python-imports/ Author: Davies Liu Closes #3216 from davies/random and squashes the following commits: 7ac4e8b [Davies Liu] rename random.py to rand.py (cherry picked from commit ce0333f9a008348692bb9a200449d2d992e7825e) Signed-off-by: Xiangrui Meng commit ad872a5ba490b4d3f8970e4876490ddf2b18f891 Author: Andrew Bullen Date: Wed Nov 12 22:14:44 2014 -0800 [SPARK-4256] Make Binary Evaluation Metrics functions defined in cases where there ar... ...e 0 positive or 0 negative examples. Author: Andrew Bullen Closes #3118 from abull/master and squashes the following commits: c2bf2b1 [Andrew Bullen] [SPARK-4256] Update Code formatting for BinaryClassificationMetricsSpec 36b0533 [Andrew Bullen] [SYMAN-4256] Extract BinaryClassificationMetricsSuite assertions into private method 4d2f79a [Andrew Bullen] [SPARK-4256] Refactor classification metrics tests - extract comparison functions in test f411e70 [Andrew Bullen] [SPARK-4256] Define precision as 1.0 when there are no positive examples; update code formatting per pull request comments d9a09ef [Andrew Bullen] Make Binary Evaluation Metrics functions defined in cases where there are 0 positive or 0 negative examples. (cherry picked from commit 484fecbf1402c25f310be0b0a5ec15c11cbd65c3) Signed-off-by: Xiangrui Meng commit 70efcd88694f20d14085d1ec895a8d38f38784fb Author: Aaron Davidson Date: Wed Nov 12 18:46:37 2014 -0800 [SPARK-4370] [Core] Limit number of Netty cores based on executor size Author: Aaron Davidson Closes #3155 from aarondav/conf and squashes the following commits: 7045e77 [Aaron Davidson] Add mesos comment 4770f6e [Aaron Davidson] [SPARK-4370] [Core] Limit number of Netty cores based on executor size (cherry picked from commit b9e1c2eb9b6f7fb609718ef20048a8da452d881b) Signed-off-by: Reynold Xin commit 5f14cdeaa9bfaa05f01a9f9fe77386c46f511805 Author: Xiangrui Meng Date: Wed Nov 12 18:15:14 2014 -0800 [SPARK-4373][MLLIB] fix MLlib maven tests We want to make sure there is at most one spark context inside the same jvm. JoshRosen Author: Xiangrui Meng Closes #3235 from mengxr/SPARK-4373 and squashes the following commits: 6574b69 [Xiangrui Meng] rename LocalSparkContext to MLlibTestSparkContext 913d48d [Xiangrui Meng] make sure there is at most one spark context inside the same jvm (cherry picked from commit 23f5bdf06a388e08ea5a69e848f0ecd5165aa481) Signed-off-by: Josh Rosen commit 675df2afd496a4bd1a48b77c0aaef2e461d5b145 Author: Andrew Or Date: Thu Nov 13 00:30:58 2014 +0000 [Release] Bring audit scripts up-to-date This involves a few main changes: - Log all output message to the log file. Previously the log file was not useful because it did not indicate progress. - Remove hive-site.xml in sbt_hive_app to avoid interference - Add the appropriate repositories for new dependencies commit 44f67ac7d0b9bd92c6516320fdfada8f3a7856bd Author: Davies Liu Date: Wed Nov 12 15:58:12 2014 -0800 [SPARK-2672] support compressed file in wholeTextFile The wholeFile() can not read compressed files, it should be, just like textFile(). Author: Davies Liu Closes #3005 from davies/whole and squashes the following commits: a43fcfb [Davies Liu] remove semicolon c83571a [Davies Liu] remove = if return type is Unit 83c844f [Davies Liu] Merge branch 'master' of github.com:apache/spark into whole 22e8b3e [Davies Liu] support compressed file in wholeTextFile (cherry picked from commit d7d54a44e3ada0e50febe64e9b037dc2c8f6ff61) Signed-off-by: Josh Rosen commit 16da988c5cdae935151e307a66a5385bac5167c3 Author: Davies Liu Date: Wed Nov 12 13:56:41 2014 -0800 [SPARK-4369] [MLLib] fix TreeModel.predict() with RDD Fix TreeModel.predict() with RDD, added tests for it. (Also checked that other models don't have this issue) Author: Davies Liu Closes #3230 from davies/predict and squashes the following commits: 81172aa [Davies Liu] fix predict (cherry picked from commit bd86118c4e980f94916f892c76fb808fd4c8bd85) Signed-off-by: Xiangrui Meng commit 127c19b449315bdeba758e48371291c61abf0952 Author: Ankur Dave Date: Wed Nov 12 13:49:20 2014 -0800 [SPARK-3666] Extract interfaces for EdgeRDD and VertexRDD This discourages users from calling the VertexRDD and EdgeRDD constructor and makes it easier for future changes to ensure backward compatibility. Author: Ankur Dave Closes #2530 from ankurdave/SPARK-3666 and squashes the following commits: d681f45 [Ankur Dave] Define getPartitions and compute in abstract class for MIMA 1472390 [Ankur Dave] Merge remote-tracking branch 'apache-spark/master' into SPARK-3666 24201d4 [Ankur Dave] Merge remote-tracking branch 'apache-spark/master' into SPARK-3666 cbe15f2 [Ankur Dave] Remove specialized annotation from VertexRDD and EdgeRDD 931b587 [Ankur Dave] Use abstract class instead of trait for binary compatibility 9ba4ec4 [Ankur Dave] Mark (Vertex|Edge)RDDImpl constructors package-private 620e603 [Ankur Dave] Extract VertexRDD interface and move implementation to VertexRDDImpl 55b6398 [Ankur Dave] Extract EdgeRDD interface and move implementation to EdgeRDDImpl (cherry picked from commit a5ef58113667ff73562ce6db381cff96a0b354b0) Signed-off-by: Reynold Xin commit dbac77ebda28e1aa7b74831601fabd615b539358 Author: Andrew Or Date: Wed Nov 12 13:46:26 2014 -0800 [Release] Correct make-distribution.sh log path commit 5d5c8fd5346e420859b11c825fa1ff1decd72d09 Author: Ankur Dave Date: Wed Nov 12 13:44:49 2014 -0800 Internal cleanup for aggregateMessages 1. Add EdgeActiveness enum to represent activeness criteria more cleanly than using booleans. 2. Comments and whitespace. Author: Ankur Dave Closes #3231 from ankurdave/aggregateMessages-followup and squashes the following commits: 3d485c3 [Ankur Dave] Internal cleanup for aggregateMessages (cherry picked from commit 0402be90f7af82c8404cafbca79f5f9fb8e2bbed) Signed-off-by: Reynold Xin commit f50c0881be943d8df98a88cc73d163b16169874e Author: Andrew Or Date: Wed Nov 12 13:39:45 2014 -0800 [SPARK-4281][Build] Package Yarn shuffle service into its own jar This is another addendum to #3082, which added the Yarn shuffle service to run inside the NM. This PR makes the feature much more usable by packaging enough dependencies into the jar to run the service inside an NM. After these changes, the user can run `./make-distribution.sh` and find a `spark-network-yarn*.jar` in their `lib` directory. The equivalent change is done in SBT by making the `network-yarn` module an assembly project. Author: Andrew Or Closes #3147 from andrewor14/yarn-shuffle-build and squashes the following commits: bda58d0 [Andrew Or] Fix line too long 81e9705 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-build fb7f398 [Andrew Or] Rename jar to spark-{VERSION}-yarn-shuffle.jar 65db822 [Andrew Or] Actually mark slf4j as provided abcefd1 [Andrew Or] Do the same for SBT c653028 [Andrew Or] Package network-yarn and its dependencies (cherry picked from commit aa43a8da012cf0dac7c7fcccde5f028a942599f0) Signed-off-by: Andrew Or commit 233f0377aaf1dafb8f7e0fb53fc6c09ea65743c3 Author: Andrew Or Date: Wed Nov 12 13:35:48 2014 -0800 [Test] Better exception message from SparkSubmitSuite Before: ``` Exception in thread "main" java.lang.Exception: Could not load user defined classes inside of executors at org.apache.spark.deploy.JarCreationTest$.main(SparkSubmitSuite.scala:471) at org.apache.spark.deploy.JarCreationTest.main(SparkSubmitSuite.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ``` After: ``` Exception in thread "main" java.lang.Exception: Could not load user class from jar: java.lang.UnsupportedClassVersionError: SparkSubmitClassA : Unsupported major.minor version 51.0 java.lang.ClassLoader.defineClass1(Native Method) java.lang.ClassLoader.defineClass(ClassLoader.java:643) ... at org.apache.spark.deploy.JarCreationTest$.main(SparkSubmitSuite.scala:472) at org.apache.spark.deploy.JarCreationTest.main(SparkSubmitSuite.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ``` Author: Andrew Or Closes #3212 from andrewor14/submit-suite-message and squashes the following commits: 7779248 [Andrew Or] Format exception 8fe6719 [Andrew Or] Better exception message from failed test (cherry picked from commit 6e3c5a296c90a551be5e6c7292a66f2e65338240) Signed-off-by: Andrew Or commit 38f9f2e1cec3fd0526a00010382698d80e8025d9 Author: Xiangrui Meng Date: Wed Nov 12 10:38:57 2014 -0800 [SPARK-3530][MLLIB] pipeline and parameters with examples This PR adds package "org.apache.spark.ml" with pipeline and parameters, as discussed on the JIRA. This is a joint work of jkbradley etrain shivaram and many others who helped on the design, also with help from marmbrus and liancheng on the Spark SQL side. The design doc can be found at: https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing **org.apache.spark.ml** This is a new package with new set of ML APIs that address practical machine learning pipelines. (Sorry for taking so long!) It will be an alpha component, so this is definitely not something set in stone. The new set of APIs, inspired by the MLI project from AMPLab and scikit-learn, takes leverage on Spark SQL's schema support and execution plan optimization. It introduces the following components that help build a practical pipeline: 1. Transformer, which transforms a dataset into another 2. Estimator, which fits models to data, where models are transformers 3. Evaluator, which evaluates model output and returns a scalar metric 4. Pipeline, a simple pipeline that consists of transformers and estimators Parameters could be supplied at fit/transform or embedded with components. 1. Param: a strong-typed parameter key with self-contained doc 2. ParamMap: a param -> value map 3. Params: trait for components with parameters For any component that implements `Params`, user can easily check the doc by calling `explainParams`: ~~~ > val lr = new LogisticRegression > lr.explainParams maxIter: max number of iterations (default: 100) regParam: regularization constant (default: 0.1) labelCol: label column name (default: label) featuresCol: features column name (default: features) ~~~ or user can check individual param: ~~~ > lr.maxIter maxIter: max number of iterations (default: 100) ~~~ **Please start with the example code in test suites and under `org.apache.spark.examples.ml`, where I put several examples:** 1. run a simple logistic regression job ~~~ val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(1.0) val model = lr.fit(dataset) model.transform(dataset, model.threshold -> 0.8) // overwrite threshold .select('label, 'score, 'prediction).collect() .foreach(println) ~~~ 2. run logistic regression with cross-validation and grid search using areaUnderROC (default) as the metric ~~~ val lr = new LogisticRegression val lrParamMaps = new ParamGridBuilder() .addGrid(lr.regParam, Array(0.1, 100.0)) .addGrid(lr.maxIter, Array(0, 5)) .build() val eval = new BinaryClassificationEvaluator val cv = new CrossValidator() .setEstimator(lr) .setEstimatorParamMaps(lrParamMaps) .setEvaluator(eval) .setNumFolds(3) val bestModel = cv.fit(dataset) ~~~ 3. run a pipeline that consists of a standard scaler and a logistic regression component ~~~ val scaler = new StandardScaler() .setInputCol("features") .setOutputCol("scaledFeatures") val lr = new LogisticRegression() .setFeaturesCol(scaler.getOutputCol) val pipeline = new Pipeline() .setStages(Array(scaler, lr)) val model = pipeline.fit(dataset) val predictions = model.transform(dataset) .select('label, 'score, 'prediction) .collect() .foreach(println) ~~~ 4. a simple text classification pipeline, which recognizes "spark": ~~~ val training = sparkContext.parallelize(Seq( LabeledDocument(0L, "a b c d e spark", 1.0), LabeledDocument(1L, "b d", 0.0), LabeledDocument(2L, "spark f g h", 1.0), LabeledDocument(3L, "hadoop mapreduce", 0.0))) val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words") val hashingTF = new HashingTF() .setInputCol(tokenizer.getOutputCol) .setOutputCol("features") val lr = new LogisticRegression() .setMaxIter(10) val pipeline = new Pipeline() .setStages(Array(tokenizer, hashingTF, lr)) val model = pipeline.fit(training) val test = sparkContext.parallelize(Seq( Document(4L, "spark i j k"), Document(5L, "l m"), Document(6L, "mapreduce spark"), Document(7L, "apache hadoop"))) model.transform(test) .select('id, 'text, 'prediction, 'score) .collect() .foreach(println) ~~~ Java examples are very similar. I put example code that creates a simple text classification pipeline in Scala and Java, where a simple tokenizer is defined as a transformer outside `org.apache.spark.ml`. **What are missing now and will be added soon:** 1. ~~Runtime check of schemas. So before we touch the data, we will go through the schema and make sure column names and types match the input parameters.~~ 2. ~~Java examples.~~ 3. ~~Store training parameters in trained models.~~ 4. (later) Serialization and Python API. Author: Xiangrui Meng Closes #3099 from mengxr/SPARK-3530 and squashes the following commits: 2cc93fd [Xiangrui Meng] hide APIs as much as I can 34319ba [Xiangrui Meng] use local instead local[2] for unit tests 2524251 [Xiangrui Meng] rename PipelineStage.transform to transformSchema c9daab4 [Xiangrui Meng] remove mockito version 1397ab5 [Xiangrui Meng] use sqlContext from LocalSparkContext instead of TestSQLContext 6ffc389 [Xiangrui Meng] try to fix unit test a59d8b7 [Xiangrui Meng] doc updates 977fd9d [Xiangrui Meng] add scala ml package object 6d97fe6 [Xiangrui Meng] add AlphaComponent annotation 731f0e4 [Xiangrui Meng] update package doc 0435076 [Xiangrui Meng] remove ;this from setters fa21d9b [Xiangrui Meng] update extends indentation f1091b3 [Xiangrui Meng] typo 228a9f4 [Xiangrui Meng] do not persist before calling binary classification metrics f51cd27 [Xiangrui Meng] rename default to defaultValue b3be094 [Xiangrui Meng] refactor schema transform in lr 8791e8e [Xiangrui Meng] rename copyValues to inheritValues and make it do the right thing 51f1c06 [Xiangrui Meng] remove leftover code in Transformer 494b632 [Xiangrui Meng] compure score once ad678e9 [Xiangrui Meng] more doc for Transformer 4306ed4 [Xiangrui Meng] org imports in text pipeline 6e7c1c7 [Xiangrui Meng] update pipeline 4f9e34f [Xiangrui Meng] more doc for pipeline aa5dbd4 [Xiangrui Meng] fix typo 11be383 [Xiangrui Meng] fix unit tests 3df7952 [Xiangrui Meng] clean up 986593e [Xiangrui Meng] re-org java test suites 2b11211 [Xiangrui Meng] remove external data deps 9fd4933 [Xiangrui Meng] add unit test for pipeline 2a0df46 [Xiangrui Meng] update tests 2d52e4d [Xiangrui Meng] add @AlphaComponent to package-info 27582a4 [Xiangrui Meng] doc changes 73a000b [Xiangrui Meng] add schema transformation layer 6736e87 [Xiangrui Meng] more doc / remove HasMetricName trait 80a8b5e [Xiangrui Meng] rename SimpleTransformer to UnaryTransformer 62ca2bb [Xiangrui Meng] check param parent in set/get 1622349 [Xiangrui Meng] add getModel to PipelineModel a0e0054 [Xiangrui Meng] update StandardScaler to use SimpleTransformer d0faa04 [Xiangrui Meng] remove implicit mapping from ParamMap c7f6921 [Xiangrui Meng] move ParamGridBuilder test to ParamGridBuilderSuite e246f29 [Xiangrui Meng] re-org: 7772430 [Xiangrui Meng] remove modelParams add a simple text classification pipeline b95c408 [Xiangrui Meng] remove implicits add unit tests to params bab3e5b [Xiangrui Meng] update params fe0ee92 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-3530 6e86d98 [Xiangrui Meng] some code clean-up 2d040b3 [Xiangrui Meng] implement setters inside each class, add Params.copyValues [ci skip] fd751fc [Xiangrui Meng] add java-friendly versions of fit and tranform 3f810cd [Xiangrui Meng] use multi-model training api in cv 5b8f413 [Xiangrui Meng] rename model to modelParams 9d2d35d [Xiangrui Meng] test varargs and chain model params f46e927 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-3530 1ef26e0 [Xiangrui Meng] specialize methods/types for Java df293ed [Xiangrui Meng] switch to setter/getter 376db0a [Xiangrui Meng] pipeline and parameters (cherry picked from commit 4b736dbab3e177e5265439d37063bb501657d830) Signed-off-by: Xiangrui Meng commit 14b933fb67c1b979c72a155c5caa36c786fc6a1a Author: Xiangrui Meng Date: Wed Nov 12 01:50:11 2014 -0800 [SPARK-4355][MLLIB] fix OnlineSummarizer.merge when other.mean is zero See inline comment about the bug. I also did some code clean-up. dbtsai I moved `update` to a private method of `MultivariateOnlineSummarizer`. I don't think it will cause performance regression, but it would be great if you have some time to test. Author: Xiangrui Meng Closes #3220 from mengxr/SPARK-4355 and squashes the following commits: 5ef601f [Xiangrui Meng] fix OnlineSummarizer.merge when other.mean is zero and some code clean-up (cherry picked from commit 84324fbcb987db6e10e435f463eacace1bae43e2) Signed-off-by: Xiangrui Meng commit 4a4cc7e9199cb8cc48b4b073578ed5a22d8093f3 Author: Ankur Dave Date: Tue Nov 11 23:38:27 2014 -0800 [SPARK-3936] Add aggregateMessages, which supersedes mapReduceTriplets aggregateMessages enables neighborhood computation similarly to mapReduceTriplets, but it introduces two API improvements: 1. Messages are sent using an imperative interface based on EdgeContext rather than by returning an iterator of messages. 2. Rather than attempting bytecode inspection, the required triplet fields must be explicitly specified by the user by passing a TripletFields object. This fixes SPARK-3936. Additionally, this PR includes the following optimizations for aggregateMessages and EdgePartition: 1. EdgePartition now stores local vertex ids instead of global ids. This avoids hash lookups when looking up vertex attributes and aggregating messages. 2. Internal iterators in aggregateMessages are inlined into a while loop. In total, these optimizations were tested to provide a 37% speedup on PageRank (uk-2007-05 graph, 10 iterations, 16 r3.2xlarge machines, sped up from 513 s to 322 s). Subsumes apache/spark#2815. Also fixes SPARK-4173. Author: Ankur Dave Closes #3100 from ankurdave/aggregateMessages and squashes the following commits: f5b65d0 [Ankur Dave] Address @rxin comments on apache/spark#3054 and apache/spark#3100 1e80aca [Ankur Dave] Add aggregateMessages, which supersedes mapReduceTriplets 194a2df [Ankur Dave] Test triplet iterator in EdgePartition serialization test e0f8ecc [Ankur Dave] Take activeSet in ExistingEdgePartitionBuilder c85076d [Ankur Dave] Readability improvements b567be2 [Ankur Dave] iter.foreach -> while loop 4a566dc [Ankur Dave] Optimizations for mapReduceTriplets and EdgePartition (cherry picked from commit faeb41de215d3ac567ce72a43ab242ad433ca93e) Signed-off-by: Reynold Xin commit c9bb5e459b53709e124fa1d45e14b30ca4fe4f79 Author: Manish Amde Date: Tue Nov 11 22:47:53 2014 -0800 [MLLIB] SPARK-4347: Reducing GradientBoostingSuite run time. Before: [info] GradientBoostingSuite: [info] - Regression with continuous features: SquaredError (22 seconds, 115 milliseconds) [info] - Regression with continuous features: Absolute Error (19 seconds, 330 milliseconds) [info] - Binary classification with continuous features: Log Loss (19 seconds, 17 milliseconds) After: [info] - Regression with continuous features: SquaredError (7 seconds, 69 milliseconds) [info] - Regression with continuous features: Absolute Error (4 seconds, 617 milliseconds) [info] - Binary classification with continuous features: Log Loss (4 seconds, 658 milliseconds) cc: mengxr, jkbradley Author: Manish Amde Closes #3214 from manishamde/gbt_test_speedup and squashes the following commits: 8994552 [Manish Amde] reducing gbt test run times (cherry picked from commit 2ef016b130a48869cf81fe6cf147ef2b1e79d674) Signed-off-by: Xiangrui Meng commit 12f56334bb308c19d1c6c017fe1ec10808bde12a Author: Prashant Sharma Date: Tue Nov 11 21:36:48 2014 -0800 Support cross building for Scala 2.11 Let's give this another go using a version of Hive that shades its JLine dependency. Author: Prashant Sharma Author: Patrick Wendell Closes #3159 from pwendell/scala-2.11-prashant and squashes the following commits: e93aa3e [Patrick Wendell] Restoring -Phive-thriftserver profile and cleaning up build script. f65d17d [Patrick Wendell] Fixing build issue due to merge conflict a8c41eb [Patrick Wendell] Reverting dev/run-tests back to master state. 7a6eb18 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into scala-2.11-prashant 583aa07 [Prashant Sharma] REVERT ME: removed hive thirftserver 3680e58 [Prashant Sharma] Revert "REVERT ME: Temporarily removing some Cli tests." 935fb47 [Prashant Sharma] Revert "Fixed by disabling a few tests temporarily." 925e90f [Prashant Sharma] Fixed by disabling a few tests temporarily. 2fffed3 [Prashant Sharma] Exclude groovy from sbt build, and also provide a way for such instances in future. 8bd4e40 [Prashant Sharma] Switched to gmaven plus, it fixes random failures observer with its predecessor gmaven. 5272ce5 [Prashant Sharma] SPARK_SCALA_VERSION related bugs. 2121071 [Patrick Wendell] Migrating version detection to PySpark b1ed44d [Patrick Wendell] REVERT ME: Temporarily removing some Cli tests. 1743a73 [Patrick Wendell] Removing decimal test that doesn't work with Scala 2.11 f5cad4e [Patrick Wendell] Add Scala 2.11 docs 210d7e1 [Patrick Wendell] Revert "Testing new Hive version with shaded jline" 48518ce [Patrick Wendell] Remove association of Hive and Thriftserver profiles. e9d0a06 [Patrick Wendell] Revert "Enable thritfserver for Scala 2.10 only" 67ec364 [Patrick Wendell] Guard building of thriftserver around Scala 2.10 check 8502c23 [Patrick Wendell] Enable thritfserver for Scala 2.10 only e22b104 [Patrick Wendell] Small fix in pom file ec402ab [Patrick Wendell] Various fixes 0be5a9d [Patrick Wendell] Testing new Hive version with shaded jline 4eaec65 [Prashant Sharma] Changed scripts to ignore target. 5167bea [Prashant Sharma] small correction a4fcac6 [Prashant Sharma] Run against scala 2.11 on jenkins. 80285f4 [Prashant Sharma] MAven equivalent of setting spark.executor.extraClasspath during tests. 034b369 [Prashant Sharma] Setting test jars on executor classpath during tests from sbt. d4874cb [Prashant Sharma] Fixed Python Runner suite. null check should be first case in scala 2.11. 6f50f13 [Prashant Sharma] Fixed build after rebasing with master. We should use ${scala.binary.version} instead of just 2.10 e56ca9d [Prashant Sharma] Print an error if build for 2.10 and 2.11 is spotted. 937c0b8 [Prashant Sharma] SCALA_VERSION -> SPARK_SCALA_VERSION cb059b0 [Prashant Sharma] Code review 0476e5e [Prashant Sharma] Scala 2.11 support with repl and all build changes. (cherry picked from commit daaca14c16dc2c1abc98f15ab8c6f7c14761b627) Signed-off-by: Patrick Wendell commit 307b69d73c37b5a580a1079843b13aeac1f6f6f4 Author: Andrew Or Date: Tue Nov 11 18:02:59 2014 -0800 [Release] Log build output for each distribution commit 6a7ddf4ce10e540ecc389235a7e4d994e225b9e6 Author: Timothy Chen Date: Tue Nov 11 14:29:18 2014 -0800 SPARK-2269 Refactor mesos scheduler resourceOffers and add unit test Author: Timothy Chen Closes #1487 from tnachen/resource_offer_refactor and squashes the following commits: 4ea5dec [Timothy Chen] Rebase from master and address comments 9ccab09 [Timothy Chen] Address review comments e6494dc [Timothy Chen] Refactor class loading 8207428 [Timothy Chen] Refactor mesos scheduler resourceOffers and add unit test (cherry picked from commit a878660d2d7bb7ad9b5818a674e1e7c651077e78) Signed-off-by: Andrew Or commit ec0d89bc93f3a69a844d4b133bf185ee24048726 Author: Kousuke Saruta Date: Tue Nov 11 12:33:53 2014 -0600 [SPARK-4282][YARN] Stopping flag in YarnClientSchedulerBackend should be volatile In YarnClientSchedulerBackend, a variable "stopping" is used as a flag and it's accessed by some threads so it should be volatile. Author: Kousuke Saruta Closes #3143 from sarutak/stopping-flag-volatile and squashes the following commits: 58fdcc9 [Kousuke Saruta] Marked stoppig flag as volatile (cherry picked from commit 7f3718842cc4025bb2ee2f5a3ec12efd100f6589) Signed-off-by: Thomas Graves commit 8f7e80f30bd34897963334d0245c0ea6fccd6182 Author: Sean Owen Date: Tue Nov 11 12:30:35 2014 -0600 SPARK-4305 [BUILD] yarn-alpha profile won't build due to network/yarn module SPARK-3797 introduced the `network/yarn` module, but its YARN code depends on YARN APIs not present in older versions covered by the `yarn-alpha` profile. As a result builds like `mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package` fail. The solution is just to not build `network/yarn` with profile `yarn-alpha`. Author: Sean Owen Closes #3167 from srowen/SPARK-4305 and squashes the following commits: 88938cb [Sean Owen] Don't build network/yarn in yarn-alpha profile as it won't compile (cherry picked from commit f820b563d88f6a972c219d9340fe95110493fb87) Signed-off-by: Thomas Graves commit cc1f3a0d6bfc5299e9db1d8ca50e33d2411d7cd9 Author: huangzhaowei Date: Tue Nov 11 03:02:12 2014 -0800 [Streaming][Minor]Replace some 'if-else' in Clock Replace some 'if-else' statement by math.min and math.max in Clock.scala Author: huangzhaowei Closes #3088 from SaintBacchus/StreamingClock and squashes the following commits: 7b7f8e7 [huangzhaowei] [Streaming][Minor]Replace some 'if-else' in Clock (cherry picked from commit 6e03de304e0294017d832763fd71e642736f8c33) Signed-off-by: Tathagata Das commit 7710b7156e0c82445783c3709a4a793d820627b2 Author: jerryshao Date: Tue Nov 11 02:22:23 2014 -0800 [SPARK-2492][Streaming] kafkaReceiver minor changes to align with Kafka 0.8 Update the KafkaReceiver's behavior when auto.offset.reset is set. In Kafka 0.8, `auto.offset.reset` is a hint for out-range offset to seek to the beginning or end of the partition. While in the previous code `auto.offset.reset` is a enforcement to seek to the beginning or end immediately, this is different from Kafka 0.8 defined behavior. Also deleting extesting ZK metadata in Receiver when multiple consumers are launched will introduce issue as mentioned in [SPARK-2383](https://issues.apache.org/jira/browse/SPARK-2383). So Here we change to offer user to API to explicitly reset offset before create Kafka stream, while in the meantime keep the same behavior as Kafka 0.8 for parameter `auto.offset.reset`. @tdas, would you please review this PR? Thanks a lot. Author: jerryshao Closes #1420 from jerryshao/kafka-fix and squashes the following commits: d6ae94d [jerryshao] Address the comment to remove the resetOffset() function de3a4c8 [jerryshao] Fix compile error 4a1c3f9 [jerryshao] Doc changes b2c1430 [jerryshao] Move offset reset to a helper function to let user explicitly delete ZK metadata by calling this API fac8fd6 [jerryshao] Changes to align with Kafka 0.8 (cherry picked from commit c8850a3d6d948f9dd9ee026ee350428968d3c21b) Signed-off-by: Tathagata Das commit fe8a1cd292ff067aabf78dd009204a4500d0cf75 Author: maji2014 Date: Tue Nov 11 02:18:27 2014 -0800 [SPARK-4295][External]Fix exception in SparkSinkSuite Handle exception in SparkSinkSuite, please refer to [SPARK-4295] Author: maji2014 Closes #3177 from maji2014/spark-4295 and squashes the following commits: 312620a [maji2014] change a new statement for spark-4295 24c3d21 [maji2014] add log4j.properties for SparkSinkSuite and spark-4295 c807bf6 [maji2014] Fix exception in SparkSinkSuite (cherry picked from commit f8811a5695af2dfe156f07431288db7b8cd97159) Signed-off-by: Tathagata Das commit e9d009dc348bc06198ed2c9e03f1ba870401e6df Author: Reynold Xin Date: Tue Nov 11 00:25:31 2014 -0800 [SPARK-4307] Initialize FileDescriptor lazily in FileRegion. Netty's DefaultFileRegion requires a FileDescriptor in its constructor, which means we need to have a opened file handle. In super large workloads, this could lead to too many open files due to the way these file descriptors are cleaned. This pull request creates a new LazyFileRegion that initializes the FileDescriptor when we are sending data for the first time. Author: Reynold Xin Author: Reynold Xin Closes #3172 from rxin/lazyFD and squashes the following commits: 0bdcdc6 [Reynold Xin] Added reference to Netty's DefaultFileRegion d4564ae [Reynold Xin] Added SparkConf to the ctor argument of IndexShuffleBlockManager. 6ed369e [Reynold Xin] Code review feedback. 04cddc8 [Reynold Xin] [SPARK-4307] Initialize FileDescriptor lazily in FileRegion. (cherry picked from commit ef29a9a9aa85468869eb67ca67b66c65f508d0ee) Signed-off-by: Aaron Davidson commit df8242c9b6307c085d4c1a7ec446b1701a7e7cde Author: Davies Liu Date: Mon Nov 10 22:26:16 2014 -0800 [SPARK-4324] [PySpark] [MLlib] support numpy.array for all MLlib API This PR check all of the existing Python MLlib API to make sure that numpy.array is supported as Vector (also RDD of numpy.array). It also improve some docstring and doctest. cc mateiz mengxr Author: Davies Liu Closes #3189 from davies/numpy and squashes the following commits: d5057c4 [Davies Liu] fix tests 6987611 [Davies Liu] support numpy.array for all MLlib API (cherry picked from commit 65083e93ddd552b7d3e4eb09f87c091ef2ae83a2) Signed-off-by: Xiangrui Meng commit 4eeaf3395a885b0a9ef79c31b720969155b0b7af Author: Kousuke Saruta Date: Mon Nov 10 22:18:00 2014 -0800 [SPARK-4330][Doc] Link to proper URL for YARN overview In running-on-yarn.md, a link to YARN overview is here. But the URL is to YARN alpha's. It should be stable's. Author: Kousuke Saruta Closes #3196 from sarutak/SPARK-4330 and squashes the following commits: 30baa21 [Kousuke Saruta] Fixed running-on-yarn.md to point proper URL for YARN (cherry picked from commit 3c07b8f08240bafcdff5d174989fb433f4bc80b6) Signed-off-by: Matei Zaharia commit e725cab66441a5de4f32630c865d0fcb25f8aed2 Author: Ankur Dave Date: Mon Nov 10 19:31:52 2014 -0800 [SPARK-3649] Remove GraphX custom serializers As [reported][1] on the mailing list, GraphX throws ``` java.lang.ClassCastException: java.lang.Long cannot be cast to scala.Tuple2 at org.apache.spark.graphx.impl.RoutingTableMessageSerializer$$anon$1$$anon$2.writeObject(Serializers.scala:39) at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:195) at org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:329) ``` when sort-based shuffle attempts to spill to disk. This is because GraphX defines custom serializers for shuffling pair RDDs that assume Spark will always serialize the entire pair object rather than breaking it up into its components. However, the spill code path in sort-based shuffle [violates this assumption][2]. GraphX uses the custom serializers to compress vertex ID keys using variable-length integer encoding. However, since the serializer can no longer rely on the key and value being serialized and deserialized together, performing such encoding would either require writing a tag byte (costly) or maintaining state in the serializer and assuming that serialization calls will alternate between key and value (fragile). Instead, this PR simply removes the custom serializers. This causes a **10% slowdown** (494 s to 543 s) and **16% increase in per-iteration communication** (2176 MB to 2518 MB) for PageRank (averages across 3 trials, 10 iterations per trial, uk-2007-05 graph, 16 r3.2xlarge nodes). [1]: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassCastException-java-lang-Long-cannot-be-cast-to-scala-Tuple2-td13926.html#a14501 [2]: https://github.com/apache/spark/blob/f9d6220c792b779be385f3022d146911a22c2130/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala#L329 Author: Ankur Dave Closes #2503 from ankurdave/SPARK-3649 and squashes the following commits: a49c2ad [Ankur Dave] [SPARK-3649] Remove GraphX custom serializers (cherry picked from commit 300887bd76c5018bfe396c5d47443be251368359) Signed-off-by: Reynold Xin commit 50c02d68a7fbc9e91c01fea4997846f46f7ea910 Author: Cheng Hao Date: Mon Nov 10 17:46:05 2014 -0800 [SPARK-4274] [SQL] Fix NPE in printing the details of the query plan Author: Cheng Hao Closes #3139 from chenghao-intel/comparison_test and squashes the following commits: f5d7146 [Cheng Hao] avoid exception in printing the codegen enabled (cherry picked from commit c764d0ac1c6410ca2dd2558cb6bcbe8ad5f02481) Signed-off-by: Michael Armbrust commit 07ba50f7eff3db68f120d979a5f0ca37cb2a886e Author: surq Date: Mon Nov 10 17:37:16 2014 -0800 [SPARK-3954][Streaming] Optimization to FileInputDStream about convert files to RDDS there are 3 loops with files sequence in spark source. loops files sequence: 1.files.map(...) 2.files.zip(fileRDDs) 3.files-size.foreach It's will very time consuming when lots of files.So I do the following correction: 3 loops with files sequence => only one loop Author: surq Closes #2811 from surq/SPARK-3954 and squashes the following commits: 321bbe8 [surq] updated the code style.The style from [for...yield]to [files.map(file=>{})] 88a2c20 [surq] Merge branch 'master' of https://github.com/apache/spark into SPARK-3954 178066f [surq] modify code's style. [Exceeds 100 columns] 626ef97 [surq] remove redundant import(ArrayBuffer) 739341f [surq] promote the speed of convert files to RDDS (cherry picked from commit ce6ed2abd14de26b9ceaa415e9a42fbb1338f5fa) Signed-off-by: Tathagata Das commit f0eb0a79cc68c0f254ddf1a1bba672321c84d341 Author: Daoyuan Wang Date: Mon Nov 10 17:26:03 2014 -0800 [SPARK-4149][SQL] ISO 8601 support for json date time strings This implement the feature davies mentioned in https://github.com/apache/spark/pull/2901#discussion-diff-19313312 Author: Daoyuan Wang Closes #3012 from adrian-wang/iso8601 and squashes the following commits: 50df6e7 [Daoyuan Wang] json data timestamp ISO8601 support (cherry picked from commit a1fc059b69c9ed150bf8a284404cc149ddaa27d6) Signed-off-by: Michael Armbrust commit ff071e35173224546a879685c2febdd9ea0ab630 Author: Cheng Hao Date: Mon Nov 10 17:22:57 2014 -0800 [SPARK-4250] [SQL] Fix bug of constant null value mapping to ConstantObjectInspector Author: Cheng Hao Closes #3114 from chenghao-intel/constant_null_oi and squashes the following commits: e603bda [Cheng Hao] fix the bug of null value for primitive types 50a13ba [Cheng Hao] fix the timezone issue f54f369 [Cheng Hao] fix bug of constant null value for ObjectInspector (cherry picked from commit fa777833b52b6f339cdc335e8e3935cfe9a2a7eb) Signed-off-by: Michael Armbrust commit 1ed1c68c0aa8f4da517cd4ac5c4ab117d2cee839 Author: Xiangrui Meng Date: Mon Nov 10 17:20:52 2014 -0800 [SQL] remove a decimal case branch that has no effect at runtime it generates warnings at compile time marmbrus Author: Xiangrui Meng Closes #3192 from mengxr/dtc-decimal and squashes the following commits: 955e9fb [Xiangrui Meng] remove a decimal case branch that has no effect (cherry picked from commit d793d80c8084923ea04dcf7d268eec8ede490127) Signed-off-by: Michael Armbrust commit 0089a4f64d90f923dc02aee45bcda4be726d740a Author: Takuya UESHIN Date: Mon Nov 10 15:55:15 2014 -0800 [SPARK-4319][SQL] Enable an ignored test "null count". Author: Takuya UESHIN Closes #3185 from ueshin/issues/SPARK-4319 and squashes the following commits: a44a38e [Takuya UESHIN] Enable an ignored test "null count". (cherry picked from commit dbf10588de03e8ea993fff687a78727eff55db1f) Signed-off-by: Michael Armbrust commit 19dcb5714ba326c272981e6e7e547ff7990648b9 Author: Varadharajan Mukundan Date: Mon Nov 10 14:32:29 2014 -0800 [SPARK-4047] - Generate runtime warnings for example implementation of PageRank Based on SPARK-2434, this PR generates runtime warnings for example implementations (Python, Scala) of PageRank. Author: Varadharajan Mukundan Closes #2894 from varadharajan/SPARK-4047 and squashes the following commits: 5f9406b [Varadharajan Mukundan] [SPARK-4047] - Point users to LogisticRegressionWithSGD and LogisticRegressionWithLBFGS instead of LogisticRegressionModel 252f595 [Varadharajan Mukundan] a. Generate runtime warnings for 05a018b [Varadharajan Mukundan] Fix PageRank implementation's package reference 5c2bf54 [Varadharajan Mukundan] [SPARK-4047] - Generate runtime warnings for example implementation of PageRank (cherry picked from commit 974d334cf06a84317234a6c8e2e9ecca8271fa41) Signed-off-by: Xiangrui Meng commit dd1b2a0a92979562c0fccf3065587ba9a9fd9cc0 Author: tedyu Date: Mon Nov 10 13:23:33 2014 -0800 SPARK-1297 Upgrade HBase dependency to 0.98 pwendell rxin Please take a look Author: tedyu Closes #3115 from tedyu/master and squashes the following commits: 2b079c8 [tedyu] SPARK-1297 Upgrade HBase dependency to 0.98 (cherry picked from commit b32734e12d5197bad26c080e529edd875604c6fb) Signed-off-by: Patrick Wendell commit 04a79b616686380e63385259f6fd9e0c1dfa235f Author: Sandy Ryza Date: Mon Nov 10 12:40:41 2014 -0800 SPARK-4230. Doc for spark.default.parallelism is incorrect Author: Sandy Ryza Closes #3107 from sryza/sandy-spark-4230 and squashes the following commits: 37a1d19 [Sandy Ryza] Clear up a couple things 34d53de [Sandy Ryza] SPARK-4230. Doc for spark.default.parallelism is incorrect (cherry picked from commit c6f4e704214097f17d2d6abfbfef4bb208e4339f) Signed-off-by: Patrick Wendell commit 7917f27a3944e1abb9c85e17dba14adc35ef1ff9 Author: Jey Kottalam Date: Mon Nov 10 12:37:56 2014 -0800 [SPARK-4312] bash doesn't have "die" sbt-launch-lib.bash includes `die` command but it's not valid command for Linux, MacOS X or Windows. Closes #2898 Author: Jey Kottalam Closes #3182 from sarutak/SPARK-4312 and squashes the following commits: 24c6677 [Jey Kottalam] bash doesn't have "die" (cherry picked from commit c5db8e2c07e442654f3d368608108e714e080184) Signed-off-by: Patrick Wendell commit ca3fe8c127d1153fd575c44b950f7620e5db8737 Author: Sean Owen Date: Mon Nov 10 11:47:27 2014 -0800 SPARK-2548 [STREAMING] JavaRecoverableWordCount is missing Here's my attempt to re-port `RecoverableNetworkWordCount` to Java, following the example of its Scala and Java siblings. I fixed a few minor doc/formatting issues along the way I believe. Author: Sean Owen Closes #2564 from srowen/SPARK-2548 and squashes the following commits: 0d0bf29 [Sean Owen] Update checkpoint call as in https://github.com/apache/spark/pull/2735 35f23e3 [Sean Owen] Remove old comment about running in standalone mode 179b3c2 [Sean Owen] Re-port RecoverableNetworkWordCount to Java example, and touch up doc / formatting in related examples (cherry picked from commit 3a02d416cd82a7a942fd6ff4a0e05ff070eb218a) Signed-off-by: Tathagata Das commit 69dd2997fb84375fc57a597e3ac43e717b40011c Author: Niklas Wilcke <1wilcke@informatik.uni-hamburg.de> Date: Mon Nov 10 11:37:38 2014 -0800 [SPARK-4169] [Core] Accommodate non-English Locales in unit tests For me the core tests failed because there are two locale dependent parts in the code. Look at the Jira ticket for details. Why is it necessary to check the exception message in isBindCollision in https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L1686 ? Author: Niklas Wilcke <1wilcke@informatik.uni-hamburg.de> Closes #3036 from numbnut/core-test-fix and squashes the following commits: 1fb0d04 [Niklas Wilcke] Fixing locale dependend code and tests (cherry picked from commit ed8bf1eac548577c4bbad7ce3f7f301a2f52ef17) Signed-off-by: Andrew Or commit 9b781c8160b369931db78b10bc0ada272edb0af8 Author: Xiangrui Meng Date: Mon Nov 10 11:04:12 2014 -0800 [SQL] support udt to hive types conversion (hive->udt is not supported) marmbrus Author: Xiangrui Meng Closes #3164 from mengxr/hive-udt and squashes the following commits: 57c7519 [Xiangrui Meng] support udt->hive types (hive->udt is not supported) (cherry picked from commit 894a7245c379b2e823ae7d81cc9228e60ba47c78) Signed-off-by: Michael Armbrust commit c127ff8c87fc4f3aa6f09697928832dc6d37cc0f Author: RongGu Date: Sun Nov 9 23:48:15 2014 -0800 [SPARK-2703][Core]Make Tachyon related unit tests execute without deploying a Tachyon system locally. Make Tachyon related unit tests execute without deploying a Tachyon system locally. Author: RongGu Closes #3030 from RongGu/SPARK-2703 and squashes the following commits: ad08827 [RongGu] Make Tachyon related unit tests execute without deploying a Tachyon system locally (cherry picked from commit bd86cb1738800a0aa4c88b9afdba2f97ac6cbf25) Signed-off-by: Patrick Wendell commit fb36cf9ea8d55dfe3119f6f5d8bd3e98ce68ce21 Author: Sandy Ryza Date: Sun Nov 9 22:29:03 2014 -0800 SPARK-3179. Add task OutputMetrics. Author: Sandy Ryza This patch had conflicts when merged, resolved by Committer: Kay Ousterhout Closes #2968 from sryza/sandy-spark-3179 and squashes the following commits: dce4784 [Sandy Ryza] More review feedback 8d350d1 [Sandy Ryza] Fix test against Hadoop 2.5+ e7c74d0 [Sandy Ryza] More review feedback 6cff9c4 [Sandy Ryza] Review feedback fb2dde0 [Sandy Ryza] SPARK-3179 (cherry picked from commit 3c2cff4b9464f8d7535564fcd194631a8e5bb0a5) Signed-off-by: Kay Ousterhout commit 42d19aec13a290984def7287411262c434cb6a69 Author: Sean Owen Date: Sun Nov 9 22:11:20 2014 -0800 SPARK-1209 [CORE] (Take 2) SparkHadoop{MapRed,MapReduce}Util should not use package org.apache.hadoop andrewor14 Another try at SPARK-1209, to address https://github.com/apache/spark/pull/2814#issuecomment-61197619 I successfully tested with `mvn -Dhadoop.version=1.0.4 -DskipTests clean package; mvn -Dhadoop.version=1.0.4 test` I assume that is what failed Jenkins last time. I also tried `-Dhadoop.version1.2.1` and `-Phadoop-2.4 -Pyarn -Phive` for more coverage. So this is why the class was put in `org.apache.hadoop` to begin with, I assume. One option is to leave this as-is for now and move it only when Hadoop 1.0.x support goes away. This is the other option, which adds a call to force the constructor to be public at run-time. It's probably less surprising than putting Spark code in `org.apache.hadoop`, but, does involve reflection. A `SecurityManager` might forbid this, but it would forbid a lot of stuff Spark does. This would also only affect Hadoop 1.0.x it seems. Author: Sean Owen Closes #3048 from srowen/SPARK-1209 and squashes the following commits: 0d48f4b [Sean Owen] For Hadoop 1.0.x, make certain constructors public, which were public in later versions 466e179 [Sean Owen] Disable MIMA warnings resulting from moving the class -- this was also part of the PairRDDFunctions type hierarchy though? eb61820 [Sean Owen] Move SparkHadoopMapRedUtil / SparkHadoopMapReduceUtil from org.apache.hadoop to org.apache.spark (cherry picked from commit f8e5732307dcb1482d9bcf1162a1090ef9a7b913) Signed-off-by: Patrick Wendell commit a9debe8fe19fc980d860a41d77f53ac21fb49d0c Author: Sean Owen Date: Sun Nov 9 17:42:08 2014 -0800 SPARK-1344 [DOCS] Scala API docs for top methods Use "k" in javadoc of top and takeOrdered to avoid confusion with type K in pair RDDs. I think this resolves the discussion in SPARK-1344. Author: Sean Owen Closes #3168 from srowen/SPARK-1344 and squashes the following commits: 6963fcc [Sean Owen] Use "k" in javadoc of top and takeOrdered to avoid confusion with type K in pair RDDs (cherry picked from commit d1362659ef5d62db2c9ff0d2a24639abcef4e118) Signed-off-by: Patrick Wendell commit 6824af0c3a29aa2d11606495c4a95915233ba96e Author: Sean Owen Date: Sun Nov 9 17:40:48 2014 -0800 SPARK-971 [DOCS] Link to Confluence wiki from project website / documentation This is a trivial change to add links to the wiki from `README.md` and the main docs page. It is already linked to from spark.apache.org. Author: Sean Owen Closes #3169 from srowen/SPARK-971 and squashes the following commits: dcb84d0 [Sean Owen] Add link to wiki from README, docs home page (cherry picked from commit 8c99a47a4f0369ff3c1ecaeb860fa61ee789e987) Signed-off-by: Patrick Wendell commit 21b9ac062f9b9c4db7596195f8b3731596a16c9f Author: Josh Rosen Date: Sat Nov 8 18:10:23 2014 -0800 [SPARK-4301] StreamingContext should not allow start() to be called after calling stop() In Spark 1.0.0+, calling `stop()` on a StreamingContext that has not been started is a no-op which has no side-effects. This allows users to call `stop()` on a fresh StreamingContext followed by `start()`. I believe that this almost always indicates an error and is not behavior that we should support. Since we don't allow `start() stop() start()` then I don't think it makes sense to allow `stop() start()`. The current behavior can lead to resource leaks when StreamingContext constructs its own SparkContext: if I call `stop(stopSparkContext=True)`, then I expect StreamingContext's underlying SparkContext to be stopped irrespective of whether the StreamingContext has been started. This is useful when writing unit test fixtures. Prior discussions: - https://github.com/apache/spark/pull/3053#discussion-diff-19710333R490 - https://github.com/apache/spark/pull/3121#issuecomment-61927353 Author: Josh Rosen Closes #3160 from JoshRosen/SPARK-4301 and squashes the following commits: dbcc929 [Josh Rosen] Address more review comments bdbe5da [Josh Rosen] Stop SparkContext after stopping scheduler, not before. 03e9c40 [Josh Rosen] Always stop SparkContext, even if stop(false) has already been called. 832a7f4 [Josh Rosen] Address review comment 5142517 [Josh Rosen] Add tests; improve Scaladoc. 813e471 [Josh Rosen] Revert workaround added in https://github.com/apache/spark/pull/3053/files#diff-e144dbee130ed84f9465853ddce65f8eR49 5558e70 [Josh Rosen] StreamingContext.stop() should stop SparkContext even if StreamingContext has not been started yet. (cherry picked from commit 7b41b17f3296eea3282efbdceb6b28baf128287d) Signed-off-by: Tathagata Das commit 05bffcc023989fb09281e59cbc094f6990527c51 Author: Aaron Davidson Date: Sat Nov 8 13:03:51 2014 -0800 [Minor] [Core] Don't NPE on closeQuietly(null) Author: Aaron Davidson Closes #3166 from aarondav/closeQuietlyer and squashes the following commits: 78096b5 [Aaron Davidson] Don't NPE on closeQuietly(null) (cherry picked from commit 4af5c7e24455246c61c1f3c22225507e720d721d) Signed-off-by: Reynold Xin commit fc51de3395f25983052ae9d3c5c17891f6e6b8a7 Author: Andrew Or Date: Fri Nov 7 23:16:13 2014 -0800 [SPARK-4291][Build] Rename network module projects The names of the recently introduced network modules are inconsistent with those of the other modules in the project. We should just drop the "Code" suffix since it doesn't sacrifice any meaning, especially before they get into an official release. ``` [INFO] Reactor Build Order: [INFO] [INFO] Spark Project Parent POM [INFO] Spark Project Common Network Code [INFO] Spark Project Shuffle Streaming Service Code [INFO] Spark Project Core [INFO] Spark Project Bagel [INFO] Spark Project GraphX [INFO] Spark Project Streaming [INFO] Spark Project Catalyst [INFO] Spark Project SQL [INFO] Spark Project ML Library [INFO] Spark Project Tools [INFO] Spark Project Hive [INFO] Spark Project REPL [INFO] Spark Project YARN Parent POM [INFO] Spark Project YARN Stable API [INFO] Spark Project Assembly [INFO] Spark Project External Twitter [INFO] Spark Project External Kafka [INFO] Spark Project External Flume Sink [INFO] Spark Project External Flume [INFO] Spark Project External ZeroMQ [INFO] Spark Project External MQTT [INFO] Spark Project Examples [INFO] Spark Project Yarn Shuffle Service Code ``` Author: Andrew Or Closes #3148 from andrewor14/build-drop-code and squashes the following commits: eac839b [Andrew Or] Network -> Networking d01ad47 [Andrew Or] Rename network module project names (cherry picked from commit 7afc8564f33eb2868f458f85046f59a51b516ed6) Signed-off-by: Patrick Wendell commit 427d7911f527e00e75dec0498b4bbdbe164db7ca Author: Michelangelo D'Agostino Date: Fri Nov 7 22:53:01 2014 -0800 [MLLIB] [PYTHON] SPARK-4221: Expose nonnegative ALS in the python API SPARK-1553 added alternating nonnegative least squares to MLLib, however it's not possible to access it via the python API. This pull request resolves that. Author: Michelangelo D'Agostino Closes #3095 from mdagost/python_nmf and squashes the following commits: a6743ad [Michelangelo D'Agostino] Use setters instead of static methods in PythonMLLibAPI. Remove the new static methods I added. Set seed in tests. Change ratings to ratingsRDD in both train and trainImplicit for consistency. 7cffd39 [Michelangelo D'Agostino] Swapped nonnegative and seed in a few more places. 3fdc851 [Michelangelo D'Agostino] Moved seed to the end of the python parameter list. bdcc154 [Michelangelo D'Agostino] Change seed type to java.lang.Long so that it can handle null. cedf043 [Michelangelo D'Agostino] Added in ability to set the seed from python and made that play nice with the nonnegative changes. Also made the python ALS tests more exact. a72fdc9 [Michelangelo D'Agostino] Expose nonnegative ALS in the python API. (cherry picked from commit 7e9d975676d56ace0e84c2200137e4cd4eba074a) Signed-off-by: Xiangrui Meng commit 3b07c483aa98965ac9dc8fdcc40e593e4edb97fd Author: Davies Liu Date: Fri Nov 7 20:53:03 2014 -0800 [SPARK-4304] [PySpark] Fix sort on empty RDD This PR fix sortBy()/sortByKey() on empty RDD. This should be back ported into 1.1/1.2 Author: Davies Liu Closes #3162 from davies/fix_sort and squashes the following commits: 84f64b7 [Davies Liu] add tests 52995b5 [Davies Liu] fix sortByKey() on empty RDD (cherry picked from commit 7779109796c90d789464ab0be35917f963bbe867) Signed-off-by: Josh Rosen commit 8cefb63c122e7c7cf4af959f9606f4491148d9f4 Author: xiao321 <1042460381@qq.com> Date: Fri Nov 7 12:56:49 2014 -0800 Update JavaCustomReceiver.java 数组下标越界 Author: xiao321 <1042460381@qq.com> Closes #3153 from xiao321/patch-1 and squashes the following commits: 0ed17b5 [xiao321] Update JavaCustomReceiver.java (cherry picked from commit 7c9ec529a3483fab48f728481dd1d3663369e50a) Signed-off-by: Tathagata Das commit 47bd8f3020149a009f605e8390c2c28f3f835191 Author: wangfei Date: Fri Nov 7 12:55:11 2014 -0800 [SPARK-4292][SQL] Result set iterator bug in JDBC/ODBC select * from src, get the wrong result set as follows: ``` ... | 309 | val_309 | | 309 | val_309 | | 309 | val_309 | | 309 | val_309 | | 309 | val_309 | | 309 | val_309 | | 309 | val_309 | | 309 | val_309 | | 309 | val_309 | | 309 | val_309 | | 97 | val_97 | | 97 | val_97 | | 97 | val_97 | | 97 | val_97 | | 97 | val_97 | | 97 | val_97 | | 97 | val_97 | | 97 | val_97 | | 97 | val_97 | | 97 | val_97 | | 97 | val_97 | ... ``` Author: wangfei Closes #3149 from scwf/SPARK-4292 and squashes the following commits: 1574a43 [wangfei] using result.collect 8b2d845 [wangfei] adding test f64eddf [wangfei] result set iter bug (cherry picked from commit d6e55524437026c0c76addeba8f99249a8316716) Signed-off-by: Michael Armbrust commit c96da3676c32579d0f97347d35d95353b1d2ef07 Author: Matthew Taylor Date: Fri Nov 7 12:53:08 2014 -0800 [SPARK-4203][SQL] Partition directories in random order when inserting into hive table When doing an insert into hive table with partitions the folders written to the file system are in a random order instead of the order defined in table creation. Seems that the loadPartition method in Hive.java has a Map parameter but expects to be called with a map that has a defined ordering such as LinkedHashMap. Working on a test but having intillij problems Author: Matthew Taylor Closes #3076 from tbfenet/partition_dir_order_problem and squashes the following commits: f1b9a52 [Matthew Taylor] Comment format fix bca709f [Matthew Taylor] review changes 0e50f6b [Matthew Taylor] test fix 99f1a31 [Matthew Taylor] partition ordering fix 369e618 [Matthew Taylor] partition ordering fix (cherry picked from commit ac70c972a51952f801fd02dd5962c0a0c1aba8f8) Signed-off-by: Michael Armbrust commit 684d1f0ecd77d639557b4ca3c26ced950c9ab9fc Author: Takuya UESHIN Date: Fri Nov 7 12:30:47 2014 -0800 [SPARK-4270][SQL] Fix Cast from DateType to DecimalType. `Cast` from `DateType` to `DecimalType` throws `NullPointerException`. Author: Takuya UESHIN Closes #3134 from ueshin/issues/SPARK-4270 and squashes the following commits: 7394e4b [Takuya UESHIN] Fix Cast from DateType to DecimalType. (cherry picked from commit a6405c5ddcda112f8efd7d50d8e5f44f78a0fa41) Signed-off-by: Michael Armbrust commit ff1a0825637690b3fce780d4dcaad68dce382fb9 Author: Cheng Hao Date: Fri Nov 7 12:15:53 2014 -0800 [SPARK-4272] [SQL] Add more unwrapper functions for primitive type in TableReader Currently, the data "unwrap" only support couple of primitive types, not all, it will not cause exception, but may get some performance in table scanning for the type like binary, date, timestamp, decimal etc. Author: Cheng Hao Closes #3136 from chenghao-intel/table_reader and squashes the following commits: fffb729 [Cheng Hao] fix bug for retrieving the timestamp object e9c97a4 [Cheng Hao] Add more unwrapper functions for primitive type in TableReader (cherry picked from commit 60ab80f501b8384ddf48a9ac0ba0c2b9eb548b28) Signed-off-by: Michael Armbrust commit d530c3952131b29fd4d7a3e54496bfe634517af1 Author: Kousuke Saruta Date: Fri Nov 7 11:56:40 2014 -0800 [SPARK-4213][SQL] ParquetFilters - No support for LT, LTE, GT, GTE operators Following description is quoted from JIRA: When I issue a hql query against a HiveContext where my predicate uses a column of string type with one of LT, LTE, GT, or GTE operator, I get the following error: scala.MatchError: StringType (of class org.apache.spark.sql.catalyst.types.StringType$) Looking at the code in org.apache.spark.sql.parquet.ParquetFilters, StringType is absent from the corresponding functions for creating these filters. To reproduce, in a Hive 0.13.1 shell, I created the following table (at a specified DB): create table sparkbug ( id int, event string ) stored as parquet; Insert some sample data: insert into table sparkbug select 1, '2011-06-18' from limit 1; insert into table sparkbug select 2, '2012-01-01' from limit 1; Launch a spark shell and create a HiveContext to the metastore where the table above is located. import org.apache.spark.sql._ import org.apache.spark.sql.SQLContext import org.apache.spark.sql.hive.HiveContext val hc = new HiveContext(sc) hc.setConf("spark.sql.shuffle.partitions", "10") hc.setConf("spark.sql.hive.convertMetastoreParquet", "true") hc.setConf("spark.sql.parquet.compression.codec", "snappy") import hc._ hc.hql("select * from .sparkbug where event >= '2011-12-01'") A scala.MatchError will appear in the output. Author: Kousuke Saruta Closes #3083 from sarutak/SPARK-4213 and squashes the following commits: 4ab6e56 [Kousuke Saruta] WIP b6890c6 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-4213 9a1fae7 [Kousuke Saruta] Fixed ParquetFilters so that compare Strings (cherry picked from commit 14c54f1876fcf91b5c10e80be2df5421c7328557) Signed-off-by: Michael Armbrust commit 51ef8ab8eca15addc476f47e04ecc578e6e9682c Author: Jacky Li Date: Fri Nov 7 11:52:08 2014 -0800 [SQL] Modify keyword val location according to ordering 'DOUBLE' should be moved before 'ELSE' according to the ordering convension Author: Jacky Li Closes #3080 from jackylk/patch-5 and squashes the following commits: 3c11df7 [Jacky Li] [SQL] Modify keyword val location according to ordering (cherry picked from commit 68609c51ad1ab2def302df3c4a1c0bc1ec6e1075) Signed-off-by: Michael Armbrust commit f1f1ae418031957256e7dac896e29d64c81bf1a4 Author: Michael Armbrust Date: Fri Nov 7 11:51:20 2014 -0800 [SQL] Support ScalaReflection of schema in different universes Author: Michael Armbrust Closes #3096 from marmbrus/reflectionContext and squashes the following commits: adc221f [Michael Armbrust] Support ScalaReflection of schema in different universes (cherry picked from commit 8154ed7df6c5407e638f465d3bd86b43f36216ef) Signed-off-by: Michael Armbrust commit 2cd8e3e2b00c6191bccfb70743df7a4c9ffd98b2 Author: Cheng Lian Date: Fri Nov 7 11:45:25 2014 -0800 [SPARK-4225][SQL] Resorts to SparkContext.version to inspect Spark version This PR resorts to `SparkContext.version` rather than META-INF/MANIFEST.MF in the assembly jar to inspect Spark version. Currently, when built with Maven, the MANIFEST.MF file in the assembly jar is incorrectly replaced by Guava 15.0 MANIFEST.MF, probably because of the assembly/shading tricks. Another related PR is #3103, which tries to fix the MANIFEST issue. Author: Cheng Lian Closes #3105 from liancheng/spark-4225 and squashes the following commits: d9585e1 [Cheng Lian] Resorts to SparkContext.version to inspect Spark version (cherry picked from commit 86e9eaa3f0ec23cb38bce67585adb2d5f484f4ee) Signed-off-by: Michael Armbrust commit e5b8cea7ef219be33df1db77a0921885833a4254 Author: wangfei Date: Fri Nov 7 11:43:35 2014 -0800 [SQL][DOC][Minor] Spark SQL Hive now support dynamic partitioning Author: wangfei Closes #3127 from scwf/patch-9 and squashes the following commits: e39a560 [wangfei] now support dynamic partitioning (cherry picked from commit 636d7bcc96b912f5b5caa91110cd55b55fa38ad8) Signed-off-by: Michael Armbrust commit d6262fa05b9b7ffde00e6659810a3436e53df6b8 Author: Aaron Davidson Date: Fri Nov 7 09:42:21 2014 -0800 [SPARK-4187] [Core] Switch to binary protocol for external shuffle service messages This PR elimiantes the network package's usage of the Java serializer and replaces it with Encodable, which is a lightweight binary protocol. Each message is preceded by a type id, which will allow us to change messages (by only adding new ones), or to change the format entirely by switching to a special id (such as -1). This protocol has the advantage over Java that we can guarantee that messages will remain compatible across compiled versions and JVMs, though it does not provide a clean way to do schema migration. In the future, it may be good to use a more heavy-weight serialization format like protobuf, thrift, or avro, but these all add several dependencies which are unnecessary at the present time. Additionally this unifies the RPC messages of NettyBlockTransferService and ExternalShuffleClient. Author: Aaron Davidson Closes #3146 from aarondav/free and squashes the following commits: ed1102a [Aaron Davidson] Remove some unused imports b8e2a49 [Aaron Davidson] Add appId to test 538f2a3 [Aaron Davidson] [SPARK-4187] [Core] Switch to binary protocol for external shuffle service messages (cherry picked from commit d4fa04e50d299e9cad349b3781772956453a696b) Signed-off-by: Reynold Xin commit 7f86c350c946ac0c44e5e70acc8b7e51bace90a4 Author: zsxwing Date: Thu Nov 6 21:52:12 2014 -0800 [SPARK-4204][Core][WebUI] Change Utils.exceptionString to contain the inner exceptions and make the error information in Web UI more friendly This PR fixed `Utils.exceptionString` to output the full exception information. However, the stack trace may become very huge, so I also updated the Web UI to collapse the error information by default (display the first line and clicking `+detail` will display the full info). Here are the screenshots: Stages: ![stages](https://cloud.githubusercontent.com/assets/1000778/4882441/66d8cc68-6356-11e4-8346-6318677d9470.png) Details for one stage: ![stage](https://cloud.githubusercontent.com/assets/1000778/4882513/1311043c-6357-11e4-8804-ca14240a9145.png) The full information in the gray text field is: ```Java org.apache.spark.shuffle.FetchFailedException: Connection reset by peer at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:160) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:159) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:189) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198) at sun.nio.ch.IOUtil.read(IOUtil.java:166) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245) at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ... 1 more ``` /cc aarondav Author: zsxwing Closes #3073 from zsxwing/SPARK-4204 and squashes the following commits: 176d1e3 [zsxwing] Add comments to explain the stack trace difference ca509d3 [zsxwing] Add fullStackTrace to the constructor of ExceptionFailure a07057b [zsxwing] Core style fix dfb0032 [zsxwing] Backward compatibility for old history server 1e50f71 [zsxwing] Update as per review and increase the max height of the stack trace details 94f2566 [zsxwing] Change Utils.exceptionString to contain the inner exceptions and make the error information in Web UI more friendly (cherry picked from commit 3abdb1b24aa48f21e7eed1232c01d3933873688c) Signed-off-by: Andrew Or commit f92e6d74910b41c5dc43285cb122b908a97e82c6 Author: Aaron Davidson Date: Thu Nov 6 19:54:32 2014 -0800 [SPARK-4236] Cleanup removed applications' files in shuffle service This relies on a hook from whoever is hosting the shuffle service to invoke removeApplication() when the application is completed. Once invoked, we will clean up all the executors' shuffle directories we know about. Author: Aaron Davidson Closes #3126 from aarondav/cleanup and squashes the following commits: 33a64a9 [Aaron Davidson] Missing brace e6e428f [Aaron Davidson] Address comments 16a0d27 [Aaron Davidson] Cleanup e4df3e7 [Aaron Davidson] [SPARK-4236] Cleanup removed applications' files in shuffle service (cherry picked from commit 48a19a6dba896f7d0b637f84e114b7efbb814e51) Signed-off-by: Andrew Or commit c1ea5c542f3267c0b23a7775887e3a6ece793fe3 Author: Aaron Davidson Date: Thu Nov 6 18:39:14 2014 -0800 [SPARK-4188] [Core] Perform network-level retry of shuffle file fetches This adds a RetryingBlockFetcher to the NettyBlockTransferService which is wrapped around our typical OneForOneBlockFetcher, adding retry logic in the event of an IOException. This sort of retry allows us to avoid marking an entire executor as failed due to garbage collection or high network load. TODO: - [x] unit tests - [x] put in ExternalShuffleClient too Author: Aaron Davidson Closes #3101 from aarondav/retry and squashes the following commits: 72a2a32 [Aaron Davidson] Add that we should remove the condition around the retry thingy c7fd107 [Aaron Davidson] Fix unit tests e80e4c2 [Aaron Davidson] Address initial comments 6f594cd [Aaron Davidson] Fix unit test 05ff43c [Aaron Davidson] Add to external shuffle client and add unit test 66e5a24 [Aaron Davidson] [SPARK-4238] [Core] Perform network-level retry of shuffle file fetches (cherry picked from commit f165b2bbf5d4acf34d826fa55b900f5bbc295654) Signed-off-by: Reynold Xin commit cbe9a6c8a822beaea5a79e4155759c39d078ea2c Author: Aaron Davidson Date: Thu Nov 6 17:20:46 2014 -0800 [SPARK-4277] Support external shuffle service on Standalone Worker Author: Aaron Davidson Closes #3142 from aarondav/worker and squashes the following commits: 3780bd7 [Aaron Davidson] Address comments 2dcdfc1 [Aaron Davidson] Add private[worker] 47f49d3 [Aaron Davidson] NettyBlockTransferService shouldn't care about app ids (it's only b/t executors) 258417c [Aaron Davidson] [SPARK-4277] Support external shuffle service on executor (cherry picked from commit 6e9ef10fd7446a11f37446c961916ba2a8e02cb8) Signed-off-by: Andrew Or commit 6508953a4b8622312c1f0ae4b4b4275b5a2c2bd6 Author: Andrew Or Date: Thu Nov 6 17:18:49 2014 -0800 [SPARK-3797] Minor addendum to Yarn shuffle service I did not realize there was a `network.util.JavaUtils` when I wrote this code. This PR moves the `ByteBuffer` string conversion to the appropriate place. I tested the changes on a stable yarn cluster. Author: Andrew Or Closes #3144 from andrewor14/yarn-shuffle-util and squashes the following commits: b6c08bf [Andrew Or] Remove unused import 94e205c [Andrew Or] Use netty Unpooled 85202a5 [Andrew Or] Use guava Charsets 057135b [Andrew Or] Reword comment adf186d [Andrew Or] Move byte buffer String conversion logic to JavaUtils (cherry picked from commit 96136f222abd4f3abd10cb78a4ebecdb21f3bde7) Signed-off-by: Andrew Or commit 9ea0fac0eafd7264a30f36c0d20863700245991f Author: Andrew Or Date: Thu Nov 6 15:31:07 2014 -0800 [HOT FIX] Make distribution fails This was added by me in https://github.com/apache/spark/commit/61a5cced049a8056292ba94f23fa7bd040f50685. The real fix will be added in [SPARK-4281](https://issues.apache.org/jira/browse/SPARK-4281). Author: Andrew Or Closes #3145 from andrewor14/fix-make-distribution and squashes the following commits: c78be61 [Andrew Or] Hot fix make distribution (cherry picked from commit 470881b24a503c9edcaed159c29bafa446ab0e9a) Signed-off-by: Andrew Or commit 9061bc4e127abb0c44e37f1b8b7706883d451bc7 Author: lianhuiwang Date: Thu Nov 6 10:46:45 2014 -0800 [SPARK-4249][GraphX]fix a problem of EdgePartitionBuilder in Graphx at first srcIds is not initialized and are all 0. so we use edgeArray(0).srcId to currSrcId Author: lianhuiwang Closes #3138 from lianhuiwang/SPARK-4249 and squashes the following commits: 3f4e503 [lianhuiwang] fix a problem of EdgePartitionBuilder in Graphx (cherry picked from commit d15c6e9dc2860bbe56e31ddf71218ccc6d5c841d) Signed-off-by: Ankur Dave commit aaaeaf93902a1954df11fa4982b1c6c7e29f5b8d Author: Aaron Davidson Date: Thu Nov 6 10:45:46 2014 -0800 [SPARK-4264] Completion iterator should only invoke callback once Author: Aaron Davidson Closes #3128 from aarondav/compiter and squashes the following commits: 698e4be [Aaron Davidson] [SPARK-4264] Completion iterator should only invoke callback once (cherry picked from commit 23eaf0e12ff221dcca40a79e61b6cc5e7c846cb5) Signed-off-by: Aaron Davidson commit 01484455c4ee4ee8e848be56f395d38841fbf86a Author: Davies Liu Date: Thu Nov 6 00:22:19 2014 -0800 [SPARK-4186] add binaryFiles and binaryRecords in Python add binaryFiles() and binaryRecords() in Python ``` binaryFiles(self, path, minPartitions=None): :: Developer API :: Read a directory of binary files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI as a byte array. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file. Note: Small files are preferred, large file is also allowable, but may cause bad performance. binaryRecords(self, path, recordLength): Load data from a flat binary file, assuming each record is a set of numbers with the specified numerical format (see ByteBuffer), and the number of bytes per record is constant. :param path: Directory to the input data files :param recordLength: The length at which to split the records ``` Author: Davies Liu Closes #3078 from davies/binary and squashes the following commits: cd0bdbd [Davies Liu] Merge branch 'master' of github.com:apache/spark into binary 3aa349b [Davies Liu] add experimental notes 24e84b6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into binary 5ceaa8a [Davies Liu] Merge branch 'master' of github.com:apache/spark into binary 1900085 [Davies Liu] bugfix bb22442 [Davies Liu] add binaryFiles and binaryRecords in Python (cherry picked from commit b41a39e24038876359aeb7ce2bbbb4de2234e5f3) Signed-off-by: Matei Zaharia commit 2c84178b8283269512b1c968b9995a7bdedd7aa5 Author: Kay Ousterhout Date: Thu Nov 6 00:03:03 2014 -0800 [SPARK-4255] Fix incorrect table striping This commit stripes table rows after hiding some rows, to ensure that rows are correct striped to alternate white and grey even when rows are hidden by default. Author: Kay Ousterhout Closes #3117 from kayousterhout/striping and squashes the following commits: be6e10a [Kay Ousterhout] [SPARK-4255] Fix incorrect table striping (cherry picked from commit 5f27ae16d5b016fae4afeb0f2ad779fd3130b390) Signed-off-by: Kay Ousterhout commit 70f6f36e03f97847cd2f3e4fe2902bb8459ca6a3 Author: Nicholas Chammas Date: Wed Nov 5 20:45:35 2014 -0800 [SPARK-4137] [EC2] Don't change working dir on user This issue was uncovered after [this discussion](https://issues.apache.org/jira/browse/SPARK-3398?focusedCommentId=14187471&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14187471). Don't change the working directory on the user. This breaks relative paths the user may pass in, e.g., for the SSH identity file. ``` ./ec2/spark-ec2 -i ../my.pem ``` This patch will preserve the user's current working directory and allow calls like the one above to work. Author: Nicholas Chammas Closes #2988 from nchammas/spark-ec2-cwd and squashes the following commits: f3850b5 [Nicholas Chammas] pep8 fix fbc20c7 [Nicholas Chammas] revert to old commenting style 752f958 [Nicholas Chammas] specify deploy.generic path absolutely bcdf6a5 [Nicholas Chammas] fix typo 77871a2 [Nicholas Chammas] add clarifying comment ce071fc [Nicholas Chammas] don't change working dir (cherry picked from commit db45f5ad0368760dbeaa618a04f66ae9b2bed656) Signed-off-by: Shivaram Venkataraman commit 7e0da9f6b423842adc9fed2db2d4a80cab541351 Author: Xiangrui Meng Date: Wed Nov 5 19:56:16 2014 -0800 [SPARK-4262][SQL] add .schemaRDD to JavaSchemaRDD marmbrus Author: Xiangrui Meng Closes #3125 from mengxr/SPARK-4262 and squashes the following commits: 307695e [Xiangrui Meng] add .schemaRDD to JavaSchemaRDD (cherry picked from commit 3d2b5bc5bb979d8b0b71e06bc0f4548376fdbb98) Signed-off-by: Xiangrui Meng commit ff84a8ae258083423529885d85bf1d939a62d899 Author: Joseph K. Bradley Date: Wed Nov 5 19:51:18 2014 -0800 [SPARK-4254] [mllib] MovieLensALS bug fix Changed code so it does not try to serialize Params. CC: mengxr debasish83 srowen Author: Joseph K. Bradley Closes #3116 from jkbradley/als-bugfix and squashes the following commits: e575bd8 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into als-bugfix 9401b16 [Joseph K. Bradley] changed implicitPrefs so it is not serialized to fix MovieLensALS example bug (cherry picked from commit c315d1316cb2372e90ae3a12f72d5b3304435a6b) Signed-off-by: Xiangrui Meng commit 9ac5c517b64606db7d6b8ac3b823c3d5a45e0ed0 Author: Brenden Matthews Date: Wed Nov 5 16:02:44 2014 -0800 [SPARK-4158] Fix for missing resources. Mesos offers may not contain all resources, and Spark needs to check to ensure they are present and sufficient. Spark may throw an erroneous exception when resources aren't present. Author: Brenden Matthews Closes #3024 from brndnmtthws/fix-mesos-resource-misuse and squashes the following commits: e5f9580 [Brenden Matthews] [SPARK-4158] Fix for missing resources. (cherry picked from commit cb0eae3b78d7f6f56c0b9521ee48564a4967d3de) Signed-off-by: Andrew Or commit 0e16d3a3dde7a0988dfd8eff05922a1ac917fe28 Author: Jongyoul Lee Date: Wed Nov 5 15:49:42 2014 -0800 SPARK-3223 runAsSparkUser cannot change HDFS write permission properly i... ...n mesos cluster mode - change master newer Author: Jongyoul Lee Closes #3034 from jongyoul/SPARK-3223 and squashes the following commits: 42b2ed3 [Jongyoul Lee] SPARK-3223 runAsSparkUser cannot change HDFS write permission properly in mesos cluster mode - change master newer (cherry picked from commit f7ac8c2b1de96151231617846b7468d23379c74a) Signed-off-by: Andrew Or commit fe4ead2995ab8529602090ed21941b6005a07c9d Author: jay@apache.org Date: Wed Nov 5 15:45:34 2014 -0800 SPARK-4040. Update documentation to exemplify use of local (n) value, fo... This is a minor docs update which helps to clarify the way local[n] is used for streaming apps. Author: jay@apache.org Closes #2964 from jayunit100/SPARK-4040 and squashes the following commits: 35b5a5e [jay@apache.org] SPARK-4040: Update documentation to exemplify use of local (n) value. (cherry picked from commit 868cd4c3ca11e6ecc4425b972d9a20c360b52425) Signed-off-by: Matei Zaharia commit cf2f676f93807bc504b77409b6c3d66f0d5e38ab Author: Andrew Or Date: Wed Nov 5 15:42:05 2014 -0800 [SPARK-3797] Run external shuffle service in Yarn NM This creates a new module `network/yarn` that depends on `network/shuffle` recently created in #3001. This PR introduces a custom Yarn auxiliary service that runs the external shuffle service. As of the changes here this shuffle service is required for using dynamic allocation with Spark. This is still WIP mainly because it doesn't handle security yet. I have tested this on a stable Yarn cluster. Author: Andrew Or Closes #3082 from andrewor14/yarn-shuffle-service and squashes the following commits: ef3ddae [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service 0ee67a2 [Andrew Or] Minor wording suggestions 1c66046 [Andrew Or] Remove unused provided dependencies 0eb6233 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service 6489db5 [Andrew Or] Try catch at the right places 7b71d8f [Andrew Or] Add detailed java docs + reword a few comments d1124e4 [Andrew Or] Add security to shuffle service (INCOMPLETE) 5f8a96f [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service 9b6e058 [Andrew Or] Address various feedback f48b20c [Andrew Or] Fix tests again f39daa6 [Andrew Or] Do not make network-yarn an assembly module 761f58a [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service 15a5b37 [Andrew Or] Fix build for Hadoop 1.x baff916 [Andrew Or] Fix tests 5bf9b7e [Andrew Or] Address a few minor comments 5b419b8 [Andrew Or] Add missing license header 804e7ff [Andrew Or] Include the Yarn shuffle service jar in the distribution cd076a4 [Andrew Or] Require external shuffle service for dynamic allocation ea764e0 [Andrew Or] Connect to Yarn shuffle service only if it's enabled 1bf5109 [Andrew Or] Use the shuffle service port specified through hadoop config b4b1f0c [Andrew Or] 4 tabs -> 2 tabs 43dcb96 [Andrew Or] First cut integration of shuffle service with Yarn aux service b54a0c4 [Andrew Or] Initial skeleton for Yarn shuffle service (cherry picked from commit 61a5cced049a8056292ba94f23fa7bd040f50685) Signed-off-by: Andrew Or commit 6844e7a8219ac78790a422ffd5054924e7d2bea1 Author: industrial-sloth Date: Wed Nov 5 15:38:48 2014 -0800 SPARK-4222 [CORE] use readFully in FixedLengthBinaryRecordReader replaces the existing read() call with readFully(). Author: industrial-sloth Closes #3093 from industrial-sloth/branch-1.2-fixedLenRecRdr and squashes the following commits: a245c8a [industrial-sloth] use readFully in FixedLengthBinaryRecordReader commit f4beb77f083e477845b90b5049186095d2002f49 Author: Kay Ousterhout Date: Wed Nov 5 15:30:31 2014 -0800 [SPARK-3984] [SPARK-3983] Fix incorrect scheduler delay and display task deserialization time in UI This commit fixes the scheduler delay in the UI (which previously included things that are not scheduler delay, like time to deserialize the task and serialize the result), and also adds information about time to deserialize tasks to the optional additional metrics. Time to deserialize the task can be large relative to task time for short jobs, and understanding when it is high can help developers realize that they should try to reduce closure size (e.g, by including less data in the task description). cc shivaram etrain Author: Kay Ousterhout Closes #2832 from kayousterhout/SPARK-3983 and squashes the following commits: 0c1398e [Kay Ousterhout] Fixed ordering 531575d [Kay Ousterhout] Removed executor launch time 1f13afe [Kay Ousterhout] Minor spacing fixes 335be4b [Kay Ousterhout] Made metrics hideable 5bc3cba [Kay Ousterhout] [SPARK-3984] [SPARK-3983] Improve UI task metrics. (cherry picked from commit a46497eecc50f854c5c5701dc2b8a2468b76c085) Signed-off-by: Kay Ousterhout commit b27d7dcaaad0bf04d341660ffbeb742cd4eecfd3 Author: Nicholas Chammas Date: Mon Nov 3 09:02:35 2014 -0800 [EC2] Factor out Mesos spark-ec2 branch We reference a specific branch in two places. This patch makes it one place. Author: Nicholas Chammas Closes #3008 from nchammas/mesos-spark-ec2-branch and squashes the following commits: 10a6089 [Nicholas Chammas] factor out mess spark-ec2 branch commit 68be37b823516dbeda066776bb060bf894db4e95 Author: zsxwing Date: Mon Nov 3 22:47:45 2014 -0800 [SPARK-4166][Core] Add a backward compatibility test for ExecutorLostFailure Author: zsxwing Closes #3085 from zsxwing/SPARK-4166-back-comp and squashes the following commits: 89329f4 [zsxwing] Add a backward compatibility test for ExecutorLostFailure commit e0a043b79c250515a680485f0dc7b1a149835445 Author: zsxwing Date: Mon Nov 3 22:40:43 2014 -0800 [SPARK-4163][Core] Add a backward compatibility test for FetchFailed /cc aarondav Author: zsxwing Closes #3086 from zsxwing/SPARK-4163-back-comp and squashes the following commits: 21cb2a8 [zsxwing] Add a backward compatibility test for FetchFailed commit 7517c37aee373c8bd3ccbf1eae079b0fc6b89c91 Author: Zhang, Liye Date: Mon Nov 3 18:17:32 2014 -0800 [SPARK-4168][WebUI] web statges number should show correctly when stages are more than 1000 The number of completed stages and failed stages showed on webUI will always be less than 1000. This is really misleading when there are already thousands of stages completed or failed. The number should be correct even when only partial stages listed on the webUI (stage info will be removed if the number is too large). Author: Zhang, Liye Closes #3035 from liyezhang556520/webStageNum and squashes the following commits: d9e29fb [Zhang, Liye] add detailed comments for variables 4ea8fd1 [Zhang, Liye] change variable name accroding to comments f4c404d [Zhang, Liye] [SPARK-4168][WebUI] web statges number should show correctly when stages are more than 1000 commit 866c7bbe56f9c7fd96d3f4afe8a76405dc877a6e Author: Josh Rosen Date: Mon Nov 3 18:18:47 2014 -0800 [SPARK-611] Display executor thread dumps in web UI This patch allows executor thread dumps to be collected on-demand and viewed in the Spark web UI. The thread dumps are collected using Thread.getAllStackTraces(). To allow remote thread dumps to be triggered from the web UI, I added a new `ExecutorActor` that runs inside of the Executor actor system and responds to RPCs from the driver. The driver's mechanism for obtaining a reference to this actor is a little bit hacky: it uses the block manager master actor to determine the host/port of the executor actor systems in order to construct ActorRefs to ExecutorActor. Unfortunately, I couldn't find a much cleaner way to do this without a big refactoring of the executor -> driver communication. Screenshots: ![image](https://cloud.githubusercontent.com/assets/50748/4781793/7e7a0776-5cbf-11e4-874d-a91cd04620bd.png) ![image](https://cloud.githubusercontent.com/assets/50748/4781794/8bce76aa-5cbf-11e4-8d13-8477748c9f7e.png) ![image](https://cloud.githubusercontent.com/assets/50748/4781797/bd11a8b8-5cbf-11e4-9ad7-a7459467ec8e.png) Author: Josh Rosen Closes #2944 from JoshRosen/jstack-in-web-ui and squashes the following commits: 3c21a5d [Josh Rosen] Address review comments: 880f7f7 [Josh Rosen] Merge remote-tracking branch 'origin/master' into jstack-in-web-ui f719266 [Josh Rosen] Merge remote-tracking branch 'origin/master' into jstack-in-web-ui 19707b0 [Josh Rosen] Add one comment. 127a130 [Josh Rosen] Update to use SparkContext.DRIVER_IDENTIFIER b8e69aa [Josh Rosen] Merge remote-tracking branch 'origin/master' into jstack-in-web-ui 3dfc2d4 [Josh Rosen] Add missing file. bc1e675 [Josh Rosen] Undo some leftover changes from the earlier approach. f4ac1c1 [Josh Rosen] Switch to on-demand collection of thread dumps dfec08b [Josh Rosen] Add option to disable thread dumps in UI. 4c87d7f [Josh Rosen] Use separate RPC for sending thread dumps. 2b8bdf3 [Josh Rosen] Enable thread dumps from the driver when running in non-local mode. cc3e6b3 [Josh Rosen] Fix test code in DAGSchedulerSuite. 87b8b65 [Josh Rosen] Add new listener event for thread dumps. 8c10216 [Josh Rosen] Add missing file. 0f198ac [Josh Rosen] [SPARK-611] Display executor thread dumps in web UI commit e7f735637ad2f681b454d1297f6fdcc433feebbc Author: Aaron Davidson Date: Wed Nov 5 14:38:43 2014 -0800 [SPARK-4242] [Core] Add SASL to external shuffle service Does three things: (1) Adds SASL to ExternalShuffleClient, (2) puts SecurityManager in BlockManager's constructor, and (3) adds unit test. Author: Aaron Davidson Closes #3108 from aarondav/sasl-client and squashes the following commits: 48b622d [Aaron Davidson] Screw it, let's just get LimitedInputStream 3543b70 [Aaron Davidson] Back out of pom change due to unknown test issue? b58518a [Aaron Davidson] ByteStreams.limit() not available :( cbe451a [Aaron Davidson] Address comments 2bf2908 [Aaron Davidson] [SPARK-4242] [Core] Add SASL to external shuffle service commit 236434033fe452e70dbd0236935a49693712e130 Author: Aaron Davidson Date: Tue Nov 4 16:15:38 2014 -0800 [SPARK-2938] Support SASL authentication in NettyBlockTransferService Also lays the groundwork for supporting it inside the external shuffle service. Author: Aaron Davidson Closes #3087 from aarondav/sasl and squashes the following commits: 3481718 [Aaron Davidson] Delete rogue println 44f8410 [Aaron Davidson] Delete documentation - muahaha! eb9f065 [Aaron Davidson] Improve documentation and add end-to-end test at Spark-level a6b95f1 [Aaron Davidson] Address comments 785bbde [Aaron Davidson] Cleanup 79973cb [Aaron Davidson] Remove unused file 151b3c5 [Aaron Davidson] Add docs, timeout config, better failure handling f6177d7 [Aaron Davidson] Cleanup SASL state upon connection termination 7b42adb [Aaron Davidson] Add unit tests 8191bcb [Aaron Davidson] [SPARK-2938] Support SASL authentication in NettyBlockTransferService