ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
meawoppl	42070f096e	[SPARK-7806][EC2] Fixes that allow the spark_ec2.py tool to run with Python3 I have used this script to launch, destroy, start, and stop clusters successfully. Author: meawoppl <meawoppl@gmail.com> Closes #6336 from meawoppl/py3ec2spark and squashes the following commits: 2e87046 [meawoppl] Py3 compat fixes. (cherry picked from commit `8dbe777703`) Signed-off-by: Davies Liu <davies@databricks.com>	2015-05-26 09:02:49 -07:00
linweizhong	25b2f95fe3	[SPARK-7339] [PYSPARK] PySpark shuffle spill memory sometimes are not correct In PySpark we get memory used before and after spill, then use the difference of these two value as memorySpilled, but if the before value is small than after value, then we will get a negative value, but this scenario 0 value may be more reasonable. Below is the result in HistoryServer we have tested: Index ID Attempt Status Locality Level Executor ID / Host Launch Time Duration GC Time Input Size / Records Write Time Shuffle Write Size / Records Shuffle Spill (Memory) Shuffle Spill (Disk) Errors 0 0 0 SUCCESS NODE_LOCAL 3 / vm119 2015/05/04 17:31:06 21 s 0.1 s 128.1 MB (hadoop) / 3237 70 ms 10.1 MB / 2529 0.0 B 5.7 MB 2 2 0 SUCCESS NODE_LOCAL 1 / vm118 2015/05/04 17:31:06 22 s 89 ms 128.1 MB (hadoop) / 3205 0.1 s 10.1 MB / 2529 -1048576.0 B 5.9 MB 1 1 0 SUCCESS NODE_LOCAL 2 / vm117 2015/05/04 17:31:06 22 s 0.1 s 128.1 MB (hadoop) / 3271 68 ms 10.1 MB / 2529 -1048576.0 B 5.6 MB 4 4 0 SUCCESS NODE_LOCAL 2 / vm117 2015/05/04 17:31:06 22 s 0.1 s 128.1 MB (hadoop) / 3192 51 ms 10.1 MB / 2529 -1048576.0 B 5.9 MB 3 3 0 SUCCESS NODE_LOCAL 3 / vm119 2015/05/04 17:31:06 22 s 0.1 s 128.1 MB (hadoop) / 3262 51 ms 10.1 MB / 2529 1024.0 KB 5.8 MB 5 5 0 SUCCESS NODE_LOCAL 1 / vm118 2015/05/04 17:31:06 22 s 89 ms 128.1 MB (hadoop) / 3256 93 ms 10.1 MB / 2529 -1048576.0 B 5.7 MB /cc davies Author: linweizhong <linweizhong@huawei.com> Closes #5887 from Sephiroth-Lin/spark-7339 and squashes the following commits: 9186c81 [linweizhong] Use max function to get a nonnegative value d41672b [linweizhong] Update MemoryBytesSpilled when memorySpilled > 0 (cherry picked from commit `8948ad3fb5`) Signed-off-by: Davies Liu <davies@databricks.com>	2015-05-26 08:36:08 -07:00
scwf	79bb7dceca	[CORE] [TEST] Fix SimpleDateParamTest ``` sbt.ForkMain$ForkError: 1424424077190 was not equal to 1424474477190 at org.scalatest.MatchersHelper$.newTestFailedException(MatchersHelper.scala:160) at org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6231) at org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6265) at org.apache.spark.status.api.v1.SimpleDateParamTest$$anonfun$1.apply$mcV$sp(SimpleDateParamTest.scala:25) at org.apache.spark.status.api.v1.SimpleDateParamTest$$anonfun$1.apply(SimpleDateParamTest.scala:23) at org.apache.spark.status.api.v1.SimpleDateParamTest$$anonfun$1.apply(SimpleDateParamTest.scala:23) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala: ``` Set timezone to fix SimpleDateParamTest Author: scwf <wangfei1@huawei.com> Author: Fei Wang <wangfei1@huawei.com> Closes #6377 from scwf/fix-SimpleDateParamTest and squashes the following commits: b8df1e5 [Fei Wang] Update SimpleDateParamSuite.scala 8bb74f0 [scwf] fix SimpleDateParamSuite (cherry picked from commit `bf49c22130`) Signed-off-by: Imran Rashid <irashid@cloudera.com>	2015-05-26 08:43:36 -05:00
Reynold Xin	4b31a07b6f	[SQL][minor] Removed unused Catalyst logical plan DSL. The Catalyst DSL is no longer used as a public facing API. This pull request removes the UDF and writeToFile feature from it since they are not used in unit tests. Author: Reynold Xin <rxin@databricks.com> Closes #6350 from rxin/unused-logical-dsl and squashes the following commits: 90b3de6 [Reynold Xin] [SQL][minor] Removed unused Catalyst logical plan DSL. (cherry picked from commit `c9adcad81a`) Signed-off-by: Reynold Xin <rxin@databricks.com>	2015-05-25 23:09:28 -07:00
Yin Huai	44b22a3f11	[SPARK-7832] [Build] Always run SQL tests in master build. https://issues.apache.org/jira/browse/SPARK-7832 Author: Yin Huai <yhuai@databricks.com> Closes #6385 from yhuai/runSQLTests and squashes the following commits: 3d399bc [Yin Huai] Always run SQL tests in master build. (cherry picked from commit `f38e619c41`) Signed-off-by: Reynold Xin <rxin@databricks.com>	2015-05-25 18:24:07 -07:00
Calvin Jia	d5572ab79a	[SPARK-6391][DOCS] Document Tachyon compatibility. Adds a section in the RDD persistence section of the programming-guide docs detailing Spark-Tachyon version compatibility as discussed in [[SPARK-6391]](https://issues.apache.org/jira/browse/SPARK-6391). Author: Calvin Jia <jia.calvin@gmail.com> Closes #6382 from calvinjia/spark-6391 and squashes the following commits: 113e863 [Calvin Jia] Move compatibility info to the offheap storage level section. 7942dc5 [Calvin Jia] Add a section in the programming-guide docs for Tachyon compatibility. (cherry picked from commit `ce0051d6f7`) Signed-off-by: Reynold Xin <rxin@databricks.com>	2015-05-25 16:50:50 -07:00
Cheng Lian	7edb17bf07	[SPARK-7842] [SQL] Makes task committing/aborting in InsertIntoHadoopFsRelation more robust When committing/aborting a write task issued in `InsertIntoHadoopFsRelation`, if an exception is thrown from `OutputWriter.close()`, the committing/aborting process will be interrupted, and leaves messy stuff behind (e.g., the `_temporary` directory created by `FileOutputCommitter`). This PR makes these two process more robust by catching potential exceptions and falling back to normal task committment/abort. Author: Cheng Lian <lian@databricks.com> Closes #6378 from liancheng/spark-7838 and squashes the following commits: f18253a [Cheng Lian] Makes task committing/aborting in InsertIntoHadoopFsRelation more robust (cherry picked from commit `8af1bf10b7`) Signed-off-by: Cheng Lian <lian@databricks.com>	2015-05-26 00:29:06 +08:00
Cheng Lian	81b35d8641	[SPARK-7684] [SQL] Invoking HiveContext.newTemporaryConfiguration() shouldn't create new metastore directory The "Database does not exist" error reported in SPARK-7684 was caused by `HiveContext.newTemporaryConfiguration()`, which always creates a new temporary metastore directory and returns a metastore configuration pointing that directory. This makes `TestHive.reset()` always replaces old temporary metastore with an empty new one. Author: Cheng Lian <lian@databricks.com> Closes #6359 from liancheng/spark-7684 and squashes the following commits: 95d2eb8 [Cheng Lian] Addresses @marmbrust's comment 042769d [Cheng Lian] Don't create new temp directory in HiveContext.newTemporaryConfiguration() (cherry picked from commit `bfeedc69a2`) Signed-off-by: Cheng Lian <lian@databricks.com>	2015-05-26 00:16:24 +08:00
Ram Sriharsha	16a6da52f8	[SPARK-7833] [ML] Add python wrapper for RegressionEvaluator Author: Ram Sriharsha <rsriharsha@hw11853.local> Closes #6365 from harsha2010/SPARK-7833 and squashes the following commits: 923f288 [Ram Sriharsha] cleanup 7623b7d [Ram Sriharsha] python style fix 9743f83 [Ram Sriharsha] [SPARK-7833][ml] Add python wrapper for RegressionEvaluator (cherry picked from commit `65c696ecc0`) Signed-off-by: Xiangrui Meng <meng@databricks.com>	2015-05-24 10:36:08 -07:00
Yin Huai	b06389caec	[SPARK-7805] [SQL] Move SQLTestUtils.scala and ParquetTest.scala to src/test https://issues.apache.org/jira/browse/SPARK-7805 Because `sql/hive`'s tests depend on the test jar of `sql/core`, we do not need to store `SQLTestUtils` and `ParquetTest` in `src/main`. We should only add stuff that will be needed by `sql/console` or Python tests (for Python, we need it in `src/main`, right? davies). Author: Yin Huai <yhuai@databricks.com> Closes #6334 from yhuai/SPARK-7805 and squashes the following commits: af6d0c9 [Yin Huai] mima b86746a [Yin Huai] Move SQLTestUtils.scala and ParquetTest.scala to src/test. (cherry picked from commit `ed21476bc0`) Signed-off-by: Yin Huai <yhuai@databricks.com>	2015-05-24 09:51:49 -07:00
Yin Huai	11d998eb77	[SPARK-7845] [BUILD] Bump "Hadoop 1" tests to version 1.2.1 https://issues.apache.org/jira/browse/SPARK-7845 Author: Yin Huai <yhuai@databricks.com> Closes #6384 from yhuai/hadoop1Test and squashes the following commits: 82fcea8 [Yin Huai] Use hadoop 1.2.1 (a stable version) for hadoop 1 test. (cherry picked from commit `bfbc0df729`) Signed-off-by: Yin Huai <yhuai@databricks.com>	2015-05-24 09:50:12 -07:00
Patrick Wendell	947d700ec8	Preparing development version 1.4.0-SNAPSHOT	2015-05-23 20:13:05 -07:00
Patrick Wendell	03fb26a3e5	Preparing Spark release v1.4.0-rc2	2015-05-23 20:13:00 -07:00
Patrick Wendell	641edc99fc	[SPARK-7287] [HOTFIX] Disable o.a.s.deploy.SparkSubmitSuite --packages	2015-05-23 19:44:23 -07:00
Patrick Wendell	f2f74b9b1a	Preparing development version 1.4.1-SNAPSHOT	2015-05-23 14:59:37 -07:00
Patrick Wendell	0da7396990	Preparing Spark release v1.4.0-rc2-test	2015-05-23 14:59:31 -07:00
Patrick Wendell	8da8caab17	Preparing development version 1.4.1-SNAPSHOT	2015-05-23 14:46:27 -07:00
Patrick Wendell	8f50218f38	Preparing Spark release 1.4.0-rc2-test	2015-05-23 14:46:23 -07:00
Shivaram Venkataraman	fbc4480d93	[HOTFIX] Copy SparkR lib if it exists in make-distribution This is to fix an issue reported in #6373 where the `cp` would fail if `-Psparkr` was not used in the build cc dragos pwendell Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #6379 from shivaram/make-distribution-hotfix and squashes the following commits: 08eb7e4 [Shivaram Venkataraman] Copy SparkR lib if it exists in make-distribution (cherry picked from commit `b231baa248`) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	2015-05-23 12:28:24 -07:00
Yin Huai	8d6d8a538c	[SPARK-7654] [SQL] Move insertInto into reader/writer interface. This one continues the work of https://github.com/apache/spark/pull/6216. Author: Yin Huai <yhuai@databricks.com> Author: Reynold Xin <rxin@databricks.com> Closes #6366 from yhuai/insert and squashes the following commits: 3d717fb [Yin Huai] Use insertInto to handle the casue when table exists and Append is used for saveAsTable. 56d2540 [Yin Huai] Add PreWriteCheck to HiveContext's analyzer. c636e35 [Yin Huai] Remove unnecessary empty lines. cf83837 [Yin Huai] Move insertInto to write. Also, remove the partition columns from InsertIntoHadoopFsRelation. 0841a54 [Reynold Xin] Removed experimental tag for deprecated methods. 33ed8ef [Reynold Xin] [SPARK-7654][SQL] Move insertInto into reader/writer interface. (cherry picked from commit `2b7e63585d`) Signed-off-by: Yin Huai <yhuai@databricks.com>	2015-05-23 09:48:30 -07:00
Davies Liu	c6e574213d	[SPARK-7840] add insertInto() to Writer Add tests later. Author: Davies Liu <davies@databricks.com> Closes #6375 from davies/insertInto and squashes the following commits: 826423e [Davies Liu] add insertInto() to Writer (cherry picked from commit `be47af1bdb`) Signed-off-by: Davies Liu <davies@databricks.com>	2015-05-23 09:07:45 -07:00
Davies Liu	d1515381cb	[SPARK-7322, SPARK-7836, SPARK-7822][SQL] DataFrame window function related updates 1. ntile should take an integer as parameter. 2. Added Python API (based on #6364) 3. Update documentation of various DataFrame Python functions. Author: Davies Liu <davies@databricks.com> Author: Reynold Xin <rxin@databricks.com> Closes #6374 from rxin/window-final and squashes the following commits: 69004c7 [Reynold Xin] Style fix. 288cea9 [Reynold Xin] Update documentaiton. 7cb8985 [Reynold Xin] Merge pull request #6364 from davies/window 66092b4 [Davies Liu] update docs ed73cb4 [Reynold Xin] [SPARK-7322][SQL] Improve DataFrame window function documentation. ef55132 [Davies Liu] Merge branch 'master' of github.com:apache/spark into window4 8936ade [Davies Liu] fix maxint in python 3 2649358 [Davies Liu] update docs 778e2c0 [Davies Liu] SPARK-7836 and SPARK-7822: Python API of window functions (cherry picked from commit `efe3bfdf49`) Signed-off-by: Yin Huai <yhuai@databricks.com>	2015-05-23 08:30:18 -07:00
zsxwing	ea9db50bc3	[SPARK-7777][Streaming] Handle the case when there is no block in a batch In the old implementation, if a batch has no block, `areWALRecordHandlesPresent` will be `true` and it will return `WriteAheadLogBackedBlockRDD`. This PR handles this case by returning `WriteAheadLogBackedBlockRDD` or `BlockRDD` according to the configuration. Author: zsxwing <zsxwing@gmail.com> Closes #6372 from zsxwing/SPARK-7777 and squashes the following commits: 788f895 [zsxwing] Handle the case when there is no block in a batch (cherry picked from commit `ad0badba14`) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	2015-05-23 02:11:28 -07:00
Shivaram Venkataraman	c8eb76ba67	[SPARK-6811] Copy SparkR lib in make-distribution.sh This change also remove native libraries from SparkR to make sure our distribution works across platforms Tested by building on Mac, running on Amazon Linux (CentOS), Windows VM and vice-versa (built on Linux run on Mac) I will also test this with YARN soon and update this PR. Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #6373 from shivaram/sparkr-binary and squashes the following commits: ae41b5c [Shivaram Venkataraman] Remove native libraries from SparkR Also include the built SparkR package in make-distribution.sh (cherry picked from commit `a40bca0111`) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	2015-05-23 00:04:32 -07:00
Davies Liu	c636b87dc2	[SPARK-6806] [SPARKR] [DOCS] Fill in SparkR examples in programming guide sqlCtx -> sqlContext You can check the docs by: ``` $ cd docs $ SKIP_SCALADOC=1 jekyll serve ``` cc shivaram Author: Davies Liu <davies@databricks.com> Closes #5442 from davies/r_docs and squashes the following commits: 7a12ec6 [Davies Liu] remove rdd in R docs 8496b26 [Davies Liu] remove the docs related to RDD e23b9d6 [Davies Liu] delete R docs for RDD API 222e4ff [Davies Liu] Merge branch 'master' into r_docs 89684ce [Davies Liu] Merge branch 'r_docs' of github.com:davies/spark into r_docs f0a10e1 [Davies Liu] address comments from @shivaram f61de71 [Davies Liu] Update pairRDD.R 3ef7cf3 [Davies Liu] use + instead of function(a,b) a+b 2f10a77 [Davies Liu] address comments from @cafreeman 9c2a062 [Davies Liu] mention R api together with Python API 23f751a [Davies Liu] Fill in SparkR examples in programming guide (cherry picked from commit `7af3818c6b`) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	2015-05-23 00:02:22 -07:00
Tathagata Das	b928db4fe3	[SPARK-7838] [STREAMING] Set scope for kinesis stream Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #6369 from tdas/SPARK-7838 and squashes the following commits: 87d1c7f [Tathagata Das] Addressed comment 37775d8 [Tathagata Das] set scope for kinesis stream (cherry picked from commit `baa89838cc`) Signed-off-by: Andrew Or <andrew@databricks.com>	2015-05-22 23:06:01 -07:00
Shivaram Venkataraman	1a134e5d48	[MINOR] Add SparkR to create-release script Enables the SparkR profiles for all the binary builds we create cc pwendell Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #6371 from shivaram/sparkr-create-release and squashes the following commits: ca5a0b2 [Shivaram Venkataraman] Add -Psparkr to create-release.sh (cherry picked from commit `017b3404a5`) Signed-off-by: Patrick Wendell <patrick@databricks.com>	2015-05-22 22:33:56 -07:00
Mike Dusenberry	08464ec630	[SPARK-7830] [DOCS] [MLLIB] Adding logistic regression to the list of Multiclass Classification Supported Methods documentation Added logistic regression to the list of Multiclass Classification Supported Methods in the MLlib Classification and Regression documentation, as it was missing. Author: Mike Dusenberry <dusenberrymw@gmail.com> Closes #6357 from dusenberrymw/Add_LR_To_List_Of_Multiclass_Classification_Methods and squashes the following commits: 7918650 [Mike Dusenberry] Updating broken link due to the "Binary Classification" section on the Linear Methods page being renamed to "Classification". 3005dc2 [Mike Dusenberry] Adding logistic regression to the list of Multiclass Classification Supported Methods in the MLlib Classification and Regression documentation, as it was missing. (cherry picked from commit `63a5ce75ea`) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>	2015-05-22 18:03:20 -07:00
Burak Yavuz	17a51c8879	[SPARK-7224] [SPARK-7306] mock repository generator for --packages tests without nio.Path The previous PR for SPARK-7224 (#5790) broke JDK 6, because it used java.nio.Path, which was in jdk 7, and not in 6. This PR uses Guava's `Files` to handle directory creation, and etc... The description from the previous PR: > This patch contains an `IvyTestUtils` file, which dynamically generates jars and pom files to test the `--packages` feature without having to rely on the internet, and Maven Central. cc pwendell I also rand the flaky test about 20 times locally, it didn't fail a single time, but I think it may fail like once every 100 builds? I still haven't figured the cause yet, but the test before it, `--jars` was also failing after we turned off the `--packages` test in `SparkSubmitSuite`. It may be related to the launch of SparkSubmit. Author: Burak Yavuz <brkyvz@gmail.com> Closes #5892 from brkyvz/maven-utils and squashes the following commits: e9b1903 [Burak Yavuz] fix merge conflict 68214e0 [Burak Yavuz] remove ignore for test(neglect spark dependencies) e632381 [Burak Yavuz] fix ignore 9ef1408 [Burak Yavuz] re-enable --packages test 22eea62 [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into maven-utils 05cd0de [Burak Yavuz] added mock repository generator (cherry picked from commit `8014e1f6bb`) Signed-off-by: Josh Rosen <joshrosen@databricks.com>	2015-05-22 17:48:19 -07:00
Tathagata Das	130ec219aa	[SPARK-7788] Made KinesisReceiver.onStart() non-blocking KinesisReceiver calls worker.run() which is a blocking call (while loop) as per source code of kinesis-client library - https://github.com/awslabs/amazon-kinesis-client/blob/v1.2.1/src/main/java/com/amazonaws/services/kinesis/clientlibrary/lib/worker/Worker.java. This results in infinite loop while calling sparkStreamingContext.stop(stopSparkContext = false, stopGracefully = true) perhaps because ReceiverTracker is never able to register the receiver (it's receiverInfo field is a empty map) causing it to be stuck in infinite loop while waiting for running flag to be set to false. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #6348 from tdas/SPARK-7788 and squashes the following commits: 2584683 [Tathagata Das] Added receiver id in thread name 6cf1cd4 [Tathagata Das] Made KinesisReceiver.onStart non-blocking (cherry picked from commit `1c388a9985`) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	2015-05-22 17:39:09 -07:00
Andrew Or	0be6e3b3e6	[SPARK-7771] [SPARK-7779] Dynamic allocation: lower default timeouts further The default add time of 5s is still too slow for small jobs. Also, the current default remove time of 10 minutes seem rather high. This patch lowers both and rephrases a few log messages. Author: Andrew Or <andrew@databricks.com> Closes #6301 from andrewor14/da-minor and squashes the following commits: 6d614a6 [Andrew Or] Lower log level 2811492 [Andrew Or] Log information when requests are canceled 5fcd3eb [Andrew Or] Fix tests 3320710 [Andrew Or] Lower timeouts + rephrase a few log messages (cherry picked from commit `3d8760d76e`) Signed-off-by: Andrew Or <andrew@databricks.com>	2015-05-22 17:38:09 -07:00
Michael Armbrust	d7660dc2f5	[SPARK-7834] [SQL] Better window error messages Author: Michael Armbrust <michael@databricks.com> Closes #6363 from marmbrus/windowErrors and squashes the following commits: 516b02d [Michael Armbrust] [SPARK-7834] [SQL] Better window error messages (cherry picked from commit `3c1305107a`) Signed-off-by: Michael Armbrust <michael@databricks.com>	2015-05-22 17:23:53 -07:00
Imran Rashid	afde4019b8	[SPARK-7760] add /json back into master & worker pages; add test Author: Imran Rashid <irashid@cloudera.com> Closes #6284 from squito/SPARK-7760 and squashes the following commits: 5e02d8a [Imran Rashid] style; increase timeout 9987399 [Imran Rashid] comment 8c7ed63 [Imran Rashid] add /json back into master & worker pages; add test (cherry picked from commit `821254fb94`) Signed-off-by: Josh Rosen <joshrosen@databricks.com>	2015-05-22 16:05:23 -07:00
Liang-Chi Hsieh	d6cb044630	[SPARK-7270] [SQL] Consider dynamic partition when inserting into hive table JIRA: https://issues.apache.org/jira/browse/SPARK-7270 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #5864 from viirya/dyn_partition_insert and squashes the following commits: b5627df [Liang-Chi Hsieh] For comments. 3b21e4b [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into dyn_partition_insert 8a4352d [Liang-Chi Hsieh] Consider dynamic partition when inserting into hive table. (cherry picked from commit `126d7235de`) Signed-off-by: Michael Armbrust <michael@databricks.com>	2015-05-22 15:40:32 -07:00
Santiago M. Mola	e18d623d93	[SPARK-7724] [SQL] Support Intersect/Except in Catalyst DSL. Author: Santiago M. Mola <santi@mola.io> Closes #6327 from smola/feature/catalyst-dsl-set-ops and squashes the following commits: 11db778 [Santiago M. Mola] [SPARK-7724] [SQL] Support Intersect/Except in Catalyst DSL. (cherry picked from commit `e4aef91fe7`) Signed-off-by: Michael Armbrust <michael@databricks.com>	2015-05-22 15:12:27 -07:00
WangTaoTheTonic	40989cea0d	[SPARK-7758] [SQL] Override more configs to avoid failure when connect to a postgre sql https://issues.apache.org/jira/browse/SPARK-7758 When initializing `executionHive`, we only masks `javax.jdo.option.ConnectionURL` to override metastore location. However, other properties that relates to the actual Hive metastore data source are not masked. For example, when using Spark SQL with a PostgreSQL backed Hive metastore, `executionHive` actually tries to use settings read from `hive-site.xml`, which talks about PostgreSQL, to connect to the temporary Derby metastore, thus causes error. To fix this, we need to mask all metastore data source properties. Specifically, according to the code of [Hive `ObjectStore.getDataSourceProps()` method] [1], all properties whose name mentions "jdo" and "datanucleus" must be included. [1]: https://github.com/apache/hive/blob/release-0.13.1/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L288 Have tested using postgre sql as metastore, it worked fine. Author: WangTaoTheTonic <wangtao111@huawei.com> Closes #6314 from WangTaoTheTonic/SPARK-7758 and squashes the following commits: ca7ae7c [WangTaoTheTonic] add comments 86caf2c [WangTaoTheTonic] delete unused import e4f0feb [WangTaoTheTonic] block more data source related property 92a81fa [WangTaoTheTonic] fix style check e3e683d [WangTaoTheTonic] override more configs to avoid failuer connecting to postgre sql (cherry picked from commit `31d5d463e7`) Signed-off-by: Michael Armbrust <michael@databricks.com>	2015-05-22 14:44:29 -07:00
Josh Rosen	2904d3f8bd	[SPARK-7766] KryoSerializerInstance reuse is unsafe when auto-reset is disabled SPARK-3386 / #5606 modified the shuffle write path to re-use serializer instances across multiple calls to DiskBlockObjectWriter. It turns out that this introduced a very rare bug when using `KryoSerializer`: if auto-reset is disabled and reference-tracking is enabled, then we'll end up re-using the same serializer instance to write multiple output streams without calling `reset()` between write calls, which can lead to cases where objects in one file may contain references to objects that are in previous files, causing errors during deserialization. This patch fixes this bug by calling `reset()` at the start of `serialize()` and `serializeStream()`. I also added a regression test which demonstrates that this problem only occurs when auto-reset is disabled and reference-tracking is enabled. Author: Josh Rosen <joshrosen@databricks.com> Closes #6293 from JoshRosen/kryo-instance-reuse-bug and squashes the following commits: e19726d [Josh Rosen] Add fix for SPARK-7766. 71845e3 [Josh Rosen] Add failing regression test to trigger Kryo re-use bug (cherry picked from commit `eac00691da`) Signed-off-by: Josh Rosen <joshrosen@databricks.com>	2015-05-22 13:29:02 -07:00
Ram Sriharsha	3b4f9ce854	[SPARK-7574] [ML] [DOC] User guide for OneVsRest Including Iris Dataset (after shuffling and relabeling 3 -> 0 to confirm to 0 -> numClasses-1 labeling). Could not find an existing dataset in data/mllib for multiclass classification. Author: Ram Sriharsha <rsriharsha@hw11853.local> Closes #6296 from harsha2010/SPARK-7574 and squashes the following commits: 645427c [Ram Sriharsha] cleanup 46c41b1 [Ram Sriharsha] cleanup 2f76295 [Ram Sriharsha] Code Review Fixes ebdf103 [Ram Sriharsha] Java Example c026613 [Ram Sriharsha] Code Review fixes 4b7d1a6 [Ram Sriharsha] minor cleanup 13bed9c [Ram Sriharsha] add wikipedia link bb9dbfa [Ram Sriharsha] Clean up naming 6f90db1 [Ram Sriharsha] [SPARK-7574][ml][doc] User guide for OneVsRest (cherry picked from commit `509d55ab41`) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>	2015-05-22 13:18:16 -07:00
Patrick Wendell	cb1de978f7	Revert "[BUILD] Always run SQL tests in master build." This reverts commit `2be72c99aa`.	2015-05-22 10:05:02 -07:00
Ram Sriharsha	d709d7cebd	[SPARK-7404] [ML] Add RegressionEvaluator to spark.ml Author: Ram Sriharsha <rsriharsha@hw11853.local> Closes #6344 from harsha2010/SPARK-7404 and squashes the following commits: 16b9d77 [Ram Sriharsha] consistent naming 7f100b6 [Ram Sriharsha] cleanup c46044d [Ram Sriharsha] Merge with Master + Code Review Fixes 188fa0a [Ram Sriharsha] Merge branch 'master' into SPARK-7404 f5b6a4c [Ram Sriharsha] cleanup doc 97beca5 [Ram Sriharsha] update test to use R packages 32dd310 [Ram Sriharsha] fix indentation f93b812 [Ram Sriharsha] fix test 1b6ebb3 [Ram Sriharsha] [SPARK-7404][ml] Add RegressionEvaluator to spark.ml (cherry picked from commit `f490b3b4c7`) Signed-off-by: Xiangrui Meng <meng@databricks.com>	2015-05-22 09:59:51 -07:00
Michael Armbrust	427dc04c1e	[SPARK-6743] [SQL] Fix empty projections of cached data Author: Michael Armbrust <michael@databricks.com> Closes #6165 from marmbrus/wrongColumn and squashes the following commits: 4fad158 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into wrongColumn aad7eab [Michael Armbrust] rxins comments f1e8df1 [Michael Armbrust] [SPARK-6743][SQL] Fix empty projections of cached data (cherry picked from commit `3b68cb0430`) Signed-off-by: Michael Armbrust <michael@databricks.com>	2015-05-22 09:45:31 -07:00
Cheng Lian	1a284743e9	[MINOR] [SQL] Ignores Thrift server UISeleniumSuite This Selenium test case has been flaky for a while and led to frequent Jenkins build failure. Let's disable it temporarily until we figure out a proper solution. Author: Cheng Lian <lian@databricks.com> Closes #6345 from liancheng/ignore-selenium-test and squashes the following commits: 09996fe [Cheng Lian] Ignores Thrift server UISeleniumSuite (cherry picked from commit `4e5220c317`) Signed-off-by: Cheng Lian <lian@databricks.com>	2015-05-22 16:32:35 +08:00
Cheng Hao	bfaf6a094a	[SPARK-7322][SQL] Window functions in DataFrame This closes #6104. Author: Cheng Hao <hao.cheng@intel.com> Author: Reynold Xin <rxin@databricks.com> Closes #6343 from rxin/window-df and squashes the following commits: 026d587 [Reynold Xin] Address code review feedback. dc448fe [Reynold Xin] Fixed Hive tests. 9794d9d [Reynold Xin] Moved Java test package. 9331605 [Reynold Xin] Refactored API. 3313e2a [Reynold Xin] Merge pull request #6104 from chenghao-intel/df_window d625a64 [Cheng Hao] Update the dataframe window API as suggsted c141fb1 [Cheng Hao] hide all of properties of the WindowFunctionDefinition 3b1865f [Cheng Hao] scaladoc typos f3fd2d0 [Cheng Hao] polish the unit test 6847825 [Cheng Hao] Add additional analystcs functions 57e3bc0 [Cheng Hao] typos 24a08ec [Cheng Hao] scaladoc 28222ed [Cheng Hao] fix bug of range/row Frame 1d91865 [Cheng Hao] style issue 53f89f2 [Cheng Hao] remove the over from the functions.scala 964c013 [Cheng Hao] add more unit tests and window functions 64e18a7 [Cheng Hao] Add Window Function support for DataFrame (cherry picked from commit `f6f2eeb179`) Signed-off-by: Reynold Xin <rxin@databricks.com>	2015-05-22 01:00:37 -07:00
Joseph K. Bradley	19e579c552	[SPARK-7578] [ML] [DOC] User guide for spark.ml Normalizer, IDF, StandardScaler Added user guide sections with code examples. Also added small Java unit tests to test Java example in guide. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #6127 from jkbradley/feature-guide-2 and squashes the following commits: cd47f4b [Joseph K. Bradley] Updated based on code review f16bcec [Joseph K. Bradley] Fixed merge issues and update Python examples print calls for Python 3 0a862f9 [Joseph K. Bradley] Added Normalizer, StandardScaler to ml-features doc, plus small Java unit tests a21c2d6 [Joseph K. Bradley] Updated ml-features.md with IDF (cherry picked from commit `2728c3df66`) Signed-off-by: Xiangrui Meng <meng@databricks.com>	2015-05-21 22:59:53 -07:00
Xiangrui Meng	11a5f32116	[SPARK-7535] [.0] [MLLIB] Audit the pipeline APIs for 1.4 Some changes to the pipeilne APIs: 1. Estimator/Transformer/ doesn’t need to extend Params since PipelineStage already does. 1. Move Evaluator to ml.evaluation. 1. Mention larger metric values are better. 1. PipelineModel doc. “compiled” -> “fitted” 1. Hide object PolynomialExpansion. 1. Hide object VectorAssembler. 1. Word2Vec.minCount (and other) -> group param 1. ParamValidators -> DeveloperApi 1. Hide MetadataUtils/SchemaUtils. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #6322 from mengxr/SPARK-7535.0 and squashes the following commits: 9e9c7da [Xiangrui Meng] move JavaEvaluator to ml.evaluation as well e179480 [Xiangrui Meng] move Evaluation to ml.evaluation in PySpark 08ef61f [Xiangrui Meng] update pipieline APIs (cherry picked from commit `8f11c6116b`) Signed-off-by: Xiangrui Meng <meng@databricks.com>	2015-05-21 22:57:43 -07:00
Mike Dusenberry	2cc7907d73	[DOCS] [MLLIB] Fixing broken link in MLlib Linear Methods documentation. Just a small change: fixed a broken link in the MLlib Linear Methods documentation by removing a newline character between the link title and link address. Author: Mike Dusenberry <dusenberrymw@gmail.com> Closes #6340 from dusenberrymw/Fix_MLlib_Linear_Methods_link and squashes the following commits: 0a57818 [Mike Dusenberry] Fixing broken link in MLlib Linear Methods documentation. (cherry picked from commit `e4136ea6c4`) Signed-off-by: Xiangrui Meng <meng@databricks.com>	2015-05-21 19:05:11 -07:00
Xiangrui Meng	df55a0d767	[SPARK-7219] [MLLIB] Output feature attributes in HashingTF This PR updates `HashingTF` to output ML attributes that tell the number of features in the output column. We need to expand `UnaryTransformer` to support output metadata. A `df outputMetadata: Metadata` is not sufficient because the metadata may also depends on the input data. Though this is not true for `HashingTF`, I think it is reasonable to update `UnaryTransformer` in a separate PR. `checkParams` is added to verify common requirements for params. I will send a separate PR to use it in other test suites. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #6308 from mengxr/SPARK-7219 and squashes the following commits: 9bd2922 [Xiangrui Meng] address comments e82a68a [Xiangrui Meng] remove sqlContext from test suite 995535b [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7219 2194703 [Xiangrui Meng] add test for attributes 178ae23 [Xiangrui Meng] update HashingTF with tests 91a6106 [Xiangrui Meng] WIP (cherry picked from commit `85b96372cf`) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>	2015-05-21 18:04:55 -07:00
Xiangrui Meng	ef9336335f	[SPARK-7794] [MLLIB] update RegexTokenizer default settings The previous default is `{gaps: false, pattern: "\\p{L}+\|[^\\p{L}\\s]+"}`. The default pattern is hard to understand. This PR changes the default to `{gaps: true, pattern: "\\s+"}`. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #6330 from mengxr/SPARK-7794 and squashes the following commits: `5ee7cde` [Xiangrui Meng] update RegexTokenizer default settings (cherry picked from commit `f5db4b416c`) Signed-off-by: Xiangrui Meng <meng@databricks.com>	2015-05-21 17:59:13 -07:00
Tathagata Das	a17a5cb302	[SPARK-7776] [STREAMING] Added shutdown hook to StreamingContext Shutdown hook to stop SparkContext was added recently. This results in ugly errors when a streaming application is terminated by ctrl-C. ``` Exception in thread "Thread-27" org.apache.spark.SparkException: Job cancelled because SparkContext was shut down at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:736) at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:735) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:735) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1468) at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1403) at org.apache.spark.SparkContext.stop(SparkContext.scala:1642) at org.apache.spark.SparkContext$$anonfun$3.apply$mcV$sp(SparkContext.scala:559) at org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2266) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(Utils.scala:2236) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:2236) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:2236) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1764) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(Utils.scala:2236) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2236) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2236) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.util.SparkShutdownHookManager.runAll(Utils.scala:2236) at org.apache.spark.util.SparkShutdownHookManager$$anon$6.run(Utils.scala:2218) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) ``` This is because the Spark's shutdown hook stops the context, and the streaming jobs fail in the middle. The correct solution is to stop the streaming context before the spark context. This PR adds the shutdown hook to do so with a priority higher than the SparkContext's shutdown hooks priority. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #6307 from tdas/SPARK-7776 and squashes the following commits: e3d5475 [Tathagata Das] Added conf to specify graceful shutdown 4c18652 [Tathagata Das] Added shutdown hook to StreamingContxt. (cherry picked from commit `d68ea24d60`) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	2015-05-21 17:45:17 -07:00
Davies Liu	ba620d62fa	[SPARK-7783] [SQL] [PySpark] add DataFrame.rollup/cube in Python Author: Davies Liu <davies@databricks.com> Closes #6311 from davies/rollup and squashes the following commits: 0261db1 [Davies Liu] use @since a51ca6b [Davies Liu] Merge branch 'master' of github.com:apache/spark into rollup 8ad5af4 [Davies Liu] Update dataframe.py ade3841 [Davies Liu] add DataFrame.rollup/cube in Python (cherry picked from commit `17791a5815`) Signed-off-by: Reynold Xin <rxin@databricks.com>	2015-05-21 17:43:14 -07:00

1 2 3 4 5 ...

11118 commits