ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Herman van Hovell	18bf9d2b2d	[SPARK-17782][STREAMING][BUILD] Add Kafka 0.10 project to build modules ## What changes were proposed in this pull request? This PR adds the Kafka 0.10 subproject to the build infrastructure. This makes sure Kafka 0.10 tests are only triggers when it or of its dependencies change. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #15355 from hvanhovell/SPARK-17782.	2016-10-07 11:46:39 +01:00
Shixiong Zhu	9293734d35	[SPARK-17346][SQL] Add Kafka source for Structured Streaming ## What changes were proposed in this pull request? This PR adds a new project ` external/kafka-0-10-sql` for Structured Streaming Kafka source. It's based on the design doc: https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing tdas did most of work and part of them was inspired by koeninger's work. ### Introduction The Kafka source is a structured streaming data source to poll data from Kafka. The schema of reading data is as follows: Column \| Type ---- \| ---- key \| binary value \| binary topic \| string partition \| int offset \| long timestamp \| long timestampType \| int The source can deal with deleting topics. However, the user should make sure there is no Spark job processing the data when deleting a topic. ### Configuration The user can use `DataStreamReader.option` to set the following configurations. Kafka Source's options \| value \| default \| meaning ------ \| ------- \| ------ \| ----- startingOffset \| ["earliest", "latest"] \| "latest" \| The start point when a query is started, either "earliest" which is from the earliest offset, or "latest" which is just from the latest offset. Note: This only applies when a new Streaming query is started, and that resuming will always pick up from where the query left off. failOnDataLost \| [true, false] \| true \| Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected. subscribe \| A comma-separated list of topics \| (none) \| The topic list to subscribe. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source. subscribePattern \| Java regex string \| (none) \| The pattern used to subscribe the topic. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source. kafka.consumer.poll.timeoutMs \| long \| 512 \| The timeout in milliseconds to poll data from Kafka in executors fetchOffset.numRetries \| int \| 3 \| Number of times to retry before giving up fatch Kafka latest offsets. fetchOffset.retryIntervalMs \| long \| 10 \| milliseconds to wait before retrying to fetch Kafka offsets Kafka's own configurations can be set via `DataStreamReader.option` with `kafka.` prefix, e.g, `stream.option("kafka.bootstrap.servers", "host:port")` ### Usage * Subscribe to 1 topic ```Scala spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host:port") .option("subscribe", "topic1") .load() ``` * Subscribe to multiple topics ```Scala spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host:port") .option("subscribe", "topic1,topic2") .load() ``` * Subscribe to a pattern ```Scala spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host:port") .option("subscribePattern", "topic.*") .load() ``` ## How was this patch tested? The new unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Author: Tathagata Das <tathagata.das1565@gmail.com> Author: Shixiong Zhu <zsxwing@gmail.com> Author: cody koeninger <cody@koeninger.org> Closes #15102 from zsxwing/kafka-source.	2016-10-05 16:45:45 -07:00
Shivaram Venkataraman	7c382524a9	[SPARK-17651][SPARKR] Set R package version number along with mvn ## What changes were proposed in this pull request? This PR sets the R package version while tagging releases. Note that since R doesn't accept `-SNAPSHOT` in version number field, we remove that while setting the next version ## How was this patch tested? Tested manually by running locally Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #15223 from shivaram/sparkr-version-change.	2016-09-23 14:35:18 -07:00
hyukjinkwon	25a020be99	[SPARK-17583][SQL] Remove uesless rowSeparator variable and set auto-expanding buffer as default for maxCharsPerColumn option in CSV ## What changes were proposed in this pull request? This PR includes the changes below: 1. Upgrade Univocity library from 2.1.1 to 2.2.1 This includes some performance improvement and also enabling auto-extending buffer in `maxCharsPerColumn` option in CSV. Please refer the [release notes](https://github.com/uniVocity/univocity-parsers/releases). 2. Remove useless `rowSeparator` variable existing in `CSVOptions` We have this unused variable in [CSVOptions.scala#L127](`29952ed096/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala (L127)`) but it seems possibly causing confusion that it actually does not care of `\r\n`. For example, we have an issue open about this, [SPARK-17227](https://issues.apache.org/jira/browse/SPARK-17227), describing this variable. This variable is virtually not being used because we rely on `LineRecordReader` in Hadoop which deals with only both `\n` and `\r\n`. 3. Set the default value of `maxCharsPerColumn` to auto-expending. We are setting 1000000 for the length of each column. It'd be more sensible we allow auto-expending rather than fixed length by default. To make sure, using `-1` is being described in the release note, [2.2.0](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.2.0). ## How was this patch tested? N/A Author: hyukjinkwon <gurwls223@gmail.com> Closes #15138 from HyukjinKwon/SPARK-17583.	2016-09-21 10:35:29 +01:00
Reynold Xin	dca771bec6	[SPARK-17558] Bump Hadoop 2.7 version from 2.7.2 to 2.7.3 ## What changes were proposed in this pull request? This patch bumps the Hadoop version in hadoop-2.7 profile from 2.7.2 to 2.7.3, which was recently released and contained a number of bug fixes. ## How was this patch tested? The change should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #15115 from rxin/SPARK-17558.	2016-09-16 11:24:26 -07:00
Adam Roberts	0ad8eeb4d3	[SPARK-17379][BUILD] Upgrade netty-all to 4.0.41 final for bug fixes ## What changes were proposed in this pull request? Upgrade netty-all to latest in the 4.0.x line which is 4.0.41, mentions several bug fixes and performance improvements we may find useful, see netty.io/news/2016/08/29/4-0-41-Final-4-1-5-Final.html. Initially tried to use 4.1.5 but noticed it's not backwards compatible. ## How was this patch tested? Existing unit tests against branch-1.6 and branch-2.0 using IBM Java 8 on Intel, Power and Z architectures Author: Adam Roberts <aroberts@uk.ibm.com> Closes #14961 from a-roberts/netty.	2016-09-15 10:40:10 -07:00
hyukjinkwon	78d5d4dd5c	[SPARK-17200][PROJECT INFRA][BUILD][SPARKR] Automate building and testing on Windows (currently SparkR only) ## What changes were proposed in this pull request? This PR adds the build automation on Windows with [AppVeyor](https://www.appveyor.com/) CI tool. Currently, this only runs the tests for SparkR as we have been having some issues with testing Windows-specific PRs (e.g. https://github.com/apache/spark/pull/14743 and https://github.com/apache/spark/pull/13165) and hard time to verify this. One concern is, this build is dependent on [steveloughran/winutils](https://github.com/steveloughran/winutils) for pre-built Hadoop bin package (who is a Hadoop PMC member). ## How was this patch tested? Manually, https://ci.appveyor.com/project/HyukjinKwon/spark/build/88-SPARK-17200-build-profile This takes roughly 40 mins. Some tests are already being failed and this was found in https://github.com/apache/spark/pull/14743#issuecomment-241405287. Author: hyukjinkwon <gurwls223@gmail.com> Closes #14859 from HyukjinKwon/SPARK-17200-build.	2016-09-08 08:26:59 -07:00
Adam Roberts	6c08dbf683	[SPARK-17378][BUILD] Upgrade snappy-java to 1.1.2.6 ## What changes were proposed in this pull request? Upgrades the Snappy version to 1.1.2.6 from 1.1.2.4, release notes: https://github.com/xerial/snappy-java/blob/master/Milestone.md mention "Fix a bug in SnappyInputStream when reading compressed data that happened to have the same first byte with the stream magic header (#142)" ## How was this patch tested? Existing unit tests using the latest IBM Java 8 on Intel, Power and Z architectures (little and big-endian) Author: Adam Roberts <aroberts@uk.ibm.com> Closes #14958 from a-roberts/master.	2016-09-06 22:13:25 +01:00
Sean Owen	536fa911c1	[SPARK-17329][BUILD] Don't build PRs with -Pyarn unless YARN code changed ## What changes were proposed in this pull request? Only build PRs with -Pyarn if YARN code was modified. ## How was this patch tested? Jenkins tests (will look to verify whether -Pyarn was included in the PR builder for this one.) Author: Sean Owen <sowen@cloudera.com> Closes #14892 from srowen/SPARK-17329.	2016-09-01 09:10:01 +01:00
Michael Gummelt	0611b3a2bf	[SPARK-17320] add build_profile_flags entry to mesos build module ## What changes were proposed in this pull request? add build_profile_flags entry to mesos build module ## How was this patch tested? unit tests Author: Michael Gummelt <mgummelt@mesosphere.io> Closes #14885 from mgummelt/mesos-profile.	2016-08-31 10:17:05 -07:00
Ferdinand Xu	4b4e329e49	[SPARK-5682][CORE] Add encrypted shuffle in spark This patch is using Apache Commons Crypto library to enable shuffle encryption support. Author: Ferdinand Xu <cheng.a.xu@intel.com> Author: kellyzly <kellyzly@126.com> Closes #8880 from winningsix/SPARK-10771.	2016-08-30 09:15:31 -07:00
frreiss	8fb445d9bd	[SPARK-17303] Added spark-warehouse to dev/.rat-excludes ## What changes were proposed in this pull request? Excludes the `spark-warehouse` directory from the Apache RAT checks that src/run-tests performs. `spark-warehouse` is created by some of the Spark SQL tests, as well as by `bin/spark-sql`. ## How was this patch tested? Ran src/run-tests twice. The second time, the script failed because the first iteration Made the change in this PR. Ran src/run-tests a third time; RAT checks succeeded. Author: frreiss <frreiss@us.ibm.com> Closes #14870 from frreiss/fred-17303.	2016-08-29 23:33:00 -07:00
Michael Gummelt	8e5475be3c	[SPARK-16967] move mesos to module ## What changes were proposed in this pull request? Move Mesos code into a mvn module ## How was this patch tested? unit tests manually submitting a client mode and cluster mode job spark/mesos integration test suite Author: Michael Gummelt <mgummelt@mesosphere.io> Closes #14637 from mgummelt/mesos-module.	2016-08-26 12:25:22 -07:00
Sean Owen	0b3a4be92c	[SPARK-16781][PYSPARK] java launched by PySpark as gateway may not be the same java used in the spark environment ## What changes were proposed in this pull request? Update to py4j 0.10.3 to enable JAVA_HOME support ## How was this patch tested? Pyspark tests Author: Sean Owen <sowen@cloudera.com> Closes #14748 from srowen/SPARK-16781.	2016-08-24 20:04:09 +01:00
jerryshao	ab648c0004	[SPARK-14743][YARN] Add a configurable credential manager for Spark running on YARN ## What changes were proposed in this pull request? Add a configurable token manager for Spark on running on yarn. ### Current Problems ### 1. Supported token provider is hard-coded, currently only hdfs, hbase and hive are supported and it is impossible for user to add new token provider without code changes. 2. Also this problem exits in timely token renewer and updater. ### Changes In This Proposal ### In this proposal, to address the problems mentioned above and make the current code more cleaner and easier to understand, mainly has 3 changes: 1. Abstract a `ServiceTokenProvider` as well as `ServiceTokenRenewable` interface for token provider. Each service wants to communicate with Spark through token way needs to implement this interface. 2. Provide a `ConfigurableTokenManager` to manage all the register token providers, also token renewer and updater. Also this class offers the API for other modules to obtain tokens, get renewal interval and so on. 3. Implement 3 built-in token providers `HDFSTokenProvider`, `HiveTokenProvider` and `HBaseTokenProvider` to keep the same semantics as supported today. Whether to load in these built-in token providers is controlled by configuration "spark.yarn.security.tokens.${service}.enabled", by default for all the built-in token providers are loaded. ### Behavior Changes ### For the end user there's no behavior change, we still use the same configuration `spark.yarn.security.tokens.${service}.enabled` to decide which token provider is enabled (hbase or hive). For user implemented token provider (assume the name of token provider is "test") needs to add into this class should have two configurations: 1. `spark.yarn.security.tokens.test.enabled` to true 2. `spark.yarn.security.tokens.test.class` to the full qualified class name. So we still keep the same semantics as current code while add one new configuration. ### Current Status ### - [x] token provider interface and management framework. - [x] implement built-in token providers (hdfs, hbase, hive). - [x] Coverage of unit test. - [x] Integrated test with security cluster. ## How was this patch tested? Unit test and integrated test. Please suggest and review, any comment is greatly appreciated. Author: jerryshao <sshao@hortonworks.com> Closes #14065 from jerryshao/SPARK-16342.	2016-08-10 15:39:30 -07:00
Stefan Schulze	4775eb414f	[SPARK-16770][BUILD] Fix JLine dependency management and version (Sca… ## What changes were proposed in this pull request? As of Scala 2.11.x there is no longer a org.scala-lang:jline version aligned to the scala version itself. Scala console now uses the plain jline:jline module. Spark's dependency management did not reflect this change properly, causing Maven to pull in Jline via transitive dependency. Unfortunately Jline 2.12 contained a minor but very annoying bug rendering the shell almost useless for developers with german keyboard layout. This request contains the following chages: - Exclude transitive dependency 'jline:jline' from hive-exec module - Remove global properties 'jline.version' and 'jline.groupId' - Add both properties and dependency to 'scala-2.11' profile - Add explicit dependency on 'jline:jline' to module 'spark-repl' ## How was this patch tested? - Running mvn dependency:tree and checking for correct Jline version 2.12.1 - Running full builds with assembly and checking for jline-2.12.1.jar in 'lib' folder of generated tarball Author: Stefan Schulze <stefan.schulze@pentasys.de> Closes #14429 from stsc-pentasys/SPARK-16770.	2016-08-03 17:07:10 -07:00
Michael Gummelt	266b92faff	[SPARK-16637] Unified containerizer ## What changes were proposed in this pull request? New config var: spark.mesos.docker.containerizer={"mesos","docker" (default)} This adds support for running docker containers via the Mesos unified containerizer: http://mesos.apache.org/documentation/latest/container-image/ The benefit is losing the dependency on `dockerd`, and all the costs which it incurs. I've also updated the supported Mesos version to 0.28.2 for support of the required protobufs. This is blocked on: https://github.com/apache/spark/pull/14167 ## How was this patch tested? - manually testing jobs submitted with both "mesos" and "docker" settings for the new config var. - spark/mesos integration test suite Author: Michael Gummelt <mgummelt@mesosphere.io> Closes #14275 from mgummelt/unified-containerizer.	2016-07-29 05:50:47 -07:00
Adam Roberts	04a2c072d9	[SPARK-16751] Upgrade derby to 10.12.1.1 ## What changes were proposed in this pull request? Version of derby upgraded based on important security info at VersionEye. Test scope added so we don't include it in our final package anyway. NB: I think this should be backported to all previous releases as it is a security problem https://www.versioneye.com/java/org.apache.derby:derby/10.11.1.1 The CVE number is 2015-1832. I also suggest we add a SECURITY tag for JIRAs ## How was this patch tested? Existing tests with the change making sure that we see no new failures. I checked derby 10.12.x and not derby 10.11.x is downloaded to our ~/.m2 folder. I then used dev/make-distribution.sh and checked the dist/jars folder for Spark 2.0: no derby jar is present. I don't know if this would also remove it from the assembly jar in our 1.x branches. Author: Adam Roberts <aroberts@uk.ibm.com> Closes #14379 from a-roberts/patch-4.	2016-07-29 04:43:01 -07:00
Philipp Hoffmann	0869b3a5f0	[SPARK-15271][MESOS] Allow force pulling executor docker images ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Mesos agents by default will not pull docker images which are cached locally already. In order to run Spark executors from mutable tags like `:latest` this commit introduces a Spark setting (`spark.mesos.executor.docker.forcePullImage`). Setting this flag to true will tell the Mesos agent to force pull the docker image (default is `false` which is consistent with the previous implementation and Mesos' default behaviour). Author: Philipp Hoffmann <mail@philipphoffmann.de> Closes #14348 from philipphoffmann/force-pull-image.	2016-07-26 16:09:10 +01:00
Josh Rosen	fc17121d59	Revert "[SPARK-15271][MESOS] Allow force pulling executor docker images" This reverts commit `978cd5f125`.	2016-07-25 12:43:44 -07:00
Philipp Hoffmann	978cd5f125	[SPARK-15271][MESOS] Allow force pulling executor docker images ## What changes were proposed in this pull request? Mesos agents by default will not pull docker images which are cached locally already. In order to run Spark executors from mutable tags like `:latest` this commit introduces a Spark setting `spark.mesos.executor.docker.forcePullImage`. Setting this flag to true will tell the Mesos agent to force pull the docker image (default is `false` which is consistent with the previous implementation and Mesos' default behaviour). ## How was this patch tested? I ran a sample application including this change on a Mesos cluster and verified the correct behaviour for both, with and without, force pulling the executor image. As expected the image is being force pulled if the flag is set. Author: Philipp Hoffmann <mail@philipphoffmann.de> Closes #13051 from philipphoffmann/force-pull-image.	2016-07-25 20:14:47 +01:00
Reynold Xin	dd784a8822	[SPARK-16685] Remove audit-release scripts. ## What changes were proposed in this pull request? This patch removes dev/audit-release. It was initially created to do basic release auditing. They have been unused by for the last one year+. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #14342 from rxin/SPARK-16685.	2016-07-25 20:03:54 +01:00
Yanbo Liang	670891496a	[SPARK-16494][ML] Upgrade breeze version to 0.12 ## What changes were proposed in this pull request? breeze 0.12 has been released for more than half a year, and it brings lots of new features, performance improvement and bug fixes. One of the biggest features is ```LBFGS-B``` which is an implementation of ```LBFGS``` with box constraints and much faster for some special case. We would like to implement Huber loss function for ```LinearRegression``` ([SPARK-3181](https://issues.apache.org/jira/browse/SPARK-3181)) and it requires ```LBFGS-B``` as the optimization solver. So we should bump up the dependent breeze version to 0.12. For more features, improvements and bug fixes of breeze 0.12, you can refer the following link: https://groups.google.com/forum/#!topic/scala-breeze/nEeRi_DcY5c ## How was this patch tested? No new tests, should pass the existing ones. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14150 from yanboliang/spark-16494.	2016-07-19 12:31:04 +01:00
Shivaram Venkataraman	c33e4b0d96	[SPARK-16507][SPARKR] Add a CRAN checker, fix Rd aliases ## What changes were proposed in this pull request? Add a check-cran.sh script that runs `R CMD check` as CRAN. Also fixes a number of issues pointed out by the check. These include - Updating `DESCRIPTION` to be appropriate - Adding a .Rbuildignore to ignore lintr, src-native, html that are non-standard files / dirs - Adding aliases to all S4 methods in DataFrame, Column, GroupedData etc. This is required as stated in https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Documenting-S4-classes-and-methods - Other minor fixes ## How was this patch tested? SparkR unit tests, running the above mentioned script Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #14173 from shivaram/sparkr-cran-changes.	2016-07-16 17:06:44 -07:00
Kazuaki Ishizaki	f12a38b2db	[SPARK-15467][BUILD] update janino version to 3.0.0 ## What changes were proposed in this pull request? This PR updates version of Janino compiler from 2.7.8 to 3.0.0. This version fixes [an Janino issue](https://github.com/janino-compiler/janino/issues/1) that fixes [an issue](https://issues.apache.org/jira/browse/SPARK-15467), which throws Java exception, in Spark. ## How was this patch tested? Manually tested using a program in [the JIRA entry](https://issues.apache.org/jira/browse/SPARK-15467) Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #14127 from kiszk/SPARK-15467.	2016-07-10 17:58:27 -07:00
Yin Huai	60ba436b70	[SPARK-16453][BUILD] release-build.sh is missing hive-thriftserver for scala 2.10 ## What changes were proposed in this pull request? This PR adds hive-thriftserver profile to scala 2.10 build created by release-build.sh. Author: Yin Huai <yhuai@databricks.com> Closes #14108 from yhuai/SPARK-16453.	2016-07-08 15:56:46 -07:00
Josh Rosen	acef843f67	[SPARK-15975] Fix improper Popen retcode code handling in dev/run-tests In the `dev/run-tests.py` script we check a `Popen.retcode` for success using `retcode > 0`, but this is subtlety wrong because Popen's return code will be negative if the child process was terminated by a signal: https://docs.python.org/2/library/subprocess.html#subprocess.Popen.returncode In order to properly handle signals, we should change this to check `retcode != 0` instead. Author: Josh Rosen <joshrosen@databricks.com> Closes #13692 from JoshRosen/dev-run-tests-return-code-handling.	2016-06-16 14:18:58 -07:00
Shixiong Zhu	0ee9fd9e52	[SPARK-15935][PYSPARK] Fix a wrong format tag in the error message ## What changes were proposed in this pull request? A follow up PR for #13655 to fix a wrong format tag. ## How was this patch tested? Jenkins unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #13665 from zsxwing/fix.	2016-06-14 19:45:11 -07:00
Adam Roberts	a431e3f1f8	[SPARK-15821][DOCS] Include parallel build info ## What changes were proposed in this pull request? We should mention that users can build Spark using multiple threads to decrease build times; either here or in "Building Spark" ## How was this patch tested? Built on machines with between one core to 192 cores using mvn -T 1C and observed faster build times with no loss in stability In response to the question here https://issues.apache.org/jira/browse/SPARK-15821 I think we should suggest this option as we know it works for Spark and can result in faster builds Author: Adam Roberts <aroberts@uk.ibm.com> Closes #13562 from a-roberts/patch-3.	2016-06-14 13:59:01 +01:00
Shixiong Zhu	96c3500c66	[SPARK-15935][PYSPARK] Enable test for sql/streaming.py and fix these tests ## What changes were proposed in this pull request? This PR just enables tests for sql/streaming.py and also fixes the failures. ## How was this patch tested? Existing unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #13655 from zsxwing/python-streaming-test.	2016-06-14 02:12:29 -07:00
Adam Roberts	147c020823	[SPARK-15818][BUILD] Upgrade to Hadoop 2.7.2 ## What changes were proposed in this pull request? Updating the Hadoop version from 2.7.0 to 2.7.2 if we use the Hadoop-2.7 build profile ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Existing tests (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) I'd like us to use Hadoop 2.7.2 owing to the Hadoop release notes stating Hadoop 2.7.0 is not ready for production use https://hadoop.apache.org/docs/r2.7.0/ states "Apache Hadoop 2.7.0 is a minor release in the 2.x.y release line, building upon the previous stable release 2.6.0. This release is not yet ready for production use. Production users should use 2.7.1 release and beyond." Hadoop 2.7.1 release notes: "Apache Hadoop 2.7.1 is a minor release in the 2.x.y release line, building upon the previous release 2.7.0. This is the next stable release after Apache Hadoop 2.6.x." And then Hadoop 2.7.2 release notes: "Apache Hadoop 2.7.2 is a minor release in the 2.x.y release line, building upon the previous stable release 2.7.1." I've tested this is OK with Intel hardware and IBM Java 8 so let's test it with OpenJDK, ideally this will be pushed to branch-2.0 and master. Author: Adam Roberts <aroberts@uk.ibm.com> Closes #13556 from a-roberts/patch-2.	2016-06-09 10:34:01 +01:00
Josh Rosen	921fa40b14	[SPARK-12712] Fix failure in ./dev/test-dependencies when run against empty .m2 cache This patch fixes a bug in `./dev/test-dependencies.sh` which caused spurious failures when the script was run on a machine with an empty `.m2` cache. The problem was that extra log output from the dependency download was conflicting with the grep / regex used to identify the classpath in the Maven output. This patch fixes this issue by adjusting the regex pattern. Tested manually with the following reproduction of the bug: ``` rm -rf ~/.m2/repository/org/apache/commons/ ./dev/test-dependencies.sh ``` Author: Josh Rosen <joshrosen@databricks.com> Closes #13568 from JoshRosen/SPARK-12712.	2016-06-09 00:51:24 -07:00
Sandeep Singh	f958c1c3e2	[MINOR] Fix Java Lint errors introduced by #13286 and #13280 ## What changes were proposed in this pull request? revived #13464 Fix Java Lint errors introduced by #13286 and #13280 Before: ``` Using `mvn` from path: /Users/pichu/Project/spark/build/apache-maven-3.3.9/bin/mvn Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0 Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[340,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[341,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[342,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[343,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[41,28] (naming) MethodName: Method name 'Append' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]$'. [ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[52,28] (naming) MethodName: Method name 'Complete' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]$'. [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[61,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.PrimitiveType. [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[62,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.Type. ``` ## How was this patch tested? ran `dev/lint-java` locally Author: Sandeep Singh <sandeep@techaddict.me> Closes #13559 from techaddict/minor-3.	2016-06-08 14:51:00 +01:00
Shixiong Zhu	9a74de18a1	Revert "[SPARK-11753][SQL][TEST-HADOOP2.2] Make allowNonNumericNumbers option work ## What changes were proposed in this pull request? This reverts commit `c24b6b679c`. Sent a PR to run Jenkins tests due to the revert conflicts of `dev/deps/spark-deps-hadoop*`. ## How was this patch tested? Jenkins unit tests, integration tests, manual tests) Author: Shixiong Zhu <shixiong@databricks.com> Closes #13417 from zsxwing/revert-SPARK-11753.	2016-05-31 14:50:07 -07:00
Ryan Blue	776d183c82	[SPARK-9876][SQL] Update Parquet to 1.8.1. ## What changes were proposed in this pull request? This includes minimal changes to get Spark using the current release of Parquet, 1.8.1. ## How was this patch tested? This uses the existing Parquet tests. Author: Ryan Blue <blue@apache.org> Closes #13280 from rdblue/SPARK-9876-update-parquet.	2016-05-27 16:59:38 -07:00
Villu Ruusmann	6d506c9ae9	[SPARK-15523][ML][MLLIB] Update JPMML to 1.2.15 ## What changes were proposed in this pull request? See https://issues.apache.org/jira/browse/SPARK-15523 This PR replaces PR #13293. It's isolated to a new branch, and contains some more squashed changes. ## How was this patch tested? 1. Executed `mvn clean package` in `mllib` directory 2. Executed `dev/test-dependencies.sh --replace-manifest` in the root directory. Author: Villu Ruusmann <villu.ruusmann@gmail.com> Closes #13297 from vruusmann/update-jpmml.	2016-05-26 08:11:34 -05:00
Herman van Hovell	527499b624	[SPARK-15525][SQL][BUILD] Upgrade ANTLR4 SBT plugin ## What changes were proposed in this pull request? The ANTLR4 SBT plugin has been moved from its own repo to one on bintray. The version was also changed from `0.7.10` to `0.7.11`. The latter actually broke our build (ihji has fixed this by also adding `0.7.10` and others to the bin-tray repo). This PR upgrades the SBT-ANTLR4 plugin and ANTLR4 to their most recent versions (`0.7.11`/`4.5.3`). I have also removed a few obsolete build configurations. ## How was this patch tested? Manually running SBT/Maven builds. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #13299 from hvanhovell/SPARK-15525.	2016-05-25 15:35:38 -07:00
Jurriaan Pruis	c875d81a3d	[SPARK-15493][SQL] default QuoteEscapingEnabled flag to true when writing CSV ## What changes were proposed in this pull request? Default QuoteEscapingEnabled flag to true when writing CSV and add an escapeQuotes option to be able to change this. See `f3eb2af263/src/main/java/com/univocity/parsers/csv/CsvWriterSettings.java (L231-L247)` This change is needed to be able to write RFC 4180 compatible CSV files (https://tools.ietf.org/html/rfc4180#section-2) https://issues.apache.org/jira/browse/SPARK-15493 ## How was this patch tested? Added a test that verifies the output is quoted correctly. Author: Jurriaan Pruis <email@jurriaanpruis.nl> Closes #13267 from jurriaan/quote-escaping.	2016-05-25 12:40:16 -07:00
Liang-Chi Hsieh	c24b6b679c	[SPARK-11753][SQL][TEST-HADOOP2.2] Make allowNonNumericNumbers option work ## What changes were proposed in this pull request? Jackson suppprts `allowNonNumericNumbers` option to parse non-standard non-numeric numbers such as "NaN", "Infinity", "INF". Currently used Jackson version (2.5.3) doesn't support it all. This patch upgrades the library and make the two ignored tests in `JsonParsingOptionsSuite` passed. ## How was this patch tested? `JsonParsingOptionsSuite`. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9759 from viirya/fix-json-nonnumric.	2016-05-24 09:43:39 -07:00
Reynold Xin	45b7557e61	[SPARK-15424][SPARK-15437][SPARK-14807][SQL] Revert Create a hivecontext-compatibility module ## What changes were proposed in this pull request? I initially asked to create a hivecontext-compatibility module to put the HiveContext there. But we are so close to Spark 2.0 release and there is only a single class in it. It seems overkill to have an entire package, which makes it more inconvenient, for a single class. ## How was this patch tested? Tests were moved. Author: Reynold Xin <rxin@databricks.com> Closes #13207 from rxin/SPARK-15424.	2016-05-20 22:01:55 -07:00
Sameer Agarwal	a78d6ce376	[SPARK-15078] [SQL] Add all TPCDS 1.4 benchmark queries for SparkSQL ## What changes were proposed in this pull request? Now that SparkSQL supports all TPC-DS queries, this patch adds all 99 benchmark queries inside SparkSQL. ## How was this patch tested? Benchmark only Author: Sameer Agarwal <sameer@databricks.com> Closes #13188 from sameeragarwal/tpcds-all.	2016-05-20 15:19:28 -07:00
DB Tsai	e2efe0529a	[SPARK-14615][ML] Use the new ML Vector and Matrix in the ML pipeline based algorithms ## What changes were proposed in this pull request? Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new vector and matrix type in the new ml pipeline based apis. ## How was this patch tested? Unit tests Author: DB Tsai <dbt@netflix.com> Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Author: Xiangrui Meng <meng@databricks.com> Closes #12627 from dbtsai/SPARK-14615-NewML.	2016-05-17 12:51:07 -07:00
Sean Owen	122302cbf5	[SPARK-15290][BUILD] Move annotations, like @Since / @DeveloperApi, into spark-tags ## What changes were proposed in this pull request? (See https://github.com/apache/spark/pull/12416 where most of this was already reviewed and committed; this is just the module structure and move part. This change does not move the annotations into test scope, which was the apparently problem last time.) Rename `spark-test-tags` -> `spark-tags`; move common annotations like `Since` to `spark-tags` ## How was this patch tested? Jenkins tests. Author: Sean Owen <sowen@cloudera.com> Closes #13074 from srowen/SPARK-15290.	2016-05-17 09:55:53 +01:00
Sean Owen	fabc8e5b12	[SPARK-12972][CORE][TEST-MAVEN][TEST-HADOOP2.2] Update org.apache.httpcomponents.httpclient, commons-io ## What changes were proposed in this pull request? This is sort of a hot-fix for https://github.com/apache/spark/pull/13117, but, the problem is limited to Hadoop 2.2. The change is to manage `commons-io` to 2.4 for all Hadoop builds, which is only a net change for Hadoop 2.2, which was using 2.1. ## How was this patch tested? Jenkins tests -- normal PR builder, then the `[test-hadoop2.2] [test-maven]` if successful. Author: Sean Owen <sowen@cloudera.com> Closes #13132 from srowen/SPARK-12972.3.	2016-05-16 16:27:04 +01:00
Sean Owen	f5576a052d	[SPARK-12972][CORE] Update org.apache.httpcomponents.httpclient ## What changes were proposed in this pull request? (Retry of https://github.com/apache/spark/pull/13049) - update to httpclient 4.5 / httpcore 4.4 - remove some defunct exclusions - manage httpmime version to match - update selenium / httpunit to support 4.5 (possible now that Jetty 9 is used) ## How was this patch tested? Jenkins tests. Also, locally running the same test command of one Jenkins profile that failed: `mvn -Phadoop-2.6 -Pyarn -Phive -Phive-thriftserver -Pkinesis-asl ...` Author: Sean Owen <sowen@cloudera.com> Closes #13117 from srowen/SPARK-12972.2.	2016-05-15 15:56:46 +01:00
Sean Owen	10a8389674	Revert "[SPARK-12972][CORE] Update org.apache.httpcomponents.httpclient" This reverts commit `c74a6c3f23`.	2016-05-13 13:50:26 +01:00
Sean Owen	c74a6c3f23	[SPARK-12972][CORE] Update org.apache.httpcomponents.httpclient ## What changes were proposed in this pull request? - update httpcore/httpclient to latest - centralize version management - remove excludes that are no longer relevant according to SBT/Maven dep graphs - also manage httpmime to match httpclient ## How was this patch tested? Jenkins tests, plus review of dependency graphs from SBT/Maven, and review of test-dependencies.sh output Author: Sean Owen <sowen@cloudera.com> Closes #13049 from srowen/SPARK-12972.	2016-05-13 09:00:50 +01:00
Holden Karau	382dbc12bb	[SPARK-15061][PYSPARK] Upgrade to Py4J 0.10.1 ## What changes were proposed in this pull request? This upgrades to Py4J 0.10.1 which reduces syscal overhead in Java gateway ( see https://github.com/bartdag/py4j/issues/201 ). Related https://issues.apache.org/jira/browse/SPARK-6728 . ## How was this patch tested? Existing doctests & unit tests pass Author: Holden Karau <holden@us.ibm.com> Closes #13064 from holdenk/SPARK-15061-upgrade-to-py4j-0.10.1.	2016-05-13 08:59:18 +01:00
bomeng	81bf870848	[SPARK-14897][SQL] upgrade to jetty 9.2.16 ## What changes were proposed in this pull request? Since Jetty 8 is EOL (end of life) and has critical security issue [http://www.securityweek.com/critical-vulnerability-found-jetty-web-server], I think upgrading to 9 is necessary. I am using latest 9.2 since 9.3 requires Java 8+. `javax.servlet` and `derby` were also upgraded since Jetty 9.2 needs corresponding version. ## How was this patch tested? Manual test and current test cases should cover it. Author: bomeng <bmeng@us.ibm.com> Closes #12916 from bomeng/SPARK-14897.	2016-05-12 20:07:44 +01:00
Sean Zhong	33c6eb5218	[SPARK-15171][SQL] Deprecate registerTempTable and add dataset.createTempView ## What changes were proposed in this pull request? Deprecates registerTempTable and add dataset.createTempView, dataset.createOrReplaceTempView. ## How was this patch tested? Unit tests. Author: Sean Zhong <seanzhong@databricks.com> Closes #12945 from clockfly/spark-15171.	2016-05-12 15:51:53 +08:00
Sandeep Singh	db573fc743	[SPARK-15072][SQL][PYSPARK] FollowUp: Remove SparkSession.withHiveSupport in PySpark ## What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/12851 Remove `SparkSession.withHiveSupport` in PySpark and instead use `SparkSession.builder. enableHiveSupport` ## How was this patch tested? Existing tests. Author: Sandeep Singh <sandeep@techaddict.me> Closes #13063 from techaddict/SPARK-15072-followup.	2016-05-11 17:44:00 -07:00
cody koeninger	89e67d6667	[SPARK-15085][STREAMING][KAFKA] Rename streaming-kafka artifact ## What changes were proposed in this pull request? Renaming the streaming-kafka artifact to include kafka version, in anticipation of needing a different artifact for later kafka versions ## How was this patch tested? Unit tests Author: cody koeninger <cody@koeninger.org> Closes #12946 from koeninger/SPARK-15085.	2016-05-11 12:15:41 -07:00
hyukjinkwon	ac12b35d31	[SPARK-15148][SQL] Upgrade Univocity library from 2.0.2 to 2.1.0 ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-15148 Mainly it improves the performance roughtly about 30%-40% according to the [release note](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.1.0). For the details of the purpose is described in the JIRA. This PR upgrades Univocity library from 2.0.2 to 2.1.0. ## How was this patch tested? Existing tests should cover this. Author: hyukjinkwon <gurwls223@gmail.com> Closes #12923 from HyukjinKwon/SPARK-15148.	2016-05-05 11:26:40 -07:00
mcheah	b7fdc23ccc	[SPARK-12154] Upgrade to Jersey 2 ## What changes were proposed in this pull request? Replace com.sun.jersey with org.glassfish.jersey. Changes to the Spark Web UI code were required to compile. The changes were relatively standard Jersey migration things. ## How was this patch tested? I did a manual test for the standalone web APIs. Although I didn't test the functionality of the security filter itself, the code that changed non-trivially is how we actually register the filter. I attached a debugger to the Spark master and verified that the SecurityFilter code is indeed invoked upon hitting /api/v1/applications. Author: mcheah <mcheah@palantir.com> Closes #12715 from mccheah/feature/upgrade-jersey.	2016-05-05 10:51:03 +01:00
Lining Sun	592fc45563	[SPARK-15123] upgrade org.json4s to 3.2.11 version ## What changes were proposed in this pull request? We had the issue when using snowplow in our Spark applications. Snowplow requires json4s version 3.2.11 while Spark still use a few years old version 3.2.10. The change is to upgrade json4s jar to 3.2.11. ## How was this patch tested? We built Spark jar and successfully ran our applications in local and cluster modes. Author: Lining Sun <lining@gmail.com> Closes #12901 from liningalex/master.	2016-05-05 10:47:39 +01:00
Dongjoon Hyun	a744457076	[SPARK-15053][BUILD] Fix Java Lint errors on Hive-Thriftserver module ## What changes were proposed in this pull request? This issue fixes or hides 181 Java linter errors introduced by SPARK-14987 which copied hive service code from Hive. We had better clean up these errors before releasing Spark 2.0. - Fix UnusedImports (15 lines), RedundantModifier (14 lines), SeparatorWrap (9 lines), MethodParamPad (6 lines), FileTabCharacter (5 lines), ArrayTypeStyle (3 lines), ModifierOrder (3 lines), RedundantImport (1 line), CommentsIndentation (1 line), UpperEll (1 line), FallThrough (1 line), OneStatementPerLine (1 line), NewlineAtEndOfFile (1 line) errors. - Ignore `LineLength` errors under `hive/service/*` (118 lines). - Ignore `MethodName` error in `PasswdAuthenticationProvider.java` (1 line). - Ignore `NoFinalizer` error in `ThreadWithGarbageCleanup.java` (1 line). ## How was this patch tested? After passing Jenkins building, run `dev/lint-java` manually. ```bash $ dev/lint-java Checkstyle checks passed. ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12831 from dongjoon-hyun/SPARK-15053.	2016-05-03 12:39:37 +01:00
Andrew Or	a7d0fedc94	[SPARK-14988][PYTHON] SparkSession catalog and conf API ## What changes were proposed in this pull request? The `catalog` and `conf` APIs were exposed in `SparkSession` in #12713 and #12669. This patch adds those to the python API. ## How was this patch tested? Python tests. Author: Andrew Or <andrew@databricks.com> Closes #12765 from andrewor14/python-spark-session-more.	2016-04-29 09:34:10 -07:00
Davies Liu	7feeb82cb7	[SPARK-14987][SQL] inline hive-service (cli) into sql/hive-thriftserver ## What changes were proposed in this pull request? This PR copy the thrift-server from hive-service-1.2 (including TCLIService.thrift and generated Java source code) into sql/hive-thriftserver, so we can do further cleanup and improvements. ## How was this patch tested? Existing tests. Author: Davies Liu <davies@databricks.com> Closes #12764 from davies/thrift_server.	2016-04-29 09:32:42 -07:00
Yin Huai	9c7c42bc6a	Revert "[SPARK-14613][ML] Add @Since into the matrix and vector classes in spark-mllib-local" This reverts commit `dae538a4d7`.	2016-04-28 19:57:41 -07:00
Pravin Gadakh	dae538a4d7	[SPARK-14613][ML] Add @Since into the matrix and vector classes in spark-mllib-local ## What changes were proposed in this pull request? This PR adds `since` tag into the matrix and vector classes in spark-mllib-local. ## How was this patch tested? Scala-style checks passed. Author: Pravin Gadakh <prgadakh@in.ibm.com> Closes #12416 from pravingadakh/SPARK-14613.	2016-04-28 15:59:18 -07:00
Dongjoon Hyun	f405de87c8	[SPARK-14867][BUILD] Remove `--force` option in `build/mvn` ## What changes were proposed in this pull request? Currently, `build/mvn` provides a convenient option, `--force`, in order to use the recommended version of maven without changing PATH environment variable. However, there were two problems. - `dev/lint-java` does not use the newly installed maven. ```bash $ ./build/mvn --force clean $ ./dev/lint-java Using `mvn` from path: /usr/local/bin/mvn ``` - It's not easy to type `--force` option always. If '--force' option is used once, we had better prefer the installed maven recommended by Spark. This PR makes `build/mvn` check the existence of maven installed by `--force` option first. According to the comments, this PR aims to the followings: - Detect the maven version from `pom.xml`. - Install maven if there is no or old maven. - Remove `--force` option. ## How was this patch tested? Manual. ```bash $ ./build/mvn --force clean $ ./dev/lint-java Using `mvn` from path: /Users/dongjoon/spark/build/apache-maven-3.3.9/bin/mvn ... $ rm -rf ./build/apache-maven-3.3.9/ $ ./dev/lint-java Using `mvn` from path: /usr/local/bin/mvn ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12631 from dongjoon-hyun/SPARK-14867.	2016-04-27 20:56:23 +01:00
Dongjoon Hyun	c5443560b7	[MINOR][BUILD] Enable RAT checking on `LZ4BlockInputStream.java`. ## What changes were proposed in this pull request? Since `LZ4BlockInputStream.java` is not licensed to Apache Software Foundation (ASF), the Apache License header of that file is not monitored until now. This PR aims to enable RAT checking on `LZ4BlockInputStream.java` by excluding from `dev/.rat-excludes`. This will prevent accidental removal of Apache License header from that file. ## How was this patch tested? Pass the Jenkins tests (Specifically, RAT check stage). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12677 from dongjoon-hyun/minor_rat_exclusion_file.	2016-04-27 09:15:06 +01:00
Andrew Or	3c5e65c339	[SPARK-14721][SQL] Remove HiveContext (part 2) ## What changes were proposed in this pull request? This removes the class `HiveContext` itself along with all code usages associated with it. The bulk of the work was already done in #12485. This is mainly just code cleanup and actually removing the class. Note: A couple of things will break after this patch. These will be fixed separately. - the python HiveContext - all the documentation / comments referencing HiveContext - there will be no more HiveContext in the REPL (fixed by #12589) ## How was this patch tested? No change in functionality. Author: Andrew Or <andrew@databricks.com> Closes #12585 from andrewor14/delete-hive-context.	2016-04-25 13:23:05 -07:00
Dongjoon Hyun	d34d650378	[SPARK-14868][BUILD] Enable NewLineAtEofChecker in checkstyle and fix lint-java errors ## What changes were proposed in this pull request? Spark uses `NewLineAtEofChecker` rule in Scala by ScalaStyle. And, most Java code also comply with the rule. This PR aims to enforce the same rule `NewlineAtEndOfFile` by CheckStyle explicitly. Also, this fixes lint-java errors since SPARK-14465. The followings are the items. - Adds a new line at the end of the files (19 files) - Fixes 25 lint-java errors (12 RedundantModifier, 6 ArrayTypeStyle, 2 LineLength, 2 UnusedImports, 2 RegexpSingleline, 1 ModifierOrder) ## How was this patch tested? After the Jenkins test succeeds, `dev/lint-java` should pass. (Currently, Jenkins dose not run lint-java.) ```bash $ dev/lint-java Using `mvn` from path: /usr/local/bin/mvn Checkstyle checks passed. ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12632 from dongjoon-hyun/SPARK-14868.	2016-04-24 20:40:03 -07:00
Yin Huai	7dde1da949	[SPARK-14807] Create a compatibility module ## What changes were proposed in this pull request? This PR creates a compatibility module in sql (called `hive-1-x-compatibility`), which will host HiveContext in Spark 2.0 (moving HiveContext to here will be done separately). This module is not included in assembly because only users who still want to access HiveContext need it. ## How was this patch tested? I manually tested `sbt/sbt -Phive package` and `mvn -Phive package -DskipTests`. Author: Yin Huai <yhuai@databricks.com> Closes #12580 from yhuai/compatibility.	2016-04-22 17:50:24 -07:00
hyukjinkwon	ec2a276022	[SPARK-14787][SQL] Upgrade Joda-Time library from 2.9 to 2.9.3 ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-14787 The possible problems are described in the JIRA above. Please refer this if you are wondering the purpose of this PR. This PR upgrades Joda-Time library from 2.9 to 2.9.3. ## How was this patch tested? `sbt scalastyle` and Jenkins tests in this PR. closes #11847 Author: hyukjinkwon <gurwls223@gmail.com> Closes #12552 from HyukjinKwon/SPARK-14787.	2016-04-21 11:32:27 +01:00
Hemant Bhanawat	af1f4da762	[SPARK-13904][SCHEDULER] Add support for pluggable cluster manager ## What changes were proposed in this pull request? This commit adds support for pluggable cluster manager. And also allows a cluster manager to clean up tasks without taking the parent process down. To plug a new external cluster manager, ExternalClusterManager trait should be implemented. It returns task scheduler and backend scheduler that will be used by SparkContext to schedule tasks. An external cluster manager is registered using the java.util.ServiceLoader mechanism (This mechanism is also being used to register data sources like parquet, json, jdbc etc.). This allows auto-loading implementations of ExternalClusterManager interface. Currently, when a driver fails, executors exit using system.exit. This does not bode well for cluster managers that would like to reuse the parent process of an executor. Hence, 1. Moving system.exit to a function that can be overriden in subclasses of CoarseGrainedExecutorBackend. 2. Added functionality of killing all the running tasks in an executor. ## How was this patch tested? ExternalClusterManagerSuite.scala was added to test this patch. Author: Hemant Bhanawat <hemant@snappydata.io> Closes #11723 from hbhanawat/pluggableScheduler.	2016-04-16 23:43:32 -07:00
DB Tsai	efaf7d1820	[SPARK-14462][ML][MLLIB] Add the mllib-local build to maven pom ## What changes were proposed in this pull request? In order to separate the linear algebra, and vector matrix classes into a standalone jar, we need to setup the build first. This PR will create a new jar called mllib-local with minimal dependencies. The previous PR was failing the build because of `spark-core:test` dependency, and that was reverted. In this PR, `FunSuite` with `// scalastyle:ignore funsuite` in mllib-local test was used, similar to sketch. Thanks. ## How was this patch tested? Unit tests mengxr tedyu holdenk Author: DB Tsai <dbt@netflix.com> Closes #12298 from dbtsai/dbtsai-mllib-local-build-fix.	2016-04-11 09:35:47 -07:00
Xiangrui Meng	415446cc9b	Revert "[SPARK-14462][ML][MLLIB] add the mllib-local build to maven pom" This reverts commit `1598d11bb0`.	2016-04-09 14:03:03 -07:00
DB Tsai	1598d11bb0	[SPARK-14462][ML][MLLIB] add the mllib-local build to maven pom ## What changes were proposed in this pull request? In order to separate the linear algebra, and vector matrix classes into a standalone jar, we need to setup the build first. This PR will create a new jar called mllib-local with minimal dependencies. The test scope will still depend on spark-core and spark-core-test in order to use the common utilities, but the runtime will avoid any platform dependency. Couple platform independent classes will be moved to this package to demonstrate how this work. ## How was this patch tested? Unit tests Author: DB Tsai <dbt@netflix.com> Closes #12241 from dbtsai/dbtsai-mllib-local-build.	2016-04-09 09:21:12 -07:00
Josh Rosen	906eef4c7a	[SPARK-11416][BUILD] Update to Chill 0.8.0 & Kryo 3.0.3 This patch upgrades Chill to 0.8.0 and Kryo to 3.0.3. While we'll likely need to bump these dependencies again before Spark 2.0 (due to SPARK-14221 / https://github.com/twitter/chill/issues/252), I wanted to get the bulk of the Kryo 2 -> Kryo 3 migration done now in order to figure out whether there are any unexpected surprises. Author: Josh Rosen <joshrosen@databricks.com> Closes #12076 from JoshRosen/kryo3.	2016-04-08 16:35:30 -07:00
hyukjinkwon	725b860e2b	[SPARK-14103][SQL] Parse unescaped quotes in CSV data source. ## What changes were proposed in this pull request? This PR resolves the problem during parsing unescaped quotes in input data. For example, currently the data below: ``` "a"b,ccc,ddd e,f,g ``` produces a data below: - Before ```bash ["a"b,ccc,ddd[\n]e,f,g] <- as a value. ``` - After ```bash ["a"b], [ccc], [ddd] [e], [f], [g] ``` This PR bumps up the Univocity parser's version. This was fixed in `2.0.2`, https://github.com/uniVocity/univocity-parsers/issues/60. ## How was this patch tested? Unit tests in `CSVSuite` and `sbt/sbt scalastyle`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #12226 from HyukjinKwon/SPARK-14103-quote.	2016-04-08 00:28:59 -07:00
Marcelo Vanzin	24d7d2e453	[SPARK-13579][BUILD] Stop building the main Spark assembly. This change modifies the "assembly/" module to just copy needed dependencies to its build directory, and modifies the packaging script to pick those up (and remove duplicate jars packages in the examples module). I also made some minor adjustments to dependencies to remove some test jars from the final packaging, and remove jars that conflict with each other when packaged separately (e.g. servlet api). Also note that this change restores guava in applications' classpaths, even though it's still shaded inside Spark. This is now needed for the Hadoop libraries that are packaged with Spark, which now are not processed by the shade plugin. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #11796 from vanzin/SPARK-13579.	2016-04-04 16:52:22 -07:00
Jacek Laskowski	c16a396886	[SPARK-13825][CORE] Upgrade to Scala 2.11.8 ## What changes were proposed in this pull request? Upgrade to 2.11.8 (from the current 2.11.7) ## How was this patch tested? A manual build Author: Jacek Laskowski <jacek@japila.pl> Closes #11681 from jaceklaskowski/SPARK-13825-scala-2_11_8.	2016-04-01 15:21:29 -07:00
Sital Kedia	8de201baed	[SPARK-14277][CORE] Upgrade Snappy Java to 1.1.2.4 ## What changes were proposed in this pull request? Upgrade snappy to 1.1.2.4 to improve snappy read/write performance. ## How was this patch tested? Tested by running a job on the cluster and saw 7.5% cpu savings after this change. Author: Sital Kedia <skedia@fb.com> Closes #12096 from sitalkedia/snappyRelease.	2016-03-31 16:06:44 -07:00
Herman van Hovell	a9b93e0739	[SPARK-14211][SQL] Remove ANTLR3 based parser ### What changes were proposed in this pull request? This PR removes the ANTLR3 based parser, and moves the new ANTLR4 based parser into the `org.apache.spark.sql.catalyst.parser package`. ### How was this patch tested? Existing unit tests. cc rxin andrewor14 yhuai Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12071 from hvanhovell/SPARK-14211.	2016-03-31 09:25:09 -07:00
Herman van Hovell	600c0b69ca	[SPARK-13713][SQL] Migrate parser from ANTLR3 to ANTLR4 ### What changes were proposed in this pull request? The current ANTLR3 parser is quite complex to maintain and suffers from code blow-ups. This PR introduces a new parser that is based on ANTLR4. This parser is based on the [Presto's SQL parser](https://github.com/facebook/presto/blob/master/presto-parser/src/main/antlr4/com/facebook/presto/sql/parser/SqlBase.g4). The current implementation can parse and create Catalyst and SQL plans. Large parts of the HiveQl DDL and some of the DML functionality is currently missing, the plan is to add this in follow-up PRs. This PR is a work in progress, and work needs to be done in the following area's: - [x] Error handling should be improved. - [x] Documentation should be improved. - [x] Multi-Insert needs to be tested. - [ ] Naming and package locations. ### How was this patch tested? Catalyst and SQL unit tests. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #11557 from hvanhovell/ngParser.	2016-03-28 12:31:12 -07:00
Shixiong Zhu	24587ce433	[SPARK-14073][STREAMING][TEST-MAVEN] Move flume back to Spark ## What changes were proposed in this pull request? This PR moves flume back to Spark as per the discussion in the dev mail-list. ## How was this patch tested? Existing Jenkins tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #11895 from zsxwing/move-flume-back.	2016-03-25 17:37:16 -07:00
Holden Karau	55a605763d	[SPARK-13887][PYTHON][TRIVIAL][BUILD] Make lint-python script fail fast ## What changes were proposed in this pull request? Change lint python script to stop on first error rather than building them up so its clearer why we failed (requested by rxin). Also while in the file, remove the commented out code. ## How was this patch tested? Manually ran lint-python script with & without pep8 errors locally and verified expected results. Author: Holden Karau <holden@us.ibm.com> Closes #11898 from holdenk/SPARK-13887-pylint-fast-fail.	2016-03-25 12:53:34 +00:00
Sun Rui	7d1175011c	[SPARK-14074][SPARKR] Specify commit sha1 ID when using install_github to install intr package. ## What changes were proposed in this pull request? In dev/lint-r.R, `install_github` makes our builds depend on a unstable source. This may cause un-expected test failures and then build break. This PR adds a specified commit sha1 ID to `install_github` to get a stable source. ## How was this patch tested? dev/lint-r Author: Sun Rui <rui.sun@intel.com> Closes #11913 from sun-rui/SPARK-14074.	2016-03-23 07:57:03 -07:00
Dongjoon Hyun	20fd254101	[SPARK-14011][CORE][SQL] Enable `LineLength` Java checkstyle rule ## What changes were proposed in this pull request? [Spark Coding Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) has 100-character limit on lines, but it's disabled for Java since 11/09/15. This PR enables LineLength checkstyle again. To help that, this also introduces RedundantImport and RedundantModifier, too. The following is the diff on `checkstyle.xml`. ```xml - <!-- TODO: 11/09/15 disabled - the lengths are currently > 100 in many places --> - <!-- <module name="LineLength"> <property name="max" value="100"/> <property name="ignorePattern" value="^package.\|^import.\|a href\|href\|http://\|https://\|ftp://"/> </module> - --> <module name="NoLineWrap"/> <module name="EmptyBlock"> <property name="option" value="TEXT"/> -167,5 +164,7 </module> <module name="CommentsIndentation"/> <module name="UnusedImports"/> + <module name="RedundantImport"/> + <module name="RedundantModifier"/> ``` ## How was this patch tested? Currently, `lint-java` is disabled in Jenkins. It needs a manual test. After passing the Jenkins tests, `dev/lint-java` should passes locally. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11831 from dongjoon-hyun/SPARK-14011.	2016-03-21 07:58:57 +00:00
Josh Rosen	82066a1667	[SPARK-13948] MiMa check should catch if the visibility changes to private MiMa excludes are currently generated using both the current Spark version's classes and Spark 1.2.0's classes, but this doesn't make sense: we should only be ignoring classes which were `private` in the previous Spark version, not classes which became private in the current version. This patch updates `dev/mima` to only generate excludes with respect to the previous artifacts that MiMa checks against. It also updates `MimaBuild` so that `excludeClass` only applies directly to the class being excluded and not to its companion object (since a class and its companion object can have different accessibility). Author: Josh Rosen <joshrosen@databricks.com> Closes #11774 from JoshRosen/SPARK-13948.	2016-03-16 23:02:25 -07:00
Marcelo Vanzin	48978abfa4	[SPARK-13576][BUILD] Don't create assembly for examples. As part of the goal to stop creating assemblies in Spark, this change modifies the mvn and sbt builds to not create an assembly for examples. Instead, dependencies are copied to the build directory (under target/scala-xx/jars), and in the final archive, into the "examples/jars" directory. To avoid having to deal too much with Windows batch files, I made examples run through the launcher library; the spark-submit launcher now has a special mode to run examples, which adds all the necessary jars to the spark-submit command line, and replaces the bash and batch scripts that were used to run examples. The scripts are now just a thin wrapper around spark-submit; another advantage is that now all spark-submit options are supported. There are a few glitches; in the mvn build, a lot of duplicated dependencies get copied, because they are promoted to "compile" scope due to extra dependencies in the examples module (such as HBase). In the sbt build, all dependencies are copied, because there doesn't seem to be an easy way to filter things. I plan to clean some of this up when the rest of the tasks are finished. When the main assembly is replaced with jars, we can remove duplicate jars from the examples directory during packaging. Tested by running SparkPi in: maven build, sbt build, dist created by make-distribution.sh. Finally: note that running the "assembly" target in sbt doesn't build the examples anymore. You need to run "package" for that. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #11452 from vanzin/SPARK-13576.	2016-03-15 09:44:51 -07:00
Shixiong Zhu	06dec37455	[SPARK-13843][STREAMING] Remove streaming-flume, streaming-mqtt, streaming-zeromq, streaming-akka, streaming-twitter to Spark packages ## What changes were proposed in this pull request? Currently there are a few sub-projects, each for integrating with different external sources for Streaming. Now that we have better ability to include external libraries (spark packages) and with Spark 2.0 coming up, we can move the following projects out of Spark to https://github.com/spark-packages - streaming-flume - streaming-akka - streaming-mqtt - streaming-zeromq - streaming-twitter They are just some ancillary packages and considering the overhead of maintenance, running tests and PR failures, it's better to maintain them out of Spark. In addition, these projects can have their different release cycles and we can release them faster. I have already copied these projects to https://github.com/spark-packages ## How was this patch tested? Jenkins tests Author: Shixiong Zhu <shixiong@databricks.com> Closes #11672 from zsxwing/remove-external-pkg.	2016-03-14 16:56:04 -07:00
Josh Rosen	07cb323e7a	[SPARK-13848][SPARK-5185] Update to Py4J 0.9.2 in order to fix classloading issue This patch upgrades Py4J from 0.9.1 to 0.9.2 in order to include a patch which modifies Py4J to use the current thread's ContextClassLoader when performing reflection / class loading. This is necessary in order to fix [SPARK-5185](https://issues.apache.org/jira/browse/SPARK-5185), a longstanding issue affecting the use of `--jars` and `--packages` in PySpark. In order to demonstrate that the fix works, I removed the workarounds which were added as part of [SPARK-6027](https://issues.apache.org/jira/browse/SPARK-6027) / #4779 and other patches. Py4J diff: https://github.com/bartdag/py4j/compare/0.9.1...0.9.2 /cc zsxwing tdas davies brkyvz Author: Josh Rosen <joshrosen@databricks.com> Closes #11687 from JoshRosen/py4j-0.9.2.	2016-03-14 12:22:02 -07:00
Dongjoon Hyun	473263f959	[SPARK-13834][BUILD] Update sbt and sbt plugins for 2.x. ## What changes were proposed in this pull request? For 2.0.0, we had better make sbt and sbt plugins up-to-date. This PR checks the status of each plugins and bumps the followings. * sbt: 0.13.9 --> 0.13.11 * sbteclipse-plugin: 2.2.0 --> 4.0.0 * sbt-dependency-graph: 0.7.4 --> 0.8.2 * sbt-mima-plugin: 0.1.6 --> 0.1.9 * sbt-revolver: 0.7.2 --> 0.8.0 All other plugins are up-to-date. (Note that `sbt-avro` seems to be change from 0.3.2 to 1.0.1, but it's not published in the repository.) During upgrade, this PR also updated the following MiMa error. Note that the related excluding filter is already registered correctly. It seems due to the change of MiMa exception result. ``` // SPARK-12896 Send only accumulator updates to driver, not TaskMetrics ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.Accumulable.this"), -ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.Accumulator.this"), +ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.Accumulator.this"), ``` ## How was this patch tested? Pass the Jenkins build. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11669 from dongjoon-hyun/update_mima.	2016-03-13 18:47:04 -07:00
Cheng Lian	6d37e1eb90	[SPARK-13817][BUILD][SQL] Re-enable MiMA and removes object DataFrame ## What changes were proposed in this pull request? PR #11443 temporarily disabled MiMA check, this PR re-enables it. One extra change is that `object DataFrame` is also removed. The only purpose of introducing `object DataFrame` was to use it as an internal factory for creating `Dataset[Row]`. By replacing this internal factory with `Dataset.newDataFrame`, both `DataFrame` and `DataFrame$` are entirely removed from the API, so that we can simply put a `MissingClassProblem` filter in `MimaExcludes.scala` for most DataFrame API changes. ## How was this patch tested? Tested by MiMA check triggered by Jenkins. Author: Cheng Lian <lian@databricks.com> Closes #11656 from liancheng/re-enable-mima.	2016-03-11 22:17:50 +08:00
Josh Rosen	6ca990fb36	[SPARK-13294][PROJECT INFRA] Remove MiMa's dependency on spark-class / Spark assembly This patch removes the need to build a full Spark assembly before running the `dev/mima` script. - I modified the `tools` project to remove a direct dependency on Spark, so `sbt/sbt tools/fullClasspath` will now return the classpath for the `GenerateMIMAIgnore` class itself plus its own dependencies. - This required me to delete two classes full of dead code that we don't use anymore - `GenerateMIMAIgnore` now uses [ClassUtil](http://software.clapper.org/classutil/) to find all of the Spark classes rather than our homemade JAR traversal code. The problem in our own code was that it didn't handle folders of classes properly, which is necessary in order to generate excludes with an assembly-free Spark build. - `./dev/mima` no longer runs through `spark-class`, eliminating the need to reason about classpath ordering between `SPARK_CLASSPATH` and the assembly. Author: Josh Rosen <joshrosen@databricks.com> Closes #11178 from JoshRosen/remove-assembly-in-run-tests.	2016-03-10 23:28:34 -08:00
Cheng Lian	1d542785b9	[SPARK-13244][SQL] Migrates DataFrame to Dataset ## What changes were proposed in this pull request? This PR unifies DataFrame and Dataset by migrating existing DataFrame operations to Dataset and make `DataFrame` a type alias of `Dataset[Row]`. Most Scala code changes are source compatible, but Java API is broken as Java knows nothing about Scala type alias (mostly replacing `DataFrame` with `Dataset<Row>`). There are several noticeable API changes related to those returning arrays: 1. `collect`/`take` - Old APIs in class `DataFrame`: ```scala def collect(): Array[Row] def take(n: Int): Array[Row] ``` - New APIs in class `Dataset[T]`: ```scala def collect(): Array[T] def take(n: Int): Array[T] def collectRows(): Array[Row] def takeRows(n: Int): Array[Row] ``` Two specialized methods `collectRows` and `takeRows` are added because Java doesn't support returning generic arrays. Thus, for example, `DataFrame.collect(): Array[T]` actually returns `Object` instead of `Array<T>` from Java side. Normally, Java users may fall back to `collectAsList` and `takeAsList`. The two new specialized versions are added to avoid performance regression in ML related code (but maybe I'm wrong and they are not necessary here). 1. `randomSplit` - Old APIs in class `DataFrame`: ```scala def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame] def randomSplit(weights: Array[Double]): Array[DataFrame] ``` - New APIs in class `Dataset[T]`: ```scala def randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]] def randomSplit(weights: Array[Double]): Array[Dataset[T]] ``` Similar problem as above, but hasn't been addressed for Java API yet. We can probably add `randomSplitAsList` to fix this one. 1. `groupBy` Some original `DataFrame.groupBy` methods have conflicting signature with original `Dataset.groupBy` methods. To distinguish these two, typed `Dataset.groupBy` methods are renamed to `groupByKey`. Other noticeable changes: 1. Dataset always do eager analysis now We used to support disabling DataFrame eager analysis to help reporting partially analyzed malformed logical plan on analysis failure. However, Dataset encoders requires eager analysi during Dataset construction. To preserve the error reporting feature, `AnalysisException` now takes an extra `Option[LogicalPlan]` argument to hold the partially analyzed plan, so that we can check the plan tree when reporting test failures. This plan is passed by `QueryExecution.assertAnalyzed`. ## How was this patch tested? Existing tests do the work. ## TODO - [ ] Fix all tests - [ ] Re-enable MiMA check - [ ] Update ScalaDoc (`since`, `group`, and example code) Author: Cheng Lian <lian@databricks.com> Author: Yin Huai <yhuai@databricks.com> Author: Wenchen Fan <wenchen@databricks.com> Author: Cheng Lian <liancheng@users.noreply.github.com> Closes #11443 from liancheng/ds-to-df.	2016-03-10 17:00:17 -08:00
Sean Owen	927e22eff8	[SPARK-13663][CORE] Upgrade Snappy Java to 1.1.2.1 ## What changes were proposed in this pull request? Update snappy to 1.1.2.1 to pull in a single fix -- the OOM fix we already worked around. Supersedes https://github.com/apache/spark/pull/11524 ## How was this patch tested? Jenkins tests. Author: Sean Owen <sowen@cloudera.com> Closes #11631 from srowen/SPARK-13663.	2016-03-10 15:17:37 +00:00
Sean Owen	256704c771	[SPARK-13595][BUILD] Move docker, extras modules into external ## What changes were proposed in this pull request? Move `docker` dirs out of top level into `external/`; move `extras/*` into `external/` ## How was this patch tested? This is tested with Jenkins tests. Author: Sean Owen <sowen@cloudera.com> Closes #11523 from srowen/SPARK-13595.	2016-03-09 18:27:44 +00:00
Dongjoon Hyun	7771c7314f	[HOT-FIX][BUILD] Use the new location of `checkstyle-suppressions.xml` ## What changes were proposed in this pull request? This PR fixes `dev/lint-java` and `mvn checkstyle:check` failures due the recent file location change. The following is the error message of current master. ``` Checkstyle checks failed at following occurrences: [ERROR] Failed to execute goal org.apache.maven.plugins:maven-checkstyle-plugin:2.17:check (default-cli) on project spark-parent_2.11: Failed during checkstyle configuration: cannot initialize module SuppressionFilter - Cannot set property 'file' to 'checkstyle-suppressions.xml' in module SuppressionFilter: InvocationTargetException: Unable to find: checkstyle-suppressions.xml -> [Help 1] ``` ## How was this patch tested? Manual. The following command should run correctly. ``` ./dev/lint-java mvn checkstyle:check ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11567 from dongjoon-hyun/hotfix_checkstyle_suppression.	2016-03-08 10:27:52 +00:00
Sean Owen	0eea12a3d9	[SPARK-13596][BUILD] Move misc top-level build files into appropriate subdirs ## What changes were proposed in this pull request? Move many top-level files in dev/ or other appropriate directory. In particular, put `make-distribution.sh` in `dev` and update docs accordingly. Remove deprecated `sbt/sbt`. I was (so far) unable to figure out how to move `tox.ini`. `scalastyle-config.xml` should be movable but edits to the project `.sbt` files didn't work; config file location is updatable for compile but not test scope. ## How was this patch tested? `./dev/run-tests` to verify RAT and checkstyle work. Jenkins tests for the rest. Author: Sean Owen <sowen@cloudera.com> Closes #11522 from srowen/SPARK-13596.	2016-03-07 14:48:02 -08:00
Dongjoon Hyun	941b270b70	[MINOR] Fix typos in comments and testcase name of code ## What changes were proposed in this pull request? This PR fixes typos in comments and testcase name of code. ## How was this patch tested? manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11481 from dongjoon-hyun/minor_fix_typos_in_code.	2016-03-03 22:42:12 +00:00
Steve Loughran	9a48c656ee	[SPARK-13599][BUILD] remove transitive groovy dependencies from Hive ## What changes were proposed in this pull request? Modifies the dependency declarations of the all the hive artifacts, to explicitly exclude the groovy-all JAR. This stops the groovy classes and everything else in that uber-JAR from getting into spark-assembly JAR. ## How was this patch tested? 1. Pre-patch build was made: `mvn clean install -Pyarn,hive,hive-thriftserver` 1. spark-assembly expanded, observed to have the org.codehaus.groovy packages and JARs 1. A maven dependency tree was created `mvn dependency:tree -Pyarn,hive,hive-thriftserver -Dverbose > target/dependencies.txt` 1. This text file examined to confirm that groovy was being imported as a dependency of `org.spark-project.hive` 1. Patch applied 1. Repeated step1: clean build of project with ` -Pyarn,hive,hive-thriftserver` set 1. Examined created spark-assembly, verified no org.codehaus packages 1. Verified that the maven dependency tree no longer references groovy Note also that the size of the assembly JAR was 181628646 bytes before this patch, 166318515 after —15MB smaller. That's a good metric of things being excluded Author: Steve Loughran <stevel@hortonworks.com> Closes #11449 from steveloughran/fixes/SPARK-13599-groovy-dependency.	2016-03-03 09:35:49 -08:00
Wojciech Jurczyk	75e618def1	Fix run-tests.py typos ## What changes were proposed in this pull request? The PR fixes typos in an error message in dev/run-tests.py. Author: Wojciech Jurczyk <wojciech.jurczyk@codilime.com> Closes #11467 from wjur/wjur/typos_run_tests.	2016-03-02 15:32:32 +00:00
jerryshao	b4d096ded6	[BUILD][MINOR] Fix SBT build error with network-yarn module ## What changes were proposed in this pull request? ``` error] Expected ID character [error] Not a valid command: common (similar: completions) [error] Expected project ID [error] Expected configuration [error] Expected ':' (if selecting a configuration) [error] Expected key [error] Not a valid key: common (similar: commands) [error] common/network-yarn/test ``` `common/network-yarn` is not a valid sbt project, we should change to `network-yarn`. ## How was this patch tested? Locally run the the unit-test. CC rxin , we should either change here, or change the sbt project name. Author: jerryshao <sshao@hortonworks.com> Closes #11456 from jerryshao/build-fix.	2016-03-01 21:28:30 -08:00
Reynold Xin	9e01dcc644	[SPARK-13529][BUILD] Move network/* modules into common/network-* ## What changes were proposed in this pull request? As the title says, this moves the three modules currently in network/ into common/network-*. This removes one top level, non-user-facing folder. ## How was this patch tested? Compilation and existing tests. We should run both SBT and Maven. Author: Reynold Xin <rxin@databricks.com> Closes #11409 from rxin/SPARK-13529.	2016-02-28 17:25:07 -08:00
mark800	ec0cc75e15	[SPARK-7483][MLLIB] Upgrade Chill to 0.7.2 to support Kryo with FPGrowth It registers more Scala classes, including ListBuffer to support Kryo with FPGrowth. See https://github.com/twitter/chill/releases for Chill's change log. Author: mark800 <yky800@126.com> Closes #11041 from mark800/master.	2016-02-27 13:50:37 +00:00
Josh Rosen	f77dc4e1e2	[SPARK-13474][PROJECT INFRA] Update packaging scripts to push artifacts to home.apache.org Due to the people.apache.org -> home.apache.org migration, we need to update our packaging scripts to publish artifacts to the new server. Because the new server only supports sftp instead of ssh, we need to update the scripts to use lftp instead of ssh + rsync. Author: Josh Rosen <joshrosen@databricks.com> Closes #11350 from JoshRosen/update-release-scripts-for-apache-home.	2016-02-26 18:40:00 -08:00
Sean Owen	b84404865b	[SPARK-13324][CORE][BUILD] Update plugin, test, example dependencies for 2.x Phase 1: update plugin versions, test dependencies, some example and third-party versions Author: Sean Owen <sowen@cloudera.com> Closes #11206 from srowen/SPARK-13324.	2016-02-17 19:03:29 -08:00
Holden Karau	64515e5fbf	[SPARK-13154][PYTHON] Add linting for pydocs We should have lint rules using sphinx to automatically catch the pydoc issues that are sometimes introduced. Right now ./dev/lint-python will skip building the docs if sphinx isn't present - but it might make sense to fail hard - just a matter of if we want to insist all PySpark developers have sphinx present. Author: Holden Karau <holden@us.ibm.com> Closes #11109 from holdenk/SPARK-13154-add-pydoc-lint-for-docs.	2016-02-12 02:13:06 -08:00
Luciano Resende	2dbb916440	[SPARK-13189] Cleanup build references to Scala 2.10 Author: Luciano Resende <lresende@apache.org> Closes #11092 from lresende/SPARK-13189.	2016-02-09 11:56:25 -08:00
Josh Rosen	289373b28c	[SPARK-6363][BUILD] Make Scala 2.11 the default Scala version This patch changes Spark's build to make Scala 2.11 the default Scala version. To be clear, this does not mean that Spark will stop supporting Scala 2.10: users will still be able to compile Spark for Scala 2.10 by following the instructions on the "Building Spark" page; however, it does mean that Scala 2.11 will be the default Scala version used by our CI builds (including pull request builds). The Scala 2.11 compiler is faster than 2.10, so I think we'll be able to look forward to a slight speedup in our CI builds (it looks like it's about 2X faster for the Maven compile-only builds, for instance). After this patch is merged, I'll update Jenkins to add new compile-only jobs to ensure that Scala 2.10 compilation doesn't break. Author: Josh Rosen <joshrosen@databricks.com> Closes #10608 from JoshRosen/SPARK-6363.	2016-01-30 00:20:28 -08:00
Josh Rosen	41f0c85f9b	[SPARK-13023][PROJECT INFRA] Fix handling of root module in modules_to_test() There's a minor bug in how we handle the `root` module in the `modules_to_test()` function in `dev/run-tests.py`: since `root` now depends on `build` (since every test needs to run on any build test), we now need to check for the presence of root in `modules_to_test` instead of `changed_modules`. Author: Josh Rosen <joshrosen@databricks.com> Closes #10933 from JoshRosen/build-module-fix.	2016-01-27 08:32:13 -08:00
Josh Rosen	ee74498de3	[SPARK-8725][PROJECT-INFRA] Test modules in topologically-sorted order in dev/run-tests This patch improves our `dev/run-tests` script to test modules in a topologically-sorted order based on modules' dependencies. This will help to ensure that bugs in upstream projects are not misattributed to downstream projects because those projects' tests were the first ones to exhibit the failure Topological sorting is also useful for shortening the feedback loop when testing pull requests: if I make a change in SQL then the SQL tests should run before MLlib, not after. In addition, this patch also updates our test module definitions to split `sql` into `catalyst`, `sql`, and `hive` in order to allow more tests to be skipped when changing only `hive/` files. Author: Josh Rosen <joshrosen@databricks.com> Closes #10885 from JoshRosen/SPARK-8725.	2016-01-26 14:20:11 -08:00
Holden Karau	a83400135d	[SPARK-10498][TOOLS][BUILD] Add requirements.txt file for dev python tools Minor since so few people use them, but it would probably be good to have a requirements file for our python release tools for easier setup (also version pinning). cc JoshRosen who looked at the original JIRA. Author: Holden Karau <holden@us.ibm.com> Closes #10871 from holdenk/SPARK-10498-add-requirements-file-for-dev-python-tools.	2016-01-24 11:48:28 -08:00
Cheng Lian	1c690ddafa	[SPARK-12933][SQL] Initial implementation of Count-Min sketch This PR adds an initial implementation of count min sketch, contained in a new module spark-sketch under `common/sketch`. The implementation is based on the [`CountMinSketch` class in stream-lib][1]. As required by the [design doc][2], spark-sketch should have no external dependency. Two classes, `Murmur3_x86_32` and `Platform` are copied to spark-sketch from spark-unsafe for hashing facilities. They'll also be used in the upcoming bloom filter implementation. The following features will be added in future follow-up PRs: - Serialization support - DataFrame API integration [1]: `aac6b4d23a/src/main/java/com/clearspring/analytics/stream/frequency/CountMinSketch.java` [2]: https://issues.apache.org/jira/secure/attachment/12782378/BloomFilterandCount-MinSketchinSpark2.0.pdf Author: Cheng Lian <lian@databricks.com> Closes #10851 from liancheng/count-min-sketch.	2016-01-23 00:34:55 -08:00
Shixiong Zhu	bc1babd63d	[SPARK-7997][CORE] Remove Akka from Spark Core and Streaming - Remove Akka dependency from core. Note: the streaming-akka project still uses Akka. - Remove HttpFileServer - Remove Akka configs from SparkConf and SSLOptions - Rename `spark.akka.frameSize` to `spark.rpc.message.maxSize`. I think it's still worth to keep this config because using `DirectTaskResult` or `IndirectTaskResult` depends on it. - Update comments and docs Author: Shixiong Zhu <shixiong@databricks.com> Closes #10854 from zsxwing/remove-akka.	2016-01-22 21:20:04 -08:00
Shixiong Zhu	b7d74a602f	[SPARK-7799][SPARK-12786][STREAMING] Add "streaming-akka" project Include the following changes: 1. Add "streaming-akka" project and org.apache.spark.streaming.akka.AkkaUtils for creating an actorStream 2. Remove "StreamingContext.actorStream" and "JavaStreamingContext.actorStream" 3. Update the ActorWordCount example and add the JavaActorWordCount example 4. Make "streaming-zeromq" depend on "streaming-akka" and update the codes accordingly Author: Shixiong Zhu <shixiong@databricks.com> Closes #10744 from zsxwing/streaming-akka-2.	2016-01-20 13:55:41 -08:00
Shixiong Zhu	4bcea1b859	Revert "[SPARK-12829] Turn Java style checker on" This reverts commit `591c88c9e2`. `lint-java` doesn't work on a machine with a clean Maven cache.	2016-01-18 16:26:52 -08:00
Josh Rosen	8dbbf3e75e	[SPARK-12842][TEST-HADOOP2.7] Add Hadoop 2.7 build profile This patch adds a Hadoop 2.7 build profile in order to let us automate tests against that version. /cc rxin srowen Author: Josh Rosen <joshrosen@databricks.com> Closes #10775 from JoshRosen/add-hadoop-2.7-profile.	2016-01-15 17:07:24 -08:00
Reynold Xin	ad1503f92e	[SPARK-12667] Remove block manager's internal "external block store" API This pull request removes the external block store API. This is rarely used, and the file system interface is actually a better, more standard way to interact with external storage systems. There are some other things to remove also, as pointed out by JoshRosen. We will do those as follow-up pull requests. Author: Reynold Xin <rxin@databricks.com> Closes #10752 from rxin/remove-offheap.	2016-01-15 12:03:28 -08:00
Hossein	5f83c6991c	[SPARK-12833][SQL] Initial import of spark-csv CSV is the most common data format in the "small data" world. It is often the first format people want to try when they see Spark on a single node. Having to rely on a 3rd party component for this leads to poor user experience for new users. This PR merges the popular spark-csv data source package (https://github.com/databricks/spark-csv) with SparkSQL. This is a first PR to bring the functionality to spark 2.0 master. We will complete items outlines in the design document (see JIRA attachment) in follow up pull requests. Author: Hossein <hossein@databricks.com> Author: Reynold Xin <rxin@databricks.com> Closes #10766 from rxin/csv.	2016-01-15 11:46:46 -08:00
Reynold Xin	591c88c9e2	[SPARK-12829] Turn Java style checker on It was previously turned off because there was a problem with a pull request. We should turn it on now. Author: Reynold Xin <rxin@databricks.com> Closes #10763 from rxin/SPARK-12829.	2016-01-14 21:02:18 -08:00
Kousuke Saruta	bcc7373f67	[SPARK-12821][BUILD] Style checker should run when some configuration files for style are modified but any source files are not. When running the `run-tests` script, style checkers run only when any source files are modified but they should run when configuration files related to style are modified. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #10754 from sarutak/SPARK-12821.	2016-01-14 10:43:39 -08:00
Josh Rosen	97e0c7c5af	[SPARK-9383][PROJECT-INFRA] PR merge script should reset back to previous branch when possible This patch modifies our PR merge script to reset back to a named branch when restoring the original checkout upon exit. When the committer is originally checked out to a detached head, then they will be restored back to that same ref (the same as today's behavior). This is a slightly updated version of #7569, with an extra fix to handle the detached head corner-case. Author: Josh Rosen <joshrosen@databricks.com> Closes #10709 from JoshRosen/SPARK-9383.	2016-01-13 11:56:30 -08:00
Shixiong Zhu	4f60651cbe	[SPARK-12652][PYSPARK] Upgrade Py4J to 0.9.1 - [x] Upgrade Py4J to 0.9.1 - [x] SPARK-12657: Revert SPARK-12617 - [x] SPARK-12658: Revert SPARK-12511 - Still keep the change that only reading checkpoint once. This is a manual change and worth to take a look carefully. `bfd4b5c040` - [x] Verify no leak any more after reverting our workarounds Author: Shixiong Zhu <shixiong@databricks.com> Closes #10692 from zsxwing/py4j-0.9.1.	2016-01-12 14:27:05 -08:00
Josh Rosen	a44991453a	[SPARK-12734][HOTFIX] Build changes must trigger all tests; clean after install in dep tests This patch fixes a build/test issue caused by the combination of #10672 and a latent issue in the original `dev/test-dependencies` script. First, changes which _only_ touched build files were not triggering full Jenkins runs, making it possible for a build change to be merged even though it could cause failures in other tests. The `root` build module now depends on `build`, so all tests will now be run whenever a build-related file is changed. I also added a `clean` step to the Maven install step in `dev/test-dependencies` in order to address an issue where the dummy JARs stuck around and caused "multiple assembly JARs found" errors in tests. /cc zsxwing Author: Josh Rosen <joshrosen@databricks.com> Closes #10704 from JoshRosen/fix-build-test-problems.	2016-01-11 12:56:43 -08:00
BrianLondon	8fe928b4fe	[SPARK-12269][STREAMING][KINESIS] Update aws-java-sdk version The current Spark Streaming kinesis connector references a quite old version 1.9.40 of the AWS Java SDK (1.10.40 is current). Numerous AWS features including Kinesis Firehose are unavailable in 1.9. Those two versions of the AWS SDK in turn require conflicting versions of Jackson (2.4.4 and 2.5.3 respectively) such that one cannot include the current AWS SDK in a project that also uses the Spark Streaming Kinesis ASL. Author: BrianLondon <brian@seatgeek.com> Closes #10256 from BrianLondon/master.	2016-01-11 09:32:06 +00:00
Josh Rosen	f13c7f8f7d	[SPARK-12734][HOTFIX][TEST-MAVEN] Fix bug in Netty exclusions This is a hotfix for a build bug introduced by the Netty exclusion changes in #10672. We can't exclude `io.netty:netty` because Akka depends on it. There's not a direct conflict between `io.netty:netty` and `io.netty:netty-all`, because the former puts classes in the `org.jboss.netty` namespace while the latter uses the `io.netty` namespace. However, there still is a conflict between `org.jboss.netty:netty` and `io.netty:netty`, so we need to continue to exclude the JBoss version of that artifact. While the diff here looks somewhat large, note that this is only a revert of a some of the changes from #10672. You can see the net changes in pom.xml at `3119206b71...5211ab8 (diff-600376dffeb79835ede4a0b285078036)` Author: Josh Rosen <joshrosen@databricks.com> Closes #10693 from JoshRosen/netty-hotfix.	2016-01-11 00:31:29 -08:00
Josh Rosen	3ab0138b0f	[SPARK-12734][BUILD] Fix Netty exclusion and use Maven Enforcer to prevent future bugs Netty classes are published under multiple artifacts with different names, so our build needs to exclude the `io.netty:netty` and `org.jboss.netty:netty` versions of the Netty artifact. However, our existing exclusions were incomplete, leading to situations where duplicate Netty classes would wind up on the classpath and cause compile errors (or worse). This patch fixes the exclusion issue by adding more exclusions and uses Maven Enforcer's [banned dependencies](https://maven.apache.org/enforcer/enforcer-rules/bannedDependencies.html) rule to prevent these classes from accidentally being reintroduced. I also updated `dev/test-dependencies.sh` to run `mvn validate` so that the enforcer rules can run as part of pull request builds. /cc rxin srowen pwendell. I'd like to backport at least the exclusion portion of this fix to `branch-1.5` in order to fix the documentation publishing job, which fails nondeterministically due to incompatible versions of Netty classes taking precedence on the compile-time classpath. Author: Josh Rosen <rosenville@gmail.com> Author: Josh Rosen <joshrosen@databricks.com> Closes #10672 from JoshRosen/enforce-netty-exclusions.	2016-01-10 19:59:01 -08:00
Reynold Xin	5b0d544339	[SPARK-12735] Consolidate & move spark-ec2 to AMPLab managed repository. Author: Reynold Xin <rxin@databricks.com> Closes #10673 from rxin/SPARK-12735.	2016-01-09 20:28:20 -08:00
Herman van Hovell	ea489f14f1	[SPARK-12573][SPARK-12574][SQL] Move SQL Parser from Hive to Catalyst This PR moves a major part of the new SQL parser to Catalyst. This is a prelude to start using this parser for all of our SQL parsing. The following key changes have been made: The ANTLR Parser & Supporting classes have been moved to the Catalyst project. They are now part of the ```org.apache.spark.sql.catalyst.parser``` package. These classes contained quite a bit of code that was originally from the Hive project, I have added aknowledgements whenever this applied. All Hive dependencies have been factored out. I have also taken this chance to clean-up the ```ASTNode``` class, and to improve the error handling. The HiveQl object that provides the functionality to convert an AST into a LogicalPlan has been refactored into three different classes, one for every SQL sub-project: - ```CatalystQl```: This implements Query and Expression parsing functionality. - ```SparkQl```: This is a subclass of CatalystQL and provides SQL/Core only functionality such as Explain and Describe. - ```HiveQl```: This is a subclass of ```SparkQl``` and this adds Hive-only functionality to the parser such as Analyze, Drop, Views, CTAS & Transforms. This class still depends on Hive. cc rxin Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #10583 from hvanhovell/SPARK-12575.	2016-01-06 11:16:53 -08:00
felixcheung	cc4d5229c9	[SPARK-12625][SPARKR][SQL] replace R usage of Spark SQL deprecated API rxin davies shivaram Took save mode from my PR #10480, and move everything to writer methods. This is related to PR #10559 - [x] it seems jsonRDD() is broken, need to investigate - this is not a public API though; will look into some more tonight. (fixed) Author: felixcheung <felixcheung_m@hotmail.com> Closes #10584 from felixcheung/rremovedeprecated.	2016-01-04 22:32:07 -08:00
Reynold Xin	77ab49b857	[SPARK-12600][SQL] Remove deprecated methods in Spark SQL Author: Reynold Xin <rxin@databricks.com> Closes #10559 from rxin/remove-deprecated-sql.	2016-01-04 18:02:38 -08:00
Josh Rosen	9fd7a2f024	[SPARK-10359][PROJECT-INFRA] Use more random number in dev/test-dependencies.sh; fix version switching This patch aims to fix another potential source of flakiness in the `dev/test-dependencies.sh` script. pwendell's original patch and my version used `$(date +%s \| tail -c6)` to generate a suffix to use when installing temporary Spark versions into the local Maven cache, but this value only changes once per second and thus is highly collision-prone when concurrent builds launch on AMPLab Jenkins. In order to reduce the potential for conflicts, this patch updates the script to call Python's random number generator instead. I also fixed a bug in how we captured the original project version; the bug was causing the exit handler code to fail. Author: Josh Rosen <joshrosen@databricks.com> Closes #10558 from JoshRosen/build-dep-tests-round-3.	2016-01-04 01:04:29 -08:00
Josh Rosen	0d165ec205	[SPARK-12612][PROJECT-INFRA] Add missing Hadoop profiles to dev/run-tests-.py scripts and dev/deps There are a couple of places in the `dev/run-tests-.py` scripts which deal with Hadoop profiles, but the set of profiles that they handle does not include all Hadoop profiles defined in our POM. Similarly, the `hadoop-2.2` and `hadoop-2.6` profiles were missing from `dev/deps`. This patch updates these scripts to include all four Hadoop profiles defined in our POM. Author: Josh Rosen <joshrosen@databricks.com> Closes #10565 from JoshRosen/add-missing-hadoop-profiles-in-test-scripts.	2016-01-03 22:05:02 -08:00
Reynold Xin	6c20b3c087	Disable test-dependencies.sh.	2016-01-01 13:31:25 -08:00
Josh Rosen	5adec63a92	[SPARK-10359][PROJECT-INFRA] Multiple fixes to dev/test-dependencies.sh script This patch includes multiple fixes for the `dev/test-dependencies.sh` script (which was introduced in #10461): - Use `build/mvn --force` instead of `mvn` in one additional place. - Explicitly set a zero exit code on success. - Set `LC_ALL=C` to make `sort` results agree across machines (see https://stackoverflow.com/questions/28881/). - Set `should_run_build_tests=True` for `build` module (this somehow got lost). Author: Josh Rosen <joshrosen@databricks.com> Closes #10543 from JoshRosen/dep-script-fixes.	2015-12-31 20:23:19 -08:00
Josh Rosen	27a42c7108	[SPARK-10359] Enumerate dependencies in a file and diff against it for new pull requests This patch adds a new build check which enumerates Spark's resolved runtime classpath and saves it to a file, then diffs against that file to detect whether pull requests have introduced dependency changes. The aim of this check is to make it simpler to reason about whether pull request which modify the build have introduced new dependencies or changed transitive dependencies in a way that affects the final classpath. This supplants the checks added in SPARK-4123 / #5093, which are currently disabled due to bugs. This patch is based on pwendell's work in #8531. Closes #8531. Author: Josh Rosen <joshrosen@databricks.com> Author: Patrick Wendell <patrick@databricks.com> Closes #10461 from JoshRosen/SPARK-10359.	2015-12-30 12:47:42 -08:00
Josh Rosen	ab6bedd85d	[SPARK-12508][PROJECT-INFRA] Fix minor bugs in dev/tests/pr_public_classes.sh script This patch fixes a handful of minor bugs in the `dev/tests/pr_public_classes.sh` script, which is used by the `run_tests_jenkins` script to detect the addition of new public classes: - Account for differences between BSD and GNU `sed` in order to allow the script to run on OS X. - Diff `$ghprbActualCommit^...$ghprbActualCommit ` instead of `master...$ghprbActualCommit`: since `ghprbActualCommit` is a merge commit which results from merging the PR into the target branch, this will give us the desired diff and will avoid certain race-conditions which could lead to false-positives. - Use `echo -e` instead of `echo` so that newline characters are handled correctly in output. This should fix a formatting glitch which caused the output to appear on a single line in the GitHub comment (see [the SC2028 page](https://github.com/koalaman/shellcheck/wiki/SC2028) on the Shellcheck wiki for more details). Author: Josh Rosen <joshrosen@databricks.com> Closes #10455 from JoshRosen/fix-pr-public-classes-test.	2015-12-28 10:40:03 -08:00
Kazuaki Ishizaki	9e85bb71ad	[SPARK-12502][BUILD][PYTHON] Script /dev/run-tests fails when IBM Java is used fix an exception with IBM JDK by removing update field from a JavaVersion tuple. This is because IBM JDK does not have information on update '_xx' Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #10463 from kiszk/SPARK-12502.	2015-12-24 21:27:55 +09:00
Reynold Xin	0a38637d05	[SPARK-11807] Remove support for Hadoop < 2.2 i.e. Hadoop 1 and Hadoop 2.0 Author: Reynold Xin <rxin@databricks.com> Closes #10404 from rxin/SPARK-11807.	2015-12-21 22:15:52 -08:00
Reynold Xin	284e29a870	[SPARK-11808] Remove Bagel. Author: Reynold Xin <rxin@databricks.com> Closes #10395 from rxin/SPARK-11808.	2015-12-19 22:40:35 -08:00
Reynold Xin	0c4d6ad873	HOTFIX for the previous hot fix.	2015-12-19 16:55:25 -08:00
Reynold Xin	6ad31e79bf	HOTFIX: Disable Java style test.	2015-12-19 15:30:31 -08:00
Josh Rosen	80a824d36e	[SPARK-12152][PROJECT-INFRA] Speed up Scalastyle checks by only invoking SBT once Currently, `dev/scalastyle` invokes SBT four times, but these invocations can be replaced with a single invocation, saving about one minute of build time. Author: Josh Rosen <joshrosen@databricks.com> Closes #10151 from JoshRosen/speed-up-scalastyle.	2015-12-06 17:35:01 -08:00
Dmitry Erastov	d0d8222778	[SPARK-6990][BUILD] Add Java linting script; fix minor warnings This replaces https://github.com/apache/spark/pull/9696 Invoke Checkstyle and print any errors to the console, failing the step. Use Google's style rules modified according to https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide Some important checks are disabled (see TODOs in `checkstyle.xml`) due to multiple violations being present in the codebase. Suggest fixing those TODOs in a separate PR(s). More on Checkstyle can be found on the [official website](http://checkstyle.sourceforge.net/). Sample output (from [build 46345](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46345/consoleFull)) (duplicated because I run the build twice with different profiles): > Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause. > [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions. > [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause. > [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions. > [error] running /home/jenkins/workspace/SparkPullRequestBuilder2/dev/lint-java ; received return code 1 Also fix some of the minor violations that didn't require sweeping changes. Apologies for the previous botched PRs - I finally figured out the issue. cr: JoshRosen, pwendell > I state that the contribution is my original work, and I license the work to the project under the project's open source license. Author: Dmitry Erastov <derastov@gmail.com> Closes #9867 from dskrvk/master.	2015-12-04 12:03:45 -08:00
Yin Huai	b9921524d9	[SPARK-12020][TESTS][TEST-HADOOP2.0] PR builder cannot trigger hadoop 2.0 test https://issues.apache.org/jira/browse/SPARK-12020 Author: Yin Huai <yhuai@databricks.com> Closes #10010 from yhuai/SPARK-12020.	2015-11-27 15:11:13 -08:00
Josh Rosen	689386b1c6	[SPARK-7841][BUILD] Stop using retrieveManaged to retrieve dependencies in SBT This patch modifies Spark's SBT build so that it no longer uses `retrieveManaged` / `lib_managed` to store its dependencies. The motivations for this change are nicely described on the JIRA ticket ([SPARK-7841](https://issues.apache.org/jira/browse/SPARK-7841)); my personal interest in doing this stems from the fact that `lib_managed` has caused me some pain while debugging dependency issues in another PR of mine. Removing our use of `lib_managed` would be trivial except for one snag: the Datanucleus JARs, required by Spark SQL's Hive integration, cannot be included in assembly JARs due to problems with merging OSGI `plugin.xml` files. As a result, several places in the packaging and deployment pipeline assume that these Datanucleus JARs are copied to `lib_managed/jars`. In the interest of maintaining compatibility, I have chosen to retain the `lib_managed/jars` directory _only_ for these Datanucleus JARs and have added custom code to `SparkBuild.scala` to automatically copy those JARs to that folder as part of the `assembly` task. `dev/mima` also depended on `lib_managed` in a hacky way in order to set classpaths when generating MiMa excludes; I've updated this to obtain the classpaths directly from SBT instead. /cc dragos marmbrus pwendell srowen Author: Josh Rosen <joshrosen@databricks.com> Closes #9575 from JoshRosen/SPARK-7841.	2015-11-10 10:14:19 -08:00
Josh Rosen	ce5e6a2849	[SPARK-11491] Update build to use Scala 2.10.5 Spark should build against Scala 2.10.5, since that includes a fix for Scaladoc that will fix doc snapshot publishing: https://issues.scala-lang.org/browse/SI-8479 Author: Josh Rosen <joshrosen@databricks.com> Closes #9450 from JoshRosen/upgrade-to-scala-2.10.5.	2015-11-04 16:58:38 -08:00
Jeff Zhang	729f983e66	[SPARK-11342][TESTS] Allow to set hadoop profile when running dev/ru… …n_tests Author: Jeff Zhang <zjffdu@apache.org> Closes #9295 from zjffdu/SPARK-11342.	2015-10-30 18:50:12 +00:00
Brennon York	d3180c25d8	[SPARK-7018][BUILD] Refactor dev/run-tests-jenkins into Python This commit refactors the `run-tests-jenkins` script into Python. This refactoring was done by brennonyork in #7401; this PR contains a few minor edits from joshrosen in order to bring it up to date with other recent changes. From the original PR description (by brennonyork): Currently a few things are left out that, could and I think should, be smaller JIRA's after this. 1. There are still a few areas where we use environment variables where we don't need to (like `CURRENT_BLOCK`). I might get around to fixing this one in lieu of everything else, but wanted to point that out. 2. The PR tests are still written in bash. I opted to not change those and just rewrite the runner into Python. This is a great follow-on JIRA IMO. 3. All of the linting scripts are still in bash as well and would likely do to just add those in as follow-on JIRA's as well. Closes #7401. Author: Brennon York <brennon.york@capitalone.com> Closes #9161 from JoshRosen/run-tests-jenkins-refactoring.	2015-10-18 22:45:27 -07:00
Reynold Xin	0480d6ca83	[SPARK-11169] Remove the extra spaces in merge script Our merge script now turns ``` [SPARK-1234][SPARK-1235][SPARK-1236][SQL] description ``` into ``` [SPARK-1234] [SPARK-1235] [SPARK-1236] [SQL] description ``` The extra spaces are more annoying in git since the first line of a git commit is supposed to be very short. Doctest passes with the following command: ``` python -m doctest merge_spark_pr.py ``` Author: Reynold Xin <rxin@databricks.com> Closes #9156 from rxin/SPARK-11169.	2015-10-18 09:54:38 -07:00
Jakob Odersky	08698ee1d6	[SPARK-11094] Strip extra strings from Java version in test runner Removes any extra strings from the Java version, fixing subsequent integer parsing. This is required since some OpenJDK versions (specifically in Debian testing), append an extra "-internal" string to the version field. Author: Jakob Odersky <jodersky@gmail.com> Closes #9111 from jodersky/fixtestrunner.	2015-10-16 14:26:34 +01:00
Josh Rosen	d0482f6af3	[SPARK-10932] [PROJECT INFRA] Port two minor changes to release-build.sh from scripts' old repo Spark's release packaging scripts used to live in a separate repository. Although these scripts are now part of the Spark repo, there are some minor patches made against the old repos that are missing in Spark's copy of the script. This PR ports those changes. /cc shivaram, who originally submitted these changes against https://github.com/rxin/spark-utils Author: Josh Rosen <joshrosen@databricks.com> Closes #8986 from JoshRosen/port-release-build-fixes-from-rxin-repo.	2015-10-13 15:18:20 -07:00
Marcelo Vanzin	94fc57afdf	[SPARK-10300] [BUILD] [TESTS] Add support for test tags in run-tests.py. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8775 from vanzin/SPARK-10300.	2015-10-07 14:11:21 -07:00
Josh Rosen	f1c911552c	[SPARK-10657] Remove SCP-based Jenkins log archiving As of https://issues.apache.org/jira/browse/SPARK-7561, we no longer need to use our custom SCP-based mechanism for archiving Jenkins logs on the master machine; this has been superseded by the use of a Jenkins plugin which archives the logs and provides public links to view them. Per shaneknapp, we should remove this log syncing mechanism if it is no longer necessary; removing the need to SCP from the Jenkins workers to the masters is a desired step as part of some larger Jenkins infra refactoring. Author: Josh Rosen <joshrosen@databricks.com> Closes #8793 from JoshRosen/remove-jenkins-ssh-to-master.	2015-09-17 11:40:24 -07:00
Luciano Resende	1894653edc	[SPARK-10511] [BUILD] Reset git repository before packaging source distro The calculation of Spark version is downloading Scala and Zinc in the build directory which is inflating the size of the source distribution. Reseting the repo before packaging the source distribution fix this issue. Author: Luciano Resende <lresende@apache.org> Closes #8774 from lresende/spark-10511.	2015-09-16 10:47:30 +01:00
Marcelo Vanzin	b42059d2ef	Revert "[SPARK-10300] [BUILD] [TESTS] Add support for test tags in run-tests.py." This reverts commit `8abef21dac`.	2015-09-15 13:03:38 -07:00
Marcelo Vanzin	8abef21dac	[SPARK-10300] [BUILD] [TESTS] Add support for test tags in run-tests.py. This change does two things: - tag a few tests and adds the mechanism in the build to be able to disable those tags, both in maven and sbt, for both junit and scalatest suites. - add some logic to run-tests.py to disable some tags depending on what files have changed; that's used to disable expensive tests when a module hasn't explicitly been changed, to speed up testing for changes that don't directly affect those modules. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8437 from vanzin/test-tags.	2015-09-15 10:45:02 -07:00
Holden Karau	48817cc111	[SPARK-10497] [BUILD] [TRIVIAL] Handle both locations for JIRAError with python-jira Location of JIRAError has moved between old and new versions of python-jira package. Longer term it probably makes sense to pin to specific versions (as mentioned in https://issues.apache.org/jira/browse/SPARK-10498 ) but for now, making release tools works with both new and old versions of python-jira. Author: Holden Karau <holden@pigscanfly.ca> Closes #8661 from holdenk/SPARK-10497-release-utils-does-not-work-with-new-jira-python.	2015-09-10 16:42:12 +02:00
Reynold Xin	ae74c3fa84	[RELEASE] Add more contributors & only show names in release notes. Author: Reynold Xin <rxin@databricks.com> Closes #8660 from rxin/contrib.	2015-09-08 17:36:00 -07:00
Patrick Wendell	35e896a79b	SPARK-9545, SPARK-9547: Use Maven in PRB if title contains "[test-maven]" This is just some small glue code to actually make use of the AMPLAB_JENKINS_BUILD_TOOL switch. As far as I can tell, we actually don't currently use the Maven support in the tool even though it exists. This patch switches to Maven when the PR title contains "test-maven". There are a few small other pieces of cleanup in the patch as well. Author: Patrick Wendell <patrick@databricks.com> Closes #7878 from pwendell/maven-tests.	2015-08-30 21:39:16 -07:00
Shivaram Venkataraman	2f99c37273	[SPARK-10328] [SPARKR] Fix generic for na.omit S3 function is at https://stat.ethz.ch/R-manual/R-patched/library/stats/html/na.fail.html Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Author: Shivaram Venkataraman <shivaram.venkataraman@gmail.com> Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8495 from shivaram/na-omit-fix.	2015-08-28 00:37:50 -07:00
Yu ISHIKAWA	1f90c5e219	[SPARK-8505] [SPARKR] Add settings to kick `lint-r` from `./dev/run-test.py` JoshRosen we'd like to check the SparkR source code with the `dev/lint-r` script on the Jenkins. I tried to incorporate the script into `dev/run-test.py`. Could you review it when you have time? shivaram I modified `dev/lint-r` and `dev/lint-r.R` to install lintr package into a local directory(`R/lib/`) and to exit with a lint status. Could you review it? - [[SPARK-8505] Add settings to kick `lint-r` from `./dev/run-test.py` - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8505) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #7883 from yu-iskw/SPARK-8505.	2015-08-27 19:38:53 -07:00
Patrick Wendell	de7209c256	HOTFIX: Increase PRB timeout	2015-08-26 12:19:36 -07:00
Josh Rosen	12de348332	[SPARK-10126] [PROJECT INFRA] Fix typo in release-build.sh which broke snapshot publishing for Scala 2.11 The current `release-build.sh` has a typo which breaks snapshot publication for Scala 2.11. We should change the Scala version to 2.11 and clean before building a 2.11 snapshot. Author: Josh Rosen <joshrosen@databricks.com> Closes #8325 from JoshRosen/fix-2.11-snapshots.	2015-08-20 11:31:03 -07:00
Patrick Wendell	3ef0f32928	[SPARK-1517] Refactor release scripts to facilitate nightly publishing This update contains some code changes to the release scripts that allow easier nightly publishing. I've been using these new scripts on Jenkins for cutting and publishing nightly snapshots for the last month or so, and it has been going well. I'd like to get them merged back upstream so this can be maintained by the community. The main changes are: 1. Separates the release tagging from various build possibilities for an already tagged release (`release-tag.sh` and `release-build.sh`). 2. Allow for injecting credentials through the environment, including GPG keys. This is then paired with secure key injection in Jenkins. 3. Support for copying build results to a remote directory, and also "rotating" results, e.g. the ability to keep the last N copies of binary or doc builds. I'm happy if anyone wants to take a look at this - it's not user facing but an internal utility used for generating releases. Author: Patrick Wendell <patrick@databricks.com> Closes #7411 from pwendell/release-script-updates and squashes the following commits: 74f9beb [Patrick Wendell] Moving maven build command to a variable 233ce85 [Patrick Wendell] [SPARK-1517] Refactor release scripts to facilitate nightly publishing	2015-08-11 21:16:48 -07:00
Tathagata Das	600031ebe2	[SPARK-9727] [STREAMING] [BUILD] Updated streaming kinesis SBT project name to be more consistent Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8092 from tdas/SPARK-9727 and squashes the following commits: b1b01fd [Tathagata Das] Updated streaming kinesis project name	2015-08-11 02:41:03 -07:00
Reynold Xin	55752d8832	[SPARK-9810] [BUILD] Remove individual commit messages from the squash commit message For more information, please see the JIRA ticket and the associated dev list discussion. https://issues.apache.org/jira/browse/SPARK-9810 http://apache-spark-developers-list.1001551.n3.nabble.com/discuss-Removing-individual-commit-messages-from-the-squash-commit-message-td13295.html Author: Reynold Xin <rxin@databricks.com> Closes #8091 from rxin/SPARK-9810.	2015-08-11 01:08:30 -07:00
Prabeesh K	853809e948	[SPARK-5155] [PYSPARK] [STREAMING] Mqtt streaming support in Python This PR is based on #4229, thanks prabeesh. Closes #4229 Author: Prabeesh K <prabsmails@gmail.com> Author: zsxwing <zsxwing@gmail.com> Author: prabs <prabsmails@gmail.com> Author: Prabeesh K <prabeesh.k@namshi.com> Closes #7833 from zsxwing/pr4229 and squashes the following commits: 9570bec [zsxwing] Fix the variable name and check null in finally 4a9c79e [zsxwing] Fix pom.xml indentation abf5f18 [zsxwing] Merge branch 'master' into pr4229 935615c [zsxwing] Fix the flaky MQTT tests 47278c5 [zsxwing] Include the project class files 478f844 [zsxwing] Add unpack 5f8a1d4 [zsxwing] Make the maven build generate the test jar for Python MQTT tests 734db99 [zsxwing] Merge branch 'master' into pr4229 126608a [Prabeesh K] address the comments b90b709 [Prabeesh K] Merge pull request #1 from zsxwing/pr4229 d07f454 [zsxwing] Register StreamingListerner before starting StreamingContext; Revert unncessary changes; fix the python unit test a6747cb [Prabeesh K] wait for starting the receiver before publishing data 87fc677 [Prabeesh K] address the comments: 97244ec [zsxwing] Make sbt build the assembly test jar for streaming mqtt 80474d1 [Prabeesh K] fix 1f0cfe9 [Prabeesh K] python style fix e1ee016 [Prabeesh K] scala style fix a5a8f9f [Prabeesh K] added Python test 9767d82 [Prabeesh K] implemented Python-friendly class a11968b [Prabeesh K] fixed python style 795ec27 [Prabeesh K] address comments ee387ae [Prabeesh K] Fix assembly jar location of mqtt-assembly 3f4df12 [Prabeesh K] updated version b34c3c1 [prabs] adress comments 3aa7fff [prabs] Added Python streaming mqtt word count example b7d42ff [prabs] Mqtt streaming support in Python	2015-08-10 16:33:23 -07:00
Mike Dusenberry	571d5b5363	[SPARK-6485] [MLLIB] [PYTHON] Add CoordinateMatrix/RowMatrix/IndexedRowMatrix to PySpark. This PR adds the RowMatrix, IndexedRowMatrix, and CoordinateMatrix distributed matrices to PySpark. Each distributed matrix class acts as a wrapper around the Scala/Java counterpart by maintaining a reference to the Java object. New distributed matrices can be created using factory methods added to DistributedMatrices, which creates the Java distributed matrix and then wraps it with the corresponding PySpark class. This design allows for simple conversion between the various distributed matrices, and lets us re-use the Scala code. Serialization between Python and Java is implemented using DataFrames as needed for IndexedRowMatrix and CoordinateMatrix for simplicity. Associated documentation and unit-tests have also been added. To facilitate code review, this PR implements access to the rows/entries as RDDs, the number of rows & columns, and conversions between the various distributed matrices (not including BlockMatrix), and does not implement the other linear algebra functions of the matrices, although this will be very simple to add now. Author: Mike Dusenberry <mwdusenb@us.ibm.com> Closes #7554 from dusenberrymw/SPARK-6485_Add_CoordinateMatrix_RowMatrix_IndexedMatrix_to_PySpark and squashes the following commits: bb039cb [Mike Dusenberry] Minor documentation update. b887c18 [Mike Dusenberry] Updating the matrix conversion logic again to make it even cleaner. Now, we allow the 'rows' parameter in the constructors to be either an RDD or the Java matrix object. If 'rows' is an RDD, we create a Java matrix object, wrap it, and then store that. If 'rows' is a Java matrix object of the correct type, we just wrap and store that directly. This is only for internal usage, and publicly, we still require 'rows' to be an RDD. We no longer store the 'rows' RDD, and instead just compute it from the Java object when needed. The point of this is that when we do matrix conversions, we do the conversion on the Scala/Java side, which returns a Java object, so we should use that directly, but exposing 'java_matrix' parameter in the public API is not ideal. This non-public feature of allowing 'rows' to be a Java matrix object is documented in the '__init__' constructor docstrings, which are not part of the generated public API, and doctests are also included. 7f0dcb6 [Mike Dusenberry] Updating module docstring. cfc1be5 [Mike Dusenberry] Use 'new SQLContext(matrix.rows.sparkContext)' rather than 'SQLContext.getOrCreate', as the later doesn't guarantee that the SparkContext will be the same as for the matrix.rows data. 687e345 [Mike Dusenberry] Improving conversion performance. This adds an optional 'java_matrix' parameter to the constructors, and pulls the conversion logic out into a '_create_from_java' function. Now, if the constructors are given a valid Java distributed matrix object as 'java_matrix', they will store those internally, rather than create a new one on the Scala/Java side. 3e50b6e [Mike Dusenberry] Moving the distributed matrices to pyspark.mllib.linalg.distributed. 308f197 [Mike Dusenberry] Using properties for better documentation. 1633f86 [Mike Dusenberry] Minor documentation cleanup. f0c13a7 [Mike Dusenberry] CoordinateMatrix should inherit from DistributedMatrix. ffdd724 [Mike Dusenberry] Updating doctests to make documentation cleaner. 3fd4016 [Mike Dusenberry] Updating docstrings. 27cd5f6 [Mike Dusenberry] Simplifying input conversions in the constructors for each distributed matrix. a409cf5 [Mike Dusenberry] Updating doctests to be less verbose by using lists instead of DenseVectors explicitly. d19b0ba [Mike Dusenberry] Updating code and documentation to note that a vector-like object (numpy array, list, etc.) can be used in place of explicit Vector object, and adding conversions when necessary to RowMatrix construction. 4bd756d [Mike Dusenberry] Adding param documentation to IndexedRow and MatrixEntry. c6bded5 [Mike Dusenberry] Move conversion logic from tuples to IndexedRow or MatrixEntry types from within the IndexedRowMatrix and CoordinateMatrix constructors to separate _convert_to_indexed_row and _convert_to_matrix_entry functions. 329638b [Mike Dusenberry] Moving the Experimental tag to the top of each docstring. 0be6826 [Mike Dusenberry] Simplifying doctests by removing duplicated rows/entries RDDs within the various tests. c0900df [Mike Dusenberry] Adding the colons that were accidentally not inserted. 4ad6819 [Mike Dusenberry] Documenting the and parameters. 3b854b9 [Mike Dusenberry] Minor updates to documentation. 10046e8 [Mike Dusenberry] Updating documentation to use class constructors instead of the removed DistributedMatrices factory methods. 119018d [Mike Dusenberry] Adding static methods to each of the distributed matrix classes to consolidate conversion logic. 4d7af86 [Mike Dusenberry] Adding type checks to the constructors. Although it is slightly verbose, it is better for the user to have a good error message than a cryptic stacktrace. 93b6a3d [Mike Dusenberry] Pulling the DistributedMatrices Python class out of this pull request. f6f3c68 [Mike Dusenberry] Pulling the DistributedMatrices Scala class out of this pull request. 6a3ecb7 [Mike Dusenberry] Updating pattern matching. 08f287b [Mike Dusenberry] Slight reformatting of the documentation. a245dc0 [Mike Dusenberry] Updating Python doctests for compatability between Python 2 & 3. Since Python 3 removed the idea of a separate 'long' type, all values that would have been outputted as a 'long' (ex: '4L') will now be treated as an 'int' and outputed as one (ex: '4'). The doctests now explicitly convert to ints so that both Python 2 and 3 will have the same output. This is fine since the values are all small, and thus can be easily represented as ints. 4d3a37e [Mike Dusenberry] Reformatting a few long Python doctest lines. 7e3ca16 [Mike Dusenberry] Fixing long lines. f721ead [Mike Dusenberry] Updating documentation for each of the distributed matrices. ab0e8b6 [Mike Dusenberry] Updating unit test to be more useful. dda2f89 [Mike Dusenberry] Added wrappers for the conversions between the various distributed matrices. Added logic to be able to access the rows/entries of the distributed matrices, which requires serialization through DataFrames for IndexedRowMatrix and CoordinateMatrix types. Added unit tests. 0cd7166 [Mike Dusenberry] Implemented the CoordinateMatrix API in PySpark, following the idea of the IndexedRowMatrix API, including using DataFrames for serialization. 3c369cb [Mike Dusenberry] Updating the architecture a bit to make conversions between the various distributed matrix types easier. The different distributed matrix classes are now only wrappers around the Java objects, and take the Java object as an argument during construction. This way, we can call for example on an , which returns a reference to a Java RowMatrix object, and then construct a PySpark RowMatrix object wrapped around the Java object. This is analogous to the behavior of PySpark RDDs and DataFrames. We now delegate creation of the various distributed matrices from scratch in PySpark to the factory methods on . 4bdd09b [Mike Dusenberry] Implemented the IndexedRowMatrix API in PySpark, following the idea of the RowMatrix API. Note that for the IndexedRowMatrix, we use DataFrames to serialize the data between Python and Scala/Java, so we accept PySpark RDDs, then convert to a DataFrame, then convert back to RDDs on the Scala/Java side before constructing the IndexedRowMatrix. 23bf1ec [Mike Dusenberry] Updating documentation to add PySpark RowMatrix. Inserting newline above doctest so that it renders properly in API docs. b194623 [Mike Dusenberry] Updating design to have a PySpark RowMatrix simply create and keep a reference to a wrapper over a Java RowMatrix. Updating DistributedMatrices factory methods to accept numRows and numCols with default values. Updating PySpark DistributedMatrices factory method to simply create a PySpark RowMatrix. Adding additional doctests for numRows and numCols parameters. bc2d220 [Mike Dusenberry] Adding unit tests for RowMatrix methods. d7e316f [Mike Dusenberry] Implemented the RowMatrix API in PySpark by doing the following: Added a DistributedMatrices class to contain factory methods for creating the various distributed matrices. Added a factory method for creating a RowMatrix from an RDD of Vectors. Added a createRowMatrix function to the PythonMLlibAPI to interface with the factory method. Added DistributedMatrix, DistributedMatrices, and RowMatrix classes to the pyspark.mllib.linalg api.	2015-08-04 16:30:03 -07:00
Steve Loughran	a2409d1c8e	[SPARK-8064] [SQL] Build against Hive 1.2.1 Cherry picked the parts of the initial SPARK-8064 WiP branch needed to get sql/hive to compile against hive 1.2.1. That's the ASF release packaged under org.apache.hive, not any fork. Tests not run yet: that's what the machines are for Author: Steve Loughran <stevel@hortonworks.com> Author: Cheng Lian <lian@databricks.com> Author: Michael Armbrust <michael@databricks.com> Author: Patrick Wendell <patrick@databricks.com> Closes #7191 from steveloughran/stevel/feature/SPARK-8064-hive-1.2-002 and squashes the following commits: 7556d85 [Cheng Lian] Updates .q files and corresponding golden files ef4af62 [Steve Loughran] Merge commit '6a92bb09f46a04d6cd8c41bdba3ecb727ebb9030' into stevel/feature/SPARK-8064-hive-1.2-002 6a92bb0 [Cheng Lian] Overrides HiveConf time vars dcbb391 [Cheng Lian] Adds com.twitter:parquet-hadoop-bundle:1.6.0 for Hive Parquet SerDe 0bbe475 [Steve Loughran] SPARK-8064 scalastyle rejects the standard Hadoop ASF license header... fdf759b [Steve Loughran] SPARK-8064 classpath dependency suite to be in sync with shading in final (?) hive-exec spark 7a6c727 [Steve Loughran] SPARK-8064 switch to second staging repo of the spark-hive artifacts. This one has the protobuf-shaded hive-exec jar 376c003 [Steve Loughran] SPARK-8064 purge duplicate protobuf declaration 2c74697 [Steve Loughran] SPARK-8064 switch to the protobuf shaded hive-exec jar with tests to chase it down cc44020 [Steve Loughran] SPARK-8064 remove hadoop.version from runtest.py, as profile will fix that automatically. 6901fa9 [Steve Loughran] SPARK-8064 explicit protobuf import da310dc [Michael Armbrust] Fixes for Hive tests. a775a75 [Steve Loughran] SPARK-8064 cherry-pick-incomplete 7404f34 [Patrick Wendell] Add spark-hive staging repo 832c164 [Steve Loughran] SPARK-8064 try to supress compiler warnings on Complex.java pasted-thrift-code 312c0d4 [Steve Loughran] SPARK-8064 maven/ivy dependency purge; calcite declaration needed fa5ae7b [Steve Loughran] HIVE-8064 fix up hive-thriftserver dependencies and cut back on evicted references in the hive- packages; this keeps mvn and ivy resolution compatible, as the reconciliation policy is "by hand" c188048 [Steve Loughran] SPARK-8064 manage the Hive depencencies to that -things that aren't needed are excluded -sql/hive built with ivy is in sync with the maven reconciliation policy, rather than latest-first 4c8be8d [Cheng Lian] WIP: Partial fix for Thrift server and CLI tests 314eb3c [Steve Loughran] SPARK-8064 deprecation warning noise in one of the tests 17b0341 [Steve Loughran] SPARK-8064 IDE-hinted cleanups of Complex.java to reduce compiler warnings. It's all autogenerated code, so still ugly. d029b92 [Steve Loughran] SPARK-8064 rely on unescaping to have already taken place, so go straight to map of serde options 23eca7e [Steve Loughran] HIVE-8064 handle raw and escaped property tokens 54d9b06 [Steve Loughran] SPARK-8064 fix compilation regression surfacing from rebase 0b12d5f [Steve Loughran] HIVE-8064 use subset of hive complex type whose types deserialize fce73b6 [Steve Loughran] SPARK-8064 poms rely implicitly on the version of kryo chill provides fd3aa5d [Steve Loughran] SPARK-8064 version of hive to d/l from ivy is 1.2.1 dc73ece [Steve Loughran] SPARK-8064 revert to master's determinstic pushdown strategy d3c1e4a [Steve Loughran] SPARK-8064 purge UnionType 051cc21 [Steve Loughran] SPARK-8064 switch to an unshaded version of hive-exec-core, which must have been built with Kryo 2.21. This currently looks for a (locally built) version 1.2.1.spark 6684c60 [Steve Loughran] SPARK-8064 ignore RTE raised in blocking process.exitValue() call e6121e5 [Steve Loughran] SPARK-8064 address review comments aa43dc6 [Steve Loughran] SPARK-8064 more robust teardown on JavaMetastoreDatasourcesSuite f2bff01 [Steve Loughran] SPARK-8064 better takeup of asynchronously caught error text 8b1ef38 [Steve Loughran] SPARK-8064: on failures executing spark-submit in HiveSparkSubmitSuite, print command line and all logged output. 5a9ce6b [Steve Loughran] SPARK-8064 add explicit reason for kv split failure, rather than array OOB. does not address the issue 642b63a [Steve Loughran] SPARK-8064 reinstate something cut briefly during rebasing 97194dc [Steve Loughran] SPARK-8064 add extra logging to the YarnClusterSuite classpath test. There should be no reason why this is failing on jenkins, but as it is (and presumably its CP-related), improve the logging including any exception raised. 335357f [Steve Loughran] SPARK-8064 fail fast on thrive process spawning tests on exit codes and/or error string patterns seen in log. 3ed872f [Steve Loughran] SPARK-8064 rename field double to dbl bca55e5 [Steve Loughran] SPARK-8064 missed one of the `date` escapes 41d6479 [Steve Loughran] SPARK-8064 wrap tests with withTable() calls to avoid table-exists exceptions 2bc29a4 [Steve Loughran] SPARK-8064 ParquetSuites to escape `date` field name 1ab9bc4 [Steve Loughran] SPARK-8064 TestHive to use sered2.thrift.test.Complex bf3a249 [Steve Loughran] SPARK-8064: more resubmit than fix; tighten startup timeout to 60s. Still no obvious reason why jersey server code in spark-assembly isn't being picked up -it hasn't been shaded c829b8f [Steve Loughran] SPARK-8064: reinstate yarn-rm-server dependencies to hive-exec to ensure that jersey server is on classpath on hadoop versions < 2.6 0b0f738 [Steve Loughran] SPARK-8064: thrift server startup to fail fast on any exception in the main thread 13abaf1 [Steve Loughran] SPARK-8064 Hive compatibilty tests sin sync with explain/show output from Hive 1.2.1 d14d5ea [Steve Loughran] SPARK-8064: DATE is now a predicate; you can't use it as a field in select ops 26eef1c [Steve Loughran] SPARK-8064: HIVE-9039 renamed TOK_UNION => TOK_UNIONALL while adding TOK_UNIONDISTINCT 3d64523 [Steve Loughran] SPARK-8064 improve diagns on uknown token; fix scalastyle failure d0360f6 [Steve Loughran] SPARK-8064: delicate merge in of the branch vanzin/hive-1.1 1126e5a [Steve Loughran] SPARK-8064: name of unrecognized file format wasn't appearing in error text 8cb09c4 [Steve Loughran] SPARK-8064: test resilience/assertion improvements. Independent of the rest of the work; can be backported to earlier versions dec12cb [Steve Loughran] SPARK-8064: when a CLI suite test fails include the full output text in the raised exception; this ensures that the stdout/stderr is included in jenkins reports, so it becomes possible to diagnose the cause. 463a670 [Steve Loughran] SPARK-8064 run-tests.py adds a hadoop-2.6 profile, and changes info messages to say "w/Hive 1.2.1" in console output 2531099 [Steve Loughran] SPARK-8064 successful attempt to get rid of pentaho as a transitive dependency of hive-exec 1d59100 [Steve Loughran] SPARK-8064 (unsuccessful) attempt to get rid of pentaho as a transitive dependency of hive-exec 75733fc [Steve Loughran] SPARK-8064 change thrift binary startup message to "Starting ThriftBinaryCLIService on port" 3ebc279 [Steve Loughran] SPARK-8064 move strings used to check for http/bin thrift services up into constants c80979d [Steve Loughran] SPARK-8064: SparkSQLCLIDriver drops remote mode support. CLISuite Tests pass instead of timing out: undetected regression? 27e8370 [Steve Loughran] SPARK-8064 fix some style & IDE warnings 00e50d6 [Steve Loughran] SPARK-8064 stop excluding hive shims from dependency (commented out , for now) cb4f142 [Steve Loughran] SPARK-8054 cut pentaho dependency from calcite f7aa9cb [Steve Loughran] SPARK-8064 everything compiles with some commenting and moving of classes into a hive package 6c310b4 [Steve Loughran] SPARK-8064 subclass Hive ServerOptionsProcessor to make it public again f61a675 [Steve Loughran] SPARK-8064 thrift server switched to Hive 1.2.1, though it doesn't compile everywhere 4890b9d [Steve Loughran] SPARK-8064, build against Hive 1.2.1	2015-08-03 15:24:42 -07:00
Sean Owen	6e5fd613ea	[SPARK-9507] [BUILD] Remove dependency reduced POM hack now that shade plugin is updated Update to shade plugin 2.4.1, which removes the need for the dependency-reduced-POM workaround and the 'release' profile. Fix management of shade plugin version so children inherit it; bump assembly plugin version while here See https://issues.apache.org/jira/browse/SPARK-8819 I verified that `mvn clean package -DskipTests` works with Maven 3.3.3. pwendell are you up for trying this for the 1.5.0 release? Author: Sean Owen <sowen@cloudera.com> Closes #7826 from srowen/SPARK-9507 and squashes the following commits: e0b0fd2 [Sean Owen] Update to shade plugin 2.4.1, which removes the need for the dependency-reduced-POM workaround and the 'release' profile. Fix management of shade plugin version so children inherit it; bump assembly plugin version while here	2015-07-31 21:51:55 +01:00
zsxwing	3afc1de89c	[SPARK-8564] [STREAMING] Add the Python API for Kinesis This PR adds the Python API for Kinesis, including a Python example and a simple unit test. Author: zsxwing <zsxwing@gmail.com> Closes #6955 from zsxwing/kinesis-python and squashes the following commits: e42e471 [zsxwing] Merge branch 'master' into kinesis-python 455f7ea [zsxwing] Remove streaming_kinesis_asl_assembly module and simply add the source folder to streaming_kinesis_asl module 32e6451 [zsxwing] Merge remote-tracking branch 'origin/master' into kinesis-python 5082d28 [zsxwing] Fix the syntax error for Python 2.6 fca416b [zsxwing] Fix wrong comparison 96670ff [zsxwing] Fix the compilation error after merging master 756a128 [zsxwing] Merge branch 'master' into kinesis-python 6c37395 [zsxwing] Print stack trace for debug 7c5cfb0 [zsxwing] RUN_KINESIS_TESTS -> ENABLE_KINESIS_TESTS cc9d071 [zsxwing] Fix the python test errors 466b425 [zsxwing] Add python tests for Kinesis e33d505 [zsxwing] Merge remote-tracking branch 'origin/master' into kinesis-python 3da2601 [zsxwing] Fix the kinesis folder 687446b [zsxwing] Fix the error message and the maven output path add2beb [zsxwing] Merge branch 'master' into kinesis-python 4957c0b [zsxwing] Add the Python API for Kinesis	2015-07-31 12:09:48 -07:00
Xiangrui Meng	ca71cc8c8b	[SPARK-9408] [PYSPARK] [MLLIB] Refactor linalg.py to /linalg This is based on MechCoder 's PR https://github.com/apache/spark/pull/7731. Hopefully it could pass tests. MechCoder I tried to make minimal changes. If this passes Jenkins, we can merge this one first and then try to move `__init__.py` to `local.py` in a separate PR. Closes #7731 Author: Xiangrui Meng <meng@databricks.com> Closes #7746 from mengxr/SPARK-9408 and squashes the following commits: 0e05a3b [Xiangrui Meng] merge master 1135551 [Xiangrui Meng] add a comment for str(...) c48cae0 [Xiangrui Meng] update tests 173a805 [Xiangrui Meng] move linalg.py to linalg/__init__.py	2015-07-30 16:57:38 -07:00
zsxwing	76f2e393a5	[SPARK-9335] [TESTS] Enable Kinesis tests only when files in extras/kinesis-asl are changed Author: zsxwing <zsxwing@gmail.com> Closes #7711 from zsxwing/SPARK-9335-test and squashes the following commits: c13ec2f [zsxwing] environs -> environ 69c2865 [zsxwing] Merge remote-tracking branch 'origin/master' into SPARK-9335-test ef84a08 [zsxwing] Revert "Modify the Kinesis project to trigger ENABLE_KINESIS_TESTS" f691028 [zsxwing] Modify the Kinesis project to trigger ENABLE_KINESIS_TESTS 7618205 [zsxwing] Enable Kinesis tests only when files in extras/kinesis-asl are changed	2015-07-30 00:46:36 -07:00
Yin Huai	dafe8d857d	[SPARK-9385] [PYSPARK] Enable PEP8 but disable installing pylint. Instead of disabling all python style check, we should enable PEP8. So, this PR just comments out the part installing pylint. Author: Yin Huai <yhuai@databricks.com> Closes #7704 from yhuai/SPARK-9385 and squashes the following commits: 0056359 [Yin Huai] Enable PEP8 but disable installing pylint.	2015-07-27 15:49:42 -07:00
Yin Huai	2104931d7d	[SPARK-9385] [HOT-FIX] [PYSPARK] Comment out Python style check https://issues.apache.org/jira/browse/SPARK-9385 Comment out Python style check because of error shown in https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/3088/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/console Author: Yin Huai <yhuai@databricks.com> Closes #7702 from yhuai/SPARK-9385 and squashes the following commits: 146e6ef [Yin Huai] Comment out Python style check because of error shown in https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/3088/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/console	2015-07-27 15:18:48 -07:00
Reynold Xin	85a50a6352	[HOTFIX] Disable pylint since it is failing master.	2015-07-27 12:25:34 -07:00
Sean Owen	c980e20cf1	[SPARK-9304] [BUILD] Improve backwards compatibility of SPARK-8401 Add back change-version-to-X.sh scripts, as wrappers for new script, for backwards compatibility Author: Sean Owen <sowen@cloudera.com> Closes #7639 from srowen/SPARK-9304 and squashes the following commits: 9ab2681 [Sean Owen] Add deprecation message to wrappers 3c8c202 [Sean Owen] Add back change-version-to-X.sh scripts, as wrappers for new script, for backwards compatibility	2015-07-25 11:05:08 +01:00
François Garillot	428cde5d1c	[SPARK-9250] Make change-scala-version more helpful w.r.t. valid Scala versions Author: François Garillot <francois@garillot.net> Closes #7595 from huitseeker/issue/SPARK-9250 and squashes the following commits: 80a0218 [François Garillot] [SPARK-9250] Make change-scala-version's usage more explicit, introduce a -h\|--help option.	2015-07-24 17:09:33 +01:00
Yu ISHIKAWA	63f4bcc73f	[SPARK-9121] [SPARKR] Get rid of the warnings about `no visible global function definition` in SparkR [[SPARK-9121] Get rid of the warnings about `no visible global function definition` in SparkR - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9121) ## The Result of `dev/lint-r` [The result of lint-r for SPARK-9121 at the revision:1ddd0f2f1688560f88470e312b72af04364e2d49 when I have sent a PR](https://gist.github.com/yu-iskw/6f55953425901725edf6) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #7567 from yu-iskw/SPARK-9121 and squashes the following commits: c8cfd63 [Yu ISHIKAWA] Fix the typo b1f19ed [Yu ISHIKAWA] Add a validate statement for local SparkR 1a03987 [Yu ISHIKAWA] Load the `testthat` package in `dev/lint-r.R`, instead of using the full path of function. 3a5e0ab [Yu ISHIKAWA] [SPARK-9121][SparkR] Get rid of the warnings about `no visible global function definition` in SparkR	2015-07-21 22:50:27 -07:00
Michael Allman	f5b6dc5e3e	[SPARK-8401] [BUILD] Scala version switching build enhancements These commits address a few minor issues in the Scala cross-version support in the build: 1. Correct two missing `${scala.binary.version}` pom file substitutions. 2. Don't update `scala.binary.version` in parent POM. This property is set through profiles. 3. Update the source of the generated scaladocs in `docs/_plugins/copy_api_dirs.rb`. 4. Factor common code out of `dev/change-version-to-.sh` and add some validation. We also test `sed` to see if it's GNU sed and try `gsed` as an alternative if not. This prevents the script from running with a non-GNU sed. This is my original work and I license this work to the Spark project under the Apache License. Author: Michael Allman <michael@videoamp.com> Closes #6832 from mallman/scala-versions and squashes the following commits: cde2f17 [Michael Allman] Delete dev/change-version-to-.sh, replacing them with single dev/change-scala-version.sh script that takes a version as argument 02296f2 [Michael Allman] Make the scala version change scripts cross-platform by restricting ourselves to POSIX sed syntax instead of looking for GNU sed ad9b40a [Michael Allman] Factor change-scala-version.sh out of change-version-to-*.sh, adding command line argument validation and testing for GNU sed bdd20bf [Michael Allman] Update source of scaladocs when changing Scala version 475088e [Michael Allman] Replace jackson-module-scala_2.10 with jackson-module-scala_${scala.binary.version}	2015-07-21 11:14:31 +01:00
Shivaram Venkataraman	228ab65a4e	[SPARK-9179] [BUILD] Use default primary author if unspecified Fixes feature introduced in #7508 to use the default value if nothing is specified in command line cc liancheng rxin pwendell Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #7558 from shivaram/merge-script-fix and squashes the following commits: 7092141 [Shivaram Venkataraman] Use default primary author if unspecified	2015-07-20 23:31:08 -07:00
Cheng Lian	bc24289f5d	[SPARK-9179] [BUILD] Allows committers to specify primary author of the PR to be merged It's a common case that some contributor contributes an initial version of a feature/bugfix, and later on some other people (mostly committers) fork and add more improvements. When merging these PRs, we probably want to specify the original author as the primary author. Currently we can only do this by running ``` $ git commit --amend --author="name <email>" ``` manually right before the merge script pushes to Apache Git repo. It would be nice if the script accepts user specified primary author information. Author: Cheng Lian <lian@databricks.com> Closes #7508 from liancheng/spark-9179 and squashes the following commits: 218d88e [Cheng Lian] Allows committers to specify primary author of the PR to be merged	2015-07-19 17:37:25 +08:00
Yu ISHIKAWA	34a889db85	[SPARK-7879] [MLLIB] KMeans API for spark.ml Pipelines I Implemented the KMeans API for spark.ml Pipelines. But it doesn't include clustering abstractions for spark.ml (SPARK-7610). It would fit for another issues. And I'll try it later, since we are trying to add the hierarchical clustering algorithms in another issue. Thanks. [SPARK-7879] KMeans API for spark.ml Pipelines - ASF JIRA https://issues.apache.org/jira/browse/SPARK-7879 Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #6756 from yu-iskw/SPARK-7879 and squashes the following commits: be752de [Yu ISHIKAWA] Add assertions a14939b [Yu ISHIKAWA] Fix the dashed line's length in pyspark.ml.rst 4c61693 [Yu ISHIKAWA] Remove the test about whether "features" and "prediction" columns exist or not in Python fb2417c [Yu ISHIKAWA] Use getInt, instead of get f397be4 [Yu ISHIKAWA] Switch the comparisons. ca78b7d [Yu ISHIKAWA] Add the Scala docs about the constraints of each parameter. effc650 [Yu ISHIKAWA] Using expertSetParam and expertGetParam c8dc6e6 [Yu ISHIKAWA] Remove an unnecessary test 19a9d63 [Yu ISHIKAWA] Include spark.ml.clustering to python tests 1abb19c [Yu ISHIKAWA] Add the statements about spark.ml.clustering into pyspark.ml.rst f8338bc [Yu ISHIKAWA] Add the placeholders in Python 4a03003 [Yu ISHIKAWA] Test for contains in Python 6566c8b [Yu ISHIKAWA] Use `get`, instead of `apply` 288e8d5 [Yu ISHIKAWA] Using `contains` to check the column names 5a7d574 [Yu ISHIKAWA] Renamce `validateInitializationMode` to `validateInitMode` and remove throwing exception 97cfae3 [Yu ISHIKAWA] Fix the type of return value of `KMeans.copy` e933723 [Yu ISHIKAWA] Remove the default value of seed from the Model class 978ee2c [Yu ISHIKAWA] Modify the docs of KMeans, according to mllib's KMeans 2ec80bc [Yu ISHIKAWA] Fit on 1 line e186be1 [Yu ISHIKAWA] Make a few variables, setters and getters be expert ones b2c205c [Yu ISHIKAWA] Rename the method `getInitializationSteps` to `getInitSteps` and `setInitializationSteps` to `setInitSteps` in Scala and Python f43f5b4 [Yu ISHIKAWA] Rename the method `getInitializationMode` to `getInitMode` and `setInitializationMode` to `setInitMode` in Scala and Python 3cb5ba4 [Yu ISHIKAWA] Modify the description about epsilon and the validation 4fa409b [Yu ISHIKAWA] Add a comment about the default value of epsilon 2f392e1 [Yu ISHIKAWA] Make some variables `final` and Use `IntParam` and `DoubleParam` 19326f8 [Yu ISHIKAWA] Use `udf`, instead of callUDF 4d2ad1e [Yu ISHIKAWA] Modify the indentations 0ae422f [Yu ISHIKAWA] Add a test for `setParams` 4ff7913 [Yu ISHIKAWA] Add "ml.clustering" to `javacOptions` in SparkBuild.scala 11ffdf1 [Yu ISHIKAWA] Use `===` and the variable 220a176 [Yu ISHIKAWA] Set a random seed in the unit testing 92c3efc [Yu ISHIKAWA] Make the points for a test be fewer c758692 [Yu ISHIKAWA] Modify the parameters of KMeans in Python 6aca147 [Yu ISHIKAWA] Add some unit testings to validate the setter methods 687cacc [Yu ISHIKAWA] Alias mllib.KMeans as MLlibKMeans in KMeansSuite.scala a4dfbef [Yu ISHIKAWA] Modify the last brace and indentations 5bedc51 [Yu ISHIKAWA] Remve an extra new line 444c289 [Yu ISHIKAWA] Add the validation for `runs` e41989c [Yu ISHIKAWA] Modify how to validate `initStep` 7ea133a [Yu ISHIKAWA] Change how to validate `initMode` 7991e15 [Yu ISHIKAWA] Add a validation for `k` c2df35d [Yu ISHIKAWA] Make `predict` private 93aa2ff [Yu ISHIKAWA] Use `withColumn` in `transform` d3a79f7 [Yu ISHIKAWA] Remove the inhefited docs e9532e1 [Yu ISHIKAWA] make `parentModel` of KMeansModel private 8559772 [Yu ISHIKAWA] Remove the `paramMap` parameter of KMeans 6684850 [Yu ISHIKAWA] Rename `initializationSteps` to `initSteps` 99b1b96 [Yu ISHIKAWA] Rename `initializationMode` to `initMode` 79ea82b [Yu ISHIKAWA] Modify the parameters of KMeans docs 6569bcd [Yu ISHIKAWA] Change how to set the default values with `setDefault` 20a795a [Yu ISHIKAWA] Change how to set the default values with `setDefault` 11c2a12 [Yu ISHIKAWA] Limit the imports badb481 [Yu ISHIKAWA] Alias spark.mllib.{KMeans, KMeansModel} f80319a [Yu ISHIKAWA] Rebase mater branch and add copy methods 85d92b1 [Yu ISHIKAWA] Add `KMeans.setPredictionCol` aa9469d [Yu ISHIKAWA] Fix a python test suite error caused by python 3.x c2d6bcb [Yu ISHIKAWA] ADD Java test suites of the KMeans API for spark.ml Pipeline 598ed2e [Yu ISHIKAWA] Implement the KMeans API for spark.ml Pipelines in Python 63ad785 [Yu ISHIKAWA] Implement the KMeans API for spark.ml Pipelines in Scala	2015-07-17 18:30:04 -07:00
MechCoder	20bb10f864	[SPARK-8706] [PYSPARK] [PROJECT INFRA] Add pylint checks to PySpark This adds Pylint checks to PySpark. For now this lazy installs using easy_install to /dev/pylint (similar to the pep8 script). We still need to figure out what rules to be allowed. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #7241 from MechCoder/pylint and squashes the following commits: 2fc7291 [MechCoder] Remove pylint test fail 6d883a2 [MechCoder] Silence warnings and make pylint tests fail to check if it works in jenkins f3a5e17 [MechCoder] undefined-variable ca8b749 [MechCoder] Minor changes 71629f8 [MechCoder] remove trailing whitespace 8498ff9 [MechCoder] Remove blacklisted arguments and pointless statements check 1dbd094 [MechCoder] Disable all checks for now 8b8aa8a [MechCoder] Add pylint configuration file 7871bb1 [MechCoder] [SPARK-8706] [PySpark] [Project infra] Add pylint checks to PySpark	2015-07-15 08:25:53 -07:00
Patrick Wendell	5572fd0c51	[HOTFIX] Adding new names to known contributors	2015-07-14 21:44:47 -07:00
Davies Liu	79c35826e6	Revert "[SPARK-8706] [PYSPARK] [PROJECT INFRA] Add pylint checks to PySpark" This reverts commit `9b62e9375f`.	2015-07-13 11:30:36 -07:00
MechCoder	9b62e9375f	[SPARK-8706] [PYSPARK] [PROJECT INFRA] Add pylint checks to PySpark This adds Pylint checks to PySpark. For now this lazy installs using easy_install to /dev/pylint (similar to the pep8 script). We still need to figure out what rules to be allowed. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #7241 from MechCoder/pylint and squashes the following commits: 8496834 [MechCoder] Silence warnings and make pylint tests fail to check if it works in jenkins 57393a3 [MechCoder] undefined-variable a8e2547 [MechCoder] Minor changes 7753810 [MechCoder] remove trailing whitespace 75c5d2b [MechCoder] Remove blacklisted arguments and pointless statements check 6bde250 [MechCoder] Disable all checks for now 3464666 [MechCoder] Add pylint configuration file d28109f [MechCoder] [SPARK-8706] [PySpark] [Project infra] Add pylint checks to PySpark	2015-07-13 09:47:53 -07:00
Jonathan Alter	e14b545d2d	[SPARK-7977] [BUILD] Disallowing println Author: Jonathan Alter <jonalter@users.noreply.github.com> Closes #7093 from jonalter/SPARK-7977 and squashes the following commits: ccd44cc [Jonathan Alter] Changed println to log in ThreadingSuite 7fcac3e [Jonathan Alter] Reverting to println in ThreadingSuite 10724b6 [Jonathan Alter] Changing some printlns to logs in tests eeec1e7 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977 0b1dcb4 [Jonathan Alter] More println cleanup aedaf80 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977 925fd98 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977 0c16fa3 [Jonathan Alter] Replacing some printlns with logs 45c7e05 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977 5c8e283 [Jonathan Alter] Allowing println in audit-release examples 5b50da1 [Jonathan Alter] Allowing printlns in example files ca4b477 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977 83ab635 [Jonathan Alter] Fixing new printlns 54b131f [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977 1cd8a81 [Jonathan Alter] Removing some unnecessary comments and printlns b837c3a [Jonathan Alter] Disallowing println	2015-07-10 11:34:01 +01:00
Patrick Wendell	1cb2629f1a	[HOTFIX] Rename release-profile to release when publishing releases. We named it as 'release-profile' because that is the Maven convention. However, it turns out this special name causes several other things to kick-in when we are creating releases that are not desirable. For instance, it triggers the javadoc plugin to run, which actually fails in our current build set-up. The fix is just to rename this to a different profile to have no collateral damage associated with its use.	2015-07-06 22:17:30 -07:00
Andrew Or	9eae5fa642	[SPARK-8819] Fix build for maven 3.3.x This is a workaround for MSHADE-148, which leads to an infinite loop when building Spark with maven 3.3.x. This was originally caused by #6441, which added a bunch of test dependencies on the spark-core test module. Recently, it was revealed by #7193. This patch adds a `-Prelease` profile. If present, it will set `createDependencyReducedPom` to true. The consequences are: - If you are releasing Spark with this profile, you are fine as long as you use maven 3.2.x or before. - If you are releasing Spark without this profile, you will run into SPARK-8781. - If you are not releasing Spark but you are using this profile, you may run into SPARK-8819. - If you are not releasing Spark and you did not include this profile, you are fine. This is all documented in `pom.xml` and tested locally with both versions of maven. Author: Andrew Or <andrew@databricks.com> Closes #7219 from andrewor14/fix-maven-build and squashes the following commits: 1d37e87 [Andrew Or] Merge branch 'master' of github.com:apache/spark into fix-maven-build 3574ae4 [Andrew Or] Review comments f39199c [Andrew Or] Create a -Prelease profile that flags `createDependencyReducedPom`	2015-07-06 19:22:30 -07:00
Josh Rosen	377ff4c9e8	[SPARK-8740] [PROJECT INFRA] Support GitHub OAuth tokens in dev/merge_spark_pr.py This commit allows `dev/merge_spark_pr.py` to use personal GitHub OAuth tokens in order to make authenticated requests. This is necessary to work around per-IP rate limiting issues. To use a token, just set the `GITHUB_OAUTH_KEY` environment variable. You can create a personal token at https://github.com/settings/tokens; we only require `public_repo` scope. If the script fails due to a rate-limit issue, it now logs a useful message directing the user to the OAuth token instructions. Author: Josh Rosen <joshrosen@databricks.com> Closes #7136 from JoshRosen/pr-merge-script-oauth-authentication and squashes the following commits: 4d011bd [Josh Rosen] Fix error message 23d92ff [Josh Rosen] Support GitHub OAuth tokens in dev/merge_spark_pr.py	2015-07-01 23:06:52 -07:00
zsxwing	75b9fe4c5f	[SPARK-8378] [STREAMING] Add the Python API for Flume Author: zsxwing <zsxwing@gmail.com> Closes #6830 from zsxwing/flume-python and squashes the following commits: 78dfdac [zsxwing] Fix the compile error in the test code f1bf3c0 [zsxwing] Address TD's comments 0449723 [zsxwing] Add sbt goal streaming-flume-assembly/assembly e93736b [zsxwing] Fix the test case for determine_modules_to_test 9d5821e [zsxwing] Fix pyspark_core dependencies f9ee681 [zsxwing] Merge branch 'master' into flume-python 7a55837 [zsxwing] Add streaming_flume_assembly to run-tests.py b96b0de [zsxwing] Merge branch 'master' into flume-python ce85e83 [zsxwing] Fix incompatible issues for Python 3 01cbb3d [zsxwing] Add import sys 152364c [zsxwing] Fix the issue that StringIO doesn't work in Python 3 14ba0ff [zsxwing] Add flume-assembly for sbt building b8d5551 [zsxwing] Merge branch 'master' into flume-python 4762c34 [zsxwing] Fix the doc 0336579 [zsxwing] Refactor Flume unit tests and also add tests for Python API 9f33873 [zsxwing] Add the Python API for Flume	2015-07-01 11:59:24 -07:00
Josh Rosen	7bbbe380c5	[SPARK-5161] Parallelize Python test execution This commit parallelizes the Python unit test execution, significantly reducing Jenkins build times. Parallelism is now configurable by passing the `-p` or `--parallelism` flags to either `dev/run-tests` or `python/run-tests` (the default parallelism is 4, but I've successfully tested with higher parallelism). To avoid flakiness, I've disabled the Spark Web UI for the Python tests, similar to what we've done for the JVM tests. Author: Josh Rosen <joshrosen@databricks.com> Closes #7031 from JoshRosen/parallelize-python-tests and squashes the following commits: feb3763 [Josh Rosen] Re-enable other tests f87ea81 [Josh Rosen] Only log output from failed tests d4ded73 [Josh Rosen] Logging improvements a2717e1 [Josh Rosen] Make parallelism configurable via dev/run-tests 1bacf1b [Josh Rosen] Merge remote-tracking branch 'origin/master' into parallelize-python-tests 110cd9d [Josh Rosen] Fix universal_newlines for Python 3 cd13db8 [Josh Rosen] Also log python_implementation 9e31127 [Josh Rosen] Log Python --version output for each executable. a2b9094 [Josh Rosen] Bump up parallelism. 5552380 [Josh Rosen] Python 3 fix 866b5b9 [Josh Rosen] Fix lazy logging warnings in Prospector checks 87cb988 [Josh Rosen] Skip MLLib tests for PyPy 8309bfe [Josh Rosen] Temporarily disable parallelism to debug a failure 9129027 [Josh Rosen] Disable Spark UI in Python tests 037b686 [Josh Rosen] Temporarily disable JVM tests so we can test Python speedup in Jenkins. af4cef4 [Josh Rosen] Initial attempt at parallelizing Python test execution	2015-06-29 21:32:40 -07:00
Brennon York	5c796d576e	[SPARK-8693] [PROJECT INFRA] profiles and goals are not printed in a nice way Hotfix to correct formatting errors of print statements within the dev and jenkins builds. Error looks like: ``` -Phadoop-1[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: -Dhadoop.version=1.0.4[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: -Pkinesis-asl[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: -Phive-thriftserver[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: -Phive[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: package[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: assembly/assembly[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: streaming-kafka-assembly/assembly ``` Author: Brennon York <brennon.york@capitalone.com> Closes #7085 from brennonyork/SPARK-8693 and squashes the following commits: c5575f1 [Brennon York] added commas to end of print statements for proper printing	2015-06-29 08:55:06 -07:00
Cheng Lian	00a9d22bd6	[SPARK-7845] [BUILD] Bumping default Hadoop version used in profile hadoop-1 to 1.2.1 PR #5694 reverted PR #6384 while refactoring `dev/run-tests` to `dev/run-tests.py`. Also, PR #6384 didn't bump Hadoop 1 version defined in POM. Author: Cheng Lian <lian@databricks.com> Closes #7062 from liancheng/spark-7845 and squashes the following commits: c088b72 [Cheng Lian] Bumping default Hadoop version used in profile hadoop-1 to 1.2.1	2015-06-28 19:34:59 -07:00
Josh Rosen	42db3a1c2f	[HOTFIX] Fix pull request builder bug in #6967	2015-06-27 23:07:20 -07:00
Josh Rosen	40648c56cd	[SPARK-8583] [SPARK-5482] [BUILD] Refactor python/run-tests to integrate with dev/run-tests module system This patch refactors the `python/run-tests` script: - It's now written in Python instead of Bash. - The descriptions of the tests to run are now stored in `dev/run-tests`'s modules. This allows the pull request builder to skip Python tests suites that were not affected by the pull request's changes. For example, we can now skip the PySpark Streaming test cases when only SQL files are changed. - `python/run-tests` now supports command-line flags to make it easier to run individual test suites (this addresses SPARK-5482): ``` Usage: run-tests [options] Options: -h, --help show this help message and exit --python-executables=PYTHON_EXECUTABLES A comma-separated list of Python executables to test against (default: python2.6,python3.4,pypy) --modules=MODULES A comma-separated list of Python modules to test (default: pyspark-core,pyspark-ml,pyspark-mllib ,pyspark-sql,pyspark-streaming) ``` - `dev/run-tests` has been split into multiple files: the module definitions and test utility functions are now stored inside of a `dev/sparktestsupport` Python module, allowing them to be re-used from the Python test runner script. Author: Josh Rosen <joshrosen@databricks.com> Closes #6967 from JoshRosen/run-tests-python-modules and squashes the following commits: f578d6d [Josh Rosen] Fix print for Python 2.x 8233d61 [Josh Rosen] Add python/run-tests.py to Python lint checks 34c98d2 [Josh Rosen] Fix universal_newlines for Python 3 8f65ed0 [Josh Rosen] Fix handling of module in python/run-tests 37aff00 [Josh Rosen] Python 3 fix 27a389f [Josh Rosen] Skip MLLib tests for PyPy c364ccf [Josh Rosen] Use which() to convert PYSPARK_PYTHON to an absolute path before shelling out to run tests 568a3fd [Josh Rosen] Fix hashbang 3b852ae [Josh Rosen] Fall back to PYSPARK_PYTHON when sys.executable is None (fixes a test) f53db55 [Josh Rosen] Remove python2 flag, since the test runner script also works fine under Python 3 9c80469 [Josh Rosen] Fix passing of PYSPARK_PYTHON d33e525 [Josh Rosen] Merge remote-tracking branch 'origin/master' into run-tests-python-modules 4f8902c [Josh Rosen] Python lint fixes. 8f3244c [Josh Rosen] Use universal_newlines to fix dev/run-tests doctest failures on Python 3. f542ac5 [Josh Rosen] Fix lint check for Python 3 fff4d09 [Josh Rosen] Add dev/sparktestsupport to pep8 checks 2efd594 [Josh Rosen] Update dev/run-tests to use new Python test runner flags b2ab027 [Josh Rosen] Add command-line options for running individual suites in python/run-tests caeb040 [Josh Rosen] Fixes to PySpark test module definitions d6a77d3 [Josh Rosen] Fix the tests of dev/run-tests def2d8a [Josh Rosen] Two minor fixes aec0b8f [Josh Rosen] Actually get the Kafka stuff to run properly 04015b9 [Josh Rosen] First attempt at getting PySpark Kafka test to work in new runner script 4c97136 [Josh Rosen] PYTHONPATH fixes dcc9c09 [Josh Rosen] Fix time division 32660fc [Josh Rosen] Initial cut at Python test runner refactoring 311c6a9 [Josh Rosen] Move shell utility functions to own module. 1bdeb87 [Josh Rosen] Move module definitions to separate file.	2015-06-27 20:24:34 -07:00
Josh Rosen	41afa16500	[SPARK-8652] [PYSPARK] Check return value for all uses of doctest.testmod() This patch addresses a critical issue in the PySpark tests: Several of our Python modules' `__main__` methods call `doctest.testmod()` in order to run doctests but forget to check and handle its return value. As a result, some PySpark test failures can go unnoticed because they will not fail the build. Fortunately, there was only one test failure which was masked by this bug: a `pyspark.profiler` doctest was failing due to changes in RDD pipelining. Author: Josh Rosen <joshrosen@databricks.com> Closes #7032 from JoshRosen/testmod-fix and squashes the following commits: 60dbdc0 [Josh Rosen] Account for int vs. long formatting change in Python 3 8b8d80a [Josh Rosen] Fix failing test. e6423f9 [Josh Rosen] Check return code for all uses of doctest.testmod().	2015-06-26 08:12:22 -07:00
fe2s	dca21a83ac	[SPARK-8558] [BUILD] Script /dev/run-tests fails when _JAVA_OPTIONS env var set Author: fe2s <aka.fe2s@gmail.com> Author: Oleksiy Dyagilev <oleksiy_dyagilev@epam.com> Closes #6956 from fe2s/fix-run-tests and squashes the following commits: 31b6edc [fe2s] str is a built-in function, so using it as a variable name will lead to spurious warnings in some Python linters 7d781a0 [fe2s] fixing for openjdk/IBM, seems like they have slightly different wording, but all have 'version' word. Surrounding with spaces for the case if version word appears in _JAVA_OPTIONS cd455ef [fe2s] address comment, looking for java version string rather than expecting to have on a certain line number ad577d7 [Oleksiy Dyagilev] [SPARK-8558][BUILD] Script /dev/run-tests fails when _JAVA_OPTIONS env var set	2015-06-24 15:12:23 -07:00
Andrew Or	1dfb0f7b2a	[HOTFIX] [TESTS] Typo mqqt -> mqtt This was introduced in #6866.	2015-06-22 16:16:26 -07:00
Yu ISHIKAWA	004f57374b	[SPARK-8495] [SPARKR] Add a `.lintr` file to validate the SparkR files and the `lint-r` script Thank Shivaram Venkataraman for your support. This is a prototype script to validate the R files. Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #6922 from yu-iskw/SPARK-6813 and squashes the following commits: c1ffe6b [Yu ISHIKAWA] Modify to save result to a log file and add a rule to validate 5520806 [Yu ISHIKAWA] Exclude the .lintr file not to check Apache lincence 8f94680 [Yu ISHIKAWA] [SPARK-8495][SparkR] Add a `.lintr` file to validate the SparkR files and the `lint-r` script	2015-06-20 16:10:14 -07:00
Josh Rosen	7a3c424ecf	[SPARK-8422] [BUILD] [PROJECT INFRA] Add a module abstraction to dev/run-tests This patch builds upon #5694 to add a 'module' abstraction to the `dev/run-tests` script which groups together the per-module test logic, including the mapping from file paths to modules, the mapping from modules to test goals and build profiles, and the dependencies / relationships between modules. This refactoring makes it much easier to increase the granularity of test modules, which will let us skip even more tests. It's also a prerequisite for other changes that will reduce test time, such as running subsets of the Python tests based on which files / modules have changed. This patch also adds doctests for the new graph traversal / change mapping code. Author: Josh Rosen <joshrosen@databricks.com> Closes #6866 from JoshRosen/more-dev-run-tests-refactoring and squashes the following commits: 75de450 [Josh Rosen] Use module system to determine which build profiles to enable. 4224da5 [Josh Rosen] Add documentation to Module. a86a953 [Josh Rosen] Clean up modules; add new modules for streaming external projects e46539f [Josh Rosen] Fix camel-cased endswith() 35a3052 [Josh Rosen] Enable Hive tests when running all tests df10e23 [Josh Rosen] update to reflect fact that no module depends on root 3670d50 [Josh Rosen] mllib should depend on streaming dc6f1c6 [Josh Rosen] Use changed files' extensions to decide whether to run style checks 7092d3e [Josh Rosen] Skip SBT tests if no test goals are specified 43a0ced [Josh Rosen] Minor fixes 3371441 [Josh Rosen] Test everything if nothing has changed (needed for non-PRB builds) 37f3fb3 [Josh Rosen] Remove doc profiles option, since it's not actually needed (see #6865) f53864b [Josh Rosen] Finish integrating module changes f0249bd [Josh Rosen] WIP	2015-06-20 16:05:54 -07:00
Josh Rosen	165f52f2f9	[HOTFIX] [PROJECT-INFRA] Fix bug in dev/run-tests for MLlib-only PRs	2015-06-17 19:03:41 -07:00
Brennon York	50a0496a43	[SPARK-7017] [BUILD] [PROJECT INFRA] Refactor dev/run-tests into Python All, this is a first attempt at refactoring `dev/run-tests` into Python. Initially I merely converted all Bash calls over to Python, then moved to a much more modular approach (more functions, moved the calls around, etc.). What is here is the initial culmination and should provide a great base to various downstream issues (e.g. SPARK-7016, modularize / parallelize testing, etc.). Would love comments / suggestions for this initial first step! /cc srowen pwendell nchammas Author: Brennon York <brennon.york@capitalone.com> Closes #5694 from brennonyork/SPARK-7017 and squashes the following commits: 154ed73 [Brennon York] updated finding java binary if JAVA_HOME not set 3922a85 [Brennon York] removed necessary passed in variable f9fbe54 [Brennon York] reverted doc test change 8135518 [Brennon York] removed the test check for documentation changes until jenkins can get updated 05d435b [Brennon York] added check for jekyll install 22edb78 [Brennon York] add check if jekyll isn't installed on the path 2dff136 [Brennon York] fixed pep8 whitespace errors 767a668 [Brennon York] fixed path joining issues, ensured docs actually build on doc changes c42cf9a [Brennon York] unpack set operations with splat (*) fb85a41 [Brennon York] fixed minor set bug 0379833 [Brennon York] minor doc addition to print the changed modules aa03d9e [Brennon York] added documentation builds as a top level test component, altered high level project changes to properly execute core tests only when necessary, changed variable names for simplicity ec1ae78 [Brennon York] minor name changes, bug fixes b7c72b9 [Brennon York] reverting streaming context 03fdd7b [Brennon York] fixed the tuple () wraps around example lambda 705d12e [Brennon York] changed example to comply with pep3113 supporting python3 60b3d51 [Brennon York] prepend rather than append onto PATH 7d2f5e2 [Brennon York] updated python tests to remove unused variable 2898717 [Brennon York] added a change to streaming test to check if it only runs streaming tests eb684b6 [Brennon York] fixed sbt_test_goals reference error db7ae6f [Brennon York] reverted SPARK_HOME from start of command 1ecca26 [Brennon York] fixed merge conflicts 2fcdfc0 [Brennon York] testing targte branch dump on jenkins 1f607b1 [Brennon York] finalizing revisions to modular tests 8afbe93 [Brennon York] made error codes a global 0629de8 [Brennon York] updated to refactor and remove various small bugs, removed pep8 complaints d90ab2d [Brennon York] fixed merge conflicts, ensured that for regular builds both core and sql tests always run b1248dc [Brennon York] exec python rather than running python and exiting with return code f9deba1 [Brennon York] python to python2 and removed newline 6d0a052 [Brennon York] incorporated merge conflicts with SPARK-7249 f950010 [Brennon York] removed building hive-0.12.0 per SPARK-6908 703f095 [Brennon York] fixed merge conflicts b1ca593 [Brennon York] reverted the sparkR test afeb093 [Brennon York] updated to make sparkR test fail 1dada6b [Brennon York] reverted pyspark test failure 9a592ec [Brennon York] reverted mima exclude issue, added pyspark test failure d825aa4 [Brennon York] revert build break, add mima break f041d8a [Brennon York] added space from commented import to now test build breaking 983f2a2 [Brennon York] comment out import to fail build test 2386785 [Brennon York] Merge remote-tracking branch 'upstream/master' into SPARK-7017 76335fb [Brennon York] reverted rat license issue for sparkconf e4a96cc [Brennon York] removed the import error and added license error, fixed the way run-tests and run-tests.py report their error codes 56d3cb9 [Brennon York] changed test back and commented out import to break compile b37328c [Brennon York] fixed typo and added default return is no error block was found in the environment 7613558 [Brennon York] updated to return the proper env variable for return codes a5bd445 [Brennon York] reverted license, changed test in shuffle to fail 803143a [Brennon York] removed license file for SparkContext b0b2604 [Brennon York] comment out import to see if build fails and returns properly 83e80ef [Brennon York] attempt at better python output when called from bash c095fa6 [Brennon York] removed another wait() call 26e18e8 [Brennon York] removed unnecessary wait() 07210a9 [Brennon York] minor doc string change for java version with namedtuple update ec03bf3 [Brennon York] added namedtuple for java version to add readability 2cb413b [Brennon York] upcased global variables, changes various calling methods from check_output to check_call 639f1e9 [Brennon York] updated with pep8 rules, fixed minor bugs, added run-tests file in bash to call the run-tests.py script 3c53a1a [Brennon York] uncomment the scala tests :) 6126c4f [Brennon York] refactored run-tests into python	2015-06-17 12:00:34 -07:00

... 2 3 4 5 6 ...

542 commits