ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Dongjoon Hyun	dde0d2fcad	[SPARK-29991][INFRA] Support `test-hive1.2` in PR Builder ### What changes were proposed in this pull request? Currently, Apache Spark PR Builder using `hive-1.2` for `hadoop-2.7` and `hive-2.3` for `hadoop-3.2`. This PR aims to support `[test-hive1.2]` in PR Builder in order to cut the correlation between `hive-1.2/2.3` to `hadoop-2.7/3.2`. After this PR, the PR Builder will use `hive-2.3` by default for all profiles (if there is no `test-hive1.2`.) ### Why are the changes needed? This new tag allows us more flexibility. ### Does this PR introduce any user-facing change? No. (This is a dev-only change.) ### How was this patch tested? Check the Jenkins triggers in this PR. BEFORE ``` ======================================================================== Building Spark ======================================================================== [info] Building Spark using SBT with these arguments: -Phadoop-2.7 -Phive-1.2 -Pyarn -Pkubernetes -Phive -Phadoop-cloud -Pspark-ganglia-lgpl -Phive-thriftserver -Pkinesis-asl -Pmesos test:package streaming-kinesis-asl-assembly/assembly ``` AFTER 1. Title: [[SPARK-29991][INFRA][test-hive1.2] Support `test-hive1.2` in PR Builder](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/114550/testReport) ``` ======================================================================== Building Spark ======================================================================== [info] Building Spark using SBT with these arguments: -Phadoop-2.7 -Phive-1.2 -Pkinesis-asl -Phadoop-cloud -Pyarn -Phive -Pmesos -Pspark-ganglia-lgpl -Pkubernetes -Phive-thriftserver test:package streaming-kinesis-asl-assembly/assembly ``` 2. Title: [[SPARK-29991][INFRA] Support `test hive1.2` in PR Builder](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/114551/testReport) - Note that I removed the hyphen intentionally from `test-hive1.2`. ``` ======================================================================== Building Spark ======================================================================== [info] Building Spark using SBT with these arguments: -Phadoop-2.7 -Phive-thriftserver -Pkubernetes -Pspark-ganglia-lgpl -Phadoop-cloud -Phive -Pmesos -Pyarn -Pkinesis-asl test:package streaming-kinesis-asl-assembly/assembly ``` Closes #26695 from dongjoon-hyun/SPARK-29991. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-27 21:14:40 -08:00
Dongjoon Hyun	c98e5eb339	[SPARK-29981][BUILD] Add hive-1.2/2.3 profiles ### What changes were proposed in this pull request? This PR aims the followings. - Add two profiles, `hive-1.2` and `hive-2.3` (default) - Validate if we keep the existing combination at least. (Hadoop-2.7 + Hive 1.2 / Hadoop-3.2 + Hive 2.3). For now, we assumes that `hive-1.2` is explicitly used with `hadoop-2.7` and `hive-2.3` with `hadoop-3.2`. The followings are beyond the scope of this PR. - SPARK-29988 Adjust Jenkins jobs for `hive-1.2/2.3` combination - SPARK-29989 Update release-script for `hive-1.2/2.3` combination - SPARK-29991 Support `hive-1.2/2.3` in PR Builder ### Why are the changes needed? This will help to switch our dependencies to update the exposed dependencies. ### Does this PR introduce any user-facing change? This is a dev-only change that the build profile combinations are changed. - `-Phadoop-2.7` => `-Phadoop-2.7 -Phive-1.2` - `-Phadoop-3.2` => `-Phadoop-3.2 -Phive-2.3` ### How was this patch tested? Pass the Jenkins with the dependency check and tests to make it sure we don't change anything for now. - [Jenkins (-Phadoop-2.7 -Phive-1.2)](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/114192/consoleFull) - [Jenkins (-Phadoop-3.2 -Phive-2.3)](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/114192/consoleFull) Also, from now, GitHub Action validates the following combinations. ![gha](https://user-images.githubusercontent.com/9700541/69355365-822d5e00-0c36-11ea-93f7-e00e5459e1d0.png) Closes #26619 from dongjoon-hyun/SPARK-29981. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-23 10:02:22 -08:00
Dongjoon Hyun	d0470d6394	[MINOR][TESTS] Ignore GitHub Action and AppVeyor file changes in testing ### What changes were proposed in this pull request? This PR aims to ignore `GitHub Action` and `AppVeyor` file changes. When we touch these files, Jenkins job should not trigger a full testing. ### Why are the changes needed? Currently, these files are categorized to `root` and trigger the full testing and ends up wasting the Jenkins resources. - https://github.com/apache/spark/pull/26555 ``` [info] Using build tool sbt with Hadoop profile hadoop2.7 under environment amplab_jenkins From https://github.com/apache/spark * [new branch] master -> master [info] Found the following changed modules: sparkr, root [info] Setup the following environment variables for tests: ``` ### Does this PR introduce any user-facing change? No. (Jenkins testing only). ### How was this patch tested? Manually. ``` $ dev/run-tests.py -h -v ... Trying: [x.name for x in determine_modules_for_files([".github/workflows/master.yml", "appveyor.xml"])] Expecting: [] ... ``` Closes #26556 from dongjoon-hyun/SPARK-IGNORE-APPVEYOR. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-16 09:26:01 -08:00
shane knapp	04e99c1e1b	[SPARK-29672][PYSPARK] update spark testing framework to use python3 ### What changes were proposed in this pull request? remove python2.7 tests and test infra for 3.0+ ### Why are the changes needed? because python2.7 is finally going the way of the dodo. ### Does this PR introduce any user-facing change? newp. ### How was this patch tested? the build system will test this Closes #26330 from shaneknapp/remove-py27-tests. Lead-authored-by: shane knapp <incomplete@gmail.com> Co-authored-by: shane <incomplete@gmail.com> Signed-off-by: shane knapp <incomplete@gmail.com>	2019-11-14 10:18:55 -08:00
shane knapp	84d4f94596	[SPARK-28701][INFRA][FOLLOWUP] Fix the key error when looking in os.environ ### What changes were proposed in this pull request? i broke run-tests.py for non-PRB builds in this PR: https://github.com/apache/spark/pull/25423 ### Why are the changes needed? to fix what i broke ### Does this PR introduce any user-facing change? no ### How was this patch tested? the build system will test this Closes #25585 from shaneknapp/fix-run-tests. Authored-by: shane knapp <incomplete@gmail.com> Signed-off-by: shane knapp <incomplete@gmail.com>	2019-08-26 12:40:31 -07:00
shane knapp	13fd32c9a9	[SPARK-28701][TEST-HADOOP3.2][TEST-JAVA11][K8S] adding java11 support for pull request builds ## What changes were proposed in this pull request? we need to add the ability to test PRBs against java11. see comments here: https://github.com/apache/spark/pull/25405 ## How was this patch tested? the build system will test this. Closes #25423 from shaneknapp/spark-prb-java11. Authored-by: shane knapp <incomplete@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-27 00:48:01 +09:00
WeichenXu	f21bc1874a	[SPARK-27889][INFRA] Make development scripts under dev/ support Python 3 ## What changes were proposed in this pull request? I made an audit and update all dev scripts to support python3. (except `merge_spark_pr.py` which already updated) ## How was this patch tested? Manual. Closes #25289 from WeichenXu123/dev_py3. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-09 18:55:48 +09:00
Hyukjin Kwon	1d36b892ab	[SPARK-7721][INFRA][FOLLOW-UP] Remove cloned coverage repo after posting HTMLs ## What changes were proposed in this pull request? This PR proposes to remove cloned `pyspark-coverage-site` repo. it doesn't looks a problem in PR builder but somehow it's problematic in `spark-master-test-sbt-hadoop-2.7`. ## How was this patch tested? Jenkins. Closes #23729 from HyukjinKwon/followup-coverage. Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: shane knapp <incomplete@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-06-25 09:18:32 +09:00
Dongjoon Hyun	e5d95117e4	[SPARK-27979][BUILD][test-maven] Remove deprecated `--force` option in `build/mvn` and `run-tests.py` ## What changes were proposed in this pull request? This is a second try of #24824. Since Apache Spark 2.0.0, SPARK-14867 deprecated `--force` option and made it ignored. This PR cleans up the related code completely at 3.0.0. BEFORE (Jenkins) ``` ======================================================================== Building Spark ======================================================================== [info] Building Spark using Maven with these arguments: -Phadoop-2.7 -Pkubernetes -Phive-thriftserver -Pkinesis-asl -Pyarn -Pspark-ganglia-lgpl -Phive -Pmesos clean package -DskipTests WARNING: '--force' is deprecated and ignored. ... ======================================================================== Running Spark unit tests ======================================================================== [info] Running Spark tests using Maven with these arguments: -Phadoop-2.7 -Phive-thriftserver -Phive -Dtest.exclude.tags=org.apache.spark.tags.ExtendedHiveTest,org.apache.spark.tags.ExtendedYarnTest test --fail-at-end WARNING: '--force' is deprecated and ignored. ``` AFTER (Jenkins) ``` ======================================================================== Building Spark ======================================================================== [info] Building Spark using Maven with these arguments: -Phadoop-2.7 -Pkubernetes -Phive-thriftserver -Pkinesis-asl -Pyarn -Pspark-ganglia-lgpl -Phive -Pmesos clean package -DskipTests ... ======================================================================== Running Spark unit tests ======================================================================== [info] Running Spark tests using Maven with these arguments: -Phadoop-2.7 -Pkubernetes -Phive-thriftserver -Pyarn -Pspark-ganglia-lgpl -Phive -Pkinesis-asl -Pmesos -Dtest.exclude.tags=org.apache.spark.tags.ExtendedHiveTest,org.apache.spark.tags.ExtendedYarnTest test --fail-at-end ``` ## How was this patch tested? Manually check the Jenkins logs. Closes #24833 from dongjoon-hyun/SPARK-FORCE-2. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-06-10 18:40:46 -07:00
Dongjoon Hyun	742f805177	Revert "[SPARK-27979][BUILD][test-maven] Remove deprecated `--force` option in `build/mvn` and `run-tests.py`" This reverts commit `354ec254c5`.	2019-06-09 08:33:21 -07:00
Dongjoon Hyun	354ec254c5	[SPARK-27979][BUILD][test-maven] Remove deprecated `--force` option in `build/mvn` and `run-tests.py` ## What changes were proposed in this pull request? Since Apache Spark 2.0.0, SPARK-14867 deprecated `--force` option and made it ignored. This PR cleans up the related code completely at 3.0.0. BEFORE (Jenkins) ``` ======================================================================== Building Spark ======================================================================== [info] Building Spark using Maven with these arguments: -Phadoop-2.7 -Pkubernetes -Phive-thriftserver -Pkinesis-asl -Pyarn -Pspark-ganglia-lgpl -Phive -Pmesos clean package -DskipTests WARNING: '--force' is deprecated and ignored. ... ======================================================================== Running Spark unit tests ======================================================================== [info] Running Spark tests using Maven with these arguments: -Phadoop-2.7 -Phive-thriftserver -Phive -Dtest.exclude.tags=org.apache.spark.tags.ExtendedHiveTest,org.apache.spark.tags.ExtendedYarnTest test --fail-at-end WARNING: '--force' is deprecated and ignored. ``` AFTER (Jenkins) ``` ======================================================================== Building Spark ======================================================================== [info] Building Spark using Maven with these arguments: -Phadoop-2.7 -Pkubernetes -Phive-thriftserver -Pkinesis-asl -Pyarn -Pspark-ganglia-lgpl -Phive -Pmesos clean package -DskipTests ... ======================================================================== Running Spark unit tests ======================================================================== [info] Running Spark tests using Maven with these arguments: -Phadoop-2.7 -Pkubernetes -Phive-thriftserver -Pyarn -Pspark-ganglia-lgpl -Phive -Pkinesis-asl -Pmesos -Dtest.exclude.tags=org.apache.spark.tags.ExtendedHiveTest,org.apache.spark.tags.ExtendedYarnTest test --fail-at-end ``` ## How was this patch tested? Manually check the Jenkins logs. Closes #24824 from dongjoon-hyun/SPARK-27979. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-06-08 08:17:12 -07:00
Yuming Wang	3f102a8229	[SPARK-27749][SQL] hadoop-3.2 support hive-thriftserver ## What changes were proposed in this pull request? This PR mainly makes the following changes to make `hadoop-3.2` support `sql/hive-thriftserver`: 1. Upgrade [`TCLIService.thrift`](https://github.com/apache/hive/blob/rel/release-2.3.5/service-rpc/if/TCLIService.thrift) and related code to Hive 2.3.5 because of [HIVE-12442](https://issues.apache.org/jira/browse/HIVE-12442)(Note that we only migrate code without adding features, such as [HIVE-4924](https://issues.apache.org/jira/browse/HIVE-4924) and [HIVE-15473](https://issues.apache.org/jira/browse/HIVE-15473)). 2. Use slf4j as logging facade because of [HIVE-12237](https://issues.apache.org/jira/browse/HIVE-12237). 3. Port [HIVE-13169](https://issues.apache.org/jira/browse/HIVE-13169) to compatible with Hive 2.3. ## How was this patch tested? Exiting test Closes #24628 from wangyum/SPARK-27749. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-06-05 08:40:05 -07:00
HyukjinKwon	b7bf4fd123	[SPARK-27402][INFRA][FOLLOW-UP] Exclude 'hive-thriftserver' in modules to test for hadoop3.2 for now ## What changes were proposed in this pull request? This PR excludes 'hive-thriftserver' in modules to test for hadoop3.2 for now as well ## How was this patch tested? Manually tested via `run-tests.py` Closes #24644 from HyukjinKwon/SPARK-27402. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-20 07:53:19 -07:00
Yuming Wang	f82ed5e8e0	[MINOR][TEST] Remove out-dated hive version in run-tests.py ## What changes were proposed in this pull request? ``` ======================================================================== Building Spark ======================================================================== [info] Building Spark (w/Hive 1.2.1) using SBT with these arguments: -Phadoop-3.2 -Pkubernetes -Phive-thriftserver -Pkinesis-asl -Pyarn -Pspark-ganglia-lgpl -Phive -Pmesos test:package streaming-kinesis-asl-assembly/assembly ``` `(w/Hive 1.2.1)` is incorrect when testing hadoop-3.2, It's should be (w/Hive 2.3.4). This pr removes `(w/Hive 1.2.1)` in run-tests.py. ## How was this patch tested? N/A Closes #24451 from wangyum/run-tests-invalid-info. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-24 21:22:15 -07:00
shane knapp	e1ece6a319	[SPARK-25079][PYTHON] update python3 executable to 3.6.x ## What changes were proposed in this pull request? have jenkins test against python3.6 (instead of 3.4). ## How was this patch tested? extensive testing on both the centos and ubuntu jenkins workers. NOTE: this will need to be backported to all active branches. Closes #24266 from shaneknapp/updating-python3-executable. Authored-by: shane knapp <incomplete@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-19 10:03:50 +09:00
Yuming Wang	9c0af746e5	[SPARK-27175][BUILD] Upgrade hadoop-3 to 3.2.0 ## What changes were proposed in this pull request? This PR upgrade `hadoop-3` to `3.2.0` to workaround [HADOOP-16086](https://issues.apache.org/jira/browse/HADOOP-16086). Otherwise some test case will throw IllegalArgumentException: ```java 02:44:34.707 ERROR org.apache.hadoop.hive.ql.exec.Task: Job Submission failed with exception 'java.io.IOException(Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.)' java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses. at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:116) at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:109) at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:102) at org.apache.hadoop.mapred.JobClient.init(JobClient.java:475) at org.apache.hadoop.mapred.JobClient.<init>(JobClient.java:454) at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:369) at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:151) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2183) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1839) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1526) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$runHive$1(HiveClientImpl.scala:730) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:283) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:221) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:220) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:266) at org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:719) at org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:709) at org.apache.spark.sql.hive.StatisticsSuite.createNonPartitionedTable(StatisticsSuite.scala:719) at org.apache.spark.sql.hive.StatisticsSuite.$anonfun$testAlterTableProperties$2(StatisticsSuite.scala:822) ``` ## How was this patch tested? manual tests Closes #24106 from wangyum/SPARK-27175. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-16 19:42:05 -05:00
Yuming Wang	f0b6245ea4	[SPARK-27158][BUILD] dev/mima and dev/scalastyle support dynamic profiles ## What changes were proposed in this pull request? `dev/mima` and `dev/scalastyle` support dynamic reading profiles from `modules.py`. ## How was this patch tested? manual tests Closes #24089 from wangyum/SPARK-27158. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-15 08:20:42 +09:00
Yuming Wang	dccf6615c3	[SPARK-27130][BUILD] Automatically select profile when executing sbt-checkstyle ## What changes were proposed in this pull request? This PR makes it automatically select profile when executing `sbt-checkstyle`. The reason for this is that `hadoop-2.7` and `hadoop-3.1` may have different `hive-thriftserver` module in the future. ## How was this patch tested? manual tests: ``` Update AbstractService.java file. export HADOOP_PROFILE=hadoop2.7 ./dev/run-tests ``` The result: ![image](https://user-images.githubusercontent.com/5399861/54197992-5337e780-4500-11e9-930c-722982cdcd45.png) Closes #24065 from wangyum/SPARK-27130. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-13 08:03:46 +09:00
Yuming Wang	8ab13065f6	[SPARK-23807][FOLLOW-UP][BUILD][TEST-HADOOP3.1] Add test-hadoop3.1 phrase ## What changes were proposed in this pull request? Add `test-hadoop3.1` phrase to test Spark against Spark’s Hadoop 3.1 profile. ## How was this patch tested? Tested on jenkins. This is output: ``` [info] Using build tool sbt with Hadoop profile hadoop3.1 under environment amplab_jenkins ... [info] Building Spark (w/Hive 1.2.1) using SBT with these arguments: -Phadoop-3.1 -Pkubernetes -Phive-thriftserver -Pkinesis-asl -Pyarn -Pspark-ganglia-lgpl -Phive -Pmesos test:package streaming-kinesis-asl-assembly/assembly ``` https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/103282/console Closes #24045 from wangyum/SPARK-23807. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-11 11:15:58 +09:00
cchung100m	dc46fb77ba	[SPARK-26822] Upgrade the deprecated module 'optparse' Follow the [official document](https://docs.python.org/2/library/argparse.html#upgrading-optparse-code) to upgrade the deprecated module 'optparse' to 'argparse'. ## What changes were proposed in this pull request? This PR proposes to replace 'optparse' module with 'argparse' module. ## How was this patch tested? Follow the [previous testing](`7e3eb3cd20`), manually tested and negative tests were also done. My [test results](https://gist.github.com/cchung100m/1661e7df6e8b66940a6e52a20861f61d) Closes #23730 from cchung100m/solve_deprecated_module_optparse. Authored-by: cchung100m <cchung100m@cs.ccu.edu.tw> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-10 00:36:22 -06:00
Hyukjin Kwon	cdd694c52b	[SPARK-7721][INFRA] Run and generate test coverage report from Python via Jenkins ## What changes were proposed in this pull request? ### Background For the current status, the test script that generates coverage information was merged into Spark, https://github.com/apache/spark/pull/20204 So, we can generate the coverage report and site by, for example: ``` run-tests-with-coverage --python-executables=python3 --modules=pyspark-sql ``` like `run-tests` script in `./python`. ### Proposed change The next step is to host this coverage report via `github.io` automatically by Jenkins (see https://spark-test.github.io/pyspark-coverage-site/). This uses my testing account for Spark, spark-test, which is shared to Felix and Shivaram a long time ago for testing purpose including AppVeyor. To cut this short, this PR targets to run the coverage in [spark-master-test-sbt-hadoop-2.7](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/) In the specific job, it will clone the page, and rebase the up-to-date PySpark test coverage from the latest commit. For instance as below: ```bash # Clone PySpark coverage site. git clone https://github.com/spark-test/pyspark-coverage-site.git # Remove existing HTMLs. rm -fr pyspark-coverage-site/* # Copy generated coverage HTMLs. cp -r .../python/test_coverage/htmlcov/* pyspark-coverage-site/ # Check out to a temporary branch. git symbolic-ref HEAD refs/heads/latest_branch # Add all the files. git add -A # Commit current HTMLs. git commit -am "Coverage report at latest commit in Apache Spark" # Delete the old branch. git branch -D gh-pages # Rename the temporary branch to master. git branch -m gh-pages # Finally, force update to our repository. git push -f origin gh-pages ``` So, it is a one single up-to-date coverage can be shown in the `github-io` page. The commands above were manually tested. ### TODOs - [x] Write a draft HyukjinKwon - [x] `pip install coverage` to all python implementations (pypy, python2, python3) in Jenkins workers - shaneknapp - [x] Set hidden `SPARK_TEST_KEY` for spark-test's password in Jenkins via Jenkins's feature This should be set in both PR builder and `spark-master-test-sbt-hadoop-2.7` so that later other PRs can test and fix the bugs - shaneknapp - [x] Set an environment variable that indicates `spark-master-test-sbt-hadoop-2.7` so that that specific build can report and update the coverage site - shaneknapp - [x] Make PR builder's test passed HyukjinKwon - [x] Fix flaky test related with coverage HyukjinKwon - 6 consecutive passes out of 7 runs This PR will be co-authored with me and shaneknapp ## How was this patch tested? It will be tested via Jenkins. Closes #23117 from HyukjinKwon/SPARK-7721. Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: hyukjinkwon <gurwls223@apache.org> Co-authored-by: shane knapp <incomplete@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-02-01 10:18:08 +08:00
Sean Owen	c2d0d700b5	[SPARK-26640][CORE][ML][SQL][STREAMING][PYSPARK] Code cleanup from lgtm.com analysis ## What changes were proposed in this pull request? Misc code cleanup from lgtm.com analysis. See comments below for details. ## How was this patch tested? Existing tests. Closes #23571 from srowen/SPARK-26640. Lead-authored-by: Sean Owen <sean.owen@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-01-17 19:40:39 -06:00
hyukjinkwon	41d5aaec84	[SPARK-26148][PYTHON][TESTS] Increases default parallelism in PySpark tests to speed up ## What changes were proposed in this pull request? This PR proposes to increase parallelism in PySpark tests to speed up from 4 to 8. It decreases the elapsed time from https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99163/consoleFull Tests passed in 1770 seconds to https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99186/testReport/ Tests passed in 1027 seconds ## How was this patch tested? Jenkins tests Closes #23111 from HyukjinKwon/parallelism. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-11-26 00:26:24 +09:00
Wenchen Fan	130121711c	fix security issue of zinc(update run-tests.py)	2018-10-20 00:23:16 +08:00
Sean Owen	703e6da1ec	[SPARK-25705][BUILD][STREAMING][TEST-MAVEN] Remove Kafka 0.8 integration ## What changes were proposed in this pull request? Remove Kafka 0.8 integration ## How was this patch tested? Existing tests, build scripts Closes #22703 from srowen/SPARK-25705. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-10-16 09:10:24 -05:00
Sean Owen	a001814189	[SPARK-25598][STREAMING][BUILD][TEST-MAVEN] Remove flume connector in Spark 3 ## What changes were proposed in this pull request? Removes all vestiges of Flume in the build, for Spark 3. I don't think this needs Jenkins config changes. ## How was this patch tested? Existing tests. Closes #22692 from srowen/SPARK-25598. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-10-11 14:28:06 -07:00
Sean Owen	80813e1980	[SPARK-25016][BUILD][CORE] Remove support for Hadoop 2.6 ## What changes were proposed in this pull request? Remove Hadoop 2.6 references and make 2.7 the default. Obviously, this is for master/3.0.0 only. After this we can also get rid of the separate test jobs for Hadoop 2.6. ## How was this patch tested? Existing tests Closes #22615 from srowen/SPARK-25016. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-10-10 12:07:53 -07:00
Sean Owen	08c76b5d39	[SPARK-25238][PYTHON] lint-python: Fix W605 warnings for pycodestyle 2.4 (This change is a subset of the changes needed for the JIRA; see https://github.com/apache/spark/pull/22231) ## What changes were proposed in this pull request? Use raw strings and simpler regex syntax consistently in Python, which also avoids warnings from pycodestyle about accidentally relying Python's non-escaping of non-reserved chars in normal strings. Also, fix a few long lines. ## How was this patch tested? Existing tests, and some manual double-checking of the behavior of regexes in Python 2/3 to be sure. Closes #22400 from srowen/SPARK-25238.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-09-13 11:19:43 +08:00
Gengliang Wang	395860a986	[SPARK-24768][SQL] Have a built-in AVRO data source implementation ## What changes were proposed in this pull request? Apache Avro (https://avro.apache.org) is a popular data serialization format. It is widely used in the Spark and Hadoop ecosystem, especially for Kafka-based data pipelines. Using the external package https://github.com/databricks/spark-avro, Spark SQL can read and write the avro data. Making spark-Avro built-in can provide a better experience for first-time users of Spark SQL and structured streaming. We expect the built-in Avro data source can further improve the adoption of structured streaming. The proposal is to inline code from spark-avro package (https://github.com/databricks/spark-avro). The target release is Spark 2.4. [Built-in AVRO Data Source In Spark 2.4.pdf](https://github.com/apache/spark/files/2181511/Built-in.AVRO.Data.Source.In.Spark.2.4.pdf) ## How was this patch tested? Unit test Author: Gengliang Wang <gengliang.wang@databricks.com> Closes #21742 from gengliangwang/export_avro.	2018-07-12 13:55:25 -07:00
hyukjinkwon	b0a9352559	[SPARK-24573][INFRA] Runs SBT checkstyle after the build to work around a side-effect ## What changes were proposed in this pull request? Seems checkstyle affects the build in the PR builder in Jenkins. I can't reproduce in my local and seems it can only be reproduced in the PR builder. I was checking the places it goes through and this is just a speculation that checkstyle's compilation in SBT has a side effect to the assembly build. This PR proposes to run the SBT checkstyle after the build. ## How was this patch tested? Jenkins tests. Author: hyukjinkwon <gurwls223@apache.org> Closes #21579 from HyukjinKwon/investigate-javastyle.	2018-06-18 15:32:34 +08:00
hyukjinkwon	4a14dc0aff	[SPARK-22269][BUILD] Run Java linter via SBT for Jenkins ## What changes were proposed in this pull request? This PR proposes to check Java lint via SBT for Jenkins. It uses the SBT wrapper for checkstyle. I manually tested. If we build the codes once, running this script takes 2 mins at maximum in my local: Test codes: ``` Checkstyle failed at following occurrences: [error] Checkstyle error found in /.../spark/core/src/test/java/test/org/apache/spark/JavaAPISuite.java:82: Line is longer than 100 characters (found 103). [error] 1 issue(s) found in Checkstyle report: /.../spark/core/target/checkstyle-test-report.xml [error] Checkstyle error found in /.../spark/sql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java:84: Line is longer than 100 characters (found 115). [error] 1 issue(s) found in Checkstyle report: /.../spark/sql/hive/target/checkstyle-test-report.xml ... ``` Main codes: ``` Checkstyle failed at following occurrences: [error] Checkstyle error found in /.../spark/sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/InputPartition.java:39: Line is longer than 100 characters (found 104). [error] Checkstyle error found in /.../spark/sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/InputPartitionReader.java:26: Line is longer than 100 characters (found 110). [error] Checkstyle error found in /.../spark/sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/InputPartitionReader.java:30: Line is longer than 100 characters (found 104). ... ``` ## How was this patch tested? Manually tested. Jenkins build should test this. Author: hyukjinkwon <gurwls223@apache.org> Closes #21399 from HyukjinKwon/SPARK-22269.	2018-05-24 14:19:32 +08:00
Benjamin Peterson	7013eea11c	[SPARK-23522][PYTHON] always use sys.exit over builtin exit The exit() builtin is only for interactive use. applications should use sys.exit(). ## What changes were proposed in this pull request? All usage of the builtin `exit()` function is replaced by `sys.exit()`. ## How was this patch tested? I ran `python/run-tests`. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Benjamin Peterson <benjamin@python.org> Closes #20682 from benjaminp/sys-exit.	2018-03-08 20:38:34 +09:00
Rekha Joshi	7af1a325da	[SPARK-23174][BUILD][PYTHON] python code style checker update ## What changes were proposed in this pull request? Referencing latest python code style checking from PyPi/pycodestyle Removed pending TODO For now, in tox.ini excluded the additional style error discovered on existing python due to latest style checker (will fallback on review comment to finalize exclusion or fix py) Any further code styling requirement needs to be part of pycodestyle, not in SPARK. ## How was this patch tested? ./dev/run-tests Author: Rekha Joshi <rekhajoshm@gmail.com> Author: rjoshi2 <rekhajoshm@gmail.com> Closes #20338 from rekhajoshm/SPARK-11222.	2018-01-24 21:13:47 +09:00
hyukjinkwon	12faae295e	[SPARK-23169][INFRA][R] Run lintr on the changes of lint-r script and .lintr configuration ## What changes were proposed in this pull request? When running the `run-tests` script, seems we don't run lintr on the changes of `lint-r` script and `.lintr` configuration. ## How was this patch tested? Jenkins builds Author: hyukjinkwon <gurwls223@gmail.com> Closes #20339 from HyukjinKwon/check-r-changed.	2018-01-22 09:45:27 +09:00
Kazuaki Ishizaki	3a07eff5af	[SPARK-22813][BUILD] Use lsof or /usr/sbin/lsof in run-tests.py ## What changes were proposed in this pull request? In [the environment where `/usr/sbin/lsof` does not exist](https://github.com/apache/spark/pull/19695#issuecomment-342865001), `./dev/run-tests.py` for `maven` causes the following error. This is because the current `./dev/run-tests.py` checks existence of only `/usr/sbin/lsof` and aborts immediately if it does not exist. This PR changes to check whether `lsof` or `/usr/sbin/lsof` exists. ``` /bin/sh: 1: /usr/sbin/lsof: not found Usage: kill [options] <pid> [...] Options: <pid> [...] send signal to every <pid> listed -<signal>, -s, --signal <signal> specify the <signal> to be sent -l, --list=[<signal>] list all signal names, or convert one to a name -L, --table list all signal names in a nice table -h, --help display this help and exit -V, --version output version information and exit For more details see kill(1). Traceback (most recent call last): File "./dev/run-tests.py", line 626, in <module> main() File "./dev/run-tests.py", line 597, in main build_apache_spark(build_tool, hadoop_version) File "./dev/run-tests.py", line 389, in build_apache_spark build_spark_maven(hadoop_version) File "./dev/run-tests.py", line 329, in build_spark_maven exec_maven(profiles_and_goals) File "./dev/run-tests.py", line 270, in exec_maven kill_zinc_on_port(zinc_port) File "./dev/run-tests.py", line 258, in kill_zinc_on_port subprocess.check_call(cmd, shell=True) File "/usr/lib/python2.7/subprocess.py", line 541, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '/usr/sbin/lsof -P \|grep 3156 \| grep LISTEN \| awk '{ print $2; }' \| xargs kill' returned non-zero exit status 123 ``` ## How was this patch tested? manually tested Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19998 from kiszk/SPARK-22813.	2017-12-19 07:35:03 +09:00
hyukjinkwon	160a540610	[SPARK-22376][TESTS] Makes dev/run-tests.py script compatible with Python 3 ## What changes were proposed in this pull request? This PR proposes to fix `dev/run-tests.py` script to support Python 3. Here are some backgrounds. Up to my knowledge, In Python 2, - `unicode` is NOT `str` in Python 2 (`type("foo") != type(u"foo")`). - `str` has an alias, `bytes` in Python 2 (`type("foo") == type(b"foo")`). In Python 3, - `unicode` was (roughly) replaced by `str` in Python 3 (`type("foo") == type(u"foo")`). - `str` is NOT `bytes` in Python 3 (`type("foo") != type(b"foo")`). So, this PR fixes: 1. Use `b''` instead of `''` so that both `str` in Python 2 and `bytes` in Python 3 can be hanlded. `sbt_proc.stdout.readline()` returns `str` (which has an alias, `bytes`) in Python 2 and `bytes` in Python 3 2. Similarily, use `b''` instead of `''` so that both `str` in Python 2 and `bytes` in Python 3 can be hanlded. `re.compile` with `str` pattern does not seem supporting to match `bytes` in Python 3: Actually, this change is recommended up to my knowledge - https://docs.python.org/3/howto/pyporting.html#text-versus-binary-data: > Mark all binary literals with a b prefix, textual literals with a u prefix ## How was this patch tested? I manually tested this via Python 3 with few additional changes to reduce the elapsed time. Author: hyukjinkwon <gurwls223@gmail.com> Closes #19665 from HyukjinKwon/SPARK-22376.	2017-11-07 19:45:34 +09:00
Wenchen Fan	864d94fe87	[SPARK-20974][BUILD] we should run REPL tests if SQL module has code changes ## What changes were proposed in this pull request? REPL module depends on SQL module, so we should run REPL tests if SQL module has code changes. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #18191 from cloud-fan/test.	2017-06-02 21:59:52 -07:00
hyukjinkwon	35378766ad	[SPARK-20343][BUILD] Avoid Unidoc build only if Hadoop 2.6 is explicitly set in SBT build ## What changes were proposed in this pull request? This PR proposes two things as below: - Avoid Unidoc build only if Hadoop 2.6 is explicitly set in SBT build Due to a different dependency resolution in SBT & Unidoc by an unknown reason, the documentation build fails on a specific machine & environment in Jenkins but it was unable to reproduce. So, this PR just checks an environment variable `AMPLAB_JENKINS_BUILD_PROFILE` that is set in Hadoop 2.6 SBT build against branches on Jenkins, and then disables Unidoc build. Note that PR builder will still build it with Hadoop 2.6 & SBT. ``` ======================================================================== Building Unidoc API Documentation ======================================================================== [info] Building Spark unidoc (w/Hive 1.2.1) using SBT with these arguments: -Phadoop-2.6 -Pmesos -Pkinesis-asl -Pyarn -Phive-thriftserver -Phive unidoc Using /usr/java/jdk1.8.0_60 as default JAVA_HOME. ... ``` I checked the environment variables from the logs (first bit) as below: - spark-master-test-sbt-hadoop-2.6 (this one is being failed) - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/lastBuild/consoleFull ``` JAVA_HOME=/usr/java/jdk1.8.0_60 JAVA_7_HOME=/usr/java/jdk1.7.0_79 SPARK_BRANCH=master AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.6 <- I use this variable AMPLAB_JENKINS="true" ``` - spark-master-test-sbt-hadoop-2.7 - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/lastBuild/consoleFull ``` JAVA_HOME=/usr/java/jdk1.8.0_60 JAVA_7_HOME=/usr/java/jdk1.7.0_79 SPARK_BRANCH=master AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.7 AMPLAB_JENKINS="true" ``` - spark-master-test-maven-hadoop-2.6 - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.6/lastBuild/consoleFull ``` JAVA_HOME=/usr/java/jdk1.8.0_60 JAVA_7_HOME=/usr/java/jdk1.7.0_79 HADOOP_PROFILE=hadoop-2.6 HADOOP_VERSION= SPARK_BRANCH=master AMPLAB_JENKINS="true" ``` - spark-master-test-maven-hadoop-2.7 - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/lastBuild/consoleFull ``` JAVA_HOME=/usr/java/jdk1.8.0_60 JAVA_7_HOME=/usr/java/jdk1.7.0_79 HADOOP_PROFILE=hadoop-2.7 HADOOP_VERSION= SPARK_BRANCH=master AMPLAB_JENKINS="true" ``` - PR builder - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75843/consoleFull ``` JENKINS_MASTER_HOSTNAME=amp-jenkins-master JAVA_HOME=/usr/java/jdk1.8.0_60 JAVA_7_HOME=/usr/java/jdk1.7.0_79 ``` Assuming from other logs in branch-2.1 - SBT & Hadoop 2.6 against branch-2.1 https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.1-test-sbt-hadoop-2.6/lastBuild/consoleFull ``` JAVA_HOME=/usr/java/jdk1.8.0_60 JAVA_7_HOME=/usr/java/jdk1.7.0_79 SPARK_BRANCH=branch-2.1 AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.6 AMPLAB_JENKINS="true" ``` - Maven & Hadoop 2.6 against branch-2.1 https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.1-test-maven-hadoop-2.6/lastBuild/consoleFull ``` JAVA_HOME=/usr/java/jdk1.8.0_60 JAVA_7_HOME=/usr/java/jdk1.7.0_79 HADOOP_PROFILE=hadoop-2.6 HADOOP_VERSION= SPARK_BRANCH=branch-2.1 AMPLAB_JENKINS="true" ``` We have been using the same convention for those variables. These are actually being used in `run-tests.py` script - here https://github.com/apache/spark/blob/master/dev/run-tests.py#L519-L520 - Revert the previous try After https://github.com/apache/spark/pull/17651, it seems the build still fails on SBT Hadoop 2.6 master. I am unable to reproduce this - https://github.com/apache/spark/pull/17477#issuecomment-294094092 and the reviewer was too. So, this got merged as it looks the only way to verify this is to merge it currently (as no one seems able to reproduce this). ## How was this patch tested? I only checked `is_hadoop_version_2_6 = os.environ.get("AMPLAB_JENKINS_BUILD_PROFILE") == "hadoop2.6"` is working fine as expected as below: ```python >>> import collections >>> os = collections.namedtuple('os', 'environ')(environ={"AMPLAB_JENKINS_BUILD_PROFILE": "hadoop2.6"}) >>> print(not os.environ.get("AMPLAB_JENKINS_BUILD_PROFILE") == "hadoop2.6") False >>> os = collections.namedtuple('os', 'environ')(environ={"AMPLAB_JENKINS_BUILD_PROFILE": "hadoop2.7"}) >>> print(not os.environ.get("AMPLAB_JENKINS_BUILD_PROFILE") == "hadoop2.6") True >>> os = collections.namedtuple('os', 'environ')(environ={}) >>> print(not os.environ.get("AMPLAB_JENKINS_BUILD_PROFILE") == "hadoop2.6") True ``` I tried many ways but I was unable to reproduce this in my local. Sean also tried the way I did but he was also unable to reproduce this. Please refer the comments in https://github.com/apache/spark/pull/17477#issuecomment-294094092 Author: hyukjinkwon <gurwls223@gmail.com> Closes #17669 from HyukjinKwon/revert-SPARK-20343.	2017-04-19 12:18:54 +01:00
hyukjinkwon	ceaf77ae43	[SPARK-18692][BUILD][DOCS] Test Java 8 unidoc build on Jenkins ## What changes were proposed in this pull request? This PR proposes to run Spark unidoc to test Javadoc 8 build as Javadoc 8 is easily re-breakable. There are several problems with it: - It introduces little extra bit of time to run the tests. In my case, it took 1.5 mins more (`Elapsed :[94.8746569157]`). How it was tested is described in "How was this patch tested?". - > One problem that I noticed was that Unidoc appeared to be processing test sources: if we can find a way to exclude those from being processed in the first place then that might significantly speed things up. (see joshrosen's [comment](https://issues.apache.org/jira/browse/SPARK-18692?focusedCommentId=15947627&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15947627)) To complete this automated build, It also suggests to fix existing Javadoc breaks / ones introduced by test codes as described above. There fixes are similar instances that previously fixed. Please refer https://github.com/apache/spark/pull/15999 and https://github.com/apache/spark/pull/16013 Note that this only fixes errors not warnings. Please see my observation https://github.com/apache/spark/pull/17389#issuecomment-288438704 for spurious errors by warnings. ## How was this patch tested? Manually via `jekyll build` for building tests. Also, tested via running `./dev/run-tests`. This was tested via manually adding `time.time()` as below: ```diff profiles_and_goals = build_profiles + sbt_goals print("[info] Building Spark unidoc (w/Hive 1.2.1) using SBT with these arguments: ", " ".join(profiles_and_goals)) + import time + st = time.time() exec_sbt(profiles_and_goals) + print("Elapsed :[%s]" % str(time.time() - st)) ``` produces ``` ... ======================================================================== Building Unidoc API Documentation ======================================================================== ... [info] Main Java API documentation successful. ... Elapsed :[94.8746569157] ... Author: hyukjinkwon <gurwls223@gmail.com> Closes #17477 from HyukjinKwon/SPARK-18692.	2017-04-12 12:38:48 +01:00
Sean Owen	0e2405490f	[SPARK-19550][BUILD][CORE][WIP] Remove Java 7 support - Move external/java8-tests tests into core, streaming, sql and remove - Remove MaxPermGen and related options - Fix some reflection / TODOs around Java 8+ methods - Update doc references to 1.7/1.8 differences - Remove Java 7/8 related build profiles - Update some plugins for better Java 8 compatibility - Fix a few Java-related warnings For the future: - Update Java 8 examples to fully use Java 8 - Update Java tests to use lambdas for simplicity - Update Java internal implementations to use lambdas ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #16871 from srowen/SPARK-19493.	2017-02-16 12:32:45 +00:00
Dongjoon Hyun	c618ccdbe9	[SPARK-19464][BUILD][HOTFIX] run-tests should use hadoop2.6 ## What changes were proposed in this pull request? After SPARK-19464, SparkPullRequestBuilder fails because it still tries to use hadoop2.3. BEFORE https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72595/console ``` ======================================================================== Building Spark ======================================================================== [error] Could not find hadoop2.3 in the list. Valid options are ['hadoop2.6', 'hadoop2.7'] Attempting to post to Github... > Post successful. ``` AFTER https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72595/console ``` ======================================================================== Building Spark ======================================================================== [info] Building Spark (w/Hive 1.2.1) using SBT with these arguments: -Phadoop-2.6 -Pmesos -Pkinesis-asl -Pyarn -Phive-thriftserver -Phive test:package streaming-kafka-0-8-assembly/assembly streaming-flume-assembly/assembly streaming-kinesis-asl-assembly/assembly Using /usr/java/jdk1.8.0_60 as default JAVA_HOME. Note, this will be overridden by -java-home if it is set. ``` ## How was this patch tested? Pass the existing test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16858 from dongjoon-hyun/hotfix_run-tests.	2017-02-08 21:28:04 +00:00
Sean Owen	e8d3fca450	[SPARK-19464][CORE][YARN][TEST-HADOOP2.6] Remove support for Hadoop 2.5 and earlier ## What changes were proposed in this pull request? - Remove support for Hadoop 2.5 and earlier - Remove reflection and code constructs only needed to support multiple versions at once - Update docs to reflect newer versions - Remove older versions' builds and profiles. ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #16810 from srowen/SPARK-19464.	2017-02-08 12:20:07 +00:00
Holden Karau	a36a76ac43	[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed ## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.	2016-11-16 14:22:15 -08:00
Shixiong Zhu	9293734d35	[SPARK-17346][SQL] Add Kafka source for Structured Streaming ## What changes were proposed in this pull request? This PR adds a new project ` external/kafka-0-10-sql` for Structured Streaming Kafka source. It's based on the design doc: https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing tdas did most of work and part of them was inspired by koeninger's work. ### Introduction The Kafka source is a structured streaming data source to poll data from Kafka. The schema of reading data is as follows: Column \| Type ---- \| ---- key \| binary value \| binary topic \| string partition \| int offset \| long timestamp \| long timestampType \| int The source can deal with deleting topics. However, the user should make sure there is no Spark job processing the data when deleting a topic. ### Configuration The user can use `DataStreamReader.option` to set the following configurations. Kafka Source's options \| value \| default \| meaning ------ \| ------- \| ------ \| ----- startingOffset \| ["earliest", "latest"] \| "latest" \| The start point when a query is started, either "earliest" which is from the earliest offset, or "latest" which is just from the latest offset. Note: This only applies when a new Streaming query is started, and that resuming will always pick up from where the query left off. failOnDataLost \| [true, false] \| true \| Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected. subscribe \| A comma-separated list of topics \| (none) \| The topic list to subscribe. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source. subscribePattern \| Java regex string \| (none) \| The pattern used to subscribe the topic. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source. kafka.consumer.poll.timeoutMs \| long \| 512 \| The timeout in milliseconds to poll data from Kafka in executors fetchOffset.numRetries \| int \| 3 \| Number of times to retry before giving up fatch Kafka latest offsets. fetchOffset.retryIntervalMs \| long \| 10 \| milliseconds to wait before retrying to fetch Kafka offsets Kafka's own configurations can be set via `DataStreamReader.option` with `kafka.` prefix, e.g, `stream.option("kafka.bootstrap.servers", "host:port")` ### Usage * Subscribe to 1 topic ```Scala spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host:port") .option("subscribe", "topic1") .load() ``` * Subscribe to multiple topics ```Scala spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host:port") .option("subscribe", "topic1,topic2") .load() ``` * Subscribe to a pattern ```Scala spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host:port") .option("subscribePattern", "topic.*") .load() ``` ## How was this patch tested? The new unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Author: Tathagata Das <tathagata.das1565@gmail.com> Author: Shixiong Zhu <zsxwing@gmail.com> Author: cody koeninger <cody@koeninger.org> Closes #15102 from zsxwing/kafka-source.	2016-10-05 16:45:45 -07:00
Sean Owen	536fa911c1	[SPARK-17329][BUILD] Don't build PRs with -Pyarn unless YARN code changed ## What changes were proposed in this pull request? Only build PRs with -Pyarn if YARN code was modified. ## How was this patch tested? Jenkins tests (will look to verify whether -Pyarn was included in the PR builder for this one.) Author: Sean Owen <sowen@cloudera.com> Closes #14892 from srowen/SPARK-17329.	2016-09-01 09:10:01 +01:00
Josh Rosen	acef843f67	[SPARK-15975] Fix improper Popen retcode code handling in dev/run-tests In the `dev/run-tests.py` script we check a `Popen.retcode` for success using `retcode > 0`, but this is subtlety wrong because Popen's return code will be negative if the child process was terminated by a signal: https://docs.python.org/2/library/subprocess.html#subprocess.Popen.returncode In order to properly handle signals, we should change this to check `retcode != 0` instead. Author: Josh Rosen <joshrosen@databricks.com> Closes #13692 from JoshRosen/dev-run-tests-return-code-handling.	2016-06-16 14:18:58 -07:00
Shixiong Zhu	0ee9fd9e52	[SPARK-15935][PYSPARK] Fix a wrong format tag in the error message ## What changes were proposed in this pull request? A follow up PR for #13655 to fix a wrong format tag. ## How was this patch tested? Jenkins unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #13665 from zsxwing/fix.	2016-06-14 19:45:11 -07:00
Reynold Xin	45b7557e61	[SPARK-15424][SPARK-15437][SPARK-14807][SQL] Revert Create a hivecontext-compatibility module ## What changes were proposed in this pull request? I initially asked to create a hivecontext-compatibility module to put the HiveContext there. But we are so close to Spark 2.0 release and there is only a single class in it. It seems overkill to have an entire package, which makes it more inconvenient, for a single class. ## How was this patch tested? Tests were moved. Author: Reynold Xin <rxin@databricks.com> Closes #13207 from rxin/SPARK-15424.	2016-05-20 22:01:55 -07:00
cody koeninger	89e67d6667	[SPARK-15085][STREAMING][KAFKA] Rename streaming-kafka artifact ## What changes were proposed in this pull request? Renaming the streaming-kafka artifact to include kafka version, in anticipation of needing a different artifact for later kafka versions ## How was this patch tested? Unit tests Author: cody koeninger <cody@koeninger.org> Closes #12946 from koeninger/SPARK-15085.	2016-05-11 12:15:41 -07:00
Yin Huai	7dde1da949	[SPARK-14807] Create a compatibility module ## What changes were proposed in this pull request? This PR creates a compatibility module in sql (called `hive-1-x-compatibility`), which will host HiveContext in Spark 2.0 (moving HiveContext to here will be done separately). This module is not included in assembly because only users who still want to access HiveContext need it. ## How was this patch tested? I manually tested `sbt/sbt -Phive package` and `mvn -Phive package -DskipTests`. Author: Yin Huai <yhuai@databricks.com> Closes #12580 from yhuai/compatibility.	2016-04-22 17:50:24 -07:00

1 2

100 commits