ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
DB Tsai	b6375097bc	[SPARK-27026][BUILD] Upgrade Docker image for release build to Ubuntu 18.04 LTS ## What changes were proposed in this pull request? Upgrade Docker image for release build to Ubuntu 18.04LTS ## How was this patch tested? Manually tested. Closes #23932 from dbtsai/ubuntu18.04. Authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-06 13:58:21 -08:00
Yanbo Liang	7857c6d633	[SPARK-27051][CORE] Bump Jackson version to 2.9.8 ## What changes were proposed in this pull request? Fasterxml Jackson version before 2.9.8 is affected by multiple [CVEs](https://github.com/FasterXML/jackson-databind/issues/2186), we need to fix bump the dependent Jackson to 2.9.8. ## How was this patch tested? Existing tests and offline benchmark. I have run ```SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.datasources.json.JSONBenchmark"``` to check there is no performance degradation for this upgrade. Closes #23965 from yanboliang/SPARK-27051. Authored-by: Yanbo Liang <ybliang8@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-05 11:46:51 +09:00
LantaoJin	e5c502c596	[SPARK-25865][CORE] Add GC information to ExecutorMetrics ## What changes were proposed in this pull request? Only memory usage without GC information could not help us to determinate the proper settings of memory. We need the GC metrics about frequency of major & minor GC. For example, two cases, their configured memory for executor are all 10GB and their usages are all near 10GB. So should we increase or decrease the configured memory for them? This metrics may be helpful. We can increase configured memory for the first one if it has very frequency major GC and decrease the second one if only some minor GC and none major GC. GC metrics are only useful in entire lifetime of executors instead of separated stages. ## How was this patch tested? Adding UT. Closes #22874 from LantaoJin/SPARK-25865. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>	2019-03-04 14:26:02 -06:00
Sean Owen	d8754df2bf	[SPARK-27029][BUILD] Update Thrift to 0.12.0 ## What changes were proposed in this pull request? Update Thrift to 0.12.0 to pick up bug and security fixes. Changes: https://github.com/apache/thrift/blob/master/CHANGES.md The important one is for https://issues.apache.org/jira/browse/THRIFT-4506 ## How was this patch tested? Existing tests. A quick local test suggests this works. Closes #23935 from srowen/SPARK-27029. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-02 17:28:37 -08:00
Marcelo Vanzin	d00eca75b3	[SPARK-26048][BUILD] Enable flume profile when creating 2.x releases. Closes #23931 from vanzin/SPARK-26048. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-02 08:14:06 -08:00
Sean Owen	131b464d0c	[SPARK-26986][ML][FOLLOWUP] Add JAXB reference impl to build for Java 9+ ## What changes were proposed in this pull request? Remove a few new JAXB dependencies that shouldn't be necessary now. See https://github.com/apache/spark/pull/23890#issuecomment-468299922 ## How was this patch tested? Existing tests Closes #23923 from srowen/SPARK-26986.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-01 11:23:40 -06:00
Sean Owen	9c283662c6	[SPARK-26986][ML] Add JAXB reference impl to build for Java 9+ ## What changes were proposed in this pull request? Add reference JAXB impl for Java 9+ from Glassfish. Right now it's only apparently necessary in MLlib but can be expanded later. ## How was this patch tested? Existing tests particularly PMML-related ones, which use JAXB. This works on Java 11. Closes #23890 from srowen/SPARK-26986. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-26 18:26:49 -06:00
Marcelo Vanzin	afbff6446f	Revert "[SPARK-26742][K8S] Update Kubernetes-Client version to 4.1.2" This reverts commit `a3192d966a`.	2019-02-26 13:42:07 -08:00
Jungtaek Lim (HeartSaVioR)	c5de804093	[MINOR][BUILD] Update all checkstyle dtd to use "https://checkstyle.org " ## What changes were proposed in this pull request? Below build failed with Java checkstyle test, but instead of violation it shows FileNotFound on dtd file. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102751/ Looks like the link of dtd file is dead `http://www.puppycrawl.com/dtds/configuration_1_3.dtd`. This patch updates the dtd link to "https://checkstyle.org/dtds/" given checkstyle repository also updated the URL path. https://github.com/checkstyle/checkstyle/issues/5601 ## How was this patch tested? Checked the new links. Closes #23887 from HeartSaVioR/java-checkstyle-dtd-change-url. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-02-25 11:25:53 -08:00
Jiaxin Shan	a3192d966a	[SPARK-26742][K8S] Update Kubernetes-Client version to 4.1.2 ## What changes were proposed in this pull request? Changed the `kubernetes-client` version to 4.1.2. Latest version fix error with exec credentials (used by aws eks) and this will be used to talk with kubernetes API server. Users can submit spark job to EKS api endpoint now with this patch. ## How was this patch tested? unit tests and manual tests. Closes #23814 from Jeffwan/update_k8s_sdk. Authored-by: Jiaxin Shan <seedjeffwan@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-25 04:56:04 -06:00
Holden Karau	6b3c832dac	[SPARK-26882] Check the Kubernetes integration tests scalatyle ## What changes were proposed in this pull request? Add the kubernetes integration tests to the scalastyle profiles. ## How was this patch tested? Run ./dev/scalastyle with a bad change manually ## Follow on work See SPARK-26898 to add scalastyle for k8s integration to the CI Closes #23792 from holdenk/SPARK-26882-check-k8s-integration-tests-when-linting. Authored-by: Holden Karau <holden@pigscanfly.ca> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-02-19 13:49:47 -08:00
cchung100m	dc46fb77ba	[SPARK-26822] Upgrade the deprecated module 'optparse' Follow the [official document](https://docs.python.org/2/library/argparse.html#upgrading-optparse-code) to upgrade the deprecated module 'optparse' to 'argparse'. ## What changes were proposed in this pull request? This PR proposes to replace 'optparse' module with 'argparse' module. ## How was this patch tested? Follow the [previous testing](`7e3eb3cd20`), manually tested and negative tests were also done. My [test results](https://gist.github.com/cchung100m/1661e7df6e8b66940a6e52a20861f61d) Closes #23730 from cchung100m/solve_deprecated_module_optparse. Authored-by: cchung100m <cchung100m@cs.ccu.edu.tw> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-10 00:36:22 -06:00
Ryan Blue	f72d217788	[SPARK-26677][BUILD] Update Parquet to 1.10.1 with notEq pushdown fix. ## What changes were proposed in this pull request? Update to Parquet Java 1.10.1. ## How was this patch tested? Added a test from HyukjinKwon that validates the notEq case from SPARK-26677. Closes #23704 from rdblue/SPARK-26677-fix-noteq-parquet-bug. Lead-authored-by: Ryan Blue <blue@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: Ryan Blue <rdblue@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2019-02-02 09:17:52 -08:00
Hyukjin Kwon	cdd694c52b	[SPARK-7721][INFRA] Run and generate test coverage report from Python via Jenkins ## What changes were proposed in this pull request? ### Background For the current status, the test script that generates coverage information was merged into Spark, https://github.com/apache/spark/pull/20204 So, we can generate the coverage report and site by, for example: ``` run-tests-with-coverage --python-executables=python3 --modules=pyspark-sql ``` like `run-tests` script in `./python`. ### Proposed change The next step is to host this coverage report via `github.io` automatically by Jenkins (see https://spark-test.github.io/pyspark-coverage-site/). This uses my testing account for Spark, spark-test, which is shared to Felix and Shivaram a long time ago for testing purpose including AppVeyor. To cut this short, this PR targets to run the coverage in [spark-master-test-sbt-hadoop-2.7](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/) In the specific job, it will clone the page, and rebase the up-to-date PySpark test coverage from the latest commit. For instance as below: ```bash # Clone PySpark coverage site. git clone https://github.com/spark-test/pyspark-coverage-site.git # Remove existing HTMLs. rm -fr pyspark-coverage-site/* # Copy generated coverage HTMLs. cp -r .../python/test_coverage/htmlcov/* pyspark-coverage-site/ # Check out to a temporary branch. git symbolic-ref HEAD refs/heads/latest_branch # Add all the files. git add -A # Commit current HTMLs. git commit -am "Coverage report at latest commit in Apache Spark" # Delete the old branch. git branch -D gh-pages # Rename the temporary branch to master. git branch -m gh-pages # Finally, force update to our repository. git push -f origin gh-pages ``` So, it is a one single up-to-date coverage can be shown in the `github-io` page. The commands above were manually tested. ### TODOs - [x] Write a draft HyukjinKwon - [x] `pip install coverage` to all python implementations (pypy, python2, python3) in Jenkins workers - shaneknapp - [x] Set hidden `SPARK_TEST_KEY` for spark-test's password in Jenkins via Jenkins's feature This should be set in both PR builder and `spark-master-test-sbt-hadoop-2.7` so that later other PRs can test and fix the bugs - shaneknapp - [x] Set an environment variable that indicates `spark-master-test-sbt-hadoop-2.7` so that that specific build can report and update the coverage site - shaneknapp - [x] Make PR builder's test passed HyukjinKwon - [x] Fix flaky test related with coverage HyukjinKwon - 6 consecutive passes out of 7 runs This PR will be co-authored with me and shaneknapp ## How was this patch tested? It will be tested via Jenkins. Closes #23117 from HyukjinKwon/SPARK-7721. Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: hyukjinkwon <gurwls223@apache.org> Co-authored-by: shane knapp <incomplete@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-02-01 10:18:08 +08:00
Bryan Cutler	16990f9299	[SPARK-26566][PYTHON][SQL] Upgrade Apache Arrow to version 0.12.0 ## What changes were proposed in this pull request? Upgrade Apache Arrow to version 0.12.0. This includes the Java artifacts and fixes to enable usage with pyarrow 0.12.0 Version 0.12.0 includes the following selected fixes/improvements relevant to Spark users: * Safe cast fails from numpy float64 array with nans to integer, ARROW-4258 * Java, Reduce heap usage for variable width vectors, ARROW-4147 * Binary identity cast not implemented, ARROW-4101 * pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098 * conversion to date object no longer needed, ARROW-3910 * Error reading IPC file with no record batches, ARROW-3894 * Signed to unsigned integer cast yields incorrect results when type sizes are the same, ARROW-3790 * from_pandas gives incorrect results when converting floating point to bool, ARROW-3428 * Import pyarrow fails if scikit-learn is installed from conda (boost-cpp / libboost issue), ARROW-3048 * Java update to official Flatbuffers version 1.9.0, ARROW-3175 complete list [here](https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.12.0) PySpark requires the following fixes to work with PyArrow 0.12.0 * Encrypted pyspark worker fails due to ChunkedStream missing closed property * pyarrow now converts dates as objects by default, which causes error because type is assumed datetime64 * ArrowTests fails due to difference in raised error message * pyarrow.open_stream deprecated * tests fail because groupby adds index column with duplicate name ## How was this patch tested? Ran unit tests with pyarrow versions 0.8.0, 0.10.0, 0.11.1, 0.12.0 Closes #23657 from BryanCutler/arrow-upgrade-012. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-01-29 14:18:45 +08:00
Sean Owen	c2d0d700b5	[SPARK-26640][CORE][ML][SQL][STREAMING][PYSPARK] Code cleanup from lgtm.com analysis ## What changes were proposed in this pull request? Misc code cleanup from lgtm.com analysis. See comments below for details. ## How was this patch tested? Existing tests. Closes #23571 from srowen/SPARK-26640. Lead-authored-by: Sean Owen <sean.owen@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-01-17 19:40:39 -06:00
wright	4915cb3adf	[MINOR][BUILD] ensure call to translate_component has correct number of arguments ## What changes were proposed in this pull request? The call to `translate_component` only supplied 2 out of the 3 required arguments. I added a default empty list for the missing argument to avoid a run-time error. I work for Semmle, and noticed the bug with our LGTM code analyzer: `0655f1624f/files/dev/create-release/releaseutils.py`?sort=name&dir=ASC&mode=heatmap#x1434915b6576fb40:1 ## How was this patch tested? I checked that `./dev/run-tests` pass OK. Closes #23567 from ipwright/wrong-number-of-arguments-fix. Authored-by: wright <wright@semmle.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-01-16 21:00:58 -06:00
Takeshi Yamamuro	abc937b247	[MINOR][BUILD] Remove binary license/notice files in a source release for branch-2.4+ only ## What changes were proposed in this pull request? To skip some steps to remove binary license/notice files in a source release for branch2.3 (these files only exist in master/branch-2.4 now), this pr checked a Spark release version in `dev/create-release/release-build.sh`. ## How was this patch tested? Manually checked. Closes #23538 from maropu/FixReleaseScript. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-01-14 19:17:39 -06:00
Dongjoon Hyun	6f35ede31c	[SPARK-26554][BUILD][FOLLOWUP] Use GitHub instead of GitBox to check HEADER ## What changes were proposed in this pull request? This PR uses GitHub repository instead of GitBox because GitHub repo returns HTTP header status correctly. ## How was this patch tested? Manual. ``` $ ./do-release-docker.sh -d /tmp/test -n Branch [branch-2.4]: Current branch version is 2.4.1-SNAPSHOT. Release [2.4.1]: RC # [1]: This is a dry run. Please confirm the ref that will be built for testing. Ref [v2.4.1-rc1]: ``` Closes #23482 from dongjoon-hyun/SPARK-26554-2. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2019-01-07 17:54:05 -08:00
Dongjoon Hyun	468d25ec74	[MINOR][BUILD] Fix script name in `release-tag.sh` usage message ## What changes were proposed in this pull request? This PR fixes the old script name in `release-tag.sh`. $ ./release-tag.sh --help \| head -n1 usage: tag-release.sh ## How was this patch tested? Manual. $ ./release-tag.sh --help \| head -n1 usage: release-tag.sh Closes #23477 from dongjoon-hyun/SPARK-RELEASE-TAG. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2019-01-06 22:45:18 -08:00
Dongjoon Hyun	fe039faddf	[SPARK-26554][BUILD] Update `release-util.sh` to avoid GitBox fake 200 headers ## What changes were proposed in this pull request? Unlike the previous Apache Git repository, new GitBox repository returns a fake HTTP 200 header instead of `404 Not Found` header. This makes release scripts out of order. This PR aims to fix it to handle the html body message instead of the fake HTTP headers. This is a release blocker. ```bash $ curl -s --head --fail "https://gitbox.apache.org/repos/asf?p=spark.git;a=commit;h=v3.0.0" HTTP/1.1 200 OK Date: Sun, 06 Jan 2019 22:42:39 GMT Server: Apache/2.4.18 (Ubuntu) Vary: Accept-Encoding Access-Control-Allow-Origin: * Access-Control-Allow-Methods: POST, GET, OPTIONS Access-Control-Allow-Headers: X-PINGOTHER Access-Control-Max-Age: 1728000 Content-Type: text/html; charset=utf-8 ``` BEFORE ```bash $ ./do-release-docker.sh -d /tmp/test -n Branch [branch-2.4]: Current branch version is 2.4.1-SNAPSHOT. Release [2.4.1]: RC # [1]: v2.4.1-rc1 already exists. Continue anyway [y/n]? ``` AFTER ```bash $ ./do-release-docker.sh -d /tmp/test -n Branch [branch-2.4]: Current branch version is 2.4.1-SNAPSHOT. Release [2.4.1]: RC # [1]: This is a dry run. Please confirm the ref that will be built for testing. Ref [v2.4.1-rc1]: ``` ## How was this patch tested? Manual. Closes #23476 from dongjoon-hyun/SPARK-26554. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2019-01-06 19:59:31 -08:00
Dongjoon Hyun	5969b8a2ed	[SPARK-26541][BUILD] Add `-Pdocker-integration-tests` to `dev/scalastyle` ## What changes were proposed in this pull request? This PR makes `scalastyle` to check `docker-integration-tests` module additionally and fixes one error. ## How was this patch tested? Pass the Jenkins with the updated Scalastyle. ``` ======================================================================== Running Scala style checks ======================================================================== Scalastyle checks passed. ``` Closes #23459 from dongjoon-hyun/SPARK-26541. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2019-01-05 00:55:17 -08:00
shane knapp	bccb8602d7	[SPARK-26537][BUILD] change git-wip-us to gitbox ## What changes were proposed in this pull request? due to apache recently moving from git-wip-us.apache.org to gitbox.apache.org, we need to update the packaging scripts to point to the new repo location. this will also need to be backported to 2.4, 2.3, 2.1, 2.0 and 1.6. ## How was this patch tested? the build system will test this. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #23454 from shaneknapp/update-apache-repo. Authored-by: shane knapp <incomplete@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2019-01-04 18:27:26 -08:00
Dongjoon Hyun	81addaa6b7	[SPARK-26427][BUILD] Upgrade Apache ORC to 1.5.4 ## What changes were proposed in this pull request? This PR aims to update Apache ORC dependency to the latest version 1.5.4 released at Dec. 20. ([Release Notes](https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12318320&version=12344187])) ``` [ORC-237] OrcFile.mergeFiles Specified block size is less than configured minimum value [ORC-409] Changes for extending MemoryManagerImpl [ORC-410] Fix a locale-dependent test in TestCsvReader [ORC-416] Avoid opening data reader when there is no stripe [ORC-417] Use dynamic Apache Maven mirror link [ORC-419] Ensure to call `close` at RecordReaderImpl constructor exception [ORC-432] openjdk 8 has a bug that prevents surefire from working [ORC-435] Ability to read stripes that are greater than 2GB [ORC-437] Make acid schema checks case insensitive [ORC-411] Update build to work with Java 10. [ORC-418] Fix broken docker build script ``` ## How was this patch tested? Build and pass Jenkins. Closes #23364 from dongjoon-hyun/SPARK-26427. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-12-22 00:41:21 -08:00
Reza Safi	90c77ea313	[SPARK-24958][CORE] Add memory from procfs to executor metrics. This adds the entire memory used by spark’s executor (as measured by procfs) to the executor metrics. The memory usage is collected from the entire process tree under the executor. The metrics are subdivided into memory used by java, by python, and by other processes, to aid users in diagnosing the source of high memory usage. The additional metrics are sent to the driver in heartbeats, using the mechanism introduced by SPARK-23429. This also slightly extends that approach to allow one ExecutorMetricType to collect multiple metrics. Added unit tests and also tested on a live cluster. Closes #22612 from rezasafi/ptreememory2. Authored-by: Reza Safi <rezasafi@cloudera.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>	2018-12-10 11:14:11 -06:00
Sean Owen	2ea9792fde	[SPARK-26266][BUILD] Update to Scala 2.12.8 ## What changes were proposed in this pull request? Update to Scala 2.12.8 ## How was this patch tested? Existing tests. Closes #23218 from srowen/SPARK-26266. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-12-08 05:59:53 -06:00
Dongjoon Hyun	4772265203	[SPARK-26298][BUILD] Upgrade Janino to 3.0.11 ## What changes were proposed in this pull request? This PR aims to upgrade Janino compiler to the latest version 3.0.11. The followings are the changes from the [release note](http://janino-compiler.github.io/janino/changelog.html). - Script with many "helper" variables. - Java 9+ compatibility - Compilation Error Messages Generated by JDK. - Added experimental support for the "StackMapFrame" attribute; not active yet. - Make Unparser more flexible. - Fixed NPEs in various "toString()" methods. - Optimize static method invocation with rvalue target expression. - Added all missing "ClassFile.getConstant*Info()" methods, removing the necessity for many type casts. ## How was this patch tested? Pass the Jenkins with the existing tests. Closes #23250 from dongjoon-hyun/SPARK-26298. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-12-06 20:50:57 -08:00
cody koeninger	5e5b9f2ee0	[SPARK-26177] Config change followup to [] Automated formatting for Scala code Let's keep this open for a while to see if other configuration tweaks are suggested ## What changes were proposed in this pull request? Formatting configuration changes following up https://github.com/apache/spark/pull/23148 ## How was this patch tested? ./dev/scalafmt Closes #23182 from koeninger/scalafmt-config. Authored-by: cody koeninger <cody@koeninger.org> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-12-03 10:03:51 -06:00
Kazuaki Ishizaki	1abfbda7eb	[SPARK-26212][BUILD][TEST-MAVEN] Upgrade maven version to 3.6.0 ## What changes were proposed in this pull request? This PR updates maven version from 3.5.4 to 3.6.0. The release note of the 3.6.0 is [here](https://maven.apache.org/docs/3.6.0/release-notes.html). From [the release note of the 3.6.0](https://maven.apache.org/docs/3.6.0/release-notes.html), the followings are new features: 1. There had been issues related to the project discoverytime which has been increased in previous version which influenced some of our users. 1. The output in the reactor summary has been improved. 1. There was an issue related to the classpath ordering. ## How was this patch tested? Existing tests Closes #23177 from kiszk/SPARK-26212. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-12-01 07:06:18 -06:00
cody koeninger	9a09e91a3e	[SPARK-26177] Automated formatting for Scala code ## What changes were proposed in this pull request? Add a maven plugin and wrapper script to use scalafmt to format files that differ from git master. Intention is for contributors to be able to use this to automate fixing code style, not to include it in build pipeline yet. If this PR is accepted, I'd make a different PR to update the code style section of https://spark.apache.org/contributing.html to mention the script ## How was this patch tested? Manually tested by modifying a few files and running ./dev/scalafmt then checking that ./dev/scalastyle still passed. Closes #23148 from koeninger/scalafmt. Authored-by: cody koeninger <cody@koeninger.org> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-29 08:54:31 -06:00
hyukjinkwon	41d5aaec84	[SPARK-26148][PYTHON][TESTS] Increases default parallelism in PySpark tests to speed up ## What changes were proposed in this pull request? This PR proposes to increase parallelism in PySpark tests to speed up from 4 to 8. It decreases the elapsed time from https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99163/consoleFull Tests passed in 1770 seconds to https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99186/testReport/ Tests passed in 1027 seconds ## How was this patch tested? Jenkins tests Closes #23111 from HyukjinKwon/parallelism. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-11-26 00:26:24 +09:00
Takanobu Asanuma	15c0384977	[SPARK-26134][CORE] Upgrading Hadoop to 2.7.4 to fix java.version problem ## What changes were proposed in this pull request? When I ran spark-shell on JDK11+28(2018-09-25), It failed with the error below. ``` Exception in thread "main" java.lang.ExceptionInInitializerError at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:80) at org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273) at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261) at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791) at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761) at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634) at org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2427) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2427) at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:79) at org.apache.spark.deploy.SparkSubmit.secMgr$lzycompute$1(SparkSubmit.scala:359) at org.apache.spark.deploy.SparkSubmit.secMgr$1(SparkSubmit.scala:359) at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$9(SparkSubmit.scala:367) at scala.Option.map(Option.scala:146) at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:367) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 2 at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3319) at java.base/java.lang.String.substring(String.java:1874) at org.apache.hadoop.util.Shell.<clinit>(Shell.java:52) ``` This is a Hadoop issue that fails to parse some java.version. It has been fixed from Hadoop-2.7.4(see [HADOOP-14586](https://issues.apache.org/jira/browse/HADOOP-14586)). Note, Hadoop-2.7.5 or upper have another problem with Spark ([SPARK-25330](https://issues.apache.org/jira/browse/SPARK-25330)). So upgrading to 2.7.4 would be fine for now. ## How was this patch tested? Existing tests. Closes #23101 from tasanuma/SPARK-26134. Authored-by: Takanobu Asanuma <tasanuma@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-11-21 23:09:57 -08:00
shane knapp	42c48387c0	[BUILD] refactor dev/lint-python in to something readable ## What changes were proposed in this pull request? `dev/lint-python` is a mess of nearly unreadable bash. i would like to fix that as best as i can. ## How was this patch tested? the build system will test this. Closes #22994 from shaneknapp/lint-python-refactor. Authored-by: shane knapp <incomplete@gmail.com> Signed-off-by: shane knapp <incomplete@gmail.com>	2018-11-20 12:38:40 -08:00
Bryan Cutler	034ae305c3	[SPARK-26033][PYTHON][TESTS] Break large ml/tests.py file into smaller files ## What changes were proposed in this pull request? This PR breaks down the large ml/tests.py file that contains all Python ML unit tests into several smaller test files to be easier to read and maintain. The tests are broken down as follows: ``` pyspark ├── __init__.py ... ├── ml │ ├── __init__.py ... │ ├── tests │ │ ├── __init__.py │ │ ├── test_algorithms.py │ │ ├── test_base.py │ │ ├── test_evaluation.py │ │ ├── test_feature.py │ │ ├── test_image.py │ │ ├── test_linalg.py │ │ ├── test_param.py │ │ ├── test_persistence.py │ │ ├── test_pipeline.py │ │ ├── test_stat.py │ │ ├── test_training_summary.py │ │ ├── test_tuning.py │ │ └── test_wrapper.py ... ├── testing ... │ ├── mlutils.py ... ``` ## How was this patch tested? Ran tests manually by module to ensure test count was the same, and ran `python/run-tests --modules=pyspark-ml` to verify all passing with Python 2.7 and Python 3.6. Closes #23063 from BryanCutler/python-test-breakup-ml-SPARK-26033. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-11-18 16:02:15 +08:00
Marcelo Vanzin	d2792046a1	[SPARK-26095][BUILD] Disable parallelization in make-distibution.sh. It makes the build slower, but at least it doesn't hang. Seems that maven-shade-plugin has some issue with parallelization. Closes #23061 from vanzin/SPARK-26095. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2018-11-16 15:57:38 -08:00
Bryan Cutler	a2fc48c28c	[SPARK-26034][PYTHON][TESTS] Break large mllib/tests.py file into smaller files ## What changes were proposed in this pull request? This PR breaks down the large mllib/tests.py file that contains all Python MLlib unit tests into several smaller test files to be easier to read and maintain. The tests are broken down as follows: ``` pyspark ├── __init__.py ... ├── mllib │ ├── __init__.py ... │ ├── tests │ │ ├── __init__.py │ │ ├── test_algorithms.py │ │ ├── test_feature.py │ │ ├── test_linalg.py │ │ ├── test_stat.py │ │ ├── test_streaming_algorithms.py │ │ └── test_util.py ... ├── testing ... │ ├── mllibutils.py ... ``` ## How was this patch tested? Ran tests manually by module to ensure test count was the same, and ran `python/run-tests --modules=pyspark-mllib` to verify all passing with Python 2.7 and Python 3.6. Also installed scipy to include optional tests in test_linalg. Closes #23056 from BryanCutler/python-test-breakup-mllib-SPARK-26034. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-11-17 00:12:17 +08:00
hyukjinkwon	3649fe599f	[SPARK-26035][PYTHON] Break large streaming/tests.py files into smaller files ## What changes were proposed in this pull request? This PR continues to break down a big large file into smaller files. See https://github.com/apache/spark/pull/23021. It targets to follow https://github.com/numpy/numpy/tree/master/numpy. Basically this PR proposes to break down `pyspark/streaming/tests.py` into ...: ``` pyspark ├── __init__.py ... ├── streaming │ ├── __init__.py ... │ ├── tests │ │ ├── __init__.py │ │ ├── test_context.py │ │ ├── test_dstream.py │ │ ├── test_kinesis.py │ │ └── test_listener.py ... ├── testing ... │ ├── streamingutils.py ... ``` ## How was this patch tested? Existing tests should cover. `cd python` and .`/run-tests-with-coverage`. Manually checked they are actually being ran. Each test (not officially) can be ran via: ```bash SPARK_TESTING=1 ./bin/pyspark pyspark.tests.test_context ``` Note that if you're using Mac and Python 3, you might have to `OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES`. Closes #23034 from HyukjinKwon/SPARK-26035. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-11-16 07:58:09 +08:00
hyukjinkwon	03306a6df3	[SPARK-26036][PYTHON] Break large tests.py files into smaller files ## What changes were proposed in this pull request? This PR continues to break down a big large file into smaller files. See https://github.com/apache/spark/pull/23021. It targets to follow https://github.com/numpy/numpy/tree/master/numpy. Basically this PR proposes to break down `pyspark/tests.py` into ...: ``` pyspark ... ├── testing ... │ └── utils.py ├── tests │ ├── __init__.py │ ├── test_appsubmit.py │ ├── test_broadcast.py │ ├── test_conf.py │ ├── test_context.py │ ├── test_daemon.py │ ├── test_join.py │ ├── test_profiler.py │ ├── test_rdd.py │ ├── test_readwrite.py │ ├── test_serializers.py │ ├── test_shuffle.py │ ├── test_taskcontext.py │ ├── test_util.py │ └── test_worker.py ... ``` ## How was this patch tested? Existing tests should cover. `cd python` and .`/run-tests-with-coverage`. Manually checked they are actually being ran. Each test (not officially) can be ran via: ```bash SPARK_TESTING=1 ./bin/pyspark pyspark.tests.test_context ``` Note that if you're using Mac and Python 3, you might have to `OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES`. Closes #23033 from HyukjinKwon/SPARK-26036. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-11-15 12:30:52 +08:00
DB Tsai	ad853c5678	[SPARK-25956] Make Scala 2.12 as default Scala version in Spark 3.0 ## What changes were proposed in this pull request? This PR makes Spark's default Scala version as 2.12, and Scala 2.11 will be the alternative version. This implies that Scala 2.12 will be used by our CI builds including pull request builds. We'll update the Jenkins to include a new compile-only jobs for Scala 2.11 to ensure the code can be still compiled with Scala 2.11. ## How was this patch tested? existing tests Closes #22967 from dbtsai/scala2.12. Authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-11-14 16:22:23 -08:00
Yuanjian Li	2977e2312d	[SPARK-25986][BUILD] Add rules to ban throw Errors in application code ## What changes were proposed in this pull request? Add scala and java lint check rules to ban the usage of `throw new xxxErrors` and fix up all exists instance followed by https://github.com/apache/spark/pull/22989#issuecomment-437939830. See more details in https://github.com/apache/spark/pull/22969. ## How was this patch tested? Local test with lint-scala and lint-java. Closes #22989 from xuanyuanking/SPARK-25986. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-14 13:05:18 -08:00
Sean Owen	722369ee55	[SPARK-24421][BUILD][CORE] Accessing sun.misc.Cleaner in JDK11 …. Other related changes to get JDK 11 working, to test ## What changes were proposed in this pull request? - Access `sun.misc.Cleaner` (Java 8) and `jdk.internal.ref.Cleaner` (JDK 9+) by reflection (note: the latter only works if illegal reflective access is allowed) - Access `sun.misc.Unsafe.invokeCleaner` in Java 9+ instead of `sun.misc.Cleaner` (Java 8) In order to test anything on JDK 11, I also fixed a few small things, which I include here: - Fix minor JDK 11 compile issues - Update scala plugin, Jetty for JDK 11, to facilitate tests too This doesn't mean JDK 11 tests all pass now, but lots do. Note also that the JDK 9+ solution for the Cleaner has a big caveat. ## How was this patch tested? Existing tests. Manually tested JDK 11 build and tests, and tests covering this change appear to pass. All Java 8 tests should still pass, but this change alone does not achieve full JDK 11 compatibility. Closes #22993 from srowen/SPARK-24421. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-14 12:52:54 -08:00
hyukjinkwon	a7a331df6e	[SPARK-26032][PYTHON] Break large sql/tests.py files into smaller files ## What changes were proposed in this pull request? This is the official first attempt to break huge single `tests.py` file - I did it locally before few times and gave up for some reasons. Now, currently it really makes the unittests super hard to read and difficult to check. To me, it even bothers me to to scroll down the big file. It's one single 7000 lines file! This is not only readability issue. Since one big test takes most of tests time, the tests don't run in parallel fully - although it will costs to start and stop the context. We could pick up one example and follow. Given my investigation, the current style looks closer to NumPy structure and looks easier to follow. Please see https://github.com/numpy/numpy/tree/master/numpy. Basically this PR proposes to break down `pyspark/sql/tests.py` into ...: ```bash pyspark ... ├── sql ... │ ├── tests # Includes all tests broken down from 'pyspark/sql/tests.py' │ │ │ # Each matchs to module in 'pyspark/sql'. Additionally, some logical group can │ │ │ # be added. For instance, 'test_arrow.py', 'test_datasources.py' ... │ │ ├── __init__.py │ │ ├── test_appsubmit.py │ │ ├── test_arrow.py │ │ ├── test_catalog.py │ │ ├── test_column.py │ │ ├── test_conf.py │ │ ├── test_context.py │ │ ├── test_dataframe.py │ │ ├── test_datasources.py │ │ ├── test_functions.py │ │ ├── test_group.py │ │ ├── test_pandas_udf.py │ │ ├── test_pandas_udf_grouped_agg.py │ │ ├── test_pandas_udf_grouped_map.py │ │ ├── test_pandas_udf_scalar.py │ │ ├── test_pandas_udf_window.py │ │ ├── test_readwriter.py │ │ ├── test_serde.py │ │ ├── test_session.py │ │ ├── test_streaming.py │ │ ├── test_types.py │ │ ├── test_udf.py │ │ └── test_utils.py ... ├── testing # Includes testing utils that can be used in unittests. │ ├── __init__.py │ └── sqlutils.py ... ``` ## How was this patch tested? Existing tests should cover. `cd python` and `./run-tests-with-coverage`. Manually checked they are actually being ran. Each test (not officially) can be ran via: ``` SPARK_TESTING=1 ./bin/pyspark pyspark.sql.tests.test_pandas_udf_scalar ``` Note that if you're using Mac and Python 3, you might have to `OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES`. Closes #23021 from HyukjinKwon/SPARK-25344. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-11-14 14:51:11 +08:00
hyukjinkwon	f9ff75653f	[SPARK-26013][R][BUILD] Upgrade R tools version from 3.4.0 to 3.5.1 in AppVeyor build ## What changes were proposed in this pull request? R tools 3.5.1 is released few months ago. Spark currently uses 3.4.0. We should better upgrade in AppVeyor. ## How was this patch tested? AppVeyor builds. Closes #23011 from HyukjinKwon/SPARK-26013. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-11-13 01:21:03 +08:00
gatorsmile	0ba9715c7d	[SPARK-26005][SQL] Upgrade ANTRL from 4.7 to 4.7.1 ## What changes were proposed in this pull request? Based on the release description of ANTRL 4.7.1., https://github.com/antlr/antlr4/releases, let us upgrade our parser to 4.7.1. ## How was this patch tested? N/A Closes #23005 from gatorsmile/upgradeAntlr4.7. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-11-11 23:21:47 -08:00
hyukjinkwon	a8e1c9815f	[SPARK-25962][BUILD][PYTHON] Specify minimum versions for both pydocstyle and flake8 in 'lint-python' script ## What changes were proposed in this pull request? This PR explicitly specifies `flake8` and `pydocstyle` versions. - It checks flake8 binary executable - flake8 version check >= 3.5.0 - pydocstyle >= 3.0.0 (previously it was == 3.0.0) ## How was this patch tested? Manually tested. Closes #22963 from HyukjinKwon/SPARK-25962. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-11-08 12:26:21 +08:00
Wenchen Fan	a241a150d5	[MINOR] update known_translations ## What changes were proposed in this pull request? update known_translations after running `translate-contributors.py` during 2.4.0 release ## How was this patch tested? N/A Closes #22949 from cloud-fan/contributors. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-11-06 14:52:02 -08:00
DB Tsai	3ed91c9b89	[SPARK-25946][BUILD] Upgrade ASM to 7.x to support JDK11 ## What changes were proposed in this pull request? Upgrade ASM to 7.x to support JDK11 ## How was this patch tested? Existing tests. Closes #22953 from dbtsai/asm7. Authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2018-11-06 05:38:59 +00:00
hyukjinkwon	486acda8c5	[SPARK-25944][R][BUILD] AppVeyor change to latest R version (3.5.1) ## What changes were proposed in this pull request? R 3.5.1 is released 2018-07-02. This PR targets to changes R version from 3.4.1 to 3.5.1. ## How was this patch tested? AppVeyor Closes #22948 from HyukjinKwon/SPARK-25944. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-11-05 14:26:22 -08:00
Dongjoon Hyun	e4cb42ad89	[SPARK-25891][PYTHON] Upgrade to Py4J 0.10.8.1 ## What changes were proposed in this pull request? Py4J 0.10.8.1 is released on October 21st and is the first release of Py4J to support Python 3.7 officially. We had better have this to get the official support. Also, there are some patches related to garbage collections. https://www.py4j.org/changelog.html#py4j-0-10-8-and-py4j-0-10-8-1 ## How was this patch tested? Pass the Jenkins. Closes #22901 from dongjoon-hyun/SPARK-25891. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-10-31 09:55:03 -07:00
Wenchen Fan	327456b482	[BUILD][MINOR] release script should not interrupt by svn ## What changes were proposed in this pull request? When running the release script, you will be interrupted unexpectedly ``` ATTENTION! Your password for authentication realm: <https://dist.apache.org:443> ASF Committers can only be stored to disk unencrypted! You are advised to configure your system so that Subversion can store passwords encrypted, if possible. See the documentation for details. You can avoid future appearances of this warning by setting the value of the 'store-plaintext-passwords' option to either 'yes' or 'no' in '/home/spark-rm/.subversion/servers'. ----------------------------------------------------------------------- Store password unencrypted (yes/no)? ``` We can avoid it by adding `--no-auth-cache` when running svn command. ## How was this patch tested? manually verified with 2.4.0 RC5 Closes #22885 from cloud-fan/svn. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-10-30 21:17:40 +08:00

1 2 3 4 5 ...

647 commits