ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Hyukjin Kwon	310cd8eef1	[SPARK-36092][INFRA][BUILD][PYTHON] Migrate to GitHub Actions with Codecov from Jenkins This PR proposes to migrate Coverage report from Jenkins to GitHub Actions by setting a dailly cron job. For some background, currently PySpark code coverage is being reported in this specific Jenkins job: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/ Because of the security issue between [Codecov service](https://app.codecov.io/gh/) and Jenkins machines, we had to work around by manually hosting a coverage site via GitHub pages, see also https://spark-test.github.io/pyspark-coverage-site/ by spark-test account (which is shared to only subset of PMC members). Since we now run the build via GitHub Actions, we can leverage [Codecov plugin](https://github.com/codecov/codecov-action), and remove the workaround we used. Virtually no. Coverage site (UI) might change but the information it holds should be virtually the same. I manually tested: - Scheduled run: https://github.com/HyukjinKwon/spark/actions/runs/1082261484 - Coverage report: `73f0291a7d/python/pyspark` - Run against a PR: https://github.com/HyukjinKwon/spark/actions/runs/1082367175 Closes #33591 from HyukjinKwon/SPARK-36092. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `c0d1860f25`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-01 21:38:39 +09:00
itholic	a9c5b1a5c8	[SPARK-36254][INFRA][PYTHON] Install mlflow in Github Actions CI ### What changes were proposed in this pull request? This PR proposes adding a Python package, `mlflow` and `sklearn` to enable the MLflow test in pandas API on Spark. ### Why are the changes needed? To enable the MLflow test in pandas API on Spark. ### Does this PR introduce _any_ user-facing change? No, it's test-only ### How was this patch tested? Manually test on local, with `python/run-tests --testnames pyspark.pandas.mlflow`. Closes #33567 from itholic/SPARK-36254. Lead-authored-by: itholic <haejoon.lee@databricks.com> Co-authored-by: Haejoon Lee <44108233+itholic@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `abce61f3fd`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-07-30 00:04:59 -07:00
Yuanjian Li	e8462a584c	[SPARK-36347][SS] Upgrade the RocksDB version to 6.20.3 ### What changes were proposed in this pull request? As the discussion in https://github.com/apache/spark/pull/32928/files#r654049392, after confirming the compatibility, we can use a newer RocksDB version for the state store implementation. ### Why are the changes needed? For further ARM support and leverage the bug fix for the newer version. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33578 from xuanyuanking/SPARK-36347. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit `4cd5fa96d8`) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-07-29 11:09:10 -07:00
William Hyun	dfa5c4dadc	[SPARK-36285][INFRA][TESTS] Skip MiMa in PySpark/SparkR/Docker GHA job This PR aims to skip MiMa in PySpark/SparkR/Docker GHA job. This will save GHA resource because MiMa is irrelevant to Python. No. Pass the GHA. Closes #33532 from williamhyun/mima. Lead-authored-by: William Hyun <william@apache.org> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `674202e7b6`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-27 16:50:10 +09:00
Sean Owen	c7d246ba4e	[SPARK-35310][MLLIB] Update to breeze 1.2 Update to the latest breeze 1.2 Minor bug fixes No. Existing tests Closes #33449 from srowen/SPARK-35310. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-07-24 08:20:25 -05:00
Takuya UESHIN	c1434b1928	[SPARK-36279][INFRA][PYTHON] Fix lint-python to work with Python 3.9 ### What changes were proposed in this pull request? Fix `lint-python` to pick `PYTHON_EXECUTABLE` from the environment variable first to switch the Python and explicitly specify `PYTHON_EXECUTABLE` to use `python3.9` in CI. ### Why are the changes needed? Currently `lint-python` uses `python3`, but it's not the one we expect in CI. As a result, `black` check is not working. ``` The python3 -m black command was not found. Skipping black checks for now. ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The `black` check in `lint-python` should work. Closes #33507 from ueshin/issues/SPARK-36279/lint-python. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `663cbdfbe5`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-24 16:49:51 +09:00
Liang-Chi Hsieh	a6418a3463	[SPARK-36270][BUILD] Change memory settings for enabling GA ### What changes were proposed in this pull request? Trying to adjust build memory settings and serial execution to re-enable GA. ### Why are the changes needed? GA tests are failed recently due to return code 137. We need to adjust build settings to make GA work. ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? GA Closes #33447 from viirya/test-ga. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `fd36ed4550`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-23 19:11:09 +09:00
Hyukjin Kwon	f169f056b4	[SPARK-36268][PYTHON] Set the lowerbound of mypy version to 0.910 ### What changes were proposed in this pull request? This PR proposes to set the lowerbound of mypy version to use in the testing script. ### Why are the changes needed? https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141519/console ``` python/pyspark/mllib/tree.pyi:29: error: Overloaded function signatures 1 and 2 overlap with incompatible return types python/pyspark/mllib/tree.pyi:38: error: Overloaded function signatures 1 and 2 overlap with incompatible return types python/pyspark/mllib/feature.pyi:34: error: Overloaded function signatures 1 and 2 overlap with incompatible return types python/pyspark/mllib/feature.pyi:42: error: Overloaded function signatures 1 and 2 overlap with incompatible return types python/pyspark/mllib/feature.pyi:48: error: Overloaded function signatures 1 and 2 overlap with incompatible return types python/pyspark/mllib/feature.pyi:54: error: Overloaded function signatures 1 and 2 overlap with incompatible return types python/pyspark/mllib/feature.pyi:76: error: Overloaded function signatures 1 and 2 overlap with incompatible return types python/pyspark/mllib/feature.pyi:124: error: Overloaded function signatures 1 and 2 overlap with incompatible return types python/pyspark/mllib/feature.pyi:165: error: Overloaded function signatures 1 and 2 overlap with incompatible return types python/pyspark/mllib/clustering.pyi:45: error: Overloaded function signatures 1 and 2 overlap with incompatible return types python/pyspark/mllib/clustering.pyi:72: error: Overloaded function signatures 1 and 2 overlap with incompatible return types python/pyspark/mllib/classification.pyi:39: error: Overloaded function signatures 1 and 2 overlap with incompatible return types python/pyspark/mllib/classification.pyi:52: error: Overloaded function signatures 1 and 2 overlap with incompatible return types Found 13 errors in 4 files (checked 314 source files) 1 ``` Jenkins installed mypy at SPARK-32797 but seems the version installed is not same as GIthub Actions. It seems difficult to make the codebase compatible with multiple mypy versions. Therefore, this PR sets the lowerbound. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Jenkins job in this PR should test it out. Also manually tested: Without mypy: ``` ... flake8 checks passed. The mypy command was not found. Skipping for now. ``` With mypy 0.812: ``` ... flake8 checks passed. The minimum mypy version needs to be 0.910. Your current version is mypy 0.812. Skipping for now. ``` With mypy 0.910: ``` ... flake8 checks passed. starting mypy test... mypy checks passed. all lint-python tests passed! ``` Closes #33487 from HyukjinKwon/SPARK-36268. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `d6bc8cd681`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-23 12:28:28 +09:00
Dongjoon Hyun	60566f9d8e	[SPARK-36262][BUILD] Upgrade ZSTD-JNI to 1.5.0-4 ### What changes were proposed in this pull request? This PR aims to upgrade ZSTD-JNI to 1.5.0-4. ### Why are the changes needed? ZSTD-JNI 1.5.0-3 has a packaging issue. 1.5.0-4 is recommended to be used instead. - https://github.com/luben/zstd-jni/issues/181#issuecomment-885138495 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #33483 from dongjoon-hyun/SPARK-36262. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `a1a197403b`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-07-22 14:04:14 -07:00
Hyukjin Kwon	d01e53208b	[SPARK-36251][INFRA][BUILD][3.2] Cover GitHub Actions runs without SHA in testing script ### What changes were proposed in this pull request? This PR partially backports the fix in the script at https://github.com/apache/spark/pull/33410 to make the branch-3.2 build pass at https://github.com/apache/spark/actions/workflows/build_and_test.yml?query=event%3Aschedule ### Why are the changes needed? To make the Scala 2.13 periodical job pass ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? It is a logically non-conflicting backport. Closes #33472 from HyukjinKwon/SPARK-36251. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-22 11:47:36 +09:00
Kousuke Saruta	fef7bf9fcc	[SPARK-36244][BUILD] Upgrade zstd-jni to 1.5.0-3 to avoid a bug about buffer size calculation ### What changes were proposed in this pull request? This PR upgrades `zstd-jni` from `1.5.0-2` to `1.5.0-3`. `1.5.0-3` was released few days ago. This release resolves an issue about buffer size calculation, which can affect usage in Spark. https://github.com/luben/zstd-jni/releases/tag/v1.5.0-3 ### Why are the changes needed? It might be a corner case that skipping length is greater than `2^31 - 1` but it's possible to affect Spark. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI. Closes #33464 from sarutak/upgrade-zstd-jni-1.5.0-3. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `dcb7db5370`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-07-21 19:37:18 -07:00
Kousuke Saruta	57794d3ec9	[SPARK-36166][TESTS][FOLLOWUP] Add BLOCK_SCALA_VERSION to sparktestssupport/__init__.py ### What changes were proposed in this pull request? This is a followup PR for SPARK-36166 (#33411), which adds `BLOCK_SCALA_VERSION` to `sparktestssupport/__init__.py`. ### Why are the changes needed? The following command fails due to the definition is missing. ``` SCALA_PROFILE=scala2.12 dev/run-tests.py ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The command shown above works. Closes #33421 from sarutak/followup-SPARK-36166. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `c7ccc602db`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-19 22:47:14 +09:00
Hyukjin Kwon	2ae77574dc	[SPARK-36166][TESTS][FOLLOW-UP] Add Scala version change logic into testing script ### What changes were proposed in this pull request? This PR is a simple followup from https://github.com/apache/spark/pull/33376: - It simplifies a bit by removing the default Scala version in the testing script (so we don't have to change here in the future when we change the Scala default version). - Call `change-scala-version.sh` script (when `SCALA_PROFILE` is explicitly specified) ### Why are the changes needed? More refactoring. In addition, this change will be used at https://github.com/apache/spark/pull/33410 ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? CI in this PR should test it out. Closes #33411 from HyukjinKwon/SPARK-36166. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `8ee199ef42`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-19 18:01:14 +09:00
William Hyun	d5cec45c0b	[SPARK-36198][TESTS] Skip UNIDOC generation in PySpark GHA job ### What changes were proposed in this pull request? This PR aims to skip UNIDOC generation in PySpark GHA job. ### Why are the changes needed? PySpark GHA jobs do not need to generate Java/Scala doc. This will save about 13 minutes in total. -https://github.com/apache/spark/runs/3098268973?check_suite_focus=true ``` ... ======================================================================== Building Unidoc API Documentation ======================================================================== [info] Building Spark unidoc using SBT with these arguments: -Phadoop-3.2 -Phive-2.3 -Pscala-2.12 -Phive-thriftserver -Pmesos -Pdocker-integration-tests -Phive -Pkinesis-asl -Pspark-ganglia-lgpl -Pkubernetes -Phadoop-cloud -Pyarn unidoc ... [info] Main Java API documentation successful. [success] Total time: 192 s (03:12), completed Jul 18, 2021 6:08:40 PM ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the GHA. Closes #33407 from williamhyun/SKIP_UNIDOC. Authored-by: William Hyun <william@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `c336f73ccd`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-07-18 17:52:40 -07:00
Dongjoon Hyun	12c8c89693	[SPARK-36166][TESTS] Support Scala 2.13 test in `dev/run-tests.py` ### What changes were proposed in this pull request? For Apache Spark 3.2, this PR aims to support Scala 2.13 test in `dev/run-tests.py` by adding `SCALA_PROFILE` and in `dev/run-tests-jenkins.py` by adding `AMPLAB_JENKINS_BUILD_SCALA_PROFILE`. In addition, `test-dependencies.sh` is skipped for Scala 2.13 because we currently don't maintain the dependency manifests yet. This will be handled after Apache Spark 3.2.0 release. ### Why are the changes needed? To test Scala 2.13 with `dev/run-tests.py`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual. The following is the result. Note that this PR aims to run Scala 2.13 tests instead of passing them. We will have daily GitHub Action job via #33358 and will fix UT failures if exists. ``` $ dev/change-scala-version.sh 2.13 $ SCALA_PROFILE=scala2.13 dev/run-tests.py ... ======================================================================== Running Scala style checks ======================================================================== [info] Checking Scala style using SBT with these profiles: -Phadoop-3.2 -Phive-2.3 -Pscala-2.13 -Pkubernetes -Phadoop-cloud -Phive -Phive-thriftserver -Pyarn -Pmesos -Pdocker-integration-tests -Pkinesis-asl -Pspark-ganglia-lgpl ... ======================================================================== Building Spark ======================================================================== [info] Building Spark using SBT with these arguments: -Phadoop-3.2 -Phive-2.3 -Pscala-2.13 -Pspark-ganglia-lgpl -Pmesos -Pyarn -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pdocker-integration-tests -Phive -Phadoop-cloud test:package streaming-kinesis-asl-assembly/assembly ... [info] Building Spark assembly using SBT with these arguments: -Phadoop-3.2 -Phive-2.3 -Pscala-2.13 -Pspark-ganglia-lgpl -Pmesos -Pyarn -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pdocker-integration-tests -Phive -Phadoop-cloud assembly/package ... ======================================================================== Running Java style checks ======================================================================== [info] Checking Java style using SBT with these profiles: -Phadoop-3.2 -Phive-2.3 -Pscala-2.13 -Pspark-ganglia-lgpl -Pmesos -Pyarn -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pdocker-integration-tests -Phive -Phadoop-cloud ... ======================================================================== Building Unidoc API Documentation ======================================================================== [info] Building Spark unidoc using SBT with these arguments: -Phadoop-3.2 -Phive-2.3 -Pscala-2.13 -Pspark-ganglia-lgpl -Pmesos -Pyarn -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pdocker-integration-tests -Phive -Phadoop-cloud unidoc ... ======================================================================== Running Spark unit tests ======================================================================== [info] Running Spark tests using SBT with these arguments: -Phadoop-3.2 -Phive-2.3 -Pscala-2.13 -Pspark-ganglia-lgpl -Pmesos -Pyarn -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pdocker-integration-tests -Phive -Phadoop-cloud test ... ``` Closes #33376 from dongjoon-hyun/SPARK-36166. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `f66153de78`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-07-15 19:26:20 -07:00
Dongjoon Hyun	0520497d60	[SPARK-36164][INFRA][FOLLOWUP] Add empty string check back ### What changes were proposed in this pull request? This is a follow-up of #33371. At the branch commit GitHub run, we have an empty environment variable. This PR adds back the empty string check logic. ### Why are the changes needed? Currently, the failure happens when we use `--modules` in GitHub Action. ``` $ GITHUB_ACTIONS=1 APACHE_SPARK_REF= dev/run-tests.py --modules core [info] Using build tool sbt with Hadoop profile hadoop3.2 and Hive profile hive2.3 under environment github_actions fatal: ambiguous argument '': unknown revision or path not in the working tree. Use '--' to separate paths from revisions, like this: 'git <command> [<revision>...] -- [<file>...]' Traceback (most recent call last): File "/Users/dongjoon/APACHE/spark-merge/dev/run-tests.py", line 785, in <module> main() File "/Users/dongjoon/APACHE/spark-merge/dev/run-tests.py", line 663, in main changed_files = identify_changed_files_from_git_commits( File "/Users/dongjoon/APACHE/spark-merge/dev/run-tests.py", line 91, in identify_changed_files_from_git_commits raw_output = subprocess.check_output(['git', 'diff', '--name-only', patch_sha, diff_target], File "/Users/dongjoon/.pyenv/versions/3.9.5/lib/python3.9/subprocess.py", line 424, in check_output return run(*popenargs, stdout=PIPE, timeout=timeout, check=True, File "/Users/dongjoon/.pyenv/versions/3.9.5/lib/python3.9/subprocess.py", line 528, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['git', 'diff', '--name-only', 'HEAD', '']' returned non-zero exit status 128. ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually. The following failure is correct in local environment because it passed `identify_changed_files_from_git_commits` already. ``` $ GITHUB_ACTIONS=1 APACHE_SPARK_REF= dev/run-tests.py --modules core [info] Using build tool sbt with Hadoop profile hadoop3.2 and Hive profile hive2.3 under environment github_actions Traceback (most recent call last): File "/Users/dongjoon/APACHE/spark-merge/dev/run-tests.py", line 785, in <module> main() File "/Users/dongjoon/APACHE/spark-merge/dev/run-tests.py", line 668, in main os.environ["GITHUB_SHA"], target_ref=os.environ["GITHUB_PREV_SHA"]) File "/Users/dongjoon/.pyenv/versions/3.9.5/lib/python3.9/os.py", line 679, in __getitem__ raise KeyError(key) from None KeyError: 'GITHUB_SHA' ``` Closes #33374 from dongjoon-hyun/SPARK-36164. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `5f41a2752f`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-07-15 13:44:27 -07:00
William Hyun	ebc830f14e	[SPARK-36164][INFRA] run-test.py should not fail when APACHE_SPARK_REF is not defined ### What changes were proposed in this pull request? This PR aims to change run-test.py so that it does not fail when os.environ["APACHE_SPARK_REF"] is not defined. ### Why are the changes needed? Currently, the run-test.py ends with an error. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #33371 from williamhyun/SPARK-36164. Authored-by: William Hyun <william@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `c8a3c22628`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-07-15 11:43:48 -07:00
Hyukjin Kwon	a87a6df2d1	[SPARK-36159][BUILD] Replace 'python' to 'python3' in dev/test-dependencies.sh ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/26330. There is the last place to fix in `dev/test-dependencies.sh` ### Why are the changes needed? To stick to Python 3 instead of using Python 2 mistakenly. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested. Closes #33368 from HyukjinKwon/change-python-3. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `6bd385f1e3`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-07-15 07:58:27 -07:00
Dongjoon Hyun	384bee3663	[SPARK-36150][INFRA][TESTS] Disable MiMa for Scala 2.13 artifacts ### What changes were proposed in this pull request? This PR aims to disable MiMa check for Scala 2.13 artifacts. ### Why are the changes needed? Apache Spark doesn't have Scala 2.13 Maven artifacts yet. SPARK-36151 will enable this after Apache Spark 3.2.0 release. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual. The following should succeed without real testing. ``` $ dev/mima -Pscala-2.13 ``` Closes #33355 from dongjoon-hyun/SPARK-36150. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `5acfecbf97`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-07-15 00:34:43 -07:00
Kousuke Saruta	ca8d2670b7	[SPARK-36129][BUILD] Upgrade commons-compress to 1.21 to deal with CVEs ### What changes were proposed in this pull request? This PR upgrades `commons-compress` from `1.20` to `1.21` to deal with CVEs. ### Why are the changes needed? Some CVEs which affect `commons-compress 1.20` are reported and fixed in `1.21`. https://commons.apache.org/proper/commons-compress/security-reports.html * CVE-2021-35515 * CVE-2021-35516 * CVE-2021-35517 * CVE-2021-36090 The severities are reported as low for all the CVEs but it would be better to deal with them just in case. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI. Closes #33333 from sarutak/upgrade-commons-compress-1.21. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `fd06cc211d`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-07-13 22:53:22 -07:00
Wenchen Fan	c1d8ccfb64	Revert "[SPARK-35253][SPARK-35398][SQL][BUILD] Bump up the janino version to v3.1.4" ### What changes were proposed in this pull request? This PR reverts https://github.com/apache/spark/pull/32455 and its followup https://github.com/apache/spark/pull/32536 , because the new janino version has a bug that is not fixed yet: https://github.com/janino-compiler/janino/pull/148 ### Why are the changes needed? avoid regressions ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #33302 from cloud-fan/revert. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `ae6199af44`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-13 12:14:21 +09:00
Yikun Jiang	fd277dc036	[SPARK-36002][PYTHON] Consolidate tests for data-type-based operations of decimal Series ### What changes were proposed in this pull request? Merge test_decimal_ops into test_num_ops - merge test_isnull() into test_num_ops.test_isnull() - remove test_datatype_ops(), which already covered in `11fcbc73cb/python/pyspark/pandas/tests/data_type_ops/test_base.py (L58-L59)` ### Why are the changes needed? Tests for data-type-based operations of decimal Series are in two places: - python/pyspark/pandas/tests/data_type_ops/test_decimal_ops.py - python/pyspark/pandas/tests/data_type_ops/test_num_ops.py We'd better merge test_decimal_ops into test_num_ops. See also [SPARK-36002](https://issues.apache.org/jira/browse/SPARK-36002) . ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? unittests passed Closes #33206 from Yikun/SPARK-36002. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `fdc50f4452`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-09 14:08:23 +09:00
Dongjoon Hyun	d7990943c3	[SPARK-35992][BUILD] Upgrade ORC to 1.6.9 ### What changes were proposed in this pull request? This PR aims to upgrade Apache ORC to 1.6.9. ### Why are the changes needed? This is required to bring ORC-804 in order to fix ORC encryption masking bug. ### Does this PR introduce _any_ user-facing change? No. This is not released yet. ### How was this patch tested? Pass the newly added test case. Closes #33189 from dongjoon-hyun/SPARK-35992. Lead-authored-by: Dongjoon Hyun <dongjoon@apache.org> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `c55b9fd1e0`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-07-02 09:50:00 -07:00
shane knapp	2c94fbc71e	initial commit for skeleton ansible for jenkins worker config ### What changes were proposed in this pull request? this is the skeleton of the ansible used to configure jenkins workers in the riselab/apache spark build system ### Why are the changes needed? they are not needed, but will help the community understand how to build systems to test multiple versions of spark, as well as propose changes that i can integrate in to the "production" riselab repo. since we're sunsetting jenkins by EOY 2021, this will potentially be useful for migrating the build system. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ansible-lint and much wailing and gnashing of teeth. Closes #32178 from shaneknapp/initial-ansible-commit. Lead-authored-by: shane knapp <incomplete@gmail.com> Co-authored-by: shane <incomplete@gmail.com> Signed-off-by: shane knapp <incomplete@gmail.com>	2021-06-30 10:05:27 -07:00
Dongjoon Hyun	b218cc90cf	[SPARK-35948][INFRA] Simplify release scripts by removing Spark 2.4/Java7 parts ### What changes were proposed in this pull request? This PR aims to clean up Spark 2.4 and Java7 code path from the release scripts. ### Why are the changes needed? To simplify the logic. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #33150 from dongjoon-hyun/SPARK-35948. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-30 16:24:03 +09:00
Dongjoon Hyun	5312008cca	[SPARK-35947][INFRA] Increase JVM stack size in release-build.sh ### What changes were proposed in this pull request? Like SPARK-35825, this PR aims to increase JVM stack size via `MAVEN_OPTS` in release-build.sh. ### Why are the changes needed? This will mitigate the failure in publishing snapshot GitHub Action job and during the release. - https://github.com/apache/spark/actions/workflows/publish_snapshot.yml (3-day consecutive failures) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #33149 from dongjoon-hyun/SPARK-35947. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-30 16:23:13 +09:00
Yuanjian Li	3257a30e53	[SPARK-35784][SS] Implementation for RocksDB instance ### What changes were proposed in this pull request? The implementation for the RocksDB instance, which is used in the RocksDB state store. It plays a role as a handler for the RocksDB instance and RocksDBFileManager. ### Why are the changes needed? Part of the RocksDB state store implementation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New UT added. Closes #32928 from xuanyuanking/SPARK-35784. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-06-29 17:46:45 -07:00
Takuya UESHIN	1f6e2f55d7	Revert "[SPARK-35721][PYTHON] Path level discover for python unittests" This reverts commit `5db51efa1a`.	2021-06-29 12:08:09 -07:00
Dongjoon Hyun	7e7028282c	[SPARK-35928][BUILD] Upgrade ASM to 9.1 ### What changes were proposed in this pull request? This PR aims to upgrade ASM to 9.1 ### Why are the changes needed? The latest `xbean-asm9-shaded` is built with ASM 9.1. - https://mvnrepository.com/artifact/org.apache.xbean/xbean-asm9-shaded/4.20 - `5e0e3c0c64/pom.xml (L67)` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #33130 from dongjoon-hyun/SPARK-35928. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-29 10:27:51 -07:00
Yikun Jiang	5db51efa1a	[SPARK-35721][PYTHON] Path level discover for python unittests ### What changes were proposed in this pull request? Add path level discover for python unittests. ### Why are the changes needed? Now we need to specify the python test cases by manually when we add a new testcase. Sometime, we forgot to add the testcase to module list, the testcase would not be executed. Such as: - pyspark-core pyspark.tests.test_pin_thread Thus we need some auto-discover way to find all testcase rather than specified every case by manually. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add below code in end of `dev/sparktestsupport/modules.py` ```python for m in sorted(all_modules): for g in sorted(m.python_test_goals): print(m.name, g) ``` Compare the result before and after: https://www.diffchecker.com/iO3FvhKL Closes #32867 from Yikun/SPARK_DISCOVER_TEST. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-29 17:56:13 +09:00
Dongjoon Hyun	0a7a6f750c	[SPARK-35483][FOLLOWUP][TESTS] Update run-tests.py doctest ### What changes were proposed in this pull request? This PR updates the doctests in `run-tests.py`. ### Why are the changes needed? This should be consists with `modules.py` behavior. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the GitHub Action. I checked manually. ``` $ python dev/run-tests.py Cannot install SparkR as R was not found in PATH [info] Using build tool sbt with Hadoop profile hadoop3.2 and Hive profile hive2.3 under environment local [info] Found the following changed modules: root [info] Setup the following environment variables for tests: ======================================================================== Running Apache RAT checks ======================================================================== RAT checks passed. ``` Closes #33127 from dongjoon-hyun/SPARK-35483-2. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-28 23:14:47 -07:00
Dongjoon Hyun	57896e662e	[SPARK-35483][FOLLOWUP][TESTS] Enable docker_integration_tests for catalyst/sql module changes too ### What changes were proposed in this pull request? This PR aims to enable `docker_integration_tests` when `catalyst` and `sql` module changes additionally. ### Why are the changes needed? Currently, `catalyst` and `sql` module changes do not trigger the JDBC integration test. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #33125 from dongjoon-hyun/SPARK-35483. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-28 22:59:56 -07:00
Dongjoon Hyun	b999e6bd90	[SPARK-35920][BUILD] Upgrade to Chill 0.10.0 ### What changes were proposed in this pull request? This PR aims to upgrade Chill to 0.10.0. ### Why are the changes needed? This is a maintenance release having cross-compilation to 2.12.14 and 2.13.6 . - https://github.com/twitter/chill/releases/tag/v0.10.0 ### Does this PR introduce _any_ user-facing change? No, this is a dependency change. ### How was this patch tested? Pass the CIs. Closes #33119 from dongjoon-hyun/SPARK-35920. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-28 22:06:41 -07:00
Xinrong Meng	5f0113e3a6	[SPARK-35344][PYTHON] Support creating a Column of numpy literals in pandas API on Spark ### What changes were proposed in this pull request? The PR is proposed to support creating a Column of numpy literal value in pandas-on-Spark. It consists of three changes mainly: - Enable the `lit` function defined in `pyspark.pandas.spark.functions` to support numpy literals input. ```py >>> from pyspark.pandas.spark import functions as SF >>> SF.lit(np.int64(1)) Column<'CAST(1 AS BIGINT)'> >>> SF.lit(np.int32(1)) Column<'CAST(1 AS INT)'> >>> SF.lit(np.int8(1)) Column<'CAST(1 AS TINYINT)'> >>> SF.lit(np.byte(1)) Column<'CAST(1 AS TINYINT)'> >>> SF.lit(np.float32(1)) Column<'CAST(1.0 AS FLOAT)'> ``` - Substitute `F.lit` by `SF.lit`, that is, use `lit` function defined in `pyspark.pandas.spark.functions` rather than `lit` function defined in `pyspark.sql.functions` to allow creating columns out of numpy literals. - Enable numpy literals input in `isin` method Non-goal: - Some pandas-on-Spark APIs use PySpark column-related APIs internally, and these column-related APIs don't support numpy literals, thus numpy literals are disallowed as input (e.g. `to_replace` parameter in `replace` API). This PR doesn't aim to adjust all of them. This PR adjusts `isin` only, because the PR is inspired by that (as https://github.com/databricks/koalas/issues/2161). - To complete mappings between all kinds of numpy literals and Spark data types should be a followup task. ### Why are the changes needed? Spark (`lit` function defined in `pyspark.sql.functions`) doesn't support creating a Column out of numpy literal value. So `lit` function defined in `pyspark.pandas.spark.functions` is adjusted in order to support that in pandas-on-Spark. ### Does this PR introduce _any_ user-facing change? Yes. Before: ```py >>> a = ps.DataFrame({'source': [1,2,3,4,5]}) >>> a.source.isin([np.int64(1), np.int64(2)]) Traceback (most recent call last): ... AttributeError: 'numpy.int64' object has no attribute '_get_object_id' ``` After: ```py >>> a = ps.DataFrame({'source': [1,2,3,4,5]}) >>> a.source.isin([np.int64(1), np.int64(2)]) 0 True 1 True 2 False 3 False 4 False Name: source, dtype: bool ``` ### How was this patch tested? Unit tests. Closes #32955 from xinrong-databricks/datatypeops_literal. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-28 19:03:42 -07:00
Adam Binford	939ea3d5da	[SPARK-35863][BUILD] Update Ivy to 2.5.0 ### What changes were proposed in this pull request? Update Ivy from 2.4.0 to 2.5.0. - https://ant.apache.org/ivy/history/2.5.0/release-notes.html ### Why are the changes needed? This brings various improvements and bug fixes. Most notably, the adding of `ivy.maven.lookup.sources` and `ivy.maven.lookup.javadoc` configs can significantly speed up module resolution time if these are turned off, especially behind a proxy. These could arguably be turned off by default, because when submitting jobs you probably don't care about the sources or javadoc jars. I didn't include that here but happy to look into if it's desired. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT and build passes Closes #33088 from Kimahriman/feature/ivy-update. Authored-by: Adam Binford <adamq43@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-25 07:37:36 -07:00
Wenchen Fan	95ba000279	[SPARK-35872][INFRA] Automatize some steps to finalize the release ### What changes were proposed in this pull request? After the RC vote, the release manager still need to do many work to finalize the release. This PR updates the script the automatize some steps: 1. create the final git tag 2. publish to pypi 3. publish docs to spark-website 4. move the release binaries from dev directory to release directory. 5. update the KEYS file ### Why are the changes needed? easy the work of release manager. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? tested with the recent 3.0.3. Closes #33055 from cloud-fan/release. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-06-24 13:25:41 -07:00
yi.wu	1cdc56c70d	[SPARK-35869][INFRA] Fix "Cannot run program python" error from do-release-docker.sh ### What changes were proposed in this pull request? Add `python-is-python3` to `create-release/spark-rm/Dockerfile` ### Why are the changes needed? Systems that use pthon3 by default should explicitly indicate the python version is 3. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested during Apache 3.0.3 release. Closes #33048 from Ngone51/fix-release-script. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-24 12:47:28 +09:00
Dongjoon Hyun	0f25cabbc2	[SPARK-35844][INFRA] Add hadoop-cloud profile to PUBLISH_PROFILES ### What changes were proposed in this pull request? This PR aims to add `hadoop-cloud` profile to `PUBLISH_PROFILES` in order to publish `hadoop-cloud` module. Note that this doesn't change `BASE_RELEASE_PROFILES` and there is no change in the binary distributions. ### Why are the changes needed? This is discussed here. - https://lists.apache.org/thread.html/rf87d755460d5ed85c7b6ac0edad48f53c929a2cd287f30be24afd2ad%40%3Cuser.spark.apache.org%3E ### Does this PR introduce _any_ user-facing change? Yes, this will provide `hadoop-cloud` module in Maven Central. ### How was this patch tested? N/A (After merging this, we can check the daily snapshot result) Closes #33003 from dongjoon-hyun/SPARK-35844. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-21 11:56:21 -07:00
Yikun Jiang	b7df75a777	[SPARK-35708][PYTHON][TEST] Add BaseTest for DataTypeOps ### What changes were proposed in this pull request? This patch adds DataTypeOps test to check the ops is loaded as expected. ### Why are the changes needed? When complete https://github.com/apache/spark/pull/32821, I found there are no test for DataTypeOps. There were many logic when DataTypeOps loaded, it's better to add the test to make sure interface stable. ### Does this PR introduce _any_ user-facing change? No, test only ### How was this patch tested? test passed. Closes #32859 from Yikun/SPARK-XXXXX1. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-18 18:54:50 -07:00
Yikun Jiang	f84a720fe3	[SPARK-35342][PYTHON] Introduce DecimalOps and make `isnull` method data-type-based ### What changes were proposed in this pull request? - Introduce a DecimalOps for DecimalType - Make `isnull` method data-type-based ### Why are the changes needed? Now DecimalType, DoubleType, and FloatType data share the FractionalOps class, but DecimalType behaves differently from FloatType and DoubleType (as https://github.com/apache/spark/blob/master/python/pyspark/pandas/base.py#L987-L990), so we propose to introduce DecimalOps. The behavior difference here is caused by DecimalType could not have NaN. https://issues.apache.org/jira/browse/SPARK-35342 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - New added DecimalOpsTest passed - Existing NumOpsTest passed Closes #32821 from Yikun/SPARK-35342. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-18 10:44:35 -07:00
David Christle	7fcb127674	[SPARK-35670][BUILD] Upgrade ZSTD-JNI to 1.5.0-2 ### What changes were proposed in this pull request? This PR aims to upgrade `zstd-jni` to 1.5.0-2, which uses `zstd` version 1.5.0. ### Why are the changes needed? Major improvements to Zstd support are targeted for the upcoming 3.2.0 release of Spark. Zstd 1.5.0 introduces significant compression (+25% to 140%) and decompression (~15%) speed improvements in benchmarks described in more detail on the releases page: - https://github.com/facebook/zstd/releases/tag/v1.5.0 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Build passes build tests, but the benchmark tests seem flaky. I am unsure if this change is responsible. The error is: ``` Running org.apache.spark.rdd.CoalescedRDDBenchmark: 21/06/08 18:53:10 ERROR SparkContext: Failed to add file:/home/runner/work/spark/spark/./core/target/scala-2.12/spark-core_2.12-3.2.0-SNAPSHOT-tests.jar to Spark environment java.lang.IllegalArgumentException: requirement failed: File spark-core_2.12-3.2.0-SNAPSHOT-tests.jar was already registered with a different path (old path = /home/runner/work/spark/spark/core/target/scala-2.12/spark-core_2.12-3.2.0-SNAPSHOT-tests.jar, new path = /home/runner/work/spark/spark/./core/target/scala-2.12/spark-core_2.12-3.2.0-SNAPSHOT-tests.jar ``` https://github.com/dchristle/spark/runs/2776123749?check_suite_focus=true cc: dongjoon-hyun Closes #32826 from dchristle/ZSTD150. Lead-authored-by: David Christle <dchristle@squareup.com> Co-authored-by: David Christle <dchristle@users.noreply.github.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-17 11:06:50 -07:00
Chao Sun	506ef9aad7	[SPARK-29250][BUILD] Upgrade to Hadoop 3.3.1 ### What changes were proposed in this pull request? This upgrade default Hadoop version from 3.2.1 to 3.3.1. The changes here are simply update the version number and dependency file. ### Why are the changes needed? Hadoop 3.3.1 just came out, which comes with many client-side improvements such as for S3A/ABFS (20% faster when accessing S3). These are important for users who want to use Spark in a cloud environment. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Existing unit tests in Spark - Manually tested using my S3 bucket for event log dir: ``` bin/spark-shell \ -c spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID \ -c spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY \ -c spark.eventLog.enabled=true -c spark.eventLog.dir=s3a://<my-bucket> ``` - Manually tested against docker-based YARN dev cluster, by running `SparkPi`. Closes #30135 from sunchao/SPARK-29250. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-16 13:28:07 -07:00
Sumeet Gajjar	864ff67746	[SPARK-35429][CORE] Remove commons-httpclient from Hadoop-3.2 profile due to EOL and CVEs ### What changes were proposed in this pull request? Remove commons-httpclient as a direct dependency for Hadoop-3.2 profile. Hadoop-2.7 profile distribution still has it, hadoop-client has a compile dependency on commons-httpclient, thus we cannot remove it for Hadoop-2.7 profile. ``` [INFO] +- org.apache.hadoop:hadoop-client:jar:2.7.4:compile [INFO] \| +- org.apache.hadoop:hadoop-common:jar:2.7.4:compile [INFO] \| \| +- commons-cli:commons-cli:jar:1.2:compile [INFO] \| \| +- xmlenc:xmlenc:jar:0.52:compile [INFO] \| \| +- commons-httpclient:commons-httpclient:jar:3.1:compile ``` ### Why are the changes needed? Spark is pulling in commons-httpclient as a dependency directly. commons-httpclient went EOL years ago and there are most likely CVEs not being reported against it, thus we should remove it. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Existing unittests - Checked the dependency tree before and after introducing the changes Before: ``` ./build/mvn dependency:tree -Phadoop-3.2 \| grep -i "commons-httpclient" Using `mvn` from path: /usr/bin/mvn [INFO] +- commons-httpclient:commons-httpclient:jar:3.1:compile [INFO] \| +- commons-httpclient:commons-httpclient:jar:3.1:provided ``` After ``` ./build/mvn dependency:tree \| grep -i "commons-httpclient" Using `mvn` from path: /Users/sumeet.gajjar/cloudera/upstream-spark/build/apache-maven-3.6.3/bin/mvn ``` P.S. Reopening this since [spark upgraded](`463daabd5a`) its `hive.version` to `2.3.9` which does not have a dependency on `commons-httpclient`. Closes #32912 from sumeetgajjar/SPARK-35429. Authored-by: Sumeet Gajjar <sumeetgajjar93@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-15 14:43:30 -07:00
Yuming Wang	463daabd5a	[SPARK-34512][BUILD][SQL] Upgrade built-in Hive to 2.3.9 ### What changes were proposed in this pull request? This pr upgrades built-in Hive to 2.3.9. Hive 2.3.9 changes: - [HIVE-17155] - findConfFile() in HiveConf.java has some issues with the conf path - [HIVE-24797] - Disable validate default values when parsing Avro schemas - [HIVE-24608] - Switch back to get_table in HMS client for Hive 2.3.x - [HIVE-21200] - Vectorization: date column throwing java.lang.UnsupportedOperationException for parquet - [HIVE-21563] - Improve Table#getEmptyTable performance by disabling registerAllFunctionsOnce - [HIVE-19228] - Remove commons-httpclient 3.x usage ### Why are the changes needed? Fix regression caused by AVRO-2035. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #32750 from wangyum/SPARK-34512. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-06-10 20:44:35 -07:00
Xinrong Meng	04a8d2cbcf	[SPARK-35343][PYTHON] Make the conversion from/to pandas data-type-based for non-ExtensionDtypes ### What changes were proposed in this pull request? Make the conversion from/to pandas (for non-ExtensionDtype) data-type-based. NOTE: Ops class per ExtensionDtype and its data-type-based from/to pandas will be implemented in a separate PR as https://issues.apache.org/jira/browse/SPARK-35614. ### Why are the changes needed? The conversion from/to pandas includes logic for checking data types and behaving accordingly. That makes code hard to change or maintain. Since we have introduced the Ops class per non-ExtensionDtype data type, we ought to make the conversion from/to pandas data-type-based for non-ExtensionDtypes. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #32592 from xinrong-databricks/datatypeop_pd_conversion. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-07 13:12:12 -07:00
Dongjoon Hyun	6f2ffccb5e	[SPARK-35660][BUILD][K8S] Upgrade kubernetes-client to 5.4.1 ### What changes were proposed in this pull request? This PR aims to upgrade kubernetes-client to 5.4.1. ### Why are the changes needed? This will bring a few bug fixes. - https://github.com/fabric8io/kubernetes-client/releases/tag/v5.4.1 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #32798 from dongjoon-hyun/SPARK-35660. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-06-06 22:27:08 -07:00
itholic	b8740a1d1e	[SPARK-35499][PYTHON] Apply black to pandas API on Spark codes ### What changes were proposed in this pull request? This PR proposes applying `black` to pandas API on Spark codes, for improving static analysis. By executing the `./dev/reformat-python` in the spark home directory, all the code of the pandas API on Spark is fixed according to the static analysis rules. ### Why are the changes needed? This can be reduces the cost of static analysis during development. It has been used continuously for about a year in the Koalas project and its convenience has been proven. ### Does this PR introduce _any_ user-facing change? No, it's dev-only. ### How was this patch tested? Manually reformat the pandas API on Spark codes by running the `./dev/reformat-python`, and checked the `./dev/lint-python` is passed. Closes #32779 from itholic/SPARK-35499. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-06-06 17:30:07 -07:00
Hyukjin Kwon	807b4006ca	[SPARK-35648][PYTHON] Refine and add dependencies needed for dev in dev/requirement.txt ### What changes were proposed in this pull request? This PR proposes to update `dev/requirement.txt` file. NOTE that: - This file isn't used anywhere in Apache Spark CI. It's just for convenience - To minimize the overhead of maintenance, I removed all lowerbounds of dependencies, which means that using the latest versions of them should work in the clean environment (e.g., you can reinstall all of them). ### Why are the changes needed? To note the dependencies needed for Spark dev, and for easier env setting up. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Logically derived from setup.py, and other places like CI Closes #32780 from HyukjinKwon/SPARK-35648. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-04 15:02:29 +09:00
Takuya UESHIN	221553c204	[SPARK-35642][INFRA] Split pyspark-pandas tests to rebalance the test duration ### What changes were proposed in this pull request? Splits some tests in `pyspark-pandas` module as slot tests to rebalance the test duration. Picked the top 12 tests from the previous runs and the total times are almost even. ### Why are the changes needed? Currently `pyspark-pandas` module tests take long time, so we should rebalance the tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #32778 from ueshin/issues/SPARK-35642/split-pandas-on-spark-tests. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-04 12:52:52 +09:00
Hyukjin Kwon	3d158f9c91	[SPARK-35587][PYTHON][DOCS] Initial porting of Koalas documentation ### What changes were proposed in this pull request? This PR proposes to port Koalas documentation to PySpark documentation as its initial step. It ports almost as is except these differences: - Renamed import from `databricks.koalas` to `pyspark.pandas`. - Renamed `to_koalas` -> `to_pandas_on_spark` - Renamed `(Series\|DataFrame).koalas` -> `(Series\|DataFrame).pandas_on_spark` - Added a `ps_` prefix in the RST file names of Koalas documentation Other then that, - Excluded `python/docs/build/html` in linter - Fixed GA dependency installataion ### Why are the changes needed? To document pandas APIs on Spark. ### Does this PR introduce _any_ user-facing change? Yes, it adds new documentations. ### How was this patch tested? Manually built the docs and checked the output. Closes #32726 from HyukjinKwon/SPARK-35587. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-04 11:11:09 +09:00

1 2 3 4 5 ...

1091 commits