ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Hyukjin Kwon	be9089731a	[SPARK-35588][PYTHON][DOCS] Merge Binder integration and quickstart notebook for pandas API on Spark ### What changes were proposed in this pull request? This PR proposes to fix: - the Binder integration of pandas API on Spark, and merge them together with the existing PySpark one. - update quickstart of pandas API on Spark, and make it working The notebooks can be easily reviewed here: https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-35588-3?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_ps.ipynb Original page in Koalas: https://koalas.readthedocs.io/en/latest/getting_started/10min.html ### Why are the changes needed? - To show the working examples of quickstart to end users. - To allow users to try out the examples without installation easily. ### Does this PR introduce _any_ user-facing change? No to end users because the existing quickstart of pandas API on Spark is not released yet. ### How was this patch tested? I manually tested it by uploading built Spark distribution to Binder. See `3bc15310a0` Closes #33041 from HyukjinKwon/SPARK-35588-2. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-24 10:17:22 +09:00
Hyukjin Kwon	27046582e4	[SPARK-35645][PYTHON][DOCS] Merge contents and remove obsolete pages in Getting Started section ### What changes were proposed in this pull request? This PR revise the installation to describe `pip install pyspark[pandas_on_spark]` and removes pandas-on-Spark installation and videos/blogposts. ### Why are the changes needed? pandas-on-Spark installation is merged to PySpark installation pages. For videos/blogposts, now this is named pandas API on Spark. Old Koalas blogposts and videos are obsolete. ### Does this PR introduce _any_ user-facing change? To end users, no because the docs are not released yet. ### How was this patch tested? I manually built the docs and checked the output Closes #33018 from HyukjinKwon/SPARK-35645. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-22 09:36:27 -07:00
Hyukjin Kwon	95f36e76c6	[SPARK-35750][PYTHON][DOCS] Rename "pandas APIs on Spark" to "pandas API on Spark" ### What changes were proposed in this pull request? This PR proposes to rename "pandas APIs on Spark" to "pandas API on Spark" which is more natural (since API stands for Application Program Interface). ### Why are the changes needed? To make it sound more natural. ### Does this PR introduce _any_ user-facing change? It fixes a typo in the unreleased changes. ### How was this patch tested? N/A Closes #32903 from HyukjinKwon/SPARK-34885. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-15 10:01:04 +09:00
Takuya UESHIN	ef7545b788	[SPARK-35759][PYTHON] Remove the upperbound for numpy for pandas-on-Spark ### What changes were proposed in this pull request? Removes the upperbound for numpy for pandas-on-Spark. ### Why are the changes needed? We can remove the upper-bound for numpy for pandas-on-Spark because currently it works well on the CI with numpy 1.20.3. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #32908 from ueshin/issues/SPARK-35759/numpy. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-15 09:59:05 +09:00
Xinrong Meng	5ecb112410	[SPARK-35300][PYTHON][DOCS] Standardize module names in install.rst ### What changes were proposed in this pull request? Use full names of modules in `install.rst` when specifying dependencies. ### Why are the changes needed? Using full names makes it more clear. In addition, `pandas APIs on Spark` as a new module can start to be recognized by more people. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual verification. Closes #32427 from xinrong-databricks/nameDoc. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-05-04 11:02:57 +09:00
Xinrong Meng	120c389b00	[SPARK-34887][PYTHON] Port Koalas dependencies into PySpark ### What changes were proposed in this pull request? Port Koalas dependencies appropriately to PySpark dependencies. ### Why are the changes needed? pandas-on-Spark has its own required dependency and optional dependencies. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #32386 from xinrong-databricks/portDeps. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-05-04 09:04:23 +09:00
Yikun Jiang	0769049ee1	[SPARK-34979][PYTHON][DOC] Add PyArrow installation note for PySpark aarch64 user ### What changes were proposed in this pull request? This patch adds a note for aarch64 user to install the specific pyarrow>=4.0.0. ### Why are the changes needed? The pyarrow aarch64 support is [introduced](https://github.com/apache/arrow/pull/9285) in [PyArrow 4.0.0](https://github.com/apache/arrow/releases/tag/apache-arrow-4.0.0), and it has been published 27.Apr.2021. See more in [SPARK-34979](https://issues.apache.org/jira/browse/SPARK-34979). ### Does this PR introduce _any_ user-facing change? Yes, this doc can help user install arrow on aarch64. ### How was this patch tested? doc test passed. Closes #32363 from Yikun/SPARK-34979. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2021-04-28 09:56:17 +09:00
wankunde	60e324aa9f	[SPARK-34688][PYTHON] Upgrade to Py4J 0.10.9.2 ### What changes were proposed in this pull request? This PR upgrade Py4J from 0.10.9.1 to 0.10.9.2 that contains some bug fixes and improvements. * expose shell parameter in Popen inside launch_gateway. ([bartdag/py4j220efc3](`220efc3716`)) * fixed Flake8 errors ([bartdag/py4j6c6ee9a](`6c6ee9aedc`)) ### Why are the changes needed? To leverage fixes from the upstream in Py4J. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Jenkins build and GitHub Actions will test it out. Closes #31796 from wankunde/py4j. Authored-by: wankunde <wankunde@163.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-03-11 09:51:41 -06:00
HyukjinKwon	329850c667	[SPARK-32017][PYTHON][FOLLOW-UP] Rename HADOOP_VERSION to PYSPARK_HADOOP_VERSION in pip installation option ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/29703. It renames `HADOOP_VERSION` environment variable to `PYSPARK_HADOOP_VERSION` in case `HADOOP_VERSION` is already being used somewhere. Arguably `HADOOP_VERSION` is a pretty common name. I see here and there: - https://www.ibm.com/support/knowledgecenter/SSZUMP_7.2.1/install_grid_sym/understanding_advanced_edition.html - https://cwiki.apache.org/confluence/display/ARROW/HDFS+Filesystem+Support - http://crs4.github.io/pydoop/_pydoop1/installation.html ### Why are the changes needed? To avoid the environment variables is unexpectedly conflicted. ### Does this PR introduce _any_ user-facing change? It renames the environment variable but it's not released yet. ### How was this patch tested? Existing unittests will test. Closes #31028 from HyukjinKwon/SPARK-32017-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-05 17:21:32 +09:00
Josh Soref	13fd272cd3	Spelling r common dev mlib external project streaming resource managers python ### What changes were proposed in this pull request? This PR intends to fix typos in the sub-modules: * `R` * `common` * `dev` * `mlib` * `external` * `project` * `streaming` * `resource-managers` * `python` Split per srowen https://github.com/apache/spark/pull/30323#issuecomment-728981618 NOTE: The misspellings have been reported at `706a726f87 (commitcomment-44064356)` ### Why are the changes needed? Misspelled words make it harder to read / understand content. ### Does this PR introduce _any_ user-facing change? There are various fixes to documentation, etc... ### How was this patch tested? No testing was performed Closes #30402 from jsoref/spelling-R_common_dev_mlib_external_project_streaming_resource-managers_python. Authored-by: Josh Soref <jsoref@users.noreply.github.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-11-27 10:22:45 -06:00
HyukjinKwon	02d410a18c	[MINOR][DOCS] Document 'without' value for HADOOP_VERSION in pip installation ### What changes were proposed in this pull request? I believe it's self-descriptive. ### Why are the changes needed? To document supported features. ### Does this PR introduce _any_ user-facing change? Yes, the docs are updated. It's master only. ### How was this patch tested? Manually built the docs via `cd python/docs` and `make clean html`: ![Screen Shot 2020-11-20 at 10 59 07 AM](https://user-images.githubusercontent.com/6477701/99748225-7ad9b280-2b1f-11eb-86fd-165012b1bb7c.png) Closes #30436 from HyukjinKwon/minor-doc-fix. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-20 13:14:20 +09:00
HyukjinKwon	688d016c7a	[SPARK-32982][BUILD] Remove hive-1.2 profiles in PIP installation option ### What changes were proposed in this pull request? This PR removes Hive 1.2 option (and therefore `HIVE_VERSION` environment variable as well). ### Why are the changes needed? Hive 1.2 is a fork version. We shouldn't promote users to use. ### Does this PR introduce _any_ user-facing change? Nope, `HIVE_VERSION` and Hive 1.2 are removed but this is new experimental feature in master only. ### How was this patch tested? Manually tested: ```bash SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz -v SPARK_VERSION=3.0.1 HADOOP_VERSION=2.7 pip install pyspark-3.1.0.dev0.tar.gz -v SPARK_VERSION=3.0.1 HADOOP_VERSION=invalid pip install pyspark-3.1.0.dev0.tar.gz -v ``` Closes #29858 from HyukjinKwon/SPARK-32981. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-24 14:49:58 +09:00
HyukjinKwon	942f577b6e	[SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI ### What changes were proposed in this pull request? This PR proposes to add a way to select Hadoop and Hive versions in pip installation. Users can select Hive or Hadoop versions as below: ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` When the environment variables are set, internally it downloads the corresponding Spark version and then sets the Spark home to it. Also this PR exposes a mirror to set as an environment variable, `PYSPARK_RELEASE_MIRROR`. Please NOTE that: - We cannot currently leverage pip's native installation option, for example: ```bash pip install pyspark --install-option="hadoop3.2" ``` This is because of a limitation and bug in pip itself. Once they fix this issue, we can switch from the environment variables to the proper installation options, see SPARK-32837. It IS possible to workaround but very ugly or hacky with a big change. See [this PR](https://github.com/microsoft/nni/pull/139/files) as an example. - In pip installation, we pack the relevant jars together. This PR _does not touch existing packaging way_ in order to prevent any behaviour changes. Once this experimental way is proven to be safe, we can avoid packing the relevant jars together (and keep only the relevant Python scripts). And downloads the Spark distribution as this PR proposes. - This way is sort of consistent with SparkR: SparkR provides a method `SparkR::install.spark` to support CRAN installation. This is fine because SparkR is provided purely as a R library. For example, `sparkr` script is not packed together. PySpark cannot take this approach because PySpark packaging ships relevant executable script together, e.g.) `pyspark` shell. If PySpark has a method such as `pyspark.install_spark`, users cannot call it in `pyspark` because `pyspark` already assumes relevant Spark is installed, JVM is launched, etc. - There looks no way to release that contains different Hadoop or Hive to PyPI due to [the version semantics](https://www.python.org/dev/peps/pep-0440/). This is not an option. The usual way looks either `--install-option` above with hacks or environment variables given my investigation. ### Why are the changes needed? To provide users the options to select Hadoop and Hive versions. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to select Hive and Hadoop version as below when they install it from `pip`; ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` ### How was this patch tested? Unit tests were added. I also manually tested in Mac and Windows (after building Spark with `python/dist/pyspark-3.1.0.dev0.tar.gz`): ```bash ./build/mvn -DskipTests -Phive-thriftserver clean package ``` Mac: ```bash SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz ``` Windows: ```bash set HADOOP_VERSION=3.2 set SPARK_VERSION=3.0.1 pip install pyspark-3.1.0.dev0.tar.gz ``` Closes #29703 from HyukjinKwon/SPARK-32017. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-23 09:30:51 +09:00
Takuya UESHIN	5440ea84ee	[SPARK-32312][DOC][FOLLOWUP] Fix the minimum version of PyArrow in the installation guide ### What changes were proposed in this pull request? Now that the minimum version of PyArrow is `1.0.0`, we should update the version in the installation guide. ### Why are the changes needed? The minimum version of PyArrow was upgraded to `1.0.0`. ### Does this PR introduce _any_ user-facing change? Users see the correct minimum version in the installation guide. ### How was this patch tested? N/A Closes #29829 from ueshin/issues/SPARK-32312/doc. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-22 11:04:14 +09:00
HyukjinKwon	f893a19c4c	[SPARK-32180][PYTHON][DOCS][FOLLOW-UP] Rephrase and add some more information in installation guide ### What changes were proposed in this pull request? This PR: - rephrases some wordings in installation guide to avoid using the terms that can be potentially ambiguous such as "different favors" - documents extra dependency installation `pip install pyspark[sql]` - uses the link that corresponds to the released version. e.g.) https://spark.apache.org/docs/latest/building-spark.html vs https://spark.apache.org/docs/3.0.0/building-spark.html - adds some more details I built it on Read the Docs to make it easier to review: https://hyukjin-spark.readthedocs.io/en/stable/getting_started/install.html ### Why are the changes needed? To improve installation guide. ### Does this PR introduce _any_ user-facing change? Yes, it updates the user-facing installation guide. ### How was this patch tested? Manually built the doc and tested. Closes #29779 from HyukjinKwon/SPARK-32180. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-20 10:58:17 +09:00

15 commits