ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Xinrong Meng	120c389b00	[SPARK-34887][PYTHON] Port Koalas dependencies into PySpark ### What changes were proposed in this pull request? Port Koalas dependencies appropriately to PySpark dependencies. ### Why are the changes needed? pandas-on-Spark has its own required dependency and optional dependencies. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #32386 from xinrong-databricks/portDeps. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-05-04 09:04:23 +09:00
Yikun Jiang	0769049ee1	[SPARK-34979][PYTHON][DOC] Add PyArrow installation note for PySpark aarch64 user ### What changes were proposed in this pull request? This patch adds a note for aarch64 user to install the specific pyarrow>=4.0.0. ### Why are the changes needed? The pyarrow aarch64 support is [introduced](https://github.com/apache/arrow/pull/9285) in [PyArrow 4.0.0](https://github.com/apache/arrow/releases/tag/apache-arrow-4.0.0), and it has been published 27.Apr.2021. See more in [SPARK-34979](https://issues.apache.org/jira/browse/SPARK-34979). ### Does this PR introduce _any_ user-facing change? Yes, this doc can help user install arrow on aarch64. ### How was this patch tested? doc test passed. Closes #32363 from Yikun/SPARK-34979. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2021-04-28 09:56:17 +09:00
wankunde	60e324aa9f	[SPARK-34688][PYTHON] Upgrade to Py4J 0.10.9.2 ### What changes were proposed in this pull request? This PR upgrade Py4J from 0.10.9.1 to 0.10.9.2 that contains some bug fixes and improvements. * expose shell parameter in Popen inside launch_gateway. ([bartdag/py4j220efc3](`220efc3716`)) * fixed Flake8 errors ([bartdag/py4j6c6ee9a](`6c6ee9aedc`)) ### Why are the changes needed? To leverage fixes from the upstream in Py4J. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Jenkins build and GitHub Actions will test it out. Closes #31796 from wankunde/py4j. Authored-by: wankunde <wankunde@163.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-03-11 09:51:41 -06:00
HyukjinKwon	b5470ae294	[MINOR][DOCS] Replace http to https when possible in PySpark documentation ### What changes were proposed in this pull request? This PR proposes: - Change http to https for better security - Change http://apache-spark-developers-list.1001551.n3.nabble.com/ to official mailing list link (https://mail-archives.apache.org/mod_mbox/spark-dev/) ### Why are the changes needed? For better security, and to use official link. ### Does this PR introduce _any_ user-facing change? Yes, It exposes more secure and correct links to the PySpark end users in PySpark documentation. ### How was this patch tested? I manually checked if each link works Closes #31616 from HyukjinKwon/minor-https. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-23 11:18:47 +09:00
HyukjinKwon	aa388cf3d0	[SPARK-34041][PYTHON][DOCS] Miscellaneous cleanup for new PySpark documentation ### What changes were proposed in this pull request? This PR proposes to: - Add a link of quick start in PySpark docs into "Programming Guides" in Spark main docs - `ML` / `MLlib` -> `MLlib (DataFrame-based)` / `MLlib (RDD-based)` in API reference page - Mention other user guides as well because the guide such as [ML](http://spark.apache.org/docs/latest/ml-guide.html) and [SQL](http://spark.apache.org/docs/latest/sql-programming-guide.html). - Mention other migration guides as well because PySpark can get affected by it. ### Why are the changes needed? For better documentation. ### Does this PR introduce _any_ user-facing change? It fixes user-facing docs. However, it's not released out yet. ### How was this patch tested? Manually tested by running: ```bash cd docs SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve --watch ``` Closes #31082 from HyukjinKwon/SPARK-34041. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-08 09:28:31 +09:00
HyukjinKwon	329850c667	[SPARK-32017][PYTHON][FOLLOW-UP] Rename HADOOP_VERSION to PYSPARK_HADOOP_VERSION in pip installation option ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/29703. It renames `HADOOP_VERSION` environment variable to `PYSPARK_HADOOP_VERSION` in case `HADOOP_VERSION` is already being used somewhere. Arguably `HADOOP_VERSION` is a pretty common name. I see here and there: - https://www.ibm.com/support/knowledgecenter/SSZUMP_7.2.1/install_grid_sym/understanding_advanced_edition.html - https://cwiki.apache.org/confluence/display/ARROW/HDFS+Filesystem+Support - http://crs4.github.io/pydoop/_pydoop1/installation.html ### Why are the changes needed? To avoid the environment variables is unexpectedly conflicted. ### Does this PR introduce _any_ user-facing change? It renames the environment variable but it's not released yet. ### How was this patch tested? Existing unittests will test. Closes #31028 from HyukjinKwon/SPARK-32017-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-05 17:21:32 +09:00
Josh Soref	13fd272cd3	Spelling r common dev mlib external project streaming resource managers python ### What changes were proposed in this pull request? This PR intends to fix typos in the sub-modules: * `R` * `common` * `dev` * `mlib` * `external` * `project` * `streaming` * `resource-managers` * `python` Split per srowen https://github.com/apache/spark/pull/30323#issuecomment-728981618 NOTE: The misspellings have been reported at `706a726f87 (commitcomment-44064356)` ### Why are the changes needed? Misspelled words make it harder to read / understand content. ### Does this PR introduce _any_ user-facing change? There are various fixes to documentation, etc... ### How was this patch tested? No testing was performed Closes #30402 from jsoref/spelling-R_common_dev_mlib_external_project_streaming_resource-managers_python. Authored-by: Josh Soref <jsoref@users.noreply.github.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-11-27 10:22:45 -06:00
HyukjinKwon	02d410a18c	[MINOR][DOCS] Document 'without' value for HADOOP_VERSION in pip installation ### What changes were proposed in this pull request? I believe it's self-descriptive. ### Why are the changes needed? To document supported features. ### Does this PR introduce _any_ user-facing change? Yes, the docs are updated. It's master only. ### How was this patch tested? Manually built the docs via `cd python/docs` and `make clean html`: ![Screen Shot 2020-11-20 at 10 59 07 AM](https://user-images.githubusercontent.com/6477701/99748225-7ad9b280-2b1f-11eb-86fd-165012b1bb7c.png) Closes #30436 from HyukjinKwon/minor-doc-fix. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-20 13:14:20 +09:00
HyukjinKwon	688d016c7a	[SPARK-32982][BUILD] Remove hive-1.2 profiles in PIP installation option ### What changes were proposed in this pull request? This PR removes Hive 1.2 option (and therefore `HIVE_VERSION` environment variable as well). ### Why are the changes needed? Hive 1.2 is a fork version. We shouldn't promote users to use. ### Does this PR introduce _any_ user-facing change? Nope, `HIVE_VERSION` and Hive 1.2 are removed but this is new experimental feature in master only. ### How was this patch tested? Manually tested: ```bash SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz -v SPARK_VERSION=3.0.1 HADOOP_VERSION=2.7 pip install pyspark-3.1.0.dev0.tar.gz -v SPARK_VERSION=3.0.1 HADOOP_VERSION=invalid pip install pyspark-3.1.0.dev0.tar.gz -v ``` Closes #29858 from HyukjinKwon/SPARK-32981. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-24 14:49:58 +09:00
HyukjinKwon	942f577b6e	[SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI ### What changes were proposed in this pull request? This PR proposes to add a way to select Hadoop and Hive versions in pip installation. Users can select Hive or Hadoop versions as below: ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` When the environment variables are set, internally it downloads the corresponding Spark version and then sets the Spark home to it. Also this PR exposes a mirror to set as an environment variable, `PYSPARK_RELEASE_MIRROR`. Please NOTE that: - We cannot currently leverage pip's native installation option, for example: ```bash pip install pyspark --install-option="hadoop3.2" ``` This is because of a limitation and bug in pip itself. Once they fix this issue, we can switch from the environment variables to the proper installation options, see SPARK-32837. It IS possible to workaround but very ugly or hacky with a big change. See [this PR](https://github.com/microsoft/nni/pull/139/files) as an example. - In pip installation, we pack the relevant jars together. This PR _does not touch existing packaging way_ in order to prevent any behaviour changes. Once this experimental way is proven to be safe, we can avoid packing the relevant jars together (and keep only the relevant Python scripts). And downloads the Spark distribution as this PR proposes. - This way is sort of consistent with SparkR: SparkR provides a method `SparkR::install.spark` to support CRAN installation. This is fine because SparkR is provided purely as a R library. For example, `sparkr` script is not packed together. PySpark cannot take this approach because PySpark packaging ships relevant executable script together, e.g.) `pyspark` shell. If PySpark has a method such as `pyspark.install_spark`, users cannot call it in `pyspark` because `pyspark` already assumes relevant Spark is installed, JVM is launched, etc. - There looks no way to release that contains different Hadoop or Hive to PyPI due to [the version semantics](https://www.python.org/dev/peps/pep-0440/). This is not an option. The usual way looks either `--install-option` above with hacks or environment variables given my investigation. ### Why are the changes needed? To provide users the options to select Hadoop and Hive versions. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to select Hive and Hadoop version as below when they install it from `pip`; ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` ### How was this patch tested? Unit tests were added. I also manually tested in Mac and Windows (after building Spark with `python/dist/pyspark-3.1.0.dev0.tar.gz`): ```bash ./build/mvn -DskipTests -Phive-thriftserver clean package ``` Mac: ```bash SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz ``` Windows: ```bash set HADOOP_VERSION=3.2 set SPARK_VERSION=3.0.1 pip install pyspark-3.1.0.dev0.tar.gz ``` Closes #29703 from HyukjinKwon/SPARK-32017. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-23 09:30:51 +09:00
Takuya UESHIN	5440ea84ee	[SPARK-32312][DOC][FOLLOWUP] Fix the minimum version of PyArrow in the installation guide ### What changes were proposed in this pull request? Now that the minimum version of PyArrow is `1.0.0`, we should update the version in the installation guide. ### Why are the changes needed? The minimum version of PyArrow was upgraded to `1.0.0`. ### Does this PR introduce _any_ user-facing change? Users see the correct minimum version in the installation guide. ### How was this patch tested? N/A Closes #29829 from ueshin/issues/SPARK-32312/doc. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-22 11:04:14 +09:00
HyukjinKwon	f893a19c4c	[SPARK-32180][PYTHON][DOCS][FOLLOW-UP] Rephrase and add some more information in installation guide ### What changes were proposed in this pull request? This PR: - rephrases some wordings in installation guide to avoid using the terms that can be potentially ambiguous such as "different favors" - documents extra dependency installation `pip install pyspark[sql]` - uses the link that corresponds to the released version. e.g.) https://spark.apache.org/docs/latest/building-spark.html vs https://spark.apache.org/docs/3.0.0/building-spark.html - adds some more details I built it on Read the Docs to make it easier to review: https://hyukjin-spark.readthedocs.io/en/stable/getting_started/install.html ### Why are the changes needed? To improve installation guide. ### Does this PR introduce _any_ user-facing change? Yes, it updates the user-facing installation guide. ### How was this patch tested? Manually built the doc and tested. Closes #29779 from HyukjinKwon/SPARK-32180. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-20 10:58:17 +09:00
Sean Owen	ce566bed17	[SPARK-32180][FOLLOWUP] Fix .rst error in new Pyspark installation guide This simply fixes an .rst generation error in https://github.com/apache/spark/pull/29640 Closes #29735 from srowen/SPARK-32180.2. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2020-09-11 20:08:22 -07:00
Rohit.Mishra	f6322d1cb1	[SPARK-32180][PYTHON][DOCS] Installation page of Getting Started in PySpark documentation ### What changes were proposed in this pull request? This PR proposes to add getting started- installation to new PySpark docs. ### Why are the changes needed? Better documentation. ### Does this PR introduce _any_ user-facing change? No. Documentation only. ### How was this patch tested? Generating documents locally. Closes #29640 from rohitmishr1484/SPARK-32180-Getting-Started-Installation. Authored-by: Rohit.Mishra <rohit.mishra@utopusinsights.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-09-11 10:38:01 -05:00
HyukjinKwon	eaaf783148	[MINOR][DOCS] Fix the Binder link to point the quickstart notebook correctly ### What changes were proposed in this pull request? This PR fixes the link of Binder in Quickstart notebook and documentation. From: https://mybinder.org/v2/gh/databricks/apache/master?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb To: https://mybinder.org/v2/gh/apache/spark/master?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb This link is the same as the one in RST files: `b54103016a/python/docs/source/conf.py (L57)` ### Why are the changes needed? The link was wrong, and points out non-existent file and repo. ### Does this PR introduce _any_ user-facing change? Yes, it will fixes the link so users can correctly try Binder. ### How was this patch tested? Manually tested by building the documentation. Closes #29597 from HyukjinKwon/minor-link-quickstart. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-31 22:00:56 +09:00
HyukjinKwon	b54103016a	[SPARK-32204][SPARK-32182][DOCS] Add a quickstart page with Binder integration in PySpark documentation ### What changes were proposed in this pull request? This PR proposes to: - add a notebook with a Binder integration which allows users to try PySpark in a live notebook. Please [try this here](https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-32204?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb). - reuse this notebook as a quickstart guide in PySpark documentation. Note that Binder turns a Git repo into a collection of interactive notebooks. It works based on Docker image. Once somebody builds, other people can reuse the image against a specific commit. Therefore, if we run Binder with the images based on released tags in Spark, virtually all users can instantly launch the Jupyter notebooks. <br/> I made a simple demo to make it easier to review. Please see: - [Main page](https://hyukjin-spark.readthedocs.io/en/stable/). Note that the link ("Live Notebook") in the main page wouldn't work since this PR is not merged yet. - [Quickstart page](https://hyukjin-spark.readthedocs.io/en/stable/getting_started/quickstart.html) <br/> When reviewing the notebook file itself, please give my direct feedback which I will appreciate and address. Another way might be: - open [here](https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-32204?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb). - edit / change / update the notebook. Please feel free to change as whatever you want. I can apply as are or slightly update more when I apply to this PR. - download it as a `.ipynb` file: ![Screen Shot 2020-08-20 at 10 12 19 PM](https://user-images.githubusercontent.com/6477701/90774311-3e38c800-e332-11ea-8476-699a653984db.png) - upload the `.ipynb` file here in a GitHub comment. Then, I will push a commit with that file with crediting correctly, of course. - alternatively, push a commit into this PR right away if that's easier for you (if you're a committer). References: - https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html - https://databricks.com/jp/blog/2020/03/31/10-minutes-from-pandas-to-koalas-on-apache-spark.html - my own blog post .. :-) and https://koalas.readthedocs.io/en/latest/getting_started/10min.html ### Why are the changes needed? To improve PySpark's usability. The current quickstart for Python users are very friendly. ### Does this PR introduce _any_ user-facing change? Yes, it will add a documentation page, and expose a live notebook to PySpark users. ### How was this patch tested? Manually tested, and GitHub Actions builds will test. Closes #29491 from HyukjinKwon/SPARK-32204. Lead-authored-by: HyukjinKwon <gurwls223@apache.org> Co-authored-by: Fokko Driesprong <fokko@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-26 12:23:24 +09:00
HyukjinKwon	6ab29b37cf	[SPARK-32179][SPARK-32188][PYTHON][DOCS] Replace and redesign the documentation base ### What changes were proposed in this pull request? This PR proposes to redesign the PySpark documentation. I made a demo site to make it easier to review: https://hyukjin-spark.readthedocs.io/en/stable/reference/index.html. Here is the initial draft for the final PySpark docs shape: https://hyukjin-spark.readthedocs.io/en/latest/index.html. In more details, this PR proposes: 1. Use [pydata_sphinx_theme](https://github.com/pandas-dev/pydata-sphinx-theme) theme - [pandas](https://pandas.pydata.org/docs/) and [Koalas](https://koalas.readthedocs.io/en/latest/) use this theme. The CSS overwrite is ported from Koalas. The colours in the CSS were actually chosen by designers to use in Spark. 2. Use the Sphinx option to separate `source` and `build` directories as the documentation pages will likely grow. 3. Port current API documentation into the new style. It mimics Koalas and pandas to use the theme most effectively. One disadvantage of this approach is that you should list up APIs or classes; however, I think this isn't a big issue in PySpark since we're being conservative on adding APIs. I also intentionally listed classes only instead of functions in ML and MLlib to make it relatively easier to manage. ### Why are the changes needed? Often I hear the complaints, from the users, that current PySpark documentation is pretty messy to read - https://spark.apache.org/docs/latest/api/python/index.html compared other projects such as [pandas](https://pandas.pydata.org/docs/) and [Koalas](https://koalas.readthedocs.io/en/latest/). It would be nicer if we can make it more organised instead of just listing all classes, methods and attributes to make it easier to navigate. Also, the documentation has been there from almost the very first version of PySpark. Maybe it's time to update it. ### Does this PR introduce _any_ user-facing change? Yes, PySpark API documentation will be redesigned. ### How was this patch tested? Manually tested, and the demo site was made to show. Closes #29188 from HyukjinKwon/SPARK-32179. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-27 17:49:21 +09:00

17 commits