ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
itholic	4e1ded67f8	[SPARK-32189][DOCS][PYTHON][FOLLOW-UP] Fixed broken link and typo in PySpark docs ### What changes were proposed in this pull request? This PR is a follow-up of #29781 to fix broken link and typo. <img width="638" alt="Screen Shot 2020-10-07 at 3 56 28 PM" src="https://user-images.githubusercontent.com/44108233/95297583-aa0ccb00-08b5-11eb-85db-89022c76d7e1.png"> <img width="734" alt="Screen Shot 2020-10-07 at 3 55 36 PM" src="https://user-images.githubusercontent.com/44108233/95297508-8ba6cf80-08b5-11eb-9caa-0b52a2482ada.png"> ### Why are the changes needed? Current link is not working properly because of wrong path. ### Does this PR introduce _any_ user-facing change? Yes, the link is working properly now. ### How was this patch tested? Manually built the doc. Closes #29963 from itholic/SPARK-32189-FOLLOWUP. Authored-by: itholic <haejoon309@naver.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-07 16:39:25 +09:00
HyukjinKwon	5ce321dc80	[SPARK-33017][PYTHON][DOCS][FOLLOW-UP] Add getCheckpointDir into API documentation ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/29918. We should add it into the documentation as well. ### Why are the changes needed? To show users new APIs. ### Does this PR introduce _any_ user-facing change? Yes, `SparkContext.getCheckpointDir` will be documented. ### How was this patch tested? Manually built the PySpark documentation: ```bash cd python/docs make clean html cd build/html open index.html ``` Closes #29960 from HyukjinKwon/SPARK-33017. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-07 13:00:59 +09:00
HyukjinKwon	6868b40517	[SPARK-33020][PYTHON] Add nth_value as a PySpark function ### What changes were proposed in this pull request? `nth_value` was added at SPARK-27951. This PR adds the corresponding PySpark API. ### Why are the changes needed? To support the consistent APIs ### Does this PR introduce _any_ user-facing change? Yes, it introduces a new PySpark function API. ### How was this patch tested? Unittest was added. Closes #29899 from HyukjinKwon/SPARK-33020. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-09-28 22:14:28 -07:00
Fabian Höring	a7f84a0b45	[SPARK-32187][PYTHON][DOCS] Doc on Python packaging ### What changes were proposed in this pull request? This PR proposes to document PySpark specific packaging guidelines. ### Why are the changes needed? To have a single place for PySpark users, and better documentation. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? ``` cd python/docs make clean html ``` Closes #29806 from fhoering/add_doc_python_packaging. Lead-authored-by: Fabian Höring <f.horing@criteo.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-28 12:30:28 +09:00
HyukjinKwon	688d016c7a	[SPARK-32982][BUILD] Remove hive-1.2 profiles in PIP installation option ### What changes were proposed in this pull request? This PR removes Hive 1.2 option (and therefore `HIVE_VERSION` environment variable as well). ### Why are the changes needed? Hive 1.2 is a fork version. We shouldn't promote users to use. ### Does this PR introduce _any_ user-facing change? Nope, `HIVE_VERSION` and Hive 1.2 are removed but this is new experimental feature in master only. ### How was this patch tested? Manually tested: ```bash SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz -v SPARK_VERSION=3.0.1 HADOOP_VERSION=2.7 pip install pyspark-3.1.0.dev0.tar.gz -v SPARK_VERSION=3.0.1 HADOOP_VERSION=invalid pip install pyspark-3.1.0.dev0.tar.gz -v ``` Closes #29858 from HyukjinKwon/SPARK-32981. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-24 14:49:58 +09:00
HyukjinKwon	942f577b6e	[SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI ### What changes were proposed in this pull request? This PR proposes to add a way to select Hadoop and Hive versions in pip installation. Users can select Hive or Hadoop versions as below: ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` When the environment variables are set, internally it downloads the corresponding Spark version and then sets the Spark home to it. Also this PR exposes a mirror to set as an environment variable, `PYSPARK_RELEASE_MIRROR`. Please NOTE that: - We cannot currently leverage pip's native installation option, for example: ```bash pip install pyspark --install-option="hadoop3.2" ``` This is because of a limitation and bug in pip itself. Once they fix this issue, we can switch from the environment variables to the proper installation options, see SPARK-32837. It IS possible to workaround but very ugly or hacky with a big change. See [this PR](https://github.com/microsoft/nni/pull/139/files) as an example. - In pip installation, we pack the relevant jars together. This PR _does not touch existing packaging way_ in order to prevent any behaviour changes. Once this experimental way is proven to be safe, we can avoid packing the relevant jars together (and keep only the relevant Python scripts). And downloads the Spark distribution as this PR proposes. - This way is sort of consistent with SparkR: SparkR provides a method `SparkR::install.spark` to support CRAN installation. This is fine because SparkR is provided purely as a R library. For example, `sparkr` script is not packed together. PySpark cannot take this approach because PySpark packaging ships relevant executable script together, e.g.) `pyspark` shell. If PySpark has a method such as `pyspark.install_spark`, users cannot call it in `pyspark` because `pyspark` already assumes relevant Spark is installed, JVM is launched, etc. - There looks no way to release that contains different Hadoop or Hive to PyPI due to [the version semantics](https://www.python.org/dev/peps/pep-0440/). This is not an option. The usual way looks either `--install-option` above with hacks or environment variables given my investigation. ### Why are the changes needed? To provide users the options to select Hadoop and Hive versions. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to select Hive and Hadoop version as below when they install it from `pip`; ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` ### How was this patch tested? Unit tests were added. I also manually tested in Mac and Windows (after building Spark with `python/dist/pyspark-3.1.0.dev0.tar.gz`): ```bash ./build/mvn -DskipTests -Phive-thriftserver clean package ``` Mac: ```bash SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz ``` Windows: ```bash set HADOOP_VERSION=3.2 set SPARK_VERSION=3.0.1 pip install pyspark-3.1.0.dev0.tar.gz ``` Closes #29703 from HyukjinKwon/SPARK-32017. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-23 09:30:51 +09:00
Takuya UESHIN	5440ea84ee	[SPARK-32312][DOC][FOLLOWUP] Fix the minimum version of PyArrow in the installation guide ### What changes were proposed in this pull request? Now that the minimum version of PyArrow is `1.0.0`, we should update the version in the installation guide. ### Why are the changes needed? The minimum version of PyArrow was upgraded to `1.0.0`. ### Does this PR introduce _any_ user-facing change? Users see the correct minimum version in the installation guide. ### How was this patch tested? N/A Closes #29829 from ueshin/issues/SPARK-32312/doc. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-22 11:04:14 +09:00
itholic	9c653c957f	[SPARK-32189][DOCS][PYTHON] Development - Setting up IDEs ### What changes were proposed in this pull request? This PR proposes to document the way of setting up IDEs ![스크린샷 2020-09-21 오전 10 43 12](https://user-images.githubusercontent.com/44108233/93727715-5c2a6e80-fbf7-11ea-821b-555723b00bc8.png) ![스크린샷 2020-09-21 오전 10 43 45](https://user-images.githubusercontent.com/44108233/93727716-5f255f00-fbf7-11ea-9c6c-7b8a973bc511.png) ### Why are the changes needed? To let users know how to setup IDEs ### Does this PR introduce _any_ user-facing change? Yes, it adds a new page in the documentation about setting IDEs. ### How was this patch tested? Manually built the doc. Closes #29781 from itholic/SPARK-32189. Authored-by: itholic <haejoon309@naver.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-21 12:29:17 +09:00
HyukjinKwon	f893a19c4c	[SPARK-32180][PYTHON][DOCS][FOLLOW-UP] Rephrase and add some more information in installation guide ### What changes were proposed in this pull request? This PR: - rephrases some wordings in installation guide to avoid using the terms that can be potentially ambiguous such as "different favors" - documents extra dependency installation `pip install pyspark[sql]` - uses the link that corresponds to the released version. e.g.) https://spark.apache.org/docs/latest/building-spark.html vs https://spark.apache.org/docs/3.0.0/building-spark.html - adds some more details I built it on Read the Docs to make it easier to review: https://hyukjin-spark.readthedocs.io/en/stable/getting_started/install.html ### Why are the changes needed? To improve installation guide. ### Does this PR introduce _any_ user-facing change? Yes, it updates the user-facing installation guide. ### How was this patch tested? Manually built the doc and tested. Closes #29779 from HyukjinKwon/SPARK-32180. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-20 10:58:17 +09:00
Sean Owen	ce566bed17	[SPARK-32180][FOLLOWUP] Fix .rst error in new Pyspark installation guide This simply fixes an .rst generation error in https://github.com/apache/spark/pull/29640 Closes #29735 from srowen/SPARK-32180.2. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2020-09-11 20:08:22 -07:00
Rohit.Mishra	f6322d1cb1	[SPARK-32180][PYTHON][DOCS] Installation page of Getting Started in PySpark documentation ### What changes were proposed in this pull request? This PR proposes to add getting started- installation to new PySpark docs. ### Why are the changes needed? Better documentation. ### Does this PR introduce _any_ user-facing change? No. Documentation only. ### How was this patch tested? Generating documents locally. Closes #29640 from rohitmishr1484/SPARK-32180-Getting-Started-Installation. Authored-by: Rohit.Mishra <rohit.mishra@utopusinsights.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-09-11 10:38:01 -05:00
Bryan Cutler	e0538bd38c	[SPARK-32312][SQL][PYTHON][TEST-JAVA11] Upgrade Apache Arrow to version 1.0.1 ### What changes were proposed in this pull request? Upgrade Apache Arrow to version 1.0.1 for the Java dependency and increase minimum version of PyArrow to 1.0.0. This release marks a transition to binary stability of the columnar format (which was already informally backward-compatible going back to December 2017) and a transition to Semantic Versioning for the Arrow software libraries. Also note that the Java arrow-memory artifact has been split to separate dependence on netty-buffer and allow users to select an allocator. Spark will continue to use `arrow-memory-netty` to maintain performance benefits. Version 1.0.0 - 1.0.0 include the following selected fixes/improvements relevant to Spark users: ARROW-9300 - [Java] Separate Netty Memory to its own module ARROW-9272 - [C++][Python] Reduce complexity in python to arrow conversion ARROW-9016 - [Java] Remove direct references to Netty/Unsafe Allocators ARROW-8664 - [Java] Add skip null check to all Vector types ARROW-8485 - [Integration][Java] Implement extension types integration ARROW-8434 - [C++] Ipc RecordBatchFileReader deserializes the Schema multiple times ARROW-8314 - [Python] Provide a method to select a subset of columns of a Table ARROW-8230 - [Java] Move Netty memory manager into a separate module ARROW-8229 - [Java] Move ArrowBuf into the Arrow package ARROW-7955 - [Java] Support large buffer for file/stream IPC ARROW-7831 - [Java] unnecessary buffer allocation when calling splitAndTransferTo on variable width vectors ARROW-6111 - [Java] Support LargeVarChar and LargeBinary types and add integration test with C++ ARROW-6110 - [Java] Support LargeList Type and add integration test with C++ ARROW-5760 - [C++] Optimize Take implementation ARROW-300 - [Format] Add body buffer compression option to IPC message protocol using LZ4 or ZSTD ARROW-9098 - RecordBatch::ToStructArray cannot handle record batches with 0 column ARROW-9066 - [Python] Raise correct error in isnull() ARROW-9223 - [Python] Fix to_pandas() export for timestamps within structs ARROW-9195 - [Java] Wrong usage of Unsafe.get from bytearray in ByteFunctionsHelper class ARROW-7610 - [Java] Finish support for 64 bit int allocations ARROW-8115 - [Python] Conversion when mixing NaT and datetime objects not working ARROW-8392 - [Java] Fix overflow related corner cases for vector value comparison ARROW-8537 - [C++] Performance regression from ARROW-8523 ARROW-8803 - [Java] Row count should be set before loading buffers in VectorLoader ARROW-8911 - [C++] Slicing a ChunkedArray with zero chunks segfaults View release notes here: https://arrow.apache.org/release/1.0.1.html https://arrow.apache.org/release/1.0.0.html ### Why are the changes needed? Upgrade brings fixes, improvements and stability guarantees. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests with pyarrow 1.0.0 and 1.0.1 Closes #29686 from BryanCutler/arrow-upgrade-100-SPARK-32312. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-10 14:16:19 +09:00
HyukjinKwon	c336ae39cd	[SPARK-32186][DOCS][PYTHON] Development - Debugging ### What changes were proposed in this pull request? This PR proposes to document the way of debugging PySpark. It's pretty much self-descriptive. I made a demo site to review it more effectively: https://hyukjin-spark.readthedocs.io/en/stable/development/debugging.html ### Why are the changes needed? To let users know how to debug PySpark applications. ### Does this PR introduce _any_ user-facing change? Yes, it adds a new page in the documentation about debugging PySpark. ### How was this patch tested? Manually built the doc. Closes #29639 from HyukjinKwon/SPARK-32186. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-08 10:32:22 +09:00
HyukjinKwon	32d87c2b59	[SPARK-32783][DOCS][PYTHON] Development - Testing PySpark ### What changes were proposed in this pull request? This PR proposes to add a page to describe how to test PySpark. Note that it avoids duplication of https://spark.apache.org/developer-tools.html and it more aims to add put the relevant links together. I made a demo site to review more effectively: https://hyukjin-spark.readthedocs.io/en/stable/development/testing.html ### Why are the changes needed? To guide PySpark developers easily test. ### Does this PR introduce _any_ user-facing change? Yes, it will adds a new documentation page. ### How was this patch tested? Manually tested. Closes #29634 from HyukjinKwon/SPARK-32783. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-04 18:06:25 +09:00
HyukjinKwon	d80c85c2e3	[SPARK-32191][FOLLOW-UP][PYTHON][DOCS] Indent the table and reword the main page in migration guide ### What changes were proposed in this pull request? This PR is a minor followup to fix: 1. Slightly reword the wording in the main page. 2. The indentation in the table at the migration guide; from ![Screen Shot 2020-09-01 at 1 53 40 PM](https://user-images.githubusercontent.com/6477701/91796204-91781800-ec5a-11ea-9f57-d7a9f4207ba0.png) to ![Screen Shot 2020-09-01 at 1 53 26 PM](https://user-images.githubusercontent.com/6477701/91796202-9046eb00-ec5a-11ea-9db2-815139ddfdb9.png) ### Why are the changes needed? In order to show the migration guide pretty. ### Does this PR introduce _any_ user-facing change? Yes, this is a change to user-facing documentation. ### How was this patch tested? Manually built the documentation. Closes #29606 from HyukjinKwon/SPARK-32191. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-01 15:08:03 +09:00
HyukjinKwon	86ca90ccd7	[SPARK-32190][PYTHON][DOCS] Development - Contribution Guide in PySpark ### What changes were proposed in this pull request? This PR proposes to document PySpark specific contribution guides at "Development" section. Here is the demo for reviewing quicker: https://hyukjin-spark.readthedocs.io/en/stable/development/contributing.html ### Why are the changes needed? To have a single place for PySpark users, and better documentation. ### Does this PR introduce _any_ user-facing change? Yes, it is a new documentation. See the demo linked above. ### How was this patch tested? ```bash cd docs SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve --watch ``` and ```bash cd python/docs make clean html ``` Closes #29596 from HyukjinKwon/SPARK-32190. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-01 14:20:07 +09:00
HyukjinKwon	eaaf783148	[MINOR][DOCS] Fix the Binder link to point the quickstart notebook correctly ### What changes were proposed in this pull request? This PR fixes the link of Binder in Quickstart notebook and documentation. From: https://mybinder.org/v2/gh/databricks/apache/master?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb To: https://mybinder.org/v2/gh/apache/spark/master?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb This link is the same as the one in RST files: `b54103016a/python/docs/source/conf.py (L57)` ### Why are the changes needed? The link was wrong, and points out non-existent file and repo. ### Does this PR introduce _any_ user-facing change? Yes, it will fixes the link so users can correctly try Binder. ### How was this patch tested? Manually tested by building the documentation. Closes #29597 from HyukjinKwon/minor-link-quickstart. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-31 22:00:56 +09:00
HyukjinKwon	c154629171	[SPARK-32183][DOCS][PYTHON] User Guide - PySpark Usage Guide for Pandas with Apache Arrow ### What changes were proposed in this pull request? This PR proposes to move Arrow usage guide from Spark documentation site to PySpark documentation site (at "User Guide"). Here is the demo for reviewing quicker: https://hyukjin-spark.readthedocs.io/en/stable/user_guide/arrow_pandas.html ### Why are the changes needed? To have a single place for PySpark users, and better documentation. ### Does this PR introduce _any_ user-facing change? Yes, it will move https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html to our PySpark documentation. ### How was this patch tested? ```bash cd docs SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve --watch ``` and ```bash cd python/docs make clean html ``` Closes #29548 from HyukjinKwon/SPARK-32183. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-28 15:09:06 +09:00
HyukjinKwon	b54103016a	[SPARK-32204][SPARK-32182][DOCS] Add a quickstart page with Binder integration in PySpark documentation ### What changes were proposed in this pull request? This PR proposes to: - add a notebook with a Binder integration which allows users to try PySpark in a live notebook. Please [try this here](https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-32204?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb). - reuse this notebook as a quickstart guide in PySpark documentation. Note that Binder turns a Git repo into a collection of interactive notebooks. It works based on Docker image. Once somebody builds, other people can reuse the image against a specific commit. Therefore, if we run Binder with the images based on released tags in Spark, virtually all users can instantly launch the Jupyter notebooks. <br/> I made a simple demo to make it easier to review. Please see: - [Main page](https://hyukjin-spark.readthedocs.io/en/stable/). Note that the link ("Live Notebook") in the main page wouldn't work since this PR is not merged yet. - [Quickstart page](https://hyukjin-spark.readthedocs.io/en/stable/getting_started/quickstart.html) <br/> When reviewing the notebook file itself, please give my direct feedback which I will appreciate and address. Another way might be: - open [here](https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-32204?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb). - edit / change / update the notebook. Please feel free to change as whatever you want. I can apply as are or slightly update more when I apply to this PR. - download it as a `.ipynb` file: ![Screen Shot 2020-08-20 at 10 12 19 PM](https://user-images.githubusercontent.com/6477701/90774311-3e38c800-e332-11ea-8476-699a653984db.png) - upload the `.ipynb` file here in a GitHub comment. Then, I will push a commit with that file with crediting correctly, of course. - alternatively, push a commit into this PR right away if that's easier for you (if you're a committer). References: - https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html - https://databricks.com/jp/blog/2020/03/31/10-minutes-from-pandas-to-koalas-on-apache-spark.html - my own blog post .. :-) and https://koalas.readthedocs.io/en/latest/getting_started/10min.html ### Why are the changes needed? To improve PySpark's usability. The current quickstart for Python users are very friendly. ### Does this PR introduce _any_ user-facing change? Yes, it will add a documentation page, and expose a live notebook to PySpark users. ### How was this patch tested? Manually tested, and GitHub Actions builds will test. Closes #29491 from HyukjinKwon/SPARK-32204. Lead-authored-by: HyukjinKwon <gurwls223@apache.org> Co-authored-by: Fokko Driesprong <fokko@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-26 12:23:24 +09:00
Sean Owen	891c5e661a	[MINOR][DOCS] Add KMeansSummary and InheritableThread to documentation ### What changes were proposed in this pull request? The class `KMeansSummary` in pyspark is not included in `clustering.py`'s `__all__` declaration. It isn't included in the docs as a result. `InheritableThread` and `KMeansSummary` should be into corresponding RST files for documentation. ### Why are the changes needed? It seems like an oversight to not include this as all similar "summary" classes are. `InheritableThread` should also be documented. ### Does this PR introduce _any_ user-facing change? I don't believe there are functional changes. It should make this public class appear in docs. ### How was this patch tested? Existing tests / N/A. Closes #29470 from srowen/KMeansSummary. Lead-authored-by: Sean Owen <srowen@gmail.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-19 14:30:07 +09:00
Dongjoon Hyun	b421bf0196	[SPARK-32517][CORE] Add StorageLevel.DISK_ONLY_3 ### What changes were proposed in this pull request? This PR aims to add `StorageLevel.DISK_ONLY_3` as a built-in `StorageLevel`. ### Why are the changes needed? In a YARN cluster, HDFS uaually provides storages with replication factor 3. So, we can save the result to HDFS to get `StorageLevel.DISK_ONLY_3` technically. However, disaggregate clusters or clusters without storage services are rising. Previously, in that situation, the users were able to use similar `MEMORY_AND_DISK_2` or a user-created `StorageLevel`. This PR aims to support those use cases officially for better UX. ### Does this PR introduce _any_ user-facing change? Yes. This provides a new built-in option. ### How was this patch tested? Pass the GitHub Action or Jenkins with the revised test cases. Closes #29331 from dongjoon-hyun/SPARK-32517. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-08-10 07:33:06 -07:00
Liang-Chi Hsieh	f9f992e9a4	[SPARK-32191][PYTHON][DOCS] Port migration guide for PySpark docs ### What changes were proposed in this pull request? This proposes to port old PySpark migration guide to new PySpark docs. ### Why are the changes needed? Better documentation. ### Does this PR introduce _any_ user-facing change? No. Documentation only. ### How was this patch tested? Generated document locally. <img width="1521" alt="Screen Shot 2020-08-07 at 1 53 20 PM" src="https://user-images.githubusercontent.com/68855/89687618-672e7700-d8b5-11ea-8f29-67a9ab271fa8.png"> Closes #29385 from viirya/SPARK-32191. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-10 15:41:32 +09:00
HyukjinKwon	15b73339d9	[SPARK-32507][DOCS][PYTHON] Add main page for PySpark documentation ### What changes were proposed in this pull request? This PR proposes to write the main page of PySpark documentation. The base work is finished at https://github.com/apache/spark/pull/29188. ### Why are the changes needed? For better usability and readability in PySpark documentation. ### Does this PR introduce _any_ user-facing change? Yes, it creates a new main page as below: ![Screen Shot 2020-07-31 at 10 02 44 PM](https://user-images.githubusercontent.com/6477701/89037618-d2d68880-d379-11ea-9a44-562f2aa0e3fd.png) ### How was this patch tested? Manually built the PySpark documentation. ```bash cd python make clean html ``` Closes #29320 from HyukjinKwon/SPARK-32507. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-05 11:14:14 +09:00
Huaxin Gao	40e6a5bbb0	[SPARK-32449][ML][PYSPARK] Add summary to MultilayerPerceptronClassificationModel ### What changes were proposed in this pull request? Add training summary to MultilayerPerceptronClassificationModel... ### Why are the changes needed? so that user can get the training process status, such as loss value of each iteration and total iteration number. ### Does this PR introduce _any_ user-facing change? Yes MultilayerPerceptronClassificationModel.summary MultilayerPerceptronClassificationModel.evaluate ### How was this patch tested? new tests Closes #29250 from huaxingao/mlp_summary. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-07-29 09:58:25 -05:00
HyukjinKwon	6ab29b37cf	[SPARK-32179][SPARK-32188][PYTHON][DOCS] Replace and redesign the documentation base ### What changes were proposed in this pull request? This PR proposes to redesign the PySpark documentation. I made a demo site to make it easier to review: https://hyukjin-spark.readthedocs.io/en/stable/reference/index.html. Here is the initial draft for the final PySpark docs shape: https://hyukjin-spark.readthedocs.io/en/latest/index.html. In more details, this PR proposes: 1. Use [pydata_sphinx_theme](https://github.com/pandas-dev/pydata-sphinx-theme) theme - [pandas](https://pandas.pydata.org/docs/) and [Koalas](https://koalas.readthedocs.io/en/latest/) use this theme. The CSS overwrite is ported from Koalas. The colours in the CSS were actually chosen by designers to use in Spark. 2. Use the Sphinx option to separate `source` and `build` directories as the documentation pages will likely grow. 3. Port current API documentation into the new style. It mimics Koalas and pandas to use the theme most effectively. One disadvantage of this approach is that you should list up APIs or classes; however, I think this isn't a big issue in PySpark since we're being conservative on adding APIs. I also intentionally listed classes only instead of functions in ML and MLlib to make it relatively easier to manage. ### Why are the changes needed? Often I hear the complaints, from the users, that current PySpark documentation is pretty messy to read - https://spark.apache.org/docs/latest/api/python/index.html compared other projects such as [pandas](https://pandas.pydata.org/docs/) and [Koalas](https://koalas.readthedocs.io/en/latest/). It would be nicer if we can make it more organised instead of just listing all classes, methods and attributes to make it easier to navigate. Also, the documentation has been there from almost the very first version of PySpark. Maybe it's time to update it. ### Does this PR introduce _any_ user-facing change? Yes, PySpark API documentation will be redesigned. ### How was this patch tested? Manually tested, and the demo site was made to show. Closes #29188 from HyukjinKwon/SPARK-32179. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-27 17:49:21 +09:00
HyukjinKwon	6fb22aa42d	[SPARK-31748][PYTHON] Document resource module in PySpark doc and rename/move classes ### What changes were proposed in this pull request? This PR is kind of a followup for SPARK-29641 and SPARK-28234. This PR proposes: 1.. Document the new `pyspark.resource` module introduced at `95aec091e4`, in PySpark API docs. 2.. Move classes into fewer and simpler modules Before: ``` pyspark ├── resource │ ├── executorrequests.py │ │ ├── class ExecutorResourceRequest │ │ └── class ExecutorResourceRequests │ ├── taskrequests.py │ │ ├── class TaskResourceRequest │ │ └── class TaskResourceRequests │ ├── resourceprofilebuilder.py │ │ └── class ResourceProfileBuilder │ ├── resourceprofile.py │ │ └── class ResourceProfile └── resourceinformation └── class ResourceInformation ``` After: ``` pyspark └── resource ├── requests.py │ ├── class ExecutorResourceRequest │ ├── class ExecutorResourceRequests │ ├── class TaskResourceRequest │ └── class TaskResourceRequests ├── profile.py │ ├── class ResourceProfileBuilder │ └── class ResourceProfile └── information.py └── class ResourceInformation ``` 3.. Minor docstring fix e.g.: ```diff - param name the name of the resource - param addresses an array of strings describing the addresses of the resource + :param name: the name of the resource + :param addresses: an array of strings describing the addresses of the resource + + .. versionadded:: 3.0.0 ``` ### Why are the changes needed? To document APIs, and move Python modules to fewer and simpler modules. ### Does this PR introduce _any_ user-facing change? No, the changes are in unreleased branches. ### How was this patch tested? Manually tested via: ```bash cd python ./run-tests --python-executables=python3 --modules=pyspark-core ./run-tests --python-executables=python3 --modules=pyspark-resource ``` Closes #28569 from HyukjinKwon/SPARK-28234-SPARK-29641-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-05-19 17:09:37 -07:00
Nicholas Chammas	4d5166f840	[SPARK-30880][DOCS] Delete Sphinx Makefile cruft Split off from #27534. ### What changes were proposed in this pull request? This PR deletes some unused cruft from the Sphinx Makefiles. ### Why are the changes needed? To remove dead code and the associated maintenance burden. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Tested locally by building the Python docs and reviewing them in my browser. Closes #27625 from nchammas/SPARK-30731-makefile-cleanup. Lead-authored-by: Nicholas Chammas <nicholas.chammas@liveramp.com> Co-authored-by: Nicholas Chammas <nicholas.chammas@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-21 14:53:40 +09:00
Dongjoon Hyun	fc4e56a54c	[SPARK-30884][PYSPARK] Upgrade to Py4J 0.10.9 This PR aims to upgrade Py4J to `0.10.9` for better Python 3.7 support in Apache Spark 3.0.0 (master/branch-3.0). This is not for `branch-2.4`. - Apache Spark 3.0.0 is using `Py4J 0.10.8.1` (released on 2018-10-21) because `0.10.8.1` was the first official release to support Python 3.7. - https://www.py4j.org/changelog.html#py4j-0-10-8-and-py4j-0-10-8-1 - `Py4J 0.10.9` was released on January 25th 2020 with better Python 3.7 support and `magic_member` bug fix. - https://github.com/bartdag/py4j/releases/tag/0.10.9 - https://www.py4j.org/changelog.html#py4j-0-10-9 No. Pass the Jenkins with the existing tests. Closes #27641 from dongjoon-hyun/SPARK-30884. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-20 09:09:30 -08:00
David Toneian	b2134ee73c	[SPARK-30823][PYTHON][DOCS] Set `%PYTHONPATH%` when building PySpark documentation on Windows This commit is published into the public domain. ### What changes were proposed in this pull request? In analogy to `python/docs/Makefile`, which has > export PYTHONPATH=$(realpath ..):$(realpath ../lib/py4j-0.10.8.1-src.zip) on line 10, this PR adds > set PYTHONPATH=..;..\lib\py4j-0.10.8.1-src.zip to `make2.bat`. Since there is no `realpath` in default installations of Windows, I left the relative paths unresolved. Per the instructions on how to build docs, `make.bat` is supposed to be run from `python/docs` as the working directory, so this should probably not cause issues (`%BUILDDIR%` is a relative path as well.) ### Why are the changes needed? When building the PySpark documentation on Windows, by changing directory to `python/docs` and running `make.bat` (which runs `make2.bat`), the majority of the documentation may not be built if pyspark is not in the default `%PYTHONPATH%`. Sphinx then reports that `pyspark` (and possibly dependencies) cannot be imported. If `pyspark` is in the default `%PYTHONPATH%`, I suppose it is that version of `pyspark` – as opposed to the version found above the `python/docs` directory – that is considered when building the documentation, which may result in documentation that does not correspond to the development version one is trying to build. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manual tests on my Windows 10 machine. Additional tests with other environments very welcome! Closes #27569 from DavidToneian/SPARK-30823. Authored-by: David Toneian <david@toneian.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-14 13:49:11 +09:00
HyukjinKwon	ee8d661058	[SPARK-30434][PYTHON][SQL] Move pandas related functionalities into 'pandas' sub-package ### What changes were proposed in this pull request? This PR proposes to move pandas related functionalities into pandas package. Namely: ```bash pyspark/sql/pandas ├── __init__.py ├── conversion.py # Conversion between pandas <> PySpark DataFrames ├── functions.py # pandas_udf ├── group_ops.py # Grouped UDF / Cogrouped UDF + groupby.apply, groupby.cogroup.apply ├── map_ops.py # Map Iter UDF + mapInPandas ├── serializers.py # pandas <> PyArrow serializers ├── types.py # Type utils between pandas <> PyArrow └── utils.py # Version requirement checks ``` In order to separately locate `groupby.apply`, `groupby.cogroup.apply`, `mapInPandas`, `toPandas`, and `createDataFrame(pdf)` under `pandas` sub-package, I had to use a mix-in approach which Scala side uses often by `trait`, and also pandas itself uses this approach (see `IndexOpsMixin` as an example) to group related functionalities. Currently, you can think it's like Scala's self typed trait. See the structure below: ```python class PandasMapOpsMixin(object): def mapInPandas(self, ...): ... return ... # other Pandas <> PySpark APIs ``` ```python class DataFrame(PandasMapOpsMixin): # other DataFrame APIs equivalent to Scala side. ``` Yes, This is a big PR but they are mostly just moving around except one case `createDataFrame` which I had to split the methods. ### Why are the changes needed? There are pandas functionalities here and there and I myself gets lost where it was. Also, when you have to make a change commonly for all of pandas related features, it's almost impossible now. Also, after this change, `DataFrame` and `SparkSession` become more consistent with Scala side since pandas is specific to Python, and this change separates pandas-specific APIs away from `DataFrame` or `SparkSession`. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests should cover. Also, I manually built the PySpark API documentation and checked. Closes #27109 from HyukjinKwon/pandas-refactoring. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-09 10:22:50 +09:00
WeichenXu	88542bc3d9	[SPARK-30154][ML] PySpark UDF to convert MLlib vectors to dense arrays ### What changes were proposed in this pull request? PySpark UDF to convert MLlib vectors to dense arrays. Example: ``` from pyspark.ml.functions import vector_to_array df.select(vector_to_array(col("features")) ``` ### Why are the changes needed? If a PySpark user wants to convert MLlib sparse/dense vectors in a DataFrame into dense arrays, an efficient approach is to do that in JVM. However, it requires PySpark user to write Scala code and register it as a UDF. Often this is infeasible for a pure python project. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? UT. Closes #26910 from WeichenXu123/vector_to_array. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2020-01-06 16:18:51 -08:00
Yuming Wang	e1ee3fb72f	[SPARK-30216][INFRA] Use python3 in Docker release image ### What changes were proposed in this pull request? - Reverts commit `1f94bf4` and `d6be46e` - Switches python to python3 in Docker release image. ### Why are the changes needed? `dev/make-distribution.sh` and `python/setup.py` are use python3. https://github.com/apache/spark/pull/26844/files#diff-ba2c046d92a1d2b5b417788bfb5cb5f8L236 https://github.com/apache/spark/pull/26330/files#diff-8cf6167d58ce775a08acafcfe6f40966 ### Does this PR introduce any user-facing change? No. ### How was this patch tested? manual test: ``` yumwangubuntu-3513086:~/spark$ dev/create-release/do-release-docker.sh -n -d /home/yumwang/spark-release Output directory already exists. Overwrite and continue? [y/n] y Branch [branch-2.4]: master Current branch version is 3.0.0-SNAPSHOT. Release [3.0.0]: 3.0.0-preview2 RC # [1]: This is a dry run. Please confirm the ref that will be built for testing. Ref [master]: ASF user [yumwang]: Full name [Yuming Wang]: GPG key [yumwangapache.org]: DBD447010C1B4F7DAD3F7DFD6E1B4122F6A3A338 ================ Release details: BRANCH: master VERSION: 3.0.0-preview2 TAG: v3.0.0-preview2-rc1 NEXT: 3.0.1-SNAPSHOT ASF USER: yumwang GPG KEY: DBD447010C1B4F7DAD3F7DFD6E1B4122F6A3A338 FULL NAME: Yuming Wang E-MAIL: yumwangapache.org ================ Is this info correct [y/n]? y GPG passphrase: ======================== = Building spark-rm image with tag latest... Command: docker build -t spark-rm:latest --build-arg UID=110302528 /home/yumwang/spark/dev/create-release/spark-rm Log file: docker-build.log Building v3.0.0-preview2-rc1; output will be at /home/yumwang/spark-release/output gpg: directory '/home/spark-rm/.gnupg' created gpg: keybox '/home/spark-rm/.gnupg/pubring.kbx' created gpg: /home/spark-rm/.gnupg/trustdb.gpg: trustdb created gpg: key 6E1B4122F6A3A338: public key "Yuming Wang <yumwangapache.org>" imported gpg: key 6E1B4122F6A3A338: secret key imported gpg: Total number processed: 1 gpg: imported: 1 gpg: secret keys read: 1 gpg: secret keys imported: 1 ======================== = Creating release tag v3.0.0-preview2-rc1... Command: /opt/spark-rm/release-tag.sh Log file: tag.log It may take some time for the tag to be synchronized to github. Press enter when you've verified that the new tag (v3.0.0-preview2-rc1) is available. ======================== = Building Spark... Command: /opt/spark-rm/release-build.sh package Log file: build.log ======================== = Building documentation... Command: /opt/spark-rm/release-build.sh docs Log file: docs.log ======================== = Publishing release Command: /opt/spark-rm/release-build.sh publish-release Log file: publish.log ``` Generated doc: ![image](https://user-images.githubusercontent.com/5399861/70693075-a7723100-1cf7-11ea-9f88-9356a02349a1.png) Closes #26848 from wangyum/SPARK-30216. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-13 11:31:31 -08:00
HyukjinKwon	fe75ff8bea	[SPARK-28206][PYTHON] Remove the legacy Epydoc in PySpark API documentation ## What changes were proposed in this pull request? Seems like we used to generate PySpark API documentation by Epydoc almost at the very first place (see `85b8f2c64f`). This fixes an actual issue: Before: ![Screen Shot 2019-07-05 at 8 20 01 PM](https://user-images.githubusercontent.com/6477701/60720491-e9879180-9f65-11e9-9562-100830a456cd.png) After: ![Screen Shot 2019-07-05 at 8 20 05 PM](https://user-images.githubusercontent.com/6477701/60720495-ec828200-9f65-11e9-8277-8f689e292cb0.png) It seems apparently a bug within `epytext` plugin during the conversion between`param` and `:param` syntax. See also [Epydoc syntax](http://epydoc.sourceforge.net/manual-epytext.html). Actually, Epydoc syntax violates [PEP-257](https://www.python.org/dev/peps/pep-0257/) IIRC and blocks us to enable some rules for doctest linter as well. We should remove this legacy away and I guess Spark 3 is good timing to do it. ## How was this patch tested? Manually built the doc and check each. I had to manually find the Epydoc syntax by `git grep -r "{L"`, for instance. Closes #25060 from HyukjinKwon/SPARK-28206. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2019-07-05 10:08:22 -07:00
sangramga	8375103568	[SPARK-27557][DOC] Add copy button to Python API docs for easier copying of code-blocks ## What changes were proposed in this pull request? Add a non-intrusive button for python API documentation, which will remove ">>>" prompts and outputs of code - for easier copying of code. For example: The below code-snippet in the document is difficult to copy due to ">>>" prompts ``` >>> l = [('Alice', 1)] >>> spark.createDataFrame(l).collect() [Row(_1='Alice', _2=1)] ``` Becomes this - After the copybutton in the corner of of code-block is pressed - which is easier to copy ``` l = [('Alice', 1)] spark.createDataFrame(l).collect() ``` ![image](https://user-images.githubusercontent.com/9406431/56715817-560c3600-6756-11e9-8bae-58a3d2d57df3.png) ## File changes Made changes to python/docs/conf.py and copybutton.js - thus only modifying sphinx frontend and no changes were made to the documentation itself- Build process for documentation remains the same. copybutton.js -> This JS snippet was taken from the official python.org documentation site. ## How was this patch tested? NA Closes #24456 from sangramga/copybutton. Authored-by: sangramga <sangramga@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-05-01 11:26:18 -05:00
Gabor Somogyi	3729efb4d0	[SPARK-26856][PYSPARK] Python support for from_avro and to_avro APIs ## What changes were proposed in this pull request? Avro is built-in but external data source module since Spark 2.4 but `from_avro` and `to_avro` APIs not yet supported in pyspark. In this PR I've made them available from pyspark. ## How was this patch tested? Please see the python API examples what I've added. cd docs/ SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll build Manual webpage check. Closes #23797 from gaborgsomogyi/SPARK-26856. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-11 10:15:07 +09:00
Hyukjin Kwon	e92088de4d	[MINOR][PYTHON] Fix SQLContext to SparkSession in Python API main page ## What changes were proposed in this pull request? This PR proposes to fix deprecated `SQLContext` to `SparkSession` in Python API main page. Before: ![screen shot 2019-01-16 at 5 30 19 pm](https://user-images.githubusercontent.com/6477701/51239583-bac82f80-19b4-11e9-9129-8dae2c23ec79.png) After: ![screen shot 2019-01-16 at 5 29 54 pm](https://user-images.githubusercontent.com/6477701/51239577-b734a880-19b4-11e9-8539-592cb772168d.png) ## How was this patch tested? Manually checked the doc after building it. I also checked by `grep -r "SQLContext"` and looks this is the only instance left. Closes #23565 from HyukjinKwon/minor-doc-change. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-01-16 23:23:36 +08:00
Dongjoon Hyun	e4cb42ad89	[SPARK-25891][PYTHON] Upgrade to Py4J 0.10.8.1 ## What changes were proposed in this pull request? Py4J 0.10.8.1 is released on October 21st and is the first release of Py4J to support Python 3.7 officially. We had better have this to get the official support. Also, there are some patches related to garbage collections. https://www.py4j.org/changelog.html#py4j-0-10-8-and-py4j-0-10-8-1 ## How was this patch tested? Pass the Jenkins. Closes #22901 from dongjoon-hyun/SPARK-25891. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-10-31 09:55:03 -07:00
Sean Owen	703e6da1ec	[SPARK-25705][BUILD][STREAMING][TEST-MAVEN] Remove Kafka 0.8 integration ## What changes were proposed in this pull request? Remove Kafka 0.8 integration ## How was this patch tested? Existing tests, build scripts Closes #22703 from srowen/SPARK-25705. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-10-16 09:10:24 -05:00
Sean Owen	a001814189	[SPARK-25598][STREAMING][BUILD][TEST-MAVEN] Remove flume connector in Spark 3 ## What changes were proposed in this pull request? Removes all vestiges of Flume in the build, for Spark 3. I don't think this needs Jenkins config changes. ## How was this patch tested? Existing tests. Closes #22692 from srowen/SPARK-25598. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-10-11 14:28:06 -07:00
hyukjinkwon	1f94bf492c	[SPARK-24530][PYTHON] Add a control to force Python version in Sphinx via environment variable, SPHINXPYTHON ## What changes were proposed in this pull request? This PR proposes to add `SPHINXPYTHON` environment variable to control the Python version used by Sphinx. The motivation of this environment variable is, it seems not properly rendering some signatures in the Python documentation when Python 2 is used by Sphinx. See the JIRA's case. It should be encouraged to use Python 3, but looks we will probably live with this problem for a long while in any event. For the default case of `make html`, it keeps previous behaviour and use `SPHINXBUILD` as it was. If `SPHINXPYTHON` is set, then it forces Sphinx to use the specific Python version. ``` $ SPHINXPYTHON=python3 make html python3 -msphinx -b html -d _build/doctrees . _build/html Running Sphinx v1.7.5 ... ``` 1. if `SPHINXPYTHON` is set, use Python. If `SPHINXBUILD` is set, use sphinx-build. 2. If both are set, `SPHINXBUILD` has a higher priority over `SPHINXPYTHON` 3. By default, `SPHINXBUILD` is used as 'sphinx-build'. Probably, we can somehow work around this via explicitly setting `SPHINXBUILD` but `sphinx-build` can't be easily distinguished since it (at least in my environment and up to my knowledge) doesn't replace `sphinx-build` when newer Sphinx is installed in different Python version. It confuses and doesn't warn for its Python version. ## How was this patch tested? Manually tested: `python` (Python 2.7) in the path with Sphinx: ``` $ make html sphinx-build -b html -d _build/doctrees . _build/html Running Sphinx v1.7.5 ... ``` `python` (Python 2.7) in the path without Sphinx: ``` $ make html Makefile:8: * The 'sphinx-build' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the 'sphinx-build' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/. Stop. ``` `SPHINXPYTHON` set `python` (Python 2.7) with Sphinx: ``` $ SPHINXPYTHON=python make html Makefile:35: * Note that Python 3 is required to generate PySpark documentation correctly for now. Current Python executable was less than Python 3. See SPARK-24530. To force Sphinx to use a specific Python executable, please set SPHINXPYTHON to point to the Python 3 executable.. Stop. ``` `SPHINXPYTHON` set `python` (Python 2.7) without Sphinx: ``` $ SPHINXPYTHON=python make html Makefile:35: * Note that Python 3 is required to generate PySpark documentation correctly for now. Current Python executable was less than Python 3. See SPARK-24530. To force Sphinx to use a specific Python executable, please set SPHINXPYTHON to point to the Python 3 executable.. Stop. ``` `SPHINXPYTHON` set `python3` with Sphinx: ``` $ SPHINXPYTHON=python3 make html python3 -msphinx -b html -d _build/doctrees . _build/html Running Sphinx v1.7.5 ... ``` `SPHINXPYTHON` set `python3` without Sphinx: ``` $ SPHINXPYTHON=python3 make html Makefile:39: * Python executable 'python3' did not have Sphinx installed. Make sure you have Sphinx installed, then set the SPHINXPYTHON environment variable to point to the Python executable having Sphinx installed. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/. Stop. ``` `SPHINXBUILD` set: ``` $ SPHINXBUILD=sphinx-build make html sphinx-build -b html -d _build/doctrees . _build/html Running Sphinx v1.7.5 ... ``` Both `SPHINXPYTHON` and `SPHINXBUILD` are set: ``` $ SPHINXBUILD=sphinx-build SPHINXPYTHON=python make html sphinx-build -b html -d _build/doctrees . _build/html Running Sphinx v1.7.5 ... ``` Author: hyukjinkwon <gurwls223@apache.org> Closes #21659 from HyukjinKwon/SPARK-24530.	2018-07-11 10:10:07 +08:00
Marcelo Vanzin	cc613b552e	[PYSPARK] Update py4j to version 0.10.7.	2018-05-09 10:47:35 -07:00
Ilya Matiach	1edb3175d8	[SPARK-21866][ML][PYSPARK] Adding spark image reader ## What changes were proposed in this pull request? Adding spark image reader, an implementation of schema for representing images in spark DataFrames The code is taken from the spark package located here: (https://github.com/Microsoft/spark-images) Please see the JIRA for more information (https://issues.apache.org/jira/browse/SPARK-21866) Please see mailing list for SPIP vote and approval information: (http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-SPARK-21866-Image-support-in-Apache-Spark-td22510.html) # Background and motivation As Apache Spark is being used more and more in the industry, some new use cases are emerging for different data formats beyond the traditional SQL types or the numerical types (vectors and matrices). Deep Learning applications commonly deal with image processing. A number of projects add some Deep Learning capabilities to Spark (see list below), but they struggle to communicate with each other or with MLlib pipelines because there is no standard way to represent an image in Spark DataFrames. We propose to federate efforts for representing images in Spark by defining a representation that caters to the most common needs of users and library developers. This SPIP proposes a specification to represent images in Spark DataFrames and Datasets (based on existing industrial standards), and an interface for loading sources of images. It is not meant to be a full-fledged image processing library, but rather the core description that other libraries and users can rely on. Several packages already offer various processing facilities for transforming images or doing more complex operations, and each has various design tradeoffs that make them better as standalone solutions. This project is a joint collaboration between Microsoft and Databricks, which have been testing this design in two open source packages: MMLSpark and Deep Learning Pipelines. The proposed image format is an in-memory, decompressed representation that targets low-level applications. It is significantly more liberal in memory usage than compressed image representations such as JPEG, PNG, etc., but it allows easy communication with popular image processing libraries and has no decoding overhead. ## How was this patch tested? Unit tests in scala ImageSchemaSuite, unit tests in python Author: Ilya Matiach <ilmat@microsoft.com> Author: hyukjinkwon <gurwls223@gmail.com> Closes #19439 from imatiach-msft/ilmat/spark-images.	2017-11-22 15:45:45 -08:00
Dongjoon Hyun	aa88b8dbbb	[SPARK-22490][DOC] Add PySpark doc for SparkSession.builder ## What changes were proposed in this pull request? In PySpark API Document, [SparkSession.build](http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html) is not documented and shows default value description. ``` SparkSession.builder = <pyspark.sql.session.Builder object ... ``` This PR adds the doc. ![screen](https://user-images.githubusercontent.com/9700541/32705514-1bdcafaa-c7ca-11e7-88bf-05566fea42de.png) The following is the diff of the generated result. ``` $ diff old.html new.html 95a96,101 > <dl class="attribute"> > <dt id="pyspark.sql.SparkSession.builder"> > <code class="descname">builder</code><a class="headerlink" href="#pyspark.sql.SparkSession.builder" title="Permalink to this definition">¶</a></dt> > <dd><p>A class attribute having a <a class="reference internal" href="#pyspark.sql.SparkSession.Builder" title="pyspark.sql.SparkSession.Builder"><code class="xref py py-class docutils literal"><span class="pre">Builder</span></code></a> to construct <a class="reference internal" href="#pyspark.sql.SparkSession" title="pyspark.sql.SparkSession"><code class="xref py py-class docutils literal"><span class="pre">SparkSession</span></code></a> instances</p> > </dd></dl> > 212,216d217 < <dt id="pyspark.sql.SparkSession.builder"> < <code class="descname">builder</code><em class="property"> = <pyspark.sql.session.SparkSession.Builder object></em><a class="headerlink" href="#pyspark.sql.SparkSession.builder" title="Permalink to this definition">¶</a></dt> < <dd></dd></dl> < < <dl class="attribute"> ``` ## How was this patch tested? Manual. ``` cd python/docs make html open _build/html/pyspark.sql.html ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #19726 from dongjoon-hyun/SPARK-22490.	2017-11-15 08:59:29 -08:00
Dongjoon Hyun	c8d0aba198	[SPARK-21278][PYSPARK] Upgrade to Py4J 0.10.6 ## What changes were proposed in this pull request? This PR aims to bump Py4J in order to fix the following float/double bug. Py4J 0.10.5 fixes this (https://github.com/bartdag/py4j/issues/272) and the latest Py4J is 0.10.6. BEFORE ``` >>> df = spark.range(1) >>> df.select(df['id'] + 17.133574204226083).show() +--------------------+ \|(id + 17.1335742042)\| +--------------------+ \| 17.1335742042\| +--------------------+ ``` AFTER ``` >>> df = spark.range(1) >>> df.select(df['id'] + 17.133574204226083).show() +-------------------------+ \|(id + 17.133574204226083)\| +-------------------------+ \| 17.133574204226083\| +-------------------------+ ``` ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #18546 from dongjoon-hyun/SPARK-21278.	2017-07-05 16:33:23 -07:00
Yanbo Liang	d4022d4951	[SPARK-20707][ML] ML deprecated APIs should be removed in major release. ## What changes were proposed in this pull request? Before 2.2, MLlib keep to remove APIs deprecated in last feature/minor release. But from Spark 2.2, we decide to remove deprecated APIs in a major release, so we need to change corresponding annotations to tell users those will be removed in 3.0. Meanwhile, this fixed bugs in ML documents. The original ML docs can't show deprecated annotations in ```MLWriter``` and ```MLReader``` related class, we correct it in this PR. Before: ![image](https://cloud.githubusercontent.com/assets/1962026/25939889/f8c55f20-3666-11e7-9fa2-0605bfb3ed06.png) After: ![image](https://cloud.githubusercontent.com/assets/1962026/25939870/e9b0d5be-3666-11e7-9765-5e04885e4b32.png) ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17946 from yanboliang/spark-20707.	2017-05-16 10:08:23 +08:00
Bago Amirbekian	a5c87707ea	[SPARK-20040][ML][PYTHON] pyspark wrapper for ChiSquareTest ## What changes were proposed in this pull request? A pyspark wrapper for spark.ml.stat.ChiSquareTest ## How was this patch tested? unit tests doctests Author: Bago Amirbekian <bago@databricks.com> Closes #17421 from MrBago/chiSquareTestWrapper.	2017-03-28 19:19:16 -07:00
zero323	0bc8847aa2	[SPARK-19281][PYTHON][ML] spark.ml Python API for FPGrowth ## What changes were proposed in this pull request? - Add `HasSupport` and `HasConfidence` `Params`. - Add new module `pyspark.ml.fpm`. - Add `FPGrowth` / `FPGrowthModel` wrappers. - Provide tests for new features. ## How was this patch tested? Unit tests. Author: zero323 <zero323@users.noreply.github.com> Closes #17218 from zero323/SPARK-19281.	2017-03-26 16:49:27 -07:00
hyukjinkwon	46b2126024	[SPARK-19002][BUILD][PYTHON] Check pep8 against all Python scripts ## What changes were proposed in this pull request? This PR proposes to check pep8 against all other Python scripts and fix the errors as below: ```bash ./dev/create-release/generate-contributors.py ./dev/create-release/releaseutils.py ./dev/create-release/translate-contributors.py ./dev/lint-python ./python/docs/epytext.py ./examples/src/main/python/mllib/decision_tree_classification_example.py ./examples/src/main/python/mllib/decision_tree_regression_example.py ./examples/src/main/python/mllib/gradient_boosting_classification_example.py ./examples/src/main/python/mllib/gradient_boosting_regression_example.py ./examples/src/main/python/mllib/linear_regression_with_sgd_example.py ./examples/src/main/python/mllib/logistic_regression_with_lbfgs_example.py ./examples/src/main/python/mllib/naive_bayes_example.py ./examples/src/main/python/mllib/random_forest_classification_example.py ./examples/src/main/python/mllib/random_forest_regression_example.py ./examples/src/main/python/mllib/svm_with_sgd_example.py ./examples/src/main/python/streaming/network_wordjoinsentiments.py ./sql/hive/src/test/resources/data/scripts/cat.py ./sql/hive/src/test/resources/data/scripts/cat_error.py ./sql/hive/src/test/resources/data/scripts/doubleescapedtab.py ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py ./sql/hive/src/test/resources/data/scripts/escapedcarriagereturn.py ./sql/hive/src/test/resources/data/scripts/escapednewline.py ./sql/hive/src/test/resources/data/scripts/escapedtab.py ./sql/hive/src/test/resources/data/scripts/input20_script.py ./sql/hive/src/test/resources/data/scripts/newline.py ``` ## How was this patch tested? - `./python/docs/epytext.py` ```bash cd ./python/docs $$ make html ``` - pep8 check (Python 2.7 / Python 3.3.6) ``` ./dev/lint-python ``` - `./dev/merge_spark_pr.py` (Python 2.7 only / Python 3.3.6 not working) ```bash python -m doctest -v ./dev/merge_spark_pr.py ``` - `./dev/create-release/releaseutils.py` `./dev/create-release/generate-contributors.py` `./dev/create-release/translate-contributors.py` (Python 2.7 only / Python 3.3.6 not working) ```bash python generate-contributors.py python translate-contributors.py ``` - Examples (Python 2.7 / Python 3.3.6) ```bash ./bin/spark-submit examples/src/main/python/mllib/decision_tree_classification_example.py ./bin/spark-submit examples/src/main/python/mllib/decision_tree_regression_example.py ./bin/spark-submit examples/src/main/python/mllib/gradient_boosting_classification_example.py ./bin/spark-submit examples/src/main/python/mllib/gradient_boosting_regression_example.p ./bin/spark-submit examples/src/main/python/mllib/random_forest_classification_example.py ./bin/spark-submit examples/src/main/python/mllib/random_forest_regression_example.py ``` - Examples (Python 2.7 only / Python 3.3.6 not working) ``` ./bin/spark-submit examples/src/main/python/mllib/linear_regression_with_sgd_example.py ./bin/spark-submit examples/src/main/python/mllib/logistic_regression_with_lbfgs_example.py ./bin/spark-submit examples/src/main/python/mllib/naive_bayes_example.py ./bin/spark-submit examples/src/main/python/mllib/svm_with_sgd_example.py ``` - `sql/hive/src/test/resources/data/scripts/*.py` (Python 2.7 / Python 3.3.6 within suggested changes) Manually tested only changed ones. - `./dev/github_jira_sync.py` (Python 2.7 only / Python 3.3.6 not working) Manually tested this after disabling actually adding comments and links. And also via Jenkins tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #16405 from HyukjinKwon/minor-pep8.	2017-01-02 15:23:19 +00:00
Jagadeesan	595893d33a	[SPARK-17960][PYSPARK][UPGRADE TO PY4J 0.10.4] ## What changes were proposed in this pull request? 1) Upgrade the Py4J version on the Java side 2) Update the py4j src zip file we bundle with Spark ## How was this patch tested? Existing doctests & unit tests pass Author: Jagadeesan <as2@us.ibm.com> Closes #15514 from jagadeesanas2/SPARK-17960.	2016-10-21 09:48:24 +01:00
Sean Owen	0b3a4be92c	[SPARK-16781][PYSPARK] java launched by PySpark as gateway may not be the same java used in the spark environment ## What changes were proposed in this pull request? Update to py4j 0.10.3 to enable JAVA_HOME support ## How was this patch tested? Pyspark tests Author: Sean Owen <sowen@cloudera.com> Closes #14748 from srowen/SPARK-16781.	2016-08-24 20:04:09 +01:00

1 2

92 commits