ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Huaxin Gao	37690dea10	[SPARK-29565][ML][PYTHON] OneHotEncoder should support single-column input/output ### What changes were proposed in this pull request? add single-column input/ouput support in OneHotEncoder ### Why are the changes needed? Currently, OneHotEncoder only has multi columns support. It makes sense to support single column as well. ### Does this PR introduce any user-facing change? Yes ```OneHotEncoder.setInputCol``` ```OneHotEncoder.setOutputCol``` ### How was this patch tested? Unit test Closes #26265 from huaxingao/spark-29565. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>	2019-10-28 23:20:21 -07:00
Huaxin Gao	c137acbf65	[SPARK-29566][ML] Imputer should support single-column input/output ### What changes were proposed in this pull request? add single-column input/output support in Imputer ### Why are the changes needed? Currently, Imputer only has multi-column support. This PR adds single-column input/output support. ### Does this PR introduce any user-facing change? Yes. add single-column input/output support in Imputer ```Imputer.setInputCol``` ```Imputer.setOutputCol``` ### How was this patch tested? add unit tests Closes #26247 from huaxingao/spark-29566. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-10-29 11:11:41 +08:00
Huaxin Gao	b19fd487df	[SPARK-29093][PYTHON][ML] Remove automatically generated param setters in _shared_params_code_gen.py ### What changes were proposed in this pull request? Remove automatically generated param setters in _shared_params_code_gen.py ### Why are the changes needed? To keep parity between scala and python ### Does this PR introduce any user-facing change? Yes Add some setters in Python ML XXXModels ### How was this patch tested? unit tests Closes #26232 from huaxingao/spark-29093. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-10-28 11:36:10 +08:00
stczwd	dcf5eaf1a6	[SPARK-29444][FOLLOWUP] add doc and python parameter for ignoreNullFields in json generating # What changes were proposed in this pull request? Add description for ignoreNullFields, which is commited in #26098 , in DataFrameWriter and readwriter.py. Enable user to use ignoreNullFields in pyspark. ### Does this PR introduce any user-facing change? No ### How was this patch tested? run unit tests Closes #26227 from stczwd/json-generator-doc. Authored-by: stczwd <qcsd2011@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-24 10:25:04 -07:00
Xianyang Liu	0a7095156b	[SPARK-29499][CORE][PYSPARK] Add mapPartitionsWithIndex for RDDBarrier ### What changes were proposed in this pull request? Add mapPartitionsWithIndex for RDDBarrier. ### Why are the changes needed? There is only one method in `RDDBarrier`. We often use the partition index as a label for the current partition. We need to get the index from `TaskContext` index in the method of `mapPartitions` which is not convenient. ### Does this PR introduce any user-facing change? No ### How was this patch tested? New UT. Closes #26148 from ConeyLiu/barrier-index. Authored-by: Xianyang Liu <xianyang.liu@intel.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>	2019-10-23 13:46:09 +02:00
HyukjinKwon	811d563fbf	[SPARK-29536][PYTHON] Upgrade cloudpickle to 1.1.1 to support Python 3.8 ### What changes were proposed in this pull request? Inline cloudpickle in PySpark to cloudpickle 1.1.1. See https://github.com/cloudpipe/cloudpickle/blob/v1.1.1/cloudpickle/cloudpickle.py https://github.com/cloudpipe/cloudpickle/pull/269 was added for Python 3.8 support (fixed from 1.1.0). Using 1.2.2 seems breaking PyPy 2 due to cloudpipe/cloudpickle#278 so this PR currently uses 1.1.1. Once we drop Python 2, we can switch to the highest version. ### Why are the changes needed? positional-only arguments was newly introduced from Python 3.8 (see https://docs.python.org/3/whatsnew/3.8.html#positional-only-parameters) Particularly the newly added argument to `types.CodeType` was the problem (https://docs.python.org/3/whatsnew/3.8.html#changes-in-the-python-api): > `types.CodeType` has a new parameter in the second position of the constructor (posonlyargcount) to support positional-only arguments defined in PEP 570. The first argument (argcount) now represents the total number of positional arguments (including positional-only arguments). The new `replace()` method of `types.CodeType` can be used to make the code future-proof. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually tested. Note that the optional dependency PyArrow looks not yet supporting Python 3.8; therefore, it was not tested. See "Details" below. <details> <p> ```bash cd python ./run-tests --python-executables=python3.8 ``` ``` Running PySpark tests. Output is in /Users/hyukjin.kwon/workspace/forked/spark/python/unit-tests.log Will test against the following Python executables: ['python3.8'] Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming'] Starting test(python3.8): pyspark.ml.tests.test_algorithms Starting test(python3.8): pyspark.ml.tests.test_feature Starting test(python3.8): pyspark.ml.tests.test_base Starting test(python3.8): pyspark.ml.tests.test_evaluation Finished test(python3.8): pyspark.ml.tests.test_base (12s) Starting test(python3.8): pyspark.ml.tests.test_image Finished test(python3.8): pyspark.ml.tests.test_evaluation (14s) Starting test(python3.8): pyspark.ml.tests.test_linalg Finished test(python3.8): pyspark.ml.tests.test_feature (23s) Starting test(python3.8): pyspark.ml.tests.test_param Finished test(python3.8): pyspark.ml.tests.test_image (22s) Starting test(python3.8): pyspark.ml.tests.test_persistence Finished test(python3.8): pyspark.ml.tests.test_param (25s) Starting test(python3.8): pyspark.ml.tests.test_pipeline Finished test(python3.8): pyspark.ml.tests.test_linalg (37s) Starting test(python3.8): pyspark.ml.tests.test_stat Finished test(python3.8): pyspark.ml.tests.test_pipeline (7s) Starting test(python3.8): pyspark.ml.tests.test_training_summary Finished test(python3.8): pyspark.ml.tests.test_stat (21s) Starting test(python3.8): pyspark.ml.tests.test_tuning Finished test(python3.8): pyspark.ml.tests.test_persistence (45s) Starting test(python3.8): pyspark.ml.tests.test_wrapper Finished test(python3.8): pyspark.ml.tests.test_algorithms (83s) Starting test(python3.8): pyspark.mllib.tests.test_algorithms Finished test(python3.8): pyspark.ml.tests.test_training_summary (32s) Starting test(python3.8): pyspark.mllib.tests.test_feature Finished test(python3.8): pyspark.ml.tests.test_wrapper (20s) Starting test(python3.8): pyspark.mllib.tests.test_linalg Finished test(python3.8): pyspark.mllib.tests.test_feature (32s) Starting test(python3.8): pyspark.mllib.tests.test_stat Finished test(python3.8): pyspark.mllib.tests.test_algorithms (70s) Starting test(python3.8): pyspark.mllib.tests.test_streaming_algorithms Finished test(python3.8): pyspark.mllib.tests.test_stat (37s) Starting test(python3.8): pyspark.mllib.tests.test_util Finished test(python3.8): pyspark.mllib.tests.test_linalg (70s) Starting test(python3.8): pyspark.sql.tests.test_arrow Finished test(python3.8): pyspark.sql.tests.test_arrow (1s) ... 53 tests were skipped Starting test(python3.8): pyspark.sql.tests.test_catalog Finished test(python3.8): pyspark.mllib.tests.test_util (15s) Starting test(python3.8): pyspark.sql.tests.test_column Finished test(python3.8): pyspark.sql.tests.test_catalog (24s) Starting test(python3.8): pyspark.sql.tests.test_conf Finished test(python3.8): pyspark.sql.tests.test_column (21s) Starting test(python3.8): pyspark.sql.tests.test_context Finished test(python3.8): pyspark.ml.tests.test_tuning (125s) Starting test(python3.8): pyspark.sql.tests.test_dataframe Finished test(python3.8): pyspark.sql.tests.test_conf (9s) Starting test(python3.8): pyspark.sql.tests.test_datasources Finished test(python3.8): pyspark.sql.tests.test_context (29s) Starting test(python3.8): pyspark.sql.tests.test_functions Finished test(python3.8): pyspark.sql.tests.test_datasources (32s) Starting test(python3.8): pyspark.sql.tests.test_group Finished test(python3.8): pyspark.sql.tests.test_dataframe (39s) ... 3 tests were skipped Starting test(python3.8): pyspark.sql.tests.test_pandas_udf Finished test(python3.8): pyspark.sql.tests.test_pandas_udf (1s) ... 6 tests were skipped Starting test(python3.8): pyspark.sql.tests.test_pandas_udf_cogrouped_map Finished test(python3.8): pyspark.sql.tests.test_pandas_udf_cogrouped_map (0s) ... 14 tests were skipped Starting test(python3.8): pyspark.sql.tests.test_pandas_udf_grouped_agg Finished test(python3.8): pyspark.sql.tests.test_pandas_udf_grouped_agg (1s) ... 15 tests were skipped Starting test(python3.8): pyspark.sql.tests.test_pandas_udf_grouped_map Finished test(python3.8): pyspark.sql.tests.test_pandas_udf_grouped_map (1s) ... 20 tests were skipped Starting test(python3.8): pyspark.sql.tests.test_pandas_udf_scalar Finished test(python3.8): pyspark.sql.tests.test_pandas_udf_scalar (1s) ... 49 tests were skipped Starting test(python3.8): pyspark.sql.tests.test_pandas_udf_window Finished test(python3.8): pyspark.sql.tests.test_pandas_udf_window (1s) ... 14 tests were skipped Starting test(python3.8): pyspark.sql.tests.test_readwriter Finished test(python3.8): pyspark.sql.tests.test_functions (29s) Starting test(python3.8): pyspark.sql.tests.test_serde Finished test(python3.8): pyspark.sql.tests.test_group (20s) Starting test(python3.8): pyspark.sql.tests.test_session Finished test(python3.8): pyspark.mllib.tests.test_streaming_algorithms (126s) Starting test(python3.8): pyspark.sql.tests.test_streaming Finished test(python3.8): pyspark.sql.tests.test_serde (25s) Starting test(python3.8): pyspark.sql.tests.test_types Finished test(python3.8): pyspark.sql.tests.test_readwriter (38s) Starting test(python3.8): pyspark.sql.tests.test_udf Finished test(python3.8): pyspark.sql.tests.test_session (32s) Starting test(python3.8): pyspark.sql.tests.test_utils Finished test(python3.8): pyspark.sql.tests.test_utils (17s) Starting test(python3.8): pyspark.streaming.tests.test_context Finished test(python3.8): pyspark.sql.tests.test_types (45s) Starting test(python3.8): pyspark.streaming.tests.test_dstream Finished test(python3.8): pyspark.sql.tests.test_udf (44s) Starting test(python3.8): pyspark.streaming.tests.test_kinesis Finished test(python3.8): pyspark.streaming.tests.test_kinesis (0s) ... 2 tests were skipped Starting test(python3.8): pyspark.streaming.tests.test_listener Finished test(python3.8): pyspark.streaming.tests.test_context (28s) Starting test(python3.8): pyspark.tests.test_appsubmit Finished test(python3.8): pyspark.sql.tests.test_streaming (60s) Starting test(python3.8): pyspark.tests.test_broadcast Finished test(python3.8): pyspark.streaming.tests.test_listener (11s) Starting test(python3.8): pyspark.tests.test_conf Finished test(python3.8): pyspark.tests.test_conf (17s) Starting test(python3.8): pyspark.tests.test_context Finished test(python3.8): pyspark.tests.test_broadcast (39s) Starting test(python3.8): pyspark.tests.test_daemon Finished test(python3.8): pyspark.tests.test_daemon (5s) Starting test(python3.8): pyspark.tests.test_join Finished test(python3.8): pyspark.tests.test_context (31s) Starting test(python3.8): pyspark.tests.test_profiler Finished test(python3.8): pyspark.tests.test_join (9s) Starting test(python3.8): pyspark.tests.test_rdd Finished test(python3.8): pyspark.tests.test_profiler (12s) Starting test(python3.8): pyspark.tests.test_readwrite Finished test(python3.8): pyspark.tests.test_readwrite (23s) ... 3 tests were skipped Starting test(python3.8): pyspark.tests.test_serializers Finished test(python3.8): pyspark.tests.test_appsubmit (94s) Starting test(python3.8): pyspark.tests.test_shuffle Finished test(python3.8): pyspark.streaming.tests.test_dstream (110s) Starting test(python3.8): pyspark.tests.test_taskcontext Finished test(python3.8): pyspark.tests.test_rdd (42s) Starting test(python3.8): pyspark.tests.test_util Finished test(python3.8): pyspark.tests.test_serializers (11s) Starting test(python3.8): pyspark.tests.test_worker Finished test(python3.8): pyspark.tests.test_shuffle (12s) Starting test(python3.8): pyspark.accumulators Finished test(python3.8): pyspark.tests.test_util (7s) Starting test(python3.8): pyspark.broadcast Finished test(python3.8): pyspark.accumulators (8s) Starting test(python3.8): pyspark.conf Finished test(python3.8): pyspark.broadcast (8s) Starting test(python3.8): pyspark.context Finished test(python3.8): pyspark.tests.test_worker (19s) Starting test(python3.8): pyspark.ml.classification Finished test(python3.8): pyspark.conf (4s) Starting test(python3.8): pyspark.ml.clustering Finished test(python3.8): pyspark.context (22s) Starting test(python3.8): pyspark.ml.evaluation Finished test(python3.8): pyspark.tests.test_taskcontext (49s) Starting test(python3.8): pyspark.ml.feature Finished test(python3.8): pyspark.ml.clustering (43s) Starting test(python3.8): pyspark.ml.fpm Finished test(python3.8): pyspark.ml.evaluation (27s) Starting test(python3.8): pyspark.ml.image Finished test(python3.8): pyspark.ml.image (8s) Starting test(python3.8): pyspark.ml.linalg.__init__ Finished test(python3.8): pyspark.ml.linalg.__init__ (0s) Starting test(python3.8): pyspark.ml.recommendation Finished test(python3.8): pyspark.ml.classification (63s) Starting test(python3.8): pyspark.ml.regression Finished test(python3.8): pyspark.ml.fpm (23s) Starting test(python3.8): pyspark.ml.stat Finished test(python3.8): pyspark.ml.stat (30s) Starting test(python3.8): pyspark.ml.tuning Finished test(python3.8): pyspark.ml.regression (51s) Starting test(python3.8): pyspark.mllib.classification Finished test(python3.8): pyspark.ml.feature (93s) Starting test(python3.8): pyspark.mllib.clustering Finished test(python3.8): pyspark.ml.tuning (39s) Starting test(python3.8): pyspark.mllib.evaluation Finished test(python3.8): pyspark.mllib.classification (38s) Starting test(python3.8): pyspark.mllib.feature Finished test(python3.8): pyspark.mllib.evaluation (25s) Starting test(python3.8): pyspark.mllib.fpm Finished test(python3.8): pyspark.mllib.clustering (64s) Starting test(python3.8): pyspark.mllib.linalg.__init__ Finished test(python3.8): pyspark.ml.recommendation (131s) Starting test(python3.8): pyspark.mllib.linalg.distributed Finished test(python3.8): pyspark.mllib.linalg.__init__ (0s) Starting test(python3.8): pyspark.mllib.random Finished test(python3.8): pyspark.mllib.feature (36s) Starting test(python3.8): pyspark.mllib.recommendation Finished test(python3.8): pyspark.mllib.fpm (31s) Starting test(python3.8): pyspark.mllib.regression Finished test(python3.8): pyspark.mllib.random (16s) Starting test(python3.8): pyspark.mllib.stat.KernelDensity Finished test(python3.8): pyspark.mllib.stat.KernelDensity (1s) Starting test(python3.8): pyspark.mllib.stat._statistics Finished test(python3.8): pyspark.mllib.stat._statistics (25s) Starting test(python3.8): pyspark.mllib.tree Finished test(python3.8): pyspark.mllib.regression (44s) Starting test(python3.8): pyspark.mllib.util Finished test(python3.8): pyspark.mllib.recommendation (49s) Starting test(python3.8): pyspark.profiler Finished test(python3.8): pyspark.mllib.linalg.distributed (53s) Starting test(python3.8): pyspark.rdd Finished test(python3.8): pyspark.profiler (14s) Starting test(python3.8): pyspark.serializers Finished test(python3.8): pyspark.mllib.tree (30s) Starting test(python3.8): pyspark.shuffle Finished test(python3.8): pyspark.shuffle (2s) Starting test(python3.8): pyspark.sql.avro.functions Finished test(python3.8): pyspark.mllib.util (30s) Starting test(python3.8): pyspark.sql.catalog Finished test(python3.8): pyspark.serializers (17s) Starting test(python3.8): pyspark.sql.column Finished test(python3.8): pyspark.rdd (31s) Starting test(python3.8): pyspark.sql.conf Finished test(python3.8): pyspark.sql.conf (7s) Starting test(python3.8): pyspark.sql.context Finished test(python3.8): pyspark.sql.avro.functions (19s) Starting test(python3.8): pyspark.sql.dataframe Finished test(python3.8): pyspark.sql.catalog (16s) Starting test(python3.8): pyspark.sql.functions Finished test(python3.8): pyspark.sql.column (27s) Starting test(python3.8): pyspark.sql.group Finished test(python3.8): pyspark.sql.context (26s) Starting test(python3.8): pyspark.sql.readwriter Finished test(python3.8): pyspark.sql.group (52s) Starting test(python3.8): pyspark.sql.session Finished test(python3.8): pyspark.sql.dataframe (73s) Starting test(python3.8): pyspark.sql.streaming Finished test(python3.8): pyspark.sql.functions (75s) Starting test(python3.8): pyspark.sql.types Finished test(python3.8): pyspark.sql.readwriter (57s) Starting test(python3.8): pyspark.sql.udf Finished test(python3.8): pyspark.sql.types (13s) Starting test(python3.8): pyspark.sql.window Finished test(python3.8): pyspark.sql.session (32s) Starting test(python3.8): pyspark.streaming.util Finished test(python3.8): pyspark.streaming.util (1s) Starting test(python3.8): pyspark.util Finished test(python3.8): pyspark.util (0s) Finished test(python3.8): pyspark.sql.streaming (30s) Finished test(python3.8): pyspark.sql.udf (27s) Finished test(python3.8): pyspark.sql.window (22s) Tests passed in 855 seconds ``` </p> </details> Closes #26194 from HyukjinKwon/SPARK-29536. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-22 16:18:34 +09:00
shahid	4a6005c795	[SPARK-29235][ML][PYSPARK] Support avgMetrics in read/write of CrossValidatorModel ### What changes were proposed in this pull request? Currently pyspark doesn't write/read `avgMetrics` in `CrossValidatorModel`, whereas scala supports it. ### Why are the changes needed? Test step to reproduce it: ``` dataset = spark.createDataFrame([(Vectors.dense([0.0]), 0.0), (Vectors.dense([0.4]), 1.0), (Vectors.dense([0.5]), 0.0), (Vectors.dense([0.6]), 1.0), (Vectors.dense([1.0]), 1.0)] * 10, ["features", "label"]) lr = LogisticRegression() grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator,parallelism=2) cvModel = cv.fit(dataset) cvModel.write().save("/tmp/model") cvModel2 = CrossValidatorModel.read().load("/tmp/model") print(cvModel.avgMetrics) # prints non empty result as expected print(cvModel2.avgMetrics) # Bug: prints an empty result. ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually tested Before patch: ``` >>> cvModel.write().save("/tmp/model_0") >>> cvModel2 = CrossValidatorModel.read().load("/tmp/model_0") >>> print(cvModel2.avgMetrics) [] ``` After patch: ``` >>> cvModel2 = CrossValidatorModel.read().load("/tmp/model_2") >>> print(cvModel2.avgMetrics[0]) 0.5 ``` Closes #26038 from shahidki31/avgMetrics. Authored-by: shahid <shahidki31@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-19 15:23:57 -05:00
zhengruifeng	dba673f0e3	[SPARK-29489][ML][PYSPARK] ml.evaluation support log-loss ### What changes were proposed in this pull request? `ml.MulticlassClassificationEvaluator` & `mllib.MulticlassMetrics` support log-loss ### Why are the changes needed? log-loss is an important classification metric and is widely used in practice ### Does this PR introduce any user-facing change? Yes, add new option ("logloss") and a related param `eps` ### How was this patch tested? added testsuites & local tests refering to sklearn Closes #26135 from zhengruifeng/logloss. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-10-18 17:57:13 +08:00
Huaxin Gao	6f8c001c8d	[SPARK-29381][FOLLOWUP][PYTHON][ML] Add 'private' _XXXParams classes for classification & regression ### What changes were proposed in this pull request? Add private _XXXParams classes for classification & regression ### Why are the changes needed? To keep parity between scala and python ### Does this PR introduce any user-facing change? Yes. Add gettters/setters for the following Model classes ``` LinearSVCModel: get/setRegParam get/setMaxIte get/setFitIntercept get/setTol get/setStandardization get/setWeightCol get/setAggregationDepth get/setThreshold LogisticRegressionModel: get/setRegParam get/setElasticNetParam get/setMaxIter get/setFitIntercept get/setTol get/setStandardization get/setWeightCol get/setAggregationDepth get/setThreshold NaiveBayesModel: get/setWeightCol LinearRegressionModel: get/setRegParam get/setElasticNetParam get/setMaxIter get/setTol get/setFitIntercept get/setStandardization get/setWeight get/setSolver get/setAggregationDepth get/setLoss GeneralizedLinearRegressionModel: get/setFitIntercept get/setMaxIter get/setTol get/setRegParam get/setWeightCol get/setSolver ``` ### How was this patch tested? Add a few doctest Closes #26142 from huaxingao/spark-29381. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-10-18 17:26:54 +08:00
Huaxin Gao	901ff92969	[SPARK-29464][PYTHON][ML] PySpark ML should expose Params.clear() to unset a user supplied Param ### What changes were proposed in this pull request? change PySpark ml ```Params._clear``` to ```Params.clear``` ### Why are the changes needed? PySpark ML currently has a private _clear() method that will unset a param. This should be made public to match the Scala API and give users a way to unset a user supplied param. ### Does this PR introduce any user-facing change? Yes. PySpark ml ```Params._clear``` ---> ```Params.clear``` ### How was this patch tested? Add test. Closes #26130 from huaxingao/spark-29464. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2019-10-17 17:02:31 -07:00
zhengruifeng	9dacdd38b3	[SPARK-23578][ML][PYSPARK] Binarizer support multi-column ### What changes were proposed in this pull request? Binarizer support multi-column by extending `HasInputCols`/`HasOutputCols`/`HasThreshold`/`HasThresholds` ### Why are the changes needed? similar algs in `ml.feature` already support multi-column, like `Bucketizer`/`StringIndexer`/`QuantileDiscretizer` ### Does this PR introduce any user-facing change? yes, add setter/getter of `thresholds`/`inputCols`/`outputCols` ### How was this patch tested? added suites Closes #26064 from zhengruifeng/binarizer_multicols. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-10-16 18:32:07 +08:00
Jeff Evans	95de93b24e	[SPARK-24540][SQL] Support for multiple character delimiter in Spark CSV read Updating univocity-parsers version to 2.8.3, which adds support for multiple character delimiters Moving univocity-parsers version to spark-parent pom dependencyManagement section Adding new utility method to build multi-char delimiter string, which delegates to existing one Adding tests for multiple character delimited CSV ### What changes were proposed in this pull request? Adds support for parsing CSV data using multiple-character delimiters. Existing logic for converting the input delimiter string to characters was kept and invoked in a loop. Project dependencies were updated to remove redundant declaration of `univocity-parsers` version, and also to change that version to the latest. ### Why are the changes needed? It is quite common for people to have delimited data, where the delimiter is not a single character, but rather a sequence of characters. Currently, it is difficult to handle such data in Spark (typically needs pre-processing). ### Does this PR introduce any user-facing change? Yes. Specifying the "delimiter" option for the DataFrame read, and providing more than one character, will no longer result in an exception. Instead, it will be converted as before and passed to the underlying library (Univocity), which has accepted multiple character delimiters since 2.8.0. ### How was this patch tested? The `CSVSuite` tests were confirmed passing (including new methods), and `sbt` tests for `sql` were executed. Closes #26027 from jeff303/SPARK-24540. Authored-by: Jeff Evans <jeffrey.wayne.evans@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-15 15:44:51 -05:00
Huaxin Gao	cfcaf528cd	[SPARK-29381][PYTHON][ML] Add _ before the XXXParams classes ### What changes were proposed in this pull request? Add _ before XXXParams classes to indicate internal usage ### Why are the changes needed? Follow the PEP 8 convention to use _single_leading_underscore to indicate internal use ### Does this PR introduce any user-facing change? No ### How was this patch tested? use existing tests Closes #26103 from huaxingao/spark-29381. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-14 10:52:23 -05:00
Huaxin Gao	67e1360bad	[SPARK-29377][PYTHON][ML] Parity between Scala ML tuning and Python ML tuning ### What changes were proposed in this pull request? Follow Scala ml tuning implementation - put leading underscore before python ```ValidatorParams``` to indicate private - add ```_CrossValidatorParams``` and ```_TrainValidationSplitParams``` - separate the getters and setters. Put getters in _XXXParams and setters in the Classes. ### Why are the changes needed? Keep parity between scala and python ### Does this PR introduce any user-facing change? add ```CrossValidatorModel.getNumFolds``` and ```TrainValidationSplitModel.getTrainRatio()``` ### How was this patch tested? Add doctest Closes #26057 from huaxingao/spark-tuning. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-10-14 14:28:31 +08:00
Huaxin Gao	81362956a7	[SPARK-29116][PYTHON][ML] Refactor py classes related to DecisionTree ### What changes were proposed in this pull request? - Move tree related classes to a separate file ```tree.py``` - add method ```predictLeaf``` in ```DecisionTreeModel```& ```TreeEnsembleModel``` ### Why are the changes needed? - keep parity between scala and python - easy code maintenance ### Does this PR introduce any user-facing change? Yes add method ```predictLeaf``` in ```DecisionTreeModel```& ```TreeEnsembleModel``` add ```setMinWeightFractionPerNode``` in ```DecisionTreeClassifier``` and ```DecisionTreeRegressor``` ### How was this patch tested? add some doc tests Closes #25929 from huaxingao/spark_29116. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-10-12 22:13:50 +08:00
Bryan Cutler	beb8d2f8ad	[SPARK-29402][PYTHON][TESTS] Added tests for grouped map pandas_udf with window ### What changes were proposed in this pull request? Added tests for grouped map pandas_udf using a window. ### Why are the changes needed? Current tests for grouped map do not use a window and this had previously caused an error due the window range being a struct column, which was not yet supported. ### Does this PR introduce any user-facing change? No ### How was this patch tested? New tests added. Closes #26063 from BryanCutler/pyspark-pandas_udf-group-with-window-tests-SPARK-29402. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2019-10-11 16:19:13 -07:00
Huaxin Gao	ffddfc8584	[SPARK-29269][PYTHON][ML] Pyspark ALSModel support getters/setters ### What changes were proposed in this pull request? Add getters/setters in Pyspark ALSModel. ### Why are the changes needed? To keep parity between python and scala. ### Does this PR introduce any user-facing change? Yes. add the following getters/setters to ALSModel ``` get/setUserCol get/setItemCol get/setColdStartStrategy get/setPredictionCol ``` ### How was this patch tested? add doctest Closes #25947 from huaxingao/spark-29269. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-10-08 14:05:09 +08:00
Huaxin Gao	2399134456	[SPARK-29143][PYTHON][ML] Pyspark feature models support column setters/getters ### What changes were proposed in this pull request? add column setters/getters support in Pyspark feature models ### Why are the changes needed? keep parity between Pyspark and Scala ### Does this PR introduce any user-facing change? Yes. After the change, Pyspark feature models have column setters/getters support. ### How was this patch tested? Add some doctests Closes #25908 from huaxingao/spark-29143. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-07 10:55:48 -05:00
Huaxin Gao	bd213a0850	[SPARK-29360][PYTHON][ML] PySpark FPGrowthModel supports getter/setter ### What changes were proposed in this pull request? ### Why are the changes needed? Keep parity between Scala and Python ### Does this PR introduce any user-facing change? add the following getters/setter to FPGrowthModel ``` getMinSupport getNumPartitions getMinConfidence getItemsCol getPredictionCol setItemsCol setMinConfidence setPredictionCol ``` add following getters/setters to PrefixSpan ``` set/getMinSupport set/getMaxPatternLength set/getMaxLocalProjDBSize set/getSequenceCol ``` ### How was this patch tested? add doctest Closes #26035 from huaxingao/spark-29360. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-07 10:53:59 -05:00
zero323	8556710409	[SPARK-28985][PYTHON][ML][FOLLOW-UP] Add _IsotonicRegressionBase ### What changes were proposed in this pull request? Adds ```python class _IsotonicRegressionBase(HasFeaturesCol, HasLabelCol, HasPredictionCol, HasWeightCol): ... ``` with related `Params` and uses it to replace `JavaPredictor` and `HasWeightCol` in `IsotonicRegression` base classes and `JavaPredictionModel,` in `IsotonicRegressionModel` base classes. ### Why are the changes needed? Previous work (#25776) on [SPARK-28985](https://issues.apache.org/jira/browse/SPARK-28985) replaced `JavaEstimator`, `HasFeaturesCol`, `HasLabelCol`, `HasPredictionCol` in `IsotonicRegression` and `JavaModel` in `IsotonicRegressionModel` with newly added `JavaPredictor`: `e97b55d322/python/pyspark/ml/wrapper.py (L377)` and `JavaPredictionModel` `e97b55d322/python/pyspark/ml/wrapper.py (L405)` respectively. This however is inconsistent with Scala counterpart where both classes extend private `IsotonicRegressionBase` `3cb1b57809/mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala (L42-L43)` This preserves some of the existing inconsistencies (`model` as defined in [the official example](https://github.com/apache/spark/blob/master/examples/src/main/python/ml/isotonic_regression_example.py)), i.e. ```python from pyspark.ml.regression impor IsotonicRegressionMode from pyspark.ml.param.shared import HasWeightCol issubclass(IsotonicRegressionModel, HasWeightCol) # False hasattr(model, "weightCol") # True ``` as well as introduces a bug, by adding unsupported `predict` method: ```python import inspect hasattr(model, "predict") # True inspect.getfullargspec(IsotonicRegressionModel.predict) # FullArgSpec(args=['self', 'value'], varargs=None, varkw=None, defaults=None, kwonlyargs=[], kwonlydefaults=None, annotations={}) IsotonicRegressionModel.predict.__doc__ # Predict label for the given features.\n\n .. versionadded:: 3.0.0' model.predict(dataset.first().features) # Py4JError: An error occurred while calling o49.predict. Trace: # py4j.Py4JException: Method predict([class org.apache.spark.ml.linalg.SparseVector]) does not exist # ... ``` Furthermore existing implementation can cause further problems in the future, if `Predictor` / `PredictionModel` API changes. ### Does this PR introduce any user-facing change? Yes. It: - Removes invalid `IsotonicRegressionModel.predict` method. - Adds `HasWeightColumn` to `IsotonicRegressionModel`. however the faulty implementation hasn't been released yet, and proposed additions have negligible potential for breaking existing code (and none, compared to changes already made in #25776). ### How was this patch tested? - Existing unit tests. - Manual testing. CC huaxingao, zhengruifeng Closes #26023 from zero323/SPARK-28985-FOLLOW-UP-isotonic-regression. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-04 18:06:10 -05:00
zero323	df22535bbd	[SPARK-28985][PYTHON][ML][FOLLOW-UP] Add _AFTSurvivalRegressionParams ### What changes were proposed in this pull request? Adds ```python _AFTSurvivalRegressionParams(HasFeaturesCol, HasLabelCol, HasPredictionCol, HasMaxIter, HasTol, HasFitIntercept, HasAggregationDepth): ... ``` with related Params and uses it to replace `HasFitIntercept`, `HasMaxIter`, `HasTol` and `HasAggregationDepth` in `AFTSurvivalRegression` base classes and `JavaPredictionModel,` in `AFTSurvivalRegressionModel` base classes. ### Why are the changes needed? Previous work (#25776) on [SPARK-28985](https://issues.apache.org/jira/browse/SPARK-28985) replaced `JavaEstimator`, `HasFeaturesCol`, `HasLabelCol`, `HasPredictionCol` in `AFTSurvivalRegression` and `JavaModel` in `AFTSurvivalRegressionModel` with newly added `JavaPredictor`: `e97b55d322/python/pyspark/ml/wrapper.py (L377)` and `JavaPredictionModel` `e97b55d322/python/pyspark/ml/wrapper.py (L405)` respectively. This however is inconsistent with Scala counterpart where both classes extend private `AFTSurvivalRegressionBase` `eb037a8180/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala (L48-L50)` This preserves some of the existing inconsistencies (variables as defined in [the official example](https://github.com/apache/spark/blob/master/examples/src/main/python/ml/aft_survival_regression.p)) ``` from pyspark.ml.regression import AFTSurvivalRegression, AFTSurvivalRegressionModel from pyspark.ml.param.shared import HasMaxIter, HasTol, HasFitIntercept, HasAggregationDepth from pyspark.ml.param import Param issubclass(AFTSurvivalRegressionModel, HasMaxIter) # False hasattr(model, "maxIter") and isinstance(model.maxIter, Param) # True issubclass(AFTSurvivalRegressionModel, HasTol) # False hasattr(model, "tol") and isinstance(model.tol, Param) # True ``` and can cause problems in the future, if Predictor / PredictionModel API changes (unlike [`IsotonicRegression`](https://github.com/apache/spark/pull/26023), current implementation is technically speaking correct, though incomplete). ### Does this PR introduce any user-facing change? Yes, it adds a number of base classes to `AFTSurvivalRegressionModel`. These change purely additive and have negligible potential for breaking existing code (and none, compared to changes already made in #25776). Additionally affected API hasn't been released in the current form yet. ### How was this patch tested? - Existing unit tests. - Manual testing. CC huaxingao, zhengruifeng Closes #26024 from zero323/SPARK-28985-FOLLOW-UP-aftsurival-regression. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-04 18:04:21 -05:00
HyukjinKwon	20ee2f5dcb	[SPARK-29286][PYTHON][TESTS] Uses UTF-8 with 'replace' on errors at Python testing script ### What changes were proposed in this pull request? This PR proposes to let Python 2 uses UTF-8, instead of ASCII, with permissively replacing non-UDF-8 unicodes into unicode points in Python testing script. ### Why are the changes needed? When Python 2 is used to run the Python testing script, with `decode(encoding='ascii')`, it fails whenever non-ascii codes are printed out. ### Does this PR introduce any user-facing change? To dev, it will enable to support to print out non-ASCII characters. ### How was this patch tested? Jenkins will test it for our existing test codes. Also, manually tested with UTF-8 output. Closes #26021 from HyukjinKwon/SPARK-29286. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-04 10:04:28 -07:00
Liang-Chi Hsieh	2bc3fff13b	[SPARK-29341][PYTHON] Upgrade cloudpickle to 1.0.0 ### What changes were proposed in this pull request? This patch upgrades cloudpickle to 1.0.0 version. Main changes: 1. cleanup unused functions: `936f16fac8` 2. Fix relative imports inside function body: `31ecdd6f57` 3. Write kw only arguments to pickle: `6cb4718528` ### Why are the changes needed? We should include new bug fix like `6cb4718528`, because users might use such python function in PySpark. ```python >>> def f(a, , b=1): ... return a + b ... >>> rdd = sc.parallelize([1, 2, 3]) >>> rdd.map(f).collect() [Stage 0:> (0 + 12) / 12]19/10/03 00:42:24 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/spark/python/lib/pyspark.zip/pyspark/worker.py", line 598, in main process() File "/spark/python/lib/pyspark.zip/pyspark/worker.py", line 590, in process serializer.dump_stream(out_iter, outfile) File "/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 513, in dump_stream vs = list(itertools.islice(iterator, batch)) File "/spark/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper return f(args, *kwargs) TypeError: f() missing 1 required keyword-only argument: 'b' ``` After: ```python >>> def f(a, , b=1): ... return a + b ... >>> rdd = sc.parallelize([1, 2, 3]) >>> rdd.map(f).collect() [2, 3, 4] ``` ### Does this PR introduce any user-facing change? Yes. This fixes two bugs when pickling Python functions. ### How was this patch tested? Existing tests. Closes #26009 from viirya/upgrade-cloudpickle. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-03 19:20:51 +09:00
zero323	858bf76e35	[SPARK-29142][PYTHON][ML][FOLLOWUP][DOC] Replace incorrect :py:attr: applications ### What changes were proposed in this pull request? This PR replaces some references with correct ones (`:py:class:`). ### Why are the changes needed? Newly added mixins from the original PR incorrectly reference classes with `:py:attr:`. While these classes are marked as internal, and not rendered in the standard documentation, it still makes sense to use correct roles. ### Does this PR introduce any user-facing change? No. The changed part is not a part of generated PySpark documents. ### How was this patch tested? Since this PR is a kind of typo fix, manually checking the patch. We can build document for compilation test although there is no UI change. Closes #26004 from zero323/SPARK-29142-FOLLOWUP. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-03 02:45:44 -07:00
Chris Martin	76791b89f5	[SPARK-27463][PYTHON][FOLLOW-UP] Miscellaneous documentation and code cleanup of cogroup pandas UDF Follow up from https://github.com/apache/spark/pull/24981 incorporating some comments from HyukjinKwon. Specifically: - Adding `CoGroupedData` to `pyspark/sql/__init__.py __all__` so that documentation is generated. - Added pydoc, including example, for the use case whereby the user supplies a cogrouping function including a key. - Added the boilerplate for doctests to cogroup.py. Note that cogroup.py only contains the apply() function which has doctests disabled as per the other Pandas Udfs. - Restricted the newly exposed RelationalGroupedDataset constructor parameters to access only by the sql package. - Some minor formatting tweaks. This was tested by running the appropriate unit tests. I'm unsure as to how to check that my change will cause the documentation to be generated correctly, but it someone can describe how I can do this I'd be happy to check. Closes #25939 from d80tb7/SPARK-27463-fixes. Authored-by: Chris Martin <chris@cmartinit.co.uk> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-30 22:25:35 +09:00
HyukjinKwon	fda0e6e48d	[SPARK-29240][PYTHON] Pass Py4J column instance to support PySpark column in element_at function ### What changes were proposed in this pull request? This PR makes `element_at` in PySpark able to take PySpark `Column` instances. ### Why are the changes needed? To match with Scala side. Seems it was intended but not working correctly as a bug. ### Does this PR introduce any user-facing change? Yes. See below: ```python from pyspark.sql import functions as F x = spark.createDataFrame([([1,2,3],1),([4,5,6],2),([7,8,9],3)],['list','num']) x.withColumn('aa',F.element_at('list',x.num.cast('int'))).show() ``` Before: ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/functions.py", line 2059, in element_at return Column(sc._jvm.functions.element_at(_to_java_column(col), extraction)) File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1277, in __call__ File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1241, in _build_args File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1228, in _get_args File "/.../forked/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_collections.py", line 500, in convert File "/.../spark/python/pyspark/sql/column.py", line 344, in __iter__ raise TypeError("Column is not iterable") TypeError: Column is not iterable ``` After: ``` +---------+---+---+ \| list\|num\| aa\| +---------+---+---+ \|[1, 2, 3]\| 1\| 1\| \|[4, 5, 6]\| 2\| 5\| \|[7, 8, 9]\| 3\| 9\| +---------+---+---+ ``` ### How was this patch tested? Manually tested against literal, Python native types, and PySpark column. Closes #25950 from HyukjinKwon/SPARK-29240. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-27 11:04:55 -07:00
zhengruifeng	aed7ff36f7	[SPARK-29258][ML][PYSPARK] parity between ml.evaluator and mllib.metrics ### What changes were proposed in this pull request? 1, expose `BinaryClassificationMetrics.numBins` in `BinaryClassificationEvaluator` 2, expose `RegressionMetrics.throughOrigin` in `RegressionEvaluator` 3, add metric `explainedVariance` in `RegressionEvaluator` ### Why are the changes needed? existing function in mllib.metrics should also be exposed in ml ### Does this PR introduce any user-facing change? yes, this PR add two expert params and one metric option ### How was this patch tested? existing and added tests Closes #25940 from zhengruifeng/evaluator_add_param. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-09-27 13:30:03 +08:00
Huaxin Gao	bdc4943b9e	[SPARK-29142][PYTHON][ML] Pyspark clustering models support column setters/getters/predict ### What changes were proposed in this pull request? Add the following Params classes in Pyspark clustering ```GaussianMixtureParams``` ```KMeansParams``` ```BisectingKMeansParams``` ```LDAParams``` ```PowerIterationClusteringParams``` ### Why are the changes needed? To be consistent with scala side ### Does this PR introduce any user-facing change? Yes. Add the following changes: ``` GaussianMixtureModel - get/setMaxIter - get/setFeaturesCol - get/setSeed - get/setPredictionCol - get/setProbabilityCol - get/setTol - predict ``` ``` KMeansModel - get/setMaxIter - get/setFeaturesCol - get/setSeed - get/setPredictionCol - get/setDistanceMeasure - get/setTol - predict ``` ``` BisectingKMeansModel - get/setMaxIter - get/setFeaturesCol - get/setSeed - get/setPredictionCol - get/setDistanceMeasure - predict ``` ``` LDAModel(HasMaxIter, HasFeaturesCol, HasSeed, HasCheckpointInterval): - get/setMaxIter - get/setFeaturesCol - get/setSeed - get/setCheckpointInterval ``` ### How was this patch tested? Add doctests Closes #25859 from huaxingao/spark-29142. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-09-27 11:19:02 +08:00
sheepstop	81de9d3c29	[SPARK-28678][DOC] Specify that array indices start at 1 for function slice in R Scala Python ### What changes were proposed in this pull request? Added "array indices start at 1" in annotation to make it clear for the usage of function slice, in R Scala Python component ### Why are the changes needed? It will throw exception if the value stare is 0, but array indices start at 0 most of times in other scenarios. ### Does this PR introduce any user-facing change? Yes, more info provided to user. ### How was this patch tested? No tests added, only doc change. Closes #25704 from sheepstop/master. Authored-by: sheepstop <yangting617@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-24 18:57:54 +09:00
Xianjin YE	8c8016a152	[SPARK-21045][PYTHON] Allow non-ascii string as an exception message from python execution in Python 2 ### What changes were proposed in this pull request? This PR allows non-ascii string as an exception message in Python 2 by explicitly en/decoding in case of `str` in Python 2. ### Why are the changes needed? Previously PySpark will hang when the `UnicodeDecodeError` occurs and the real exception cannot be passed to the JVM side. See the reproducer as below: ```python def f(): raise Exception("中") spark = SparkSession.builder.master('local').getOrCreate() spark.sparkContext.parallelize([1]).map(lambda x: f()).count() ``` ### Does this PR introduce any user-facing change? User may not observe hanging for the similar cases. ### How was this patch tested? Added a new test and manually checking. This pr is based on #18324, credits should also go to dataknocker. To make lint-python happy for python3, it also includes a followup fix for #25814 Closes #25847 from advancedxy/python_exception_19926_and_21045. Authored-by: Xianjin YE <advancedxy@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-21 08:09:19 +09:00
Holden Karau	42050c3f4f	[SPARK-27659][PYTHON] Allow PySpark to prefetch during toLocalIterator ### What changes were proposed in this pull request? This PR allows Python toLocalIterator to prefetch the next partition while the first partition is being collected. The PR also adds a demo micro bench mark in the examples directory, we may wish to keep this or not. ### Why are the changes needed? In https://issues.apache.org/jira/browse/SPARK-23961 / `5e79ae3b40` we changed PySpark to only pull one partition at a time. This is memory efficient, but if partitions take time to compute this can mean we're spending more time blocking. ### Does this PR introduce any user-facing change? A new param is added to toLocalIterator ### How was this patch tested? New unit test inside of `test_rdd.py` checks the time that the elements are evaluated at. Another test that the results remain the same are added to `test_dataframe.py`. I also ran a micro benchmark in the examples directory `prefetch.py` which shows an improvement of ~40% in this specific use case. > > 19/08/16 17:11:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). > Running timers: > > [Stage 32:> (0 + 1) / 1] > Results: > > Prefetch time: > > 100.228110831 > > > Regular time: > > 188.341721614 > > > Closes #25515 from holdenk/SPARK-27659-allow-pyspark-tolocalitr-to-prefetch. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Holden Karau <hkarau@apple.com>	2019-09-20 09:59:31 -07:00
Huaxin Gao	e97b55d322	[SPARK-28985][PYTHON][ML] Add common classes (JavaPredictor/JavaClassificationModel/JavaProbabilisticClassifier) in PYTHON ### What changes were proposed in this pull request? Add some common classes in Python to make it have the same structure as Scala 1. Scala has ClassifierParams/Classifier/ClassificationModel: ``` trait ClassifierParams extends PredictorParams with HasRawPredictionCol abstract class Classifier extends Predictor with ClassifierParams { def setRawPredictionCol } abstract class ClassificationModel extends PredictionModel with ClassifierParams { def setRawPredictionCol } ``` This PR makes Python has the following: ``` class JavaClassifierParams(HasRawPredictionCol, JavaPredictorParams): pass class JavaClassifier(JavaPredictor, JavaClassifierParams): def setRawPredictionCol class JavaClassificationModel(JavaPredictionModel, JavaClassifierParams): def setRawPredictionCol ``` 2. Scala has ProbabilisticClassifierParams/ProbabilisticClassifier/ProbabilisticClassificationModel: ``` trait ProbabilisticClassifierParams extends ClassifierParams with HasProbabilityCol with HasThresholds abstract class ProbabilisticClassifier extends Classifier with ProbabilisticClassifierParams { def setProbabilityCol def setThresholds } abstract class ProbabilisticClassificationModel extends ClassificationModel with ProbabilisticClassifierParams { def setProbabilityCol def setThresholds } ``` This PR makes Python have the following: ``` class JavaProbabilisticClassifierParams(HasProbabilityCol, HasThresholds, JavaClassifierParams): pass class JavaProbabilisticClassifier(JavaClassifier, JavaProbabilisticClassifierParams): def setProbabilityCol def setThresholds class JavaProbabilisticClassificationModel(JavaClassificationModel, JavaProbabilisticClassifierParams): def setProbabilityCol def setThresholds ``` 3. Scala has PredictorParams/Predictor/PredictionModel: ``` trait PredictorParams extends Params with HasLabelCol with HasFeaturesCol with HasPredictionCol abstract class Predictor extends Estimator with PredictorParams { def setLabelCol def setFeaturesCol def setPredictionCol } abstract class PredictionModel extends Model with PredictorParams { def setFeaturesCol def setPredictionCol def numFeatures def predict } ``` This PR makes Python have the following: ``` class JavaPredictorParams(HasLabelCol, HasFeaturesCol, HasPredictionCol): pass class JavaPredictor(JavaEstimator, JavaPredictorParams): def setLabelCol def setFeaturesCol def setPredictionCol class JavaPredictionModel(JavaModel, JavaPredictorParams): def setFeaturesCol def setPredictionCol def numFeatures def predict ``` ### Why are the changes needed? Have parity between Python and Scala ML ### Does this PR introduce any user-facing change? Yes. Add the following changes: ``` LinearSVCModel - get/setFeatureCol - get/setPredictionCol - get/setLabelCol - get/setRawPredictionCol - predict ``` ``` LogisticRegressionModel DecisionTreeClassificationModel RandomForestClassificationModel GBTClassificationModel NaiveBayesModel MultilayerPerceptronClassificationModel - get/setFeatureCol - get/setPredictionCol - get/setLabelCol - get/setRawPredictionCol - get/setProbabilityCol - predict ``` ``` LinearRegressionModel IsotonicRegressionModel DecisionTreeRegressionModel RandomForestRegressionModel GBTRegressionModel AFTSurvivalRegressionModel GeneralizedLinearRegressionModel - get/setFeatureCol - get/setPredictionCol - get/setLabelCol - predict ``` ### How was this patch tested? Add a few doc tests. Closes #25776 from huaxingao/spark-28985. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-19 08:17:25 -05:00
Huaxin Gao	db9e0fda6b	[SPARK-22796][PYTHON][ML] Add multiple columns support to PySpark QuantileDiscretizer ### What changes were proposed in this pull request? Add multiple columns support to PySpark QuantileDiscretizer ### Why are the changes needed? Multiple columns support for QuantileDiscretizer was in scala side a while ago. We need to add multiple columns support to python too. ### Does this PR introduce any user-facing change? Yes. New Python is added ### How was this patch tested? Add doctest Closes #25812 from huaxingao/spark-22796. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>	2019-09-18 12:16:06 -07:00
Xianjin YE	203bf9e569	[SPARK-19926][PYSPARK] make captured exception from JVM side user friendly ### What changes were proposed in this pull request? The str of `CapaturedException` is now returned by str(self.desc) rather than repr(self.desc), which is more user-friendly. It also handles unicode under python2 specially. ### Why are the changes needed? This is an improvement, and makes exception more human readable in python side. ### Does this PR introduce any user-facing change? Before this pr, select `中文字段` throws exception something likes below: ``` Traceback (most recent call last): File "/Users/advancedxy/code_workspace/github/spark/python/pyspark/sql/tests/test_utils.py", line 34, in test_capture_user_friendly_exception raise e AnalysisException: u"cannot resolve '`\u4e2d\u6587\u5b57\u6bb5`' given input columns: []; line 1 pos 7;\n'Project ['\u4e2d\u6587\u5b57\u6bb5]\n+- OneRowRelation\n" ``` after this pr: ``` Traceback (most recent call last): File "/Users/advancedxy/code_workspace/github/spark/python/pyspark/sql/tests/test_utils.py", line 34, in test_capture_user_friendly_exception raise e AnalysisException: cannot resolve '`中文字段`' given input columns: []; line 1 pos 7; 'Project ['中文字段] +- OneRowRelation ``` ### How was this patch Add a new test to verify unicode are correctly converted and manual checks for thrown exceptions. This pr's credits should go to uncleGen and is based on https://github.com/apache/spark/pull/17267 Closes #25814 from advancedxy/python_exception_19926_and_21045. Authored-by: Xianjin YE <advancedxy@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-18 23:32:10 +09:00
Liang-Chi Hsieh	12e1583093	[SPARK-28927][ML] Rethrow block mismatch exception in ALS when input data is nondeterministic ### What changes were proposed in this pull request? Fitting ALS model can be failed due to nondeterministic input data. Currently the failure is thrown by an ArrayIndexOutOfBoundsException which is not explainable for end users what is wrong in fitting. This patch catches this exception and rethrows a more explainable one, when the input data is nondeterministic. Because we may not exactly know the output deterministic level of RDDs produced by user code, this patch also adds a note to Scala/Python/R ALS document about the training data deterministic level. ### Why are the changes needed? ArrayIndexOutOfBoundsException was observed during fitting ALS model. It was caused by mismatching between in/out user/item blocks during computing ratings. If the training RDD output is nondeterministic, when fetch failure is happened, rerun part of training RDD can produce inconsistent user/item blocks. This patch is needed to notify users ALS fitting on nondeterministic input. ### Does this PR introduce any user-facing change? Yes. When fitting ALS model on nondeterministic input data, previously if rerun happens, users would see ArrayIndexOutOfBoundsException caused by mismatch between In/Out user/item blocks. After this patch, a SparkException with more clear message will be thrown, and original ArrayIndexOutOfBoundsException is wrapped. ### How was this patch tested? Tested on development cluster. Closes #25789 from viirya/als-indeterminate-input. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-18 09:22:13 -05:00
Chris Martin	05988b256e	[SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs ### What changes were proposed in this pull request? Adds a new cogroup Pandas UDF. This allows two grouped dataframes to be cogrouped together and apply a (pandas.DataFrame, pandas.DataFrame) -> pandas.DataFrame UDF to each cogroup. Example usage ``` from pyspark.sql.functions import pandas_udf, PandasUDFType df1 = spark.createDataFrame( [(20000101, 1, 1.0), (20000101, 2, 2.0), (20000102, 1, 3.0), (20000102, 2, 4.0)], ("time", "id", "v1")) df2 = spark.createDataFrame( [(20000101, 1, "x"), (20000101, 2, "y")], ("time", "id", "v2")) pandas_udf("time int, id int, v1 double, v2 string", PandasUDFType.COGROUPED_MAP) def asof_join(l, r): return pd.merge_asof(l, r, on="time", by="id") df1.groupby("id").cogroup(df2.groupby("id")).apply(asof_join).show() ``` +--------+---+---+---+ \| time\| id\| v1\| v2\| +--------+---+---+---+ \|20000101\| 1\|1.0\| x\| \|20000102\| 1\|3.0\| x\| \|20000101\| 2\|2.0\| y\| \|20000102\| 2\|4.0\| y\| +--------+---+---+---+ ### How was this patch tested? Added unit test test_pandas_udf_cogrouped_map Closes #24981 from d80tb7/SPARK-27463-poc-arrow-stream. Authored-by: Chris Martin <chris@cmartinit.co.uk> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2019-09-17 17:13:50 -07:00
zhengruifeng	4d27a25908	[SPARK-22797][ML][PYTHON] Bucketizer support multi-column ### What changes were proposed in this pull request? Bucketizer support multi-column in the python side ### Why are the changes needed? Bucketizer should support multi-column like the scala side. ### Does this PR introduce any user-facing change? yes, this PR add new Python API ### How was this patch tested? added testsuites Closes #25801 from zhengruifeng/20542_py. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-09-17 11:52:20 +08:00
Huaxin Gao	77e9b58d4f	[SPARK-28969][PYTHON][ML] OneVsRestParams parity between scala and python ### What changes were proposed in this pull request? Follow the scala ```OneVsRestParams``` implementation, move ```setClassifier``` from ```OneVsRestParams``` to ```OneVsRest``` in Pyspark ### Why are the changes needed? 1. Maintain the parity between scala and python code. 2. ```Classifier``` can only be set in the estimator. ### Does this PR introduce any user-facing change? Yes. Previous behavior: ```OneVsRestModel``` has method ```setClassifier``` Current behavior: ```setClassifier``` is removed from ```OneVsRestModel```. ```classifier``` can only be set in ```OneVsRest```. ### How was this patch tested? Use existing tests Closes #25715 from huaxingao/spark-28969. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-13 12:29:19 -05:00
Wenchen Fan	053dd858d3	[SPARK-28998][SQL] reorganize the packages of DS v2 interfaces/classes ### What changes were proposed in this pull request? reorganize the packages of DS v2 interfaces/classes: 1. `org.spark.sql.connector.catalog`: put `TableCatalog`, `Table` and other related interfaces/classes 2. `org.spark.sql.connector.expression`: put `Expression`, `Transform` and other related interfaces/classes 3. `org.spark.sql.connector.read`: put `ScanBuilder`, `Scan` and other related interfaces/classes 4. `org.spark.sql.connector.write`: put `WriteBuilder`, `BatchWrite` and other related interfaces/classes ### Why are the changes needed? Data Source V2 has evolved a lot. It's a bit weird that `Expression` is in `org.spark.sql.catalog.v2` and `Table` is in `org.spark.sql.sources.v2`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing tests Closes #25700 from cloud-fan/package. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-12 19:59:34 +08:00
dengziming	8f632d7045	[MINOR][DOCS] Fix few typos in the java docs JIRA :https://issues.apache.org/jira/browse/SPARK-29050 'a hdfs' change into 'an hdfs' 'an unique' change into 'a unique' 'an url' change into 'a url' 'a error' change into 'an error' Closes #25756 from dengziming/feature_fix_typos. Authored-by: dengziming <dengziming@growingio.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-12 09:30:03 +09:00
HyukjinKwon	7ce0f2b499	[SPARK-29041][PYTHON] Allows createDataFrame to accept bytes as binary type ### What changes were proposed in this pull request? This PR proposes to allow `bytes` as an acceptable type for binary type for `createDataFrame`. ### Why are the changes needed? `bytes` is a standard type for binary in Python. This should be respected in PySpark side. ### Does this PR introduce any user-facing change? Yes, _when specified type is binary_, we will allow `bytes` as a binary type. Previously this was not allowed in both Python 2 and Python 3 as below: ```python spark.createDataFrame([[b"abcd"]], "col binary") ``` in Python 3 ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/session.py", line 787, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, data), schema) File "/.../spark/python/pyspark/sql/session.py", line 442, in _createFromLocal data = list(data) File "/.../spark/python/pyspark/sql/session.py", line 769, in prepare verify_func(obj) File "/.../forked/spark/python/pyspark/sql/types.py", line 1403, in verify verify_value(obj) File "/.../spark/python/pyspark/sql/types.py", line 1384, in verify_struct verifier(v) File "/.../spark/python/pyspark/sql/types.py", line 1403, in verify verify_value(obj) File "/.../spark/python/pyspark/sql/types.py", line 1397, in verify_default verify_acceptable_types(obj) File "/.../spark/python/pyspark/sql/types.py", line 1282, in verify_acceptable_types % (dataType, obj, type(obj)))) TypeError: field col: BinaryType can not accept object b'abcd' in type <class 'bytes'> ``` in Python 2: ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/session.py", line 787, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, data), schema) File "/.../spark/python/pyspark/sql/session.py", line 442, in _createFromLocal data = list(data) File "/.../spark/python/pyspark/sql/session.py", line 769, in prepare verify_func(obj) File "/.../spark/python/pyspark/sql/types.py", line 1403, in verify verify_value(obj) File "/.../spark/python/pyspark/sql/types.py", line 1384, in verify_struct verifier(v) File "/.../spark/python/pyspark/sql/types.py", line 1403, in verify verify_value(obj) File "/.../spark/python/pyspark/sql/types.py", line 1397, in verify_default verify_acceptable_types(obj) File "/.../spark/python/pyspark/sql/types.py", line 1282, in verify_acceptable_types % (dataType, obj, type(obj)))) TypeError: field col: BinaryType can not accept object 'abcd' in type <type 'str'> ``` So, it won't break anything. ### How was this patch tested? Unittests were added and also manually tested as below. ```bash ./run-tests --python-executables=python2,python3 --testnames "pyspark.sql.tests.test_serde" ``` Closes #25749 from HyukjinKwon/SPARK-29041. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-12 08:52:25 +09:00
Sean Owen	6378d4bc06	[SPARK-28980][CORE][SQL][STREAMING][MLLIB] Remove most items deprecated in Spark 2.2.0 or earlier, for Spark 3 ### What changes were proposed in this pull request? - Remove SQLContext.createExternalTable and Catalog.createExternalTable, deprecated in favor of createTable since 2.2.0, plus tests of deprecated methods - Remove HiveContext, deprecated in 2.0.0, in favor of `SparkSession.builder.enableHiveSupport` - Remove deprecated KinesisUtils.createStream methods, plus tests of deprecated methods, deprecate in 2.2.0 - Remove deprecated MLlib (not Spark ML) linear method support, mostly utility constructors and 'train' methods, and associated docs. This includes methods in LinearRegression, LogisticRegression, Lasso, RidgeRegression. These have been deprecated since 2.0.0 - Remove deprecated Pyspark MLlib linear method support, including LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD - Remove 'runs' argument in KMeans.train() method, which has been a no-op since 2.0.0 - Remove deprecated ChiSqSelector isSorted protected method - Remove deprecated 'yarn-cluster' and 'yarn-client' master argument in favor of 'yarn' and deploy mode 'cluster', etc Notes: - I was not able to remove deprecated DataFrameReader.json(RDD) in favor of DataFrameReader.json(Dataset); the former was deprecated in 2.2.0, but, it is still needed to support Pyspark's .json() method, which can't use a Dataset. - Looks like SQLContext.createExternalTable was not actually deprecated in Pyspark, but, almost certainly was meant to be? Catalog.createExternalTable was. - I afterwards noted that the toDegrees, toRadians functions were almost removed fully in SPARK-25908, but Felix suggested keeping just the R version as they hadn't been technically deprecated. I'd like to revisit that. Do we really want the inconsistency? I'm not against reverting it again, but then that implies leaving SQLContext.createExternalTable just in Pyspark too, which seems weird. - I kept LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD, RidgeRegressionWithSGD in Pyspark, though deprecated, as it is hard to remove them (still used by StreamingLogisticRegressionWithSGD?) and they are not fully removed in Scala. Maybe should not have been deprecated. ### Why are the changes needed? Deprecated items are easiest to remove in a major release, so we should do so as much as possible for Spark 3. This does not target items deprecated 'recently' as of Spark 2.3, which is still 18 months old. ### Does this PR introduce any user-facing change? Yes, in that deprecated items are removed from some public APIs. ### How was this patch tested? Existing tests. Closes #25684 from srowen/SPARK-28980. Lead-authored-by: Sean Owen <sean.owen@databricks.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-09 10:19:40 -05:00
zhengruifeng	4664a082c2	[SPARK-28968][ML] Add HasNumFeatures in the scala side ### What changes were proposed in this pull request? Add HasNumFeatures in the scala side, with `1<<18` as the default value ### Why are the changes needed? HasNumFeatures is already added in the py side, it is reasonable to keep them in sync. I don't find other similar place. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing testsuites Closes #25671 from zhengruifeng/add_HasNumFeatures. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-09-06 11:50:45 +08:00
Sean Owen	36559b6525	[SPARK-28977][DOCS][SQL] Fix DataFrameReader.json docs to doc that partition column can be numeric, date or timestamp type ### What changes were proposed in this pull request? `DataFrameReader.json()` accepts a partition column that is of numeric, date or timestamp type, according to the implementation in `JDBCRelation.scala`. Update the scaladoc accordingly, to match the documentation in `sql-data-sources-jdbc.md` too. ### Why are the changes needed? scaladoc is incorrect. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A Closes #25687 from srowen/SPARK-28977. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-05 18:32:45 +09:00
Sean Owen	eb037a8180	[SPARK-28855][CORE][ML][SQL][STREAMING] Remove outdated usages of Experimental, Evolving annotations ### What changes were proposed in this pull request? The Experimental and Evolving annotations are both (like Unstable) used to express that a an API may change. However there are many things in the code that have been marked that way since even Spark 1.x. Per the dev thread, anything introduced at or before Spark 2.3.0 is pretty much 'stable' in that it would not change without a deprecation cycle. Therefore I'd like to remove most of these annotations. And, remove the `:: Experimental ::` scaladoc tag too. And likewise for Python, R. The changes below can be summarized as: - Generally, anything introduced at or before Spark 2.3.0 has been unmarked as neither Evolving nor Experimental - Obviously experimental items like DSv2, Barrier mode, ExperimentalMethods are untouched - I _did_ unmark a few MLlib classes introduced in 2.4, as I am quite confident they're not going to change (e.g. KolmogorovSmirnovTest, PowerIterationClustering) It's a big change to review, so I'd suggest scanning the list of _files_ changed to see if any area seems like it should remain partly experimental and examine those. ### Why are the changes needed? Many of these annotations are incorrect; the APIs are de facto stable. Leaving them also makes legitimate usages of the annotations less meaningful. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #25558 from srowen/SPARK-28855. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-01 10:15:00 -05:00
HyukjinKwon	8848af2635	[SPARK-28881][PYTHON][TESTS][FOLLOW-UP] Use SparkSession(SparkContext(...)) to prevent for Spark conf to affect other tests ### What changes were proposed in this pull request? This PR proposes to match the test with branch-2.4. See https://github.com/apache/spark/pull/25593#discussion_r318109047 Seems using `SparkSession.builder` with Spark conf possibly affects other tests. ### Why are the changes needed? To match with branch-2.4 and to make easier to backport. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Test was fixed. Closes #25603 from HyukjinKwon/SPARK-28881-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-28 10:39:21 +09:00
WeichenXu	7f605f5559	[SPARK-28621][SQL] Make spark.sql.crossJoin.enabled default value true ### What changes were proposed in this pull request? Make `spark.sql.crossJoin.enabled` default value true ### Why are the changes needed? For implicit cross join, we can set up a watchdog to cancel it if running for a long time. When "spark.sql.crossJoin.enabled" is false, because `CheckCartesianProducts` is implemented in logical plan stage, it may generate some mismatching error which may confuse end user: * it's done in logical phase, so we may fail queries that can be executed via broadcast join, which is very fast. * if we move the check to the physical phase, then a query may success at the beginning, and begin to fail when the table size gets larger (other people insert data to the table). This can be quite confusing. * the CROSS JOIN syntax doesn't work well if join reorder happens. * some non-equi-join will generate plan using cartesian product, but `CheckCartesianProducts` do not detect it and raise error. So that in order to address this in simpler way, we can turn off showing this cross-join error by default. For reference, I list some cases raising mismatching error here: Providing: ``` spark.range(2).createOrReplaceTempView("sm1") // can be broadcast spark.range(50000000).createOrReplaceTempView("bg1") // cannot be broadcast spark.range(60000000).createOrReplaceTempView("bg2") // cannot be broadcast ``` 1) Some join could be convert to broadcast nested loop join, but CheckCartesianProducts raise error. e.g. ``` select sm1.id, bg1.id from bg1 join sm1 where sm1.id < bg1.id ``` 2) Some join will run by CartesianJoin but CheckCartesianProducts DO NOT raise error. e.g. ``` select bg1.id, bg2.id from bg1 join bg2 where bg1.id < bg2.id ``` ### Does this PR introduce any user-facing change? ### How was this patch tested? Closes #25520 from WeichenXu123/SPARK-28621. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-27 21:53:37 +08:00
HyukjinKwon	00cb2f99cc	[SPARK-28881][PYTHON][TESTS] Add a test to make sure toPandas with Arrow optimization throws an exception per maxResultSize ### What changes were proposed in this pull request? This PR proposes to add a test case for: ```bash ./bin/pyspark --conf spark.driver.maxResultSize=1m spark.conf.set("spark.sql.execution.arrow.enabled",True) ``` ```python spark.range(10000000).toPandas() ``` ``` Empty DataFrame Columns: [id] Index: [] ``` which can result in partial results (see https://github.com/apache/spark/pull/25593#issuecomment-525153808). This regression was found between Spark 2.3 and Spark 2.4, and accidentally fixed. ### Why are the changes needed? To prevent the same regression in the future. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Test was added. Closes #25594 from HyukjinKwon/SPARK-28881. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-27 17:30:06 +09:00
zhengruifeng	573b1cb835	[SPARK-28858][ML][PYSPARK] add tree-based transformation in the py side ### What changes were proposed in this pull request? expose the newly added tree-based transformation in the py side ### Why are the changes needed? function parity ### Does this PR introduce any user-facing change? yes, add `setLeafCol` & `getLeafCol` in the py side ### How was this patch tested? added tests & local tests Closes #25566 from zhengruifeng/py_tree_path. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2019-08-23 15:18:35 -07:00
heleny	fb1f868d4f	[SPARK-28776][ML] SparkML Writer gets hadoop conf from session state <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html 2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html 3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'. 4. Be sure to keep the PR description updated to reflect all changes. 5. Please write your PR title to summarize what this PR proposes. 6. If possible, provide a concise example to reproduce the issue for a faster review. --> ### What changes were proposed in this pull request? SparkML writer gets hadoop conf from session state, instead of the spark context. <!-- Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below. 1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers. 2. If you fix some SQL features, you can provide some references of other DBMSes. 3. If there is design documentation, please add the link. 4. If there is a discussion in the mailing list, please add the link. --> ### Why are the changes needed? Allow for multiple sessions in the same context that have different hadoop configurations. <!-- Please clarify why the changes are needed. For instance, 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, you can clarify why it is a bug. --> ### Does this PR introduce any user-facing change? <!-- If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If no, write 'No'. --> No ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Tested in pyspark.ml.tests.test_persistence.PersistenceTest test_default_read_write Closes #25505 from helenyugithub/SPARK-28776. Authored-by: heleny <heleny@palantir.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-22 09:27:05 -05:00
darrentirto	a787bc2884	[SPARK-28777][PYTHON][DOCS] Fix format_string doc string with the correct parameters ### What changes were proposed in this pull request? The parameters doc string of the function format_string was changed from _col_, _d_ to _format_, _cols_ which is what the actual function declaration states ### Why are the changes needed? The parameters stated by the documentation was inaccurate ### Does this PR introduce any user-facing change? Yes. BEFORE ![before](https://user-images.githubusercontent.com/9700541/63310013-e21a0e80-c2ad-11e9-806b-1d272c5cde12.png) AFTER ![after](https://user-images.githubusercontent.com/9700541/63315812-6b870c00-c2c1-11e9-8165-82782628cd1a.png) ### How was this patch tested? N/A: documentation only <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Closes #25506 from darrentirto/SPARK-28777. Authored-by: darrentirto <darrentirto@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-19 20:44:46 -07:00
HyukjinKwon	ef142371e7	[SPARK-28736][SPARK-28735][PYTHON][ML] Fix PySpark ML tests to pass in JDK 11 <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html 2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html 3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'. 4. Be sure to keep the PR description updated to reflect all changes. 5. Please write your PR title to summarize what this PR proposes. 6. If possible, provide a concise example to reproduce the issue for a faster review. --> ### What changes were proposed in this pull request? <!-- Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below. 1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers. 2. If you fix some SQL features, you can provide some references of other DBMSes. 3. If there is design documentation, please add the link. 4. If there is a discussion in the mailing list, please add the link. --> This PR proposes to fix both tests below: ``` ====================================================================== FAIL: test_raw_and_probability_prediction (pyspark.ml.tests.test_algorithms.MultilayerPerceptronClassifierTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/dongjoon/APACHE/spark-master/python/pyspark/ml/tests/test_algorithms.py", line 89, in test_raw_and_probability_prediction self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1E-4)) AssertionError: False is not true ``` ``` File "/Users/dongjoon/APACHE/spark-master/python/pyspark/mllib/clustering.py", line 386, in __main__.GaussianMixtureModel Failed example: abs(softPredicted[0] - 1.0) < 0.001 Expected: True Got: False ********************************************************************** File "/Users/dongjoon/APACHE/spark-master/python/pyspark/mllib/clustering.py", line 388, in __main__.GaussianMixtureModel Failed example: abs(softPredicted[1] - 0.0) < 0.001 Expected: True Got: False ``` to pass in JDK 11. The root cause seems to be different float values being understood via Py4J. This issue also was found in https://github.com/apache/spark/pull/25132 before. When floats are transferred from Python to JVM, the values are sent as are. Python floats are not "precise" due to its own limitation - https://docs.python.org/3/tutorial/floatingpoint.html. For some reasons, the floats from Python on JDK 8 and JDK 11 are different, which is already explicitly not guaranteed. This seems why only some tests in PySpark with floats are being failed. So, this PR fixes it by increasing tolerance in identified test cases in PySpark. ### Why are the changes needed? <!-- Please clarify why the changes are needed. For instance, 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, you can clarify why it is a bug. --> To fully support JDK 11. See, for instance, https://github.com/apache/spark/pull/25443 and https://github.com/apache/spark/pull/25423 for ongoing efforts. ### Does this PR introduce any user-facing change? <!-- If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If no, write 'No'. --> No. ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Manually tested as described in JIRAs: ``` $ build/sbt -Phadoop-3.2 test:package $ python/run-tests --testnames 'pyspark.ml.tests.test_algorithms' --python-executables python ``` ``` $ build/sbt -Phadoop-3.2 test:package $ python/run-tests --testnames 'pyspark.mllib.clustering' --python-executables python ``` Closes #25475 from HyukjinKwon/SPARK-28735. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-16 19:47:29 +09:00
Huaxin Gao	ba5ee27706	[SPARK-28243][PYSPARK][ML][FOLLOW-UP] Move Python DecisionTreeParams to regression.py ## What changes were proposed in this pull request? Leave ```shared.py``` untouched. Move Python ```DecisionTreeParams``` to ```regression.py``` ## How was this patch tested? Use existing tests Closes #25406 from huaxingao/spark-28243. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-15 10:21:26 -05:00
Liang-Chi Hsieh	e6a0385289	[SPARK-28422][SQL][PYTHON] GROUPED_AGG pandas_udf should work without group by clause ## What changes were proposed in this pull request? A GROUPED_AGG pandas python udf can't work, if without group by clause, like `select udf(id) from table`. This doesn't match with aggregate function like sum, count..., and also dataset API like `df.agg(udf(df['id']))`. When we parse a udf (or an aggregate function) like that from SQL syntax, it is known as a function in a project. `GlobalAggregates` rule in analysis makes such project as aggregate, by looking for aggregate expressions. At the moment, we should also look for GROUPED_AGG pandas python udf. ## How was this patch tested? Added tests. Closes #25352 from viirya/SPARK-28422. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-14 00:32:33 +09:00
Gengliang Wang	48adc91057	[SPARK-28698][SQL] Support user-specified output schema in `to_avro` ## What changes were proposed in this pull request? The mapping of Spark schema to Avro schema is many-to-many. (See https://spark.apache.org/docs/latest/sql-data-sources-avro.html#supported-types-for-spark-sql---avro-conversion) The default schema mapping might not be exactly what users want. For example, by default, a "string" column is always written as "string" Avro type, but users might want to output the column as "enum" Avro type. With PR https://github.com/apache/spark/pull/21847, Spark supports user-specified schema in the batch writer. For the function `to_avro`, we should support user-specified output schema as well. ## How was this patch tested? Unit test. Closes #25419 from gengliangwang/to_avro. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-13 20:52:16 +08:00
wuyi	cbad616d4c	[SPARK-27371][CORE] Support GPU-aware resources scheduling in Standalone ## What changes were proposed in this pull request? In this PR, we implements a complete process of GPU-aware resources scheduling in Standalone. The whole process looks like: Worker sets up isolated resources when it starts up and registers to master along with its resources. And, Master picks up usable workers according to driver/executor's resource requirements to launch driver/executor on them. Then, Worker launches the driver/executor after preparing resources file, which is created under driver/executor's working directory, with specified resource addresses(told by master). When driver/executor finished, their resources could be recycled to worker. Finally, if a worker stops, it should always release its resources firstly. For the case of Workers and Drivers in client mode run on the same host, we introduce a config option named `spark.resources.coordinate.enable`(default true) to indicate whether Spark should coordinate resources for user. If `spark.resources.coordinate.enable=false`, user should be responsible for configuring different resources for Workers and Drivers when use resourcesFile or discovery script. If true, Spark would help user to assign different resources for Workers and Drivers. The solution for Spark to coordinate resources among Workers and Drivers is: Generally, use a shared file named ____allocated_resources____.json to sync allocated resources info among Workers and Drivers on the same host. After a Worker or Driver found all resources using the configured resourcesFile and/or discovery script during launching, it should filter out available resources by excluding resources already allocated in ____allocated_resources____.json and acquire resources from available resources according to its own requirement. After that, it should write its allocated resources along with its process id (pid) into ____allocated_resources____.json. Pid (proposed by tgravescs) here used to check whether the allocated resources are still valid in case of Worker or Driver crashes and doesn't release resources properly. And when a Worker or Driver finished, normally, it would always clean up its own allocated resources in ____allocated_resources____.json. Note that we'll always get a file lock before any access to file ____allocated_resources____.json and release the lock finally. Futhermore, we appended resources info in `WorkerSchedulerStateResponse` to work around master change behaviour in HA mode. ## How was this patch tested? Added unit tests in WorkerSuite, MasterSuite, SparkContextSuite. Manually tested with client/cluster mode (e.g. multiple workers) in a single node Standalone. Closes #25047 from Ngone51/SPARK-27371. Authored-by: wuyi <ngone_5451@163.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2019-08-09 07:49:03 -05:00
Shixiong Zhu	5bb69945e4	[SPARK-28651][SS] Force the schema of Streaming file source to be nullable ## What changes were proposed in this pull request? Right now, batch DataFrame always changes the schema to nullable automatically (See this line: `325bc8e9c6/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala (L399)`). But streaming file source is missing this. This PR updates the streaming file source schema to force it be nullable. I also added a flag `spark.sql.streaming.fileSource.schema.forceNullable` to disable this change since some users may rely on the old behavior. ## How was this patch tested? The new unit test. Closes #25382 from zsxwing/SPARK-28651. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-09 18:54:55 +09:00
Anton Yanchenko	bda5b51576	[SPARK-28454][PYTHON] Validate LongType in `createDataFrame(verifySchema=True)` ## What changes were proposed in this pull request? Add missing validation for `LongType` in `pyspark.sql.types._make_type_verifier`. ## How was this patch tested? Doctests / unittests / manual tests. Unpatched version: ``` In [23]: s.createDataFrame([{'x': 1 << 64}], StructType([StructField('x', LongType())])).collect() Out[23]: [Row(x=None)] ``` Patched: ``` In [5]: s.createDataFrame([{'x': 1 << 64}], StructType([StructField('x', LongType())])).collect() --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-5-c1740fcadbf9> in <module> ----> 1 s.createDataFrame([{'x': 1 << 64}], StructType([StructField('x', LongType())])).collect() /usr/local/lib/python3.5/site-packages/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio, verifySchema) 689 rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio) 690 else: --> 691 rdd, schema = self._createFromLocal(map(prepare, data), schema) 692 jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd()) 693 jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json()) /usr/local/lib/python3.5/site-packages/pyspark/sql/session.py in _createFromLocal(self, data, schema) 405 # make sure data could consumed multiple times 406 if not isinstance(data, list): --> 407 data = list(data) 408 409 if schema is None or isinstance(schema, (list, tuple)): /usr/local/lib/python3.5/site-packages/pyspark/sql/session.py in prepare(obj) 671 672 def prepare(obj): --> 673 verify_func(obj) 674 return obj 675 elif isinstance(schema, DataType): /usr/local/lib/python3.5/site-packages/pyspark/sql/types.py in verify(obj) 1427 def verify(obj): 1428 if not verify_nullability(obj): -> 1429 verify_value(obj) 1430 1431 return verify /usr/local/lib/python3.5/site-packages/pyspark/sql/types.py in verify_struct(obj) 1397 if isinstance(obj, dict): 1398 for f, verifier in verifiers: -> 1399 verifier(obj.get(f)) 1400 elif isinstance(obj, Row) and getattr(obj, "__from_dict__", False): 1401 # the order in obj could be different than dataType.fields /usr/local/lib/python3.5/site-packages/pyspark/sql/types.py in verify(obj) 1427 def verify(obj): 1428 if not verify_nullability(obj): -> 1429 verify_value(obj) 1430 1431 return verify /usr/local/lib/python3.5/site-packages/pyspark/sql/types.py in verify_long(obj) 1356 if obj < -9223372036854775808 or obj > 9223372036854775807: 1357 raise ValueError( -> 1358 new_msg("object of LongType out of range, got: %s" % obj)) 1359 1360 verify_value = verify_long ValueError: field x: object of LongType out of range, got: 18446744073709551616 ``` Closes #25117 from simplylizz/master. Authored-by: Anton Yanchenko <simplylizz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-08 11:47:25 +09:00
wuyi	94499af6f0	[SPARK-28486][CORE][PYTHON] Map PythonBroadcast's data file to a BroadcastBlock to avoid delete by GC ## What changes were proposed in this pull request? Currently, PythonBroadcast may delete its data file while a python worker still needs it. This happens because PythonBroadcast overrides the `finalize()` method to delete its data file. So, when GC happens and no references on broadcast variable, it may trigger `finalize()` to delete data file. That's also means, data under python Broadcast variable couldn't be deleted when `unpersist()`/`destroy()` called but relys on GC. In this PR, we removed the `finalize()` method, and map the PythonBroadcast data file to a BroadcastBlock(which has the same broadcast id with the broadcast variable who wrapped this PythonBroadcast) when PythonBroadcast is deserializing. As a result, the data file could be deleted just like other pieces of the Broadcast variable when `unpersist()`/`destroy()` called and do not rely on GC any more. ## How was this patch tested? Added a Python test, and tested manually(verified create/delete the broadcast block). Closes #25262 from Ngone51/SPARK-28486. Authored-by: wuyi <ngone_5451@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-05 20:18:53 +09:00
WeichenXu	b3394db193	[SPARK-28582][PYTHON] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7 ## What changes were proposed in this pull request? This PR picks up https://github.com/apache/spark/pull/25315 back after removing `Popen.wait` usage which exists in Python 3 only. I saw the last test results wrongly and thought it was passed. Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7. I add a sleep after the test connection to daemon. ## How was this patch tested? Run test ``` python/run-tests --python-executables=python3.7 --testname "pyspark.tests.test_daemon DaemonTests" ``` Before Fail on test "test_termination_sigterm". And we can see daemon process do not exit. After Test passed Closes #25343 from HyukjinKwon/SPARK-28582. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-03 10:31:15 +09:00
Dongjoon Hyun	8ae032d78d	Revert "[SPARK-28582][PYSPARK] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7" This reverts commit `fbeee0c5bc`.	2019-08-02 10:14:20 -07:00
Huaxin Gao	660423d717	[SPARK-23469][ML] HashingTF should use corrected MurmurHash3 implementation ## What changes were proposed in this pull request? Update HashingTF to use new implementation of MurmurHash3 Make HashingTF use the old MurmurHash3 when a model from pre 3.0 is loaded ## How was this patch tested? Change existing unit tests. Also add one unit test to make sure HashingTF use the old MurmurHash3 when a model from pre 3.0 is loaded Closes #25303 from huaxingao/spark-23469. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-02 10:53:36 -05:00
WeichenXu	fbeee0c5bc	[SPARK-28582][PYSPARK] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7 ## What changes were proposed in this pull request? Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7. I add a sleep after the test connection to daemon. ## How was this patch tested? Run test ``` python/run-tests --python-executables=python3.7 --testname "pyspark.tests.test_daemon DaemonTests" ``` Before Fail on test "test_termination_sigterm". And we can see daemon process do not exit. After Test passed Closes #25315 from WeichenXu123/fix_py37_daemon. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-02 22:07:06 +09:00
HyukjinKwon	b8e13b0aea	[SPARK-28153][PYTHON] Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF) ## What changes were proposed in this pull request? This PR proposes to use `AtomicReference` so that parent and child threads can access to the same file block holder. Python UDF expressions are turned to a plan and then it launches a separate thread to consume the input iterator. In the separate child thread, the iterator sets `InputFileBlockHolder.set` before the parent does which the parent thread is unable to read later. 1. In this separate child thread, if it happens to call `InputFileBlockHolder.set` first without initialization of the parent's thread local (which is done when the `ThreadLocal.get()` is first called), the child thread seems calling its own `initialValue` to initialize. 2. After that, the parent calls its own `initialValue` to initializes at the first call of `ThreadLocal.get()`. 3. Both now have two different references. Updating at child isn't reflected to parent. This PR fixes it via initializing parent's thread local with `AtomicReference` for file status so that they can be used in each task, and children thread's update is reflected. I also tried to explain this a bit more at https://github.com/apache/spark/pull/24958#discussion_r297203041. ## How was this patch tested? Manually tested and unittest was added. Closes #24958 from HyukjinKwon/SPARK-28153. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-07-31 22:40:01 +08:00
WeichenXu	a745381b9d	[SPARK-25382][SQL][PYSPARK] Remove ImageSchema.readImages in 3.0 ## What changes were proposed in this pull request? I remove the deprecate `ImageSchema.readImages`. Move some useful methods from class `ImageSchema` into class `ImageFileFormat`. In pyspark, I rename `ImageSchema` class to be `ImageUtils`, and keep some useful python methods in it. ## How was this patch tested? UT. Please review https://spark.apache.org/contributing.html before opening a pull request. Closes #25245 from WeichenXu123/remove_image_schema. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-31 14:26:18 +09:00
WeichenXu	3b14088541	[SPARK-26175][PYTHON] Redirect the standard input of the forked child to devnull in daemon ## What changes were proposed in this pull request? PySpark worker daemon reads from stdin the worker PIDs to kill. `1bb60ab839/python/pyspark/daemon.py (L127)` However, the worker process is a forked process from the worker daemon process and we didn't close stdin on the child after fork. This means the child and user program can read stdin as well, which blocks daemon from receiving the PID to kill. This can cause issues because the task reaper might detect the task was not terminated and eventually kill the JVM. This PR fix this by redirecting the standard input of the forked child to devnull. ## How was this patch tested? Manually test. In `pyspark`, run: ``` import subprocess def task(_): subprocess.check_output(["cat"]) sc.parallelize(range(1), 1).mapPartitions(task).count() ``` Before: The job will get stuck and press Ctrl+C to exit the job but the python worker process do not exit. After: The job finish correctly. The "cat" print nothing (because the dummay stdin is "/dev/null"). The python worker process exit normally. Please review https://spark.apache.org/contributing.html before opening a pull request. Closes #25138 from WeichenXu123/SPARK-26175. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-31 09:10:24 +09:00
zhengruifeng	44c28d7515	[SPARK-28399][ML][PYTHON] implement RobustScaler ## What changes were proposed in this pull request? Implement `RobustScaler` Since the transformation is quite similar to `StandardScaler`, I refactor the transform function so that it can be reused in both scalers. ## How was this patch tested? existing and added tests Closes #25160 from zhengruifeng/robust_scaler. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-07-30 10:24:33 -05:00
Maxim Gekk	a5a5da78cf	[SPARK-28471][SQL] Replace `yyyy` by `uuuu` in date-timestamp patterns without era ## What changes were proposed in this pull request? In the PR, I propose to use `uuuu` for years instead of `yyyy` in date/timestamp patterns without the era pattern `G` (https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html). Parsing/formatting of positive years (current era) will be the same. The difference is in formatting negative years belong to previous era - BC (Before Christ). I replaced the `yyyy` pattern by `uuuu` everywhere except: 1. Test, Suite & Benchmark. Existing tests must work as is. 2. `SimpleDateFormat` because it doesn't support the `uuuu` pattern. 3. Comments and examples (except comments related to already replaced patterns). Before the changes, the year of common era `100` and the year of BC era `-99`, showed similarly as `100`. After the changes negative years will be formatted with the `-` sign. Before: ```Scala scala> Seq(java.time.LocalDate.of(-99, 1, 1)).toDF().show +----------+ \| value\| +----------+ \|0100-01-01\| +----------+ ``` After: ```Scala scala> Seq(java.time.LocalDate.of(-99, 1, 1)).toDF().show +-----------+ \| value\| +-----------+ \|-0099-01-01\| +-----------+ ``` ## How was this patch tested? By existing test suites, and added tests for negative years to `DateFormatterSuite` and `TimestampFormatterSuite`. Closes #25230 from MaxGekk/year-pattern-uuuu. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-07-28 20:36:36 -07:00
Huaxin Gao	70f82fd298	[SPARK-21481][ML] Add indexOf method in ml.feature.HashingTF ## What changes were proposed in this pull request? Add indexOf method for ml.feature.HashingTF. ## How was this patch tested? Add Unit test. Closes #25250 from huaxingao/spark-21481. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-07-28 08:32:43 -05:00
wangguangxin.cn	fbaa177d2a	[MINOR][PYTHON] Use `_memory_limit` to get worker memory conf in `rdd.py` ## What changes were proposed in this pull request? Replace duplicate code by function `_memory_limit` ## How was this patch tested? Existing UTs Closes #25273 from WangGuangxin/python_memory_limit. Authored-by: wangguangxin.cn <wangguangxin.cn@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-07-27 11:58:50 -07:00
Huaxin Gao	3de4e1b9b4	[SPARK-28507][ML][PYSPARK] Remove deprecated API context(self, sqlContext) from pyspark/ml/util.py ## What changes were proposed in this pull request? remove deprecated ``` def context(self, sqlContext)``` from ```pyspark/ml/util.py``` ## How was this patch tested? test with existing ml PySpark test suites Closes #25246 from huaxingao/spark-28507. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-07-26 12:12:11 -05:00
Huaxin Gao	72c80ee81c	[SPARK-28243][PYSPARK][ML] Remove setFeatureSubsetStrategy and setSubsamplingRate from Python TreeEnsembleParams ## What changes were proposed in this pull request? Remove deprecated setFeatureSubsetStrategy and setSubsamplingRate from Python TreeEnsembleParams ## How was this patch tested? Use existing tests. Closes #25046 from huaxingao/spark-28243. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-07-20 10:44:33 -05:00
zero323	a0c2fa63ab	[SPARK-28439][PYTHON][SQL] Add support for count: Column in array_repeat ## What changes were proposed in this pull request? This adds simple check for `count` argument: - If it is a `Column` we apply `_to_java_column` before invoking JVM counterpart - Otherwise we proceed as before. ## How was this patch tested? Manual testing. Closes #25193 from zero323/SPARK-28278. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-07-18 12:58:48 -07:00
Huaxin Gao	971e832e0e	[SPARK-28411][PYTHON][SQL] InsertInto with overwrite is not honored ## What changes were proposed in this pull request? In the following python code ``` df.write.mode("overwrite").insertInto("table") ``` ```insertInto``` ignores ```mode("overwrite")``` and appends by default. ## How was this patch tested? Add Unit test. Closes #25175 from huaxingao/spark-28411. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-18 13:37:59 +09:00
Maxim Gekk	70073b19eb	[SPARK-27609][PYTHON] Convert values of function options to strings ## What changes were proposed in this pull request? In the PR, I propose to convert options values to strings by using `to_str()` for the following functions: `from_csv()`, `to_csv()`, `from_json()`, `to_json()`, `schema_of_csv()` and `schema_of_json()`. This will make handling of function options consistent to option handling in `DataFrameReader`/`DataFrameWriter`. For example: ```Python df.select(from_csv(df.value, "s string", {'ignoreLeadingWhiteSpace': True}) ``` ## How was this patch tested? Added an example for `from_csv()` which was tested by: ```Shell ./python/run-tests --testnames pyspark.sql.functions ``` Closes #25182 from MaxGekk/options_to_str. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-18 13:37:03 +09:00
HyukjinKwon	66179fa842	[SPARK-28418][PYTHON][SQL] Wait for event process in 'test_query_execution_listener_on_collect' ## What changes were proposed in this pull request? It fixes a flaky test: ``` ERROR [0.164s]: test_query_execution_listener_on_collect (pyspark.sql.tests.test_dataframe.QueryExecutionListenerTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/jenkins/python/pyspark/sql/tests/test_dataframe.py", line 758, in test_query_execution_listener_on_collect "The callback from the query execution listener should be called after 'collect'") AssertionError: The callback from the query execution listener should be called after 'collect' ``` Seems it can be failed because the event was somehow delayed but checked first. ## How was this patch tested? Manually. Closes #25177 from HyukjinKwon/SPARK-28418. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-17 18:44:11 +09:00
Liang-Chi Hsieh	591de42351	[SPARK-28381][PYSPARK] Upgraded version of Pyrolite to 4.30 ## What changes were proposed in this pull request? This upgraded to a newer version of Pyrolite. Most updates [1] in the newer version are for dotnot. For java, it includes a bug fix to Unpickler regarding cleaning up Unpickler memo, and support of protocol 5. After upgrading, we can remove the fix at SPARK-27629 for the bug in Unpickler. [1] https://github.com/irmen/Pyrolite/compare/pyrolite-4.23...master ## How was this patch tested? Manually tested on Python 3.6 in local on existing tests. Closes #25143 from viirya/upgrade-pyrolite. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-15 12:29:58 +09:00
Liang-Chi Hsieh	707411f479	[SPARK-28378][PYTHON] Remove usage of cgi.escape ## What changes were proposed in this pull request? `cgi.escape` is deprecated [1], and removed at 3.8 [2]. We better to replace it. [1] https://docs.python.org/3/library/cgi.html#cgi.escape. [2] https://docs.python.org/3.8/whatsnew/3.8.html#api-and-feature-removals ## How was this patch tested? Existing tests. Closes #25142 from viirya/remove-cgi-escape. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-14 15:26:00 +09:00
Jesse Cai	79e2047703	[SPARK-28355][CORE][PYTHON] Use Spark conf for threshold at which command is compressed by broadcast ## What changes were proposed in this pull request? The `_prepare_for_python_RDD` method currently broadcasts a pickled command if its length is greater than the hardcoded value `1 << 20` (1M). This change sets this value as a Spark conf instead. ## How was this patch tested? Unit tests, manual tests. Closes #25123 from jessecai/SPARK-28355. Authored-by: Jesse Cai <jesse.cai@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-07-13 08:44:16 -07:00
Thomas Graves	f84cca2d84	[SPARK-28234][CORE][PYTHON] Add python and JavaSparkContext support to get resources ## What changes were proposed in this pull request? Add python api support and JavaSparkContext support for resources(). I needed the JavaSparkContext support for it to properly translate into python with the py4j stuff. ## How was this patch tested? Unit tests added and manually tested in local cluster mode and on yarn. Closes #25087 from tgravescs/SPARK-28234-python. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-11 09:32:58 +09:00
Liang-Chi Hsieh	7858e534d3	[SPARK-28323][SQL][PYTHON] PythonUDF should be able to use in join condition ## What changes were proposed in this pull request? There is a bug in `ExtractPythonUDFs` that produces wrong result attributes. It causes a failure when using `PythonUDF`s among multiple child plans, e.g., join. An example is using `PythonUDF`s in join condition. ```python >>> left = spark.createDataFrame([Row(a=1, a1=1, a2=1), Row(a=2, a1=2, a2=2)]) >>> right = spark.createDataFrame([Row(b=1, b1=1, b2=1), Row(b=1, b1=3, b2=1)]) >>> f = udf(lambda a: a, IntegerType()) >>> df = left.join(right, [f("a") == f("b"), left.a1 == right.b1]) >>> df.collect() 19/07/10 12:20:49 ERROR Executor: Exception in task 5.0 in stage 0.0 (TID 5) java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.genericGet(rows.scala:201) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getAs(rows.scala:35) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.isNullAt(rows.scala:36) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.isNullAt$(rows.scala:36) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.isNullAt(rows.scala:195) at org.apache.spark.sql.catalyst.expressions.JoinedRow.isNullAt(JoinedRow.scala:70) ... ``` ## How was this patch tested? Added test. Closes #25091 from viirya/SPARK-28323. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2019-07-10 16:29:58 -07:00
Henry D	a32c92c0cd	[SPARK-28140][MLLIB][PYTHON] Accept DataFrames in RowMatrix and IndexedRowMatrix constructors ## What changes were proposed in this pull request? In both cases, the input `DataFrame` schema must contain only the information that's required for the matrix object, so a vector column in the case of `RowMatrix` and long and vector columns for `IndexedRowMatrix`. ## How was this patch tested? Unit tests that verify: - `RowMatrix` and `IndexedRowMatrix` can be created from `DataFrame`s - If the schema does not match expectations, we throw an `IllegalArgumentException` Please review https://spark.apache.org/contributing.html before opening a pull request. Closes #24953 from henrydavidge/row-matrix-df. Authored-by: Henry D <henrydavidge@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-07-09 16:39:21 -05:00
HyukjinKwon	fe3e34dda6	[SPARK-28273][SQL][PYTHON] Convert and port 'pgSQL/case.sql' into UDF test base ## What changes were proposed in this pull request? This PR adds some tests converted from `pgSQL/case.sql'` to test UDFs. Please see contribution guide of this umbrella ticket - [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921). This PR also contains two minor fixes: 1. Change name of Scala UDF from `UDF:name(...)` to `name(...)` to be consistent with Python' 2. Fix Scala UDF at `IntegratedUDFTestUtils.scala ` to handle `null` in strings. <details><summary>Diff comparing to 'pgSQL/case.sql'</summary> <p> ```diff diff --git a/sql/core/src/test/resources/sql-tests/results/pgSQL/case.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-case.sql.out index fa078d16d6d..55bef64338f 100644 --- a/sql/core/src/test/resources/sql-tests/results/pgSQL/case.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-case.sql.out -115,7 +115,7 struct<> -- !query 13 SELECT '3' AS `One`, CASE - WHEN 1 < 2 THEN 3 + WHEN CAST(udf(1 < 2) AS boolean) THEN 3 END AS `Simple WHEN` -- !query 13 schema struct<One:string,Simple WHEN:int> -126,10 +126,10 struct<One:string,Simple WHEN:int> -- !query 14 SELECT '<NULL>' AS `One`, CASE - WHEN 1 > 2 THEN 3 + WHEN 1 > 2 THEN udf(3) END AS `Simple default` -- !query 14 schema -struct<One:string,Simple default:int> +struct<One:string,Simple default:string> -- !query 14 output <NULL> NULL -137,17 +137,17 struct<One:string,Simple default:int> -- !query 15 SELECT '3' AS `One`, CASE - WHEN 1 < 2 THEN 3 - ELSE 4 + WHEN udf(1) < 2 THEN udf(3) + ELSE udf(4) END AS `Simple ELSE` -- !query 15 schema -struct<One:string,Simple ELSE:int> +struct<One:string,Simple ELSE:string> -- !query 15 output 3 3 -- !query 16 -SELECT '4' AS `One`, +SELECT udf('4') AS `One`, CASE WHEN 1 > 2 THEN 3 ELSE 4 -159,10 +159,10 struct<One:string,ELSE default:int> -- !query 17 -SELECT '6' AS `One`, +SELECT udf('6') AS `One`, CASE - WHEN 1 > 2 THEN 3 - WHEN 4 < 5 THEN 6 + WHEN CAST(udf(1 > 2) AS boolean) THEN 3 + WHEN udf(4) < 5 THEN 6 ELSE 7 END AS `Two WHEN with default` -- !query 17 schema -173,7 +173,7 struct<One:string,Two WHEN with default:int> -- !query 18 SELECT '7' AS `None`, - CASE WHEN rand() < 0 THEN 1 + CASE WHEN rand() < udf(0) THEN 1 END AS `NULL on no matches` -- !query 18 schema struct<None:string,NULL on no matches:int> -182,36 +182,36 struct<None:string,NULL on no matches:int> -- !query 19 -SELECT CASE WHEN 1=0 THEN 1/0 WHEN 1=1 THEN 1 ELSE 2/0 END +SELECT CASE WHEN CAST(udf(1=0) AS boolean) THEN 1/0 WHEN 1=1 THEN 1 ELSE 2/0 END -- !query 19 schema -struct<CASE WHEN (1 = 0) THEN (CAST(1 AS DOUBLE) / CAST(0 AS DOUBLE)) WHEN (1 = 1) THEN CAST(1 AS DOUBLE) ELSE (CAST(2 AS DOUBLE) / CAST(0 AS DOUBLE)) END:double> +struct<CASE WHEN CAST(udf((1 = 0)) AS BOOLEAN) THEN (CAST(1 AS DOUBLE) / CAST(0 AS DOUBLE)) WHEN (1 = 1) THEN CAST(1 AS DOUBLE) ELSE (CAST(2 AS DOUBLE) / CAST(0 AS DOUBLE)) END:double> -- !query 19 output 1.0 -- !query 20 -SELECT CASE 1 WHEN 0 THEN 1/0 WHEN 1 THEN 1 ELSE 2/0 END +SELECT CASE 1 WHEN 0 THEN 1/udf(0) WHEN 1 THEN 1 ELSE 2/0 END -- !query 20 schema -struct<CASE WHEN (1 = 0) THEN (CAST(1 AS DOUBLE) / CAST(0 AS DOUBLE)) WHEN (1 = 1) THEN CAST(1 AS DOUBLE) ELSE (CAST(2 AS DOUBLE) / CAST(0 AS DOUBLE)) END:double> +struct<CASE WHEN (1 = 0) THEN (CAST(1 AS DOUBLE) / CAST(CAST(udf(0) AS DOUBLE) AS DOUBLE)) WHEN (1 = 1) THEN CAST(1 AS DOUBLE) ELSE (CAST(2 AS DOUBLE) / CAST(0 AS DOUBLE)) END:double> -- !query 20 output 1.0 -- !query 21 -SELECT CASE WHEN i > 100 THEN 1/0 ELSE 0 END FROM case_tbl +SELECT CASE WHEN i > 100 THEN udf(1/0) ELSE udf(0) END FROM case_tbl -- !query 21 schema -struct<CASE WHEN (i > 100) THEN (CAST(1 AS DOUBLE) / CAST(0 AS DOUBLE)) ELSE CAST(0 AS DOUBLE) END:double> +struct<CASE WHEN (i > 100) THEN udf((cast(1 as double) / cast(0 as double))) ELSE udf(0) END:string> -- !query 21 output -0.0 -0.0 -0.0 -0.0 +0 +0 +0 +0 -- !query 22 -SELECT CASE 'a' WHEN 'a' THEN 1 ELSE 2 END +SELECT CASE 'a' WHEN 'a' THEN udf(1) ELSE udf(2) END -- !query 22 schema -struct<CASE WHEN (a = a) THEN 1 ELSE 2 END:int> +struct<CASE WHEN (a = a) THEN udf(1) ELSE udf(2) END:string> -- !query 22 output 1 -283,7 +283,7 big -- !query 27 -SELECT * FROM CASE_TBL WHERE COALESCE(f,i) = 4 +SELECT * FROM CASE_TBL WHERE udf(COALESCE(f,i)) = 4 -- !query 27 schema struct<i:int,f:double> -- !query 27 output -291,7 +291,7 struct<i:int,f:double> -- !query 28 -SELECT * FROM CASE_TBL WHERE NULLIF(f,i) = 2 +SELECT * FROM CASE_TBL WHERE udf(NULLIF(f,i)) = 2 -- !query 28 schema struct<i:int,f:double> -- !query 28 output -299,10 +299,10 struct<i:int,f:double> -- !query 29 -SELECT COALESCE(a.f, b.i, b.j) +SELECT udf(COALESCE(a.f, b.i, b.j)) FROM CASE_TBL a, CASE2_TBL b -- !query 29 schema -struct<coalesce(f, CAST(i AS DOUBLE), CAST(j AS DOUBLE)):double> +struct<udf(coalesce(f, cast(i as double), cast(j as double))):string> -- !query 29 output -30.3 -30.3 -332,8 +332,8 struct<coalesce(f, CAST(i AS DOUBLE), CAST(j AS DOUBLE)):double> -- !query 30 SELECT * - FROM CASE_TBL a, CASE2_TBL b - WHERE COALESCE(a.f, b.i, b.j) = 2 + FROM CASE_TBL a, CASE2_TBL b + WHERE udf(COALESCE(a.f, b.i, b.j)) = 2 -- !query 30 schema struct<i:int,f:double,i:int,j:int> -- !query 30 output -342,7 +342,7 struct<i:int,f:double,i:int,j:int> -- !query 31 -SELECT '' AS Five, NULLIF(a.i,b.i) AS `NULLIF(a.i,b.i)`, +SELECT udf('') AS Five, NULLIF(a.i,b.i) AS `NULLIF(a.i,b.i)`, NULLIF(b.i, 4) AS `NULLIF(b.i,4)` FROM CASE_TBL a, CASE2_TBL b -- !query 31 schema -377,7 +377,7 struct<Five:string,NULLIF(a.i,b.i):int,NULLIF(b.i,4):int> -- !query 32 SELECT '' AS `Two`, * FROM CASE_TBL a, CASE2_TBL b - WHERE COALESCE(f,b.i) = 2 + WHERE CAST(udf(COALESCE(f,b.i) = 2) AS boolean) -- !query 32 schema struct<Two:string,i:int,f:double,i:int,j:int> -- !query 32 output -388,15 +388,15 struct<Two:string,i:int,f:double,i:int,j:int> -- !query 33 SELECT CASE (CASE vol('bar') - WHEN 'foo' THEN 'it was foo!' - WHEN vol(null) THEN 'null input' + WHEN udf('foo') THEN 'it was foo!' + WHEN udf(vol(null)) THEN 'null input' WHEN 'bar' THEN 'it was bar!' END ) - WHEN 'it was foo!' THEN 'foo recognized' - WHEN 'it was bar!' THEN 'bar recognized' - ELSE 'unrecognized' END + WHEN udf('it was foo!') THEN 'foo recognized' + WHEN 'it was bar!' THEN udf('bar recognized') + ELSE 'unrecognized' END AS col -- !query 33 schema -struct<CASE WHEN (CASE WHEN (UDF:vol(bar) = foo) THEN it was foo! WHEN (UDF:vol(bar) = UDF:vol(null)) THEN null input WHEN (UDF:vol(bar) = bar) THEN it was bar! END = it was foo!) THEN foo recognized WHEN (CASE WHEN (UDF:vol(bar) = foo) THEN it was foo! WHEN (UDF:vol(bar) = UDF:vol(null)) THEN null input WHEN (UDF:vol(bar) = bar) THEN it was bar! END = it was bar!) THEN bar recognized ELSE unrecognized END:string> +struct<col:string> -- !query 33 output bar recognized ``` </p> </details> https://github.com/apache/spark/pull/25069 contains the same minor fixes as it's required to write the tests. ## How was this patch tested? Tested as guided in [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921). Closes #25070 from HyukjinKwon/SPARK-28273. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-07-09 10:50:07 +08:00
HyukjinKwon	cdbc30213b	[SPARK-28226][PYTHON] Document Pandas UDF mapInPandas ## What changes were proposed in this pull request? This PR proposes to document `MAP_ITER` with `mapInPandas`. ## How was this patch tested? Manually checked the documentation. ![Screen Shot 2019-07-05 at 1 52 30 PM](https://user-images.githubusercontent.com/6477701/60698812-26cf2d80-9f2c-11e9-8295-9c00c28f5569.png) ![Screen Shot 2019-07-05 at 1 48 53 PM](https://user-images.githubusercontent.com/6477701/60698710-ac061280-9f2b-11e9-8521-a4f361207e06.png) Closes #25025 from HyukjinKwon/SPARK-28226. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-07 09:07:52 +09:00
HyukjinKwon	fe75ff8bea	[SPARK-28206][PYTHON] Remove the legacy Epydoc in PySpark API documentation ## What changes were proposed in this pull request? Seems like we used to generate PySpark API documentation by Epydoc almost at the very first place (see `85b8f2c64f`). This fixes an actual issue: Before: ![Screen Shot 2019-07-05 at 8 20 01 PM](https://user-images.githubusercontent.com/6477701/60720491-e9879180-9f65-11e9-9562-100830a456cd.png) After: ![Screen Shot 2019-07-05 at 8 20 05 PM](https://user-images.githubusercontent.com/6477701/60720495-ec828200-9f65-11e9-8277-8f689e292cb0.png) It seems apparently a bug within `epytext` plugin during the conversion between`param` and `:param` syntax. See also [Epydoc syntax](http://epydoc.sourceforge.net/manual-epytext.html). Actually, Epydoc syntax violates [PEP-257](https://www.python.org/dev/peps/pep-0257/) IIRC and blocks us to enable some rules for doctest linter as well. We should remove this legacy away and I guess Spark 3 is good timing to do it. ## How was this patch tested? Manually built the doc and check each. I had to manually find the Epydoc syntax by `git grep -r "{L"`, for instance. Closes #25060 from HyukjinKwon/SPARK-28206. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2019-07-05 10:08:22 -07:00
HyukjinKwon	5c55812400	[SPARK-28198][PYTHON][FOLLOW-UP] Rename mapPartitionsInPandas to mapInPandas with a separate evaluation type ## What changes were proposed in this pull request? This PR proposes to rename `mapPartitionsInPandas` to `mapInPandas` with a separate evaluation type . Had an offline discussion with rxin, mengxr and cloud-fan The reason is basically: 1. `SCALAR_ITER` doesn't make sense with `mapPartitionsInPandas`. 2. It cannot share the same Pandas UDF, for instance, at `select` and `mapPartitionsInPandas` unlike `GROUPED_AGG` because iterator's return type is different. 3. `mapPartitionsInPandas` -> `mapInPandas` - see https://github.com/apache/spark/pull/25044#issuecomment-508298552 and https://github.com/apache/spark/pull/25044#issuecomment-508299764 Renaming `SCALAR_ITER` as `MAP_ITER` is abandoned due to 2. reason. For `XXX_ITER`, it might have to have a different interface in the future if we happen to add other versions of them. But this is an orthogonal topic with `mapPartitionsInPandas`. ## How was this patch tested? Existing tests should cover. Closes #25044 from HyukjinKwon/SPARK-28198. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-05 09:22:41 +09:00
HyukjinKwon	5f7aceb9df	[SPARK-28240][PYTHON] Fix Arrow tests to pass with Python 2.7 and latest PyArrow and Pandas in PySpark ## What changes were proposed in this pull request? In Python 2.7 with latest PyArrow and Pandas, the error message seems a bit different with Python 3. This PR simply fixes the test. ``` ====================================================================== FAIL: test_createDataFrame_with_incorrect_schema (pyspark.sql.tests.test_arrow.ArrowTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/.../spark/python/pyspark/sql/tests/test_arrow.py", line 275, in test_createDataFrame_with_incorrect_schema self.spark.createDataFrame(pdf, schema=wrong_schema) AssertionError: "integer.required.got.str" does not match "('Exception thrown when converting pandas.Series (object) to Arrow Array (int32). It can be caused by overflows or other unsafe conversions warned by Arrow. Arrow safe type check can be disabled by using SQL config `spark.sql.execution.pandas.arrowSafeTypeConversion`.', ArrowTypeError('an integer is required',))" ====================================================================== FAIL: test_createDataFrame_with_incorrect_schema (pyspark.sql.tests.test_arrow.EncryptionArrowTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/.../spark/python/pyspark/sql/tests/test_arrow.py", line 275, in test_createDataFrame_with_incorrect_schema self.spark.createDataFrame(pdf, schema=wrong_schema) AssertionError: "integer.required.got.str" does not match "('Exception thrown when converting pandas.Series (object) to Arrow Array (int32). It can be caused by overflows or other unsafe conversions warned by Arrow. Arrow safe type check can be disabled by using SQL config `spark.sql.execution.pandas.arrowSafeTypeConversion`.', ArrowTypeError('an integer is required',))" ``` ## How was this patch tested? Manually tested. ``` cd python ./run-tests --python-executables=python --modules pyspark-sql ``` Closes #25042 from HyukjinKwon/SPARK-28240. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-03 17:46:31 +09:00
HyukjinKwon	02f4763286	[SPARK-28198][PYTHON] Add mapPartitionsInPandas to allow an iterator of DataFrames ## What changes were proposed in this pull request? This PR proposes to add `mapPartitionsInPandas` API to DataFrame by using existing `SCALAR_ITER` as below: 1. Filtering via setting the column ```python from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame([(1, 21), (2, 30)], ("id", "age")) pandas_udf(df.schema, PandasUDFType.SCALAR_ITER) def filter_func(iterator): for pdf in iterator: yield pdf[pdf.id == 1] df.mapPartitionsInPandas(filter_func).show() ``` ``` +---+---+ \| id\|age\| +---+---+ \| 1\| 21\| +---+---+ ``` 2. `DataFrame.loc` ```python from pyspark.sql.functions import pandas_udf, PandasUDFType import pandas as pd df = spark.createDataFrame([['aa'], ['bb'], ['cc'], ['aa'], ['aa'], ['aa']], ["value"]) pandas_udf(df.schema, PandasUDFType.SCALAR_ITER) def filter_func(iterator): for pdf in iterator: yield pdf.loc[pdf.value.str.contains('^a'), :] df.mapPartitionsInPandas(filter_func).show() ``` ``` +-----+ \|value\| +-----+ \| aa\| \| aa\| \| aa\| \| aa\| +-----+ ``` 3. `pandas.melt` ```python from pyspark.sql.functions import pandas_udf, PandasUDFType import pandas as pd df = spark.createDataFrame( pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'}, 'B': {0: 1, 1: 3, 2: 5}, 'C': {0: 2, 1: 4, 2: 6}})) pandas_udf("A string, variable string, value long", PandasUDFType.SCALAR_ITER) def filter_func(iterator): for pdf in iterator: import pandas as pd yield pd.melt(pdf, id_vars=['A'], value_vars=['B', 'C']) df.mapPartitionsInPandas(filter_func).show() ``` ``` +---+--------+-----+ \| A\|variable\|value\| +---+--------+-----+ \| a\| B\| 1\| \| a\| C\| 2\| \| b\| B\| 3\| \| b\| C\| 4\| \| c\| B\| 5\| \| c\| C\| 6\| +---+--------+-----+ ``` The current limitation of `SCALAR_ITER` is that it doesn't allow different length of result, which is pretty critical in practice - for instance, we cannot simply filter by using Pandas APIs but we merely just map N to N. This PR allows map N to M like flatMap. This API mimics the way of `mapPartitions` but keeps API shape of `SCALAR_ITER` by allowing different results. ### How does this PR implement? This PR adds mimics both `dapply` with Arrow optimization and Grouped Map Pandas UDF. At Python execution side, it reuses existing `SCALAR_ITER` code path. Therefore, externally, we don't introduce any new type of Pandas UDF but internally we use another evaluation type code `205` (`SQL_MAP_PANDAS_ITER_UDF`). This approach is similar with Pandas' Windows function implementation with Grouped Aggregation Pandas UDF functions - internally we have `203` (`SQL_WINDOW_AGG_PANDAS_UDF`) but externally we just share the same `GROUPED_AGG`. ## How was this patch tested? Manually tested and unittests were added. Closes #24997 from HyukjinKwon/scalar-udf-iter. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-02 10:54:16 +09:00
Marco Gaido	048224ce9a	[SPARK-28170][ML][PYTHON] Uniform Vectors and Matrix documentation ## What changes were proposed in this pull request? The documentation in `linalg.py` is not consistent. This PR uniforms the documentation. ## How was this patch tested? NA Closes #25011 from mgaido91/SPARK-28170. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-01 11:40:12 +09:00
Xiangrui Meng	8299600575	[SPARK-28056][.2][PYTHON][SQL] add docstring/doctest for SCALAR_ITER Pandas UDF ## What changes were proposed in this pull request? Add docstring/doctest for `SCALAR_ITER` Pandas UDF. I explicitly mentioned that per-partition execution is an implementation detail, not guaranteed. I will submit another PR to add the same to user guide, just to keep this PR minimal. I didn't add "doctest: +SKIP" in the first commit so it is easy to test locally. cc: HyukjinKwon gatorsmile icexelloss BryanCutler WeichenXu123 ![Screen Shot 2019-06-28 at 9 52 41 AM](https://user-images.githubusercontent.com/829644/60358349-b0aa5400-998a-11e9-9ebf-8481dfd555b5.png) ![Screen Shot 2019-06-28 at 9 53 19 AM](https://user-images.githubusercontent.com/829644/60358355-b1db8100-998a-11e9-8f6f-00a11bdbdc4d.png) ## How was this patch tested? doctest Closes #25005 from mengxr/SPARK-28056.2. Authored-by: Xiangrui Meng <meng@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-06-28 15:09:57 -07:00
WeichenXu	31e7c37354	[SPARK-28185][PYTHON][SQL] Closes the generator when Python UDFs stop early ## What changes were proposed in this pull request? Closes the generator when Python UDFs stop early. ### Manually verification on pandas iterator UDF and mapPartitions ```python from pyspark.sql import SparkSession from pyspark.sql.functions import pandas_udf, PandasUDFType from pyspark.sql.functions import col, udf from pyspark.taskcontext import TaskContext import time import os spark.conf.set('spark.sql.execution.arrow.maxRecordsPerBatch', '1') spark.conf.set('spark.sql.pandas.udf.buffer.size', '4') pandas_udf("int", PandasUDFType.SCALAR_ITER) def fi1(it): try: for batch in it: yield batch + 100 time.sleep(1.0) except BaseException as be: print("Debug: exception raised: " + str(type(be))) raise be finally: open("/tmp/000001.tmp", "a").close() df1 = spark.range(10).select(col('id').alias('a')).repartition(1) # will see log Debug: exception raised: <class 'GeneratorExit'> # and file "/tmp/000001.tmp" generated. df1.select(col('a'), fi1('a')).limit(2).collect() def mapper(it): try: for batch in it: yield batch except BaseException as be: print("Debug: exception raised: " + str(type(be))) raise be finally: open("/tmp/000002.tmp", "a").close() df2 = spark.range(10000000).repartition(1) # will see log Debug: exception raised: <class 'GeneratorExit'> # and file "/tmp/000002.tmp" generated. df2.rdd.mapPartitions(mapper).take(2) ``` ## How was this patch tested? Unit test added. Please review https://spark.apache.org/contributing.html before opening a pull request. Closes #24986 from WeichenXu123/pandas_iter_udf_limit. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-06-28 17:10:25 +09:00
Bryan Cutler	c277afb12b	[SPARK-27992][PYTHON] Allow Python to join with connection thread to propagate errors ## What changes were proposed in this pull request? Currently with `toLocalIterator()` and `toPandas()` with Arrow enabled, if the Spark job being run in the background serving thread errors, it will be caught and sent to Python through the PySpark serializer. This is not the ideal solution because it is only catch a SparkException, it won't handle an error that occurs in the serializer, and each method has to have it's own special handling to propagate the error. This PR instead returns the Python Server object along with the serving port and authentication info, so that it allows the Python caller to join with the serving thread. During the call to join, the serving thread Future is completed either successfully or with an exception. In the latter case, the exception will be propagated to Python through the Py4j call. ## How was this patch tested? Existing tests Closes #24834 from BryanCutler/pyspark-propagate-server-error-SPARK-27992. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2019-06-26 13:05:41 -07:00
zhengruifeng	c397b06924	[SPARK-28045][ML][PYTHON] add missing RankingEvaluator ## What changes were proposed in this pull request? add missing RankingEvaluator ## How was this patch tested? added testsuites Closes #24869 from zhengruifeng/ranking_eval. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-06-25 06:44:06 -05:00
Li Jin	d0fbc4da3b	[SPARK-28003][PYTHON] Allow NaT values when creating Spark dataframe from pandas with Arrow ## What changes were proposed in this pull request? This patch removes `fillna(0)` when creating ArrowBatch from a pandas Series. With `fillna(0)`, the original code would turn a timestamp type into object type, which pyarrow will complain later: ``` >>> s = pd.Series([pd.NaT, pd.Timestamp('2015-01-01')]) >>> s.dtypes dtype('<M8[ns]') >>> s.fillna(0) 0 0 1 2015-01-01 00:00:00 dtype: object ``` ## How was this patch tested? Added `test_timestamp_nat` Closes #24844 from icexelloss/SPARK-28003-arrow-nat. Authored-by: Li Jin <ice.xelloss@gmail.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2019-06-24 11:15:21 -07:00
HyukjinKwon	7c05f61514	[SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark ## What changes were proposed in this pull request? Currently, pretty skipped message added by `f7435bec6a` mechanism seems not working when xmlrunner is installed apparently. This PR fixes two things: 1. When `xmlrunner` is installed, seems `xmlrunner` does not respect `vervosity` level in unittests (default is level 1). So the output looks as below ``` Running tests... ---------------------------------------------------------------------- SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS ---------------------------------------------------------------------- ``` So it is not caught by our message detection mechanism. 2. If we manually set the `vervocity` level to `xmlrunner`, it prints messages as below: ``` test_mixed_udf (pyspark.sql.tests.test_pandas_udf_scalar.ScalarPandasUDFTests) ... SKIP (0.000s) test_mixed_udf_and_sql (pyspark.sql.tests.test_pandas_udf_scalar.ScalarPandasUDFTests) ... SKIP (0.000s) ... ``` This is different in our Jenkins machine: ``` test_createDataFrame_column_name_encoding (pyspark.sql.tests.test_arrow.ArrowTests) ... skipped 'Pandas >= 0.23.2 must be installed; however, it was not found.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.test_arrow.ArrowTests) ... skipped 'Pandas >= 0.23.2 must be installed; however, it was not found.' ... ``` Note that last `SKIP` is different. This PR fixes the regular expression to catch `SKIP` case as well. ## How was this patch tested? Manually tested. Before: ``` Starting test(python2.7): pyspark.... Finished test(python2.7): pyspark.... (0s) ... Tests passed in 562 seconds ======================================================================== ... ``` After: ``` Starting test(python2.7): pyspark.... Finished test(python2.7): pyspark.... (48s) ... 93 tests were skipped ... Tests passed in 560 seconds Skipped tests pyspark.... with python2.7: pyspark...(...) ... SKIP (0.000s) ... ======================================================================== ... ``` Closes #24927 from HyukjinKwon/SPARK-28130. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-06-24 09:58:17 +09:00
Bryan Cutler	5ad1053f3e	[SPARK-28128][PYTHON][SQL] Pandas Grouped UDFs skip empty partitions ## What changes were proposed in this pull request? When running FlatMapGroupsInPandasExec or AggregateInPandasExec the shuffle uses a default number of partitions of 200 in "spark.sql.shuffle.partitions". If the data is small, e.g. in testing, many of the partitions will be empty but are treated just the same. This PR checks the `mapPartitionsInternal` iterator to be non-empty before calling `ArrowPythonRunner` to start computation on the iterator. ## How was this patch tested? Existing tests. Ran the following benchmarks a simple example where most partitions are empty: ```python from pyspark.sql.functions import pandas_udf, PandasUDFType from pyspark.sql.types import * df = spark.createDataFrame( [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v")) pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP) def normalize(pdf): v = pdf.v return pdf.assign(v=(v - v.mean()) / v.std()) df.groupby("id").apply(normalize).count() ``` Before ``` In [4]: %timeit df.groupby("id").apply(normalize).count() 1.58 s ± 62.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [5]: %timeit df.groupby("id").apply(normalize).count() 1.52 s ± 29.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [6]: %timeit df.groupby("id").apply(normalize).count() 1.52 s ± 37.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` After this Change ``` In [2]: %timeit df.groupby("id").apply(normalize).count() 646 ms ± 89.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [3]: %timeit df.groupby("id").apply(normalize).count() 408 ms ± 84.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [4]: %timeit df.groupby("id").apply(normalize).count() 381 ms ± 29.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` Closes #24926 from BryanCutler/pyspark-pandas_udf-map-agg-skip-empty-parts-SPARK-28128. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-06-22 11:20:35 +09:00
HyukjinKwon	113f8c8d13	[SPARK-28132][PYTHON] Update document type conversion for Pandas UDFs (pyarrow 0.13.0, pandas 0.24.2, Python 3.7) ## What changes were proposed in this pull request? This PR updates the chart generated at SPARK-25666. We deprecated Python 2. It's better to use Python 3. We don't have to test `unicode` and `long` anymore in Python 3. So it was removed. Use this code to generate the chart: ```python from pyspark.sql.types import * from pyspark.sql.functions import pandas_udf columns = [ ('none', 'object(NoneType)'), ('bool', 'bool'), ('int8', 'int8'), ('int16', 'int16'), ('int32', 'int32'), ('int64', 'int64'), ('uint8', 'uint8'), ('uint16', 'uint16'), ('uint32', 'uint32'), ('uint64', 'uint64'), ('float64', 'float16'), ('float64', 'float32'), ('float64', 'float64'), ('date', 'datetime64[ns]'), ('tz_aware_dates', 'datetime64[ns, US/Eastern]'), ('string', 'object(string)'), ('decimal', 'object(Decimal)'), ('array', 'object(array[int32])'), ('float128', 'float128'), ('complex64', 'complex64'), ('complex128', 'complex128'), ('category', 'category'), ('tdeltas', 'timedelta64[ns]'), ] def create_dataframe(): import pandas as pd import numpy as np import decimal pdf = pd.DataFrame({ 'none': [None, None], 'bool': [True, False], 'int8': np.arange(1, 3).astype('int8'), 'int16': np.arange(1, 3).astype('int16'), 'int32': np.arange(1, 3).astype('int32'), 'int64': np.arange(1, 3).astype('int64'), 'uint8': np.arange(1, 3).astype('uint8'), 'uint16': np.arange(1, 3).astype('uint16'), 'uint32': np.arange(1, 3).astype('uint32'), 'uint64': np.arange(1, 3).astype('uint64'), 'float16': np.arange(1, 3).astype('float16'), 'float32': np.arange(1, 3).astype('float32'), 'float64': np.arange(1, 3).astype('float64'), 'float128': np.arange(1, 3).astype('float128'), 'complex64': np.arange(1, 3).astype('complex64'), 'complex128': np.arange(1, 3).astype('complex128'), 'string': list('ab'), 'array': pd.Series([np.array([1, 2, 3], dtype=np.int32), np.array([1, 2, 3], dtype=np.int32)]), 'decimal': pd.Series([decimal.Decimal('1'), decimal.Decimal('2')]), 'date': pd.date_range('19700101', periods=2).values, 'category': pd.Series(list("AB")).astype('category')}) pdf['tdeltas'] = [pdf.date.diff()[1], pdf.date.diff()[0]] pdf['tz_aware_dates'] = pd.date_range('19700101', periods=2, tz='US/Eastern') return pdf types = [ BooleanType(), ByteType(), ShortType(), IntegerType(), LongType(), FloatType(), DoubleType(), DateType(), TimestampType(), StringType(), DecimalType(10, 0), ArrayType(IntegerType()), MapType(StringType(), IntegerType()), StructType([StructField("_1", IntegerType())]), BinaryType(), ] df = spark.range(2).repartition(1) results = [] count = 0 total = len(types) * len(columns) values = [] spark.sparkContext.setLogLevel("FATAL") for t in types: result = [] for column, pandas_t in columns: v = create_dataframe()[column][0] values.append(v) try: row = df.select(pandas_udf(lambda _: create_dataframe()[column], t)(df.id)).first() ret_str = repr(row[0]) except Exception: ret_str = "X" result.append(ret_str) progress = "SQL Type: [%s]\n Pandas Value(Type): %s(%s)]\n Result Python Value: [%s]" % ( t.simpleString(), v, pandas_t, ret_str) count += 1 print("%s/%s:\n %s" % (count, total, progress)) results.append([t.simpleString()] + list(map(str, result))) schema = ["SQL Type \\ Pandas Value(Type)"] + list(map(lambda values_column: "%s(%s)" % (values_column[0], values_column[1][1]), zip(values, columns))) strings = spark.createDataFrame(results, schema=schema)._jdf.showString(20, 20, False) print("\n".join(map(lambda line: " # %s # noqa" % line, strings.strip().split("\n")))) ``` ## How was this patch tested? Manually. Closes #24930 from HyukjinKwon/SPARK-28132. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2019-06-21 10:47:54 -07:00
HyukjinKwon	9b9d81b821	[SPARK-28131][PYTHON] Update document type conversion between Python data and SQL types in normal UDFs (Python 3.7) ## What changes were proposed in this pull request? This PR updates the chart generated at SPARK-25666. We deprecated Python 2. It's better to use Python 3. We don't have to test `unicode` and `long` anymore in Python 3. So it was removed. Use this code to generate the chart: ```python import sys import array import datetime from decimal import Decimal from pyspark.sql import Row from pyspark.sql.types import * from pyspark.sql.functions import udf data = [ None, True, 1, "a", datetime.date(1970, 1, 1), datetime.datetime(1970, 1, 1, 0, 0), 1.0, array.array("i", [1]), [1], (1,), bytearray([65, 66, 67]), Decimal(1), {"a": 1}, Row(kwargs=1), Row("namedtuple")(1), ] types = [ BooleanType(), ByteType(), ShortType(), IntegerType(), LongType(), StringType(), DateType(), TimestampType(), FloatType(), DoubleType(), ArrayType(IntegerType()), BinaryType(), DecimalType(10, 0), MapType(StringType(), IntegerType()), StructType([StructField("_1", IntegerType())]), ] df = spark.range(1) results = [] count = 0 total = len(types) * len(data) spark.sparkContext.setLogLevel("FATAL") for t in types: result = [] for v in data: try: row = df.select(udf(lambda: v, t)()).first() ret_str = repr(row[0]) except Exception: ret_str = "X" result.append(ret_str) progress = "SQL Type: [%s]\n Python Value: [%s(%s)]\n Result Python Value: [%s]" % ( t.simpleString(), str(v), type(v).__name__, ret_str) count += 1 print("%s/%s:\n %s" % (count, total, progress)) results.append([t.simpleString()] + list(map(str, result))) schema = ["SQL Type \\ Python Value(Type)"] + list(map(lambda v: "%s(%s)" % (str(v), type(v).__name__), data)) strings = spark.createDataFrame(results, schema=schema)._jdf.showString(20, 20, False) print("\n".join(map(lambda line: " # %s # noqa" % line, strings.strip().split("\n")))) ``` ## How was this patch tested? Manually. Closes #24929 from HyukjinKwon/SPARK-28131. Lead-authored-by: HyukjinKwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2019-06-21 10:27:18 -07:00
tools4origins	25c5d57883	[MINOR][DOC] Fix python variance() documentation ## What changes were proposed in this pull request? The Python documentation incorrectly says that `variance()` acts as `var_pop` whereas it acts like `var_samp` here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.variance It was not the case in Spark 1.6 doc but it is in Spark 2.0 doc: https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/functions.html https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html The Scala documentation is correct: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html#variance-org.apache.spark.sql.Column- The alias is set on this line: https://github.com/apache/spark/blob/v2.4.3/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L786 ## How was this patch tested? Using variance() in pyspark 2.4.3 returns: ``` >>> spark.createDataFrame([(1, ), (2, ), (3, )], "a: int").select(variance("a")).show() +-----------+ \|var_samp(a)\| +-----------+ \| 1.0\| +-----------+ ``` Closes #24895 from tools4origins/patch-1. Authored-by: tools4origins <tools4origins@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-06-20 08:10:19 -07:00
zhengruifeng	9ec049601a	[SPARK-28044][ML][PYTHON] MulticlassClassificationEvaluator support more metrics ## What changes were proposed in this pull request? expose more metrics in evaluator: weightedTruePositiveRate/weightedFalsePositiveRate/weightedFMeasure/truePositiveRateByLabel/falsePositiveRateByLabel/precisionByLabel/recallByLabel/fMeasureByLabel ## How was this patch tested? existing cases and add cases Closes #24868 from zhengruifeng/multi_class_support_bylabel. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-06-19 08:56:15 -05:00

1 2 3 4 5 ...

2336 commits