ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Alex Favaro	96c1a4401d	[SPARK-30856][SQL][PYSPARK] Fix SQLContext.getOrCreate() when SparkContext is restarted ### What changes were proposed in this pull request? As discussed on the Jira ticket, this change clears the SQLContext._instantiatedContext class attribute when the SparkSession is stopped. That way, the attribute will be reset with a new, usable SQLContext when a new SparkSession is started. ### Why are the changes needed? When the underlying SQLContext is instantiated for a SparkSession, the instance is saved as a class attribute and returned from subsequent calls to SQLContext.getOrCreate(). If the SparkContext is stopped and a new one started, the SQLContext class attribute is never cleared so any code which calls SQLContext.getOrCreate() will get a SQLContext with a reference to the old, unusable SparkContext. A similar issue was identified and fixed for SparkSession in [SPARK-19055](https://issues.apache.org/jira/browse/SPARK-19055), but the fix did not change SQLContext as well. I ran into this because mllib still [uses](https://github.com/apache/spark/blob/master/python/pyspark/mllib/common.py#L105) SQLContext.getOrCreate() under the hood. ### Does this PR introduce any user-facing change? No ### How was this patch tested? A new test was added. I verified that the test fails without the included change. Closes #27610 from afavaro/restart-sqlcontext. Authored-by: Alex Favaro <alex.favaro@affirm.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-20 12:21:24 +09:00
Xingbo Jiang	e32411eb07	Revert "[SPARK-30667][CORE] Add allGather method to BarrierTaskContext" This reverts commit `af63971cb7`.	2020-02-19 17:04:47 -08:00
sarthfrey-db	af63971cb7	[SPARK-30667][CORE] Add allGather method to BarrierTaskContext ### What changes were proposed in this pull request? The `allGather` method is added to the `BarrierTaskContext`. This method contains the same functionality as the `BarrierTaskContext.barrier` method; it blocks the task until all tasks make the call, at which time they may continue execution. In addition, the `allGather` method takes an input message. Upon returning from the `allGather` the task receives a list of all the messages sent by all the tasks that made the `allGather` call. ### Why are the changes needed? There are many situations where having the tasks communicate in a synchronized way is useful. One simple example is if each task needs to start a server to serve requests from one another; first the tasks must find a free port (the result of which is undetermined beforehand) and then start making requests, but to do so they each must know the port chosen by the other task. An `allGather` method would allow them to inform each other of the port they will run on. ### Does this PR introduce any user-facing change? Yes, an `BarrierTaskContext.allGather` method will be available through the Scala, Java, and Python APIs. ### How was this patch tested? Most of the code path is already covered by tests to the `barrier` method, since this PR includes a refactor so that much code is shared by the `barrier` and `allGather` methods. However, a test is added to assert that an all gather on each tasks partition ID will return a list of every partition ID. An example through the Python API: ```python >>> from pyspark import BarrierTaskContext >>> >>> def f(iterator): ... context = BarrierTaskContext.get() ... return [context.allGather('{}'.format(context.partitionId()))] ... >>> sc.parallelize(range(4), 4).barrier().mapPartitions(f).collect()[0] [u'3', u'1', u'0', u'2'] ``` Closes #27395 from sarthfrey/master. Lead-authored-by: sarthfrey-db <sarth.frey@databricks.com> Co-authored-by: sarthfrey <sarth.frey@gmail.com> Signed-off-by: Xiangrui Meng <meng@databricks.com> (cherry picked from commit `57254c9719`) Signed-off-by: Xiangrui Meng <meng@databricks.com>	2020-02-19 12:10:51 -08:00
HyukjinKwon	e065e22e5e	[SPARK-30861][PYTHON][SQL] Deprecate constructor of SQLContext and getOrCreate in SQLContext at PySpark ### What changes were proposed in this pull request? This PR proposes to deprecate the APIs at `SQLContext` removed in SPARK-25908. We should remove equivalent APIs; however, seems we missed to deprecate. While I am here, I fix one more issue. After SPARK-25908, `sc._jvm.SQLContext.getOrCreate` dose not exist anymore. So, ```python from pyspark.sql import SQLContext from pyspark import SparkContext sc = SparkContext.getOrCreate() SQLContext.getOrCreate(sc).range(10).show() ``` throws an exception as below: ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/context.py", line 110, in getOrCreate jsqlContext = sc._jvm.SQLContext.getOrCreate(sc._jsc.sc()) File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1516, in __getattr__ py4j.protocol.Py4JError: org.apache.spark.sql.SQLContext.getOrCreate does not exist in the JVM ``` After this PR: ``` /.../spark/python/pyspark/sql/context.py:113: DeprecationWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead. DeprecationWarning) +---+ \| id\| +---+ \| 0\| \| 1\| \| 2\| \| 3\| \| 4\| \| 5\| \| 6\| \| 7\| \| 8\| \| 9\| +---+ ``` In case of the constructor of `SQLContext`, after this PR: ```python from pyspark.sql import SQLContext sc = SparkContext.getOrCreate() SQLContext(sc) ``` ``` /.../spark/python/pyspark/sql/context.py:77: DeprecationWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead. DeprecationWarning) ``` ### Why are the changes needed? To promote to use SparkSession, and keep the API party consistent with Scala side. ### Does this PR introduce any user-facing change? Yes, it will show deprecation warning to users. ### How was this patch tested? Manually tested as described above. Unittests were also added. Closes #27614 from HyukjinKwon/SPARK-30861. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-19 11:17:47 +09:00
yi.wu	68d7edf949	[SPARK-30812][SQL][CORE] Revise boolean config name to comply with new config naming policy ### What changes were proposed in this pull request? Revise below config names to comply with [new config naming policy](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-naming-policy-of-Spark-configs-td28875.html): SQL: * spark.sql.execution.subquery.reuse.enabled / [SPARK-27083](https://issues.apache.org/jira/browse/SPARK-27083) * spark.sql.legacy.allowNegativeScaleOfDecimal.enabled / [SPARK-30252](https://issues.apache.org/jira/browse/SPARK-30252) * spark.sql.adaptive.optimizeSkewedJoin.enabled / [SPARK-29544](https://issues.apache.org/jira/browse/SPARK-29544) * spark.sql.legacy.property.nonReserved / [SPARK-30183](https://issues.apache.org/jira/browse/SPARK-30183) * spark.sql.streaming.forceDeleteTempCheckpointLocation.enabled / [SPARK-26389](https://issues.apache.org/jira/browse/SPARK-26389) * spark.sql.analyzer.failAmbiguousSelfJoin.enabled / [SPARK-28344](https://issues.apache.org/jira/browse/SPARK-28344) * spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled / [SPARK-30074](https://issues.apache.org/jira/browse/SPARK-30074) * spark.sql.execution.pandas.arrowSafeTypeConversion / [SPARK-25811](https://issues.apache.org/jira/browse/SPARK-25811) * spark.sql.legacy.looseUpcast / [SPARK-24586](https://issues.apache.org/jira/browse/SPARK-24586) * spark.sql.legacy.arrayExistsFollowsThreeValuedLogic / [SPARK-28052](https://issues.apache.org/jira/browse/SPARK-28052) * spark.sql.sources.ignoreDataLocality.enabled / [SPARK-29189](https://issues.apache.org/jira/browse/SPARK-29189) * spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled / [SPARK-9853](https://issues.apache.org/jira/browse/SPARK-9853) CORE: * spark.eventLog.erasureCoding.enabled / [SPARK-25855](https://issues.apache.org/jira/browse/SPARK-25855) * spark.shuffle.readHostLocalDisk.enabled / [SPARK-30235](https://issues.apache.org/jira/browse/SPARK-30235) * spark.scheduler.listenerbus.logSlowEvent.enabled / [SPARK-29001](https://issues.apache.org/jira/browse/SPARK-29001) * spark.resources.coordinate.enable / [SPARK-27371](https://issues.apache.org/jira/browse/SPARK-27371) * spark.eventLog.logStageExecutorMetrics.enabled / [SPARK-23429](https://issues.apache.org/jira/browse/SPARK-23429) ### Why are the changes needed? To comply with the config naming policy. ### Does this PR introduce any user-facing change? No. Configurations listed above are all newly added in Spark 3.0. ### How was this patch tested? Pass Jenkins. Closes #27563 from Ngone51/revise_boolean_conf_name. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-18 20:39:50 +08:00
David Toneian	504b5135d0	[SPARK-30859][PYSPARK][DOCS][MINOR] Fixed docstring syntax issues preventing proper compilation of documentation This commit is published into the public domain. ### What changes were proposed in this pull request? Some syntax issues in docstrings have been fixed. ### Why are the changes needed? In some places, the documentation did not render as intended, e.g. parameter documentations were not formatted as such. ### Does this PR introduce any user-facing change? Slight improvements in documentation. ### How was this patch tested? Manual testing. No new Sphinx warnings arise due to this change. Closes #27613 from DavidToneian/SPARK-30859. Authored-by: David Toneian <david@toneian.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-18 16:46:45 +09:00
Liang Zhang	d8c0599e54	[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset ### What changes were proposed in this pull request? This PR added two DeveloperApis to the Dataset[T] class. Both methods are just exposing lower-level methods to the Dataset[T] class. ### Why are the changes needed? They are useful for checking whether two dataframes are the same when implementing dataframe caching in python, and also get a unique ID. It's easier to use if we wrap the lower-level APIs. ### Does this PR introduce any user-facing change? ``` scala> val df1 = Seq((1,2),(4,5)).toDF("col1", "col2") df1: org.apache.spark.sql.DataFrame = [col1: int, col2: int] scala> val df2 = Seq((1,2),(4,5)).toDF("col1", "col2") df2: org.apache.spark.sql.DataFrame = [col1: int, col2: int] scala> val df3 = Seq((0,2),(4,5)).toDF("col1", "col2") df3: org.apache.spark.sql.DataFrame = [col1: int, col2: int] scala> val df4 = Seq((0,2),(4,5)).toDF("col0", "col2") df4: org.apache.spark.sql.DataFrame = [col0: int, col2: int] scala> df1.semanticHash res0: Int = 594427822 scala> df2.semanticHash res1: Int = 594427822 scala> df1.sameSemantics(df2) res2: Boolean = true scala> df1.sameSemantics(df3) res3: Boolean = false scala> df3.semanticHash res4: Int = -1592702048 scala> df4.semanticHash res5: Int = -1592702048 scala> df4.sameSemantics(df3) res6: Boolean = true ``` ### How was this patch tested? Unit test in scala and doctest in python. Note: comments are copied from the corresponding lower-level APIs. Note: There are some issues to be fixed that would improve the hash collision rate: https://github.com/apache/spark/pull/27565#discussion_r379881028 Closes #27565 from liangz1/df-same-result. Authored-by: Liang Zhang <liang.zhang@databricks.com> Signed-off-by: WeichenXu <weichen.xu@databricks.com>	2020-02-18 09:22:26 +08:00
Yuanjian Li	ab186e3659	[SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior ### What changes were proposed in this pull request? This is a follow-up for #23124, add a new config `spark.sql.legacy.allowDuplicatedMapKeys` to control the behavior of removing duplicated map keys in build-in functions. With the default value `false`, Spark will throw a RuntimeException while duplicated keys are found. ### Why are the changes needed? Prevent silent behavior changes. ### Does this PR introduce any user-facing change? Yes, new config added and the default behavior for duplicated map keys changed to RuntimeException thrown. ### How was this patch tested? Modify existing UT. Closes #27478 from xuanyuanking/SPARK-25892-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-17 22:06:58 +08:00
David Toneian	25db8c71a2	[PYSPARK][DOCS][MINOR] Changed `:func:` to `:attr:` Sphinx roles, fixed links in documentation of `Data{Frame,Stream}{Reader,Writer}` This commit is published into the public domain. ### What changes were proposed in this pull request? This PR fixes the documentation of `DataFrameReader`, `DataFrameWriter`, `DataStreamReader`, and `DataStreamWriter`, where attributes of other classes were misrepresented as functions. Additionally, creation of hyperlinks across modules was fixed in these instances. ### Why are the changes needed? The old state produced documentation that suggested invalid usage of PySpark objects (accessing attributes as though they were callable.) ### Does this PR introduce any user-facing change? No, except for improved documentation. ### How was this patch tested? No test added; documentation build runs through. Closes #27553 from DavidToneian/docfix-DataFrameReader-DataFrameWriter-DataStreamReader-DataStreamWriter. Authored-by: David Toneian <david@toneian.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-14 11:00:35 +09:00
Xingbo Jiang	fa3517cdb1	Revert "[SPARK-30667][CORE] Add allGather method to BarrierTaskContext" This reverts commit `57254c9719`.	2020-02-13 17:43:55 -08:00
sarthfrey-db	57254c9719	[SPARK-30667][CORE] Add allGather method to BarrierTaskContext ### What changes were proposed in this pull request? The `allGather` method is added to the `BarrierTaskContext`. This method contains the same functionality as the `BarrierTaskContext.barrier` method; it blocks the task until all tasks make the call, at which time they may continue execution. In addition, the `allGather` method takes an input message. Upon returning from the `allGather` the task receives a list of all the messages sent by all the tasks that made the `allGather` call. ### Why are the changes needed? There are many situations where having the tasks communicate in a synchronized way is useful. One simple example is if each task needs to start a server to serve requests from one another; first the tasks must find a free port (the result of which is undetermined beforehand) and then start making requests, but to do so they each must know the port chosen by the other task. An `allGather` method would allow them to inform each other of the port they will run on. ### Does this PR introduce any user-facing change? Yes, an `BarrierTaskContext.allGather` method will be available through the Scala, Java, and Python APIs. ### How was this patch tested? Most of the code path is already covered by tests to the `barrier` method, since this PR includes a refactor so that much code is shared by the `barrier` and `allGather` methods. However, a test is added to assert that an all gather on each tasks partition ID will return a list of every partition ID. An example through the Python API: ```python >>> from pyspark import BarrierTaskContext >>> >>> def f(iterator): ... context = BarrierTaskContext.get() ... return [context.allGather('{}'.format(context.partitionId()))] ... >>> sc.parallelize(range(4), 4).barrier().mapPartitions(f).collect()[0] [u'3', u'1', u'0', u'2'] ``` Closes #27395 from sarthfrey/master. Lead-authored-by: sarthfrey-db <sarth.frey@databricks.com> Co-authored-by: sarthfrey <sarth.frey@gmail.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2020-02-13 16:15:00 -08:00
Liang Zhang	82d0aa37ae	[SPARK-30762] Add dtype=float32 support to vector_to_array UDF ### What changes were proposed in this pull request? In this PR, we add a parameter in the python function vector_to_array(col) that allows converting to a column of arrays of Float (32bits) in scala, which would be mapped to a numpy array of dtype=float32. ### Why are the changes needed? In the downstream ML training, using float32 instead of float64 (default) would allow a larger batch size, i.e., allow more data to fit in the memory. ### Does this PR introduce any user-facing change? Yes. Old: `vector_to_array()` only take one param ``` df.select(vector_to_array("colA"), ...) ``` New: `vector_to_array()` can take an additional optional param: `dtype` = "float32" (or "float64") ``` df.select(vector_to_array("colA", "float32"), ...) ``` ### How was this patch tested? Unit test in scala. doctest in python. Closes #27522 from liangz1/udf-float32. Authored-by: Liang Zhang <liang.zhang@databricks.com> Signed-off-by: WeichenXu <weichen.xu@databricks.com>	2020-02-13 23:55:13 +08:00
Thomas Graves	496f6ac860	[SPARK-29148][CORE] Add stage level scheduling dynamic allocation and scheduler backend changes ### What changes were proposed in this pull request? This is another PR for stage level scheduling. In particular this adds changes to the dynamic allocation manager and the scheduler backend to be able to track what executors are needed per ResourceProfile. Note the api is still private to Spark until the entire feature gets in, so this functionality will be there but only usable by tests for profiles other then the DefaultProfile. The main changes here are simply tracking things on a ResourceProfile basis as well as sending the executor requests to the scheduler backend for all ResourceProfiles. I introduce a ResourceProfileManager in this PR that will track all the actual ResourceProfile objects so that we can keep them all in a single place and just pass around and use in datastructures the resource profile id. The resource profile id can be used with the ResourceProfileManager to get the actual ResourceProfile contents. There are various places in the code that use executor "slots" for things. The ResourceProfile adds functionality to keep that calculation in it. This logic is more complex then it should due to standalone mode and mesos coarse grained not setting the executor cores config. They default to all cores on the worker, so calculating slots is harder there. This PR keeps the functionality to make the cores the limiting resource because the scheduler still uses that for "slots" for a few things. This PR does also add the resource profile id to the Stage and stage info classes to be able to test things easier. That full set of changes will come with the scheduler PR that will be after this one. The PR stops at the scheduler backend pieces for the cluster manager and the real YARN support hasn't been added in this PR, that again will be in a separate PR, so this has a few of the API changes up to the cluster manager and then just uses the default profile requests to continue. The code for the entire feature is here for reference: https://github.com/apache/spark/pull/27053/files although it needs to be upmerged again as well. ### Why are the changes needed? Needed for stage level scheduling feature. ### Does this PR introduce any user-facing change? No user facing api changes added here. ### How was this patch tested? Lots of unit tests and manually testing. I tested on yarn, k8s, standalone, local modes. Ran both failure and success cases. Closes #27313 from tgravescs/SPARK-29148. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2020-02-12 16:45:42 -06:00
HyukjinKwon	aa6a60530e	[SPARK-30722][PYTHON][DOCS] Update documentation for Pandas UDF with Python type hints ### What changes were proposed in this pull request? This PR targets to document the Pandas UDF redesign with type hints introduced at SPARK-28264. Mostly self-describing; however, there are few things to note for reviewers. 1. This PR replace the existing documentation of pandas UDFs to the newer redesign to promote the Python type hints. I added some words that Spark 3.0 still keeps the compatibility though. 2. This PR proposes to name non-pandas UDFs as "Pandas Function API" 3. SCALAR_ITER become two separate sections to reduce confusion: - `Iterator[pd.Series]` -> `Iterator[pd.Series]` - `Iterator[Tuple[pd.Series, ...]]` -> `Iterator[pd.Series]` 4. I removed some examples that look overkill to me. 5. I also removed some information in the doc, that seems duplicating or too much. ### Why are the changes needed? To document new redesign in pandas UDF. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests should cover. Closes #27466 from HyukjinKwon/SPARK-30722. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-12 10:49:46 +09:00
Bryan Cutler	07a9885f27	[SPARK-30777][PYTHON][TESTS] Fix test failures for Pandas >= 1.0.0 ### What changes were proposed in this pull request? Fix PySpark test failures for using Pandas >= 1.0.0. ### Why are the changes needed? Pandas 1.0.0 has recently been released and has API changes that result in PySpark test failures, this PR fixes the broken tests. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually tested with Pandas 1.0.1 and PyArrow 0.16.0 Closes #27529 from BryanCutler/pandas-fix-tests-1.0-SPARK-30777. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-11 10:03:01 +09:00
Huaxin Gao	a7ae77a8d8	[SPARK-30662][ML][PYSPARK] Put back the API changes for HasBlockSize in ALS/MLP ### What changes were proposed in this pull request? Add ```HasBlockSize``` in shared Params in both Scala and Python. Make ALS/MLP extend ```HasBlockSize``` ### Why are the changes needed? Add ```HasBlockSize ``` in ALS, so user can specify the blockSize. Make ```HasBlockSize``` a shared param so both ALS and MLP can use it. ### Does this PR introduce any user-facing change? Yes ```ALS.setBlockSize/getBlockSize``` ```ALSModel.setBlockSize/getBlockSize``` ### How was this patch tested? Manually tested. Also added doctest. Closes #27501 from huaxingao/spark_30662. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-02-09 13:14:30 +08:00
zhengruifeng	12e1bbaddb	Revert "[SPARK-30642][SPARK-30659][SPARK-30660][SPARK-30662]" ### What changes were proposed in this pull request? Revert #27360 #27396 #27374 #27389 ### Why are the changes needed? BLAS need more performace tests, specially on sparse datasets. Perfermance test of LogisticRegression (https://github.com/apache/spark/pull/27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression. LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression. ### Does this PR introduce any user-facing change? remove newly added param blockSize ### How was this patch tested? reverted testsuites Closes #27487 from zhengruifeng/revert_blockify_ii. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-02-08 08:46:16 +08:00
sharif ahmad	dd2f4431f5	[MINOR][DOCS] Fix typos at python/pyspark/sql/types.py ### What changes were proposed in this pull request? This PR fixes some typos in `python/pyspark/sql/types.py` file. ### Why are the changes needed? To deliver correct wording in documentation and codes. ### Does this PR introduce any user-facing change? Yes, it fixes some typos in user-facing API documentation. ### How was this patch tested? Locally tested the linter. Closes #27475 from sharifahmad2061/master. Lead-authored-by: sharif ahmad <sharifahmad2061@gmail.com> Co-authored-by: Sharif ahmad <sharifahmad2061@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-07 18:42:16 +09:00
HyukjinKwon	692e3ddb4e	[SPARK-27870][PYTHON][FOLLOW-UP] Rename spark.sql.pandas.udf.buffer.size to spark.sql.execution.pandas.udf.buffer.size ### What changes were proposed in this pull request? This PR renames `spark.sql.pandas.udf.buffer.size` to `spark.sql.execution.pandas.udf.buffer.size` to be more consistent with other pandas configuration prefixes, given: - `spark.sql.execution.pandas.arrowSafeTypeConversion` - `spark.sql.execution.pandas.respectSessionTimeZone` - `spark.sql.legacy.execution.pandas.groupedMap.assignColumnsByName` - other configurations like `spark.sql.execution.arrow.*`. ### Why are the changes needed? To make configuration names consistent. ### Does this PR introduce any user-facing change? No because this configuration was not released yet. ### How was this patch tested? Existing tests should cover. Closes #27450 from HyukjinKwon/SPARK-27870-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-05 11:38:33 +09:00
Dongjoon Hyun	534f5d409a	[SPARK-29138][PYTHON][TEST] Increase timeout of StreamingLogisticRegressionWithSGDTests.test_parameter_accuracy ### What changes were proposed in this pull request? This PR aims to increase the timeout of `StreamingLogisticRegressionWithSGDTests.test_parameter_accuracy` from 30s (default) to 60s. In this PR, before increasing the timeout, 1. I verified that this is not a JDK11 environmental issue by repeating 3 times first. 2. I reproduced the accuracy failure by reducing the timeout in Jenkins (https://github.com/apache/spark/pull/27424#issuecomment-580981262) Then, the final commit passed the Jenkins. ### Why are the changes needed? This seems to happen when Jenkins environment has congestion and the jobs are slowdown. The streaming job seems to be unable to repeat the designed iteration `numIteration=25` in 30 seconds. Since the error is decreasing at each iteration, the failure occurs. By reducing the timeout, we can reproduce the similar issue locally like Jenkins. ```python - eventually(condition, catch_assertions=True) + eventually(condition, timeout=10.0, catch_assertions=True) ``` ``` $ python/run-tests --testname 'pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_parameter_accuracy' --python-executables=python ... ====================================================================== FAIL: test_parameter_accuracy (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/dongjoon/PRS/SPARK-TEST/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 229, in test_parameter_accuracy eventually(condition, timeout=10.0, catch_assertions=True) File "/Users/dongjoon/PRS/SPARK-TEST/python/pyspark/testing/utils.py", line 86, in eventually raise lastValue Reproduce the error File "/Users/dongjoon/PRS/SPARK-TEST/python/pyspark/testing/utils.py", line 77, in eventually lastValue = condition() File "/Users/dongjoon/PRS/SPARK-TEST/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 226, in condition self.assertAlmostEqual(rel, 0.1, 1) AssertionError: 0.25749106949322637 != 0.1 within 1 places (0.15749106949322636 difference) ---------------------------------------------------------------------- Ran 1 test in 14.814s FAILED (failures=1) ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins (and manual check by reducing the timeout). Since this is a flakiness issue depending on the Jenkins job situation, it's difficult to reproduce there. Closes #27424 from dongjoon-hyun/SPARK-TEST. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-01 15:38:16 +09:00
zhengruifeng	d0c3e9f1f7	[SPARK-30660][ML][PYSPARK] LinearRegression blockify input vectors ### What changes were proposed in this pull request? 1, use blocks instead of vectors for performance improvement 2, use Level-2 BLAS 3, move standardization of input vectors outside of gradient computation ### Why are the changes needed? 1, less RAM to persist training data; (save ~40%) 2, faster than existing impl; (30% ~ 102%) ### Does this PR introduce any user-facing change? add a new expert param `blockSize` ### How was this patch tested? updated testsuites Closes #27396 from zhengruifeng/blockify_lireg. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-31 21:04:26 -06:00
Huaxin Gao	6fac411076	[SPARK-29093][ML][PYSPARK][FOLLOW-UP] Remove duplicate setter ### What changes were proposed in this pull request? remove duplicate setter in ```BucketedRandomProjectionLSH``` ### Why are the changes needed? Remove the duplicate ```setInputCol/setOutputCol``` in ```BucketedRandomProjectionLSH``` because these two setter are already in super class ```LSH``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually checked. Closes #27397 from huaxingao/spark-29093. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-30 23:36:39 -08:00
Huaxin Gao	f59685acaa	[SPARK-30662][ML][PYSPARK] ALS/MLP extend HasBlockSize ### What changes were proposed in this pull request? Make ALS/MLP extend ```HasBlockSize``` ### Why are the changes needed? Currently, MLP has its own ```blockSize``` param, we should make MLP extend ```HasBlockSize``` since ```HasBlockSize``` was added in ```sharedParams.scala``` recently. ALS doesn't have ```blockSize``` param now, we can make it extend ```HasBlockSize```, so user can specify the ```blockSize```. ### Does this PR introduce any user-facing change? Yes ```ALS.setBlockSize``` and ```ALS.getBlockSize``` ```ALSModel.setBlockSize``` and ```ALSModel.getBlockSize``` ### How was this patch tested? Manually tested. Also added doctest. Closes #27389 from huaxingao/spark-30662. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-30 13:13:10 -06:00
zhengruifeng	073ce12543	[SPARK-30659][ML][PYSPARK] LogisticRegression blockify input vectors ### What changes were proposed in this pull request? 1, use blocks instead of vectors 2, use Level-2 BLAS for binary, use Level-3 BLAS for multinomial ### Why are the changes needed? 1, less RAM to persist training data; (save ~40%) 2, faster than existing impl; (40% ~ 92%) ### Does this PR introduce any user-facing change? add a new expert param `blockSize` ### How was this patch tested? updated testsuites Closes #27374 from zhengruifeng/blockify_lor. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-30 10:52:07 -06:00
zhengruifeng	96d27274f5	[SPARK-30642][ML][PYSPARK] LinearSVC blockify input vectors ### What changes were proposed in this pull request? 1, stack input vectors to blocks (like ALS/MLP); 2, add new param `blockSize`; 3, add a new class `InstanceBlock` 4, standardize the input outside of optimization procedure; ### Why are the changes needed? 1, reduce RAM to persist traing dataset; (save ~40% in test) 2, use Level-2 BLAS routines; (12% ~ 28% faster, without native BLAS) ### Does this PR introduce any user-facing change? a new param `blockSize` ### How was this patch tested? existing and updated testsuites Closes #27360 from zhengruifeng/blockify_svc. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-01-28 20:55:21 +08:00
Bryan Cutler	43d9c7e7e5	[SPARK-30640][PYTHON][SQL] Prevent unnecessary copies of data during Arrow to Pandas conversion ### What changes were proposed in this pull request? Prevent unnecessary copies of data during conversion from Arrow to Pandas. ### Why are the changes needed? During conversion of pyarrow data to Pandas, columns are checked for timestamp types and then modified to correct for local timezone. If the data contains no timestamp types, then unnecessary copies of the data can be made. This is most prevalent when checking columns of a pandas DataFrame where each series is assigned back to the DataFrame, regardless if it had timestamps. See https://www.mail-archive.com/devarrow.apache.org/msg17008.html and ARROW-7596 for discussion. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests Closes #27358 from BryanCutler/pyspark-pandas-timestamp-copy-fix-SPARK-30640. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2020-01-26 15:21:06 -08:00
Xiao Li	d69ed9afdf	Revert "[SPARK-25496][SQL] Deprecate from_utc_timestamp and to_utc_timestamp" This reverts commit `1d20d13149`. Closes #27351 from gatorsmile/revertSPARK25496. Authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-25 21:34:12 -08:00
Deepyaman Datta	53fd83a8c5	[MINOR][DOCS] Fix src/dest type documentation for `to_timestamp` ### What changes were proposed in this pull request? Minor documentation fix ### Why are the changes needed? ### Does this PR introduce any user-facing change? ### How was this patch tested? Manually; consider adding tests? Closes #27295 from deepyaman/patch-2. Authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-24 13:10:09 +09:00
zhengruifeng	f35f352096	[SPARK-30543][ML][PYSPARK][R] RandomForest add Param bootstrap to control sampling method ### What changes were proposed in this pull request? add a param `bootstrap` to control whether bootstrap samples are used. ### Why are the changes needed? Current RF with numTrees=1 will directly build a tree using the orignial dataset, while with numTrees>1 it will use bootstrap samples to build trees. This design is for training a DecisionTreeModel by the impl of RandomForest, however, it is somewhat strange. In Scikit-Learn, there is a param [bootstrap](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier) to control whether bootstrap samples are used. ### Does this PR introduce any user-facing change? Yes, new param is added ### How was this patch tested? existing testsuites Closes #27254 from zhengruifeng/add_bootstrap. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-01-23 16:44:13 +08:00
zero323	2330a5682d	[SPARK-30607][SQL][PYSPARK][SPARKR] Add overlay wrappers for SparkR and PySpark ### What changes were proposed in this pull request? This PR adds: - `pyspark.sql.functions.overlay` function to PySpark - `overlay` function to SparkR ### Why are the changes needed? Feature parity. At the moment R and Python users can access this function only using SQL or `expr` / `selectExpr`. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New unit tests. Closes #27325 from zero323/SPARK-30607. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-23 16:16:47 +09:00
HyukjinKwon	ab0890bdb1	[SPARK-28264][PYTHON][SQL] Support type hints in pandas UDF and rename/move inconsistent pandas UDF types ### What changes were proposed in this pull request? This PR proposes to redesign pandas UDFs as described in [the proposal](https://docs.google.com/document/d/1-kV0FS_LF2zvaRh_GhkV32Uqksm_Sq8SvnBBmRyxm30/edit?usp=sharing). ```python from pyspark.sql.functions import pandas_udf import pandas as pd pandas_udf("long") def plug_one(s: pd.Series) -> pd.Series: return s + 1 spark.range(10).select(plug_one("id")).show() ``` ``` +------------+ \|plug_one(id)\| +------------+ \| 1\| \| 2\| \| 3\| \| 4\| \| 5\| \| 6\| \| 7\| \| 8\| \| 9\| \| 10\| +------------+ ``` Note that, this PR address one of the future improvements described [here](https://docs.google.com/document/d/1-kV0FS_LF2zvaRh_GhkV32Uqksm_Sq8SvnBBmRyxm30/edit#heading=h.h3ncjpk6ujqu), "A couple of less-intuitive pandas UDF types" (by zero323) together. In short, - Adds new way with type hints as an alternative and experimental way. ```python pandas_udf(schema='...') def func(c1: Series, c2: Series) -> DataFrame: pass ``` - Replace and/or add an alias for three types below from UDF, and make them as separate standalone APIs. So, `pandas_udf` is now consistent with regular `udf`s and other expressions. `df.mapInPandas(udf)` -replace-> `df.mapInPandas(f, schema)` `df.groupby.apply(udf)` -alias-> `df.groupby.applyInPandas(f, schema)` `df.groupby.cogroup.apply(udf)` -replace-> `df.groupby.cogroup.applyInPandas(f, schema)` `df.groupby.apply` was added from 2.3 while the other were added in the master only. - No deprecation for the existing ways for now. ```python pandas_udf(schema='...', functionType=PandasUDFType.SCALAR) def func(c1, c2): pass ``` If users are happy with this, I plan to deprecate the existing way and declare using type hints is not experimental anymore. One design goal in this PR was that, avoid touching the internal (since we didn't deprecate the old ways for now), but supports type hints with a minimised changes only at the interface. - Once we deprecate or remove the old ways, I think it requires another refactoring for the internal in the future. At the very least, we should rename internal pandas evaluation types. - If users find this experimental type hints isn't quite helpful, we should simply revert the changes at the interface level. ### Why are the changes needed? In order to address old design issues. Please see [the proposal](https://docs.google.com/document/d/1-kV0FS_LF2zvaRh_GhkV32Uqksm_Sq8SvnBBmRyxm30/edit?usp=sharing). ### Does this PR introduce any user-facing change? For behaviour changes, No. It adds new ways to use pandas UDFs by using type hints. See below. SCALAR: ```python pandas_udf(schema='...') def func(c1: Series, c2: DataFrame) -> Series: pass # DataFrame represents a struct column ``` SCALAR_ITER: ```python pandas_udf(schema='...') def func(iter: Iterator[Tuple[Series, DataFrame, ...]]) -> Iterator[Series]: pass # Same as SCALAR but wrapped by Iterator ``` GROUPED_AGG: ```python pandas_udf(schema='...') def func(c1: Series, c2: DataFrame) -> int: pass # DataFrame represents a struct column ``` GROUPED_MAP: This was added in Spark 2.3 as of SPARK-20396. As described above, it keeps the existing behaviour. Additionally, we now have a new alias `groupby.applyInPandas` for `groupby.apply`. See the example below: ```python def func(pdf): return pdf df.groupby("...").applyInPandas(func, schema=df.schema) ``` MAP_ITER: this is not a pandas UDF anymore This was added in Spark 3.0 as of SPARK-28198; and this PR replaces the usages. See the example below: ```python def func(iter): for df in iter: yield df df.mapInPandas(func, df.schema) ``` COGROUPED_MAP*: this is not a pandas UDF anymore This was added in Spark 3.0 as of SPARK-27463; and this PR replaces the usages. See the example below: ```python def asof_join(left, right): return pd.merge_asof(left, right, on="...", by="...") df1.groupby("...").cogroup(df2.groupby("...")).applyInPandas(asof_join, schema="...") ``` ### How was this patch tested? Unittests added and tested against Python 2.7, 3.6 and 3.7. Closes #27165 from HyukjinKwon/revisit-pandas. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-22 15:32:58 +09:00
yi.wu	ff39c9271c	[SPARK-30252][SQL] Disallow negative scale of Decimal ### What changes were proposed in this pull request? This PR propose to disallow negative `scale` of `Decimal` in Spark. And this PR brings two behavior changes: 1) for literals like `1.23E4BD` or `1.23E4`(with `spark.sql.legacy.exponentLiteralAsDecimal.enabled`=true, see [SPARK-29956](https://issues.apache.org/jira/browse/SPARK-29956)), we set its `(precision, scale)` to (5, 0) rather than (3, -2); 2) add negative `scale` check inside the decimal method if it exposes to set `scale` explicitly. If check fails, `AnalysisException` throws. And user could still use `spark.sql.legacy.allowNegativeScaleOfDecimal.enabled` to restore the previous behavior. ### Why are the changes needed? According to SQL standard, > 4.4.2 Characteristics of numbers An exact numeric type has a precision P and a scale S. P is a positive integer that determines the number of significant digits in a particular radix R, where R is either 2 or 10. S is a non-negative integer. scale of Decimal should always be non-negative. And other mainstream databases, like Presto, PostgreSQL, also don't allow negative scale. Presto: ``` presto:default> create table t (i decimal(2, -1)); Query 20191213_081238_00017_i448h failed: line 1:30: mismatched input '-'. Expecting: <integer>, <type> create table t (i decimal(2, -1)) ``` PostgrelSQL: ``` postgres=# create table t(i decimal(2, -1)); ERROR: NUMERIC scale -1 must be between 0 and precision 2 LINE 1: create table t(i decimal(2, -1)); ^ ``` And, actually, Spark itself already doesn't allow to create table with negative decimal types using SQL: ``` scala> spark.sql("create table t(i decimal(2, -1))"); org.apache.spark.sql.catalyst.parser.ParseException: no viable alternative at input 'create table t(i decimal(2, -'(line 1, pos 28) == SQL == create table t(i decimal(2, -1)) ----------------------------^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:76) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605) ... 35 elided ``` However, it is still possible to create such table or `DatFrame` using Spark SQL programming API: ``` scala> val tb = CatalogTable( TableIdentifier("test", None), CatalogTableType.MANAGED, CatalogStorageFormat.empty, StructType(StructField("i", DecimalType(2, -1) ) :: Nil)) ``` ``` scala> spark.sql("SELECT 1.23E4BD") res2: org.apache.spark.sql.DataFrame = [1.23E+4: decimal(3,-2)] ``` while, these two different behavior could make user confused. On the other side, even if user creates such table or `DataFrame` with negative scale decimal type, it can't write data out if using format, like `parquet` or `orc`. Because these formats have their own check for negative scale and fail on it. ``` scala> spark.sql("SELECT 1.23E4BD").write.saveAsTable("parquet") 19/12/13 17:37:04 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.IllegalArgumentException: Invalid DECIMAL scale: -2 at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:53) at org.apache.parquet.schema.Types$BasePrimitiveBuilder.decimalMetadata(Types.java:495) at org.apache.parquet.schema.Types$BasePrimitiveBuilder.build(Types.java:403) at org.apache.parquet.schema.Types$BasePrimitiveBuilder.build(Types.java:309) at org.apache.parquet.schema.Types$Builder.named(Types.java:290) at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:428) at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:334) at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.$anonfun$convert$2(ParquetSchemaConverter.scala:326) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at org.apache.spark.sql.types.StructType.map(StructType.scala:99) at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.convert(ParquetSchemaConverter.scala:326) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.init(ParquetWriteSupport.scala:97) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetOutputWriter.scala:37) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:150) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:124) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:109) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:264) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:205) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:441) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:444) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` So, I think it would be better to disallow negative scale totally and make behaviors above be consistent. ### Does this PR introduce any user-facing change? Yes, if `spark.sql.legacy.allowNegativeScaleOfDecimal.enabled=false`, user couldn't create Decimal value with negative scale anymore. ### How was this patch tested? Added new tests in `ExpressionParserSuite` and `DecimalSuite`; Updated `SQLQueryTestSuite`. Closes #26881 from Ngone51/nonnegative-scale. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-21 21:09:48 +08:00
HyukjinKwon	a6bdea3ad4	[SPARK-30539][PYTHON][SQL] Add DataFrame.tail in PySpark ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/26809 added `Dataset.tail` API. It should be good to have it in PySpark API as well. ### Why are the changes needed? To support consistent APIs. ### Does this PR introduce any user-facing change? No. It adds a new API. ### How was this patch tested? Manually tested and doctest was added. Closes #27251 from HyukjinKwon/SPARK-30539. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-18 00:18:12 -08:00
zero323	3228732fd5	[SPARK-30533][ML][PYSPARK] Add classes to represent Java Regressors and RegressionModels ### What changes were proposed in this pull request? This PR adds: - `pyspark.ml.regression.JavaRegressor` - `pyspark.ml.regression.JavaRegressionModel` classes and replaces `JavaPredictor` and `JavaPredictionModel` in - `LinearRegression` / `LinearRegressionModel` - `DecisionTreeRegressor` / `DecisionTreeRegressionModel` (just addition as `JavaPredictionModel` hasn't been used) - `RandomForestRegressor` / `RandomForestRegressionModel` (just addition as `JavaPredictionModel` hasn't been used) - `GBTRegressor` / `GBTRegressionModel` (just addition as `JavaPredictionModel` hasn't been used) - `AFTSurvivalRegression` / `AFTSurvivalRegressionModel` - `GeneralizedLinearRegression` / `GeneralizedLinearRegressionModel` - `FMRegressor` / `FMRegressionModel` ### Why are the changes needed? - Internal PySpark consistency. - Feature parity with Scala. - Intermediate step towards implementing [SPARK-29212](https://issues.apache.org/jira/browse/SPARK-29212) ### Does this PR introduce any user-facing change? It adds new base classes, so it will affect `mro`. Otherwise interfaces should stay intact. ### How was this patch tested? Existing tests. Closes #27241 from zero323/SPARK-30533. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-17 19:34:30 -06:00
HyukjinKwon	1881caa95e	[SPARK-29188][PYTHON][FOLLOW-UP] Explicitly disable Arrow execution for all test of toPandas empty types ### What changes were proposed in this pull request? Another followup of `4398dfa709` I missed two more tests added: ``` ====================================================================== ERROR [0.133s]: test_to_pandas_from_mixed_dataframe (pyspark.sql.tests.test_dataframe.DataFrameTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/jenkins/python/pyspark/sql/tests/test_dataframe.py", line 617, in test_to_pandas_from_mixed_dataframe self.assertTrue(np.all(pdf_with_only_nulls.dtypes == pdf_with_some_nulls.dtypes)) AssertionError: False is not true ====================================================================== ERROR [0.061s]: test_to_pandas_from_null_dataframe (pyspark.sql.tests.test_dataframe.DataFrameTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/jenkins/python/pyspark/sql/tests/test_dataframe.py", line 588, in test_to_pandas_from_null_dataframe self.assertEqual(types[0], np.float64) AssertionError: dtype('O') != <class 'numpy.float64'> ---------------------------------------------------------------------- ``` ### Why are the changes needed? To make the test independent of default values of configuration. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually tested and Jenkins should test. Closes #27250 from HyukjinKwon/SPARK-29188-followup2. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-17 15:00:18 +09:00
HyukjinKwon	4398dfa709	[SPARK-29188][PYTHON][FOLLOW-UP] Explicitly disable Arrow execution for the test of toPandas empty types ### What changes were proposed in this pull request? This PR proposes to explicitly disable Arrow execution for the test of toPandas empty types. If `spark.sql.execution.arrow.pyspark.enabled` is enabled by default, this test alone fails as below: ``` ====================================================================== ERROR [0.205s]: test_to_pandas_from_empty_dataframe (pyspark.sql.tests.test_dataframe.DataFrameTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/.../pyspark/sql/tests/test_dataframe.py", line 568, in test_to_pandas_from_empty_dataframe self.assertTrue(np.all(dtypes_when_empty_df == dtypes_when_nonempty_df)) AssertionError: False is not true ---------------------------------------------------------------------- ``` it should be best to explicitly disable for the test that only works when it's disabled. ### Why are the changes needed? To make the test independent of default values of configuration. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually tested and Jenkins should test. Closes #27247 from HyukjinKwon/SPARK-29188-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-16 19:27:30 -08:00
Maxim Gekk	1a9de8c31f	[SPARK-30499][SQL] Remove SQL config spark.sql.execution.pandas.respectSessionTimeZone ### What changes were proposed in this pull request? In the PR, I propose to remove the SQL config `spark.sql.execution.pandas.respectSessionTimeZone` which has been deprecated since Spark 2.3. ### Why are the changes needed? To improve code maintainability. ### Does this PR introduce any user-facing change? Yes. ### How was this patch tested? by running python tests, https://spark.apache.org/docs/latest/building-spark.html#pyspark-tests-with-maven-or-sbt Closes #27218 from MaxGekk/remove-respectSessionTimeZone. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-17 11:44:49 +09:00
Huaxin Gao	92dd7c9d2a	[MINOR][ML] Change DecisionTreeClassifier to FMClassifier in OneVsRest setWeightCol test ### What changes were proposed in this pull request? Change ```DecisionTreeClassifier``` to ```FMClassifier``` in ```OneVsRest``` setWeightCol test ### Why are the changes needed? In ```OneVsRest```, if the classifier doesn't support instance weight, ```OneVsRest``` weightCol will be ignored, so unit test has tested one classifier(```LogisticRegression```) that support instance weight, and one classifier (```DecisionTreeClassifier```) that doesn't support instance weight. Since ```DecisionTreeClassifier``` now supports instance weight, we need to change it to the classifier that doesn't have weight support. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing test Closes #27204 from huaxingao/spark-ovr-minor. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-01-17 10:04:41 +08:00
Huaxin Gao	1ef1d6caf2	[SPARK-29565][FOLLOWUP] add setInputCol/setOutputCol in OHEModel ### What changes were proposed in this pull request? add setInputCol/setOutputCol in OHEModel ### Why are the changes needed? setInputCol/setOutputCol should be in OHEModel too. ### Does this PR introduce any user-facing change? Yes. ```OHEModel.setInputCol``` ```OHEModel.setOutputCol``` ### How was this patch tested? Manually tested. Closes #27228 from huaxingao/spark-29565. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-01-16 19:23:10 +08:00
HyukjinKwon	0a95eb0800	[SPARK-30434][FOLLOW-UP][PYTHON][SQL] Make the parameter list consistent in createDataFrame ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/27109. It should match the parameter lists in `createDataFrame`. ### Why are the changes needed? To pass parameters supposed to pass. ### Does this PR introduce any user-facing change? No (it's only in master) ### How was this patch tested? Manually tested and existing tests should cover. Closes #27225 from HyukjinKwon/SPARK-30434-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-16 12:39:44 +09:00
zero323	990a2be27f	[SPARK-30378][ML][PYSPARK][FOLLOWUP] Remove Param fields provided by _FactorizationMachinesParams ### What changes were proposed in this pull request? Removal of following `Param` fields: - `factorSize` - `fitLinear` - `miniBatchFraction` - `initStd` - `solver` from `FMClassifier` and `FMRegressor` ### Why are the changes needed? This `Param` members are already provided by `_FactorizationMachinesParams` `0f3d744c3f/python/pyspark/ml/regression.py (L2303-L2318)` which is mixed into `FMRegressor`: `0f3d744c3f/python/pyspark/ml/regression.py (L2350)` and `FMClassifier`: `0f3d744c3f/python/pyspark/ml/classification.py (L2793)` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manual testing. Closes #27205 from zero323/SPARK-30378-FOLLOWUP. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-15 08:43:36 -06:00
zero323	525c5695f8	[SPARK-30504][PYTHON][ML] Set weightCol in OneVsRest(Model) _to_java and _from_java ### What changes were proposed in this pull request? This PR adjusts `_to_java` and `_from_java` of `OneVsRest` and `OneVsRestModel` to preserve `weightCol`. ### Why are the changes needed? Currently both `Params` don't preserve `weightCol` `Params` when data is saved / loaded: ```python from pyspark.ml.classification import LogisticRegression, OneVsRest, OneVsRestModel from pyspark.ml.linalg import DenseVector df = spark.createDataFrame([(0, 1, DenseVector([1.0, 0.0])), (0, 1, DenseVector([1.0, 0.0]))], ("label", "w", "features")) ovr = OneVsRest(classifier=LogisticRegression()).setWeightCol("w") ovrm = ovr.fit(df) ovr.getWeightCol() ## 'w' ovrm.getWeightCol() ## 'w' ovr.write().overwrite().save("/tmp/ovr") ovr_ = OneVsRest.load("/tmp/ovr") ovr_.getWeightCol() ## KeyError ## ... ## KeyError: Param(parent='OneVsRest_5145d56b6bd1', name='weightCol', doc='weight column name. ...) ovrm.write().overwrite().save("/tmp/ovrm") ovrm_ = OneVsRestModel.load("/tmp/ovrm") ovrm_ .getWeightCol() ## KeyError ## ... ## KeyError: Param(parent='OneVsRestModel_598c6d900fad', name='weightCol', doc='weight column name ... ``` ### Does this PR introduce any user-facing change? After this PR is merged, loaded objects will have `weightCol` `Param` set. ### How was this patch tested? - Manual testing. - Extension of existing persistence tests. Closes #27190 from zero323/SPARK-30504. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-15 08:42:24 -06:00
zero323	3668291e6b	[SPARK-30452][ML][PYSPARK][FOLLOWUP] Change IsotonicRegressionModel.numFeatures to property ### What changes were proposed in this pull request? Change `IsotonicRegressionModel.numFeatures` from plain method to property. ### Why are the changes needed? Consistency. Right now we use `numFeatures` in two other places in `pyspark.ml` `0f3d744c3f/python/pyspark/ml/feature.py (L4289-L4291)` `0f3d744c3f/python/pyspark/ml/wrapper.py (L437-L439)` and one in `pyspark,mllib` `0f3d744c3f/python/pyspark/mllib/classification.py (L177-L179)` each time as a property. Additionally all similar values in `ml` are exposed as properties, for example `0f3d744c3f/python/pyspark/ml/regression.py (L451-L453)` ### Does this PR introduce any user-facing change? Yes, but current API hasn't been released yet. ### How was this patch tested? Existing doctests. Closes #27206 from zero323/SPARK-30452-FOLLOWUP. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-01-15 12:29:23 +08:00
zhengruifeng	93200115d7	[SPARK-9478][ML][PYSPARK] Add sample weights to Random Forest ### What changes were proposed in this pull request? 1, change `convertToBaggedRDDSamplingWithReplacement` to attach instance weights 2, make RF supports weights ### Why are the changes needed? `weightCol` is already exposed, while RF has not support weights. ### Does this PR introduce any user-facing change? Yes, new setters ### How was this patch tested? added testsuites Closes #27097 from zhengruifeng/rf_support_weight. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-14 08:25:51 -06:00
Huaxin Gao	2688faeea5	[SPARK-30498][ML][PYSPARK] Fix some ml parity issues between python and scala ### What changes were proposed in this pull request? There are some parity issues between python and scala ### Why are the changes needed? keep parity between python and scala ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? existing tests Closes #27196 from huaxingao/spark-30498. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-01-14 17:24:17 +08:00
jiake	b389b8c5f0	[SPARK-30188][SQL] Resolve the failed unit tests when enable AQE ### What changes were proposed in this pull request? Fix all the failed tests when enable AQE. ### Why are the changes needed? Run more tests with AQE to catch bugs, and make it easier to enable AQE by default in the future. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing unit tests Closes #26813 from JkSelf/enableAQEDefault. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-13 22:55:19 +08:00
Huaxin Gao	f77dcfc55a	[SPARK-30351][ML][PYSPARK] BisectingKMeans support instance weighting ### What changes were proposed in this pull request? add weight support in BisectingKMeans ### Why are the changes needed? BisectingKMeans should support instance weighting ### Does this PR introduce any user-facing change? Yes. BisectingKMeans.setWeight ### How was this patch tested? Unit test Closes #27035 from huaxingao/spark_30351. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-13 08:24:49 -06:00
Huaxin Gao	d6e28f2922	[SPARK-30377][ML] Make Regressors extend abstract class Regressor ### What changes were proposed in this pull request? Make Regressors extend abstract class Regressor: ```AFTSurvivalRegression extends Estimator => extends Regressor``` ```DecisionTreeRegressor extends Predictor => extends Regressor``` ```FMRegressor extends Predictor => extends Regressor``` ```GBTRegressor extends Predictor => extends Regressor``` ```RandomForestRegressor extends Predictor => extends Regressor``` We will not make ```IsotonicRegression``` extend ```Regressor``` because it is tricky to handle both DoubleType and VectorType. ### Why are the changes needed? Make class hierarchy consistent for all Regressors ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing tests Closes #27168 from huaxingao/spark-30377. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-13 08:22:20 -06:00
zero323	6502c66025	[SPARK-30493][PYTHON][ML] Remove OneVsRestModel setClassifier, setLabelCol and setWeightCol methods ### What changes were proposed in this pull request? Removal of `OneVsRestModel.setClassifier`, `OneVsRestModel.setLabelCol` and `OneVsRestModel.setWeightCol` methods. ### Why are the changes needed? Aforementioned methods shouldn't by included by [SPARK-29093](https://issues.apache.org/jira/browse/SPARK-29093), as they're not present in Scala `OneVsRestModel` and have no practical application. ### Does this PR introduce any user-facing change? Not beyond scope of SPARK-29093]. ### How was this patch tested? Existing tests. CC huaxingao zhengruifeng Closes #27181 from zero323/SPARK-30493. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-01-13 19:03:32 +08:00
HyukjinKwon	0823aec463	[SPARK-30480][PYTHON][TESTS] Increases the memory limit being tested in 'WorkerMemoryTest.test_memory_limit' ### What changes were proposed in this pull request? This PR proposes to increase the memory in `WorkerMemoryTest.test_memory_limit` in order to make the test pass with PyPy. The test is currently failed only in PyPy as below in some PRs unexpectedly: ``` Current mem limits: 18446744073709551615 of max 18446744073709551615 Setting mem limits to 1048576 of max 1048576 RPython traceback: File "pypy_module_pypyjit_interp_jit.c", line 289, in portal_5 File "pypy_interpreter_pyopcode.c", line 3468, in handle_bytecode__AccessDirect_None File "pypy_interpreter_pyopcode.c", line 5558, in dispatch_bytecode__AccessDirect_None out of memory: couldn't allocate the next arena ERROR ``` It seems related to how PyPy allocates the memory and GC works PyPy-specifically. There seems nothing wrong in this configuration implementation itself in PySpark side. I roughly tested in higher PyPy versions on Ubuntu (PyPy v7.3.0) and this test seems passing fine so I suspect this might be an issue in old PyPy behaviours. The change only increases the limit so it would not affect actual memory allocations. It just needs to test if the limit is properly set in worker sides. For clarification, the memory is unlimited in the machine if not set. ### Why are the changes needed? To make the tests pass and unblock other PRs. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually and Jenkins should test it out. Closes #27186 from HyukjinKwon/SPARK-30480. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-13 18:47:15 +09:00
Bryan Cutler	f372d1cf4f	[SPARK-29748][PYTHON][SQL] Remove Row field sorting in PySpark for version 3.6+ ### What changes were proposed in this pull request? Removing the sorting of PySpark SQL Row fields that were previously sorted by name alphabetically for Python versions 3.6 and above. Field order will now match that as entered. Rows will be used like tuples and are applied to schema by position. For Python versions < 3.6, the order of kwargs is not guaranteed and therefore will be sorted automatically as in previous versions of Spark. ### Why are the changes needed? This caused inconsistent behavior in that local Rows could be applied to a schema by matching names, but once serialized the Row could only be used by position and the fields were possibly in a different order. ### Does this PR introduce any user-facing change? Yes, Row fields are no longer sorted alphabetically but will be in the order entered. For Python < 3.6 `kwargs` can not guarantee the order as entered, so `Row`s will be automatically sorted. An environment variable "PYSPARK_ROW_FIELD_SORTING_ENABLED" can be set that will override construction of `Row` to maintain compatibility with Spark 2.x. ### How was this patch tested? Existing tests are run with PYSPARK_ROW_FIELD_SORTING_ENABLED=true and added new test with unsorted fields for Python 3.6+ Closes #26496 from BryanCutler/pyspark-remove-Row-sorting-SPARK-29748. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2020-01-10 14:37:59 -08:00
HyukjinKwon	d0983af38f	Revert "[SPARK-30480][PYSPARK][TESTS] Fix 'test_memory_limit' on pyspark test" This reverts commit `afd70a0f6f`.	2020-01-10 22:35:54 +09:00
Jungtaek Lim (HeartSaVioR)	afd70a0f6f	[SPARK-30480][PYSPARK][TESTS] Fix 'test_memory_limit' on pyspark test ### What changes were proposed in this pull request? This patch increases the memory limit in the test 'test_memory_limit' from 1m to 8m. Credit to srowen and HyukjinKwon to provide the idea of suspicion and guide how to fix. ### Why are the changes needed? We observed consistent Pyspark test failures on multiple PRs (#26955, #26201, #27064) which block the PR builds whenever the test is included. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Jenkins builds passed in WIP PR (#27159) Closes #27162 from HeartSaVioR/SPARK-30480. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-10 15:30:54 +09:00
Huaxin Gao	c88124a246	[SPARK-30452][ML][PYSPARK] Add predict and numFeatures in Python IsotonicRegressionModel ### What changes were proposed in this pull request? Add ```predict``` and ```numFeatures``` in Python ```IsotonicRegressionModel``` ### Why are the changes needed? ```IsotonicRegressionModel``` doesn't extend ```JavaPredictionModel```, so it doesn't get ```predict``` and ```numFeatures``` from the super class. ### Does this PR introduce any user-facing change? Yes. Python version of ``` IsotonicRegressionModel.predict IsotonicRegressionModel.numFeatures ``` ### How was this patch tested? doctest Closes #27122 from huaxingao/spark-30452. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-09 09:23:10 -06:00
HyukjinKwon	92a0877ee1	[SPARK-30464][PYTHON][DOCS] Explicitly note that we don't add "pandas compatible" aliases ### What changes were proposed in this pull request? This PR adds a note that we're not adding "pandas compatible" aliases anymore. ### Why are the changes needed? We added "pandas compatible" aliases as of https://github.com/apache/spark/pull/5544 and https://github.com/apache/spark/pull/6066 . There are too many differences and I don't think it makes sense to add such aliases anymore at this moment. I was even considering deprecating them out but decided to take a more conservative approache by just documenting it. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests should cover. Closes #27142 from HyukjinKwon/SPARK-30464. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-09 11:42:52 +09:00
HyukjinKwon	ee8d661058	[SPARK-30434][PYTHON][SQL] Move pandas related functionalities into 'pandas' sub-package ### What changes were proposed in this pull request? This PR proposes to move pandas related functionalities into pandas package. Namely: ```bash pyspark/sql/pandas ├── __init__.py ├── conversion.py # Conversion between pandas <> PySpark DataFrames ├── functions.py # pandas_udf ├── group_ops.py # Grouped UDF / Cogrouped UDF + groupby.apply, groupby.cogroup.apply ├── map_ops.py # Map Iter UDF + mapInPandas ├── serializers.py # pandas <> PyArrow serializers ├── types.py # Type utils between pandas <> PyArrow └── utils.py # Version requirement checks ``` In order to separately locate `groupby.apply`, `groupby.cogroup.apply`, `mapInPandas`, `toPandas`, and `createDataFrame(pdf)` under `pandas` sub-package, I had to use a mix-in approach which Scala side uses often by `trait`, and also pandas itself uses this approach (see `IndexOpsMixin` as an example) to group related functionalities. Currently, you can think it's like Scala's self typed trait. See the structure below: ```python class PandasMapOpsMixin(object): def mapInPandas(self, ...): ... return ... # other Pandas <> PySpark APIs ``` ```python class DataFrame(PandasMapOpsMixin): # other DataFrame APIs equivalent to Scala side. ``` Yes, This is a big PR but they are mostly just moving around except one case `createDataFrame` which I had to split the methods. ### Why are the changes needed? There are pandas functionalities here and there and I myself gets lost where it was. Also, when you have to make a change commonly for all of pandas related features, it's almost impossible now. Also, after this change, `DataFrame` and `SparkSession` become more consistent with Scala side since pandas is specific to Python, and this change separates pandas-specific APIs away from `DataFrame` or `SparkSession`. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests should cover. Also, I manually built the PySpark API documentation and checked. Closes #27109 from HyukjinKwon/pandas-refactoring. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-09 10:22:50 +09:00
HyukjinKwon	866b7df348	[SPARK-30335][SQL][DOCS] Add a note first, last, collect_list and collect_set can be non-deterministic in SQL function docs as well ### What changes were proposed in this pull request? This PR adds a note first and last can be non-deterministic in SQL function docs as well. This is already documented in `functions.scala`. ### Why are the changes needed? Some people look reading SQL docs only. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Jenkins will test. Closes #27099 from HyukjinKwon/SPARK-30335. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-07 14:31:59 +09:00
HyukjinKwon	3ba175ef9a	[SPARK-30430][PYTHON][DOCS] Add a note that UserDefinedFunction's constructor is private ### What changes were proposed in this pull request? This PR adds a note that UserDefinedFunction's constructor is private. ### Why are the changes needed? To match with Scala side. Scala side does not have it at all. ### Does this PR introduce any user-facing change? Doc only changes but it declares UserDefinedFunction's constructor is private explicitly. ### How was this patch tested? Jenkins Closes #27101 from HyukjinKwon/SPARK-30430. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-07 10:13:40 +09:00
WeichenXu	88542bc3d9	[SPARK-30154][ML] PySpark UDF to convert MLlib vectors to dense arrays ### What changes were proposed in this pull request? PySpark UDF to convert MLlib vectors to dense arrays. Example: ``` from pyspark.ml.functions import vector_to_array df.select(vector_to_array(col("features")) ``` ### Why are the changes needed? If a PySpark user wants to convert MLlib sparse/dense vectors in a DataFrame into dense arrays, an efficient approach is to do that in JVM. However, it requires PySpark user to write Scala code and register it as a UDF. Often this is infeasible for a pure python project. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? UT. Closes #26910 from WeichenXu123/vector_to_array. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2020-01-06 16:18:51 -08:00
Huaxin Gao	d32ed25f0d	[SPARK-30144][ML][PYSPARK] Make MultilayerPerceptronClassificationModel extend MultilayerPerceptronParams ### What changes were proposed in this pull request? Make ```MultilayerPerceptronClassificationModel``` extend ```MultilayerPerceptronParams``` ### Why are the changes needed? Make ```MultilayerPerceptronClassificationModel``` extend ```MultilayerPerceptronParams``` to expose the training params, so user can see these params when calling ```extractParamMap``` ### Does this PR introduce any user-facing change? Yes. The ```MultilayerPerceptronParams``` such as ```seed```, ```maxIter``` ... are available in ```MultilayerPerceptronClassificationModel``` now ### How was this patch tested? Manually tested ```MultilayerPerceptronClassificationModel.extractParamMap()``` to verify all the new params are there. Closes #26838 from huaxingao/spark-30144. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-03 12:01:11 -06:00
Huaxin Gao	6196c20ee0	[SPARK-30358][ML][PYSPARK][FOLLOWUP] ML expose predictRaw and predictProbability on Python side ### What changes were proposed in this pull request? expose predictRaw and predictProbability on Python side ### Why are the changes needed? to keep parity between scala and python ### Does this PR introduce any user-facing change? Yes. Expose python ```predictRaw``` and ```predictProbability``` ### How was this patch tested? doctest Closes #27082 from huaxingao/spark-30358. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-03 11:42:56 -06:00
Huaxin Gao	9ee8da298d	[SPARK-30378][ML][PYSPARK] Add getter/setter in Python FM ### What changes were proposed in this pull request? add getter/setter in Python FM ### Why are the changes needed? to be consistent with other algorithms ### Does this PR introduce any user-facing change? Yes. add getter/setter in Python FMRegressor/FMRegressionModel/FMClassifier/FMClassificationModel ### How was this patch tested? doctest Closes #27044 from huaxingao/spark-30378. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-31 12:56:19 +08:00
Gengliang Wang	07593d362f	[SPARK-27506][SQL][FOLLOWUP] Use option `avroSchema` to specify an evolved schema in `from_avro` ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/26780 In https://github.com/apache/spark/pull/26780, a new Avro data source option `actualSchema` is introduced for setting the original Avro schema in function `from_avro`, while the expected schema is supposed to be set in the parameter `jsonFormatSchema` of `from_avro`. However, there is another Avro data source option `avroSchema`. It is used for setting the expected schema in readiong and writing. This PR is to use the option `avroSchema` option for reading Avro data with an evolved schema and remove the new one `actualSchema` ### Why are the changes needed? Unify and simplify the Avro data source options. ### Does this PR introduce any user-facing change? Yes. To deserialize Avro data with an evolved schema, before changes: ``` from_avro('col, expectedSchema, ("actualSchema" -> actualSchema)) ``` After changes: ``` from_avro('col, actualSchema, ("avroSchema" -> expectedSchema)) ``` The second parameter is always the actual Avro schema after changes. ### How was this patch tested? Update the existing tests in https://github.com/apache/spark/pull/26780 Closes #27045 from gengliangwang/renameAvroOption. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-30 18:14:21 +09:00
zhengruifeng	9c046dc808	[SPARK-30102][ML][PYSPARK] GMM supports instance weighting ### What changes were proposed in this pull request? supports instance weighting in GMM ### Why are the changes needed? ML should support instance weighting ### Does this PR introduce any user-facing change? yes, a new param `weightCol` is exposed ### How was this patch tested? added testsuits Closes #26735 from zhengruifeng/gmm_support_weight. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-27 13:32:57 +08:00
Huaxin Gao	a3cf9c564e	[SPARK-30247][PYSPARK][FOLLOWUP] Add Python class MultivariateGaussian ### What changes were proposed in this pull request? add a corresponding class MultivariateGaussian containing a vector and a matrix on the py side, so gaussian can be used on the py side. ### Does this PR introduce any user-facing change? add Python class ```MultivariateGaussian``` ### How was this patch tested? doctest Closes #27020 from huaxingao/spark-30247. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-27 13:30:18 +08:00
zhanjf	8d3eed33ee	[SPARK-29224][ML] Implement Factorization Machines as a ml-pipeline component ### What changes were proposed in this pull request? Implement Factorization Machines as a ml-pipeline component 1. loss function supports: logloss, mse 2. optimizer: GD, adamW ### Why are the changes needed? Factorization Machines is widely used in advertising and recommendation system to estimate CTR(click-through rate). Advertising and recommendation system usually has a lot of data, so we need Spark to estimate the CTR, and Factorization Machines are common ml model to estimate CTR. References: 1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 995–1000, 2010. https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf ### Does this PR introduce any user-facing change? No ### How was this patch tested? run unit tests Closes #27000 from mob-ai/ml/fm. Authored-by: zhanjf <zhanjf@mob.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-26 11:39:53 -06:00
zhengruifeng	8f07839e74	[SPARK-30178][ML] RobustScaler support large numFeatures ### What changes were proposed in this pull request? compute the medians/ranges more distributedly ### Why are the changes needed? It is a bottleneck to collect the whole Array[QuantileSummaries] from executors, since a QuantileSummaries is a large object, which maintains arrays of large sizes 10k(`defaultCompressThreshold`)/50k(`defaultHeadSize`). In Spark-Shell with default params, I processed a dataset with numFeatures=69,200, and existing impl fail due to OOM. After this PR, it will sucessfuly fit the model. ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #26803 from zhengruifeng/robust_high_dim. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-25 09:44:19 +08:00
Wenchen Fan	ba3f6330dd	Revert "[SPARK-29224][ML] Implement Factorization Machines as a ml-pipeline component" This reverts commit `c6ab7165dd`.	2019-12-24 14:01:27 +08:00
zhanjf	c6ab7165dd	[SPARK-29224][ML] Implement Factorization Machines as a ml-pipeline component ### What changes were proposed in this pull request? Implement Factorization Machines as a ml-pipeline component 1. loss function supports: logloss, mse 2. optimizer: GD, adamW ### Why are the changes needed? Factorization Machines is widely used in advertising and recommendation system to estimate CTR(click-through rate). Advertising and recommendation system usually has a lot of data, so we need Spark to estimate the CTR, and Factorization Machines are common ml model to estimate CTR. References: 1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 995–1000, 2010. https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf ### Does this PR introduce any user-facing change? No ### How was this patch tested? run unit tests Closes #26124 from mob-ai/ml/fm. Authored-by: zhanjf <zhanjf@mob.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-23 10:11:09 -06:00
HyukjinKwon	e5abbab0ed	[SPARK-30128][DOCS][PYTHON][SQL] Document/promote 'recursiveFileLookup' and 'pathGlobFilter' in file sources 'mergeSchema' in ORC ### What changes were proposed in this pull request? This PR adds and exposes the options, 'recursiveFileLookup' and 'pathGlobFilter' in file sources 'mergeSchema' in ORC, into documentation. - `recursiveFileLookup` at file sources: https://github.com/apache/spark/pull/24830 ([SPARK-27627](https://issues.apache.org/jira/browse/SPARK-27627)) - `pathGlobFilter` at file sources: https://github.com/apache/spark/pull/24518 ([SPARK-27990](https://issues.apache.org/jira/browse/SPARK-27990)) - `mergeSchema` at ORC: https://github.com/apache/spark/pull/24043 ([SPARK-11412](https://issues.apache.org/jira/browse/SPARK-11412)) Note that `timeZone` option was not moved from `DataFrameReader.options` as I assume it will likely affect other datasources as well once DSv2 is complete. ### Why are the changes needed? To document available options in sources properly. ### Does this PR introduce any user-facing change? In PySpark, `pathGlobFilter` can be set via `DataFrameReader.(text\|orc\|parquet\|json\|csv)` and `DataStreamReader.(text\|orc\|parquet\|json\|csv)`. ### How was this patch tested? Manually built the doc and checked the output. Option setting in PySpark is rather a logical change. I manually tested one only: ```bash $ ls -al tmp ... -rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 aa -rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 ab -rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 ac -rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 cc ``` ```python >>> spark.read.text("tmp", pathGlobFilter="*c").show() ``` ``` +-----+ \|value\| +-----+ \| ac\| \| cc\| +-----+ ``` Closes #26958 from HyukjinKwon/doc-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-23 09:57:42 +09:00
Yuming Wang	696288f623	[INFRA] Reverts commit `56dcd79` and `c216ef1` ### What changes were proposed in this pull request? 1. Revert "Preparing development version 3.0.1-SNAPSHOT": `56dcd79` 2. Revert "Preparing Spark release v3.0.0-preview2-rc2": `c216ef1` ### Why are the changes needed? Shouldn't change master. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? manual test: https://github.com/apache/spark/compare/5de5e46..wangyum:revert-master Closes #26915 from wangyum/revert-master. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com>	2019-12-16 19:57:44 -07:00
Yuming Wang	56dcd79992	Preparing development version 3.0.1-SNAPSHOT	2019-12-17 01:57:27 +00:00
Yuming Wang	c216ef1d03	Preparing Spark release v3.0.0-preview2-rc2	2019-12-17 01:57:21 +00:00
Huaxin Gao	5ed72a1940	[SPARK-30247][PYSPARK] GaussianMixtureModel in py side should expose gaussian ### What changes were proposed in this pull request? expose gaussian in PySpark ### Why are the changes needed? A ```GaussianMixtureModel``` contains two parts of coefficients: ```weights``` & ```gaussians```. However, ```gaussians``` is not exposed on Python side. ### Does this PR introduce any user-facing change? Yes. ```GaussianMixtureModel.gaussians``` is exposed in PySpark. ### How was this patch tested? add doctest Closes #26882 from huaxingao/spark-30247. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-16 18:15:40 -06:00
Boris Boutkov	3bf5498b4a	[MINOR][DOCS] Fix documentation for slide function ### What changes were proposed in this pull request? This PR proposes to fix documentation for slide function. Fixed the spacing issue and added some parameter related info. ### Why are the changes needed? Documentation improvement ### Does this PR introduce any user-facing change? No (doc-only change). ### How was this patch tested? Manually tested by documentation build. Closes #26896 from bboutkov/pyspark_doc_fix. Authored-by: Boris Boutkov <boris.boutkov@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-16 16:29:09 +09:00
HyukjinKwon	0a2afcec7d	[SPARK-30200][SQL][FOLLOW-UP] Expose only explain(mode: String) in Scala side, and clean up related codes ### What changes were proposed in this pull request? This PR mainly targets: 1. Expose only explain(mode: String) in Scala side 2. Clean up related codes - Hide `ExplainMode` under private `execution` package. No particular reason but just because `ExplainUtils` exists there - Use `case object` + `trait` pattern in `ExplainMode` to look after `ParseMode`. - Move `Dataset.toExplainString` to `QueryExecution.explainString` to look after `QueryExecution.simpleString`, and deduplicate the codes at `ExplainCommand`. - Use `ExplainMode` in `ExplainCommand` too. - Add `explainString` to `PythonSQLUtils` to avoid unexpected test failure of PySpark during refactoring Scala codes side. ### Why are the changes needed? To minimised exposed APIs, deduplicate, and clean up. ### Does this PR introduce any user-facing change? `Dataset.explain(mode: ExplainMode)` will be removed (which only exists in master). ### How was this patch tested? Manually tested and existing tests should cover. Closes #26898 from HyukjinKwon/SPARK-30200-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-16 14:42:35 +09:00
Takeshi Yamamuro	f483a13d4a	[SPARK-30231][SQL][PYTHON][FOLLOWUP] Make error messages clear in PySpark df.explain ### What changes were proposed in this pull request? This pr is a followup of #26861 to address minor comments from viirya. ### Why are the changes needed? For better error messages. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually tested. Closes #26886 from maropu/SPARK-30231-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-14 14:26:50 -08:00
Takeshi Yamamuro	64c7b94d64	[SPARK-30231][SQL][PYTHON] Support explain mode in PySpark df.explain ### What changes were proposed in this pull request? This pr intends to support explain modes implemented in #26829 for PySpark. ### Why are the changes needed? For better debugging info. in PySpark dataframes. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UTs. Closes #26861 from maropu/ExplainModeInPython. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-13 17:44:23 +09:00
David	8e9bfea107	[SPARK-29188][PYTHON] toPandas (without Arrow) gets wrong dtypes when applied on empty DF ### What changes were proposed in this pull request? An empty Spark DataFrame converted to a Pandas DataFrame wouldn't have the right column types. Several type mappings were missing. ### Why are the changes needed? Empty Spark DataFrames can be used to write unit tests, and verified by converting them to Pandas first. But this can fail when the column types are wrong. ### Does this PR introduce any user-facing change? Yes; the error reported in the JIRA issue should not happen anymore. ### How was this patch tested? Through unit tests in `pyspark.sql.tests.test_dataframe.DataFrameTests#test_to_pandas_from_empty_dataframe` Closes #26747 from dlindelof/SPARK-29188. Authored-by: David <dlindelof@expediagroup.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-12 20:49:10 +09:00
Fokko Driesprong	99ea324b6f	[SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas Follow up of https://github.com/apache/spark/pull/24405 ### What changes were proposed in this pull request? The current implementation of _from_avro_ and _AvroDataToCatalyst_ doesn't allow doing schema evolution since it requires the deserialization of an Avro record with the exact same schema with which it was serialized. The proposed change is to add a new option `actualSchema` to allow passing the schema used to serialize the records. This allows using a different compatible schema for reading by passing both schemas to _GenericDatumReader_. If no writer's schema is provided, nothing changes from before. ### Why are the changes needed? Consider the following example. ``` // schema ID: 1 val schema1 = """ { "type": "record", "name": "MySchema", "fields": [ {"name": "col1", "type": "int"}, {"name": "col2", "type": "string"} ] } """ // schema ID: 2 val schema2 = """ { "type": "record", "name": "MySchema", "fields": [ {"name": "col1", "type": "int"}, {"name": "col2", "type": "string"}, {"name": "col3", "type": "string", "default": ""} ] } """ ``` The two schemas are compatible - i.e. you can use `schema2` to deserialize events serialized with `schema1`, in which case there will be the field `col3` with the default value. Now imagine that you have two dataframes (read from batch or streaming), one with Avro events from schema1 and the other with events from schema2. We want to combine them into one dataframe for storing or further processing. With the current `from_avro` function we can only decode each of them with the corresponding schema: ``` scalaval df1 = ... // Avro events created with schema1 df1: org.apache.spark.sql.DataFrame = [eventBytes: binary] scalaval decodedDf1 = df1.select(from_avro('eventBytes, schema1) as "decoded") decodedDf1: org.apache.spark.sql.DataFrame = [decoded: struct<col1: int, col2: string>] scalaval df2= ... // Avro events created with schema2 df2: org.apache.spark.sql.DataFrame = [eventBytes: binary] scalaval decodedDf2 = df2.select(from_avro('eventBytes, schema2) as "decoded") decodedDf2: org.apache.spark.sql.DataFrame = [decoded: struct<col1: int, col2: string, col3: string>] ``` but then `decodedDf1` and `decodedDf2` have different Spark schemas and we can't union them. Instead, with the proposed change we can decode `df1` in the following way: ``` scalaimport scala.collection.JavaConverters._ scalaval decodedDf1 = df1.select(from_avro(data = 'eventBytes, jsonFormatSchema = schema2, options = Map("actualSchema" -> schema1).asJava) as "decoded") decodedDf1: org.apache.spark.sql.DataFrame = [decoded: struct<col1: int, col2: string, col3: string>] ``` so that both dataframes have the same schemas and can be merged. ### Does this PR introduce any user-facing change? This PR allows users to pass a new configuration but it doesn't affect current code. ### How was this patch tested? A new unit test was added. Closes #26780 from Fokko/SPARK-27506. Lead-authored-by: Fokko Driesprong <fokko@apache.org> Co-authored-by: Gianluca Amori <gianluca.amori@gmail.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2019-12-11 01:26:29 -08:00
Karthikeyan Singaravelan	aec1d95f3b	[SPARK-30205][PYSPARK] Import ABCs from collections.abc to remove deprecation warnings ### What changes were proposed in this pull request? This PR aims to remove deprecation warnings by importing ABCs from `collections.abc` instead of `collections`. - https://github.com/python/cpython/pull/10596 ### Why are the changes needed? This will remove deprecation warnings in Python 3.7 and 3.8. ``` $ python -V Python 3.7.5 $ python python/pyspark/resultiterable.py python/pyspark/resultiterable.py:23: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working class ResultIterable(collections.Iterable): ``` ### Does this PR introduce any user-facing change? No, this doesn't introduce user-facing change ### How was this patch tested? Manually because this is about deprecation warning messages. Closes #26835 from tirkarthi/spark-30205-fix-abc-warnings. Authored-by: Karthikeyan Singaravelan <tir.karthi@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-10 11:08:13 -08:00
Huaxin Gao	1cac9b2cc6	[SPARK-29967][ML][PYTHON] KMeans support instance weighting ### What changes were proposed in this pull request? add weight support in KMeans ### Why are the changes needed? KMeans should support weighting ### Does this PR introduce any user-facing change? Yes. ```KMeans.setWeightCol``` ### How was this patch tested? Unit Tests Closes #26739 from huaxingao/spark-29967. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-10 09:33:06 -06:00
Huaxin Gao	8a9cccf1f3	[SPARK-30146][ML][PYSPARK] Add setWeightCol to GBTs in PySpark ### What changes were proposed in this pull request? add ```setWeightCol``` and ```setMinWeightFractionPerNode``` in Python side of ```GBTClassifier``` and ```GBTRegressor``` ### Why are the changes needed? https://github.com/apache/spark/pull/25926 added ```setWeightCol``` and ```setMinWeightFractionPerNode``` in GBTs on scala side. This PR will add ```setWeightCol``` and ```setMinWeightFractionPerNode``` in GBTs on python side ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? doc test Closes #26774 from huaxingao/spark-30146. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-09 13:39:33 -06:00
Nicholas Chammas	c8922d9145	[SPARK-30113][SQL][PYTHON] Expose mergeSchema option in PySpark's ORC APIs ### What changes were proposed in this pull request? This PR is a follow-up to #24043 and cousin of #26730. It exposes the `mergeSchema` option directly in the ORC APIs. ### Why are the changes needed? So the Python API matches the Scala API. ### Does this PR introduce any user-facing change? Yes, it adds a new option directly in the ORC reader method signatures. ### How was this patch tested? I tested this manually as follows: ``` >>> spark.range(3).write.orc('test-orc') >>> spark.range(3).withColumnRenamed('id', 'name').write.orc('test-orc/nested') >>> spark.read.orc('test-orc', recursiveFileLookup=True, mergeSchema=True) DataFrame[id: bigint, name: bigint] >>> spark.read.orc('test-orc', recursiveFileLookup=True, mergeSchema=False) DataFrame[id: bigint] >>> spark.conf.set('spark.sql.orc.mergeSchema', True) >>> spark.read.orc('test-orc', recursiveFileLookup=True) DataFrame[id: bigint, name: bigint] >>> spark.read.orc('test-orc', recursiveFileLookup=True, mergeSchema=False) DataFrame[id: bigint] ``` Closes #26755 from nchammas/SPARK-30113-ORC-mergeSchema. Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-04 11:44:24 +09:00
Nicholas Chammas	e766a323bc	[SPARK-30091][SQL][PYTHON] Document mergeSchema option directly in the PySpark Parquet APIs ### What changes were proposed in this pull request? This change properly documents the `mergeSchema` option directly in the Python APIs for reading Parquet data. ### Why are the changes needed? The docstring for `DataFrameReader.parquet()` mentions `mergeSchema` but doesn't show it in the API. It seems like a simple oversight. Before this PR, you'd have to do this to use `mergeSchema`: ```python spark.read.option('mergeSchema', True).parquet('test-parquet').show() ``` After this PR, you can use the option as (I believe) it was intended to be used: ```python spark.read.parquet('test-parquet', mergeSchema=True).show() ``` ### Does this PR introduce any user-facing change? Yes, this PR changes the signatures of `DataFrameReader.parquet()` and `DataStreamReader.parquet()` to match their docstrings. ### How was this patch tested? Testing the `mergeSchema` option directly seems to be left to the Scala side of the codebase. I tested my change manually to confirm the API works. I also confirmed that setting `spark.sql.parquet.mergeSchema` at the session does not get overridden by leaving `mergeSchema` at its default when calling `parquet()`: ``` >>> spark.conf.set('spark.sql.parquet.mergeSchema', True) >>> spark.range(3).write.parquet('test-parquet/id') >>> spark.range(3).withColumnRenamed('id', 'name').write.parquet('test-parquet/name') >>> spark.read.option('recursiveFileLookup', True).parquet('test-parquet').show() +----+----+ \| id\|name\| +----+----+ \|null\| 1\| \|null\| 2\| \|null\| 0\| \| 1\|null\| \| 2\|null\| \| 0\|null\| +----+----+ >>> spark.read.option('recursiveFileLookup', True).parquet('test-parquet', mergeSchema=False).show() +----+ \| id\| +----+ \|null\| \|null\| \|null\| \| 1\| \| 2\| \| 0\| +----+ ``` Closes #26730 from nchammas/parquet-merge-schema. Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-04 11:31:57 +09:00
Nicholas Chammas	3dd3a623f2	[SPARK-27990][SPARK-29903][PYTHON] Add recursiveFileLookup option to Python DataFrameReader ### What changes were proposed in this pull request? As a follow-up to #24830, this PR adds the `recursiveFileLookup` option to the Python DataFrameReader API. ### Why are the changes needed? This PR maintains Python feature parity with Scala. ### Does this PR introduce any user-facing change? Yes. Before this PR, you'd only be able to use this option as follows: ```python spark.read.option("recursiveFileLookup", True).text("test-data").show() ``` With this PR, you can reference the option from within the format-specific method: ```python spark.read.text("test-data", recursiveFileLookup=True).show() ``` This option now also shows up in the Python API docs. ### How was this patch tested? I tested this manually by creating the following directories with dummy data: ``` test-data ├── 1.txt └── nested └── 2.txt test-parquet ├── nested │ ├── _SUCCESS │ ├── part-00000-...-.parquet ├── _SUCCESS ├── part-00000-...-.parquet ``` I then ran the following tests and confirmed the output looked good: ```python spark.read.parquet("test-parquet", recursiveFileLookup=True).show() spark.read.text("test-data", recursiveFileLookup=True).show() spark.read.csv("test-data", recursiveFileLookup=True).show() ``` `python/pyspark/sql/tests/test_readwriter.py` seems pretty sparse. I'm happy to add my tests there, though it seems we have been deferring testing like this to the Scala side of things. Closes #26718 from nchammas/SPARK-27990-recursiveFileLookup-python. Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-04 10:10:30 +09:00
zhengruifeng	4021354b73	[SPARK-30044][ML] MNB/CNB/BNB use empty sigma matrix instead of null ### What changes were proposed in this pull request? MNB/CNB/BNB use empty sigma matrix instead of null ### Why are the changes needed? 1,Using empty sigma matrix will simplify the impl 2,I am reviewing FM impl these days, FMModels have optional bias and linear part. It seems more reasonable to set optional part an empty vector/matrix or zero value than `null` ### Does this PR introduce any user-facing change? yes, sigma from `null` to empty matrix ### How was this patch tested? updated testsuites Closes #26679 from zhengruifeng/nb_use_empty_sigma. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-03 10:02:23 +08:00
zhengruifeng	03ac1b799c	[SPARK-29959][ML][PYSPARK] Summarizer support more metrics ### What changes were proposed in this pull request? Summarizer support more metrics: sum, std ### Why are the changes needed? Those metrics are widely used, it will be convenient to directly obtain them other than a conversion. in `NaiveBayes`: we want the sum of vectors, mean & weightSum need to computed then multiplied in `StandardScaler`,`AFTSurvivalRegression`,`LinearRegression`,`LinearSVC`,`LogisticRegression`: we need to obtain `variance` and then sqrt it to get std ### Does this PR introduce any user-facing change? yes, new metrics are exposed to end users ### How was this patch tested? added testsuites Closes #26596 from zhengruifeng/summarizer_add_metrics. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-02 14:44:31 +08:00
zhengruifeng	0f40d2a6ee	[SPARK-29960][ML][PYSPARK] MulticlassClassificationEvaluator support hammingLoss ### What changes were proposed in this pull request? MulticlassClassificationEvaluator support hammingLoss ### Why are the changes needed? 1, it is an easy to compute hammingLoss based on confusion matrix 2, scikit-learn supports it ### Does this PR introduce any user-facing change? yes ### How was this patch tested? added testsuites Closes #26597 from zhengruifeng/multi_class_hamming_loss. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-11-21 18:32:28 +08:00
zhengruifeng	297cbab98e	[SPARK-29942][ML] Impl Complement Naive Bayes Classifier ### What changes were proposed in this pull request? Impl Complement Naive Bayes Classifier as a `modelType` option in `NaiveBayes` ### Why are the changes needed? 1, it is a better choice for text classification: it is said in [scikit-learn](https://scikit-learn.org/stable/modules/naive_bayes.html#complement-naive-bayes) that 'CNB regularly outperforms MNB (often by a considerable margin) on text classification tasks.' 2, CNB is highly similar to existing MNB, only a small part of existing MNB need to be changed, so it is a easy win to support CNB. ### Does this PR introduce any user-facing change? yes, a new `modelType` is supported ### How was this patch tested? added testsuites Closes #26575 from zhengruifeng/cnb. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-11-21 18:22:05 +08:00
HyukjinKwon	74cb1ffd68	[SPARK-22340][PYTHON][FOLLOW-UP] Add a better message and improve documentation for pinned thread mode ### What changes were proposed in this pull request? This PR proposes to show different warning message when the pinned thread mode is enabled: When enabled: > PYSPARK_PIN_THREAD feature is enabled. However, note that it cannot inherit the local properties from the parent thread although it isolates each thread on PVM and JVM with its own local properties. > To work around this, you should manually copy and set the local properties from the parent thread to the child thread when you create another thread. When disabled: > Currently, 'setLocalProperty' (set to local properties) with multiple threads does not properly work. > Internally threads on PVM and JVM are not synced, and JVM thread can be reused for multiple threads on PVM, which fails to isolate local properties for each thread on PVM. > To work around this, you can set PYSPARK_PIN_THREAD to true (see SPARK-22340). However, note that it cannot inherit the local properties from the parent thread although it isolates each thread on PVM and JVM with its own local properties. > To work around this, you should manually copy and set the local properties from the parent thread to the child thread when you create another thread. ### Why are the changes needed? Currently, it shows the same warning message regardless of PYSPARK_PIN_THREAD being set. In the warning message it says "you can set PYSPARK_PIN_THREAD to true ..." which is confusing. ### Does this PR introduce any user-facing change? Documentation and warning message as shown above. ### How was this patch tested? Manually tested. ```bash $ PYSPARK_PIN_THREAD=true ./bin/pyspark ``` ```python sc.setJobGroup("a", "b") ``` ``` .../pyspark/util.py:141: UserWarning: PYSPARK_PIN_THREAD feature is enabled. However, note that it cannot inherit the local properties from the parent thread although it isolates each thread on PVM and JVM with its own local properties. To work around this, you should manually copy and set the local properties from the parent thread to the child thread when you create another thread. warnings.warn(msg, UserWarning) ``` ```bash $ ./bin/pyspark ``` ```python sc.setJobGroup("a", "b") ``` ``` .../pyspark/util.py:141: UserWarning: Currently, 'setJobGroup' (set to local properties) with multiple threads does not properly work. Internally threads on PVM and JVM are not synced, and JVM thread can be reused for multiple threads on PVM, which fails to isolate local properties for each thread on PVM. To work around this, you can set PYSPARK_PIN_THREAD to true (see SPARK-22340). However, note that it cannot inherit the local properties from the parent thread although it isolates each thread on PVM and JVM with its own local properties. To work around this, you should manually copy and set the local properties from the parent thread to the child thread when you create another thread. warnings.warn(msg, UserWarning) ``` Closes #26588 from HyukjinKwon/SPARK-22340. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-11-21 10:54:01 +09:00
John Bauer	e804ed5e33	[SPARK-29691][ML][PYTHON] ensure Param objects are valid in fit, transform modify Param._copyValues to check valid Param objects supplied as extra ### What changes were proposed in this pull request? Estimator.fit() and Model.transform() accept a dictionary of extra parameters whose values are used to overwrite those supplied at initialization or by default. Additionally, the ParamGridBuilder.addGrid accepts a parameter and list of values. The keys are presumed to be valid Param objects. This change adds a check that only Param objects are supplied as keys. ### Why are the changes needed? Param objects are created by and bound to an instance of Params (Estimator, Model, or Transformer). They may be obtained from their parent as attributes, or by name through getParam. The documentation does not state that keys must be valid Param objects, nor describe how one may be obtained. The current behavior is to silently ignore keys which are not valid Param objects. ### Does this PR introduce any user-facing change? If the user does not pass in a Param object as required for keys in `extra` for Estimator.fit() and Model.transform(), and `param` for ParamGridBuilder.addGrid, an error will be raised indicating it is an invalid object. ### How was this patch tested? Added method test_copy_param_extras_check to test_param.py. Tested with Python 3.7 Closes #26527 from JohnHBauer/paramExtra. Authored-by: John Bauer <john.h.bauer@gmail.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2019-11-19 14:15:00 -08:00
zhengruifeng	c5f644c6eb	[SPARK-16872][ML][PYSPARK] Impl Gaussian Naive Bayes Classifier ### What changes were proposed in this pull request? support `modelType` `gaussian` ### Why are the changes needed? current modelTypes do not support continuous data ### Does this PR introduce any user-facing change? yes, add a `modelType` option ### How was this patch tested? existing testsuites and added ones Closes #26413 from zhengruifeng/gnb. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-11-18 10:05:42 +08:00
Huaxin Gao	1112fc6029	[SPARK-29867][ML][PYTHON] Add __repr__ in Python ML Models ### What changes were proposed in this pull request? Add ```__repr__``` in Python ML Models ### Why are the changes needed? In Python ML Models, some of them have ```__repr__```, others don't. In the doctest, when calling Model.setXXX, some of the Models print out the xxxModel... correctly, some of them can't because of lacking the ```__repr__``` method. For example: ``` >>> gm = GaussianMixture(k=3, tol=0.0001, seed=10) >>> model = gm.fit(df) >>> model.setPredictionCol("newPrediction") GaussianMixture... ``` After the change, the above code will become the following: ``` >>> gm = GaussianMixture(k=3, tol=0.0001, seed=10) >>> model = gm.fit(df) >>> model.setPredictionCol("newPrediction") GaussianMixtureModel... ``` ### Does this PR introduce any user-facing change? Yes. ### How was this patch tested? doctest Closes #26489 from huaxingao/spark-29876. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-15 21:44:39 -08:00
Bryan Cutler	65a189c7a1	[SPARK-29376][SQL][PYTHON] Upgrade Apache Arrow to version 0.15.1 ### What changes were proposed in this pull request? Upgrade Apache Arrow to version 0.15.1. This includes Java artifacts and increases the minimum required version of PyArrow also. Version 0.12.0 to 0.15.1 includes the following selected fixes/improvements relevant to Spark users: * ARROW-6898 - [Java] Fix potential memory leak in ArrowWriter and several test classes * ARROW-6874 - [Python] Memory leak in Table.to_pandas() when conversion to object dtype * ARROW-5579 - [Java] shade flatbuffer dependency * ARROW-5843 - [Java] Improve the readability and performance of BitVectorHelper#getNullCount * ARROW-5881 - [Java] Provide functionalities to efficiently determine if a validity buffer has completely 1 bits/0 bits * ARROW-5893 - [C++] Remove arrow::Column class from C++ library * ARROW-5970 - [Java] Provide pointer to Arrow buffer * ARROW-6070 - [Java] Avoid creating new schema before IPC sending * ARROW-6279 - [Python] Add Table.slice method or allow slices in \_\_getitem\_\_ * ARROW-6313 - [Format] Tracking for ensuring flatbuffer serialized values are aligned in stream/files. * ARROW-6557 - [Python] Always return pandas.Series from Array/ChunkedArray.to_pandas, propagate field names to Series from RecordBatch, Table * ARROW-2015 - [Java] Use Java Time and Date APIs instead of JodaTime * ARROW-1261 - [Java] Add container type for Map logical type * ARROW-1207 - [C++] Implement Map logical type Changelog can be seen at https://arrow.apache.org/release/0.15.0.html ### Why are the changes needed? Upgrade to get bug fixes, improvements, and maintain compatibility with future versions of PyArrow. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests, manually tested with Python 3.7, 3.8 Closes #26133 from BryanCutler/arrow-upgrade-015-SPARK-29376. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-11-15 13:27:30 +09:00
shane knapp	04e99c1e1b	[SPARK-29672][PYSPARK] update spark testing framework to use python3 ### What changes were proposed in this pull request? remove python2.7 tests and test infra for 3.0+ ### Why are the changes needed? because python2.7 is finally going the way of the dodo. ### Does this PR introduce any user-facing change? newp. ### How was this patch tested? the build system will test this Closes #26330 from shaneknapp/remove-py27-tests. Lead-authored-by: shane knapp <incomplete@gmail.com> Co-authored-by: shane <incomplete@gmail.com> Signed-off-by: shane knapp <incomplete@gmail.com>	2019-11-14 10:18:55 -08:00
Huaxin Gao	1f4075d29e	[SPARK-29808][ML][PYTHON] StopWordsRemover should support multi-cols ### What changes were proposed in this pull request? Add multi-cols support in StopWordsRemover ### Why are the changes needed? As a basic Transformer, StopWordsRemover should support multi-cols. Param stopWords can be applied across all columns. ### Does this PR introduce any user-facing change? ```StopWordsRemover.setInputCols``` ```StopWordsRemover.setOutputCols``` ### How was this patch tested? Unit tests Closes #26480 from huaxingao/spark-29808. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-11-13 08:18:23 -06:00
zhengruifeng	76e5294bb6	[SPARK-29801][ML] ML models unify toString method ### What changes were proposed in this pull request? 1,ML models should extend toString method to expose basic information. Current some algs (GBT/RF/LoR) had done this, while others not yet. 2,add `val numFeatures` in `BisectingKMeansModel`/`GaussianMixtureModel`/`KMeansModel`/`AFTSurvivalRegressionModel`/`IsotonicRegressionModel` ### Why are the changes needed? ML models should extend toString method to expose basic information. ### Does this PR introduce any user-facing change? yes ### How was this patch tested? existing testsuites Closes #26439 from zhengruifeng/models_toString. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-11 11:03:26 -08:00
Bago Amirbekian	8152a87235	[SPARK-28978][ ] Support > 256 args to python udf ### What changes were proposed in this pull request? On the worker we express lambda functions as strings and then eval them to create a "mapper" function. This make the code hard to read & limits the # of arguments a udf can support to 256 for python <= 3.6. This PR rewrites the mapper functions as nested functions instead of "lambda strings" and allows passing in more than 255 args. ### Why are the changes needed? The jira ticket associated with this issue describes how MLflow uses udfs to consume columns as features. This pattern isn't unique and a limit of 255 features is quite low. ### Does this PR introduce any user-facing change? Users can now pass more than 255 cols to a udf function. ### How was this patch tested? Added a unit test for passing in > 255 args to udf. Closes #26442 from MrBago/replace-lambdas-on-worker. Authored-by: Bago Amirbekian <bago@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2019-11-08 19:19:14 -08:00
HyukjinKwon	7fc9db0853	[SPARK-29798][PYTHON][SQL] Infers bytes as binary type in createDataFrame in Python 3 at PySpark ### What changes were proposed in this pull request? This PR proposes to infer bytes as binary types in Python 3. See https://github.com/apache/spark/pull/25749 for discussions. I have also checked that Arrow considers `bytes` as binary type, and PySpark UDF can also accepts `bytes` as a binary type. Since `bytes` is not a `str` anymore in Python 3, it's clear to call it `BinaryType` in Python 3. ### Why are the changes needed? To respect Python 3's `bytes` type and support Python's primitive types. ### Does this PR introduce any user-facing change? Yes. Before: ```python >>> spark.createDataFrame([[b"abc"]]) Traceback (most recent call last): File "/.../spark/python/pyspark/sql/types.py", line 1036, in _infer_type return _infer_schema(obj) File "/.../spark/python/pyspark/sql/types.py", line 1062, in _infer_schema raise TypeError("Can not infer schema for type: %s" % type(row)) TypeError: Can not infer schema for type: <class 'bytes'> During handling of the above exception, another exception occurred: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/session.py", line 787, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, data), schema) File "/.../spark/python/pyspark/sql/session.py", line 445, in _createFromLocal struct = self._inferSchemaFromList(data, names=schema) File "/.../spark/python/pyspark/sql/session.py", line 377, in _inferSchemaFromList schema = reduce(_merge_type, (_infer_schema(row, names) for row in data)) File "/.../spark/python/pyspark/sql/session.py", line 377, in <genexpr> schema = reduce(_merge_type, (_infer_schema(row, names) for row in data)) File "/.../spark/python/pyspark/sql/types.py", line 1064, in _infer_schema fields = [StructField(k, _infer_type(v), True) for k, v in items] File "/.../spark/python/pyspark/sql/types.py", line 1064, in <listcomp> fields = [StructField(k, _infer_type(v), True) for k, v in items] File "/.../spark/python/pyspark/sql/types.py", line 1038, in _infer_type raise TypeError("not supported type: %s" % type(obj)) TypeError: not supported type: <class 'bytes'> ``` After: ```python >>> spark.createDataFrame([[b"abc"]]) DataFrame[_1: binary] ``` ### How was this patch tested? Unittest was added and manually tested. Closes #26432 from HyukjinKwon/SPARK-29798. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2019-11-08 12:10:39 -08:00

1 2 3 4 5 ...

2356 commits