ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
William Hyun	2ab82fae57	[SPARK-31963][PYSPARK][SQL] Support both pandas 0.23 and 1.0 in serializers.py ### What changes were proposed in this pull request? This PR aims to support both pandas 0.23 and 1.0. ### Why are the changes needed? ``` $ pip install pandas==0.23.2 $ python -c "import pandas.CategoricalDtype" Traceback (most recent call last): File "<string>", line 1, in <module> ModuleNotFoundError: No module named 'pandas.CategoricalDtype' $ python -c "from pandas.api.types import CategoricalDtype" ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins. ``` $ pip freeze \| grep pandas pandas==0.23.2 $ python/run-tests.py --python-executables python --modules pyspark-sql ... Tests passed in 359 seconds ``` Closes #28789 from williamhyun/williamhyun-patch-2. Authored-by: William Hyun <williamhyun3@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-06-10 14:42:45 -07:00
Takuya UESHIN	032d17933b	[SPARK-31945][SQL][PYSPARK] Enable cache for the same Python function ### What changes were proposed in this pull request? This PR proposes to make `PythonFunction` holds `Seq[Byte]` instead of `Array[Byte]` to be able to compare if the byte array has the same values for the cache manager. ### Why are the changes needed? Currently the cache manager doesn't use the cache for `udf` if the `udf` is created again even if the functions is the same. ```py >>> func = lambda x: x >>> df = spark.range(1) >>> df.select(udf(func)("id")).cache() ``` ```py >>> df.select(udf(func)("id")).explain() == Physical Plan == (2) Project [pythonUDF0#14 AS <lambda>(id)#12] +- BatchEvalPython [<lambda>(id#0L)], [pythonUDF0#14] +- (1) Range (0, 1, step=1, splits=12) ``` This is because `PythonFunction` holds `Array[Byte]`, and `equals` method of array equals only when the both array is the same instance. ### Does this PR introduce _any_ user-facing change? Yes, if the user reuse the Python function for the UDF, the cache manager will detect the same function and use the cache for it. ### How was this patch tested? I added a test case and manually. ```py >>> df.select(udf(func)("id")).explain() == Physical Plan == InMemoryTableScan [<lambda>(id)#12] +- InMemoryRelation [<lambda>(id)#12], StorageLevel(disk, memory, deserialized, 1 replicas) +- (2) Project [pythonUDF0#5 AS <lambda>(id)#3] +- BatchEvalPython [<lambda>(id#0L)], [pythonUDF0#5] +- (1) Range (0, 1, step=1, splits=12) ``` Closes #28774 from ueshin/issues/SPARK-31945/udf_cache. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-06-10 16:38:59 +09:00
HyukjinKwon	e28914095a	[SPARK-31849][PYTHON][SQL][FOLLOW-UP] More correct error message in Python UDF exception message ### What changes were proposed in this pull request? This PR proposes to fix wordings in the Python UDF exception error message from: From: > An exception was thrown from Python worker in the executor. The below is the Python worker stacktrace. To: > An exception was thrown from the Python worker. Please see the stack trace below. It removes "executor" because Python worker is technically a separate process, and remove the duplicated wording "Python worker" . ### Why are the changes needed? To give users better exception messages. ### Does this PR introduce _any_ user-facing change? No, it's in unreleased branches only. If RC3 passes, yes, it will change the exception message. ### How was this patch tested? Manually tested. Closes #28762 from HyukjinKwon/SPARK-31849-followup-2. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-06-09 10:24:34 +09:00
HyukjinKwon	a42af81706	[SPARK-31849][PYTHON][SQL][FOLLOW-UP] Deduplicate and reuse Utils.exceptionString in Python exception handling ### What changes were proposed in this pull request? This PR proposes to use existing util `org.apache.spark.util.Utils.exceptionString` for the same codes at: ```python jwriter = jvm.java.io.StringWriter() e.printStackTrace(jvm.java.io.PrintWriter(jwriter)) stacktrace = jwriter.toString() ``` ### Why are the changes needed? To deduplicate codes. Plus, less communication between JVM and Py4j. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually tested. Closes #28749 from HyukjinKwon/SPARK-31849-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-06-08 15:18:42 +09:00
HyukjinKwon	53ce58da34	[MINOR][PYTHON] Add one more newline between JVM and Python tracebacks ### What changes were proposed in this pull request? This PR proposes to add one more newline to clearly separate JVM and Python tracebacks: Before: ``` Traceback (most recent call last): ... pyspark.sql.utils.AnalysisException: Reference 'column' is ambiguous, could be: column, column.; JVM stacktrace: org.apache.spark.sql.AnalysisException: Reference 'column' is ambiguous, could be: column, column.; ... ``` After: ``` Traceback (most recent call last): ... pyspark.sql.utils.AnalysisException: Reference 'column' is ambiguous, could be: column, column.; JVM stacktrace: org.apache.spark.sql.AnalysisException: Reference 'column' is ambiguous, could be: column, column.; ... ``` This is kind of a followup of `e69466056f` (SPARK-31849). ### Why are the changes needed? To make it easier to read. ### Does this PR introduce _any_ user-facing change? It's in the unreleased branches. ### How was this patch tested? Manually tested. Closes #28732 from HyukjinKwon/python-minor. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-06-05 13:31:35 +09:00
HyukjinKwon	e1d5201140	[SPARK-31895][PYTHON][SQL] Support DataFrame.explain(extended: str) case to be consistent with Scala side ### What changes were proposed in this pull request? Scala: ```scala scala> spark.range(10).explain("cost") ``` ``` == Optimized Logical Plan == Range (0, 10, step=1, splits=Some(12)), Statistics(sizeInBytes=80.0 B) == Physical Plan == (1) Range (0, 10, step=1, splits=12) ``` PySpark: ```python >>> spark.range(10).explain("cost") ``` ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/dataframe.py", line 333, in explain raise TypeError(err_msg) TypeError: extended (optional) should be provided as bool, got <class 'str'> ``` In addition, it is consistent with other codes too, for example, `DataFrame.sample` also can support `DataFrame.sample(1.0)` and `DataFrame.sample(False)`. ### Why are the changes needed? To provide the consistent API support across APIs. ### Does this PR introduce _any_ user-facing change? Nope, it's only changes in unreleased branches. If this lands to master only, yes, users will be able to set `mode` as `df.explain("...")` in Spark 3.1. After this PR: ```python >>> spark.range(10).explain("cost") ``` ``` == Optimized Logical Plan == Range (0, 10, step=1, splits=Some(12)), Statistics(sizeInBytes=80.0 B) == Physical Plan == (1) Range (0, 10, step=1, splits=12) ``` ### How was this patch tested? Unittest was added and manually tested as well to make sure: ```python spark.range(10).explain(True) spark.range(10).explain(False) spark.range(10).explain("cost") spark.range(10).explain(extended="cost") spark.range(10).explain(mode="cost") spark.range(10).explain() spark.range(10).explain(True, "cost") spark.range(10).explain(1.0) ``` Closes #28711 from HyukjinKwon/SPARK-31895. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-06-03 12:07:05 +09:00
HyukjinKwon	e69466056f	[SPARK-31849][PYTHON][SQL] Make PySpark SQL exceptions more Pythonic ### What changes were proposed in this pull request? This PR proposes to make PySpark exception more Pythonic by hiding JVM stacktrace by default. It can be enabled by turning on `spark.sql.pyspark.jvmStacktrace.enabled` configuration. ``` Traceback (most recent call last): ... pyspark.sql.utils.PythonException: An exception was thrown from Python worker in the executor. The below is the Python worker stacktrace. Traceback (most recent call last): ... ``` If this `spark.sql.pyspark.jvmStacktrace.enabled` is enabled, it appends: ``` JVM stacktrace: org.apache.spark.Exception: ... ... ``` For example, the codes below: ```python from pyspark.sql.functions import udf udf def divide_by_zero(v): raise v / 0 spark.range(1).select(divide_by_zero("id")).show() ``` will show an error messages that looks like Python exception thrown from the local. <details> <summary>Python exception message when <code>spark.sql.pyspark.jvmStacktrace.enabled</code> is off (default)</summary> ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/dataframe.py", line 427, in show print(self._jdf.showString(n, 20, vertical)) File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../spark/python/pyspark/sql/utils.py", line 131, in deco raise_from(converted) File "<string>", line 3, in raise_from pyspark.sql.utils.PythonException: An exception was thrown from Python worker in the executor. The below is the Python worker stacktrace. Traceback (most recent call last): File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main process() File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process serializer.dump_stream(out_iter, outfile) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream for obj in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched for item in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr> result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda> return lambda a: f(a) File "/.../spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper return f(args, kwargs) File "<stdin>", line 3, in divide_by_zero ZeroDivisionError: division by zero ``` </details> <details> <summary>Python exception message when <code>spark.sql.pyspark.jvmStacktrace.enabled</code> is on</summary> ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/dataframe.py", line 427, in show print(self._jdf.showString(n, 20, vertical)) File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../spark/python/pyspark/sql/utils.py", line 137, in deco raise_from(converted) File "<string>", line 3, in raise_from pyspark.sql.utils.PythonException: An exception was thrown from Python worker in the executor. The below is the Python worker stacktrace. Traceback (most recent call last): File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main process() File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process serializer.dump_stream(out_iter, outfile) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream for obj in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched for item in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr> result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda> return lambda a: f(a) File "/.../spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper return f(args, *kwargs) File "<stdin>", line 3, in divide_by_zero ZeroDivisionError: division by zero JVM stacktrace: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, 192.168.35.193, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main process() File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process serializer.dump_stream(out_iter, outfile) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream for obj in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched for item in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr> result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda> return lambda a: f(a) File "/.../spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper return f(args, *kwargs) File "<stdin>", line 3, in divide_by_zero ZeroDivisionError: division by zero at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:516) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:81) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:64) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:469) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:753) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:469) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:472) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2117) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2066) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2065) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2065) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1021) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1021) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1021) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2297) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2246) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2235) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:823) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2108) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2129) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2148) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:467) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:420) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3653) at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2695) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3644) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3642) at org.apache.spark.sql.Dataset.head(Dataset.scala:2695) at org.apache.spark.sql.Dataset.take(Dataset.scala:2902) at org.apache.spark.sql.Dataset.getRows(Dataset.scala:300) at org.apache.spark.sql.Dataset.showString(Dataset.scala:337) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main process() File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process serializer.dump_stream(out_iter, outfile) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream for obj in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched for item in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr> result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda> return lambda a: f(a) File "/.../spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper return f(args, *kwargs) File "<stdin>", line 3, in divide_by_zero ZeroDivisionError: division by zero at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:516) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:81) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:64) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:469) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:753) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:469) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:472) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ... 1 more ``` </details> <details> <summary>Python exception message without this change</summary> ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/dataframe.py", line 427, in show print(self._jdf.showString(n, 20, vertical)) File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../spark/python/pyspark/sql/utils.py", line 98, in deco return f(a, *kw) File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o160.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 10 in stage 5.0 failed 4 times, most recent failure: Lost task 10.3 in stage 5.0 (TID 37, 192.168.35.193, executor 3): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main process() File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process serializer.dump_stream(out_iter, outfile) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream for obj in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched for item in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr> result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda> return lambda a: f(a) File "/.../spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper return f(args, *kwargs) File "<stdin>", line 3, in divide_by_zero ZeroDivisionError: division by zero at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:516) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:81) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:64) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:469) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:753) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:469) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:472) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2117) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2066) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2065) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2065) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1021) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1021) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1021) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2297) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2246) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2235) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:823) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2108) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2129) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2148) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:467) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:420) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3653) at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2695) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3644) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3642) at org.apache.spark.sql.Dataset.head(Dataset.scala:2695) at org.apache.spark.sql.Dataset.take(Dataset.scala:2902) at org.apache.spark.sql.Dataset.getRows(Dataset.scala:300) at org.apache.spark.sql.Dataset.showString(Dataset.scala:337) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main process() File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process serializer.dump_stream(out_iter, outfile) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream for obj in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched for item in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr> result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda> return lambda a: f(a) File "/.../spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper return f(args, *kwargs) File "<stdin>", line 3, in divide_by_zero ZeroDivisionError: division by zero at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:516) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:81) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:64) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:469) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:753) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:469) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:472) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ... 1 more ``` </details> <br/> Another example with Python 3.7: ```python sql("a") ``` <details> <summary>Python exception message when <code>spark.sql.pyspark.jvmStacktrace.enabled</code> is off (default)</summary> ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/session.py", line 646, in sql return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped) File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../spark/python/pyspark/sql/utils.py", line 131, in deco raise_from(converted) File "<string>", line 3, in raise_from pyspark.sql.utils.ParseException: mismatched input 'a' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0) == SQL == a ^^^ ``` </details> <details> <summary>Python exception message when <code>spark.sql.pyspark.jvmStacktrace.enabled</code> is on</summary> ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/session.py", line 646, in sql return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped) File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../spark/python/pyspark/sql/utils.py", line 131, in deco raise_from(converted) File "<string>", line 3, in raise_from pyspark.sql.utils.ParseException: mismatched input 'a' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0) == SQL == a ^^^ JVM stacktrace: org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'a' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0) == SQL == a ^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:266) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:133) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:49) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:81) at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:604) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:604) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:601) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) ``` </details> <details> <summary>Python exception message without this change</summary> ``` Traceback (most recent call last): File "/.../spark/python/pyspark/sql/utils.py", line 98, in deco return f(a, **kw) File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o26.sql. : org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'a' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0) == SQL == a ^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:266) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:133) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:49) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:81) at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:604) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:604) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:601) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/session.py", line 646, in sql return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped) File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../spark/python/pyspark/sql/utils.py", line 102, in deco raise converted pyspark.sql.utils.ParseException: mismatched input 'a' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0) == SQL == a ^^^ ``` </details> ### Why are the changes needed? Currently, PySpark exceptions are very unfriendly to Python users with causing a bunch of JVM stacktrace. See "Python exception message without this change" above. ### Does this PR introduce _any_ user-facing change? Yes, it will change the exception message. See the examples above. ### How was this patch tested? Manually tested by ```bash ./bin/pyspark --conf spark.sql.pyspark.jvmStacktrace.enabled=true ``` and running the examples above. Closes #28661 from HyukjinKwon/python-debug. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-06-01 09:45:21 +09:00
Bryan Cutler	8bbb666622	[SPARK-25351][PYTHON][TEST][FOLLOWUP] Fix test assertions to be consistent ### What changes were proposed in this pull request? Followup to make assertions from recent test consistent with the rest of the module ### Why are the changes needed? Better to use assertions from `unittest` and be consistent ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #28659 from BryanCutler/arrow-category-test-fix-SPARK-25351. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-28 10:27:15 +09:00
iRakson	2f92ea0df4	[SPARK-31763][PYSPARK] Add `inputFiles` method in PySpark DataFrame Class ### What changes were proposed in this pull request? Adds `inputFiles()` method to PySpark `DataFrame`. Using this, PySpark users can list all files constituting a `DataFrame`. Before changes: ``` >>> spark.read.load("examples/src/main/resources/people.json", format="json").inputFiles() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/*//spark/python/pyspark/sql/dataframe.py", line 1388, in __getattr__ "'%s' object has no attribute '%s'" % (self.__class__.__name__, name)) AttributeError: 'DataFrame' object has no attribute 'inputFiles' ``` After changes:* ``` >>> spark.read.load("examples/src/main/resources/people.json", format="json").inputFiles() [u'file:///*/*/spark/examples/src/main/resources/people.json'] ``` ### Why are the changes needed? This method is already supported for spark with scala and java. ### Does this PR introduce _any_ user-facing change? Yes, Now users can list all files of a DataFrame using `inputFiles()` ### How was this patch tested? UT added. Closes #28652 from iRakson/SPARK-31763. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-28 09:52:08 +09:00
Jalpan Randeri	339b0ecadb	[SPARK-25351][SQL][PYTHON] Handle Pandas category type when converting from Python with Arrow Handle Pandas category type while converting from python with Arrow enabled. The category column will be converted to whatever type the category elements are as is the case with Arrow disabled. ### Does this PR introduce any user-facing change? No ### How was this patch tested? New unit tests were added for `createDataFrame` and scalar `pandas_udf` Closes #26585 from jalpan-randeri/feature-pyarrow-dictionary-type. Authored-by: Jalpan Randeri <randerij@amazon.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2020-05-27 17:27:29 -07:00
David Toneian	acab558e55	[SPARK-31739][PYSPARK][DOCS][MINOR] Fix docstring syntax issues and misplaced space characters This commit is published into the public domain. ### What changes were proposed in this pull request? Some syntax issues in docstrings have been fixed. ### Why are the changes needed? In some places, the documentation did not render as intended, e.g. parameter documentations were not formatted as such. ### Does this PR introduce any user-facing change? Slight improvements in documentation. ### How was this patch tested? Manual testing and `dev/lint-python` run. No new Sphinx warnings arise due to this change. Closes #28559 from DavidToneian/SPARK-31739. Authored-by: David Toneian <david@toneian.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-18 20:25:02 +09:00
Kent Yao	b31ae7bb0b	[SPARK-31615][SQL] Pretty string output for sql method of RuntimeReplaceable expressions ### What changes were proposed in this pull request? The RuntimeReplaceable ones are runtime replaceable, thus, their original parameters are not going to be resolved to PrettyAttribute and remain debug style string if we directly implement their `sql` methods with their parameters' `sql` methods. This PR is raised with suggestions by maropu and cloud-fan https://github.com/apache/spark/pull/28402/files#r417656589. In this PR, we re-implement the `sql` methods of the RuntimeReplaceable ones with toPettySQL ### Why are the changes needed? Consistency of schema output between RuntimeReplaceable expressions and normal ones. For example, `date_format` vs `to_timestamp`, before this PR, they output differently #### Before ```sql select date_format(timestamp '2019-10-06', 'yyyy-MM-dd uuuu') struct<date_format(TIMESTAMP '2019-10-06 00:00:00', yyyy-MM-dd uuuu):string> select to_timestamp("2019-10-06S10:11:12.12345", "yyyy-MM-dd'S'HH:mm:ss.SSSSSS") struct<to_timestamp('2019-10-06S10:11:12.12345', 'yyyy-MM-dd\'S\'HH:mm:ss.SSSSSS'):timestamp> ``` #### After ```sql select date_format(timestamp '2019-10-06', 'yyyy-MM-dd uuuu') struct<date_format(TIMESTAMP '2019-10-06 00:00:00', yyyy-MM-dd uuuu):string> select to_timestamp("2019-10-06T10:11:12'12", "yyyy-MM-dd'T'HH:mm:ss''SSSS") struct<to_timestamp(2019-10-06T10:11:12'12, yyyy-MM-dd'T'HH:mm:ss''SSSS):timestamp> ```` ### Does this PR introduce _any_ user-facing change? Yes, the schema output style changed for the runtime replaceable expressions as shown in the above example ### How was this patch tested? regenerate all related tests Closes #28420 from yaooqinn/SPARK-31615. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-05-07 14:40:26 +09:00
HyukjinKwon	5dd581c88a	[SPARK-29664][PYTHON][SQL][FOLLOW-UP] Add deprecation warnings for getItem instead ### What changes were proposed in this pull request? This PR proposes to use a different approach instead of breaking it per Micheal's rubric added at https://spark.apache.org/versioning-policy.html. It deprecates the behaviour for now. It will be gradually removed in the future releases. After this change, ```python import warnings warnings.simplefilter("always") from pyspark.sql.functions import * df = spark.range(2) map_col = create_map(lit(0), lit(100), lit(1), lit(200)) df.withColumn("mapped", map_col.getItem(col('id'))).show() ``` ``` /.../python/pyspark/sql/column.py:311: DeprecationWarning: A column as 'key' in getItem is deprecated as of Spark 3.0, and will not be supported in the future release. Use `column[key]` or `column.key` syntax instead. DeprecationWarning) ... ``` ```python import warnings warnings.simplefilter("always") from pyspark.sql.functions import * df = spark.range(2) struct_col = struct(lit(0), lit(100), lit(1), lit(200)) df.withColumn("struct", struct_col.getField(lit("col1"))).show() ``` ``` /.../spark/python/pyspark/sql/column.py:336: DeprecationWarning: A column as 'name' in getField is deprecated as of Spark 3.0, and will not be supported in the future release. Use `column[name]` or `column.name` syntax instead. DeprecationWarning) ``` ### Why are the changes needed? To prevent the radical behaviour change after the amended versioning policy. ### Does this PR introduce any user-facing change? Yes, it will show the deprecated warning message. ### How was this patch tested? Manually tested. Closes #28327 from HyukjinKwon/SPARK-29664. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-27 14:49:22 +09:00
Takuya UESHIN	87be3641eb	[SPARK-31441] Support duplicated column names for toPandas with arrow execution ### What changes were proposed in this pull request? This PR is adding support duplicated column names for `toPandas` with Arrow execution. ### Why are the changes needed? When we execute `toPandas()` with Arrow execution, it fails if the column names have duplicates. ```py >>> spark.sql("select 1 v, 1 v").toPandas() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/path/to/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line 2132, in toPandas pdf = table.to_pandas() File "pyarrow/array.pxi", line 441, in pyarrow.lib._PandasConvertible.to_pandas File "pyarrow/table.pxi", line 1367, in pyarrow.lib.Table._to_pandas File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 653, in table_to_blockmanager columns = _deserialize_column_index(table, all_columns, column_indexes) File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 704, in _deserialize_column_index columns = _flatten_single_level_multiindex(columns) File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 937, in _flatten_single_level_multiindex raise ValueError('Found non-unique column index') ValueError: Found non-unique column index ``` ### Does this PR introduce any user-facing change? Yes, previously we will face an error above, but after this PR, we will see the result: ```py >>> spark.sql("select 1 v, 1 v").toPandas() v v 0 1 1 ``` ### How was this patch tested? Added and modified related tests. Closes #28210 from ueshin/issues/SPARK-31441/to_pandas. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-14 14:08:56 +09:00
Kent Yao	31b907748d	[SPARK-31414][SQL][DOCS][FOLLOWUP] Update default datetime pattern for json/csv APIs documentations ### What changes were proposed in this pull request? Update default datetime pattern from `yyyy-MM-dd'T'HH:mm:ss.SSSXXX ` to `yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX] ` for JSON/CSV APIs documentations ### Why are the changes needed? doc fix ### Does this PR introduce any user-facing change? Yes, the documentation will change ### How was this patch tested? Passing Jenkins Closes #28204 from yaooqinn/SPARK-31414-F. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-14 10:25:37 +09:00
HyukjinKwon	c279e6b091	[SPARK-30722][DOCS][FOLLOW-UP] Explicitly mention the same entire input/output length restriction of Series Iterator UDF ### What changes were proposed in this pull request? This PR explicitly mention that the requirement of Iterator of Series to Iterator of Series and Iterator of Multiple Series to Iterator of Series (previously Scalar Iterator pandas UDF). The actual limitation of this UDF is the same length of the _entire input and output_, instead of each series's length. Namely you can do something as below: ```python from typing import Iterator, Tuple import pandas as pd from pyspark.sql.functions import pandas_udf pandas_udf("long") def func( iterator: Iterator[pd.Series]) -> Iterator[pd.Series]: return iter([pd.concat(iterator)]) spark.range(100).select(func("id")).show() ``` This characteristic allows you to prefetch the data from the iterator to speed up, compared to the regular Scalar to Scalar (previously Scalar pandas UDF). ### Why are the changes needed? To document the correct restriction and characteristics of a feature. ### Does this PR introduce any user-facing change? Yes in the documentation but only in unreleased branches. ### How was this patch tested? Github Actions should test the documentation build Closes #28160 from HyukjinKwon/SPARK-30722-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-09 16:46:27 +09:00
HyukjinKwon	4fafdcd63b	[SPARK-26412][PYTHON][FOLLOW-UP] Improve error messages in Scala iterator pandas UDF ### What changes were proposed in this pull request? This PR proposes to improve the error message from Scalar iterator pandas UDF. ### Why are the changes needed? To show the correct error messages. ### Does this PR introduce any user-facing change? Yes, but only in unreleased branches. ```python import pandas as pd from pyspark.sql.functions import pandas_udf, PandasUDFType pandas_udf('long', PandasUDFType.SCALAR_ITER) def pandas_plus_one(iterator): for _ in iterator: yield pd.Series(1) spark.range(10).repartition(1).select(pandas_plus_one("id")).show() ``` ```python import pandas as pd from pyspark.sql.functions import pandas_udf, PandasUDFType pandas_udf('long', PandasUDFType.SCALAR_ITER) def pandas_plus_one(iterator): for _ in iterator: yield pd.Series(list(range(20))) spark.range(10).repartition(1).select(pandas_plus_one("id")).show() ``` Before: ``` RuntimeError: The number of output rows of pandas iterator UDF should be the same with input rows. The input rows number is 10 but the output rows number is 1. ``` ``` AssertionError: Pandas MAP_ITER UDF outputted more rows than input rows. ``` After: ``` RuntimeError: The length of output in Scalar iterator pandas UDF should be the same with the input's; however, the length of output was 1 and the length of input was 10. ``` ``` AssertionError: Pandas SCALAR_ITER UDF outputted more rows than input rows. ``` ### How was this patch tested? Unittests were fixed accordingly. Closes #28135 from HyukjinKwon/SPARK-26412-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-09 13:14:41 +09:00
Liang-Chi Hsieh	1f02871489	[SPARK-30921][PYSPARK] Predicates on python udf should not be pushdown through Aggregate ### What changes were proposed in this pull request? This patch proposed to skip predicates on PythonUDFs to be pushdown through Aggregate. ### Why are the changes needed? The predicates on PythonUDFs cannot be pushdown through Aggregate. Pushed down predicates cannot be evaluate because PythonUDFs cannot be evaluated on Filter and cause error like: ``` Caused by: java.lang.UnsupportedOperationException: Cannot generate code for expression: mean(input[1, struct<bar:bigint>, true].bar) at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:304) at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:303) at org.apache.spark.sql.catalyst.expressions.PythonUDF.doGenCode(PythonUDF.scala:52) at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:146) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:141) at org.apache.spark.sql.catalyst.expressions.CastBase.doGenCode(Cast.scala:821) at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:146) at scala.Option.getOrElse(Option.scala:189) ``` ### Does this PR introduce any user-facing change? Yes. Previously the predicates on PythonUDFs will be pushdown through Aggregate can cause error. After this change, the query can work. ### How was this patch tested? Unit test. Closes #28089 from viirya/SPARK-30921. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-06 09:36:20 +09:00
Ben Ryves	fa37856710	[SPARK-31306][DOCS] update rand() function documentation to indicate exclusive upper bound ### What changes were proposed in this pull request? A small documentation change to clarify that the `rand()` function produces values in `[0.0, 1.0)`. ### Why are the changes needed? `rand()` uses `Rand()` - which generates values in [0, 1) ([documented here](`a1dbcd13a3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala (L71)`)). The existing documentation suggests that 1.0 is a possible value returned by rand (i.e for a distribution written as `X ~ U(a, b)`, x can be a or b, so `U[0.0, 1.0]` suggests the value returned could include 1.0). ### Does this PR introduce any user-facing change? Only documentation changes. ### How was this patch tested? Documentation changes only. Closes #28071 from Smeb/master. Authored-by: Ben Ryves <benjamin.ryves@getyourguide.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-31 15:16:17 +09:00
Maxim Gekk	d2ff5c5bfb	[SPARK-31286][SQL][DOC] Specify formats of time zone ID for JSON/CSV option and from/to_utc_timestamp ### What changes were proposed in this pull request? In the PR, I propose to update the doc for the `timeZone` option in JSON/CSV datasources and for the `tz` parameter of the `from_utc_timestamp()`/`to_utc_timestamp()` functions, and to restrict format of config's values to 2 forms: 1. Geographical regions, such as `America/Los_Angeles`. 2. Fixed offsets - a fully resolved offset from UTC. For example, `-08:00`. ### Why are the changes needed? Other formats such as three-letter time zone IDs are ambitious, and depend on the locale. For example, `CST` could be U.S. `Central Standard Time` and `China Standard Time`. Such formats have been already deprecated in JDK, see [Three-letter time zone IDs](https://docs.oracle.com/javase/8/docs/api/java/util/TimeZone.html). ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running `./dev/scalastyle`, and manual testing. Closes #28051 from MaxGekk/doc-time-zone-option. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-30 12:20:11 +08:00
gatorsmile	3884455780	[SPARK-31087] [SQL] Add Back Multiple Removed APIs ### What changes were proposed in this pull request? Based on the discussion in the mailing list [[Proposal] Modification to Spark's Semantic Versioning Policy](http://apache-spark-developers-list.1001551.n3.nabble.com/Proposal-Modification-to-Spark-s-Semantic-Versioning-Policy-td28938.html) , this PR is to add back the following APIs whose maintenance cost are relatively small. - functions.toDegrees/toRadians - functions.approxCountDistinct - functions.monotonicallyIncreasingId - Column.!== - Dataset.explode - Dataset.registerTempTable - SQLContext.getOrCreate, setActive, clearActive, constructors Below is the other removed APIs in the original PR, but not added back in this PR [https://issues.apache.org/jira/browse/SPARK-25908]: - Remove some AccumulableInfo .apply() methods - Remove non-label-specific multiclass precision/recall/fScore in favor of accuracy - Remove unused Python StorageLevel constants - Remove unused multiclass option in libsvm parsing - Remove references to deprecated spark configs like spark.yarn.am.port - Remove TaskContext.isRunningLocally - Remove ShuffleMetrics.shuffle* methods - Remove BaseReadWrite.context in favor of session ### Why are the changes needed? Avoid breaking the APIs that are commonly used. ### Does this PR introduce any user-facing change? Adding back the APIs that were removed in 3.0 branch does not introduce the user-facing changes, because Spark 3.0 has not been released. ### How was this patch tested? Added a new test suite for these APIs. Author: gatorsmile <gatorsmile@gmail.com> Author: yi.wu <yi.wu@databricks.com> Closes #27821 from gatorsmile/addAPIBackV2.	2020-03-28 22:05:16 -07:00
HyukjinKwon	3165a95a04	[SPARK-31287][PYTHON][SQL] Ignore type hints in groupby.(cogroup.)applyInPandas and mapInPandas ### What changes were proposed in this pull request? This PR proposes to make pandas function APIs (`groupby.(cogroup.)applyInPandas` and `mapInPandas`) to ignore Python type hints. ### Why are the changes needed? Python type hints are optional. It shouldn't affect where pandas UDFs are not used. This is also a future work for them to support other type hints. We shouldn't at least throw an exception at this moment. ### Does this PR introduce any user-facing change? No, it's master-only change. ```python import pandas as pd def pandas_plus_one(pdf: pd.DataFrame) -> pd.DataFrame: return pdf + 1 spark.range(10).groupby('id').applyInPandas(pandas_plus_one, schema="id long").show() ``` ```python import pandas as pd def pandas_plus_one(left: pd.DataFrame, right: pd.DataFrame) -> pd.DataFrame: return left + 1 spark.range(10).groupby('id').cogroup(spark.range(10).groupby("id")).applyInPandas(pandas_plus_one, schema="id long").show() ``` ```python from typing import Iterator import pandas as pd def pandas_plus_one(iter: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]: return map(lambda v: v + 1, iter) spark.range(10).mapInPandas(pandas_plus_one, schema="id long").show() ``` Before: Exception After: ``` +---+ \| id\| +---+ \| 1\| \| 2\| \| 3\| \| 4\| \| 5\| \| 6\| \| 7\| \| 8\| \| 9\| \| 10\| +---+ ``` ### How was this patch tested? Closes #28052 from HyukjinKwon/SPARK-31287. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-29 13:59:18 +09:00
gatorsmile	b9eafcb526	[SPARK-31088][SQL] Add back HiveContext and createExternalTable ### What changes were proposed in this pull request? Based on the discussion in the mailing list [[Proposal] Modification to Spark's Semantic Versioning Policy](http://apache-spark-developers-list.1001551.n3.nabble.com/Proposal-Modification-to-Spark-s-Semantic-Versioning-Policy-td28938.html) , this PR is to add back the following APIs whose maintenance cost are relatively small. - HiveContext - createExternalTable APIs ### Why are the changes needed? Avoid breaking the APIs that are commonly used. ### Does this PR introduce any user-facing change? Adding back the APIs that were removed in 3.0 branch does not introduce the user-facing changes, because Spark 3.0 has not been released. ### How was this patch tested? add a new test suite for createExternalTable APIs. Closes #27815 from gatorsmile/addAPIsBack. Lead-authored-by: gatorsmile <gatorsmile@gmail.com> Co-authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2020-03-26 23:51:15 -07:00
Liang-Chi Hsieh	559d3e4051	[SPARK-31186][PYSPARK][SQL] toPandas should not fail on duplicate column names ### What changes were proposed in this pull request? When `toPandas` API works on duplicate column names produced from operators like join, we see the error like: ``` ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). ``` This patch fixes the error in `toPandas` API. ### Why are the changes needed? To make `toPandas` work on dataframe with duplicate column names. ### Does this PR introduce any user-facing change? Yes. Previously calling `toPandas` API on a dataframe with duplicate column names will fail. After this patch, it will produce correct result. ### How was this patch tested? Unit test. Closes #28025 from viirya/SPARK-31186. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-27 12:10:30 +09:00
Kent Yao	b024a8a69e	[MINOR][DOCS] Fix some links for python api doc ### What changes were proposed in this pull request? the link for `partition discovery` is malformed, because for releases, there will contains` /docs/<version>/` in the full URL. ### Why are the changes needed? fix doc ### Does this PR introduce any user-facing change? no ### How was this patch tested? `SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve` locally verified Closes #28017 from yaooqinn/doc. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-26 13:06:21 +09:00
Kent Yao	88ae6c4481	[SPARK-31189][SQL][DOCS] Fix errors and missing parts for datetime pattern document ### What changes were proposed in this pull request? Fix errors and missing parts for datetime pattern document 1. The pattern we use is similar to DateTimeFormatter and SimpleDateFormat but not identical. So we shouldn't use any of them in the API docs but use a link to the doc of our own. 2. Some pattern letters are missing 3. Some pattern letters are explicitly banned - Set('A', 'c', 'e', 'n', 'N') 4. the second fraction pattern different logic for parsing and formatting ### Why are the changes needed? fix and improve doc ### Does this PR introduce any user-facing change? yes, new and updated doc ### How was this patch tested? pass Jenkins viewed locally with `jekyll serve` ![image](https://user-images.githubusercontent.com/8326978/77044447-6bd3bb00-69fa-11ea-8d6f-7084166c5dea.png) Closes #27956 from yaooqinn/SPARK-31189. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-20 21:59:26 +08:00
zero323	01f20394ac	[SPARK-30569][SQL][PYSPARK][SPARKR] Add percentile_approx DSL functions ### What changes were proposed in this pull request? - Adds following overloaded variants to Scala `o.a.s.sql.functions`: - `percentile_approx(e: Column, percentage: Array[Double], accuracy: Long): Column` - `percentile_approx(columnName: String, percentage: Array[Double], accuracy: Long): Column` - `percentile_approx(e: Column, percentage: Double, accuracy: Long): Column` - `percentile_approx(columnName: String, percentage: Double, accuracy: Long): Column` - `percentile_approx(e: Column, percentage: Seq[Double], accuracy: Long): Column` (primarily for Python interop). - `percentile_approx(columnName: String, percentage: Seq[Double], accuracy: Long): Column` - Adds `percentile_approx` to `pyspark.sql.functions`. - Adds `percentile_approx` function to SparkR. ### Why are the changes needed? Currently we support `percentile_approx` only in SQL expression. It is inconvenient and makes this function relatively unknown. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New unit tests for SparkR an PySpark. As for now there are no additional tests in Scala API ‒ `ApproximatePercentile` is well tested and Python (including docstrings) and R tests provide additional tests, so it seems unnecessary. Closes #27278 from zero323/SPARK-30569. Lead-authored-by: zero323 <mszymkiewicz@gmail.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-17 10:44:21 +09:00
Yuanjian Li	3493162c78	[SPARK-31030][SQL] Backward Compatibility for Parsing and formatting Datetime ### What changes were proposed in this pull request? In Spark version 2.4 and earlier, datetime parsing, formatting and conversion are performed by using the hybrid calendar (Julian + Gregorian). Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as well as the chosen one in ANSI SQL standard, Spark 3.0 switches to it by using Java 8 API classes (the java.time packages that are based on ISO chronology ). The switching job is completed in SPARK-26651. But after the switching, there are some patterns not compatible between Java 8 and Java 7, Spark needs its own definition on the patterns rather than depends on Java API. In this PR, we achieve this by writing the document and shadow the incompatible letters. See more details in [SPARK-31030](https://issues.apache.org/jira/browse/SPARK-31030) ### Why are the changes needed? For backward compatibility. ### Does this PR introduce any user-facing change? No. After we define our own datetime parsing and formatting patterns, it's same to old Spark version. ### How was this patch tested? Existing and new added UT. Locally document test: ![image](https://user-images.githubusercontent.com/4833765/76064100-f6acc280-5fc3-11ea-9ef7-82e7dc074205.png) Closes #27830 from xuanyuanking/SPARK-31030. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-11 14:11:13 +08:00
Liang-Chi Hsieh	d21aab403a	[SPARK-30941][PYSPARK] Add a note to asDict to document its behavior when there are duplicate fields ### What changes were proposed in this pull request? Adding a note to document `Row.asDict` behavior when there are duplicate fields. ### Why are the changes needed? When a row contains duplicate fields, `asDict` and `_get_item_` behaves differently. We should document it to let users know the difference explicitly. ### Does this PR introduce any user-facing change? No. Only document change. ### How was this patch tested? Existing test. Closes #27853 from viirya/SPARK-30941. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-09 11:06:45 -07:00
zero323	7de33f56e8	[SPARK-30681][PYSPARK][SQL] Add higher order functions API to PySpark ### What changes were proposed in this pull request? This PR add Python API for invoking following higher functions: - `transform` - `exists` - `forall` - `filter` - `aggregate` - `zip_with` - `transform_keys` - `transform_values` - `map_filter` - `map_zip_with` to `pyspark.sql`. Each of these accepts plain Python functions of one of the following types - `(Column) -> Column: ...` - `(Column, Column) -> Column: ...` - `(Column, Column, Column) -> Column: ...` Internally this proposal piggbacks on objects supporting Scala implementation ([SPARK-27297](https://issues.apache.org/jira/browse/SPARK-27297)) by: 1. Creating required `UnresolvedNamedLambdaVariables` exposing these as PySpark `Columns` 2. Invoking Python function with these columns as arguments. 3. Using the result, and underlying JVM objects from 1., to create `expressions.LambdaFunction` which is passed to desired expression, and repacked as Python `Column`. ### Why are the changes needed? Currently higher order functions are available only using SQL and Scala API and can use only SQL expressions ```python df.selectExpr("transform(values, x -> x + 1)") ``` This works reasonably well for simple functions, but can get really ugly with complex functions (complex functions, casts), resulting objects are somewhat verbose and we don't get any IDE support. Additionally DSL used, though very simple, is not documented. With changes propose here, above query could be rewritten as: ```python df.select(transform("values", lambda x: x + 1)) ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? - For positive cases this PR adds doctest strings covering possible usage patterns. - For negative cases (unsupported function types) this PR adds unit tests. ### Notes If approved, the same approach can be used in SparkR. Closes #27406 from zero323/SPARK-30681. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-28 12:59:39 +09:00
Eric Wu	1f0300fb16	[SPARK-30764][SQL] Improve the readability of EXPLAIN FORMATTED style ### What changes were proposed in this pull request? The style of `EXPLAIN FORMATTED` output needs to be improved. We’ve already got some observations/ideas in https://github.com/apache/spark/pull/27368#discussion_r376694496 https://github.com/apache/spark/pull/27368#discussion_r376927143 Observations/Ideas: 1. Using comma as the separator is not clear, especially commas are used inside the expressions too. 2. Show the column counts first? For example, `Results [4]: …` 3. Currently the attribute names are automatically generated, this need to refined. 4. Add arguments field in common implementations as `EXPLAIN EXTENDED` did by calling `argString` in `TreeNode.simpleString`. This will eliminate most existing minor differences between `EXPLAIN EXTENDED` and `EXPLAIN FORMATTED`. 5. Another improvement we can do is: the generated alias shouldn't include attribute id. collect_set(val, 0, 0)#123 looks clearer than collect_set(val#456, 0, 0)#123 This PR is currently addressing comments 2 & 4, and open for more discussions on improving readability. ### Why are the changes needed? The readability of `EXPLAIN FORMATTED` need to be improved, which will help user better understand the query plan. ### Does this PR introduce any user-facing change? Yes, `EXPLAIN FORMATTED` output style changed. ### How was this patch tested? Update expect results of test cases in explain.sql Closes #27509 from Eric5553/ExplainFormattedRefine. Authored-by: Eric Wu <492960551@qq.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-21 23:36:14 +08:00
Alex Favaro	96c1a4401d	[SPARK-30856][SQL][PYSPARK] Fix SQLContext.getOrCreate() when SparkContext is restarted ### What changes were proposed in this pull request? As discussed on the Jira ticket, this change clears the SQLContext._instantiatedContext class attribute when the SparkSession is stopped. That way, the attribute will be reset with a new, usable SQLContext when a new SparkSession is started. ### Why are the changes needed? When the underlying SQLContext is instantiated for a SparkSession, the instance is saved as a class attribute and returned from subsequent calls to SQLContext.getOrCreate(). If the SparkContext is stopped and a new one started, the SQLContext class attribute is never cleared so any code which calls SQLContext.getOrCreate() will get a SQLContext with a reference to the old, unusable SparkContext. A similar issue was identified and fixed for SparkSession in [SPARK-19055](https://issues.apache.org/jira/browse/SPARK-19055), but the fix did not change SQLContext as well. I ran into this because mllib still [uses](https://github.com/apache/spark/blob/master/python/pyspark/mllib/common.py#L105) SQLContext.getOrCreate() under the hood. ### Does this PR introduce any user-facing change? No ### How was this patch tested? A new test was added. I verified that the test fails without the included change. Closes #27610 from afavaro/restart-sqlcontext. Authored-by: Alex Favaro <alex.favaro@affirm.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-20 12:21:24 +09:00
HyukjinKwon	e065e22e5e	[SPARK-30861][PYTHON][SQL] Deprecate constructor of SQLContext and getOrCreate in SQLContext at PySpark ### What changes were proposed in this pull request? This PR proposes to deprecate the APIs at `SQLContext` removed in SPARK-25908. We should remove equivalent APIs; however, seems we missed to deprecate. While I am here, I fix one more issue. After SPARK-25908, `sc._jvm.SQLContext.getOrCreate` dose not exist anymore. So, ```python from pyspark.sql import SQLContext from pyspark import SparkContext sc = SparkContext.getOrCreate() SQLContext.getOrCreate(sc).range(10).show() ``` throws an exception as below: ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/context.py", line 110, in getOrCreate jsqlContext = sc._jvm.SQLContext.getOrCreate(sc._jsc.sc()) File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1516, in __getattr__ py4j.protocol.Py4JError: org.apache.spark.sql.SQLContext.getOrCreate does not exist in the JVM ``` After this PR: ``` /.../spark/python/pyspark/sql/context.py:113: DeprecationWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead. DeprecationWarning) +---+ \| id\| +---+ \| 0\| \| 1\| \| 2\| \| 3\| \| 4\| \| 5\| \| 6\| \| 7\| \| 8\| \| 9\| +---+ ``` In case of the constructor of `SQLContext`, after this PR: ```python from pyspark.sql import SQLContext sc = SparkContext.getOrCreate() SQLContext(sc) ``` ``` /.../spark/python/pyspark/sql/context.py:77: DeprecationWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead. DeprecationWarning) ``` ### Why are the changes needed? To promote to use SparkSession, and keep the API party consistent with Scala side. ### Does this PR introduce any user-facing change? Yes, it will show deprecation warning to users. ### How was this patch tested? Manually tested as described above. Unittests were also added. Closes #27614 from HyukjinKwon/SPARK-30861. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-19 11:17:47 +09:00
yi.wu	68d7edf949	[SPARK-30812][SQL][CORE] Revise boolean config name to comply with new config naming policy ### What changes were proposed in this pull request? Revise below config names to comply with [new config naming policy](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-naming-policy-of-Spark-configs-td28875.html): SQL: * spark.sql.execution.subquery.reuse.enabled / [SPARK-27083](https://issues.apache.org/jira/browse/SPARK-27083) * spark.sql.legacy.allowNegativeScaleOfDecimal.enabled / [SPARK-30252](https://issues.apache.org/jira/browse/SPARK-30252) * spark.sql.adaptive.optimizeSkewedJoin.enabled / [SPARK-29544](https://issues.apache.org/jira/browse/SPARK-29544) * spark.sql.legacy.property.nonReserved / [SPARK-30183](https://issues.apache.org/jira/browse/SPARK-30183) * spark.sql.streaming.forceDeleteTempCheckpointLocation.enabled / [SPARK-26389](https://issues.apache.org/jira/browse/SPARK-26389) * spark.sql.analyzer.failAmbiguousSelfJoin.enabled / [SPARK-28344](https://issues.apache.org/jira/browse/SPARK-28344) * spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled / [SPARK-30074](https://issues.apache.org/jira/browse/SPARK-30074) * spark.sql.execution.pandas.arrowSafeTypeConversion / [SPARK-25811](https://issues.apache.org/jira/browse/SPARK-25811) * spark.sql.legacy.looseUpcast / [SPARK-24586](https://issues.apache.org/jira/browse/SPARK-24586) * spark.sql.legacy.arrayExistsFollowsThreeValuedLogic / [SPARK-28052](https://issues.apache.org/jira/browse/SPARK-28052) * spark.sql.sources.ignoreDataLocality.enabled / [SPARK-29189](https://issues.apache.org/jira/browse/SPARK-29189) * spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled / [SPARK-9853](https://issues.apache.org/jira/browse/SPARK-9853) CORE: * spark.eventLog.erasureCoding.enabled / [SPARK-25855](https://issues.apache.org/jira/browse/SPARK-25855) * spark.shuffle.readHostLocalDisk.enabled / [SPARK-30235](https://issues.apache.org/jira/browse/SPARK-30235) * spark.scheduler.listenerbus.logSlowEvent.enabled / [SPARK-29001](https://issues.apache.org/jira/browse/SPARK-29001) * spark.resources.coordinate.enable / [SPARK-27371](https://issues.apache.org/jira/browse/SPARK-27371) * spark.eventLog.logStageExecutorMetrics.enabled / [SPARK-23429](https://issues.apache.org/jira/browse/SPARK-23429) ### Why are the changes needed? To comply with the config naming policy. ### Does this PR introduce any user-facing change? No. Configurations listed above are all newly added in Spark 3.0. ### How was this patch tested? Pass Jenkins. Closes #27563 from Ngone51/revise_boolean_conf_name. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-18 20:39:50 +08:00
David Toneian	504b5135d0	[SPARK-30859][PYSPARK][DOCS][MINOR] Fixed docstring syntax issues preventing proper compilation of documentation This commit is published into the public domain. ### What changes were proposed in this pull request? Some syntax issues in docstrings have been fixed. ### Why are the changes needed? In some places, the documentation did not render as intended, e.g. parameter documentations were not formatted as such. ### Does this PR introduce any user-facing change? Slight improvements in documentation. ### How was this patch tested? Manual testing. No new Sphinx warnings arise due to this change. Closes #27613 from DavidToneian/SPARK-30859. Authored-by: David Toneian <david@toneian.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-18 16:46:45 +09:00
Liang Zhang	d8c0599e54	[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset ### What changes were proposed in this pull request? This PR added two DeveloperApis to the Dataset[T] class. Both methods are just exposing lower-level methods to the Dataset[T] class. ### Why are the changes needed? They are useful for checking whether two dataframes are the same when implementing dataframe caching in python, and also get a unique ID. It's easier to use if we wrap the lower-level APIs. ### Does this PR introduce any user-facing change? ``` scala> val df1 = Seq((1,2),(4,5)).toDF("col1", "col2") df1: org.apache.spark.sql.DataFrame = [col1: int, col2: int] scala> val df2 = Seq((1,2),(4,5)).toDF("col1", "col2") df2: org.apache.spark.sql.DataFrame = [col1: int, col2: int] scala> val df3 = Seq((0,2),(4,5)).toDF("col1", "col2") df3: org.apache.spark.sql.DataFrame = [col1: int, col2: int] scala> val df4 = Seq((0,2),(4,5)).toDF("col0", "col2") df4: org.apache.spark.sql.DataFrame = [col0: int, col2: int] scala> df1.semanticHash res0: Int = 594427822 scala> df2.semanticHash res1: Int = 594427822 scala> df1.sameSemantics(df2) res2: Boolean = true scala> df1.sameSemantics(df3) res3: Boolean = false scala> df3.semanticHash res4: Int = -1592702048 scala> df4.semanticHash res5: Int = -1592702048 scala> df4.sameSemantics(df3) res6: Boolean = true ``` ### How was this patch tested? Unit test in scala and doctest in python. Note: comments are copied from the corresponding lower-level APIs. Note: There are some issues to be fixed that would improve the hash collision rate: https://github.com/apache/spark/pull/27565#discussion_r379881028 Closes #27565 from liangz1/df-same-result. Authored-by: Liang Zhang <liang.zhang@databricks.com> Signed-off-by: WeichenXu <weichen.xu@databricks.com>	2020-02-18 09:22:26 +08:00
Yuanjian Li	ab186e3659	[SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior ### What changes were proposed in this pull request? This is a follow-up for #23124, add a new config `spark.sql.legacy.allowDuplicatedMapKeys` to control the behavior of removing duplicated map keys in build-in functions. With the default value `false`, Spark will throw a RuntimeException while duplicated keys are found. ### Why are the changes needed? Prevent silent behavior changes. ### Does this PR introduce any user-facing change? Yes, new config added and the default behavior for duplicated map keys changed to RuntimeException thrown. ### How was this patch tested? Modify existing UT. Closes #27478 from xuanyuanking/SPARK-25892-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-17 22:06:58 +08:00
David Toneian	25db8c71a2	[PYSPARK][DOCS][MINOR] Changed `:func:` to `:attr:` Sphinx roles, fixed links in documentation of `Data{Frame,Stream}{Reader,Writer}` This commit is published into the public domain. ### What changes were proposed in this pull request? This PR fixes the documentation of `DataFrameReader`, `DataFrameWriter`, `DataStreamReader`, and `DataStreamWriter`, where attributes of other classes were misrepresented as functions. Additionally, creation of hyperlinks across modules was fixed in these instances. ### Why are the changes needed? The old state produced documentation that suggested invalid usage of PySpark objects (accessing attributes as though they were callable.) ### Does this PR introduce any user-facing change? No, except for improved documentation. ### How was this patch tested? No test added; documentation build runs through. Closes #27553 from DavidToneian/docfix-DataFrameReader-DataFrameWriter-DataStreamReader-DataStreamWriter. Authored-by: David Toneian <david@toneian.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-14 11:00:35 +09:00
HyukjinKwon	aa6a60530e	[SPARK-30722][PYTHON][DOCS] Update documentation for Pandas UDF with Python type hints ### What changes were proposed in this pull request? This PR targets to document the Pandas UDF redesign with type hints introduced at SPARK-28264. Mostly self-describing; however, there are few things to note for reviewers. 1. This PR replace the existing documentation of pandas UDFs to the newer redesign to promote the Python type hints. I added some words that Spark 3.0 still keeps the compatibility though. 2. This PR proposes to name non-pandas UDFs as "Pandas Function API" 3. SCALAR_ITER become two separate sections to reduce confusion: - `Iterator[pd.Series]` -> `Iterator[pd.Series]` - `Iterator[Tuple[pd.Series, ...]]` -> `Iterator[pd.Series]` 4. I removed some examples that look overkill to me. 5. I also removed some information in the doc, that seems duplicating or too much. ### Why are the changes needed? To document new redesign in pandas UDF. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests should cover. Closes #27466 from HyukjinKwon/SPARK-30722. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-12 10:49:46 +09:00
Bryan Cutler	07a9885f27	[SPARK-30777][PYTHON][TESTS] Fix test failures for Pandas >= 1.0.0 ### What changes were proposed in this pull request? Fix PySpark test failures for using Pandas >= 1.0.0. ### Why are the changes needed? Pandas 1.0.0 has recently been released and has API changes that result in PySpark test failures, this PR fixes the broken tests. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually tested with Pandas 1.0.1 and PyArrow 0.16.0 Closes #27529 from BryanCutler/pandas-fix-tests-1.0-SPARK-30777. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-11 10:03:01 +09:00
sharif ahmad	dd2f4431f5	[MINOR][DOCS] Fix typos at python/pyspark/sql/types.py ### What changes were proposed in this pull request? This PR fixes some typos in `python/pyspark/sql/types.py` file. ### Why are the changes needed? To deliver correct wording in documentation and codes. ### Does this PR introduce any user-facing change? Yes, it fixes some typos in user-facing API documentation. ### How was this patch tested? Locally tested the linter. Closes #27475 from sharifahmad2061/master. Lead-authored-by: sharif ahmad <sharifahmad2061@gmail.com> Co-authored-by: Sharif ahmad <sharifahmad2061@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-07 18:42:16 +09:00
HyukjinKwon	692e3ddb4e	[SPARK-27870][PYTHON][FOLLOW-UP] Rename spark.sql.pandas.udf.buffer.size to spark.sql.execution.pandas.udf.buffer.size ### What changes were proposed in this pull request? This PR renames `spark.sql.pandas.udf.buffer.size` to `spark.sql.execution.pandas.udf.buffer.size` to be more consistent with other pandas configuration prefixes, given: - `spark.sql.execution.pandas.arrowSafeTypeConversion` - `spark.sql.execution.pandas.respectSessionTimeZone` - `spark.sql.legacy.execution.pandas.groupedMap.assignColumnsByName` - other configurations like `spark.sql.execution.arrow.*`. ### Why are the changes needed? To make configuration names consistent. ### Does this PR introduce any user-facing change? No because this configuration was not released yet. ### How was this patch tested? Existing tests should cover. Closes #27450 from HyukjinKwon/SPARK-27870-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-05 11:38:33 +09:00
Bryan Cutler	43d9c7e7e5	[SPARK-30640][PYTHON][SQL] Prevent unnecessary copies of data during Arrow to Pandas conversion ### What changes were proposed in this pull request? Prevent unnecessary copies of data during conversion from Arrow to Pandas. ### Why are the changes needed? During conversion of pyarrow data to Pandas, columns are checked for timestamp types and then modified to correct for local timezone. If the data contains no timestamp types, then unnecessary copies of the data can be made. This is most prevalent when checking columns of a pandas DataFrame where each series is assigned back to the DataFrame, regardless if it had timestamps. See https://www.mail-archive.com/devarrow.apache.org/msg17008.html and ARROW-7596 for discussion. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests Closes #27358 from BryanCutler/pyspark-pandas-timestamp-copy-fix-SPARK-30640. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2020-01-26 15:21:06 -08:00
Xiao Li	d69ed9afdf	Revert "[SPARK-25496][SQL] Deprecate from_utc_timestamp and to_utc_timestamp" This reverts commit `1d20d13149`. Closes #27351 from gatorsmile/revertSPARK25496. Authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-25 21:34:12 -08:00
Deepyaman Datta	53fd83a8c5	[MINOR][DOCS] Fix src/dest type documentation for `to_timestamp` ### What changes were proposed in this pull request? Minor documentation fix ### Why are the changes needed? ### Does this PR introduce any user-facing change? ### How was this patch tested? Manually; consider adding tests? Closes #27295 from deepyaman/patch-2. Authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-24 13:10:09 +09:00
zero323	2330a5682d	[SPARK-30607][SQL][PYSPARK][SPARKR] Add overlay wrappers for SparkR and PySpark ### What changes were proposed in this pull request? This PR adds: - `pyspark.sql.functions.overlay` function to PySpark - `overlay` function to SparkR ### Why are the changes needed? Feature parity. At the moment R and Python users can access this function only using SQL or `expr` / `selectExpr`. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New unit tests. Closes #27325 from zero323/SPARK-30607. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-23 16:16:47 +09:00
HyukjinKwon	ab0890bdb1	[SPARK-28264][PYTHON][SQL] Support type hints in pandas UDF and rename/move inconsistent pandas UDF types ### What changes were proposed in this pull request? This PR proposes to redesign pandas UDFs as described in [the proposal](https://docs.google.com/document/d/1-kV0FS_LF2zvaRh_GhkV32Uqksm_Sq8SvnBBmRyxm30/edit?usp=sharing). ```python from pyspark.sql.functions import pandas_udf import pandas as pd pandas_udf("long") def plug_one(s: pd.Series) -> pd.Series: return s + 1 spark.range(10).select(plug_one("id")).show() ``` ``` +------------+ \|plug_one(id)\| +------------+ \| 1\| \| 2\| \| 3\| \| 4\| \| 5\| \| 6\| \| 7\| \| 8\| \| 9\| \| 10\| +------------+ ``` Note that, this PR address one of the future improvements described [here](https://docs.google.com/document/d/1-kV0FS_LF2zvaRh_GhkV32Uqksm_Sq8SvnBBmRyxm30/edit#heading=h.h3ncjpk6ujqu), "A couple of less-intuitive pandas UDF types" (by zero323) together. In short, - Adds new way with type hints as an alternative and experimental way. ```python pandas_udf(schema='...') def func(c1: Series, c2: Series) -> DataFrame: pass ``` - Replace and/or add an alias for three types below from UDF, and make them as separate standalone APIs. So, `pandas_udf` is now consistent with regular `udf`s and other expressions. `df.mapInPandas(udf)` -replace-> `df.mapInPandas(f, schema)` `df.groupby.apply(udf)` -alias-> `df.groupby.applyInPandas(f, schema)` `df.groupby.cogroup.apply(udf)` -replace-> `df.groupby.cogroup.applyInPandas(f, schema)` `df.groupby.apply` was added from 2.3 while the other were added in the master only. - No deprecation for the existing ways for now. ```python pandas_udf(schema='...', functionType=PandasUDFType.SCALAR) def func(c1, c2): pass ``` If users are happy with this, I plan to deprecate the existing way and declare using type hints is not experimental anymore. One design goal in this PR was that, avoid touching the internal (since we didn't deprecate the old ways for now), but supports type hints with a minimised changes only at the interface. - Once we deprecate or remove the old ways, I think it requires another refactoring for the internal in the future. At the very least, we should rename internal pandas evaluation types. - If users find this experimental type hints isn't quite helpful, we should simply revert the changes at the interface level. ### Why are the changes needed? In order to address old design issues. Please see [the proposal](https://docs.google.com/document/d/1-kV0FS_LF2zvaRh_GhkV32Uqksm_Sq8SvnBBmRyxm30/edit?usp=sharing). ### Does this PR introduce any user-facing change? For behaviour changes, No. It adds new ways to use pandas UDFs by using type hints. See below. SCALAR: ```python pandas_udf(schema='...') def func(c1: Series, c2: DataFrame) -> Series: pass # DataFrame represents a struct column ``` SCALAR_ITER: ```python pandas_udf(schema='...') def func(iter: Iterator[Tuple[Series, DataFrame, ...]]) -> Iterator[Series]: pass # Same as SCALAR but wrapped by Iterator ``` GROUPED_AGG: ```python pandas_udf(schema='...') def func(c1: Series, c2: DataFrame) -> int: pass # DataFrame represents a struct column ``` GROUPED_MAP: This was added in Spark 2.3 as of SPARK-20396. As described above, it keeps the existing behaviour. Additionally, we now have a new alias `groupby.applyInPandas` for `groupby.apply`. See the example below: ```python def func(pdf): return pdf df.groupby("...").applyInPandas(func, schema=df.schema) ``` MAP_ITER: this is not a pandas UDF anymore This was added in Spark 3.0 as of SPARK-28198; and this PR replaces the usages. See the example below: ```python def func(iter): for df in iter: yield df df.mapInPandas(func, df.schema) ``` COGROUPED_MAP*: this is not a pandas UDF anymore This was added in Spark 3.0 as of SPARK-27463; and this PR replaces the usages. See the example below: ```python def asof_join(left, right): return pd.merge_asof(left, right, on="...", by="...") df1.groupby("...").cogroup(df2.groupby("...")).applyInPandas(asof_join, schema="...") ``` ### How was this patch tested? Unittests added and tested against Python 2.7, 3.6 and 3.7. Closes #27165 from HyukjinKwon/revisit-pandas. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-22 15:32:58 +09:00
yi.wu	ff39c9271c	[SPARK-30252][SQL] Disallow negative scale of Decimal ### What changes were proposed in this pull request? This PR propose to disallow negative `scale` of `Decimal` in Spark. And this PR brings two behavior changes: 1) for literals like `1.23E4BD` or `1.23E4`(with `spark.sql.legacy.exponentLiteralAsDecimal.enabled`=true, see [SPARK-29956](https://issues.apache.org/jira/browse/SPARK-29956)), we set its `(precision, scale)` to (5, 0) rather than (3, -2); 2) add negative `scale` check inside the decimal method if it exposes to set `scale` explicitly. If check fails, `AnalysisException` throws. And user could still use `spark.sql.legacy.allowNegativeScaleOfDecimal.enabled` to restore the previous behavior. ### Why are the changes needed? According to SQL standard, > 4.4.2 Characteristics of numbers An exact numeric type has a precision P and a scale S. P is a positive integer that determines the number of significant digits in a particular radix R, where R is either 2 or 10. S is a non-negative integer. scale of Decimal should always be non-negative. And other mainstream databases, like Presto, PostgreSQL, also don't allow negative scale. Presto: ``` presto:default> create table t (i decimal(2, -1)); Query 20191213_081238_00017_i448h failed: line 1:30: mismatched input '-'. Expecting: <integer>, <type> create table t (i decimal(2, -1)) ``` PostgrelSQL: ``` postgres=# create table t(i decimal(2, -1)); ERROR: NUMERIC scale -1 must be between 0 and precision 2 LINE 1: create table t(i decimal(2, -1)); ^ ``` And, actually, Spark itself already doesn't allow to create table with negative decimal types using SQL: ``` scala> spark.sql("create table t(i decimal(2, -1))"); org.apache.spark.sql.catalyst.parser.ParseException: no viable alternative at input 'create table t(i decimal(2, -'(line 1, pos 28) == SQL == create table t(i decimal(2, -1)) ----------------------------^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:76) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605) ... 35 elided ``` However, it is still possible to create such table or `DatFrame` using Spark SQL programming API: ``` scala> val tb = CatalogTable( TableIdentifier("test", None), CatalogTableType.MANAGED, CatalogStorageFormat.empty, StructType(StructField("i", DecimalType(2, -1) ) :: Nil)) ``` ``` scala> spark.sql("SELECT 1.23E4BD") res2: org.apache.spark.sql.DataFrame = [1.23E+4: decimal(3,-2)] ``` while, these two different behavior could make user confused. On the other side, even if user creates such table or `DataFrame` with negative scale decimal type, it can't write data out if using format, like `parquet` or `orc`. Because these formats have their own check for negative scale and fail on it. ``` scala> spark.sql("SELECT 1.23E4BD").write.saveAsTable("parquet") 19/12/13 17:37:04 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.IllegalArgumentException: Invalid DECIMAL scale: -2 at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:53) at org.apache.parquet.schema.Types$BasePrimitiveBuilder.decimalMetadata(Types.java:495) at org.apache.parquet.schema.Types$BasePrimitiveBuilder.build(Types.java:403) at org.apache.parquet.schema.Types$BasePrimitiveBuilder.build(Types.java:309) at org.apache.parquet.schema.Types$Builder.named(Types.java:290) at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:428) at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:334) at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.$anonfun$convert$2(ParquetSchemaConverter.scala:326) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at org.apache.spark.sql.types.StructType.map(StructType.scala:99) at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.convert(ParquetSchemaConverter.scala:326) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.init(ParquetWriteSupport.scala:97) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetOutputWriter.scala:37) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:150) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:124) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:109) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:264) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:205) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:441) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:444) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` So, I think it would be better to disallow negative scale totally and make behaviors above be consistent. ### Does this PR introduce any user-facing change? Yes, if `spark.sql.legacy.allowNegativeScaleOfDecimal.enabled=false`, user couldn't create Decimal value with negative scale anymore. ### How was this patch tested? Added new tests in `ExpressionParserSuite` and `DecimalSuite`; Updated `SQLQueryTestSuite`. Closes #26881 from Ngone51/nonnegative-scale. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-21 21:09:48 +08:00
HyukjinKwon	a6bdea3ad4	[SPARK-30539][PYTHON][SQL] Add DataFrame.tail in PySpark ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/26809 added `Dataset.tail` API. It should be good to have it in PySpark API as well. ### Why are the changes needed? To support consistent APIs. ### Does this PR introduce any user-facing change? No. It adds a new API. ### How was this patch tested? Manually tested and doctest was added. Closes #27251 from HyukjinKwon/SPARK-30539. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-18 00:18:12 -08:00
HyukjinKwon	1881caa95e	[SPARK-29188][PYTHON][FOLLOW-UP] Explicitly disable Arrow execution for all test of toPandas empty types ### What changes were proposed in this pull request? Another followup of `4398dfa709` I missed two more tests added: ``` ====================================================================== ERROR [0.133s]: test_to_pandas_from_mixed_dataframe (pyspark.sql.tests.test_dataframe.DataFrameTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/jenkins/python/pyspark/sql/tests/test_dataframe.py", line 617, in test_to_pandas_from_mixed_dataframe self.assertTrue(np.all(pdf_with_only_nulls.dtypes == pdf_with_some_nulls.dtypes)) AssertionError: False is not true ====================================================================== ERROR [0.061s]: test_to_pandas_from_null_dataframe (pyspark.sql.tests.test_dataframe.DataFrameTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/jenkins/python/pyspark/sql/tests/test_dataframe.py", line 588, in test_to_pandas_from_null_dataframe self.assertEqual(types[0], np.float64) AssertionError: dtype('O') != <class 'numpy.float64'> ---------------------------------------------------------------------- ``` ### Why are the changes needed? To make the test independent of default values of configuration. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually tested and Jenkins should test. Closes #27250 from HyukjinKwon/SPARK-29188-followup2. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-17 15:00:18 +09:00

1 2 3 4 5 ...

966 commits