27915 commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Yuexin Zhang | e70df2cea4 |
[SPARK-29683][YARN] False report isAllNodeBlacklisted when RM is having issue
### What changes were proposed in this pull request? Improve the check logic on if all node managers are really being backlisted. ### Why are the changes needed? I observed when the AM is out of sync with ResourceManager, or RM is having issue report back with current number of available NMs, something like below happens: ... 20/05/13 09:01:21 INFO RetryInvocationHandler: java.io.EOFException: End of File Exception between local host is: "client.zyx.com/x.x.x.124"; destination host is: "rm.zyx.com":8030; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException, while invoking ApplicationMasterProtocolPBClientImpl.allocate over rm543. Trying to failover immediately. ... 20/05/13 09:01:28 WARN AMRMClientImpl: ApplicationMaster is out of sync with ResourceManager, hence resyncing. ... then the spark job would suddenly run into AllNodeBlacklisted state: ... 20/05/13 09:01:31 INFO ApplicationMaster: Final app status: FAILED, exitCode: 11, (reason: Due to executor failures all available nodes are blacklisted) ... but actually there's no black listed nodes in currentBlacklistedYarnNodes, and I do not see any blacklisting message from: https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala#L119 We should only return isAllNodeBlacklisted =true when we see there are >0 numClusterNodes AND 'currentBlacklistedYarnNodes.size >= numClusterNodes'. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? A minor change. No changes on tests. Closes #28606 from cnZach/false_AllNodeBlacklisted_when_RM_is_having_issue. Authored-by: Yuexin Zhang <zach.yx.zhang@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> |
||
yi.wu | bc24c99af4 |
[SPARK-31837][CORE] Shift to the new highest locality level if there is when recomputeLocality
### What changes were proposed in this pull request? This PR proposes to shift to the new most local locality level if there're any new more local locality levels are added during `TaskSetManager.recomputeLocality`. ### Why are the changes needed? There's a race condition between `resourceOffers` and `submitTasks`. If `submitTasks` happens before `resourceOffers`, especially when there are no executors added to `TaskSchedulerImpl` at all, the `TaskSetManager` 's `myLocalityLevels` will only have ANY locality level, see: |
||
Takeshi Yamamuro | b806fc4582 |
[SPARK-31854][SQL] Invoke in MapElementsExec should not propagate null
### What changes were proposed in this pull request?
This PR intends to fix a bug of `Dataset.map` below when the whole-stage codegen enabled;
```
scala> val ds = Seq(1.asInstanceOf[Integer], null.asInstanceOf[Integer]).toDS()
scala> sql("SET spark.sql.codegen.wholeStage=true")
scala> ds.map(v=>(v,v)).explain
== Physical Plan ==
*(1) SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1.intValue AS _1#69, assertnotnull(input[0, scala.Tuple2, true])._2.intValue AS _2#70]
+- *(1) MapElements <function1>, obj#68: scala.Tuple2
+- *(1) DeserializeToObject staticinvoke(class java.lang.Integer, ObjectType(class java.lang.Integer), valueOf, value#1, true, false), obj#67: java.lang.Integer
+- LocalTableScan [value#1]
// `AssertNotNull` in `SerializeFromObject` will fail;
scala> ds.map(v => (v, v)).show()
java.lang.NullPointerException: Null value appeared in non-nullable fails:
top level Product input object
If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int).
// When the whole-stage codegen disabled, the query works well;
scala> sql("SET spark.sql.codegen.wholeStage=false")
scala> ds.map(v=>(v,v)).show()
+----+----+
| _1| _2|
+----+----+
| 1| 1|
|null|null|
+----+----+
```
A root cause is that `Invoke` used in `MapElementsExec` propagates input null, and then [AssertNotNull](
|
||
HyukjinKwon | e69466056f |
[SPARK-31849][PYTHON][SQL] Make PySpark SQL exceptions more Pythonic
### What changes were proposed in this pull request? This PR proposes to make PySpark exception more Pythonic by hiding JVM stacktrace by default. It can be enabled by turning on `spark.sql.pyspark.jvmStacktrace.enabled` configuration. ``` Traceback (most recent call last): ... pyspark.sql.utils.PythonException: An exception was thrown from Python worker in the executor. The below is the Python worker stacktrace. Traceback (most recent call last): ... ``` If this `spark.sql.pyspark.jvmStacktrace.enabled` is enabled, it appends: ``` JVM stacktrace: org.apache.spark.Exception: ... ... ``` For example, the codes below: ```python from pyspark.sql.functions import udf udf def divide_by_zero(v): raise v / 0 spark.range(1).select(divide_by_zero("id")).show() ``` will show an error messages that looks like Python exception thrown from the local. <details> <summary>Python exception message when <code>spark.sql.pyspark.jvmStacktrace.enabled</code> is off (default)</summary> ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/dataframe.py", line 427, in show print(self._jdf.showString(n, 20, vertical)) File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../spark/python/pyspark/sql/utils.py", line 131, in deco raise_from(converted) File "<string>", line 3, in raise_from pyspark.sql.utils.PythonException: An exception was thrown from Python worker in the executor. The below is the Python worker stacktrace. Traceback (most recent call last): File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main process() File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process serializer.dump_stream(out_iter, outfile) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream for obj in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched for item in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr> result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda> return lambda *a: f(*a) File "/.../spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper return f(*args, **kwargs) File "<stdin>", line 3, in divide_by_zero ZeroDivisionError: division by zero ``` </details> <details> <summary>Python exception message when <code>spark.sql.pyspark.jvmStacktrace.enabled</code> is on</summary> ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/dataframe.py", line 427, in show print(self._jdf.showString(n, 20, vertical)) File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../spark/python/pyspark/sql/utils.py", line 137, in deco raise_from(converted) File "<string>", line 3, in raise_from pyspark.sql.utils.PythonException: An exception was thrown from Python worker in the executor. The below is the Python worker stacktrace. Traceback (most recent call last): File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main process() File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process serializer.dump_stream(out_iter, outfile) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream for obj in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched for item in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr> result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda> return lambda *a: f(*a) File "/.../spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper return f(*args, **kwargs) File "<stdin>", line 3, in divide_by_zero ZeroDivisionError: division by zero JVM stacktrace: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, 192.168.35.193, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main process() File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process serializer.dump_stream(out_iter, outfile) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream for obj in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched for item in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr> result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda> return lambda *a: f(*a) File "/.../spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper return f(*args, **kwargs) File "<stdin>", line 3, in divide_by_zero ZeroDivisionError: division by zero at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:516) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:81) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:64) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:469) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:753) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:469) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:472) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2117) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2066) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2065) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2065) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1021) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1021) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1021) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2297) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2246) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2235) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:823) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2108) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2129) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2148) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:467) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:420) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3653) at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2695) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3644) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3642) at org.apache.spark.sql.Dataset.head(Dataset.scala:2695) at org.apache.spark.sql.Dataset.take(Dataset.scala:2902) at org.apache.spark.sql.Dataset.getRows(Dataset.scala:300) at org.apache.spark.sql.Dataset.showString(Dataset.scala:337) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main process() File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process serializer.dump_stream(out_iter, outfile) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream for obj in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched for item in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr> result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda> return lambda *a: f(*a) File "/.../spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper return f(*args, **kwargs) File "<stdin>", line 3, in divide_by_zero ZeroDivisionError: division by zero at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:516) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:81) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:64) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:469) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:753) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:469) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:472) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ... 1 more ``` </details> <details> <summary>Python exception message without this change</summary> ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/dataframe.py", line 427, in show print(self._jdf.showString(n, 20, vertical)) File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../spark/python/pyspark/sql/utils.py", line 98, in deco return f(*a, **kw) File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o160.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 10 in stage 5.0 failed 4 times, most recent failure: Lost task 10.3 in stage 5.0 (TID 37, 192.168.35.193, executor 3): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main process() File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process serializer.dump_stream(out_iter, outfile) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream for obj in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched for item in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr> result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda> return lambda *a: f(*a) File "/.../spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper return f(*args, **kwargs) File "<stdin>", line 3, in divide_by_zero ZeroDivisionError: division by zero at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:516) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:81) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:64) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:469) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:753) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:469) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:472) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2117) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2066) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2065) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2065) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1021) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1021) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1021) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2297) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2246) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2235) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:823) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2108) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2129) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2148) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:467) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:420) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3653) at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2695) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3644) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3642) at org.apache.spark.sql.Dataset.head(Dataset.scala:2695) at org.apache.spark.sql.Dataset.take(Dataset.scala:2902) at org.apache.spark.sql.Dataset.getRows(Dataset.scala:300) at org.apache.spark.sql.Dataset.showString(Dataset.scala:337) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main process() File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process serializer.dump_stream(out_iter, outfile) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream for obj in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched for item in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr> result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda> return lambda *a: f(*a) File "/.../spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper return f(*args, **kwargs) File "<stdin>", line 3, in divide_by_zero ZeroDivisionError: division by zero at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:516) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:81) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:64) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:469) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:753) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:469) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:472) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ... 1 more ``` </details> <br/> Another example with Python 3.7: ```python sql("a") ``` <details> <summary>Python exception message when <code>spark.sql.pyspark.jvmStacktrace.enabled</code> is off (default)</summary> ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/session.py", line 646, in sql return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped) File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../spark/python/pyspark/sql/utils.py", line 131, in deco raise_from(converted) File "<string>", line 3, in raise_from pyspark.sql.utils.ParseException: mismatched input 'a' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0) == SQL == a ^^^ ``` </details> <details> <summary>Python exception message when <code>spark.sql.pyspark.jvmStacktrace.enabled</code> is on</summary> ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/session.py", line 646, in sql return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped) File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../spark/python/pyspark/sql/utils.py", line 131, in deco raise_from(converted) File "<string>", line 3, in raise_from pyspark.sql.utils.ParseException: mismatched input 'a' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0) == SQL == a ^^^ JVM stacktrace: org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'a' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0) == SQL == a ^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:266) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:133) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:49) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:81) at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:604) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:604) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:601) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) ``` </details> <details> <summary>Python exception message without this change</summary> ``` Traceback (most recent call last): File "/.../spark/python/pyspark/sql/utils.py", line 98, in deco return f(*a, **kw) File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o26.sql. : org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'a' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0) == SQL == a ^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:266) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:133) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:49) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:81) at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:604) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:604) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:601) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/session.py", line 646, in sql return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped) File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../spark/python/pyspark/sql/utils.py", line 102, in deco raise converted pyspark.sql.utils.ParseException: mismatched input 'a' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0) == SQL == a ^^^ ``` </details> ### Why are the changes needed? Currently, PySpark exceptions are very unfriendly to Python users with causing a bunch of JVM stacktrace. See "Python exception message without this change" above. ### Does this PR introduce _any_ user-facing change? Yes, it will change the exception message. See the examples above. ### How was this patch tested? Manually tested by ```bash ./bin/pyspark --conf spark.sql.pyspark.jvmStacktrace.enabled=true ``` and running the examples above. Closes #28661 from HyukjinKwon/python-debug. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> |
||
HyukjinKwon | 29c51d682b |
[SPARK-31788][CORE][DSTREAM][PYTHON] Recover the support of union for different types of RDD and DStreams
### What changes were proposed in this pull request? This PR manually specifies the class for the input array being used in `(SparkContext|StreamingContext).union`. It fixes a regression introduced from SPARK-25737. ```python rdd1 = sc.parallelize([1,2,3,4,5]) rdd2 = sc.parallelize([6,7,8,9,10]) pairRDD1 = rdd1.zip(rdd2) sc.union([pairRDD1, pairRDD1]).collect() ``` in the current master and `branch-3.0`: ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/context.py", line 870, in union jrdds[i] = rdds[i]._jrdd File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 238, in __setitem__ File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 221, in __set_item File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 332, in get_return_value py4j.protocol.Py4JError: An error occurred while calling None.None. Trace: py4j.Py4JException: Cannot convert org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) ``` which works in Spark 2.4.5: ``` [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (1, 6), (2, 7), (3, 8), (4, 9), (5, 10)] ``` It assumed the class of the input array is the same `JavaRDD` or `JavaDStream`; however, that can be different such as `JavaPairRDD`. This fix is based on redsanket's initial approach, and will be co-authored. ### Why are the changes needed? To fix a regression from Spark 2.4.5. ### Does this PR introduce _any_ user-facing change? No, it's only in unreleased branches. This is to fix a regression. ### How was this patch tested? Manually tested, and a unittest was added. Closes #28648 from HyukjinKwon/SPARK-31788. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> |
||
Huaxin Gao | 45cf5e9950 |
[SPARK-31840][ML] Add instance weight support in LogisticRegressionSummary
### What changes were proposed in this pull request? Add instance weight support in LogisticRegressionSummary ### Why are the changes needed? LogisticRegression, MulticlassClassificationEvaluator and BinaryClassificationEvaluator support instance weight. We should support instance weight in LogisticRegressionSummary too. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add new tests Closes #28657 from huaxingao/weighted_summary. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com> |
||
Max Gekk | 47dc332258 |
[SPARK-31874][SQL] Use FastDateFormat as the legacy fractional formatter
### What changes were proposed in this pull request? 1. Replace `SimpleDateFormat` by `FastDateFormat` as the legacy formatter of `FractionTimestampFormatter`. 2. Optimise `LegacyFastTimestampFormatter` for `java.sql.Timestamp` w/o fractional part. ### Why are the changes needed? 1. By default `HiveResult`.`hiveResultString` retrieves timestamps values as instances of `java.sql.Timestamp`, and uses the legacy parser `SimpleDateFormat` to convert the timestamps to strings. After the fix https://github.com/apache/spark/pull/28024, the fractional formatter and its companion - legacy formatter `SimpleDateFormat` are created per every value. By switching from `LegacySimpleTimestampFormatter` to `LegacyFastTimestampFormatter`, we can utilize the internal cache of `FastDateFormat`, and avoid parsing the default pattern `yyyy-MM-dd HH:mm:ss`. 2. The second change in the method `def format(ts: Timestamp): String` of `LegacyFastTimestampFormatter` is needed to optimize the formatter for patterns without the fractional part and avoid conversions to microseconds. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing tests in `TimestampFormatter`. Closes #28678 from MaxGekk/fastdateformat-as-legacy-frac-formatter. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> |
||
Kent Yao | 547c5bf552 |
[SPARK-31867][SQL] Disable year type datetime patterns which are longer than 10
### What changes were proposed in this pull request? As mentioned in https://github.com/apache/spark/pull/28673 and suggested via cloud-fan at https://github.com/apache/spark/pull/28673#discussion_r432817075 In this PR, we disable datetime pattern in the form of `y..y` and `Y..Y` whose lengths are greater than 10 to avoid sort of JDK bug as described below he new datetime formatter introduces silent data change like, ```sql spark-sql> select from_unixtime(1, 'yyyyyyyyyyy-MM-dd'); NULL spark-sql> set spark.sql.legacy.timeParserPolicy=legacy; spark.sql.legacy.timeParserPolicy legacy spark-sql> select from_unixtime(1, 'yyyyyyyyyyy-MM-dd'); 00000001970-01-01 spark-sql> ``` For patterns that support `SignStyle.EXCEEDS_PAD`, e.g. `y..y`(len >=4), when using the `NumberPrinterParser` to format it ```java switch (signStyle) { case EXCEEDS_PAD: if (minWidth < 19 && value >= EXCEED_POINTS[minWidth]) { buf.append(decimalStyle.getPositiveSign()); } break; .... ``` the `minWidth` == `len(y..y)` the `EXCEED_POINTS` is ```java /** * Array of 10 to the power of n. */ static final long[] EXCEED_POINTS = new long[] { 0L, 10L, 100L, 1000L, 10000L, 100000L, 1000000L, 10000000L, 100000000L, 1000000000L, 10000000000L, }; ``` So when the `len(y..y)` is greater than 10, ` ArrayIndexOutOfBoundsException` will be raised. And at the caller side, for `from_unixtime`, the exception will be suppressed and silent data change occurs. for `date_format`, the `ArrayIndexOutOfBoundsException` will continue. ### Why are the changes needed? fix silent data change ### Does this PR introduce _any_ user-facing change? Yes, SparkUpgradeException will take place of `null` result when the pattern contains 10 or more continuous 'y' or 'Y' ### How was this patch tested? new tests Closes #28684 from yaooqinn/SPARK-31867-2. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> |
||
Huaxin Gao | 1b780f364b |
[SPARK-31866][SQL][DOCS] Add COALESCE/REPARTITION/REPARTITION_BY_RANGE Hints to SQL Reference
### What changes were proposed in this pull request? Add Coalesce/Repartition/Repartition_By_Range Hints to SQL Reference ### Why are the changes needed? To make SQL reference complete ### Does this PR introduce _any_ user-facing change? <img width="1100" alt="Screen Shot 2020-05-29 at 6 46 38 PM" src="https://user-images.githubusercontent.com/13592258/83316782-d6fcf300-a1dc-11ea-87f6-e357b9c739fd.png"> <img width="1099" alt="Screen Shot 2020-05-29 at 6 43 30 PM" src="https://user-images.githubusercontent.com/13592258/83316784-d8c6b680-a1dc-11ea-95ea-10a1f75dcef9.png"> Only the the above pages are changed. The following two pages are the same as before. <img width="1100" alt="Screen Shot 2020-05-28 at 10 05 27 PM" src="https://user-images.githubusercontent.com/13592258/83223474-bfb3fc00-a12f-11ea-807a-824a618afa0b.png"> <img width="1099" alt="Screen Shot 2020-05-28 at 10 05 08 PM" src="https://user-images.githubusercontent.com/13592258/83223478-c2165600-a12f-11ea-806e-a1e57dc35ef4.png"> ### How was this patch tested? Manually build and check Closes #28672 from huaxingao/coalesce_hint. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com> |
||
Maryann Xue | b9737c3c22 |
[SPARK-31864][SQL] Adjust AQE skew join trigger condition
### What changes were proposed in this pull request? This PR makes a minor change in deciding whether a partition is skewed by comparing the partition size to the median size of coalesced partitions instead of median size of raw partitions before coalescing. ### Why are the changes needed? This change is line with target size criteria of splitting skew join partitions and can also cope with situations of extra empty partitions caused by over-partitioning. This PR has also improved skew join tests in AQE tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated UTs. Closes #28669 from maryannxue/spark-31864. Authored-by: Maryann Xue <maryann.xue@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> |
||
Juliusz Sompolski | af35691de4 |
[SPARK-31859][SPARK-31861][SPARK-31863] Fix Thriftserver session timezone issues
### What changes were proposed in this pull request? Timestamp literals in Spark are interpreted as timestamps in local timezone spark.sql.session.timeZone. If JDBC client is e.g. in TimeZone UTC-7, and sets spark.sql.session.timeZone to PST, and sends a query "SELECT timestamp '2020-05-20 12:00:00'", and the JVM timezone of the Spark cluster is e.g. UTC+2, then what currently happens is: * The timestamp literal in the query is interpreted as 12:00:00 UTC-7, i.e. 19:00:00 UTC. * When it's returned from the query, it is collected as a java.sql.Timestamp object with Dataset.collect(), and put into a Thriftserver RowSet. * Before sending it over the wire, the Timestamp is converted to String. This happens in explicitly in ColumnValue for RowBasedSet, and implicitly in ColumnBuffer for ColumnBasedSet (all non-primitive types are converted toString() there). The conversion toString uses JVM timezone, which results in a "21:00:00" (UTC+2) string representation. * The client JDBC application parses gets a "21:00:00" Timestamp back (in it's JVM timezone; if the JDBC application cares about the correct UTC internal value, it should set spark.sql.session.timeZone to be consistent with its JVM timezone) The problem is caused by the conversion happening in Thriftserver RowSet with the generic toString() function, instead of using HiveResults.toHiveString() that takes care of correct, timezone respecting conversions. This PR fixes it by converting the Timestamp values to String earlier, in SparkExecuteStatementOperation, using that function. This fixes SPARK-31861. Thriftserver also did not work spark.sql.datetime.java8API.enabled, because the conversions in RowSet expected an Timestamp object instead of Instant object. Using HiveResults.toHiveString() also fixes that. For this reason, we also convert Date values in SparkExecuteStatementOperation as well - so that HiveResults.toHiveString() handles LocalDate as well. This fixes SPARK-31859. Thriftserver also did not correctly set the active SparkSession. Because of that, configuration obtained using SQLConf.get was not the correct session configuration. This affected getting the correct spark.sql.session.timeZone. It is fixed by extending the use of SparkExecuteStatementOperation.withSchedulerPool to also set the correct active SparkSession. When the correct session is set, we also no longer need to maintain the pool mapping in a sessionToActivePool map. The scheduler pool can be just correctly retrieved from the session config. "withSchedulerPool" is renamed to "withLocalProperties" and moved into a mixin helper trait, because it should be applied with every operation. This fixes SPARK-31863. I used the opportunity to move some repetitive code from the operations to the mixin helper trait. Closes #28671 from juliuszsompolski/SPARK-31861. Authored-by: Juliusz Sompolski <julek@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> |
||
Kousuke Saruta |
d3eba5bc8c
|
[SPARK-31756][WEBUI] Add real headless browser support for UI test
### What changes were proposed in this pull request? This PR mainly adds two things. 1. Real headless browser support for UI test 2. A test suite using headless Chrome as one instance of those browsers. Also, for environment where Chrome and Chrome driver is not installed, `ChromeUITest` tag is added to filter out the test suite. By default, test suites with `ChromeUITest` is disabled. ### Why are the changes needed? In the current master, there are two problems for UI test. 1. Lots of tests especially JavaScript related ones are done manually. Appearance is better to be confirmed by our eyes but logic should be tested by test cases ideally. 2. Compared to the real web browsers, HtmlUnit doesn't seem to support JavaScript enough. I added a JavaScript related test before for SPARK-31534 using HtmlUnit which is simple library based headless browser for test. The test I added works somehow but some JavaScript related error is shown in unit-tests.log. ``` ======= EXCEPTION START ======== Exception class=[net.sourceforge.htmlunit.corejs.javascript.JavaScriptException] com.gargoylesoftware.htmlunit.ScriptException: Error: TOOLTIP: Option "sanitizeFn" provided type "window" but expected type "(null|function)". (http://192.168.1.209:60724/static/jquery-3.4.1.min.js#2) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:904) at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:628) at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:515) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:835) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:807) at com.gargoylesoftware.htmlunit.InteractivePage.executeJavaScriptFunctionIfPossible(InteractivePage.java:216) at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptFunctionJob.runJavaScript(JavaScriptFunctionJob.java:52) at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptExecutionJob.run(JavaScriptExecutionJob.java:102) at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptJobManagerImpl.runSingleJob(JavaScriptJobManagerImpl.java:426) at com.gargoylesoftware.htmlunit.javascript.background.DefaultJavaScriptExecutor.run(DefaultJavaScriptExecutor.java:157) at java.lang.Thread.run(Thread.java:748) Caused by: net.sourceforge.htmlunit.corejs.javascript.JavaScriptException: Error: TOOLTIP: Option "sanitizeFn" provided type "window" but expected type "(null|function)". (http://192.168.1.209:60724/static/jquery-3.4.1.min.js#2) at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1009) at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:800) at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:105) at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:413) at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:252) at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3264) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$4.doRun(JavaScriptEngine.java:828) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:889) ... 10 more JavaScriptException value = Error: TOOLTIP: Option "sanitizeFn" provided type "window" but expected type "(null|function)". == CALLING JAVASCRIPT == function () { throw e; } ======= EXCEPTION END ======== ``` I tried to upgrade HtmlUnit to 2.40.0 but what is worse, the test become not working even though it works on real browsers like Chrome, Safari and Firefox without error. ``` [info] UISeleniumSuite: [info] - SPARK-31534: text for tooltip should be escaped *** FAILED *** (17 seconds, 745 milliseconds) [info] The code passed to eventually never returned normally. Attempted 2 times over 12.910785232 seconds. Last failure message: com.gargoylesoftware.htmlunit.ScriptException: ReferenceError: Assignment to undefined "regeneratorRuntime" in strict mode (http://192.168.1.209:62132/static/vis-timeline-graph2d.min.js#52(Function)#1) ``` To resolve those problems, it's better to support headless browser for UI test. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I tested with following patterns. Both Chrome and Chrome driver should be installed to test. 1. sbt / with default excluded tags (ChromeUISeleniumSuite is expected to be skipped and SQLQueryTestSuite is expected to succeed) `build/sbt -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver "testOnly org.apache.spark.ui.ChromeUISeleniumSuite org.apache.spark.sql.SQLQueryTestSuite" 2. sbt / overwrite default excluded tags as empty string (Both suites are expected to succeed) `build/sbt -Dtest.default.exclude.tags= -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver "testOnly org.apache.spark.ui.ChromeUISeleniumSuite org.apache.spark.sql.SQLQueryTestSuite" 3. sbt / set `test.exclude.tags` to `org.apache.spark.tags.ExtendedSQLTest` (Both suites are expected to be skipped) `build/sbt -Dtest.exclude.tags=org.apache.spark.tags.ExtendedSQLTest -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver "testOnly org.apache.spark.ui.ChromeUISeleniumSuite org.apache.spark.sql.SQLQueryTestSuite" 4. Maven / with default excluded tags (ChromeUISeleniumSuite is expected to be skipped and SQLQueryTestSuite is expected to succeed) `build/mvn -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver -Dtest=none -DwildcardSuites=org.apache.spark.ui.ChromeUISeleniumSuite,org.apache.spark.sql.SQLQueryTestSuite test` 5. Maven / overwrite default excluded tags as empty string (Both suites are expected to succeed) `build/mvn -Dtest.default.exclude.tags= -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver -Dtest=none -DwildcardSuites=org.apache.spark.ui.ChromeUISeleniumSuite,org.apache.spark.sql.SQLQueryTestSuite test` 6. Maven / set `test.exclude.tags` to `org.apache.spark.tags.ExtendedSQLTest` (Both suites are expected to be skipped) `build/mvn -Dtest.exclude.tags=org.apache.spark.tags.ExtendedSQLTest -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver -Dtest=none -DwildcardSuites=org.apache.spark.ui.ChromeUISeleniumSuite,org.apache.spark.sql.SQLQueryTestSuite test` Closes #28627 from sarutak/real-headless-browser-support-take2. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> |
||
Dongjoon Hyun | 625abca9db |
[SPARK-31858][BUILD] Upgrade commons-io to 2.5 in Hadoop 3.2 profile
### What changes were proposed in this pull request? This PR aims to upgrade `commons-io` from 2.4 to 2.5 for Apache Spark 3.1. ### Why are the changes needed? Since Hadoop 3.1, `commons-io` 2.5 is used. - https://issues.apache.org/jira/browse/HADOOP-15261 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins with Hadoop-3.2 profile. Maven dependency is verified via `test-dependencies.sh` automatically. SBT dependency can be verified like the following manually. ``` build/sbt -Phadoop-3.2 "core/dependencyTree" | grep commons-io:commons-io | head -n1 [info] | | +-commons-io:commons-io:2.5 ``` Closes #28665 from dongjoon-hyun/SPARK-31858. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> |
||
Jungtaek Lim (HeartSaVioR) | fe1d1e24bc |
[SPARK-31214][BUILD] Upgrade Janino to 3.1.2
### What changes were proposed in this pull request? This PR proposes to upgrade Janino to 3.1.2 which is released recently. Major changes were done for refactoring, as well as there're lots of "no commit message". Belows are the pairs of (commit title, commit) which seem to deal with some bugs or specific improvements (not purposed to refactor) after 3.0.15. * Issue #119: Guarantee executing popOperand() in popUninitializedVariableOperand() via moving popOperand() out of "assert" * Issue #116: replace operand to final target type if boxing conversion and widening reference conversion happen together * Merged pull request `#114` "Grow the code for relocatables, and do fixup, and relocate". * |
||
Yuming Wang | 91148f428b |
[SPARK-28481][SQL] More expressions should extend NullIntolerant
### What changes were proposed in this pull request? 1. Make more expressions extend `NullIntolerant`. 2. Add a checker(in `ExpressionInfoSuite`) to identify whether the expression is `NullIntolerant`. ### Why are the changes needed? Avoid skew join if the join column has many null values and can improve query performance. For examples: ```sql CREATE TABLE t1(c1 string, c2 string) USING parquet; CREATE TABLE t2(c1 string, c2 string) USING parquet; EXPLAIN SELECT t1.* FROM t1 JOIN t2 ON upper(t1.c1) = upper(t2.c1); ``` Before and after this PR: ```sql == Physical Plan == *(2) Project [c1#5, c2#6] +- *(2) BroadcastHashJoin [upper(c1#5)], [upper(c1#7)], Inner, BuildLeft :- BroadcastExchange HashedRelationBroadcastMode(List(upper(input[0, string, true]))), [id=#41] : +- *(1) ColumnarToRow : +- FileScan parquet default.t1[c1#5,c2#6] +- *(2) ColumnarToRow +- FileScan parquet default.t2[c1#7] == Physical Plan == *(2) Project [c1#5, c2#6] +- *(2) BroadcastHashJoin [upper(c1#5)], [upper(c1#7)], Inner, BuildRight :- *(2) Project [c1#5, c2#6] : +- *(2) Filter isnotnull(c1#5) : +- *(2) ColumnarToRow : +- FileScan parquet default.t1[c1#5,c2#6] +- BroadcastExchange HashedRelationBroadcastMode(List(upper(input[0, string, true]))), [id=#59] +- *(1) Project [c1#7] +- *(1) Filter isnotnull(c1#7) +- *(1) ColumnarToRow +- FileScan parquet default.t2[c1#7] ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #28626 from wangyum/SPARK-28481. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> |
||
Max Gekk | 37a1fb8d08 |
[SPARK-31855][SQL][TESTS] Check reading date/timestamp from Avro files w/ and w/o Spark version
### What changes were proposed in this pull request? 1. Add the following parquet files to the resource folder `external/avro/src/test/resources`: - Files saved by Spark 2.4.5 ( |
||
Maryann Xue | 45864faaf2 |
[SPARK-31862][SQL] Remove exception wrapping in AQE
### What changes were proposed in this pull request? This PR removes the excessive exception wrapping in AQE so that error messages are less verbose and mostly consistent with non-aqe execution. Exceptions from stage materialization are now only wrapped with `SparkException` if there are multiple stage failures. Also, stage cancelling errors will not be included as part the exception thrown, but rather just be error logged. ### Why are the changes needed? This will make the AQE error reporting more readable and debuggable. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated existing tests. Closes #28668 from maryannxue/spark-31862. Authored-by: Maryann Xue <maryann.xue@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> |
||
Maryann Xue | 90302309a3 |
[SPARK-31865][SQL] Fix complex AQE query stage not reused
### What changes were proposed in this pull request? This PR fixes the issue of complex query stages that contain sub stages not being reused at times due to dynamic plan changes. This PR synchronizes the "finished" flag between all reused stages so that the runtime replanning would always produce the same sub plan for their potentially reusable parent stages. ### Why are the changes needed? Without this change, complex query stages that contain sub stages will sometimes not be reused due to dynamic plan changes and the status of their child query stages not being synced. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually tested TPC-DS q47 and q57. Before this PR, the reuse of the biggest stage would happen with a 50/50 chance; and after this PR, it will happen 100% of the time. Closes #28670 from maryannxue/fix-aqe-reuse. Authored-by: Maryann Xue <maryann.xue@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> |
||
Kent Yao | fe1da296da |
[SPARK-31833][SQL][TEST-HIVE1.2] Set HiveThriftServer2 with actual port while configured 0
### What changes were proposed in this pull request? When I was developing some stuff based on the `DeveloperAPI ` `org.apache.spark.sql.hive.thriftserver.HiveThriftServer2#startWithContext`, I need to use thrift port randomly to avoid race on ports. But the `org.apache.hive.service.cli.thrift.ThriftCLIService#getPortNumber` do not respond to me with the actual bound port but always 0. And the server log is not right too, after starting the server, it's hard to form to the right JDBC connection string. ``` INFO ThriftCLIService: Starting ThriftBinaryCLIService on port 0 with 5...500 worker threads ``` Indeed, the `53742` is the right port ```shell lsof -nP -p `cat ./pid/spark-kentyao-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1.pid` | grep LISTEN java 18990 kentyao 288u IPv6 0x22858e3e60d6a0a7 0t0 TCP 10.242.189.214:53723 (LISTEN) java 18990 kentyao 290u IPv6 0x22858e3e60d68827 0t0 TCP *:4040 (LISTEN) java 18990 kentyao 366u IPv6 0x22858e3e60d66987 0t0 TCP 10.242.189.214:53724 (LISTEN) java 18990 kentyao 438u IPv6 0x22858e3e60d65d47 0t0 TCP *:53742 (LISTEN) ``` In the PR, when the port is configured 0, the `portNum` will be set to the real used port during the start process. Also use 0 in thrift related tests to avoid potential flakiness. ### Why are the changes needed? 1 fix API bug 2 reduce test flakiness ### Does this PR introduce _any_ user-facing change? yes, `org.apache.hive.service.cli.thrift.ThriftCLIService#getPortNumber` will always give you the actual port when it is configured 0. ### How was this patch tested? modified unit tests Closes #28651 from yaooqinn/SPARK-31833. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> |
||
shane knapp | 9e68affd13 |
[BUILD][INFRA] bump the timeout to match the jenkins PRB
### What changes were proposed in this pull request? bump the timeout to match what's set in jenkins ### Why are the changes needed? tests be timing out! ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? via jenkins Closes #28666 from shaneknapp/increase-jenkins-timeout. Authored-by: shane knapp <incomplete@gmail.com> Signed-off-by: shane knapp <incomplete@gmail.com> |
||
Bryan Cutler | 8bbb666622 |
[SPARK-25351][PYTHON][TEST][FOLLOWUP] Fix test assertions to be consistent
### What changes were proposed in this pull request? Followup to make assertions from recent test consistent with the rest of the module ### Why are the changes needed? Better to use assertions from `unittest` and be consistent ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #28659 from BryanCutler/arrow-category-test-fix-SPARK-25351. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> |
||
GuoPhilipse | dfbc5edf20 |
[SPARK-31839][TESTS] Delete duplicate code in castsuit
### What changes were proposed in this pull request? Delete duplicate code castsuit ### Why are the changes needed? keep spark code clean ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? no need Closes #28655 from GuoPhilipse/delete-duplicate-code-castsuit. Lead-authored-by: GuoPhilipse <46367746+GuoPhilipse@users.noreply.github.com> Co-authored-by: GuoPhilipse <guofei_ok@126.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> |
||
iRakson | 2f92ea0df4 |
[SPARK-31763][PYSPARK] Add inputFiles method in PySpark DataFrame Class
### What changes were proposed in this pull request? Adds `inputFiles()` method to PySpark `DataFrame`. Using this, PySpark users can list all files constituting a `DataFrame`. **Before changes:** ``` >>> spark.read.load("examples/src/main/resources/people.json", format="json").inputFiles() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/***/***/spark/python/pyspark/sql/dataframe.py", line 1388, in __getattr__ "'%s' object has no attribute '%s'" % (self.__class__.__name__, name)) AttributeError: 'DataFrame' object has no attribute 'inputFiles' ``` **After changes:** ``` >>> spark.read.load("examples/src/main/resources/people.json", format="json").inputFiles() [u'file:///***/***/spark/examples/src/main/resources/people.json'] ``` ### Why are the changes needed? This method is already supported for spark with scala and java. ### Does this PR introduce _any_ user-facing change? Yes, Now users can list all files of a DataFrame using `inputFiles()` ### How was this patch tested? UT added. Closes #28652 from iRakson/SPARK-31763. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> |
||
Jalpan Randeri | 339b0ecadb |
[SPARK-25351][SQL][PYTHON] Handle Pandas category type when converting from Python with Arrow
Handle Pandas category type while converting from python with Arrow enabled. The category column will be converted to whatever type the category elements are as is the case with Arrow disabled. ### Does this PR introduce any user-facing change? No ### How was this patch tested? New unit tests were added for `createDataFrame` and scalar `pandas_udf` Closes #26585 from jalpan-randeri/feature-pyarrow-dictionary-type. Authored-by: Jalpan Randeri <randerij@amazon.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com> |
||
Xingbo Jiang | efe7fd2b6b |
[SPARK-31730][CORE][TEST] Fix flaky tests in BarrierTaskContextSuite
### What changes were proposed in this pull request? To wait until all the executors have started before submitting any job. This could avoid the flakiness caused by waiting for executors coming up. ### How was this patch tested? Existing tests. Closes #28584 from jiangxb1987/barrierTest. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com> |
||
Kousuke Saruta | d19b173b47 |
[SPARK-31764][CORE] JsonProtocol doesn't write RDDInfo#isBarrier
### What changes were proposed in this pull request? This PR changes JsonProtocol to write RDDInfos#isBarrier. ### Why are the changes needed? JsonProtocol reads RDDInfos#isBarrier but not write it so it's a bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added a testcase. Closes #28583 from sarutak/SPARK-31764. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com> |
||
Wenchen Fan | 1528fbced8 |
[SPARK-31827][SQL] fail datetime parsing/formatting if detect the Java 8 bug of stand-alone form
### What changes were proposed in this pull request? If `LLL`/`qqq` is used in the datetime pattern string, and the current JDK in use has a bug for the stand-alone form (see https://bugs.openjdk.java.net/browse/JDK-8114833), throw an exception with a clear error message. ### Why are the changes needed? to keep backward compatibility with Spark 2.4 ### Does this PR introduce _any_ user-facing change? Yes Spark 2.4 ``` scala> sql("select date_format('1990-1-1', 'LLL')").show +---------------------------------------------+ |date_format(CAST(1990-1-1 AS TIMESTAMP), LLL)| +---------------------------------------------+ | Jan| +---------------------------------------------+ ``` Spark 3.0 with Java 11 ``` scala> sql("select date_format('1990-1-1', 'LLL')").show +---------------------------------------------+ |date_format(CAST(1990-1-1 AS TIMESTAMP), LLL)| +---------------------------------------------+ | Jan| +---------------------------------------------+ ``` Spark 3.0 with Java 8 ``` // before this PR +---------------------------------------------+ |date_format(CAST(1990-1-1 AS TIMESTAMP), LLL)| +---------------------------------------------+ | 1| +---------------------------------------------+ // after this PR scala> sql("select date_format('1990-1-1', 'LLL')").show org.apache.spark.SparkUpgradeException ``` ### How was this patch tested? manual test with java 8 and 11 Closes #28646 from cloud-fan/format. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> |
||
Max Gekk | b5eb0933ac |
[SPARK-31762][SQL][FOLLOWUP] Avoid double formatting in legacy fractional formatter
### What changes were proposed in this pull request? Currently, the legacy fractional formatter is based on the implementation from Spark 2.4 which formats the input timestamp twice: ``` val timestampString = ts.toString val formatted = legacyFormatter.format(ts) ``` to strip trailing zeros. This PR proposes to avoid the first formatting by forming the second fraction directly. ### Why are the changes needed? It makes legacy fractional formatter faster. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing test "format fraction of second" in `TimestampFormatterSuite` + added test for timestamps before 1970-01-01 00:00:00Z Closes #28643 from MaxGekk/optimize-legacy-fract-format. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> |
||
Kent Yao | 311fe6a880 |
[SPARK-31835][SQL][TESTS] Add zoneId to codegen related tests in DateExpressionsSuite
### What changes were proposed in this pull request? This PR modifies some codegen related tests to test escape characters for datetime functions which are time zone aware. If the timezone is absent, the formatter could result in `null` caused by `java.util.NoSuchElementException: None.get` and bypassing the real intention of those test cases. ### Why are the changes needed? fix tests ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? passing the modified test cases. Closes #28653 from yaooqinn/SPARK-31835. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> |
||
Ali Afroozeh | f6f1e51072 |
[SPARK-31719][SQL] Refactor JoinSelection
### What changes were proposed in this pull request? This PR extracts the logic for selecting the planned join type out of the `JoinSelection` rule and moves it to `JoinSelectionHelper` in Catalyst. ### Why are the changes needed? This change both cleans up the code in `JoinSelection` and allows the logic to be in one place and be used from other rules that need to make decision based on the join type before the planning time. ### Does this PR introduce _any_ user-facing change? `BuildSide`, `BuildLeft`, and `BuildRight` are moved from `org.apache.spark.sql.execution` to Catalyst in `org.apache.spark.sql.catalyst.optimizer`. ### How was this patch tested? This is a refactoring, passes existing tests. Closes #28540 from dbaliafroozeh/RefactorJoinSelection. Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> |
||
Huaxin Gao | 50492c0bd3 |
[SPARK-31803][ML] Make sure instance weight is not negative
### What changes were proposed in this pull request? In the algorithms that support instance weight, add checks to make sure instance weight is not negative. ### Why are the changes needed? instance weight has to be >= 0.0 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually tested Closes #28621 from huaxingao/weight_check. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com> |
||
iRakson | 765105b6f1 |
[SPARK-31638][WEBUI] Clean Pagination code for all webUI pages
### What changes were proposed in this pull request? Pagination code across pages needs to be cleaned. I have tried to clear out these things : * Unused methods * Unused method arguments * remove redundant `if` expressions * fix indentation ### Why are the changes needed? This fix will make code more readable and remove unnecessary methods and variables. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually Closes #28448 from iRakson/refactorPagination. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> |
||
beliefer | 8f2b6f3a0b |
[SPARK-31393][SQL][FOLLOW-UP] Show the correct alias in schema for expression
### What changes were proposed in this pull request? Some alias of expression can not display correctly in schema. This PR will fix them. - `ln` - `rint` - `lcase` - `position` ### Why are the changes needed? Improve the implement of some expression. ### Does this PR introduce _any_ user-facing change? 'Yes'. This PR will let user see the correct alias in schema. ### How was this patch tested? Jenkins test. Closes #28551 from beliefer/show-correct-alias-in-schema. Lead-authored-by: beliefer <beliefer@163.com> Co-authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: HyukjinKwon <gurwls223@apache.org> |
||
Gabor Somogyi |
8f1d77488c
|
[SPARK-31821][BUILD] Remove mssql-jdbc dependencies from Hadoop 3.2 profile
### What changes were proposed in this pull request? There is an unnecessary dependency for `mssql-jdbc`. In this PR I've removed it. ### Why are the changes needed? Unnecessary dependency. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins with the following configuration. - [x] Pass the dependency test. - [x] SBT with Hadoop-3.2 (https://github.com/apache/spark/pull/28640#issuecomment-634192512) - [ ] Maven with Hadoop-3.2 Closes #28640 from gaborgsomogyi/SPARK-31821. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> |
||
HyukjinKwon | 7fb2275f00 |
Revert "[SPARK-31788][CORE][PYTHON] Fix UnionRDD of PairRDDs"
This reverts commit
|
||
Max Gekk | 87d34e6b96 |
[SPARK-31820][SQL][TESTS] Fix flaky JavaBeanDeserializationSuite
### What changes were proposed in this pull request?
Modified formatting of expected timestamp strings in the test `JavaBeanDeserializationSuite`.`testSpark22000` to correctly format timestamps with **zero** seconds fraction. Current implementation outputs `.0` but must be empty string. From SPARK-31820 failure:
- should be `2020-05-25 12:39:17`
- but incorrect expected string is `2020-05-25 12:39:17.0`
### Why are the changes needed?
To make `JavaBeanDeserializationSuite` stable, and avoid test failures like https://github.com/apache/spark/pull/28630#issuecomment-633695723
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
I changed
|
||
Dilip Biswal | b44acee953 |
[SPARK-31673][SQL] QueryExection.debug.toFile() to take an addtional explain mode param
### What changes were proposed in this pull request? Currently QueryExecution.debug.toFile dumps the query plan information in a fixed format. This PR adds an additional explain mode parameter that writes the debug information as per the user supplied format. ``` df.queryExecution.debug.toFile("/tmp/plan.txt", explainMode = ExplainMode.fromString("formatted")) ``` ``` == Physical Plan == * Filter (2) +- Scan hive default.s1 (1) (1) Scan hive default.s1 Output [2]: [c1#15, c2#16] Arguments: [c1#15, c2#16], HiveTableRelation `default`.`s1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#15, c2#16] (2) Filter [codegen id : 1] Input [2]: [c1#15, c2#16] Condition : (isnotnull(c1#15) AND (c1#15 > 0)) == Whole Stage Codegen == Found 1 WholeStageCodegen subtrees. == Subtree 1 / 1 (maxMethodCodeSize:220; maxConstantPoolSize:105(0.16% used); numInnerClasses:0) == *(1) Filter (isnotnull(c1#15) AND (c1#15 > 0)) +- Scan hive default.s1 [c1#15, c2#16], HiveTableRelation `default`.`s1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#15, c2#16] Generated code: /* 001 */ public Object generate(Object[] references) { /* 002 */ return new GeneratedIteratorForCodegenStage1(references); /* 003 */ } /* 004 */ /* 005 */ // codegenStageId=1 /* 006 */ final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator { /* 007 */ private Object[] references; /* 008 */ private scala.collection.Iterator[] inputs; /* 009 */ private scala.collection.Iterator inputadapter_input_0; /* 010 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] filter_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[1]; /* 011 */ /* 012 */ public GeneratedIteratorForCodegenStage1(Object[] references) { /* 013 */ this.references = references; /* 014 */ } /* 015 */ /* 016 */ public void init(int index, scala.collection.Iterator[] inputs) { /* 017 */ partitionIndex = index; /* 018 */ this.inputs = inputs; /* 019 */ inputadapter_input_0 = inputs[0]; /* 020 */ filter_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(2, 0); /* 021 */ /* 022 */ } /* 023 */ /* 024 */ protected void processNext() throws java.io.IOException { /* 025 */ while ( inputadapter_input_0.hasNext()) { /* 026 */ InternalRow inputadapter_row_0 = (InternalRow) inputadapter_input_0.next(); /* 027 */ /* 028 */ do { /* 029 */ boolean inputadapter_isNull_0 = inputadapter_row_0.isNullAt(0); /* 030 */ int inputadapter_value_0 = inputadapter_isNull_0 ? /* 031 */ -1 : (inputadapter_row_0.getInt(0)); /* 032 */ /* 033 */ boolean filter_value_2 = !inputadapter_isNull_0; /* 034 */ if (!filter_value_2) continue; /* 035 */ /* 036 */ boolean filter_value_3 = false; /* 037 */ filter_value_3 = inputadapter_value_0 > 0; /* 038 */ if (!filter_value_3) continue; /* 039 */ /* 040 */ ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] /* numOutputRows */).add(1); /* 041 */ /* 042 */ boolean inputadapter_isNull_1 = inputadapter_row_0.isNullAt(1); /* 043 */ int inputadapter_value_1 = inputadapter_isNull_1 ? /* 044 */ -1 : (inputadapter_row_0.getInt(1)); /* 045 */ filter_mutableStateArray_0[0].reset(); /* 046 */ /* 047 */ filter_mutableStateArray_0[0].zeroOutNullBytes(); /* 048 */ /* 049 */ filter_mutableStateArray_0[0].write(0, inputadapter_value_0); /* 050 */ /* 051 */ if (inputadapter_isNull_1) { /* 052 */ filter_mutableStateArray_0[0].setNullAt(1); /* 053 */ } else { /* 054 */ filter_mutableStateArray_0[0].write(1, inputadapter_value_1); /* 055 */ } /* 056 */ append((filter_mutableStateArray_0[0].getRow())); /* 057 */ /* 058 */ } while(false); /* 059 */ if (shouldStop()) return; /* 060 */ } /* 061 */ } /* 062 */ /* 063 */ } ``` ### Why are the changes needed? Hopefully enhances the usability of debug.toFile(..) ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added a test in QueryExecutionSuite Closes #28493 from dilipbiswal/write_to_file. Authored-by: Dilip Biswal <dkbiswal@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> |
||
Xingcan Cui | 8ba2b47737 |
[SPARK-31792][SS][DOCS] Introduce the structured streaming UI in the Web UI doc
### What changes were proposed in this pull request? This PR adds the structured streaming UI introduction to the Web UI doc. ![image](https://user-images.githubusercontent.com/1452518/82642209-92b99380-9bdb-11ea-9a0d-cbb26040b0ef.png) ### Why are the changes needed? The structured streaming web UI introduced before was missing from the Web UI documentation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N.A. Closes #28609 from xccui/ss-ui-doc. Authored-by: Xingcan Cui <xccui@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> |
||
Max Gekk | 7e4f5bbd8a |
[SPARK-31806][SQL][TESTS] Check reading date/timestamp from legacy parquet: dictionary encoding, w/o Spark version
### What changes were proposed in this pull request? 1. Add the following parquet files to the resource folder `sql/core/src/test/resources/test-data`: - Files saved by Spark 2.4.5 ( |
||
Prakhar Jain | 452594f5a4 |
[SPARK-31810][TEST] Fix AlterTableRecoverPartitions test using incorrect api to modify RDD_PARALLEL_LISTING_THRESHOLD
### What changes were proposed in this pull request? Use the correct API in AlterTableRecoverPartition tests to modify the `RDD_PARALLEL_LISTING_THRESHOLD` conf. ### Why are the changes needed? The existing AlterTableRecoverPartitions test modify the RDD_PARALLEL_LISTING_THRESHOLD as a SQLConf using the withSQLConf API. But since, this is not a SQLConf, it is not overridden and so the test doesn't end up testing the required behaviour. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This is UT Fix. UTs are still passing after the fix. Closes #28634 from prakharjain09/SPARK-31810-fix-recover-partitions. Authored-by: Prakhar Jain <prakharjain09@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> |
||
HyukjinKwon |
df2a1fe131
|
[SPARK-31808][SQL] Makes struct function's output name and class name pretty
### What changes were proposed in this pull request? This PR proposes to set the alias, and class name in its `ExpressionInfo` for `struct`. - Class name in `ExpressionInfo` - from: `org.apache.spark.sql.catalyst.expressions.NamedStruct` - to:`org.apache.spark.sql.catalyst.expressions.CreateNamedStruct` - Alias name: `named_struct(col1, v, ...)` -> `struct(v, ...)` This PR takes over https://github.com/apache/spark/pull/28631 ### Why are the changes needed? To show the correct output name and class names to users. ### Does this PR introduce _any_ user-facing change? Yes. **Before:** ```scala scala> sql("DESC FUNCTION struct").show(false) +------------------------------------------------------------------------------------+ |function_desc | +------------------------------------------------------------------------------------+ |Function: struct | |Class: org.apache.spark.sql.catalyst.expressions.NamedStruct | |Usage: struct(col1, col2, col3, ...) - Creates a struct with the given field values.| +------------------------------------------------------------------------------------+ ``` ```scala scala> sql("SELECT struct(1, 2)").show(false) +------------------------------+ |named_struct(col1, 1, col2, 2)| +------------------------------+ |[1, 2] | +------------------------------+ ``` **After:** ```scala scala> sql("DESC FUNCTION struct").show(false) +------------------------------------------------------------------------------------+ |function_desc | +------------------------------------------------------------------------------------+ |Function: struct | |Class: org.apache.spark.sql.catalyst.expressions.CreateNamedStruct | |Usage: struct(col1, col2, col3, ...) - Creates a struct with the given field values.| +------------------------------------------------------------------------------------+ ``` ```scala scala> sql("SELECT struct(1, 2)").show(false) +------------+ |struct(1, 2)| +------------+ |[1, 2] | +------------+ ``` ### How was this patch tested? Manually tested, and Jenkins tests. Closes #28633 from HyukjinKwon/SPARK-31808. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> |
||
Max Gekk |
6c80ebbccb
|
[SPARK-31818][SQL] Fix pushing down filters with java.time.Instant values in ORC
### What changes were proposed in this pull request? Convert `java.time.Instant` to `java.sql.Timestamp` in pushed down filters to ORC datasource when Java 8 time API enabled. ### Why are the changes needed? The changes fix the exception raised while pushing date filters when `spark.sql.datetime.java8API.enabled` is set to `true`: ``` java.lang.IllegalArgumentException: Wrong value class java.time.Instant for TIMESTAMP.EQUALS leaf at org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl.checkLiteralType(SearchArgumentImpl.java:192) at org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl.<init>(SearchArgumentImpl.java:75) ``` ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? Added tests to `OrcFilterSuite`. Closes #28636 from MaxGekk/orc-timestamp-filter-pushdown. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> |
||
Kent Yao | 695cb617d4 |
[SPARK-31771][SQL] Disable Narrow TextStyle for datetime pattern 'G/M/L/E/u/Q/q'
### What changes were proposed in this pull request? Five continuous pattern characters with 'G/M/L/E/u/Q/q' means Narrow-Text Style while we turn to use `java.time.DateTimeFormatterBuilder` since 3.0.0, which output the leading single letter of the value, e.g. `December` would be `D`. In Spark 2.4 they mean Full-Text Style. In this PR, we explicitly disable Narrow-Text Style for these pattern characters. ### Why are the changes needed? Without this change, there will be a silent data change. ### Does this PR introduce _any_ user-facing change? Yes, queries with datetime operations using datetime patterns, e.g. `G/M/L/E/u` will fail if the pattern length is 5 and other patterns, e,g. 'k', 'm' also can accept a certain number of letters. 1. datetime patterns that are not supported by the new parser but the legacy will get SparkUpgradeException, e.g. "GGGGG", "MMMMM", "LLLLL", "EEEEE", "uuuuu", "aa", "aaa". 2 options are given to end-users, one is to use legacy mode, and the other is to follow the new online doc for correct datetime patterns 2, datetime patterns that are not supported by both the new parser and the legacy, e.g. "QQQQQ", "qqqqq", will get IllegalArgumentException which is captured by Spark internally and results NULL to end-users. ### How was this patch tested? add unit tests Closes #28592 from yaooqinn/SPARK-31771. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> |
||
Max Gekk | 92685c0148 |
[SPARK-31755][SQL][FOLLOWUP] Update date-time, CSV and JSON benchmark results
### What changes were proposed in this pull request? Re-generate results of: - DateTimeBenchmark - CSVBenchmark - JsonBenchmark in the environment: | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 | ### Why are the changes needed? 1. The PR https://github.com/apache/spark/pull/28576 changed date-time parser. The `DateTimeBenchmark` should confirm that the PR didn't slow down date/timestamp parsing. 2. CSV/JSON datasources are affected by the above PR too. This PR updates the benchmark results in the same environment as other benchmarks to have a base line for future optimizations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running benchmarks via the script: ```python #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #28613 from MaxGekk/missing-hour-year-benchmarks. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> |
||
Kent Yao | 0df8dd6073 |
[SPARK-30352][SQL] DataSourceV2: Add CURRENT_CATALOG function
### What changes were proposed in this pull request? As we support multiple catalogs with DataSourceV2, we may need the `CURRENT_CATALOG` value expression from the SQL standard. `CURRENT_CATALOG` is a general value specification in the SQL Standard, described as: > The value specified by CURRENT_CATALOG is the character string that represents the current default catalog name. ### Why are the changes needed? improve catalog v2 with ANSI SQL standard. ### Does this PR introduce any user-facing change? yes, add a new function `current_catalog()` to point the current active catalog ### How was this patch tested? add ut Closes #27006 from yaooqinn/SPARK-30352. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> |
||
Huaxin Gao | d4007776f2 |
[SPARK-31734][ML][PYSPARK] Add weight support in ClusteringEvaluator
### What changes were proposed in this pull request? Add weight support in ClusteringEvaluator ### Why are the changes needed? Currently, BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator support instance weight, but ClusteringEvaluator doesn't, so we will add instance weight support in ClusteringEvaluator. ### Does this PR introduce _any_ user-facing change? Yes. ClusteringEvaluator.setWeightCol ### How was this patch tested? add new unit test Closes #28553 from huaxingao/weight_evaluator. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com> |
||
Max Gekk | 7f36310500 |
[SPARK-31802][SQL] Format Java date-time types in Row.jsonValue directly
### What changes were proposed in this pull request? Use `format()` methods for Java date-time types in `Row.jsonValue`. The PR https://github.com/apache/spark/pull/28582 added the methods to avoid conversions to days and microseconds. ### Why are the changes needed? To avoid unnecessary overhead of converting Java date-time types to micros/days before formatting. Also formatters have to convert input micros/days back to Java types to pass instances to standard library API. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing tests in `RowJsonSuite`. Closes #28620 from MaxGekk/toJson-format-Java-datetime-types. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> |
||
rishi | b90e10c546 |
[SPARK-31377][SQL][TEST] Added unit tests to 'number of output rows metric' for some joins in SQLMetricSuite
### What changes were proposed in this pull request? Add unit tests to the 'number of output rows metric' for some join types in the SQLMetricSuite. A list of unit tests added are as follows. - ShuffledHashJoin: leftOuter, RightOuter, LeftAnti, LeftSemi - BroadcastNestedLoopJoin: RightOuter - BroadcastHashJoin: LeftAnti ### Why are the changes needed? For some combinations of JoinType and Join algorithm there is no test coverage for the 'number of output rows' metric. ### Does this PR introduce any user-facing change? No ### How was this patch tested? I added debug statements in the code to ensure the correct combination if JoinType and Join algorithms are triggered. I further used Intellij debugger to test the same. Closes #28330 from sririshindra/SPARK-31377. Authored-by: rishi <spothireddi@cloudera.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> |
||
schintap | a61911c50c |
[SPARK-31788][CORE][PYTHON] Fix UnionRDD of PairRDDs
### What changes were proposed in this pull request? UnionRDD of PairRDDs causing a bug. The fix is to check for instance type before proceeding ### Why are the changes needed? Changes are needed to avoid users running into issues with union rdd operation with any other type other than JavaRDD. ### Does this PR introduce _any_ user-facing change? Yes Before: SparkSession available as 'spark'. >>> rdd1 = sc.parallelize([1,2,3,4,5]) >>> rdd2 = sc.parallelize([6,7,8,9,10]) >>> pairRDD1 = rdd1.zip(rdd2) >>> unionRDD1 = sc.union([pairRDD1, pairRDD1]) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union jrdds[i] = rdds[i]._jrdd File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 238, in setitem File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 221, in __set_item File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 332, in get_return_value py4j.protocol.Py4JError: An error occurred while calling None.None. Trace: py4j.Py4JException: Cannot convert org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) After: >>> rdd2 = sc.parallelize([6,7,8,9,10]) >>> pairRDD1 = rdd1.zip(rdd2) >>> unionRDD1 = sc.union([pairRDD1, pairRDD1]) >>> unionRDD1.collect() [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (1, 6), (2, 7), (3, 8), (4, 9), (5, 10)] ### How was this patch tested? Tested with the reproduced piece of code above manually Closes #28603 from redsanket/SPARK-31788. Authored-by: schintap <schintap@verizonmedia.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> |
||
William Hyun | 753636e86b |
[SPARK-31807][INFRA] Use python 3 style in release-build.sh
### What changes were proposed in this pull request? This PR aims to use the style that is compatible with both python 2 and 3. ### Why are the changes needed? This will help python 3 migration. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual. Closes #28632 from williamhyun/use_python3_style. Authored-by: William Hyun <williamhyun3@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> |