## What changes were proposed in this pull request?
`OverwriteOptions` was introduced in https://github.com/apache/spark/pull/15705, to carry the information of static partitions. However, after further refactor, this information becomes duplicated and we can remove `OverwriteOptions`.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#15995 from cloud-fan/overwrite.
## What changes were proposed in this pull request?
Add implicit encoders for BigDecimal, timestamp and date.
## How was this patch tested?
Add an unit test. Pass build, unit tests, and some tests below .
Before:
```
scala> spark.createDataset(Seq(new java.math.BigDecimal(10)))
<console>:24: error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
spark.createDataset(Seq(new java.math.BigDecimal(10)))
^
scala>
```
After:
```
scala> spark.createDataset(Seq(new java.math.BigDecimal(10)))
res0: org.apache.spark.sql.Dataset[java.math.BigDecimal] = [value: decimal(38,18)]
```
Author: Weiqing Yang <yangweiqing001@gmail.com>
Closes#16176 from weiqingy/SPARK-18746.
## What changes were proposed in this pull request?
- Changed `StreamingQueryProgress.watermark` to `StreamingQueryProgress.queryTimestamps` which is a `Map[String, String]` containing the following keys: "eventTime.max", "eventTime.min", "eventTime.avg", "processingTime", "watermark". All of them UTC formatted strings.
- Renamed `StreamingQuery.timestamp` to `StreamingQueryProgress.triggerTimestamp` to differentiate from `queryTimestamps`. It has the timestamp of when the trigger was started.
## How was this patch tested?
Updated tests
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#16258 from tdas/SPARK-18834.
## What changes were proposed in this pull request?
Change the statement `SHOW TABLES [EXTENDED] [(IN|FROM) database_name] [[LIKE] 'identifier_with_wildcards'] [PARTITION(partition_spec)]` to the following statements:
- SHOW TABLES [(IN|FROM) database_name] [[LIKE] 'identifier_with_wildcards']
- SHOW TABLE EXTENDED [(IN|FROM) database_name] LIKE 'identifier_with_wildcards' [PARTITION(partition_spec)]
After this change, the statements `SHOW TABLE/SHOW TABLES` have the same syntax with that HIVE has.
## How was this patch tested?
Modified the test sql file `show-tables.sql`;
Modified the test suite `DDLSuite`.
Author: jiangxingbo <jiangxb1987@gmail.com>
Closes#16262 from jiangxb1987/show-table-extended.
## What changes were proposed in this pull request?
Some places in SQL may call `RpcEndpointRef.askWithRetry` (e.g., ParquetFileFormat.buildReader -> SparkContext.broadcast -> ... -> BlockManagerMaster.updateBlockInfo -> RpcEndpointRef.askWithRetry), which will finally call `Await.result`. It may cause `java.lang.IllegalArgumentException: spark.sql.execution.id is already set` when running in Scala ForkJoinPool.
This PR includes the following changes to fix this issue:
- Remove `ThreadUtils.awaitResult`
- Rename `ThreadUtils. awaitResultInForkJoinSafely` to `ThreadUtils.awaitResult`
- Replace `Await.result` in RpcTimeout with `ThreadUtils.awaitResult`.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#16230 from zsxwing/fix-SPARK-13747.
## What changes were proposed in this pull request?
I **believe** that I _only_ removed duplicated code (that adds nothing but noise). I'm gonna remove the comment after Jenkins has built the changes with no issues and Spark devs has agreed to include the changes.
Remove explicit `RDD` and `Partition` overrides (that turn out code duplication)
## How was this patch tested?
Local build. Awaiting Jenkins.
…cation)
Author: Jacek Laskowski <jacek@japila.pl>
Closes#16145 from jaceklaskowski/rdd-overrides-removed.
## What changes were proposed in this pull request?
Fixes compile errors in generated code when user has case class with a `scala.collections.immutable.Map` instead of a `scala.collections.Map`. Since ArrayBasedMapData.toScalaMap returns the immutable version we can make it work with both.
## How was this patch tested?
Additional unit tests.
Author: Andrew Ray <ray.andrew@gmail.com>
Closes#16161 from aray/fix-map-codegen.
## What changes were proposed in this pull request?
Major change in this PR:
- Add `pendingQueryNames` and `pendingQueryIds` to track that are going to start but not yet put into `activeQueries` so that we don't need to hold a lock when starting a query.
Minor changes:
- Fix a potential NPE when the user sets `checkpointLocation` using SQLConf but doesn't specify a query name.
- Add missing docs in `StreamingQueryListener`
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#16220 from zsxwing/SPARK-18796.
The value of the "isSrcLocal" parameter passed to Hive's loadTable and
loadPartition methods needs to be set according to the user query (e.g.
"LOAD DATA LOCAL"), and not the current code that tries to guess what
it should be.
For existing versions of Hive the current behavior is probably ok, but
some recent changes in the Hive code changed the semantics slightly,
making code that sets "isSrcLocal" to "true" incorrectly to do the
wrong thing. It would end up moving the parent directory of the files
into the final location, instead of the file themselves, resulting
in a table that cannot be read.
I modified HiveCommandSuite so that existing "LOAD DATA" tests are run
both in local and non-local mode, since the semantics are slightly different.
The tests include a few new checks to make sure the semantics follow
what Hive describes in its documentation.
Tested with existing unit tests and also ran some Hive integration tests
with a version of Hive containing the changes that surfaced the problem.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#16179 from vanzin/SPARK-18752.
The problem is if it is run with no fix throws an exception and causes the following error:
"Cannot specify a column width on data type bit."
The problem stems from the fact that the "java.sql.types.BIT" type is mapped as BIT[n] that really must be mapped as BIT.
This concerns the type Boolean.
As for the type String with maximum length of characters it must be mapped as VARCHAR (MAX) instead of TEXT which is a type deprecated in SQLServer.
Here is the list of mappings for SQL Server:
https://msdn.microsoft.com/en-us/library/ms378878(v=sql.110).aspxCloses#13944 from meknio/master.
## What changes were proposed in this pull request?
Instead of only keeping the minimum number of offsets around, we should keep enough information to allow us to roll back n batches and reexecute the stream starting from a given point. In particular, we should create a config in SQLConf, spark.sql.streaming.retainedBatches that defaults to 100 and ensure that we keep enough log files in the following places to roll back the specified number of batches:
the offsets that are present in each batch
versions of the state store
the files lists stored for the FileStreamSource
the metadata log stored by the FileStreamSink
marmbrus zsxwing
## How was this patch tested?
The following tests were added.
### StreamExecution offset metadata
Test added to StreamingQuerySuite that ensures offset metadata is garbage collected according to minBatchesRetain
### CompactibleFileStreamLog
Tests added in CompactibleFileStreamLogSuite to ensure that logs are purged starting before the first compaction file that proceeds the current batch id - minBatchesToRetain.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Tyson Condie <tcondie@gmail.com>
Closes#16219 from tcondie/offset_hist.
## What changes were proposed in this pull request?
During column stats collection, average and max length will be null if a column of string/binary type has only null values. To fix this, I use default size when avg/max length is null.
## How was this patch tested?
Add a test for handling null columns
Author: wangzhenhua <wangzhenhua@huawei.com>
Closes#16243 from wzhfy/nullStats.
## What changes were proposed in this pull request?
This PR proposes to fix some tests being failed on Windows as below for several problems.
### Incorrect path handling
- FileSuite
```
[info] - binary file input as byte array *** FAILED *** (500 milliseconds)
[info] "file:/C:/projects/spark/target/tmp/spark-e7c3a3b8-0a4b-4a7f-9ebe-7c4883e48624/record-bytestream-00000.bin" did not contain "C:\projects\spark\target\tmp\spark-e7c3a3b8-0a4b-4a7f-9ebe-7c4883e48624\record-bytestream-00000.bin" (FileSuite.scala:258)
[info] org.scalatest.exceptions.TestFailedException:
[info] at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
...
```
```
[info] - Get input files via old Hadoop API *** FAILED *** (1 second, 94 milliseconds)
[info] Set("/C:/projects/spark/target/tmp/spark-cf5b1f8b-c5ed-43e0-8d17-546ebbfa8200/output/part-00000", "/C:/projects/spark/target/tmp/spark-cf5b1f8b-c5ed-43e0-8d17-546ebbfa8200/output/part-00001") did not equal Set("C:\projects\spark\target\tmp\spark-cf5b1f8b-c5ed-43e0-8d17-546ebbfa8200\output/part-00000", "C:\projects\spark\target\tmp\spark-cf5b1f8b-c5ed-43e0-8d17-546ebbfa8200\output/part-00001") (FileSuite.scala:535)
[info] org.scalatest.exceptions.TestFailedException:
[info] at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
...
```
```
[info] - Get input files via new Hadoop API *** FAILED *** (313 milliseconds)
[info] Set("/C:/projects/spark/target/tmp/spark-12bc1540-1111-4df6-9c4d-79e0e614407c/output/part-00000", "/C:/projects/spark/target/tmp/spark-12bc1540-1111-4df6-9c4d-79e0e614407c/output/part-00001") did not equal Set("C:\projects\spark\target\tmp\spark-12bc1540-1111-4df6-9c4d-79e0e614407c\output/part-00000", "C:\projects\spark\target\tmp\spark-12bc1540-1111-4df6-9c4d-79e0e614407c\output/part-00001") (FileSuite.scala:549)
[info] org.scalatest.exceptions.TestFailedException:
...
```
- TaskResultGetterSuite
```
[info] - handling results larger than max RPC message size *** FAILED *** (1 second, 579 milliseconds)
[info] 1 did not equal 0 Expect result to be removed from the block manager. (TaskResultGetterSuite.scala:129)
[info] org.scalatest.exceptions.TestFailedException:
[info] ...
[info] Cause: java.net.URISyntaxException: Illegal character in path at index 12: string:///C:\projects\spark\target\tmp\spark-93c485af-68da-440f-a907-aac7acd5fc25\repro\MyException.java
[info] at java.net.URI$Parser.fail(URI.java:2848)
[info] at java.net.URI$Parser.checkChars(URI.java:3021)
...
```
```
[info] - failed task deserialized with the correct classloader (SPARK-11195) *** FAILED *** (0 milliseconds)
[info] java.lang.IllegalArgumentException: Illegal character in path at index 12: string:///C:\projects\spark\target\tmp\spark-93c485af-68da-440f-a907-aac7acd5fc25\repro\MyException.java
[info] at java.net.URI.create(URI.java:852)
...
```
- SparkSubmitSuite
```
[info] java.lang.IllegalArgumentException: Illegal character in path at index 12: string:///C:\projects\spark\target\tmp\1481210831381-0\870903339\MyLib.java
[info] at java.net.URI.create(URI.java:852)
[info] at org.apache.spark.TestUtils$.org$apache$spark$TestUtils$$createURI(TestUtils.scala:112)
...
```
### Incorrect separate for JarEntry
After the path fix from above, then `TaskResultGetterSuite` throws another exception as below:
```
[info] - failed task deserialized with the correct classloader (SPARK-11195) *** FAILED *** (907 milliseconds)
[info] java.lang.ClassNotFoundException: repro.MyException
[info] at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
...
```
This is because `Paths.get` concatenates the given paths to an OS-specific path (Windows `\` and Linux `/`). However, for `JarEntry` we should comply ZIP specification meaning it should be always `/` according to ZIP specification.
See `4.4.17 file name: (Variable)` in https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT
### Long path problem on Windows
Some tests in `ShuffleSuite` via `ShuffleNettySuite` were skipped due to the same reason with SPARK-18718
## How was this patch tested?
Manually via AppVeyor.
**Before**
- `FileSuite`, `TaskResultGetterSuite`,`SparkSubmitSuite`
https://ci.appveyor.com/project/spark-test/spark/build/164-tmp-windows-base (please grep each to check each)
- `ShuffleSuite`
https://ci.appveyor.com/project/spark-test/spark/build/157-tmp-windows-base
**After**
- `FileSuite`
https://ci.appveyor.com/project/spark-test/spark/build/166-FileSuite
- `TaskResultGetterSuite`
https://ci.appveyor.com/project/spark-test/spark/build/173-TaskResultGetterSuite
- `SparkSubmitSuite`
https://ci.appveyor.com/project/spark-test/spark/build/167-SparkSubmitSuite
- `ShuffleSuite`
https://ci.appveyor.com/project/spark-test/spark/build/176-ShuffleSuite
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#16234 from HyukjinKwon/test-errors-windows.
### What changes were proposed in this pull request?
Currently, when users use Python UDF in Filter, BatchEvalPython is always generated below FilterExec. However, not all the predicates need to be evaluated after Python UDF execution. Thus, this PR is to push down the determinisitc predicates through `BatchEvalPython`.
```Python
>>> df = spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], ["key", "value"])
>>> from pyspark.sql.functions import udf, col
>>> from pyspark.sql.types import BooleanType
>>> my_filter = udf(lambda a: a < 2, BooleanType())
>>> sel = df.select(col("key"), col("value")).filter((my_filter(col("key"))) & (df.value < "2"))
>>> sel.explain(True)
```
Before the fix, the plan looks like
```
== Optimized Logical Plan ==
Filter ((isnotnull(value#1) && <lambda>(key#0L)) && (value#1 < 2))
+- LogicalRDD [key#0L, value#1]
== Physical Plan ==
*Project [key#0L, value#1]
+- *Filter ((isnotnull(value#1) && pythonUDF0#9) && (value#1 < 2))
+- BatchEvalPython [<lambda>(key#0L)], [key#0L, value#1, pythonUDF0#9]
+- Scan ExistingRDD[key#0L,value#1]
```
After the fix, the plan looks like
```
== Optimized Logical Plan ==
Filter ((isnotnull(value#1) && <lambda>(key#0L)) && (value#1 < 2))
+- LogicalRDD [key#0L, value#1]
== Physical Plan ==
*Project [key#0L, value#1]
+- *Filter pythonUDF0#9: boolean
+- BatchEvalPython [<lambda>(key#0L)], [key#0L, value#1, pythonUDF0#9]
+- *Filter (isnotnull(value#1) && (value#1 < 2))
+- Scan ExistingRDD[key#0L,value#1]
```
### How was this patch tested?
Added both unit test cases for `BatchEvalPythonExec` and also add an end-to-end test case in Python test suite.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#16193 from gatorsmile/pythonUDFPredicatePushDown.
## What changes were proposed in this pull request?
1. In SparkStrategies.canBroadcast, I will add the check plan.statistics.sizeInBytes >= 0
2. In LocalRelations.statistics, when calculate the statistics, I will change the size to BigInt so it won't overflow.
## How was this patch tested?
I will add a test case to make sure the statistics.sizeInBytes won't overflow.
Author: Huaxin Gao <huaxing@us.ibm.com>
Closes#16175 from huaxingao/spark-17460.
## What changes were proposed in this pull request?
When you start a stream, if we are trying to resolve the source of the stream, for example if we need to resolve partition columns, this could take a long time. This long execution time should not block the main thread where `query.start()` was called on. It should happen in the stream execution thread possibly before starting any triggers.
## How was this patch tested?
Unit test added. Made sure test fails with no code changes.
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#16238 from brkyvz/SPARK-18811.
## What changes were proposed in this pull request?
This PR avoids that a result of a cast `toInt` is negative due to signed integer overflow (e.g. 0x0000_0000_1???????L.toInt < 0 ). This PR performs casts after we can ensure the value is within range of signed integer (the result of `max(array.length, ???)` is always integer).
## How was this patch tested?
Manually executed query68 of TPC-DS with 100TB
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#16235 from kiszk/SPARK-18745.
## What changes were proposed in this pull request?
* This PR changes `JVMObjectTracker` from `object` to `class` and let its instance associated with each RBackend. So we can manage the lifecycle of JVM objects when there are multiple `RBackend` sessions. `RBackend.close` will clear the object tracker explicitly.
* I assume that `SQLUtils` and `RRunner` do not need to track JVM instances, which could be wrong.
* Small refactor of `SerDe.sqlSerDe` to increase readability.
## How was this patch tested?
* Added unit tests for `JVMObjectTracker`.
* Wait for Jenkins to run full tests.
Author: Xiangrui Meng <meng@databricks.com>
Closes#16154 from mengxr/SPARK-17822.
## What changes were proposed in this pull request?
- Changed FileStreamSource to use new FileStreamSourceOffset rather than LongOffset. The field is named as `logOffset` to make it more clear that this is a offset in the file stream log.
- Fixed bug in FileStreamSourceLog, the field endId in the FileStreamSourceLog.get(startId, endId) was not being used at all. No test caught it earlier. Only my updated tests caught it.
Other minor changes
- Dont use batchId in the FileStreamSource, as calling it batch id is extremely miss leading. With multiple sources, it may happen that a new batch has no new data from a file source. So offset of FileStreamSource != batchId after that batch.
## How was this patch tested?
Updated unit test.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#16205 from tdas/SPARK-18776.
## What changes were proposed in this pull request?
This patch fixes the format specification in explain for file sources (Parquet and Text formats are the only two that are different from the rest):
Before:
```
scala> spark.read.text("test.text").explain()
== Physical Plan ==
*FileScan text [value#15] Batched: false, Format: org.apache.spark.sql.execution.datasources.text.TextFileFormatxyz, Location: InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string>
```
After:
```
scala> spark.read.text("test.text").explain()
== Physical Plan ==
*FileScan text [value#15] Batched: false, Format: Text, Location: InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string>
```
Also closes#14680.
## How was this patch tested?
Verified in spark-shell.
Author: Reynold Xin <rxin@databricks.com>
Closes#16187 from rxin/SPARK-18760.
## What changes were proposed in this pull request?
`input_file_name` doesn't return filename when working with UDF in PySpark. An example shows the problem:
from pyspark.sql.functions import *
from pyspark.sql.types import *
def filename(path):
return path
sourceFile = udf(filename, StringType())
spark.read.json("tmp.json").select(sourceFile(input_file_name())).show()
+---------------------------+
|filename(input_file_name())|
+---------------------------+
| |
+---------------------------+
The cause of this issue is, we group rows in `BatchEvalPythonExec` for batching processing of PythonUDF. Currently we group rows first and then evaluate expressions on the rows. If the data is less than the required number of rows for a group, the iterator will be consumed to the end before the evaluation. However, once the iterator reaches the end, we will unset input filename. So the input_file_name expression can't return correct filename.
This patch fixes the approach to group the batch of rows. We evaluate the expression first and then group evaluated results to batch.
## How was this patch tested?
Added unit test to PySpark.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#16115 from viirya/fix-py-udf-input-filename.
## What changes were proposed in this pull request?
There are some tests failed on Windows due to the wrong format of path and the limitation of path length as below:
This PR proposes both to fix the failed tests by fixing the path for the tests below:
- `InsertSuite`
```
Exception encountered when attempting to run a suite with class name: org.apache.spark.sql.sources.InsertSuite *** ABORTED *** (12 seconds, 547 milliseconds)
org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-177945ef-9128-42b4-8c07-de31f78bbbd6;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
```
- `PathOptionSuite`
```
- path option also exist for write path *** FAILED *** (1 second, 93 milliseconds)
"C:[projectsspark arget mp]spark-5ab34a58-df8d-..." did not equal "C:[\projects\spark\target\tmp\]spark-5ab34a58-df8d-..." (PathOptionSuite.scala:93)
org.scalatest.exceptions.TestFailedException:
at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
...
```
- `UDFSuite`
```
- SPARK-8005 input_file_name *** FAILED *** (2 seconds, 234 milliseconds)
"file:///C:/projects/spark/target/tmp/spark-e4e5720a-2006-48f9-8b11-797bf59794bf/part-00001-26fb05e4-603d-471d-ae9d-b9549e0c7765.snappy.parquet" did not contain "C:\projects\spark\target\tmp\spark-e4e5720a-2006-48f9-8b11-797bf59794bf" (UDFSuite.scala:67)
org.scalatest.exceptions.TestFailedException:
at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
...
```
and to skip the tests belows which are being failed on Windows due to path length limitation.
- `SparkLauncherSuite`
```
Test org.apache.spark.launcher.SparkLauncherSuite.testChildProcLauncher failed: java.lang.AssertionError: expected:<0> but was:<1>, took 0.062 sec
at org.apache.spark.launcher.SparkLauncherSuite.testChildProcLauncher(SparkLauncherSuite.java:177)
...
```
The stderr from the process is `The filename or extension is too long` which is equivalent to the one below.
- `BroadcastJoinSuite`
```
04:09:40.882 ERROR org.apache.spark.deploy.worker.ExecutorRunner: Error running executor
java.io.IOException: Cannot run program "C:\Progra~1\Java\jdk1.8.0\bin\java" (in directory "C:\projects\spark\work\app-20161205040542-0000\51658"): CreateProcess error=206, The filename or extension is too long
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.spark.deploy.worker.ExecutorRunner.org$apache$spark$deploy$worker$ExecutorRunner$$fetchAndRunExecutor(ExecutorRunner.scala:167)
at org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:73)
Caused by: java.io.IOException: CreateProcess error=206, The filename or extension is too long
at java.lang.ProcessImpl.create(Native Method)
at java.lang.ProcessImpl.<init>(ProcessImpl.java:386)
at java.lang.ProcessImpl.start(ProcessImpl.java:137)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 2 more
04:09:40.929 ERROR org.apache.spark.deploy.worker.ExecutorRunner: Error running executor
(appearently infinite same error messages)
...
```
## How was this patch tested?
Manually tested via AppVeyor.
**Before**
`InsertSuite`: https://ci.appveyor.com/project/spark-test/spark/build/148-InsertSuite-pr
`PathOptionSuite`: https://ci.appveyor.com/project/spark-test/spark/build/139-PathOptionSuite-pr
`UDFSuite`: https://ci.appveyor.com/project/spark-test/spark/build/143-UDFSuite-pr
`SparkLauncherSuite`: https://ci.appveyor.com/project/spark-test/spark/build/141-SparkLauncherSuite-pr
`BroadcastJoinSuite`: https://ci.appveyor.com/project/spark-test/spark/build/145-BroadcastJoinSuite-pr
**After**
`PathOptionSuite`: https://ci.appveyor.com/project/spark-test/spark/build/140-PathOptionSuite-pr
`SparkLauncherSuite`: https://ci.appveyor.com/project/spark-test/spark/build/142-SparkLauncherSuite-pr
`UDFSuite`: https://ci.appveyor.com/project/spark-test/spark/build/144-UDFSuite-pr
`InsertSuite`: https://ci.appveyor.com/project/spark-test/spark/build/147-InsertSuite-pr
`BroadcastJoinSuite`: https://ci.appveyor.com/project/spark-test/spark/build/149-BroadcastJoinSuite-pr
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#16147 from HyukjinKwon/fix-tests.
## What changes were proposed in this pull request?
When `ignoreCorruptFiles` is enabled, it's better to also ignore non-existing files.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#16203 from zsxwing/ignore-file-not-found.
## What changes were proposed in this pull request?
Listeners added with `sparkSession.streams.addListener(l)` are added to a SparkSession. So events only from queries in the same session as a listener should be posted to the listener. Currently, all the events gets rerouted through the Spark's main listener bus, that is,
- StreamingQuery posts event to StreamingQueryListenerBus. Only the queries associated with the same session as the bus posts events to it.
- StreamingQueryListenerBus posts event to Spark's main LiveListenerBus as a SparkEvent.
- StreamingQueryListenerBus also subscribes to LiveListenerBus events thus getting back the posted event in a different thread.
- The received is posted to the registered listeners.
The problem is that *all StreamingQueryListenerBuses in all sessions* gets the events and posts them to their listeners. This is wrong.
In this PR, I solve it by making StreamingQueryListenerBus track active queries (by their runIds) when a query posts the QueryStarted event to the bus. This allows the rerouted events to be filtered using the tracked queries.
Note that this list needs to be maintained separately
from the `StreamingQueryManager.activeQueries` because a terminated query is cleared from
`StreamingQueryManager.activeQueries` as soon as it is stopped, but the this ListenerBus must
clear a query only after the termination event of that query has been posted lazily, much after the query has been terminated.
Credit goes to zsxwing for coming up with the initial idea.
## How was this patch tested?
Updated test harness code to use the correct session, and added new unit test.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#16186 from tdas/SPARK-18758.
## What changes were proposed in this pull request?
`makeRootConverter` is only called with a `StructType` value. By making this method less general we can remove pattern matches, which are never actually hit outside of the test suite.
## How was this patch tested?
The existing tests.
Author: Nathan Howell <nhowell@godaddy.com>
Closes#16084 from NathanHowell/SPARK-18654.
Based on an informal survey, users find this option easier to understand / remember.
Author: Michael Armbrust <michael@databricks.com>
Closes#16182 from marmbrus/renameRecentProgress.
## What changes were proposed in this pull request?
Fixed the following failures:
```
org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to eventually never returned normally. Attempted 3745 times over 1.0000790851666665 minutes. Last failure message: assertion failed: failOnDataLoss-0 not deleted after timeout.
```
```
sbt.ForkMain$ForkError: org.apache.spark.sql.streaming.StreamingQueryException: Query query-66 terminated with exception: null
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:252)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:146)
Caused by: sbt.ForkMain$ForkError: java.lang.NullPointerException: null
at java.util.ArrayList.addAll(ArrayList.java:577)
at org.apache.kafka.clients.Metadata.getClusterForCurrentTopics(Metadata.java:257)
at org.apache.kafka.clients.Metadata.update(Metadata.java:177)
at org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.handleResponse(NetworkClient.java:605)
at org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.maybeHandleCompletedReceive(NetworkClient.java:582)
at org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:450)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:269)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:360)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:224)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:192)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.awaitPendingRequests(ConsumerNetworkClient.java:260)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:222)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.ensurePartitionAssignment(ConsumerCoordinator.java:366)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:978)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:938)
at
...
```
## How was this patch tested?
Tested in #16048 by running many times.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#16109 from zsxwing/fix-kafka-flaky-test.
## What changes were proposed in this pull request?
It's better to add a warning log when skipping a corrupted file. It will be helpful when we want to finish the job first, then find them in the log and fix these files.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#16192 from zsxwing/SPARK-18764.
## What changes were proposed in this pull request?
Fixes AnalysisException for pivot queries that have group by columns that are expressions and not attributes by substituting the expressions output attribute in the second aggregation and final projection.
## How was this patch tested?
existing and additional unit tests
Author: Andrew Ray <ray.andrew@gmail.com>
Closes#16177 from aray/SPARK-17760.
## What changes were proposed in this pull request?
Maven compilation seem to not allow resource is sql/test to be easily referred to in kafka-0-10-sql tests. So moved the kafka-source-offset-version-2.1.0 from sql test resources to kafka-0-10-sql test resources.
## How was this patch tested?
Manually ran maven test
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#16183 from tdas/SPARK-18671-1.
## What changes were proposed in this pull request?
Easier to read while debugging as a formatted string (in ISO8601 format) than in millis
## How was this patch tested?
Updated unit tests
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#16166 from tdas/SPARK-18734.
## What changes were proposed in this pull request?
To be able to restart StreamingQueries across Spark version, we have already made the logs (offset log, file source log, file sink log) use json. We should added tests with actual json files in the Spark such that any incompatible changes in reading the logs is immediately caught. This PR add tests for FileStreamSourceLog, FileStreamSinkLog, and OffsetSeqLog.
## How was this patch tested?
new unit tests
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#16128 from tdas/SPARK-18671.
## What changes were proposed in this pull request?
Many Spark developers often want to test the runtime of some function in interactive debugging and testing. This patch adds a simple time function to SparkSession:
```
scala> spark.time { spark.range(1000).count() }
Time taken: 77 ms
res1: Long = 1000
```
## How was this patch tested?
I tested this interactively in spark-shell.
Author: Reynold Xin <rxin@databricks.com>
Closes#16140 from rxin/SPARK-18714.
## What changes were proposed in this pull request?
Right now ForeachSink creates a new physical plan, so StreamExecution cannot retrieval metrics and watermark.
This PR changes ForeachSink to manually convert InternalRows to objects without creating a new plan.
## How was this patch tested?
`test("foreach with watermark: append")`.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#16160 from zsxwing/SPARK-18721.
(Link to Jira issue: https://issues.apache.org/jira/browse/SPARK-18572)
## What changes were proposed in this pull request?
Currently Spark answers the `SHOW PARTITIONS` command by fetching all of the table's partition metadata from the external catalog and constructing partition names therefrom. The Hive client has a `getPartitionNames` method which is many times faster for this purpose, with the performance improvement scaling with the number of partitions in a table.
To test the performance impact of this PR, I ran the `SHOW PARTITIONS` command on two Hive tables with large numbers of partitions. One table has ~17,800 partitions, and the other has ~95,000 partitions. For the purposes of this PR, I'll call the former table `table1` and the latter table `table2`. I ran 5 trials for each table with before-and-after versions of this PR. The results are as follows:
Spark at bdc8153, `SHOW PARTITIONS table1`, times in seconds:
7.901
3.983
4.018
4.331
4.261
Spark at bdc8153, `SHOW PARTITIONS table2`
(Timed out after 10 minutes with a `SocketTimeoutException`.)
Spark at this PR, `SHOW PARTITIONS table1`, times in seconds:
3.801
0.449
0.395
0.348
0.336
Spark at this PR, `SHOW PARTITIONS table2`, times in seconds:
5.184
1.63
1.474
1.519
1.41
Taking the best times from each trial, we get a 12x performance improvement for a table with ~17,800 partitions and at least a 426x improvement for a table with ~95,000 partitions. More significantly, the latter command doesn't even complete with the current code in master.
This is actually a patch we've been using in-house at VideoAmp since Spark 1.1. It's made all the difference in the practical usability of our largest tables. Even with tables with about 1,000 partitions there's a performance improvement of about 2-3x.
## How was this patch tested?
I added a unit test to `VersionsSuite` which tests that the Hive client's `getPartitionNames` method returns the correct number of partitions.
Author: Michael Allman <michael@videoamp.com>
Closes#15998 from mallman/spark-18572-list_partition_names.
## What changes were proposed in this pull request?
Move no data rate limit from StreamExecution to ProgressReporter to make `recentProgresses` and listener events consistent.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#16155 from zsxwing/SPARK-18722.
## What changes were proposed in this pull request?
DataSet.na.fill(0) used on a DataSet which has a long value column, it will change the original long value.
The reason is that the type of the function fill's param is Double, and the numeric columns are always cast to double(`fillCol[Double](f, value)`) .
```
def fill(value: Double, cols: Seq[String]): DataFrame = {
val columnEquals = df.sparkSession.sessionState.analyzer.resolver
val projections = df.schema.fields.map { f =>
// Only fill if the column is part of the cols list.
if (f.dataType.isInstanceOf[NumericType] && cols.exists(col => columnEquals(f.name, col))) {
fillCol[Double](f, value)
} else {
df.col(f.name)
}
}
df.select(projections : _*)
}
```
For example:
```
scala> val df = Seq[(Long, Long)]((1, 2), (-1, -2), (9123146099426677101L, 9123146560113991650L)).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint]
scala> df.show
+-------------------+-------------------+
| a| b|
+-------------------+-------------------+
| 1| 2|
| -1| -2|
|9123146099426677101|9123146560113991650|
+-------------------+-------------------+
scala> df.na.fill(0).show
+-------------------+-------------------+
| a| b|
+-------------------+-------------------+
| 1| 2|
| -1| -2|
|9123146099426676736|9123146560113991680|
+-------------------+-------------------+
```
the original values changed [which is not we expected result]:
```
9123146099426677101 -> 9123146099426676736
9123146560113991650 -> 9123146560113991680
```
## How was this patch tested?
unit test added.
Author: root <root@iZbp1gsnrlfzjxh82cz80vZ.(none)>
Closes#15994 from windpiger/nafillMissupOriginalValue.
### What changes were proposed in this pull request?
Our existing withColumn for adding metadata can simply use the existing public withColumn API.
### How was this patch tested?
The existing test cases cover it.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#16152 from gatorsmile/withColumnRefactoring.
## What changes were proposed in this pull request?
Here are the major changes in this PR.
- Added the ability to recover `StreamingQuery.id` from checkpoint location, by writing the id to `checkpointLoc/metadata`.
- Added `StreamingQuery.runId` which is unique for every query started and does not persist across restarts. This is to identify each restart of a query separately (same as earlier behavior of `id`).
- Removed auto-generation of `StreamingQuery.name`. The purpose of name was to have the ability to define an identifier across restarts, but since id is precisely that, there is no need for a auto-generated name. This means name becomes purely cosmetic, and is null by default.
- Added `runId` to `StreamingQueryListener` events and `StreamingQueryProgress`.
Implementation details
- Renamed existing `StreamExecutionMetadata` to `OffsetSeqMetadata`, and moved it to the file `OffsetSeq.scala`, because that is what this metadata is tied to. Also did some refactoring to make the code cleaner (got rid of a lot of `.json` and `.getOrElse("{}")`).
- Added the `id` as the new `StreamMetadata`.
- When a StreamingQuery is created it gets or writes the `StreamMetadata` from `checkpointLoc/metadata`.
- All internal logging in `StreamExecution` uses `(name, id, runId)` instead of just `name`
TODO
- [x] Test handling of name=null in json generation of StreamingQueryProgress
- [x] Test handling of name=null in json generation of StreamingQueryListener events
- [x] Test python API of runId
## How was this patch tested?
Updated unit tests and new unit tests
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#16113 from tdas/SPARK-18657.
## What changes were proposed in this pull request?
Move DataFrame.collect out of synchronized block so that we can query content in MemorySink when `DataFrame.collect` is running.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#16162 from zsxwing/SPARK-18729.
## What changes were proposed in this pull request?
As reported in the Jira, there are some weird issues with exploding Python UDFs in SparkSQL.
The following test code can reproduce it. Notice: the following test code is reported to return wrong results in the Jira. However, as I tested on master branch, it causes exception and so can't return any result.
>>> from pyspark.sql.functions import *
>>> from pyspark.sql.types import *
>>>
>>> df = spark.range(10)
>>>
>>> def return_range(value):
... return [(i, str(i)) for i in range(value - 1, value + 1)]
...
>>> range_udf = udf(return_range, ArrayType(StructType([StructField("integer_val", IntegerType()),
... StructField("string_val", StringType())])))
>>>
>>> df.select("id", explode(range_udf(df.id))).show()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/spark/python/pyspark/sql/dataframe.py", line 318, in show
print(self._jdf.showString(n, 20))
File "/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o126.showString.: java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:156)
at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:120)
at org.apache.spark.sql.execution.GenerateExec.consume(GenerateExec.scala:57)
The cause of this issue is, in `ExtractPythonUDFs` we insert `BatchEvalPythonExec` to run PythonUDFs in batch. `BatchEvalPythonExec` will add extra outputs (e.g., `pythonUDF0`) to original plan. In above case, the original `Range` only has one output `id`. After `ExtractPythonUDFs`, the added `BatchEvalPythonExec` has two outputs `id` and `pythonUDF0`.
Because the output of `GenerateExec` is given after analysis phase, in above case, it is the combination of `id`, i.e., the output of `Range`, and `col`. But in planning phase, we change `GenerateExec`'s child plan to `BatchEvalPythonExec` with additional output attributes.
It will cause no problem in non wholestage codegen. Because when evaluating the additional attributes are projected out the final output of `GenerateExec`.
However, as `GenerateExec` now supports wholestage codegen, the framework will input all the outputs of the child plan to `GenerateExec`. Then when consuming `GenerateExec`'s output data (i.e., calling `consume`), the number of output attributes is different to the output variables in wholestage codegen.
To solve this issue, this patch only gives the generator's output to `GenerateExec` after analysis phase. `GenerateExec`'s output is the combination of its child plan's output and the generator's output. So when we change `GenerateExec`'s child, its output is still correct.
## How was this patch tested?
Added test cases to PySpark.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#16120 from viirya/fix-py-udf-with-generator.
## What changes were proposed in this pull request?
This is kind of a long-standing bug, it's hidden until https://github.com/apache/spark/pull/15780 , which may add `AssertNotNull` on top of `LambdaVariable` and thus enables subexpression elimination.
However, subexpression elimination will evaluate the common expressions at the beginning, which is invalid for `LambdaVariable`. `LambdaVariable` usually represents loop variable, which can't be evaluated ahead of the loop.
This PR skips expressions containing `LambdaVariable` when doing subexpression elimination.
## How was this patch tested?
updated test in `DatasetAggregatorSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#16143 from cloud-fan/aggregator.
## What changes were proposed in this pull request?
- Add StreamingQuery.explain and exception to Python.
- Fix StreamingQueryException to not expose `OffsetSeq`.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#16125 from zsxwing/py-streaming-explain.
## What changes were proposed in this pull request?
We currently have function input_file_name to get the path of the input file, but don't have functions to get the block start offset and length. This patch introduces two functions:
1. input_file_block_start: returns the file block start offset, or -1 if not available.
2. input_file_block_length: returns the file block length, or -1 if not available.
## How was this patch tested?
Updated existing test cases in ColumnExpressionSuite that covered input_file_name to also cover the two new functions.
Author: Reynold Xin <rxin@databricks.com>
Closes#16133 from rxin/SPARK-18702.
## What changes were proposed in this pull request?
Even though in 2.1 creating a partitioned datasource table will not populate the partition data by default (until the user issues MSCK REPAIR TABLE), it seems we still scan the filesystem for no good reason.
We should avoid doing this when the user specifies a schema.
## How was this patch tested?
Perf stat tests.
Author: Eric Liang <ekl@databricks.com>
Closes#16090 from ericl/spark-18661.
## What changes were proposed in this pull request?
This fix puts an explicit list of operators that Spark supports for correlated subqueries.
## How was this patch tested?
Run sql/test, catalyst/test and add a new test case on Generate.
Author: Nattavut Sutyanyong <nsy.can@gmail.com>
Closes#16046 from nsyca/spark18455.0.
## What changes were proposed in this pull request?
This patch significantly improves the IO / file listing performance of schema inference in Spark's built-in CSV data source.
Previously, this data source used the legacy `SparkContext.hadoopFile` and `SparkContext.hadoopRDD` methods to read files during its schema inference step, causing huge file-listing bottlenecks on the driver.
This patch refactors this logic to use Spark SQL's `text` data source to read files during this step. The text data source still performs some unnecessary file listing (since in theory we already have resolved the table prior to schema inference and therefore should be able to scan without performing _any_ extra listing), but that listing is much faster and takes place in parallel. In one production workload operating over tens of thousands of files, this change managed to reduce schema inference time from 7 minutes to 2 minutes.
A similar problem also affects the JSON file format and this patch originally fixed that as well, but I've decided to split that change into a separate patch so as not to conflict with changes in another JSON PR.
## How was this patch tested?
Existing unit tests, plus manual benchmarking on a production workload.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#15813 from JoshRosen/use-text-data-source-in-csv-and-json.
## What changes were proposed in this pull request?
This PR adds a sql conf `spark.sql.streaming.noDataReportInterval` to control how long to wait before outputing the next StreamProgressEvent when there is no data.
## How was this patch tested?
The added unit test.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#16108 from zsxwing/SPARK-18670.
## What changes were proposed in this pull request?
Two bugs are addressed here
1. INSERT OVERWRITE TABLE sometime crashed when catalog partition management was enabled. This was because when dropping partitions after an overwrite operation, the Hive client will attempt to delete the partition files. If the entire partition directory was dropped, this would fail. The PR fixes this by adding a flag to control whether the Hive client should attempt to delete files.
2. The static partition spec for OVERWRITE TABLE was not correctly resolved to the case-sensitive original partition names. This resulted in the entire table being overwritten if you did not correctly capitalize your partition names.
cc yhuai cloud-fan
## How was this patch tested?
Unit tests. Surprisingly, the existing overwrite table tests did not catch these edge cases.
Author: Eric Liang <ekl@databricks.com>
Closes#16088 from ericl/spark-18659.