## What changes were proposed in this pull request?
This patch appends a message to suggest users running refresh table or reloading data frames when Spark sees a FileNotFoundException due to stale, cached metadata.
## How was this patch tested?
Added a unit test for this in MetadataCacheSuite.
Author: petermaxlee <petermaxlee@gmail.com>
Closes#14003 from petermaxlee/SPARK-16336.
## What changes were proposed in this pull request?
This PR implements `posexplode` table generating function. Currently, master branch raises the following exception for `map` argument. It's different from Hive.
**Before**
```scala
scala> sql("select posexplode(map('a', 1, 'b', 2))").show
org.apache.spark.sql.AnalysisException: No handler for Hive UDF ... posexplode() takes an array as a parameter; line 1 pos 7
```
**After**
```scala
scala> sql("select posexplode(map('a', 1, 'b', 2))").show
+---+---+-----+
|pos|key|value|
+---+---+-----+
| 0| a| 1|
| 1| b| 2|
+---+---+-----+
```
For `array` argument, `after` is the same with `before`.
```
scala> sql("select posexplode(array(1, 2, 3))").show
+---+---+
|pos|col|
+---+---+
| 0| 1|
| 1| 2|
| 2| 3|
+---+---+
```
## How was this patch tested?
Pass the Jenkins tests with newly added testcases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13971 from dongjoon-hyun/SPARK-16289.
## What changes were proposed in this pull request?
Force the sorter to Spill when number of elements in the pointer array reach a certain size. This is to workaround the issue of timSort failing on large buffer size.
## How was this patch tested?
Tested by running a job which was failing without this change due to TimSort bug.
Author: Sital Kedia <skedia@fb.com>
Closes#13107 from sitalkedia/fix_TimSort.
## What changes were proposed in this pull request?
Add Catalog.refreshTable API into python interface for Spark-SQL.
## How was this patch tested?
Existing test.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#13558 from WeichenXu123/update_python_sql_interface_refreshTable.
## What changes were proposed in this pull request?
This PR adds 3 optimizer rules for typed filter:
1. push typed filter down through `SerializeFromObject` and eliminate the deserialization in filter condition.
2. pull typed filter up through `SerializeFromObject` and eliminate the deserialization in filter condition.
3. combine adjacent typed filters and share the deserialized object among all the condition expressions.
This PR also adds `TypedFilter` logical plan, to separate it from normal filter, so that the concept is more clear and it's easier to write optimizer rules.
## How was this patch tested?
`TypedFilterOptimizationSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13846 from cloud-fan/filter.
## What changes were proposed in this pull request?
This PR allows `emptyDataFrame.write` since the user didn't specify any partition columns.
**Before**
```scala
scala> spark.emptyDataFrame.write.parquet("/tmp/t1")
org.apache.spark.sql.AnalysisException: Cannot use all columns for partition columns;
scala> spark.emptyDataFrame.write.csv("/tmp/t1")
org.apache.spark.sql.AnalysisException: Cannot use all columns for partition columns;
```
After this PR, there occurs no exceptions and the created directory has only one file, `_SUCCESS`, as expected.
## How was this patch tested?
Pass the Jenkins tests including updated test cases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13730 from dongjoon-hyun/SPARK-16006.
## What changes were proposed in this pull request?
This PR removes meaningless `StringIteratorReader` for CSV data source.
In `CSVParser.scala`, there is an `Reader` wrapping `Iterator` but there are two problems by this.
Firstly, it was actually not faster than processing line by line with Iterator due to additional logics to wrap `Iterator` to `Reader`.
Secondly, this brought a bit of complexity because it needs additional logics to allow every line to be read bytes by bytes. So, it was pretty difficult to figure out issues about parsing, (eg. SPARK-14103).
A benchmark was performed manually and the results were below:
- Original codes with Reader wrapping Iterator
|End-to-end (ns) | Parse Time (ns) |
|-----------------------|------------------------|
|14116265034 |2008277960 |
- New codes with Iterator
|End-to-end (ns) | Parse Time (ns) |
|-----------------------|------------------------|
|13451699644 | 1549050564 |
For the details for the environment, dataset and methods, please refer the JIRA ticket.
## How was this patch tested?
Existing tests should cover this.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#13808 from HyukjinKwon/SPARK-14480-small.
#### What changes were proposed in this pull request?
Based on the previous discussion with cloud-fan hvanhovell in another related PR https://github.com/apache/spark/pull/13764#discussion_r67994276, it looks reasonable to add convenience methods for users to add `comment` when defining `StructField`.
Currently, the column-related `comment` attribute is stored in `Metadata` of `StructField`. For example, users can add the `comment` attribute using the following way:
```Scala
StructType(
StructField(
"cl1",
IntegerType,
nullable = false,
new MetadataBuilder().putString("comment", "test").build()) :: Nil)
```
This PR is to add more user friendly methods for the `comment` attribute when defining a `StructField`. After the changes, users are provided three different ways to do it:
```Scala
val struct = (new StructType)
.add("a", "int", true, "test1")
val struct = (new StructType)
.add("c", StringType, true, "test3")
val struct = (new StructType)
.add(StructField("d", StringType).withComment("test4"))
```
#### How was this patch tested?
Added test cases:
- `DataTypeSuite` is for testing three types of API changes,
- `DataFrameReaderWriterSuite` is for parquet, json and csv formats - using in-memory catalog
- `OrcQuerySuite.scala` is for orc format using Hive-metastore
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13860 from gatorsmile/newMethodForComment.
## What changes were proposed in this pull request?
Change the return type mentioned in the JavaDoc for `toJavaRDD` / `javaRDD` to match the actual return type & be consistent with the scala rdd return type.
## How was this patch tested?
Docs only change.
Author: Holden Karau <holden@us.ibm.com>
Closes#13954 from holdenk/trivial-streaming-tojavardd-doc-fix.
## What changes were proposed in this pull request?
Fixes a couple old references to `DataFrameWriter.startStream` to `DataStreamWriter.start
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#13952 from brkyvz/minor-doc-fix.
#### What changes were proposed in this pull request?
koertkuipers identified the PR https://github.com/apache/spark/pull/13727/ changed the behavior of `load` API. After the change, the `load` API does not add the value of `path` into the `options`. Thank you!
This PR is to add the option `path` back to `load()` API in `DataFrameReader`, if and only if users specify one and only one `path` in the `load` API. For example, users can see the `path` option after the following API call,
```Scala
spark.read
.format("parquet")
.load("/test")
```
#### How was this patch tested?
Added test cases.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13933 from gatorsmile/optionPath.
## What changes were proposed in this pull request?
Allowing truncate to a specific number of character is convenient at times, especially while operating from the REPL. Sometimes those last few characters make all the difference, and showing everything brings in whole lot of noise.
## How was this patch tested?
Existing tests. + 1 new test in DataFrameSuite.
For SparkR and pyspark, existing tests and manual testing.
Author: Prashant Sharma <prashsh1@in.ibm.com>
Author: Prashant Sharma <prashant@apache.org>
Closes#13839 from ScrapCodes/add_truncateTo_DF.show.
#### What changes were proposed in this pull request?
The API description of `createRelation` in `CreatableRelationProvider` is misleading. The current description only expects users to return the relation.
```Scala
trait CreatableRelationProvider {
def createRelation(
sqlContext: SQLContext,
mode: SaveMode,
parameters: Map[String, String],
data: DataFrame): BaseRelation
}
```
However, the major goal of this API should also include saving the `DataFrame`.
Since this API is critical for Data Source API developers, this PR is to correct the description.
#### How was this patch tested?
N/A
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13903 from gatorsmile/readUnderscoreFiles.
## What changes were proposed in this pull request?
[SPARK-8118](https://github.com/apache/spark/pull/8196) implements redirecting Parquet JUL logger via SLF4J, but it is currently applied only when READ operations occurs. If users use only WRITE operations, there occurs many Parquet logs.
This PR makes the redirection work on WRITE operations, too.
**Before**
```scala
scala> spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Jun 26, 2016 9:04:38 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY
............ about 70 lines Parquet Log .............
scala> spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p")
............ about 70 lines Parquet Log .............
```
**After**
```scala
scala> spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
scala> spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p")
```
This PR also fixes some typos.
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13918 from dongjoon-hyun/SPARK-16221.
## What changes were proposed in this pull request?
Spark currently shows all functions when issue a `SHOW FUNCTIONS` command. This PR refines the `SHOW FUNCTIONS` command by allowing users to select all functions, user defined function or system functions. The following syntax can be used:
**ALL** (default)
```SHOW FUNCTIONS```
```SHOW ALL FUNCTIONS```
**SYSTEM**
```SHOW SYSTEM FUNCTIONS```
**USER**
```SHOW USER FUNCTIONS```
## How was this patch tested?
Updated tests and added tests to the DDLSuite
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#13929 from hvanhovell/SPARK-16220.
## What changes were proposed in this pull request?
Add `conf` method to get Runtime Config from SparkSession
## How was this patch tested?
unit tests, manual tests
This is how it works in sparkR shell:
```
SparkSession available as 'spark'.
> conf()
$hive.metastore.warehouse.dir
[1] "file:/opt/spark-2.0.0-bin-hadoop2.6/R/spark-warehouse"
$spark.app.id
[1] "local-1466749575523"
$spark.app.name
[1] "SparkR"
$spark.driver.host
[1] "10.0.2.1"
$spark.driver.port
[1] "45629"
$spark.executorEnv.LD_LIBRARY_PATH
[1] "$LD_LIBRARY_PATH:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/default-java/jre/lib/amd64/server"
$spark.executor.id
[1] "driver"
$spark.home
[1] "/opt/spark-2.0.0-bin-hadoop2.6"
$spark.master
[1] "local[*]"
$spark.sql.catalogImplementation
[1] "hive"
$spark.submit.deployMode
[1] "client"
> conf("spark.master")
$spark.master
[1] "local[*]"
```
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13885 from felixcheung/rconf.
## What changes were proposed in this pull request?
Currently the initial buffer size in the sorter is hard coded inside the code and is too small for large workload. As a result, the sorter spends significant time expanding the buffer size and copying the data. It would be useful to have it configurable.
## How was this patch tested?
Tested by running a job on the cluster.
Author: Sital Kedia <skedia@fb.com>
Closes#13699 from sitalkedia/config_sort_buffer_upstream.
## What changes were proposed in this pull request?
One of the most frequent usage patterns for Spark SQL is using **cached tables**. This PR improves `InMemoryTableScanExec` to handle `IN` predicate efficiently by pruning partition batches. Of course, the performance improvement varies over the queries and the datasets. But, for the following simple query, the query duration in Spark UI goes from 9 seconds to 50~90ms. It's about over 100 times faster.
**Before**
```scala
$ bin/spark-shell --driver-memory 6G
scala> val df = spark.range(2000000000)
scala> df.createOrReplaceTempView("t")
scala> spark.catalog.cacheTable("t")
scala> sql("select id from t where id = 1").collect() // About 2 mins
scala> sql("select id from t where id = 1").collect() // less than 90ms
scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds
```
**After**
```scala
scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms
```
This PR has impacts over 35 queries of TPC-DS if the tables are cached.
Note that this optimization is applied for `IN`. To apply `IN` predicate having more than 10 items, `spark.sql.optimizer.inSetConversionThreshold` option should be increased.
## How was this patch tested?
Pass the Jenkins tests (including new testcases).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13887 from dongjoon-hyun/SPARK-16186.
## What changes were proposed in this pull request?
Allow to specify empty over clause in window expressions through dataset API
In SQL, its allowed to specify an empty OVER clause in the window expression.
```SQL
select area, sum(product) over () as c from windowData
where product > 3 group by area, product
having avg(month) > 0 order by avg(month), product
```
In this case the analytic function sum is presented based on all the rows of the result set
Currently its not allowed through dataset API and is handled in this PR.
## How was this patch tested?
Added a new test in DataframeWindowSuite
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#13897 from dilipbiswal/spark-empty-over.
## What changes were proposed in this pull request?
This PR fixes `DataFrame.describe()` by forcing materialization to make the `Seq` serializable. Currently, `describe()` of DataFrame throws `Task not serializable` Spark exceptions when joining in Scala 2.10.
## How was this patch tested?
Manual. (After building with Scala 2.10, test on `bin/spark-shell` and `bin/pyspark`.)
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13900 from dongjoon-hyun/SPARK-16173.
## What changes were proposed in this pull request?
One of the most frequent usage patterns for Spark SQL is using **cached tables**. This PR improves `InMemoryTableScanExec` to handle `IN` predicate efficiently by pruning partition batches. Of course, the performance improvement varies over the queries and the datasets. But, for the following simple query, the query duration in Spark UI goes from 9 seconds to 50~90ms. It's about over 100 times faster.
**Before**
```scala
$ bin/spark-shell --driver-memory 6G
scala> val df = spark.range(2000000000)
scala> df.createOrReplaceTempView("t")
scala> spark.catalog.cacheTable("t")
scala> sql("select id from t where id = 1").collect() // About 2 mins
scala> sql("select id from t where id = 1").collect() // less than 90ms
scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds
```
**After**
```scala
scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms
```
This PR has impacts over 35 queries of TPC-DS if the tables are cached.
Note that this optimization is applied for `IN`. To apply `IN` predicate having more than 10 items, `spark.sql.optimizer.inSetConversionThreshold` option should be increased.
## How was this patch tested?
Pass the Jenkins tests (including new testcases).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13887 from dongjoon-hyun/SPARK-16186.
## What changes were proposed in this pull request?
This PR fix the bug when Python UDF is used in explode (generator), GenerateExec requires that all the attributes in expressions should be resolvable from children when creating, we should replace the children first, then replace it's expressions.
```
>>> df.select(explode(f(*df))).show()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/vlad/dev/spark/python/pyspark/sql/dataframe.py", line 286, in show
print(self._jdf.showString(n, truncate))
File "/home/vlad/dev/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
File "/home/vlad/dev/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/home/vlad/dev/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o52.showString.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree:
Generate explode(<lambda>(_1#0L)), false, false, [col#15L]
+- Scan ExistingRDD[_1#0L]
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:387)
at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:69)
at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:45)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:177)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144)
at org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:153)
at org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
at org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
at org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:113)
at org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:93)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:95)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:95)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:95)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:85)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:85)
at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2557)
at org.apache.spark.sql.Dataset.head(Dataset.scala:1923)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2138)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:239)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$13.apply(TreeNode.scala:413)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$13.apply(TreeNode.scala:413)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:412)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:387)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
... 42 more
Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: pythonUDF0#20
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:268)
at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
at org.apache.spark.sql.execution.GenerateExec.<init>(GenerateExec.scala:63)
... 52 more
Caused by: java.lang.RuntimeException: Couldn't find pythonUDF0#20 in [_1#0L]
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:94)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:88)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
... 67 more
```
## How was this patch tested?
Added regression tests.
Author: Davies Liu <davies@databricks.com>
Closes#13883 from davies/udf_in_generate.
## What changes were proposed in this pull request?
This is a small patch to rewrite the predicate filter translation in DataSourceStrategy. The original code used excessive functional constructs (e.g. unzip) and was very difficult to understand.
## How was this patch tested?
Should be covered by existing tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#13889 from rxin/simplify-predicate-filter.
## What changes were proposed in this pull request?
Replace use of `commons-lang` in favor of `commons-lang3` and forbid the former via scalastyle; remove `NotImplementedException` from `comons-lang` in favor of JDK `UnsupportedOperationException`
## How was this patch tested?
Jenkins tests
Author: Sean Owen <sowen@cloudera.com>
Closes#13843 from srowen/SPARK-16129.
## What changes were proposed in this pull request?
It's weird that `ParserUtils.operationNotAllowed` returns an exception and the caller throw it.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13874 from cloud-fan/style.
## What changes were proposed in this pull request?
This patch fixes an overflow bug in vectorized parquet reader where both off-heap and on-heap variants of `ColumnVector.reserve()` can unfortunately overflow while reserving additional capacity during reads.
## How was this patch tested?
Manual Tests
Author: Sameer Agarwal <sameer@databricks.com>
Closes#13832 from sameeragarwal/negative-array.
## What changes were proposed in this pull request?
Currently, `readBatches` accumulator of `InMemoryTableScanExec` is updated only when `spark.sql.inMemoryColumnarStorage.partitionPruning` is true. Although this metric is used for only testing purpose, we had better have correct metric without considering SQL options.
## How was this patch tested?
Pass the Jenkins tests (including a new testcase).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13870 from dongjoon-hyun/SPARK-16165.
## What changes were proposed in this pull request?
This calculation of statistics is not trivial anymore, it could be very slow on large query (for example, TPC-DS Q64 took several minutes to plan).
During the planning of a query, the statistics of any logical plan should not change (even InMemoryRelation), so we should use `lazy val` to cache the statistics.
For InMemoryRelation, the statistics could be updated after materialization, it's only useful when used in another query (before planning), because once we finished the planning, the statistics will not be used anymore.
## How was this patch tested?
Testsed with TPC-DS Q64, it could be planned in a second after the patch.
Author: Davies Liu <davies@databricks.com>
Closes#13871 from davies/fix_statistics.
## What changes were proposed in this pull request?
When the user uses `ConsoleSink`, we should use a temp location if `checkpointLocation` is not specified.
## How was this patch tested?
The added unit test.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#13817 from zsxwing/console-checkpoint.
## What changes were proposed in this pull request?
When table is created with column name containing dot, distinct() will fail to run. For example,
```scala
val rowRDD = sparkContext.parallelize(Seq(Row(1), Row(1), Row(2)))
val schema = StructType(Array(StructField("column.with.dot", IntegerType, nullable = false)))
val df = spark.createDataFrame(rowRDD, schema)
```
running the following will have no problem:
```scala
df.select(new Column("`column.with.dot`"))
```
but running the query with additional distinct() will cause exception:
```scala
df.select(new Column("`column.with.dot`")).distinct()
```
The issue is that distinct() will try to resolve the column name, but the column name in the schema does not have backtick with it. So the solution is to add the backtick before passing the column name to resolve().
## How was this patch tested?
Added a new test case.
Author: bomeng <bmeng@us.ibm.com>
Closes#13140 from bomeng/SPARK-15230.
## What changes were proposed in this pull request?
We embed partitioning logic in FileSourceStrategy.apply, making the function very long. This is a small refactoring to move it into its own functions. Eventually we would be able to move the partitioning functions into a physical operator, rather than doing it in physical planning.
## How was this patch tested?
This is a simple code move.
Author: Reynold Xin <rxin@databricks.com>
Closes#13862 from rxin/SPARK-16159.
## What changes were proposed in this pull request?
Seems the fix of SPARK-14959 breaks the parallel partitioning discovery. This PR fixes the problem
## How was this patch tested?
Tested manually. (This PR also adds a proper test for SPARK-14959)
Author: Yin Huai <yhuai@databricks.com>
Closes#13830 from yhuai/SPARK-16121.
#### What changes were proposed in this pull request?
This PR is to use the latest `SparkSession` to replace the existing `SQLContext` in `MLlib`. `SQLContext` is removed from `MLlib`.
Also fix a test case issue in `BroadcastJoinSuite`.
BTW, `SQLContext` is not being used in the `MLlib` test suites.
#### How was this patch tested?
Existing test cases.
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#13380 from gatorsmile/sqlContextML.
## What changes were proposed in this pull request?
This PR let `CsvWriter` object is not created for each time but able to be reused. This way was taken after from JSON data source.
Original `CsvWriter` was being created for each row but it was enhanced in https://github.com/apache/spark/pull/13229. However, it still creates `CsvWriter` object for each `flush()` in `LineCsvWriter`. It seems it does not have to close the object and re-create this for every flush.
It follows the original logic as it is but `CsvWriter` is reused by reseting `CharArrayWriter`.
## How was this patch tested?
Existing tests should cover this.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#13809 from HyukjinKwon/write-perf.
## What changes were proposed in this pull request?
Add a configuration to allow people to set a minimum polling delay when no new data arrives (default is 10ms). This PR also cleans up some INFO logs.
## How was this patch tested?
Existing unit tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#13718 from zsxwing/SPARK-16002.
## What changes were proposed in this pull request?
1. FORMATTED is actually supported, but partition is not supported;
2. Remove parenthesis as it is not necessary just like anywhere else.
## How was this patch tested?
Minor issue. I do not think it needs a test case!
Author: bomeng <bmeng@us.ibm.com>
Closes#13791 from bomeng/SPARK-16084.
#### What changes were proposed in this pull request?
This PR is to fix the following bugs:
**Issue 1: Wrong Results when lowerBound is larger than upperBound in Column Partitioning**
```scala
spark.read.jdbc(
url = urlWithUserAndPass,
table = "TEST.seq",
columnName = "id",
lowerBound = 4,
upperBound = 0,
numPartitions = 3,
connectionProperties = new Properties)
```
**Before code changes:**
The returned results are wrong and the generated partitions are wrong:
```
Part 0 id < 3 or id is null
Part 1 id >= 3 AND id < 2
Part 2 id >= 2
```
**After code changes:**
Issue an `IllegalArgumentException` exception:
```
Operation not allowed: the lower bound of partitioning column is larger than the upper bound. lowerBound: 5; higherBound: 1
```
**Issue 2: numPartitions is more than the number of key values between upper and lower bounds**
```scala
spark.read.jdbc(
url = urlWithUserAndPass,
table = "TEST.seq",
columnName = "id",
lowerBound = 1,
upperBound = 5,
numPartitions = 10,
connectionProperties = new Properties)
```
**Before code changes:**
Returned correct results but the generated partitions are very inefficient, like:
```
Partition 0: id < 1 or id is null
Partition 1: id >= 1 AND id < 1
Partition 2: id >= 1 AND id < 1
Partition 3: id >= 1 AND id < 1
Partition 4: id >= 1 AND id < 1
Partition 5: id >= 1 AND id < 1
Partition 6: id >= 1 AND id < 1
Partition 7: id >= 1 AND id < 1
Partition 8: id >= 1 AND id < 1
Partition 9: id >= 1
```
**After code changes:**
Adjust `numPartitions` and can return the correct answers:
```
Partition 0: id < 2 or id is null
Partition 1: id >= 2 AND id < 3
Partition 2: id >= 3 AND id < 4
Partition 3: id >= 4
```
**Issue 3: java.lang.ArithmeticException when numPartitions is zero**
```Scala
spark.read.jdbc(
url = urlWithUserAndPass,
table = "TEST.seq",
columnName = "id",
lowerBound = 0,
upperBound = 4,
numPartitions = 0,
connectionProperties = new Properties)
```
**Before code changes:**
Got the following exception:
```
java.lang.ArithmeticException: / by zero
```
**After code changes:**
Able to return a correct answer by disabling column partitioning when numPartitions is equal to or less than zero
#### How was this patch tested?
Added test cases to verify the results
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13773 from gatorsmile/jdbcPartitioning.
## What changes were proposed in this pull request?
This pull request adds a new option (maxMalformedLogPerPartition) in CSV reader to limit the maximum of logging message Spark generates per partition for malformed records.
The error log looks something like
```
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: More than 10 malformed records have been found on this partition. Malformed records from now on will not be logged.
```
Closes#12173
## How was this patch tested?
Manually tested.
Author: Reynold Xin <rxin@databricks.com>
Closes#13795 from rxin/SPARK-13792.
## What changes were proposed in this pull request?
The property spark.streaming.stateStore.maintenanceInterval should be renamed and harmonized with other properties related to Structured Streaming like spark.sql.streaming.stateStore.minDeltasForSnapshot.
## How was this patch tested?
Existing unit tests.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#13777 from sarutak/SPARK-16061.
## What changes were proposed in this pull request?
Issues with current reader behavior.
- `text()` without args returns an empty DF with no columns -> inconsistent, its expected that text will always return a DF with `value` string field,
- `textFile()` without args fails with exception because of the above reason, it expected the DF returned by `text()` to have a `value` field.
- `orc()` does not have var args, inconsistent with others
- `json(single-arg)` was removed, but that caused source compatibility issues - [SPARK-16009](https://issues.apache.org/jira/browse/SPARK-16009)
- user specified schema was not respected when `text/csv/...` were used with no args - [SPARK-16007](https://issues.apache.org/jira/browse/SPARK-16007)
The solution I am implementing is to do the following.
- For each format, there will be a single argument method, and a vararg method. For json, parquet, csv, text, this means adding json(string), etc.. For orc, this means adding orc(varargs).
- Remove the special handling of text(), csv(), etc. that returns empty dataframe with no fields. Rather pass on the empty sequence of paths to the datasource, and let each datasource handle it right. For e.g, text data source, should return empty DF with schema (value: string)
- Deduped docs and fixed their formatting.
## How was this patch tested?
Added new unit tests for Scala and Java tests
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#13727 from tdas/SPARK-15982.
## What changes were proposed in this pull request?
This PR adds the static partition support to INSERT statement when the target table is a data source table.
## How was this patch tested?
New tests in InsertIntoHiveTableSuite and DataSourceAnalysisSuite.
**Note: This PR is based on https://github.com/apache/spark/pull/13766. The last commit is the actual change.**
Author: Yin Huai <yhuai@databricks.com>
Closes#13769 from yhuai/SPARK-16030-1.
## What changes were proposed in this pull request?
This patch adds a text-based socket source similar to the one in Spark Streaming for debugging and tutorials. The source is clearly marked as debug-only so that users don't try to run it in production applications, because this type of source cannot provide HA without storing a lot of state in Spark.
## How was this patch tested?
Unit tests and manual tests in spark-shell.
Author: Matei Zaharia <matei@databricks.com>
Closes#13748 from mateiz/socket-source.
## What changes were proposed in this pull request?
`DataFrameWriter` can be used to append data to existing data source tables. It becomes tricky when partition columns used in `DataFrameWriter.partitionBy(columns)` don't match the actual partition columns of the underlying table. This pull request enforces the check so that the partition columns of these two always match.
## How was this patch tested?
Unit test.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#13749 from clockfly/SPARK-16034.
## What changes were proposed in this pull request?
The current table insertion has some weird behaviours:
1. inserting into a partitioned table with mismatch columns has confusing error message for hive table, and wrong result for datasource table
2. inserting into a partitioned table without partition list has wrong result for hive table.
This PR fixes these 2 problems.
## How was this patch tested?
new test in hive `SQLQuerySuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13754 from cloud-fan/insert2.
## What changes were proposed in this pull request?
Improve readability of `InMemoryTableScanExec.scala`, which has too much stuff in it.
## How was this patch tested?
Jenkins
Author: Andrew Or <andrew@databricks.com>
Closes#13742 from andrewor14/move-inmemory-relation.
## What changes were proposed in this pull request?
We cannot use `limit` on DataFrame in ConsoleSink because it will use a wrong planner. This PR just collects `DataFrame` and calls `show` on a batch DataFrame based on the result. This is fine since ConsoleSink is only for debugging.
## How was this patch tested?
Manually confirmed ConsoleSink now works with complete mode aggregation.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#13740 from zsxwing/complete-console.