## What changes were proposed in this pull request?
We should be able to resolve a nested view. The main advantage is that if you update an underlying view, the current view also gets updated.
The new approach should be compatible with older versions of SPARK/HIVE, that means:
1. The new approach should be able to resolve the views that created by older versions of SPARK/HIVE;
2. The new approach should be able to resolve the views that are currently supported by SPARK SQL.
The new approach mainly brings in the following changes:
1. Add a new operator called `View` to keep track of the CatalogTable that describes the view, and the output attributes as well as the child of the view;
2. Update the `ResolveRelations` rule to resolve the relations and views, note that a nested view should be resolved correctly;
3. Add `viewDefaultDatabase` variable to `CatalogTable` to keep track of the default database name used to resolve a view, if the `CatalogTable` is not a view, then the variable should be `None`;
4. Add `AnalysisContext` to enable us to still support a view created with CTE/Windows query;
5. Enables the view support without enabling Hive support (i.e., enableHiveSupport);
6. Fix a weird behavior: the result of a view query may have different schema if the referenced table has been changed. After this PR, we try to cast the child output attributes to that from the view schema, throw an AnalysisException if cast is not allowed.
Note this is compatible with the views defined by older versions of Spark(before 2.2), which have empty `defaultDatabase` and all the relations in `viewText` have database part defined.
## How was this patch tested?
1. Add new tests in `SessionCatalogSuite` to test the function `lookupRelation`;
2. Add new test case in `SQLViewSuite` to test resolve a nested view.
Author: jiangxingbo <jiangxb1987@gmail.com>
Closes#16233 from jiangxb1987/resolve-view.
## What changes were proposed in this pull request?
Currently we have two sets of statistics in LogicalPlan: a simple stats and a stats estimated by cbo, but the computing logic and naming are quite confusing, we need to unify these two sets of stats.
## How was this patch tested?
Just modify existing tests.
Author: wangzhenhua <wangzhenhua@huawei.com>
Author: Zhenhua Wang <wzh_zju@163.com>
Closes#16529 from wzhfy/unifyStats.
## What changes were proposed in this pull request?
This PR allow update mode for non-aggregation streaming queries. It will be same as the append mode if a query has no aggregations.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#16520 from zsxwing/update-without-agg.
## What changes were proposed in this pull request?
Support cardinality estimation of aggregate operator
## How was this patch tested?
Add test cases
Author: Zhenhua Wang <wzh_zju@163.com>
Author: wangzhenhua <wangzhenhua@huawei.com>
Closes#16431 from wzhfy/aggEstimation.
## What changes were proposed in this pull request?
Support cardinality estimation for project operator.
## How was this patch tested?
Add a test suite and a base class in the catalyst package.
Author: Zhenhua Wang <wzh_zju@163.com>
Closes#16430 from wzhfy/projectEstimation.
## What changes were proposed in this pull request?
Added a `to` call at the end of the code generated by `ScalaReflection.deserializerFor` if the requested type is not a supertype of `WrappedArray[_]` that uses `CanBuildFrom[_, _, _]` to convert result into an arbitrary subtype of `Seq[_]`.
Care was taken to preserve the original deserialization where it is possible to avoid the overhead of conversion in cases where it is not needed
`ScalaReflection.serializerFor` could already be used to serialize any `Seq[_]` so it was not altered
`SQLImplicits` had to be altered and new implicit encoders added to permit serialization of other sequence types
Also fixes [SPARK-16815] Dataset[List[T]] leads to ArrayStoreException
## How was this patch tested?
```bash
./build/mvn -DskipTests clean package && ./dev/run-tests
```
Also manual execution of the following sets of commands in the Spark shell:
```scala
case class TestCC(key: Int, letters: List[String])
val ds1 = sc.makeRDD(Seq(
(List("D")),
(List("S","H")),
(List("F","H")),
(List("D","L","L"))
)).map(x=>(x.length,x)).toDF("key","letters").as[TestCC]
val test1=ds1.map{_.key}
test1.show
```
```scala
case class X(l: List[String])
spark.createDataset(Seq(List("A"))).map(X).show
```
```scala
spark.sqlContext.createDataset(sc.parallelize(List(1) :: Nil)).collect
```
After adding arbitrary sequence support also tested with the following commands:
```scala
case class QueueClass(q: scala.collection.immutable.Queue[Int])
spark.createDataset(Seq(List(1,2,3))).map(x => QueueClass(scala.collection.immutable.Queue(x: _*))).map(_.q.dequeue).collect
```
Author: Michal Senkyr <mike.senkyr@gmail.com>
Closes#16240 from michalsenkyr/sql-caseclass-list-fix.
## What changes were proposed in this pull request?
Today we have different syntax to create data source or hive serde tables, we should unify them to not confuse users and step forward to make hive a data source.
Please read https://issues.apache.org/jira/secure/attachment/12843835/CREATE-TABLE.pdf for details.
TODO(for follow-up PRs):
1. TBLPROPERTIES is not added to the new syntax, we should decide if we wanna add it later.
2. `SHOW CREATE TABLE` should be updated to use the new syntax.
3. we should decide if we wanna change the behavior of `SET LOCATION`.
## How was this patch tested?
new tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#16296 from cloud-fan/create-table.
## What changes were proposed in this pull request?
There are many locations in the Spark repo where the same word occurs consecutively. Sometimes they are appropriately placed, but many times they are not. This PR removes the inappropriately duplicated words.
## How was this patch tested?
N/A since only docs or comments were updated.
Author: Niranjan Padmanabhan <niranjan.padmanabhan@gmail.com>
Closes#16455 from neurons/np.structure_streaming_doc.
## What changes were proposed in this pull request?
Now all aggregation functions support partial aggregate, we can remove the `supportsPartual` flag in `AggregateFunction`
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#16461 from cloud-fan/partial.
## What changes were proposed in this pull request?
In https://github.com/apache/spark/pull/16402 we made a mistake that, when double/float is infinity, the `Literal` codegen will output boxed value and cause wrong result.
This PR fixes this by special handling infinity to not output boxed value.
## How was this patch tested?
new regression test
Author: Wenchen Fan <wenchen@databricks.com>
Closes#16469 from cloud-fan/literal.
### What changes were proposed in this pull request?
The data in the managed table should be deleted after table is dropped. However, if the partition location is not under the location of the partitioned table, it is not deleted as expected. Users can specify any location for the partition when they adding a partition.
This PR is to delete partition location when dropping managed partitioned tables stored in `InMemoryCatalog`.
### How was this patch tested?
Added test cases for both HiveExternalCatalog and InMemoryCatalog
Author: gatorsmile <gatorsmile@gmail.com>
Closes#16448 from gatorsmile/unsetSerdeProp.
## What changes were proposed in this pull request?
Currently collect_set/collect_list aggregation expression don't support partial aggregation. This patch is to enable partial aggregation for them.
## How was this patch tested?
Jenkins tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#16371 from viirya/collect-partial-support.
## What changes were proposed in this pull request?
We add a cbo configuration to switch between default stats and estimated stats.
We also define a new statistics method `planStats` in LogicalPlan with conf as its parameter, in order to pass the cbo switch and other estimation related configurations in the future. `planStats` is used on the caller sides (i.e. in Optimizer and Strategies) to make transformation decisions based on stats.
## How was this patch tested?
Add a test case using a dummy LogicalPlan.
Author: Zhenhua Wang <wzh_zju@163.com>
Closes#16401 from wzhfy/cboSwitch.
### What changes were proposed in this pull request?
Remove useless `databaseName ` from `SimpleCatalogRelation`.
### How was this patch tested?
Existing test cases.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#16438 from gatorsmile/removeDBFromSimpleCatalogRelation.
### What changes were proposed in this pull request?
Fixed non-thread-safe functions used in SessionCatalog:
- refreshTable
- lookupRelation
### How was this patch tested?
N/A
Author: gatorsmile <gatorsmile@gmail.com>
Closes#16437 from gatorsmile/addSyncToLookUpTable.
## What changes were proposed in this pull request?
This PR proposes to fix the test failures due to different format of paths on Windows.
Failed tests are as below:
```
ColumnExpressionSuite:
- input_file_name, input_file_block_start, input_file_block_length - FileScanRDD *** FAILED *** (187 milliseconds)
"file:///C:/projects/spark/target/tmp/spark-0b21b963-6cfa-411c-8d6f-e6a5e1e73bce/part-00001-c083a03a-e55e-4b05-9073-451de352d006.snappy.parquet" did not contain "C:\projects\spark\target\tmp\spark-0b21b963-6cfa-411c-8d6f-e6a5e1e73bce" (ColumnExpressionSuite.scala:545)
- input_file_name, input_file_block_start, input_file_block_length - HadoopRDD *** FAILED *** (172 milliseconds)
"file:/C:/projects/spark/target/tmp/spark-5d0afa94-7c2f-463b-9db9-2e8403e2bc5f/part-00000-f6530138-9ad3-466d-ab46-0eeb6f85ed0b.txt" did not contain "C:\projects\spark\target\tmp\spark-5d0afa94-7c2f-463b-9db9-2e8403e2bc5f" (ColumnExpressionSuite.scala:569)
- input_file_name, input_file_block_start, input_file_block_length - NewHadoopRDD *** FAILED *** (156 milliseconds)
"file:/C:/projects/spark/target/tmp/spark-a894c7df-c74d-4d19-82a2-a04744cb3766/part-00000-29674e3f-3fcf-4327-9b04-4dab1d46338d.txt" did not contain "C:\projects\spark\target\tmp\spark-a894c7df-c74d-4d19-82a2-a04744cb3766" (ColumnExpressionSuite.scala:598)
```
```
DataStreamReaderWriterSuite:
- source metadataPath *** FAILED *** (62 milliseconds)
org.mockito.exceptions.verification.junit.ArgumentsAreDifferent: Argument(s) are different! Wanted:
streamSourceProvider.createSource(
org.apache.spark.sql.SQLContext3b04133b,
"C:\projects\spark\target\tmp\streaming.metadata-b05db6ae-c8dc-4ce4-b0d9-1eb8c84876c0/sources/0",
None,
"org.apache.spark.sql.streaming.test",
Map()
);
-> at org.apache.spark.sql.streaming.test.DataStreamReaderWriterSuite$$anonfun$12.apply$mcV$sp(DataStreamReaderWriterSuite.scala:374)
Actual invocation has different arguments:
streamSourceProvider.createSource(
org.apache.spark.sql.SQLContext3b04133b,
"/C:/projects/spark/target/tmp/streaming.metadata-b05db6ae-c8dc-4ce4-b0d9-1eb8c84876c0/sources/0",
None,
"org.apache.spark.sql.streaming.test",
Map()
);
```
```
GlobalTempViewSuite:
- CREATE GLOBAL TEMP VIEW USING *** FAILED *** (110 milliseconds)
org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-960398ba-a0a1-45f6-a59a-d98533f9f519;
```
```
CreateTableAsSelectSuite:
- CREATE TABLE USING AS SELECT *** FAILED *** (0 milliseconds)
java.lang.IllegalArgumentException: Can not create a Path from an empty string
- create a table, drop it and create another one with the same name *** FAILED *** (16 milliseconds)
java.lang.IllegalArgumentException: Can not create a Path from an empty string
- create table using as select - with partitioned by *** FAILED *** (0 milliseconds)
java.lang.IllegalArgumentException: Can not create a Path from an empty string
- create table using as select - with non-zero buckets *** FAILED *** (0 milliseconds)
java.lang.IllegalArgumentException: Can not create a Path from an empty string
```
```
HiveMetadataCacheSuite:
- partitioned table is cached when partition pruning is true *** FAILED *** (532 milliseconds)
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
- partitioned table is cached when partition pruning is false *** FAILED *** (297 milliseconds)
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
```
```
MultiDatabaseSuite:
- createExternalTable() to non-default database - with USE *** FAILED *** (954 milliseconds)
org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-0839d9a7-5e29-467a-9e3e-3e4cd618ee09;
- createExternalTable() to non-default database - without USE *** FAILED *** (500 milliseconds)
org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-c7e24d73-1d8f-45e8-ab7d-53a83087aec3;
- invalid database name and table names *** FAILED *** (31 milliseconds)
"Path does not exist: file:/C:projectsspark arget mpspark-15a2a494-3483-4876-80e5-ec396e704b77;" did not contain "`t:a` is not a valid name for tables/databases. Valid names only contain alphabet characters, numbers and _." (MultiDatabaseSuite.scala:296)
```
```
OrcQuerySuite:
- SPARK-8501: Avoids discovery schema from empty ORC files *** FAILED *** (15 milliseconds)
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
- Verify the ORC conversion parameter: CONVERT_METASTORE_ORC *** FAILED *** (78 milliseconds)
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
- converted ORC table supports resolving mixed case field *** FAILED *** (297 milliseconds)
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
```
```
HadoopFsRelationTest - JsonHadoopFsRelationSuite, OrcHadoopFsRelationSuite, ParquetHadoopFsRelationSuite, SimpleTextHadoopFsRelationSuite:
- Locality support for FileScanRDD *** FAILED *** (15 milliseconds)
java.lang.IllegalArgumentException: Wrong FS: file://C:\projects\spark\target\tmp\spark-383d1f13-8783-47fd-964d-9c75e5eec50f, expected: file:///
```
```
HiveQuerySuite:
- CREATE TEMPORARY FUNCTION *** FAILED *** (0 milliseconds)
java.net.MalformedURLException: For input string: "%5Cprojects%5Cspark%5Csql%5Chive%5Ctarget%5Cscala-2.11%5Ctest-classes%5CTestUDTF.jar"
- ADD FILE command *** FAILED *** (500 milliseconds)
java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\sql\hive\target\scala-2.11\test-classes\data\files\v1.txt
- ADD JAR command 2 *** FAILED *** (110 milliseconds)
org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: C:projectssparksqlhive argetscala-2.11 est-classesdatafilessample.json;
```
```
PruneFileSourcePartitionsSuite:
- PruneFileSourcePartitions should not change the output of LogicalRelation *** FAILED *** (15 milliseconds)
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
```
```
HiveCommandSuite:
- LOAD DATA LOCAL *** FAILED *** (109 milliseconds)
org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: C:projectssparksqlhive argetscala-2.11 est-classesdatafilesemployee.dat;
- LOAD DATA *** FAILED *** (93 milliseconds)
java.net.URISyntaxException: Illegal character in opaque part at index 15: C:projectsspark arget mpemployee.dat7496657117354281006.tmp
- Truncate Table *** FAILED *** (78 milliseconds)
org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: C:projectssparksqlhive argetscala-2.11 est-classesdatafilesemployee.dat;
```
```
HiveExternalCatalogBackwardCompatibilitySuite:
- make sure we can read table created by old version of Spark *** FAILED *** (0 milliseconds)
"[/C:/projects/spark/target/tmp/]spark-0554d859-74e1-..." did not equal "[C:\projects\spark\target\tmp\]spark-0554d859-74e1-..." (HiveExternalCatalogBackwardCompatibilitySuite.scala:213)
org.scalatest.exceptions.TestFailedException
- make sure we can alter table location created by old version of Spark *** FAILED *** (110 milliseconds)
java.net.URISyntaxException: Illegal character in opaque part at index 15: C:projectsspark arget mpspark-0e9b2c5f-49a1-4e38-a32a-c0ab1813a79f
```
```
ExternalCatalogSuite:
- create/drop/rename partitions should create/delete/rename the directory *** FAILED *** (610 milliseconds)
java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\target\tmp\spark-4c24f010-18df-437b-9fed-990c6f9adece
```
```
SQLQuerySuite:
- describe functions - temporary user defined functions *** FAILED *** (16 milliseconds)
java.net.URISyntaxException: Illegal character in opaque part at index 22: C:projectssparksqlhive argetscala-2.11 est-classesTestUDTF.jar
- specifying database name for a temporary table is not allowed *** FAILED *** (125 milliseconds)
org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-a34c9814-a483-43f2-be29-37f616b6df91;
```
```
PartitionProviderCompatibilitySuite:
- convert partition provider to hive with repair table *** FAILED *** (281 milliseconds)
org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-ee5fc96d-8c7d-4ebf-8571-a1d62736473e;
- when partition management is enabled, new tables have partition provider hive *** FAILED *** (187 milliseconds)
org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-803ad4d6-3e8c-498d-9ca5-5cda5d9b2a48;
- when partition management is disabled, new tables have no partition provider *** FAILED *** (172 milliseconds)
org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-c9fda9e2-4020-465f-8678-52cd72d0a58f;
- when partition management is disabled, we preserve the old behavior even for new tables *** FAILED *** (203 milliseconds)
org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget
mpspark-f4a518a6-c49d-43d3-b407-0ddd76948e13;
- insert overwrite partition of legacy datasource table *** FAILED *** (188 milliseconds)
org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-f4a518a6-c49d-43d3-b407-0ddd76948e79;
- insert overwrite partition of new datasource table overwrites just partition *** FAILED *** (219 milliseconds)
org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-6ba3a88d-6f6c-42c5-a9f4-6d924a0616ff;
- SPARK-18544 append with saveAsTable - partition management true *** FAILED *** (173 milliseconds)
org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-cd234a6d-9cb4-4d1d-9e51-854ae9543bbd;
- SPARK-18635 special chars in partition values - partition management true *** FAILED *** (2 seconds, 967 milliseconds)
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
- SPARK-18635 special chars in partition values - partition management false *** FAILED *** (62 milliseconds)
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
- SPARK-18659 insert overwrite table with lowercase - partition management true *** FAILED *** (63 milliseconds)
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
- SPARK-18544 append with saveAsTable - partition management false *** FAILED *** (266 milliseconds)
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
- SPARK-18659 insert overwrite table files - partition management false *** FAILED *** (63 milliseconds)
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
- SPARK-18659 insert overwrite table with lowercase - partition management false *** FAILED *** (78 milliseconds)
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
- sanity check table setup *** FAILED *** (31 milliseconds)
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
- insert into partial dynamic partitions *** FAILED *** (47 milliseconds)
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
- insert into fully dynamic partitions *** FAILED *** (62 milliseconds)
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
- insert into static partition *** FAILED *** (78 milliseconds)
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
- overwrite partial dynamic partitions *** FAILED *** (63 milliseconds)
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
- overwrite fully dynamic partitions *** FAILED *** (47 milliseconds)
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
- overwrite static partition *** FAILED *** (63 milliseconds)
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
```
```
MetastoreDataSourcesSuite:
- check change without refresh *** FAILED *** (203 milliseconds)
org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-00713fe4-ca04-448c-bfc7-6c5e9a2ad2a1;
- drop, change, recreate *** FAILED *** (78 milliseconds)
org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-2030a21b-7d67-4385-a65b-bb5e2bed4861;
- SPARK-15269 external data source table creation *** FAILED *** (78 milliseconds)
org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-4d50fd4a-14bc-41d6-9232-9554dd233f86;
- CTAS *** FAILED *** (109 milliseconds)
java.lang.IllegalArgumentException: Can not create a Path from an empty string
- CTAS with IF NOT EXISTS *** FAILED *** (109 milliseconds)
java.lang.IllegalArgumentException: Can not create a Path from an empty string
- CTAS: persisted partitioned bucketed data source table *** FAILED *** (0 milliseconds)
java.lang.IllegalArgumentException: Can not create a Path from an empty string
- SPARK-15025: create datasource table with path with select *** FAILED *** (16 milliseconds)
java.lang.IllegalArgumentException: Can not create a Path from an empty string
- CTAS: persisted partitioned data source table *** FAILED *** (47 milliseconds)
java.lang.IllegalArgumentException: Can not create a Path from an empty string
```
```
HiveMetastoreCatalogSuite:
- Persist non-partitioned parquet relation into metastore as managed table using CTAS *** FAILED *** (16 milliseconds)
java.lang.IllegalArgumentException: Can not create a Path from an empty string
- Persist non-partitioned orc relation into metastore as managed table using CTAS *** FAILED *** (16 milliseconds)
java.lang.IllegalArgumentException: Can not create a Path from an empty string
```
```
HiveUDFSuite:
- SPARK-11522 select input_file_name from non-parquet table *** FAILED *** (16 milliseconds)
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
```
```
QueryPartitionSuite:
- SPARK-13709: reading partitioned Avro table with nested schema *** FAILED *** (250 milliseconds)
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
```
```
ParquetHiveCompatibilitySuite:
- simple primitives *** FAILED *** (16 milliseconds)
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
- SPARK-10177 timestamp *** FAILED *** (0 milliseconds)
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
- array *** FAILED *** (16 milliseconds)
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
- map *** FAILED *** (16 milliseconds)
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
- struct *** FAILED *** (0 milliseconds)
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
- SPARK-16344: array of struct with a single field named 'array_element' *** FAILED *** (15 milliseconds)
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string);
```
## How was this patch tested?
Manually tested via AppVeyor.
```
ColumnExpressionSuite:
- input_file_name, input_file_block_start, input_file_block_length - FileScanRDD (234 milliseconds)
- input_file_name, input_file_block_start, input_file_block_length - HadoopRDD (235 milliseconds)
- input_file_name, input_file_block_start, input_file_block_length - NewHadoopRDD (203 milliseconds)
```
```
DataStreamReaderWriterSuite:
- source metadataPath (63 milliseconds)
```
```
GlobalTempViewSuite:
- CREATE GLOBAL TEMP VIEW USING (436 milliseconds)
```
```
CreateTableAsSelectSuite:
- CREATE TABLE USING AS SELECT (171 milliseconds)
- create a table, drop it and create another one with the same name (422 milliseconds)
- create table using as select - with partitioned by (141 milliseconds)
- create table using as select - with non-zero buckets (125 milliseconds)
```
```
HiveMetadataCacheSuite:
- partitioned table is cached when partition pruning is true (3 seconds, 211 milliseconds)
- partitioned table is cached when partition pruning is false (1 second, 781 milliseconds)
```
```
MultiDatabaseSuite:
- createExternalTable() to non-default database - with USE (797 milliseconds)
- createExternalTable() to non-default database - without USE (640 milliseconds)
- invalid database name and table names (62 milliseconds)
```
```
OrcQuerySuite:
- SPARK-8501: Avoids discovery schema from empty ORC files (703 milliseconds)
- Verify the ORC conversion parameter: CONVERT_METASTORE_ORC (750 milliseconds)
- converted ORC table supports resolving mixed case field (625 milliseconds)
```
```
HadoopFsRelationTest - JsonHadoopFsRelationSuite, OrcHadoopFsRelationSuite, ParquetHadoopFsRelationSuite, SimpleTextHadoopFsRelationSuite:
- Locality support for FileScanRDD (296 milliseconds)
```
```
HiveQuerySuite:
- CREATE TEMPORARY FUNCTION (125 milliseconds)
- ADD FILE command (250 milliseconds)
- ADD JAR command 2 (609 milliseconds)
```
```
PruneFileSourcePartitionsSuite:
- PruneFileSourcePartitions should not change the output of LogicalRelation (359 milliseconds)
```
```
HiveCommandSuite:
- LOAD DATA LOCAL (1 second, 829 milliseconds)
- LOAD DATA (1 second, 735 milliseconds)
- Truncate Table (1 second, 641 milliseconds)
```
```
HiveExternalCatalogBackwardCompatibilitySuite:
- make sure we can read table created by old version of Spark (32 milliseconds)
- make sure we can alter table location created by old version of Spark (125 milliseconds)
- make sure we can rename table created by old version of Spark (281 milliseconds)
```
```
ExternalCatalogSuite:
- create/drop/rename partitions should create/delete/rename the directory (625 milliseconds)
```
```
SQLQuerySuite:
- describe functions - temporary user defined functions (31 milliseconds)
- specifying database name for a temporary table is not allowed (390 milliseconds)
```
```
PartitionProviderCompatibilitySuite:
- convert partition provider to hive with repair table (813 milliseconds)
- when partition management is enabled, new tables have partition provider hive (562 milliseconds)
- when partition management is disabled, new tables have no partition provider (344 milliseconds)
- when partition management is disabled, we preserve the old behavior even for new tables (422 milliseconds)
- insert overwrite partition of legacy datasource table (750 milliseconds)
- SPARK-18544 append with saveAsTable - partition management true (985 milliseconds)
- SPARK-18635 special chars in partition values - partition management true (3 seconds, 328 milliseconds)
- SPARK-18635 special chars in partition values - partition management false (2 seconds, 891 milliseconds)
- SPARK-18659 insert overwrite table with lowercase - partition management true (750 milliseconds)
- SPARK-18544 append with saveAsTable - partition management false (656 milliseconds)
- SPARK-18659 insert overwrite table files - partition management false (922 milliseconds)
- SPARK-18659 insert overwrite table with lowercase - partition management false (469 milliseconds)
- sanity check table setup (937 milliseconds)
- insert into partial dynamic partitions (2 seconds, 985 milliseconds)
- insert into fully dynamic partitions (1 second, 937 milliseconds)
- insert into static partition (1 second, 578 milliseconds)
- overwrite partial dynamic partitions (7 seconds, 561 milliseconds)
- overwrite fully dynamic partitions (1 second, 766 milliseconds)
- overwrite static partition (1 second, 797 milliseconds)
```
```
MetastoreDataSourcesSuite:
- check change without refresh (610 milliseconds)
- drop, change, recreate (437 milliseconds)
- SPARK-15269 external data source table creation (297 milliseconds)
- CTAS with IF NOT EXISTS (437 milliseconds)
- CTAS: persisted partitioned bucketed data source table (422 milliseconds)
- SPARK-15025: create datasource table with path with select (265 milliseconds)
- CTAS (438 milliseconds)
- CTAS with IF NOT EXISTS (469 milliseconds)
- CTAS: persisted partitioned bucketed data source table (406 milliseconds)
```
```
HiveMetastoreCatalogSuite:
- Persist non-partitioned parquet relation into metastore as managed table using CTAS (406 milliseconds)
- Persist non-partitioned orc relation into metastore as managed table using CTAS (313 milliseconds)
```
```
HiveUDFSuite:
- SPARK-11522 select input_file_name from non-parquet table (3 seconds, 144 milliseconds)
```
```
QueryPartitionSuite:
- SPARK-13709: reading partitioned Avro table with nested schema (1 second, 67 milliseconds)
```
```
ParquetHiveCompatibilitySuite:
- simple primitives (745 milliseconds)
- SPARK-10177 timestamp (375 milliseconds)
- array (407 milliseconds)
- map (409 milliseconds)
- struct (437 milliseconds)
- SPARK-16344: array of struct with a single field named 'array_element' (391 milliseconds)
```
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#16397 from HyukjinKwon/SPARK-18922-paths.
## What changes were proposed in this pull request?
`Literal` can use `CodegenContex.addReferenceObj` to implement codegen, instead of `CodegenFallback`. This can also simplify the generated code a little bit, before we will generate: `((Expression) references[1]).eval(null)`, now it's just `references[1]`.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#16402 from cloud-fan/minor.
## What changes were proposed in this pull request?
Currently we implement `Aggregator` with `DeclarativeAggregate`, which will serialize/deserialize the buffer object every time we process an input.
This PR implements `Aggregator` with `TypedImperativeAggregate` and avoids to serialize/deserialize buffer object many times. The benchmark shows we get about 2 times speed up.
For simple buffer object that doesn't need serialization, we still go with `DeclarativeAggregate`, to avoid performance regression.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#16383 from cloud-fan/aggregator.
## What changes were proposed in this pull request?
Statistics in LogicalPlan should use attributes to refer to columns rather than column names, because two columns from two relations can have the same column name. But CatalogTable doesn't have the concepts of attribute or broadcast hint in Statistics. Therefore, putting Statistics in CatalogTable is confusing.
We define a different statistic structure in CatalogTable, which is only responsible for interacting with metastore, and is converted to statistics in LogicalPlan when it is used.
## How was this patch tested?
add test cases
Author: wangzhenhua <wangzhenhua@huawei.com>
Author: Zhenhua Wang <wzh_zju@163.com>
Closes#16323 from wzhfy/nameToAttr.
## What changes were proposed in this pull request?
SortPartitions and RedistributeData logical operators are not actually used and can be removed. Note that we do have a Sort operator (with global flag false) that subsumed SortPartitions.
## How was this patch tested?
Also updated test cases to reflect the removal.
Author: Reynold Xin <rxin@databricks.com>
Closes#16381 from rxin/SPARK-18973.
## What changes were proposed in this pull request?
Made update mode public. As part of that here are the changes.
- Update DatastreamWriter to accept "update"
- Changed package of InternalOutputModes from o.a.s.sql to o.a.s.sql.catalyst
- Added update mode state removing with watermark to StateStoreSaveExec
## How was this patch tested?
Added new tests in changed modules
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#16360 from tdas/SPARK-18234.
## What changes were proposed in this pull request?
When we append data to an existing table with `DataFrameWriter.saveAsTable`, we will do various checks to make sure the appended data is consistent with the existing data.
However, we get the information of the existing table by matching the table relation, instead of looking at the table metadata. This is error-prone, e.g. we only check the number of columns for `HadoopFsRelation`, we forget to check bucketing, etc.
This PR refactors the error checking by looking at the metadata of the existing table, and fix several bugs:
* SPARK-18899: We forget to check if the specified bucketing matched the existing table, which may lead to a problematic table that has different bucketing in different data files.
* SPARK-18912: We forget to check the number of columns for non-file-based data source table
* SPARK-18913: We don't support append data to a table with special column names.
## How was this patch tested?
new regression test.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#16313 from cloud-fan/bug1.
## What changes were proposed in this pull request?
Currently `ImplicitTypeCasts` doesn't handle casts between `ArrayType`s, this is not convenient, we should add a rule to enable casting from `ArrayType(InternalType)` to `ArrayType(newInternalType)`.
Goals:
1. Add a rule to `ImplicitTypeCasts` to enable casting between `ArrayType`s;
2. Simplify `Percentile` and `ApproximatePercentile`.
## How was this patch tested?
Updated test cases in `TypeCoercionSuite`.
Author: jiangxingbo <jiangxb1987@gmail.com>
Closes#16057 from jiangxb1987/implicit-cast-complex-types.
## What changes were proposed in this pull request?
percentile_approx is the name used in Hive, and approx_percentile is the name used in Presto. approx_percentile is actually more consistent with our approx_count_distinct. Given the cost to alias SQL functions is low (one-liner), it'd be better to just alias them so it is easier to use.
## How was this patch tested?
Technically I could add an end-to-end test to verify this one-line change, but it seemed too trivial to me.
Author: Reynold Xin <rxin@databricks.com>
Closes#16300 from rxin/SPARK-18892.
## What changes were proposed in this pull request?
Check whether Aggregation operators on a streaming subplan have aggregate expressions with isDistinct = true.
## How was this patch tested?
Added unit test
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#16289 from tdas/SPARK-18870.
## What changes were proposed in this pull request?
Right now, once a user set the comment of a column with create table command, he/she cannot update the comment. It will be useful to provide a public interface (e.g. SQL) to do that.
This PR implements the following SQL statement:
```
ALTER TABLE table [PARTITION partition_spec]
CHANGE [COLUMN] column_old_name column_new_name column_dataType
[COMMENT column_comment]
[FIRST | AFTER column_name];
```
For further expansion, we could support alter `name`/`dataType`/`index` of a column too.
## How was this patch tested?
Add new test cases in `ExternalCatalogSuite` and `SessionCatalogSuite`.
Add sql file test for `ALTER TABLE CHANGE COLUMN` statement.
Author: jiangxingbo <jiangxb1987@gmail.com>
Closes#15717 from jiangxb1987/change-column.
## What changes were proposed in this pull request?
After the bug fix in SPARK-18854, TreeNode.apply now returns TreeNode[_] rather than a more specific type. It would be easier for interactive debugging to introduce a function that returns the BaseType.
## How was this patch tested?
N/A - this is a developer only feature used for interactive debugging. As long as it compiles, it should be good to go. I tested this in spark-shell.
Author: Reynold Xin <rxin@databricks.com>
Closes#16288 from rxin/SPARK-18869.
## What changes were proposed in this pull request?
This is a bug introduced by subquery handling. numberedTreeString (which uses generateTreeString under the hood) numbers trees including innerChildren (used to print subqueries), but apply (which uses getNodeNumbered) ignores innerChildren. As a result, apply(i) would return the wrong plan node if there are subqueries.
This patch fixes the bug.
## How was this patch tested?
Added a test case in SubquerySuite.scala to test both the depth-first traversal of numbering as well as making sure the two methods are consistent.
Author: Reynold Xin <rxin@databricks.com>
Closes#16277 from rxin/SPARK-18854.
## What changes were proposed in this pull request?
This patch reduces the default number element estimation for arrays and maps from 100 to 1. The issue with the 100 number is that when nested (e.g. an array of map), 100 * 100 would be used as the default size. This sounds like just an overestimation which doesn't seem that bad (since it is usually better to overestimate than underestimate). However, due to the way we assume the size output for Project (new estimated column size / old estimated column size), this overestimation can become underestimation. It is actually in general in this case safer to assume 1 default element.
## How was this patch tested?
This should be covered by existing tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#16274 from rxin/SPARK-18853.
## What changes were proposed in this pull request?
Move the checking of GROUP BY column in correlated scalar subquery from CheckAnalysis
to Analysis to fix a regression caused by SPARK-18504.
This problem can be reproduced with a simple script now.
Seq((1,1)).toDF("pk","pv").createOrReplaceTempView("p")
Seq((1,1)).toDF("ck","cv").createOrReplaceTempView("c")
sql("select * from p,c where p.pk=c.ck and c.cv = (select avg(c1.cv) from c c1 where c1.ck = p.pk)").show
The requirements are:
1. We need to reference the same table twice in both the parent and the subquery. Here is the table c.
2. We need to have a correlated predicate but to a different table. Here is from c (as c1) in the subquery to p in the parent.
3. We will then "deduplicate" c1.ck in the subquery to `ck#<n1>#<n2>` at `Project` above `Aggregate` of `avg`. Then when we compare `ck#<n1>#<n2>` and the original group by column `ck#<n1>` by their canonicalized form, which is #<n2> != #<n1>. That's how we trigger the exception added in SPARK-18504.
## How was this patch tested?
SubquerySuite and a simplified version of TPCDS-Q32
Author: Nattavut Sutyanyong <nsy.can@gmail.com>
Closes#16246 from nsyca/18814.
## What changes were proposed in this pull request?
`OverwriteOptions` was introduced in https://github.com/apache/spark/pull/15705, to carry the information of static partitions. However, after further refactor, this information becomes duplicated and we can remove `OverwriteOptions`.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#15995 from cloud-fan/overwrite.
## What changes were proposed in this pull request?
Change the statement `SHOW TABLES [EXTENDED] [(IN|FROM) database_name] [[LIKE] 'identifier_with_wildcards'] [PARTITION(partition_spec)]` to the following statements:
- SHOW TABLES [(IN|FROM) database_name] [[LIKE] 'identifier_with_wildcards']
- SHOW TABLE EXTENDED [(IN|FROM) database_name] LIKE 'identifier_with_wildcards' [PARTITION(partition_spec)]
After this change, the statements `SHOW TABLE/SHOW TABLES` have the same syntax with that HIVE has.
## How was this patch tested?
Modified the test sql file `show-tables.sql`;
Modified the test suite `DDLSuite`.
Author: jiangxingbo <jiangxb1987@gmail.com>
Closes#16262 from jiangxb1987/show-table-extended.
## What changes were proposed in this pull request?
Fixes compile errors in generated code when user has case class with a `scala.collections.immutable.Map` instead of a `scala.collections.Map`. Since ArrayBasedMapData.toScalaMap returns the immutable version we can make it work with both.
## How was this patch tested?
Additional unit tests.
Author: Andrew Ray <ray.andrew@gmail.com>
Closes#16161 from aray/fix-map-codegen.
The value of the "isSrcLocal" parameter passed to Hive's loadTable and
loadPartition methods needs to be set according to the user query (e.g.
"LOAD DATA LOCAL"), and not the current code that tries to guess what
it should be.
For existing versions of Hive the current behavior is probably ok, but
some recent changes in the Hive code changed the semantics slightly,
making code that sets "isSrcLocal" to "true" incorrectly to do the
wrong thing. It would end up moving the parent directory of the files
into the final location, instead of the file themselves, resulting
in a table that cannot be read.
I modified HiveCommandSuite so that existing "LOAD DATA" tests are run
both in local and non-local mode, since the semantics are slightly different.
The tests include a few new checks to make sure the semantics follow
what Hive describes in its documentation.
Tested with existing unit tests and also ran some Hive integration tests
with a version of Hive containing the changes that surfaced the problem.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#16179 from vanzin/SPARK-18752.
## What changes were proposed in this pull request?
After https://github.com/apache/spark/pull/15620 , all of the Maven-based 2.0 Jenkins jobs time out consistently. As I pointed out in https://github.com/apache/spark/pull/15620#discussion_r91829129 , it seems that the regression test is an overkill and may hit constants pool size limitation, which is a known issue and hasn't been fixed yet.
Since #15620 only fix the code size limitation problem, we can simplify the test to avoid hitting constants pool size limitation.
## How was this patch tested?
test only change
Author: Wenchen Fan <wenchen@databricks.com>
Closes#16244 from cloud-fan/minor.
## What changes were proposed in this pull request?
During column stats collection, average and max length will be null if a column of string/binary type has only null values. To fix this, I use default size when avg/max length is null.
## How was this patch tested?
Add a test for handling null columns
Author: wangzhenhua <wangzhenhua@huawei.com>
Closes#16243 from wzhfy/nullStats.
## What changes were proposed in this pull request?
1. In SparkStrategies.canBroadcast, I will add the check plan.statistics.sizeInBytes >= 0
2. In LocalRelations.statistics, when calculate the statistics, I will change the size to BigInt so it won't overflow.
## How was this patch tested?
I will add a test case to make sure the statistics.sizeInBytes won't overflow.
Author: Huaxin Gao <huaxing@us.ibm.com>
Closes#16175 from huaxingao/spark-17460.
## What changes were proposed in this pull request?
Typo fixes
## How was this patch tested?
Local build. Awaiting the official build.
Author: Jacek Laskowski <jacek@japila.pl>
Closes#16144 from jaceklaskowski/typo-fixes.
## What changes were proposed in this pull request?
`makeRootConverter` is only called with a `StructType` value. By making this method less general we can remove pattern matches, which are never actually hit outside of the test suite.
## How was this patch tested?
The existing tests.
Author: Nathan Howell <nhowell@godaddy.com>
Closes#16084 from NathanHowell/SPARK-18654.
## What changes were proposed in this pull request?
Fixes AnalysisException for pivot queries that have group by columns that are expressions and not attributes by substituting the expressions output attribute in the second aggregation and final projection.
## How was this patch tested?
existing and additional unit tests
Author: Andrew Ray <ray.andrew@gmail.com>
Closes#16177 from aray/SPARK-17760.
## What changes were proposed in this pull request?
I jumped the gun on merging https://github.com/apache/spark/pull/16120, and missed a tiny potential problem. This PR fixes that by changing a val into a def; this should prevent potential serialization/initialization weirdness from happening.
## How was this patch tested?
Existing tests.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#16170 from hvanhovell/SPARK-18634.
(Link to Jira issue: https://issues.apache.org/jira/browse/SPARK-18572)
## What changes were proposed in this pull request?
Currently Spark answers the `SHOW PARTITIONS` command by fetching all of the table's partition metadata from the external catalog and constructing partition names therefrom. The Hive client has a `getPartitionNames` method which is many times faster for this purpose, with the performance improvement scaling with the number of partitions in a table.
To test the performance impact of this PR, I ran the `SHOW PARTITIONS` command on two Hive tables with large numbers of partitions. One table has ~17,800 partitions, and the other has ~95,000 partitions. For the purposes of this PR, I'll call the former table `table1` and the latter table `table2`. I ran 5 trials for each table with before-and-after versions of this PR. The results are as follows:
Spark at bdc8153, `SHOW PARTITIONS table1`, times in seconds:
7.901
3.983
4.018
4.331
4.261
Spark at bdc8153, `SHOW PARTITIONS table2`
(Timed out after 10 minutes with a `SocketTimeoutException`.)
Spark at this PR, `SHOW PARTITIONS table1`, times in seconds:
3.801
0.449
0.395
0.348
0.336
Spark at this PR, `SHOW PARTITIONS table2`, times in seconds:
5.184
1.63
1.474
1.519
1.41
Taking the best times from each trial, we get a 12x performance improvement for a table with ~17,800 partitions and at least a 426x improvement for a table with ~95,000 partitions. More significantly, the latter command doesn't even complete with the current code in master.
This is actually a patch we've been using in-house at VideoAmp since Spark 1.1. It's made all the difference in the practical usability of our largest tables. Even with tables with about 1,000 partitions there's a performance improvement of about 2-3x.
## How was this patch tested?
I added a unit test to `VersionsSuite` which tests that the Hive client's `getPartitionNames` method returns the correct number of partitions.
Author: Michael Allman <michael@videoamp.com>
Closes#15998 from mallman/spark-18572-list_partition_names.
## What changes were proposed in this pull request?
As reported in the Jira, there are some weird issues with exploding Python UDFs in SparkSQL.
The following test code can reproduce it. Notice: the following test code is reported to return wrong results in the Jira. However, as I tested on master branch, it causes exception and so can't return any result.
>>> from pyspark.sql.functions import *
>>> from pyspark.sql.types import *
>>>
>>> df = spark.range(10)
>>>
>>> def return_range(value):
... return [(i, str(i)) for i in range(value - 1, value + 1)]
...
>>> range_udf = udf(return_range, ArrayType(StructType([StructField("integer_val", IntegerType()),
... StructField("string_val", StringType())])))
>>>
>>> df.select("id", explode(range_udf(df.id))).show()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/spark/python/pyspark/sql/dataframe.py", line 318, in show
print(self._jdf.showString(n, 20))
File "/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o126.showString.: java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:156)
at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:120)
at org.apache.spark.sql.execution.GenerateExec.consume(GenerateExec.scala:57)
The cause of this issue is, in `ExtractPythonUDFs` we insert `BatchEvalPythonExec` to run PythonUDFs in batch. `BatchEvalPythonExec` will add extra outputs (e.g., `pythonUDF0`) to original plan. In above case, the original `Range` only has one output `id`. After `ExtractPythonUDFs`, the added `BatchEvalPythonExec` has two outputs `id` and `pythonUDF0`.
Because the output of `GenerateExec` is given after analysis phase, in above case, it is the combination of `id`, i.e., the output of `Range`, and `col`. But in planning phase, we change `GenerateExec`'s child plan to `BatchEvalPythonExec` with additional output attributes.
It will cause no problem in non wholestage codegen. Because when evaluating the additional attributes are projected out the final output of `GenerateExec`.
However, as `GenerateExec` now supports wholestage codegen, the framework will input all the outputs of the child plan to `GenerateExec`. Then when consuming `GenerateExec`'s output data (i.e., calling `consume`), the number of output attributes is different to the output variables in wholestage codegen.
To solve this issue, this patch only gives the generator's output to `GenerateExec` after analysis phase. `GenerateExec`'s output is the combination of its child plan's output and the generator's output. So when we change `GenerateExec`'s child, its output is still correct.
## How was this patch tested?
Added test cases to PySpark.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#16120 from viirya/fix-py-udf-with-generator.
## What changes were proposed in this pull request?
This is kind of a long-standing bug, it's hidden until https://github.com/apache/spark/pull/15780 , which may add `AssertNotNull` on top of `LambdaVariable` and thus enables subexpression elimination.
However, subexpression elimination will evaluate the common expressions at the beginning, which is invalid for `LambdaVariable`. `LambdaVariable` usually represents loop variable, which can't be evaluated ahead of the loop.
This PR skips expressions containing `LambdaVariable` when doing subexpression elimination.
## How was this patch tested?
updated test in `DatasetAggregatorSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#16143 from cloud-fan/aggregator.
## What changes were proposed in this pull request?
We currently have function input_file_name to get the path of the input file, but don't have functions to get the block start offset and length. This patch introduces two functions:
1. input_file_block_start: returns the file block start offset, or -1 if not available.
2. input_file_block_length: returns the file block length, or -1 if not available.
## How was this patch tested?
Updated existing test cases in ColumnExpressionSuite that covered input_file_name to also cover the two new functions.
Author: Reynold Xin <rxin@databricks.com>
Closes#16133 from rxin/SPARK-18702.
## What changes were proposed in this pull request?
Fix for SPARK-18091 which is a bug related to large if expressions causing generated SpecificUnsafeProjection code to exceed JVM code size limit.
This PR changes if expression's code generation to place its predicate, true value and false value expressions' generated code in separate methods in context so as to never generate too long combined code.
## How was this patch tested?
Added a unit test and also tested manually with the application (having transformations similar to the unit test) which caused the issue to be identified in the first place.
Author: Kapil Singh <kapsingh@adobe.com>
Closes#15620 from kapilsingh5050/SPARK-18091-IfCodegenFix.
## What changes were proposed in this pull request?
This fix puts an explicit list of operators that Spark supports for correlated subqueries.
## How was this patch tested?
Run sql/test, catalyst/test and add a new test case on Generate.
Author: Nattavut Sutyanyong <nsy.can@gmail.com>
Closes#16046 from nsyca/spark18455.0.
## What changes were proposed in this pull request?
This fixes the parser rule to match named expressions, which doesn't work for two reasons:
1. The name match is not coerced to a regular expression (missing .r)
2. The surrounding literals are incorrect and attempt to escape a single quote, which is unnecessary
## How was this patch tested?
This adds test cases for named expressions using the bracket syntax, including one with quoted spaces.
Author: Ryan Blue <blue@apache.org>
Closes#16107 from rdblue/SPARK-18677-fix-json-path.
### What changes were proposed in this pull request?
Added a test case for using joins with nested fields.
### How was this patch tested?
N/A
Author: gatorsmile <gatorsmile@gmail.com>
Closes#16110 from gatorsmile/followup-18674.
## What changes were proposed in this pull request?
Two bugs are addressed here
1. INSERT OVERWRITE TABLE sometime crashed when catalog partition management was enabled. This was because when dropping partitions after an overwrite operation, the Hive client will attempt to delete the partition files. If the entire partition directory was dropped, this would fail. The PR fixes this by adding a flag to control whether the Hive client should attempt to delete files.
2. The static partition spec for OVERWRITE TABLE was not correctly resolved to the case-sensitive original partition names. This resulted in the entire table being overwritten if you did not correctly capitalize your partition names.
cc yhuai cloud-fan
## How was this patch tested?
Unit tests. Surprisingly, the existing overwrite table tests did not catch these edge cases.
Author: Eric Liang <ekl@databricks.com>
Closes#16088 from ericl/spark-18659.
## What changes were proposed in this pull request?
This replaces uses of `TextOutputFormat` with an `OutputStream`, which will either write directly to the filesystem or indirectly via a compressor (if so configured). This avoids intermediate buffering.
The inverse of this (reading directly from a stream) is necessary for streaming large JSON records (when `wholeFile` is enabled) so I wanted to keep the read and write paths symmetric.
## How was this patch tested?
Existing unit tests.
Author: Nathan Howell <nhowell@godaddy.com>
Closes#16089 from NathanHowell/SPARK-18658.
## What changes were proposed in this pull request?
SPARK-18429 introduced count-min sketch aggregate function for SQL, but the implementation and testing is more complicated than needed. This simplifies the test cases and removes support for data types that don't have clear equality semantics:
1. Removed support for floating point and decimal types.
2. Removed the heavy randomized tests. The underlying CountMinSketch implementation already had pretty good test coverage through randomized tests, and the SPARK-18429 implementation is just to add an aggregate function wrapper around CountMinSketch. There is no need for randomized tests at three different levels of the implementations.
## How was this patch tested?
A lot of the change is to simplify test cases.
Author: Reynold Xin <rxin@databricks.com>
Closes#16093 from rxin/SPARK-18663.
## What changes were proposed in this pull request?
This PR makes `ExpressionEncoder.serializer.nullable` for flat encoder for a primitive type `false`. Since it is `true` for now, it is too conservative.
While `ExpressionEncoder.schema` has correct information (e.g. `<IntegerType, false>`), `serializer.head.nullable` of `ExpressionEncoder`, which got from `encoderFor[T]`, is always false. It is too conservative.
This is accomplished by checking whether a type is one of primitive types. If it is `true`, `nullable` should be `false`.
## How was this patch tested?
Added new tests for encoder and dataframe
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#15780 from kiszk/SPARK-18284.
## What changes were proposed in this pull request?
The current error message of USING join is quite confusing, for example:
```
scala> val df1 = List(1,2,3).toDS.withColumnRenamed("value", "c1")
df1: org.apache.spark.sql.DataFrame = [c1: int]
scala> val df2 = List(1,2,3).toDS.withColumnRenamed("value", "c2")
df2: org.apache.spark.sql.DataFrame = [c2: int]
scala> df1.join(df2, usingColumn = "c1")
org.apache.spark.sql.AnalysisException: using columns ['c1] can not be resolved given input columns: [c1, c2] ;;
'Join UsingJoin(Inner,List('c1))
:- Project [value#1 AS c1#3]
: +- LocalRelation [value#1]
+- Project [value#7 AS c2#9]
+- LocalRelation [value#7]
```
after this PR, it becomes:
```
scala> val df1 = List(1,2,3).toDS.withColumnRenamed("value", "c1")
df1: org.apache.spark.sql.DataFrame = [c1: int]
scala> val df2 = List(1,2,3).toDS.withColumnRenamed("value", "c2")
df2: org.apache.spark.sql.DataFrame = [c2: int]
scala> df1.join(df2, usingColumn = "c1")
org.apache.spark.sql.AnalysisException: USING column `c1` can not be resolved with the right join side, the right output is: [c2];
```
## How was this patch tested?
updated tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#16100 from cloud-fan/natural.
## What changes were proposed in this pull request?
Due to confusion between URI vs paths, in certain cases we escape partition values too many times, which causes some Hive client operations to fail or write data to the wrong location. This PR fixes at least some of these cases.
To my understanding this is how values, filesystem paths, and URIs interact.
- Hive stores raw (unescaped) partition values that are returned to you directly when you call listPartitions.
- Internally, we convert these raw values to filesystem paths via `ExternalCatalogUtils.[un]escapePathName`.
- In some circumstances we store URIs instead of filesystem paths. When a path is converted to a URI via `path.toURI`, the escaped partition values are further URI-encoded. This means that to get a path back from a URI, you must call `new Path(new URI(uriTxt))` in order to decode the URI-encoded string.
- In `CatalogStorageFormat` we store URIs as strings. This makes it easy to forget to URI-decode the value before converting it into a path.
- Finally, the Hive client itself uses mostly Paths for representing locations, and only URIs occasionally.
In the future we should probably clean this up, perhaps by dropping use of URIs when unnecessary. We should also try fixing escaping for partition names as well as values, though names are unlikely to contain special characters.
cc mallman cloud-fan yhuai
## How was this patch tested?
Unit tests.
Author: Eric Liang <ekl@databricks.com>
Closes#16071 from ericl/spark-18635.
## What changes were proposed in this pull request?
For input object of non-flat type, we can't encode it to row if it's null, as Spark SQL doesn't allow the entire row to be null, only its columns can be null. That's the reason we forbid users to use top level null objects in https://github.com/apache/spark/pull/13469
However, if users wrap non-flat type with `Option`, then we may still encoder top level null object to row, which is not allowed.
This PR fixes this case, and suggests users to wrap their type with `Tuple1` if they do wanna top level null objects.
## How was this patch tested?
new test
Author: Wenchen Fan <wenchen@databricks.com>
Closes#15979 from cloud-fan/option.
## What changes were proposed in this pull request?
Currently we haven't implemented `SHOW TABLE EXTENDED` in Spark 2.0. This PR is to implement the statement.
Goals:
1. Support `SHOW TABLES EXTENDED LIKE 'identifier_with_wildcards'`;
2. Explicitly output an unsupported error message for `SHOW TABLES [EXTENDED] ... PARTITION` statement;
3. Improve test cases for `SHOW TABLES` statement.
## How was this patch tested?
1. Add new test cases in file `show-tables.sql`.
2. Modify tests for `SHOW TABLES` in `DDLSuite`.
Author: jiangxingbo <jiangxb1987@gmail.com>
Closes#15958 from jiangxb1987/show-table-extended.
### What changes were proposed in this pull request?
The `constraints` of an operator is the expressions that evaluate to `true` for all the rows produced. That means, the expression result should be neither `false` nor `unknown` (NULL). Thus, we can conclude that `IsNotNull` on all the constraints, which are generated by its own predicates or propagated from the children. The constraint can be a complex expression. For better usage of these constraints, we try to push down `IsNotNull` to the lowest-level expressions (i.e., `Attribute`). `IsNotNull` can be pushed through an expression when it is null intolerant. (When the input is NULL, the null-intolerant expression always evaluates to NULL.)
Below is the existing code we have for `IsNotNull` pushdown.
```Scala
private def scanNullIntolerantExpr(expr: Expression): Seq[Attribute] = expr match {
case a: Attribute => Seq(a)
case _: NullIntolerant | IsNotNull(_: NullIntolerant) =>
expr.children.flatMap(scanNullIntolerantExpr)
case _ => Seq.empty[Attribute]
}
```
**`IsNotNull` itself is not null-intolerant.** It converts `null` to `false`. If the expression does not include any `Not`-like expression, it works; otherwise, it could generate a wrong result. This PR is to fix the above function by removing the `IsNotNull` from the inference. After the fix, when a constraint has a `IsNotNull` expression, we infer new attribute-specific `IsNotNull` constraints if and only if `IsNotNull` appears in the root.
Without the fix, the following test case will return empty.
```Scala
val data = Seq[java.lang.Integer](1, null).toDF("key")
data.filter("not key is not null").show()
```
Before the fix, the optimized plan is like
```
== Optimized Logical Plan ==
Project [value#1 AS key#3]
+- Filter (isnotnull(value#1) && NOT isnotnull(value#1))
+- LocalRelation [value#1]
```
After the fix, the optimized plan is like
```
== Optimized Logical Plan ==
Project [value#1 AS key#3]
+- Filter NOT isnotnull(value#1)
+- LocalRelation [value#1]
```
### How was this patch tested?
Added a test
Author: gatorsmile <gatorsmile@gmail.com>
Closes#16067 from gatorsmile/isNotNull2.
## What changes were proposed in this pull request?
The result of a `sum` aggregate function is typically a Decimal, Double or a Long. Currently the output dataType is based on input's dataType.
The `FunctionArgumentConversion` rule will make sure that the input is promoted to the largest type, and that also ensures that the output uses a (hopefully) sufficiently large output dataType. The issue is that sum is in a resolved state when we cast the input type, this means that rules assuming that the dataType of the expression does not change anymore could have been applied in the mean time. This is what happens if we apply `WidenSetOperationTypes` before applying the casts, and this breaks analysis.
The most straight forward and future proof solution is to make `sum` always output the widest dataType in its class (Long for IntegralTypes, Decimal for DecimalTypes & Double for FloatType and DoubleType). This PR implements that solution.
We should move expression specific type casting rules into the given Expression at some point.
## How was this patch tested?
Added (regression) tests to SQLQueryTestSuite's `union.sql`.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#16063 from hvanhovell/SPARK-18622.
## What changes were proposed in this pull request?
`AggregateFunction` currently implements `ImplicitCastInputTypes` (which enables implicit input type casting). There are actually quite a few situations in which we don't need this, or require more control over our input. A recent example is the aggregate for `CountMinSketch` which should only take string, binary or integral types inputs.
This PR removes `ImplicitCastInputTypes` from the `AggregateFunction` and makes a case-by-case decision on what kind of input validation we should use.
## How was this patch tested?
Refactoring only. Existing tests.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#16066 from hvanhovell/SPARK-18632.
## What changes were proposed in this pull request?
ExistenceJoin should be treated the same as LeftOuter and LeftAnti, not InnerLike and LeftSemi. This is not currently exposed because the rewrite of [NOT] EXISTS OR ... to ExistenceJoin happens in rule RewritePredicateSubquery, which is in a separate rule set and placed after the rule PushPredicateThroughJoin. During the transformation in the rule PushPredicateThroughJoin, an ExistenceJoin never exists.
The semantics of ExistenceJoin says we need to preserve all the rows from the left table through the join operation as if it is a regular LeftOuter join. The ExistenceJoin augments the LeftOuter operation with a new column called exists, set to true when the join condition in the ON clause is true and false otherwise. The filter of any rows will happen in the Filter operation above the ExistenceJoin.
Example:
A(c1, c2): { (1, 1), (1, 2) }
// B can be any value as it is irrelevant in this example
B(c1): { (NULL) }
select A.*
from A
where exists (select 1 from B where A.c1 = A.c2)
or A.c2=2
In this example, the correct result is all the rows from A. If the pattern ExistenceJoin around line 935 in Optimizer.scala is indeed active, the code will push down the predicate A.c1 = A.c2 to be a Filter on relation A, which will incorrectly filter the row (1,2) from A.
## How was this patch tested?
Since this is not an exposed case, no new test cases is added. The scenario is discovered via a code review of another PR and confirmed to be valid with peer.
Author: Nattavut Sutyanyong <nsy.can@gmail.com>
Closes#16044 from nsyca/spark-18614.
## What changes were proposed in this pull request?
This PR implements a new Aggregate to generate count min sketch, which is a wrapper of CountMinSketch.
## How was this patch tested?
add test cases
Author: wangzhenhua <wangzhenhua@huawei.com>
Closes#15877 from wzhfy/cms.
## What changes were proposed in this pull request?
This PR make `sbt unidoc` complete with Java 8.
This PR roughly includes several fixes as below:
- Fix unrecognisable class and method links in javadoc by changing it from `[[..]]` to `` `...` ``
```diff
- * A column that will be computed based on the data in a [[DataFrame]].
+ * A column that will be computed based on the data in a `DataFrame`.
```
- Fix throws annotations so that they are recognisable in javadoc
- Fix URL links to `<a href="http..."></a>`.
```diff
- * [[http://en.wikipedia.org/wiki/Decision_tree_learning Decision tree]] model for regression.
+ * <a href="http://en.wikipedia.org/wiki/Decision_tree_learning">
+ * Decision tree (Wikipedia)</a> model for regression.
```
```diff
- * see http://en.wikipedia.org/wiki/Receiver_operating_characteristic
+ * see <a href="http://en.wikipedia.org/wiki/Receiver_operating_characteristic">
+ * Receiver operating characteristic (Wikipedia)</a>
```
- Fix < to > to
- `greater than`/`greater than or equal to` or `less than`/`less than or equal to` where applicable.
- Wrap it with `{{{...}}}` to print them in javadoc or use `{code ...}` or `{literal ..}`. Please refer https://github.com/apache/spark/pull/16013#discussion_r89665558
- Fix `</p>` complaint
## How was this patch tested?
Manually tested by `jekyll build` with Java 7 and 8
```
java version "1.7.0_80"
Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
```
```
java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
```
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#16013 from HyukjinKwon/SPARK-3359-errors-more.
## What changes were proposed in this pull request?
For the following workflow:
1. I have a column called time which is at minute level precision in a Streaming DataFrame
2. I want to perform groupBy time, count
3. Then I want my MemorySink to only have the last 30 minutes of counts and I perform this by
.where('time >= current_timestamp().cast("long") - 30 * 60)
what happens is that the `filter` gets pushed down before the aggregation, and the filter happens on the source data for the aggregation instead of the result of the aggregation (where I actually want to filter).
I guess the main issue here is that `current_timestamp` is non-deterministic in the streaming context and shouldn't be pushed down the filter.
Does this require us to store the `current_timestamp` for each trigger of the streaming job, that is something to discuss.
Furthermore, we want to persist current batch timestamp and watermark timestamp to the offset log so that these values are consistent across multiple executions of the same batch.
brkyvz zsxwing tdas
## How was this patch tested?
A test was added to StreamingAggregationSuite ensuring the above use case is handled. The test injects a stream of time values (in seconds) to a query that runs in complete mode and only outputs the (count) aggregation results for the past 10 seconds.
Author: Tyson Condie <tcondie@gmail.com>
Closes#15949 from tcondie/SPARK-18339.
## What changes were proposed in this pull request?
This is absolutely minor. PR https://github.com/apache/spark/pull/15595 uses `dt1.asNullable == dt2.asNullable` expressions in a few places. It is however more efficient to call `dt1.sameType(dt2)`. I have replaced every instance of the first pattern with the second pattern (3/5 were introduced by #15595).
## How was this patch tested?
Existing tests.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#16041 from hvanhovell/SPARK-18058.
## What changes were proposed in this pull request?
In #15764 we added a mechanism to detect if a function is temporary or not. Hive functions are treated as non-temporary. Of the three hive functions, now "percentile" has been implemented natively, and "hash" has been removed. So we should update the list.
## How was this patch tested?
Unit tests.
Author: Shuai Lin <linshuai2012@gmail.com>
Closes#16049 from lins05/update-temp-function-detect-hive-list.
## What changes were proposed in this pull request?
Implement percentile SQL function. It computes the exact percentile(s) of expr at pc with range in [0, 1].
## How was this patch tested?
Add a new testsuite `PercentileSuite` to test percentile directly.
Updated related testcases in `ExpressionToSQLSuite`.
Author: jiangxingbo <jiangxb1987@gmail.com>
Author: 蒋星博 <jiangxingbo@meituan.com>
Author: jiangxingbo <jiangxingbo@meituan.com>
Closes#14136 from jiangxb1987/percentile.
## What changes were proposed in this pull request?
https://github.com/apache/spark/pull/15704 will fail if we use int literal in `DROP PARTITION`, and we have reverted it in branch-2.1.
This PR reverts it in master branch, and add a regression test for it, to make sure the master branch is healthy.
## How was this patch tested?
new regression test
Author: Wenchen Fan <wenchen@databricks.com>
Closes#16036 from cloud-fan/revert.
## What changes were proposed in this pull request?
We currently push down join conditions of a Left Anti join to both sides of the join. This is similar to Inner, Left Semi and Existence (a specialized left semi) join. The problem is that this changes the semantics of the join; a left anti join filters out rows that matches the join condition.
This PR fixes this by only pushing down conditions to the left hand side of the join. This is similar to the behavior of left outer join.
## How was this patch tested?
Added tests to `FilterPushdownSuite.scala` and created a SQLQueryTestSuite file for left anti joins with a regression test.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#16026 from hvanhovell/SPARK-18597.
## What changes were proposed in this pull request?
The `CollapseWindow` optimizer rule changes the order of output attributes. This modifies the output of the plan, which the optimizer cannot do. This also breaks things like `collect()` for which we use a `RowEncoder` that assumes that the output attributes of the executed plan are equal to those outputted by the logical plan.
## How was this patch tested?
I have updated an incorrect test in `CollapseWindowSuite`.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#16027 from hvanhovell/SPARK-18604.
## What changes were proposed in this pull request?
Janino can optimize `true ? a : b` into `a` or `false ? a : b` into `b`, or if/else with literal condition, so we should use literal as `ev.isNull` if possible.
## How was this patch tested?
Existing tests.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#16008 from ueshin/issues/SPARK-18585.
### What changes were proposed in this pull request?
Currently, the name validation checks are limited to table creation. It is enfored by Analyzer rule: `PreWriteCheck`.
However, table renaming and database creation have the same issues. It makes more sense to do the checks in `SessionCatalog`. This PR is to add it into `SessionCatalog`.
### How was this patch tested?
Added test cases
Author: gatorsmile <gatorsmile@gmail.com>
Closes#16018 from gatorsmile/nameValidate.
## What changes were proposed in this pull request?
Currently, `OuterReference` is not `NamedExpression`. So, it raises 'ClassCastException` when it used in projection lists of IN correlated subqueries. This PR aims to support that by making `OuterReference` as `NamedExpression` to show correct error messages.
```scala
scala> sql("CREATE TEMPORARY VIEW t1 AS SELECT * FROM VALUES 1, 2 AS t1(a)")
scala> sql("CREATE TEMPORARY VIEW t2 AS SELECT * FROM VALUES 1 AS t2(b)")
scala> sql("SELECT a FROM t1 WHERE a IN (SELECT a FROM t2)").show
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.OuterReference cannot be cast to org.apache.spark.sql.catalyst.expressions.NamedExpression
```
## How was this patch tested?
Pass the Jenkins test with new test cases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#16015 from dongjoon-hyun/SPARK-17251-2.
## What changes were proposed in this pull request?
The nullability of `InputFileName` should be `false`.
## How was this patch tested?
Existing tests.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#16007 from ueshin/issues/SPARK-18583.
## What changes were proposed in this pull request?
The expression `in(empty seq)` is invalid in some data source. Since `in(empty seq)` is always false, we should generate `in(empty seq)` to false literal in optimizer.
The sql `SELECT * FROM t WHERE a IN ()` throws a `ParseException` which is consistent with Hive, don't need to change that behavior.
## How was this patch tested?
Add new test case in `OptimizeInSuite`.
Author: jiangxingbo <jiangxb1987@gmail.com>
Closes#15977 from jiangxb1987/isin-empty.
## What changes were proposed in this pull request?
In `HyperLogLogPlusPlus`, if the relative error is so small that p >= 19, it will cause ArrayIndexOutOfBoundsException in `THRESHOLDS(p-4)` . We should check `p` and when p >= 19, regress to the original HLL result and use the small range correction they use.
The pr also fixes the upper bound in the log info in `require()`.
The upper bound is computed by:
```
val relativeSD = 1.106d / Math.pow(Math.E, p * Math.log(2.0d) / 2.0d)
```
which is derived from the equation for computing `p`:
```
val p = 2.0d * Math.log(1.106d / relativeSD) / Math.log(2.0d)
```
## How was this patch tested?
add test cases for:
1. checking validity of parameter relatvieSD
2. estimation with smaller relative error so that p >= 19
Author: Zhenhua Wang <wzh_zju@163.com>
Author: wangzhenhua <wangzhenhua@huawei.com>
Closes#15990 from wzhfy/hllppRsd.
## What changes were proposed in this pull request?
This PR only tries to fix things that looks pretty straightforward and were fixed in other previous PRs before.
This PR roughly fixes several things as below:
- Fix unrecognisable class and method links in javadoc by changing it from `[[..]]` to `` `...` ``
```
[error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/DataStreamReader.java:226: error: reference not found
[error] * Loads text files and returns a {link DataFrame} whose schema starts with a string column named
```
- Fix an exception annotation and remove code backticks in `throws` annotation
Currently, sbt unidoc with Java 8 complains as below:
```
[error] .../java/org/apache/spark/sql/streaming/StreamingQuery.java:72: error: unexpected text
[error] * throws StreamingQueryException, if <code>this</code> query has terminated with an exception.
```
`throws` should specify the correct class name from `StreamingQueryException,` to `StreamingQueryException` without backticks. (see [JDK-8007644](https://bugs.openjdk.java.net/browse/JDK-8007644)).
- Fix `[[http..]]` to `<a href="http..."></a>`.
```diff
- * [[https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https Oracle
- * blog page]].
+ * <a href="https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https">
+ * Oracle blog page</a>.
```
`[[http...]]` link markdown in scaladoc is unrecognisable in javadoc.
- It seems class can't have `return` annotation. So, two cases of this were removed.
```
[error] .../java/org/apache/spark/mllib/regression/IsotonicRegression.java:27: error: invalid use of return
[error] * return New instance of IsotonicRegression.
```
- Fix < to `<` and > to `>` according to HTML rules.
- Fix `</p>` complaint
- Exclude unrecognisable in javadoc, `constructor`, `todo` and `groupname`.
## How was this patch tested?
Manually tested by `jekyll build` with Java 7 and 8
```
java version "1.7.0_80"
Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
```
```
java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
```
Note: this does not yet make sbt unidoc suceed with Java 8 yet but it reduces the number of errors with Java 8.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#15999 from HyukjinKwon/SPARK-3359-errors.
## What changes were proposed in this pull request?
- Raise Analysis exception when correlated predicates exist in the descendant operators of either operand of a Full outer join in a subquery as well as in a FOJ operator itself
- Raise Analysis exception when correlated predicates exists in a Window operator (a side effect inadvertently introduced by SPARK-17348)
## How was this patch tested?
Run sql/test catalyst/test and new test cases, added to SubquerySuite, showing the reported incorrect results.
Author: Nattavut Sutyanyong <nsy.can@gmail.com>
Closes#16005 from nsyca/FOJ-incorrect.1.
## What changes were proposed in this pull request?
The current implementation of column stats uses the base64 encoding of the internal UnsafeRow format to persist statistics (in table properties in Hive metastore). This is an internal format that is not stable across different versions of Spark and should NOT be used for persistence. In addition, it would be better if statistics stored in the catalog is human readable.
This pull request introduces the following changes:
1. Created a single ColumnStat class to for all data types. All data types track the same set of statistics.
2. Updated the implementation for stats collection to get rid of the dependency on internal data structures (e.g. InternalRow, or storing DateType as an int32). For example, previously dates were stored as a single integer, but are now stored as java.sql.Date. When we implement the next steps of CBO, we can add code to convert those back into internal types again.
3. Documented clearly what JVM data types are being used to store what data.
4. Defined a simple Map[String, String] interface for serializing and deserializing column stats into/from the catalog.
5. Rearranged the method/function structure so it is more clear what the supported data types are, and also moved how stats are generated into ColumnStat class so they are easy to find.
## How was this patch tested?
Removed most of the original test cases created for column statistics, and added three very simple ones to cover all the cases. The three test cases validate:
1. Roundtrip serialization works.
2. Behavior when analyzing non-existent column or unsupported data type column.
3. Result for stats collection for all valid data types.
Also moved parser related tests into a parser test suite and added an explicit serialization test for the Hive external catalog.
Author: Reynold Xin <rxin@databricks.com>
Closes#15959 from rxin/SPARK-18522.
## What changes were proposed in this pull request?
In Spark SQL, some expression may output safe format values, e.g. `CreateArray`, `CreateStruct`, `Cast`, etc. When we compare 2 values, we should be able to compare safe and unsafe formats.
The `GreaterThan`, `LessThan`, etc. in Spark SQL already handles it, but the `EqualTo` doesn't. This PR fixes it.
## How was this patch tested?
new unit test and regression test
Author: Wenchen Fan <wenchen@databricks.com>
Closes#15929 from cloud-fan/type-aware.
## What changes were proposed in this pull request?
This PR proposes throwing an `AnalysisException` with a proper message rather than `NoSuchElementException` with the message ` key not found: TimestampType` when unsupported types are given to `reflect` and `java_method` functions.
```scala
spark.range(1).selectExpr("reflect('java.lang.String', 'valueOf', cast('1990-01-01' as timestamp))")
```
produces
**Before**
```
java.util.NoSuchElementException: key not found: TimestampType
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:59)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:59)
at org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection$$anonfun$findMethod$1$$anonfun$apply$1.apply(CallMethodViaReflection.scala:159)
...
```
**After**
```
cannot resolve 'reflect('java.lang.String', 'valueOf', CAST('1990-01-01' AS TIMESTAMP))' due to data type mismatch: arguments from the third require boolean, byte, short, integer, long, float, double or string expressions; line 1 pos 0;
'Project [unresolvedalias(reflect(java.lang.String, valueOf, cast(1990-01-01 as timestamp)), Some(<function1>))]
+- Range (0, 1, step=1, splits=Some(2))
...
```
Added message is,
```
arguments from the third require boolean, byte, short, integer, long, float, double or string expressions
```
## How was this patch tested?
Tests added in `CallMethodViaReflection`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#15694 from HyukjinKwon/SPARK-18179.
## What changes were proposed in this pull request?
Fixes the inconsistency of error raised between data source and hive serde
tables when schema is specified in CTAS scenario. In the process the grammar for
create table (datasource) is simplified.
**before:**
``` SQL
spark-sql> create table t2 (c1 int, c2 int) using parquet as select * from t1;
Error in query:
mismatched input 'as' expecting {<EOF>, '.', 'OPTIONS', 'CLUSTERED', 'PARTITIONED'}(line 1, pos 64)
== SQL ==
create table t2 (c1 int, c2 int) using parquet as select * from t1
----------------------------------------------------------------^^^
```
**After:**
```SQL
spark-sql> create table t2 (c1 int, c2 int) using parquet as select * from t1
> ;
Error in query:
Operation not allowed: Schema may not be specified in a Create Table As Select (CTAS) statement(line 1, pos 0)
== SQL ==
create table t2 (c1 int, c2 int) using parquet as select * from t1
^^^
```
## How was this patch tested?
Added a new test in CreateTableAsSelectSuite
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#15968 from dilipbiswal/ctas.
## What changes were proposed in this pull request?
While this behavior is debatable, consider the following use case:
```sql
UNCACHE TABLE foo;
CACHE TABLE foo AS
SELECT * FROM bar
```
The command above fails the first time you run it. But I want to run the command above over and over again, and I don't want to change my code just for the first run of it.
The issue is that subsequent `CACHE TABLE` commands do not overwrite the existing table.
Now we can do:
```sql
UNCACHE TABLE IF EXISTS foo;
CACHE TABLE foo AS
SELECT * FROM bar
```
## How was this patch tested?
Unit tests
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#15896 from brkyvz/uncache.
## What changes were proposed in this pull request?
This PR blocks an incorrect result scenario in scalar subquery where there are GROUP BY column(s)
that are not part of the correlated predicate(s).
Example:
// Incorrect result
Seq(1).toDF("c1").createOrReplaceTempView("t1")
Seq((1,1),(1,2)).toDF("c1","c2").createOrReplaceTempView("t2")
sql("select (select sum(-1) from t2 where t1.c1=t2.c1 group by t2.c2) from t1").show
// How can selecting a scalar subquery from a 1-row table return 2 rows?
## How was this patch tested?
sql/test, catalyst/test
new test case covering the reported problem is added to SubquerySuite.scala
Author: Nattavut Sutyanyong <nsy.can@gmail.com>
Closes#15936 from nsyca/scalarSubqueryIncorrect-1.
## What changes were proposed in this pull request?
Technically map type is not orderable, but can be used in equality comparison. However, due to the limitation of the current implementation, map type can't be used in equality comparison so that it can't be join key or grouping key.
This PR makes this limitation explicit, to avoid wrong result.
## How was this patch tested?
updated tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#15956 from cloud-fan/map-type.
## What changes were proposed in this pull request?
The nullabilities of `MapObject` can be made more strict by relying on `inputObject.nullable` and `lambdaFunction.nullable`.
Also `ExternalMapToCatalyst.dataType` can be made more strict by relying on `valueConverter.nullable`.
## How was this patch tested?
Existing tests.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#15840 from ueshin/issues/SPARK-18398.
## What changes were proposed in this pull request?
This pr extracts method for preparing arguments from `StaticInvoke`, `Invoke` and `NewInstance` and modify to short circuit if arguments have `null` when `propageteNull == true`.
The steps are as follows:
1. Introduce `InvokeLike` to extract common logic from `StaticInvoke`, `Invoke` and `NewInstance` to prepare arguments.
`StaticInvoke` and `Invoke` had a risk to exceed 64kb JVM limit to prepare arguments but after this patch they can handle them because they share the preparing code of NewInstance, which handles the limit well.
2. Remove unneeded null checking and fix nullability of `NewInstance`.
Avoid some of nullabilty checking which are not needed because the expression is not nullable.
3. Modify to short circuit if arguments have `null` when `needNullCheck == true`.
If `needNullCheck == true`, preparing arguments can be skipped if we found one of them is `null`, so modified to short circuit in the case.
## How was this patch tested?
Existing tests.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#15901 from ueshin/issues/SPARK-18467.
## What changes were proposed in this pull request?
This PR adds code generation to `Generate`. It supports two code paths:
- General `TraversableOnce` based iteration. This used for regular `Generator` (code generation supporting) expressions. This code path expects the expression to return a `TraversableOnce[InternalRow]` and it will iterate over the returned collection. This PR adds code generation for the `stack` generator.
- Specialized `ArrayData/MapData` based iteration. This is used for the `explode`, `posexplode` & `inline` functions and operates directly on the `ArrayData`/`MapData` result that the child of the generator returns.
### Benchmarks
I have added some benchmarks and it seems we can create a nice speedup for explode:
#### Environment
```
Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.11.6
Intel(R) Core(TM) i7-4980HQ CPU 2.80GHz
```
#### Explode Array
##### Before
```
generate explode array: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
generate explode array wholestage off 7377 / 7607 2.3 439.7 1.0X
generate explode array wholestage on 6055 / 6086 2.8 360.9 1.2X
```
##### After
```
generate explode array: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
generate explode array wholestage off 7432 / 7696 2.3 443.0 1.0X
generate explode array wholestage on 631 / 646 26.6 37.6 11.8X
```
#### Explode Map
##### Before
```
generate explode map: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
generate explode map wholestage off 12792 / 12848 1.3 762.5 1.0X
generate explode map wholestage on 11181 / 11237 1.5 666.5 1.1X
```
##### After
```
generate explode map: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
generate explode map wholestage off 10949 / 10972 1.5 652.6 1.0X
generate explode map wholestage on 870 / 913 19.3 51.9 12.6X
```
#### Posexplode
##### Before
```
generate posexplode array: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
generate posexplode array wholestage off 7547 / 7580 2.2 449.8 1.0X
generate posexplode array wholestage on 5786 / 5838 2.9 344.9 1.3X
```
##### After
```
generate posexplode array: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
generate posexplode array wholestage off 7535 / 7548 2.2 449.1 1.0X
generate posexplode array wholestage on 620 / 624 27.1 37.0 12.1X
```
#### Inline
##### Before
```
generate inline array: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
generate inline array wholestage off 6935 / 6978 2.4 413.3 1.0X
generate inline array wholestage on 6360 / 6400 2.6 379.1 1.1X
```
##### After
```
generate inline array: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
generate inline array wholestage off 6940 / 6966 2.4 413.6 1.0X
generate inline array wholestage on 1002 / 1012 16.7 59.7 6.9X
```
#### Stack
##### Before
```
generate stack: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
generate stack wholestage off 12980 / 13104 1.3 773.7 1.0X
generate stack wholestage on 11566 / 11580 1.5 689.4 1.1X
```
##### After
```
generate stack: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
generate stack wholestage off 12875 / 12949 1.3 767.4 1.0X
generate stack wholestage on 840 / 845 20.0 50.0 15.3X
```
## How was this patch tested?
Existing tests.
Author: Herman van Hovell <hvanhovell@databricks.com>
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#13065 from hvanhovell/SPARK-15214.
## What changes were proposed in this pull request?
The previous documentation and example for DateDiff was wrong.
## How was this patch tested?
Doc only change.
Author: Reynold Xin <rxin@databricks.com>
Closes#15937 from rxin/datediff-doc.
## What changes were proposed in this pull request?
The nullability of `WrapOption` should be `false`.
## How was this patch tested?
Existing tests.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#15887 from ueshin/issues/SPARK-18442.