## What changes were proposed in this pull request?
We allow users to specify hints (currently only "broadcast" is supported) in SQL and DataFrame. However, while SQL has a standard hint format (/*+ ... */), DataFrame doesn't have one and sometimes users are confused that they can't find how to apply a broadcast hint. This ticket adds a generic hint function on DataFrame that allows using the same hint on DataFrames as well as SQL.
As an example, after this patch, the following will apply a broadcast hint on a DataFrame using the new hint function:
```
df1.join(df2.hint("broadcast"))
```
## How was this patch tested?
Added a test case in DataFrameJoinSuite.
Author: Reynold Xin <rxin@databricks.com>
Closes#17839 from rxin/SPARK-20576.
## What changes were proposed in this pull request?
Within the same streaming query, when one `StreamingRelation` is referred multiple times – e.g. `df.union(df)` – we should transform it only to one `StreamingExecutionRelation`, instead of two or more different `StreamingExecutionRelation`s (each of which would have a separate set of source, source logs, ...).
## How was this patch tested?
Added two test cases, each of which would fail without this patch.
Author: Liwei Lin <lwlin7@gmail.com>
Closes#17735 from lw-lin/SPARK-20441.
## What changes were proposed in this pull request?
Fix build warnings primarily related to Breeze 0.13 operator changes, Java style problems
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#17803 from srowen/SPARK-20523.
It is not valid to eagerly bind with the child's output as this causes failures when we attempt to canonicalize the plan (replacing the attribute references with dummies).
Author: Michael Armbrust <michael@databricks.com>
Closes#17838 from marmbrus/fixBindExplode.
### What changes were proposed in this pull request?
This is a follow-up of enabling test cases in DDLSuite with Hive Metastore. It consists of the following remaining tasks:
- Run all the `alter table` and `drop table` DDL tests against data source tables when using Hive metastore.
- Do not run any `alter table` and `drop table` DDL test against Hive serde tables when using InMemoryCatalog.
- Reenable `alter table: set serde partition` and `alter table: set serde` tests for Hive serde tables.
### How was this patch tested?
N/A
Author: Xiao Li <gatorsmile@gmail.com>
Closes#17524 from gatorsmile/cleanupDDLSuite.
## What changes were proposed in this pull request?
A fix for the same problem was made in #17693 but ignored `JsonToStructs`. This PR uses the same fix for `JsonToStructs`.
## How was this patch tested?
Regression test
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#17826 from brkyvz/SPARK-20549.
## What changes were proposed in this pull request?
As #17773 revealed `OnHeapColumnVector` may copy a part of the original storage.
`OffHeapColumnVector` reallocation also copies to the new storage data up to 'elementsAppended'. This variable is only updated when using the `ColumnVector.appendX` API, while `ColumnVector.putX` is more commonly used.
This PR copies the new storage data up to the previously-allocated size in`OffHeapColumnVector`.
## How was this patch tested?
Existing test suites
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#17811 from kiszk/SPARK-20537.
## What changes were proposed in this pull request?
Add support for the SQL standard distinct predicate to SPARK SQL.
```
<expression> IS [NOT] DISTINCT FROM <expression>
```
## How was this patch tested?
Tested using unit tests, integration tests, manual tests.
Author: ptkool <michael.styles@shopify.com>
Closes#17764 from ptkool/is_not_distinct_from.
## What changes were proposed in this pull request?
Avoid failing to initCause on JDBC exception with cause initialized to null
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#17800 from srowen/SPARK-20459.
## What changes were proposed in this pull request?
Generate exec does not produce `null` values if the generator for the input row is empty and the generate operates in outer mode without join. This is caused by the fact that the `join=false` code path is different from the `join=true` code path, and that the `join=false` code path did deal with outer properly. This PR addresses this issue.
## How was this patch tested?
Updated `outer*` tests in `GeneratorFunctionSuite`.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#17810 from hvanhovell/SPARK-20534.
## What changes were proposed in this pull request?
Currently, when the type string is invalid, it looks printing empty parentheses. This PR proposes a small improvement in an error message by removing it in the parse as below:
```scala
spark.range(1).select($"col".cast("aa"))
```
**Before**
```
org.apache.spark.sql.catalyst.parser.ParseException:
DataType aa() is not supported.(line 1, pos 0)
== SQL ==
aa
^^^
```
**After**
```
org.apache.spark.sql.catalyst.parser.ParseException:
DataType aa is not supported.(line 1, pos 0)
== SQL ==
aa
^^^
```
## How was this patch tested?
Unit tests in `DataTypeParserSuite`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17784 from HyukjinKwon/SPARK-20492.
## What changes were proposed in this pull request?
This PR proposes to fill up the documentation with examples for `bitwiseOR`, `bitwiseAND`, `bitwiseXOR`. `contains`, `asc` and `desc` in `Column` API.
Also, this PR fixes minor typos in the documentation and matches some of the contents between Scala doc and Python doc.
Lastly, this PR suggests to use `spark` rather than `sc` in doc tests in `Column` for Python documentation.
## How was this patch tested?
Doc tests were added and manually tested with the commands below:
`./python/run-tests.py --module pyspark-sql`
`./python/run-tests.py --module pyspark-sql --python-executable python3`
`./dev/lint-python`
Output was checked via `make html` under `./python/docs`. The snapshots will be left on the codes with comments.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17737 from HyukjinKwon/SPARK-20442.
## What changes were proposed in this pull request?
It seems we are using `SQLUtils.getSQLDataType` for type string in structField. It looks we can replace this with `CatalystSqlParser.parseDataType`.
They look similar DDL-like type definitions as below:
```scala
scala> Seq(Tuple1(Tuple1("a"))).toDF.show()
```
```
+---+
| _1|
+---+
|[a]|
+---+
```
```scala
scala> Seq(Tuple1(Tuple1("a"))).toDF.select($"_1".cast("struct<_1:string>")).show()
```
```
+---+
| _1|
+---+
|[a]|
+---+
```
Such type strings looks identical when R’s one as below:
```R
> write.df(sql("SELECT named_struct('_1', 'a') as struct"), "/tmp/aa", "parquet")
> collect(read.df("/tmp/aa", "parquet", structType(structField("struct", "struct<_1:string>"))))
struct
1 a
```
R’s one is stricter because we are checking the types via regular expressions in R side ahead.
Actual logics there look a bit different but as we check it ahead in R side, it looks replacing it would not introduce (I think) no behaviour changes. To make this sure, the tests dedicated for it were added in SPARK-20105. (It looks `structField` is the only place that calls this method).
## How was this patch tested?
Existing tests - https://github.com/apache/spark/blob/master/R/pkg/inst/tests/testthat/test_sparkSQL.R#L143-L194 should cover this.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17785 from HyukjinKwon/SPARK-20493.
What changes were proposed in this pull request?
remove AggregateBenchmark testsuite warning:
such as '14:26:33.220 WARN org.apache.spark.sql.execution.aggregate.HashAggregateExec: Two level hashmap is disabled but vectorized hashmap is enabled.'
How was this patch tested?
unit tests: AggregateBenchmark
Modify the 'ignore function for 'test funtion
Author: caoxuewen <cao.xuewen@zte.com.cn>
Closes#17771 from heary-cao/AggregateBenchmark.
## What changes were proposed in this pull request?
This pr added a new rule in `Analyzer` to resolve aliases in `GROUP BY`.
The current master throws an exception if `GROUP BY` clauses have aliases in `SELECT`;
```
scala> spark.sql("select a a1, a1 + 1 as b, count(1) from t group by a1")
org.apache.spark.sql.AnalysisException: cannot resolve '`a1`' given input columns: [a]; line 1 pos 51;
'Aggregate ['a1], [a#83L AS a1#87L, ('a1 + 1) AS b#88, count(1) AS count(1)#90L]
+- SubqueryAlias t
+- Project [id#80L AS a#83L]
+- Range (0, 10, step=1, splits=Some(8))
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
```
## How was this patch tested?
Added tests in `SQLQuerySuite` and `SQLQueryTestSuite`.
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes#17191 from maropu/SPARK-14471.
### What changes were proposed in this pull request?
```SQL
hive> create table t1(`a,` string);
OK
Time taken: 1.399 seconds
hive> create table t2(`a,` string, b string);
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 3 elements while columns.types has 2 elements!)
hive> create table t2(`a,` string, b string) stored as parquet;
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.IllegalArgumentException: ParquetHiveSerde initialization failed. Number of column name and column type differs. columnNames = [a, , b], columnTypes = [string, string]
```
It has a bug in Hive metastore.
When users do not provide alias name in the SELECT query, we call `toPrettySQL` to generate the alias name. For example, the string `get_json_object(jstring, '$.f1')` will be the alias name for the function call in the statement
```SQL
SELECT key, get_json_object(jstring, '$.f1') FROM tempView
```
Above is not an issue for the SELECT query statements. However, for CTAS, we hit the issue due to a bug in Hive metastore. Hive metastore does not like the column names containing commas and returned a confusing error message, like:
```
17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: org.apache.hadoop.hive.serde2.SerDeException org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements while columns.types has 1 elements!
org.apache.hadoop.hive.serde2.SerDeException: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements while columns.types has 1 elements!
```
Thus, this PR is to block users to create a table in Hive metastore when the table table has a column containing commas in the name.
### How was this patch tested?
Added a test case
Author: Xiao Li <gatorsmile@gmail.com>
Closes#17781 from gatorsmile/blockIllegalColumnNames.
## What changes were proposed in this pull request?
When sending accumulator updates back to driver, the network overhead is pretty big as there are a lot of accumulators, e.g. `TaskMetrics` will send about 20 accumulators everytime, there may be a lot of `SQLMetric` if the query plan is complicated.
Therefore, it's critical to reduce the size of serialized accumulator. A simple way is to not send the name of internal accumulators to executor side, as it's unnecessary. When executor sends accumulator updates back to driver, we can look up the accumulator name in `AccumulatorContext` easily. Note that, we still need to send names of normal accumulators, as the user code run at executor side may rely on accumulator names.
In the future, we should reimplement `TaskMetrics` to not rely on accumulators and use custom serialization.
Tried on the example in https://issues.apache.org/jira/browse/SPARK-12837, the size of serialized accumulator has been cut down by about 40%.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#17596 from cloud-fan/oom.
## What changes were proposed in this pull request?
Relax the requirement that a `TimeZoneAwareExpression` has to have its `timeZoneId` set to be considered resolved.
With this change, a `Cast` (which is a `TimeZoneAwareExpression`) can be considered resolved if the `(fromType, toType)` combination doesn't require time zone information.
Also de-relaxed test cases in `CastSuite` so Casts in that test suite don't get a default`timeZoneId = Option("GMT")`.
## How was this patch tested?
Ran the de-relaxed`CastSuite` and it's passing. Also ran the SQL unit tests and they're passing too.
Author: Kris Mok <kris.mok@databricks.com>
Closes#17777 from rednaxelafx/fix-catalyst-cast-timezone.
## What changes were proposed in this pull request?
Spark 2.2 is going to be cut, it'll be great if SPARK-12868 can be resolved before that. There have been several PRs for this like [PR#16324](https://github.com/apache/spark/pull/16324) , but all of them are inactivity for a long time or have been closed.
This PR added a SparkUrlStreamHandlerFactory, which relies on 'protocol' to choose the appropriate
UrlStreamHandlerFactory like FsUrlStreamHandlerFactory to create URLStreamHandler.
## How was this patch tested?
1. Add a new unit test.
2. Check manually.
Before: throw an exception with " failed unknown protocol: hdfs"
<img width="914" alt="screen shot 2017-03-17 at 9 07 36 pm" src="https://cloud.githubusercontent.com/assets/8546874/24075277/5abe0a7c-0bd5-11e7-900e-ec3d3105da0b.png">
After:
<img width="1148" alt="screen shot 2017-03-18 at 11 42 18 am" src="https://cloud.githubusercontent.com/assets/8546874/24075283/69382a60-0bd5-11e7-8d30-d9405c3aaaba.png">
Author: Weiqing Yang <yangweiqing001@gmail.com>
Closes#17342 from weiqingy/SPARK-18910.
## What changes were proposed in this pull request?
OnHeapColumnVector reallocation copies to the new storage data up to 'elementsAppended'. This variable is only updated when using the ColumnVector.appendX API, while ColumnVector.putX is more commonly used.
## How was this patch tested?
Tested using existing unit tests.
Author: Michal Szafranski <michal@databricks.com>
Closes#17773 from michal-databricks/spark-20474.
## What changes were proposed in this pull request?
ColumnVector implementations originally did not support some Catalyst types (float, short, and boolean). Now that they do, those types should be also added to the ColumnVector.Array.
## How was this patch tested?
Tested using existing unit tests.
Author: Michal Szafranski <michal@databricks.com>
Closes#17772 from michal-databricks/spark-20473.
## What changes were proposed in this pull request?
change to using Jackson's `com.fasterxml.jackson.core.JsonFactory`
public JsonParser createParser(String content)
## How was this patch tested?
existing unit tests
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Eric Wasserman <ericw@sgn.com>
Closes#17693 from ewasserman/SPARK-20314.
## What changes were proposed in this pull request?
This patch adds support for customizing the spark session by injecting user-defined custom extensions. This allows a user to add custom analyzer rules/checks, optimizer rules, planning strategies or even a customized parser.
## How was this patch tested?
Unit Tests in SparkSessionExtensionSuite
Author: Sameer Agarwal <sameerag@cs.berkeley.edu>
Closes#17724 from sameeragarwal/session-extensions.
## What changes were proposed in this pull request?
In `randomSplit`, It is possible that the underlying dataset doesn't guarantee the ordering of rows in its constituent partitions each time a split is materialized which could result in overlapping
splits.
To prevent this, as part of SPARK-12662, we explicitly sort each input partition to make the ordering deterministic. Given that `MapTypes` cannot be sorted this patch explicitly prunes them out from the sort order. Additionally, if the resulting sort order is empty, this patch then materializes the dataset to guarantee determinism.
## How was this patch tested?
Extended `randomSplit on reordered partitions` in `DataFrameStatSuite` to also test for dataframes with mapTypes nested mapTypes.
Author: Sameer Agarwal <sameerag@cs.berkeley.edu>
Closes#17751 from sameeragarwal/randomsplit2.
### What changes were proposed in this pull request?
`spark.catalog.listTables` and `spark.catalog.getTable` does not work if we are unable to retrieve table metadata due to any reason (e.g., table serde class is not accessible or the table type is not accepted by Spark SQL). After this PR, the APIs still return the corresponding Table without the description and tableType)
### How was this patch tested?
Added a test case
Author: Xiao Li <gatorsmile@gmail.com>
Closes#17730 from gatorsmile/listTables.
## What changes were proposed in this pull request?
This pr initialised `RangeExec` parameters in a driver side.
In the current master, a query below throws `NullPointerException`;
```
sql("SET spark.sql.codegen.wholeStage=false")
sql("SELECT * FROM range(1)").show
17/04/20 17:11:05 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NullPointerException
at org.apache.spark.sql.execution.SparkPlan.sparkContext(SparkPlan.scala:54)
at org.apache.spark.sql.execution.RangeExec.numSlices(basicPhysicalOperators.scala:343)
at org.apache.spark.sql.execution.RangeExec$$anonfun$20.apply(basicPhysicalOperators.scala:506)
at org.apache.spark.sql.execution.RangeExec$$anonfun$20.apply(basicPhysicalOperators.scala:505)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:320)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
```
## How was this patch tested?
Added a test in `DataFrameRangeSuite`.
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes#17717 from maropu/SPARK-20430.
## What changes were proposed in this pull request?
This PR avoids an exception in the case where `scala.math.BigInt` has a value that does not fit into long value range (e.g. `Long.MAX_VALUE+1`). When we run the following code by using the current Spark, the following exception is thrown.
This PR keeps the value using `BigDecimal` if we detect such an overflow case by catching `ArithmeticException`.
Sample program:
```
case class BigIntWrapper(value:scala.math.BigInt)```
spark.createDataset(BigIntWrapper(scala.math.BigInt("10000000000000000002"))::Nil).show
```
Exception:
```
Error while encoding: java.lang.ArithmeticException: BigInteger out of long range
staticinvoke(class org.apache.spark.sql.types.Decimal$, DecimalType(38,0), apply, assertnotnull(assertnotnull(input[0, org.apache.spark.sql.BigIntWrapper, true])).value, true) AS value#0
java.lang.RuntimeException: Error while encoding: java.lang.ArithmeticException: BigInteger out of long range
staticinvoke(class org.apache.spark.sql.types.Decimal$, DecimalType(38,0), apply, assertnotnull(assertnotnull(input[0, org.apache.spark.sql.BigIntWrapper, true])).value, true) AS value#0
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454)
at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:454)
at org.apache.spark.sql.Agg$$anonfun$18.apply$mcV$sp(MySuite.scala:192)
at org.apache.spark.sql.Agg$$anonfun$18.apply(MySuite.scala:192)
at org.apache.spark.sql.Agg$$anonfun$18.apply(MySuite.scala:192)
at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
...
Caused by: java.lang.ArithmeticException: BigInteger out of long range
at java.math.BigInteger.longValueExact(BigInteger.java:4531)
at org.apache.spark.sql.types.Decimal.set(Decimal.scala:140)
at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:434)
at org.apache.spark.sql.types.Decimal.apply(Decimal.scala)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287)
... 59 more
```
## How was this patch tested?
Add new test suite into `DecimalSuite`
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#17684 from kiszk/SPARK-20341.
## What changes were proposed in this pull request?
If a partitionSpec is supposed to not contain optional values, a ParseException should be thrown, and not nulls returned.
The nulls can later cause NullPointerExceptions in places not expecting them.
## How was this patch tested?
A query like "SHOW PARTITIONS tbl PARTITION(col1='val1', col2)" used to throw a NullPointerException.
Now it throws a ParseException.
Author: Juliusz Sompolski <julek@databricks.com>
Closes#17707 from juliuszsompolski/SPARK-20412.
## What changes were proposed in this pull request?
It is often useful to be able to track changes to the `ExternalCatalog`. This PR makes the `ExternalCatalog` emit events when a catalog object is changed. Events are fired before and after the change.
The following events are fired per object:
- Database
- CreateDatabasePreEvent: event fired before the database is created.
- CreateDatabaseEvent: event fired after the database has been created.
- DropDatabasePreEvent: event fired before the database is dropped.
- DropDatabaseEvent: event fired after the database has been dropped.
- Table
- CreateTablePreEvent: event fired before the table is created.
- CreateTableEvent: event fired after the table has been created.
- RenameTablePreEvent: event fired before the table is renamed.
- RenameTableEvent: event fired after the table has been renamed.
- DropTablePreEvent: event fired before the table is dropped.
- DropTableEvent: event fired after the table has been dropped.
- Function
- CreateFunctionPreEvent: event fired before the function is created.
- CreateFunctionEvent: event fired after the function has been created.
- RenameFunctionPreEvent: event fired before the function is renamed.
- RenameFunctionEvent: event fired after the function has been renamed.
- DropFunctionPreEvent: event fired before the function is dropped.
- DropFunctionPreEvent: event fired after the function has been dropped.
The current events currently only contain the names of the object modified. We add more events, and more details at a later point.
A user can monitor changes to the external catalog by adding a listener to the Spark listener bus checking for `ExternalCatalogEvent`s using the `SparkListener.onOtherEvent` hook. A more direct approach is add listener directly to the `ExternalCatalog`.
## How was this patch tested?
Added the `ExternalCatalogEventSuite`.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#17710 from hvanhovell/SPARK-20420.
## What changes were proposed in this pull request?
This pr modified code to print the identical `Range` parameters of SparkContext APIs and SQL in `explain` output. In the current master, they internally use `defaultParallelism` for `splits` by default though, they print different strings in explain output;
```
scala> spark.range(4).explain
== Physical Plan ==
*Range (0, 4, step=1, splits=Some(8))
scala> sql("select * from range(4)").explain
== Physical Plan ==
*Range (0, 4, step=1, splits=None)
```
## How was this patch tested?
Added tests in `SQLQuerySuite` and modified some results in the existing tests.
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes#17670 from maropu/SPARK-20281.
## What changes were proposed in this pull request?
A cast expression with a resolved time zone is not equal to a cast expression without a resolved time zone. The `ResolveAggregateFunction` assumed that these expression were the same, and would fail to resolve `HAVING` clauses which contain a `Cast` expression.
This is in essence caused by the fact that a `TimeZoneAwareExpression` can be resolved without a set time zone. This PR fixes this, and makes a `TimeZoneAwareExpression` unresolved as long as it has no TimeZone set.
## How was this patch tested?
Added a regression test to the `SQLQueryTestSuite.having` file.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#17641 from hvanhovell/SPARK-20329.
## What changes were proposed in this pull request?
When infering partitioning schema from paths, the column in parsePartitionColumn should be unescaped with unescapePathName, just like it is being done in e.g. parsePathFragmentAsSeq.
## How was this patch tested?
Added a test to FileIndexSuite.
Author: Juliusz Sompolski <julek@databricks.com>
Closes#17703 from juliuszsompolski/SPARK-20367.
## What changes were proposed in this pull request?
It is kind of annoying that `SharedSQLContext.sparkConf` is a val when overriding test cases, because you cannot call `super` on it. This PR makes it a function.
## How was this patch tested?
Existing tests.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#17705 from hvanhovell/SPARK-20410.
## What changes were proposed in this pull request?
Address a follow up in [comment](https://github.com/apache/spark/pull/16954#discussion_r105718880)
Currently subqueries with correlated predicates containing aggregate expression having mixture of outer references and local references generate a codegen error like following :
```SQL
SELECT t1a
FROM t1
GROUP BY 1
HAVING EXISTS (SELECT 1
FROM t2
WHERE t2a < min(t1a + t2a));
```
Exception snippet.
```
Cannot evaluate expression: min((input[0, int, false] + input[4, int, false]))
at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:226)
at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:106)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:103)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:103)
```
After this PR, a better error message is issued.
```
org.apache.spark.sql.AnalysisException
Error in query: Found an aggregate expression in a correlated
predicate that has both outer and local references, which is not supported yet.
Aggregate expression: min((t1.`t1a` + t2.`t2a`)),
Outer references: t1.`t1a`,
Local references: t2.`t2a`.;
```
## How was this patch tested?
Added tests in SQLQueryTestSuite.
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#17636 from dilipbiswal/subquery_followup1.
## What changes were proposed in this pull request?
SharedSQLContext.afterEach now calls DebugFilesystem.assertNoOpenStreams inside eventually.
SQLTestUtils withTempDir calls waitForTasksToFinish before deleting the directory.
## How was this patch tested?
Added new test in ParquetQuerySuite based on the flaky test
Author: Bogdan Raducanu <bogdan@databricks.com>
Closes#17701 from bogdanrdc/SPARK-20407.
## What changes were proposed in this pull request?
It's illegal to have aggregate function in GROUP BY, and we should fail at analysis phase, if this happens.
## How was this patch tested?
new regression test
Author: Wenchen Fan <wenchen@databricks.com>
Closes#17704 from cloud-fan/minor.
## What changes were proposed in this pull request?
Dataset.withNewExecutionId is only used in Dataset itself and should be private.
## How was this patch tested?
N/A - this is a simple visibility change.
Author: Reynold Xin <rxin@databricks.com>
Closes#17699 from rxin/SPARK-20405.
### What changes were proposed in this pull request?
Database and Table names conform the Hive standard ("[a-zA-z_0-9]+"), i.e. if this name only contains characters, numbers, and _.
When calling `toLowerCase` on the names, we should add `Locale.ROOT` to the `toLowerCase`for avoiding inadvertent locale-sensitive variation in behavior (aka the "Turkish locale problem").
### How was this patch tested?
Added a test case
Author: Xiao Li <gatorsmile@gmail.com>
Closes#17655 from gatorsmile/locale.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-19820 adds a reason field for why tasks were killed. However, for backwards compatibility it left the old TaskKilledException constructor which defaults to "unknown reason".
The range() operator should use the constructor that fills in the reason rather than dropping it on task kill.
## How was this patch tested?
Existing tests, and I tested this manually.
Author: Eric Liang <ekl@databricks.com>
Closes#17692 from ericl/fix-kill-reason-in-range.
## What changes were proposed in this pull request?
Also went through the same file to ensure other string concatenation are correct.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#17691 from zsxwing/fix-error-message.
## What changes were proposed in this pull request?
Apply Complementation Laws during boolean expression simplification.
## How was this patch tested?
Tested using unit tests, integration tests, and manual tests.
Author: ptkool <michael.styles@shopify.com>
Author: Michael Styles <michael.styles@shopify.com>
Closes#17650 from ptkool/apply_complementation_laws.
## What changes were proposed in this pull request?
The output of `InMemoryTableScanExec` can be pruned and mismatch with `InMemoryRelation` and its child plan's output. This causes wrong output partitioning and ordering.
## How was this patch tested?
Jenkins tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#17679 from viirya/SPARK-20356.
Avoid necessary execution that can lead to NPE in EliminateOuterJoin and add test in DataFrameSuite to confirm NPE is no longer thrown
## What changes were proposed in this pull request?
Change leftHasNonNullPredicate and rightHasNonNullPredicate to lazy so they are only executed when needed.
## How was this patch tested?
Added test in DataFrameSuite that failed before this fix and now succeeds. Note that a test in catalyst project would be better but i am unsure how to do this.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Koert Kuipers <koert@tresata.com>
Closes#17660 from koertkuipers/feat-catch-npe-in-eliminate-outer-join.
## What changes were proposed in this pull request?
If a plan has multi-level successive joins, e.g.:
```
Join
/ \
Union t5
/ \
Join t4
/ \
Join t3
/ \
t1 t2
```
Currently we fail to reorder the inside joins, i.e. t1, t2, t3.
In join reorder, we use `OrderedJoin` to indicate a join has been ordered, such that when transforming down the plan, these joins don't need to be rerodered again.
But there's a problem in the definition of `OrderedJoin`:
The real join node is a parameter, but not a child. This breaks the transform procedure because `mapChildren` applies transform function on parameters which should be children.
In this patch, we change `OrderedJoin` to a class having the same structure as a join node.
## How was this patch tested?
Add a corresponding test case.
Author: wangzhenhua <wangzhenhua@huawei.com>
Closes#17668 from wzhfy/recursiveReorder.
## What changes were proposed in this pull request?
fix typo
## How was this patch tested?
manual
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17663 from felixcheung/likedoctypo.
## What changes were proposed in this pull request?
Replace non-existent `repartitionBy` with `distribute` in `CollapseRepartitionSuite`.
## How was this patch tested?
local build and `catalyst/testOnly *CollapseRepartitionSuite`
Author: Jacek Laskowski <jacek@japila.pl>
Closes#17657 from jaceklaskowski/CollapseRepartitionSuite.
## What changes were proposed in this pull request?
This patch fixes a bug in the way LIKE patterns are translated to Java regexes. The bug causes any character following an escaped backslash to be escaped, i.e. there is double-escaping.
A concrete example is the following pattern:`'%\\%'`. The expected Java regex that this pattern should correspond to (according to the behavior described below) is `'.*\\.*'`, however the current situation leads to `'.*\\%'` instead.
---
Update: in light of the discussion that ensued, we should explicitly define the expected behaviour of LIKE expressions, especially in certain edge cases. With the help of gatorsmile, we put together a list of different RDBMS and their variations wrt to certain standard features.
| RDBMS\Features | Wildcards | Default escape [1] | Case sensitivity |
| --- | --- | --- | --- |
| [MS SQL Server](https://msdn.microsoft.com/en-us/library/ms179859.aspx) | _, %, [], [^] | none | no |
| [Oracle](https://docs.oracle.com/cd/B12037_01/server.101/b10759/conditions016.htm) | _, % | none | yes |
| [DB2 z/OS](http://www.ibm.com/support/knowledgecenter/SSEPEK_11.0.0/sqlref/src/tpc/db2z_likepredicate.html) | _, % | none | yes |
| [MySQL](http://dev.mysql.com/doc/refman/5.7/en/string-comparison-functions.html) | _, % | none | no |
| [PostreSQL](https://www.postgresql.org/docs/9.0/static/functions-matching.html) | _, % | \ | yes |
| [Hive](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF) | _, % | none | yes |
| Current Spark | _, % | \ | yes |
[1] Default escape character: most systems do not have a default escape character, instead the user can specify one by calling a like expression with an escape argument [A] LIKE [B] ESCAPE [C]. This syntax is currently not supported by Spark, however I would volunteer to implement this feature in a separate ticket.
The specifications are often quite terse and certain scenarios are undocumented, so here is a list of scenarios that I am uncertain about and would appreciate any input. Specifically I am looking for feedback on whether or not Spark's current behavior should be changed.
1. [x] Ending a pattern with the escape sequence, e.g. `like 'a\'`.
PostreSQL gives an error: 'LIKE pattern must not end with escape character', which I personally find logical. Currently, Spark allows "non-terminated" escapes and simply ignores them as part of the pattern.
According to [DB2's documentation](http://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.messages.sql.doc/doc/msql00130n.html), ending a pattern in an escape character is invalid.
_Proposed new behaviour in Spark: throw AnalysisException_
2. [x] Empty input, e.g. `'' like ''`
Postgres and DB2 will match empty input only if the pattern is empty as well, any other combination of empty input will not match. Spark currently follows this rule.
3. [x] Escape before a non-special character, e.g. `'a' like '\a'`.
Escaping a non-wildcard character is not really documented but PostgreSQL just treats it verbatim, which I also find the least surprising behavior. Spark does the same.
According to [DB2's documentation](http://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.messages.sql.doc/doc/msql00130n.html), it is invalid to follow an escape character with anything other than an escape character, an underscore or a percent sign.
_Proposed new behaviour in Spark: throw AnalysisException_
The current specification is also described in the operator's source code in this patch.
## How was this patch tested?
Extra case in regex unit tests.
Author: Jakob Odersky <jakob@odersky.com>
This patch had conflicts when merged, resolved by
Committer: Reynold Xin <rxin@databricks.com>
Closes#15398 from jodersky/SPARK-17647.
### What changes were proposed in this pull request?
The session catalog caches some persistent functions in the `FunctionRegistry`, so there can be duplicates. Our Catalog API `listFunctions` does not handle it.
It would be better if `SessionCatalog` API can de-duplciate the records, instead of doing it by each API caller. In `FunctionRegistry`, our functions are identified by the unquoted string. Thus, this PR is try to parse it using our parser interface and then de-duplicate the names.
### How was this patch tested?
Added test cases.
Author: Xiao Li <gatorsmile@gmail.com>
Closes#17646 from gatorsmile/showFunctions.
### What changes were proposed in this pull request?
```JAVA
/**
* Certain optimizations should not be applied if UDF is not deterministic.
* Deterministic UDF returns same result each time it is invoked with a
* particular input. This determinism just needs to hold within the context of
* a query.
*
* return true if the UDF is deterministic
*/
boolean deterministic() default true;
```
Based on the definition of [UDFType](https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFType.java#L42-L50), when Hive UDF's children are non-deterministic, Hive UDF is also non-deterministic.
### How was this patch tested?
Added test cases.
Author: Xiao Li <gatorsmile@gmail.com>
Closes#17635 from gatorsmile/udfDeterministic.
## What changes were proposed in this pull request?
In https://github.com/apache/spark/pull/17398 we introduced `UnresolvedMapObjects` as a placeholder of `MapObjects`. Unfortunately `UnresolvedMapObjects` is not serializable as its `function` may reference Scala `Type` which is not serializable.
Ideally this is fine, as we will never serialize and send unresolved expressions to executors. However users may accidentally do this, e.g. mistakenly reference an encoder instance when implementing `Aggregator`, we should fix it so that it's just a performance issue(more network traffic) and should not fail the query.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#17639 from cloud-fan/minor.
## What changes were proposed in this pull request?
val and var should strictly follow the Scala syntax
## How was this patch tested?
manual test and exisiting test cases
Author: ouyangxiaochen <ou.yangxiaochen@zte.com.cn>
Closes#17628 from ouyangxiaochen/spark-413.
## What changes were proposed in this pull request?
Currently when estimating predicates like col > literal or col = literal, we will update min or max in column stats based on literal value. However, literal value is of Catalyst type (internal type), while min/max is of external type. Then for the next predicate, we again need to do type conversion to compare and update column stats. This is awkward and causes many unnecessary conversions in estimation.
To solve this, we use Catalyst type for min/max in `ColumnStat`. Note that the persistent format in metastore is still of external type, so there's no inconsistency for statistics in metastore.
This pr also fixes a bug for boolean type in `IN` condition.
## How was this patch tested?
The changes for ColumnStat are covered by existing tests.
For bug fix, a new test for boolean type in IN condition is added
Author: wangzhenhua <wangzhenhua@huawei.com>
Closes#17630 from wzhfy/refactorColumnStat.
## What changes were proposed in this pull request?
have the`FileFormatWriter.ExecuteWriteTask.releaseResources()` implementations set `currentWriter=null` in a finally clause. This guarantees that if the first call to `currentWriter()` throws an exception, the second releaseResources() call made during the task cancel process will not trigger a second attempt to close the stream.
## How was this patch tested?
Tricky. I've been fixing the underlying cause when I saw the problem [HADOOP-14204](https://issues.apache.org/jira/browse/HADOOP-14204), but SPARK-10109 shows I'm not the first to have seen this. I can't replicate it locally any more, my code no longer being broken.
code review, however, should be straightforward
Author: Steve Loughran <stevel@hortonworks.com>
Closes#17364 from steveloughran/stevel/SPARK-20038-close.
## What changes were proposed in this pull request?
Some Structured Streaming tests show flakiness such as:
```
[info] - prune results by current_date, complete mode - 696 *** FAILED *** (10 seconds, 937 milliseconds)
[info] Timed out while stopping and waiting for microbatchthread to terminate.: The code passed to failAfter did not complete within 10 seconds.
```
This happens when we wait for the stream to stop, but it doesn't. The reason it doesn't stop is that we interrupt the microBatchThread, but Hadoop's `Shell.runCommand` swallows the interrupt exception, and the exception is not propagated upstream to the microBatchThread. Then this thread continues to run, only to start blocking on the `streamManualClock`.
## How was this patch tested?
Thousand retries locally and [Jenkins](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75720/testReport) of the flaky tests
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#17613 from brkyvz/flaky-stream-agg.
## What changes were proposed in this pull request?
AssertNotNull's toString/simpleString dumps the entire walkedTypePath. walkedTypePath is used for error message reporting and shouldn't be part of the output.
## How was this patch tested?
Manually tested.
Author: Reynold Xin <rxin@databricks.com>
Closes#17616 from rxin/SPARK-20304.
### What changes were proposed in this pull request?
Session catalog API `createTempFunction` is being used by Hive build-in functions, persistent functions, and temporary functions. Thus, the name is confusing. This PR is to rename it by `registerFunction`. Also we can move construction of `FunctionBuilder` and `ExpressionInfo` into the new `registerFunction`, instead of duplicating the logics everywhere.
In the next PRs, the remaining Function-related APIs also need cleanups.
### How was this patch tested?
Existing test cases.
Author: Xiao Li <gatorsmile@gmail.com>
Closes#17615 from gatorsmile/cleanupCreateTempFunction.
## What changes were proposed in this pull request?
This PR proposes to run Spark unidoc to test Javadoc 8 build as Javadoc 8 is easily re-breakable.
There are several problems with it:
- It introduces little extra bit of time to run the tests. In my case, it took 1.5 mins more (`Elapsed :[94.8746569157]`). How it was tested is described in "How was this patch tested?".
- > One problem that I noticed was that Unidoc appeared to be processing test sources: if we can find a way to exclude those from being processed in the first place then that might significantly speed things up.
(see joshrosen's [comment](https://issues.apache.org/jira/browse/SPARK-18692?focusedCommentId=15947627&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15947627))
To complete this automated build, It also suggests to fix existing Javadoc breaks / ones introduced by test codes as described above.
There fixes are similar instances that previously fixed. Please refer https://github.com/apache/spark/pull/15999 and https://github.com/apache/spark/pull/16013
Note that this only fixes **errors** not **warnings**. Please see my observation https://github.com/apache/spark/pull/17389#issuecomment-288438704 for spurious errors by warnings.
## How was this patch tested?
Manually via `jekyll build` for building tests. Also, tested via running `./dev/run-tests`.
This was tested via manually adding `time.time()` as below:
```diff
profiles_and_goals = build_profiles + sbt_goals
print("[info] Building Spark unidoc (w/Hive 1.2.1) using SBT with these arguments: ",
" ".join(profiles_and_goals))
+ import time
+ st = time.time()
exec_sbt(profiles_and_goals)
+ print("Elapsed :[%s]" % str(time.time() - st))
```
produces
```
...
========================================================================
Building Unidoc API Documentation
========================================================================
...
[info] Main Java API documentation successful.
...
Elapsed :[94.8746569157]
...
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17477 from HyukjinKwon/SPARK-18692.
## What changes were proposed in this pull request?
Update count distinct error message for streaming datasets/dataframes to match current behavior. These aggregations are not yet supported, regardless of whether the dataset/dataframe is aggregated.
Author: jtoka <jason.tokayer@gmail.com>
Closes#17609 from jtoka/master.
## What changes were proposed in this pull request?
When we perform a cast expression and the from and to types are structurally the same (having the same structure but different field names), we should be able to skip the actual cast.
## How was this patch tested?
Added unit tests for the newly introduced functions.
Author: Reynold Xin <rxin@databricks.com>
Closes#17614 from rxin/SPARK-20302.
## What changes were proposed in this pull request?
This PR proposes corrections related to JSON APIs as below:
- Rendering links in Python documentation
- Replacing `RDD` to `Dataset` in programing guide
- Adding missing description about JSON Lines consistently in `DataFrameReader.json` in Python API
- De-duplicating little bit of `DataFrameReader.json` in Scala/Java API
## How was this patch tested?
Manually build the documentation via `jekyll build`. Corresponding snapstops will be left on the codes.
Note that currently there are Javadoc8 breaks in several places. These are proposed to be handled in https://github.com/apache/spark/pull/17477. So, this PR does not fix those.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17602 from HyukjinKwon/minor-json-documentation.
## What changes were proposed in this pull request?
`NaNvl(float value, null)` will be converted into `NaNvl(float value, Cast(null, DoubleType))` and finally `NaNvl(Cast(float value, DoubleType), Cast(null, DoubleType))`.
This will cause mismatching in the output type when the input type is float.
By adding extra rule in TypeCoercion can resolve this issue.
## How was this patch tested?
unite tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: DB Tsai <dbt@netflix.com>
Closes#17606 from dbtsai/fixNaNvl.
## What changes were proposed in this pull request?
Dataset typed API currently uses NewInstance to box primitive types (i.e. calling the constructor). Instead, it'd be slightly more idiomatic in Java to use PrimitiveType.valueOf, which can be invoked using StaticInvoke expression.
## How was this patch tested?
The change should be covered by existing tests for Dataset encoders.
Author: Reynold Xin <rxin@databricks.com>
Closes#17604 from rxin/SPARK-20289.
## What changes were proposed in this pull request?
Similar to `ListQuery`, `Exists` should not be evaluated in `Join` operator too.
## How was this patch tested?
Jenkins tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#17491 from viirya/dont-push-exists-to-join.
## What changes were proposed in this pull request?
This is a regression caused by SPARK-19716.
Before SPARK-19716, we will cast an array field to the expected array type. However, after SPARK-19716, the cast is removed, but we forgot to push the cast to the element level.
## How was this patch tested?
new regression tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#17587 from cloud-fan/array.
## What changes were proposed in this pull request?
We currently have postHocOptimizationBatches, but not preOptimizationBatches. This patch adds preOptimizationBatches so the optimizer debugging extensions are symmetric.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#17595 from rxin/SPARK-20283.
## What changes were proposed in this pull request?
This PR fixes the following failure:
```
sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException:
Assert on query failed:
== Progress ==
AssertOnQuery(<condition>, )
StopStream
AddData to MemoryStream[value#30891]: 1,2
StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock35cdc93a,Map())
CheckAnswer: [6],[3]
StopStream
=> AssertOnQuery(<condition>, )
AssertOnQuery(<condition>, )
StartStream(OneTimeTrigger,org.apache.spark.util.SystemClockcdb247d,Map())
CheckAnswer: [6],[3]
StopStream
AddData to MemoryStream[value#30891]: 3
StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock55394e4d,Map())
CheckLastBatch: [2]
StopStream
AddData to MemoryStream[value#30891]: 0
StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock749aa997,Map())
ExpectFailure[org.apache.spark.SparkException, isFatalError: false]
AssertOnQuery(<condition>, )
AssertOnQuery(<condition>, incorrect start offset or end offset on exception)
== Stream ==
Output Mode: Append
Stream state: not started
Thread state: dead
== Sink ==
0: [6] [3]
== Plan ==
at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
at org.scalatest.Assertions$class.fail(Assertions.scala:1328)
at org.scalatest.FunSuite.fail(FunSuite.scala:1555)
at org.apache.spark.sql.streaming.StreamTest$class.failTest$1(StreamTest.scala:347)
at org.apache.spark.sql.streaming.StreamTest$class.verify$1(StreamTest.scala:318)
at org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:483)
at org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:357)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.sql.streaming.StreamTest$class.liftedTree1$1(StreamTest.scala:357)
at org.apache.spark.sql.streaming.StreamTest$class.testStream(StreamTest.scala:356)
at org.apache.spark.sql.streaming.StreamingQuerySuite.testStream(StreamingQuerySuite.scala:41)
at org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply$mcV$sp(StreamingQuerySuite.scala:166)
at org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply(StreamingQuerySuite.scala:161)
at org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply(StreamingQuerySuite.scala:161)
at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42)
at org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply$mcV$sp(SQLTestUtils.scala:268)
at org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:268)
at org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:268)
at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
at org.apache.spark.sql.streaming.StreamingQuerySuite.org$scalatest$BeforeAndAfterEach$$super$runTest(StreamingQuerySuite.scala:41)
at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
at org.apache.spark.sql.streaming.StreamingQuerySuite.org$scalatest$BeforeAndAfter$$super$runTest(StreamingQuerySuite.scala:41)
at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
at org.apache.spark.sql.streaming.StreamingQuerySuite.runTest(StreamingQuerySuite.scala:41)
at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
at org.scalatest.Suite$class.run(Suite.scala:1424)
at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
at org.apache.spark.sql.streaming.StreamingQuerySuite.org$scalatest$BeforeAndAfter$$super$run(StreamingQuerySuite.scala:41)
at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
at org.apache.spark.sql.streaming.StreamingQuerySuite.run(StreamingQuerySuite.scala:41)
at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:357)
at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:502)
at sbt.ForkMain$Run$2.call(ForkMain.java:296)
at sbt.ForkMain$Run$2.call(ForkMain.java:286)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
```
The failure is because `CheckAnswer` will run once `committedOffsets` is updated. Then writing the commit log may be interrupted by the following `StopStream`.
This PR just change the order to write the commit log first.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#17594 from zsxwing/SPARK-20282.
## What changes were proposed in this pull request?
Weigher.weigh needs to return Int but it is possible for an Array[FileStatus] to have size > Int.maxValue. To avoid this, the size is scaled down by a factor of 32. The maximumWeight of the cache is also scaled down by the same factor.
## How was this patch tested?
New test in FileIndexSuite
Author: Bogdan Raducanu <bogdan@databricks.com>
Closes#17591 from bogdanrdc/SPARK-20280.
## What changes were proposed in this pull request?
Add Locale.ROOT to internal calls to String `toLowerCase`, `toUpperCase`, to avoid inadvertent locale-sensitive variation in behavior (aka the "Turkish locale problem").
The change looks large but it is just adding `Locale.ROOT` (the locale with no country or language specified) to every call to these methods.
## How was this patch tested?
Existing tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#17527 from srowen/SPARK-20156.
## What changes were proposed in this pull request?
```
sql("SELECT t1.b, rand(0) as r FROM cachedData, cachedData t1 GROUP BY t1.b having r > 0.5").show()
```
We will get the following error:
```
Job aborted due to stage failure: Task 1 in stage 4.0 failed 1 times, most recent failure: Lost task 1.0 in stage 4.0 (TID 8, localhost, executor driver): java.lang.NullPointerException
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown Source)
at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$1.apply(BroadcastNestedLoopJoinExec.scala:87)
at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$1.apply(BroadcastNestedLoopJoinExec.scala:87)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
```
Filters could be pushed down to the join conditions by the optimizer rule `PushPredicateThroughJoin`. However, Analyzer [blocks users to add non-deterministics conditions](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L386-L395) (For details, see the PR https://github.com/apache/spark/pull/7535).
We should not push down non-deterministic conditions; otherwise, we need to explicitly initialize the non-deterministic expressions. This PR is to simply block it.
### How was this patch tested?
Added a test case
Author: Xiao Li <gatorsmile@gmail.com>
Closes#17585 from gatorsmile/joinRandCondition.
## What changes were proposed in this pull request?
This PR proposes to add `IGNORE NULLS` keyword in `first`/`last` in Spark's parser likewise http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions057.htm. This simply maps the keywords to existing `ignoreNullsExpr`.
**Before**
```scala
scala> sql("select first('a' IGNORE NULLS)").show()
```
```
org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input 'NULLS' expecting {')', ','}(line 1, pos 24)
== SQL ==
select first('a' IGNORE NULLS)
------------------------^^^
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:210)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:112)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:46)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:66)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:622)
... 48 elided
```
**After**
```scala
scala> sql("select first('a' IGNORE NULLS)").show()
```
```
+--------------+
|first(a, true)|
+--------------+
| a|
+--------------+
```
## How was this patch tested?
Unit tests in `ExpressionParserSuite`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17566 from HyukjinKwon/SPARK-19518.
## What changes were proposed in this pull request?
Like `Expression`, `QueryPlan` should also have a `semanticHash` method, then we can put plans to a hash map and look it up fast. This PR refactors `QueryPlan` to follow `Expression` and put all the normalization logic in `QueryPlan.canonicalized`, so that it's very natural to implement `semanticHash`.
follow-up: improve `CacheManager` to leverage this `semanticHash` and speed up plan lookup, instead of iterating all cached plans.
## How was this patch tested?
existing tests. Note that we don't need to test the `semanticHash` method, once the existing tests prove `sameResult` is correct, we are good.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#17541 from cloud-fan/plan-semantic.
## What changes were proposed in this pull request?
This bug was partially addressed in SPARK-18555 https://github.com/apache/spark/pull/15994, but the root cause isn't completely solved. This bug is pretty critical since it changes the member id in Long in our application if the member id can not be represented by Double losslessly when the member id is very big.
Here is an example how this happens, with
```
Seq[(java.lang.Long, java.lang.Double)]((null, 3.14), (9123146099426677101L, null),
(9123146560113991650L, 1.6), (null, null)).toDF("a", "b").na.fill(0.2),
```
the logical plan will be
```
== Analyzed Logical Plan ==
a: bigint, b: double
Project [cast(coalesce(cast(a#232L as double), cast(0.2 as double)) as bigint) AS a#240L, cast(coalesce(nanvl(b#233, cast(null as double)), 0.2) as double) AS b#241]
+- Project [_1#229L AS a#232L, _2#230 AS b#233]
+- LocalRelation [_1#229L, _2#230]
```
Note that even the value is not null, Spark will cast the Long into Double first. Then if it's not null, Spark will cast it back to Long which results in losing precision.
The behavior should be that the original value should not be changed if it's not null, but Spark will change the value which is wrong.
With the PR, the logical plan will be
```
== Analyzed Logical Plan ==
a: bigint, b: double
Project [coalesce(a#232L, cast(0.2 as bigint)) AS a#240L, coalesce(nanvl(b#233, cast(null as double)), cast(0.2 as double)) AS b#241]
+- Project [_1#229L AS a#232L, _2#230 AS b#233]
+- LocalRelation [_1#229L, _2#230]
```
which behaves correctly without changing the original Long values and also avoids extra cost of unnecessary casting.
## How was this patch tested?
unit test added.
+cc srowen rxin cloud-fan gatorsmile
Thanks.
Author: DB Tsai <dbt@netflix.com>
Closes#17577 from dbtsai/fixnafill.
## What changes were proposed in this pull request?
sq/core module currently declares asm as a test scope dependency. Transitively it should actually be a normal dependency since the actual core module defines it. This occasionally confuses IntelliJ.
## How was this patch tested?
N/A - This is a build change.
Author: Reynold Xin <rxin@databricks.com>
Closes#17574 from rxin/SPARK-20264.
## What changes were proposed in this pull request?
This error message doesn't get properly formatted because of a missing `s`. Currently the error looks like:
```
Caused by: java.lang.IllegalArgumentException: requirement failed: indices should be one-based and in ascending order; found current=$current, previous=$previous; line="$line"
```
(note the literal `$current` instead of the interpolated value)
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Vijay Ramesh <vramesh@demandbase.com>
Closes#17572 from vijaykramesh/master.
## What changes were proposed in this pull request?
AssertNotNull currently throws RuntimeException. It should throw NullPointerException, which is more specific.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#17573 from rxin/SPARK-20262.
## What changes were proposed in this pull request?
Similar to `Project`, when `Aggregate` has non-deterministic expressions, we should not push predicate down through it, as it will change the number of input rows and thus change the evaluation result of non-deterministic expressions in `Aggregate`.
## How was this patch tested?
new regression test
Author: Wenchen Fan <wenchen@databricks.com>
Closes#17562 from cloud-fan/filter.
## What changes were proposed in this pull request
Trying to get a grip on the `FileIndex` hierarchy, I was confused by the following inconsistency:
On the one hand, `PartitioningAwareFileIndex` defines `leafFiles` and `leafDirToChildrenFiles` as abstract, but on the other it fully implements `listLeafFiles` which does all the listing of files. However, the latter is only used by `InMemoryFileIndex`.
I'm hereby proposing to move this method (and all its dependencies) to the implementation class that actually uses it, and thus unclutter the `PartitioningAwareFileIndex` interface.
## How was this patch tested?
`./build/sbt sql/test`
Author: Adrian Ionescu <adrian@databricks.com>
Closes#17570 from adrian-ionescu/list-leaf-files.
## What changes were proposed in this pull request?
Currently `LogicalRelation` has a `expectedOutputAttributes` parameter, which makes it hard to reason about what the actual output is. Like other leaf nodes, `LogicalRelation` should also take `output` as a parameter, to simplify the logic
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#17552 from cloud-fan/minor.
## What changes were proposed in this pull request?
This is a tiny addendum to SPARK-19495 to remove the private visibility for copy, which is the only package private method in the entire file.
## How was this patch tested?
N/A - no semantic change.
Author: Reynold Xin <rxin@databricks.com>
Closes#17555 from rxin/SPARK-19495-2.
## What changes were proposed in this pull request?
Update doc to remove external for createTable, add refreshByPath in python
## How was this patch tested?
manual
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#17512 from felixcheung/catalogdoc.
## What changes were proposed in this pull request?
This commit moves star schema code from ```join.scala``` to ```StarSchemaDetection.scala```. It also applies some minor fixes in ```StarJoinReorderSuite.scala```.
## How was this patch tested?
Run existing ```StarJoinReorderSuite.scala```.
Author: Ioana Delaney <ioanamdelaney@gmail.com>
Closes#17544 from ioana-delaney/starSchemaCBOv2.
## What changes were proposed in this pull request?
Make sure SESSION_LOCAL_TIMEZONE reflects the change in JVM's default timezone setting. Currently several timezone related tests fail as the change to default timezone is not picked up by SQLConf.
## How was this patch tested?
Added an unit test in ConfigEntrySuite
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#17537 from dilipbiswal/timezone_debug.
## What changes were proposed in this pull request?
- Fixed bug in Java API not passing timeout conf to scala API
- Updated markdown docs
- Updated scala docs
- Added scala and Java example
## How was this patch tested?
Manually ran examples.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#17539 from tdas/SPARK-20224.
## What changes were proposed in this pull request?
Fix typo in tpcds q77.sql
## How was this patch tested?
N/A
Author: wangzhenhua <wangzhenhua@huawei.com>
Closes#17538 from wzhfy/typoQ77.
## What changes were proposed in this pull request?
For large trigger intervals (e.g. 10 minutes), if a batch takes 11 minutes, then it will wait for 9 mins before starting the next batch. This does not make sense. The processing time based trigger policy should be to do process batches as fast as possible, but no faster than 1 in every trigger interval. If batches are taking longer than trigger interval anyways, then no point waiting extra trigger interval.
In this PR, I modified the ProcessingTimeExecutor to do so. Another minor change I did was to extract our StreamManualClock into a separate class so that it can be used outside subclasses of StreamTest. For example, ProcessingTimeExecutorSuite does not need to create any context for testing, just needs the StreamManualClock.
## How was this patch tested?
Added new unit tests to comprehensively test this behavior.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#17525 from tdas/SPARK-20209.
## What changes were proposed in this pull request?
Previously when we construct deserializer expression for array type, we will first cast the corresponding field to expected array type and then apply `MapObjects`.
However, by doing that, we lose the opportunity to do by-name resolution for struct type inside array type. In this PR, I introduce a `UnresolvedMapObjects` to hold the lambda function and the input array expression. Then during analysis, after the input array expression is resolved, we get the actual array element type and apply by-name resolution. Then we don't need to add `Cast` for array type when constructing the deserializer expression, as the element type is determined later at analyzer.
## How was this patch tested?
new regression test
Author: Wenchen Fan <wenchen@databricks.com>
Closes#17398 from cloud-fan/dataset.
## What changes were proposed in this pull request?
This is a follow-up of https://github.com/apache/spark/pull/17285 .
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#17521 from cloud-fan/conf.
### What changes were proposed in this pull request?
Observed by felixcheung , in `SparkSession`.`Catalog` APIs, we have different conventions/rules for table/function identifiers/names. Most APIs accept the qualified name (i.e., `databaseName`.`tableName` or `databaseName`.`functionName`). However, the following five APIs do not accept it.
- def listColumns(tableName: String): Dataset[Column]
- def getTable(tableName: String): Table
- def getFunction(functionName: String): Function
- def tableExists(tableName: String): Boolean
- def functionExists(functionName: String): Boolean
To make them consistent with the other Catalog APIs, this PR does the changes, updates the function/API comments and adds the `params` to clarify the inputs we allow.
### How was this patch tested?
Added the test cases .
Author: Xiao Li <gatorsmile@gmail.com>
Closes#17518 from gatorsmile/tableIdentifier.
### What changes were proposed in this pull request?
This PR is to unify and clean up the outputs of `DESC EXTENDED/FORMATTED` and `SHOW TABLE EXTENDED` by moving the logics into the Catalog interface. The output formats are improved. We also add the missing attributes. It impacts the DDL commands like `SHOW TABLE EXTENDED`, `DESC EXTENDED` and `DESC FORMATTED`.
In addition, by following what we did in Dataset API `printSchema`, we can use `treeString` to show the schema in the more readable way.
Below is the current way:
```
Schema: STRUCT<`a`: STRING (nullable = true), `b`: INT (nullable = true), `c`: STRING (nullable = true), `d`: STRING (nullable = true)>
```
After the change, it should look like
```
Schema: root
|-- a: string (nullable = true)
|-- b: integer (nullable = true)
|-- c: string (nullable = true)
|-- d: string (nullable = true)
```
### How was this patch tested?
`describe.sql` and `show-tables.sql`
Author: Xiao Li <gatorsmile@gmail.com>
Closes#17394 from gatorsmile/descFollowUp.
## What changes were proposed in this pull request?
**Description** from JIRA
The TimestampType in Spark SQL is of microsecond precision. Ideally, we should convert Spark SQL timestamp values into Parquet TIMESTAMP_MICROS. But unfortunately parquet-mr hasn't supported it yet.
For the read path, we should be able to read TIMESTAMP_MILLIS Parquet values and pad a 0 microsecond part to read values.
For the write path, currently we are writing timestamps as INT96, similar to Impala and Hive. One alternative is that, we can have a separate SQL option to let users be able to write Spark SQL timestamp values as TIMESTAMP_MILLIS. Of course, in this way the microsecond part will be truncated.
## How was this patch tested?
Added new tests in ParquetQuerySuite and ParquetIOSuite
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#15332 from dilipbiswal/parquet-time-millis.