## What changes were proposed in this pull request?
Method `SQLContext.parseDataType(dataTypeString: String)` could be removed, we should use `SparkSession.parseDataType(dataTypeString: String)` instead.
This require updating PySpark.
## How was this patch tested?
Existing test cases.
Author: jiangxingbo <jiangxb1987@gmail.com>
Closes#14790 from jiangxb1987/parseDataType.
### What changes were proposed in this pull request?
Since `HiveClient` is used to interact with the Hive metastore, it should be hidden in `HiveExternalCatalog`. After moving `HiveClient` into `HiveExternalCatalog`, `HiveSharedState` becomes a wrapper of `HiveExternalCatalog`. Thus, removal of `HiveSharedState` becomes straightforward. After removal of `HiveSharedState`, the reflection logic is directly applied on the choice of `ExternalCatalog` types, based on the configuration of `CATALOG_IMPLEMENTATION`.
~~`HiveClient` is also used/invoked by the other entities besides HiveExternalCatalog, we defines the following two APIs: getClient and getNewClient~~
### How was this patch tested?
The existing test cases
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14757 from gatorsmile/removeHiveClient.
## What changes were proposed in this pull request?
Given that filters based on non-deterministic constraints shouldn't be pushed down in the query plan, unnecessarily inferring them is confusing and a source of potential bugs. This patch simplifies the inferring logic by simply ignoring them.
## How was this patch tested?
Added a new test in `ConstraintPropagationSuite`.
Author: Sameer Agarwal <sameerag@cs.berkeley.edu>
Closes#14795 from sameeragarwal/deterministic-constraints.
## What changes were proposed in this pull request?
### Default - ISO 8601
Currently, CSV datasource is writing `Timestamp` and `Date` as numeric form and JSON datasource is writing both as below:
- CSV
```
// TimestampType
1414459800000000
// DateType
16673
```
- Json
```
// TimestampType
1970-01-01 11:46:40.0
// DateType
1970-01-01
```
So, for CSV we can't read back what we write and for JSON it becomes ambiguous because the timezone is being missed.
So, this PR make both **write** `Timestamp` and `Date` in ISO 8601 formatted string (please refer the [ISO 8601 specification](https://www.w3.org/TR/NOTE-datetime)).
- For `Timestamp` it becomes as below: (`yyyy-MM-dd'T'HH:mm:ss.SSSZZ`)
```
1970-01-01T02:00:01.000-01:00
```
- For `Date` it becomes as below (`yyyy-MM-dd`)
```
1970-01-01
```
### Custom date format option - `dateFormat`
This PR also adds the support to write and read dates and timestamps in a formatted string as below:
- **DateType**
- With `dateFormat` option (e.g. `yyyy/MM/dd`)
```
+----------+
| date|
+----------+
|2015/08/26|
|2014/10/27|
|2016/01/28|
+----------+
```
### Custom date format option - `timestampFormat`
- **TimestampType**
- With `dateFormat` option (e.g. `dd/MM/yyyy HH:mm`)
```
+----------------+
| date|
+----------------+
|2015/08/26 18:00|
|2014/10/27 18:30|
|2016/01/28 20:00|
+----------------+
```
## How was this patch tested?
Unit tests were added in `CSVSuite` and `JsonSuite`. For JSON, existing tests cover the default cases.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#14279 from HyukjinKwon/SPARK-16216-json-csv.
## What changes were proposed in this pull request?
Currently, two-word window functions like `row_number`, `dense_rank`, `percent_rank`, and `cume_dist` are expressed without `_` in error messages. We had better show the correct names.
**Before**
```scala
scala> sql("select row_number()").show
java.lang.UnsupportedOperationException: Cannot evaluate expression: rownumber()
```
**After**
```scala
scala> sql("select row_number()").show
java.lang.UnsupportedOperationException: Cannot evaluate expression: row_number()
```
## How was this patch tested?
Pass the Jenkins and manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14571 from dongjoon-hyun/SPARK-16983.
## What changes were proposed in this pull request?
Actually Spark SQL doesn't support index, the catalog table type `INDEX` is from Hive. However, most operations in Spark SQL can't handle index table, e.g. create table, alter table, etc.
Logically index table should be invisible to end users, and Hive also generates special table name for index table to avoid users accessing it directly. Hive has special SQL syntax to create/show/drop index tables.
At Spark SQL side, although we can describe index table directly, but the result is unreadable, we should use the dedicated SQL syntax to do it(e.g. `SHOW INDEX ON tbl`). Spark SQL can also read index table directly, but the result is always empty.(Can hive read index table directly?)
This PR remove the table type `INDEX`, to make it clear that Spark SQL doesn't support index currently.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14752 from cloud-fan/minor2.
## What changes were proposed in this pull request?
This PR removes implemented functions from comments of `HiveSessionCatalog.scala`: `java_method`, `posexplode`, `str_to_map`.
## How was this patch tested?
Manual.
Author: Weiqing Yang <yangweiqing001@gmail.com>
Closes#14769 from Sherry302/cleanComment.
When Spark emits SQL for a string literal, it should wrap the string in single quotes, not double quotes. Databases which adhere more strictly to the ANSI SQL standards, such as Postgres, allow only single-quotes to be used for denoting string literals (see http://stackoverflow.com/a/1992331/590203).
Author: Josh Rosen <joshrosen@databricks.com>
Closes#14763 from JoshRosen/SPARK-17194.
## What changes were proposed in this pull request?
Some JDBC driver (for example PostgreSQL) does not use the underlying exception as cause, but have another APIs (getNextException) to access that, so it it's included in the error logging, making us hard to find the root cause, especially in batch mode.
This PR will pull out the next exception and add it as cause (if it's different) or suppressed (if there is another different cause).
## How was this patch tested?
Can't reproduce this on the default JDBC driver, so did not add a regression test.
Author: Davies Liu <davies@databricks.com>
Closes#14722 from davies/keep_cause.
## What changes were proposed in this pull request?
Use `CatalystConf.resolver` consistently for case-sensitivity comparison (removed dups).
## How was this patch tested?
Local build. Waiting for Jenkins to ensure clean build and test.
Author: Jacek Laskowski <jacek@japila.pl>
Closes#14771 from jaceklaskowski/17199-catalystconf-resolver.
## What changes were proposed in this pull request?
This is a sub-task of [SPARK-16283](https://issues.apache.org/jira/browse/SPARK-16283) (Implement percentile_approx SQL function), which moves class QuantileSummaries to project catalyst so that it can be reused when implementing aggregation function `percentile_approx`.
## How was this patch tested?
This PR only does class relocation, class implementation is not changed.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#14754 from clockfly/move_QuantileSummaries_to_catalyst.
## What changes were proposed in this pull request?
This PR marks the abstract class `Collect` as non-deterministic since the results of `CollectList` and `CollectSet` depend on the actual order of input rows.
## How was this patch tested?
Existing test cases should be enough.
Author: Cheng Lian <lian@databricks.com>
Closes#14749 from liancheng/spark-17182-non-deterministic-collect.
## What changes were proposed in this pull request?
`CreateHiveTableAsSelectLogicalPlan` is a dead code after refactoring.
## How was this patch tested?
N/A
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14707 from gatorsmile/removeCreateHiveTable.
## What changes were proposed in this pull request?
The range operator previously didn't support SQL generation, which made it not possible to use in views.
## How was this patch tested?
Unit tests.
cc hvanhovell
Author: Eric Liang <ekl@databricks.com>
Closes#14724 from ericl/spark-17162.
## What changes were proposed in this pull request?
Fix some typos in comments and test hints
## How was this patch tested?
N/A.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#14755 from clockfly/fix_minor_typo.
## What changes were proposed in this pull request?
In 2.0, we change the threshold of splitting expressions from 16K to 64K, which cause very bad performance on wide table, because the generated method can't be JIT compiled by default (above the limit of 8K bytecode).
This PR will decrease it to 1K, based on the benchmark results for a wide table with 400 columns of LongType.
It also fix a bug around splitting expression in whole-stage codegen (it should not split them).
## How was this patch tested?
Added benchmark suite.
Author: Davies Liu <davies@databricks.com>
Closes#14692 from davies/split_exprs.
## What changes were proposed in this pull request?
Spark SQL doesn't have its own meta store yet, and use hive's currently. However, hive's meta store has some limitations(e.g. columns can't be too many, not case-preserving, bad decimal type support, etc.), so we have some hacks to successfully store data source table metadata into hive meta store, i.e. put all the information in table properties.
This PR moves these hacks to `HiveExternalCatalog`, tries to isolate hive specific logic in one place.
changes overview:
1. **before this PR**: we need to put metadata(schema, partition columns, etc.) of data source tables to table properties before saving it to external catalog, even the external catalog doesn't use hive metastore(e.g. `InMemoryCatalog`)
**after this PR**: the table properties tricks are only in `HiveExternalCatalog`, the caller side doesn't need to take care of it anymore.
2. **before this PR**: because the table properties tricks are done outside of external catalog, so we also need to revert these tricks when we read the table metadata from external catalog and use it. e.g. in `DescribeTableCommand` we will read schema and partition columns from table properties.
**after this PR**: The table metadata read from external catalog is exactly the same with what we saved to it.
bonus: now we can create data source table using `SessionCatalog`, if schema is specified.
breaks: `schemaStringLengthThreshold` is not configurable anymore. `hive.default.rcfile.serde` is not configurable anymore.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14155 from cloud-fan/catalog-table.
## What changes were proposed in this pull request?
Currently, `NullPropagation` optimizer replaces `COUNT` on null literals in a bottom-up fashion. During that, `WindowExpression` is not covered properly. This PR adds the missing propagation logic.
**Before**
```scala
scala> sql("SELECT COUNT(1 + NULL) OVER ()").show
java.lang.UnsupportedOperationException: Cannot evaluate expression: cast(0 as bigint) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
```
**After**
```scala
scala> sql("SELECT COUNT(1 + NULL) OVER ()").show
+----------------------------------------------------------------------------------------------+
|count((1 + CAST(NULL AS INT))) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)|
+----------------------------------------------------------------------------------------------+
| 0|
+----------------------------------------------------------------------------------------------+
```
## How was this patch tested?
Pass the Jenkins test with a new test case.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14689 from dongjoon-hyun/SPARK-17098.
## What changes were proposed in this pull request?
This patch fixes a longstanding issue with one of the RelationalGroupedDataset.agg function. Even though the signature accepts vararg of pairs, the underlying implementation turns the seq into a map, and thus not order preserving nor allowing multiple aggregates per column.
This change also allows users to use this function to run multiple different aggregations for a single column, e.g.
```
agg("age" -> "max", "age" -> "count")
```
## How was this patch tested?
Added a test case in DataFrameAggregateSuite.
Author: petermaxlee <petermaxlee@gmail.com>
Closes#14697 from petermaxlee/SPARK-17124.
## What changes were proposed in this pull request?
Currently `LogicalRelation.newInstance()` simply creates another `LogicalRelation` object with the same parameters. However, the `newInstance()` method inherited from `MultiInstanceRelation` should return a copy of object with unique expression ids. Current `LogicalRelation.newInstance()` can cause failure when doing self-join.
## How was this patch tested?
Jenkins tests.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#14682 from viirya/fix-localrelation.
## What changes were proposed in this pull request?
This patch adds support for SQL generation for inline tables. With this, it would be possible to create a view that depends on inline tables.
## How was this patch tested?
Added a test case in LogicalPlanToSQLSuite.
Author: petermaxlee <petermaxlee@gmail.com>
Closes#14709 from petermaxlee/SPARK-17150.
## What changes were proposed in this pull request?
Modifies error message for numeric literals to
Numeric literal <literal> does not fit in range [min, max] for type <T>
## How was this patch tested?
Fixed up the error messages for literals.sql in SqlQueryTestSuite and re-ran via sbt. Also fixed up error messages in ExpressionParserSuite
Author: Srinath Shankar <srinath@databricks.com>
Closes#14721 from srinathshankar/sc4296.
## What changes were proposed in this pull request?
This patch creates array.sql in SQLQueryTestSuite for testing array related functions, including:
- indexing
- array creation
- size
- array_contains
- sort_array
## How was this patch tested?
The patch itself is about adding tests.
Author: petermaxlee <petermaxlee@gmail.com>
Closes#14708 from petermaxlee/SPARK-17149.
## What changes were proposed in this pull request?
This patch changes predicate pushdown optimization rule (PushDownPredicate) from using a blacklist to a whitelist. That is to say, operators must be explicitly allowed. This approach is more future-proof: previously it was possible for us to introduce a new operator and then render the optimization rule incorrect.
This also fixes the bug that previously we allowed pushing filter beneath limit, which was incorrect. That is to say, before this patch, the optimizer would rewrite
```
select * from (select * from range(10) limit 5) where id > 3
to
select * from range(10) where id > 3 limit 5
```
## How was this patch tested?
- a unit test case in FilterPushdownSuite
- an end-to-end test in limit.sql
Author: Reynold Xin <rxin@databricks.com>
Closes#14713 from rxin/SPARK-16994.
## What changes were proposed in this pull request?
This patch improves inline table support with the following:
1. Support type coercion.
2. Support using foldable expressions. Previously only literals were supported.
3. Improve error message handling.
4. Improve test coverage.
## How was this patch tested?
Added a new unit test suite ResolveInlineTablesSuite and a new file-based end-to-end test inline-table.sql.
Author: petermaxlee <petermaxlee@gmail.com>
Closes#14676 from petermaxlee/SPARK-16947.
## What changes were proposed in this pull request?
This patch fixes the problem described in SPARK-17117, i.e. "SELECT 1 / NULL" throws an analysis exception:
```
org.apache.spark.sql.AnalysisException: cannot resolve '(1 / NULL)' due to data type mismatch: differing types in '(1 / NULL)' (int and null).
```
The problem is that division type coercion did not take null type into account.
## How was this patch tested?
A unit test for the type coercion, and a few end-to-end test cases using SQLQueryTestSuite.
Author: petermaxlee <petermaxlee@gmail.com>
Closes#14695 from petermaxlee/SPARK-17117.
## What changes were proposed in this pull request?
This adds analyzer rules for resolving table-valued functions, and adds one builtin implementation for range(). The arguments for range() are the same as those of `spark.range()`.
## How was this patch tested?
Unit tests.
cc hvanhovell
Author: Eric Liang <ekl@databricks.com>
Closes#14656 from ericl/sc-4309.
## What changes were proposed in this pull request?
The `Optimizer` rules `PushThroughSetOperations` and `PushDownPredicate` have a redundant rule to push down `Filter` through `Union`. We should remove it.
## How was this patch tested?
Jenkins tests.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#14687 from viirya/remove-extra-pushdown.
## What changes were proposed in this pull request?
This patch introduces a new private ReduceAggregator interface that is a subclass of Aggregator. ReduceAggregator only requires a single associative and commutative reduce function. ReduceAggregator is also used to implement KeyValueGroupedDataset.reduceGroups in order to support partial aggregation.
Note that the pull request was initially done by viirya.
## How was this patch tested?
Covered by original tests for reduceGroups, as well as a new test suite for ReduceAggregator.
Author: Reynold Xin <rxin@databricks.com>
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#14576 from rxin/reduceAggregator.
## What changes were proposed in this pull request?
I was looking at the code for UnresolvedOrdinal and made a few small changes to make it slightly more clear:
1. Rename the rule to SubstituteUnresolvedOrdinals which is more consistent with other rules that start with verbs. Note that this is still inconsistent with CTESubstitution and WindowsSubstitution.
2. Broke the test suite down from a single test case to three test cases.
## How was this patch tested?
This is a minor cleanup.
Author: petermaxlee <petermaxlee@gmail.com>
Closes#14672 from petermaxlee/SPARK-17034.
## What changes were proposed in this pull request?
A TreeNodeException is thrown when executing the following minimal example in Spark 2.0.
import spark.implicits._
case class test (x: Int, q: Int)
val d = Seq(1).toDF("x")
d.withColumn("q", lit(0)).as[test].groupByKey(_.x).flatMapGroups{case (x, iter) => List[Int]()}.show
d.withColumn("q", expr("0")).as[test].groupByKey(_.x).flatMapGroups{case (x, iter) => List[Int]()}.show
The problem is at `FoldablePropagation`. The rule will do `transformExpressions` on `LogicalPlan`. The query above contains a `MapGroups` which has a parameter `dataAttributes:Seq[Attribute]`. One attributes in `dataAttributes` will be transformed to an `Alias(literal(0), _)` in `FoldablePropagation`. `Alias` is not an `Attribute` and causes the error.
We can't easily detect such type inconsistency during transforming expressions. A direct approach to this problem is to skip doing `FoldablePropagation` on object operators as they should not contain such expressions.
## How was this patch tested?
Jenkins tests.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#14648 from viirya/flat-mapping.
## What changes were proposed in this pull request?
Currently, the stackTrace (as `Array[StackTraceElements]`) reported through StreamingQueryListener.onQueryTerminated is useless as it has the stack trace of where StreamingQueryException is defined, not the stack trace of underlying exception. For example, if a streaming query fails because of a / by zero exception in a task, the `QueryTerminated.stackTrace` will have
```
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:211)
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:124)
```
This is basically useless, as it is location where the StreamingQueryException was defined. What we want is
Here is the right way to reason about what should be posted as through StreamingQueryListener.onQueryTerminated
- The actual exception could either be a SparkException, or an arbitrary exception.
- SparkException reports the relevant executor stack trace of a failed task as a string in the the exception message. The `Array[StackTraceElements]` returned by `SparkException.stackTrace()` is mostly irrelevant.
- For any arbitrary exception, the `Array[StackTraceElements]` returned by `exception.stackTrace()` may be relevant.
- When there is an error in a streaming query, it's hard to reason whether the `Array[StackTraceElements]` is useful or not. In fact, it is not clear whether it is even useful to report the stack trace as this array of Java objects. It may be sufficient to report the strack trace as a string, along with the message. This is how Spark reported executor stra
- Hence, this PR simplifies the API by removing the array `stackTrace` from `QueryTerminated`. Instead the `exception` returns a string containing the message and the stack trace of the actual underlying exception that failed the streaming query (i.e. not that of the StreamingQueryException). If anyone is interested in the actual stack trace as an array, can always access them through `streamingQuery.exception` which returns the exception object.
With this change, if a streaming query fails because of a / by zero exception in a task, the `QueryTerminated.exception` will be
```
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 (TID 1, localhost): java.lang.ArithmeticException: / by zero
at org.apache.spark.sql.streaming.StreamingQueryListenerSuite$$anonfun$5$$anonfun$apply$mcV$sp$4$$anonfun$apply$mcV$sp$5.apply$mcII$sp(StreamingQueryListenerSuite.scala:153)
at org.apache.spark.sql.streaming.StreamingQueryListenerSuite$$anonfun$5$$anonfun$apply$mcV$sp$4$$anonfun$apply$mcV$sp$5.apply(StreamingQueryListenerSuite.scala:153)
at org.apache.spark.sql.streaming.StreamingQueryListenerSuite$$anonfun$5$$anonfun$apply$mcV$sp$4$$anonfun$apply$mcV$sp$5.apply(StreamingQueryListenerSuite.scala:153)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:232)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:226)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1429)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1417)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1416)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1416)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
...
```
It contains the relevant executor stack trace. In a case non-SparkException, if the streaming source MemoryStream throws an exception, exception message will have the relevant stack trace.
```
java.lang.RuntimeException: this is the exception message
at org.apache.spark.sql.execution.streaming.MemoryStream.getBatch(memory.scala:103)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:316)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:313)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:25)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch(StreamExecution.scala:313)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1.apply$mcZ$sp(StreamExecution.scala:197)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:43)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:187)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:124)
```
Note that this change in the public `QueryTerminated` class is okay as the APIs are still experimental.
## How was this patch tested?
Unit tests that test whether the right information is present in the exception message reported through QueryTerminated object.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#14675 from tdas/SPARK-17096.
A review of the code, working back from Hadoop's `FileSystem.exists()` and `FileSystem.isDirectory()` code, then removing uses of the calls when superfluous.
1. delete is harmless if called on a nonexistent path, so don't do any checks before deletes
1. any `FileSystem.exists()` check before `getFileStatus()` or `open()` is superfluous as the operation itself does the check. Instead the `FileNotFoundException` is caught and triggers the downgraded path. When a `FileNotFoundException` was thrown before, the code still creates a new FNFE with the error messages. Though now the inner exceptions are nested, for easier diagnostics.
Initially, relying on Jenkins test runs.
One troublespot here is that some of the codepaths are clearly error situations; it's not clear that they have coverage anyway. Trying to create the failure conditions in tests would be ideal, but it will also be hard.
Author: Steve Loughran <stevel@apache.org>
Closes#14371 from steveloughran/cloud/SPARK-16736-superfluous-fs-calls.
## What changes were proposed in this pull request?
We use reflection to convert `TreeNode` to json string, and currently don't support arbitrary object. `UserDefinedGenerator` takes a function object, so we should skip json format test for it, or the tests can be flacky, e.g. `DataFrameSuite.simple explode`, this test always fail with scala 2.10(branch 1.6 builds with scala 2.10 by default), but pass with scala 2.11(master branch builds with scala 2.11 by default).
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14679 from cloud-fan/json.
## What changes were proposed in this pull request?
The current subquery expression interface contains a little bit of technical debt in the form of a few different access paths to get and set the query contained by the expression. This is confusing to anyone who goes over this code.
This PR unifies these access paths.
## How was this patch tested?
(Existing tests)
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#14685 from hvanhovell/SPARK-17106.
## What changes were proposed in this pull request?
This PR splits the generated code for ```SafeProjection.apply``` by using ```ctx.splitExpressions()```. This is because the large code body for ```NewInstance``` may grow beyond 64KB bytecode size for ```apply()``` method.
Here is [the original PR](https://github.com/apache/spark/pull/13243) for SPARK-15285. However, it breaks a build with Scala 2.10 since Scala 2.10 does not a case class with large number of members. Thus, it was reverted by [this commit](fa244e5a90).
## How was this patch tested?
Added new tests by using `DefinedByConstructorParams` instead of case class for scala-2.10
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#14670 from kiszk/SPARK-15285-2.
## What changes were proposed in this pull request?
Currently methods in `ParserUtils` are tested indirectly, we should add test cases in `ParserUtilsSuite` to verify their integrity directly.
## How was this patch tested?
New test cases in `ParserUtilsSuite`
Author: jiangxingbo <jiangxb1987@gmail.com>
Closes#14620 from jiangxb1987/parserUtils.
## What changes were proposed in this pull request?
This PR adds a field to subquery alias in order to make the usage of views in a resolved `LogicalPlan` more visible (and more understandable).
For example, the following view and query:
```sql
create view constants as select 1 as id union all select 1 union all select 42
select * from constants;
```
...now yields the following analyzed plan:
```
Project [id#39]
+- SubqueryAlias c, `default`.`constants`
+- Project [gen_attr_0#36 AS id#39]
+- SubqueryAlias gen_subquery_0
+- Union
:- Union
: :- Project [1 AS gen_attr_0#36]
: : +- OneRowRelation$
: +- Project [1 AS gen_attr_1#37]
: +- OneRowRelation$
+- Project [42 AS gen_attr_2#38]
+- OneRowRelation$
```
## How was this patch tested?
Added tests for the two code paths in `SessionCatalogSuite` (sql/core) and `HiveMetastoreCatalogSuite` (sql/hive)
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#14657 from hvanhovell/SPARK-17068.
## What changes were proposed in this pull request?
This PR renames `ParserUtils.assert` to `ParserUtils.validate`. This is done because this method is used to check requirements, and not to check if the program is in an invalid state.
## How was this patch tested?
Simple rename. Compilation should do.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#14665 from hvanhovell/SPARK-17084.
## What changes were proposed in this pull request?
This PR is a small follow-up to https://github.com/apache/spark/pull/14554. This also widens the visibility of a few (similar) Hive classes.
## How was this patch tested?
No test. Only a visibility change.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#14654 from hvanhovell/SPARK-16964-hive.
## What changes were proposed in this pull request?
This PR adds expression `UnresolvedOrdinal` to represent the ordinal in GROUP BY or ORDER BY, and fixes the rules when resolving ordinals.
Ordinals in GROUP BY or ORDER BY like `1` in `order by 1` or `group by 1` should be considered as unresolved before analysis. But in current code, it uses `Literal` expression to store the ordinal. This is inappropriate as `Literal` itself is a resolved expression, it gives the user a wrong message that the ordinals has already been resolved.
### Before this change
Ordinal is stored as `Literal` expression
```
scala> sc.setLogLevel("TRACE")
scala> sql("select a from t group by 1 order by 1")
...
'Sort [1 ASC], true
+- 'Aggregate [1], ['a]
+- 'UnresolvedRelation `t
```
For query:
```
scala> Seq(1).toDF("a").createOrReplaceTempView("t")
scala> sql("select count(a), a from t group by 2 having a > 0").show
```
During analysis, the intermediate plan before applying rule `ResolveAggregateFunctions` is:
```
'Filter ('a > 0)
+- Aggregate [2], [count(1) AS count(1)#83L, a#81]
+- LocalRelation [value#7 AS a#9]
```
Before this PR, rule `ResolveAggregateFunctions` believes all expressions of `Aggregate` have already been resolved, and tries to resolve the expressions in `Filter` directly. But this is wrong, as ordinal `2` in Aggregate is not really resolved!
### After this change
Ordinals are stored as `UnresolvedOrdinal`.
```
scala> sc.setLogLevel("TRACE")
scala> sql("select a from t group by 1 order by 1")
...
'Sort [unresolvedordinal(1) ASC], true
+- 'Aggregate [unresolvedordinal(1)], ['a]
+- 'UnresolvedRelation `t`
```
## How was this patch tested?
Unit tests.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#14616 from clockfly/spark-16955.
## What changes were proposed in this pull request?
`CatalogStorageFormat.properties` can be used in 2 ways:
1. for hive tables, it stores the serde properties.
2. for data source tables, it stores the data source options, e.g. `path`, `skipHiveMetadata`, etc.
however, both of them have nothing to do with data source properties, e.g. `spark.sql.sources.provider`, so they should not have limitations about data source properties.
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14506 from cloud-fan/table-prop.
## What changes were proposed in this pull request?
Add an instruction to ask the user to remove or upgrade the incompatible DataSourceRegister in the error message.
## How was this patch tested?
Test command:
```
build/sbt -Dscala-2.10 package
SPARK_SCALA_VERSION=2.10 bin/spark-shell --packages ai.h2o:sparkling-water-core_2.10:1.6.5
scala> Seq(1).toDS().write.format("parquet").save("foo")
```
Before:
```
java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.h2o.DefaultSource could not be instantiated
at java.util.ServiceLoader.fail(ServiceLoader.java:232)
at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
...
Caused by: java.lang.NoClassDefFoundError: org/apache/spark/Logging
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
...
```
After:
```
java.lang.ClassNotFoundException: Detected an incompatible DataSourceRegister. Please remove the incompatible library from classpath or upgrade it. Error: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.h2o.DefaultSource could not be instantiated
at org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:178)
at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:79)
at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:79)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:441)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:213)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:196)
...
```
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#14651 from zsxwing/SPARK-17065.
Both core and sql have slightly different code that does variable substitution
of config values. This change refactors that code and encapsulates the logic
of reading config values and expading variables in a new helper class, which
can be configured so that both core and sql can use it without losing existing
functionality, and allows for easier testing and makes it easier to add more
features in the future.
Tested with existing and new unit tests, and by running spark-shell with
some configs referencing variables and making sure it behaved as expected.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#14468 from vanzin/SPARK-16671.
## What changes were proposed in this pull request?
Don't override app name specified in `SparkConf` with a random app name. Only set it if the conf has no app name even after options have been applied.
See also https://github.com/apache/spark/pull/14602
This is similar to Sherry302 's original proposal in https://github.com/apache/spark/pull/14556
## How was this patch tested?
Jenkins test, with new case reproducing the bug
Author: Sean Owen <sowen@cloudera.com>
Closes#14630 from srowen/SPARK-16966.2.
## What changes were proposed in this pull request?
In the PR, we just allow the user to add additional options when create a new table in JDBC writer.
The options can be table_options or partition_options.
E.g., "CREATE TABLE t (name string) ENGINE=InnoDB DEFAULT CHARSET=utf8"
Here is the usage example:
```
df.write.option("createTableOptions", "ENGINE=InnoDB DEFAULT CHARSET=utf8").jdbc(...)
```
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
will apply test result soon.
Author: GraceH <93113783@qq.com>
Closes#14559 from GraceH/jdbc_options.
## What changes were proposed in this pull request?
This PR changes the CTE resolving rule to use only **forward-declared** tables in order to prevent infinite loops. More specifically, new logic is like the following.
* Resolve CTEs in `WITH` clauses first before replacing the main SQL body.
* When resolving CTEs, only forward-declared CTEs or base tables are referenced.
- Self-referencing is not allowed any more.
- Cross-referencing is not allowed any more.
**Reported Error Scenarios**
```scala
scala> sql("WITH t AS (SELECT 1 FROM t) SELECT * FROM t")
java.lang.StackOverflowError
...
scala> sql("WITH t1 AS (SELECT * FROM t2), t2 AS (SELECT 2 FROM t1) SELECT * FROM t1, t2")
java.lang.StackOverflowError
...
```
Note that `t`, `t1`, and `t2` are not declared in database. Spark falls into infinite loops before resolving table names.
## How was this patch tested?
Pass the Jenkins tests with new two testcases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14397 from dongjoon-hyun/SPARK-16771-TREENODE.
#### What changes were proposed in this pull request?
So far, the test cases of `TableIdentifierParserSuite` do not cover the quoted cases. We should add one for avoiding regression.
#### How was this patch tested?
N/A
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14244 from gatorsmile/quotedIdentifiers.
## What changes were proposed in this pull request?
This patch updates the SQL parser to parse negative numeric literals as numeric literals, instead of unary minus of positive literals.
This allows the parser to parse the minimal value for each data type, e.g. "-32768S".
## How was this patch tested?
Updated test cases.
Author: petermaxlee <petermaxlee@gmail.com>
Closes#14608 from petermaxlee/SPARK-17013.