## What changes were proposed in this pull request?
For the following workflow:
1. I have a column called time which is at minute level precision in a Streaming DataFrame
2. I want to perform groupBy time, count
3. Then I want my MemorySink to only have the last 30 minutes of counts and I perform this by
.where('time >= current_timestamp().cast("long") - 30 * 60)
what happens is that the `filter` gets pushed down before the aggregation, and the filter happens on the source data for the aggregation instead of the result of the aggregation (where I actually want to filter).
I guess the main issue here is that `current_timestamp` is non-deterministic in the streaming context and shouldn't be pushed down the filter.
Does this require us to store the `current_timestamp` for each trigger of the streaming job, that is something to discuss.
Furthermore, we want to persist current batch timestamp and watermark timestamp to the offset log so that these values are consistent across multiple executions of the same batch.
brkyvz zsxwing tdas
## How was this patch tested?
A test was added to StreamingAggregationSuite ensuring the above use case is handled. The test injects a stream of time values (in seconds) to a query that runs in complete mode and only outputs the (count) aggregation results for the past 10 seconds.
Author: Tyson Condie <tcondie@gmail.com>
Closes#15949 from tcondie/SPARK-18339.
## What changes were proposed in this pull request?
We failed to properly propagate table metadata for existing tables for the saveAsTable command. This caused a downstream component to think the table was MANAGED, writing data to the wrong location.
## How was this patch tested?
Unit test that fails before the patch.
Author: Eric Liang <ekl@databricks.com>
Closes#15983 from ericl/spark-18544.
## What changes were proposed in this pull request?
This is absolutely minor. PR https://github.com/apache/spark/pull/15595 uses `dt1.asNullable == dt2.asNullable` expressions in a few places. It is however more efficient to call `dt1.sameType(dt2)`. I have replaced every instance of the first pattern with the second pattern (3/5 were introduced by #15595).
## How was this patch tested?
Existing tests.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#16041 from hvanhovell/SPARK-18058.
## What changes were proposed in this pull request?
This PR fixes a random OOM issue occurred while running `ObjectHashAggregateSuite`.
This issue can be steadily reproduced under the following conditions:
1. The aggregation must be evaluated using `ObjectHashAggregateExec`;
2. There must be an input column whose data type involves `ArrayType` (an input column of `MapType` may even cause SIGSEGV);
3. Sort-based aggregation fallback must be triggered during evaluation.
The root cause is that while falling back to sort-based aggregation, we must sort and feed already evaluated partial aggregation buffers living in the hash map to the sort-based aggregator using an external sorter. However, the underlying mutable byte buffer of `UnsafeRow`s produced by the iterator of the external sorter is reused and may get overwritten when the iterator steps forward. After the last entry is consumed, the byte buffer points to a block of uninitialized memory filled by `5a`. Therefore, while reading an `UnsafeArrayData` out of the `UnsafeRow`, `5a5a5a5a` is treated as array size and triggers a memory allocation for a ridiculously large array and immediately blows up the JVM with an OOM.
To fix this issue, we only need to add `.copy()` accordingly.
## How was this patch tested?
New regression test case added in `ObjectHashAggregateSuite`.
Author: Cheng Lian <lian@databricks.com>
Closes#15976 from liancheng/investigate-oom.
## What changes were proposed in this pull request?
This pull request adds test cases for the following cases:
- keep all data types with null or without null
- access `CachedBatch` disabling whole stage codegen
- access only some columns in `CachedBatch`
This PR is a part of https://github.com/apache/spark/pull/15219. Here are motivations to add these tests. When https://github.com/apache/spark/pull/15219 is enabled, the first two cases are handled by specialized (generated) code. The third one is a pitfall.
In general, even for now, it would be helpful to increase test coverage.
## How was this patch tested?
added test suites itself
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#15462 from kiszk/columnartestsuites.
## What changes were proposed in this pull request?
`CatalogTable` has a parameter named `tracksPartitionsInCatalog`, and in `CatalogTable.toString` we use `"Partition Provider: Catalog"` to represent it. This PR fixes `DESC TABLE` to make it consistent with `CatalogTable.toString`.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#16035 from cloud-fan/minor.
## What changes were proposed in this pull request?
https://github.com/apache/spark/pull/15704 will fail if we use int literal in `DROP PARTITION`, and we have reverted it in branch-2.1.
This PR reverts it in master branch, and add a regression test for it, to make sure the master branch is healthy.
## How was this patch tested?
new regression test
Author: Wenchen Fan <wenchen@databricks.com>
Closes#16036 from cloud-fan/revert.
## What changes were proposed in this pull request?
We currently push down join conditions of a Left Anti join to both sides of the join. This is similar to Inner, Left Semi and Existence (a specialized left semi) join. The problem is that this changes the semantics of the join; a left anti join filters out rows that matches the join condition.
This PR fixes this by only pushing down conditions to the left hand side of the join. This is similar to the behavior of left outer join.
## How was this patch tested?
Added tests to `FilterPushdownSuite.scala` and created a SQLQueryTestSuite file for left anti joins with a regression test.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#16026 from hvanhovell/SPARK-18597.
### What changes were proposed in this pull request?
Currently, the name validation checks are limited to table creation. It is enfored by Analyzer rule: `PreWriteCheck`.
However, table renaming and database creation have the same issues. It makes more sense to do the checks in `SessionCatalog`. This PR is to add it into `SessionCatalog`.
### How was this patch tested?
Added test cases
Author: gatorsmile <gatorsmile@gmail.com>
Closes#16018 from gatorsmile/nameValidate.
## What changes were proposed in this pull request?
This PR is to fix incorrect `code` tag in `sql-programming-guide.md`
## How was this patch tested?
Manually.
Author: Weiqing Yang <yangweiqing001@gmail.com>
Closes#15941 from weiqingy/fixtag.
## What changes were proposed in this pull request?
The expression `in(empty seq)` is invalid in some data source. Since `in(empty seq)` is always false, we should generate `in(empty seq)` to false literal in optimizer.
The sql `SELECT * FROM t WHERE a IN ()` throws a `ParseException` which is consistent with Hive, don't need to change that behavior.
## How was this patch tested?
Add new test case in `OptimizeInSuite`.
Author: jiangxingbo <jiangxb1987@gmail.com>
Closes#15977 from jiangxb1987/isin-empty.
## What changes were proposed in this pull request?
This is a follow-up PR of #15868 to merge `maxConnections` option into `numPartitions` options.
## How was this patch tested?
Pass the existing tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#15966 from dongjoon-hyun/SPARK-18413-2.
## What changes were proposed in this pull request?
This PR only tries to fix things that looks pretty straightforward and were fixed in other previous PRs before.
This PR roughly fixes several things as below:
- Fix unrecognisable class and method links in javadoc by changing it from `[[..]]` to `` `...` ``
```
[error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/DataStreamReader.java:226: error: reference not found
[error] * Loads text files and returns a {link DataFrame} whose schema starts with a string column named
```
- Fix an exception annotation and remove code backticks in `throws` annotation
Currently, sbt unidoc with Java 8 complains as below:
```
[error] .../java/org/apache/spark/sql/streaming/StreamingQuery.java:72: error: unexpected text
[error] * throws StreamingQueryException, if <code>this</code> query has terminated with an exception.
```
`throws` should specify the correct class name from `StreamingQueryException,` to `StreamingQueryException` without backticks. (see [JDK-8007644](https://bugs.openjdk.java.net/browse/JDK-8007644)).
- Fix `[[http..]]` to `<a href="http..."></a>`.
```diff
- * [[https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https Oracle
- * blog page]].
+ * <a href="https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https">
+ * Oracle blog page</a>.
```
`[[http...]]` link markdown in scaladoc is unrecognisable in javadoc.
- It seems class can't have `return` annotation. So, two cases of this were removed.
```
[error] .../java/org/apache/spark/mllib/regression/IsotonicRegression.java:27: error: invalid use of return
[error] * return New instance of IsotonicRegression.
```
- Fix < to `<` and > to `>` according to HTML rules.
- Fix `</p>` complaint
- Exclude unrecognisable in javadoc, `constructor`, `todo` and `groupname`.
## How was this patch tested?
Manually tested by `jekyll build` with Java 7 and 8
```
java version "1.7.0_80"
Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
```
```
java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
```
Note: this does not yet make sbt unidoc suceed with Java 8 yet but it reduces the number of errors with Java 8.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#15999 from HyukjinKwon/SPARK-3359-errors.
## What changes were proposed in this pull request?
- Raise Analysis exception when correlated predicates exist in the descendant operators of either operand of a Full outer join in a subquery as well as in a FOJ operator itself
- Raise Analysis exception when correlated predicates exists in a Window operator (a side effect inadvertently introduced by SPARK-17348)
## How was this patch tested?
Run sql/test catalyst/test and new test cases, added to SubquerySuite, showing the reported incorrect results.
Author: Nattavut Sutyanyong <nsy.can@gmail.com>
Closes#16005 from nsyca/FOJ-incorrect.1.
## What changes were proposed in this pull request?
This PR addressed the rest comments in #15951.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#15997 from zsxwing/SPARK-18510-follow-up.
## What changes were proposed in this pull request?
### The Issue
If I specify my schema when doing
```scala
spark.read
.schema(someSchemaWherePartitionColumnsAreStrings)
```
but if the partition inference can infer it as IntegerType or I assume LongType or DoubleType (basically fixed size types), then once UnsafeRows are generated, your data will be corrupted.
### Proposed solution
The partition handling code path is kind of a mess. In my fix I'm probably adding to the mess, but at least trying to standardize the code path.
The real issue is that a user that uses the `spark.read` code path can never clearly specify what the partition columns are. If you try to specify the fields in `schema`, we practically ignore what the user provides, and fall back to our inferred data types. What happens in the end is data corruption.
My solution tries to fix this by always trying to infer partition columns the first time you specify the table. Once we find what the partition columns are, we try to find them in the user specified schema and use the dataType provided there, or fall back to the smallest common data type.
We will ALWAYS append partition columns to the user's schema, even if they didn't ask for it. We will only use the data type they provided if they specified it. While this is confusing, this has been the behavior since Spark 1.6, and I didn't want to change this behavior in the QA period of Spark 2.1. We may revisit this decision later.
A side effect of this PR is that we won't need https://github.com/apache/spark/pull/15942 if this PR goes in.
## How was this patch tested?
Regression tests
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#15951 from brkyvz/partition-corruption.
## What changes were proposed in this pull request?
When we try to create the default database, we ask hive to do nothing if it already exists. However, Hive will log an error message instead of doing nothing, and the error message is quite annoying and confusing.
In this PR, we only create default database if it doesn't exist.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#15993 from cloud-fan/default-db.
## What changes were proposed in this pull request?
The current implementation of column stats uses the base64 encoding of the internal UnsafeRow format to persist statistics (in table properties in Hive metastore). This is an internal format that is not stable across different versions of Spark and should NOT be used for persistence. In addition, it would be better if statistics stored in the catalog is human readable.
This pull request introduces the following changes:
1. Created a single ColumnStat class to for all data types. All data types track the same set of statistics.
2. Updated the implementation for stats collection to get rid of the dependency on internal data structures (e.g. InternalRow, or storing DateType as an int32). For example, previously dates were stored as a single integer, but are now stored as java.sql.Date. When we implement the next steps of CBO, we can add code to convert those back into internal types again.
3. Documented clearly what JVM data types are being used to store what data.
4. Defined a simple Map[String, String] interface for serializing and deserializing column stats into/from the catalog.
5. Rearranged the method/function structure so it is more clear what the supported data types are, and also moved how stats are generated into ColumnStat class so they are easy to find.
## How was this patch tested?
Removed most of the original test cases created for column statistics, and added three very simple ones to cover all the cases. The three test cases validate:
1. Roundtrip serialization works.
2. Behavior when analyzing non-existent column or unsupported data type column.
3. Result for stats collection for all valid data types.
Also moved parser related tests into a parser test suite and added an explicit serialization test for the Hive external catalog.
Author: Reynold Xin <rxin@databricks.com>
Closes#15959 from rxin/SPARK-18522.
## What changes were proposed in this pull request?
In Spark SQL, some expression may output safe format values, e.g. `CreateArray`, `CreateStruct`, `Cast`, etc. When we compare 2 values, we should be able to compare safe and unsafe formats.
The `GreaterThan`, `LessThan`, etc. in Spark SQL already handles it, but the `EqualTo` doesn't. This PR fixes it.
## How was this patch tested?
new unit test and regression test
Author: Wenchen Fan <wenchen@databricks.com>
Closes#15929 from cloud-fan/type-aware.
## What changes were proposed in this pull request?
Updates links to the wiki to links to the new location of content on spark.apache.org.
## How was this patch tested?
Doc builds
Author: Sean Owen <sowen@cloudera.com>
Closes#15967 from srowen/SPARK-18073.1.
## What changes were proposed in this pull request?
Fixes the inconsistency of error raised between data source and hive serde
tables when schema is specified in CTAS scenario. In the process the grammar for
create table (datasource) is simplified.
**before:**
``` SQL
spark-sql> create table t2 (c1 int, c2 int) using parquet as select * from t1;
Error in query:
mismatched input 'as' expecting {<EOF>, '.', 'OPTIONS', 'CLUSTERED', 'PARTITIONED'}(line 1, pos 64)
== SQL ==
create table t2 (c1 int, c2 int) using parquet as select * from t1
----------------------------------------------------------------^^^
```
**After:**
```SQL
spark-sql> create table t2 (c1 int, c2 int) using parquet as select * from t1
> ;
Error in query:
Operation not allowed: Schema may not be specified in a Create Table As Select (CTAS) statement(line 1, pos 0)
== SQL ==
create table t2 (c1 int, c2 int) using parquet as select * from t1
^^^
```
## How was this patch tested?
Added a new test in CreateTableAsSelectSuite
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#15968 from dilipbiswal/ctas.
### What changes were proposed in this pull request?
In Spark 2.0, `SaveAsTable` does not work when the target table is a Hive serde table, but Spark 1.6 works.
**Spark 1.6**
``` Scala
scala> sql("create table sample.sample stored as SEQUENCEFILE as select 1 as key, 'abc' as value")
res2: org.apache.spark.sql.DataFrame = []
scala> val df = sql("select key, value as value from sample.sample")
df: org.apache.spark.sql.DataFrame = [key: int, value: string]
scala> df.write.mode("append").saveAsTable("sample.sample")
scala> sql("select * from sample.sample").show()
+---+-----+
|key|value|
+---+-----+
| 1| abc|
| 1| abc|
+---+-----+
```
**Spark 2.0**
``` Scala
scala> df.write.mode("append").saveAsTable("sample.sample")
org.apache.spark.sql.AnalysisException: Saving data in MetastoreRelation sample, sample
is not supported.;
```
So far, we do not plan to support it in Spark 2.1 due to the risk. Spark 1.6 works because it internally uses insertInto. But, if we change it back it will break the semantic of saveAsTable (this method uses by-name resolution instead of using by-position resolution used by insertInto). More extra changes are needed to support `hive` as a `format` in DataFrameWriter.
Instead, users should use insertInto API. This PR corrects the error messages. Users can understand how to bypass it before we support it in a separate PR.
### How was this patch tested?
Test cases are added
Author: gatorsmile <gatorsmile@gmail.com>
Closes#15926 from gatorsmile/saveAsTableFix5.
## What changes were proposed in this pull request?
While this behavior is debatable, consider the following use case:
```sql
UNCACHE TABLE foo;
CACHE TABLE foo AS
SELECT * FROM bar
```
The command above fails the first time you run it. But I want to run the command above over and over again, and I don't want to change my code just for the first run of it.
The issue is that subsequent `CACHE TABLE` commands do not overwrite the existing table.
Now we can do:
```sql
UNCACHE TABLE IF EXISTS foo;
CACHE TABLE foo AS
SELECT * FROM bar
```
## How was this patch tested?
Unit tests
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#15896 from brkyvz/uncache.
## What changes were proposed in this pull request?
This PR blocks an incorrect result scenario in scalar subquery where there are GROUP BY column(s)
that are not part of the correlated predicate(s).
Example:
// Incorrect result
Seq(1).toDF("c1").createOrReplaceTempView("t1")
Seq((1,1),(1,2)).toDF("c1","c2").createOrReplaceTempView("t2")
sql("select (select sum(-1) from t2 where t1.c1=t2.c1 group by t2.c2) from t1").show
// How can selecting a scalar subquery from a 1-row table return 2 rows?
## How was this patch tested?
sql/test, catalyst/test
new test case covering the reported problem is added to SubquerySuite.scala
Author: Nattavut Sutyanyong <nsy.can@gmail.com>
Closes#15936 from nsyca/scalarSubqueryIncorrect-1.
## What changes were proposed in this pull request?
Right now we are testing the most of `CompactibleFileStreamLog` in `FileStreamSinkLogSuite` (because `FileStreamSinkLog` once was the only subclass of `CompactibleFileStreamLog`, but now it's not the case any more).
Let's refactor the tests so that `CompactibleFileStreamLog` is directly tested, making future changes (like https://github.com/apache/spark/pull/15828, https://github.com/apache/spark/pull/15827) to `CompactibleFileStreamLog` much easier to test and much easier to review.
## How was this patch tested?
the PR itself is about tests
Author: Liwei Lin <lwlin7@gmail.com>
Closes#15870 from lw-lin/test-compact-1113.
## What changes were proposed in this pull request?
This PR adds two of the newly added methods of `Dataset`s to Python:
`withWatermark` and `checkpoint`
## How was this patch tested?
Doc tests
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#15921 from brkyvz/py-watermark.
## What changes were proposed in this pull request?
Currently, `DROP TABLE IF EXISTS` shows warning for non-existing tables. However, it had better be quiet for this case by definition of the command.
**BEFORE**
```scala
scala> sql("DROP TABLE IF EXISTS nonexist")
16/11/20 20:48:26 WARN DropTableCommand: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'nonexist' not found in database 'default';
```
**AFTER**
```scala
scala> sql("DROP TABLE IF EXISTS nonexist")
res0: org.apache.spark.sql.DataFrame = []
```
## How was this patch tested?
Manual because this is related to the warning messages instead of exceptions.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#15953 from dongjoon-hyun/SPARK-18517.
## What changes were proposed in this pull request?
This PR adds a new JDBCOption `maxConnections` which means the maximum number of simultaneous JDBC connections allowed. This option applies only to writing with coalesce operation if needed. It defaults to the number of partitions of RDD. Previously, SQL users cannot cannot control this while Scala/Java/Python users can use `coalesce` (or `repartition`) API.
**Reported Scenario**
For the following cases, the number of connections becomes 200 and database cannot handle all of them.
```sql
CREATE OR REPLACE TEMPORARY VIEW resultview
USING org.apache.spark.sql.jdbc
OPTIONS (
url "jdbc:oracle:thin:10.129.10.111:1521:BKDB",
dbtable "result",
user "HIVE",
password "HIVE"
);
-- set spark.sql.shuffle.partitions=200
INSERT OVERWRITE TABLE resultview SELECT g, count(1) AS COUNT FROM tnet.DT_LIVE_INFO GROUP BY g
```
## How was this patch tested?
Manual. Do the followings and see Spark UI.
**Step 1 (MySQL)**
```
CREATE TABLE t1 (a INT);
CREATE TABLE data (a INT);
INSERT INTO data VALUES (1);
INSERT INTO data VALUES (2);
INSERT INTO data VALUES (3);
```
**Step 2 (Spark)**
```scala
SPARK_HOME=$PWD bin/spark-shell --driver-memory 4G --driver-class-path mysql-connector-java-5.1.40-bin.jar
scala> sql("SET spark.sql.shuffle.partitions=3")
scala> sql("CREATE OR REPLACE TEMPORARY VIEW data USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 'data', user 'root', password '')")
scala> sql("CREATE OR REPLACE TEMPORARY VIEW t1 USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 't1', user 'root', password '', maxConnections '1')")
scala> sql("INSERT OVERWRITE TABLE t1 SELECT a FROM data GROUP BY a")
scala> sql("CREATE OR REPLACE TEMPORARY VIEW t1 USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 't1', user 'root', password '', maxConnections '2')")
scala> sql("INSERT OVERWRITE TABLE t1 SELECT a FROM data GROUP BY a")
scala> sql("CREATE OR REPLACE TEMPORARY VIEW t1 USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 't1', user 'root', password '', maxConnections '3')")
scala> sql("INSERT OVERWRITE TABLE t1 SELECT a FROM data GROUP BY a")
scala> sql("CREATE OR REPLACE TEMPORARY VIEW t1 USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 't1', user 'root', password '', maxConnections '4')")
scala> sql("INSERT OVERWRITE TABLE t1 SELECT a FROM data GROUP BY a")
```
![maxconnections](https://cloud.githubusercontent.com/assets/9700541/20287987/ed8409c2-aa84-11e6-8aab-ae28e63fe54d.png)
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#15868 from dongjoon-hyun/SPARK-18413.
## What changes were proposed in this pull request?
This PR adds code generation to `Generate`. It supports two code paths:
- General `TraversableOnce` based iteration. This used for regular `Generator` (code generation supporting) expressions. This code path expects the expression to return a `TraversableOnce[InternalRow]` and it will iterate over the returned collection. This PR adds code generation for the `stack` generator.
- Specialized `ArrayData/MapData` based iteration. This is used for the `explode`, `posexplode` & `inline` functions and operates directly on the `ArrayData`/`MapData` result that the child of the generator returns.
### Benchmarks
I have added some benchmarks and it seems we can create a nice speedup for explode:
#### Environment
```
Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.11.6
Intel(R) Core(TM) i7-4980HQ CPU 2.80GHz
```
#### Explode Array
##### Before
```
generate explode array: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
generate explode array wholestage off 7377 / 7607 2.3 439.7 1.0X
generate explode array wholestage on 6055 / 6086 2.8 360.9 1.2X
```
##### After
```
generate explode array: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
generate explode array wholestage off 7432 / 7696 2.3 443.0 1.0X
generate explode array wholestage on 631 / 646 26.6 37.6 11.8X
```
#### Explode Map
##### Before
```
generate explode map: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
generate explode map wholestage off 12792 / 12848 1.3 762.5 1.0X
generate explode map wholestage on 11181 / 11237 1.5 666.5 1.1X
```
##### After
```
generate explode map: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
generate explode map wholestage off 10949 / 10972 1.5 652.6 1.0X
generate explode map wholestage on 870 / 913 19.3 51.9 12.6X
```
#### Posexplode
##### Before
```
generate posexplode array: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
generate posexplode array wholestage off 7547 / 7580 2.2 449.8 1.0X
generate posexplode array wholestage on 5786 / 5838 2.9 344.9 1.3X
```
##### After
```
generate posexplode array: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
generate posexplode array wholestage off 7535 / 7548 2.2 449.1 1.0X
generate posexplode array wholestage on 620 / 624 27.1 37.0 12.1X
```
#### Inline
##### Before
```
generate inline array: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
generate inline array wholestage off 6935 / 6978 2.4 413.3 1.0X
generate inline array wholestage on 6360 / 6400 2.6 379.1 1.1X
```
##### After
```
generate inline array: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
generate inline array wholestage off 6940 / 6966 2.4 413.6 1.0X
generate inline array wholestage on 1002 / 1012 16.7 59.7 6.9X
```
#### Stack
##### Before
```
generate stack: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
generate stack wholestage off 12980 / 13104 1.3 773.7 1.0X
generate stack wholestage on 11566 / 11580 1.5 689.4 1.1X
```
##### After
```
generate stack: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
generate stack wholestage off 12875 / 12949 1.3 767.4 1.0X
generate stack wholestage on 840 / 845 20.0 50.0 15.3X
```
## How was this patch tested?
Existing tests.
Author: Herman van Hovell <hvanhovell@databricks.com>
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#13065 from hvanhovell/SPARK-15214.
## What changes were proposed in this pull request?
Fix since 2.1.0 on new SparkSession.close() method. I goofed in https://github.com/apache/spark/pull/15932 because it was back-ported to 2.1 instead of just master as originally planned.
Author: Sean Owen <sowen@cloudera.com>
Closes#15938 from srowen/SPARK-18448.2.
## What changes were proposed in this pull request?
Just adds `close()` + `Closeable` as a synonym for `stop()`. This makes it usable in Java in try-with-resources, as suggested by ash211 (`Closeable` extends `AutoCloseable` BTW)
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#15932 from srowen/SPARK-18448.
## What changes were proposed in this pull request?
The issue in ForeachSink is the new created DataSet still uses the old QueryExecution. When `foreachPartition` is called, `QueryExecution.toString` will be called and then fail because it doesn't know how to plan EventTimeWatermark.
This PR just replaces the QueryExecution with IncrementalExecution to fix the issue.
## How was this patch tested?
`test("foreach with watermark")`.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#15934 from zsxwing/SPARK-18497.
## What changes were proposed in this pull request?
I'm spending more time at the design & code level for cost-based optimizer now, and have found a number of issues related to maintainability and compatibility that I will like to address.
This is a small pull request to clean up AnalyzeColumnCommand:
1. Removed warning on duplicated columns. Warnings in log messages are useless since most users that run SQL don't see them.
2. Removed the nested updateStats function, by just inlining the function.
3. Renamed a few functions to better reflect what they do.
4. Removed the factory apply method for ColumnStatStruct. It is a bad pattern to use a apply method that returns an instantiation of a class that is not of the same type (ColumnStatStruct.apply used to return CreateNamedStruct).
5. Renamed ColumnStatStruct to just AnalyzeColumnCommand.
6. Added more documentation explaining some of the non-obvious return types and code blocks.
In follow-up pull requests, I'd like to address the following:
1. Get rid of the Map[String, ColumnStat] map, since internally we should be using Attribute to reference columns, rather than strings.
2. Decouple the fields exposed by ColumnStat and internals of Spark SQL's execution path. Currently the two are coupled because ColumnStat takes in an InternalRow.
3. Correctness: Remove code path that stores statistics in the catalog using the base64 encoding of the UnsafeRow format, which is not stable across Spark versions.
4. Clearly document the data representation stored in the catalog for statistics.
## How was this patch tested?
Affected test cases have been updated.
Author: Reynold Xin <rxin@databricks.com>
Closes#15933 from rxin/SPARK-18505.
## What changes were proposed in this pull request?
HDFS `write` may just hang until timeout if some network error happens. It's better to enable interrupts to allow stopping the query fast on HDFS.
This PR just changes the logic to only disable interrupts for local file system, as HADOOP-10622 only happens for local file system.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#15911 from zsxwing/interrupt-on-dfs.
## What changes were proposed in this pull request?
CompactibleFileStreamLog relys on "compactInterval" to detect a compaction batch. If the "compactInterval" is reset by user, CompactibleFileStreamLog will return wrong answer, resulting data loss. This PR procides a way to check the validity of 'compactInterval', and calculate an appropriate value.
## How was this patch tested?
When restart a stream, we change the 'spark.sql.streaming.fileSource.log.compactInterval' different with the former one.
The primary solution to this issue was given by uncleGen
Added extensions include an additional metadata field in OffsetSeq and CompactibleFileStreamLog APIs. zsxwing
Author: Tyson Condie <tcondie@gmail.com>
Author: genmao.ygm <genmao.ygm@genmaoygmdeMacBook-Air.local>
Closes#15852 from tcondie/spark-18187.
## What changes were proposed in this pull request?
This patch fixes a `ClassCastException: java.lang.Integer cannot be cast to java.lang.Long` error which could occur in the HistoryServer while trying to process a deserialized `SparkListenerDriverAccumUpdates` event.
The problem stems from how `jackson-module-scala` handles primitive type parameters (see https://github.com/FasterXML/jackson-module-scala/wiki/FAQ#deserializing-optionint-and-other-primitive-challenges for more details). This was causing a problem where our code expected a field to be deserialized as a `(Long, Long)` tuple but we got an `(Int, Int)` tuple instead.
This patch hacks around this issue by registering a custom `Converter` with Jackson in order to deserialize the tuples as `(Object, Object)` and perform the appropriate casting.
## How was this patch tested?
New regression tests in `SQLListenerSuite`.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#15922 from JoshRosen/SPARK-18462.
## What changes were proposed in this pull request?
In ShuffleExchange, the nodename's extraInfo are the same when exchangeCoordinator.isEstimated
is true or false.
Merge the two situation in the PR.
Author: root <root@iZbp1gsnrlfzjxh82cz80vZ.(none)>
Closes#15920 from windpiger/DupNodeNameShuffleExchange.
## What changes were proposed in this pull request?
I found the documentation for the sample method to be confusing, this adds more clarification across all languages.
- [x] Scala
- [x] Python
- [x] R
- [x] RDD Scala
- [ ] RDD Python with SEED
- [X] RDD Java
- [x] RDD Java with SEED
- [x] RDD Python
## How was this patch tested?
NA
Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.
Author: anabranch <wac.chambers@gmail.com>
Author: Bill Chambers <bill@databricks.com>
Closes#15815 from anabranch/SPARK-18365.
## What changes were proposed in this pull request?
Before Spark 2.1, users can create an external data source table without schema, and we will infer the table schema at runtime. In Spark 2.1, we decided to infer the schema when the table was created, so that we don't need to infer it again and again at runtime.
This is a good improvement, but we should still respect and support old tables which doesn't store table schema in metastore.
## How was this patch tested?
regression test.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#15900 from cloud-fan/hive-catalog.
## What changes were proposed in this pull request?
SPARK-18459: triggerId seems like a number that should be increasing with each trigger, whether or not there is data in it. However, actually, triggerId increases only where there is a batch of data in a trigger. So its better to rename it to batchId.
SPARK-18460: triggerDetails was missing from json representation. Fixed it.
## How was this patch tested?
Updated existing unit tests.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#15895 from tdas/SPARK-18459.
### What changes were proposed in this pull request?
Currently, when CTE is used in RunnableCommand, the Analyzer does not replace the logical node `With`. The child plan of RunnableCommand is not resolved. Thus, the output of the `With` plan node looks very confusing.
For example,
```
sql(
"""
|CREATE VIEW cte_view AS
|WITH w AS (SELECT 1 AS n), cte1 (select 2), cte2 as (select 3)
|SELECT n FROM w
""".stripMargin).explain()
```
The output is like
```
ExecutedCommand
+- CreateViewCommand `cte_view`, WITH w AS (SELECT 1 AS n), cte1 (select 2), cte2 as (select 3)
SELECT n FROM w, false, false, PersistedView
+- 'With [(w,SubqueryAlias w
+- Project [1 AS n#16]
+- OneRowRelation$
), (cte1,'SubqueryAlias cte1
+- 'Project [unresolvedalias(2, None)]
+- OneRowRelation$
), (cte2,'SubqueryAlias cte2
+- 'Project [unresolvedalias(3, None)]
+- OneRowRelation$
)]
+- 'Project ['n]
+- 'UnresolvedRelation `w`
```
After the fix, the output is as shown below.
```
ExecutedCommand
+- CreateViewCommand `cte_view`, WITH w AS (SELECT 1 AS n), cte1 (select 2), cte2 as (select 3)
SELECT n FROM w, false, false, PersistedView
+- CTE [w, cte1, cte2]
: :- SubqueryAlias w
: : +- Project [1 AS n#16]
: : +- OneRowRelation$
: :- 'SubqueryAlias cte1
: : +- 'Project [unresolvedalias(2, None)]
: : +- OneRowRelation$
: +- 'SubqueryAlias cte2
: +- 'Project [unresolvedalias(3, None)]
: +- OneRowRelation$
+- 'Project ['n]
+- 'UnresolvedRelation `w`
```
BTW, this PR also fixes the output of the view type.
### How was this patch tested?
Manual
Author: gatorsmile <gatorsmile@gmail.com>
Closes#15854 from gatorsmile/cteName.
## What changes were proposed in this pull request?
This PR aims to improve DataSource option keys to be more case-insensitive
DataSource partially use CaseInsensitiveMap in code-path. For example, the following fails to find url.
```scala
val df = spark.createDataFrame(sparkContext.parallelize(arr2x2), schema2)
df.write.format("jdbc")
.option("UrL", url1)
.option("dbtable", "TEST.SAVETEST")
.options(properties.asScala)
.save()
```
This PR makes DataSource options to use CaseInsensitiveMap internally and also makes DataSource to use CaseInsensitiveMap generally except `InMemoryFileIndex` and `InsertIntoHadoopFsRelationCommand`. We can not pass them CaseInsensitiveMap because they creates new case-sensitive HadoopConfs by calling newHadoopConfWithOptions(options) inside.
## How was this patch tested?
Pass the Jenkins test with newly added test cases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#15884 from dongjoon-hyun/SPARK-18433.
## What changes were proposed in this pull request?
it's weird that every session can set its own warehouse path at runtime, we should forbid it and make it a static conf.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#15825 from cloud-fan/warehouse.
## What changes were proposed in this pull request?
SPARK-18012 refactored the file write path in FileStreamSink using FileFormatWriter which always uses the default non-streaming QueryExecution to perform the writes. This is wrong for FileStreamSink, because the streaming QueryExecution (i.e. IncrementalExecution) should be used for correctly incrementalizing aggregation. The addition of watermarks in SPARK-18124, file stream sink should logically supports aggregation + watermark + append mode. But actually it fails with
```
16:23:07.389 ERROR org.apache.spark.sql.execution.streaming.StreamExecution: Query query-0 terminated with error
java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark timestamp#7: timestamp, interval 10 seconds
+- LocalRelation [timestamp#7]
at scala.Predef$.assert(Predef.scala:170)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
```
This PR fixes it by passing the correct query execution.
## How was this patch tested?
New unit test
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#15885 from tdas/SPARK-18440.
## What changes were proposed in this pull request?
It would be nice if memory sinks can also recover from checkpoints. For correctness reasons, the only time we should support it is in `Complete` OutputMode. We can support this in CompleteMode, because the output of the StateStore is already persisted in the checkpoint directory.
## How was this patch tested?
Unit test
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#15801 from brkyvz/mem-stream.