## What changes were proposed in this pull request?
In order to support nested predicate subquery, this PR introduce an internal join type ExistenceJoin, which will emit all the rows from left, plus an additional column, which presents there are any rows matched from right or not (it's not null-aware right now). This additional column could be used to replace the subquery in Filter.
In theory, all the predicate subquery could use this join type, but it's slower than LeftSemi and LeftAnti, so it's only used for nested subquery (subquery inside OR).
For example, the following SQL:
```sql
SELECT a FROM t WHERE EXISTS (select 0) OR EXISTS (select 1)
```
This PR also fix a bug in predicate subquery push down through join (they should not).
Nested null-aware subquery is still not supported. For example, `a > 3 OR b NOT IN (select bb from t)`
After this, we could run TPCDS query Q10, Q35, Q45
## How was this patch tested?
Added unit tests.
Author: Davies Liu <davies@databricks.com>
Closes#12820 from davies/or_exists.
## What changes were proposed in this pull request?
#12339 didn't fix the race condition. MemorySinkSuite is still flaky: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.2/814/testReport/junit/org.apache.spark.sql.streaming/MemorySinkSuite/registering_as_a_table/
Here is an execution order to reproduce it.
| Time |Thread 1 | MicroBatchThread |
|:-------------:|:-------------:|:-----:|
| 1 | | `MemorySink.getOffset` |
| 2 | | availableOffsets ++= newData (availableOffsets is not changed here) |
| 3 | addData(newData) | |
| 4 | Set `noNewData` to `false` in processAllAvailable | |
| 5 | | `dataAvailable` returns `false` |
| 6 | | noNewData = true |
| 7 | `noNewData` is true so just return | |
| 8 | assert results and fail | |
| 9 | | `dataAvailable` returns true so process the new batch |
This PR expands the scope of `awaitBatchLock.synchronized` to eliminate the above race.
## How was this patch tested?
test("stress test"). It always failed before this patch. And it will pass after applying this patch. Ignore this test in the PR as it takes several minutes to finish.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#12582 from zsxwing/SPARK-14579-2.
## What changes were proposed in this pull request?
The existing implementation of pivot translates into a single aggregation with one aggregate per distinct pivot value. When the number of distinct pivot values is large (say 1000+) this can get extremely slow since each input value gets evaluated on every aggregate even though it only affects the value of one of them.
I'm proposing an alternate strategy for when there are 10+ (somewhat arbitrary threshold) distinct pivot values. We do two phases of aggregation. In the first we group by the grouping columns plus the pivot column and perform the specified aggregations (one or sometimes more). In the second aggregation we group by the grouping columns and use the new (non public) PivotFirst aggregate that rearranges the outputs of the first aggregation into an array indexed by the pivot value. Finally we do a project to extract the array entries into the appropriate output column.
## How was this patch tested?
Additional unit tests in DataFramePivotSuite and manual larger scale testing.
Author: Andrew Ray <ray.andrew@gmail.com>
Closes#11583 from aray/fast-pivot.
## What changes were proposed in this pull request?
NewAccumulator isn't the best name if we ever come up with v3 of the API.
## How was this patch tested?
Updated tests to reflect the change.
Author: Reynold Xin <rxin@databricks.com>
Closes#12827 from rxin/SPARK-15049.
## What changes were proposed in this pull request?
This PR adds the explanation and documentation for CSV options for reading and writing.
## How was this patch tested?
Style tests with `./dev/run_tests` for documentation style.
Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>
Closes#12817 from HyukjinKwon/SPARK-13425.
## What changes were proposed in this pull request?
This is caused by https://github.com/apache/spark/pull/12776, which removes the `synchronized` from all methods in `AccumulatorContext`.
However, a test in `CachedTableSuite` synchronize on `AccumulatorContext` and expecting no one else can change it, which is not true anymore.
This PR update that test to not require to lock on `AccumulatorContext`.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12811 from cloud-fan/flaky.
1. Adds the following options for parsing NaNs: nanValue
2. Adds the following options for parsing infinity: positiveInf, negativeInf.
`TypeCast.castTo` is unit tested and an end-to-end test is added to `CSVSuite`
Author: Hossein <hossein@databricks.com>
Closes#11947 from falaki/SPARK-14143.
This PR contains three changes:
1. We will use spark.sql.warehouse.dir set warehouse location. We will not use hive.metastore.warehouse.dir.
2. SessionCatalog needs to set the location to default db. Otherwise, when creating a table in SparkSession without hive support, the default db's path will be an empty string.
3. When we create a database, we need to make the path qualified.
Existing tests and new tests
Author: Yin Huai <yhuai@databricks.com>
Closes#12812 from yhuai/warehouse.
## What changes were proposed in this pull request?
This patch removes some code that are no longer relevant -- mainly HiveSessionState.setDefaultOverrideConfs.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#12806 from rxin/SPARK-15028.
## What changes were proposed in this pull request?
This PR adds the support to specify custom date format for `DateType` and `TimestampType`.
For `TimestampType`, this uses the given format to infer schema and also to convert the values
For `DateType`, this uses the given format to convert the values.
If the `dateFormat` is not given, then it works with `DateTimeUtils.stringToTime()` for backwords compatibility.
When it's given, then it uses `SimpleDateFormat` for parsing data.
In addition, `IntegerType`, `DoubleType` and `LongType` have a higher priority than `TimestampType` in type inference. This means even if the given format is `yyyy` or `yyyy.MM`, it will be inferred as `IntegerType` or `DoubleType`. Since it is type inference, I think it is okay to give such precedences.
In addition, I renamed `csv.CSVInferSchema` to `csv.InferSchema` as JSON datasource has `json.InferSchema`. Although they have the same names, I did this because I thought the parent package name can still differentiate each. Accordingly, the suite name was also changed from `CSVInferSchemaSuite` to `InferSchemaSuite`.
## How was this patch tested?
unit tests are used and `./dev/run_tests` for coding style tests.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#11550 from HyukjinKwon/SPARK-13667.
## What changes were proposed in this pull request?
CatalystSqlParser can parse data types. So, we do not need to have an individual DataTypeParser.
## How was this patch tested?
Existing tests
Author: Yin Huai <yhuai@databricks.com>
Closes#12796 from yhuai/removeDataTypeParser.
## What changes were proposed in this pull request?
1. Remove all the `spark.setConf` etc. Just expose `spark.conf`
2. Make `spark.conf` take in things set in the core `SparkConf` as well, otherwise users may get confused
This was done for both the Python and Scala APIs.
## How was this patch tested?
`SQLConfSuite`, python tests.
This one fixes the failed tests in #12787Closes#12787
Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#12798 from yhuai/conf-api.
The previous subquery PRs did not include support for pushing subqueries used in filters (`WHERE`/`HAVING`) down. This PR adds this support. For example :
```scala
range(0, 10).registerTempTable("a")
range(5, 15).registerTempTable("b")
range(7, 25).registerTempTable("c")
range(3, 12).registerTempTable("d")
val plan = sql("select * from a join b on a.id = b.id left join c on c.id = b.id where a.id in (select id from d)")
plan.explain(true)
```
Leads to the following Analyzed & Optimized plans:
```
== Parsed Logical Plan ==
...
== Analyzed Logical Plan ==
id: bigint, id: bigint, id: bigint
Project [id#0L,id#4L,id#8L]
+- Filter predicate-subquery#16 [(id#0L = id#12L)]
: +- SubqueryAlias predicate-subquery#16 [(id#0L = id#12L)]
: +- Project [id#12L]
: +- SubqueryAlias d
: +- Range 3, 12, 1, 8, [id#12L]
+- Join LeftOuter, Some((id#8L = id#4L))
:- Join Inner, Some((id#0L = id#4L))
: :- SubqueryAlias a
: : +- Range 0, 10, 1, 8, [id#0L]
: +- SubqueryAlias b
: +- Range 5, 15, 1, 8, [id#4L]
+- SubqueryAlias c
+- Range 7, 25, 1, 8, [id#8L]
== Optimized Logical Plan ==
Join LeftOuter, Some((id#8L = id#4L))
:- Join Inner, Some((id#0L = id#4L))
: :- Join LeftSemi, Some((id#0L = id#12L))
: : :- Range 0, 10, 1, 8, [id#0L]
: : +- Range 3, 12, 1, 8, [id#12L]
: +- Range 5, 15, 1, 8, [id#4L]
+- Range 7, 25, 1, 8, [id#8L]
== Physical Plan ==
...
```
I have also taken the opportunity to move quite a bit of code around:
- Rewriting subqueris and pulling out correlated predicated from subqueries has been moved into the analyzer. The analyzer transforms `Exists` and `InSubQuery` into `PredicateSubquery` expressions. A PredicateSubquery exposes the 'join' expressions and the proper references. This makes things like type coercion, optimization and planning easier to do.
- I have added support for `Aggregate` plans in subqueries. Any correlated expressions will be added to the grouping expressions. I have removed support for `Union` plans, since pulling in an outer reference from beneath a Union has no value (a filtered value could easily be part of another Union child).
- Resolution of subqueries is now done using `OuterReference`s. These are used to wrap any outer reference; this makes the identification of these references easier, and also makes dealing with duplicate attributes in the outer and inner plans easier. The resolution of subqueries initially used a resolution loop which would alternate between calling the analyzer and trying to resolve the outer references. We now use a dedicated analyzer which uses a special rule for outer reference resolution.
These changes are a stepping stone for enabling correlated scalar subqueries, enabling all Hive tests & allowing us to use predicate subqueries anywhere.
Current tests and added test cases in FilterPushdownSuite.
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#12720 from hvanhovell/SPARK-14858.
## What changes were proposed in this pull request?
Addresses comments in #12765.
## How was this patch tested?
Python tests.
Author: Andrew Or <andrew@databricks.com>
Closes#12784 from andrewor14/python-followup.
## What changes were proposed in this pull request?
dapply() applies an R function on each partition of a DataFrame and returns a new DataFrame.
The function signature is:
dapply(df, function(localDF) {}, schema = NULL)
R function input: local data.frame from the partition on local node
R function output: local data.frame
Schema specifies the Row format of the resulting DataFrame. It must match the R function's output.
If schema is not specified, each partition of the result DataFrame will be serialized in R into a single byte array. Such resulting DataFrame can be processed by successive calls to dapply().
## How was this patch tested?
SparkR unit tests.
Author: Sun Rui <rui.sun@intel.com>
Author: Sun Rui <sunrui2016@gmail.com>
Closes#12493 from sun-rui/SPARK-12919.
## What changes were proposed in this pull request?
Currently Spark SQL doesn't support sorting columns in descending order. However, the parser accepts the syntax and silently drops sorting directions. This PR fixes this by throwing an exception if `DESC` is specified as sorting direction of a sorting column.
## How was this patch tested?
A test case is added to test the invalid sorting order by checking exception message.
Author: Cheng Lian <lian@databricks.com>
Closes#12759 from liancheng/spark-14981.
## What changes were proposed in this pull request?
The `catalog` and `conf` APIs were exposed in `SparkSession` in #12713 and #12669. This patch adds those to the python API.
## How was this patch tested?
Python tests.
Author: Andrew Or <andrew@databricks.com>
Closes#12765 from andrewor14/python-spark-session-more.
## What changes were proposed in this pull request?
This patch removes executionHive from HiveSessionState and HiveSharedState.
## How was this patch tested?
Updated test cases.
Author: Reynold Xin <rxin@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#12770 from rxin/SPARK-14994.
## What changes were proposed in this pull request?
This PR adds support for easily running and benchmarking a set of common TPCDS queries locally in SparkSQL.
## How was this patch tested?
N/A
Author: Sameer Agarwal <sameer@databricks.com>
Closes#12771 from sameeragarwal/tpcds-2.
#### What changes were proposed in this pull request?
Replaces a logical `Except` operator with a `Left-anti Join` operator. This way, we can take advantage of all the benefits of join implementations (e.g. managed memory, code generation, broadcast joins).
```SQL
SELECT a1, a2 FROM Tab1 EXCEPT SELECT b1, b2 FROM Tab2
==> SELECT DISTINCT a1, a2 FROM Tab1 LEFT ANTI JOIN Tab2 ON a1<=>b1 AND a2<=>b2
```
Note:
1. This rule is only applicable to EXCEPT DISTINCT. Do not use it for EXCEPT ALL.
2. This rule has to be done after de-duplicating the attributes; otherwise, the enerated
join conditions will be incorrect.
This PR also corrects the existing behavior in Spark. Before this PR, the behavior is like
```SQL
test("except") {
val df_left = Seq(1, 2, 2, 3, 3, 4).toDF("id")
val df_right = Seq(1, 3).toDF("id")
checkAnswer(
df_left.except(df_right),
Row(2) :: Row(2) :: Row(4) :: Nil
)
}
```
After this PR, the result is corrected. We strictly follow the SQL compliance of `Except Distinct`.
#### How was this patch tested?
Modified and added a few test cases to verify the optimization rule and the results of operators.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#12736 from gatorsmile/exceptByAntiJoin.
## What changes were proposed in this pull request?
Minor typo fixes
## How was this patch tested?
local build
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#12755 from zhengruifeng/fix_doc_dataset.
## What changes were proposed in this pull request?
This patch removes HiveNativeCommand, so we can continue to remove the dependency on Hive. This pull request also removes the ability to generate golden result file using Hive.
## How was this patch tested?
Updated tests to reflect this.
Author: Reynold Xin <rxin@databricks.com>
Closes#12769 from rxin/SPARK-14991.
## What changes were proposed in this pull request?
`AccumulatorContext` is not thread-safe, that's why all of its methods are synchronized. However, there is one exception: the `AccumulatorContext.originals`. `NewAccumulator` use it to check if it's registered, which is wrong as it's not synchronized.
This PR mark `AccumulatorContext.originals` as `private` and now all access to `AccumulatorContext` is synchronized.
## How was this patch tested?
I verified it locally. To be safe, we can let jenkins test it many times to make sure this problem is gone.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12773 from cloud-fan/debug.
## What changes were proposed in this pull request?
The FileCatalog object gets created even if the user specifies schema, which means files in the directory is enumerated even thought its not necessary. For large directories this is very slow. User would want to specify schema in such scenarios of large dirs, and this defeats the purpose quite a bit.
## How was this patch tested?
Hard to test this with unit test.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#12748 from tdas/SPARK-14970.
## What changes were proposed in this pull request?
Currently we use `SQLUserDefinedType` annotation to register UDTs for user classes. However, by doing this, we add Spark dependency to user classes.
For some user classes, it is unnecessary to add such dependency that will increase deployment difficulty.
We should provide alternative approach to register UDTs for user classes without `SQLUserDefinedType` annotation.
## How was this patch tested?
`UserDefinedTypeSuite`
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#12259 from viirya/improve-sql-usertype.
## What changes were proposed in this pull request?
This PR introduces a new accumulator API which is much simpler than before:
1. the type hierarchy is simplified, now we only have an `Accumulator` class
2. Combine `initialValue` and `zeroValue` concepts into just one concept: `zeroValue`
3. there in only one `register` method, the accumulator registration and cleanup registration are combined.
4. the `id`,`name` and `countFailedValues` are combined into an `AccumulatorMetadata`, and is provided during registration.
`SQLMetric` is a good example to show the simplicity of this new API.
What we break:
1. no `setValue` anymore. In the new API, the intermedia type can be different from the result type, it's very hard to implement a general `setValue`
2. accumulator can't be serialized before registered.
Problems need to be addressed in follow-ups:
1. with this new API, `AccumulatorInfo` doesn't make a lot of sense, the partial output is not partial updates, we need to expose the intermediate value.
2. `ExceptionFailure` should not carry the accumulator updates. Why do users care about accumulator updates for failed cases? It looks like we only use this feature to update the internal metrics, how about we sending a heartbeat to update internal metrics after the failure event?
3. the public event `SparkListenerTaskEnd` carries a `TaskMetrics`. Ideally this `TaskMetrics` don't need to carry external accumulators, as the only method of `TaskMetrics` that can access external accumulators is `private[spark]`. However, `SQLListener` use it to retrieve sql metrics.
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12612 from cloud-fan/acc.
## What changes were proposed in this pull request?
Currently, LongToUnsafeRowMap use byte array as the underlying page, which can't be larger 1G.
This PR improves LongToUnsafeRowMap to scale up to 8G bytes by using array of Long instead of array of byte.
## How was this patch tested?
Manually ran a test to confirm that both UnsafeHashedRelation and LongHashedRelation could build a map that larger than 2G.
Author: Davies Liu <davies@databricks.com>
Closes#12740 from davies/larger_broadcast.
## What changes were proposed in this pull request?
`interfaces.scala` was getting big. This just moves the biggest class in there to a new file for cleanliness.
## How was this patch tested?
Just moving things around.
Author: Andrew Or <andrew@databricks.com>
Closes#12721 from andrewor14/move-external-catalog.
Currently, we can only create persisted partitioned and/or bucketed data source tables using the Dataset API but not using SQL DDL. This PR implements the following syntax to add partitioning and bucketing support to the SQL DDL:
```
CREATE TABLE <table-name>
USING <provider> [OPTIONS (<key1> <value1>, <key2> <value2>, ...)]
[PARTITIONED BY (col1, col2, ...)]
[CLUSTERED BY (col1, col2, ...) [SORTED BY (col1, col2, ...)] INTO <n> BUCKETS]
AS SELECT ...
```
Test cases are added in `MetastoreDataSourcesSuite` to check the newly added syntax.
Author: Cheng Lian <lian@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#12734 from liancheng/spark-14954.
## What changes were proposed in this pull request?
This PR aims to implement decimal aggregation optimization for window queries by improving existing `DecimalAggregates`. Historically, `DecimalAggregates` optimizer is designed to transform general `sum/avg(decimal)`, but it breaks recently added windows queries like the followings. The following queries work well without the current `DecimalAggregates` optimizer.
**Sum**
```scala
scala> sql("select sum(a) over () from (select explode(array(1.0,2.0)) a) t").head
java.lang.RuntimeException: Unsupported window function: MakeDecimal((sum(UnscaledValue(a#31)),mode=Complete,isDistinct=false),12,1)
scala> sql("select sum(a) over () from (select explode(array(1.0,2.0)) a) t").explain()
== Physical Plan ==
WholeStageCodegen
: +- Project [sum(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#23]
: +- INPUT
+- Window [MakeDecimal((sum(UnscaledValue(a#21)),mode=Complete,isDistinct=false),12,1) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS sum(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#23]
+- Exchange SinglePartition, None
+- Generate explode([1.0,2.0]), false, false, [a#21]
+- Scan OneRowRelation[]
```
**Average**
```scala
scala> sql("select avg(a) over () from (select explode(array(1.0,2.0)) a) t").head
java.lang.RuntimeException: Unsupported window function: cast(((avg(UnscaledValue(a#40)),mode=Complete,isDistinct=false) / 10.0) as decimal(6,5))
scala> sql("select avg(a) over () from (select explode(array(1.0,2.0)) a) t").explain()
== Physical Plan ==
WholeStageCodegen
: +- Project [avg(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#44]
: +- INPUT
+- Window [cast(((avg(UnscaledValue(a#42)),mode=Complete,isDistinct=false) / 10.0) as decimal(6,5)) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS avg(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#44]
+- Exchange SinglePartition, None
+- Generate explode([1.0,2.0]), false, false, [a#42]
+- Scan OneRowRelation[]
```
After this PR, those queries work fine and new optimized physical plans look like the followings.
**Sum**
```scala
scala> sql("select sum(a) over () from (select explode(array(1.0,2.0)) a) t").explain()
== Physical Plan ==
WholeStageCodegen
: +- Project [sum(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#35]
: +- INPUT
+- Window [MakeDecimal((sum(UnscaledValue(a#33)),mode=Complete,isDistinct=false) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),12,1) AS sum(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#35]
+- Exchange SinglePartition, None
+- Generate explode([1.0,2.0]), false, false, [a#33]
+- Scan OneRowRelation[]
```
**Average**
```scala
scala> sql("select avg(a) over () from (select explode(array(1.0,2.0)) a) t").explain()
== Physical Plan ==
WholeStageCodegen
: +- Project [avg(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#47]
: +- INPUT
+- Window [cast(((avg(UnscaledValue(a#45)),mode=Complete,isDistinct=false) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) / 10.0) as decimal(6,5)) AS avg(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#47]
+- Exchange SinglePartition, None
+- Generate explode([1.0,2.0]), false, false, [a#45]
+- Scan OneRowRelation[]
```
In this PR, *SUM over window* pattern matching is based on the code of hvanhovell ; he should be credited for the work he did.
## How was this patch tested?
Pass the Jenkins tests (with newly added testcases)
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12421 from dongjoon-hyun/SPARK-14664.
## What changes were proposed in this pull request?
The `Batch` class, which had been used to indicate progress in a stream, was abandoned by [[SPARK-13985][SQL] Deterministic batches with ids](caea152145) and then became useless.
This patch:
- removes the `Batch` class
- ~~does some related renaming~~ (update: this has been reverted)
- fixes some related comments
## How was this patch tested?
N/A
Author: Liwei Lin <lwlin7@gmail.com>
Closes#12638 from lw-lin/remove-batch.
### What changes were proposed in this pull request?
Anti-Joins using BroadcastHashJoin's unique key code path are broken; it currently returns Semi Join results . This PR fixes this bug.
### How was this patch tested?
Added tests cases to `ExistenceJoinSuite`.
cc davies gatorsmile
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#12730 from hvanhovell/SPARK-14950.
## What changes were proposed in this pull request?
This PR will make Spark SQL not allow ALTER TABLE ADD/REPLACE/CHANGE COLUMN, ALTER TABLE SET FILEFORMAT, DFS, and transaction related commands.
## How was this patch tested?
Existing tests. For those tests that I put in the blacklist, I am adding the useful parts back to SQLQuerySuite.
Author: Yin Huai <yhuai@databricks.com>
Closes#12714 from yhuai/banNativeCommand.
## What changes were proposed in this pull request?
We currently expose both Hadoop configuration and Spark SQL configuration in RuntimeConfig. I think we can remove the Hadoop configuration part, and simply generate Hadoop Configuration on the fly by passing all the SQL configurations into it. This way, there is a single interface (in Java/Scala/Python/SQL) for end-users.
As part of this patch, I also removed some config options deprecated in Spark 1.x.
## How was this patch tested?
Updated relevant tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#12689 from rxin/SPARK-14913.
## What changes were proposed in this pull request?
#12625 exposed a new user-facing conf interface in `SparkSession`. This patch adds a catalog interface.
## How was this patch tested?
See `CatalogSuite`.
Author: Andrew Or <andrew@databricks.com>
Closes#12713 from andrewor14/user-facing-catalog.
## What changes were proposed in this pull request?
This PR adds Native execution of SHOW COLUMNS and SHOW PARTITION commands.
Command Syntax:
``` SQL
SHOW COLUMNS (FROM | IN) table_identifier [(FROM | IN) database]
```
``` SQL
SHOW PARTITIONS [db_name.]table_name [PARTITION(partition_spec)]
```
## How was this patch tested?
Added test cases in HiveCommandSuite to verify execution and DDLCommandSuite
to verify plans.
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#12222 from dilipbiswal/dkb_show_columns.
## What changes were proposed in this pull request?
While the vectorized hash map in `TungstenAggregate` is currently supported for all primitive data types during partial aggregation, this patch only enables the hash map for a subset of cases that've been verified to show performance improvements on our benchmarks subject to an internal conf that sets an upper limit on the maximum length of the aggregate key/value schema. This list of supported use-cases should be expanded over time.
## How was this patch tested?
This is no new change in functionality so existing tests should suffice. Performance tests were done on TPCDS benchmarks.
Author: Sameer Agarwal <sameer@databricks.com>
Closes#12710 from sameeragarwal/vectorized-enable.
## What changes were proposed in this pull request?
This PR update SortMergeJoinExec to support LeftSemi/LeftAnti, so it could support all the join types, same as other three join implementations: BroadcastHashJoinExec, ShuffledHashJoinExec,and BroadcastNestedLoopJoinExec.
This PR also simplify the join selection in SparkStrategy.
## How was this patch tested?
Added new tests.
Author: Davies Liu <davies@databricks.com>
Closes#12668 from davies/smj_semi.
## What changes were proposed in this pull request?
That patch mistakenly widened the visibility from `private[x]` to `protected[x]`. This patch reverts those changes.
Author: Andrew Or <andrew@databricks.com>
Closes#12686 from andrewor14/visibility.
## What changes were proposed in this pull request?
We currently have no way for users to propagate options to the underlying library that rely in Hadoop configurations to work. For example, there are various options in parquet-mr that users might want to set, but the data source API does not expose a per-job way to set it. This patch propagates the user-specified options also into Hadoop Configuration.
## How was this patch tested?
Used a mock data source implementation to test both the read path and the write path.
Author: Reynold Xin <rxin@databricks.com>
Closes#12688 from rxin/SPARK-14912.
#### What changes were proposed in this pull request?
The existing `Describe Function` only support the function name in `identifier`. This is different from what Hive behaves. That is why many test cases `udf_abc` in `HiveCompatibilitySuite` are not using our native DDL support. For example,
- udf_not.q
- udf_bitwise_not.q
This PR is to resolve the issues. Now, we can support the command of `Describe Function` whose function names are in the following format:
- `qualifiedName` (e.g., `db.func1`)
- `STRING` (e.g., `'func1'`)
- `comparisonOperator` (e.g,. `<`)
- `arithmeticOperator` (e.g., `+`)
- `predicateOperator` (e.g., `or`)
Note, before this PR, we only have a native command support when the function name is in the format of `qualifiedName`.
#### How was this patch tested?
Added test cases in `DDLSuite.scala`. Also manually verified all the related test cases in `HiveCompatibilitySuite` passed.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#12679 from gatorsmile/descFunction.
## What changes were proposed in this pull request?
Minor typo fixes (too minor to deserve separate a JIRA)
## How was this patch tested?
local build
Author: Jacek Laskowski <jacek@japila.pl>
Closes#12469 from jaceklaskowski/minor-typo-fixes.
## What changes were proposed in this pull request?
Use Long.parseLong which returns a primative.
Use a series of appends() reduces the creation of an extra StringBuilder type
## How was this patch tested?
Unit tests
Author: Azeem Jiva <azeemj@gmail.com>
Closes#12520 from javawithjiva/minor.
## What changes were proposed in this pull request?
In Spark 2.0, `SparkSession` is the new thing. Internally we should stop using `SQLContext` everywhere since that's supposed to be not the main user-facing API anymore.
In this patch I took care to not break any public APIs. The one place that's suspect is `o.a.s.ml.source.libsvm.DefaultSource`, but according to mengxr it's not supposed to be public so it's OK to change the underlying `FileFormat` trait.
**Reviewers**: This is a big patch that may be difficult to review but the changes are actually really straightforward. If you prefer I can break it up into a few smaller patches, but it will delay the progress of this issue a little.
## How was this patch tested?
No change in functionality intended.
Author: Andrew Or <andrew@databricks.com>
Closes#12625 from andrewor14/spark-session-refactor.
## What changes were proposed in this pull request?
Minor followup to https://github.com/apache/spark/pull/12651
## How was this patch tested?
Test-only change
Author: Sameer Agarwal <sameer@databricks.com>
Closes#12674 from sameeragarwal/tpcds-fix-2.
## What changes were proposed in this pull request?
`RuntimeConfig` is the new user-facing API in 2.0 added in #11378. Until now, however, it's been dead code. This patch uses `RuntimeConfig` in `SessionState` and exposes that through the `SparkSession`.
## How was this patch tested?
New test in `SQLContextSuite`.
Author: Andrew Or <andrew@databricks.com>
Closes#12669 from andrewor14/use-runtime-conf.
## What changes were proposed in this pull request?
This patch changes UnresolvedFunction and UnresolvedGenerator to use a FunctionIdentifier rather than just a String for function name. Also changed SessionCatalog to accept FunctionIdentifier in lookupFunction.
## How was this patch tested?
Updated related unit tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#12659 from rxin/SPARK-14888.
## What changes were proposed in this pull request?
```
Spark context available as 'sc' (master = local[*], app id = local-1461283768192).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_51)
Type in expressions to have them evaluated.
Type :help for more information.
scala> sql("SHOW TABLES").collect()
16/04/21 17:09:39 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/04/21 17:09:39 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
res0: Array[org.apache.spark.sql.Row] = Array([src,false])
scala> sql("SHOW TABLES").collect()
res1: Array[org.apache.spark.sql.Row] = Array([src,false])
scala> spark.createDataFrame(Seq((1, 1), (2, 2), (3, 3)))
res2: org.apache.spark.sql.DataFrame = [_1: int, _2: int]
```
Hive things are loaded lazily.
## How was this patch tested?
Manual.
Author: Andrew Or <andrew@databricks.com>
Closes#12589 from andrewor14/spark-session-repl.
#### What changes were proposed in this pull request?
For performance, predicates can be pushed through Window if and only if the following conditions are satisfied:
1. All the expressions are part of window partitioning key. The expressions can be compound.
2. Deterministic
#### How was this patch tested?
TODO:
- [X] DSL needs to be modified for window
- [X] more tests will be added.
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#11635 from gatorsmile/pushPredicateThroughWindow.
## What changes were proposed in this pull request?
This removes the class `HiveContext` itself along with all code usages associated with it. The bulk of the work was already done in #12485. This is mainly just code cleanup and actually removing the class.
Note: A couple of things will break after this patch. These will be fixed separately.
- the python HiveContext
- all the documentation / comments referencing HiveContext
- there will be no more HiveContext in the REPL (fixed by #12589)
## How was this patch tested?
No change in functionality.
Author: Andrew Or <andrew@databricks.com>
Closes#12585 from andrewor14/delete-hive-context.
## What changes were proposed in this pull request?
This method was accidentally made `private[sql]` in Spark 2.0. This PR makes it public again, since 3rd party data sources like spark-avro depend on it.
## How was this patch tested?
N/A
Author: Cheng Lian <lian@databricks.com>
Closes#12652 from liancheng/spark-14875.
## What changes were proposed in this pull request?
This PR fixes a bug in `TungstenAggregate` that manifests while aggregating by keys over nullable `BigDecimal` columns. This causes a null pointer exception while executing TPCDS q14a.
## How was this patch tested?
1. Added regression test in `DataFrameAggregateSuite`.
2. Verified that TPCDS q14a works
Author: Sameer Agarwal <sameer@databricks.com>
Closes#12651 from sameeragarwal/tpcds-fix.
## What changes were proposed in this pull request?
Right now, the data type field of a CatalogColumn is using the string representation. When we create this string from a DataType object, there are places where we use simpleString instead of catalogString. Although catalogString is the same as simpleString right now, it is still good to use catalogString. So, we will not silently introduce issues when we change the semantic of simpleString or the implementation of catalogString.
## How was this patch tested?
Existing tests.
Author: Yin Huai <yhuai@databricks.com>
Closes#12654 from yhuai/useCatalogString.
## What changes were proposed in this pull request?
Spark uses `NewLineAtEofChecker` rule in Scala by ScalaStyle. And, most Java code also comply with the rule. This PR aims to enforce the same rule `NewlineAtEndOfFile` by CheckStyle explicitly. Also, this fixes lint-java errors since SPARK-14465. The followings are the items.
- Adds a new line at the end of the files (19 files)
- Fixes 25 lint-java errors (12 RedundantModifier, 6 **ArrayTypeStyle**, 2 LineLength, 2 UnusedImports, 2 RegexpSingleline, 1 ModifierOrder)
## How was this patch tested?
After the Jenkins test succeeds, `dev/lint-java` should pass. (Currently, Jenkins dose not run lint-java.)
```bash
$ dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks passed.
```
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12632 from dongjoon-hyun/SPARK-14868.
## What changes were proposed in this pull request?
This patch changes SparkSession to be case insensitive by default, in order to match other database systems.
## How was this patch tested?
N/A - I'm sure some tests will fail and I will need to fix those.
Author: Reynold Xin <rxin@databricks.com>
Closes#12643 from rxin/SPARK-14876.
!< means not less than which is equivalent to >=
!> means not greater than which is equivalent to <=
I'd to create a PR to support these two operators.
I've added new test cases in: DataFrameSuite, ExpressionParserSuite, JDBCSuite, PlanParserSuite, SQLQuerySuite
dilipbiswal viirya gatorsmile
Author: jliwork <jiali@us.ibm.com>
Closes#12316 from jliwork/SPARK-14548.
#### What changes were proposed in this pull request?
So far, we are capturing each unsupported Alter Table in separate visit functions. They should be unified and issue the same ParseException instead.
This PR is to refactor the existing implementation and make error message consistent for Alter Table DDL.
#### How was this patch tested?
Updated the existing test cases and also added new test cases to ensure all the unsupported statements are covered.
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#12459 from gatorsmile/cleanAlterTable.
## What changes were proposed in this pull request?
CreateMetastoreDataSource and CreateMetastoreDataSourceAsSelect are not Hive-specific. So, this PR moves them from sql/hive to sql/core. Also, I am adding `Command` suffix to these two classes.
## How was this patch tested?
Existing tests.
Author: Yin Huai <yhuai@databricks.com>
Closes#12645 from yhuai/moveCreateDataSource.
## What changes were proposed in this pull request?
Current StreamTest allows testing of a streaming Dataset generated explicitly wraps a source. This is different from the actual production code path where the source object is dynamically created through a DataSource object every time a query is started. So all the fault-tolerance testing in FileSourceSuite and FileSourceStressSuite is not really testing the actual code path as they are just reusing the FileStreamSource object.
This PR fixes StreamTest and the FileSource***Suite to test this correctly. Instead of maintaining a mapping of source --> expected offset in StreamTest (which requires reuse of source object), it now maintains a mapping of source index --> offset, so that it is independent of the source object.
Summary of changes
- StreamTest refactored to keep track of offset by source index instead of source
- AddData, AddTextData and AddParquetData updated to find the FileStreamSource object from an active query, so that it can work with sources generated when query is started.
- Refactored unit tests in FileSource***Suite to test using DataFrame/Dataset generated with public, rather than reusing the same FileStreamSource. This correctly tests fault tolerance.
The refactoring changed a lot of indents in FileSourceSuite, so its recommended to hide whitespace changes with this - https://github.com/apache/spark/pull/12592/files?w=1
## How was this patch tested?
Refactored unit tests.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#12592 from tdas/SPARK-14833.
## What changes were proposed in this pull request?
We have logical plans that produce domain objects which are `ObjectType`. As we can't estimate the size of `ObjectType`, we throw an `UnsupportedOperationException` if trying to do that. We should set a default size for `ObjectType` to avoid this failure.
## How was this patch tested?
`DatasetSuite`.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#12599 from viirya/skip-broadcast-objectproducer.
## What changes were proposed in this pull request?
There was a typo in the message for second assertion in "returning batch for wide table" test
## How was this patch tested?
Existing tests.
Author: tedyu <yuzhihong@gmail.com>
Closes#12639 from tedyu/master.
## What changes were proposed in this pull request?
This patch improves error handling in view creation. CreateViewCommand itself will analyze the view SQL query first, and if it cannot successfully analyze it, throw an AnalysisException.
In addition, I also added the following two conservative guards for easier identification of Spark bugs:
1. If there is a bug and the generated view SQL cannot be analyzed, throw an exception at runtime. Note that this is not an AnalysisException because it is not caused by the user and more likely indicate a bug in Spark.
2. SQLBuilder when it gets an unresolved plan, it will also show the plan in the error message.
I also took the chance to simplify the internal implementation of CreateViewCommand, and *removed* a fallback path that would've masked an exception from before.
## How was this patch tested?
1. Added a unit test for the user facing error handling.
2. Manually introduced some bugs in Spark to test the internal defensive error handling.
3. Also added a test case to test nested views (not super relevant).
Author: Reynold Xin <rxin@databricks.com>
Closes#12633 from rxin/SPARK-14865.
## What changes were proposed in this pull request?
In order to support running SQL directly on files, we added some code in ResolveRelations to catch the exception thrown by catalog.lookupRelation and ignore it. This unfortunately masks all the exceptions. This patch changes the logic to simply test the table's existence.
## How was this patch tested?
I manually hacked some bugs into Spark and made sure the exceptions were being propagated up.
Author: Reynold Xin <rxin@databricks.com>
Closes#12634 from rxin/SPARK-14869.
## What changes were proposed in this pull request?
This patch restructures sql.execution.command package to break the commands into multiple files, in some logical organization: databases, tables, views, functions.
I also renamed basicOperators.scala to basicLogicalOperators.scala and basicPhysicalOperators.scala.
## How was this patch tested?
N/A - all I did was moving code around.
Author: Reynold Xin <rxin@databricks.com>
Closes#12636 from rxin/SPARK-14872.
## What changes were proposed in this pull request?
del unused imports in ML/MLLIB
## How was this patch tested?
unit tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#12497 from zhengruifeng/del_unused_imports.
## What changes were proposed in this pull request?
Currently, the Parquet reader decide whether to return batch based on required schema or full schema, it's not consistent, this PR fix that.
## How was this patch tested?
Added regression tests.
Author: Davies Liu <davies@databricks.com>
Closes#12619 from davies/fix_return_batch.
## What changes were proposed in this pull request?
This patch re-implements view creation command in sql/core, based on the pre-existing view creation command in the Hive module. This consolidates the view creation logical command and physical command into a single one, called CreateViewCommand.
## How was this patch tested?
All the code should've been tested by existing tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#12615 from rxin/SPARK-14842-2.
## What changes were proposed in this pull request?
This patch adds "Exec" suffix to all physical operators. Before this patch, Spark's physical operators and logical operators are named the same (e.g. Project could be logical.Project or execution.Project), which caused small issues in code review and bigger issues in code refactoring.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#12617 from rxin/exec-node.
## What changes were proposed in this pull request?
When creating a file stream using sqlContext.write.stream(), existing files are scanned twice for finding the schema
- Once, when creating a DataSource + StreamingRelation in the DataFrameReader.stream()
- Again, when creating streaming Source from the DataSource, in DataSource.createSource()
Instead, the schema should be generated only once, at the time of creating the dataframe, and when the streaming source is created, it should just reuse that schema
The solution proposed in this PR is to add a lazy field in DataSource that caches the schema. Then streaming Source created by the DataSource can just reuse the schema.
## How was this patch tested?
Refactored unit tests.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#12591 from tdas/SPARK-14832.
## What changes were proposed in this pull request?
This PR try to increase the parallelism for small table (a few of big files) to reduce the query time, by decrease the maxSplitBytes, the goal is to have at least one task per CPU in the cluster, if the total size of all files is bigger than openCostInBytes * 2 * nCPU.
For example, a small/medium table could be used as dimension table in huge query, this will be useful to reduce the time waiting for broadcast.
## How was this patch tested?
Existing tests.
Author: Davies Liu <davies@databricks.com>
Closes#12344 from davies/more_partition.
## What changes were proposed in this pull request?
Currently, `OptimizeIn` optimizer replaces `In` expression into `InSet` expression if the size of set is greater than a constant, 10.
This issue aims to make a configuration `spark.sql.optimizer.inSetConversionThreshold` for that.
After this PR, `OptimizerIn` is configurable.
```scala
scala> sql("select a in (1,2,3) from (select explode(array(1,2)) a) T").explain()
== Physical Plan ==
WholeStageCodegen
: +- Project [a#7 IN (1,2,3) AS (a IN (1, 2, 3))#8]
: +- INPUT
+- Generate explode([1,2]), false, false, [a#7]
+- Scan OneRowRelation[]
scala> sqlContext.setConf("spark.sql.optimizer.inSetConversionThreshold", "2")
scala> sql("select a in (1,2,3) from (select explode(array(1,2)) a) T").explain()
== Physical Plan ==
WholeStageCodegen
: +- Project [a#16 INSET (1,2,3) AS (a IN (1, 2, 3))#17]
: +- INPUT
+- Generate explode([1,2]), false, false, [a#16]
+- Scan OneRowRelation[]
```
## How was this patch tested?
Pass the Jenkins tests (with a new testcase)
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12562 from dongjoon-hyun/SPARK-14796.
## What changes were proposed in this pull request?
1. Fix the "spill size" of TungstenAggregate and Sort
2. Rename "data size" to "peak memory" to match the actual meaning (also consistent with task metrics)
3. Added "data size" for ShuffleExchange and BroadcastExchange
4. Added some timing for Sort, Aggregate and BroadcastExchange (this requires another patch to work)
## How was this patch tested?
Existing tests.
![metrics](https://cloud.githubusercontent.com/assets/40902/14573908/21ad2f00-030d-11e6-9e2c-c544f30039ea.png)
Author: Davies Liu <davies@databricks.com>
Closes#12425 from davies/fix_metrics.
## What changes were proposed in this pull request?
SparkPlan.prepare() could be called in different threads (BroadcastExchange will call it in a thread pool), it only make sure that doPrepare() will only be called once, the second call to prepare() may return earlier before all the children had finished prepare(). Then some operator may call doProduce() before prepareSubqueries(), `null` will be used as the result of subquery, which is wrong. This cause TPCDS Q23B returns wrong answer sometimes.
This PR added synchronization for prepare(), make sure all the children had finished prepare() before return. Also call prepare() in produce() (similar to execute()).
Added checking for ScalarSubquery to make sure that the subquery has finished before using the result.
## How was this patch tested?
Manually tested with Q23B, no wrong answer anymore.
Author: Davies Liu <davies@databricks.com>
Closes#12600 from davies/fix_risk.
## What changes were proposed in this pull request?
Currently, a column could be resolved wrongly if there are columns from both outer table and subquery have the same name, we should only resolve the attributes that can't be resolved within subquery. They may have same exprId than other attributes in subquery, so we should create alias for them.
Also, the column in IN subquery could have same exprId, we should create alias for them.
## How was this patch tested?
Added regression tests. Manually tests TPCDS Q70 and Q95, work well after this patch.
Author: Davies Liu <davies@databricks.com>
Closes#12539 from davies/fix_subquery.
## What changes were proposed in this pull request?
This patch moves SQLBuilder into sql/core so we can in the future move view generation also into sql/core.
## How was this patch tested?
Also moved unit tests.
Author: Reynold Xin <rxin@databricks.com>
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12602 from rxin/SPARK-14841.
## What changes were proposed in this pull request?
In Python, the `option` and `options` method of `DataFrameReader` and `DataFrameWriter` were sending the string "None" instead of `null` when passed `None`, therefore making it impossible to send an actual `null`. This fixes that problem.
This is based on #11305 from mathieulongtin.
## How was this patch tested?
Added test to readwriter.py.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: mathieu longtin <mathieu.longtin@nuance.com>
Closes#12494 from viirya/py-df-none-option.
## What changes were proposed in this pull request?
Change test to compare sets rather than sequence
## How was this patch tested?
Full test runs on little endian and big endian platforms
Author: Pete Robbins <robbinspg@gmail.com>
Closes#12610 from robbinspg/DatasetSuiteFix.
## What changes were proposed in this pull request?
Implement some `hashCode` and `equals` together in order to enable the scalastyle.
This is a first batch, I will continue to implement them but I wanted to know your thoughts.
Author: Joan <joan@goyeau.com>
Closes#12157 from joan38/SPARK-6429-HashCode-Equals.
## What changes were proposed in this pull request?
Add the native support for LOAD DATA DDL command that loads data into Hive table/partition.
## How was this patch tested?
`HiveDDLCommandSuite` and `HiveQuerySuite`. Besides, few Hive tests (`WindowQuerySuite`, `HiveTableScanSuite` and `HiveSerDeSuite`) also use `LOAD DATA` command.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#12412 from viirya/ddl-load-data.
## What changes were proposed in this pull request?
This patch removes HiveQueryExecution. As part of this, I consolidated all the describe commands into DescribeTableCommand.
## How was this patch tested?
Should be covered by existing tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#12588 from rxin/SPARK-14826.
(This PR is a rebased version of PR #12153.)
## What changes were proposed in this pull request?
This PR adds preliminary locality support for `FileFormat` data sources by overriding `FileScanRDD.preferredLocations()`. The strategy can be divided into two parts:
1. Block location lookup
Unlike `HadoopRDD` or `NewHadoopRDD`, `FileScanRDD` doesn't have access to the underlying `InputFormat` or `InputSplit`, and thus can't rely on `InputSplit.getLocations()` to gather locality information. Instead, this PR queries block locations using `FileSystem.getBlockLocations()` after listing all `FileStatus`es in `HDFSFileCatalog` and convert all `FileStatus`es into `LocatedFileStatus`es.
Note that although S3/S3A/S3N file systems don't provide valid locality information, their `getLocatedStatus()` implementations don't actually issue remote calls either. So there's no need to special case these file systems.
2. Selecting preferred locations
For each `FilePartition`, we pick up top 3 locations that containing the most data to be retrieved. This isn't necessarily the best algorithm out there. Further improvements may be brought up in follow-up PRs.
## How was this patch tested?
Tested by overriding default `FileSystem` implementation for `file:///` with a mocked one, which returns mocked block locations.
Author: Cheng Lian <lian@databricks.com>
Closes#12527 from liancheng/spark-14369-locality-rebased.
## What changes were proposed in this pull request?
This PR adds support for all primitive datatypes, decimal types and stringtypes in the VectorizedHashmap during aggregation.
## How was this patch tested?
Existing tests for group-by aggregates should already test for all these datatypes. Additionally, manually inspected the generated code for all supported datatypes (details below).
Author: Sameer Agarwal <sameer@databricks.com>
Closes#12440 from sameeragarwal/all-datatypes.
## What changes were proposed in this pull request?
This patch moves analyze table parsing into SparkSqlAstBuilder and removes HiveSqlAstBuilder.
In order to avoid extensive refactoring, I created a common trait for CatalogRelation and MetastoreRelation, and match on that. In the future we should probably just consolidate the two into a single thing so we don't need this common trait.
## How was this patch tested?
Updated unit tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#12584 from rxin/SPARK-14821.
## What changes were proposed in this pull request?
Spark currently uses TimSort for all in-memory sorts, including sorts done for shuffle. One low-hanging fruit is to use radix sort when possible (e.g. sorting by integer keys). This PR adds a radix sort implementation to the unsafe sort package and switches shuffles and sorts to use it when possible.
The current implementation does not have special support for null values, so we cannot radix-sort `LongType`. I will address this in a follow-up PR.
## How was this patch tested?
Unit tests, enabling radix sort on existing tests. Microbenchmark results:
```
Running benchmark: radix sort 25000000
Java HotSpot(TM) 64-Bit Server VM 1.8.0_66-b17 on Linux 3.13.0-44-generic
Intel(R) Core(TM) i7-4600U CPU 2.10GHz
radix sort 25000000: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
reference TimSort key prefix array 15546 / 15859 1.6 621.9 1.0X
reference Arrays.sort 2416 / 2446 10.3 96.6 6.4X
radix sort one byte 133 / 137 188.4 5.3 117.2X
radix sort two bytes 255 / 258 98.2 10.2 61.1X
radix sort eight bytes 991 / 997 25.2 39.6 15.7X
radix sort key prefix array 1540 / 1563 16.2 61.6 10.1X
```
I also ran a mix of the supported TPCDS queries and compared TimSort vs RadixSort metrics. The overall benchmark ran ~10% faster with radix sort on. In the breakdown below, the radix-enabled sort phases averaged about 20x faster than TimSort, however sorting is only a small fraction of the overall runtime. About half of the TPCDS queries were able to take advantage of radix sort.
```
TPCDS on master: 2499s real time, 8185s executor
- 1171s in TimSort, avg 267 MB/s
(note the /s accounting is weird here since dataSize counts the record sizes too)
TPCDS with radix enabled: 2294s real time, 7391s executor
- 596s in TimSort, avg 254 MB/s
- 26s in radix sort, avg 4.2 GB/s
```
cc davies rxin
Author: Eric Liang <ekl@databricks.com>
Closes#12490 from ericl/sort-benchmark.
## What changes were proposed in this pull request?
We recently made `ColumnarBatch.row` mutable and added a new `ColumnVector.putDecimal` method to support putting `Decimal` values in the `ColumnarBatch`. This unfortunately introduced a bug wherein we were not updating the vector with the proper unscaled values.
## How was this patch tested?
This codepath is hit only when the vectorized aggregate hashmap is enabled. https://github.com/apache/spark/pull/12440 makes sure that a number of regression tests/benchmarks test this bugfix.
Author: Sameer Agarwal <sameer@databricks.com>
Closes#12541 from sameeragarwal/fix-bigdecimal.
## What changes were proposed in this pull request?
This patch moves native command and script transformation into SparkSqlAstBuilder. This builds on #12561. See the last commit for diff.
## How was this patch tested?
Updated test cases to reflect this.
Author: Reynold Xin <rxin@databricks.com>
Closes#12564 from rxin/SPARK-14798.
## What changes were proposed in this pull request?
After removing most of `HiveContext` in 8fc267ab33 we can now move existing functionality in `SQLContext` to `SparkSession`. As of this PR `SQLContext` becomes a simple wrapper that has a `SparkSession` and delegates all functionality to it.
## How was this patch tested?
Jenkins.
Author: Andrew Or <andrew@databricks.com>
Closes#12553 from andrewor14/implement-spark-session.
## What changes were proposed in this pull request?
As we moved most parsing rules to `SparkSqlParser`, some tests expected to throw exception are not correct anymore.
## How was this patch tested?
`DDLCommandSuite`
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#12572 from viirya/hotfix-ddl.
## What changes were proposed in this pull request?
the `Accumulable.internal` flag is only used to avoid registering internal accumulators for 2 certain cases:
1. `TaskMetrics.createTempShuffleReadMetrics`: the accumulators in the temp shuffle read metrics should not be registered.
2. `TaskMetrics.fromAccumulatorUpdates`: the created task metrics is only used to post event, accumulators inside it should not be registered.
For 1, we can create a `TempShuffleReadMetrics` that don't create accumulators, just keep the data and merge it at last.
For 2, we can un-register these accumulators immediately.
TODO: remove `internal` flag in `AccumulableInfo` with followup PR
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12525 from cloud-fan/acc.
## What changes were proposed in this pull request?
This patch moves as many parsing rules as possible into SQL parser. There are only three more left after this patch: (1) run native command, (2) analyze, and (3) script IO. These 3 will be dealt with in a follow-up PR.
## How was this patch tested?
No test change. This simply moves code around.
Author: Reynold Xin <rxin@databricks.com>
Closes#12556 from rxin/SPARK-14792.
## What changes were proposed in this pull request?
The patch removes HiveConf dependency from HiveSqlAstBuilder. This is required in order to merge HiveSqlParser and SparkSqlAstBuilder, which would require getting rid of the Hive specific dependencies in HiveSqlParser.
This patch also accomplishes [SPARK-14778] Remove HiveSessionState.substitutor.
## How was this patch tested?
This should be covered by existing tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#12550 from rxin/SPARK-14782.
## What changes were proposed in this pull request?
In order to fully merge the Hive parser and the SQL parser, we'd need to support variable substitution in Spark. The implementation of the substitute algorithm is mostly copied from Hive, but I simplified the overall structure quite a bit and added more comprehensive test coverage.
Note that this pull request does not yet use this functionality anywhere.
## How was this patch tested?
Added VariableSubstitutionSuite for unit tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#12538 from rxin/SPARK-14769.
## What changes were proposed in this pull request?
3 testcases namely,
```
"count is partially aggregated"
"count distinct is partially aggregated"
"mixed aggregates are partially aggregated"
```
were failing when running PlannerSuite individually.
The PR provides a fix for this.
## How was this patch tested?
unit tests
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: Subhobrata Dey <sbcd90@gmail.com>
Closes#12532 from sbcd90/plannersuitetestsfix.
## What changes were proposed in this pull request?
This PR adds a special log for FileStreamSink for two purposes:
- Versioning. A future Spark version should be able to read the metadata of an old FileStreamSink.
- Compaction. As reading from many small files is usually pretty slow, we should compact small metadata files into big files.
FileStreamSinkLog has a new log format instead of Java serialization format. It will write one log file for each batch. The first line of the log file is the version number, and there are multiple JSON lines following. Each JSON line is a JSON format of FileLog.
FileStreamSinkLog will compact log files every "spark.sql.sink.file.log.compactLen" batches into a big file. When doing a compact, it will read all history logs and merge them with the new batch. During the compaction, it will also delete the files that are deleted (marked by FileLog.action). When the reader uses allLogs to list all files, this method only returns the visible files (drops the deleted files).
## How was this patch tested?
FileStreamSinkLogSuite
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#12435 from zsxwing/sink-log.
## What changes were proposed in this pull request?
This PR has two main changes.
1. Move Hive-specific methods from HiveContext to HiveSessionState, which help the work of removing HiveContext.
2. Create a SparkSession Class, which will later be the entry point of Spark SQL users.
## How was this patch tested?
Existing tests
This PR is trying to fix test failures of https://github.com/apache/spark/pull/12485.
Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#12522 from yhuai/spark-session.
## What changes were proposed in this pull request?
Consider the following directory structure
dir/col=X/some-files
If we create a text format streaming dataframe on `dir/col=X/` then it should not consider as partitioning in columns. Even though the streaming dataframe does not do so, the generated batch dataframes pick up col as a partitioning columns, causing mismatch streaming source schema and generated df schema. This leads to runtime failure:
```
18:55:11.262 ERROR org.apache.spark.sql.execution.streaming.StreamExecution: Query query-0 terminated with error
java.lang.AssertionError: assertion failed: Invalid batch: c#2 != c#7,type#8
```
The reason is that the partition inferring code has no idea of a base path, above which it should not search of partitions. This PR makes sure that the batch DF is generated with the basePath set as the original path on which the file stream source is defined.
## How was this patch tested?
New unit test
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#12517 from tdas/SPARK-14741.
## What changes were proposed in this pull request?
This patch provides a first cut of python APIs for structured streaming. This PR provides the new classes:
- ContinuousQuery
- Trigger
- ProcessingTime
in pyspark under `pyspark.sql.streaming`.
In addition, it contains the new methods added under:
- `DataFrameWriter`
a) `startStream`
b) `trigger`
c) `queryName`
- `DataFrameReader`
a) `stream`
- `DataFrame`
a) `isStreaming`
This PR doesn't contain all methods exposed for `ContinuousQuery`, for example:
- `exception`
- `sourceStatuses`
- `sinkStatus`
They may be added in a follow up.
This PR also contains some very minor doc fixes in the Scala side.
## How was this patch tested?
Python doc tests
TODO:
- [ ] verify Python docs look good
Author: Burak Yavuz <brkyvz@gmail.com>
Author: Burak Yavuz <burak@databricks.com>
Closes#12320 from brkyvz/stream-python.
## What changes were proposed in this pull request?
- replaced `FileSystem.get(conf)` calls with `path.getFileSystem(conf)`
## How was this patch tested?
N/A
Author: Liwei Lin <lwlin7@gmail.com>
Closes#12450 from lw-lin/fix-fs-get.
`MutableProjection` is not thread-safe and we won't use it in multiple threads. I think the reason that we return `() => MutableProjection` is not about thread safety, but to save the costs of generating code when we need same but individual mutable projections.
However, I only found one place that use this [feature](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/Window.scala#L122-L123), and comparing to the troubles it brings, I think we should generate `MutableProjection` directly instead of return a function.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#7373 from cloud-fan/project.
## What changes were proposed in this pull request?
Before this PR, we create accumulators at driver side(and register them) and send them to executor side, then we create `TaskMetrics` with these accumulators at executor side.
After this PR, we will create `TaskMetrics` at driver side and send it to executor side, so that we can create accumulators inside `TaskMetrics` directly, which is cleaner.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12472 from cloud-fan/acc.
## What changes were proposed in this pull request?
Change SubquerySuite to validate test results utilizing checkAnswer helper method
## How was this patch tested?
Existing tests
Author: Luciano Resende <lresende@apache.org>
Closes#12269 from lresende/SPARK-13419.
## What changes were proposed in this pull request?
Enable ScalaReflection and User Defined Types for plain Scala classes.
This involves the move of `schemaFor` from `ScalaReflection` trait (which is Runtime and Compile time (macros) reflection) to the `ScalaReflection` object (runtime reflection only) as I believe this code wouldn't work at compile time anyway as it manipulates `Class`'s that are not compiled yet.
## How was this patch tested?
Unit test
Author: Joan <joan@goyeau.com>
Closes#12149 from joan38/SPARK-13929-Scala-reflection.
## What changes were proposed in this pull request?
This PR moves `HadoopFsRelation` related data source API into `execution/datasources` package.
Note that to avoid conflicts, this PR is based on #12153. Effective changes for this PR only consist of the last three commits. Will rebase after merging #12153.
## How was this patch tested?
Existing tests.
Author: Yin Huai <yhuai@databricks.com>
Author: Cheng Lian <lian@databricks.com>
Closes#12361 from liancheng/spark-14407-hide-hadoop-fs-relation.
### What changes were proposed in this pull request?
This PR adds support for in/exists predicate subqueries to Spark. Predicate sub-queries are used as a filtering condition in a query (this is the only supported use case). A predicate sub-query comes in two forms:
- `[NOT] EXISTS(subquery)`
- `[NOT] IN (subquery)`
This PR is (loosely) based on the work of davies (https://github.com/apache/spark/pull/10706) and chenghao-intel (https://github.com/apache/spark/pull/9055). They should be credited for the work they did.
### How was this patch tested?
Modified parsing unit tests.
Added tests to `org.apache.spark.sql.SQLQuerySuite`
cc rxin, davies & chenghao-intel
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#12306 from hvanhovell/SPARK-4226.
## What changes were proposed in this pull request?
After https://github.com/apache/spark/pull/12067, we now use expressions to do the aggregation in `TypedAggregateExpression`. To implement buffer merge, we produce a new buffer deserializer expression by replacing `AttributeReference` with right-side buffer attribute, like other `DeclarativeAggregate`s do, and finally combine the left and right buffer deserializer with `Invoke`.
However, after https://github.com/apache/spark/pull/12338, we will add loop variable to class members when codegen `MapObjects`. If the `Aggregator` buffer type is `Seq`, which is implemented by `MapObjects` expression, we will add the same loop variable to class members twice(by left and right buffer deserializer), which cause the `ClassFormatError`.
This PR fixes this issue by calling `distinct` before declare the class menbers.
## How was this patch tested?
new regression test in `DatasetAggregatorSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12468 from cloud-fan/bug.
When `Await.result` throws an exception which originated from a different thread, the resulting stacktrace doesn't include the path leading to the `Await.result` call itself, making it difficult to identify the impact of these exceptions. For example, I've seen cases where broadcast cleaning errors propagate to the main thread and crash it but the resulting stacktrace doesn't include any of the main thread's code, making it difficult to pinpoint which exception crashed that thread.
This patch addresses this issue by explicitly catching, wrapping, and re-throwing exceptions that are thrown by `Await.result`.
I tested this manually using 16b31c8251, a patch which reproduces an issue where an RPC exception which occurs while unpersisting RDDs manages to crash the main thread without any useful stacktrace, and verified that informative, full stacktraces were generated after applying the fix in this PR.
/cc rxin nongli yhuai anabranch
Author: Josh Rosen <joshrosen@databricks.com>
Closes#12433 from JoshRosen/wrap-and-rethrow-await-exceptions.
## What changes were proposed in this pull request?
This PR tries to separate the serialization and deserialization logic from object operators, so that it's easier to eliminate unnecessary serializations in optimizer.
Typed aggregate related operators are special, they will deserialize the input row to multiple objects and it's difficult to simply use a deserializer operator to abstract it, so we still mix the deserialization logic there.
## How was this patch tested?
existing tests and new test in `EliminateSerializationSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12260 from cloud-fan/encoder.
## What changes were proposed in this pull request?
These test suites were removed while refactoring `HadoopFsRelation` related API. This PR brings them back.
This PR also fixes two regressions:
- SPARK-14458, which causes runtime error when saving partitioned tables using `FileFormat` data sources that are not able to infer their own schemata. This bug wasn't detected by any built-in data sources because all of them happen to have schema inference feature.
- SPARK-14566, which happens to be covered by SPARK-14458 and causes wrong query result or runtime error when
- appending a Dataset `ds` to a persisted partitioned data source relation `t`, and
- partition columns in `ds` don't all appear after data columns
## How was this patch tested?
`CommitFailureTestRelationSuite` uses a testing relation that always fails when committing write tasks to test write job cleanup.
`SimpleTextHadoopFsRelationSuite` uses a testing relation to test general `HadoopFsRelation` and `FileFormat` interfaces.
The two regressions are both covered by existing test cases.
Author: Cheng Lian <lian@databricks.com>
Closes#12179 from liancheng/spark-13681-commit-failure-test.
## What changes were proposed in this pull request?
We currently disable codegen for `CaseWhen` if the number of branches is greater than 20 (in CaseWhen.MAX_NUM_CASES_FOR_CODEGEN). It would be better if this value is a non-public config defined in SQLConf.
## How was this patch tested?
Pass the Jenkins tests (including a new testcase `Support spark.sql.codegen.maxCaseBranches option`)
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12353 from dongjoon-hyun/SPARK-14577.
## What changes were proposed in this pull request?
This is roughly based on the input metrics logic in `SqlNewHadoopRDD`
## How was this patch tested?
Not sure how to write a test, I manually verified it in Spark UI.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12352 from cloud-fan/metrics.
## What changes were proposed in this pull request?
Per rxin's suggestions, this patch renames `upstreams()` to `inputRDDs()` in `WholeStageCodegen` for better implied semantics
## How was this patch tested?
N/A
Author: Sameer Agarwal <sameer@databricks.com>
Closes#12486 from sameeragarwal/codegen-cleanup.
## What changes were proposed in this pull request?
The `doGenCode` method currently takes in an `ExprCode`, mutates it and returns the java code to evaluate the given expression. It should instead just return a new `ExprCode` to avoid passing around mutable objects during code generation.
## How was this patch tested?
Existing Tests
Author: Sameer Agarwal <sameer@databricks.com>
Closes#12483 from sameeragarwal/new-exprcode-2.
## What changes were proposed in this pull request?
The sort shuffle manager has been the default since Spark 1.2. It is time to remove the old hash shuffle manager.
## How was this patch tested?
Removed some tests related to the old manager.
Author: Reynold Xin <rxin@databricks.com>
Closes#12423 from rxin/SPARK-14667.
## What changes were proposed in this pull request?
Per rxin's suggestions, this patch renames `s/gen/genCode` and `s/genCode/doGenCode` to better reflect the semantics of these 2 function calls.
## How was this patch tested?
N/A (refactoring only)
Author: Sameer Agarwal <sameer@databricks.com>
Closes#12475 from sameeragarwal/gencode.
## What changes were proposed in this pull request?
This patch adds a SharedState that groups state shared across multiple SQLContexts. This is analogous to the SessionState added in SPARK-13526 that groups session-specific state. This cleanup makes the constructors of the contexts simpler and ultimately allows us to remove HiveContext in the near future.
## How was this patch tested?
Existing tests.
Author: Yin Huai <yhuai@databricks.com>
Closes#12463 from yhuai/sharedState.
## What changes were proposed in this pull request?
There are many operations that are currently not supported in the streaming execution. For example:
- joining two streams
- unioning a stream and a batch source
- sorting
- window functions (not time windows)
- distinct aggregates
Furthermore, executing a query with a stream source as a batch query should also fail.
This patch add an additional step after analysis in the QueryExecution which will check that all the operations in the analyzed logical plan is supported or not.
## How was this patch tested?
unit tests.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#12246 from tdas/SPARK-14473.
## What changes were proposed in this pull request?
This PR aims to add `bound` function (aka Banker's round) by extending current `round` implementation. [Hive supports `bround` since 1.3.0.](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF)
**Hive (1.3 ~ 2.0)**
```
hive> select round(2.5), bround(2.5);
OK
3.0 2.0
```
**After this PR**
```scala
scala> sql("select round(2.5), bround(2.5)").head
res0: org.apache.spark.sql.Row = [3,2]
```
## How was this patch tested?
Pass the Jenkins tests (with extended tests).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12376 from dongjoon-hyun/SPARK-14614.
## What changes were proposed in this pull request?
We currently only have implicit encoders for scala primitive types. We should also add implicit encoders for boxed primitives. Otherwise, the following code would not have an encoder:
```scala
sqlContext.range(1000).map { i => i }
```
## How was this patch tested?
Added a unit test case for this.
Author: Reynold Xin <rxin@databricks.com>
Closes#12466 from rxin/SPARK-14696.
## What changes were proposed in this pull request?
set the input encoder for `TypedColumn` in `RelationalGroupedDataset.agg`.
## How was this patch tested?
new tests in `DatasetAggregatorSuite`
close https://github.com/apache/spark/pull/11269
This PR brings https://github.com/apache/spark/pull/12359 up to date and fix the compile.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12451 from cloud-fan/agg.
## What changes were proposed in this pull request?
The patch fixes the issue with the randomSplit method which is not able to split dataframes which has maps in schema. The bug was introduced in spark 1.6.1.
## How was this patch tested?
Tested with unit tests.
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: Subhobrata Dey <sbcd90@gmail.com>
Closes#12438 from sbcd90/randomSplitIssue.
## What changes were proposed in this pull request?
This patch adds a SharedState that groups state shared across multiple SQLContexts. This is analogous to the SessionState added in SPARK-13526 that groups session-specific state. This cleanup makes the constructors of the contexts simpler and ultimately allows us to remove HiveContext in the near future.
## How was this patch tested?
Existing tests.
Closes#12405
Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#12447 from yhuai/sharedState.
## What changes were proposed in this pull request?
This is a follow-up to make the max iteration number an internal config.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#12441 from rxin/maxIterConfInternal.
## What changes were proposed in this pull request?
This PR removes
- Inappropriate type notations
For example, from
```scala
words.foreachRDD { (rdd: RDD[String], time: Time) =>
...
```
to
```scala
words.foreachRDD { (rdd, time) =>
...
```
- Extra anonymous closure within functional transformations.
For example,
```scala
.map(item => {
...
})
```
which can be just simply as below:
```scala
.map { item =>
...
}
```
and corrects some obvious style nits.
## How was this patch tested?
This was tested after adding rules in `scalastyle-config.xml`, which ended up with not finding all perfectly.
The rules applied were below:
- For the first correction,
```xml
<check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
<parameters><parameter name="regex">(?m)\.[a-zA-Z_][a-zA-Z0-9]*\(\s*[^,]+s*=>\s*\{[^\}]+\}\s*\)</parameter></parameters>
</check>
```
```xml
<check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
<parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]*\s*[\{|\(]([^\n>,]+=>)?\s*\{([^()]|(?R))*\}^[,]</parameter></parameters>
</check>
```
- For the second correction
```xml
<check customId="TypeNotation" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
<parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]*\s*[\{|\(]\s*\([^):]*:R))*\}^[,]</parameter></parameters>
</check>
```
**Those rules were not added**
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#12413 from HyukjinKwon/SPARK-style.
## What changes were proposed in this pull request?
set the input encoder for `TypedColumn` in `RelationalGroupedDataset.agg`.
## How was this patch tested?
new tests in `DatasetAggregatorSuite`
close https://github.com/apache/spark/pull/11269
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12359 from cloud-fan/agg.
## What changes were proposed in this pull request?
We currently hard code the max number of optimizer/analyzer iterations to 100. This patch makes it configurable. While I'm at it, I also added the SessionCatalog to the optimizer, so we can use information there in optimization.
## How was this patch tested?
Updated unit tests to reflect the change.
Author: Reynold Xin <rxin@databricks.com>
Closes#12434 from rxin/SPARK-14677.
## What changes were proposed in this pull request?
This PR moves `CurrentDatabase` from sql/hive package to sql/catalyst. It also adds the function description, which looks like the following.
```
scala> sqlContext.sql("describe function extended current_database").collect.foreach(println)
[Function: current_database]
[Class: org.apache.spark.sql.execution.command.CurrentDatabase]
[Usage: current_database() - Returns the current database.]
[Extended Usage:
> SELECT current_database()]
```
## How was this patch tested?
Existing tests
Author: Yin Huai <yhuai@databricks.com>
Closes#12424 from yhuai/SPARK-14668.
## What changes were proposed in this pull request?
This PR uses a better hashing algorithm while probing the AggregateHashMap:
```java
long h = 0
h = (h ^ (0x9e3779b9)) + key_1 + (h << 6) + (h >>> 2);
h = (h ^ (0x9e3779b9)) + key_2 + (h << 6) + (h >>> 2);
h = (h ^ (0x9e3779b9)) + key_3 + (h << 6) + (h >>> 2);
...
h = (h ^ (0x9e3779b9)) + key_n + (h << 6) + (h >>> 2);
return h
```
Depends on: https://github.com/apache/spark/pull/12345
## How was this patch tested?
Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02 on Mac OS X 10.11.4
Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz
Aggregate w keys: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
codegen = F 2417 / 2457 8.7 115.2 1.0X
codegen = T hashmap = F 1554 / 1581 13.5 74.1 1.6X
codegen = T hashmap = T 877 / 929 23.9 41.8 2.8X
Author: Sameer Agarwal <sameer@databricks.com>
Closes#12379 from sameeragarwal/hash.
## What changes were proposed in this pull request?
`ExpressionEncoder` is just a container for serialization and deserialization expressions, we can use these expressions to build `TypedAggregateExpression` directly, so that it can fit in `DeclarativeAggregate`, which is more efficient.
One trick is, for each buffer serializer expression, it will reference to the result object of serialization and function call. To avoid re-calculating this result object, we can serialize the buffer object to a single struct field, so that we can use a special `Expression` to only evaluate result object once.
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12067 from cloud-fan/typed_udaf.
## What changes were proposed in this pull request?
This patch speeds up group-by aggregates by around 3-5x by leveraging an in-memory `AggregateHashMap` (please see https://github.com/apache/spark/pull/12161), an append-only aggregate hash map that can act as a 'cache' for extremely fast key-value lookups while evaluating aggregates (and fall back to the `BytesToBytesMap` if a given key isn't found).
Architecturally, it is backed by a power-of-2-sized array for index lookups and a columnar batch that stores the key-value pairs. The index lookups in the array rely on linear probing (with a small number of maximum tries) and use an inexpensive hash function which makes it really efficient for a majority of lookups. However, using linear probing and an inexpensive hash function also makes it less robust as compared to the `BytesToBytesMap` (especially for a large number of keys or even for certain distribution of keys) and requires us to fall back on the latter for correctness.
## How was this patch tested?
Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02 on Mac OS X 10.11.4
Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz
Aggregate w keys: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
codegen = F 2124 / 2204 9.9 101.3 1.0X
codegen = T hashmap = F 1198 / 1364 17.5 57.1 1.8X
codegen = T hashmap = T 369 / 600 56.8 17.6 5.8X
Author: Sameer Agarwal <sameer@databricks.com>
Closes#12345 from sameeragarwal/tungsten-aggregate-integration.
## What changes were proposed in this pull request?
Removing references to assembly jar in documentation.
Adding an additional (previously undocumented) usage of spark-submit to run examples.
## How was this patch tested?
Ran spark-submit usage to ensure formatting was fine. Ran examples using SparkSubmit.
Author: Mark Grover <mark@apache.org>
Closes#12365 from markgrover/spark-14601.
## What changes were proposed in this pull request?
JIRA: https://issues.apache.org/jira/browse/SPARK-14592
This patch adds native support for DDL command `CREATE TABLE LIKE`.
The SQL syntax is like:
CREATE TABLE table_name LIKE existing_table
CREATE TABLE IF NOT EXISTS table_name LIKE existing_table
## How was this patch tested?
`HiveDDLCommandSuite`. `HiveQuerySuite` already tests `CREATE TABLE LIKE`.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
This patch had conflicts when merged, resolved by
Committer: Andrew Or <andrew@databricks.com>
Closes#12362 from viirya/create-table-like.
## What changes were proposed in this pull request?
When there are multiple attempts for a stage, we currently only reset internal accumulator values if all the tasks are resubmitted. It would make more sense to reset the accumulator values for each stage attempt. This will allow us to eventually get rid of the internal flag in the Accumulator class. This is part of my bigger effort to simplify accumulators and task metrics.
## How was this patch tested?
Covered by existing tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#12378 from rxin/SPARK-14619.
## What changes were proposed in this pull request?
Currently many public abstract methods (in abstract classes as well as traits) don't declare return types explicitly, such as in [o.a.s.streaming.dstream.InputDStream](https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/InputDStream.scala#L110):
```scala
def start() // should be: def start(): Unit
def stop() // should be: def stop(): Unit
```
These methods exist in core, sql, streaming; this PR fixes them.
## How was this patch tested?
N/A
## Which piece of scala style rule led to the changes?
the rule was added separately in https://github.com/apache/spark/pull/12396
Author: Liwei Lin <lwlin7@gmail.com>
Closes#12389 from lw-lin/public-abstract-methods.
#### What changes were proposed in this pull request?
This PR is to provide a native DDL support for the following three Alter View commands:
Based on the Hive DDL document:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
##### 1. ALTER VIEW RENAME
**Syntax:**
```SQL
ALTER VIEW view_name RENAME TO new_view_name
```
- to change the name of a view to a different name
- not allowed to rename a view's name by ALTER TABLE
##### 2. ALTER VIEW SET TBLPROPERTIES
**Syntax:**
```SQL
ALTER VIEW view_name SET TBLPROPERTIES ('comment' = new_comment);
```
- to add metadata to a view
- not allowed to set views' properties by ALTER TABLE
- ignore it if trying to set a view's existing property key when the value is the same
- overwrite the value if trying to set a view's existing key to a different value
##### 3. ALTER VIEW UNSET TBLPROPERTIES
**Syntax:**
```SQL
ALTER VIEW view_name UNSET TBLPROPERTIES [IF EXISTS] ('comment', 'key')
```
- to remove metadata from a view
- not allowed to unset views' properties by ALTER TABLE
- issue an exception if trying to unset a view's non-existent key
#### How was this patch tested?
Added test cases to verify if it works properly.
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#12324 from gatorsmile/alterView.
## What changes were proposed in this pull request?
This PR removes extra anonymous closure within functional transformations.
For example,
```scala
.map(item => {
...
})
```
which can be just simply as below:
```scala
.map { item =>
...
}
```
## How was this patch tested?
Related unit tests and `sbt scalastyle`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#12382 from HyukjinKwon/minor-extra-closers.
## What changes were proposed in this pull request?
Old `HadoopFsRelation` API includes `buildInternalScan()` which uses `SqlNewHadoopRDD` in `ParquetRelation`.
Because now the old API is removed, `SqlNewHadoopRDD` is not used anymore.
So, this PR removes `SqlNewHadoopRDD` and several unused imports.
This was discussed in https://github.com/apache/spark/pull/12326.
## How was this patch tested?
Several related existing unit tests and `sbt scalastyle`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#12354 from HyukjinKwon/SPARK-14596.
## What changes were proposed in this pull request?
When prune the partitions or push down predicates, case-sensitivity is not respected. In order to make it work with case-insensitive, this PR update the AttributeReference inside predicate to use the name from schema.
## How was this patch tested?
Add regression tests for case-insensitive.
Author: Davies Liu <davies@databricks.com>
Closes#12371 from davies/case_insensi.
## What changes were proposed in this pull request?
This patch implements the `CREATE TABLE` command using the `SessionCatalog`. Previously we handled only `CTAS` and `CREATE TABLE ... USING`. This requires us to refactor `CatalogTable` to accept various fields (e.g. bucket and skew columns) and pass them to Hive.
WIP: Note that I haven't verified whether this actually works yet! But I believe it does.
## How was this patch tested?
Tests will come in a future commit.
Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#12271 from andrewor14/create-table-ddl.
## What changes were proposed in this pull request?
It looks several recent commits for datasources (maybe while removing old `HadoopFsRelation` interface) missed removing some unused imports.
This PR removes some unused imports in datasources.
## How was this patch tested?
`sbt scalastyle` and some unit tests for them.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#12326 from HyukjinKwon/minor-imports.
## What changes were proposed in this pull request?
There is a race condition in `StreamExecution.processAllAvailable`. Here is an execution order to reproduce it.
| Time |Thread 1 | MicroBatchThread |
|:-------------:|:-------------:|:-----:|
| 1 | | `dataAvailable in constructNextBatch` returns false |
| 2 | addData(newData) | |
| 3 | `noNewData = false` in processAllAvailable | |
| 4 | | noNewData = true |
| 5 | `noNewData` is true so just return | |
The root cause is that `checking dataAvailable and change noNewData to true` is not atomic. This PR puts these two actions into `synchronized` to make sure they are atomic.
In addition, this PR also has the following changes:
- Make `committedOffsets` and `availableOffsets` volatile to make sure they can be seen in other threads.
- Copy the reference of `availableOffsets` to a local variable so that `sourceStatuses` can use a snapshot of `availableOffsets`.
## How was this patch tested?
Existing unit tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#12339 from zsxwing/race-condition.
## What changes were proposed in this pull request?
The wide schema, the expression of fields will be splitted into multiple functions, but the variable for loopVar can't be accessed in splitted functions, this PR change them as class member.
## How was this patch tested?
Added regression test.
Author: Davies Liu <davies@databricks.com>
Closes#12338 from davies/nested_row.
## What changes were proposed in this pull request?
This PR improve the performance of SQL UI by:
1) remove the details column in all executions page (the first page in SQL tab). We can check the details by enter the execution page.
2) break-all is super slow in Chrome recently, so switch to break-word.
3) Using "display: none" to hide a block.
4) using one js closure for for all the executions, not one for each.
5) remove the height limitation of details, don't need to scroll it in the tiny window.
## How was this patch tested?
Exists tests.
![ui](https://cloud.githubusercontent.com/assets/40902/14445712/68d7b258-0004-11e6-9b48-5d329b05d165.png)
Author: Davies Liu <davies@databricks.com>
Closes#12311 from davies/ui_perf.
## What changes were proposed in this pull request?
Before we are using `AnalysisException`, `ParseException`, `NoSuchFunctionException` etc when a parsing error encounters. I am trying to make it consistent and also **minimum** code impact to the current implementation by changing the class hierarchy.
1. `NoSuchItemException` is removed, since it is an abstract class and it just simply takes a message string.
2. `NoSuchDatabaseException`, `NoSuchTableException`, `NoSuchPartitionException` and `NoSuchFunctionException` now extends `AnalysisException`, as well as `ParseException`, they are all under `AnalysisException` umbrella, but you can also determine how to use them in a granular way.
## How was this patch tested?
The existing test cases should cover this patch.
Author: bomeng <bmeng@us.ibm.com>
Closes#12314 from bomeng/SPARK-14414.
## What changes were proposed in this pull request?
- `StateStoreConf.**max**DeltasForSnapshot` was renamed to `StateStoreConf.**min**DeltasForSnapshot`
- some state switch checks were added
- improved consistency between method names and string literals
- other comments & typo fix
## How was this patch tested?
N/A
Author: Liwei Lin <lwlin7@gmail.com>
Closes#12323 from lw-lin/streaming-state-clean-up.