## What changes were proposed in this pull request?
If there are many branches in a CaseWhen expression, the generated code could go above the 64K limit for single java method, will fail to compile. This PR change it to fallback to interpret mode if there are more than 20 branches.
This PR is based on #11243 and #11221, thanks to joehalliwell
Closes#11243Closes#11221
## How was this patch tested?
Add a test with 50 branches.
Author: Davies Liu <davies@databricks.com>
Closes#11592 from davies/fix_when.
## What changes were proposed in this pull request?
Analysis exception occurs while running the following query.
```
SELECT ints FROM nestedArray LATERAL VIEW explode(a.b) `a` AS `ints`
```
```
Failed to analyze query: org.apache.spark.sql.AnalysisException: cannot resolve '`ints`' given input columns: [a, `ints`]; line 1 pos 7
'Project ['ints]
+- Generate explode(a#0.b), true, false, Some(a), [`ints`#8]
+- SubqueryAlias nestedarray
+- LocalRelation [a#0], [[[[1,2,3]]]]
```
## How was this patch tested?
Added new unit tests in SQLQuerySuite and HiveQlSuite
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#11538 from dilipbiswal/SPARK-13698.
## What changes were proposed in this pull request?
In order to make `docs/examples` (and other related code) more simple/readable/user-friendly, this PR replaces existing codes like the followings by using `diamond` operator.
```
- final ArrayList<Product2<Object, Object>> dataToWrite =
- new ArrayList<Product2<Object, Object>>();
+ final ArrayList<Product2<Object, Object>> dataToWrite = new ArrayList<>();
```
Java 7 or higher supports **diamond** operator which replaces the type arguments required to invoke the constructor of a generic class with an empty set of type parameters (<>). Currently, Spark Java code use mixed usage of this.
## How was this patch tested?
Manual.
Pass the existing tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11541 from dongjoon-hyun/SPARK-13702.
## What changes were proposed in this pull request?
`ScalaReflection.mirror` method should be synchronized when scala version is `2.10` because `universe.runtimeMirror` is not thread safe.
## How was this patch tested?
I added a test to check thread safety of `ScalaRefection.mirror` method in `ScalaReflectionSuite`, which will throw the following Exception in Scala `2.10` without this patch:
```
[info] - thread safety of mirror *** FAILED *** (49 milliseconds)
[info] java.lang.UnsupportedOperationException: tail of empty list
[info] at scala.collection.immutable.Nil$.tail(List.scala:339)
[info] at scala.collection.immutable.Nil$.tail(List.scala:334)
[info] at scala.reflect.internal.SymbolTable.popPhase(SymbolTable.scala:172)
[info] at scala.reflect.internal.Symbols$Symbol.unsafeTypeParams(Symbols.scala:1477)
[info] at scala.reflect.internal.Symbols$TypeSymbol.tpe(Symbols.scala:2777)
[info] at scala.reflect.internal.Mirrors$RootsBase.init(Mirrors.scala:235)
[info] at scala.reflect.runtime.JavaMirrors$class.createMirror(JavaMirrors.scala:34)
[info] at scala.reflect.runtime.JavaMirrors$class.runtimeMirror(JavaMirrors.scala:61)
[info] at scala.reflect.runtime.JavaUniverse.runtimeMirror(JavaUniverse.scala:12)
[info] at scala.reflect.runtime.JavaUniverse.runtimeMirror(JavaUniverse.scala:12)
[info] at org.apache.spark.sql.catalyst.ScalaReflection$.mirror(ScalaReflection.scala:36)
[info] at org.apache.spark.sql.catalyst.ScalaReflectionSuite$$anonfun$12$$anonfun$apply$mcV$sp$1$$anonfun$apply$1$$anonfun$apply$2.apply(ScalaReflectionSuite.scala:256)
[info] at org.apache.spark.sql.catalyst.ScalaReflectionSuite$$anonfun$12$$anonfun$apply$mcV$sp$1$$anonfun$apply$1$$anonfun$apply$2.apply(ScalaReflectionSuite.scala:252)
[info] at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
[info] at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
[info] at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
[info] at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[info] at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[info] at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
[info] at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
```
Notice that the test will pass when Scala version is `2.11`.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#11487 from ueshin/issues/SPARK-13640.
## What changes were proposed in this pull request?
This issue fixes the following potential bugs and Java coding style detected by Coverity and Checkstyle.
- Implement both null and type checking in equals functions.
- Fix wrong type casting logic in SimpleJavaBean2.equals.
- Add `implement Cloneable` to `UTF8String` and `SortedIterator`.
- Remove dereferencing before null check in `AbstractBytesToBytesMapSuite`.
- Fix coding style: Add '{}' to single `for` statement in mllib examples.
- Remove unused imports in `ColumnarBatch` and `JavaKinesisStreamSuite`.
- Remove unused fields in `ChunkFetchIntegrationSuite`.
- Add `stop()` to prevent resource leak.
Please note that the last two checkstyle errors exist on newly added commits after [SPARK-13583](https://issues.apache.org/jira/browse/SPARK-13583).
## How was this patch tested?
manual via `./dev/lint-java` and Coverity site.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11530 from dongjoon-hyun/SPARK-13692.
This PR replaces #9925 which had issues with CI. **Please see the original PR for any previous discussions.**
## What changes were proposed in this pull request?
Deprecate the SparkSQL column operator !== and use =!= as an alternative.
Fixes subtle issues related to operator precedence (basically, !== does not have the same priority as its logical negation, ===).
## How was this patch tested?
All currently existing tests.
Author: Jakob Odersky <jodersky@gmail.com>
Closes#11588 from jodersky/SPARK-7286.
## Motivation
CSV data source was contributed by Databricks. It is the inlined version of https://github.com/databricks/spark-csv. The data source name was `com.databricks.spark.csv`. As a result there are many tables created on older versions of spark with that name as the source. For backwards compatibility we should keep the old name.
## Proposed changes
`com.databricks.spark.csv` was added to list of `backwardCompatibilityMap` in `ResolvedDataSource.scala`
## Tests
A unit test was added to `CSVSuite` to parse a csv file using the old name.
Author: Hossein <hossein@databricks.com>
Closes#11589 from falaki/SPARK-13754.
## What changes were proposed in this pull request?
This PR fix the sizeInBytes of HadoopFsRelation.
## How was this patch tested?
Added regression test for that.
Author: Davies Liu <davies@databricks.com>
Closes#11590 from davies/fix_sizeInBytes.
When generating Graphviz DOT files in the SQL query visualization we need to escape double-quotes inside node labels. This is a followup to #11309, which fixed a similar graph in Spark Core's DAG visualization.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#11587 from JoshRosen/graphviz-escaping.
## What changes were proposed in this pull request?
If a filter predicate or a join condition consists of `IsNotNull` checks, we should reorder these checks such that these non-nullability checks are evaluated before the rest of the predicates.
For e.g., if a filter predicate is of the form `a > 5 && isNotNull(b)`, we should rewrite this as `isNotNull(b) && a > 5` during physical plan generation.
## How was this patch tested?
new unit tests that verify the physical plan for both filters and joins in `ReorderedPredicateSuite`
Author: Sameer Agarwal <sameer@databricks.com>
Closes#11511 from sameeragarwal/reorder-isnotnull.
Follow-up to #11509, that simply refactors the interface that we use when resolving a pluggable `DataSource`.
- Multiple functions share the same set of arguments so we make this a case class, called `DataSource`. Actual resolution is now done by calling a function on this class.
- Instead of having multiple methods named `apply` (some of which do writing some of which do reading) we now explicitly have `resolveRelation()` and `write(mode, df)`.
- Get rid of `Array[String]` since this is an internal API and was forcing us to awkwardly call `toArray` in a bunch of places.
Author: Michael Armbrust <michael@databricks.com>
Closes#11572 from marmbrus/dataSourceResolution.
## What changes were proposed in this pull request?
This removes the remaining deprecated Octal escape literals. The followings are the warnings on those two lines.
```
LiteralExpressionSuite.scala:99: Octal escape literals are deprecated, use \u0000 instead.
HiveQlSuite.scala:74: Octal escape literals are deprecated, use \u002c instead.
```
## How was this patch tested?
Manual.
During building, there should be no warning on `Octal escape literals`.
```
mvn -DskipTests clean install
```
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11584 from dongjoon-hyun/SPARK-13400.
## What changes were proposed in this pull request?
This PR add SQL generation support for aggregate with multi-distinct, by simply moving the `DistinctAggregationRewriter` rule to optimizer.
More discussions are needed as this breaks an import contract: analyzed plan should be able to run without optimization. However, the `ComputeCurrentTime` rule has kind of broken it already, and I think maybe we should add a new phase for this kind of rules, because strictly speaking they don't belong to analysis and is coupled with the physical plan implementation.
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11579 from cloud-fan/distinct.
## What changes were proposed in this pull request?
In order to avoid StackOverflow when parse a expression with hundreds of ORs, we should use loop instead of recursive functions to flatten the tree as list. This PR also build a balanced tree to reduce the depth of generated And/Or expression, to avoid StackOverflow in analyzer/optimizer.
## How was this patch tested?
Add new unit tests. Manually tested with TPCDS Q3 with hundreds predicates in it [1]. These predicates help to reduce the number of partitions, then the query time went from 60 seconds to 8 seconds.
[1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql
Author: Davies Liu <davies@databricks.com>
Closes#11501 from davies/long_or.
## What changes were proposed in this pull request?
The code in `Expand.apply` can be simplified by existing information:
* the `groupByExprs` parameter are all `Attribute`s
* the `child` parameter is a `Project` that append aliased group by expressions to its child's output
## How was this patch tested?
by existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11485 from cloud-fan/expand.
## What changes were proposed in this pull request?
This PR change the way how we generate the code for the output variables passing from a plan to it's parent.
Right now, they are generated before call consume() of it's parent. It's not efficient, if the parent is a Filter or Join, which could filter out most the rows, the time to access some of the columns that are not used by the Filter or Join are wasted.
This PR try to improve this by defering the access of columns until they are actually used by a plan. After this PR, a plan does not need to generate code to evaluate the variables for output, just passing the ExprCode to its parent by `consume()`. In `parent.consumeChild()`, it will check the output from child and `usedInputs`, generate the code for those columns that is part of `usedInputs` before calling `doConsume()`.
This PR also change the `if` from
```
if (cond) {
xxx
}
```
to
```
if (!cond) continue;
xxx
```
The new one could help to reduce the nested indents for multiple levels of Filter and BroadcastHashJoin.
It also added some comments for operators.
## How was the this patch tested?
Unit tests. Manually ran TPCDS Q55, this PR improve the performance about 30% (scale=10, from 2.56s to 1.96s)
Author: Davies Liu <davies@databricks.com>
Closes#11274 from davies/gen_defer.
## What changes were proposed in this pull request?
When we add more DDL parsing logic in the future, SparkQl will become very big. To keep it smaller, we'll introduce helper "parser objects", e.g. one to parse alter table commands. However, these parser objects will need to access some helper methods that exist in CatalystQl. The proposal is to move those methods to an isolated ParserUtils object.
This is based on viirya's changes in #11048. It prefaces the bigger fix for SPARK-13139 to make the diff of that patch smaller.
## How was this patch tested?
No change in functionality, so just Jenkins.
Author: Andrew Or <andrew@databricks.com>
Closes#11529 from andrewor14/parser-utils.
## What changes were proposed in this pull request?
Adding the hive-cli classes to the classloader
## How was this patch tested?
The hive Versionssuite tests were run
This is my original work and I license the work to the project under the project's open source license.
Author: Tim Preece <tim.preece.in.oz@gmail.com>
Closes#11495 from preecet/master.
`HadoopFsRelation` is used for reading most files into Spark SQL. However today this class mixes the concerns of file management, schema reconciliation, scan building, bucketing, partitioning, and writing data. As a result, many data sources are forced to reimplement the same functionality and the various layers have accumulated a fair bit of inefficiency. This PR is a first cut at separating this into several components / interfaces that are each described below. Additionally, all implementations inside of Spark (parquet, csv, json, text, orc, svmlib) have been ported to the new API `FileFormat`. External libraries, such as spark-avro will also need to be ported to work with Spark 2.0.
### HadoopFsRelation
A simple `case class` that acts as a container for all of the metadata required to read from a datasource. All discovery, resolution and merging logic for schemas and partitions has been removed. This an internal representation that no longer needs to be exposed to developers.
```scala
case class HadoopFsRelation(
sqlContext: SQLContext,
location: FileCatalog,
partitionSchema: StructType,
dataSchema: StructType,
bucketSpec: Option[BucketSpec],
fileFormat: FileFormat,
options: Map[String, String]) extends BaseRelation
```
### FileFormat
The primary interface that will be implemented by each different format including external libraries. Implementors are responsible for reading a given format and converting it into `InternalRow` as well as writing out an `InternalRow`. A format can optionally return a schema that is inferred from a set of files.
```scala
trait FileFormat {
def inferSchema(
sqlContext: SQLContext,
options: Map[String, String],
files: Seq[FileStatus]): Option[StructType]
def prepareWrite(
sqlContext: SQLContext,
job: Job,
options: Map[String, String],
dataSchema: StructType): OutputWriterFactory
def buildInternalScan(
sqlContext: SQLContext,
dataSchema: StructType,
requiredColumns: Array[String],
filters: Array[Filter],
bucketSet: Option[BitSet],
inputFiles: Array[FileStatus],
broadcastedConf: Broadcast[SerializableConfiguration],
options: Map[String, String]): RDD[InternalRow]
}
```
The current interface is based on what was required to get all the tests passing again, but still mixes a couple of concerns (i.e. `bucketSet` is passed down to the scan instead of being resolved by the planner). Additionally, scans are still returning `RDD`s instead of iterators for single files. In a future PR, bucketing should be removed from this interface and the scan should be isolated to a single file.
### FileCatalog
This interface is used to list the files that make up a given relation, as well as handle directory based partitioning.
```scala
trait FileCatalog {
def paths: Seq[Path]
def partitionSpec(schema: Option[StructType]): PartitionSpec
def allFiles(): Seq[FileStatus]
def getStatus(path: Path): Array[FileStatus]
def refresh(): Unit
}
```
Currently there are two implementations:
- `HDFSFileCatalog` - based on code from the old `HadoopFsRelation`. Infers partitioning by recursive listing and caches this data for performance
- `HiveFileCatalog` - based on the above, but it uses the partition spec from the Hive Metastore.
### ResolvedDataSource
Produces a logical plan given the following description of a Data Source (which can come from DataFrameReader or a metastore):
- `paths: Seq[String] = Nil`
- `userSpecifiedSchema: Option[StructType] = None`
- `partitionColumns: Array[String] = Array.empty`
- `bucketSpec: Option[BucketSpec] = None`
- `provider: String`
- `options: Map[String, String]`
This class is responsible for deciding which of the Data Source APIs a given provider is using (including the non-file based ones). All reconciliation of partitions, buckets, schema from metastores or inference is done here.
### DataSourceAnalysis / DataSourceStrategy
Responsible for analyzing and planning reading/writing of data using any of the Data Source APIs, including:
- pruning the files from partitions that will be read based on filters.
- appending partition columns*
- applying additional filters when a data source can not evaluate them internally.
- constructing an RDD that is bucketed correctly when required*
- sanity checking schema match-up and other analysis when writing.
*In the future we should do that following:
- Break out file handling into its own Strategy as its sufficiently complex / isolated.
- Push the appending of partition columns down in to `FileFormat` to avoid an extra copy / unvectorization.
- Use a custom RDD for scans instead of `SQLNewNewHadoopRDD2`
Author: Michael Armbrust <michael@databricks.com>
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11509 from marmbrus/fileDataSource.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-13442
This PR adds the support for inferring `BooleanType` for schema.
It supports to infer case-insensitive `true` / `false` as `BooleanType`.
Unittests were added for `CSVInferSchemaSuite` and `CSVSuite` for end-to-end test.
## How was the this patch tested?
This was tested with unittests and with `dev/run_tests` for coding style
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#11315 from HyukjinKwon/SPARK-13442.
#### What changes were proposed in this pull request?
Non-deterministic predicates should not be pushed through Generate.
#### How was this patch tested?
Added a test case in `FilterPushdownSuite.scala`
Author: gatorsmile <gatorsmile@gmail.com>
Closes#11562 from gatorsmile/pushPredicateDownWindow.
## What changes were proposed in this pull request?
This PR adds an optimizer rule to eliminate reading (unnecessary) NULL values if they are not required for correctness by inserting `isNotNull` filters is the query plan. These filters are currently inserted beneath existing `Filter` and `Join` operators and are inferred based on their data constraints.
Note: While this optimization is applicable to all types of join, it primarily benefits `Inner` and `LeftSemi` joins.
## How was this patch tested?
1. Added a new `NullFilteringSuite` that tests for `IsNotNull` filters in the query plan for joins and filters. Also, tests interaction with the `CombineFilters` optimizer rules.
2. Test generated ExpressionTrees via `OrcFilterSuite`
3. Test filter source pushdown logic via `SimpleTextHadoopFsRelationSuite`
cc yhuai nongli
Author: Sameer Agarwal <sameer@databricks.com>
Closes#11372 from sameeragarwal/gen-isnotnull.
## What changes were proposed in this pull request?
It's weird that expressions don't always have all the expressions in it. This PR marks `QueryPlan.expressions` final to forbid sub classes overriding it to exclude some expressions. Currently only `Generate` override it, we can use `producedAttributes` to fix the unresolved attribute problem for it.
Note that this PR doesn't fix the problem in #11497
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11532 from cloud-fan/generate.
## What changes were proposed in this pull request?
```
Seq(("id1", "value1")).toDF("key", "value").registerTempTable("src")
sqlContext.sql("SELECT t1.* FROM src LATERAL VIEW explode(map('key1', 100, 'key2', 200)) t1 AS key, value")
```
Results in following logical plan
```
Project [key#2,value#3]
+- Generate explode(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFMap(key1,100,key2,200)), true, false, Some(genoutput), [key#2,value#3]
+- SubqueryAlias src
+- Project [_1#0 AS key#2,_2#1 AS value#3]
+- LocalRelation [_1#0,_2#1], [[id1,value1]]
```
The above query fails with following runtime error.
```
java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.spark.unsafe.types.UTF8String
at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:221)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:42)
at org.apache.spark.sql.execution.Generate$$anonfun$doExecute$1$$anonfun$apply$9.apply(Generate.scala:98)
at org.apache.spark.sql.execution.Generate$$anonfun$doExecute$1$$anonfun$apply$9.apply(Generate.scala:96)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
<stack-trace omitted.....>
```
In this case the generated outputs are wrongly resolved from its child (LocalRelation) due to
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L537-L548
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Added unit tests in hive/SQLQuerySuite and AnalysisSuite
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#11497 from dilipbiswal/spark-13651.
## What changes were proposed in this pull request?
Today we have `analysis.Catalog` and `catalog.Catalog`. In the future the former will call the latter. When that happens, if both of them are still called `Catalog` it will be very confusing. This patch renames the latter `ExternalCatalog` because it is expected to talk to external systems.
## How was this patch tested?
Jenkins.
Author: Andrew Or <andrew@databricks.com>
Closes#11526 from andrewor14/rename-catalog.
This reverts commit f87ce0504e.
According to discussion in #11466, let's revert PR #11466 for safe.
Author: Cheng Lian <lian@databricks.com>
Closes#11539 from liancheng/revert-pr-11466.
#### What changes were proposed in this pull request?
This PR is for supporting SQL generation for cube, rollup and grouping sets.
For example, a query using rollup:
```SQL
SELECT count(*) as cnt, key % 5, grouping_id() FROM t1 GROUP BY key % 5 WITH ROLLUP
```
Original logical plan:
```
Aggregate [(key#17L % cast(5 as bigint))#47L,grouping__id#46],
[(count(1),mode=Complete,isDistinct=false) AS cnt#43L,
(key#17L % cast(5 as bigint))#47L AS _c1#45L,
grouping__id#46 AS _c2#44]
+- Expand [List(key#17L, value#18, (key#17L % cast(5 as bigint))#47L, 0),
List(key#17L, value#18, null, 1)],
[key#17L,value#18,(key#17L % cast(5 as bigint))#47L,grouping__id#46]
+- Project [key#17L,
value#18,
(key#17L % cast(5 as bigint)) AS (key#17L % cast(5 as bigint))#47L]
+- Subquery t1
+- Relation[key#17L,value#18] ParquetRelation
```
Converted SQL:
```SQL
SELECT count( 1) AS `cnt`,
(`t1`.`key` % CAST(5 AS BIGINT)),
grouping_id() AS `_c2`
FROM `default`.`t1`
GROUP BY (`t1`.`key` % CAST(5 AS BIGINT))
GROUPING SETS (((`t1`.`key` % CAST(5 AS BIGINT))), ())
```
#### How was the this patch tested?
Added eight test cases in `LogicalPlanToSQLSuite`.
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#11283 from gatorsmile/groupingSetsToSQL.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
Currently, the parquet reader returns rows one by one which is bad for performance. This patch
updates the reader to directly return ColumnarBatches. This is only enabled with whole stage
codegen, which is the only operator currently that is able to consume ColumnarBatches (instead
of rows). The current implementation is a bit of a hack to get this to work and we should do
more refactoring of these low level interfaces to make this work better.
## How was this patch tested?
```
Results:
TPCDS: Best/Avg Time(ms) Rate(M/s) Per Row(ns)
---------------------------------------------------------------------------------
q55 (before) 8897 / 9265 12.9 77.2
q55 5486 / 5753 21.0 47.6
```
Author: Nong Li <nong@databricks.com>
Closes#11435 from nongli/spark-13255.
## What changes were proposed in this pull request?
This patch simply moves things to existing package `o.a.s.sql.catalyst.parser` in an effort to reduce the size of the diff in #11048. This is conceptually the same as a recently merged patch #11482.
## How was this patch tested?
Jenkins.
Author: Andrew Or <andrew@databricks.com>
Closes#11506 from andrewor14/parser-package.
Earlier fix did not copy the bytes and it is possible for higher level to reuse Text object. This was causing issues. Proposed fix now copies the bytes from Text. This still avoids the expensive encoding/decoding
Author: Rajesh Balamohan <rbalamohan@apache.org>
Closes#11477 from rajeshbalamohan/SPARK-12925.2.
## What changes were proposed in this pull request?
This is support SQL generation for subquery expressions, which will be replaced to a SubqueryHolder inside SQLBuilder recursively.
## How was this patch tested?
Added unit tests.
Author: Davies Liu <davies@databricks.com>
Closes#11453 from davies/sql_subquery.
## What changes were proposed in this pull request?
A test suite added for the bug fix -SPARK 12941; for the mapping of the StringType to corresponding in Oracle
## How was this patch tested?
manual tests done
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: thomastechs <thomas.sebastian@tcs.com>
Author: THOMAS SEBASTIAN <thomas.sebastian@tcs.com>
Closes#11489 from thomastechs/thomastechs-12941-master-new.
## What changes were proposed in this pull request?
Fix race conditions when cleanup files.
## How was this patch tested?
Existing tests.
Author: Davies Liu <davies@databricks.com>
Closes#11507 from davies/flaky.
## What changes were proposed in this pull request?
This PR support visualization for subquery in SQL web UI, also improve the explain of subquery, especially when it's used together with whole stage codegen.
For example:
```python
>>> sqlContext.range(100).registerTempTable("range")
>>> sqlContext.sql("select id / (select sum(id) from range) from range where id > (select id from range limit 1)").explain(True)
== Parsed Logical Plan ==
'Project [unresolvedalias(('id / subquery#9), None)]
: +- 'SubqueryAlias subquery#9
: +- 'Project [unresolvedalias('sum('id), None)]
: +- 'UnresolvedRelation `range`, None
+- 'Filter ('id > subquery#8)
: +- 'SubqueryAlias subquery#8
: +- 'GlobalLimit 1
: +- 'LocalLimit 1
: +- 'Project [unresolvedalias('id, None)]
: +- 'UnresolvedRelation `range`, None
+- 'UnresolvedRelation `range`, None
== Analyzed Logical Plan ==
(id / scalarsubquery()): double
Project [(cast(id#0L as double) / cast(subquery#9 as double)) AS (id / scalarsubquery())#11]
: +- SubqueryAlias subquery#9
: +- Aggregate [(sum(id#0L),mode=Complete,isDistinct=false) AS sum(id)#10L]
: +- SubqueryAlias range
: +- Range 0, 100, 1, 4, [id#0L]
+- Filter (id#0L > subquery#8)
: +- SubqueryAlias subquery#8
: +- GlobalLimit 1
: +- LocalLimit 1
: +- Project [id#0L]
: +- SubqueryAlias range
: +- Range 0, 100, 1, 4, [id#0L]
+- SubqueryAlias range
+- Range 0, 100, 1, 4, [id#0L]
== Optimized Logical Plan ==
Project [(cast(id#0L as double) / cast(subquery#9 as double)) AS (id / scalarsubquery())#11]
: +- SubqueryAlias subquery#9
: +- Aggregate [(sum(id#0L),mode=Complete,isDistinct=false) AS sum(id)#10L]
: +- Range 0, 100, 1, 4, [id#0L]
+- Filter (id#0L > subquery#8)
: +- SubqueryAlias subquery#8
: +- GlobalLimit 1
: +- LocalLimit 1
: +- Project [id#0L]
: +- Range 0, 100, 1, 4, [id#0L]
+- Range 0, 100, 1, 4, [id#0L]
== Physical Plan ==
WholeStageCodegen
: +- Project [(cast(id#0L as double) / cast(subquery#9 as double)) AS (id / scalarsubquery())#11]
: : +- Subquery subquery#9
: : +- WholeStageCodegen
: : : +- TungstenAggregate(key=[], functions=[(sum(id#0L),mode=Final,isDistinct=false)], output=[sum(id)#10L])
: : : +- INPUT
: : +- Exchange SinglePartition, None
: : +- WholeStageCodegen
: : : +- TungstenAggregate(key=[], functions=[(sum(id#0L),mode=Partial,isDistinct=false)], output=[sum#14L])
: : : +- Range 0, 1, 4, 100, [id#0L]
: +- Filter (id#0L > subquery#8)
: : +- Subquery subquery#8
: : +- CollectLimit 1
: : +- WholeStageCodegen
: : : +- Project [id#0L]
: : : +- Range 0, 1, 4, 100, [id#0L]
: +- Range 0, 1, 4, 100, [id#0L]
```
The web UI looks like:
![subquery](https://cloud.githubusercontent.com/assets/40902/13377963/932bcbae-dda7-11e5-82f7-03c9be85d77c.png)
This PR also change the tree structure of WholeStageCodegen to make it consistent than others. Before this change, Both WholeStageCodegen and InputAdapter hold a references to the same plans, those could be updated without notify another, causing problems, this is discovered by #11403 .
## How was this patch tested?
Existing tests, also manual tests with the example query, check the explain and web UI.
Author: Davies Liu <davies@databricks.com>
Closes#11417 from davies/viz_subquery.
## What changes were proposed in this pull request?
Make ContinuousQueryManagerSuite not output logs to the console. The logs will still output to `unit-tests.log`.
I also updated `SQLListenerMemoryLeakSuite` to use `quietly` to avoid changing the log level which won't output logs to `unit-tests.log`.
## How was this patch tested?
Just check Jenkins output.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#11439 from zsxwing/quietly-ContinuousQueryManagerSuite.
## What changes were proposed in this pull request?
This patch simply moves things to a new package in an effort to reduce the size of the diff in #11048. Currently the new package only has one file, but in the future we'll add many new commands in SPARK-13139.
## How was this patch tested?
Jenkins.
Author: Andrew Or <andrew@databricks.com>
Closes#11482 from andrewor14/commands-package.
## What changes were proposed in this pull request?
This PR fixes typos in comments and testcase name of code.
## How was this patch tested?
manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11481 from dongjoon-hyun/minor_fix_typos_in_code.
## What changes were proposed in this pull request?
This PR adds the support to specify compression codecs for both ORC and Parquet.
## How was this patch tested?
unittests within IDE and code style tests with `dev/run_tests`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#11464 from HyukjinKwon/SPARK-13543.
## What changes were proposed in this pull request?
Fixes compile problem due to inadvertent use of `Option.contains`, only in Scala 2.11. The change should have been to replace `Option.exists(_ == x)` with `== Some(x)`. Replacing exists with contains only makes sense for collections. Replacing use of `Option.exists` still makes sense though as it's misleading.
## How was this patch tested?
Jenkins tests / compilation
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: Sean Owen <sowen@cloudera.com>
Closes#11493 from srowen/SPARK-13423.2.
## What changes were proposed in this pull request?
After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review by saving much time.
This issue aims remove unused imports from Java/Scala code and add `UnusedImports` checkstyle rule to help developers.
## How was this patch tested?
```
./dev/lint-java
./build/sbt compile
```
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11438 from dongjoon-hyun/SPARK-13583.
## What changes were proposed in this pull request?
Make some cross-cutting code improvements according to static analysis. These are individually up for discussion since they exist in separate commits that can be reverted. The changes are broadly:
- Inner class should be static
- Mismatched hashCode/equals
- Overflow in compareTo
- Unchecked warnings
- Misuse of assert, vs junit.assert
- get(a) + getOrElse(b) -> getOrElse(a,b)
- Array/String .size -> .length (occasionally, -> .isEmpty / .nonEmpty) to avoid implicit conversions
- Dead code
- tailrec
- exists(_ == ) -> contains find + nonEmpty -> exists filter + size -> count
- reduce(_+_) -> sum map + flatten -> map
The most controversial may be .size -> .length simply because of its size. It is intended to avoid implicits that might be expensive in some places.
## How was the this patch tested?
Existing Jenkins unit tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#11292 from srowen/SPARK-13423.
## What changes were proposed in this pull request?
#11479 [SPARK-13627] broke 2.10 compatibility: [2.10-Build](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-scala-2.10/292/console)
At this moment, we need to support both 2.10 and 2.11.
This PR recovers some deprecated methods which were replace by [SPARK-13627].
## How was this patch tested?
Jenkins build: Both 2.10, 2.11.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11488 from dongjoon-hyun/hotfix_compatibility_with_2.10.
JIRA: https://issues.apache.org/jira/browse/SPARK-13466
## What changes were proposed in this pull request?
With column pruning rule in optimizer, some Project operators will become redundant. We should remove these redundant Projects.
For an example query:
val input = LocalRelation('key.int, 'value.string)
val query =
Project(Seq($"x.key", $"y.key"),
Join(
SubqueryAlias("x", input),
BroadcastHint(SubqueryAlias("y", input)), Inner, None))
After the first run of column pruning, it would like:
Project(Seq($"x.key", $"y.key"),
Join(
Project(Seq($"x.key"), SubqueryAlias("x", input)),
Project(Seq($"y.key"), <-- inserted by the rule
BroadcastHint(SubqueryAlias("y", input))),
Inner, None))
Actually we don't need the outside Project now. This patch will remove it:
Join(
Project(Seq($"x.key"), SubqueryAlias("x", input)),
Project(Seq($"y.key"),
BroadcastHint(SubqueryAlias("y", input))),
Inner, None)
## How was the this patch tested?
Unit test is added into ColumnPruningSuite.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#11341 from viirya/remove-redundant-project.
JIRA: https://issues.apache.org/jira/browse/SPARK-13635
## What changes were proposed in this pull request?
LimitPushdown optimizer rule has been disabled due to no whole-stage codegen for Limit. As we have whole-stage codegen for Limit now, we should enable it.
## How was this patch tested?
As we only re-enable LimitPushdown optimizer rule, no need to add new tests for it.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#11483 from viirya/enable-limitpushdown.
JIRA: https://issues.apache.org/jira/browse/SPARK-13616
## What changes were proposed in this pull request?
It is possibly that a logical plan has been removed `Project` from the top of it. Or the plan doesn't has a top `Project` from the beginning because it is not necessary. Currently the `SQLBuilder` can't convert such plans back to SQL. This change is to add this feature.
## How was this patch tested?
A test is added to `LogicalPlanToSQLSuite`.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#11466 from viirya/sqlbuilder-notopselect.
## What changes were proposed in this pull request?
This PR aims to fix the following deprecation warnings.
* MethodSymbolApi.paramss--> paramLists
* AnnotationApi.tpe -> tree.tpe
* BufferLike.readOnly -> toList.
* StandardNames.nme -> termNames
* scala.tools.nsc.interpreter.AbstractFileClassLoader -> scala.reflect.internal.util.AbstractFileClassLoader
* TypeApi.declarations-> decls
## How was this patch tested?
Check the compile build log and pass the tests.
```
./build/sbt
```
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11479 from dongjoon-hyun/SPARK-13627.
## What changes were proposed in this pull request?
The `trait GroupingAnalytics` only has one implementation, it's an unnecessary abstraction. This PR removes it, and does some code simplification when resolving `GroupingSet`.
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11469 from cloud-fan/groupingset.
## What changes were proposed in this pull request?
This pr to make the short names of compression codecs in `ParquetRelation` consistent against other ones. This pr comes from #11324.
## How was this patch tested?
Add more tests in `TextSuite`.
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
Closes#11408 from maropu/SPARK-13528.
## What changes were proposed in this pull request?
Also updated the other benchmarks when the default to use vectorized decode was flipped.
Author: Nong Li <nong@databricks.com>
Closes#11454 from nongli/benchmark.
## What changes were proposed in this pull request?
In order to tell OutputStream that the task has failed or not, we should call the failure callbacks BEFORE calling writer.close().
## How was this patch tested?
Added new unit tests.
Author: Davies Liu <davies@databricks.com>
Closes#11450 from davies/callback.
#### What changes were proposed in this pull request?
```SQL
FROM
(FROM test SELECT TRANSFORM(key, value) USING 'cat' AS (`thing1` int, thing2 string)) t
SELECT thing1 + 1
```
This query returns an analysis error, like:
```
Failed to analyze query: org.apache.spark.sql.AnalysisException: cannot resolve '`thing1`' given input columns: [`thing1`, thing2]; line 3 pos 7
'Project [unresolvedalias(('thing1 + 1), None)]
+- SubqueryAlias t
+- ScriptTransformation [key#2,value#3], cat, [`thing1`#6,thing2#7], HiveScriptIOSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim, )),List((field.delim, )),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),false)
+- SubqueryAlias test
+- Project [_1#0 AS key#2,_2#1 AS value#3]
+- LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3],[4,4],[5,5]]
```
The backpacks of \`thing1\` should be cleaned before entering Parser/Analyzer. This PR fixes this issue.
#### How was this patch tested?
Added a test case and modified an existing test case
Author: gatorsmile <gatorsmile@gmail.com>
Closes#11415 from gatorsmile/scriptTransform.
#### What changes were proposed in this pull request?
This PR is to prune unnecessary columns when the operator is `MapPartitions`. The solution is to add an extra `Project` in the child node.
For the other two operators `AppendColumns` and `MapGroups`, it sounds doable. More discussions are required. The major reason is the current implementation of the `inputPlan` of `groupBy` is based on the child of `AppendColumns`. It might be a bug? Thus, will submit a separate PR.
#### How was this patch tested?
Added a test case in ColumnPruningSuite to verify the rule. Added another test case in DatasetSuite.scala to verify the data.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#11460 from gatorsmile/datasetPruningNew.
## What changes were proposed in this pull request?
Change in class FormatNumber to make it work irrespective of locale.
## How was this patch tested?
Unit tests.
Author: lgieron <lgieron@gmail.com>
Closes#11396 from lgieron/SPARK-13515_Fix_Format_Number.
Rows with null values in partition column are not included in the results because none of the partition
where clause specify is null predicate on the partition column. This fix adds is null predicate on the partition column to the first JDBC partition where clause.
Example:
JDBCPartition(THEID < 1 or THEID is null, 0),JDBCPartition(THEID >= 1 AND THEID < 2,1),
JDBCPartition(THEID >= 2, 2)
Author: sureshthalamati <suresh.thalamati@gmail.com>
Closes#11063 from sureshthalamati/nullable_jdbc_part_col_spark-13167.
## What changes were proposed in this pull request?
Broadcast left semi join without joining keys is already supported in BroadcastNestedLoopJoin, it has the same implementation as LeftSemiJoinBNL, we should remove that.
## How was this patch tested?
Updated unit tests.
Author: Davies Liu <davies@databricks.com>
Closes#11448 from davies/remove_bnl.
## What changes were proposed in this pull request?
This PR defer the resolution from a id of dictionary to value until the column is actually accessed (inside getInt/getLong), this is very useful for those columns and rows that are filtered out. It's also useful for binary type, we will not need to copy all the byte arrays.
This PR also change the underlying type for small decimal that could be fit within a Int, in order to use getInt() to lookup the value from IntDictionary.
## How was this patch tested?
Manually test TPCDS Q7 with scale factor 10, saw about 30% improvements (after PR #11274).
Author: Davies Liu <davies@databricks.com>
Closes#11437 from davies/decode_dict.
JIRA: https://issues.apache.org/jira/browse/SPARK-13511
## What changes were proposed in this pull request?
Current limit operator doesn't support wholestage codegen. This is open to add support for it.
In the `doConsume` of `GlobalLimit` and `LocalLimit`, we use a count term to count the processed rows. Once the row numbers catches the limit number, we set the variable `stopEarly` of `BufferedRowIterator` newly added in this pr to `true` that indicates we want to stop processing remaining rows. Then when the wholestage codegen framework checks `shouldStop()`, it will stop the processing of the row iterator.
Before this, the executed plan for a query `sqlContext.range(N).limit(100).groupBy().sum()` is:
TungstenAggregate(key=[], functions=[(sum(id#5L),mode=Final,isDistinct=false)], output=[sum(id)#6L])
+- TungstenAggregate(key=[], functions=[(sum(id#5L),mode=Partial,isDistinct=false)], output=[sum#9L])
+- GlobalLimit 100
+- Exchange SinglePartition, None
+- LocalLimit 100
+- Range 0, 1, 1, 524288000, [id#5L]
After add wholestage codegen support:
WholeStageCodegen
: +- TungstenAggregate(key=[], functions=[(sum(id#40L),mode=Final,isDistinct=false)], output=[sum(id)#41L])
: +- TungstenAggregate(key=[], functions=[(sum(id#40L),mode=Partial,isDistinct=false)], output=[sum#44L])
: +- GlobalLimit 100
: +- INPUT
+- Exchange SinglePartition, None
+- WholeStageCodegen
: +- LocalLimit 100
: +- Range 0, 1, 1, 524288000, [id#40L]
## How was this patch tested?
A test is added into BenchmarkWholeStageCodegen.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#11391 from viirya/wholestage-limit.
## What changes were proposed in this pull request?
This PR adds support for implementing whole state codegen for sort. Builds heaving on nongli 's PR: https://github.com/apache/spark/pull/11008 (which actually implements the feature), and adds the following changes on top:
- [x] Generated code updates peak execution memory metrics
- [x] Unit tests in `WholeStageCodegenSuite` and `SQLMetricsSuite`
## How was this patch tested?
New unit tests in `WholeStageCodegenSuite` and `SQLMetricsSuite`. Further, all existing sort tests should pass.
Author: Sameer Agarwal <sameer@databricks.com>
Author: Nong Li <nong@databricks.com>
Closes#11359 from sameeragarwal/sort-codegen.
#### What changes were proposed in this pull request?
After analysis by Analyzer, two operators could have alias. They are `Project` and `Aggregate`. So far, we only rewrite and propagate constraints if `Alias` is defined in `Project`. This PR is to resolve this issue in `Aggregate`.
#### How was this patch tested?
Added a test case for `Aggregate` in `ConstraintPropagationSuite`.
marmbrus sameeragarwal
Author: gatorsmile <gatorsmile@gmail.com>
Closes#11422 from gatorsmile/validConstraintsInUnaryNodes.
https://issues.apache.org/jira/browse/SPARK-13507https://issues.apache.org/jira/browse/SPARK-13509
## What changes were proposed in this pull request?
This PR adds the support to write CSV data directly by a single call to the given path.
Several unitests were added for each functionality.
## How was this patch tested?
This was tested with unittests and with `dev/run_tests` for coding style
Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>
Closes#11389 from HyukjinKwon/SPARK-13507-13509.
## What changes were proposed in this pull request?
Nested classes defined within Scala objects are translated into Java static nested classes. Unlike inner classes, they don't need outer scopes. But the analyzer still thinks that an outer scope is required.
This PR fixes this issue simply by checking whether a nested class is static before looking up its outer scope.
## How was this patch tested?
A test case is added to `DatasetSuite`. It checks contents of a Dataset whose element type is a nested class declared in a Scala object.
Author: Cheng Lian <lian@databricks.com>
Closes#11421 from liancheng/spark-13540-object-as-outer-scope.
JIRA: https://issues.apache.org/jira/browse/SPARK-13537
## What changes were proposed in this pull request?
In readBytes of VectorizedPlainValuesReader, we use buffer[offset] to access bytes in buffer. It is incorrect because offset is added with Platform.BYTE_ARRAY_OFFSET when initialization. We should fix it.
## How was this patch tested?
`ParquetHadoopFsRelationSuite` sometimes (depending on the randomly generated data) will be [failed](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52136/consoleFull) by this bug. After applying this, the test can be passed.
I added a test to `ParquetHadoopFsRelationSuite` with the data which will fail without this patch.
The error exception:
[info] ParquetHadoopFsRelationSuite:
[info] - test all data types - StringType (440 milliseconds)
[info] - test all data types - BinaryType (434 milliseconds)
[info] - test all data types - BooleanType (406 milliseconds)
20:59:38.618 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 2597.0 (TID 67966)
java.lang.ArrayIndexOutOfBoundsException: 46
at org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReader.readBytes(VectorizedPlainValuesReader.java:88)
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#11418 from viirya/fix-readbytes.
## What changes were proposed in this pull request?
This creates a `SessionState`, which groups a few fields that existed in `SQLContext`. Because `HiveContext` extends `SQLContext` we also need to make changes there. This is mainly a cleanup task that will soon pave the way for merging the two contexts.
## How was this patch tested?
Existing unit tests; this patch introduces no change in behavior.
Author: Andrew Or <andrew@databricks.com>
Closes#11405 from andrewor14/refactor-session.
## What changes were proposed in this pull request?
Fix readBytes in VectorizedPlainValuesReader. This fixes a copy and paste issue.
## How was this patch tested?
Ran ParquetHadoopFsRelationSuite which failed before this.
Author: Nong Li <nong@databricks.com>
Closes#11414 from nongli/spark-13533.
JIRA: https://issues.apache.org/jira/browse/SPARK-13530
## What changes were proposed in this pull request?
By enabling vectorized parquet scanner by default, the unit test `ParquetHadoopFsRelationSuite` based on `HadoopFsRelationTest` will be failed due to the lack of short type support in `UnsafeRowParquetRecordReader`. We should fix it.
The error exception:
[info] ParquetHadoopFsRelationSuite:
[info] - test all data types - StringType (499 milliseconds)
[info] - test all data types - BinaryType (447 milliseconds)
[info] - test all data types - BooleanType (520 milliseconds)
[info] - test all data types - ByteType (418 milliseconds)
00:22:58.920 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 124.0 (TID 1949)
org.apache.commons.lang.NotImplementedException: Unimplemented type: ShortType
at org.apache.spark.sql.execution.datasources.parquet.UnsafeRowParquetRecordReader$ColumnReader.readIntBatch(UnsafeRowParquetRecordReader.java:769)
at org.apache.spark.sql.execution.datasources.parquet.UnsafeRowParquetRecordReader$ColumnReader.readBatch(UnsafeRowParquetRecordReader.java:640)
at org.apache.spark.sql.execution.datasources.parquet.UnsafeRowParquetRecordReader$ColumnReader.access$000(UnsafeRowParquetRecordReader.java:461)
at org.apache.spark.sql.execution.datasources.parquet.UnsafeRowParquetRecordReader.nextBatch(UnsafeRowParquetRecordReader.java:224)
## How was this patch tested?
The unit test `ParquetHadoopFsRelationSuite` based on `HadoopFsRelationTest` will be [failed](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52110/consoleFull) due to the lack of short type support in UnsafeRowParquetRecordReader. By adding this support, the test can be passed.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#11412 from viirya/add-shorttype-support.
## What changes were proposed in this pull request?
Change the default of the flag to enable this feature now that the implementation is complete.
## How was this patch tested?
The new parquet reader should be a drop in, so will be exercised by the existing tests.
Author: Nong Li <nong@databricks.com>
Closes#11397 from nongli/spark-13518.
## What changes were proposed in this pull request?
This patch includes these performance fixes:
- Remove unnecessary setNotNull() calls. The NULL bits are cleared already.
- Speed up RLE group decoding
- Speed up dictionary decoding by decoding NULLs directly into the result.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
In addition to the updated benchmarks, on TPCDS, the result of these changes
running Q55 (sf40) is:
```
TPCDS: Best/Avg Time(ms) Rate(M/s) Per Row(ns)
---------------------------------------------------------------------------------
q55 (Before) 6398 / 6616 18.0 55.5
q55 (After) 4983 / 5189 23.1 43.3
```
Author: Nong Li <nong@databricks.com>
Closes#11375 from nongli/spark-13499.
## What changes were proposed in this pull request?
Currently, BroadcastNestedLoopJoin is implemented for worst case, it's too slow, very easy to hang forever. This PR will create fast path for some joinType and buildSide, also improve the worst case (will use much less memory than before).
Before this PR, one task requires O(N*K) + O(K) in worst cases, N is number of rows from one partition of streamed table, it could hang the job (because of GC).
In order to workaround this for InnerJoin, we have to disable auto-broadcast, switch to CartesianProduct: This could be workaround for InnerJoin, see https://forums.databricks.com/questions/6747/how-do-i-get-a-cartesian-product-of-a-huge-dataset.html
In this PR, we will have fast path for these joins :
InnerJoin with BuildLeft or BuildRight
LeftOuterJoin with BuildRight
RightOuterJoin with BuildLeft
LeftSemi with BuildRight
These fast paths are all stream based (take one pass on streamed table), required O(1) memory.
All other join types and build types will take two pass on streamed table, one pass to find the matched rows that includes streamed part, which require O(1) memory, another pass to find the rows from build table that does not have a matched row from streamed table, which required O(K) memory, K is the number rows from build side, one bit per row, should be much smaller than the memory for broadcast. The following join types work in this way:
LeftOuterJoin with BuildLeft
RightOuterJoin with BuildRight
FullOuterJoin with BuildLeft or BuildRight
LeftSemi with BuildLeft
This PR also added tests for all the join types for BroadcastNestedLoopJoin.
After this PR, for InnerJoin with one small table, BroadcastNestedLoopJoin should be faster than CartesianProduct, we don't need that workaround anymore.
## How was the this patch tested?
Added unit tests.
Author: Davies Liu <davies@databricks.com>
Closes#11328 from davies/nested_loop.
## What changes were proposed in this pull request?
This is another try of PR #11323.
This PR removes DataFrame RDD operations except for `foreach` and `foreachPartitions` (they are actions rather than transformations). Original calls are now replaced by calls to methods of `DataFrame.rdd`.
PR #11323 was reverted because it introduced a regression: both `DataFrame.foreach` and `DataFrame.foreachPartitions` wrap underlying RDD operations with `withNewExecutionId` to track Spark jobs. But they are removed in #11323.
## How was the this patch tested?
No extra tests are added. Existing tests should do the work.
Author: Cheng Lian <lian@databricks.com>
Closes#11388 from liancheng/remove-df-rdd-ops.
## What changes were proposed in this pull request?
This patch creates the public API for runtime configuration and an implementation for it. The public runtime configuration includes configs for existing SQL, as well as Hadoop Configuration.
This new interface is currently dead code. It will be added to SQLContext and a session entry point to Spark when we add that.
## How was this patch tested?
a new unit test suite
Author: Reynold Xin <rxin@databricks.com>
Closes#11378 from rxin/SPARK-13487.
## What changes were proposed in this pull request?
This Pull request is used for the fix SPARK-12941, creating a data type mapping to Oracle for the corresponding data type"Stringtype" from dataframe. This PR is for the master branch fix, where as another PR is already tested with the branch 1.4
## How was the this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
This patch was tested using the Oracle docker .Created a new integration suite for the same.The oracle.jdbc jar was to be downloaded from the maven repository.Since there was no jdbc jar available in the maven repository, the jar was downloaded from oracle site manually and installed in the local; thus tested. So, for SparkQA test case run, the ojdbc jar might be manually placed in the local maven repository(com/oracle/ojdbc6/11.2.0.2.0) while Spark QA test run.
Author: thomastechs <thomas.sebastian@tcs.com>
Closes#11306 from thomastechs/master.
This pr added benchmark codes for Encoder#compress().
Also, it replaced the benchmark results with new ones because the output format of `Benchmark` changed.
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
Closes#11236 from maropu/CompressionSpike.
## Motivation
As a pre-requisite to off-heap caching of blocks, we need a mechanism to prevent pages / blocks from being evicted while they are being read. With on-heap objects, evicting a block while it is being read merely leads to memory-accounting problems (because we assume that an evicted block is a candidate for garbage-collection, which will not be true during a read), but with off-heap memory this will lead to either data corruption or segmentation faults.
## Changes
### BlockInfoManager and reader/writer locks
This patch adds block-level read/write locks to the BlockManager. It introduces a new `BlockInfoManager` component, which is contained within the `BlockManager`, holds the `BlockInfo` objects that the `BlockManager` uses for tracking block metadata, and exposes APIs for locking blocks in either shared read or exclusive write modes.
`BlockManager`'s `get*()` and `put*()` methods now implicitly acquire the necessary locks. After a `get()` call successfully retrieves a block, that block is locked in a shared read mode. A `put()` call will block until it acquires an exclusive write lock. If the write succeeds, the write lock will be downgraded to a shared read lock before returning to the caller. This `put()` locking behavior allows us store a block and then immediately turn around and read it without having to worry about it having been evicted between the write and the read, which will allow us to significantly simplify `CacheManager` in the future (see #10748).
See `BlockInfoManagerSuite`'s test cases for a more detailed specification of the locking semantics.
### Auto-release of locks at the end of tasks
Our locking APIs support explicit release of locks (by calling `unlock()`), but it's not always possible to guarantee that locks will be released prior to the end of the task. One reason for this is our iterator interface: since our iterators don't support an explicit `close()` operator to signal that no more records will be consumed, operations like `take()` or `limit()` don't have a good means to release locks on their input iterators' blocks. Another example is broadcast variables, whose block locks can only be released at the end of the task.
To address this, `BlockInfoManager` uses a pair of maps to track the set of locks acquired by each task. Lock acquisitions automatically record the current task attempt id by obtaining it from `TaskContext`. When a task finishes, code in `Executor` calls `BlockInfoManager.unlockAllLocksForTask(taskAttemptId)` to free locks.
### Locking and the MemoryStore
In order to prevent in-memory blocks from being evicted while they are being read, the `MemoryStore`'s `evictBlocksToFreeSpace()` method acquires write locks on blocks which it is considering as candidates for eviction. These lock acquisitions are non-blocking, so a block which is being read will not be evicted. By holding write locks until the eviction is performed or skipped (in case evicting the blocks would not free enough memory), we avoid a race where a new reader starts to read a block after the block has been marked as an eviction candidate but before it has been removed.
### Locking and remote block transfer
This patch makes small changes to to block transfer and network layer code so that locks acquired by the BlockTransferService are released as soon as block transfer messages are consumed and released by Netty. This builds on top of #11193, a bug fix related to freeing of network layer ManagedBuffers.
## FAQ
- **Why not use Java's built-in [`ReadWriteLock`](https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/locks/ReadWriteLock.html)?**
Our locks operate on a per-task rather than per-thread level. Under certain circumstances a task may consist of multiple threads, so using `ReadWriteLock` would mean that we might call `unlock()` from a thread which didn't hold the lock in question, an operation which has undefined semantics. If we could rely on Java 8 classes, we might be able to use [`StampedLock`](https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/locks/StampedLock.html) to work around this issue.
- **Why not detect "leaked" locks in tests?**:
See above notes about `take()` and `limit`.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10705 from JoshRosen/pin-pages.
## What changes were proposed in this pull request?
This PR removes DataFrame RDD operations. Original calls are now replaced by calls to methods of `DataFrame.rdd`.
## How was the this patch tested?
No extra tests are added. Existing tests should do the work.
Author: Cheng Lian <lian@databricks.com>
Closes#11323 from liancheng/remove-df-rdd-ops.
## What changes were proposed in this pull request?
Predicates shouldn't be pushed through project with nondeterministic field(s).
See https://github.com/graphframes/graphframes/pull/23 and SPARK-13473 for more details.
This PR targets master, branch-1.6, and branch-1.5.
## How was this patch tested?
A test case is added in `FilterPushdownSuite`. It constructs a query plan where a filter is over a project with a nondeterministic field. Optimized query plan shouldn't change in this case.
Author: Cheng Lian <lian@databricks.com>
Closes#11348 from liancheng/spark-13473-no-ppd-through-nondeterministic-project-field.
## What changes were proposed in this pull request?
This patch moves SQLConf into org.apache.spark.sql.internal package to make it very explicit that it is internal. Soon I will also submit more API work that creates implementations of interfaces in this internal package.
## How was this patch tested?
If it compiles, then the refactoring should work.
Author: Reynold Xin <rxin@databricks.com>
Closes#11363 from rxin/SPARK-13486.
## What changes were proposed in this pull request?
This PR mostly rewrite the ColumnPruning rule to support most of the SQL logical plans (except those for Dataset).
This PR also fix a bug in Generate, it should always output UnsafeRow, added an regression test for that.
## How was this patch tested?
This is test by unit tests, also manually test with TPCDS Q78, which could prune all unused columns successfully, improved the performance by 78% (from 22s to 12s).
Author: Davies Liu <davies@databricks.com>
Closes#11354 from davies/fix_column_pruning.
## What changes were proposed in this pull request?
* Scala DataFrameStatFunctions: Added version of approxQuantile taking a List instead of an Array, for Python compatbility
* Python DataFrame and DataFrameStatFunctions: Added approxQuantile
## How was this patch tested?
* unit test in sql/tests.py
Documentation was copied from the existing approxQuantile exactly.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#11356 from jkbradley/approx-quantile-python.
This PR adds a new abstraction called an `ExpressionSet` which attempts to canonicalize expressions to remove cosmetic differences. Deterministic expressions that are in the set after canonicalization will always return the same answer given the same input (i.e. false positives should not be possible). However, it is possible that two canonical expressions that are not equal will in fact return the same answer given any input (i.e. false negatives are possible).
```scala
val set = AttributeSet('a + 1 :: 1 + 'a :: Nil)
set.iterator => Iterator('a + 1)
set.contains('a + 1) => true
set.contains(1 + 'a) => true
set.contains('a + 2) => false
```
Other relevant changes include:
- Since this concept overlaps with the existing `semanticEquals` and `semanticHash`, those functions are also ported to this new infrastructure.
- A memoized `canonicalized` version of the expression is added as a `lazy val` to `Expression` and is used by both `semanticEquals` and `ExpressionSet`.
- A set of unit tests for `ExpressionSet` are added
- Tests which expect `semanticEquals` to be less intelligent than it now is are updated.
As a followup, we should consider auditing the places where we do `O(n)` `semanticEquals` operations and replace them with `ExpressionSet`. We should also consider consolidating `AttributeSet` as a specialized factory for an `ExpressionSet.`
Author: Michael Armbrust <michael@databricks.com>
Closes#11338 from marmbrus/expressionSet.
Some parts of the engine rely on UnsafeRow which the vectorized parquet scanner does not want
to produce. This add a conversion in Physical RDD. In the case where codegen is used (and the
scan is the start of the pipeline), there is no requirement to use UnsafeRow. This patch adds
update PhysicallRDD to support codegen, which eliminates the need for the UnsafeRow conversion
in all cases.
The result of these changes for TPCDS-Q19 at the 10gb sf reduces the query time from 9.5 seconds
to 6.5 seconds.
Author: Nong Li <nong@databricks.com>
Closes#11141 from nongli/spark-13250.
## What changes were proposed in this pull request?
Reverting SPARK-13376 (d563c8fa01) affects the test added by SPARK-13383. So, I am fixing the test.
Author: Yin Huai <yhuai@databricks.com>
Closes#11355 from yhuai/SPARK-13383-fix-test.
## What changes were proposed in this pull request?
`HiveCompatibilitySuite` should still run in PR build even if a PR only changes sql/core. So, I am going to remove `ExtendedHiveTest` annotation from `HiveCompatibilitySuite`.
https://issues.apache.org/jira/browse/SPARK-13475
Author: Yin Huai <yhuai@databricks.com>
Closes#11351 from yhuai/SPARK-13475.
## What changes were proposed in this pull request?
Since "[SPARK-13321][SQL] Support nested UNION in parser" is reverted, we need to disable the test case that requires this PR. Thanks!
rxin yhuai marmbrus
## How was this patch tested?
N/A
Author: gatorsmile <gatorsmile@gmail.com>
Closes#11352 from gatorsmile/disableTestCase.
## What changes were proposed in this pull request?
When we pass a Python function to JVM side, we also need to send its context, e.g. `envVars`, `pythonIncludes`, `pythonExec`, etc. However, it's annoying to pass around so many parameters at many places. This PR abstract python function along with its context, to simplify some pyspark code and make the logic more clear.
## How was the this patch tested?
by existing unit tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11342 from cloud-fan/python-clean.
JIRA: https://issues.apache.org/jira/browse/SPARK-13383
## What changes were proposed in this pull request?
When we do column pruning in Optimizer, we put additional Project on top of a logical plan. However, when we already wrap a BroadcastHint on a logical plan, the added Project will hide BroadcastHint after later execution.
We should take care of BroadcastHint when we do column pruning.
## How was the this patch tested?
Unit test is added.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#11260 from viirya/keep-broadcasthint.
## What changes were proposed in this pull request?
This PR pull all the keywords (and some others) from ExpressionParser.g as KeywordParser.g, because ExpressionParser is too large to compile.
## How was the this patch tested?
unit test, maven build
Closes#11329
Author: Davies Liu <davies@databricks.com>
Closes#11331 from davies/split_expr.
## What changes were proposed in this pull request?
This PR mostly rewrite the ColumnPruning rule to support most of the SQL logical plans (except those for Dataset).
## How was the this patch tested?
This is test by unit tests, also manually test with TPCDS Q78, which could prune all unused columns successfully, improved the performance by 78% (from 22s to 12s).
Author: Davies Liu <davies@databricks.com>
Closes#11256 from davies/fix_column_pruning.
## What changes were proposed in this pull request?
This continues thunterdb 's work on `approxQuantile` API. It changes the signature of `approxQuantile` from `(col: String, quantile: Double, epsilon: Double): Double` to `(col: String, probabilities: Array[Double], relativeError: Double): Array[Double]` and update API doc. It also improves the error message in tests and simplifies the merge algorithm for summaries.
## How was the this patch tested?
Use the same unit tests as before.
Closes#11325
Author: Timothy Hunter <timhunter@databricks.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes#11332 from mengxr/SPARK-6761.
## What changes were proposed in this pull request?
Generates code for SortMergeJoin.
## How was the this patch tested?
Unit tests and manually tested with TPCDS Q72, which showed 70% performance improvements (from 42s to 25s), but micro benchmark only show minor improvements, it may depends the distribution of data and number of columns.
Author: Davies Liu <davies@databricks.com>
Closes#11248 from davies/gen_smj.
The current implementation of statistics of UnaryNode does not considering output (for example, Project may product much less columns than it's child), we should considering it to have a better guess.
We usually only join with few columns from a parquet table, the size of projected plan could be much smaller than the original parquet files. Having a better guess of size help we choose between broadcast join or sort merge join.
After this PR, I saw a few queries choose broadcast join other than sort merge join without turning spark.sql.autoBroadcastJoinThreshold for every query, ended up with about 6-8X improvements on end-to-end time.
We use `defaultSize` of DataType to estimate the size of a column, currently For DecimalType/StringType/BinaryType and UDT, we are over-estimate too much (4096 Bytes), so this PR change them to some more reasonable values. Here are the new defaultSize for them:
DecimalType: 8 or 16 bytes, based on the precision
StringType: 20 bytes
BinaryType: 100 bytes
UDF: default size of SQL type
These numbers are not perfect (hard to have a perfect number for them), but should be better than 4096.
Author: Davies Liu <davies@databricks.com>
Closes#11210 from davies/statics.
The type checking functions of `If` and `UnwrapOption` are fixed to eliminate spurious failures. `UnwrapOption` was checking for an input of `ObjectType` but `ObjectType`'s accept function was hard coded to return `false`. `If`'s type check was returning a false negative in the case that the two options differed only by nullability.
Tests added:
- an end-to-end regression test is added to `DatasetSuite` for the reported failure.
- all the unit tests in `ExpressionEncoderSuite` are augmented to also confirm successful analysis. These tests are actually what pointed out the additional issues with `If` resolution.
Author: Michael Armbrust <michael@databricks.com>
Closes#11316 from marmbrus/datasetOptions.
JIRA: https://issues.apache.org/jira/browse/SPARK-6761
Compute approximate quantile based on the paper Greenwald, Michael and Khanna, Sanjeev, "Space-efficient Online Computation of Quantile Summaries," SIGMOD '01.
Author: Timothy Hunter <timhunter@databricks.com>
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#6042 from viirya/approximate_quantile.
This PR is to implement SQL generation for the following three set operations:
- Union Distinct
- Intersect
- Except
liancheng Thanks!
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#11195 from gatorsmile/setOpSQLGen.
#### What changes were proposed in this pull request?
Ensure that all built-in expressions can be mapped to its SQL representation if there is one (e.g. ScalaUDF doesn't have a SQL representation). The function lists are from the expression list in `FunctionRegistry`.
window functions, grouping sets functions (`cube`, `rollup`, `grouping`, `grouping_id`), generator functions (`explode` and `json_tuple`) are covered by separate JIRA and PRs. Thus, this PR does not cover them. Except these functions, all the built-in expressions are covered. For details, see the list in `ExpressionToSQLSuite`.
Fixed a few issues. For example, the `prettyName` of `approx_count_distinct` is not right. The `sql` of `hash` function is not right, since the `hash` function does not accept `seed`.
Additionally, also correct the order of expressions in `FunctionRegistry` so that people are easier to find which functions are missing.
cc liancheng
#### How was the this patch tested?
Added two test cases in LogicalPlanToSQLSuite for covering `not like` and `not in`.
Added a new test suite `ExpressionToSQLSuite` to cover the functions:
1. misc non-aggregate functions + complex type creators + null expressions
2. math functions
3. aggregate functions
4. string functions
5. date time functions + calendar interval
6. collection functions
7. misc functions
Author: gatorsmile <gatorsmile@gmail.com>
Closes#11314 from gatorsmile/expressionToSQL.
In SparkSQLCLI, we have created a `CliSessionState`, but then we call `SparkSQLEnv.init()`, which will start another `SessionState`. This would lead to exception because `processCmd` need to get the `CliSessionState` instance by calling `SessionState.get()`, but the return value would be a instance of `SessionState`. See the exception below.
spark-sql> !echo "test";
Exception in thread "main" java.lang.ClassCastException: org.apache.hadoop.hive.ql.session.SessionState cannot be cast to org.apache.hadoop.hive.cli.CliSessionState
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:112)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:301)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:242)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:691)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#9589 from adrian-wang/clicommand.
Use the HashedRelation which is a more optimized datastructure and reduce code complexity
Author: Xiu Guo <xguo27@gmail.com>
Closes#11291 from xguo27/SPARK-13422.
A common problem that users encounter with Spark 1.6.0 is that writing to a partitioned parquet table OOMs. The root cause is that parquet allocates a significant amount of memory that is not accounted for by our own mechanisms. As a workaround, we can ensure that only a single file is open per task unless the user explicitly asks for more.
Author: Michael Armbrust <michael@databricks.com>
Closes#11308 from marmbrus/parquetWriteOOM.
## What changes were proposed in this pull request?
This PR tries to fix all typos in all markdown files under `docs` module,
and fixes similar typos in other comments, too.
## How was the this patch tested?
manual tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11300 from dongjoon-hyun/minor_fix_typos.
https://issues.apache.org/jira/browse/SPARK-13381
This PR adds the support to load CSV data directly by a single call with given paths.
Also, I corrected this to refer all paths rather than the first path in schema inference, which JSON datasource dose.
Several unitests were added for each functionality.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#11262 from HyukjinKwon/SPARK-13381.
JIRA: https://issues.apache.org/jira/browse/SPARK-13321
The following SQL can not be parsed with current parser:
SELECT `u_1`.`id` FROM (((SELECT `t0`.`id` FROM `default`.`t0`) UNION ALL (SELECT `t0`.`id` FROM `default`.`t0`)) UNION ALL (SELECT `t0`.`id` FROM `default`.`t0`)) AS u_1
We should fix it.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#11204 from viirya/nested-union.
## What changes were proposed in this pull request?
This PR adds equality operators to UDT classes so that they can be correctly tested for dataType equality during union operations.
This was previously causing `"AnalysisException: u"unresolved operator 'Union;""` when trying to unionAll two dataframes with UDT columns as below.
```
from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
from pyspark.sql import types
schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)])
a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema)
b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema)
c = a.unionAll(b)
```
## How was the this patch tested?
Tested using two unit tests in sql/test.py and the DataFrameSuite.
Additional information here : https://issues.apache.org/jira/browse/SPARK-13410
Author: Franklyn D'souza <franklynd@gmail.com>
Closes#11279 from damnMeddlingKid/udt-union-all.
## What changes were proposed in this pull request?
Fixed the test failure `org.apache.spark.sql.util.ContinuousQueryListenerSuite.event ordering`: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/202/testReport/junit/org.apache.spark.sql.util/ContinuousQueryListenerSuite/event_ordering/
```
org.scalatest.exceptions.TestFailedException:
Assert failed: : null equaled null onQueryTerminated called before onQueryStarted
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
org.apache.spark.sql.util.ContinuousQueryListenerSuite$QueryStatusCollector$$anonfun$onQueryTerminated$1.apply$mcV$sp(ContinuousQueryListenerSuite.scala:204)
org.scalatest.concurrent.AsyncAssertions$Waiter.apply(AsyncAssertions.scala:349)
org.apache.spark.sql.util.ContinuousQueryListenerSuite$QueryStatusCollector.onQueryTerminated(ContinuousQueryListenerSuite.scala:203)
org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.doPostEvent(ContinuousQueryListenerBus.scala:67)
org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.doPostEvent(ContinuousQueryListenerBus.scala:32)
org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.postToAll(ContinuousQueryListenerBus.scala:32)
```
In the previous codes, when the test `adding and removing listener` finishes, there may be still some QueryTerminated events in the listener bus queue. Then when `event ordering` starts to run, it may see these events and throw the above exception.
This PR just added `waitUntilEmpty` in `after` to make sure all events be consumed after each test.
## How was the this patch tested?
Jenkins tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#11275 from zsxwing/SPARK-13405.
## What changes were proposed in this pull request?
This is a step towards merging `SQLContext` and `HiveContext`. A new internal Catalog API was introduced in #10982 and extended in #11069. This patch introduces an implementation of this API using `HiveClient`, an existing interface to Hive. It also extends `HiveClient` with additional calls to Hive that are needed to complete the catalog implementation.
*Where should I start reviewing?* The new catalog introduced is `HiveCatalog`. This class is relatively simple because it just calls `HiveClientImpl`, where most of the new logic is. I would not start with `HiveClient`, `HiveQl`, or `HiveMetastoreCatalog`, which are modified mainly because of a refactor.
*Why is this patch so big?* I had to refactor HiveClient to remove an intermediate representation of databases, tables, partitions etc. After this refactor `CatalogTable` convert directly to and from `HiveTable` (etc.). Otherwise we would have to first convert `CatalogTable` to the intermediate representation and then convert that to HiveTable, which is messy.
The new class hierarchy is as follows:
```
org.apache.spark.sql.catalyst.catalog.Catalog
- org.apache.spark.sql.catalyst.catalog.InMemoryCatalog
- org.apache.spark.sql.hive.HiveCatalog
```
Note that, as of this patch, none of these classes are currently used anywhere yet. This will come in the future before the Spark 2.0 release.
## How was the this patch tested?
All existing unit tests, and HiveCatalogSuite that extends CatalogTestCases.
Author: Andrew Or <andrew@databricks.com>
Author: Reynold Xin <rxin@databricks.com>
Closes#11293 from rxin/hive-catalog.
https://issues.apache.org/jira/browse/SPARK-13137
This PR adds a filter in schema inference so that it does not emit NullPointException.
Also, I removed `MAX_COMMENT_LINES_IN_HEADER `but instead used a monad chaining with `filter()` and `first()`.
Lastly, I simply added a newline rather than adding a new file for this so that this is covered with the original tests.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#11023 from HyukjinKwon/SPARK-13137.
Quite a few Spark SQL join operators broadcast one side of the join to all nodes. The are a few problems with this:
- This conflates broadcasting (a data exchange) with joining. Data exchanges should be managed by a different operator.
- All these nodes implement their own (duplicate) broadcasting logic.
- Re-use of indices is quite hard.
This PR defines both a ```BroadcastDistribution``` and ```BroadcastPartitioning```, these contain a `BroadcastMode`. The `BroadcastMode` defines the way in which we transform the Array of `InternalRow`'s into an index. We currently support the following `BroadcastMode`'s:
- IdentityBroadcastMode: This broadcasts the rows in their original form.
- HashSetBroadcastMode: This applies a projection to the input rows, deduplicates these rows and broadcasts the resulting `Set`.
- HashedRelationBroadcastMode: This transforms the input rows into a `HashedRelation`, and broadcasts this index.
To match this distribution we implement a ```BroadcastExchange``` operator which will perform the broadcast for us, and have ```EnsureRequirements``` plan this operator. The old Exchange operator has been renamed into ShuffleExchange in order to clearly separate between Shuffled and Broadcasted exchanges. Finally the classes in Exchange.scala have been moved to a dedicated package.
cc rxin davies
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#11083 from hvanhovell/SPARK-13136.
## What changes were proposed in this pull request?
This pull request fixes some minor issues (documentation, test flakiness, test organization) with #11190, which was merged earlier tonight.
## How was the this patch tested?
unit tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#11285 from rxin/subquery.
## What changes were proposed in this pull request?
This patch renames logical.Subquery to logical.SubqueryAlias, which is a more appropriate name for this operator (versus subqueries as expressions).
## How was the this patch tested?
Unit tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#11288 from rxin/SPARK-13420.
This PR introduces several major changes:
1. Replacing `Expression.prettyString` with `Expression.sql`
The `prettyString` method is mostly an internal, developer faced facility for debugging purposes, and shouldn't be exposed to users.
1. Using SQL-like representation as column names for selected fields that are not named expression (back-ticks and double quotes should be removed)
Before, we were using `prettyString` as column names when possible, and sometimes the result column names can be weird. Here are several examples:
Expression | `prettyString` | `sql` | Note
------------------ | -------------- | ---------- | ---------------
`a && b` | `a && b` | `a AND b` |
`a.getField("f")` | `a[f]` | `a.f` | `a` is a struct
1. Adding trait `NonSQLExpression` extending from `Expression` for expressions that don't have a SQL representation (e.g. Scala UDF/UDAF and Java/Scala object expressions used for encoders)
`NonSQLExpression.sql` may return an arbitrary user facing string representation of the expression.
Author: Cheng Lian <lian@databricks.com>
Closes#10757 from liancheng/spark-12799.simplify-expression-string-methods.
```scala
// case 1: missing sort columns are resolvable if join is true
sql("SELECT explode(a) AS val, b FROM data WHERE b < 2 order by val, c")
// case 2: missing sort columns are not resolvable if join is false. Thus, issue an error message in this case
sql("SELECT explode(a) AS val FROM data order by val, c")
```
When sort columns are not in `Generate`, we can resolve them when `join` is equal to `true`. Still trying to add more test cases for the other `UnaryNode` types.
Could you review the changes? davies cloud-fan Thanks!
Author: gatorsmile <gatorsmile@gmail.com>
Closes#11198 from gatorsmile/missingInSort.
Conversion of outer joins, if the predicates in filter conditions can restrict the result sets so that all null-supplying rows are eliminated.
- `full outer` -> `inner` if both sides have such predicates
- `left outer` -> `inner` if the right side has such predicates
- `right outer` -> `inner` if the left side has such predicates
- `full outer` -> `left outer` if only the left side has such predicates
- `full outer` -> `right outer` if only the right side has such predicates
If applicable, this can greatly improve the performance, since outer join is much slower than inner join, full outer join is much slower than left/right outer join.
The original PR is https://github.com/apache/spark/pull/10542
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#10567 from gatorsmile/outerJoinEliminationByFilterCond.
This PR adds support for rewriting constraints if there are aliases in the query plan. For e.g., if there is a query of form `SELECT a, a AS b`, any constraints on `a` now also apply to `b`.
JIRA: https://issues.apache.org/jira/browse/SPARK-13091
cc marmbrus
Author: Sameer Agarwal <sameer@databricks.com>
Closes#11144 from sameeragarwal/alias.
This patch expose `maxCharactersPerColumn` and `maxColumns` to user in CSV data source.
Author: Hossein <hossein@databricks.com>
Closes#11147 from falaki/SPARK-13261.
Fixes error `org.postgresql.util.PSQLException: Unable to find server array type for provided name decimal(38,18)`.
* Passes scale metadata to JDBC dialect for usage in type conversions.
* Removes unused length/scale/precision parameters from `createArrayOf` parameter `typeName` (for writing).
* Adds configurable precision and scale to Postgres `DecimalType` (for reading).
* Adds a new kind of test that verifies the schema written by `DataFrame.write.jdbc`.
Author: Brandon Bradley <bradleytastic@gmail.com>
Closes#10928 from blbradley/spark-12966.
JIRA: https://issues.apache.org/jira/browse/SPARK-13384
## What changes were proposed in this pull request?
When we de-duplicate attributes in Analyzer, we create new attributes. However, we don't keep original qualifiers. Some plans will be failed to analysed. We should keep original qualifiers in new attributes.
## How was the this patch tested?
Unit test is added.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#11261 from viirya/keep-attr-qualifiers.
`rand` and `randn` functions with a `seed` argument are commonly used. Based on the common sense, the results of `rand` and `randn` should be deterministic if the `seed` parameter value is provided. For example, in MS SQL Server, it also has a function `rand`. Regarding the parameter `seed`, the description is like: ```Seed is an integer expression (tinyint, smallint, or int) that gives the seed value. If seed is not specified, the SQL Server Database Engine assigns a seed value at random. For a specified seed value, the result returned is always the same.```
Update: the current implementation is unable to generate deterministic results when the partitions are not fixed. This PR documents this issue in the function descriptions.
jkbradley hit an issue and provided an example in the following JIRA: https://issues.apache.org/jira/browse/SPARK-13333
Author: gatorsmile <gatorsmile@gmail.com>
Closes#11232 from gatorsmile/randSeed.
This PR support codegen for broadcast outer join.
In order to reduce the duplicated codes, this PR merge HashJoin and HashOuterJoin together (also BroadcastHashJoin and BroadcastHashOuterJoin).
Author: Davies Liu <davies@databricks.com>
Closes#11130 from davies/gen_out.
Currently, the columns in projects of Expand that are not used by Aggregate are not pruned, this PR fix that.
Author: Davies Liu <davies@databricks.com>
Closes#11225 from davies/fix_pruning_expand.
Add local ivy repo to the SBT build file to fix this.
Scaladoc compile error is fixed.
Author: jerryshao <sshao@hortonworks.com>
Closes#11001 from jerryshao/SPARK-13109.
`TakeOrderedAndProjectNode` should use generated projection and ordering like other `LocalNode`s.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#11230 from ueshin/issues/SPARK-13357.
Add `LazilyGenerateOrdering` to support generated ordering for `RangePartitioner` of `Exchange` instead of `InterpretedOrdering`.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#10894 from ueshin/issues/SPARK-12976.
Using GroupingSets will generate a wrong result when Aggregate Functions containing GroupBy columns.
This PR is to fix it. Since the code changes are very small. Maybe we also can merge it to 1.6
For example, the following query returns a wrong result:
```scala
sql("select course, sum(earnings) as sum from courseSales group by course, earnings" +
" grouping sets((), (course), (course, earnings))" +
" order by course, sum").show()
```
Before the fix, the results are like
```
[null,null]
[Java,null]
[Java,20000.0]
[Java,30000.0]
[dotNET,null]
[dotNET,5000.0]
[dotNET,10000.0]
[dotNET,48000.0]
```
After the fix, the results become correct:
```
[null,113000.0]
[Java,20000.0]
[Java,30000.0]
[Java,50000.0]
[dotNET,5000.0]
[dotNET,10000.0]
[dotNET,48000.0]
[dotNET,63000.0]
```
UPDATE: This PR also deprecated the external column: GROUPING__ID.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#11100 from gatorsmile/groupingSets.
This patch adds a new optimizer rule for performing limit pushdown. Limits will now be pushed down in two cases:
- If a limit is on top of a `UNION ALL` operator, then a partition-local limit operator will be pushed to each of the union operator's children.
- If a limit is on top of an `OUTER JOIN` then a partition-local limit will be pushed to one side of the join. For `LEFT OUTER` and `RIGHT OUTER` joins, the limit will be pushed to the left and right side, respectively. For `FULL OUTER` join, we will only push limits when at most one of the inputs is already limited: if one input is limited we will push a smaller limit on top of it and if neither input is limited then we will limit the input which is estimated to be larger.
These optimizations were proposed previously by gatorsmile in #10451 and #10454, but those earlier PRs were closed and deferred for later because at that time Spark's physical `Limit` operator would trigger a full shuffle to perform global limits so there was a chance that pushdowns could actually harm performance by causing additional shuffles/stages. In #7334, we split the `Limit` operator into separate `LocalLimit` and `GlobalLimit` operators, so we can now push down only local limits (which don't require extra shuffles). This patch is based on both of gatorsmile's patches, with changes and simplifications due to partition-local-limiting.
When we push down the limit, we still keep the original limit in place, so we need a mechanism to ensure that the optimizer rule doesn't keep pattern-matching once the limit has been pushed down. In order to handle this, this patch adds a `maxRows` method to `SparkPlan` which returns the maximum number of rows that the plan can compute, then defines the pushdown rules to only push limits to children if the children's maxRows are greater than the limit's maxRows. This idea is carried over from #10451; see that patch for additional discussion.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#11121 from JoshRosen/limit-pushdown-2.
The java `Calendar` object is expensive to create. I have a sub query like this `SELECT a, b, c FROM table UV WHERE (datediff(UV.visitDate, '1997-01-01')>=0 AND datediff(UV.visitDate, '2015-01-01')<=0))`
The table stores `visitDate` as String type and has 3 billion records. A `Calendar` object is created every time `DateTimeUtils.stringToDate` is called. By reusing the `Calendar` object, I saw about 20 seconds performance improvement for this stage.
Author: Carson Wang <carson.wang@intel.com>
Closes#11090 from carsonwang/SPARK-13185.
This pull request has the following changes:
1. Moved UserDefinedFunction into expressions package. This is more consistent with how we structure the packages for window functions and UDAFs.
2. Moved UserDefinedPythonFunction into execution.python package, so we don't have a random private class in the top level sql package.
3. Move everything in execution/python.scala into the newly created execution.python package.
Most of the diffs are just straight copy-paste.
Author: Reynold Xin <rxin@databricks.com>
Closes#11181 from rxin/SPARK-13296.
Expand suffer from create the UnsafeRow from same input multiple times, with codegen, it only need to copy some of the columns.
After this, we can see 3X improvements (from 43 seconds to 13 seconds) on a TPCDS query (Q67) that have eight columns in Rollup.
Ideally, we could mask some of the columns based on bitmask, I'd leave that in the future, because currently Aggregation (50 ns) is much slower than that just copy the variables (1-2 ns).
Author: Davies Liu <davies@databricks.com>
Closes#11177 from davies/gen_expand.
https://issues.apache.org/jira/browse/SPARK-13260
This is a quicky fix for `count(*)`.
When the `requiredColumns` is empty, currently it returns `sqlContext.sparkContext.emptyRDD[Row]` which does not have the count.
Just like JSON datasource, this PR lets the CSV datasource count the rows but do not parse each set of tokens.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#11169 from HyukjinKwon/SPARK-13260.
Previously we were using Option[String] and None to indicate the case when Spark fails to generate SQL. It is easier to just use exceptions to propagate error cases, rather than having for comprehension everywhere. I also introduced a "build" function that simplifies string concatenation (i.e. no need to reason about whether we have an extra space or not).
Author: Reynold Xin <rxin@databricks.com>
Closes#11171 from rxin/SPARK-13282.
The current implementation of ResolveSortReferences can only push one missing attributes into it's child, it failed to analyze TPCDS Q98, because of there are two missing attributes in that (one from Window, another from Aggregate).
Author: Davies Liu <davies@databricks.com>
Closes#11153 from davies/resolve_sort.
Add the table name validation at the temp table creation
Author: jayadevanmurali <jayadevan.m@tcs.com>
Closes#11051 from jayadevanmurali/branch-0.2-SPARK-12982.
JIRA: https://issues.apache.org/jira/browse/SPARK-13277
There is an ANTLR warning during compilation:
warning(200): org/apache/spark/sql/catalyst/parser/SparkSqlParser.g:938:7:
Decision can match input such as "KW_USING Identifier" using multiple alternatives: 2, 3
As a result, alternative(s) 3 were disabled for that input
This patch is to fix it.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#11168 from viirya/fix-parser-using.
In spark-env.sh.template, there are multi-byte characters, this PR will remove it.
Author: Sasaki Toru <sasakitoa@nttdata.co.jp>
Closes#11149 from sasakitoa/remove_multibyte_in_sparkenv.
The parser currently parses the following strings without a hitch:
* Table Identifier:
* `a.b.c` should fail, but results in the following table identifier `a.b`
* `table!#` should fail, but results in the following table identifier `table`
* Expression
* `1+2 r+e` should fail, but results in the following expression `1 + 2`
This PR fixes this by adding terminated rules for both expression parsing and table identifier parsing.
cc cloud-fan (we discussed this in https://github.com/apache/spark/pull/10649) jayadevanmurali (this causes your PR https://github.com/apache/spark/pull/11051 to fail)
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#11159 from hvanhovell/SPARK-13276.