## What changes were proposed in this pull request?
fix typo in DataSourceRegister
## How was this patch tested?
found when going through latest code
Author: Jacky Li <jacky.likun@huawei.com>
Closes#11686 from jackylk/patch-12.
## What changes were proposed in this pull request?
This PR removes two methods, `collectRows()` and `takeRows()`, from `Dataset[T]`. These methods were added in PR #11443, and were later considered not useful.
## How was this patch tested?
Existing tests should do the work.
Author: Cheng Lian <lian@databricks.com>
Closes#11678 from liancheng/remove-collect-rows-and-take-rows.
PR #11443 added an extra `plan: Option[LogicalPlan]` argument to `AnalysisException` and attached partially analyzed plan to thrown `AnalysisException` in `QueryExecution.assertAnalyzed()`. However, the original stack trace wasn't properly inherited. This PR fixes this issue by inheriting the stack trace.
A test case is added to verify that the first entry of `AnalysisException` stack trace isn't from `QueryExecution`.
Author: Cheng Lian <lian@databricks.com>
Closes#11677 from liancheng/analysis-exception-stacktrace.
## What changes were proposed in this pull request?
This PR split the PhysicalRDD into two classes, PhysicalRDD and PhysicalScan. PhysicalRDD is used for DataFrames that is created from existing RDD. PhysicalScan is used for DataFrame that is created from data sources. This enable use to apply different optimization on both of them.
Also fix the problem for sameResult() on two DataSourceScan.
Also fix the equality check to toString for `In`. It's better to use Seq there, but we can't break this public API (sad).
## How was this patch tested?
Existing tests. Manually tested with TPCDS query Q59 and Q64, all those duplicated exchanges can be re-used now, also saw there are 40+% performance improvement (saving half of the scan).
Author: Davies Liu <davies@databricks.com>
Closes#11514 from davies/existing_rdd.
## What changes were proposed in this pull request?
This patch is ported over from viirya's changes in #11048. Currently for most DDLs we just pass the query text directly to Hive. Instead, we should parse these commands ourselves and in the future (not part of this patch) use the `HiveCatalog` to process these DDLs. This is a pretext to merging `SQLContext` and `HiveContext`.
Note: As of this patch we still pass the query text to Hive. The difference is that we now parse the commands ourselves so in the future we can just use our own catalog.
## How was this patch tested?
Jenkins, new `DDLCommandSuite`, which comprises of about 40% of the changes here.
Author: Andrew Or <andrew@databricks.com>
Closes#11573 from andrewor14/parser-plus-plus.
## What changes were proposed in this pull request?
PR #11443 temporarily disabled MiMA check, this PR re-enables it.
One extra change is that `object DataFrame` is also removed. The only purpose of introducing `object DataFrame` was to use it as an internal factory for creating `Dataset[Row]`. By replacing this internal factory with `Dataset.newDataFrame`, both `DataFrame` and `DataFrame$` are entirely removed from the API, so that we can simply put a `MissingClassProblem` filter in `MimaExcludes.scala` for most DataFrame API changes.
## How was this patch tested?
Tested by MiMA check triggered by Jenkins.
Author: Cheng Lian <lian@databricks.com>
Closes#11656 from liancheng/re-enable-mima.
#### What changes were proposed in this pull request?
`projectList` is useless. Its value is always the same as the child.output. Remove it from the class `Window`. Removal can simplify the codes in Analyzer and Optimizer.
This PR is based on the discussion started by cloud-fan in a separate PR:
https://github.com/apache/spark/pull/5604#discussion_r55140466
This PR also eliminates useless `Window`.
cloud-fan yhuai
#### How was this patch tested?
Existing test cases cover it.
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#11565 from gatorsmile/removeProjListWindow.
## What changes were proposed in this pull request?
This PR unifies DataFrame and Dataset by migrating existing DataFrame operations to Dataset and make `DataFrame` a type alias of `Dataset[Row]`.
Most Scala code changes are source compatible, but Java API is broken as Java knows nothing about Scala type alias (mostly replacing `DataFrame` with `Dataset<Row>`).
There are several noticeable API changes related to those returning arrays:
1. `collect`/`take`
- Old APIs in class `DataFrame`:
```scala
def collect(): Array[Row]
def take(n: Int): Array[Row]
```
- New APIs in class `Dataset[T]`:
```scala
def collect(): Array[T]
def take(n: Int): Array[T]
def collectRows(): Array[Row]
def takeRows(n: Int): Array[Row]
```
Two specialized methods `collectRows` and `takeRows` are added because Java doesn't support returning generic arrays. Thus, for example, `DataFrame.collect(): Array[T]` actually returns `Object` instead of `Array<T>` from Java side.
Normally, Java users may fall back to `collectAsList` and `takeAsList`. The two new specialized versions are added to avoid performance regression in ML related code (but maybe I'm wrong and they are not necessary here).
1. `randomSplit`
- Old APIs in class `DataFrame`:
```scala
def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame]
def randomSplit(weights: Array[Double]): Array[DataFrame]
```
- New APIs in class `Dataset[T]`:
```scala
def randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]]
def randomSplit(weights: Array[Double]): Array[Dataset[T]]
```
Similar problem as above, but hasn't been addressed for Java API yet. We can probably add `randomSplitAsList` to fix this one.
1. `groupBy`
Some original `DataFrame.groupBy` methods have conflicting signature with original `Dataset.groupBy` methods. To distinguish these two, typed `Dataset.groupBy` methods are renamed to `groupByKey`.
Other noticeable changes:
1. Dataset always do eager analysis now
We used to support disabling DataFrame eager analysis to help reporting partially analyzed malformed logical plan on analysis failure. However, Dataset encoders requires eager analysi during Dataset construction. To preserve the error reporting feature, `AnalysisException` now takes an extra `Option[LogicalPlan]` argument to hold the partially analyzed plan, so that we can check the plan tree when reporting test failures. This plan is passed by `QueryExecution.assertAnalyzed`.
## How was this patch tested?
Existing tests do the work.
## TODO
- [ ] Fix all tests
- [ ] Re-enable MiMA check
- [ ] Update ScalaDoc (`since`, `group`, and example code)
Author: Cheng Lian <lian@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Author: Wenchen Fan <wenchen@databricks.com>
Author: Cheng Lian <liancheng@users.noreply.github.com>
Closes#11443 from liancheng/ds-to-df.
## What changes were proposed in this pull request?
Since the opening curly brace, '{', has many usages as discussed in [SPARK-3854](https://issues.apache.org/jira/browse/SPARK-3854), this PR adds a ScalaStyle rule to prevent '){' pattern for the following majority pattern and fixes the code accordingly. If we enforce this in ScalaStyle from now, it will improve the Scala code quality and reduce review time.
```
// Correct:
if (true) {
println("Wow!")
}
// Incorrect:
if (true){
println("Wow!")
}
```
IntelliJ also shows new warnings based on this.
## How was this patch tested?
Pass the Jenkins ScalaStyle test.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11637 from dongjoon-hyun/SPARK-3854.
## What changes were proposed in this pull request?
We should reuse an object similar to the other non-primitive type getters. For
a query that computes averages over decimal columns, this shows a 10% speedup
on overall query times.
## How was this patch tested?
Existing tests and this benchmark
```
TPCDS Snappy: Best/Avg Time(ms) Rate(M/s) Per Row(ns)
--------------------------------------------------------------------------------
q27-agg (master) 10627 / 11057 10.8 92.3
q27-agg (this patch) 9722 / 9832 11.8 84.4
```
Author: Nong Li <nong@databricks.com>
Closes#11624 from nongli/spark-13790.
JIRA: https://issues.apache.org/jira/browse/SPARK-13636
## What changes were proposed in this pull request?
As shown in the wholestage codegen verion of Sort operator, when Sort is top of Exchange (or other operator that produce UnsafeRow), we will create variables from UnsafeRow, than create another UnsafeRow using these variables. We should avoid the unnecessary unpack and pack variables from UnsafeRows.
## How was this patch tested?
All existing wholestage codegen tests should be passed.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#11484 from viirya/direct-consume-unsaferow.
## What changes were proposed in this pull request?
According to #11627 , this PR replace `DataFrameWriter.stream()` with `startStream()` in comments of `ContinuousQueryListener.java`.
## How was this patch tested?
Manual. (It changes on comments.)
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11629 from dongjoon-hyun/minor_rename.
## What changes were proposed in this pull request?
The new name makes it more obvious with the verb "start" that we are actually starting some execution.
## How was this patch tested?
This is just a rename. Existing unit tests should cover it.
Author: Reynold Xin <rxin@databricks.com>
Closes#11627 from rxin/SPARK-13794.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-13766
This PR makes the file extensions (written by internal datasource) consistent.
**Before**
- TEXT, CSV and JSON
```
[.COMPRESSION_CODEC_NAME]
```
- Parquet
```
[.COMPRESSION_CODEC_NAME].parquet
```
- ORC
```
.orc
```
**After**
- TEXT, CSV and JSON
```
.txt[.COMPRESSION_CODEC_NAME]
.csv[.COMPRESSION_CODEC_NAME]
.json[.COMPRESSION_CODEC_NAME]
```
- Parquet
```
[.COMPRESSION_CODEC_NAME].parquet
```
- ORC
```
[.COMPRESSION_CODEC_NAME].orc
```
When the compression codec is set,
- For Parquet and ORC, each still stays in Parquet and ORC format but just have compressed data internally. So, I think it is okay to name `.parquet` and `.orc` at the end.
- For Text, CSV and JSON, each does not stays in each format but it has different data format according to compression codec. So, each has the names `.json`, `.csv` and `.txt` before the compression extension.
## How was this patch tested?
Unit tests are used and `./dev/run_tests` for coding style tests.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#11604 from HyukjinKwon/SPARK-13766.
#### What changes were proposed in this pull request?
Remove all the deterministic conditions in a [[Filter]] that are contained in the Child's Constraints.
For example, the first query can be simplified to the second one.
```scala
val queryWithUselessFilter = tr1
.where("tr1.a".attr > 10 || "tr1.c".attr < 10)
.join(tr2.where('d.attr < 100), Inner, Some("tr1.a".attr === "tr2.a".attr))
.where(
("tr1.a".attr > 10 || "tr1.c".attr < 10) &&
'd.attr < 100 &&
"tr2.a".attr === "tr1.a".attr)
```
```scala
val query = tr1
.where("tr1.a".attr > 10 || "tr1.c".attr < 10)
.join(tr2.where('d.attr < 100), Inner, Some("tr1.a".attr === "tr2.a".attr))
```
#### How was this patch tested?
Six test cases are added.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#11406 from gatorsmile/FilterRemoval.
## What changes were proposed in this pull request?
It’s possible to have common parts in a query, for example, self join, it will be good to avoid the duplicated part to same CPUs and memory (Broadcast or cache).
Exchange will materialize the underlying RDD by shuffle or collect, it’s a great point to check duplicates and reuse them. Duplicated exchanges means they generate exactly the same result inside a query.
In order to find out the duplicated exchanges, we should be able to compare SparkPlan to check that they have same results or not. We already have that for LogicalPlan, so we should move that into QueryPlan to make it available for SparkPlan.
Once we can find the duplicated exchanges, we should replace all of them with same SparkPlan object (could be wrapped by ReusedExchage for explain), then the plan tree become a DAG. Since all the planner only work with tree, so this rule should be the last one for the entire planning.
After the rule, the plan will looks like:
```
WholeStageCodegen
: +- Project [id#0L]
: +- BroadcastHashJoin [id#0L], [id#2L], Inner, BuildRight, None
: :- Project [id#0L]
: : +- BroadcastHashJoin [id#0L], [id#1L], Inner, BuildRight, None
: : :- Range 0, 1, 4, 1024, [id#0L]
: : +- INPUT
: +- INPUT
:- BroadcastExchange HashedRelationBroadcastMode(true,List(id#1L),List(id#1L))
: +- WholeStageCodegen
: : +- Range 0, 1, 4, 1024, [id#1L]
+- ReusedExchange [id#2L], BroadcastExchange HashedRelationBroadcastMode(true,List(id#1L),List(id#1L))
```
![bjoin](https://cloud.githubusercontent.com/assets/40902/13414787/209e8c5c-df0a-11e5-8a0f-edff69d89e83.png)
For three ways SortMergeJoin,
```
== Physical Plan ==
WholeStageCodegen
: +- Project [id#0L]
: +- SortMergeJoin [id#0L], [id#4L], None
: :- INPUT
: +- INPUT
:- WholeStageCodegen
: : +- Project [id#0L]
: : +- SortMergeJoin [id#0L], [id#3L], None
: : :- INPUT
: : +- INPUT
: :- WholeStageCodegen
: : : +- Sort [id#0L ASC], false, 0
: : : +- INPUT
: : +- Exchange hashpartitioning(id#0L, 200), None
: : +- WholeStageCodegen
: : : +- Range 0, 1, 4, 33554432, [id#0L]
: +- WholeStageCodegen
: : +- Sort [id#3L ASC], false, 0
: : +- INPUT
: +- ReusedExchange [id#3L], Exchange hashpartitioning(id#0L, 200), None
+- WholeStageCodegen
: +- Sort [id#4L ASC], false, 0
: +- INPUT
+- ReusedExchange [id#4L], Exchange hashpartitioning(id#0L, 200), None
```
![sjoin](https://cloud.githubusercontent.com/assets/40902/13414790/27aea61c-df0a-11e5-8cbf-fbc985c31d95.png)
If the same ShuffleExchange or BroadcastExchange, execute()/executeBroadcast() will be called by different parents, they should cached the RDD/Broadcast, return the same one for all the parents.
## How was this patch tested?
Added some unit tests for this. Had done some manual tests on TPCDS query Q59 and Q64, we can see some exchanges are re-used (this requires a change in PhysicalRDD to for sameResult, is be done in #11514 ).
Author: Davies Liu <davies@databricks.com>
Closes#11403 from davies/dedup.
## What changes were proposed in this pull request?
If there are many branches in a CaseWhen expression, the generated code could go above the 64K limit for single java method, will fail to compile. This PR change it to fallback to interpret mode if there are more than 20 branches.
This PR is based on #11243 and #11221, thanks to joehalliwell
Closes#11243Closes#11221
## How was this patch tested?
Add a test with 50 branches.
Author: Davies Liu <davies@databricks.com>
Closes#11592 from davies/fix_when.
## What changes were proposed in this pull request?
In order to make `docs/examples` (and other related code) more simple/readable/user-friendly, this PR replaces existing codes like the followings by using `diamond` operator.
```
- final ArrayList<Product2<Object, Object>> dataToWrite =
- new ArrayList<Product2<Object, Object>>();
+ final ArrayList<Product2<Object, Object>> dataToWrite = new ArrayList<>();
```
Java 7 or higher supports **diamond** operator which replaces the type arguments required to invoke the constructor of a generic class with an empty set of type parameters (<>). Currently, Spark Java code use mixed usage of this.
## How was this patch tested?
Manual.
Pass the existing tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11541 from dongjoon-hyun/SPARK-13702.
## What changes were proposed in this pull request?
This issue fixes the following potential bugs and Java coding style detected by Coverity and Checkstyle.
- Implement both null and type checking in equals functions.
- Fix wrong type casting logic in SimpleJavaBean2.equals.
- Add `implement Cloneable` to `UTF8String` and `SortedIterator`.
- Remove dereferencing before null check in `AbstractBytesToBytesMapSuite`.
- Fix coding style: Add '{}' to single `for` statement in mllib examples.
- Remove unused imports in `ColumnarBatch` and `JavaKinesisStreamSuite`.
- Remove unused fields in `ChunkFetchIntegrationSuite`.
- Add `stop()` to prevent resource leak.
Please note that the last two checkstyle errors exist on newly added commits after [SPARK-13583](https://issues.apache.org/jira/browse/SPARK-13583).
## How was this patch tested?
manual via `./dev/lint-java` and Coverity site.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11530 from dongjoon-hyun/SPARK-13692.
This PR replaces #9925 which had issues with CI. **Please see the original PR for any previous discussions.**
## What changes were proposed in this pull request?
Deprecate the SparkSQL column operator !== and use =!= as an alternative.
Fixes subtle issues related to operator precedence (basically, !== does not have the same priority as its logical negation, ===).
## How was this patch tested?
All currently existing tests.
Author: Jakob Odersky <jodersky@gmail.com>
Closes#11588 from jodersky/SPARK-7286.
## Motivation
CSV data source was contributed by Databricks. It is the inlined version of https://github.com/databricks/spark-csv. The data source name was `com.databricks.spark.csv`. As a result there are many tables created on older versions of spark with that name as the source. For backwards compatibility we should keep the old name.
## Proposed changes
`com.databricks.spark.csv` was added to list of `backwardCompatibilityMap` in `ResolvedDataSource.scala`
## Tests
A unit test was added to `CSVSuite` to parse a csv file using the old name.
Author: Hossein <hossein@databricks.com>
Closes#11589 from falaki/SPARK-13754.
## What changes were proposed in this pull request?
This PR fix the sizeInBytes of HadoopFsRelation.
## How was this patch tested?
Added regression test for that.
Author: Davies Liu <davies@databricks.com>
Closes#11590 from davies/fix_sizeInBytes.
When generating Graphviz DOT files in the SQL query visualization we need to escape double-quotes inside node labels. This is a followup to #11309, which fixed a similar graph in Spark Core's DAG visualization.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#11587 from JoshRosen/graphviz-escaping.
## What changes were proposed in this pull request?
If a filter predicate or a join condition consists of `IsNotNull` checks, we should reorder these checks such that these non-nullability checks are evaluated before the rest of the predicates.
For e.g., if a filter predicate is of the form `a > 5 && isNotNull(b)`, we should rewrite this as `isNotNull(b) && a > 5` during physical plan generation.
## How was this patch tested?
new unit tests that verify the physical plan for both filters and joins in `ReorderedPredicateSuite`
Author: Sameer Agarwal <sameer@databricks.com>
Closes#11511 from sameeragarwal/reorder-isnotnull.
Follow-up to #11509, that simply refactors the interface that we use when resolving a pluggable `DataSource`.
- Multiple functions share the same set of arguments so we make this a case class, called `DataSource`. Actual resolution is now done by calling a function on this class.
- Instead of having multiple methods named `apply` (some of which do writing some of which do reading) we now explicitly have `resolveRelation()` and `write(mode, df)`.
- Get rid of `Array[String]` since this is an internal API and was forcing us to awkwardly call `toArray` in a bunch of places.
Author: Michael Armbrust <michael@databricks.com>
Closes#11572 from marmbrus/dataSourceResolution.
## What changes were proposed in this pull request?
This PR change the way how we generate the code for the output variables passing from a plan to it's parent.
Right now, they are generated before call consume() of it's parent. It's not efficient, if the parent is a Filter or Join, which could filter out most the rows, the time to access some of the columns that are not used by the Filter or Join are wasted.
This PR try to improve this by defering the access of columns until they are actually used by a plan. After this PR, a plan does not need to generate code to evaluate the variables for output, just passing the ExprCode to its parent by `consume()`. In `parent.consumeChild()`, it will check the output from child and `usedInputs`, generate the code for those columns that is part of `usedInputs` before calling `doConsume()`.
This PR also change the `if` from
```
if (cond) {
xxx
}
```
to
```
if (!cond) continue;
xxx
```
The new one could help to reduce the nested indents for multiple levels of Filter and BroadcastHashJoin.
It also added some comments for operators.
## How was the this patch tested?
Unit tests. Manually ran TPCDS Q55, this PR improve the performance about 30% (scale=10, from 2.56s to 1.96s)
Author: Davies Liu <davies@databricks.com>
Closes#11274 from davies/gen_defer.
## What changes were proposed in this pull request?
When we add more DDL parsing logic in the future, SparkQl will become very big. To keep it smaller, we'll introduce helper "parser objects", e.g. one to parse alter table commands. However, these parser objects will need to access some helper methods that exist in CatalystQl. The proposal is to move those methods to an isolated ParserUtils object.
This is based on viirya's changes in #11048. It prefaces the bigger fix for SPARK-13139 to make the diff of that patch smaller.
## How was this patch tested?
No change in functionality, so just Jenkins.
Author: Andrew Or <andrew@databricks.com>
Closes#11529 from andrewor14/parser-utils.
`HadoopFsRelation` is used for reading most files into Spark SQL. However today this class mixes the concerns of file management, schema reconciliation, scan building, bucketing, partitioning, and writing data. As a result, many data sources are forced to reimplement the same functionality and the various layers have accumulated a fair bit of inefficiency. This PR is a first cut at separating this into several components / interfaces that are each described below. Additionally, all implementations inside of Spark (parquet, csv, json, text, orc, svmlib) have been ported to the new API `FileFormat`. External libraries, such as spark-avro will also need to be ported to work with Spark 2.0.
### HadoopFsRelation
A simple `case class` that acts as a container for all of the metadata required to read from a datasource. All discovery, resolution and merging logic for schemas and partitions has been removed. This an internal representation that no longer needs to be exposed to developers.
```scala
case class HadoopFsRelation(
sqlContext: SQLContext,
location: FileCatalog,
partitionSchema: StructType,
dataSchema: StructType,
bucketSpec: Option[BucketSpec],
fileFormat: FileFormat,
options: Map[String, String]) extends BaseRelation
```
### FileFormat
The primary interface that will be implemented by each different format including external libraries. Implementors are responsible for reading a given format and converting it into `InternalRow` as well as writing out an `InternalRow`. A format can optionally return a schema that is inferred from a set of files.
```scala
trait FileFormat {
def inferSchema(
sqlContext: SQLContext,
options: Map[String, String],
files: Seq[FileStatus]): Option[StructType]
def prepareWrite(
sqlContext: SQLContext,
job: Job,
options: Map[String, String],
dataSchema: StructType): OutputWriterFactory
def buildInternalScan(
sqlContext: SQLContext,
dataSchema: StructType,
requiredColumns: Array[String],
filters: Array[Filter],
bucketSet: Option[BitSet],
inputFiles: Array[FileStatus],
broadcastedConf: Broadcast[SerializableConfiguration],
options: Map[String, String]): RDD[InternalRow]
}
```
The current interface is based on what was required to get all the tests passing again, but still mixes a couple of concerns (i.e. `bucketSet` is passed down to the scan instead of being resolved by the planner). Additionally, scans are still returning `RDD`s instead of iterators for single files. In a future PR, bucketing should be removed from this interface and the scan should be isolated to a single file.
### FileCatalog
This interface is used to list the files that make up a given relation, as well as handle directory based partitioning.
```scala
trait FileCatalog {
def paths: Seq[Path]
def partitionSpec(schema: Option[StructType]): PartitionSpec
def allFiles(): Seq[FileStatus]
def getStatus(path: Path): Array[FileStatus]
def refresh(): Unit
}
```
Currently there are two implementations:
- `HDFSFileCatalog` - based on code from the old `HadoopFsRelation`. Infers partitioning by recursive listing and caches this data for performance
- `HiveFileCatalog` - based on the above, but it uses the partition spec from the Hive Metastore.
### ResolvedDataSource
Produces a logical plan given the following description of a Data Source (which can come from DataFrameReader or a metastore):
- `paths: Seq[String] = Nil`
- `userSpecifiedSchema: Option[StructType] = None`
- `partitionColumns: Array[String] = Array.empty`
- `bucketSpec: Option[BucketSpec] = None`
- `provider: String`
- `options: Map[String, String]`
This class is responsible for deciding which of the Data Source APIs a given provider is using (including the non-file based ones). All reconciliation of partitions, buckets, schema from metastores or inference is done here.
### DataSourceAnalysis / DataSourceStrategy
Responsible for analyzing and planning reading/writing of data using any of the Data Source APIs, including:
- pruning the files from partitions that will be read based on filters.
- appending partition columns*
- applying additional filters when a data source can not evaluate them internally.
- constructing an RDD that is bucketed correctly when required*
- sanity checking schema match-up and other analysis when writing.
*In the future we should do that following:
- Break out file handling into its own Strategy as its sufficiently complex / isolated.
- Push the appending of partition columns down in to `FileFormat` to avoid an extra copy / unvectorization.
- Use a custom RDD for scans instead of `SQLNewNewHadoopRDD2`
Author: Michael Armbrust <michael@databricks.com>
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11509 from marmbrus/fileDataSource.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-13442
This PR adds the support for inferring `BooleanType` for schema.
It supports to infer case-insensitive `true` / `false` as `BooleanType`.
Unittests were added for `CSVInferSchemaSuite` and `CSVSuite` for end-to-end test.
## How was the this patch tested?
This was tested with unittests and with `dev/run_tests` for coding style
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#11315 from HyukjinKwon/SPARK-13442.
## What changes were proposed in this pull request?
It's weird that expressions don't always have all the expressions in it. This PR marks `QueryPlan.expressions` final to forbid sub classes overriding it to exclude some expressions. Currently only `Generate` override it, we can use `producedAttributes` to fix the unresolved attribute problem for it.
Note that this PR doesn't fix the problem in #11497
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11532 from cloud-fan/generate.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
Currently, the parquet reader returns rows one by one which is bad for performance. This patch
updates the reader to directly return ColumnarBatches. This is only enabled with whole stage
codegen, which is the only operator currently that is able to consume ColumnarBatches (instead
of rows). The current implementation is a bit of a hack to get this to work and we should do
more refactoring of these low level interfaces to make this work better.
## How was this patch tested?
```
Results:
TPCDS: Best/Avg Time(ms) Rate(M/s) Per Row(ns)
---------------------------------------------------------------------------------
q55 (before) 8897 / 9265 12.9 77.2
q55 5486 / 5753 21.0 47.6
```
Author: Nong Li <nong@databricks.com>
Closes#11435 from nongli/spark-13255.
## What changes were proposed in this pull request?
This patch simply moves things to existing package `o.a.s.sql.catalyst.parser` in an effort to reduce the size of the diff in #11048. This is conceptually the same as a recently merged patch #11482.
## How was this patch tested?
Jenkins.
Author: Andrew Or <andrew@databricks.com>
Closes#11506 from andrewor14/parser-package.
## What changes were proposed in this pull request?
This PR support visualization for subquery in SQL web UI, also improve the explain of subquery, especially when it's used together with whole stage codegen.
For example:
```python
>>> sqlContext.range(100).registerTempTable("range")
>>> sqlContext.sql("select id / (select sum(id) from range) from range where id > (select id from range limit 1)").explain(True)
== Parsed Logical Plan ==
'Project [unresolvedalias(('id / subquery#9), None)]
: +- 'SubqueryAlias subquery#9
: +- 'Project [unresolvedalias('sum('id), None)]
: +- 'UnresolvedRelation `range`, None
+- 'Filter ('id > subquery#8)
: +- 'SubqueryAlias subquery#8
: +- 'GlobalLimit 1
: +- 'LocalLimit 1
: +- 'Project [unresolvedalias('id, None)]
: +- 'UnresolvedRelation `range`, None
+- 'UnresolvedRelation `range`, None
== Analyzed Logical Plan ==
(id / scalarsubquery()): double
Project [(cast(id#0L as double) / cast(subquery#9 as double)) AS (id / scalarsubquery())#11]
: +- SubqueryAlias subquery#9
: +- Aggregate [(sum(id#0L),mode=Complete,isDistinct=false) AS sum(id)#10L]
: +- SubqueryAlias range
: +- Range 0, 100, 1, 4, [id#0L]
+- Filter (id#0L > subquery#8)
: +- SubqueryAlias subquery#8
: +- GlobalLimit 1
: +- LocalLimit 1
: +- Project [id#0L]
: +- SubqueryAlias range
: +- Range 0, 100, 1, 4, [id#0L]
+- SubqueryAlias range
+- Range 0, 100, 1, 4, [id#0L]
== Optimized Logical Plan ==
Project [(cast(id#0L as double) / cast(subquery#9 as double)) AS (id / scalarsubquery())#11]
: +- SubqueryAlias subquery#9
: +- Aggregate [(sum(id#0L),mode=Complete,isDistinct=false) AS sum(id)#10L]
: +- Range 0, 100, 1, 4, [id#0L]
+- Filter (id#0L > subquery#8)
: +- SubqueryAlias subquery#8
: +- GlobalLimit 1
: +- LocalLimit 1
: +- Project [id#0L]
: +- Range 0, 100, 1, 4, [id#0L]
+- Range 0, 100, 1, 4, [id#0L]
== Physical Plan ==
WholeStageCodegen
: +- Project [(cast(id#0L as double) / cast(subquery#9 as double)) AS (id / scalarsubquery())#11]
: : +- Subquery subquery#9
: : +- WholeStageCodegen
: : : +- TungstenAggregate(key=[], functions=[(sum(id#0L),mode=Final,isDistinct=false)], output=[sum(id)#10L])
: : : +- INPUT
: : +- Exchange SinglePartition, None
: : +- WholeStageCodegen
: : : +- TungstenAggregate(key=[], functions=[(sum(id#0L),mode=Partial,isDistinct=false)], output=[sum#14L])
: : : +- Range 0, 1, 4, 100, [id#0L]
: +- Filter (id#0L > subquery#8)
: : +- Subquery subquery#8
: : +- CollectLimit 1
: : +- WholeStageCodegen
: : : +- Project [id#0L]
: : : +- Range 0, 1, 4, 100, [id#0L]
: +- Range 0, 1, 4, 100, [id#0L]
```
The web UI looks like:
![subquery](https://cloud.githubusercontent.com/assets/40902/13377963/932bcbae-dda7-11e5-82f7-03c9be85d77c.png)
This PR also change the tree structure of WholeStageCodegen to make it consistent than others. Before this change, Both WholeStageCodegen and InputAdapter hold a references to the same plans, those could be updated without notify another, causing problems, this is discovered by #11403 .
## How was this patch tested?
Existing tests, also manual tests with the example query, check the explain and web UI.
Author: Davies Liu <davies@databricks.com>
Closes#11417 from davies/viz_subquery.
## What changes were proposed in this pull request?
This patch simply moves things to a new package in an effort to reduce the size of the diff in #11048. Currently the new package only has one file, but in the future we'll add many new commands in SPARK-13139.
## How was this patch tested?
Jenkins.
Author: Andrew Or <andrew@databricks.com>
Closes#11482 from andrewor14/commands-package.
## What changes were proposed in this pull request?
This PR fixes typos in comments and testcase name of code.
## How was this patch tested?
manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11481 from dongjoon-hyun/minor_fix_typos_in_code.
## What changes were proposed in this pull request?
This PR adds the support to specify compression codecs for both ORC and Parquet.
## How was this patch tested?
unittests within IDE and code style tests with `dev/run_tests`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#11464 from HyukjinKwon/SPARK-13543.
## What changes were proposed in this pull request?
Fixes compile problem due to inadvertent use of `Option.contains`, only in Scala 2.11. The change should have been to replace `Option.exists(_ == x)` with `== Some(x)`. Replacing exists with contains only makes sense for collections. Replacing use of `Option.exists` still makes sense though as it's misleading.
## How was this patch tested?
Jenkins tests / compilation
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: Sean Owen <sowen@cloudera.com>
Closes#11493 from srowen/SPARK-13423.2.
## What changes were proposed in this pull request?
After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review by saving much time.
This issue aims remove unused imports from Java/Scala code and add `UnusedImports` checkstyle rule to help developers.
## How was this patch tested?
```
./dev/lint-java
./build/sbt compile
```
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11438 from dongjoon-hyun/SPARK-13583.
## What changes were proposed in this pull request?
Make some cross-cutting code improvements according to static analysis. These are individually up for discussion since they exist in separate commits that can be reverted. The changes are broadly:
- Inner class should be static
- Mismatched hashCode/equals
- Overflow in compareTo
- Unchecked warnings
- Misuse of assert, vs junit.assert
- get(a) + getOrElse(b) -> getOrElse(a,b)
- Array/String .size -> .length (occasionally, -> .isEmpty / .nonEmpty) to avoid implicit conversions
- Dead code
- tailrec
- exists(_ == ) -> contains find + nonEmpty -> exists filter + size -> count
- reduce(_+_) -> sum map + flatten -> map
The most controversial may be .size -> .length simply because of its size. It is intended to avoid implicits that might be expensive in some places.
## How was the this patch tested?
Existing Jenkins unit tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#11292 from srowen/SPARK-13423.
## What changes were proposed in this pull request?
This pr to make the short names of compression codecs in `ParquetRelation` consistent against other ones. This pr comes from #11324.
## How was this patch tested?
Add more tests in `TextSuite`.
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
Closes#11408 from maropu/SPARK-13528.
## What changes were proposed in this pull request?
In order to tell OutputStream that the task has failed or not, we should call the failure callbacks BEFORE calling writer.close().
## How was this patch tested?
Added new unit tests.
Author: Davies Liu <davies@databricks.com>
Closes#11450 from davies/callback.
Rows with null values in partition column are not included in the results because none of the partition
where clause specify is null predicate on the partition column. This fix adds is null predicate on the partition column to the first JDBC partition where clause.
Example:
JDBCPartition(THEID < 1 or THEID is null, 0),JDBCPartition(THEID >= 1 AND THEID < 2,1),
JDBCPartition(THEID >= 2, 2)
Author: sureshthalamati <suresh.thalamati@gmail.com>
Closes#11063 from sureshthalamati/nullable_jdbc_part_col_spark-13167.
## What changes were proposed in this pull request?
Broadcast left semi join without joining keys is already supported in BroadcastNestedLoopJoin, it has the same implementation as LeftSemiJoinBNL, we should remove that.
## How was this patch tested?
Updated unit tests.
Author: Davies Liu <davies@databricks.com>
Closes#11448 from davies/remove_bnl.
## What changes were proposed in this pull request?
This PR defer the resolution from a id of dictionary to value until the column is actually accessed (inside getInt/getLong), this is very useful for those columns and rows that are filtered out. It's also useful for binary type, we will not need to copy all the byte arrays.
This PR also change the underlying type for small decimal that could be fit within a Int, in order to use getInt() to lookup the value from IntDictionary.
## How was this patch tested?
Manually test TPCDS Q7 with scale factor 10, saw about 30% improvements (after PR #11274).
Author: Davies Liu <davies@databricks.com>
Closes#11437 from davies/decode_dict.
JIRA: https://issues.apache.org/jira/browse/SPARK-13511
## What changes were proposed in this pull request?
Current limit operator doesn't support wholestage codegen. This is open to add support for it.
In the `doConsume` of `GlobalLimit` and `LocalLimit`, we use a count term to count the processed rows. Once the row numbers catches the limit number, we set the variable `stopEarly` of `BufferedRowIterator` newly added in this pr to `true` that indicates we want to stop processing remaining rows. Then when the wholestage codegen framework checks `shouldStop()`, it will stop the processing of the row iterator.
Before this, the executed plan for a query `sqlContext.range(N).limit(100).groupBy().sum()` is:
TungstenAggregate(key=[], functions=[(sum(id#5L),mode=Final,isDistinct=false)], output=[sum(id)#6L])
+- TungstenAggregate(key=[], functions=[(sum(id#5L),mode=Partial,isDistinct=false)], output=[sum#9L])
+- GlobalLimit 100
+- Exchange SinglePartition, None
+- LocalLimit 100
+- Range 0, 1, 1, 524288000, [id#5L]
After add wholestage codegen support:
WholeStageCodegen
: +- TungstenAggregate(key=[], functions=[(sum(id#40L),mode=Final,isDistinct=false)], output=[sum(id)#41L])
: +- TungstenAggregate(key=[], functions=[(sum(id#40L),mode=Partial,isDistinct=false)], output=[sum#44L])
: +- GlobalLimit 100
: +- INPUT
+- Exchange SinglePartition, None
+- WholeStageCodegen
: +- LocalLimit 100
: +- Range 0, 1, 1, 524288000, [id#40L]
## How was this patch tested?
A test is added into BenchmarkWholeStageCodegen.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#11391 from viirya/wholestage-limit.
## What changes were proposed in this pull request?
This PR adds support for implementing whole state codegen for sort. Builds heaving on nongli 's PR: https://github.com/apache/spark/pull/11008 (which actually implements the feature), and adds the following changes on top:
- [x] Generated code updates peak execution memory metrics
- [x] Unit tests in `WholeStageCodegenSuite` and `SQLMetricsSuite`
## How was this patch tested?
New unit tests in `WholeStageCodegenSuite` and `SQLMetricsSuite`. Further, all existing sort tests should pass.
Author: Sameer Agarwal <sameer@databricks.com>
Author: Nong Li <nong@databricks.com>
Closes#11359 from sameeragarwal/sort-codegen.
https://issues.apache.org/jira/browse/SPARK-13507https://issues.apache.org/jira/browse/SPARK-13509
## What changes were proposed in this pull request?
This PR adds the support to write CSV data directly by a single call to the given path.
Several unitests were added for each functionality.
## How was this patch tested?
This was tested with unittests and with `dev/run_tests` for coding style
Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>
Closes#11389 from HyukjinKwon/SPARK-13507-13509.