#### What changes were proposed in this pull request?
This PR is to address a few existing issues in `EXPLAIN`:
- The `EXPLAIN` options `LOGICAL | FORMATTED | EXTENDED | CODEGEN` should not be 0 or more match. It should 0 or one match. Parser does not allow users to use more than one option in a single command.
- The option `LOGICAL` is not supported. Issue an exception when users specify this option in the command.
- The output of `EXPLAIN ` contains a weird empty line when the output of analyzed plan is empty. We should remove it. For example:
```
== Parsed Logical Plan ==
CreateTable CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io. HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None), false
== Analyzed Logical Plan ==
CreateTable CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io. HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None), false
== Optimized Logical Plan ==
CreateTable CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io. HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None), false
...
```
#### How was this patch tested?
Added and modified a few test cases
Author: gatorsmile <gatorsmile@gmail.com>
Closes#12991 from gatorsmile/explainCreateTable.
## What changes were proposed in this pull request?
Our case sensitivity support is different from what ANSI SQL standards support. Postgres' behavior is that if an identifier is quoted, then it is treated as case sensitive; otherwise it is folded to lowercase. We will likely need to revisit this in the future and change our behavior. For now, the safest change to do for Spark 2.0 is to make the case sensitive option internal and discourage users from turning it on, effectively making Spark always case insensitive.
## How was this patch tested?
N/A - a small config documentation change.
Author: Reynold Xin <rxin@databricks.com>
Closes#13011 from rxin/SPARK-15229.
## What changes were proposed in this pull request?
Before:
```
scala> spark.catalog.listDatabases.show()
+--------------------+-----------+-----------+
| name|description|locationUri|
+--------------------+-----------+-----------+
|Database[name='de...|
|Database[name='my...|
|Database[name='so...|
+--------------------+-----------+-----------+
```
After:
```
+-------+--------------------+--------------------+
| name| description| locationUri|
+-------+--------------------+--------------------+
|default|Default Hive data...|file:/user/hive/w...|
| my_db| This is a database|file:/Users/andre...|
|some_db| |file:/private/var...|
+-------+--------------------+--------------------+
```
## How was this patch tested?
New test in `CatalogSuite`
Author: Andrew Or <andrew@databricks.com>
Closes#13015 from andrewor14/catalog-show.
## What changes were proposed in this pull request?
The issue is that when the user provides the path option with uppercase "PATH" key, `options` contains `PATH` key and will get into the non-external case in the following code in `createDataSourceTables.scala`, where a new key "path" is created with a default path.
```
val optionsWithPath =
if (!options.contains("path")) {
isExternal = false
options + ("path" -> sessionState.catalog.defaultTablePath(tableIdent))
} else {
options
}
```
So before creating hive table, serdeInfo.parameters will contain both "PATH" and "path" keys and different directories. and Hive table's dataLocation contains the value of "path".
The fix in this PR is to convert `options` in the code above to `CaseInsensitiveMap` before checking for containing "path" key.
## How was this patch tested?
A testcase is added
Author: xin Wu <xinwu@us.ibm.com>
Closes#12804 from xwu0226/SPARK-15025.
This patch improves the performance of `InferSchema.compatibleType` and `inferField`. The net result of this patch is a 6x speedup in local benchmarks running against cached data with a massive nested schema.
The key idea is to remove unnecessary sorting in `compatibleType`'s `StructType` merging code. This code takes two structs, merges the fields with matching names, and copies over the unique fields, producing a new schema which is the union of the two structs' schemas. Previously, this code performed a very inefficient `groupBy()` to match up fields with the same name, but this is unnecessary because `inferField` already sorts structs' fields by name: since both lists of fields are sorted, we can simply merge them in a single pass.
This patch also speeds up the existing field sorting in `inferField`: the old sorting code allocated unnecessary intermediate collections, while the new code uses mutable collects and performs in-place sorting.
I rewrote inefficient `equals()` implementations in `StructType` and `Metadata`, significantly reducing object allocations in those methods.
Finally, I replaced a `treeAggregate` call with `fold`: I doubt that `treeAggregate` will benefit us very much because the schemas would have to be enormous to realize large savings in network traffic. Since most schemas are probably fairly small in serialized form, they should typically fit within a direct task result and therefore can be incrementally merged at the driver as individual tasks finish. This change eliminates an entire (short) scheduler stage.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#12750 from JoshRosen/schema-inference-speedups.
When we parse `CREATE TABLE USING`, we should build a `CreateTableUsing` plan with the `managedIfNoPath` set to true. Then we will add default table path to options when write it to hive.
new test in `SQLQuerySuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12949 from cloud-fan/bug.
## What changes were proposed in this pull request?
This also simplifies the code being moved.
## How was this patch tested?
Existing tests.
Author: Andrew Or <andrew@databricks.com>
Closes#12941 from andrewor14/move-code.
Enhance the exception message when `checkpointLocation` is not set, previously the message is:
```
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at org.apache.spark.sql.DataFrameWriter$$anonfun$8.apply(DataFrameWriter.scala:338)
at org.apache.spark.sql.DataFrameWriter$$anonfun$8.apply(DataFrameWriter.scala:338)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.DataFrameWriter.startStream(DataFrameWriter.scala:337)
at org.apache.spark.sql.DataFrameWriter.startStream(DataFrameWriter.scala:277)
... 48 elided
```
This is not so meaningful, so changing to make it more specific.
Local verified.
Author: jerryshao <sshao@hortonworks.com>
Closes#12998 from jerryshao/improve-exception-message.
## What changes were proposed in this pull request?
This is a follow-up of PR #12844. It makes the newly updated `DescribeTableCommand` to support data sources tables.
## How was this patch tested?
A test case is added to check `DESC [EXTENDED | FORMATTED] <table>` output.
Author: Cheng Lian <lian@databricks.com>
Closes#12934 from liancheng/spark-14127-desc-table-follow-up.
#### What changes were proposed in this pull request?
As Hive and the major RDBMS behave, the built-in functions are not allowed to drop. In the current implementation, users can drop the built-in functions. However, after dropping the built-in functions, users are unable to add them back.
#### How was this patch tested?
Added a test case.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#12975 from gatorsmile/dropBuildInFunction.
## What changes were proposed in this pull request?
following operations have file system operation now:
1. CREATE DATABASE: create a dir
2. DROP DATABASE: delete the dir
3. CREATE TABLE: create a dir
4. DROP TABLE: delete the dir
5. RENAME TABLE: rename the dir
6. CREATE PARTITIONS: create a dir
7. RENAME PARTITIONS: rename the dir
8. DROP PARTITIONS: drop the dir
## How was this patch tested?
new tests in `ExternalCatalogSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12871 from cloud-fan/catalog.
## What changes were proposed in this pull request?
Currently when we create an alias against a TypedColumn from user-defined Aggregator(for example: agg(aggSum.toColumn as "a")), spark is using the alias' function from Column( as), the alias function will return a column contains a TypedAggregateExpression, which is unresolved because the inputDeserializer is not defined. Later the aggregator function (agg) will inject the inputDeserializer back to the TypedAggregateExpression, but only if the aggregate columns are TypedColumn, in the above case, the TypedAggregateExpression will remain unresolved because it is under column and caused the
problem reported by this jira [15051](https://issues.apache.org/jira/browse/SPARK-15051?jql=project%20%3D%20SPARK).
This PR propose to create an alias function for TypedColumn, it will return a TypedColumn. It is using the similar code path as Column's alia function.
For the spark build in aggregate function, like max, it is working with alias, for example
val df1 = Seq(1 -> "a", 2 -> "b", 3 -> "b").toDF("i", "j")
checkAnswer(df1.agg(max("j") as "b"), Row(3) :: Nil)
Thanks for comments.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Add test cases in DatasetAggregatorSuite.scala
run the sql related queries against this patch.
Author: Kevin Yu <qyu@us.ibm.com>
Closes#12893 from kevinyu98/spark-15051.
## What changes were proposed in this pull request?
Lets says there are json files in the following directories structure
```
xyz/file0.json
xyz/subdir1/file1.json
xyz/subdir2/file2.json
xyz/subdir1/subsubdir1/file3.json
```
`sqlContext.read.json("xyz")` should read only file0.json according to behavior in Spark 1.6.1. However in current master, all the 4 files are read.
The fix is to make FileCatalog return only the children files of the given path if there is not partitioning detected (instead of all the recursive list of files).
Closes#12774
## How was this patch tested?
unit tests
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#12856 from tdas/SPARK-14997.
#### What changes were proposed in this pull request?
When Describe a UDTF, the command returns a wrong result. The command is unable to find the function, which has been created and cataloged in the catalog but not in the functionRegistry.
This PR is to correct it. If the function is not in the functionRegistry, we will check the catalog for collecting the information of the UDTF function.
#### How was this patch tested?
Added test cases to verify the results
Author: gatorsmile <gatorsmile@gmail.com>
Closes#12885 from gatorsmile/showFunction.
## What changes were proposed in this pull request?
Minor doc and code style fixes
## How was this patch tested?
local build
Author: Jacek Laskowski <jacek@japila.pl>
Closes#12928 from jaceklaskowski/SPARK-15152.
## What changes were proposed in this pull request?
This issue addresses the comments in SPARK-15031 and also fix java-linter errors.
- Use multiline format in SparkSession builder patterns.
- Update `binary_classification_metrics_example.py` to use `SparkSession`.
- Fix Java Linter errors (in SPARK-13745, SPARK-15031, and so far)
## How was this patch tested?
After passing the Jenkins tests and run `dev/lint-java` manually.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12911 from dongjoon-hyun/SPARK-15134.
## What changes were proposed in this pull request?
Went through SparkSession and its members and fixed non-thread-safe classes used by SparkSession
## How was this patch tested?
Existing unit tests
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#12915 from zsxwing/spark-session-thread-safe.
## What changes were proposed in this pull request?
Removing the `withHiveSupport` method of `SparkSession`, instead use `enableHiveSupport`
## How was this patch tested?
ran tests locally
Author: Sandeep Singh <sandeep@techaddict.me>
Closes#12851 from techaddict/SPARK-15072.
#### What changes were proposed in this pull request?
First, a few test cases failed in mac OS X because the property value of `java.io.tmpdir` does not include a trailing slash on some platform. Hive always removes the last trailing slash. For example, what I got in the web:
```
Win NT --> C:\TEMP\
Win XP --> C:\TEMP
Solaris --> /var/tmp/
Linux --> /var/tmp
```
Second, a couple of test cases are added to verify if the commands work properly.
#### How was this patch tested?
Added a test case for it and correct the previous test cases.
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#12081 from gatorsmile/mkdir.
## What changes were proposed in this pull request?
Implement repartitionByColumn on DataFrame.
This will allow us to run R functions on each partition identified by column groups with dapply() method.
## How was this patch tested?
Unit tests
Author: NarineK <narine.kokhlikyan@us.ibm.com>
Closes#12887 from NarineK/repartitionByColumns.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-15148
Mainly it improves the performance roughtly about 30%-40% according to the [release note](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.1.0). For the details of the purpose is described in the JIRA.
This PR upgrades Univocity library from 2.0.2 to 2.1.0.
## How was this patch tested?
Existing tests should cover this.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#12923 from HyukjinKwon/SPARK-15148.
## What changes were proposed in this pull request?
Similar to #11990, GenerateOrdering and GenerateColumnAccessor should print debug log for generated code with proper indentation.
## How was this patch tested?
Manually checked.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#12908 from sarutak/SPARK-15132.
## What changes were proposed in this pull request?
Make sure that whenever the StateStoreCoordinator cannot be contacted, assume that the SparkContext and RpcEnv on the driver has been shutdown, and therefore stop the StateStore management thread, and unload all loaded stores.
## How was this patch tested?
Updated unit tests.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#12905 from tdas/SPARK-15131.
#### What changes were proposed in this pull request?
When we load a dataset, if we set the path to ```/path/a=1```, we will not take `a` as the partitioning column. However, if we set the path to ```/path/a=1/file.parquet```, we take `a` as the partitioning column and it shows up in the schema.
This PR is to fix the behavior inconsistency issue.
The base path contains a set of paths that are considered as the base dirs of the input datasets. The partitioning discovery logic will make sure it will stop when it reaches any base path.
By default, the paths of the dataset provided by users will be base paths. Below are three typical cases,
**Case 1**```sqlContext.read.parquet("/path/something=true/")```: the base path will be
`/path/something=true/`, and the returned DataFrame will not contain a column of `something`.
**Case 2**```sqlContext.read.parquet("/path/something=true/a.parquet")```: the base path will be
still `/path/something=true/`, and the returned DataFrame will also not contain a column of
`something`.
**Case 3**```sqlContext.read.parquet("/path/")```: the base path will be `/path/`, and the returned
DataFrame will have the column of `something`.
Users also can override the basePath by setting `basePath` in the options to pass the new base
path to the data source. For example,
```sqlContext.read.option("basePath", "/path/").parquet("/path/something=true/")```,
and the returned DataFrame will have the column of `something`.
The related PRs:
- https://github.com/apache/spark/pull/9651
- https://github.com/apache/spark/pull/10211
#### How was this patch tested?
Added a couple of test cases
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#12828 from gatorsmile/readPartitionedTable.
## What changes were proposed in this pull request?
This PR support new SQL syntax CREATE TEMPORARY VIEW.
Like:
```
CREATE TEMPORARY VIEW viewName AS SELECT * from xx
CREATE OR REPLACE TEMPORARY VIEW viewName AS SELECT * from xx
CREATE TEMPORARY VIEW viewName (c1 COMMENT 'blabla', c2 COMMENT 'blabla') AS SELECT * FROM xx
```
## How was this patch tested?
Unit tests.
Author: Sean Zhong <clockfly@gmail.com>
Closes#12872 from clockfly/spark-6399.
## What changes were proposed in this pull request?
Typo fix
## How was this patch tested?
No tests
My apologies for the tiny PR, but I stumbled across this today and wanted to get it corrected for 2.0.
Author: sethah <seth.hendrickson16@gmail.com>
Closes#12912 from sethah/csv_typo.
## What changes were proposed in this pull request?
Currently we return RuntimeConfig itself to facilitate chaining. However, it makes the output in interactive environments (e.g. notebooks, scala repl) weird because it'd show the response of calling set as a RuntimeConfig itself.
## How was this patch tested?
Updated unit tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#12902 from rxin/SPARK-15126.
## What changes were proposed in this pull request?
File Stream Sink writes the list of written files in a metadata log. StreamFileCatalog reads the list of the files for processing. However StreamFileCatalog does not infer partitioning like HDFSFileCatalog.
This PR enables that by refactoring HDFSFileCatalog to create an abstract class PartitioningAwareFileCatalog, that has all the functionality to infer partitions from a list of leaf files.
- HDFSFileCatalog has been renamed to ListingFileCatalog and it extends PartitioningAwareFileCatalog by providing a list of leaf files from recursive directory scanning.
- StreamFileCatalog has been renamed to MetadataLogFileCatalog and it extends PartitioningAwareFileCatalog by providing a list of leaf files from the metadata log.
- The above two classes has been moved into their own files as they are not interfaces that should be in fileSourceInterfaces.scala.
## How was this patch tested?
- FileStreamSinkSuite was update to see if partitioning gets inferred, and on reading whether the partitions get pruned correctly based on the query.
- Other unit tests are unchanged and pass as expected.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#12879 from tdas/SPARK-15103.
## What changes were proposed in this pull request?
We can support subexpression elimination in TungstenAggregate by using current `EquivalentExpressions` which is already used in subexpression elimination for expression codegen.
However, in wholestage codegen, we can't wrap the common expression's codes in functions as before, we simply generate the code snippets for common expressions. These code snippets are inserted before the common expressions are actually used in generated java codes.
For multiple `TypedAggregateExpression` used in aggregation operator, since their input type should be the same. So their `inputDeserializer` will be the same too. This patch can also reduce redundant input deserialization.
## How was this patch tested?
Existing tests.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#12729 from viirya/subexpr-elimination-tungstenaggregate.
## What changes were proposed in this pull request?
This patch changes the join API in Dataset so they can accept any Dataset, rather than just DataFrames.
## How was this patch tested?
N/A.
Author: Reynold Xin <rxin@databricks.com>
Closes#12886 from rxin/SPARK-15109.
## What changes were proposed in this pull request?
Currently in `StreamTest`, we have a `StartStream` which will start a streaming query against trigger `ProcessTime(intervalMS = 0)` and `SystemClock`.
We also need to test cases against `ProcessTime(intervalMS > 0)`, which often requires `ManualClock`.
This patch:
- fixes an issue of `ProcessingTimeExecutor`, where for a batch it should run `batchRunner` only once but might run multiple times under certain conditions;
- adds support for testing against the `ProcessingTime(intervalMS > 0)` trigger and `AdvanceManualClock`, by specifying them as fields for `StartStream`, and by adding an `AdvanceClock` action;
- adds a test, which takes advantage of the new `StartStream` and `AdvanceManualClock`, to test against [PR#[SPARK-14942] Reduce delay between batch construction and execution ](https://github.com/apache/spark/pull/12725).
## How was this patch tested?
N/A
Author: Liwei Lin <lwlin7@gmail.com>
Closes#12797 from lw-lin/add-trigger-test-support.
## What changes were proposed in this pull request?
This PR improve the error message for `Generate` in 3 cases:
1. generator is nested in expressions, e.g. `SELECT explode(list) + 1 FROM tbl`
2. generator appears more than one time in SELECT, e.g. `SELECT explode(list), explode(list) FROM tbl`
3. generator appears in other operator which is not project, e.g. `SELECT * FROM tbl SORT BY explode(list)`
## How was this patch tested?
new tests in `AnalysisErrorSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12810 from cloud-fan/bug.
## What changes were proposed in this pull request?
Currently, various `FileFormat` data sources share approximately the same code for partition value appending. This PR tries to eliminate this duplication.
A new method `buildReaderWithPartitionValues()` is added to `FileFormat` with a default implementation that appends partition values to `InternalRow`s produced by the reader function returned by `buildReader()`.
Special data sources like Parquet, which implements partition value appending inside `buildReader()` because of the vectorized reader, and the Text data source, which doesn't support partitioning, override `buildReaderWithPartitionValues()` and simply delegate to `buildReader()`.
This PR brings two benefits:
1. Apparently, it de-duplicates partition value appending logic
2. Now the reader function returned by `buildReader()` is only required to produce `InternalRow`s rather than `UnsafeRow`s if the data source doesn't override `buildReaderWithPartitionValues()`.
Because the safe-to-unsafe conversion is also performed while appending partition values. This makes 3rd-party data sources (e.g. spark-avro) easier to implement since they no longer need to access private APIs involving `UnsafeRow`.
## How was this patch tested?
Existing tests should do the work.
Author: Cheng Lian <lian@databricks.com>
Closes#12866 from liancheng/spark-14237-simplify-partition-values-appending.
## What changes were proposed in this pull request?
Just a bunch of small tweaks on DDL exception messages.
## How was this patch tested?
`DDLCommandSuite` et al.
Author: Andrew Or <andrew@databricks.com>
Closes#12853 from andrewor14/make-exceptions-consistent.
## What changes were proposed in this pull request?
Make Dataset.sqlContext a lazy val so that its a stable identifier and can be used for imports.
Now this works again:
import someDataset.sqlContext.implicits._
## How was this patch tested?
Add unit test to DatasetSuite that uses the import show above.
Author: Koert Kuipers <koert@tresata.com>
Closes#12877 from koertkuipers/feat-sqlcontext-stable-import.
## What changes were proposed in this pull request?
Create a new API for handling Optional Configs in SQLConf.
Right now `getConf` for `OptionalConfigEntry[T]` returns value of type `T`, if doesn't exist throws an exception. Add new method `getOptionalConf`(suggestions on naming) which will now returns value of type `Option[T]`(so if doesn't exist it returns `None`).
## How was this patch tested?
Add test and ran tests locally.
Author: Sandeep Singh <sandeep@techaddict.me>
Closes#12846 from techaddict/SPARK-14422.
## What changes were proposed in this pull request?
Users should use the builder pattern instead.
## How was this patch tested?
Jenks.
Author: Andrew Or <andrew@databricks.com>
Closes#12873 from andrewor14/spark-session-constructor.
## What changes were proposed in this pull request?
Observed stackOverflowError in Kryo when executing TPC-DS Query27. Spark thrift server disables kryo reference tracking (if not specified in conf). When "spark.kryo.referenceTracking" is set to true explicitly in spark-defaults.conf, query executes successfully. The root cause is that the TaskMemoryManager inside MemoryConsumer and LongToUnsafeRowMap were not transient and thus were serialized and broadcast around from within LongHashedRelation, which could potentially cause circular reference inside Kryo. But the TaskMemoryManager is per task and should not be passed around at the first place. This fix makes it transient.
## How was this patch tested?
core/test, hive/test, sql/test, catalyst/test, dev/lint-scala, org.apache.spark.sql.hive.execution.HiveCompatibilitySuite, dev/scalastyle,
manual test of TBC-DS Query 27 with 1GB data but without the "limit 100" which would cause a NPE due to SPARK-14752.
Author: yzhou2001 <yzhou_1999@yahoo.com>
Closes#12598 from yzhou2001/master.
## What changes were proposed in this pull request?
Remove AccumulatorV2.localValue and keep only value
## How was this patch tested?
existing tests
Author: Sandeep Singh <sandeep@techaddict.me>
Closes#12865 from techaddict/SPARK-15087.
# What changes were proposed in this pull request?
Support partitioning in the file stream sink. This is implemented using a new, but simpler code path for writing parquet files - both unpartitioned and partitioned. This new code path does not use Output Committers, as we will eventually write the file names to the metadata log for "committing" them.
This patch duplicates < 100 LOC from the WriterContainer. But its far simpler that WriterContainer as it does not involve output committing. In addition, it introduces the new APIs in FileFormat and OutputWriterFactory in an attempt to simplify the APIs (not have Job in the `FileFormat` API, not have bucket and other stuff in the `OutputWriterFactory.newInstance()` ).
# Tests
- New unit tests to test the FileStreamSinkWriter for partitioned and unpartitioned files
- New unit test to partially test the FileStreamSink for partitioned files (does not test recovery of partition column data, as that requires change in the StreamFileCatalog, future PR).
- Updated FileStressSuite to test number of records read from partitioned output files.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#12409 from tdas/streaming-partitioned-parquet.
## What changes were proposed in this pull request?
This patch removes SparkSqlSerializer. I believe this is now dead code.
## How was this patch tested?
Removed a test case related to it.
Author: Reynold Xin <rxin@databricks.com>
Closes#12864 from rxin/SPARK-15088.
## What changes were proposed in this pull request?
This patch moves AccumulatorV2 and subclasses into util package.
## How was this patch tested?
Updated relevant tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#12863 from rxin/SPARK-15081.
## What changes were proposed in this pull request?
Right now `StreamExecution.awaitBatchLock` uses an unfair lock. `StreamExecution.awaitOffset` may run too long and fail some test because `StreamExecution.constructNextBatch` keeps getting the lock.
See: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/865/testReport/junit/org.apache.spark.sql.streaming/FileStreamSourceStressTestSuite/file_source_stress_test/
This PR uses a fair ReentrantLock to resolve the thread starvation issue.
## How was this patch tested?
Modified `FileStreamSourceStressTestSuite.test("file source stress test")` to run the test codes 100 times locally. It always fails because of timeout without this patch.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#12852 from zsxwing/SPARK-15077.
## What changes were proposed in this pull request?
This PR addresses a few minor issues in SQL parser:
- Removes some unused rules and keywords in the grammar.
- Removes code path for fallback SQL parsing (was needed for Hive native parsing).
- Use `UnresolvedGenerator` instead of hard-coding `Explode` & `JsonTuple`.
- Adds a more generic way of creating error messages for unsupported Hive features.
- Use `visitFunctionName` as much as possible.
- Interpret a `CatalogColumn`'s `DataType` directly instead of parsing it again.
## How was this patch tested?
Existing tests.
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#12826 from hvanhovell/SPARK-15047.
The contribution is my original work and that I license the work to the project under the project's open source license.
Author: poolis <gmichalopoulos@gmail.com>
Author: Greg Michalopoulos <gmichalopoulos@gmail.com>
Closes#10899 from poolis/spark-12928.
## What changes were proposed in this pull request?
This patch creates a builder pattern for creating SparkSession. The new code is unused and mostly deadcode. I'm putting it up here for feedback.
There are a few TODOs that can be done as follow-up pull requests:
- [ ] Update tests to use this
- [ ] Update examples to use this
- [ ] Clean up SQLContext code w.r.t. this one (i.e. SparkSession shouldn't call into SQLContext.getOrCreate; it should be the other way around)
- [ ] Remove SparkSession.withHiveSupport
- [ ] Disable the old constructor (by making it private) so the only way to start a SparkSession is through this builder pattern
## How was this patch tested?
Part of the future pull request is to clean this up and switch existing tests to use this.
Author: Reynold Xin <rxin@databricks.com>
Closes#12830 from rxin/sparksession-builder.
## What changes were proposed in this pull request?
parquet datasource and ColumnarBatch tests fail on big-endian platforms This patch adds support for the little-endian byte arrays being correctly interpreted on a big-endian platform
## How was this patch tested?
Spark test builds ran on big endian z/Linux and regression build on little endian amd64
Author: Pete Robbins <robbinspg@gmail.com>
Closes#12397 from robbinspg/master.
## What changes were proposed in this pull request?
In order to support nested predicate subquery, this PR introduce an internal join type ExistenceJoin, which will emit all the rows from left, plus an additional column, which presents there are any rows matched from right or not (it's not null-aware right now). This additional column could be used to replace the subquery in Filter.
In theory, all the predicate subquery could use this join type, but it's slower than LeftSemi and LeftAnti, so it's only used for nested subquery (subquery inside OR).
For example, the following SQL:
```sql
SELECT a FROM t WHERE EXISTS (select 0) OR EXISTS (select 1)
```
This PR also fix a bug in predicate subquery push down through join (they should not).
Nested null-aware subquery is still not supported. For example, `a > 3 OR b NOT IN (select bb from t)`
After this, we could run TPCDS query Q10, Q35, Q45
## How was this patch tested?
Added unit tests.
Author: Davies Liu <davies@databricks.com>
Closes#12820 from davies/or_exists.
## What changes were proposed in this pull request?
#12339 didn't fix the race condition. MemorySinkSuite is still flaky: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.2/814/testReport/junit/org.apache.spark.sql.streaming/MemorySinkSuite/registering_as_a_table/
Here is an execution order to reproduce it.
| Time |Thread 1 | MicroBatchThread |
|:-------------:|:-------------:|:-----:|
| 1 | | `MemorySink.getOffset` |
| 2 | | availableOffsets ++= newData (availableOffsets is not changed here) |
| 3 | addData(newData) | |
| 4 | Set `noNewData` to `false` in processAllAvailable | |
| 5 | | `dataAvailable` returns `false` |
| 6 | | noNewData = true |
| 7 | `noNewData` is true so just return | |
| 8 | assert results and fail | |
| 9 | | `dataAvailable` returns true so process the new batch |
This PR expands the scope of `awaitBatchLock.synchronized` to eliminate the above race.
## How was this patch tested?
test("stress test"). It always failed before this patch. And it will pass after applying this patch. Ignore this test in the PR as it takes several minutes to finish.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#12582 from zsxwing/SPARK-14579-2.
## What changes were proposed in this pull request?
NewAccumulator isn't the best name if we ever come up with v3 of the API.
## How was this patch tested?
Updated tests to reflect the change.
Author: Reynold Xin <rxin@databricks.com>
Closes#12827 from rxin/SPARK-15049.
## What changes were proposed in this pull request?
This PR adds the explanation and documentation for CSV options for reading and writing.
## How was this patch tested?
Style tests with `./dev/run_tests` for documentation style.
Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>
Closes#12817 from HyukjinKwon/SPARK-13425.
## What changes were proposed in this pull request?
This is caused by https://github.com/apache/spark/pull/12776, which removes the `synchronized` from all methods in `AccumulatorContext`.
However, a test in `CachedTableSuite` synchronize on `AccumulatorContext` and expecting no one else can change it, which is not true anymore.
This PR update that test to not require to lock on `AccumulatorContext`.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12811 from cloud-fan/flaky.
1. Adds the following options for parsing NaNs: nanValue
2. Adds the following options for parsing infinity: positiveInf, negativeInf.
`TypeCast.castTo` is unit tested and an end-to-end test is added to `CSVSuite`
Author: Hossein <hossein@databricks.com>
Closes#11947 from falaki/SPARK-14143.
This PR contains three changes:
1. We will use spark.sql.warehouse.dir set warehouse location. We will not use hive.metastore.warehouse.dir.
2. SessionCatalog needs to set the location to default db. Otherwise, when creating a table in SparkSession without hive support, the default db's path will be an empty string.
3. When we create a database, we need to make the path qualified.
Existing tests and new tests
Author: Yin Huai <yhuai@databricks.com>
Closes#12812 from yhuai/warehouse.
## What changes were proposed in this pull request?
This patch removes some code that are no longer relevant -- mainly HiveSessionState.setDefaultOverrideConfs.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#12806 from rxin/SPARK-15028.
## What changes were proposed in this pull request?
This PR adds the support to specify custom date format for `DateType` and `TimestampType`.
For `TimestampType`, this uses the given format to infer schema and also to convert the values
For `DateType`, this uses the given format to convert the values.
If the `dateFormat` is not given, then it works with `DateTimeUtils.stringToTime()` for backwords compatibility.
When it's given, then it uses `SimpleDateFormat` for parsing data.
In addition, `IntegerType`, `DoubleType` and `LongType` have a higher priority than `TimestampType` in type inference. This means even if the given format is `yyyy` or `yyyy.MM`, it will be inferred as `IntegerType` or `DoubleType`. Since it is type inference, I think it is okay to give such precedences.
In addition, I renamed `csv.CSVInferSchema` to `csv.InferSchema` as JSON datasource has `json.InferSchema`. Although they have the same names, I did this because I thought the parent package name can still differentiate each. Accordingly, the suite name was also changed from `CSVInferSchemaSuite` to `InferSchemaSuite`.
## How was this patch tested?
unit tests are used and `./dev/run_tests` for coding style tests.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#11550 from HyukjinKwon/SPARK-13667.
## What changes were proposed in this pull request?
CatalystSqlParser can parse data types. So, we do not need to have an individual DataTypeParser.
## How was this patch tested?
Existing tests
Author: Yin Huai <yhuai@databricks.com>
Closes#12796 from yhuai/removeDataTypeParser.
## What changes were proposed in this pull request?
1. Remove all the `spark.setConf` etc. Just expose `spark.conf`
2. Make `spark.conf` take in things set in the core `SparkConf` as well, otherwise users may get confused
This was done for both the Python and Scala APIs.
## How was this patch tested?
`SQLConfSuite`, python tests.
This one fixes the failed tests in #12787Closes#12787
Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#12798 from yhuai/conf-api.
## What changes were proposed in this pull request?
Addresses comments in #12765.
## How was this patch tested?
Python tests.
Author: Andrew Or <andrew@databricks.com>
Closes#12784 from andrewor14/python-followup.
## What changes were proposed in this pull request?
dapply() applies an R function on each partition of a DataFrame and returns a new DataFrame.
The function signature is:
dapply(df, function(localDF) {}, schema = NULL)
R function input: local data.frame from the partition on local node
R function output: local data.frame
Schema specifies the Row format of the resulting DataFrame. It must match the R function's output.
If schema is not specified, each partition of the result DataFrame will be serialized in R into a single byte array. Such resulting DataFrame can be processed by successive calls to dapply().
## How was this patch tested?
SparkR unit tests.
Author: Sun Rui <rui.sun@intel.com>
Author: Sun Rui <sunrui2016@gmail.com>
Closes#12493 from sun-rui/SPARK-12919.
## What changes were proposed in this pull request?
Currently Spark SQL doesn't support sorting columns in descending order. However, the parser accepts the syntax and silently drops sorting directions. This PR fixes this by throwing an exception if `DESC` is specified as sorting direction of a sorting column.
## How was this patch tested?
A test case is added to test the invalid sorting order by checking exception message.
Author: Cheng Lian <lian@databricks.com>
Closes#12759 from liancheng/spark-14981.
## What changes were proposed in this pull request?
The `catalog` and `conf` APIs were exposed in `SparkSession` in #12713 and #12669. This patch adds those to the python API.
## How was this patch tested?
Python tests.
Author: Andrew Or <andrew@databricks.com>
Closes#12765 from andrewor14/python-spark-session-more.
## What changes were proposed in this pull request?
This patch removes executionHive from HiveSessionState and HiveSharedState.
## How was this patch tested?
Updated test cases.
Author: Reynold Xin <rxin@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#12770 from rxin/SPARK-14994.
#### What changes were proposed in this pull request?
Replaces a logical `Except` operator with a `Left-anti Join` operator. This way, we can take advantage of all the benefits of join implementations (e.g. managed memory, code generation, broadcast joins).
```SQL
SELECT a1, a2 FROM Tab1 EXCEPT SELECT b1, b2 FROM Tab2
==> SELECT DISTINCT a1, a2 FROM Tab1 LEFT ANTI JOIN Tab2 ON a1<=>b1 AND a2<=>b2
```
Note:
1. This rule is only applicable to EXCEPT DISTINCT. Do not use it for EXCEPT ALL.
2. This rule has to be done after de-duplicating the attributes; otherwise, the enerated
join conditions will be incorrect.
This PR also corrects the existing behavior in Spark. Before this PR, the behavior is like
```SQL
test("except") {
val df_left = Seq(1, 2, 2, 3, 3, 4).toDF("id")
val df_right = Seq(1, 3).toDF("id")
checkAnswer(
df_left.except(df_right),
Row(2) :: Row(2) :: Row(4) :: Nil
)
}
```
After this PR, the result is corrected. We strictly follow the SQL compliance of `Except Distinct`.
#### How was this patch tested?
Modified and added a few test cases to verify the optimization rule and the results of operators.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#12736 from gatorsmile/exceptByAntiJoin.
## What changes were proposed in this pull request?
Minor typo fixes
## How was this patch tested?
local build
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#12755 from zhengruifeng/fix_doc_dataset.
## What changes were proposed in this pull request?
This patch removes HiveNativeCommand, so we can continue to remove the dependency on Hive. This pull request also removes the ability to generate golden result file using Hive.
## How was this patch tested?
Updated tests to reflect this.
Author: Reynold Xin <rxin@databricks.com>
Closes#12769 from rxin/SPARK-14991.
## What changes were proposed in this pull request?
The FileCatalog object gets created even if the user specifies schema, which means files in the directory is enumerated even thought its not necessary. For large directories this is very slow. User would want to specify schema in such scenarios of large dirs, and this defeats the purpose quite a bit.
## How was this patch tested?
Hard to test this with unit test.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#12748 from tdas/SPARK-14970.
## What changes were proposed in this pull request?
This PR introduces a new accumulator API which is much simpler than before:
1. the type hierarchy is simplified, now we only have an `Accumulator` class
2. Combine `initialValue` and `zeroValue` concepts into just one concept: `zeroValue`
3. there in only one `register` method, the accumulator registration and cleanup registration are combined.
4. the `id`,`name` and `countFailedValues` are combined into an `AccumulatorMetadata`, and is provided during registration.
`SQLMetric` is a good example to show the simplicity of this new API.
What we break:
1. no `setValue` anymore. In the new API, the intermedia type can be different from the result type, it's very hard to implement a general `setValue`
2. accumulator can't be serialized before registered.
Problems need to be addressed in follow-ups:
1. with this new API, `AccumulatorInfo` doesn't make a lot of sense, the partial output is not partial updates, we need to expose the intermediate value.
2. `ExceptionFailure` should not carry the accumulator updates. Why do users care about accumulator updates for failed cases? It looks like we only use this feature to update the internal metrics, how about we sending a heartbeat to update internal metrics after the failure event?
3. the public event `SparkListenerTaskEnd` carries a `TaskMetrics`. Ideally this `TaskMetrics` don't need to carry external accumulators, as the only method of `TaskMetrics` that can access external accumulators is `private[spark]`. However, `SQLListener` use it to retrieve sql metrics.
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12612 from cloud-fan/acc.
## What changes were proposed in this pull request?
Currently, LongToUnsafeRowMap use byte array as the underlying page, which can't be larger 1G.
This PR improves LongToUnsafeRowMap to scale up to 8G bytes by using array of Long instead of array of byte.
## How was this patch tested?
Manually ran a test to confirm that both UnsafeHashedRelation and LongHashedRelation could build a map that larger than 2G.
Author: Davies Liu <davies@databricks.com>
Closes#12740 from davies/larger_broadcast.
## What changes were proposed in this pull request?
`interfaces.scala` was getting big. This just moves the biggest class in there to a new file for cleanliness.
## How was this patch tested?
Just moving things around.
Author: Andrew Or <andrew@databricks.com>
Closes#12721 from andrewor14/move-external-catalog.
Currently, we can only create persisted partitioned and/or bucketed data source tables using the Dataset API but not using SQL DDL. This PR implements the following syntax to add partitioning and bucketing support to the SQL DDL:
```
CREATE TABLE <table-name>
USING <provider> [OPTIONS (<key1> <value1>, <key2> <value2>, ...)]
[PARTITIONED BY (col1, col2, ...)]
[CLUSTERED BY (col1, col2, ...) [SORTED BY (col1, col2, ...)] INTO <n> BUCKETS]
AS SELECT ...
```
Test cases are added in `MetastoreDataSourcesSuite` to check the newly added syntax.
Author: Cheng Lian <lian@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#12734 from liancheng/spark-14954.
## What changes were proposed in this pull request?
The `Batch` class, which had been used to indicate progress in a stream, was abandoned by [[SPARK-13985][SQL] Deterministic batches with ids](caea152145) and then became useless.
This patch:
- removes the `Batch` class
- ~~does some related renaming~~ (update: this has been reverted)
- fixes some related comments
## How was this patch tested?
N/A
Author: Liwei Lin <lwlin7@gmail.com>
Closes#12638 from lw-lin/remove-batch.
### What changes were proposed in this pull request?
Anti-Joins using BroadcastHashJoin's unique key code path are broken; it currently returns Semi Join results . This PR fixes this bug.
### How was this patch tested?
Added tests cases to `ExistenceJoinSuite`.
cc davies gatorsmile
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#12730 from hvanhovell/SPARK-14950.
## What changes were proposed in this pull request?
This PR will make Spark SQL not allow ALTER TABLE ADD/REPLACE/CHANGE COLUMN, ALTER TABLE SET FILEFORMAT, DFS, and transaction related commands.
## How was this patch tested?
Existing tests. For those tests that I put in the blacklist, I am adding the useful parts back to SQLQuerySuite.
Author: Yin Huai <yhuai@databricks.com>
Closes#12714 from yhuai/banNativeCommand.
## What changes were proposed in this pull request?
We currently expose both Hadoop configuration and Spark SQL configuration in RuntimeConfig. I think we can remove the Hadoop configuration part, and simply generate Hadoop Configuration on the fly by passing all the SQL configurations into it. This way, there is a single interface (in Java/Scala/Python/SQL) for end-users.
As part of this patch, I also removed some config options deprecated in Spark 1.x.
## How was this patch tested?
Updated relevant tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#12689 from rxin/SPARK-14913.
## What changes were proposed in this pull request?
#12625 exposed a new user-facing conf interface in `SparkSession`. This patch adds a catalog interface.
## How was this patch tested?
See `CatalogSuite`.
Author: Andrew Or <andrew@databricks.com>
Closes#12713 from andrewor14/user-facing-catalog.
## What changes were proposed in this pull request?
This PR adds Native execution of SHOW COLUMNS and SHOW PARTITION commands.
Command Syntax:
``` SQL
SHOW COLUMNS (FROM | IN) table_identifier [(FROM | IN) database]
```
``` SQL
SHOW PARTITIONS [db_name.]table_name [PARTITION(partition_spec)]
```
## How was this patch tested?
Added test cases in HiveCommandSuite to verify execution and DDLCommandSuite
to verify plans.
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#12222 from dilipbiswal/dkb_show_columns.
## What changes were proposed in this pull request?
While the vectorized hash map in `TungstenAggregate` is currently supported for all primitive data types during partial aggregation, this patch only enables the hash map for a subset of cases that've been verified to show performance improvements on our benchmarks subject to an internal conf that sets an upper limit on the maximum length of the aggregate key/value schema. This list of supported use-cases should be expanded over time.
## How was this patch tested?
This is no new change in functionality so existing tests should suffice. Performance tests were done on TPCDS benchmarks.
Author: Sameer Agarwal <sameer@databricks.com>
Closes#12710 from sameeragarwal/vectorized-enable.
## What changes were proposed in this pull request?
This PR update SortMergeJoinExec to support LeftSemi/LeftAnti, so it could support all the join types, same as other three join implementations: BroadcastHashJoinExec, ShuffledHashJoinExec,and BroadcastNestedLoopJoinExec.
This PR also simplify the join selection in SparkStrategy.
## How was this patch tested?
Added new tests.
Author: Davies Liu <davies@databricks.com>
Closes#12668 from davies/smj_semi.
## What changes were proposed in this pull request?
That patch mistakenly widened the visibility from `private[x]` to `protected[x]`. This patch reverts those changes.
Author: Andrew Or <andrew@databricks.com>
Closes#12686 from andrewor14/visibility.
## What changes were proposed in this pull request?
We currently have no way for users to propagate options to the underlying library that rely in Hadoop configurations to work. For example, there are various options in parquet-mr that users might want to set, but the data source API does not expose a per-job way to set it. This patch propagates the user-specified options also into Hadoop Configuration.
## How was this patch tested?
Used a mock data source implementation to test both the read path and the write path.
Author: Reynold Xin <rxin@databricks.com>
Closes#12688 from rxin/SPARK-14912.
## What changes were proposed in this pull request?
Minor typo fixes (too minor to deserve separate a JIRA)
## How was this patch tested?
local build
Author: Jacek Laskowski <jacek@japila.pl>
Closes#12469 from jaceklaskowski/minor-typo-fixes.
## What changes were proposed in this pull request?
Use Long.parseLong which returns a primative.
Use a series of appends() reduces the creation of an extra StringBuilder type
## How was this patch tested?
Unit tests
Author: Azeem Jiva <azeemj@gmail.com>
Closes#12520 from javawithjiva/minor.
## What changes were proposed in this pull request?
In Spark 2.0, `SparkSession` is the new thing. Internally we should stop using `SQLContext` everywhere since that's supposed to be not the main user-facing API anymore.
In this patch I took care to not break any public APIs. The one place that's suspect is `o.a.s.ml.source.libsvm.DefaultSource`, but according to mengxr it's not supposed to be public so it's OK to change the underlying `FileFormat` trait.
**Reviewers**: This is a big patch that may be difficult to review but the changes are actually really straightforward. If you prefer I can break it up into a few smaller patches, but it will delay the progress of this issue a little.
## How was this patch tested?
No change in functionality intended.
Author: Andrew Or <andrew@databricks.com>
Closes#12625 from andrewor14/spark-session-refactor.
## What changes were proposed in this pull request?
`RuntimeConfig` is the new user-facing API in 2.0 added in #11378. Until now, however, it's been dead code. This patch uses `RuntimeConfig` in `SessionState` and exposes that through the `SparkSession`.
## How was this patch tested?
New test in `SQLContextSuite`.
Author: Andrew Or <andrew@databricks.com>
Closes#12669 from andrewor14/use-runtime-conf.
## What changes were proposed in this pull request?
This patch changes UnresolvedFunction and UnresolvedGenerator to use a FunctionIdentifier rather than just a String for function name. Also changed SessionCatalog to accept FunctionIdentifier in lookupFunction.
## How was this patch tested?
Updated related unit tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#12659 from rxin/SPARK-14888.
## What changes were proposed in this pull request?
```
Spark context available as 'sc' (master = local[*], app id = local-1461283768192).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_51)
Type in expressions to have them evaluated.
Type :help for more information.
scala> sql("SHOW TABLES").collect()
16/04/21 17:09:39 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/04/21 17:09:39 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
res0: Array[org.apache.spark.sql.Row] = Array([src,false])
scala> sql("SHOW TABLES").collect()
res1: Array[org.apache.spark.sql.Row] = Array([src,false])
scala> spark.createDataFrame(Seq((1, 1), (2, 2), (3, 3)))
res2: org.apache.spark.sql.DataFrame = [_1: int, _2: int]
```
Hive things are loaded lazily.
## How was this patch tested?
Manual.
Author: Andrew Or <andrew@databricks.com>
Closes#12589 from andrewor14/spark-session-repl.
## What changes were proposed in this pull request?
This removes the class `HiveContext` itself along with all code usages associated with it. The bulk of the work was already done in #12485. This is mainly just code cleanup and actually removing the class.
Note: A couple of things will break after this patch. These will be fixed separately.
- the python HiveContext
- all the documentation / comments referencing HiveContext
- there will be no more HiveContext in the REPL (fixed by #12589)
## How was this patch tested?
No change in functionality.
Author: Andrew Or <andrew@databricks.com>
Closes#12585 from andrewor14/delete-hive-context.
## What changes were proposed in this pull request?
This method was accidentally made `private[sql]` in Spark 2.0. This PR makes it public again, since 3rd party data sources like spark-avro depend on it.
## How was this patch tested?
N/A
Author: Cheng Lian <lian@databricks.com>
Closes#12652 from liancheng/spark-14875.
## What changes were proposed in this pull request?
This PR fixes a bug in `TungstenAggregate` that manifests while aggregating by keys over nullable `BigDecimal` columns. This causes a null pointer exception while executing TPCDS q14a.
## How was this patch tested?
1. Added regression test in `DataFrameAggregateSuite`.
2. Verified that TPCDS q14a works
Author: Sameer Agarwal <sameer@databricks.com>
Closes#12651 from sameeragarwal/tpcds-fix.
## What changes were proposed in this pull request?
Right now, the data type field of a CatalogColumn is using the string representation. When we create this string from a DataType object, there are places where we use simpleString instead of catalogString. Although catalogString is the same as simpleString right now, it is still good to use catalogString. So, we will not silently introduce issues when we change the semantic of simpleString or the implementation of catalogString.
## How was this patch tested?
Existing tests.
Author: Yin Huai <yhuai@databricks.com>
Closes#12654 from yhuai/useCatalogString.
## What changes were proposed in this pull request?
Spark uses `NewLineAtEofChecker` rule in Scala by ScalaStyle. And, most Java code also comply with the rule. This PR aims to enforce the same rule `NewlineAtEndOfFile` by CheckStyle explicitly. Also, this fixes lint-java errors since SPARK-14465. The followings are the items.
- Adds a new line at the end of the files (19 files)
- Fixes 25 lint-java errors (12 RedundantModifier, 6 **ArrayTypeStyle**, 2 LineLength, 2 UnusedImports, 2 RegexpSingleline, 1 ModifierOrder)
## How was this patch tested?
After the Jenkins test succeeds, `dev/lint-java` should pass. (Currently, Jenkins dose not run lint-java.)
```bash
$ dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks passed.
```
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12632 from dongjoon-hyun/SPARK-14868.
## What changes were proposed in this pull request?
This patch changes SparkSession to be case insensitive by default, in order to match other database systems.
## How was this patch tested?
N/A - I'm sure some tests will fail and I will need to fix those.
Author: Reynold Xin <rxin@databricks.com>
Closes#12643 from rxin/SPARK-14876.
#### What changes were proposed in this pull request?
So far, we are capturing each unsupported Alter Table in separate visit functions. They should be unified and issue the same ParseException instead.
This PR is to refactor the existing implementation and make error message consistent for Alter Table DDL.
#### How was this patch tested?
Updated the existing test cases and also added new test cases to ensure all the unsupported statements are covered.
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#12459 from gatorsmile/cleanAlterTable.
## What changes were proposed in this pull request?
CreateMetastoreDataSource and CreateMetastoreDataSourceAsSelect are not Hive-specific. So, this PR moves them from sql/hive to sql/core. Also, I am adding `Command` suffix to these two classes.
## How was this patch tested?
Existing tests.
Author: Yin Huai <yhuai@databricks.com>
Closes#12645 from yhuai/moveCreateDataSource.
## What changes were proposed in this pull request?
This patch improves error handling in view creation. CreateViewCommand itself will analyze the view SQL query first, and if it cannot successfully analyze it, throw an AnalysisException.
In addition, I also added the following two conservative guards for easier identification of Spark bugs:
1. If there is a bug and the generated view SQL cannot be analyzed, throw an exception at runtime. Note that this is not an AnalysisException because it is not caused by the user and more likely indicate a bug in Spark.
2. SQLBuilder when it gets an unresolved plan, it will also show the plan in the error message.
I also took the chance to simplify the internal implementation of CreateViewCommand, and *removed* a fallback path that would've masked an exception from before.
## How was this patch tested?
1. Added a unit test for the user facing error handling.
2. Manually introduced some bugs in Spark to test the internal defensive error handling.
3. Also added a test case to test nested views (not super relevant).
Author: Reynold Xin <rxin@databricks.com>
Closes#12633 from rxin/SPARK-14865.
## What changes were proposed in this pull request?
In order to support running SQL directly on files, we added some code in ResolveRelations to catch the exception thrown by catalog.lookupRelation and ignore it. This unfortunately masks all the exceptions. This patch changes the logic to simply test the table's existence.
## How was this patch tested?
I manually hacked some bugs into Spark and made sure the exceptions were being propagated up.
Author: Reynold Xin <rxin@databricks.com>
Closes#12634 from rxin/SPARK-14869.
## What changes were proposed in this pull request?
This patch restructures sql.execution.command package to break the commands into multiple files, in some logical organization: databases, tables, views, functions.
I also renamed basicOperators.scala to basicLogicalOperators.scala and basicPhysicalOperators.scala.
## How was this patch tested?
N/A - all I did was moving code around.
Author: Reynold Xin <rxin@databricks.com>
Closes#12636 from rxin/SPARK-14872.
## What changes were proposed in this pull request?
del unused imports in ML/MLLIB
## How was this patch tested?
unit tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#12497 from zhengruifeng/del_unused_imports.
## What changes were proposed in this pull request?
Currently, the Parquet reader decide whether to return batch based on required schema or full schema, it's not consistent, this PR fix that.
## How was this patch tested?
Added regression tests.
Author: Davies Liu <davies@databricks.com>
Closes#12619 from davies/fix_return_batch.
## What changes were proposed in this pull request?
This patch re-implements view creation command in sql/core, based on the pre-existing view creation command in the Hive module. This consolidates the view creation logical command and physical command into a single one, called CreateViewCommand.
## How was this patch tested?
All the code should've been tested by existing tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#12615 from rxin/SPARK-14842-2.
## What changes were proposed in this pull request?
This patch adds "Exec" suffix to all physical operators. Before this patch, Spark's physical operators and logical operators are named the same (e.g. Project could be logical.Project or execution.Project), which caused small issues in code review and bigger issues in code refactoring.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#12617 from rxin/exec-node.
## What changes were proposed in this pull request?
When creating a file stream using sqlContext.write.stream(), existing files are scanned twice for finding the schema
- Once, when creating a DataSource + StreamingRelation in the DataFrameReader.stream()
- Again, when creating streaming Source from the DataSource, in DataSource.createSource()
Instead, the schema should be generated only once, at the time of creating the dataframe, and when the streaming source is created, it should just reuse that schema
The solution proposed in this PR is to add a lazy field in DataSource that caches the schema. Then streaming Source created by the DataSource can just reuse the schema.
## How was this patch tested?
Refactored unit tests.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#12591 from tdas/SPARK-14832.
## What changes were proposed in this pull request?
This PR try to increase the parallelism for small table (a few of big files) to reduce the query time, by decrease the maxSplitBytes, the goal is to have at least one task per CPU in the cluster, if the total size of all files is bigger than openCostInBytes * 2 * nCPU.
For example, a small/medium table could be used as dimension table in huge query, this will be useful to reduce the time waiting for broadcast.
## How was this patch tested?
Existing tests.
Author: Davies Liu <davies@databricks.com>
Closes#12344 from davies/more_partition.
## What changes were proposed in this pull request?
Currently, `OptimizeIn` optimizer replaces `In` expression into `InSet` expression if the size of set is greater than a constant, 10.
This issue aims to make a configuration `spark.sql.optimizer.inSetConversionThreshold` for that.
After this PR, `OptimizerIn` is configurable.
```scala
scala> sql("select a in (1,2,3) from (select explode(array(1,2)) a) T").explain()
== Physical Plan ==
WholeStageCodegen
: +- Project [a#7 IN (1,2,3) AS (a IN (1, 2, 3))#8]
: +- INPUT
+- Generate explode([1,2]), false, false, [a#7]
+- Scan OneRowRelation[]
scala> sqlContext.setConf("spark.sql.optimizer.inSetConversionThreshold", "2")
scala> sql("select a in (1,2,3) from (select explode(array(1,2)) a) T").explain()
== Physical Plan ==
WholeStageCodegen
: +- Project [a#16 INSET (1,2,3) AS (a IN (1, 2, 3))#17]
: +- INPUT
+- Generate explode([1,2]), false, false, [a#16]
+- Scan OneRowRelation[]
```
## How was this patch tested?
Pass the Jenkins tests (with a new testcase)
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12562 from dongjoon-hyun/SPARK-14796.
## What changes were proposed in this pull request?
1. Fix the "spill size" of TungstenAggregate and Sort
2. Rename "data size" to "peak memory" to match the actual meaning (also consistent with task metrics)
3. Added "data size" for ShuffleExchange and BroadcastExchange
4. Added some timing for Sort, Aggregate and BroadcastExchange (this requires another patch to work)
## How was this patch tested?
Existing tests.
![metrics](https://cloud.githubusercontent.com/assets/40902/14573908/21ad2f00-030d-11e6-9e2c-c544f30039ea.png)
Author: Davies Liu <davies@databricks.com>
Closes#12425 from davies/fix_metrics.
## What changes were proposed in this pull request?
SparkPlan.prepare() could be called in different threads (BroadcastExchange will call it in a thread pool), it only make sure that doPrepare() will only be called once, the second call to prepare() may return earlier before all the children had finished prepare(). Then some operator may call doProduce() before prepareSubqueries(), `null` will be used as the result of subquery, which is wrong. This cause TPCDS Q23B returns wrong answer sometimes.
This PR added synchronization for prepare(), make sure all the children had finished prepare() before return. Also call prepare() in produce() (similar to execute()).
Added checking for ScalarSubquery to make sure that the subquery has finished before using the result.
## How was this patch tested?
Manually tested with Q23B, no wrong answer anymore.
Author: Davies Liu <davies@databricks.com>
Closes#12600 from davies/fix_risk.
## What changes were proposed in this pull request?
This patch moves SQLBuilder into sql/core so we can in the future move view generation also into sql/core.
## How was this patch tested?
Also moved unit tests.
Author: Reynold Xin <rxin@databricks.com>
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12602 from rxin/SPARK-14841.
## What changes were proposed in this pull request?
In Python, the `option` and `options` method of `DataFrameReader` and `DataFrameWriter` were sending the string "None" instead of `null` when passed `None`, therefore making it impossible to send an actual `null`. This fixes that problem.
This is based on #11305 from mathieulongtin.
## How was this patch tested?
Added test to readwriter.py.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: mathieu longtin <mathieu.longtin@nuance.com>
Closes#12494 from viirya/py-df-none-option.
## What changes were proposed in this pull request?
Implement some `hashCode` and `equals` together in order to enable the scalastyle.
This is a first batch, I will continue to implement them but I wanted to know your thoughts.
Author: Joan <joan@goyeau.com>
Closes#12157 from joan38/SPARK-6429-HashCode-Equals.
## What changes were proposed in this pull request?
Add the native support for LOAD DATA DDL command that loads data into Hive table/partition.
## How was this patch tested?
`HiveDDLCommandSuite` and `HiveQuerySuite`. Besides, few Hive tests (`WindowQuerySuite`, `HiveTableScanSuite` and `HiveSerDeSuite`) also use `LOAD DATA` command.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#12412 from viirya/ddl-load-data.
## What changes were proposed in this pull request?
This patch removes HiveQueryExecution. As part of this, I consolidated all the describe commands into DescribeTableCommand.
## How was this patch tested?
Should be covered by existing tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#12588 from rxin/SPARK-14826.
(This PR is a rebased version of PR #12153.)
## What changes were proposed in this pull request?
This PR adds preliminary locality support for `FileFormat` data sources by overriding `FileScanRDD.preferredLocations()`. The strategy can be divided into two parts:
1. Block location lookup
Unlike `HadoopRDD` or `NewHadoopRDD`, `FileScanRDD` doesn't have access to the underlying `InputFormat` or `InputSplit`, and thus can't rely on `InputSplit.getLocations()` to gather locality information. Instead, this PR queries block locations using `FileSystem.getBlockLocations()` after listing all `FileStatus`es in `HDFSFileCatalog` and convert all `FileStatus`es into `LocatedFileStatus`es.
Note that although S3/S3A/S3N file systems don't provide valid locality information, their `getLocatedStatus()` implementations don't actually issue remote calls either. So there's no need to special case these file systems.
2. Selecting preferred locations
For each `FilePartition`, we pick up top 3 locations that containing the most data to be retrieved. This isn't necessarily the best algorithm out there. Further improvements may be brought up in follow-up PRs.
## How was this patch tested?
Tested by overriding default `FileSystem` implementation for `file:///` with a mocked one, which returns mocked block locations.
Author: Cheng Lian <lian@databricks.com>
Closes#12527 from liancheng/spark-14369-locality-rebased.
## What changes were proposed in this pull request?
This PR adds support for all primitive datatypes, decimal types and stringtypes in the VectorizedHashmap during aggregation.
## How was this patch tested?
Existing tests for group-by aggregates should already test for all these datatypes. Additionally, manually inspected the generated code for all supported datatypes (details below).
Author: Sameer Agarwal <sameer@databricks.com>
Closes#12440 from sameeragarwal/all-datatypes.
## What changes were proposed in this pull request?
This patch moves analyze table parsing into SparkSqlAstBuilder and removes HiveSqlAstBuilder.
In order to avoid extensive refactoring, I created a common trait for CatalogRelation and MetastoreRelation, and match on that. In the future we should probably just consolidate the two into a single thing so we don't need this common trait.
## How was this patch tested?
Updated unit tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#12584 from rxin/SPARK-14821.
## What changes were proposed in this pull request?
Spark currently uses TimSort for all in-memory sorts, including sorts done for shuffle. One low-hanging fruit is to use radix sort when possible (e.g. sorting by integer keys). This PR adds a radix sort implementation to the unsafe sort package and switches shuffles and sorts to use it when possible.
The current implementation does not have special support for null values, so we cannot radix-sort `LongType`. I will address this in a follow-up PR.
## How was this patch tested?
Unit tests, enabling radix sort on existing tests. Microbenchmark results:
```
Running benchmark: radix sort 25000000
Java HotSpot(TM) 64-Bit Server VM 1.8.0_66-b17 on Linux 3.13.0-44-generic
Intel(R) Core(TM) i7-4600U CPU 2.10GHz
radix sort 25000000: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
reference TimSort key prefix array 15546 / 15859 1.6 621.9 1.0X
reference Arrays.sort 2416 / 2446 10.3 96.6 6.4X
radix sort one byte 133 / 137 188.4 5.3 117.2X
radix sort two bytes 255 / 258 98.2 10.2 61.1X
radix sort eight bytes 991 / 997 25.2 39.6 15.7X
radix sort key prefix array 1540 / 1563 16.2 61.6 10.1X
```
I also ran a mix of the supported TPCDS queries and compared TimSort vs RadixSort metrics. The overall benchmark ran ~10% faster with radix sort on. In the breakdown below, the radix-enabled sort phases averaged about 20x faster than TimSort, however sorting is only a small fraction of the overall runtime. About half of the TPCDS queries were able to take advantage of radix sort.
```
TPCDS on master: 2499s real time, 8185s executor
- 1171s in TimSort, avg 267 MB/s
(note the /s accounting is weird here since dataSize counts the record sizes too)
TPCDS with radix enabled: 2294s real time, 7391s executor
- 596s in TimSort, avg 254 MB/s
- 26s in radix sort, avg 4.2 GB/s
```
cc davies rxin
Author: Eric Liang <ekl@databricks.com>
Closes#12490 from ericl/sort-benchmark.
## What changes were proposed in this pull request?
We recently made `ColumnarBatch.row` mutable and added a new `ColumnVector.putDecimal` method to support putting `Decimal` values in the `ColumnarBatch`. This unfortunately introduced a bug wherein we were not updating the vector with the proper unscaled values.
## How was this patch tested?
This codepath is hit only when the vectorized aggregate hashmap is enabled. https://github.com/apache/spark/pull/12440 makes sure that a number of regression tests/benchmarks test this bugfix.
Author: Sameer Agarwal <sameer@databricks.com>
Closes#12541 from sameeragarwal/fix-bigdecimal.
## What changes were proposed in this pull request?
This patch moves native command and script transformation into SparkSqlAstBuilder. This builds on #12561. See the last commit for diff.
## How was this patch tested?
Updated test cases to reflect this.
Author: Reynold Xin <rxin@databricks.com>
Closes#12564 from rxin/SPARK-14798.
## What changes were proposed in this pull request?
After removing most of `HiveContext` in 8fc267ab33 we can now move existing functionality in `SQLContext` to `SparkSession`. As of this PR `SQLContext` becomes a simple wrapper that has a `SparkSession` and delegates all functionality to it.
## How was this patch tested?
Jenkins.
Author: Andrew Or <andrew@databricks.com>
Closes#12553 from andrewor14/implement-spark-session.
## What changes were proposed in this pull request?
the `Accumulable.internal` flag is only used to avoid registering internal accumulators for 2 certain cases:
1. `TaskMetrics.createTempShuffleReadMetrics`: the accumulators in the temp shuffle read metrics should not be registered.
2. `TaskMetrics.fromAccumulatorUpdates`: the created task metrics is only used to post event, accumulators inside it should not be registered.
For 1, we can create a `TempShuffleReadMetrics` that don't create accumulators, just keep the data and merge it at last.
For 2, we can un-register these accumulators immediately.
TODO: remove `internal` flag in `AccumulableInfo` with followup PR
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12525 from cloud-fan/acc.
## What changes were proposed in this pull request?
This patch moves as many parsing rules as possible into SQL parser. There are only three more left after this patch: (1) run native command, (2) analyze, and (3) script IO. These 3 will be dealt with in a follow-up PR.
## How was this patch tested?
No test change. This simply moves code around.
Author: Reynold Xin <rxin@databricks.com>
Closes#12556 from rxin/SPARK-14792.
## What changes were proposed in this pull request?
The patch removes HiveConf dependency from HiveSqlAstBuilder. This is required in order to merge HiveSqlParser and SparkSqlAstBuilder, which would require getting rid of the Hive specific dependencies in HiveSqlParser.
This patch also accomplishes [SPARK-14778] Remove HiveSessionState.substitutor.
## How was this patch tested?
This should be covered by existing tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#12550 from rxin/SPARK-14782.
## What changes were proposed in this pull request?
In order to fully merge the Hive parser and the SQL parser, we'd need to support variable substitution in Spark. The implementation of the substitute algorithm is mostly copied from Hive, but I simplified the overall structure quite a bit and added more comprehensive test coverage.
Note that this pull request does not yet use this functionality anywhere.
## How was this patch tested?
Added VariableSubstitutionSuite for unit tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#12538 from rxin/SPARK-14769.
## What changes were proposed in this pull request?
3 testcases namely,
```
"count is partially aggregated"
"count distinct is partially aggregated"
"mixed aggregates are partially aggregated"
```
were failing when running PlannerSuite individually.
The PR provides a fix for this.
## How was this patch tested?
unit tests
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: Subhobrata Dey <sbcd90@gmail.com>
Closes#12532 from sbcd90/plannersuitetestsfix.
## What changes were proposed in this pull request?
This PR adds a special log for FileStreamSink for two purposes:
- Versioning. A future Spark version should be able to read the metadata of an old FileStreamSink.
- Compaction. As reading from many small files is usually pretty slow, we should compact small metadata files into big files.
FileStreamSinkLog has a new log format instead of Java serialization format. It will write one log file for each batch. The first line of the log file is the version number, and there are multiple JSON lines following. Each JSON line is a JSON format of FileLog.
FileStreamSinkLog will compact log files every "spark.sql.sink.file.log.compactLen" batches into a big file. When doing a compact, it will read all history logs and merge them with the new batch. During the compaction, it will also delete the files that are deleted (marked by FileLog.action). When the reader uses allLogs to list all files, this method only returns the visible files (drops the deleted files).
## How was this patch tested?
FileStreamSinkLogSuite
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#12435 from zsxwing/sink-log.
## What changes were proposed in this pull request?
This PR has two main changes.
1. Move Hive-specific methods from HiveContext to HiveSessionState, which help the work of removing HiveContext.
2. Create a SparkSession Class, which will later be the entry point of Spark SQL users.
## How was this patch tested?
Existing tests
This PR is trying to fix test failures of https://github.com/apache/spark/pull/12485.
Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#12522 from yhuai/spark-session.
## What changes were proposed in this pull request?
Consider the following directory structure
dir/col=X/some-files
If we create a text format streaming dataframe on `dir/col=X/` then it should not consider as partitioning in columns. Even though the streaming dataframe does not do so, the generated batch dataframes pick up col as a partitioning columns, causing mismatch streaming source schema and generated df schema. This leads to runtime failure:
```
18:55:11.262 ERROR org.apache.spark.sql.execution.streaming.StreamExecution: Query query-0 terminated with error
java.lang.AssertionError: assertion failed: Invalid batch: c#2 != c#7,type#8
```
The reason is that the partition inferring code has no idea of a base path, above which it should not search of partitions. This PR makes sure that the batch DF is generated with the basePath set as the original path on which the file stream source is defined.
## How was this patch tested?
New unit test
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#12517 from tdas/SPARK-14741.
## What changes were proposed in this pull request?
This patch provides a first cut of python APIs for structured streaming. This PR provides the new classes:
- ContinuousQuery
- Trigger
- ProcessingTime
in pyspark under `pyspark.sql.streaming`.
In addition, it contains the new methods added under:
- `DataFrameWriter`
a) `startStream`
b) `trigger`
c) `queryName`
- `DataFrameReader`
a) `stream`
- `DataFrame`
a) `isStreaming`
This PR doesn't contain all methods exposed for `ContinuousQuery`, for example:
- `exception`
- `sourceStatuses`
- `sinkStatus`
They may be added in a follow up.
This PR also contains some very minor doc fixes in the Scala side.
## How was this patch tested?
Python doc tests
TODO:
- [ ] verify Python docs look good
Author: Burak Yavuz <brkyvz@gmail.com>
Author: Burak Yavuz <burak@databricks.com>
Closes#12320 from brkyvz/stream-python.
## What changes were proposed in this pull request?
- replaced `FileSystem.get(conf)` calls with `path.getFileSystem(conf)`
## How was this patch tested?
N/A
Author: Liwei Lin <lwlin7@gmail.com>
Closes#12450 from lw-lin/fix-fs-get.
`MutableProjection` is not thread-safe and we won't use it in multiple threads. I think the reason that we return `() => MutableProjection` is not about thread safety, but to save the costs of generating code when we need same but individual mutable projections.
However, I only found one place that use this [feature](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/Window.scala#L122-L123), and comparing to the troubles it brings, I think we should generate `MutableProjection` directly instead of return a function.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#7373 from cloud-fan/project.
## What changes were proposed in this pull request?
This PR moves `HadoopFsRelation` related data source API into `execution/datasources` package.
Note that to avoid conflicts, this PR is based on #12153. Effective changes for this PR only consist of the last three commits. Will rebase after merging #12153.
## How was this patch tested?
Existing tests.
Author: Yin Huai <yhuai@databricks.com>
Author: Cheng Lian <lian@databricks.com>
Closes#12361 from liancheng/spark-14407-hide-hadoop-fs-relation.
### What changes were proposed in this pull request?
This PR adds support for in/exists predicate subqueries to Spark. Predicate sub-queries are used as a filtering condition in a query (this is the only supported use case). A predicate sub-query comes in two forms:
- `[NOT] EXISTS(subquery)`
- `[NOT] IN (subquery)`
This PR is (loosely) based on the work of davies (https://github.com/apache/spark/pull/10706) and chenghao-intel (https://github.com/apache/spark/pull/9055). They should be credited for the work they did.
### How was this patch tested?
Modified parsing unit tests.
Added tests to `org.apache.spark.sql.SQLQuerySuite`
cc rxin, davies & chenghao-intel
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#12306 from hvanhovell/SPARK-4226.
When `Await.result` throws an exception which originated from a different thread, the resulting stacktrace doesn't include the path leading to the `Await.result` call itself, making it difficult to identify the impact of these exceptions. For example, I've seen cases where broadcast cleaning errors propagate to the main thread and crash it but the resulting stacktrace doesn't include any of the main thread's code, making it difficult to pinpoint which exception crashed that thread.
This patch addresses this issue by explicitly catching, wrapping, and re-throwing exceptions that are thrown by `Await.result`.
I tested this manually using 16b31c8251, a patch which reproduces an issue where an RPC exception which occurs while unpersisting RDDs manages to crash the main thread without any useful stacktrace, and verified that informative, full stacktraces were generated after applying the fix in this PR.
/cc rxin nongli yhuai anabranch
Author: Josh Rosen <joshrosen@databricks.com>
Closes#12433 from JoshRosen/wrap-and-rethrow-await-exceptions.
## What changes were proposed in this pull request?
This PR tries to separate the serialization and deserialization logic from object operators, so that it's easier to eliminate unnecessary serializations in optimizer.
Typed aggregate related operators are special, they will deserialize the input row to multiple objects and it's difficult to simply use a deserializer operator to abstract it, so we still mix the deserialization logic there.
## How was this patch tested?
existing tests and new test in `EliminateSerializationSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12260 from cloud-fan/encoder.
## What changes were proposed in this pull request?
These test suites were removed while refactoring `HadoopFsRelation` related API. This PR brings them back.
This PR also fixes two regressions:
- SPARK-14458, which causes runtime error when saving partitioned tables using `FileFormat` data sources that are not able to infer their own schemata. This bug wasn't detected by any built-in data sources because all of them happen to have schema inference feature.
- SPARK-14566, which happens to be covered by SPARK-14458 and causes wrong query result or runtime error when
- appending a Dataset `ds` to a persisted partitioned data source relation `t`, and
- partition columns in `ds` don't all appear after data columns
## How was this patch tested?
`CommitFailureTestRelationSuite` uses a testing relation that always fails when committing write tasks to test write job cleanup.
`SimpleTextHadoopFsRelationSuite` uses a testing relation to test general `HadoopFsRelation` and `FileFormat` interfaces.
The two regressions are both covered by existing test cases.
Author: Cheng Lian <lian@databricks.com>
Closes#12179 from liancheng/spark-13681-commit-failure-test.
## What changes were proposed in this pull request?
We currently disable codegen for `CaseWhen` if the number of branches is greater than 20 (in CaseWhen.MAX_NUM_CASES_FOR_CODEGEN). It would be better if this value is a non-public config defined in SQLConf.
## How was this patch tested?
Pass the Jenkins tests (including a new testcase `Support spark.sql.codegen.maxCaseBranches option`)
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12353 from dongjoon-hyun/SPARK-14577.
## What changes were proposed in this pull request?
This is roughly based on the input metrics logic in `SqlNewHadoopRDD`
## How was this patch tested?
Not sure how to write a test, I manually verified it in Spark UI.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12352 from cloud-fan/metrics.
## What changes were proposed in this pull request?
Per rxin's suggestions, this patch renames `upstreams()` to `inputRDDs()` in `WholeStageCodegen` for better implied semantics
## How was this patch tested?
N/A
Author: Sameer Agarwal <sameer@databricks.com>
Closes#12486 from sameeragarwal/codegen-cleanup.
## What changes were proposed in this pull request?
The `doGenCode` method currently takes in an `ExprCode`, mutates it and returns the java code to evaluate the given expression. It should instead just return a new `ExprCode` to avoid passing around mutable objects during code generation.
## How was this patch tested?
Existing Tests
Author: Sameer Agarwal <sameer@databricks.com>
Closes#12483 from sameeragarwal/new-exprcode-2.
## What changes were proposed in this pull request?
The sort shuffle manager has been the default since Spark 1.2. It is time to remove the old hash shuffle manager.
## How was this patch tested?
Removed some tests related to the old manager.
Author: Reynold Xin <rxin@databricks.com>
Closes#12423 from rxin/SPARK-14667.
## What changes were proposed in this pull request?
Per rxin's suggestions, this patch renames `s/gen/genCode` and `s/genCode/doGenCode` to better reflect the semantics of these 2 function calls.
## How was this patch tested?
N/A (refactoring only)
Author: Sameer Agarwal <sameer@databricks.com>
Closes#12475 from sameeragarwal/gencode.
## What changes were proposed in this pull request?
This patch adds a SharedState that groups state shared across multiple SQLContexts. This is analogous to the SessionState added in SPARK-13526 that groups session-specific state. This cleanup makes the constructors of the contexts simpler and ultimately allows us to remove HiveContext in the near future.
## How was this patch tested?
Existing tests.
Author: Yin Huai <yhuai@databricks.com>
Closes#12463 from yhuai/sharedState.
## What changes were proposed in this pull request?
There are many operations that are currently not supported in the streaming execution. For example:
- joining two streams
- unioning a stream and a batch source
- sorting
- window functions (not time windows)
- distinct aggregates
Furthermore, executing a query with a stream source as a batch query should also fail.
This patch add an additional step after analysis in the QueryExecution which will check that all the operations in the analyzed logical plan is supported or not.
## How was this patch tested?
unit tests.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#12246 from tdas/SPARK-14473.
## What changes were proposed in this pull request?
This PR aims to add `bound` function (aka Banker's round) by extending current `round` implementation. [Hive supports `bround` since 1.3.0.](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF)
**Hive (1.3 ~ 2.0)**
```
hive> select round(2.5), bround(2.5);
OK
3.0 2.0
```
**After this PR**
```scala
scala> sql("select round(2.5), bround(2.5)").head
res0: org.apache.spark.sql.Row = [3,2]
```
## How was this patch tested?
Pass the Jenkins tests (with extended tests).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12376 from dongjoon-hyun/SPARK-14614.
## What changes were proposed in this pull request?
We currently only have implicit encoders for scala primitive types. We should also add implicit encoders for boxed primitives. Otherwise, the following code would not have an encoder:
```scala
sqlContext.range(1000).map { i => i }
```
## How was this patch tested?
Added a unit test case for this.
Author: Reynold Xin <rxin@databricks.com>
Closes#12466 from rxin/SPARK-14696.
## What changes were proposed in this pull request?
set the input encoder for `TypedColumn` in `RelationalGroupedDataset.agg`.
## How was this patch tested?
new tests in `DatasetAggregatorSuite`
close https://github.com/apache/spark/pull/11269
This PR brings https://github.com/apache/spark/pull/12359 up to date and fix the compile.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12451 from cloud-fan/agg.
## What changes were proposed in this pull request?
The patch fixes the issue with the randomSplit method which is not able to split dataframes which has maps in schema. The bug was introduced in spark 1.6.1.
## How was this patch tested?
Tested with unit tests.
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: Subhobrata Dey <sbcd90@gmail.com>
Closes#12438 from sbcd90/randomSplitIssue.
## What changes were proposed in this pull request?
This patch adds a SharedState that groups state shared across multiple SQLContexts. This is analogous to the SessionState added in SPARK-13526 that groups session-specific state. This cleanup makes the constructors of the contexts simpler and ultimately allows us to remove HiveContext in the near future.
## How was this patch tested?
Existing tests.
Closes#12405
Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#12447 from yhuai/sharedState.
## What changes were proposed in this pull request?
This is a follow-up to make the max iteration number an internal config.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#12441 from rxin/maxIterConfInternal.
## What changes were proposed in this pull request?
This PR removes
- Inappropriate type notations
For example, from
```scala
words.foreachRDD { (rdd: RDD[String], time: Time) =>
...
```
to
```scala
words.foreachRDD { (rdd, time) =>
...
```
- Extra anonymous closure within functional transformations.
For example,
```scala
.map(item => {
...
})
```
which can be just simply as below:
```scala
.map { item =>
...
}
```
and corrects some obvious style nits.
## How was this patch tested?
This was tested after adding rules in `scalastyle-config.xml`, which ended up with not finding all perfectly.
The rules applied were below:
- For the first correction,
```xml
<check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
<parameters><parameter name="regex">(?m)\.[a-zA-Z_][a-zA-Z0-9]*\(\s*[^,]+s*=>\s*\{[^\}]+\}\s*\)</parameter></parameters>
</check>
```
```xml
<check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
<parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]*\s*[\{|\(]([^\n>,]+=>)?\s*\{([^()]|(?R))*\}^[,]</parameter></parameters>
</check>
```
- For the second correction
```xml
<check customId="TypeNotation" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
<parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]*\s*[\{|\(]\s*\([^):]*:R))*\}^[,]</parameter></parameters>
</check>
```
**Those rules were not added**
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#12413 from HyukjinKwon/SPARK-style.
## What changes were proposed in this pull request?
set the input encoder for `TypedColumn` in `RelationalGroupedDataset.agg`.
## How was this patch tested?
new tests in `DatasetAggregatorSuite`
close https://github.com/apache/spark/pull/11269
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12359 from cloud-fan/agg.
## What changes were proposed in this pull request?
We currently hard code the max number of optimizer/analyzer iterations to 100. This patch makes it configurable. While I'm at it, I also added the SessionCatalog to the optimizer, so we can use information there in optimization.
## How was this patch tested?
Updated unit tests to reflect the change.
Author: Reynold Xin <rxin@databricks.com>
Closes#12434 from rxin/SPARK-14677.
## What changes were proposed in this pull request?
This PR moves `CurrentDatabase` from sql/hive package to sql/catalyst. It also adds the function description, which looks like the following.
```
scala> sqlContext.sql("describe function extended current_database").collect.foreach(println)
[Function: current_database]
[Class: org.apache.spark.sql.execution.command.CurrentDatabase]
[Usage: current_database() - Returns the current database.]
[Extended Usage:
> SELECT current_database()]
```
## How was this patch tested?
Existing tests
Author: Yin Huai <yhuai@databricks.com>
Closes#12424 from yhuai/SPARK-14668.
## What changes were proposed in this pull request?
This PR uses a better hashing algorithm while probing the AggregateHashMap:
```java
long h = 0
h = (h ^ (0x9e3779b9)) + key_1 + (h << 6) + (h >>> 2);
h = (h ^ (0x9e3779b9)) + key_2 + (h << 6) + (h >>> 2);
h = (h ^ (0x9e3779b9)) + key_3 + (h << 6) + (h >>> 2);
...
h = (h ^ (0x9e3779b9)) + key_n + (h << 6) + (h >>> 2);
return h
```
Depends on: https://github.com/apache/spark/pull/12345
## How was this patch tested?
Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02 on Mac OS X 10.11.4
Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz
Aggregate w keys: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
codegen = F 2417 / 2457 8.7 115.2 1.0X
codegen = T hashmap = F 1554 / 1581 13.5 74.1 1.6X
codegen = T hashmap = T 877 / 929 23.9 41.8 2.8X
Author: Sameer Agarwal <sameer@databricks.com>
Closes#12379 from sameeragarwal/hash.
## What changes were proposed in this pull request?
`ExpressionEncoder` is just a container for serialization and deserialization expressions, we can use these expressions to build `TypedAggregateExpression` directly, so that it can fit in `DeclarativeAggregate`, which is more efficient.
One trick is, for each buffer serializer expression, it will reference to the result object of serialization and function call. To avoid re-calculating this result object, we can serialize the buffer object to a single struct field, so that we can use a special `Expression` to only evaluate result object once.
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12067 from cloud-fan/typed_udaf.
## What changes were proposed in this pull request?
This patch speeds up group-by aggregates by around 3-5x by leveraging an in-memory `AggregateHashMap` (please see https://github.com/apache/spark/pull/12161), an append-only aggregate hash map that can act as a 'cache' for extremely fast key-value lookups while evaluating aggregates (and fall back to the `BytesToBytesMap` if a given key isn't found).
Architecturally, it is backed by a power-of-2-sized array for index lookups and a columnar batch that stores the key-value pairs. The index lookups in the array rely on linear probing (with a small number of maximum tries) and use an inexpensive hash function which makes it really efficient for a majority of lookups. However, using linear probing and an inexpensive hash function also makes it less robust as compared to the `BytesToBytesMap` (especially for a large number of keys or even for certain distribution of keys) and requires us to fall back on the latter for correctness.
## How was this patch tested?
Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02 on Mac OS X 10.11.4
Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz
Aggregate w keys: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
codegen = F 2124 / 2204 9.9 101.3 1.0X
codegen = T hashmap = F 1198 / 1364 17.5 57.1 1.8X
codegen = T hashmap = T 369 / 600 56.8 17.6 5.8X
Author: Sameer Agarwal <sameer@databricks.com>
Closes#12345 from sameeragarwal/tungsten-aggregate-integration.
## What changes were proposed in this pull request?
Removing references to assembly jar in documentation.
Adding an additional (previously undocumented) usage of spark-submit to run examples.
## How was this patch tested?
Ran spark-submit usage to ensure formatting was fine. Ran examples using SparkSubmit.
Author: Mark Grover <mark@apache.org>
Closes#12365 from markgrover/spark-14601.
## What changes were proposed in this pull request?
JIRA: https://issues.apache.org/jira/browse/SPARK-14592
This patch adds native support for DDL command `CREATE TABLE LIKE`.
The SQL syntax is like:
CREATE TABLE table_name LIKE existing_table
CREATE TABLE IF NOT EXISTS table_name LIKE existing_table
## How was this patch tested?
`HiveDDLCommandSuite`. `HiveQuerySuite` already tests `CREATE TABLE LIKE`.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
This patch had conflicts when merged, resolved by
Committer: Andrew Or <andrew@databricks.com>
Closes#12362 from viirya/create-table-like.
## What changes were proposed in this pull request?
Currently many public abstract methods (in abstract classes as well as traits) don't declare return types explicitly, such as in [o.a.s.streaming.dstream.InputDStream](https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/InputDStream.scala#L110):
```scala
def start() // should be: def start(): Unit
def stop() // should be: def stop(): Unit
```
These methods exist in core, sql, streaming; this PR fixes them.
## How was this patch tested?
N/A
## Which piece of scala style rule led to the changes?
the rule was added separately in https://github.com/apache/spark/pull/12396
Author: Liwei Lin <lwlin7@gmail.com>
Closes#12389 from lw-lin/public-abstract-methods.
#### What changes were proposed in this pull request?
This PR is to provide a native DDL support for the following three Alter View commands:
Based on the Hive DDL document:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
##### 1. ALTER VIEW RENAME
**Syntax:**
```SQL
ALTER VIEW view_name RENAME TO new_view_name
```
- to change the name of a view to a different name
- not allowed to rename a view's name by ALTER TABLE
##### 2. ALTER VIEW SET TBLPROPERTIES
**Syntax:**
```SQL
ALTER VIEW view_name SET TBLPROPERTIES ('comment' = new_comment);
```
- to add metadata to a view
- not allowed to set views' properties by ALTER TABLE
- ignore it if trying to set a view's existing property key when the value is the same
- overwrite the value if trying to set a view's existing key to a different value
##### 3. ALTER VIEW UNSET TBLPROPERTIES
**Syntax:**
```SQL
ALTER VIEW view_name UNSET TBLPROPERTIES [IF EXISTS] ('comment', 'key')
```
- to remove metadata from a view
- not allowed to unset views' properties by ALTER TABLE
- issue an exception if trying to unset a view's non-existent key
#### How was this patch tested?
Added test cases to verify if it works properly.
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#12324 from gatorsmile/alterView.
## What changes were proposed in this pull request?
This PR removes extra anonymous closure within functional transformations.
For example,
```scala
.map(item => {
...
})
```
which can be just simply as below:
```scala
.map { item =>
...
}
```
## How was this patch tested?
Related unit tests and `sbt scalastyle`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#12382 from HyukjinKwon/minor-extra-closers.
## What changes were proposed in this pull request?
Old `HadoopFsRelation` API includes `buildInternalScan()` which uses `SqlNewHadoopRDD` in `ParquetRelation`.
Because now the old API is removed, `SqlNewHadoopRDD` is not used anymore.
So, this PR removes `SqlNewHadoopRDD` and several unused imports.
This was discussed in https://github.com/apache/spark/pull/12326.
## How was this patch tested?
Several related existing unit tests and `sbt scalastyle`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#12354 from HyukjinKwon/SPARK-14596.
## What changes were proposed in this pull request?
When prune the partitions or push down predicates, case-sensitivity is not respected. In order to make it work with case-insensitive, this PR update the AttributeReference inside predicate to use the name from schema.
## How was this patch tested?
Add regression tests for case-insensitive.
Author: Davies Liu <davies@databricks.com>
Closes#12371 from davies/case_insensi.
## What changes were proposed in this pull request?
This patch implements the `CREATE TABLE` command using the `SessionCatalog`. Previously we handled only `CTAS` and `CREATE TABLE ... USING`. This requires us to refactor `CatalogTable` to accept various fields (e.g. bucket and skew columns) and pass them to Hive.
WIP: Note that I haven't verified whether this actually works yet! But I believe it does.
## How was this patch tested?
Tests will come in a future commit.
Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#12271 from andrewor14/create-table-ddl.
## What changes were proposed in this pull request?
It looks several recent commits for datasources (maybe while removing old `HadoopFsRelation` interface) missed removing some unused imports.
This PR removes some unused imports in datasources.
## How was this patch tested?
`sbt scalastyle` and some unit tests for them.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#12326 from HyukjinKwon/minor-imports.
## What changes were proposed in this pull request?
There is a race condition in `StreamExecution.processAllAvailable`. Here is an execution order to reproduce it.
| Time |Thread 1 | MicroBatchThread |
|:-------------:|:-------------:|:-----:|
| 1 | | `dataAvailable in constructNextBatch` returns false |
| 2 | addData(newData) | |
| 3 | `noNewData = false` in processAllAvailable | |
| 4 | | noNewData = true |
| 5 | `noNewData` is true so just return | |
The root cause is that `checking dataAvailable and change noNewData to true` is not atomic. This PR puts these two actions into `synchronized` to make sure they are atomic.
In addition, this PR also has the following changes:
- Make `committedOffsets` and `availableOffsets` volatile to make sure they can be seen in other threads.
- Copy the reference of `availableOffsets` to a local variable so that `sourceStatuses` can use a snapshot of `availableOffsets`.
## How was this patch tested?
Existing unit tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#12339 from zsxwing/race-condition.
## What changes were proposed in this pull request?
This PR improve the performance of SQL UI by:
1) remove the details column in all executions page (the first page in SQL tab). We can check the details by enter the execution page.
2) break-all is super slow in Chrome recently, so switch to break-word.
3) Using "display: none" to hide a block.
4) using one js closure for for all the executions, not one for each.
5) remove the height limitation of details, don't need to scroll it in the tiny window.
## How was this patch tested?
Exists tests.
![ui](https://cloud.githubusercontent.com/assets/40902/14445712/68d7b258-0004-11e6-9b48-5d329b05d165.png)
Author: Davies Liu <davies@databricks.com>
Closes#12311 from davies/ui_perf.
## What changes were proposed in this pull request?
Before we are using `AnalysisException`, `ParseException`, `NoSuchFunctionException` etc when a parsing error encounters. I am trying to make it consistent and also **minimum** code impact to the current implementation by changing the class hierarchy.
1. `NoSuchItemException` is removed, since it is an abstract class and it just simply takes a message string.
2. `NoSuchDatabaseException`, `NoSuchTableException`, `NoSuchPartitionException` and `NoSuchFunctionException` now extends `AnalysisException`, as well as `ParseException`, they are all under `AnalysisException` umbrella, but you can also determine how to use them in a granular way.
## How was this patch tested?
The existing test cases should cover this patch.
Author: bomeng <bmeng@us.ibm.com>
Closes#12314 from bomeng/SPARK-14414.
## What changes were proposed in this pull request?
- `StateStoreConf.**max**DeltasForSnapshot` was renamed to `StateStoreConf.**min**DeltasForSnapshot`
- some state switch checks were added
- improved consistency between method names and string literals
- other comments & typo fix
## How was this patch tested?
N/A
Author: Liwei Lin <lwlin7@gmail.com>
Closes#12323 from lw-lin/streaming-state-clean-up.
## What changes were proposed in this pull request?
Now that we have a single location for storing checkpointed state. This PR just propagates the checkpoint location into FileStreamSource so that we don't have one random log off on its own.
## How was this patch tested?
test("metadataPath should be in checkpointLocation")
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#12247 from zsxwing/file-source-log-location.
## What changes were proposed in this pull request?
Now `HadoopFsRelation` with all kinds of file formats can be handled in `FileSourceStrategy`, we can remove the branches for `HadoopFsRelation` in `FileSourceStrategy` and the `buildInternalScan` API from `FileFormat`.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12300 from cloud-fan/remove.
## What changes were proposed in this pull request?
In https://github.com/apache/spark/pull/12047/files#diff-94a1f59bcc9b6758c4ca874652437634R529, we may split field expressions codes in `CreateExternalRow` to support wide table. However, the whole stage codegen framework doesn't support it, because the input for expressions is not always the input row, but can be `CodeGenContext.currentVars`, which doesn't work well with `CodeGenContext.splitExpressions`.
Actually we do have a check to guard against this cases, but it's incomplete, it only checks output fields.
This PR improves the whole stage codegen support check, to disable it if there are too many input fields, so that we can avoid splitting field expressions codes in `CreateExternalRow` for whole stage codegen.
TODO: Is it a better solution if we can make `CodeGenContext.currentVars` work well with `CodeGenContext.splitExpressions`?
## How was this patch tested?
new test in DatasetSuite.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12322 from cloud-fan/codegen.
#### What changes were proposed in this pull request?
In this PR, we are trying to address the comment in the original PR: dfce9665c4 (commitcomment-17057030)
In this PR, we checks if table/view exists at the beginning and then does not need to capture the exceptions, including `NoSuchTableException` and `InvalidTableException`. We still capture the NonFatal exception when doing `sqlContext.cacheManager.tryUncacheQuery`.
#### How was this patch tested?
The existing test cases should cover the code changes of this PR.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#12321 from gatorsmile/dropViewFollowup.
## What changes were proposed in this pull request?
This implements a few alter table partition commands using the `SessionCatalog`. In particular:
```
ALTER TABLE ... ADD PARTITION ...
ALTER TABLE ... DROP PARTITION ...
ALTER TABLE ... RENAME PARTITION ... TO ...
```
The following operations are not supported, and an `AnalysisException` with a helpful error message will be thrown if the user tries to use them:
```
ALTER TABLE ... EXCHANGE PARTITION ...
ALTER TABLE ... ARCHIVE PARTITION ...
ALTER TABLE ... UNARCHIVE PARTITION ...
ALTER TABLE ... TOUCH ...
ALTER TABLE ... COMPACT ...
ALTER TABLE ... CONCATENATE
MSCK REPAIR TABLE ...
```
## How was this patch tested?
`DDLSuite`, `DDLCommandSuite` and `HiveDDLCommandSuite`
Author: Andrew Or <andrew@databricks.com>
Closes#12220 from andrewor14/alter-partition-ddl.
## What changes were proposed in this pull request?
JIRA: https://issues.apache.org/jira/browse/SPARK-14520
`VectorizedParquetInputFormat` inherits `ParquetInputFormat` and overrides `createRecordReader`. However, its overridden `createRecordReader` returns a `ParquetRecordReader`. It should return a `RecordReader`. Otherwise, `ClassCastException` will be thrown.
## How was this patch tested?
Existing tests.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#12292 from viirya/fix-vectorized-input-format.
## What changes were proposed in this pull request?
1.Added method randomSplitAsList() in Dataset for java
for https://issues.apache.org/jira/browse/SPARK-14372
## How was this patch tested?
TestSuite
Author: Rekha Joshi <rekhajoshm@gmail.com>
Author: Joshi <rekhajoshm@gmail.com>
Closes#12184 from rekhajoshm/SPARK-14372.
#### What changes were proposed in this pull request?
This PR is to address the comment: https://github.com/apache/spark/pull/12146#discussion-diff-59092238. It removes the function `isViewSupported` from `SessionCatalog`. After the removal, we still can capture the user errors if users try to drop a table using `DROP VIEW`.
#### How was this patch tested?
Modified the existing test cases
Author: gatorsmile <gatorsmile@gmail.com>
Closes#12284 from gatorsmile/followupDropTable.
## What changes were proposed in this pull request?
Making them more consistent.
## How was this patch tested?
Existing tests.
Author: Davies Liu <davies@databricks.com>
Closes#12289 from davies/cleanup_style.
## What changes were proposed in this pull request?
Currently, `checkstyle` is configured to check the files under `src/main/java`. However, Spark has Java files in `src/main/scala`, too. This PR fixes the following configuration in `pom.xml` and the unchecked-so-far violations on those files.
```xml
-<sourceDirectory>${basedir}/src/main/java</sourceDirectory>
+<sourceDirectories>${basedir}/src/main/java,${basedir}/src/main/scala</sourceDirectories>
```
## How was this patch tested?
After passing the Jenkins build and manually `dev/lint-java`. (Note that Jenkins does not run `lint-java`)
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12242 from dongjoon-hyun/SPARK-14465.
## What changes were proposed in this pull request?
This PR is based on #12017
Currently, this causes batches where some values are dictionary encoded and some
which are not. The non-dictionary encoded values cause us to remove the dictionary
from the batch causing the first values to return garbage.
This patch fixes the issue by first decoding the dictionary for the values that are
already dictionary encoded before switching. A similar thing is done for the reverse
case where the initial values are not dictionary encoded.
## How was this patch tested?
This is difficult to test but replicated on a test cluster using a large tpcds data set.
Author: Nong Li <nong@databricks.com>
Author: Davies Liu <davies@databricks.com>
Closes#12279 from davies/fix_dict.
## What changes were proposed in this pull request?
Currently, we use java HashMap for HashedRelation if the key could fit within a Long. The java HashMap and CompactBuffer are not memory efficient, the memory used by them is also accounted accurately.
This PR introduce a LongToUnsafeRowMap (similar to BytesToBytesMap) for better memory efficiency and performance.
This PR reopen#12190 to fix bugs.
## How was this patch tested?
Existing tests.
Author: Davies Liu <davies@databricks.com>
Closes#12278 from davies/long_map3.
#### What changes were proposed in this pull request?
This PR is to provide a native support for DDL `DROP VIEW` and `DROP TABLE`. The PR includes native parsing and native analysis.
Based on the HIVE DDL document for [DROP_VIEW_WEB_LINK](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-
DropView
), `DROP VIEW` is defined as,
**Syntax:**
```SQL
DROP VIEW [IF EXISTS] [db_name.]view_name;
```
- to remove metadata for the specified view.
- illegal to use DROP TABLE on a view.
- illegal to use DROP VIEW on a table.
- this command only works in `HiveContext`. In `SQLContext`, we will get an exception.
This PR also handles `DROP TABLE`.
**Syntax:**
```SQL
DROP TABLE [IF EXISTS] table_name [PURGE];
```
- Previously, the `DROP TABLE` command only can drop Hive tables in `HiveContext`. Now, after this PR, this command also can drop temporary table, external table, external data source table in `SQLContext`.
- In `HiveContext`, we will not issue an exception if the to-be-dropped table does not exist and users did not specify `IF EXISTS`. Instead, we just log an error message. If `IF EXISTS` is specified, we will not issue any error message/exception.
- In `SQLContext`, we will issue an exception if the to-be-dropped table does not exist, unless `IF EXISTS` is specified.
- Data will not be deleted if the tables are `external`, unless table type is `managed_table`.
#### How was this patch tested?
For verifying command parsing, added test cases in `spark/sql/hive/HiveDDLCommandSuite.scala`
For verifying command analysis, added test cases in `spark/sql/hive/execution/HiveDDLSuite.scala`
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#12146 from gatorsmile/dropView.
#### What changes were proposed in this pull request?
"Not good to slightly ignore all the un-supported options/clauses. We should either support it or throw an exception." A comment from yhuai in another PR https://github.com/apache/spark/pull/12146
- Can `Explain` be an exception? The `Formatted` clause is used in `HiveCompatibilitySuite`.
- Two unsupported clauses in `Drop Table` are handled in a separate PR: https://github.com/apache/spark/pull/12146
#### How was this patch tested?
Test cases are added to verify all the cases.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#12255 from gatorsmile/warningToException.
## What changes were proposed in this pull request?
…because some of built-in functions are not in function registry.
This fix tries to fix issues in `describe function` command where some of the outputs
still shows Hive's function because some built-in functions are not in FunctionRegistry.
The following built-in functions have been added to FunctionRegistry:
```
-
!
*
/
&
%
^
+
<
<=
<=>
=
==
>
>=
|
~
and
in
like
not
or
rlike
when
```
The following listed functions are not added, but hard coded in `commands.scala` (hvanhovell):
```
!=
<>
between
case
```
Below are the existing result of the above functions that have not been added:
```
spark-sql> describe function `!=`;
Function: <>
Class: org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPNotEqual
Usage: a <> b - Returns TRUE if a is not equal to b
```
```
spark-sql> describe function `<>`;
Function: <>
Class: org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPNotEqual
Usage: a <> b - Returns TRUE if a is not equal to b
```
```
spark-sql> describe function `between`;
Function: between
Class: org.apache.hadoop.hive.ql.udf.generic.GenericUDFBetween
Usage: between a [NOT] BETWEEN b AND c - evaluate if a is [not] in between b and c
```
```
spark-sql> describe function `case`;
Function: case
Class: org.apache.hadoop.hive.ql.udf.generic.GenericUDFCase
Usage: CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END - When a = b, returns c; when a = d, return e; else return f
```
## How was this patch tested?
Existing tests passed. Additional test cases added.
Author: Yong Tang <yong.tang.github@outlook.com>
Closes#12128 from yongtang/SPARK-14335.
## What changes were proposed in this pull request?
Minor issues. Found 2 typos while browsing the code.
## How was this patch tested?
None.
Author: bomeng <bmeng@us.ibm.com>
Closes#12264 from bomeng/SPARK-14496.
## What changes were proposed in this pull request?
Currently, we use java HashMap for HashedRelation if the key could fit within a Long. The java HashMap and CompactBuffer are not memory efficient, the memory used by them is also accounted accurately.
This PR introduce a LongToUnsafeRowMap (similar to BytesToBytesMap) for better memory efficiency and performance.
## How was this patch tested?
Updated existing tests.
Author: Davies Liu <davies@databricks.com>
Closes#12190 from davies/long_map2.
## What changes were proposed in this pull request?
When we first introduced Aggregators, we required the user of Aggregators to (implicitly) specify the encoders. It would actually make more sense to have the encoders be specified by the implementation of Aggregators, since each implementation should have the most state about how to encode its own data type.
Note that this simplifies the Java API because Java users no longer need to explicitly specify encoders for aggregators.
## How was this patch tested?
Updated unit tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#12231 from rxin/SPARK-14451.
## What changes were proposed in this pull request?
Based on our tests, gzip decompression is very slow (< 100MB/s), making queries decompression bound. Snappy can decompress at ~ 500MB/s on a single core.
This patch changes the default compression codec for Parquet output from gzip to snappy, and also introduces a ParquetOptions class to be more consistent with other data sources (e.g. CSV, JSON).
## How was this patch tested?
Should be covered by existing unit tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#12256 from rxin/SPARK-14482.
## What changes were proposed in this pull request?
Cleanups to documentation. No changes to code.
* GBT docs: Move Scala doc for private object GradientBoostedTrees to public docs for GBTClassifier,Regressor
* GLM regParam: needs doc saying it is for L2 only
* TrainValidationSplitModel: add .. versionadded:: 2.0.0
* Rename “_transformer_params_from_java” to “_transfer_params_from_java”
* LogReg Summary classes: “probability” col should not say “calibrated”
* LR summaries: coefficientStandardErrors —> document that intercept stderr comes last. Same for t,p-values
* approxCountDistinct: Document meaning of “rsd" argument.
* LDA: note which params are for online LDA only
## How was this patch tested?
Doc build
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#12266 from jkbradley/ml-doc-cleanups.
## What changes were proposed in this pull request?
This patch adds support for better handling of exceptions inside catch blocks if the code within the block throws an exception. For instance here is the code in a catch block before this change in `WriterContainer.scala`:
```scala
logError("Aborting task.", cause)
// call failure callbacks first, so we could have a chance to cleanup the writer.
TaskContext.get().asInstanceOf[TaskContextImpl].markTaskFailed(cause)
if (currentWriter != null) {
currentWriter.close()
}
abortTask()
throw new SparkException("Task failed while writing rows.", cause)
```
If `markTaskFailed` or `currentWriter.close` throws an exception, we currently lose the original cause. This PR fixes this problem by implementing a utility function `Utils.tryWithSafeCatch` that suppresses (`Throwable.addSuppressed`) the exception that are thrown within the catch block and rethrowing the original exception.
## How was this patch tested?
No new functionality added
Author: Sameer Agarwal <sameer@databricks.com>
Closes#12234 from sameeragarwal/fix-exception.
## What changes were proposed in this pull request?
In this PR, two changes are proposed for ColumnVector :
1. ColumnVector should be declared as implementing AutoCloseable - it already has close() method
2. In OnHeapColumnVector#reserveInternal(), we only need to allocate new array when existing array is null or the length of existing array is shorter than the newCapacity.
## How was this patch tested?
Existing unit tests.
Author: tedyu <yuzhihong@gmail.com>
Closes#12225 from tedyu/master.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-14189
When inferred types in the same field during finding compatible `DataType`, are `IntegralType` and `DecimalType` but `DecimalType` is not capable of the given `IntegralType`, JSON data source simply fails to find a compatible type resulting in `StringType`.
This can be observed when `prefersDecimal` is enabled.
```scala
def mixedIntegerAndDoubleRecords: RDD[String] =
sqlContext.sparkContext.parallelize(
"""{"a": 3, "b": 1.1}""" ::
"""{"a": 3.1, "b": 1}""" :: Nil)
val jsonDF = sqlContext.read
.option("prefersDecimal", "true")
.json(mixedIntegerAndDoubleRecords)
.printSchema()
```
- **Before**
```
root
|-- a: string (nullable = true)
|-- b: string (nullable = true)
```
- **After**
```
root
|-- a: decimal(21, 1) (nullable = true)
|-- b: decimal(21, 1) (nullable = true)
```
(Note that integer is inferred as `LongType` which becomes `DecimalType(20, 0)`)
## How was this patch tested?
unit tests were used and style tests by `dev/run_tests`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#11993 from HyukjinKwon/SPARK-14189.
## What changes were proposed in this pull request?
This PR resolves the problem during parsing unescaped quotes in input data. For example, currently the data below:
```
"a"b,ccc,ddd
e,f,g
```
produces a data below:
- **Before**
```bash
["a"b,ccc,ddd[\n]e,f,g] <- as a value.
```
- **After**
```bash
["a"b], [ccc], [ddd]
[e], [f], [g]
```
This PR bumps up the Univocity parser's version. This was fixed in `2.0.2`, https://github.com/uniVocity/univocity-parsers/issues/60.
## How was this patch tested?
Unit tests in `CSVSuite` and `sbt/sbt scalastyle`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#12226 from HyukjinKwon/SPARK-14103-quote.
## What changes were proposed in this pull request?
We implement typed filter by `MapPartitions`, which doesn't work well with whole stage codegen. This PR use `Filter` to implement typed filter and we can get the whole stage codegen support for free.
This PR also introduced `DeserializeToObject` and `SerializeFromObject`, to seperate serialization logic from object operator, so that it's eaiser to write optimization rules for adjacent object operators.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12061 from cloud-fan/whole-stage-codegen.
## What changes were proposed in this pull request?
This is a followup to #12117 and addresses some of the TODOs introduced there. In particular, the resolution of database is now pushed into session catalog, which knows about the current database. Further, the logic for checking whether a function exists is pushed into the external catalog.
No change in functionality is expected.
## How was this patch tested?
`SessionCatalogSuite`, `DDLSuite`
Author: Andrew Or <andrew@databricks.com>
Closes#12198 from andrewor14/function-exists.
## What changes were proposed in this pull request?
In DataSource#write method, the variables `dataSchema` and `equality`, and related logics are no longer used. Let's remove them.
## How was this patch tested?
Existing tests.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#12237 from sarutak/SPARK-14456.
## What changes were proposed in this pull request?
This patch removes DirectParquetOutputCommitter. This was initially created by Databricks as a faster way to write Parquet data to S3. However, given how the underlying S3 Hadoop implementation works, this committer only works when there are no failures. If there are multiple attempts of the same task (e.g. speculation or task failures or node failures), the output data can be corrupted. I don't think this performance optimization outweighs the correctness issue.
## How was this patch tested?
Removed the related tests also.
Author: Reynold Xin <rxin@databricks.com>
Closes#12229 from rxin/SPARK-10063.
## What changes were proposed in this pull request?
The Scala Dataset public API currently only allows users to specify encoders through SQLContext.implicits. This is OK but sometimes people want to explicitly get encoders without a SQLContext (e.g. Aggregator implementations). This patch adds public APIs to Encoders class for getting Scala encoders.
## How was this patch tested?
None - I will update test cases once https://github.com/apache/spark/pull/12231 is merged.
Author: Reynold Xin <rxin@databricks.com>
Closes#12232 from rxin/SPARK-14452.
### What changes were proposed in this pull request?
This PR adds support for `LEFT ANTI JOIN` to Spark SQL. A `LEFT ANTI JOIN` is the exact opposite of a `LEFT SEMI JOIN` and can be used to identify rows in one dataset that are not in another dataset. Note that `nulls` on the left side of the join cannot match a row on the right hand side of the join; the result is that left anti join will always select a row with a `null` in one or more of its keys.
We currently add support for the following SQL join syntax:
SELECT *
FROM tbl1 A
LEFT ANTI JOIN tbl2 B
ON A.Id = B.Id
Or using a dataframe:
tbl1.as("a").join(tbl2.as("b"), $"a.id" === $"b.id", "left_anti)
This PR provides serves as the basis for implementing `NOT EXISTS` and `NOT IN (...)` correlated sub-queries. It would also serve as good basis for implementing an more efficient `EXCEPT` operator.
The PR has been (losely) based on PR's by both davies (https://github.com/apache/spark/pull/10706) and chenghao-intel (https://github.com/apache/spark/pull/10563); credit should be given where credit is due.
This PR adds supports for `LEFT ANTI JOIN` to `BroadcastHashJoin` (including codegeneration), `ShuffledHashJoin` and `BroadcastNestedLoopJoin`.
### How was this patch tested?
Added tests to `JoinSuite` and ported `ExistenceJoinSuite` from https://github.com/apache/spark/pull/10563.
cc davies chenghao-intel rxin
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#12214 from hvanhovell/SPARK-12610.
## What changes were proposed in this pull request?
According to the [Spark Code Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Indentation), this PR adds a new scalastyle rule to prevent the followings.
```
/** In Spark, we don't use the ScalaDoc style so this
* is not correct.
*/
```
## How was this patch tested?
Pass the Jenkins tests (including `lint-scala`).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12221 from dongjoon-hyun/SPARK-14444.
## What changes were proposed in this pull request?
1) fix the RowEncoder for wide table (many columns) by splitting the generate code into multiple functions.
2) Separate DataSourceScan as RowDataSourceScan and BatchedDataSourceScan
3) Disable the returning columnar batch in parquet reader if there are many columns.
4) Added a internal config for maximum number of fields (nested) columns supported by whole stage codegen.
Closes#12098
## How was this patch tested?
Add a tests for table with 1000 columns.
Author: Davies Liu <davies@databricks.com>
Closes#12047 from davies/many_columns.
## What changes were proposed in this pull request?
In order to leverage a data structure like `AggregateHashMap` (https://github.com/apache/spark/pull/12055) to speed up aggregates with keys, we need to make `ColumnarBatch.Row` mutable.
## How was this patch tested?
Unit test in `ColumnarBatchSuite`. Also, tested via `BenchmarkWholeStageCodegen`.
Author: Sameer Agarwal <sameer@databricks.com>
Closes#12103 from sameeragarwal/mutable-row.
This PR exposes the internal testing `MemorySink` though the data source API. This will allow users to easily test streaming applications in the Spark shell or other local tests.
Usage:
```scala
inputStream.write
.format("memory")
.queryName("memStream")
.startStream()
// Now you can query the result of the stream here.
sqlContext.table("memStream")
```
The most complicated part of the logic is choosing the checkpoint directory. There are a few requirements we are attempting to satisfy here:
- when working in the shell locally, it should just work with no extra configuration.
- when working on a cluster you should be able to make it easily create the checkpoint on a distributed file system so you can test aggregation (state checkpoints are also stored in this directory and must be accessible from workers).
- it should be clear that you can't resume since the data is just in memory.
The chosen algorithm proceeds as follows:
- the user gives a checkpoint directory, use it
- if the conf has a checkpoint location, use `$location/$queryName`
- if neither, create a local directory
- always check to make sure there are no offsets written to the directory
Author: Michael Armbrust <michael@databricks.com>
Closes#12119 from marmbrus/memorySink.
#### What changes were proposed in this pull request?
Because the concept of partitioning is associated with physical tables, we disable all the supports of partitioned views, which are defined in the following three commands in [Hive DDL Manual](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/AlterView):
```
ALTER VIEW view DROP [IF EXISTS] PARTITION spec1[, PARTITION spec2, ...];
ALTER VIEW view ADD [IF NOT EXISTS] PARTITION spec;
CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT column_comment], ...) ]
[COMMENT view_comment]
[TBLPROPERTIES (property_name = property_value, ...)]
AS SELECT ...;
```
An exception is thrown when users issue any of these three DDL commands.
#### How was this patch tested?
Added test cases for parsing create view and changed the existing test cases to verify if the exceptions are thrown.
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#12169 from gatorsmile/viewPartition.
## What changes were proposed in this pull request?
This is just a followup to #12121, which implemented the alter table DDLs using the `SessionCatalog`. Specially, this corrects the behavior of setting the location of a datasource table. For datasource tables, we need to set the `locationUri` in addition to the `path` entry in the serde properties. Additionally, changing the location of a datasource table partition is not allowed.
## How was this patch tested?
`DDLSuite`
Author: Andrew Or <andrew@databricks.com>
Closes#12186 from andrewor14/alter-table-ddl-followup.
## What changes were proposed in this pull request?
This PR adds a new operator `MapElements` for `Dataset.map`, it's a 1-1 mapping and is easier to adapt to whole stage codegen framework.
## How was this patch tested?
new test in `WholeStageCodegenSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12087 from cloud-fan/map.
Because SQL keeps track of all known configs, some customization was
needed in SQLConf to allow that, since the core API does not have that
feature.
Tested via existing (and slightly updated) unit tests.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#11570 from vanzin/SPARK-529-sql.
## What changes were proposed in this pull request?
onQueryProgress is asynchronous so the user may see some future status of `ContinuousQuery`. This PR just updated comments to warn it.
## How was this patch tested?
Only updated comments.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#12180 from zsxwing/ContinuousQueryListener-doc.
## What changes were proposed in this pull request?
In Spark 2.0, we want to handle the most common `ALTER TABLE` commands ourselves instead of passing the entire query text to Hive. This is done using the new `SessionCatalog` API introduced recently.
The commands supported in this patch include:
```
ALTER TABLE ... RENAME TO ...
ALTER TABLE ... SET TBLPROPERTIES ...
ALTER TABLE ... UNSET TBLPROPERTIES ...
ALTER TABLE ... SET LOCATION ...
ALTER TABLE ... SET SERDE ...
```
The commands we explicitly do not support are:
```
ALTER TABLE ... CLUSTERED BY ...
ALTER TABLE ... SKEWED BY ...
ALTER TABLE ... NOT CLUSTERED
ALTER TABLE ... NOT SORTED
ALTER TABLE ... NOT SKEWED
ALTER TABLE ... NOT STORED AS DIRECTORIES
```
For these we throw exceptions complaining that they are not supported.
## How was this patch tested?
`DDLSuite`
Author: Andrew Or <andrew@databricks.com>
Closes#12121 from andrewor14/alter-table-ddl.
## What changes were proposed in this pull request?
The `window` function was added to Dataset with [this PR](https://github.com/apache/spark/pull/12008).
This PR adds the Python, and SQL, API for this function.
With this PR, SQL, Java, and Scala will share the same APIs as in users can use:
- `window(timeColumn, windowDuration)`
- `window(timeColumn, windowDuration, slideDuration)`
- `window(timeColumn, windowDuration, slideDuration, startTime)`
In Python, users can access all APIs above, but in addition they can do
- In Python:
`window(timeColumn, windowDuration, startTime=...)`
that is, they can provide the startTime without providing the `slideDuration`. In this case, we will generate tumbling windows.
## How was this patch tested?
Unit tests + manual tests
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#12136 from brkyvz/python-windows.
## What changes were proposed in this pull request?
This PR implements CreateFunction and DropFunction commands. Besides implementing these two commands, we also change how to manage functions. Here are the main changes.
* `FunctionRegistry` will be a container to store all functions builders and it will not actively load any functions. Because of this change, we do not need to maintain a separate registry for HiveContext. So, `HiveFunctionRegistry` is deleted.
* SessionCatalog takes care the job of loading a function if this function is not in the `FunctionRegistry` but its metadata is stored in the external catalog. For this case, SessionCatalog will (1) load the metadata from the external catalog, (2) load all needed resources (i.e. jars and files), (3) create a function builder based on the function definition, (4) register the function builder in the `FunctionRegistry`.
* A `UnresolvedGenerator` is created. So, the parser will not need to call `FunctionRegistry` directly during parsing, which is not a good time to create a Hive UDTF. In the analysis phase, we will resolve `UnresolvedGenerator`.
This PR is based on viirya's https://github.com/apache/spark/pull/12036/
## How was this patch tested?
Existing tests and new tests.
## TODOs
[x] Self-review
[x] Cleanup
[x] More tests for create/drop functions (we need to more tests for permanent functions).
[ ] File JIRAs for all TODOs
[x] Standardize the error message when a function does not exist.
Author: Yin Huai <yhuai@databricks.com>
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#12117 from yhuai/function.
## What changes were proposed in this pull request?
Make StreamingRelation store the closure to create the source in StreamExecution so that we can start multiple continuous queries from the same DataFrame.
## How was this patch tested?
`test("DataFrame reuse")`
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#12049 from zsxwing/df-reuse.
## What changes were proposed in this pull request?
This PR adds Native execution of SHOW TBLPROPERTIES command.
Command Syntax:
``` SQL
SHOW TBLPROPERTIES table_name[(property_key_literal)]
```
## How was this patch tested?
Tests added in HiveComandSuiie and DDLCommandSuite
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#12133 from dilipbiswal/dkb_show_tblproperties.
## What changes were proposed in this pull request?
This adds the corresponding Java static functions for built-in typed aggregates already exposed in Scala.
## How was this patch tested?
Unit tests.
rxin
Author: Eric Liang <ekl@databricks.com>
Closes#12168 from ericl/sc-2794.
With the addition of StreamExecution (ContinuousQuery) to Datasets, data will become unbounded. With unbounded data, the execution of some methods and operations will not make sense, e.g. `Dataset.count()`.
A simple API is required to check whether the data in a Dataset is bounded or unbounded. This will allow users to check whether their Dataset is in streaming mode or not. ML algorithms may check if the data is unbounded and throw an exception for example.
The implementation of this method is simple, however naming it is the challenge. Some possible names for this method are:
- isStreaming
- isContinuous
- isBounded
- isUnbounded
I've gone with `isStreaming` for now. We can change it before Spark 2.0 if we decide to come up with a different name. For that reason I've marked it as `Experimental`
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#12080 from brkyvz/is-streaming.
## What changes were proposed in this pull request?
This PR basically re-do the things in #12068 but with a different model, which should work better in case of small files with different sizes.
## How was this patch tested?
Updated existing tests.
Ran a query on thousands of partitioned small files locally, with all default settings (the cost to open a file should be over estimated), the durations of tasks become smaller and smaller, which is good (the last few tasks will be shortest).
Author: Davies Liu <davies@databricks.com>
Closes#12095 from davies/file_cost.
## What changes were proposed in this pull request?
RDD.toLocalIterator() could be used to fetch one partition at a time to reduce the memory usage. Right now, for Dataset/Dataframe we have to use df.rdd.toLocalIterator, which is super slow also requires lots of memory (because of the Java serializer or even Kyro serializer).
This PR introduce an optimized toLocalIterator for Dataset/DataFrame, which is much faster and requires much less memory. For a partition with 5 millions rows, `df.rdd.toIterator` took about 100 seconds, but df.toIterator took less than 7 seconds. For 10 millions row, rdd.toIterator will crash (not enough memory) with 4G heap, but df.toLocalIterator could finished in 12 seconds.
The JDBC server has been updated to use DataFrame.toIterator.
## How was this patch tested?
Existing tests.
Author: Davies Liu <davies@databricks.com>
Closes#12114 from davies/local_iterator.
## What changes were proposed in this pull request?
Currently we extract Python UDFs into a special logical plan EvaluatePython in analyzer, But EvaluatePython is not part of catalyst, many rules have no knowledge of it , which will break many things (for example, filter push down or column pruning).
We should treat Python UDFs as normal expressions, until we want to evaluate in physical plan, we could extract them in end of optimizer, or physical plan.
This PR extract Python UDFs in physical plan.
Closes#10935
## How was this patch tested?
Added regression tests.
Author: Davies Liu <davies@databricks.com>
Closes#12127 from davies/py_udf.
## What changes were proposed in this pull request?
Add a processing time trigger to control the batch processing speed
## How was this patch tested?
Unit tests
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#11976 from zsxwing/trigger.
## What changes were proposed in this pull request?
This PR did a few cleanup on HashedRelation and HashJoin:
1) Merge HashedRelation and UniqueHashedRelation together
2) Return an iterator from HashedRelation, so we donot need a create many UnsafeRow objects.
3) Return a copy of HashedRelation for thread-safety in BroadcastJoin, so we can re-use the UnafeRow objects.
4) Cleanup HashJoin, share most of the code between BroadcastHashJoin and ShuffleHashJoin
5) Removed UniqueLongHashedRelation, which will be replaced by LongUnsafeMap (another PR).
6) Update benchmark, before this patch, the selectivity of joins are too high.
## How was this patch tested?
Existing tests.
Author: Davies Liu <davies@databricks.com>
Closes#12102 from davies/cleanup_hash.
## What changes were proposed in this pull request?
We recently added the ability to dump the generated code for a given query. However, the method is only available through an implicit after an import. It'd slightly simplify things if it can be called directly in queryExecution.
## How was this patch tested?
Manually tested in spark-shell.
Author: Reynold Xin <rxin@databricks.com>
Closes#12144 from rxin/SPARK-14360.
## What changes were proposed in this pull request?
Update DebugQuery to work on Datasets of any type, not just DataFrames.
## How was this patch tested?
Added unit tests, checked in spark-shell.
Author: Matei Zaharia <matei@databricks.com>
Closes#12140 from mateiz/debug-dataset.
## What changes were proposed in this pull request?
This PR contains the following 5 types of maintenance fix over 59 files (+94 lines, -93 lines).
- Fix typos(exception/log strings, testcase name, comments) in 44 lines.
- Fix lint-java errors (MaxLineLength) in 6 lines. (New codes after SPARK-14011)
- Use diamond operators in 40 lines. (New codes after SPARK-13702)
- Fix redundant semicolon in 5 lines.
- Rename class `InferSchemaSuite` to `CSVInferSchemaSuite` in CSVInferSchemaSuite.scala.
## How was this patch tested?
Manual and pass the Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12139 from dongjoon-hyun/SPARK-14355.
## What changes were proposed in this pull request?
EXPLAIN output should be in a single cell.
**Before**
```
scala> sql("explain select 1").collect()
res0: Array[org.apache.spark.sql.Row] = Array([== Physical Plan ==], [WholeStageCodegen], [: +- Project [1 AS 1#1]], [: +- INPUT], [+- Scan OneRowRelation[]])
```
**After**
```
scala> sql("explain select 1").collect()
res1: Array[org.apache.spark.sql.Row] =
Array([== Physical Plan ==
WholeStageCodegen
: +- Project [1 AS 1#4]
: +- INPUT
+- Scan OneRowRelation[]])
```
Or,
```
scala> sql("explain select 1").head
res1: org.apache.spark.sql.Row =
[== Physical Plan ==
WholeStageCodegen
: +- Project [1 AS 1#5]
: +- INPUT
+- Scan OneRowRelation[]]
```
Please note that `Spark-shell(Scala-shell)` trims long string output. So, you may need to use `println` to get full strings.
```
scala> println(sql("explain codegen select 'a' as a group by 1").head)
[Found 2 WholeStageCodegen subtrees.
== Subtree 1 / 2 ==
WholeStageCodegen
...
/* 059 */ }
/* 060 */ }
]
```
## How was this patch tested?
Pass the Jenkins tests. (Testcases are updated.)
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12137 from dongjoon-hyun/SPARK-14350.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-14231
Currently, JSON data source supports to infer `DecimalType` for big numbers and `floatAsBigDecimal` option which reads floating-point values as `DecimalType`.
But there are few restrictions in Spark `DecimalType` below:
1. The precision cannot be bigger than 38.
2. scale cannot be bigger than precision.
Currently, both restrictions are not being handled.
This PR handles the cases by inferring them as `DoubleType`. Also, the option name was changed from `floatAsBigDecimal` to `prefersDecimal` as suggested [here](https://issues.apache.org/jira/browse/SPARK-14231?focusedCommentId=15215579&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15215579).
So, the codes below:
```scala
def doubleRecords: RDD[String] =
sqlContext.sparkContext.parallelize(
s"""{"a": 1${"0" * 38}, "b": 0.01}""" ::
s"""{"a": 2${"0" * 38}, "b": 0.02}""" :: Nil)
val jsonDF = sqlContext.read
.option("prefersDecimal", "true")
.json(doubleRecords)
jsonDF.printSchema()
```
produces below:
- **Before**
```scala
org.apache.spark.sql.AnalysisException: Decimal scale (2) cannot be greater than precision (1).;
at org.apache.spark.sql.types.DecimalType.<init>(DecimalType.scala:44)
at org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:144)
at org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:108)
at
...
```
- **After**
```scala
root
|-- a: double (nullable = true)
|-- b: double (nullable = true)
```
## How was this patch tested?
Unit tests were used and `./dev/run_tests` for coding style tests.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#12030 from HyukjinKwon/SPARK-14231.
## What changes were proposed in this pull request?
This PR aims to fix all Scala-Style multiline comments into Java-Style multiline comments in Scala codes.
(All comment-only changes over 77 files: +786 lines, −747 lines)
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12130 from dongjoon-hyun/use_multiine_javadoc_comments.
## What changes were proposed in this pull request?
Typo fixes. No functional changes.
## How was this patch tested?
Built the sources and ran with samples.
Author: Jacek Laskowski <jacek@japila.pl>
Closes#11802 from jaceklaskowski/typo-fixes.
## What changes were proposed in this pull request?
While trying to create a PR (which was not an issue at the end), I just corrected some style nits.
So, I removed the changes except for some coding style corrections.
- According to the [scala-style-guide#documentation-style](https://github.com/databricks/scala-style-guide#documentation-style), Scala style comments are discouraged.
>```scala
>/** This is a correct one-liner, short description. */
>
>/**
> * This is correct multi-line JavaDoc comment. And
> * this is my second line, and if I keep typing, this would be
> * my third line.
> */
>
>/** In Spark, we don't use the ScalaDoc style so this
> * is not correct.
> */
>```
- Double newlines between consecutive methods was removed. According to [scala-style-guide#blank-lines-vertical-whitespace](https://github.com/databricks/scala-style-guide#blank-lines-vertical-whitespace), single newline appears when
>Between consecutive members (or initializers) of a class: fields, constructors, methods, nested classes, static initializers, instance initializers.
- Remove uesless parentheses in tests
- Use `mapPartitions` instead of `mapPartitionsWithIndex()`.
## How was this patch tested?
Unit tests were used and `dev/run_tests` for style tests.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#12109 from HyukjinKwon/SPARK-14271.
## What changes were proposed in this pull request?
In the Dataset API, it is fairly difficult for users to perform simple aggregations in a type-safe way at the moment because there are no aggregators that have been implemented. This pull request adds a few common aggregate functions in expressions.scala.typed package, and also creates the expressions.java.typed package without implementation. The java implementation should probably come as a separate pull request. One challenge there is to resolve the type difference between Scala primitive types and Java boxed types.
## How was this patch tested?
Added unit tests for them.
Author: Reynold Xin <rxin@databricks.com>
Closes#12077 from rxin/SPARK-14285.
## What changes were proposed in this pull request?
This PR implements `EXPLAIN CODEGEN` SQL command which returns generated codes like `debugCodegen`. In `spark-shell`, we don't need to `import debug` module. In `spark-sql`, we can use this SQL command now.
**Before**
```
scala> import org.apache.spark.sql.execution.debug._
scala> sql("select 'a' as a group by 1").debugCodegen()
Found 2 WholeStageCodegen subtrees.
== Subtree 1 / 2 ==
...
Generated code:
...
== Subtree 2 / 2 ==
...
Generated code:
...
```
**After**
```
scala> sql("explain extended codegen select 'a' as a group by 1").collect().foreach(println)
[Found 2 WholeStageCodegen subtrees.]
[== Subtree 1 / 2 ==]
...
[]
[Generated code:]
...
[]
[== Subtree 2 / 2 ==]
...
[]
[Generated code:]
...
```
## How was this patch tested?
Pass the Jenkins tests (including new testcases)
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12099 from dongjoon-hyun/SPARK-14251.
## What changes were proposed in this pull request?
This PR reduces Java byte code size of method in ```SpecificColumnarIterator``` by using a approach to make a group for lot of ```ColumnAccessor``` instantiations or method calls (more than 200) into a method
## How was this patch tested?
Added a new unit test, which includes large instantiations and method calls, to ```InMemoryColumnarQuerySuite```
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#12108 from kiszk/SPARK-14138-master.
## What changes were proposed in this pull request?
`SizeBasedWindowFunction.n` is a global singleton attribute created for evaluating size based aggregate window functions like `CUME_DIST`. However, this attribute gets different expression IDs when created on both driver side and executor side. This PR adds `withPartitionSize` method to `SizeBasedWindowFunction` so that we can easily rewrite `SizeBasedWindowFunction.n` on executor side.
## How was this patch tested?
A test case is added in `HiveSparkSubmitSuite`, which supports launching multi-process clusters.
Author: Cheng Lian <lian@databricks.com>
Closes#12040 from liancheng/spark-14244-fix-sized-window-function.
This PR adds the ability to perform aggregations inside of a `ContinuousQuery`. In order to implement this feature, the planning of aggregation has augmented with a new `StatefulAggregationStrategy`. Unlike batch aggregation, stateful-aggregation uses the `StateStore` (introduced in #11645) to persist the results of partial aggregation across different invocations. The resulting physical plan performs the aggregation using the following progression:
- Partial Aggregation
- Shuffle
- Partial Merge (now there is at most 1 tuple per group)
- StateStoreRestore (now there is 1 tuple from this batch + optionally one from the previous)
- Partial Merge (now there is at most 1 tuple per group)
- StateStoreSave (saves the tuple for the next batch)
- Complete (output the current result of the aggregation)
The following refactoring was also performed to allow us to plug into existing code:
- The get/put implementation is taken from #12013
- The logic for breaking down and de-duping the physical execution of aggregation has been move into a new pattern `PhysicalAggregation`
- The `AttributeReference` used to identify the result of an `AggregateFunction` as been moved into the `AggregateExpression` container. This change moves the reference into the same object as the other intermediate references used in aggregation and eliminates the need to pass around a `Map[(AggregateFunction, Boolean), Attribute]`. Further clean up (using a different aggregation container for logical/physical plans) is deferred to a followup.
- Some planning logic is moved from the `SessionState` into the `QueryExecution` to make it easier to override in the streaming case.
- The ability to write a `StreamTest` that checks only the output of the last batch has been added to simulate the future addition of output modes.
Author: Michael Armbrust <michael@databricks.com>
Closes#12048 from marmbrus/statefulAgg.
## What changes were proposed in this pull request?
RpcEndpoint is not thread safe and allows multiple messages to be processed at the same time. StateStoreCoordinator should use ThreadSafeRpcEndpoint.
## How was this patch tested?
Existing unit tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#12100 from zsxwing/fix-StateStoreCoordinator.
JIRA: https://issues.apache.org/jira/browse/SPARK-13674
## What changes were proposed in this pull request?
Sample operator doesn't support wholestage codegen now. This pr is to add support to it.
## How was this patch tested?
A test is added into `BenchmarkWholeStageCodegen`. Besides, all tests should be passed.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#11517 from viirya/add-wholestage-sample.
## What changes were proposed in this pull request?
This PR adds the function `window` as a column expression.
`window` can be used to bucket rows into time windows given a time column. With this expression, performing time series analysis on batch data, as well as streaming data should become much more simpler.
### Usage
Assume the following schema:
`sensor_id, measurement, timestamp`
To average 5 minute data every 1 minute (window length of 5 minutes, slide duration of 1 minute), we will use:
```scala
df.groupBy(window("timestamp", “5 minutes”, “1 minute”), "sensor_id")
.agg(mean("measurement").as("avg_meas"))
```
This will generate windows such as:
```
09:00:00-09:05:00
09:01:00-09:06:00
09:02:00-09:07:00 ...
```
Intervals will start at every `slideDuration` starting at the unix epoch (1970-01-01 00:00:00 UTC).
To start intervals at a different point of time, e.g. 30 seconds after a minute, the `startTime` parameter can be used.
```scala
df.groupBy(window("timestamp", “5 minutes”, “1 minute”, "30 second"), "sensor_id")
.agg(mean("measurement").as("avg_meas"))
```
This will generate windows such as:
```
09:00:30-09:05:30
09:01:30-09:06:30
09:02:30-09:07:30 ...
```
Support for Python will be made in a follow up PR after this.
## How was this patch tested?
This patch has some basic unit tests for the `TimeWindow` expression testing that the parameters pass validation, and it also has some unit/integration tests testing the correctness of the windowing and usability in complex operations (multi-column grouping, multi-column projections, joins).
Author: Burak Yavuz <brkyvz@gmail.com>
Author: Michael Armbrust <michael@databricks.com>
Closes#12008 from brkyvz/df-time-window.
## What changes were proposed in this pull request?
This PR updates the usage comments of `debug` according to the following commits.
- [SPARK-9754](https://issues.apache.org/jira/browse/SPARK-9754) removed `typeCheck`.
- [SPARK-14227](https://issues.apache.org/jira/browse/SPARK-14227) added `debugCodegen`.
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12094 from dongjoon-hyun/minor_fix_debug_usage.
## What changes were proposed in this pull request?
This PR addresses the following
1. Supports native execution of SHOW DATABASES command
2. Fixes SHOW TABLES to apply the identifier_with_wildcards pattern if supplied.
SHOW TABLE syntax
```
SHOW TABLES [IN database_name] ['identifier_with_wildcards'];
```
SHOW DATABASES syntax
```
SHOW (DATABASES|SCHEMAS) [LIKE 'identifier_with_wildcards'];
```
## How was this patch tested?
Tests added in SQLQuerySuite (both hive and sql contexts) and DDLCommandSuite
Note: Since the table name pattern was not working , tests are added in both SQLQuerySuite to
verify the application of the table pattern.
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#11991 from dilipbiswal/dkb_show_database.
## What changes were proposed in this pull request?
This PR implements `FileFormat.buildReader()` for the LibSVM data source. Besides that, a new interface method `prepareRead()` is added to `FileFormat`:
```scala
def prepareRead(
sqlContext: SQLContext,
options: Map[String, String],
files: Seq[FileStatus]): Map[String, String] = options
```
After migrating from `buildInternalScan()` to `buildReader()`, we lost the opportunity to collect necessary global information, since `buildReader()` works in a per-partition manner. For example, LibSVM needs to infer the total number of features if the `numFeatures` data source option is not set. Any necessary collected global information should be returned using the data source options map. By default, this method just returns the original options untouched.
An alternative approach is to absorb `inferSchema()` into `prepareRead()`, since schema inference is also some kind of global information gathering. However, this approach wasn't chosen because schema inference is optional, while `prepareRead()` must be called whenever a `HadoopFsRelation` based data source relation is instantiated.
One unaddressed problem is that, when `numFeatures` is absent, now the input data will be scanned twice. The `buildInternalScan()` code path doesn't need to do this because it caches the raw parsed RDD in memory before computing the total number of features. However, with `FileScanRDD`, the raw parsed RDD is created in a different way (e.g. partitioning) from the final RDD.
## How was this patch tested?
Tested using existing test suites.
Author: Cheng Lian <lian@databricks.com>
Closes#12088 from liancheng/spark-14295-libsvm-build-reader.
## What changes were proposed in this pull request?
This PR support multiple Python UDFs within single batch, also improve the performance.
```python
>>> from pyspark.sql.types import IntegerType
>>> sqlContext.registerFunction("double", lambda x: x * 2, IntegerType())
>>> sqlContext.registerFunction("add", lambda x, y: x + y, IntegerType())
>>> sqlContext.sql("SELECT double(add(1, 2)), add(double(2), 1)").explain(True)
== Parsed Logical Plan ==
'Project [unresolvedalias('double('add(1, 2)), None),unresolvedalias('add('double(2), 1), None)]
+- OneRowRelation$
== Analyzed Logical Plan ==
double(add(1, 2)): int, add(double(2), 1): int
Project [double(add(1, 2))#14,add(double(2), 1)#15]
+- Project [double(add(1, 2))#14,add(double(2), 1)#15]
+- Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15]
+- EvaluatePython [add(pythonUDF1#17, 1)], [pythonUDF0#18]
+- EvaluatePython [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17]
+- OneRowRelation$
== Optimized Logical Plan ==
Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15]
+- EvaluatePython [add(pythonUDF1#17, 1)], [pythonUDF0#18]
+- EvaluatePython [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17]
+- OneRowRelation$
== Physical Plan ==
WholeStageCodegen
: +- Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15]
: +- INPUT
+- !BatchPythonEvaluation [add(pythonUDF1#17, 1)], [pythonUDF0#16,pythonUDF1#17,pythonUDF0#18]
+- !BatchPythonEvaluation [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17]
+- Scan OneRowRelation[]
```
## How was this patch tested?
Added new tests.
Using the following script to benchmark 1, 2 and 3 udfs,
```
df = sqlContext.range(1, 1 << 23, 1, 4)
double = F.udf(lambda x: x * 2, LongType())
print df.select(double(df.id)).count()
print df.select(double(df.id), double(df.id + 1)).count()
print df.select(double(df.id), double(df.id + 1), double(df.id + 2)).count()
```
Here is the results:
N | Before | After | speed up
---- |------------ | -------------|------
1 | 22 s | 7 s | 3.1X
2 | 38 s | 13 s | 2.9X
3 | 58 s | 16 s | 3.6X
This benchmark ran locally with 4 CPUs. For 3 UDFs, it launched 12 Python before before this patch, 4 process after this patch. After this patch, it will use less memory for multiple UDFs than before (less buffering).
Author: Davies Liu <davies@databricks.com>
Closes#12057 from davies/multi_udfs.
This PR is to provide native parsing support for DDL commands: `Alter View`. Since its AST trees are highly similar to `Alter Table`. Thus, both implementation are integrated into the same one.
Based on the Hive DDL document:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL and https://cwiki.apache.org/confluence/display/Hive/PartitionedViews
**Syntax:**
```SQL
ALTER VIEW view_name RENAME TO new_view_name
```
- to change the name of a view to a different name
**Syntax:**
```SQL
ALTER VIEW view_name SET TBLPROPERTIES ('comment' = new_comment);
```
- to add metadata to a view
**Syntax:**
```SQL
ALTER VIEW view_name UNSET TBLPROPERTIES [IF EXISTS] ('comment', 'key')
```
- to remove metadata from a view
**Syntax:**
```SQL
ALTER VIEW view_name ADD [IF NOT EXISTS] PARTITION spec1[, PARTITION spec2, ...]
```
- to add the partitioning metadata for a view.
- the syntax of partition spec in `ALTER VIEW` is identical to `ALTER TABLE`, **EXCEPT** that it is **ILLEGAL** to specify a `LOCATION` clause.
**Syntax:**
```SQL
ALTER VIEW view_name DROP [IF EXISTS] PARTITION spec1[, PARTITION spec2, ...]
```
- to drop the related partition metadata for a view.
Added the related test cases to `DDLCommandSuite`
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#11987 from gatorsmile/parseAlterView.
## What changes were proposed in this pull request?
Fixes a minor bug in the record reader constructor that was possibly introduced during refactoring.
## How was this patch tested?
N/A
Author: Sameer Agarwal <sameer@databricks.com>
Closes#12070 from sameeragarwal/vectorized-rr.
## What changes were proposed in this pull request?
This PR proposes a new data-structure based on a vectorized hashmap that can be potentially _codegened_ in `TungstenAggregate` to speed up aggregates with group by. Micro-benchmarks show a 10x improvement over the current `BytesToBytes` aggregation map.
## How was this patch tested?
Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz
BytesToBytesMap: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
hash 108 / 119 96.9 10.3 1.0X
fast hash 63 / 70 166.2 6.0 1.7X
arrayEqual 70 / 73 150.8 6.6 1.6X
Java HashMap (Long) 141 / 200 74.3 13.5 0.8X
Java HashMap (two ints) 145 / 185 72.3 13.8 0.7X
Java HashMap (UnsafeRow) 499 / 524 21.0 47.6 0.2X
BytesToBytesMap (off Heap) 483 / 548 21.7 46.0 0.2X
BytesToBytesMap (on Heap) 485 / 562 21.6 46.2 0.2X
Vectorized Hashmap 54 / 60 193.7 5.2 2.0X
Author: Sameer Agarwal <sameer@databricks.com>
Closes#12055 from sameeragarwal/vectorized-hashmap.
### What changes were proposed in this pull request?
This PR removes the ANTLR3 based parser, and moves the new ANTLR4 based parser into the `org.apache.spark.sql.catalyst.parser package`.
### How was this patch tested?
Existing unit tests.
cc rxin andrewor14 yhuai
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#12071 from hvanhovell/SPARK-14211.
## What changes were proposed in this pull request?
Major changes:
1. Implement `FileFormat.buildReader()` for the CSV data source.
1. Add an extra argument to `FileFormat.buildReader()`, `physicalSchema`, which is basically the result of `FileFormat.inferSchema` or user specified schema.
This argument is necessary because the CSV data source needs to know all the columns of the underlying files to read the file.
## How was this patch tested?
Existing tests should do the work.
Author: Cheng Lian <lian@databricks.com>
Closes#12002 from liancheng/spark-14206-csv-build-reader.
## What changes were proposed in this pull request?
This change resolves an issue where `DataFrameNaFunctions.fill` changes a `FloatType` column to a `DoubleType`. We also clarify the contract that replacement values will be cast to the column data type, which may change the replacement value when casting to a lower precision type.
## How was this patch tested?
This patch has associated unit tests.
Author: Travis Crawford <travis@medium.com>
Closes#11967 from traviscrawford/SPARK-14081-dataframena.
## What changes were proposed in this pull request?
This pr is to add a config to control the maximum number of files as even small files have a non-trivial fixed cost. The current packing can put a lot of small files together which cases straggler tasks.
## How was this patch tested?
I added tests to check if many files get split into partitions in FileSourceStrategySuite.
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
Closes#12068 from maropu/SPARK-14259.
## What changes were proposed in this pull request?
This PR implements buildReader for text data source and enable it in the new data source code path.
## How was this patch tested?
Existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11934 from cloud-fan/text.
#### What changes were proposed in this pull request?
This PR is to implement the following four Database-related DDL commands:
- `CREATE DATABASE|SCHEMA [IF NOT EXISTS] database_name`
- `DROP DATABASE [IF EXISTS] database_name [RESTRICT|CASCADE]`
- `DESCRIBE DATABASE [EXTENDED] db_name`
- `ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...)`
Another PR will be submitted to handle the unsupported commands. In the Database-related DDL commands, we will issue an error exception for `ALTER (DATABASE|SCHEMA) database_name SET OWNER [USER|ROLE] user_or_role`.
cc yhuai andrewor14 rxin Could you review the changes? Is it in the right direction? Thanks!
#### How was this patch tested?
Added a few test cases in `command/DDLSuite.scala` for testing DDL command execution in `SQLContext`. Since `HiveContext` also shares the same implementation, the existing test cases in `\hive` also verifies the correctness of these commands.
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#12009 from gatorsmile/dbDDL.
## What changes were proposed in this pull request?
This PR brings the support for chained Python UDFs, for example
```sql
select udf1(udf2(a))
select udf1(udf2(a) + 3)
select udf1(udf2(a) + udf3(b))
```
Also directly chained unary Python UDFs are put in single batch of Python UDFs, others may require multiple batches.
For example,
```python
>>> sqlContext.sql("select double(double(1))").explain()
== Physical Plan ==
WholeStageCodegen
: +- Project [pythonUDF#10 AS double(double(1))#9]
: +- INPUT
+- !BatchPythonEvaluation double(double(1)), [pythonUDF#10]
+- Scan OneRowRelation[]
>>> sqlContext.sql("select double(double(1) + double(2))").explain()
== Physical Plan ==
WholeStageCodegen
: +- Project [pythonUDF#19 AS double((double(1) + double(2)))#16]
: +- INPUT
+- !BatchPythonEvaluation double((pythonUDF#17 + pythonUDF#18)), [pythonUDF#17,pythonUDF#18,pythonUDF#19]
+- !BatchPythonEvaluation double(2), [pythonUDF#17,pythonUDF#18]
+- !BatchPythonEvaluation double(1), [pythonUDF#17]
+- Scan OneRowRelation[]
```
TODO: will support multiple unrelated Python UDFs in one batch (another PR).
## How was this patch tested?
Added new unit tests for chained UDFs.
Author: Davies Liu <davies@databricks.com>
Closes#12014 from davies/py_udfs.
## What changes were proposed in this pull request?
This PR is a simple fix for an exception message to print `string[]` content correctly.
```java
String[] colPath = requestedSchema.getPaths().get(i);
...
- throw new IOException("Required column is missing in data file. Col: " + colPath);
+ throw new IOException("Required column is missing in data file. Col: " + Arrays.toString(colPath));
```
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12041 from dongjoon-hyun/fix_exception_message_with_string_array.
## What changes were proposed in this pull request?
Renames SQL option `spark.sql.parquet.fileScan` since now all `HadoopFsRelation` based data sources are being migrated to `FileScanRDD` code path.
## How was this patch tested?
None.
Author: Cheng Lian <lian@databricks.com>
Closes#12003 from liancheng/spark-14208-option-renaming.
## What changes were proposed in this pull request?
This PR implements buildReader for json data source and enable it in the new data source code path.
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11960 from cloud-fan/json.
## What changes were proposed in this pull request?
This adds a metric to parquet scans that measures the time in just the scan phase. This is
only possible when the scan returns ColumnarBatches, otherwise the overhead is too high.
This combined with the pipeline metric lets us easily see what percent of the time was
in the scan.
Author: Nong Li <nong@databricks.com>
Closes#12007 from nongli/spark-14210.
## What changes were proposed in this pull request?
This improves the Filter codegen for NULLs by deferring loading the values for IsNotNull.
Instead of generating code like:
boolean isNull = ...
int value = ...
if (isNull) continue;
we will generate:
boolean isNull = ...
if (isNull) continue;
int value = ...
This is useful since retrieving the values can be non-trivial (they can be dictionary encoded
among other things). This currently only works when the attribute comes from the column batch
but could be extended to other cases in the future.
## How was this patch tested?
On tpcds q55, this fixes the regression from introducing the IsNotNull predicates.
```
TPCDS Snappy: Best/Avg Time(ms) Rate(M/s) Per Row(ns)
--------------------------------------------------------------------------------
q55 4564 / 5036 25.2 39.6
q55 4064 / 4340 28.3 35.3
```
Author: Nong Li <nong@databricks.com>
Closes#11792 from nongli/spark-13981.
## What changes were proposed in this pull request?
After DataFrame and Dataset are merged, the trait `Queryable` becomes unnecessary as it has only one implementation. We should remove it.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12001 from cloud-fan/df-ds.
## What changes were proposed in this pull request?
Session catalog was added in #11750. However, it doesn't really support temporary functions properly; right now we only store the metadata in the form of `CatalogFunction`, but this doesn't make sense for temporary functions because there is no class name.
This patch moves the `FunctionRegistry` into the `SessionCatalog`. With this, the user can call `catalog.createTempFunction` and `catalog.lookupFunction` to use the function they registered previously. This is currently still dead code, however.
## How was this patch tested?
`SessionCatalogSuite`.
Author: Andrew Or <andrew@databricks.com>
Closes#11972 from andrewor14/temp-functions.
## What changes were proposed in this pull request?
Extract the workaround for HADOOP-10622 introduced by #11940 into UninterruptibleThread so that we can test and reuse it.
## How was this patch tested?
Unit tests
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#11971 from zsxwing/uninterrupt.
## What changes were proposed in this pull request?
This patch addresses the remaining comments left in #11750 and #11918 after they are merged. For a full list of changes in this patch, just trace the commits.
## How was this patch tested?
`SessionCatalogSuite` and `CatalogTestCases`
Author: Andrew Or <andrew@databricks.com>
Closes#12006 from andrewor14/session-catalog-followup.
#### What changes were proposed in this pull request?
This PR adds all the current Spark SQL DDL commands to the new ANTLR 4 based SQL parser.
I have found a few inconsistencies in the current commands:
- Function has an alias field. This is actually the class name of the function.
- Partition specifications should contain nulls in some commands, and contain `None`s in others.
- `AlterTableSkewedLocation`: Should defines which columns have skewed values, and should allow us to define storage for each skewed combination of values. We currently only allow one value per field.
- `AlterTableSetFileFormat`: Should only have one file format, it currently supports both.
I have implemented all these comments like they were, and I propose to improve them in follow-up PRs.
#### How was this patch tested?
The existing DDLCommandSuite.
cc rxin andrewor14 yhuai
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#12011 from hvanhovell/SPARK-14086.
## What changes were proposed in this pull request?
Currently, for the key that can not fit within a long, we build a hash map for UnsafeHashedRelation, it's converted to BytesToBytesMap after serialization and deserialization. We should build a BytesToBytesMap directly to have better memory efficiency.
In order to do that, BytesToBytesMap should support multiple (K,V) pair with the same K, Location.putNewKey() is renamed to Location.append(), which could append multiple values for the same key (same Location). `Location.newValue()` is added to find the next value for the same key.
## How was this patch tested?
Existing tests. Added benchmark for broadcast hash join with duplicated keys.
Author: Davies Liu <davies@databricks.com>
Closes#11870 from davies/map2.
### What changes were proposed in this pull request?
The current ANTLR3 parser is quite complex to maintain and suffers from code blow-ups. This PR introduces a new parser that is based on ANTLR4.
This parser is based on the [Presto's SQL parser](https://github.com/facebook/presto/blob/master/presto-parser/src/main/antlr4/com/facebook/presto/sql/parser/SqlBase.g4). The current implementation can parse and create Catalyst and SQL plans. Large parts of the HiveQl DDL and some of the DML functionality is currently missing, the plan is to add this in follow-up PRs.
This PR is a work in progress, and work needs to be done in the following area's:
- [x] Error handling should be improved.
- [x] Documentation should be improved.
- [x] Multi-Insert needs to be tested.
- [ ] Naming and package locations.
### How was this patch tested?
Catalyst and SQL unit tests.
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#11557 from hvanhovell/ngParser.
#### What changes were proposed in this pull request?
This PR is to provide native parsing support for two DDL commands: ```Describe Database``` and ```Alter Database Set Properties```
Based on the Hive DDL document:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
##### 1. ALTER DATABASE
**Syntax:**
```SQL
ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...)
```
- `ALTER DATABASE` is to add new (key, value) pairs into `DBPROPERTIES`
##### 2. DESCRIBE DATABASE
**Syntax:**
```SQL
DESCRIBE DATABASE [EXTENDED] db_name
```
- `DESCRIBE DATABASE` shows the name of the database, its comment (if one has been set), and its root location on the filesystem. When `extended` is true, it also shows the database's properties
#### How was this patch tested?
Added the related test cases to `DDLCommandSuite`
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
This patch had conflicts when merged, resolved by
Committer: Yin Huai <yhuai@databricks.com>
Closes#11977 from gatorsmile/parseAlterDatabase.
## What changes were proposed in this pull request?
This PR implements `FileFormat.buildReader()` for our ORC data source. It also fixed several minor styling issues related to `HadoopFsRelation` planning code path.
Note that `OrcNewInputFormat` doesn't rely on `OrcNewSplit` for creating `OrcRecordReader`s, plain `FileSplit` is just fine. That's why we can simply create the record reader with the help of `OrcNewInputFormat` and `FileSplit`.
## How was this patch tested?
Existing test cases should do the work
Author: Cheng Lian <lian@databricks.com>
Closes#11936 from liancheng/spark-14116-build-reader-for-orc.
### What changes were proposed in this pull request?
Based on the Hive DDL document https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
The syntax of DDL command for Drop Database is
```SQL
DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];
```
- If `IF EXISTS` is not specified, the default behavior is to issue a warning message if `database_name` does't exist
- `RESTRICT` is the default behavior.
This PR is to provide a native parsing support for `DROP DATABASE`.
#### How was this patch tested?
Added a test case `DDLCommandSuite`
Author: gatorsmile <gatorsmile@gmail.com>
Closes#11962 from gatorsmile/parseDropDatabase.
## What changes were proposed in this pull request?
1. merge consumeChild into consume()
2. always generate code for input variables and UnsafeRow, a plan can use eight of them.
## How was this patch tested?
Existing tests.
Author: Davies Liu <davies@databricks.com>
Closes#11975 from davies/gen_refactor.
## What changes were proposed in this pull request?
This PR fixes some newly added java-lint errors(unused-imports, line-lengsth).
## How was this patch tested?
Pass the Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11968 from dongjoon-hyun/SPARK-14167.
## What changes were proposed in this pull request?
HDFSMetadataLog uses newer FileContext API to achieve atomic renaming. However, FileContext implementations may not exist for many scheme for which there may be FileSystem implementations. In those cases, rather than failing completely, we should fallback to the FileSystem based implementation, and log warning that there may be file consistency issues in case the log directory is concurrently modified.
In addition I have also added more tests to increase the code coverage.
## How was this patch tested?
Unit test.
Tested on cluster with custom file system.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#11925 from tdas/SPARK-14109.
## What changes were proposed in this pull request?
There is a potential dead-lock in Hadoop Shell.runCommand before 2.5.0 ([HADOOP-10622](https://issues.apache.org/jira/browse/HADOOP-10622)). If we interrupt some thread running Shell.runCommand, we may hit this issue.
This PR adds some protecion to prevent from interrupting the microBatchThread when we may run into Shell.runCommand. There are two places will call Shell.runCommand now:
- offsetLog.add
- FileStreamSource.getOffset
They will create a file using HDFS API and call Shell.runCommand to set the file permission.
## How was this patch tested?
Existing unit tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#11940 from zsxwing/workaround-for-HADOOP-10622.
## What changes were proposed in this pull request?
This PR is a minor cleanup task as part of https://issues.apache.org/jira/browse/SPARK-14008 to explicitly identify/catch the `UnsupportedOperationException` while initializing the vectorized parquet reader. Other exceptions will simply be thrown back to `SqlNewHadoopPartition`.
## How was this patch tested?
N/A (cleanup only; no new functionality added)
Author: Sameer Agarwal <sameer@databricks.com>
Closes#11950 from sameeragarwal/parquet-cleanup.
## What changes were proposed in this pull request?
As we have `CreateArray` and `CreateStruct`, we should also have `CreateMap`. This PR adds the `CreateMap` expression, and the DataFrame API, and python API.
## How was this patch tested?
various new tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11879 from cloud-fan/create_map.
## What changes were proposed in this pull request?
We ran into a problem today debugging some class loading problem during deserialization, and JVM was masking the underlying exception which made it very difficult to debug. We can however log the exceptions using try/catch ourselves in serialization/deserialization. The good thing is that all these methods are already using Utils.tryOrIOException, so we can just put the try catch and logging in a single place.
## How was this patch tested?
A logging change with a manual test.
Author: Reynold Xin <rxin@databricks.com>
Closes#11951 from rxin/SPARK-14149.
## What changes were proposed in this pull request?
This reopens#11836, which was merged but promptly reverted because it introduced flaky Hive tests.
## How was this patch tested?
See `CatalogTestCases`, `SessionCatalogSuite` and `HiveContextSuite`.
Author: Andrew Or <andrew@databricks.com>
Closes#11938 from andrewor14/session-catalog-again.
## What changes were proposed in this pull request?
Dataset has two variants of groupByKey, one for untyped and the other for typed. It actually doesn't make as much sense to have an untyped API here, since apps that want to use untyped APIs should just use the groupBy "DataFrame" API.
## How was this patch tested?
This patch removes a method, and removes the associated tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#11949 from rxin/SPARK-14145.
## What changes were proposed in this pull request?
unionAll has been deprecated in SPARK-14088.
## How was this patch tested?
Should be covered by all existing tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#11946 from rxin/SPARK-14142.
#### What changes were proposed in this pull request?
This PR is to support group by position in SQL. For example, when users input the following query
```SQL
select c1 as a, c2, c3, sum(*) from tbl group by 1, 3, c4
```
The ordinals are recognized as the positions in the select list. Thus, `Analyzer` converts it to
```SQL
select c1, c2, c3, sum(*) from tbl group by c1, c3, c4
```
This is controlled by the config option `spark.sql.groupByOrdinal`.
- When true, the ordinal numbers in group by clauses are treated as the position in the select list.
- When false, the ordinal numbers are ignored.
- Only convert integer literals (not foldable expressions). If found foldable expressions, ignore them.
- When the positions specified in the group by clauses correspond to the aggregate functions in select list, output an exception message.
- star is not allowed to use in the select list when users specify ordinals in group by
Note: This PR is taken from https://github.com/apache/spark/pull/10731. When merging this PR, please give the credit to zhichao-li
Also cc all the people who are involved in the previous discussion: rxin cloud-fan marmbrus yhuai hvanhovell adrian-wang chenghao-intel tejasapatil
#### How was this patch tested?
Added a few test cases for both positive and negative test cases.
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#11846 from gatorsmile/groupByOrdinal.
## What changes were proposed in this pull request?
`SessionCatalog`, introduced in #11750, is a catalog that keeps track of temporary functions and tables, and delegates metastore operations to `ExternalCatalog`. This functionality overlaps a lot with the existing `analysis.Catalog`.
As of this commit, `SessionCatalog` and `ExternalCatalog` will no longer be dead code. There are still things that need to be done after this patch, namely:
- SPARK-14013: Properly implement temporary functions in `SessionCatalog`
- SPARK-13879: Decide which DDL/DML commands to support natively in Spark
- SPARK-?????: Implement the ones we do want to support through `SessionCatalog`.
- SPARK-?????: Merge SQL/HiveContext
## How was this patch tested?
This is largely a refactoring task so there are no new tests introduced. The particularly relevant tests are `SessionCatalogSuite` and `ExternalCatalogSuite`.
Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#11836 from andrewor14/use-session-catalog.
This PR adds a new `Sink` implementation that writes out Parquet files. In order to correctly handle partial failures while maintaining exactly once semantics, the files for each batch are written out to a unique directory and then atomically appended to a metadata log. When a parquet based `DataSource` is initialized for reading, we first check for this log directory and use it instead of file listing when present.
Unit tests are added, as well as a stress test that checks the answer after non-deterministic injected failures.
Author: Michael Armbrust <michael@databricks.com>
Closes#11897 from marmbrus/fileSink.
## What changes were proposed in this pull request?
In this PR, I am implementing a new abstraction for management of streaming state data - State Store. It is a key-value store for persisting running aggregates for aggregate operations in streaming dataframes. The motivation and design is discussed here.
https://docs.google.com/document/d/1-ncawFx8JS5Zyfq1HAEGBx56RDet9wfVp_hDM8ZL254/edit#
## How was this patch tested?
- [x] Unit tests
- [x] Cluster tests
**Coverage from unit tests**
<img width="952" alt="screen shot 2016-03-21 at 3 09 40 pm" src="https://cloud.githubusercontent.com/assets/663212/13935872/fdc8ba86-ef76-11e5-93e8-9fa310472c7b.png">
## TODO
- [x] Fix updates() iterator to avoid duplicate updates for same key
- [x] Use Coordinator in ContinuousQueryManager
- [x] Plugging in hadoop conf and other confs
- [x] Unit tests
- [x] StateStore object lifecycle and methods
- [x] StateStoreCoordinator communication and logic
- [x] StateStoreRDD fault-tolerance
- [x] StateStoreRDD preferred location using StateStoreCoordinator
- [ ] Cluster tests
- [ ] Whether preferred locations are set correctly
- [ ] Whether recovery works correctly with distributed storage
- [x] Basic performance tests
- [x] Docs
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#11645 from tdas/state-store.
## What changes were proposed in this pull request?
This PR adds support for TimestampType in the vectorized parquet reader
## How was this patch tested?
1. `VectorizedColumnReader` initially had a gating condition on `primitiveType.getPrimitiveTypeName() == PrimitiveType.PrimitiveTypeName.INT96)` that made us fall back on parquet-mr for handling timestamps. This condition is now removed.
2. The `ParquetHadoopFsRelationSuite` (that tests for all supported hive types -- including `TimestampType`) fails when the gating condition is removed (https://github.com/apache/spark/pull/11808) and should now pass with this change. Similarly, the `ParquetHiveCompatibilitySuite.SPARK-10177 timestamp` test that fails when the gating condition is removed, should now pass as well.
3. Added tests in `HadoopFsRelationTest` that test both the dictionary encoded and non-encoded versions across all supported datatypes.
Author: Sameer Agarwal <sameer@databricks.com>
Closes#11882 from sameeragarwal/timestamp-parquet.
## What changes were proposed in this pull request?
This PR rollback some changes in #11274 , which introduced some performance regression when do a simple aggregation on parquet scan with one integer column.
Does not really understand how this change introduce this huge impact, maybe related show JIT compiler inline functions. (saw very different stats from profiling).
## How was this patch tested?
Manually run the parquet reader benchmark, before this change:
```
Intel(R) Core(TM) i7-4558U CPU 2.80GHz
Int and String Scan: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
SQL Parquet Vectorized 2391 / 3107 43.9 22.8 1.0X
```
After this change
```
Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5
Intel(R) Core(TM) i7-4558U CPU 2.80GHz
Int and String Scan: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
SQL Parquet Vectorized 2032 / 2626 51.6 19.4 1.0X```
Author: Davies Liu <davies@databricks.com>
Closes#11912 from davies/fix_regression.
This patch refactors the `MemoryStore` so that it can be tested without needing to construct / mock an entire `BlockManager`.
- The block manager's serialization- and compression-related methods have been moved from `BlockManager` to `SerializerManager`.
- `BlockInfoManager `is now passed directly to classes that need it, rather than being passed via the `BlockManager`.
- The `MemoryStore` now calls `dropFromMemory` via a new `BlockEvictionHandler` interface rather than directly calling the `BlockManager`. This change helps to enforce a narrow interface between the `MemoryStore` and `BlockManager` functionality and makes this interface easier to mock in tests.
- Several of the block unrolling tests have been moved from `BlockManagerSuite` into a new `MemoryStoreSuite`.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#11899 from JoshRosen/reduce-memorystore-blockmanager-coupling.
## What changes were proposed in this pull request?
This PR does the renaming as suggested by marmbrus in [this comment][1].
## How was this patch tested?
Existing tests.
[1]: 6d37e1eb90 (commitcomment-16654694)
Author: Cheng Lian <lian@databricks.com>
Closes#11889 from liancheng/spark-13817-follow-up.
## What changes were proposed in this pull request?
1. Deprecated unionAll. It is pretty confusing to have both "union" and "unionAll" when the two do the same thing in Spark but are different in SQL.
2. Rename reduce in KeyValueGroupedDataset to reduceGroups so it is more consistent with rest of the functions in KeyValueGroupedDataset. Also makes it more obvious what "reduce" and "reduceGroups" mean. Previously it was confusing because it could be reducing a Dataset, or just reducing groups.
3. Added a "name" function, which is more natural to name columns than "as" for non-SQL users.
4. Remove "subtract" function since it is just an alias for "except".
## How was this patch tested?
All changes should be covered by existing tests. Also added couple test cases to cover "name".
Author: Reynold Xin <rxin@databricks.com>
Closes#11908 from rxin/SPARK-14088.
## What changes were proposed in this pull request?
This PR updates `sql/README.md` according to the latest console output and removes some unused imports in `sql` module. This is done by manually, so there is no guarantee to remove all unused imports.
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11907 from dongjoon-hyun/update_sql_module.
## What changes were proposed in this pull request?
This patch moves StringToColumn implicit class into SQLImplicits. This was kept in SQLContext.implicits object for binary backward compatibility, in the Spark 1.x series. It makes more sense for this API to be in SQLImplicits since that's the single class that defines all the SQL implicits.
## How was this patch tested?
Should be covered by existing unit tests.
Author: Reynold Xin <rxin@databricks.com>
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11878 from rxin/SPARK-14060.
## What changes were proposed in this pull request?
This patch changed the return type for SQLContext.range from `Dataset[Long]` (Scala primitive) to `Dataset[java.lang.Long]` (Java boxed long).
Previously, SPARK-13894 changed the return type of range from `Dataset[Row]` to `Dataset[Long]`. The problem is that due to https://issues.scala-lang.org/browse/SI-4388, Scala compiles primitive types in generics into just Object, i.e. range at bytecode level now just returns `Dataset[Object]`. This is really bad for Java users because they are losing type safety and also need to add a type cast every time they use range.
Talked to Jason Zaugg from Lightbend (Typesafe) who suggested the best approach is to return `Dataset[java.lang.Long]`. The downside is that when Scala users want to explicitly type a closure used on the dataset returned by range, they would need to use `java.lang.Long` instead of the Scala `Long`.
## How was this patch tested?
The signature change should be covered by existing unit tests and API tests. I also added a new test case in DatasetSuite for range.
Author: Reynold Xin <rxin@databricks.com>
Closes#11880 from rxin/SPARK-14063.
This PR relaxes the requirements of a `Sink` for structured streaming to only require idempotent appending of data. Previously the `Sink` needed to be able to transactionally append data while recording an opaque offset indicated how far in a stream we have processed.
In order to do this, a new write-ahead-log has been added to stream execution, which records the offsets that will are present in each batch. The log is created in the newly added `checkpointLocation`, which defaults to `${spark.sql.streaming.checkpointLocation}/${queryName}` but can be overriden by setting `checkpointLocation` in `DataFrameWriter`.
In addition to making sinks easier to write the addition of batchIds and a checkpoint location is done in anticipation of integration with the the `StateStore` (#11645).
Author: Michael Armbrust <michael@databricks.com>
Closes#11804 from marmbrus/batchIds.
SPARK-13774: IllegalArgumentException: Can not create a Path from an empty string for incorrect file path
**Overview:**
- If a non-existent path is given in this call
``
scala> sqlContext.read.format("csv").load("file-path-is-incorrect.csv")
``
it throws the following error:
`java.lang.IllegalArgumentException: Can not create a Path from an empty string` …..
`It gets called from inferSchema call in org.apache.spark.sql.execution.datasources.DataSource.resolveRelation`
- The purpose of this JIRA is to throw a better error message.
- With the fix, you will now get a _Path does not exist_ error message.
```
scala> sqlContext.read.format("csv").load("file-path-is-incorrect.csv")
org.apache.spark.sql.AnalysisException: Path does not exist: file:/Users/ksunitha/trunk/spark/file-path-is-incorrect.csv;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:215)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:204)
...
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:204)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:131)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:141)
... 49 elided
```
**Details**
_Changes include:_
- Check if path exists or not in resolveRelation in DataSource, and throw an AnalysisException with message like “Path does not exist: $path”
- AnalysisException is thrown similar to the exceptions thrown in resolveRelation.
- The glob path and the non glob path is checked with minimal calls to path exists. If the globPath is empty, then it is a nonexistent glob pattern and an error will be thrown. In the scenario that it is not globPath, it is necessary to only check if the first element in the Seq is valid or not.
_Test modifications:_
- Changes went in for 3 tests to account for this error checking.
- SQLQuerySuite:test("run sql directly on files") – Error message needed to be updated.
- 2 tests failed in MetastoreDataSourcesSuite because they had a dummy path and so test is modified to give a tempdir and allow it to move past so it can continue to test the codepath it meant to test
_New Tests:_
2 new tests are added to DataFrameSuite to validate that glob and non-glob path will throw the new error message.
_Testing:_
Unit tests were run with the fix.
**Notes/Questions to reviewers:**
- There is some code duplication in DataSource.scala in resolveRelation method and also createSource with respect to getting the paths. I have not made any changes to the createSource codepath. Should we make the change there as well ?
- From other JIRAs, I know there is restructuring and changes going on in this area, not sure how that will affect these changes, but since this seemed like a starter issue, I looked into it. If we prefer not to add the overhead of the checks, or if there is a better place to do so, let me know.
I would appreciate your review. Thanks for your time and comments.
Author: Sunitha Kambhampati <skambha@us.ibm.com>
Closes#11775 from skambha/improve_errmsg.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-13953
Currently, JSON data source creates a new field in `PERMISSIVE` mode for storing malformed string.
This field can be renamed via `spark.sql.columnNameOfCorruptRecord` option but it is a global configuration.
This PR make that option can be applied per read and can be specified via `option()`. This will overwrites `spark.sql.columnNameOfCorruptRecord` if it is set.
## How was this patch tested?
Unit tests were used and `./dev/run_tests` for coding style tests.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#11881 from HyukjinKwon/SPARK-13953.
## What changes were proposed in this pull request?
As we have completed the `SQLBuilder`, we can safely turn on native view by default.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11872 from cloud-fan/native-view.
This PR add implements the new `buildReader` interface for the Parquet `FileFormat`. An simple implementation of `FileScanRDD` is also included.
This code should be tested by the many existing tests for parquet.
Author: Michael Armbrust <michael@databricks.com>
Author: Sameer Agarwal <sameer@databricks.com>
Author: Nong Li <nong@databricks.com>
Closes#11709 from marmbrus/parquetReader.
## What changes were proposed in this pull request?
This patch adds support for reading `DecimalTypes` with high (> 18) precision in `VectorizedColumnReader`
## How was this patch tested?
1. `VectorizedColumnReader` initially had a gating condition on `primitiveType.getDecimalMetadata().getPrecision() > Decimal.MAX_LONG_DIGITS()` that made us fall back on parquet-mr for handling high-precision decimals. This condition is now removed.
2. In particular, the `ParquetHadoopFsRelationSuite` (that tests for all supported hive types -- including `DecimalType(25, 5)`) fails when the gating condition is removed (https://github.com/apache/spark/pull/11808) and should now pass with this change.
Author: Sameer Agarwal <sameer@databricks.com>
Closes#11869 from sameeragarwal/bigdecimal-parquet.
## What changes were proposed in this pull request?
This patch merges DatasetHolder and DataFrameHolder. This makes more sense because DataFrame/Dataset are now one class.
In addition, fixed some minor issues with pull request #11732.
## How was this patch tested?
Updated existing unit tests that test these implicits.
Author: Reynold Xin <rxin@databricks.com>
Closes#11737 from rxin/SPARK-13898.
## What changes were proposed in this pull request?
WholeStageCodegen naturally breaks the execution into pipelines that are easier to
measure duration. This is more granular than the task timings (a task can be multiple
pipelines) and is integrated with the web ui.
We currently report total time (across all tasks), min/mask/median to get a sense of how long each is taking.
## How was this patch tested?
Manually tested looking at the web ui.
Author: Nong Li <nong@databricks.com>
Closes#11741 from nongli/spark-13916.
## What changes were proposed in this pull request?
There is only one exception: `PythonUDF`. However, I don't think the `PythonUDF#` prefix is useful, as we can only create python udf under python context. This PR removes the `PythonUDF#` prefix from `PythonUDF.toString`, so that it doesn't need to overrde `sql`.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11859 from cloud-fan/tmp.
## What changes were proposed in this pull request?
This PR generates code that get a value in each column from ```ColumnVector``` instead of creating ```InternalRow``` when ```ColumnarBatch``` is accessed. This PR improves benchmark program by up to 15%.
This PR consists of two parts:
1. Get an ```ColumnVector ``` by using ```ColumnarBatch.column()``` method
2. Get a value of each column by using ```rdd_col${COLIDX}.getInt(ROWIDX)``` instead of ```rdd_row.getInt(COLIDX)```
This is a motivated example.
````
sqlContext.conf.setConfString(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key, "true")
sqlContext.conf.setConfString(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "true")
val values = 10
withTempPath { dir =>
withTempTable("t1", "tempTable") {
sqlContext.range(values).registerTempTable("t1")
sqlContext.sql("select id % 2 as p, cast(id as INT) as id from t1")
.write.partitionBy("p").parquet(dir.getCanonicalPath)
sqlContext.read.parquet(dir.getCanonicalPath).registerTempTable("tempTable")
sqlContext.sql("select sum(p) from tempTable").collect
}
}
````
The original code
````java
...
/* 072 */ while (!shouldStop() && rdd_batchIdx < numRows) {
/* 073 */ InternalRow rdd_row = rdd_batch.getRow(rdd_batchIdx++);
/* 074 */ /*** CONSUME: TungstenAggregate(key=[], functions=[(sum(cast(p#4 as bigint)),mode=Partial,isDistinct=false)], output=[sum#10L]) */
/* 075 */ /* input[0, int] */
/* 076 */ boolean rdd_isNull = rdd_row.isNullAt(0);
/* 077 */ int rdd_value = rdd_isNull ? -1 : (rdd_row.getInt(0));
...
````
The code generated by this PR
````java
/* 072 */ while (!shouldStop() && rdd_batchIdx < numRows) {
/* 073 */ org.apache.spark.sql.execution.vectorized.ColumnVector rdd_col0 = rdd_batch.column(0);
/* 074 */ /*** CONSUME: TungstenAggregate(key=[], functions=[(sum(cast(p#4 as bigint)),mode=Partial,isDistinct=false)], output=[sum#10L]) */
/* 075 */ /* input[0, int] */
/* 076 */ boolean rdd_isNull = rdd_col0.getIsNull(rdd_batchIdx);
/* 077 */ int rdd_value = rdd_isNull ? -1 : (rdd_col0.getInt(rdd_batchIdx));
...
/* 128 */ rdd_batchIdx++;
/* 129 */ }
/* 130 */ if (shouldStop()) return;
````
Performance
Without this PR
````
model name : Intel(R) Xeon(R) CPU E5-2667 v2 3.30GHz
Partitioned Table: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
Read data column 434 / 488 36.3 27.6 1.0X
Read partition column 302 / 346 52.1 19.2 1.4X
Read both columns 588 / 643 26.8 37.4 0.7X
````
With this PR
````
model name : Intel(R) Xeon(R) CPU E5-2667 v2 3.30GHz
Partitioned Table: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
Read data column 392 / 516 40.1 24.9 1.0X
Read partition column 256 / 318 61.4 16.3 1.5X
Read both columns 523 / 539 30.1 33.3 0.7X
````
## How was this patch tested?
Tested by existing test suites and benchmark
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#11636 from kiszk/SPARK-13805.
## What changes were proposed in this pull request?
This PR try acquire the memory for hash map in shuffled hash join, fail the task if there is no enough memory (otherwise it could OOM the executor).
It also removed unused HashedRelation.
## How was this patch tested?
Existing unit tests. Manual tests with TPCDS Q78.
Author: Davies Liu <davies@databricks.com>
Closes#11826 from davies/cleanup_hash2.
## What changes were proposed in this pull request?
Ad-hoc Dataset API ScalaDoc fixes
## How was this patch tested?
By building and checking ScalaDoc locally.
Author: Cheng Lian <lian@databricks.com>
Closes#11862 from liancheng/ds-doc-fixes.
#### What changes were proposed in this pull request?
This PR is to support order by position in SQL, e.g.
```SQL
select c1, c2, c3 from tbl order by 1 desc, 3
```
should be equivalent to
```SQL
select c1, c2, c3 from tbl order by c1 desc, c3 asc
```
This is controlled by config option `spark.sql.orderByOrdinal`.
- When true, the ordinal numbers are treated as the position in the select list.
- When false, the ordinal number in order/sort By clause are ignored.
- Only convert integer literals (not foldable expressions). If found foldable expressions, ignore them
- This also works with select *.
**Question**: Do we still need sort by columns that contain zero reference? In this case, it will have no impact on the sorting results. IMO, we should not allow users do it. rxin cloud-fan marmbrus yhuai hvanhovell
-- Update: In these cases, they are ignored in this case.
**Note**: This PR is taken from https://github.com/apache/spark/pull/10731. When merging this PR, please give the credit to zhichao-li
Also cc all the people who are involved in the previous discussion: adrian-wang chenghao-intel tejasapatil
#### How was this patch tested?
Added a few test cases for both positive and negative test cases.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#11815 from gatorsmile/orderByPosition.
## What changes were proposed in this pull request?
This PR adds some proper periods and spaces to Spark CLI help messages and SQL/YARN conf docs for consistency.
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11848 from dongjoon-hyun/add_proper_period_and_space.
## What changes were proposed in this pull request?
[Spark Coding Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) has 100-character limit on lines, but it's disabled for Java since 11/09/15. This PR enables **LineLength** checkstyle again. To help that, this also introduces **RedundantImport** and **RedundantModifier**, too. The following is the diff on `checkstyle.xml`.
```xml
- <!-- TODO: 11/09/15 disabled - the lengths are currently > 100 in many places -->
- <!--
<module name="LineLength">
<property name="max" value="100"/>
<property name="ignorePattern" value="^package.*|^import.*|a href|href|http://|https://|ftp://"/>
</module>
- -->
<module name="NoLineWrap"/>
<module name="EmptyBlock">
<property name="option" value="TEXT"/>
-167,5 +164,7
</module>
<module name="CommentsIndentation"/>
<module name="UnusedImports"/>
+ <module name="RedundantImport"/>
+ <module name="RedundantModifier"/>
```
## How was this patch tested?
Currently, `lint-java` is disabled in Jenkins. It needs a manual test.
After passing the Jenkins tests, `dev/lint-java` should passes locally.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11831 from dongjoon-hyun/SPARK-14011.
## What changes were proposed in this pull request?
Currently, there is no way to control the behaviour when fails to parse corrupt records in JSON data source .
This PR adds the support for parse modes just like CSV data source. There are three modes below:
- `PERMISSIVE` : When it fails to parse, this sets `null` to to field. This is a default mode when it has been this mode.
- `DROPMALFORMED`: When it fails to parse, this drops the whole record.
- `FAILFAST`: When it fails to parse, it just throws an exception.
This PR also make JSON data source share the `ParseModes` in CSV data source.
## How was this patch tested?
Unit tests were used and `./dev/run_tests` for code style tests.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#11756 from HyukjinKwon/SPARK-13764.
## What changes were proposed in this pull request?
Previously, Dataset.groupBy returns a GroupedData, and Dataset.groupByKey returns a GroupedDataset. The naming is very similar, and unfortunately does not convey the real differences between the two.
Assume we are grouping by some keys (K). groupByKey is a key-value style group by, in which the schema of the returned dataset is a tuple of just two fields: key and value. groupBy, on the other hand, is a relational style group by, in which the schema of the returned dataset is flattened and contain |K| + |V| fields.
This pull request also removes the experimental tag from RelationalGroupedDataset. It has been with DataFrame since 1.3, and we have enough confidence now to stabilize it.
## How was this patch tested?
This is a rename to improve API understandability. Should be covered by all existing tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#11841 from rxin/SPARK-13897.
## What changes were proposed in this pull request?
This is a minor followup on https://github.com/apache/spark/pull/11799 that extracts out the `VectorizedColumnReader` from `VectorizedParquetRecordReader` into its own file.
## How was this patch tested?
N/A (refactoring only)
Author: Sameer Agarwal <sameer@databricks.com>
Closes#11834 from sameeragarwal/rename.
## What changes were proposed in this pull request?
This PR cleans up the new parquet record reader with the following changes:
1. Removes the non-vectorized parquet reader code from `UnsafeRowParquetRecordReader`.
2. Removes the non-vectorized column reader code from `ColumnReader`.
3. Renames `UnsafeRowParquetRecordReader` to `VectorizedParquetRecordReader` and `ColumnReader` to `VectorizedColumnReader`
4. Deprecate `PARQUET_UNSAFE_ROW_RECORD_READER_ENABLED`
## How was this patch tested?
Refactoring only; Existing tests should reveal any problems.
Author: Sameer Agarwal <sameer@databricks.com>
Closes#11799 from sameeragarwal/vectorized-parquet.
## What changes were proposed in this pull request?
ShuffledHashJoin (also outer join) is removed in 1.6, in favor of SortMergeJoin, which is more robust and also fast.
ShuffledHashJoin is still useful in this case: 1) one table is much smaller than the other one, then cost to build a hash table on smaller table is smaller than sorting the larger table 2) any partition of the small table could fit in memory.
This PR brings back ShuffledHashJoin, basically revert #9645, and fix the conflict. Also merging outer join and left-semi join into the same class. This PR does not implement full outer join, because it's not implemented efficiently (requiring build hash table on both side).
A simple benchmark (one table is 5x smaller than other one) show that ShuffledHashJoin could be 2X faster than SortMergeJoin.
## How was this patch tested?
Added new unit tests for ShuffledHashJoin.
Author: Davies Liu <davies@databricks.com>
Closes#11788 from davies/shuffle_join.
## What changes were proposed in this pull request?
This patch updates documentations for Datasets. I also updated some internal documentation for exchange/broadcast.
## How was this patch tested?
Just documentation/api stability update.
Author: Reynold Xin <rxin@databricks.com>
Closes#11814 from rxin/dataset-docs.
## What changes were proposed in this pull request?
JIRA: https://issues.apache.org/jira/browse/SPARK-13930
Recently the fast serialization has been introduced to collecting DataFrame/Dataset (#11664). The same technology can be used on collect limit operator too.
## How was this patch tested?
Add a benchmark for collect limit to `BenchmarkWholeStageCodegen`.
Without this patch:
model name : Westmere E56xx/L56xx/X56xx (Nehalem-C)
collect limit: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
collect limit 1 million 3413 / 3768 0.3 3255.0 1.0X
collect limit 2 millions 9728 / 10440 0.1 9277.3 0.4X
With this patch:
model name : Westmere E56xx/L56xx/X56xx (Nehalem-C)
collect limit: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
collect limit 1 million 833 / 1284 1.3 794.4 1.0X
collect limit 2 millions 3348 / 4005 0.3 3193.3 0.2X
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#11759 from viirya/execute-take.
## What changes were proposed in this pull request?
This PR revises Dataset API ScalaDoc. All public methods are divided into the following groups
* `groupname basic`: Basic Dataset functions
* `groupname action`: Actions
* `groupname untypedrel`: Untyped Language Integrated Relational Queries
* `groupname typedrel`: Typed Language Integrated Relational Queries
* `groupname func`: Functional Transformations
* `groupname rdd`: RDD Operations
* `groupname output`: Output Operations
`since` tag and sample code are also updated. We may want to add more sample code for typed APIs.
## How was this patch tested?
Documentation change. Checked by building unidoc locally.
Author: Cheng Lian <lian@databricks.com>
Closes#11769 from liancheng/spark-13826-ds-api-doc.
## What changes were proposed in this pull request?
Support queries that JOIN tables with USING clause.
SELECT * from table1 JOIN table2 USING <column_list>
USING clause can be used as a means to simplify the join condition
when :
1) Equijoin semantics is desired and
2) The column names in the equijoin have the same name.
We already have the support for Natural Join in Spark. This PR makes
use of the already existing infrastructure for natural join to
form the join condition and also the projection list.
## How was the this patch tested?
Have added unit tests in SQLQuerySuite, CatalystQlSuite, ResolveNaturalJoinSuite
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#11297 from dilipbiswal/spark-13427.
## What changes were proposed in this pull request?
Logging was made private in Spark 2.0. If we move it, then users would be able to create a Logging trait themselves to avoid changing their own code.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11764 from cloud-fan/logger.
Because ClassTags are available when constructing ShuffledRDD we can use them to automatically use Kryo for shuffle serialization when the RDD's types are known to be compatible with Kryo.
This patch introduces `SerializerManager`, a component which picks the "best" serializer for a shuffle given the elements' ClassTags. It will automatically pick a Kryo serializer for ShuffledRDDs whose key, value, and/or combiner types are primitives, arrays of primitives, or strings. In the future we can use this class as a narrow extension point to integrate specialized serializers for other types, such as ByteBuffers.
In a planned followup patch, I will extend the BlockManager APIs so that we're able to use similar automatic serializer selection when caching RDDs (this is a little trickier because the ClassTags need to be threaded through many more places).
Author: Josh Rosen <joshrosen@databricks.com>
Closes#11755 from JoshRosen/automatically-pick-best-serializer.
## What changes were proposed in this pull request?
Since developer API of plug-able parser has been removed in #10801 , docs should be updated accordingly.
## How was this patch tested?
This patch will not affect the real code path.
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#11758 from adrian-wang/spark12855.
## What changes were proposed in this pull request?
We need to copy the UnsafeRow since a Join could produce multiple rows from single input rows. We could avoid that if there is no join (or the join will not produce multiple rows) inside WholeStageCodegen.
Updated the benchmark for `collect`, we could see 20-30% speedup.
## How was this patch tested?
existing unit tests.
Author: Davies Liu <davies@databricks.com>
Closes#11740 from davies/avoid_copy2.
## What changes were proposed in this pull request?
This https://github.com/apache/spark/pull/2400 added the support to parse JSON rows wrapped with an array. However, this throws an exception when the given data contains array data and struct data in the same field as below:
```json
{"a": {"b": 1}}
{"a": []}
```
and the schema is given as below:
```scala
val schema =
StructType(
StructField("a", StructType(
StructField("b", StringType) :: Nil
)) :: Nil)
```
- **Before**
```scala
sqlContext.read.schema(schema).json(path).show()
```
```scala
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: Lost task 7.3 in stage 0.0 (TID 10, 192.168.1.170): java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData cannot be cast to org.apache.spark.sql.catalyst.InternalRow
at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:50)
at org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:247)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown Source)
...
```
- **After**
```scala
sqlContext.read.schema(schema).json(path).show()
```
```bash
+----+
| a|
+----+
| [1]|
|null|
+----+
```
For other data types, in this case it converts the given values are `null` but only this case emits an exception.
This PR makes the support for wrapped rows applied only at the top level.
## How was this patch tested?
Unit tests were used and `./dev/run_tests` for code style tests.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#11752 from HyukjinKwon/SPARK-3308-follow-up.
## What changes were proposed in this pull request?
Narrow down the parameter type of `UserDefinedType#serialize()`. Currently, the parameter type is `Any`, however it would logically make more sense to narrow it down to the type of the actual user defined type.
## How was this patch tested?
Existing tests were successfully run on local machine.
Author: Jakob Odersky <jakob@odersky.com>
Closes#11379 from jodersky/SPARK-11011-udt-types.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-13894
Change the return type of the `SQLContext.range` API from `DataFrame` to `Dataset`.
## How was this patch tested?
No additional unit test required.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#11730 from chenghao-intel/range.
## What changes were proposed in this pull request?
There is a feature of hive SQL called multi-insert. For example:
```
FROM src
INSERT OVERWRITE TABLE dest1
SELECT key + 1
INSERT OVERWRITE TABLE dest2
SELECT key WHERE key > 2
INSERT OVERWRITE TABLE dest3
SELECT col EXPLODE(arr) exp AS col
...
```
We partially support it currently, with some limitations: 1) WHERE can't reference columns produced by LATERAL VIEW. 2) It's not executed eagerly, i.e. `sql("...multi-insert clause...")` won't take place right away like other commands, e.g. CREATE TABLE.
This PR removes these limitations and make us fully support multi-insert.
## How was this patch tested?
new tests in `SQLQuerySuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11754 from cloud-fan/lateral-view.
## What changes were proposed in this pull request?
Follow up to https://github.com/apache/spark/pull/11657
- Also update `String.getBytes("UTF-8")` to use `StandardCharsets.UTF_8`
- And fix one last new Coverity warning that turned up (use of unguarded `wait()` replaced by simpler/more robust `java.util.concurrent` classes in tests)
- And while we're here cleaning up Coverity warnings, just fix about 15 more build warnings
## How was this patch tested?
Jenkins tests
Author: Sean Owen <sowen@cloudera.com>
Closes#11725 from srowen/SPARK-13823.2.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-13899
This PR makes CSV data source produce `InternalRow` instead of `Row`.
Basically, this resembles JSON data source. It uses the same codes for casting.
## How was this patch tested?
Unit tests were used within IDE and code style was checked by `./dev/run_tests`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#11717 from HyukjinKwon/SPARK-13899.
## What changes were proposed in this pull request?
This PR brings codegen support for broadcast left-semi join.
## How was this patch tested?
Existing tests. Added benchmark, the result show 7X speedup.
Author: Davies Liu <davies@databricks.com>
Closes#11742 from davies/gen_semi.
## What changes were proposed in this pull request?
This PR just move some code from SortMergeOuterJoin into SortMergeJoin.
This is for support codegen for outer join.
## How was this patch tested?
existing tests.
Author: Davies Liu <davies@databricks.com>
Closes#11743 from davies/gen_smjouter.
## What changes were proposed in this pull request?
This patch changes DataFrameReader.text()'s return type from DataFrame to Dataset[String].
Closes#11731.
## How was this patch tested?
Updated existing integration tests to reflect the change.
Author: Reynold Xin <rxin@databricks.com>
Closes#11739 from rxin/SPARK-13895.
## What changes were proposed in this pull request?
Change the return type of toJson in Dataset class
## How was this patch tested?
No additional unit test required.
Author: Stavros Kontopoulos <stavros.kontopoulos@typesafe.com>
Closes#11732 from skonto/fix_toJson.
## What changes were proposed in this pull request?
Our internal code can go through SessionState.catalog and SessionState.analyzer. This brings two small benefits:
1. Reduces internal dependency on SQLContext.
2. Removes 2 public methods in Java (Java does not obey package private visibility).
More importantly, according to the design in SPARK-13485, we'd need to claim this catalog function for the user-facing public functions, rather than having an internal field.
## How was this patch tested?
Existing unit/integration test code.
Author: Reynold Xin <rxin@databricks.com>
Closes#11716 from rxin/SPARK-13893.
## What changes were proposed in this pull request?
In general it is better for internal classes to not depend on the external class (in this case SQLContext) to reduce coupling between user-facing APIs and the internal implementations. This patch removes SQLContext dependency from some internal classes such as SparkPlanner, SparkOptimizer.
As part of this patch, I also removed the following internal methods from SQLContext:
```
protected[sql] def functionRegistry: FunctionRegistry
protected[sql] def optimizer: Optimizer
protected[sql] def sqlParser: ParserInterface
protected[sql] def planner: SparkPlanner
protected[sql] def continuousQueryManager
protected[sql] def prepareForExecution: RuleExecutor[SparkPlan]
```
## How was this patch tested?
Existing unit/integration tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#11712 from rxin/sqlContext-planner.
## What changes were proposed in this pull request?
This patch removes DescribeCommand's dependency on LogicalPlan. After this patch, DescribeCommand simply accepts a TableIdentifier. It minimizes the dependency, and blocks my next patch (removes SQLContext dependency from SparkPlanner).
## How was this patch tested?
Should be covered by existing unit tests and Hive compatibility tests that run describe table.
Author: Reynold Xin <rxin@databricks.com>
Closes#11710 from rxin/SPARK-13884.
## What changes were proposed in this pull request?
When we call DataFrame/Dataset.collect(), Java serializer (or Kryo Serializer) will be used to serialize the UnsafeRows in executor, then deserialize them into UnsafeRows in driver. Java serializer (and Kyro serializer) are slow on millions rows, because they try to find out the same rows, but usually there is no same rows.
This PR will serialize the UnsafeRows as byte array by packing them together, then Java serializer (or Kyro serializer) serialize the bytes very fast (there are fewer blocks and byte array are not compared by content).
The UnsafeRow format is highly compressible, the serialized bytes are also compressed (configurable by spark.io.compression.codec).
## How was this patch tested?
Existing unit tests.
Add a benchmark for collect, before this patch:
```
Intel(R) Core(TM) i7-4558U CPU 2.80GHz
collect: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
collect 1 million 3991 / 4311 0.3 3805.7 1.0X
collect 2 millions 10083 / 10637 0.1 9616.0 0.4X
collect 4 millions 29551 / 30072 0.0 28182.3 0.1X
```
```
Intel(R) Core(TM) i7-4558U CPU 2.80GHz
collect: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
collect 1 million 775 / 1170 1.4 738.9 1.0X
collect 2 millions 1153 / 1758 0.9 1099.3 0.7X
collect 4 millions 4451 / 5124 0.2 4244.9 0.2X
```
We can see about 5-7X speedup.
Author: Davies Liu <davies@databricks.com>
Closes#11664 from davies/serialize_row.
## What changes were proposed in this pull request?
Avoid the copy in HashedRelation, since most of the HashedRelation are built with Array[Row], added the copy() for LeftSemiJoinHash. This could help to reduce the memory consumption for Broadcast join.
## How was this patch tested?
Existing tests.
Author: Davies Liu <davies@databricks.com>
Closes#11666 from davies/remove_copy.
## What changes were proposed in this pull request?
1. Rename DataFrame.scala Dataset.scala, since the class is now named Dataset.
2. Remove LegacyFunctions. It was introduced in Spark 1.6 for backward compatibility, and can be removed in Spark 2.0.
## How was this patch tested?
Should be covered by existing unit/integration tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#11704 from rxin/SPARK-13880.