Both core and sql have slightly different code that does variable substitution
of config values. This change refactors that code and encapsulates the logic
of reading config values and expading variables in a new helper class, which
can be configured so that both core and sql can use it without losing existing
functionality, and allows for easier testing and makes it easier to add more
features in the future.
Tested with existing and new unit tests, and by running spark-shell with
some configs referencing variables and making sure it behaved as expected.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#14468 from vanzin/SPARK-16671.
## What changes were proposed in this pull request?
Don't override app name specified in `SparkConf` with a random app name. Only set it if the conf has no app name even after options have been applied.
See also https://github.com/apache/spark/pull/14602
This is similar to Sherry302 's original proposal in https://github.com/apache/spark/pull/14556
## How was this patch tested?
Jenkins test, with new case reproducing the bug
Author: Sean Owen <sowen@cloudera.com>
Closes#14630 from srowen/SPARK-16966.2.
## What changes were proposed in this pull request?
In the PR, we just allow the user to add additional options when create a new table in JDBC writer.
The options can be table_options or partition_options.
E.g., "CREATE TABLE t (name string) ENGINE=InnoDB DEFAULT CHARSET=utf8"
Here is the usage example:
```
df.write.option("createTableOptions", "ENGINE=InnoDB DEFAULT CHARSET=utf8").jdbc(...)
```
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
will apply test result soon.
Author: GraceH <93113783@qq.com>
Closes#14559 from GraceH/jdbc_options.
## What changes were proposed in this pull request?
Currently, Spark ignores path names starting with underscore `_` and `.`. This causes read-failures for the column-partitioned file data sources whose partition column names starts from '_', e.g. `_col`.
**Before**
```scala
scala> spark.range(10).withColumn("_locality_code", $"id").write.partitionBy("_locality_code").save("/tmp/parquet")
scala> spark.read.parquet("/tmp/parquet")
org.apache.spark.sql.AnalysisException: Unable to infer schema for ParquetFormat at /tmp/parquet20. It must be specified manually;
```
**After**
```scala
scala> spark.range(10).withColumn("_locality_code", $"id").write.partitionBy("_locality_code").save("/tmp/parquet")
scala> spark.read.parquet("/tmp/parquet")
res2: org.apache.spark.sql.DataFrame = [id: bigint, _locality_code: int]
```
## How was this patch tested?
Pass the Jenkins with a new test case.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14585 from dongjoon-hyun/SPARK-16975-PARQUET.
## What changes were proposed in this pull request?
1. `sampled` doesn't need to be `ArrayBuffer`, we never update it, but assign new value
2. `count` doesn't need to be `var`, we never mutate it.
3. `headSampled` doesn't need to be in constructor, we never pass a non-empty `headSampled` to constructor
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14603 from cloud-fan/simply.
## What changes were proposed in this pull request?
There could be multiple subqueries that generate same results, we could re-use the result instead of running it multiple times.
This PR also cleanup up how we run subqueries.
For SQL query
```sql
select id,(select avg(id) from t) from t where id > (select avg(id) from t)
```
The explain is
```
== Physical Plan ==
*Project [id#15L, Subquery subquery29 AS scalarsubquery()#35]
: +- Subquery subquery29
: +- *HashAggregate(keys=[], functions=[avg(id#15L)])
: +- Exchange SinglePartition
: +- *HashAggregate(keys=[], functions=[partial_avg(id#15L)])
: +- *Range (0, 1000, splits=4)
+- *Filter (cast(id#15L as double) > Subquery subquery29)
: +- Subquery subquery29
: +- *HashAggregate(keys=[], functions=[avg(id#15L)])
: +- Exchange SinglePartition
: +- *HashAggregate(keys=[], functions=[partial_avg(id#15L)])
: +- *Range (0, 1000, splits=4)
+- *Range (0, 1000, splits=4)
```
The visualized plan:
![reuse-subquery](https://cloud.githubusercontent.com/assets/40902/17573229/e578d93c-5f0d-11e6-8a3c-0150d81d3aed.png)
## How was this patch tested?
Existing tests.
Author: Davies Liu <davies@databricks.com>
Closes#14548 from davies/subq.
## What changes were proposed in this pull request?
In both `OnHeapColumnVector` and `OffHeapColumnVector`, we implemented `getInt()` with the following code pattern:
```
public int getInt(int rowId) {
if (dictionary == null)
{ return intData[rowId]; }
else
{ return dictionary.decodeToInt(dictionaryIds.getInt(rowId)); }
}
```
As `dictionaryIds` is also a `ColumnVector`, this results in a recursive call of `getInt()` and breaks JIT inlining. As a result, `getInt()` will not get inlined.
We fix this by adding a separate method `getDictId()` specific for `dictionaryIds` to use.
## How was this patch tested?
We tested the difference with the following aggregate query on a TPCDS dataset (with scale factor = 5):
```
select
max(ss_sold_date_sk) as max_ss_sold_date_sk,
from store_sales
```
The query runtime is improved, from 202ms (before) to 159ms (after).
Author: Qifan Pu <qifan.pu@gmail.com>
Closes#14513 from ooq/SPARK-16928.
## What changes were proposed in this pull request?
The base class `SpecificParquetRecordReaderBase` used for vectorized parquet reader will try to get pushed-down filters from the given configuration. This pushed-down filters are used for RowGroups-level filtering. However, we don't set up the filters to push down into the configuration. In other words, the filters are not actually pushed down to do RowGroups-level filtering. This patch is to fix this and tries to set up the filters for pushing down to configuration for the reader.
The benchmark that excludes the time of writing Parquet file:
test("Benchmark for Parquet") {
val N = 500 << 12
withParquetTable((0 until N).map(i => (101, i)), "t") {
val benchmark = new Benchmark("Parquet reader", N)
benchmark.addCase("reading Parquet file", 10) { iter =>
sql("SELECT _1 FROM t where t._1 < 100").collect()
}
benchmark.run()
}
}
`withParquetTable` in default will run tests for vectorized reader non-vectorized readers. I only let it run vectorized reader.
When we set the block size of parquet as 1024 to have multiple row groups. The benchmark is:
Before this patch:
The retrieved row groups: 8063
Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic
Intel(R) Core(TM) i7-5557U CPU 3.10GHz
Parquet reader: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
reading Parquet file 825 / 1233 2.5 402.6 1.0X
After this patch:
The retrieved row groups: 0
Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic
Intel(R) Core(TM) i7-5557U CPU 3.10GHz
Parquet reader: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
reading Parquet file 306 / 503 6.7 149.6 1.0X
Next, I run the benchmark for non-pushdown case using the same benchmark code but with disabled pushdown configuration. This time the parquet block size is default value.
Before this patch:
Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic
Intel(R) Core(TM) i7-5557U CPU 3.10GHz
Parquet reader: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
reading Parquet file 136 / 238 15.0 66.5 1.0X
After this patch:
Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic
Intel(R) Core(TM) i7-5557U CPU 3.10GHz
Parquet reader: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
reading Parquet file 124 / 193 16.5 60.7 1.0X
For non-pushdown case, from the results, I think this patch doesn't affect normal code path.
I've manually output the `totalRowCount` in `SpecificParquetRecordReaderBase` to see if this patch actually filter the row-groups. When running the above benchmark:
After this patch:
`totalRowCount = 0`
Before this patch:
`totalRowCount = 1024000`
## How was this patch tested?
Existing tests should be passed.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#13701 from viirya/vectorized-reader-push-down-filter2.
## What changes were proposed in this pull request?
Fix the construction of the file path. Previous way of construction caused the creation of incorrect path on Windows.
## How was this patch tested?
Run SQL unit tests on Windows
Author: avulanov <nashb@yandex.ru>
Closes#13868 from avulanov/SPARK-15899-file.
## What changes were proposed in this pull request?
Doc that regexp_extract returns empty string when regex or group does not match
## How was this patch tested?
Jenkins test, with a few new test cases
Author: Sean Owen <sowen@cloudera.com>
Closes#14525 from srowen/SPARK-16324.
#### What changes were proposed in this pull request?
When we do not turn on the Hive Support, the following query generates a confusing error message by Planner:
```Scala
sql("CREATE TABLE t2 SELECT a, b from t1")
```
```
assertion failed: No plan for CreateTable CatalogTable(
Table: `t2`
Created: Tue Aug 09 23:45:32 PDT 2016
Last Access: Wed Dec 31 15:59:59 PST 1969
Type: MANAGED
Provider: hive
Storage(InputFormat: org.apache.hadoop.mapred.TextInputFormat, OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat)), ErrorIfExists
+- Relation[a#19L,b#20L] parquet
java.lang.AssertionError: assertion failed: No plan for CreateTable CatalogTable(
Table: `t2`
Created: Tue Aug 09 23:45:32 PDT 2016
Last Access: Wed Dec 31 15:59:59 PST 1969
Type: MANAGED
Provider: hive
Storage(InputFormat: org.apache.hadoop.mapred.TextInputFormat, OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat)), ErrorIfExists
+- Relation[a#19L,b#20L] parquet
```
This PR is to issue a better error message:
```
Hive support is required to use CREATE Hive TABLE AS SELECT
```
#### How was this patch tested?
Added test cases in `DDLSuite.scala`
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13886 from gatorsmile/createCatalogedTableAsSelect.
## What changes were proposed in this pull request?
MSCK REPAIR TABLE could be used to recover the partitions in external catalog based on partitions in file system.
Another syntax is: ALTER TABLE table RECOVER PARTITIONS
The implementation in this PR will only list partitions (not the files with a partition) in driver (in parallel if needed).
## How was this patch tested?
Added unit tests for it and Hive compatibility test suite.
Author: Davies Liu <davies@databricks.com>
Closes#14500 from davies/repair_table.
## What changes were proposed in this pull request?
This package is meant to be internal, and as a result it does not make sense to mark things as private[sql] or private[spark]. It simply makes debugging harder when Spark developers need to inspect the plans at runtime.
This patch removes all private[sql] and private[spark] visibility modifiers in org.apache.spark.sql.execution.
## How was this patch tested?
N/A - just visibility changes.
Author: Reynold Xin <rxin@databricks.com>
Closes#14554 from rxin/remote-private.
## What changes were proposed in this pull request?
This PR adds argument type information for typed logical plan like MapElements, TypedFilter, and AppendColumn, so that we can use these info in customized optimizer rule.
## How was this patch tested?
Existing test.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#14494 from clockfly/add_more_info_for_typed_operator.
## What changes were proposed in this pull request?
The logic for LEAD/LAG processing is more complex that it needs to be. This PR fixes that.
## How was this patch tested?
Existing tests.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#14376 from hvanhovell/SPARK-16749.
### What changes were proposed in this pull request?
Currently, the `refreshTable` API is always case sensitive.
When users use the view name without the exact case match, the API silently ignores the call. Users might expect the command has been successfully completed. However, when users run the subsequent SQL commands, they might still get the exception, like
```
Job aborted due to stage failure:
Task 1 in stage 4.0 failed 1 times, most recent failure: Lost task 1.0 in stage 4.0 (TID 7, localhost):
java.io.FileNotFoundException:
File file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-bd4b9ea6-9aec-49c5-8f05-01cff426211e/part-r-00000-0c84b915-c032-4f2e-abf5-1d48fdbddf38.snappy.parquet does not exist
```
This PR is to fix the issue.
### How was this patch tested?
Added a test case.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14523 from gatorsmile/refreshTempTable.
#### What changes were proposed in this pull request?
When doing a CTAS with a Partition By clause, we got a wrong error message.
For example,
```SQL
CREATE TABLE gen__tmp
PARTITIONED BY (key string)
AS SELECT key, value FROM mytable1
```
The error message we get now is like
```
Operation not allowed: Schema may not be specified in a Create Table As Select (CTAS) statement(line 2, pos 0)
```
However, based on the code, the message we should get is like
```
Operation not allowed: A Create Table As Select (CTAS) statement is not allowed to create a partitioned table using Hive's file formats. Please use the syntax of "CREATE TABLE tableName USING dataSource OPTIONS (...) PARTITIONED BY ...\" to create a partitioned table through a CTAS statement.(line 2, pos 0)
```
Currently, partitioning columns is part of the schema. This PR fixes the bug by changing the detection orders.
#### How was this patch tested?
Added test cases.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14113 from gatorsmile/ctas.
## What changes were proposed in this pull request?
This PR adds auxiliary info like input class and input schema in TypedAggregateExpression
## How was this patch tested?
Manual test.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#14501 from clockfly/typed_aggregation.
## What changes were proposed in this pull request?
This problem was found in [PARQUET-251](https://issues.apache.org/jira/browse/PARQUET-251) and we disabled filter pushdown on binary columns in Spark before. We enabled this after upgrading Parquet but it seems there is potential incompatibility for Parquet files written in lower Spark versions.
Currently, this does not happen in normal Parquet reader. However, In Spark, we implemented a vectorized reader, separately with Parquet's standard API. For normal Parquet reader this is being handled but not in the vectorized reader.
It is okay to just pass `FileMetaData`. This is being handled in parquet-mr (See e3b95020f7). This will prevent loading corrupt statistics in each page in Parquet.
This PR replaces the deprecated usage of constructor.
## How was this patch tested?
N/A
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#14450 from HyukjinKwon/SPARK-16847.
## What changes were proposed in this pull request?
Spark will convert **BooleanType** to **BIT(1)**, **LongType** to **BIGINT**, **ByteType** to **BYTE** when saving DataFrame to Oracle, but Oracle does not support BIT, BIGINT and BYTE types.
This PR is convert following _Spark Types_ to _Oracle types_ refer to [Oracle Developer's Guide](https://docs.oracle.com/cd/E19501-01/819-3659/gcmaz/)
Spark Type | Oracle
----|----
BooleanType | NUMBER(1)
IntegerType | NUMBER(10)
LongType | NUMBER(19)
FloatType | NUMBER(19, 4)
DoubleType | NUMBER(19, 4)
ByteType | NUMBER(3)
ShortType | NUMBER(5)
## How was this patch tested?
Add new tests in [JDBCSuite.scala](22b0c2a422 (diff-dc4b58851b084b274df6fe6b189db84d)) and [OracleDialect.scala](22b0c2a422 (diff-5e0cadf526662f9281aa26315b3750ad))
Author: Yuming Wang <wgyumg@gmail.com>
Closes#14377 from wangyum/SPARK-16625.
## What changes were proposed in this pull request?
we have various logical plans for CREATE TABLE and CTAS: `CreateTableUsing`, `CreateTableUsingAsSelect`, `CreateHiveTableAsSelectLogicalPlan`. This PR unifies them to reduce the complexity and centralize the error handling.
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14482 from cloud-fan/table.
## What changes were proposed in this pull request?
For non-partitioned parquet table, if the vectorized parquet record reader is not being used, Spark 2.0 adds an extra unnecessary memory copy to append partition values for each row.
There are several typical cases that vectorized parquet record reader is not being used:
1. When the table schema is not flat, like containing nested fields.
2. When `spark.sql.parquet.enableVectorizedReader = false`
By fixing this bug, we get about 20% - 30% performance gain in test case like this:
```
// Generates parquet table with nested columns
spark.range(100000000).select(struct($"id").as("nc")).write.parquet("/tmp/data4")
def time[R](block: => R): Long = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0)/1000000 + "ms")
(t1 - t0)/1000000
}
val x = ((0 until 20).toList.map(x => time(spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect()))).sum/20
```
## How was this patch tested?
After a few times warm up, we get 26% performance improvement
Before fix:
```
Average: 4584ms, raw data (10 tries): 4726ms 4509ms 4454ms 4879ms 4586ms 4733ms 4500ms 4361ms 4456ms 4640ms
```
After fix:
```
Average: 3614ms, raw data(10 tries): 3554ms 3740ms 4019ms 3439ms 3460ms 3664ms 3557ms 3584ms 3612ms 3531ms
```
Test env: Intel(R) Core(TM) i7-6700 CPU 3.40GHz, Intel SSD SC2KW24
Author: Sean Zhong <seanzhong@databricks.com>
Closes#14445 from clockfly/fix_parquet_regression_2.
## What changes were proposed in this pull request?
Add the missing args-checking for randomSplit and sample
## How was this patch tested?
unit tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#14478 from zhengruifeng/fix_randomSplit.
## What changes were proposed in this pull request?
This moves DataSourceScanExec out so it's more discoverable, and now that it doesn't necessarily depend on an existing RDD. cc davies
## How was this patch tested?
Existing tests.
Author: Eric Liang <ekl@databricks.com>
Closes#14487 from ericl/split-scan.
## What changes were proposed in this pull request?
This patch fix the overflow in LongToUnsafeRowMap when the range of key is very wide (the key is much much smaller then minKey, for example, key is Long.MinValue, minKey is > 0).
## How was this patch tested?
Added regression test (also for SPARK-16740)
Author: Davies Liu <davies@databricks.com>
Closes#14464 from davies/fix_overflow.
## What changes were proposed in this pull request?
For DataSet typed select:
```
def select[U1: Encoder](c1: TypedColumn[T, U1]): Dataset[U1]
```
If type T is a case class or a tuple class that is not atomic, the resulting logical plan's schema will mismatch with `Dataset[T]` encoder's schema, which will cause encoder error and throw AnalysisException.
### Before change:
```
scala> case class A(a: Int, b: Int)
scala> Seq((0, A(1,2))).toDS.select($"_2".as[A])
org.apache.spark.sql.AnalysisException: cannot resolve '`a`' given input columns: [_2];
..
```
### After change:
```
scala> case class A(a: Int, b: Int)
scala> Seq((0, A(1,2))).toDS.select($"_2".as[A]).show
+---+---+
| a| b|
+---+---+
| 1| 2|
+---+---+
```
## How was this patch tested?
Unit test.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#14474 from clockfly/SPARK-16853.
This is a pull request that was originally merged against branch-1.6 as #12000, now being merged into master as well. srowen zzcclp JoshRosen
This pull request fixes an issue in which cluster-mode executors fail to properly register a JDBC driver when the driver is provided in a jar by the user, but the driver class name is derived from a JDBC URL (rather than specified by the user). The consequence of this is that all JDBC accesses under the described circumstances fail with an IllegalStateException. I reported the issue here: https://issues.apache.org/jira/browse/SPARK-14204
My proposed solution is to have the executors register the JDBC driver class under all circumstances, not only when the driver is specified by the user.
This patch was tested manually. I built an assembly jar, deployed it to a cluster, and confirmed that the problem was fixed.
Author: Kevin McHale <kevin@premise.com>
Closes#14420 from mchalek/mchalek-jdbc_driver_registration.
## What changes were proposed in this pull request?
Partition discovery is rather expensive, so we should do it at execution time instead of during physical planning. Right now there is not much benefit since ListingFileCatalog will read scan for all partitions at planning time anyways, but this can be optimized in the future. Also, there might be more information for partition pruning not available at planning time.
This PR moves a lot of the file scan logic from planning to execution time. All file scan operations are handled by `FileSourceScanExec`, which handles both batched and non-batched file scans. This requires some duplication with `RowDataSourceScanExec`, but is probably worth it so that `FileSourceScanExec` does not need to depend on an input RDD.
TODO: In another pr, move DataSourceScanExec to it's own file.
## How was this patch tested?
Existing tests (it might be worth adding a test that catalog.listFiles() is delayed until execution, but this can be delayed until there is an actual benefit to doing so).
Author: Eric Liang <ekl@databricks.com>
Closes#14241 from ericl/refactor.
## What changes were proposed in this pull request?
a small code style change, it's better to make the type parameter more accurate.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14458 from cloud-fan/parquet.
## What changes were proposed in this pull request?
`StructField` has very similar semantic with `CatalogColumn`, except that `CatalogColumn` use string to express data type. I think it's reasonable to use `StructType` as the `CatalogTable.schema` and remove `CatalogColumn`.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14363 from cloud-fan/column.
## What changes were proposed in this pull request?
This fixes a bug wherethe file scan operator does not take into account partition pruning in its implementation of `sameResult()`. As a result, executions may be incorrect on self-joins over the same base file relation.
The patch here is minimal, but we should reconsider relying on `metadata` for implementing sameResult() in the future, as string representations may not be uniquely identifying.
cc rxin
## How was this patch tested?
Unit tests.
Author: Eric Liang <ekl@databricks.com>
Closes#14425 from ericl/spark-16818.
## What changes were proposed in this pull request?
f12f11e578 introduced this bug, missed foreach as map
## How was this patch tested?
Test added
Author: Wesley Tang <tangmingjun@mininglamp.com>
Closes#14324 from breakdawn/master.
## What changes were proposed in this pull request?
We currently don't bound or manage the data array size used by column vectors in the vectorized reader (they're just bound by INT.MAX) which may lead to OOMs while reading data. As a short term fix, this patch intercepts the OutOfMemoryError exception and suggest the user to disable the vectorized parquet reader.
## How was this patch tested?
Existing Tests
Author: Sameer Agarwal <sameerag@cs.berkeley.edu>
Closes#14387 from sameeragarwal/oom.
## What changes were proposed in this pull request?
Avoid overflow of Long type causing a NegativeArraySizeException a few lines later.
## How was this patch tested?
Unit tests for HashedRelationSuite still pass.
I can confirm the python script I included in https://issues.apache.org/jira/browse/SPARK-16740 works fine with this patch. Unfortunately I don't have the knowledge/time to write a Scala test case for HashedRelationSuite right now. As the patch is pretty obvious I hope it can be included without this.
Thanks!
Author: Sylvain Zimmer <sylvain@sylvainzimmer.com>
Closes#14373 from sylvinus/master.
#### What changes were proposed in this pull request?
Currently, in Spark SQL, the initial creation of schema can be classified into two groups. It is applicable to both Hive tables and Data Source tables:
**Group A. Users specify the schema.**
_Case 1 CREATE TABLE AS SELECT_: the schema is determined by the result schema of the SELECT clause. For example,
```SQL
CREATE TABLE tab STORED AS TEXTFILE
AS SELECT * from input
```
_Case 2 CREATE TABLE_: users explicitly specify the schema. For example,
```SQL
CREATE TABLE jsonTable (_1 string, _2 string)
USING org.apache.spark.sql.json
```
**Group B. Spark SQL infers the schema at runtime.**
_Case 3 CREATE TABLE_. Users do not specify the schema but the path to the file location. For example,
```SQL
CREATE TABLE jsonTable
USING org.apache.spark.sql.json
OPTIONS (path '${tempDir.getCanonicalPath}')
```
Before this PR, Spark SQL does not store the inferred schema in the external catalog for the cases in Group B. When users refreshing the metadata cache, accessing the table at the first time after (re-)starting Spark, Spark SQL will infer the schema and store the info in the metadata cache for improving the performance of subsequent metadata requests. However, the runtime schema inference could cause undesirable schema changes after each reboot of Spark.
This PR is to store the inferred schema in the external catalog when creating the table. When users intend to refresh the schema after possible changes on external files (table location), they issue `REFRESH TABLE`. Spark SQL will infer the schema again based on the previously specified table location and update/refresh the schema in the external catalog and metadata cache.
In this PR, we do not use the inferred schema to replace the user specified schema for avoiding external behavior changes . Based on the design, user-specified schemas (as described in Group A) can be changed by ALTER TABLE commands, although we do not support them now.
#### How was this patch tested?
TODO: add more cases to cover the changes.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14207 from gatorsmile/userSpecifiedSchema.
## What changes were proposed in this pull request?
Fix two places in SQLConf documents regarding size in bytes and statistics.
## How was this patch tested?
No. Just change document.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#14341 from viirya/fix-doc-size-in-bytes.
## What changes were proposed in this pull request?
Currently, the generated SQLs have not-stable IDs for generated attributes.
The stable generated SQL will give more benefit for understanding or testing the queries.
This PR provides stable SQL generation by the followings.
- Provide unique ids for generated subqueries, `gen_subquery_xxx`.
- Provide unique and stable ids for generated attributes, `gen_attr_xxx`.
**Before**
```scala
scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
res0: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS gen_subquery_0
scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
res1: String = SELECT `gen_attr_4` AS `1` FROM (SELECT 1 AS `gen_attr_4`) AS gen_subquery_0
```
**After**
```scala
scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
res1: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS gen_subquery_0
scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
res2: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS gen_subquery_0
```
## How was this patch tested?
Pass the existing Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14257 from dongjoon-hyun/SPARK-16621.
## What changes were proposed in this pull request?
This PR is the first step for the following feature:
For hash aggregation in Spark SQL, we use a fast aggregation hashmap to act as a "cache" in order to boost aggregation performance. Previously, the hashmap is backed by a `ColumnarBatch`. This has performance issues when we have wide schema for the aggregation table (large number of key fields or value fields).
In this JIRA, we support another implementation of fast hashmap, which is backed by a `RowBasedKeyValueBatch`. We then automatically pick between the two implementations based on certain knobs.
In this first-step PR, implementations for `RowBasedKeyValueBatch` and `RowBasedHashMapGenerator` are added.
## How was this patch tested?
Unit tests: `RowBasedKeyValueBatchSuite`
Author: Qifan Pu <qifan.pu@gmail.com>
Closes#14349 from ooq/SPARK-16524.
## What changes were proposed in this pull request?
Currently there are 2 inconsistence:
1. for data source table, we only print partition names, for hive table, we also print partition schema. After this PR, we will always print schema
2. if column doesn't have comment, data source table will print empty string, hive table will print null. After this PR, we will always print null
## How was this patch tested?
new test in `HiveDDLSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14302 from cloud-fan/minor3.
## What changes were proposed in this pull request?
Currently, `JdbcUtils.savePartition` is doing type-based dispatch for each row to write appropriate values.
So, appropriate setters for `PreparedStatement` can be created first according to the schema, and then apply them to each row. This approach is similar with `CatalystWriteSupport`.
This PR simply make the setters to avoid this.
## How was this patch tested?
Existing tests should cover this.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#14323 from HyukjinKwon/SPARK-16675.
## What changes were proposed in this pull request?
This PR contains three changes.
First, this PR changes the behavior of lead/lag back to Spark 1.6's behavior, which is described as below:
1. lead/lag respect null input values, which means that if the offset row exists and the input value is null, the result will be null instead of the default value.
2. If the offset row does not exist, the default value will be used.
3. OffsetWindowFunction's nullable setting also considers the nullability of its input (because of the first change).
Second, this PR fixes the evaluation of lead/lag when the input expression is a literal. This fix is a result of the first change. In current master, if a literal is used as the input expression of a lead or lag function, the result will be this literal even if the offset row does not exist.
Third, this PR makes ResolveWindowFrame not fire if a window function is not resolved.
## How was this patch tested?
New tests in SQLWindowFunctionSuite
Author: Yin Huai <yhuai@databricks.com>
Closes#14284 from yhuai/lead-lag.
## What changes were proposed in this pull request?
Currently, `SQLBuilder` raises `empty.reduceLeft` exceptions on *unoptimized* `EXISTS` queries. We had better prevent this.
```scala
scala> sql("CREATE TABLE t1(a int)")
scala> val df = sql("select * from t1 b where exists (select * from t1 a)")
scala> new org.apache.spark.sql.catalyst.SQLBuilder(df).toSQL
java.lang.UnsupportedOperationException: empty.reduceLeft
```
## How was this patch tested?
Pass the Jenkins tests with a new test suite.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14307 from dongjoon-hyun/SPARK-16672.
## What changes were proposed in this pull request?
**Issue 1: Disallow Creating/Altering a View when the same-name Table Exists (without IF NOT EXISTS)**
When we create OR alter a view, we check whether the view already exists. In the current implementation, if a table with the same name exists, we treat it as a view. However, this is not the right behavior. We should follow what Hive does. For example,
```
hive> CREATE TABLE tab1 (id int);
OK
Time taken: 0.196 seconds
hive> CREATE OR REPLACE VIEW tab1 AS SELECT * FROM t1;
FAILED: SemanticException [Error 10218]: Existing table is not a view
The following is an existing table, not a view: default.tab1
hive> ALTER VIEW tab1 AS SELECT * FROM t1;
FAILED: SemanticException [Error 10218]: Existing table is not a view
The following is an existing table, not a view: default.tab1
hive> CREATE VIEW IF NOT EXISTS tab1 AS SELECT * FROM t1;
OK
Time taken: 0.678 seconds
```
**Issue 2: Strange Error when Issuing Load Table Against A View**
Users should not be allowed to issue LOAD DATA against a view. Currently, when users doing it, we got a very strange runtime error. For example,
```SQL
LOAD DATA LOCAL INPATH "$testData" INTO TABLE $viewName
```
```
java.lang.reflect.InvocationTargetException was thrown.
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:680)
```
## How was this patch tested?
Added test cases
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14314 from gatorsmile/tableDDLAgainstView.
## What changes were proposed in this pull request?
Current fix for deadlock disables interrupts in the StreamExecution which getting offsets for all sources, and when writing to any metadata log, to avoid potential deadlocks in HDFSMetadataLog(see JIRA for more details). However, disabling interrupts can have unintended consequences in other sources. So I am making the fix more narrow, by disabling interrupt it only in the HDFSMetadataLog. This is a narrower fix for something risky like disabling interrupt.
## How was this patch tested?
Existing tests.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#14292 from tdas/SPARK-14131.
## What changes were proposed in this pull request?
It's weird that we have `BucketSpec` to abstract bucket info, but don't use it in `CatalogTable`. This PR moves `BucketSpec` into catalyst module.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14331 from cloud-fan/check.
## What changes were proposed in this pull request?
`CreateViewCommand` only needs some information of a `CatalogTable`, but not all of them. We have some tricks(e.g. we need to check the table type is `VIEW`, we need to make `CatalogColumn.dataType` nullable) to allow it to take a `CatalogTable`.
This PR cleans it up and only pass in necessary information to `CreateViewCommand`.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14297 from cloud-fan/minor2.
## What changes were proposed in this pull request?
Currently, `JDBCRDD.compute` is doing type dispatch for each row to read appropriate values.
It might not have to be done like this because the schema is already kept in `JDBCRDD`.
So, appropriate converters can be created first according to the schema, and then apply them to each row.
## How was this patch tested?
Existing tests should cover this.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#14313 from HyukjinKwon/SPARK-16674.
In the following code in `VectorizedHashMapGenerator.scala`:
```
def hashBytes(b: String): String = {
val hash = ctx.freshName("hash")
s"""
|int $result = 0;
|for (int i = 0; i < $b.length; i++) {
| ${genComputeHash(ctx, s"$b[i]", ByteType, hash)}
| $result = ($result ^ (0x9e3779b9)) + $hash + ($result << 6) + ($result >>> 2);
|}
""".stripMargin
}
```
when b=input.getBytes(), the current 2.0 code results in getBytes() being called n times, n being length of input. getBytes() involves memory copy is thus expensive and causes a performance degradation.
Fix is to evaluate getBytes() before the for loop.
Performance bug, no additional test added.
Author: Qifan Pu <qifan.pu@gmail.com>
Closes#14337 from ooq/SPARK-16699.
(cherry picked from commit d226dce12b)
Signed-off-by: Reynold Xin <rxin@databricks.com>
## What changes were proposed in this pull request?
we also store data source table options in this field, it's unreasonable to call it `serdeProperties`.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14283 from cloud-fan/minor1.
## What changes were proposed in this pull request?
This PR adds a boolean option, `truncate`, for `SaveMode.Overwrite` of JDBC DataFrameWriter. If this option is `true`, it try to take advantage of `TRUNCATE TABLE` instead of `DROP TABLE`. This is a trivial option, but will provide great **convenience** for BI tool users based on RDBMS tables generated by Spark.
**Goal**
- Without `CREATE/DROP` privilege, we can save dataframe to database. Sometime these are not allowed for security.
- It will preserve the existing table information, so users can add and keep some additional `INDEX` and `CONSTRAINT`s for the table.
- Sometime, `TRUNCATE` is faster than the combination of `DROP/CREATE`.
**Supported DBMS**
The following is `truncate`-option support table. Due to the different behavior of `TRUNCATE TABLE` among DBMSs, it's not always safe to use `TRUNCATE TABLE`. Spark will ignore the `truncate` option for **unknown** and **some** DBMS with **default CASCADING** behavior. Newly added JDBCDialect should implement corresponding function to support `truncate` option additionally.
Spark Dialects | `truncate` OPTION SUPPORT
---------------|-------------------------------
MySQLDialect | O
PostgresDialect | X
DB2Dialect | O
MsSqlServerDialect | O
DerbyDialect | O
OracleDialect | O
**Before (TABLE with INDEX case)**: SparkShell & MySQL CLI are interleaved intentionally.
```scala
scala> val (url, prop)=("jdbc:mysql://localhost:3306/temp?useSSL=false", new java.util.Properties)
scala> prop.setProperty("user","root")
scala> df.write.mode("overwrite").jdbc(url, "table_with_index", prop)
scala> spark.range(10).write.mode("overwrite").jdbc(url, "table_with_index", prop)
mysql> DESC table_with_index;
+-------+------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| id | bigint(20) | NO | | NULL | |
+-------+------------+------+-----+---------+-------+
mysql> CREATE UNIQUE INDEX idx_id ON table_with_index(id);
mysql> DESC table_with_index;
+-------+------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| id | bigint(20) | NO | PRI | NULL | |
+-------+------------+------+-----+---------+-------+
scala> spark.range(10).write.mode("overwrite").jdbc(url, "table_with_index", prop)
mysql> DESC table_with_index;
+-------+------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| id | bigint(20) | NO | | NULL | |
+-------+------------+------+-----+---------+-------+
```
**After (TABLE with INDEX case)**
```scala
scala> spark.range(10).write.mode("overwrite").option("truncate", true).jdbc(url, "table_with_index", prop)
mysql> DESC table_with_index;
+-------+------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| id | bigint(20) | NO | PRI | NULL | |
+-------+------------+------+-----+---------+-------+
```
**Error Handling**
- In case of exceptions, Spark will not retry. Users should turn off the `truncate` option.
- In case of schema change:
- If one of the column names changes, this will raise exceptions intuitively.
- If there exists only type difference, this will work like Append mode.
## How was this patch tested?
Pass the Jenkins tests with a updated testcase.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14086 from dongjoon-hyun/SPARK-16410.
### What changes were proposed in this pull request?
**Issue 1: Silent Ignorance of Bucket Specification When Creating Table Using Schema Inference**
When creating a data source table without explicit specification of schema or SELECT clause, we silently ignore the bucket specification (CLUSTERED BY... SORTED BY...) in [the code](ce3b98bae2/sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala (L339-L354)).
For example,
```SQL
CREATE TABLE jsonTable
USING org.apache.spark.sql.json
OPTIONS (
path '${tempDir.getCanonicalPath}'
)
CLUSTERED BY (inexistentColumnA) SORTED BY (inexistentColumnB) INTO 2 BUCKETS
```
This PR captures it and issues an error message.
**Issue 2: Got a run-time `java.lang.ArithmeticException` when num of buckets is set to zero.**
For example,
```SQL
CREATE TABLE t USING PARQUET
OPTIONS (PATH '${path.toString}')
CLUSTERED BY (a) SORTED BY (b) INTO 0 BUCKETS
AS SELECT 1 AS a, 2 AS b
```
The exception we got is
```
ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 1.0 (TID 2)
java.lang.ArithmeticException: / by zero
```
This PR captures the misuse and issues an appropriate error message.
### How was this patch tested?
Added a test case in DDLSuite
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14210 from gatorsmile/createTableWithoutSchema.
## What changes were proposed in this pull request?
As part of the bugfix in https://github.com/apache/spark/pull/12279, if a row batch consist of both dictionary encoded and non-dictionary encoded pages, we explicitly decode the dictionary for the values that are already dictionary encoded. Currently we reset the dictionary while reading every page that can potentially cause ` java.lang.ArrayIndexOutOfBoundsException` while decoding older pages. This patch fixes the problem by maintaining a single dictionary per row-batch in vectorized parquet reader.
## How was this patch tested?
Manual Tests against a number of hand-generated parquet files.
Author: Sameer Agarwal <sameerag@cs.berkeley.edu>
Closes#14225 from sameeragarwal/vectorized.
## What changes were proposed in this pull request?
PR #14278 is a more general and simpler fix for SPARK-16632 than PR #14272. After merging #14278, we no longer need changes made in #14272. So here I revert them.
This PR targets both master and branch-2.0.
## How was this patch tested?
Existing tests.
Author: Cheng Lian <lian@databricks.com>
Closes#14300 from liancheng/revert-pr-14272.
## What changes were proposed in this pull request?
In `SpecificParquetRecordReaderBase`, which is used by the vectorized Parquet reader, we convert the Parquet requested schema into a Spark schema to guide column reader initialization. However, the Parquet requested schema is tailored from the schema of the physical file being scanned, and may have inaccurate type information due to bugs of other systems (e.g. HIVE-14294).
On the other hand, we already set the real Spark requested schema into Hadoop configuration in [`ParquetFileFormat`][1]. This PR simply reads out this schema to replace the converted one.
## How was this patch tested?
New test case added in `ParquetQuerySuite`.
[1]: https://github.com/apache/spark/blob/v2.0.0-rc5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L292-L294
Author: Cheng Lian <lian@databricks.com>
Closes#14278 from liancheng/spark-16632-simpler-fix.
## What changes were proposed in this pull request?
Saving partitions to JDBC in transaction can use a weaker transaction isolation level to reduce locking. Use better method to check if transactions are supported.
## How was this patch tested?
Existing Jenkins tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#14054 from srowen/SPARK-16226.
This allows configuration to be more flexible, for example, when the cluster does
not have a homogeneous configuration (e.g. packages are installed on different
paths in different nodes). By allowing one to reference the environment from
the conf, it becomes possible to work around those in certain cases.
As part of the implementation, ConfigEntry now keeps track of all "known" configs
(i.e. those created through the use of ConfigBuilder), since that list is used
by the resolution code. This duplicates some code in SQLConf, which could potentially
be merged with this now. It will also make it simpler to implement some missing
features such as filtering which configs show up in the UI or in event logs - which
are not part of this change.
Another change is in the way ConfigEntry reads config data; it now takes a string
map and a function that reads env variables, so that it can be called both from
SparkConf and SQLConf. This makes it so both places follow the same read path,
instead of having to replicate certain logic in SQLConf. There are still a
couple of methods in SQLConf that peek into fields of ConfigEntry directly,
though.
Tested via unit tests, and by using the new variable expansion functionality
in a shell session with a custom spark.sql.hive.metastore.jars value.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#14022 from vanzin/SPARK-16272.
## What changes were proposed in this pull request?
Due to backward-compatibility reasons, the following Parquet schema is ambiguous:
```
optional group f (LIST) {
repeated group list {
optional group element {
optional int32 element;
}
}
}
```
According to the parquet-format spec, when interpreted as a standard 3-level layout, this type is equivalent to the following SQL type:
```
ARRAY<STRUCT<element: INT>>
```
However, when interpreted as a legacy 2-level layout, it's equivalent to
```
ARRAY<STRUCT<element: STRUCT<element: INT>>>
```
Historically, to disambiguate these cases, we employed two methods:
- `ParquetSchemaConverter.isElementType()`
Used to disambiguate the above cases while converting Parquet types to Spark types.
- `ParquetRowConverter.isElementType()`
Used to disambiguate the above cases while instantiating row converters that convert Parquet records to Spark rows.
Unfortunately, these two methods make different decision about the above problematic Parquet type, and caused SPARK-16344.
`ParquetRowConverter.isElementType()` is necessary for Spark 1.4 and earlier versions because Parquet requested schemata are directly converted from Spark schemata in these versions. The converted Parquet schemata may be incompatible with actual schemata of the underlying physical files when the files are written by a system/library that uses a schema conversion scheme that is different from Spark when writing Parquet LIST and MAP fields.
In Spark 1.5, Parquet requested schemata are always properly tailored from schemata of physical files to be read. Thus `ParquetRowConverter.isElementType()` is no longer necessary. This PR replaces this method with a simply yet accurate scheme: whenever an ambiguous Parquet type is hit, convert the type in question back to a Spark type using `ParquetSchemaConverter` and check whether it matches the corresponding Spark type.
## How was this patch tested?
New test cases added in `ParquetHiveCompatibilitySuite` and `ParquetQuerySuite`.
Author: Cheng Lian <lian@databricks.com>
Closes#14014 from liancheng/spark-16344-for-master-and-2.0.
When Hive (or at least certain versions of Hive) creates parquet files
containing tinyint or smallint columns, it stores them as int32, but
doesn't annotate the parquet field as containing the corresponding
int8 / int16 data. When Spark reads those files using the vectorized
reader, it follows the parquet schema for these fields, but when
actually reading the data it tries to use the type fetched from
the metastore, and then fails because data has been loaded into the
wrong fields in OnHeapColumnVector.
So instead of blindly trusting the parquet schema, check whether the
Catalyst-provided schema disagrees with it, and adjust the types so
that the necessary metadata is present when loading the data into
the ColumnVector instance.
Tested with unit tests and with tests that create byte / short columns
in Hive and try to read them from Spark.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#14272 from vanzin/SPARK-16632.
## What changes were proposed in this pull request?
In ScriptInputOutputSchema, we read default RecordReader and RecordWriter from conf. Since Spark 2.0 has deleted those config keys from hive conf, we have to set default reader/writer class name by ourselves. Otherwise we will get None for LazySimpleSerde, the data written would not be able to read by script. The test case added worked fine with previous version of Spark, but would fail now.
## How was this patch tested?
added a test case in SQLQuerySuite.
Closes#14169
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#14249 from yhuai/scriptTransformation.
## What changes were proposed in this pull request?
Currently, `JacksonGenerator.apply` is doing type-based dispatch for each row to write appropriate values.
It might not have to be done like this because the schema is already kept.
So, appropriate writers can be created first according to the schema once, and then apply them to each row. This approach is similar with `CatalystWriteSupport`.
This PR corrects `JacksonGenerator` so that it creates all writers for the schema once and then applies them to each row rather than type dispatching for every row.
Benchmark was proceeded with the codes below:
```scala
test("Benchmark for JSON writer") {
val N = 500 << 8
val row =
"""{"struct":{"field1": true, "field2": 92233720368547758070},
"structWithArrayFields":{"field1":[4, 5, 6], "field2":["str1", "str2"]},
"arrayOfString":["str1", "str2"],
"arrayOfInteger":[1, 2147483647, -2147483648],
"arrayOfLong":[21474836470, 9223372036854775807, -9223372036854775808],
"arrayOfBigInteger":[922337203685477580700, -922337203685477580800],
"arrayOfDouble":[1.2, 1.7976931348623157E308, 4.9E-324, 2.2250738585072014E-308],
"arrayOfBoolean":[true, false, true],
"arrayOfNull":[null, null, null, null],
"arrayOfStruct":[{"field1": true, "field2": "str1"}, {"field1": false}, {"field3": null}],
"arrayOfArray1":[[1, 2, 3], ["str1", "str2"]],
"arrayOfArray2":[[1, 2, 3], [1.1, 2.1, 3.1]]
}"""
val df = spark.sqlContext.read.json(spark.sparkContext.parallelize(List.fill(N)(row)))
val benchmark = new Benchmark("JSON writer", N)
benchmark.addCase("writing JSON file", 10) { _ =>
withTempPath { path =>
df.write.format("json").save(path.getCanonicalPath)
}
}
benchmark.run()
}
```
This produced the results below
- **Before**
```
JSON writer: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
writing JSON file 1675 / 1767 0.1 13087.5 1.0X
```
- **After**
```
JSON writer: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
writing JSON file 1597 / 1686 0.1 12477.1 1.0X
```
In addition, I ran this benchmark 10 times for each and calculated the average elapsed time as below:
| **Before** | **After**|
|---------------|------------|
|17478ms |16669ms |
It seems roughly ~5% is improved.
## How was this patch tested?
Existing tests should cover this.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#14028 from HyukjinKwon/SPARK-16351.
## What changes were proposed in this pull request?
This PR changes the name of columns returned by `SHOW PARTITION` and `SHOW COLUMNS` commands. Currently, both commands uses `result` as a column name.
**Comparison: Column Name**
Command|Spark(Before)|Spark(After)|Hive
----------|--------------|------------|-----
SHOW PARTITIONS|result|partition|partition
SHOW COLUMNS|result|col_name|field
Note that Spark/Hive uses `col_name` in `DESC TABLES`. So, this PR chooses `col_name` for consistency among Spark commands.
**Before**
```scala
scala> sql("show partitions p").show()
+------+
|result|
+------+
| b=2|
+------+
scala> sql("show columns in p").show()
+------+
|result|
+------+
| a|
| b|
+------+
```
**After**
```scala
scala> sql("show partitions p").show
+---------+
|partition|
+---------+
| b=2|
+---------+
scala> sql("show columns in p").show
+--------+
|col_name|
+--------+
| a|
| b|
+--------+
```
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14199 from dongjoon-hyun/SPARK-16543.
## What changes were proposed in this pull request?
This patch enables SparkSession to provide spark version.
## How was this patch tested?
Manual test:
```
scala> sc.version
res0: String = 2.1.0-SNAPSHOT
scala> spark.version
res1: String = 2.1.0-SNAPSHOT
```
```
>>> sc.version
u'2.1.0-SNAPSHOT'
>>> spark.version
u'2.1.0-SNAPSHOT'
```
Author: Liwei Lin <lwlin7@gmail.com>
Closes#14165 from lw-lin/add-version.
#### What changes were proposed in this pull request?
If we create a table pointing to a parquet/json datasets without specifying the schema, describe table command does not show the schema at all. It only shows `# Schema of this table is inferred at runtime`. In 1.6, describe table does show the schema of such a table.
~~For data source tables, to infer the schema, we need to load the data source tables at runtime. Thus, this PR calls the function `lookupRelation`.~~
For data source tables, we infer the schema before table creation. Thus, this PR set the inferred schema as the table schema when table creation.
#### How was this patch tested?
Added test cases
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14148 from gatorsmile/describeSchema.
## What changes were proposed in this pull request?
Fix Java style errors and remove unused imports, which are randomly found
## How was this patch tested?
Tested on my local machine.
Author: Xin Ren <iamshrek@126.com>
Closes#14161 from keypointt/SPARK-16437.
This option is used by Hive to directly delete the files instead of
moving them to the trash. This is needed in certain configurations
where moving the files does not work. For non-Hive tables and partitions,
Spark already behaves as if the PURGE option was set, so there's no
need to do anything.
Hive support for PURGE was added in 0.14 (for tables) and 1.2 (for
partitions), so the code reflects that: trying to use the option with
older versions of Hive will cause an exception to be thrown.
The change is a little noisier than I would like, because of the code
to propagate the new flag through all the interfaces and implementations;
the main changes are in the parser and in HiveShim, aside from the tests
(DDLCommandSuite, VersionsSuite).
Tested by running sql and catalyst unit tests, plus VersionsSuite which
has been updated to test the version-specific behavior. I also ran an
internal test suite that uses PURGE and would not pass previously.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#13831 from vanzin/SPARK-16119.
## What changes were proposed in this pull request?
when query only use metadata (example: partition key), it can return results based on metadata without scanning files. Hive did it in HIVE-1003.
## How was this patch tested?
add unit tests
Author: Lianhui Wang <lianhuiwang09@gmail.com>
Author: Wenchen Fan <wenchen@databricks.com>
Author: Lianhui Wang <lianhuiwang@users.noreply.github.com>
Closes#13494 from lianhuiwang/metadata-only.
## What changes were proposed in this pull request?
Currently the input `RDD` of `Dataset` is always serialized to `RDD[InternalRow]` prior to being as `Dataset`, but there is a case that we use `map` or `mapPartitions` just after converted to `Dataset`.
In this case, serialize and then deserialize happens but it would not be needed.
This pr adds `ExistingRDD` logical plan for input with `RDD` to have a chance to eliminate serialize/deserialize.
## How was this patch tested?
Existing tests.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#13890 from ueshin/issues/SPARK-16189.
## What changes were proposed in this pull request?
It would be useful to support listing the columns that are referenced by a filter. This can help simplify data source planning, because with this we would be able to implement unhandledFilters method in HadoopFsRelation.
This is based on rxin's patch (#13901) and adds unit tests.
## How was this patch tested?
Added a new suite FiltersSuite.
Author: petermaxlee <petermaxlee@gmail.com>
Author: Reynold Xin <rxin@databricks.com>
Closes#14120 from petermaxlee/SPARK-16199.
## What changes were proposed in this pull request?
In order to make it clear which filters are fully handled by the
underlying datasource we will mark them with an *. This will give a
clear visual queue to users that the filter is being treated differently
by catalyst than filters which are just presented to the underlying
DataSource.
Examples from the FilteredScanSuite, in this example `c IN (...)` is handled by the source, `b < ...` is not
### Before
```
//SELECT a FROM oneToTenFiltered WHERE a + b > 9 AND b < 16 AND c IN ('bbbbbBBBBB', 'cccccCCCCC', 'dddddDDDDD', 'foo')
== Physical Plan ==
Project [a#0]
+- Filter (((a#0 + b#1) > 9) && (b#1 < 16))
+- Scan SimpleFilteredScan(1,10)[a#0,b#1] PushedFilters: [LessThan(b,16), In(c, [bbbbbBBBBB,cccccCCCCC,dddddDDDDD,foo]]
```
### After
```
== Physical Plan ==
Project [a#0]
+- Filter (((a#0 + b#1) > 9) && (b#1 < 16))
+- Scan SimpleFilteredScan(1,10)[a#0,b#1] PushedFilters: [LessThan(b,16), *In(c, [bbbbbBBBBB,cccccCCCCC,dddddDDDDD,foo]]
```
## How was the this patch tested?
Manually tested with the Spark Cassandra Connector, a source which fully handles underlying filters. Now fully handled filters appear with an * next to their names. I can add an automated test as well if requested
Post 1.6.1
Tested by modifying the FilteredScanSuite to run explains.
Author: Russell Spitzer <Russell.Spitzer@gmail.com>
Closes#11317 from RussellSpitzer/SPARK-12639-Star.
## What changes were proposed in this pull request?
Incorrect list of files were being allocated to a batch. This caused a file to read multiple times in the multiple batches.
## How was this patch tested?
Added unit tests
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#14143 from tdas/SPARK-16430-1.
## What changes were proposed in this pull request?
Display `No physical plan. Waiting for data.` instead of `N/A` for StreamingQuery.explain when no data arrives because `N/A` doesn't provide meaningful information.
## How was this patch tested?
Existing unit tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#14100 from zsxwing/SPARK-16433.
## What changes were proposed in this pull request?
A structured streaming example with event time windowing.
## How was this patch tested?
Run locally
Author: James Thomas <jamesjoethomas@gmail.com>
Closes#13957 from jjthomas/current.
## What changes were proposed in this pull request?
Temporary tables are used frequently, but `spark.catalog.listColumns` does not support those tables. This PR make `SessionCatalog` supports temporary table column listing.
**Before**
```scala
scala> spark.range(10).createOrReplaceTempView("t1")
scala> spark.catalog.listTables().collect()
res1: Array[org.apache.spark.sql.catalog.Table] = Array(Table[name=`t1`, tableType=`TEMPORARY`, isTemporary=`true`])
scala> spark.catalog.listColumns("t1").collect()
org.apache.spark.sql.AnalysisException: Table `t1` does not exist in database `default`.;
```
**After**
```
scala> spark.catalog.listColumns("t1").collect()
res2: Array[org.apache.spark.sql.catalog.Column] = Array(Column[name='id', description='id', dataType='bigint', nullable='false', isPartition='false', isBucket='false'])
```
## How was this patch tested?
Pass the Jenkins tests including a new testcase.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14114 from dongjoon-hyun/SPARK-16458.
#### What changes were proposed in this pull request?
When users try to implement a data source API with extending only `RelationProvider` and `CreatableRelationProvider`, they will hit an error when resolving the relation.
```Scala
spark.read
.format("org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema")
.load()
.write.
format("org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema")
.save()
```
The error they hit is like
```
org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema does not allow user-specified schemas.;
org.apache.spark.sql.AnalysisException: org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema does not allow user-specified schemas.;
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:319)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:494)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
```
Actually, the bug fix is simple. [`DataSource.createRelation(sparkSession.sqlContext, mode, options, data)`](dd644f8117/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala (L429)) already returns a BaseRelation. We should not assign schema to `userSpecifiedSchema`. That schema assignment only makes sense for the data sources that extend `FileFormat`.
#### How was this patch tested?
Added a test case.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14075 from gatorsmile/dataSource.
## What changes were proposed in this pull request?
Currently, JDBC Writer uses dialects to get datatypes, but doesn't to quote field names. This PR uses dialects to quote the field names, too.
**Reported Error Scenario (MySQL case)**
```scala
scala> val url="jdbc:mysql://localhost:3306/temp"
scala> val prop = new java.util.Properties
scala> prop.setProperty("user","root")
scala> spark.createDataset(Seq("a","b","c")).toDF("order")
scala> df.write.mode("overwrite").jdbc(url, "temptable", prop)
...MySQLSyntaxErrorException: ... near 'order TEXT )
```
## How was this patch tested?
Pass the Jenkins tests and manually do the above case.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14107 from dongjoon-hyun/SPARK-16387.
## What changes were proposed in this pull request?
Adds an quoteAll option for writing CSV which will quote all fields.
See https://issues.apache.org/jira/browse/SPARK-13638
## How was this patch tested?
Added a test to verify the output columns are quoted for all fields in the Dataframe
Author: Jurriaan Pruis <email@jurriaanpruis.nl>
Closes#13374 from jurriaan/csv-quote-all.
## What changes were proposed in this pull request?
An option that limits the file stream source to read 1 file at a time enables rate limiting. It has the additional convenience that a static set of files can be used like a stream for testing as this will allows those files to be considered one at a time.
This PR adds option `maxFilesPerTrigger`.
## How was this patch tested?
New unit test
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#14094 from tdas/SPARK-16430.
## What changes were proposed in this pull request?
There are cases where `complete` output mode does not output updated aggregated value; for details please refer to [SPARK-16350](https://issues.apache.org/jira/browse/SPARK-16350).
The cause is that, as we do `data.as[T].foreachPartition { iter => ... }` in `ForeachSink.addBatch()`, `foreachPartition()` does not support incremental planning for now.
This patches makes `foreachPartition()` support incremental planning in `ForeachSink`, by making a special version of `Dataset` with its `rdd()` method supporting incremental planning.
## How was this patch tested?
Added a unit test which failed before the change
Author: Liwei Lin <lwlin7@gmail.com>
Closes#14030 from lw-lin/fix-foreach-complete.
## What changes were proposed in this pull request?
This patch removes InSet filter pushdown from Parquet data source, since row-based pushdown is not beneficial to Spark and brings extra complexity to the code base.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#14076 from rxin/SPARK-16400.
#### What changes were proposed in this pull request?
When creating a view, a common user error is the number of columns produced by the `SELECT` clause does not match the number of column names specified by `CREATE VIEW`.
For example, given Table `t1` only has 3 columns
```SQL
create view v1(col2, col4, col3, col5) as select * from t1
```
Currently, Spark SQL reports the following error:
```
requirement failed
java.lang.IllegalArgumentException: requirement failed
at scala.Predef$.require(Predef.scala:212)
at org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:90)
```
This error message is very confusing. This PR is to detect the error and issue a meaningful error message.
#### How was this patch tested?
Added test cases
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14047 from gatorsmile/viewMismatchedColumns.
## What changes were proposed in this pull request?
Currently, Scala API supports to take options with the types, `String`, `Long`, `Double` and `Boolean` and Python API also supports other types.
This PR corrects `tableProperty` rule to support other types (string, boolean, double and integer) so that support the options for data sources in a consistent way. This will affect other rules such as DBPROPERTIES and TBLPROPERTIES (allowing other types as values).
Also, `TODO add bucketing and partitioning.` was removed because it was resolved in 24bea00047
## How was this patch tested?
Unit test in `MetastoreDataSourcesSuite.scala`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#13517 from HyukjinKwon/SPARK-14839.
## What changes were proposed in this pull request?
Currently, if there is a schema as below:
```
root
|-- _1: struct (nullable = true)
| |-- _1: integer (nullable = true)
```
and if we execute the codes below:
```scala
df.filter("_1 IS NOT NULL").count()
```
This pushes down a filter although this filter is being applied to `StructType`.(If my understanding is correct, Spark does not pushes down filters for those).
The reason is, `ParquetFilters.getFieldMap` produces results below:
```
(_1,StructType(StructField(_1,IntegerType,true)))
(_1,IntegerType)
```
and then it becomes a `Map`
```
(_1,IntegerType)
```
Now, because of ` ....lift(dataTypeOf(name)).map(_(name, value))`, this pushes down filters for `_1` which Parquet thinks is `IntegerType`. However, it is actually `StructType`.
So, Parquet filter2 produces incorrect results, for example, the codes below:
```
df.filter("_1 IS NOT NULL").count()
```
produces always 0.
This PR prevents this by not finding nested fields.
## How was this patch tested?
Unit test in `ParquetFilterSuite`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#14067 from HyukjinKwon/SPARK-16371.
## What changes were proposed in this pull request?
PR #13696 renamed various Parquet support classes but left `CatalystWriteSupport` behind. This PR is renames it as a follow-up.
## How was this patch tested?
N/A.
Author: Cheng Lian <lian@databricks.com>
Closes#14070 from liancheng/spark-15979-follow-up.
## What changes were proposed in this pull request?
These two configs should always be true after Spark 2.0. This patch removes them from the config list. Note that ideally this should've gone into branch-2.0, but due to the timing of the release we should only merge this in master for Spark 2.1.
## How was this patch tested?
Updated test cases.
Author: Reynold Xin <rxin@databricks.com>
Closes#14061 from rxin/SPARK-16388.
## What changes were proposed in this pull request?
Currently, `regexp_replace` function supports `Column` arguments in a query. This PR supports that in a `Dataset` operation, too.
## How was this patch tested?
Pass the Jenkins tests with a updated testcase.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14060 from dongjoon-hyun/SPARK-16340.
## What changes were proposed in this pull request?
This PR removes `SessionState.executeSql` in favor of `SparkSession.sql`. We can remove this safely since the visibility `SessionState` is `private[sql]` and `executeSql` is only used in one **ignored** test, `test("Multiple Hive Instances")`.
## How was this patch tested?
Pass the Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14055 from dongjoon-hyun/SPARK-16383.
## What changes were proposed in this pull request?
This patch fixes the bug that the refresh command does not work on temporary views. This patch is based on https://github.com/apache/spark/pull/13989, but removes the public Dataset.refresh() API as well as improved test coverage.
Note that I actually think the public refresh() API is very useful. We can in the future implement it by also invalidating the lazy vals in QueryExecution (or alternatively just create a new QueryExecution).
## How was this patch tested?
Re-enabled a previously ignored test, and added a new test suite for Hive testing behavior of temporary views against MetastoreRelation.
Author: Reynold Xin <rxin@databricks.com>
Author: petermaxlee <petermaxlee@gmail.com>
Closes#14009 from rxin/SPARK-16311.
## What changes were proposed in this pull request?
Currently, there are a few reports about Spark 2.0 query performance regression for large queries.
This PR speeds up SQL query processing performance by removing redundant **consecutive `executePlan`** call in `Dataset.ofRows` function and `Dataset` instantiation. Specifically, this PR aims to reduce the overhead of SQL query execution plan generation, not real query execution. So, we can not see the result in the Spark Web UI. Please use the following query script. The result is **25.78 sec** -> **12.36 sec** as expected.
**Sample Query**
```scala
val n = 4000
val values = (1 to n).map(_.toString).mkString(", ")
val columns = (1 to n).map("column" + _).mkString(", ")
val query =
s"""
|SELECT $columns
|FROM VALUES ($values) T($columns)
|WHERE 1=2 AND 1 IN ($columns)
|GROUP BY $columns
|ORDER BY $columns
|""".stripMargin
def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block
println("Elapsed time: " + ((System.nanoTime - t0) / 1e9) + "s")
result
}
```
**Before**
```scala
scala> time(sql(query))
Elapsed time: 30.138142577s // First query has a little overhead of initialization.
res0: org.apache.spark.sql.DataFrame = [column1: int, column2: int ... 3998 more fields]
scala> time(sql(query))
Elapsed time: 25.787751452s // Let's compare this one.
res1: org.apache.spark.sql.DataFrame = [column1: int, column2: int ... 3998 more fields]
```
**After**
```scala
scala> time(sql(query))
Elapsed time: 17.500279659s // First query has a little overhead of initialization.
res0: org.apache.spark.sql.DataFrame = [column1: int, column2: int ... 3998 more fields]
scala> time(sql(query))
Elapsed time: 12.364812255s // This shows the real difference. The speed up is about 2 times.
res1: org.apache.spark.sql.DataFrame = [column1: int, column2: int ... 3998 more fields]
```
## How was this patch tested?
Manual by the above script.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14044 from dongjoon-hyun/SPARK-16360.
## What changes were proposed in this pull request?
TypedAggregateExpression sets nullable based on the schema of the outputEncoder
## How was this patch tested?
Add test in DatasetAggregatorSuite
Author: Koert Kuipers <koert@tresata.com>
Closes#13532 from koertkuipers/feat-aggregator-nullable.
## What changes were proposed in this pull request?
This PR fixes the minor Java linter errors like the following.
```
- public int read(char cbuf[], int off, int len) throws IOException {
+ public int read(char[] cbuf, int off, int len) throws IOException {
```
## How was this patch tested?
Manual.
```
$ build/mvn -T 4 -q -DskipTests -Pyarn -Phadoop-2.3 -Pkinesis-asl -Phive -Phive-thriftserver install
$ dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks passed.
```
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14017 from dongjoon-hyun/minor_build_java_linter_error.
## What changes were proposed in this pull request?
In structured streaming, Spark does not report errors when the specified directory does not exist. This is a behavior different from the batch mode. This patch changes the behavior to fail if the directory does not exist (when the path is not a glob pattern).
## How was this patch tested?
Updated unit tests to reflect the new behavior.
Author: Reynold Xin <rxin@databricks.com>
Closes#14002 from rxin/SPARK-16335.
#### What changes were proposed in this pull request?
For JDBC data sources, users can specify `batchsize` for multi-row inserts and `fetchsize` for multi-row fetch. A few issues exist:
- The property keys are case sensitive. Thus, the existing test cases for `fetchsize` use incorrect names, `fetchSize`. Basically, the test cases are broken.
- No test case exists for `batchsize`.
- We do not detect the illegal input values for `fetchsize` and `batchsize`.
For example, when `batchsize` is zero, we got the following exception:
```
Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.ArithmeticException: / by zero
```
when `fetchsize` is less than zero, we got the exception from the underlying JDBC driver:
```
Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): org.h2.jdbc.JdbcSQLException: Invalid value "-1" for parameter "rows" [90008-183]
```
This PR fixes all the above issues, and issue the appropriate exceptions when detecting the illegal inputs for `fetchsize` and `batchsize`. Also update the function descriptions.
#### How was this patch tested?
Test cases are fixed and added.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13919 from gatorsmile/jdbcProperties.
## What changes were proposed in this pull request?
Spark silently drops exceptions during file listing. This is a very bad behavior because it can mask legitimate errors and the resulting plan will silently have 0 rows. This patch changes it to not silently drop the errors.
## How was this patch tested?
Manually verified.
Author: Reynold Xin <rxin@databricks.com>
Closes#13987 from rxin/SPARK-16313.
## What changes were proposed in this pull request?
This patch appends a message to suggest users running refresh table or reloading data frames when Spark sees a FileNotFoundException due to stale, cached metadata.
## How was this patch tested?
Added a unit test for this in MetadataCacheSuite.
Author: petermaxlee <petermaxlee@gmail.com>
Closes#14003 from petermaxlee/SPARK-16336.
## What changes were proposed in this pull request?
This PR implements `posexplode` table generating function. Currently, master branch raises the following exception for `map` argument. It's different from Hive.
**Before**
```scala
scala> sql("select posexplode(map('a', 1, 'b', 2))").show
org.apache.spark.sql.AnalysisException: No handler for Hive UDF ... posexplode() takes an array as a parameter; line 1 pos 7
```
**After**
```scala
scala> sql("select posexplode(map('a', 1, 'b', 2))").show
+---+---+-----+
|pos|key|value|
+---+---+-----+
| 0| a| 1|
| 1| b| 2|
+---+---+-----+
```
For `array` argument, `after` is the same with `before`.
```
scala> sql("select posexplode(array(1, 2, 3))").show
+---+---+
|pos|col|
+---+---+
| 0| 1|
| 1| 2|
| 2| 3|
+---+---+
```
## How was this patch tested?
Pass the Jenkins tests with newly added testcases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13971 from dongjoon-hyun/SPARK-16289.
## What changes were proposed in this pull request?
Force the sorter to Spill when number of elements in the pointer array reach a certain size. This is to workaround the issue of timSort failing on large buffer size.
## How was this patch tested?
Tested by running a job which was failing without this change due to TimSort bug.
Author: Sital Kedia <skedia@fb.com>
Closes#13107 from sitalkedia/fix_TimSort.
## What changes were proposed in this pull request?
Add Catalog.refreshTable API into python interface for Spark-SQL.
## How was this patch tested?
Existing test.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#13558 from WeichenXu123/update_python_sql_interface_refreshTable.
## What changes were proposed in this pull request?
This PR adds 3 optimizer rules for typed filter:
1. push typed filter down through `SerializeFromObject` and eliminate the deserialization in filter condition.
2. pull typed filter up through `SerializeFromObject` and eliminate the deserialization in filter condition.
3. combine adjacent typed filters and share the deserialized object among all the condition expressions.
This PR also adds `TypedFilter` logical plan, to separate it from normal filter, so that the concept is more clear and it's easier to write optimizer rules.
## How was this patch tested?
`TypedFilterOptimizationSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13846 from cloud-fan/filter.
## What changes were proposed in this pull request?
This PR allows `emptyDataFrame.write` since the user didn't specify any partition columns.
**Before**
```scala
scala> spark.emptyDataFrame.write.parquet("/tmp/t1")
org.apache.spark.sql.AnalysisException: Cannot use all columns for partition columns;
scala> spark.emptyDataFrame.write.csv("/tmp/t1")
org.apache.spark.sql.AnalysisException: Cannot use all columns for partition columns;
```
After this PR, there occurs no exceptions and the created directory has only one file, `_SUCCESS`, as expected.
## How was this patch tested?
Pass the Jenkins tests including updated test cases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13730 from dongjoon-hyun/SPARK-16006.
## What changes were proposed in this pull request?
This PR removes meaningless `StringIteratorReader` for CSV data source.
In `CSVParser.scala`, there is an `Reader` wrapping `Iterator` but there are two problems by this.
Firstly, it was actually not faster than processing line by line with Iterator due to additional logics to wrap `Iterator` to `Reader`.
Secondly, this brought a bit of complexity because it needs additional logics to allow every line to be read bytes by bytes. So, it was pretty difficult to figure out issues about parsing, (eg. SPARK-14103).
A benchmark was performed manually and the results were below:
- Original codes with Reader wrapping Iterator
|End-to-end (ns) | Parse Time (ns) |
|-----------------------|------------------------|
|14116265034 |2008277960 |
- New codes with Iterator
|End-to-end (ns) | Parse Time (ns) |
|-----------------------|------------------------|
|13451699644 | 1549050564 |
For the details for the environment, dataset and methods, please refer the JIRA ticket.
## How was this patch tested?
Existing tests should cover this.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#13808 from HyukjinKwon/SPARK-14480-small.
#### What changes were proposed in this pull request?
Based on the previous discussion with cloud-fan hvanhovell in another related PR https://github.com/apache/spark/pull/13764#discussion_r67994276, it looks reasonable to add convenience methods for users to add `comment` when defining `StructField`.
Currently, the column-related `comment` attribute is stored in `Metadata` of `StructField`. For example, users can add the `comment` attribute using the following way:
```Scala
StructType(
StructField(
"cl1",
IntegerType,
nullable = false,
new MetadataBuilder().putString("comment", "test").build()) :: Nil)
```
This PR is to add more user friendly methods for the `comment` attribute when defining a `StructField`. After the changes, users are provided three different ways to do it:
```Scala
val struct = (new StructType)
.add("a", "int", true, "test1")
val struct = (new StructType)
.add("c", StringType, true, "test3")
val struct = (new StructType)
.add(StructField("d", StringType).withComment("test4"))
```
#### How was this patch tested?
Added test cases:
- `DataTypeSuite` is for testing three types of API changes,
- `DataFrameReaderWriterSuite` is for parquet, json and csv formats - using in-memory catalog
- `OrcQuerySuite.scala` is for orc format using Hive-metastore
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13860 from gatorsmile/newMethodForComment.
## What changes were proposed in this pull request?
Change the return type mentioned in the JavaDoc for `toJavaRDD` / `javaRDD` to match the actual return type & be consistent with the scala rdd return type.
## How was this patch tested?
Docs only change.
Author: Holden Karau <holden@us.ibm.com>
Closes#13954 from holdenk/trivial-streaming-tojavardd-doc-fix.
## What changes were proposed in this pull request?
Fixes a couple old references to `DataFrameWriter.startStream` to `DataStreamWriter.start
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#13952 from brkyvz/minor-doc-fix.
#### What changes were proposed in this pull request?
koertkuipers identified the PR https://github.com/apache/spark/pull/13727/ changed the behavior of `load` API. After the change, the `load` API does not add the value of `path` into the `options`. Thank you!
This PR is to add the option `path` back to `load()` API in `DataFrameReader`, if and only if users specify one and only one `path` in the `load` API. For example, users can see the `path` option after the following API call,
```Scala
spark.read
.format("parquet")
.load("/test")
```
#### How was this patch tested?
Added test cases.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13933 from gatorsmile/optionPath.
## What changes were proposed in this pull request?
Allowing truncate to a specific number of character is convenient at times, especially while operating from the REPL. Sometimes those last few characters make all the difference, and showing everything brings in whole lot of noise.
## How was this patch tested?
Existing tests. + 1 new test in DataFrameSuite.
For SparkR and pyspark, existing tests and manual testing.
Author: Prashant Sharma <prashsh1@in.ibm.com>
Author: Prashant Sharma <prashant@apache.org>
Closes#13839 from ScrapCodes/add_truncateTo_DF.show.
#### What changes were proposed in this pull request?
The API description of `createRelation` in `CreatableRelationProvider` is misleading. The current description only expects users to return the relation.
```Scala
trait CreatableRelationProvider {
def createRelation(
sqlContext: SQLContext,
mode: SaveMode,
parameters: Map[String, String],
data: DataFrame): BaseRelation
}
```
However, the major goal of this API should also include saving the `DataFrame`.
Since this API is critical for Data Source API developers, this PR is to correct the description.
#### How was this patch tested?
N/A
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13903 from gatorsmile/readUnderscoreFiles.
## What changes were proposed in this pull request?
[SPARK-8118](https://github.com/apache/spark/pull/8196) implements redirecting Parquet JUL logger via SLF4J, but it is currently applied only when READ operations occurs. If users use only WRITE operations, there occurs many Parquet logs.
This PR makes the redirection work on WRITE operations, too.
**Before**
```scala
scala> spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Jun 26, 2016 9:04:38 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY
............ about 70 lines Parquet Log .............
scala> spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p")
............ about 70 lines Parquet Log .............
```
**After**
```scala
scala> spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
scala> spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p")
```
This PR also fixes some typos.
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13918 from dongjoon-hyun/SPARK-16221.
## What changes were proposed in this pull request?
Spark currently shows all functions when issue a `SHOW FUNCTIONS` command. This PR refines the `SHOW FUNCTIONS` command by allowing users to select all functions, user defined function or system functions. The following syntax can be used:
**ALL** (default)
```SHOW FUNCTIONS```
```SHOW ALL FUNCTIONS```
**SYSTEM**
```SHOW SYSTEM FUNCTIONS```
**USER**
```SHOW USER FUNCTIONS```
## How was this patch tested?
Updated tests and added tests to the DDLSuite
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#13929 from hvanhovell/SPARK-16220.
## What changes were proposed in this pull request?
Add `conf` method to get Runtime Config from SparkSession
## How was this patch tested?
unit tests, manual tests
This is how it works in sparkR shell:
```
SparkSession available as 'spark'.
> conf()
$hive.metastore.warehouse.dir
[1] "file:/opt/spark-2.0.0-bin-hadoop2.6/R/spark-warehouse"
$spark.app.id
[1] "local-1466749575523"
$spark.app.name
[1] "SparkR"
$spark.driver.host
[1] "10.0.2.1"
$spark.driver.port
[1] "45629"
$spark.executorEnv.LD_LIBRARY_PATH
[1] "$LD_LIBRARY_PATH:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/default-java/jre/lib/amd64/server"
$spark.executor.id
[1] "driver"
$spark.home
[1] "/opt/spark-2.0.0-bin-hadoop2.6"
$spark.master
[1] "local[*]"
$spark.sql.catalogImplementation
[1] "hive"
$spark.submit.deployMode
[1] "client"
> conf("spark.master")
$spark.master
[1] "local[*]"
```
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13885 from felixcheung/rconf.
## What changes were proposed in this pull request?
Currently the initial buffer size in the sorter is hard coded inside the code and is too small for large workload. As a result, the sorter spends significant time expanding the buffer size and copying the data. It would be useful to have it configurable.
## How was this patch tested?
Tested by running a job on the cluster.
Author: Sital Kedia <skedia@fb.com>
Closes#13699 from sitalkedia/config_sort_buffer_upstream.
## What changes were proposed in this pull request?
One of the most frequent usage patterns for Spark SQL is using **cached tables**. This PR improves `InMemoryTableScanExec` to handle `IN` predicate efficiently by pruning partition batches. Of course, the performance improvement varies over the queries and the datasets. But, for the following simple query, the query duration in Spark UI goes from 9 seconds to 50~90ms. It's about over 100 times faster.
**Before**
```scala
$ bin/spark-shell --driver-memory 6G
scala> val df = spark.range(2000000000)
scala> df.createOrReplaceTempView("t")
scala> spark.catalog.cacheTable("t")
scala> sql("select id from t where id = 1").collect() // About 2 mins
scala> sql("select id from t where id = 1").collect() // less than 90ms
scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds
```
**After**
```scala
scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms
```
This PR has impacts over 35 queries of TPC-DS if the tables are cached.
Note that this optimization is applied for `IN`. To apply `IN` predicate having more than 10 items, `spark.sql.optimizer.inSetConversionThreshold` option should be increased.
## How was this patch tested?
Pass the Jenkins tests (including new testcases).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13887 from dongjoon-hyun/SPARK-16186.
## What changes were proposed in this pull request?
Allow to specify empty over clause in window expressions through dataset API
In SQL, its allowed to specify an empty OVER clause in the window expression.
```SQL
select area, sum(product) over () as c from windowData
where product > 3 group by area, product
having avg(month) > 0 order by avg(month), product
```
In this case the analytic function sum is presented based on all the rows of the result set
Currently its not allowed through dataset API and is handled in this PR.
## How was this patch tested?
Added a new test in DataframeWindowSuite
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#13897 from dilipbiswal/spark-empty-over.
## What changes were proposed in this pull request?
This PR fixes `DataFrame.describe()` by forcing materialization to make the `Seq` serializable. Currently, `describe()` of DataFrame throws `Task not serializable` Spark exceptions when joining in Scala 2.10.
## How was this patch tested?
Manual. (After building with Scala 2.10, test on `bin/spark-shell` and `bin/pyspark`.)
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13900 from dongjoon-hyun/SPARK-16173.
## What changes were proposed in this pull request?
One of the most frequent usage patterns for Spark SQL is using **cached tables**. This PR improves `InMemoryTableScanExec` to handle `IN` predicate efficiently by pruning partition batches. Of course, the performance improvement varies over the queries and the datasets. But, for the following simple query, the query duration in Spark UI goes from 9 seconds to 50~90ms. It's about over 100 times faster.
**Before**
```scala
$ bin/spark-shell --driver-memory 6G
scala> val df = spark.range(2000000000)
scala> df.createOrReplaceTempView("t")
scala> spark.catalog.cacheTable("t")
scala> sql("select id from t where id = 1").collect() // About 2 mins
scala> sql("select id from t where id = 1").collect() // less than 90ms
scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds
```
**After**
```scala
scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms
```
This PR has impacts over 35 queries of TPC-DS if the tables are cached.
Note that this optimization is applied for `IN`. To apply `IN` predicate having more than 10 items, `spark.sql.optimizer.inSetConversionThreshold` option should be increased.
## How was this patch tested?
Pass the Jenkins tests (including new testcases).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13887 from dongjoon-hyun/SPARK-16186.
## What changes were proposed in this pull request?
This PR fix the bug when Python UDF is used in explode (generator), GenerateExec requires that all the attributes in expressions should be resolvable from children when creating, we should replace the children first, then replace it's expressions.
```
>>> df.select(explode(f(*df))).show()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/vlad/dev/spark/python/pyspark/sql/dataframe.py", line 286, in show
print(self._jdf.showString(n, truncate))
File "/home/vlad/dev/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
File "/home/vlad/dev/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/home/vlad/dev/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o52.showString.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree:
Generate explode(<lambda>(_1#0L)), false, false, [col#15L]
+- Scan ExistingRDD[_1#0L]
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:387)
at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:69)
at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:45)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:177)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144)
at org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:153)
at org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
at org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
at org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:113)
at org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:93)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:95)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:95)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:95)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:85)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:85)
at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2557)
at org.apache.spark.sql.Dataset.head(Dataset.scala:1923)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2138)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:239)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$13.apply(TreeNode.scala:413)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$13.apply(TreeNode.scala:413)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:412)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:387)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
... 42 more
Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: pythonUDF0#20
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:268)
at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
at org.apache.spark.sql.execution.GenerateExec.<init>(GenerateExec.scala:63)
... 52 more
Caused by: java.lang.RuntimeException: Couldn't find pythonUDF0#20 in [_1#0L]
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:94)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:88)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
... 67 more
```
## How was this patch tested?
Added regression tests.
Author: Davies Liu <davies@databricks.com>
Closes#13883 from davies/udf_in_generate.
## What changes were proposed in this pull request?
This is a small patch to rewrite the predicate filter translation in DataSourceStrategy. The original code used excessive functional constructs (e.g. unzip) and was very difficult to understand.
## How was this patch tested?
Should be covered by existing tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#13889 from rxin/simplify-predicate-filter.
## What changes were proposed in this pull request?
Replace use of `commons-lang` in favor of `commons-lang3` and forbid the former via scalastyle; remove `NotImplementedException` from `comons-lang` in favor of JDK `UnsupportedOperationException`
## How was this patch tested?
Jenkins tests
Author: Sean Owen <sowen@cloudera.com>
Closes#13843 from srowen/SPARK-16129.
## What changes were proposed in this pull request?
It's weird that `ParserUtils.operationNotAllowed` returns an exception and the caller throw it.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13874 from cloud-fan/style.
## What changes were proposed in this pull request?
This patch fixes an overflow bug in vectorized parquet reader where both off-heap and on-heap variants of `ColumnVector.reserve()` can unfortunately overflow while reserving additional capacity during reads.
## How was this patch tested?
Manual Tests
Author: Sameer Agarwal <sameer@databricks.com>
Closes#13832 from sameeragarwal/negative-array.
## What changes were proposed in this pull request?
Currently, `readBatches` accumulator of `InMemoryTableScanExec` is updated only when `spark.sql.inMemoryColumnarStorage.partitionPruning` is true. Although this metric is used for only testing purpose, we had better have correct metric without considering SQL options.
## How was this patch tested?
Pass the Jenkins tests (including a new testcase).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13870 from dongjoon-hyun/SPARK-16165.
## What changes were proposed in this pull request?
This calculation of statistics is not trivial anymore, it could be very slow on large query (for example, TPC-DS Q64 took several minutes to plan).
During the planning of a query, the statistics of any logical plan should not change (even InMemoryRelation), so we should use `lazy val` to cache the statistics.
For InMemoryRelation, the statistics could be updated after materialization, it's only useful when used in another query (before planning), because once we finished the planning, the statistics will not be used anymore.
## How was this patch tested?
Testsed with TPC-DS Q64, it could be planned in a second after the patch.
Author: Davies Liu <davies@databricks.com>
Closes#13871 from davies/fix_statistics.
## What changes were proposed in this pull request?
When the user uses `ConsoleSink`, we should use a temp location if `checkpointLocation` is not specified.
## How was this patch tested?
The added unit test.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#13817 from zsxwing/console-checkpoint.
## What changes were proposed in this pull request?
When table is created with column name containing dot, distinct() will fail to run. For example,
```scala
val rowRDD = sparkContext.parallelize(Seq(Row(1), Row(1), Row(2)))
val schema = StructType(Array(StructField("column.with.dot", IntegerType, nullable = false)))
val df = spark.createDataFrame(rowRDD, schema)
```
running the following will have no problem:
```scala
df.select(new Column("`column.with.dot`"))
```
but running the query with additional distinct() will cause exception:
```scala
df.select(new Column("`column.with.dot`")).distinct()
```
The issue is that distinct() will try to resolve the column name, but the column name in the schema does not have backtick with it. So the solution is to add the backtick before passing the column name to resolve().
## How was this patch tested?
Added a new test case.
Author: bomeng <bmeng@us.ibm.com>
Closes#13140 from bomeng/SPARK-15230.
## What changes were proposed in this pull request?
We embed partitioning logic in FileSourceStrategy.apply, making the function very long. This is a small refactoring to move it into its own functions. Eventually we would be able to move the partitioning functions into a physical operator, rather than doing it in physical planning.
## How was this patch tested?
This is a simple code move.
Author: Reynold Xin <rxin@databricks.com>
Closes#13862 from rxin/SPARK-16159.
## What changes were proposed in this pull request?
Seems the fix of SPARK-14959 breaks the parallel partitioning discovery. This PR fixes the problem
## How was this patch tested?
Tested manually. (This PR also adds a proper test for SPARK-14959)
Author: Yin Huai <yhuai@databricks.com>
Closes#13830 from yhuai/SPARK-16121.
#### What changes were proposed in this pull request?
This PR is to use the latest `SparkSession` to replace the existing `SQLContext` in `MLlib`. `SQLContext` is removed from `MLlib`.
Also fix a test case issue in `BroadcastJoinSuite`.
BTW, `SQLContext` is not being used in the `MLlib` test suites.
#### How was this patch tested?
Existing test cases.
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#13380 from gatorsmile/sqlContextML.
## What changes were proposed in this pull request?
This PR let `CsvWriter` object is not created for each time but able to be reused. This way was taken after from JSON data source.
Original `CsvWriter` was being created for each row but it was enhanced in https://github.com/apache/spark/pull/13229. However, it still creates `CsvWriter` object for each `flush()` in `LineCsvWriter`. It seems it does not have to close the object and re-create this for every flush.
It follows the original logic as it is but `CsvWriter` is reused by reseting `CharArrayWriter`.
## How was this patch tested?
Existing tests should cover this.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#13809 from HyukjinKwon/write-perf.
## What changes were proposed in this pull request?
Add a configuration to allow people to set a minimum polling delay when no new data arrives (default is 10ms). This PR also cleans up some INFO logs.
## How was this patch tested?
Existing unit tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#13718 from zsxwing/SPARK-16002.
## What changes were proposed in this pull request?
1. FORMATTED is actually supported, but partition is not supported;
2. Remove parenthesis as it is not necessary just like anywhere else.
## How was this patch tested?
Minor issue. I do not think it needs a test case!
Author: bomeng <bmeng@us.ibm.com>
Closes#13791 from bomeng/SPARK-16084.
#### What changes were proposed in this pull request?
This PR is to fix the following bugs:
**Issue 1: Wrong Results when lowerBound is larger than upperBound in Column Partitioning**
```scala
spark.read.jdbc(
url = urlWithUserAndPass,
table = "TEST.seq",
columnName = "id",
lowerBound = 4,
upperBound = 0,
numPartitions = 3,
connectionProperties = new Properties)
```
**Before code changes:**
The returned results are wrong and the generated partitions are wrong:
```
Part 0 id < 3 or id is null
Part 1 id >= 3 AND id < 2
Part 2 id >= 2
```
**After code changes:**
Issue an `IllegalArgumentException` exception:
```
Operation not allowed: the lower bound of partitioning column is larger than the upper bound. lowerBound: 5; higherBound: 1
```
**Issue 2: numPartitions is more than the number of key values between upper and lower bounds**
```scala
spark.read.jdbc(
url = urlWithUserAndPass,
table = "TEST.seq",
columnName = "id",
lowerBound = 1,
upperBound = 5,
numPartitions = 10,
connectionProperties = new Properties)
```
**Before code changes:**
Returned correct results but the generated partitions are very inefficient, like:
```
Partition 0: id < 1 or id is null
Partition 1: id >= 1 AND id < 1
Partition 2: id >= 1 AND id < 1
Partition 3: id >= 1 AND id < 1
Partition 4: id >= 1 AND id < 1
Partition 5: id >= 1 AND id < 1
Partition 6: id >= 1 AND id < 1
Partition 7: id >= 1 AND id < 1
Partition 8: id >= 1 AND id < 1
Partition 9: id >= 1
```
**After code changes:**
Adjust `numPartitions` and can return the correct answers:
```
Partition 0: id < 2 or id is null
Partition 1: id >= 2 AND id < 3
Partition 2: id >= 3 AND id < 4
Partition 3: id >= 4
```
**Issue 3: java.lang.ArithmeticException when numPartitions is zero**
```Scala
spark.read.jdbc(
url = urlWithUserAndPass,
table = "TEST.seq",
columnName = "id",
lowerBound = 0,
upperBound = 4,
numPartitions = 0,
connectionProperties = new Properties)
```
**Before code changes:**
Got the following exception:
```
java.lang.ArithmeticException: / by zero
```
**After code changes:**
Able to return a correct answer by disabling column partitioning when numPartitions is equal to or less than zero
#### How was this patch tested?
Added test cases to verify the results
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13773 from gatorsmile/jdbcPartitioning.
## What changes were proposed in this pull request?
This pull request adds a new option (maxMalformedLogPerPartition) in CSV reader to limit the maximum of logging message Spark generates per partition for malformed records.
The error log looks something like
```
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: More than 10 malformed records have been found on this partition. Malformed records from now on will not be logged.
```
Closes#12173
## How was this patch tested?
Manually tested.
Author: Reynold Xin <rxin@databricks.com>
Closes#13795 from rxin/SPARK-13792.
## What changes were proposed in this pull request?
The property spark.streaming.stateStore.maintenanceInterval should be renamed and harmonized with other properties related to Structured Streaming like spark.sql.streaming.stateStore.minDeltasForSnapshot.
## How was this patch tested?
Existing unit tests.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#13777 from sarutak/SPARK-16061.
## What changes were proposed in this pull request?
Issues with current reader behavior.
- `text()` without args returns an empty DF with no columns -> inconsistent, its expected that text will always return a DF with `value` string field,
- `textFile()` without args fails with exception because of the above reason, it expected the DF returned by `text()` to have a `value` field.
- `orc()` does not have var args, inconsistent with others
- `json(single-arg)` was removed, but that caused source compatibility issues - [SPARK-16009](https://issues.apache.org/jira/browse/SPARK-16009)
- user specified schema was not respected when `text/csv/...` were used with no args - [SPARK-16007](https://issues.apache.org/jira/browse/SPARK-16007)
The solution I am implementing is to do the following.
- For each format, there will be a single argument method, and a vararg method. For json, parquet, csv, text, this means adding json(string), etc.. For orc, this means adding orc(varargs).
- Remove the special handling of text(), csv(), etc. that returns empty dataframe with no fields. Rather pass on the empty sequence of paths to the datasource, and let each datasource handle it right. For e.g, text data source, should return empty DF with schema (value: string)
- Deduped docs and fixed their formatting.
## How was this patch tested?
Added new unit tests for Scala and Java tests
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#13727 from tdas/SPARK-15982.
## What changes were proposed in this pull request?
This PR adds the static partition support to INSERT statement when the target table is a data source table.
## How was this patch tested?
New tests in InsertIntoHiveTableSuite and DataSourceAnalysisSuite.
**Note: This PR is based on https://github.com/apache/spark/pull/13766. The last commit is the actual change.**
Author: Yin Huai <yhuai@databricks.com>
Closes#13769 from yhuai/SPARK-16030-1.
## What changes were proposed in this pull request?
This patch adds a text-based socket source similar to the one in Spark Streaming for debugging and tutorials. The source is clearly marked as debug-only so that users don't try to run it in production applications, because this type of source cannot provide HA without storing a lot of state in Spark.
## How was this patch tested?
Unit tests and manual tests in spark-shell.
Author: Matei Zaharia <matei@databricks.com>
Closes#13748 from mateiz/socket-source.
## What changes were proposed in this pull request?
`DataFrameWriter` can be used to append data to existing data source tables. It becomes tricky when partition columns used in `DataFrameWriter.partitionBy(columns)` don't match the actual partition columns of the underlying table. This pull request enforces the check so that the partition columns of these two always match.
## How was this patch tested?
Unit test.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#13749 from clockfly/SPARK-16034.
## What changes were proposed in this pull request?
The current table insertion has some weird behaviours:
1. inserting into a partitioned table with mismatch columns has confusing error message for hive table, and wrong result for datasource table
2. inserting into a partitioned table without partition list has wrong result for hive table.
This PR fixes these 2 problems.
## How was this patch tested?
new test in hive `SQLQuerySuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13754 from cloud-fan/insert2.
## What changes were proposed in this pull request?
Improve readability of `InMemoryTableScanExec.scala`, which has too much stuff in it.
## How was this patch tested?
Jenkins
Author: Andrew Or <andrew@databricks.com>
Closes#13742 from andrewor14/move-inmemory-relation.
## What changes were proposed in this pull request?
We cannot use `limit` on DataFrame in ConsoleSink because it will use a wrong planner. This PR just collects `DataFrame` and calls `show` on a batch DataFrame based on the result. This is fine since ConsoleSink is only for debugging.
## How was this patch tested?
Manually confirmed ConsoleSink now works with complete mode aggregation.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#13740 from zsxwing/complete-console.
## What changes were proposed in this pull request?
This PR introduces the new SparkSession API for SparkR.
`sparkR.session.getOrCreate()` and `sparkR.session.stop()`
"getOrCreate" is a bit unusual in R but it's important to name this clearly.
SparkR implementation should
- SparkSession is the main entrypoint (vs SparkContext; due to limited functionality supported with SparkContext in SparkR)
- SparkSession replaces SQLContext and HiveContext (both a wrapper around SparkSession, and because of API changes, supporting all 3 would be a lot more work)
- Changes to SparkSession is mostly transparent to users due to SPARK-10903
- Full backward compatibility is expected - users should be able to initialize everything just in Spark 1.6.1 (`sparkR.init()`), but with deprecation warning
- Mostly cosmetic changes to parameter list - users should be able to move to `sparkR.session.getOrCreate()` easily
- An advanced syntax with named parameters (aka varargs aka "...") is supported; that should be closer to the Builder syntax that is in Scala/Python (which unfortunately does not work in R because it will look like this: `enableHiveSupport(config(config(master(appName(builder(), "foo"), "local"), "first", "value"), "next, "value"))`
- Updating config on an existing SparkSession is supported, the behavior is the same as Python, in which config is applied to both SparkContext and SparkSession
- Some SparkSession changes are not matched in SparkR, mostly because it would be breaking API change: `catalog` object, `createOrReplaceTempView`
- Other SQLContext workarounds are replicated in SparkR, eg. `tables`, `tableNames`
- `sparkR` shell is updated to use the SparkSession entrypoint (`sqlContext` is removed, just like with Scale/Python)
- All tests are updated to use the SparkSession entrypoint
- A bug in `read.jdbc` is fixed
TODO
- [x] Add more tests
- [ ] Separate PR - update all roxygen2 doc coding example
- [ ] Separate PR - update SparkR programming guide
## How was this patch tested?
unit tests, manual tests
shivaram sun-rui rxin
Author: Felix Cheung <felixcheung_m@hotmail.com>
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#13635 from felixcheung/rsparksession.
## What changes were proposed in this pull request?
When inserting into an existing partitioned table, partitioning columns should always be determined by catalog metadata of the existing table to be inserted. Extra `partitionBy()` calls don't make sense, and mess up existing data because newly inserted data may have wrong partitioning directory layout.
## How was this patch tested?
New test case added in `InsertIntoHiveTableSuite`.
Author: Cheng Lian <lian@databricks.com>
Closes#13747 from liancheng/spark-16033-insert-into-without-partition-by.
## What changes were proposed in this pull request?
This PR fixes the problem that the precedence order is messed when pushing where-clause expression to JDBC layer.
**Case 1:**
For sql `select * from table where (a or b) and c`, the where-clause is wrongly converted to JDBC where-clause `a or (b and c)` after filter push down. The consequence is that JDBC may returns less or more rows than expected.
**Case 2:**
For sql `select * from table where always_false_condition`, the result table may not be empty if the JDBC RDD is partitioned using where-clause:
```
spark.read.jdbc(url, table, predicates = Array("partition 1 where clause", "partition 2 where clause"...)
```
## How was this patch tested?
Unit test.
This PR also close#13640
Author: hyukjinkwon <gurwls223@gmail.com>
Author: Sean Zhong <seanzhong@databricks.com>
Closes#13743 from clockfly/SPARK-15916.
## What changes were proposed in this pull request?
Iterator can't be serialized in Scala 2.10, we should force it into a array to make sure that .
## How was this patch tested?
Build with Scala 2.10 and ran all the Python unit tests manually (will be covered by a jenkins build).
Author: Davies Liu <davies@databricks.com>
Closes#13717 from davies/fix_udf_210.
## What changes were proposed in this pull request?
`UTF8String` and all `Unsafe*` classes are backed by either on-heap or off-heap byte arrays. The code generated version `SortMergeJoin` buffers the left hand side join keys during iteration. This was actually problematic in off-heap mode when one of the keys is a `UTF8String` (or any other 'Unsafe*` object) and the left hand side iterator was exhausted (and released its memory); the buffered keys would reference freed memory. This causes Seg-faults and all kinds of other undefined behavior when we would use one these buffered keys.
This PR fixes this problem by creating copies of the buffered variables. I have added a general method to the `CodeGenerator` for this. I have checked all places in which this could happen, and only `SortMergeJoin` had this problem.
This PR is largely based on the work of robbinspg and he should be credited for this.
closes https://github.com/apache/spark/pull/13707
## How was this patch tested?
Manually tested on problematic workloads.
Author: Pete Robbins <robbinspg@gmail.com>
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#13723 from hvanhovell/SPARK-15822-2.
## What changes were proposed in this pull request?
Before this patch, after a SparkSession has been created, hadoop conf set directly to SparkContext.hadoopConfiguration will not affect the hadoop conf created by SessionState. This patch makes the change to always use SparkContext.hadoopConfiguration as the base.
This patch also changes the behavior of hive-site.xml support added in https://github.com/apache/spark/pull/12689/. With this patch, we will load hive-site.xml to SparkContext.hadoopConfiguration.
## How was this patch tested?
New test in SparkSessionBuilderSuite.
Author: Yin Huai <yhuai@databricks.com>
Closes#13711 from yhuai/SPARK-15991.
## What changes were proposed in this pull request?
For table test1 (C1 varchar (10), C2 varchar (10)), when I insert a row using
```
sqlContext.sql("insert into test1 values ('abc', 'def', 1)")
```
I got error message
```
Exception in thread "main" java.lang.RuntimeException: RelationC1#0,C2#1 JDBCRelation(test1)
requires that the query in the SELECT clause of the INSERT INTO/OVERWRITE statement
generates the same number of columns as its schema.
```
The error message is a little confusing. In my simple insert statement, it doesn't have a SELECT clause.
I will change the error message to a more general one
```
Exception in thread "main" java.lang.RuntimeException: RelationC1#0,C2#1 JDBCRelation(test1)
requires that the data to be inserted have the same number of columns as the target table.
```
## How was this patch tested?
I tested the patch using my simple unit test, but it's a very trivial change and I don't think I need to check in any test.
Author: Huaxin Gao <huaxing@us.ibm.com>
Closes#13492 from huaxingao/spark-15749.
## What changes were proposed in this pull request?
This PR contains a few changes on code comments.
- `HiveTypeCoercion` is renamed into `TypeCoercion`.
- `NoSuchDatabaseException` is only used for the absence of database.
- For partition type inference, only `DoubleType` is considered.
## How was this patch tested?
N/A
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13674 from dongjoon-hyun/minor_doc_types.
## What changes were proposed in this pull request?
This PR fixes some minor `.toString` format issues for `HashAggregateExec`.
Before:
```
*HashAggregate(key=[a#234L,b#235L], functions=[count(1),max(c#236L)], output=[a#234L,b#235L,count(c)#247L,max(c)#248L])
```
After:
```
*HashAggregate(keys=[a#234L, b#235L], functions=[count(1), max(c#236L)], output=[a#234L, b#235L, count(c)#247L, max(c)#248L])
```
## How was this patch tested?
Manually tested.
Author: Cheng Lian <lian@databricks.com>
Closes#13710 from liancheng/minor-agg-string-fix.
## What changes were proposed in this pull request?
`TRUNCATE TABLE` is currently broken for Spark specific datasource tables (json, csv, ...). This PR correctly sets the location for these datasources which allows them to be truncated.
## How was this patch tested?
Extended the datasources `TRUNCATE TABLE` tests in `DDLSuite`.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#13697 from hvanhovell/SPARK-15977.
## What changes were proposed in this pull request?
Interface method `FileFormat.prepareRead()` was added in #12088 to handle a special case in the LibSVM data source.
However, the semantics of this interface method isn't intuitive: it returns a modified version of the data source options map. Considering that the LibSVM case can be easily handled using schema metadata inside `inferSchema`, we can remove this interface method to keep the `FileFormat` interface clean.
## How was this patch tested?
Existing tests.
Author: Cheng Lian <lian@databricks.com>
Closes#13698 from liancheng/remove-prepare-read.
#### What changes were proposed in this pull request?
~~If the temp table already exists, we should not silently replace it when doing `CACHE TABLE AS SELECT`. This is inconsistent with the behavior of `CREAT VIEW` or `CREATE TABLE`. This PR is to fix this silent drop.~~
~~Maybe, we also can introduce new syntax for replacing the existing one. For example, in Hive, to replace a view, the syntax should be like `ALTER VIEW AS SELECT` or `CREATE OR REPLACE VIEW AS SELECT`~~
The table name in `CACHE TABLE AS SELECT` should NOT contain database prefix like "database.table". Thus, this PR captures this in Parser and outputs a better error message, instead of reporting the view already exists.
In addition, refactoring the `Parser` to generate table identifiers instead of returning the table name string.
#### How was this patch tested?
- Added a test case for caching and uncaching qualified table names
- Fixed a few test cases that do not drop temp table at the end
- Added the related test case for the issue resolved in this PR
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#13572 from gatorsmile/cacheTableAsSelect.
## What changes were proposed in this pull request?
gapply() applies an R function on groups grouped by one or more columns of a DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() in the Dataset API.
Please, let me know what do you think and if you have any ideas to improve it.
Thank you!
## How was this patch tested?
Unit tests.
1. Primitive test with different column types
2. Add a boolean column
3. Compute average by a group
Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>
Author: NarineK <narine.kokhlikyan@us.ibm.com>
Closes#12836 from NarineK/gapply2.
## What changes were proposed in this pull request?
We currently immediately execute `INSERT` commands when they are issued. This is not the case as soon as we use a `WITH` to define common table expressions, for example:
```sql
WITH
tbl AS (SELECT * FROM x WHERE id = 10)
INSERT INTO y
SELECT *
FROM tbl
```
This PR fixes this problem. This PR closes https://github.com/apache/spark/pull/13561 (which fixes the a instance of this problem in the ThriftSever).
## How was this patch tested?
Added a test to `InsertSuite`
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#13678 from hvanhovell/SPARK-15824.
## What changes were proposed in this pull request?
This patch brings https://github.com/apache/spark/pull/11373 up-to-date and increments the record count for JDBC data source.
Closes#11373.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#13694 from rxin/SPARK-13498.
## What changes were proposed in this pull request?
This patch renames various Parquet support classes from CatalystAbc to ParquetAbc. This new naming makes more sense for two reasons:
1. These are not optimizer related (i.e. Catalyst) classes.
2. We are in the Spark code base, and as a result it'd be more clear to call out these are Parquet support classes, rather than some Spark classes.
## How was this patch tested?
Renamed test cases as well.
Author: Reynold Xin <rxin@databricks.com>
Closes#13696 from rxin/parquet-rename.
## What changes were proposed in this pull request?
Add missing SQLExecution.withNewExecutionId for hiveResultString so that queries running in `spark-sql` will be shown in Web UI.
Closes#13115
## How was this patch tested?
Existing unit tests.
Author: KaiXinXiaoLei <huleilei1@huawei.com>
Closes#13689 from zsxwing/pr13115.
## What changes were proposed in this pull request?
After we move the ExtractPythonUDF rule into physical plan, Python UDF can't work on top of aggregate anymore, because they can't be evaluated before aggregate, should be evaluated after aggregate. This PR add another rule to extract these kind of Python UDF from logical aggregate, create a Project on top of Aggregate.
## How was this patch tested?
Added regression tests. The plan of added test query looks like this:
```
== Parsed Logical Plan ==
'Project [<lambda>('k, 's) AS t#26]
+- Aggregate [<lambda>(key#5L)], [<lambda>(key#5L) AS k#17, sum(cast(<lambda>(value#6) as bigint)) AS s#22L]
+- LogicalRDD [key#5L, value#6]
== Analyzed Logical Plan ==
t: int
Project [<lambda>(k#17, s#22L) AS t#26]
+- Aggregate [<lambda>(key#5L)], [<lambda>(key#5L) AS k#17, sum(cast(<lambda>(value#6) as bigint)) AS s#22L]
+- LogicalRDD [key#5L, value#6]
== Optimized Logical Plan ==
Project [<lambda>(agg#29, agg#30L) AS t#26]
+- Aggregate [<lambda>(key#5L)], [<lambda>(key#5L) AS agg#29, sum(cast(<lambda>(value#6) as bigint)) AS agg#30L]
+- LogicalRDD [key#5L, value#6]
== Physical Plan ==
*Project [pythonUDF0#37 AS t#26]
+- BatchEvalPython [<lambda>(agg#29, agg#30L)], [agg#29, agg#30L, pythonUDF0#37]
+- *HashAggregate(key=[<lambda>(key#5L)#31], functions=[sum(cast(<lambda>(value#6) as bigint))], output=[agg#29,agg#30L])
+- Exchange hashpartitioning(<lambda>(key#5L)#31, 200)
+- *HashAggregate(key=[pythonUDF0#34 AS <lambda>(key#5L)#31], functions=[partial_sum(cast(pythonUDF1#35 as bigint))], output=[<lambda>(key#5L)#31,sum#33L])
+- BatchEvalPython [<lambda>(key#5L), <lambda>(value#6)], [key#5L, value#6, pythonUDF0#34, pythonUDF1#35]
+- Scan ExistingRDD[key#5L,value#6]
```
Author: Davies Liu <davies@databricks.com>
Closes#13682 from davies/fix_py_udf.
## What changes were proposed in this pull request?
This PR adds the support of conf `hive.metastore.warehouse.dir` back. With this patch, the way of setting the warehouse dir is described as follows:
* If `spark.sql.warehouse.dir` is set, `hive.metastore.warehouse.dir` will be automatically set to the value of `spark.sql.warehouse.dir`. The warehouse dir is effectively set to the value of `spark.sql.warehouse.dir`.
* If `spark.sql.warehouse.dir` is not set but `hive.metastore.warehouse.dir` is set, `spark.sql.warehouse.dir` will be automatically set to the value of `hive.metastore.warehouse.dir`. The warehouse dir is effectively set to the value of `hive.metastore.warehouse.dir`.
* If neither `spark.sql.warehouse.dir` nor `hive.metastore.warehouse.dir` is set, `hive.metastore.warehouse.dir` will be automatically set to the default value of `spark.sql.warehouse.dir`. The warehouse dir is effectively set to the default value of `spark.sql.warehouse.dir`.
## How was this patch tested?
`set hive.metastore.warehouse.dir` in `HiveSparkSubmitSuite`.
JIRA: https://issues.apache.org/jira/browse/SPARK-15959
Author: Yin Huai <yhuai@databricks.com>
Closes#13679 from yhuai/hiveWarehouseDir.
Renamed for simplicity, so that its obvious that its related to streaming.
Existing unit tests.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#13673 from tdas/SPARK-15953.
## What changes were proposed in this pull request?
Since we are probably going to add more statistics related configurations in the future, I'd like to rename the newly added `spark.sql.enableFallBackToHdfsForStats` configuration option to `spark.sql.statistics.fallBackToHdfs`. This allows us to put all statistics related configurations in the same namespace.
## How was this patch tested?
None - just a usability thing
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#13681 from hvanhovell/SPARK-15960.
## What changes were proposed in this pull request?
Two issues I've found for "show databases" command:
1. The returned database name list was not sorted, it only works when "like" was used together; (HIVE will always return a sorted list)
2. When it is used as sql("show databases").show, it will output a table with column named as "result", but for sql("show tables").show, it will output the column name as "tableName", so I think we should be consistent and use "databaseName" at least.
## How was this patch tested?
Updated existing test case to test its ordering as well.
Author: bomeng <bmeng@us.ibm.com>
Closes#13671 from bomeng/SPARK-15952.
## What changes were proposed in this pull request?
Currently, the DataFrameReader/Writer has method that are needed for streaming and non-streaming DFs. This is quite awkward because each method in them through runtime exception for one case or the other. So rather having half the methods throw runtime exceptions, its just better to have a different reader/writer API for streams.
- [x] Python API!!
## How was this patch tested?
Existing unit tests + two sets of unit tests for DataFrameReader/Writer and DataStreamReader/Writer.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#13653 from tdas/SPARK-15933.
## What changes were proposed in this pull request?
This pr sets the default number of partitions when reading parquet schemas.
SQLContext#read#parquet currently yields at least n_executors * n_cores tasks even if parquet data consist of a single small file. This issue could increase the latency for small jobs.
## How was this patch tested?
Manually tested and checked.
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
Closes#13137 from maropu/SPARK-15247.
## What changes were proposed in this pull request?
Take the following directory layout as an example:
```
dir/
+- p0=0/
|-_metadata
+- p1=0/
|-part-00001.parquet
|-part-00002.parquet
|-...
```
The `_metadata` file under `p0=0` shouldn't fail partition discovery.
This PR filters output all metadata files whose names start with `_` while doing partition discovery.
## How was this patch tested?
New unit test added in `ParquetPartitionDiscoverySuite`.
Author: Cheng Lian <lian@databricks.com>
Closes#13623 from liancheng/spark-15895-partition-disco-no-metafiles.
#### What changes were proposed in this pull request?
To uncache a table, we have three different ways:
- _SQL interface_: `UNCACHE TABLE`
- _DataSet API_: `sparkSession.catalog.uncacheTable`
- _DataSet API_: `sparkSession.table(tableName).unpersist()`
When the table is not cached,
- _SQL interface_: `UNCACHE TABLE non-cachedTable` -> **no error message**
- _Dataset API_: `sparkSession.catalog.uncacheTable("non-cachedTable")` -> **report a strange error message:**
```requirement failed: Table [a: int] is not cached```
- _Dataset API_: `sparkSession.table("non-cachedTable").unpersist()` -> **no error message**
This PR will make them consistent. No operation if the table has already been uncached.
In addition, this PR also removes `uncacheQuery` and renames `tryUncacheQuery` to `uncacheQuery`, and documents it that it's noop if the table has already been uncached
#### How was this patch tested?
Improved the existing test case for verifying the cases when the table has not been cached.
Also added test cases for verifying the cases when the table does not exist
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#13593 from gatorsmile/uncacheNonCachedTable.
## What changes were proposed in this pull request?
`DataFrame` with plan overriding `sameResult` but not using canonicalized plan to compare can't cacheTable.
The example is like:
```
val localRelation = Seq(1, 2, 3).toDF()
localRelation.createOrReplaceTempView("localRelation")
spark.catalog.cacheTable("localRelation")
assert(
localRelation.queryExecution.withCachedData.collect {
case i: InMemoryRelation => i
}.size == 1)
```
and this will fail as:
```
ArrayBuffer() had size 0 instead of expected size 1
```
The reason is that when do `spark.catalog.cacheTable("localRelation")`, `CacheManager` tries to cache for the plan wrapped by `SubqueryAlias` but when planning for the DataFrame `localRelation`, `CacheManager` tries to find cached table for the not-wrapped plan because the plan for DataFrame `localRelation` is not wrapped.
Some plans like `LocalRelation`, `LogicalRDD`, etc. override `sameResult` method, but not use canonicalized plan to compare so the `CacheManager` can't detect the plans are the same.
This pr modifies them to use canonicalized plan when override `sameResult` method.
## How was this patch tested?
Added a test to check if DataFrame with plan overriding sameResult but not using canonicalized plan to compare can cacheTable.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#13638 from ueshin/issues/SPARK-15915.
## What changes were proposed in this pull request?
Another PR to clean up recent build warnings. This particularly cleans up several instances of the old accumulator API usage in tests that are straightforward to update. I think this qualifies as "minor".
## How was this patch tested?
Jenkins
Author: Sean Owen <sowen@cloudera.com>
Closes#13642 from srowen/BuildWarnings.
## What changes were proposed in this pull request?
Revert partial changes in SPARK-12600, and add some deprecated method back to SQLContext for backward source code compatibility.
## How was this patch tested?
Manual test.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#13637 from clockfly/SPARK-15914.
#### What changes were proposed in this pull request?
**Issue:** Got wrong results or strange errors when append data to a table with mismatched file format.
_Example 1: PARQUET -> CSV_
```Scala
createDF(0, 9).write.format("parquet").saveAsTable("appendParquetToOrc")
createDF(10, 19).write.mode(SaveMode.Append).format("orc").saveAsTable("appendParquetToOrc")
```
Error we got:
```
Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): java.lang.RuntimeException: file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/warehouse-bc8fedf2-aa6a-4002-a18b-524c6ac859d4/appendorctoparquet/part-r-00000-c0e3f365-1d46-4df5-a82c-b47d7af9feb9.snappy.orc is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [79, 82, 67, 23]
```
_Example 2: Json -> CSV_
```Scala
createDF(0, 9).write.format("json").saveAsTable("appendJsonToCSV")
createDF(10, 19).write.mode(SaveMode.Append).format("parquet").saveAsTable("appendJsonToCSV")
```
No exception, but wrong results:
```
+----+----+
| c1| c2|
+----+----+
|null|null|
|null|null|
|null|null|
|null|null|
| 0|str0|
| 1|str1|
| 2|str2|
| 3|str3|
| 4|str4|
| 5|str5|
| 6|str6|
| 7|str7|
| 8|str8|
| 9|str9|
+----+----+
```
_Example 3: Json -> Text_
```Scala
createDF(0, 9).write.format("json").saveAsTable("appendJsonToText")
createDF(10, 19).write.mode(SaveMode.Append).format("text").saveAsTable("appendJsonToText")
```
Error we got:
```
Text data source supports only a single column, and you have 2 columns.
```
This PR is to issue an exception with appropriate error messages.
#### How was this patch tested?
Added test cases.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13546 from gatorsmile/fileFormatCheck.
## What changes were proposed in this pull request?
Right now, Spark 2.0 does not load hive-site.xml. Based on users' feedback, it seems make sense to still load this conf file.
This PR adds a `hadoopConf` API in `SharedState`, which is `sparkContext.hadoopConfiguration` by default. When users are under hive context, `SharedState.hadoopConf` will load hive-site.xml and append its configs to `sparkContext.hadoopConfiguration`.
When we need to read hadoop config in spark sql, we should call `SessionState.newHadoopConf`, which contains `sparkContext.hadoopConfiguration`, hive-site.xml and sql configs.
## How was this patch tested?
new test in `HiveDataFrameSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13611 from cloud-fan/hive-site.
## What changes were proposed in this pull request?
ContinuousQueries have names that are unique across all the active ones. However, when queries are rapidly restarted with same name, it causes races conditions with the listener. A listener event from a stopped query can arrive after the query has been restarted, leading to complexities in monitoring infrastructure.
Along with this change, I have also consolidated all the messy code paths to start queries with different sinks.
## How was this patch tested?
Added unit tests, and existing unit tests.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#13613 from tdas/SPARK-15889.
## What changes were proposed in this pull request?
This pr is to set the number of parallelism to prevent file listing in `listLeafFilesInParallel` from generating many tasks in case of large #defaultParallelism.
## How was this patch tested?
Manually checked
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
Closes#13444 from maropu/SPARK-15530.
#### What changes were proposed in this pull request?
When creating a Hive Table (not data source tables), a common error users might make is to specify an existing column name as a partition column. Below is what Hive returns in this case:
```
hive> CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (data string, part string);
FAILED: SemanticException [Error 10035]: Column repeated in partitioning columns
```
Currently, the error we issued is very confusing:
```
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:For direct MetaStore DB connections, we don't support retries at the client level.);
```
This PR is to fix the above issue by capturing the usage error in `Parser`.
#### How was this patch tested?
Added a test case to `DDLCommandSuite`
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13415 from gatorsmile/partitionColumnsInTableSchema.
## What changes were proposed in this pull request?
This patch does some replacing (as `streaming Datasets/DataFrames` is the term we've chosen in [SPARK-15593](00c310133d)):
- `continuous queries` -> `streaming Datasets/DataFrames`
- `non-continuous queries` -> `non-streaming Datasets/DataFrames`
This patch also adds `test("check foreach() can only be called on streaming Datasets/DataFrames")`.
## How was this patch tested?
N/A
Author: Liwei Lin <lwlin7@gmail.com>
Closes#13595 from lw-lin/continuous-queries-to-streaming-dss-dfs.
## What changes were proposed in this pull request?
It's similar to the bug fixed in https://github.com/apache/spark/pull/13425, we should consider null object and wrap the `CreateStruct` with `If` to do null check.
This PR also improves the test framework to test the objects of `Dataset[T]` directly, instead of calling `toDF` and compare the rows.
## How was this patch tested?
new test in `DatasetAggregatorSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13553 from cloud-fan/agg-null.
## What changes were proposed in this pull request?
If a cached `DataFrame` executed more than once and then do `uncacheTable` like the following:
```
val selectStar = sql("SELECT * FROM testData WHERE key = 1")
selectStar.createOrReplaceTempView("selectStar")
spark.catalog.cacheTable("selectStar")
checkAnswer(
selectStar,
Seq(Row(1, "1")))
spark.catalog.uncacheTable("selectStar")
checkAnswer(
selectStar,
Seq(Row(1, "1")))
```
, then the uncached `DataFrame` can't execute because of `Task not serializable` exception like:
```
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2038)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1897)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1912)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:884)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:357)
at org.apache.spark.rdd.RDD.collect(RDD.scala:883)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:290)
...
Caused by: java.lang.UnsupportedOperationException: Accumulator must be registered before send to executor
at org.apache.spark.util.AccumulatorV2.writeReplace(AccumulatorV2.scala:153)
at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1118)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1136)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
...
```
Notice that `DataFrame` uncached with `DataFrame.unpersist()` works, but with `spark.catalog.uncacheTable` doesn't work.
This pr reverts a part of cf38fe0 not to unregister `batchStats` accumulator, which is not needed to be unregistered here because it will be done by `ContextCleaner` after it is collected by GC.
## How was this patch tested?
Added a test to check if DataFrame can execute after uncacheTable and other existing tests.
But I made a test to check if the accumulator was cleared as `ignore` because the test would be flaky.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#13596 from ueshin/issues/SPARK-15870.
## What changes were proposed in this pull request?
This adds support for radix sort of nullable long fields. When a sort field is null and radix sort is enabled, we keep nulls in a separate region of the sort buffer so that radix sort does not need to deal with them. This also has performance benefits when sorting smaller integer types, since the current representation of nulls in two's complement (Long.MIN_VALUE) otherwise forces a full-width radix sort.
This strategy for nulls does mean the sort is no longer stable. cc davies
## How was this patch tested?
Existing randomized sort tests for correctness. I also tested some TPCDS queries and there does not seem to be any significant regression for non-null sorts.
Some test queries (best of 5 runs each).
Before change:
scala> val start = System.nanoTime; spark.range(5000000).selectExpr("if(id > 5, cast(hash(id) as long), NULL) as h").coalesce(1).orderBy("h").collect(); (System.nanoTime - start) / 1e6
start: Long = 3190437233227987
res3: Double = 4716.471091
After change:
scala> val start = System.nanoTime; spark.range(5000000).selectExpr("if(id > 5, cast(hash(id) as long), NULL) as h").coalesce(1).orderBy("h").collect(); (System.nanoTime - start) / 1e6
start: Long = 3190367870952791
res4: Double = 2981.143045
Author: Eric Liang <ekl@databricks.com>
Closes#13161 from ericl/sc-2998.
## What changes were proposed in this pull request?
It's easy for users to call `range(...).as[Long]` to get typed Dataset, and don't worth an API breaking change. This PR reverts it.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13605 from cloud-fan/range.
## What changes were proposed in this pull request?
This pr is to add doc for turning off quotations because this behavior is different from `com.databricks.spark.csv`.
## How was this patch tested?
Check behavior to put an empty string in csv options.
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
Closes#13616 from maropu/SPARK-15585-2.
## What changes were proposed in this pull request?
In case of any bugs in whole-stage codegen, the generated code can't be compiled, we should fallback to non-codegen to make sure that query could run.
The batch mode of new parquet reader depends on codegen, can't be easily switched to non-batch mode, so we still use codegen for batched scan (for parquet). Because it only support primitive types and the number of columns is less than spark.sql.codegen.maxFields (100), it should not fail.
This could be configurable by `spark.sql.codegen.fallback`
## How was this patch tested?
Manual test it with buggy operator, it worked well.
Author: Davies Liu <davies@databricks.com>
Closes#13501 from davies/codegen_fallback.
## What changes were proposed in this pull request?
Spark currently incorrectly continues to use cached data even if the underlying data is overwritten.
Current behavior:
```scala
val dir = "/tmp/test"
sqlContext.range(1000).write.mode("overwrite").parquet(dir)
val df = sqlContext.read.parquet(dir).cache()
df.count() // outputs 1000
sqlContext.range(10).write.mode("overwrite").parquet(dir)
sqlContext.read.parquet(dir).count() // outputs 1000 <---- We are still using the cached dataset
```
This patch fixes this bug by adding support for `REFRESH path` that invalidates and refreshes all the cached data (and the associated metadata) for any dataframe that contains the given data source path.
Expected behavior:
```scala
val dir = "/tmp/test"
sqlContext.range(1000).write.mode("overwrite").parquet(dir)
val df = sqlContext.read.parquet(dir).cache()
df.count() // outputs 1000
sqlContext.range(10).write.mode("overwrite").parquet(dir)
spark.catalog.refreshResource(dir)
sqlContext.read.parquet(dir).count() // outputs 10 <---- We are not using the cached dataset
```
## How was this patch tested?
Unit tests for overwrites and appends in `ParquetQuerySuite` and `CachedTableSuite`.
Author: Sameer Agarwal <sameer@databricks.com>
Closes#13566 from sameeragarwal/refresh-path-2.
## What changes were proposed in this pull request?
The base class `SpecificParquetRecordReaderBase` used for vectorized parquet reader will try to get pushed-down filters from the given configuration. This pushed-down filters are used for RowGroups-level filtering. However, we don't set up the filters to push down into the configuration. In other words, the filters are not actually pushed down to do RowGroups-level filtering. This patch is to fix this and tries to set up the filters for pushing down to configuration for the reader.
## How was this patch tested?
Existing tests should be passed.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#13371 from viirya/vectorized-reader-push-down-filter.
## What changes were proposed in this pull request?
Currently, we always split the files when it's bigger than maxSplitBytes, but Hadoop LineRecordReader does not respect the splits for compressed files correctly, we should have a API for FileFormat to check whether the file could be splitted or not.
This PR is based on #13442, closes#13442
## How was this patch tested?
add regression tests.
Author: Davies Liu <davies@databricks.com>
Closes#13531 from davies/fix_split.
## What changes were proposed in this pull request?
In scala, immutable.List.length is an expensive operation so we should
avoid using Seq.length == 0 or Seq.lenth > 0, and use Seq.isEmpty and Seq.nonEmpty instead.
## How was this patch tested?
existing tests
Author: wangyang <wangyang@haizhi.com>
Closes#13601 from yangw1234/isEmpty.
## What changes were proposed in this pull request?
Replace all occurrences of `None: Option[X]` with `Option.empty[X]`
## How was this patch tested?
Exisiting Tests
Author: Sandeep Singh <sandeep@techaddict.me>
Closes#13591 from techaddict/minor-7.
## What changes were proposed in this pull request?
This PR moves `QueryPlanner.planLater()` method into `GenericStrategy` for extra strategies to be able to use `planLater` in its strategy.
## How was this patch tested?
Existing tests.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#13147 from ueshin/issues/SPARK-6320.
## What changes were proposed in this pull request?
When saving datasets on storage, `partitionBy` provides an easy way to construct the directory structure. However, if a user choose all columns as partition columns, some exceptions occurs.
- **ORC with all column partitioning**: `AnalysisException` on **future read** due to schema inference failure.
```scala
scala> spark.range(10).write.format("orc").mode("overwrite").partitionBy("id").save("/tmp/data")
scala> spark.read.format("orc").load("/tmp/data").collect()
org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC at /tmp/data. It must be specified manually;
```
- **Parquet with all-column partitioning**: `InvalidSchemaException` on **write execution** due to Parquet limitation.
```scala
scala> spark.range(100).write.format("parquet").mode("overwrite").partitionBy("id").save("/tmp/data")
[Stage 0:> (0 + 8) / 8]16/06/02 16:51:17
ERROR Utils: Aborting task
org.apache.parquet.schema.InvalidSchemaException: A group type can not be empty. Parquet does not support empty group without leaves. Empty group: spark_schema
... (lots of error messages)
```
Although some formats like JSON support all-column partitioning without any problem, it seems not a good idea to make lots of empty directories.
This PR prevents saving with all-column partitioning by consistently raising `AnalysisException` before executing save operation.
## How was this patch tested?
Newly added `PartitioningUtilsSuite`.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13486 from dongjoon-hyun/SPARK-15743.
## What changes were proposed in this pull request?
SparkContext.listAccumulator, by Spark's convention, makes it sound like "list" is a verb and the method should return a list of accumulators. This patch renames the method and the class collection accumulator.
## How was this patch tested?
Updated test case to reflect the names.
Author: Reynold Xin <rxin@databricks.com>
Closes#13594 from rxin/SPARK-15866.
## What changes were proposed in this pull request?
This patch moves some codes in `DataFrameWriter.insertInto` that belongs to `Analyzer`.
## How was this patch tested?
Existing tests.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#13496 from viirya/move-analyzer-stuff.
## What changes were proposed in this pull request?
* Add DataFrameWriter.foreach to allow the user consuming data in ContinuousQuery
* ForeachWriter is the interface for the user to consume partitions of data
* Add a type parameter T to DataFrameWriter
Usage
```Scala
val ds = spark.read....stream().as[String]
ds.....write
.queryName(...)
.option("checkpointLocation", ...)
.foreach(new ForeachWriter[Int] {
def open(partitionId: Long, version: Long): Boolean = {
// prepare some resources for a partition
// check `version` if possible and return `false` if this is a duplicated data to skip the data processing.
}
override def process(value: Int): Unit = {
// process data
}
def close(errorOrNull: Throwable): Unit = {
// release resources for a partition
// check `errorOrNull` and handle the error if necessary.
}
})
```
## How was this patch tested?
New unit tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#13342 from zsxwing/foreach.
## What changes were proposed in this pull request?
The fix is pretty simple, just don't make the executedPlan transient in `ScalarSubquery` since it is referenced at execution time.
## How was this patch tested?
I verified the fix manually in non-local mode. It's not clear to me why the problem did not manifest in local mode, any suggestions?
cc davies
Author: Eric Liang <ekl@databricks.com>
Closes#13569 from ericl/fix-scalar-npe.
## What changes were proposed in this pull request?
SparkSession does not have that many functions due to better namespacing, and as a result we probably don't need the function grouping. This patch removes the grouping and also adds missing scaladocs for createDataset functions in SQLContext.
Closes#13577.
## How was this patch tested?
N/A - this is a documentation change.
Author: Reynold Xin <rxin@databricks.com>
Closes#13582 from rxin/SPARK-15850.
## What changes were proposed in this pull request?
This PR closes the input stream created in `HDFSMetadataLog.get`
## How was this patch tested?
Jenkins unit tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#13583 from zsxwing/leak.
## What changes were proposed in this pull request?
With very wide tables, e.g. thousands of fields, the plan output is unreadable and often causes OOMs due to inefficient string processing. This truncates all struct and operator field lists to a user configurable threshold to limit performance impact.
It would also be nice to optimize string generation to avoid these sort of O(n^2) slowdowns entirely (i.e. use StringBuilder everywhere including expressions), but this is probably too large of a change for 2.0 at this point, and truncation has other benefits for usability.
## How was this patch tested?
Added a microbenchmark that covers this case particularly well. I also ran the microbenchmark while varying the truncation threshold.
```
numFields = 5
wide shallowly nested struct field r/w: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
2000 wide x 50 rows (write in-mem) 2336 / 2558 0.0 23364.4 0.1X
numFields = 25
wide shallowly nested struct field r/w: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
2000 wide x 50 rows (write in-mem) 4237 / 4465 0.0 42367.9 0.1X
numFields = 100
wide shallowly nested struct field r/w: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
2000 wide x 50 rows (write in-mem) 10458 / 11223 0.0 104582.0 0.0X
numFields = Infinity
wide shallowly nested struct field r/w: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
[info] java.lang.OutOfMemoryError: Java heap space
```
Author: Eric Liang <ekl@databricks.com>
Author: Eric Liang <ekhliang@gmail.com>
Closes#13537 from ericl/truncated-string.
## What changes were proposed in this pull request?
Documentation Fix
## How was this patch tested?
Author: Sandeep Singh <sandeep@techaddict.me>
Closes#13567 from techaddict/minor-4.
## What changes were proposed in this pull request?
On the SparkUI right now we have this SQLTab that displays accumulator values per operator. However, it only displays metrics updated on the executors, not on the driver. It is useful to also include driver metrics, e.g. broadcast time.
This is a different version from https://github.com/apache/spark/pull/12427. This PR sends driver side accumulator updates right after the updating happens, not at the end of execution, by a new event.
## How was this patch tested?
new test in `SQLListenerSuite`
![qq20160606-0](https://cloud.githubusercontent.com/assets/3182036/15841418/0eb137da-2c06-11e6-9068-5694eeb78530.png)
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13189 from cloud-fan/metrics.
## What changes were proposed in this pull request?
revived #13464
Fix Java Lint errors introduced by #13286 and #13280
Before:
```
Using `mvn` from path: /Users/pichu/Project/spark/build/apache-maven-3.3.9/bin/mvn
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[340,5] (whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[341,5] (whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[342,5] (whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[343,5] (whitespace) FileTabCharacter: Line contains a tab character.
[ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[41,28] (naming) MethodName: Method name 'Append' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[52,28] (naming) MethodName: Method name 'Complete' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
[ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[61,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.PrimitiveType.
[ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[62,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.Type.
```
## How was this patch tested?
ran `dev/lint-java` locally
Author: Sandeep Singh <sandeep@techaddict.me>
Closes#13559 from techaddict/minor-3.
## What changes were proposed in this pull request?
This PR adds ContinuousQueryInfo to make ContinuousQueryListener events serializable in order to support writing events into the event log.
## How was this patch tested?
Jenkins unit tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#13335 from zsxwing/query-info.
## What changes were proposed in this pull request?
The current implementation of "CREATE TEMPORARY TABLE USING datasource..." is NOT creating any intermediate temporary data directory like temporary HDFS folder, instead, it only stores a SQL string in memory. Probably we should use "TEMPORARY VIEW" instead.
This PR assumes a temporary table has to link with some temporary intermediate data. It follows the definition of temporary table like this (from [hortonworks doc](https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/temp-tables.html)):
> A temporary table is a convenient way for an application to automatically manage intermediate data generated during a complex query
**Example**:
```
scala> spark.sql("CREATE temporary view my_tab7 (c1: String, c2: String) USING org.apache.spark.sql.execution.datasources.csv.CSVFileFormat OPTIONS (PATH '/Users/seanzhong/csv/cars.csv')")
scala> spark.sql("select c1, c2 from my_tab7").show()
+----+-----+
| c1| c2|
+----+-----+
|year| make|
|2012|Tesla|
...
```
It NOW prints a **deprecation warning** if "CREATE TEMPORARY TABLE USING..." is used.
```
scala> spark.sql("CREATE temporary table my_tab7 (c1: String, c2: String) USING org.apache.spark.sql.execution.datasources.csv.CSVFileFormat OPTIONS (PATH '/Users/seanzhong/csv/cars.csv')")
16/05/31 10:39:27 WARN SparkStrategies$DDLStrategy: CREATE TEMPORARY TABLE tableName USING... is deprecated, please use CREATE TEMPORARY VIEW viewName USING... instead
```
## How was this patch tested?
Unit test.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#13414 from clockfly/create_temp_view_using.
## What changes were proposed in this pull request?
This PR allows customization of verbosity in explain output. After change, `dataframe.explain()` and `dataframe.explain(true)` has different verbosity output for physical plan.
Currently, this PR only enables verbosity string for operator `HashAggregateExec` and `SortAggregateExec`. We will gradually enable verbosity string for more operators in future.
**Less verbose mode:** dataframe.explain(extended = false)
`output=[count(a)#85L]` is **NOT** displayed for HashAggregate.
```
scala> Seq((1,2,3)).toDF("a", "b", "c").createTempView("df2")
scala> spark.sql("select count(a) from df2").explain()
== Physical Plan ==
*HashAggregate(key=[], functions=[count(1)])
+- Exchange SinglePartition
+- *HashAggregate(key=[], functions=[partial_count(1)])
+- LocalTableScan
```
**Verbose mode:** dataframe.explain(extended = true)
`output=[count(a)#85L]` is displayed for HashAggregate.
```
scala> spark.sql("select count(a) from df2").explain(true) // "output=[count(a)#85L]" is added
...
== Physical Plan ==
*HashAggregate(key=[], functions=[count(1)], output=[count(a)#85L])
+- Exchange SinglePartition
+- *HashAggregate(key=[], functions=[partial_count(1)], output=[count#87L])
+- LocalTableScan
```
## How was this patch tested?
Manual test.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#13535 from clockfly/verbose_breakdown_2.
## What changes were proposed in this pull request?
This PR makes sure the typed Filter doesn't change the Dataset schema.
**Before the change:**
```
scala> val df = spark.range(0,9)
scala> df.schema
res12: org.apache.spark.sql.types.StructType = StructType(StructField(id,LongType,false))
scala> val afterFilter = df.filter(_=>true)
scala> afterFilter.schema // !!! schema is CHANGED!!! Column name is changed from id to value, nullable is changed from false to true.
res13: org.apache.spark.sql.types.StructType = StructType(StructField(value,LongType,true))
```
SerializeFromObject and DeserializeToObject are inserted to wrap the Filter, and these two can possibly change the schema of Dataset.
**After the change:**
```
scala> afterFilter.schema // schema is NOT changed.
res47: org.apache.spark.sql.types.StructType = StructType(StructField(id,LongType,false))
```
## How was this patch tested?
Unit test.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#13529 from clockfly/spark-15632.
BindReferences contains a n^2 loop which causes performance issues when operating over large schemas: to determine the ordinal of an attribute reference, we perform a linear scan over the `input` array. Because input can sometimes be a `List`, the call to `input(ordinal).nullable` can also be O(n).
Instead of performing a linear scan, we can convert the input into an array and build a hash map to map from expression ids to ordinals. The greater up-front cost of the map construction is offset by the fact that an expression can contain multiple attribute references, so the cost of the map construction is amortized across a number of lookups.
Perf. benchmarks to follow. /cc ericl
Author: Josh Rosen <joshrosen@databricks.com>
Closes#13505 from JoshRosen/bind-references-improvement.
## What changes were proposed in this pull request?
`an -> a`
Use cmds like `find . -name '*.R' | xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one.
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#13515 from zhengruifeng/an_a.
## What changes were proposed in this pull request?
This pr fixes the behaviour of `format("csv").option("quote", null)` along with one of spark-csv.
Also, it explicitly sets default values for CSV options in python.
## How was this patch tested?
Added tests in CSVSuite.
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
Closes#13372 from maropu/SPARK-15585.
`PartitionStatistics` uses `foldLeft` and list concatenation (`++`) to flatten an iterator of lists, but this is extremely inefficient compared to simply doing `flatMap`/`flatten` because it performs many unnecessary object allocations. Simply replacing this `foldLeft` by a `flatMap` results in decent performance gains when constructing PartitionStatistics instances for tables with many columns.
This patch fixes this and also makes two similar changes in MLlib and streaming to try to fix all known occurrences of this pattern.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#13491 from JoshRosen/foldleft-to-flatmap.
## What changes were proposed in this pull request?
Now Spark SQL can support 'create table src stored as orc/parquet/avro' for orc/parquet/avro table. But Hive can support both commands: ' stored as orc/parquet/avro' and 'stored as orcfile/parquetfile/avrofile'.
So this PR supports these keywords 'orcfile/parquetfile/avrofile' in Spark SQL.
## How was this patch tested?
add unit tests
Author: Lianhui Wang <lianhuiwang09@gmail.com>
Closes#13500 from lianhuiwang/SPARK-15756.
## What changes were proposed in this pull request?
Currently, the memory for temporary buffer used by TimSort is always allocated as on-heap without bookkeeping, it could cause OOM both in on-heap and off-heap mode.
This PR will try to manage that by preallocate it together with the pointer array, same with RadixSort. It both works for on-heap and off-heap mode.
This PR also change the loadFactor of BytesToBytesMap to 0.5 (it was 0.70), it enables use to radix sort also makes sure that we have enough memory for timsort.
## How was this patch tested?
Existing tests.
Author: Davies Liu <davies@databricks.com>
Closes#13318 from davies/fix_timsort.
## What changes were proposed in this pull request?
As of this patch, the following throws an exception because the schemas may not match:
```
CREATE TABLE students (age INT, name STRING) AS SELECT * FROM boxes
```
but this is OK:
```
CREATE TABLE students AS SELECT * FROM boxes
```
## How was this patch tested?
SQLQuerySuite, HiveDDLCommandSuite
Author: Andrew Or <andrew@databricks.com>
Closes#13490 from andrewor14/ctas-no-column.
## What changes were proposed in this pull request?
Our encoder framework has been evolved a lot, this PR tries to clean up the code to make it more readable and emphasise the concept that encoder should be used as a container of serde expressions.
1. move validation logic to analyzer instead of encoder
2. only have a `resolveAndBind` method in encoder instead of `resolve` and `bind`, as we don't have the encoder life cycle concept anymore.
3. `Dataset` don't need to keep a resolved encoder, as there is no such concept anymore. bound encoder is still needed to do serialization outside of query framework.
4. Using `BoundReference` to represent an unresolved field in deserializer expression is kind of weird, this PR adds a `GetColumnByOrdinal` for this purpose. (serializer expression still use `BoundReference`, we can replace it with `GetColumnByOrdinal` in follow-ups)
## How was this patch tested?
existing test
Author: Wenchen Fan <wenchen@databricks.com>
Author: Cheng Lian <lian@databricks.com>
Closes#13269 from cloud-fan/clean-encoder.
## What changes were proposed in this pull request?
For consistency, this PR updates some remaining `TungstenAggregation/SortBasedAggregate` after SPARK-15728.
- Update a comment in codegen in `VectorizedHashMapGenerator.scala`.
- `TungstenAggregationQuerySuite` --> `HashAggregationQuerySuite`
- `TungstenAggregationQueryWithControlledFallbackSuite` --> `HashAggregationQueryWithControlledFallbackSuite`
- Update two error messages in `SQLQuerySuite.scala` and `AggregationQuerySuite.scala`.
- Update several comments.
## How was this patch tested?
Manual (Only comment changes and test suite renamings).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13487 from dongjoon-hyun/SPARK-15744.
## What changes were proposed in this pull request?
##### The root cause:
When `DataSource.resolveRelation` is trying to build `ListingFileCatalog` object, `ListLeafFiles` is invoked where a list of `FileStatus` objects are retrieved from the provided path. These FileStatus objects include directories for the partitions (id=0 and id=2 in the jira). However, these directory `FileStatus` objects also try to invoke `getFileBlockLocations` where directory is not allowed for `DistributedFileSystem`, hence the exception happens.
This PR is to remove the block of code that invokes `getFileBlockLocations` for every FileStatus object of the provided path. Instead, we call `HadoopFsRelation.listLeafFiles` directly because this utility method filters out the directories before calling `getFileBlockLocations` for generating `LocatedFileStatus` objects.
## How was this patch tested?
Regtest is run. Manual test:
```
scala> spark.read.format("parquet").load("hdfs://bdavm009.svl.ibm.com:8020/user/spark/SPARK-14959_part").show
+-----+---+
| text| id|
+-----+---+
|hello| 0|
|world| 0|
|hello| 1|
|there| 1|
+-----+---+
spark.read.format("orc").load("hdfs://bdavm009.svl.ibm.com:8020/user/spark/SPARK-14959_orc").show
+-----+---+
| text| id|
+-----+---+
|hello| 0|
|world| 0|
|hello| 1|
|there| 1|
+-----+---+
```
I also tried it with 2 level of partitioning.
I have not found a way to add test case in the unit test bucket that can test a real hdfs file location. Any suggestions will be appreciated.
Author: Xin Wu <xinwu@us.ibm.com>
Closes#13463 from xwu0226/SPARK-14959.
## What changes were proposed in this pull request?
Currently we don't support bucketing for `save` and `insertInto`.
For `save`, we just write the data out into a directory users specified, and it's not a table, we don't keep its metadata. When we read it back, we have no idea if the data is bucketed or not, so it doesn't make sense to use `save` to write bucketed data, as we can't use the bucket information anyway.
We can support it in the future, once we have features like bucket discovery, or we save bucket information in the data directory too, so that we don't need to rely on a metastore.
For `insertInto`, it inserts data into an existing table, so it doesn't make sense to specify bucket information, as we should get the bucket information from the existing table.
This PR improves the error message for the above 2 cases.
## How was this patch tested?
new test in `BukctedWriteSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13452 from cloud-fan/error-msg.
## What changes were proposed in this pull request?
This PR disables writing Parquet summary files by default (i.e., when Hadoop configuration "parquet.enable.summary-metadata" is not set).
Please refer to [SPARK-15719][1] for more details.
## How was this patch tested?
New test case added in `ParquetQuerySuite` to check no summary files are written by default.
[1]: https://issues.apache.org/jira/browse/SPARK-15719
Author: Cheng Lian <lian@databricks.com>
Closes#13455 from liancheng/spark-15719-disable-parquet-summary-files.
## What changes were proposed in this pull request?
This PR bans syntax like `CREATE TEMPORARY TABLE USING AS SELECT`
`CREATE TEMPORARY TABLE ... USING ... AS ...` is not properly implemented, the temporary data is not cleaned up when the session exits. Before a full fix, we probably should ban this syntax.
This PR only impact syntax like `CREATE TEMPORARY TABLE ... USING ... AS ...`.
Other syntax like `CREATE TEMPORARY TABLE .. USING ...` and `CREATE TABLE ... USING ...` are not impacted.
## How was this patch tested?
Unit test.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#13451 from clockfly/ban_create_temp_table_using_as.
#### What changes were proposed in this pull request?
This PR is to address the following issues:
- **ISSUE 1:** For ORC source format, we are reporting the strange error message when we did not enable Hive support:
```SQL
SQL Example:
select id from `org.apache.spark.sql.hive.orc`.`file_path`
Error Message:
Table or view not found: `org.apache.spark.sql.hive.orc`.`file_path`
```
Instead, we should issue the error message like:
```
Expected Error Message:
The ORC data source must be used with Hive support enabled
```
- **ISSUE 2:** For the Avro format, we report the strange error message like:
The example query is like
```SQL
SQL Example:
select id from `avro`.`file_path`
select id from `com.databricks.spark.avro`.`file_path`
Error Message:
Table or view not found: `com.databricks.spark.avro`.`file_path`
```
The desired message should be like:
```
Expected Error Message:
Failed to find data source: avro. Please use Spark package http://spark-packages.org/package/databricks/spark-avro"
```
- ~~**ISSUE 3:** Unable to detect incompatibility libraries for Spark 2.0 in Data Source Resolution. We report a strange error message:~~
**Update**: The latest code changes contains
- For JDBC format, we added an extra checking in the rule `ResolveRelations` of `Analyzer`. Without the PR, Spark will return the error message like: `Option 'url' not specified`. Now, we are reporting `Unsupported data source type for direct query on files: jdbc`
- Make data source format name case incensitive so that error handling behaves consistent with the normal cases.
- Added the test cases for all the supported formats.
#### How was this patch tested?
Added test cases to cover all the above issues
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#13283 from gatorsmile/runSQLAgainstFile.
## What changes were proposed in this pull request?
We currently have two physical aggregate operators: TungstenAggregate and SortBasedAggregate. These names don't make a lot of sense from an end-user point of view. This patch renames them HashAggregate and SortAggregate.
## How was this patch tested?
Updated test cases.
Author: Reynold Xin <rxin@databricks.com>
Closes#13465 from rxin/SPARK-15728.
## What changes were proposed in this pull request?
This PR corrects the remaining cases for using old accumulators.
This does not change some old accumulator usages below:
- `ImplicitSuite.scala` - Tests dedicated to old accumulator, for implicits with `AccumulatorParam`
- `AccumulatorSuite.scala` - Tests dedicated to old accumulator
- `JavaSparkContext.scala` - For supporting old accumulators for Java API.
- `debug.package.scala` - Usage with `HashSet[String]`. Currently, it seems no implementation for this. I might be able to write an anonymous class for this but I didn't because I think it is not worth writing a lot of codes only for this.
- `SQLMetricsSuite.scala` - This uses the old accumulator for checking type boxing. It seems new accumulator does not require type boxing for this case whereas the old one requires (due to the use of generic).
## How was this patch tested?
Existing tests cover this.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#13434 from HyukjinKwon/accum.
## What changes were proposed in this pull request?
Currently, `freqItems` raises `UnsupportedOperationException` on `empty.min` usually when its `support` argument is high.
```scala
scala> spark.createDataset(Seq(1, 2, 2, 3, 3, 3)).stat.freqItems(Seq("value"), 2)
16/06/01 11:11:38 ERROR Executor: Exception in task 5.0 in stage 0.0 (TID 5)
java.lang.UnsupportedOperationException: empty.min
...
```
Also, the parameter checking message is wrong.
```
require(support >= 1e-4, s"support ($support) must be greater than 1e-4.")
```
This PR changes the logic to handle the `empty` case and also improves parameter checking.
## How was this patch tested?
Pass the Jenkins tests (with a new testcase).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13449 from dongjoon-hyun/SPARK-15709.
## What changes were proposed in this pull request?
When `spark.sql.hive.convertCTAS` is true, for a CTAS statement, we will create a data source table using the default source (i.e. parquet) if the CTAS does not specify any Hive storage format. However, there are two issues with this conversion logic.
1. First, we determine if a CTAS statement defines storage format by checking the serde. However, TEXTFILE/SEQUENCEFILE does not have a default serde. When we do the check, we have not set the default serde. So, a query like `CREATE TABLE abc STORED AS TEXTFILE AS SELECT ...` actually creates a data source parquet table.
2. In the conversion logic, we are ignoring the user-specified location.
This PR fixes the above two issues.
Also, this PR makes the parser throws an exception when a CTAS statement has a PARTITIONED BY clause. This change is made because Hive's syntax does not allow it and our current implementation actually does not work for this case (the insert operation always throws an exception because the insertion does not pick up the partitioning info).
## How was this patch tested?
I am adding new tests in SQLQuerySuite and HiveDDLCommandSuite.
Author: Yin Huai <yhuai@databricks.com>
Closes#13386 from yhuai/SPARK-14507.
## What changes were proposed in this pull request?
Improves the explain output of several physical plans by displaying embedded logical plan in tree style
Some physical plan contains a embedded logical plan, for example, `cache tableName query` maps to:
```
case class CacheTableCommand(
tableName: String,
plan: Option[LogicalPlan],
isLazy: Boolean)
extends RunnableCommand
```
It is easier to read the explain output if we can display the `plan` in tree style.
**Before change:**
Everything is messed in one line.
```
scala> Seq((1,2)).toDF().createOrReplaceTempView("testView")
scala> spark.sql("cache table testView2 select * from testView").explain()
== Physical Plan ==
ExecutedCommand CacheTableCommand testView2, Some('Project [*]
+- 'UnresolvedRelation `testView`, None
), false
```
**After change:**
```
scala> spark.sql("cache table testView2 select * from testView").explain()
== Physical Plan ==
ExecutedCommand
: +- CacheTableCommand testView2, false
: : +- 'Project [*]
: : +- 'UnresolvedRelation `testView`, None
```
## How was this patch tested?
Manual test.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#13433 from clockfly/verbose_breakdown_3_2.
## What changes were proposed in this pull request?
Currently we can't encode top level null object into internal row, as Spark SQL doesn't allow row to be null, only its columns can be null.
This is not a problem before, as we assume the input object is never null. However, for outer join, we do need the semantics of null object.
This PR fixes this problem by making both join sides produce a single column, i.e. nest the logical plan output(by `CreateStruct`), so that we have an extra level to represent top level null obejct.
## How was this patch tested?
new test in `DatasetSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13425 from cloud-fan/outer-join2.
This PR is an alternative to #13120 authored by xwu0226.
## What changes were proposed in this pull request?
When creating an external Spark SQL data source table and persisting its metadata to Hive metastore, we don't use the standard Hive `Table.dataLocation` field because Hive only allows directory paths as data locations while Spark SQL also allows file paths. However, if we don't set `Table.dataLocation`, Hive always creates an unexpected empty table directory under database location, but doesn't remove it while dropping the table (because the table is external).
This PR works around this issue by explicitly setting `Table.dataLocation` and then manullay removing the created directory after creating the external table.
Please refer to [this JIRA comment][1] for more details about why we chose this approach as a workaround.
[1]: https://issues.apache.org/jira/browse/SPARK-15269?focusedCommentId=15297408&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15297408
## How was this patch tested?
1. A new test case is added in `HiveQuerySuite` for this case
2. Updated `ShowCreateTableSuite` to use the same table name in all test cases. (This is how I hit this issue at the first place.)
Author: Cheng Lian <lian@databricks.com>
Closes#13270 from liancheng/spark-15269-unpleasant-fix.
## What changes were proposed in this pull request?
**SPARK-15596**: Even after we renamed a cached table, the plan would remain in the cache with the old table name. If I created a new table using the old name then the old table would return incorrect data. Note that this applies only to Hive tables.
**SPARK-15635**: Renaming a datasource table would render the table not query-able. This is because we store the location of the table in a "path" property, which was not updated to reflect Hive's change in table location following a rename.
## How was this patch tested?
DDLSuite
Author: Andrew Or <andrew@databricks.com>
Closes#13416 from andrewor14/rename-table.
## What changes were proposed in this pull request?
This patch moves all user-facing structured streaming classes into sql.streaming. As part of this, I also added some since version annotation to methods and classes that don't have them.
## How was this patch tested?
Updated tests to reflect the moves.
Author: Reynold Xin <rxin@databricks.com>
Closes#13429 from rxin/SPARK-15686.
## What changes were proposed in this pull request?
Text data source ignores requested schema, and may give wrong result when the only data column is not requested. This may happen when only partitioning column(s) are requested for a partitioned text table.
## How was this patch tested?
New test case added in `TextSuite`.
Author: Cheng Lian <lian@databricks.com>
Closes#13431 from liancheng/spark-14343-partitioned-text-table.
## What changes were proposed in this pull request?
This PR changes function `SparkSession.builder.sparkContext(..)` from **private[sql]** into **private[spark]**, and uses it if applicable like the followings.
```
- val spark = SparkSession.builder().config(sc.getConf).getOrCreate()
+ val spark = SparkSession.builder().sparkContext(sc).getOrCreate()
```
## How was this patch tested?
Pass the existing Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13365 from dongjoon-hyun/SPARK-15618.
This PR fixes a sample code, a description, and indentations in docs.
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13420 from dongjoon-hyun/minor_fix_dataset_doc.
## What changes were proposed in this pull request?
Fixes "Can't drop top level columns that contain dots".
This work is based on dilipbiswal's https://github.com/apache/spark/pull/10943.
This PR fixes problems like:
```
scala> Seq((1, 2)).toDF("a.b", "a.c").drop("a.b")
org.apache.spark.sql.AnalysisException: cannot resolve '`a.c`' given input columns: [a.b, a.c];
```
`drop(columnName)` can only be used to drop top level column, so, we should parse the column name literally WITHOUT interpreting dot "."
We should also NOT interpret back tick "`", otherwise it is hard to understand what
```
```aaa```bbb``
```
actually means.
## How was this patch tested?
Unit tests.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#13306 from clockfly/fix_drop_column.
## What changes were proposed in this pull request?
This patch does a few things:
1. Adds since version annotation to methods and classes in sql.catalog.
2. Fixed a typo in FilterFunction and a whitespace issue in spark/api/java/function/package.scala
3. Added "database" field to Function class.
## How was this patch tested?
Updated unit test case for "database" field in Function class.
Author: Reynold Xin <rxin@databricks.com>
Closes#13406 from rxin/SPARK-15662.
## What changes were proposed in this pull request?
Currently structured streaming only supports append output mode. This PR adds the following.
- Added support for Complete output mode in the internal state store, analyzer and planner.
- Added public API in Scala and Python for users to specify output mode
- Added checks for unsupported combinations of output mode and DF operations
- Plans with no aggregation should support only Append mode
- Plans with aggregation should support only Update and Complete modes
- Default output mode is Append mode (**Question: should we change this to automatically set to Complete mode when there is aggregation?**)
- Added support for Complete output mode in Memory Sink. So Memory Sink internally supports append and complete, update. But from public API only Complete and Append output modes are supported.
## How was this patch tested?
Unit tests in various test suites
- StreamingAggregationSuite: tests for complete mode
- MemorySinkSuite: tests for checking behavior in Append and Complete modes.
- UnsupportedOperationSuite: tests for checking unsupported combinations of DF ops and output modes
- DataFrameReaderWriterSuite: tests for checking that output mode cannot be called on static DFs
- Python doc test and existing unit tests modified to call write.outputMode.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#13286 from tdas/complete-mode.
## What changes were proposed in this pull request?
Right now, we will split the code for expressions into multiple functions when it exceed 64k, which requires that the the expressions are using Row object, but this is not true for whole-state codegen, it will fail to compile after splitted.
This PR will not split the code in whole-stage codegen.
## How was this patch tested?
Added regression tests.
Author: Davies Liu <davies@databricks.com>
Closes#13235 from davies/fix_nested_codegen.
## What changes were proposed in this pull request?
This reverts commit c24b6b679c. Sent a PR to run Jenkins tests due to the revert conflicts of `dev/deps/spark-deps-hadoop*`.
## How was this patch tested?
Jenkins unit tests, integration tests, manual tests)
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#13417 from zsxwing/revert-SPARK-11753.
## What changes were proposed in this pull request?
This patch contains a list of changes as a result of my auditing Dataset, SparkSession, and SQLContext. The patch audits the categorization of experimental APIs, function groups, and deprecations. For the detailed list of changes, please see the diff.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#13370 from rxin/SPARK-15638.
## What changes were proposed in this pull request?
I create a bucketed table bucketed_table with bucket column i,
```scala
case class Data(i: Int, j: Int, k: Int)
sc.makeRDD(Array((1, 2, 3))).map(x => Data(x._1, x._2, x._3)).toDF.write.bucketBy(2, "i").saveAsTable("bucketed_table")
```
and I run the following SQLs:
```sql
SELECT j FROM bucketed_table;
Error in query: bucket column i not found in existing columns (j);
SELECT j, MAX(k) FROM bucketed_table GROUP BY j;
Error in query: bucket column i not found in existing columns (j, k);
```
I think we should add a check that, we only enable bucketing when it satisfies all conditions below:
1. the conf is enabled
2. the relation is bucketed
3. the output contains all bucketing columns
## How was this patch tested?
Updated test cases to reflect the changes.
Author: Yadong Qi <qiyadong2010@gmail.com>
Closes#13321 from watermen/SPARK-15549.
## What changes were proposed in this pull request?
Let `Dataset.createTempView` and `Dataset.createOrReplaceTempView` use `CreateViewCommand`, rather than calling `SparkSession.createTempView`. Besides, this patch also removes `SparkSession.createTempView`.
## How was this patch tested?
Existing tests.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#13327 from viirya/dataset-createtempview.
## What changes were proposed in this pull request?
These commands ignore the partition spec and change the storage properties of the table itself:
```
ALTER TABLE table_name PARTITION (a=1, b=2) SET SERDE 'my_serde'
ALTER TABLE table_name PARTITION (a=1, b=2) SET SERDEPROPERTIES ('key1'='val1')
```
Now they change the storage properties of the specified partition.
## How was this patch tested?
DDLSuite
Author: Andrew Or <andrew@databricks.com>
Closes#13343 from andrewor14/alter-table-serdeproperties.
## What changes were proposed in this pull request?
This includes minimal changes to get Spark using the current release of Parquet, 1.8.1.
## How was this patch tested?
This uses the existing Parquet tests.
Author: Ryan Blue <blue@apache.org>
Closes#13280 from rdblue/SPARK-9876-update-parquet.
## What changes were proposed in this pull request?
Minor typo fixes in Dataset scaladoc
* Corrected context type as SparkSession, not SQLContext.
liancheng rxin andrewor14
## How was this patch tested?
Compiled locally
Author: Xinh Huynh <xinh_huynh@yahoo.com>
Closes#13330 from xinhhuynh/fix-dataset-typos.
## What changes were proposed in this pull request?
This patch adds a new function emptyDataset to SparkSession, for creating an empty dataset.
## How was this patch tested?
Added a test case.
Author: Reynold Xin <rxin@databricks.com>
Closes#13344 from rxin/SPARK-15597.
## What changes were proposed in this pull request?
Adds API docs and usage examples for the 3 `createDataset` calls in `SparkSession`
## How was this patch tested?
N/A
Author: Sameer Agarwal <sameer@databricks.com>
Closes#13345 from sameeragarwal/dataset-doc.
## What changes were proposed in this pull request?
This PR replaces `spark.sql.sources.` strings with `CreateDataSourceTableUtils.*` constant variables.
## How was this patch tested?
Pass the existing Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13349 from dongjoon-hyun/SPARK-15584.
#### What changes were proposed in this pull request?
The default value of `spark.sql.warehouse.dir` is `System.getProperty("user.dir")/spark-warehouse`. Since `System.getProperty("user.dir")` is a local dir, we should explicitly set the scheme to local filesystem.
cc yhuai
#### How was this patch tested?
Added two test cases
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13348 from gatorsmile/addSchemeToDefaultWarehousePath.
#### What changes were proposed in this pull request?
This PR is to use the new entrance `Sparksession` to replace the existing `SQLContext` and `HiveContext` in SQL test suites.
No change is made in the following suites:
- `ListTablesSuite` is to test the APIs of `SQLContext`.
- `SQLContextSuite` is to test `SQLContext`
- `HiveContextCompatibilitySuite` is to test `HiveContext`
**Update**: Move tests in `ListTableSuite` to `SQLContextSuite`
#### How was this patch tested?
N/A
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#13337 from gatorsmile/sparkSessionTest.
## What changes were proposed in this pull request?
`a` -> `an`
I use regex to generate potential error lines:
`grep -in ' a [aeiou]' mllib/src/main/scala/org/apache/spark/ml/*/*scala`
and review them line by line.
## How was this patch tested?
local build
`lint-java` checking
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#13317 from zhengruifeng/a_an.
## What changes were proposed in this pull request?
Certain table properties (and SerDe properties) are in the protected namespace `spark.sql.sources.`, which we use internally for datasource tables. The user should not be allowed to
(1) Create a Hive table setting these properties
(2) Alter these properties in an existing table
Previously, we threw an exception if the user tried to alter the properties of an existing datasource table. However, this is overly restrictive for datasource tables and does not do anything for Hive tables.
## How was this patch tested?
DDLSuite
Author: Andrew Or <andrew@databricks.com>
Closes#13341 from andrewor14/alter-table-props.
## What changes were proposed in this pull request?
Two more changes:
(1) Fix truncate table for data source tables (only for cases without `PARTITION`)
(2) Disallow truncating external tables or views
## How was this patch tested?
`DDLSuite`
Author: Andrew Or <andrew@databricks.com>
Closes#13315 from andrewor14/truncate-table.
## What changes were proposed in this pull request?
This PR changes SQLContext/HiveContext's public constructor to use SparkSession.build.getOrCreate and removes isRootContext from SQLContext.
## How was this patch tested?
Existing tests.
Author: Yin Huai <yhuai@databricks.com>
Closes#13310 from yhuai/SPARK-15532.
## What changes were proposed in this pull request?
This PR addresses two related issues:
1. `Dataset.showString()` should show case classes/Java beans at all levels as rows, while master code only handles top level ones.
2. `Dataset.showString()` should show full contents produced the underlying query plan
Dataset is only a view of the underlying query plan. Columns not referred by the encoder are still reachable using methods like `Dataset.col`. So it probably makes more sense to show full contents of the query plan.
## How was this patch tested?
Two new test cases are added in `DatasetSuite` to check `.showString()` output.
Author: Cheng Lian <lian@databricks.com>
Closes#13331 from liancheng/spark-15550-ds-show.
## What changes were proposed in this pull request?
SparkSession has a list of unnecessary private[sql] methods. These methods cause some trouble because private[sql] doesn't apply in Java. In the cases that they are easy to remove, we can simply remove them. This patch does that.
As part of this pull request, I also replaced a bunch of protected[sql] with private[sql], to tighten up visibility.
## How was this patch tested?
Updated test cases to reflect the changes.
Author: Reynold Xin <rxin@databricks.com>
Closes#13319 from rxin/SPARK-15552.
## What changes were proposed in this pull request?
Same as #13302, but for DROP TABLE.
## How was this patch tested?
`DDLSuite`
Author: Andrew Or <andrew@databricks.com>
Closes#13307 from andrewor14/drop-table.
## What changes were proposed in this pull request?
This patch renames various DefaultSources to make their names more self-describing. The choice of "DefaultSource" was from the days when we did not have a good way to specify short names.
They are now named:
- LibSVMFileFormat
- CSVFileFormat
- JdbcRelationProvider
- JsonFileFormat
- ParquetFileFormat
- TextFileFormat
Backward compatibility is maintained through aliasing.
## How was this patch tested?
Updated relevant test cases too.
Author: Reynold Xin <rxin@databricks.com>
Closes#13311 from rxin/SPARK-15543.
## What changes were proposed in this pull request?
This patch deprecates `Dataset.explode` and documents appropriate workarounds to use `flatMap()` or `functions.explode()` instead.
## How was this patch tested?
N/A
Author: Sameer Agarwal <sameer@databricks.com>
Closes#13312 from sameeragarwal/deprecate.
## What changes were proposed in this pull request?
Two changes:
- When things fail, `TRUNCATE TABLE` just returns nothing. Instead, we should throw exceptions.
- Remove `TRUNCATE TABLE ... COLUMN`, which was never supported by either Spark or Hive.
## How was this patch tested?
Jenkins.
Author: Andrew Or <andrew@databricks.com>
Closes#13302 from andrewor14/truncate-table.
## What changes were proposed in this pull request?
Extra strategies does not work for streams because `IncrementalExecution` uses modified planner with stateful operations but it does not include extra strategies.
This pr fixes `IncrementalExecution` to include extra strategies to use them.
## How was this patch tested?
I added a test to check if extra strategies work for streams.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#13261 from ueshin/issues/SPARK-15483.
fixed typos for source code for components [mllib] [streaming] and [SQL]
None and obvious.
Author: lfzCarlosC <lfz.carlos@gmail.com>
Closes#13298 from lfzCarlosC/master.
## What changes were proposed in this pull request?
Override the existing SparkContext is the provided SparkConf is different. PySpark part hasn't been fixed yet, will do that after the first round of review to ensure this is the correct approach.
## How was this patch tested?
Manually verify it in spark-shell.
rxin Please help review it, I think this is a very critical issue for spark 2.0
Author: Jeff Zhang <zjffdu@apache.org>
Closes#13160 from zjffdu/SPARK-15345.
## What changes were proposed in this pull request?
This patch removes the last two commands defined in the catalyst module: DescribeFunction and ShowFunctions. They were unnecessary since the parser could just generate DescribeFunctionCommand and ShowFunctionsCommand directly.
## How was this patch tested?
Created a new SparkSqlParserSuite.
Author: Reynold Xin <rxin@databricks.com>
Closes#13292 from rxin/SPARK-15436.
## What changes were proposed in this pull request?
Currently if a table is used in join operation we rely on Metastore returned size to calculate if we can convert the operation to Broadcast join. This optimization only kicks in for table's that have the statistics available in metastore. Hive generally rolls over to HDFS if the statistics are not available directly from metastore and this seems like a reasonable choice to adopt given the optimization benefit of using broadcast joins.
## How was this patch tested?
I have executed queries locally to test.
Author: Parth Brahmbhatt <pbrahmbhatt@netflix.com>
Closes#13150 from Parth-Brahmbhatt/SPARK-15365.
## What changes were proposed in this pull request?
If the user relies on the schema to be inferred in file streams can break easily for multiple reasons
- accidentally running on a directory which has no data
- schema changing underneath
- on restart, the query will infer schema again, and may unexpectedly infer incorrect schema, as the file in the directory may be different at the time of the restart.
To avoid these complicated scenarios, for Spark 2.0, we are going to disable schema inferencing by default with a config, so that user is forced to consider explicitly what is the schema it wants, rather than the system trying to infer it and run into weird corner cases.
In this PR, I introduce a SQLConf that determines whether schema inference for file streams is allowed or not. It is disabled by default.
## How was this patch tested?
Updated unit tests that test error behavior with and without schema inference enabled.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#13238 from tdas/SPARK-15458.
## What changes were proposed in this pull request?
Jackson suppprts `allowNonNumericNumbers` option to parse non-standard non-numeric numbers such as "NaN", "Infinity", "INF". Currently used Jackson version (2.5.3) doesn't support it all. This patch upgrades the library and make the two ignored tests in `JsonParsingOptionsSuite` passed.
## How was this patch tested?
`JsonParsingOptionsSuite`.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Liang-Chi Hsieh <viirya@appier.com>
Closes#9759 from viirya/fix-json-nonnumric.
## What changes were proposed in this pull request?
Currently command `ADD FILE|JAR <filepath | jarpath>` is supported natively in SparkSQL. However, when this command is run, the file/jar is added to the resources that can not be looked up by `LIST FILE(s)|JAR(s)` command because the `LIST` command is passed to Hive command processor in Spark-SQL or simply not supported in Spark-shell. There is no way users can find out what files/jars are added to the spark context.
Refer to [Hive commands](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli)
This PR is to support following commands:
`LIST (FILE[s] [filepath ...] | JAR[s] [jarfile ...])`
### For example:
##### LIST FILE(s)
```
scala> spark.sql("add file hdfs://bdavm009.svl.ibm.com:8020/tmp/test.txt")
res1: org.apache.spark.sql.DataFrame = []
scala> spark.sql("add file hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt")
res2: org.apache.spark.sql.DataFrame = []
scala> spark.sql("list file hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt").show(false)
+----------------------------------------------+
|result |
+----------------------------------------------+
|hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt|
+----------------------------------------------+
scala> spark.sql("list files").show(false)
+----------------------------------------------+
|result |
+----------------------------------------------+
|hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt|
|hdfs://bdavm009.svl.ibm.com:8020/tmp/test.txt |
+----------------------------------------------+
```
##### LIST JAR(s)
```
scala> spark.sql("add jar /Users/xinwu/spark/core/src/test/resources/TestUDTF.jar")
res9: org.apache.spark.sql.DataFrame = [result: int]
scala> spark.sql("list jar TestUDTF.jar").show(false)
+---------------------------------------------+
|result |
+---------------------------------------------+
|spark://192.168.1.234:50131/jars/TestUDTF.jar|
+---------------------------------------------+
scala> spark.sql("list jars").show(false)
+---------------------------------------------+
|result |
+---------------------------------------------+
|spark://192.168.1.234:50131/jars/TestUDTF.jar|
+---------------------------------------------+
```
## How was this patch tested?
New test cases are added for Spark-SQL, Spark-Shell and SparkContext API code path.
Author: Xin Wu <xinwu@us.ibm.com>
Author: xin Wu <xinwu@us.ibm.com>
Closes#13212 from xwu0226/list_command.
## What changes were proposed in this pull request?
Adds error handling to the CSV writer for unsupported complex data types. Currently garbage gets written to the output csv files if the data frame schema has complex data types.
## How was this patch tested?
Added new unit test case.
Author: sureshthalamati <suresh.thalamati@gmail.com>
Closes#13105 from sureshthalamati/csv_complex_types_SPARK-15315.
## What changes were proposed in this pull request?
Spark assumes that UDF functions are deterministic. This PR adds explicit notes about that.
## How was this patch tested?
It's only about docs.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13087 from dongjoon-hyun/SPARK-15282.
## What changes were proposed in this pull request?
The user may do something like:
```
CREATE TABLE my_tab ROW FORMAT SERDE 'anything' STORED AS PARQUET
CREATE TABLE my_tab ROW FORMAT SERDE 'anything' STORED AS ... SERDE 'myserde'
CREATE TABLE my_tab ROW FORMAT DELIMITED ... STORED AS ORC
CREATE TABLE my_tab ROW FORMAT DELIMITED ... STORED AS ... SERDE 'myserde'
```
None of these should be allowed because the SerDe's conflict. As of this patch:
- `ROW FORMAT DELIMITED` is only compatible with `TEXTFILE`
- `ROW FORMAT SERDE` is only compatible with `TEXTFILE`, `RCFILE` and `SEQUENCEFILE`
## How was this patch tested?
New tests in `DDLCommandSuite`.
Author: Andrew Or <andrew@databricks.com>
Closes#13068 from andrewor14/row-format-conflict.
## What changes were proposed in this pull request?
Currently, we create an CSVWriter for every row, it's very expensive and memory hungry, took about 15 seconds to write out 1 mm rows (two columns).
This PR will write the rows in batch mode, create a CSVWriter for every 1k rows, which could write out 1 mm rows in about 1 seconds (15X faster).
## How was this patch tested?
Manually benchmark it.
Author: Davies Liu <davies@databricks.com>
Closes#13229 from davies/csv_writer.
## What changes were proposed in this pull request?
In order to prevent users from inadvertently writing queries with cartesian joins, this patch introduces a new conf `spark.sql.crossJoin.enabled` (set to `false` by default) that if not set, results in a `SparkException` if the query contains one or more cartesian products.
## How was this patch tested?
Added a test to verify the new behavior in `JoinSuite`. Additionally, `SQLQuerySuite` and `SQLMetricsSuite` were modified to explicitly enable cartesian products.
Author: Sameer Agarwal <sameer@databricks.com>
Closes#13209 from sameeragarwal/disallow-cartesian.
## What changes were proposed in this pull request?
This patch simplifies the implementation of Range operator and make the explain string consistent between logical plan and physical plan. To do this, I changed RangeExec to embed a Range logical plan in it.
Before this patch (note that the logical Range and physical Range actually output different information):
```
== Optimized Logical Plan ==
Range 0, 100, 2, 2, [id#8L]
== Physical Plan ==
*Range 0, 2, 2, 50, [id#8L]
```
After this patch:
If step size is 1:
```
== Optimized Logical Plan ==
Range(0, 100, splits=2)
== Physical Plan ==
*Range(0, 100, splits=2)
```
If step size is not 1:
```
== Optimized Logical Plan ==
Range (0, 100, step=2, splits=2)
== Physical Plan ==
*Range (0, 100, step=2, splits=2)
```
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#13239 from rxin/SPARK-15459.
#### What changes were proposed in this pull request?
When there are duplicate keys in the partition specs or table properties, we always use the last value and ignore all the previous values. This is caused by the function call `toMap`.
partition specs or table properties are widely used in multiple DDL statements.
This PR is to detect the duplicates and issue an exception if found.
#### How was this patch tested?
Added test cases in DDLSuite
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13095 from gatorsmile/detectDuplicate.
## What changes were proposed in this pull request?
This PR makes BroadcastHint more deterministic by using a special isBroadcastable property
instead of setting the sizeInBytes to 1.
See https://issues.apache.org/jira/browse/SPARK-15415
## How was this patch tested?
Added testcases to test if the broadcast hash join is included in the plan when the BroadcastHint is supplied and also tests for propagation of the joins.
Author: Jurriaan Pruis <email@jurriaanpruis.nl>
Closes#13244 from jurriaan/broadcast-hint.
#### What changes were proposed in this pull request?
Like `Set` Command in Hive, `Reset` is also supported by Hive. See the link: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli
Below is the related Hive JIRA: https://issues.apache.org/jira/browse/HIVE-3202
This PR is to implement such a command for resetting the SQL-related configuration to the default values. One of the use case shown in HIVE-3202 is listed below:
> For the purpose of optimization we set various configs per query. It's worthy but all those configs should be reset every time for next query.
#### How was this patch tested?
Added a test case.
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#13121 from gatorsmile/resetCommand.
## What changes were proposed in this pull request?
The Aggregator API was introduced in 2.0 for Dataset. All typed Dataset APIs should still be marked as experimental in 2.0.
## How was this patch tested?
N/A - annotation only change.
Author: Reynold Xin <rxin@databricks.com>
Closes#13226 from rxin/SPARK-15452.
## What changes were proposed in this pull request?
Generate a shorter default alias for `AggregateExpression `, In this PR, aggregate function name along with a index is used for generating the alias name.
```SQL
val ds = Seq(1, 3, 2, 5).toDS()
ds.select(typed.sum((i: Int) => i), typed.avg((i: Int) => i)).show()
```
Output before change.
```SQL
+-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+
|typedsumdouble(unresolveddeserializer(upcast(input[0, int], IntegerType, - root class: "scala.Int"), value#1), upcast(value))|typedaverage(unresolveddeserializer(upcast(input[0, int], IntegerType, - root class: "scala.Int"), value#1), newInstance(class scala.Tuple2))|
+-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+
| 11.0| 2.75|
+-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+
```
Output after change:
```SQL
+-----------------+---------------+
|typedsumdouble_c1|typedaverage_c2|
+-----------------+---------------+
| 11.0| 2.75|
+-----------------+---------------+
```
Note: There is one test in ParquetSuites.scala which shows that that the system picked alias
name is not usable and is rejected. [test](https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala#L672-#L687)
## How was this patch tested?
A new test was added in DataSetAggregatorSuite.
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#13045 from dilipbiswal/spark-15114.
## What changes were proposed in this pull request?
Many other systems (e.g. Impala) uses _xxx as staging, and Spark should not be reading those files.
## How was this patch tested?
Added a unit test case.
Author: Reynold Xin <rxin@databricks.com>
Closes#13227 from rxin/SPARK-15454.
## What changes were proposed in this pull request?
Currently, the explain of a query with whole-stage codegen looks like this
```
>>> df = sqlCtx.range(1000);df2 = sqlCtx.range(1000);df.join(pyspark.sql.functions.broadcast(df2), 'id').explain()
== Physical Plan ==
WholeStageCodegen
: +- Project [id#1L]
: +- BroadcastHashJoin [id#1L], [id#4L], Inner, BuildRight, None
: :- Range 0, 1, 4, 1000, [id#1L]
: +- INPUT
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint]))
+- WholeStageCodegen
: +- Range 0, 1, 4, 1000, [id#4L]
```
The problem is that the plan looks much different than logical plan, make us hard to understand the plan (especially when the logical plan is not showed together).
This PR will change it to:
```
>>> df = sqlCtx.range(1000);df2 = sqlCtx.range(1000);df.join(pyspark.sql.functions.broadcast(df2), 'id').explain()
== Physical Plan ==
*Project [id#0L]
+- *BroadcastHashJoin [id#0L], [id#3L], Inner, BuildRight, None
:- *Range 0, 1, 4, 1000, [id#0L]
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
+- *Range 0, 1, 4, 1000, [id#3L]
```
The `*`before the plan means that it's part of whole-stage codegen, it's easy to understand.
## How was this patch tested?
Manually ran some queries and check the explain.
Author: Davies Liu <davies@databricks.com>
Closes#13204 from davies/explain_codegen.
This reverts commit 8d05a7a from #12855, which seems to have caused regressions when working with empty DataFrames.
Author: Michael Armbrust <michael@databricks.com>
Closes#13181 from marmbrus/revert12855.
## What changes were proposed in this pull request?
This PR introduce place holder for comment in generated code and the purpose is same for #12939 but much safer.
Generated code to be compiled doesn't include actual comments but includes place holder instead.
Place holders in generated code will be replaced with actual comments only at the time of logging.
Also, this PR can resolve SPARK-15205.
## How was this patch tested?
Existing tests.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#12979 from sarutak/SPARK-15205.
## What changes were proposed in this pull request?
We started this convention to append Command suffix to all SQL commands. However, not all commands follow that convention. This patch adds Command suffix to all RunnableCommands.
## How was this patch tested?
Updated test cases to reflect the renames.
Author: Reynold Xin <rxin@databricks.com>
Closes#13215 from rxin/SPARK-15435.
## What changes were proposed in this pull request?
When we parse DDLs involving table or database properties, we need to validate the values.
E.g. if we alter a database's property without providing a value:
```
ALTER DATABASE my_db SET DBPROPERTIES('some_key')
```
Then we'll ignore it with Hive, but override the property with the in-memory catalog. Inconsistencies like these arise because we don't validate the property values.
In such cases, we should throw exceptions instead.
## How was this patch tested?
`DDLCommandSuite`
Author: Andrew Or <andrew@databricks.com>
Closes#13205 from andrewor14/ddl-prop-values.
#### What changes were proposed in this pull request?
`refreshTable` was a method in `HiveContext`. It was deleted accidentally while we were migrating the APIs. This PR is to add it back to `HiveContext`.
In addition, in `SparkSession`, we put it under the catalog namespace (`SparkSession.catalog.refreshTable`).
#### How was this patch tested?
Changed the existing test cases to use the function `refreshTable`. Also added a test case for refreshTable in `hivecontext-compatibility`
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13156 from gatorsmile/refreshTable.
## What changes were proposed in this pull request?
Like TRUNCATE TABLE Command in Hive, TRUNCATE TABLE is also supported by Hive. See the link: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
Below is the related Hive JIRA: https://issues.apache.org/jira/browse/HIVE-446
This PR is to implement such a command for truncate table excluded column truncation(HIVE-4005).
## How was this patch tested?
Added a test case.
Author: Lianhui Wang <lianhuiwang09@gmail.com>
Closes#13170 from lianhuiwang/truncate.
## What changes were proposed in this pull request?
Currently SparkSession.Builder use SQLContext.getOrCreate. It should probably the the other way around, i.e. all the core logic goes in SparkSession, and SQLContext just calls that. This patch does that.
This patch also makes sure config options specified in the builder are propagated to the existing (and of course the new) SparkSession.
## How was this patch tested?
Updated tests to reflect the change, and also introduced a new SparkSessionBuilderSuite that should cover all the branches.
Author: Reynold Xin <rxin@databricks.com>
Closes#13200 from rxin/SPARK-15075.
## What changes were proposed in this pull request?
If finding `NoClassDefFoundError` or `ClassNotFoundException`, check if the class name is removed in Spark 2.0. If so, the user must be using an incompatible library and we can provide a better message.
## How was this patch tested?
1. Run `bin/pyspark --packages com.databricks:spark-avro_2.10:2.0.1`
2. type `sqlContext.read.format("com.databricks.spark.avro").load("src/test/resources/episodes.avro")`.
It will show `java.lang.ClassNotFoundException: org.apache.spark.sql.sources.HadoopFsRelationProvider is removed in Spark 2.0. Please check if your library is compatible with Spark 2.0`
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#13201 from zsxwing/better-message.
## What changes were proposed in this pull request?
Add ConsoleSink to structure streaming, user could use it to display dataframes on the console (useful for debugging and demostrating), similar to the functionality of `DStream#print`, to use it:
```
val query = result.write
.format("console")
.trigger(ProcessingTime("2 seconds"))
.startStream()
```
## How was this patch tested?
local verified.
Not sure it is suitable to add into structure streaming, please review and help to comment, thanks a lot.
Author: jerryshao <sshao@hortonworks.com>
Closes#13162 from jerryshao/SPARK-15375.
## What changes were proposed in this pull request?
We use autoBroadcastJoinThreshold + 1L as the default value of size estimation, that is not good in 2.0, because we will calculate the size based on size of schema, then the estimation could be less than autoBroadcastJoinThreshold if you have an SELECT on top of an DataFrame created from RDD.
This PR change the default value to Long.MaxValue.
## How was this patch tested?
Added regression tests.
Author: Davies Liu <davies@databricks.com>
Closes#13183 from davies/fix_default_size.
## What changes were proposed in this pull request?
In general, the Web UI doesn't need to store the Accumulator/AccumulableInfo for every task. It only needs the Accumulator values.
In this PR, it creates new UIData classes to store the necessary fields and make `JobProgressListener` store only these new classes, so that `JobProgressListener` won't store Accumulator/AccumulableInfo and the size of `JobProgressListener` becomes pretty small. I also eliminates `AccumulableInfo` from `SQLListener` so that we don't keep any references for those unused `AccumulableInfo`s.
## How was this patch tested?
I ran two tests reported in JIRA locally:
The first one is:
```
val data = spark.range(0, 10000, 1, 10000)
data.cache().count()
```
The retained size of JobProgressListener decreases from 60.7M to 6.9M.
The second one is:
```
import org.apache.spark.ml.CC
import org.apache.spark.sql.SQLContext
val sqlContext = SQLContext.getOrCreate(sc)
CC.runTest(sqlContext)
```
This test won't cause OOM after applying this patch.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#13153 from zsxwing/memory.
## What changes were proposed in this pull request?
This PR is a follow-up of #13079. It replaces `hasUnsupportedFeatures: Boolean` in `CatalogTable` with `unsupportedFeatures: Seq[String]`, which contains unsupported Hive features of the underlying Hive table. In this way, we can accurately report all unsupported Hive features in the exception message.
## How was this patch tested?
Updated existing test case to check exception message.
Author: Cheng Lian <lian@databricks.com>
Closes#13173 from liancheng/spark-14346-follow-up.
## What changes were proposed in this pull request?
This PR corrects another case that uses deprecated `accumulableCollection` to use `listAccumulator`, which seems the previous PR missed.
Since `ArrayBuffer[InternalRow].asJava` is `java.util.List[InternalRow]`, it seems ok to replace the usage.
## How was this patch tested?
Related existing tests `InMemoryColumnarQuerySuite` and `CachedTableSuite`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#13187 from HyukjinKwon/SPARK-15322.
## What changes were proposed in this pull request?
When broadcast a table with more than 100 millions rows (should not ideally), the size of needed memory will overflow.
This PR fix the overflow by converting it to Long when calculating the size of memory.
Also add more checking in broadcast to show reasonable messages.
## How was this patch tested?
Add test.
Author: Davies Liu <davies@databricks.com>
Closes#13182 from davies/fix_broadcast.
## What changes were proposed in this pull request?
Whole Stage Codegen depends on `SparkPlan.reference` to do some optimization. For physical object operators, they should be consistent with their logical version and set the `reference` correctly.
## How was this patch tested?
new test in DatasetSuite
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13167 from cloud-fan/bug.
## What changes were proposed in this pull request?
This patch is a follow-up to https://github.com/apache/spark/pull/13104 and adds documentation to clarify the semantics of read.text with respect to partitioning.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#13184 from rxin/SPARK-14463.
#### What changes were proposed in this pull request?
The command `SET -v` always outputs the default values even if we set the parameter. This behavior is incorrect. Instead, if users override it, we should output the user-specified value.
In addition, the output schema of `SET -v` is wrong. We should use the column `value` instead of `default` for the parameter value.
This PR is to fix the above two issues.
#### How was this patch tested?
Added a test case.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13081 from gatorsmile/setVcommand.
## What changes were proposed in this pull request?
This PR adds null check in `SparkSession.createDataFrame`, so that we can make sure the passed in rows matches the given schema.
## How was this patch tested?
new tests in `DatasetSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13008 from cloud-fan/row-encoder.
https://issues.apache.org/jira/browse/SPARK-15323
I was using partitioned text datasets in Spark 1.6.1 but it broke in Spark 2.0.0.
It would be logical if you could also write those,
but not entirely sure how to solve this with the new DataSet implementation.
Also it doesn't work using `sqlContext.read.text`, since that method returns a `DataSet[String]`.
See https://issues.apache.org/jira/browse/SPARK-14463 for that issue.
Author: Jurriaan Pruis <email@jurriaanpruis.nl>
Closes#13104 from jurriaan/fix-partitioned-text-reads.
## What changes were proposed in this pull request?
We use autoBroadcastJoinThreshold + 1L as the default value of size estimation, that is not good in 2.0, because we will calculate the size based on size of schema, then the estimation could be less than autoBroadcastJoinThreshold if you have an SELECT on top of an DataFrame created from RDD.
This PR change the default value to Long.MaxValue.
## How was this patch tested?
Added regression tests.
Author: Davies Liu <davies@databricks.com>
Closes#13179 from davies/fix_default_size.
## What changes were proposed in this pull request?
Since we support forced spilling for Spillable, which only works in OnHeap mode, different from other SQL operators (could be OnHeap or OffHeap), we should considering the mode of consumer before calling trigger forced spilling.
## How was this patch tested?
Add new test.
Author: Davies Liu <davies@databricks.com>
Closes#13151 from davies/fix_mode.
## What changes were proposed in this pull request?
Currently, listing files is very slow if there is thousands files, especially on local file system, because:
1) FileStatus.getPermission() is very slow on local file system, which is launch a subprocess and parse the stdout.
2) Create an JobConf is very expensive (ClassUtil.findContainingJar() is slow).
This PR improve these by:
1) Use another constructor of LocatedFileStatus to avoid calling FileStatus.getPermission, the permissions are not used for data sources.
2) Only create an JobConf once within one task.
## How was this patch tested?
Manually tests on a partitioned table with 1828 partitions, decrease the time to load the table from 22 seconds to 1.6 seconds (Most of time are spent in merging schema now).
Author: Davies Liu <davies@databricks.com>
Closes#13094 from davies/listing.
## What changes were proposed in this pull request?
Update the unit test code, examples, and documents to remove calls to deprecated method `dataset.registerTempTable`.
## How was this patch tested?
This PR only changes the unit test code, examples, and comments. It should be safe.
This is a follow up of PR https://github.com/apache/spark/pull/12945 which was merged.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#13098 from clockfly/spark-15171-remove-deprecation.
## What changes were proposed in this pull request?
This is a follow-up of #12781. It adds native `SHOW CREATE TABLE` support for Hive tables and views. A new field `hasUnsupportedFeatures` is added to `CatalogTable` to indicate whether all table metadata retrieved from the concrete underlying external catalog (i.e. Hive metastore in this case) can be mapped to fields in `CatalogTable`. This flag is useful when the target Hive table contains structures that can't be handled by Spark SQL, e.g., skewed columns and storage handler, etc..
## How was this patch tested?
New test cases are added in `ShowCreateTableSuite` to do round-trip tests.
Author: Cheng Lian <lian@databricks.com>
Closes#13079 from liancheng/spark-14346-show-create-table-for-hive-tables.
## What changes were proposed in this pull request?
Add a check in the constructor of SQLContext/SparkSession to make sure its SparkContext is not stopped.
## How was this patch tested?
Jenkins unit tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#13154 from zsxwing/check-spark-context-stop.
## What changes were proposed in this pull request?
According to the recent change, this PR replaces all the remaining `sqlContext` usage with `spark` in ScalaDoc/JavaDoc (.scala/.java files) except `SQLContext.scala`, `SparkPlan.scala', and `DatasetHolder.scala`.
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13125 from dongjoon-hyun/minor_doc_sparksession.
## What changes were proposed in this pull request?
Currently, `INSERT INTO` with `GROUP BY` query tries to make at least 200 files (default value of `spark.sql.shuffle.partition`), which results in lots of empty files.
This PR makes it avoid creating empty files during overwriting into Hive table and in internal data sources with group by query.
This checks whether the given partition has data in it or not and creates/writes file only when it actually has data.
## How was this patch tested?
Unittests in `InsertIntoHiveTableSuite` and `HadoopFsRelationTest`.
Closes#8411
Author: hyukjinkwon <gurwls223@gmail.com>
Author: Keuntae Park <sirpkt@apache.org>
Closes#12855 from HyukjinKwon/pr/8411.
## What changes were proposed in this pull request?
https://github.com/apache/spark/pull/12781 introduced PARTITIONED BY, CLUSTERED BY, and SORTED BY keywords to CREATE TABLE USING. This PR adds tests to make sure those keywords are handled correctly.
This PR also fixes a mistake that we should create non-hive-compatible table if partition or bucket info exists.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13144 from cloud-fan/add-test.
## What changes were proposed in this pull request?
"DESCRIBE table" is broken when table schema is stored at key "spark.sql.sources.schema".
Originally, we used spark.sql.sources.schema to store the schema of a data source table.
After SPARK-6024, we removed this flag. Although we are not using spark.sql.sources.schema any more, we need to still support it.
## How was this patch tested?
Unit test.
When using spark2.0 to load a table generated by spark 1.2.
Before change:
`DESCRIBE table` => Schema of this table is inferred at runtime,,
After change:
`DESCRIBE table` => correct output.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#13073 from clockfly/spark-15253.
## What changes were proposed in this pull request?
1,Rename matrix args in BreezeUtil to upper to match the doc
2,Fix several typos in ML and SQL
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#13078 from zhengruifeng/fix_ann.
## What changes were proposed in this pull request?
Was trying out `SparkSession` for the first time and the given class doc (when copied as is) did not work over Spark shell:
```
scala> SparkSession.builder().master("local").appName("Word Count").getOrCreate()
<console>:27: error: org.apache.spark.sql.SparkSession.Builder does not take parameters
SparkSession.builder().master("local").appName("Word Count").getOrCreate()
```
Adding () to the builder method in SparkSession.
## How was this patch tested?
```
scala> SparkSession.builder().master("local").appName("Word Count").getOrCreate()
res0: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession65c17e38
scala> SparkSession.builder.master("local").appName("Word Count").getOrCreate()
res1: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession65c17e38
```
Author: Tejas Patil <tejasp@fb.com>
Closes#13086 from tejasapatil/doc_correction.
## What changes were proposed in this pull request?
Currently, Parquet, JSON and CSV data sources have a class for thier options, (`ParquetOptions`, `JSONOptions` and `CSVOptions`).
It is convenient to manage options for sources to gather options into a class. Currently, `JDBC`, `Text`, `libsvm` and `ORC` datasources do not have this class. This might be nicer if these options are in a unified format so that options can be added and
This PR refactors the options in Spark internal data sources adding new classes, `OrcOptions`, `TextOptions`, `JDBCOptions` and `LibSVMOptions`.
Also, this PR change the default compression codec for ORC from `NONE` to `SNAPPY`.
## How was this patch tested?
Existing tests should cover this for refactoring and unittests in `OrcHadoopFsRelationSuite` for changing the default compression codec for ORC.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#13048 from HyukjinKwon/SPARK-15267.
## What changes were proposed in this pull request?
We originally designed the type coercion rules to match Hive, but over time we have diverged. It does not make sense to call it HiveTypeCoercion anymore. This patch renames it TypeCoercion.
## How was this patch tested?
Updated unit tests to reflect the rename.
Author: Reynold Xin <rxin@databricks.com>
Closes#13091 from rxin/SPARK-15310.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-13866
This PR adds the support to infer `DecimalType`.
Here are the rules between `IntegerType`, `LongType` and `DecimalType`.
#### Infering Types
1. `IntegerType` and then `LongType`are tried first.
```scala
Int.MaxValue => IntegerType
Long.MaxValue => LongType
```
2. If it fails, try `DecimalType`.
```scala
(Long.MaxValue + 1) => DecimalType(20, 0)
```
This does not try to infer this as `DecimalType` when scale is less than 0.
3. if it fails, try `DoubleType`
```scala
0.1 => DoubleType // This is failed to be inferred as `DecimalType` because it has the scale, 1.
```
#### Compatible Types (Merging Types)
For merging types, this is the same with JSON data source. If `DecimalType` is not capable, then it becomes `DoubleType`
## How was this patch tested?
Unit tests were used and `./dev/run_tests` for code style test.
Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>
Closes#11724 from HyukjinKwon/SPARK-13866.
## What changes were proposed in this pull request?
This patch moves all the object related expressions into expressions.objects package, for better code organization.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#13085 from rxin/SPARK-15306.
## What changes were proposed in this pull request?
We currently use the Hive implementations for the collect_list/collect_set aggregate functions. This has a few major drawbacks: the use of HiveUDAF (which has quite a bit of overhead) and the lack of support for struct datatypes. This PR adds native implementation of these functions to Spark.
The size of the collected list/set may vary, this means we cannot use the fast, Tungsten, aggregation path to perform the aggregation, and that we fallback to the slower sort based path. Another big issue with these operators is that when the size of the collected list/set grows too large, we can start experiencing large GC pauzes and OOMEs.
This `collect*` aggregates implemented in this PR rely on the sort based aggregate path for correctness. They maintain their own internal buffer which holds the rows for one group at a time. The sortbased aggregation path is triggered by disabling `partialAggregation` for these aggregates (which is kinda funny); this technique is also employed in `org.apache.spark.sql.hiveHiveUDAFFunction`.
I have done some performance testing:
```scala
import org.apache.spark.sql.{Dataset, Row}
sql("create function collect_list2 as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCollectList'")
val df = range(0, 10000000).select($"id", (rand(213123L) * 100000).cast("int").as("grp"))
df.select(countDistinct($"grp")).show
def benchmark(name: String, plan: Dataset[Row], maxItr: Int = 5): Unit = {
// Do not measure planning.
plan1.queryExecution.executedPlan
// Execute the plan a number of times and average the result.
val start = System.nanoTime
var i = 0
while (i < maxItr) {
plan.rdd.foreach(row => Unit)
i += 1
}
val time = (System.nanoTime - start) / (maxItr * 1000000L)
println(s"[$name] $maxItr iterations completed in an average time of $time ms.")
}
val plan1 = df.groupBy($"grp").agg(collect_list($"id"))
val plan2 = df.groupBy($"grp").agg(callUDF("collect_list2", $"id"))
benchmark("Spark collect_list", plan1)
...
> [Spark collect_list] 5 iterations completed in an average time of 3371 ms.
benchmark("Hive collect_list", plan2)
...
> [Hive collect_list] 5 iterations completed in an average time of 9109 ms.
```
Performance is improved by a factor 2-3.
## How was this patch tested?
Added tests to `DataFrameAggregateSuite`.
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#12874 from hvanhovell/implode.
## What changes were proposed in this pull request?
Deprecates registerTempTable and add dataset.createTempView, dataset.createOrReplaceTempView.
## How was this patch tested?
Unit tests.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#12945 from clockfly/spark-15171.
## What changes were proposed in this pull request?
This PR adds a new rule to convert `SimpleCatalogRelation` to data source table if its table property contains data source information.
## How was this patch tested?
new test in SQLQuerySuite
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12935 from cloud-fan/ds-table.
## What changes were proposed in this pull request?
This PR adds native `SHOW CREATE TABLE` DDL command for data source tables. Support for Hive tables will be added in follow-up PR(s).
To show table creation DDL for data source tables created by CTAS statements, this PR also added partitioning and bucketing support for normal `CREATE TABLE ... USING ...` syntax.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
A new test suite `ShowCreateTableSuite` is added in sql/hive package to test the new feature.
Author: Cheng Lian <lian@databricks.com>
Closes#12781 from liancheng/spark-14346-show-create-table.
## What changes were proposed in this pull request?
Break copyAndReset into two methods copy and reset instead of just one.
## How was this patch tested?
Existing Tests
Author: Sandeep Singh <sandeep@techaddict.me>
Closes#12936 from techaddict/SPARK-15080.
## What changes were proposed in this pull request?
When a CSV begins with:
- `,,`
OR
- `"","",`
meaning that the first column names are either empty or blank strings and `header` is specified to be `true`, then the column name is replaced with `C` + the index number of that given column. For example, if you were to read in the CSV:
```
"","second column"
"hello", "there"
```
Then column names would become `"C0", "second column"`.
This behavior aligns with what currently happens when `header` is specified to be `false` in recent versions of Spark.
### Current Behavior in Spark <=1.6
In Spark <=1.6, a CSV with a blank column name becomes a blank string, `""`, meaning that this column cannot be accessed. However the CSV reads in without issue.
### Current Behavior in Spark 2.0
Spark throws a NullPointerError and will not read in the file.
#### Reproduction in 2.0
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/346304/2828750690305044/484361/latest.html
## How was this patch tested?
A new test was added to `CSVSuite` to account for this issue. We then have asserts that test for being able to select both the empty column names as well as the regular column names.
Author: Bill Chambers <bill@databricks.com>
Author: Bill Chambers <wchambers@ischool.berkeley.edu>
Closes#13041 from anabranch/master.
## What changes were proposed in this pull request?
Before:
```sql
-- uses that location but issues a warning
CREATE TABLE my_tab LOCATION /some/path
-- deletes any existing data in the specified location
DROP TABLE my_tab
```
After:
```sql
-- uses that location but creates an EXTERNAL table instead
CREATE TABLE my_tab LOCATION /some/path
-- does not delete the data at /some/path
DROP TABLE my_tab
```
This patch essentially makes the `EXTERNAL` field optional. This is related to #13032.
## How was this patch tested?
New test in `DDLCommandSuite`.
Author: Andrew Or <andrew@databricks.com>
Closes#13060 from andrewor14/location-implies-external.
## What changes were proposed in this pull request?
Before:
```sql
-- uses warehouse dir anyway
CREATE EXTERNAL TABLE my_tab
-- doesn't actually delete the data
DROP TABLE my_tab
```
After:
```sql
-- no location is provided, throws exception
CREATE EXTERNAL TABLE my_tab
-- creates an external table using that location
CREATE EXTERNAL TABLE my_tab LOCATION '/path/to/something'
-- doesn't delete the data, which is expected
DROP TABLE my_tab
```
## How was this patch tested?
New test in `DDLCommandSuite`
Author: Andrew Or <andrew@databricks.com>
Closes#13032 from andrewor14/create-external-table-location.
Table partitions can be added with locations different from default warehouse location of a hive table.
`CREATE TABLE parquetTable (a int) PARTITIONED BY (b int) STORED AS parquet `
`ALTER TABLE parquetTable ADD PARTITION (b=1) LOCATION '/partition'`
Querying such a table throws error as the MetastoreFileCatalog does not list the added partition directory, it only lists the default base location.
```
[info] - SPARK-15248: explicitly added partitions should be readable *** FAILED *** (1 second, 8 milliseconds)
[info] java.util.NoSuchElementException: key not found: file:/Users/tdas/Projects/Spark/spark2/target/tmp/spark-b39ad224-c5d1-4966-8981-fb45a2066d61/partition
[info] at scala.collection.MapLike$class.default(MapLike.scala:228)
[info] at scala.collection.AbstractMap.default(Map.scala:59)
[info] at scala.collection.MapLike$class.apply(MapLike.scala:141)
[info] at scala.collection.AbstractMap.apply(Map.scala:59)
[info] at org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog$$anonfun$listFiles$1.apply(PartitioningAwareFileCatalog.scala:59)
[info] at org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog$$anonfun$listFiles$1.apply(PartitioningAwareFileCatalog.scala:55)
[info] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
[info] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
[info] at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
[info] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
[info] at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
[info] at scala.collection.AbstractTraversable.map(Traversable.scala:104)
[info] at org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.listFiles(PartitioningAwareFileCatalog.scala:55)
[info] at org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:93)
[info] at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:59)
[info] at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:59)
[info] at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
[info] at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
[info] at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:60)
[info] at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:55)
[info] at org.apache.spark.sql.execution.SparkStrategies$SpecialLimits$.apply(SparkStrategies.scala:55)
[info] at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:59)
[info] at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:59)
[info] at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
[info] at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
[info] at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:60)
[info] at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:77)
[info] at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:75)
[info] at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:82)
[info] at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:82)
[info] at org.apache.spark.sql.QueryTest.assertEmptyMissingInput(QueryTest.scala:330)
[info] at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:146)
[info] at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:159)
[info] at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12$$anonfun$apply$mcV$sp$7$$anonfun$apply$mcV$sp$25.apply(parquetSuites.scala:554)
[info] at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12$$anonfun$apply$mcV$sp$7$$anonfun$apply$mcV$sp$25.apply(parquetSuites.scala:535)
[info] at org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:125)
[info] at org.apache.spark.sql.hive.ParquetPartitioningTest.withTempDir(parquetSuites.scala:726)
[info] at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12$$anonfun$apply$mcV$sp$7.apply$mcV$sp(parquetSuites.scala:535)
[info] at org.apache.spark.sql.test.SQLTestUtils$class.withTable(SQLTestUtils.scala:166)
[info] at org.apache.spark.sql.hive.ParquetPartitioningTest.withTable(parquetSuites.scala:726)
[info] at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12.apply$mcV$sp(parquetSuites.scala:534)
[info] at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12.apply(parquetSuites.scala:534)
[info] at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12.apply(parquetSuites.scala:534)
```
The solution in this PR to get the paths to list from the partition spec and not rely on the default table path alone.
unit tests.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#13022 from tdas/SPARK-15248.
## What changes were proposed in this pull request?
After SPARK-14669 it seems the sort time metric includes both spill and record insertion time. This makes it not very useful since the metric becomes close to the total execution time of the node.
We should track just the time spent for in-memory sort, as before.
## How was this patch tested?
Verified metric in the UI, also unit test on UnsafeExternalRowSorter.
cc davies
Author: Eric Liang <ekl@databricks.com>
Author: Eric Liang <ekhliang@gmail.com>
Closes#13035 from ericl/fix-metrics.
## What changes were proposed in this pull request?
This PR adds documents about the different behaviors between `insertInto` and `saveAsTable`, and throws an exception when the user try to add too man columns using `saveAsTable with append`.
## How was this patch tested?
Unit tests added in this PR.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#13013 from zsxwing/SPARK-15231.
## What changes were proposed in this pull request?
We use the tree string of an SparkPlan as the name of cached DataFrame, that could be very long, cause the browser to be not responsive. This PR will limit the length of the name to 1000 characters.
## How was this patch tested?
Here is how the UI looks right now:
![ui](https://cloud.githubusercontent.com/assets/40902/15163355/d5640f9c-16bc-11e6-8655-809af8a4fed1.png)
Author: Davies Liu <davies@databricks.com>
Closes#13033 from davies/cache_name.
## What changes were proposed in this pull request?
This PR removes the old `json(path: String)` API which is covered by the new `json(paths: String*)`.
## How was this patch tested?
Jenkins tests (existing tests should cover this)
Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>
Closes#13040 from HyukjinKwon/SPARK-15250.
## What changes were proposed in this pull request?
This patch removes experimental tag from DataFrameReader and DataFrameWriter, and explicitly tags a few methods added for structured streaming as experimental.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#13038 from rxin/SPARK-15261.
## What changes were proposed in this pull request?
Currently, file stream source can only find new files if they appear in the directory given to the source, but not if they appear in subdirs. This PR add support for providing glob patterns when creating file stream source so that it can find new files in nested directories based on the glob pattern.
## How was this patch tested?
Unit test that tests when new files are discovered with globs and partitioned directories.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#12616 from tdas/SPARK-14837.
## What changes were proposed in this pull request?
PR fixes the import issue which breaks udf functions.
The following code snippet throws an error
```
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> import org.apache.spark.sql.expressions._
import org.apache.spark.sql.expressions._
scala> udf((v: String) => v.stripSuffix("-abc"))
<console>:30: error: No TypeTag available for String
udf((v: String) => v.stripSuffix("-abc"))
```
This PR resolves the issue.
## How was this patch tested?
patch tested with unit tests.
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: Subhobrata Dey <sbcd90@gmail.com>
Closes#12458 from sbcd90/udfFuncBreak.
Sending un-updated accumulators back to driver makes no sense, as merging a zero value accumulator is a no-op. We should only send back updated accumulators, to save network IO.
new test in `TaskContextSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12899 from cloud-fan/acc.
## What changes were proposed in this pull request?
This PR fixes SQL building for predicate subqueries and correlated scalar subqueries. It also enables most Hive subquery tests.
## How was this patch tested?
Enabled new tests in HiveComparisionSuite.
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#12988 from hvanhovell/SPARK-14773.
#### What changes were proposed in this pull request?
This PR is to address a few existing issues in `EXPLAIN`:
- The `EXPLAIN` options `LOGICAL | FORMATTED | EXTENDED | CODEGEN` should not be 0 or more match. It should 0 or one match. Parser does not allow users to use more than one option in a single command.
- The option `LOGICAL` is not supported. Issue an exception when users specify this option in the command.
- The output of `EXPLAIN ` contains a weird empty line when the output of analyzed plan is empty. We should remove it. For example:
```
== Parsed Logical Plan ==
CreateTable CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io. HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None), false
== Analyzed Logical Plan ==
CreateTable CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io. HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None), false
== Optimized Logical Plan ==
CreateTable CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io. HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None), false
...
```
#### How was this patch tested?
Added and modified a few test cases
Author: gatorsmile <gatorsmile@gmail.com>
Closes#12991 from gatorsmile/explainCreateTable.
## What changes were proposed in this pull request?
Our case sensitivity support is different from what ANSI SQL standards support. Postgres' behavior is that if an identifier is quoted, then it is treated as case sensitive; otherwise it is folded to lowercase. We will likely need to revisit this in the future and change our behavior. For now, the safest change to do for Spark 2.0 is to make the case sensitive option internal and discourage users from turning it on, effectively making Spark always case insensitive.
## How was this patch tested?
N/A - a small config documentation change.
Author: Reynold Xin <rxin@databricks.com>
Closes#13011 from rxin/SPARK-15229.
## What changes were proposed in this pull request?
Before:
```
scala> spark.catalog.listDatabases.show()
+--------------------+-----------+-----------+
| name|description|locationUri|
+--------------------+-----------+-----------+
|Database[name='de...|
|Database[name='my...|
|Database[name='so...|
+--------------------+-----------+-----------+
```
After:
```
+-------+--------------------+--------------------+
| name| description| locationUri|
+-------+--------------------+--------------------+
|default|Default Hive data...|file:/user/hive/w...|
| my_db| This is a database|file:/Users/andre...|
|some_db| |file:/private/var...|
+-------+--------------------+--------------------+
```
## How was this patch tested?
New test in `CatalogSuite`
Author: Andrew Or <andrew@databricks.com>
Closes#13015 from andrewor14/catalog-show.
## What changes were proposed in this pull request?
The issue is that when the user provides the path option with uppercase "PATH" key, `options` contains `PATH` key and will get into the non-external case in the following code in `createDataSourceTables.scala`, where a new key "path" is created with a default path.
```
val optionsWithPath =
if (!options.contains("path")) {
isExternal = false
options + ("path" -> sessionState.catalog.defaultTablePath(tableIdent))
} else {
options
}
```
So before creating hive table, serdeInfo.parameters will contain both "PATH" and "path" keys and different directories. and Hive table's dataLocation contains the value of "path".
The fix in this PR is to convert `options` in the code above to `CaseInsensitiveMap` before checking for containing "path" key.
## How was this patch tested?
A testcase is added
Author: xin Wu <xinwu@us.ibm.com>
Closes#12804 from xwu0226/SPARK-15025.
This patch improves the performance of `InferSchema.compatibleType` and `inferField`. The net result of this patch is a 6x speedup in local benchmarks running against cached data with a massive nested schema.
The key idea is to remove unnecessary sorting in `compatibleType`'s `StructType` merging code. This code takes two structs, merges the fields with matching names, and copies over the unique fields, producing a new schema which is the union of the two structs' schemas. Previously, this code performed a very inefficient `groupBy()` to match up fields with the same name, but this is unnecessary because `inferField` already sorts structs' fields by name: since both lists of fields are sorted, we can simply merge them in a single pass.
This patch also speeds up the existing field sorting in `inferField`: the old sorting code allocated unnecessary intermediate collections, while the new code uses mutable collects and performs in-place sorting.
I rewrote inefficient `equals()` implementations in `StructType` and `Metadata`, significantly reducing object allocations in those methods.
Finally, I replaced a `treeAggregate` call with `fold`: I doubt that `treeAggregate` will benefit us very much because the schemas would have to be enormous to realize large savings in network traffic. Since most schemas are probably fairly small in serialized form, they should typically fit within a direct task result and therefore can be incrementally merged at the driver as individual tasks finish. This change eliminates an entire (short) scheduler stage.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#12750 from JoshRosen/schema-inference-speedups.
When we parse `CREATE TABLE USING`, we should build a `CreateTableUsing` plan with the `managedIfNoPath` set to true. Then we will add default table path to options when write it to hive.
new test in `SQLQuerySuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12949 from cloud-fan/bug.
## What changes were proposed in this pull request?
This also simplifies the code being moved.
## How was this patch tested?
Existing tests.
Author: Andrew Or <andrew@databricks.com>
Closes#12941 from andrewor14/move-code.
Enhance the exception message when `checkpointLocation` is not set, previously the message is:
```
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at org.apache.spark.sql.DataFrameWriter$$anonfun$8.apply(DataFrameWriter.scala:338)
at org.apache.spark.sql.DataFrameWriter$$anonfun$8.apply(DataFrameWriter.scala:338)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.DataFrameWriter.startStream(DataFrameWriter.scala:337)
at org.apache.spark.sql.DataFrameWriter.startStream(DataFrameWriter.scala:277)
... 48 elided
```
This is not so meaningful, so changing to make it more specific.
Local verified.
Author: jerryshao <sshao@hortonworks.com>
Closes#12998 from jerryshao/improve-exception-message.
## What changes were proposed in this pull request?
This is a follow-up of PR #12844. It makes the newly updated `DescribeTableCommand` to support data sources tables.
## How was this patch tested?
A test case is added to check `DESC [EXTENDED | FORMATTED] <table>` output.
Author: Cheng Lian <lian@databricks.com>
Closes#12934 from liancheng/spark-14127-desc-table-follow-up.
#### What changes were proposed in this pull request?
As Hive and the major RDBMS behave, the built-in functions are not allowed to drop. In the current implementation, users can drop the built-in functions. However, after dropping the built-in functions, users are unable to add them back.
#### How was this patch tested?
Added a test case.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#12975 from gatorsmile/dropBuildInFunction.
## What changes were proposed in this pull request?
following operations have file system operation now:
1. CREATE DATABASE: create a dir
2. DROP DATABASE: delete the dir
3. CREATE TABLE: create a dir
4. DROP TABLE: delete the dir
5. RENAME TABLE: rename the dir
6. CREATE PARTITIONS: create a dir
7. RENAME PARTITIONS: rename the dir
8. DROP PARTITIONS: drop the dir
## How was this patch tested?
new tests in `ExternalCatalogSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12871 from cloud-fan/catalog.
## What changes were proposed in this pull request?
Currently when we create an alias against a TypedColumn from user-defined Aggregator(for example: agg(aggSum.toColumn as "a")), spark is using the alias' function from Column( as), the alias function will return a column contains a TypedAggregateExpression, which is unresolved because the inputDeserializer is not defined. Later the aggregator function (agg) will inject the inputDeserializer back to the TypedAggregateExpression, but only if the aggregate columns are TypedColumn, in the above case, the TypedAggregateExpression will remain unresolved because it is under column and caused the
problem reported by this jira [15051](https://issues.apache.org/jira/browse/SPARK-15051?jql=project%20%3D%20SPARK).
This PR propose to create an alias function for TypedColumn, it will return a TypedColumn. It is using the similar code path as Column's alia function.
For the spark build in aggregate function, like max, it is working with alias, for example
val df1 = Seq(1 -> "a", 2 -> "b", 3 -> "b").toDF("i", "j")
checkAnswer(df1.agg(max("j") as "b"), Row(3) :: Nil)
Thanks for comments.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Add test cases in DatasetAggregatorSuite.scala
run the sql related queries against this patch.
Author: Kevin Yu <qyu@us.ibm.com>
Closes#12893 from kevinyu98/spark-15051.
## What changes were proposed in this pull request?
Lets says there are json files in the following directories structure
```
xyz/file0.json
xyz/subdir1/file1.json
xyz/subdir2/file2.json
xyz/subdir1/subsubdir1/file3.json
```
`sqlContext.read.json("xyz")` should read only file0.json according to behavior in Spark 1.6.1. However in current master, all the 4 files are read.
The fix is to make FileCatalog return only the children files of the given path if there is not partitioning detected (instead of all the recursive list of files).
Closes#12774
## How was this patch tested?
unit tests
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#12856 from tdas/SPARK-14997.
#### What changes were proposed in this pull request?
When Describe a UDTF, the command returns a wrong result. The command is unable to find the function, which has been created and cataloged in the catalog but not in the functionRegistry.
This PR is to correct it. If the function is not in the functionRegistry, we will check the catalog for collecting the information of the UDTF function.
#### How was this patch tested?
Added test cases to verify the results
Author: gatorsmile <gatorsmile@gmail.com>
Closes#12885 from gatorsmile/showFunction.
## What changes were proposed in this pull request?
Minor doc and code style fixes
## How was this patch tested?
local build
Author: Jacek Laskowski <jacek@japila.pl>
Closes#12928 from jaceklaskowski/SPARK-15152.
## What changes were proposed in this pull request?
This issue addresses the comments in SPARK-15031 and also fix java-linter errors.
- Use multiline format in SparkSession builder patterns.
- Update `binary_classification_metrics_example.py` to use `SparkSession`.
- Fix Java Linter errors (in SPARK-13745, SPARK-15031, and so far)
## How was this patch tested?
After passing the Jenkins tests and run `dev/lint-java` manually.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12911 from dongjoon-hyun/SPARK-15134.
## What changes were proposed in this pull request?
Went through SparkSession and its members and fixed non-thread-safe classes used by SparkSession
## How was this patch tested?
Existing unit tests
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#12915 from zsxwing/spark-session-thread-safe.
## What changes were proposed in this pull request?
Removing the `withHiveSupport` method of `SparkSession`, instead use `enableHiveSupport`
## How was this patch tested?
ran tests locally
Author: Sandeep Singh <sandeep@techaddict.me>
Closes#12851 from techaddict/SPARK-15072.
#### What changes were proposed in this pull request?
First, a few test cases failed in mac OS X because the property value of `java.io.tmpdir` does not include a trailing slash on some platform. Hive always removes the last trailing slash. For example, what I got in the web:
```
Win NT --> C:\TEMP\
Win XP --> C:\TEMP
Solaris --> /var/tmp/
Linux --> /var/tmp
```
Second, a couple of test cases are added to verify if the commands work properly.
#### How was this patch tested?
Added a test case for it and correct the previous test cases.
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#12081 from gatorsmile/mkdir.
## What changes were proposed in this pull request?
Implement repartitionByColumn on DataFrame.
This will allow us to run R functions on each partition identified by column groups with dapply() method.
## How was this patch tested?
Unit tests
Author: NarineK <narine.kokhlikyan@us.ibm.com>
Closes#12887 from NarineK/repartitionByColumns.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-15148
Mainly it improves the performance roughtly about 30%-40% according to the [release note](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.1.0). For the details of the purpose is described in the JIRA.
This PR upgrades Univocity library from 2.0.2 to 2.1.0.
## How was this patch tested?
Existing tests should cover this.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#12923 from HyukjinKwon/SPARK-15148.
## What changes were proposed in this pull request?
Similar to #11990, GenerateOrdering and GenerateColumnAccessor should print debug log for generated code with proper indentation.
## How was this patch tested?
Manually checked.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#12908 from sarutak/SPARK-15132.
## What changes were proposed in this pull request?
Make sure that whenever the StateStoreCoordinator cannot be contacted, assume that the SparkContext and RpcEnv on the driver has been shutdown, and therefore stop the StateStore management thread, and unload all loaded stores.
## How was this patch tested?
Updated unit tests.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#12905 from tdas/SPARK-15131.
#### What changes were proposed in this pull request?
When we load a dataset, if we set the path to ```/path/a=1```, we will not take `a` as the partitioning column. However, if we set the path to ```/path/a=1/file.parquet```, we take `a` as the partitioning column and it shows up in the schema.
This PR is to fix the behavior inconsistency issue.
The base path contains a set of paths that are considered as the base dirs of the input datasets. The partitioning discovery logic will make sure it will stop when it reaches any base path.
By default, the paths of the dataset provided by users will be base paths. Below are three typical cases,
**Case 1**```sqlContext.read.parquet("/path/something=true/")```: the base path will be
`/path/something=true/`, and the returned DataFrame will not contain a column of `something`.
**Case 2**```sqlContext.read.parquet("/path/something=true/a.parquet")```: the base path will be
still `/path/something=true/`, and the returned DataFrame will also not contain a column of
`something`.
**Case 3**```sqlContext.read.parquet("/path/")```: the base path will be `/path/`, and the returned
DataFrame will have the column of `something`.
Users also can override the basePath by setting `basePath` in the options to pass the new base
path to the data source. For example,
```sqlContext.read.option("basePath", "/path/").parquet("/path/something=true/")```,
and the returned DataFrame will have the column of `something`.
The related PRs:
- https://github.com/apache/spark/pull/9651
- https://github.com/apache/spark/pull/10211
#### How was this patch tested?
Added a couple of test cases
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#12828 from gatorsmile/readPartitionedTable.
## What changes were proposed in this pull request?
This PR support new SQL syntax CREATE TEMPORARY VIEW.
Like:
```
CREATE TEMPORARY VIEW viewName AS SELECT * from xx
CREATE OR REPLACE TEMPORARY VIEW viewName AS SELECT * from xx
CREATE TEMPORARY VIEW viewName (c1 COMMENT 'blabla', c2 COMMENT 'blabla') AS SELECT * FROM xx
```
## How was this patch tested?
Unit tests.
Author: Sean Zhong <clockfly@gmail.com>
Closes#12872 from clockfly/spark-6399.
## What changes were proposed in this pull request?
Typo fix
## How was this patch tested?
No tests
My apologies for the tiny PR, but I stumbled across this today and wanted to get it corrected for 2.0.
Author: sethah <seth.hendrickson16@gmail.com>
Closes#12912 from sethah/csv_typo.
## What changes were proposed in this pull request?
Currently we return RuntimeConfig itself to facilitate chaining. However, it makes the output in interactive environments (e.g. notebooks, scala repl) weird because it'd show the response of calling set as a RuntimeConfig itself.
## How was this patch tested?
Updated unit tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#12902 from rxin/SPARK-15126.
## What changes were proposed in this pull request?
File Stream Sink writes the list of written files in a metadata log. StreamFileCatalog reads the list of the files for processing. However StreamFileCatalog does not infer partitioning like HDFSFileCatalog.
This PR enables that by refactoring HDFSFileCatalog to create an abstract class PartitioningAwareFileCatalog, that has all the functionality to infer partitions from a list of leaf files.
- HDFSFileCatalog has been renamed to ListingFileCatalog and it extends PartitioningAwareFileCatalog by providing a list of leaf files from recursive directory scanning.
- StreamFileCatalog has been renamed to MetadataLogFileCatalog and it extends PartitioningAwareFileCatalog by providing a list of leaf files from the metadata log.
- The above two classes has been moved into their own files as they are not interfaces that should be in fileSourceInterfaces.scala.
## How was this patch tested?
- FileStreamSinkSuite was update to see if partitioning gets inferred, and on reading whether the partitions get pruned correctly based on the query.
- Other unit tests are unchanged and pass as expected.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#12879 from tdas/SPARK-15103.
## What changes were proposed in this pull request?
We can support subexpression elimination in TungstenAggregate by using current `EquivalentExpressions` which is already used in subexpression elimination for expression codegen.
However, in wholestage codegen, we can't wrap the common expression's codes in functions as before, we simply generate the code snippets for common expressions. These code snippets are inserted before the common expressions are actually used in generated java codes.
For multiple `TypedAggregateExpression` used in aggregation operator, since their input type should be the same. So their `inputDeserializer` will be the same too. This patch can also reduce redundant input deserialization.
## How was this patch tested?
Existing tests.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#12729 from viirya/subexpr-elimination-tungstenaggregate.
## What changes were proposed in this pull request?
This patch changes the join API in Dataset so they can accept any Dataset, rather than just DataFrames.
## How was this patch tested?
N/A.
Author: Reynold Xin <rxin@databricks.com>
Closes#12886 from rxin/SPARK-15109.
## What changes were proposed in this pull request?
Currently in `StreamTest`, we have a `StartStream` which will start a streaming query against trigger `ProcessTime(intervalMS = 0)` and `SystemClock`.
We also need to test cases against `ProcessTime(intervalMS > 0)`, which often requires `ManualClock`.
This patch:
- fixes an issue of `ProcessingTimeExecutor`, where for a batch it should run `batchRunner` only once but might run multiple times under certain conditions;
- adds support for testing against the `ProcessingTime(intervalMS > 0)` trigger and `AdvanceManualClock`, by specifying them as fields for `StartStream`, and by adding an `AdvanceClock` action;
- adds a test, which takes advantage of the new `StartStream` and `AdvanceManualClock`, to test against [PR#[SPARK-14942] Reduce delay between batch construction and execution ](https://github.com/apache/spark/pull/12725).
## How was this patch tested?
N/A
Author: Liwei Lin <lwlin7@gmail.com>
Closes#12797 from lw-lin/add-trigger-test-support.
## What changes were proposed in this pull request?
This PR improve the error message for `Generate` in 3 cases:
1. generator is nested in expressions, e.g. `SELECT explode(list) + 1 FROM tbl`
2. generator appears more than one time in SELECT, e.g. `SELECT explode(list), explode(list) FROM tbl`
3. generator appears in other operator which is not project, e.g. `SELECT * FROM tbl SORT BY explode(list)`
## How was this patch tested?
new tests in `AnalysisErrorSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12810 from cloud-fan/bug.
## What changes were proposed in this pull request?
Currently, various `FileFormat` data sources share approximately the same code for partition value appending. This PR tries to eliminate this duplication.
A new method `buildReaderWithPartitionValues()` is added to `FileFormat` with a default implementation that appends partition values to `InternalRow`s produced by the reader function returned by `buildReader()`.
Special data sources like Parquet, which implements partition value appending inside `buildReader()` because of the vectorized reader, and the Text data source, which doesn't support partitioning, override `buildReaderWithPartitionValues()` and simply delegate to `buildReader()`.
This PR brings two benefits:
1. Apparently, it de-duplicates partition value appending logic
2. Now the reader function returned by `buildReader()` is only required to produce `InternalRow`s rather than `UnsafeRow`s if the data source doesn't override `buildReaderWithPartitionValues()`.
Because the safe-to-unsafe conversion is also performed while appending partition values. This makes 3rd-party data sources (e.g. spark-avro) easier to implement since they no longer need to access private APIs involving `UnsafeRow`.
## How was this patch tested?
Existing tests should do the work.
Author: Cheng Lian <lian@databricks.com>
Closes#12866 from liancheng/spark-14237-simplify-partition-values-appending.
## What changes were proposed in this pull request?
Just a bunch of small tweaks on DDL exception messages.
## How was this patch tested?
`DDLCommandSuite` et al.
Author: Andrew Or <andrew@databricks.com>
Closes#12853 from andrewor14/make-exceptions-consistent.
## What changes were proposed in this pull request?
Make Dataset.sqlContext a lazy val so that its a stable identifier and can be used for imports.
Now this works again:
import someDataset.sqlContext.implicits._
## How was this patch tested?
Add unit test to DatasetSuite that uses the import show above.
Author: Koert Kuipers <koert@tresata.com>
Closes#12877 from koertkuipers/feat-sqlcontext-stable-import.
## What changes were proposed in this pull request?
Create a new API for handling Optional Configs in SQLConf.
Right now `getConf` for `OptionalConfigEntry[T]` returns value of type `T`, if doesn't exist throws an exception. Add new method `getOptionalConf`(suggestions on naming) which will now returns value of type `Option[T]`(so if doesn't exist it returns `None`).
## How was this patch tested?
Add test and ran tests locally.
Author: Sandeep Singh <sandeep@techaddict.me>
Closes#12846 from techaddict/SPARK-14422.
## What changes were proposed in this pull request?
Users should use the builder pattern instead.
## How was this patch tested?
Jenks.
Author: Andrew Or <andrew@databricks.com>
Closes#12873 from andrewor14/spark-session-constructor.
## What changes were proposed in this pull request?
Observed stackOverflowError in Kryo when executing TPC-DS Query27. Spark thrift server disables kryo reference tracking (if not specified in conf). When "spark.kryo.referenceTracking" is set to true explicitly in spark-defaults.conf, query executes successfully. The root cause is that the TaskMemoryManager inside MemoryConsumer and LongToUnsafeRowMap were not transient and thus were serialized and broadcast around from within LongHashedRelation, which could potentially cause circular reference inside Kryo. But the TaskMemoryManager is per task and should not be passed around at the first place. This fix makes it transient.
## How was this patch tested?
core/test, hive/test, sql/test, catalyst/test, dev/lint-scala, org.apache.spark.sql.hive.execution.HiveCompatibilitySuite, dev/scalastyle,
manual test of TBC-DS Query 27 with 1GB data but without the "limit 100" which would cause a NPE due to SPARK-14752.
Author: yzhou2001 <yzhou_1999@yahoo.com>
Closes#12598 from yzhou2001/master.
## What changes were proposed in this pull request?
Remove AccumulatorV2.localValue and keep only value
## How was this patch tested?
existing tests
Author: Sandeep Singh <sandeep@techaddict.me>
Closes#12865 from techaddict/SPARK-15087.
# What changes were proposed in this pull request?
Support partitioning in the file stream sink. This is implemented using a new, but simpler code path for writing parquet files - both unpartitioned and partitioned. This new code path does not use Output Committers, as we will eventually write the file names to the metadata log for "committing" them.
This patch duplicates < 100 LOC from the WriterContainer. But its far simpler that WriterContainer as it does not involve output committing. In addition, it introduces the new APIs in FileFormat and OutputWriterFactory in an attempt to simplify the APIs (not have Job in the `FileFormat` API, not have bucket and other stuff in the `OutputWriterFactory.newInstance()` ).
# Tests
- New unit tests to test the FileStreamSinkWriter for partitioned and unpartitioned files
- New unit test to partially test the FileStreamSink for partitioned files (does not test recovery of partition column data, as that requires change in the StreamFileCatalog, future PR).
- Updated FileStressSuite to test number of records read from partitioned output files.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#12409 from tdas/streaming-partitioned-parquet.
## What changes were proposed in this pull request?
This patch removes SparkSqlSerializer. I believe this is now dead code.
## How was this patch tested?
Removed a test case related to it.
Author: Reynold Xin <rxin@databricks.com>
Closes#12864 from rxin/SPARK-15088.
## What changes were proposed in this pull request?
This patch moves AccumulatorV2 and subclasses into util package.
## How was this patch tested?
Updated relevant tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#12863 from rxin/SPARK-15081.
## What changes were proposed in this pull request?
Right now `StreamExecution.awaitBatchLock` uses an unfair lock. `StreamExecution.awaitOffset` may run too long and fail some test because `StreamExecution.constructNextBatch` keeps getting the lock.
See: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/865/testReport/junit/org.apache.spark.sql.streaming/FileStreamSourceStressTestSuite/file_source_stress_test/
This PR uses a fair ReentrantLock to resolve the thread starvation issue.
## How was this patch tested?
Modified `FileStreamSourceStressTestSuite.test("file source stress test")` to run the test codes 100 times locally. It always fails because of timeout without this patch.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#12852 from zsxwing/SPARK-15077.
## What changes were proposed in this pull request?
This PR addresses a few minor issues in SQL parser:
- Removes some unused rules and keywords in the grammar.
- Removes code path for fallback SQL parsing (was needed for Hive native parsing).
- Use `UnresolvedGenerator` instead of hard-coding `Explode` & `JsonTuple`.
- Adds a more generic way of creating error messages for unsupported Hive features.
- Use `visitFunctionName` as much as possible.
- Interpret a `CatalogColumn`'s `DataType` directly instead of parsing it again.
## How was this patch tested?
Existing tests.
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#12826 from hvanhovell/SPARK-15047.
The contribution is my original work and that I license the work to the project under the project's open source license.
Author: poolis <gmichalopoulos@gmail.com>
Author: Greg Michalopoulos <gmichalopoulos@gmail.com>
Closes#10899 from poolis/spark-12928.
## What changes were proposed in this pull request?
This patch creates a builder pattern for creating SparkSession. The new code is unused and mostly deadcode. I'm putting it up here for feedback.
There are a few TODOs that can be done as follow-up pull requests:
- [ ] Update tests to use this
- [ ] Update examples to use this
- [ ] Clean up SQLContext code w.r.t. this one (i.e. SparkSession shouldn't call into SQLContext.getOrCreate; it should be the other way around)
- [ ] Remove SparkSession.withHiveSupport
- [ ] Disable the old constructor (by making it private) so the only way to start a SparkSession is through this builder pattern
## How was this patch tested?
Part of the future pull request is to clean this up and switch existing tests to use this.
Author: Reynold Xin <rxin@databricks.com>
Closes#12830 from rxin/sparksession-builder.
## What changes were proposed in this pull request?
parquet datasource and ColumnarBatch tests fail on big-endian platforms This patch adds support for the little-endian byte arrays being correctly interpreted on a big-endian platform
## How was this patch tested?
Spark test builds ran on big endian z/Linux and regression build on little endian amd64
Author: Pete Robbins <robbinspg@gmail.com>
Closes#12397 from robbinspg/master.
## What changes were proposed in this pull request?
In order to support nested predicate subquery, this PR introduce an internal join type ExistenceJoin, which will emit all the rows from left, plus an additional column, which presents there are any rows matched from right or not (it's not null-aware right now). This additional column could be used to replace the subquery in Filter.
In theory, all the predicate subquery could use this join type, but it's slower than LeftSemi and LeftAnti, so it's only used for nested subquery (subquery inside OR).
For example, the following SQL:
```sql
SELECT a FROM t WHERE EXISTS (select 0) OR EXISTS (select 1)
```
This PR also fix a bug in predicate subquery push down through join (they should not).
Nested null-aware subquery is still not supported. For example, `a > 3 OR b NOT IN (select bb from t)`
After this, we could run TPCDS query Q10, Q35, Q45
## How was this patch tested?
Added unit tests.
Author: Davies Liu <davies@databricks.com>
Closes#12820 from davies/or_exists.
## What changes were proposed in this pull request?
#12339 didn't fix the race condition. MemorySinkSuite is still flaky: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.2/814/testReport/junit/org.apache.spark.sql.streaming/MemorySinkSuite/registering_as_a_table/
Here is an execution order to reproduce it.
| Time |Thread 1 | MicroBatchThread |
|:-------------:|:-------------:|:-----:|
| 1 | | `MemorySink.getOffset` |
| 2 | | availableOffsets ++= newData (availableOffsets is not changed here) |
| 3 | addData(newData) | |
| 4 | Set `noNewData` to `false` in processAllAvailable | |
| 5 | | `dataAvailable` returns `false` |
| 6 | | noNewData = true |
| 7 | `noNewData` is true so just return | |
| 8 | assert results and fail | |
| 9 | | `dataAvailable` returns true so process the new batch |
This PR expands the scope of `awaitBatchLock.synchronized` to eliminate the above race.
## How was this patch tested?
test("stress test"). It always failed before this patch. And it will pass after applying this patch. Ignore this test in the PR as it takes several minutes to finish.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#12582 from zsxwing/SPARK-14579-2.
## What changes were proposed in this pull request?
NewAccumulator isn't the best name if we ever come up with v3 of the API.
## How was this patch tested?
Updated tests to reflect the change.
Author: Reynold Xin <rxin@databricks.com>
Closes#12827 from rxin/SPARK-15049.
## What changes were proposed in this pull request?
This PR adds the explanation and documentation for CSV options for reading and writing.
## How was this patch tested?
Style tests with `./dev/run_tests` for documentation style.
Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>
Closes#12817 from HyukjinKwon/SPARK-13425.
## What changes were proposed in this pull request?
This is caused by https://github.com/apache/spark/pull/12776, which removes the `synchronized` from all methods in `AccumulatorContext`.
However, a test in `CachedTableSuite` synchronize on `AccumulatorContext` and expecting no one else can change it, which is not true anymore.
This PR update that test to not require to lock on `AccumulatorContext`.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12811 from cloud-fan/flaky.
1. Adds the following options for parsing NaNs: nanValue
2. Adds the following options for parsing infinity: positiveInf, negativeInf.
`TypeCast.castTo` is unit tested and an end-to-end test is added to `CSVSuite`
Author: Hossein <hossein@databricks.com>
Closes#11947 from falaki/SPARK-14143.
This PR contains three changes:
1. We will use spark.sql.warehouse.dir set warehouse location. We will not use hive.metastore.warehouse.dir.
2. SessionCatalog needs to set the location to default db. Otherwise, when creating a table in SparkSession without hive support, the default db's path will be an empty string.
3. When we create a database, we need to make the path qualified.
Existing tests and new tests
Author: Yin Huai <yhuai@databricks.com>
Closes#12812 from yhuai/warehouse.
## What changes were proposed in this pull request?
This patch removes some code that are no longer relevant -- mainly HiveSessionState.setDefaultOverrideConfs.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#12806 from rxin/SPARK-15028.
## What changes were proposed in this pull request?
This PR adds the support to specify custom date format for `DateType` and `TimestampType`.
For `TimestampType`, this uses the given format to infer schema and also to convert the values
For `DateType`, this uses the given format to convert the values.
If the `dateFormat` is not given, then it works with `DateTimeUtils.stringToTime()` for backwords compatibility.
When it's given, then it uses `SimpleDateFormat` for parsing data.
In addition, `IntegerType`, `DoubleType` and `LongType` have a higher priority than `TimestampType` in type inference. This means even if the given format is `yyyy` or `yyyy.MM`, it will be inferred as `IntegerType` or `DoubleType`. Since it is type inference, I think it is okay to give such precedences.
In addition, I renamed `csv.CSVInferSchema` to `csv.InferSchema` as JSON datasource has `json.InferSchema`. Although they have the same names, I did this because I thought the parent package name can still differentiate each. Accordingly, the suite name was also changed from `CSVInferSchemaSuite` to `InferSchemaSuite`.
## How was this patch tested?
unit tests are used and `./dev/run_tests` for coding style tests.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#11550 from HyukjinKwon/SPARK-13667.
## What changes were proposed in this pull request?
CatalystSqlParser can parse data types. So, we do not need to have an individual DataTypeParser.
## How was this patch tested?
Existing tests
Author: Yin Huai <yhuai@databricks.com>
Closes#12796 from yhuai/removeDataTypeParser.
## What changes were proposed in this pull request?
1. Remove all the `spark.setConf` etc. Just expose `spark.conf`
2. Make `spark.conf` take in things set in the core `SparkConf` as well, otherwise users may get confused
This was done for both the Python and Scala APIs.
## How was this patch tested?
`SQLConfSuite`, python tests.
This one fixes the failed tests in #12787Closes#12787
Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#12798 from yhuai/conf-api.
## What changes were proposed in this pull request?
Addresses comments in #12765.
## How was this patch tested?
Python tests.
Author: Andrew Or <andrew@databricks.com>
Closes#12784 from andrewor14/python-followup.
## What changes were proposed in this pull request?
dapply() applies an R function on each partition of a DataFrame and returns a new DataFrame.
The function signature is:
dapply(df, function(localDF) {}, schema = NULL)
R function input: local data.frame from the partition on local node
R function output: local data.frame
Schema specifies the Row format of the resulting DataFrame. It must match the R function's output.
If schema is not specified, each partition of the result DataFrame will be serialized in R into a single byte array. Such resulting DataFrame can be processed by successive calls to dapply().
## How was this patch tested?
SparkR unit tests.
Author: Sun Rui <rui.sun@intel.com>
Author: Sun Rui <sunrui2016@gmail.com>
Closes#12493 from sun-rui/SPARK-12919.
## What changes were proposed in this pull request?
Currently Spark SQL doesn't support sorting columns in descending order. However, the parser accepts the syntax and silently drops sorting directions. This PR fixes this by throwing an exception if `DESC` is specified as sorting direction of a sorting column.
## How was this patch tested?
A test case is added to test the invalid sorting order by checking exception message.
Author: Cheng Lian <lian@databricks.com>
Closes#12759 from liancheng/spark-14981.
## What changes were proposed in this pull request?
The `catalog` and `conf` APIs were exposed in `SparkSession` in #12713 and #12669. This patch adds those to the python API.
## How was this patch tested?
Python tests.
Author: Andrew Or <andrew@databricks.com>
Closes#12765 from andrewor14/python-spark-session-more.
## What changes were proposed in this pull request?
This patch removes executionHive from HiveSessionState and HiveSharedState.
## How was this patch tested?
Updated test cases.
Author: Reynold Xin <rxin@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#12770 from rxin/SPARK-14994.
#### What changes were proposed in this pull request?
Replaces a logical `Except` operator with a `Left-anti Join` operator. This way, we can take advantage of all the benefits of join implementations (e.g. managed memory, code generation, broadcast joins).
```SQL
SELECT a1, a2 FROM Tab1 EXCEPT SELECT b1, b2 FROM Tab2
==> SELECT DISTINCT a1, a2 FROM Tab1 LEFT ANTI JOIN Tab2 ON a1<=>b1 AND a2<=>b2
```
Note:
1. This rule is only applicable to EXCEPT DISTINCT. Do not use it for EXCEPT ALL.
2. This rule has to be done after de-duplicating the attributes; otherwise, the enerated
join conditions will be incorrect.
This PR also corrects the existing behavior in Spark. Before this PR, the behavior is like
```SQL
test("except") {
val df_left = Seq(1, 2, 2, 3, 3, 4).toDF("id")
val df_right = Seq(1, 3).toDF("id")
checkAnswer(
df_left.except(df_right),
Row(2) :: Row(2) :: Row(4) :: Nil
)
}
```
After this PR, the result is corrected. We strictly follow the SQL compliance of `Except Distinct`.
#### How was this patch tested?
Modified and added a few test cases to verify the optimization rule and the results of operators.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#12736 from gatorsmile/exceptByAntiJoin.
## What changes were proposed in this pull request?
Minor typo fixes
## How was this patch tested?
local build
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#12755 from zhengruifeng/fix_doc_dataset.
## What changes were proposed in this pull request?
This patch removes HiveNativeCommand, so we can continue to remove the dependency on Hive. This pull request also removes the ability to generate golden result file using Hive.
## How was this patch tested?
Updated tests to reflect this.
Author: Reynold Xin <rxin@databricks.com>
Closes#12769 from rxin/SPARK-14991.
## What changes were proposed in this pull request?
The FileCatalog object gets created even if the user specifies schema, which means files in the directory is enumerated even thought its not necessary. For large directories this is very slow. User would want to specify schema in such scenarios of large dirs, and this defeats the purpose quite a bit.
## How was this patch tested?
Hard to test this with unit test.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#12748 from tdas/SPARK-14970.
## What changes were proposed in this pull request?
This PR introduces a new accumulator API which is much simpler than before:
1. the type hierarchy is simplified, now we only have an `Accumulator` class
2. Combine `initialValue` and `zeroValue` concepts into just one concept: `zeroValue`
3. there in only one `register` method, the accumulator registration and cleanup registration are combined.
4. the `id`,`name` and `countFailedValues` are combined into an `AccumulatorMetadata`, and is provided during registration.
`SQLMetric` is a good example to show the simplicity of this new API.
What we break:
1. no `setValue` anymore. In the new API, the intermedia type can be different from the result type, it's very hard to implement a general `setValue`
2. accumulator can't be serialized before registered.
Problems need to be addressed in follow-ups:
1. with this new API, `AccumulatorInfo` doesn't make a lot of sense, the partial output is not partial updates, we need to expose the intermediate value.
2. `ExceptionFailure` should not carry the accumulator updates. Why do users care about accumulator updates for failed cases? It looks like we only use this feature to update the internal metrics, how about we sending a heartbeat to update internal metrics after the failure event?
3. the public event `SparkListenerTaskEnd` carries a `TaskMetrics`. Ideally this `TaskMetrics` don't need to carry external accumulators, as the only method of `TaskMetrics` that can access external accumulators is `private[spark]`. However, `SQLListener` use it to retrieve sql metrics.
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12612 from cloud-fan/acc.
## What changes were proposed in this pull request?
Currently, LongToUnsafeRowMap use byte array as the underlying page, which can't be larger 1G.
This PR improves LongToUnsafeRowMap to scale up to 8G bytes by using array of Long instead of array of byte.
## How was this patch tested?
Manually ran a test to confirm that both UnsafeHashedRelation and LongHashedRelation could build a map that larger than 2G.
Author: Davies Liu <davies@databricks.com>
Closes#12740 from davies/larger_broadcast.
## What changes were proposed in this pull request?
`interfaces.scala` was getting big. This just moves the biggest class in there to a new file for cleanliness.
## How was this patch tested?
Just moving things around.
Author: Andrew Or <andrew@databricks.com>
Closes#12721 from andrewor14/move-external-catalog.
Currently, we can only create persisted partitioned and/or bucketed data source tables using the Dataset API but not using SQL DDL. This PR implements the following syntax to add partitioning and bucketing support to the SQL DDL:
```
CREATE TABLE <table-name>
USING <provider> [OPTIONS (<key1> <value1>, <key2> <value2>, ...)]
[PARTITIONED BY (col1, col2, ...)]
[CLUSTERED BY (col1, col2, ...) [SORTED BY (col1, col2, ...)] INTO <n> BUCKETS]
AS SELECT ...
```
Test cases are added in `MetastoreDataSourcesSuite` to check the newly added syntax.
Author: Cheng Lian <lian@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#12734 from liancheng/spark-14954.
## What changes were proposed in this pull request?
The `Batch` class, which had been used to indicate progress in a stream, was abandoned by [[SPARK-13985][SQL] Deterministic batches with ids](caea152145) and then became useless.
This patch:
- removes the `Batch` class
- ~~does some related renaming~~ (update: this has been reverted)
- fixes some related comments
## How was this patch tested?
N/A
Author: Liwei Lin <lwlin7@gmail.com>
Closes#12638 from lw-lin/remove-batch.
### What changes were proposed in this pull request?
Anti-Joins using BroadcastHashJoin's unique key code path are broken; it currently returns Semi Join results . This PR fixes this bug.
### How was this patch tested?
Added tests cases to `ExistenceJoinSuite`.
cc davies gatorsmile
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#12730 from hvanhovell/SPARK-14950.
## What changes were proposed in this pull request?
This PR will make Spark SQL not allow ALTER TABLE ADD/REPLACE/CHANGE COLUMN, ALTER TABLE SET FILEFORMAT, DFS, and transaction related commands.
## How was this patch tested?
Existing tests. For those tests that I put in the blacklist, I am adding the useful parts back to SQLQuerySuite.
Author: Yin Huai <yhuai@databricks.com>
Closes#12714 from yhuai/banNativeCommand.
## What changes were proposed in this pull request?
We currently expose both Hadoop configuration and Spark SQL configuration in RuntimeConfig. I think we can remove the Hadoop configuration part, and simply generate Hadoop Configuration on the fly by passing all the SQL configurations into it. This way, there is a single interface (in Java/Scala/Python/SQL) for end-users.
As part of this patch, I also removed some config options deprecated in Spark 1.x.
## How was this patch tested?
Updated relevant tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#12689 from rxin/SPARK-14913.
## What changes were proposed in this pull request?
#12625 exposed a new user-facing conf interface in `SparkSession`. This patch adds a catalog interface.
## How was this patch tested?
See `CatalogSuite`.
Author: Andrew Or <andrew@databricks.com>
Closes#12713 from andrewor14/user-facing-catalog.
## What changes were proposed in this pull request?
This PR adds Native execution of SHOW COLUMNS and SHOW PARTITION commands.
Command Syntax:
``` SQL
SHOW COLUMNS (FROM | IN) table_identifier [(FROM | IN) database]
```
``` SQL
SHOW PARTITIONS [db_name.]table_name [PARTITION(partition_spec)]
```
## How was this patch tested?
Added test cases in HiveCommandSuite to verify execution and DDLCommandSuite
to verify plans.
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#12222 from dilipbiswal/dkb_show_columns.
## What changes were proposed in this pull request?
While the vectorized hash map in `TungstenAggregate` is currently supported for all primitive data types during partial aggregation, this patch only enables the hash map for a subset of cases that've been verified to show performance improvements on our benchmarks subject to an internal conf that sets an upper limit on the maximum length of the aggregate key/value schema. This list of supported use-cases should be expanded over time.
## How was this patch tested?
This is no new change in functionality so existing tests should suffice. Performance tests were done on TPCDS benchmarks.
Author: Sameer Agarwal <sameer@databricks.com>
Closes#12710 from sameeragarwal/vectorized-enable.
## What changes were proposed in this pull request?
This PR update SortMergeJoinExec to support LeftSemi/LeftAnti, so it could support all the join types, same as other three join implementations: BroadcastHashJoinExec, ShuffledHashJoinExec,and BroadcastNestedLoopJoinExec.
This PR also simplify the join selection in SparkStrategy.
## How was this patch tested?
Added new tests.
Author: Davies Liu <davies@databricks.com>
Closes#12668 from davies/smj_semi.
## What changes were proposed in this pull request?
That patch mistakenly widened the visibility from `private[x]` to `protected[x]`. This patch reverts those changes.
Author: Andrew Or <andrew@databricks.com>
Closes#12686 from andrewor14/visibility.
## What changes were proposed in this pull request?
We currently have no way for users to propagate options to the underlying library that rely in Hadoop configurations to work. For example, there are various options in parquet-mr that users might want to set, but the data source API does not expose a per-job way to set it. This patch propagates the user-specified options also into Hadoop Configuration.
## How was this patch tested?
Used a mock data source implementation to test both the read path and the write path.
Author: Reynold Xin <rxin@databricks.com>
Closes#12688 from rxin/SPARK-14912.
## What changes were proposed in this pull request?
Minor typo fixes (too minor to deserve separate a JIRA)
## How was this patch tested?
local build
Author: Jacek Laskowski <jacek@japila.pl>
Closes#12469 from jaceklaskowski/minor-typo-fixes.
## What changes were proposed in this pull request?
Use Long.parseLong which returns a primative.
Use a series of appends() reduces the creation of an extra StringBuilder type
## How was this patch tested?
Unit tests
Author: Azeem Jiva <azeemj@gmail.com>
Closes#12520 from javawithjiva/minor.
## What changes were proposed in this pull request?
In Spark 2.0, `SparkSession` is the new thing. Internally we should stop using `SQLContext` everywhere since that's supposed to be not the main user-facing API anymore.
In this patch I took care to not break any public APIs. The one place that's suspect is `o.a.s.ml.source.libsvm.DefaultSource`, but according to mengxr it's not supposed to be public so it's OK to change the underlying `FileFormat` trait.
**Reviewers**: This is a big patch that may be difficult to review but the changes are actually really straightforward. If you prefer I can break it up into a few smaller patches, but it will delay the progress of this issue a little.
## How was this patch tested?
No change in functionality intended.
Author: Andrew Or <andrew@databricks.com>
Closes#12625 from andrewor14/spark-session-refactor.
## What changes were proposed in this pull request?
`RuntimeConfig` is the new user-facing API in 2.0 added in #11378. Until now, however, it's been dead code. This patch uses `RuntimeConfig` in `SessionState` and exposes that through the `SparkSession`.
## How was this patch tested?
New test in `SQLContextSuite`.
Author: Andrew Or <andrew@databricks.com>
Closes#12669 from andrewor14/use-runtime-conf.
## What changes were proposed in this pull request?
This patch changes UnresolvedFunction and UnresolvedGenerator to use a FunctionIdentifier rather than just a String for function name. Also changed SessionCatalog to accept FunctionIdentifier in lookupFunction.
## How was this patch tested?
Updated related unit tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#12659 from rxin/SPARK-14888.
## What changes were proposed in this pull request?
```
Spark context available as 'sc' (master = local[*], app id = local-1461283768192).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_51)
Type in expressions to have them evaluated.
Type :help for more information.
scala> sql("SHOW TABLES").collect()
16/04/21 17:09:39 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/04/21 17:09:39 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
res0: Array[org.apache.spark.sql.Row] = Array([src,false])
scala> sql("SHOW TABLES").collect()
res1: Array[org.apache.spark.sql.Row] = Array([src,false])
scala> spark.createDataFrame(Seq((1, 1), (2, 2), (3, 3)))
res2: org.apache.spark.sql.DataFrame = [_1: int, _2: int]
```
Hive things are loaded lazily.
## How was this patch tested?
Manual.
Author: Andrew Or <andrew@databricks.com>
Closes#12589 from andrewor14/spark-session-repl.
## What changes were proposed in this pull request?
This removes the class `HiveContext` itself along with all code usages associated with it. The bulk of the work was already done in #12485. This is mainly just code cleanup and actually removing the class.
Note: A couple of things will break after this patch. These will be fixed separately.
- the python HiveContext
- all the documentation / comments referencing HiveContext
- there will be no more HiveContext in the REPL (fixed by #12589)
## How was this patch tested?
No change in functionality.
Author: Andrew Or <andrew@databricks.com>
Closes#12585 from andrewor14/delete-hive-context.
## What changes were proposed in this pull request?
This method was accidentally made `private[sql]` in Spark 2.0. This PR makes it public again, since 3rd party data sources like spark-avro depend on it.
## How was this patch tested?
N/A
Author: Cheng Lian <lian@databricks.com>
Closes#12652 from liancheng/spark-14875.
## What changes were proposed in this pull request?
This PR fixes a bug in `TungstenAggregate` that manifests while aggregating by keys over nullable `BigDecimal` columns. This causes a null pointer exception while executing TPCDS q14a.
## How was this patch tested?
1. Added regression test in `DataFrameAggregateSuite`.
2. Verified that TPCDS q14a works
Author: Sameer Agarwal <sameer@databricks.com>
Closes#12651 from sameeragarwal/tpcds-fix.