## What changes were proposed in this pull request?
we also store data source table options in this field, it's unreasonable to call it `serdeProperties`.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14283 from cloud-fan/minor1.
## What changes were proposed in this pull request?
This PR adds a boolean option, `truncate`, for `SaveMode.Overwrite` of JDBC DataFrameWriter. If this option is `true`, it try to take advantage of `TRUNCATE TABLE` instead of `DROP TABLE`. This is a trivial option, but will provide great **convenience** for BI tool users based on RDBMS tables generated by Spark.
**Goal**
- Without `CREATE/DROP` privilege, we can save dataframe to database. Sometime these are not allowed for security.
- It will preserve the existing table information, so users can add and keep some additional `INDEX` and `CONSTRAINT`s for the table.
- Sometime, `TRUNCATE` is faster than the combination of `DROP/CREATE`.
**Supported DBMS**
The following is `truncate`-option support table. Due to the different behavior of `TRUNCATE TABLE` among DBMSs, it's not always safe to use `TRUNCATE TABLE`. Spark will ignore the `truncate` option for **unknown** and **some** DBMS with **default CASCADING** behavior. Newly added JDBCDialect should implement corresponding function to support `truncate` option additionally.
Spark Dialects | `truncate` OPTION SUPPORT
---------------|-------------------------------
MySQLDialect | O
PostgresDialect | X
DB2Dialect | O
MsSqlServerDialect | O
DerbyDialect | O
OracleDialect | O
**Before (TABLE with INDEX case)**: SparkShell & MySQL CLI are interleaved intentionally.
```scala
scala> val (url, prop)=("jdbc:mysql://localhost:3306/temp?useSSL=false", new java.util.Properties)
scala> prop.setProperty("user","root")
scala> df.write.mode("overwrite").jdbc(url, "table_with_index", prop)
scala> spark.range(10).write.mode("overwrite").jdbc(url, "table_with_index", prop)
mysql> DESC table_with_index;
+-------+------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| id | bigint(20) | NO | | NULL | |
+-------+------------+------+-----+---------+-------+
mysql> CREATE UNIQUE INDEX idx_id ON table_with_index(id);
mysql> DESC table_with_index;
+-------+------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| id | bigint(20) | NO | PRI | NULL | |
+-------+------------+------+-----+---------+-------+
scala> spark.range(10).write.mode("overwrite").jdbc(url, "table_with_index", prop)
mysql> DESC table_with_index;
+-------+------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| id | bigint(20) | NO | | NULL | |
+-------+------------+------+-----+---------+-------+
```
**After (TABLE with INDEX case)**
```scala
scala> spark.range(10).write.mode("overwrite").option("truncate", true).jdbc(url, "table_with_index", prop)
mysql> DESC table_with_index;
+-------+------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| id | bigint(20) | NO | PRI | NULL | |
+-------+------------+------+-----+---------+-------+
```
**Error Handling**
- In case of exceptions, Spark will not retry. Users should turn off the `truncate` option.
- In case of schema change:
- If one of the column names changes, this will raise exceptions intuitively.
- If there exists only type difference, this will work like Append mode.
## How was this patch tested?
Pass the Jenkins tests with a updated testcase.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14086 from dongjoon-hyun/SPARK-16410.
## What changes were proposed in this pull request?
after https://github.com/apache/spark/pull/12945, we renamed the `registerTempTable` to `createTempView`, as we do create a view actually. This PR renames `SQLTestUtils.withTempTable` to reflect this change.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#14318 from cloud-fan/minor4.
### What changes were proposed in this pull request?
**Issue 1: Silent Ignorance of Bucket Specification When Creating Table Using Schema Inference**
When creating a data source table without explicit specification of schema or SELECT clause, we silently ignore the bucket specification (CLUSTERED BY... SORTED BY...) in [the code](ce3b98bae2/sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala (L339-L354)).
For example,
```SQL
CREATE TABLE jsonTable
USING org.apache.spark.sql.json
OPTIONS (
path '${tempDir.getCanonicalPath}'
)
CLUSTERED BY (inexistentColumnA) SORTED BY (inexistentColumnB) INTO 2 BUCKETS
```
This PR captures it and issues an error message.
**Issue 2: Got a run-time `java.lang.ArithmeticException` when num of buckets is set to zero.**
For example,
```SQL
CREATE TABLE t USING PARQUET
OPTIONS (PATH '${path.toString}')
CLUSTERED BY (a) SORTED BY (b) INTO 0 BUCKETS
AS SELECT 1 AS a, 2 AS b
```
The exception we got is
```
ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 1.0 (TID 2)
java.lang.ArithmeticException: / by zero
```
This PR captures the misuse and issues an appropriate error message.
### How was this patch tested?
Added a test case in DDLSuite
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14210 from gatorsmile/createTableWithoutSchema.
## What changes were proposed in this pull request?
This PR adds `str_to_map` SQL function in order to remove Hive fallback.
## How was this patch tested?
Pass the Jenkins tests with newly added.
Author: Sandeep Singh <sandeep@techaddict.me>
Closes#13990 from techaddict/SPARK-16287.
## What changes were proposed in this pull request?
As part of the bugfix in https://github.com/apache/spark/pull/12279, if a row batch consist of both dictionary encoded and non-dictionary encoded pages, we explicitly decode the dictionary for the values that are already dictionary encoded. Currently we reset the dictionary while reading every page that can potentially cause ` java.lang.ArrayIndexOutOfBoundsException` while decoding older pages. This patch fixes the problem by maintaining a single dictionary per row-batch in vectorized parquet reader.
## How was this patch tested?
Manual Tests against a number of hand-generated parquet files.
Author: Sameer Agarwal <sameerag@cs.berkeley.edu>
Closes#14225 from sameeragarwal/vectorized.
## What changes were proposed in this pull request?
PR #14278 is a more general and simpler fix for SPARK-16632 than PR #14272. After merging #14278, we no longer need changes made in #14272. So here I revert them.
This PR targets both master and branch-2.0.
## How was this patch tested?
Existing tests.
Author: Cheng Lian <lian@databricks.com>
Closes#14300 from liancheng/revert-pr-14272.
## What changes were proposed in this pull request?
In `SpecificParquetRecordReaderBase`, which is used by the vectorized Parquet reader, we convert the Parquet requested schema into a Spark schema to guide column reader initialization. However, the Parquet requested schema is tailored from the schema of the physical file being scanned, and may have inaccurate type information due to bugs of other systems (e.g. HIVE-14294).
On the other hand, we already set the real Spark requested schema into Hadoop configuration in [`ParquetFileFormat`][1]. This PR simply reads out this schema to replace the converted one.
## How was this patch tested?
New test case added in `ParquetQuerySuite`.
[1]: https://github.com/apache/spark/blob/v2.0.0-rc5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L292-L294
Author: Cheng Lian <lian@databricks.com>
Closes#14278 from liancheng/spark-16632-simpler-fix.
## What changes were proposed in this pull request?
Saving partitions to JDBC in transaction can use a weaker transaction isolation level to reduce locking. Use better method to check if transactions are supported.
## How was this patch tested?
Existing Jenkins tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#14054 from srowen/SPARK-16226.
## What changes were proposed in this pull request?
aggregate expressions can only be executed inside `Aggregate`, if we propagate it up with constraints, the parent operator can not execute it and will fail at runtime.
## How was this patch tested?
new test in SQLQuerySuite
Author: Wenchen Fan <wenchen@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#14281 from cloud-fan/bug.
This allows configuration to be more flexible, for example, when the cluster does
not have a homogeneous configuration (e.g. packages are installed on different
paths in different nodes). By allowing one to reference the environment from
the conf, it becomes possible to work around those in certain cases.
As part of the implementation, ConfigEntry now keeps track of all "known" configs
(i.e. those created through the use of ConfigBuilder), since that list is used
by the resolution code. This duplicates some code in SQLConf, which could potentially
be merged with this now. It will also make it simpler to implement some missing
features such as filtering which configs show up in the UI or in event logs - which
are not part of this change.
Another change is in the way ConfigEntry reads config data; it now takes a string
map and a function that reads env variables, so that it can be called both from
SparkConf and SQLConf. This makes it so both places follow the same read path,
instead of having to replicate certain logic in SQLConf. There are still a
couple of methods in SQLConf that peek into fields of ConfigEntry directly,
though.
Tested via unit tests, and by using the new variable expansion functionality
in a shell session with a custom spark.sql.hive.metastore.jars value.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#14022 from vanzin/SPARK-16272.
## What changes were proposed in this pull request?
Due to backward-compatibility reasons, the following Parquet schema is ambiguous:
```
optional group f (LIST) {
repeated group list {
optional group element {
optional int32 element;
}
}
}
```
According to the parquet-format spec, when interpreted as a standard 3-level layout, this type is equivalent to the following SQL type:
```
ARRAY<STRUCT<element: INT>>
```
However, when interpreted as a legacy 2-level layout, it's equivalent to
```
ARRAY<STRUCT<element: STRUCT<element: INT>>>
```
Historically, to disambiguate these cases, we employed two methods:
- `ParquetSchemaConverter.isElementType()`
Used to disambiguate the above cases while converting Parquet types to Spark types.
- `ParquetRowConverter.isElementType()`
Used to disambiguate the above cases while instantiating row converters that convert Parquet records to Spark rows.
Unfortunately, these two methods make different decision about the above problematic Parquet type, and caused SPARK-16344.
`ParquetRowConverter.isElementType()` is necessary for Spark 1.4 and earlier versions because Parquet requested schemata are directly converted from Spark schemata in these versions. The converted Parquet schemata may be incompatible with actual schemata of the underlying physical files when the files are written by a system/library that uses a schema conversion scheme that is different from Spark when writing Parquet LIST and MAP fields.
In Spark 1.5, Parquet requested schemata are always properly tailored from schemata of physical files to be read. Thus `ParquetRowConverter.isElementType()` is no longer necessary. This PR replaces this method with a simply yet accurate scheme: whenever an ambiguous Parquet type is hit, convert the type in question back to a Spark type using `ParquetSchemaConverter` and check whether it matches the corresponding Spark type.
## How was this patch tested?
New test cases added in `ParquetHiveCompatibilitySuite` and `ParquetQuerySuite`.
Author: Cheng Lian <lian@databricks.com>
Closes#14014 from liancheng/spark-16344-for-master-and-2.0.
When Hive (or at least certain versions of Hive) creates parquet files
containing tinyint or smallint columns, it stores them as int32, but
doesn't annotate the parquet field as containing the corresponding
int8 / int16 data. When Spark reads those files using the vectorized
reader, it follows the parquet schema for these fields, but when
actually reading the data it tries to use the type fetched from
the metastore, and then fails because data has been loaded into the
wrong fields in OnHeapColumnVector.
So instead of blindly trusting the parquet schema, check whether the
Catalyst-provided schema disagrees with it, and adjust the types so
that the necessary metadata is present when loading the data into
the ColumnVector instance.
Tested with unit tests and with tests that create byte / short columns
in Hive and try to read them from Spark.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#14272 from vanzin/SPARK-16632.
## What changes were proposed in this pull request?
In ScriptInputOutputSchema, we read default RecordReader and RecordWriter from conf. Since Spark 2.0 has deleted those config keys from hive conf, we have to set default reader/writer class name by ourselves. Otherwise we will get None for LazySimpleSerde, the data written would not be able to read by script. The test case added worked fine with previous version of Spark, but would fail now.
## How was this patch tested?
added a test case in SQLQuerySuite.
Closes#14169
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#14249 from yhuai/scriptTransformation.
## What changes were proposed in this pull request?
Currently, `JacksonGenerator.apply` is doing type-based dispatch for each row to write appropriate values.
It might not have to be done like this because the schema is already kept.
So, appropriate writers can be created first according to the schema once, and then apply them to each row. This approach is similar with `CatalystWriteSupport`.
This PR corrects `JacksonGenerator` so that it creates all writers for the schema once and then applies them to each row rather than type dispatching for every row.
Benchmark was proceeded with the codes below:
```scala
test("Benchmark for JSON writer") {
val N = 500 << 8
val row =
"""{"struct":{"field1": true, "field2": 92233720368547758070},
"structWithArrayFields":{"field1":[4, 5, 6], "field2":["str1", "str2"]},
"arrayOfString":["str1", "str2"],
"arrayOfInteger":[1, 2147483647, -2147483648],
"arrayOfLong":[21474836470, 9223372036854775807, -9223372036854775808],
"arrayOfBigInteger":[922337203685477580700, -922337203685477580800],
"arrayOfDouble":[1.2, 1.7976931348623157E308, 4.9E-324, 2.2250738585072014E-308],
"arrayOfBoolean":[true, false, true],
"arrayOfNull":[null, null, null, null],
"arrayOfStruct":[{"field1": true, "field2": "str1"}, {"field1": false}, {"field3": null}],
"arrayOfArray1":[[1, 2, 3], ["str1", "str2"]],
"arrayOfArray2":[[1, 2, 3], [1.1, 2.1, 3.1]]
}"""
val df = spark.sqlContext.read.json(spark.sparkContext.parallelize(List.fill(N)(row)))
val benchmark = new Benchmark("JSON writer", N)
benchmark.addCase("writing JSON file", 10) { _ =>
withTempPath { path =>
df.write.format("json").save(path.getCanonicalPath)
}
}
benchmark.run()
}
```
This produced the results below
- **Before**
```
JSON writer: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
writing JSON file 1675 / 1767 0.1 13087.5 1.0X
```
- **After**
```
JSON writer: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
writing JSON file 1597 / 1686 0.1 12477.1 1.0X
```
In addition, I ran this benchmark 10 times for each and calculated the average elapsed time as below:
| **Before** | **After**|
|---------------|------------|
|17478ms |16669ms |
It seems roughly ~5% is improved.
## How was this patch tested?
Existing tests should cover this.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#14028 from HyukjinKwon/SPARK-16351.
## What changes were proposed in this pull request?
`SQLTestUtils.withTempDatabase` is a frequently used test harness to setup a temporary table and clean up finally. This issue improves like the following for usability.
```scala
- try f(dbName) finally spark.sql(s"DROP DATABASE $dbName CASCADE")
+ try f(dbName) finally {
+ if (spark.catalog.currentDatabase == dbName) {
+ spark.sql(s"USE ${DEFAULT_DATABASE}")
+ }
+ spark.sql(s"DROP DATABASE $dbName CASCADE")
+ }
```
In case of forgetting to reset the databaes, `withTempDatabase` will not raise Exception.
## How was this patch tested?
This improves test harness.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14184 from dongjoon-hyun/SPARK-16529.
## What changes were proposed in this pull request?
This PR changes the name of columns returned by `SHOW PARTITION` and `SHOW COLUMNS` commands. Currently, both commands uses `result` as a column name.
**Comparison: Column Name**
Command|Spark(Before)|Spark(After)|Hive
----------|--------------|------------|-----
SHOW PARTITIONS|result|partition|partition
SHOW COLUMNS|result|col_name|field
Note that Spark/Hive uses `col_name` in `DESC TABLES`. So, this PR chooses `col_name` for consistency among Spark commands.
**Before**
```scala
scala> sql("show partitions p").show()
+------+
|result|
+------+
| b=2|
+------+
scala> sql("show columns in p").show()
+------+
|result|
+------+
| a|
| b|
+------+
```
**After**
```scala
scala> sql("show partitions p").show
+---------+
|partition|
+---------+
| b=2|
+---------+
scala> sql("show columns in p").show
+--------+
|col_name|
+--------+
| a|
| b|
+--------+
```
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14199 from dongjoon-hyun/SPARK-16543.
## What changes were proposed in this pull request?
This patch enables SparkSession to provide spark version.
## How was this patch tested?
Manual test:
```
scala> sc.version
res0: String = 2.1.0-SNAPSHOT
scala> spark.version
res1: String = 2.1.0-SNAPSHOT
```
```
>>> sc.version
u'2.1.0-SNAPSHOT'
>>> spark.version
u'2.1.0-SNAPSHOT'
```
Author: Liwei Lin <lwlin7@gmail.com>
Closes#14165 from lw-lin/add-version.
#### What changes were proposed in this pull request?
If we create a table pointing to a parquet/json datasets without specifying the schema, describe table command does not show the schema at all. It only shows `# Schema of this table is inferred at runtime`. In 1.6, describe table does show the schema of such a table.
~~For data source tables, to infer the schema, we need to load the data source tables at runtime. Thus, this PR calls the function `lookupRelation`.~~
For data source tables, we infer the schema before table creation. Thus, this PR set the inferred schema as the table schema when table creation.
#### How was this patch tested?
Added test cases
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14148 from gatorsmile/describeSchema.
## What changes were proposed in this pull request?
It's unnecessary. `QueryTest` already sets it.
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#14170 from brkyvz/test-tz.
## What changes were proposed in this pull request?
Fix Java style errors and remove unused imports, which are randomly found
## How was this patch tested?
Tested on my local machine.
Author: Xin Ren <iamshrek@126.com>
Closes#14161 from keypointt/SPARK-16437.
## What changes were proposed in this pull request?
A second form of AssertQuery now actually invokes the condition; avoids a build warning too
## How was this patch tested?
Jenkins; running StreamTest
Author: Sean Owen <sowen@cloudera.com>
Closes#14133 from srowen/SPARK-15889.2.
## What changes were proposed in this pull request?
This patch implements reflect SQL function, which can be used to invoke a Java method in SQL. Slightly different from Hive, this implementation requires the class name and the method name to be literals. This implementation also supports only a smaller number of data types, and requires the function to be static, as suggested by rxin in #13969.
java_method is an alias for reflect, so this should also resolve SPARK-16277.
## How was this patch tested?
Added expression unit tests and an end-to-end test.
Author: petermaxlee <petermaxlee@gmail.com>
Closes#14138 from petermaxlee/reflect-static.
This option is used by Hive to directly delete the files instead of
moving them to the trash. This is needed in certain configurations
where moving the files does not work. For non-Hive tables and partitions,
Spark already behaves as if the PURGE option was set, so there's no
need to do anything.
Hive support for PURGE was added in 0.14 (for tables) and 1.2 (for
partitions), so the code reflects that: trying to use the option with
older versions of Hive will cause an exception to be thrown.
The change is a little noisier than I would like, because of the code
to propagate the new flag through all the interfaces and implementations;
the main changes are in the parser and in HiveShim, aside from the tests
(DDLCommandSuite, VersionsSuite).
Tested by running sql and catalyst unit tests, plus VersionsSuite which
has been updated to test the version-specific behavior. I also ran an
internal test suite that uses PURGE and would not pass previously.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#13831 from vanzin/SPARK-16119.
## What changes were proposed in this pull request?
In code generation, it is incorrect for expressions to reuse variable names across different instances of itself. As an example, SPARK-16488 reports a bug in which pmod expression reuses variable name "r".
This patch updates ExpressionEvalHelper test harness to always project two instances of the same expression, which will help us catch variable reuse problems in expression unit tests. This patch also fixes the bug in crc32 expression.
## How was this patch tested?
This is a test harness change, but I also created a new test suite for testing the test harness.
Author: Reynold Xin <rxin@databricks.com>
Closes#14146 from rxin/SPARK-16489.
## What changes were proposed in this pull request?
when query only use metadata (example: partition key), it can return results based on metadata without scanning files. Hive did it in HIVE-1003.
## How was this patch tested?
add unit tests
Author: Lianhui Wang <lianhuiwang09@gmail.com>
Author: Wenchen Fan <wenchen@databricks.com>
Author: Lianhui Wang <lianhuiwang@users.noreply.github.com>
Closes#13494 from lianhuiwang/metadata-only.
## What changes were proposed in this pull request?
Currently the input `RDD` of `Dataset` is always serialized to `RDD[InternalRow]` prior to being as `Dataset`, but there is a case that we use `map` or `mapPartitions` just after converted to `Dataset`.
In this case, serialize and then deserialize happens but it would not be needed.
This pr adds `ExistingRDD` logical plan for input with `RDD` to have a chance to eliminate serialize/deserialize.
## How was this patch tested?
Existing tests.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#13890 from ueshin/issues/SPARK-16189.
## What changes were proposed in this pull request?
It would be useful to support listing the columns that are referenced by a filter. This can help simplify data source planning, because with this we would be able to implement unhandledFilters method in HadoopFsRelation.
This is based on rxin's patch (#13901) and adds unit tests.
## How was this patch tested?
Added a new suite FiltersSuite.
Author: petermaxlee <petermaxlee@gmail.com>
Author: Reynold Xin <rxin@databricks.com>
Closes#14120 from petermaxlee/SPARK-16199.
## What changes were proposed in this pull request?
In order to make it clear which filters are fully handled by the
underlying datasource we will mark them with an *. This will give a
clear visual queue to users that the filter is being treated differently
by catalyst than filters which are just presented to the underlying
DataSource.
Examples from the FilteredScanSuite, in this example `c IN (...)` is handled by the source, `b < ...` is not
### Before
```
//SELECT a FROM oneToTenFiltered WHERE a + b > 9 AND b < 16 AND c IN ('bbbbbBBBBB', 'cccccCCCCC', 'dddddDDDDD', 'foo')
== Physical Plan ==
Project [a#0]
+- Filter (((a#0 + b#1) > 9) && (b#1 < 16))
+- Scan SimpleFilteredScan(1,10)[a#0,b#1] PushedFilters: [LessThan(b,16), In(c, [bbbbbBBBBB,cccccCCCCC,dddddDDDDD,foo]]
```
### After
```
== Physical Plan ==
Project [a#0]
+- Filter (((a#0 + b#1) > 9) && (b#1 < 16))
+- Scan SimpleFilteredScan(1,10)[a#0,b#1] PushedFilters: [LessThan(b,16), *In(c, [bbbbbBBBBB,cccccCCCCC,dddddDDDDD,foo]]
```
## How was the this patch tested?
Manually tested with the Spark Cassandra Connector, a source which fully handles underlying filters. Now fully handled filters appear with an * next to their names. I can add an automated test as well if requested
Post 1.6.1
Tested by modifying the FilteredScanSuite to run explains.
Author: Russell Spitzer <Russell.Spitzer@gmail.com>
Closes#11317 from RussellSpitzer/SPARK-12639-Star.
## What changes were proposed in this pull request?
This patch fixes a variable namespace collision bug in pmod and partitionBy
## How was this patch tested?
Regression test for one possible occurrence. A more general fix in `ExpressionEvalHelper.checkEvaluation` will be in a subsequent PR.
Author: Sameer Agarwal <sameer@databricks.com>
Closes#14144 from sameeragarwal/codegen-bug.
## What changes were proposed in this pull request?
Incorrect list of files were being allocated to a batch. This caused a file to read multiple times in the multiple batches.
## How was this patch tested?
Added unit tests
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#14143 from tdas/SPARK-16430-1.
## What changes were proposed in this pull request?
Display `No physical plan. Waiting for data.` instead of `N/A` for StreamingQuery.explain when no data arrives because `N/A` doesn't provide meaningful information.
## How was this patch tested?
Existing unit tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#14100 from zsxwing/SPARK-16433.
## What changes were proposed in this pull request?
A structured streaming example with event time windowing.
## How was this patch tested?
Run locally
Author: James Thomas <jamesjoethomas@gmail.com>
Closes#13957 from jjthomas/current.
## What changes were proposed in this pull request?
Temporary tables are used frequently, but `spark.catalog.listColumns` does not support those tables. This PR make `SessionCatalog` supports temporary table column listing.
**Before**
```scala
scala> spark.range(10).createOrReplaceTempView("t1")
scala> spark.catalog.listTables().collect()
res1: Array[org.apache.spark.sql.catalog.Table] = Array(Table[name=`t1`, tableType=`TEMPORARY`, isTemporary=`true`])
scala> spark.catalog.listColumns("t1").collect()
org.apache.spark.sql.AnalysisException: Table `t1` does not exist in database `default`.;
```
**After**
```
scala> spark.catalog.listColumns("t1").collect()
res2: Array[org.apache.spark.sql.catalog.Column] = Array(Column[name='id', description='id', dataType='bigint', nullable='false', isPartition='false', isBucket='false'])
```
## How was this patch tested?
Pass the Jenkins tests including a new testcase.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14114 from dongjoon-hyun/SPARK-16458.
## What changes were proposed in this pull request?
This PR prevents dropping the current database to avoid errors like the followings.
```scala
scala> sql("create database delete_db")
scala> sql("use delete_db")
scala> sql("drop database delete_db")
scala> sql("create table t as select 1")
org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database `delete_db` not found;
```
## How was this patch tested?
Pass the Jenkins tests including an updated testcase.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14115 from dongjoon-hyun/SPARK-16459.
#### What changes were proposed in this pull request?
**Issue 1:** When a query containing LIMIT/TABLESAMPLE 0, the statistics could be zero. Results are correct but it could cause a huge performance regression. For example,
```Scala
Seq(("one", 1), ("two", 2), ("three", 3), ("four", 4)).toDF("k", "v")
.createOrReplaceTempView("test")
val df1 = spark.table("test")
val df2 = spark.table("test").limit(0)
val df = df1.join(df2, Seq("k"), "left")
```
The statistics of both `df` and `df2` are zero. The statistics values should never be zero; otherwise `sizeInBytes` of `BinaryNode` will also be zero (product of children). This PR is to increase it to `1` when the num of rows is equal to 0.
**Issue 2:** When a query containing negative LIMIT/TABLESAMPLE, we should issue exceptions. Negative values could break the implementation assumption of multiple parts. For example, statistics calculation. Below is the example query.
```SQL
SELECT * FROM testData TABLESAMPLE (-1 rows)
SELECT * FROM testData LIMIT -1
```
This PR is to issue an appropriate exception in this case.
**Issue 3:** Spark SQL follows the restriction of LIMIT clause in Hive. The argument to the LIMIT clause must evaluate to a constant value. It can be a numeric literal, or another kind of numeric expression involving operators, casts, and function return values. You cannot refer to a column or use a subquery. Currently, we do not detect whether the expression in LIMIT clause is foldable or not. If non-foldable, we might issue a strange error message. For example,
```SQL
SELECT * FROM testData LIMIT rand() > 0.2
```
Then, a misleading error message is issued, like
```
assertion failed: No plan for GlobalLimit (_nondeterministic#203 > 0.2)
+- Project [key#11, value#12, rand(-1441968339187861415) AS _nondeterministic#203]
+- LocalLimit (_nondeterministic#202 > 0.2)
+- Project [key#11, value#12, rand(-1308350387169017676) AS _nondeterministic#202]
+- LogicalRDD [key#11, value#12]
java.lang.AssertionError: assertion failed: No plan for GlobalLimit (_nondeterministic#203 > 0.2)
+- Project [key#11, value#12, rand(-1441968339187861415) AS _nondeterministic#203]
+- LocalLimit (_nondeterministic#202 > 0.2)
+- Project [key#11, value#12, rand(-1308350387169017676) AS _nondeterministic#202]
+- LogicalRDD [key#11, value#12]
```
This PR detects it and then issues a meaningful error message.
#### How was this patch tested?
Added test cases.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14034 from gatorsmile/limit.
## What changes were proposed in this pull request?
This patch implements all remaining xpath functions that Hive supports and not natively supported in Spark: xpath_int, xpath_short, xpath_long, xpath_float, xpath_double, xpath_string, and xpath.
## How was this patch tested?
Added unit tests and end-to-end tests.
Author: petermaxlee <petermaxlee@gmail.com>
Closes#13991 from petermaxlee/SPARK-16318.
#### What changes were proposed in this pull request?
When users try to implement a data source API with extending only `RelationProvider` and `CreatableRelationProvider`, they will hit an error when resolving the relation.
```Scala
spark.read
.format("org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema")
.load()
.write.
format("org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema")
.save()
```
The error they hit is like
```
org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema does not allow user-specified schemas.;
org.apache.spark.sql.AnalysisException: org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema does not allow user-specified schemas.;
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:319)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:494)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
```
Actually, the bug fix is simple. [`DataSource.createRelation(sparkSession.sqlContext, mode, options, data)`](dd644f8117/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala (L429)) already returns a BaseRelation. We should not assign schema to `userSpecifiedSchema`. That schema assignment only makes sense for the data sources that extend `FileFormat`.
#### How was this patch tested?
Added a test case.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14075 from gatorsmile/dataSource.
## What changes were proposed in this pull request?
Currently, JDBC Writer uses dialects to get datatypes, but doesn't to quote field names. This PR uses dialects to quote the field names, too.
**Reported Error Scenario (MySQL case)**
```scala
scala> val url="jdbc:mysql://localhost:3306/temp"
scala> val prop = new java.util.Properties
scala> prop.setProperty("user","root")
scala> spark.createDataset(Seq("a","b","c")).toDF("order")
scala> df.write.mode("overwrite").jdbc(url, "temptable", prop)
...MySQLSyntaxErrorException: ... near 'order TEXT )
```
## How was this patch tested?
Pass the Jenkins tests and manually do the above case.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14107 from dongjoon-hyun/SPARK-16387.
## What changes were proposed in this pull request?
This PR adds parse_url SQL functions in order to remove Hive fallback.
A new implementation of #13999
## How was this patch tested?
Pass the exist tests including new testcases.
Author: wujian <jan.chou.wu@gmail.com>
Closes#14008 from janplus/SPARK-16281.
## What changes were proposed in this pull request?
Adds an quoteAll option for writing CSV which will quote all fields.
See https://issues.apache.org/jira/browse/SPARK-13638
## How was this patch tested?
Added a test to verify the output columns are quoted for all fields in the Dataframe
Author: Jurriaan Pruis <email@jurriaanpruis.nl>
Closes#13374 from jurriaan/csv-quote-all.
## What changes were proposed in this pull request?
This PR implements `sentences` SQL function.
## How was this patch tested?
Pass the Jenkins tests with a new testcase.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14004 from dongjoon-hyun/SPARK_16285.
## What changes were proposed in this pull request?
An option that limits the file stream source to read 1 file at a time enables rate limiting. It has the additional convenience that a static set of files can be used like a stream for testing as this will allows those files to be considered one at a time.
This PR adds option `maxFilesPerTrigger`.
## How was this patch tested?
New unit test
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#14094 from tdas/SPARK-16430.
## What changes were proposed in this pull request?
There are cases where `complete` output mode does not output updated aggregated value; for details please refer to [SPARK-16350](https://issues.apache.org/jira/browse/SPARK-16350).
The cause is that, as we do `data.as[T].foreachPartition { iter => ... }` in `ForeachSink.addBatch()`, `foreachPartition()` does not support incremental planning for now.
This patches makes `foreachPartition()` support incremental planning in `ForeachSink`, by making a special version of `Dataset` with its `rdd()` method supporting incremental planning.
## How was this patch tested?
Added a unit test which failed before the change
Author: Liwei Lin <lwlin7@gmail.com>
Closes#14030 from lw-lin/fix-foreach-complete.
## What changes were proposed in this pull request?
This patch removes InSet filter pushdown from Parquet data source, since row-based pushdown is not beneficial to Spark and brings extra complexity to the code base.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#14076 from rxin/SPARK-16400.
#### What changes were proposed in this pull request?
When creating a view, a common user error is the number of columns produced by the `SELECT` clause does not match the number of column names specified by `CREATE VIEW`.
For example, given Table `t1` only has 3 columns
```SQL
create view v1(col2, col4, col3, col5) as select * from t1
```
Currently, Spark SQL reports the following error:
```
requirement failed
java.lang.IllegalArgumentException: requirement failed
at scala.Predef$.require(Predef.scala:212)
at org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:90)
```
This error message is very confusing. This PR is to detect the error and issue a meaningful error message.
#### How was this patch tested?
Added test cases
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14047 from gatorsmile/viewMismatchedColumns.
## What changes were proposed in this pull request?
Currently, Scala API supports to take options with the types, `String`, `Long`, `Double` and `Boolean` and Python API also supports other types.
This PR corrects `tableProperty` rule to support other types (string, boolean, double and integer) so that support the options for data sources in a consistent way. This will affect other rules such as DBPROPERTIES and TBLPROPERTIES (allowing other types as values).
Also, `TODO add bucketing and partitioning.` was removed because it was resolved in 24bea00047
## How was this patch tested?
Unit test in `MetastoreDataSourcesSuite.scala`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#13517 from HyukjinKwon/SPARK-14839.
## What changes were proposed in this pull request?
This is a small follow-up for SPARK-16371:
1. Hide removeMetadata from public API.
2. Add JIRA ticket number to test case name.
## How was this patch tested?
Updated a test comment.
Author: Reynold Xin <rxin@databricks.com>
Closes#14074 from rxin/parquet-filter.
## What changes were proposed in this pull request?
Currently, if there is a schema as below:
```
root
|-- _1: struct (nullable = true)
| |-- _1: integer (nullable = true)
```
and if we execute the codes below:
```scala
df.filter("_1 IS NOT NULL").count()
```
This pushes down a filter although this filter is being applied to `StructType`.(If my understanding is correct, Spark does not pushes down filters for those).
The reason is, `ParquetFilters.getFieldMap` produces results below:
```
(_1,StructType(StructField(_1,IntegerType,true)))
(_1,IntegerType)
```
and then it becomes a `Map`
```
(_1,IntegerType)
```
Now, because of ` ....lift(dataTypeOf(name)).map(_(name, value))`, this pushes down filters for `_1` which Parquet thinks is `IntegerType`. However, it is actually `StructType`.
So, Parquet filter2 produces incorrect results, for example, the codes below:
```
df.filter("_1 IS NOT NULL").count()
```
produces always 0.
This PR prevents this by not finding nested fields.
## How was this patch tested?
Unit test in `ParquetFilterSuite`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#14067 from HyukjinKwon/SPARK-16371.
## What changes were proposed in this pull request?
PR #13696 renamed various Parquet support classes but left `CatalystWriteSupport` behind. This PR is renames it as a follow-up.
## How was this patch tested?
N/A.
Author: Cheng Lian <lian@databricks.com>
Closes#14070 from liancheng/spark-15979-follow-up.
## What changes were proposed in this pull request?
These two configs should always be true after Spark 2.0. This patch removes them from the config list. Note that ideally this should've gone into branch-2.0, but due to the timing of the release we should only merge this in master for Spark 2.1.
## How was this patch tested?
Updated test cases.
Author: Reynold Xin <rxin@databricks.com>
Closes#14061 from rxin/SPARK-16388.
## What changes were proposed in this pull request?
Currently, `regexp_replace` function supports `Column` arguments in a query. This PR supports that in a `Dataset` operation, too.
## How was this patch tested?
Pass the Jenkins tests with a updated testcase.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14060 from dongjoon-hyun/SPARK-16340.
## What changes were proposed in this pull request?
This PR implements `stack` table generating function.
## How was this patch tested?
Pass the Jenkins tests including new testcases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14033 from dongjoon-hyun/SPARK-16286.
## What changes were proposed in this pull request?
This PR removes `SessionState.executeSql` in favor of `SparkSession.sql`. We can remove this safely since the visibility `SessionState` is `private[sql]` and `executeSql` is only used in one **ignored** test, `test("Multiple Hive Instances")`.
## How was this patch tested?
Pass the Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14055 from dongjoon-hyun/SPARK-16383.
## What changes were proposed in this pull request?
This patch fixes the bug that the refresh command does not work on temporary views. This patch is based on https://github.com/apache/spark/pull/13989, but removes the public Dataset.refresh() API as well as improved test coverage.
Note that I actually think the public refresh() API is very useful. We can in the future implement it by also invalidating the lazy vals in QueryExecution (or alternatively just create a new QueryExecution).
## How was this patch tested?
Re-enabled a previously ignored test, and added a new test suite for Hive testing behavior of temporary views against MetastoreRelation.
Author: Reynold Xin <rxin@databricks.com>
Author: petermaxlee <petermaxlee@gmail.com>
Closes#14009 from rxin/SPARK-16311.
## What changes were proposed in this pull request?
Currently, there are a few reports about Spark 2.0 query performance regression for large queries.
This PR speeds up SQL query processing performance by removing redundant **consecutive `executePlan`** call in `Dataset.ofRows` function and `Dataset` instantiation. Specifically, this PR aims to reduce the overhead of SQL query execution plan generation, not real query execution. So, we can not see the result in the Spark Web UI. Please use the following query script. The result is **25.78 sec** -> **12.36 sec** as expected.
**Sample Query**
```scala
val n = 4000
val values = (1 to n).map(_.toString).mkString(", ")
val columns = (1 to n).map("column" + _).mkString(", ")
val query =
s"""
|SELECT $columns
|FROM VALUES ($values) T($columns)
|WHERE 1=2 AND 1 IN ($columns)
|GROUP BY $columns
|ORDER BY $columns
|""".stripMargin
def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block
println("Elapsed time: " + ((System.nanoTime - t0) / 1e9) + "s")
result
}
```
**Before**
```scala
scala> time(sql(query))
Elapsed time: 30.138142577s // First query has a little overhead of initialization.
res0: org.apache.spark.sql.DataFrame = [column1: int, column2: int ... 3998 more fields]
scala> time(sql(query))
Elapsed time: 25.787751452s // Let's compare this one.
res1: org.apache.spark.sql.DataFrame = [column1: int, column2: int ... 3998 more fields]
```
**After**
```scala
scala> time(sql(query))
Elapsed time: 17.500279659s // First query has a little overhead of initialization.
res0: org.apache.spark.sql.DataFrame = [column1: int, column2: int ... 3998 more fields]
scala> time(sql(query))
Elapsed time: 12.364812255s // This shows the real difference. The speed up is about 2 times.
res1: org.apache.spark.sql.DataFrame = [column1: int, column2: int ... 3998 more fields]
```
## How was this patch tested?
Manual by the above script.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14044 from dongjoon-hyun/SPARK-16360.
## What changes were proposed in this pull request?
TypedAggregateExpression sets nullable based on the schema of the outputEncoder
## How was this patch tested?
Add test in DatasetAggregatorSuite
Author: Koert Kuipers <koert@tresata.com>
Closes#13532 from koertkuipers/feat-aggregator-nullable.
## What changes were proposed in this pull request?
This PR implements `inline` table generating function.
## How was this patch tested?
Pass the Jenkins tests with new testcase.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13976 from dongjoon-hyun/SPARK-16288.
## What changes were proposed in this pull request?
This PR adds `map_keys` and `map_values` SQL functions in order to remove Hive fallback.
## How was this patch tested?
Pass the Jenkins tests including new testcases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13967 from dongjoon-hyun/SPARK-16278.
#### What changes were proposed in this pull request?
Star expansion over a table containing zero column does not work since 1.6. However, it works in Spark 1.5.1. This PR is to fix the issue in the master branch.
For example,
```scala
val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => Row.empty)
val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty))
dfNoCols.registerTempTable("temp_table_no_cols")
sqlContext.sql("select * from temp_table_no_cols").show
```
Without the fix, users will get the following the exception:
```
java.lang.IllegalArgumentException: requirement failed
at scala.Predef$.require(Predef.scala:221)
at org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
```
#### How was this patch tested?
Tests are added
Author: gatorsmile <gatorsmile@gmail.com>
Closes#14007 from gatorsmile/starExpansionTableWithZeroColumn.
## What changes were proposed in this pull request?
This PR fixes the minor Java linter errors like the following.
```
- public int read(char cbuf[], int off, int len) throws IOException {
+ public int read(char[] cbuf, int off, int len) throws IOException {
```
## How was this patch tested?
Manual.
```
$ build/mvn -T 4 -q -DskipTests -Pyarn -Phadoop-2.3 -Pkinesis-asl -Phive -Phive-thriftserver install
$ dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks passed.
```
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#14017 from dongjoon-hyun/minor_build_java_linter_error.
## What changes were proposed in this pull request?
In structured streaming, Spark does not report errors when the specified directory does not exist. This is a behavior different from the batch mode. This patch changes the behavior to fail if the directory does not exist (when the path is not a glob pattern).
## How was this patch tested?
Updated unit tests to reflect the new behavior.
Author: Reynold Xin <rxin@databricks.com>
Closes#14002 from rxin/SPARK-16335.
#### What changes were proposed in this pull request?
For JDBC data sources, users can specify `batchsize` for multi-row inserts and `fetchsize` for multi-row fetch. A few issues exist:
- The property keys are case sensitive. Thus, the existing test cases for `fetchsize` use incorrect names, `fetchSize`. Basically, the test cases are broken.
- No test case exists for `batchsize`.
- We do not detect the illegal input values for `fetchsize` and `batchsize`.
For example, when `batchsize` is zero, we got the following exception:
```
Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.ArithmeticException: / by zero
```
when `fetchsize` is less than zero, we got the exception from the underlying JDBC driver:
```
Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): org.h2.jdbc.JdbcSQLException: Invalid value "-1" for parameter "rows" [90008-183]
```
This PR fixes all the above issues, and issue the appropriate exceptions when detecting the illegal inputs for `fetchsize` and `batchsize`. Also update the function descriptions.
#### How was this patch tested?
Test cases are fixed and added.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13919 from gatorsmile/jdbcProperties.
## What changes were proposed in this pull request?
This patch implements the elt function, as it is implemented in Hive.
## How was this patch tested?
Added expression unit test in StringExpressionsSuite and end-to-end test in StringFunctionsSuite.
Author: petermaxlee <petermaxlee@gmail.com>
Closes#13966 from petermaxlee/SPARK-16276.
## What changes were proposed in this pull request?
Spark silently drops exceptions during file listing. This is a very bad behavior because it can mask legitimate errors and the resulting plan will silently have 0 rows. This patch changes it to not silently drop the errors.
## How was this patch tested?
Manually verified.
Author: Reynold Xin <rxin@databricks.com>
Closes#13987 from rxin/SPARK-16313.
## What changes were proposed in this pull request?
This patch appends a message to suggest users running refresh table or reloading data frames when Spark sees a FileNotFoundException due to stale, cached metadata.
## How was this patch tested?
Added a unit test for this in MetadataCacheSuite.
Author: petermaxlee <petermaxlee@gmail.com>
Closes#14003 from petermaxlee/SPARK-16336.
## What changes were proposed in this pull request?
This PR implements `posexplode` table generating function. Currently, master branch raises the following exception for `map` argument. It's different from Hive.
**Before**
```scala
scala> sql("select posexplode(map('a', 1, 'b', 2))").show
org.apache.spark.sql.AnalysisException: No handler for Hive UDF ... posexplode() takes an array as a parameter; line 1 pos 7
```
**After**
```scala
scala> sql("select posexplode(map('a', 1, 'b', 2))").show
+---+---+-----+
|pos|key|value|
+---+---+-----+
| 0| a| 1|
| 1| b| 2|
+---+---+-----+
```
For `array` argument, `after` is the same with `before`.
```
scala> sql("select posexplode(array(1, 2, 3))").show
+---+---+
|pos|col|
+---+---+
| 0| 1|
| 1| 2|
| 2| 3|
+---+---+
```
## How was this patch tested?
Pass the Jenkins tests with newly added testcases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13971 from dongjoon-hyun/SPARK-16289.
## What changes were proposed in this pull request?
Force the sorter to Spill when number of elements in the pointer array reach a certain size. This is to workaround the issue of timSort failing on large buffer size.
## How was this patch tested?
Tested by running a job which was failing without this change due to TimSort bug.
Author: Sital Kedia <skedia@fb.com>
Closes#13107 from sitalkedia/fix_TimSort.
## What changes were proposed in this pull request?
Add Catalog.refreshTable API into python interface for Spark-SQL.
## How was this patch tested?
Existing test.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#13558 from WeichenXu123/update_python_sql_interface_refreshTable.
## What changes were proposed in this pull request?
This patch implements xpath_boolean expression for Spark SQL, a xpath function that returns true or false. The implementation is modelled after Hive's xpath_boolean, except that how the expression handles null inputs. Hive throws a NullPointerException at runtime if either of the input is null. This implementation returns null if either of the input is null.
## How was this patch tested?
Created two new test suites. One for unit tests covering the expression, and the other for end-to-end test in SQL.
Author: petermaxlee <petermaxlee@gmail.com>
Closes#13964 from petermaxlee/SPARK-16274.
## What changes were proposed in this pull request?
After SPARK-15674, `DDLStrategy` prints out the following deprecation messages in the testsuites.
```
12:10:53.284 WARN org.apache.spark.sql.execution.SparkStrategies$DDLStrategy:
CREATE TEMPORARY TABLE normal_orc_source USING... is deprecated,
please use CREATE TEMPORARY VIEW viewName USING... instead
```
Total : 40
- JDBCWriteSuite: 14
- DDLSuite: 6
- TableScanSuite: 6
- ParquetSourceSuite: 5
- OrcSourceSuite: 2
- SQLQuerySuite: 2
- HiveCommandSuite: 2
- JsonSuite: 1
- PrunedScanSuite: 1
- FilteredScanSuite 1
This PR replaces `CREATE TEMPORARY TABLE` with `CREATE TEMPORARY VIEW` in order to remove the deprecation messages in the above testsuites except `DDLSuite`, `SQLQuerySuite`, `HiveCommandSuite`.
The Jenkins results shows only remaining 10 messages.
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61422/consoleFull
## How was this patch tested?
This is a testsuite-only change.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13956 from dongjoon-hyun/SPARK-16267.
## What changes were proposed in this pull request?
This PR adds 3 optimizer rules for typed filter:
1. push typed filter down through `SerializeFromObject` and eliminate the deserialization in filter condition.
2. pull typed filter up through `SerializeFromObject` and eliminate the deserialization in filter condition.
3. combine adjacent typed filters and share the deserialized object among all the condition expressions.
This PR also adds `TypedFilter` logical plan, to separate it from normal filter, so that the concept is more clear and it's easier to write optimizer rules.
## How was this patch tested?
`TypedFilterOptimizationSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13846 from cloud-fan/filter.
## What changes were proposed in this pull request?
This PR allows `emptyDataFrame.write` since the user didn't specify any partition columns.
**Before**
```scala
scala> spark.emptyDataFrame.write.parquet("/tmp/t1")
org.apache.spark.sql.AnalysisException: Cannot use all columns for partition columns;
scala> spark.emptyDataFrame.write.csv("/tmp/t1")
org.apache.spark.sql.AnalysisException: Cannot use all columns for partition columns;
```
After this PR, there occurs no exceptions and the created directory has only one file, `_SUCCESS`, as expected.
## How was this patch tested?
Pass the Jenkins tests including updated test cases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13730 from dongjoon-hyun/SPARK-16006.
## What changes were proposed in this pull request?
This PR removes meaningless `StringIteratorReader` for CSV data source.
In `CSVParser.scala`, there is an `Reader` wrapping `Iterator` but there are two problems by this.
Firstly, it was actually not faster than processing line by line with Iterator due to additional logics to wrap `Iterator` to `Reader`.
Secondly, this brought a bit of complexity because it needs additional logics to allow every line to be read bytes by bytes. So, it was pretty difficult to figure out issues about parsing, (eg. SPARK-14103).
A benchmark was performed manually and the results were below:
- Original codes with Reader wrapping Iterator
|End-to-end (ns) | Parse Time (ns) |
|-----------------------|------------------------|
|14116265034 |2008277960 |
- New codes with Iterator
|End-to-end (ns) | Parse Time (ns) |
|-----------------------|------------------------|
|13451699644 | 1549050564 |
For the details for the environment, dataset and methods, please refer the JIRA ticket.
## How was this patch tested?
Existing tests should cover this.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#13808 from HyukjinKwon/SPARK-14480-small.
#### What changes were proposed in this pull request?
Based on the previous discussion with cloud-fan hvanhovell in another related PR https://github.com/apache/spark/pull/13764#discussion_r67994276, it looks reasonable to add convenience methods for users to add `comment` when defining `StructField`.
Currently, the column-related `comment` attribute is stored in `Metadata` of `StructField`. For example, users can add the `comment` attribute using the following way:
```Scala
StructType(
StructField(
"cl1",
IntegerType,
nullable = false,
new MetadataBuilder().putString("comment", "test").build()) :: Nil)
```
This PR is to add more user friendly methods for the `comment` attribute when defining a `StructField`. After the changes, users are provided three different ways to do it:
```Scala
val struct = (new StructType)
.add("a", "int", true, "test1")
val struct = (new StructType)
.add("c", StringType, true, "test3")
val struct = (new StructType)
.add(StructField("d", StringType).withComment("test4"))
```
#### How was this patch tested?
Added test cases:
- `DataTypeSuite` is for testing three types of API changes,
- `DataFrameReaderWriterSuite` is for parquet, json and csv formats - using in-memory catalog
- `OrcQuerySuite.scala` is for orc format using Hive-metastore
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13860 from gatorsmile/newMethodForComment.
## What changes were proposed in this pull request?
`MAX(COUNT(*))` is invalid since aggregate expression can't be nested within another aggregate expression. This case should be captured at analysis phase, but somehow sneaks off to runtime.
The reason is that when checking aggregate expressions in `CheckAnalysis`, a checking branch treats all expressions that reference no input attributes as valid ones. However, `MAX(COUNT(*))` is translated into `MAX(COUNT(1))` at analysis phase and also references no input attribute.
This PR fixes this issue by removing the aforementioned branch.
## How was this patch tested?
New test case added in `AnalysisErrorSuite`.
Author: Cheng Lian <lian@databricks.com>
Closes#13968 from liancheng/spark-16291-nested-agg-functions.
## What changes were proposed in this pull request?
Change the return type mentioned in the JavaDoc for `toJavaRDD` / `javaRDD` to match the actual return type & be consistent with the scala rdd return type.
## How was this patch tested?
Docs only change.
Author: Holden Karau <holden@us.ibm.com>
Closes#13954 from holdenk/trivial-streaming-tojavardd-doc-fix.
## What changes were proposed in this pull request?
Fixes a couple old references to `DataFrameWriter.startStream` to `DataStreamWriter.start
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#13952 from brkyvz/minor-doc-fix.
## What changes were proposed in this pull request?
The root cause is in `MapObjects`. Its parameter `loopVar` is not declared as child, but sometimes can be same with `lambdaFunction`(e.g. the function that takes `loopVar` and produces `lambdaFunction` may be `identity`), which is a child. This brings trouble when call `withNewChildren`, it may mistakenly treat `loopVar` as a child and cause `IndexOutOfBoundsException: 0` later.
This PR fixes this bug by simply pulling out the paremters from `LambdaVariable` and pass them to `MapObjects` directly.
## How was this patch tested?
new test in `DatasetAggregatorSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13835 from cloud-fan/map-objects.
#### What changes were proposed in this pull request?
koertkuipers identified the PR https://github.com/apache/spark/pull/13727/ changed the behavior of `load` API. After the change, the `load` API does not add the value of `path` into the `options`. Thank you!
This PR is to add the option `path` back to `load()` API in `DataFrameReader`, if and only if users specify one and only one `path` in the `load` API. For example, users can see the `path` option after the following API call,
```Scala
spark.read
.format("parquet")
.load("/test")
```
#### How was this patch tested?
Added test cases.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13933 from gatorsmile/optionPath.
## What changes were proposed in this pull request?
The root cause is: the output attributes of outer join are derived from its children, while they are actually different attributes(outer join can return null).
We have already added some special logic to handle it, e.g. `PushPredicateThroughJoin` won't push down predicates through outer join side, `FixNullability`.
This PR adds one more special logic in `FoldablePropagation`.
## How was this patch tested?
new test in `DataFrameSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13884 from cloud-fan/bug.
## What changes were proposed in this pull request?
Allowing truncate to a specific number of character is convenient at times, especially while operating from the REPL. Sometimes those last few characters make all the difference, and showing everything brings in whole lot of noise.
## How was this patch tested?
Existing tests. + 1 new test in DataFrameSuite.
For SparkR and pyspark, existing tests and manual testing.
Author: Prashant Sharma <prashsh1@in.ibm.com>
Author: Prashant Sharma <prashant@apache.org>
Closes#13839 from ScrapCodes/add_truncateTo_DF.show.
#### What changes were proposed in this pull request?
The API description of `createRelation` in `CreatableRelationProvider` is misleading. The current description only expects users to return the relation.
```Scala
trait CreatableRelationProvider {
def createRelation(
sqlContext: SQLContext,
mode: SaveMode,
parameters: Map[String, String],
data: DataFrame): BaseRelation
}
```
However, the major goal of this API should also include saving the `DataFrame`.
Since this API is critical for Data Source API developers, this PR is to correct the description.
#### How was this patch tested?
N/A
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13903 from gatorsmile/readUnderscoreFiles.
## What changes were proposed in this pull request?
[SPARK-8118](https://github.com/apache/spark/pull/8196) implements redirecting Parquet JUL logger via SLF4J, but it is currently applied only when READ operations occurs. If users use only WRITE operations, there occurs many Parquet logs.
This PR makes the redirection work on WRITE operations, too.
**Before**
```scala
scala> spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Jun 26, 2016 9:04:38 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY
............ about 70 lines Parquet Log .............
scala> spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p")
............ about 70 lines Parquet Log .............
```
**After**
```scala
scala> spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
scala> spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p")
```
This PR also fixes some typos.
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13918 from dongjoon-hyun/SPARK-16221.
## What changes were proposed in this pull request?
Spark currently shows all functions when issue a `SHOW FUNCTIONS` command. This PR refines the `SHOW FUNCTIONS` command by allowing users to select all functions, user defined function or system functions. The following syntax can be used:
**ALL** (default)
```SHOW FUNCTIONS```
```SHOW ALL FUNCTIONS```
**SYSTEM**
```SHOW SYSTEM FUNCTIONS```
**USER**
```SHOW USER FUNCTIONS```
## How was this patch tested?
Updated tests and added tests to the DDLSuite
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#13929 from hvanhovell/SPARK-16220.
## What changes were proposed in this pull request?
- Fix tests regarding show functions functionality
- Revert `catalog.ListFunctions` and `SHOW FUNCTIONS` to return to `Spark 1.X` functionality.
Cherry picked changes from this PR: https://github.com/apache/spark/pull/13413/files
## How was this patch tested?
Unit tests.
Author: Bill Chambers <bill@databricks.com>
Author: Bill Chambers <wchambers@ischool.berkeley.edu>
Closes#13916 from anabranch/master.
## What changes were proposed in this pull request?
This PR adds a testcase to ensure if `checkAnswer` handles Map type correctly.
## How was this patch tested?
Pass the jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13913 from dongjoon-hyun/SPARK-10591.
## What changes were proposed in this pull request?
Add `conf` method to get Runtime Config from SparkSession
## How was this patch tested?
unit tests, manual tests
This is how it works in sparkR shell:
```
SparkSession available as 'spark'.
> conf()
$hive.metastore.warehouse.dir
[1] "file:/opt/spark-2.0.0-bin-hadoop2.6/R/spark-warehouse"
$spark.app.id
[1] "local-1466749575523"
$spark.app.name
[1] "SparkR"
$spark.driver.host
[1] "10.0.2.1"
$spark.driver.port
[1] "45629"
$spark.executorEnv.LD_LIBRARY_PATH
[1] "$LD_LIBRARY_PATH:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/default-java/jre/lib/amd64/server"
$spark.executor.id
[1] "driver"
$spark.home
[1] "/opt/spark-2.0.0-bin-hadoop2.6"
$spark.master
[1] "local[*]"
$spark.sql.catalogImplementation
[1] "hive"
$spark.submit.deployMode
[1] "client"
> conf("spark.master")
$spark.master
[1] "local[*]"
```
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13885 from felixcheung/rconf.
## What changes were proposed in this pull request?
Currently the initial buffer size in the sorter is hard coded inside the code and is too small for large workload. As a result, the sorter spends significant time expanding the buffer size and copying the data. It would be useful to have it configurable.
## How was this patch tested?
Tested by running a job on the cluster.
Author: Sital Kedia <skedia@fb.com>
Closes#13699 from sitalkedia/config_sort_buffer_upstream.
## What changes were proposed in this pull request?
One of the most frequent usage patterns for Spark SQL is using **cached tables**. This PR improves `InMemoryTableScanExec` to handle `IN` predicate efficiently by pruning partition batches. Of course, the performance improvement varies over the queries and the datasets. But, for the following simple query, the query duration in Spark UI goes from 9 seconds to 50~90ms. It's about over 100 times faster.
**Before**
```scala
$ bin/spark-shell --driver-memory 6G
scala> val df = spark.range(2000000000)
scala> df.createOrReplaceTempView("t")
scala> spark.catalog.cacheTable("t")
scala> sql("select id from t where id = 1").collect() // About 2 mins
scala> sql("select id from t where id = 1").collect() // less than 90ms
scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds
```
**After**
```scala
scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms
```
This PR has impacts over 35 queries of TPC-DS if the tables are cached.
Note that this optimization is applied for `IN`. To apply `IN` predicate having more than 10 items, `spark.sql.optimizer.inSetConversionThreshold` option should be increased.
## How was this patch tested?
Pass the Jenkins tests (including new testcases).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13887 from dongjoon-hyun/SPARK-16186.
## What changes were proposed in this pull request?
`CollectSet` cannot have map-typed data because MapTypeData does not implement `equals`.
So, this pr is to add type checks in `CheckAnalysis`.
## How was this patch tested?
Added tests to check failures when we found map-typed data in `CollectSet`.
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
Closes#13892 from maropu/SPARK-16192.
## What changes were proposed in this pull request?
Allow to specify empty over clause in window expressions through dataset API
In SQL, its allowed to specify an empty OVER clause in the window expression.
```SQL
select area, sum(product) over () as c from windowData
where product > 3 group by area, product
having avg(month) > 0 order by avg(month), product
```
In this case the analytic function sum is presented based on all the rows of the result set
Currently its not allowed through dataset API and is handled in this PR.
## How was this patch tested?
Added a new test in DataframeWindowSuite
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#13897 from dilipbiswal/spark-empty-over.
## What changes were proposed in this pull request?
This PR fixes `DataFrame.describe()` by forcing materialization to make the `Seq` serializable. Currently, `describe()` of DataFrame throws `Task not serializable` Spark exceptions when joining in Scala 2.10.
## How was this patch tested?
Manual. (After building with Scala 2.10, test on `bin/spark-shell` and `bin/pyspark`.)
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13900 from dongjoon-hyun/SPARK-16173.
## What changes were proposed in this pull request?
One of the most frequent usage patterns for Spark SQL is using **cached tables**. This PR improves `InMemoryTableScanExec` to handle `IN` predicate efficiently by pruning partition batches. Of course, the performance improvement varies over the queries and the datasets. But, for the following simple query, the query duration in Spark UI goes from 9 seconds to 50~90ms. It's about over 100 times faster.
**Before**
```scala
$ bin/spark-shell --driver-memory 6G
scala> val df = spark.range(2000000000)
scala> df.createOrReplaceTempView("t")
scala> spark.catalog.cacheTable("t")
scala> sql("select id from t where id = 1").collect() // About 2 mins
scala> sql("select id from t where id = 1").collect() // less than 90ms
scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds
```
**After**
```scala
scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms
```
This PR has impacts over 35 queries of TPC-DS if the tables are cached.
Note that this optimization is applied for `IN`. To apply `IN` predicate having more than 10 items, `spark.sql.optimizer.inSetConversionThreshold` option should be increased.
## How was this patch tested?
Pass the Jenkins tests (including new testcases).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13887 from dongjoon-hyun/SPARK-16186.
## What changes were proposed in this pull request?
This PR fix the bug when Python UDF is used in explode (generator), GenerateExec requires that all the attributes in expressions should be resolvable from children when creating, we should replace the children first, then replace it's expressions.
```
>>> df.select(explode(f(*df))).show()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/vlad/dev/spark/python/pyspark/sql/dataframe.py", line 286, in show
print(self._jdf.showString(n, truncate))
File "/home/vlad/dev/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
File "/home/vlad/dev/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/home/vlad/dev/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o52.showString.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree:
Generate explode(<lambda>(_1#0L)), false, false, [col#15L]
+- Scan ExistingRDD[_1#0L]
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:387)
at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:69)
at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:45)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:177)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144)
at org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:153)
at org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
at org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
at org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:113)
at org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:93)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:95)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:95)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:95)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:85)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:85)
at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2557)
at org.apache.spark.sql.Dataset.head(Dataset.scala:1923)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2138)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:239)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$13.apply(TreeNode.scala:413)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$13.apply(TreeNode.scala:413)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:412)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:387)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
... 42 more
Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: pythonUDF0#20
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:268)
at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
at org.apache.spark.sql.execution.GenerateExec.<init>(GenerateExec.scala:63)
... 52 more
Caused by: java.lang.RuntimeException: Couldn't find pythonUDF0#20 in [_1#0L]
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:94)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:88)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
... 67 more
```
## How was this patch tested?
Added regression tests.
Author: Davies Liu <davies@databricks.com>
Closes#13883 from davies/udf_in_generate.
## What changes were proposed in this pull request?
This is a small patch to rewrite the predicate filter translation in DataSourceStrategy. The original code used excessive functional constructs (e.g. unzip) and was very difficult to understand.
## How was this patch tested?
Should be covered by existing tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#13889 from rxin/simplify-predicate-filter.
## What changes were proposed in this pull request?
Replace use of `commons-lang` in favor of `commons-lang3` and forbid the former via scalastyle; remove `NotImplementedException` from `comons-lang` in favor of JDK `UnsupportedOperationException`
## How was this patch tested?
Jenkins tests
Author: Sean Owen <sowen@cloudera.com>
Closes#13843 from srowen/SPARK-16129.
## What changes were proposed in this pull request?
It's weird that `ParserUtils.operationNotAllowed` returns an exception and the caller throw it.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13874 from cloud-fan/style.
## What changes were proposed in this pull request?
This patch fixes an overflow bug in vectorized parquet reader where both off-heap and on-heap variants of `ColumnVector.reserve()` can unfortunately overflow while reserving additional capacity during reads.
## How was this patch tested?
Manual Tests
Author: Sameer Agarwal <sameer@databricks.com>
Closes#13832 from sameeragarwal/negative-array.
## What changes were proposed in this pull request?
Currently, `readBatches` accumulator of `InMemoryTableScanExec` is updated only when `spark.sql.inMemoryColumnarStorage.partitionPruning` is true. Although this metric is used for only testing purpose, we had better have correct metric without considering SQL options.
## How was this patch tested?
Pass the Jenkins tests (including a new testcase).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13870 from dongjoon-hyun/SPARK-16165.
## What changes were proposed in this pull request?
This calculation of statistics is not trivial anymore, it could be very slow on large query (for example, TPC-DS Q64 took several minutes to plan).
During the planning of a query, the statistics of any logical plan should not change (even InMemoryRelation), so we should use `lazy val` to cache the statistics.
For InMemoryRelation, the statistics could be updated after materialization, it's only useful when used in another query (before planning), because once we finished the planning, the statistics will not be used anymore.
## How was this patch tested?
Testsed with TPC-DS Q64, it could be planned in a second after the patch.
Author: Davies Liu <davies@databricks.com>
Closes#13871 from davies/fix_statistics.
## What changes were proposed in this pull request?
When the user uses `ConsoleSink`, we should use a temp location if `checkpointLocation` is not specified.
## How was this patch tested?
The added unit test.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#13817 from zsxwing/console-checkpoint.
## What changes were proposed in this pull request?
When table is created with column name containing dot, distinct() will fail to run. For example,
```scala
val rowRDD = sparkContext.parallelize(Seq(Row(1), Row(1), Row(2)))
val schema = StructType(Array(StructField("column.with.dot", IntegerType, nullable = false)))
val df = spark.createDataFrame(rowRDD, schema)
```
running the following will have no problem:
```scala
df.select(new Column("`column.with.dot`"))
```
but running the query with additional distinct() will cause exception:
```scala
df.select(new Column("`column.with.dot`")).distinct()
```
The issue is that distinct() will try to resolve the column name, but the column name in the schema does not have backtick with it. So the solution is to add the backtick before passing the column name to resolve().
## How was this patch tested?
Added a new test case.
Author: bomeng <bmeng@us.ibm.com>
Closes#13140 from bomeng/SPARK-15230.
## What changes were proposed in this pull request?
We embed partitioning logic in FileSourceStrategy.apply, making the function very long. This is a small refactoring to move it into its own functions. Eventually we would be able to move the partitioning functions into a physical operator, rather than doing it in physical planning.
## How was this patch tested?
This is a simple code move.
Author: Reynold Xin <rxin@databricks.com>
Closes#13862 from rxin/SPARK-16159.
#### What changes were proposed in this pull request?
This PR is to improve test coverage. It verifies whether `Comment` of `Column` can be appropriate handled.
The test cases verify the related parts in Parser, both SQL and DataFrameWriter interface, and both Hive Metastore catalog and In-memory catalog.
#### How was this patch tested?
N/A
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13764 from gatorsmile/dataSourceComment.
## What changes were proposed in this pull request?
Although the top level input object can not be null, but when we use `Encoders.tuple` to combine 2 encoders, their input objects are not top level anymore and can be null. We should handle this case.
## How was this patch tested?
new test in DatasetSuite
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13807 from cloud-fan/bug.
## What changes were proposed in this pull request?
Seems the fix of SPARK-14959 breaks the parallel partitioning discovery. This PR fixes the problem
## How was this patch tested?
Tested manually. (This PR also adds a proper test for SPARK-14959)
Author: Yin Huai <yhuai@databricks.com>
Closes#13830 from yhuai/SPARK-16121.
#### What changes were proposed in this pull request?
This PR is to use the latest `SparkSession` to replace the existing `SQLContext` in `MLlib`. `SQLContext` is removed from `MLlib`.
Also fix a test case issue in `BroadcastJoinSuite`.
BTW, `SQLContext` is not being used in the `MLlib` test suites.
#### How was this patch tested?
Existing test cases.
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#13380 from gatorsmile/sqlContextML.
## What changes were proposed in this pull request?
This PR let `CsvWriter` object is not created for each time but able to be reused. This way was taken after from JSON data source.
Original `CsvWriter` was being created for each row but it was enhanced in https://github.com/apache/spark/pull/13229. However, it still creates `CsvWriter` object for each `flush()` in `LineCsvWriter`. It seems it does not have to close the object and re-create this for every flush.
It follows the original logic as it is but `CsvWriter` is reused by reseting `CharArrayWriter`.
## How was this patch tested?
Existing tests should cover this.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#13809 from HyukjinKwon/write-perf.
## What changes were proposed in this pull request?
Add a configuration to allow people to set a minimum polling delay when no new data arrives (default is 10ms). This PR also cleans up some INFO logs.
## How was this patch tested?
Existing unit tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#13718 from zsxwing/SPARK-16002.
## What changes were proposed in this pull request?
1. FORMATTED is actually supported, but partition is not supported;
2. Remove parenthesis as it is not necessary just like anywhere else.
## How was this patch tested?
Minor issue. I do not think it needs a test case!
Author: bomeng <bmeng@us.ibm.com>
Closes#13791 from bomeng/SPARK-16084.
## What changes were proposed in this pull request?
This PR makes `input_file_name()` function return the file paths not empty strings for external data sources based on `NewHadoopRDD`, such as [spark-redshift](cba5eee1ab/src/main/scala/com/databricks/spark/redshift/RedshiftRelation.scala (L149)) and [spark-xml](https://github.com/databricks/spark-xml/blob/master/src/main/scala/com/databricks/spark/xml/util/XmlFile.scala#L39-L47).
The codes with the external data sources below:
```scala
df.select(input_file_name).show()
```
will produce
- **Before**
```
+-----------------+
|input_file_name()|
+-----------------+
| |
+-----------------+
```
- **After**
```
+--------------------+
| input_file_name()|
+--------------------+
|file:/private/var...|
+--------------------+
```
## How was this patch tested?
Unit tests in `ColumnExpressionSuite`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#13759 from HyukjinKwon/SPARK-16044.
#### What changes were proposed in this pull request?
This PR is to fix the following bugs:
**Issue 1: Wrong Results when lowerBound is larger than upperBound in Column Partitioning**
```scala
spark.read.jdbc(
url = urlWithUserAndPass,
table = "TEST.seq",
columnName = "id",
lowerBound = 4,
upperBound = 0,
numPartitions = 3,
connectionProperties = new Properties)
```
**Before code changes:**
The returned results are wrong and the generated partitions are wrong:
```
Part 0 id < 3 or id is null
Part 1 id >= 3 AND id < 2
Part 2 id >= 2
```
**After code changes:**
Issue an `IllegalArgumentException` exception:
```
Operation not allowed: the lower bound of partitioning column is larger than the upper bound. lowerBound: 5; higherBound: 1
```
**Issue 2: numPartitions is more than the number of key values between upper and lower bounds**
```scala
spark.read.jdbc(
url = urlWithUserAndPass,
table = "TEST.seq",
columnName = "id",
lowerBound = 1,
upperBound = 5,
numPartitions = 10,
connectionProperties = new Properties)
```
**Before code changes:**
Returned correct results but the generated partitions are very inefficient, like:
```
Partition 0: id < 1 or id is null
Partition 1: id >= 1 AND id < 1
Partition 2: id >= 1 AND id < 1
Partition 3: id >= 1 AND id < 1
Partition 4: id >= 1 AND id < 1
Partition 5: id >= 1 AND id < 1
Partition 6: id >= 1 AND id < 1
Partition 7: id >= 1 AND id < 1
Partition 8: id >= 1 AND id < 1
Partition 9: id >= 1
```
**After code changes:**
Adjust `numPartitions` and can return the correct answers:
```
Partition 0: id < 2 or id is null
Partition 1: id >= 2 AND id < 3
Partition 2: id >= 3 AND id < 4
Partition 3: id >= 4
```
**Issue 3: java.lang.ArithmeticException when numPartitions is zero**
```Scala
spark.read.jdbc(
url = urlWithUserAndPass,
table = "TEST.seq",
columnName = "id",
lowerBound = 0,
upperBound = 4,
numPartitions = 0,
connectionProperties = new Properties)
```
**Before code changes:**
Got the following exception:
```
java.lang.ArithmeticException: / by zero
```
**After code changes:**
Able to return a correct answer by disabling column partitioning when numPartitions is equal to or less than zero
#### How was this patch tested?
Added test cases to verify the results
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13773 from gatorsmile/jdbcPartitioning.
## What changes were proposed in this pull request?
This pull request adds a new option (maxMalformedLogPerPartition) in CSV reader to limit the maximum of logging message Spark generates per partition for malformed records.
The error log looks something like
```
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: More than 10 malformed records have been found on this partition. Malformed records from now on will not be logged.
```
Closes#12173
## How was this patch tested?
Manually tested.
Author: Reynold Xin <rxin@databricks.com>
Closes#13795 from rxin/SPARK-13792.
## What changes were proposed in this pull request?
The property spark.streaming.stateStore.maintenanceInterval should be renamed and harmonized with other properties related to Structured Streaming like spark.sql.streaming.stateStore.minDeltasForSnapshot.
## How was this patch tested?
Existing unit tests.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#13777 from sarutak/SPARK-16061.
## What changes were proposed in this pull request?
Issues with current reader behavior.
- `text()` without args returns an empty DF with no columns -> inconsistent, its expected that text will always return a DF with `value` string field,
- `textFile()` without args fails with exception because of the above reason, it expected the DF returned by `text()` to have a `value` field.
- `orc()` does not have var args, inconsistent with others
- `json(single-arg)` was removed, but that caused source compatibility issues - [SPARK-16009](https://issues.apache.org/jira/browse/SPARK-16009)
- user specified schema was not respected when `text/csv/...` were used with no args - [SPARK-16007](https://issues.apache.org/jira/browse/SPARK-16007)
The solution I am implementing is to do the following.
- For each format, there will be a single argument method, and a vararg method. For json, parquet, csv, text, this means adding json(string), etc.. For orc, this means adding orc(varargs).
- Remove the special handling of text(), csv(), etc. that returns empty dataframe with no fields. Rather pass on the empty sequence of paths to the datasource, and let each datasource handle it right. For e.g, text data source, should return empty DF with schema (value: string)
- Deduped docs and fixed their formatting.
## How was this patch tested?
Added new unit tests for Scala and Java tests
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#13727 from tdas/SPARK-15982.
## What changes were proposed in this pull request?
ConsoleSinkSuite just collects content from stdout and compare them with the expected string. However, because Spark may not stop some background threads at once, there is a race condition that other threads are outputting logs to **stdout** while ConsoleSinkSuite is running. Then it will make ConsoleSinkSuite fail.
Therefore, I just deleted `ConsoleSinkSuite`. If we want to test ConsoleSinkSuite in future, we should refactoring ConsoleSink to make it testable instead of depending on stdout. Therefore, this test is useless and I just delete it.
## How was this patch tested?
Just removed a flaky test.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#13776 from zsxwing/SPARK-16050.
## What changes were proposed in this pull request?
This PR adds the static partition support to INSERT statement when the target table is a data source table.
## How was this patch tested?
New tests in InsertIntoHiveTableSuite and DataSourceAnalysisSuite.
**Note: This PR is based on https://github.com/apache/spark/pull/13766. The last commit is the actual change.**
Author: Yin Huai <yhuai@databricks.com>
Closes#13769 from yhuai/SPARK-16030-1.
## What changes were proposed in this pull request?
This patch adds a text-based socket source similar to the one in Spark Streaming for debugging and tutorials. The source is clearly marked as debug-only so that users don't try to run it in production applications, because this type of source cannot provide HA without storing a lot of state in Spark.
## How was this patch tested?
Unit tests and manual tests in spark-shell.
Author: Matei Zaharia <matei@databricks.com>
Closes#13748 from mateiz/socket-source.
## What changes were proposed in this pull request?
`DataFrameWriter` can be used to append data to existing data source tables. It becomes tricky when partition columns used in `DataFrameWriter.partitionBy(columns)` don't match the actual partition columns of the underlying table. This pull request enforces the check so that the partition columns of these two always match.
## How was this patch tested?
Unit test.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#13749 from clockfly/SPARK-16034.
## What changes were proposed in this pull request?
The current table insertion has some weird behaviours:
1. inserting into a partitioned table with mismatch columns has confusing error message for hive table, and wrong result for datasource table
2. inserting into a partitioned table without partition list has wrong result for hive table.
This PR fixes these 2 problems.
## How was this patch tested?
new test in hive `SQLQuerySuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13754 from cloud-fan/insert2.
## What changes were proposed in this pull request?
Improve readability of `InMemoryTableScanExec.scala`, which has too much stuff in it.
## How was this patch tested?
Jenkins
Author: Andrew Or <andrew@databricks.com>
Closes#13742 from andrewor14/move-inmemory-relation.
## What changes were proposed in this pull request?
We cannot use `limit` on DataFrame in ConsoleSink because it will use a wrong planner. This PR just collects `DataFrame` and calls `show` on a batch DataFrame based on the result. This is fine since ConsoleSink is only for debugging.
## How was this patch tested?
Manually confirmed ConsoleSink now works with complete mode aggregation.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#13740 from zsxwing/complete-console.
## What changes were proposed in this pull request?
This PR introduces the new SparkSession API for SparkR.
`sparkR.session.getOrCreate()` and `sparkR.session.stop()`
"getOrCreate" is a bit unusual in R but it's important to name this clearly.
SparkR implementation should
- SparkSession is the main entrypoint (vs SparkContext; due to limited functionality supported with SparkContext in SparkR)
- SparkSession replaces SQLContext and HiveContext (both a wrapper around SparkSession, and because of API changes, supporting all 3 would be a lot more work)
- Changes to SparkSession is mostly transparent to users due to SPARK-10903
- Full backward compatibility is expected - users should be able to initialize everything just in Spark 1.6.1 (`sparkR.init()`), but with deprecation warning
- Mostly cosmetic changes to parameter list - users should be able to move to `sparkR.session.getOrCreate()` easily
- An advanced syntax with named parameters (aka varargs aka "...") is supported; that should be closer to the Builder syntax that is in Scala/Python (which unfortunately does not work in R because it will look like this: `enableHiveSupport(config(config(master(appName(builder(), "foo"), "local"), "first", "value"), "next, "value"))`
- Updating config on an existing SparkSession is supported, the behavior is the same as Python, in which config is applied to both SparkContext and SparkSession
- Some SparkSession changes are not matched in SparkR, mostly because it would be breaking API change: `catalog` object, `createOrReplaceTempView`
- Other SQLContext workarounds are replicated in SparkR, eg. `tables`, `tableNames`
- `sparkR` shell is updated to use the SparkSession entrypoint (`sqlContext` is removed, just like with Scale/Python)
- All tests are updated to use the SparkSession entrypoint
- A bug in `read.jdbc` is fixed
TODO
- [x] Add more tests
- [ ] Separate PR - update all roxygen2 doc coding example
- [ ] Separate PR - update SparkR programming guide
## How was this patch tested?
unit tests, manual tests
shivaram sun-rui rxin
Author: Felix Cheung <felixcheung_m@hotmail.com>
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#13635 from felixcheung/rsparksession.
## What changes were proposed in this pull request?
When inserting into an existing partitioned table, partitioning columns should always be determined by catalog metadata of the existing table to be inserted. Extra `partitionBy()` calls don't make sense, and mess up existing data because newly inserted data may have wrong partitioning directory layout.
## How was this patch tested?
New test case added in `InsertIntoHiveTableSuite`.
Author: Cheng Lian <lian@databricks.com>
Closes#13747 from liancheng/spark-16033-insert-into-without-partition-by.
## What changes were proposed in this pull request?
This PR fixes the problem that the precedence order is messed when pushing where-clause expression to JDBC layer.
**Case 1:**
For sql `select * from table where (a or b) and c`, the where-clause is wrongly converted to JDBC where-clause `a or (b and c)` after filter push down. The consequence is that JDBC may returns less or more rows than expected.
**Case 2:**
For sql `select * from table where always_false_condition`, the result table may not be empty if the JDBC RDD is partitioned using where-clause:
```
spark.read.jdbc(url, table, predicates = Array("partition 1 where clause", "partition 2 where clause"...)
```
## How was this patch tested?
Unit test.
This PR also close#13640
Author: hyukjinkwon <gurwls223@gmail.com>
Author: Sean Zhong <seanzhong@databricks.com>
Closes#13743 from clockfly/SPARK-15916.
## What changes were proposed in this pull request?
My fault -- these 2 conf entries are mysteriously hidden inside the benchmark code and makes it non-obvious to disable whole stage codegen and/or the vectorized parquet reader.
PS: Didn't attach a JIRA as this change should otherwise be a no-op (both these conf are enabled by default in Spark)
## How was this patch tested?
N/A
Author: Sameer Agarwal <sameer@databricks.com>
Closes#13726 from sameeragarwal/tpcds-conf.
## What changes were proposed in this pull request?
Iterator can't be serialized in Scala 2.10, we should force it into a array to make sure that .
## How was this patch tested?
Build with Scala 2.10 and ran all the Python unit tests manually (will be covered by a jenkins build).
Author: Davies Liu <davies@databricks.com>
Closes#13717 from davies/fix_udf_210.
## What changes were proposed in this pull request?
`UTF8String` and all `Unsafe*` classes are backed by either on-heap or off-heap byte arrays. The code generated version `SortMergeJoin` buffers the left hand side join keys during iteration. This was actually problematic in off-heap mode when one of the keys is a `UTF8String` (or any other 'Unsafe*` object) and the left hand side iterator was exhausted (and released its memory); the buffered keys would reference freed memory. This causes Seg-faults and all kinds of other undefined behavior when we would use one these buffered keys.
This PR fixes this problem by creating copies of the buffered variables. I have added a general method to the `CodeGenerator` for this. I have checked all places in which this could happen, and only `SortMergeJoin` had this problem.
This PR is largely based on the work of robbinspg and he should be credited for this.
closes https://github.com/apache/spark/pull/13707
## How was this patch tested?
Manually tested on problematic workloads.
Author: Pete Robbins <robbinspg@gmail.com>
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#13723 from hvanhovell/SPARK-15822-2.
## What changes were proposed in this pull request?
Before this patch, after a SparkSession has been created, hadoop conf set directly to SparkContext.hadoopConfiguration will not affect the hadoop conf created by SessionState. This patch makes the change to always use SparkContext.hadoopConfiguration as the base.
This patch also changes the behavior of hive-site.xml support added in https://github.com/apache/spark/pull/12689/. With this patch, we will load hive-site.xml to SparkContext.hadoopConfiguration.
## How was this patch tested?
New test in SparkSessionBuilderSuite.
Author: Yin Huai <yhuai@databricks.com>
Closes#13711 from yhuai/SPARK-15991.
## What changes were proposed in this pull request?
For table test1 (C1 varchar (10), C2 varchar (10)), when I insert a row using
```
sqlContext.sql("insert into test1 values ('abc', 'def', 1)")
```
I got error message
```
Exception in thread "main" java.lang.RuntimeException: RelationC1#0,C2#1 JDBCRelation(test1)
requires that the query in the SELECT clause of the INSERT INTO/OVERWRITE statement
generates the same number of columns as its schema.
```
The error message is a little confusing. In my simple insert statement, it doesn't have a SELECT clause.
I will change the error message to a more general one
```
Exception in thread "main" java.lang.RuntimeException: RelationC1#0,C2#1 JDBCRelation(test1)
requires that the data to be inserted have the same number of columns as the target table.
```
## How was this patch tested?
I tested the patch using my simple unit test, but it's a very trivial change and I don't think I need to check in any test.
Author: Huaxin Gao <huaxing@us.ibm.com>
Closes#13492 from huaxingao/spark-15749.
## What changes were proposed in this pull request?
This PR contains a few changes on code comments.
- `HiveTypeCoercion` is renamed into `TypeCoercion`.
- `NoSuchDatabaseException` is only used for the absence of database.
- For partition type inference, only `DoubleType` is considered.
## How was this patch tested?
N/A
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13674 from dongjoon-hyun/minor_doc_types.
## What changes were proposed in this pull request?
This PR fixes some minor `.toString` format issues for `HashAggregateExec`.
Before:
```
*HashAggregate(key=[a#234L,b#235L], functions=[count(1),max(c#236L)], output=[a#234L,b#235L,count(c)#247L,max(c)#248L])
```
After:
```
*HashAggregate(keys=[a#234L, b#235L], functions=[count(1), max(c#236L)], output=[a#234L, b#235L, count(c)#247L, max(c)#248L])
```
## How was this patch tested?
Manually tested.
Author: Cheng Lian <lian@databricks.com>
Closes#13710 from liancheng/minor-agg-string-fix.
## What changes were proposed in this pull request?
`TRUNCATE TABLE` is currently broken for Spark specific datasource tables (json, csv, ...). This PR correctly sets the location for these datasources which allows them to be truncated.
## How was this patch tested?
Extended the datasources `TRUNCATE TABLE` tests in `DDLSuite`.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#13697 from hvanhovell/SPARK-15977.
## What changes were proposed in this pull request?
Interface method `FileFormat.prepareRead()` was added in #12088 to handle a special case in the LibSVM data source.
However, the semantics of this interface method isn't intuitive: it returns a modified version of the data source options map. Considering that the LibSVM case can be easily handled using schema metadata inside `inferSchema`, we can remove this interface method to keep the `FileFormat` interface clean.
## How was this patch tested?
Existing tests.
Author: Cheng Lian <lian@databricks.com>
Closes#13698 from liancheng/remove-prepare-read.
#### What changes were proposed in this pull request?
~~If the temp table already exists, we should not silently replace it when doing `CACHE TABLE AS SELECT`. This is inconsistent with the behavior of `CREAT VIEW` or `CREATE TABLE`. This PR is to fix this silent drop.~~
~~Maybe, we also can introduce new syntax for replacing the existing one. For example, in Hive, to replace a view, the syntax should be like `ALTER VIEW AS SELECT` or `CREATE OR REPLACE VIEW AS SELECT`~~
The table name in `CACHE TABLE AS SELECT` should NOT contain database prefix like "database.table". Thus, this PR captures this in Parser and outputs a better error message, instead of reporting the view already exists.
In addition, refactoring the `Parser` to generate table identifiers instead of returning the table name string.
#### How was this patch tested?
- Added a test case for caching and uncaching qualified table names
- Fixed a few test cases that do not drop temp table at the end
- Added the related test case for the issue resolved in this PR
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#13572 from gatorsmile/cacheTableAsSelect.
## What changes were proposed in this pull request?
gapply() applies an R function on groups grouped by one or more columns of a DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() in the Dataset API.
Please, let me know what do you think and if you have any ideas to improve it.
Thank you!
## How was this patch tested?
Unit tests.
1. Primitive test with different column types
2. Add a boolean column
3. Compute average by a group
Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>
Author: NarineK <narine.kokhlikyan@us.ibm.com>
Closes#12836 from NarineK/gapply2.
## What changes were proposed in this pull request?
We currently immediately execute `INSERT` commands when they are issued. This is not the case as soon as we use a `WITH` to define common table expressions, for example:
```sql
WITH
tbl AS (SELECT * FROM x WHERE id = 10)
INSERT INTO y
SELECT *
FROM tbl
```
This PR fixes this problem. This PR closes https://github.com/apache/spark/pull/13561 (which fixes the a instance of this problem in the ThriftSever).
## How was this patch tested?
Added a test to `InsertSuite`
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#13678 from hvanhovell/SPARK-15824.
## What changes were proposed in this pull request?
This patch brings https://github.com/apache/spark/pull/11373 up-to-date and increments the record count for JDBC data source.
Closes#11373.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#13694 from rxin/SPARK-13498.
## What changes were proposed in this pull request?
This patch renames various Parquet support classes from CatalystAbc to ParquetAbc. This new naming makes more sense for two reasons:
1. These are not optimizer related (i.e. Catalyst) classes.
2. We are in the Spark code base, and as a result it'd be more clear to call out these are Parquet support classes, rather than some Spark classes.
## How was this patch tested?
Renamed test cases as well.
Author: Reynold Xin <rxin@databricks.com>
Closes#13696 from rxin/parquet-rename.