## What changes were proposed in this pull request?
This PR replaces `spark.sql.sources.` strings with `CreateDataSourceTableUtils.*` constant variables.
## How was this patch tested?
Pass the existing Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13349 from dongjoon-hyun/SPARK-15584.
#### What changes were proposed in this pull request?
The default value of `spark.sql.warehouse.dir` is `System.getProperty("user.dir")/spark-warehouse`. Since `System.getProperty("user.dir")` is a local dir, we should explicitly set the scheme to local filesystem.
cc yhuai
#### How was this patch tested?
Added two test cases
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13348 from gatorsmile/addSchemeToDefaultWarehousePath.
#### What changes were proposed in this pull request?
This PR is to use the new entrance `Sparksession` to replace the existing `SQLContext` and `HiveContext` in SQL test suites.
No change is made in the following suites:
- `ListTablesSuite` is to test the APIs of `SQLContext`.
- `SQLContextSuite` is to test `SQLContext`
- `HiveContextCompatibilitySuite` is to test `HiveContext`
**Update**: Move tests in `ListTableSuite` to `SQLContextSuite`
#### How was this patch tested?
N/A
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#13337 from gatorsmile/sparkSessionTest.
## What changes were proposed in this pull request?
`a` -> `an`
I use regex to generate potential error lines:
`grep -in ' a [aeiou]' mllib/src/main/scala/org/apache/spark/ml/*/*scala`
and review them line by line.
## How was this patch tested?
local build
`lint-java` checking
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#13317 from zhengruifeng/a_an.
## What changes were proposed in this pull request?
Certain table properties (and SerDe properties) are in the protected namespace `spark.sql.sources.`, which we use internally for datasource tables. The user should not be allowed to
(1) Create a Hive table setting these properties
(2) Alter these properties in an existing table
Previously, we threw an exception if the user tried to alter the properties of an existing datasource table. However, this is overly restrictive for datasource tables and does not do anything for Hive tables.
## How was this patch tested?
DDLSuite
Author: Andrew Or <andrew@databricks.com>
Closes#13341 from andrewor14/alter-table-props.
## What changes were proposed in this pull request?
Two more changes:
(1) Fix truncate table for data source tables (only for cases without `PARTITION`)
(2) Disallow truncating external tables or views
## How was this patch tested?
`DDLSuite`
Author: Andrew Or <andrew@databricks.com>
Closes#13315 from andrewor14/truncate-table.
## What changes were proposed in this pull request?
This PR changes SQLContext/HiveContext's public constructor to use SparkSession.build.getOrCreate and removes isRootContext from SQLContext.
## How was this patch tested?
Existing tests.
Author: Yin Huai <yhuai@databricks.com>
Closes#13310 from yhuai/SPARK-15532.
## What changes were proposed in this pull request?
This PR addresses two related issues:
1. `Dataset.showString()` should show case classes/Java beans at all levels as rows, while master code only handles top level ones.
2. `Dataset.showString()` should show full contents produced the underlying query plan
Dataset is only a view of the underlying query plan. Columns not referred by the encoder are still reachable using methods like `Dataset.col`. So it probably makes more sense to show full contents of the query plan.
## How was this patch tested?
Two new test cases are added in `DatasetSuite` to check `.showString()` output.
Author: Cheng Lian <lian@databricks.com>
Closes#13331 from liancheng/spark-15550-ds-show.
## What changes were proposed in this pull request?
Add more verbose error message when order by clause is missed when using Window function.
## How was this patch tested?
Unit test.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#13333 from clockfly/spark-13445.
## What changes were proposed in this pull request?
SparkSession has a list of unnecessary private[sql] methods. These methods cause some trouble because private[sql] doesn't apply in Java. In the cases that they are easy to remove, we can simply remove them. This patch does that.
As part of this pull request, I also replaced a bunch of protected[sql] with private[sql], to tighten up visibility.
## How was this patch tested?
Updated test cases to reflect the changes.
Author: Reynold Xin <rxin@databricks.com>
Closes#13319 from rxin/SPARK-15552.
## What changes were proposed in this pull request?
Same as #13302, but for DROP TABLE.
## How was this patch tested?
`DDLSuite`
Author: Andrew Or <andrew@databricks.com>
Closes#13307 from andrewor14/drop-table.
## What changes were proposed in this pull request?
For some of the test cases, e.g. `OrcSourceSuite`, it will create temp folders and temp files inside them. But after tests finish, the folders are not removed. This will cause lots of temp files created and space occupied, if we keep running the test cases.
The reason is dir.delete() won't work if dir is not empty. We need to recursively delete the content before deleting the folder.
## How was this patch tested?
Manually checked the temp folder to make sure the temp files were deleted.
Author: Bo Meng <mengbo@hotmail.com>
Closes#13304 from bomeng/SPARK-15537.
## What changes were proposed in this pull request?
This patch renames various DefaultSources to make their names more self-describing. The choice of "DefaultSource" was from the days when we did not have a good way to specify short names.
They are now named:
- LibSVMFileFormat
- CSVFileFormat
- JdbcRelationProvider
- JsonFileFormat
- ParquetFileFormat
- TextFileFormat
Backward compatibility is maintained through aliasing.
## How was this patch tested?
Updated relevant test cases too.
Author: Reynold Xin <rxin@databricks.com>
Closes#13311 from rxin/SPARK-15543.
## What changes were proposed in this pull request?
This patch deprecates `Dataset.explode` and documents appropriate workarounds to use `flatMap()` or `functions.explode()` instead.
## How was this patch tested?
N/A
Author: Sameer Agarwal <sameer@databricks.com>
Closes#13312 from sameeragarwal/deprecate.
## What changes were proposed in this pull request?
Two changes:
- When things fail, `TRUNCATE TABLE` just returns nothing. Instead, we should throw exceptions.
- Remove `TRUNCATE TABLE ... COLUMN`, which was never supported by either Spark or Hive.
## How was this patch tested?
Jenkins.
Author: Andrew Or <andrew@databricks.com>
Closes#13302 from andrewor14/truncate-table.
## What changes were proposed in this pull request?
Extra strategies does not work for streams because `IncrementalExecution` uses modified planner with stateful operations but it does not include extra strategies.
This pr fixes `IncrementalExecution` to include extra strategies to use them.
## How was this patch tested?
I added a test to check if extra strategies work for streams.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#13261 from ueshin/issues/SPARK-15483.
fixed typos for source code for components [mllib] [streaming] and [SQL]
None and obvious.
Author: lfzCarlosC <lfz.carlos@gmail.com>
Closes#13298 from lfzCarlosC/master.
## What changes were proposed in this pull request?
Override the existing SparkContext is the provided SparkConf is different. PySpark part hasn't been fixed yet, will do that after the first round of review to ensure this is the correct approach.
## How was this patch tested?
Manually verify it in spark-shell.
rxin Please help review it, I think this is a very critical issue for spark 2.0
Author: Jeff Zhang <zjffdu@apache.org>
Closes#13160 from zjffdu/SPARK-15345.
## What changes were proposed in this pull request?
This patch removes the last two commands defined in the catalyst module: DescribeFunction and ShowFunctions. They were unnecessary since the parser could just generate DescribeFunctionCommand and ShowFunctionsCommand directly.
## How was this patch tested?
Created a new SparkSqlParserSuite.
Author: Reynold Xin <rxin@databricks.com>
Closes#13292 from rxin/SPARK-15436.
## What changes were proposed in this pull request?
This PR fixes 3 slow tests:
1. `ParquetQuerySuite.read/write wide table`: This is not a good unit test as it runs more than 5 minutes. This PR removes it and add a new regression test in `CodeGenerationSuite`, which is more "unit".
2. `ParquetQuerySuite.returning batch for wide table`: reduce the threshold and use smaller data size.
3. `DatasetSuite.SPARK-14554: Dataset.map may generate wrong java code for wide table`: Improve `CodeFormatter.format`(introduced at https://github.com/apache/spark/pull/12979) can dramatically speed this it up.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13273 from cloud-fan/test.
## What changes were proposed in this pull request?
Currently if a table is used in join operation we rely on Metastore returned size to calculate if we can convert the operation to Broadcast join. This optimization only kicks in for table's that have the statistics available in metastore. Hive generally rolls over to HDFS if the statistics are not available directly from metastore and this seems like a reasonable choice to adopt given the optimization benefit of using broadcast joins.
## How was this patch tested?
I have executed queries locally to test.
Author: Parth Brahmbhatt <pbrahmbhatt@netflix.com>
Closes#13150 from Parth-Brahmbhatt/SPARK-15365.
## What changes were proposed in this pull request?
Previously, SPARK-8893 added the constraints on positive number of partitions for repartition/coalesce operations in general. This PR adds one missing part for that and adds explicit two testcases.
**Before**
```scala
scala> sc.parallelize(1 to 5).coalesce(0)
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
...
scala> sc.parallelize(1 to 5).repartition(0).collect()
res1: Array[Int] = Array() // empty
scala> spark.sql("select 1").coalesce(0)
res2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [1: int]
scala> spark.sql("select 1").coalesce(0).collect()
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
scala> spark.sql("select 1").repartition(0)
res3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [1: int]
scala> spark.sql("select 1").repartition(0).collect()
res4: Array[org.apache.spark.sql.Row] = Array() // empty
```
**After**
```scala
scala> sc.parallelize(1 to 5).coalesce(0)
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
...
scala> sc.parallelize(1 to 5).repartition(0)
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
...
scala> spark.sql("select 1").coalesce(0)
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
...
scala> spark.sql("select 1").repartition(0)
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
...
```
## How was this patch tested?
Pass the Jenkins tests with new testcases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13282 from dongjoon-hyun/SPARK-15512.
## What changes were proposed in this pull request?
If the user relies on the schema to be inferred in file streams can break easily for multiple reasons
- accidentally running on a directory which has no data
- schema changing underneath
- on restart, the query will infer schema again, and may unexpectedly infer incorrect schema, as the file in the directory may be different at the time of the restart.
To avoid these complicated scenarios, for Spark 2.0, we are going to disable schema inferencing by default with a config, so that user is forced to consider explicitly what is the schema it wants, rather than the system trying to infer it and run into weird corner cases.
In this PR, I introduce a SQLConf that determines whether schema inference for file streams is allowed or not. It is disabled by default.
## How was this patch tested?
Updated unit tests that test error behavior with and without schema inference enabled.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#13238 from tdas/SPARK-15458.
## What changes were proposed in this pull request?
spark.sql("CREATE FUNCTION myfunc AS 'com.haizhi.bdp.udf.UDFGetGeoCode'") throws "org.apache.hadoop.hive.ql.metadata.HiveException:MetaException(message:NoSuchObjectException(message:Function default.myfunc does not exist))" with hive 1.2.1.
I think it is introduced by pr #12853. Fixing it by catching Exception (not NoSuchObjectException) and string matching.
## How was this patch tested?
added a unit test and also tested it manually
Author: wangyang <wangyang@haizhi.com>
Closes#13177 from wangyang1992/fixCreateFunc2.
## What changes were proposed in this pull request?
Jackson suppprts `allowNonNumericNumbers` option to parse non-standard non-numeric numbers such as "NaN", "Infinity", "INF". Currently used Jackson version (2.5.3) doesn't support it all. This patch upgrades the library and make the two ignored tests in `JsonParsingOptionsSuite` passed.
## How was this patch tested?
`JsonParsingOptionsSuite`.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Liang-Chi Hsieh <viirya@appier.com>
Closes#9759 from viirya/fix-json-nonnumric.
## What changes were proposed in this pull request?
in hive, `locate("aa", "aaa", 0)` would yield 0, `locate("aa", "aaa", 1)` would yield 1 and `locate("aa", "aaa", 2)` would yield 2, while in Spark, `locate("aa", "aaa", 0)` would yield 1, `locate("aa", "aaa", 1)` would yield 2 and `locate("aa", "aaa", 2)` would yield 0. This results from the different understanding of the third parameter in udf `locate`. It means the starting index and starts from 1, so when we use 0, the return would always be 0.
## How was this patch tested?
tested with modified `StringExpressionsSuite` and `StringFunctionsSuite`
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#13186 from adrian-wang/locate.
## What changes were proposed in this pull request?
This PR splits the generated code for ```SafeProjection.apply``` by using ```ctx.splitExpressions()```. This is because the large code body for ```NewInstance``` may grow beyond 64KB bytecode size for ```apply()``` method.
## How was this patch tested?
Added new tests
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#13243 from kiszk/SPARK-15285.
#### What changes were proposed in this pull request?
So far, when using In-Memory Catalog, we allow DDL operations for the tables. However, the corresponding DML operations are not supported for the tables that are neither temporary nor data source tables. For example,
```SQL
CREATE TABLE tabName(i INT, j STRING)
SELECT * FROM tabName
INSERT OVERWRITE TABLE tabName SELECT 1, 'a'
```
In the above example, before this PR fix, we will get very confusing exception messages for either `SELECT` or `INSERT`
```
org.apache.spark.sql.AnalysisException: unresolved operator 'SimpleCatalogRelation default, CatalogTable(`default`.`tbl`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(i,int,true,None), CatalogColumn(j,string,true,None)),List(),List(),List(),-1,,1463928681802,-1,Map(),None,None,None,List()), None;
```
This PR is to issue appropriate exceptions in this case. The message will be like
```
org.apache.spark.sql.AnalysisException: Please enable Hive support when operating non-temporary tables: `tbl`;
```
#### How was this patch tested?
Added a test case in `DDLSuite`.
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#13093 from gatorsmile/selectAfterCreate.
## What changes were proposed in this pull request?
Currently command `ADD FILE|JAR <filepath | jarpath>` is supported natively in SparkSQL. However, when this command is run, the file/jar is added to the resources that can not be looked up by `LIST FILE(s)|JAR(s)` command because the `LIST` command is passed to Hive command processor in Spark-SQL or simply not supported in Spark-shell. There is no way users can find out what files/jars are added to the spark context.
Refer to [Hive commands](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli)
This PR is to support following commands:
`LIST (FILE[s] [filepath ...] | JAR[s] [jarfile ...])`
### For example:
##### LIST FILE(s)
```
scala> spark.sql("add file hdfs://bdavm009.svl.ibm.com:8020/tmp/test.txt")
res1: org.apache.spark.sql.DataFrame = []
scala> spark.sql("add file hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt")
res2: org.apache.spark.sql.DataFrame = []
scala> spark.sql("list file hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt").show(false)
+----------------------------------------------+
|result |
+----------------------------------------------+
|hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt|
+----------------------------------------------+
scala> spark.sql("list files").show(false)
+----------------------------------------------+
|result |
+----------------------------------------------+
|hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt|
|hdfs://bdavm009.svl.ibm.com:8020/tmp/test.txt |
+----------------------------------------------+
```
##### LIST JAR(s)
```
scala> spark.sql("add jar /Users/xinwu/spark/core/src/test/resources/TestUDTF.jar")
res9: org.apache.spark.sql.DataFrame = [result: int]
scala> spark.sql("list jar TestUDTF.jar").show(false)
+---------------------------------------------+
|result |
+---------------------------------------------+
|spark://192.168.1.234:50131/jars/TestUDTF.jar|
+---------------------------------------------+
scala> spark.sql("list jars").show(false)
+---------------------------------------------+
|result |
+---------------------------------------------+
|spark://192.168.1.234:50131/jars/TestUDTF.jar|
+---------------------------------------------+
```
## How was this patch tested?
New test cases are added for Spark-SQL, Spark-Shell and SparkContext API code path.
Author: Xin Wu <xinwu@us.ibm.com>
Author: xin Wu <xinwu@us.ibm.com>
Closes#13212 from xwu0226/list_command.
## What changes were proposed in this pull request?
Adds error handling to the CSV writer for unsupported complex data types. Currently garbage gets written to the output csv files if the data frame schema has complex data types.
## How was this patch tested?
Added new unit test case.
Author: sureshthalamati <suresh.thalamati@gmail.com>
Closes#13105 from sureshthalamati/csv_complex_types_SPARK-15315.
## What changes were proposed in this pull request?
Spark assumes that UDF functions are deterministic. This PR adds explicit notes about that.
## How was this patch tested?
It's only about docs.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13087 from dongjoon-hyun/SPARK-15282.
## What changes were proposed in this pull request?
The user may do something like:
```
CREATE TABLE my_tab ROW FORMAT SERDE 'anything' STORED AS PARQUET
CREATE TABLE my_tab ROW FORMAT SERDE 'anything' STORED AS ... SERDE 'myserde'
CREATE TABLE my_tab ROW FORMAT DELIMITED ... STORED AS ORC
CREATE TABLE my_tab ROW FORMAT DELIMITED ... STORED AS ... SERDE 'myserde'
```
None of these should be allowed because the SerDe's conflict. As of this patch:
- `ROW FORMAT DELIMITED` is only compatible with `TEXTFILE`
- `ROW FORMAT SERDE` is only compatible with `TEXTFILE`, `RCFILE` and `SEQUENCEFILE`
## How was this patch tested?
New tests in `DDLCommandSuite`.
Author: Andrew Or <andrew@databricks.com>
Closes#13068 from andrewor14/row-format-conflict.
## What changes were proposed in this pull request?
1. simplify the logic of deserializing option type.
2. simplify the logic of serializing array type, and remove silentSchemaFor
3. remove some unnecessary code.
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13250 from cloud-fan/encoder.
## What changes were proposed in this pull request?
Currently, we create an CSVWriter for every row, it's very expensive and memory hungry, took about 15 seconds to write out 1 mm rows (two columns).
This PR will write the rows in batch mode, create a CSVWriter for every 1k rows, which could write out 1 mm rows in about 1 seconds (15X faster).
## How was this patch tested?
Manually benchmark it.
Author: Davies Liu <davies@databricks.com>
Closes#13229 from davies/csv_writer.
## What changes were proposed in this pull request?
In order to prevent users from inadvertently writing queries with cartesian joins, this patch introduces a new conf `spark.sql.crossJoin.enabled` (set to `false` by default) that if not set, results in a `SparkException` if the query contains one or more cartesian products.
## How was this patch tested?
Added a test to verify the new behavior in `JoinSuite`. Additionally, `SQLQuerySuite` and `SQLMetricsSuite` were modified to explicitly enable cartesian products.
Author: Sameer Agarwal <sameer@databricks.com>
Closes#13209 from sameeragarwal/disallow-cartesian.
## What changes were proposed in this pull request?
When invalid date string like "2015-02-29 00:00:00" are cast as date or timestamp using spark sql, it used to not return null but another valid date (2015-03-01 in this case).
In this pr, invalid date string like "2016-02-29" and "2016-04-31" are returned as null when cast as date or timestamp.
## How was this patch tested?
Unit tests are added.
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: wangyang <wangyang@haizhi.com>
Closes#13169 from wangyang1992/invalid_date.
## What changes were proposed in this pull request?
Fix some typos while browsing the codes.
## How was this patch tested?
None and obvious.
Author: Bo Meng <mengbo@hotmail.com>
Author: bomeng <bmeng@us.ibm.com>
Closes#13246 from bomeng/typo.
## What changes were proposed in this pull request?
Incrementalizing plans of with multiple streaming aggregation is tricky and we dont have the necessary support for "delta" to implement correctly. So disabling the support for multiple streaming aggregations.
## How was this patch tested?
Additional unit tests
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#13210 from tdas/SPARK-15428.
## What changes were proposed in this pull request?
This patch simplifies the implementation of Range operator and make the explain string consistent between logical plan and physical plan. To do this, I changed RangeExec to embed a Range logical plan in it.
Before this patch (note that the logical Range and physical Range actually output different information):
```
== Optimized Logical Plan ==
Range 0, 100, 2, 2, [id#8L]
== Physical Plan ==
*Range 0, 2, 2, 50, [id#8L]
```
After this patch:
If step size is 1:
```
== Optimized Logical Plan ==
Range(0, 100, splits=2)
== Physical Plan ==
*Range(0, 100, splits=2)
```
If step size is not 1:
```
== Optimized Logical Plan ==
Range (0, 100, step=2, splits=2)
== Physical Plan ==
*Range (0, 100, step=2, splits=2)
```
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#13239 from rxin/SPARK-15459.
#### What changes were proposed in this pull request?
When there are duplicate keys in the partition specs or table properties, we always use the last value and ignore all the previous values. This is caused by the function call `toMap`.
partition specs or table properties are widely used in multiple DDL statements.
This PR is to detect the duplicates and issue an exception if found.
#### How was this patch tested?
Added test cases in DDLSuite
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13095 from gatorsmile/detectDuplicate.
## What changes were proposed in this pull request?
This PR makes BroadcastHint more deterministic by using a special isBroadcastable property
instead of setting the sizeInBytes to 1.
See https://issues.apache.org/jira/browse/SPARK-15415
## How was this patch tested?
Added testcases to test if the broadcast hash join is included in the plan when the BroadcastHint is supplied and also tests for propagation of the joins.
Author: Jurriaan Pruis <email@jurriaanpruis.nl>
Closes#13244 from jurriaan/broadcast-hint.
## What changes were proposed in this pull request?
Add new test cases for including distinct aggregate in having clause in 2.0 branch.
This is a followup PR for [#12974](https://github.com/apache/spark/pull/12974), which is for 1.6 branch.
Author: xin Wu <xinwu@us.ibm.com>
Closes#12984 from xwu0226/SPARK-15206.
#### What changes were proposed in this pull request?
Like `Set` Command in Hive, `Reset` is also supported by Hive. See the link: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli
Below is the related Hive JIRA: https://issues.apache.org/jira/browse/HIVE-3202
This PR is to implement such a command for resetting the SQL-related configuration to the default values. One of the use case shown in HIVE-3202 is listed below:
> For the purpose of optimization we set various configs per query. It's worthy but all those configs should be reset every time for next query.
#### How was this patch tested?
Added a test case.
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#13121 from gatorsmile/resetCommand.
## What changes were proposed in this pull request?
Refactoring: Separated ORC serialization logic from OrcOutputWriter and moved to a new class called OrcSerializer.
## How was this patch tested?
Manual tests & existing tests.
Author: Ergin Seyfe <eseyfe@fb.com>
Closes#13066 from seyfe/orc_serializer.
## What changes were proposed in this pull request?
The Aggregator API was introduced in 2.0 for Dataset. All typed Dataset APIs should still be marked as experimental in 2.0.
## How was this patch tested?
N/A - annotation only change.
Author: Reynold Xin <rxin@databricks.com>
Closes#13226 from rxin/SPARK-15452.
## What changes were proposed in this pull request?
Generate a shorter default alias for `AggregateExpression `, In this PR, aggregate function name along with a index is used for generating the alias name.
```SQL
val ds = Seq(1, 3, 2, 5).toDS()
ds.select(typed.sum((i: Int) => i), typed.avg((i: Int) => i)).show()
```
Output before change.
```SQL
+-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+
|typedsumdouble(unresolveddeserializer(upcast(input[0, int], IntegerType, - root class: "scala.Int"), value#1), upcast(value))|typedaverage(unresolveddeserializer(upcast(input[0, int], IntegerType, - root class: "scala.Int"), value#1), newInstance(class scala.Tuple2))|
+-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+
| 11.0| 2.75|
+-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+
```
Output after change:
```SQL
+-----------------+---------------+
|typedsumdouble_c1|typedaverage_c2|
+-----------------+---------------+
| 11.0| 2.75|
+-----------------+---------------+
```
Note: There is one test in ParquetSuites.scala which shows that that the system picked alias
name is not usable and is rejected. [test](https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala#L672-#L687)
## How was this patch tested?
A new test was added in DataSetAggregatorSuite.
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#13045 from dilipbiswal/spark-15114.
## What changes were proposed in this pull request?
In only `catalyst` module, there exists 8 evaluation test cases on unresolved expressions. But, in real-world situation, those cases doesn't happen since they occurs exceptions before evaluations.
```scala
scala> sql("select format_number(null, 3)")
res0: org.apache.spark.sql.DataFrame = [format_number(CAST(NULL AS DOUBLE), 3): string]
scala> sql("select format_number(cast(null as NULL), 3)")
org.apache.spark.sql.catalyst.parser.ParseException:
DataType null() is not supported.(line 1, pos 34)
```
This PR makes those testcases more realistic.
```scala
- checkEvaluation(FormatNumber(Literal.create(null, NullType), Literal(3)), null)
+ assert(FormatNumber(Literal.create(null, NullType), Literal(3)).resolved === false)
```
Also, this PR also removes redundant `resolved` checking in `FoldablePropagation` optimizer.
## How was this patch tested?
Pass the modified Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13241 from dongjoon-hyun/SPARK-15462.
## What changes were proposed in this pull request?
Using longValue() and then checking whether the value is in the range for a long manually.
## How was this patch tested?
Existing tests
Author: Sandeep Singh <sandeep@techaddict.me>
Closes#13223 from techaddict/SPARK-15445.
## What changes were proposed in this pull request?
I initially asked to create a hivecontext-compatibility module to put the HiveContext there. But we are so close to Spark 2.0 release and there is only a single class in it. It seems overkill to have an entire package, which makes it more inconvenient, for a single class.
## How was this patch tested?
Tests were moved.
Author: Reynold Xin <rxin@databricks.com>
Closes#13207 from rxin/SPARK-15424.
## What changes were proposed in this pull request?
Use `SparkSession` according to [SPARK-15031](https://issues.apache.org/jira/browse/SPARK-15031)
`MLLLIB` is not recommended to use now, so examples in `MLLIB` are ignored in this PR.
`StreamingContext` can not be directly obtained from `SparkSession`, so example in `Streaming` are ignored too.
cc andrewor14
## How was this patch tested?
manual tests with spark-submit
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#13164 from zhengruifeng/use_sparksession_ii.
## What changes were proposed in this pull request?
Now that SparkSQL supports all TPC-DS queries, this patch adds all 99 benchmark queries inside SparkSQL.
## How was this patch tested?
Benchmark only
Author: Sameer Agarwal <sameer@databricks.com>
Closes#13188 from sameeragarwal/tpcds-all.
## What changes were proposed in this pull request?
Many other systems (e.g. Impala) uses _xxx as staging, and Spark should not be reading those files.
## How was this patch tested?
Added a unit test case.
Author: Reynold Xin <rxin@databricks.com>
Closes#13227 from rxin/SPARK-15454.
## What changes were proposed in this pull request?
Currently, the explain of a query with whole-stage codegen looks like this
```
>>> df = sqlCtx.range(1000);df2 = sqlCtx.range(1000);df.join(pyspark.sql.functions.broadcast(df2), 'id').explain()
== Physical Plan ==
WholeStageCodegen
: +- Project [id#1L]
: +- BroadcastHashJoin [id#1L], [id#4L], Inner, BuildRight, None
: :- Range 0, 1, 4, 1000, [id#1L]
: +- INPUT
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint]))
+- WholeStageCodegen
: +- Range 0, 1, 4, 1000, [id#4L]
```
The problem is that the plan looks much different than logical plan, make us hard to understand the plan (especially when the logical plan is not showed together).
This PR will change it to:
```
>>> df = sqlCtx.range(1000);df2 = sqlCtx.range(1000);df.join(pyspark.sql.functions.broadcast(df2), 'id').explain()
== Physical Plan ==
*Project [id#0L]
+- *BroadcastHashJoin [id#0L], [id#3L], Inner, BuildRight, None
:- *Range 0, 1, 4, 1000, [id#0L]
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
+- *Range 0, 1, 4, 1000, [id#3L]
```
The `*`before the plan means that it's part of whole-stage codegen, it's easy to understand.
## How was this patch tested?
Manually ran some queries and check the explain.
Author: Davies Liu <davies@databricks.com>
Closes#13204 from davies/explain_codegen.
This reverts commit 8d05a7a from #12855, which seems to have caused regressions when working with empty DataFrames.
Author: Michael Armbrust <michael@databricks.com>
Closes#13181 from marmbrus/revert12855.
## What changes were proposed in this pull request?
Right now inferring the schema for case classes happens before searching the SQLUserDefinedType annotation, so the SQLUserDefinedType annotation for case classes doesn't work.
This PR simply changes the inferring order to resolve it. I also reenabled the java.math.BigDecimal test and added two tests for `List`.
## How was this patch tested?
`encodeDecodeTest(UDTCaseClass(new java.net.URI("http://spark.apache.org/")), "udt with case class")`
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#12965 from zsxwing/SPARK-15190.
## What changes were proposed in this pull request?
This PR introduce place holder for comment in generated code and the purpose is same for #12939 but much safer.
Generated code to be compiled doesn't include actual comments but includes place holder instead.
Place holders in generated code will be replaced with actual comments only at the time of logging.
Also, this PR can resolve SPARK-15205.
## How was this patch tested?
Existing tests.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#12979 from sarutak/SPARK-15205.
## What changes were proposed in this pull request?
`CreateNamedStruct` and `CreateNamedStructUnsafe` should preserve metadata of value expressions if it is `NamedExpression` like `CreateStruct` or `CreateStructUnsafe` are doing.
## How was this patch tested?
Existing tests.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#13193 from ueshin/issues/SPARK-15400.
## What changes were proposed in this pull request?
We started this convention to append Command suffix to all SQL commands. However, not all commands follow that convention. This patch adds Command suffix to all RunnableCommands.
## How was this patch tested?
Updated test cases to reflect the renames.
Author: Reynold Xin <rxin@databricks.com>
Closes#13215 from rxin/SPARK-15435.
## What changes were proposed in this pull request?
When we parse DDLs involving table or database properties, we need to validate the values.
E.g. if we alter a database's property without providing a value:
```
ALTER DATABASE my_db SET DBPROPERTIES('some_key')
```
Then we'll ignore it with Hive, but override the property with the in-memory catalog. Inconsistencies like these arise because we don't validate the property values.
In such cases, we should throw exceptions instead.
## How was this patch tested?
`DDLCommandSuite`
Author: Andrew Or <andrew@databricks.com>
Closes#13205 from andrewor14/ddl-prop-values.
#### What changes were proposed in this pull request?
`refreshTable` was a method in `HiveContext`. It was deleted accidentally while we were migrating the APIs. This PR is to add it back to `HiveContext`.
In addition, in `SparkSession`, we put it under the catalog namespace (`SparkSession.catalog.refreshTable`).
#### How was this patch tested?
Changed the existing test cases to use the function `refreshTable`. Also added a test case for refreshTable in `hivecontext-compatibility`
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13156 from gatorsmile/refreshTable.
## What changes were proposed in this pull request?
Like TRUNCATE TABLE Command in Hive, TRUNCATE TABLE is also supported by Hive. See the link: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
Below is the related Hive JIRA: https://issues.apache.org/jira/browse/HIVE-446
This PR is to implement such a command for truncate table excluded column truncation(HIVE-4005).
## How was this patch tested?
Added a test case.
Author: Lianhui Wang <lianhuiwang09@gmail.com>
Closes#13170 from lianhuiwang/truncate.
## What changes were proposed in this pull request?
The following code:
```
val ds = Seq(("a", 1), ("b", 2), ("c", 3)).toDS()
ds.filter(_._1 == "b").select(expr("_1").as[String]).foreach(println(_))
```
throws an Exception:
```
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: _1#420
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
...
Cause: java.lang.RuntimeException: Couldn't find _1#420 in [_1#416,_2#417]
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:94)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:88)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
...
```
This is because `EmbedSerializerInFilter` rule drops the `exprId`s of output of surrounded `SerializeFromObject`.
The analyzed and optimized plans of the above example are as follows:
```
== Analyzed Logical Plan ==
_1: string
Project [_1#420]
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, scala.Tuple2]._1, true) AS _1#420,input[0, scala.Tuple2]._2 AS _2#421]
+- Filter <function1>.apply
+- DeserializeToObject newInstance(class scala.Tuple2), obj#419: scala.Tuple2
+- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]]
== Optimized Logical Plan ==
!Project [_1#420]
+- Filter <function1>.apply
+- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]]
```
This PR fixes `EmbedSerializerInFilter` rule to keep `exprId`s of output of surrounded `SerializeFromObject`.
The plans after this patch are as follows:
```
== Analyzed Logical Plan ==
_1: string
Project [_1#420]
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, scala.Tuple2]._1, true) AS _1#420,input[0, scala.Tuple2]._2 AS _2#421]
+- Filter <function1>.apply
+- DeserializeToObject newInstance(class scala.Tuple2), obj#419: scala.Tuple2
+- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]]
== Optimized Logical Plan ==
Project [_1#416]
+- Filter <function1>.apply
+- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]]
```
## How was this patch tested?
Existing tests and I added a test to check if `filter and then select` works.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#13096 from ueshin/issues/SPARK-15313.
## What changes were proposed in this pull request?
This patch fixes a bug in TypeUtils.checkForSameTypeInputExpr. Previously the code was testing on strict equality, which does not taking nullability into account.
This is based on https://github.com/apache/spark/pull/12768. This patch fixed a bug there (with empty expression) and added a test case.
## How was this patch tested?
Added a new test suite and test case.
Closes#12768.
Author: Reynold Xin <rxin@databricks.com>
Author: Oleg Danilov <oleg.danilov@wandisco.com>
Closes#13208 from rxin/SPARK-14990.
## What changes were proposed in this pull request?
Currently SparkSession.Builder use SQLContext.getOrCreate. It should probably the the other way around, i.e. all the core logic goes in SparkSession, and SQLContext just calls that. This patch does that.
This patch also makes sure config options specified in the builder are propagated to the existing (and of course the new) SparkSession.
## How was this patch tested?
Updated tests to reflect the change, and also introduced a new SparkSessionBuilderSuite that should cover all the branches.
Author: Reynold Xin <rxin@databricks.com>
Closes#13200 from rxin/SPARK-15075.
Hello : Can you help check this PR? I am adding support for the java.math.BigInteger for java bean code path. I saw internally spark is converting the BigInteger to BigDecimal in ColumnType.scala and CatalystRowConverter.scala. I use the similar way and convert the BigInteger to the BigDecimal. .
Author: Kevin Yu <qyu@us.ibm.com>
Closes#10125 from kevinyu98/working_on_spark-11827.
## What changes were proposed in this pull request?
Fix `MapObjects.itemAccessorMethod` to handle `TimestampType`. Without this fix, `Array[Timestamp]` cannot be properly encoded or decoded. To reproduce this, in `ExpressionEncoderSuite`, if you add the following test case:
`encodeDecodeTest(Array(Timestamp.valueOf("2016-01-29 10:00:00")), "array of timestamp")
`
... you will see that (without this fix) it fails with the following output:
```
- encode/decode for array of timestamp: [Ljava.sql.Timestamp;fd9ebde *** FAILED ***
Exception thrown while decoding
Converted: [0,1000000010,800000001,52a7ccdc36800]
Schema: value#61615
root
-- value: array (nullable = true)
|-- element: timestamp (containsNull = true)
Encoder:
class[value[0]: array<timestamp>] (ExpressionEncoderSuite.scala:312)
```
## How was this patch tested?
Existing tests
Author: Sumedh Mungee <smungee@gmail.com>
Closes#13108 from smungee/fix-itemAccessorMethod.
## What changes were proposed in this pull request?
If finding `NoClassDefFoundError` or `ClassNotFoundException`, check if the class name is removed in Spark 2.0. If so, the user must be using an incompatible library and we can provide a better message.
## How was this patch tested?
1. Run `bin/pyspark --packages com.databricks:spark-avro_2.10:2.0.1`
2. type `sqlContext.read.format("com.databricks.spark.avro").load("src/test/resources/episodes.avro")`.
It will show `java.lang.ClassNotFoundException: org.apache.spark.sql.sources.HadoopFsRelationProvider is removed in Spark 2.0. Please check if your library is compatible with Spark 2.0`
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#13201 from zsxwing/better-message.
## What changes were proposed in this pull request?
Add ConsoleSink to structure streaming, user could use it to display dataframes on the console (useful for debugging and demostrating), similar to the functionality of `DStream#print`, to use it:
```
val query = result.write
.format("console")
.trigger(ProcessingTime("2 seconds"))
.startStream()
```
## How was this patch tested?
local verified.
Not sure it is suitable to add into structure streaming, please review and help to comment, thanks a lot.
Author: jerryshao <sshao@hortonworks.com>
Closes#13162 from jerryshao/SPARK-15375.
## What changes were proposed in this pull request?
We use autoBroadcastJoinThreshold + 1L as the default value of size estimation, that is not good in 2.0, because we will calculate the size based on size of schema, then the estimation could be less than autoBroadcastJoinThreshold if you have an SELECT on top of an DataFrame created from RDD.
This PR change the default value to Long.MaxValue.
## How was this patch tested?
Added regression tests.
Author: Davies Liu <davies@databricks.com>
Closes#13183 from davies/fix_default_size.
## What changes were proposed in this pull request?
In general, the Web UI doesn't need to store the Accumulator/AccumulableInfo for every task. It only needs the Accumulator values.
In this PR, it creates new UIData classes to store the necessary fields and make `JobProgressListener` store only these new classes, so that `JobProgressListener` won't store Accumulator/AccumulableInfo and the size of `JobProgressListener` becomes pretty small. I also eliminates `AccumulableInfo` from `SQLListener` so that we don't keep any references for those unused `AccumulableInfo`s.
## How was this patch tested?
I ran two tests reported in JIRA locally:
The first one is:
```
val data = spark.range(0, 10000, 1, 10000)
data.cache().count()
```
The retained size of JobProgressListener decreases from 60.7M to 6.9M.
The second one is:
```
import org.apache.spark.ml.CC
import org.apache.spark.sql.SQLContext
val sqlContext = SQLContext.getOrCreate(sc)
CC.runTest(sqlContext)
```
This test won't cause OOM after applying this patch.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#13153 from zsxwing/memory.
## What changes were proposed in this pull request?
This PR is a follow-up of #13079. It replaces `hasUnsupportedFeatures: Boolean` in `CatalogTable` with `unsupportedFeatures: Seq[String]`, which contains unsupported Hive features of the underlying Hive table. In this way, we can accurately report all unsupported Hive features in the exception message.
## How was this patch tested?
Updated existing test case to check exception message.
Author: Cheng Lian <lian@databricks.com>
Closes#13173 from liancheng/spark-14346-follow-up.
## What changes were proposed in this pull request?
This PR corrects another case that uses deprecated `accumulableCollection` to use `listAccumulator`, which seems the previous PR missed.
Since `ArrayBuffer[InternalRow].asJava` is `java.util.List[InternalRow]`, it seems ok to replace the usage.
## How was this patch tested?
Related existing tests `InMemoryColumnarQuerySuite` and `CachedTableSuite`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#13187 from HyukjinKwon/SPARK-15322.
## What changes were proposed in this pull request?
After #12871 is fixed, we are forced to make `/user/hive/warehouse` when SimpleAnalyzer is used but SimpleAnalyzer may not need the directory.
## How was this patch tested?
Manual test.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#13175 from sarutak/SPARK-15387.
#### What changes were proposed in this pull request?
This follow-up PR is to address the remaining comments in https://github.com/apache/spark/pull/12385
The major change in this PR is to issue better error messages in PySpark by using the mechanism that was proposed by davies in https://github.com/apache/spark/pull/7135
For example, in PySpark, if we input the following statement:
```python
>>> l = [('Alice', 1)]
>>> df = sqlContext.createDataFrame(l)
>>> df.createTempView("people")
>>> df.createTempView("people")
```
Before this PR, the exception we will get is like
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/dataframe.py", line 152, in createTempView
self._jdf.createTempView(name)
File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o35.createTempView.
: org.apache.spark.sql.catalyst.analysis.TempTableAlreadyExistsException: Temporary table 'people' already exists;
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTempView(SessionCatalog.scala:324)
at org.apache.spark.sql.SparkSession.createTempView(SparkSession.scala:523)
at org.apache.spark.sql.Dataset.createTempView(Dataset.scala:2328)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
```
After this PR, the exception we will get become cleaner:
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/dataframe.py", line 152, in createTempView
self._jdf.createTempView(name)
File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/utils.py", line 75, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"Temporary table 'people' already exists;"
```
#### How was this patch tested?
Fixed an existing PySpark test case
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13126 from gatorsmile/followup-14684.
## What changes were proposed in this pull request?
When broadcast a table with more than 100 millions rows (should not ideally), the size of needed memory will overflow.
This PR fix the overflow by converting it to Long when calculating the size of memory.
Also add more checking in broadcast to show reasonable messages.
## How was this patch tested?
Add test.
Author: Davies Liu <davies@databricks.com>
Closes#13182 from davies/fix_broadcast.
## What changes were proposed in this pull request?
This PR aims to add new **FoldablePropagation** optimizer that propagates foldable expressions by replacing all attributes with the aliases of original foldable expression. Other optimizations will take advantage of the propagated foldable expressions: e.g. `EliminateSorts` optimizer now can handle the following Case 2 and 3. (Case 1 is the previous implementation.)
1. Literals and foldable expression, e.g. "ORDER BY 1.0, 'abc', Now()"
2. Foldable ordinals, e.g. "SELECT 1.0, 'abc', Now() ORDER BY 1, 2, 3"
3. Foldable aliases, e.g. "SELECT 1.0 x, 'abc' y, Now() z ORDER BY x, y, z"
This PR has been generalized based on cloud-fan 's key ideas many times; he should be credited for the work he did.
**Before**
```
scala> sql("SELECT 1.0, Now() x ORDER BY 1, x").explain
== Physical Plan ==
WholeStageCodegen
: +- Sort [1.0#5 ASC,x#0 ASC], true, 0
: +- INPUT
+- Exchange rangepartitioning(1.0#5 ASC, x#0 ASC, 200), None
+- WholeStageCodegen
: +- Project [1.0 AS 1.0#5,1461873043577000 AS x#0]
: +- INPUT
+- Scan OneRowRelation[]
```
**After**
```
scala> sql("SELECT 1.0, Now() x ORDER BY 1, x").explain
== Physical Plan ==
WholeStageCodegen
: +- Project [1.0 AS 1.0#5,1461873079484000 AS x#0]
: +- INPUT
+- Scan OneRowRelation[]
```
## How was this patch tested?
Pass the Jenkins tests including a new test case.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12719 from dongjoon-hyun/SPARK-14939.
## What changes were proposed in this pull request?
Whole Stage Codegen depends on `SparkPlan.reference` to do some optimization. For physical object operators, they should be consistent with their logical version and set the `reference` correctly.
## How was this patch tested?
new test in DatasetSuite
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13167 from cloud-fan/bug.
## What changes were proposed in this pull request?
This patch is a follow-up to https://github.com/apache/spark/pull/13104 and adds documentation to clarify the semantics of read.text with respect to partitioning.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#13184 from rxin/SPARK-14463.
#### What changes were proposed in this pull request?
The command `SET -v` always outputs the default values even if we set the parameter. This behavior is incorrect. Instead, if users override it, we should output the user-specified value.
In addition, the output schema of `SET -v` is wrong. We should use the column `value` instead of `default` for the parameter value.
This PR is to fix the above two issues.
#### How was this patch tested?
Added a test case.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13081 from gatorsmile/setVcommand.
## What changes were proposed in this pull request?
This PR adds null check in `SparkSession.createDataFrame`, so that we can make sure the passed in rows matches the given schema.
## How was this patch tested?
new tests in `DatasetSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13008 from cloud-fan/row-encoder.
https://issues.apache.org/jira/browse/SPARK-15323
I was using partitioned text datasets in Spark 1.6.1 but it broke in Spark 2.0.0.
It would be logical if you could also write those,
but not entirely sure how to solve this with the new DataSet implementation.
Also it doesn't work using `sqlContext.read.text`, since that method returns a `DataSet[String]`.
See https://issues.apache.org/jira/browse/SPARK-14463 for that issue.
Author: Jurriaan Pruis <email@jurriaanpruis.nl>
Closes#13104 from jurriaan/fix-partitioned-text-reads.
## What changes were proposed in this pull request?
We use autoBroadcastJoinThreshold + 1L as the default value of size estimation, that is not good in 2.0, because we will calculate the size based on size of schema, then the estimation could be less than autoBroadcastJoinThreshold if you have an SELECT on top of an DataFrame created from RDD.
This PR change the default value to Long.MaxValue.
## How was this patch tested?
Added regression tests.
Author: Davies Liu <davies@databricks.com>
Closes#13179 from davies/fix_default_size.
## What changes were proposed in this pull request?
Since we support forced spilling for Spillable, which only works in OnHeap mode, different from other SQL operators (could be OnHeap or OffHeap), we should considering the mode of consumer before calling trigger forced spilling.
## How was this patch tested?
Add new test.
Author: Davies Liu <davies@databricks.com>
Closes#13151 from davies/fix_mode.
## What changes were proposed in this pull request?
This PR fixes a Scala 2.10 compilation failure introduced in PR #13127.
## How was this patch tested?
Jenkins build.
Author: Cheng Lian <lian@databricks.com>
Closes#13166 from liancheng/hotfix-for-scala-2.10.
## What changes were proposed in this pull request?
This PR removes unused pattern matching variable in Optimizers in order to improve readability.
## How was this patch tested?
Pass the existing Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13145 from dongjoon-hyun/remove_unused_pattern_matching_variables.
## What changes were proposed in this pull request?
I use Intellj-IDEA to search usage of deprecate SparkContext.accumulator in the whole spark project, and update the code.(except those test code for accumulator method itself)
## How was this patch tested?
Exisiting unit tests
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#13112 from WeichenXu123/update_accuV2_in_mllib.
## What changes were proposed in this pull request?
Currently, listing files is very slow if there is thousands files, especially on local file system, because:
1) FileStatus.getPermission() is very slow on local file system, which is launch a subprocess and parse the stdout.
2) Create an JobConf is very expensive (ClassUtil.findContainingJar() is slow).
This PR improve these by:
1) Use another constructor of LocatedFileStatus to avoid calling FileStatus.getPermission, the permissions are not used for data sources.
2) Only create an JobConf once within one task.
## How was this patch tested?
Manually tests on a partitioned table with 1828 partitions, decrease the time to load the table from 22 seconds to 1.6 seconds (Most of time are spent in merging schema now).
Author: Davies Liu <davies@databricks.com>
Closes#13094 from davies/listing.
## What changes were proposed in this pull request?
HiveClient facade is not compatible with Hive 0.12.
This PR Fixes the following compatibility issues:
1. `org.apache.spark.sql.hive.client.HiveClientImpl` use `AddPartitionDesc(db, table, ignoreIfExists)` to create partitions, however, Hive 0.12 doesn't have this constructor for `AddPartitionDesc`.
2. `HiveClientImpl` uses `PartitionDropOptions` when dropping partition, however, class `PartitionDropOptions` doesn't exist in Hive 0.12.
3. Hive 0.12 doesn't support adding permanent functions. It is not valid to call `org.apache.hadoop.hive.ql.metadata.Hive.createFunction`, `org.apache.hadoop.hive.ql.metadata.Hive.alterFunction`, and `org.apache.hadoop.hive.ql.metadata.Hive.alterFunction`
4. `org.apache.spark.sql.hive.client.VersionsSuite` doesn't have enough test coverage for different hive versions 0.12, 0.13, 0.14, 1.0.0, 1.1.0, 1.2.0.
## How was this patch tested?
Unit test.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#13127 from clockfly/versionSuite.
## What changes were proposed in this pull request?
Scala 2.10 build was broken by #13079. I am reverting the change of that line.
Author: Yin Huai <yhuai@databricks.com>
Closes#13157 from yhuai/SPARK-14346-fix-scala2.10.
## What changes were proposed in this pull request?
Update the unit test code, examples, and documents to remove calls to deprecated method `dataset.registerTempTable`.
## How was this patch tested?
This PR only changes the unit test code, examples, and comments. It should be safe.
This is a follow up of PR https://github.com/apache/spark/pull/12945 which was merged.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#13098 from clockfly/spark-15171-remove-deprecation.
## What changes were proposed in this pull request?
This is a follow-up of #12781. It adds native `SHOW CREATE TABLE` support for Hive tables and views. A new field `hasUnsupportedFeatures` is added to `CatalogTable` to indicate whether all table metadata retrieved from the concrete underlying external catalog (i.e. Hive metastore in this case) can be mapped to fields in `CatalogTable`. This flag is useful when the target Hive table contains structures that can't be handled by Spark SQL, e.g., skewed columns and storage handler, etc..
## How was this patch tested?
New test cases are added in `ShowCreateTableSuite` to do round-trip tests.
Author: Cheng Lian <lian@databricks.com>
Closes#13079 from liancheng/spark-14346-show-create-table-for-hive-tables.