ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Maxim Gekk	93a264d05a	[SPARK-27535][SQL][TEST] Date and timestamp JSON benchmarks ## What changes were proposed in this pull request? Added new JSON benchmarks related to date and timestamps operations: - Write date/timestamp to JSON files - `to_json()` and `from_json()` for dates and timestamps - Read date/timestamps from JSON files, and infer schemas - Parse and infer schemas from `Dataset[String]` Also existing JSON benchmarks are ported on `NoOp` datasource. Closes #24430 from MaxGekk/json-datetime-benchmark. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-23 11:09:14 +09:00
Maxim Gekk	55f26d8090	[SPARK-27533][SQL][TEST] Date and timestamp CSV benchmarks ## What changes were proposed in this pull request? Added new CSV benchmarks related to date and timestamps operations: - Write date/timestamp to CSV files - `to_csv()` and `from_csv()` for dates and timestamps - Read date/timestamps from CSV files, and infer schemas - Parse and infer schemas from `Dataset[String]` Also existing CSV benchmarks are ported on `NoOp` datasource. Closes #24429 from MaxGekk/csv-timestamp-benchmark. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-23 11:08:02 +09:00
Maxim Gekk	43a73e387c	[SPARK-27528][SQL] Use Parquet logical type TIMESTAMP_MICROS by default ## What changes were proposed in this pull request? In the PR, I propose to use the `TIMESTAMP_MICROS` logical type for timestamps written to parquet files. The type matches semantically to Catalyst's `TimestampType`, and stores microseconds since epoch in UTC time zone. This will allow to avoid conversions of microseconds to nanoseconds and to Julian calendar. Also this will reduce sizes of written parquet files. ## How was this patch tested? By existing test suites. Closes #24425 from MaxGekk/parquet-timestamp_micros. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-23 11:06:39 +09:00
Dilip Biswal	3240e52dc7	[SPARK-27531][SQL] Improve `EXPLAIN DESC TABLE` to show the input parameters of the command. ## What changes were proposed in this pull request? Currently "EXPLAIN DESC TABLE" is special cased and outputs a single row relation as following. Current output: ```sql spark-sql> EXPLAIN DESCRIBE TABLE t; == Physical Plan == *(1) Scan OneRowRelation[] ``` This is not consistent with how we handle explain processing for other commands. In this PR, the inconsistency is handled by removing the special handling for "describe table". After change: ```sql spark-sql> EXPLAIN DESC EXTENDED t == Physical Plan == Execute DescribeTableCommand +- DescribeTableCommand `t`, true ``` ## How was this patch tested? Added new tests in SQLQueryTestSuite. Closes #24427 from dilipbiswal/describe_table_explain2. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-22 13:02:10 -07:00
Maxim Gekk	79d3bc0409	[SPARK-27438][SQL] Parse strings with timestamps by to_timestamp() in microsecond precision ## What changes were proposed in this pull request? In the PR, I propose to parse strings to timestamps in microsecond precision by the ` to_timestamp()` function if the specified pattern contains a sub-pattern for seconds fractions. Closes #24342 ## How was this patch tested? By `DateFunctionsSuite` and `DateExpressionsSuite` Closes #24420 from MaxGekk/to_timestamp-microseconds3. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-22 19:41:32 +08:00
Maxim Gekk	777b797867	[SPARK-27522][SQL][TEST] Test migration from INT96 to TIMESTAMP_MICROS for timestamps in parquet ## What changes were proposed in this pull request? Added tests to check migration from `INT96` to `TIMESTAMP_MICROS` (`INT64`) for timestamps in parquet files. In particular: - Append `TIMESTAMP_MICROS` timestamps to existing parquet files with `INT96` timestamps - Append `TIMESTAMP_MICROS` timestamps to a table with `INT96` timestamps - Append `INT96` to `TIMESTAMP_MICROS` timestamps in parquet files - Append `INT96` to `TIMESTAMP_MICROS` timestamps in a table Closes #24417 from MaxGekk/parquet-timestamp-int64-tests. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-22 16:34:13 +09:00
Dilip Biswal	8a8643c28d	[SPARK-27480][SQL] Improve `EXPLAIN DESC QUERY` to show the input SQL statement Currently running explain on describe query gives a little confusing output. This is a minor pr that improves the output of explain. Before ``` 1.EXPLAIN DESCRIBE WITH s AS (SELECT 'hello' as col1) SELECT * FROM s; == Physical Plan == Execute DescribeQueryCommand +- DescribeQueryCommand CTE [s] 2.EXPLAIN EXTENDED DESCRIBE SELECT * from s1 where c1 > 0; == Physical Plan == Execute DescribeQueryCommand +- DescribeQueryCommand 'Project [] ``` After ``` 1. EXPLAIN DESCRIBE WITH s AS (SELECT 'hello' as col1) SELECT FROM s; == Physical Plan == Execute DescribeQueryCommand +- DescribeQueryCommand WITH s AS (SELECT 'hello' as col1) SELECT * FROM s 2. EXPLAIN DESCRIBE SELECT * from s1 where c1 > 0; == Physical Plan == Execute DescribeQueryCommand +- DescribeQueryCommand SELECT * from s1 where c1 > 0 ``` Added a couple of tests in describe-query.sql under SQLQueryTestSuite. Closes #24385 from dilipbiswal/describe_query_explain. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-21 15:35:05 -07:00
WeichenXu	9793d9ec22	[SPARK-27473][SQL] Support filter push down for status fields in binary file data source ## What changes were proposed in this pull request? Support 4 kinds of filters: - LessThan - LessThanOrEqual - GreatThan - GreatThanOrEqual Support filters applied on 2 columns: - modificationTime - length Note: In order to support datasource filter push-down, I flatten schema to be: ``` val schema = StructType( StructField("path", StringType, false) :: StructField("modificationTime", TimestampType, false) :: StructField("length", LongType, false) :: StructField("content", BinaryType, true) :: Nil) ``` ## How was this patch tested? To be added. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #24387 from WeichenXu123/binary_ds_filter. Lead-authored-by: WeichenXu <weichen.xu@databricks.com> Co-authored-by: Xiangrui Meng <meng@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2019-04-21 12:45:59 -07:00
Liang-Chi Hsieh	ad60c6d9be	[SPARK-27439][SQL] Use analyzed plan when explaining Dataset ## What changes were proposed in this pull request? Because a review is resolved during analysis when we create a dataset, the content of the view is determined when the dataset is created, not when it is evaluated. Now the explain result of a dataset is not correctly consistent with the collected result of it, because we use pre-analyzed logical plan of the dataset in explain command. The explain command will analyzed the logical plan passed in. So if a view is changed after the dataset was created, the plans shown by explain command aren't the same with the plan of the dataset. ```scala scala> spark.range(10).createOrReplaceTempView("test") scala> spark.range(5).createOrReplaceTempView("test2") scala> spark.sql("select * from test").createOrReplaceTempView("tmp001") scala> val df = spark.sql("select * from tmp001") scala> spark.sql("select * from test2").createOrReplaceTempView("tmp001") scala> df.show +---+ \| id\| +---+ \| 0\| \| 1\| \| 2\| \| 3\| \| 4\| \| 5\| \| 6\| \| 7\| \| 8\| \| 9\| +---+ scala> df.explain ``` Before: ```scala == Physical Plan == (1) Range (0, 5, step=1, splits=12) ``` After: ```scala == Physical Plan == (1) Range (0, 10, step=1, splits=12) ``` ## How was this patch tested? Manually test and unit test. Closes #24415 from viirya/SPARK-27439. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-21 10:25:56 -07:00
Gengliang Wang	31488e1ca5	[SPARK-27504][SQL] File source V2: support refreshing metadata cache ## What changes were proposed in this pull request? In file source V1, if some file is deleted manually, reading the DataFrame/Table will throws an exception with suggestion message ``` It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. ``` After refreshing the table/DataFrame, the reads should return correct results. We should follow it in file source V2 as well. ## How was this patch tested? Unit test Closes #24401 from gengliangwang/refreshFileTable. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-19 18:26:03 +08:00
Liang-Chi Hsieh	9c41bfd83c	[SPARK-27502][SQL][TEST] Update nested schema benchmark result for Orc V2 ## What changes were proposed in this pull request? We added nested schema pruning support to Orc V2 recently. The benchmark result should be updated. The benchmark numbers are obtained by running benchmark on r3.xlarge machine. ## How was this patch tested? Test only change. Closes #24399 from viirya/update-orcv2-benchmark. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-18 08:08:22 -07:00
Gengliang Wang	9c238b8a46	[SPARK-27460][TESTS] Running slowest test suites in their own forked JVMs for higher parallelism ## What changes were proposed in this pull request? This patch modifies SparkBuild so that the largest / slowest test suites (or collections of suites) can run in their own forked JVMs, allowing them to be run in parallel with each other. This opt-in / whitelisting approach allows us to increase parallelism without having to fix a long-tail of flakiness / brittleness issues in tests which aren't performance bottlenecks. See comments in SparkBuild.scala for information on the details, including a summary of why we sometimes opt to run entire groups of tests in a single forked JVM . The time of full new pull request test in Jenkins is reduced by around 53%: before changes: 4hr 40min after changes: 2hr 13min ## How was this patch tested? Unit test Closes #24373 from gengliangwang/parallelTest. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-18 20:49:36 +08:00
Gengliang Wang	7d44ba05d1	[SPARK-27490][SQL] File source V2: return correct result for Dataset.inputFiles() ## What changes were proposed in this pull request? Currently, a `Dateset` with file source V2 always return empty results for method `Dataset.inputFiles()`. We should fix it. ## How was this patch tested? Unit test Closes #24393 from gengliangwang/inputFiles. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-18 14:39:30 +08:00
Wenchen Fan	e6618de809	[SPARK-27430][SQL] broadcast hint should be respected for broadcast nested loop join ## What changes were proposed in this pull request? A followup of https://github.com/apache/spark/pull/24164 broadcast hint should be respected for broadcast nested loop join. This PR also refactors the related code a little bit, to save duplicated code. ## How was this patch tested? new tests Closes #24376 from cloud-fan/join. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-17 19:29:28 +08:00
WeichenXu	1bb0c8e407	[SPARK-25348][SQL] Data source for binary files ## What changes were proposed in this pull request? Implement binary file data source in Spark. Format name: "binaryFile" (case-insensitive) Schema: - content: BinaryType - status: StructType - path: StringType - modificationTime: TimestampType - length: LongType Options: * pathGlobFilter (instead of pathFilterRegex) to reply on GlobFilter behavior * maxBytesPerPartition is not implemented since it is controlled by two SQL confs: maxPartitionBytes and openCostInBytes. ## How was this patch tested? Unit test added. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #24354 from WeichenXu123/binary_file_datasource. Lead-authored-by: WeichenXu <weichen.xu@databricks.com> Co-authored-by: Xiangrui Meng <meng@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2019-04-16 15:41:32 -07:00
liwensun	26ed65f415	[SPARK-27453] Pass partitionBy as options in DataFrameWriter ## What changes were proposed in this pull request? Pass partitionBy columns as options and feature-flag this behavior. ## How was this patch tested? A new unit test. Closes #24365 from liwensun/partitionby. Authored-by: liwensun <liwen.sun@databricks.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	2019-04-16 15:03:16 -07:00
Liang-Chi Hsieh	b404e02574	[SPARK-27476][SQL] Refactoring SchemaPruning rule to remove duplicate code ## What changes were proposed in this pull request? In SchemaPruning rule, there is duplicate code for data source v1 and v2. Their logic is the same and we can refactor the rule to remove duplicate code. ## How was this patch tested? Existing tests. Closes #24383 from viirya/SPARK-27476. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-16 14:50:37 -07:00
shivusondur	88d9de26dd	[SPARK-27464][CORE] Added Constant instead of referring string literal used from many places ## What changes were proposed in this pull request? Added Constant instead of referring the same String literal "spark.buffer.pageSize" from many places ## How was this patch tested? Run the corresponding Unit Test Cases manually. Closes #24368 from shivusondur/Constant. Authored-by: shivusondur <shivusondur@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-16 09:30:46 -05:00
Gengliang Wang	f9837d3bf6	[SPARK-27448][SQL] File source V2 table provider should be compatible with V1 provider ## What changes were proposed in this pull request? In the rule `PreprocessTableCreation`, if an existing table is appended with a different provider, the action will fail. Currently, there are two implementations for file sources and creating a table with file source V2 will always fall back to V1 FileFormat. We should consider the following cases as valid: 1. Appending a table with file source V2 provider using the v1 file format 2. Appending a table with v1 file format provider using file source V2 format ## How was this patch tested? Unit test Closes #24356 from gengliangwang/fixTableProvider. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-16 14:26:38 +08:00
Dilip Biswal	3ab96d7acf	[SPARK-27444][SQL][FOLLOWUP][MINOR][TEST] Add a test for describing multi select query. ## What changes were proposed in this pull request? This is a minor pr to add a test to describe a multi select query. ## How was this patch tested? Added a test in describe-query.sql Closes #24370 from dilipbiswal/describe-query-multiselect-test. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-15 21:26:45 +08:00
Gengliang Wang	27d625d785	[SPARK-27459][SQL] Revise the exception message of schema inference failure in file source V2 ## What changes were proposed in this pull request? Since https://github.com/apache/spark/pull/23383/files#diff-db4a140579c1ac4b1dbec7fe5057eecaR36, the exception message of schema inference failure in file source V2 is `tableName`, which is equivalent to `shortName + path`. While in file source V1, the message is `Unable to infer schema from ORC/CSV/JSON...`. We should make the message in V2 consistent with V1, so that in the future migration the related test cases don't need to be modified. https://github.com/apache/spark/pull/24058#pullrequestreview-226364350 ## How was this patch tested? Revert the modified unit test cases in https://github.com/apache/spark/pull/24005/files#diff-b9ddfbc9be8d83ecf100b3b8ff9610b9R431 and https://github.com/apache/spark/pull/23383/files#diff-9ab56940ee5a53f2bb81e3c008653362R577, and test with them. Closes #24369 from gengliangwang/reviseInferSchemaMessage. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-15 21:06:03 +08:00
herman	4704af4c26	[SPARK-27449] Move WholeStageCodegen.limitNotReachedCond class checks into separate methods. ## What changes were proposed in this pull request? This PR moves the checks done in `WholeStageCodegen.limitNotReachedCond` into a separate protected method. This makes it easier to introduce new leaf or blocking nodes. ## How was this patch tested? Existing tests. Closes #24358 from hvanhovell/SPARK-27449. Authored-by: herman <herman@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-14 15:54:20 +08:00
Gengliang Wang	4eb694c58f	[SPARK-27443][SQL] Support UDF input_file_name in file source V2 ## What changes were proposed in this pull request? Currently, if we select the UDF `input_file_name` as a column in file source V2, the results are empty. We should support it in file source V2. ## How was this patch tested? Unit test Closes #24347 from gengliangwang/input_file_name. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-12 20:30:42 +08:00
Dilip Biswal	5d8aee5886	[SPARK-27445][SQL][TEST] Update SQLQueryTestSuite to process files ending with `.sql` ## What changes were proposed in this pull request? While using vi or vim to edit the test files the .swp or .swo files are created and attempt to run the test suite in the presence of these files causes errors like below : ``` nfo] - subquery/exists-subquery/.exists-basic.sql.swp * FAILED * (117 milliseconds) [info] java.io.FileNotFoundException: /Users/dbiswal/mygit/apache/spark/sql/core/target/scala-2.12/test-classes/sql-tests/results/subquery/exists-subquery/.exists-basic.sql.swp.out (No such file or directory) [info] at java.io.FileInputStream.open0(Native Method) [info] at java.io.FileInputStream.open(FileInputStream.java:195) [info] at java.io.FileInputStream.<init>(FileInputStream.java:138) [info] at org.apache.spark.sql.catalyst.util.package$.fileToString(package.scala:49) [info] at org.apache.spark.sql.SQLQueryTestSuite.runQueries(SQLQueryTestSuite.scala:247) [info] at org.apache.spark.sql.SQLQueryTestSuite.$anonfun$runTest$11(SQLQueryTestSuite.scala:192) ``` ~~This minor pr adds these temp files in the ignore list.~~ While computing the list of test files to process, only consider files with `.sql` extension. This makes sure the unwanted temp files created from various editors are ignored from processing. ## How was this patch tested? Verified manually. Closes #24333 from dilipbiswal/dkb_sqlquerytest. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-11 14:50:46 -07:00
Sean Owen	4ec7f631aa	[SPARK-27404][CORE][SQL][STREAMING][YARN] Fix build warnings for 3.0: postfixOps edition ## What changes were proposed in this pull request? Fix build warnings -- see some details below. But mostly, remove use of postfix syntax where it causes warnings without the `scala.language.postfixOps` import. This is mostly in expressions like "120000 milliseconds". Which, I'd like to simplify to things like "2.minutes" anyway. ## How was this patch tested? Existing tests. Closes #24314 from srowen/SPARK-27404. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-11 13:43:44 -05:00
maryannxue	43da473c1c	[SPARK-27225][SQL] Implement join strategy hints ## What changes were proposed in this pull request? This PR extends the existing BROADCAST join hint (for both broadcast-hash join and broadcast-nested-loop join) by implementing other join strategy hints corresponding to the rest of Spark's existing join strategies: shuffle-hash, sort-merge, cartesian-product. The hint names: SHUFFLE_MERGE, SHUFFLE_HASH, SHUFFLE_REPLICATE_NL are partly different from the code names in order to make them clearer to users and reflect the actual algorithms better. The hinted strategy will be used for the join with which it is associated if it is applicable/doable. Conflict resolving rules in case of multiple hints: 1. Conflicts within either side of the join: take the first strategy hint specified in the query, or the top hint node in Dataset. For example, in "select /+ merge(t1) / /+ broadcast(t1) / k1, v2 from t1 join t2 on t1.k1 = t2.k2", take "merge(t1)"; in ```df1.hint("merge").hint("shuffle_hash").join(df2)```, take "shuffle_hash". This is a general hint conflict resolving strategy, not specific to join strategy hint. 2. Conflicts between two sides of the join: a) In case of different strategy hints, hints are prioritized as ```BROADCAST``` over ```SHUFFLE_MERGE``` over ```SHUFFLE_HASH``` over ```SHUFFLE_REPLICATE_NL```. b) In case of same strategy hints but conflicts in build side, choose the build side based on join type and size. ## How was this patch tested? Added new UTs. Closes #24164 from maryannxue/join-hints. Lead-authored-by: maryannxue <maryannxue@apache.org> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-12 00:14:37 +08:00
s71955	239082d966	[SPARK-27403][SQL] Fix `updateTableStats` to update table stats always with new stats or None ## What changes were proposed in this pull request? System shall update the table stats automatically if user set spark.sql.statistics.size.autoUpdate.enabled as true, currently this property is not having any significance even if it is enabled or disabled. This feature is similar to Hives auto-gather feature where statistics are automatically computed by default if this feature is enabled. Reference: https://cwiki.apache.org/confluence/display/Hive/StatsDev As part of fix , autoSizeUpdateEnabled validation is been done initially so that system will calculate the table size for the user automatically and record it in metastore as per user expectation. ## How was this patch tested? UT is written and manually verified in cluster. Tested with unit tests + some internal tests on real cluster. Before fix: ![image](https://user-images.githubusercontent.com/12999161/55688682-cd8d4780-5998-11e9-85da-e1a4e34419f6.png) After fix ![image](https://user-images.githubusercontent.com/12999161/55688654-7d15ea00-5998-11e9-973f-1f4cee27018f.png) Closes #24315 from sujith71955/master_autoupdate. Authored-by: s71955 <sujithchacko.2010@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-11 08:53:00 -07:00
Gengliang Wang	4177292dcd	[SPARK-27435][SQL] Support schema pruning in ORC V2 ## What changes were proposed in this pull request? Currently, the optimization rule `SchemaPruning` only works for Parquet/Orc V1. We should have the same optimization in ORC V2. ## How was this patch tested? Unit test Closes #24338 from gengliangwang/schemaPruningForV2. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-11 20:03:32 +08:00
Maxim Gekk	1470f23ec9	[SPARK-27422][SQL] current_date() should return current date in the session time zone ## What changes were proposed in this pull request? In the PR, I propose to revert 2 commits `06abd06112` and `61561c1c2d`, and take current date via `LocalDate.now` in the session time zone. The result is stored as days since epoch `1970-01-01`. ## How was this patch tested? It was tested by `DateExpressionsSuite`, `DateFunctionsSuite`, `DateTimeUtilsSuite`, and `ComputeCurrentTimeSuite`. Closes #24330 from MaxGekk/current-date2. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-10 21:54:50 +08:00
10129659	5ea4deec44	[SPARK-26012][SQL] Null and '' values should not cause dynamic partition failure of string types Dynamic partition will fail when both '' and null values are taken as dynamic partition values simultaneously. For example, the test bellow will fail before this PR: test("Null and '' values should not cause dynamic partition failure of string types") { withTable("t1", "t2") { spark.range(3).write.saveAsTable("t1") spark.sql("select id, cast(case when id = 1 then '' else null end as string) as p" + " from t1").write.partitionBy("p").saveAsTable("t2") checkAnswer(spark.table("t2").sort("id"), Seq(Row(0, null), Row(1, null), Row(2, null))) } } The error is: 'org.apache.hadoop.fs.FileAlreadyExistsException: File already exists'. This PR convert the empty strings to null for partition values. This is another way for PR(https://github.com/apache/spark/pull/23010) (Please fill in changes proposed in this fix) How was this patch tested? New added test. Closes #24334 from eatoncys/FileFormatWriter. Authored-by: 10129659 <chen.yanshan@zte.com.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-10 19:54:19 +08:00
Ryan Blue	58674d54ba	[SPARK-27181][SQL] Add public transform API ## What changes were proposed in this pull request? This adds a public Expression API that can be used to pass partition transformations to data sources. ## How was this patch tested? Existing tests to validate no regressions. Added transform cases to DDL suite and v1 conversions suite. Closes #24117 from rdblue/add-public-transform-api. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-10 14:30:39 +08:00
Liang-Chi Hsieh	08858f6abc	[SPARK-27253][SQL][FOLLOW-UP] Update doc about parent-session configuration priority ## What changes were proposed in this pull request? The PR #24189 changes the behavior of merging SparkConf. The existing doc is not updated for it. This is a followup of it to update the doc. ## How was this patch tested? Doc only change. Closes #24326 from viirya/SPARK-27253-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-10 13:21:21 +09:00
Maxim Gekk	63e4bf42c2	[SPARK-27401][SQL] Refactoring conversion of Timestamp to/from java.sql.Timestamp ## What changes were proposed in this pull request? In the PR, I propose simpler implementation of `toJavaTimestamp()`/`fromJavaTimestamp()` by reusing existing functions of `DateTimeUtils`. This will allow to: - Simply implementation of `toJavaTimestamp()`, and handle properly negative inputs. - Detect `Long` overflow in conversion of milliseconds (`java.sql.Timestamp`) to microseconds (Catalyst's Timestamp). ## How was this patch tested? By existing test suites `DateTimeUtilsSuite`, `DateFunctionsSuite`, `DateExpressionsSuite` and `CastSuite`. And by new benchmark for export/import timestamps added to `DateTimeBenchmark`: Before: ``` To/from java.sql.Timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ From java.sql.Timestamp 290 335 49 17.2 58.0 1.0X Collect longs 1234 1681 487 4.1 246.8 0.2X Collect timestamps 1718 1755 63 2.9 343.7 0.2X ``` After: ``` To/from java.sql.Timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ From java.sql.Timestamp 283 301 19 17.7 56.6 1.0X Collect longs 1048 1087 36 4.8 209.6 0.3X Collect timestamps 1425 1479 56 3.5 285.1 0.2X ``` Closes #24311 from MaxGekk/conv-java-sql-date-timestamp. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-09 15:42:27 -07:00
Gengliang Wang	3db117e43e	[SPARK-27407][SQL] File source V2: Invalidate cache data on overwrite/append ## What changes were proposed in this pull request? File source V2 currently incorrectly continues to use cached data even if the underlying data is overwritten. We should follow https://github.com/apache/spark/pull/13566 and fix it by invalidating and refreshes all the cached data (and the associated metadata) for any Dataframe that contains the given data source path. ## How was this patch tested? Unit test Closes #24318 from gengliangwang/invalidCache. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-09 09:25:37 -07:00
francis0407	601fac2cb3	[SPARK-27411][SQL] DataSourceV2Strategy should not eliminate subquery ## What changes were proposed in this pull request? In DataSourceV2Strategy, it seems we eliminate the subqueries by mistake after normalizing filters. We have a sql with a scalar subquery: ``` scala val plan = spark.sql("select * from t2 where t2a > (select max(t1a) from t1)") plan.explain(true) ``` And we get the log info of DataSourceV2Strategy: ``` Pushing operators to csv:examples/src/main/resources/t2.txt Pushed Filters: Post-Scan Filters: isnotnull(t2a#30) Output: t2a#30, t2b#31 ``` The `Post-Scan Filters` should contain the scalar subquery, but we eliminate it by mistake. ``` == Parsed Logical Plan == 'Project [] +- 'Filter ('t2a > scalar-subquery#56 []) : +- 'Project [unresolvedalias('max('t1a), None)] : +- 'UnresolvedRelation `t1` +- 'UnresolvedRelation `t2` == Analyzed Logical Plan == t2a: string, t2b: string Project [t2a#30, t2b#31] +- Filter (t2a#30 > scalar-subquery#56 []) : +- Aggregate [max(t1a#13) AS max(t1a)#63] : +- SubqueryAlias `t1` : +- RelationV2[t1a#13, t1b#14] csv:examples/src/main/resources/t1.txt +- SubqueryAlias `t2` +- RelationV2[t2a#30, t2b#31] csv:examples/src/main/resources/t2.txt == Optimized Logical Plan == Filter (isnotnull(t2a#30) && (t2a#30 > scalar-subquery#56 [])) : +- Aggregate [max(t1a#13) AS max(t1a)#63] : +- Project [t1a#13] : +- RelationV2[t1a#13, t1b#14] csv:examples/src/main/resources/t1.txt +- RelationV2[t2a#30, t2b#31] csv:examples/src/main/resources/t2.txt == Physical Plan == (1) Project [t2a#30, t2b#31] +- (1) Filter isnotnull(t2a#30) +- (1) BatchScan[t2a#30, t2b#31] class org.apache.spark.sql.execution.datasources.v2.csv.CSVScan ``` ## How was this patch tested? ut Closes #24321 from francis0407/SPARK-27411. Authored-by: francis0407 <hanmingcong123@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-09 21:45:46 +08:00
Wenchen Fan	051336d9dd	[SPARK-25496][SQL][FOLLOWUP] avoid using to_utc_timestamp ## What changes were proposed in this pull request? in https://github.com/apache/spark/pull/24195 , we deprecate `from/to_utc_timestamp`. This PR removes unnecessary use of `to_utc_timestamp` in the test. ## How was this patch tested? test only PR Closes #24319 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-09 10:13:38 +08:00
Gengliang Wang	d50603a37c	[SPARK-27271][SQL] Migrate Text to File Data Source V2 ## What changes were proposed in this pull request? Migrate Text source to File Data Source V2 ## How was this patch tested? Unit test Closes #24207 from gengliangwang/textV2. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-08 10:15:22 -07:00
Yuming Wang	33f3c48cac	[SPARK-27176][SQL] Upgrade hadoop-3's built-in Hive maven dependencies to 2.3.4 ## What changes were proposed in this pull request? This PR mainly contains: 1. Upgrade hadoop-3's built-in Hive maven dependencies to 2.3.4. 2. Resolve compatibility issues between Hive 1.2.1 and Hive 2.3.4 in the `sql/hive` module. ## How was this patch tested? jenkins test hadoop-2.7 manual test hadoop-3: ```shell build/sbt clean package -Phadoop-3.2 -Phive export SPARK_PREPEND_CLASSES=true # rm -rf metastore_db cat <<EOF > test_hadoop3.scala spark.range(10).write.saveAsTable("test_hadoop3") spark.table("test_hadoop3").show EOF bin/spark-shell --conf spark.hadoop.hive.metastore.schema.verification=false --conf spark.hadoop.datanucleus.schema.autoCreateAll=true -i test_hadoop3.scala ``` Closes #23788 from wangyum/SPARK-23710-hadoop3. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-04-08 08:42:21 -07:00
Michael Allman	215609def2	[SPARK-25407][SQL] Allow nested access for non-existent field for Parquet file when nested pruning is enabled ## What changes were proposed in this pull request? As part of schema clipping in `ParquetReadSupport.scala`, we add fields in the Catalyst requested schema which are missing from the Parquet file schema to the Parquet clipped schema. However, nested schema pruning requires we ignore unrequested field data when reading from a Parquet file. Therefore we pass two schema to `ParquetRecordMaterializer`: the schema of the file data we want to read and the schema of the rows we want to return. The reader is responsible for reconciling the differences between the two. Aside from checking whether schema pruning is enabled, there is an additional complication to constructing the Parquet requested schema. The manner in which Spark's two Parquet readers reconcile the differences between the Parquet requested schema and the Catalyst requested schema differ. Spark's vectorized reader does not (currently) support reading Parquet files with complex types in their schema. Further, it assumes that the Parquet requested schema includes all fields requested in the Catalyst requested schema. It includes logic in its read path to skip fields in the Parquet requested schema which are not present in the file. Spark's parquet-mr based reader supports reading Parquet files of any kind of complex schema, and it supports nested schema pruning as well. Unlike the vectorized reader, the parquet-mr reader requires that the Parquet requested schema include only those fields present in the underlying Parquet file's schema. Therefore, in the case where we use the parquet-mr reader we intersect the Parquet clipped schema with the Parquet file's schema to construct the Parquet requested schema that's set in the `ReadContext`. _Additional description (by HyukjinKwon):_ Let's suppose that we have a Parquet schema as below: ``` message spark_schema { required int32 id; optional group name { optional binary first (UTF8); optional binary last (UTF8); } optional binary address (UTF8); } ``` Currently, the clipped schema as follows: ``` message spark_schema { optional group name { optional binary middle (UTF8); } optional binary address (UTF8); } ``` Parquet MR does not support access to the nested non-existent field (`name.middle`). To workaround this, this PR removes `name.middle` request at all to Parquet reader as below: ``` Parquet requested schema: message spark_schema { optional binary address (UTF8); } ``` and produces the record (`name.middle`) properly as the requested Catalyst schema. ``` root -- name: struct (nullable = true) \|-- middle: string (nullable = true) -- address: string (nullable = true) ``` I think technically this is what Parquet library should support since Parquet library made a design decision to produce `null` for non-existent fields IIRC. This PR targets to work around it. ## How was this patch tested? A previously ignored test case which exercises the failure scenario this PR addresses has been enabled. This closes #22880 Closes #24307 from dongjoon-hyun/SPARK-25407. Lead-authored-by: Michael Allman <msa@allman.ms> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-08 22:26:02 +09:00
Gengliang Wang	02e9f93309	[SPARK-27384][SQL] File source V2: Prune unnecessary partition columns ## What changes were proposed in this pull request? When scanning file sources, we can prune unnecessary partition columns on constructing input partitions, so that: 1. Reduce the data transformation from Driver to Executors 2. Make it easier to implement columnar batch readers, since the partition columns are already pruned. ## How was this patch tested? Existing unit tests. Closes #24296 from gengliangwang/prunePartitionValue. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-08 15:14:02 +08:00
Jose Torres	4a5768b2a2	[SPARK-27391][SS] Don't initialize a lazy val in ContinuousExecution job. ## What changes were proposed in this pull request? Fix a potential deadlock in ContinuousExecution by not initializing the toRDD lazy val. Closes #24301 from jose-torres/deadlock. Authored-by: Jose Torres <torres.joseph.f+github@gmail.com> Signed-off-by: Jose Torres <torres.joseph.f+github@gmail.com>	2019-04-05 12:56:36 -07:00
Dongjoon Hyun	982c4c8e3c	[SPARK-27390][CORE][SQL][TEST] Fix package name mismatch ## What changes were proposed in this pull request? This PR aims to clean up package name mismatches. ## How was this patch tested? Pass the Jenkins. Closes #24300 from dongjoon-hyun/SPARK-27390. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-05 11:50:37 -07:00
gatorsmile	5678e687c6	[SPARK-27393][SQL] Show ReusedSubquery in the plan when the subquery is reused ## What changes were proposed in this pull request? With this change, we can easily identify the plan difference when subquery is reused. When the reuse is enabled, the plan looks like ``` == Physical Plan == CollectLimit 1 +- (1) Project [(Subquery subquery240 + ReusedSubquery Subquery subquery240) AS (scalarsubquery() + scalarsubquery())#253] : :- Subquery subquery240 : : +- (2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)#250]) : : +- Exchange SinglePartition : : +- (1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#256, count#257L]) : : +- (1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : : +- Scan[obj#12] : +- ReusedSubquery Subquery subquery240 +- (1) SerializeFromObject +- Scan[obj#12] ``` When the reuse is disabled, the plan looks like ``` == Physical Plan == CollectLimit 1 +- (1) Project [(Subquery subquery286 + Subquery subquery287) AS (scalarsubquery() + scalarsubquery())#299] : :- Subquery subquery286 : : +- (2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)#296]) : : +- Exchange SinglePartition : : +- (1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#302, count#303L]) : : +- (1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : : +- Scan[obj#12] : +- Subquery subquery287 : +- (2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)#298]) : +- Exchange SinglePartition : +- (1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#306, count#307L]) : +- (1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : +- Scan[obj#12] +- *(1) SerializeFromObject +- Scan[obj#12] ``` ## How was this patch tested? Modified the existing test. Closes #24258 from gatorsmile/followupSPARK-27279. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-04-05 08:31:41 -07:00
Gengliang Wang	568db94e0c	[SPARK-27356][SQL] File source V2: Fix the case that data columns overlap with partition schema ## What changes were proposed in this pull request? In the current file source V2 framework, the schema of `FileScan` is not returned correctly if there are overlap columns between `dataSchema` and `partitionSchema`. The actual schema should be `dataSchema - overlapSchema + partitionSchema`, which might have different column order from the pushed down `requiredSchema` in `SupportsPushDownRequiredColumns.pruneColumns`. For example, if the data schema is `[a: String, b: String, c: String]` and the partition schema is `[b: Int, d: Int]`, the result schema is `[a: String, b: Int, c: String, d: Int]` in current `FileTable` and `HadoopFsRelation`. while the actual scan schema is `[a: String, c: String, b: Int, d: Int]` in `FileScan`. To fix the corner case, this PR proposes that the output schema of `FileTable` should be `dataSchema - overlapSchema + partitionSchema`, so that the column order is consistent with `FileScan`. Putting all the partition columns to the end of table schema is more reasonable. ## How was this patch tested? Unit test. Closes #24284 from gengliangwang/FixReadSchema. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-05 13:34:46 +08:00
Wenchen Fan	f7bd1ab586	[SPARK-26811][SQL][FOLLOWUP] some more document fixes ## What changes were proposed in this pull request? while working on https://github.com/apache/spark/pull/24129, I realized that I missed some document fixes in https://github.com/apache/spark/pull/24285. This PR covers all of them. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #24295 from cloud-fan/doc.	2019-04-05 01:07:08 +08:00
Wenchen Fan	5c50f68253	[SPARK-26811][SQL][FOLLOWUP] fix some documentation ## What changes were proposed in this pull request? It's a followup of https://github.com/apache/spark/pull/24012 , to fix 2 documentation: 1. `SupportsRead` and `SupportsWrite` are not internal anymore. They are public interfaces now. 2. `Scan` should link the `BATCH_READ` instead of hardcoding it. ## How was this patch tested? N/A Closes #24285 from cloud-fan/doc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-04 10:31:27 +08:00
Liang-Chi Hsieh	d04a7371da	[MINOR][DOC][SQL] Remove out-of-date doc about ORC in DataFrameReader and Writer ## What changes were proposed in this pull request? According to current status, `orc` is available even Hive support isn't enabled. This is a minor doc change to reflect it. ## How was this patch tested? Doc only change. Closes #24280 from viirya/fix-orc-doc. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-03 09:11:09 -07:00
Maxim Gekk	1bc672366d	[SPARK-27344][SQL][TEST] Support the LocalDate and Instant classes in Java Bean encoders ## What changes were proposed in this pull request? - Added new test for Java Bean encoder of the classes: `java.time.LocalDate` and `java.time.Instant`. - Updated comment for `Encoders.bean` - New Row getters: `getLocalDate` and `getInstant` - Extended `inferDataType` to infer types for `java.time.LocalDate` -> `DateType` and `java.time.Instant` -> `TimestampType`. ## How was this patch tested? By `JavaBeanDeserializationSuite` Closes #24273 from MaxGekk/bean-instant-localdate. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-03 17:45:59 +08:00
Dilip Biswal	3286bff942	[SPARK-27255][SQL] Report error when illegal expressions are hosted by a plan operator. ## What changes were proposed in this pull request? In the PR, we raise an AnalysisError when we detect the presense of aggregate expressions in where clause. Here is the problem description from the JIRA. Aggregate functions should not be allowed in WHERE clause. But Spark SQL throws an exception when generating codes. It is supposed to throw an exception during parsing or analyzing. Here is an example: ``` val df = spark.sql("select * from t where sum(ta) > 0") df.explain(true) df.show() ``` Resulting exception: ``` Exception in thread "main" java.lang.UnsupportedOperationException: Cannot generate code for expression: sum(cast(input[0, int, false] as bigint)) at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:291) at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:290) at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87) at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138) at scala.Option.getOrElse(Option.scala:138) ``` Checked the behaviour of other database and all of them return an exception: Postgress ``` select * from foo where max(c1) > 0; Error ERROR: aggregate functions are not allowed in WHERE Position: 25 ``` DB2 ``` db2 => select * from foo where max(c1) > 0; SQL0120N Invalid use of an aggregate function or OLAP function. ``` Oracle ``` select * from foo where max(c1) > 0; ORA-00934: group function is not allowed here ``` MySql ``` select * from foo where max(c1) > 0; Invalid use of group function ``` Update This PR has been enhanced to report error when expressions such as Aggregate, Window, Generate are hosted by operators where they are invalid. ## How was this patch tested? Added tests in AnalysisErrorSuite and group-by.sql Closes #24209 from dilipbiswal/SPARK-27255. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-03 13:05:06 +08:00
Maxim Gekk	1d20d13149	[SPARK-25496][SQL] Deprecate from_utc_timestamp and to_utc_timestamp ## What changes were proposed in this pull request? In the PR, I propose to deprecate the `from_utc_timestamp()` and `to_utc_timestamp`, and disable them by default. The functions can be enabled back via the SQL config `spark.sql.legacy.utcTimestampFunc.enabled`. By default, any calls of the functions throw an analysis exception. One of the reason for deprecation is functions violate semantic of `TimestampType` which is number of microseconds since epoch in UTC time zone. Shifting microseconds since epoch by time zone offset doesn't make sense because the result doesn't represent microseconds since epoch in UTC time zone any more, and cannot be considered as `TimestampType`. ## How was this patch tested? The changes were tested by `DateExpressionsSuite` and `DateFunctionsSuite`. Closes #24195 from MaxGekk/conv-utc-timestamp-deprecate. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-03 10:55:56 +08:00
Hyukjin Kwon	d7dd59a6b4	[SPARK-26224][SQL][PYTHON][R][FOLLOW-UP] Add notes about many projects in withColumn at SparkR and PySpark as well ## What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/23285. This PR adds the notes into PySpark and SparkR documentation as well. While I am here, I revised the doc a bit to make it sound a bit more neutral ## How was this patch tested? Manually built the doc and verified. Closes #24272 from HyukjinKwon/SPARK-26224. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-04-03 08:30:24 +09:00
Sean Owen	d4420b455a	[SPARK-27323][CORE][SQL][STREAMING] Use Single-Abstract-Method support in Scala 2.12 to simplify code ## What changes were proposed in this pull request? Use Single Abstract Method syntax where possible (and minor related cleanup). Comments below. No logic should change here. ## How was this patch tested? Existing tests. Closes #24241 from srowen/SPARK-27323. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-02 07:37:05 -07:00
Dongjoon Hyun	d575a453db	Revert "[SPARK-25496][SQL] Deprecate from_utc_timestamp and to_utc_timestamp" This reverts commit `c5e83ab92c`.	2019-04-02 01:05:54 -07:00
Marco Gaido	0b150f833c	[SPARK-26224][SQL] Advice the user when creating many project on subsequent calls to withColumn ## What changes were proposed in this pull request? We have seen many cases when users make several subsequent calls to `withColumn` on a Dataset. This leads now to the generation of a lot of `Project` nodes on the top of the plan, with serious problem which can lead also to `StackOverflowException`s. The PR improves the doc of `withColumn`, in order to advise the user to avoid this pattern and do something different, ie. a single select with all the column he/she needs. ## How was this patch tested? NA Closes #23285 from mgaido91/SPARK-26224. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-04-02 14:12:47 +09:00
Maxim Gekk	c5e83ab92c	[SPARK-25496][SQL] Deprecate from_utc_timestamp and to_utc_timestamp ## What changes were proposed in this pull request? In the PR, I propose to deprecate the `from_utc_timestamp()` and `to_utc_timestamp`, and disable them by default. The functions can be enabled back via the SQL config `spark.sql.legacy.utcTimestampFunc.enabled`. By default, any calls of the functions throw an analysis exception. One of the reason for deprecation is functions violate semantic of `TimestampType` which is number of microseconds since epoch in UTC time zone. Shifting microseconds since epoch by time zone offset doesn't make sense because the result doesn't represent microseconds since epoch in UTC time zone any more, and cannot be considered as `TimestampType`. ## How was this patch tested? The changes were tested by `DateExpressionsSuite` and `DateFunctionsSuite`. Closes #24195 from MaxGekk/conv-utc-timestamp-deprecate. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-02 10:20:06 +08:00
Liang-Chi Hsieh	eaf008ad0e	[SPARK-27329][SQL] Pruning nested field in map of map key and value from object serializers ## What changes were proposed in this pull request? If object serializer has map of map key/value, pruning nested field should work. Previously object serializer pruner don't recursively prunes nested fields if it is deeply located in map key or value. This patch proposed to address it by slightly factoring the pruning logic. ## How was this patch tested? Added tests. Closes #24260 from viirya/SPARK-27329. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-01 13:53:55 -07:00
Maxim Gekk	d332958109	[SPARK-27325][SQL] Add implicit encoders for LocalDate and Instant ## What changes were proposed in this pull request? Added implicit encoders for the `java.time.LocalDate` and `java.time.Instant` classes. This allows creation of datasets from instances of the types. ## How was this patch tested? Added new tests to `JavaDatasetSuite` and `DatasetSuite`. Closes #24249 from MaxGekk/instant-localdate-encoders. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-01 23:02:48 +08:00
Marco Gaido	8012f55a9b	[SPARK-26812][SQL] Report correct nullability for complex datatypes in Union ## What changes were proposed in this pull request? When there is a `Union`, the reported output datatypes are the ones of the first plan and the nullability is updated according to all the plans. For complex types, though, the nullability of their elements is not updated using the types from the other plans. This means that the nullability of the inner elements is the one of the first plan. If this is not compatible with the one of other plans, errors can happen (as reported in the JIRA). The PR proposes to update the nullability of the inner elements of complex datatypes according to most permissive value of all the plans. ## How was this patch tested? added UT Closes #23726 from mgaido91/SPARK-26812. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-01 22:22:10 +08:00
chakravarthiT	fc9aad0957	[SPARK-27253][SQL] Prioritizes parent session's SQLConf over SparkConf when cloning a session ## What changes were proposed in this pull request? Cloned session should prioritize `SQLConf` from parent's over `SparkConf`. Currently, when cloning a session, the child session has configuration set in `SparkConf` even the same properties are set to its parent `SQLConf`. Currently, when a Spark session is cloned, `mergeSparkConf` in `BaseSessionStateBuilder`'s `conf` overwrites `SQLConf` values as set in `SparkConf`. This PR proposes to call `mergeSparkConf` only when the parent session is empty. See below codes to read. 1. Parent's `sessionState` `c26379b446/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala (L268)` `c26379b446/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala (L157-L161)` `5dab5f651f/sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala (L88-L90)` 2. Child `sessionState` `c26379b446/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala (L269)` `c26379b446/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala (L155)` `c26379b446/sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala (L102)` `c26379b446/sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala (L74)` `5dab5f651f/sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala (L305)` `5dab5f651f/sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala (L283)` `5dab5f651f/sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala (L292)` `5dab5f651f/sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala (L88-L90)` ## How was this patch tested? Added UT and with existing Unit Tests. Closes #24189 from chakravarthiT/CloneDiscardsConf. Authored-by: chakravarthiT <tcchakra@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-04-01 09:33:18 +09:00
Takeshi Yamamuro	885aab40a2	[SPARK-27266][SQL] Support ANALYZE TABLE to collect tables stats for cached catalog views ## What changes were proposed in this pull request? The current master doesn't support ANALYZE TABLE to collect tables stats for catalog views even if they are cached as follows; ```scala scala> sql(s"CREATE VIEW v AS SELECT 1 c") scala> sql(s"CACHE LAZY TABLE v") scala> sql(s"ANALYZE TABLE v COMPUTE STATISTICS") org.apache.spark.sql.AnalysisException: ANALYZE TABLE is not supported on views.; ... ``` Since SPARK-25196 has supported to an ANALYZE command to collect column statistics for cached catalog view, we could support table stats, too. ## How was this patch tested? Added tests in `StatisticsCollectionSuite` and `InMemoryColumnarQuerySuite`. Closes #24200 from maropu/SPARK-27266. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-31 17:24:21 -07:00
Maxim Gekk	6115a5e1a0	[SPARK-27327][SQL] New JSON benchmarks: functions, Dataset[String] ## What changes were proposed in this pull request? Added new benchmarks for: 1. JSON functions: `from_json`, `json_tuple` and `get_json_object` 2. Parsing `Dataset[String]` with JSON records 3. Comparing just splitting input text by lines with schema inferring, per-line parsing when encoding is set and not set. Also existing benchmarks were refactored to use the `NoOp` datasource to eliminate overhead of triggers like `.filter((_: Row) => true).count()`. ## How was this patch tested? By running `JSONBenchmark` locally. Closes #24252 from MaxGekk/json-benchmark-func. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-04-01 08:33:16 +09:00
Gengliang Wang	5dab5f651f	[SPARK-27326][SQL] Fall back all v2 file sources in `InsertIntoTable` to V1 FileFormat ## What changes were proposed in this pull request? In the first PR for file source V2, there was a rule for falling back Orc V2 table to OrcFileFormat: https://github.com/apache/spark/pull/23383/files#diff-57e8244b6964e4f84345357a188421d5R34 As we are migrating more file sources to data source V2, we should make the rule more generic. This PR proposes to: 1. Rename the rule `FallbackOrcDataSourceV2 ` to `FallBackFileSourceV2`.The name is more generic. And we use "fall back" as verb, while "fallback" is noun. 2. Rename the method `fallBackFileFormat` in `FileDataSourceV2` to `fallbackFileFormat`. Here we should use "fallback" as noun. 3. Add new method `fallbackFileFormat` in `FileTable`. This is for falling back to V1 in rule `FallbackOrcDataSourceV2 `. ## How was this patch tested? Existing Unit tests. Closes #24251 from gengliangwang/fallbackV1Rule. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-30 14:38:26 -07:00
10129659	144b35fe3a	[SPARK-27320][SQL] Replacing index with iterator to traverse the expressions list in AggregationIterator, which make it simpler ## What changes were proposed in this pull request? In AggregationIterator's loop function, we access the expressions by `expressions(i)`, the type of `expressions` is `::`, a subtype of list. ``` while (i < expressionsLength) { val func = expressions(i).aggregateFunction ``` This PR replacing index with iterator to access the expressions list, which make it simpler. ## How was this patch tested? Existing tests. Closes #24238 from eatoncys/array. Authored-by: 10129659 <chen.yanshan@zte.com.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-30 02:27:12 -05:00
Maxim Gekk	06abd06112	[SPARK-27252][SQL] Make current_date() independent from time zones ## What changes were proposed in this pull request? This makes the `CurrentDate` expression and `current_date` function independent from time zone settings. New result is number of days since epoch in `UTC` time zone. Previously, Spark shifted the current date (in `UTC` time zone) according the session time zone which violets definition of `DateType` - number of days since epoch (which is an absolute point in time, midnight of Jan 1 1970 in UTC time). The changes makes `CurrentDate` consistent to `CurrentTimestamp` which is independent from time zone too. ## How was this patch tested? The changes were tested by existing test suites like `DateExpressionsSuite`. Closes #24185 from MaxGekk/current-date. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-28 18:44:08 -07:00
Xianyang Liu	50cded590f	[MINOR] Move java file to java directory ## What changes were proposed in this pull request? move ```scala org.apache.spark.sql.execution.streaming.BaseStreamingSource org.apache.spark.sql.execution.streaming.BaseStreamingSink ``` to java directory ## How was this patch tested? Existing UT. Closes #24222 from ConeyLiu/move-scala-to-java. Authored-by: Xianyang Liu <xianyang.liu@intel.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-28 12:11:00 -05:00
Gengliang Wang	49b0411549	[SPARK-27291][SQL] PartitioningAwareFileIndex: Filter out empty files on listing files ## What changes were proposed in this pull request? In https://github.com/apache/spark/pull/23130, all empty files are excluded from target file splits in `FileSourceScanExec`. In File source V2, we should keep the same behavior. This PR suggests to filter out empty files on listing files in `PartitioningAwareFileIndex` so that the upper level doesn't need to handle them. ## How was this patch tested? Unit test Closes #24227 from gengliangwang/ignoreEmptyFile. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-27 10:08:38 -07:00
Daoyuan Wang	f1fe805bed	[SPARK-27279][SQL] Reuse subquery should compare child plan of `SubqueryExec` ## What changes were proposed in this pull request? For now, `ReuseSubquery` in Spark compares two subqueries at `SubqueryExec` level, which invalidates the `ReuseSubquery` rule. This pull request fixes this, and add a configuration key for subquery reuse exclusively. ## How was this patch tested? add a unit test. Closes #24214 from adrian-wang/reuse. Authored-by: Daoyuan Wang <me@daoyuan.wang> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-03-27 08:45:22 -07:00
Takeshi Yamamuro	956b52b167	[SPARK-26771][SQL][FOLLOWUP] Make all the uncache operations non-blocking by default ## What changes were proposed in this pull request? To make the blocking behaviour consistent, this pr made catalog table/view `uncacheQuery` non-blocking by default. If this pr merged, all the behaviours in spark are non-blocking by default. ## How was this patch tested? Pass Jenkins. Closes #24212 from maropu/SPARK-26771-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-03-27 21:01:36 +09:00
Liang-Chi Hsieh	93ff69003b	[SPARK-27288][SQL] Pruning nested field in complex map key from object serializers ## What changes were proposed in this pull request? In the original PR #24158, pruning nested field in complex map key was not supported, because some methods in schema pruning did't support it at that moment. This is a followup to add it. ## How was this patch tested? Added tests. Closes #24220 from viirya/SPARK-26847-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-03-27 19:40:14 +09:00
liuxian	fac31104f6	[SPARK-27083][SQL] Add a new conf to control subqueryReuse ## What changes were proposed in this pull request? Subquery Reuse and Exchange Reuse are not the same feature， if we don't want to reuse subqueries，and we just want to reuse exchanges，only one configuration that cannot be done. This PR adds a new configuration `spark.sql.subquery.reuse` to control subqueryReuse. ## How was this patch tested? N/A Closes #23998 from 10110346/SUBQUERY_REUSE. Authored-by: liuxian <liu.xian3@zte.com.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-26 23:37:58 -07:00
Gengliang Wang	6bcd4805d2	[SPARK-27286][SQL] Handles exceptions on proceeding to next record in FilePartitionReader ## What changes were proposed in this pull request? In data source V2, the method `PartitionReader.next()` has side effects. When the method is called, the current reader proceeds to the next record. This might throw RuntimeException/IOException and File source V2 framework should handle these exceptions. ## How was this patch tested? Unit test. Closes #24225 from gengliangwang/corruptFile. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-26 22:33:34 -07:00
Yuming Wang	ca1433b94a	[SPARK-27182][SQL] Move the conflict source code of the sql/core module to sql/core/v1.2.1 ## What changes were proposed in this pull request? To make https://github.com/apache/spark/pull/23788 easy to review. This PR moves `OrcColumnVector.java`, `OrcShimUtils.scala`, `OrcFilters.scala` and `OrcFilterSuite.scala` to `sql/core/v1.2.1` and copies it to `sql/core/v2.3.4`. ## How was this patch tested? manual tests ```shell diff -urNa sql/core/v1.2.1 sql/core/v2.3.4 ``` Closes #24119 from wangyum/SPARK-27182. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-03-26 22:32:03 -07:00
Dilip Biswal	6c0e13b456	[SPARK-27285] Support describing output of CTE ## What changes were proposed in this pull request? SPARK-26982 allows users to describe output of a query. However, it had a limitation of not supporting CTEs due to limitation of the grammar having a single rule to parse both select and inserts. After SPARK-27209, which splits select and insert parsing to two different rules, we can now support describing output of the CTEs easily. ## How was this patch tested? Existing tests were modified. Closes #24224 from dilipbiswal/describe_support_cte. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-26 16:00:56 -07:00
Gengliang Wang	267160b360	[SPARK-27269][SQL] File source v2 should validate data schema only ## What changes were proposed in this pull request? Currently, File source v2 allows each data source to specify the supported data types by implementing the method `supportsDataType` in `FileScan` and `FileWriteBuilder`. However, in the read path, the validation checks all the data types in `readSchema`, which might contain partition columns. This is actually a regression. E.g. Text data source only supports String data type, while the partition columns can still contain Integer type since partition columns are processed by Spark. This PR is to: 1. Refactor schema validation and check data schema only. 2. Filter the partition columns in data schema if user specified schema provided. ## How was this patch tested? Unit test Closes #24203 from gengliangwang/schemaValidation. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-27 07:58:31 +09:00
Maxim Gekk	69035684d4	[SPARK-27242][SQL] Make formatting TIMESTAMP/DATE literals independent from the default time zone ## What changes were proposed in this pull request? In the PR, I propose to use the SQL config `spark.sql.session.timeZone` in formatting `TIMESTAMP` literals, and make formatting `DATE` literals independent from time zone. The changes make parsing and formatting `TIMESTAMP`/`DATE` literals consistent, and independent from the default time zone of current JVM. Also this PR ports `TIMESTAMP`/`DATE` literals formatting on Proleptic Gregorian Calendar via using `TimestampFormatter`/`DateFormatter`. ## How was this patch tested? Added new tests to `LiteralExpressionSuite` Closes #24181 from MaxGekk/timezone-aware-literals. Authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-26 15:29:59 -07:00
Dilip Biswal	9cc925cda2	[SPARK-27209][SQL] Split parsing of SELECT and INSERT into two top-level rules in the grammar file. ## What changes were proposed in this pull request? Currently in the grammar file the rule `query` is responsible to parse both select and insert statements. As a result, we need to have more semantic checks in the code to guard against in-valid insert constructs in a query. Couple of examples are in the `visitCreateView` and `visitAlterView` functions. One other issue is that, we don't catch the `invalid insert constructs` in all the places until checkAnalysis (the errors we raise can be confusing as well). Here are couple of examples : ```SQL select * from (insert into bar values (2)); ``` ``` Error in query: unresolved operator 'Project []; 'Project [] +- SubqueryAlias `__auto_generated_subquery_name` +- InsertIntoHiveTable `default`.`bar`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, false, false, [c1] +- Project [cast(col1#18 as int) AS c1#20] +- LocalRelation [col1#18] ``` ```SQL select * from foo where c1 in (insert into bar values (2)) ``` ``` Error in query: cannot resolve '(default.foo.`c1` IN (listquery()))' due to data type mismatch: The number of columns in the left hand side of an IN subquery does not match the number of columns in the output of subquery. #columns in left hand side: 1. #columns in right hand side: 0. Left side columns: [default.foo.`c1`]. Right side columns: [].;; 'Project [*] +- 'Filter c1#6 IN (list#5 []) : +- InsertIntoHiveTable `default`.`bar`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, false, false, [c1] : +- Project [cast(col1#7 as int) AS c1#9] : +- LocalRelation [col1#7] +- SubqueryAlias `default`.`foo` +- HiveTableRelation `default`.`foo`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#6] ``` For both the cases above, we should reject the syntax at parser level. In this PR, we create two top-level parser rules to parse `SELECT` and `INSERT` respectively. I will create a small PR to allow CTEs in DESCRIBE QUERY after this PR is in. ## How was this patch tested? Added tests to PlanParserSuite and removed the semantic check tests from SparkSqlParserSuites. Closes #24150 from dilipbiswal/split-query-insert. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-25 17:43:03 -07:00
Yuming Wang	300ec1a74c	[SPARK-27226][SQL] Reduce the code duplicate when upgrading built-in Hive ## What changes were proposed in this pull request? This pr related to #24119. Reduce the code duplicate when upgrading built-in Hive. To achieve this, we should avoid using classes in `org.apache.orc.storage.` because these classes will be replaced with `org.apache.hadoop.hive.` after upgrading the built-in Hive. Such as: ![image](https://user-images.githubusercontent.com/5399861/54437594-e9be1000-476f-11e9-8878-3b7414871ee5.png) - Move the usage of `org.apache.orc.storage.*` to `OrcShimUtils`: 1. Add wrapper for `VectorizedRowBatch`(Reduce code duplication of [OrcColumnarBatchReader](https://github.com/apache/spark/pull/24166/files#diff-e594f7295e5408c01ace8175166313b6)). 2. Move some serializer/deserializer method out of `OrcDeserializer` and `OrcSerializer`(Reduce code duplication of [OrcDeserializer](https://github.com/apache/spark/pull/24166/files#diff-b933819e6dcaff41eee8fce1e8f2932c) and [OrcSerializer](https://github.com/apache/spark/pull/24166/files#diff-6d3849d88929f6ea25c436d71da729da)). 3. Defined two type aliases: `Operator` and `SearchArgument`(Reduce code duplication of [OrcV1FilterSuite](https://github.com/apache/spark/pull/24166/files#diff-48c4fc7a3b3384a6d0aab246723a0058)). - Move duplication code to super class: 1. Add a trait for `OrcFilters`(Reduce code duplication of [OrcFilters](https://github.com/apache/spark/pull/24166/files#diff-224b8cbedf286ecbfdd092d1e2e2f237)). 2. Move `checkNoFilterPredicate` from `OrcFilterSuite` to `OrcTest`(Reduce code duplication of [OrcFilterSuite](https://github.com/apache/spark/pull/24166/files#diff-8e05c1faaaec98edd7723e62f84066f1)). After this pr. We only need to copy these 4 files: OrcColumnVector, OrcFilters, OrcFilterSuite and OrcShimUtils. ## How was this patch tested? existing tests Closes #24166 from wangyum/SPARK-27226. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-25 19:39:00 -05:00
Ajith	b61dce23d2	[SPARK-26961][CORE] Enable parallel classloading capability ## What changes were proposed in this pull request? As per https://docs.oracle.com/javase/8/docs/api/java/lang/ClassLoader.html ``Class loaders that support concurrent loading of classes are known as parallel capable class loaders and are required to register themselves at their class initialization time by invoking the ClassLoader.registerAsParallelCapable method. Note that the ClassLoader class is registered as parallel capable by default. However, its subclasses still need to register themselves if they are parallel capable. `` i.e we can have finer class loading locks by registering classloaders as parallel capable. (Refer to deadlock due to macro lock https://issues.apache.org/jira/browse/SPARK-26961). All the classloaders we have are wrapper of URLClassLoader which by itself is parallel capable. But this cannot be achieved by scala code due to static registration Refer https://github.com/scala/bug/issues/11429 ## How was this patch tested? All Existing UT must pass Closes #24126 from ajithme/driverlock. Authored-by: Ajith <ajith2489@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-25 19:07:30 -05:00
Liang-Chi Hsieh	8433ff6607	[SPARK-26847][SQL] Pruning nested serializers from object serializers: MapType support ## What changes were proposed in this pull request? In SPARK-26837, we prune nested fields from object serializers if they are unnecessary in the query execution. SPARK-26837 leaves the support of MapType as a TODO item. This proposes to support map type. ## How was this patch tested? Added tests. Closes #24158 from viirya/SPARK-26847. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-25 15:36:58 -07:00
Liang-Chi Hsieh	5a36cf66ed	[SPARK-27268][SQL] Add map_keys and map_values support in nested schema pruning. ## What changes were proposed in this pull request? We need to add `map_keys` and `map_values` into `ProjectionOverSchema` to support those methods in nested schema pruning. This also adds end-to-end tests to SchemaPruningSuite. ## How was this patch tested? Added tests. Closes #24202 from viirya/SPARK-27268. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-25 15:32:01 -07:00
liuxian	e4b36df2c0	[SPARK-27256][CORE][SQL] If the configuration is used to set the number of bytes, we'd better use `bytesConf`'. ## What changes were proposed in this pull request? Currently, if we want to configure `spark.sql.files.maxPartitionBytes` to 256 megabytes, we must set `spark.sql.files.maxPartitionBytes=268435456`, which is very unfriendly to users. And if we set it like this:`spark.sql.files.maxPartitionBytes=256M`, we will encounter this exception: ``` Exception in thread "main" java.lang.IllegalArgumentException: spark.sql.files.maxPartitionBytes should be long, but was 256M at org.apache.spark.internal.config.ConfigHelpers$.toNumber(ConfigBuilder.scala) ``` This PR use `bytesConf` to replace `longConf` or `intConf`, if the configuration is used to set the number of bytes. Configuration change list: `spark.files.maxPartitionBytes` `spark.files.openCostInBytes` `spark.shuffle.sort.initialBufferSize` `spark.shuffle.spill.initialMemoryThreshold` `spark.sql.autoBroadcastJoinThreshold` `spark.sql.files.maxPartitionBytes` `spark.sql.files.openCostInBytes` `spark.sql.defaultSizeInBytes` ## How was this patch tested? 1.Existing unit tests 2.Manual testing Closes #24187 from 10110346/bytesConf. Authored-by: liuxian <liu.xian3@zte.com.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-25 14:47:40 -07:00
Sean Owen	8bc304f97e	[SPARK-26132][BUILD][CORE] Remove support for Scala 2.11 in Spark 3.0.0 ## What changes were proposed in this pull request? Remove Scala 2.11 support in build files and docs, and in various parts of code that accommodated 2.11. See some targeted comments below. ## How was this patch tested? Existing tests. Closes #23098 from srowen/SPARK-26132. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-25 10:46:42 -05:00
Takeshi Yamamuro	b8a0f981f2	[SPARK-25196][SQL][FOLLOWUP] Fix wrong tests in StatisticsCollectionSuite ## What changes were proposed in this pull request? This is a follow-up of #24047 and it fixed wrong tests in `StatisticsCollectionSuite`. ## How was this patch tested? Pass Jenkins. Closes #24198 from maropu/SPARK-25196-FOLLOWUP-2. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-03-25 21:02:01 +09:00
Maxim Gekk	52671d631d	[SPARK-27008][SQL][FOLLOWUP] Fix typo from `_EANBLED` to `_ENABLED` ## What changes were proposed in this pull request? This fixes a typo in the SQL config value: DATETIME_JAVA8API_EANBLED -> DATETIME_JAVA8API_ENABLED. ## How was this patch tested? This was tested by `RowEncoderSuite` and `LiteralExpressionSuite`. Closes #24194 from MaxGekk/date-localdate-followup. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-24 17:16:33 -07:00
Takeshi Yamamuro	01e63053df	[SPARK-25196][SPARK-27251][SQL][FOLLOWUP] Add synchronized for InMemoryRelation.statsOfPlanToCache ## What changes were proposed in this pull request? This is a follow-up of #24047; to follow the `CacheManager.cachedData` lock semantics, this pr wrapped the `statsOfPlanToCache` update with `synchronized`. ## How was this patch tested? Pass Jenkins Closes #24178 from maropu/SPARK-24047-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-23 22:54:27 -07:00
Gengliang Wang	624288556d	[SPARK-27085][SQL] Migrate CSV to File Data Source V2 ## What changes were proposed in this pull request? Migrate CSV to File Data Source V2. ## How was this patch tested? Unit test Closes #24005 from gengliangwang/CSVDataSourceV2. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-23 15:43:46 -07:00
Maxim Gekk	027ed2d11b	[SPARK-23643][CORE][SQL][ML] Shrinking the buffer in hashSeed up to size of the seed parameter ## What changes were proposed in this pull request? The hashSeed method allocates 64 bytes instead of 8. Other bytes are always zeros (thanks to default behavior of ByteBuffer). And they could be excluded from hash calculation because they don't differentiate inputs. ## How was this patch tested? By running the existing tests - XORShiftRandomSuite Closes #20793 from MaxGekk/hash-buff-size. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-23 11:26:09 -05:00
Ryan Blue	34e3cc7060	[SPARK-27108][SQL] Add parsed SQL plans for create, CTAS. ## What changes were proposed in this pull request? This moves parsing `CREATE TABLE ... USING` statements into catalyst. Catalyst produces logical plans with the parsed information and those plans are converted to v1 `DataSource` plans in `DataSourceAnalysis`. This prepares for adding v2 create plans that should receive the information parsed from SQL without being translated to v1 plans first. This also makes it possible to parse in catalyst instead of breaking the parser across the abstract `AstBuilder` in catalyst and `SparkSqlParser` in core. For more information, see the [mailing list thread](https://lists.apache.org/thread.html/54f4e1929ceb9a2b0cac7cb058000feb8de5d6c667b2e0950804c613%3Cdev.spark.apache.org%3E). ## How was this patch tested? This uses existing tests to catch regressions. This introduces no behavior changes. Closes #24029 from rdblue/SPARK-27108-add-parsed-create-logical-plans. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-22 13:58:54 -07:00
Jungtaek Lim (HeartSaVioR)	78d546fe15	[SPARK-27210][SS] Cleanup incomplete output files in ManifestFileCommitProtocol if task is aborted ## What changes were proposed in this pull request? This patch proposes ManifestFileCommitProtocol to clean up incomplete output files in task level if task aborts. Please note that this works as 'best-effort', not kind of guarantee, as we have in HadoopMapReduceCommitProtocol. ## How was this patch tested? Added UT. Closes #24154 from HeartSaVioR/SPARK-27210. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>	2019-03-22 11:26:53 -07:00
Martin Junghanns	8efc5ec72e	[SPARK-27174][SQL] Add support for casting integer types to binary Co-authored-by: Philip Stutz <philip.stutzgmail.com> ## What changes were proposed in this pull request? This PR adds support for casting * `ByteType` * `ShortType` * `IntegerType` * `LongType` to `BinaryType`. ## How was this patch tested? We added unit tests for casting instances of the above types. For validation, we used Javas `DataOutputStream` to compare the resulting byte array with the result of `Cast`. We state that the contribution is our original work and that we license the work to the project under the project’s open source license. cloud-fan we'd appreciate a review if you find the time, thx Closes #24107 from s1ck/cast_to_binary. Authored-by: Martin Junghanns <martin.junghanns@neotechnology.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-22 10:09:35 -07:00
Maxim Gekk	a529be2930	[SPARK-27212][SQL] Eliminate TimeZone to ZoneId conversion in stringToTimestamp ## What changes were proposed in this pull request? In the PR, I propose to avoid the `TimeZone` to `ZoneId` conversion in `DateTimeUtils.stringToTimestamp` by changing signature of the method, and require a parameter of `ZoneId` type. This will allow to avoid unnecessary conversion (`TimeZone` -> `String` -> `ZoneId`) per each row. Also the PR avoids creation of `ZoneId` instances from `ZoneOffset` because `ZoneOffset` is a sub-class, and the conversion is unnecessary too. ## How was this patch tested? It was tested by `DateTimeUtilsSuite` and `CastSuite`. Closes #24155 from MaxGekk/stringtotimestamp-zoneid. Authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-22 18:01:29 +09:00
maryannxue	9f58d3b436	[SPARK-27236][TEST] Refactor log-appender pattern in tests ## What changes were proposed in this pull request? Refactored code in tests regarding the "withLogAppender()" pattern by creating a general helper method in SparkFunSuite. ## How was this patch tested? Passed existing tests. Closes #24172 from maryannxue/log-appender. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-03-21 19:18:30 -07:00
John Zhuge	80565ce253	[SPARK-26946][SQL] Identifiers for multi-catalog ## What changes were proposed in this pull request? - Support N-part identifier in SQL - N-part identifier extractor in Analyzer ## How was this patch tested? - A new unit test suite ResolveMultipartRelationSuite - CatalogLoadingSuite rblue cloud-fan mccheah Closes #23848 from jzhuge/SPARK-26946. Authored-by: John Zhuge <jzhuge@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-21 18:04:50 -07:00
Takeshi Yamamuro	0627850b7e	[SPARK-25196][SQL] Extends the analyze column command for cached tables ## What changes were proposed in this pull request? This pr extended `ANALYZE` commands to analyze column stats for cached table. In common use cases, users read catalog table data, join/aggregate them, and then cache the result for following reuse. Since we are only allowed to analyze column statistics in catalog tables via ANALYZE commands, the current optimization depends on non-existing or inaccurate column statistics of cached data. So, it would be great if we could analyze cached data as follows; ```scala scala> def printColumnStats(tableName: String) = { \| spark.table(tableName).queryExecution.optimizedPlan.stats.attributeStats.foreach { \| case (k, v) => println(s"[$k]: $v") \| } \| } scala> sql("SET spark.sql.cbo.enabled=true") scala> sql("SET spark.sql.statistics.histogram.enabled=true") scala> spark.range(1000).selectExpr("id % 33 AS c0", "rand() AS c1", "0 AS c2").write.saveAsTable("t") scala> sql("ANALYZE TABLE t COMPUTE STATISTICS FOR COLUMNS c0, c1, c2") scala> spark.table("t").groupBy("c0").agg(count("c1").as("v1"), sum("c2").as("v2")).createTempView("temp") // Prints column statistics in catalog table `t` scala> printColumnStats("t") [c0#7073L]: ColumnStat(Some(33),Some(0),Some(32),Some(0),Some(8),Some(8),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;9f7c1c)),2) [c1#7074]: ColumnStat(Some(944),Some(3.2108484832404915E-4),Some(0.997584797423909),Some(0),Some(8),Some(8),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;60a386b1)),2) [c2#7075]: ColumnStat(Some(1),Some(0),Some(0),Some(0),Some(4),Some(4),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;5ffd29e8)),2) // Prints column statistics on cached table `temp` scala> sql("CACHE TABLE temp") scala> printColumnStats("temp") <No Column Statistics> // Analyzes columns `v1` and `v2` on cached table `temp` scala> sql("ANALYZE TABLE temp COMPUTE STATISTICS FOR COLUMNS v1, v2") // Then, prints again scala> printColumnStats("temp") [v1#7084L]: ColumnStat(Some(2),Some(30),Some(31),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;49f7bb6f)),2) [v2#7086L]: ColumnStat(Some(1),Some(0),Some(0),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;12701677)),2) // Analyzes one left column and prints again scala> sql("ANALYZE TABLE temp COMPUTE STATISTICS FOR COLUMNS c0") scala> printColumnStats("temp") [v1#7084L]: ColumnStat(Some(2),Some(30),Some(31),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;49f7bb6f)),2) [v2#7086L]: ColumnStat(Some(1),Some(0),Some(0),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;12701677)),2) [c0#7073L]: ColumnStat(Some(33),Some(0),Some(32),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;1f5c1b81)),2) ``` ## How was this patch tested? Added tests in `CachedTableSuite` and `StatisticsCollectionSuite`. Closes #24047 from maropu/SPARK-25196-4. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-21 09:20:35 -07:00
Bryan Cutler	be08b415da	[SPARK-27163][PYTHON] Cleanup and consolidate Pandas UDF functionality ## What changes were proposed in this pull request? This change is a cleanup and consolidation of 3 areas related to Pandas UDFs: 1) `ArrowStreamPandasSerializer` now inherits from `ArrowStreamSerializer` and uses the base class `dump_stream`, `load_stream` to create Arrow reader/writer and send Arrow record batches. `ArrowStreamPandasSerializer` makes the conversions to/from Pandas and converts to Arrow record batch iterators. This change removed duplicated creation of Arrow readers/writers. 2) `createDataFrame` with Arrow now uses `ArrowStreamPandasSerializer` instead of doing its own conversions from Pandas to Arrow and sending record batches through `ArrowStreamSerializer`. 3) Grouped Map UDFs now reuse existing logic in `ArrowStreamPandasSerializer` to send Pandas DataFrame results as a `StructType` instead of separating each column from the DataFrame. This makes the code a little more consistent with the Python worker, but does require that the returned StructType column is flattened out in `FlatMapGroupsInPandasExec` in Scala. ## How was this patch tested? Existing tests and ran tests with pyarrow 0.12.0 Closes #24095 from BryanCutler/arrow-refactor-cleanup-UDFs. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-21 17:44:51 +09:00
maryannxue	2e090ba628	[SPARK-27223][SQL] Remove private methods that skip conversion when passing user schemas for constructing a DataFrame ## What changes were proposed in this pull request? When passing in a user schema to create a DataFrame, there might be mismatched nullability between the user schema and the the actual data. All related public interfaces now perform catalyst conversion using the user provided schema, which catches such mismatches to avoid runtime errors later on. However, there're private methods which allow this conversion to be skipped, so we need to remove these private methods which may lead to confusion and potential issues. ## How was this patch tested? Passed existing tests. No new tests were added since this PR removed the private interfaces that would potentially cause null problems and other interfaces are covered already by existing tests. Closes #24162 from maryannxue/spark-27223. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-21 11:13:25 +09:00
wangguangxin.cn	46f9f44918	[SPARK-27202][MINOR][SQL] Update comments to keep according with code ## What changes were proposed in this pull request? Update comments in `InMemoryFileIndex.listLeafFiles` to keep according with code. ## How was this patch tested? existing test cases Closes #24146 from WangGuangxin/SPARK-27202. Authored-by: wangguangxin.cn <wangguangxin.cn@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-20 17:54:28 -05:00
Maxim Gekk	1882912cca	[SPARK-27199][SQL] Replace TimeZone by ZoneId in TimestampFormatter API ## What changes were proposed in this pull request? In the PR, I propose to use `ZoneId` instead of `TimeZone` in: - the `apply` and `getFractionFormatter ` methods of the `TimestampFormatter` object, - and in implementations of the `TimestampFormatter` trait like `FractionTimestampFormatter`. The reason of the changes is to avoid unnecessary conversion from `TimeZone` to `ZoneId` because `ZoneId` is used in `TimestampFormatter` implementations internally, and the conversion is performed via `String` which is not for free. Also taking into account that `TimeZone` instances are converted from `String` in some cases, the worse case looks like `String` -> `TimeZone` -> `String` -> `ZoneId`. The PR eliminates the unneeded conversions. ## How was this patch tested? It was tested by `DateExpressionsSuite`, `DateTimeUtilsSuite` and `TimestampFormatterSuite`. Closes #24141 from MaxGekk/zone-id. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-20 21:28:11 +09:00
Huon Wilson	b67d369572	[SPARK-27099][SQL] Add 'xxhash64' for hashing arbitrary columns to Long ## What changes were proposed in this pull request? This introduces a new SQL function 'xxhash64' for getting a 64-bit hash of an arbitrary number of columns. This is designed to exactly mimic the 32-bit `hash`, which uses MurmurHash3. The name is designed to be more future-proof than the 'hash', by indicating the exact algorithm used, similar to md5 and the sha hashes. ## How was this patch tested? The tests for the existing `hash` function were duplicated to run with `xxhash64`. Closes #24019 from huonw/hash64. Authored-by: Huon Wilson <Huon.Wilson@data61.csiro.au> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-20 16:34:34 +08:00
Darcy Shen	9a43852f17	[SPARK-27160][SQL] Fix DecimalType when building orc filters ## What changes were proposed in this pull request? DecimalType Literal should not be casted to Long. eg. For `df.filter("x < 3.14")`, assuming df (x in DecimalType) reads from a ORC table and uses the native ORC reader with predicate push down enabled, we will push down the `x < 3.14` predicate to the ORC reader via a SearchArgument. OrcFilters will construct the SearchArgument, but not handle the DecimalType correctly. The previous impl will construct `x < 3` from `x < 3.14`. ## How was this patch tested? ``` $ sbt > sql/testOnly OrcFilterSuite > sql/testOnly OrcQuerySuite -- -z "27160" ``` Closes #24092 from sadhen/spark27160. Authored-by: Darcy Shen <sadhen@zoho.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-19 20:28:46 -07:00
Dongjoon Hyun	257391497b	[SPARK-26975][SQL] Support nested-column pruning over limit/sample/repartition ## What changes were proposed in this pull request? As [SPARK-26958](https://github.com/apache/spark/pull/23862/files) benchmark shows, nested-column pruning has limitations. This PR aims to remove the limitations on `limit/repartition/sample`. Here, repartition means `Repartition`, not `RepartitionByExpression`. PREPARATION ```scala scala> spark.range(100).map(x => (x, (x, s"$x" * 100))).toDF("col1", "col2").write.mode("overwrite").save("/tmp/p") scala> sql("set spark.sql.optimizer.nestedSchemaPruning.enabled=true") scala> spark.read.parquet("/tmp/p").createOrReplaceTempView("t") ``` BEFORE ```scala scala> sql("SELECT col2._1 FROM (SELECT col2 FROM t LIMIT 1000000)").explain == Physical Plan == CollectLimit 1000000 +- (1) Project [col2#22._1 AS _1#28L] +- (1) FileScan parquet [col2#22] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/p], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col2:struct<_1:bigint>> scala> sql("SELECT col2._1 FROM (SELECT /+ REPARTITION(1) / col2 FROM t)").explain == Physical Plan == (2) Project [col2#22._1 AS _1#33L] +- Exchange RoundRobinPartitioning(1) +- (1) Project [col2#22] +- (1) FileScan parquet [col2#22] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/p], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col2:struct<_1:bigint,_2:string>> ``` AFTER* ```scala scala> sql("SELECT col2._1 FROM (SELECT /+ REPARTITION(1) / col2 FROM t)").explain == Physical Plan == Exchange RoundRobinPartitioning(1) +- (1) Project [col2#5._1 AS _1#11L] +- (1) FileScan parquet [col2#5] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/p], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col2:struct<_1:bigint>> ``` This supercedes https://github.com/apache/spark/pull/23542 and https://github.com/apache/spark/pull/23873 . ## How was this patch tested? Pass the Jenkins with a newly added test suite. Closes #23964 from dongjoon-hyun/SPARK-26975-ALIAS. Lead-authored-by: Dongjoon Hyun <dhyun@apple.com> Co-authored-by: DB Tsai <d_tsai@apple.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-19 20:24:22 -07:00
Dongjoon Hyun	4d5247778a	[SPARK-27197][SQL][TEST] Add ReadNestedSchemaTest for file-based data sources ## What changes were proposed in this pull request? The reader schema is said to be evolved (or projected) when it changed after the data is written by writers. Apache Spark file-based data sources have a test coverage for that; e.g. [ReadSchemaSuite.scala](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/ReadSchemaSuite.scala). This PR aims to add a test coverage for nested columns by adding and hiding nested columns. ## How was this patch tested? Pass the Jenkins with newly added tests. Closes #24139 from dongjoon-hyun/SPARK-27197. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2019-03-20 00:22:05 +00:00
s71955	e402de5fd0	[SPARK-26176][SQL] Verify column names for CTAS with `STORED AS` ## What changes were proposed in this pull request? Currently, users meet job abortions while creating a table using the Hive serde "STORED AS" with invalid column names. We had better prevent this by raising AnalysisException with a guide to use aliases instead like Paquet data source tables. thus making compatible with error message shown while creating Parquet/ORC native table. BEFORE ```scala scala> sql("set spark.sql.hive.convertMetastoreParquet=false") scala> sql("CREATE TABLE a STORED AS PARQUET AS SELECT 1 AS `COUNT(ID)`") Caused by: java.lang.IllegalArgumentException: No enum constant parquet.schema.OriginalType.col1 ``` AFTER ```scala scala> sql("CREATE TABLE a STORED AS PARQUET AS SELECT 1 AS `COUNT(ID)`") Please use alias to rename it.;eption: Attribute name "count(ID)" contains invalid character(s) among " ,;{}()\n\t=". ``` ## How was this patch tested? Pass the Jenkins with the newly added test case. Closes #24075 from sujith71955/master_serde. Authored-by: s71955 <sujithchacko.2010@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-19 20:29:47 +08:00
Takeshi Yamamuro	901c7408a4	[SPARK-27161][SQL][FOLLOWUP] Drops non-keywords from docs/sql-keywords.md ## What changes were proposed in this pull request? This pr is a follow-up of #24093 and includes fixes below; - Lists up all the keywords of Spark only (that is, drops non-keywords there); I listed up all the keywords of ANSI SQL-2011 in the previous commit (SPARK-26215). - Sorts the keywords in `SqlBase.g4` in a alphabetical order ## How was this patch tested? Pass Jenkins. Closes #24125 from maropu/SPARK-27161-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-19 20:18:40 +08:00
Gengliang Wang	28d35c8578	[SPARK-27162][SQL] Add new method asCaseSensitiveMap in CaseInsensitiveStringMap ## What changes were proposed in this pull request? Currently, DataFrameReader/DataFrameReader supports setting Hadoop configurations via method `.option()`. E.g, the following test case should be passed in both ORC V1 and V2 ``` class TestFileFilter extends PathFilter { override def accept(path: Path): Boolean = path.getParent.getName != "p=2" } withTempPath { dir => val path = dir.getCanonicalPath val df = spark.range(2) df.write.orc(path + "/p=1") df.write.orc(path + "/p=2") val extraOptions = Map( "mapred.input.pathFilter.class" -> classOf[TestFileFilter].getName, "mapreduce.input.pathFilter.class" -> classOf[TestFileFilter].getName ) assert(spark.read.options(extraOptions).orc(path).count() === 2) } } ``` While Hadoop Configurations are case sensitive, the current data source V2 APIs are using `CaseInsensitiveStringMap` in the top level entry `TableProvider`. To create Hadoop configurations correctly, I suggest 1. adding a new method `asCaseSensitiveMap` in `CaseInsensitiveStringMap`. 2. Make `CaseInsensitiveStringMap` read-only to ambiguous conversion in `asCaseSensitiveMap` ## How was this patch tested? Unit test Closes #24094 from gengliangwang/originalMap. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-19 13:35:47 +08:00
Dongjoon Hyun	26e9849cb4	[SPARK-27195][SQL][TEST] Add AvroReadSchemaSuite ## What changes were proposed in this pull request? The reader schema is said to be evolved (or projected) when it changed after the data is written by writers. Apache Spark file-based data sources have a test coverage for that, [ReadSchemaSuite.scala](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/ReadSchemaSuite.scala). This PR aims to add `AvroReadSchemaSuite` to ensure the minimal consistency among file-based data sources and prevent a future regression in Avro data source. ## How was this patch tested? Pass the Jenkins with the newly added test suite. Closes #24135 from dongjoon-hyun/SPARK-27195. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-18 20:10:30 -07:00
Ryan Blue	e348f14259	[SPARK-26811][SQL] Add capabilities to v2.Table ## What changes were proposed in this pull request? This adds a new method, `capabilities` to `v2.Table` that returns a set of `TableCapability`. Capabilities are used to fail queries during analysis checks, `V2WriteSupportCheck`, when the table does not support operations, like truncation. ## How was this patch tested? Existing tests for regressions, added new analysis suite, `V2WriteSupportCheckSuite`, for new capability checks. Closes #24012 from rdblue/SPARK-26811-add-capabilities. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-18 18:25:11 +08:00
Jungtaek Lim (HeartSaVioR)	4adbcdc424	[SPARK-22000][SQL][FOLLOW-UP] Fix bad test to ensure it can test properly ## What changes were proposed in this pull request? There was some mistake on test code: it has wrong assertion. The patch proposes fixing it, as well as fixing other stuff to make test really pass. ## How was this patch tested? Fixed unit test. Closes #24112 from HeartSaVioR/SPARK-22000-hotfix. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-03-17 08:25:40 +09:00
Dilip Biswal	7a136f8670	[SPARK-27096][SQL][FOLLOWUP] Do the correct validation of join types in R side and fix join docs for scala, python and r ## What changes were proposed in this pull request? This is a minor follow-up PR for SPARK-27096. The original PR reconciled the join types supported between dataset and sql interface. In case of R, we do the join type validation in the R side. In this PR we do the correct validation and adds tests in R to test all the join types along with the error condition. Along with this, i made the necessary doc correction. ## How was this patch tested? Add R tests. Closes #24087 from dilipbiswal/joinfix_followup. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-16 13:04:54 +09:00
Zhu, Lipeng	8ee09f26d5	[SPARK-27159][SQL] update mssql server dialect to support binary type ## What changes were proposed in this pull request? Change the binary type mapping from default blob to varbinary(max) for mssql server. https://docs.microsoft.com/en-us/sql/t-sql/data-types/binary-and-varbinary-transact-sql?view=sql-server-2017 ![image](https://user-images.githubusercontent.com/698621/54351715-0e8c8780-468b-11e9-8931-7ecb85c5ad6b.png) ## How was this patch tested? Unit test. Closes #24091 from lipzhu/SPARK-27159. Authored-by: Zhu, Lipeng <lipzhu@ebay.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-15 20:21:59 -05:00
Gengliang Wang	2a37d6ed93	[SPARK-27132][SQL] Improve file source V2 framework ## What changes were proposed in this pull request? During the migration of CSV V2(https://github.com/apache/spark/pull/24005), I find that we can improve the file source v2 framework by: 1. check duplicated column names in both read and write 2. Not all the file sources support filter push down. So remove `SupportsPushDownFilters` from FileScanBuilder 3. The method `isSplitable` might require data source options. Add a new member `options` to FileScan. 4. Make `FileTable.schema` a lazy value instead of a method. ## How was this patch tested? Unit test Closes #24066 from gengliangwang/reviseFileSourceV2. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-15 11:58:03 +08:00
Dongjoon Hyun	74d2f04183	[SPARK-27166][SQL] Improve `printSchema` to print up to the given level ## What changes were proposed in this pull request? This PR aims to improve `printSchema` to be able to print up to the given level of the schema. ```scala scala> val df = Seq((1,(2,(3,4)))).toDF df: org.apache.spark.sql.DataFrame = [_1: int, _2: struct<_1: int, _2: struct<_1: int, _2: int>>] scala> df.printSchema root \|-- _1: integer (nullable = false) \|-- _2: struct (nullable = true) \| \|-- _1: integer (nullable = false) \| \|-- _2: struct (nullable = true) \| \| \|-- _1: integer (nullable = false) \| \| \|-- _2: integer (nullable = false) scala> df.printSchema(1) root \|-- _1: integer (nullable = false) \|-- _2: struct (nullable = true) scala> df.printSchema(2) root \|-- _1: integer (nullable = false) \|-- _2: struct (nullable = true) \| \|-- _1: integer (nullable = false) \| \|-- _2: struct (nullable = true) scala> df.printSchema(3) root \|-- _1: integer (nullable = false) \|-- _2: struct (nullable = true) \| \|-- _1: integer (nullable = false) \| \|-- _2: struct (nullable = true) \| \| \|-- _1: integer (nullable = false) \| \| \|-- _2: integer (nullable = false) ``` ## How was this patch tested? Pass the Jenkins with the newly added test case. Closes #24098 from dongjoon-hyun/SPARK-27166. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-14 20:27:55 -07:00
Gengliang Wang	6d22ee3969	[SPARK-27136][SQL] Remove data source option check_files_exist ## What changes were proposed in this pull request? The data source option check_files_exist is introduced in In #23383 when the file source V2 framework is implemented. In the PR, FileIndex was created as a member of FileTable, so that we could implement partition pruning like 0f9fcab in the future. At that time `FileIndex`es will always be created for file writes, so we needed the option to decide whether to check file existence. After https://github.com/apache/spark/pull/23774, the option is not needed anymore, since Dataframe writes won't create unnecessary FileIndex. This PR is to remove the option. ## How was this patch tested? Unit test. Closes #24069 from gengliangwang/removeOptionCheckFilesExist. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-15 10:19:26 +08:00
Dave DeCaprio	8819eaba4d	[SPARK-26917][SQL] Further reduce locks in CacheManager ## What changes were proposed in this pull request? Further load increases in our production environment have shown that even the read locks can cause some contention, since they contain a mechanism that turns a read lock into an exclusive lock if a writer has been starved out. This PR reduces the potential for lock contention even further than https://github.com/apache/spark/pull/23833. Additionally, it uses more idiomatic scala than the previous implementation. cloud-fan & gatorsmile This is a relatively minor improvement to the previous CacheManager changes. At this point, I think we finally are doing the minimum possible amount of locking. ## How was this patch tested? Has been tested on a live system where the blocking was causing major issues and it is working well. CacheManager has no explicit unit test but is used in many places internally as part of the SharedState. Closes #24028 from DaveDeCaprio/read-locks-master. Lead-authored-by: Dave DeCaprio <daved@alum.mit.edu> Co-authored-by: David DeCaprio <daved@alum.mit.edu> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-15 10:13:34 +08:00
Shahid	8b5224097b	[SPARK-27145][MINOR] Close store in the SQLAppStatusListenerSuite after test ## What changes were proposed in this pull request? We create many stores in the SQLAppStatusListenerSuite, but we need to the close store after test. ## How was this patch tested? Existing tests Closes #24079 from shahidki31/SPARK-27145. Authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-03-14 13:08:41 -07:00
Takeshi Yamamuro	66c5cd2d9c	[SPARK-27151][SQL] ClearCacheCommand extends IgnoreCahedData to avoid plan node copys ## What changes were proposed in this pull request? In SPARK-27011, we introduced `IgnoreCahedData` to avoid plan node copys in `CacheManager`. Since `ClearCacheCommand` has no argument, it also can extend `IgnoreCahedData`. ## How was this patch tested? Pass Jenkins. Closes #24081 from maropu/SPARK-27011-2. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-14 11:36:16 -07:00
Takeshi Yamamuro	bacffb8810	[SPARK-23264][SQL] Make INTERVAL keyword optional in INTERVAL clauses when ANSI mode enabled ## What changes were proposed in this pull request? This pr updated parsing rules in `SqlBase.g4` to support a SQL query below when ANSI mode enabled; ``` SELECT CAST('2017-08-04' AS DATE) + 1 days; ``` The current master cannot parse it though, other dbms-like systems support the syntax (e.g., hive and mysql). Also, the syntax is frequently used in the official TPC-DS queries. This pr added new tokens as follows; ``` YEAR \| YEARS \| MONTH \| MONTHS \| WEEK \| WEEKS \| DAY \| DAYS \| HOUR \| HOURS \| MINUTE MINUTES \| SECOND \| SECONDS \| MILLISECOND \| MILLISECONDS \| MICROSECOND \| MICROSECONDS ``` Then, it registered the keywords below as the ANSI reserved (this follows SQL-2011); ``` DAY \| HOUR \| MINUTE \| MONTH \| SECOND \| YEAR ``` ## How was this patch tested? Added tests in `SQLQuerySuite`, `ExpressionParserSuite`, and `TableIdentifierParserSuite`. Closes #20433 from maropu/SPARK-23264. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-03-14 10:45:29 +09:00
Jungtaek Lim (HeartSaVioR)	733f2c0b98	[MINOR][SQL] Deduplicate huge if statements in get between specialized getters ## What changes were proposed in this pull request? This patch deduplicates the huge if statements regarding getting value between specialized getters. ## How was this patch tested? Existing UT. Closes #24016 from HeartSaVioR/MINOR-deduplicate-get-from-specialized-getters. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-13 15:52:21 -05:00
Dongjoon Hyun	3221bf4cd5	[SPARK-27034][SPARK-27123][SQL][FOLLOWUP] Update Nested Schema Pruning BM result with EC2 ## What changes were proposed in this pull request? This is a follow up PR for #23943 in order to update the benchmark result with EC2 `r3.xlarge` instance. ## How was this patch tested? N/A. (Manually compare the diff) Closes #24078 from dongjoon-hyun/SPARK-27034. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2019-03-13 20:27:10 +00:00
Wenchen Fan	2a80a4cd39	[SPARK-27106][SQL] merge CaseInsensitiveStringMap and DataSourceOptions ## What changes were proposed in this pull request? It's a little awkward to have 2 different classes(`CaseInsensitiveStringMap` and `DataSourceOptions`) to present the options in data source and catalog API. This PR merges these 2 classes, while keeping the name `CaseInsensitiveStringMap`, which is more precise. ## How was this patch tested? existing tests Closes #24025 from cloud-fan/option. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-14 01:23:27 +08:00
Dave DeCaprio	812ad55461	[SPARK-26103][SQL] Limit the length of debug strings for query plans ## What changes were proposed in this pull request? The PR puts in a limit on the size of a debug string generated for a tree node. Helps to fix out of memory errors when large plans have huge debug strings. In addition to SPARK-26103, this should also address SPARK-23904 and SPARK-25380. AN alternative solution was proposed in #23076, but that solution doesn't address all the cases that can cause a large query. This limit is only on calls treeString that don't pass a Writer, which makes it play nicely with #22429, #23018 and #23039. Full plans can be written to files, but truncated plans will be used when strings are held in memory, such as for the UI. - A new configuration parameter called spark.sql.debug.maxPlanLength was added to control the length of the plans. - When plans are truncated, "..." is printed to indicate that it isn't a full plan - A warning is printed out the first time a truncated plan is displayed. The warning explains what happened and how to adjust the limit. ## How was this patch tested? Unit tests were created for the new SizeLimitedWriter. Also a unit test for TreeNode was created that checks that a long plan is correctly truncated. Closes #23169 from DaveDeCaprio/text-plan-size. Lead-authored-by: Dave DeCaprio <daved@alum.mit.edu> Co-authored-by: David DeCaprio <daved@alum.mit.edu> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-03-13 09:58:43 -07:00
Wenchen Fan	d3813d8b21	[SPARK-27064][SS] create StreamingWrite at the beginning of streaming execution ## What changes were proposed in this pull request? According to the [design](https://docs.google.com/document/d/1vI26UEuDpVuOjWw4WPoH2T6y8WAekwtI7qoowhOFnI4/edit?usp=sharing), the life cycle of `StreamingWrite` should be the same as the read side `MicroBatch/ContinuousStream`, i.e. each run of the stream query, instead of each epoch. This PR fixes it. ## How was this patch tested? existing tests Closes #23981 from cloud-fan/dsv2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-13 19:47:54 +08:00
Liang-Chi Hsieh	f55c760df6	[SPARK-27034][SQL][FOLLOWUP] Rename ParquetSchemaPruning to SchemaPruning ## What changes were proposed in this pull request? This is a followup to #23943. This proposes to rename ParquetSchemaPruning to SchemaPruning as ParquetSchemaPruning supports both Parquet and ORC v1 now. ## How was this patch tested? Existing tests. Closes #24077 from viirya/nested-schema-pruning-orc-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-13 20:12:01 +09:00
Ajith	e60d8fce0b	[SPARK-27045][SQL] SQL tab in UI shows actual SQL instead of callsite in case of SparkSQLDriver ## What changes were proposed in this pull request? When we run sql in spark via SparkSQLDriver (thrift server, spark-sql), SQL string is siet via ``setJobDescription``. the SparkUI SQL tab must show SQL instead of stacktrace in case ``setJobDescription`` is set which is more useful to end user. Instead it currently shows in description column the callsite shortform which is less useful ![image](https://user-images.githubusercontent.com/22072336/53734682-aaa7d900-3eaa-11e9-957b-0e5006db417e.png) ## How was this patch tested? Manually: ![image](https://user-images.githubusercontent.com/22072336/53734657-9f54ad80-3eaa-11e9-8dc5-2b38f6970f4e.png) Closes #23958 from ajithme/sqlui. Authored-by: Ajith <ajith2489@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-12 16:14:29 -07:00
Liang-Chi Hsieh	b0c2b3bfd9	[SPARK-27034][SQL] Nested schema pruning for ORC ## What changes were proposed in this pull request? We only supported nested schema pruning for Parquet previously. This proposes to support nested schema pruning for ORC too. Note: This only covers ORC v1. For ORC v2, the necessary change is at the schema pruning rule. We should deal with ORC v2 as a TODO item, in order to reduce review burden. ## How was this patch tested? Added tests. Closes #23943 from viirya/nested-schema-pruning-orc. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-12 15:39:16 -07:00
shivusondur	4b6d39d85d	[SPARK-27090][CORE] Removing old LEGACY_DRIVER_IDENTIFIER ("<driver>") ## What changes were proposed in this pull request? LEGACY_DRIVER_IDENTIFIER and its reference are removed. corresponding references test are updated. ## How was this patch tested? tested UT test cases Closes #24026 from shivusondur/newjira2. Authored-by: shivusondur <shivusondur@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-12 13:29:39 -05:00
Shahid	1853db3186	[SPARK-27125][SQL][TEST] Add test suite for sql execution page ## What changes were proposed in this pull request? Added test suite for AllExecutionsPage class. Checked the scenarios for SPARK-27019 and SPARK-27075. ## How was this patch tested? Added UT, manually tested Closes #24052 from shahidki31/SPARK-27125. Authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-12 10:15:28 -05:00
Ajith	b8dd84b9e4	[SPARK-27011][SQL] reset command fails with cache ## What changes were proposed in this pull request? When cache is enabled ( i.e once cache table command is executed), any following sql will trigger CacheManager#lookupCachedData which will create a copy of the tree node, which inturn calls TreeNode#makeCopy. Here the problem is it will try to create a copy instance. But as ResetCommand is a case object this will fail ## How was this patch tested? Added UT to reproduce the issue Closes #23918 from ajithme/reset. Authored-by: Ajith <ajith2489@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-12 11:02:09 +08:00
Hyukjin Kwon	3725b1324f	[SPARK-26923][SQL][R] Refactor ArrowRRunner and RRunner to share one BaseRRunner ## What changes were proposed in this pull request? This PR proposes to have one base R runner. In the high level, Previously, it had `ArrowRRunner` and it inherited `RRunner`: ``` └── RRunner └── ArrowRRunner ``` After this PR, now it has a `BaseRRunner`, and `ArrowRRunner` and `RRunner` inherit `BaseRRunner`: ``` └── BaseRRunner ├── ArrowRRunner └── RRunner ``` This way is consistent with Python's. In more details, see below: ```scala class BaseRRunner[IN, OUT] { def compute: Iterator[OUT] = { ... newWriterThread(...).start() ... newReaderIterator(...) ... } // Make a thread that writes data from JVM to R process abstract protected def newWriterThread(..., iter: Iterator[IN], ...): WriterThread // Make an iterator that reads data from the R process to JVM abstract protected def newReaderIterator(...): ReaderIterator abstract class WriterThread(..., iter: Iterator[IN], ...) extends Thread { override def run(): Unit { ... writeIteratorToStream(...) ... } // Actually writing logic to the socket stream. abstract protected def writeIteratorToStream(dataOut: DataOutputStream): Unit } abstract class ReaderIterator extends Iterator[OUT] { override def hasNext(): Boolean = { ... read(...) ... } override def next(): OUT = { ... hasNext() ... } // Actually reading logic from the socket stream. abstract protected def read(...): OUT } } ``` ```scala case [Arrow]RRunner extends BaseRRunner { override def newWriterThread(...) { new WriterThread(...) { override def writeIteratorToStream(...) { ... } } } override def newReaderIterator(...) { new ReaderIterator(...) { override def read(...) { ... } } } } ``` ## How was this patch tested? Manually tested and existing tests should cover. Closes #23977 from HyukjinKwon/SPARK-26923. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-12 08:45:29 +09:00
Jagadesh Kiran	d9978fb4e4	[SPARK-26860][PYSPARK][SPARKR] Fix for RangeBetween and RowsBetween docs to be in sync with spark documentation The docs describing RangeBetween & RowsBetween for pySpark & SparkR are not in sync with Spark description. a. Edited PySpark and SparkR docs and made description same for both RangeBetween and RowsBetween b. created executable examples in both pySpark and SparkR documentation c. Locally tested the patch for scala Style checks and UT for checking no testcase failures Closes #23946 from jagadesh-kiran/master. Authored-by: Jagadesh Kiran <jagadesh.n@in.verizon.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-11 08:53:09 -05:00
Dilip Biswal	1b9fd67904	[SPARK-27096][SQL] Reconcile the join types between data frame and sql interface ## What changes were proposed in this pull request? Currently in the grammar file, we have the joinType rule defined as following : ``` joinType : INNER? .... .... \| LEFT SEMI \| LEFT? ANTI ; ``` The keyword LEFT is optional for ANTI join even though its not optional for SEMI join. When using data frame interface join type "anti" is not allowed. The allowed types are "left_anti" or "leftanti" for anti joins. ~~In this PR, i am making the LEFT keyword mandatory for ANTI joins so it aligns better with the LEFT SEMI join in the grammar file and also the join types allowed from dataframe api.~~ This PR makes LEFT optional for SEMI join in .g4 and add "semi" and "anti" join types from dataframe. ~~I have not opened any JIRA for this as probably we may need some discussion to see if we are going to address this or not.~~ ## How was this patch tested? Modified the join type tests. Closes #23982 from dilipbiswal/join_fix. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-11 14:02:21 +08:00
Takeshi Yamamuro	f0927d8ac4	[SPARK-27110][SQL] Moves some functions from AnalyzeColumnCommand to command/CommandUtils ## What changes were proposed in this pull request? To reuse some common logics for improving `Analyze` commands (See the description of `SPARK-25196` for details), this pr moved some functions from `AnalyzeColumnCommand` to `command/CommandUtils`. A follow-up pr will add code to extend `Analyze` commands for cached tables. ## How was this patch tested? Existing tests. Closes #22204 from maropu/SPARK-25196. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-10 15:17:46 -07:00
Shixiong Zhu	6e1c0827ec	[SPARK-27111][SS] Fix a race that a continuous query may fail with InterruptedException ## What changes were proposed in this pull request? Before a Kafka consumer gets assigned with partitions, its offset will contain 0 partitions. However, runContinuous will still run and launch a Spark job having 0 partitions. In this case, there is a race that epoch may interrupt the query execution thread after `lastExecution.toRdd`, and either `epochEndpoint.askSync[Unit](StopContinuousExecutionWrites)` or the next `runContinuous` will get interrupted unintentionally. To handle this case, this PR has the following changes: - Clean up the resources in `queryExecutionThread.runUninterruptibly`. This may increase the waiting time of `stop` but should be minor because the operations here are very fast (just sending an RPC message in the same process and stopping a very simple thread). - Clear the interrupted status at the end so that it won't impact the `runContinuous` call. We may clear the interrupted status set by `stop`, but it doesn't affect the query termination because `runActivatedStream` will check `state` and exit accordingly. I also updated the clean up codes to make sure exceptions thrown from `epochEndpoint.askSync[Unit](StopContinuousExecutionWrites)` won't stop the clean up. ## How was this patch tested? Jenkins Closes #24034 from zsxwing/SPARK-27111. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>	2019-03-09 14:26:58 -08:00
Kris Mok	57ae251f75	[SPARK-27097] Avoid embedding platform-dependent offsets literally in whole-stage generated code ## What changes were proposed in this pull request? Spark SQL performs whole-stage code generation to speed up query execution. There are two steps to it: - Java source code is generated from the physical query plan on the driver. A single version of the source code is generated from a query plan, and sent to all executors. - It's compiled to bytecode on the driver to catch compilation errors before sending to executors, but currently only the generated source code gets sent to the executors. The bytecode compilation is for fail-fast only. - Executors receive the generated source code and compile to bytecode, then the query runs like a hand-written Java program. In this model, there's an implicit assumption about the driver and executors being run on similar platforms. Some code paths accidentally embedded platform-dependent object layout information into the generated code, such as: ```java Platform.putLong(buffer, /* offset / 24, / value */ 1); ``` This code expects a field to be at offset +24 of the `buffer` object, and sets a value to that field. But whole-stage code generation generally uses platform-dependent information from the driver. If the object layout is significantly different on the driver and executors, the generated code can be reading/writing to wrong offsets on the executors, causing all kinds of data corruption. One code pattern that leads to such problem is the use of `Platform.XXX` constants in generated code, e.g. `Platform.BYTE_ARRAY_OFFSET`. Bad: ```scala val baseOffset = Platform.BYTE_ARRAY_OFFSET // codegen template: s"Platform.putLong($buffer, $baseOffset, $value);" ``` This will embed the value of `Platform.BYTE_ARRAY_OFFSET` on the driver into the generated code. Good: ```scala val baseOffset = "Platform.BYTE_ARRAY_OFFSET" // codegen template: s"Platform.putLong($buffer, $baseOffset, $value);" ``` This will generate the offset symbolically -- `Platform.putLong(buffer, Platform.BYTE_ARRAY_OFFSET, value)`, which will be able to pick up the correct value on the executors. Caveat: these offset constants are declared as runtime-initialized `static final` in Java, so they're not compile-time constants from the Java language's perspective. It does lead to a slightly increased size of the generated code, but this is necessary for correctness. NOTE: there can be other patterns that generate platform-dependent code on the driver which is invalid on the executors. e.g. if the endianness is different between the driver and the executors, and if some generated code makes strong assumption about endianness, it would also be problematic. ## How was this patch tested? Added a new test suite `WholeStageCodegenSparkSubmitSuite`. This test suite needs to set the driver's extraJavaOptions to force the driver and executor use different Java object layouts, so it's run as an actual SparkSubmit job. Authored-by: Kris Mok <kris.mokdatabricks.com> Closes #24031 from gatorsmile/cherrypickSPARK-27097. Lead-authored-by: Kris Mok <kris.mok@databricks.com> Co-authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2019-03-09 01:20:32 +00:00
Ryan Blue	6170e40c15	[SPARK-24252][SQL] Add v2 catalog plugin system ## What changes were proposed in this pull request? This adds a v2 API for adding new catalog plugins to Spark. * Catalog implementations extend `CatalogPlugin` and are loaded via reflection, similar to data sources * `Catalogs` loads and initializes catalogs using configuration from a `SQLConf` * `CaseInsensitiveStringMap` is used to pass configuration to `CatalogPlugin` via `initialize` Catalogs are configured by adding config properties starting with `spark.sql.catalog.(name)`. The name property must specify a class that implements `CatalogPlugin`. Other properties under the namespace (`spark.sql.catalog.(name).(prop)`) are passed to the provider during initialization along with the catalog name. This replaces #21306, which will be implemented in two multiple parts: the catalog plugin system (this commit) and specific catalog APIs, like `TableCatalog`. ## How was this patch tested? Added test suites for `CaseInsensitiveStringMap` and for catalog loading. Closes #23915 from rdblue/SPARK-24252-add-v2-catalog-plugins. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-08 19:31:49 +08:00
Yuming Wang	2036074b99	[SPARK-26004][SQL] InMemoryTable support StartsWith predicate push down ## What changes were proposed in this pull request? [SPARK-24638](https://issues.apache.org/jira/browse/SPARK-24638) adds support for Parquet file `StartsWith` predicate push down. `InMemoryTable` can also support this feature. This is an example to explain how it works, Imagine that the `id` column stored as below: Partition ID \| lowerBound \| upperBound -- \| -- \| -- p1 \| '1' \| '9' p2 \| '10' \| '19' p3 \| '20' \| '29' p4 \| '30' \| '39' p5 \| '40' \| '49' A filter ```df.filter($"id".startsWith("2"))``` or ```id like '2%'``` then we substr lowerBound and upperBound: Partition ID \| lowerBound.substr(0, Length("2")) \| upperBound.substr(0, Length("2")) -- \| -- \| -- p1 \| '1' \| '9' p2 \| '1' \| '1' p3 \| '2' \| '2' p4 \| '3' \| '3' p5 \| '4' \| '4' We can see that we only need to read `p1` and `p3`. ## How was this patch tested? unit tests and benchmark tests benchmark test result: ``` ================================================================================================ Pushdown benchmark for StringStartsWith ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6 Intel(R) Core(TM) i7-7820HQ CPU 2.90GHz StringStartsWith filter: (value like '10%'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ InMemoryTable Vectorized 12068 / 14198 1.3 767.3 1.0X InMemoryTable Vectorized (Pushdown) 5457 / 8662 2.9 347.0 2.2X Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6 Intel(R) Core(TM) i7-7820HQ CPU 2.90GHz StringStartsWith filter: (value like '1000%'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ InMemoryTable Vectorized 5246 / 5355 3.0 333.5 1.0X InMemoryTable Vectorized (Pushdown) 2185 / 2346 7.2 138.9 2.4X Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6 Intel(R) Core(TM) i7-7820HQ CPU 2.90GHz StringStartsWith filter: (value like '786432%'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ InMemoryTable Vectorized 5112 / 5312 3.1 325.0 1.0X InMemoryTable Vectorized (Pushdown) 2292 / 2522 6.9 145.7 2.2X ``` Closes #23004 from wangyum/SPARK-26004. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-08 19:18:32 +08:00
Sean Owen	5ebb4b5723	[SPARK-24783][SQL] spark.sql.shuffle.partitions=0 should throw exception ## What changes were proposed in this pull request? Throw an exception if spark.sql.shuffle.partitions=0 This takes over https://github.com/apache/spark/pull/23835 ## How was this patch tested? Existing tests. Closes #24008 from srowen/SPARK-24783.2. Lead-authored-by: Sean Owen <sean.owen@databricks.com> Co-authored-by: WindCanDie <491237260@qq.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-08 14:09:53 +09:00
Jungtaek Lim (HeartSaVioR)	d8f77e11a4	[SPARK-27001][SQL][FOLLOWUP] Address primitive array type for serializer ## What changes were proposed in this pull request? This is follow-up PR which addresses review comment in PR for SPARK-27001: https://github.com/apache/spark/pull/23908#discussion_r261511454 This patch proposes addressing primitive array type for serializer - instead of handling it to generic one, Spark now handles it efficiently as primitive array. ## How was this patch tested? UT modified to include primitive array. Closes #24015 from HeartSaVioR/SPARK-27001-FOLLOW-UP-java-primitive-array. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-08 11:54:04 +08:00
Shahid	713646ddc2	[SPARK-27075] Remove duplicate execution tag parameters from the url, when accessing the execution table in the SQL page ## What changes were proposed in this pull request? When we sort any columns in the execution table of the SQL page in the WEBUI, it throws IllegalArgumentException. The root cause is that, in the url, we are duplicating the execution tag parameters in the 'parameterPath'. Actually we should filter out the executionTag related entries while getting the 'parameterOtherTable' `e9e8bb33ef/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/AllExecutionsPage.scala (L161-L163)` `e9e8bb33ef/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/AllExecutionsPage.scala (L241)` `e9e8bb33ef/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/AllExecutionsPage.scala (L263-L266)` ## How was this patch tested? Manually tested Test steps: Sort any column in the sql page execution table Before fix: ![screenshot from 2019-03-07 01-38-17](https://user-images.githubusercontent.com/23054875/53913261-f0b69580-4080-11e9-88ea-f238b47a21d5.png) After fix: ![screenshot from 2019-03-07 02-01-40](https://user-images.githubusercontent.com/23054875/53913285-01670b80-4081-11e9-81b6-78cdbf5a0817.png) Closes #23994 from shahidki31/SPARK-27075. Authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-03-07 12:52:46 -08:00
Dilip Biswal	a0e26cffc5	[MINOR][SQL][TEST] Include usage example for generating output for single test in SQLQueryTestSuite ## What changes were proposed in this pull request? This is a very minor pr to include the usage example to generate output for single test in SQLQueryTestSuite. I tried to deduce it from the existing example and ran into a scenario where sbt is simply looping to run the same test over and over again. Here is the example of running a single test. ``` build/sbt "~sql/test-only SQLQueryTestSuite -- -z inline-table.sql" ``` I tried to generate the output for a single test by prepending `SPARK_GENERATE_GOLDEN_FILES=1` like following ``` SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "~sql/test-only SQLQueryTestSuite -- -z describe.sql" ``` In this case i found that sbt is looping trying to run describe.sql over and over again as we are running the test in on continuous mode (because of `~` prefix ) where it detects a change in the generated result file which in turn triggers a build and test. I have included an example where we don't run it in continuous mode when generating the output. Hopefully it saves other developers some time. ## How was this patch tested? Verified manually in my dev setup. Closes #23995 from dilipbiswal/dkb_sqlquerytest_usage. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-07 13:06:23 +09:00
Gengliang Wang	a543f917e0	[SPARK-27049][SQL] Create util class to support handling partition values in file source V2 ## What changes were proposed in this pull request? While I am migrating other data sources, I find that we should abstract the logic that: 1. converting safe `InternalRow`s into `UnsafeRow`s 2. appending partition values to the end of the result row if existed This PR proposes to support handling partition values in file source v2 abstraction by adding a util class `PartitionReaderWithPartitionValues`. ## How was this patch tested? Existing unit tests Closes #23987 from gengliangwang/SPARK-27049. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-07 11:24:15 +08:00
Maxim Gekk	9513d82edd	[SPARK-27057][SQL] Common trait for limit exec operators ## What changes were proposed in this pull request? I would like to refactor `limit.scala` slightly and introduce common trait `LimitExec` for `CollectLimitExec` and `BaseLimitExec` (`LocalLimitExec` and `GlobalLimitExec`). This will allow to distinguish those operators from others, and to get the `limit` value without casting to concrete class. ## How was this patch tested? by existing test suites. Closes #23976 from MaxGekk/limit-exec. Authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-07 08:47:52 +08:00
Shahid	62fd133f74	[SPARK-27019][SQL][WEBUI] onJobStart happens after onExecutionEnd shouldn't overwrite kvstore ## What changes were proposed in this pull request? Currently, when the event reordering happens, especially onJobStart event come after onExecutionEnd event, SQL page in the UI displays weirdly.(for eg:test mentioned in JIRA and also this issue randomly occurs when the TPCDS query fails due to broadcast timeout etc.) The reason is that, In the SQLAppstatusListener, we remove the liveExecutions entry once the execution ends. So, if a jobStart event come after that, then we create a new liveExecution entry corresponding to the execId. Eventually this will overwrite the kvstore and UI displays confusing entries. ## How was this patch tested? Added UT, Also manually tested with the eventLog, provided in the jira, of the failed query. Before fix: ![screenshot from 2019-03-03 03-05-52](https://user-images.githubusercontent.com/23054875/53687929-53e2b800-3d61-11e9-9dca-620fa41e605c.png) After fix: ![screenshot from 2019-03-03 02-40-18](https://user-images.githubusercontent.com/23054875/53687928-4f1e0400-3d61-11e9-86aa-584646ac68f9.png) Closes #23939 from shahidki31/SPARK-27019. Authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-03-06 14:02:30 -08:00
Udbhav30	9bddf7180e	[SPARK-24669][SQL] Invalidate tables in case of DROP DATABASE CASCADE ## What changes were proposed in this pull request? Before dropping database refresh the tables of that database, so as to refresh all cached entries associated with those tables. We follow the same when dropping a table. ## How was this patch tested? UT is added Closes #23905 from Udbhav30/SPARK-24669. Authored-by: Udbhav30 <u.agrawal30@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-06 09:06:10 -08:00
Maxim Gekk	9b55722161	[SPARK-27031][SQL] Avoid double formatting in timestampToString ## What changes were proposed in this pull request? Removed unnecessary conversion of microseconds in `DateTimeUtils.timestampToString` to `java.sql.Timestamp` which aims to output fraction of seconds by casting it to string. This was replaced by special `TimestampFormatter` which appends the fraction formatter to `DateTimeFormatterBuilder`: `appendFraction(ChronoField.NANO_OF_SECOND, 0, 9, true)`. The former one means trailing zeros in second's fraction should be truncated while formatting. ## How was this patch tested? By existing test suites like `CastSuite`, `DateTimeUtilsSuite`, `JDBCSuite`, and by new test in `TimestampFormatterSuite`. Closes #23936 from MaxGekk/timestamp-to-string. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-06 08:26:59 -06:00
Liang-Chi Hsieh	83857496e5	[SPARK-27043][SQL] Add ORC nested schema pruning benchmarks ## What changes were proposed in this pull request? We have benchmark of nested schema pruning, but only for Parquet. This adds similar benchmark for ORC. This is used with nested schema pruning of ORC. ## How was this patch tested? Added test. Closes #23955 from viirya/orc-nested-schema-pruning-benchmark. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-05 11:12:57 -08:00
Anton Okolnychyi	0c23a39384	[SPARK-26205][SQL] Optimize InSet Expression for bytes, shorts, ints, dates ## What changes were proposed in this pull request? This PR optimizes `InSet` expressions for byte, short, integer, date types. It is a follow-up on PR #21442 from dbtsai. `In` expressions are compiled into a sequence of if-else statements, which results in O$n$ time complexity. `InSet` is an optimized version of `In`, which is supposed to improve the performance if all values are literals and the number of elements is big enough. However, `InSet` actually worsens the performance in many cases due to various reasons. The main idea of this PR is to use Java `switch` statements to significantly improve the performance of `InSet` expressions for bytes, shorts, ints, dates. All `switch` statements are compiled into `tableswitch` and `lookupswitch` bytecode instructions. We will have O$1$ time complexity if our case values are compact and `tableswitch` can be used. Otherwise, `lookupswitch` will give us O$log n$. Locally, I tried Spark `OpenHashSet` and primitive collections from `fastutils` in order to solve the boxing issue in `InSet`. Both options significantly decreased the memory consumption and `fastutils` improved the time compared to `HashSet` from Scala. However, the switch-based approach was still more than two times faster even on 500+ non-compact elements. I also noticed that applying the switch-based approach on less than 10 elements gives a relatively minor improvement compared to the if-else approach. Therefore, I placed the switch-based logic into `InSet` and added a new config to track when it is applied. Even if we migrate to primitive collections at some point, the switch logic will be still faster unless the number of elements is really big. Another option is to have a separate `InSwitch` expression. However, this would mean we need to modify other places (e.g., `DataSourceStrategy`). See [here](https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-3.html#jvms-3.10) and [here](https://stackoverflow.com/questions/10287700/difference-between-jvms-lookupswitch-and-tableswitch) for more information. This PR does not cover long values as Java `switch` statements cannot be used on them. However, we can have a follow-up PR with an approach similar to binary search. ## How was this patch tested? There are new tests that verify the logic of the proposed optimization. The performance was evaluated using existing benchmarks. This PR was also tested on an EC2 instance (OpenJDK 64-Bit Server VM 1.8.0_191-b12 on Linux 4.14.77-70.59.amzn1.x86_64, Intel(R) Xeon(R) CPU E5-2686 v4 2.30GHz). ## Notes - [This link](http://hg.openjdk.java.net/jdk8/jdk8/langtools/file/30db5e0aaf83/src/share/classes/com/sun/tools/javac/jvm/Gen.java#l1153) contains source code that decides between `tableswitch` and `lookupswitch`. The logic was re-used in the benchmarks. See the `isLookupSwitch` method. Closes #23171 from aokolnychyi/spark-26205. Lead-authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-04 15:40:04 -08:00
Sean Owen	0deebd3820	[SPARK-26016][DOCS] Clarify that text DataSource read/write, and RDD methods that read text, always use UTF-8 ## What changes were proposed in this pull request? Clarify that text DataSource read/write, and RDD methods that read text, always use UTF-8 as they use Hadoop's implementation underneath. I think these are all the places that this needs a mention in the user-facing docs. ## How was this patch tested? Doc tests. Closes #23962 from srowen/SPARK-26016. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-05 08:03:39 +09:00
Gengliang Wang	68fa601d62	[SPARK-27040][SQL] Avoid using unnecessary JoinRow in FileFormat ## What changes were proposed in this pull request? When reading files with empty partition columns, we can avoid using JoinRow. ## How was this patch tested? Existing unit tests. Closes #23953 from gengliangwang/avoidJoinRow. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-04 22:26:11 +08:00
Dilip Biswal	ad4823c99d	[SPARK-19712][SQL] Pushing Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc. ## What changes were proposed in this pull request? This PR adds support for pushing down LeftSemi and LeftAnti joins below operators such as Project, Aggregate, Window, Union etc. This is the initial piece of work that will be needed for the subsequent work of moving the subquery rewrites to the beginning of optimization phase. The larger PR is [here](https://github.com/apache/spark/pull/23211) . This PR addresses the comment at [link](https://github.com/apache/spark/pull/23211#issuecomment-445705922). ## How was this patch tested? Added a new test suite LeftSemiAntiJoinPushDownSuite. Closes #23750 from dilipbiswal/SPARK-19712-pushleftsemi. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-04 19:09:24 +08:00

1 2 3 4 5 ...

5640 commits