ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Wenchen Fan	85fd552ed6	[SPARK-27190][SQL] add table capability for streaming ## What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/24012 , to add the corresponding capabilities for streaming. ## How was this patch tested? existing tests Closes #24129 from cloud-fan/capability. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-26 15:44:23 +08:00
Wenchen Fan	2234667b15	[SPARK-27563][SQL][TEST] automatically get the latest Spark versions in HiveExternalCatalogVersionsSuite ## What changes were proposed in this pull request? We can get the latest downloadable Spark versions from https://dist.apache.org/repos/dist/release/spark/ ## How was this patch tested? manually. Closes #24454 from cloud-fan/test. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-26 16:37:43 +09:00
uncleGen	d2656aaecd	[SPARK-27494][SS] Null values don't work in Kafka source v2 ## What changes were proposed in this pull request? Right now Kafka source v2 doesn't support null values. The issue is in org.apache.spark.sql.kafka010.KafkaRecordToUnsafeRowConverter.toUnsafeRow which doesn't handle null values. ## How was this patch tested? add new unit tests Closes #24441 from uncleGen/SPARK-27494. Authored-by: uncleGen <hustyugm@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-26 14:25:31 +08:00
Dongjoon Hyun	d5dbf053d3	Revert "[SPARK-27439][SQL] Use analyzed plan when explaining Dataset" This reverts commit `ad60c6d9be`.	2019-04-25 18:38:52 -07:00
Liang-Chi Hsieh	8b86326521	[SPARK-27551][SQL] Improve error message of mismatched types for CASE WHEN ## What changes were proposed in this pull request? When there are mismatched types among cases or else values in case when expression, current error message is hard to read to figure out what and where the mismatch is. This patch simply improves the error message for mismatched types for case when. Before: ```scala scala> spark.range(100).select(when('id === 1, array(struct('id * 123456789 + 123456789 as "x"))).otherwise(array(struct('id * 987654321 + 987654321 as "y")))) org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (`id` = CAST(1 AS BIGINT)) THEN array(named_struct('x', ((`id` * CAST(123456789 AS BI GINT)) + CAST(123456789 AS BIGINT)))) ELSE array(named_struct('y', ((`id` * CAST(987654321 AS BIGINT)) + CAST(987654321 AS BIGINT)))) END' due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type;; ``` After: ```scala scala> spark.range(100).select(when('id === 1, array(struct('id * 123456789 + 123456789 as "x"))).otherwise(array(struct('id * 987654321 + 987654321 as "y")))) org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (`id` = CAST(1 AS BIGINT)) THEN array(named_struct('x', ((`id` * CAST(123456789 AS BI GINT)) + CAST(123456789 AS BIGINT)))) ELSE array(named_struct('y', ((`id` * CAST(987654321 AS BIGINT)) + CAST(987654321 AS BIGINT)))) END' due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type, got CASE WHEN ... THEN array<struct<x:bigint>> ELSE arr ay<struct<y:bigint>> END;; ``` ## How was this patch tested? Added unit test. Closes #24453 from viirya/SPARK-27551. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-25 08:47:19 -07:00
Yuming Wang	f82ed5e8e0	[MINOR][TEST] Remove out-dated hive version in run-tests.py ## What changes were proposed in this pull request? ``` ======================================================================== Building Spark ======================================================================== [info] Building Spark (w/Hive 1.2.1) using SBT with these arguments: -Phadoop-3.2 -Pkubernetes -Phive-thriftserver -Pkinesis-asl -Pyarn -Pspark-ganglia-lgpl -Phive -Pmesos test:package streaming-kinesis-asl-assembly/assembly ``` `(w/Hive 1.2.1)` is incorrect when testing hadoop-3.2, It's should be (w/Hive 2.3.4). This pr removes `(w/Hive 1.2.1)` in run-tests.py. ## How was this patch tested? N/A Closes #24451 from wangyum/run-tests-invalid-info. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-24 21:22:15 -07:00
Rob Vesse	b1c6b60ce7	[SPARK-26729][K8S] Fix typo with default value for R image name ## What changes were proposed in this pull request? As discovered by users making use of this feature, there is a bug in the declaration of the `R_IMAGE_NAME` variable that causes the default name to not be properly set to `spark-r` but rather to just `-r` ## How was this patch tested? Verified that the image name for the R image is now appropriately populated in the integration test script via Bash debug output. NB - The fact that this wasn't spotted earlier highlights the fact that currently the K8S integration test suite does not have any tests for the R image as if it had this would have failed integration testing in the original PR #23846 Closes #24449 from rvesse/SPARK-26729. Authored-by: Rob Vesse <rvesse@dotnetrdf.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-24 21:08:42 -07:00
Wenchen Fan	b7f9830670	[MINOR][TEST] switch from 2.4.1 to 2.4.2 in HiveExternalCatalogVersionsSuite ## What changes were proposed in this pull request? update `HiveExternalCatalogVersionsSuite` to test 2.4.2, as 2.4.1 will be removed from Mirror Network soon. ## How was this patch tested? N/A Closes #24452 from cloud-fan/release. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-25 10:26:40 +08:00
gatorsmile	cd4a284030	[SPARK-27460][FOLLOW-UP][TESTS] Fix flaky tests ## What changes were proposed in this pull request? This patch makes several test flakiness fixes. ## How was this patch tested? N/A Closes #24434 from gatorsmile/fixFlakyTest. Lead-authored-by: gatorsmile <gatorsmile@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-24 17:36:29 +08:00
HyukjinKwon	a30983db57	[SPARK-27512][SQL] Avoid to replace ',' in CSV's decimal type inference for backward compatibility ## What changes were proposed in this pull request? The code below currently infers as decimal but previously it was inferred as string. In branch-2.4, type inference path for decimal and parsing data are different. `2a8343121e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala (L153)` `c284c4e1f6/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala (L125)` So the code below: ```scala scala> spark.read.option("delimiter", "\|").option("inferSchema", "true").csv(Seq("1,2").toDS).printSchema() ``` produced string as its type. ``` root \|-- _c0: string (nullable = true) ``` In the current master, it now infers decimal as below: ``` root \|-- _c0: decimal(2,0) (nullable = true) ``` It happened after https://github.com/apache/spark/pull/22979 because, now after this PR, we only have one way to parse decimal: `7a83d71403/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala (L92)` After the fix: ``` root \|-- _c0: string (nullable = true) ``` This PR proposes to restore the previous behaviour back in `CSVInferSchema`. ## How was this patch tested? Manually tested and unit tests were added. Closes #24437 from HyukjinKwon/SPARK-27512. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-24 16:22:07 +09:00
Sean Owen	596a5ff273	[MINOR][BUILD] Update genjavadoc to 0.13 ## What changes were proposed in this pull request? Kind of related to https://github.com/gatorsmile/spark/pull/5 - let's update genjavadoc to see if it generates fewer spurious javadoc errors to begin with. ## How was this patch tested? Existing docs build Closes #24443 from srowen/genjavadoc013. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-24 13:44:48 +09:00
Andrew-Crosby	5bf5d9d854	[SPARK-26970][PYTHON][ML] Add Spark ML interaction transformer to PySpark ## What changes were proposed in this pull request? Adds the Spark ML Interaction transformer to PySpark ## How was this patch tested? - Added Python doctest - Ran the newly added example code - Manually confirmed that a PipelineModel that contains an Interaction transformer can now be loaded in PySpark Closes #24426 from Andrew-Crosby/pyspark-interaction-transformer. Lead-authored-by: Andrew-Crosby <37139900+Andrew-Crosby@users.noreply.github.com> Co-authored-by: Andrew-Crosby <andrew.crosby@autotrader.co.uk> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2019-04-23 13:53:33 -07:00
Dongjoon Hyun	810be5dd20	[SPARK-27493][BUILD][FOLLOWUP] Upgrade ASM to 7.1 in plugins.sbt ## What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/24395. This PR update `plugins.sbt`, too. ## How was this patch tested? Pass the Jenkins. Closes #24444 from dongjoon-hyun/SPARK-ASM71-2. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2019-04-23 18:18:02 +00:00
Gengliang Wang	00f2f311f7	[SPARK-27128][SQL] Migrate JSON to File Data Source V2 ## What changes were proposed in this pull request? Migrate JSON to File Data Source V2 ## How was this patch tested? Unit test Closes #24058 from gengliangwang/jsonV2. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-23 22:39:59 +08:00
uncleGen	ecfdffcb35	[SPARK-27503][DSTREAM] JobGenerator thread exit for some fatal errors but application keeps running ## What changes were proposed in this pull request? In some corner cases, `JobGenerator` thread (including some other EventLoop threads) may exit for some fatal error, like OOM, but Spark Streaming job keep running with no batch job generating. Currently, we only report any non-fatal error. ``` override def run(): Unit = { try { while (!stopped.get) { val event = eventQueue.take() try { onReceive(event) } catch { case NonFatal(e) => try { onError(e) } catch { case NonFatal(e) => logError("Unexpected error in " + name, e) } } } } catch { case ie: InterruptedException => // exit even if eventQueue is not empty case NonFatal(e) => logError("Unexpected error in " + name, e) } } ``` In this PR, we double check if event thread alive when post Event ## How was this patch tested? existing unit tests Closes #24400 from uncleGen/SPARK-27503. Authored-by: uncleGen <hustyugm@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-23 07:11:58 -07:00
Yuming Wang	7cc15af156	[SPARK-27481][BUILD] Upgrade commons-logging to 1.1.3 for hadoop-3.2 ## What changes were proposed in this pull request? hadoop-2.7 gets `commons-logging` version from `hive-metastore`: ``` [INFO] +- org.spark-project.hive:hive-metastore:jar:1.2.1.spark2:compile [INFO] \| +- com.jolbox:bonecp:jar:0.8.0.RELEASE:compile [INFO] \| +- commons-cli:commons-cli:jar:1.2:compile [INFO] \| +- commons-logging:commons-logging:jar:1.1.3:compile ``` But Hive removes `commons-logging` since [HIVE-12237(Hive 2.0.0)](https://issues.apache.org/jira/browse/HIVE-12237), so hadoop-3.2 gets `commons-logging` from `commons-httpclient`: ``` [INFO] +- commons-httpclient:commons-httpclient:jar:3.1:compile [INFO] \| \- commons-logging:commons-logging:jar:1.0.4:compile ``` Thus. we may hint `LogConfigurationException`: ``` bin/spark-sql --conf spark.sql.hive.metastore.version=1.2.2 --conf spark.sql.hive.metastore.jars=file:///apache/hive-1.2.2-bin/lib/* ... Caused by: org.apache.commons.logging.LogConfigurationException: Invalid class loader hierarchy. You have more than one version of 'org.apache.commons.logging.Log' visible, which is not allowed. at org.apache.commons.logging.impl.LogFactoryImpl.getLogConstructor(LogFactoryImpl.java:385) ... 43 more ``` This pr upgrade `commons-logging` to 1.1.3 for hadoop-3.2 to fix this issue. ## How was this patch tested? manual tests Closes #24388 from wangyum/SPARK-27481. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-23 07:08:01 -07:00
pengbo	d9b2ce0f0f	[SPARK-27539][SQL] Fix inaccurate aggregate outputRows estimation with column containing null values ## What changes were proposed in this pull request? This PR is follow up of https://github.com/apache/spark/pull/24286. As gatorsmile pointed out that column with null value is inaccurate as well. ``` > select key from test; 2 NULL 1 spark-sql> desc extended test key; col_name key data_type int comment NULL min 1 max 2 num_nulls 1 distinct_count 2 ``` The distinct count should be distinct_count + 1 when column contains null value. ## How was this patch tested? Existing tests & new UT added. Closes #24436 from pengbo/aggregation_estimation. Authored-by: pengbo <bo.peng1019@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-22 20:30:08 -07:00
Maxim Gekk	93a264d05a	[SPARK-27535][SQL][TEST] Date and timestamp JSON benchmarks ## What changes were proposed in this pull request? Added new JSON benchmarks related to date and timestamps operations: - Write date/timestamp to JSON files - `to_json()` and `from_json()` for dates and timestamps - Read date/timestamps from JSON files, and infer schemas - Parse and infer schemas from `Dataset[String]` Also existing JSON benchmarks are ported on `NoOp` datasource. Closes #24430 from MaxGekk/json-datetime-benchmark. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-23 11:09:14 +09:00
Maxim Gekk	55f26d8090	[SPARK-27533][SQL][TEST] Date and timestamp CSV benchmarks ## What changes were proposed in this pull request? Added new CSV benchmarks related to date and timestamps operations: - Write date/timestamp to CSV files - `to_csv()` and `from_csv()` for dates and timestamps - Read date/timestamps from CSV files, and infer schemas - Parse and infer schemas from `Dataset[String]` Also existing CSV benchmarks are ported on `NoOp` datasource. Closes #24429 from MaxGekk/csv-timestamp-benchmark. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-23 11:08:02 +09:00
Maxim Gekk	43a73e387c	[SPARK-27528][SQL] Use Parquet logical type TIMESTAMP_MICROS by default ## What changes were proposed in this pull request? In the PR, I propose to use the `TIMESTAMP_MICROS` logical type for timestamps written to parquet files. The type matches semantically to Catalyst's `TimestampType`, and stores microseconds since epoch in UTC time zone. This will allow to avoid conversions of microseconds to nanoseconds and to Julian calendar. Also this will reduce sizes of written parquet files. ## How was this patch tested? By existing test suites. Closes #24425 from MaxGekk/parquet-timestamp_micros. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-23 11:06:39 +09:00
Dilip Biswal	3240e52dc7	[SPARK-27531][SQL] Improve `EXPLAIN DESC TABLE` to show the input parameters of the command. ## What changes were proposed in this pull request? Currently "EXPLAIN DESC TABLE" is special cased and outputs a single row relation as following. Current output: ```sql spark-sql> EXPLAIN DESCRIBE TABLE t; == Physical Plan == *(1) Scan OneRowRelation[] ``` This is not consistent with how we handle explain processing for other commands. In this PR, the inconsistency is handled by removing the special handling for "describe table". After change: ```sql spark-sql> EXPLAIN DESC EXTENDED t == Physical Plan == Execute DescribeTableCommand +- DescribeTableCommand `t`, true ``` ## How was this patch tested? Added new tests in SQLQueryTestSuite. Closes #24427 from dilipbiswal/describe_table_explain2. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-22 13:02:10 -07:00
Eric Liang	5172190da1	[SPARK-27392][SQL] TestHive test tables should be placed in shared test state, not per session ## What changes were proposed in this pull request? Otherwise, tests that use tables from multiple sessions will run into issues if they access the same table. The correct location is in shared state. A couple other minor test improvements. cc gatorsmile srinathshankar ## How was this patch tested? Existing unit tests. Closes #24302 from ericl/test-conflicts. Lead-authored-by: Eric Liang <ekl@databricks.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-22 11:05:31 -07:00
Maxim Gekk	79d3bc0409	[SPARK-27438][SQL] Parse strings with timestamps by to_timestamp() in microsecond precision ## What changes were proposed in this pull request? In the PR, I propose to parse strings to timestamps in microsecond precision by the ` to_timestamp()` function if the specified pattern contains a sub-pattern for seconds fractions. Closes #24342 ## How was this patch tested? By `DateFunctionsSuite` and `DateExpressionsSuite` Closes #24420 from MaxGekk/to_timestamp-microseconds3. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-22 19:41:32 +08:00
Bryan Cutler	d36cce18e2	[SPARK-27276][PYTHON][SQL] Increase minimum version of pyarrow to 0.12.1 and remove prior workarounds ## What changes were proposed in this pull request? This increases the minimum support version of pyarrow to 0.12.1 and removes workarounds in pyspark to remain compatible with prior versions. This means that users will need to have at least pyarrow 0.12.1 installed and available in the cluster or an `ImportError` will be raised to indicate an upgrade is needed. ## How was this patch tested? Existing tests using: Python 2.7.15, pyarrow 0.12.1, pandas 0.24.2 Python 3.6.7, pyarrow 0.12.1, pandas 0.24.0 Closes #24298 from BryanCutler/arrow-bump-min-pyarrow-SPARK-27276. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-22 19:30:31 +09:00
Maxim Gekk	777b797867	[SPARK-27522][SQL][TEST] Test migration from INT96 to TIMESTAMP_MICROS for timestamps in parquet ## What changes were proposed in this pull request? Added tests to check migration from `INT96` to `TIMESTAMP_MICROS` (`INT64`) for timestamps in parquet files. In particular: - Append `TIMESTAMP_MICROS` timestamps to existing parquet files with `INT96` timestamps - Append `TIMESTAMP_MICROS` timestamps to a table with `INT96` timestamps - Append `INT96` to `TIMESTAMP_MICROS` timestamps in parquet files - Append `INT96` to `TIMESTAMP_MICROS` timestamps in a table Closes #24417 from MaxGekk/parquet-timestamp-int64-tests. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-22 16:34:13 +09:00
Sean Owen	d4a16f46f7	[SPARK-27419][FOLLOWUP][DOCS] Add note about spark.executor.heartbeatInterval change to migration guide ## What changes were proposed in this pull request? Add note about spark.executor.heartbeatInterval change to migration guide See also https://github.com/apache/spark/pull/24329 ## How was this patch tested? N/A Closes #24432 from srowen/SPARK-27419.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-22 12:02:16 +08:00
Shixiong Zhu	009059e3c2	[SPARK-27496][CORE] Fatal errors should also be sent back to the sender ## What changes were proposed in this pull request? When a fatal error (such as StackOverflowError) throws from "receiveAndReply", we should try our best to notify the sender. Otherwise, the sender will hang until timeout. In addition, when a MessageLoop is dying unexpectedly, it should resubmit a new one so that Dispatcher is still working. ## How was this patch tested? New unit tests. Closes #24396 from zsxwing/SPARK-27496. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-21 17:00:07 -07:00
Dilip Biswal	8a8643c28d	[SPARK-27480][SQL] Improve `EXPLAIN DESC QUERY` to show the input SQL statement Currently running explain on describe query gives a little confusing output. This is a minor pr that improves the output of explain. Before ``` 1.EXPLAIN DESCRIBE WITH s AS (SELECT 'hello' as col1) SELECT * FROM s; == Physical Plan == Execute DescribeQueryCommand +- DescribeQueryCommand CTE [s] 2.EXPLAIN EXTENDED DESCRIBE SELECT * from s1 where c1 > 0; == Physical Plan == Execute DescribeQueryCommand +- DescribeQueryCommand 'Project [] ``` After ``` 1. EXPLAIN DESCRIBE WITH s AS (SELECT 'hello' as col1) SELECT FROM s; == Physical Plan == Execute DescribeQueryCommand +- DescribeQueryCommand WITH s AS (SELECT 'hello' as col1) SELECT * FROM s 2. EXPLAIN DESCRIBE SELECT * from s1 where c1 > 0; == Physical Plan == Execute DescribeQueryCommand +- DescribeQueryCommand SELECT * from s1 where c1 > 0 ``` Added a couple of tests in describe-query.sql under SQLQueryTestSuite. Closes #24385 from dilipbiswal/describe_query_explain. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-21 15:35:05 -07:00
WeichenXu	9793d9ec22	[SPARK-27473][SQL] Support filter push down for status fields in binary file data source ## What changes were proposed in this pull request? Support 4 kinds of filters: - LessThan - LessThanOrEqual - GreatThan - GreatThanOrEqual Support filters applied on 2 columns: - modificationTime - length Note: In order to support datasource filter push-down, I flatten schema to be: ``` val schema = StructType( StructField("path", StringType, false) :: StructField("modificationTime", TimestampType, false) :: StructField("length", LongType, false) :: StructField("content", BinaryType, true) :: Nil) ``` ## How was this patch tested? To be added. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #24387 from WeichenXu123/binary_ds_filter. Lead-authored-by: WeichenXu <weichen.xu@databricks.com> Co-authored-by: Xiangrui Meng <meng@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2019-04-21 12:45:59 -07:00
Liang-Chi Hsieh	ad60c6d9be	[SPARK-27439][SQL] Use analyzed plan when explaining Dataset ## What changes were proposed in this pull request? Because a review is resolved during analysis when we create a dataset, the content of the view is determined when the dataset is created, not when it is evaluated. Now the explain result of a dataset is not correctly consistent with the collected result of it, because we use pre-analyzed logical plan of the dataset in explain command. The explain command will analyzed the logical plan passed in. So if a view is changed after the dataset was created, the plans shown by explain command aren't the same with the plan of the dataset. ```scala scala> spark.range(10).createOrReplaceTempView("test") scala> spark.range(5).createOrReplaceTempView("test2") scala> spark.sql("select * from test").createOrReplaceTempView("tmp001") scala> val df = spark.sql("select * from tmp001") scala> spark.sql("select * from test2").createOrReplaceTempView("tmp001") scala> df.show +---+ \| id\| +---+ \| 0\| \| 1\| \| 2\| \| 3\| \| 4\| \| 5\| \| 6\| \| 7\| \| 8\| \| 9\| +---+ scala> df.explain ``` Before: ```scala == Physical Plan == (1) Range (0, 5, step=1, splits=12) ``` After: ```scala == Physical Plan == (1) Range (0, 10, step=1, splits=12) ``` ## How was this patch tested? Manually test and unit test. Closes #24415 from viirya/SPARK-27439. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-21 10:25:56 -07:00
shivusondur	4cb1cd6ab7	[SPARK-27532][DOC] Correct the default value in the Documentation for "spark.redaction.regex" ## What changes were proposed in this pull request? Corrected the default value in the Documentation for "spark.redaction.regex" ## How was this patch tested? NA Closes #24428 from shivusondur/doc2. Authored-by: shivusondur <shivusondur@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-21 16:56:12 +09:00
Maxim Gekk	d61b3bc875	[SPARK-27527][SQL][DOCS] Improve descriptions of Timestamp and Date types ## What changes were proposed in this pull request? In the PR, I propose more precise description of `TimestampType` and `DateType`, how they store timestamps and dates internally. Closes #24424 from MaxGekk/timestamp-date-type-doc. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-21 16:53:11 +09:00
Yuming Wang	777b4502b2	[SPARK-27176][FOLLOW-UP][SQL] Upgrade Hive parquet to 1.10.1 for hadoop-3.2 ## What changes were proposed in this pull request? When we compile and test Hadoop 3.2, we will hint the following two issues: 1. JobSummaryLevel is not a member of object org.apache.parquet.hadoop.ParquetOutputFormat. Fixed by [PARQUET-381](https://issues.apache.org/jira/browse/PARQUET-381)(Parquet 1.9.0) 2. java.lang.NoSuchFieldError: BROTLI at org.apache.parquet.hadoop.metadata.CompressionCodecName.<clinit>(CompressionCodecName.java:31). Fixed by [PARQUET-1143](https://issues.apache.org/jira/browse/PARQUET-1143)(Parquet 1.10.0) The reason is that the `parquet-hadoop-bundle-1.8.1.jar` conflicts with Parquet 1.10.1. I think it would be safe to upgrade Hive's parquet to 1.10.1 to workaround this issue. This is what Hive did when upgrading Parquet 1.8.1 to 1.10.0: [HIVE-17000](https://issues.apache.org/jira/browse/HIVE-17000) and [HIVE-19464](https://issues.apache.org/jira/browse/HIVE-19464). We can see that all changes are related to vectors, and vectors are disabled by default: see [HIVE-14826](https://issues.apache.org/jira/browse/HIVE-14826) and [HiveConf.java#L2723](https://github.com/apache/hive/blob/rel/release-2.3.4/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L2723). This pr removes [parquet-hadoop-bundle-1.8.1.jar](https://github.com/apache/parquet-mr/tree/master/parquet-hadoop-bundle) , so Hive serde will use [parquet-common-1.10.1.jar, parquet-column-1.10.1.jar and parquet-hadoop-1.10.1.jar](https://github.com/apache/spark/blob/master/dev/deps/spark-deps-hadoop-3.2#L185-L189). ## How was this patch tested? 1. manual tests 2. [upgrade Hive Parquet to 1.10.1 annd run Hadoop 3.2 test on jenkins](https://github.com/apache/spark/pull/24044#commits-pushed-0c3f962) Closes #24346 from wangyum/SPARK-27176. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-04-19 08:59:08 -07:00
Shahid	16bbe0f798	[SPARK-27486][CORE][TEST] Enable History server storage information test in the HistoryServerSuite ## What changes were proposed in this pull request? We have disabled a test related to storage in the History server suite after SPARK-13845. But, after SPARK-22050, we can store the information about block updated events to eventLog, if we enable "spark.eventLog.logBlockUpdates.enabled=true". So, we can enable the test, by adding an eventlog corresponding to the application, which has enabled the configuration, "spark.eventLog.logBlockUpdates.enabled=true" ## How was this patch tested? Existing UTs Closes #24390 from shahidki31/enableRddStorageTest. Authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-19 08:12:20 -07:00
Gengliang Wang	31488e1ca5	[SPARK-27504][SQL] File source V2: support refreshing metadata cache ## What changes were proposed in this pull request? In file source V1, if some file is deleted manually, reading the DataFrame/Table will throws an exception with suggestion message ``` It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. ``` After refreshing the table/DataFrame, the reads should return correct results. We should follow it in file source V2 as well. ## How was this patch tested? Unit test Closes #24401 from gengliangwang/refreshFileTable. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-19 18:26:03 +08:00
Yifei Huang	163a6e2982	[SPARK-27514] Skip collapsing windows with empty window expressions ## What changes were proposed in this pull request? A previous change moved the removal of empty window expressions to the RemoveNoopOperations rule, which comes after the CollapseWindow rule. Therefore, by the time we get to CollapseWindow, we aren't guaranteed that empty windows have been removed. This change checks that the window expressions are not empty, and only collapses the windows if both windows are non-empty. A lengthier description and repro steps here: https://issues.apache.org/jira/browse/SPARK-27514 ## How was this patch tested? A unit test, plus I reran the breaking case mentioned in the Jira ticket. Closes #24411 from yifeih/yh/spark-27514. Authored-by: Yifei Huang <yifeih@palantir.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-19 14:04:44 +08:00
Yuming Wang	8f82237a5b	[SPARK-27501][SQL][TEST] Add test for HIVE-13083: Writing HiveDecimal to ORC can wrongly suppress present stream ## What changes were proposed in this pull request? This PR add test for [HIVE-13083](https://issues.apache.org/jira/browse/HIVE-13083): Writing HiveDecimal to ORC can wrongly suppress present stream. ## How was this patch tested? manual tests: ``` build/sbt "hive/testOnly *HiveOrcQuerySuite" -Phive -Phadoop-3.2 ``` Closes #24397 from wangyum/SPARK-26437. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-19 10:12:21 +09:00
shane knapp	e1ece6a319	[SPARK-25079][PYTHON] update python3 executable to 3.6.x ## What changes were proposed in this pull request? have jenkins test against python3.6 (instead of 3.4). ## How was this patch tested? extensive testing on both the centos and ubuntu jenkins workers. NOTE: this will need to be backported to all active branches. Closes #24266 from shaneknapp/updating-python3-executable. Authored-by: shane knapp <incomplete@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-19 10:03:50 +09:00
Gengliang Wang	3748b381df	[SPARK-27460][TESTS][FOLLOWUP] Add HiveClientVersions to parallel test suite list ## What changes were proposed in this pull request? The test time of `HiveClientVersions` is around 3.5 minutes. This PR is to add it into the parallel test suite list. To make sure there is no colliding warehouse location, we can change the warehouse path to a temporary directory. ## How was this patch tested? Unit test Closes #24404 from gengliangwang/parallelTestFollowUp. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-18 15:37:55 -07:00
Liang-Chi Hsieh	9c41bfd83c	[SPARK-27502][SQL][TEST] Update nested schema benchmark result for Orc V2 ## What changes were proposed in this pull request? We added nested schema pruning support to Orc V2 recently. The benchmark result should be updated. The benchmark numbers are obtained by running benchmark on r3.xlarge machine. ## How was this patch tested? Test only change. Closes #24399 from viirya/update-orcv2-benchmark. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-18 08:08:22 -07:00
Gengliang Wang	9c238b8a46	[SPARK-27460][TESTS] Running slowest test suites in their own forked JVMs for higher parallelism ## What changes were proposed in this pull request? This patch modifies SparkBuild so that the largest / slowest test suites (or collections of suites) can run in their own forked JVMs, allowing them to be run in parallel with each other. This opt-in / whitelisting approach allows us to increase parallelism without having to fix a long-tail of flakiness / brittleness issues in tests which aren't performance bottlenecks. See comments in SparkBuild.scala for information on the details, including a summary of why we sometimes opt to run entire groups of tests in a single forked JVM . The time of full new pull request test in Jenkins is reduced by around 53%: before changes: 4hr 40min after changes: 2hr 13min ## How was this patch tested? Unit test Closes #24373 from gengliangwang/parallelTest. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-18 20:49:36 +08:00
Gengliang Wang	7d44ba05d1	[SPARK-27490][SQL] File source V2: return correct result for Dataset.inputFiles() ## What changes were proposed in this pull request? Currently, a `Dateset` with file source V2 always return empty results for method `Dataset.inputFiles()`. We should fix it. ## How was this patch tested? Unit test Closes #24393 from gengliangwang/inputFiles. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-18 14:39:30 +08:00
Kris Mok	50bdc9befa	[SPARK-27423][SQL][FOLLOWUP] Minor polishes to Cast codegen templates for Date <-> Timestamp ## What changes were proposed in this pull request? https://github.com/apache/spark/pull/24332 introduced an unnecessary `import` statement and two slight issues in the codegen templates in `Cast` for `Date` <-> `Timestamp`. This PR removes the unused import statement and fixes the slight codegen issue. The issue in those two codegen templates is this pattern: ```scala val zid = JavaCode.global( ctx.addReferenceObj("zoneId", zoneId, "java.time.ZoneId"), zoneId.getClass) ``` `zoneId` can refer to an instance of a non-public class, e.g. `java.time.ZoneRegion`, and while this code correctly puts in the 3rd argument to `ctx.addReferenceObj()`, it's still passing `zoneId.getClass` to `JavaCode.global()` which is not desirable, but doesn't cause any immediate bugs in this particular case, because `zid` is used in an expression immediately afterwards. If this `zid` ever needs to spill to any explicitly typed variables, e.g. a local variable, and if the spill handling uses the `javaType` on this `GlobalVariable`, it'd generate code that looks like: ```java java.time.ZoneRegion value1 = ((java.time.ZoneId) references[2] /* literal */); ``` which would then be a real bug: - a non-accessible type `java.time.ZoneRegion` is referenced in the generated code, and - `ZoneId` -> `ZoneRegion` requires an explicit downcast. ## How was this patch tested? Existing tests. This PR does not change behavior, and the original PR won't cause any real behavior bug to begin with. Closes #24392 from rednaxelafx/spark-27423-followup. Authored-by: Kris Mok <kris.mok@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-18 14:27:33 +08:00
Dongjoon Hyun	f93460dae9	[SPARK-27493][BUILD] Upgrade ASM to 7.1 ## What changes were proposed in this pull request? [SPARK-25946](https://issues.apache.org/jira/browse/SPARK-25946) upgraded ASM to 7.0 to support JDK11. This PR aims to update ASM to 7.1 to bring the bug fixes. - https://asm.ow2.io/versions.html - https://issues.apache.org/jira/browse/XBEAN-316 ## How was this patch tested? Pass the Jenkins. Closes #24395 from dongjoon-hyun/SPARK-27493. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-18 13:36:52 +09:00
Dilip Biswal	e1c90d66bb	[SPARK-19712][SQL] Pushdown LeftSemi/LeftAnti below join ## What changes were proposed in this pull request? This PR adds support for pushing down LeftSemi and LeftAnti joins below the Join operator. This is a prerequisite work thats needed for the subsequent task of moving the subquery rewrites to the beginning of optimization phase. The larger PR is [here](https://github.com/apache/spark/pull/23211) . This PR addresses the comment at [link](https://github.com/apache/spark/pull/23211#issuecomment-445705922). ## How was this patch tested? Added tests under LeftSemiAntiJoinPushDownSuite. Closes #24331 from dilipbiswal/SPARK-19712-pushleftsemi-belowjoin. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-17 20:30:20 +08:00
Wenchen Fan	e6618de809	[SPARK-27430][SQL] broadcast hint should be respected for broadcast nested loop join ## What changes were proposed in this pull request? A followup of https://github.com/apache/spark/pull/24164 broadcast hint should be respected for broadcast nested loop join. This PR also refactors the related code a little bit, to save duplicated code. ## How was this patch tested? new tests Closes #24376 from cloud-fan/join. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-17 19:29:28 +08:00
pengbo	54b0d1e0ef	[SPARK-27416][SQL] UnsafeMapData & UnsafeArrayData Kryo serialization … ## What changes were proposed in this pull request? Finish the rest work of https://github.com/apache/spark/pull/24317, https://github.com/apache/spark/pull/9030 a. Implement Kryo serialization for UnsafeArrayData b. fix UnsafeMapData Java/Kryo Serialization issue when two machines have different Oops size c. Move the duplicate code "getBytes()" to Utils. ## How was this patch tested? According Units has been added & tested Closes #24357 from pengbo/SPARK-27416_new. Authored-by: pengbo <bo.peng1019@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-17 13:03:00 +08:00
gatorsmile	61feb16352	[SPARK-27479][BUILD] Hide API docs for org.apache.spark.util.kvstore ## What changes were proposed in this pull request? The API docs should not include the "org.apache.spark.util.kvstore" package because they are internal private APIs. See the doc link: https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/kvstore/LevelDB.html ## How was this patch tested? N/A Closes #24386 from gatorsmile/rmDoc. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-04-16 19:53:01 -07:00
WeichenXu	1bb0c8e407	[SPARK-25348][SQL] Data source for binary files ## What changes were proposed in this pull request? Implement binary file data source in Spark. Format name: "binaryFile" (case-insensitive) Schema: - content: BinaryType - status: StructType - path: StringType - modificationTime: TimestampType - length: LongType Options: * pathGlobFilter (instead of pathFilterRegex) to reply on GlobFilter behavior * maxBytesPerPartition is not implemented since it is controlled by two SQL confs: maxPartitionBytes and openCostInBytes. ## How was this patch tested? Unit test added. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #24354 from WeichenXu123/binary_file_datasource. Lead-authored-by: WeichenXu <weichen.xu@databricks.com> Co-authored-by: Xiangrui Meng <meng@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2019-04-16 15:41:32 -07:00
liwensun	26ed65f415	[SPARK-27453] Pass partitionBy as options in DataFrameWriter ## What changes were proposed in this pull request? Pass partitionBy columns as options and feature-flag this behavior. ## How was this patch tested? A new unit test. Closes #24365 from liwensun/partitionby. Authored-by: liwensun <liwen.sun@databricks.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	2019-04-16 15:03:16 -07:00

... 4 5 6 7 8 ...

24477 commits