ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Liang-Chi Hsieh	4b725e50a7	[SPARK-27439][SQL] Explainging Dataset should show correct resolved plans ## What changes were proposed in this pull request? Because a review is resolved during analysis when we create a dataset, the content of the view is determined when the dataset is created, not when it is evaluated. Now the explain result of a dataset is not correctly consistent with the collected result of it, because we use pre-analyzed logical plan of the dataset in explain command. The explain command will analyzed the logical plan passed in. So if a view is changed after the dataset was created, the plans shown by explain command aren't the same with the plan of the dataset. ```scala scala> spark.range(10).createOrReplaceTempView("test") scala> spark.range(5).createOrReplaceTempView("test2") scala> spark.sql("select * from test").createOrReplaceTempView("tmp001") scala> val df = spark.sql("select * from tmp001") scala> spark.sql("select * from test2").createOrReplaceTempView("tmp001") scala> df.show +---+ \| id\| +---+ \| 0\| \| 1\| \| 2\| \| 3\| \| 4\| \| 5\| \| 6\| \| 7\| \| 8\| \| 9\| +---+ scala> df.explain(true) ``` Before: ```scala == Parsed Logical Plan == 'Project [] +- 'UnresolvedRelation `tmp001` == Analyzed Logical Plan == id: bigint Project [id#2L] +- SubqueryAlias `tmp001` +- Project [id#2L] +- SubqueryAlias `test2` +- Range (0, 5, step=1, splits=Some(12)) == Optimized Logical Plan == Range (0, 5, step=1, splits=Some(12)) == Physical Plan == (1) Range (0, 5, step=1, splits=12) ``` After: ```scala == Parsed Logical Plan == 'Project [] +- 'UnresolvedRelation `tmp001` == Analyzed Logical Plan == id: bigint Project [id#0L] +- SubqueryAlias `tmp001` +- Project [id#0L] +- SubqueryAlias `test` +- Range (0, 10, step=1, splits=Some(12)) == Optimized Logical Plan == Range (0, 10, step=1, splits=Some(12)) == Physical Plan == (1) Range (0, 10, step=1, splits=12) ``` To fix it, this passes query execution of Dataset when explaining it. The query execution contains pre-analyzed plan which is consistent with Dataset's result. ## How was this patch tested? Manually test and unit test. Closes #24464 from viirya/SPARK-27439-2. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-05 23:19:19 -07:00
Dilip Biswal	6001d476ce	[SPARK-27596][SQL] The JDBC 'query' option doesn't work for Oracle database ## What changes were proposed in this pull request? Description from JIRA For the JDBC option `query`, we use the identifier name to start with underscore: s"(${subquery}) _SPARK_GEN_JDBC_SUBQUERY_NAME${curId.getAndIncrement()}". This is not supported by Oracle. The Oracle doesn't seem to support identifier name to start with non-alphabet character (unless it is quoted) and has length restrictions as well. [link](https://docs.oracle.com/cd/B19306_01/server.102/b14200/sql_elements008.htm) In this PR, the generated alias name 'SPARK_GEN_JDBC_SUBQUERY_NAME<int value>' is fixed to remove "_" prefix and also the alias name is shortened to not exceed the identifier length limit. ## How was this patch tested? Tests are added for MySql, Postgress, Oracle and DB2 to ensure enough coverage. Closes #24532 from dilipbiswal/SPARK-27596. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-05-05 21:52:23 -07:00
Liang-Chi Hsieh	d9bcacf94b	[SPARK-27629][PYSPARK] Prevent Unpickler from intervening each unpickling ## What changes were proposed in this pull request? In SPARK-27612, one correctness issue was reported. When protocol 4 is used to pickle Python objects, we found that unpickled objects were wrong. A temporary fix was proposed by not using highest protocol. It was found that Opcodes.MEMOIZE was appeared in the opcodes in protocol 4. It is suspect to this issue. A deeper dive found that Opcodes.MEMOIZE stores objects into internal map of Unpickler object. We use single Unpickler object to unpickle serialized Python bytes. Stored objects intervenes next round of unpickling, if the map is not cleared. We has two options: 1. Continues to reuse Unpickler, but calls its close after each unpickling. 2. Not to reuse Unpickler and create new Unpickler object in each unpickling. This patch takes option 1. ## How was this patch tested? Passing the test added in SPARK-27612 (#24519). Closes #24521 from viirya/SPARK-27629. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-04 13:21:08 +09:00
Seth Fitzsimmons	5182aa25f0	[MINOR][DOCS] Correct date_trunc docs ## What changes were proposed in this pull request? `date_trunc` argument order was flipped, phrasing was awkward. ## How was this patch tested? Documentation-only. Closes #24522 from mojodna/patch-2. Authored-by: Seth Fitzsimmons <seth@mojodna.net> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-04 09:13:23 +09:00
sandeep katta	c66ec43945	[SPARK-27555][SQL] HiveSerDe should fall back to hadoopconf if hive.default.fileformat is not found in SQLConf ## What changes were proposed in this pull request? SQLConf does not load hive-site.xml.So HiveSerDe should fall back to hadoopconf if hive.default.fileformat is not found in SQLConf ## How was this patch tested? Tested manually. Added UT Closes #24489 from sandeep-katta/spark-27555. Authored-by: sandeep katta <sandeep.katta2007@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-04 09:02:12 +09:00
Marco Gaido	7a8cc8e071	[SPARK-27607][SQL] Improve Row.toString performance ## What changes were proposed in this pull request? `Row.toString` is currently causing the useless creation of an `Array` containing all the values in the row before generating the string containing it. This operation adds a considerable overhead. The PR proposes to avoid this operation in order to get a faster implementation. ## How was this patch tested? Run ```scala test("Row toString perf test") { val n = 100000 val rows = (1 to n).map { i => Row(i, i.toDouble, i.toString, i.toShort, true, null) } // warmup (1 to 10).foreach { _ => rows.foreach(_.toString) } val times = (1 to 100).map { _ => val t0 = System.nanoTime() rows.foreach(_.toString) val t1 = System.nanoTime() t1 - t0 } // scalastyle:off println println(s"Avg time on ${times.length} iterations for $n toString:" + s" ${times.sum.toDouble / times.length / 1e6} ms") // scalastyle:on println } ``` Before the PR: ``` Avg time on 100 iterations for 100000 toString: 61.08408419 ms ``` After the PR: ``` Avg time on 100 iterations for 100000 toString: 38.16539432 ms ``` This means the new implementation is about 1.60X faster than the original one. Closes #24505 from mgaido91/SPARK-27607. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-02 07:20:33 -07:00
Xiangrui Meng	618d6bff71	[SPARK-27588] Binary file data source fails fast and doesn't attempt to read very large files ## What changes were proposed in this pull request? If a file is too big (>2GB), we should fail fast and do not try to read the file. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #24483 from mengxr/SPARK-27588. Authored-by: Xiangrui Meng <meng@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2019-04-29 16:24:49 -07:00
Gabor Somogyi	fb6b19ab7c	[SPARK-23014][SS] Fully remove V1 memory sink. ## What changes were proposed in this pull request? There is a MemorySink v2 already so v1 can be removed. In this PR I've removed it completely. What this PR contains: * V1 memory sink removal * V2 memory sink renamed to become the only implementation * Since DSv2 sends exceptions in a chained format (linking them with cause field) I've made python side compliant * Adapted all the tests ## How was this patch tested? Existing unit tests. Closes #24403 from gaborgsomogyi/SPARK-23014. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-04-29 09:44:23 -07:00
Sean Owen	a6716d3f03	[SPARK-27571][CORE][YARN][EXAMPLES] Avoid scala.language.reflectiveCalls ## What changes were proposed in this pull request? This PR avoids usage of reflective calls in Scala. It removes the import that suppresses the warnings and rewrites code in small ways to avoid accessing methods that aren't technically accessible. ## How was this patch tested? Existing tests. Closes #24463 from srowen/SPARK-27571. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-29 11:16:45 -05:00
Sean Owen	8a17d26784	[SPARK-27536][CORE][ML][SQL][STREAMING] Remove most use of scala.language.existentials ## What changes were proposed in this pull request? I want to get rid of as much use of `scala.language.existentials` as possible for 3.0. It's a complicated language feature that generates warnings unless this value is imported. It might even be on the way out of Scala: https://contributors.scala-lang.org/t/proposal-to-remove-existential-types-from-the-language/2785 For Spark, it comes up mostly where the code plays fast and loose with generic types, not the advanced situations you'll often see referenced where this feature is explained. For example, it comes up in cases where a function returns something like `(String, Class[_])`. Scala doesn't like matching this to any other instance of `(String, Class[_])` because doing so requires inferring the existence of some type that satisfies both. Seems obvious if the generic type is a wildcard, but, not technically something Scala likes to let you get away with. This is a large PR, and it only gets rid of _most_ instances of `scala.language.existentials`. The change should be all compile-time and shouldn't affect APIs or logic. Many of the changes simply touch up sloppiness about generic types, making the known correct value explicit in the code. Some fixes involve being more explicit about the existence of generic types in methods. For instance, `def foo(arg: Class[_])` seems innocent enough but should really be declared `def foo[T](arg: Class[T])` to let Scala select and fix a single type when evaluating calls to `foo`. For kind of surprising reasons, this comes up in places where code evaluates a tuple of things that involve a generic type, but is OK if the two parts of the tuple are evaluated separately. One key change was altering `Utils.classForName(...): Class[_]` to the more correct `Utils.classForName[T](...): Class[T]`. This caused a number of small but positive changes to callers that otherwise had to cast the result. In several tests, `Dataset[_]` was used where `DataFrame` seems to be the clear intent. Finally, in a few cases in MLlib, the return type `this.type` was used where there are no subclasses of the class that uses it. This really isn't needed and causes issues for Scala reasoning about the return type. These are just changed to be concrete classes as return types. After this change, we have only a few classes that still import `scala.language.existentials` (because modifying them would require extensive rewrites to fix) and no build warnings. ## How was this patch tested? Existing tests. Closes #24431 from srowen/SPARK-27536. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-29 11:02:01 -05:00
Liang-Chi Hsieh	76785cd6f0	[SPARK-27581][SQL] DataFrame countDistinct("") shouldn't fail with AnalysisException ## What changes were proposed in this pull request? Currently `countDistinct("")` doesn't work. An analysis exception is thrown: ```scala val df = sql("select id % 100 from range(100000)") df.select(countDistinct("")).first() org.apache.spark.sql.AnalysisException: Invalid usage of '' in expression 'count'; ``` Users need to use `expr`. ```scala df.select(expr("count(distinct())")).first() ``` This limits some API usage like `df.select(count(""), countDistinct("*))`. The PR takes the simplest fix that lets analyzer expand star and resolve `count` function. ## How was this patch tested? Added unit test. Closes #24482 from viirya/SPARK-27581. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-29 21:17:32 +08:00
Yuming Wang	5a62295219	[SPARK-27580][HOT-FIX] Fix wrong import order in FileScan.scala ## What changes were proposed in this pull request? ``` ======================================================================== Running Scala style checks ======================================================================== [info] Checking Scala style using SBT with these profiles: -Phadoop-2.7 -Pkubernetes -Phive-thriftserver -Pkinesis-asl -Pyarn -Pspark-ganglia-lgpl -Phive -Pmesos Scalastyle checks failed at following occurrences: [error] /home/jenkins/workspace/SparkPullRequestBuilder/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala:29:0: org.apache.spark.sql.sources.Filter is in wrong order relative to org.apache.spark.sql.sources.v2.reader.. [error] Total time: 17 s, completed Apr 29, 2019 3:09:43 AM ``` https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/104987/console ## How was this patch tested? manual tests: ``` dev/scalastyle ``` Closes #24487 from wangyum/SPARK-27580. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-29 20:48:12 +08:00
Gengliang Wang	07d07fec03	[SPARK-27580][SQL] Implement `doCanonicalize` in BatchScanExec for comparing query plan results ## What changes were proposed in this pull request? The method `QueryPlan.sameResult` is used for comparing logical plans in order to: 1. cache data in CacheManager 2. uncache data in CacheManager 3. Reuse subqueries 4. etc... Currently the method `sameReuslt` always return false for `BatchScanExec`. We should fix it by implementing `doCanonicalize` for the node. ## How was this patch tested? Unit test Closes #24475 from gengliangwang/sameResultForV2. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-29 17:54:12 +08:00
Xiangrui Meng	20a3ef7259	[SPARK-27534][SQL] Do not load `content` column in binary data source if it is not selected ## What changes were proposed in this pull request? A follow-up task from SPARK-25348. To save I/O cost, Spark shouldn't attempt to read the file if users didn't request the `content` column. For example: ``` spark.read.format("binaryFile").load(path).filter($"length" < 1000000).count() ``` ## How was this patch tested? Unit test added. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #24473 from WeichenXu123/SPARK-27534. Lead-authored-by: Xiangrui Meng <meng@databricks.com> Co-authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2019-04-28 07:57:03 -07:00
Jash Gala	90085a1847	[SPARK-23619][DOCS] Add output description for some generator expressions / functions ## What changes were proposed in this pull request? This PR addresses SPARK-23619: https://issues.apache.org/jira/browse/SPARK-23619 It adds additional comments indicating the default column names for the `explode` and `posexplode` functions in Spark-SQL. Functions for which comments have been updated so far: * stack * inline * explode * posexplode * explode_outer * posexplode_outer ## How was this patch tested? This is just a change in the comments. The package builds and tests successfullly after the change. Closes #23748 from jashgala/SPARK-23619. Authored-by: Jash Gala <jashgala@amazon.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-27 10:30:12 +09:00
uncleGen	6328be78f9	[MINOR][TEST][DOC] Execute action miss name message ## What changes were proposed in this pull request? some minor updates: - `Execute` action miss `name` message - typo in SS document - typo in SQLConf ## How was this patch tested? N/A Closes #24466 from uncleGen/minor-fix. Authored-by: uncleGen <hustyugm@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-27 09:28:31 +08:00
Wenchen Fan	85fd552ed6	[SPARK-27190][SQL] add table capability for streaming ## What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/24012 , to add the corresponding capabilities for streaming. ## How was this patch tested? existing tests Closes #24129 from cloud-fan/capability. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-26 15:44:23 +08:00
Dongjoon Hyun	d5dbf053d3	Revert "[SPARK-27439][SQL] Use analyzed plan when explaining Dataset" This reverts commit `ad60c6d9be`.	2019-04-25 18:38:52 -07:00
Liang-Chi Hsieh	8b86326521	[SPARK-27551][SQL] Improve error message of mismatched types for CASE WHEN ## What changes were proposed in this pull request? When there are mismatched types among cases or else values in case when expression, current error message is hard to read to figure out what and where the mismatch is. This patch simply improves the error message for mismatched types for case when. Before: ```scala scala> spark.range(100).select(when('id === 1, array(struct('id * 123456789 + 123456789 as "x"))).otherwise(array(struct('id * 987654321 + 987654321 as "y")))) org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (`id` = CAST(1 AS BIGINT)) THEN array(named_struct('x', ((`id` * CAST(123456789 AS BI GINT)) + CAST(123456789 AS BIGINT)))) ELSE array(named_struct('y', ((`id` * CAST(987654321 AS BIGINT)) + CAST(987654321 AS BIGINT)))) END' due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type;; ``` After: ```scala scala> spark.range(100).select(when('id === 1, array(struct('id * 123456789 + 123456789 as "x"))).otherwise(array(struct('id * 987654321 + 987654321 as "y")))) org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (`id` = CAST(1 AS BIGINT)) THEN array(named_struct('x', ((`id` * CAST(123456789 AS BI GINT)) + CAST(123456789 AS BIGINT)))) ELSE array(named_struct('y', ((`id` * CAST(987654321 AS BIGINT)) + CAST(987654321 AS BIGINT)))) END' due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type, got CASE WHEN ... THEN array<struct<x:bigint>> ELSE arr ay<struct<y:bigint>> END;; ``` ## How was this patch tested? Added unit test. Closes #24453 from viirya/SPARK-27551. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-25 08:47:19 -07:00
gatorsmile	cd4a284030	[SPARK-27460][FOLLOW-UP][TESTS] Fix flaky tests ## What changes were proposed in this pull request? This patch makes several test flakiness fixes. ## How was this patch tested? N/A Closes #24434 from gatorsmile/fixFlakyTest. Lead-authored-by: gatorsmile <gatorsmile@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-24 17:36:29 +08:00
HyukjinKwon	a30983db57	[SPARK-27512][SQL] Avoid to replace ',' in CSV's decimal type inference for backward compatibility ## What changes were proposed in this pull request? The code below currently infers as decimal but previously it was inferred as string. In branch-2.4, type inference path for decimal and parsing data are different. `2a8343121e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala (L153)` `c284c4e1f6/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala (L125)` So the code below: ```scala scala> spark.read.option("delimiter", "\|").option("inferSchema", "true").csv(Seq("1,2").toDS).printSchema() ``` produced string as its type. ``` root \|-- _c0: string (nullable = true) ``` In the current master, it now infers decimal as below: ``` root \|-- _c0: decimal(2,0) (nullable = true) ``` It happened after https://github.com/apache/spark/pull/22979 because, now after this PR, we only have one way to parse decimal: `7a83d71403/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala (L92)` After the fix: ``` root \|-- _c0: string (nullable = true) ``` This PR proposes to restore the previous behaviour back in `CSVInferSchema`. ## How was this patch tested? Manually tested and unit tests were added. Closes #24437 from HyukjinKwon/SPARK-27512. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-24 16:22:07 +09:00
Gengliang Wang	00f2f311f7	[SPARK-27128][SQL] Migrate JSON to File Data Source V2 ## What changes were proposed in this pull request? Migrate JSON to File Data Source V2 ## How was this patch tested? Unit test Closes #24058 from gengliangwang/jsonV2. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-23 22:39:59 +08:00
Maxim Gekk	93a264d05a	[SPARK-27535][SQL][TEST] Date and timestamp JSON benchmarks ## What changes were proposed in this pull request? Added new JSON benchmarks related to date and timestamps operations: - Write date/timestamp to JSON files - `to_json()` and `from_json()` for dates and timestamps - Read date/timestamps from JSON files, and infer schemas - Parse and infer schemas from `Dataset[String]` Also existing JSON benchmarks are ported on `NoOp` datasource. Closes #24430 from MaxGekk/json-datetime-benchmark. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-23 11:09:14 +09:00
Maxim Gekk	55f26d8090	[SPARK-27533][SQL][TEST] Date and timestamp CSV benchmarks ## What changes were proposed in this pull request? Added new CSV benchmarks related to date and timestamps operations: - Write date/timestamp to CSV files - `to_csv()` and `from_csv()` for dates and timestamps - Read date/timestamps from CSV files, and infer schemas - Parse and infer schemas from `Dataset[String]` Also existing CSV benchmarks are ported on `NoOp` datasource. Closes #24429 from MaxGekk/csv-timestamp-benchmark. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-23 11:08:02 +09:00
Maxim Gekk	43a73e387c	[SPARK-27528][SQL] Use Parquet logical type TIMESTAMP_MICROS by default ## What changes were proposed in this pull request? In the PR, I propose to use the `TIMESTAMP_MICROS` logical type for timestamps written to parquet files. The type matches semantically to Catalyst's `TimestampType`, and stores microseconds since epoch in UTC time zone. This will allow to avoid conversions of microseconds to nanoseconds and to Julian calendar. Also this will reduce sizes of written parquet files. ## How was this patch tested? By existing test suites. Closes #24425 from MaxGekk/parquet-timestamp_micros. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-23 11:06:39 +09:00
Dilip Biswal	3240e52dc7	[SPARK-27531][SQL] Improve `EXPLAIN DESC TABLE` to show the input parameters of the command. ## What changes were proposed in this pull request? Currently "EXPLAIN DESC TABLE" is special cased and outputs a single row relation as following. Current output: ```sql spark-sql> EXPLAIN DESCRIBE TABLE t; == Physical Plan == *(1) Scan OneRowRelation[] ``` This is not consistent with how we handle explain processing for other commands. In this PR, the inconsistency is handled by removing the special handling for "describe table". After change: ```sql spark-sql> EXPLAIN DESC EXTENDED t == Physical Plan == Execute DescribeTableCommand +- DescribeTableCommand `t`, true ``` ## How was this patch tested? Added new tests in SQLQueryTestSuite. Closes #24427 from dilipbiswal/describe_table_explain2. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-22 13:02:10 -07:00
Maxim Gekk	79d3bc0409	[SPARK-27438][SQL] Parse strings with timestamps by to_timestamp() in microsecond precision ## What changes were proposed in this pull request? In the PR, I propose to parse strings to timestamps in microsecond precision by the ` to_timestamp()` function if the specified pattern contains a sub-pattern for seconds fractions. Closes #24342 ## How was this patch tested? By `DateFunctionsSuite` and `DateExpressionsSuite` Closes #24420 from MaxGekk/to_timestamp-microseconds3. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-22 19:41:32 +08:00
Maxim Gekk	777b797867	[SPARK-27522][SQL][TEST] Test migration from INT96 to TIMESTAMP_MICROS for timestamps in parquet ## What changes were proposed in this pull request? Added tests to check migration from `INT96` to `TIMESTAMP_MICROS` (`INT64`) for timestamps in parquet files. In particular: - Append `TIMESTAMP_MICROS` timestamps to existing parquet files with `INT96` timestamps - Append `TIMESTAMP_MICROS` timestamps to a table with `INT96` timestamps - Append `INT96` to `TIMESTAMP_MICROS` timestamps in parquet files - Append `INT96` to `TIMESTAMP_MICROS` timestamps in a table Closes #24417 from MaxGekk/parquet-timestamp-int64-tests. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-22 16:34:13 +09:00
Dilip Biswal	8a8643c28d	[SPARK-27480][SQL] Improve `EXPLAIN DESC QUERY` to show the input SQL statement Currently running explain on describe query gives a little confusing output. This is a minor pr that improves the output of explain. Before ``` 1.EXPLAIN DESCRIBE WITH s AS (SELECT 'hello' as col1) SELECT * FROM s; == Physical Plan == Execute DescribeQueryCommand +- DescribeQueryCommand CTE [s] 2.EXPLAIN EXTENDED DESCRIBE SELECT * from s1 where c1 > 0; == Physical Plan == Execute DescribeQueryCommand +- DescribeQueryCommand 'Project [] ``` After ``` 1. EXPLAIN DESCRIBE WITH s AS (SELECT 'hello' as col1) SELECT FROM s; == Physical Plan == Execute DescribeQueryCommand +- DescribeQueryCommand WITH s AS (SELECT 'hello' as col1) SELECT * FROM s 2. EXPLAIN DESCRIBE SELECT * from s1 where c1 > 0; == Physical Plan == Execute DescribeQueryCommand +- DescribeQueryCommand SELECT * from s1 where c1 > 0 ``` Added a couple of tests in describe-query.sql under SQLQueryTestSuite. Closes #24385 from dilipbiswal/describe_query_explain. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-21 15:35:05 -07:00
WeichenXu	9793d9ec22	[SPARK-27473][SQL] Support filter push down for status fields in binary file data source ## What changes were proposed in this pull request? Support 4 kinds of filters: - LessThan - LessThanOrEqual - GreatThan - GreatThanOrEqual Support filters applied on 2 columns: - modificationTime - length Note: In order to support datasource filter push-down, I flatten schema to be: ``` val schema = StructType( StructField("path", StringType, false) :: StructField("modificationTime", TimestampType, false) :: StructField("length", LongType, false) :: StructField("content", BinaryType, true) :: Nil) ``` ## How was this patch tested? To be added. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #24387 from WeichenXu123/binary_ds_filter. Lead-authored-by: WeichenXu <weichen.xu@databricks.com> Co-authored-by: Xiangrui Meng <meng@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2019-04-21 12:45:59 -07:00
Liang-Chi Hsieh	ad60c6d9be	[SPARK-27439][SQL] Use analyzed plan when explaining Dataset ## What changes were proposed in this pull request? Because a review is resolved during analysis when we create a dataset, the content of the view is determined when the dataset is created, not when it is evaluated. Now the explain result of a dataset is not correctly consistent with the collected result of it, because we use pre-analyzed logical plan of the dataset in explain command. The explain command will analyzed the logical plan passed in. So if a view is changed after the dataset was created, the plans shown by explain command aren't the same with the plan of the dataset. ```scala scala> spark.range(10).createOrReplaceTempView("test") scala> spark.range(5).createOrReplaceTempView("test2") scala> spark.sql("select * from test").createOrReplaceTempView("tmp001") scala> val df = spark.sql("select * from tmp001") scala> spark.sql("select * from test2").createOrReplaceTempView("tmp001") scala> df.show +---+ \| id\| +---+ \| 0\| \| 1\| \| 2\| \| 3\| \| 4\| \| 5\| \| 6\| \| 7\| \| 8\| \| 9\| +---+ scala> df.explain ``` Before: ```scala == Physical Plan == (1) Range (0, 5, step=1, splits=12) ``` After: ```scala == Physical Plan == (1) Range (0, 10, step=1, splits=12) ``` ## How was this patch tested? Manually test and unit test. Closes #24415 from viirya/SPARK-27439. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-21 10:25:56 -07:00
Gengliang Wang	31488e1ca5	[SPARK-27504][SQL] File source V2: support refreshing metadata cache ## What changes were proposed in this pull request? In file source V1, if some file is deleted manually, reading the DataFrame/Table will throws an exception with suggestion message ``` It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. ``` After refreshing the table/DataFrame, the reads should return correct results. We should follow it in file source V2 as well. ## How was this patch tested? Unit test Closes #24401 from gengliangwang/refreshFileTable. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-19 18:26:03 +08:00
Liang-Chi Hsieh	9c41bfd83c	[SPARK-27502][SQL][TEST] Update nested schema benchmark result for Orc V2 ## What changes were proposed in this pull request? We added nested schema pruning support to Orc V2 recently. The benchmark result should be updated. The benchmark numbers are obtained by running benchmark on r3.xlarge machine. ## How was this patch tested? Test only change. Closes #24399 from viirya/update-orcv2-benchmark. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-18 08:08:22 -07:00
Gengliang Wang	9c238b8a46	[SPARK-27460][TESTS] Running slowest test suites in their own forked JVMs for higher parallelism ## What changes were proposed in this pull request? This patch modifies SparkBuild so that the largest / slowest test suites (or collections of suites) can run in their own forked JVMs, allowing them to be run in parallel with each other. This opt-in / whitelisting approach allows us to increase parallelism without having to fix a long-tail of flakiness / brittleness issues in tests which aren't performance bottlenecks. See comments in SparkBuild.scala for information on the details, including a summary of why we sometimes opt to run entire groups of tests in a single forked JVM . The time of full new pull request test in Jenkins is reduced by around 53%: before changes: 4hr 40min after changes: 2hr 13min ## How was this patch tested? Unit test Closes #24373 from gengliangwang/parallelTest. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-18 20:49:36 +08:00
Gengliang Wang	7d44ba05d1	[SPARK-27490][SQL] File source V2: return correct result for Dataset.inputFiles() ## What changes were proposed in this pull request? Currently, a `Dateset` with file source V2 always return empty results for method `Dataset.inputFiles()`. We should fix it. ## How was this patch tested? Unit test Closes #24393 from gengliangwang/inputFiles. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-18 14:39:30 +08:00
Wenchen Fan	e6618de809	[SPARK-27430][SQL] broadcast hint should be respected for broadcast nested loop join ## What changes were proposed in this pull request? A followup of https://github.com/apache/spark/pull/24164 broadcast hint should be respected for broadcast nested loop join. This PR also refactors the related code a little bit, to save duplicated code. ## How was this patch tested? new tests Closes #24376 from cloud-fan/join. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-17 19:29:28 +08:00
WeichenXu	1bb0c8e407	[SPARK-25348][SQL] Data source for binary files ## What changes were proposed in this pull request? Implement binary file data source in Spark. Format name: "binaryFile" (case-insensitive) Schema: - content: BinaryType - status: StructType - path: StringType - modificationTime: TimestampType - length: LongType Options: * pathGlobFilter (instead of pathFilterRegex) to reply on GlobFilter behavior * maxBytesPerPartition is not implemented since it is controlled by two SQL confs: maxPartitionBytes and openCostInBytes. ## How was this patch tested? Unit test added. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #24354 from WeichenXu123/binary_file_datasource. Lead-authored-by: WeichenXu <weichen.xu@databricks.com> Co-authored-by: Xiangrui Meng <meng@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2019-04-16 15:41:32 -07:00
liwensun	26ed65f415	[SPARK-27453] Pass partitionBy as options in DataFrameWriter ## What changes were proposed in this pull request? Pass partitionBy columns as options and feature-flag this behavior. ## How was this patch tested? A new unit test. Closes #24365 from liwensun/partitionby. Authored-by: liwensun <liwen.sun@databricks.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	2019-04-16 15:03:16 -07:00
Liang-Chi Hsieh	b404e02574	[SPARK-27476][SQL] Refactoring SchemaPruning rule to remove duplicate code ## What changes were proposed in this pull request? In SchemaPruning rule, there is duplicate code for data source v1 and v2. Their logic is the same and we can refactor the rule to remove duplicate code. ## How was this patch tested? Existing tests. Closes #24383 from viirya/SPARK-27476. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-16 14:50:37 -07:00
shivusondur	88d9de26dd	[SPARK-27464][CORE] Added Constant instead of referring string literal used from many places ## What changes were proposed in this pull request? Added Constant instead of referring the same String literal "spark.buffer.pageSize" from many places ## How was this patch tested? Run the corresponding Unit Test Cases manually. Closes #24368 from shivusondur/Constant. Authored-by: shivusondur <shivusondur@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-16 09:30:46 -05:00
Gengliang Wang	f9837d3bf6	[SPARK-27448][SQL] File source V2 table provider should be compatible with V1 provider ## What changes were proposed in this pull request? In the rule `PreprocessTableCreation`, if an existing table is appended with a different provider, the action will fail. Currently, there are two implementations for file sources and creating a table with file source V2 will always fall back to V1 FileFormat. We should consider the following cases as valid: 1. Appending a table with file source V2 provider using the v1 file format 2. Appending a table with v1 file format provider using file source V2 format ## How was this patch tested? Unit test Closes #24356 from gengliangwang/fixTableProvider. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-16 14:26:38 +08:00
Dilip Biswal	3ab96d7acf	[SPARK-27444][SQL][FOLLOWUP][MINOR][TEST] Add a test for describing multi select query. ## What changes were proposed in this pull request? This is a minor pr to add a test to describe a multi select query. ## How was this patch tested? Added a test in describe-query.sql Closes #24370 from dilipbiswal/describe-query-multiselect-test. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-15 21:26:45 +08:00
Gengliang Wang	27d625d785	[SPARK-27459][SQL] Revise the exception message of schema inference failure in file source V2 ## What changes were proposed in this pull request? Since https://github.com/apache/spark/pull/23383/files#diff-db4a140579c1ac4b1dbec7fe5057eecaR36, the exception message of schema inference failure in file source V2 is `tableName`, which is equivalent to `shortName + path`. While in file source V1, the message is `Unable to infer schema from ORC/CSV/JSON...`. We should make the message in V2 consistent with V1, so that in the future migration the related test cases don't need to be modified. https://github.com/apache/spark/pull/24058#pullrequestreview-226364350 ## How was this patch tested? Revert the modified unit test cases in https://github.com/apache/spark/pull/24005/files#diff-b9ddfbc9be8d83ecf100b3b8ff9610b9R431 and https://github.com/apache/spark/pull/23383/files#diff-9ab56940ee5a53f2bb81e3c008653362R577, and test with them. Closes #24369 from gengliangwang/reviseInferSchemaMessage. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-15 21:06:03 +08:00
herman	4704af4c26	[SPARK-27449] Move WholeStageCodegen.limitNotReachedCond class checks into separate methods. ## What changes were proposed in this pull request? This PR moves the checks done in `WholeStageCodegen.limitNotReachedCond` into a separate protected method. This makes it easier to introduce new leaf or blocking nodes. ## How was this patch tested? Existing tests. Closes #24358 from hvanhovell/SPARK-27449. Authored-by: herman <herman@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-14 15:54:20 +08:00
Gengliang Wang	4eb694c58f	[SPARK-27443][SQL] Support UDF input_file_name in file source V2 ## What changes were proposed in this pull request? Currently, if we select the UDF `input_file_name` as a column in file source V2, the results are empty. We should support it in file source V2. ## How was this patch tested? Unit test Closes #24347 from gengliangwang/input_file_name. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-12 20:30:42 +08:00
Dilip Biswal	5d8aee5886	[SPARK-27445][SQL][TEST] Update SQLQueryTestSuite to process files ending with `.sql` ## What changes were proposed in this pull request? While using vi or vim to edit the test files the .swp or .swo files are created and attempt to run the test suite in the presence of these files causes errors like below : ``` nfo] - subquery/exists-subquery/.exists-basic.sql.swp * FAILED * (117 milliseconds) [info] java.io.FileNotFoundException: /Users/dbiswal/mygit/apache/spark/sql/core/target/scala-2.12/test-classes/sql-tests/results/subquery/exists-subquery/.exists-basic.sql.swp.out (No such file or directory) [info] at java.io.FileInputStream.open0(Native Method) [info] at java.io.FileInputStream.open(FileInputStream.java:195) [info] at java.io.FileInputStream.<init>(FileInputStream.java:138) [info] at org.apache.spark.sql.catalyst.util.package$.fileToString(package.scala:49) [info] at org.apache.spark.sql.SQLQueryTestSuite.runQueries(SQLQueryTestSuite.scala:247) [info] at org.apache.spark.sql.SQLQueryTestSuite.$anonfun$runTest$11(SQLQueryTestSuite.scala:192) ``` ~~This minor pr adds these temp files in the ignore list.~~ While computing the list of test files to process, only consider files with `.sql` extension. This makes sure the unwanted temp files created from various editors are ignored from processing. ## How was this patch tested? Verified manually. Closes #24333 from dilipbiswal/dkb_sqlquerytest. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-11 14:50:46 -07:00
Sean Owen	4ec7f631aa	[SPARK-27404][CORE][SQL][STREAMING][YARN] Fix build warnings for 3.0: postfixOps edition ## What changes were proposed in this pull request? Fix build warnings -- see some details below. But mostly, remove use of postfix syntax where it causes warnings without the `scala.language.postfixOps` import. This is mostly in expressions like "120000 milliseconds". Which, I'd like to simplify to things like "2.minutes" anyway. ## How was this patch tested? Existing tests. Closes #24314 from srowen/SPARK-27404. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-11 13:43:44 -05:00
maryannxue	43da473c1c	[SPARK-27225][SQL] Implement join strategy hints ## What changes were proposed in this pull request? This PR extends the existing BROADCAST join hint (for both broadcast-hash join and broadcast-nested-loop join) by implementing other join strategy hints corresponding to the rest of Spark's existing join strategies: shuffle-hash, sort-merge, cartesian-product. The hint names: SHUFFLE_MERGE, SHUFFLE_HASH, SHUFFLE_REPLICATE_NL are partly different from the code names in order to make them clearer to users and reflect the actual algorithms better. The hinted strategy will be used for the join with which it is associated if it is applicable/doable. Conflict resolving rules in case of multiple hints: 1. Conflicts within either side of the join: take the first strategy hint specified in the query, or the top hint node in Dataset. For example, in "select /+ merge(t1) / /+ broadcast(t1) / k1, v2 from t1 join t2 on t1.k1 = t2.k2", take "merge(t1)"; in ```df1.hint("merge").hint("shuffle_hash").join(df2)```, take "shuffle_hash". This is a general hint conflict resolving strategy, not specific to join strategy hint. 2. Conflicts between two sides of the join: a) In case of different strategy hints, hints are prioritized as ```BROADCAST``` over ```SHUFFLE_MERGE``` over ```SHUFFLE_HASH``` over ```SHUFFLE_REPLICATE_NL```. b) In case of same strategy hints but conflicts in build side, choose the build side based on join type and size. ## How was this patch tested? Added new UTs. Closes #24164 from maryannxue/join-hints. Lead-authored-by: maryannxue <maryannxue@apache.org> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-12 00:14:37 +08:00
s71955	239082d966	[SPARK-27403][SQL] Fix `updateTableStats` to update table stats always with new stats or None ## What changes were proposed in this pull request? System shall update the table stats automatically if user set spark.sql.statistics.size.autoUpdate.enabled as true, currently this property is not having any significance even if it is enabled or disabled. This feature is similar to Hives auto-gather feature where statistics are automatically computed by default if this feature is enabled. Reference: https://cwiki.apache.org/confluence/display/Hive/StatsDev As part of fix , autoSizeUpdateEnabled validation is been done initially so that system will calculate the table size for the user automatically and record it in metastore as per user expectation. ## How was this patch tested? UT is written and manually verified in cluster. Tested with unit tests + some internal tests on real cluster. Before fix: ![image](https://user-images.githubusercontent.com/12999161/55688682-cd8d4780-5998-11e9-85da-e1a4e34419f6.png) After fix ![image](https://user-images.githubusercontent.com/12999161/55688654-7d15ea00-5998-11e9-973f-1f4cee27018f.png) Closes #24315 from sujith71955/master_autoupdate. Authored-by: s71955 <sujithchacko.2010@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-11 08:53:00 -07:00
Gengliang Wang	4177292dcd	[SPARK-27435][SQL] Support schema pruning in ORC V2 ## What changes were proposed in this pull request? Currently, the optimization rule `SchemaPruning` only works for Parquet/Orc V1. We should have the same optimization in ORC V2. ## How was this patch tested? Unit test Closes #24338 from gengliangwang/schemaPruningForV2. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-11 20:03:32 +08:00

1 2 3 4 5 ...

5562 commits