ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Sean Owen	be2eca22e9	[SPARK-32398][TESTS][CORE][STREAMING][SQL][ML] Update to scalatest 3.2.0 for Scala 2.13.3+ ### What changes were proposed in this pull request? Updates to scalatest 3.2.0. Though it looks large, it is 99% changes to the new location of scalatest classes. ### Why are the changes needed? 3.2.0+ has a fix that is required for Scala 2.13.3+ compatibility. ### Does this PR introduce _any_ user-facing change? No, only affects tests. ### How was this patch tested? Existing tests. Closes #29196 from srowen/SPARK-32398. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-23 16:20:17 -07:00
Terry Kim	35345e30e5	[SPARK-32374][SQL] Disallow setting properties when creating temporary views ### What changes were proposed in this pull request? Currently, you can specify properties when creating a temporary view. However, the specified properties are not used and can be misleading. This PR propose to disallow specifying properties when creating temporary views. ### Why are the changes needed? To avoid confusion by disallowing specifying unused properties. ### Does this PR introduce _any_ user-facing change? Yes, now if you create a temporary view with properties, the operation will fail: ``` scala> sql("CREATE TEMPORARY VIEW tv TBLPROPERTIES('p1'='v1') AS SELECT 1 AS c1") org.apache.spark.sql.catalyst.parser.ParseException: Operation not allowed: CREATE TEMPORARY VIEW ... TBLPROPERTIES (property_name = property_value, ...)(line 1, pos 0) == SQL == CREATE TEMPORARY VIEW tv TBLPROPERTIES('p1'='v1') AS SELECT 1 AS c1 ^^^ ``` ### How was this patch tested? Added tests Closes #29167 from imback82/disable_properties_temp_view. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-23 14:32:10 +00:00
yi.wu	a8e3de36e7	[SPARK-32280][SPARK-32372][SQL] ResolveReferences.dedupRight should only rewrite attributes for ancestor nodes of the conflict plan ### What changes were proposed in this pull request? This PR refactors `ResolveReferences.dedupRight` to make sure it only rewrite attributes for ancestor nodes of the conflict plan. ### Why are the changes needed? This is a bug fix. ```scala sql("SELECT name, avg(age) as avg_age FROM person GROUP BY name") .createOrReplaceTempView("person_a") sql("SELECT p1.name, p2.avg_age FROM person p1 JOIN person_a p2 ON p1.name = p2.name") .createOrReplaceTempView("person_b") sql("SELECT * FROM person_a UNION SELECT * FROM person_b") .createOrReplaceTempView("person_c") sql("SELECT p1.name, p2.avg_age FROM person_c p1 JOIN person_c p2 ON p1.name = p2.name").show() ``` When executing the above query, we'll hit the error: ```scala [info] Failed to analyze query: org.apache.spark.sql.AnalysisException: Resolved attribute(s) avg_age#231 missing from name#223,avg_age#218,id#232,age#234,name#233 in operator !Project [name#233, avg_age#231]. Attribute(s) with the same name appear in the operation: avg_age. Please check if the right attribute(s) are used.;; ... ``` The plan below is the problematic plan which is the right plan of a `Join` operator. And, it has conflict plans comparing to the left plan. In this problematic plan, the first `Aggregate` operator (the one under the first child of `Union`) becomes a conflict plan compares to the left one and has a rewrite attribute pair as `avg_age#218` -> `avg_age#231`. With the current `dedupRight` logic, we'll first replace this `Aggregate` with a new one, and then rewrites the attribute `avg_age#218` from bottom to up. As you can see, projects with the attribute `avg_age#218` of the second child of the `Union` can also be replaced with `avg_age#231`(That means we also rewrite attributes for non-ancestor plans for the conflict plan). Ideally, the attribute `avg_age#218` in the second `Aggregate` operator (the one under the second child of `Union`) should also be replaced. But it didn't because it's an `Alias` while we only rewrite `Attribute` yet. Therefore, the project above the second `Aggregate` becomes unresolved. ```scala :  : +- SubqueryAlias p2 +- SubqueryAlias person_c +- Distinct +- Union :- Project [name#233, avg_age#231] : +- SubqueryAlias person_a : +- Aggregate [name#233], [name#233, avg(cast(age#234 as bigint)) AS avg_age#231] : +- SubqueryAlias person : +- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).id AS id#232, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).name, true, false) AS name#233, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).age AS age#234] : +- ExternalRDD [obj#165] +- Project [name#233 AS name#227, avg_age#231 AS avg_age#228] +- Project [name#233, avg_age#231] +- SubqueryAlias person_b +- !Project [name#233, avg_age#231] +- Join Inner, (name#233 = name#223) :- SubqueryAlias p1 : +- SubqueryAlias person : +- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).id AS id#232, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).name, true, false) AS name#233, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).age AS age#234] : +- ExternalRDD [obj#165] +- SubqueryAlias p2 +- SubqueryAlias person_a +- Aggregate [name#223], [name#223, avg(cast(age#224 as bigint)) AS avg_age#218] +- SubqueryAlias person +- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).id AS id#222, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).name, true, false) AS name#223, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).age AS age#224] +- ExternalRDD [obj#165] ``` ### Does this PR introduce _any_ user-facing change? Yes, users would no longer hit the error after this fix. ### How was this patch tested? Added test. Closes #29166 from Ngone51/impr-dedup. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-23 14:24:47 +00:00
Dongjoon Hyun	aed8dbab1d	[SPARK-32364][SQL][FOLLOWUP] Add toMap to return originalMap and documentation ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/29160. We already removed the indeterministicity. This PR aims the following for the existing code base. 1. Add an explicit document to `DataFrameReader/DataFrameWriter`. 2. Add `toMap` to `CaseInsensitiveMap` in order to return `originalMap: Map[String, T]` because it's more consistent with the existing `case-sensitive key names` behavior for the existing code pattern like `AppendData.byName(..., extraOptions.toMap)`. Previously, it was `HashMap.toMap`. 3. During (2), we need to change the following to keep the original logic using `CaseInsensitiveMap.++`. ```scala - val params = extraOptions.toMap ++ connectionProperties.asScala.toMap + val params = extraOptions ++ connectionProperties.asScala ``` 4. Additionally, use `.toMap` in the following because `dsOptions.asCaseSensitiveMap()` is used later. ```scala - val options = sessionOptions ++ extraOptions + val options = sessionOptions.filterKeys(!extraOptions.contains(_)) ++ extraOptions.toMap val dsOptions = new CaseInsensitiveStringMap(options.asJava) ``` ### Why are the changes needed? `extraOptions.toMap` is used in several places (e.g. `DataFrameReader`) to hand over `Map[String, T]`. In this case, `CaseInsensitiveMap[T] private (val originalMap: Map[String, T])` had better return `originalMap`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins or GitHub Action with the existing tests and newly add test case at `JDBCSuite`. Closes #29191 from dongjoon-hyun/SPARK-32364-3. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-23 06:28:08 -07:00
LantaoJin	182566bf57	[SPARK-32237][SQL] Resolve hint in CTE ### What changes were proposed in this pull request? This PR is to move `Substitution` rule before `Hints` rule in `Analyzer` to avoid hint in CTE not working. ### Why are the changes needed? Below SQL in Spark3.0 will throw AnalysisException, but it works in Spark2.x ```sql WITH cte AS (SELECT /+ REPARTITION(3) / T.id, T.data FROM $t1 T) SELECT cte.id, cte.data FROM cte ``` ``` Failed to analyze query: org.apache.spark.sql.AnalysisException: cannot resolve '`cte.id`' given input columns: [cte.data, cte.id]; line 3 pos 7; 'Project ['cte.id, 'cte.data] +- SubqueryAlias cte +- Project [id#21L, data#22] +- SubqueryAlias T +- SubqueryAlias testcat.ns1.ns2.tbl +- RelationV2[id#21L, data#22] testcat.ns1.ns2.tbl 'Project ['cte.id, 'cte.data] +- SubqueryAlias cte +- Project [id#21L, data#22] +- SubqueryAlias T +- SubqueryAlias testcat.ns1.ns2.tbl +- RelationV2[id#21L, data#22] testcat.ns1.ns2.tbl ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add a unit test Closes #29062 from LantaoJin/SPARK-32237. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-23 03:10:45 +00:00
ulysses	184074de22	[SPARK-31999][SQL] Add REFRESH FUNCTION command ### What changes were proposed in this pull request? In Hive mode, permanent functions are shared with Hive metastore so that functions may be modified by other Hive client. With in long-lived spark scene, it's hard to update the change of function. Here are 2 reasons: * Spark cache the function in memory using `FunctionRegistry`. * User may not know the location or classname of udf when using `replace function`. Note that we use v2 command code path to add new command. ### Why are the changes needed? Give a easy way to make spark function registry sync with Hive metastore. Then we can call ``` refresh function functionName ``` ### Does this PR introduce _any_ user-facing change? Yes, new command. ### How was this patch tested? New UT. Closes #28840 from ulysses-you/SPARK-31999. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-22 19:05:50 +00:00
Dongjoon Hyun	cd16a10475	[SPARK-32364][SQL] Use CaseInsensitiveMap for DataFrameReader/Writer options ### What changes were proposed in this pull request? When a user have multiple options like `path`, `paTH`, and `PATH` for the same key `path`, `option/options` is non-deterministic because `extraOptions` is `HashMap`. This PR aims to use `CaseInsensitiveMap` instead of `HashMap` to fix this bug fundamentally. ### Why are the changes needed? Like the following, DataFrame's `option/options` have been non-deterministic in terms of case-insensitivity because it stores the options at `extraOptions` which is using `HashMap` class. ```scala spark.read .option("paTh", "1") .option("PATH", "2") .option("Path", "3") .option("patH", "4") .load("5") ... org.apache.spark.sql.AnalysisException: Path does not exist: file:/.../1; ``` ### Does this PR introduce _any_ user-facing change? Yes. However, this is a bug fix for the indeterministic cases. ### How was this patch tested? Pass the Jenkins or GitHub Action with newly added test cases. Closes #29160 from dongjoon-hyun/SPARK-32364. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-22 07:58:45 -07:00
Takeshi Yamamuro	04bf3511f1	[SPARK-21117][SQL][FOLLOWUP] Define prettyName for WidthBucket ### What changes were proposed in this pull request? This PR is to define prettyName for `WidthBucket`. This comes from the gatorsmile's suggestion: https://github.com/apache/spark/pull/28764#discussion_r457802957 ### Why are the changes needed? For a better name. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #29183 from maropu/SPARK-21117-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-22 02:51:30 -07:00
Max Gekk	feca9edbdd	[MINOR][SQL][TESTS] Create tables once in JDBC tests ### What changes were proposed in this pull request? In PR, I propose to create input tables once before executing tests in `JDBCSuite` and `JdbcRDDSuite`. Currently, the table are created before every test in the test suites. ### Why are the changes needed? This speed up the test suites up 30-40%. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Run the modified test suites Closes #29176 from MaxGekk/jdbc-suite-before-all. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-22 08:32:01 +00:00
Cheng Su	39181ff209	[SPARK-32286][SQL] Coalesce bucketed table for shuffled hash join if applicable ### What changes were proposed in this pull request? Based on a follow up comment in https://github.com/apache/spark/pull/28123, where we can coalesce buckets for shuffled hash join as well. The note here is we only coalesce the buckets from shuffled hash join stream side (i.e. the side not building hash map), so we don't need to worry about OOM when coalescing multiple buckets in one task for building hash map. > If you refactor some codes with changing classes, showing the class hierarchy will help reviewers. Refactor existing physical plan rule `CoalesceBucketsInSortMergeJoin` to `CoalesceBucketsInJoin`, for covering shuffled hash join as well. Refactor existing unit test `CoalesceBucketsInSortMergeJoinSuite` to `CoalesceBucketsInJoinSuite`, for covering shuffled hash join as well. ### Why are the changes needed? Avoid shuffle for joining different bucketed tables, is also useful for shuffled hash join. In production, we are seeing users to use shuffled hash join to join bucketed tables (set `spark.sql.join.preferSortMergeJoin`=false, to avoid sort), and this can help avoid shuffle if number of buckets are not same. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit tests in `CoalesceBucketsInJoinSuite` for verifying shuffled hash join physical plan. ### Performance number per request from maropu I was looking at TPCDS per suggestion from maropu. But I found most of queries from TPCDS are doing aggregate, and only several ones are doing join. None of input tables are bucketed. So I took the approach to test a modified version of `TPCDS q93` as ``` SELECT ss_ticket_number, sr_ticket_number FROM store_sales JOIN store_returns ON ss_ticket_number = sr_ticket_number ``` And make `store_sales` and `store_returns` to be bucketed tables. Physical query plan without coalesce: ``` ShuffledHashJoin [ss_ticket_number#109L], [sr_ticket_number#120L], Inner, BuildLeft :- Exchange hashpartitioning(ss_ticket_number#109L, 4), true, [id=#67] : +- (1) Project [ss_ticket_number#109L] : +- (1) Filter isnotnull(ss_ticket_number#109L) : +- (1) ColumnarToRow : +- FileScan parquet default.store_sales[ss_ticket_number#109L] Batched: true, DataFilters: [isnotnull(ss_ticket_number#109L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/chengsu/spark/spark-warehouse/store_sales], PartitionFilters: [], PushedFilters: [IsNotNull(ss_ticket_number)], ReadSchema: struct<ss_ticket_number:bigint>, SelectedBucketsCount: 2 out of 2 +- (2) Project [sr_returned_date_sk#111L, sr_return_time_sk#112L, sr_item_sk#113L, sr_customer_sk#114L, sr_cdemo_sk#115L, sr_hdemo_sk#116L, sr_addr_sk#117L, sr_store_sk#118L, sr_reason_sk#119L, sr_ticket_number#120L, sr_return_quantity#121L, sr_return_amt#122, sr_return_tax#123, sr_return_amt_inc_tax#124, sr_fee#125, sr_return_ship_cost#126, sr_refunded_cash#127, sr_reversed_charge#128, sr_store_credit#129, sr_net_loss#130] +- (2) Filter isnotnull(sr_ticket_number#120L) +- (2) ColumnarToRow +- FileScan parquet default.store_returns[sr_returned_date_sk#111L,sr_return_time_sk#112L,sr_item_sk#113L,sr_customer_sk#114L,sr_cdemo_sk#115L,sr_hdemo_sk#116L,sr_addr_sk#117L,sr_store_sk#118L,sr_reason_sk#119L,sr_ticket_number#120L,sr_return_quantity#121L,sr_return_amt#122,sr_return_tax#123,sr_return_amt_inc_tax#124,sr_fee#125,sr_return_ship_cost#126,sr_refunded_cash#127,sr_reversed_charge#128,sr_store_credit#129,sr_net_loss#130] Batched: true, DataFilters: [isnotnull(sr_ticket_number#120L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/chengsu/spark/spark-warehouse/store_returns], PartitionFilters: [], PushedFilters: [IsNotNull(sr_ticket_number)], ReadSchema: struct<sr_returned_date_sk:bigint,sr_return_time_sk:bigint,sr_item_sk:bigint,sr_customer_sk:bigin..., SelectedBucketsCount: 4 out of 4 ``` Physical query plan with coalesce: ``` ShuffledHashJoin [ss_ticket_number#109L], [sr_ticket_number#120L], Inner, BuildLeft :- (1) Project [ss_ticket_number#109L] : +- (1) Filter isnotnull(ss_ticket_number#109L) : +- (1) ColumnarToRow : +- FileScan parquet default.store_sales[ss_ticket_number#109L] Batched: true, DataFilters: [isnotnull(ss_ticket_number#109L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/chengsu/spark/spark-warehouse/store_sales], PartitionFilters: [], PushedFilters: [IsNotNull(ss_ticket_number)], ReadSchema: struct<ss_ticket_number:bigint>, SelectedBucketsCount: 2 out of 2 +- (2) Project [sr_returned_date_sk#111L, sr_return_time_sk#112L, sr_item_sk#113L, sr_customer_sk#114L, sr_cdemo_sk#115L, sr_hdemo_sk#116L, sr_addr_sk#117L, sr_store_sk#118L, sr_reason_sk#119L, sr_ticket_number#120L, sr_return_quantity#121L, sr_return_amt#122, sr_return_tax#123, sr_return_amt_inc_tax#124, sr_fee#125, sr_return_ship_cost#126, sr_refunded_cash#127, sr_reversed_charge#128, sr_store_credit#129, sr_net_loss#130] +- (2) Filter isnotnull(sr_ticket_number#120L) +- (2) ColumnarToRow +- FileScan parquet default.store_returns[sr_returned_date_sk#111L,sr_return_time_sk#112L,sr_item_sk#113L,sr_customer_sk#114L,sr_cdemo_sk#115L,sr_hdemo_sk#116L,sr_addr_sk#117L,sr_store_sk#118L,sr_reason_sk#119L,sr_ticket_number#120L,sr_return_quantity#121L,sr_return_amt#122,sr_return_tax#123,sr_return_amt_inc_tax#124,sr_fee#125,sr_return_ship_cost#126,sr_refunded_cash#127,sr_reversed_charge#128,sr_store_credit#129,sr_net_loss#130] Batched: true, DataFilters: [isnotnull(sr_ticket_number#120L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/chengsu/spark/spark-warehouse/store_returns], PartitionFilters: [], PushedFilters: [IsNotNull(sr_ticket_number)], ReadSchema: struct<sr_returned_date_sk:bigint,sr_return_time_sk:bigint,sr_item_sk:bigint,sr_customer_sk:bigin..., SelectedBucketsCount: 4 out of 4 (Coalesced to 2) ``` Run time improvement as 50% of wall clock time: ``` Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15.4 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz shuffle hash join: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ shuffle hash join coalesce bucket off 1541 1664 106 1.9 535.1 1.0X shuffle hash join coalesce bucket on 1060 1169 81 2.7 368.1 1.5X ``` Closes #29079 from c21/split-bucket. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-07-22 08:56:26 +09:00
LantaoJin	8a1c24bb03	[SPARK-32362][SQL][TEST] AdaptiveQueryExecSuite misses verifying AE results ### What changes were proposed in this pull request? Verify results for `AdaptiveQueryExecSuite` ### Why are the changes needed? `AdaptiveQueryExecSuite` misses verifying AE results ```scala QueryTest.sameRows(result.toSeq, df.collect().toSeq) ``` Even the results are different, no fail. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Exists unit tests. Closes #29158 from LantaoJin/SPARK-32362. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-21 12:47:45 +09:00
gengjiaan	02114f96d6	[SPARK-32365][SQL] Add a boundary condition for negative index in regexp_extract ### What changes were proposed in this pull request? The current implement of regexp_extract will throws a unprocessed exception show below: SELECT regexp_extract('1a 2b 14m', 'd+' -1) ``` java.lang.IndexOutOfBoundsException: No group -1 java.util.regex.Matcher.group(Matcher.java:538) org.apache.spark.sql.catalyst.expressions.RegExpExtract.nullSafeEval(regexpExpressions.scala:455) org.apache.spark.sql.catalyst.expressions.TernaryExpression.eval(Expression.scala:704) org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$1$$anonfun$applyOrElse$1.applyOrElse(expressions.scala:52) org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$1$$anonfun$applyOrElse$1.applyOrElse(expressions.scala:45) ``` ### Why are the changes needed? Fix a bug `java.lang.IndexOutOfBoundsException: No group -1` ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? new UT Closes #29161 from beliefer/regexp_extract-group-not-allow-less-than-zero. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-20 20:34:51 -07:00
Nik Vanderhoof	7d65caebec	[SPARK-32338][SQL] Overload slice to accept Column for start and length ### What changes were proposed in this pull request? Add an overload for the `slice` function that can accept Columns for the `start` and `length` parameters. ### Why are the changes needed? This will allow users to take slices of arrays based on the length of the arrays, or via data in other columns. ```scala df.select(slice(x, 4, size(x) - 4)) ``` ### Does this PR introduce _any_ user-facing change? Yes, before the `slice` method would only accept Ints for the start and length parameters, now we can pass in Columns and/or Ints. ### How was this patch tested? I've extended the existing tests for slice but using combinations of Column and Ints. Closes #29138 from nvander1/SPARK-32338. Authored-by: Nik Vanderhoof <nikolasrvanderhoof@gmail.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2020-07-20 17:48:07 -07:00
HyukjinKwon	133c5edc80	[SPARK-32368][SQL] pathGlobFilter, recursiveFileLookup and basePath should respect case insensitivity ### What changes were proposed in this pull request? This PR proposes to make the datasource options at `PartitioningAwareFileIndex` respect case insensitivity consistently: - `pathGlobFilter` - `recursiveFileLookup ` - `basePath` ### Why are the changes needed? To support consistent case insensitivity in datasource options. ### Does this PR introduce _any_ user-facing change? Yes, now users can also use case insensitive options such as `PathglobFilter`. ### How was this patch tested? Unittest were added. It reuses existing tests and adds extra clues to make it easier to track when the test is broken. Closes #29165 from HyukjinKwon/SPARK-32368. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-20 13:56:00 -07:00
Cheng Su	fe07521c9e	[SPARK-32330][SQL] Preserve shuffled hash join build side partitioning ### What changes were proposed in this pull request? Currently `ShuffledHashJoin.outputPartitioning` inherits from `HashJoin.outputPartitioning`, which only preserves stream side partitioning (`HashJoin.scala`): ``` override def outputPartitioning: Partitioning = streamedPlan.outputPartitioning ``` This loses build side partitioning information, and causes extra shuffle if there's another join / group-by after this join. Example: ``` withSQLConf( SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50", SQLConf.SHUFFLE_PARTITIONS.key -> "2", SQLConf.PREFER_SORTMERGEJOIN.key -> "false") { val df1 = spark.range(10).select($"id".as("k1")) val df2 = spark.range(30).select($"id".as("k2")) Seq("inner", "cross").foreach(joinType => { val plan = df1.join(df2, $"k1" === $"k2", joinType).groupBy($"k1").count() .queryExecution.executedPlan assert(plan.collect { case _: ShuffledHashJoinExec => true }.size === 1) // No extra shuffle before aggregate assert(plan.collect { case _: ShuffleExchangeExec => true }.size === 2) }) } ``` Current physical plan (having an extra shuffle on `k1` before aggregate) ``` (4) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, count#235L]) +- Exchange hashpartitioning(k1#220L, 2), true, [id=#117] +- (3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], output=[k1#220L, count#239L]) +- (3) Project [k1#220L] +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft :- Exchange hashpartitioning(k1#220L, 2), true, [id=#109] : +- (1) Project [id#218L AS k1#220L] : +- (1) Range (0, 10, step=1, splits=2) +- Exchange hashpartitioning(k2#224L, 2), true, [id=#111] +- (2) Project [id#222L AS k2#224L] +- (2) Range (0, 30, step=1, splits=2) ``` Ideal physical plan (no shuffle on `k1` before aggregate) ``` (3) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, count#235L]) +- (3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], output=[k1#220L, count#239L]) +- (3) Project [k1#220L] +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft :- Exchange hashpartitioning(k1#220L, 2), true, [id=#107] : +- (1) Project [id#218L AS k1#220L] : +- (1) Range (0, 10, step=1, splits=2) +- Exchange hashpartitioning(k2#224L, 2), true, [id=#109] +- (2) Project [id#222L AS k2#224L] +- (2) Range (0, 30, step=1, splits=2) ``` This can be fixed by overriding `outputPartitioning` method in `ShuffledHashJoinExec`, similar to `SortMergeJoinExec`. In addition, also fix one typo in `HashJoin`, as that code path is shared between broadcast hash join and shuffled hash join. ### Why are the changes needed? To avoid shuffle (for queries having multiple joins or group-by), for saving CPU and IO. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit test in `JoinSuite`. Closes #29130 from c21/shj. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-20 14:38:43 +00:00
Terry Kim	e0ecb66f53	[SPARK-31869][SQL] BroadcastHashJoinExec can utilize the build side for its output partitioning ### What changes were proposed in this pull request? Currently, the `BroadcastHashJoinExec`'s `outputPartitioning` only uses the streamed side's `outputPartitioning`. However, if the join type of `BroadcastHashJoinExec` is an inner-like join, the build side's info (the join keys) can be added to `BroadcastHashJoinExec`'s `outputPartitioning`. For example, ```Scala spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "500") val t1 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i1", "j1") val t2 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i2", "j2") val t3 = (0 until 20).map(i => (i % 7, i % 11)).toDF("i3", "j3") val t4 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i4", "j4") // join1 is a sort merge join. val join1 = t1.join(t2, t1("i1") === t2("i2")) // join2 is a broadcast join where t3 is broadcasted. val join2 = join1.join(t3, join1("i1") === t3("i3")) // Join on the column from the broadcasted side (i3). val join3 = join2.join(t4, join2("i3") === t4("i4")) join3.explain ``` You see that `Exchange hashpartitioning(i2#103, 200)` is introduced because there is no output partitioning info from the build side. ``` == Physical Plan == (6) SortMergeJoin [i3#29], [i4#40], Inner :- (4) Sort [i3#29 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(i3#29, 200), true, [id=#55] : +- (3) BroadcastHashJoin [i1#7], [i3#29], Inner, BuildRight : :- (3) SortMergeJoin [i1#7], [i2#18], Inner : : :- (1) Sort [i1#7 ASC NULLS FIRST], false, 0 : : : +- Exchange hashpartitioning(i1#7, 200), true, [id=#28] : : : +- LocalTableScan [i1#7, j1#8] : : +- (2) Sort [i2#18 ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(i2#18, 200), true, [id=#29] : : +- LocalTableScan [i2#18, j2#19] : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))), [id=#34] : +- LocalTableScan [i3#29, j3#30] +- (5) Sort [i4#40 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(i4#40, 200), true, [id=#39] +- LocalTableScan [i4#40, j4#41] ``` This PR proposes to introduce output partitioning for the build side for `BroadcastHashJoinExec` if the streamed side has a `HashPartitioning` or a collection of `HashPartitioning`s. There is a new internal config `spark.sql.execution.broadcastHashJoin.outputPartitioningExpandLimit`, which can limit the number of partitioning a `HashPartitioning` can expand to. It can be set to "0" to disable this feature. ### Why are the changes needed? To remove unnecessary shuffle. ### Does this PR introduce _any_ user-facing change? Yes, now the shuffle in the above example can be eliminated: ``` == Physical Plan == (5) SortMergeJoin [i3#108], [i4#119], Inner :- (3) Sort [i3#108 ASC NULLS FIRST], false, 0 : +- (3) BroadcastHashJoin [i1#86], [i3#108], Inner, BuildRight : :- (3) SortMergeJoin [i1#86], [i2#97], Inner : : :- (1) Sort [i1#86 ASC NULLS FIRST], false, 0 : : : +- Exchange hashpartitioning(i1#86, 200), true, [id=#120] : : : +- LocalTableScan [i1#86, j1#87] : : +- (2) Sort [i2#97 ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(i2#97, 200), true, [id=#121] : : +- LocalTableScan [i2#97, j2#98] : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))), [id=#126] : +- LocalTableScan [i3#108, j3#109] +- (4) Sort [i4#119 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(i4#119, 200), true, [id=#130] +- LocalTableScan [i4#119, j4#120] ``` ### How was this patch tested? Added new tests. Closes #28676 from imback82/broadcast_join_output. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-20 14:25:51 +00:00
Gengliang Wang	d0c83f372b	[SPARK-32302][SQL] Partially push down disjunctive predicates through Join/Partitions ### What changes were proposed in this pull request? In https://github.com/apache/spark/pull/28733 and #28805, CNF conversion is used to push down disjunctive predicates through join and partitions pruning. It's a good improvement, however, converting all the predicates in CNF can lead to a very long result, even with grouping functions over expressions. For example, for the following predicate ``` (p0 = '1' AND p1 = '1') OR (p0 = '2' AND p1 = '2') OR (p0 = '3' AND p1 = '3') OR (p0 = '4' AND p1 = '4') OR (p0 = '5' AND p1 = '5') OR (p0 = '6' AND p1 = '6') OR (p0 = '7' AND p1 = '7') OR (p0 = '8' AND p1 = '8') OR (p0 = '9' AND p1 = '9') OR (p0 = '10' AND p1 = '10') OR (p0 = '11' AND p1 = '11') OR (p0 = '12' AND p1 = '12') OR (p0 = '13' AND p1 = '13') OR (p0 = '14' AND p1 = '14') OR (p0 = '15' AND p1 = '15') OR (p0 = '16' AND p1 = '16') OR (p0 = '17' AND p1 = '17') OR (p0 = '18' AND p1 = '18') OR (p0 = '19' AND p1 = '19') OR (p0 = '20' AND p1 = '20') ``` will be converted into a long query(130K characters) in Hive metastore, and there will be error: ``` javax.jdo.JDOException: Exception thrown when executing query : SELECT DISTINCT 'org.apache.hadoop.hive.metastore.model.MPartition' AS NUCLEUS_TYPE,A0.CREATE_TIME,A0.LAST_ACCESS_TIME,A0.PART_NAME,A0.PART_ID,A0.PART_NAME AS NUCORDER0 FROM PARTITIONS A0 LEFT OUTER JOIN TBLS B0 ON A0.TBL_ID = B0.TBL_ID LEFT OUTER JOIN DBS C0 ON B0.DB_ID = C0.DB_ID WHERE B0.TBL_NAME = ? AND C0."NAME" = ? AND ((((((A0.PART_NAME LIKE '%/p1=1' ESCAPE '\' ) OR (A0.PART_NAME LIKE '%/p1=2' ESCAPE '\' )) OR (A0.PART_NAME LIKE '%/p1=3' ESCAPE '\' )) OR ((A0.PART_NAME LIKE '%/p1=4' ESCAPE '\' ) O ... ``` Essentially, we just need to traverse predicate and extract the convertible sub-predicates like what we did in https://github.com/apache/spark/pull/24598. There is no need to maintain the CNF result set. ### Why are the changes needed? A better implementation for pushing down disjunctive and complex predicates. The pushed down predicates is always equal or shorter than the CNF result. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #29101 from gengliangwang/pushJoin. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-20 14:17:31 +00:00
Takeshi Yamamuro	c7a68a920d	[SPARK-32344][SQL] Unevaluable expr is set to FIRST/LAST ignoreNullsExpr in distinct aggregates ### What changes were proposed in this pull request? This PR intends to fix a bug of distinct FIRST/LAST aggregates in v2.4.6/v3.0.0/master; ``` scala> sql("SELECT FIRST(DISTINCT v) FROM VALUES 1, 2, 3 t(v)").show() ... Caused by: java.lang.UnsupportedOperationException: Cannot evaluate expression: false#37 at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:258) at org.apache.spark.sql.catalyst.expressions.AttributeReference.eval(namedExpressions.scala:226) at org.apache.spark.sql.catalyst.expressions.aggregate.First.ignoreNulls(First.scala:68) at org.apache.spark.sql.catalyst.expressions.aggregate.First.updateExpressions$lzycompute(First.scala:82) at org.apache.spark.sql.catalyst.expressions.aggregate.First.updateExpressions(First.scala:81) at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$15.apply(HashAggregateExec.scala:268) ``` A root cause of this bug is that the `Aggregation` strategy replaces a foldable boolean `ignoreNullsExpr` expr with a `Unevaluable` expr (`AttributeReference`) for distinct FIRST/LAST aggregate functions. But, this operation cannot be allowed because the `Analyzer` has checked that it must be foldabe; `ffdbbae1d4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala (L74-L76)` So, this PR proposes to change a vriable for `IGNORE NULLS` from `Expression` to `Boolean` to avoid the case. ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added a test in `DataFrameAggregateSuite`. Closes #29143 from maropu/SPARK-32344. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-19 11:11:42 +09:00
Yaroslav Tkachenko	34baed8139	[SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache ### What changes were proposed in this pull request? New `spark.sql.metadataCacheTTLSeconds` option that adds time-to-live cache behaviour to the existing caches in `FileStatusCache` and `SessionCatalog`. ### Why are the changes needed? Currently Spark [caches file listing for tables](https://spark.apache.org/docs/2.4.4/sql-data-sources-parquet.html#metadata-refreshing) and requires issuing `REFRESH TABLE` any time the file listing has changed outside of Spark. Unfortunately, simply submitting `REFRESH TABLE` commands could be very cumbersome. Assuming frequently added files, hundreds of tables and dozens of users querying the data (and expecting up-to-date results), manually refreshing metadata for each table is not a solution. This is a pretty common use-case for streaming ingestion of data, which can be done outside of Spark (with tools like Kafka Connect, etc.). A similar feature exists in Presto: `hive.file-status-cache-expire-time` can be found [here](https://prestosql.io/docs/current/connector/hive.html#hive-configuration-properties). ### Does this PR introduce _any_ user-facing change? Yes, it's controlled with the new `spark.sql.metadataCacheTTLSeconds` option. When it's set to `-1` (by default), the behaviour of caches doesn't change, so it stays _backwards-compatible_. Otherwise, you can specify a value in seconds, for example `spark.sql.metadataCacheTTLSeconds: 60` means 1-minute cache TTL. ### How was this patch tested? Added new tests in: - FileIndexSuite - SessionCatalogSuite Closes #28852 from sap1ens/SPARK-30616-metadata-cache-ttl. Authored-by: Yaroslav Tkachenko <sapiensy@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-17 13:40:54 +00:00
Maxim Gekk	c1f160e097	[SPARK-30648][SQL] Support filters pushdown in JSON datasource ### What changes were proposed in this pull request? In the PR, I propose to support pushed down filters in JSON datasource. The reason of pushing a filter up to `JacksonParser` is to apply the filter as soon as all its attributes become available i.e. converted from JSON field values to desired values according to the schema. This allows to skip parsing of the rest of JSON record and conversions of other values if the filter returns `false`. This can improve performance when pushed filters are highly selective and conversion of JSON string fields to desired values are comparably expensive ( for example, the conversion to `TIMESTAMP` values). The main idea behind of `JsonFilters` is to group pushdown filters by their references, convert the grouped filters to expressions, and then compile to predicates. The predicates are indexed by schema field positions. Each predicate has a state with reference counter to non-set row fields. As soon as the counter reaches `0`, it can be applied to the row because all its dependencies has been set. Before processing new row, predicate's reference counter is reset to total number of predicate references (dependencies in a row). The common code shared between `CSVFilters` and `JsonFilters` is moved to the `StructFilters` class and its companion object. ### Why are the changes needed? The changes improve performance on synthetic benchmarks up to 27 times on JDK 8 and 25 times on JDK 11: ``` OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1044-aws Intel(R) Xeon(R) CPU E5-2670 v2 2.50GHz Filters pushdown: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ w/o filters 25230 25255 22 0.0 252299.6 1.0X pushdown disabled 25248 25282 33 0.0 252475.6 1.0X w/ filters 905 911 8 0.1 9047.9 27.9X ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? - Added new test suites `JsonFiltersSuite` and `JacksonParserSuite`. - By new end-to-end and case sensitivity tests in `JsonSuite`. - By `CSVFiltersSuite`, `UnivocityParserSuite` and `CSVSuite`. - Re-running `CSVBenchmark` and `JsonBenchmark` using Amazon EC2: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge (spot instance) \| \| AMI \| ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) \| \| Java \| OpenJDK8/11 installed by`sudo add-apt-repository ppa:openjdk-r/ppa` & `sudo apt install openjdk-11-jdk`\| and `./dev/run-benchmarks`: ```python #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27366 from MaxGekk/json-filters-pushdown. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-17 00:01:13 +09:00
SaurabhChawla	6be8b935a4	[SPARK-32234][SQL] Spark sql commands are failing on selecting the orc tables ### What changes were proposed in this pull request? Spark sql commands are failing on selecting the orc tables Steps to reproduce Example 1 - Prerequisite - This is the location(/Users/test/tpcds_scale5data/date_dim) for orc data which is generated by the hive. ``` val table = """CREATE TABLE `date_dim` ( `d_date_sk` INT, `d_date_id` STRING, `d_date` TIMESTAMP, `d_month_seq` INT, `d_week_seq` INT, `d_quarter_seq` INT, `d_year` INT, `d_dow` INT, `d_moy` INT, `d_dom` INT, `d_qoy` INT, `d_fy_year` INT, `d_fy_quarter_seq` INT, `d_fy_week_seq` INT, `d_day_name` STRING, `d_quarter_name` STRING, `d_holiday` STRING, `d_weekend` STRING, `d_following_holiday` STRING, `d_first_dom` INT, `d_last_dom` INT, `d_same_day_ly` INT, `d_same_day_lq` INT, `d_current_day` STRING, `d_current_week` STRING, `d_current_month` STRING, `d_current_quarter` STRING, `d_current_year` STRING) USING orc LOCATION '/Users/test/tpcds_scale5data/date_dim'""" spark.sql(table).collect val u = """select date_dim.d_date_id from date_dim limit 5""" spark.sql(u).collect ``` Example 2 ``` val table = """CREATE TABLE `test_orc_data` ( `_col1` INT, `_col2` STRING, `_col3` INT) USING orc""" spark.sql(table).collect spark.sql("insert into test_orc_data values(13, '155', 2020)").collect val df = """select _col2 from test_orc_data limit 5""" spark.sql(df).collect ``` Its Failing with below error ``` org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2, 192.168.0.103, executor driver): java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initBatch(OrcColumnarBatchReader.java:156) at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$7(OrcFileFormat.scala:258) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:141) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:203) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:620) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:343) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:895) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:895) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:372) at org.apache.spark.rdd.RDD.iterator(RDD.scala:336) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:133) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:445) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1489) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:448) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)` ``` The reason behind this initBatch is not getting the schema that is needed to find out the column value in OrcFileFormat.scala ``` batchReader.initBatch( TypeDescription.fromString(resultSchemaString) ``` ### Why are the changes needed? Spark sql queries for orc tables are failing ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test is added for this .Also Tested through spark shell and spark submit the failing queries Closes #29045 from SaurabhChawla100/SPARK-32234. Lead-authored-by: SaurabhChawla <saurabhc@qubole.com> Co-authored-by: SaurabhChawla <s.saurabhtim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-16 13:11:47 +00:00
Kent Yao	bdeb626c5a	[SPARK-32272][SQL] Add SQL standard command SET TIME ZONE ### What changes were proposed in this pull request? This PR adds the SQL standard command - `SET TIME ZONE` to the current default time zone displacement for the current SQL-session, which is the same as the existing `set spark.sql.session.timeZone=xxx'. All in all, this PR adds syntax as following, ``` SET TIME ZONE LOCAL; SET TIME ZONE 'valid time zone'; -- zone offset or region SET TIME ZONE INTERVAL XXXX; -- xxx must in [-18, + 18] hours, * this range is bigger than ansi [-14, + 14] ``` ### Why are the changes needed? ANSI compliance and supply pure SQL users a way to retrieve all supported TimeZones ### Does this PR introduce _any_ user-facing change? yes, add new syntax. ### How was this patch tested? add unit tests. and locally verified reference doc ![image](https://user-images.githubusercontent.com/8326978/87510244-c8dc3680-c6a5-11ea-954c-b098be84afee.png) Closes #29064 from yaooqinn/SPARK-32272. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-16 13:01:53 +00:00
Erik Krogen	cf22d947fb	[SPARK-32036] Replace references to blacklist/whitelist language with more appropriate terminology, excluding the blacklisting feature ### What changes were proposed in this pull request? This PR will remove references to these "blacklist" and "whitelist" terms besides the blacklisting feature as a whole, which can be handled in a separate JIRA/PR. This touches quite a few files, but the changes are straightforward (variable/method/etc. name changes) and most quite self-contained. ### Why are the changes needed? As per discussion on the Spark dev list, it will be beneficial to remove references to problematic language that can alienate potential community members. One such reference is "blacklist" and "whitelist". While it seems to me that there is some valid debate as to whether these terms have racist origins, the cultural connotations are inescapable in today's world. ### Does this PR introduce _any_ user-facing change? In the test file `HiveQueryFileTest`, a developer has the ability to specify the system property `spark.hive.whitelist` to specify a list of Hive query files that should be tested. This system property has been renamed to `spark.hive.includelist`. The old property has been kept for compatibility, but will log a warning if used. I am open to feedback from others on whether keeping a deprecated property here is unnecessary given that this is just for developers running tests. ### How was this patch tested? Existing tests should be suitable since no behavior changes are expected as a result of this PR. Closes #28874 from xkrogen/xkrogen-SPARK-32036-rename-blacklists. Authored-by: Erik Krogen <ekrogen@linkedin.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2020-07-15 11:40:55 -05:00
Dilip Biswal	e4499932da	[SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node ### What changes were proposed in this pull request? Improve the EXPLAIN FORMATTED output of DSV2 Scan nodes (file based ones). Before ``` == Physical Plan == * Project (4) +- * Filter (3) +- * ColumnarToRow (2) +- BatchScan (1) (1) BatchScan Output [2]: [value#7, id#8] Arguments: [value#7, id#8], ParquetScan(org.apache.spark.sql.test.TestSparkSession17477bbb,Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, __spark_hadoop_conf__.xml,org.apache.spark.sql.execution.datasources.InMemoryFileIndexa6c363ce,StructType(StructField(value,IntegerType,true)),StructType(StructField(value,IntegerType,true)),StructType(StructField(id,IntegerType,true)),[Lorg.apache.spark.sql.sources.Filter;40fee459,org.apache.spark.sql.util.CaseInsensitiveStringMapfeca1ec6,Vector(isnotnull(id#8), (id#8 > 1)),List(isnotnull(value#7), (value#7 > 2))) (2) ... (3) ... (4) ... ``` After ``` == Physical Plan == * Project (4) +- * Filter (3) +- * ColumnarToRow (2) +- BatchScan (1) (1) BatchScan Output [2]: [value#7, id#8] DataFilters: [isnotnull(value#7), (value#7 > 2)] Format: parquet Location: InMemoryFileIndex[....] PartitionFilters: [isnotnull(id#8), (id#8 > 1)] PushedFilers: [IsNotNull(id), IsNotNull(value), GreaterThan(id,1), GreaterThan(value,2)] ReadSchema: struct<value:int> (2) ... (3) ... (4) ... ``` ### Why are the changes needed? The old format is not very readable. This improves the readability of the plan. ### Does this PR introduce any user-facing change? Yes. the explain output will be different. ### How was this patch tested? Added a test case in ExplainSuite. Closes #28425 from dilipbiswal/dkb_dsv2_explain. Lead-authored-by: Dilip Biswal <dkbiswal@gmail.com> Co-authored-by: Dilip Biswal <dkbiswal@apache.org> Signed-off-by: Dilip Biswal <dkbiswal@apache.org>	2020-07-15 01:28:39 -07:00
Jungtaek Lim (HeartSaVioR)	542aefb4c4	[SPARK-31985][SS] Remove incomplete/undocumented stateful aggregation in continuous mode ### What changes were proposed in this pull request? This removes the undocumented and incomplete feature of "stateful aggregation" in continuous mode, which would reduce 1100+ lines of code. ### Why are the changes needed? The work for the feature had been stopped for over an year, and no one asked/requested for the availability of such feature in community. Current state for the feature is that it only works with `coalesce(1)` which force the query to read and process, and write in "a" task, which doesn't make sense in production. The remaining code increases the work on DSv2 changes as well - that's why I don't simply propose reverting relevant commits - the code path has been changed due to DSv2 evolution. ### Does this PR introduce _any_ user-facing change? Technically no, because it's never documented and can't be used in production in current shape. ### How was this patch tested? Existing tests. Closes #29077 from HeartSaVioR/SPARK-31985. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-07-15 13:40:43 +09:00
HyukjinKwon	03b5707b51	[MINOR][R] Match collectAsArrowToR with non-streaming collectAsArrowToPython ### What changes were proposed in this pull request? This PR proposes to port forward #29098 to `collectAsArrowToR`. `collectAsArrowToR` follows `collectAsArrowToPython` in branch-2.4 due to the limitation of ARROW-4512. SparkR vectorization currently cannot use streaming format. ### Why are the changes needed? For simplicity and consistency. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The same code is being tested in `collectAsArrowToPython` of branch-2.4. Closes #29100 from HyukjinKwon/minor-parts. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-15 08:46:20 +09:00
yi.wu	a47b69a88a	[SPARK-32307][SQL] ScalaUDF's canonicalized expression should exclude inputEncoders ### What changes were proposed in this pull request? Override `canonicalized` to empty the `inputEncoders` for the canonicalized `ScalaUDF`. ### Why are the changes needed? The following fails on `branch-3.0` currently, not on Apache Spark 3.0.0 release. ```scala spark.udf.register("key", udf((m: Map[String, String]) => m.keys.head.toInt)) Seq(Map("1" -> "one", "2" -> "two")).toDF("a").createOrReplaceTempView("t") checkAnswer(sql("SELECT key(a) AS k FROM t GROUP BY key(a)"), Row(1) :: Nil) [info] org.apache.spark.sql.AnalysisException: expression 't.`a`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; [info] Aggregate [UDF(a#6)], [UDF(a#6) AS k#8] [info] +- SubqueryAlias t [info] +- Project [value#3 AS a#6] [info] +- LocalRelation [value#3] [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:49) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:48) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:130) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidAggregateExpression$1(CheckAnalysis.scala:257) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10(CheckAnalysis.scala:259) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10$adapted(CheckAnalysis.scala:259) [info] at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) [info] at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) [info] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidAggregateExpression$1(CheckAnalysis.scala:259) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10(CheckAnalysis.scala:259) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10$adapted(CheckAnalysis.scala:259) [info] at scala.collection.immutable.List.foreach(List.scala:392) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidAggregateExpression$1(CheckAnalysis.scala:259) ... ``` We use the rule`ResolveEncodersInUDF` to resolve `inputEncoders` and the original`ScalaUDF` instance will be updated to a new `ScalaUDF` instance with the resolved encoders at the end. Note, during encoder resolving, types like `map`, `array` will be resolved to new expression(e.g. `MapObjects`, `CatalystToExternalMap`). However, `ExpressionEncoder` can't be canonicalized. Thus, the canonicalized `ScalaUDF`s become different even if their original `ScalaUDF`s are the same. Finally, it fails the `checkValidAggregateExpression` when this `ScalaUDF` is used as a group expression. ### Does this PR introduce _any_ user-facing change? Yes, users will not hit the exception after this fix. ### How was this patch tested? Added tests. Closes #29106 from Ngone51/spark-32307. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-14 12:19:01 -07:00
Sean Owen	d6a68e0b67	[SPARK-29292][STREAMING][SQL][BUILD] Get streaming, catalyst, sql compiling for Scala 2.13 ### What changes were proposed in this pull request? Continuation of https://github.com/apache/spark/pull/28971 which lets streaming, catalyst and sql compile for 2.13. Same idea. ### Why are the changes needed? Eventually, we need to support a Scala 2.13 build, perhaps in Spark 3.1. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. (2.13 was not tested; this is about getting it to compile without breaking 2.12) Closes #29078 from srowen/SPARK-29292.2. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-14 02:06:50 -07:00
angerszhu	5521afbd22	[SPARK-32220][SQL][FOLLOW-UP] SHUFFLE_REPLICATE_NL Hint should not change Non-Cartesian Product join result ### What changes were proposed in this pull request? follow comment https://github.com/apache/spark/pull/29035#discussion_r453468999 Explain for pr ### Why are the changes needed? add comment ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #29084 from AngersZhuuuu/follow-spark-32220. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-13 08:23:25 -07:00
angerszhu	6d499647b3	[SPARK-32105][SQL] Refactor current ScriptTransformationExec code ### What changes were proposed in this pull request? * Renamed hive transform scrip class `hive/execution/ScriptTransformationExec` to `hive/execution/HiveScriptTransformationExec` (don't rename file) * Extract class `BaseScriptTransformationExec ` about common code used across `SparkScriptTransformationExec(next pr add this)` and `HiveScriptTransformationExec` * Extract class `BaseScriptTransformationWriterThread` of writing data thread across `SparkScriptTransformationWriterThread(added next for support transform in sql/core )` and `HiveScriptTransformationWriterThread` , * `HiveScriptTransformationWriterThread` additionally supports Hive serde format * Rename current `Script` strategies in hive module to `HiveScript`, in next pr will add `SparkScript` strategies for support transform in sql/core. Todo List; - Support transform in sql/core base on `BaseScriptTransformationExec`, which would run script operator in SQL mode (without Hive). The output of script would be read as a string and column values are extracted by using a delimiter (default : tab character) - For Hive, by default only serde's must be used, and without hive we can run without serde - Cleanup past hacks that are observed (and people suggest / report)， such as - [Solve string value error about Date/Timestamp in ScriptTransform](https://issues.apache.org/jira/browse/SPARK-31947) - [support use transform with aggregation](https://issues.apache.org/jira/browse/SPARK-28227) - [support array/map as transform's input](https://issues.apache.org/jira/browse/SPARK-22435) - Use code-gen projection to serialize rows to output stream() ### Why are the changes needed? Support run transform in SQL mode without hive ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? Added UT Closes #27983 from AngersZhuuuu/follow_spark_15694. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-13 08:58:25 +00:00
HyukjinKwon	c4b0639f83	[SPARK-32270][SQL] Use TextFileFormat in CSV's schema inference with a different encoding ### What changes were proposed in this pull request? This PR proposes to use text datasource in CSV's schema inference. This shares the same reasons of SPARK-18362, SPARK-19885 and SPARK-19918 - we're currently using Hadoop RDD when the encoding is different, which is unnecessary. This PR completes SPARK-18362, and address the comment at https://github.com/apache/spark/pull/15813#discussion_r90751405. We should better keep the code paths consistent with existing CSV and JSON datasources as well, but this CSV schema inference with the encoding specified is different from UTF-8 alone. There can be another story that this PR might lead to a bug fix: Spark session configurations, say Hadoop configurations, are not respected during CSV schema inference when the encoding is different (but it has to be set to Spark context for schema inference when the encoding is different). ### Why are the changes needed? For consistency, potentially better performance, and fixing a potentially very corner case bug. ### Does this PR introduce _any_ user-facing change? Virtually no. ### How was this patch tested? Existing tests should cover. Closes #29063 from HyukjinKwon/SPARK-32270. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-12 09:44:27 -07:00
yi.wu	004aea8155	[SPARK-32154][SQL] Use ExpressionEncoder for the return type of ScalaUDF to convert to catalyst type ### What changes were proposed in this pull request? This PR proposes to use `ExpressionEncoder` for the return type of ScalaUDF to convert to the catalyst type, instead of using `CatalystTypeConverters`. Note, this change only takes effect for typed Scala UDF since its the only case where we know the type tag of the raw type. ### Why are the changes needed? Users now could register a UDF with `Instant`/`LocalDate` as return types even with `spark.sql.datetime.java8API.enabled=false`. However, the UDF can not really be used. For example, if we try: ```scala scala> sql("set spark.sql.datetime.java8API.enabled=false") scala> spark.udf.register("buildDate", udf{ d: String => java.time.LocalDate.parse(d) }) scala> Seq("2020-07-02").toDF("d").selectExpr("CAST(buildDate(d) AS STRING)").show ``` Then, we will hit the error: ```scala java.lang.ClassCastException: java.time.LocalDate cannot be cast to java.sql.Date at org.apache.spark.sql.catalyst.CatalystTypeConverters$DateConverter$.toCatalystImpl(CatalystTypeConverters.scala:304) at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:107) at org.apache.spark.sql.catalyst.CatalystTypeConverters$.$anonfun$createToCatalystConverter$2(CatalystTypeConverters.scala:425) at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1169) ... ``` as it actually requires enabling `spark.sql.datetime.java8API.enabled` when using the UDF. And I think this could make users get confused. This happens because when registering the UDF, Spark actually uses `ExpressionEncoder` to ser/deser types. However, when using UDF, Spark uses `CatalystTypeConverters`, which is under control of `spark.sql.datetime.java8API.enabled`, to ser/deser types. Therefore, Spark would fail to convert the Java8 date time types. If we could also use `ExpressionEncoder` to ser/deser types for the return type, similar to what we do for the input parameter types, then, UDF could support Instant/LocalDate, event other combined complex types as well. ### Does this PR introduce _any_ user-facing change? Yes. Before this PR, if users run the demo above, they would hit the error. After this PR, the demo will run successfully. ### How was this patch tested? Updated 2 tests and added a new one for combined types of `Instant` and `LocalDate`. Closes #28979 from Ngone51/udf-return-encoder. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-07-12 09:45:18 +09:00
Liang-Chi Hsieh	98504e92a7	[SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls ### What changes were proposed in this pull request? This patch proposes to make `unionByName` optionally fill missing columns with nulls. ### Why are the changes needed? Currently, `unionByName` throws exception if detecting different column names between two Datasets. It is strict requirement and sometimes users require more flexible usage that two Datasets with different subset of columns can be union by name resolution. ### Does this PR introduce _any_ user-facing change? Yes. Adding overloading `Dataset.unionByName` with a boolean parameter that allows different set of column names between two Datasets. Missing columns at each side, will be filled with null values. ### How was this patch tested? Unit test. Closes #28996 from viirya/SPARK-29358. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-11 15:58:40 -07:00
Sean Owen	3ad4863673	[SPARK-29292][SPARK-30010][CORE] Let core compile for Scala 2.13 ### What changes were proposed in this pull request? The purpose of this PR is to partly resolve SPARK-29292, and fully resolve SPARK-30010, which should allow Spark to compile vs Scala 2.13 in Spark Core and up through GraphX (not SQL, Streaming, etc). Note that we are not trying to determine here whether this makes Spark work on 2.13 yet, just compile, as a prerequisite for assessing test outcomes. However, of course, we need to ensure that the change does not break 2.12. The changes are, in the main, adding .toSeq and .toMap calls where mutable collections / maps are returned as Seq / Map, which are immutable by default in Scala 2.13. The theory is that it should be a no-op for Scala 2.12 (these return themselves), and required for 2.13. There are a few non-trivial changes highlighted below. In particular, to get Core to compile, we need to resolve SPARK-30010 which removes a deprecated SparkConf method ### Why are the changes needed? Eventually, we need to support a Scala 2.13 build, perhaps in Spark 3.1. ### Does this PR introduce _any_ user-facing change? Yes, removal of the deprecated SparkConf.setAll overload, which isn't legal in Scala 2.13 anymore. ### How was this patch tested? Existing tests. (2.13 was not _tested_; this is about getting it to compile without breaking 2.12) Closes #28971 from srowen/SPARK-29292.1. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-11 14:34:02 -07:00
yi.wu	0c9196e549	[SPARK-32238][SQL] Use Utils.getSimpleName to avoid hitting Malformed class name in ScalaUDF ### What changes were proposed in this pull request? This PR proposes to use `Utils.getSimpleName(function)` instead of `function.getClass.getSimpleName` to get the class name. ### Why are the changes needed? For some functions(see the demo below), using `function.getClass.getSimpleName` can hit "Malformed class name" error. ### Does this PR introduce _any_ user-facing change? Yes. For the demo, ```scala object MalformedClassObject extends Serializable { class MalformedNonPrimitiveFunction extends (String => Int) with Serializable { override def apply(v1: String): Int = v1.toInt / 0 } } OuterScopes.addOuterScope(MalformedClassObject) val f = new MalformedClassObject.MalformedNonPrimitiveFunction() Seq("20").toDF("col").select(udf(f).apply(Column("col"))).collect() ``` Before this PR, user can only see the error about "Malformed class name": ```scala An exception or error caused a run to abort: Malformed class name java.lang.InternalError: Malformed class name at java.lang.Class.getSimpleName(Class.java:1330) at org.apache.spark.sql.catalyst.expressions.ScalaUDF.udfErrorMessage$lzycompute(ScalaUDF.scala:1157) at org.apache.spark.sql.catalyst.expressions.ScalaUDF.udfErrorMessage(ScalaUDF.scala:1155) at org.apache.spark.sql.catalyst.expressions.ScalaUDF.doGenCode(ScalaUDF.scala:1077) at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:147) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:142) at org.apache.spark.sql.catalyst.expressions.Alias.genCode(namedExpressions.scala:160) at org.apache.spark.sql.execution.ProjectExec.$anonfun$doConsume$1(basicPhysicalOperators.scala:69) ... ``` After this PR, user can see the real root cause of the udf failure: ```scala org.apache.spark.SparkException: Failed to execute user defined function(UDFSuite$MalformedClassObject$MalformedNonPrimitiveFunction: (string) => int) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:753) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:464) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:467) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ArithmeticException: / by zero at org.apache.spark.sql.UDFSuite$MalformedClassObject$MalformedNonPrimitiveFunction.apply(UDFSuite.scala:677) at org.apache.spark.sql.UDFSuite$MalformedClassObject$MalformedNonPrimitiveFunction.apply(UDFSuite.scala:676) ... 17 more ``` ### How was this patch tested? Added a test. Closes #29050 from Ngone51/fix-malformed-udf. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-11 06:27:56 -07:00
angerszhu	560fe1f54c	[SPARK-32220][SQL] SHUFFLE_REPLICATE_NL Hint should not change Non-Cartesian Product join result ### What changes were proposed in this pull request? In current Join Hint strategies, if we use SHUFFLE_REPLICATE_NL hint, it will directly convert join to Cartesian Product Join and loss join condition making result not correct. For Example: ``` spark-sql> select * from test4 order by a asc; 1 2 Time taken: 1.063 seconds, Fetched 4 row(s)20/07/08 14:11:25 INFO SparkSQLCLIDriver: Time taken: 1.063 seconds, Fetched 4 row(s) spark-sql>select * from test5 order by a asc 1 2 2 2 Time taken: 1.18 seconds, Fetched 24 row(s)20/07/08 14:13:59 INFO SparkSQLCLIDriver: Time taken: 1.18 seconds, Fetched 24 row(s)spar spark-sql>select /+ shuffle_replicate_nl(test4) / * from test4 join test5 where test4.a = test5.a order by test4.a asc ; 1 2 1 2 1 2 2 2 Time taken: 0.351 seconds, Fetched 2 row(s) 20/07/08 14:18:16 INFO SparkSQLCLIDriver: Time taken: 0.351 seconds, Fetched 2 row(s) ``` ### Why are the changes needed? Fix wrong data result ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Added UT Closes #29035 from AngersZhuuuu/SPARK-32220. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-10 09:03:16 -07:00
Kent Yao	4609f1fdab	[SPARK-32207][SQL] Support 'F'-suffixed Float Literals ### What changes were proposed in this pull request? In this PR, I suppose we support 'f'-suffixed float literal, e.g. `select 1.1f` ### Why are the changes needed? a very common feature across platforms, checked with pg, presto, hive, MySQL... ### Does this PR introduce _any_ user-facing change? yes, `select 1.1f` results float value 1.1 instead of throwing AnlaysisExceptiion`Can't extract value from 1: need struct type but got int;` ### How was this patch tested? add unit tests Closes #29022 from yaooqinn/SPARK-32207. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-09 19:45:16 -07:00
HyukjinKwon	01e9dd9050	[SPARK-20680][SQL][FOLLOW-UP] Revert NullType.simpleString from 'unknown' to 'null' ### What changes were proposed in this pull request? This PR proposes to partially reverts the simple string in `NullType` at https://github.com/apache/spark/pull/28833: `NullType.simpleString` back from `unknown` to `null`. ### Why are the changes needed? - Technically speaking, it's orthogonal with the issue itself, SPARK-20680. - It needs some more discussion, see https://github.com/apache/spark/pull/28833#issuecomment-655277714 ### Does this PR introduce _any_ user-facing change? It reverts back the user-facing changes at https://github.com/apache/spark/pull/28833. The simple string of `NullType` is back to `null`. ### How was this patch tested? I just logically reverted. Jenkins should test it out. Closes #29041 from HyukjinKwon/SPARK-20680. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-09 19:44:08 -07:00
Dilip Biswal	18aae21d96	[SPARK-31875][SQL] Provide a option to disable user supplied Hints ### What changes were proposed in this pull request? Introduce a new SQL config `spark.sql.optimizer.ignoreHints`. When this is set to true application of hints are disabled. This is similar to Oracle's OPTIMIZER_IGNORE_HINTS. This can be helpful to study the impact of performance difference when hints are applied vs when they are not. ### Why are the changes needed? Can be helpful to study the impact of performance difference when hints are applied vs when they are not. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New tests added in ResolveHintsSuite. Closes #28683 from dilipbiswal/disable_hint. Authored-by: Dilip Biswal <dkbiswal@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-09 18:27:07 -07:00
Erik Erlandson	1cb5bfc47a	[SPARK-32159][SQL] Fix integration between Aggregator[Array[_], _, _] and UnresolvedMapObjects Context: The fix for SPARK-27296 introduced by #25024 allows `Aggregator` objects to appear in queries. This works fine for aggregators with atomic input types, e.g. `Aggregator[Double, _, _]`. However it can cause a null pointer exception if the input type is `Array[_]`. This was historically considered an ignorable case for serialization of `UnresolvedMapObjects`, but the new ScalaAggregator class causes these expressions to be serialized over to executors because the resolve-and-bind is being deferred. ### What changes were proposed in this pull request? A new rule `ResolveEncodersInScalaAgg` that performs the resolution of the expressions contained in the encoders so that properly resolved expressions are serialized over to executors. ### Why are the changes needed? Applying an aggregator of the form `Aggregator[Array[_], _, _]` using `functions.udaf()` currently causes a null pointer error in Catalyst. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? A unit test has been added that does aggregation with array types for input, buffer, and output. I have done additional testing with my own custom aggregators in the spark REPL. Closes #28983 from erikerlandson/fix-spark-32159. Authored-by: Erik Erlandson <eerlands@redhat.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-09 08:42:20 +00:00
Jungtaek Lim (HeartSaVioR)	526cb2d1ba	[SPARK-32148][SS] Fix stream-stream join issue on missing to copy reused unsafe row ### What changes were proposed in this pull request? This patch fixes the odd join result being occurred from stream-stream join for state store format V2. There're some spots on V2 path which leverage UnsafeProjection. As the result row is reused, the row should be copied to avoid changing value during reading (or make sure the caller doesn't affect by such behavior) but `SymmetricHashJoinStateManager.removeByValueCondition` violates the case. This patch makes `KeyWithIndexToValueRowConverterV2.convertValue` copy the row by itself so that callers don't need to take care about it. This patch doesn't change the behavior of `KeyWithIndexToValueRowConverterV2.convertToValueRow` to avoid double-copying, as the caller is expected to store the row which the implementation of state store will call `copy()`. This patch adds such behavior into each method doc in `KeyWithIndexToValueRowConverter`, so that further contributors can read through and make sure the change / new addition doesn't break the contract. ### Why are the changes needed? Stream-stream join with state store format V2 (newly added in Spark 3.0.0) has a serious correctness bug which brings indeterministic result. ### Does this PR introduce _any_ user-facing change? Yes, some of Spark 3.0.0 users using stream-stream join from the new checkpoint (as the bug exists to only v2 format path) may encounter wrong join result. This patch will fix it. ### How was this patch tested? Reported case is converted to the new UT, and confirmed UT passed. All UTs in StreamingInnerJoinSuite and StreamingOuterJoinSuite passed as well Closes #28975 from HeartSaVioR/SPARK-32148. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-09 07:37:06 +00:00
Wenchen Fan	8c5bee599d	[SPARK-28067][SPARK-32018] Fix decimal overflow issues ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/27627 to fix the remaining issues. There are 2 issues fixed in this PR: 1. `UnsafeRow.setDecimal` can set an overflowed decimal and causes an error when reading it. The expected behavior is to return null. 2. The update/merge expression for decimal type in `Sum` is wrong. We shouldn't turn the `sum` value back to 0 after it becomes null due to overflow. This issue was hidden because: 2.1 for hash aggregate, the buffer is unsafe row. Due to the first bug, we fail when overflow happens, so there is no chance to mistakenly turn null back to 0. 2.2 for sort-based aggregate, the buffer is generic row. The decimal can overflow (the Decimal class has unlimited precision) and we don't have the null problem. If we only fix the first bug, then the second bug is exposed and test fails. If we only fix the second bug, there is no way to test it. This PR fixes these 2 bugs together. ### Why are the changes needed? Fix issues during decimal sum when overflow happens ### Does this PR introduce _any_ user-facing change? Yes. Now decimal sum can return null correctly for overflow under non-ansi mode. ### How was this patch tested? new test and updated test Closes #29026 from cloud-fan/decimal. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-09 15:56:40 +09:00
Takuya UESHIN	cfecc2030d	[SPARK-32160][CORE][PYSPARK] Disallow to create SparkContext in executors ### What changes were proposed in this pull request? This PR proposes to disallow to create `SparkContext` in executors, e.g., in UDFs. ### Why are the changes needed? Currently executors can create SparkContext, but shouldn't be able to create it. ```scala sc.range(0, 1).foreach { _ => new SparkContext(new SparkConf().setAppName("test").setMaster("local")) } ``` ### Does this PR introduce _any_ user-facing change? Yes, users won't be able to create `SparkContext` in executors. ### How was this patch tested? Addes tests. Closes #28986 from ueshin/issues/SPARK-32160/disallow_spark_context_in_executors. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-09 15:51:56 +09:00
Ryan Blue	3bb1ac597a	[SPARK-32168][SQL] Fix hidden partitioning correctness bug in SQL overwrite ### What changes were proposed in this pull request? When converting an `INSERT OVERWRITE` query to a v2 overwrite plan, Spark attempts to detect when a dynamic overwrite and a static overwrite will produce the same result so it can use the static overwrite. Spark incorrectly detects when dynamic and static overwrites are equivalent when there are hidden partitions, such as `days(ts)`. This updates the analyzer rule `ResolveInsertInto` to always use a dynamic overwrite when the mode is dynamic, and static when the mode is static. This avoids the problem by not trying to determine whether the two plans are equivalent and always using the one that corresponds to the partition overwrite mode. ### Why are the changes needed? This is a correctness bug. If a table has hidden partitions, all of the values for those partitions are dropped instead of dynamically overwriting changed partitions. This only affects SQL commands (not `DataFrameWriter`) writing to tables that have hidden partitions. It is also only a problem when the partition overwrite mode is dynamic. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the correctness bug detailed above. ### How was this patch tested? * This updates the in-memory table to support a hidden partition transform, `days`, and adds a test case to `DataSourceV2SQLSuite` in which the table uses this hidden partition function. This test fails without the fix to `ResolveInsertInto`. * This updates the test case `InsertInto: overwrite - multiple static partitions - dynamic mode` in `InsertIntoTests`. The result of the SQL command is unchanged, but the SQL command will now use a dynamic overwrite so the test now uses `dynamicOverwriteTest`. Closes #28993 from rdblue/fix-insert-overwrite-v2-conversion. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-08 16:06:40 -07:00
Kousuke Saruta	371b35d2e0	[SPARK-32214][SQL] The type conversion function generated in makeFromJava for "other" type uses a wrong variable ### What changes were proposed in this pull request? This PR fixes an inconsistency in `EvaluatePython.makeFromJava`, which creates a type conversion function for some Java/Scala types. `other` is a type but it should actually pass `obj`: ```scala case other => (obj: Any) => nullSafeConvert(other)(PartialFunction.empty) ``` This does not change the output because it always returns `null` for unsupported datatypes. ### Why are the changes needed? To make the codes coherent, and consistent. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No behaviour change. Closes #29029 from sarutak/fix-makeFromJava. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-08 17:46:25 +09:00
LantaoJin	b5297c43b0	[SPARK-20680][SQL] Spark-sql do not support for creating table with void column datatype ### What changes were proposed in this pull request? This is the new PR which to address the close one #17953 1. support "void" primitive data type in the `AstBuilder`, point it to `NullType` 2. forbid creating tables with VOID/NULL column type ### Why are the changes needed? 1. Spark is incompatible with hive void type. When Hive table schema contains void type, DESC table will throw an exception in Spark. >hive> create table bad as select 1 x, null z from dual; >hive> describe bad; OK x int z void In Spark2.0.x, the behaviour to read this view is normal: >spark-sql> describe bad; x int NULL z void NULL Time taken: 4.431 seconds, Fetched 2 row(s) But in lastest Spark version, it failed with SparkException: Cannot recognize hive type string: void >spark-sql> describe bad; 17/05/09 03:12:08 ERROR thriftserver.SparkSQLDriver: Failed in [describe bad] org.apache.spark.SparkException: Cannot recognize hive type string: void Caused by: org.apache.spark.sql.catalyst.parser.ParseException: DataType void() is not supported.(line 1, pos 0) == SQL == void ^^^ ... 61 more org.apache.spark.SparkException: Cannot recognize hive type string: void 2. Hive CTAS statements throws error when select clause has NULL/VOID type column since HIVE-11217 In Spark, creating table with a VOID/NULL column should throw readable exception message, include - create data source table (using parquet, json, ...) - create hive table (with or without stored as) - CTAS ### Does this PR introduce any user-facing change? No ### How was this patch tested? Add unit tests Closes #28833 from LantaoJin/SPARK-20680_COPY. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-07 18:58:01 -07:00
Liang-Chi Hsieh	90b9099064	[SPARK-32163][SQL] Nested pruning should work even with cosmetic variations ### What changes were proposed in this pull request? This patch proposes to deal with cosmetic variations when processing nested column extractors in `NestedColumnAliasing`. Currently if cosmetic variations are in the nested column extractors, the query is not optimized. ### Why are the changes needed? If the expressions extracting nested fields have cosmetic variations like qualifier difference, currently nested column pruning cannot work well. For example, two attributes which are semantically the same, are referred in a query, but the nested column extractors of them are treated differently when we deal with nested column pruning. ### Does this PR introduce _any_ user-facing change? Yes, fixing a bug in nested column pruning. ### How was this patch tested? Unit test. Closes #28988 from viirya/SPARK-32163. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-07 11:17:53 -07:00
fqaiser94@gmail.com	4bbc343a4c	[SPARK-31317][SQL] Add withField method to Column ### What changes were proposed in this pull request? Added a new `withField` method to the `Column` class. This method should allow users to add or replace a `StructField` in a `StructType` column (with very similar semantics to the `withColumn` method on `Dataset`). ### Why are the changes needed? Often Spark users have to work with deeply nested data e.g. to fix a data quality issue with an existing `StructField`. To do this with the existing Spark APIs, users have to rebuild the entire struct column. For example, let's say you have the following deeply nested data structure which has a data quality issue (`5` is missing): ``` import org.apache.spark.sql._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._ val data = spark.createDataFrame(sc.parallelize( Seq(Row(Row(Row(1, 2, 3), Row(Row(4, null, 6), Row(7, 8, 9), Row(10, 11, 12)), Row(13, 14, 15))))), StructType(Seq( StructField("a", StructType(Seq( StructField("a", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))), StructField("b", StructType(Seq( StructField("a", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))), StructField("b", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))), StructField("c", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))) ))), StructField("c", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))) )))))).cache data.show(false) +---------------------------------+ \|a \| +---------------------------------+ \|[[1, 2, 3], [[4,, 6], [7, 8, 9]]]\| +---------------------------------+ ``` Currently, to replace the missing value users would have to do something like this: ``` val result = data.withColumn("a", struct( $"a.a", struct( struct( $"a.b.a.a", lit(5).as("b"), $"a.b.a.c" ).as("a"), $"a.b.b", $"a.b.c" ).as("b"), $"a.c" )) result.show(false) +---------------------------------------------------------------+ \|a \| +---------------------------------------------------------------+ \|[[1, 2, 3], [[4, 5, 6], [7, 8, 9], [10, 11, 12]], [13, 14, 15]]\| +---------------------------------------------------------------+ ``` As you can see above, with the existing methods users must call the `struct` function and list all fields, including fields they don't want to change. This is not ideal as: >this leads to complex, fragile code that cannot survive schema evolution. [SPARK-16483](https://issues.apache.org/jira/browse/SPARK-16483) In contrast, with the method added in this PR, a user could simply do something like this: ``` val result = data.withColumn("a", 'a.withField("b.a.b", lit(5))) result.show(false) +---------------------------------------------------------------+ \|a \| +---------------------------------------------------------------+ \|[[1, 2, 3], [[4, 5, 6], [7, 8, 9], [10, 11, 12]], [13, 14, 15]]\| +---------------------------------------------------------------+ ``` This is the first of maybe a few methods that could be added to the `Column` class to make it easier to manipulate nested data. Other methods under discussion in [SPARK-22231](https://issues.apache.org/jira/browse/SPARK-22231) include `drop` and `renameField`. However, these should be added in a separate PR. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New unit tests were added. Jenkins must pass them. ### Related JIRAs: - https://issues.apache.org/jira/browse/SPARK-22231 - https://issues.apache.org/jira/browse/SPARK-16483 Closes #27066 from fqaiser94/SPARK-22231-withField. Lead-authored-by: fqaiser94@gmail.com <fqaiser94@gmail.com> Co-authored-by: fqaiser94 <fqaiser94@gmail.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-07 16:34:03 +00:00
Wenchen Fan	5d296ed39e	[SPARK-32167][SQL] Fix GetArrayStructFields to respect inner field's nullability together ### What changes were proposed in this pull request? Fix nullability of `GetArrayStructFields`. It should consider both the original array's `containsNull` and the inner field's nullability. ### Why are the changes needed? Fix a correctness issue. ### Does this PR introduce _any_ user-facing change? Yes. See the added test. ### How was this patch tested? a new UT and end-to-end test Closes #28992 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-06 20:07:33 -07:00
Max Gekk	42f01e314b	[SPARK-32130][SQL][FOLLOWUP] Enable timestamps inference in JsonBenchmark ### What changes were proposed in this pull request? Set the JSON option `inferTimestamp` to `true` for the cases that measure perf of timestamp inference. ### Why are the changes needed? The PR https://github.com/apache/spark/pull/28966 disabled timestamp inference by default. As a consequence, some benchmarks don't measure perf of timestamp inference from JSON fields. This PR explicitly enable such inference. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By re-generating results of `JsonBenchmark`. Closes #28981 from MaxGekk/json-inferTimestamps-disable-by-default-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-02 13:26:57 -07:00

1 2 3 4 5 ...

6953 commits