ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Marek Novotny	2c9d8f56c7	[SPARK-25469][SQL] Eval methods of Concat, Reverse and ElementAt should use pattern matching only once ## What changes were proposed in this pull request? The PR proposes to avoid usage of pattern matching for each call of ```eval``` method within: - ```Concat``` - ```Reverse``` - ```ElementAt``` ## How was this patch tested? Run the existing tests for ```Concat```, ```Reverse``` and ```ElementAt``` expression classes. Closes #22471 from mn-mikke/SPARK-25470. Authored-by: Marek Novotny <mn.mikke@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2018-09-21 18:16:54 +09:00
Reynold Xin	411ecc365e	[SPARK-23549][SQL] Rename config spark.sql.legacy.compareDateTimestampInTimestamp ## What changes were proposed in this pull request? See title. Makes our legacy backward compatibility configs more consistent. ## How was this patch tested? Make sure all references have been updated: ``` > git grep compareDateTimestampInTimestamp docs/sql-programming-guide.md: - Since Spark 2.4, Spark compares a DATE type with a TIMESTAMP type after promotes both sides to TIMESTAMP. To set `false` to `spark.sql.legacy.compareDateTimestampInTimestamp` restores the previous behavior. This option will be removed in Spark 3.0. sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala: // if conf.compareDateTimestampInTimestamp is true sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala: => if (conf.compareDateTimestampInTimestamp) Some(TimestampType) else Some(StringType) sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala: => if (conf.compareDateTimestampInTimestamp) Some(TimestampType) else Some(StringType) sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: buildConf("spark.sql.legacy.compareDateTimestampInTimestamp") sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: def compareDateTimestampInTimestamp : Boolean = getConf(COMPARE_DATE_TIMESTAMP_IN_TIMESTAMP) sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala: "spark.sql.legacy.compareDateTimestampInTimestamp" -> convertToTS.toString) { ``` Closes #22508 from rxin/SPARK-23549. Authored-by: Reynold Xin <rxin@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-21 14:27:14 +08:00
Reynold Xin	fb3276a54a	[SPARK-25384][SQL] Clarify fromJsonForceNullableSchema will be removed in Spark 3.0 See above. This should go into the 2.4 release. Closes #22509 from rxin/SPARK-25384. Authored-by: Reynold Xin <rxin@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-21 14:17:34 +08:00
gatorsmile	5d25e15440	Revert "[SPARK-23715][SQL] the input of to/from_utc_timestamp can not have timezone ## What changes were proposed in this pull request? This reverts commit `417ad92502`. We decided to keep the current behaviors unchanged and will consider whether we will deprecate the these functions in 3.0. For more details, see the discussion in https://issues.apache.org/jira/browse/SPARK-23715 ## How was this patch tested? The existing tests. Closes #22505 from gatorsmile/revertSpark-23715. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-21 10:39:45 +08:00
maryannxue	88446b6ad1	[SPARK-25450][SQL] PushProjectThroughUnion rule uses the same exprId for project expressions in each Union child, causing mistakes in constant propagation ## What changes were proposed in this pull request? The problem was cause by the PushProjectThroughUnion rule, which, when creating new Project for each child of Union, uses the same exprId for expressions of the same position. This is wrong because, for each child of Union, the expressions are all independent, and it can lead to a wrong result if other rules like FoldablePropagation kicks in, taking two different expressions as the same. This fix is to create new expressions in the new Project for each child of Union. ## How was this patch tested? Added UT. Closes #22447 from maryannxue/push-project-thru-union-bug. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-09-20 10:00:28 -07:00
Dilip Biswal	67f2c6a554	[SPARK-25417][SQL] ArrayContains function may return incorrect result when right expression is implicitly down casted ## What changes were proposed in this pull request? In ArrayContains, we currently cast the right hand side expression to match the element type of the left hand side Array. This may result in down casting and may return wrong result or questionable result. Example : ```SQL spark-sql> select array_contains(array(1), 1.34); true ``` ```SQL spark-sql> select array_contains(array(1), 'foo'); null ``` We should safely coerce both left and right hand side expressions. ## How was this patch tested? Added tests in DataFrameFunctionsSuite Closes #22408 from dilipbiswal/SPARK-25417. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-20 20:33:44 +08:00
Liang-Chi Hsieh	89671a27e7	Revert [SPARK-19355][SPARK-25352] ## What changes were proposed in this pull request? This goes to revert sequential PRs based on some discussion and comments at https://github.com/apache/spark/pull/16677#issuecomment-422650759. #22344 #22330 #22239 #16677 ## How was this patch tested? Existing tests. Closes #22481 from viirya/revert-SPARK-19355-1. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-20 20:18:31 +08:00
Reynold Xin	76399d75e2	[SPARK-4502][SQL] Rename to spark.sql.optimizer.nestedSchemaPruning.enabled ## What changes were proposed in this pull request? This patch adds an "optimizer" prefix to nested schema pruning. ## How was this patch tested? Should be covered by existing tests. Closes #22475 from rxin/SPARK-4502. Authored-by: Reynold Xin <rxin@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-09-19 21:23:35 -07:00
Marco Gaido	47d6e80a2e	[SPARK-25457][SQL] IntegralDivide returns data type of the operands ## What changes were proposed in this pull request? The PR proposes to return the data type of the operands as a result for the `div` operator. Before the PR, `bigint` is always returned. It introduces also a `spark.sql.legacy.integralDivide.returnBigint` config in order to let the users restore the legacy behavior. ## How was this patch tested? added UTs Closes #22465 from mgaido91/SPARK-25457. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-20 10:23:37 +08:00
Reynold Xin	936c920347	[SPARK-24157][SS][FOLLOWUP] Rename to spark.sql.streaming.noDataMicroBatches.enabled ## What changes were proposed in this pull request? This patch changes the config option `spark.sql.streaming.noDataMicroBatchesEnabled` to `spark.sql.streaming.noDataMicroBatches.enabled` to be more consistent with rest of the configs. Unfortunately there is one streaming config called `spark.sql.streaming.metricsEnabled`. For that one we should just use a fallback config and change it in a separate patch. ## How was this patch tested? Made sure no other references to this config are in the code base: ``` > git grep "noDataMicro" sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: buildConf("spark.sql.streaming.noDataMicroBatches.enabled") ``` Closes #22476 from rxin/SPARK-24157. Authored-by: Reynold Xin <rxin@databricks.com> Signed-off-by: Reynold Xin <rxin@databricks.com>	2018-09-19 18:51:20 -07:00
Dongjoon Hyun	cb1b55cf77	Revert "[SPARK-23173][SQL] rename spark.sql.fromJsonForceNullableSchema" This reverts commit `6c7db7fd1c`.	2018-09-19 14:33:40 -07:00
Takeshi Yamamuro	12b1e91e6b	[SPARK-25358][SQL] MutableProjection supports fallback to an interpreted mode ## What changes were proposed in this pull request? In SPARK-23711, `UnsafeProjection` supports fallback to an interpreted mode. Therefore, this pr fixed code to support the same fallback mode in `MutableProjection` based on `CodeGeneratorWithInterpretedFallback`. ## How was this patch tested? Added tests in `CodeGeneratorWithInterpretedFallbackSuite`. Closes #22355 from maropu/SPARK-25358. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-19 19:54:49 +08:00
Reynold Xin	4193c7623b	[SPARK-24626] Add statistics prefix to parallelFileListingInStatsComputation ## What changes were proposed in this pull request? To be more consistent with other statistics based configs. ## How was this patch tested? N/A - straightforward rename of config option. Used `git grep` to make sure there are no mention of it. Closes #22457 from rxin/SPARK-24626. Authored-by: Reynold Xin <rxin@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-09-18 22:41:27 -07:00
Reynold Xin	6c7db7fd1c	[SPARK-23173][SQL] rename spark.sql.fromJsonForceNullableSchema ## What changes were proposed in this pull request? `spark.sql.fromJsonForceNullableSchema` -> `spark.sql.function.fromJson.forceNullable` ## How was this patch tested? Made sure there are no more references to `spark.sql.fromJsonForceNullableSchema`. Closes #22459 from rxin/SPARK-23173. Authored-by: Reynold Xin <rxin@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-09-18 22:39:29 -07:00
James Thompson	ba838fee00	[SPARK-24151][SQL] Case insensitive resolution of CURRENT_DATE and CURRENT_TIMESTAMP ## What changes were proposed in this pull request? SPARK-22333 introduced a regression in the resolution of `CURRENT_DATE` and `CURRENT_TIMESTAMP`. Before that ticket, these 2 functions were resolved in a case insensitive way. After, this depends on the value of `spark.sql.caseSensitive`. The PR restores the previous behavior and makes their resolution case insensitive anyhow. The PR takes over #21217, therefore it closes #21217 and credit for this patch should be given to jamesthomp. ## How was this patch tested? added UT Closes #22440 from mgaido91/SPARK-24151. Lead-authored-by: James Thompson <jamesthomp@users.noreply.github.com> Co-authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-09-17 23:19:04 -07:00
Kazuaki Ishizaki	acc6452579	[SPARK-25444][SQL] Refactor GenArrayData.genCodeToCreateArrayData method ## What changes were proposed in this pull request? This PR makes `GenArrayData.genCodeToCreateArrayData` method simple by using `ArrayData.createArrayData` method. Before this PR, `genCodeToCreateArrayData` method was complicated * Generated a temporary Java array to create `ArrayData` * Had separate code generation path to assign values for `GenericArrayData` and `UnsafeArrayData` After this PR, the method * Directly generates `GenericArrayData` or `UnsafeArrayData` without a temporary array * Has only code generation path to assign values ## How was this patch tested? Existing UTs Closes #22439 from kiszk/SPARK-25444. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2018-09-18 12:44:54 +09:00
Marco Gaido	553af22f2c	[SPARK-16323][SQL] Add IntegralDivide expression ## What changes were proposed in this pull request? The PR takes over #14036 and it introduces a new expression `IntegralDivide` in order to avoid the several unneded cast added previously. In order to prove the performance gain, the following benchmark has been run: ``` test("Benchmark IntegralDivide") { val r = new scala.util.Random(91) val nData = 1000000 val testDataInt = (1 to nData).map(_ => (r.nextInt(), r.nextInt())) val testDataLong = (1 to nData).map(_ => (r.nextLong(), r.nextLong())) val testDataShort = (1 to nData).map(_ => (r.nextInt().toShort, r.nextInt().toShort)) // old code val oldExprsInt = testDataInt.map(x => Cast(Divide(Cast(Literal(x._1), DoubleType), Cast(Literal(x._2), DoubleType)), LongType)) val oldExprsLong = testDataLong.map(x => Cast(Divide(Cast(Literal(x._1), DoubleType), Cast(Literal(x._2), DoubleType)), LongType)) val oldExprsShort = testDataShort.map(x => Cast(Divide(Cast(Literal(x._1), DoubleType), Cast(Literal(x._2), DoubleType)), LongType)) // new code val newExprsInt = testDataInt.map(x => IntegralDivide(x._1, x._2)) val newExprsLong = testDataLong.map(x => IntegralDivide(x._1, x._2)) val newExprsShort = testDataShort.map(x => IntegralDivide(x._1, x._2)) Seq(("Long", "old", oldExprsLong), ("Long", "new", newExprsLong), ("Int", "old", oldExprsInt), ("Int", "new", newExprsShort), ("Short", "old", oldExprsShort), ("Short", "new", oldExprsShort)).foreach { case (dt, t, ds) => val start = System.nanoTime() ds.foreach(e => e.eval(EmptyRow)) val endNoCodegen = System.nanoTime() println(s"Running $nData op with $t code on $dt (no-codegen): ${(endNoCodegen - start) / 1000000} ms") } } ``` The results on my laptop are: ``` Running 1000000 op with old code on Long (no-codegen): 600 ms Running 1000000 op with new code on Long (no-codegen): 112 ms Running 1000000 op with old code on Int (no-codegen): 560 ms Running 1000000 op with new code on Int (no-codegen): 135 ms Running 1000000 op with old code on Short (no-codegen): 317 ms Running 1000000 op with new code on Short (no-codegen): 153 ms ``` Showing a 2-5X improvement. The benchmark doesn't include code generation as it is pretty hard to test the performance there as for such simple operations the most of the time is spent in the code generation/compilation process. ## How was this patch tested? added UTs Closes #22395 from mgaido91/SPARK-16323. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-09-17 11:33:50 -07:00
Takuya UESHIN	8cf6fd1c23	[SPARK-25431][SQL][EXAMPLES] Fix function examples and the example results. ## What changes were proposed in this pull request? There are some mistakes in examples of newly added functions. Also the format of the example results are not unified. We should fix them. ## How was this patch tested? Manually executed the examples. Closes #22437 from ueshin/issues/SPARK-25431/fix_examples_2. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-09-17 20:40:42 +08:00
gatorsmile	bb2f069cf2	[SPARK-25436] Bump master branch version to 2.5.0-SNAPSHOT ## What changes were proposed in this pull request? In the dev list, we can still discuss whether the next version is 2.5.0 or 3.0.0. Let us first bump the master branch version to `2.5.0-SNAPSHOT`. ## How was this patch tested? N/A Closes #22426 from gatorsmile/bumpVersionMaster. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-09-15 16:24:02 -07:00
Takeshi Yamamuro	5ebef33c85	[SPARK-25426][SQL] Remove the duplicate fallback logic in UnsafeProjection ## What changes were proposed in this pull request? This pr removed the duplicate fallback logic in `UnsafeProjection`. This pr comes from #22355. ## How was this patch tested? Added tests in `CodeGeneratorWithInterpretedFallbackSuite`. Closes #22417 from maropu/SPARK-25426. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-09-15 16:20:45 -07:00
Takuya UESHIN	be454a7cef	Revert "[SPARK-25431][SQL][EXAMPLES] Fix function examples and unify the format of the example results." This reverts commit `9c25d7f735`.	2018-09-15 12:50:46 +09:00
Takuya UESHIN	9c25d7f735	[SPARK-25431][SQL][EXAMPLES] Fix function examples and unify the format of the example results. ## What changes were proposed in this pull request? There are some mistakes in examples of newly added functions. Also the format of the example results are not unified. We should fix and unify them. ## How was this patch tested? Manually executed the examples. Closes #22421 from ueshin/issues/SPARK-25431/fix_examples. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-09-14 09:25:27 -07:00
maryannxue	8b702e1e0a	[SPARK-25415][SQL] Make plan change log in RuleExecutor configurable by SQLConf ## What changes were proposed in this pull request? In RuleExecutor, after applying a rule, if the plan has changed, the before and after plan will be logged using level "trace". At times, however, such information can be very helpful for debugging. Hence, making the log level configurable in SQLConf would allow users to turn on the plan change log independently and save the trouble of tweaking log4j settings. Meanwhile, filtering plan change log for specific rules can also be very useful. So this PR adds two SQL configurations: 1. spark.sql.optimizer.planChangeLog.level - set a specific log level for logging plan changes after a rule is applied. 2. spark.sql.optimizer.planChangeLog.rules - enable plan change logging only for a set of specified rules, separated by commas. ## How was this patch tested? Added UT. Closes #22406 from maryannxue/spark-25415. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-09-12 21:56:09 -07:00
gatorsmile	79cc59718f	[SPARK-25402][SQL] Null handling in BooleanSimplification ## What changes were proposed in this pull request? This PR is to fix the null handling in BooleanSimplification. In the rule BooleanSimplification, there are two cases that do not properly handle null values. The optimization is not right if either side is null. This PR is to fix them. ## How was this patch tested? Added test cases Closes #22390 from gatorsmile/fixBooleanSimplification. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-12 21:11:22 +08:00
Sean Owen	cfbdd6a1f5	[SPARK-25398] Minor bugs from comparing unrelated types ## What changes were proposed in this pull request? Correct some comparisons between unrelated types to what they seem to… have been trying to do ## How was this patch tested? Existing tests. Closes #22384 from srowen/SPARK-25398. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-09-11 14:46:03 -05:00
Marco Gaido	0736e72a66	[SPARK-25371][SQL] struct() should allow being called with 0 args ## What changes were proposed in this pull request? SPARK-21281 introduced a check for the inputs of `CreateStructLike` to be non-empty. This means that `struct()`, which was previously considered valid, now throws an Exception. This behavior change was introduced in 2.3.0. The change may break users' application on upgrade and it causes `VectorAssembler` to fail when an empty `inputCols` is defined. The PR removes the added check making `struct()` valid again. ## How was this patch tested? added UT Closes #22373 from mgaido91/SPARK-25371. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-11 14:16:56 +08:00
Marco Gaido	12e3e9f17d	[SPARK-25278][SQL] Avoid duplicated Exec nodes when the same logical plan appears in the query ## What changes were proposed in this pull request? In the Planner, we collect the placeholder which need to be substituted in the query execution plan and once we plan them, we substitute the placeholder with the effective plan. In this second phase, we rely on the `==` comparison, ie. the `equals` method. This means that if two placeholder plans - which are different instances - have the same attributes (so that they are equal, according to the equal method) they are both substituted with their corresponding new physical plans. So, in such a situation, the first time we substitute both them with the first of the 2 new generated plan and the second time we substitute nothing. This is usually of no harm for the execution of the query itself, as the 2 plans are identical. But since they are the same instance, now, the local variables are shared (which is unexpected). This causes issues for the metrics collected, as the same node is executed 2 times, so the metrics are accumulated 2 times, wrongly. The PR proposes to use the `eq` method in checking which placeholder needs to be substituted,; thus in the previous situation, actually both the two different physical nodes which are created (one for each time the logical plan appears in the query plan) are used and the metrics are collected properly for each of them. ## How was this patch tested? added UT Closes #22284 from mgaido91/SPARK-25278. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-10 19:41:51 +08:00
gatorsmile	6f6517837b	[SPARK-24849][SPARK-24911][SQL][FOLLOW-UP] Converting a value of StructType to a DDL string ## What changes were proposed in this pull request? Add the version number for the new APIs. ## How was this patch tested? N/A Closes #22377 from gatorsmile/followup24849. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-10 19:18:00 +08:00
Yuming Wang	77c996403d	[SPARK-25368][SQL] Incorrect predicate pushdown returns wrong result ## What changes were proposed in this pull request? How to reproduce: ```scala val df1 = spark.createDataFrame(Seq( (1, 1) )).toDF("a", "b").withColumn("c", lit(null).cast("int")) val df2 = df1.union(df1).withColumn("d", spark_partition_id).filter($"c".isNotNull) df2.show +---+---+----+---+ \| a\| b\| c\| d\| +---+---+----+---+ \| 1\| 1\|null\| 0\| \| 1\| 1\|null\| 1\| +---+---+----+---+ ``` `filter($"c".isNotNull)` was transformed to `(null <=> c#10)` before https://github.com/apache/spark/pull/19201, but it is transformed to `(c#10 = null)` since https://github.com/apache/spark/pull/20155. This pr revert it to `(null <=> c#10)` to fix this issue. ## How was this patch tested? unit tests Closes #22368 from wangyum/SPARK-25368. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-09-09 09:07:31 -07:00
gatorsmile	0b9ccd55c2	Revert [SPARK-10399] [SPARK-23879] [SPARK-23762] [SPARK-25317] ## What changes were proposed in this pull request? When running TPC-DS benchmarks on 2.4 release, npoggi and winglungngai saw more than 10% performance regression on the following queries: q67, q24a and q24b. After we applying the PR https://github.com/apache/spark/pull/22338, the performance regression still exists. If we revert the changes in https://github.com/apache/spark/pull/19222, npoggi and winglungngai found the performance regression was resolved. Thus, this PR is to revert the related changes for unblocking the 2.4 release. In the future release, we still can continue the investigation and find out the root cause of the regression. ## How was this patch tested? The existing test cases Closes #22361 from gatorsmile/revertMemoryBlock. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-09 21:25:19 +08:00
ptkool	78981efc2c	[SPARK-20636] Add new optimization rule to transpose adjacent Window expressions. ## What changes were proposed in this pull request? Add new optimization rule to eliminate unnecessary shuffling by flipping adjacent Window expressions. ## How was this patch tested? Tested with unit tests, integration tests, and manual tests. Closes #17899 from ptkool/adjacent_window_optimization. Authored-by: ptkool <michael.styles@shopify.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-09-08 11:36:55 -07:00
hyukjinkwon	01c3dfab15	[MINOR][SQL] Add a debug log when a SQL text is used for a view ## What changes were proposed in this pull request? This took me a while to debug and find out. Looks we better at least leave a debug log that SQL text for a view will be used. Here's how I got there: Hive: ``` CREATE TABLE emp AS SELECT 'user' AS name, 'address' as address; CREATE DATABASE d100; CREATE FUNCTION d100.udf100 AS 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFUpper'; CREATE VIEW testview AS SELECT d100.udf100(name) FROM default.emp; ``` Spark: ``` sql("SELECT * FROM testview").show() ``` ``` scala> sql("SELECT * FROM testview").show() org.apache.spark.sql.AnalysisException: Undefined function: 'd100.udf100'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 7 ``` Under the hood, it actually makes sense since the view is defined as `SELECT d100.udf100(name) FROM default.emp;` and Hive API: ``` org.apache.hadoop.hive.ql.metadata.Table.getViewExpandedText() ``` This returns a wrongly qualified SQL string for the view as below: ``` SELECT `d100.udf100`(`emp`.`name`) FROM `default`.`emp` ``` which works fine in Hive but not in Spark. ## How was this patch tested? Manually: ``` 18/09/06 19:32:48 DEBUG HiveSessionCatalog: 'SELECT `d100.udf100`(`emp`.`name`) FROM `default`.`emp`' will be used for the view(testview). ``` Closes #22351 from HyukjinKwon/minor-debug. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-09-08 12:55:44 +08:00
Xiao Li	f96a8bf8ff	[SPARK-12321][SQL][FOLLOW-UP] Add tests for fromString ## What changes were proposed in this pull request? Add test cases for fromString ## How was this patch tested? N/A Closes #22345 from gatorsmile/addTest. Authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-09-06 23:36:30 -07:00
Takuya UESHIN	1b1711e053	[SPARK-25208][SQL][FOLLOW-UP] Reduce code size. ## What changes were proposed in this pull request? This is a follow-up pr of #22200. When casting to decimal type, if `Cast.canNullSafeCastToDecimal()`, overflow won't happen, so we don't need to check the result of `Decimal.changePrecision()`. ## How was this patch tested? Existing tests. Closes #22352 from ueshin/issues/SPARK-25208/reduce_code_size. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-07 10:12:20 +08:00
Maxim Gekk	d749d034a8	[SPARK-25252][SQL] Support arrays of any types by to_json ## What changes were proposed in this pull request? In the PR, I propose to extended `to_json` and support any types as element types of input arrays. It should allow converting arrays of primitive types and arrays of arrays. For example: ``` select to_json(array('1','2','3')) > ["1","2","3"] select to_json(array(array(1,2,3),array(4))) > [[1,2,3],[4]] ``` ## How was this patch tested? Added a couple sql tests for arrays of primitive type and of arrays. Also I added round trip test `from_json` -> `to_json`. Closes #22226 from MaxGekk/to_json-array. Authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-09-06 12:35:59 +08:00
Xiangrui Meng	061bb01d9b	[SPARK-25248][CORE] Audit barrier Scala APIs for 2.4 ## What changes were proposed in this pull request? I made one pass over barrier APIs added to Spark 2.4 and updates some scopes and docs. I will update Python docs once Scala doc was reviewed. One major issue is that `BarrierTaskContext` implements `TaskContextImpl` that exposes some public methods. And internally there were several direct references to `TaskContextImpl` methods instead of `TaskContext`. This PR moved some methods from `TaskContextImpl` to `TaskContext`, remaining package private, and used delegate methods to avoid inheriting `TaskContextImp` and exposing unnecessary APIs. TODOs: - [x] scala doc - [x] python doc (#22261 ). Closes #22240 from mengxr/SPARK-25248. Authored-by: Xiangrui Meng <meng@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2018-09-04 09:55:53 -07:00
Kazuaki Ishizaki	e319ac92e5	[SPARK-24962][SQL] Refactor CodeGenerator.createUnsafeArray, ArraySetLike, and ArrayDistinct ## What changes were proposed in this pull request? This PR integrates handling of `UnsafeArrayData` and `GenericArrayData` into one. The current `CodeGenerator.createUnsafeArray` handles only allocation of `UnsafeArrayData`. This PR introduces a new method `createArrayData` that returns a code to allocate `UnsafeArrayData` or `GenericArrayData` and to assign a value into the allocated array. This PR also reduce the size of generated code by calling a runtime helper. This PR replaced `createArrayData` with `createUnsafeArray`. This PR also refactor `ArraySetLike` that can be used for `ArrayDistinct`, too. This PR also refactors`ArrayDistinct` to use `ArraryBuilder`. ## How was this patch tested? Existing tests Closes #21912 from kiszk/SPARK-24962. Lead-authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Co-authored-by: Takuya UESHIN <ueshin@happy-camper.st> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-04 15:26:34 +08:00
Kazuaki Ishizaki	4cb2ff9d8a	[SPARK-25310][SQL] ArraysOverlap may throw a CompilationException ## What changes were proposed in this pull request? This PR fixes a problem that `ArraysOverlap` function throws a `CompilationException` with non-nullable array type. The following is the stack trace of the original problem: ``` Code generation of arrays_overlap([1,2,3], [4,5,3]) failed: java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 56, Column 11: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 56, Column 11: Expression "isNull_0" is not an rvalue java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 56, Column 11: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 56, Column 11: Expression "isNull_0" is not an rvalue at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306) at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293) at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135) at com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410) at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2380) at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257) at com.google.common.cache.LocalCache.get(LocalCache.java:4000) at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004) at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1305) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:143) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:48) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:32) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1260) ``` ## How was this patch tested? Added test in `CollectionExpressionSuite`. Closes #22317 from kiszk/SPARK-25310. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2018-09-04 14:00:00 +09:00
Dilip Biswal	b60ee3a337	[SPARK-25307][SQL] ArraySort function may return an error in the code generation phase ## What changes were proposed in this pull request? Sorting array of booleans (not nullable) returns a compilation error in the code generation phase. Below is the compilation error : ```SQL java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 51, Column 23: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 51, Column 23: No applicable constructor/method found for actual parameters "boolean[]"; candidates are: "public static void java.util.Arrays.sort(long[])", "public static void java.util.Arrays.sort(long[], int, int)", "public static void java.util.Arrays.sort(byte[], int, int)", "public static void java.util.Arrays.sort(float[])", "public static void java.util.Arrays.sort(float[], int, int)", "public static void java.util.Arrays.sort(char[])", "public static void java.util.Arrays.sort(char[], int, int)", "public static void java.util.Arrays.sort(short[], int, int)", "public static void java.util.Arrays.sort(short[])", "public static void java.util.Arrays.sort(byte[])", "public static void java.util.Arrays.sort(java.lang.Object[], int, int, java.util.Comparator)", "public static void java.util.Arrays.sort(java.lang.Object[], java.util.Comparator)", "public static void java.util.Arrays.sort(int[])", "public static void java.util.Arrays.sort(java.lang.Object[], int, int)", "public static void java.util.Arrays.sort(java.lang.Object[])", "public static void java.util.Arrays.sort(double[])", "public static void java.util.Arrays.sort(double[], int, int)", "public static void java.util.Arrays.sort(int[], int, int)" at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306) at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293) at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135) at com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410) at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2380) at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257) at com.google.common.cache.LocalCache.get(LocalCache.java:4000) at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004) at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1305) ``` ## How was this patch tested? Added test in collectionExpressionSuite Closes #22314 from dilipbiswal/SPARK-25307. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2018-09-04 13:39:29 +09:00
Dilip Biswal	8e2169696f	[SPARK-25308][SQL] ArrayContains function may return a error in the code generation phase. ## What changes were proposed in this pull request? Invoking ArrayContains function with non nullable array type throws the following error in the code generation phase. Below is the error snippet. ```SQL Code generation of array_contains([1,2,3], 1) failed: java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 40, Column 11: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 40, Column 11: Expression "isNull_0" is not an rvalue java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 40, Column 11: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 40, Column 11: Expression "isNull_0" is not an rvalue at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306) at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293) at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135) at com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410) at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2380) at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257) at com.google.common.cache.LocalCache.get(LocalCache.java:4000) at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004) at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1305) ``` ## How was this patch tested? Added test in CollectionExpressionSuite. Closes #22315 from dilipbiswal/SPARK-25308. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2018-09-04 13:28:36 +09:00
Kazuaki Ishizaki	c5583fdcd2	[SPARK-23466][SQL] Remove redundant null checks in generated Java code by GenerateUnsafeProjection ## What changes were proposed in this pull request? This PR works for one of TODOs in `GenerateUnsafeProjection` "if the nullability of field is correct, we can use it to save null check" to simplify generated code. When `nullable=false` in `DataType`, `GenerateUnsafeProjection` removed code for null checks in the generated Java code. ## How was this patch tested? Added new test cases into `GenerateUnsafeProjectionSuite` Closes #20637 from kiszk/SPARK-23466. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2018-09-01 12:19:19 +09:00
Kazuaki Ishizaki	9e0f9591af	[SPARK-23997][SQL][FOLLOWUP] Update exception message ## What changes were proposed in this pull request? This PR is an follow-up PR of #21087 based on [a discussion thread](https://github.com/apache/spark/pull/21087#discussion_r211080067]. Since #21087 changed a condition of `if` statement, the message in an exception is not consistent of the current behavior. This PR updates the exception message. ## How was this patch tested? Existing UTs Closes #22269 from kiszk/SPARK-23997-followup. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-08-30 11:21:40 -05:00
Sean Owen	1fd59c129a	[WIP][SPARK-25044][SQL] (take 2) Address translation of LMF closure primitive args to Object in Scala 2.12 ## What changes were proposed in this pull request? Alternative take on https://github.com/apache/spark/pull/22063 that does not introduce udfInternal. Resolve issue with inferring func types in 2.12 by instead using info captured when UDF is registered -- capturing which types are nullable (i.e. not primitive) ## How was this patch tested? Existing tests. Closes #22259 from srowen/SPARK-25044.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-08-29 15:23:16 +08:00
Marco Gaido	32c8a3d7be	[MINOR] Avoid code duplication for nullable in Higher Order function ## What changes were proposed in this pull request? Most of `HigherOrderFunction`s have the same `nullable` definition, ie. they are nullable when one of their arguments is nullable. The PR refactors it in order to avoid code duplication. ## How was this patch tested? NA Closes #22243 from mgaido91/MINOR_nullable_hof. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-08-29 09:20:32 +08:00
Bogdan Raducanu	103854028e	[SPARK-25212][SQL] Support Filter in ConvertToLocalRelation ## What changes were proposed in this pull request? Support Filter in ConvertToLocalRelation, similar to how Project works. Additionally, in Optimizer, run ConvertToLocalRelation earlier to simplify the plan. This is good for very short queries which often are queries on local relations. ## How was this patch tested? New test. Manual benchmark. Author: Bogdan Raducanu <bogdan@databricks.com> Author: Shixiong Zhu <zsxwing@gmail.com> Author: Yinan Li <ynli@google.com> Author: Li Jin <ice.xelloss@gmail.com> Author: s71955 <sujithchacko.2010@gmail.com> Author: DB Tsai <d_tsai@apple.com> Author: jaroslav chládek <mastermism@gmail.com> Author: Huangweizhe <huangweizhe@bbdservice.com> Author: Xiangrui Meng <meng@databricks.com> Author: hyukjinkwon <gurwls223@apache.org> Author: Kent Yao <yaooqinn@hotmail.com> Author: caoxuewen <cao.xuewen@zte.com.cn> Author: liuxian <liu.xian3@zte.com.cn> Author: Adam Bradbury <abradbury@users.noreply.github.com> Author: Jose Torres <torres.joseph.f+github@gmail.com> Author: Yuming Wang <yumwang@ebay.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #22205 from bogdanrdc/local-relation-filter.	2018-08-28 15:50:25 -07:00
Fernando Pereira	de46df549a	[SPARK-23997][SQL] Configurable maximum number of buckets ## What changes were proposed in this pull request? This PR implements the possibility of the user to override the maximum number of buckets when saving to a table. Currently the limit is a hard-coded 100k, which might be insufficient for large workloads. A new configuration entry is proposed: `spark.sql.bucketing.maxBuckets`, which defaults to the previous 100k. ## How was this patch tested? Added unit tests in the following spark.sql test suites: - CreateTableAsSelectSuite - BucketedWriteSuite Author: Fernando Pereira <fernando.pereira@epfl.ch> Closes #21087 from ferdonline/enh/configurable_bucket_limit.	2018-08-28 10:31:47 -07:00
caoxuewen	6193a202aa	[SPARK-24978][SQL] Add spark.sql.fast.hash.aggregate.row.max.capacity to configure the capacity of fast aggregation. ## What changes were proposed in this pull request? this pr add a configuration parameter to configure the capacity of fast aggregation. Performance comparison: ``` Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Windows 7 6.1 Intel64 Family 6 Model 94 Stepping 3, GenuineIntel Aggregate w multiple keys: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ fasthash = default 5612 / 5882 3.7 267.6 1.0X fasthash = config 3586 / 3595 5.8 171.0 1.6X ``` ## How was this patch tested? the existed test cases. Closes #21931 from heary-cao/FastHashCapacity. Authored-by: caoxuewen <cao.xuewen@zte.com.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-08-27 15:45:48 +08:00
Sean Owen	9b6baeb7b9	[SPARK-25029][BUILD][CORE] Janino "Two non-abstract methods ..." errors ## What changes were proposed in this pull request? Update to janino 3.0.9 to address Java 8 + Scala 2.12 incompatibility. The error manifests as test failures like this in `ExpressionEncoderSuite`: ``` - encode/decode for seq of string: List(abc, xyz) * FAILED * java.lang.RuntimeException: Error while encoding: org.codehaus.janino.InternalCompilerException: failed to compile: org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Two non-abstract methods "public int scala.collection.TraversableOnce.size()" have the same parameter types, declaring type and return type ``` It comes up pretty immediately in any generated code that references Scala collections, and virtually always concerning the `size()` method. ## How was this patch tested? Existing tests Closes #22203 from srowen/SPARK-25029. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2018-08-23 21:36:53 -07:00
Michael Allman	f2d35427ee	[SPARK-4502][SQL] Parquet nested column pruning - foundation (Link to Jira: https://issues.apache.org/jira/browse/SPARK-4502) _N.B. This is a restart of PR #16578 which includes a subset of that code. Relevant review comments from that PR should be considered incorporated by reference. Please avoid duplication in review by reviewing that PR first. The summary below is an edited copy of the summary of the previous PR._ ## What changes were proposed in this pull request? One of the hallmarks of a column-oriented data storage format is the ability to read data from a subset of columns, efficiently skipping reads from other columns. Spark has long had support for pruning unneeded top-level schema fields from the scan of a parquet file. For example, consider a table, `contacts`, backed by parquet with the following Spark SQL schema: ``` root \|-- name: struct \| \|-- first: string \| \|-- last: string \|-- address: string ``` Parquet stores this table's data in three physical columns: `name.first`, `name.last` and `address`. To answer the query ```SQL select address from contacts ``` Spark will read only from the `address` column of parquet data. However, to answer the query ```SQL select name.first from contacts ``` Spark will read `name.first` and `name.last` from parquet. This PR modifies Spark SQL to support a finer-grain of schema pruning. With this patch, Spark reads only the `name.first` column to answer the previous query. ### Implementation There are two main components of this patch. First, there is a `ParquetSchemaPruning` optimizer rule for gathering the required schema fields of a `PhysicalOperation` over a parquet file, constructing a new schema based on those required fields and rewriting the plan in terms of that pruned schema. The pruned schema fields are pushed down to the parquet requested read schema. `ParquetSchemaPruning` uses a new `ProjectionOverSchema` extractor for rewriting a catalyst expression in terms of a pruned schema. Second, the `ParquetRowConverter` has been patched to ensure the ordinals of the parquet columns read are correct for the pruned schema. `ParquetReadSupport` has been patched to address a compatibility mismatch between Spark's built in vectorized reader and the parquet-mr library's reader. ### Limitation Among the complex Spark SQL data types, this patch supports parquet column pruning of nested sequences of struct fields only. ## How was this patch tested? Care has been taken to ensure correctness and prevent regressions. A more advanced version of this patch incorporating optimizations for rewriting queries involving aggregations and joins has been running on a production Spark cluster at VideoAmp for several years. In that time, one bug was found and fixed early on, and we added a regression test for that bug. We forward-ported this patch to Spark master in June 2016 and have been running this patch against Spark 2.x branches on ad-hoc clusters since then. Closes #21320 from mallman/spark-4502-parquet_column_pruning-foundation. Lead-authored-by: Michael Allman <msa@allman.ms> Co-authored-by: Adam Jacques <adam@technowizardry.net> Co-authored-by: Michael Allman <michael@videoamp.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2018-08-23 21:31:10 -07:00
Takuya UESHIN	a9aacdf1c2	[SPARK-25208][SQL] Loosen Cast.forceNullable for DecimalType. ## What changes were proposed in this pull request? Casting to `DecimalType` is not always needed to force nullable. If the decimal type to cast is wider than original type, or only truncating or precision loss, the casted value won't be `null`. ## How was this patch tested? Added and modified tests. Closes #22200 from ueshin/issues/SPARK-25208/cast_nullable_decimal. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-08-23 22:48:26 +08:00

1 2 3 4 5 ...

3172 commits