ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Xiayun Sun	b304e07e06	[SPARK-23462][SQL] improve missing field error message in `StructType` ## What changes were proposed in this pull request? The error message ```s"""Field "$name" does not exist."""``` is thrown when looking up an unknown field in StructType. In the error message, we should also contain the information about which columns/fields exist in this struct. ## How was this patch tested? Added new unit tests. Note: I created a new `StructTypeSuite.scala` as I couldn't find an existing suite that's suitable to place these tests. I may be missing something so feel free to propose new locations. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Xiayun Sun <xiayunsun@gmail.com> Closes #20649 from xysun/SPARK-23462.	2018-03-12 22:13:28 +09:00
Michał Świtakowski	2ca9bb083c	[SPARK-23173][SQL] Avoid creating corrupt parquet files when loading data from JSON ## What changes were proposed in this pull request? The from_json() function accepts an additional parameter, where the user might specify the schema. The issue is that the specified schema might not be compatible with data. In particular, the JSON data might be missing data for fields declared as non-nullable in the schema. The from_json() function does not verify the data against such errors. When data with missing fields is sent to the parquet encoder, there is no verification either. The end results is a corrupt parquet file. To avoid corruptions, make sure that all fields in the user-specified schema are set to be nullable. Since this changes the behavior of a public function, we need to include it in release notes. The behavior can be reverted by setting `spark.sql.fromJsonForceNullableSchema=false` ## How was this patch tested? Added two new tests. Author: Michał Świtakowski <michal.switakowski@databricks.com> Closes #20694 from mswit-databricks/SPARK-23173.	2018-03-09 14:29:31 -08:00
Marco Gaido	e7bbca8896	[SPARK-23602][SQL] PrintToStderr prints value also in interpreted mode ## What changes were proposed in this pull request? `PrintToStderr` was doing what is it supposed to only when code generation is enabled. The PR adds the same behavior in interpreted mode too. ## How was this patch tested? added UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #20773 from mgaido91/SPARK-23602.	2018-03-08 22:02:28 +01:00
Marco Gaido	ea480990e7	[SPARK-23628][SQL] calculateParamLength should not return 1 + num of epressions ## What changes were proposed in this pull request? There was a bug in `calculateParamLength` which caused it to return always 1 + the number of expressions. This could lead to Exceptions especially with expressions of type long. ## How was this patch tested? added UT + fixed previous UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #20772 from mgaido91/SPARK-23628.	2018-03-08 11:09:15 -08:00
Marco Gaido	92e7ecbbbd	[SPARK-23592][SQL] Add interpreted execution to DecodeUsingSerializer ## What changes were proposed in this pull request? The PR adds interpreted execution to DecodeUsingSerializer. ## How was this patch tested? added UT Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Marco Gaido <marcogaido91@gmail.com> Closes #20760 from mgaido91/SPARK-23592.	2018-03-08 14:18:14 +01:00
Marco Gaido	aff7d81cb7	[SPARK-23591][SQL] Add interpreted execution to EncodeUsingSerializer ## What changes were proposed in this pull request? The PR adds interpreted execution to EncodeUsingSerializer. ## How was this patch tested? added UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #20751 from mgaido91/SPARK-23591.	2018-03-07 18:31:59 +01:00
Takeshi Yamamuro	33c2cb22b3	[SPARK-23611][SQL] Add a helper function to check exception for expr evaluation ## What changes were proposed in this pull request? This pr added a helper function in `ExpressionEvalHelper` to check exceptions in all the path of expression evaluation. ## How was this patch tested? Modified the existing tests. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #20748 from maropu/SPARK-23611.	2018-03-07 13:10:51 +01:00
Marco Gaido	4c587eb488	[SPARK-23590][SQL] Add interpreted execution to CreateExternalRow ## What changes were proposed in this pull request? The PR adds interpreted execution to CreateExternalRow ## How was this patch tested? added UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #20749 from mgaido91/SPARK-23590.	2018-03-06 17:42:17 +01:00
Takeshi Yamamuro	e8a259d66d	[SPARK-23594][SQL] GetExternalRowField should support interpreted execution ## What changes were proposed in this pull request? This pr added interpreted execution for `GetExternalRowField`. ## How was this patch tested? Added tests in `ObjectExpressionsSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #20746 from maropu/SPARK-23594.	2018-03-06 13:55:13 +01:00
Marco Gaido	f6b49f9d1b	[SPARK-23586][SQL] Add interpreted execution to WrapOption ## What changes were proposed in this pull request? The PR adds interpreted execution to WrapOption. ## How was this patch tested? added UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #20741 from mgaido91/SPARK-23586_2.	2018-03-06 01:37:51 +01:00
Marco Gaido	ba622f45ca	[SPARK-23585][SQL] Add interpreted execution to UnwrapOption ## What changes were proposed in this pull request? The PR adds interpreted execution to UnwrapOption. ## How was this patch tested? added UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #20736 from mgaido91/SPARK-23586.	2018-03-05 20:43:03 +01:00
Kazuaki Ishizaki	2ce37b50fc	[SPARK-23546][SQL] Refactor stateless methods/values in CodegenContext ## What changes were proposed in this pull request? A current `CodegenContext` class has immutable value or method without mutable state, too. This refactoring moves them to `CodeGenerator` object class which can be accessed from anywhere without an instantiated `CodegenContext` in the program. ## How was this patch tested? Existing tests Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #20700 from kiszk/SPARK-23546.	2018-03-05 11:39:01 +01:00
KaiXinXiaoLei	cdcccd7b41	[SPARK-23405] Generate additional constraints for Join's children ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) I run a sql: `select ls.cs_order_number from ls left semi join catalog_sales cs on ls.cs_order_number = cs.cs_order_number`, The `ls` table is a small table ,and the number is one. The `catalog_sales` table is a big table, and the number is 10 billion. The task will be hang up. And i find the many null values of `cs_order_number` in the `catalog_sales` table. I think the null value should be removed in the logical plan. >== Optimized Logical Plan == >Join LeftSemi, (cs_order_number#1 = cs_order_number#22) >:- Project cs_order_number#1 > : +- Filter isnotnull(cs_order_number#1) > : +- MetastoreRelation 100t, ls >+- Project cs_order_number#22 > +- MetastoreRelation 100t, catalog_sales Now, use this patch, the plan will be: >== Optimized Logical Plan == >Join LeftSemi, (cs_order_number#1 = cs_order_number#22) >:- Project cs_order_number#1 > : +- Filter isnotnull(cs_order_number#1) > : +- MetastoreRelation 100t, ls >+- Project cs_order_number#22 > : +- Filter isnotnull(cs_order_number#22) > :+- MetastoreRelation 100t, catalog_sales ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: KaiXinXiaoLei <584620569@qq.com> Author: hanghang <584620569@qq.com> Closes #20670 from KaiXinXiaoLei/Spark-23405.	2018-03-02 00:09:44 +08:00
Juliusz Sompolski	8077bb04f3	[SPARK-23445] ColumnStat refactoring ## What changes were proposed in this pull request? Refactor ColumnStat to be more flexible. * Split `ColumnStat` and `CatalogColumnStat` just like `CatalogStatistics` is split from `Statistics`. This detaches how the statistics are stored from how they are processed in the query plan. `CatalogColumnStat` keeps `min` and `max` as `String`, making it not depend on dataType information. * For `CatalogColumnStat`, parse column names from property names in the metastore (`KEY_VERSION` property), not from metastore schema. This means that `CatalogColumnStat`s can be created for columns even if the schema itself is not stored in the metastore. * Make all fields optional. `min`, `max` and `histogram` for columns were optional already. Having them all optional is more consistent, and gives flexibility to e.g. drop some of the fields through transformations if they are difficult / impossible to calculate. The added flexibility will make it possible to have alternative implementations for stats, and separates stats collection from stats and estimation processing in plans. ## How was this patch tested? Refactored existing tests to work with refactored `ColumnStat` and `CatalogColumnStat`. New tests added in `StatisticsSuite` checking that backwards / forwards compatibility is not broken. Author: Juliusz Sompolski <julek@databricks.com> Closes #20624 from juliuszsompolski/SPARK-23445.	2018-02-26 23:37:31 -08:00
hyukjinkwon	ed86476098	[SPARK-23359][SQL] Adds an alias 'names' of 'fieldNames' in Scala's StructType ## What changes were proposed in this pull request? This PR proposes to add an alias 'names' of 'fieldNames' in Scala. Please see the discussion in [SPARK-20090](https://issues.apache.org/jira/browse/SPARK-20090). ## How was this patch tested? Unit tests added in `DataTypeSuite.scala`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #20545 from HyukjinKwon/SPARK-23359.	2018-02-15 17:13:05 +08:00
caoxuewen	63b49fa2e5	[SPARK-23311][SQL][TEST] add FilterFunction test case for test CombineTypedFilters ## What changes were proposed in this pull request? In the current test case for CombineTypedFilters, we lack the test of FilterFunction, so let's add it. In addition, in TypedFilterOptimizationSuite's existing test cases, Let's extract a common LocalRelation. ## How was this patch tested? add new test cases. Author: caoxuewen <cao.xuewen@zte.com.cn> Closes #20482 from heary-cao/TypedFilterOptimizationSuite.	2018-02-03 00:02:03 -08:00
gatorsmile	ca04c3ff23	[SPARK-23274][SQL] Fix ReplaceExceptWithFilter when the right's Filter contains the references that are not in the left output ## What changes were proposed in this pull request? This PR is to fix the `ReplaceExceptWithFilter` rule when the right's Filter contains the references that are not in the left output. Before this PR, we got the error like ``` java.util.NoSuchElementException: key not found: a at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:59) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:59) ``` After this PR, `ReplaceExceptWithFilter ` will not take an effect in this case. ## How was this patch tested? Added tests Author: gatorsmile <gatorsmile@gmail.com> Closes #20444 from gatorsmile/fixReplaceExceptWithFilter.	2018-01-30 20:05:57 -08:00
Herman van Hovell	2d903cf9d3	[SPARK-23223][SQL] Make stacking dataset transforms more performant ## What changes were proposed in this pull request? It is a common pattern to apply multiple transforms to a `Dataset` (using `Dataset.withColumn` for example. This is currently quite expensive because we run `CheckAnalysis` on the full plan and create an encoder for each intermediate `Dataset`. This PR extends the usage of the `AnalysisBarrier` to include `CheckAnalysis`. By doing this we hide the already analyzed plan from `CheckAnalysis` because barrier is a `LeafNode`. The `AnalysisBarrier` is in the `FinishAnalysis` phase of the optimizer. We also make binding the `Dataset` encoder lazy. The bound encoder is only needed when we materialize the dataset. ## How was this patch tested? Existing test should cover this. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #20402 from hvanhovell/SPARK-23223.	2018-01-29 09:00:54 -08:00
caoxuewen	54dd7cf4ef	[SPARK-23199][SQL] improved Removes repetition from group expressions in Aggregate ## What changes were proposed in this pull request? Currently, all Aggregate operations will go into RemoveRepetitionFromGroupExpressions, but there is no group expression or there is no duplicate group expression in group expression, we not need copy for logic plan. ## How was this patch tested? the existed test case. Author: caoxuewen <cao.xuewen@zte.com.cn> Closes #20375 from heary-cao/RepetitionGroupExpressions.	2018-01-29 08:56:42 -08:00
Herman van Hovell	e29b08add9	[SPARK-23208][SQL] Fix code generation for complex create array (related) expressions ## What changes were proposed in this pull request? The `GenArrayData.genCodeToCreateArrayData` produces illegal java code when code splitting is enabled. This is used in `CreateArray` and `CreateMap` expressions for complex object arrays. This issue is caused by a typo. ## How was this patch tested? Added a regression test in `complexTypesSuite`. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #20391 from hvanhovell/SPARK-23208.	2018-01-25 16:40:41 +08:00
caoxuewen	6f0ba8472d	[MINOR][SQL] add new unit test to LimitPushdown ## What changes were proposed in this pull request? This PR is repaired as follows 1、update y -> x in "left outer join" test case ,maybe is mistake. 2、add a new test case："left outer join and left sides are limited" 3、add a new test case："left outer join and right sides are limited" 4、add a new test case: "right outer join and right sides are limited" 5、add a new test case: "right outer join and left sides are limited" 6、Remove annotations without code implementation ## How was this patch tested? add new unit test case. Author: caoxuewen <cao.xuewen@zte.com.cn> Closes #20381 from heary-cao/LimitPushdownSuite.	2018-01-24 13:06:09 -08:00
Jacek Laskowski	76b8b840dd	[MINOR] Typo fixes ## What changes were proposed in this pull request? Typo fixes ## How was this patch tested? Local build / Doc-only changes Author: Jacek Laskowski <jacek@japila.pl> Closes #20344 from jaceklaskowski/typo-fixes.	2018-01-22 13:55:14 -06:00
Marco Gaido	e28eb43114	[SPARK-22036][SQL] Decimal multiplication with high precision/scale often returns NULL ## What changes were proposed in this pull request? When there is an operation between Decimals and the result is a number which is not representable exactly with the result's precision and scale, Spark is returning `NULL`. This was done to reflect Hive's behavior, but it is against SQL ANSI 2011, which states that "If the result cannot be represented exactly in the result type, then whether it is rounded or truncated is implementation-defined". Moreover, Hive now changed its behavior in order to respect the standard, thanks to HIVE-15331. Therefore, the PR propose to: - update the rules to determine the result precision and scale according to the new Hive's ones introduces in HIVE-15331; - round the result of the operations, when it is not representable exactly with the result's precision and scale, instead of returning `NULL` - introduce a new config `spark.sql.decimalOperations.allowPrecisionLoss` which default to `true` (ie. the new behavior) in order to allow users to switch back to the previous one. Hive behavior reflects SQLServer's one. The only difference is that the precision and scale are adjusted for all the arithmetic operations in Hive, while SQL Server is said to do so only for multiplications and divisions in the documentation. This PR follows Hive's behavior. A more detailed explanation is available here: https://mail-archives.apache.org/mod_mbox/spark-dev/201712.mbox/%3CCAEorWNAJ4TxJR9NBcgSFMD_VxTg8qVxusjP%2BAJP-x%2BJV9zH-yA%40mail.gmail.com%3E. ## How was this patch tested? modified and added UTs. Comparisons with results of Hive and SQLServer. Author: Marco Gaido <marcogaido91@gmail.com> Closes #20023 from mgaido91/SPARK-22036.	2018-01-18 21:24:39 +08:00
Wang Gengliang	8598a982b4	[SPARK-23079][SQL] Fix query constraints propagation with aliases ## What changes were proposed in this pull request? Previously, PR #19201 fix the problem of non-converging constraints. After that PR #19149 improve the loop and constraints is inferred only once. So the problem of non-converging constraints is gone. However, the case below will fail. ``` spark.range(5).write.saveAsTable("t") val t = spark.read.table("t") val left = t.withColumn("xid", $"id" + lit(1)).as("x") val right = t.withColumnRenamed("id", "xid").as("y") val df = left.join(right, "xid").filter("id = 3").toDF() checkAnswer(df, Row(4, 3)) ``` Because `aliasMap` replace all the aliased child. See the test case in PR for details. This PR is to fix this bug by removing useless code for preventing non-converging constraints. It can be also fixed with #20270, but this is much simpler and clean up the code. ## How was this patch tested? Unit test Author: Wang Gengliang <ltnwgl@gmail.com> Closes #20278 from gengliangwang/FixConstraintSimple.	2018-01-18 00:05:26 +08:00
Gabor Somogyi	a9b845ebb5	[SPARK-22361][SQL][TEST] Add unit test for Window Frames ## What changes were proposed in this pull request? There are already quite a few integration tests using window frames, but the unit tests coverage is not ideal. In this PR the already existing tests are reorganized, extended and where gaps found additional cases added. ## How was this patch tested? Automated: Pass the Jenkins. Author: Gabor Somogyi <gabor.g.somogyi@gmail.com> Closes #20019 from gaborgsomogyi/SPARK-22361.	2018-01-17 10:03:25 +08:00
xubo245	6c81fe227a	[SPARK-23035][SQL] Fix improper information of TempTableAlreadyExistsException ## What changes were proposed in this pull request? Problem: it throw TempTableAlreadyExistsException and output "Temporary table '$table' already exists" when we create temp view by using org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's improper. So fix improper information about TempTableAlreadyExistsException when create temp view: change "Temporary table" to "Temporary view" ## How was this patch tested? test("rename temporary view - destination table already exists, with: CREATE TEMPORARY view") test("rename temporary view - destination table with database name,with:CREATE TEMPORARY view") Author: xubo245 <601450868@qq.com> Closes #20227 from xubo245/fixDeprecated.	2018-01-15 23:13:15 +08:00
Marco Gaido	5050868069	[SPARK-23025][SQL] Support Null type in scala reflection ## What changes were proposed in this pull request? Add support for `Null` type in the `schemaFor` method for Scala reflection. ## How was this patch tested? Added UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #20219 from mgaido91/SPARK-23025.	2018-01-12 18:04:44 +08:00
Feng Liu	9b33dfc408	[SPARK-22951][SQL] fix aggregation after dropDuplicates on empty data frames ## What changes were proposed in this pull request? (courtesy of liancheng) Spark SQL supports both global aggregation and grouping aggregation. Global aggregation always return a single row with the initial aggregation state as the output, even there are zero input rows. Spark implements this by simply checking the number of grouping keys and treats an aggregation as a global aggregation if it has zero grouping keys. However, this simple principle drops the ball in the following case: ```scala spark.emptyDataFrame.dropDuplicates().agg(count($"") as "c").show() // +---+ // \| c \| // +---+ // \| 1 \| // +---+ ``` The reason is that: 1. `df.dropDuplicates()` is roughly translated into something equivalent to: ```scala val allColumns = df.columns.map { col } df.groupBy(allColumns: _).agg(allColumns.head, allColumns.tail: _*) ``` This translation is implemented in the rule `ReplaceDeduplicateWithAggregate`. 2. `spark.emptyDataFrame` contains zero columns and zero rows. Therefore, rule `ReplaceDeduplicateWithAggregate` makes a confusing transformation roughly equivalent to the following one: ```scala spark.emptyDataFrame.dropDuplicates() => spark.emptyDataFrame.groupBy().agg(Map.empty[String, String]) ``` The above transformation is confusing because the resulting aggregate operator contains no grouping keys (because `emptyDataFrame` contains no columns), and gets recognized as a global aggregation. As a result, Spark SQL allocates a single row filled by the initial aggregation state and uses it as the output, and returns a wrong result. To fix this issue, this PR tweaks `ReplaceDeduplicateWithAggregate` by appending a literal `1` to the grouping key list of the resulting `Aggregate` operator when the input plan contains zero output columns. In this way, `spark.emptyDataFrame.dropDuplicates()` is now translated into a grouping aggregation, roughly depicted as: ```scala spark.emptyDataFrame.dropDuplicates() => spark.emptyDataFrame.groupBy(lit(1)).agg(Map.empty[String, String]) ``` Which is now properly treated as a grouping aggregation and returns the correct answer. ## How was this patch tested? New unit tests added Author: Feng Liu <fengliu@databricks.com> Closes #20174 from liufengdb/fix-duplicate.	2018-01-10 14:25:04 -08:00
Takeshi Yamamuro	2250cb75b9	[SPARK-22981][SQL] Fix incorrect results of Casting Struct to String ## What changes were proposed in this pull request? This pr fixed the issue when casting structs into strings; ``` scala> val df = Seq(((1, "a"), 0), ((2, "b"), 0)).toDF("a", "b") scala> df.write.saveAsTable("t") scala> sql("SELECT CAST(a AS STRING) FROM t").show +-------------------+ \| a\| +-------------------+ \|[0,1,1800000001,61]\| \|[0,2,1800000001,62]\| +-------------------+ ``` This pr modified the result into; ``` +------+ \| a\| +------+ \|[1, a]\| \|[2, b]\| +------+ ``` ## How was this patch tested? Added tests in `CastSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #20176 from maropu/SPARK-22981.	2018-01-09 21:58:55 +08:00
Josh Rosen	f20131dd35	[SPARK-22984] Fix incorrect bitmap copying and offset adjustment in GenerateUnsafeRowJoiner ## What changes were proposed in this pull request? This PR fixes a longstanding correctness bug in `GenerateUnsafeRowJoiner`. This class was introduced in https://github.com/apache/spark/pull/7821 (July 2015 / Spark 1.5.0+) and is used to combine pairs of UnsafeRows in TungstenAggregationIterator, CartesianProductExec, and AppendColumns. ### Bugs fixed by this patch 1. Incorrect combining of null-tracking bitmaps: when concatenating two UnsafeRows, the implementation "Concatenate the two bitsets together into a single one, taking padding into account". If one row has no columns then it has a bitset size of 0, but the code was incorrectly assuming that if the left row had a non-zero number of fields then the right row would also have at least one field, so it was copying invalid bytes and and treating them as part of the bitset. I'm not sure whether this bug was also present in the original implementation or whether it was introduced in https://github.com/apache/spark/pull/7892 (which fixed another bug in this code). 2. Incorrect updating of data offsets for null variable-length fields: after updating the bitsets and copying fixed-length and variable-length data, we need to perform adjustments to the offsets pointing the start of variable length fields's data. The existing code was _conditionally_ adding a fixed offset to correct for the new length of the combined row, but it is unsafe to do this if the variable-length field has a null value: we always represent nulls by storing `0` in the fixed-length slot, but this code was incorrectly incrementing those values. This bug was present since the original version of `GenerateUnsafeRowJoiner`. ### Why this bug remained latent for so long The PR which introduced `GenerateUnsafeRowJoiner` features several randomized tests, including tests of the cases where one side of the join has no fields and where string-valued fields are null. However, the existing assertions were too weak to uncover this bug: - If a null field has a non-zero value in its fixed-length data slot then this will not cause problems for field accesses because the null-tracking bitmap should still be correct and we will not try to use the incorrect offset for anything. - If the null tracking bitmap is corrupted by joining against a row with no fields then the corruption occurs in field numbers past the actual field numbers contained in the row. Thus valid `isNullAt()` calls will not read the incorrectly-set bits. The existing `GenerateUnsafeRowJoinerSuite` tests only exercised `.get()` and `isNullAt()`, but didn't actually check the UnsafeRows for bit-for-bit equality, preventing these bugs from failing assertions. It turns out that there was even a [GenerateUnsafeRowJoinerBitsetSuite](`03377d2522/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeRowJoinerBitsetSuite.scala`) but it looks like it also didn't catch this problem because it only tested the bitsets in an end-to-end fashion by accessing them through the `UnsafeRow` interface instead of actually comparing the bitsets' bytes. ### Impact of these bugs - This bug will cause `equals()` and `hashCode()` to be incorrect for these rows, which will be problematic in case`GenerateUnsafeRowJoiner`'s results are used as join or grouping keys. - Chained / repeated invocations of `GenerateUnsafeRowJoiner` may result in reads from invalid null bitmap positions causing fields to incorrectly become NULL (see the end-to-end example below). - It looks like this generally only happens in `CartesianProductExec`, which our query optimizer often avoids executing (usually we try to plan a `BroadcastNestedLoopJoin` instead). ### End-to-end test case demonstrating the problem The following query demonstrates how this bug may result in incorrect query results: ```sql set spark.sql.autoBroadcastJoinThreshold=-1; -- Needed to trigger CartesianProductExec create table a as select * from values 1; create table b as select * from values 2; SELECT t3.col1, t1.col1 FROM a t1 CROSS JOIN b t2 CROSS JOIN b t3 ``` This should return `(2, 1)` but instead was returning `(null, 1)`. Column pruning ends up trimming off all columns from `t2`, so when `t2` joins with another table this triggers the bitmap-copying bug. This incorrect bitmap is subsequently copied again when performing the final join, causing the final output to have an incorrectly-set null bit for the first field. ## How was this patch tested? Strengthened the assertions in existing tests in GenerateUnsafeRowJoinerSuite. Also verified that the end-to-end test case which uncovered this now passes. Author: Josh Rosen <joshrosen@databricks.com> Closes #20181 from JoshRosen/SPARK-22984-fix-generate-unsaferow-joiner-bitmap-bugs.	2018-01-09 11:49:10 +08:00
Wenchen Fan	eb45b52e82	[SPARK-21865][SQL] simplify the distribution semantic of Spark SQL ## What changes were proposed in this pull request? The current shuffle planning logic 1. Each operator specifies the distribution requirements for its children, via the `Distribution` interface. 2. Each operator specifies its output partitioning, via the `Partitioning` interface. 3. `Partitioning.satisfy` determines whether a `Partitioning` can satisfy a `Distribution`. 4. For each operator, check each child of it, add a shuffle node above the child if the child partitioning can not satisfy the required distribution. 5. For each operator, check if its children's output partitionings are compatible with each other, via the `Partitioning.compatibleWith`. 6. If the check in 5 failed, add a shuffle above each child. 7. try to eliminate the shuffles added in 6, via `Partitioning.guarantees`. This design has a major problem with the definition of "compatible". `Partitioning.compatibleWith` is not well defined, ideally a `Partitioning` can't know if it's compatible with other `Partitioning`, without more information from the operator. For example, `t1 join t2 on t1.a = t2.b`, `HashPartitioning(a, 10)` should be compatible with `HashPartitioning(b, 10)` under this case, but the partitioning itself doesn't know it. As a result, currently `Partitioning.compatibleWith` always return false except for literals, which make it almost useless. This also means, if an operator has distribution requirements for multiple children, Spark always add shuffle nodes to all the children(although some of them can be eliminated). However, there is no guarantee that the children's output partitionings are compatible with each other after adding these shuffles, we just assume that the operator will only specify `ClusteredDistribution` for multiple children. I think it's very hard to guarantee children co-partition for all kinds of operators, and we can not even give a clear definition about co-partition between distributions like `ClusteredDistribution(a,b)` and `ClusteredDistribution(c)`. I think we should drop the "compatible" concept in the distribution model, and let the operator achieve the co-partition requirement by special distribution requirements. Proposed shuffle planning logic after this PR (The first 4 are same as before) 1. Each operator specifies the distribution requirements for its children, via the `Distribution` interface. 2. Each operator specifies its output partitioning, via the `Partitioning` interface. 3. `Partitioning.satisfy` determines whether a `Partitioning` can satisfy a `Distribution`. 4. For each operator, check each child of it, add a shuffle node above the child if the child partitioning can not satisfy the required distribution. 5. For each operator, check if its children's output partitionings have the same number of partitions. 6. If the check in 5 failed, pick the max number of partitions from children's output partitionings, and add shuffle to child whose number of partitions doesn't equal to the max one. The new distribution model is very simple, we only have one kind of relationship, which is `Partitioning.satisfy`. For multiple children, Spark only guarantees they have the same number of partitions, and it's the operator's responsibility to leverage this guarantee to achieve more complicated requirements. For example, non-broadcast joins can use the newly added `HashPartitionedDistribution` to achieve co-partition. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #19080 from cloud-fan/exchange.	2018-01-08 19:41:41 +08:00
Josh Rosen	2c73d2a948	[SPARK-22983] Don't push filters beneath aggregates with empty grouping expressions ## What changes were proposed in this pull request? The following SQL query should return zero rows, but in Spark it actually returns one row: ``` SELECT 1 from ( SELECT 1 AS z, MIN(a.x) FROM (select 1 as x) a WHERE false ) b where b.z != b.z ``` The problem stems from the `PushDownPredicate` rule: when this rule encounters a filter on top of an Aggregate operator, e.g. `Filter(Agg(...))`, it removes the original filter and adds a new filter onto Aggregate's child, e.g. `Agg(Filter(...))`. This is sometimes okay, but the case above is a counterexample: because there is no explicit `GROUP BY`, we are implicitly computing a global aggregate over the entire table so the original filter was not acting like a `HAVING` clause filtering the number of groups: if we push this filter then it fails to actually reduce the cardinality of the Aggregate output, leading to the wrong answer. In 2016 I fixed a similar problem involving invalid pushdowns of data-independent filters (filters which reference no columns of the filtered relation). There was additional discussion after my fix was merged which pointed out that my patch was an incomplete fix (see #15289), but it looks I must have either misunderstood the comment or forgot to follow up on the additional points raised there. This patch fixes the problem by choosing to never push down filters in cases where there are no grouping expressions. Since there are no grouping keys, the only columns are aggregate columns and we can't push filters defined over aggregate results, so this change won't cause us to miss out on any legitimate pushdown opportunities. ## How was this patch tested? New regression tests in `SQLQueryTestSuite` and `FilterPushdownSuite`. Author: Josh Rosen <joshrosen@databricks.com> Closes #20180 from JoshRosen/SPARK-22983-dont-push-filters-beneath-aggs-with-empty-grouping-expressions.	2018-01-08 16:04:03 +08:00
Josh Rosen	71d65a3215	[SPARK-22985] Fix argument escaping bug in from_utc_timestamp / to_utc_timestamp codegen ## What changes were proposed in this pull request? This patch adds additional escaping in `from_utc_timestamp` / `to_utc_timestamp` expression codegen in order to a bug where invalid timezones which contain special characters could cause generated code to fail to compile. ## How was this patch tested? New regression tests in `DateExpressionsSuite`. Author: Josh Rosen <joshrosen@databricks.com> Closes #20182 from JoshRosen/SPARK-22985-fix-utc-timezone-function-escaping-bugs.	2018-01-08 11:39:45 +08:00
Takeshi Yamamuro	18e9414999	[SPARK-22973][SQL] Fix incorrect results of Casting Map to String ## What changes were proposed in this pull request? This pr fixed the issue when casting maps into strings; ``` scala> Seq(Map(1 -> "a", 2 -> "b")).toDF("a").write.saveAsTable("t") scala> sql("SELECT cast(a as String) FROM t").show(false) +----------------------------------------------------------------+ \|a \| +----------------------------------------------------------------+ \|org.apache.spark.sql.catalyst.expressions.UnsafeMapData38bdd75d\| +----------------------------------------------------------------+ ``` This pr modified the result into; ``` +----------------+ \|a \| +----------------+ \|[1 -> a, 2 -> b]\| +----------------+ ``` ## How was this patch tested? Added tests in `CastSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #20166 from maropu/SPARK-22973.	2018-01-07 13:42:01 +08:00
Takeshi Yamamuro	e8af7e8aec	[SPARK-22937][SQL] SQL elt output binary for binary inputs ## What changes were proposed in this pull request? This pr modified `elt` to output binary for binary inputs. `elt` in the current master always output data as a string. But, in some databases (e.g., MySQL), if all inputs are binary, `elt` also outputs binary (Also, this might be a small surprise). This pr is related to #19977. ## How was this patch tested? Added tests in `SQLQueryTestSuite` and `TypeCoercionSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #20135 from maropu/SPARK-22937.	2018-01-06 09:26:03 +08:00
Adrian Ionescu	51c33bd0d4	[SPARK-22961][REGRESSION] Constant columns should generate QueryPlanConstraints ## What changes were proposed in this pull request? #19201 introduced the following regression: given something like `df.withColumn("c", lit(2))`, we're no longer picking up `c === 2` as a constraint and infer filters from it when joins are involved, which may lead to noticeable performance degradation. This patch re-enables this optimization by picking up Aliases of Literals in Projection lists as constraints and making sure they're not treated as aliased columns. ## How was this patch tested? Unit test was added. Author: Adrian Ionescu <adrian@databricks.com> Closes #20155 from adrian-ionescu/constant_constraints.	2018-01-05 21:32:39 +08:00
Takeshi Yamamuro	52fc5c17d9	[SPARK-22825][SQL] Fix incorrect results of Casting Array to String ## What changes were proposed in this pull request? This pr fixed the issue when casting arrays into strings; ``` scala> val df = spark.range(10).select('id.cast("integer")).agg(collect_list('id).as('ids)) scala> df.write.saveAsTable("t") scala> sql("SELECT cast(ids as String) FROM t").show(false) +------------------------------------------------------------------+ \|ids \| +------------------------------------------------------------------+ \|org.apache.spark.sql.catalyst.expressions.UnsafeArrayData8bc285df\| +------------------------------------------------------------------+ ``` This pr modified the result into; ``` +------------------------------+ \|ids \| +------------------------------+ \|[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]\| +------------------------------+ ``` ## How was this patch tested? Added tests in `CastSuite` and `SQLQuerySuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #20024 from maropu/SPARK-22825.	2018-01-05 14:02:21 +08:00
Takeshi Yamamuro	6f68316e98	[SPARK-22771][SQL] Add a missing return statement in Concat.checkInputDataTypes ## What changes were proposed in this pull request? This pr is a follow-up to fix a bug left in #19977. ## How was this patch tested? Added tests in `StringExpressionsSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #20149 from maropu/SPARK-22771-FOLLOWUP.	2018-01-04 21:15:10 +08:00
Wenchen Fan	7d045c5f00	[SPARK-22944][SQL] improve FoldablePropagation ## What changes were proposed in this pull request? `FoldablePropagation` is a little tricky as it needs to handle attributes that are miss-derived from children, e.g. outer join outputs. This rule does a kind of stop-able tree transform, to skip to apply this rule when hit a node which may have miss-derived attributes. Logically we should be able to apply this rule above the unsupported nodes, by just treating the unsupported nodes as leaf nodes. This PR improves this rule to not stop the tree transformation, but reduce the foldable expressions that we want to propagate. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #20139 from cloud-fan/foldable.	2018-01-04 13:14:52 +08:00
Sean Owen	c284c4e1f6	[MINOR] Fix a bunch of typos	2018-01-02 07:10:19 +09:00
gatorsmile	cfbe11e816	[SPARK-22895][SQL] Push down the deterministic predicates that are after the first non-deterministic ## What changes were proposed in this pull request? Currently, we do not guarantee an order evaluation of conjuncts in either Filter or Join operator. This is also true to the mainstream RDBMS vendors like DB2 and MS SQL Server. Thus, we should also push down the deterministic predicates that are after the first non-deterministic, if possible. ## How was this patch tested? Updated the existing test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes #20069 from gatorsmile/morePushDown.	2017-12-31 15:06:54 +08:00
Zhenhua Wang	234d9435d4	[TEST][MINOR] remove redundant `EliminateSubqueryAliases` in test code ## What changes were proposed in this pull request? The `analyze` method in `implicit class DslLogicalPlan` already includes `EliminateSubqueryAliases`. So there's no need to call `EliminateSubqueryAliases` again after calling `analyze` in some test code. ## How was this patch tested? Existing tests. Author: Zhenhua Wang <wzh_zju@163.com> Closes #20122 from wzhfy/redundant_code.	2017-12-30 20:48:39 +08:00
Takeshi Yamamuro	f2b3525c17	[SPARK-22771][SQL] Concatenate binary inputs into a binary output ## What changes were proposed in this pull request? This pr modified `concat` to concat binary inputs into a single binary output. `concat` in the current master always output data as a string. But, in some databases (e.g., PostgreSQL), if all inputs are binary, `concat` also outputs binary. ## How was this patch tested? Added tests in `SQLQueryTestSuite` and `TypeCoercionSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #19977 from maropu/SPARK-22771.	2017-12-30 14:09:56 +08:00
oraviv	fcf66a3276	[SPARK-21657][SQL] optimize explode quadratic memory consumpation ## What changes were proposed in this pull request? The issue has been raised in two Jira tickets: [SPARK-21657](https://issues.apache.org/jira/browse/SPARK-21657), [SPARK-16998](https://issues.apache.org/jira/browse/SPARK-16998). Basically, what happens is that in collection generators like explode/inline we create many rows from each row. Currently each exploded row contains also the column on which it was created. This causes, for example, if we have a 10k array in one row that this array will get copy 10k times - to each of the row. this results a qudratic memory consumption. However, it is a common case that the original column gets projected out after the explode, so we can avoid duplicating it. In this solution we propose to identify this situation in the optimizer and turn on a flag for omitting the original column in the generation process. ## How was this patch tested? 1. We added a benchmark test to MiscBenchmark that shows x16 improvement in runtimes. 2. We ran some of the other tests in MiscBenchmark and they show 15% improvements. 3. We ran this code on a specific case from our production data with rows containing arrays of size ~200k and it reduced the runtime from 6 hours to 3 mins. Author: oraviv <oraviv@paypal.com> Author: uzadude <ohad.raviv@gmail.com> Author: uzadude <15645757+uzadude@users.noreply.github.com> Closes #19683 from uzadude/optimize_explode.	2017-12-29 21:08:34 +08:00
Marco Gaido	c6f01caded	[SPARK-22750][SQL] Reuse mutable states when possible ## What changes were proposed in this pull request? The PR introduces a new method `addImmutableStateIfNotExists ` to `CodeGenerator` to allow reusing and sharing the same global variable between different Expressions. This helps reducing the number of global variables needed, which is important to limit the impact on the constant pool. ## How was this patch tested? added UTs Author: Marco Gaido <marcogaido91@gmail.com> Author: Marco Gaido <mgaido@hortonworks.com> Closes #19940 from mgaido91/SPARK-22750.	2017-12-22 10:13:26 +08:00
Youngbin Kim	6e36d8d562	[SPARK-22829] Add new built-in function date_trunc() ## What changes were proposed in this pull request? Adding date_trunc() as a built-in function. `date_trunc` is common in other databases, but Spark or Hive does not have support for this. `date_trunc` is commonly used by data scientists and business intelligence application such as Superset (https://github.com/apache/incubator-superset). We do have `trunc` but this only works with 'MONTH' and 'YEAR' level on the DateType input. date_trunc() in other databases: AWS Redshift: http://docs.aws.amazon.com/redshift/latest/dg/r_DATE_TRUNC.html PostgreSQL: https://www.postgresql.org/docs/9.1/static/functions-datetime.html Presto: https://prestodb.io/docs/current/functions/datetime.html ## How was this patch tested? Unit tests (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Youngbin Kim <ykim828@hotmail.com> Closes #20015 from youngbink/date_trunc.	2017-12-19 20:22:33 -08:00
Kazuaki Ishizaki	ee56fc3432	[SPARK-18016][SQL] Code Generation: Constant Pool Limit - reduce entries for mutable state ## What changes were proposed in this pull request? This PR is follow-on of #19518. This PR tries to reduce the number of constant pool entries used for accessing mutable state. There are two directions: 1. Primitive type variables should be allocated at the outer class due to better performance. Otherwise, this PR allocates an array. 2. The length of allocated array is up to 32768 due to avoiding usage of constant pool entry at access (e.g. `mutableStateArray[32767]`). Here are some discussions to determine these directions. 1. [[1]](https://github.com/apache/spark/pull/19518#issuecomment-346690464), [[2]](https://github.com/apache/spark/pull/19518#issuecomment-346690642), [[3]](https://github.com/apache/spark/pull/19518#issuecomment-346828180), [[4]](https://github.com/apache/spark/pull/19518#issuecomment-346831544), [[5]](https://github.com/apache/spark/pull/19518#issuecomment-346857340) 2. [[6]](https://github.com/apache/spark/pull/19518#issuecomment-346729172), [[7]](https://github.com/apache/spark/pull/19518#issuecomment-346798358), [[8]](https://github.com/apache/spark/pull/19518#issuecomment-346870408) This PR modifies `addMutableState` function in the `CodeGenerator` to check if the declared state can be easily initialized compacted into an array. We identify three types of states that cannot compacted: - Primitive type state (ints, booleans, etc) if the number of them does not exceed threshold - Multiple-dimensional array type - `inline = true` When `useFreshName = false`, the given name is used. Many codes were ported from #19518. Many efforts were put here. I think this PR should credit to bdrillard With this PR, the following code is generated: ``` /* 005 / class SpecificMutableProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection { / 006 / / 007 / private Object[] references; / 008 / private InternalRow mutableRow; / 009 / private boolean isNull_0; / 010 / private boolean isNull_1; / 011 / private boolean isNull_2; / 012 / private int value_2; / 013 / private boolean isNull_3; ... / 10006 / private int value_4999; / 10007 / private boolean isNull_5000; / 10008 / private int value_5000; / 10009 / private InternalRow[] mutableStateArray = new InternalRow[2]; / 10010 / private boolean[] mutableStateArray1 = new boolean[7001]; / 10011 / private int[] mutableStateArray2 = new int[1001]; / 10012 / private UTF8String[] mutableStateArray3 = new UTF8String[6000]; / 10013 / ... / 107956 / private void init_176() { / 107957 / isNull_4986 = true; / 107958 / value_4986 = -1; ... / 108004 */ } ... ``` ## How was this patch tested? Added a new test case to `GeneratedProjectionSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19811 from kiszk/SPARK-18016.	2017-12-20 00:10:54 +08:00
Zhenhua Wang	571aa27554	[SPARK-21984][SQL] Join estimation based on equi-height histogram ## What changes were proposed in this pull request? Equi-height histogram is one of the state-of-the-art statistics for cardinality estimation, which can provide better estimation accuracy, and good at cases with skew data. This PR is to improve join estimation based on equi-height histogram. The difference from basic estimation (based on ndv) is the logic for computing join cardinality and the new ndv after join. The main idea is as follows: 1. find overlapped ranges between two histograms from two join keys; 2. apply the formula `T(A IJ B) = T(A) * T(B) / max(V(A.k1), V(B.k1))` in each overlapped range. ## How was this patch tested? Added new test cases. Author: Zhenhua Wang <wangzhenhua@huawei.com> Closes #19594 from wzhfy/join_estimation_histogram.	2017-12-19 21:55:21 +08:00
Wenchen Fan	2a29a60da3	Revert "[SPARK-22600][SQL] Fix 64kb limit for deeply nested expressions under wholestage codegen" This reverts commit `c7d0148615`.	2017-12-14 11:22:23 +08:00
gatorsmile	c5a4701acc	Revert "[SPARK-21417][SQL] Infer join conditions using propagated constraints" This reverts commit `6ac57fd0d1`.	2017-12-13 11:50:04 -08:00
Wenchen Fan	f6bcd3e53f	[SPARK-22767][SQL] use ctx.addReferenceObj in InSet and ScalaUDF ## What changes were proposed in this pull request? We should not operate on `references` directly in `Expression.doGenCode`, instead we should use the high-level API `addReferenceObj`. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #19962 from cloud-fan/codegen.	2017-12-14 01:16:44 +08:00
gatorsmile	13e489b675	[SPARK-22759][SQL] Filters can be combined iff both are deterministic ## What changes were proposed in this pull request? The query execution/optimization does not guarantee the expressions are evaluated in order. We only can combine them if and only if both are deterministic. We need to update the optimizer rule: CombineFilters. ## How was this patch tested? Updated the existing tests. Author: gatorsmile <gatorsmile@gmail.com> Closes #19947 from gatorsmile/combineFilters.	2017-12-12 22:48:31 -08:00
Liang-Chi Hsieh	c7d0148615	[SPARK-22600][SQL] Fix 64kb limit for deeply nested expressions under wholestage codegen ## What changes were proposed in this pull request? SPARK-22543 fixes the 64kb compile error for deeply nested expression for non-wholestage codegen. This PR extends it to support wholestage codegen. This patch brings some util methods in to extract necessary parameters for an expression if it is split to a function. The util methods are put in object `ExpressionCodegen` under `codegen`. The main entry is `getExpressionInputParams` which returns all necessary parameters to evaluate the given expression in a split function. This util methods can be used to split expressions too. This is a TODO item later. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19813 from viirya/reduce-expr-code-for-wholestage.	2017-12-13 10:40:05 +08:00
Marco Gaido	4117786a87	[SPARK-22716][SQL] Avoid the creation of mutable states in addReferenceObj ## What changes were proposed in this pull request? We have two methods to reference an object `addReferenceMinorObj` and `addReferenceObj `. The latter creates a new global variable, which means new entries in the constant pool. The PR unifies the two method in a single `addReferenceObj` which returns the code to access the object in the `references` array and doesn't add new mutable states. ## How was this patch tested? added UTs. Author: Marco Gaido <mgaido@hortonworks.com> Closes #19916 from mgaido91/SPARK-22716.	2017-12-13 10:29:14 +08:00
Ron Hu	ecc179ecaa	[SPARK-21322][SQL] support histogram in filter cardinality estimation ## What changes were proposed in this pull request? Histogram is effective in dealing with skewed distribution. After we generate histogram information for column statistics, we need to adjust filter estimation based on histogram data structure. ## How was this patch tested? We revised all the unit test cases by including histogram data structure. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Ron Hu <ron.hu@huawei.com> Closes #19783 from ron8hu/supportHistogram.	2017-12-12 15:04:49 +08:00
Marco Gaido	b79071910e	[SPARK-22696][SQL] objects functions should not use unneeded global variables ## What changes were proposed in this pull request? Some objects functions are using global variables which are not needed. This can generate some unneeded entries in the constant pool. The PR replaces the unneeded global variables with local variables. ## How was this patch tested? added UTs Author: Marco Gaido <mgaido@hortonworks.com> Author: Marco Gaido <marcogaido91@gmail.com> Closes #19908 from mgaido91/SPARK-22696.	2017-12-07 21:24:36 +08:00
Marco Gaido	fc29446300	[SPARK-22699][SQL] GenerateSafeProjection should not use global variables for struct ## What changes were proposed in this pull request? GenerateSafeProjection is defining a mutable state for each struct, which is not needed. This is bad for the well known issues related to constant pool limits. The PR replace the global variable with a local one. ## How was this patch tested? added UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #19914 from mgaido91/SPARK-22699.	2017-12-07 21:18:27 +08:00
Kazuaki Ishizaki	ea2fbf4197	[SPARK-22705][SQL] Case, Coalesce, and In use less global variables ## What changes were proposed in this pull request? This PR accomplishes the following two items. 1. Reduce # of global variables from two to one for generated code of `Case` and `Coalesce` and remove global variables for generated code of `In`. 2. Make lifetime of global variable local within an operation Item 1. reduces # of constant pool entries in a Java class. Item 2. ensures that an variable is not passed to arguments in a method split by `CodegenContext.splitExpressions()`, which is addressed by #19865. ## How was this patch tested? Added new tests into `PredicateSuite`, `NullExpressionsSuite`, and `ConditionalExpressionSuite`. Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19901 from kiszk/SPARK-22705.	2017-12-07 20:55:35 +08:00
Marco Gaido	f110a7f884	[SPARK-22693][SQL] CreateNamedStruct and InSet should not use global variables ## What changes were proposed in this pull request? CreateNamedStruct and InSet are using a global variable which is not needed. This can generate some unneeded entries in the constant pool. The PR removes the unnecessary mutable states and makes them local variables. ## How was this patch tested? added UT Author: Marco Gaido <marcogaido91@gmail.com> Author: Marco Gaido <mgaido@hortonworks.com> Closes #19896 from mgaido91/SPARK-22693.	2017-12-06 14:12:16 -08:00
Marco Gaido	e98f9647f4	[SPARK-22695][SQL] ScalaUDF should not use global variables ## What changes were proposed in this pull request? ScalaUDF is using global variables which are not needed. This can generate some unneeded entries in the constant pool. The PR replaces the unneeded global variables with local variables. ## How was this patch tested? added UT Author: Marco Gaido <mgaido@hortonworks.com> Author: Marco Gaido <marcogaido91@gmail.com> Closes #19900 from mgaido91/SPARK-22695.	2017-12-07 00:50:49 +08:00
Kazuaki Ishizaki	813c0f945d	[SPARK-22704][SQL] Least and Greatest use less global variables ## What changes were proposed in this pull request? This PR accomplishes the following two items. 1. Reduce # of global variables from two to one 2. Make lifetime of global variable local within an operation Item 1. reduces # of constant pool entries in a Java class. Item 2. ensures that an variable is not passed to arguments in a method split by `CodegenContext.splitExpressions()`, which is addressed by #19865. ## How was this patch tested? Added new test into `ArithmeticExpressionSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19899 from kiszk/SPARK-22704.	2017-12-07 00:45:51 +08:00
Liang-Chi Hsieh	00d176d2fe	[SPARK-20392][SQL] Set barrier to prevent re-entering a tree ## What changes were proposed in this pull request? The SQL `Analyzer` goes through a whole query plan even most part of it is analyzed. This increases the time spent on query analysis for long pipelines in ML, especially. This patch adds a logical node called `AnalysisBarrier` that wraps an analyzed logical plan to prevent it from analysis again. The barrier is applied to the analyzed logical plan in `Dataset`. It won't change the output of wrapped logical plan and just acts as a wrapper to hide it from analyzer. New operations on the dataset will be put on the barrier, so only the new nodes created will be analyzed. This analysis barrier will be removed at the end of analysis stage. ## How was this patch tested? Added tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19873 from viirya/SPARK-20392-reopen.	2017-12-05 21:43:41 -08:00
Zhenhua Wang	1e17ab83de	[SPARK-22662][SQL] Failed to prune columns after rewriting predicate subquery ## What changes were proposed in this pull request? As a simple example: ``` spark-sql> create table base (a int, b int) using parquet; Time taken: 0.066 seconds spark-sql> create table relInSubq ( x int, y int, z int) using parquet; Time taken: 0.042 seconds spark-sql> explain select a from base where a in (select x from relInSubq); == Physical Plan == Project [a#83] +- BroadcastHashJoin [a#83], [x#85], LeftSemi, BuildRight :- FileScan parquet default.base[a#83,b#84] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://100.0.0.4:9000/wzh/base], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:int,b:int> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))) +- Project [x#85] +- *FileScan parquet default.relinsubq[x#85] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://100.0.0.4:9000/wzh/relinsubq], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<x:int> ``` We only need column `a` in table `base`, but all columns (`a`, `b`) are fetched. The reason is that, in "Operator Optimizations" batch, `ColumnPruning` first produces a `Project` on table `base`, but then it's removed by `removeProjectBeforeFilter`. Because at that time, the predicate subquery is in filter form. Then, in "Rewrite Subquery" batch, `RewritePredicateSubquery` converts the subquery into a LeftSemi join, but this batch doesn't have the `ColumnPruning` rule. This results in reading all columns for the `base` table. ## How was this patch tested? Added a new test case. Author: Zhenhua Wang <wangzhenhua@huawei.com> Closes #19855 from wzhfy/column_pruning_subquery.	2017-12-05 15:15:32 -08:00
Wenchen Fan	a8af4da12c	[SPARK-22682][SQL] HashExpression does not need to create global variables ## What changes were proposed in this pull request? It turns out that `HashExpression` can pass around some values via parameter when splitting codes into methods, to save some global variable slots. This can also prevent a weird case that global variable appears in parameter list, which is discovered by https://github.com/apache/spark/pull/19865 ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #19878 from cloud-fan/minor.	2017-12-05 12:43:05 +08:00
Marco Gaido	3887b7eef7	[SPARK-22665][SQL] Avoid repartitioning with empty list of expressions ## What changes were proposed in this pull request? Repartitioning by empty set of expressions is currently possible, even though it is a case which is not handled properly. Indeed, in `HashExpression` there is a check to avoid to run it on an empty set, but this check is not performed while repartitioning. Thus, the PR adds a check to avoid this wrong situation. ## How was this patch tested? added UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #19870 from mgaido91/SPARK-22665.	2017-12-04 17:08:56 -08:00
Marco Gaido	3927bb9b46	[SPARK-22473][FOLLOWUP][TEST] Remove deprecated Date functions ## What changes were proposed in this pull request? #19696 replaced the deprecated usages for `Date` and `Waiter`, but a few methods were missed. The PR fixes the forgotten deprecated usages. ## How was this patch tested? existing UTs Author: Marco Gaido <mgaido@hortonworks.com> Closes #19875 from mgaido91/SPARK-22473_FOLLOWUP.	2017-12-04 11:07:27 -06:00
Adrian Ionescu	f5f8e84d9d	[SPARK-22614] Dataset API: repartitionByRange(...) ## What changes were proposed in this pull request? This PR introduces a way to explicitly range-partition a Dataset. So far, only round-robin and hash partitioning were possible via `df.repartition(...)`, but sometimes range partitioning might be desirable: e.g. when writing to disk, for better compression without the cost of global sort. The current implementation piggybacks on the existing `RepartitionByExpression` `LogicalPlan` and simply adds the following logic: If its expressions are of type `SortOrder`, then it will do `RangePartitioning`; otherwise `HashPartitioning`. This was by far the least intrusive solution I could come up with. ## How was this patch tested? Unit test for `RepartitionByExpression` changes, a test to ensure we're not changing the behavior of existing `.repartition()` and a few end-to-end tests in `DataFrameSuite`. Author: Adrian Ionescu <adrian@databricks.com> Closes #19828 from adrian-ionescu/repartitionByRange.	2017-11-30 15:41:34 -08:00
aokolnychyi	6ac57fd0d1	[SPARK-21417][SQL] Infer join conditions using propagated constraints ## What changes were proposed in this pull request? This PR adds an optimization rule that infers join conditions using propagated constraints. For instance, if there is a join, where the left relation has 'a = 1' and the right relation has 'b = 1', then the rule infers 'a = b' as a join predicate. Only semantically new predicates are appended to the existing join condition. Refer to the corresponding ticket and tests for more details. ## How was this patch tested? This patch comes with a new test suite to cover the implemented logic. Author: aokolnychyi <anton.okolnychyi@sap.com> Closes #18692 from aokolnychyi/spark-21417.	2017-11-30 14:25:10 -08:00
Kazuaki Ishizaki	999ec137a9	[SPARK-22570][SQL] Avoid to create a lot of global variables by using a local variable with allocation of an object in generated code ## What changes were proposed in this pull request? This PR reduces # of global variables in generated code by replacing a global variable with a local variable with an allocation of an object every time. When a lot of global variables were generated, the generated code may meet 64K constant pool limit. This PR reduces # of generated global variables in the following three operations: * `Cast` with String to primitive byte/short/int/long * `RegExpReplace` * `CreateArray` I intentionally leave [this part](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L595-L603). This is because this variable keeps a class that is dynamically generated. In other word, it is not possible to reuse one class. ## How was this patch tested? Added test cases Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19797 from kiszk/SPARK-22570.	2017-12-01 02:28:24 +08:00
Wang Gengliang	57687280d4	[SPARK-22615][SQL] Handle more cases in PropagateEmptyRelation ## What changes were proposed in this pull request? Currently, in the optimize rule `PropagateEmptyRelation`, the following cases is not handled: 1. empty relation as right child in left outer join 2. empty relation as left child in right outer join 3. empty relation as right child in left semi join 4. empty relation as right child in left anti join 5. only one empty relation in full outer join case 1 / 2 / 5 can be treated as Cartesian product and cause exception. See the new test cases. ## How was this patch tested? Unit test Author: Wang Gengliang <ltnwgl@gmail.com> Closes #19825 from gengliangwang/SPARK-22615.	2017-11-29 09:17:39 -08:00
Marco Gaido	087879a77a	[SPARK-22520][SQL] Support code generation for large CaseWhen ## What changes were proposed in this pull request? Code generation is disabled for CaseWhen when the number of branches is higher than `spark.sql.codegen.maxCaseBranches` (which defaults to 20). This was done to prevent the well known 64KB method limit exception. This PR proposes to support code generation also in those cases (without causing exceptions of course). As a side effect, we could get rid of the `spark.sql.codegen.maxCaseBranches` configuration. ## How was this patch tested? existing UTs Author: Marco Gaido <mgaido@hortonworks.com> Author: Marco Gaido <marcogaido91@gmail.com> Closes #19752 from mgaido91/SPARK-22520.	2017-11-28 07:46:18 +08:00
Kazuaki Ishizaki	2dbe275b2d	[SPARK-22603][SQL] Fix 64KB JVM bytecode limit problem with FormatString ## What changes were proposed in this pull request? This PR changes `FormatString` code generation to place generated code for expressions for arguments into separated methods if these size could be large. This PR passes variable arguments by using an `Object` array. ## How was this patch tested? Added new test cases into `StringExpressionSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19817 from kiszk/SPARK-22603.	2017-11-27 20:32:01 +08:00
Kazuaki Ishizaki	554adc77d2	[SPARK-22595][SQL] fix flaky test: CastSuite.SPARK-22500: cast for struct should not generate codes beyond 64KB ## What changes were proposed in this pull request? This PR reduces the number of fields in the test case of `CastSuite` to fix an issue that is pointed at [here](https://github.com/apache/spark/pull/19800#issuecomment-346634950). ``` java.lang.OutOfMemoryError: GC overhead limit exceeded java.lang.OutOfMemoryError: GC overhead limit exceeded at org.codehaus.janino.UnitCompiler.findClass(UnitCompiler.java:10971) at org.codehaus.janino.UnitCompiler.findTypeByName(UnitCompiler.java:7607) at org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5758) at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5732) at org.codehaus.janino.UnitCompiler.access$13200(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$18.visitReferenceType(UnitCompiler.java:5668) at org.codehaus.janino.UnitCompiler$18.visitReferenceType(UnitCompiler.java:5660) at org.codehaus.janino.Java$ReferenceType.accept(Java.java:3356) at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:5660) at org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:2892) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2764) at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262) at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) at org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) at org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) ... ``` ## How was this patch tested? Used existing test case Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19806 from kiszk/SPARK-22595.	2017-11-24 12:08:49 +01:00
Liang-Chi Hsieh	62a826f17c	[SPARK-22591][SQL] GenerateOrdering shouldn't change CodegenContext.INPUT_ROW ## What changes were proposed in this pull request? When I played with codegen in developing another PR, I found the value of `CodegenContext.INPUT_ROW` is not reliable. Under wholestage codegen, it is assigned to null first and then suddenly changed to `i`. The reason is `GenerateOrdering` changes `CodegenContext.INPUT_ROW` but doesn't restore it back. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19800 from viirya/SPARK-22591.	2017-11-24 11:46:58 +01:00
Wenchen Fan	0605ad7614	[SPARK-22543][SQL] fix java 64kb compile error for deeply nested expressions ## What changes were proposed in this pull request? A frequently reported issue of Spark is the Java 64kb compile error. This is because Spark generates a very big method and it's usually caused by 3 reasons: 1. a deep expression tree, e.g. a very complex filter condition 2. many individual expressions, e.g. expressions can have many children, operators can have many expressions. 3. a deep query plan tree (with whole stage codegen) This PR focuses on 1. There are already several patches(#15620 #18972 #18641) trying to fix this issue and some of them are already merged. However this is an endless job as every non-leaf expression has this issue. This PR proposes to fix this issue in `Expression.genCode`, to make sure the code for a single expression won't grow too big. According to maropu 's benchmark, no regression is found with TPCDS (thanks maropu !): https://docs.google.com/spreadsheets/d/1K3_7lX05-ZgxDXi9X_GleNnDjcnJIfoSlSCDZcL4gdg/edit?usp=sharing ## How was this patch tested? existing test Author: Wenchen Fan <wenchen@databricks.com> Author: Wenchen Fan <cloud0fan@gmail.com> Closes #19767 from cloud-fan/codegen.	2017-11-22 10:05:46 -08:00
Kazuaki Ishizaki	ac10171bea	[SPARK-22500][SQL] Fix 64KB JVM bytecode limit problem with cast ## What changes were proposed in this pull request? This PR changes `cast` code generation to place generated code for expression for fields of a structure into separated methods if these size could be large. ## How was this patch tested? Added new test cases into `CastSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19730 from kiszk/SPARK-22500.	2017-11-21 22:24:43 +01:00
Kazuaki Ishizaki	9bdff0bcd8	[SPARK-22550][SQL] Fix 64KB JVM bytecode limit problem with elt ## What changes were proposed in this pull request? This PR changes `elt` code generation to place generated code for expression for arguments into separated methods if these size could be large. This PR resolved the case of `elt` with a lot of argument ## How was this patch tested? Added new test cases into `StringExpressionsSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19778 from kiszk/SPARK-22550.	2017-11-21 12:19:11 +01:00
Kazuaki Ishizaki	c957714806	[SPARK-22508][SQL] Fix 64KB JVM bytecode limit problem with GenerateUnsafeRowJoiner.create() ## What changes were proposed in this pull request? This PR changes `GenerateUnsafeRowJoiner.create()` code generation to place generated code for statements to operate bitmap and offset into separated methods if these size could be large. ## How was this patch tested? Added a new test case into `GenerateUnsafeRowJoinerSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19737 from kiszk/SPARK-22508.	2017-11-21 12:16:54 +01:00
Kazuaki Ishizaki	41c6f36018	[SPARK-22549][SQL] Fix 64KB JVM bytecode limit problem with concat_ws ## What changes were proposed in this pull request? This PR changes `concat_ws` code generation to place generated code for expression for arguments into separated methods if these size could be large. This PR resolved the case of `concat_ws` with a lot of argument ## How was this patch tested? Added new test cases into `StringExpressionsSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19777 from kiszk/SPARK-22549.	2017-11-21 01:42:05 +01:00
Kazuaki Ishizaki	d54bfec2e0	[SPARK-22498][SQL] Fix 64KB JVM bytecode limit problem with concat ## What changes were proposed in this pull request? This PR changes `concat` code generation to place generated code for expression for arguments into separated methods if these size could be large. This PR resolved the case of `concat` with a lot of argument ## How was this patch tested? Added new test cases into `StringExpressionsSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19728 from kiszk/SPARK-22498.	2017-11-18 19:40:06 +01:00
Kazuaki Ishizaki	7f2e62ee6b	[SPARK-22501][SQL] Fix 64KB JVM bytecode limit problem with in ## What changes were proposed in this pull request? This PR changes `In` code generation to place generated code for expression for expressions for arguments into separated methods if these size could be large. ## How was this patch tested? Added new test cases into `PredicateSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19733 from kiszk/SPARK-22501.	2017-11-16 18:24:49 +01:00
Marco Gaido	4e7f07e255	[SPARK-22494][SQL] Fix 64KB limit exception with Coalesce and AtleastNNonNulls ## What changes were proposed in this pull request? Both `Coalesce` and `AtLeastNNonNulls` can cause the 64KB limit exception when used with a lot of arguments and/or complex expressions. This PR splits their expressions in order to avoid the issue. ## How was this patch tested? Added UTs Author: Marco Gaido <marcogaido91@gmail.com> Author: Marco Gaido <mgaido@hortonworks.com> Closes #19720 from mgaido91/SPARK-22494.	2017-11-16 18:19:13 +01:00
Kazuaki Ishizaki	ed885e7a65	[SPARK-22499][SQL] Fix 64KB JVM bytecode limit problem with least and greatest ## What changes were proposed in this pull request? This PR changes `least` and `greatest` code generation to place generated code for expression for arguments into separated methods if these size could be large. This PR resolved two cases: * `least` with a lot of argument * `greatest` with a lot of argument ## How was this patch tested? Added a new test case into `ArithmeticExpressionsSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19729 from kiszk/SPARK-22499.	2017-11-16 17:56:21 +01:00
liutang123	bc0848b4c1	[SPARK-22469][SQL] Accuracy problem in comparison with string and numeric ## What changes were proposed in this pull request? This fixes a problem caused by #15880 `select '1.5' > 0.5; // Result is NULL in Spark but is true in Hive. ` When compare string and numeric, cast them as double like Hive. Author: liutang123 <liutang123@yeah.net> Closes #19692 from liutang123/SPARK-22469.	2017-11-15 09:02:54 -08:00
Kazuaki Ishizaki	9bf696dbec	[SPARK-21720][SQL] Fix 64KB JVM bytecode limit problem with AND or OR ## What changes were proposed in this pull request? This PR changes `AND` or `OR` code generation to place condition and then expressions' generated code into separated methods if these size could be large. When the method is newly generated, variables for `isNull` and `value` are declared as an instance variable to pass these values (e.g. `isNull1409` and `value1409`) to the callers of the generated method. This PR resolved two cases: * large code size of left expression * large code size of right expression ## How was this patch tested? Added a new test case into `CodeGenerationSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #18972 from kiszk/SPARK-21720.	2017-11-12 22:44:47 +01:00
Kazuaki Ishizaki	f2da738c76	[SPARK-22284][SQL] Fix 64KB JVM bytecode limit problem in calculating hash for nested structs ## What changes were proposed in this pull request? This PR avoids to generate a huge method for calculating a murmur3 hash for nested structs. This PR splits a huge method (e.g. `apply_4`) into multiple smaller methods. Sample program ``` val structOfString = new StructType().add("str", StringType) var inner = new StructType() for (_ <- 0 until 800) { inner = inner1.add("structOfString", structOfString) } var schema = new StructType() for (_ <- 0 until 50) { schema = schema.add("structOfStructOfStrings", inner) } GenerateMutableProjection.generate(Seq(Murmur3Hash(exprs, 42))) ``` Without this PR ``` /* 005 / class SpecificMutableProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection { / 006 / / 007 / private Object[] references; / 008 / private InternalRow mutableRow; / 009 / private int value; / 010 / private int value_0; ... / 034 / public java.lang.Object apply(java.lang.Object _i) { / 035 / InternalRow i = (InternalRow) _i; / 036 / / 037 / / 038 / / 039 / value = 42; / 040 / apply_0(i); / 041 / apply_1(i); / 042 / apply_2(i); / 043 / apply_3(i); / 044 / apply_4(i); / 045 / nestedClassInstance.apply_5(i); ... / 089 / nestedClassInstance8.apply_49(i); / 090 / value_0 = value; / 091 / / 092 / // copy all the results into MutableRow / 093 / mutableRow.setInt(0, value_0); / 094 / return mutableRow; / 095 / } / 096 / / 097 / / 098 / private void apply_4(InternalRow i) { / 099 / / 100 / boolean isNull5 = i.isNullAt(4); / 101 / InternalRow value5 = isNull5 ? null : (i.getStruct(4, 800)); / 102 / if (!isNull5) { / 103 / / 104 / if (!value5.isNullAt(0)) { / 105 / / 106 / final InternalRow element6400 = value5.getStruct(0, 1); / 107 / / 108 / if (!element6400.isNullAt(0)) { / 109 / / 110 / final UTF8String element6401 = element6400.getUTF8String(0); / 111 / value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6401.getBaseObject(), element6401.getBaseOffset(), element6401.numBytes(), value); / 112 / / 113 / } / 114 / / 115 / / 116 / } / 117 / / 118 / / 119 / if (!value5.isNullAt(1)) { / 120 / / 121 / final InternalRow element6402 = value5.getStruct(1, 1); / 122 / / 123 / if (!element6402.isNullAt(0)) { / 124 / / 125 / final UTF8String element6403 = element6402.getUTF8String(0); / 126 / value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6403.getBaseObject(), element6403.getBaseOffset(), element6403.numBytes(), value); / 127 / / 128 / } / 128 / } / 129 / / 130 / / 131 / } / 132 / / 133 / / 134 / if (!value5.isNullAt(2)) { / 135 / / 136 / final InternalRow element6404 = value5.getStruct(2, 1); / 137 / / 138 / if (!element6404.isNullAt(0)) { / 139 / / 140 / final UTF8String element6405 = element6404.getUTF8String(0); / 141 / value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6405.getBaseObject(), element6405.getBaseOffset(), element6405.numBytes(), value); / 142 / / 143 / } / 144 / / 145 / / 146 / } / 147 / ... / 12074 / if (!value5.isNullAt(798)) { / 12075 / / 12076 / final InternalRow element7996 = value5.getStruct(798, 1); / 12077 / / 12078 / if (!element7996.isNullAt(0)) { / 12079 / / 12080 / final UTF8String element7997 = element7996.getUTF8String(0); / 12083 / } / 12084 / / 12085 / / 12086 / } / 12087 / / 12088 / / 12089 / if (!value5.isNullAt(799)) { / 12090 / / 12091 / final InternalRow element7998 = value5.getStruct(799, 1); / 12092 / / 12093 / if (!element7998.isNullAt(0)) { / 12094 / / 12095 / final UTF8String element7999 = element7998.getUTF8String(0); / 12096 / value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element7999.getBaseObject(), element7999.getBaseOffset(), element7999.numBytes(), value); / 12097 / / 12098 / } / 12099 / / 12100 / / 12101 / } / 12102 / / 12103 / } / 12104 / / 12105 / } / 12106 / / 12106 / / 12107 / / 12108 / private void apply_1(InternalRow i) { ... ``` With this PR ``` / 005 / class SpecificMutableProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection { / 006 / / 007 / private Object[] references; / 008 / private InternalRow mutableRow; / 009 / private int value; / 010 / private int value_0; / 011 / ... / 034 / public java.lang.Object apply(java.lang.Object _i) { / 035 / InternalRow i = (InternalRow) _i; / 036 / / 037 / / 038 / / 039 / value = 42; / 040 / nestedClassInstance11.apply50_0(i); / 041 / nestedClassInstance11.apply50_1(i); ... / 088 / nestedClassInstance11.apply50_48(i); / 089 / nestedClassInstance11.apply50_49(i); / 090 / value_0 = value; / 091 / / 092 / // copy all the results into MutableRow / 093 / mutableRow.setInt(0, value_0); / 094 / return mutableRow; / 095 / } / 096 / ... / 37717 / private void apply4_0(InternalRow value5, InternalRow i) { / 37718 / / 37719 / if (!value5.isNullAt(0)) { / 37720 / / 37721 / final InternalRow element6400 = value5.getStruct(0, 1); / 37722 / / 37723 / if (!element6400.isNullAt(0)) { / 37724 / / 37725 / final UTF8String element6401 = element6400.getUTF8String(0); / 37726 / value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6401.getBaseObject(), element6401.getBaseOffset(), element6401.numBytes(), value); / 37727 / / 37728 / } / 37729 / / 37730 / / 37731 / } / 37732 / / 37733 / if (!value5.isNullAt(1)) { / 37734 / / 37735 / final InternalRow element6402 = value5.getStruct(1, 1); / 37736 / / 37737 / if (!element6402.isNullAt(0)) { / 37738 / / 37739 / final UTF8String element6403 = element6402.getUTF8String(0); / 37740 / value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6403.getBaseObject(), element6403.getBaseOffset(), element6403.numBytes(), value); / 37741 / / 37742 / } / 37743 / / 37744 / / 37745 / } / 37746 / / 37747 / if (!value5.isNullAt(2)) { / 37748 / / 37749 / final InternalRow element6404 = value5.getStruct(2, 1); / 37750 / / 37751 / if (!element6404.isNullAt(0)) { / 37752 / / 37753 / final UTF8String element6405 = element6404.getUTF8String(0); / 37754 / value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6405.getBaseObject(), element6405.getBaseOffset(), element6405.numBytes(), value); / 37755 / / 37756 / } / 37757 / / 37758 / / 37759 / } / 37760 / / 37761 / } ... / 218470 / / 218471 / private void apply50_4(InternalRow i) { / 218472 / / 218473 / boolean isNull5 = i.isNullAt(4); / 218474 / InternalRow value5 = isNull5 ? null : (i.getStruct(4, 800)); / 218475 / if (!isNull5) { / 218476 / apply4_0(value5, i); / 218477 / apply4_1(value5, i); / 218478 / apply4_2(value5, i); ... / 218742 / nestedClassInstance.apply4_266(value5, i); / 218743 / } / 218744 / / 218745 */ } ``` ## How was this patch tested? Added new test to `HashExpressionsSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19563 from kiszk/SPARK-22284.	2017-11-10 21:17:49 +01:00
Shixiong Zhu	24ea781cd3	[SPARK-19644][SQL] Clean up Scala reflection garbage after creating Encoder ## What changes were proposed in this pull request? Because of the memory leak issue in `scala.reflect.api.Types.TypeApi.<:<` (https://github.com/scala/bug/issues/8302), creating an encoder may leak memory. This PR adds `cleanUpReflectionObjects` to clean up these leaking objects for methods calling `scala.reflect.api.Types.TypeApi.<:<`. ## How was this patch tested? The updated unit tests. Author: Shixiong Zhu <zsxwing@gmail.com> Closes #19687 from zsxwing/SPARK-19644.	2017-11-10 11:27:28 -08:00
Wenchen Fan	0025ddeb1d	[SPARK-22472][SQL] add null check for top-level primitive values ## What changes were proposed in this pull request? One powerful feature of `Dataset` is, we can easily map SQL rows to Scala/Java objects and do runtime null check automatically. For example, let's say we have a parquet file with schema `<a: int, b: string>`, and we have a `case class Data(a: Int, b: String)`. Users can easily read this parquet file into `Data` objects, and Spark will throw NPE if column `a` has null values. However the null checking is left behind for top-level primitive values. For example, let's say we have a parquet file with schema `<a: Int>`, and we read it into Scala `Int`. If column `a` has null values, we will get some weird results. ``` scala> val ds = spark.read.parquet(...).as[Int] scala> ds.show() +----+ \|v \| +----+ \|null\| \|1 \| +----+ scala> ds.collect res0: Array[Long] = Array(0, 1) scala> ds.map(_ * 2).show +-----+ \|value\| +-----+ \|-2 \| \|2 \| +-----+ ``` This is because internally Spark use some special default values for primitive types, but never expect users to see/operate these default value directly. This PR adds null check for top-level primitive values ## How was this patch tested? new test Author: Wenchen Fan <wenchen@databricks.com> Closes #19707 from cloud-fan/bug.	2017-11-09 21:56:20 -08:00
Nathan Kronenfeld	b57ed2245c	[SPARK-22308][TEST-MAVEN] Support alternative unit testing styles in external applications Continuation of PR#19528 (https://github.com/apache/spark/pull/19529#issuecomment-340252119) The problem with the maven build in the previous PR was the new tests.... the creation of a spark session outside the tests meant there was more than one spark session around at a time. I was using the spark session outside the tests so that the tests could share data; I've changed it so that each test creates the data anew. Author: Nathan Kronenfeld <nicole.oresme@gmail.com> Author: Nathan Kronenfeld <nkronenfeld@uncharted.software> Closes #19705 from nkronenfeld/alternative-style-tests-2.	2017-11-09 19:11:30 -08:00
jerryshao	6793a3dac0	[SPARK-22405][SQL] Add new alter table and alter database related ExternalCatalogEvent ## What changes were proposed in this pull request? We're building a data lineage tool in which we need to monitor the metadata changes in ExternalCatalog, current ExternalCatalog already provides several useful events like "CreateDatabaseEvent" for custom SparkListener to use. But still there's some event missing, like alter database event and alter table event. So here propose to and new ExternalCatalogEvent. ## How was this patch tested? Enrich the current UT and tested on local cluster. CC hvanhovell please let me know your comments about current proposal, thanks. Author: jerryshao <sshao@hortonworks.com> Closes #19649 from jerryshao/SPARK-22405.	2017-11-09 11:57:56 +01:00
Liang-Chi Hsieh	40a8aefaf3	[SPARK-22442][SQL] ScalaReflection should produce correct field names for special characters ## What changes were proposed in this pull request? For a class with field name of special characters, e.g.: ```scala case class MyType(`field.1`: String, `field 2`: String) ``` Although we can manipulate DataFrame/Dataset, the field names are encoded: ```scala scala> val df = Seq(MyType("a", "b"), MyType("c", "d")).toDF df: org.apache.spark.sql.DataFrame = [field$u002E1: string, field$u00202: string] scala> df.as[MyType].collect res7: Array[MyType] = Array(MyType(a,b), MyType(c,d)) ``` It causes resolving problem when we try to convert the data with non-encoded field names: ```scala spark.read.json(path).as[MyType] ... [info] org.apache.spark.sql.AnalysisException: cannot resolve '`field$u002E1`' given input columns: [field 2, fie ld.1]; [info] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) ... ``` We should use decoded field name in Dataset schema. ## How was this patch tested? Added tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19664 from viirya/SPARK-22442.	2017-11-09 11:54:50 +01:00
Dongjoon Hyun	98be55c0fa	[SPARK-22222][CORE][TEST][FOLLOW-UP] Remove redundant and deprecated `Timeouts` ## What changes were proposed in this pull request? Since SPARK-21939, Apache Spark uses `TimeLimits` instead of the deprecated `Timeouts`. This PR fixes the build warning `BufferHolderSparkSubmitSuite.scala` introduced at [SPARK-22222](https://github.com/apache/spark/pull/19460/files#diff-d8cf6e0c229969db94ec8ffc31a9239cR36) by removing the redundant `Timeouts`. ```scala trait Timeouts in package concurrent is deprecated: Please use org.scalatest.concurrent.TimeLimits instead [warn] with Timeouts { ``` ## How was this patch tested? N/A Author: Dongjoon Hyun <dongjoon@apache.org> Closes #19697 from dongjoon-hyun/SPARK-22222.	2017-11-09 16:34:38 +09:00
Kazuaki Ishizaki	3bba8621cf	[SPARK-22378][SQL] Eliminate redundant null check in generated code for extracting an element from complex types ## What changes were proposed in this pull request? This PR eliminates redundant null check in generated code for extracting an element from complex types `GetArrayItem`, `GetMapValue`, and `GetArrayStructFields`. Since these code generation does not take care of `nullable` in `DataType` such as `ArrayType`, the generated code always has `isNullAt(index)`. This PR avoids to generate `isNullAt(index)` if `nullable` is false in `DataType`. Example ``` val nonNullArray = Literal.create(Seq(1), ArrayType(IntegerType, false)) checkEvaluation(GetArrayItem(nonNullArray, Literal(0)), 1) ``` Before this PR ``` /* 034 / public java.lang.Object apply(java.lang.Object _i) { / 035 / InternalRow i = (InternalRow) _i; / 036 / / 037 / / 038 / / 039 / boolean isNull = true; / 040 / int value = -1; / 041 / / 042 / / 043 / / 044 / isNull = false; // resultCode could change nullability. / 045 / / 046 / final int index = (int) 0; / 047 / if (index >= ((ArrayData) references[0]).numElements() \|\| index < 0 \|\| ((ArrayData) references[0]).isNullAt(index)) { / 048 / isNull = true; / 049 / } else { / 050 / value = ((ArrayData) references[0]).getInt(index); / 051 / } / 052 / isNull_0 = isNull; / 053 / value_0 = value; / 054 / / 055 / // copy all the results into MutableRow / 056 / / 057 / if (!isNull_0) { / 058 / mutableRow.setInt(0, value_0); / 059 / } else { / 060 / mutableRow.setNullAt(0); / 061 / } / 062 / / 063 / return mutableRow; / 064 / } ``` After this PR (Line 47 is changed) ``` / 034 / public java.lang.Object apply(java.lang.Object _i) { / 035 / InternalRow i = (InternalRow) _i; / 036 / / 037 / / 038 / / 039 / boolean isNull = true; / 040 / int value = -1; / 041 / / 042 / / 043 / / 044 / isNull = false; // resultCode could change nullability. / 045 / / 046 / final int index = (int) 0; / 047 / if (index >= ((ArrayData) references[0]).numElements() \|\| index < 0) { / 048 / isNull = true; / 049 / } else { / 050 / value = ((ArrayData) references[0]).getInt(index); / 051 / } / 052 / isNull_0 = isNull; / 053 / value_0 = value; / 054 / / 055 / // copy all the results into MutableRow / 056 / / 057 / if (!isNull_0) { / 058 / mutableRow.setInt(0, value_0); / 059 / } else { / 060 / mutableRow.setNullAt(0); / 061 / } / 062 / / 063 / return mutableRow; / 064 */ } ``` ## How was this patch tested? Added test cases into `ComplexTypeSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19598 from kiszk/SPARK-22378.	2017-11-04 22:57:12 -07:00
Henry Robinson	6c6626614e	[SPARK-22211][SQL] Remove incorrect FOJ limit pushdown ## What changes were proposed in this pull request? It's not safe in all cases to push down a LIMIT below a FULL OUTER JOIN. If the limit is pushed to one side of the FOJ, the physical join operator can not tell if a row in the non-limited side would have a match in the other side. If the join operator guarantees that unmatched tuples from the limited side are emitted before any unmatched tuples from the other side, pushing down the limit is safe. But this is impractical for some join implementations, e.g. SortMergeJoin. For now, disable limit pushdown through a FULL OUTER JOIN, and we can evaluate whether a more complicated solution is necessary in the future. ## How was this patch tested? Ran org.apache.spark.sql.* tests. Altered full outer join tests in LimitPushdownSuite. Author: Henry Robinson <henry@cloudera.com> Closes #19647 from henryr/spark-22211.	2017-11-04 22:47:25 -07:00
Wenchen Fan	2fd12af437	[SPARK-22306][SQL] alter table schema should not erase the bucketing metadata at hive side forward-port https://github.com/apache/spark/pull/19622 to master branch. This bug doesn't exist in master because we've added hive bucketing support and the hive bucketing metadata can be recognized by Spark, but we should still port it to master: 1) there may be other unsupported hive metadata removed by Spark. 2) reduce code difference between master and 2.2 to ease the backport in the feature. *** When we alter table schema, we set the new schema to spark `CatalogTable`, convert it to hive table, and finally call `hive.alterTable`. This causes a problem in Spark 2.2, because hive bucketing metedata is not recognized by Spark, which means a Spark `CatalogTable` representing a hive table is always non-bucketed, and when we convert it to hive table and call `hive.alterTable`, the original hive bucketing metadata will be removed. To fix this bug, we should read out the raw hive table metadata, update its schema, and call `hive.alterTable`. By doing this we can guarantee only the schema is changed, and nothing else. Author: Wenchen Fan <wenchen@databricks.com> Closes #19644 from cloud-fan/infer.	2017-11-02 23:41:16 +01:00
Henry Robinson	9f5c77ae32	[SPARK-21983][SQL] Fix Antlr 4.7 deprecation warnings ## What changes were proposed in this pull request? Fix three deprecation warnings introduced by move to ANTLR 4.7: * Use ParserRuleContext.addChild(TerminalNode) in preference to deprecated ParserRuleContext.addChild(Token) interface. * TokenStream.reset() is deprecated in favour of seek(0) * Replace use of deprecated ANTLRInputStream with stream returned by CharStreams.fromString() The last item changed the way we construct ANTLR's input stream (from direct instantiation to factory construction), so necessitated a change to how we override the LA() method to always return an upper-case char. The ANTLR object is now wrapped, rather than inherited-from. * Also fix incorrect usage of CharStream.getText() which expects the rhs of the supplied interval to be the last char to be returned, i.e. the interval is inclusive, and work around bug in ANTLR 4.7 where empty streams or intervals may cause getText() to throw an error. ## How was this patch tested? Ran all the sql tests. Confirmed that LA() override has coverage by breaking it, and noting that tests failed. Author: Henry Robinson <henry@apache.org> Closes #19578 from henryr/spark-21983.	2017-10-30 07:45:54 +00:00
gatorsmile	659acf18da	Revert "[SPARK-22308] Support alternative unit testing styles in external applications" This reverts commit `592cfeab9c`.	2017-10-29 10:37:25 -07:00
Wenchen Fan	7fdacbc77b	[SPARK-19727][SQL][FOLLOWUP] Fix for round function that modifies original column ## What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/17075 , to fix the bug in codegen path. ## How was this patch tested? new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #19576 from cloud-fan/bug.	2017-10-28 18:24:18 -07:00
donnyzone	c42d208e19	[SPARK-22333][SQL] timeFunctionCall(CURRENT_DATE, CURRENT_TIMESTAMP) has conflicts with columnReference ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-22333 In current version, users can use CURRENT_DATE() and CURRENT_TIMESTAMP() without specifying braces. However, when a table has columns named as "current_date" or "current_timestamp", it will still be parsed as function call. There are many such cases in our production cluster. We get the wrong answer due to this inappropriate behevior. In general, ColumnReference should get higher priority than timeFunctionCall. ## How was this patch tested? unit test manul test Author: donnyzone <wellfengzhu@gmail.com> Closes #19559 from DonnyZone/master.	2017-10-27 23:40:59 -07:00
Sathiya	01f6ba0e7a	[SPARK-22181][SQL] Adds ReplaceExceptWithFilter rule ## What changes were proposed in this pull request? Adds a new optimisation rule 'ReplaceExceptWithNotFilter' that replaces Except logical with Filter operator and schedule it before applying 'ReplaceExceptWithAntiJoin' rule. This way we can avoid expensive join operation if one or both of the datasets of the Except operation are fully derived out of Filters from a same parent. ## How was this patch tested? The patch is tested locally using spark-shell + unit test. Author: Sathiya <sathiya.kumar@polytechnique.edu> Closes #19451 from sathiyapk/SPARK-22181-optimize-exceptWithFilter.	2017-10-27 18:57:08 -07:00
Marco Gaido	b3d8fc3dc4	[SPARK-22226][SQL] splitExpression can create too many method calls in the outer class ## What changes were proposed in this pull request? SPARK-18016 introduced `NestedClass` to avoid that the many methods generated by `splitExpressions` contribute to the outer class' constant pool, making it growing too much. Unfortunately, despite their definition is stored in the `NestedClass`, they all are invoked in the outer class and for each method invocation, there are two entries added to the constant pool: a `Methodref` and a `Utf8` entry (you can easily check this compiling a simple sample class with `janinoc` and looking at its Constant Pool). This limits the scalability of the solution with very large methods which are split in a lot of small ones. This means that currently we are generating classes like this one: ``` class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection { ... public UnsafeRow apply(InternalRow i) { rowWriter.zeroOutNullBytes(); apply_0(i); apply_1(i); ... nestedClassInstance.apply_862(i); nestedClassInstance.apply_863(i); ... nestedClassInstance1.apply_1612(i); nestedClassInstance1.apply_1613(i); ... } ... private class NestedClass { private void apply_862(InternalRow i) { ... } private void apply_863(InternalRow i) { ... } ... } private class NestedClass1 { private void apply_1612(InternalRow i) { ... } private void apply_1613(InternalRow i) { ... } ... } } ``` This PR reduce the Constant Pool size of the outer class by adding a new method to each nested class: in this method we invoke all the small methods generated by `splitExpression` in that nested class. In this way, in the outer class there is only one method invocation per nested class, reducing by orders of magnitude the entries in its constant pool because of method invocations. This means that after the patch the generated code becomes: ``` class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection { ... public UnsafeRow apply(InternalRow i) { rowWriter.zeroOutNullBytes(); apply_0(i); apply_1(i); ... nestedClassInstance.apply(i); nestedClassInstance1.apply(i); ... } ... private class NestedClass { private void apply_862(InternalRow i) { ... } private void apply_863(InternalRow i) { ... } ... private void apply(InternalRow i) { apply_862(i); apply_863(i); ... } } private class NestedClass1 { private void apply_1612(InternalRow i) { ... } private void apply_1613(InternalRow i) { ... } ... private void apply(InternalRow i) { apply_1612(i); apply_1613(i); ... } } } ``` ## How was this patch tested? Added UT and existing UTs Author: Marco Gaido <mgaido@hortonworks.com> Author: Marco Gaido <marcogaido91@gmail.com> Closes #19480 from mgaido91/SPARK-22226.	2017-10-27 13:43:09 -07:00
Nathan Kronenfeld	592cfeab9c	[SPARK-22308] Support alternative unit testing styles in external applications ## What changes were proposed in this pull request? Support unit tests of external code (i.e., applications that use spark) using scalatest that don't want to use FunSuite. SharedSparkContext already supports this, but SharedSQLContext does not. I've introduced SharedSparkSession as a parent to SharedSQLContext, written in a way that it does support all scalatest styles. ## How was this patch tested? There are three new unit test suites added that just test using FunSpec, FlatSpec, and WordSpec. Author: Nathan Kronenfeld <nicole.oresme@gmail.com> Closes #19529 from nkronenfeld/alternative-style-tests-2.	2017-10-26 00:29:49 -07:00
Ruben Berenguel Montoro	427359f077	[SPARK-13947][SQL] The error message from using an invalid column reference is not clear ## What changes were proposed in this pull request? Rewritten error message for clarity. Added extra information in case of attribute name collision, hinting the user to double-check referencing two different tables ## How was this patch tested? No functional changes, only final message has changed. It has been tested manually against the situation proposed in the JIRA ticket. Automated tests in repository pass. This PR is original work from me and I license this work to the Spark project Author: Ruben Berenguel Montoro <ruben@mostlymaths.net> Author: Ruben Berenguel Montoro <ruben@dreamattic.com> Author: Ruben Berenguel <ruben@mostlymaths.net> Closes #17100 from rberenguel/SPARK-13947-error-message.	2017-10-24 23:02:11 -07:00
Marco Gaido	3f5ba968c5	[SPARK-22301][SQL] Add rule to Optimizer for In with not-nullable value and empty list ## What changes were proposed in this pull request? For performance reason, we should resolve in operation on an empty list as false in the optimizations phase, ad discussed in #19522. ## How was this patch tested? Added UT cc gatorsmile Author: Marco Gaido <marcogaido91@gmail.com> Author: Marco Gaido <mgaido@hortonworks.com> Closes #19523 from mgaido91/SPARK-22301.	2017-10-24 09:11:52 -07:00
Zhenhua Wang	f6290aea24	[SPARK-22285][SQL] Change implementation of ApproxCountDistinctForIntervals to TypedImperativeAggregate ## What changes were proposed in this pull request? The current implementation of `ApproxCountDistinctForIntervals` is `ImperativeAggregate`. The number of `aggBufferAttributes` is the number of total words in the hllppHelper array. Each hllppHelper has 52 words by default relativeSD. Since this aggregate function is used in equi-height histogram generation, and the number of buckets in histogram is usually hundreds, the number of `aggBufferAttributes` can easily reach tens of thousands or even more. This leads to a huge method in codegen and causes error: ``` org.codehaus.janino.JaninoRuntimeException: Code of method "apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB. ``` Besides, huge generated methods also result in performance regression. In this PR, we change its implementation to `TypedImperativeAggregate`. After the fix, `ApproxCountDistinctForIntervals` can deal with more than thousands endpoints without throwing codegen error, and improve performance from `20 sec` to `2 sec` in a test case of 500 endpoints. ## How was this patch tested? Test by an added test case and existing tests. Author: Zhenhua Wang <wangzhenhua@huawei.com> Closes #19506 from wzhfy/change_forIntervals_typedAgg.	2017-10-23 23:02:36 +01:00
Dongjoon Hyun	6412ea1759	[SPARK-21247][SQL] Type comparison should respect case-sensitive SQL conf ## What changes were proposed in this pull request? This is an effort to reduce the difference between Hive and Spark. Spark supports case-sensitivity in columns. Especially, for Struct types, with `spark.sql.caseSensitive=true`, the following is supported. ```scala scala> sql("select named_struct('a', 1, 'A', 2).a").show +--------------------------+ \|named_struct(a, 1, A, 2).a\| +--------------------------+ \| 1\| +--------------------------+ scala> sql("select named_struct('a', 1, 'A', 2).A").show +--------------------------+ \|named_struct(a, 1, A, 2).A\| +--------------------------+ \| 2\| +--------------------------+ ``` And vice versa, with `spark.sql.caseSensitive=false`, the following is supported. ```scala scala> sql("select named_struct('a', 1).A, named_struct('A', 1).a").show +--------------------+--------------------+ \|named_struct(a, 1).A\|named_struct(A, 1).a\| +--------------------+--------------------+ \| 1\| 1\| +--------------------+--------------------+ ``` However, types are considered different. For example, SET operations fail. ```scala scala> sql("SELECT named_struct('a',1) union all (select named_struct('A',2))").show org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. struct<A:int> <> struct<a:int> at the first column of the second table;; 'Union :- Project [named_struct(a, 1) AS named_struct(a, 1)#57] : +- OneRowRelation$ +- Project [named_struct(A, 2) AS named_struct(A, 2)#58] +- OneRowRelation$ ``` This PR aims to support case-insensitive type equality. For example, in Set operation, the above operation succeed when `spark.sql.caseSensitive=false`. ```scala scala> sql("SELECT named_struct('a',1) union all (select named_struct('A',2))").show +------------------+ \|named_struct(a, 1)\| +------------------+ \| [1]\| \| [2]\| +------------------+ ``` ## How was this patch tested? Pass the Jenkins with a newly add test case. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #18460 from dongjoon-hyun/SPARK-21247.	2017-10-14 00:35:12 +08:00
Wang Gengliang	2f00a71a87	[SPARK-22257][SQL] Reserve all non-deterministic expressions in ExpressionSet ## What changes were proposed in this pull request? For non-deterministic expressions, they should be considered as not contained in the [[ExpressionSet]]. This is consistent with how we define `semanticEquals` between two expressions. Otherwise, combining expressions will remove non-deterministic expressions which should be reserved. E.g. Combine filters of ```scala testRelation.where(Rand(0) > 0.1).where(Rand(0) > 0.1) ``` should result in ```scala testRelation.where(Rand(0) > 0.1 && Rand(0) > 0.1) ``` ## How was this patch tested? Unit test Author: Wang Gengliang <ltnwgl@gmail.com> Closes #19475 from gengliangwang/non-deterministic-expressionSet.	2017-10-12 22:45:19 -07:00
Zhenhua Wang	655f6f86f8	[SPARK-22208][SQL] Improve percentile_approx by not rounding up targetError and starting from index 0 ## What changes were proposed in this pull request? Currently percentile_approx never returns the first element when percentile is in (relativeError, 1/N], where relativeError default 1/10000, and N is the total number of elements. But ideally, percentiles in [0, 1/N] should all return the first element as the answer. For example, given input data 1 to 10, if a user queries 10% (or even less) percentile, it should return 1, because the first value 1 already reaches 10%. Currently it returns 2. Based on the paper, targetError is not rounded up, and searching index should start from 0 instead of 1. By following the paper, we should be able to fix the cases mentioned above. ## How was this patch tested? Added a new test case and fix existing test cases. Author: Zhenhua Wang <wzh_zju@163.com> Closes #19438 from wzhfy/improve_percentile_approx.	2017-10-11 00:16:12 -07:00
Kazuaki Ishizaki	76fb173dd6	[SPARK-21751][SQL] CodeGeneraor.splitExpressions counts code size more precisely ## What changes were proposed in this pull request? Current `CodeGeneraor.splitExpressions` splits statements into methods if the total length of statements is more than 1024 characters. The length may include comments or empty line. This PR excludes comment or empty line from the length to reduce the number of generated methods in a class, by using `CodeFormatter.stripExtraNewLinesAndComments()` method. ## How was this patch tested? Existing tests Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #18966 from kiszk/SPARK-21751.	2017-10-10 20:29:02 -07:00
gatorsmile	633ffd816d	rename the file.	2017-10-10 11:01:02 -07:00
Feng Liu	bebd2e1ce1	[SPARK-22222][CORE] Fix the ARRAY_MAX in BufferHolder and add a test ## What changes were proposed in this pull request? We should not break the assumption that the length of the allocated byte array is word rounded: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java#L170 So we want to use `Integer.MAX_VALUE - 15` instead of `Integer.MAX_VALUE - 8` as the upper bound of an allocated byte array. cc: srowen gatorsmile ## How was this patch tested? Since the Spark unit test JVM has less than 1GB heap, here we run the test code as a submit job, so it can run on a JVM has 4GB memory. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Feng Liu <fengliu@databricks.com> Closes #19460 from liufengdb/fix_array_max.	2017-10-09 21:34:37 -07:00
Liang-Chi Hsieh	debcbec749	[SPARK-21947][SS] Check and report error when monotonically_increasing_id is used in streaming query ## What changes were proposed in this pull request? `monotonically_increasing_id` doesn't work in Structured Streaming. We should throw an exception if a streaming query uses it. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19336 from viirya/SPARK-21947.	2017-10-06 13:10:04 -07:00
Wenchen Fan	bb035f1ee5	[SPARK-22169][SQL] support byte length literal as identifier ## What changes were proposed in this pull request? By definition the table name in Spark can be something like `123x`, `25a`, etc., with exceptions for literals like `12L`, `23BD`, etc. However, Spark SQL has a special byte length literal, which stops users to use digits followed by `b`, `k`, `m`, `g` as identifiers. byte length literal is not a standard sql literal and is only used in the `tableSample` parser rule. This PR move the parsing of byte length literal from lexer to parser, so that users can use it as identifiers. ## How was this patch tested? regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #19392 from cloud-fan/parser-bug.	2017-10-04 13:13:51 -07:00
Takeshi Yamamuro	4a779bdac3	[SPARK-21871][SQL] Check actual bytecode size when compiling generated code ## What changes were proposed in this pull request? This pr added code to check actual bytecode size when compiling generated code. In #18810, we added code to give up code compilation and use interpreter execution in `SparkPlan` if the line number of generated functions goes over `maxLinesPerFunction`. But, we already have code to collect metrics for compiled bytecode size in `CodeGenerator` object. So,we could easily reuse the code for this purpose. ## How was this patch tested? Added tests in `WholeStageCodegenSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #19083 from maropu/SPARK-21871.	2017-10-04 10:08:24 -07:00
Jose Torres	3099c574c5	[SPARK-22136][SS] Implement stream-stream outer joins. ## What changes were proposed in this pull request? Allow one-sided outer joins between two streams when a watermark is defined. ## How was this patch tested? new unit tests Author: Jose Torres <jose@databricks.com> Closes #19327 from joseph-torres/outerjoin.	2017-10-03 21:42:51 -07:00
gatorsmile	5f69433453	[SPARK-22171][SQL] Describe Table Extended Failed when Table Owner is Empty ## What changes were proposed in this pull request? Users could hit `java.lang.NullPointerException` when the tables were created by Hive and the table's owner is `null` that are got from Hive metastore. `DESC EXTENDED` failed with the error: > SQLExecutionException: java.lang.NullPointerException at scala.collection.immutable.StringOps$.length$extension(StringOps.scala:47) at scala.collection.immutable.StringOps.length(StringOps.scala:47) at scala.collection.IndexedSeqOptimized$class.isEmpty(IndexedSeqOptimized.scala:27) at scala.collection.immutable.StringOps.isEmpty(StringOps.scala:29) at scala.collection.TraversableOnce$class.nonEmpty(TraversableOnce.scala:111) at scala.collection.immutable.StringOps.nonEmpty(StringOps.scala:29) at org.apache.spark.sql.catalyst.catalog.CatalogTable.toLinkedHashMap(interface.scala:300) at org.apache.spark.sql.execution.command.DescribeTableCommand.describeFormattedTableInfo(tables.scala:565) at org.apache.spark.sql.execution.command.DescribeTableCommand.run(tables.scala:543) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:66) at ## How was this patch tested? Added a unit test case Author: gatorsmile <gatorsmile@gmail.com> Closes #19395 from gatorsmile/desc.	2017-10-03 21:27:58 -07:00
Zhenhua Wang	365a29bdbf	[SPARK-22100][SQL] Make percentile_approx support date/timestamp type and change the output type to be the same as input type ## What changes were proposed in this pull request? The `percentile_approx` function previously accepted numeric type input and output double type results. But since all numeric types, date and timestamp types are represented as numerics internally, `percentile_approx` can support them easily. After this PR, it supports date type, timestamp type and numeric types as input types. The result type is also changed to be the same as the input type, which is more reasonable for percentiles. This change is also required when we generate equi-height histograms for these types. ## How was this patch tested? Added a new test and modified some existing tests. Author: Zhenhua Wang <wangzhenhua@huawei.com> Closes #19321 from wzhfy/approx_percentile_support_types.	2017-09-25 09:28:42 -07:00
Tathagata Das	f32a842505	[SPARK-22053][SS] Stream-stream inner join in Append Mode ## What changes were proposed in this pull request? #### Architecture This PR implements stream-stream inner join using a two-way symmetric hash join. At a high level, we want to do the following. 1. For each stream, we maintain the past rows as state in State Store. - For each joining key, there can be multiple rows that have been received. - So, we have to effectively maintain a key-to-list-of-values multimap as state for each stream. 2. In each batch, for each input row in each stream - Look up the other streams state to see if there are matching rows, and output them if they satisfy the joining condition - Add the input row to corresponding stream’s state. - If the data has a timestamp/window column with watermark, then we will use that to calculate the threshold for keys that are required to buffered for future matches and drop the rest from the state. Cleaning up old unnecessary state rows depends completely on whether watermark has been defined and what are join conditions. We definitely want to support state clean up two types of queries that are likely to be common. - Queries to time range conditions - E.g. `SELECT * FROM leftTable, rightTable ON leftKey = rightKey AND leftTime > rightTime - INTERVAL 8 MINUTES AND leftTime < rightTime + INTERVAL 1 HOUR` - Queries with windows as the matching key - E.g. `SELECT * FROM leftTable, rightTable ON leftKey = rightKey AND window(leftTime, "1 hour") = window(rightTime, "1 hour")` (pseudo-SQL) #### Implementation The stream-stream join is primarily implemented in three classes - `StreamingSymmetricHashJoinExec` implements the above symmetric join algorithm. - `SymmetricsHashJoinStateManagers` manages the streaming state for the join. This essentially is a fault-tolerant key-to-list-of-values multimap built on the StateStore APIs. `StreamingSymmetricHashJoinExec` instantiates two such managers, one for each join side. - `StreamingSymmetricHashJoinExecHelper` is a helper class to extract threshold for the state based on the join conditions and the event watermark. Refer to the scaladocs class for more implementation details. Besides the implementation of stream-stream inner join SparkPlan. Some additional changes are - Allowed inner join in append mode in UnsupportedOperationChecker - Prevented stream-stream join on an empty batch dataframe to be collapsed by the optimizer ## How was this patch tested? - New tests in StreamingJoinSuite - Updated tests UnsupportedOperationSuite Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #19271 from tdas/SPARK-22053.	2017-09-21 15:39:07 -07:00
Zhenhua Wang	1d1a09be9f	[SPARK-17997][SQL] Add an aggregation function for counting distinct values for multiple intervals ## What changes were proposed in this pull request? This work is a part of [SPARK-17074](https://issues.apache.org/jira/browse/SPARK-17074) to compute equi-height histograms. Equi-height histogram is an array of bins. A bin consists of two endpoints which form an interval of values and the ndv in that interval. This PR creates a new aggregate function, given an array of endpoints, counting distinct values (ndv) in intervals among those endpoints. This PR also refactors `HyperLogLogPlusPlus` by extracting a helper class `HyperLogLogPlusPlusHelper`, where the underlying HLLPP algorithm locates. ## How was this patch tested? Add new test cases. Author: Zhenhua Wang <wangzhenhua@huawei.com> Closes #15544 from wzhfy/countIntervals.	2017-09-21 21:43:02 +08:00
Kevin Yu	c66d64b3df	[SPARK-14878][SQL] Trim characters string function support #### What changes were proposed in this pull request? This PR enhances the TRIM function support in Spark SQL by allowing the specification of trim characters set. Below is the SQL syntax : ``` SQL <trim function> ::= TRIM <left paren> <trim operands> <right paren> <trim operands> ::= [ [ <trim specification> ] [ <trim character set> ] FROM ] <trim source> <trim source> ::= <character value expression> <trim specification> ::= LEADING \| TRAILING \| BOTH <trim character set> ::= <characters value expression> ``` or ``` SQL LTRIM (source-exp [, trim-exp]) RTRIM (source-exp [, trim-exp]) ``` Here are the documentation link of support of this feature by other mainstream databases. - Oracle: [TRIM function](http://docs.oracle.com/cd/B28359_01/olap.111/b28126/dml_functions_2126.htm#OLADM704) - DB2: [TRIM scalar function](https://www.ibm.com/support/knowledgecenter/en/SSMKHH_10.0.0/com.ibm.etools.mft.doc/ak05270_.htm) - MySQL: [Trim function](http://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_trim) - Oracle: [ltrim](https://docs.oracle.com/cd/B28359_01/olap.111/b28126/dml_functions_2018.htm#OLADM594) - DB2: [ltrim](https://www.ibm.com/support/knowledgecenter/en/SSEPEK_11.0.0/sqlref/src/tpc/db2z_bif_ltrim.html) This PR is to implement the above enhancement. In the implementation, the design principle is to keep the changes to the minimum. Also, the exiting trim functions (which handles a special case, i.e., trimming space characters) are kept unchanged for performane reasons. #### How was this patch tested? The unit test cases are added in the following files: - UTF8StringSuite.java - StringExpressionsSuite.scala - sql/SQLQuerySuite.scala - StringFunctionsSuite.scala Author: Kevin Yu <qyu@us.ibm.com> Closes #12646 from kevinyu98/spark-14878.	2017-09-18 12:12:35 -07:00
Tathagata Das	88661747f5	[SPARK-22018][SQL] Preserve top-level alias metadata when collapsing projects ## What changes were proposed in this pull request? If there are two projects like as follows. ``` Project [a_with_metadata#27 AS b#26] +- Project [a#0 AS a_with_metadata#27] +- LocalRelation <empty>, [a#0, b#1] ``` Child Project has an output column with a metadata in it, and the parent Project has an alias that implicitly forwards the metadata. So this metadata is visible for higher operators. Upon applying CollapseProject optimizer rule, the metadata is not preserved. ``` Project [a#0 AS b#26] +- LocalRelation <empty>, [a#0, b#1] ``` This is incorrect, as downstream operators that expect certain metadata (e.g. watermark in structured streaming) to identify certain fields will fail to do so. This PR fixes it by preserving the metadata of top-level aliases. ## How was this patch tested? New unit test Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #19240 from tdas/SPARK-22018.	2017-09-14 22:32:16 -07:00
goldmedal	371e4e2053	[SPARK-21513][SQL] Allow UDF to_json support converting MapType to json # What changes were proposed in this pull request? UDF to_json only supports converting `StructType` or `ArrayType` of `StructType`s to a json output string now. According to the discussion of JIRA SPARK-21513, I allow to `to_json` support converting `MapType` and `ArrayType` of `MapType`s to a json output string. This PR is for SQL and Scala API only. # How was this patch tested? Adding unit test case. cc viirya HyukjinKwon Author: goldmedal <liugs963@gmail.com> Author: Jia-Xuan Liu <liugs963@gmail.com> Closes #18875 from goldmedal/SPARK-21513.	2017-09-13 09:43:00 +09:00
Wang Gengliang	1a98574766	[SPARK-21979][SQL] Improve QueryPlanConstraints framework ## What changes were proposed in this pull request? Improve QueryPlanConstraints framework, make it robust and simple. In https://github.com/apache/spark/pull/15319, constraints for expressions like `a = f(b, c)` is resolved. However, for expressions like ```scala a = f(b, c) && c = g(a, b) ``` The current QueryPlanConstraints framework will produce non-converging constraints. Essentially, the problem is caused by having both the name and child of aliases in the same constraint set. We infer constraints, and push down constraints as predicates in filters, later on these predicates are propagated as constraints, etc.. Simply using the alias names only can resolve these problems. The size of constraints is reduced without losing any information. We can always get these inferred constraints on child of aliases when pushing down filters. Also, the EqualNullSafe between name and child in propagating alias is meaningless ```scala allConstraints += EqualNullSafe(e, a.toAttribute) ``` It just produces redundant constraints. ## How was this patch tested? Unit test Author: Wang Gengliang <ltnwgl@gmail.com> Closes #19201 from gengliangwang/QueryPlanConstraints.	2017-09-12 13:02:29 -07:00
Liang-Chi Hsieh	6b45d7e941	[SPARK-21954][SQL] JacksonUtils should verify MapType's value type instead of key type ## What changes were proposed in this pull request? `JacksonUtils.verifySchema` verifies if a data type can be converted to JSON. For `MapType`, it now verifies the key type. However, in `JacksonGenerator`, when converting a map to JSON, we only care about its values and create a writer for the values. The keys in a map are treated as strings by calling `toString` on the keys. Thus, we should change `JacksonUtils.verifySchema` to verify the value type of `MapType`. ## How was this patch tested? Added tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19167 from viirya/test-jacksonutils.	2017-09-09 19:10:52 +09:00
Liang-Chi Hsieh	6e37524a1f	[SPARK-21726][SQL] Check for structural integrity of the plan in Optimzer in test mode. ## What changes were proposed in this pull request? We have many optimization rules now in `Optimzer`. Right now we don't have any checks in the optimizer to check for the structural integrity of the plan (e.g. resolved). When debugging, it is difficult to identify which rules return invalid plans. It would be great if in test mode, we can check whether a plan is still resolved after the execution of each rule, so we can catch rules that return invalid plans. ## How was this patch tested? Added tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18956 from viirya/SPARK-21726.	2017-09-07 23:12:18 -07:00
Jose Torres	acdf45fb52	[SPARK-21765] Check that optimization doesn't affect isStreaming bit. ## What changes were proposed in this pull request? Add an assert in logical plan optimization that the isStreaming bit stays the same, and fix empty relation rules where that wasn't happening. ## How was this patch tested? new and existing unit tests Author: Jose Torres <joseph.torres@databricks.com> Author: Jose Torres <joseph-torres@databricks.com> Closes #19056 from joseph-torres/SPARK-21765-followup.	2017-09-06 11:19:46 -07:00
Liang-Chi Hsieh	9f30d92803	[SPARK-21654][SQL] Complement SQL predicates expression description ## What changes were proposed in this pull request? SQL predicates don't have complete expression description. This patch goes to complement the description by adding arguments, examples. This change also adds related test cases for the SQL predicate expressions. ## How was this patch tested? Existing tests. And added predicate test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18869 from viirya/SPARK-21654.	2017-09-03 21:55:18 -07:00
Sean Owen	12ab7f7e89	[SPARK-14280][BUILD][WIP] Update change-version.sh and pom.xml to add Scala 2.12 profiles and enable 2.12 compilation …build; fix some things that will be warnings or errors in 2.12; restore Scala 2.12 profile infrastructure ## What changes were proposed in this pull request? This change adds back the infrastructure for a Scala 2.12 build, but does not enable it in the release or Python test scripts. In order to make that meaningful, it also resolves compile errors that the code hits in 2.12 only, in a way that still works with 2.11. It also updates dependencies to the earliest minor release of dependencies whose current version does not yet support Scala 2.12. This is in a sense covered by other JIRAs under the main umbrella, but implemented here. The versions below still work with 2.11, and are the _latest_ maintenance release in the _earliest_ viable minor release. - Scalatest 2.x -> 3.0.3 - Chill 0.8.0 -> 0.8.4 - Clapper 1.0.x -> 1.1.2 - json4s 3.2.x -> 3.4.2 - Jackson 2.6.x -> 2.7.9 (required by json4s) This change does _not_ fully enable a Scala 2.12 build: - It will also require dropping support for Kafka before 0.10. Easy enough, just didn't do it yet here - It will require recreating `SparkILoop` and `Main` for REPL 2.12, which is SPARK-14650. Possible to do here too. What it does do is make changes that resolve much of the remaining gap without affecting the current 2.11 build. ## How was this patch tested? Existing tests and build. Manually tested with `./dev/change-scala-version.sh 2.12` to verify it compiles, modulo the exceptions above. Author: Sean Owen <sowen@cloudera.com> Closes #18645 from srowen/SPARK-14280.	2017-09-01 19:21:21 +01:00
Andrew Ray	cba69aeb45	[SPARK-21110][SQL] Structs, arrays, and other orderable datatypes should be usable in inequalities ## What changes were proposed in this pull request? Allows `BinaryComparison` operators to work on any data type that actually supports ordering as verified by `TypeUtils.checkForOrderingExpr` instead of relying on the incomplete list `TypeCollection.Ordered` (which is removed by this PR). ## How was this patch tested? Updated unit tests to cover structs and arrays. Author: Andrew Ray <ray.andrew@gmail.com> Closes #18818 from aray/SPARK-21110.	2017-08-31 15:08:03 -07:00
Herman van Hovell	05af2de0fd	[SPARK-21830][SQL] Bump ANTLR version and fix a few issues. ## What changes were proposed in this pull request? This PR bumps the ANTLR version to 4.7, and fixes a number of small parser related issues uncovered by the bump. The main reason for upgrading is that in some cases the current version of ANTLR (4.5) can exhibit exponential slowdowns if it needs to parse boolean predicates. For example the following query will take forever to parse: ```sql SELECT * FROM RANGE(1000) WHERE TRUE AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' ``` This is caused by a know bug in ANTLR (https://github.com/antlr/antlr4/issues/994), which was fixed in version 4.6. ## How was this patch tested? Existing tests. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #19042 from hvanhovell/SPARK-21830.	2017-08-24 16:33:55 -07:00
Liang-Chi Hsieh	183d4cb71f	[SPARK-21759][SQL] In.checkInputDataTypes should not wrongly report unresolved plans for IN correlated subquery ## What changes were proposed in this pull request? With the check for structural integrity proposed in SPARK-21726, it is found that the optimization rule `PullupCorrelatedPredicates` can produce unresolved plans. For a correlated IN query looks like: SELECT t1.a FROM t1 WHERE t1.a IN (SELECT t2.c FROM t2 WHERE t1.b < t2.d); The query plan might look like: Project [a#0] +- Filter a#0 IN (list#4 [b#1]) : +- Project [c#2] : +- Filter (outer(b#1) < d#3) : +- LocalRelation <empty>, [c#2, d#3] +- LocalRelation <empty>, [a#0, b#1] After `PullupCorrelatedPredicates`, it produces query plan like: 'Project [a#0] +- 'Filter a#0 IN (list#4 [(b#1 < d#3)]) : +- Project [c#2, d#3] : +- LocalRelation <empty>, [c#2, d#3] +- LocalRelation <empty>, [a#0, b#1] Because the correlated predicate involves another attribute `d#3` in subquery, it has been pulled out and added into the `Project` on the top of the subquery. When `list` in `In` contains just one `ListQuery`, `In.checkInputDataTypes` checks if the size of `value` expressions matches the output size of subquery. In the above example, there is only `value` expression and the subquery output has two attributes `c#2, d#3`, so it fails the check and `In.resolved` returns `false`. We should not let `In.checkInputDataTypes` wrongly report unresolved plans to fail the structural integrity check. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18968 from viirya/SPARK-21759.	2017-08-24 21:46:58 +08:00
Jen-Ming Chung	95713eb4f2	[SPARK-21804][SQL] json_tuple returns null values within repeated columns except the first one ## What changes were proposed in this pull request? When json_tuple in extracting values from JSON it returns null values within repeated columns except the first one as below: ``` scala scala> spark.sql("""SELECT json_tuple('{"a":1, "b":2}', 'a', 'b', 'a')""").show() +---+---+----+ \| c0\| c1\| c2\| +---+---+----+ \| 1\| 2\|null\| +---+---+----+ ``` I think this should be consistent with Hive's implementation: ``` hive> SELECT json_tuple('{"a": 1, "b": 2}', 'a', 'a'); ... 1 1 ``` In this PR, we located all the matched indices in `fieldNames` instead of returning the first matched index, i.e., indexOf. ## How was this patch tested? Added test in JsonExpressionsSuite. Author: Jen-Ming Chung <jenmingisme@gmail.com> Closes #19017 from jmchung/SPARK-21804.	2017-08-24 19:24:00 +09:00
10129659	b8aaef49fb	[SPARK-21807][SQL] Override ++ operation in ExpressionSet to reduce clone time ## What changes were proposed in this pull request? The getAliasedConstraints fuction in LogicalPlan.scala will clone the expression set when an element added, and it will take a long time. This PR add a function to add multiple elements at once to reduce the clone time. Before modified, the cost of getAliasedConstraints is: 100 expressions: 41 seconds 150 expressions: 466 seconds After modified, the cost of getAliasedConstraints is: 100 expressions: 1.8 seconds 150 expressions: 6.5 seconds The test is like this: test("getAliasedConstraints") { val expressionNum = 150 val aggExpression = (1 to expressionNum).map(i => Alias(Count(Literal(1)), s"cnt$i")()) val aggPlan = Aggregate(Nil, aggExpression, LocalRelation()) val beginTime = System.currentTimeMillis() val expressions = aggPlan.validConstraints println(s"validConstraints cost: ${System.currentTimeMillis() - beginTime}ms") // The size of Aliased expression is n * (n - 1) / 2 + n assert( expressions.size === expressionNum * (expressionNum - 1) / 2 + expressionNum) } (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Run new added test. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: 10129659 <chen.yanshan@zte.com.cn> Closes #19022 from eatoncys/getAliasedConstraints.	2017-08-23 20:35:08 -07:00
Jose Torres	3c0c2d09ca	[SPARK-21765] Set isStreaming on leaf nodes for streaming plans. ## What changes were proposed in this pull request? All streaming logical plans will now have isStreaming set. This involved adding isStreaming as a case class arg in a few cases, since a node might be logically streaming depending on where it came from. ## How was this patch tested? Existing unit tests - no functional change is intended in this PR. Author: Jose Torres <joseph-torres@databricks.com> Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #18973 from joseph-torres/SPARK-21765.	2017-08-22 19:07:43 -07:00
Jen-Ming Chung	7ab951885f	[SPARK-21677][SQL] json_tuple throws NullPointException when column is null as string type ## What changes were proposed in this pull request? ``` scala scala> Seq(("""{"Hyukjin": 224, "John": 1225}""")).toDS.selectExpr("json_tuple(value, trim(null))").show() ... java.lang.NullPointerException at ... ``` Currently the `null` field name will throw NullPointException. As a given field name null can't be matched with any field names in json, we just output null as its column value. This PR achieves it by returning a very unlikely column name `__NullFieldName` in evaluation of the field names. ## How was this patch tested? Added unit test. Author: Jen-Ming Chung <jenmingisme@gmail.com> Closes #18930 from jmchung/SPARK-21677.	2017-08-17 15:59:45 -07:00
Takeshi Yamamuro	6aad02d036	[SPARK-18394][SQL] Make an AttributeSet.toSeq output order consistent ## What changes were proposed in this pull request? This pr sorted output attributes on their name and exprId in `AttributeSet.toSeq` to make the order consistent. If the order is different, spark possibly generates different code and then misses cache in `CodeGenerator`, e.g., `GenerateColumnAccessor` generates code depending on an input attribute order. ## How was this patch tested? Added tests in `AttributeSetSuite` and manually checked if the cache worked well in the given query of the JIRA. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #18959 from maropu/SPARK-18394.	2017-08-17 22:47:14 +02:00
10129659	1cce1a3b63	[SPARK-21603][SQL] The wholestage codegen will be much slower then that is closed when the function is too long ## What changes were proposed in this pull request? Close the whole stage codegen when the function lines is longer than the maxlines which will be setted by spark.sql.codegen.MaxFunctionLength parameter, because when the function is too long , it will not get the JIT optimizing. A benchmark test result is 10x slower when the generated function is too long : ignore("max function length of wholestagecodegen") { val N = 20 << 15 val benchmark = new Benchmark("max function length of wholestagecodegen", N) def f(): Unit = sparkSession.range(N) .selectExpr( "id", "(id & 1023) as k1", "cast(id & 1023 as double) as k2", "cast(id & 1023 as int) as k3", "case when id > 100 and id <= 200 then 1 else 0 end as v1", "case when id > 200 and id <= 300 then 1 else 0 end as v2", "case when id > 300 and id <= 400 then 1 else 0 end as v3", "case when id > 400 and id <= 500 then 1 else 0 end as v4", "case when id > 500 and id <= 600 then 1 else 0 end as v5", "case when id > 600 and id <= 700 then 1 else 0 end as v6", "case when id > 700 and id <= 800 then 1 else 0 end as v7", "case when id > 800 and id <= 900 then 1 else 0 end as v8", "case when id > 900 and id <= 1000 then 1 else 0 end as v9", "case when id > 1000 and id <= 1100 then 1 else 0 end as v10", "case when id > 1100 and id <= 1200 then 1 else 0 end as v11", "case when id > 1200 and id <= 1300 then 1 else 0 end as v12", "case when id > 1300 and id <= 1400 then 1 else 0 end as v13", "case when id > 1400 and id <= 1500 then 1 else 0 end as v14", "case when id > 1500 and id <= 1600 then 1 else 0 end as v15", "case when id > 1600 and id <= 1700 then 1 else 0 end as v16", "case when id > 1700 and id <= 1800 then 1 else 0 end as v17", "case when id > 1800 and id <= 1900 then 1 else 0 end as v18") .groupBy("k1", "k2", "k3") .sum() .collect() benchmark.addCase(s"codegen = F") { iter => sparkSession.conf.set("spark.sql.codegen.wholeStage", "false") f() } benchmark.addCase(s"codegen = T") { iter => sparkSession.conf.set("spark.sql.codegen.wholeStage", "true") sparkSession.conf.set("spark.sql.codegen.MaxFunctionLength", "10000") f() } benchmark.run() /* Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14 on Windows 7 6.1 Intel64 Family 6 Model 58 Stepping 9, GenuineIntel max function length of wholestagecodegen: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ codegen = F 443 / 507 1.5 676.0 1.0X codegen = T 3279 / 3283 0.2 5002.6 0.1X */ } ## How was this patch tested? Run the unit test Author: 10129659 <chen.yanshan@zte.com.cn> Closes #18810 from eatoncys/codegen.	2017-08-16 09:12:20 -07:00
Marcelo Vanzin	3f958a9992	[SPARK-21731][BUILD] Upgrade scalastyle to 0.9. This version fixes a few issues in the import order checker; it provides better error messages, and detects more improper ordering (thus the need to change a lot of files in this patch). The main fix is that it correctly complains about the order of packages vs. classes. As part of the above, I moved some "SparkSession" import in ML examples inside the "$example on$" blocks; that didn't seem consistent across different source files to start with, and avoids having to add more on/off blocks around specific imports. The new scalastyle also seems to have a better header detector, so a few license headers had to be updated to match the expected indentation. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #18943 from vanzin/SPARK-21731.	2017-08-15 13:59:00 -07:00
Wenchen Fan	14bdb25fd7	[SPARK-18464][SQL][FOLLOWUP] support old table which doesn't store schema in table properties ## What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/15900 , to fix one more bug: When table schema is empty and need to be inferred at runtime, we should not resolve parent plans before the schema has been inferred, or the parent plans will be resolved against an empty schema and may get wrong result for something like `select *` The fix logic is: introduce `UnresolvedCatalogRelation` as a placeholder. Then we replace it with `LogicalRelation` or `HiveTableRelation` during analysis, so that it's guaranteed that we won't resolve parent plans until the schema has been inferred. ## How was this patch tested? regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #18907 from cloud-fan/bug.	2017-08-15 09:04:56 -07:00
aokolnychyi	5596ce83c4	[MINOR][SQL] Additional test case for CheckCartesianProducts rule ## What changes were proposed in this pull request? While discovering optimization rules and their test coverage, I did not find any tests for `CheckCartesianProducts` in the Catalyst folder. So, I decided to create a new test suite. Once I finished, I found a test in `JoinSuite` for this functionality so feel free to discard this change if it does not make much sense. The proposed test suite covers a few additional use cases. Author: aokolnychyi <anton.okolnychyi@sap.com> Closes #18909 from aokolnychyi/check-cartesian-join-tests.	2017-08-13 21:33:16 -07:00
Reynold Xin	584c7f1437	[SPARK-21699][SQL] Remove unused getTableOption in ExternalCatalog ## What changes were proposed in this pull request? This patch removes the unused SessionCatalog.getTableMetadataOption and ExternalCatalog. getTableOption. ## How was this patch tested? Removed the test case. Author: Reynold Xin <rxin@databricks.com> Closes #18912 from rxin/remove-getTableOption.	2017-08-10 18:56:25 -07:00
Jose Torres	0fb73253fc	[SPARK-21587][SS] Added filter pushdown through watermarks. ## What changes were proposed in this pull request? Push filter predicates through EventTimeWatermark if they're deterministic and do not reference the watermarked attribute. (This is similar but not identical to the logic for pushing through UnaryNode.) ## How was this patch tested? unit tests Author: Jose Torres <joseph-torres@databricks.com> Closes #18790 from joseph-torres/SPARK-21587.	2017-08-09 12:50:04 -07:00
gatorsmile	2d799d0808	[SPARK-21504][SQL] Add spark version info into table metadata ## What changes were proposed in this pull request? This PR is to add the spark version info in the table metadata. When creating the table, this value is assigned. It can help users find which version of Spark was used to create the table. ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #18709 from gatorsmile/addVersion.	2017-08-09 08:46:25 -07:00
BartekH	438c381584	Add "full_outer" name to join types I have discovered that "full_outer" name option is working in Spark 2.0, but it is not printed in exception. Please verify. ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: BartekH <bartekhamielec@gmail.com> Closes #17985 from BartekH/patch-1.	2017-08-06 16:40:59 -07:00
Takeshi Yamamuro	74b47845ea	[SPARK-20963][SQL][FOLLOW-UP] Use UnresolvedSubqueryColumnAliases for visitTableName ## What changes were proposed in this pull request? This pr (follow-up of #18772) used `UnresolvedSubqueryColumnAliases` for `visitTableName` in `AstBuilder`, which is a new unresolved `LogicalPlan` implemented in #18185. ## How was this patch tested? Existing tests Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #18857 from maropu/SPARK-20963-FOLLOWUP.	2017-08-06 10:14:45 -07:00
Takeshi Yamamuro	990efad1c6	[SPARK-20963][SQL] Support column aliases for join relations in FROM clause ## What changes were proposed in this pull request? This pr added parsing rules to support column aliases for join relations in FROM clause. This pr is a sub-task of #18079. ## How was this patch tested? Added tests in `AnalysisSuite`, `PlanParserSuite,` and `SQLQueryTestSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #18772 from maropu/SPARK-20963-2.	2017-08-05 20:35:54 -07:00
Reynold Xin	5ad1796b9f	[SPARK-21634][SQL] Change OneRowRelation from a case object to case class ## What changes were proposed in this pull request? OneRowRelation is the only plan that is a case object, which causes some issues with makeCopy using a 0-arg constructor. This patch changes it from a case object to a case class. This blocks SPARK-21619. ## How was this patch tested? Should be covered by existing test cases. Author: Reynold Xin <rxin@databricks.com> Closes #18839 from rxin/SPARK-21634.	2017-08-04 10:36:08 -07:00
Yuming Wang	231f67247b	[SPARK-21205][SQL] pmod(number, 0) should be null. ## What changes were proposed in this pull request? Hive `pmod(3.13, 0)`: ```:sql hive> select pmod(3.13, 0); OK NULL Time taken: 2.514 seconds, Fetched: 1 row(s) hive> ``` Spark `mod(3.13, 0)`: ```:sql spark-sql> select mod(3.13, 0); NULL spark-sql> ``` But the Spark `pmod(3.13, 0)`: ```:sql spark-sql> select pmod(3.13, 0); 17/06/25 09:35:58 ERROR SparkSQLDriver: Failed in [select pmod(3.13, 0)] java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.Pmod.pmod(arithmetic.scala:504) at org.apache.spark.sql.catalyst.expressions.Pmod.nullSafeEval(arithmetic.scala:432) at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:419) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:323) ... ``` This PR make `pmod(number, 0)` to null. ## How was this patch tested? unit tests Author: Yuming Wang <wgyumg@gmail.com> Closes #18413 from wangyum/SPARK-21205.	2017-08-04 12:06:08 +02:00
bravo-zhang	6b186c9d60	[SPARK-18950][SQL] Report conflicting fields when merging two StructTypes ## What changes were proposed in this pull request? Currently, StructType.merge() only reports data types of conflicting fields when merging two incompatible schemas. It would be nice to also report the field names for easier debugging. ## How was this patch tested? Unit test in DataTypeSuite. Print exception message when conflict is triggered. Author: bravo-zhang <mzhang1230@gmail.com> Closes #16365 from bravo-zhang/spark-18950.	2017-07-31 17:19:55 -07:00
Takeshi Yamamuro	6550086bbd	[SPARK-20962][SQL] Support subquery column aliases in FROM clause ## What changes were proposed in this pull request? This pr added parsing rules to support subquery column aliases in FROM clause. This pr is a sub-task of #18079. ## How was this patch tested? Added tests in `PlanParserSuite` and `SQLQueryTestSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #18185 from maropu/SPARK-20962.	2017-07-29 10:14:47 -07:00
Xingbo Jiang	92d85637e7	[SPARK-19451][SQL] rangeBetween method should accept Long value as boundary ## What changes were proposed in this pull request? Long values can be passed to `rangeBetween` as range frame boundaries, but we silently convert it to Int values, this can cause wrong results and we should fix this. Further more, we should accept any legal literal values as range frame boundaries. In this PR, we make it possible for Long values, and make accepting other DataTypes really easy to add. This PR is mostly based on Herman's previous amazing work: `596f53c339` After this been merged, we can close #16818 . ## How was this patch tested? Add new tests in `DataFrameWindowFunctionsSuite` and `TypeCoercionSuite`. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #18540 from jiangxb1987/rangeFrame.	2017-07-29 10:11:31 -07:00
pj.fanning	2a53fbfce7	[SPARK-20871][SQL] limit logging of Janino code ## What changes were proposed in this pull request? When the code that is generated is greater than 64k, then Janino compile will fail and CodeGenerator.scala will log the entire code at Error level. SPARK-20871 suggests only logging the code at Debug level. Since, the code is already logged at debug level, this Pull Request proposes not including the formatted code in the Error logging and exception message at all. When an exception occurs, the code will be logged at Info level but truncated if it is more than 1000 lines long. ## How was this patch tested? Existing tests were run. An extra test test case was added to CodeFormatterSuite to test the new maxLines parameter, Author: pj.fanning <pj.fanning@workday.com> Closes #18658 from pjfanning/SPARK-20871.	2017-07-23 10:38:03 -07:00
gatorsmile	ae253e5a87	[SPARK-21273][SQL][FOLLOW-UP] Propagate logical plan stats using visitor pattern and mixin ## What changes were proposed in this pull request? This PR is to add back the stats propagation of `Window` and remove the stats calculation of the leaf node `Range`, which has been covered by `9c32d2507d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala (L56)` ## How was this patch tested? Added two test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes #18677 from gatorsmile/visitStats.	2017-07-19 10:57:15 +08:00
Wenchen Fan	f18b905f6c	[SPARK-21457][SQL] ExternalCatalog.listPartitions should correctly handle partition values with dot ## What changes were proposed in this pull request? When we list partitions from hive metastore with a partial partition spec, we are expecting exact matching according to the partition values. However, hive treats dot specially and match any single character for dot. We should do an extra filter to drop unexpected partitions. ## How was this patch tested? new regression test. Author: Wenchen Fan <wenchen@databricks.com> Closes #18671 from cloud-fan/hive.	2017-07-18 15:56:16 -07:00
Sean Owen	e26dac5feb	[SPARK-21415] Triage scapegoat warnings, part 1 ## What changes were proposed in this pull request? Address scapegoat warnings for: - BigDecimal double constructor - Catching NPE - Finalizer without super - List.size is O(n) - Prefer Seq.empty - Prefer Set.empty - reverse.map instead of reverseMap - Type shadowing - Unnecessary if condition. - Use .log1p - Var could be val In some instances like Seq.empty, I avoided making the change even where valid in test code to keep the scope of the change smaller. Those issues are concerned with performance and it won't matter for tests. ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #18635 from srowen/Scapegoat1.	2017-07-18 08:47:17 +01:00
aokolnychyi	0be5fb41a6	[SPARK-21332][SQL] Incorrect result type inferred for some decimal expressions ## What changes were proposed in this pull request? This PR changes the direction of expression transformation in the DecimalPrecision rule. Previously, the expressions were transformed down, which led to incorrect result types when decimal expressions had other decimal expressions as their operands. The root cause of this issue was in visiting outer nodes before their children. Consider the example below: ``` val inputSchema = StructType(StructField("col", DecimalType(26, 6)) :: Nil) val sc = spark.sparkContext val rdd = sc.parallelize(1 to 2).map(_ => Row(BigDecimal(12))) val df = spark.createDataFrame(rdd, inputSchema) // Works correctly since no nested decimal expression is involved // Expected result type: (26, 6) * (26, 6) = (38, 12) df.select($"col" * $"col").explain(true) df.select($"col" * $"col").printSchema() // Gives a wrong result since there is a nested decimal expression that should be visited first // Expected result type: ((26, 6) * (26, 6)) * (26, 6) = (38, 12) * (26, 6) = (38, 18) df.select($"col" * $"col" * $"col").explain(true) df.select($"col" * $"col" * $"col").printSchema() ``` The example above gives the following output: ``` // Correct result without sub-expressions == Parsed Logical Plan == 'Project [('col * 'col) AS (col * col)#4] +- LogicalRDD [col#1] == Analyzed Logical Plan == (col * col): decimal(38,12) Project [CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS (col * col)#4] +- LogicalRDD [col#1] == Optimized Logical Plan == Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col * col)#4] +- LogicalRDD [col#1] == Physical Plan == Project [CheckOverflow((col#1 col#1), DecimalType(38,12)) AS (col * col)#4] +- Scan ExistingRDD[col#1] // Schema root \|-- (col * col): decimal(38,12) (nullable = true) // Incorrect result with sub-expressions == Parsed Logical Plan == 'Project [(('col * 'col) * 'col) AS ((col * col) * col)#11] +- LogicalRDD [col#1] == Analyzed Logical Plan == ((col * col) * col): decimal(38,12) Project [CheckOverflow((promote_precision(cast(CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS ((col * col) * col)#11] +- LogicalRDD [col#1] == Optimized Logical Plan == Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)#11] +- LogicalRDD [col#1] == Physical Plan == Project [CheckOverflow((cast(CheckOverflow((col#1 col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)#11] +- Scan ExistingRDD[col#1] // Schema root \|-- ((col * col) * col): decimal(38,12) (nullable = true) ``` ## How was this patch tested? This PR was tested with available unit tests. Moreover, there are tests to cover previously failing scenarios. Author: aokolnychyi <anton.okolnychyi@sap.com> Closes #18583 from aokolnychyi/spark-21332.	2017-07-17 21:07:50 -07:00
Sean Owen	fd52a747fd	[SPARK-19810][SPARK-19810][MINOR][FOLLOW-UP] Follow-ups from to remove Scala 2.10 ## What changes were proposed in this pull request? Follow up to a few comments on https://github.com/apache/spark/pull/17150#issuecomment-315020196 that couldn't be addressed before it was merged. ## How was this patch tested? Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes #18646 from srowen/SPARK-19810.2.	2017-07-17 09:22:42 +08:00
Kazuaki Ishizaki	ac5d5d7959	[SPARK-21344][SQL] BinaryType comparison does signed byte array comparison ## What changes were proposed in this pull request? This PR fixes a wrong comparison for `BinaryType`. This PR enables unsigned comparison and unsigned prefix generation for an array for `BinaryType`. Previous implementations uses signed operations. ## How was this patch tested? Added a test suite in `OrderingSuite`. Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #18571 from kiszk/SPARK-21344.	2017-07-14 20:16:04 -07:00
Sean Owen	425c4ada4c	[SPARK-19810][BUILD][CORE] Remove support for Scala 2.10 ## What changes were proposed in this pull request? - Remove Scala 2.10 build profiles and support - Replace some 2.10 support in scripts with commented placeholders for 2.12 later - Remove deprecated API calls from 2.10 support - Remove usages of deprecated context bounds where possible - Remove Scala 2.10 workarounds like ScalaReflectionLock - Other minor Scala warning fixes ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #17150 from srowen/SPARK-19810.	2017-07-13 17:06:24 +08:00
Takeshi Yamamuro	647963a26a	[SPARK-20460][SQL] Make it more consistent to handle column name duplication ## What changes were proposed in this pull request? This pr made it more consistent to handle column name duplication. In the current master, error handling is different when hitting column name duplication: ``` // json scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil) scala> Seq("""{"a":1, "a":1}"""""").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data") scala> spark.read.format("json").schema(schema).load("/tmp/data").show org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: a#12, a#13.; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153) scala> spark.read.format("json").load("/tmp/data").show org.apache.spark.sql.AnalysisException: Duplicate column(s) : "a" found, cannot save to JSON format; at org.apache.spark.sql.execution.datasources.json.JsonDataSource.checkConstraints(JsonDataSource.scala:81) at org.apache.spark.sql.execution.datasources.json.JsonDataSource.inferSchema(JsonDataSource.scala:63) at org.apache.spark.sql.execution.datasources.json.JsonFileFormat.inferSchema(JsonFileFormat.scala:57) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:176) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:176) // csv scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil) scala> Seq("a,a", "1,1").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data") scala> spark.read.format("csv").schema(schema).option("header", false).load("/tmp/data").show org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: a#41, a#42.; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:152) // If `inferSchema` is true, a CSV format is duplicate-safe (See SPARK-16896) scala> spark.read.format("csv").option("header", true).load("/tmp/data").show +---+---+ \| a0\| a1\| +---+---+ \| 1\| 1\| +---+---+ // parquet scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil) scala> Seq((1, 1)).toDF("a", "b").coalesce(1).write.mode("overwrite").parquet("/tmp/data") scala> spark.read.format("parquet").schema(schema).option("header", false).load("/tmp/data").show org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: a#110, a#111.; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:152) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) ``` When this patch applied, the results change to; ``` // json scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil) scala> Seq("""{"a":1, "a":1}"""""").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data") scala> spark.read.format("json").schema(schema).load("/tmp/data").show org.apache.spark.sql.AnalysisException: Found duplicate column(s) in datasource: "a"; at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtil.scala:47) at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtil.scala:33) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:186) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:368) scala> spark.read.format("json").load("/tmp/data").show org.apache.spark.sql.AnalysisException: Found duplicate column(s) in datasource: "a"; at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtil.scala:47) at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtil.scala:33) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:186) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:368) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156) // csv scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil) scala> Seq("a,a", "1,1").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data") scala> spark.read.format("csv").schema(schema).option("header", false).load("/tmp/data").show org.apache.spark.sql.AnalysisException: Found duplicate column(s) in datasource: "a"; at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtil.scala:47) at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtil.scala:33) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:186) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:368) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) scala> spark.read.format("csv").option("header", true).load("/tmp/data").show +---+---+ \| a0\| a1\| +---+---+ \| 1\| 1\| +---+---+ // parquet scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil) scala> Seq((1, 1)).toDF("a", "b").coalesce(1).write.mode("overwrite").parquet("/tmp/data") scala> spark.read.format("parquet").schema(schema).option("header", false).load("/tmp/data").show org.apache.spark.sql.AnalysisException: Found duplicate column(s) in datasource: "a"; at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtil.scala:47) at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtil.scala:33) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:186) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:368) ``` ## How was this patch tested? Added tests in `DataFrameReaderWriterSuite` and `SQLQueryTestSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #17758 from maropu/SPARK-20460.	2017-07-10 15:58:34 +08:00
Xiao Li	c3712b77a9	[SPARK-21307][REVERT][SQL] Remove SQLConf parameters from the parser-related classes ## What changes were proposed in this pull request? Since we do not set active sessions when parsing the plan, we are unable to correctly use SQLConf.get to find the correct active session. Since https://github.com/apache/spark/pull/18531 breaks the build, I plan to revert it at first. ## How was this patch tested? The existing test cases Author: Xiao Li <gatorsmile@gmail.com> Closes #18568 from gatorsmile/revert18531.	2017-07-08 11:56:19 -07:00
Takeshi Yamamuro	7896e7b99d	[SPARK-21281][SQL] Use string types by default if array and map have no argument ## What changes were proposed in this pull request? This pr modified code to use string types by default if `array` and `map` in functions have no argument. This behaviour is the same with Hive one; ``` hive> CREATE TEMPORARY TABLE t1 AS SELECT map(); hive> DESCRIBE t1; _c0 map<string,string> hive> CREATE TEMPORARY TABLE t2 AS SELECT array(); hive> DESCRIBE t2; _c0 array<string> ``` ## How was this patch tested? Added tests in `DataFrameFunctionsSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #18516 from maropu/SPARK-21281.	2017-07-07 23:05:38 -07:00
Wenchen Fan	fef081309f	[SPARK-21335][SQL] support un-aliased subquery ## What changes were proposed in this pull request? un-aliased subquery is supported by Spark SQL for a long time. Its semantic was not well defined and had confusing behaviors, and it's not a standard SQL syntax, so we disallowed it in https://issues.apache.org/jira/browse/SPARK-20690 . However, this is a breaking change, and we do have existing queries using un-aliased subquery. We should add the support back and fix its semantic. This PR fixes the un-aliased subquery by assigning a default alias name. After this PR, there is no syntax change from branch 2.2 to master, but we invalid a weird use case: `SELECT v.i from (SELECT i FROM v)`. Now this query will throw analysis exception because users should not be able to use the qualifier inside a subquery. ## How was this patch tested? new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #18559 from cloud-fan/sub-query.	2017-07-07 20:04:30 +08:00
Bogdan Raducanu	26ac085deb	[SPARK-21228][SQL] InSet incorrect handling of structs ## What changes were proposed in this pull request? When data type is struct, InSet now uses TypeUtils.getInterpretedOrdering (similar to EqualTo) to build a TreeSet. In other cases it will use a HashSet as before (which should be faster). Similarly, In.eval uses Ordering.equiv instead of equals. ## How was this patch tested? New test in SQLQuerySuite. Author: Bogdan Raducanu <bogdan@databricks.com> Closes #18455 from bogdanrdc/SPARK-21228.	2017-07-07 01:04:57 +08:00
Wang Gengliang	d540dfbff3	[SPARK-21273][SQL][FOLLOW-UP] Add missing test cases back and revise code style ## What changes were proposed in this pull request? Add missing test cases back and revise code style Follow up the previous PR: https://github.com/apache/spark/pull/18479 ## How was this patch tested? Unit test Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Wang Gengliang <ltnwgl@gmail.com> Closes #18548 from gengliangwang/stat_propagation_revise.	2017-07-06 19:12:15 +08:00
gatorsmile	75b168fd30	[SPARK-21308][SQL] Remove SQLConf parameters from the optimizer ### What changes were proposed in this pull request? This PR removes SQLConf parameters from the optimizer rules ### How was this patch tested? The existing test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #18533 from gatorsmile/rmSQLConfOptimizer.	2017-07-06 14:18:50 +08:00
gatorsmile	c8e7f445b9	[SPARK-21307][SQL] Remove SQLConf parameters from the parser-related classes. ### What changes were proposed in this pull request? This PR is to remove SQLConf parameters from the parser-related classes. ### How was this patch tested? The existing test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes #18531 from gatorsmile/rmSQLConfParser.	2017-07-05 11:06:15 -07:00
ouyangxiaochen	5787ace463	[SPARK-20383][SQL] Supporting Create [temporary] Function with the keyword 'OR REPLACE' and 'IF NOT EXISTS' ## What changes were proposed in this pull request? support to create [temporary] function with the keyword 'OR REPLACE' and 'IF NOT EXISTS' ## How was this patch tested? manual test and added test cases Please review http://spark.apache.org/contributing.html before opening a pull request. Author: ouyangxiaochen <ou.yangxiaochen@zte.com.cn> Closes #17681 from ouyangxiaochen/spark-419.	2017-07-05 20:46:42 +08:00
Takuya UESHIN	873f3ad2b8	[SPARK-16167][SQL] RowEncoder should preserve array/map type nullability. ## What changes were proposed in this pull request? Currently `RowEncoder` doesn't preserve nullability of `ArrayType` or `MapType`. It returns always `containsNull = true` for `ArrayType`, `valueContainsNull = true` for `MapType` and also the nullability of itself is always `true`. This pr fixes the nullability of them. ## How was this patch tested? Add tests to check if `RowEncoder` preserves array/map nullability. Author: Takuya UESHIN <ueshin@happy-camper.st> Author: Takuya UESHIN <ueshin@databricks.com> Closes #13873 from ueshin/issues/SPARK-16167.	2017-07-05 20:32:47 +08:00
Takuya UESHIN	ce10545d34	[SPARK-21300][SQL] ExternalMapToCatalyst should null-check map key prior to converting to internal value. ## What changes were proposed in this pull request? `ExternalMapToCatalyst` should null-check map key prior to converting to internal value to throw an appropriate Exception instead of something like NPE. ## How was this patch tested? Added a test and existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #18524 from ueshin/issues/SPARK-21300.	2017-07-05 11:24:38 +08:00
gatorsmile	29b1f6b09f	[SPARK-21256][SQL] Add withSQLConf to Catalyst Test ### What changes were proposed in this pull request? SQLConf is moved to Catalyst. We are adding more and more test cases for verifying the conf-specific behaviors. It is nice to add a helper function to simplify the test cases. ### How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #18469 from gatorsmile/withSQLConf.	2017-07-04 08:54:07 -07:00
Wenchen Fan	f953ca56ec	[SPARK-21284][SQL] rename SessionCatalog.registerFunction parameter name ## What changes were proposed in this pull request? Looking at the code in `SessionCatalog.registerFunction`, the parameter `ignoreIfExists` is a wrong name. When `ignoreIfExists` is true, we will override the function if it already exists. So `overrideIfExists` should be the corrected name. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #18510 from cloud-fan/minor.	2017-07-03 10:51:44 -07:00
Reynold Xin	b1d719e7c9	[SPARK-21273][SQL] Propagate logical plan stats using visitor pattern and mixin ## What changes were proposed in this pull request? We currently implement statistics propagation directly in logical plan. Given we already have two different implementations, it'd make sense to actually decouple the two and add stats propagation using mixin. This would reduce the coupling between logical plan and statistics handling. This can also be a powerful pattern in the future to add additional properties (e.g. constraints). ## How was this patch tested? Should be covered by existing test cases. Author: Reynold Xin <rxin@databricks.com> Closes #18479 from rxin/stats-trait.	2017-06-30 21:10:23 -07:00
Wenchen Fan	4eb41879ce	[SPARK-17528][SQL] data should be copied properly before saving into InternalRow ## What changes were proposed in this pull request? For performance reasons, `UnsafeRow.getString`, `getStruct`, etc. return a "pointer" that points to a memory region of this unsafe row. This makes the unsafe projection a little dangerous, because all of its output rows share one instance. When we implement SQL operators, we should be careful to not cache the input rows because they may be produced by unsafe projection from child operator and thus its content may change overtime. However, when we updating values of InternalRow(e.g. in mutable projection and safe projection), we only copy UTF8String, we should also copy InternalRow, ArrayData and MapData. This PR fixes this, and also fixes the copy of vairous InternalRow, ArrayData and MapData implementations. ## How was this patch tested? new regression tests Author: Wenchen Fan <wenchen@databricks.com> Closes #18483 from cloud-fan/fix-copy.	2017-07-01 09:25:29 +08:00
Xiao Li	eed9c4ef85	[SPARK-21129][SQL] Arguments of SQL function call should not be named expressions ### What changes were proposed in this pull request? Function argument should not be named expressions. It could cause two issues: - Misleading error message - Unexpected query results when the column name is `distinct`, which is not a reserved word in our parser. ``` spark-sql> select count(distinct c1, distinct c2) from t1; Error in query: cannot resolve '`distinct`' given input columns: [c1, c2]; line 1 pos 26; 'Project [unresolvedalias('count(c1#30, 'distinct), None)] +- SubqueryAlias t1 +- CatalogRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#30, c2#31] ``` After the fix, the error message becomes ``` spark-sql> select count(distinct c1, distinct c2) from t1; Error in query: extraneous input 'c2' expecting {')', ',', '.', '[', 'OR', 'AND', 'IN', NOT, 'BETWEEN', 'LIKE', RLIKE, 'IS', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '\|', '\|\|', '^'}(line 1, pos 35) == SQL == select count(distinct c1, distinct c2) from t1 -----------------------------------^^^ ``` ### How was this patch tested? Added a test case to parser suite. Author: Xiao Li <gatorsmile@gmail.com> Author: gatorsmile <gatorsmile@gmail.com> Closes #18338 from gatorsmile/parserDistinctAggFunc.	2017-06-30 14:23:56 -07:00
wangzhenhua	82e24912d6	[SPARK-21237][SQL] Invalidate stats once table data is changed ## What changes were proposed in this pull request? Invalidate spark's stats after data changing commands: - InsertIntoHadoopFsRelationCommand - InsertIntoHiveTable - LoadDataCommand - TruncateTableCommand - AlterTableSetLocationCommand - AlterTableDropPartitionCommand ## How was this patch tested? Added test cases. Author: wangzhenhua <wangzhenhua@huawei.com> Closes #18449 from wzhfy/removeStats.	2017-06-29 11:32:29 +08:00
Wang Gengliang	b72b8521d9	[SPARK-21222] Move elimination of Distinct clause from analyzer to optimizer ## What changes were proposed in this pull request? Move elimination of Distinct clause from analyzer to optimizer Distinct clause is useless after MAX/MIN clause. For example, "Select MAX(distinct a) FROM src from" is equivalent of "Select MAX(a) FROM src from" However, this optimization is implemented in analyzer. It should be in optimizer. ## How was this patch tested? Unit test gatorsmile cloud-fan Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Wang Gengliang <ltnwgl@gmail.com> Closes #18429 from gengliangwang/distinct_opt.	2017-06-29 08:47:31 +08:00
Xiao Li	03eb6117af	[SPARK-21164][SQL] Remove isTableSample from Sample and isGenerated from Alias and AttributeReference ## What changes were proposed in this pull request? `isTableSample` and `isGenerated ` were introduced for SQL Generation respectively by https://github.com/apache/spark/pull/11148 and https://github.com/apache/spark/pull/11050 Since SQL Generation is removed, we do not need to keep `isTableSample`. ## How was this patch tested? The existing test cases Author: Xiao Li <gatorsmile@gmail.com> Closes #18379 from gatorsmile/CleanSample.	2017-06-23 14:48:33 -07:00
Dilip Biswal	13c2a4f2f8	[SPARK-20417][SQL] Move subquery error handling to checkAnalysis from Analyzer ## What changes were proposed in this pull request? Currently we do a lot of validations for subquery in the Analyzer. We should move them to CheckAnalysis which is the framework to catch and report Analysis errors. This was mentioned as a review comment in SPARK-18874. ## How was this patch tested? Exists tests + A few tests added to SQLQueryTestSuite. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #17713 from dilipbiswal/subquery_checkanalysis.	2017-06-23 11:02:54 -07:00
wangzhenhua	b803b66a81	[SPARK-21180][SQL] Remove conf from stats functions since now we have conf in LogicalPlan ## What changes were proposed in this pull request? After wiring `SQLConf` in logical plan ([PR 18299](https://github.com/apache/spark/pull/18299)), we can remove the need of passing `conf` into `def stats` and `def computeStats`. ## How was this patch tested? Covered by existing tests, plus some modified existing tests. Author: wangzhenhua <wangzhenhua@huawei.com> Author: Zhenhua Wang <wzh_zju@163.com> Closes #18391 from wzhfy/removeConf.	2017-06-23 10:33:53 -07:00
Xingbo Jiang	cad88f17e8	[SPARK-17851][SQL][TESTS] Make sure all test sqls in catalyst pass checkAnalysis ## What changes were proposed in this pull request? Currently we have several tens of test sqls in catalyst will fail at `SimpleAnalyzer.checkAnalysis`, we should make sure they are valid. This PR makes the following changes: 1. Apply `checkAnalysis` on plans that tests `Optimizer` rules, but don't require the testcases for `Parser`/`Analyzer` pass `checkAnalysis`; 2. Fix testcases for `Optimizer` that would have fall. ## How was this patch tested? Apply `SimpleAnalyzer.checkAnalysis` on plans in `PlanTest.comparePlans`, update invalid test cases. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Author: jiangxingbo <jiangxb1987@gmail.com> Closes #15417 from jiangxb1987/cptest.	2017-06-21 09:40:06 -07:00
Xiao Li	9413b84b5a	[SPARK-21132][SQL] DISTINCT modifier of function arguments should not be silently ignored ### What changes were proposed in this pull request? We should not silently ignore `DISTINCT` when they are not supported in the function arguments. This PR is to block these cases and issue the error messages. ### How was this patch tested? Added test cases for both regular functions and window functions Author: Xiao Li <gatorsmile@gmail.com> Closes #18340 from gatorsmile/firstCount.	2017-06-19 15:51:21 +08:00
Yuming Wang	f913f158ec	[SPARK-20948][SQL] Built-in SQL Function UnaryMinus/UnaryPositive support string type ## What changes were proposed in this pull request? Built-in SQL Function UnaryMinus/UnaryPositive support string type, if it's string type, convert it to double type, after this PR: ```sql spark-sql> select positive('-1.11'), negative('-1.11'); -1.11 1.11 spark-sql> ``` ## How was this patch tested? unit tests Author: Yuming Wang <wgyumg@gmail.com> Closes #18173 from wangyum/SPARK-20948.	2017-06-18 20:14:05 -07:00
Yuming Wang	53e48f73e4	[SPARK-20931][SQL] ABS function support string type. ## What changes were proposed in this pull request? ABS function support string type. Hive/MySQL support this feature. Ref: `4ba713ccd8/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFAbs.java (L93)` ## How was this patch tested? unit tests Author: Yuming Wang <wgyumg@gmail.com> Closes #18153 from wangyum/SPARK-20931.	2017-06-16 09:40:58 -07:00
Kazuaki Ishizaki	7a3e5dc28b	[SPARK-20749][SQL] Built-in SQL Function Support - all variants of LEN[GTH] ## What changes were proposed in this pull request? This PR adds built-in SQL function `BIT_LENGTH()`, `CHAR_LENGTH()`, and `OCTET_LENGTH()` functions. `BIT_LENGTH()` returns the bit length of the given string or binary expression. `CHAR_LENGTH()` returns the length of the given string or binary expression. (i.e. equal to `LENGTH()`) `OCTET_LENGTH()` returns the byte length of the given string or binary expression. ## How was this patch tested? Added new test suites for these three functions Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #18046 from kiszk/SPARK-20749.	2017-06-15 23:06:58 -07:00
Xianyang Liu	87ab0cec65	[SPARK-21072][SQL] TreeNode.mapChildren should only apply to the children node. ## What changes were proposed in this pull request? Just as the function name and comments of `TreeNode.mapChildren` mentioned, the function should be apply to all currently node children. So, the follow code should judge whether it is the children node. https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342 ## How was this patch tested? Existing tests. Author: Xianyang Liu <xianyang.liu@intel.com> Closes #18284 from ConeyLiu/treenode.	2017-06-16 12:10:09 +08:00
ALeksander Eskilson	b32b2123dd	[SPARK-18016][SQL][CATALYST] Code Generation: Constant Pool Limit - Class Splitting ## What changes were proposed in this pull request? This pull-request exclusively includes the class splitting feature described in #16648. When code for a given class would grow beyond 1600k bytes, a private, nested sub-class is generated into which subsequent functions are inlined. Additional sub-classes are generated as the code threshold is met subsequent times. This code includes 3 changes: 1. Includes helper maps, lists, and functions for keeping track of sub-classes during code generation (included in the `CodeGenerator` class). These helper functions allow nested classes and split functions to be initialized/declared/inlined to the appropriate locations in the various projection classes. 2. Changes `addNewFunction` to return a string to support instances where a split function is inlined to a nested class and not the outer class (and so must be invoked using the class-qualified name). Uses of `addNewFunction` throughout the codebase are modified so that the returned name is properly used. 3. Removes instances of the `this` keyword when used on data inside generated classes. All state declared in the outer class is by default global and accessible to the nested classes. However, if a reference to global state in a nested class is prepended with the `this` keyword, it would attempt to reference state belonging to the nested class (which would not exist), rather than the correct variable belonging to the outer class. ## How was this patch tested? Added a test case to the `GeneratedProjectionSuite` that increases the number of columns tested in various projections to a threshold that would previously have triggered a `JaninoRuntimeException` for the Constant Pool. Note: This PR does not address the second Constant Pool issue with code generation (also mentioned in #16648): excess global mutable state. A second PR may be opened to resolve that issue. Author: ALeksander Eskilson <alek.eskilson@cerner.com> Closes #18075 from bdrillard/class_splitting_only.	2017-06-15 13:45:08 +08:00
Reynold Xin	fffeb6d7c3	[SPARK-21092][SQL] Wire SQLConf in logical plan and expressions ## What changes were proposed in this pull request? It is really painful to not have configs in logical plan and expressions. We had to add all sorts of hacks (e.g. pass SQLConf explicitly in functions). This patch exposes SQLConf in logical plan, using a thread local variable and a getter closure that's set once there is an active SparkSession. The implementation is a bit of a hack, since we didn't anticipate this need in the beginning (config was only exposed in physical plan). The implementation is described in `SQLConf.get`. In terms of future work, we should follow up to clean up CBO (remove the need for passing in config). ## How was this patch tested? Updated relevant tests for constraint propagation. Author: Reynold Xin <rxin@databricks.com> Closes #18299 from rxin/SPARK-21092.	2017-06-14 22:11:41 -07:00
Yuming Wang	4d01aa4648	[SPARK-20754][SQL][FOLLOWUP] Add Function Alias For MOD/POSITION. ## What changes were proposed in this pull request? https://github.com/apache/spark/pull/18106 Support TRUNC (number), We should also add function alias for `MOD `and `POSITION`. `POSITION(substr IN str) `is a synonym for `LOCATE(substr,str)`. same as MySQL: https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_position ## How was this patch tested? unit tests Author: Yuming Wang <wgyumg@gmail.com> Closes #18206 from wangyum/SPARK-20754-mod&position.	2017-06-13 23:39:06 -07:00
Dongjoon Hyun	2639c3ed03	[SPARK-19910][SQL] `stack` should not reject NULL values due to type mismatch ## What changes were proposed in this pull request? Since `stack` function generates a table with nullable columns, it should allow mixed null values. ```scala scala> sql("select stack(3, 1, 2, 3)").printSchema root \|-- col0: integer (nullable = true) scala> sql("select stack(3, 1, 2, null)").printSchema org.apache.spark.sql.AnalysisException: cannot resolve 'stack(3, 1, 2, NULL)' due to data type mismatch: Argument 1 (IntegerType) != Argument 3 (NullType); line 1 pos 7; ``` ## How was this patch tested? Pass the Jenkins with a new test case. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #17251 from dongjoon-hyun/SPARK-19910.	2017-06-12 21:18:43 -07:00
Reynold Xin	b1436c7496	[SPARK-21059][SQL] LikeSimplification can NPE on null pattern ## What changes were proposed in this pull request? This patch fixes a bug that can cause NullPointerException in LikeSimplification, when the pattern for like is null. ## How was this patch tested? Added a new unit test case in LikeSimplificationSuite. Author: Reynold Xin <rxin@databricks.com> Closes #18273 from rxin/SPARK-21059.	2017-06-12 14:07:51 -07:00
aokolnychyi	ca4e960aec	[SPARK-17914][SQL] Fix parsing of timestamp strings with nanoseconds The PR contains a tiny change to fix the way Spark parses string literals into timestamps. Currently, some timestamps that contain nanoseconds are corrupted during the conversion from internal UTF8Strings into the internal representation of timestamps. Consider the following example: ``` spark.sql("SELECT cast('2015-01-02 00:00:00.000000001' as TIMESTAMP)").show(false) +------------------------------------------------+ \|CAST(2015-01-02 00:00:00.000000001 AS TIMESTAMP)\| +------------------------------------------------+ \|2015-01-02 00:00:00.000001 \| +------------------------------------------------+ ``` The fix was tested with existing tests. Also, there is a new test to cover cases that did not work previously. Author: aokolnychyi <anton.okolnychyi@sap.com> Closes #18252 from aokolnychyi/spark-17914.	2017-06-12 13:06:14 -07:00
liuxian	d140918093	[SPARK-20665][SQL][FOLLOW-UP] Move test case to MathExpressionsSuite ## What changes were proposed in this pull request? add test case to MathExpressionsSuite as #17906 ## How was this patch tested? unit test cases Author: liuxian <liu.xian3@zte.com.cn> Closes #18082 from 10110346/wip-lx-0524.	2017-06-11 22:29:09 -07:00
Michal Senkyr	0538f3b0ae	[SPARK-18891][SQL] Support for Scala Map collection types ## What changes were proposed in this pull request? Add support for arbitrary Scala `Map` types in deserialization as well as a generic implicit encoder. Used the builder approach as in #16541 to construct any provided `Map` type upon deserialization. Please note that this PR also adds (ignored) tests for issue [SPARK-19104 CompileException with Map and Case Class in Spark 2.1.0](https://issues.apache.org/jira/browse/SPARK-19104) but doesn't solve it. Added support for Java Maps in codegen code (encoders will be added in a different PR) with the following default implementations for interfaces/abstract classes: * `java.util.Map`, `java.util.AbstractMap` => `java.util.HashMap` * `java.util.SortedMap`, `java.util.NavigableMap` => `java.util.TreeMap` * `java.util.concurrent.ConcurrentMap` => `java.util.concurrent.ConcurrentHashMap` * `java.util.concurrent.ConcurrentNavigableMap` => `java.util.concurrent.ConcurrentSkipListMap` Resulting codegen for `Seq(Map(1 -> 2)).toDS().map(identity).queryExecution.debug.codegen`: ``` /* 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIterator(references); / 003 / } / 004 / / 005 / final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator { / 006 / private Object[] references; / 007 / private scala.collection.Iterator[] inputs; / 008 / private scala.collection.Iterator inputadapter_input; / 009 / private boolean CollectObjectsToMap_loopIsNull1; / 010 / private int CollectObjectsToMap_loopValue0; / 011 / private boolean CollectObjectsToMap_loopIsNull3; / 012 / private int CollectObjectsToMap_loopValue2; / 013 / private UnsafeRow deserializetoobject_result; / 014 / private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder deserializetoobject_holder; / 015 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter deserializetoobject_rowWriter; / 016 / private scala.collection.immutable.Map mapelements_argValue; / 017 / private UnsafeRow mapelements_result; / 018 / private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder mapelements_holder; / 019 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter mapelements_rowWriter; / 020 / private UnsafeRow serializefromobject_result; / 021 / private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder serializefromobject_holder; / 022 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter serializefromobject_rowWriter; / 023 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter serializefromobject_arrayWriter; / 024 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter serializefromobject_arrayWriter1; / 025 / / 026 / public GeneratedIterator(Object[] references) { / 027 / this.references = references; / 028 / } / 029 / / 030 / public void init(int index, scala.collection.Iterator[] inputs) { / 031 / partitionIndex = index; / 032 / this.inputs = inputs; / 033 / wholestagecodegen_init_0(); / 034 / wholestagecodegen_init_1(); / 035 / / 036 / } / 037 / / 038 / private void wholestagecodegen_init_0() { / 039 / inputadapter_input = inputs[0]; / 040 / / 041 / deserializetoobject_result = new UnsafeRow(1); / 042 / this.deserializetoobject_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(deserializetoobject_result, 32); / 043 / this.deserializetoobject_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(deserializetoobject_holder, 1); / 044 / / 045 / mapelements_result = new UnsafeRow(1); / 046 / this.mapelements_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(mapelements_result, 32); / 047 / this.mapelements_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(mapelements_holder, 1); / 048 / serializefromobject_result = new UnsafeRow(1); / 049 / this.serializefromobject_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(serializefromobject_result, 32); / 050 / this.serializefromobject_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(serializefromobject_holder, 1); / 051 / this.serializefromobject_arrayWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter(); / 052 / / 053 / } / 054 / / 055 / private void wholestagecodegen_init_1() { / 056 / this.serializefromobject_arrayWriter1 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter(); / 057 / / 058 / } / 059 / / 060 / protected void processNext() throws java.io.IOException { / 061 / while (inputadapter_input.hasNext() && !stopEarly()) { / 062 / InternalRow inputadapter_row = (InternalRow) inputadapter_input.next(); / 063 / boolean inputadapter_isNull = inputadapter_row.isNullAt(0); / 064 / MapData inputadapter_value = inputadapter_isNull ? null : (inputadapter_row.getMap(0)); / 065 / / 066 / boolean deserializetoobject_isNull1 = true; / 067 / ArrayData deserializetoobject_value1 = null; / 068 / if (!inputadapter_isNull) { / 069 / deserializetoobject_isNull1 = false; / 070 / if (!deserializetoobject_isNull1) { / 071 / Object deserializetoobject_funcResult = null; / 072 / deserializetoobject_funcResult = inputadapter_value.keyArray(); / 073 / if (deserializetoobject_funcResult == null) { / 074 / deserializetoobject_isNull1 = true; / 075 / } else { / 076 / deserializetoobject_value1 = (ArrayData) deserializetoobject_funcResult; / 077 / } / 078 / / 079 / } / 080 / deserializetoobject_isNull1 = deserializetoobject_value1 == null; / 081 / } / 082 / / 083 / boolean deserializetoobject_isNull3 = true; / 084 / ArrayData deserializetoobject_value3 = null; / 085 / if (!inputadapter_isNull) { / 086 / deserializetoobject_isNull3 = false; / 087 / if (!deserializetoobject_isNull3) { / 088 / Object deserializetoobject_funcResult1 = null; / 089 / deserializetoobject_funcResult1 = inputadapter_value.valueArray(); / 090 / if (deserializetoobject_funcResult1 == null) { / 091 / deserializetoobject_isNull3 = true; / 092 / } else { / 093 / deserializetoobject_value3 = (ArrayData) deserializetoobject_funcResult1; / 094 / } / 095 / / 096 / } / 097 / deserializetoobject_isNull3 = deserializetoobject_value3 == null; / 098 / } / 099 / scala.collection.immutable.Map deserializetoobject_value = null; / 100 / / 101 / if ((deserializetoobject_isNull1 && !deserializetoobject_isNull3) \|\| / 102 / (!deserializetoobject_isNull1 && deserializetoobject_isNull3)) { / 103 / throw new RuntimeException("Invalid state: Inconsistent nullability of key-value"); / 104 / } / 105 / / 106 / if (!deserializetoobject_isNull1) { / 107 / if (deserializetoobject_value1.numElements() != deserializetoobject_value3.numElements()) { / 108 / throw new RuntimeException("Invalid state: Inconsistent lengths of key-value arrays"); / 109 / } / 110 / int deserializetoobject_dataLength = deserializetoobject_value1.numElements(); / 111 / / 112 / scala.collection.mutable.Builder CollectObjectsToMap_builderValue5 = scala.collection.immutable.Map$.MODULE$.newBuilder(); / 113 / CollectObjectsToMap_builderValue5.sizeHint(deserializetoobject_dataLength); / 114 / / 115 / int deserializetoobject_loopIndex = 0; / 116 / while (deserializetoobject_loopIndex < deserializetoobject_dataLength) { / 117 / CollectObjectsToMap_loopValue0 = (int) (deserializetoobject_value1.getInt(deserializetoobject_loopIndex)); / 118 / CollectObjectsToMap_loopValue2 = (int) (deserializetoobject_value3.getInt(deserializetoobject_loopIndex)); / 119 / CollectObjectsToMap_loopIsNull1 = deserializetoobject_value1.isNullAt(deserializetoobject_loopIndex); / 120 / CollectObjectsToMap_loopIsNull3 = deserializetoobject_value3.isNullAt(deserializetoobject_loopIndex); / 121 / / 122 / if (CollectObjectsToMap_loopIsNull1) { / 123 / throw new RuntimeException("Found null in map key!"); / 124 / } / 125 / / 126 / scala.Tuple2 CollectObjectsToMap_loopValue4; / 127 / / 128 / if (CollectObjectsToMap_loopIsNull3) { / 129 / CollectObjectsToMap_loopValue4 = new scala.Tuple2(CollectObjectsToMap_loopValue0, null); / 130 / } else { / 131 / CollectObjectsToMap_loopValue4 = new scala.Tuple2(CollectObjectsToMap_loopValue0, CollectObjectsToMap_loopValue2); / 132 / } / 133 / / 134 / CollectObjectsToMap_builderValue5.$plus$eq(CollectObjectsToMap_loopValue4); / 135 / / 136 / deserializetoobject_loopIndex += 1; / 137 / } / 138 / / 139 / deserializetoobject_value = (scala.collection.immutable.Map) CollectObjectsToMap_builderValue5.result(); / 140 / } / 141 / / 142 / boolean mapelements_isNull = true; / 143 / scala.collection.immutable.Map mapelements_value = null; / 144 / if (!false) { / 145 / mapelements_argValue = deserializetoobject_value; / 146 / / 147 / mapelements_isNull = false; / 148 / if (!mapelements_isNull) { / 149 / Object mapelements_funcResult = null; / 150 / mapelements_funcResult = ((scala.Function1) references[0]).apply(mapelements_argValue); / 151 / if (mapelements_funcResult == null) { / 152 / mapelements_isNull = true; / 153 / } else { / 154 / mapelements_value = (scala.collection.immutable.Map) mapelements_funcResult; / 155 / } / 156 / / 157 / } / 158 / mapelements_isNull = mapelements_value == null; / 159 / } / 160 / / 161 / MapData serializefromobject_value = null; / 162 / if (!mapelements_isNull) { / 163 / final int serializefromobject_length = mapelements_value.size(); / 164 / final Object[] serializefromobject_convertedKeys = new Object[serializefromobject_length]; / 165 / final Object[] serializefromobject_convertedValues = new Object[serializefromobject_length]; / 166 / int serializefromobject_index = 0; / 167 / final scala.collection.Iterator serializefromobject_entries = mapelements_value.iterator(); / 168 / while(serializefromobject_entries.hasNext()) { / 169 / final scala.Tuple2 serializefromobject_entry = (scala.Tuple2) serializefromobject_entries.next(); / 170 / int ExternalMapToCatalyst_key1 = (Integer) serializefromobject_entry._1(); / 171 / int ExternalMapToCatalyst_value1 = (Integer) serializefromobject_entry._2(); / 172 / / 173 / boolean ExternalMapToCatalyst_value_isNull1 = false; / 174 / / 175 / if (false) { / 176 / throw new RuntimeException("Cannot use null as map key!"); / 177 / } else { / 178 / serializefromobject_convertedKeys[serializefromobject_index] = (Integer) ExternalMapToCatalyst_key1; / 179 / } / 180 / / 181 / if (false) { / 182 / serializefromobject_convertedValues[serializefromobject_index] = null; / 183 / } else { / 184 / serializefromobject_convertedValues[serializefromobject_index] = (Integer) ExternalMapToCatalyst_value1; / 185 / } / 186 / / 187 / serializefromobject_index++; / 188 / } / 189 / / 190 / serializefromobject_value = new org.apache.spark.sql.catalyst.util.ArrayBasedMapData(new org.apache.spark.sql.catalyst.util.GenericArrayData(serializefromobject_convertedKeys), new org.apache.spark.sql.catalyst.util.GenericArrayData(serializefromobject_convertedValues)); / 191 / } / 192 / serializefromobject_holder.reset(); / 193 / / 194 / serializefromobject_rowWriter.zeroOutNullBytes(); / 195 / / 196 / if (mapelements_isNull) { / 197 / serializefromobject_rowWriter.setNullAt(0); / 198 / } else { / 199 / // Remember the current cursor so that we can calculate how many bytes are / 200 / // written later. / 201 / final int serializefromobject_tmpCursor = serializefromobject_holder.cursor; / 202 / / 203 / if (serializefromobject_value instanceof UnsafeMapData) { / 204 / final int serializefromobject_sizeInBytes = ((UnsafeMapData) serializefromobject_value).getSizeInBytes(); / 205 / // grow the global buffer before writing data. / 206 / serializefromobject_holder.grow(serializefromobject_sizeInBytes); / 207 / ((UnsafeMapData) serializefromobject_value).writeToMemory(serializefromobject_holder.buffer, serializefromobject_holder.cursor); / 208 / serializefromobject_holder.cursor += serializefromobject_sizeInBytes; / 209 / / 210 / } else { / 211 / final ArrayData serializefromobject_keys = serializefromobject_value.keyArray(); / 212 / final ArrayData serializefromobject_values = serializefromobject_value.valueArray(); / 213 / / 214 / // preserve 8 bytes to write the key array numBytes later. / 215 / serializefromobject_holder.grow(8); / 216 / serializefromobject_holder.cursor += 8; / 217 / / 218 / // Remember the current cursor so that we can write numBytes of key array later. / 219 / final int serializefromobject_tmpCursor1 = serializefromobject_holder.cursor; / 220 / / 221 / if (serializefromobject_keys instanceof UnsafeArrayData) { / 222 / final int serializefromobject_sizeInBytes1 = ((UnsafeArrayData) serializefromobject_keys).getSizeInBytes(); / 223 / // grow the global buffer before writing data. / 224 / serializefromobject_holder.grow(serializefromobject_sizeInBytes1); / 225 / ((UnsafeArrayData) serializefromobject_keys).writeToMemory(serializefromobject_holder.buffer, serializefromobject_holder.cursor); / 226 / serializefromobject_holder.cursor += serializefromobject_sizeInBytes1; / 227 / / 228 / } else { / 229 / final int serializefromobject_numElements = serializefromobject_keys.numElements(); / 230 / serializefromobject_arrayWriter.initialize(serializefromobject_holder, serializefromobject_numElements, 4); / 231 / / 232 / for (int serializefromobject_index1 = 0; serializefromobject_index1 < serializefromobject_numElements; serializefromobject_index1++) { / 233 / if (serializefromobject_keys.isNullAt(serializefromobject_index1)) { / 234 / serializefromobject_arrayWriter.setNullInt(serializefromobject_index1); / 235 / } else { / 236 / final int serializefromobject_element = serializefromobject_keys.getInt(serializefromobject_index1); / 237 / serializefromobject_arrayWriter.write(serializefromobject_index1, serializefromobject_element); / 238 / } / 239 / } / 240 / } / 241 / / 242 / // Write the numBytes of key array into the first 8 bytes. / 243 / Platform.putLong(serializefromobject_holder.buffer, serializefromobject_tmpCursor1 - 8, serializefromobject_holder.cursor - serializefromobject_tmpCursor1); / 244 / / 245 / if (serializefromobject_values instanceof UnsafeArrayData) { / 246 / final int serializefromobject_sizeInBytes2 = ((UnsafeArrayData) serializefromobject_values).getSizeInBytes(); / 247 / // grow the global buffer before writing data. / 248 / serializefromobject_holder.grow(serializefromobject_sizeInBytes2); / 249 / ((UnsafeArrayData) serializefromobject_values).writeToMemory(serializefromobject_holder.buffer, serializefromobject_holder.cursor); / 250 / serializefromobject_holder.cursor += serializefromobject_sizeInBytes2; / 251 / / 252 / } else { / 253 / final int serializefromobject_numElements1 = serializefromobject_values.numElements(); / 254 / serializefromobject_arrayWriter1.initialize(serializefromobject_holder, serializefromobject_numElements1, 4); / 255 / / 256 / for (int serializefromobject_index2 = 0; serializefromobject_index2 < serializefromobject_numElements1; serializefromobject_index2++) { / 257 / if (serializefromobject_values.isNullAt(serializefromobject_index2)) { / 258 / serializefromobject_arrayWriter1.setNullInt(serializefromobject_index2); / 259 / } else { / 260 / final int serializefromobject_element1 = serializefromobject_values.getInt(serializefromobject_index2); / 261 / serializefromobject_arrayWriter1.write(serializefromobject_index2, serializefromobject_element1); / 262 / } / 263 / } / 264 / } / 265 / / 266 / } / 267 / / 268 / serializefromobject_rowWriter.setOffsetAndSize(0, serializefromobject_tmpCursor, serializefromobject_holder.cursor - serializefromobject_tmpCursor); / 269 / } / 270 / serializefromobject_result.setTotalSize(serializefromobject_holder.totalSize()); / 271 / append(serializefromobject_result); / 272 / if (shouldStop()) return; / 273 / } / 274 / } / 275 / } ``` Codegen for `java.util.Map`: ``` / 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIterator(references); / 003 / } / 004 / / 005 / final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator { / 006 / private Object[] references; / 007 / private scala.collection.Iterator[] inputs; / 008 / private scala.collection.Iterator inputadapter_input; / 009 / private boolean CollectObjectsToMap_loopIsNull1; / 010 / private int CollectObjectsToMap_loopValue0; / 011 / private boolean CollectObjectsToMap_loopIsNull3; / 012 / private int CollectObjectsToMap_loopValue2; / 013 / private UnsafeRow deserializetoobject_result; / 014 / private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder deserializetoobject_holder; / 015 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter deserializetoobject_rowWriter; / 016 / private java.util.HashMap mapelements_argValue; / 017 / private UnsafeRow mapelements_result; / 018 / private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder mapelements_holder; / 019 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter mapelements_rowWriter; / 020 / private UnsafeRow serializefromobject_result; / 021 / private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder serializefromobject_holder; / 022 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter serializefromobject_rowWriter; / 023 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter serializefromobject_arrayWriter; / 024 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter serializefromobject_arrayWriter1; / 025 / / 026 / public GeneratedIterator(Object[] references) { / 027 / this.references = references; / 028 / } / 029 / / 030 / public void init(int index, scala.collection.Iterator[] inputs) { / 031 / partitionIndex = index; / 032 / this.inputs = inputs; / 033 / wholestagecodegen_init_0(); / 034 / wholestagecodegen_init_1(); / 035 / / 036 / } / 037 / / 038 / private void wholestagecodegen_init_0() { / 039 / inputadapter_input = inputs[0]; / 040 / / 041 / deserializetoobject_result = new UnsafeRow(1); / 042 / this.deserializetoobject_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(deserializetoobject_result, 32); / 043 / this.deserializetoobject_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(deserializetoobject_holder, 1); / 044 / / 045 / mapelements_result = new UnsafeRow(1); / 046 / this.mapelements_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(mapelements_result, 32); / 047 / this.mapelements_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(mapelements_holder, 1); / 048 / serializefromobject_result = new UnsafeRow(1); / 049 / this.serializefromobject_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(serializefromobject_result, 32); / 050 / this.serializefromobject_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(serializefromobject_holder, 1); / 051 / this.serializefromobject_arrayWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter(); / 052 / / 053 / } / 054 / / 055 / private void wholestagecodegen_init_1() { / 056 / this.serializefromobject_arrayWriter1 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter(); / 057 / / 058 / } / 059 / / 060 / protected void processNext() throws java.io.IOException { / 061 / while (inputadapter_input.hasNext() && !stopEarly()) { / 062 / InternalRow inputadapter_row = (InternalRow) inputadapter_input.next(); / 063 / boolean inputadapter_isNull = inputadapter_row.isNullAt(0); / 064 / MapData inputadapter_value = inputadapter_isNull ? null : (inputadapter_row.getMap(0)); / 065 / / 066 / boolean deserializetoobject_isNull1 = true; / 067 / ArrayData deserializetoobject_value1 = null; / 068 / if (!inputadapter_isNull) { / 069 / deserializetoobject_isNull1 = false; / 070 / if (!deserializetoobject_isNull1) { / 071 / Object deserializetoobject_funcResult = null; / 072 / deserializetoobject_funcResult = inputadapter_value.keyArray(); / 073 / if (deserializetoobject_funcResult == null) { / 074 / deserializetoobject_isNull1 = true; / 075 / } else { / 076 / deserializetoobject_value1 = (ArrayData) deserializetoobject_funcResult; / 077 / } / 078 / / 079 / } / 080 / deserializetoobject_isNull1 = deserializetoobject_value1 == null; / 081 / } / 082 / / 083 / boolean deserializetoobject_isNull3 = true; / 084 / ArrayData deserializetoobject_value3 = null; / 085 / if (!inputadapter_isNull) { / 086 / deserializetoobject_isNull3 = false; / 087 / if (!deserializetoobject_isNull3) { / 088 / Object deserializetoobject_funcResult1 = null; / 089 / deserializetoobject_funcResult1 = inputadapter_value.valueArray(); / 090 / if (deserializetoobject_funcResult1 == null) { / 091 / deserializetoobject_isNull3 = true; / 092 / } else { / 093 / deserializetoobject_value3 = (ArrayData) deserializetoobject_funcResult1; / 094 / } / 095 / / 096 / } / 097 / deserializetoobject_isNull3 = deserializetoobject_value3 == null; / 098 / } / 099 / java.util.HashMap deserializetoobject_value = null; / 100 / / 101 / if ((deserializetoobject_isNull1 && !deserializetoobject_isNull3) \|\| / 102 / (!deserializetoobject_isNull1 && deserializetoobject_isNull3)) { / 103 / throw new RuntimeException("Invalid state: Inconsistent nullability of key-value"); / 104 / } / 105 / / 106 / if (!deserializetoobject_isNull1) { / 107 / if (deserializetoobject_value1.numElements() != deserializetoobject_value3.numElements()) { / 108 / throw new RuntimeException("Invalid state: Inconsistent lengths of key-value arrays"); / 109 / } / 110 / int deserializetoobject_dataLength = deserializetoobject_value1.numElements(); / 111 / java.util.Map CollectObjectsToMap_builderValue5 = new java.util.HashMap(deserializetoobject_dataLength); / 112 / / 113 / int deserializetoobject_loopIndex = 0; / 114 / while (deserializetoobject_loopIndex < deserializetoobject_dataLength) { / 115 / CollectObjectsToMap_loopValue0 = (int) (deserializetoobject_value1.getInt(deserializetoobject_loopIndex)); / 116 / CollectObjectsToMap_loopValue2 = (int) (deserializetoobject_value3.getInt(deserializetoobject_loopIndex)); / 117 / CollectObjectsToMap_loopIsNull1 = deserializetoobject_value1.isNullAt(deserializetoobject_loopIndex); / 118 / CollectObjectsToMap_loopIsNull3 = deserializetoobject_value3.isNullAt(deserializetoobject_loopIndex); / 119 / / 120 / if (CollectObjectsToMap_loopIsNull1) { / 121 / throw new RuntimeException("Found null in map key!"); / 122 / } / 123 / / 124 / CollectObjectsToMap_builderValue5.put(CollectObjectsToMap_loopValue0, CollectObjectsToMap_loopValue2); / 125 / / 126 / deserializetoobject_loopIndex += 1; / 127 / } / 128 / / 129 / deserializetoobject_value = (java.util.HashMap) CollectObjectsToMap_builderValue5; / 130 / } / 131 / / 132 / boolean mapelements_isNull = true; / 133 / java.util.HashMap mapelements_value = null; / 134 / if (!false) { / 135 / mapelements_argValue = deserializetoobject_value; / 136 / / 137 / mapelements_isNull = false; / 138 / if (!mapelements_isNull) { / 139 / Object mapelements_funcResult = null; / 140 / mapelements_funcResult = ((scala.Function1) references[0]).apply(mapelements_argValue); / 141 / if (mapelements_funcResult == null) { / 142 / mapelements_isNull = true; / 143 / } else { / 144 / mapelements_value = (java.util.HashMap) mapelements_funcResult; / 145 / } / 146 / / 147 / } / 148 / mapelements_isNull = mapelements_value == null; / 149 / } / 150 / / 151 / MapData serializefromobject_value = null; / 152 / if (!mapelements_isNull) { / 153 / final int serializefromobject_length = mapelements_value.size(); / 154 / final Object[] serializefromobject_convertedKeys = new Object[serializefromobject_length]; / 155 / final Object[] serializefromobject_convertedValues = new Object[serializefromobject_length]; / 156 / int serializefromobject_index = 0; / 157 / final java.util.Iterator serializefromobject_entries = mapelements_value.entrySet().iterator(); / 158 / while(serializefromobject_entries.hasNext()) { / 159 / final java.util.Map$Entry serializefromobject_entry = (java.util.Map$Entry) serializefromobject_entries.next(); / 160 / int ExternalMapToCatalyst_key1 = (Integer) serializefromobject_entry.getKey(); / 161 / int ExternalMapToCatalyst_value1 = (Integer) serializefromobject_entry.getValue(); / 162 / / 163 / boolean ExternalMapToCatalyst_value_isNull1 = false; / 164 / / 165 / if (false) { / 166 / throw new RuntimeException("Cannot use null as map key!"); / 167 / } else { / 168 / serializefromobject_convertedKeys[serializefromobject_index] = (Integer) ExternalMapToCatalyst_key1; / 169 / } / 170 / / 171 / if (false) { / 172 / serializefromobject_convertedValues[serializefromobject_index] = null; / 173 / } else { / 174 / serializefromobject_convertedValues[serializefromobject_index] = (Integer) ExternalMapToCatalyst_value1; / 175 / } / 176 / / 177 / serializefromobject_index++; / 178 / } / 179 / / 180 / serializefromobject_value = new org.apache.spark.sql.catalyst.util.ArrayBasedMapData(new org.apache.spark.sql.catalyst.util.GenericArrayData(serializefromobject_convertedKeys), new org.apache.spark.sql.catalyst.util.GenericArrayData(serializefromobject_convertedValues)); / 181 / } / 182 / serializefromobject_holder.reset(); / 183 / / 184 / serializefromobject_rowWriter.zeroOutNullBytes(); / 185 / / 186 / if (mapelements_isNull) { / 187 / serializefromobject_rowWriter.setNullAt(0); / 188 / } else { / 189 / // Remember the current cursor so that we can calculate how many bytes are / 190 / // written later. / 191 / final int serializefromobject_tmpCursor = serializefromobject_holder.cursor; / 192 / / 193 / if (serializefromobject_value instanceof UnsafeMapData) { / 194 / final int serializefromobject_sizeInBytes = ((UnsafeMapData) serializefromobject_value).getSizeInBytes(); / 195 / // grow the global buffer before writing data. / 196 / serializefromobject_holder.grow(serializefromobject_sizeInBytes); / 197 / ((UnsafeMapData) serializefromobject_value).writeToMemory(serializefromobject_holder.buffer, serializefromobject_holder.cursor); / 198 / serializefromobject_holder.cursor += serializefromobject_sizeInBytes; / 199 / / 200 / } else { / 201 / final ArrayData serializefromobject_keys = serializefromobject_value.keyArray(); / 202 / final ArrayData serializefromobject_values = serializefromobject_value.valueArray(); / 203 / / 204 / // preserve 8 bytes to write the key array numBytes later. / 205 / serializefromobject_holder.grow(8); / 206 / serializefromobject_holder.cursor += 8; / 207 / / 208 / // Remember the current cursor so that we can write numBytes of key array later. / 209 / final int serializefromobject_tmpCursor1 = serializefromobject_holder.cursor; / 210 / / 211 / if (serializefromobject_keys instanceof UnsafeArrayData) { / 212 / final int serializefromobject_sizeInBytes1 = ((UnsafeArrayData) serializefromobject_keys).getSizeInBytes(); / 213 / // grow the global buffer before writing data. / 214 / serializefromobject_holder.grow(serializefromobject_sizeInBytes1); / 215 / ((UnsafeArrayData) serializefromobject_keys).writeToMemory(serializefromobject_holder.buffer, serializefromobject_holder.cursor); / 216 / serializefromobject_holder.cursor += serializefromobject_sizeInBytes1; / 217 / / 218 / } else { / 219 / final int serializefromobject_numElements = serializefromobject_keys.numElements(); / 220 / serializefromobject_arrayWriter.initialize(serializefromobject_holder, serializefromobject_numElements, 4); / 221 / / 222 / for (int serializefromobject_index1 = 0; serializefromobject_index1 < serializefromobject_numElements; serializefromobject_index1++) { / 223 / if (serializefromobject_keys.isNullAt(serializefromobject_index1)) { / 224 / serializefromobject_arrayWriter.setNullInt(serializefromobject_index1); / 225 / } else { / 226 / final int serializefromobject_element = serializefromobject_keys.getInt(serializefromobject_index1); / 227 / serializefromobject_arrayWriter.write(serializefromobject_index1, serializefromobject_element); / 228 / } / 229 / } / 230 / } / 231 / / 232 / // Write the numBytes of key array into the first 8 bytes. / 233 / Platform.putLong(serializefromobject_holder.buffer, serializefromobject_tmpCursor1 - 8, serializefromobject_holder.cursor - serializefromobject_tmpCursor1); / 234 / / 235 / if (serializefromobject_values instanceof UnsafeArrayData) { / 236 / final int serializefromobject_sizeInBytes2 = ((UnsafeArrayData) serializefromobject_values).getSizeInBytes(); / 237 / // grow the global buffer before writing data. / 238 / serializefromobject_holder.grow(serializefromobject_sizeInBytes2); / 239 / ((UnsafeArrayData) serializefromobject_values).writeToMemory(serializefromobject_holder.buffer, serializefromobject_holder.cursor); / 240 / serializefromobject_holder.cursor += serializefromobject_sizeInBytes2; / 241 / / 242 / } else { / 243 / final int serializefromobject_numElements1 = serializefromobject_values.numElements(); / 244 / serializefromobject_arrayWriter1.initialize(serializefromobject_holder, serializefromobject_numElements1, 4); / 245 / / 246 / for (int serializefromobject_index2 = 0; serializefromobject_index2 < serializefromobject_numElements1; serializefromobject_index2++) { / 247 / if (serializefromobject_values.isNullAt(serializefromobject_index2)) { / 248 / serializefromobject_arrayWriter1.setNullInt(serializefromobject_index2); / 249 / } else { / 250 / final int serializefromobject_element1 = serializefromobject_values.getInt(serializefromobject_index2); / 251 / serializefromobject_arrayWriter1.write(serializefromobject_index2, serializefromobject_element1); / 252 / } / 253 / } / 254 / } / 255 / / 256 / } / 257 / / 258 / serializefromobject_rowWriter.setOffsetAndSize(0, serializefromobject_tmpCursor, serializefromobject_holder.cursor - serializefromobject_tmpCursor); / 259 / } / 260 / serializefromobject_result.setTotalSize(serializefromobject_holder.totalSize()); / 261 / append(serializefromobject_result); / 262 / if (shouldStop()) return; / 263 / } / 264 / } / 265 */ } ``` ## How was this patch tested? ``` build/mvn -DskipTests clean package && dev/run-tests ``` Additionally in Spark shell: ``` scala> Seq(collection.mutable.HashMap(1 -> 2, 2 -> 3)).toDS().map(_ += (3 -> 4)).collect() res0: Array[scala.collection.mutable.HashMap[Int,Int]] = Array(Map(2 -> 3, 1 -> 2, 3 -> 4)) ``` Author: Michal Senkyr <mike.senkyr@gmail.com> Author: Michal Šenkýř <mike.senkyr@gmail.com> Closes #16986 from michalsenkyr/dataset-map-builder.	2017-06-12 08:47:01 +08:00
Zhenhua Wang	a7c61c100b	[SPARK-21031][SQL] Add `alterTableStats` to store spark's stats and let `alterTable` keep existing stats ## What changes were proposed in this pull request? Currently, hive's stats are read into `CatalogStatistics`, while spark's stats are also persisted through `CatalogStatistics`. As a result, hive's stats can be unexpectedly propagated into spark' stats. For example, for a catalog table, we read stats from hive, e.g. "totalSize" and put it into `CatalogStatistics`. Then, by using "ALTER TABLE" command, we will store the stats in `CatalogStatistics` into metastore as spark's stats (because we don't know whether it's from spark or not). But spark's stats should be only generated by "ANALYZE" command. This is unexpected from this command. Secondly, now that we have spark's stats in metastore, after inserting new data, although hive updated "totalSize" in metastore, we still cannot get the right `sizeInBytes` in `CatalogStatistics`, because we respect spark's stats (should not exist) over hive's stats. A running example is shown in [JIRA](https://issues.apache.org/jira/browse/SPARK-21031). To fix this, we add a new method `alterTableStats` to store spark's stats, and let `alterTable` keep existing stats. ## How was this patch tested? Added new tests. Author: Zhenhua Wang <wzh_zju@163.com> Closes #18248 from wzhfy/separateHiveStats.	2017-06-12 08:23:04 +08:00
liuxian	5301a19a0e	[SPARK-20620][TEST] Improve some unit tests for NullExpressionsSuite and TypeCoercionSuite ## What changes were proposed in this pull request? add more datatype for some unit tests ## How was this patch tested? unit tests Author: liuxian <liu.xian3@zte.com.cn> Closes #17880 from 10110346/wip_lx_0506.	2017-06-10 10:42:23 -07:00
Xiao Li	8e96acf71c	[SPARK-20211][SQL] Fix the Precision and Scale of Decimal Values when the Input is BigDecimal between -1.0 and 1.0 ### What changes were proposed in this pull request? The precision and scale of decimal values are wrong when the input is BigDecimal between -1.0 and 1.0. The BigDecimal's precision is the digit count starts from the leftmost nonzero digit based on the [JAVA's BigDecimal definition](https://docs.oracle.com/javase/7/docs/api/java/math/BigDecimal.html). However, our Decimal decision follows the database decimal standard, which is the total number of digits, including both to the left and the right of the decimal point. Thus, this PR is to fix the issue by doing the conversion. Before this PR, the following queries failed: ```SQL select 1 > 0.0001 select floor(0.0001) select ceil(0.0001) ``` ### How was this patch tested? Added test cases. Author: Xiao Li <gatorsmile@gmail.com> Closes #18244 from gatorsmile/bigdecimal.	2017-06-10 10:28:14 -07:00
Xiao Li	571635488d	[SPARK-20918][SQL] Use FunctionIdentifier as function identifiers in FunctionRegistry ### What changes were proposed in this pull request? Currently, the unquoted string of a function identifier is being used as the function identifier in the function registry. This could cause the incorrect the behavior when users use `.` in the function names. This PR is to take the `FunctionIdentifier` as the identifier in the function registry. - Add one new function `createOrReplaceTempFunction` to `FunctionRegistry` ```Scala final def createOrReplaceTempFunction(name: String, builder: FunctionBuilder): Unit ``` ### How was this patch tested? Add extra test cases to verify the inclusive bug fixes. Author: Xiao Li <gatorsmile@gmail.com> Author: gatorsmile <gatorsmile@gmail.com> Closes #18142 from gatorsmile/fuctionRegistry.	2017-06-09 10:16:30 -07:00
Bogdan Raducanu	cb83ca1433	[SPARK-20854][TESTS] Removing duplicate test case ## What changes were proposed in this pull request? Removed a duplicate case in "SPARK-20854: select hint syntax with expressions" ## How was this patch tested? Existing tests. Author: Bogdan Raducanu <bogdan@databricks.com> Closes #18217 from bogdanrdc/SPARK-20854-2.	2017-06-06 22:51:10 -07:00
Wenchen Fan	c92949ac23	[SPARK-20972][SQL] rename HintInfo.isBroadcastable to broadcast ## What changes were proposed in this pull request? `HintInfo.isBroadcastable` is actually not an accurate name, it's used to force the planner to broadcast a plan no matter what the data size is, via the hint mechanism. I think `forceBroadcast` is a better name. And `isBroadcastable` only have 2 possible values: `Some(true)` and `None`, so we can just use boolean type for it. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #18189 from cloud-fan/stats.	2017-06-06 22:50:06 -07:00
Bogdan Raducanu	2134196a9c	[SPARK-20854][SQL] Extend hint syntax to support expressions ## What changes were proposed in this pull request? SQL hint syntax: * support expressions such as strings, numbers, etc. instead of only identifiers as it is currently. * support multiple hints, which was missing compared to the DataFrame syntax. DataFrame API: * support any parameters in DataFrame.hint instead of just strings ## How was this patch tested? Existing tests. New tests in PlanParserSuite. New suite DataFrameHintSuite. Author: Bogdan Raducanu <bogdan@databricks.com> Closes #18086 from bogdanrdc/SPARK-20854.	2017-06-01 15:50:40 -07:00
Yuming Wang	6d05c1c1da	[SPARK-20910][SQL] Add build-in SQL function - UUID ## What changes were proposed in this pull request? Add build-int SQL function - UUID. ## How was this patch tested? unit tests Author: Yuming Wang <wgyumg@gmail.com> Closes #18136 from wangyum/SPARK-20910.	2017-06-01 16:15:24 +09:00
Wenchen Fan	1f5dddffa3	Revert "[SPARK-20392][SQL] Set barrier to prevent re-entering a tree" This reverts commit `8ce0d8ffb6`.	2017-05-30 21:14:55 -07:00
Liang-Chi Hsieh	35b644bd03	[SPARK-20916][SQL] Improve error message for unaliased subqueries in FROM clause ## What changes were proposed in this pull request? We changed the parser to reject unaliased subqueries in the FROM clause in SPARK-20690. However, the error message that we now give isn't very helpful: scala> sql("""SELECT x FROM (SELECT 1 AS x)""") org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'FROM' expecting {<EOF>, 'WHERE', 'GROUP', 'ORDER', 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 9) We should modify the parser to throw a more clear error for such queries: scala> sql("""SELECT x FROM (SELECT 1 AS x)""") org.apache.spark.sql.catalyst.parser.ParseException: The unaliased subqueries in the FROM clause are not supported.(line 1, pos 14) ## How was this patch tested? Modified existing tests to reflect this change. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18141 from viirya/SPARK-20916.	2017-05-30 06:28:43 -07:00
Yuming Wang	d797ed0ef1	[SPARK-20909][SQL] Add build-int SQL function - DAYOFWEEK ## What changes were proposed in this pull request? Add build-int SQL function - DAYOFWEEK ## How was this patch tested? unit tests Author: Yuming Wang <wgyumg@gmail.com> Closes #18134 from wangyum/SPARK-20909.	2017-05-30 15:40:50 +09:00
Kazuaki Ishizaki	ef9fd920c3	[SPARK-20750][SQL] Built-in SQL Function Support - REPLACE ## What changes were proposed in this pull request? This PR adds built-in SQL function `(REPLACE(<string_expression>, <search_string> [, <replacement_string>])` `REPLACE()` return that string that is replaced all occurrences with given string. ## How was this patch tested? added new test suites Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #18047 from kiszk/SPARK-20750.	2017-05-29 11:47:31 -07:00
Tejas Patil	f9b59abeae	[SPARK-20758][SQL] Add Constant propagation optimization ## What changes were proposed in this pull request? See class doc of `ConstantPropagation` for the approach used. ## How was this patch tested? - Added unit tests Author: Tejas Patil <tejasp@fb.com> Closes #17993 from tejasapatil/SPARK-20758_const_propagation.	2017-05-29 12:21:34 +02:00
Takeshi Yamamuro	24d34281d7	[SPARK-20841][SQL] Support table column aliases in FROM clause ## What changes were proposed in this pull request? This pr added parsing rules to support table column aliases in FROM clause. ## How was this patch tested? Added tests in `PlanParserSuite`, `SQLQueryTestSuite`, and `PlanParserSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #18079 from maropu/SPARK-20841.	2017-05-28 13:23:18 -07:00
Xiao Li	06c155c90d	[SPARK-20908][SQL] Cache Manager: Hint should be ignored in plan matching ### What changes were proposed in this pull request? In Cache manager, the plan matching should ignore Hint. ```Scala val df1 = spark.range(10).join(broadcast(spark.range(10))) df1.cache() spark.range(10).join(spark.range(10)).explain() ``` The output plan of the above query shows that the second query is not using the cached data of the first query. ``` BroadcastNestedLoopJoin BuildRight, Inner :- Range (0, 10, step=1, splits=2) +- BroadcastExchange IdentityBroadcastMode +- Range (0, 10, step=1, splits=2) ``` After the fix, the plan becomes ``` InMemoryTableScan [id#20L, id#23L] +- InMemoryRelation [id#20L, id#23L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) +- BroadcastNestedLoopJoin BuildRight, Inner :- Range (0, 10, step=1, splits=2) +- BroadcastExchange IdentityBroadcastMode +- Range (0, 10, step=1, splits=2) ``` ### How was this patch tested? Added a test. Author: Xiao Li <gatorsmile@gmail.com> Closes #18131 from gatorsmile/HintCache.	2017-05-27 21:32:18 -07:00
liuxian	3969a8078e	[SPARK-20876][SQL] If the input parameter is float type for ceil or floor,the result is not we expected ## What changes were proposed in this pull request? spark-sql>SELECT ceil(cast(12345.1233 as float)); spark-sql>12345 For this case, the result we expected is `12346` spark-sql>SELECT floor(cast(-12345.1233 as float)); spark-sql>-12345 For this case, the result we expected is `-12346` Because in `Ceil` or `Floor`, `inputTypes` has no FloatType, so it is converted to LongType. ## How was this patch tested? After the modification: spark-sql>SELECT ceil(cast(12345.1233 as float)); spark-sql>12346 spark-sql>SELECT floor(cast(-12345.1233 as float)); spark-sql>-12346 Author: liuxian <liu.xian3@zte.com.cn> Closes #18103 from 10110346/wip-lx-0525-1.	2017-05-27 16:23:45 -07:00
Yuming Wang	a0f8a072e3	[SPARK-20748][SQL] Add built-in SQL function CH[A]R. ## What changes were proposed in this pull request? Add built-in SQL function `CH[A]R`: For `CHR(bigint\|double n)`, returns the ASCII character having the binary equivalent to `n`. If n is larger than 256 the result is equivalent to CHR(n % 256) ## How was this patch tested? unit tests Author: Yuming Wang <wgyumg@gmail.com> Closes #18019 from wangyum/SPARK-20748.	2017-05-26 20:59:14 -07:00
Liang-Chi Hsieh	8ce0d8ffb6	[SPARK-20392][SQL] Set barrier to prevent re-entering a tree ## What changes were proposed in this pull request? It is reported that there is performance downgrade when applying ML pipeline for dataset with many columns but few rows. A big part of the performance downgrade comes from some operations (e.g., `select`) on DataFrame/Dataset which re-create new DataFrame/Dataset with a new `LogicalPlan`. The cost can be ignored in the usage of SQL, normally. However, it's not rare to chain dozens of pipeline stages in ML. When the query plan grows incrementally during running those stages, the total cost spent on re-creation of DataFrame grows too. In particular, the `Analyzer` will go through the big query plan even most part of it is analyzed. By eliminating part of the cost, the time to run the example code locally is reduced from about 1min to about 30 secs. In particular, the time applying the pipeline locally is mostly spent on calling transform of the 137 `Bucketizer`s. Before the change, each call of `Bucketizer`'s transform can cost about 0.4 sec. So the total time spent on all `Bucketizer`s' transform is about 50 secs. After the change, each call only costs about 0.1 sec. <del>We also make `boundEnc` as lazy variable to reduce unnecessary running time.</del> ### Performance improvement The codes and datasets provided by Barry Becker to re-produce this issue and benchmark can be found on the JIRA. Before this patch: about 1 min After this patch: about 20 secs ## How was this patch tested? Existing tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #17770 from viirya/SPARK-20392.	2017-05-26 13:45:55 +08:00
Reynold Xin	a64746677b	[SPARK-20867][SQL] Move hints from Statistics into HintInfo class ## What changes were proposed in this pull request? This is a follow-up to SPARK-20857 to move the broadcast hint from Statistics into a new HintInfo class, so we can be more flexible in adding new hints in the future. ## How was this patch tested? Updated test cases to reflect the change. Author: Reynold Xin <rxin@databricks.com> Closes #18087 from rxin/SPARK-20867.	2017-05-24 13:57:19 -07:00
Reynold Xin	0d589ba00b	[SPARK-20857][SQL] Generic resolved hint node ## What changes were proposed in this pull request? This patch renames BroadcastHint to ResolvedHint (and Hint to UnresolvedHint) so the hint framework is more generic and would allow us to introduce other hint types in the future without introducing new hint nodes. ## How was this patch tested? Updated test cases. Author: Reynold Xin <rxin@databricks.com> Closes #18072 from rxin/SPARK-20857.	2017-05-23 18:44:49 +02:00
Liang-Chi Hsieh	442287ae29	[SPARK-20399][SQL][FOLLOW-UP] Add a config to fallback string literal parsing consistent with old sql parser behavior ## What changes were proposed in this pull request? As srowen pointed in `609ba5f2b9 (commitcomment-22221259)`, the previous tests are not proper. This follow-up is going to fix the tests. ## How was this patch tested? Jenkins tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18048 from viirya/SPARK-20399-follow-up.	2017-05-23 16:09:38 +08:00
Xiao Li	a2460be9c3	[SPARK-17410][SPARK-17284] Move Hive-generated Stats Info to HiveClientImpl ### What changes were proposed in this pull request? After we adding a new field `stats` into `CatalogTable`, we should not expose Hive-specific Stats metadata to `MetastoreRelation`. It complicates all the related codes. It also introduces a bug in `SHOW CREATE TABLE`. The statistics-related table properties should be skipped by `SHOW CREATE TABLE`, since it could be incorrect in the newly created table. See the Hive JIRA: https://issues.apache.org/jira/browse/HIVE-13792 Also fix the issue to fill Hive-generated RowCounts to our stats. This PR is to handle Hive-specific Stats metadata in `HiveClientImpl`. ### How was this patch tested? Added a few test cases. Author: Xiao Li <gatorsmile@gmail.com> Closes #14971 from gatorsmile/showCreateTableNew.	2017-05-22 17:28:30 -07:00
Yuming Wang	9b09101938	[SPARK-20751][SQL][FOLLOWUP] Add cot test in MathExpressionsSuite ## What changes were proposed in this pull request? Add cot test in MathExpressionsSuite as https://github.com/apache/spark/pull/17999#issuecomment-302832794. ## How was this patch tested? unit tests Author: Yuming Wang <wgyumg@gmail.com> Closes #18039 from wangyum/SPARK-20751-test.	2017-05-22 13:05:05 -07:00
gatorsmile	f3ed62a381	[SPARK-20831][SQL] Fix INSERT OVERWRITE data source tables with IF NOT EXISTS ### What changes were proposed in this pull request? Currently, we have a bug when we specify `IF NOT EXISTS` in `INSERT OVERWRITE` data source tables. For example, given a query: ```SQL INSERT OVERWRITE TABLE $tableName partition (b=2, c=3) IF NOT EXISTS SELECT 9, 10 ``` we will get the following error: ``` unresolved operator 'InsertIntoTable Relation[a#425,d#426,b#427,c#428] parquet, Map(b -> Some(2), c -> Some(3)), true, true;; 'InsertIntoTable Relation[a#425,d#426,b#427,c#428] parquet, Map(b -> Some(2), c -> Some(3)), true, true +- Project [cast(9#423 as int) AS a#429, cast(10#424 as int) AS d#430] +- Project [9 AS 9#423, 10 AS 10#424] +- OneRowRelation$ ``` This PR is to fix the issue to follow the behavior of Hive serde tables > INSERT OVERWRITE will overwrite any existing data in the table or partition unless IF NOT EXISTS is provided for a partition ### How was this patch tested? Modified an existing test case Author: gatorsmile <gatorsmile@gmail.com> Closes #18050 from gatorsmile/insertPartitionIfNotExists.	2017-05-22 22:24:50 +08:00
caoxuewen	3c9eef35a8	[SPARK-20786][SQL] Improve ceil and floor handle the value which is not expected ## What changes were proposed in this pull request? spark-sql>SELECT ceil(1234567890123456); 1234567890123456 spark-sql>SELECT ceil(12345678901234567); 12345678901234568 spark-sql>SELECT ceil(123456789012345678); 123456789012345680 when the length of the getText is greater than 16. long to double will be precision loss. but mysql handle the value is ok. mysql> SELECT ceil(1234567890123456); +------------------------+ \| ceil(1234567890123456) \| +------------------------+ \| 1234567890123456 \| +------------------------+ 1 row in set (0.00 sec) mysql> SELECT ceil(12345678901234567); +-------------------------+ \| ceil(12345678901234567) \| +-------------------------+ \| 12345678901234567 \| +-------------------------+ 1 row in set (0.00 sec) mysql> SELECT ceil(123456789012345678); +--------------------------+ \| ceil(123456789012345678) \| +--------------------------+ \| 123456789012345678 \| +--------------------------+ 1 row in set (0.00 sec) ## How was this patch tested? Supplement the unit test. Author: caoxuewen <cao.xuewen@zte.com.cn> Closes #18016 from heary-cao/ceil_long.	2017-05-21 22:39:07 -07:00
liuxian	ea3b1e352a	[SPARK-20763][SQL] The function of `month` and `day` return the value which is not we expected. ## What changes were proposed in this pull request? spark-sql>select month("1582-09-28"); spark-sql>10 For this case, the expected result is 9, but it is 10. spark-sql>select day("1582-04-18"); spark-sql>28 For this case, the expected result is 18, but it is 28. when the date before "1582-10-04", the function of `month` and `day` return the value which is not we expected. ## How was this patch tested? unit tests Author: liuxian <liu.xian3@zte.com.cn> Closes #17997 from 10110346/wip_lx_0516.	2017-05-19 10:25:21 -07:00
Ala Luszczak	ce8edb8bf4	[SPARK-20798] GenerateUnsafeProjection should check if a value is null before calling the getter ## What changes were proposed in this pull request? GenerateUnsafeProjection.writeStructToBuffer() did not honor the assumption that the caller must make sure that a value is not null before using the getter. This could lead to various errors. This change fixes that behavior. Example of code generated before: ```scala /* 059 / final UTF8String fieldName = value.getUTF8String(0); / 060 / if (value.isNullAt(0)) { / 061 / rowWriter1.setNullAt(0); / 062 / } else { / 063 / rowWriter1.write(0, fieldName); / 064 / } ``` Example of code generated now: ```scala / 060 / boolean isNull1 = value.isNullAt(0); / 061 / UTF8String value1 = isNull1 ? null : value.getUTF8String(0); / 062 / if (isNull1) { / 063 / rowWriter1.setNullAt(0); / 064 / } else { / 065 / rowWriter1.write(0, value1); / 066 */ } ``` ## How was this patch tested? Adds GenerateUnsafeProjectionSuite. Author: Ala Luszczak <ala@databricks.com> Closes #18030 from ala/fix-generate-unsafe-projection.	2017-05-19 13:18:48 +02:00
Xingbo Jiang	b7aac15d56	[SPARK-20700][SQL] InferFiltersFromConstraints stackoverflows for query (v2) ## What changes were proposed in this pull request? In the previous approach we used `aliasMap` to link an `Attribute` to the expression with potentially the form `f(a, b)`, but we only searched the `expressions` and `children.expressions` for this, which is not enough when an `Alias` may lies deep in the logical plan. In that case, we can't generate the valid equivalent constraint classes and thus we fail at preventing the recursive deductions. We fix this problem by collecting all `Alias`s from the logical plan. ## How was this patch tested? No additional test case is added, but do modified one test case to cover this situation. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #18020 from jiangxb1987/inferConstrants.	2017-05-17 23:32:31 -07:00
Liang-Chi Hsieh	7463a88be6	[SPARK-20690][SQL] Subqueries in FROM should have alias names ## What changes were proposed in this pull request? We add missing attributes into Filter in Analyzer. But we shouldn't do it through subqueries like this: select 1 from (select 1 from onerow t1 LIMIT 1) where t1.c1=1 This query works in current codebase. However, the outside where clause shouldn't be able to refer `t1.c1` attribute. The root cause is we allow subqueries in FROM have no alias names previously, it is confusing and isn't supported by various databases such as MySQL, Postgres, Oracle. We shouldn't support it too. ## How was this patch tested? Jenkins tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #17935 from viirya/SPARK-20690.	2017-05-17 12:57:35 +08:00
Tejas Patil	d2416925c4	[SPARK-17729][SQL] Enable creating hive bucketed tables ## What changes were proposed in this pull request? Hive allows inserting data to bucketed table without guaranteeing bucketed and sorted-ness based on these two configs : `hive.enforce.bucketing` and `hive.enforce.sorting`. What does this PR achieve ? - Spark will disallow users from writing outputs to hive bucketed tables by default (given that output won't adhere with Hive's semantics). - IF user still wants to write to hive bucketed table, the only resort is to use `hive.enforce.bucketing=false` and `hive.enforce.sorting=false` which means user does NOT care about bucketing guarantees. Changes done in this PR: - Extract table's bucketing information in `HiveClientImpl` - While writing table info to metastore, `HiveClientImpl` now populates the bucketing information in the hive `Table` object - `InsertIntoHiveTable` allows inserts to bucketed table only if both `hive.enforce.bucketing` and `hive.enforce.sorting` are `false` Ability to create bucketed tables will enable adding test cases to Spark while I add more changes related to hive bucketing support. Design doc for hive hive bucketing support : https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit# ## How was this patch tested? - Added test for creating bucketed and sorted table. - Added test to ensure that INSERTs fail if strict bucket / sort is enforced - Added test to ensure that INSERTs can go through if strict bucket / sort is NOT enforced - Added test to validate that bucketing information shows up in output of DESC FORMATTED - Added test to ensure that `SHOW CREATE TABLE` works for hive bucketed tables Author: Tejas Patil <tejasp@fb.com> Closes #17644 from tejasapatil/SPARK-17729_create_bucketed_table.	2017-05-16 01:47:23 +08:00
Takeshi Yamamuro	b0888d1ac3	[SPARK-20730][SQL] Add an optimizer rule to combine nested Concat ## What changes were proposed in this pull request? This pr added a new Optimizer rule to combine nested Concat. The master supports a pipeline operator '\|\|' to concatenate strings in #17711 (This pr is follow-up). Since the parser currently generates nested Concat expressions, the optimizer needs to combine the nested expressions. ## How was this patch tested? Added tests in `CombineConcatSuite` and `SQLQueryTestSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #17970 from maropu/SPARK-20730.	2017-05-15 16:24:55 +08:00
hyukjinkwon	720708ccdd	[SPARK-20639][SQL] Add single argument support for to_timestamp in SQL with documentation improvement ## What changes were proposed in this pull request? This PR proposes three things as below: - Use casting rules to a timestamp in `to_timestamp` by default (it was `yyyy-MM-dd HH:mm:ss`). - Support single argument for `to_timestamp` similarly with APIs in other languages. For example, the one below works ``` import org.apache.spark.sql.functions._ Seq("2016-12-31 00:12:00.00").toDF("a").select(to_timestamp(col("a"))).show() ``` prints ``` +----------------------------------------+ \|to_timestamp(`a`, 'yyyy-MM-dd HH:mm:ss')\| +----------------------------------------+ \| 2016-12-31 00:12:00\| +----------------------------------------+ ``` whereas this does not work in SQL. Before ``` spark-sql> SELECT to_timestamp('2016-12-31 00:12:00'); Error in query: Invalid number of arguments for function to_timestamp; line 1 pos 7 ``` After ``` spark-sql> SELECT to_timestamp('2016-12-31 00:12:00'); 2016-12-31 00:12:00 ``` - Related document improvement for SQL function descriptions and other API descriptions accordingly. Before ``` spark-sql> DESCRIBE FUNCTION extended to_date; ... Usage: to_date(date_str, fmt) - Parses the `left` expression with the `fmt` expression. Returns null with invalid input. Extended Usage: Examples: > SELECT to_date('2016-12-31', 'yyyy-MM-dd'); 2016-12-31 ``` ``` spark-sql> DESCRIBE FUNCTION extended to_timestamp; ... Usage: to_timestamp(timestamp, fmt) - Parses the `left` expression with the `format` expression to a timestamp. Returns null with invalid input. Extended Usage: Examples: > SELECT to_timestamp('2016-12-31', 'yyyy-MM-dd'); 2016-12-31 00:00:00.0 ``` After ``` spark-sql> DESCRIBE FUNCTION extended to_date; ... Usage: to_date(date_str[, fmt]) - Parses the `date_str` expression with the `fmt` expression to a date. Returns null with invalid input. By default, it follows casting rules to a date if the `fmt` is omitted. Extended Usage: Examples: > SELECT to_date('2009-07-30 04:17:52'); 2009-07-30 > SELECT to_date('2016-12-31', 'yyyy-MM-dd'); 2016-12-31 ``` ``` spark-sql> DESCRIBE FUNCTION extended to_timestamp; ... Usage: to_timestamp(timestamp[, fmt]) - Parses the `timestamp` expression with the `fmt` expression to a timestamp. Returns null with invalid input. By default, it follows casting rules to a timestamp if the `fmt` is omitted. Extended Usage: Examples: > SELECT to_timestamp('2016-12-31 00:12:00'); 2016-12-31 00:12:00 > SELECT to_timestamp('2016-12-31', 'yyyy-MM-dd'); 2016-12-31 00:00:00 ``` ## How was this patch tested? Added tests in `datetime.sql`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #17901 from HyukjinKwon/to_timestamp_arg.	2017-05-12 16:42:58 +08:00
liuxian	2b36eb696f	[SPARK-20665][SQL] Bround" and "Round" function return NULL ## What changes were proposed in this pull request? spark-sql>select bround(12.3, 2); spark-sql>NULL For this case, the expected result is 12.3, but it is null. So ,when the second parameter is bigger than "decimal.scala", the result is not we expected. "round" function has the same problem. This PR can solve the problem for both of them. ## How was this patch tested? unit test cases in MathExpressionsSuite and MathFunctionsSuite Author: liuxian <liu.xian3@zte.com.cn> Closes #17906 from 10110346/wip_lx_0509.	2017-05-12 11:38:50 +08:00
Liang-Chi Hsieh	609ba5f2b9	[SPARK-20399][SQL] Add a config to fallback string literal parsing consistent with old sql parser behavior ## What changes were proposed in this pull request? The new SQL parser is introduced into Spark 2.0. All string literals are unescaped in parser. Seems it bring an issue regarding the regex pattern string. The following codes can reproduce it: val data = Seq("\u0020\u0021\u0023", "abc") val df = data.toDF() // 1st usage: works in 1.6 // Let parser parse pattern string val rlike1 = df.filter("value rlike '^\\x20[\\x20-\\x23]+$'") // 2nd usage: works in 1.6, 2.x // Call Column.rlike so the pattern string is a literal which doesn't go through parser val rlike2 = df.filter($"value".rlike("^\\x20[\\x20-\\x23]+$")) // In 2.x, we need add backslashes to make regex pattern parsed correctly val rlike3 = df.filter("value rlike '^\\\\x20[\\\\x20-\\\\x23]+$'") Follow the discussion in #17736, this patch adds a config to fallback to 1.6 string literal parsing and mitigate migration issue. ## How was this patch tested? Jenkins tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #17887 from viirya/add-config-fallback-string-parsing.	2017-05-12 11:15:10 +08:00
Takeshi Yamamuro	8c67aa7f00	[SPARK-20311][SQL] Support aliases for table value functions ## What changes were proposed in this pull request? This pr added parsing rules to support aliases in table value functions. The previous pr (#17666) has been reverted because of the regression. This new pr fixed the regression and add tests in `SQLQueryTestSuite`. ## How was this patch tested? Added tests in `PlanParserSuite` and `SQLQueryTestSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #17928 from maropu/SPARK-20311-3.	2017-05-11 18:09:31 +08:00
wangzhenhua	76e4a5566b	[SPARK-20678][SQL] Ndv for columns not in filter condition should also be updated ## What changes were proposed in this pull request? In filter estimation, we update column stats for those columns in filter condition. However, if the number of rows decreases after the filter (i.e. the overall selectivity is less than 1), we need to update (scale down) the number of distinct values (NDV) for all columns, no matter they are in filter conditions or not. This pr also fixes the inconsistency of rounding mode for ndv and rowCount. ## How was this patch tested? Added new tests. Author: wangzhenhua <wangzhenhua@huawei.com> Closes #17918 from wzhfy/scaleDownNdvAfterFilter.	2017-05-10 19:42:49 +08:00
Josh Rosen	a90c5cd822	[SPARK-20686][SQL] PropagateEmptyRelation incorrectly handles aggregate without grouping ## What changes were proposed in this pull request? The query ``` SELECT 1 FROM (SELECT COUNT() WHERE FALSE) t1 ``` should return a single row of output because the subquery is an aggregate without a group-by and thus should return a single row. However, Spark incorrectly returns zero rows. This is caused by SPARK-16208 / #13906, a patch which added an optimizer rule to propagate EmptyRelation through operators. The logic for handling aggregates is wrong: it checks whether aggregate expressions are non-empty for deciding whether the output should be empty, whereas it should be checking grouping expressions instead: An aggregate with non-empty grouping expression will return one output row per group. If the input to the grouped aggregate is empty then all groups will be empty and thus the output will be empty. It doesn't matter whether the aggregation output columns include aggregate expressions since that won't affect the number of output rows. If the grouping expressions are empty, however, then the aggregate will always produce a single output row and thus we cannot propagate the EmptyRelation. The current implementation is incorrect and also misses an optimization opportunity by not propagating EmptyRelation in the case where a grouped aggregate has aggregate expressions (in other words, `SELECT COUNT() from emptyRelation GROUP BY x` would _not_ be optimized to `EmptyRelation` in the old code, even though it safely could be). This patch resolves this issue by modifying `PropagateEmptyRelation` to consider only the presence/absence of grouping expressions, not the aggregate functions themselves, when deciding whether to propagate EmptyRelation. ## How was this patch tested? - Added end-to-end regression tests in `SQLQueryTest`'s `group-by.sql` file. - Updated unit tests in `PropagateEmptyRelationSuite`. Author: Josh Rosen <joshrosen@databricks.com> Closes #17929 from JoshRosen/fix-PropagateEmptyRelation.	2017-05-10 14:36:36 +08:00
Yin Huai	f79aa285cf	Revert "[SPARK-20311][SQL] Support aliases for table value functions" This reverts commit `714811d0b5`.	2017-05-09 14:47:45 -07:00
Takeshi Yamamuro	714811d0b5	[SPARK-20311][SQL] Support aliases for table value functions ## What changes were proposed in this pull request? This pr added parsing rules to support aliases in table value functions. ## How was this patch tested? Added tests in `PlanParserSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #17666 from maropu/SPARK-20311.	2017-05-09 20:22:51 +08:00
Sean Owen	16fab6b0ef	[SPARK-20523][BUILD] Clean up build warnings for 2.2.0 release ## What changes were proposed in this pull request? Fix build warnings primarily related to Breeze 0.13 operator changes, Java style problems ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #17803 from srowen/SPARK-20523.	2017-05-03 10:18:35 +01:00
Burak Yavuz	86174ea89b	[SPARK-20549] java.io.CharConversionException: Invalid UTF-32' in JsonToStructs ## What changes were proposed in this pull request? A fix for the same problem was made in #17693 but ignored `JsonToStructs`. This PR uses the same fix for `JsonToStructs`. ## How was this patch tested? Regression test Author: Burak Yavuz <brkyvz@gmail.com> Closes #17826 from brkyvz/SPARK-20549.	2017-05-02 14:08:16 +08:00
ptkool	259860d23d	[SPARK-20463] Add support for IS [NOT] DISTINCT FROM. ## What changes were proposed in this pull request? Add support for the SQL standard distinct predicate to SPARK SQL. ``` <expression> IS [NOT] DISTINCT FROM <expression> ``` ## How was this patch tested? Tested using unit tests, integration tests, manual tests. Author: ptkool <michael.styles@shopify.com> Closes #17764 from ptkool/is_not_distinct_from.	2017-05-01 17:05:35 -07:00
hyukjinkwon	1ee494d086	[SPARK-20492][SQL] Do not print empty parentheses for invalid primitive types in parser ## What changes were proposed in this pull request? Currently, when the type string is invalid, it looks printing empty parentheses. This PR proposes a small improvement in an error message by removing it in the parse as below: ```scala spark.range(1).select($"col".cast("aa")) ``` Before ``` org.apache.spark.sql.catalyst.parser.ParseException: DataType aa() is not supported.(line 1, pos 0) == SQL == aa ^^^ ``` After ``` org.apache.spark.sql.catalyst.parser.ParseException: DataType aa is not supported.(line 1, pos 0) == SQL == aa ^^^ ``` ## How was this patch tested? Unit tests in `DataTypeParserSuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #17784 from HyukjinKwon/SPARK-20492.	2017-04-30 08:24:10 -07:00
Kris Mok	26ac2ce05c	[SPARK-20482][SQL] Resolving Casts is too strict on having time zone set ## What changes were proposed in this pull request? Relax the requirement that a `TimeZoneAwareExpression` has to have its `timeZoneId` set to be considered resolved. With this change, a `Cast` (which is a `TimeZoneAwareExpression`) can be considered resolved if the `(fromType, toType)` combination doesn't require time zone information. Also de-relaxed test cases in `CastSuite` so Casts in that test suite don't get a default`timeZoneId = Option("GMT")`. ## How was this patch tested? Ran the de-relaxed`CastSuite` and it's passing. Also ran the SQL unit tests and they're passing too. Author: Kris Mok <kris.mok@databricks.com> Closes #17777 from rednaxelafx/fix-catalyst-cast-timezone.	2017-04-27 12:08:16 -07:00
Eric Wasserman	57e1da3946	[SPARK-16548][SQL] Inconsistent error handling in JSON parsing SQL functions ## What changes were proposed in this pull request? change to using Jackson's `com.fasterxml.jackson.core.JsonFactory` public JsonParser createParser(String content) ## How was this patch tested? existing unit tests Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Eric Wasserman <ericw@sgn.com> Closes #17693 from ewasserman/SPARK-20314.	2017-04-26 11:42:43 +08:00
Kazuaki Ishizaki	a750a59597	[SPARK-20341][SQL] Support BigInt's value that does not fit in long value range ## What changes were proposed in this pull request? This PR avoids an exception in the case where `scala.math.BigInt` has a value that does not fit into long value range (e.g. `Long.MAX_VALUE+1`). When we run the following code by using the current Spark, the following exception is thrown. This PR keeps the value using `BigDecimal` if we detect such an overflow case by catching `ArithmeticException`. Sample program: ``` case class BigIntWrapper(value:scala.math.BigInt)``` spark.createDataset(BigIntWrapper(scala.math.BigInt("10000000000000000002"))::Nil).show ``` Exception: ``` Error while encoding: java.lang.ArithmeticException: BigInteger out of long range staticinvoke(class org.apache.spark.sql.types.Decimal$, DecimalType(38,0), apply, assertnotnull(assertnotnull(input[0, org.apache.spark.sql.BigIntWrapper, true])).value, true) AS value#0 java.lang.RuntimeException: Error while encoding: java.lang.ArithmeticException: BigInteger out of long range staticinvoke(class org.apache.spark.sql.types.Decimal$, DecimalType(38,0), apply, assertnotnull(assertnotnull(input[0, org.apache.spark.sql.BigIntWrapper, true])).value, true) AS value#0 at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290) at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454) at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:454) at org.apache.spark.sql.Agg$$anonfun$18.apply$mcV$sp(MySuite.scala:192) at org.apache.spark.sql.Agg$$anonfun$18.apply(MySuite.scala:192) at org.apache.spark.sql.Agg$$anonfun$18.apply(MySuite.scala:192) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) ... Caused by: java.lang.ArithmeticException: BigInteger out of long range at java.math.BigInteger.longValueExact(BigInteger.java:4531) at org.apache.spark.sql.types.Decimal.set(Decimal.scala:140) at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:434) at org.apache.spark.sql.types.Decimal.apply(Decimal.scala) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287) ... 59 more ``` ## How was this patch tested? Add new test suite into `DecimalSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #17684 from kiszk/SPARK-20341.	2017-04-21 22:25:35 +08:00
Herman van Hovell	e2b3d2367a	[SPARK-20420][SQL] Add events to the external catalog ## What changes were proposed in this pull request? It is often useful to be able to track changes to the `ExternalCatalog`. This PR makes the `ExternalCatalog` emit events when a catalog object is changed. Events are fired before and after the change. The following events are fired per object: - Database - CreateDatabasePreEvent: event fired before the database is created. - CreateDatabaseEvent: event fired after the database has been created. - DropDatabasePreEvent: event fired before the database is dropped. - DropDatabaseEvent: event fired after the database has been dropped. - Table - CreateTablePreEvent: event fired before the table is created. - CreateTableEvent: event fired after the table has been created. - RenameTablePreEvent: event fired before the table is renamed. - RenameTableEvent: event fired after the table has been renamed. - DropTablePreEvent: event fired before the table is dropped. - DropTableEvent: event fired after the table has been dropped. - Function - CreateFunctionPreEvent: event fired before the function is created. - CreateFunctionEvent: event fired after the function has been created. - RenameFunctionPreEvent: event fired before the function is renamed. - RenameFunctionEvent: event fired after the function has been renamed. - DropFunctionPreEvent: event fired before the function is dropped. - DropFunctionPreEvent: event fired after the function has been dropped. The current events currently only contain the names of the object modified. We add more events, and more details at a later point. A user can monitor changes to the external catalog by adding a listener to the Spark listener bus checking for `ExternalCatalogEvent`s using the `SparkListener.onOtherEvent` hook. A more direct approach is add listener directly to the `ExternalCatalog`. ## How was this patch tested? Added the `ExternalCatalogEventSuite`. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #17710 from hvanhovell/SPARK-20420.	2017-04-21 00:05:03 -07:00
Herman van Hovell	760c8d088d	[SPARK-20329][SQL] Make timezone aware expression without timezone unresolved ## What changes were proposed in this pull request? A cast expression with a resolved time zone is not equal to a cast expression without a resolved time zone. The `ResolveAggregateFunction` assumed that these expression were the same, and would fail to resolve `HAVING` clauses which contain a `Cast` expression. This is in essence caused by the fact that a `TimeZoneAwareExpression` can be resolved without a set time zone. This PR fixes this, and makes a `TimeZoneAwareExpression` unresolved as long as it has no TimeZone set. ## How was this patch tested? Added a regression test to the `SQLQueryTestSuite.having` file. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #17641 from hvanhovell/SPARK-20329.	2017-04-21 10:06:12 +08:00
ptkool	63824b2c8e	[SPARK-20350] Add optimization rules to apply Complementation Laws. ## What changes were proposed in this pull request? Apply Complementation Laws during boolean expression simplification. ## How was this patch tested? Tested using unit tests, integration tests, and manual tests. Author: ptkool <michael.styles@shopify.com> Author: Michael Styles <michael.styles@shopify.com> Closes #17650 from ptkool/apply_complementation_laws.	2017-04-20 09:51:13 +08:00
Kazuaki Ishizaki	e468a96c40	[SPARK-20254][SQL] Remove unnecessary data conversion for Dataset with primitive array ## What changes were proposed in this pull request? This PR elminates unnecessary data conversion, which is introduced by SPARK-19716, for Dataset with primitve array in the generated Java code. When we run the following example program, now we get the Java code "Without this PR". In this code, lines 56-82 are unnecessary since the primitive array in ArrayData can be converted into Java primitive array by using ``toDoubleArray()`` method. ``GenericArrayData`` is not required. ```java val ds = sparkContext.parallelize(Seq(Array(1.1, 2.2)), 1).toDS.cache ds.count ds.map(e => e).show ``` Without this PR ``` == Parsed Logical Plan == 'SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#25] +- 'MapElements <function1>, class [D, [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D +- 'DeserializeToObject unresolveddeserializer(unresolvedmapobjects(<function1>, getcolumnbyordinal(0, ArrayType(DoubleType,false)), None).toDoubleArray), obj#23: [D +- SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#2] +- ExternalRDD [obj#1] == Analyzed Logical Plan == value: array<double> SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#25] +- MapElements <function1>, class [D, [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D +- DeserializeToObject mapobjects(MapObjects_loopValue5, MapObjects_loopIsNull5, DoubleType, assertnotnull(lambdavariable(MapObjects_loopValue5, MapObjects_loopIsNull5, DoubleType, true), - array element class: "scala.Double", - root class: "scala.Array"), value#2, None, MapObjects_builderValue5).toDoubleArray, obj#23: [D +- SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#2] +- ExternalRDD [obj#1] == Optimized Logical Plan == SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#25] +- MapElements <function1>, class [D, [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D +- DeserializeToObject mapobjects(MapObjects_loopValue5, MapObjects_loopIsNull5, DoubleType, assertnotnull(lambdavariable(MapObjects_loopValue5, MapObjects_loopIsNull5, DoubleType, true), - array element class: "scala.Double", - root class: "scala.Array"), value#2, None, MapObjects_builderValue5).toDoubleArray, obj#23: [D +- InMemoryRelation [value#2], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) +- SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#2] +- Scan ExternalRDDScan[obj#1] == Physical Plan == SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#25] +- MapElements <function1>, obj#24: [D +- DeserializeToObject mapobjects(MapObjects_loopValue5, MapObjects_loopIsNull5, DoubleType, assertnotnull(lambdavariable(MapObjects_loopValue5, MapObjects_loopIsNull5, DoubleType, true), - array element class: "scala.Double", - root class: "scala.Array"), value#2, None, MapObjects_builderValue5).toDoubleArray, obj#23: [D +- InMemoryTableScan [value#2] +- InMemoryRelation [value#2], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) +- SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#2] +- Scan ExternalRDDScan[obj#1] ``` ```java / 050 / protected void processNext() throws java.io.IOException { / 051 / while (inputadapter_input.hasNext() && !stopEarly()) { / 052 / InternalRow inputadapter_row = (InternalRow) inputadapter_input.next(); / 053 / boolean inputadapter_isNull = inputadapter_row.isNullAt(0); / 054 / ArrayData inputadapter_value = inputadapter_isNull ? null : (inputadapter_row.getArray(0)); / 055 / / 056 / ArrayData deserializetoobject_value1 = null; / 057 / / 058 / if (!inputadapter_isNull) { / 059 / int deserializetoobject_dataLength = inputadapter_value.numElements(); / 060 / / 061 / Double[] deserializetoobject_convertedArray = null; / 062 / deserializetoobject_convertedArray = new Double[deserializetoobject_dataLength]; / 063 / / 064 / int deserializetoobject_loopIndex = 0; / 065 / while (deserializetoobject_loopIndex < deserializetoobject_dataLength) { / 066 / MapObjects_loopValue2 = (double) (inputadapter_value.getDouble(deserializetoobject_loopIndex)); / 067 / MapObjects_loopIsNull2 = inputadapter_value.isNullAt(deserializetoobject_loopIndex); / 068 / / 069 / if (MapObjects_loopIsNull2) { / 070 / throw new RuntimeException(((java.lang.String) references[0])); / 071 / } / 072 / if (false) { / 073 / deserializetoobject_convertedArray[deserializetoobject_loopIndex] = null; / 074 / } else { / 075 / deserializetoobject_convertedArray[deserializetoobject_loopIndex] = MapObjects_loopValue2; / 076 / } / 077 / / 078 / deserializetoobject_loopIndex += 1; / 079 / } / 080 / / 081 / deserializetoobject_value1 = new org.apache.spark.sql.catalyst.util.GenericArrayData(deserializetoobject_convertedArray); /###/ / 082 / } / 083 / boolean deserializetoobject_isNull = true; / 084 / double[] deserializetoobject_value = null; / 085 / if (!inputadapter_isNull) { / 086 / deserializetoobject_isNull = false; / 087 / if (!deserializetoobject_isNull) { / 088 / Object deserializetoobject_funcResult = null; / 089 / deserializetoobject_funcResult = deserializetoobject_value1.toDoubleArray(); / 090 / if (deserializetoobject_funcResult == null) { / 091 / deserializetoobject_isNull = true; / 092 / } else { / 093 / deserializetoobject_value = (double[]) deserializetoobject_funcResult; / 094 / } / 095 / / 096 / } / 097 / deserializetoobject_isNull = deserializetoobject_value == null; / 098 / } / 099 / / 100 / boolean mapelements_isNull = true; / 101 / double[] mapelements_value = null; / 102 / if (!false) { / 103 / mapelements_resultIsNull = false; / 104 / / 105 / if (!mapelements_resultIsNull) { / 106 / mapelements_resultIsNull = deserializetoobject_isNull; / 107 / mapelements_argValue = deserializetoobject_value; / 108 / } / 109 / / 110 / mapelements_isNull = mapelements_resultIsNull; / 111 / if (!mapelements_isNull) { / 112 / Object mapelements_funcResult = null; / 113 / mapelements_funcResult = ((scala.Function1) references[1]).apply(mapelements_argValue); / 114 / if (mapelements_funcResult == null) { / 115 / mapelements_isNull = true; / 116 / } else { / 117 / mapelements_value = (double[]) mapelements_funcResult; / 118 / } / 119 / / 120 / } / 121 / mapelements_isNull = mapelements_value == null; / 122 / } / 123 / / 124 / serializefromobject_resultIsNull = false; / 125 / / 126 / if (!serializefromobject_resultIsNull) { / 127 / serializefromobject_resultIsNull = mapelements_isNull; / 128 / serializefromobject_argValue = mapelements_value; / 129 / } / 130 / / 131 / boolean serializefromobject_isNull = serializefromobject_resultIsNull; / 132 / final ArrayData serializefromobject_value = serializefromobject_resultIsNull ? null : org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.fromPrimitiveArray(serializefromobject_argValue); / 133 / serializefromobject_isNull = serializefromobject_value == null; / 134 / serializefromobject_holder.reset(); / 135 / / 136 / serializefromobject_rowWriter.zeroOutNullBytes(); / 137 / / 138 / if (serializefromobject_isNull) { / 139 / serializefromobject_rowWriter.setNullAt(0); / 140 / } else { / 141 / // Remember the current cursor so that we can calculate how many bytes are / 142 / // written later. / 143 / final int serializefromobject_tmpCursor = serializefromobject_holder.cursor; / 144 / / 145 / if (serializefromobject_value instanceof UnsafeArrayData) { / 146 / final int serializefromobject_sizeInBytes = ((UnsafeArrayData) serializefromobject_value).getSizeInBytes(); / 147 / // grow the global buffer before writing data. / 148 / serializefromobject_holder.grow(serializefromobject_sizeInBytes); / 149 / ((UnsafeArrayData) serializefromobject_value).writeToMemory(serializefromobject_holder.buffer, serializefromobject_holder.cursor); / 150 / serializefromobject_holder.cursor += serializefromobject_sizeInBytes; / 151 / / 152 / } else { / 153 / final int serializefromobject_numElements = serializefromobject_value.numElements(); / 154 / serializefromobject_arrayWriter.initialize(serializefromobject_holder, serializefromobject_numElements, 8); / 155 / / 156 / for (int serializefromobject_index = 0; serializefromobject_index < serializefromobject_numElements; serializefromobject_index++) { / 157 / if (serializefromobject_value.isNullAt(serializefromobject_index)) { / 158 / serializefromobject_arrayWriter.setNullDouble(serializefromobject_index); / 159 / } else { / 160 / final double serializefromobject_element = serializefromobject_value.getDouble(serializefromobject_index); / 161 / serializefromobject_arrayWriter.write(serializefromobject_index, serializefromobject_element); / 162 / } / 163 / } / 164 / } / 165 / / 166 / serializefromobject_rowWriter.setOffsetAndSize(0, serializefromobject_tmpCursor, serializefromobject_holder.cursor - serializefromobject_tmpCursor); / 167 / } / 168 / serializefromobject_result.setTotalSize(serializefromobject_holder.totalSize()); / 169 / append(serializefromobject_result); / 170 / if (shouldStop()) return; / 171 / } / 172 / } ``` With this PR (eliminated lines 56-62 in the above code) ```java / 047 / protected void processNext() throws java.io.IOException { / 048 / while (inputadapter_input.hasNext() && !stopEarly()) { / 049 / InternalRow inputadapter_row = (InternalRow) inputadapter_input.next(); / 050 / boolean inputadapter_isNull = inputadapter_row.isNullAt(0); / 051 / ArrayData inputadapter_value = inputadapter_isNull ? null : (inputadapter_row.getArray(0)); / 052 / / 053 / boolean deserializetoobject_isNull = true; / 054 / double[] deserializetoobject_value = null; / 055 / if (!inputadapter_isNull) { / 056 / deserializetoobject_isNull = false; / 057 / if (!deserializetoobject_isNull) { / 058 / Object deserializetoobject_funcResult = null; / 059 / deserializetoobject_funcResult = inputadapter_value.toDoubleArray(); / 060 / if (deserializetoobject_funcResult == null) { / 061 / deserializetoobject_isNull = true; / 062 / } else { / 063 / deserializetoobject_value = (double[]) deserializetoobject_funcResult; / 064 / } / 065 / / 066 / } / 067 / deserializetoobject_isNull = deserializetoobject_value == null; / 068 / } / 069 / / 070 / boolean mapelements_isNull = true; / 071 / double[] mapelements_value = null; / 072 / if (!false) { / 073 / mapelements_resultIsNull = false; / 074 / / 075 / if (!mapelements_resultIsNull) { / 076 / mapelements_resultIsNull = deserializetoobject_isNull; / 077 / mapelements_argValue = deserializetoobject_value; / 078 / } / 079 / / 080 / mapelements_isNull = mapelements_resultIsNull; / 081 / if (!mapelements_isNull) { / 082 / Object mapelements_funcResult = null; / 083 / mapelements_funcResult = ((scala.Function1) references[0]).apply(mapelements_argValue); / 084 / if (mapelements_funcResult == null) { / 085 / mapelements_isNull = true; / 086 / } else { / 087 / mapelements_value = (double[]) mapelements_funcResult; / 088 / } / 089 / / 090 / } / 091 / mapelements_isNull = mapelements_value == null; / 092 / } / 093 / / 094 / serializefromobject_resultIsNull = false; / 095 / / 096 / if (!serializefromobject_resultIsNull) { / 097 / serializefromobject_resultIsNull = mapelements_isNull; / 098 / serializefromobject_argValue = mapelements_value; / 099 / } / 100 / / 101 / boolean serializefromobject_isNull = serializefromobject_resultIsNull; / 102 / final ArrayData serializefromobject_value = serializefromobject_resultIsNull ? null : org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.fromPrimitiveArray(serializefromobject_argValue); / 103 / serializefromobject_isNull = serializefromobject_value == null; / 104 / serializefromobject_holder.reset(); / 105 / / 106 / serializefromobject_rowWriter.zeroOutNullBytes(); / 107 / / 108 / if (serializefromobject_isNull) { / 109 / serializefromobject_rowWriter.setNullAt(0); / 110 / } else { / 111 / // Remember the current cursor so that we can calculate how many bytes are / 112 / // written later. / 113 / final int serializefromobject_tmpCursor = serializefromobject_holder.cursor; / 114 / / 115 / if (serializefromobject_value instanceof UnsafeArrayData) { / 116 / final int serializefromobject_sizeInBytes = ((UnsafeArrayData) serializefromobject_value).getSizeInBytes(); / 117 / // grow the global buffer before writing data. / 118 / serializefromobject_holder.grow(serializefromobject_sizeInBytes); / 119 / ((UnsafeArrayData) serializefromobject_value).writeToMemory(serializefromobject_holder.buffer, serializefromobject_holder.cursor); / 120 / serializefromobject_holder.cursor += serializefromobject_sizeInBytes; / 121 / / 122 / } else { / 123 / final int serializefromobject_numElements = serializefromobject_value.numElements(); / 124 / serializefromobject_arrayWriter.initialize(serializefromobject_holder, serializefromobject_numElements, 8); / 125 / / 126 / for (int serializefromobject_index = 0; serializefromobject_index < serializefromobject_numElements; serializefromobject_index++) { / 127 / if (serializefromobject_value.isNullAt(serializefromobject_index)) { / 128 / serializefromobject_arrayWriter.setNullDouble(serializefromobject_index); / 129 / } else { / 130 / final double serializefromobject_element = serializefromobject_value.getDouble(serializefromobject_index); / 131 / serializefromobject_arrayWriter.write(serializefromobject_index, serializefromobject_element); / 132 / } / 133 / } / 134 / } / 135 / / 136 / serializefromobject_rowWriter.setOffsetAndSize(0, serializefromobject_tmpCursor, serializefromobject_holder.cursor - serializefromobject_tmpCursor); / 137 / } / 138 / serializefromobject_result.setTotalSize(serializefromobject_holder.totalSize()); / 139 / append(serializefromobject_result); / 140 / if (shouldStop()) return; / 141 / } / 142 */ } ``` ## How was this patch tested? Add test suites into `DatasetPrimitiveSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #17568 from kiszk/SPARK-20254.	2017-04-19 10:58:05 +08:00
wangzhenhua	321b4f03bc	[SPARK-20366][SQL] Fix recursive join reordering: inside joins are not reordered ## What changes were proposed in this pull request? If a plan has multi-level successive joins, e.g.: ``` Join / \ Union t5 / \ Join t4 / \ Join t3 / \ t1 t2 ``` Currently we fail to reorder the inside joins, i.e. t1, t2, t3. In join reorder, we use `OrderedJoin` to indicate a join has been ordered, such that when transforming down the plan, these joins don't need to be rerodered again. But there's a problem in the definition of `OrderedJoin`: The real join node is a parameter, but not a child. This breaks the transform procedure because `mapChildren` applies transform function on parameters which should be children. In this patch, we change `OrderedJoin` to a class having the same structure as a join node. ## How was this patch tested? Add a corresponding test case. Author: wangzhenhua <wangzhenhua@huawei.com> Closes #17668 from wzhfy/recursiveReorder.	2017-04-18 20:12:21 +08:00
Jacek Laskowski	33ea908af9	[TEST][MINOR] Replace repartitionBy with distribute in CollapseRepartitionSuite ## What changes were proposed in this pull request? Replace non-existent `repartitionBy` with `distribute` in `CollapseRepartitionSuite`. ## How was this patch tested? local build and `catalyst/testOnly *CollapseRepartitionSuite` Author: Jacek Laskowski <jacek@japila.pl> Closes #17657 from jaceklaskowski/CollapseRepartitionSuite.	2017-04-17 17:58:10 -07:00
Jakob Odersky	e5fee3e4f8	[SPARK-17647][SQL] Fix backslash escaping in 'LIKE' patterns. ## What changes were proposed in this pull request? This patch fixes a bug in the way LIKE patterns are translated to Java regexes. The bug causes any character following an escaped backslash to be escaped, i.e. there is double-escaping. A concrete example is the following pattern:`'%\\%'`. The expected Java regex that this pattern should correspond to (according to the behavior described below) is `'.\\.'`, however the current situation leads to `'.*\\%'` instead. --- Update: in light of the discussion that ensued, we should explicitly define the expected behaviour of LIKE expressions, especially in certain edge cases. With the help of gatorsmile, we put together a list of different RDBMS and their variations wrt to certain standard features. \| RDBMS\Features \| Wildcards \| Default escape [1] \| Case sensitivity \| \| --- \| --- \| --- \| --- \| \| [MS SQL Server](https://msdn.microsoft.com/en-us/library/ms179859.aspx) \| _, %, [], [^] \| none \| no \| \| [Oracle](https://docs.oracle.com/cd/B12037_01/server.101/b10759/conditions016.htm) \| _, % \| none \| yes \| \| [DB2 z/OS](http://www.ibm.com/support/knowledgecenter/SSEPEK_11.0.0/sqlref/src/tpc/db2z_likepredicate.html) \| _, % \| none \| yes \| \| [MySQL](http://dev.mysql.com/doc/refman/5.7/en/string-comparison-functions.html) \| _, % \| none \| no \| \| [PostreSQL](https://www.postgresql.org/docs/9.0/static/functions-matching.html) \| _, % \| \ \| yes \| \| [Hive](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF) \| _, % \| none \| yes \| \| Current Spark \| _, % \| \ \| yes \| [1] Default escape character: most systems do not have a default escape character, instead the user can specify one by calling a like expression with an escape argument [A] LIKE [B] ESCAPE [C]. This syntax is currently not supported by Spark, however I would volunteer to implement this feature in a separate ticket. The specifications are often quite terse and certain scenarios are undocumented, so here is a list of scenarios that I am uncertain about and would appreciate any input. Specifically I am looking for feedback on whether or not Spark's current behavior should be changed. 1. [x] Ending a pattern with the escape sequence, e.g. `like 'a\'`. PostreSQL gives an error: 'LIKE pattern must not end with escape character', which I personally find logical. Currently, Spark allows "non-terminated" escapes and simply ignores them as part of the pattern. According to [DB2's documentation](http://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.messages.sql.doc/doc/msql00130n.html), ending a pattern in an escape character is invalid. _Proposed new behaviour in Spark: throw AnalysisException_ 2. [x] Empty input, e.g. `'' like ''` Postgres and DB2 will match empty input only if the pattern is empty as well, any other combination of empty input will not match. Spark currently follows this rule. 3. [x] Escape before a non-special character, e.g. `'a' like '\a'`. Escaping a non-wildcard character is not really documented but PostgreSQL just treats it verbatim, which I also find the least surprising behavior. Spark does the same. According to [DB2's documentation](http://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.messages.sql.doc/doc/msql00130n.html), it is invalid to follow an escape character with anything other than an escape character, an underscore or a percent sign. _Proposed new behaviour in Spark: throw AnalysisException_ The current specification is also described in the operator's source code in this patch. ## How was this patch tested? Extra case in regex unit tests. Author: Jakob Odersky <jakob@odersky.com> This patch had conflicts when merged, resolved by Committer: Reynold Xin <rxin@databricks.com> Closes #15398 from jodersky/SPARK-17647.	2017-04-17 11:17:57 -07:00
wangzhenhua	fb036c4413	[SPARK-20318][SQL] Use Catalyst type for min/max in ColumnStat for ease of estimation ## What changes were proposed in this pull request? Currently when estimating predicates like col > literal or col = literal, we will update min or max in column stats based on literal value. However, literal value is of Catalyst type (internal type), while min/max is of external type. Then for the next predicate, we again need to do type conversion to compare and update column stats. This is awkward and causes many unnecessary conversions in estimation. To solve this, we use Catalyst type for min/max in `ColumnStat`. Note that the persistent format in metastore is still of external type, so there's no inconsistency for statistics in metastore. This pr also fixes a bug for boolean type in `IN` condition. ## How was this patch tested? The changes for ColumnStat are covered by existing tests. For bug fix, a new test for boolean type in IN condition is added Author: wangzhenhua <wangzhenhua@huawei.com> Closes #17630 from wzhfy/refactorColumnStat.	2017-04-14 19:16:47 +08:00
Ioana Delaney	fbe4216e1e	[SPARK-20233][SQL] Apply star-join filter heuristics to dynamic programming join enumeration ## What changes were proposed in this pull request? Implements star-join filter to reduce the search space for dynamic programming join enumeration. Consider the following join graph: ``` T1 D1 - T2 - T3 \ / F1 \| D2 star-join: {F1, D1, D2} non-star: {T1, T2, T3} ``` The following join combinations will be generated: ``` level 0: (F1), (D1), (D2), (T1), (T2), (T3) level 1: {F1, D1}, {F1, D2}, {T2, T3} level 2: {F1, D1, D2} level 3: {F1, D1, D2, T1}, {F1, D1, D2, T2} level 4: {F1, D1, D2, T1, T2}, {F1, D1, D2, T2, T3 } level 6: {F1, D1, D2, T1, T2, T3} ``` ## How was this patch tested? New test suite ```StarJOinCostBasedReorderSuite.scala```. Author: Ioana Delaney <ioanamdelaney@gmail.com> Closes #17546 from ioana-delaney/starSchemaCBOv3.	2017-04-13 22:27:04 +08:00
Xiao Li	504e62e2f4	[SPARK-20303][SQL] Rename createTempFunction to registerFunction ### What changes were proposed in this pull request? Session catalog API `createTempFunction` is being used by Hive build-in functions, persistent functions, and temporary functions. Thus, the name is confusing. This PR is to rename it by `registerFunction`. Also we can move construction of `FunctionBuilder` and `ExpressionInfo` into the new `registerFunction`, instead of duplicating the logics everywhere. In the next PRs, the remaining Function-related APIs also need cleanups. ### How was this patch tested? Existing test cases. Author: Xiao Li <gatorsmile@gmail.com> Closes #17615 from gatorsmile/cleanupCreateTempFunction.	2017-04-12 09:01:26 -07:00

... 3 4 5 6 7 ...

1509 commits