ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Marcelo Vanzin	1474ed05fb	[SPARK-29562][SQL] Speed up and slim down metric aggregation in SQL listener First, a bit of background on the code being changed. The current code tracks metric updates for each task, recording which metrics the task is monitoring and the last update value. Once a SQL execution finishes, then the metrics for all the stages are aggregated, by building a list with all (metric ID, value) pairs collected for all tasks in the stages related to the execution, then grouping by metric ID, and then calculating the values shown in the UI. That is full of inefficiencies: - in normal operation, all tasks will be tracking and updating the same metrics. So recording the metric IDs per task is wasteful. - tracking by task means we might be double-counting values if you have speculative tasks (as a comment in the code mentions). - creating a list of (metric ID, value) is extremely inefficient, because now you have a huge map in memory storing boxed versions of the metric IDs and values. - same thing for the aggregation part, where now a Seq is built with the values for each metric ID. The end result is that for large queries, this code can become both really slow, thus affecting the processing of events, and memory hungry. The updated code changes the approach to the following: - stages track metrics by their ID; this means the stage tracking code naturally groups values, making aggregation later simpler. - each metric ID being tracked uses a long array matching the number of partitions of the stage; this means that it's cheap to update the value of the metric once a task ends. - when aggregating, custom code just concatenates the arrays corresponding to the matching metric IDs; this is cheaper than the previous, boxing-heavy approach. The end result is that the listener uses about half as much memory as before for tracking metrics, since it doesn't need to track metric IDs per task. I captured heap dumps with the old and the new code during metric aggregation in the listener, for an execution with 3 stages, 100k tasks per stage, 50 metrics updated per task. The dumps contained just reachable memory - so data kept by the listener plus the variables in the aggregateMetrics() method. With the old code, the thread doing aggregation references >1G of memory - and that does not include temporary data created by the "groupBy" transformation (for which the intermediate state is not referenced in the aggregation method). The same thread with the new code references ~250M of memory. The old code uses about ~250M to track all the metric values for that execution, while the new code uses about ~130M. (Note the per-thread numbers include the amount used to track the metrics - so, e.g., in the old case, aggregation was referencing about ~750M of temporary data.) I'm also including a small benchmark (based on the Benchmark class) so that we can measure how much changes to this code affect performance. The benchmark contains some extra code to measure things the normal Benchmark class does not, given that the code under test does not really map that well to the expectations of that class. Running with the old code (I removed results that don't make much sense for this benchmark): ``` [info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Linux 4.15.0-66-generic [info] Intel(R) Core(TM) i7-6820HQ CPU 2.70GHz [info] metrics aggregation (50 metrics, 100k tasks per stage): Best Time(ms) Avg Time(ms) [info] -------------------------------------------------------------------------------------- [info] 1 stage(s) 2113 2118 [info] 2 stage(s) 4172 4392 [info] 3 stage(s) 7755 8460 [info] [info] Stage Count Stage Proc. Time Aggreg. Time [info] 1 614 1187 [info] 2 620 2480 [info] 3 718 5069 ``` With the new code: ``` [info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Linux 4.15.0-66-generic [info] Intel(R) Core(TM) i7-6820HQ CPU 2.70GHz [info] metrics aggregation (50 metrics, 100k tasks per stage): Best Time(ms) Avg Time(ms) [info] -------------------------------------------------------------------------------------- [info] 1 stage(s) 727 886 [info] 2 stage(s) 1722 1983 [info] 3 stage(s) 2752 3013 [info] [info] Stage Count Stage Proc. Time Aggreg. Time [info] 1 408 177 [info] 2 389 423 [info] 3 372 660 ``` So the new code is faster than the old when processing task events, and about an order of maginute faster when aggregating metrics. Note this still leaves room for improvement; for example, using the above measurements, 600ms is still a huge amount of time to spend in an event handler. But I'll leave further enhancements for a separate change. Tested with benchmarking code + existing unit tests. Closes #26218 from vanzin/SPARK-29562. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-24 22:18:10 -07:00
wenxuanguan	40df9d246e	[SPARK-29227][SS] Track rule info in optimization phase ### What changes were proposed in this pull request? Track timing info for each rule in optimization phase using `QueryPlanningTracker` in Structured Streaming ### Why are the changes needed? In Structured Streaming we only track rule info in analysis phase, not in optimization phase. ### Does this PR introduce any user-facing change? No Closes #25914 from wenxuanguan/spark-29227. Authored-by: wenxuanguan <choose_home@126.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-25 10:02:54 +09:00
Terry Kim	dec99d8ac5	[SPARK-29526][SQL] UNCACHE TABLE should look up catalog/table like v2 commands ### What changes were proposed in this pull request? Add UncacheTableStatement and make UNCACHE TABLE go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g. ``` USE my_catalog DESC t // success and describe the table t from my_catalog UNCACHE TABLE t // report table not found as there is no table t in the session catalog ``` ### Does this PR introduce any user-facing change? yes. When running UNCACHE TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog. ### How was this patch tested? New unit tests Closes #26237 from imback82/uncache_table. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-24 14:51:23 -07:00
fuwhu	92b25295ca	[SPARK-21287][SQL] Remove requirement of fetch_size>=0 from JDBCOptions ### What changes were proposed in this pull request? Remove the requirement of fetch_size>=0 from JDBCOptions to allow negative fetch size. ### Why are the changes needed? Namely, to allow data fetch in stream manner (row-by-row fetch) against MySQL database. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit test (JDBCSuite) This closes #26230 . Closes #26244 from fuwhu/SPARK-21287-FIX. Authored-by: fuwhu <bestwwg@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-24 12:35:32 -07:00
stczwd	dcf5eaf1a6	[SPARK-29444][FOLLOWUP] add doc and python parameter for ignoreNullFields in json generating # What changes were proposed in this pull request? Add description for ignoreNullFields, which is commited in #26098 , in DataFrameWriter and readwriter.py. Enable user to use ignoreNullFields in pyspark. ### Does this PR introduce any user-facing change? No ### How was this patch tested? run unit tests Closes #26227 from stczwd/json-generator-doc. Authored-by: stczwd <qcsd2011@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-24 10:25:04 -07:00
Wenchen Fan	cdea520ff8	[SPARK-29532][SQL] Simplify interval string parsing ### What changes were proposed in this pull request? Only use antlr4 to parse the interval string, and remove the duplicated parsing logic from `CalendarInterval`. ### Why are the changes needed? Simplify the code and fix inconsistent behaviors. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Pass the Jenkins with the updated test cases. Closes #26190 from cloud-fan/parser. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-24 09:15:59 -07:00
angerszhu	67cf0433ee	[SPARK-29145][SQL] Support sub-queries in join conditions ### What changes were proposed in this pull request? Support SparkSQL use iN/EXISTS with subquery in JOIN condition. ### Why are the changes needed? Support SQL use iN/EXISTS with subquery in JOIN condition. ### Does this PR introduce any user-facing change? This PR is for enable user use subquery in `JOIN`'s ON condition. such as we have create three table ``` CREATE TABLE A(id String); CREATE TABLE B(id String); CREATE TABLE C(id String); ``` we can do query like : ``` SELECT A.id from A JOIN B ON A.id = B.id and A.id IN (select C.id from C) ``` ### How was this patch tested? ADDED UT Closes #25854 from AngersZhuuuu/SPARK-29145. Lead-authored-by: angerszhu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-10-24 21:55:03 +09:00
Yuanjian Li	9e77d48315	[SPARK-21492][SQL][FOLLOW UP] Reimplement UnsafeExternalRowSorter in database style iterator ### What changes were proposed in this pull request? Reimplement the iterator in UnsafeExternalRowSorter in database style. This can be done by reusing the `RowIterator` in our code base. ### Why are the changes needed? During the job in #26164, after involving a var `isReleased` in `hasNext`, there's possible that `isReleased` is false when calling `hasNext`, but it becomes true before calling `next`. A safer way is using database-style iterator: `advanceNext` and `getRow`. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UT. Closes #26229 from xuanyuanking/SPARK-21492-follow-up. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-24 15:43:13 +08:00
Liang-Chi Hsieh	177bf672e4	[SPARK-29522][SQL] CACHE TABLE should look up catalog/table like v2 commands ### What changes were proposed in this pull request? Add CacheTableStatement and make CACHE TABLE go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g. ``` USE my_catalog DESC t // success and describe the table t from my_catalog CACHE TABLE t // report table not found as there is no table t in the session catalog ``` ### Does this PR introduce any user-facing change? yes. When running CACHE TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog. ### How was this patch tested? Unit tests. Closes #26179 from viirya/SPARK-29522. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-24 15:00:21 +08:00
07ARB	55ced9c148	[SPARK-29571][SQL][TESTS][FOLLOWUP] Fix UT in AllExecutionsPageSuite ### What changes were proposed in this pull request? This is a follow-up of #24052 to correct assert condition. ### Why are the changes needed? To test IllegalArgumentException condition.. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manual Test (during fixing of SPARK-29453 find this issue) Closes #26234 from 07ARB/SPARK-29571. Authored-by: 07ARB <ankitrajboudh@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-24 15:57:16 +09:00
Dongjoon Hyun	b91356e4c2	[SPARK-29533][SQL][TESTS][FOLLOWUP] Regenerate the result on EC2 ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/26189 to regenerate the result on EC2. ### Why are the changes needed? This will be used for the other PR reviews. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A. Closes #26233 from dongjoon-hyun/SPARK-29533. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2019-10-23 21:41:05 +00:00
jiake	7e8e4c0a14	[SPARK-29552][SQL] Execute the "OptimizeLocalShuffleReader" rule when creating new query stage and then can optimize the shuffle reader to local shuffle reader as much as possible ### What changes were proposed in this pull request? `OptimizeLocalShuffleReader` rule is very conservative and gives up optimization as long as there are extra shuffles introduced. It's very likely that most of the added local shuffle readers are fine and only one introduces extra shuffle. However, it's very hard to make `OptimizeLocalShuffleReader` optimal, a simple workaround is to run this rule again right before executing a query stage. ### Why are the changes needed? Optimize more shuffle reader to local shuffle reader. ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing ut Closes #26207 from JkSelf/resolve-multi-joins-issue. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-24 01:18:07 +08:00
Jungtaek Lim (HeartSaVioR)	bfbf2821f3	[SPARK-29503][SQL] Remove conversion CreateNamedStruct to CreateNamedStructUnsafe ### What changes were proposed in this pull request? There's a case where MapObjects has a lambda function which creates nested struct - unsafe data in safe data struct. In this case, MapObjects doesn't copy the row returned from lambda function (as outmost data type is safe data struct), which misses copying nested unsafe data. The culprit is that `UnsafeProjection.toUnsafeExprs` converts `CreateNamedStruct` to `CreateNamedStructUnsafe` (this is the only place where `CreateNamedStructUnsafe` is used) which incurs safe and unsafe being mixed up temporarily, which may not be needed at all at least logically, as it will finally assembly these evaluations to `UnsafeRow`. > Before the patch ``` /* 105 / private ArrayData MapObjects_0(InternalRow i) { / 106 / boolean isNull_1 = i.isNullAt(0); / 107 / ArrayData value_1 = isNull_1 ? / 108 / null : (i.getArray(0)); / 109 / ArrayData value_0 = null; / 110 / / 111 / if (!isNull_1) { / 112 / / 113 / int dataLength_0 = value_1.numElements(); / 114 / / 115 / ArrayData[] convertedArray_0 = null; / 116 / convertedArray_0 = new ArrayData[dataLength_0]; / 117 / / 118 / / 119 / int loopIndex_0 = 0; / 120 / / 121 / while (loopIndex_0 < dataLength_0) { / 122 / value_MapObject_lambda_variable_1 = (int) (value_1.getInt(loopIndex_0)); / 123 / isNull_MapObject_lambda_variable_1 = value_1.isNullAt(loopIndex_0); / 124 / / 125 / ArrayData arrayData_0 = ArrayData.allocateArrayData( / 126 / -1, 1L, " createArray failed."); / 127 / / 128 / mutableStateArray_0[0].reset(); / 129 / / 130 / / 131 / mutableStateArray_0[0].zeroOutNullBytes(); / 132 / / 133 / / 134 / if (isNull_MapObject_lambda_variable_1) { / 135 / mutableStateArray_0[0].setNullAt(0); / 136 / } else { / 137 / mutableStateArray_0[0].write(0, value_MapObject_lambda_variable_1); / 138 / } / 139 / arrayData_0.update(0, (mutableStateArray_0[0].getRow())); / 140 / if (false) { / 141 / convertedArray_0[loopIndex_0] = null; / 142 / } else { / 143 / convertedArray_0[loopIndex_0] = arrayData_0 instanceof UnsafeArrayData? arrayData_0.copy() : arrayData_0; / 144 / } / 145 / / 146 / loopIndex_0 += 1; / 147 / } / 148 / / 149 / value_0 = new org.apache.spark.sql.catalyst.util.GenericArrayData(convertedArray_0); / 150 / } / 151 / globalIsNull_0 = isNull_1; / 152 / return value_0; / 153 / } ``` > After the patch ``` / 104 / private ArrayData MapObjects_0(InternalRow i) { / 105 / boolean isNull_1 = i.isNullAt(0); / 106 / ArrayData value_1 = isNull_1 ? / 107 / null : (i.getArray(0)); / 108 / ArrayData value_0 = null; / 109 / / 110 / if (!isNull_1) { / 111 / / 112 / int dataLength_0 = value_1.numElements(); / 113 / / 114 / ArrayData[] convertedArray_0 = null; / 115 / convertedArray_0 = new ArrayData[dataLength_0]; / 116 / / 117 / / 118 / int loopIndex_0 = 0; / 119 / / 120 / while (loopIndex_0 < dataLength_0) { / 121 / value_MapObject_lambda_variable_1 = (int) (value_1.getInt(loopIndex_0)); / 122 / isNull_MapObject_lambda_variable_1 = value_1.isNullAt(loopIndex_0); / 123 / / 124 / ArrayData arrayData_0 = ArrayData.allocateArrayData( / 125 / -1, 1L, " createArray failed."); / 126 / / 127 / Object[] values_0 = new Object[1]; / 128 / / 129 / / 130 / if (isNull_MapObject_lambda_variable_1) { / 131 / values_0[0] = null; / 132 / } else { / 133 / values_0[0] = value_MapObject_lambda_variable_1; / 134 / } / 135 / / 136 / final InternalRow value_3 = new org.apache.spark.sql.catalyst.expressions.GenericInternalRow(values_0); / 137 / values_0 = null; / 138 / arrayData_0.update(0, value_3); / 139 / if (false) { / 140 / convertedArray_0[loopIndex_0] = null; / 141 / } else { / 142 / convertedArray_0[loopIndex_0] = arrayData_0 instanceof UnsafeArrayData? arrayData_0.copy() : arrayData_0; / 143 / } / 144 / / 145 / loopIndex_0 += 1; / 146 / } / 147 / / 148 / value_0 = new org.apache.spark.sql.catalyst.util.GenericArrayData(convertedArray_0); / 149 / } / 150 / globalIsNull_0 = isNull_1; / 151 / return value_0; / 152 */ } ``` ### Why are the changes needed? This patch fixes the bug described above. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? UT added which fails on master branch and passes on PR. Closes #26173 from HeartSaVioR/SPARK-29503. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-24 00:41:48 +08:00
Terry Kim	53a5f17803	[SPARK-29513][SQL] REFRESH TABLE should look up catalog/table like v2 commands ### What changes were proposed in this pull request? Add RefreshTableStatement and make REFRESH TABLE go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g. ``` USE my_catalog DESC t // success and describe the table t from my_catalog REFRESH TABLE t // report table not found as there is no table t in the session catalog ``` ### Does this PR introduce any user-facing change? yes. When running REFRESH TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog. ### How was this patch tested? New unit tests Closes #26183 from imback82/refresh_table. Lead-authored-by: Terry Kim <yuminkim@gmail.com> Co-authored-by: Terry Kim <terryk@terrys-mbp-2.lan> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>	2019-10-23 08:26:47 -07:00
Burak Yavuz	cbe6eadc0c	[SPARK-29352][SQL][SS] Track active streaming queries in the SparkSession.sharedState ### What changes were proposed in this pull request? This moves the tracking of active queries from a per SparkSession state, to the shared SparkSession for better safety in isolated Spark Session environments. ### Why are the changes needed? We have checks to prevent the restarting of the same stream on the same spark session, but we can actually make that better in multi-tenant environments by actually putting that state in the SharedState instead of SessionState. This would allow a more comprehensive check for multi-tenant clusters. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added tests to StreamingQueryManagerSuite Closes #26018 from brkyvz/sharedStreamingQueryManager. Lead-authored-by: Burak Yavuz <burak@databricks.com> Co-authored-by: Burak Yavuz <brkyvz@gmail.com> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>	2019-10-23 10:56:19 +02:00
Terry Kim	c128ac564d	[SPARK-29511][SQL] DataSourceV2: Support CREATE NAMESPACE ### What changes were proposed in this pull request? This PR adds `CREATE NAMESPACE` support for V2 catalogs. ### Why are the changes needed? Currently, you cannot explicitly create namespaces for v2 catalogs. ### Does this PR introduce any user-facing change? The user can now perform the following: ```SQL CREATE NAMESPACE mycatalog.ns ``` to create a namespace `ns` inside `mycatalog` V2 catalog. ### How was this patch tested? Added unit tests. Closes #26166 from imback82/create_namespace. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-23 12:17:20 +08:00
DylanGuedes	e6749092f7	[SPARK-29107][SQL][TESTS] Port window.sql (Part 1) ### What changes were proposed in this pull request? This PR ports window.sql from PostgreSQL regression tests https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/sql/window.sql from lines 1~319 The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/expected/window.out ### Why are the changes needed? To ensure compatibility with PostgreSQL. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Pass the Jenkins. And, Comparison with PgSQL results. Closes #26119 from DylanGuedes/spark-29107. Authored-by: DylanGuedes <djmgguedes@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-23 10:24:38 +09:00
Huaxin Gao	3bf5355e24	[SPARK-29539][SQL] SHOW PARTITIONS should look up catalog/table like v2 commands ### What changes were proposed in this pull request? Add ShowPartitionsStatement and make SHOW PARTITIONS go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. ### Does this PR introduce any user-facing change? Yes. When running SHOW PARTITIONS, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog. ### How was this patch tested? Unit tests. Closes #26198 from huaxingao/spark-29539. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>	2019-10-22 14:47:17 -07:00
Liang-Chi Hsieh	b4844eea1f	[SPARK-29517][SQL] TRUNCATE TABLE should look up catalog/table like v2 commands ### What changes were proposed in this pull request? Add TruncateTableStatement and make TRUNCATE TABLE go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g. ``` USE my_catalog DESC t // success and describe the table t from my_catalog TRUNCATE TABLE t // report table not found as there is no table t in the session catalog ``` ### Does this PR introduce any user-facing change? yes. When running TRUNCATE TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog. ### How was this patch tested? Unit tests. Closes #26174 from viirya/SPARK-29517. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-22 19:17:28 +08:00
Yuanjian Li	bb49c80c89	[SPARK-21492][SQL] Fix memory leak in SortMergeJoin ### What changes were proposed in this pull request? We shall have a new mechanism that the downstream operators may notify its parents that they may release the output data stream. In this PR, we implement the mechanism as below: - Add function named `cleanupResources` in SparkPlan, which default call children's `cleanupResources` function, the operator which need a resource cleanup should rewrite this with the self cleanup and also call `super.cleanupResources`, like SortExec in this PR. - Add logic support on the trigger side, in this PR is SortMergeJoinExec, which make sure and call the `cleanupResources` to do the cleanup job for all its upstream(children) operator. ### Why are the changes needed? Bugfix for SortMergeJoin memory leak, and implement a general framework for SparkPlan resource cleanup. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? UT: Add new test suite JoinWithResourceCleanSuite to check both standard and code generation scenario. Integrate Test: Test with driver/executor default memory set 1g, local mode 10 thread. The below test(thanks taosaildrone for providing this test [here](https://github.com/apache/spark/pull/23762#issuecomment-463303175)) will pass with this PR. ``` from pyspark.sql.functions import rand, col spark.conf.set("spark.sql.join.preferSortMergeJoin", "true") spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) # spark.conf.set("spark.sql.sortMergeJoinExec.eagerCleanupResources", "true") r1 = spark.range(1, 1001).select(col("id").alias("timestamp1")) r1 = r1.withColumn('value', rand()) r2 = spark.range(1000, 1001).select(col("id").alias("timestamp2")) r2 = r2.withColumn('value2', rand()) joined = r1.join(r2, r1.timestamp1 == r2.timestamp2, "inner") joined = joined.coalesce(1) joined.explain() joined.show() ``` Closes #26164 from xuanyuanking/SPARK-21492. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-22 19:08:09 +08:00
angerszhu	484f93e255	[SPARK-29530][SQL] Make SQLConf in SQL parse process thread safe ### What changes were proposed in this pull request? As I have comment in [SPARK-29516](https://github.com/apache/spark/pull/26172#issuecomment-544364977) SparkSession.sql() method parse process not under current sparksession's conf, so some configuration about parser is not valid in multi-thread situation. In this pr, we add a SQLConf parameter to AbstractSqlParser and initial it with SessionState's conf. Then for each SparkSession's parser process. It will use's it's own SessionState's SQLConf and to be thread safe ### Why are the changes needed? Fix bug ### Does this PR introduce any user-facing change? NO ### How was this patch tested? NO Closes #26187 from AngersZhuuuu/SPARK-29530. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-22 10:38:06 +08:00
wuyi	3d567a357c	[MINOR][SQL] Avoid unnecessary invocation on checkAndGlobPathIfNecessary ### What changes were proposed in this pull request? Only invoke `checkAndGlobPathIfNecessary()` when we have to use `InMemoryFileIndex`. ### Why are the changes needed? Avoid unnecessary function invocation. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass Jenkins. Closes #26196 from Ngone51/dev-avoid-unnecessary-invocation-on-globpath. Authored-by: wuyi <ngone_5451@163.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-21 21:10:21 -05:00
DylanGuedes	bb4400c23a	[SPARK-29108][SQL][TESTS] Port window.sql (Part 2) ### What changes were proposed in this pull request? This PR ports window.sql from PostgreSQL regression tests https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/sql/window.sql from lines 320~562 The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/expected/window.out ## How was this patch tested? Pass the Jenkins. ### Why are the changes needed? To ensure compatibility with PGSQL ### Does this PR introduce any user-facing change? No ### How was this patch tested? Comparison with PgSQL results. Closes #26121 from DylanGuedes/spark-29108. Authored-by: DylanGuedes <djmgguedes@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-22 10:49:40 +09:00
Maxim Gekk	eef11ba9ef	[SPARK-29518][SQL][TEST] Benchmark `date_part` for `INTERVAL` ### What changes were proposed in this pull request? I extended `ExtractBenchmark` to support the `INTERVAL` type of the `source` parameter of the `date_part` function. ### Why are the changes needed? - To detect performance issues while changing implementation of the `date_part` function in the future. - To find out current performance bottlenecks in `date_part` for the `INTERVAL` type ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running the benchmark and print out produced values per each `field` value. Closes #26175 from MaxGekk/extract-interval-benchmark. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-22 10:47:54 +09:00
Maxim Gekk	6ffec5e6a6	[SPARK-29533][SQL][TEST] Benchmark casting strings to intervals ### What changes were proposed in this pull request? Added new benchmark `IntervalBenchmark` to measure performance of interval related functions. In the PR, I added benchmarks for casting strings to interval. In particular, interval strings with `interval` prefix and without it because there is special code for this `da576a737c/common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java (L100-L103)` . And also I added benchmarks for different number of units in interval strings, for example 1 unit is `interval 10 years`, 2 units w/o interval is `10 years 5 months`, and etc. ### Why are the changes needed? - To find out current performance issues in casting to intervals - The benchmark can be used while refactoring/re-implementing `CalendarInterval.fromString()` or `CalendarInterval.fromCaseInsensitiveString()`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running the benchmark via the command: ```shell SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.IntervalBenchmark" ``` Closes #26189 from MaxGekk/interval-from-string-benchmark. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-22 10:47:04 +09:00
Kent Yao	5b4d9170ed	[SPARK-27879][SQL] Add support for bit_and and bit_or aggregates ### What changes were proposed in this pull request? ``` bit_and(expression) -- The bitwise AND of all non-null input values, or null if none bit_or(expression) -- The bitwise OR of all non-null input values, or null if none ``` More details: https://www.postgresql.org/docs/9.3/functions-aggregate.html ### Why are the changes needed? Postgres, Mysql and many other popular db support them. ### Does this PR introduce any user-facing change? add two bit agg ### How was this patch tested? add ut Closes #26155 from yaooqinn/SPARK-27879. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-21 14:32:31 +08:00
Yuming Wang	0f65b49f55	[SPARK-29525][SQL][TEST] Fix the associated location already exists in SQLQueryTestSuite ### What changes were proposed in this pull request? This PR fix Fix the associated location already exists in `SQLQueryTestSuite`: ``` build/sbt "~sql/test-only SQLQueryTestSuite -- -z postgreSQL/join.sql" ... [info] - postgreSQL/join.sql FAILED * (35 seconds, 420 milliseconds) [info] postgreSQL/join.sql [info] Expected "[]", but got "[org.apache.spark.sql.AnalysisException [info] Can not create the managed table('`default`.`tt3`'). The associated location('file:/root/spark/sql/core/spark-warehouse/org.apache.spark.sql.SQLQueryTestSuite/tt3') already exists.;]" Result did not match for query #108 ``` ### Why are the changes needed? Fix bug. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #26181 from wangyum/TestError. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-20 13:31:59 -07:00
Terry Kim	ab92e1715e	[SPARK-29512][SQL] REPAIR TABLE should look up catalog/table like v2 commands ### What changes were proposed in this pull request? Add RepairTableStatement and make REPAIR TABLE go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g. ``` USE my_catalog DESC t // success and describe the table t from my_catalog MSCK REPAIR TABLE t // report table not found as there is no table t in the session catalog ``` ### Does this PR introduce any user-facing change? yes. When running MSCK REPAIR TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog. ### How was this patch tested? New unit tests Closes #26168 from imback82/repair_table. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>	2019-10-18 22:43:58 -07:00
angerszhu	9a3dccae72	[SPARK-29379][SQL] SHOW FUNCTIONS show '!=', '<>' , 'between', 'case' ### What changes were proposed in this pull request? Current Spark SQL `SHOW FUNCTIONS` don't show `!=`, `<>`, `between`, `case` But these expressions is truly functions. We should show it in SQL `SHOW FUNCTIONS` ### Why are the changes needed? SHOW FUNCTIONS show '!=', '<>' , 'between', 'case' ### Does this PR introduce any user-facing change? SHOW FUNCTIONS show '!=', '<>' , 'between', 'case' ### How was this patch tested? UT Closes #26053 from AngersZhuuuu/SPARK-29379. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-19 00:19:56 +08:00
Maxim Gekk	77fe8a8e7c	[SPARK-28420][SQL] Support the `INTERVAL` type in `date_part()` ### What changes were proposed in this pull request? The `date_part()` function can accept the `source` parameter of the `INTERVAL` type (`CalendarIntervalType`). The following values of the `field` parameter are supported: - `"MILLENNIUM"` (`"MILLENNIA"`, `"MIL"`, `"MILS"`) - number of millenniums in the given interval. It is `YEAR / 1000`. - `"CENTURY"` (`"CENTURIES"`, `"C"`, `"CENT"`) - number of centuries in the interval calculated as `YEAR / 100`. - `"DECADE"` (`"DECADES"`, `"DEC"`, `"DECS"`) - decades in the `YEAR` part of the interval calculated as `YEAR / 10`. - `"YEAR"` (`"Y"`, `"YEARS"`, `"YR"`, `"YRS"`) - years in a values of `CalendarIntervalType`. It is `MONTHS / 12`. - `"QUARTER"` (`"QTR"`) - a quarter of year calculated as `MONTHS / 3 + 1` - `"MONTH"` (`"MON"`, `"MONS"`, `"MONTHS"`) - the months part of the interval calculated as `CalendarInterval.months % 12` - `"DAY"` (`"D"`, `"DAYS"`) - total number of days in `CalendarInterval.microseconds` - `"HOUR"` (`"H"`, `"HOURS"`, `"HR"`, `"HRS"`) - the hour part of the interval. - `"MINUTE"` (`"M"`, `"MIN"`, `"MINS"`, `"MINUTES"`) - the minute part of the interval. - `"SECOND"` (`"S"`, `"SEC"`, `"SECONDS"`, `"SECS"`) - the seconds part with fractional microsecond part. - `"MILLISECONDS"` (`"MSEC"`, `"MSECS"`, `"MILLISECON"`, `"MSECONDS"`, `"MS"`) - the millisecond part of the interval with fractional microsecond part. - `"MICROSECONDS"` (`"USEC"`, `"USECS"`, `"USECONDS"`, `"MICROSECON"`, `"US"`) - the total number of microseconds in the `second`, `millisecond` and `microsecond` parts of the given interval. - `"EPOCH"` - the total number of seconds in the interval including the fractional part with microsecond precision. Here we assume 365.25 days per year (leap year every four years). For example: ```sql > SELECT date_part('days', interval 1 year 10 months 5 days); 5 > SELECT date_part('seconds', interval 30 seconds 1 milliseconds 1 microseconds); 30.001001 ``` ### Why are the changes needed? To maintain feature parity with PostgreSQL (https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT) ### Does this PR introduce any user-facing change? No ### How was this patch tested? - Added new test suite `IntervalExpressionsSuite` - Add new test cases to `date_part.sql` Closes #25981 from MaxGekk/extract-from-intervals. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-18 23:54:59 +08:00
jiake	c3a0d02a40	[SPARK-28560][SQL][FOLLOWUP] resolve the remaining comments for PR#25295 ### What changes were proposed in this pull request? A followup of [#25295](https://github.com/apache/spark/pull/25295). 1) change the logWarning to logDebug in `OptimizeLocalShuffleReader`. 2) update the test to check whether query stage reuse can work well with local shuffle reader. ### Why are the changes needed? make code robust ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing tests Closes #26157 from JkSelf/followup-25295. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-18 23:16:58 +08:00
Terry Kim	39af51dbc6	[SPARK-29014][SQL] DataSourceV2: Fix current/default catalog usage ### What changes were proposed in this pull request? The handling of the catalog across plans should be as follows ([SPARK-29014](https://issues.apache.org/jira/browse/SPARK-29014)): * The current catalog should be used when no catalog is specified * The default catalog is the catalog current is initialized to * If the default catalog is not set, then current catalog is the built-in Spark session catalog. This PR addresses the issue where current catalog usage is not followed as describe above. ### Why are the changes needed? It is a bug as described in the previous section. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Unit tests added. Closes #26120 from imback82/cleanup_catalog. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-18 22:45:42 +08:00
Wenchen Fan	74351468de	[SPARK-29482][SQL] ANALYZE TABLE should look up catalog/table like v2 commands ### What changes were proposed in this pull request? Add `AnalyzeTableStatement` and `AnalyzeColumnStatement`, and make ANALYZE TABLE go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g. ``` USE my_catalog DESC t // success and describe the table t from my_catalog ANALYZE TABLE t // report table not found as there is no table t in the session catalog ``` ### Does this PR introduce any user-facing change? yes. When running ANALYZE TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog. ### How was this patch tested? new tests Closes #26129 from cloud-fan/analyze-table. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2019-10-18 12:55:49 +02:00
Dilip Biswal	ec5d698d99	[SPARK-29092][SQL] Report additional information about DataSourceScanExec in EXPLAIN FORMATTED # What changes were proposed in this pull request? Currently we report only output attributes of a scan while doing EXPLAIN FORMATTED. This PR implements the ```verboseStringWithOperatorId``` in DataSourceScanExec to report additional information about a scan such as pushed down filters, partition filters, location etc. SQL ``` EXPLAIN FORMATTED SELECT key, max(val) FROM explain_temp1 WHERE key > 0 GROUP BY key ORDER BY key ``` Before ``` == Physical Plan == * Sort (9) +- Exchange (8) +- * HashAggregate (7) +- Exchange (6) +- * HashAggregate (5) +- * Project (4) +- * Filter (3) +- * ColumnarToRow (2) +- Scan parquet default.explain_temp1 (1) (1) Scan parquet default.explain_temp1 Output: [key#x, val#x] .... .... .... ``` After ``` == Physical Plan == * Sort (9) +- Exchange (8) +- * HashAggregate (7) +- Exchange (6) +- * HashAggregate (5) +- * Project (4) +- * Filter (3) +- * ColumnarToRow (2) +- Scan parquet default.explain_temp1 (1) (1) Scan parquet default.explain_temp1 Output: [key#x, val#x] Batched: true DataFilters: [isnotnull(key#x), (key#x > 0)] Format: Parquet Location: InMemoryFileIndex[file:/tmp/apache/spark/spark-warehouse/explain_temp1] PushedFilters: [IsNotNull(key), GreaterThan(key,0)] ReadSchema: struct<key:int,val:int> ... ... ... ``` ### Why are the changes needed? ### Does this PR introduce any user-facing change? ### How was this patch tested? Closes #26042 from dilipbiswal/verbose_string_datasrc_scanexec. Authored-by: Dilip Biswal <dkbiswal@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-18 15:53:13 +08:00
Jiajia Li	dc0bc7a6eb	[MINOR][DOCS] Fix some typos ### What changes were proposed in this pull request? This PR proposes a few typos: 1. Sparks => Spark's 2. parallize => parallelize 3. doesnt => doesn't Closes #26140 from plusplusjiajia/fix-typos. Authored-by: Jiajia Li <jiajia.li@intel.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-17 07:22:01 -07:00
Kent Yao	4b902d3b45	[SPARK-29491][SQL] Add bit_count function support ### What changes were proposed in this pull request? BIT_COUNT(N) - Returns the number of bits that are set in the argument N as an unsigned 64-bit integer, or NULL if the argument is NULL ### Why are the changes needed? Supported by MySQL，Microsoft SQL Server ，etc. ### Does this PR introduce any user-facing change? add a built-in function ### How was this patch tested? add uts Closes #26139 from yaooqinn/SPARK-29491. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-17 20:22:38 +08:00
Yuanjian Li	239ee3f561	[SPARK-9853][CORE] Optimize shuffle fetch of continuous partition IDs This PR takes over #19788. After we split the shuffle fetch protocol from `OpenBlock` in #24565, this optimization can be extended in the new shuffle protocol. Credit to yucai, closes #19788. ### What changes were proposed in this pull request? This PR adds the support for continuous shuffle block fetching in batch: - Shuffle client changes: - Add new feature tag `spark.shuffle.fetchContinuousBlocksInBatch`, implement the decision logic in `BlockStoreShuffleReader`. - Merge the continuous shuffle block ids in batch if needed in ShuffleBlockFetcherIterator. - Shuffle server changes: - Add support in `ExternalBlockHandler` for the external shuffle service side. - Make `ShuffleBlockResolver.getBlockData` accept getting block data by range. - Protocol changes: - Add new block id type `ShuffleBlockBatchId` represent continuous shuffle block ids. - Extend `FetchShuffleBlocks` and `OneForOneBlockFetcher`. - After the new shuffle fetch protocol completed in #24565, the backward compatibility for external shuffle service can be controlled by `spark.shuffle.useOldFetchProtocol`. ### Why are the changes needed? In adaptive execution, one reducer may fetch multiple continuous shuffle blocks from one map output file. However, as the original approach, each reducer needs to fetch those 10 reducer blocks one by one. This way needs many IO and impacts performance. This PR is to support fetching those continuous shuffle blocks in one IO (batch way). See below example: The shuffle block is stored like below: ![image](https://user-images.githubusercontent.com/2989575/51654634-c37fbd80-1fd3-11e9-935e-5652863676c3.png) The ShuffleId format is s"shuffle_$shuffleId_$mapId_$reduceId", referring to BlockId.scala. In adaptive execution, one reducer may want to read output for reducer 5 to 14, whose block Ids are from shuffle_0_x_5 to shuffle_0_x_14. Before this PR, Spark needs 10 disk IOs + 10 network IOs for each output file. After this PR, Spark only needs 1 disk IO and 1 network IO. This way can reduce IO dramatically. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Add new UT. Integrate test with setting `spark.sql.adaptive.enabled=true`. Closes #26040 from xuanyuanking/SPARK-9853. Lead-authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Co-authored-by: yucai <yyu1@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-17 14:47:56 +08:00
Kent Yao	6d4cc7b855	[SPARK-27880][SQL] Add bool_and for every and bool_or for any as function aliases ### What changes were proposed in this pull request? bool_or(x) <=> any/some(x) <=> max(x) bool_and(x) <=> every(x) <=> min(x) Args: x: boolean ### Why are the changes needed? PostgreSQL, Presto and Vertica, etc also support this feature: ### Does this PR introduce any user-facing change? add new functions support ### How was this patch tested? add ut Closes #26126 from yaooqinn/SPARK-27880. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-16 22:43:47 +08:00
Maxim Gekk	d11cbf2e36	[SPARK-29364][SQL] Return an interval from date subtract according to SQL standard ### What changes were proposed in this pull request? Proposed new expression `SubtractDates` which is used in `date1` - `date2`. It has the `INTERVAL` type, and returns the interval from `date1` (inclusive) and `date2` (exclusive). For example: ```sql > select date'tomorrow' - date'yesterday'; interval 2 days ``` Closes #26034 ### Why are the changes needed? - To conform the SQL standard which states the result type of `date operand 1` - `date operand 2` must be the interval type. See [4.5.3 Operations involving datetimes and intervals](http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt). - Improve Spark SQL UX and allow mixing date and timestamp in subtractions. For example: `select timestamp'now' + (date'2019-10-01' - date'2019-09-15')` ### Does this PR introduce any user-facing change? Before the query below returns number of days: ```sql spark-sql> select date'2019-10-05' - date'2018-09-01'; 399 ``` After it returns an interval: ```sql spark-sql> select date'2019-10-05' - date'2018-09-01'; interval 1 years 1 months 4 days ``` ### How was this patch tested? - by new tests in `DateExpressionsSuite` and `TypeCoercionSuite`. - by existing tests in `date.sql` Closes #26112 from MaxGekk/date-subtract. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com>	2019-10-16 06:26:01 -07:00
Yuming Wang	e00344edc1	[SPARK-29423][SS] lazily initialize StreamingQueryManager in SessionState ### What changes were proposed in this pull request? This PR makes `SessionState` lazily initialize `StreamingQueryManager` to avoid constructing `StreamingQueryManager` for each session when connecting to ThriftServer. ### Why are the changes needed? Reduce memory usage. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? manual test 1. Start thriftserver: ``` build/sbt clean package -Phive -Phadoop-3.2 -Phive-thriftserver export SPARK_PREPEND_CLASSES=true sbin/start-thriftserver.sh ``` 2. Open a session: ``` bin/beeline -u jdbc:hive2://localhost:10000 ``` 3. Check `StreamingQueryManager` instance: ``` jcmd \| grep HiveThriftServer2 \| awk -F ' ' '{print $1}' \| xargs jmap -histo \| grep StreamingQueryManager ``` Before this PR: ``` [rootspark-3267648 spark]# jcmd \| grep HiveThriftServer2 \| awk -F ' ' '{print $1}' \| xargs jmap -histo \| grep StreamingQueryManager 1954: 2 96 org.apache.spark.sql.streaming.StreamingQueryManager ``` After this PR: ``` [rootspark-3267648 spark]# jcmd \| grep HiveThriftServer2 \| awk -F ' ' '{print $1}' \| xargs jmap -histo \| grep StreamingQueryManager [rootspark-3267648 spark]# ``` Closes #26089 from wangyum/SPARK-29423. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-15 21:08:15 -07:00
Wenchen Fan	51f10ed90f	[SPARK-28560][SQL][FOLLOWUP] code cleanup for local shuffle reader ### What changes were proposed in this pull request? A followup of https://github.com/apache/spark/pull/25295 This PR proposes a few code cleanups: 1. rename the special `getMapSizesByExecutorId` to `getMapSizesByMapIndex` 2. rename the parameter `mapId` to `mapIndex` as that's really a mapper index. 3. `BlockStoreShuffleReader` should take `blocksByAddress` directly instead of a map id. 4. rename `getMapReader` to `getReaderForOneMapper` to be more clearer. ### Why are the changes needed? make code easier to understand ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #26128 from cloud-fan/followup. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-16 11:19:16 +08:00
Jeff Evans	95de93b24e	[SPARK-24540][SQL] Support for multiple character delimiter in Spark CSV read Updating univocity-parsers version to 2.8.3, which adds support for multiple character delimiters Moving univocity-parsers version to spark-parent pom dependencyManagement section Adding new utility method to build multi-char delimiter string, which delegates to existing one Adding tests for multiple character delimited CSV ### What changes were proposed in this pull request? Adds support for parsing CSV data using multiple-character delimiters. Existing logic for converting the input delimiter string to characters was kept and invoked in a loop. Project dependencies were updated to remove redundant declaration of `univocity-parsers` version, and also to change that version to the latest. ### Why are the changes needed? It is quite common for people to have delimited data, where the delimiter is not a single character, but rather a sequence of characters. Currently, it is difficult to handle such data in Spark (typically needs pre-processing). ### Does this PR introduce any user-facing change? Yes. Specifying the "delimiter" option for the DataFrame read, and providing more than one character, will no longer result in an exception. Instead, it will be converted as before and passed to the underlying library (Univocity), which has accepted multiple character delimiters since 2.8.0. ### How was this patch tested? The `CSVSuite` tests were confirmed passing (including new methods), and `sbt` tests for `sql` were executed. Closes #26027 from jeff303/SPARK-24540. Authored-by: Jeff Evans <jeffrey.wayne.evans@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-15 15:44:51 -05:00
Gengliang Wang	322ec0ba9b	[SPARK-28885][SQL] Follow ANSI store assignment rules in table insertion by default ### What changes were proposed in this pull request? When inserting a value into a column with the different data type, Spark performs type coercion. Currently, we support 3 policies for the store assignment rules: ANSI, legacy and strict, which can be set via the option "spark.sql.storeAssignmentPolicy": 1. ANSI: Spark performs the type coercion as per ANSI SQL. In practice, the behavior is mostly the same as PostgreSQL. It disallows certain unreasonable type conversions such as converting `string` to `int` and `double` to `boolean`. It will throw a runtime exception if the value is out-of-range(overflow). 2. Legacy: Spark allows the type coercion as long as it is a valid `Cast`, which is very loose. E.g., converting either `string` to `int` or `double` to `boolean` is allowed. It is the current behavior in Spark 2.x for compatibility with Hive. When inserting an out-of-range value to a integral field, the low-order bits of the value is inserted(the same as Java/Scala numeric type casting). For example, if 257 is inserted to a field of Byte type, the result is 1. 3. Strict: Spark doesn't allow any possible precision loss or data truncation in store assignment, e.g., converting either `double` to `int` or `decimal` to `double` is allowed. The rules are originally for Dataset encoder. As far as I know, no mainstream DBMS is using this policy by default. Currently, the V1 data source uses "Legacy" policy by default, while V2 uses "Strict". This proposal is to use "ANSI" policy by default for both V1 and V2 in Spark 3.0. ### Why are the changes needed? Following the ANSI SQL standard is most reasonable among the 3 policies. ### Does this PR introduce any user-facing change? Yes. The default store assignment policy is ANSI for both V1 and V2 data sources. ### How was this patch tested? Unit test Closes #26107 from gengliangwang/ansiPolicyAsDefault. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-15 10:41:37 -07:00
jiake	9ac4b2dbc5	[SPARK-28560][SQL] Optimize shuffle reader to local shuffle reader when smj converted to bhj in adaptive execution ## What changes were proposed in this pull request? Implement a rule in the new adaptive execution framework introduced in [SPARK-23128](https://issues.apache.org/jira/browse/SPARK-23128). This rule is used to optimize the shuffle reader to local shuffle reader when smj is converted to bhj in adaptive execution. ## How was this patch tested? Existing tests Closes #25295 from JkSelf/localShuffleOptimization. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-15 21:51:15 +08:00
Wenchen Fan	8915966bf4	[SPARK-29473][SQL] move statement logical plans to a new file ### What changes were proposed in this pull request? move the statement logical plans that were created for v2 commands to a new file `statements.scala`, under the same package of `v2Commands.scala`. This PR also includes some minor cleanups: 1. remove `private[sql]` from `ParsedStatement` as it's in the private package. 2. remove unnecessary override of `output` and `children`. 3. add missing classdoc. ### Why are the changes needed? Similar to https://github.com/apache/spark/pull/26111 , this is to better organize the logical plans of data source v2. It's a bit weird to put the statements in the package `org.apache.spark.sql.catalyst.plans.logical.sql` as `sql` is not a good sub-package name in Spark SQL. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #26125 from cloud-fan/statement. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-15 15:05:49 +02:00
yangjie01	a988aaf3fa	[SPARK-29454][SQL] Reduce unsafeProjection times when read Parquet file use non-vectorized mode ### What changes were proposed in this pull request? There will be 2 times unsafeProjection convert operation When we read a Parquet data file use non-vectorized mode: 1. `ParquetGroupConverter` call unsafeProjection function to covert `SpecificInternalRow` to `UnsafeRow` every times when read Parquet data file use `ParquetRecordReader`. 2. `ParquetFileFormat` will call unsafeProjection function to covert this `UnsafeRow` to another `UnsafeRow` again when partitionSchema is not empty in DataSourceV1 branch, and `PartitionReaderWithPartitionValues` will always do this convert operation in DataSourceV2 branch. In this pr, remove `unsafeProjection` convert operation in `ParquetGroupConverter` and change `ParquetRecordReader` to produce `SpecificInternalRow` instead of `UnsafeRow`. ### Why are the changes needed? The first time convert in `ParquetGroupConverter` is redundant and `ParquetRecordReader` return a `InternalRow(SpecificInternalRow)` is enough. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Unit Test Closes #26106 from LuciferYang/spark-parquet-unsafe-projection. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-15 12:42:42 +08:00
Wenchen Fan	9407fba037	[SPARK-29412][SQL] refine the document of v2 session catalog config ### What changes were proposed in this pull request? Refine the document of v2 session catalog config, to clearly explain what it is, when it should be used and how to implement it. ### Why are the changes needed? Make this config more understandable ### Does this PR introduce any user-facing change? No ### How was this patch tested? Pass the Jenkins with the newly updated test cases. Closes #26071 from cloud-fan/config. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-15 10:18:58 +08:00
Dongjoon Hyun	ff9fcd501c	Revert "[SPARK-29107][SQL][TESTS] Port window.sql (Part 1)" This reverts commit `81915dacc4`.	2019-10-14 15:15:32 -07:00
Dongjoon Hyun	e696c36e32	[SPARK-29442][SQL] Set `default` mode should override the existing mode ### What changes were proposed in this pull request? This PR aims to fix the behavior of `mode("default")` to set `SaveMode.ErrorIfExists`. Also, this PR updates the exception message by adding `default` explicitly. ### Why are the changes needed? This is reported during `GRAPH API` PR. This builder pattern should work like the documentation. ### Does this PR introduce any user-facing change? Yes if the app has multiple `mode()` invocation including `mode("default")` and the `mode("default")` is the last invocation. This is really a corner case. - Previously, the last invocation was handled as `No-Op`. - After this bug fix, it will work like the documentation. ### How was this patch tested? Pass the Jenkins with the newly added test case. Closes #26094 from dongjoon-hyun/SPARK-29442. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-14 13:11:05 -07:00
DylanGuedes	81915dacc4	[SPARK-29107][SQL][TESTS] Port window.sql (Part 1) ### What changes were proposed in this pull request? This PR ports window.sql from PostgreSQL regression tests https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/window.sql from lines 1~319 The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/window.out ## How was this patch tested? Pass the Jenkins. ### Why are the changes needed? To ensure compatibility with PGSQL ### Does this PR introduce any user-facing change? No ### How was this patch tested? Comparison with PgSQL results. Closes #25816 from DylanGuedes/spark-29107. Authored-by: DylanGuedes <djmgguedes@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-14 10:17:16 -07:00
Maxim Gekk	da576a737c	[SPARK-29369][SQL] Support string intervals without the `interval` prefix ### What changes were proposed in this pull request? In the PR, I propose to move interval parsing to `CalendarInterval.fromCaseInsensitiveString()` which throws an `IllegalArgumentException` for invalid strings, and reuse it from `CalendarInterval.fromString()`. The former one handles `IllegalArgumentException` only and returns `NULL` for invalid interval strings. This will allow to support interval strings without the `interval` prefix in casting strings to intervals and in interval type constructor because they use `fromString()` for parsing string intervals. For example: ```sql spark-sql> select cast('1 year 10 days' as interval); interval 1 years 1 weeks 3 days spark-sql> SELECT INTERVAL '1 YEAR 10 DAYS'; interval 1 years 1 weeks 3 days ``` ### Why are the changes needed? To maintain feature parity with PostgreSQL which supports interval strings without prefix: ```sql # select interval '2 months 1 microsecond'; interval ------------------------ 2 mons 00:00:00.000001 ``` and to improve Spark SQL UX. ### Does this PR introduce any user-facing change? Yes, previously parsing of interval strings without `interval` gives `NULL`: ```sql spark-sql> select interval '2 months 1 microsecond'; NULL ``` After: ```sql spark-sql> select interval '2 months 1 microsecond'; interval 2 months 1 microseconds ``` ### How was this patch tested? - Added new tests to `CalendarIntervalSuite.java` - A test for casting strings to intervals in `CastSuite` - Test for interval type constructor from strings in `ExpressionParserSuite` Closes #26079 from MaxGekk/interval-str-without-prefix. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-14 23:34:18 +08:00
Terry Kim	ef6dce29b2	[SPARK-29279][SQL] Merge SHOW NAMESPACES and SHOW DATABASES code path ### What changes were proposed in this pull request? Currently, `SHOW NAMESPACES` and `SHOW DATABASES` are separate code paths. This PR merges two implementations. ### Why are the changes needed? To remove code/behavior duplication ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added new unit tests. Closes #26006 from imback82/combine_show. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-14 22:35:26 +08:00
Peter Toth	9e12c94c15	[SPARK-29359][SQL][TESTS] Better exception handling in (SQL\|ThriftServer)QueryTestSuite ### What changes were proposed in this pull request? This PR adds 2 changes regarding exception handling in `SQLQueryTestSuite` and `ThriftServerQueryTestSuite` - fixes an expected output sorting issue in `ThriftServerQueryTestSuite` as if there is an exception then there is no need for sort - introduces common exception handling in those 2 suites with a new `handleExceptions` method ### Why are the changes needed? Currently `ThriftServerQueryTestSuite` passes on master, but it fails on one of my PRs (https://github.com/apache/spark/pull/23531) with this error (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111651/testReport/org.apache.spark.sql.hive.thriftserver/ThriftServerQueryTestSuite/sql_3/): ``` org.scalatest.exceptions.TestFailedException: Expected " [Recursion level limit 100 reached but query has not exhausted, try increasing spark.sql.cte.recursion.level.limit org.apache.spark.SparkException] ", but got " [org.apache.spark.SparkException Recursion level limit 100 reached but query has not exhausted, try increasing spark.sql.cte.recursion.level.limit] " Result did not match for query #4 WITH RECURSIVE r(level) AS ( VALUES (0) UNION ALL SELECT level + 1 FROM r ) SELECT * FROM r ``` The unexpected reversed order of expected output (error message comes first, then the exception class) is due to this line: https://github.com/apache/spark/pull/26028/files#diff-b3ea3021602a88056e52bf83d8782de8L146. It should not sort the expected output if there was an error during execution. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UTs. Closes #26028 from peter-toth/SPARK-29359-better-exception-handling. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com>	2019-10-12 22:17:37 -07:00
Maxim Gekk	d193248205	[SPARK-29368][SQL][TEST] Port interval.sql ### What changes were proposed in this pull request? This PR is to port interval.sql from PostgreSQL regression tests: https://raw.githubusercontent.com/postgres/postgres/REL_12_STABLE/src/test/regress/sql/interval.sql The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/expected/interval.out When porting the test cases, found PostgreSQL specific features below that do not exist in Spark SQL: - [SPARK-29369](https://issues.apache.org/jira/browse/SPARK-29369): Accept strings without `interval` prefix in casting to intervals - [SPARK-29370](https://issues.apache.org/jira/browse/SPARK-29370): Interval strings without explicit unit markings - [SPARK-29371](https://issues.apache.org/jira/browse/SPARK-29371): Support interval field values with fractional parts - [SPARK-29382](https://issues.apache.org/jira/browse/SPARK-29382): Support the `INTERVAL` type by Parquet datasource - [SPARK-29383](https://issues.apache.org/jira/browse/SPARK-29383): Support the optional prefix `` in interval strings - [SPARK-29384](https://issues.apache.org/jira/browse/SPARK-29384): Support `ago` in interval strings - [SPARK-29385](https://issues.apache.org/jira/browse/SPARK-29385): Make `INTERVAL` values comparable - [SPARK-29386](https://issues.apache.org/jira/browse/SPARK-29386): Copy data between a file and a table - [SPARK-29387](https://issues.apache.org/jira/browse/SPARK-29387): Support `*` and `\` operators for intervals - [SPARK-29388](https://issues.apache.org/jira/browse/SPARK-29388): Construct intervals from the `millenniums`, `centuries` or `decades` units - [SPARK-29389](https://issues.apache.org/jira/browse/SPARK-29389): Support synonyms for interval units - [SPARK-29390](https://issues.apache.org/jira/browse/SPARK-29390): Add the justify_days(), justify_hours() and justify_interval() functions - [SPARK-29391](https://issues.apache.org/jira/browse/SPARK-29391): Default year-month units - [SPARK-29393](https://issues.apache.org/jira/browse/SPARK-29393): Add the make_interval() function - [SPARK-29394](https://issues.apache.org/jira/browse/SPARK-29394): Support ISO 8601 format for intervals - [SPARK-29395](https://issues.apache.org/jira/browse/SPARK-29395): Precision of the interval type - [SPARK-29406](https://issues.apache.org/jira/browse/SPARK-29406): Interval output styles - [SPARK-29407](https://issues.apache.org/jira/browse/SPARK-29407): Support syntax for zero interval - [SPARK-29408](https://issues.apache.org/jira/browse/SPARK-29408): Support interval literal with negative sign `-` ### Why are the changes needed? To improve the test coverage, see https://issues.apache.org/jira/browse/SPARK-27763 ### Does this PR introduce any user-facing change? No ### How was this patch tested? By manually comparing Spark results with PostgreSQL Closes #26055 from MaxGekk/port-interval-sql. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-12 17:44:40 -07:00
Maxim Gekk	f302c2ee62	[SPARK-29328][SQL][FOLLOWUP] Revert calculation of mean seconds per month ### What changes were proposed in this pull request? Revert this commit `18b7ad2fc5`. ### Why are the changes needed? See https://github.com/apache/spark/pull/16304#discussion_r92753590 ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? There is no test for that. Closes #26101 from MaxGekk/revert-mean-seconds-per-month. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-12 09:38:08 -05:00
Sean Owen	cc7493fa21	[SPARK-29416][CORE][ML][SQL][MESOS][TESTS] Use .sameElements to compare arrays, instead of .deep (gone in 2.13) ### What changes were proposed in this pull request? Use `.sameElements` to compare (non-nested) arrays, as `Arrays.deep` is removed in 2.13 and wasn't the best way to do this in the first place. ### Why are the changes needed? To compile with 2.13. ### Does this PR introduce any user-facing change? None. ### How was this patch tested? Existing tests. Closes #26073 from srowen/SPARK-29416. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-09 17:00:48 -07:00
Sean Owen	fa95a5c395	[SPARK-29411][CORE][ML][SQL][DSTREAM] Replace use of Unit object with () for Scala 2.13 ### What changes were proposed in this pull request? Replace `Unit` with equivalent `()` where code refers to the `Unit` companion object. ### Why are the changes needed? It doesn't compile otherwise in Scala 2.13. - https://github.com/scala/scala/blob/v2.13.0/src/library/scala/Unit.scala#L30 ### Does this PR introduce any user-facing change? Should be no behavior change at all. ### How was this patch tested? Existing tests. Closes #26070 from srowen/SPARK-29411. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-09 10:24:13 -07:00
herman	ba4d413fc9	[SPARK-29346][SQL] Add Aggregating Accumulator ### What changes were proposed in this pull request? This PR adds an accumulator that computes a global aggregate over a number of rows. A user can define an arbitrary number of aggregate functions which can be computed at the same time. The accumulator uses the standard technique for implementing (interpreted) aggregation in Spark. It uses projections and manual updates for each of the aggregation steps (initialize buffer, update buffer with new input row, merge two buffers and compute the final result on the buffer). Note that two of the steps (update and merge) use the aggregation buffer both as input and output. Accumulators do not have an explicit point at which they get serialized. A somewhat surprising side effect is that the buffers of a `TypedImperativeAggregate` go over the wire as-is instead of serializing them. The merging logic for `TypedImperativeAggregate` assumes that the input buffer contains serialized buffers, this is violated by the accumulator's implicit serialization. In order to get around this I have added `mergeBuffersObjects` method that merges two unserialized buffers to `TypedImperativeAggregate`. ### Why are the changes needed? This is the mechanism we are going to use to implement observable metrics. ### Does this PR introduce any user-facing change? No, not yet. ### How was this patch tested? Added `AggregatingAccumulator` test suite. Closes #26012 from hvanhovell/SPARK-29346. Authored-by: herman <herman@databricks.com> Signed-off-by: herman <herman@databricks.com>	2019-10-09 16:05:14 +02:00
Terry Kim	a927f1aefc	[SPARK-29373][SQL] DataSourceV2: Commands should not submit a spark job ### What changes were proposed in this pull request? DataSourceV2 Exec classes (ShowTablesExec, ShowNamespacesExec, etc.) all extend LeafExecNode. This results in running a job when executeCollect() is called. This breaks the previous behavior [SPARK-19650](https://issues.apache.org/jira/browse/SPARK-19650). A new command physical operator will be introduced form which all V2 Exec classes derive to avoid running a job. ### Why are the changes needed? It is a bug since the current behavior runs a spark job, which breaks the existing behavior: [SPARK-19650](https://issues.apache.org/jira/browse/SPARK-19650). ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing unit tests. Closes #26048 from imback82/dsv2_command. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-09 11:44:25 +08:00
Sean Owen	ee83d09b53	[SPARK-29401][CORE][ML][SQL][GRAPHX][TESTS] Replace calls to .parallelize Arrays of tuples, ambiguous in Scala 2.13, with Seqs of tuples ### What changes were proposed in this pull request? Invocations like `sc.parallelize(Array((1,2)))` cause a compile error in 2.13, like: ``` [ERROR] [Error] /Users/seanowen/Documents/spark_2.13/core/src/test/scala/org/apache/spark/ShuffleSuite.scala:47: overloaded method value apply with alternatives: (x: Unit,xs: Unit)Array[Unit] <and> (x: Double,xs: Double)Array[Double] <and> (x: Float,xs: Float)Array[Float] <and> (x: Long,xs: Long)Array[Long] <and> (x: Int,xs: Int)Array[Int] <and> (x: Char,xs: Char)Array[Char] <and> (x: Short,xs: Short)Array[Short] <and> (x: Byte,xs: Byte)Array[Byte] <and> (x: Boolean,xs: Boolean*)Array[Boolean] cannot be applied to ((Int, Int), (Int, Int), (Int, Int), (Int, Int)) ``` Using a `Seq` instead appears to resolve it, and is effectively equivalent. ### Why are the changes needed? To better cross-build for 2.13. ### Does this PR introduce any user-facing change? None. ### How was this patch tested? Existing tests. Closes #26062 from srowen/SPARK-29401. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-08 20:22:02 -07:00
Sean Owen	2d871ad0e7	[SPARK-29392][CORE][SQL][STREAMING] Remove symbol literal syntax 'foo, deprecated in Scala 2.13, in favor of Symbol("foo") ### What changes were proposed in this pull request? Syntax like `'foo` is deprecated in Scala 2.13. Replace usages with `Symbol("foo")` ### Why are the changes needed? Avoids ~50 deprecation warnings when attempting to build with 2.13. ### Does this PR introduce any user-facing change? None, should be no functional change at all. ### How was this patch tested? Existing tests. Closes #26061 from srowen/SPARK-29392. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-08 20:15:37 -07:00
Guilherme	de360e96d7	[SPARK-29336][SQL] Fix the implementation of QuantileSummaries.merge (guarantee that the relativeError will be respected) ### What changes were proposed in this pull request? Reimplement `org.apache.spark.sql.catalyst.util.QuantileSummaries#merge` and add a test-case showing the previous bug. ### Why are the changes needed? The original Greenwald-Khanna paper, from which the algorithm behind `approxQuantile` was taken, does not cover how to merge the result of multiple parallel QuantileSummaries. The current implementation violates some invariants and therefore the effective error can be larger than the specified. ### Does this PR introduce any user-facing change? Yes, for same cases, the results from `approxQuantile` (`percentile_approx` in SQL) will now be within the expected error margin. For example: ```scala var values = (1 to 100).toArray val all_quantiles = values.indices.map(i => (i+1).toDouble / values.length).toArray for (n <- 0 until 5) { var df = spark.sparkContext.makeRDD(values).toDF("value").repartition(5) val all_answers = df.stat.approxQuantile("value", all_quantiles, 0.1) val all_answered_ranks = all_answers.map(ans => values.indexOf(ans)).toArray val error = all_answered_ranks.zipWithIndex.map({ case (answer, expected) => Math.abs(expected - answer) }).toArray val max_error = error.max print(max_error + "\n") } ``` In the current build it returns: ``` 16 12 10 11 17 ``` I couldn't run the code with this patch applied to double check the implementation. Can someone please confirm it now outputs at most `10`, please? ### How was this patch tested? A new unit test was added to uncover the previous bug. Closes #26029 from sitegui/SPARK-29336. Authored-by: Guilherme <sitegui@sitegui.com.br> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-08 08:11:10 -05:00
Dilip Biswal	ef1e8495ba	[SPARK-29366][SQL] Subqueries created for DPP are not printed in EXPLAIN FORMATTED ### What changes were proposed in this pull request? The subquery expressions introduced by DPP are not printed in the newer explain command. This PR fixes the code that computes the list of subqueries in the plan. SQL df1 and df2 are partitioned on k. ``` SELECT df1.id, df2.k FROM df1 JOIN df2 ON df1.k = df2.k AND df2.id < 2 ``` Before ``` \|== Physical Plan == * Project (9) +- * BroadcastHashJoin Inner BuildRight (8) :- * ColumnarToRow (2) : +- Scan parquet default.df1 (1) +- BroadcastExchange (7) +- * Project (6) +- * Filter (5) +- * ColumnarToRow (4) +- Scan parquet default.df2 (3) (1) Scan parquet default.df1 Output: [id#19L, k#20L] (2) ColumnarToRow [codegen id : 2] Input: [id#19L, k#20L] (3) Scan parquet default.df2 Output: [id#21L, k#22L] (4) ColumnarToRow [codegen id : 1] Input: [id#21L, k#22L] (5) Filter [codegen id : 1] Input : [id#21L, k#22L] Condition : (isnotnull(id#21L) AND (id#21L < 2)) (6) Project [codegen id : 1] Output : [k#22L] Input : [id#21L, k#22L] (7) BroadcastExchange Input: [k#22L] (8) BroadcastHashJoin [codegen id : 2] Left keys: List(k#20L) Right keys: List(k#22L) Join condition: None (9) Project [codegen id : 2] Output : [id#19L, k#22L] Input : [id#19L, k#20L, k#22L] ``` After ``` \|== Physical Plan == * Project (9) +- * BroadcastHashJoin Inner BuildRight (8) :- * ColumnarToRow (2) : +- Scan parquet default.df1 (1) +- BroadcastExchange (7) +- * Project (6) +- * Filter (5) +- * ColumnarToRow (4) +- Scan parquet default.df2 (3) (1) Scan parquet default.df1 Output: [id#19L, k#20L] (2) ColumnarToRow [codegen id : 2] Input: [id#19L, k#20L] (3) Scan parquet default.df2 Output: [id#21L, k#22L] (4) ColumnarToRow [codegen id : 1] Input: [id#21L, k#22L] (5) Filter [codegen id : 1] Input : [id#21L, k#22L] Condition : (isnotnull(id#21L) AND (id#21L < 2)) (6) Project [codegen id : 1] Output : [k#22L] Input : [id#21L, k#22L] (7) BroadcastExchange Input: [k#22L] (8) BroadcastHashJoin [codegen id : 2] Left keys: List(k#20L) Right keys: List(k#22L) Join condition: None (9) Project [codegen id : 2] Output : [id#19L, k#22L] Input : [id#19L, k#20L, k#22L] ===== Subqueries ===== Subquery:1 Hosting operator id = 1 Hosting Expression = k#20L IN subquery25 * HashAggregate (16) +- Exchange (15) +- * HashAggregate (14) +- * Project (13) +- * Filter (12) +- * ColumnarToRow (11) +- Scan parquet default.df2 (10) (10) Scan parquet default.df2 Output: [id#21L, k#22L] (11) ColumnarToRow [codegen id : 1] Input: [id#21L, k#22L] (12) Filter [codegen id : 1] Input : [id#21L, k#22L] Condition : (isnotnull(id#21L) AND (id#21L < 2)) (13) Project [codegen id : 1] Output : [k#22L] Input : [id#21L, k#22L] (14) HashAggregate [codegen id : 1] Input: [k#22L] (15) Exchange Input: [k#22L] (16) HashAggregate [codegen id : 2] Input: [k#22L] ``` ### Why are the changes needed? Without the fix, the subqueries are not printed in the explain plan. ### Does this PR introduce any user-facing change? Yes. the explain output will be different. ### How was this patch tested? Added a test case in ExplainSuite. Closes #26039 from dilipbiswal/explain_subquery_issue. Authored-by: Dilip Biswal <dkbiswal@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-10-07 23:39:05 -07:00
Wenchen Fan	948a6e80fe	[SPARK-28892][SQL][FOLLOWUP] add resolved logical plan for UPDATE TABLE ### What changes were proposed in this pull request? Add back the resolved logical plan for UPDATE TABLE. It was in https://github.com/apache/spark/pull/25626 before but was removed later. ### Why are the changes needed? In https://github.com/apache/spark/pull/25626 , we decided to not add the update API in DS v2, but we still want to implement UPDATE for builtin source like JDBC. We should at least add the resolved logical plan. ### Does this PR introduce any user-facing change? no, UPDATE is still not supported yet. ### How was this patch tested? new tests. Closes #26025 from cloud-fan/update. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-10-07 23:36:26 -07:00
Dongjoon Hyun	cb501771fa	[SPARK-25668][SQL][TESTS] Refactor TPCDSQueryBenchmark to use main method ### What changes were proposed in this pull request? This PR aims the followings. - Refactor `TPCDSQueryBenchmark` to use main method to improve the usability. - Reduce the number of iteration from 5 to 2 because it takes too long. (2 is okay because we have `Stdev` field now. If there is an irregular run, we can notice easily with that). - Generate one result file for TPCDS scale factor 1. (Note that this test suite can be used for the other scale factors, too.) - AWS EC2 `r3.xlarge` with `ami-06f2f779464715dc5 (ubuntu-bionic-18.04-amd64-server-20190722.1)` is used. This PR adds a JDK8 result based on the TPCDS ScaleFactor 1G data generated by the following. ``` # `spark-tpcds-datagen` needs this. (JDK8) $ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4 $ export SPARK_HOME=$PWD $ ./build/mvn clean package -DskipTests # Generate data. (JDK8) $ git clone gitgithub.com:maropu/spark-tpcds-datagen.git $ cd spark-tpcds-datagen/ $ build/mvn clean package $ mkdir -p /data/tpcds $ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4` ``` ### Why are the changes needed? Although the generated TPCDS data is random, we can keep the record. ### Does this PR introduce any user-facing change? No. (This is dev-only test benchmark). ### How was this patch tested? Manually run the benchmark. Please note that you need to have TPCDS data. ``` SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark --data-location /data/tpcds/s1" ``` Closes #26049 from dongjoon-hyun/SPARK-25668. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-08 13:33:42 +09:00
gwang3	64fe82b519	[SPARK-29189][SQL] Add an option to ignore block locations when listing file ### What changes were proposed in this pull request? In our PROD env, we have a pure Spark cluster, I think this is also pretty common, where computation is separated from storage layer. In such deploy mode, data locality is never reachable. And there are some configurations in Spark scheduler to reduce waiting time for data locality(e.g. "spark.locality.wait"). While, problem is that, in listing file phase, the location informations of all the files, with all the blocks inside each file, are all fetched from the distributed file system. Actually, in a PROD environment, a table can be so huge that even fetching all these location informations need take tens of seconds. To improve such scenario, Spark need provide an option, where data locality can be totally ignored, all we need in the listing file phase are the files locations, without any block location informations. ### Why are the changes needed? And we made a benchmark in our PROD env, after ignore the block locations, we got a pretty huge improvement. Table Size \| Total File Number \| Total Block Number \| List File Duration(With Block Location) \| List File Duration(Without Block Location) -- \| -- \| -- \| -- \| -- 22.6T \| 30000 \| 120000 \| 16.841s \| 1.730s 28.8 T \| 42001 \| 148964 \| 10.099s \| 2.858s 3.4 T \| 20000 \| 20000 \| 5.833s \| 4.881s ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Via ut. Closes #25869 from wangshisan/SPARK-29189. Authored-by: gwang3 <gwang3@ebay.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>	2019-10-07 14:52:55 -05:00
Maxim Gekk	18b7ad2fc5	[SPARK-29328][SQL] Fix calculation of mean seconds per month ### What changes were proposed in this pull request? I introduced new constants `SECONDS_PER_MONTH` and `MILLIS_PER_MONTH`, and reused it in calculations of seconds/milliseconds per month. `SECONDS_PER_MONTH` is 2629746 because the average year of the Gregorian calendar is 365.2425 days long or 60 * 60 * 24 * 365.2425 = 31556952.0 = 12 * 2629746 seconds per year. ### Why are the changes needed? Spark uses the proleptic Gregorian calendar (see https://issues.apache.org/jira/browse/SPARK-26651) in which the average year is 365.2425 days (see https://en.wikipedia.org/wiki/Gregorian_calendar) but existing implementation assumes 31 days per months or 12 * 31 = 372 days. That's far away from the the truth. ### Does this PR introduce any user-facing change? Yes, the changes affect at least 3 methods in `GroupStateImpl`, `EventTimeWatermark` and `MonthsBetween`. For example, the `month_between()` function will return different result in some cases. Before: ```sql spark-sql> select months_between('2019-09-15', '1970-01-01'); 596.4516129 ``` After: ```sql spark-sql> select months_between('2019-09-15', '1970-01-01'); 596.45996838 ``` ### How was this patch tested? By existing test suite `DateTimeUtilsSuite`, `DateFunctionsSuite` and `DateExpressionsSuite`. Closes #25998 from MaxGekk/days-in-year. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-07 08:47:46 -05:00
Maxim Gekk	932e2619ce	[SPARK-29365][SQL] Support dates and timestamps subtraction ### What changes were proposed in this pull request? Added new rules to `TypeCoercion.DateTimeOperations` for the `Subtract` expression which is replaced by existing `TimestampDiff` expression if one of its parameter has the `DATE` type and another one is the `TIMESTAMP` type. The date argument is casted to timestamp. ### Why are the changes needed? - To maintain feature parity with PostgreSQL which supports subtraction of a date from a timestamp and a timestamp from a date: ```sql maxim=# select timestamp'now' - date'epoch'; ?column? ---------------------------- 18175 days 21:07:33.412875 (1 row) maxim=# select date'2020-01-01' - timestamp'now'; ?column? ------------------------- 86 days 02:52:00.945296 (1 row) ``` - To conform to the SQL standard which defines datetime subtraction as an interval. ### Does this PR introduce any user-facing change? Yes, currently the queries bellow fails with the error: ```sql spark-sql> select timestamp'now' - date'2019-10-01'; Error in query: cannot resolve '(TIMESTAMP('2019-10-06 21:05:07.234') - DATE '2019-10-01')' due to data type mismatch: differing types in '(TIMESTAMP('2019-10-06 21:05:07.234') - DATE '2019-10-01')' (timestamp and date).; line 1 pos 7; 'Project [unresolvedalias((1570385107234000 - 18170), None)] +- OneRowRelation ``` after the changes: ```sql spark-sql> select timestamp'now' - date'2019-10-01'; interval 5 days 21 hours 4 minutes 55 seconds 878 milliseconds ``` ### How was this patch tested? - Add new cases to the `rule for date/timestamp operations` test in `TypeCoercionSuite` - by 2 new test in `datetime.sql` Closes #26036 from MaxGekk/date-timestamp-subtract. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-07 16:47:00 +09:00
Yuanjian Li	130e9ae5dc	[SPARK-29357][SQL][TESTS] Fix flaky test by changing to use AtomicLong ### What changes were proposed in this pull request? Change to use AtomicLong instead of a var in the test. ### Why are the changes needed? Fix flaky test. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UT. Closes #26020 from xuanyuanking/SPARK-25159. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-04 10:11:31 -07:00
Maxim Gekk	eecef75350	[SPARK-29355][SQL] Support timestamps subtraction ### What changes were proposed in this pull request? Added new expression `TimestampDiff` for timestamp subtractions. It accepts 2 timestamp expressions and returns another one of the `CalendarIntervalType`. While creating an instance of `CalendarInterval`, it initializes only the microsecond field by difference of the given timestamps in microseconds, and set the `months` field to zero. Also I added an rule for conversion `Subtract` to `TimestampDiff`, and enabled already ported test queries in `postgreSQL/timestamp.sql`. ### Why are the changes needed? To maintain feature parity with PostgreSQL which allows to get timestamp difference: ```sql # select timestamp'today' - timestamp'yesterday'; ?column? ---------- 1 day (1 row) ``` ### Does this PR introduce any user-facing change? Yes, previously users got the following error from timestamp subtraction: ```sql spark-sql> select timestamp'today' - timestamp'yesterday'; Error in query: cannot resolve '(TIMESTAMP('2019-10-04 00:00:00') - TIMESTAMP('2019-10-03 00:00:00'))' due to data type mismatch: '(TIMESTAMP('2019-10-04 00:00:00') - TIMESTAMP('2019-10-03 00:00:00'))' requires (numeric or interval) type, not timestamp; line 1 pos 7; 'Project [unresolvedalias((1570136400000000 - 1570050000000000), None)] +- OneRowRelation ``` after the changes they should get an interval: ```sql spark-sql> select timestamp'today' - timestamp'yesterday'; interval 1 days ``` ### How was this patch tested? - Added tests for `TimestampDiff` to `DateExpressionsSuite` - By new test in `TypeCoercionSuite`. - Enabled tests in `postgreSQL/timestamp.sql`. Closes #26022 from MaxGekk/timestamp-diff. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-04 09:39:19 -07:00
Wenchen Fan	275e044ba8	[SPARK-29039][SQL] centralize the catalog and table lookup logic ### What changes were proposed in this pull request? Currently we deal with different `ParsedStatement` in many places and write duplicated catalog/table lookup logic. In general the lookup logic is 1. try look up the catalog by name. If no such catalog, and default catalog is not set, convert `ParsedStatement` to v1 command like `ShowDatabasesCommand`. Otherwise, convert `ParsedStatement` to v2 command like `ShowNamespaces`. 2. try look up the table by name. If no such table, fail. If the table is a `V1Table`, convert `ParsedStatement` to v1 command like `CreateTable`. Otherwise, convert `ParsedStatement` to v2 command like `CreateV2Table`. However, since the code is duplicated we don't apply this lookup logic consistently. For example, we forget to consider the v2 session catalog in several places. This PR centralizes the catalog/table lookup logic by 3 rules. 1. `ResolveCatalogs` (in catalyst). This rule resolves v2 catalog from the multipart identifier in SQL statements, and convert the statement to v2 command if the resolved catalog is not session catalog. If the command needs to resolve the table (e.g. ALTER TABLE), put an `UnresolvedV2Table` in the command. 2. `ResolveTables` (in catalyst). It resolves `UnresolvedV2Table` to `DataSourceV2Relation`. 3. `ResolveSessionCatalog` (in sql/core). This rule is only effective if the resolved catalog is session catalog. For commands that don't need to resolve the table, this rule converts the statement to v1 command directly. Otherwise, it converts the statement to v1 command if the resolved table is v1 table, and convert to v2 command if the resolved table is v2 table. Hopefully we can remove this rule eventually when v1 fallback is not needed anymore. ### Why are the changes needed? Reduce duplicated code and make the catalog/table lookup logic consistent. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #25747 from cloud-fan/lookup. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-04 16:21:13 +08:00
Yuanjian Li	93289b54f5	[SPARK-29203][TESTS][MINOR][FOLLOW UP] Add access modifier for sparkConf in SQLQueryTestSuite ### What changes were proposed in this pull request? Add access modifier `protected` for `sparkConf` in SQLQueryTestSuite, because in the parent trait SharedSparkSession, it is protected. ### Why are the changes needed? Code consistency. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UT. Closes #26019 from xuanyuanking/SPARK-29203. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-04 16:54:47 +09:00
Gengliang Wang	91747bd91b	[SPARK-29326][SQL] ANSI store assignment policy: throw exception on casting failure ### What changes were proposed in this pull request? 1. With ANSI store assignment policy, an exception is thrown on casting failure 2. Introduce a new expression `AnsiCast` for the ANSI store assignment policy, so that the store assignment policy configuration won't affect the general `Cast`. ### Why are the changes needed? As per ANSI SQL standard, ANSI store assignment policy should throw an exception on insertion failure, such as inserting out-of-range value to a numeric field. ### Does this PR introduce any user-facing change? With ANSI store assignment policy, an exception is thrown on casting failure ### How was this patch tested? Unit test Closes #25997 from gengliangwang/newCast. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-04 15:53:38 +08:00
maryannxue	8fabbab299	[SPARK-29350] Fix BroadcastExchange reuse in Dynamic Partition Pruning ### What changes were proposed in this pull request? Dynamic partition pruning filters are added as an in-subquery containing a `BroadcastExchangeExec` in case of a broadcast hash join. This PR makes the `ReuseExchange` rule visit in-subquery nodes, to ensure the new `BroadcastExchangeExec` added by dynamic partition pruning can be reused. ### Why are the changes needed? This initial dynamic partition pruning PR did not enable this reuse, which means a broadcast exchange would be executed twice, in the main query and in the DPP filter. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added broadcast exchange reuse check in `DynamicPartitionPruningSuite` Closes #26015 from maryannxue/exchange-reuse. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-10-03 16:11:32 -07:00
Nik Vanderhoof	6f687691ef	[SPARK-28962][SPARK-27297][SQL] Add overload for filter with index to functions object ### What changes were proposed in this pull request? Add an overload for the higher order function `filter` that takes array index as its second argument to `org.apache.spark.sql.functions`. ### Why are the changes needed? See: SPARK-28962 and SPARK-27297. Specifically ueshin pointing out the discrepency here: https://github.com/apache/spark/pull/24232#issuecomment-533288653 ### Does this PR introduce any user-facing change? ### How was this patch tested? Updated the these test suites: `test.org.apache.spark.sql.JavaHigherOrderFunctionsSuite` and `org.apache.spark.sql.DataFrameFunctionsSuite` Closes #26007 from nvander1/add_index_overload_for_filter. Authored-by: Nik Vanderhoof <nikolasrvanderhoof@gmail.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2019-10-03 11:12:14 -07:00
Dongjoon Hyun	4e0e4e51c4	[MINOR][TESTS] Rename JSONBenchmark to JsonBenchmark ### What changes were proposed in this pull request? This PR renames `object JSONBenchmark` to `object JsonBenchmark` and the benchmark result file `JSONBenchmark-results.txt` to `JsonBenchmark-results.txt`. ### Why are the changes needed? Since the file name doesn't match with `object JSONBenchmark`, it makes a confusion when we run the benchmark. In addition, this makes the automation difficult. ``` $ find . -name JsonBenchmark.scala ./sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonBenchmark.scala ``` ``` $ build/sbt "sql/test:runMain org.apache.spark.sql.execution.datasources.json.JsonBenchmark" [info] Running org.apache.spark.sql.execution.datasources.json.JsonBenchmark [error] Error: Could not find or load main class org.apache.spark.sql.execution.datasources.json.JsonBenchmark ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This is just renaming. Closes #26008 from dongjoon-hyun/SPARK-RENAME-JSON. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-03 09:02:06 -07:00
Dongjoon Hyun	854a0f752e	[SPARK-29320][TESTS] Compare `sql/core` module in JDK8/11 (Part 1) ### What changes were proposed in this pull request? This PR regenerates the `sql/core` benchmarks in JDK8/11 to compare the result. In general, we compare the ratio instead of the time. However, in this PR, the average time is compared. This PR should be considered as a rough comparison. A. EXPECTED CASES(JDK11 is faster in general) - [x] BloomFilterBenchmark (JDK11 is faster except one case) - [x] BuiltInDataSourceWriteBenchmark (JDK11 is faster at CSV/ORC) - [x] CSVBenchmark (JDK11 is faster except five cases) - [x] ColumnarBatchBenchmark (JDK11 is faster at `boolean`/`string` and some cases in `int`/`array`) - [x] DatasetBenchmark (JDK11 is faster with `string`, but is slower for `long` type) - [x] ExternalAppendOnlyUnsafeRowArrayBenchmark (JDK11 is faster except two cases) - [x] ExtractBenchmark (JDK11 is faster except HOUR/MINUTE/SECOND/MILLISECONDS/MICROSECONDS) - [x] HashedRelationMetricsBenchmark (JDK11 is faster) - [x] JSONBenchmark (JDK11 is much faster except eight cases) - [x] JoinBenchmark (JDK11 is faster except five cases) - [x] OrcNestedSchemaPruningBenchmark (JDK11 is faster in nine cases) - [x] PrimitiveArrayBenchmark (JDK11 is faster) - [x] SortBenchmark (JDK11 is faster except `Arrays.sort` case) - [x] UDFBenchmark (N/A, values are too small) - [x] UnsafeArrayDataBenchmark (JDK11 is faster except one case) - [x] WideTableBenchmark (JDK11 is faster except two cases) B. CASES WE NEED TO INVESTIGATE MORE LATER - [x] AggregateBenchmark (JDK11 is slower in general) - [x] CompressionSchemeBenchmark (JDK11 is slower in general except `string`) - [x] DataSourceReadBenchmark (JDK11 is slower in general) - [x] DateTimeBenchmark (JDK11 is slightly slower in general except `parsing`) - [x] MakeDateTimeBenchmark (JDK11 is slower except two cases) - [x] MiscBenchmark (JDK11 is slower except ten cases) - [x] OrcV2NestedSchemaPruningBenchmark (JDK11 is slower) - [x] ParquetNestedSchemaPruningBenchmark (JDK11 is slower except six cases) - [x] RangeBenchmark (JDK11 is slower except one case) `FilterPushdownBenchmark/InExpressionBenchmark/WideSchemaBenchmark` will be compared later because it took long timer. ### Why are the changes needed? According to the result, there are some difference between JDK8/JDK11. This will be a baseline for the future improvement and comparison. Also, as a reproducible environment, the following environment is used. - Instance: `r3.xlarge` - OS: `CentOS Linux release 7.5.1804 (Core)` - JDK: - `OpenJDK Runtime Environment (build 1.8.0_222-b10)` - `OpenJDK Runtime Environment 18.9 (build 11.0.4+11-LTS)` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This is a test-only PR. We need to run benchmark. Closes #26003 from dongjoon-hyun/SPARK-29320. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-03 08:58:25 -07:00
Sean Owen	7aca0dd658	[SPARK-29296][BUILD][CORE] Remove use of .par to make 2.13 support easier; add scala-2.13 profile to enable pulling in par collections library separately, for the future ### What changes were proposed in this pull request? Scala 2.13 removes the parallel collections classes to a separate library, so first, this establishes a `scala-2.13` profile to bring it back, for future use. However the library enables use of `.par` implicit conversions via a new class that is not in 2.12, which makes cross-building hard. This implements a suggested workaround from https://github.com/scala/scala-parallel-collections/issues/22 to avoid `.par` entirely. ### Why are the changes needed? To compile for 2.13 and later to work with 2.13. ### Does this PR introduce any user-facing change? Should not, no. ### How was this patch tested? Existing tests. Closes #25980 from srowen/SPARK-29296. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-03 08:56:08 -05:00
s71955	ee66890f30	[SPARK-28084][SQL] Resolving the partition column name based on the resolver in sql load command ### What changes were proposed in this pull request? LOAD DATA command resolves the partition column name as case sensitive manner, where as in insert commandthe partition column name will be resolved using the SQLConf resolver where the names will be resolved based on `spark.sql.caseSensitive` property. Same logic can be applied for resolving the partition column names in LOAD COMMAND. ### Why are the changes needed? It's to handle the partition column name correctly according to the configuration. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UT and manual testing. Closes #24903 from sujith71955/master_paritionColName. Lead-authored-by: s71955 <sujithchacko.2010@gmail.com> Co-authored-by: sujith71955 <sujithchacko.2010@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-03 01:11:48 -07:00
HyukjinKwon	40485f4656	[SPARK-29317][SQL][PYTHON] Avoid inheritance hierarchy in pandas CoGroup arrow runner and its plan ### What changes were proposed in this pull request? This PR proposes to avoid abstract classes introduced at https://github.com/apache/spark/pull/24965 but instead uses trait and object. - `abstract class BaseArrowPythonRunner` -> `trait PythonArrowOutput` to allow mix-in Before: ``` BasePythonRunner ├── BaseArrowPythonRunner │ ├── ArrowPythonRunner │ └── CoGroupedArrowPythonRunner ├── PythonRunner └── PythonUDFRunner ``` After: ``` └── BasePythonRunner ├── ArrowPythonRunner ├── CoGroupedArrowPythonRunner ├── PythonRunner └── PythonUDFRunner ``` - `abstract class BasePandasGroupExec ` -> `object PandasGroupUtils` to decouple Before: ``` └── BasePandasGroupExec ├── FlatMapGroupsInPandasExec └── FlatMapCoGroupsInPandasExec ``` After: ``` ├── FlatMapGroupsInPandasExec └── FlatMapCoGroupsInPandasExec ``` ### Why are the changes needed? The problem is that R code path is being matched with Python side: Python: ``` └── BasePythonRunner ├── ArrowPythonRunner ├── CoGroupedArrowPythonRunner ├── PythonRunner └── PythonUDFRunner ``` R: ``` └── BaseRRunner ├── ArrowRRunner └── RRunner ``` I would like to match the hierarchy and decouple other stuff for now if possible. Ideally we should deduplicate both code paths. Internal implementation is also similar intentionally. `BasePandasGroupExec` case is similar as well. R (with Arrow optimization, in particular) has some duplicated codes with Pandas UDFs. `FlatMapGroupsInRWithArrowExec` <> `FlatMapGroupsInPandasExec` `MapPartitionsInRWithArrowExec` <> `ArrowEvalPythonExec` In order to prepare deduplication here as well, it might better avoid changing hierarchy alone in Python side. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Locally tested existing tests. Jenkins tests should verify this too. Closes #25989 from HyukjinKwon/SPARK-29317. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-03 16:42:37 +09:00
Henry D	51d6ba7490	[SPARK-28962][SQL] Provide index argument to filter lambda functions ### What changes were proposed in this pull request? Lambda functions to array `filter` can now take as input the index as well as the element. This behavior matches array `transform`. ### Why are the changes needed? See JIRA. It's generally useful, and particularly so if you're working with fixed length arrays. ### Does this PR introduce any user-facing change? Previously filter lambdas had to look like `filter(arr, el -> whatever)` Now, lambdas can take an index argument as well `filter(array, (el, idx) -> whatever)` ### How was this patch tested? I added unit tests to `HigherOrderFunctionsSuite`. Closes #25666 from henrydavidge/filter-idx. Authored-by: Henry D <henrydavidge@gmail.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2019-10-02 13:03:06 -07:00
Nik Vanderhoof	730a17823f	[SPARK-27297][SQL] Add higher order functions to scala API ## What changes were proposed in this pull request? There is currently no existing Scala API equivalent for the higher order functions introduced in Spark 2.4.0. * transform * aggregate * filter * exists * forall * zip_with * map_zip_with * map_filter * transform_values * transform_keys Equivalent column based functions should be added to the Scala API for org.apache.spark.sql.functions with the following signatures: ```scala def transform(column: Column, f: Column => Column): Column = ??? def transform(column: Column, f: (Column, Column) => Column): Column = ??? def exists(column: Column, f: Column => Column): Column = ??? def filter(column: Column, f: Column => Column): Column = ??? def aggregate( expr: Column, zero: Column, merge: (Column, Column) => Column, finish: Column => Column): Column = ??? def aggregate( expr: Column, zero: Column, merge: (Column, Column) => Column): Column = ??? def zip_with( left: Column, right: Column, f: (Column, Column) => Column): Column = ??? def transform_keys(expr: Column, f: (Column, Column) => Column): Column = ??? def transform_values(expr: Column, f: (Column, Column) => Column): Column = ??? def map_filter(expr: Column, f: (Column, Column) => Column): Column = ??? def map_zip_with(left: Column, right: Column, f: (Column, Column, Column) => Column): Column = ??? ``` ## How was this patch tested? I've mimicked the existing tests for the higher order functions in `org.apache.spark.sql.DataFrameFunctionsSuite` that use `expr` to test the higher order functions. As an example of an existing test: ```scala test("map_zip_with function - map of primitive types") { val df = Seq( (Map(8 -> 6L, 3 -> 5L, 6 -> 2L), Map[Integer, Integer]((6, 4), (8, 2), (3, 2))), (Map(10 -> 6L, 8 -> 3L), Map[Integer, Integer]((8, 4), (4, null))), (Map.empty[Int, Long], Map[Integer, Integer]((5, 1))), (Map(5 -> 1L), null) ).toDF("m1", "m2") checkAnswer(df.selectExpr("map_zip_with(m1, m2, (k, v1, v2) -> k == v1 + v2)"), Seq( Row(Map(8 -> true, 3 -> false, 6 -> true)), Row(Map(10 -> null, 8 -> false, 4 -> null)), Row(Map(5 -> null)), Row(null))) } ``` I've added this test that performs the same logic, but with the new column based API I've added. ```scala checkAnswer(df.select(map_zip_with(df("m1"), df("m2"), (k, v1, v2) => k === v1 + v2)), Seq( Row(Map(8 -> true, 3 -> false, 6 -> true)), Row(Map(10 -> null, 8 -> false, 4 -> null)), Row(Map(5 -> null)), Row(null))) ``` Closes #24232 from nvander1/feature/add_higher_order_functions_to_scala_api. Lead-authored-by: Nik Vanderhoof <nikolasrvanderhoof@gmail.com> Co-authored-by: Nik <nikolasrvanderhoof@gmail.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2019-10-02 12:53:39 -07:00
Terry Kim	f2ead4d0b5	[SPARK-28970][SQL] Implement USE CATALOG/NAMESPACE for Data Source V2 ### What changes were proposed in this pull request? This PR exposes USE CATALOG/USE SQL commands as described in this [SPIP](https://docs.google.com/document/d/1jEcvomPiTc5GtB9F7d2RTVVpMY64Qy7INCA_rFEd9HQ/edit#) It also exposes `currentCatalog` in `CatalogManager`. Finally, it changes `SHOW NAMESPACES` and `SHOW TABLES` to use the current catalog if no catalog is specified (instead of default catalog). ### Why are the changes needed? There is currently no mechanism to change current catalog/namespace thru SQL commands. ### Does this PR introduce any user-facing change? Yes, you can perform the following: ```scala // Sets the current catalog to 'testcat' spark.sql("USE CATALOG testcat") // Sets the current catalog to 'testcat' and current namespace to 'ns1.ns2'. spark.sql("USE ns1.ns2 IN testcat") // Now, the following will use 'testcat' as the current catalog and 'ns1.ns2' as the current namespace. spark.sql("SHOW NAMESPACES") ``` ### How was this patch tested? Added new unit tests. Closes #25771 from imback82/use_namespace. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-02 21:55:21 +08:00
Maxim Gekk	3b1674cb1f	[SPARK-29313][SQL] Fix failure on writing to `noop` in benchmarks ### What changes were proposed in this pull request? In the PR, I propose to specify the save mode explicitly while writing to the `noop` datasource in benchmarks. I set `Overwrite` mode in the following benchmarks: - JsonBenchmark - CSVBenchmark - UDFBenchmark - MakeDateTimeBenchmark - ExtractBenchmark - DateTimeBenchmark - NestedSchemaPruningBenchmark ### Why are the changes needed? Otherwise writing to `noop` fails with: ``` [error] Exception in thread "main" org.apache.spark.sql.AnalysisException: TableProvider implementation noop cannot be written with ErrorIfExists mode, please use Append or Overwrite modes instead.; [error] at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:284) ``` most likely due to https://github.com/apache/spark/pull/25876 ### Does this PR introduce any user-facing change? No ### How was this patch tested? I generated results of `ExtractBenchmark` via the command: ``` SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.ExtractBenchmark" ``` Closes #25988 from MaxGekk/noop-overwrite-mode. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-01 21:04:56 -07:00
Maxim Gekk	e13880128d	[SPARK-29311][SQL] Return seconds with fraction from `date_part()` and `extract` ### What changes were proposed in this pull request? Added new expression `SecondWithFraction` which produces the `seconds` part of timestamps/dates with fractional part containing microseconds. This expression is used only in the `DatePart` expression. As the result, `date_part()` and `extract` return seconds and microseconds as the fractional part of the seconds part when `field` is `SECOND` (or synonyms). ### Why are the changes needed? The `date_part()` and `extract` were added to maintain feature parity with PostgreSQL which has different behavior for the `SECOND` value of the `field` parameter. The fix is needed to behave in the same way. Here is PostgreSQL's output: ```sql # SELECT date_part('SECONDS', timestamp'2019-10-01 00:00:01.000001'); date_part ----------- 1.000001 (1 row) ``` ### Does this PR introduce any user-facing change? Yes, type of `date_part('SECOND', ...)` is changed from `INT` to `DECIMAL(8, 6)`. Before: ```sql spark-sql> SELECT date_part('SECONDS', '2019-10-01 00:00:01.000001'); 1 ``` After: ```sql spark-sql> SELECT date_part('SECONDS', '2019-10-01 00:00:01.000001'); 1.000001 ``` ### How was this patch tested? - Added new tests to `DateExpressionSuite` for the `SecondWithFraction` expression - Regenerated results of `date_part.sql`, `extract.sql` and `timestamp.sql` - Updated results of `ExtractBenchmark` Closes #25986 from MaxGekk/extract-seconds-from-timestamp. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-02 11:16:31 +09:00
Dongjoon Hyun	bd031c2173	[SPARK-29307][BUILD][TESTS] Remove scalatest deprecation warnings ### What changes were proposed in this pull request? This PR aims to remove `scalatest` deprecation warnings with the following changes. - `org.scalatest.mockito.MockitoSugar` -> `org.scalatestplus.mockito.MockitoSugar` - `org.scalatest.selenium.WebBrowser` -> `org.scalatestplus.selenium.WebBrowser` - `org.scalatest.prop.Checkers` -> `org.scalatestplus.scalacheck.Checkers` - `org.scalatest.prop.GeneratorDrivenPropertyChecks` -> `org.scalatestplus.scalacheck.ScalaCheckDrivenPropertyChecks` ### Why are the changes needed? According to the Jenkins logs, there are 118 warnings about this. ``` grep "is deprecated" ~/consoleText \| grep scalatest \| wc -l 118 ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? After Jenkins passes, we need to check the Jenkins log. Closes #25982 from dongjoon-hyun/SPARK-29307. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-30 21:00:11 -07:00
Jeff Evans	d841b33ba3	[SPARK-25153][SQL] Improve error messages for columns with dots/periods ### What changes were proposed in this pull request? Check schema fields to see if they contain the exact column name, add to error message in DataSet#resolve Add test for extra error message piece Adds an additional check in `DataSet#resolve`, in the else clause (i.e. column not resolved), that appends a suffix to the error message for the `AnalysisException` if that column name is literally found in the schema fields, to suggest to the user that it might need to be quoted via backticks. ### Why are the changes needed? Forgetting to quote such column names is a common occurrence for new Spark users. ### Does this PR introduce any user-facing change? No (other than the extra suffix on the error message). ### How was this patch tested? `test` was run for `core` in `sbt`, and passed. Closes #25807 from jeff303/SPARK-25153. Authored-by: Jeff Evans <jeffrey.wayne.evans@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com>	2019-09-30 18:34:44 -07:00
Sean Owen	e1ea806b30	[SPARK-29291][CORE][SQL][STREAMING][MLLIB] Change procedure-like declaration to function + Unit for 2.13 ### What changes were proposed in this pull request? Scala 2.13 emits a deprecation warning for procedure-like declarations: ``` def foo() { ... ``` This is equivalent to the following, so should be changed to avoid a warning: ``` def foo(): Unit = { ... ``` ### Why are the changes needed? It will avoid about a thousand compiler warnings when we start to support Scala 2.13. I wanted to make the change in 3.0 as there are less likely to be back-ports from 3.0 to 2.4 than 3.1 to 3.0, for example, minimizing that downside to touching so many files. Unfortunately, that makes this quite a big change. ### Does this PR introduce any user-facing change? No behavior change at all. ### How was this patch tested? Existing tests. Closes #25968 from srowen/SPARK-29291. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-30 10:03:23 -07:00
Chris Martin	76791b89f5	[SPARK-27463][PYTHON][FOLLOW-UP] Miscellaneous documentation and code cleanup of cogroup pandas UDF Follow up from https://github.com/apache/spark/pull/24981 incorporating some comments from HyukjinKwon. Specifically: - Adding `CoGroupedData` to `pyspark/sql/__init__.py __all__` so that documentation is generated. - Added pydoc, including example, for the use case whereby the user supplies a cogrouping function including a key. - Added the boilerplate for doctests to cogroup.py. Note that cogroup.py only contains the apply() function which has doctests disabled as per the other Pandas Udfs. - Restricted the newly exposed RelationalGroupedDataset constructor parameters to access only by the sql package. - Some minor formatting tweaks. This was tested by running the appropriate unit tests. I'm unsure as to how to check that my change will cause the documentation to be generated correctly, but it someone can describe how I can do this I'd be happy to check. Closes #25939 from d80tb7/SPARK-27463-fixes. Authored-by: Chris Martin <chris@cmartinit.co.uk> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-30 22:25:35 +09:00
Yuming Wang	31700116d2	[SPARK-28476][SQL] Support ALTER DATABASE SET LOCATION ### What changes were proposed in this pull request? Support the syntax of `ALTER (DATABASE\|SCHEMA) database_name SET LOCATION` path. Please note that only Hive 3.x metastore support this syntax. Ref: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL https://issues.apache.org/jira/browse/HIVE-8472 ### Why are the changes needed? Support more syntax. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Unit test. Closes #25883 from wangyum/SPARK-28476. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-09-29 11:31:49 -07:00
TomokoKomiyama	67d5b9b157	[SPARK-29172][SQL] Fix some exception issue of explain commands ### What changes were proposed in this pull request? Added try exception ### Why are the changes needed? The behaviors of run commands during exception handling are different depends on explain command. I think it should be unified. [ >spark.sql("explain cost select * from hoge").show(false) ] ![cost](https://user-images.githubusercontent.com/55128575/65225389-09a80500-db00-11e9-9246-0f1a3a881595.png) [ >spark.sql("explain extended select * from hoge").show(false) ] ![extemded](https://user-images.githubusercontent.com/55128575/65225430-188eb780-db00-11e9-99bf-ff550b2ffd12.png) ### Does this PR introduce any user-facing change? No ### How was this patch tested? tested manually Closes #25848 from TomokoKomiyama/fix-explain. Authored-by: TomokoKomiyama <btkomiyamatm@oss.nttdata.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-29 10:41:57 -05:00
Maxim Gekk	2409320d8f	[SPARK-29237][SQL][FOLLOWUP] Ignore `SET` commands in expression examples while checking the _FUNC_ pattern ### What changes were proposed in this pull request? The `SET` commands do not contain the `_FUNC_` pattern a priori. In the PR, I propose filter out such commands in the `using _FUNC_ instead of function names in examples` test. ### Why are the changes needed? After the merge of https://github.com/apache/spark/pull/25942, examples will require particular settings. Currently, the whole expression example has to be ignored which is so much. It makes sense to ignore only `SET` commands in expression examples. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running the `using _FUNC_ instead of function names in examples` test. Closes #25958 from MaxGekk/dont-check-_FUNC_-in-set. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-29 08:51:47 +09:00
Jungtaek Lim (HeartSaVioR)	94946e4836	[SPARK-29281][SQL] Correct example of Like/RLike to test the origin intention correctly ### What changes were proposed in this pull request? This patch fixes examples of Like/RLike to test its origin intention correctly. The example doesn't consider the default value of spark.sql.parser.escapedStringLiterals: it's false by default. Please take a look at current example of Like: `d72f39897b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala (L97-L106)` If spark.sql.parser.escapedStringLiterals=false, then it should fail as there's `\U` in pattern (spark.sql.parser.escapedStringLiterals=false by default) but it doesn't fail. ``` The escape character is '\'. If an escape character precedes a special symbol or another escape character, the following character is matched literally. It is invalid to escape any other character. ``` For the query ``` SET spark.sql.parser.escapedStringLiterals=false; SELECT '%SystemDrive%\Users\John' like '\%SystemDrive\%\Users%'; ``` SQL parser removes single `\` (not sure that is intended) so the expressions of Like are constructed as following (I've printed out expression of left and right for Like/RLike): > LIKE - left `%SystemDrive%UsersJohn` / right `\%SystemDrive\%Users%` which are no longer having origin intention (see left). Below query tests the origin intention: ``` SET spark.sql.parser.escapedStringLiterals=false; SELECT '%SystemDrive%\\Users\\John' like '\%SystemDrive\%\\\\Users%'; ``` > LIKE - left `%SystemDrive%\Users\John` / right `\%SystemDrive\%\\Users%` Note that `\\\\` is needed in pattern as `StringUtils.escapeLikeRegex` requires `\\` to represent normal character of `\`. Same for RLIKE: ``` SET spark.sql.parser.escapedStringLiterals=true; SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\\Users.'; ``` > RLIKE - left `%SystemDrive%\Users\John` / right `%SystemDrive%\\Users.` which is OK, but ``` SET spark.sql.parser.escapedStringLiterals=false; SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\Users.'; ``` > RLIKE - left `%SystemDrive%UsersJohn` / right `%SystemDrive%Users.` which no longer haves origin intention. Below query tests the origin intention: ``` SET spark.sql.parser.escapedStringLiterals=true; SELECT '%SystemDrive%\\Users\\John' rlike '%SystemDrive%\\\\Users.'; ``` > RLIKE - left `%SystemDrive%\Users\John` / right `%SystemDrive%\\Users.` ### Why are the changes needed? Because the example doesn't test the origin intention. Spark is now running automated tests from these examples, so now it's not only documentation issue but also test issue. ### Does this PR introduce any user-facing change? No, as it only corrects documentation. ### How was this patch tested? Added debug log (like above) and ran queries from `spark-sql`. Closes #25957 from HeartSaVioR/SPARK-29281. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-29 03:05:49 +09:00
Maxim Gekk	ece4213176	[SPARK-21914][FOLLOWUP][TEST-HADOOP3.2][TEST-JAVA11] Clone SparkSession per each function example ### What changes were proposed in this pull request? In the PR, I propose to clone Spark session per-each expression example. Examples can modify SQL settings, and can influence on each other if they run in the same Spark session in parallel. ### Why are the changes needed? This should fix test failures like [this](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-jdk-11/478/testReport/junit/org.apache.spark.sql/SQLQuerySuite/check_outputs_of_expression_examples/) checking of the `Like` example: ``` org.apache.spark.sql.AnalysisException: the pattern '\%SystemDrive\%\Users%' is invalid, the escape character is not allowed to precede 'U'; at org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:48) at org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:57) at org.apache.spark.sql.catalyst.expressions.Like.escape(regexpExpressions.scala:108) ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running `check outputs of expression examples` in `org.apache.spark.sql.SQLQuerySuite` Closes #25956 from MaxGekk/fix-expr-examples-checks. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-29 02:57:55 +09:00
Jungtaek Lim (HeartSaVioR)	d72f39897b	[SPARK-27254][SS] Cleanup complete but invalid output files in ManifestFileCommitProtocol if job is aborted ## What changes were proposed in this pull request? SPARK-27210 enables ManifestFileCommitProtocol to clean up incomplete output files in task level if task is aborted. This patch extends the area of cleaning up, proposes ManifestFileCommitProtocol to clean up complete but invalid output files in job level if job aborts. Please note that this works as 'best-effort', not kind of guarantee, as we have in HadoopMapReduceCommitProtocol. ## How was this patch tested? Added UT. Closes #24186 from HeartSaVioR/SPARK-27254. Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>	2019-09-27 12:35:26 -07:00
Maxim Gekk	4dd0066d40	[SPARK-21914][SQL][TESTS] Check results of expression examples ### What changes were proposed in this pull request? New test compares outputs of expression examples in comments with results of `hiveResultString()`. Also I fixed existing examples where actual and expected outputs are different. ### Why are the changes needed? This prevents mistakes in expression examples, and fixes existing mistakes in comments. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Add new test to `SQLQuerySuite`. Closes #25942 from MaxGekk/run-expr-examples. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-27 21:30:37 +09:00
Wang Shuo	bd28e8e179	[SPARK-29213][SQL] Generate extra IsNotNull predicate in FilterExec ### What changes were proposed in this pull request? Currently the behavior of getting output and generating null checks in `FilterExec` is different. Thus some nullable attribute could be treated as not nullable by mistake. In `FilterExec.ouput`, an attribute is marked as nullable or not by finding its `exprId` in notNullAttributes: ``` a.nullable && notNullAttributes.contains(a.exprId) ``` But in `FilterExec.doConsume`, a `nullCheck` is generated or not for a predicate is decided by whether there is semantic equal not null predicate: ``` val nullChecks = c.references.map { r => val idx = notNullPreds.indexWhere { n => n.asInstanceOf[IsNotNull].child.semanticEquals(r)} if (idx != -1 && !generatedIsNotNullChecks(idx)) { generatedIsNotNullChecks(idx) = true // Use the child's output. The nullability is what the child produced. genPredicate(notNullPreds(idx), input, child.output) } else { "" } }.mkString("\n").trim ``` NPE will happen when run the SQL below: ``` sql("create table table1(x string)") sql("create table table2(x bigint)") sql("create table table3(x string)") sql("insert into table2 select null as x") sql( """ \|select t1.x \|from ( \| select x from table1) t1 \|left join ( \| select x from ( \| select x from table2 \| union all \| select substr(x,5) x from table3 \| ) a \| where length(x)>0 \|) t3 \|on t1.x=t3.x """.stripMargin).collect() ``` NPE Exception: ``` java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(generated.java:40) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:726) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:135) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:94) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:449) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:452) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` the generated code: ``` == Subtree 4 / 5 == (2) Project [cast(x#7L as string) AS x#9] +- (2) Filter ((length(cast(x#7L as string)) > 0) AND isnotnull(cast(x#7L as string))) +- Scan hive default.table2 [x#7L], HiveTableRelation `default`.`table2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [x#7L] Generated code: /* 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIteratorForCodegenStage2(references); / 003 / } / 004 / / 005 / // codegenStageId=2 / 006 / final class GeneratedIteratorForCodegenStage2 extends org.apache.spark.sql.execution.BufferedRowIterator { / 007 / private Object[] references; / 008 / private scala.collection.Iterator[] inputs; / 009 / private scala.collection.Iterator inputadapter_input_0; / 010 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] filter_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[2]; / 011 / / 012 / public GeneratedIteratorForCodegenStage2(Object[] references) { / 013 / this.references = references; / 014 / } / 015 / / 016 / public void init(int index, scala.collection.Iterator[] inputs) { / 017 / partitionIndex = index; / 018 / this.inputs = inputs; / 019 / inputadapter_input_0 = inputs[0]; / 020 / filter_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); / 021 / filter_mutableStateArray_0[1] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 32); / 022 / / 023 / } / 024 / / 025 / protected void processNext() throws java.io.IOException { / 026 / while ( inputadapter_input_0.hasNext()) { / 027 / InternalRow inputadapter_row_0 = (InternalRow) inputadapter_input_0.next(); / 028 / / 029 / do { / 030 / boolean inputadapter_isNull_0 = inputadapter_row_0.isNullAt(0); / 031 / long inputadapter_value_0 = inputadapter_isNull_0 ? / 032 / -1L : (inputadapter_row_0.getLong(0)); / 033 / / 034 / boolean filter_isNull_2 = inputadapter_isNull_0; / 035 / UTF8String filter_value_2 = null; / 036 / if (!inputadapter_isNull_0) { / 037 / filter_value_2 = UTF8String.fromString(String.valueOf(inputadapter_value_0)); / 038 / } / 039 / int filter_value_1 = -1; / 040 / filter_value_1 = (filter_value_2).numChars(); / 041 / / 042 / boolean filter_value_0 = false; / 043 / filter_value_0 = filter_value_1 > 0; / 044 / if (!filter_value_0) continue; / 045 / / 046 / boolean filter_isNull_6 = inputadapter_isNull_0; / 047 / UTF8String filter_value_6 = null; / 048 / if (!inputadapter_isNull_0) { / 049 / filter_value_6 = UTF8String.fromString(String.valueOf(inputadapter_value_0)); / 050 / } / 051 / if (!(!filter_isNull_6)) continue; / 052 / / 053 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(1); / 054 / / 055 / boolean project_isNull_0 = false; / 056 / UTF8String project_value_0 = null; / 057 / if (!false) { / 058 / project_value_0 = UTF8String.fromString(String.valueOf(inputadapter_value_0)); / 059 / } / 060 / filter_mutableStateArray_0[1].reset(); / 061 / / 062 / filter_mutableStateArray_0[1].zeroOutNullBytes(); / 063 / / 064 / if (project_isNull_0) { / 065 / filter_mutableStateArray_0[1].setNullAt(0); / 066 / } else { / 067 / filter_mutableStateArray_0[1].write(0, project_value_0); / 068 / } / 069 / append((filter_mutableStateArray_0[1].getRow())); / 070 / / 071 / } while(false); / 072 / if (shouldStop()) return; / 073 / } / 074 / } / 075 / / 076 / } ``` This PR proposes to use semantic comparison both in `FilterExec.output` and `FilterExec.doConsume` for nullable attribute. With this PR, the generated code snippet is below: ``` == Subtree 2 / 5 == (3) Project [substring(x#8, 5, 2147483647) AS x#5] +- (3) Filter ((length(substring(x#8, 5, 2147483647)) > 0) AND isnotnull(substring(x#8, 5, 2147483647))) +- Scan hive default.table3 [x#8], HiveTableRelation `default`.`table3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [x#8] Generated code: / 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIteratorForCodegenStage3(references); / 003 / } / 004 / / 005 / // codegenStageId=3 / 006 / final class GeneratedIteratorForCodegenStage3 extends org.apache.spark.sql.execution.BufferedRowIterator { / 007 / private Object[] references; / 008 / private scala.collection.Iterator[] inputs; / 009 / private scala.collection.Iterator inputadapter_input_0; / 010 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] filter_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[2]; / 011 / / 012 / public GeneratedIteratorForCodegenStage3(Object[] references) { / 013 / this.references = references; / 014 / } / 015 / / 016 / public void init(int index, scala.collection.Iterator[] inputs) { / 017 / partitionIndex = index; / 018 / this.inputs = inputs; / 019 / inputadapter_input_0 = inputs[0]; / 020 / filter_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 32); / 021 / filter_mutableStateArray_0[1] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 32); / 022 / / 023 / } / 024 / / 025 / protected void processNext() throws java.io.IOException { / 026 / while ( inputadapter_input_0.hasNext()) { / 027 / InternalRow inputadapter_row_0 = (InternalRow) inputadapter_input_0.next(); / 028 / / 029 / do { / 030 / boolean inputadapter_isNull_0 = inputadapter_row_0.isNullAt(0); / 031 / UTF8String inputadapter_value_0 = inputadapter_isNull_0 ? / 032 / null : (inputadapter_row_0.getUTF8String(0)); / 033 / / 034 / boolean filter_isNull_0 = true; / 035 / boolean filter_value_0 = false; / 036 / boolean filter_isNull_2 = true; / 037 / UTF8String filter_value_2 = null; / 038 / / 039 / if (!inputadapter_isNull_0) { / 040 / filter_isNull_2 = false; // resultCode could change nullability. / 041 / filter_value_2 = inputadapter_value_0.substringSQL(5, 2147483647); / 042 / / 043 / } / 044 / boolean filter_isNull_1 = filter_isNull_2; / 045 / int filter_value_1 = -1; / 046 / / 047 / if (!filter_isNull_2) { / 048 / filter_value_1 = (filter_value_2).numChars(); / 049 / } / 050 / if (!filter_isNull_1) { / 051 / filter_isNull_0 = false; // resultCode could change nullability. / 052 / filter_value_0 = filter_value_1 > 0; / 053 / / 054 / } / 055 / if (filter_isNull_0 \|\| !filter_value_0) continue; / 056 / boolean filter_isNull_8 = true; / 057 / UTF8String filter_value_8 = null; / 058 / / 059 / if (!inputadapter_isNull_0) { / 060 / filter_isNull_8 = false; // resultCode could change nullability. / 061 / filter_value_8 = inputadapter_value_0.substringSQL(5, 2147483647); / 062 / / 063 / } / 064 / if (!(!filter_isNull_8)) continue; / 065 / / 066 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(1); / 067 / / 068 / boolean project_isNull_0 = true; / 069 / UTF8String project_value_0 = null; / 070 / / 071 / if (!inputadapter_isNull_0) { / 072 / project_isNull_0 = false; // resultCode could change nullability. / 073 / project_value_0 = inputadapter_value_0.substringSQL(5, 2147483647); / 074 / / 075 / } / 076 / filter_mutableStateArray_0[1].reset(); / 077 / / 078 / filter_mutableStateArray_0[1].zeroOutNullBytes(); / 079 / / 080 / if (project_isNull_0) { / 081 / filter_mutableStateArray_0[1].setNullAt(0); / 082 / } else { / 083 / filter_mutableStateArray_0[1].write(0, project_value_0); / 084 / } / 085 / append((filter_mutableStateArray_0[1].getRow())); / 086 / / 087 / } while(false); / 088 / if (shouldStop()) return; / 089 / } / 090 / } / 091 / / 092 */ } ``` ### Why are the changes needed? Fix NPE bug in FilterExec. ### Does this PR introduce any user-facing change? no ### How was this patch tested? new UT Closes #25902 from wangshuo128/filter-codegen-npe. Authored-by: Wang Shuo <wangshuo128@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-27 15:14:17 +08:00
uncleGen	570525f886	[SPARK-27715][SQL][UI] SQL query details in UI does not show in correct format ## What changes were proposed in this pull request? before pr: ![image](https://user-images.githubusercontent.com/7402327/57752168-bb7e9180-771a-11e9-8757-63236ecab753.png) after pr: ![image](https://user-images.githubusercontent.com/7402327/57752175-c802ea00-771a-11e9-96fd-aef1890b7985.png) ## How was this patch tested? manual test Closes #24609 from uncleGen/SPARK-27715. Authored-by: uncleGen <hustyugm@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-26 22:52:22 -05:00
Rahij Ramsharan	9f3c82163a	[SPARK-29259][SQL] call fs.exists only when necessary ### What changes were proposed in this pull request? Call fs.exists only when necessary in InsertIntoHadoopFsRelationCommand. ### Why are the changes needed? When saving a dataframe into Hadoop, spark first checks if the file exists before inspecting the SaveMode to determine if it should actually insert data. However, the pathExists variable is actually not used in the case of SaveMode.Append. In some file systems, the exists call can be expensive and hence this PR makes that call only when necessary. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing unit tests should cover it since this doesn't change the behavior. Closes #25928 from rahij/rr/exists-upstream. Authored-by: Rahij Ramsharan <rramsharan@palantir.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-26 15:46:31 -07:00
Gengliang Wang	a1213d5f96	[SPARK-28997][SQL] Add `spark.sql.dialect` ### What changes were proposed in this pull request? After https://github.com/apache/spark/pull/25158 and https://github.com/apache/spark/pull/25458, SQL features of PostgreSQL are introduced into Spark. AFAIK, both features are implementation-defined behaviors, which are not specified in ANSI SQL. In such a case, this proposal is to add a configuration `spark.sql.dialect` for choosing a database dialect. After this PR, Spark supports two database dialects, `Spark` and `PostgreSQL`. With `PostgreSQL` dialect, Spark will: 1. perform integral division with the / operator if both sides are integral types; 2. accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type. ### Why are the changes needed? Unify the external database dialect with one configuration, instead of small flags. ### Does this PR introduce any user-facing change? A new configuration `spark.sql.dialect` for choosing a database dialect. ### How was this patch tested? Existing tests. Closes #25697 from gengliangwang/dialect. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-26 21:00:27 +08:00
Gengliang Wang	66c9dc316a	[SPARK-29255][SQL][TESTS] Rename package pgSQL to postgreSQL ### What changes were proposed in this pull request? Rename the package pgSQL to postgreSQL ### Why are the changes needed? To address the comment in https://github.com/apache/spark/pull/25697#discussion_r328431070 . The official full name seems more reasonable. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #25936 from gengliangwang/renamePGSQL. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com>	2019-09-26 05:36:15 -07:00
Burak Yavuz	c8159c7941	[SPARK-29197][SQL] Remove saveModeForDSV2 from DataFrameWriter ### What changes were proposed in this pull request? It is very confusing that the default save mode is different between the internal implementation of a Data source. The reason that we had to have saveModeForDSV2 was that there was no easy way to check the existence of a Table in DataSource v2. Now, we have catalogs for that. Therefore we should be able to remove the different save modes. We also have a plan forward for `save`, where we can't really check the existence of a table, and therefore create one. That will come in a future PR. ### Why are the changes needed? Because it is confusing that the internal implementation of a data source (which is generally non-obvious to users) decides which default save mode is used within Spark. ### Does this PR introduce any user-facing change? It changes the default save mode for V2 Tables in the DataFrameWriter APIs ### How was this patch tested? Existing tests Closes #25876 from brkyvz/removeSM. Lead-authored-by: Burak Yavuz <brkyvz@gmail.com> Co-authored-by: Burak Yavuz <burak@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-26 15:20:04 +08:00
Liang-Chi Hsieh	b8b59d6fa3	[SPARK-29239][SPARK-29221][SQL] Subquery should not cause NPE when eliminating subexpression ### What changes were proposed in this pull request? This patch proposes to skip PlanExpression when doing subexpression elimination on executors. ### Why are the changes needed? Subexpression elimination can possibly cause NPE when applying on execution subquery expression like ScalarSubquery on executors. It is because PlanExpression wraps query plan. To compare query plan on executor when eliminating subexpression, can cause unexpected error, like NPE when accessing transient fields. The NPE looks like: ``` [info] - SPARK-29239: Subquery should not cause NPE when eliminating subexpression * FAILED * (175 milliseconds) [info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1395.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1395.0 (TID 3447, 10.0.0.196, executor driver): java.lang.NullPointerException [info] at org.apache.spark.sql.execution.LocalTableScanExec.stringArgs(LocalTableScanExec.scala:62) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:506) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:534) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:179) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:181) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:647) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:675) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:675) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:569) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:559) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:551) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:548) [info] at org.apache.spark.sql.catalyst.errors.package$TreeNodeException.<init>(package.scala:36) [info] at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:436) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:425) [info] at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:102) [info] at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:63) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:132) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:261) ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added unit test. Closes #25925 from viirya/SPARK-29239. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-26 13:55:01 +08:00
Ryan Blue	6a4235aee7	[SPARK-29249][SQL] V2 writer: Don't allow tableProperty for existing tables ### What changes were proposed in this pull request? Don't allow calling append, overwrite, or overwritePartitions after tableProperty is used in DataFrameWriterV2 because table properties are not set as part of operations on existing tables. Only tables that are created or replaced can set table properties. ### Why are the changes needed? The properties are discarded otherwise, so this avoids confusing behavior. ### Does this PR introduce any user-facing change? Yes, but to a new API, DataFrameWriterV2. ### How was this patch tested? Removed test cases that used this method and the append, etc. methods because they no longer compile. Closes #25931 from rdblue/fix-dfw-v2-table-properties. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-26 12:41:34 +08:00
Maxim Gekk	21db2f86f7	[SPARK-29237][SQL] Prevent real function names in expression example template ### What changes were proposed in this pull request? In the PR, I propose to replace function names in some expression examples by `_FUNC_`, and add a test to check that `_FUNC_` always present in all examples. ### Why are the changes needed? Binding of a function name to an expression is performed in `FunctionRegistry` which is single source of truth. Expression examples should avoid using function name directly because this can make the examples invalid in the future. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added new test to `SQLQuerySuite` which analyses expression example, and check presence of `_FUNC_`. Closes #25924 from MaxGekk/fix-func-examples. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-25 15:16:00 -07:00
Wenchen Fan	a36a7235db	[SPARK-29215][SQL] current namespace should be tracked in SessionCatalog if the current catalog is session catalog ### What changes were proposed in this pull request? when the current catalog is session catalog, get/set the current namespace from/to the `SessionCatalog`. ### Why are the changes needed? It's super confusing that we don't have a single source of truth for the current namespace of the session catalog. It can be in `CatalogManager` or `SessionCatalog`. Ideally, we should always track the current catalog/namespace in `CatalogManager`. However, there are many commands that do not support v2 catalog API. They ignore the current catalog in `CatalogManager` and blindly go to `SessionCatalog`. This means, we must keep track of the current namespace of session catalog even if the current catalog is not session catalog. Thus, we can't use `CatalogManager` to track the current namespace of session catalog because it changes when the current catalog is changed. To keep single source of truth, we should only track the current namespace of session catalog in `SessionCatalog`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Newly added and updated test cases. Closes #25903 from cloud-fan/current. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2019-09-25 17:01:36 +08:00
Xiao Li	7c02c143aa	[SPARK-28292][SQL] Enable Injection of User-defined Hint ### What changes were proposed in this pull request? Move the rule `RemoveAllHints` after the batch `Resolution`. ### Why are the changes needed? User-defined hints can be resolved by the rules injected via `extendedResolutionRules` or `postHocResolutionRules`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added a test case Closes #25746 from gatorsmile/moveRemoveAllHints. Authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-24 18:04:17 +08:00
windpiger	da7e5c4ffb	[SPARK-19917][SQL] qualified partition path stored in catalog ## What changes were proposed in this pull request? partition path should be qualified to store in catalog. There are some scenes: 1. ALTER TABLE t PARTITION(b=1) SET LOCATION '/path/x' should be qualified: file:/path/x Hive 2.0.0 does not support for location without schema here. ``` FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. {0} is not absolute or has no scheme information. Please specify a complete absolute uri with scheme information. ``` 2. ALTER TABLE t PARTITION(b=1) SET LOCATION 'x' should be qualified: file:/tablelocation/x Hive 2.0.0 does not support for relative location here. 3. ALTER TABLE t ADD PARTITION(b=1) LOCATION '/path/x' should be qualified: file:/path/x the same with Hive 2.0.0 4. ALTER TABLE t ADD PARTITION(b=1) LOCATION 'x' should be qualified: file:/tablelocation/x the same with Hive 2.0.0 Currently only ALTER TABLE t ADD PARTITION(b=1) LOCATION for hive serde table has the expected qualified path. we should make other scenes to be consist with it. Another change is for alter table location. ## How was this patch tested? add / modify existing TestCases Closes #17254 from windpiger/qualifiedPartitionPath. Authored-by: windpiger <songjun@outlook.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-24 14:48:47 +08:00
Yuming Wang	0c40b94ae5	[SPARK-29203][SQL][TESTS] Reduce shuffle partitions in SQLQueryTestSuite ### What changes were proposed in this pull request? This PR reduce shuffle partitions from 200 to 4 in `SQLQueryTestSuite` to reduce testing time. ### Why are the changes needed? Reduce testing time. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually tested in my local: Before: ``` ... [info] - subquery/in-subquery/in-joins.sql (6 minutes, 19 seconds) [info] - subquery/in-subquery/not-in-joins.sql (2 minutes, 17 seconds) [info] - subquery/scalar-subquery/scalar-subquery-predicate.sql (45 seconds, 763 milliseconds) ... Run completed in 1 hour, 22 minutes. ``` After: ``` ... [info] - subquery/in-subquery/in-joins.sql (1 minute, 12 seconds) [info] - subquery/in-subquery/not-in-joins.sql (27 seconds, 541 milliseconds) [info] - subquery/scalar-subquery/scalar-subquery-predicate.sql (17 seconds, 360 milliseconds) ... Run completed in 47 minutes. ``` Closes #25891 from wangyum/SPARK-29203. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com>	2019-09-23 08:38:40 -07:00
xy_xin	655356e825	[SPARK-28892][SQL] support UPDATE in the parser and add the corresponding logical plan ### What changes were proposed in this pull request? This PR supports UPDATE in the parser and add the corresponding logical plan. The SQL syntax is a standard UPDATE statement: ``` UPDATE tableName tableAlias SET colName=value [, colName=value]+ WHERE predicate? ``` ### Why are the changes needed? With this change, we can start to implement UPDATE in builtin sources and think about how to design the update API in DS v2. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New test cases added. Closes #25626 from xianyinxin/SPARK-28892. Authored-by: xy_xin <xianyin.xxy@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-23 19:25:56 +08:00
Takeshi Yamamuro	7a2ea58e78	[SPARK-29084][SQL][TESTS] Check method bytecode size in BenchmarkQueryTest ### What changes were proposed in this pull request? This pr proposes to check method bytecode size in `BenchmarkQueryTest`. This metric is critical for performance numbers. ### Why are the changes needed? For performance checks ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #25788 from maropu/CheckMethodSize. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-22 14:47:42 -07:00
Dongjoon Hyun	76bc9db749	[SPARK-29191][TESTS][SQL] Add tag ExtendedSQLTest for SQLQueryTestSuite ### What changes were proposed in this pull request? This PR aims to add tag `ExtendedSQLTest` for `SQLQueryTestSuite`. This doesn't affect our Jenkins test coverage. Instead, this tag gives us an ability to parallelize them by splitting this test suite and the other suites. ### Why are the changes needed? `SQLQueryTestSuite` takes 45 mins alone because it has many SQL scripts to run. <img width="906" alt="time" src="https://user-images.githubusercontent.com/9700541/65353553-4af0f100-dba2-11e9-9f2f-386742d28f92.png"> ### Does this PR introduce any user-facing change? No. ### How was this patch tested? ``` build/sbt "sql/test-only *.SQLQueryTestSuite" -Dtest.exclude.tags=org.apache.spark.tags.ExtendedSQLTest ... [info] SQLQueryTestSuite: [info] ScalaTest [info] Run completed in 3 seconds, 147 milliseconds. [info] Total number of tests run: 0 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0 [info] No tests were executed. [info] Passed: Total 0, Failed 0, Errors 0, Passed 0 [success] Total time: 22 s, completed Sep 20, 2019 12:23:13 PM ``` Closes #25872 from dongjoon-hyun/SPARK-29191. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-22 13:53:21 -07:00
Maxim Gekk	051e691029	[SPARK-28141][SQL] Support special date values ### What changes were proposed in this pull request? Supported special string values for `DATE` type. They are simply notational shorthands that will be converted to ordinary date values when read. The following string values are supported: - `epoch [zoneId]` - `1970-01-01` - `today [zoneId]` - the current date in the time zone specified by `spark.sql.session.timeZone`. - `yesterday [zoneId]` - the current date -1 - `tomorrow [zoneId]` - the current date + 1 - `now` - the date of running the current query. It has the same notion as `today`. For example: ```sql spark-sql> SELECT date 'tomorrow' - date 'yesterday'; 2 ``` ### Why are the changes needed? To maintain feature parity with PostgreSQL, see [8.5.1.4. Special Values](https://www.postgresql.org/docs/12/datatype-datetime.html) ### Does this PR introduce any user-facing change? Previously, the parser fails on the special values with the error: ```sql spark-sql> select date 'today'; Error in query: Cannot parse the DATE value: today(line 1, pos 7) ``` After the changes, the special values are converted to appropriate dates: ```sql spark-sql> select date 'today'; 2019-09-06 ``` ### How was this patch tested? - Added tests to `DateFormatterSuite` to check parsing special values from regular strings. - Tests in `DateTimeUtilsSuite` check parsing those values from `UTF8String` - Uncommented tests in `date.sql` Closes #25708 from MaxGekk/datetime-special-values. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-22 17:31:33 +09:00
Maxim Gekk	89bad267d4	[SPARK-29200][SQL] Optimize `extract`/`date_part` for epoch ### What changes were proposed in this pull request? Refactoring of the `DateTimeUtils.getEpoch()` function by avoiding decimal operations that are pretty expensive, and converting the final result to the decimal type at the end. ### Why are the changes needed? The changes improve performance of the `getEpoch()` method at least up to 20 times. Before: ``` Invoke extract for timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ cast to timestamp 256 277 33 39.0 25.6 1.0X EPOCH of timestamp 23455 23550 131 0.4 2345.5 0.0X ``` After: ``` Invoke extract for timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ cast to timestamp 255 294 34 39.2 25.5 1.0X EPOCH of timestamp 1049 1054 9 9.5 104.9 0.2X ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing test from `DateExpressionSuite`. Closes #25881 from MaxGekk/optimize-extract-epoch. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-22 16:59:59 +09:00
Maxim Gekk	3be5741029	[SPARK-29190][SQL] Optimize `extract`/`date_part` for the milliseconds `field` ### What changes were proposed in this pull request? Changed the `DateTimeUtils.getMilliseconds()` by avoiding the decimal division, and replacing it by setting scale and precision while converting microseconds to the decimal type. ### Why are the changes needed? This improves performance of `extract` and `date_part()` by more than 50 times: Before: ``` Invoke extract for timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative Invoke extract for timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ cast to timestamp 397 428 45 25.2 39.7 1.0X MILLISECONDS of timestamp 36723 36761 63 0.3 3672.3 0.0X ``` After: ``` Invoke extract for timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ cast to timestamp 278 284 6 36.0 27.8 1.0X MILLISECONDS of timestamp 592 606 13 16.9 59.2 0.5X ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing test suite - `DateExpressionsSuite` Closes #25871 from MaxGekk/optimize-epoch-millis. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-21 21:11:31 -07:00
Jungtaek Lim (HeartSaVioR)	f7cc695808	[SPARK-29140][SQL] Handle parameters having "array" of javaType properly in splitAggregateExpressions ### What changes were proposed in this pull request? This patch fixes the issue brought by [SPARK-21870](http://issues.apache.org/jira/browse/SPARK-21870): when generating code for parameter type, it doesn't consider array type in javaType. At least we have one, Spark should generate code for BinaryType as `byte[]`, but Spark create the code for BinaryType as `[B` and generated code fails compilation. Below is the generated code which failed compilation (Line 380): ``` /* 380 / private void agg_doAggregate_count_0([B agg_expr_1_1, boolean agg_exprIsNull_1_1, org.apache.spark.sql.catalyst.InternalRow agg_unsafeRowAggBuffer_1) throws java.io.IOException { / 381 / // evaluate aggregate function for count / 382 / boolean agg_isNull_26 = false; / 383 / long agg_value_28 = -1L; / 384 / if (!false && agg_exprIsNull_1_1) { / 385 / long agg_value_31 = agg_unsafeRowAggBuffer_1.getLong(1); / 386 / agg_isNull_26 = false; / 387 / agg_value_28 = agg_value_31; / 388 / } else { / 389 / long agg_value_33 = agg_unsafeRowAggBuffer_1.getLong(1); / 390 / / 391 / long agg_value_32 = -1L; / 392 / / 393 / agg_value_32 = agg_value_33 + 1L; / 394 / agg_isNull_26 = false; / 395 / agg_value_28 = agg_value_32; / 396 / } / 397 / // update unsafe row buffer / 398 / agg_unsafeRowAggBuffer_1.setLong(1, agg_value_28); / 399 */ } ``` There wasn't any test for HashAggregateExec specifically testing this, but randomized test in ObjectHashAggregateSuite could encounter this and that's why ObjectHashAggregateSuite is flaky. ### Why are the changes needed? Without the fix, generated code from HashAggregateExec may fail compilation. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added new UT. Without the fix, newly added UT fails. Closes #25830 from HeartSaVioR/SPARK-29140. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-09-21 16:29:23 +09:00
Maxim Gekk	252b6cf3c9	[SPARK-29187][SQL] Return null from `date_part()` for the null `field` ### What changes were proposed in this pull request? In the PR, I propose to change behavior of the `date_part()` function in handling `null` field, and make it the same as PostgreSQL has. If `field` parameter is `null`, the function should return `null` of the `double` type as PostgreSQL does: ```sql # select date_part(null, date '2019-09-20'); date_part ----------- (1 row) # select pg_typeof(date_part(null, date '2019-09-20')); pg_typeof ------------------ double precision (1 row) ``` ### Why are the changes needed? The `date_part()` function was added to maintain feature parity with PostgreSQL but current behavior of the function is different in handling null as `field`. ### Does this PR introduce any user-facing change? Yes. Before: ```sql spark-sql> select date_part(null, date'2019-09-20'); Error in query: null; line 1 pos 7 ``` After: ```sql spark-sql> select date_part(null, date'2019-09-20'); NULL ``` ### How was this patch tested? Add new tests to `DateFunctionsSuite for 2 cases: - `field` = `null`, `source` = a date literal - `field` = `null`, `source` = a date column Closes #25865 from MaxGekk/date_part-null. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-20 20:28:56 -07:00
Yuanjian Li	abc88deeed	[SPARK-29063][SQL] Modify fillValue approach to support joined dataframe ### What changes were proposed in this pull request? Modify the approach in `DataFrameNaFunctions.fillValue`, the new one uses `df.withColumns` which only address the columns need to be filled. After this change, there are no more ambiguous fileds detected for joined dataframe. ### Why are the changes needed? Before this change, when you have a joined table that has the same field name from both original table, fillna will fail even if you specify a subset that does not include the 'ambiguous' fields. ``` scala> val df1 = Seq(("f1-1", "f2", null), ("f1-2", null, null), ("f1-3", "f2", "f3-1"), ("f1-4", "f2", "f3-1")).toDF("f1", "f2", "f3") scala> val df2 = Seq(("f1-1", null, null), ("f1-2", "f2", null), ("f1-3", "f2", "f4-1")).toDF("f1", "f2", "f4") scala> val df_join = df1.alias("df1").join(df2.alias("df2"), Seq("f1"), joinType="left_outer") scala> df_join.na.fill("", cols=Seq("f4")) org.apache.spark.sql.AnalysisException: Reference 'f2' is ambiguous, could be: df1.f2, df2.f2.; ``` ### Does this PR introduce any user-facing change? Yes, fillna operation will pass and give the right answer for a joined table. ### How was this patch tested? Local test and newly added UT. Closes #25768 from xuanyuanking/SPARK-29063. Lead-authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-21 08:26:30 +09:00
Holden Karau	42050c3f4f	[SPARK-27659][PYTHON] Allow PySpark to prefetch during toLocalIterator ### What changes were proposed in this pull request? This PR allows Python toLocalIterator to prefetch the next partition while the first partition is being collected. The PR also adds a demo micro bench mark in the examples directory, we may wish to keep this or not. ### Why are the changes needed? In https://issues.apache.org/jira/browse/SPARK-23961 / `5e79ae3b40` we changed PySpark to only pull one partition at a time. This is memory efficient, but if partitions take time to compute this can mean we're spending more time blocking. ### Does this PR introduce any user-facing change? A new param is added to toLocalIterator ### How was this patch tested? New unit test inside of `test_rdd.py` checks the time that the elements are evaluated at. Another test that the results remain the same are added to `test_dataframe.py`. I also ran a micro benchmark in the examples directory `prefetch.py` which shows an improvement of ~40% in this specific use case. > > 19/08/16 17:11:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). > Running timers: > > [Stage 32:> (0 + 1) / 1] > Results: > > Prefetch time: > > 100.228110831 > > > Regular time: > > 188.341721614 > > > Closes #25515 from holdenk/SPARK-27659-allow-pyspark-tolocalitr-to-prefetch. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Holden Karau <hkarau@apple.com>	2019-09-20 09:59:31 -07:00
Burak Yavuz	eb7ee6834d	[SPARK-29062][SQL] Add V1_BATCH_WRITE to the TableCapabilityChecks ### What changes were proposed in this pull request? Currently the checks in the Analyzer require that V2 Tables have BATCH_WRITE defined for all tables that have V1 Write fallbacks. This is confusing as these tables may not have the V2 writer interface implemented yet. This PR adds this table capability to these checks. In addition, this allows V2 tables to leverage the V1 APIs for DataFrameWriter.save if they do extend the V1_BATCH_WRITE capability. This way, these tables can continue to receive partitioning information and also perform checks for the existence of tables, and support all SaveModes. ### Why are the changes needed? Partitioned saves through DataFrame.write are otherwise broken for V2 tables that support the V1 write API. ### Does this PR introduce any user-facing change? No ### How was this patch tested? V1WriteFallbackSuite Closes #25767 from brkyvz/bwcheck. Authored-by: Burak Yavuz <brkyvz@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-20 22:04:32 +08:00
Takeshi Yamamuro	ec8a1a8e88	[SPARK-29122][SQL] Propagate all the SQL conf to executors in SQLQueryTestSuite ### What changes were proposed in this pull request? This pr is to propagate all the SQL configurations to executors in `SQLQueryTestSuite`. When the propagation enabled in the tests, a potential bug below becomes apparent; ``` CREATE TABLE num_data (id int, val decimal(38,10)) USING parquet; .... select sum(udf(CAST(null AS Decimal(38,0)))) from range(1,4): QueryOutput(select sum(udf(CAST(null AS Decimal(38,0)))) from range(1,4),struct<>,java.lang.IllegalArgumentException [info] requirement failed: MutableProjection cannot use UnsafeRow for output data types: decimal(38,0)) (SQLQueryTestSuite.scala:380) ``` The root culprit is that `InterpretedMutableProjection` has incorrect validation in the interpreter mode: `validExprs.forall { case (e, _) => UnsafeRow.isFixedLength(e.dataType) }`. This validation should be the same with the condition (`isMutable`) in `HashAggregate.supportsAggregate`: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L1126 ### Why are the changes needed? Bug fixes. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added tests in `AggregationQuerySuite` Closes #25831 from maropu/SPARK-29122. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-09-20 21:41:09 +09:00
Jungtaek Lim (HeartSaVioR)	5e92301723	[SPARK-29161][CORE][SQL][STREAMING] Unify default wait time for waitUntilEmpty ### What changes were proposed in this pull request? This is a follow-up of the [review comment](https://github.com/apache/spark/pull/25706#discussion_r321923311). This patch unifies the default wait time to be 10 seconds as it would fit most of UTs (as they have smaller timeouts) and doesn't bring additional latency since it will return if the condition is met. This patch doesn't touch the one which waits 100000 milliseconds (100 seconds), to not break anything unintentionally, though I'd rather questionable that we really need to wait for 100 seconds. ### Why are the changes needed? It simplifies the test code and get rid of various heuristic values on timeout. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? CI build will test the patch, as it would be the best environment to test the patch (builds are running there). Closes #25837 from HeartSaVioR/MINOR-unify-default-wait-time-for-wait-until-empty. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-19 23:11:54 -07:00
Dongjoon Hyun	5b478416f8	[SPARK-28208][SQL][FOLLOWUP] Use `tryWithResource` pattern ### What changes were proposed in this pull request? This PR aims to use `tryWithResource` for ORC file. ### Why are the changes needed? This is a follow-up to address https://github.com/apache/spark/pull/25006#discussion_r298788206 . ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins with the existing tests. Closes #25842 from dongjoon-hyun/SPARK-28208. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-19 15:33:12 -07:00
Ryan Blue	2c775f418f	[SPARK-28612][SQL] Add DataFrameWriterV2 API ## What changes were proposed in this pull request? This adds a new write API as proposed in the [SPIP to standardize logical plans](https://issues.apache.org/jira/browse/SPARK-23521). This new API: * Uses clear verbs to execute writes, like `append`, `overwrite`, `create`, and `replace` that correspond to the new logical plans. * Only creates v2 logical plans so the behavior is always consistent. * Does not allow table configuration options for operations that cannot change table configuration. For example, `partitionedBy` can only be called when the writer executes `create` or `replace`. Here are a few example uses of the new API: ```scala df.writeTo("catalog.db.table").append() df.writeTo("catalog.db.table").overwrite($"date" === "2019-06-01") df.writeTo("catalog.db.table").overwritePartitions() df.writeTo("catalog.db.table").asParquet.create() df.writeTo("catalog.db.table").partitionedBy(days($"ts")).createOrReplace() df.writeTo("catalog.db.table").using("abc").replace() ``` ## How was this patch tested? Added `DataFrameWriterV2Suite` that tests the new write API. Existing tests for v2 plans. Closes #25681 from rdblue/SPARK-28612-add-data-frame-writer-v2. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>	2019-09-19 13:32:09 -07:00
Gengliang Wang	b917a6593d	[SPARK-28989][SQL] Add a SQLConf `spark.sql.ansi.enabled` ### What changes were proposed in this pull request? Currently, there are new configurations for compatibility with ANSI SQL: * `spark.sql.parser.ansi.enabled` * `spark.sql.decimalOperations.nullOnOverflow` * `spark.sql.failOnIntegralTypeOverflow` This PR is to add new configuration `spark.sql.ansi.enabled` and remove the 3 options above. When the configuration is true, Spark tries to conform to the ANSI SQL specification. It will be disabled by default. ### Why are the changes needed? Make it simple and straightforward. ### Does this PR introduce any user-facing change? The new features for ANSI compatibility will be set via one configuration `spark.sql.ansi.enabled`. ### How was this patch tested? Existing unit tests. Closes #25693 from gengliangwang/ansiEnabled. Lead-authored-by: Gengliang Wang <gengliang.wang@databricks.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-09-18 22:30:28 -07:00
Maxim Gekk	a6a663c437	[SPARK-29141][SQL][TEST] Use SqlBasedBenchmark in SQL benchmarks ### What changes were proposed in this pull request? Refactored SQL-related benchmark and made them depend on `SqlBasedBenchmark`. In particular, creation of Spark session are moved into `override def getSparkSession: SparkSession`. ### Why are the changes needed? This should simplify maintenance of SQL-based benchmarks by reducing the number of dependencies. In the future, it should be easier to refactor & extend all SQL benchmarks by changing only one trait. Finally, all SQL-based benchmarks will look uniformly. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running the modified benchmarks. Closes #25828 from MaxGekk/sql-benchmarks-refactoring. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-18 17:52:23 -07:00
bartosz25	b4b2e958ce	[MINOR][SS][DOCS] Adapt multiple watermark policy comment to the reality ### What changes were proposed in this pull request? Previous comment was true for Apache Spark 2.3.0. The 2.4.0 release brought multiple watermark policy and therefore stating that the 'min' is always chosen is misleading. This PR updates the comments about multiple watermark policy. They aren't true anymore since in case of multiple watermarks, we can configure which one will be applied to the query. This change was brought with Apache Spark 2.4.0 release. ### Why are the changes needed? It introduces some confusion about the real execution of the commented code. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? The tests weren't added because the change is only about the documentation level. I affirm that the contribution is my original work and that I license the work to the project under the project's open source license. Closes #25832 from bartosz25/fix_comments_multiple_watermark_policy. Authored-by: bartosz25 <bartkonieczny@yahoo.fr> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-18 10:51:11 -07:00
Owen O'Malley	dfb0a8bb04	[SPARK-28208][BUILD][SQL] Upgrade to ORC 1.5.6 including closing the ORC readers ## What changes were proposed in this pull request? It upgrades ORC from 1.5.5 to 1.5.6 and adds closes the ORC readers when they aren't used to create RecordReaders. ## How was this patch tested? The changed unit tests were run. Closes #25006 from omalley/spark-28208. Lead-authored-by: Owen O'Malley <omalley@apache.org> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-18 09:32:43 -07:00
sandeep katta	376e17c082	[SPARK-29101][SQL] Fix count API for csv file when DROPMALFORMED mode is selected ### What changes were proposed in this pull request? #DataSet fruit,color,price,quantity apple,red,1,3 banana,yellow,2,4 orange,orange,3,5 xxx This PR aims to fix the below ``` scala> spark.conf.set("spark.sql.csv.parser.columnPruning.enabled", false) scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").count res1: Long = 4 ``` This is caused by the issue [SPARK-24645](https://issues.apache.org/jira/browse/SPARK-24645). SPARK-24645 issue can also be solved by [SPARK-25387](https://issues.apache.org/jira/browse/SPARK-25387) ### Why are the changes needed? SPARK-24645 caused this regression, so reverted the code as it can also be solved by SPARK-25387 ### Does this PR introduce any user-facing change? No, ### How was this patch tested? Added UT, and also tested the bug SPARK-24645 SPARK-24645 regression ![image](https://user-images.githubusercontent.com/35216143/65067957-4c08ff00-d9a5-11e9-8d43-a4a23a61e8b8.png) Closes #25820 from sandeep-katta/SPARK-29101. Authored-by: sandeep katta <sandeep.katta2007@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-18 23:33:13 +09:00
Maxim Gekk	c2734ab1fc	[SPARK-29012][SQL] Support special timestamp values ### What changes were proposed in this pull request? Supported special string values for `TIMESTAMP` type. They are simply notational shorthands that will be converted to ordinary timestamp values when read. The following string values are supported: - `epoch [zoneId]` - `1970-01-01 00:00:00+00 (Unix system time zero)` - `today [zoneId]` - midnight today. - `yesterday [zoneId]` -midnight yesterday - `tomorrow [zoneId]` - midnight tomorrow - `now` - current query start time. For example: ```sql spark-sql> SELECT timestamp 'tomorrow'; 2019-09-07 00:00:00 ``` ### Why are the changes needed? To maintain feature parity with PostgreSQL, see [8.5.1.4. Special Values](https://www.postgresql.org/docs/12/datatype-datetime.html) ### Does this PR introduce any user-facing change? Previously, the parser fails on the special values with the error: ```sql spark-sql> select timestamp 'today'; Error in query: Cannot parse the TIMESTAMP value: today(line 1, pos 7) ``` After the changes, the special values are converted to appropriate dates: ```sql spark-sql> select timestamp 'today'; 2019-09-06 00:00:00 ``` ### How was this patch tested? - Added tests to `TimestampFormatterSuite` to check parsing special values from regular strings. - Tests in `DateTimeUtilsSuite` check parsing those values from `UTF8String` - Uncommented tests in `timestamp.sql` Closes #25716 from MaxGekk/timestamp-special-values. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-18 23:30:59 +09:00
Gengliang Wang	3da2786dc6	[SPARK-29096][SQL] The exact math method should be called only when there is a corresponding function in Math ### What changes were proposed in this pull request? 1. After https://github.com/apache/spark/pull/21599, if the option "spark.sql.failOnIntegralTypeOverflow" is enabled, all the Binary Arithmetic operator will used the exact version function. However, only `Add`/`Substract`/`Multiply` has a corresponding exact function in java.lang.Math . When the option "spark.sql.failOnIntegralTypeOverflow" is enabled, a runtime exception "BinaryArithmetics must override either exactMathMethod or genCode" is thrown if the other Binary Arithmetic operators are used, such as "Divide", "Remainder". The exact math method should be called only when there is a corresponding function in `java.lang.Math` 2. Revise the log output of casting to `Int`/`Short` 3. Enable `spark.sql.failOnIntegralTypeOverflow` for pgSQL tests in `SQLQueryTestSuite`. ### Why are the changes needed? 1. Fix the bugs of https://github.com/apache/spark/pull/21599 2. The test case of pgSQL intends to check the overflow of integer/long type. We should enable `spark.sql.failOnIntegralTypeOverflow`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit test. Closes #25804 from gengliangwang/enableIntegerOverflowInSQLTest. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-18 16:59:17 +08:00
turbofei	eef5e6d348	[SPARK-29113][DOC] Fix some annotation errors and remove meaningless annotations in project ### What changes were proposed in this pull request? In this PR, I fix some annotation errors and remove meaningless annotations in project. ### Why are the changes needed? There are some annotation errors and meaningless annotations in project. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Verified manually. Closes #25809 from turboFei/SPARK-29113. Authored-by: turbofei <fwang12@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-18 13:12:18 +09:00
Chris Martin	05988b256e	[SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs ### What changes were proposed in this pull request? Adds a new cogroup Pandas UDF. This allows two grouped dataframes to be cogrouped together and apply a (pandas.DataFrame, pandas.DataFrame) -> pandas.DataFrame UDF to each cogroup. Example usage ``` from pyspark.sql.functions import pandas_udf, PandasUDFType df1 = spark.createDataFrame( [(20000101, 1, 1.0), (20000101, 2, 2.0), (20000102, 1, 3.0), (20000102, 2, 4.0)], ("time", "id", "v1")) df2 = spark.createDataFrame( [(20000101, 1, "x"), (20000101, 2, "y")], ("time", "id", "v2")) pandas_udf("time int, id int, v1 double, v2 string", PandasUDFType.COGROUPED_MAP) def asof_join(l, r): return pd.merge_asof(l, r, on="time", by="id") df1.groupby("id").cogroup(df2.groupby("id")).apply(asof_join).show() ``` +--------+---+---+---+ \| time\| id\| v1\| v2\| +--------+---+---+---+ \|20000101\| 1\|1.0\| x\| \|20000102\| 1\|3.0\| x\| \|20000101\| 2\|2.0\| y\| \|20000102\| 2\|4.0\| y\| +--------+---+---+---+ ### How was this patch tested? Added unit test test_pandas_udf_cogrouped_map Closes #24981 from d80tb7/SPARK-27463-poc-arrow-stream. Authored-by: Chris Martin <chris@cmartinit.co.uk> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2019-09-17 17:13:50 -07:00
Maxim Gekk	02db706090	[SPARK-29115][SQL][TEST] Add benchmarks for make_date() and make_timestamp() ### What changes were proposed in this pull request? Added new benchmarks for `make_date()` and `make_timestamp()` to detect performance issues, and figure out functions speed on foldable arguments. - `make_date()` is benchmarked on fully foldable arguments. - `make_timestamp()` is benchmarked on corner case `60.0`, foldable time fields and foldable date. ### Why are the changes needed? To find out inputs where `make_date()` and `make_timestamp()` have performance problems. This should be useful in the future optimizations of the functions and users apps. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running the benchmark and manually checking generated dates/timestamps. Closes #25813 from MaxGekk/make_datetime-benchmark. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-17 15:09:16 -07:00
xy_xin	3fc52b5557	[SPARK-28950][SQL] Refine the code of DELETE ### What changes were proposed in this pull request? This pr refines the code of DELETE, including, 1, make `whereClause` to be optional, in which case DELETE will delete all of the data of a table; 2, add more test cases; 3, some other refines. This is a following-up of SPARK-28351. ### Why are the changes needed? An optional where clause in DELETE respects the SQL standard. ### Does this PR introduce any user-facing change? Yes. But since this is a non-released feature, this change does not have any end-user affects. ### How was this patch tested? New case is added. Closes #25652 from xianyinxin/SPARK-28950. Authored-by: xy_xin <xianyin.xxy@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-18 01:14:14 +08:00
Maxim Gekk	db996ccad9	[SPARK-29074][SQL] Optimize `date_format` for foldable `fmt` ### What changes were proposed in this pull request? In the PR, I propose to create an instance of `TimestampFormatter` only once at the initialization, and reuse it inside of `nullSafeEval()` and `doGenCode()` in the case when the `fmt` parameter is foldable. ### Why are the changes needed? The changes improve performance of the `date_format()` function. Before: ``` format date: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ format date wholestage off 7180 / 7181 1.4 718.0 1.0X format date wholestage on 7051 / 7194 1.4 705.1 1.0X ``` After: ``` format date: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ format date wholestage off 4787 / 4839 2.1 478.7 1.0X format date wholestage on 4736 / 4802 2.1 473.6 1.0X ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? By existing test suites `DateExpressionsSuite` and `DateFunctionsSuite`. Closes #25782 from MaxGekk/date_format-foldable. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-17 16:00:10 +09:00
Takeshi Yamamuro	95073fb62b	[SPARK-29008][SQL] Define an individual method for each common subexpression in HashAggregateExec ### What changes were proposed in this pull request? This pr proposes to define an individual method for each common subexpression in HashAggregateExec. In the current master, the common subexpr elimination code in HashAggregateExec is expanded in a single method; `4664a082c2/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala (L397)` The method size can be too big for JIT compilation, so I believe splitting it is beneficial for performance. For example, in a query `SELECT SUM(a + b), AVG(a + b + c) FROM VALUES (1, 1, 1) t(a, b, c)`, the current master generates; ``` /* 098 / private void agg_doConsume_0(InternalRow localtablescan_row_0, int agg_expr_0_0, int agg_expr_1_0, int agg_expr_2_0) throws java.io.IOException { / 099 / // do aggregate / 100 / // common sub-expressions / 101 / int agg_value_6 = -1; / 102 / / 103 / agg_value_6 = agg_expr_0_0 + agg_expr_1_0; / 104 / / 105 / int agg_value_5 = -1; / 106 / / 107 / agg_value_5 = agg_value_6 + agg_expr_2_0; / 108 / boolean agg_isNull_4 = false; / 109 / long agg_value_4 = -1L; / 110 / if (!false) { / 111 / agg_value_4 = (long) agg_value_5; / 112 / } / 113 / int agg_value_10 = -1; / 114 / / 115 / agg_value_10 = agg_expr_0_0 + agg_expr_1_0; / 116 / // evaluate aggregate functions and update aggregation buffers / 117 / agg_doAggregate_sum_0(agg_value_10); / 118 / agg_doAggregate_avg_0(agg_value_4, agg_isNull_4); / 119 / / 120 / } ``` On the other hand, this pr generates; ``` / 121 / private void agg_doConsume_0(InternalRow localtablescan_row_0, int agg_expr_0_0, int agg_expr_1_0, int agg_expr_2_0) throws java.io.IOException { / 122 / // do aggregate / 123 / // common sub-expressions / 124 / long agg_subExprValue_0 = agg_subExpr_0(agg_expr_2_0, agg_expr_0_0, agg_expr_1_0); / 125 / int agg_subExprValue_1 = agg_subExpr_1(agg_expr_0_0, agg_expr_1_0); / 126 / // evaluate aggregate functions and update aggregation buffers / 127 / agg_doAggregate_sum_0(agg_subExprValue_1); / 128 / agg_doAggregate_avg_0(agg_subExprValue_0); / 129 / / 130 / } ``` I run some micro benchmarks for this pr; ``` (base) maropu~:$system_profiler SPHardwareDataType Hardware: Hardware Overview: Processor Name: Intel Core i5 Processor Speed: 2 GHz Number of Processors: 1 Total Number of Cores: 2 L2 Cache (per Core): 256 KB L3 Cache: 4 MB Memory: 8 GB (base) maropu~:$java -version java version "1.8.0_181" Java(TM) SE Runtime Environment (build 1.8.0_181-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode) (base) maropu~:$ /bin/spark-shell --master=local[1] --conf spark.driver.memory=8g --conf spark.sql.shurtitions=1 -v val numCols = 40 val colExprs = "id AS key" +: (0 until numCols).map { i => s"id AS _c$i" } spark.range(3000000).selectExpr(colExprs: _).createOrReplaceTempView("t") val aggExprs = (2 until numCols).map { i => (0 until i).map(d => s"_c$d") .mkString("AVG(", " + ", ")") } // Drops the time of a first run then pick that of a second run timer { sql(s"SELECT ${aggExprs.mkString(", ")} FROM t").write.format("noop").save() } // the master maxCodeGen: 12957 Elapsed time: 36.309858661s // this pr maxCodeGen=4184 Elapsed time: 2.399490285s ``` ### Why are the changes needed? To avoid the too-long-function issue in JVMs. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added tests in `WholeStageCodegenSuite` Closes #25710 from maropu/SplitSubexpr. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-09-17 11:09:55 +09:00
Takeshi Yamamuro	6297287dfa	[SPARK-29061][SQL] Prints bytecode statistics in debugCodegen ### What changes were proposed in this pull request? This pr proposes to print bytecode statistics (max class bytecode size, max method bytecode size, max constant pool size, and # of inner classes) for generated classes in debug prints, `debugCodegen`. Since these metrics are critical for codegen framework developments, I think its worth printing there. This pr intends to enable `debugCodegen` to print these metrics as following; ``` scala> sql("SELECT sum(v) FROM VALUES(1) t(v)").debugCodegen Found 2 WholeStageCodegen subtrees. == Subtree 1 / 2 (maxClassCodeSize:2693; maxMethodCodeSize:124; maxConstantPoolSize:130(0.20% used); numInnerClasses:0) == ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (1) HashAggregate(keys=[], functions=[partial_sum(cast(v#0 as bigint))], output=[sum#5L]) +- (1) LocalTableScan [v#0] Generated code: /* 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIteratorForCodegenStage1(references); / 003 */ } ... ``` ### Why are the changes needed? For efficient developments ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually tested Closes #25766 from maropu/PrintBytecodeStats. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-16 21:48:07 +08:00
Maxim Gekk	1b7afc0c98	[SPARK-28471][SQL][DOC][FOLLOWUP] Fix year patterns in the comments of date-time expressions ### What changes were proposed in this pull request? In the PR, I propose to fix comments of date-time expressions, and replace the `yyyy` pattern by `uuuu` when the implementation supposes the former one. ### Why are the changes needed? To make comments consistent to implementations. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running Scala Style checker. Closes #25796 from MaxGekk/year-pattern-uuuu-followup. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-15 11:02:15 -07:00
Dongjoon Hyun	13b77e52d2	Revert "[SPARK-29046][SQL] Fix NPE in SQLConf.get when active SparkContext is stopping" This reverts commit `850833fa17`.	2019-09-14 00:09:45 -07:00
WeichenXu	5631a96367	[SPARK-29048] Improve performance on Column.isInCollection() with a large size collection ### What changes were proposed in this pull request? The `Column.isInCollection()` with a large size collection will generate an expression with large size children expressions. This make analyzer and optimizer take a long time to run. In this PR, in `isInCollection()` function, directly generate `InSet` expression, avoid generating too many children expressions. ### Why are the changes needed? `Column.isInCollection()` with a large size collection sometimes become a bottleneck when running sql. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually benchmark it in spark-shell: ``` def testExplainTime(collectionSize: Int) = { val df = spark.range(10).withColumn("id2", col("id") + 1) val list = Range(0, collectionSize).toList val startTime = System.currentTimeMillis() df.where(col("id").isInCollection(list)).where(col("id2").isInCollection(list)).explain() val elapsedTime = System.currentTimeMillis() - startTime println(s"cost time: ${elapsedTime}ms") } ``` Then test on collection size 5, 10, 100, 1000, 10000, test result is: collection size \| explain time (before) \| explain time (after) ------ \| ------ \| ------ 5 \| 26ms \| 29ms 10 \| 30ms \| 48ms 100 \| 104ms \| 50ms 1000 \| 1202ms \| 58ms 10000 \| 10012ms \| 523ms Closes #25754 from WeichenXu123/improve_in_collection. Lead-authored-by: WeichenXu <weichen.xu@databricks.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-09-12 17:23:08 -07:00
maryannxue	c56a012bc8	[SPARK-29060][SQL] Add tree traversal helper for adaptive spark plans ### What changes were proposed in this pull request? This PR adds a utility class `AdaptiveSparkPlanHelper` which provides methods related to tree traversal of an `AdaptiveSparkPlanExec` plan. Unlike their counterparts in `TreeNode` or `QueryPlan`, these methods traverse down leaf nodes of adaptive plans, i.e., `AdaptiveSparkPlanExec` and `QueryStageExec`. ### Why are the changes needed? This utility class can greatly simplify tree traversal code for adaptive spark plans. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Refined `AdaptiveQueryExecSuite` with the help of the new utility methods. Closes #25764 from maryannxue/aqe-utils. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-12 21:49:21 +08:00
Maxim Gekk	8e9fafbb21	[SPARK-29065][SQL][TEST] Extend `EXTRACT` benchmark ### What changes were proposed in this pull request? In the PR, I propose to extend `ExtractBenchmark` and add new ones for: - `EXTRACT` and `DATE` as input column - the `DATE_PART` function and `DATE`/`TIMESTAMP` input column ### Why are the changes needed? The `EXTRACT` expression is rebased on the `DATE_PART` expression by the PR https://github.com/apache/spark/pull/25410 where some of sub-expressions take `DATE` column as the input (`Millennium`, `Year` and etc.) but others require `TIMESTAMP` column (`Hour`, `Minute`). Separate benchmarks for `DATE` should exclude overhead of implicit conversions `DATE` <-> `TIMESTAMP`. ### Does this PR introduce any user-facing change? No, it doesn't. ### How was this patch tested? - Regenerated results of `ExtractBenchmark` Closes #25772 from MaxGekk/date_part-benchmark. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-09-12 21:32:35 +09:00
Wenchen Fan	053dd858d3	[SPARK-28998][SQL] reorganize the packages of DS v2 interfaces/classes ### What changes were proposed in this pull request? reorganize the packages of DS v2 interfaces/classes: 1. `org.spark.sql.connector.catalog`: put `TableCatalog`, `Table` and other related interfaces/classes 2. `org.spark.sql.connector.expression`: put `Expression`, `Transform` and other related interfaces/classes 3. `org.spark.sql.connector.read`: put `ScanBuilder`, `Scan` and other related interfaces/classes 4. `org.spark.sql.connector.write`: put `WriteBuilder`, `BatchWrite` and other related interfaces/classes ### Why are the changes needed? Data Source V2 has evolved a lot. It's a bit weird that `Expression` is in `org.spark.sql.catalog.v2` and `Table` is in `org.spark.sql.sources.v2`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing tests Closes #25700 from cloud-fan/package. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-12 19:59:34 +08:00
LantaoJin	6768431c97	[SPARK-29045][SQL][TESTS] Drop table to avoid test failure in SQLMetricsSuite ### What changes were proposed in this pull request? In method `SQLMetricsTestUtils.testMetricsDynamicPartition()`, there is a CREATE TABLE sentence without `withTable` block. It causes test failure if use same table name in other unit tests. ### Why are the changes needed? To avoid "table already exists" in tests. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Exist UT Closes #25752 from LantaoJin/SPARK-29045. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com>	2019-09-11 23:05:03 -07:00
Jungtaek Lim (HeartSaVioR)	850833fa17	[SPARK-29046][SQL] Fix NPE in SQLConf.get when active SparkContext is stopping # What changes were proposed in this pull request? This patch fixes the bug regarding NPE in SQLConf.get, which is only possible when SparkContext._dagScheduler is null due to stopping SparkContext. The logic doesn't seem to consider active SparkContext could be in progress of stopping. Note that it can't be encountered easily as `SparkContext.stop()` blocks the main thread, but there're many cases which SQLConf.get is accessed concurrently while SparkContext.stop() is executing - users run another threads, or listener is accessing SQLConf.get after dagScheduler is set to null (this is the case what I encountered.) ### Why are the changes needed? The bug brings NPE. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added new UT to verify NPE doesn't occur. Without patch, the test fails with throwing NPE. Closes #25753 from HeartSaVioR/SPARK-29046. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-12 11:16:33 +09:00
dengziming	8f632d7045	[MINOR][DOCS] Fix few typos in the java docs JIRA :https://issues.apache.org/jira/browse/SPARK-29050 'a hdfs' change into 'an hdfs' 'an unique' change into 'a unique' 'an url' change into 'a url' 'a error' change into 'an error' Closes #25756 from dengziming/feature_fix_typos. Authored-by: dengziming <dengziming@growingio.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-12 09:30:03 +09:00
Wenchen Fan	eec728a0d4	[SPARK-29057][SQL] remove InsertIntoTable ### What changes were proposed in this pull request? Remove `InsertIntoTable` and replace it's usage by `InsertIntoStatement` ### Why are the changes needed? `InsertIntoTable` and `InsertIntoStatement` are almost identical (except some namings). It doesn't make sense to keep 2 identical plans. After the removal of `InsertIntoTable`, the analysis process becomes: 1. parser creates `InsertIntoStatement` 2. v2 rule `ResolveInsertInto` converts `InsertIntoStatement` to v2 commands. 3. v1 rules like `DataSourceAnalysis` and `HiveAnalysis` convert `InsertIntoStatement` to v1 commands. ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing tests Closes #25763 from cloud-fan/remove. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-12 09:24:36 +09:00
Terry Kim	bf43541c92	[SPARK-28856][SQL] Implement SHOW DATABASES for Data Source V2 Tables ### What changes were proposed in this pull request? Implement the SHOW DATABASES logical and physical plans for data source v2 tables. ### Why are the changes needed? To support `SHOW DATABASES` SQL commands for v2 tables. ### Does this PR introduce any user-facing change? `spark.sql("SHOW DATABASES")` will return namespaces if the default catalog is set: ``` +---------------+ \| namespace\| +---------------+ \| ns1\| \| ns1.ns1_1\| \|ns1.ns1_1.ns1_2\| +---------------+ ``` ### How was this patch tested? Added unit tests to `DataSourceV2SQLSuite`. Closes #25601 from imback82/show_databases. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-10 21:23:57 +08:00
Marco Gaido	ca6f693ef1	[SPARK-28939][SQL][FOLLOWUP] Avoid useless Properties ### What changes were proposed in this pull request? Removes useless `Properties` created according to hvanhovell 's suggestion. ### Why are the changes needed? Avoid useless code. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? existing UTs Closes #25742 from mgaido91/SPARK-28939_followup. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-09-10 20:47:55 +09:00
Dongjoon Hyun	580c6266fb	[SPARK-28939][SQL][FOLLOWUP] Fix JDK11 compilation due to ambiguous reference ### What changes were proposed in this pull request? This PR aims to recover the JDK11 compilation with a workaround. For now, the master branch is broken like the following due to a [Scala bug](https://github.com/scala/bug/issues/10418) which is fixed in `2.13.0-RC2`. ``` [ERROR] [Error] /spark/sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecutionRDD.scala:42: ambiguous reference to overloaded definition, both method putAll in class Properties of type (x$1: java.util.Map[_, _])Unit and method putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: Object])Unit match argument types (java.util.Map[String,String]) ``` - https://github.com/apache/spark/actions (JDK11 build monitoring) ### Why are the changes needed? This workaround recovers JDK11 compilation. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manual build with JDK11 because this is JDK11 compilation fix. - Jenkins builds with JDK8 and tests with JDK11. - GitHub action will verify this after merging. Closes #25738 from dongjoon-hyun/SPARK-28939. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-09 20:30:49 -07:00
Wenchen Fan	c2d8ee9c54	[SPARK-28878][SQL][FOLLOWUP] Remove extra project for DSv2 streaming scan ### What changes were proposed in this pull request? Remove the project node if the streaming scan is columnar ### Why are the changes needed? This is a followup of https://github.com/apache/spark/pull/25586. Batch and streaming share the same DS v2 read API so both can support columnar reads. We should apply #25586 to streaming scan as well. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #25727 from cloud-fan/follow. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-10 11:01:57 +08:00
gengjiaan	aafce7ebff	[SPARK-28412][SQL] ANSI SQL: OVERLAY function support byte array ## What changes were proposed in this pull request? This is a ANSI SQL and feature id is `T312` ``` <binary overlay function> ::= OVERLAY <left paren> <binary value expression> PLACING <binary value expression> FROM <start position> [ FOR <string length> ] <right paren> ``` This PR related to https://github.com/apache/spark/pull/24918 and support treat byte array. ref: https://www.postgresql.org/docs/11/functions-binarystring.html ## How was this patch tested? new UT. There are some show of the PR on my production environment. ``` spark-sql> select overlay(encode('Spark SQL', 'utf-8') PLACING encode('_', 'utf-8') FROM 6); Spark_SQL Time taken: 0.285 s spark-sql> select overlay(encode('Spark SQL', 'utf-8') PLACING encode('CORE', 'utf-8') FROM 7); Spark CORE Time taken: 0.202 s spark-sql> select overlay(encode('Spark SQL', 'utf-8') PLACING encode('ANSI ', 'utf-8') FROM 7 FOR 0); Spark ANSI SQL Time taken: 0.165 s spark-sql> select overlay(encode('Spark SQL', 'utf-8') PLACING encode('tructured', 'utf-8') FROM 2 FOR 4); Structured SQL Time taken: 0.141 s ``` Closes #25172 from beliefer/ansi-overlay-byte-array. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-09-10 08:16:18 +09:00
Sean Owen	6378d4bc06	[SPARK-28980][CORE][SQL][STREAMING][MLLIB] Remove most items deprecated in Spark 2.2.0 or earlier, for Spark 3 ### What changes were proposed in this pull request? - Remove SQLContext.createExternalTable and Catalog.createExternalTable, deprecated in favor of createTable since 2.2.0, plus tests of deprecated methods - Remove HiveContext, deprecated in 2.0.0, in favor of `SparkSession.builder.enableHiveSupport` - Remove deprecated KinesisUtils.createStream methods, plus tests of deprecated methods, deprecate in 2.2.0 - Remove deprecated MLlib (not Spark ML) linear method support, mostly utility constructors and 'train' methods, and associated docs. This includes methods in LinearRegression, LogisticRegression, Lasso, RidgeRegression. These have been deprecated since 2.0.0 - Remove deprecated Pyspark MLlib linear method support, including LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD - Remove 'runs' argument in KMeans.train() method, which has been a no-op since 2.0.0 - Remove deprecated ChiSqSelector isSorted protected method - Remove deprecated 'yarn-cluster' and 'yarn-client' master argument in favor of 'yarn' and deploy mode 'cluster', etc Notes: - I was not able to remove deprecated DataFrameReader.json(RDD) in favor of DataFrameReader.json(Dataset); the former was deprecated in 2.2.0, but, it is still needed to support Pyspark's .json() method, which can't use a Dataset. - Looks like SQLContext.createExternalTable was not actually deprecated in Pyspark, but, almost certainly was meant to be? Catalog.createExternalTable was. - I afterwards noted that the toDegrees, toRadians functions were almost removed fully in SPARK-25908, but Felix suggested keeping just the R version as they hadn't been technically deprecated. I'd like to revisit that. Do we really want the inconsistency? I'm not against reverting it again, but then that implies leaving SQLContext.createExternalTable just in Pyspark too, which seems weird. - I kept LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD, RidgeRegressionWithSGD in Pyspark, though deprecated, as it is hard to remove them (still used by StreamingLogisticRegressionWithSGD?) and they are not fully removed in Scala. Maybe should not have been deprecated. ### Why are the changes needed? Deprecated items are easiest to remove in a major release, so we should do so as much as possible for Spark 3. This does not target items deprecated 'recently' as of Spark 2.3, which is still 18 months old. ### Does this PR introduce any user-facing change? Yes, in that deprecated items are removed from some public APIs. ### How was this patch tested? Existing tests. Closes #25684 from srowen/SPARK-28980. Lead-authored-by: Sean Owen <sean.owen@databricks.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-09 10:19:40 -05:00
Marco Gaido	3d6b33a49a	[SPARK-28939][SQL] Propagate SQLConf for plans executed by toRdd ### What changes were proposed in this pull request? The PR proposes to create a custom `RDD` which enables to propagate `SQLConf` also in cases not tracked by SQL execution, as it happens when a `Dataset` is converted to and RDD either using `.rdd` or `.queryExecution.toRdd` and then the returned RDD is used to invoke actions on it. In this way, SQL configs are effective also in these cases, while earlier they were ignored. ### Why are the changes needed? Without this patch, all the times `.rdd` or `.queryExecution.toRdd` are used, all the SQL configs set are ignored. An example of a reproducer can be: ``` withSQLConf(SQLConf.SUBEXPRESSION_ELIMINATION_ENABLED.key, "false") { val df = spark.range(2).selectExpr((0 to 5000).map(i => s"id as field_$i"): _) df.createOrReplaceTempView("spark64kb") val data = spark.sql("select from spark64kb limit 10") // Subexpression elimination is used here, despite it should have been disabled data.describe() } ``` ### Does this PR introduce any user-facing change? When a user calls `.queryExecution.toRdd`, a `SQLExecutionRDD` is returned wrapping the `RDD` of the execute. When `.rdd` is used, an additional `SQLExecutionRDD` is present in the hierarchy. ### How was this patch tested? added UT Closes #25643 from mgaido91/SPARK-28939. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-09 21:20:34 +08:00
Wenchen Fan	abec6d7763	[SPARK-28341][SQL] create a public API for V2SessionCatalog ## What changes were proposed in this pull request? The `V2SessionCatalog` has 2 functionalities: 1. work as an adapter: provide v2 APIs and translate calls to the `SessionCatalog`. 2. allow users to extend it, so that they can add hooks to apply custom logic before calling methods of the builtin catalog (session catalog). To leverage the second functionality, users must extend `V2SessionCatalog` which is an internal class. There is no doc to explain this usage. This PR does 2 things: 1. refine the document of the config `spark.sql.catalog.session`. 2. add a public abstract class `CatalogExtension` for users to write implementations. TODOs for followup PRs: 1. discuss if we should allow users to completely overwrite the v2 session catalog with a new one. 2. discuss to change the name of session catalog, so that it's less likely to conflict with existing namespace names. ## How was this patch tested? existing tests Closes #25104 from cloud-fan/session-catalog. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-09 21:14:37 +08:00
turbofei	d4eca7c99d	[SPARK-29000][SQL] Decimal precision overflow when don't allow precision loss ### What changes were proposed in this pull request? When we set spark.sql.decimalOperations.allowPrecisionLoss to false. For the sql below, the result will overflow and return null. Case a: `select case when 1=2 then 1 else 1.000000000000000000000001 end * 1` Similar with the division operation. This sql below will lost precision. Case b: `select case when 1=2 then 1 else 1.000000000000000000000001 end / 1` Let us check the code of TypeCoercion.scala. `a75467432e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala (L864-L875)`. For binaryOperator, if the two operands have differnt datatype, rule ImplicitTypeCasts will find a common type and cast both operands to common type. So, for these cases menthioned, their left operand is Decimal(34, 24) and right operand is Literal. Their common type is Decimal(34,24), and Literal(1) will be casted to Decimal(34,24). Then both operands are decimal type and they will be processed by decimalAndDecimal method of DecimalPrecision class. Let's check the relative code. `a75467432e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala (L123-L153)` When we don't allow precision loss, the result type of multiply operation in case a is Decimal(38, 38), and that of division operation in case b is Decimal(38, 20). Then the multi operation in case a will overflow and division operation in case b will lost precision. In this PR, we skip to handle the binaryOperator if DecimalType operands are involved and rule `DecimalPrecision` will handle it. ### Why are the changes needed? Data will corrupt without this change. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Unit test. Closes #25701 from turboFei/SPARK-29000. Authored-by: turbofei <fwang12@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-09 13:50:17 +08:00
Yuming Wang	a75467432e	[SPARK-28000][SQL][TEST] Port comments.sql ## What changes were proposed in this pull request? This PR is to port comments.sql from PostgreSQL regression tests. https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/comments.sql The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/expected/comments.out When porting the test cases, found one PostgreSQL specific features that do not exist in Spark SQL: [SPARK-28880](https://issues.apache.org/jira/browse/SPARK-28880): ANSI SQL: Bracketed comments ## How was this patch tested? N/A Closes #25588 from wangyum/SPARK-28000. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-08 10:32:08 +09:00
Takeshi Yamamuro	ff5fa5873e	[SPARK-21870][SQL][FOLLOW-UP] Clean up string template formats for generated code in HashAggregateExec ### What changes were proposed in this pull request? This pr cleans up string template formats for generated code in HashAggregateExec. This changes comes from rednaxelafx comment: https://github.com/apache/spark/pull/20965#discussion_r316418729 ### Why are the changes needed? To improve code-readability. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #25714 from maropu/SPARK-21870-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-09-07 07:16:36 +09:00
maryannxue	b2f06608b7	[SPARK-29002][SQL] Avoid changing SMJ to BHJ if the build side has a high ratio of empty partitions ### What changes were proposed in this pull request? This PR aims to avoid AQE regressions by avoiding changing a sort merge join to a broadcast hash join when the expected build plan has a high ratio of empty partitions, in which case sort merge join can actually perform faster. This PR achieves this by adding an internal join hint in order to let the planner know which side has this high ratio of empty partitions and it should avoid planning it as a build plan of a BHJ. Still, it won't affect the other side if the other side qualifies for a build plan of a BHJ. ### Why are the changes needed? It is a performance improvement for AQE. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT. Closes #25703 from maryannxue/aqe-demote-bhj. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-09-06 12:46:54 -07:00
Maxim Gekk	67b4329fb0	[SPARK-28690][SQL] Add `date_part` function for timestamps/dates ## What changes were proposed in this pull request? In the PR, I propose new function `date_part()`. The function is modeled on the traditional Ingres equivalent to the SQL-standard function `extract`: ``` date_part('field', source) ``` and added for feature parity with PostgreSQL (https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT). The `source` can have `DATE` or `TIMESTAMP` type. Supported string values of `'field'` are: - `millennium` - the current millennium for given date (or a timestamp implicitly casted to a date). For example, years in the 1900s are in the second millennium. The third millennium started _January 1, 2001_. - `century` - the current millennium for given date (or timestamp). The first century starts at 0001-01-01 AD. - `decade` - the current decade for given date (or timestamp). Actually, this is the year field divided by 10. - isoyear` - the ISO 8601 week-numbering year that the date falls in. Each ISO 8601 week-numbering year begins with the Monday of the week containing the 4th of January. - `year`, `month`, `day`, `hour`, `minute`, `second` - `week` - the number of the ISO 8601 week-numbering week of the year. By definition, ISO weeks start on Mondays and the first week of a year contains January 4 of that year. - `quarter` - the quarter of the year (1 - 4) - `dayofweek` - the day of the week for date/timestamp (1 = Sunday, 2 = Monday, ..., 7 = Saturday) - `dow` - the day of the week as Sunday (0) to Saturday (6) - `isodow` - the day of the week as Monday (1) to Sunday (7) - `doy` - the day of the year (1 - 365/366) - `milliseconds` - the seconds field including fractional parts multiplied by 1,000. - `microseconds` - the seconds field including fractional parts multiplied by 1,000,000. - `epoch` - the number of seconds since 1970-01-01 00:00:00 local time in microsecond precision. Here are examples: ```sql spark-sql> select date_part('year', timestamp'2019-08-12 01:00:00.123456'); 2019 spark-sql> select date_part('week', timestamp'2019-08-12 01:00:00.123456'); 33 spark-sql> select date_part('doy', timestamp'2019-08-12 01:00:00.123456'); 224 ``` I changed implementation of `extract` to re-use `date_part()` internally. ## How was this patch tested? Added `date_part.sql` and regenerated results of `extract.sql`. Closes #25410 from MaxGekk/date_part. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-09-06 23:36:00 +09:00
Takeshi Yamamuro	cb0cddffe9	[SPARK-21870][SQL] Split aggregation code into small functions ## What changes were proposed in this pull request? This pr proposed to split aggregation code into small functions in `HashAggregateExec`. In #18810, we got performance regression if JVMs didn't compile too long functions. I checked and I found the codegen of `HashAggregateExec` frequently goes over the limit when a query has too many aggregate functions (e.g., q66 in TPCDS). The current master places all the generated aggregation code in a single function. In this pr, I modified the code to assign an individual function for each aggregate function (e.g., `SUM` and `AVG`). For example, in a query `SELECT SUM(a), AVG(a) FROM VALUES(1) t(a)`, the proposed code defines two functions for `SUM(a)` and `AVG(a)` as follows; - generated code with this pr (https://gist.github.com/maropu/812990012bc967a78364be0fa793f559): ``` /* 173 / private void agg_doConsume_0(InternalRow inputadapter_row_0, long agg_expr_0_0, boolean agg_exprIsNull_0_0, double agg_expr_1_0, boolean agg_exprIsNull_1_0, long agg_expr_2_0, boolean agg_exprIsNull_2_0) throws java.io.IOException { / 174 / // do aggregate / 175 / // common sub-expressions / 176 / / 177 / // evaluate aggregate functions and update aggregation buffers / 178 / agg_doAggregate_sum_0(agg_exprIsNull_0_0, agg_expr_0_0); / 179 / agg_doAggregate_avg_0(agg_expr_1_0, agg_exprIsNull_1_0, agg_exprIsNull_2_0, agg_expr_2_0); / 180 / / 181 / } ... / 071 / private void agg_doAggregate_avg_0(double agg_expr_1_0, boolean agg_exprIsNull_1_0, boolean agg_exprIsNull_2_0, long agg_expr_2_0) throws java.io.IOException { / 072 / // do aggregate for avg / 073 / // evaluate aggregate function / 074 / boolean agg_isNull_19 = true; / 075 / double agg_value_19 = -1.0; ... / 114 / private void agg_doAggregate_sum_0(boolean agg_exprIsNull_0_0, long agg_expr_0_0) throws java.io.IOException { / 115 / // do aggregate for sum / 116 / // evaluate aggregate function / 117 / agg_agg_isNull_11_0 = true; / 118 / long agg_value_11 = -1L; ``` - generated code in the current master (https://gist.github.com/maropu/e9d772af2c98d8991a6a5f0af7841760) ``` / 059 / private void agg_doConsume_0(InternalRow localtablescan_row_0, int agg_expr_0_0) throws java.io.IOException { / 060 / // do aggregate / 061 / // common sub-expressions / 062 / boolean agg_isNull_4 = false; / 063 / long agg_value_4 = -1L; / 064 / if (!false) { / 065 / agg_value_4 = (long) agg_expr_0_0; / 066 / } / 067 / // evaluate aggregate function / 068 / agg_agg_isNull_7_0 = true; / 069 / long agg_value_7 = -1L; / 070 / do { / 071 / if (!agg_bufIsNull_0) { / 072 / agg_agg_isNull_7_0 = false; / 073 / agg_value_7 = agg_bufValue_0; / 074 / continue; / 075 / } / 076 / / 077 / boolean agg_isNull_9 = false; / 078 / long agg_value_9 = -1L; / 079 / if (!false) { / 080 / agg_value_9 = (long) 0; / 081 / } / 082 / if (!agg_isNull_9) { / 083 / agg_agg_isNull_7_0 = false; / 084 / agg_value_7 = agg_value_9; / 085 / continue; / 086 / } / 087 / / 088 / } while (false); / 089 / / 090 / long agg_value_6 = -1L; / 091 / / 092 / agg_value_6 = agg_value_7 + agg_value_4; / 093 / boolean agg_isNull_11 = true; / 094 / double agg_value_11 = -1.0; / 095 / / 096 / if (!agg_bufIsNull_1) { / 097 / agg_agg_isNull_13_0 = true; / 098 / double agg_value_13 = -1.0; / 099 / do { / 100 / boolean agg_isNull_14 = agg_isNull_4; / 101 / double agg_value_14 = -1.0; / 102 / if (!agg_isNull_4) { / 103 / agg_value_14 = (double) agg_value_4; / 104 / } / 105 / if (!agg_isNull_14) { / 106 / agg_agg_isNull_13_0 = false; / 107 / agg_value_13 = agg_value_14; / 108 / continue; / 109 / } / 110 / / 111 / boolean agg_isNull_15 = false; / 112 / double agg_value_15 = -1.0; / 113 / if (!false) { / 114 / agg_value_15 = (double) 0; / 115 / } / 116 / if (!agg_isNull_15) { / 117 / agg_agg_isNull_13_0 = false; / 118 / agg_value_13 = agg_value_15; / 119 / continue; / 120 / } / 121 / / 122 / } while (false); / 123 / / 124 / agg_isNull_11 = false; // resultCode could change nullability. / 125 / / 126 / agg_value_11 = agg_bufValue_1 + agg_value_13; / 127 / / 128 / } / 129 / boolean agg_isNull_17 = false; / 130 / long agg_value_17 = -1L; / 131 / if (!false && agg_isNull_4) { / 132 / agg_isNull_17 = agg_bufIsNull_2; / 133 / agg_value_17 = agg_bufValue_2; / 134 / } else { / 135 / boolean agg_isNull_20 = true; / 136 / long agg_value_20 = -1L; / 137 / / 138 / if (!agg_bufIsNull_2) { / 139 / agg_isNull_20 = false; // resultCode could change nullability. / 140 / / 141 / agg_value_20 = agg_bufValue_2 + 1L; / 142 / / 143 / } / 144 / agg_isNull_17 = agg_isNull_20; / 145 / agg_value_17 = agg_value_20; / 146 / } / 147 / // update aggregation buffer / 148 / agg_bufIsNull_0 = false; / 149 / agg_bufValue_0 = agg_value_6; / 150 / / 151 / agg_bufIsNull_1 = agg_isNull_11; / 152 / agg_bufValue_1 = agg_value_11; / 153 / / 154 / agg_bufIsNull_2 = agg_isNull_17; / 155 / agg_bufValue_2 = agg_value_17; / 156 / / 157 */ } ``` You can check the previous discussion in https://github.com/apache/spark/pull/19082 ## How was this patch tested? Existing tests Closes #20965 from maropu/SPARK-21870-2. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-06 11:45:14 +08:00
Mukul Murthy	3929d16604	[SPARK-26046][SS] Add StreamingQueryManager.listListeners() ### What changes were proposed in this pull request? Add a listListeners() method to StreamingQueryManager that lists all StreamingQueryListeners that have been added to that manager. ### Why are the changes needed? While it's best practice to keep handles on all listeners added, it's still nice to have an API to be able to list what listeners have been added to a StreamingQueryManager. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Modified existing unit tests to use the new API instead of using reflection. Closes #25518 from mukulmurthy/26046-listener. Authored-by: Mukul Murthy <mukul.murthy@gmail.com> Signed-off-by: Jose Torres <torres.joseph.f+github@gmail.com>	2019-09-05 14:27:54 -07:00
Wenchen Fan	c81fd0cd61	[SPARK-28974][SQL] centralize the Data Source V2 table capability checks ### What changes were proposed in this pull request? merge the `V2WriteSupportCheck` and `V2StreamingScanSupportCheck` to one rule: `TableCapabilityCheck`. ### Why are the changes needed? It's a little confusing to have 2 rules to check DS v2 table capability, while one rule says it checks write and another rule says it checks streaming scan. We can clearly tell it from the rule names that the batch scan check is missing. It's better to have a centralized place for this check, with a name that clearly says it checks table capability. ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing tests Closes #25679 from cloud-fan/dsv2-check. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-05 20:22:29 +08:00
HyukjinKwon	103d50b3f6	[SPARK-28272][SQL][PYTHON][TESTS] Convert and port 'pgSQL/aggregates_part3.sql' into UDF test base ### What changes were proposed in this pull request? This PR proposes to port `pgSQL/aggregates_part3.sql` into UDF test base. <details><summary>Diff comparing to 'pgSQL/aggregates_part3.sql'</summary> <p> ```diff diff --git a/sql/core/src/test/resources/sql-tests/results/pgSQL/aggregates_part3.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-aggregates_part3.sql.out index f102383cb4d..eff33f280cf 100644 --- a/sql/core/src/test/resources/sql-tests/results/pgSQL/aggregates_part3.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-aggregates_part3.sql.out -3,7 +3,7 -- !query 0 -select max(min(unique1)) from tenk1 +select udf(max(min(unique1))) from tenk1 -- !query 0 schema struct<> -- !query 0 output -12,11 +12,11 It is not allowed to use an aggregate function in the argument of another aggreg -- !query 1 -select (select count() - from (values (1)) t0(inner_c)) +select udf((select udf(count()) + from (values (1)) t0(inner_c))) as col from (values (2),(3)) t1(outer_c) -- !query 1 schema -struct<scalarsubquery():bigint> +struct<col:bigint> -- !query 1 output 1 1 ``` </p> </details> ### Why are the changes needed? To improve test coverage in UDFs. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually tested via: ```bash build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/pgSQL/udf-aggregates_part3.sql" ``` as guided in https://issues.apache.org/jira/browse/SPARK-27921 Closes #25676 from HyukjinKwon/SPARK-28272. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-05 18:35:21 +09:00
HyukjinKwon	be04c97262	[SPARK-28971][SQL][PYTHON][TESTS] Convert and port 'pgSQL/aggregates_part4.sql' into UDF test base ### What changes were proposed in this pull request? This PR proposes to port `pgSQL/aggregates_part4.sql` into UDF test base. <details><summary>Diff comparing to 'pgSQL/aggregates_part3.sql'</summary> <p> ```diff ``` </p> </details> ### Why are the changes needed? To improve test coverage in UDFs. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually tested via: ```bash build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/pgSQL/udf-aggregates_part4.sql" ``` as guided in https://issues.apache.org/jira/browse/SPARK-27921 Closes #25677 from HyukjinKwon/SPARK-28971. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-05 18:34:44 +09:00
Sean Owen	36559b6525	[SPARK-28977][DOCS][SQL] Fix DataFrameReader.json docs to doc that partition column can be numeric, date or timestamp type ### What changes were proposed in this pull request? `DataFrameReader.json()` accepts a partition column that is of numeric, date or timestamp type, according to the implementation in `JDBCRelation.scala`. Update the scaladoc accordingly, to match the documentation in `sql-data-sources-jdbc.md` too. ### Why are the changes needed? scaladoc is incorrect. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A Closes #25687 from srowen/SPARK-28977. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-09-05 18:32:45 +09:00
WeichenXu	f8bc91f749	[SPARK-28782][SQL] Generator support in aggregate expressions ### What changes were proposed in this pull request? Support generator in aggregate expressions. In this PR, I check the aggregate logical plan, if its aggregateExpressions include generator, then convert this agg plan into "normal agg plan + generator plan + projection plan". I.e: ``` aggregate(with generator) \|--child_plan ``` ===> ``` project \|--generator(resolved) \|--aggregate \|--child_plan ``` ### Why are the changes needed? We should support sql like: ``` select explode(array(min(a), max(a))) from t ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit test added. Closes #25512 from WeichenXu123/explode_bug. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-05 16:17:49 +08:00
Ryan Blue	dde393142f	[SPARK-28878][SQL] Remove extra project for DSv2 reads with columnar batches ### What changes were proposed in this pull request? Remove unnecessary physical projection added to ensure rows are `UnsafeRow` when the DSv2 scan is columnar. This is not needed because conversions are automatically added to convert from columnar operators to `UnsafeRow` when the next operator does not support columnar execution. ### Why are the changes needed? Removes an extra projection and copy. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #25586 from rdblue/SPARK-28878-remove-dsv2-project-with-columnar. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-05 15:38:46 +08:00
Burak Yavuz	b9edd44bd6	[SPARK-28964] Add the provider information to the table properties in saveAsTable ### What changes were proposed in this pull request? Adds the provider information to the table properties in saveAsTable. ### Why are the changes needed? Otherwise, catalog implementations don't know what kind of Table definition to create. ### Does this PR introduce any user-facing change? nope ### How was this patch tested? Existing unit tests check the existence of the provider now. Closes #25669 from brkyvz/provider. Authored-by: Burak Yavuz <brkyvz@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-05 14:33:35 +08:00
Ryan Blue	5adaa2e103	[SPARK-28979][SQL] Rename UnresovledTable to V1Table ### What changes were proposed in this pull request? Rename `UnresolvedTable` to `V1Table` because it is not unresolved. ### Why are the changes needed? The class name is inaccurate. This should be fixed before it is in a release. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #25683 from rdblue/SPARK-28979-rename-unresolved-table. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-05 11:41:21 +08:00
Xianjin YE	ca71177868	[SPARK-28907][CORE] Review invalid usage of new Configuration() ### What changes were proposed in this pull request? Replaces some incorrect usage of `new Configuration()` as it will load default configs defined in Hadoop ### Why are the changes needed? Unexpected config could be accessed instead of the expected config, see SPARK-28203 for example ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existed tests. Closes #25616 from advancedxy/remove_invalid_configuration. Authored-by: Xianjin YE <advancedxy@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-04 19:52:19 -05:00
maryannxue	a7a3935c97	[SPARK-11150][SQL] Dynamic Partition Pruning ### What changes were proposed in this pull request? This patch implements dynamic partition pruning by adding a dynamic-partition-pruning filter if there is a partitioned table and a filter on the dimension table. The filter is then planned using a heuristic approach: 1. As a broadcast relation if it is a broadcast hash join. The broadcast relation will then be transformed into a reused broadcast exchange by the `ReuseExchange` rule; or 2. As a subquery duplicate if the estimated benefit of partition table scan being saved is greater than the estimated cost of the extra scan of the duplicated subquery; otherwise 3. As a bypassed condition (`true`). ### Why are the changes needed? This is an important performance feature. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT - Testing DPP by enabling / disabling the reuse broadcast results feature and / or the subquery duplication feature. - Testing DPP with reused broadcast results. - Testing the key iterators on different HashedRelation types. - Testing the packing and unpacking of the broadcast keys in a LongType. Closes #25600 from maryannxue/dpp. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-09-04 13:13:23 -07:00
Ryan Blue	5ea134c354	[SPARK-28628][SQL] Implement SupportsNamespaces in V2SessionCatalog ## What changes were proposed in this pull request? This adds namespace support to V2SessionCatalog. ## How was this patch tested? WIP: will add tests for v2 session catalog namespace methods. Closes #25363 from rdblue/SPARK-28628-support-namespaces-in-v2-session-catalog. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>	2019-09-03 13:13:27 -07:00
Xianjin YE	d5688dc732	[SPARK-28573][SQL] Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for partitioned table ## What changes were proposed in this pull request? Datasource table now supports partition tables long ago. This commit adds the ability to translate the InsertIntoTable(HiveTableRelation) to datasource table insertion. ## How was this patch tested? Existing tests with some modification Closes #25306 from advancedxy/SPARK-28573. Authored-by: Xianjin YE <advancedxy@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-09-03 13:40:06 +08:00
HyukjinKwon	bd3915e356	Revert "[SPARK-28612][SQL] Add DataFrameWriterV2 API" This reverts commit `3821d75b83`.	2019-09-02 12:47:14 +09:00
Sean Owen	eb037a8180	[SPARK-28855][CORE][ML][SQL][STREAMING] Remove outdated usages of Experimental, Evolving annotations ### What changes were proposed in this pull request? The Experimental and Evolving annotations are both (like Unstable) used to express that a an API may change. However there are many things in the code that have been marked that way since even Spark 1.x. Per the dev thread, anything introduced at or before Spark 2.3.0 is pretty much 'stable' in that it would not change without a deprecation cycle. Therefore I'd like to remove most of these annotations. And, remove the `:: Experimental ::` scaladoc tag too. And likewise for Python, R. The changes below can be summarized as: - Generally, anything introduced at or before Spark 2.3.0 has been unmarked as neither Evolving nor Experimental - Obviously experimental items like DSv2, Barrier mode, ExperimentalMethods are untouched - I _did_ unmark a few MLlib classes introduced in 2.4, as I am quite confident they're not going to change (e.g. KolmogorovSmirnovTest, PowerIterationClustering) It's a big change to review, so I'd suggest scanning the list of _files_ changed to see if any area seems like it should remain partly experimental and examine those. ### Why are the changes needed? Many of these annotations are incorrect; the APIs are de facto stable. Leaving them also makes legitimate usages of the annotations less meaningful. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #25558 from srowen/SPARK-28855. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-01 10:15:00 -05:00
Ryan Blue	3821d75b83	[SPARK-28612][SQL] Add DataFrameWriterV2 API ## What changes were proposed in this pull request? This adds a new write API as proposed in the [SPIP to standardize logical plans](https://issues.apache.org/jira/browse/SPARK-23521). This new API: * Uses clear verbs to execute writes, like `append`, `overwrite`, `create`, and `replace` that correspond to the new logical plans. * Only creates v2 logical plans so the behavior is always consistent. * Does not allow table configuration options for operations that cannot change table configuration. For example, `partitionedBy` can only be called when the writer executes `create` or `replace`. Here are a few example uses of the new API: ```scala df.writeTo("catalog.db.table").append() df.writeTo("catalog.db.table").overwrite($"date" === "2019-06-01") df.writeTo("catalog.db.table").overwritePartitions() df.writeTo("catalog.db.table").asParquet.create() df.writeTo("catalog.db.table").partitionedBy(days($"ts")).createOrReplace() df.writeTo("catalog.db.table").using("abc").replace() ``` ## How was this patch tested? Added `DataFrameWriterV2Suite` that tests the new write API. Existing tests for v2 plans. Closes #25354 from rdblue/SPARK-28612-add-data-frame-writer-v2. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>	2019-08-31 21:28:20 -07:00
HyukjinKwon	7cc0f0e9a7	[SPARK-28894][SQL][TESTS] Add a clue to make it easier to debug via Jenkins's test results ### What changes were proposed in this pull request? See https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109834/testReport/junit/org.apache.spark.sql/SQLQueryTestSuite/ ![Screen Shot 2019-08-28 at 4 08 58 PM](https://user-images.githubusercontent.com/6477701/63833484-2a23ea00-c9ae-11e9-91a1-0859cb183fea.png) ```xml <?xml version="1.0" encoding="UTF-8"?> <testsuite hostname="C02Y52ZLJGH5" name="org.apache.spark.sql.SQLQueryTestSuite" tests="3" errors="0" failures="0" skipped="0" time="14.475"> ... <testcase classname="org.apache.spark.sql.SQLQueryTestSuite" name="sql - Scala UDF" time="6.703"> </testcase> <testcase classname="org.apache.spark.sql.SQLQueryTestSuite" name="sql - Regular Python UDF" time="4.442"> </testcase> <testcase classname="org.apache.spark.sql.SQLQueryTestSuite" name="sql - Scalar Pandas UDF" time="3.33"> </testcase> <system-out/> <system-err/> </testsuite> ``` Root cause seems a bug in SBT - it truncates the test name based on the last dot. https://github.com/sbt/sbt/issues/2949 https://github.com/sbt/sbt/blob/v0.13.18/testing/src/main/scala/sbt/JUnitXmlTestsListener.scala#L71-L79 I tried to find a better way but couldn't find. Therefore, this PR proposes a workaround by appending the test file name into the assert log: ```diff [info] - inner-join.sql * FAILED * (4 seconds, 306 milliseconds) + [info] inner-join.sql [info] Expected "1 a [info] 1 a [info] 1 b [info] 1[]", but got "1 a [info] 1 a [info] 1 b [info] 1[ b]" Result did not match for query #6 [info] SELECT tb.* FROM ta INNER JOIN tb ON ta.a = tb.a AND ta.tag = tb.tag (SQLQueryTestSuite.scala:377) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528) ``` It will at least prevent us to search full logs to identify which test file is failed by clicking filed test. Note that this PR does not fully fix the issue but only fix the logs on its failed tests. ### Why are the changes needed? To debug Jenkins logs easier. Otherwise, we should open full logs and search which test was failed. ### Does this PR introduce any user-facing change? It will print out the file name of failed tests in Jenkins' test reports. ### How was this patch tested? Manually tested but Jenkins tests are required in this PR. Now it at least shows which file it is: ![Screen Shot 2019-08-30 at 10 16 32 PM](https://user-images.githubusercontent.com/6477701/64023705-de22a200-cb73-11e9-8806-2e98ad35adef.png) Closes #25630 from HyukjinKwon/SPARK-28894-1. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-30 15:10:40 -07:00
younggyu chun	3b07a4eb28	[SPARK-27931][SQL] Accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type ## What changes were proposed in this pull request? This PR aims to add "true", "yes", "1", "false", "no", "0", and unique prefixes as input for the boolean data type and ignore input whitespace. Please see the following what string representations are using for the boolean type in other databases. https://www.postgresql.org/docs/devel/datatype-boolean.html https://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html ## How was this patch tested? Added new tests to CastSuite. Closes #25458 from younggyuchun/SPARK-27931. Authored-by: younggyu chun <younggyuchun@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-30 14:18:13 -07:00
Burak Yavuz	827969399b	[SPARK-28668][SQL] Support V2SessionCatalog for ALTER TABLE ### What changes were proposed in this pull request? Adds support for the V2SessionCatalog for ALTER TABLE statements. Implementation changes are ~50 loc. The rest is just test refactoring. ### Why are the changes needed? To allow V2 DataSources to plug in through a configurable plugin interface without requiring the explicit use of catalog identifiers, and leverage ALTER TABLE statements. ### How was this patch tested? By re-using existing tests in DataSourceV2SQLSuite. Closes #25502 from brkyvz/alterV3. Authored-by: Burak Yavuz <brkyvz@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-30 14:16:47 +08:00
Wenchen Fan	f8f7c52f12	[SPARK-28899][SQL][TEST] merge the testing in-memory v2 catalogs from catalyst and core ### What changes were proposed in this pull request? There are 2 in-memory `TableCatalog` and `Table` implementations for testing, in sql/catalyst and sql/core. This PR merges them. After merging, there are 3 classes: 1. `InMemoryTable` 2. `InMemoryTableCatalog` 3. `StagingInMemoryTableCatalog` For better maintainability, these 3 classes are put in 3 different files. ### Why are the changes needed? reduce duplicated code ### Does this PR introduce any user-facing change? no ### How was this patch tested? N/A Closes #25610 from cloud-fan/dsv2-test. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Ryan Blue <blue@apache.org>	2019-08-29 12:56:19 -07:00
Matt Hawes	137b20b964	[SPARK-28818][SQL] Respect source column nullability in the arrays created by `freqItems()` ### What changes were proposed in this pull request? This PR replaces the hard-coded non-nullability of the array elements returned by `freqItems()` with a nullability that reflects the original schema. Essentially [the functional change](https://github.com/apache/spark/pull/25575/files#diff-bf59bb9f3dc351f5bf6624e5edd2dcf4R122) to the schema generation is: ``` StructField(name + "_freqItems", ArrayType(dataType, false)) ``` Becomes: ``` StructField(name + "_freqItems", ArrayType(dataType, originalField.nullable)) ``` Respecting the original nullability prevents issues when Spark depends on `ArrayType`'s `containsNull` being accurate. The example that uncovered this is calling `collect()` on the dataframe (see [ticket](https://issues.apache.org/jira/browse/SPARK-28818) for full repro). Though it's likely that there a several places where this could cause a problem. I've also refactored a small amount of the surrounding code to remove some unnecessary steps and group together related operations. ### Why are the changes needed? I think it's pretty clear why this change is needed. It fixes a bug that currently prevents users from calling `df.freqItems.collect()` along with potentially causing other, as yet unknown, issues. ### Does this PR introduce any user-facing change? Nullability of columns when calling freqItems on them is now respected after the change. ### How was this patch tested? I added a test that specifically tests the carry-through of the nullability as well as explicitly calling `collect()` to catch the exact regression that was observed. I also ran the test against the old version of the code and it fails as expected. Closes #25575 from MGHawes/mhawes/SPARK-28818. Authored-by: Matt Hawes <mhawes@palantir.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-29 10:49:10 +09:00
hemanth meka	6252c54e39	[SPARK-23519][SQL] create view should work from query with duplicate output columns What changes were proposed in this pull request? Moving the call for checkColumnNameDuplication out of generateViewProperties. This way we can choose ifcheckColumnNameDuplication will be performed on analyzed or aliased plan without having to pass an additional argument(aliasedPlan) to generateViewProperties. Before the pr column name duplication was performed on the query output of below sql(c1, c1) and the pr makes it perform check on the user provided schema of view definition(c1, c2) Why are the changes needed? Changes are to fix SPARK-23519 bug. Below queries would cause an exception. This pr fixes them and also added a test case. `CREATE TABLE t23519 AS SELECT 1 AS c1 CREATE VIEW v23519 (c1, c2) AS SELECT c1, c1 FROM t23519` Does this PR introduce any user-facing change? No How was this patch tested? new unit test added in SQLViewSuite Closes #25570 from hem1891/SPARK-23519. Lead-authored-by: hemanth meka <hmeka@tibco.com> Co-authored-by: hem1891 <hem1891@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-28 12:11:10 +08:00
Wenchen Fan	90b10b4f7a	[HOT-FIX] fix compilation This is caused by 2 PRs that were merged at the same time: `cb06209fc9` `2b24a71fec` Closes #25597 from cloud-fan/hot-fix. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-27 23:30:44 +08:00
Gengliang Wang	2b24a71fec	[SPARK-28495][SQL] Introduce ANSI store assignment policy for table insertion ### What changes were proposed in this pull request? Introduce ANSI store assignment policy for table insertion. With ANSI policy, Spark performs the type coercion of table insertion as per ANSI SQL. ### Why are the changes needed? In Spark version 2.4 and earlier, when inserting into a table, Spark will cast the data type of input query to the data type of target table by coercion. This can be super confusing, e.g. users make a mistake and write string values to an int column. In data source V2, by default, only upcasting is allowed when inserting data into a table. E.g. int -> long and int -> string are allowed, while decimal -> double or long -> int are not allowed. The rules of UpCast was originally created for Dataset type coercion. They are quite strict and different from the behavior of all existing popular DBMS. This is breaking change. It is possible that existing queries are broken after 3.0 releases. Following ANSI SQL standard makes Spark consistent with the table insertion behaviors of popular DBMS like PostgreSQL/Oracle/Mysql. ### Does this PR introduce any user-facing change? A new optional mode for table insertion. ### How was this patch tested? Unit test Closes #25581 from gengliangwang/ANSImode. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-27 22:13:23 +08:00
WeichenXu	7f605f5559	[SPARK-28621][SQL] Make spark.sql.crossJoin.enabled default value true ### What changes were proposed in this pull request? Make `spark.sql.crossJoin.enabled` default value true ### Why are the changes needed? For implicit cross join, we can set up a watchdog to cancel it if running for a long time. When "spark.sql.crossJoin.enabled" is false, because `CheckCartesianProducts` is implemented in logical plan stage, it may generate some mismatching error which may confuse end user: * it's done in logical phase, so we may fail queries that can be executed via broadcast join, which is very fast. * if we move the check to the physical phase, then a query may success at the beginning, and begin to fail when the table size gets larger (other people insert data to the table). This can be quite confusing. * the CROSS JOIN syntax doesn't work well if join reorder happens. * some non-equi-join will generate plan using cartesian product, but `CheckCartesianProducts` do not detect it and raise error. So that in order to address this in simpler way, we can turn off showing this cross-join error by default. For reference, I list some cases raising mismatching error here: Providing: ``` spark.range(2).createOrReplaceTempView("sm1") // can be broadcast spark.range(50000000).createOrReplaceTempView("bg1") // cannot be broadcast spark.range(60000000).createOrReplaceTempView("bg2") // cannot be broadcast ``` 1) Some join could be convert to broadcast nested loop join, but CheckCartesianProducts raise error. e.g. ``` select sm1.id, bg1.id from bg1 join sm1 where sm1.id < bg1.id ``` 2) Some join will run by CartesianJoin but CheckCartesianProducts DO NOT raise error. e.g. ``` select bg1.id, bg2.id from bg1 join bg2 where bg1.id < bg2.id ``` ### Does this PR introduce any user-facing change? ### How was this patch tested? Closes #25520 from WeichenXu123/SPARK-28621. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-27 21:53:37 +08:00
Wenchen Fan	cb06209fc9	[SPARK-28747][SQL] merge the two data source v2 fallback configs ## What changes were proposed in this pull request? Currently we have 2 configs to specify which v2 sources should fallback to v1 code path. One config for read path, and one config for write path. However, I found it's awkward to work with these 2 configs: 1. for `CREATE TABLE USING format`, should this be read path or write path? 2. for `V2SessionCatalog.loadTable`, we need to return `UnresolvedTable` if it's a DS v1 or we need to fallback to v1 code path. However, at that time, we don't know if the returned table will be used for read or write. We don't have any new features or perf improvement in file source v2. The fallback API is just a safeguard if we have bugs in v2 implementations. There are not many benefits to support falling back to v1 for read and write path separately. This PR proposes to merge these 2 configs into one. ## How was this patch tested? existing tests Closes #25465 from cloud-fan/merge-conf. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-27 20:47:24 +08:00
Burak Yavuz	e31aec9be4	[SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog ### What changes were proposed in this pull request? This PR adds support for INSERT INTO through both the SQL and DataFrameWriter APIs through the V2SessionCatalog. ### Why are the changes needed? This will allow V2 tables to be plugged in through the V2SessionCatalog, and be used seamlessly with existing APIs. ### Does this PR introduce any user-facing change? No behavior changes. ### How was this patch tested? Pulled out a lot of tests so that they can be shared across the DataFrameWriter and SQL code paths. Closes #25507 from brkyvz/insertSesh. Lead-authored-by: Burak Yavuz <brkyvz@gmail.com> Co-authored-by: Burak Yavuz <burak@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-27 12:59:53 +08:00
Yuming Wang	6e12b585a9	[SPARK-28527][SQL][TEST] Re-run all the tests in SQLQueryTestSuite via Thrift Server ### What changes were proposed in this pull request? This PR build a test framework that directly re-run all the tests in `SQLQueryTestSuite` via Thrift Server. But it's a little different from `SQLQueryTestSuite`: 1. Can not support [UDF testing](`44e607e921/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala (L293-L297)`). 2. Can not support `DESC` command and `SHOW` command because `SQLQueryTestSuite` [formatted the output](`1882912cca/sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala (L38-L50)`.). When building this framework, found two bug: [SPARK-28624](https://issues.apache.org/jira/browse/SPARK-28624): `make_date` is inconsistent when reading from table [SPARK-28611](https://issues.apache.org/jira/browse/SPARK-28611): Histogram's height is different found two features that ThriftServer can not support: [SPARK-28636](https://issues.apache.org/jira/browse/SPARK-28636): ThriftServer can not support decimal type with negative scale [SPARK-28637](https://issues.apache.org/jira/browse/SPARK-28637): ThriftServer can not support interval type Also, found two inconsistent behavior: [SPARK-28620](https://issues.apache.org/jira/browse/SPARK-28620): Double type returned for float type in Beeline/JDBC [SPARK-28619](https://issues.apache.org/jira/browse/SPARK-28619): The golden result file is different when tested by `bin/spark-sql` ### Why are the changes needed? Improve the overall test coverage for Thrift Server. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #25567 from wangyum/SPARK-28527. Lead-authored-by: Yuming Wang <yumwang@ebay.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-26 22:39:57 +09:00
Dilip Biswal	c61270fd74	[SPARK-27395][SQL] Improve EXPLAIN command ## What changes were proposed in this pull request? This PR aims at improving the way physical plans are explained in spark. Currently, the explain output for physical plan may look very cluttered and each operator's string representation can be very wide and wraps around in the display making it little hard to follow. This especially happens when explaining a query 1) Operating on wide tables 2) Has complex expressions etc. This PR attempts to split the output into two sections. In the header section, we display the basic operator tree with a number associated with each operator. In this section, we strictly control what we output for each operator. In the footer section, each operator is verbosely displayed. Based on the feedback from Maryann, the uncorrelated subqueries (SubqueryExecs) are not included in the main plan. They are printed separately after the main plan and can be correlated by the originating expression id from its parent plan. To illustrate, here is a simple plan displayed in old vs new way. Example query1 : ``` EXPLAIN SELECT key, Max(val) FROM explain_temp1 WHERE key > 0 GROUP BY key HAVING max(val) > 0 ``` Old : ``` (2) Project [key#2, max(val)#15] +- (2) Filter (isnotnull(max(val#3)#18) AND (max(val#3)#18 > 0)) +- (2) HashAggregate(keys=[key#2], functions=[max(val#3)], output=[key#2, max(val)#15, max(val#3)#18]) +- Exchange hashpartitioning(key#2, 200) +- (1) HashAggregate(keys=[key#2], functions=[partial_max(val#3)], output=[key#2, max#21]) +- (1) Project [key#2, val#3] +- (1) Filter (isnotnull(key#2) AND (key#2 > 0)) +- (1) FileScan parquet default.explain_temp1[key#2,val#3] Batched: true, DataFilters: [isnotnull(key#2), (key#2 > 0)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp1], PartitionFilters: [], PushedFilters: [IsNotNull(key), GreaterThan(key,0)], ReadSchema: struct<key:int,val:int> ``` New : ``` Project (8) +- Filter (7) +- HashAggregate (6) +- Exchange (5) +- HashAggregate (4) +- Project (3) +- Filter (2) +- Scan parquet default.explain_temp1 (1) (1) Scan parquet default.explain_temp1 [codegen id : 1] Output: [key#2, val#3] (2) Filter [codegen id : 1] Input : [key#2, val#3] Condition : (isnotnull(key#2) AND (key#2 > 0)) (3) Project [codegen id : 1] Output : [key#2, val#3] Input : [key#2, val#3] (4) HashAggregate [codegen id : 1] Input: [key#2, val#3] (5) Exchange Input: [key#2, max#11] (6) HashAggregate [codegen id : 2] Input: [key#2, max#11] (7) Filter [codegen id : 2] Input : [key#2, max(val)#5, max(val#3)#8] Condition : (isnotnull(max(val#3)#8) AND (max(val#3)#8 > 0)) (8) Project [codegen id : 2] Output : [key#2, max(val)#5] Input : [key#2, max(val)#5, max(val#3)#8] ``` Example Query2 (subquery): ``` SELECT FROM explain_temp1 WHERE KEY = (SELECT Max(KEY) FROM explain_temp2 WHERE KEY = (SELECT Max(KEY) FROM explain_temp3 WHERE val > 0) AND val = 2) AND val > 3 ``` Old: ``` (1) Project [key#2, val#3] +- (1) Filter (((isnotnull(KEY#2) AND isnotnull(val#3)) AND (KEY#2 = Subquery scalar-subquery#39)) AND (val#3 > 3)) : +- Subquery scalar-subquery#39 : +- (2) HashAggregate(keys=[], functions=[max(KEY#26)], output=[max(KEY)#45]) : +- Exchange SinglePartition : +- (1) HashAggregate(keys=[], functions=[partial_max(KEY#26)], output=[max#47]) : +- (1) Project [key#26] : +- (1) Filter (((isnotnull(KEY#26) AND isnotnull(val#27)) AND (KEY#26 = Subquery scalar-subquery#38)) AND (val#27 = 2)) : : +- Subquery scalar-subquery#38 : : +- (2) HashAggregate(keys=[], functions=[max(KEY#28)], output=[max(KEY)#43]) : : +- Exchange SinglePartition : : +- (1) HashAggregate(keys=[], functions=[partial_max(KEY#28)], output=[max#49]) : : +- (1) Project [key#28] : : +- (1) Filter (isnotnull(val#29) AND (val#29 > 0)) : : +- (1) FileScan parquet default.explain_temp3[key#28,val#29] Batched: true, DataFilters: [isnotnull(val#29), (val#29 > 0)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp3], PartitionFilters: [], PushedFilters: [IsNotNull(val), GreaterThan(val,0)], ReadSchema: struct<key:int,val:int> : +- (1) FileScan parquet default.explain_temp2[key#26,val#27] Batched: true, DataFilters: [isnotnull(key#26), isnotnull(val#27), (val#27 = 2)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp2], PartitionFilters: [], PushedFilters: [IsNotNull(key), IsNotNull(val), EqualTo(val,2)], ReadSchema: struct<key:int,val:int> +- *(1) FileScan parquet default.explain_temp1[key#2,val#3] Batched: true, DataFilters: [isnotnull(key#2), isnotnull(val#3), (val#3 > 3)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp1], PartitionFilters: [], PushedFilters: [IsNotNull(key), IsNotNull(val), GreaterThan(val,3)], ReadSchema: struct<key:int,val:int> ``` New: ``` Project (3) +- Filter (2) +- Scan parquet default.explain_temp1 (1) (1) Scan parquet default.explain_temp1 [codegen id : 1] Output: [key#2, val#3] (2) Filter [codegen id : 1] Input : [key#2, val#3] Condition : (((isnotnull(KEY#2) AND isnotnull(val#3)) AND (KEY#2 = Subquery scalar-subquery#23)) AND (val#3 > 3)) (3) Project [codegen id : 1] Output : [key#2, val#3] Input : [key#2, val#3] ===== Subqueries ===== Subquery:1 Hosting operator id = 2 Hosting Expression = Subquery scalar-subquery#23 HashAggregate (9) +- Exchange (8) +- HashAggregate (7) +- Project (6) +- Filter (5) +- Scan parquet default.explain_temp2 (4) (4) Scan parquet default.explain_temp2 [codegen id : 1] Output: [key#26, val#27] (5) Filter [codegen id : 1] Input : [key#26, val#27] Condition : (((isnotnull(KEY#26) AND isnotnull(val#27)) AND (KEY#26 = Subquery scalar-subquery#22)) AND (val#27 = 2)) (6) Project [codegen id : 1] Output : [key#26] Input : [key#26, val#27] (7) HashAggregate [codegen id : 1] Input: [key#26] (8) Exchange Input: [max#35] (9) HashAggregate [codegen id : 2] Input: [max#35] Subquery:2 Hosting operator id = 5 Hosting Expression = Subquery scalar-subquery#22 HashAggregate (15) +- Exchange (14) +- HashAggregate (13) +- Project (12) +- Filter (11) +- Scan parquet default.explain_temp3 (10) (10) Scan parquet default.explain_temp3 [codegen id : 1] Output: [key#28, val#29] (11) Filter [codegen id : 1] Input : [key#28, val#29] Condition : (isnotnull(val#29) AND (val#29 > 0)) (12) Project [codegen id : 1] Output : [key#28] Input : [key#28, val#29] (13) HashAggregate [codegen id : 1] Input: [key#28] (14) Exchange Input: [max#37] (15) HashAggregate [codegen id : 2] Input: [max#37] ``` Note: I opened this PR as a WIP to start getting feedback. I will be on vacation starting tomorrow would not be able to immediately incorporate the feedback. I will start to work on them as soon as i can. Also, currently this PR provides a basic infrastructure for explain enhancement. The details about individual operators will be implemented in follow-up prs ## How was this patch tested? Added a new test `explain.sql` that tests basic scenarios. Need to add more tests. Closes #24759 from dilipbiswal/explain_feature. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-26 20:37:13 +08:00
Yuming Wang	c353a84d1a	[SPARK-28642][SQL][TEST][FOLLOW-UP] Test spark.sql.redaction.options.regex with and without default values ### What changes were proposed in this pull request? Test `spark.sql.redaction.options.regex` with and without default values. ### Why are the changes needed? Normally, we do not rely on the default value of `spark.sql.redaction.options.regex`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #25579 from wangyum/SPARK-28642-f1. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-08-25 23:12:16 -07:00
Terry Kim	a3328cdc0a	[SPARK-28238][SQL][FOLLOW-UP] Clean up attributes for Datasource v2 DESCRIBE TABLE ### What changes were proposed in this pull request? 1. Fix the physical plan (`DescribeTableExec`) to have the same output attributes as the corresponding logical plan. 2. Remove `output` in statements since they are unresolved plans. ### Why are the changes needed? Correctness of how output attributes should work. ### Does this PR introduce any user-facing change? NO ### How was this patch tested? Existing tests Closes #25568 from imback82/describe_table. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-26 13:39:36 +08:00
Yuming Wang	4b16cf11b3	[SPARK-27988][SQL][TEST] Port AGGREGATES.sql [Part 3] ## What changes were proposed in this pull request? This PR is to port AGGREGATES.sql from PostgreSQL regression tests. https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/aggregates.sql#L352-L605 The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/aggregates.out#L986-L1613 When porting the test cases, found seven PostgreSQL specific features that do not exist in Spark SQL: [SPARK-27974](https://issues.apache.org/jira/browse/SPARK-27974): Add built-in Aggregate Function: array_agg [SPARK-27978](https://issues.apache.org/jira/browse/SPARK-27978): Add built-in Aggregate Functions: string_agg [SPARK-27986](https://issues.apache.org/jira/browse/SPARK-27986): Support Aggregate Expressions with filter [SPARK-27987](https://issues.apache.org/jira/browse/SPARK-27987): Support POSIX Regular Expressions [SPARK-28682](https://issues.apache.org/jira/browse/SPARK-28682): ANSI SQL: Collation Support [SPARK-28768](https://issues.apache.org/jira/browse/SPARK-28768): Implement more text pattern operators [SPARK-28865](https://issues.apache.org/jira/browse/SPARK-28865): Table inheritance ## How was this patch tested? N/A Closes #24829 from wangyum/SPARK-27988. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-25 23:34:59 +09:00
Xiao Li	07c4b9bd1f	Revert "[SPARK-25474][SQL] Support `spark.sql.statistics.fallBackToHdfs` in data source tables" This reverts commit `485ae6d181`. Closes #25563 from gatorsmile/revert. Authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-23 07:41:39 -07:00
Ali Afroozeh	1472e664ba	[SPARK-28716][SQL] Add id to Exchange and Subquery's stringArgs method for easier identifying their reuses in query plans ## What changes were proposed in this pull request? Add id to Exchange and Subquery's stringArgs method for easier identifying their reuses in query plans, for example: ``` ReusedExchange d_date_sk#827, BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))) [id=#2710] ``` Where `2710` is the id of the reused exchange. ## How was this patch tested? Passes existing tests Closes #25434 from dbaliafroozeh/ImplementStringArgsExchangeSubqueryExec. Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com> Signed-off-by: herman <herman@databricks.com>	2019-08-23 13:29:32 +02:00
Ali Afroozeh	aef7ca1f0b	[SPARK-28836][SQL] Remove the canonicalize(attributes) method from PlanExpression ### What changes were proposed in this pull request? This PR removes the `canonicalize(attrs: AttributeSeq)` from `PlanExpression` and taking care of normalizing expressions in `QueryPlan`. ### Why are the changes needed? `Expression` has already a `canonicalized` method and having the `canonicalize` method in `PlanExpression` is confusing. ### Does this PR introduce any user-facing change? Removes the `canonicalize` plan from `PlanExpression`. Also renames the `normalizeExprId` to `normalizeExpressions` in query plan. ### How was this patch tested? This PR is a refactoring and passes the existing tests Closes #25534 from dbaliafroozeh/ImproveCanonicalizeAPI. Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com> Signed-off-by: herman <herman@databricks.com>	2019-08-23 13:26:58 +02:00
terryk	98e1a4cea4	[SPARK-28319][SQL] Implement SHOW TABLES for Data Source V2 Tables ## What changes were proposed in this pull request? Implements the SHOW TABLES logical and physical plans for data source v2 tables. ## How was this patch tested? Added unit tests to `DataSourceV2SQLSuite`. Closes #25247 from imback82/dsv2_show_tables. Lead-authored-by: terryk <yuminkim@gmail.com> Co-authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-23 14:20:25 +08:00
Ali Afroozeh	9976b876f1	[SPARK-28835][SQL][TEST] Add TPCDSSchema trait ### What changes were proposed in this pull request? This PR extracts the schema information of TPCDS tables into a separate class called `TPCDSSchema` which can be reused for other testing purposes ### How was this patch tested? This PR is only a refactoring for tests and passes existing tests Closes #25535 from dbaliafroozeh/IntroduceTPCDSSchema. Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-22 23:18:46 -07:00
Jungtaek Lim (HeartSaVioR)	406c5331ff	[SPARK-28025][SS] Fix FileContextBasedCheckpointFileManager leaking crc files ### What changes were proposed in this pull request? This PR fixes the leak of crc files from CheckpointFileManager when FileContextBasedCheckpointFileManager is being used. Spark hits the Hadoop bug, [HADOOP-16255](https://issues.apache.org/jira/browse/HADOOP-16255) which seems to be a long-standing issue. This is there're two `renameInternal` methods: ``` public void renameInternal(Path src, Path dst) public void renameInternal(final Path src, final Path dst, boolean overwrite) ``` which should be overridden to handle all cases but ChecksumFs only overrides method with 2 params, so when latter is called FilterFs.renameInternal(...) is called instead, and it will do rename with RawLocalFs as underlying filesystem. The bug is related to FileContext, so FileSystemBasedCheckpointFileManager is not affected. [SPARK-17475](https://issues.apache.org/jira/browse/SPARK-17475) took a workaround for this bug, but [SPARK-23966](https://issues.apache.org/jira/browse/SPARK-23966) seemed to bring regression. This PR deletes crc file as "best-effort" when renaming, as failing to delete crc file is not that critical to fail the task. ### Why are the changes needed? This PR prevents crc files not being cleaned up even purging batches. Too many files in same directory often hurts performance, as well as each crc file occupies more space than its own size so possible to occupy nontrivial amount of space when batches go up to 100000+. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Some unit tests are modified to check leakage of crc files. Closes #25488 from HeartSaVioR/SPARK-28025. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>	2019-08-22 23:10:16 -07:00

... 2 3 4 5 6 ...

6211 commits