ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Kent Yao	9a46702791	[SPARK-29554][SQL] Add `version` SQL function ### What changes were proposed in this pull request? ``` hive> select version(); OK 3.1.1 rf4e0529634b6231a0072295da48af466cf2f10b7 Time taken: 2.113 seconds, Fetched: 1 row(s) ``` ### Why are the changes needed? From hive behavior and I guess it is useful for debugging and developing etc. ### Does this PR introduce any user-facing change? add a misc func ### How was this patch tested? add ut Closes #26209 from yaooqinn/SPARK-29554. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-25 23:02:11 -07:00
Dongjoon Hyun	5bdc58bf8a	[SPARK-27653][SQL][FOLLOWUP] Fix `since` version of `min_by/max_by` ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/24557 to fix `since` version. ### Why are the changes needed? This is found during 3.0.0-preview preparation. The version will be exposed to our SQL document like the following. We had better fix this. - https://spark.apache.org/docs/latest/api/sql/#array_min ### Does this PR introduce any user-facing change? Yes. It's exposed at `DESC FUNCTION EXTENDED` SQL command and SQL doc, but this is new at 3.0.0. ### How was this patch tested? Manual. ``` spark-sql> DESC FUNCTION EXTENDED min_by; Function: min_by Class: org.apache.spark.sql.catalyst.expressions.aggregate.MinBy Usage: min_by(x, y) - Returns the value of `x` associated with the minimum value of `y`. Extended Usage: Examples: > SELECT min_by(x, y) FROM VALUES (('a', 10)), (('b', 50)), (('c', 20)) AS tab(x, y); a Since: 3.0.0 ``` Closes #26264 from dongjoon-hyun/SPARK-27653. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-25 21:17:17 -07:00
Liang-Chi Hsieh	68dca9a095	[SPARK-29527][SQL] SHOW CREATE TABLE should look up catalog/table like v2 commands ### What changes were proposed in this pull request? Add ShowCreateTableStatement and make SHOW CREATE TABLE go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g. ``` USE my_catalog DESC t // success and describe the table t from my_catalog SHOW CREATE TABLE t // report table not found as there is no table t in the session catalog ``` ### Does this PR introduce any user-facing change? yes. When running SHOW CREATE TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog. ### How was this patch tested? Unit tests. Closes #26184 from viirya/SPARK-29527. Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-25 23:09:08 +08:00
Kent Yao	0cf4f07c66	[SPARK-29545][SQL] Add support for bit_xor aggregate function ### What changes were proposed in this pull request? bit_xor(expr) - Returns the bitwise XOR of all non-null input values, or null if none ### Why are the changes needed? As we support `bit_and`, `bit_or` now, we'd better support the related aggregate function bit_xor ahead of postgreSQL, because many other popular databases support it. http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.help.sqlanywhere.12.0.1/dbreference/bit-xor-function.html https://dev.mysql.com/doc/refman/5.7/en/group-by-functions.html#function_bit-or https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Aggregate/BIT_XOR.htm?TocPath=SQL%20Reference%20Manual%7CSQL%20Functions%7CAggregate%20Functions%7C_____10 ### Does this PR introduce any user-facing change? add a new bit agg ### How was this patch tested? UTs added Closes #26205 from yaooqinn/SPARK-29545. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-10-25 22:19:19 +09:00
Jungtaek Lim (HeartSaVioR)	cfbdd9d293	[SPARK-29461][SQL] Measure the number of records being updated for JDBC writer ### What changes were proposed in this pull request? This patch adds the functionality to measure records being written for JDBC writer. In reality, the value is meant to be a number of records being updated from queries, as per JDBC spec it will return updated count. ### Why are the changes needed? Output metrics for JDBC writer are missing now. The value of "bytesWritten" is also missing, but we can't measure it from JDBC API. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Unit test added. Closes #26109 from HeartSaVioR/SPARK-29461. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-10-25 16:32:06 +09:00
Marcelo Vanzin	1474ed05fb	[SPARK-29562][SQL] Speed up and slim down metric aggregation in SQL listener First, a bit of background on the code being changed. The current code tracks metric updates for each task, recording which metrics the task is monitoring and the last update value. Once a SQL execution finishes, then the metrics for all the stages are aggregated, by building a list with all (metric ID, value) pairs collected for all tasks in the stages related to the execution, then grouping by metric ID, and then calculating the values shown in the UI. That is full of inefficiencies: - in normal operation, all tasks will be tracking and updating the same metrics. So recording the metric IDs per task is wasteful. - tracking by task means we might be double-counting values if you have speculative tasks (as a comment in the code mentions). - creating a list of (metric ID, value) is extremely inefficient, because now you have a huge map in memory storing boxed versions of the metric IDs and values. - same thing for the aggregation part, where now a Seq is built with the values for each metric ID. The end result is that for large queries, this code can become both really slow, thus affecting the processing of events, and memory hungry. The updated code changes the approach to the following: - stages track metrics by their ID; this means the stage tracking code naturally groups values, making aggregation later simpler. - each metric ID being tracked uses a long array matching the number of partitions of the stage; this means that it's cheap to update the value of the metric once a task ends. - when aggregating, custom code just concatenates the arrays corresponding to the matching metric IDs; this is cheaper than the previous, boxing-heavy approach. The end result is that the listener uses about half as much memory as before for tracking metrics, since it doesn't need to track metric IDs per task. I captured heap dumps with the old and the new code during metric aggregation in the listener, for an execution with 3 stages, 100k tasks per stage, 50 metrics updated per task. The dumps contained just reachable memory - so data kept by the listener plus the variables in the aggregateMetrics() method. With the old code, the thread doing aggregation references >1G of memory - and that does not include temporary data created by the "groupBy" transformation (for which the intermediate state is not referenced in the aggregation method). The same thread with the new code references ~250M of memory. The old code uses about ~250M to track all the metric values for that execution, while the new code uses about ~130M. (Note the per-thread numbers include the amount used to track the metrics - so, e.g., in the old case, aggregation was referencing about ~750M of temporary data.) I'm also including a small benchmark (based on the Benchmark class) so that we can measure how much changes to this code affect performance. The benchmark contains some extra code to measure things the normal Benchmark class does not, given that the code under test does not really map that well to the expectations of that class. Running with the old code (I removed results that don't make much sense for this benchmark): ``` [info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Linux 4.15.0-66-generic [info] Intel(R) Core(TM) i7-6820HQ CPU 2.70GHz [info] metrics aggregation (50 metrics, 100k tasks per stage): Best Time(ms) Avg Time(ms) [info] -------------------------------------------------------------------------------------- [info] 1 stage(s) 2113 2118 [info] 2 stage(s) 4172 4392 [info] 3 stage(s) 7755 8460 [info] [info] Stage Count Stage Proc. Time Aggreg. Time [info] 1 614 1187 [info] 2 620 2480 [info] 3 718 5069 ``` With the new code: ``` [info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Linux 4.15.0-66-generic [info] Intel(R) Core(TM) i7-6820HQ CPU 2.70GHz [info] metrics aggregation (50 metrics, 100k tasks per stage): Best Time(ms) Avg Time(ms) [info] -------------------------------------------------------------------------------------- [info] 1 stage(s) 727 886 [info] 2 stage(s) 1722 1983 [info] 3 stage(s) 2752 3013 [info] [info] Stage Count Stage Proc. Time Aggreg. Time [info] 1 408 177 [info] 2 389 423 [info] 3 372 660 ``` So the new code is faster than the old when processing task events, and about an order of maginute faster when aggregating metrics. Note this still leaves room for improvement; for example, using the above measurements, 600ms is still a huge amount of time to spend in an event handler. But I'll leave further enhancements for a separate change. Tested with benchmarking code + existing unit tests. Closes #26218 from vanzin/SPARK-29562. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-24 22:18:10 -07:00
wenxuanguan	40df9d246e	[SPARK-29227][SS] Track rule info in optimization phase ### What changes were proposed in this pull request? Track timing info for each rule in optimization phase using `QueryPlanningTracker` in Structured Streaming ### Why are the changes needed? In Structured Streaming we only track rule info in analysis phase, not in optimization phase. ### Does this PR introduce any user-facing change? No Closes #25914 from wenxuanguan/spark-29227. Authored-by: wenxuanguan <choose_home@126.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-25 10:02:54 +09:00
Terry Kim	dec99d8ac5	[SPARK-29526][SQL] UNCACHE TABLE should look up catalog/table like v2 commands ### What changes were proposed in this pull request? Add UncacheTableStatement and make UNCACHE TABLE go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g. ``` USE my_catalog DESC t // success and describe the table t from my_catalog UNCACHE TABLE t // report table not found as there is no table t in the session catalog ``` ### Does this PR introduce any user-facing change? yes. When running UNCACHE TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog. ### How was this patch tested? New unit tests Closes #26237 from imback82/uncache_table. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-24 14:51:23 -07:00
fuwhu	92b25295ca	[SPARK-21287][SQL] Remove requirement of fetch_size>=0 from JDBCOptions ### What changes were proposed in this pull request? Remove the requirement of fetch_size>=0 from JDBCOptions to allow negative fetch size. ### Why are the changes needed? Namely, to allow data fetch in stream manner (row-by-row fetch) against MySQL database. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit test (JDBCSuite) This closes #26230 . Closes #26244 from fuwhu/SPARK-21287-FIX. Authored-by: fuwhu <bestwwg@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-24 12:35:32 -07:00
stczwd	dcf5eaf1a6	[SPARK-29444][FOLLOWUP] add doc and python parameter for ignoreNullFields in json generating # What changes were proposed in this pull request? Add description for ignoreNullFields, which is commited in #26098 , in DataFrameWriter and readwriter.py. Enable user to use ignoreNullFields in pyspark. ### Does this PR introduce any user-facing change? No ### How was this patch tested? run unit tests Closes #26227 from stczwd/json-generator-doc. Authored-by: stczwd <qcsd2011@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-24 10:25:04 -07:00
Wenchen Fan	cdea520ff8	[SPARK-29532][SQL] Simplify interval string parsing ### What changes were proposed in this pull request? Only use antlr4 to parse the interval string, and remove the duplicated parsing logic from `CalendarInterval`. ### Why are the changes needed? Simplify the code and fix inconsistent behaviors. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Pass the Jenkins with the updated test cases. Closes #26190 from cloud-fan/parser. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-24 09:15:59 -07:00
Sean Owen	a35fb4fd50	[SPARK-29578][TESTS] Add "8634" as another skipped day for Kwajalein timzeone due to more recent timezone updates in later JDK 8 ### What changes were proposed in this pull request? Recent timezone definition changes in very new JDK 8 (and beyond) releases cause test failures. The below was observed on JDK 1.8.0_232. As before, the easy fix is to allow for these inconsequential variations in test results due to differing definition of timezones. ### Why are the changes needed? Keeps test passing on the latest JDK releases. ### Does this PR introduce any user-facing change? None ### How was this patch tested? Existing tests Closes #26236 from srowen/SPARK-29578. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-24 08:30:27 -05:00
shahid	76d4bebb54	[SPARK-29559][WEBUI] Support pagination for JDBC/ODBC Server page ### What changes were proposed in this pull request? Supports pagination for SQL Statisitcs table in the JDBC/ODBC tab using existing Spark pagination framework. ### Why are the changes needed? It will easier for user to analyse the table and it may fix the potential issues like oom while loading the page, that may occur similar to the SQL page (refer https://github.com/apache/spark/pull/22645) ### Does this PR introduce any user-facing change? There will be no change in the `SQLStatistics` table in JDBC/ODBC server page execpt pagination support. ### How was this patch tested? Manually verified. Before PR: ![Screenshot 2019-10-22 at 11 37 29 PM](https://user-images.githubusercontent.com/23054875/67316080-73636680-f525-11e9-91bc-ff7e06e3736d.png) After PR: ![Screenshot 2019-10-22 at 10 33 00 PM](https://user-images.githubusercontent.com/23054875/67316092-778f8400-f525-11e9-93f8-1e2815abd66f.png) Closes #26215 from shahidki31/jdbcPagination. Authored-by: shahid <shahidki31@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-24 08:29:05 -05:00
angerszhu	67cf0433ee	[SPARK-29145][SQL] Support sub-queries in join conditions ### What changes were proposed in this pull request? Support SparkSQL use iN/EXISTS with subquery in JOIN condition. ### Why are the changes needed? Support SQL use iN/EXISTS with subquery in JOIN condition. ### Does this PR introduce any user-facing change? This PR is for enable user use subquery in `JOIN`'s ON condition. such as we have create three table ``` CREATE TABLE A(id String); CREATE TABLE B(id String); CREATE TABLE C(id String); ``` we can do query like : ``` SELECT A.id from A JOIN B ON A.id = B.id and A.id IN (select C.id from C) ``` ### How was this patch tested? ADDED UT Closes #25854 from AngersZhuuuu/SPARK-29145. Lead-authored-by: angerszhu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-10-24 21:55:03 +09:00
Yuanjian Li	9e77d48315	[SPARK-21492][SQL][FOLLOW UP] Reimplement UnsafeExternalRowSorter in database style iterator ### What changes were proposed in this pull request? Reimplement the iterator in UnsafeExternalRowSorter in database style. This can be done by reusing the `RowIterator` in our code base. ### Why are the changes needed? During the job in #26164, after involving a var `isReleased` in `hasNext`, there's possible that `isReleased` is false when calling `hasNext`, but it becomes true before calling `next`. A safer way is using database-style iterator: `advanceNext` and `getRow`. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UT. Closes #26229 from xuanyuanking/SPARK-21492-follow-up. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-24 15:43:13 +08:00
Liang-Chi Hsieh	177bf672e4	[SPARK-29522][SQL] CACHE TABLE should look up catalog/table like v2 commands ### What changes were proposed in this pull request? Add CacheTableStatement and make CACHE TABLE go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g. ``` USE my_catalog DESC t // success and describe the table t from my_catalog CACHE TABLE t // report table not found as there is no table t in the session catalog ``` ### Does this PR introduce any user-facing change? yes. When running CACHE TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog. ### How was this patch tested? Unit tests. Closes #26179 from viirya/SPARK-29522. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-24 15:00:21 +08:00
07ARB	55ced9c148	[SPARK-29571][SQL][TESTS][FOLLOWUP] Fix UT in AllExecutionsPageSuite ### What changes were proposed in this pull request? This is a follow-up of #24052 to correct assert condition. ### Why are the changes needed? To test IllegalArgumentException condition.. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manual Test (during fixing of SPARK-29453 find this issue) Closes #26234 from 07ARB/SPARK-29571. Authored-by: 07ARB <ankitrajboudh@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-24 15:57:16 +09:00
Dongjoon Hyun	b91356e4c2	[SPARK-29533][SQL][TESTS][FOLLOWUP] Regenerate the result on EC2 ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/26189 to regenerate the result on EC2. ### Why are the changes needed? This will be used for the other PR reviews. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A. Closes #26233 from dongjoon-hyun/SPARK-29533. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2019-10-23 21:41:05 +00:00
jiake	7e8e4c0a14	[SPARK-29552][SQL] Execute the "OptimizeLocalShuffleReader" rule when creating new query stage and then can optimize the shuffle reader to local shuffle reader as much as possible ### What changes were proposed in this pull request? `OptimizeLocalShuffleReader` rule is very conservative and gives up optimization as long as there are extra shuffles introduced. It's very likely that most of the added local shuffle readers are fine and only one introduces extra shuffle. However, it's very hard to make `OptimizeLocalShuffleReader` optimal, a simple workaround is to run this rule again right before executing a query stage. ### Why are the changes needed? Optimize more shuffle reader to local shuffle reader. ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing ut Closes #26207 from JkSelf/resolve-multi-joins-issue. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-24 01:18:07 +08:00
Jungtaek Lim (HeartSaVioR)	bfbf2821f3	[SPARK-29503][SQL] Remove conversion CreateNamedStruct to CreateNamedStructUnsafe ### What changes were proposed in this pull request? There's a case where MapObjects has a lambda function which creates nested struct - unsafe data in safe data struct. In this case, MapObjects doesn't copy the row returned from lambda function (as outmost data type is safe data struct), which misses copying nested unsafe data. The culprit is that `UnsafeProjection.toUnsafeExprs` converts `CreateNamedStruct` to `CreateNamedStructUnsafe` (this is the only place where `CreateNamedStructUnsafe` is used) which incurs safe and unsafe being mixed up temporarily, which may not be needed at all at least logically, as it will finally assembly these evaluations to `UnsafeRow`. > Before the patch ``` /* 105 / private ArrayData MapObjects_0(InternalRow i) { / 106 / boolean isNull_1 = i.isNullAt(0); / 107 / ArrayData value_1 = isNull_1 ? / 108 / null : (i.getArray(0)); / 109 / ArrayData value_0 = null; / 110 / / 111 / if (!isNull_1) { / 112 / / 113 / int dataLength_0 = value_1.numElements(); / 114 / / 115 / ArrayData[] convertedArray_0 = null; / 116 / convertedArray_0 = new ArrayData[dataLength_0]; / 117 / / 118 / / 119 / int loopIndex_0 = 0; / 120 / / 121 / while (loopIndex_0 < dataLength_0) { / 122 / value_MapObject_lambda_variable_1 = (int) (value_1.getInt(loopIndex_0)); / 123 / isNull_MapObject_lambda_variable_1 = value_1.isNullAt(loopIndex_0); / 124 / / 125 / ArrayData arrayData_0 = ArrayData.allocateArrayData( / 126 / -1, 1L, " createArray failed."); / 127 / / 128 / mutableStateArray_0[0].reset(); / 129 / / 130 / / 131 / mutableStateArray_0[0].zeroOutNullBytes(); / 132 / / 133 / / 134 / if (isNull_MapObject_lambda_variable_1) { / 135 / mutableStateArray_0[0].setNullAt(0); / 136 / } else { / 137 / mutableStateArray_0[0].write(0, value_MapObject_lambda_variable_1); / 138 / } / 139 / arrayData_0.update(0, (mutableStateArray_0[0].getRow())); / 140 / if (false) { / 141 / convertedArray_0[loopIndex_0] = null; / 142 / } else { / 143 / convertedArray_0[loopIndex_0] = arrayData_0 instanceof UnsafeArrayData? arrayData_0.copy() : arrayData_0; / 144 / } / 145 / / 146 / loopIndex_0 += 1; / 147 / } / 148 / / 149 / value_0 = new org.apache.spark.sql.catalyst.util.GenericArrayData(convertedArray_0); / 150 / } / 151 / globalIsNull_0 = isNull_1; / 152 / return value_0; / 153 / } ``` > After the patch ``` / 104 / private ArrayData MapObjects_0(InternalRow i) { / 105 / boolean isNull_1 = i.isNullAt(0); / 106 / ArrayData value_1 = isNull_1 ? / 107 / null : (i.getArray(0)); / 108 / ArrayData value_0 = null; / 109 / / 110 / if (!isNull_1) { / 111 / / 112 / int dataLength_0 = value_1.numElements(); / 113 / / 114 / ArrayData[] convertedArray_0 = null; / 115 / convertedArray_0 = new ArrayData[dataLength_0]; / 116 / / 117 / / 118 / int loopIndex_0 = 0; / 119 / / 120 / while (loopIndex_0 < dataLength_0) { / 121 / value_MapObject_lambda_variable_1 = (int) (value_1.getInt(loopIndex_0)); / 122 / isNull_MapObject_lambda_variable_1 = value_1.isNullAt(loopIndex_0); / 123 / / 124 / ArrayData arrayData_0 = ArrayData.allocateArrayData( / 125 / -1, 1L, " createArray failed."); / 126 / / 127 / Object[] values_0 = new Object[1]; / 128 / / 129 / / 130 / if (isNull_MapObject_lambda_variable_1) { / 131 / values_0[0] = null; / 132 / } else { / 133 / values_0[0] = value_MapObject_lambda_variable_1; / 134 / } / 135 / / 136 / final InternalRow value_3 = new org.apache.spark.sql.catalyst.expressions.GenericInternalRow(values_0); / 137 / values_0 = null; / 138 / arrayData_0.update(0, value_3); / 139 / if (false) { / 140 / convertedArray_0[loopIndex_0] = null; / 141 / } else { / 142 / convertedArray_0[loopIndex_0] = arrayData_0 instanceof UnsafeArrayData? arrayData_0.copy() : arrayData_0; / 143 / } / 144 / / 145 / loopIndex_0 += 1; / 146 / } / 147 / / 148 / value_0 = new org.apache.spark.sql.catalyst.util.GenericArrayData(convertedArray_0); / 149 / } / 150 / globalIsNull_0 = isNull_1; / 151 / return value_0; / 152 */ } ``` ### Why are the changes needed? This patch fixes the bug described above. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? UT added which fails on master branch and passes on PR. Closes #26173 from HeartSaVioR/SPARK-29503. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-24 00:41:48 +08:00
Terry Kim	53a5f17803	[SPARK-29513][SQL] REFRESH TABLE should look up catalog/table like v2 commands ### What changes were proposed in this pull request? Add RefreshTableStatement and make REFRESH TABLE go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g. ``` USE my_catalog DESC t // success and describe the table t from my_catalog REFRESH TABLE t // report table not found as there is no table t in the session catalog ``` ### Does this PR introduce any user-facing change? yes. When running REFRESH TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog. ### How was this patch tested? New unit tests Closes #26183 from imback82/refresh_table. Lead-authored-by: Terry Kim <yuminkim@gmail.com> Co-authored-by: Terry Kim <terryk@terrys-mbp-2.lan> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>	2019-10-23 08:26:47 -07:00
turbofei	70dd9c0cab	[SPARK-29542][SQL][DOC] Make the descriptions of spark.sql.files.* be clearly ### What changes were proposed in this pull request? As described in [SPARK-29542](https://issues.apache.org/jira/browse/SPARK-29542) , the descriptions of `spark.sql.files.` are confused. In this PR, I make their descriptions be clearly. ### Why are the changes needed? It makes the descriptions of `spark.sql.files.` be clearly. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UT. Closes #26200 from turboFei/SPARK-29542-partition-maxSize. Authored-by: turbofei <fwang12@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-23 20:31:06 +09:00
Burak Yavuz	cbe6eadc0c	[SPARK-29352][SQL][SS] Track active streaming queries in the SparkSession.sharedState ### What changes were proposed in this pull request? This moves the tracking of active queries from a per SparkSession state, to the shared SparkSession for better safety in isolated Spark Session environments. ### Why are the changes needed? We have checks to prevent the restarting of the same stream on the same spark session, but we can actually make that better in multi-tenant environments by actually putting that state in the SharedState instead of SessionState. This would allow a more comprehensive check for multi-tenant clusters. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added tests to StreamingQueryManagerSuite Closes #26018 from brkyvz/sharedStreamingQueryManager. Lead-authored-by: Burak Yavuz <burak@databricks.com> Co-authored-by: Burak Yavuz <brkyvz@gmail.com> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>	2019-10-23 10:56:19 +02:00
Terry Kim	c128ac564d	[SPARK-29511][SQL] DataSourceV2: Support CREATE NAMESPACE ### What changes were proposed in this pull request? This PR adds `CREATE NAMESPACE` support for V2 catalogs. ### Why are the changes needed? Currently, you cannot explicitly create namespaces for v2 catalogs. ### Does this PR introduce any user-facing change? The user can now perform the following: ```SQL CREATE NAMESPACE mycatalog.ns ``` to create a namespace `ns` inside `mycatalog` V2 catalog. ### How was this patch tested? Added unit tests. Closes #26166 from imback82/create_namespace. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-23 12:17:20 +08:00
DylanGuedes	e6749092f7	[SPARK-29107][SQL][TESTS] Port window.sql (Part 1) ### What changes were proposed in this pull request? This PR ports window.sql from PostgreSQL regression tests https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/sql/window.sql from lines 1~319 The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/expected/window.out ### Why are the changes needed? To ensure compatibility with PostgreSQL. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Pass the Jenkins. And, Comparison with PgSQL results. Closes #26119 from DylanGuedes/spark-29107. Authored-by: DylanGuedes <djmgguedes@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-23 10:24:38 +09:00
Huaxin Gao	3bf5355e24	[SPARK-29539][SQL] SHOW PARTITIONS should look up catalog/table like v2 commands ### What changes were proposed in this pull request? Add ShowPartitionsStatement and make SHOW PARTITIONS go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. ### Does this PR introduce any user-facing change? Yes. When running SHOW PARTITIONS, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog. ### How was this patch tested? Unit tests. Closes #26198 from huaxingao/spark-29539. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>	2019-10-22 14:47:17 -07:00
Liang-Chi Hsieh	b4844eea1f	[SPARK-29517][SQL] TRUNCATE TABLE should look up catalog/table like v2 commands ### What changes were proposed in this pull request? Add TruncateTableStatement and make TRUNCATE TABLE go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g. ``` USE my_catalog DESC t // success and describe the table t from my_catalog TRUNCATE TABLE t // report table not found as there is no table t in the session catalog ``` ### Does this PR introduce any user-facing change? yes. When running TRUNCATE TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog. ### How was this patch tested? Unit tests. Closes #26174 from viirya/SPARK-29517. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-22 19:17:28 +08:00
Yuanjian Li	bb49c80c89	[SPARK-21492][SQL] Fix memory leak in SortMergeJoin ### What changes were proposed in this pull request? We shall have a new mechanism that the downstream operators may notify its parents that they may release the output data stream. In this PR, we implement the mechanism as below: - Add function named `cleanupResources` in SparkPlan, which default call children's `cleanupResources` function, the operator which need a resource cleanup should rewrite this with the self cleanup and also call `super.cleanupResources`, like SortExec in this PR. - Add logic support on the trigger side, in this PR is SortMergeJoinExec, which make sure and call the `cleanupResources` to do the cleanup job for all its upstream(children) operator. ### Why are the changes needed? Bugfix for SortMergeJoin memory leak, and implement a general framework for SparkPlan resource cleanup. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? UT: Add new test suite JoinWithResourceCleanSuite to check both standard and code generation scenario. Integrate Test: Test with driver/executor default memory set 1g, local mode 10 thread. The below test(thanks taosaildrone for providing this test [here](https://github.com/apache/spark/pull/23762#issuecomment-463303175)) will pass with this PR. ``` from pyspark.sql.functions import rand, col spark.conf.set("spark.sql.join.preferSortMergeJoin", "true") spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) # spark.conf.set("spark.sql.sortMergeJoinExec.eagerCleanupResources", "true") r1 = spark.range(1, 1001).select(col("id").alias("timestamp1")) r1 = r1.withColumn('value', rand()) r2 = spark.range(1000, 1001).select(col("id").alias("timestamp2")) r2 = r2.withColumn('value2', rand()) joined = r1.join(r2, r1.timestamp1 == r2.timestamp2, "inner") joined = joined.coalesce(1) joined.explain() joined.show() ``` Closes #26164 from xuanyuanking/SPARK-21492. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-22 19:08:09 +08:00
Yuming Wang	3163b6b43b	[SPARK-29516][SQL][TEST] Test ThriftServerQueryTestSuite asynchronously ### What changes were proposed in this pull request? This PR test `ThriftServerQueryTestSuite` in an asynchronous way. ### Why are the changes needed? The default value of `spark.sql.hive.thriftServer.async` is `true`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? ``` build/sbt "hive-thriftserver/test-only *.ThriftServerQueryTestSuite" -Phive-thriftserver build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite test -Phive-thriftserver ``` Closes #26172 from wangyum/SPARK-29516. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com>	2019-10-22 03:20:49 -07:00
denglingang	467c3f610f	[SPARK-29529][DOCS] Remove unnecessary orc version and hive version in doc ### What changes were proposed in this pull request? This PR remove unnecessary orc version and hive version in doc. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A. Closes #26146 from denglingang/SPARK-24576. Lead-authored-by: denglingang <chitin1027@gmail.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-22 14:49:23 +09:00
angerszhu	484f93e255	[SPARK-29530][SQL] Make SQLConf in SQL parse process thread safe ### What changes were proposed in this pull request? As I have comment in [SPARK-29516](https://github.com/apache/spark/pull/26172#issuecomment-544364977) SparkSession.sql() method parse process not under current sparksession's conf, so some configuration about parser is not valid in multi-thread situation. In this pr, we add a SQLConf parameter to AbstractSqlParser and initial it with SessionState's conf. Then for each SparkSession's parser process. It will use's it's own SessionState's SQLConf and to be thread safe ### Why are the changes needed? Fix bug ### Does this PR introduce any user-facing change? NO ### How was this patch tested? NO Closes #26187 from AngersZhuuuu/SPARK-29530. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-22 10:38:06 +08:00
wuyi	3d567a357c	[MINOR][SQL] Avoid unnecessary invocation on checkAndGlobPathIfNecessary ### What changes were proposed in this pull request? Only invoke `checkAndGlobPathIfNecessary()` when we have to use `InMemoryFileIndex`. ### Why are the changes needed? Avoid unnecessary function invocation. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass Jenkins. Closes #26196 from Ngone51/dev-avoid-unnecessary-invocation-on-globpath. Authored-by: wuyi <ngone_5451@163.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-21 21:10:21 -05:00
DylanGuedes	bb4400c23a	[SPARK-29108][SQL][TESTS] Port window.sql (Part 2) ### What changes were proposed in this pull request? This PR ports window.sql from PostgreSQL regression tests https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/sql/window.sql from lines 320~562 The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/expected/window.out ## How was this patch tested? Pass the Jenkins. ### Why are the changes needed? To ensure compatibility with PGSQL ### Does this PR introduce any user-facing change? No ### How was this patch tested? Comparison with PgSQL results. Closes #26121 from DylanGuedes/spark-29108. Authored-by: DylanGuedes <djmgguedes@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-22 10:49:40 +09:00
Maxim Gekk	eef11ba9ef	[SPARK-29518][SQL][TEST] Benchmark `date_part` for `INTERVAL` ### What changes were proposed in this pull request? I extended `ExtractBenchmark` to support the `INTERVAL` type of the `source` parameter of the `date_part` function. ### Why are the changes needed? - To detect performance issues while changing implementation of the `date_part` function in the future. - To find out current performance bottlenecks in `date_part` for the `INTERVAL` type ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running the benchmark and print out produced values per each `field` value. Closes #26175 from MaxGekk/extract-interval-benchmark. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-22 10:47:54 +09:00
Maxim Gekk	6ffec5e6a6	[SPARK-29533][SQL][TEST] Benchmark casting strings to intervals ### What changes were proposed in this pull request? Added new benchmark `IntervalBenchmark` to measure performance of interval related functions. In the PR, I added benchmarks for casting strings to interval. In particular, interval strings with `interval` prefix and without it because there is special code for this `da576a737c/common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java (L100-L103)` . And also I added benchmarks for different number of units in interval strings, for example 1 unit is `interval 10 years`, 2 units w/o interval is `10 years 5 months`, and etc. ### Why are the changes needed? - To find out current performance issues in casting to intervals - The benchmark can be used while refactoring/re-implementing `CalendarInterval.fromString()` or `CalendarInterval.fromCaseInsensitiveString()`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running the benchmark via the command: ```shell SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.IntervalBenchmark" ``` Closes #26189 from MaxGekk/interval-from-string-benchmark. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-22 10:47:04 +09:00
fuwhu	31a5dea48f	[SPARK-29531][SQL][TEST] refine ThriftServerQueryTestSuite.blackList to reuse black list in SQLQueryTestSuite ### What changes were proposed in this pull request? This pr refine the code in ThriftServerQueryTestSuite.blackList to reuse the black list of SQLQueryTestSuite instead of duplicating all test cases from SQLQueryTestSuite.blackList. ### Why are the changes needed? To reduce code duplication. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #26188 from fuwhu/SPARK-TBD. Authored-by: fuwhu <bestwwg@163.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com>	2019-10-21 05:19:27 -07:00
Yuming Wang	e99a9f78ea	[SPARK-29498][SQL] CatalogTable to HiveTable should not change the table's ownership ### What changes were proposed in this pull request? `CatalogTable` to `HiveTable` will change the table's ownership. How to reproduce: ```scala import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.catalyst.catalog.{CatalogStorageFormat, CatalogTable, CatalogTableType} import org.apache.spark.sql.types.{LongType, StructType} val identifier = TableIdentifier("spark_29498", None) val owner = "SPARK-29498" val newTable = CatalogTable( identifier, tableType = CatalogTableType.EXTERNAL, storage = CatalogStorageFormat( locationUri = None, inputFormat = None, outputFormat = None, serde = None, compressed = false, properties = Map.empty), owner = owner, schema = new StructType().add("i", LongType, false), provider = Some("hive")) spark.sessionState.catalog.createTable(newTable, false) // The owner is not SPARK-29498 println(spark.sessionState.catalog.getTableMetadata(identifier).owner) ``` This PR makes it set the `HiveTable`'s owner to `CatalogTable`'s owner if it's owner is not empty when converting `CatalogTable` to `HiveTable`. ### Why are the changes needed? We should not change the ownership of the table when converting `CatalogTable` to `HiveTable`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? unit test Closes #26160 from wangyum/SPARK-29498. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-21 15:53:36 +08:00
Kent Yao	5b4d9170ed	[SPARK-27879][SQL] Add support for bit_and and bit_or aggregates ### What changes were proposed in this pull request? ``` bit_and(expression) -- The bitwise AND of all non-null input values, or null if none bit_or(expression) -- The bitwise OR of all non-null input values, or null if none ``` More details: https://www.postgresql.org/docs/9.3/functions-aggregate.html ### Why are the changes needed? Postgres, Mysql and many other popular db support them. ### Does this PR introduce any user-facing change? add two bit agg ### How was this patch tested? add ut Closes #26155 from yaooqinn/SPARK-27879. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-21 14:32:31 +08:00
Yuming Wang	0f65b49f55	[SPARK-29525][SQL][TEST] Fix the associated location already exists in SQLQueryTestSuite ### What changes were proposed in this pull request? This PR fix Fix the associated location already exists in `SQLQueryTestSuite`: ``` build/sbt "~sql/test-only SQLQueryTestSuite -- -z postgreSQL/join.sql" ... [info] - postgreSQL/join.sql FAILED * (35 seconds, 420 milliseconds) [info] postgreSQL/join.sql [info] Expected "[]", but got "[org.apache.spark.sql.AnalysisException [info] Can not create the managed table('`default`.`tt3`'). The associated location('file:/root/spark/sql/core/spark-warehouse/org.apache.spark.sql.SQLQueryTestSuite/tt3') already exists.;]" Result did not match for query #108 ``` ### Why are the changes needed? Fix bug. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #26181 from wangyum/TestError. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-20 13:31:59 -07:00
Terry Kim	ab92e1715e	[SPARK-29512][SQL] REPAIR TABLE should look up catalog/table like v2 commands ### What changes were proposed in this pull request? Add RepairTableStatement and make REPAIR TABLE go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g. ``` USE my_catalog DESC t // success and describe the table t from my_catalog MSCK REPAIR TABLE t // report table not found as there is no table t in the session catalog ``` ### Does this PR introduce any user-facing change? yes. When running MSCK REPAIR TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog. ### How was this patch tested? New unit tests Closes #26168 from imback82/repair_table. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>	2019-10-18 22:43:58 -07:00
Wenchen Fan	2437878299	[SPARK-29502][SQL] typed interval expression should fail for invalid format ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/25241 . The typed interval expression should fail for invalid format. ### Why are the changes needed? Te be consistent with the typed timestamp/date expression ### Does this PR introduce any user-facing change? Yes. But this feature is not released yet. ### How was this patch tested? updated test Closes #26151 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com>	2019-10-18 16:12:03 -07:00
Rahul Mahadev	4cfce3e5d0	[SPARK-29494][SQL] Fix for ArrayOutofBoundsException while converting string to timestamp ### What changes were proposed in this pull request? * Adding an additional check in `stringToTimestamp` to handle cases where the input has trailing ':' * Added a test to make sure this works. ### Why are the changes needed? In a couple of scenarios while converting from String to Timestamp `DateTimeUtils.stringToTimestamp` throws an array out of bounds exception if there is trailing ':'. The behavior of this method requires it to return `None` in case the format of the string is incorrect. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added a test in the `DateTimeTestUtils` suite to test if my fix works. Closes #26143 from rahulsmahadev/SPARK-29494. Lead-authored-by: Rahul Mahadev <rahul.mahadev@databricks.com> Co-authored-by: Rahul Shivu Mahadev <51690557+rahulsmahadev@users.noreply.github.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-18 16:45:25 -05:00
angerszhu	9a3dccae72	[SPARK-29379][SQL] SHOW FUNCTIONS show '!=', '<>' , 'between', 'case' ### What changes were proposed in this pull request? Current Spark SQL `SHOW FUNCTIONS` don't show `!=`, `<>`, `between`, `case` But these expressions is truly functions. We should show it in SQL `SHOW FUNCTIONS` ### Why are the changes needed? SHOW FUNCTIONS show '!=', '<>' , 'between', 'case' ### Does this PR introduce any user-facing change? SHOW FUNCTIONS show '!=', '<>' , 'between', 'case' ### How was this patch tested? UT Closes #26053 from AngersZhuuuu/SPARK-29379. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-19 00:19:56 +08:00
Maxim Gekk	77fe8a8e7c	[SPARK-28420][SQL] Support the `INTERVAL` type in `date_part()` ### What changes were proposed in this pull request? The `date_part()` function can accept the `source` parameter of the `INTERVAL` type (`CalendarIntervalType`). The following values of the `field` parameter are supported: - `"MILLENNIUM"` (`"MILLENNIA"`, `"MIL"`, `"MILS"`) - number of millenniums in the given interval. It is `YEAR / 1000`. - `"CENTURY"` (`"CENTURIES"`, `"C"`, `"CENT"`) - number of centuries in the interval calculated as `YEAR / 100`. - `"DECADE"` (`"DECADES"`, `"DEC"`, `"DECS"`) - decades in the `YEAR` part of the interval calculated as `YEAR / 10`. - `"YEAR"` (`"Y"`, `"YEARS"`, `"YR"`, `"YRS"`) - years in a values of `CalendarIntervalType`. It is `MONTHS / 12`. - `"QUARTER"` (`"QTR"`) - a quarter of year calculated as `MONTHS / 3 + 1` - `"MONTH"` (`"MON"`, `"MONS"`, `"MONTHS"`) - the months part of the interval calculated as `CalendarInterval.months % 12` - `"DAY"` (`"D"`, `"DAYS"`) - total number of days in `CalendarInterval.microseconds` - `"HOUR"` (`"H"`, `"HOURS"`, `"HR"`, `"HRS"`) - the hour part of the interval. - `"MINUTE"` (`"M"`, `"MIN"`, `"MINS"`, `"MINUTES"`) - the minute part of the interval. - `"SECOND"` (`"S"`, `"SEC"`, `"SECONDS"`, `"SECS"`) - the seconds part with fractional microsecond part. - `"MILLISECONDS"` (`"MSEC"`, `"MSECS"`, `"MILLISECON"`, `"MSECONDS"`, `"MS"`) - the millisecond part of the interval with fractional microsecond part. - `"MICROSECONDS"` (`"USEC"`, `"USECS"`, `"USECONDS"`, `"MICROSECON"`, `"US"`) - the total number of microseconds in the `second`, `millisecond` and `microsecond` parts of the given interval. - `"EPOCH"` - the total number of seconds in the interval including the fractional part with microsecond precision. Here we assume 365.25 days per year (leap year every four years). For example: ```sql > SELECT date_part('days', interval 1 year 10 months 5 days); 5 > SELECT date_part('seconds', interval 30 seconds 1 milliseconds 1 microseconds); 30.001001 ``` ### Why are the changes needed? To maintain feature parity with PostgreSQL (https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT) ### Does this PR introduce any user-facing change? No ### How was this patch tested? - Added new test suite `IntervalExpressionsSuite` - Add new test cases to `date_part.sql` Closes #25981 from MaxGekk/extract-from-intervals. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-18 23:54:59 +08:00
jiake	c3a0d02a40	[SPARK-28560][SQL][FOLLOWUP] resolve the remaining comments for PR#25295 ### What changes were proposed in this pull request? A followup of [#25295](https://github.com/apache/spark/pull/25295). 1) change the logWarning to logDebug in `OptimizeLocalShuffleReader`. 2) update the test to check whether query stage reuse can work well with local shuffle reader. ### Why are the changes needed? make code robust ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing tests Closes #26157 from JkSelf/followup-25295. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-18 23:16:58 +08:00
Terry Kim	39af51dbc6	[SPARK-29014][SQL] DataSourceV2: Fix current/default catalog usage ### What changes were proposed in this pull request? The handling of the catalog across plans should be as follows ([SPARK-29014](https://issues.apache.org/jira/browse/SPARK-29014)): * The current catalog should be used when no catalog is specified * The default catalog is the catalog current is initialized to * If the default catalog is not set, then current catalog is the built-in Spark session catalog. This PR addresses the issue where current catalog usage is not followed as describe above. ### Why are the changes needed? It is a bug as described in the previous section. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Unit tests added. Closes #26120 from imback82/cleanup_catalog. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-18 22:45:42 +08:00
Wenchen Fan	74351468de	[SPARK-29482][SQL] ANALYZE TABLE should look up catalog/table like v2 commands ### What changes were proposed in this pull request? Add `AnalyzeTableStatement` and `AnalyzeColumnStatement`, and make ANALYZE TABLE go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g. ``` USE my_catalog DESC t // success and describe the table t from my_catalog ANALYZE TABLE t // report table not found as there is no table t in the session catalog ``` ### Does this PR introduce any user-facing change? yes. When running ANALYZE TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog. ### How was this patch tested? new tests Closes #26129 from cloud-fan/analyze-table. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2019-10-18 12:55:49 +02:00
Liang-Chi Hsieh	5692680e37	[SPARK-29295][SQL] Insert overwrite to Hive external table partition should delete old data ### What changes were proposed in this pull request? This patch proposes to delete old Hive external partition directory even the partition does not exist in Hive, when insert overwrite Hive external table partition. ### Why are the changes needed? When insert overwrite to a Hive external table partition, if the partition does not exist, Hive will not check if the external partition directory exists or not before copying files. So if users drop the partition, and then do insert overwrite to the same partition, the partition will have both old and new data. For example: ```scala withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> "false") { // test is an external Hive table. sql("INSERT OVERWRITE TABLE test PARTITION(name='n1') SELECT 1") sql("ALTER TABLE test DROP PARTITION(name='n1')") sql("INSERT OVERWRITE TABLE test PARTITION(name='n1') SELECT 2") sql("SELECT id FROM test WHERE name = 'n1' ORDER BY id") // Got both 1 and 2. } ``` ### Does this PR introduce any user-facing change? Yes. This fix a correctness issue when users drop partition on a Hive external table partition and then insert overwrite it. ### How was this patch tested? Added test. Closes #25979 from viirya/SPARK-29295. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-18 16:35:44 +08:00
Kent Yao	ef4c298cc9	[SPARK-29405][SQL] Alter table / Insert statements should not change a table's ownership ### What changes were proposed in this pull request? In this change, we give preference to the original table's owner if it is not empty. ### Why are the changes needed? When executing 'insert into/overwrite ...' DML, or 'alter table set tblproperties ...' DDL, spark would change the ownership of the table the one who runs the spark application. ### Does this PR introduce any user-facing change? NO ### How was this patch tested? Compare with the behavior of Apache Hive Closes #26068 from yaooqinn/SPARK-29405. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-18 16:21:31 +08:00
stczwd	78b0cbe265	[SPARK-29444] Add configuration to support JacksonGenrator to keep fields with null values ### Why are the changes needed? As mentioned in jira, sometimes we need to be able to support the retention of null columns when writing JSON. For example, sparkmagic(used widely in jupyter with livy) will generate sql query results based on DataSet.toJSON and parse JSON to pandas DataFrame to display. If there is a null column, it is easy to have some column missing or even the query result is empty. The loss of the null column in the first row, may cause parsing exceptions or loss of entire column data. ### Does this PR introduce any user-facing change? Example in spark-shell. scala> spark.sql("select null as a, 1 as b").toJSON.collect.foreach(println) {"b":1} scala> spark.sql("set spark.sql.jsonGenerator.struct.ignore.null=false") res2: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.sql("select null as a, 1 as b").toJSON.collect.foreach(println) {"a":null,"b":1} ### How was this patch tested? Add new test to JacksonGeneratorSuite Closes #26098 from stczwd/json. Lead-authored-by: stczwd <qcsd2011@163.com> Co-authored-by: Jackey Lee <qcsd2011@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-18 16:06:54 +08:00

1 2 3 4 5 ...

8528 commits