ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Takuya UESHIN	ef7545b788	[SPARK-35759][PYTHON] Remove the upperbound for numpy for pandas-on-Spark ### What changes were proposed in this pull request? Removes the upperbound for numpy for pandas-on-Spark. ### Why are the changes needed? We can remove the upper-bound for numpy for pandas-on-Spark because currently it works well on the CI with numpy 1.20.3. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #32908 from ueshin/issues/SPARK-35759/numpy. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-15 09:59:05 +09:00
Xinrong Meng	03756618fc	[SPARK-35616][PYTHON] Make `astype` method data-type-based ### What changes were proposed in this pull request? Make `astype` method data-type-based. Non-goal: Match pandas' `astype` TypeErrors. Currently, `astype` throws TypeError error messages only when the destination type is not recognized. However, for some destination types that don't make sense to the specific type of Series/Index, for example, `numeric Series/Index → bytes`, we don't have proper TypeError error messages. Since the goal of the PR is refactoring mainly, the above issue might be resolved later if needed. ### Why are the changes needed? There are many type checks in the `astype` method. Since `DataTypeOps` and its subclasses are introduced, we should refactor `astype` to make it data-type-based. In this way, code is cleaner, more maintainable, and more flexible. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #32847 from xinrong-databricks/datatypeops_astype. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-14 16:33:15 -07:00
Kousuke Saruta	aab0c2bf66	[SPARK-35736][SPARK-35737][SQL][FOLLOWUP] Move a common logic to DayTimeIntervalType ### What changes were proposed in this pull request? This is a followup PR for SPARK-35736(#32893) and SPARK-35737(#32892). This PR moves a common logic to `object DayTimeIntervalType`. That logic is like `val strToFieldIndex = DayTimeIntervalType.dayTimeFields.map(i => DayTimeIntervalType.fieldToString(i) -> (i).toMap`, a `Map` which maps each time unit to the corresponding day-time field index. ### Why are the changes needed? That logic appeared in the change in SPARK-35736 and SPARK-35737 so it can be a common logic and it's better to avoid the similar logic scattered. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #32905 from sarutak/followup-SPARK-35736-35737. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-14 20:51:18 +03:00
Kousuke Saruta	82af318c31	[SPARK-35748][SS][SQL] Fix StreamingJoinHelper to be able to handle day-time interval ### What changes were proposed in this pull request? This PR fixes `StreamingJoinHelper` to be able to handle day-time interval. ### Why are the changes needed? In the current master, `StreamingJoinHelper.getStateValueWatermark` can't handle conditions which contain day-time interval literals. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New assertions added to `StreamingJoinHlelperSuite`. Closes #32896 from sarutak/streamingjoinhelper-daytime. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-14 15:45:36 +03:00
Kousuke Saruta	439e94c171	[SPARK-35737][SQL] Parse day-time interval literals to tightest types ### What changes were proposed in this pull request? This PR add a feature which parse day-time interval literals to tightest type. ### Why are the changes needed? To comply with the ANSI behavior. For example, `INTERVAL '10 20:30' DAY TO MINUTE` should be parsed as `DayTimeIntervalType(DAY, MINUTE)` but not as `DayTimeIntervalType(DAY, SECOND)`. ### Does this PR introduce _any_ user-facing change? No because `DayTimeIntervalType` will be introduced in `3.2.0`. ### How was this patch tested? New tests. Closes #32892 from sarutak/tight-daytime-interval. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-14 10:06:19 +03:00
Kousuke Saruta	7978fdc97b	[SPARK-35736][SQL] Parse any day-time interval types in SQL ### What changes were proposed in this pull request? This PR adda a feature which allow the parser parse any day-time interval types in SQL. ### Why are the changes needed? To comply with ANSI standard, we additionally need to support the following types. * INTERVAL DAY * INTERVAL DAY TO HOUR * INTERVAL DAY TO MINUTE * INTERVAL HOUR * INTERVAL HOUR TO MINUTE * INTERVAL HOUR TO SECOND * INTERVAL MINUTE * INTERVAL MINUTE TO SECOND * INTERVAL SECOND ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. Closes #32893 from sarutak/parse-any-day-time. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-14 00:13:50 +03:00
Kun Wan	69aa7ad11f	[SPARK-35714][CORE] Bug fix for deadlock during the executor shutdown ### What changes were proposed in this pull request? Bug fix for deadlock during the executor shutdown ### Why are the changes needed? When a executor received a TERM signal, it (the second TERM signal) will lock java.lang.Shutdown class and then call Shutdown.exit() method to exit the JVM. Shutdown will call SparkShutdownHook to shutdown the executor. During the executor shutdown phase, RemoteProcessDisconnected event will be send to the RPC inbox, and then WorkerWatcher will try to call System.exit(-1) again. Because java.lang.Shutdown has already locked, a deadlock has occurred. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test case "task reaper kills JVM if killed tasks keep running for too long" in JobCancellationSuite Closes #32868 from wankunde/SPARK-35714. Authored-by: Kun Wan <wankun@apache.org> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-06-13 16:01:00 -05:00
Kent Yao	1125afd462	[MINOR][K8S] Print the driver pod name instead of Some(name) if absent Print the driver pod name instead of Some(name) if absent ### What changes were proposed in this pull request? ### Why are the changes needed? fix error hint ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test Closes #32889 from yaooqinn/minork8s. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-13 09:11:14 -07:00
Gengliang Wang	6272222bc0	[SPARK-35719][SQL] Support type conversion between timestamp and timestamp without time zone type ### What changes were proposed in this pull request? 1. Extend the Cast expression and support TimestampType in casting to TimestampWithoutTZType. 2. There was a mistake in casting TimestampWithoutTZType as TimestampType in https://github.com/apache/spark/pull/32864. The target value should be `sourceValue - timeZoneOffset` instead of being the same value. ### Why are the changes needed? To conform the ANSI SQL standard which requires to support such casting. ### Does this PR introduce _any_ user-facing change? No, the new timestamp type is not released yet. ### How was this patch tested? Unit test Closes #32878 from gengliangwang/timestampToTimestampWithoutTZ. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-13 18:44:24 +03:00
Haiyang Sun	0ba1d3852b	[SPARK-35701][SQL] Use copy-on-write semantics for SQLConf registered configurations ### What changes were proposed in this pull request? Using copy-on-write for `SQLConf.sqlConfEntries` and `SQLConf.staticConfKeys` to reduce contention in concurrent workloads. ### Why are the changes needed? The global locks used to protect `SQLConf.sqlConfEntries` map and the `SQLConf.staticConfKeys` set can cause significant contention on the `SQLConf` instance in a concurrent setting. Using copy-on-write versions should reduce the contention given that modifications to the configs are relatively rare. Closes #32865 from haiyangsun-db/SPARK-35701. Authored-by: Haiyang Sun <haiyang.sun@databricks.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-06-12 14:59:48 -07:00
Kousuke Saruta	80f7989d9a	[SPARK-35734][SQL] Format day-time intervals using type fields ### What changes were proposed in this pull request? This PR add a feature which formats day-time interval to strings using the start and end fields of `DayTimeIntervalType`. ### Why are the changes needed? Currently, they are ignored, and any `DayTimeIntervalType` is formatted as `INTERVAL DAY TO SECOND.` ### Does this PR introduce _any_ user-facing change? Yes. The format of day-time intervals is determined the start and end fields. ### How was this patch tested? New test. Closes #32891 from sarutak/interval-format. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-12 21:45:12 +03:00
shahid	450b415028	[SPARK-35746][UI] Fix taskid in the stage page task event timeline ### What changes were proposed in this pull request? Task id is given incorrect in the timeline plot in Stage Page ### Why are the changes needed? Map event timeline plots to correct task Before: ![image](https://user-images.githubusercontent.com/23054875/121761077-81775800-cb4b-11eb-8ec6-ee71926a6549.png) After ![image](https://user-images.githubusercontent.com/23054875/121761195-02ceea80-cb4c-11eb-8ce6-07bb1cca190e.png) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually tested Closes #32888 from shahidki31/shahid/fixtaskid. Authored-by: shahid <shahidki31@gmail.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-06-12 15:38:41 +09:00
Hyukjin Kwon	76e08a8e3d	[SPARK-35738][PYTHON] Support 'y' properly in DataFrame with non-numeric columns with plots ### What changes were proposed in this pull request? This PR proposes to port the fix https://github.com/databricks/koalas/pull/2172. ```python ks.DataFrame({'a': [1, 2, 3], 'b':["a", "b", "c"], 'c': [4, 5, 6]}).plot(kind='hist', x='a', y='c', bins=200) ``` Before: ``` pyspark.sql.utils.AnalysisException: cannot resolve 'least(min(a), min(b), min(c))' due to data type mismatch: The expressions should all have the same type, got LEAST(bigint, string, bigint).; 'Aggregate [unresolvedalias(least(min(a#1L), min(b#2), min(c#3L)), Some(org.apache.spark.sql.Column$$Lambda$1556/0x0000000800d9484042fb0cc1)), unresolvedalias(greatest(max(a#1L), max(b#2), max(c#3L)), Some(org.apache.spark.sql.Column$$Lambda$1556/0x0000000800d9484042fb0cc1))] +- Project [a#1L, b#2, c#3L] +- Project [__index_level_0__#0L, a#1L, b#2, c#3L, monotonically_increasing_id() AS __natural_order__#8L] +- LogicalRDD [__index_level_0__#0L, a#1L, b#2, c#3L], false ``` After: ```python Figure({ 'data': [{'hovertemplate': 'variable=a<br>value=%{text}<br>count=%{y}', 'name': 'a', ... ``` ### Why are the changes needed? To match the behaviour with panadas' and allow users to set `x` and `y` in the DataFrame with non-numeric columns. ### Does this PR introduce _any_ user-facing change? No to end users since the changes is not released yet. Yes to dev as described before. ### How was this patch tested? Manually tested, added a test and tested in notebooks: ![Screen Shot 2021-06-11 at 9 11 25 PM](https://user-images.githubusercontent.com/6477701/121686038-a47a1b80-cafb-11eb-8f8e-8d968db7ebef.png) ![Screen Shot 2021-06-11 at 9 48 58 PM](https://user-images.githubusercontent.com/6477701/121688858-e22c7380-cafe-11eb-9d0a-adcbe560030f.png) Closes #32884 from HyukjinKwon/fix-hist-plot. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-12 14:36:46 +09:00
Chao Sun	9c7250fa73	[SPARK-35321][SQL] Don't register Hive permanent functions when creating Hive client ### What changes were proposed in this pull request? Instantiate a new Hive client through `Hive.getWithoutRegisterFns(conf, false)` instead of `Hive.get(conf)`, if `Hive` version is >= '2.3.9' (the built-in version). ### Why are the changes needed? [HIVE-10319](https://issues.apache.org/jira/browse/HIVE-10319) introduced a new API `get_all_functions` which is only supported in Hive 1.3.0/2.0.0 and up. As result, when Spark 3.x talks to a HMS service of version 1.2 or lower, the following error will occur: ``` Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.TApplicationException: Invalid method name: 'get_all_functions' at org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3897) at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:248) at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:231) ... 96 more Caused by: org.apache.thrift.TApplicationException: Invalid method name: 'get_all_functions' at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_all_functions(ThriftHiveMetastore.java:3845) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_all_functions(ThriftHiveMetastore.java:3833) ``` The `get_all_functions` is called only when `doRegisterAllFns` is set to true: ```java private Hive(HiveConf c, boolean doRegisterAllFns) throws HiveException { conf = c; if (doRegisterAllFns) { registerAllFunctionsOnce(); } } ``` what this does is to register all Hive permanent functions defined in HMS in Hive's `FunctionRegistry` class, via iterating through results from `get_all_functions`. To Spark, this seems unnecessary as it loads Hive permanent (not built-in) UDF via directly calling the HMS API, i.e., `get_function`. The `FunctionRegistry` is only used in loading Hive's built-in function that is not supported by Spark. At this time, it only applies to `histogram_numeric`. [HIVE-21563](https://issues.apache.org/jira/browse/HIVE-21563) introduced a new API `getWithoutRegisterFns` which skips the above registration and is available in Hive 2.3.9. Therefore, Spark should adopt it to avoid the cost. ### Does this PR introduce _any_ user-facing change? Yes with this fix Spark now should be able to talk to HMS server with Hive 1.2.x and lower. ### How was this patch tested? Manually started a HMS server of Hive version 1.2.2. Without the PR it failed with the above exception. With the PR the error disappeared and I can successfully perform common operations such as create table, create database, list tables, etc. Closes #32887 from sunchao/SPARK-35321-new. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2021-06-12 10:32:30 +08:00
Liang-Chi Hsieh	703376e8a9	[SPARK-35689][SS] Add log warn when keyWithIndexToValue returns null value ### What changes were proposed in this pull request? This patch adds log warn when `keyWithIndexToValue` returns null value in `SymmetricHashJoinStateManager`. ### Why are the changes needed? Once we get null from state store in SymmetricHashJoinStateManager, it is better to add meaningful logging for the case. It is better for debugging. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #32828 from viirya/fix-ss-joinstatemanager-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-06-12 10:17:09 +09:00
Takuya UESHIN	4d21b94d13	[SPARK-35475][PYTHON] Fix disallow_untyped_defs mypy checks ### What changes were proposed in this pull request? Adds more type annotations in the file `python/pyspark/pandas/namespace.py` and fixes the mypy check failures. ### Why are the changes needed? We should enable more disallow_untyped_defs mypy checks. ### Does this PR introduce _any_ user-facing change? Yes. This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users. ### How was this patch tested? The mypy check with a new configuration and existing tests should pass. Closes #32871 from ueshin/issues/SPARK-35475/disallow_untyped_defs. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-11 11:07:11 -07:00
Chendi Xue	e958833c72	[SPARK-35396][SQL][TESTS][FOLLOWUP] Add a UT to check if a user-defined cachedBatch is completely released ### What changes were proposed in this pull request? This PR is used to do add a UT to check if user-defined cached batch are completely released when clearCache called. ### Why are the changes needed? Add a new UT file RefCountedTestCachedBatchSerializerSuite.scala ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT is added, org.apache.spark.sql.execution.columnar.RefCountedTestCachedBatchSerializerSuite Closes #32717 from xuechendi/support_manual_close_in_InMemoryRelation. Authored-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-06-11 08:20:26 -07:00
Max Gekk	d53831ff5c	[SPARK-35704][SQL] Add fields to `DayTimeIntervalType` ### What changes were proposed in this pull request? Extend DayTimeIntervalType to support interval fields. Valid interval field values: - 0 (DAY) - 1 (HOUR) - 2 (MINUTE) - 3 (SECOND) After the changes, the following day-time interval types are supported: 1. `DayTimeIntervalType(0, 0)` or `DayTimeIntervalType(DAY, DAY)` 2. `DayTimeIntervalType(0, 1)` or `DayTimeIntervalType(DAY, HOUR)` 3. `DayTimeIntervalType(0, 2)` or `DayTimeIntervalType(DAY, MINUTE)` 4. `DayTimeIntervalType(0, 3)` or `DayTimeIntervalType(DAY, SECOND)`. It is the default one. The second fraction precision is microseconds. 5. `DayTimeIntervalType(1, 1)` or `DayTimeIntervalType(HOUR, HOUR)` 6. `DayTimeIntervalType(1, 2)` or `DayTimeIntervalType(HOUR, MINUTE)` 7. `DayTimeIntervalType(1, 3)` or `DayTimeIntervalType(HOUR, SECOND)` 8. `DayTimeIntervalType(2, 2)` or `DayTimeIntervalType(MINUTE, MINUTE)` 9. `DayTimeIntervalType(2, 3)` or `DayTimeIntervalType(MINUTE, SECOND)` 10. `DayTimeIntervalType(3, 3)` or `DayTimeIntervalType(SECOND, SECOND)` ### Why are the changes needed? In the current implementation, Spark supports only `interval day to second` but the SQL standard allows to specify the start and end fields. The changes will allow to follow ANSI SQL standard more precisely. ### Does this PR introduce _any_ user-facing change? Yes but `DayTimeIntervalType` has not been released yet. ### How was this patch tested? By existing test suites. Closes #32849 from MaxGekk/day-time-interval-type-units. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-11 16:16:33 +03:00
Tanel Kiis	692dc66c4a	[SPARK-35695][SQL] Collect observed metrics from cached and adaptive execution sub-trees ### What changes were proposed in this pull request? Collect observed metrics from cached and adaptive execution sub-trees. ### Why are the changes needed? Currently persisting/caching will hide all observed metrics in that sub-tree from reaching the `QueryExecutionListeners`. Adaptive query execution can also hide the metrics from reaching `QueryExecutionListeners`. ### Does this PR introduce _any_ user-facing change? Bugfix ### How was this patch tested? New UTs Closes #32862 from tanelk/SPARK-35695_collect_metrics_persist. Lead-authored-by: Tanel Kiis <tanel.kiis@gmail.com> Co-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-11 21:03:08 +08:00
RoryQi	57ce64c511	[SPARK-35706][SQL] Consider making the ':' in STRUCT data type definition optional ### What changes were proposed in this pull request? The STRUCT type syntax is defined like this: STRUCT(fieldNmae: fileType [NOT NULL][COMMENT stringLiteral][,.....]) So the field list is nearly the same as a column list if we could make ':' optional it would be so much cleaner an less proprietary ### Why are the changes needed? ease of use ### Does this PR introduce _any_ user-facing change? Yes, you can use Struct type list is nearly the same as a column list ### How was this patch tested? unit tests Closes #32858 from jerqi/master. Authored-by: RoryQi <1242949407@qq.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-11 12:58:32 +00:00
dgd-contributor	6e1aa15679	[SPARK-35652][SQL] joinWith on two table generated from same one ### What changes were proposed in this pull request? It seems like spark inner join is performing a cartesian join in self joining using `joinWith` To produce this issues: ``` val df = spark.range(0,3) df.joinWith(df, df("id") === df("id")).show() ``` Before this pull request, the result is +---+---+ \| _1 \| _2 \| +---+---+ \| 0 \| 0 \| \| 0 \| 1 \| \| 0 \| 2 \| \| 1 \| 0 \| \| 1 \| 1 \| \| 1 \| 2 \| \| 2 \| 0 \| \| 2 \| 1 \| \| 2 \| 2 \| +---+---+ The expected result is +---+---+ \| _1 \| _2 \| +---+---+ \| 0 \| 0 \| \| 1 \| 1 \| \| 2 \| 2 \| +---+---+ ### Why are the changes needed? correctness ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? add test Closes #32863 from dgd-contributor/SPARK-35652_join_and_joinWith_in_seft_joining. Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-11 20:36:50 +08:00
itholic	ebe529e8e1	[SPARK-35591][PYTHON][DOCS] Rename "Koalas" to "pandas API on Spark" in the documents ### What changes were proposed in this pull request? This PR proposes the change the name "Koalas" to the "Pandas APIs on Spark" in the documents. ### Why are the changes needed? Since we don't use the name "Koalas" anymore. We should use "Pandas APIs on Spark" instead. ### Does this PR introduce _any_ user-facing change? Yes, the name "Koalas" is renamed to "Pandas APIs on Spark" in the documents. ### How was this patch tested? Manually built the docs and checked one by one. Closes #32835 from itholic/SPARK-35591. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-11 20:42:38 +09:00
Gengliang Wang	62be22e929	[SPARK-35694][INFRA][FOLLOWUP] Increase the default JVM stack size of SBT/Maven ### What changes were proposed in this pull request? In https://github.com/apache/spark/pull/32838, we set the default JVM stack size to 16M from 4M. However, there are still stackoverflow error in builds: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139672/console Let's update the value to 64M ### Why are the changes needed? Make test build stable. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual trigger test builds. Closes #32879 from gengliangwang/increaseStackAgain. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-11 18:51:07 +09:00
Liang-Chi Hsieh	c463472e85	[SPARK-35439][SQL][FOLLOWUP] ExpressionContainmentOrdering should not sort unrelated expressions ### What changes were proposed in this pull request? This is a followup of #32586. We introduced `ExpressionContainmentOrdering` to sort common expressions according to their parent-child relations. For unrelated expressions, previously the ordering returns -1 which is not correct and can possibly lead to transitivity issue. ### Why are the changes needed? To fix the possible transitivity issue of `ExpressionContainmentOrdering`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #32870 from viirya/SPARK-35439-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-06-11 16:13:46 +09:00
Gengliang Wang	e9af4576d5	[SPARK-35718][SQL] Support casting of Date to timestamp without time zone type ### What changes were proposed in this pull request? Extend the Cast expression and support DateType in casting to TimestampWithoutTZType. ### Why are the changes needed? To conform the ANSI SQL standard which requires to support such casting. ### Does this PR introduce _any_ user-facing change? No, the new timestamp type is not released yet. ### How was this patch tested? Unit test Closes #32873 from gengliangwang/dateToTswtz. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-11 05:41:28 +00:00
Chao Sun	e9ccf4a50c	[SPARK-35640][SQL] Refactor Parquet vectorized reader to remove duplicated code paths ### What changes were proposed in this pull request? 1. Remove duplicated code in the form of `readXXX` in `VectorizedRleValuesReader`. For instance: ```java public void readIntegers( int total, WritableColumnVector c, int rowId, int level, VectorizedValuesReader data) throws IOException { int left = total; while (left > 0) { if (this.currentCount == 0) this.readNextGroup(); int n = Math.min(left, this.currentCount); switch (mode) { case RLE: if (currentValue == level) { data.readIntegers(n, c, rowId); } else { c.putNulls(rowId, n); } break; case PACKED: for (int i = 0; i < n; ++i) { if (currentBuffer[currentBufferIdx++] == level) { c.putInt(rowId + i, data.readInteger()); } else { c.putNull(rowId + i); } } break; } rowId += n; left -= n; currentCount -= n; } } ``` and replace with: ```java public void readBatch( int total, int offset, WritableColumnVector values, int maxDefinitionLevel, VectorizedValuesReader valueReader, ParquetVectorUpdater updater) throws IOException { int left = total; while (left > 0) { if (this.currentCount == 0) this.readNextGroup(); int n = Math.min(left, this.currentCount); switch (mode) { case RLE: if (currentValue == maxDefinitionLevel) { updater.updateBatch(n, offset, values, valueReader); } else { values.putNulls(offset, n); } break; case PACKED: for (int i = 0; i < n; ++i) { if (currentBuffer[currentBufferIdx++] == maxDefinitionLevel) { updater.update(offset + i, values, valueReader); } else { values.putNull(offset + i); } } break; } offset += n; left -= n; currentCount -= n; } } ``` where the `ParquetVectorUpdater` is type specific, and has different implementations under `updateBatch` and `update`. Together, this also changes code paths handling timestamp types to use the batch read API for decoding definition levels. 2. Similar to the above, this removes code duplication in `VectorizedColumnReader.decodeDictionaryIds`. Now different implementations are under `ParquetVectorUpdater.decodeSingleDictionaryId`. ### Why are the changes needed? `VectorizedRleValuesReader` and `VectorizedColumnReader` are becoming increasingly harder to maintain, as any change touches the above logic will need to be replicated in 20+ places. The issue becomes even more serious when we are going to implement column index (for instance, see how the change [here](https://github.com/apache/spark/pull/32753/files#diff-a01e174e178366aadf07f64ee690d47d343b2ca416a4a2b2ea735887c22d5934R191) has to be replicated multiple times) and complex type support (in progress) for the vectorized path. In addition, currently dictionary decoding (see `VectorizedColumnReader.decodeDictionaryIds`) and non-dictionary decoding are handled separately, and therefore the same (very complicated) branching logic based on input Spark & Parquet types have to be replicated in two places, which is another burden for code maintenance. The original intention is for performance. However these days JIT compilers tend to be very effective on this and will inline virtual calls aggressively to eliminate the method invocation costs (see [this](https://shipilev.net/blog/2015/black-magic-method-dispatch/) and [this](http://insightfullogic.com/blog/2014/may/12/fast-and-megamorphic-what-influences-method-invoca/)). I've also done benchmarks using a modified `DataSourceReadBenchmark` and `DateTimeRebaseBenchmark` and the result is almost exact the same before and after the change. The results can be found [here](https://gist.github.com/sunchao/674afbf942ccc2370bdcfa33efb4471c), and [here's](https://github.com/sunchao/spark/tree/parquet-refactor) the source code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #32777 from sunchao/SPARK-35640. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2021-06-11 05:39:43 +00:00
Yuming Wang	463daabd5a	[SPARK-34512][BUILD][SQL] Upgrade built-in Hive to 2.3.9 ### What changes were proposed in this pull request? This pr upgrades built-in Hive to 2.3.9. Hive 2.3.9 changes: - [HIVE-17155] - findConfFile() in HiveConf.java has some issues with the conf path - [HIVE-24797] - Disable validate default values when parsing Avro schemas - [HIVE-24608] - Switch back to get_table in HMS client for Hive 2.3.x - [HIVE-21200] - Vectorization: date column throwing java.lang.UnsupportedOperationException for parquet - [HIVE-21563] - Improve Table#getEmptyTable performance by disabling registerAllFunctionsOnce - [HIVE-19228] - Remove commons-httpclient 3.x usage ### Why are the changes needed? Fix regression caused by AVRO-2035. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #32750 from wangyum/SPARK-34512. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-06-10 20:44:35 -07:00
Pawel Ptaszynski	912d60b6dd	[SPARK-35709][DOCS] Remove the reference to third party Nomad integration project ### What changes were proposed in this pull request? This PR updates documentation by removing reference to [hashicorp/nomad-spark](https://github.com/hashicorp/nomad-spark) which has been deprecated in April 2020, and will not be developed any longer. ### Why are the changes needed? To keep the documentation updated and remove confusion for potential users being interested in running with Nomad. ### Does this PR introduce _any_ user-facing change? Yes. A change to the documentation. ### How was this patch tested? Generated to documentation, and checked everything is alright in the output. Closes #32860 from pptaszynski/doc/remove-spark-nomad-project-reference. Authored-by: Pawel Ptaszynski <pawel.ptaszynski@bolt.eu> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-06-11 08:34:59 +09:00
Dongjoon Hyun	cf07036d9b	[SPARK-35593][K8S][CORE] Support shuffle data recovery on the reused PVCs ### What changes were proposed in this pull request? Previously, the following two commits allow driver-owned on-demand PVC reuse. - SPARK-35182 Support driver-owned on-demand PVC - SPARK-35416 Support PersistentVolumeClaim Reuse This PR aims to recover the shuffle data on those remounted PVCs. The lifecycle of PVCs are tied to the one of Spark jobs. Since this is K8s specific feature, `ShuffleDataIO` plugin is used. ### Why are the changes needed? Although Pod is killed, we can remount PVCs and recover some data from it. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the newly added test cases. Closes #32730 from dongjoon-hyun/SPARK-RECOVER-SHUFFLE-DATA. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-06-10 16:06:58 -07:00
Ye Zhou	a97885bb2c	[SPARK-33350][SHUFFLE] Add support to DiskBlockManager to create merge directory and to get the local shuffle merged data ### What changes were proposed in this pull request? This is one of the patches for SPIP SPARK-30602 which is needed for push-based shuffle. ### Summary of changes: Executor will create the merge directories under the application temp directory provided by YARN. The access control of the folder will be set to 770, where Shuffle Service can create merged shuffle files and write merge shuffle data in to those files. Serve the merged shuffle blocks fetch request, read the merged shuffle blocks. ### Why are the changes needed? Refer to the SPIP in SPARK-30602. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added unit tests. The reference PR with the consolidated changes covering the complete implementation is also provided in SPARK-30602. We have already verified the functionality and the improved performance as documented in the SPIP doc. Lead-authored-by: Min Shen mshenlinkedin.com Co-authored-by: Chandni Singh chsinghlinkedin.com Co-authored-by: Ye Zhou yezhoulinkedin.com Closes #32007 from zhouyejoe/SPARK-33350. Lead-authored-by: Ye Zhou <yezhou@linkedin.com> Co-authored-by: Chandni Singh <chsingh@linkedin.com> Co-authored-by: Min Shen <mshen@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>	2021-06-10 16:57:46 -05:00
Kent Yao	bc1edba8f6	[SPARK-35692][K8S] Use AtomicInteger for executor id generating ### What changes were proposed in this pull request? AtomicInteger is enough for executor ids, in this PR, we use it to replace AtomicLong like other cluster managers, e.g. yarn, standalone ### Why are the changes needed? See the discussion here https://github.com/apache/spark/pull/32610#discussion_r648007320 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? pass CI with existing tests Closes #32837 from yaooqinn/SPARK-35692. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-06-10 13:42:07 -07:00
Kent Yao	b4b78ce265	[SPARK-32975][K8S][FOLLOWUP] Avoid None.get exception ### What changes were proposed in this pull request? A follow-up for SPARK-32975 to avoid unexpected the `None.get` exception Run SparkPi with docker desktop, as podName is an option, we will got ```logtalk 21/06/09 01:09:12 ERROR Utils: Uncaught exception in thread main java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$1(ExecutorPodsAllocator.scala:110) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1417) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.start(ExecutorPodsAllocator.scala:111) at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.start(KubernetesClusterSchedulerBackend.scala:99) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:220) at org.apache.spark.SparkContext.<init>(SparkContext.scala:581) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2686) at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:948) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:942) at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:30) at org.apache.spark.examples.SparkPi.main(SparkPi.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) ``` ### Why are the changes needed? fix a regression ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Manual. Closes #32830 from yaooqinn/SPARK-32975. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-06-10 13:39:39 -07:00
Gengliang Wang	d21ff1318f	[SPARK-35716][SQL] Support casting of timestamp without time zone to date type ### What changes were proposed in this pull request? Extend the Cast expression and support TimestampWithoutTZType in casting to DateType. ### Why are the changes needed? To conform the ANSI SQL standard which requires to support such casting. ### Does this PR introduce _any_ user-facing change? No, the new timestamp type is not released yet. ### How was this patch tested? Unit test Closes #32869 from gengliangwang/castToDate. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-10 23:37:02 +03:00
Venkata krishnan Sowrirajan	b5a1503585	[SPARK-32920][SHUFFLE] Finalization of Shuffle push/merge with Push based shuffle and preparation step for the reduce stage ### What changes were proposed in this pull request? Summary of the changes made as part of this PR: 1. `DAGScheduler` changes to finalize a ShuffleMapStage which involves talking to all the shuffle mergers (`ExternalShuffleService`) and getting all the completed merge statuses. 2. Once the `ShuffleMapStage` finalization is complete, mark the `ShuffleMapStage` to be finalized which marks the stage as complete and subsequently letting the child stage start. 3. Also added the relevant tests to `DAGSchedulerSuite` for changes made as part of [SPARK-32919](https://issues.apache.org/jira/browse/SPARK-32919) Lead-authored-by: Min Shen mshenlinkedin.com Co-authored-by: Venkata krishnan Sowrirajan vsowrirajanlinkedin.com Co-authored-by: Chandni Singh chsinghlinkedin.com ### Why are the changes needed? Refer to [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests to DAGSchedulerSuite Closes #30691 from venkata91/SPARK-32920. Lead-authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com> Co-authored-by: Min Shen <mshen@linkedin.com> Co-authored-by: Chandni Singh <chsingh@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>	2021-06-10 13:06:15 -05:00
Kousuke Saruta	44b695fbb0	[SPARK-35296][SQL] Allow Dataset.observe to work even if CollectMetricsExec in a task handles multiple partitions ### What changes were proposed in this pull request? This PR fixes an issue that `Dataset.observe` doesn't work if `CollectMetricsExec` in a task handles multiple partitions. If `coalesce` follows `observe` and the number of partitions shrinks after `coalesce`, `CollectMetricsExec` can handle multiple partitions in a task. ### Why are the changes needed? The current implementation of `CollectMetricsExec` doesn't consider the case it can handle multiple partitions. Because new `updater` is created for each partition even though those partitions belong to the same task, `collector.setState(updater)` raise an assertion error. This is a simple reproducible example. ``` $ bin/spark-shell --master "local[1]" scala> spark.range(1, 4, 1, 3).observe("my_event", count($"id").as("count_val")).coalesce(2).collect ``` ``` java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:208) at org.apache.spark.sql.execution.AggregatingAccumulator.setState(AggregatingAccumulator.scala:204) at org.apache.spark.sql.execution.CollectMetricsExec.$anonfun$doExecute$2(CollectMetricsExec.scala:72) at org.apache.spark.sql.execution.CollectMetricsExec.$anonfun$doExecute$2$adapted(CollectMetricsExec.scala:71) at org.apache.spark.TaskContext$$anon$1.onTaskCompletion(TaskContext.scala:125) at org.apache.spark.TaskContextImpl.$anonfun$markTaskCompleted$1(TaskContextImpl.scala:124) at org.apache.spark.TaskContextImpl.$anonfun$markTaskCompleted$1$adapted(TaskContextImpl.scala:124) at org.apache.spark.TaskContextImpl.$anonfun$invokeListeners$1(TaskContextImpl.scala:137) at org.apache.spark.TaskContextImpl.$anonfun$invokeListeners$1$adapted(TaskContextImpl.scala:135) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. Closes #32786 from sarutak/fix-collectmetricsexec. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-11 01:20:35 +08:00
Emil Ejbyfeldt	e2e3fe7782	[SPARK-35653][SQL] Fix CatalystToExternalMap interpreted path fails for Map with case classes as keys or values ### What changes were proposed in this pull request? Use the key/value LambdaFunction to convert the elements instead of using CatalystTypeConverters.createToScalaConverter. This is how it is done in MapObjects and that correctly handles Arrays with case classes. ### Why are the changes needed? Before these changes the added test cases would fail with the following: ``` [info] - encode/decode for map with case class as value: Map(1 -> IntAndString(1,a)) (interpreted path) * FAILED * (64 milliseconds) [info] Encoded/Decoded data does not match input data [info] [info] in: Map(1 -> IntAndString(1,a)) [info] out: Map(1 -> [1,a]) [info] types: scala.collection.immutable.Map$Map1 [info] [info] Encoded Data: [org.apache.spark.sql.catalyst.expressions.UnsafeMapData5ecf5d9e] [info] Schema: value#823 [info] root [info] -- value: map (nullable = true) [info] \|-- key: integer [info] \|-- value: struct (valueContainsNull = true) [info] \| \|-- i: integer (nullable = false) [info] \| \|-- s: string (nullable = true) [info] [info] [info] fromRow Expressions: [info] catalysttoexternalmap(lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178), lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178), lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179), if (isnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179))) null else newInstance(class org.apache.spark.sql.catalyst.encoders.IntAndString), input[0, map<int,struct<i:int,s:string>>, true], interface scala.collection.immutable.Map [info] :- lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178) [info] :- lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178) [info] :- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) [info] :- if (isnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179))) null else newInstance(class org.apache.spark.sql.catalyst.encoders.IntAndString) [info] : :- isnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179)) [info] : : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) [info] : :- null [info] : +- newInstance(class org.apache.spark.sql.catalyst.encoders.IntAndString) [info] : :- assertnotnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).i) [info] : : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).i [info] : : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) [info] : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).s.toString [info] : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).s [info] : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) [info] +- input[0, map<int,struct<i:int,s:string>>, true] (ExpressionEncoderSuite.scala:627) ``` So using a map with cases classes for keys or values and using the interpreted path would incorrect deserialize data from the catalyst representation. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the bug. ### How was this patch tested? Existing and new unit tests in the ExpressionEncoderSuite Closes #32783 from eejbyfeldt/fix-interpreted-path-for-map-with-case-classes. Authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-06-10 09:37:27 -07:00
Gengliang Wang	4180692135	[SPARK-35711][SQL] Support casting of timestamp without time zone to timestamp type ### What changes were proposed in this pull request? Extend the Cast expression and support TimestampWithoutTZType in casting to TimestampType. ### Why are the changes needed? To conform the ANSI SQL standard which requires to support such casting. ### Does this PR introduce _any_ user-facing change? No, the new timestamp type is not released yet. ### How was this patch tested? Unit test Closes #32864 from gengliangwang/castToTimestamp. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-10 23:03:52 +08:00
Terry Kim	88f1d82a46	[SPARK-34524][SQL][FOLLOWUP] Remove unused checkAlterTablePartition in CheckAnalysis.scala ### What changes were proposed in this pull request? #31637 removed the usage of `CheckAnalysis.checkAlterTablePartition` but didn't remove the function. ### Why are the changes needed? To removed an unused function. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #32855 from imback82/SPARK-34524-followup. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-10 12:43:09 +00:00
Fu Chen	5280f02747	[SPARK-35673][SQL] Fix user-defined hint and unrecognized hint in subquery ### What changes were proposed in this pull request? Use `UnresolvedHint.resolved = child.resolved` instead `UnresolvedHint.resolved = false`, then the plan contains `UnresolvedHint` child can be optimized by rule in batch `Resolution`. For instance, before this pr, the following plan can't be optimized by `ResolveReferences`. ``` !'Project [*] +- SubqueryAlias __auto_generated_subquery_name +- UnresolvedHint use_hash +- Project [42 AS 42#10] +- OneRowRelation ``` ### Why are the changes needed? fix hint in subquery bug ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. Closes #32841 from cfmcgrady/SPARK-35673. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-10 15:32:10 +08:00
Kevin Su	cadd3a0588	[SPARK-35474] Enable disallow_untyped_defs mypy check for pyspark.pandas.indexing ### What changes were proposed in this pull request? Adds more type annotations in the file: `python/pyspark/pandas/spark/indexing.py` and fixes the mypy check failures. ### Why are the changes needed? We should enable more disallow_untyped_defs mypy checks. ### Does this PR introduce _any_ user-facing change? Yes. This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users. ### How was this patch tested? The mypy check with a new configuration and existing tests should pass. `./dev/lint-python` Closes #32738 from pingsutw/SPARK-35474. Authored-by: Kevin Su <pingsutw@apache.org> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-09 22:35:12 -07:00
dgd-contributor	aa3de40773	[SPARK-35679][SQL] instantToMicros overflow ### Why are the changes needed? With Long.minValue cast to an instant, secs will be floored in function microsToInstant and cause overflow when multiply with Micros_per_second ``` def microsToInstant(micros: Long): Instant = { val secs = Math.floorDiv(micros, MICROS_PER_SECOND) // Unfolded Math.floorMod(us, MICROS_PER_SECOND) to reuse the result of // the above calculation of `secs` via `floorDiv`. val mos = micros - secs * MICROS_PER_SECOND <- it will overflow here Instant.ofEpochSecond(secs, mos * NANOS_PER_MICROS) } ``` But the overflow is acceptable because it won't produce any change to the result However, when convert the instant back to micro value, it will raise Overflow Error ``` def instantToMicros(instant: Instant): Long = { val us = Math.multiplyExact(instant.getEpochSecond, MICROS_PER_SECOND) <- It overflow here val result = Math.addExact(us, NANOSECONDS.toMicros(instant.getNano)) result } ``` Code to reproduce this error ``` instantToMicros(microToInstant(Long.MinValue)) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test added Closes #32839 from dgd-contributor/SPARK-35679_instantToMicro. Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-10 08:08:51 +03:00
ulysses-you	8dde20a993	[SPARK-35675][SQL] EnsureRequirements remove shuffle should respect PartitioningCollection ### What changes were proposed in this pull request? Add `PartitioningCollection` in EnsureRequirements during remove shuffle. ### Why are the changes needed? Currently `EnsureRequirements` only check if child has semantic equal `HashPartitioning` and remove redundant shuffle. We can enhance this case using `PartitioningCollection`. ### Does this PR introduce _any_ user-facing change? Yes, plan might be changed. ### How was this patch tested? Add test. Closes #32815 from ulysses-you/shuffle-node. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Kent Yao <yao@apache.org>	2021-06-10 13:03:47 +08:00
Linhong Liu	87d2ffbbcf	[MINOR][SQL] No need to normolize name for built-in functions ### What changes were proposed in this pull request? Add an `internalRegisterFunction` for the built-in function registry. So that we can skip the unnecessary function normalization. ### Why are the changes needed? small refactor ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing ut Closes #32842 from linhongliu-db/function-refactor. Lead-authored-by: Linhong Liu <linhong.liu@databricks.com> Co-authored-by: Linhong Liu <67896261+linhongliu-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-10 04:35:26 +00:00
Kousuke Saruta	7e99b65295	[SPARK-35194][SQL][FOLLOWUP] Change Seq to collections.Seq in NestedColumnAliasing to work with Scala 2.13 ### What changes were proposed in this pull request? This PR changes an occurrence of `Seq` to `collections.Seq` in `NestedColumnAliasing`. ### Why are the changes needed? In the current master, `NestedColumnAliasing` doesn't work with Scala 2.13 and the relevant tests fail. The following are examples. * `NestedColumnAliasingSuite` * Subclasses of `SchemaPruningSuite` * `ColumnPruningSuite` ``` NestedColumnAliasingSuite: [info] - Pushing a single nested field projection * FAILED * (14 milliseconds) [info] scala.MatchError: (none#211451,ArrayBuffer(name#211451.middle)) (of class scala.Tuple2) [info] at org.apache.spark.sql.catalyst.optimizer.NestedColumnAliasing$.$anonfun$getAttributeToExtractValues$5(NestedColumnAliasing.scala:258) [info] at scala.collection.StrictOptimizedMapOps.flatMap(StrictOptimizedMapOps.scala:31) [info] at scala.collection.StrictOptimizedMapOps.flatMap$(StrictOptimizedMapOps.scala:30) [info] at scala.collection.immutable.HashMap.flatMap(HashMap.scala:39) [info] at org.apache.spark.sql.catalyst.optimizer.NestedColumnAliasing$.getAttributeToExtractValues(NestedColumnAliasing.scala:258) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Ran tests mentioned above and all passed with Scala 2.13. Closes #32848 from sarutak/followup-SPARK-35194-2. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-10 02:14:40 +00:00
Kousuke Saruta	94b66f5e28	[MINOR][SQL] Modify the example of rand and randn ### What changes were proposed in this pull request? This PR fixes the examples of `rand` and `randn`. ### Why are the changes needed? SPARK-23643 (#20793) fixes an issue which is related to the seed and it causes the result of `rand` and `randn`. Now the results of `SELECT rand(0)` and `SELECT randn((null)` are `0.7604953758285915` and `1.6034991609278433` respectively, and they should be deterministic because the number of partitions are always 1 (the leaf node is `OneRowRelation`). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Built the doc and confirmed it. ![rand-doc](https://user-images.githubusercontent.com/4736016/121359059-145a9b80-c96e-11eb-84c2-2f2b313614f3.png) Closes #32844 from sarutak/rand-example. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-10 10:37:38 +09:00
Xinrong Meng	e9d60156c4	[SPARK-35705][PYTHON] Adjust pandas-on-spark `test_groupby_multiindex_columns` test for different pandas versions ### What changes were proposed in this pull request? Adjust pandas-on-spark test_groupby_multiindex_columns test in order to pass with different pandas versions. ### Why are the changes needed? pandas had introduced bugs as below: - For pandas 1.1.3 and 1.1.4 Type error: only integer scalar arrays can be converted to a scalar index - For pandas < 1.0.4 Type error: Can only tuple-index with a MultiIndex We ought to adjust `test_groupby_multiindex_columns` tests by comparing with a predefined return value, rather than comparing with the pandas return value in the pandas versions above. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #32851 from xinrong-databricks/SPARK-35705. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-10 10:36:19 +09:00
Xinrong Meng	3c66c11aa6	[SPARK-35601][PYTHON] Complete arithmetic operators involving bool literals, Series, and Index ### What changes were proposed in this pull request? Completing arithmetic operators involving bool literals, Series, and Index consists of two main tasks: - Support arithmetic operations against bool literals - Support operators (+, ) between bool Series/Indexes. ### Why are the changes needed? Arithmetic operators involving bool literals, Series, and Index are incomplete now. We ought to match pandas' behaviors. ### Does this PR introduce _any_ user-facing change? Yes. Newly supported operations example: ```py >>> ps.Series([1, 2, 3]) + True 0 2 1 3 2 4 dtype: int64 >>> ps.Series([1, 2, 3]) + False 0 1 1 2 2 3 dtype: int64 >>> ps.Series([True, False, True]) + True 0 True 1 True 2 True dtype: bool >>> ps.Series([True, False, True]) + False 0 True 1 False 2 True dtype: bool >>> ps.Series([True, False, True]) True 0 True 1 False 2 True dtype: bool >>> ps.Series([True, False, True]) * False 0 False 1 False 2 False dtype: bool >>> ps.set_option('compute.ops_on_diff_frames', True) >>> ps.Series([True, True, False]) + ps.Series([True, False, True]) 0 True 1 True 2 True dtype: bool >>> ps.Series([True, True, False]) * ps.Series([True, False, True]) 0 True 1 False 2 False dtype: bool ``` Before the change, operations above are not supported, raising a TypeError such as ```py >>> ps.Series([True, False, True]) + True Traceback (most recent call last): ... TypeError: Addition can not be applied to booleans and the given type. >>> ps.Series([True, False, True]) + False Traceback (most recent call last): ... TypeError: Addition can not be applied to booleans and the given type. ``` ### How was this patch tested? Unit tests. Closes #32785 from xinrong-databricks/datatypeops_arith_bool. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-09 15:13:03 -07:00
Gengliang Wang	74b3df86f3	[SPARK-35698][SQL] Support casting of timestamp without time zone to strings ### What changes were proposed in this pull request? Extend the Cast expression and support TimestampWithoutTZType in casting to StringType. ### Why are the changes needed? To conform the ANSI SQL standard which requires to support such casting. ### Does this PR introduce _any_ user-facing change? No, the new timestamp type is not released yet. ### How was this patch tested? Unit test Closes #32846 from gengliangwang/tswtzToString. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-10 02:29:37 +08:00
allisonwang-db	f49bf1a072	[SPARK-34382][SQL] Support LATERAL subqueries ### What changes were proposed in this pull request? This PR adds support for lateral subqueries. A lateral subquery is a subquery preceded by the `LATERAL` keyword in the FROM clause of a query that can reference columns in the preceding FROM items. For example: ```sql SELECT * FROM t1, LATERAL (SELECT * FROM t2 WHERE t1.a = t2.c) ``` A new subquery expression`LateralSubquery` is used to represent a lateral subquery. It is similar to `ScalarSubquery` but can return multiple rows and columns. A new logical unary node `LateralJoin` is used to represent a lateral join. Here is the analyzed plan for the above query: ```scala Project [a, b, c, d] +- LateralJoin lateral-subquery [a], Inner : +- Project [c, d] : +- Filter (outer(a) = c) : +- Relation [c, d] +- Relation [a, b] ``` Similar to a correlated subquery, a lateral subquery can be viewed as a dependent (nested loop) join where the evaluation of the right subtree depends on the current value of the left subtree. The same technique to decorrelate a subquery is used to decorrelate a lateral join: ```scala Project [a, b, c, d] +- LateralJoin lateral-subquery [a && a = c], Inner // pull up correlated predicates as join conditions : +- Project [c, d] : +- Relation [c, d] +- Relation [a, b] ``` Then the lateral join can be rewritten into a normal join: ```scala Join Inner (a = c) :- Relation [a, b] +- Relation [c, d] ``` #### Follow-ups: 1. Similar to rewriting correlated scalar subqueries, rewriting lateral joins is also subject to the COUNT bug (See SPARK-15370 for more details). This is not handled in the current PR as it requires a sizeable amount of refactoring. It will be addressed in a subsequent PR (SPARK-35551). 2. Currently Spark does use outer query references to resolve star expressions in subqueries. This is not lateral subquery specific and can be handled in a separate PR (SPARK-35618) ### Why are the changes needed? To support an ANSI SQL feature. ### Does this PR introduce _any_ user-facing change? Yes. It allows users to use lateral subqueries in the FROM clause of a query. ### How was this patch tested? - Parser test: `PlanParserSuite.scala` - Analyzer test: `ResolveSubquerySuite.scala` - Optimizer test: `PullupCorrelatedPredicatesSuite.scala` - SQL test: `join-lateral.sql`, `postgreSQL/join.sql` Closes #32303 from allisonwang-db/spark-34382-lateral. Lead-authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-09 17:08:32 +00:00
shahid	519be238be	[SPARK-35423][ML] PCA results should be consistent, If the Matrix contains both Sparse and Dense vectors ### What changes were proposed in this pull request? If the dataset contains mix of sparse and dense vectors output of PCA seems different. The issue here is we check only the first row's Vector type. If the first row is dense and rest all the row's are sparse, we compute PCA based on dense path. Similarly, if only first row in Sparse and rest all the rows are dense, we compute based on Sparse computation path. Following datasets will produce different results with PCA, even though the data is same, except first row type is sparse. ``` val data1 = Array( Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))), Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0), Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0) ) ``` ``` +-----------------------------------------------------------+ \|pcaFeatures \| +-----------------------------------------------------------+ \|[1.6485728230883807,-4.013282700516296,-5.524543751369388] \| \|[-4.645104331781534,-1.1167972663619026,-5.524543751369387]\| \|[-6.428880535676489,-5.337951427775355,-5.524543751369389] \| +-----------------------------------------------------------+ ``` ``` val data1 = Array( Vectors.dense(0.0, 1.0, 0.0, 7.0, 0.0 ), Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0), Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0) ) ``` ``` +------------------------------------------------------------+ \|pcaFeatures \| +------------------------------------------------------------+ \|[1.6485728230883814,-4.0132827005162985,-1.0091435193998504]\| \|[-4.645104331781533,-1.1167972663619048,-1.0091435193998501]\| \|[-6.428880535676488,-5.337951427775359,-1.009143519399851] \| +------------------------------------------------------------+ ``` ### Why are the changes needed? To fix inconsistent result if dataset contains both sparse and dense vectors. We need to treat the entire metrics as Sparse ONLY if all the rows are sparse. Otherwise we need to consider the matrix as dense. This PR can be a followup for the PR: https://github.com/apache/spark/pull/23126 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UTs Closes #32734 from shahidki31/shahid/pca. Authored-by: shahid <shahidki31@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-06-09 10:23:46 -05:00

1 2 3 4 5 ...

30345 commits