ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
PengLei	3904c0edba	[SPARK-35778][SQL][TESTS] Check multiply/divide of year month interval of any fields by numeric ### What changes were proposed in this pull request? Check multiply/divide of year-month intervals of any fields by numeric. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Expanded existed test cases. Closes #33051 from Peng-Lei/SPARK-35778. Authored-by: PengLei <18066542445@189.cn> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-24 12:25:06 +03:00
PengLei	8a1995f936	[SPARK-35728][SQL][TESTS] Check multiply/divide of day-time intervals of any fields by numeric ### What changes were proposed in this pull request? 1. The testcase is just cover the DayTimeIntervalType() / numeric 2. Add testcase for following intervals / numeric: INTERVAL DAY INTERVAL DAY TO HOUR INTERVAL DAY TO MINUTE INTERVAL HOUR INTERVAL HOUR TO MINUTE INTERVAL HOUR TO SECOND INTERVAL MINUTE INTERVAL MINUTE TO SECOND INTERVAL SECOND ### Why are the changes needed? Add testcase coverage. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existed testcase Closes #33014 from Peng-Lei/SPARK-35728. Authored-by: PengLei <18066542445@189.cn> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-24 12:11:47 +03:00
ulysses-you	1295e8876c	[SPARK-35786][SQL] Add a new operator to distingush if AQE can optimize safely ### What changes were proposed in this pull request? * Add a new repartition operator `RebalanceRepartition`. * Support a new hint `REBALANCE` After this patch, user can run this query: ```sql SELECT /+ REBALANCE(c) / * FROM t ``` ### Why are the changes needed? Add a new hint to distingush if we can optimize it safely. This new hint can let AQE optimize with `CustomShuffleReaderExec` safely. Currently, AQE can only coalesce shuffle partitions but can not expand shuffle partitions due to the semantics of output partitioning. Let's say we have a query: ```sql SELECT /+ REPARTITION(col) / * FROM t ``` AQE can not expand the shuffle partitions even if `col` is skewed because expanding shuffle partitions will break the hashed output paritioning of `RepartitionByExpression`. But if the query is use`REPARTITION_BY_AQE`, AQE can optimize it without considering the semantics of output partitioning. ### Does this PR introduce _any_ user-facing change? Yes, a new hint. ### How was this patch tested? Add test. Closes #32932 from ulysses-you/SPARK-35786. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-24 09:04:38 +00:00
Hyukjin Kwon	5a7686a393	[SPARK-35301][PYTHON][DOCS] Document migration guide from Koalas to pandas APIs on Spark ### What changes were proposed in this pull request? This PR proposes to add a migration guide for legacy Koalas users in pandas API on Spark. ### Why are the changes needed? For easier migration. ### Does this PR introduce _any_ user-facing change? Yes, this adds a new page for migration from Koalas. ### How was this patch tested? Manually built the docs and checked manually. Closes #33050 from HyukjinKwon/SPARK-35301. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-24 17:58:09 +09:00
dgd-contributor	5b9c5c126f	[SPARK-35841][SQL] Casting string to decimal type doesn't work if the… … sum of the digits is greater than 38 ### What changes were proposed in this pull request? Since Spark 3.1.1, NULL is returned when casting a string with many decimal places to a decimal type. If the sum of the digits before and after the decimal point is less than 39, a value is returned. From 39 digits, however, NULL is returned. This worked until Spark 3.0.X. Code to reproduce: A string with 2 decimal places in front of the decimal point and 37 decimal places after the decimal point returns null ``` val data = Seq( "28.9259999999999983799625624669715762138", "28.925999999999998379962562466971576213", "2.9259999999999983799625624669715762138" ) val df = data.toDF("num") df.withColumn("numConverted", col("num").cast("decimal(38, 5)")).show() ``` before this pull request, the result is +----------------------+---------------+ \| num \|numConverted\| +----------------------+---------------+ \|28.92599999999999...\| null\| \|28.92599999999999...\| 28.92600\| \|2.925999999999998...\| 2.92600\| +----------------------+---------------+ the correct result should be +----------------------+---------------+ \| num \|numConverted\| +----------------------+---------------+ \|28.92599999999999...\| 28.92600\| \|28.92599999999999...\| 28.92600\| \|2.925999999999998...\| 2.92600\| +----------------------+---------------+ The problem occur since https://issues.apache.org/jira/browse/SPARK-32706, it because the fast fail is checking precision length, which should only check the whole number part length of the input value, not the precision length ### Why are the changes needed? correctness ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? test added Closes #33011 from dgd-contributor/SPARK-35841_castStringToDecimalTypeError. Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-24 16:44:58 +08:00
ulysses-you	ff9ba89dcb	[SPARK-35282][SQL][FOLLOWUP] Simplify condition code of shuffled hash join ### What changes were proposed in this pull request? Simplify the condition code which is introduced by [SPARK-35282](https://issues.apache.org/jira/browse/SPARK-35282). ### Why are the changes needed? Reduce the code size and make code more readable. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass CI Closes #33046 from ulysses-you/simplify-shj. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-24 08:42:24 +00:00
Angerszhuuuu	5e77ca8071	[SPARK-35768][SQL] Take into account year-month interval fields in cast ### What changes were proposed in this pull request? Support take into account year-month interval field in cast ##### Rule cast to target YearMonthIntervalType \| string \| demo \| strict target type \| months \| \|---\|---\|---\|---\| \| [+\\|-]y-m \| 1-1 \| YearMonthIntervalType(YEAR. MONTH) \| 13 \| \| [+\\|-]y\| 1 \| YearMonthIntervalType(YEAR. YEAR) \| 12 \| \| [+\\|-]m \| 1 \| YearMonthIntervalType(MONTH. MONTH) \| 1 \| \| INTERVAL [+\\|-]'[+\\|-]y-m' YEAR TO MONTH \| interval '1-1' year to month \| YearMonthIntervalType(YEAR. MONTH) \| 13 \| \| INTERVAL [+\\|-]'[+\\|-]m' MONTH \| interval '1' month \| YearMonthIntervalType(MONTH. MONTH) \| 1 \| \| INTERVAL [+\\|-]'[+\\|-]y' YEAR \| interval '1' year \| YearMonthIntervalType(YEAR.YEAR) \| 12 \| ### Why are the changes needed? Support take into account year-month interval field in cast ### Does this PR introduce _any_ user-facing change? user can use `cast(str, YearMonthInterval(YEAR, YEAR))` etc ### How was this patch tested? Added UT Closes #32940 from AngersZhuuuu/SPARK-35768. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-24 07:48:47 +00:00
AngersZhuuuu	7d0786f535	[SPARK-35730][SQL][TESTS] Check all day-time interval types in UDF ### What changes were proposed in this pull request? Check all day-time interval types in UDF. ### Why are the changes needed? New checks should improve test coverage. ### Does this PR introduce _any_ user-facing change? Yes but `DayTimeIntervalType` has not been released yet. ### How was this patch tested? Existed UT. Closes #33047 from AngersZhuuuu/SPARK-35730. Lead-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Co-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-24 10:42:20 +03:00
itholic	92ddef7cfb	[SPARK-35696][PYTHON][DOCS][FOLLOW-UP] Fix underline for title in FAQ to remove warnings ### What changes were proposed in this pull request? This PR follow-up for SPARK-35696 to fix incorrect underline in the documents to remove warnings. ### Why are the changes needed? We should build the docs without any incorrect documentation style ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually build docs and see the warning is removed Closes #33052 from itholic/SPARK-35696-followup. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-24 15:20:13 +09:00
Vinod KC	4dabba8f76	[SPARK-35747][CORE] Avoid printing full Exception stack trace, if Hbase/Kafka/Hive services are not running in a secure cluster ### What changes were proposed in this pull request? In a secure Yarn cluster, even though HBase or Kafka, or Hive services are not used in the user application, yarn client unnecessarily trying to generate Delegations token from these services. This will add additional delays while submitting spark application in a yarn cluster Also during HBase delegation token generation step in the application submit stage, HBaseDelegationTokenProvider prints a full Exception Stack trace and it causes a noisy warning. Apart from printing exception stack trace, Application submission taking more time as it retries connection to HBase master multiple times before it gives up. So, if HBase is not used in the user Applications, it is better to suggest User disable HBase Delegation Token generation. This PR aims to avoid printing full Exception Stack by just printing just Exception name and also add a suggestion message to disable `Delegation Token generation` if service is not used in the Spark Application. eg: `If HBase is not used, set spark.security.credentials.hbase.enabled to false` ### Why are the changes needed? To avoid printing full Exception stack trace in WARN log #### Before the fix ---------------- ``` spark-shell --master yarn ....... ....... 21/06/12 14:29:41 WARN security.HBaseDelegationTokenProvider: Failed to get token from service hbase java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.security.HBaseDelegationTokenProvider.obtainDelegationTokensWithHBaseConn(HBaseDelegationT okenProvider.scala:93) at org.apache.spark.deploy.security.HBaseDelegationTokenProvider.obtainDelegationTokens(HBaseDelegationTokenProvider. scala:60) at org.apache.spark.deploy.security.HadoopDelegationTokenManager$$anonfun$6.apply(HadoopDelegationTokenManager.scala: 166) at org.apache.spark.deploy.security.HadoopDelegationTokenManager$$anonfun$6.apply(HadoopDelegationTokenManager.scala: 164) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:206) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) at org.apache.spark.deploy.security.HadoopDelegationTokenManager.obtainDelegationTokens(HadoopDelegationTokenManager. scala:164) ``` #### After the fix ------------ ``` spark-shell --master yarn Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 21/06/13 02:10:02 WARN security.HBaseDelegationTokenProvider: Failed to get token from service hbase due to java.lang.reflect.InvocationTargetException Retrying to fetch HBase security token with hbase connection parameter. 21/06/13 02:10:40 WARN security.HBaseDelegationTokenProvider: Failed to get token from service hbase java.lang.reflect.InvocationTargetException. If HBase is not used, set spark.security.credentials.hbase.enabled to false 21/06/13 02:10:47 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered! ``` ### Does this PR introduce _any_ user-facing change? Yes, in the log, it avoids printing full Exception stack trace. Instread prints this. WARN security.HBaseDelegationTokenProvider: Failed to get token from service hbase java.lang.reflect.InvocationTargetException. If HBase is not used, set spark.security.credentials.hbase.enabled to false ### How was this patch tested? Tested manually as it can be verified only in a secure cluster Closes #32894 from vinodkc/br_fix_Hbase_DT_Exception_stack_printing. Authored-by: Vinod KC <vinod.kc.in@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-06-23 23:12:02 -07:00
Angerszhuuuu	490ae8f4d6	[SPARK-35777][SQL][TESTS] Check all year-month interval types in UDF ### What changes were proposed in this pull request? Check all year-month interval types in UDF. ### Why are the changes needed? New checks should improve test coverage. ### Does this PR introduce _any_ user-facing change? Yes but `YearMonthIntervalType` has not been released yet. ### How was this patch tested? Existed UT. Closes #32985 from AngersZhuuuu/SPARK-35777. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-24 08:56:08 +03:00
itholic	712ed87faa	[SPARK-35696][PYTHON][DOCS] Refine the code examples in pandas-on-Spark documentation ### What changes were proposed in this pull request? This PR proposes to refine the code examples for pandas-on-Spark since some of them still follows the naming for Koalas. For example, ```python kdf = ks.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) ``` should be refined to ```python psdf = ps.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) ``` Also fixed the several remaining Koalas stuffs in FAQ ### Why are the changes needed? Because we don't want to use the name "Koalas" in the Apache Spark anymore. ### Does this PR introduce _any_ user-facing change? Yes, the examples in the documentation will be changed with refined names. ### How was this patch tested? Manually built the docs and check one by one. Closes #33017 from itholic/SPARK-35696. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-24 14:48:13 +09:00
PengLei	61bd036cb9	[SPARK-35852][SQL] Use DateAdd instead of TimeAdd for DateType +/- INTERVAL DAY ### What changes were proposed in this pull request? We use `DateAdd` to impl `DateType` `+`/`-` `INTERVAL DAY` ### Why are the changes needed? To improve the impl of `DateType` `+`/`-` `INTERVAL DAY` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add ut test Closes #33033 from Peng-Lei/SPARK-35852. Authored-by: PengLei <18066542445@189.cn> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-24 08:47:29 +03:00
Dongjoon Hyun	af9b47f8f8	[SPARK-35868][CORE] Add fs.s3a.downgrade.syncable.exceptions if not set ### What changes were proposed in this pull request? This PR aims to add `fs.s3a.downgrade.syncable.exceptions=true` if it's not provided by the users. ### Why are the changes needed? Currently, event log feature is broken with Hadoop 3.2 profile due to `UnsupportedOperationException` because [HADOOP-17597](https://issues.apache.org/jira/browse/HADOOP-17597) changes the default behavior to throw exceptions by default since Apache Hadoop 3.3.1. We know that it's because `EventLogFileWriters` is using `hadoopDataStream.foreach(_.hflush())`, but this PR aims to provide the same UX across Spark distributions with Hadoop2/Hadoop 3 at Apache Spark 3.2.0. ``` $ bin/spark-shell -c spark.eventLog.enabled=true -c spark.eventLog.dir=s3a://dongjoon/spark-events/ ... 21/06/23 17:34:35 ERROR SparkContext: Error initializing SparkContext. java.lang.UnsupportedOperationException: S3A streams are not Syncable. See HADOOP-17597. ``` ### Does this PR introduce _any_ user-facing change? Yes, this will recover the existing behavior. ### How was this patch tested? Manual. ``` $ build/sbt package -Phadoop-3.2 -Phadoop-cloud $ bin/spark-shell -c spark.eventLog.enabled=true -c spark.eventLog.dir=s3a://dongjoon/spark-events/ ...(working)... ``` If the users provide the configuration explicitly, it will return to the original behavior throwing exceptions. ``` $ bin/spark-shell -c spark.eventLog.enabled=true -c spark.eventLog.dir=s3a://dongjoon/spark-events/ -c spark.hadoop.fs.s3a.downgrade.syncable.exceptions=false ... 21/06/23 17:44:41 ERROR Main: Failed to initialize Spark session. java.lang.UnsupportedOperationException: S3A streams are not Syncable. See HADOOP-17597. ``` Closes #33044 from dongjoon-hyun/SPARK-35868. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-06-23 22:46:36 -07:00
Ruifeng Zheng	37f70422b5	[SPARK-35678][ML][FOLLOWUP] Revert changes in ANN ### What changes were proposed in this pull request? revert changes related to ANN ### Why are the changes needed? using the new `softmax` may cause flaky failure ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? reverted testsuite Closes #33049 from zhengruifeng/revert_softmax_ann. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-24 14:02:28 +09:00
attilapiros	0bdece015e	[SPARK-35543][CORE][FOLLOWUP] Fix memory leak in BlockManagerMasterEndpoint removeRdd ### What changes were proposed in this pull request? Wrapping `JHashMap[BlockId, BlockStatus]` (used in `blockStatusByShuffleService`) into a new class `BlockStatusPerBlockId` which removes the reference to the map when all the persisted blocks are removed. ### Why are the changes needed? With https://github.com/apache/spark/pull/32790 a bug is introduced when all the persisted blocks are removed we remove the HashMap which already shared by the block manger infos but when new block is persisted this map is needed to be used again for storing the data (and this HashMap must be the same which shared by the block manger infos created for registered block managers running on the same host where the external shuffle service is). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Extending `BlockManagerInfoSuite` with test which removes all the persisted blocks then adds another one. Closes #33020 from attilapiros/SPARK-35543-2. Authored-by: attilapiros <piros.attila.zsolt@gmail.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>	2021-06-24 00:01:40 -05:00
yi.wu	1cdc56c70d	[SPARK-35869][INFRA] Fix "Cannot run program python" error from do-release-docker.sh ### What changes were proposed in this pull request? Add `python-is-python3` to `create-release/spark-rm/Dockerfile` ### Why are the changes needed? Systems that use pthon3 by default should explicitly indicate the python version is 3. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested during Apache 3.0.3 release. Closes #33048 from Ngone51/fix-release-script. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-24 12:47:28 +09:00
tanel.kiis@gmail.com	b3a2cebc2b	[SPARK-34807][SQL] Transpose Window nodes with Project between them ### What changes were proposed in this pull request? Extend the `TransposeWindow` rule to transpose `Window` nodes, that have `Project` between them. ### Why are the changes needed? The analyzer will turn a `dataset.withColumn("colName", expressionWithWindowFunction)` method call to a `Project - Window - Project` chain in the logical plan. When this method is called multiple times in a row, then the projects can block the `Window` nodes from being transposed by the current `TransposeWindow` rule. TPCDS q47 and q57 are also improved by this. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #31980 from tanelk/SPARK-34807_transpose_window. Lead-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Co-authored-by: Tanel Kiis <tanel.kiis@gmail.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2021-06-24 10:28:57 +08:00
Ruifeng Zheng	a66738823c	[SPARK-35678][ML][FOLLOWUP] softmax support offset and step ### What changes were proposed in this pull request? softmax support offset and step, then we can use it in ANN and NB ### Why are the changes needed? to simplify impl ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuite Closes #32991 from zhengruifeng/softmax_support_offset_step. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: Huaxin Gao <huaxin_gao@apple.com>	2021-06-23 21:03:18 -05:00
Hyukjin Kwon	be9089731a	[SPARK-35588][PYTHON][DOCS] Merge Binder integration and quickstart notebook for pandas API on Spark ### What changes were proposed in this pull request? This PR proposes to fix: - the Binder integration of pandas API on Spark, and merge them together with the existing PySpark one. - update quickstart of pandas API on Spark, and make it working The notebooks can be easily reviewed here: https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-35588-3?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_ps.ipynb Original page in Koalas: https://koalas.readthedocs.io/en/latest/getting_started/10min.html ### Why are the changes needed? - To show the working examples of quickstart to end users. - To allow users to try out the examples without installation easily. ### Does this PR introduce _any_ user-facing change? No to end users because the existing quickstart of pandas API on Spark is not released yet. ### How was this patch tested? I manually tested it by uploading built Spark distribution to Binder. See `3bc15310a0` Closes #33041 from HyukjinKwon/SPARK-35588-2. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-24 10:17:22 +09:00
zengruios	1cf18a2277	[SPARK-35851][GRAPHX] Modify the wrong variable used in GraphGenerators.sampleLogNormal ### What changes were proposed in this pull request? modify the wrong variable used in GraphGenerators.sampleLogNormal ### Why are the changes needed? wrong variable used ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #33010 from zengruios/SPARK-35851. Authored-by: zengruios <578395184@qq.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-06-23 18:20:33 -05:00
Angerszhuuuu	ad187227f1	[SPARK-35731][SQL][TESTS] Check all day-time interval types in arrow ### What changes were proposed in this pull request? Check all day-time interval types in arrow. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added UT. Closes #33039 from AngersZhuuuu/SPARK-35731. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-23 23:38:41 +03:00
Kousuke Saruta	2d3fa04e90	[SPARK-35729][SQL][TESTS] Check all day-time interval types in aggregate expressions ### What changes were proposed in this pull request? This PR adds test to check `sum` and `avg` works with all the `DayTimeIntervalType`. This PR also moves a dataframe commonly used by tests `SPARK-34837: Support ANSI SQL intervals by the aggregate function avg` and `SPARK-34716: Support ANSI SQL intervals by the aggregate function sum` to `SQLTestData.scala`, and a little bit modifies it. ### Why are the changes needed? To ensure the results of aggregations are what is expected. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. Closes #33042 from sarutak/check-interval-agg-dt. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-23 23:34:28 +03:00
Jungtaek Lim (HeartSaVioR)	476197791b	[SPARK-34889][SS] Introduce MergingSessionsIterator merging elements directly which belong to the same session Introduction: this PR is a part of SPARK-10816 (`EventTime based sessionization (session window)`). Please refer #31937 to see the overall view of the code change. (Note that code diff could be diverged a bit.) ### What changes were proposed in this pull request? This PR introduces MergingSessionsIterator, which enables to merge elements belong to the same session directly. MergingSessionsIterator is a variant of SortAggregateIterator which merges the session windows based on the fact input rows are sorted by "group keys + the start time of session window". When merging windows, MergingSessionsIterator also applies aggregations on merged window, which eliminates the necessity on buffering inputs (which requires copying rows) and update the session spec for each input. MergingSessionsIterator is quite performant compared to UpdatingSessionsIterator brought by SPARK-34888. Note that MergingSessionsIterator can only apply to the cases aggregation can be applied altogether, so there're still rooms for UpdatingSessionIterator to be used. This issue also introduces MergingSessionsExec which is the physical node on leveraging MergingSessionsIterator to sort the input rows and aggregate rows according to the session windows. ### Why are the changes needed? This part is a one of required on implementing SPARK-10816. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test suite added. Closes #31987 from HeartSaVioR/SPARK-34889-SPARK-10816-PR-31570-part-2. Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Co-authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-06-23 13:04:37 -07:00
Angerszhuuuu	077cf2acdb	[SPARK-35733][SQL][TESTS] Check all day-time interval types in `HiveInspectors` tests ### What changes were proposed in this pull request? Check all day-time interval types in HiveInspectors tests. ### Why are the changes needed? New tests should improve test coverage for day-time interval types. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added UT. Closes #33036 from AngersZhuuuu/SPARK-35733. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-23 19:20:51 +03:00
Bruce Robbins	66d5a0049a	[SPARK-35817][SQL] Restore performance of queries against wide Avro tables ### What changes were proposed in this pull request? When creating a record writer in an AvroDeserializer, or creating a struct converter in an AvroSerializer, look up Avro fields using a map rather than scanning the entire list of Avro fields. ### Why are the changes needed? A query against an Avro table can be quite slow when all are true: * There are many columns in the Avro file * The query contains a wide projection * There are many splits in the input * Some of the splits are read serially (e.g., less executors than there are tasks) A write to an Avro table can be quite slow when all are true: * There are many columns in the new rows * The operation is creating many files For example, a single-threaded query against a 6000 column Avro data set with 50K rows and 20 files takes less than a minute with Spark 3.0.1 but over 7 minutes with Spark 3.2.0-SNAPSHOT. This PR restores the faster time. For the 1000 column read benchmark: Before patch: 108447 ms After patch: 35925 ms percent improvement: 66% For the 1000 column write benchmark: Before patch: 123307 After patch: 42313 percent improvement: 65% ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? * Ran existing unit tests * Added new unit tests * Added new benchmarks Closes #32969 from bersprockets/SPARK-35817. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-23 22:36:56 +08:00
yi.wu	7f937730ff	[SPARK-33741][FOLLOW-UP][CORE] Rename the min threshold time speculation config ### What changes were proposed in this pull request? This's a follow-up of https://github.com/apache/spark/pull/30710. Rename the conf from `spark.speculation.min.threshold` to `spark.speculation.minTaskRuntime`. ### Why are the changes needed? To follow the [config naming policy](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala#L21). ### Does this PR introduce _any_ user-facing change? No (since Spark 3.2 hasn't been released). ### How was this patch tested? Pass existing tests. Closes #33037 from Ngone51/spark-33741-followup. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-23 13:29:58 +00:00
Yikun Jiang	4824c53398	[SPARK-35812][PYTHON] Throw ValueError if version and timestamp are used together in to_delta ### What changes were proposed in this pull request? Throw ValueError if version and timestamp are used together in to_delta ### Why are the changes needed? read_delta has arguments named `version` and `timestamp`, but they cannot be used together. We should raise the proper error message when they are used together. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #33023 from Yikun/SPARK-35812. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-23 19:04:45 +09:00
Chao Sun	b8acbf6d88	[SPARK-35846][SQL] Introduce ParquetReadState to track various states while reading a Parquet column chunk ### What changes were proposed in this pull request? Move all the bookkeeping states while scanning a Parquet column chunk into a single class `ParquetReadState`. ### Why are the changes needed? As suggested [here](https://github.com/apache/spark/pull/32753#discussion_r655580942). To support column index in the vectorized reader path, we'll going to introduce more states to track. These are spread across different classes which make the code harder to maintain. Therefore, this proposes to move them into a single class so they can be managed better. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UTs. Closes #33006 from sunchao/SPARK-35846. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-23 02:56:00 -07:00
Gengliang Wang	6f51e37eb5	[SPARK-35857][SQL] The ANSI flag of Cast should be kept after being copied ### What changes were proposed in this pull request? Make the ANSI flag part of expression `Cast`'s parameter list, instead of fetching it from the sessional SQLConf. ### Why are the changes needed? For Views, it is important to show consistent results even the ANSI configuration is different in the running session. This is why many expressions like 'Add'/'Divide' making the ANSI flag part of its case class parameter list. We should make it consistent for the expression `Cast` ### Does this PR introduce _any_ user-facing change? Yes, the `Cast` inside a View always behaves the same, independent of the ANSI model SQL configuration in the current session. ### How was this patch tested? Existing UT Closes #33027 from gengliangwang/ansiFlagInCast. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-23 16:52:33 +08:00
Angerszhuuuu	758b423a31	[SPARK-35860][SQL] Support UpCast between different field of YearMonthIntervalType/DayTimeIntervalType ### What changes were proposed in this pull request? Support UpCast between different field of YearMonthIntervalType/DayTimeIntervalType ### Why are the changes needed? Since in our encoder we handle Period/Duration as default YearMonthIntervalType/DayTimeIntervalType, when we use udf to handle this type, it will upcast all type of YearMonthIntervalType/DayTimeIntervalType to default YearMonthIntervalType/DayTimeIntervalType, so we need to support this. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added Ut Closes #33035 from AngersZhuuuu/SPARK-35860. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-23 11:32:13 +03:00
Angerszhuuuu	7c1a9dd3f5	[SPARK-35776][SQL][TESTS] Check all year-month interval types in arrow ### What changes were proposed in this pull request? Add tests to check that all year-month interval types are supported in (de-)serialization from/to Arrow format. ### Why are the changes needed? New tests should improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added ut Closes #32993 from AngersZhuuuu/SPARK-35776. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-23 10:59:50 +03:00
Peter Toth	79e3d0d98f	[SPARK-35855][SQL] Unify reuse map data structures in non-AQE and AQE rules ### What changes were proposed in this pull request? This PR unifies reuse map data structures in non-AQE and AQE rules to a simple `Map[<canonicalized plan>, <plan>]` based on the discussion here: https://github.com/apache/spark/pull/28885#discussion_r655073897 ### Why are the changes needed? The proposed `Map[<canonicalized plan>, <plan>]` is simpler than the currently used `Map[<schema>, ArrayBuffer[<plan>]]` in `ReuseMap`/`ReuseExchangeAndSubquery` (non-AQE) and consistent with the `ReuseAdaptiveSubquery` (AQE) subquery reuse rule. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UTs. Closes #33021 from peter-toth/SPARK-35855-unify-reuse-map-data-structures. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-23 07:20:47 +00:00
Wenchen Fan	20edfdd39a	[SPARK-35845][SQL] OuterReference resolution should reject ambiguous column names ### What changes were proposed in this pull request? The current OuterReference resolution is a bit weird: when the outer plan has more than one child, it resolves OuterReference from the output of each child, one by one, left to right. This is incorrect in the case of join, as the column name can be ambiguous if both left and right sides output this column. This PR fixes this bug by resolving OuterReference with `outerPlan.resolveChildren`, instead of something like `outerPlan.children.foreach(_.resolve(...))` ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? The problem only occurs in join, and join condition doesn't support correlated subquery yet. So this PR only improves the error message. Before this PR, people see ``` java.lang.UnsupportedOperationException Cannot generate code for expression: outer(t1a#291) ``` ### How was this patch tested? a new test Closes #33004 from cloud-fan/outer-ref. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-23 14:32:34 +08:00
Angerszhuuuu	df55945804	[SPARK-35772][SQL][TESTS] Check all year-month interval types in `HiveInspectors` tests ### What changes were proposed in this pull request? Check all year-month interval types in HiveInspectors tests. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT. Closes #32970 from AngersZhuuuu/SPARK-35772. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-23 08:54:07 +03:00
Kousuke Saruta	4416b4b8ba	[SPARK-35734][SQL][FOLLOWUP] IntervalUtils.toDayTimeIntervalString should consider the case a day-time type is casted as another day-time type ### What changes were proposed in this pull request? This PR fixes an issue that `IntervalUtils.toDayTimeIntervalString` doesn't consider the case that a day-time interval type is casted as another day-time interval type. if data of `interval day to second` is casted as `interval hour to second`, the value of the day is multiplied by 24 and added to the value of hour. For example, `INTERVAL '1 2' DAY TO HOUR` will be `INTERVAL '26' HOUR` if it's casted. If this behavior is intended, it should be stringified as `INTERVAL '26' HOUR` but currently, it will be `INTERVAL '2' HOUR` ### Why are the changes needed? t's a bug if the behavior of cast is intended. ### Does this PR introduce _any_ user-facing change? No, because this feature is not released yet. ### How was this patch tested? Modified the tests added in SPARK-35734 (#32891) Closes #33031 from sarutak/fix-toDayTimeIntervalString. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-23 08:00:35 +03:00
Gengliang Wang	960a7e5fce	[SPARK-35856][SQL][TESTS] Move new interval type test cases from CastSuite to CastBaseSuite ### What changes were proposed in this pull request? There are a few test cases that are supposed to be in CastSuiteBase instead of CastSuite: - SPARK-35112: Cast string to day-time interval - SPARK-35111: Cast string to year-month interval - SPARK-35820: Support cast DayTimeIntervalType in different fields - SPARK-35819: Support cast YearMonthIntervalType in different fields This PR is to move them to CastSuiteBase. Also, it adds comments for the scope of CastSuiteBase/CastSuite/AnsiCastSuiteBase. ### Why are the changes needed? Increase test coverage so that we can test the casting under ANSI mode. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT Closes #33022 from gengliangwang/moveTest. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-23 11:20:50 +08:00
Wenchen Fan	a87ee5d8b9	[SPARK-35695][SQL][FOLLOWUP] Use AQE helper to simplify the code in CollectMetricsExec ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/32862 , to simplify the code with AQE helper. ### Why are the changes needed? code cleanup ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #33026 from cloud-fan/follow. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-23 09:54:12 +09:00
Takuya UESHIN	68b54b702c	[SPARK-35473][PYTHON] Fix disallow_untyped_defs mypy checks for pyspark.pandas.groupby ### What changes were proposed in this pull request? Adds more type annotations in the file `python/pyspark/pandas/groupby.py` and fixes the mypy check failures. ### Why are the changes needed? We should enable more disallow_untyped_defs mypy checks. ### Does this PR introduce _any_ user-facing change? Yes. This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users. ### How was this patch tested? The mypy check with a new configuration and existing tests should pass. Closes #33032 from ueshin/issues/SPARK-35473/disallow_untyped_defs_groupby. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-23 09:51:33 +09:00
Wenchen Fan	7a21e9c48f	[SPARK-35858][SQL] SparkPlan.makeCopy should not set the active session ### What changes were proposed in this pull request? We introduced `SparkSession.withActive` a while ago, and we use it when we need to run some code with a certain SparkSession as the active session. Somehow we missed `SparkPlan.makeCopy`, which sets active session directly. This PR proposes to call `SparkSession.withActive` there. ### Why are the changes needed? make sure we don't change the active session unexpectedly. ### Does this PR introduce _any_ user-facing change? No. `makeCopy` is an internal function and I can't find a real case that this can change the active session. Mostly in an upper level, there is already a `SparkSession.withActive`, like `QueryExecution.executePhase` ### How was this patch tested? existing tests Closes #33029 from cloud-fan/minor1. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-23 09:50:59 +09:00
Wenchen Fan	a2c1a55b1f	[SPARK-35700][SQL][FOLLOWUP] Read schema from ORC files should strip CHAR/VARCHAR types ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/33001 , to provide a more direct fix. The regression in 3.1 was caused by the fact that we changed the parser and allow the parser to return CHAR/VARCHAR type. We should have replaced CHAR/VARCHAR with STRING before the data type flows into the query engine, however, `OrcUtils` is missed. When reading ORC files, at the task side we will read the real schema from ORC file metadata, then apply filter pushdown. For some reason, the implementation turns ORC schema to Spark schema before filter pushdown, and this step does not strip CHAR/VARCHAR. Note, for Parquet we use the Parquet schema directly in filter pushdown, and do not this have problem. This PR proposes to replace the CHAR/VARCHAR with STRING when turning ORC schema to Spark schema. ### Why are the changes needed? a more directly bug fix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #33030 from cloud-fan/help. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-22 13:50:49 -07:00
Takuya UESHIN	c418803df7	[SPARK-35847][PYTHON] Manage InternalField in DataTypeOps.isnull ### What changes were proposed in this pull request? Properly set `InternalField` for `DataTypeOps.isnull`. ### Why are the changes needed? The result of `DataTypeOps.isnull` must always be non-nullable boolean. We should manage `InternalField` for this case. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added some more tests. Closes #33005 from ueshin/issues/SPARK-35847/isnull_field. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-22 12:54:01 -07:00
Li Zhang	dfd7b026dc	[SPARK-35800][SS] Improving GroupState testability by introducing TestGroupState ### What changes were proposed in this pull request? Proposed changes in this pull request: 1. Introducing the `TestGroupState` interface which is inherited from `GroupState` so that testing related getters can be exposed in a controlled manner 2. Changing `GroupStateImpl` to inherit from `TestGroupState` interface, instead of directly from `GroupState` 3. Implementing `TestGroupState` object with `create()` method to forward inputs to the private `GroupStateImpl` constructor 4. User input validations have been added into `GroupStateImpl`'s `createForStreaming()` method to prevent users from creating invalid GroupState objects. 5. Replacing existing `GroupStateImpl` usages in sql pkg internal unit tests with the newly added `TestGroupState` to give user best practice about `TestGroupState` usage. With the changes in this PR, the class hierarchy is changed from `GroupStateImpl` -> `GroupState` to `GroupStateImpl` -> `TestGroupState` -> `GroupState` (-> means inherits from) ### Why are the changes needed? The internal `GroupStateImpl` implementation for the `GroupState` interface has no public constructors accessible outside of the sql pkg. However, the user-provided state transition function for `[map\|flatMap]GroupsWithState` requires a `GroupState` object as the prevState input. Currently, users are calling the Structured Streaming engine in their unit tests in order to instantiate such `GroupState` instances, which makes UTs cumbersome. The proposed `TestGroupState` interface is to give users controlled access to the `GroupStateImpl` internal implementation to largely improve testability of Structured Streaming state transition functions. Usage Example ``` import org.apache.spark.sql.streaming.TestGroupState test(“Structured Streaming state update function”) { var prevState = TestGroupState.create[UserStatus]( optionalState = Optional.empty[UserStatus], timeoutConf = EventTimeTimeout, batchProcessingTimeMs = 1L, eventTimeWatermarkMs = Optional.of(1L), hasTimedOut = false) val userId: String = ... val actions: Iterator[UserAction] = ... assert(!prevState.hasUpdated) updateState(userId, actions, prevState) assert(prevState.hasUpdated) } ``` ### Does this PR introduce _any_ user-facing change? Yes, the `TestGroupState` interface and its corresponding `create()` factory function in its companion object are introduced in this pull request for users to use in unit tests. ### How was this patch tested? - New unit tests are added - Existing GroupState unit tests are updated Closes #32938 from lizhangdatabricks/improve-group-state-testability. Authored-by: Li Zhang <li.zhang@databricks.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	2021-06-22 15:04:01 -04:00
Yikun Jiang	1c26433f1d	[SPARK-35849][PYTHON] Make `astype` method data-type-based for DecimalOps ### What changes were proposed in this pull request? Make DecimalOps astype data-type-based. See more in: https://github.com/apache/spark/pull/32821#issuecomment-861119905 ### Why are the changes needed? Make DecimalOps astype data-type-based. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test NumOpsTest.test_astype in pyspark/pandas/tests/data_type_ops/test_num_ops.py Closes #33009 from Yikun/SPARK-35849. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-22 10:41:22 -07:00
Hyukjin Kwon	27046582e4	[SPARK-35645][PYTHON][DOCS] Merge contents and remove obsolete pages in Getting Started section ### What changes were proposed in this pull request? This PR revise the installation to describe `pip install pyspark[pandas_on_spark]` and removes pandas-on-Spark installation and videos/blogposts. ### Why are the changes needed? pandas-on-Spark installation is merged to PySpark installation pages. For videos/blogposts, now this is named pandas API on Spark. Old Koalas blogposts and videos are obsolete. ### Does this PR introduce _any_ user-facing change? To end users, no because the docs are not released yet. ### How was this patch tested? I manually built the docs and checked the output Closes #33018 from HyukjinKwon/SPARK-35645. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-22 09:36:27 -07:00
Gengliang Wang	ce53b7199d	[SPARK-35854][SQL] Improve the error message of to_timestamp_ntz with invalid format pattern ### What changes were proposed in this pull request? When SQL function `to_timestamp_ntz` has invalid format pattern input, throw a runtime exception with hints for the valid patterns, instead of throwing an upgrade exception with suggestions to use legacy formatters. ### Why are the changes needed? As discussed in https://github.com/apache/spark/pull/32995/files#r655148980, there is an error message saying "You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'yyyy-MM-dd GGGGG' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0" This is not true for function to_timestamp_ntz, which only uses the Iso8601TimestampFormatter and added since Spark 3.2. We should improve it. ### Does this PR introduce _any_ user-facing change? No, the new SQL function is not released yet. ### How was this patch tested? Unit test Closes #33019 from gengliangwang/improveError. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-22 23:45:54 +08:00
Lei Peng	bc61b62a55	[SPARK-35727][SQL] Return INTERVAL DAY from dates subtraction What changes were proposed in this pull request? 1. Change the return value type from DayTimeIntervalType(DAY, SECOND) to DayTimeIntervalType(DAY, DAY) of SubtractDates. Why are the changes needed? https://issues.apache.org/jira/browse/SPARK-35727 Does this PR introduce any user-facing change? no How was this patch tested? existed ut test Closes #32999 from Peng-Lei/SPARK-35727. Lead-authored-by: Lei Peng <peng.8lei@gmail.com> Co-authored-by: PengLei <18066542445@189.cn> Co-authored-by: Peng-Lei <peng.8lei@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-22 13:43:25 +00:00
YangJie	6c05459600	[SPARK-35838][BUILD][TESTS] Ensure all modules can be maven test independently in Scala 2.13 ### What changes were proposed in this pull request? Similar to SPARK-35532, the main change of this pr is add `scala-2.13` profile to external/kafka-0-10-sql/pom.xml, external/avro/pom.xml and sql/hive-thriftserver/pom.xml, the `scala-2.13` profile include dependency on `scala-parallel-collections_2.13`, then all(34) spark modules can maven test independently. ### Why are the changes needed? Ensure alll(34) spark modules can be maven test independently in Scala 2.13 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the GitHub Action Scala 2.13 job - Manual test： 1. Execute ``` dev/change-scala-version.sh 2.13 mvn clean install -DskipTests -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -Pscala-2.13 ``` 2. maven test `external/kafka-0-10-sql` module ``` mvn test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -Pscala-2.13 -pl external/kafka-0-10-sql ``` before ``` Discovery starting. Discovery completed in 857 milliseconds. Run starting. Expected test count is: 464 ... KafkaRelationSuiteV2: - explicit earliest to latest offsets - default starting and ending offsets - explicit offsets - default starting and ending offsets with headers - timestamp provided for starting and ending - timestamp provided for starting, offset provided for ending - timestamp provided for ending, offset provided for starting - timestamp provided for starting, ending not provided - timestamp provided for ending, starting not provided - global timestamp provided for starting and ending - no matched offset for timestamp - startingOffsets - preferences on offset related options - no matched offset for timestamp - endingOffsets * RUN ABORTED * java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport at org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1411) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.SparkContext.withScope(SparkContext.scala:788) at org.apache.spark.SparkContext.union(SparkContext.scala:1405) at org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:697) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:182) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:220) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:217) ... Cause: java.lang.ClassNotFoundException: scala.collection.parallel.TaskSupport at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) at org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1411) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.SparkContext.withScope(SparkContext.scala:788) at org.apache.spark.SparkContext.union(SparkContext.scala:1405) at org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:697) ... ``` After ``` Run completed in 33 minutes, 51 seconds. Total number of tests run: 464 Suites: completed 31, aborted 0 Tests: succeeded 464, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` 3. maven test `external/avro` module ``` mvn test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -Pscala-2.13 -pl external/avro ``` before ``` Discovery starting. Discovery completed in 2 seconds, 765 milliseconds. Run starting. Expected test count is: 255 AvroReadSchemaSuite: - append column at the end - hide column at the end - append column into middle - hide column in the middle - add a nested column at the end of the leaf struct column - add a nested column in the middle of the leaf struct column - add a nested column at the end of the middle struct column - add a nested column in the middle of the middle struct column - hide a nested column at the end of the leaf struct column - hide a nested column in the middle of the leaf struct column - hide a nested column at the end of the middle struct column - hide a nested column in the middle of the middle struct column * RUN ABORTED * java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport at org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1411) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.SparkContext.withScope(SparkContext.scala:788) at org.apache.spark.SparkContext.union(SparkContext.scala:1405) at org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:697) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:182) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:220) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:217) ... Cause: java.lang.ClassNotFoundException: scala.collection.parallel.TaskSupport at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) at org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1411) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.SparkContext.withScope(SparkContext.scala:788) at org.apache.spark.SparkContext.union(SparkContext.scala:1405) at org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:697) ... ``` After ``` Run completed in 1 minute, 42 seconds. Total number of tests run: 255 Suites: completed 12, aborted 0 Tests: succeeded 255, failed 0, canceled 0, ignored 2, pending 0 All tests passed. ``` 4. maven test `sql/hive-thriftserver` module ``` mvn test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -Pscala-2.13 -pl sql/hive-thriftserver ``` before ``` - union.sql * FAILED * "1 a 1 a 2 b 2 b" did not contain "Exception" Exception did not match for query #2 SELECT * FROM (SELECT * FROM t1 UNION ALL SELECT * FROM t1), expected: 1 a 1 a 2 b 2 b, but got: java.sql.SQLException org.apache.hive.service.cli.HiveSQLException: Error running query: java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport at org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:38) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:324) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:229) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:79) at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:63) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:43) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:229) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:224) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:238) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport at org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1411) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.SparkContext.withScope(SparkContext.scala:788) at org.apache.spark.SparkContext.union(SparkContext.scala:1405) at org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:697) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:182) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:220) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:217) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:178) at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:323) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:389) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3719) at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2987) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3710) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:774) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3708) at org.apache.spark.sql.Dataset.collect(Dataset.scala:2987) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:299) ... 16 more Caused by: java.lang.ClassNotFoundException: scala.collection.parallel.TaskSupport at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ... 40 more (ThriftServerQueryTestSuite.scala:209) ``` After ``` Run completed in 29 minutes, 17 seconds. Total number of tests run: 535 Suites: completed 20, aborted 0 Tests: succeeded 535, failed 0, canceled 0, ignored 17, pending 0 All tests passed. ``` Closes #32994 from LuciferYang/SPARK-35838. Authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-22 06:31:24 -07:00
Angerszhuuuu	5a510cf578	[SPARK-35726][SPARK-35769][SQL][FOLLOWUP] Call periodToMonths and durationToMicros in HiveResult should add endField ### What changes were proposed in this pull request? When we call periodToMonths and durationToMicros with certain type field, we should pass endField parameter. ### Why are the changes needed? When we call periodToMonths and durationToMicros with certain type field, we should pass endField parameter. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existed UT Closes #32984 from AngersZhuuuu/SPARK-35726-35769. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-22 11:15:35 +03:00
gengjiaan	43cd6ca687	[SPARK-35378][SQL][FOLLOWUP] isLocal should consider CommandResult ### What changes were proposed in this pull request? #32513 added the case class `CommandResult` so as we can eagerly execute command locally. But we forgot to update `isLocal` of `Dataset`. ### Why are the changes needed? `Dataset.isLocal` should consider `CommandResult`. ### Does this PR introduce _any_ user-facing change? Yes. If the SQL plan is `CommandResult`, `Dataset.isLocal` must return true. ### How was this patch tested? No test. Closes #32963 from beliefer/SPARK-35378-followup2. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-22 07:39:54 +00:00

1 2 3 4 5 ...

30489 commits