ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
weixiuli	947c7ea27c	[SPARK-35783][SQL] Set the list of read columns in the task configuration to reduce reading of ORC data ### What changes were proposed in this pull request? Set the list of read columns in the task configuration to reduce reading of ORC data. ### Why are the changes needed? Now, the ORC reader will read all columns of the ORC table when the task configuration does not set the list of read columns . Therefore, we should set the list of read columns in the task configuration to reduce reading of ORC data. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? exist unittests Closes #32923 from weixiuli/SPARK-35783. Authored-by: weixiuli <weixiuli@jd.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-16 22:06:31 -07:00
Hyukjin Kwon	94bdbec380	[SPARK-35644][PYTHON][DOCS] Merge contents and remove obsolete pages in Development section ### What changes were proposed in this pull request? This PR proposes to merge contents and remove obsolete pages in Development section, especially about pandas API on Spark. Some were removed, and some were merged to the existing PySpark guides. I will inline some comments in the PRs to make the review easier. ### Why are the changes needed? To guide developers on the code base of pandas API on Spark. ### Does this PR introduce _any_ user-facing change? Yes, it updates the user-facing documentation. ### How was this patch tested? Manually built the docs and checked. Closes #32926 from HyukjinKwon/SPARK-35644. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-17 13:35:20 +09:00
Venki Korukanti	8e594f084a	[SPARK-35763][SS] Remove the StateStoreCustomMetric subclass enumeration dependency ### What changes were proposed in this pull request? Remove the usage of the enumerating subclasses of `StateStoreCustomMetric` dependency. To achieve it, add couple of utility methods to `StateStoreCustomMetric` * `withNewDesc(desc : String)` to `StateStoreCustomMetric` for cloning the instance with a new `desc` (currently used in `SymmetricHashJoinStateManager`) * `createSQLMetric(sparkContext: sparkContext): SQLMetric` for creating a corresponding `SQLMetric` to show the metric in UI and accumulate at the query level (currently used in `statefulOperator. stateStoreCustomMetrics`) ### Why are the changes needed? Code in [SymmetricHashJoinStateManager](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala#L321) and [StateStoreWriter](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala#L129) rely on the subclass implementations of [StateStoreCustomMetric](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L187). If a new subclass of `StateStoreCustomMetric` is added, it requires code changes to `SymmetricHashJoinStateManager` and `StateStoreWriter`, and we may miss the update if there is no existing test coverage. To prevent these issues add a couple of utility methods to `StateStoreCustomMetric` as mentioned above. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT and a new UT Closes #32914 from vkorukanti/SPARK-35763. Authored-by: Venki Korukanti <venki.korukanti@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-06-17 07:48:24 +09:00
Chao Sun	506ef9aad7	[SPARK-29250][BUILD] Upgrade to Hadoop 3.3.1 ### What changes were proposed in this pull request? This upgrade default Hadoop version from 3.2.1 to 3.3.1. The changes here are simply update the version number and dependency file. ### Why are the changes needed? Hadoop 3.3.1 just came out, which comes with many client-side improvements such as for S3A/ABFS (20% faster when accessing S3). These are important for users who want to use Spark in a cloud environment. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Existing unit tests in Spark - Manually tested using my S3 bucket for event log dir: ``` bin/spark-shell \ -c spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID \ -c spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY \ -c spark.eventLog.enabled=true -c spark.eventLog.dir=s3a://<my-bucket> ``` - Manually tested against docker-based YARN dev cluster, by running `SparkPi`. Closes #30135 from sunchao/SPARK-29250. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-16 13:28:07 -07:00
YangJie	87bf6b0ea4	[SPARK-35556][SQL] Remove close HiveClient's SessionState ### What changes were proposed in this pull request? It will not generate `tmpOutputFile`, `tmpErrOutputFile` and `sessionDirs` since [SPARK-35286](https://issues.apache.org/jira/browse/SPARK-35286). So we can remove `HiveClientImpl.closeState` to avoid these exceptions: ``` java.lang.NoSuchMethodError: org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File ``` ### Why are the changes needed? 1. Avoid incompatible exceptions. 2. Remove useless code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the GitHub Action - Manual test： Execute ``` mvn clean install -DskipTests -pl sql/hive -am mvn test -pl sql/hive -DwildcardSuites=org.apache.spark.sql.hive.client.VersionsSuite -Dtest=none ``` Before ``` Run completed in 17 minutes, 18 seconds. Total number of tests run: 867 Suites: completed 2, aborted 0 Tests: succeeded 867, failed 0, canceled 0, ignored 1, pending 0 All tests passed. 15:04:02.407 WARN org.apache.hadoop.hive.metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0 15:04:02.408 WARN org.apache.hadoop.hive.metastore.ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore yangjie010.2.30.21 15:04:02.441 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException 15:04:03.140 ERROR org.apache.spark.util.Utils: Uncaught exception in thread shutdown-hook-0 java.lang.NoSuchMethodError: org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File; at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$closeState$1(HiveClientImpl.scala:168) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:312) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:243) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:242) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:292) at org.apache.spark.sql.hive.client.HiveClientImpl.closeState(HiveClientImpl.scala:158) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$new$1(HiveClientImpl.scala:175) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1994) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 15:04:03.141 WARN org.apache.hadoop.util.ShutdownHookManager: ShutdownHook '$anon$2' failed, java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File; java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File; at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:206) at org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95) Caused by: java.lang.NoSuchMethodError: org.apache.hadoop.hive.ql.session.SessionState.getTmpErrOutputFile()Ljava/io/File; at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$closeState$1(HiveClientImpl.scala:168) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:312) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:243) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:242) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:292) at org.apache.spark.sql.hive.client.HiveClientImpl.closeState(HiveClientImpl.scala:158) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$new$1(HiveClientImpl.scala:175) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1994) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` After ``` Run completed in 11 minutes, 41 seconds. Total number of tests run: 867 Suites: completed 2, aborted 0 Tests: succeeded 867, failed 0, canceled 0, ignored 1, pending 0 All tests passed. ``` Closes #32693 from LuciferYang/SPARK-35556. Lead-authored-by: YangJie <yangjie01@baidu.com> Co-authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2021-06-16 23:30:30 +08:00
Wenchen Fan	a2961ddfdf	[SPARK-35712][SQL] Simplify ResolveAggregateFunctions ### What changes were proposed in this pull request? Currently, `ResolveAggregateFunctions` is a complicated rule that recursively calls the entire analyzer to resolve aggregate functions in parent nodes of aggregate. It's kind of necessary as we need to do many things to identify the aggregate function and push it down to the aggregate node: resolve columns as if they are in the aggregate node, resolve functions, apply type coercion, etc. However, this is overly complicated and it's hard to fully understand how the resolution is done there. It also leads to hacks such as the [char/varchar hack](https://github.com/apache/spark/blob/v3.1.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2396-L2401), [subquery hack](https://github.com/apache/spark/blob/v3.1.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2274-L2277), [grouping function hack](https://github.com/apache/spark/blob/v3.1.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2465-L2467), etc. This PR simplifies the `ResolveAggregateFunctions` rule and clarifies the resolution logic. To resolve aggregate functions/grouping columns in HAVING, ORDER BY and `df.where`, we expand the aggregate node below to output these required aggregate functions/grouping columns. In details, when resolving an expression from the parent of an aggregate node: 1. try to resolve columns with `agg.child` and wrap the result with `TempResolvedColumn`. 2. try to resolve subqueries with `agg.child` 3. if the expression is not resolved, return it and wait for other rules to resolve it, such as resolve functions, type coercions, etc. 4. if the expression is resolved, we transform it and push aggregate functions/grouping columns into the aggregate node below. 4.1 the expression may already present in `agg.aggregateExpressions`, we can simply replace the expression with attr ref. 4.2 if a `TempResolvedColumn` is neither inside an aggregate function, or wrap a grouping column, turn it back to an `UnresolvedAttribute` 5. after the main resolution batch, remove all `TempResolvedColumn` and turn them back to `UnresolvedAttribute`. ### Why are the changes needed? Code cleanup ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing test Closes #32470 from cloud-fan/agg2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-16 09:52:19 +00:00
Kousuke Saruta	184f65e7c7	[SPARK-35771][SQL] Format year-month intervals using type fields ### What changes were proposed in this pull request? This PR proposes to format year-month interval to strings using the start and end fields of `YearMonthIntervalType`. ### Why are the changes needed? Currently, they are ignored, and any `YearMonthIntervalType` is formatted as `INTERVAL YEAR TO MONTH`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. Closes #32924 from sarutak/year-month-interval-format. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-16 11:08:02 +03:00
Kousuke Saruta	4530760c40	[SPARK-35774][SQL] Parse any year-month interval types in SQL ### What changes were proposed in this pull request? This PR extends the parser rules to be able to parse the following types: * INTERVAL YEAR * INTERVAL YEAR TO MONTH * INTERVAL MONTH ### Why are the changes needed? For ANSI compliance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New assertion. Closes #32922 from sarutak/parse-any-year-month. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-16 09:41:57 +03:00
Venkata krishnan Sowrirajan	aaa8a80c9d	[SPARK-35613][CORE][SQL] Cache commonly occurring strings in SQLMetrics, JSONProtocol and AccumulatorV2 classes ### What changes were proposed in this pull request? Cache commonly occurring duplicate Some objects in SQLMetrics by using a Guava cache and reusing the existing Guava String Interner to avoid duplicate strings in JSONProtocol. Also with AccumulatorV2 we have seen lot of Some(-1L) and Some(0L) occurrences in a heap dump that is naively interned by having reusing a already constructed Some(-1L) and Some(0L) To give some context on the impact and the garbage got accumulated, below are the details of the complex spark job which we troubleshooted and figured out the bottlenecks. tl;dr - In short, major issues were the accumulation of duplicate objects mainly from SQLMetrics. Greater than 25% of the 40G driver heap filled with (a very large number of) duplicate, immutable objects. 1. Very large number of duplicate immutable objects. - Type of metric is represented by `'scala.Some("sql")'` - which is created for each metric. - Fixing this reduced memory usage from 4GB to a few bytes. 2. `scala.Some(0)` and `scala.Some(-1)` are very common metric values (typically to indicate absence of metric) - Individually the values are all immutable, but spark sql was creating a new instance each time. - Intern'ing these resulted in saving ~4.5GB for a 40G heap. 3. Using string interpolation for metric names. - Interpolation results in creation of a new string object. - We end up with a very large number of metric names - though the number of unique strings is miniscule. - ~7.5 GB in the 40 GB heap : which went down to a few KB's when fixed. ### Why are the changes needed? To reduce overall driver memory footprint which eventually reduces the Full GC pauses. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Since these are memory related optimizations, unit tests are not added. These changes are added in our internal platform which made it possible for one of the complex spark job continuously failing to succeed along with other set of optimizations. Closes #32754 from venkata91/SPARK-35613. Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>	2021-06-15 22:02:19 -05:00
Yuming Wang	b08cf6e822	[SPARK-35203][SQL] Improve Repartition statistics estimation ### What changes were proposed in this pull request? This PR improves `Repartition` and `RepartitionByExpr` statistics estimation using child statistics. ### Why are the changes needed? The current implementation will missing column stat. For example: ```sql CREATE TABLE t1 USING parquet AS SELECT id % 10 AS key FROM range(100); ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS; set spark.sql.cbo.enabled=true; EXPLAIN COST SELECT key FROM (SELECT key FROM t1 DISTRIBUTE BY key) t GROUP BY key; ``` Before this PR: ``` == Optimized Logical Plan == Aggregate [key#2950L], [key#2950L], Statistics(sizeInBytes=1600.0 B) +- RepartitionByExpression [key#2950L], Statistics(sizeInBytes=1600.0 B, rowCount=100) +- Relation default.t1[key#2950L] parquet, Statistics(sizeInBytes=1600.0 B, rowCount=100) ``` After this PR: ``` == Optimized Logical Plan == Aggregate [key#2950L], [key#2950L], Statistics(sizeInBytes=160.0 B, rowCount=10) +- RepartitionByExpression [key#2950L], Statistics(sizeInBytes=1600.0 B, rowCount=100) +- Relation default.t1[key#2950L] parquet, Statistics(sizeInBytes=1600.0 B, rowCount=100) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #32309 from wangyum/SPARK-35203. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-06-16 10:20:13 +09:00
Ruifeng Zheng	5c96d643ee	[SPARK-35707][ML] optimize sparse GEMM by skipping bound checking ### What changes were proposed in this pull request? Sparse gemm use mothod `DenseMatrix.apply` to access the values, which can be optimized by skipping checking the bound and `isTransposed` ``` override def apply(i: Int, j: Int): Double = values(index(i, j)) private[ml] def index(i: Int, j: Int): Int = { require(i >= 0 && i < numRows, s"Expected 0 <= i < $numRows, got i = $i.") require(j >= 0 && j < numCols, s"Expected 0 <= j < $numCols, got j = $j.") if (!isTransposed) i + numRows * j else j + numCols * i } ``` ### Why are the changes needed? to improve performance, about 15% faster in the designed case ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuite and additional performance test Closes #32857 from zhengruifeng/gemm_opt_index. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>	2021-06-16 08:57:27 +08:00
Ruifeng Zheng	2802ac321f	[SPARK-35666][ML] gemv skip array shape checking ### What changes were proposed in this pull request? In existing impls, it is common case that the vector/matrix need to be sliced/copied just due to shape match. which makes the logic complex and introduce extra costing of slicing & copying. ### Why are the changes needed? 1, avoid slicing and copying due to shape checking; 2, simpify the usages; ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes #32805 from zhengruifeng/new_blas_func_for_agg. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>	2021-06-16 08:54:34 +08:00
Wenchen Fan	11e96dc843	[SPARK-35669][SQL] Quote the pushed column name only when nested column predicate pushdown is enabled ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/31964 We should only quote the column name when nested column predicate pushdown is enabled, otherwise the data source side may not have the logic to parse the quoted column name and fail. This is not a problem before #31964 , as we don't quote the column name if there is no dot in the name. But #31964 changed it. ### Why are the changes needed? fix a query failure ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test Closes #32807 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-06-16 09:43:28 +09:00
Kevin Su	ac228d43bc	[SPARK-35691][CORE] addFile/addJar/addDirectory should put CanonicalFile ### What changes were proposed in this pull request? `addFile/addJar/addDirectory` should put CanonicalFile ### Why are the changes needed? I met the error below. 21/06/07 00:06:57 ERROR SparkContext: Failed to add file:/home/runner/work/spark/spark/./core/target/scala-2.12/spark- core_2.12-3.2.0-SNAPSHOT-tests.jar to Spark environment java.lang.IllegalArgumentException: requirement failed: File spark-core_2.12-3.2.0-SNAPSHOT-tests.jar was already registered with a different path (old path = /home/runner/work/spark/spark/core/target/scala-2.12/spark-core_2.12-3.2.0-SNAPSHOT-tests.jar, new path = /home/runner/work/spark/spark/./core/target/scala-2.12/spark-core_2.12-3.2.0-SNAPSHOT-tests.jar But actually, /home/runner/work/spark/spark/./core/target/scala-2.12/spark-core_2.12-3.2.0-SNAPSHOT-tests.jar* and * /home/runner/work/spark/spark/core/target/scala-2.12/spark-core_2.12-3.2.0-SNAPSHOT-tests.jar are the same. But actually, `/home/runner/work/spark/spark/./core/target/scala-2.12/spark-core_2.12-3.2.0-SNAPSHOT-tests.jar`and `/home/runner/work/spark/spark/core/target/scala-2.12/spark-core_2.12-3.2.0-SNAPSHOT-tests.jar` are the same. I think we should put the Canonical File in ConcurrentHashMap. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the CIs. Closes #32845 from pingsutw/SPARK-35691. Authored-by: Kevin Su <pingsutw@apache.org> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-06-16 09:39:37 +09:00
Cheng Su	9709ee5ffd	[SPARK-35760][SQL] Fix the max rows check for broadcast exchange ### What changes were proposed in this pull request? This is to fix the maximal allowed number of rows check in `BroadcastExchangeExec`. After https://github.com/apache/spark/pull/27828, the max number of rows is calculated based on max capacity of `BytesToBytesMap` (previous value before the PR is 512000000). This calculation is not accurate as only `UnsafeHashedRelation` is using `BytesToBytesMap`. `LongHashedRelation` (used for broadcast join on key with long data type) has limit of [512000000](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L584), and `BroadcastNestedLoopJoinExec` is not depending on `HashedRelation` at all. The change is to only specialize the max rows limit when needed. Keep other broadcast case with the previous limit - 512000000. ### Why are the changes needed? Fix code logic and avoid unexpected behavior. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #32911 from c21/broadcast. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-06-16 09:36:24 +09:00
Sumeet Gajjar	864ff67746	[SPARK-35429][CORE] Remove commons-httpclient from Hadoop-3.2 profile due to EOL and CVEs ### What changes were proposed in this pull request? Remove commons-httpclient as a direct dependency for Hadoop-3.2 profile. Hadoop-2.7 profile distribution still has it, hadoop-client has a compile dependency on commons-httpclient, thus we cannot remove it for Hadoop-2.7 profile. ``` [INFO] +- org.apache.hadoop:hadoop-client:jar:2.7.4:compile [INFO] \| +- org.apache.hadoop:hadoop-common:jar:2.7.4:compile [INFO] \| \| +- commons-cli:commons-cli:jar:1.2:compile [INFO] \| \| +- xmlenc:xmlenc:jar:0.52:compile [INFO] \| \| +- commons-httpclient:commons-httpclient:jar:3.1:compile ``` ### Why are the changes needed? Spark is pulling in commons-httpclient as a dependency directly. commons-httpclient went EOL years ago and there are most likely CVEs not being reported against it, thus we should remove it. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Existing unittests - Checked the dependency tree before and after introducing the changes Before: ``` ./build/mvn dependency:tree -Phadoop-3.2 \| grep -i "commons-httpclient" Using `mvn` from path: /usr/bin/mvn [INFO] +- commons-httpclient:commons-httpclient:jar:3.1:compile [INFO] \| +- commons-httpclient:commons-httpclient:jar:3.1:provided ``` After ``` ./build/mvn dependency:tree \| grep -i "commons-httpclient" Using `mvn` from path: /Users/sumeet.gajjar/cloudera/upstream-spark/build/apache-maven-3.6.3/bin/mvn ``` P.S. Reopening this since [spark upgraded](`463daabd5a`) its `hive.version` to `2.3.9` which does not have a dependency on `commons-httpclient`. Closes #32912 from sumeetgajjar/SPARK-35429. Authored-by: Sumeet Gajjar <sumeetgajjar93@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-15 14:43:30 -07:00
Max Gekk	61ce8f7649	[SPARK-35680][SQL] Add fields to `YearMonthIntervalType` ### What changes were proposed in this pull request? Extend `YearMonthIntervalType` to support interval fields. Valid interval field values: - 0 (YEAR) - 1 (MONTH) After the changes, the following year-month interval types are supported: 1. `YearMonthIntervalType(0, 0)` or `YearMonthIntervalType(YEAR, YEAR)` 2. `YearMonthIntervalType(0, 1)` or `YearMonthIntervalType(YEAR, MONTH)`. It is the default one. 3. `YearMonthIntervalType(1, 1)` or `YearMonthIntervalType(MONTH, MONTH)` Closes #32825 ### Why are the changes needed? In the current implementation, Spark supports only `interval year to month` but the SQL standard allows to specify the start and end fields. The changes will allow to follow ANSI SQL standard more precisely. ### Does this PR introduce _any_ user-facing change? Yes but `YearMonthIntervalType` has not been released yet. ### How was this patch tested? By existing test suites. Closes #32909 from MaxGekk/add-fields-to-YearMonthIntervalType. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-15 23:08:12 +03:00
Andy Grove	1012967ace	[SPARK-35767][SQL] Avoid executing child plan twice in CoalesceExec ### What changes were proposed in this pull request? `CoalesceExec` needlessly calls `child.execute` twice when it could just call it once and re-use the results. This only happens when `numPartitions == 1`. ### Why are the changes needed? It is more efficient to execute the child plan once rather than twice. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? There are no functional changes. This is just a performance optimization, so the existing tests should be sufficient to catch any regressions. Closes #32920 from andygrove/coalesce-exec-executes-twice. Authored-by: Andy Grove <andygrove73@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-15 11:59:21 -07:00
Angerszhuuuu	8a02f3a413	[SPARK-35129][SQL] Construct year-month interval column from integral fields ### What changes were proposed in this pull request? Add a new function to support construct YearMonthIntervalType from integral fields ### Why are the changes needed? Add a new function to support construct YearMonthIntervalType from integral fields ### Does this PR introduce _any_ user-facing change? Yea user can use `make_ym_interval` to construct TearMonthIntervalType from years/months integral fields ### How was this patch tested? Added UT Closes #32645 from AngersZhuuuu/SPARK-35129. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-15 19:19:41 +03:00
Gengliang Wang	c382d4009b	[SPARK-35766][SQL][TESTS] Break down CastSuite/AnsiCastSuite into multiple files ### What changes were proposed in this pull request? Currently, the file CastSuite.scala becomes big: 2000 lines, 2 base classes, 4 test suites. In my previous work of Timestamp without time zone, I planned to put new test cases in CastSuiteBase, but they were accidentally added in AnsiCastSuiteBase. This PR is to break the file down into 3 files. It also moves the test cases about timestamp without time zone to the right base class. ### Why are the changes needed? Make development and review easier. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #32918 from gengliangwang/refactorCastSuite. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-16 00:17:04 +08:00
Tanel Kiis	b74260f67f	[SPARK-35765][SQL] Distinct aggs are not duplicate sensitive ### What changes were proposed in this pull request? Extended `RemoveRedundantAggregates` to remove deduplicating aggregations before aggregations that ignore duplicates. ### Why are the changes needed? Performance imporovement. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Extending existing UT Closes #32904 from tanelk/SPARK-33122_followup2_distinct_agg. Authored-by: Tanel Kiis <tanel.kiis@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-15 22:25:04 +09:00
Kousuke Saruta	d54edf0bde	[SPARK-35758][DOCS] Update the document about building Spark with Hadoop for Hadoop 2.x and 3.x ### What changes were proposed in this pull request? This PR updates the document about building Spark with Hadoop for Hadoop 3.x and Hadoop 3.2. ### Why are the changes needed? The document says about how to build like as follows: ``` ./build/mvn -Pyarn -Dhadoop.version=2.8.5 -DskipTests clean package ``` But this command fails because the default build settings are for Hadoop 3.x. So, we need to modify the command example. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I confirmed both of these commands successfully finished. ``` ./build/mvn -Pyarn -Dhadoop.version=3.3.0 -DskipTests package ./build/mvn -Phadoop-2.7 -Pyarn -Dhadoop.version=2.8.5 -DskipTests package ``` I also built the document and confirmed the result. This is before: ![hadoop-version-before](https://user-images.githubusercontent.com/4736016/122016157-bf020c80-cdfb-11eb-8e74-4840861f8541.png) And this is after: ![hadoop-version-after](https://user-images.githubusercontent.com/4736016/122016188-c75a4780-cdfb-11eb-8427-2f0765e6ff7a.png) Closes #32917 from sarutak/fix-build-doc-with-hadoop. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-15 20:19:50 +09:00
gengjiaan	b191d720e1	[SPARK-35056][SQL] Group exception messages in execution/streaming ### What changes were proposed in this pull request? This PR group exception messages in `sql/core/src/main/scala/org/apache/spark/sql/execution/streaming`. ### Why are the changes needed? It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce _any_ user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. Closes #32880 from beliefer/SPARK-35056. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-15 12:19:52 +03:00
Gengliang Wang	195090afcc	[SPARK-35764][SQL] Assign pretty names to TimestampWithoutTZType ### What changes were proposed in this pull request? In the PR, I propose to override the typeName() method in TimestampWithoutTZType, and assign it a name according to the ANSI SQL standard ![image](https://user-images.githubusercontent.com/1097932/122013859-2cf50680-cdf1-11eb-9fcd-0ec1b59fb5c0.png) ### Why are the changes needed? To improve Spark SQL user experience, and have readable types in error messages. ### Does this PR introduce _any_ user-facing change? No, the new timestamp type is not released yet. ### How was this patch tested? Unit test Closes #32915 from gengliangwang/typename. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-15 12:15:13 +03:00
Wenchen Fan	a50bd8f810	[SPARK-35742][SQL] Expression.semanticEquals should be symmetrical ### What changes were proposed in this pull request? Currently, there are some expressions that overwrite `semanticEquals`, which makes it not symmetrical. Ideally, expressions should overwrite `canonicalized` instead of `semanticEquals`. This PR marks `semanticEquals` as final, and implement `canonicalized` for the few expressions that overwrote `semanticEquals` before. ### Why are the changes needed? To avoid subtle bugs (I haven't found a real bug yet). ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? a new test Closes #32885 from cloud-fan/attr. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-15 08:53:04 +00:00
itholic	b9aeeb4e6c	[SPARK-35683][PYTHON] Fix Index.difference to avoid collect 'other' to driver side ### What changes were proposed in this pull request? This PR fix the wrong behavior of `Index.difference` in pandas APIs on Spark, based on the comment https://github.com/databricks/koalas/pull/1325#discussion_r647889901 and https://github.com/databricks/koalas/pull/1325#discussion_r647890007 - it couldn't handle the case properly when `self` is `Index` or `MultiIndex` and `other` is `MultiIndex` or `Index`. ```python >>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)]) >>> idx1 = ps.Index([1, 2, 3]) >>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)]) >>> midx1.difference(idx1) pyspark.pandas.exceptions.PandasNotImplementedError: The method `pd.Index.__iter__()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead. ``` - it's collecting the all data into the driver side when the other is list-like objects, especially when the `other` is distributed object such as Series which is very dangerous. And added the related test cases. ### Why are the changes needed? To correct the incompatible behavior with pandas, and to prevent the case which potentially cause the OOM easily. ```python >>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)]) >>> idx1 = ps.Index([1, 2, 3]) >>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)]) >>> midx1.difference(idx1) MultiIndex([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)], ) ``` And now it only using the for loop when the `other` is only the case `list`, `set` or `dict`. ### Does this PR introduce _any_ user-facing change? Yes, the previous bug is fixed as described in the above code examples. ### How was this patch tested? Manually tested with linter and unittest in local, and it might be passed on CI. Closes #32853 from itholic/SPARK-35683. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-15 14:18:54 +09:00
Ruifeng Zheng	8c4b535baf	[SPARK-35678][ML] add a common softmax function ### What changes were proposed in this pull request? add softmax function in utils ### Why are the changes needed? it can be used in multi places ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? existing testsuites Closes #32822 from zhengruifeng/add_softmax_func. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>	2021-06-15 10:33:57 +08:00
Takuya UESHIN	2a56cc36ca	[SPARK-35761][PYTHON] Use type-annotation based pandas_udf or avoid specifying udf types to suppress warnings ### What changes were proposed in this pull request? Modify the `pandas_udf` usage to use type-annotation based pandas_udf or avoid specifying udf types to suppress warnings. ### Why are the changes needed? The usage of `pandas_udf` in pandas-on-Spark is outdated and shows warnings. We should use type-annotation based `pandas_udf` or avoid specifying udf types. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #32913 from ueshin/issues/SPARK-35761/suppress_warnings. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-15 11:17:56 +09:00
Hyukjin Kwon	95f36e76c6	[SPARK-35750][PYTHON][DOCS] Rename "pandas APIs on Spark" to "pandas API on Spark" ### What changes were proposed in this pull request? This PR proposes to rename "pandas APIs on Spark" to "pandas API on Spark" which is more natural (since API stands for Application Program Interface). ### Why are the changes needed? To make it sound more natural. ### Does this PR introduce _any_ user-facing change? It fixes a typo in the unreleased changes. ### How was this patch tested? N/A Closes #32903 from HyukjinKwon/SPARK-34885. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-15 10:01:04 +09:00
Hyukjin Kwon	2d47fb7683	[SPARK-35755][PYTHON][INFRA] Use higher PyArrow versions in GitHub Actions build ### What changes were proposed in this pull request? This PR proposes to use higher versions of PyArrow which more users use in general. Without this PR, the testing matrix as follows: - (Python 3.8) Use PyArrow 2.x in [pandas UDF tests in SQL side](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala) - (Python 3.6) Use PyArrow 2.x in PySpark tests - (Python 3.9) Use PyArrow 4.x in PySpark tests (no change) - (Python 3.6) Use PyArrow 2.x in PySpark documentation generation (it runs Spark jobs to generate images to use in PySpark API docs) After this PR, the testing matrix as follows: - (Python 3.8) Use PyArrow 4.x in [pandas UDF tests in SQL side](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala) - (Python 3.6) Use PyArrow 3.x in PySpark tests - (Python 3.9) Use PyArrow 4.x in PySpark tests (no change) - (Python 3.6) Use PyArrow 4.x in PySpark documentation generation (it runs Spark jobs to generate images to use in PySpark API docs) ### Why are the changes needed? Test matrix which more people use. ### Does this PR introduce _any_ user-facing change? No, dev and testing only. ### How was this patch tested? GitHub Actions in this PR should test it out. Closes #32906 from HyukjinKwon/SPARK-35755. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-15 09:59:38 +09:00
Takuya UESHIN	ef7545b788	[SPARK-35759][PYTHON] Remove the upperbound for numpy for pandas-on-Spark ### What changes were proposed in this pull request? Removes the upperbound for numpy for pandas-on-Spark. ### Why are the changes needed? We can remove the upper-bound for numpy for pandas-on-Spark because currently it works well on the CI with numpy 1.20.3. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #32908 from ueshin/issues/SPARK-35759/numpy. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-15 09:59:05 +09:00
Xinrong Meng	03756618fc	[SPARK-35616][PYTHON] Make `astype` method data-type-based ### What changes were proposed in this pull request? Make `astype` method data-type-based. Non-goal: Match pandas' `astype` TypeErrors. Currently, `astype` throws TypeError error messages only when the destination type is not recognized. However, for some destination types that don't make sense to the specific type of Series/Index, for example, `numeric Series/Index → bytes`, we don't have proper TypeError error messages. Since the goal of the PR is refactoring mainly, the above issue might be resolved later if needed. ### Why are the changes needed? There are many type checks in the `astype` method. Since `DataTypeOps` and its subclasses are introduced, we should refactor `astype` to make it data-type-based. In this way, code is cleaner, more maintainable, and more flexible. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #32847 from xinrong-databricks/datatypeops_astype. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-14 16:33:15 -07:00
Kousuke Saruta	aab0c2bf66	[SPARK-35736][SPARK-35737][SQL][FOLLOWUP] Move a common logic to DayTimeIntervalType ### What changes were proposed in this pull request? This is a followup PR for SPARK-35736(#32893) and SPARK-35737(#32892). This PR moves a common logic to `object DayTimeIntervalType`. That logic is like `val strToFieldIndex = DayTimeIntervalType.dayTimeFields.map(i => DayTimeIntervalType.fieldToString(i) -> (i).toMap`, a `Map` which maps each time unit to the corresponding day-time field index. ### Why are the changes needed? That logic appeared in the change in SPARK-35736 and SPARK-35737 so it can be a common logic and it's better to avoid the similar logic scattered. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #32905 from sarutak/followup-SPARK-35736-35737. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-14 20:51:18 +03:00
Kousuke Saruta	82af318c31	[SPARK-35748][SS][SQL] Fix StreamingJoinHelper to be able to handle day-time interval ### What changes were proposed in this pull request? This PR fixes `StreamingJoinHelper` to be able to handle day-time interval. ### Why are the changes needed? In the current master, `StreamingJoinHelper.getStateValueWatermark` can't handle conditions which contain day-time interval literals. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New assertions added to `StreamingJoinHlelperSuite`. Closes #32896 from sarutak/streamingjoinhelper-daytime. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-14 15:45:36 +03:00
Kousuke Saruta	439e94c171	[SPARK-35737][SQL] Parse day-time interval literals to tightest types ### What changes were proposed in this pull request? This PR add a feature which parse day-time interval literals to tightest type. ### Why are the changes needed? To comply with the ANSI behavior. For example, `INTERVAL '10 20:30' DAY TO MINUTE` should be parsed as `DayTimeIntervalType(DAY, MINUTE)` but not as `DayTimeIntervalType(DAY, SECOND)`. ### Does this PR introduce _any_ user-facing change? No because `DayTimeIntervalType` will be introduced in `3.2.0`. ### How was this patch tested? New tests. Closes #32892 from sarutak/tight-daytime-interval. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-14 10:06:19 +03:00
Kousuke Saruta	7978fdc97b	[SPARK-35736][SQL] Parse any day-time interval types in SQL ### What changes were proposed in this pull request? This PR adda a feature which allow the parser parse any day-time interval types in SQL. ### Why are the changes needed? To comply with ANSI standard, we additionally need to support the following types. * INTERVAL DAY * INTERVAL DAY TO HOUR * INTERVAL DAY TO MINUTE * INTERVAL HOUR * INTERVAL HOUR TO MINUTE * INTERVAL HOUR TO SECOND * INTERVAL MINUTE * INTERVAL MINUTE TO SECOND * INTERVAL SECOND ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. Closes #32893 from sarutak/parse-any-day-time. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-14 00:13:50 +03:00
Kun Wan	69aa7ad11f	[SPARK-35714][CORE] Bug fix for deadlock during the executor shutdown ### What changes were proposed in this pull request? Bug fix for deadlock during the executor shutdown ### Why are the changes needed? When a executor received a TERM signal, it (the second TERM signal) will lock java.lang.Shutdown class and then call Shutdown.exit() method to exit the JVM. Shutdown will call SparkShutdownHook to shutdown the executor. During the executor shutdown phase, RemoteProcessDisconnected event will be send to the RPC inbox, and then WorkerWatcher will try to call System.exit(-1) again. Because java.lang.Shutdown has already locked, a deadlock has occurred. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test case "task reaper kills JVM if killed tasks keep running for too long" in JobCancellationSuite Closes #32868 from wankunde/SPARK-35714. Authored-by: Kun Wan <wankun@apache.org> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-06-13 16:01:00 -05:00
Kent Yao	1125afd462	[MINOR][K8S] Print the driver pod name instead of Some(name) if absent Print the driver pod name instead of Some(name) if absent ### What changes were proposed in this pull request? ### Why are the changes needed? fix error hint ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test Closes #32889 from yaooqinn/minork8s. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-13 09:11:14 -07:00
Gengliang Wang	6272222bc0	[SPARK-35719][SQL] Support type conversion between timestamp and timestamp without time zone type ### What changes were proposed in this pull request? 1. Extend the Cast expression and support TimestampType in casting to TimestampWithoutTZType. 2. There was a mistake in casting TimestampWithoutTZType as TimestampType in https://github.com/apache/spark/pull/32864. The target value should be `sourceValue - timeZoneOffset` instead of being the same value. ### Why are the changes needed? To conform the ANSI SQL standard which requires to support such casting. ### Does this PR introduce _any_ user-facing change? No, the new timestamp type is not released yet. ### How was this patch tested? Unit test Closes #32878 from gengliangwang/timestampToTimestampWithoutTZ. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-13 18:44:24 +03:00
Haiyang Sun	0ba1d3852b	[SPARK-35701][SQL] Use copy-on-write semantics for SQLConf registered configurations ### What changes were proposed in this pull request? Using copy-on-write for `SQLConf.sqlConfEntries` and `SQLConf.staticConfKeys` to reduce contention in concurrent workloads. ### Why are the changes needed? The global locks used to protect `SQLConf.sqlConfEntries` map and the `SQLConf.staticConfKeys` set can cause significant contention on the `SQLConf` instance in a concurrent setting. Using copy-on-write versions should reduce the contention given that modifications to the configs are relatively rare. Closes #32865 from haiyangsun-db/SPARK-35701. Authored-by: Haiyang Sun <haiyang.sun@databricks.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-06-12 14:59:48 -07:00
Kousuke Saruta	80f7989d9a	[SPARK-35734][SQL] Format day-time intervals using type fields ### What changes were proposed in this pull request? This PR add a feature which formats day-time interval to strings using the start and end fields of `DayTimeIntervalType`. ### Why are the changes needed? Currently, they are ignored, and any `DayTimeIntervalType` is formatted as `INTERVAL DAY TO SECOND.` ### Does this PR introduce _any_ user-facing change? Yes. The format of day-time intervals is determined the start and end fields. ### How was this patch tested? New test. Closes #32891 from sarutak/interval-format. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-12 21:45:12 +03:00
shahid	450b415028	[SPARK-35746][UI] Fix taskid in the stage page task event timeline ### What changes were proposed in this pull request? Task id is given incorrect in the timeline plot in Stage Page ### Why are the changes needed? Map event timeline plots to correct task Before: ![image](https://user-images.githubusercontent.com/23054875/121761077-81775800-cb4b-11eb-8ec6-ee71926a6549.png) After ![image](https://user-images.githubusercontent.com/23054875/121761195-02ceea80-cb4c-11eb-8ce6-07bb1cca190e.png) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually tested Closes #32888 from shahidki31/shahid/fixtaskid. Authored-by: shahid <shahidki31@gmail.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-06-12 15:38:41 +09:00
Hyukjin Kwon	76e08a8e3d	[SPARK-35738][PYTHON] Support 'y' properly in DataFrame with non-numeric columns with plots ### What changes were proposed in this pull request? This PR proposes to port the fix https://github.com/databricks/koalas/pull/2172. ```python ks.DataFrame({'a': [1, 2, 3], 'b':["a", "b", "c"], 'c': [4, 5, 6]}).plot(kind='hist', x='a', y='c', bins=200) ``` Before: ``` pyspark.sql.utils.AnalysisException: cannot resolve 'least(min(a), min(b), min(c))' due to data type mismatch: The expressions should all have the same type, got LEAST(bigint, string, bigint).; 'Aggregate [unresolvedalias(least(min(a#1L), min(b#2), min(c#3L)), Some(org.apache.spark.sql.Column$$Lambda$1556/0x0000000800d9484042fb0cc1)), unresolvedalias(greatest(max(a#1L), max(b#2), max(c#3L)), Some(org.apache.spark.sql.Column$$Lambda$1556/0x0000000800d9484042fb0cc1))] +- Project [a#1L, b#2, c#3L] +- Project [__index_level_0__#0L, a#1L, b#2, c#3L, monotonically_increasing_id() AS __natural_order__#8L] +- LogicalRDD [__index_level_0__#0L, a#1L, b#2, c#3L], false ``` After: ```python Figure({ 'data': [{'hovertemplate': 'variable=a<br>value=%{text}<br>count=%{y}', 'name': 'a', ... ``` ### Why are the changes needed? To match the behaviour with panadas' and allow users to set `x` and `y` in the DataFrame with non-numeric columns. ### Does this PR introduce _any_ user-facing change? No to end users since the changes is not released yet. Yes to dev as described before. ### How was this patch tested? Manually tested, added a test and tested in notebooks: ![Screen Shot 2021-06-11 at 9 11 25 PM](https://user-images.githubusercontent.com/6477701/121686038-a47a1b80-cafb-11eb-8f8e-8d968db7ebef.png) ![Screen Shot 2021-06-11 at 9 48 58 PM](https://user-images.githubusercontent.com/6477701/121688858-e22c7380-cafe-11eb-9d0a-adcbe560030f.png) Closes #32884 from HyukjinKwon/fix-hist-plot. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-12 14:36:46 +09:00
Chao Sun	9c7250fa73	[SPARK-35321][SQL] Don't register Hive permanent functions when creating Hive client ### What changes were proposed in this pull request? Instantiate a new Hive client through `Hive.getWithoutRegisterFns(conf, false)` instead of `Hive.get(conf)`, if `Hive` version is >= '2.3.9' (the built-in version). ### Why are the changes needed? [HIVE-10319](https://issues.apache.org/jira/browse/HIVE-10319) introduced a new API `get_all_functions` which is only supported in Hive 1.3.0/2.0.0 and up. As result, when Spark 3.x talks to a HMS service of version 1.2 or lower, the following error will occur: ``` Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.TApplicationException: Invalid method name: 'get_all_functions' at org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3897) at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:248) at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:231) ... 96 more Caused by: org.apache.thrift.TApplicationException: Invalid method name: 'get_all_functions' at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_all_functions(ThriftHiveMetastore.java:3845) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_all_functions(ThriftHiveMetastore.java:3833) ``` The `get_all_functions` is called only when `doRegisterAllFns` is set to true: ```java private Hive(HiveConf c, boolean doRegisterAllFns) throws HiveException { conf = c; if (doRegisterAllFns) { registerAllFunctionsOnce(); } } ``` what this does is to register all Hive permanent functions defined in HMS in Hive's `FunctionRegistry` class, via iterating through results from `get_all_functions`. To Spark, this seems unnecessary as it loads Hive permanent (not built-in) UDF via directly calling the HMS API, i.e., `get_function`. The `FunctionRegistry` is only used in loading Hive's built-in function that is not supported by Spark. At this time, it only applies to `histogram_numeric`. [HIVE-21563](https://issues.apache.org/jira/browse/HIVE-21563) introduced a new API `getWithoutRegisterFns` which skips the above registration and is available in Hive 2.3.9. Therefore, Spark should adopt it to avoid the cost. ### Does this PR introduce _any_ user-facing change? Yes with this fix Spark now should be able to talk to HMS server with Hive 1.2.x and lower. ### How was this patch tested? Manually started a HMS server of Hive version 1.2.2. Without the PR it failed with the above exception. With the PR the error disappeared and I can successfully perform common operations such as create table, create database, list tables, etc. Closes #32887 from sunchao/SPARK-35321-new. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2021-06-12 10:32:30 +08:00
Liang-Chi Hsieh	703376e8a9	[SPARK-35689][SS] Add log warn when keyWithIndexToValue returns null value ### What changes were proposed in this pull request? This patch adds log warn when `keyWithIndexToValue` returns null value in `SymmetricHashJoinStateManager`. ### Why are the changes needed? Once we get null from state store in SymmetricHashJoinStateManager, it is better to add meaningful logging for the case. It is better for debugging. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #32828 from viirya/fix-ss-joinstatemanager-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-06-12 10:17:09 +09:00
Takuya UESHIN	4d21b94d13	[SPARK-35475][PYTHON] Fix disallow_untyped_defs mypy checks ### What changes were proposed in this pull request? Adds more type annotations in the file `python/pyspark/pandas/namespace.py` and fixes the mypy check failures. ### Why are the changes needed? We should enable more disallow_untyped_defs mypy checks. ### Does this PR introduce _any_ user-facing change? Yes. This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users. ### How was this patch tested? The mypy check with a new configuration and existing tests should pass. Closes #32871 from ueshin/issues/SPARK-35475/disallow_untyped_defs. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-11 11:07:11 -07:00
Chendi Xue	e958833c72	[SPARK-35396][SQL][TESTS][FOLLOWUP] Add a UT to check if a user-defined cachedBatch is completely released ### What changes were proposed in this pull request? This PR is used to do add a UT to check if user-defined cached batch are completely released when clearCache called. ### Why are the changes needed? Add a new UT file RefCountedTestCachedBatchSerializerSuite.scala ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT is added, org.apache.spark.sql.execution.columnar.RefCountedTestCachedBatchSerializerSuite Closes #32717 from xuechendi/support_manual_close_in_InMemoryRelation. Authored-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-06-11 08:20:26 -07:00
Max Gekk	d53831ff5c	[SPARK-35704][SQL] Add fields to `DayTimeIntervalType` ### What changes were proposed in this pull request? Extend DayTimeIntervalType to support interval fields. Valid interval field values: - 0 (DAY) - 1 (HOUR) - 2 (MINUTE) - 3 (SECOND) After the changes, the following day-time interval types are supported: 1. `DayTimeIntervalType(0, 0)` or `DayTimeIntervalType(DAY, DAY)` 2. `DayTimeIntervalType(0, 1)` or `DayTimeIntervalType(DAY, HOUR)` 3. `DayTimeIntervalType(0, 2)` or `DayTimeIntervalType(DAY, MINUTE)` 4. `DayTimeIntervalType(0, 3)` or `DayTimeIntervalType(DAY, SECOND)`. It is the default one. The second fraction precision is microseconds. 5. `DayTimeIntervalType(1, 1)` or `DayTimeIntervalType(HOUR, HOUR)` 6. `DayTimeIntervalType(1, 2)` or `DayTimeIntervalType(HOUR, MINUTE)` 7. `DayTimeIntervalType(1, 3)` or `DayTimeIntervalType(HOUR, SECOND)` 8. `DayTimeIntervalType(2, 2)` or `DayTimeIntervalType(MINUTE, MINUTE)` 9. `DayTimeIntervalType(2, 3)` or `DayTimeIntervalType(MINUTE, SECOND)` 10. `DayTimeIntervalType(3, 3)` or `DayTimeIntervalType(SECOND, SECOND)` ### Why are the changes needed? In the current implementation, Spark supports only `interval day to second` but the SQL standard allows to specify the start and end fields. The changes will allow to follow ANSI SQL standard more precisely. ### Does this PR introduce _any_ user-facing change? Yes but `DayTimeIntervalType` has not been released yet. ### How was this patch tested? By existing test suites. Closes #32849 from MaxGekk/day-time-interval-type-units. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-11 16:16:33 +03:00
Tanel Kiis	692dc66c4a	[SPARK-35695][SQL] Collect observed metrics from cached and adaptive execution sub-trees ### What changes were proposed in this pull request? Collect observed metrics from cached and adaptive execution sub-trees. ### Why are the changes needed? Currently persisting/caching will hide all observed metrics in that sub-tree from reaching the `QueryExecutionListeners`. Adaptive query execution can also hide the metrics from reaching `QueryExecutionListeners`. ### Does this PR introduce _any_ user-facing change? Bugfix ### How was this patch tested? New UTs Closes #32862 from tanelk/SPARK-35695_collect_metrics_persist. Lead-authored-by: Tanel Kiis <tanel.kiis@gmail.com> Co-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-11 21:03:08 +08:00
RoryQi	57ce64c511	[SPARK-35706][SQL] Consider making the ':' in STRUCT data type definition optional ### What changes were proposed in this pull request? The STRUCT type syntax is defined like this: STRUCT(fieldNmae: fileType [NOT NULL][COMMENT stringLiteral][,.....]) So the field list is nearly the same as a column list if we could make ':' optional it would be so much cleaner an less proprietary ### Why are the changes needed? ease of use ### Does this PR introduce _any_ user-facing change? Yes, you can use Struct type list is nearly the same as a column list ### How was this patch tested? unit tests Closes #32858 from jerqi/master. Authored-by: RoryQi <1242949407@qq.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-11 12:58:32 +00:00

... 3 4 5 6 7 ...

30575 commits