ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Dongjoon Hyun	745bd090f7	[SPARK-35589][CORE][TESTS][FOLLOWUP] Remove the duplicated test coverage ### What changes were proposed in this pull request? This removes the accidental duplicated test coverage. ### Why are the changes needed? To save the test resources. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A because this is a removal of the duplicated test coverage. Closes #32774 from dongjoon-hyun/SPARK-35589-3. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-04 10:28:12 +09:00
Xinrong Meng	7eeb07d0f9	[SPARK-35606][PYTHON][INFRA] List Python 3.9 installed libraries in build_and_test workflow ### What changes were proposed in this pull request? In the build_and_test workflow, tests are run against both Python 3.6 and Python 3.9. However, only libraries installed in Python 3.6 are listed. We should list Python 3.9's installed libraries as well. ### Why are the changes needed? Listing Python 3.9's installed libraries is helpful for debugging. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual check. Closes #32737 from xinrong-databricks/ci_py3.9lib. Lead-authored-by: Xinrong Meng <xinrong.meng@databricks.com> Co-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-06-03 14:24:53 -07:00
fornaix	878527d9fa	[SPARK-35612][SQL] Support LZ4 compression in ORC data source ### What changes were proposed in this pull request? This PR aims to support LZ4 compression in the ORC data source. ### Why are the changes needed? Apache ORC supports LZ4 compression, but we cannot set LZ4 compression in the ORC data source BEFORE ```scala scala> spark.range(10).write.option("compression", "lz4").orc("/tmp/lz4") java.lang.IllegalArgumentException: Codec [lz4] is not available. Available codecs are uncompressed, lzo, snappy, zlib, none, zstd. ``` AFTER ```scala scala> spark.range(10).write.option("compression", "lz4").orc("/tmp/lz4") ``` ```bash $ orc-tools meta /tmp/lz4 Processing data file file:/tmp/lz4/part-00000-6a244eee-b092-4c79-a977-fb8a69dde2eb-c000.lz4.orc [length: 222] Structure for file:/tmp/lz4/part-00000-6a244eee-b092-4c79-a977-fb8a69dde2eb-c000.lz4.orc File Version: 0.12 with ORC_517 Rows: 10 Compression: LZ4 Compression size: 262144 Type: struct<id:bigint> Stripe Statistics: Stripe 1: Column 0: count: 10 hasNull: false Column 1: count: 10 hasNull: false bytesOnDisk: 7 min: 0 max: 9 sum: 45 File Statistics: Column 0: count: 10 hasNull: false Column 1: count: 10 hasNull: false bytesOnDisk: 7 min: 0 max: 9 sum: 45 Stripes: Stripe: offset: 3 data: 7 rows: 10 tail: 35 index: 35 Stream: column 0 section ROW_INDEX start: 3 length 11 Stream: column 1 section ROW_INDEX start: 14 length 24 Stream: column 1 section DATA start: 38 length 7 Encoding column 0: DIRECT Encoding column 1: DIRECT_V2 File length: 222 bytes Padding length: 0 bytes Padding ratio: 0% User Metadata: org.apache.spark.version=3.2.0 ``` ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Pass the newly added test case. Closes #32751 from fornaix/spark-35612. Authored-by: fornaix <foxnaix@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-06-03 14:07:26 -07:00
Liang-Chi Hsieh	0342dcb628	[SPARK-35580][SQL] Implement canonicalized method for HigherOrderFunction ### What changes were proposed in this pull request? This patch implements `canonicalized` method for `HigherOrderFunction`. Basically it canonicalizes the name of all `NamedLambdaVariable`s and their `ExprId`. The name and `ExprId` of `NamedLambdaVariable` are unque. But to compare semantic equality between `HigherOrderFunction`, we can canonicalize them. ### Why are the changes needed? The default `canonicalized` method does not work for `HigherOrderFunction`. It makes subexpression elimination not work for higher functions. Manual check gen-ed code for: ```scala val df = Seq(Seq(1, 2, 3)).toDF("a") df.select(transform($"a", x => x + 1), transform($"a", x => x + 1)).collect() ``` The code for `transform(input[0, array<int>, true], lambdafunction((lambda x_20#19041 + 1), lambda x_20#19041, false)),transform(input[0, array<int>, true], lambdafunction((lambda x_21#19042 + 1), lambda x_21#19042, false))`, generated by `GenerateUnsafeProjection`. Before: ```java /* 005 / class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection { ... / 028 / public UnsafeRow apply(InternalRow i) { ... / 034 / Object obj_0 = ((Expression) references[0]).eval(i); ... / 062 / Object obj_1 = ((Expression) references[1]).eval(i); ... / 093 / } ``` After: ```java / 005 / class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection { ... / 031 / public UnsafeRow apply(InternalRow i) { ... / 033 / subExpr_0(i); ... / 086 / private void subExpr_0(InternalRow i) { / 087 / Object obj_0 = ((Expression) references[0]).eval(i); / 088 / boolean isNull_0 = obj_0 == null; / 089 / ArrayData value_0 = null; / 090 / if (!isNull_0) { / 091 / value_0 = (ArrayData) obj_0; / 092 / } / 093 / subExprIsNull_0 = isNull_0; / 094 / mutableStateArray_0[0] = value_0; / 095 / } / 096 / / 097 */ } ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test and manual check gen-ed code. Closes #32735 from viirya/higher-func-canonicalize. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-06-03 09:16:47 -07:00
Dongjoon Hyun	4f0db872a0	[SPARK-35416][K8S][FOLLOWUP] Use Set instead of ArrayBuffer ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/32564 . ### Why are the changes needed? To use Set instead of ArrayBuffer and add a return type. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #32758 from dongjoon-hyun/SPARK-35416-2. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>	2021-06-03 10:41:11 -05:00
Fu Chen	cfde117c6f	[SPARK-35316][SQL] UnwrapCastInBinaryComparison support In/InSet predicate ### What changes were proposed in this pull request? This pr add in/inset predicate support for `UnwrapCastInBinaryComparison`. Current implement doesn't pushdown filters for `In/InSet` which contains `Cast`. For instance: ```scala spark.range(50).selectExpr("cast(id as int) as id").write.mode("overwrite").parquet("/tmp/parquet/t1") spark.read.parquet("/tmp/parquet/t1").where("id in (1L, 2L, 4L)").explain ``` before this pr: ``` == Physical Plan == (1) Filter cast(id#5 as bigint) IN (1,2,4) +- (1) ColumnarToRow +- FileScan parquet [id#5] Batched: true, DataFilters: [cast(id#5 as bigint) IN (1,2,4)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/tmp/parquet/t1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int> ``` after this pr: ``` == Physical Plan == (1) Filter id#95 IN (1,2,4) +- (1) ColumnarToRow +- FileScan parquet [id#95] Batched: true, DataFilters: [id#95 IN (1,2,4)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/tmp/parquet/t1], PartitionFilters: [], PushedFilters: [In(id, [1,2,4])], ReadSchema: struct<id:int> ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. Closes #32488 from cfmcgrady/SPARK-35316. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-03 14:45:17 +00:00
Kousuke Saruta	c532f8260e	[SPARK-35609][BUILD] Add style rules to prohibit to use a Guava's API which is incompatible with newer versions ### What changes were proposed in this pull request? This PR adds rules to `checkstyle.xml` and `scalastyle-config.xml` to avoid introducing `Objects.toStringHelper` a Guava's API which is no longer present in newer Guava. ### Why are the changes needed? SPARK-30272 (#26911) replaced `Objects.toStringHelper` which is an APIs Guava 14 provides with `commons.lang3` API because `Objects.toStringHelper` is no longer present in newer Guava. But toStringHelper was introduced into Spark again and replaced them in SPARK-35420 (#32567). I think it's better to have a style rule to avoid such repetition. SPARK-30272 replaced some APIs aside from `Objects.toStringHelper` but `Objects.toStringHelper` seems to affect Spark for now so I add rules only for it. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I confirmed that `lint-java` and `lint-scala` detect the usage of `toStringHelper` and let the lint check fail. ``` $ dev/lint-java exec: curl --silent --show-error -L https://downloads.lightbend.com/scala/2.12.14/scala-2.12.14.tgz Using `mvn` from path: /opt/maven/3.6.3//bin/mvn Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/network/protocol/OneWayMessage.java:[78] (regexp) RegexpSinglelineJava: Avoid using Object.toStringHelper. Use ToStringBuilder instead. $ dev/lint-scala Scalastyle checks failed at following occurrences: [error] /home/kou/work/oss/spark/core/src/main/scala/org/apache/spark/rdd/RDD.scala:93:25: Avoid using Object.toStringHelper. Use ToStringBuilder instead. [error] Total time: 25 s, completed 2021/06/02 16:18:25 ``` Closes #32740 from sarutak/style-rule-for-guava. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-06-03 21:52:41 +09:00
itholic	2658bc590f	[SPARK-35081][DOCS] Add Data Source Option links to missing documents ### What changes were proposed in this pull request? This PR proposes adding the missing link to Data Source Option page, for related functions such as `to_csv`, `to_json`, `from_csv`, `from_json`, `schema_of_csv`, `schema_of_json`. - Before <img width="797" alt="Screen Shot 2021-06-03 at 11 39 17 AM" src="https://user-images.githubusercontent.com/44108233/120578877-7b092200-c461-11eb-9e24-bd5349445c66.png"> - After <img width="776" alt="Screen Shot 2021-06-03 at 11 59 14 AM" src="https://user-images.githubusercontent.com/44108233/120579868-29fa2d80-c463-11eb-9329-bd6c8f068f5b.png"> ### Why are the changes needed? To provide users available options in detail with the proper documentation link. ### Does this PR introduce _any_ user-facing change? Yes, the link to Data Source Options page is added to the API documentations, as shown in the above screen capture. ### How was this patch tested? Manually built the docs and checked one by one. Closes #32762 from itholic/SPARK-35081. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-03 13:52:46 +09:00
yangjie01	4a549f2de2	[SPARK-35574][BUILD] Add a compile arg to turn compilation warnings related to `procedure syntax` to compilation errors in Scala 2.13 ### What changes were proposed in this pull request? There are several pr to fix compilation warnings related to `procedure syntax` like SPARK-29291， SPARK-33352 and SPARK-35526, in order to prevent the recurrence of similar problems, this pr add a compile arg to convert `procedure syntax` related compilation warnings to compilation errors in Scala 2.13. ### Why are the changes needed? Prevent the recurrence of compilation warnings related to `procedure syntax is deprecated` ### Does this PR introduce _any_ user-facing change? `procedure syntax` is no longer allowed in Spark code with Scala 2.13, for constructors methods definition should be `this(...) = { }` not `this(...) { }`, for without `return type` methods definition should be `def methodName(...): Unit = {}` not `def methodName(...) {}`. ### How was this patch tested? - Pass the GitHub Action Scala 2.13 job - Manual test： Do some code change like: ``` Index: core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala =================================================================== -67,7 +67,7 private[spark] class HeartbeatReceiver(sc: SparkContext, clock: Clock) extends SparkListener with ThreadSafeRpcEndpoint with Logging { - def this(sc: SparkContext) = { + def this(sc: SparkContext) { this(sc, new SystemClock) } Index: core/src/main/scala/org/apache/spark/MapOutputTracker.scala =================================================================== -720,7 +720,7 } } - def registerMergeResult(shuffleId: Int, reduceId: Int, status: MergeStatus): Unit = { + def registerMergeResult(shuffleId: Int, reduceId: Int, status: MergeStatus) { shuffleStatuses(shuffleId).addMergeResult(reduceId, status) } ``` sbt with Scala 2.13 profile compile failed as follows:* ``` [error] /home/runner/work/spark/spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala:70:29: procedure syntax is deprecated for constructors: add `=`, as in method definition [error] def this(sc: SparkContext) { [error] ^ [error] /home/runner/work/spark/spark/core/src/main/scala/org/apache/spark/MapOutputTracker.scala:723:79: procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `registerMergeResult`'s return type [error] def registerMergeResult(shuffleId: Int, reduceId: Int, status: MergeStatus) { [error] ^ [error] two errors found [error] (core / Compile / compileIncremental) Compilation failed [error] Total time: 136 s (02:16), completed May 31, 2021 10:06:50 AM Error: Process completed with exit code 1. ``` maven with Scala 2.13 profile compile failed as follows: ``` [ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala:70: procedure syntax is deprecated for constructors: add `=`, as in method definition [ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine/core/src/main/scala/org/apache/spark/MapOutputTracker.scala:723: procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `registerMergeResult`'s return type [ERROR] two errors found ``` Closes #32710 from LuciferYang/SPARK-35574. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-03 13:52:04 +09:00
itholic	e0bccc1831	[SPARK-35528][DOCS] Add more options at Data Source Options pages ### What changes were proposed in this pull request? This PR proposes adding more methods to set data source option to `Data Source Option` page for each data source. For example, Data Source Option page for JSON as below: - Before <img width="322" alt="Screen Shot 2021-06-03 at 10 51 54 AM" src="https://user-images.githubusercontent.com/44108233/120574245-eb13aa00-c459-11eb-9f81-0b356023bcb5.png"> - After <img width="470" alt="Screen Shot 2021-06-03 at 10 52 21 AM" src="https://user-images.githubusercontent.com/44108233/120574253-ed760400-c459-11eb-9008-1f075e0b9267.png"> ### Why are the changes needed? To provide users various options when they set options for data source. ### Does this PR introduce _any_ user-facing change? Yes, now the document provides more methods for setting options than before, as in above screen capture. ### How was this patch tested? Manually built the docs and check one by one. Closes #32757 from itholic/SPARK-35528. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-03 12:49:10 +09:00
Hyukjin Kwon	d478cff8bb	[SPARK-35620][BUILD][PYTHON] Remove documentation build in Python linter ### What changes were proposed in this pull request? This PR proposes to remove PySpark documentation build in linter check because: - to speed up CI build by removing duplicate documentation build (linter and doc build) - for https://github.com/apache/spark/pull/32726. With this PR PySpark documentation build requires a full Spark build to generate plot images in PySpark documentation. It makes less sense to require it in Python linter. - to remove unnecessary dependency installation for Python linter in CI ### Why are the changes needed? Python linter script includes documentation build. Because of this, we run documentation builds duplicately in CI, and requires unnecessary dependencies to be installed, and takes extra time. It would more make sense to exclude this in Python linter. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested, and it will be tested in CI. Closes #32760 from HyukjinKwon/SPARK-35620. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-03 12:48:30 +09:00
Sumeet Gajjar	b9e53f8937	[SPARK-35011][CORE] Avoid Block Manager registrations when StopExecutor msg is in-flight ### What changes were proposed in this pull request? This patch proposes a fix to prevent triggering BlockManager reregistration while `StopExecutor` msg is in-flight. Here on receiving `StopExecutor` msg, we do not remove the corresponding `BlockManagerInfo` from `blockManagerInfo` map, instead we mark it as dead by updating the corresponding `executorRemovalTs`. There's a separate cleanup thread running to periodically remove the stale `BlockManagerInfo` from `blockManangerInfo` map. Now if a recently removed `BlockManager` tries to register, the driver simply ignores it since the `blockManagerInfo` map already contains an entry for it. The same applies to `BlockManagerHeartbeat`, if the BlockManager belongs to a recently removed executor, the `blockManagerInfo` map would contain an entry and we shall not ask the corresponding `BlockManager` to re-register. ### Why are the changes needed? This changes are needed since BlockManager reregistration while executor is shutting down causes inconsistent bookkeeping of executors in Spark. Consider the following scenario: - `CoarseGrainedSchedulerBackend` issues async `StopExecutor` on executorEndpoint - `CoarseGrainedSchedulerBackend` removes that executor from Driver's internal data structures and publishes `SparkListenerExecutorRemoved` on the `listenerBus`. - Executor has still not processed `StopExecutor` from the Driver - Driver receives heartbeat from the Executor, since it cannot find the `executorId` in its data structures, it responds with `HeartbeatResponse(reregisterBlockManager = true)` - `BlockManager` on the Executor reregisters with the `BlockManagerMaster` and `SparkListenerBlockManagerAdded` is published on the `listenerBus` - Executor starts processing the `StopExecutor` and exits - `AppStatusListener` picks the `SparkListenerBlockManagerAdded` event and updates `AppStatusStore` - `statusTracker.getExecutorInfos` refers `AppStatusStore` to get the list of executors which returns the dead executor as alive. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Modified the existing unittests. - Ran a simple test application on minikube that asserts on number of executors are zero once the executor idle timeout is reached. Closes #32114 from sumeetgajjar/SPARK-35011. Authored-by: Sumeet Gajjar <sumeetgajjar93@gmail.com> Signed-off-by: yi.wu <yi.wu@databricks.com>	2021-06-03 11:15:50 +08:00
Dongjoon Hyun	2550490c09	[SPARK-35617][INFRA] Update GitHub Action docker image to 20210602 ### What changes were proposed in this pull request? This PR aims to update GitHub Action docker image with the following updates. 1. Add `pip` explicitly to Python 3.8/3.9 2. Add `plotly` to Python 3.8. 3. Since SPARK-35573 fixes SparkR UT failures on R 4.1.0, update SparkR job to run R 4.1.0. ### Why are the changes needed? To improve the GitHub Action test infra and unblock #32737 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the GitHub Action. Closes #32755 from dongjoon-hyun/SPARK-35617. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-06-02 18:30:38 -07:00
attilapiros	806edf8f44	[SPARK-35610][CORE] Fix the memory leak introduced by the Executor's stop shutdown hook ### What changes were proposed in this pull request? Fixing the memory leak by deregistering the shutdown hook when the executor is stopped. This way the Garbage Collector can release the executor object early. Which is a huge win for our tests as user's classloader could be also released which keeps references to objects which are created for the jars on the classpath. ### Why are the changes needed? I have identified this leak by running the Livy tests (I know it is close to the attic but this leak causes a constant OOM there) and it is in our Spark unit tests as well. This leak can be identified by checking the number of `LeakyEntry` in case of Scala 2.12.14 (and `ZipEntry` for Scala 2.12.10) instances which with its related data can take up a considerable amount of memory (as those are created from the jars which are on the classpath). I have my own tool for instrumenting JVM code [trace-agent](https://github.com/attilapiros/trace-agent) and with that I am able to call JVM diagnostic commands at specific methods. Let me show how it in action. It has a single text file embedded into the tool's jar called action.txt. In this case actions.txt content is: {noformat} $ unzip -q -c trace-agent-0.0.7.jar actions.txt diagnostic_command org.apache.spark.repl.ReplSuite runInterpreter cmd:gcClassHistogram,limit_output_lines:8,where:beforeAndAfter,with_gc:true diagnostic_command org.apache.spark.repl.ReplSuite afterAll cmd:gcClassHistogram,limit_output_lines:8,where:after,with_gc:true {noformat} Which creates a class histogram at the beginning and at the end of `org.apache.spark.repl.ReplSuite#runInterpreter()` (after triggering a GC which might not finish as GC is done in a separate thread..) and one histogram in the end of the `org.apache.spark.repl.ReplSuite#afterAll()` method. And the histograms are the followings on master branch: ``` $ ./build/sbt ";project repl;set Test/javaOptions += \"-javaagent:/Users/attilazsoltpiros/git/attilapiros/memoryLeak/trace-agent-0.0.7.jar\"; testOnly" \|grep "ZipEntry\\|LeakyEntry" 3: 197089 9460272 scala.reflect.io.FileZipArchive$LeakyEntry 3: 197089 9460272 scala.reflect.io.FileZipArchive$LeakyEntry 3: 197089 9460272 scala.reflect.io.FileZipArchive$LeakyEntry 3: 197089 9460272 scala.reflect.io.FileZipArchive$LeakyEntry 3: 197089 9460272 scala.reflect.io.FileZipArchive$LeakyEntry 3: 197089 9460272 scala.reflect.io.FileZipArchive$LeakyEntry 3: 394178 18920544 scala.reflect.io.FileZipArchive$LeakyEntry 3: 394178 18920544 scala.reflect.io.FileZipArchive$LeakyEntry 3: 591267 28380816 scala.reflect.io.FileZipArchive$LeakyEntry 3: 591267 28380816 scala.reflect.io.FileZipArchive$LeakyEntry 3: 788356 37841088 scala.reflect.io.FileZipArchive$LeakyEntry 3: 788356 37841088 scala.reflect.io.FileZipArchive$LeakyEntry 3: 985445 47301360 scala.reflect.io.FileZipArchive$LeakyEntry 3: 985445 47301360 scala.reflect.io.FileZipArchive$LeakyEntry 3: 1182534 56761632 scala.reflect.io.FileZipArchive$LeakyEntry 3: 1182534 56761632 scala.reflect.io.FileZipArchive$LeakyEntry 3: 1379623 66221904 scala.reflect.io.FileZipArchive$LeakyEntry 3: 1379623 66221904 scala.reflect.io.FileZipArchive$LeakyEntry 3: 1576712 75682176 scala.reflect.io.FileZipArchive$LeakyEntry ``` Where the header of the table is: ``` num #instances #bytes class name ``` So the `LeakyEntry` in the end is about 75MB (173MB in case of Scala 2.12.10 and before for another class called `ZipEntry`) but the first item (a char/byte arrays) and the second item (strings) in the histogram also relates to this leak: ``` $ ./build/sbt ";project repl;set Test/javaOptions += \"-javaagent:/Users/attilazsoltpiros/git/attilapiros/memoryLeak/trace-agent-0.0.7.jar\"; testOnly" \|grep "1:\\|2:\\|3:" 1: 2701 3496112 [B 2: 21855 2607192 [C 3: 4885 537264 java.lang.Class 1: 480323 55970208 [C 2: 480499 11531976 java.lang.String 3: 197089 9460272 scala.reflect.io.FileZipArchive$LeakyEntry 1: 481825 56148024 [C 2: 481998 11567952 java.lang.String 3: 197089 9460272 scala.reflect.io.FileZipArchive$LeakyEntry 1: 487056 57550344 [C 2: 487179 11692296 java.lang.String 3: 197089 9460272 scala.reflect.io.FileZipArchive$LeakyEntry 1: 487054 57551008 [C 2: 487176 11692224 java.lang.String 3: 197089 9460272 scala.reflect.io.FileZipArchive$LeakyEntry 1: 927823 107139160 [C 2: 928072 22273728 java.lang.String 3: 394178 18920544 scala.reflect.io.FileZipArchive$LeakyEntry 1: 927793 107129328 [C 2: 928041 22272984 java.lang.String 3: 394178 18920544 scala.reflect.io.FileZipArchive$LeakyEntry 1: 1361851 155555608 [C 2: 1362261 32694264 java.lang.String 3: 591267 28380816 scala.reflect.io.FileZipArchive$LeakyEntry 1: 1361683 155493464 [C 2: 1362092 32690208 java.lang.String 3: 591267 28380816 scala.reflect.io.FileZipArchive$LeakyEntry 1: 1803074 205157728 [C 2: 1803268 43278432 java.lang.String 3: 788356 37841088 scala.reflect.io.FileZipArchive$LeakyEntry 1: 1802385 204938224 [C 2: 1802579 43261896 java.lang.String 3: 788356 37841088 scala.reflect.io.FileZipArchive$LeakyEntry 1: 2236631 253636592 [C 2: 2237029 53688696 java.lang.String 3: 985445 47301360 scala.reflect.io.FileZipArchive$LeakyEntry 1: 2236536 253603008 [C 2: 2236933 53686392 java.lang.String 3: 985445 47301360 scala.reflect.io.FileZipArchive$LeakyEntry 1: 2668892 301893920 [C 2: 2669510 64068240 java.lang.String 3: 1182534 56761632 scala.reflect.io.FileZipArchive$LeakyEntry 1: 2668759 301846376 [C 2: 2669376 64065024 java.lang.String 3: 1182534 56761632 scala.reflect.io.FileZipArchive$LeakyEntry 1: 3101238 350101048 [C 2: 3102073 74449752 java.lang.String 3: 1379623 66221904 scala.reflect.io.FileZipArchive$LeakyEntry 1: 3101240 350101104 [C 2: 3102075 74449800 java.lang.String 3: 1379623 66221904 scala.reflect.io.FileZipArchive$LeakyEntry 1: 3533785 398371760 [C 2: 3534835 84836040 java.lang.String 3: 1576712 75682176 scala.reflect.io.FileZipArchive$LeakyEntry 1: 3533759 398367088 [C 2: 3534807 84835368 java.lang.String 3: 1576712 75682176 scala.reflect.io.FileZipArchive$LeakyEntry 1: 3967049 446893400 [C 2: 3968314 95239536 java.lang.String 3: 1773801 85142448 scala.reflect.io.FileZipArchive$LeakyEntry [info] - SPARK-26633: ExecutorClassLoader.getResourceAsStream find REPL classes (8 seconds, 248 milliseconds) Setting default log level to "ERROR". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 1: 3966423 446709584 [C 2: 3967682 95224368 java.lang.String 3: 1773801 85142448 scala.reflect.io.FileZipArchive$LeakyEntry 1: 4399583 495097208 [C 2: 4401050 105625200 java.lang.String 3: 1970890 94602720 scala.reflect.io.FileZipArchive$LeakyEntry 1: 4399578 495070064 [C 2: 4401040 105624960 java.lang.String 3: 1970890 94602720 scala.reflect.io.FileZipArchive$LeakyEntry ``` The last three is about 700MB altogether. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I used the trace-agent tool with the same settings for the modified code: ``` $ ./build/sbt ";project repl;set Test/javaOptions += \"-javaagent:/Users/attilazsoltpiros/git/attilapiros/memoryLeak/trace-agent-0.0.7.jar\"; testOnly" \|grep "1:\\|2:\\|3:" 1: 2701 3496112 [B 2: 21855 2607192 [C 3: 4885 537264 java.lang.Class 1: 480323 55970208 [C 2: 480499 11531976 java.lang.String 3: 197089 9460272 scala.reflect.io.FileZipArchive$LeakyEntry 1: 481825 56148024 [C 2: 481998 11567952 java.lang.String 3: 197089 9460272 scala.reflect.io.FileZipArchive$LeakyEntry 1: 487056 57550344 [C 2: 487179 11692296 java.lang.String 3: 197089 9460272 scala.reflect.io.FileZipArchive$LeakyEntry 1: 487054 57551008 [C 2: 487176 11692224 java.lang.String 3: 197089 9460272 scala.reflect.io.FileZipArchive$LeakyEntry 1: 927823 107139160 [C 2: 928072 22273728 java.lang.String 3: 394178 18920544 scala.reflect.io.FileZipArchive$LeakyEntry 1: 927793 107129328 [C 2: 928041 22272984 java.lang.String 3: 394178 18920544 scala.reflect.io.FileZipArchive$LeakyEntry 1: 1361851 155555608 [C 2: 1362261 32694264 java.lang.String 3: 591267 28380816 scala.reflect.io.FileZipArchive$LeakyEntry 1: 1361683 155493464 [C 2: 1362092 32690208 java.lang.String 3: 591267 28380816 scala.reflect.io.FileZipArchive$LeakyEntry 1: 1803074 205157728 [C 2: 1803268 43278432 java.lang.String 3: 788356 37841088 scala.reflect.io.FileZipArchive$LeakyEntry 1: 1802385 204938224 [C 2: 1802579 43261896 java.lang.String 3: 788356 37841088 scala.reflect.io.FileZipArchive$LeakyEntry 1: 2236631 253636592 [C 2: 2237029 53688696 java.lang.String 3: 985445 47301360 scala.reflect.io.FileZipArchive$LeakyEntry 1: 2236536 253603008 [C 2: 2236933 53686392 java.lang.String 3: 985445 47301360 scala.reflect.io.FileZipArchive$LeakyEntry 1: 2668892 301893920 [C 2: 2669510 64068240 java.lang.String 3: 1182534 56761632 scala.reflect.io.FileZipArchive$LeakyEntry 1: 2668759 301846376 [C 2: 2669376 64065024 java.lang.String 3: 1182534 56761632 scala.reflect.io.FileZipArchive$LeakyEntry 1: 3101238 350101048 [C 2: 3102073 74449752 java.lang.String 3: 1379623 66221904 scala.reflect.io.FileZipArchive$LeakyEntry 1: 3101240 350101104 [C 2: 3102075 74449800 java.lang.String 3: 1379623 66221904 scala.reflect.io.FileZipArchive$LeakyEntry 1: 3533785 398371760 [C 2: 3534835 84836040 java.lang.String 3: 1576712 75682176 scala.reflect.io.FileZipArchive$LeakyEntry 1: 3533759 398367088 [C 2: 3534807 84835368 java.lang.String 3: 1576712 75682176 scala.reflect.io.FileZipArchive$LeakyEntry 1: 3967049 446893400 [C 2: 3968314 95239536 java.lang.String 3: 1773801 85142448 scala.reflect.io.FileZipArchive$LeakyEntry [info] - SPARK-26633: ExecutorClassLoader.getResourceAsStream find REPL classes (8 seconds, 248 milliseconds) Setting default log level to "ERROR". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 1: 3966423 446709584 [C 2: 3967682 95224368 java.lang.String 3: 1773801 85142448 scala.reflect.io.FileZipArchive$LeakyEntry 1: 4399583 495097208 [C 2: 4401050 105625200 java.lang.String 3: 1970890 94602720 scala.reflect.io.FileZipArchive$LeakyEntry 1: 4399578 495070064 [C 2: 4401040 105624960 java.lang.String 3: 1970890 94602720 scala.reflect.io.FileZipArchive$LeakyEntry [success] Total time: 174 s (02:54), completed Jun 2, 2021 2:00:43 PM ╭─attilazsoltpirosapiros-MBP16 ~/git/attilapiros/memoryLeak ‹SPARK-35610› ╰─$ vim ╭─attilazsoltpirosapiros-MBP16 ~/git/attilapiros/memoryLeak ‹SPARK-35610› ╰─$ ./build/sbt ";project repl;set Test/javaOptions += \"-javaagent:/Users/attilazsoltpiros/git/attilapiros/memoryLeak/trace-agent-0.0.7.jar\"; testOnly" \|grep "1:\\|2:\\|3:" 1: 2685 3457368 [B 2: 21833 2606712 [C 3: 4885 537264 java.lang.Class 1: 480245 55978400 [C 2: 480421 11530104 java.lang.String 3: 197089 9460272 scala.reflect.io.FileZipArchive$LeakyEntry 1: 480460 56005784 [C 2: 480633 11535192 java.lang.String 3: 197089 9460272 scala.reflect.io.FileZipArchive$LeakyEntry 1: 486643 57537784 [C 2: 486766 11682384 java.lang.String 3: 197089 9460272 scala.reflect.io.FileZipArchive$LeakyEntry 1: 486636 57538192 [C 2: 486758 11682192 java.lang.String 3: 197089 9460272 scala.reflect.io.FileZipArchive$LeakyEntry 1: 501208 60411856 [C 2: 501180 12028320 java.lang.String 3: 197089 9460272 scala.reflect.io.FileZipArchive$LeakyEntry 1: 501206 60412960 [C 2: 501177 12028248 java.lang.String 3: 197089 9460272 scala.reflect.io.FileZipArchive$LeakyEntry 1: 934925 108773320 [C 2: 935058 22441392 java.lang.String 3: 394178 18920544 scala.reflect.io.FileZipArchive$LeakyEntry 1: 934912 108769528 [C 2: 935044 22441056 java.lang.String 3: 394178 18920544 scala.reflect.io.FileZipArchive$LeakyEntry 1: 1370351 156901296 [C 2: 1370318 32887632 java.lang.String 3: 591267 28380816 scala.reflect.io.FileZipArchive$LeakyEntry 1: 1369660 156681680 [C 2: 1369627 32871048 java.lang.String 3: 591267 28380816 scala.reflect.io.FileZipArchive$LeakyEntry 1: 1803746 205383136 [C 2: 1803917 43294008 java.lang.String 3: 788356 37841088 scala.reflect.io.FileZipArchive$LeakyEntry 1: 1803658 205353096 [C 2: 1803828 43291872 java.lang.String 3: 788356 37841088 scala.reflect.io.FileZipArchive$LeakyEntry 1: 2235677 253608240 [C 2: 2236068 53665632 java.lang.String 3: 985445 47301360 scala.reflect.io.FileZipArchive$LeakyEntry 1: 2235539 253560088 [C 2: 2235929 53662296 java.lang.String 3: 985445 47301360 scala.reflect.io.FileZipArchive$LeakyEntry 1: 2667775 301799240 [C 2: 2668383 64041192 java.lang.String 3: 1182534 56761632 scala.reflect.io.FileZipArchive$LeakyEntry 1: 2667765 301798568 [C 2: 2668373 64040952 java.lang.String 3: 1182534 56761632 scala.reflect.io.FileZipArchive$LeakyEntry 1: 2666665 301491096 [C 2: 2667285 64014840 java.lang.String 3: 1182534 56761632 scala.reflect.io.FileZipArchive$LeakyEntry 1: 2666648 301490792 [C 2: 2667266 64014384 java.lang.String 3: 1182534 56761632 scala.reflect.io.FileZipArchive$LeakyEntry 1: 2668169 301833032 [C 2: 2668782 64050768 java.lang.String 3: 1182534 56761632 scala.reflect.io.FileZipArchive$LeakyEntry [info] - SPARK-26633: ExecutorClassLoader.getResourceAsStream find REPL classes (6 seconds, 396 milliseconds) Setting default log level to "ERROR". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 1: 2235495 253419952 [C 2: 2235887 53661288 java.lang.String 3: 985445 47301360 scala.reflect.io.FileZipArchive$LeakyEntry 1: 2668379 301800768 [C 2: 2668979 64055496 java.lang.String 3: 1182534 56761632 scala.reflect.io.FileZipArchive$LeakyEntry 1: 2236123 253522640 [C 2: 2236514 53676336 java.lang.String 3: 985445 47301360 scala.reflect.io.FileZipArchive$LeakyEntry ``` The sum of the last three numbers is about 354MB. Closes #32748 from attilapiros/SPARK-35610. Authored-by: attilapiros <piros.attila.zsolt@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-06-02 09:34:28 -07:00
Yuming Wang	8041aed296	[SPARK-34808][SQL][FOLLOWUP] Remove canPlanAsBroadcastHashJoin check in EliminateOuterJoin ### What changes were proposed in this pull request? This PR removes `canPlanAsBroadcastHashJoin` check in `EliminateOuterJoin. ### Why are the changes needed? We can always removes outer join if it only has DISTINCT on streamed side. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #32744 from wangyum/SPARK-34808-2. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-02 14:14:37 +00:00
gengjiaan	9f7cdb89f7	[SPARK-35059][SQL] Group exception messages in hive/execution ### What changes were proposed in this pull request? This PR group exception messages in `sql/hive/src/main/scala/org/apache/spark/sql/hive/execution`. ### Why are the changes needed? It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce _any_ user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. Closes #32694 from beliefer/SPARK-35059. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-02 13:06:55 +00:00
Kent Yao	345d35ed1a	[SPARK-21957][SQL] Support current_user function ### What changes were proposed in this pull request? Currently, we do not have a suitable definition of the `user` concept in Spark. We only have a `sparkUser` app widely but do not support identify or retrieve the user information from a session in STS or a runtime query execution. `current_user()` is very popular and supported by plenty of other modern or old school databases, and also ANSI compliant. This PR add `current_user()` as a SQL function. And, they are the same. In this PR, we add these functions w/o ambiguity. 1. For a normal single-threaded Spark application, clearly the `sparkUser` is always equivalent to `current_user()` . 2. For a multi-threaded Spark application, e.g. Spark thrift server, we use a `ThreadLocal` variable to store the client-side user(after authenticated) before running the query and retrieve it in the parser. ### Why are the changes needed? `current_user()` is very popular and supported by plenty of other modern or old school databases, and also ANSI compliant. ### Does this PR introduce _any_ user-facing change? yes, added `current_user()` as a SQL function ### How was this patch tested? new tests in thrift server and sql/catalyst Closes #32718 from yaooqinn/SPARK-21957. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-02 13:04:40 +00:00
ulysses-you	daf9d198dc	[SPARK-35585][SQL] Support propagate empty relation through project/filter ### What changes were proposed in this pull request? Add rule `ConvertToLocalRelation` into AQE Optimizer. ### Why are the changes needed? Support propagate empty local relation through project and filter like such SQL case: ``` Aggregate Project Join ShuffleStage ShuffleStage ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test. Closes #32724 from ulysses-you/SPARK-35585. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-02 07:49:56 +00:00
Cheng Su	54e9999d39	[SPARK-35604][SQL] Fix condition check for FULL OUTER sort merge join ### What changes were proposed in this pull request? The condition check for FULL OUTER sort merge join (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L1368 ) has unnecessary trip when `leftIndex == leftMatches.size` or `rightIndex == rightMatches.size`. Though this does not affect correctness (`scanNextInBuffered()` returns false anyway). But we can avoid it in the first place. ### Why are the changes needed? Better readability for developers and avoid unnecessary execution. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests, such as `OuterJoinSuite.scala`. Closes #32736 from c21/join-bug. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>	2021-06-02 14:01:34 +08:00
itholic	48252bac95	[SPARK-35583][DOCS] Move JDBC data source options from Python and Scala into a single page ### What changes were proposed in this pull request? This PR proposes move missing JDBC data source options from Python, Scala and Java into a single page. ### Why are the changes needed? So far, the documentation for JDBC data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language. ### Does this PR introduce _any_ user-facing change? Yes, the documents will be shown below after this change: - "JDBC To Other Databases" page <img width="803" alt="Screen Shot 2021-06-02 at 11 34 14 AM" src="https://user-images.githubusercontent.com/44108233/120415520-a115c000-c396-11eb-9663-9e666e08ed2b.png"> - Python ![Screen Shot 2021-06-01 at 2 57 40 PM](https://user-images.githubusercontent.com/44108233/120273628-ba146780-c2e9-11eb-96a8-11bd25415197.png) - Scala ![Screen Shot 2021-06-01 at 2 57 03 PM](https://user-images.githubusercontent.com/44108233/120273567-a2d57a00-c2e9-11eb-9788-ea58028ca0a6.png) - Java ![Screen Shot 2021-06-01 at 2 58 27 PM](https://user-images.githubusercontent.com/44108233/120273722-d912f980-c2e9-11eb-83b3-e09992d8c582.png) ### How was this patch tested? Manually build docs and confirm the page. Closes #32723 from itholic/SPARK-35583. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-02 14:21:16 +09:00
Yingyi Bu	3f6322f9aa	[SPARK-35077][SQL] Migrate to transformWithPruning for leftover optimizer rules ### What changes were proposed in this pull request? Migrate to transformWithPruning for the following queries: - SimplifyExtractValueOps - NormalizeFloatingNumbers - PushProjectionThroughUnion - PushDownPredicates - ExtractPythonUDFFromAggregate - ExtractPythonUDFFromJoinCondition - ExtractGroupingPythonUDFFromAggregate - ExtractPythonUDFs - CleanupDynamicPruningFilters </google-sheets-html-origin> ### Why are the changes needed? Reduce the number of tree traversals and hence improve the query compilation latency. ### How was this patch tested? Existing tests. Performance diff: <google-sheets-html-origin><style type="text/css"></style>   \| Baseline \| Experiment \| Experiment/Baseline -- \| -- \| -- \| -- SimplifyExtractValueOps \| 99367049 \| 3679579 \| 0.04 NormalizeFloatingNumbers \| 24717928 \| 20451094 \| 0.83 PushProjectionThroughUnion \| 14130245 \| 7913551 \| 0.56 PushDownPredicates \| 276333542 \| 261246842 \| 0.95 ExtractPythonUDFFromAggregate \| 6459451 \| 2683556 \| 0.42 ExtractPythonUDFFromJoinCondition \| 5695404 \| 2504573 \| 0.44 ExtractGroupingPythonUDFFromAggregate \| 5546701 \| 1858755 \| 0.34 ExtractPythonUDFs \| 58726458 \| 1598518 \| 0.03 CleanupDynamicPruningFilters \| 26606652 \| 15417936 \| 0.58 OptimizeSubqueries \| 3072287940 \| 2876462708 \| 0.94 </google-sheets-html-origin> Closes #32721 from sigmod/pushdown. Authored-by: Yingyi Bu <yingyi.bu@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-02 11:46:33 +08:00
Ruifeng Zheng	c2de0a64e9	[SPARK-35100][ML] Refactor AFT - support virtual centering ### What changes were proposed in this pull request? 1, remove old agg; 2, apply new agg supporting virtual centering ### Why are the changes needed? for better convergence ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes #32199 from zhengruifeng/refactor_aft. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>	2021-06-02 10:47:45 +08:00
Liang-Chi Hsieh	dbf0b50757	[SPARK-35560][SQL] Remove redundant subexpression evaluation in nested subexpressions ### What changes were proposed in this pull request? This patch proposes to improve subexpression evaluation under whole-stage codegen for the cases of nested subexpressions. ### Why are the changes needed? In the cases of nested subexpressions, whole-stage codegen's subexpression elimination will do redundant subexpression evaluation. We should reduce it. For example, if we have two sub-exprs: 1. `simpleUDF($"id")` 2. `functions.length(simpleUDF($"id"))` We should only evaluate `simpleUDF($"id")` once, i.e. ```java subExpr1 = simpleUDF($"id"); subExpr2 = functions.length(subExpr1); ``` Snippets of generated codes: Before: ```java /* 040 / private int project_subExpr_1(long project_expr_0_0) { / 041 / boolean project_isNull_6 = false; / 042 / UTF8String project_value_6 = null; / 043 / if (!false) { / 044 / project_value_6 = UTF8String.fromString(String.valueOf(project_expr_0_0)); / 045 / } / 046 / / 047 / Object project_arg_1 = null; / 048 / if (project_isNull_6) { / 049 / project_arg_1 = ((scala.Function1[]) references[3] / converters /)[0].apply(null); / 050 / } else { / 051 / project_arg_1 = ((scala.Function1[]) references[3] / converters /)[0].apply(project_value_6); / 052 / } / 053 / / 054 / UTF8String project_result_1 = null; / 055 / try { / 056 / project_result_1 = (UTF8String)((scala.Function1[]) references[3] / converters /)[1].apply(((scala.Function1) references[4] / udf /).apply(project_arg_1) ); / 057 / } catch (Throwable e) { / 058 / throw QueryExecutionErrors.failedExecuteUserDefinedFunctionError( / 059 / "DataFrameSuite$$Lambda$6418/1507986601", "string", "string", e); / 060 / } / 061 / / 062 / boolean project_isNull_5 = project_result_1 == null; / 063 / UTF8String project_value_5 = null; / 064 / if (!project_isNull_5) { / 065 / project_value_5 = project_result_1; / 066 / } / 067 / boolean project_isNull_4 = project_isNull_5; / 068 / int project_value_4 = -1; / 069 / / 070 / if (!project_isNull_5) { / 071 / project_value_4 = (project_value_5).numChars(); / 072 / } / 073 / project_subExprIsNull_1 = project_isNull_4; / 074 / return project_value_4; / 075 / } ... / 149 / private UTF8String project_subExpr_0(long project_expr_0_0) { / 150 / boolean project_isNull_2 = false; / 151 / UTF8String project_value_2 = null; / 152 / if (!false) { / 153 / project_value_2 = UTF8String.fromString(String.valueOf(project_expr_0_0)); / 154 / } / 155 / / 156 / Object project_arg_0 = null; / 157 / if (project_isNull_2) { / 158 / project_arg_0 = ((scala.Function1[]) references[1] / converters /)[0].apply(null); / 159 / } else { / 160 / project_arg_0 = ((scala.Function1[]) references[1] / converters /)[0].apply(project_value_2); / 161 / } / 162 / / 163 / UTF8String project_result_0 = null; / 164 / try { / 165 / project_result_0 = (UTF8String)((scala.Function1[]) references[1] / converters /)[1].apply(((scala.Function1) references[2] / udf /).apply(project_arg_0) ); / 166 / } catch (Throwable e) { / 167 / throw QueryExecutionErrors.failedExecuteUserDefinedFunctionError( / 168 / "DataFrameSuite$$Lambda$6418/1507986601", "string", "string", e); / 169 / } / 170 / / 171 / boolean project_isNull_1 = project_result_0 == null; / 172 / UTF8String project_value_1 = null; / 173 / if (!project_isNull_1) { / 174 / project_value_1 = project_result_0; / 175 / } / 176 / project_subExprIsNull_0 = project_isNull_1; / 177 / return project_value_1; / 178 / } ``` After: ```java / 041 / private void project_subExpr_1(long project_expr_0_0) { / 042 / boolean project_isNull_8 = project_subExprIsNull_0; / 043 / int project_value_8 = -1; / 044 / / 045 / if (!project_subExprIsNull_0) { / 046 / project_value_8 = (project_mutableStateArray_0[0]).numChars(); / 047 / } / 048 / project_subExprIsNull_1 = project_isNull_8; / 049 / project_subExprValue_0 = project_value_8; / 050 / } / 056 / ... / 123 / / 124 / private void project_subExpr_0(long project_expr_0_0) { / 125 / boolean project_isNull_6 = false; / 126 / UTF8String project_value_6 = null; / 127 / if (!false) { / 128 / project_value_6 = UTF8String.fromString(String.valueOf(project_expr_0_0)); / 129 / } / 130 / / 131 / Object project_arg_1 = null; / 132 / if (project_isNull_6) { / 133 / project_arg_1 = ((scala.Function1[]) references[3] / converters /)[0].apply(null); / 134 / } else { / 135 / project_arg_1 = ((scala.Function1[]) references[3] / converters /)[0].apply(project_value_6); / 136 / } / 137 / / 138 / UTF8String project_result_1 = null; / 139 / try { / 140 / project_result_1 = (UTF8String)((scala.Function1[]) references[3] / converters /)[1].apply(((scala.Function1) references[4] / udf /).apply(project_arg_1) ); / 141 / } catch (Throwable e) { / 142 / throw QueryExecutionErrors.failedExecuteUserDefinedFunctionError( / 143 / "DataFrameSuite$$Lambda$6430/2004847941", "string", "string", e); / 144 / } / 145 / / 146 / boolean project_isNull_5 = project_result_1 == null; / 147 / UTF8String project_value_5 = null; / 148 / if (!project_isNull_5) { / 149 / project_value_5 = project_result_1; / 150 / } / 151 / project_subExprIsNull_0 = project_isNull_5; / 152 / project_mutableStateArray_0[0] = project_value_5; / 153 */ } ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #32699 from viirya/improve-subexpr. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-06-01 19:13:12 -07:00
Gengliang Wang	9d0d4edb43	[SPARK-35595][TESTS] Support multiple loggers in testing method withLogAppender ### What changes were proposed in this pull request? A test case of AdaptiveQueryExecSuite becomes flaky since there are too many debug logs in RootLogger: https://github.com/Yikun/spark/runs/2715222392?check_suite_focus=true https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139125/testReport/ To fix it, I suggest supporting multiple loggers in the testing method withLogAppender. So that the LogAppender gets clean target log outputs. ### Why are the changes needed? Fix a flaky test case. Also, reduce unnecessary memory cost in tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #32725 from gengliangwang/fixFlakyLogAppender. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-02 10:05:29 +08:00
itholic	0ad5ae54b2	[SPARK-35539][PYTHON] Restore to_koalas to keep the backward compatibility ### What changes were proposed in this pull request? This PR proposes restoring `to_koalas` to keep the backward compatibility, with throwing deprecated warning. ### Why are the changes needed? If we remove `to_koalas`, the existing Koalas codes that include `to_koalas` wouldn't work. ### Does this PR introduce _any_ user-facing change? No. It's restoring the existing functionality. ### How was this patch tested? Manually tested in local. ```shell >>> sdf.to_koalas() .../spark/python/pyspark/pandas/frame.py:4550: FutureWarning: DataFrame.to_koalas is deprecated as of DataFrame.to_pandas_on_spark. Please use the API instead. warnings.warn( ``` Closes #32729 from itholic/SPARK-35539. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-02 10:39:24 +09:00
Gengliang Wang	6a277bb7c6	[SPARK-35600][TESTS] Move Set command related test cases to SetCommandSuite ### What changes were proposed in this pull request? Move `Set` command related test cases from `SQLQuerySuite` to a new test suite `SetCommandSuite`. There are 7 test cases in total. ### Why are the changes needed? Code refactoring. `SQLQuerySuite` is becoming big. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #32732 from gengliangwang/setsuite. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-02 10:36:21 +09:00
Dongjoon Hyun	35cfabcf5c	[SPARK-35589][CORE] BlockManagerMasterEndpoint should not ignore index-only shuffle file during updating ### What changes were proposed in this pull request? This PR aims to make `BlockManagerMasterEndpoint.updateBlockInfo` not to ignore index-only shuffle files. In addition, this PR fixes `IndexShuffleBlockResolver.getMigrationBlocks` to return data files first. ### Why are the changes needed? When [SPARK-20629](`a4ca355af8`) introduced a worker decommission, index-only shuffle files are not considered properly. - SPARK-33198 fixed `getMigrationBlocks` to handle index only shuffle files - SPARK-35589 (this) aims to fix `updateBlockInfo` to handle index only shuffle files. ### Does this PR introduce _any_ user-facing change? No. This is a bug fix. ### How was this patch tested? Pass the CIs with the newly added test case. Closes #32727 from dongjoon-hyun/SPARK-UPDATE-OUTPUT. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-06-01 14:23:24 -07:00
Xinrong Meng	0ac5c16177	[SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin ### What changes were proposed in this pull request? Support arithmetic operations against bool IndexOpsMixin. ### Why are the changes needed? Existing binary operations of bool IndexOpsMixin in Koalas do not match pandas’ behaviors. pandas take True as 1, False as 0 when dealing with numeric values, numeric collections, and numeric Series/Index; whereas Koalas raises an AnalysisException no matter what the binary operation is. We aim to match pandas' behaviors. ### Does this PR introduce _any_ user-facing change? Yes. Before the change: ```py >>> import pyspark.pandas as ps >>> psser = ps.Series([True, True, False]) >>> psser + 1 Traceback (most recent call last): ... TypeError: Addition can not be applied to booleans. >>> 1 + psser Traceback (most recent call last): ... TypeError: Addition can not be applied to booleans. >>> from pyspark.pandas.config import set_option >>> set_option("compute.ops_on_diff_frames", True) >>> psser + ps.Series([1, 2, 3]) Traceback (most recent call last): ... TypeError: Addition can not be applied to booleans. >>> ps.Series([1, 2, 3]) + psser Traceback (most recent call last): ... TypeError: addition can not be applied to given types. ``` After the change: ```py >>> import pyspark.pandas as ps >>> psser = ps.Series([True, True, False]) >>> psser + 1 0 2 1 2 2 1 dtype: int64 >>> 1 + psser 0 2 1 2 2 1 dtype: int64 >>> from pyspark.pandas.config import set_option >>> set_option("compute.ops_on_diff_frames", True) >>> psser + ps.Series([1, 2, 3]) 0 2 1 3 2 3 dtype: int64 >>> ps.Series([1, 2, 3]) + psser 0 2 1 3 2 3 dtype: int64 ``` ### How was this patch tested? Unit tests. Closes #32611 from xinrong-databricks/datatypeop_arith_bool. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-01 10:57:12 -07:00
Kent Yao	a127d91292	[SPARK-35402][WEBUI] Increase the max thread pool size of jetty server in HistoryServer UI ### What changes were proposed in this pull request? For different UIs, e.g. History Server or Spark Live UI, maybe need different capabilities to handle HTTP requests. Usually, a History Server is for multi-users and needs more threads to increase concurrency, while Live UI is per application, which needn't that large pool size. In this PR, we increase the max pool size of the History Server's jetty backend ### Why are the changes needed? increase the client concurrency of HistoryServer ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #32539 from yaooqinn/SPARK-35402. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>	2021-06-02 01:02:41 +08:00
Kousuke Saruta	08e6f633b5	[SPARK-35577][TESTS] Allow to log container output for docker integration tests ### What changes were proposed in this pull request? This PR proposes to add a feature that logs container output for docker integration tests. With this change, if we run test with SBT, we will have like the following log in `unit-tests.log`. ``` ===== CONTAINER LOGS FOR container Id: 3360c98eb28337d8b217fb614e47bf49aafa18a6cb60ecadf3178aee0c663021 ===== 21/05/31 20:54:56.433 pool-1-thread-1 INFO PostgresIntegrationSuite: The files belonging to this database system will be owned by user "postgres". This user must also own the server process. The database cluster will be initialized with locale "en_US.utf8". The default database encoding has accordingly been set to "UTF8". The default text search configuration will be set to "english". Data page checksums are disabled. fixing permissions on existing directory /var/lib/postgresql/data ... ok creating subdirectories ... ok selecting dynamic shared memory implementation ... posix selecting default max_connections ... 100 selecting default shared_buffers ... 128MB selecting default time zone ... UTC creating configuration files ... ok running bootstrap script ... ok sh: locale: not found 2021-05-31 11:54:49.892 UTC [29] WARNING: no usable system locales were found performing post-bootstrap initialization ... ok initdb: warning: enabling "trust" authentication for local connections You can change this by editing pg_hba.conf or using the option -A, or --auth-local and --auth-host, the next time you run initdb. syncing data to disk ... ok Success. You can now start the database server using: pg_ctl -D /var/lib/postgresql/data -l logfile start waiting for server to start....2021-05-31 11:54:50.284 UTC [34] LOG: starting PostgreSQL 13.0 on x86_64-pc-linux-musl, compiled by gcc (Alpine 9.3.0) 9.3.0, 64-bit 2021-05-31 11:54:50.287 UTC [34] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" 2021-05-31 11:54:50.296 UTC [35] LOG: database system was shut down at 2021-05-31 11:54:50 UTC 2021-05-31 11:54:50.301 UTC [34] LOG: database system is ready to accept connections done server started /usr/local/bin/docker-entrypoint.sh: ignoring /docker-entrypoint-initdb.d/* waiting for server to shut down....2021-05-31 11:54:50.363 UTC [34] LOG: received fast shutdown request 2021-05-31 11:54:50.366 UTC [34] LOG: aborting any active transactions 2021-05-31 11:54:50.368 UTC [34] LOG: background worker "logical replication launcher" (PID 41) exited with exit code 1 2021-05-31 11:54:50.368 UTC [36] LOG: shutting down 2021-05-31 11:54:50.402 UTC [34] LOG: database system is shut down done server stopped PostgreSQL init process complete; ready for start up. 2021-05-31 11:54:50.510 UTC [1] LOG: starting PostgreSQL 13.0 on x86_64-pc-linux-musl, compiled by gcc (Alpine 9.3.0) 9.3.0, 64-bit 2021-05-31 11:54:50.510 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432 2021-05-31 11:54:50.510 UTC [1] LOG: listening on IPv6 address "::", port 5432 2021-05-31 11:54:50.517 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" 2021-05-31 11:54:50.526 UTC [43] LOG: database system was shut down at 2021-05-31 11:54:50 UTC 2021-05-31 11:54:50.531 UTC [1] LOG: database system is ready to accept connections 2021-05-31 11:54:54.226 UTC [54] ERROR: relation "public.barcopy" does not exist at character 15 2021-05-31 11:54:54.226 UTC [54] STATEMENT: SELECT 1 FROM public.barcopy LIMIT 1 2021-05-31 11:54:54.610 UTC [59] ERROR: relation "public.barcopy2" does not exist at character 15 2021-05-31 11:54:54.610 UTC [59] STATEMENT: SELECT 1 FROM public.barcopy2 LIMIT 1 2021-05-31 11:54:54.934 UTC [63] ERROR: relation "shortfloat" does not exist at character 15 2021-05-31 11:54:54.934 UTC [63] STATEMENT: SELECT 1 FROM shortfloat LIMIT 1 2021-05-31 11:54:55.675 UTC [75] ERROR: relation "byte_to_smallint_test" does not exist at character 15 2021-05-31 11:54:55.675 UTC [75] STATEMENT: SELECT 1 FROM byte_to_smallint_test LIMIT 1 21/05/31 20:54:56.434 pool-1-thread-1 INFO PostgresIntegrationSuite: ===== END OF CONTAINER LOGS FOR container Id: 3360c98eb28337d8b217fb614e47bf49aafa18a6cb60ecadf3178aee0c663021 ===== ``` ### Why are the changes needed? If we have container logs, it's useful to debug especially for GA. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I run docker integration tests and got logs. The example shown above is for `PostgresIntegrationSuite`. Closes #32715 from sarutak/log-docker-container. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-06-01 22:44:48 +09:00
Max Gekk	a59063d544	[SPARK-35581][SQL] Support special datetime values in typed literals only ### What changes were proposed in this pull request? In the PR, I propose to support special datetime values introduced by #25708 and by #25716 only in typed literals, and don't recognize them in parsing strings to dates/timestamps. The following string values are supported only in typed timestamp literals: - `epoch [zoneId]` - `1970-01-01 00:00:00+00 (Unix system time zero)` - `today [zoneId]` - midnight today. - `yesterday [zoneId]` - midnight yesterday - `tomorrow [zoneId]` - midnight tomorrow - `now` - current query start time. For example: ```sql spark-sql> SELECT timestamp 'tomorrow'; 2019-09-07 00:00:00 ``` Similarly, the following special date values are supported only in typed date literals: - `epoch [zoneId]` - `1970-01-01` - `today [zoneId]` - the current date in the time zone specified by `spark.sql.session.timeZone`. - `yesterday [zoneId]` - the current date -1 - `tomorrow [zoneId]` - the current date + 1 - `now` - the date of running the current query. It has the same notion as `today`. For example: ```sql spark-sql> SELECT date 'tomorrow' - date 'yesterday'; 2 ``` ### Why are the changes needed? In the current implementation, Spark supports the special date/timestamp value in any input strings casted to dates/timestamps that leads to the following problems: - If executors have different system time, the result is inconsistent, and random. Column values depend on where the conversions were performed. - The special values play the role of distributed non-deterministic functions though users might think of the values as constants. ### Does this PR introduce _any_ user-facing change? Yes but the probability should be small. ### How was this patch tested? By running existing test suites: ``` $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z interval.sql" $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z date.sql" $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z timestamp.sql" $ build/sbt "test:testOnly *DateTimeUtilsSuite" ``` Closes #32714 from MaxGekk/remove-datetime-special-values. Lead-authored-by: Max Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-01 15:29:05 +03:00
lidiyag	b7dd4b37e5	[SPARK-35516][WEBUI] Storage UI tab Storage Level tool tip correction ### What changes were proposed in this pull request? Fixed tooltip for "Storage" tab in UI ### Why are the changes needed? Tooltip correction was needed ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Manually tested Closes #32664 from lidiyag/storagewebui. Authored-by: lidiyag <lidiya.nixon@huawei.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-06-01 19:00:13 +09:00
Yikun Jiang	d773373074	[SPARK-35584][CORE][TESTS] Increase the timeout in FallbackStorageSuite ### What changes were proposed in this pull request? ``` - Upload multi stages * FAILED * {{ The code passed to eventually never returned normally. Attempted 20 times over 10.011176743 seconds. Last failure message: fallbackStorage.exists(0, file) was false. (FallbackStorageSuite.scala:243)}} ``` The error like above was raised in aarch64 randomly and also in github action test[1][2]. [1] https://github.com/apache/spark/actions/runs/489319612 [2]https://github.com/apache/spark/actions/runs/479317320 ### Why are the changes needed? timeout is too short, need to increase to let test case complete. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? build/mvn test -Dtest=none -DwildcardSuites=org.apache.spark.storage.FallbackStorageSuite -pl :spark-core_2.12 Closes #32719 from Yikun/SPARK-35584. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-06-01 00:45:58 -07:00
Kousuke	e04883880f	[SPARK-35586][K8S][TESTS] Set a default value for spark.kubernetes.test.sparkTgz in pom.xml for Kubernetes integration tests ### What changes were proposed in this pull request? This PR set a default value for `spark.kubernetes.test.sparkTgz` in `kubernetes/integration-tests/pom.xml` for Kubernetes integration tests. ### Why are the changes needed? In the current master, running the integration tests with the following command will fail because there is no default value set for the property. ``` build/mvn -Dspark.kubernetes.test.namespace=default -Pkubernetes -Pkubernetes-integration-tests -Psparkr -pl resource-managers/kubernetes/integration-tests integration-test ``` ``` + mkdir -p /home/kou/work/oss/spark/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked + tar -xzvf --test-exclude-tags --strip-components=1 -C /home/kou/work/oss/spark/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked tar (child): --test-exclude-tags: Cannot open: No such file or directory tar (child): Error is not recoverable: exiting now tar: Child returned status 2 tar: Error is not recoverable: exiting now [ERROR] Command execution failed. ``` According to `setup-integration-test-env.sh`, `N/A` is intended as the default value so this PR choose it. ``` SPARK_TGZ="N/A" MVN="$TEST_ROOT_DIR/build/mvn" EXCLUDE_TAGS="" ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Build and tests successfully finish with the command shown above. Closes #32722 from sarutak/fix-pom-for-kube-integ. Authored-by: Kousuke <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-06-01 00:40:02 -07:00
itholic	fe09def323	[SPARK-35582][PYTHON][DOCS] Remove # noqa in Python API documents ### What changes were proposed in this pull request? This PR aims to move `# noqa` in the Python docstring to the proper place so that hide them from the official documents. ### Why are the changes needed? If we don't move `# noqa` to the proper place, it is exposed in the middle of the docstring, and it looks a bit wired as below: <img width="613" alt="Screen Shot 2021-06-01 at 3 17 52 PM" src="https://user-images.githubusercontent.com/44108233/120275617-91da3800-c2ec-11eb-9778-16c5fe789418.png"> ### Does this PR introduce _any_ user-facing change? Yes, the `# noqa` is no more shown in the documents as below: <img width="609" alt="Screen Shot 2021-06-01 at 3 21 00 PM" src="https://user-images.githubusercontent.com/44108233/120275927-fbf2dd00-c2ec-11eb-950d-346af2745711.png"> ### How was this patch tested? Manually build docs and check. Closes #32728 from itholic/SPARK-35582. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-01 15:24:04 +09:00
Yingyi Bu	1dd0ca23f6	[SPARK-35544][SQL] Add tree pattern pruning to Analyzer rules ### What changes were proposed in this pull request? Added the following TreePattern enums: - AGGREGATE_EXPRESSION - ALIAS - GROUPING_ANALYTICS - GENERATOR - HIGH_ORDER_FUNCTION - LAMBDA_FUNCTION - NEW_INSTANCE - PIVOT - PYTHON_UDF - TIME_WINDOW - TIME_ZONE_AWARE_EXPRESSION - UP_CAST - COMMAND - EVENT_TIME_WATERMARK - UNRESOLVED_RELATION - WITH_WINDOW_DEFINITION - UNRESOLVED_ALIAS - UNRESOLVED_ATTRIBUTE - UNRESOLVED_DESERIALIZER - UNRESOLVED_ORDINAL - UNRESOLVED_FUNCTION - UNRESOLVED_HINT - UNRESOLVED_SUBQUERY_COLUMN_ALIAS - UNRESOLVED_FUNC Added tree pattern pruning to the following Analyzer rules: - ResolveBinaryArithmetic - WindowsSubstitution - ResolveAliases - ResolveGroupingAnalytics - ResolvePivot - ResolveOrdinalInOrderByAndGroupBy - LookupFunction - ResolveSubquery - ResolveSubqueryColumnAliases - ApplyCharTypePadding - UpdateOuterReferences - ResolveCreateNamedStruct - TimeWindowing - CleanupAliases - EliminateUnions - EliminateSubqueryAliases - HandleAnalysisOnlyCommand - ResolveNewInstances - ResolveUpCast - ResolveDeserializer - ResolveOutputRelation - ResolveEncodersInUDF - HandleNullInputsForUDF - ResolveGenerate - ExtractGenerator - GlobalAggregates - ResolveAggregateFunctions ### Why are the changes needed? Reduce the number of tree traversals and hence improve the query compilation latency. ### How was this patch tested? Existing tests. Performance diff: <google-sheets-html-origin><style type="text/css"></style>   \| Baseline \| Experiment \| Experiment/Baseline -- \| -- \| -- \| -- ResolveBinaryArithmetic \| 43264874 \| 34707117 \| 0.80 WindowsSubstitution \| 3322996 \| 2734192 \| 0.82 ResolveAliases \| 24859263 \| 21359941 \| 0.86 ResolveGroupingAnalytics \| 39249143 \| 25417569 \| 0.80 ResolvePivot \| 6393408 \| 2843314 \| 0.44 ResolveOrdinalInOrderByAndGroupBy \| 10750806 \| 3386715 \| 0.32 LookupFunction \| 22087384 \| 15481294 \| 0.70 ResolveSubquery \| 1129139340 \| 944402323 \| 0.84 ResolveSubqueryColumnAliases \| 5055038 \| 2808210 \| 0.56 ApplyCharTypePadding \| 76285576 \| 63785681 \| 0.84 UpdateOuterReferences \| 6548321 \| 3092539 \| 0.47 ResolveCreateNamedStruct \| 38111477 \| 17350249 \| 0.46 TimeWindowing \| 41694190 \| 3739134 \| 0.09 CleanupAliases \| 48683506 \| 39584921 \| 0.81 EliminateUnions \| 3405069 \| 2372506 \| 0.70 EliminateSubqueryAliases \| 9626649 \| 9518216 \| 0.99 HandleAnalysisOnlyCommand \| 2562123 \| 2661432 \| 1.04 ResolveNewInstances \| 16208966 \| 1982314 \| 0.12 ResolveUpCast \| 14067843 \| 1868615 \| 0.13 ResolveDeserializer \| 17991103 \| 2320308 \| 0.13 ResolveOutputRelation \| 5815277 \| 2088787 \| 0.36 ResolveEncodersInUDF \| 14182892 \| 1045113 \| 0.07 HandleNullInputsForUDF \| 19850838 \| 1329528 \| 0.07 ResolveGenerate \| 5587345 \| 1953192 \| 0.35 ExtractGenerator \| 120378046 \| 3386286 \| 0.03 GlobalAggregates \| 16510455 \| 13553155 \| 0.82 ResolveAggregateFunctions \| 1041848509 \| 828049280 \| 0.79 </google-sheets-html-origin> Closes #32686 from sigmod/analyzer. Authored-by: Yingyi Bu <yingyi.bu@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-01 11:39:42 +08:00
itholic	73d4f67145	[SPARK-35433][DOCS] Move CSV data source options from Python and Scala into a single page ### What changes were proposed in this pull request? This PR proposes move CSV data source options from Python, Scala and Java into a single page. ### Why are the changes needed? So far, the documentation for CSV data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language. ### Does this PR introduce _any_ user-facing change? Yes, the documents will be shown below after this change: - "CSV Files" page <img width="970" alt="Screen Shot 2021-05-27 at 12 35 36 PM" src="https://user-images.githubusercontent.com/44108233/119762269-586a8c80-bee8-11eb-8443-ae5b3c7a685c.png"> - Python <img width="785" alt="Screen Shot 2021-05-25 at 4 12 10 PM" src="https://user-images.githubusercontent.com/44108233/119455390-83cc6a80-bd74-11eb-9156-65785ae27db0.png"> - Scala <img width="718" alt="Screen Shot 2021-05-25 at 4 12 39 PM" src="https://user-images.githubusercontent.com/44108233/119455414-89c24b80-bd74-11eb-9775-aeda549d081e.png"> - Java <img width="667" alt="Screen Shot 2021-05-25 at 4 13 09 PM" src="https://user-images.githubusercontent.com/44108233/119455422-8d55d280-bd74-11eb-97e8-86c1eabeadc2.png"> ### How was this patch tested? Manually build docs and confirm the page. Closes #32658 from itholic/SPARK-35433. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-01 10:58:49 +09:00
Wenchen Fan	bb2a0747d2	[SPARK-35578][SQL][TEST] Add a test case for a bug in janino ### What changes were proposed in this pull request? This PR adds a unit test to show a bug in the latest janino version which fails to compile valid Java code. Unfortunately, I can't share the exact query that can trigger this bug (includes some custom expressions), but this pattern is not very uncommon and I believe can be triggered by some real queries. A follow-up is needed before the 3.2 release, to either fix this bug in janino, or revert the janino version upgrade, or work around it in Spark. ### Why are the changes needed? make it easy for people to debug janino, as I'm not a janino expert. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #32716 from cloud-fan/janino. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-01 10:51:05 +09:00
Hyukjin Kwon	1ba1b70cfe	[SPARK-35573][R][TESTS] Make SparkR tests pass with R 4.1+ ### What changes were proposed in this pull request? This PR proposes to support R 4.1.0+ in SparkR. Currently the tests are being failed as below: ``` ══ Failed ══════════════════════════════════════════════════════════════════════ ── 1. Failure (test_sparkSQL_arrow.R:71:3): createDataFrame/collect Arrow optimi collect(createDataFrame(rdf)) not equal to `expected`. Component “g”: 'tzone' attributes are inconsistent ('UTC' and '') ── 2. Failure (test_sparkSQL_arrow.R:143:3): dapply() Arrow optimization - type collect(ret) not equal to `rdf`. Component “b”: 'tzone' attributes are inconsistent ('UTC' and '') ── 3. Failure (test_sparkSQL_arrow.R:229:3): gapply() Arrow optimization - type collect(ret) not equal to `rdf`. Component “b”: 'tzone' attributes are inconsistent ('UTC' and '') ── 4. Error (test_sparkSQL.R:1454:3): column functions ───────────────────────── Error: (converted from warning) cannot xtfrm data frames Backtrace: 1. base::sort(collect(distinct(select(df, input_file_name())))) test_sparkSQL.R:1454:2 2. base::sort.default(collect(distinct(select(df, input_file_name())))) 5. base::order(x, na.last = na.last, decreasing = decreasing) 6. base::lapply(z, function(x) if (is.object(x)) as.vector(xtfrm(x)) else x) 7. base:::FUN(X[[i]], ...) 10. base::xtfrm.data.frame(x) ── 5. Failure (test_utils.R:67:3): cleanClosure on R functions ───────────────── `actual` not equal to `g`. names for current but not for target Length mismatch: comparison on first 0 components ── 6. Failure (test_utils.R:80:3): cleanClosure on R functions ───────────────── `actual` not equal to `g`. names for current but not for target Length mismatch: comparison on first 0 components ``` It fixes three as below: - Avoid a sort on DataFrame which isn't legitimate: https://github.com/apache/spark/pull/32709#discussion_r642458108 - Treat the empty timezone and local timezone as equivalent in SparkR: https://github.com/apache/spark/pull/32709#discussion_r642464454 - Disable `check.environment` in the cleaned closure comparison (enabled by default from R 4.1+, https://cran.r-project.org/doc/manuals/r-release/NEWS.html), and keep the test as is https://github.com/apache/spark/pull/32709#discussion_r642510089 ### Why are the changes needed? Higher R versions have bug fixes and improvements. More importantly R users tend to use highest R versions. ### Does this PR introduce _any_ user-facing change? Yes, SparkR will work together with R 4.1.0+ ### How was this patch tested? ```bash ./R/run-tests.sh ``` ``` sparkSQL_arrow: SparkSQL Arrow optimization: ................. ... sparkSQL: SparkSQL functions: ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ... utils: functions in utils.R: .............................................. ``` Closes #32709 from HyukjinKwon/SPARK-35573. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-01 10:35:52 +09:00
itholic	7e2717333b	[SPARK-35453][PYTHON] Move Koalas accessor to pandas_on_spark accessor ### What changes were proposed in this pull request? This PR proposes renaming the existing "Koalas Accessor" to "Pandas API on Spark Accessor". ### Why are the changes needed? Because we don't use name "Koalas" anymore, rather use "Pandas API on Spark". So, the related code bases are all need to be changed. ### Does this PR introduce _any_ user-facing change? Yes, the usage of pandas API on Spark accessor is changed from `df.koalas.[...]`. to `df.pandas_on_spark.[...]`. Note: `df.koalas.[...]` is still available but with deprecated warnings. ### How was this patch tested? Manually tested in local and checked one by one. Closes #32674 from itholic/SPARK-35453. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-01 10:33:10 +09:00
Gengliang Wang	8e11f5f007	[SPARK-35576][SQL] Redact the sensitive info in the result of Set command ### What changes were proposed in this pull request? Currently, the results of following SQL queries are not redacted: ``` SET [KEY]; SET; ``` For example: ``` scala> spark.sql("set javax.jdo.option.ConnectionPassword=123456").show() +--------------------+------+ \| key\| value\| +--------------------+------+ \|javax.jdo.option....\|123456\| +--------------------+------+ scala> spark.sql("set javax.jdo.option.ConnectionPassword").show() +--------------------+------+ \| key\| value\| +--------------------+------+ \|javax.jdo.option....\|123456\| +--------------------+------+ scala> spark.sql("set").show() +--------------------+--------------------+ \| key\| value\| +--------------------+--------------------+ \|javax.jdo.option....\| 123456\| ``` We should hide the sensitive information and redact the query output. ### Why are the changes needed? Security. ### Does this PR introduce _any_ user-facing change? Yes, the sensitive information in the output of Set commands are redacted ### How was this patch tested? Unit test Closes #32712 from gengliangwang/redactSet. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-31 14:50:18 -07:00
shahid	cd2ef9cb43	[SPARK-35567][SQL] Fix: Explain cost is not showing statistics for all the nodes ### What changes were proposed in this pull request? Explain cost command in spark currently doesn't show statistics for all the nodes. It misses some nodes in almost all the TPCDS queries. In this PR, we are collecting all the plan nodes including the subqueries and computing the statistics for each node, if it doesn't exists in stats cache, ### Why are the changes needed? Before Fix For eg: Query1, Project node doesn't have statistics ![image](https://user-images.githubusercontent.com/23054875/120123442-868feb00-c1cc-11eb-9af9-3a87bf2117d2.png) Query15, Aggregate node doesn't have statistics ![image](https://user-images.githubusercontent.com/23054875/120123296-a4108500-c1cb-11eb-89df-7fddd651572e.png) After Fix: Query1: ![image](https://user-images.githubusercontent.com/23054875/120123559-1df53e00-c1cd-11eb-938a-53704f5240e6.png) Query 15: ![image](https://user-images.githubusercontent.com/23054875/120123665-bb507200-c1cd-11eb-8ea2-84c732215bac.png) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual testing Closes #32704 from shahidki31/shahid/fixshowstats. Authored-by: shahid <shahidki31@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-01 00:55:29 +08:00
Tengfei Huang	1603775934	[SPARK-35411][SQL][FOLLOWUP] Handle Currying Product while serializing TreeNode to JSON ### What changes were proposed in this pull request? Handle Currying Product while serializing TreeNode to JSON. While processing [Product](https://github.com/apache/spark/blob/v3.1.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L820), we may get an assert error for cases like Currying Product because of the mismatch of sizes between field name and field values. Fallback to use reflection to get all the values for constructor parameters when we meet such cases. ### Why are the changes needed? Avoid throwing error while serializing TreeNode to JSON, try to output as much information as possible. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UT case added. Closes #32713 from ivoson/SPARK-35411-followup. Authored-by: Tengfei Huang <tengfei.h@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-31 22:15:26 +08:00
Hyukjin Kwon	14e12c64d3	[SPARK-35575][INFRA] Recover updating build status in GitHub Actions ### What changes were proposed in this pull request? This PR fixes the logic to be fault tolerant when it gets the status of the workflow run from PR author's forked repository. Looks like https://github.com/apache/spark/pull/32483 removed and disabled (see also https://github.com/apache/spark/pull/32486/checks?check_run_id=2648696751) the GitHub actions workflow runs in the forked repositories, and the detection logic in the main repo fails because the runs don't exist anymore. See also https://github.com/apache/spark/runs/2709537998?check_suite_focus=true ### Why are the changes needed? To recover the status update of GitHub Actions in PRs. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? It cannot be tested without being merged. Closes #32711 from HyukjinKwon/SPARK-35575. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-31 19:29:54 +09:00
Yuming Wang	6cd6c438f2	[SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side ### What changes were proposed in this pull request? This pr add new rule to removes outer join if it only has distinct on streamed side. For example: ```scala spark.range(200L).selectExpr("id AS a").createTempView("t1") spark.range(300L).selectExpr("id AS b").createTempView("t2") spark.sql("SELECT DISTINCT a FROM t1 LEFT JOIN t2 ON a = b").explain(true) ``` Before this pr: ``` == Optimized Logical Plan == Aggregate [a#2L], [a#2L] +- Project [a#2L] +- Join LeftOuter, (a#2L = b#6L) :- Project [id#0L AS a#2L] : +- Range (0, 200, step=1, splits=Some(2)) +- Project [id#4L AS b#6L] +- Range (0, 300, step=1, splits=Some(2)) ``` After this pr: ``` == Optimized Logical Plan == Aggregate [a#2L], [a#2L] +- Project [id#0L AS a#2L] +- Range (0, 200, step=1, splits=Some(2)) ``` ### Why are the changes needed? Improve query performance. [DB2](https://www.ibm.com/docs/en/db2-for-zos/11?topic=manipulation-how-db2-simplifies-join-operations) support this feature: ![image](https://user-images.githubusercontent.com/5399861/119594277-0d7c4680-be0e-11eb-8bd4-366d8c4639f0.png) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #31908 from wangyum/SPARK-34808. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2021-05-31 18:14:15 +08:00
Shiqi Sun	8c69e9cd94	[SPARK-35562][DOC] Fix docs about Kubernetes and Yarn Fixed some places in cluster-overview that are obsolete (i.e. not mentioning Kubernetes), and also fixed the Yarn spark-submit sample command in submitting-applications. ### What changes were proposed in this pull request? This is to fix the docs in "Cluster Overview" and "Submitting Applications" for places where Kubernetes is missed (mostly due to obsolete docs that haven't got updated) and where Yarn sample spark-submit command is incorrectly written. ### Why are the changes needed? To help the Spark users who uses Kubernetes as cluster manager to have a correct idea when reading the "Cluster Overview" doc page. Also to make the sample spark-submit command for Yarn actually runnable in the "Submitting Applications" doc page, by removing the invalid comment after line continuation char `\`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No test, as this is doc fix. Closes #32701 from huskysun/doc-fix. Authored-by: Shiqi Sun <s.sun@salesforce.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-31 02:43:58 -07:00
Liang-Chi Hsieh	73ba4492b1	[SPARK-35566][SS] Fix StateStoreRestoreExec output rows ### What changes were proposed in this pull request? This is a minor change to update how `StateStoreRestoreExec` computes its number of output rows. Previously we only count input rows, but the optionally restored rows are not counted in. ### Why are the changes needed? Currently the number of output rows of `StateStoreRestoreExec` only counts the each input row. But it actually outputs input rows + optional restored rows. We should provide correct number of output rows. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #32703 from viirya/fix-outputrows. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-31 16:45:56 +09:00
Dongjoon Hyun	c225196be0	[SPARK-35507][INFRA] Add Python 3.9 in the docker image for GitHub Action ### What changes were proposed in this pull request? This PR aims to add `Python 3.9.5` and updates the docker image references except SparkR job. ### Why are the changes needed? To save GitHub Action resource and be more robust on the the Python and R library changes. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the GitHub Action. Closes #32706 from dongjoon-hyun/SPARK-35507. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-31 05:56:47 +00:00
allisonwang-db	806da9d6fa	[SPARK-35545][SQL] Split SubqueryExpression's children field into outer attributes and join conditions ### What changes were proposed in this pull request? This PR refactors `SubqueryExpression` class. It removes the children field from SubqueryExpression's constructor and adds `outerAttrs` and `joinCond`. ### Why are the changes needed? Currently, the children field of a subquery expression is used to store both collected outer references in the subquery plan and join conditions after correlated predicates are pulled up. For example: `SELECT (SELECT max(c1) FROM t1 WHERE t1.c1 = t2.c1) FROM t2` During the analysis phase, outer references in the subquery are stored in the children field: `scalar-subquery [t2.c1]`, but after the optimizer rule `PullupCorrelatedPredicates`, the children field will be used to store the join conditions, which contain both the inner and the outer references: `scalar-subquery [t1.c1 = t2.c1]`. This is why the references of SubqueryExpression excludes the inner plan's output: `29ed1a2de4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala (L68-L69)` This can be confusing and error-prone. The references for a subquery expression should always be defined as outer attribute references. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #32687 from allisonwang-db/refactor-subquery-expr. Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-31 04:57:24 +00:00
Dongjoon Hyun	1a55019b1f	[SPARK-31168][BUILD][FOLLOWUP] Update scala-2.12 profile ### What changes were proposed in this pull request? This PR is a follow-up of https://github.com/apache/spark/pull/32697 to update the missed part. After SPARK-34774, we have Scala 2.12 version in `scala-2.12` profile. ### Why are the changes needed? To be consistent. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs and manual. BEFORE ``` $ build/mvn help:evaluate -Pscala-2.12 -Dexpression=scala.version \| grep "^2.12" Using `mvn` from path: /usr/local/bin/mvn 2.12.10 ``` AFTER ``` $ build/mvn help:evaluate -Pscala-2.12 -Dexpression=scala.version \| grep "^2.12" Using `mvn` from path: /usr/local/bin/mvn 2.12.14 ``` Closes #32707 from dongjoon-hyun/SPARK-31168-2. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-30 21:27:24 -07:00

... 3 4 5 6 7 ...

30440 commits