ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Kousuke Saruta	f7ba95264d	[SPARK-33048][BUILD] Fix SparkBuild.scala to recognize build settings for Scala 2.13 ### What changes were proposed in this pull request? This PR fixes `SparkBuild.scala` to recognize build settings for Scala 2.13. In `SparkBuild.scala`, a variable `scalaBinaryVersion` is hardcoded as `2.12`. So, an environment variable `SPARK_SCALA_VERSION` is also to be `2.12`. This issue causes some test suites (e.g. `SparkSubmitSuite`) to be error. ``` ===== TEST OUTPUT FOR o.a.s.deploy.SparkSubmitSuite: 'user classpath first in driver' ===== 20/10/02 08:55:30.234 redirect stderr for command /home/kou/work/oss/spark-scala-2.13/bin/spark-submit INFO Utils: Error: Could not find or load m ain class org.apache.spark.launcher.Main 20/10/02 08:55:30.235 redirect stderr for command /home/kou/work/oss/spark-scala-2.13/bin/spark-submit INFO Utils: /home/kou/work/oss/spark-scala- 2.13/bin/spark-class: line 96: CMD: bad array subscript ``` The reason of this error is that environment variables `SPARK_JARS_DIR` and `LAUNCH_CLASSPATH` is defined in `bin/spark-class` as follows. ``` SPARK_JARS_DIR="${SPARK_HOME}/assembly/target/scala-$SPARK_SCALA_VERSION/jars" LAUNCH_CLASSPATH="${SPARK_HOME}/launcher/target/scala-$SPARK_SCALA_VERSION/classes:$LAUNCH_CLASSPATH" ``` ### Why are the changes needed? To build for Scala 2.13 successfully. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests for `core` module finish successfully. ``` build/sbt -Pscala-2.13 clean "core/test" ``` Closes #29927 from sarutak/fix-sparkbuild-for-scala-2.13. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-02 15:17:44 +09:00
HyukjinKwon	b205be5ff6	[SPARK-33051][INFRA][R] Uses setup-r to install R in GitHub Actions build ### What changes were proposed in this pull request? At SPARK-32493, the R installation was switched to manual installation because setup-r was broken. This seems fixed in the upstream so we should better switch it back. ### Why are the changes needed? To avoid maintaining the installation steps by ourselve. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? GitHub Actions build in this PR should test it. Closes #29931 from HyukjinKwon/recover-r-build. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-02 15:12:33 +09:00
Yuming Wang	9996e252ad	[SPARK-33026][SQL] Add numRows to metric of BroadcastExchangeExec ### What changes were proposed in this pull request? This pr adds `numRows` to the metric and runtimeStatistics of `BroadcastExchangeExec`. ### Why are the changes needed? [`JoinEstimation.estimateInnerOuterJoin`](`d6a68e0b67/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala (L55-L156)`) need row count. The [ShuffleExchangeExec](`1c6dff7b5f/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala (L127)`) have added the row count, but `BroadcastExchangeExec` missing the row count. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #29904 from wangyum/SPARK-33026. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-01 23:01:31 -07:00
Gabor Somogyi	991f7e81d4	[SPARK-32001][SQL] Create JDBC authentication provider developer API ### What changes were proposed in this pull request? At the moment only the baked in JDBC connection providers can be used but there is a need to support additional databases and use-cases. In this PR I'm proposing a new developer API name `JdbcConnectionProvider`. To show how an external JDBC connection provider can be implemented I've created an example [here](https://github.com/gaborgsomogyi/spark-jdbc-connection-provider). The PR contains the following changes: * Added connection provider developer API * Made JDBC connection providers constructor to noarg => needed to load them w/ service loader * Connection providers are now loaded w/ service loader * Added tests to load providers independently * Moved `SecurityConfigurationLock` into a central place because other areas will change global JVM security config ### Why are the changes needed? No custom authentication possibility. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? * Existing + additional unit tests * Docker integration tests * Tested manually the newly created external JDBC connection provider Closes #29024 from gaborgsomogyi/SPARK-32001. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-02 13:04:40 +09:00
Cheng Su	d6f3138352	[SPARK-32859][SQL] Introduce physical rule to decide bucketing dynamically ### What changes were proposed in this pull request? This PR is to add support to decide bucketed table scan dynamically based on actual query plan. Currently bucketing is enabled by default (`spark.sql.sources.bucketing.enabled`=true), so for all bucketed tables in the query plan, we will use bucket table scan (all input files per the bucket will be read by same task). This has the drawback that if the bucket table scan is not benefitting at all (no join/groupby/etc in the query), we don't need to use bucket table scan as it would restrict the # of tasks to be # of buckets and might hurt parallelism. The feature is to add a physical plan rule right after `EnsureRequirements`: The rule goes through plan nodes. For all operators which has "interesting partition" (i.e., require `ClusteredDistribution` or `HashClusteredDistribution`), check if the sub-plan for operator has `Exchange` and bucketed table scan (and only allow certain operators in plan (i.e. `Scan/Filter/Project/Sort/PartialAgg/etc`.), see details in `DisableUnnecessaryBucketedScan.disableBucketWithInterestingPartition`). If yes, disable the bucketed table scan in the sub-plan. In addition, disabling bucketed table scan if there's operator with interesting partition along the sub-plan. Why the algorithm works is that if there's a shuffle between the bucketed table scan and operator with interesting partition, then bucketed table scan partitioning will be destroyed by the shuffle operator in the middle, and we don't need bucketed table scan for sure. The idea of "interesting partition" is inspired from "interesting order" in "Access Path Selection in a Relational Database Management System"(http://www.inf.ed.ac.uk/teaching/courses/adbs/AccessPath.pdf), after discussion with cloud-fan . ### Why are the changes needed? To avoid unnecessary bucketed scan in the query, and this is prerequisite for https://github.com/apache/spark/pull/29625 (decide bucketed sorted scan dynamically will be added later in that PR). ### Does this PR introduce _any_ user-facing change? A new config `spark.sql.sources.bucketing.autoBucketedScan.enabled` is introduced which set to false by default (the rule is disabled by default as it can regress cached bucketed table query, see discussion in https://github.com/apache/spark/pull/29804#issuecomment-701151447). User can opt-in/opt-out by enabling/disabling the config, as we found in prod, some users rely on assumption of # of tasks == # of buckets when reading bucket table to precisely control # of tasks. This is a bad assumption but it does happen on our side, so leave a config here to allow them opt-out for the feature. ### How was this patch tested? Added unit tests in `DisableUnnecessaryBucketedScanSuite.scala` Closes #29804 from c21/bucket-rule. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-02 09:01:15 +09:00
Shruti Gumma	8657742ec7	[SPARK-32996][WEB-UI][FOLLOWUP] Move ExecutorSummarySuite to proper path ### What changes were proposed in this pull request? This change updates the test file location in #29872 to proper path. ### Why are the changes needed? ExecutorSummarySuite.scala should be in core/src/test/scala instead of core/src/test/java. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #29926 from shrutig/SPARK-32996. Authored-by: Shruti Gumma <shruti_gumma@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2020-10-01 16:33:19 -07:00
Kousuke Saruta	005999721f	[SPARK-33046][DOCS] Update how to build doc for Scala 2.13 with sbt ### What changes were proposed in this pull request? This PR fixes the description how to build Spark for Scala 2.13 with sbt. In the current doc, how to build Spark for Scala 2.13 with sbt is described like: ![scala-2 13-build-before](https://user-images.githubusercontent.com/4736016/94816248-80c3e900-0436-11eb-9bc2-99af5786971a.png) But build fails with this command because scala-2.13 profile is not enabled and scala-parallel-collections is absent. ``` [error] /home/kou/work/oss/spark-scala-2.13/core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala:23: object parallel is not a member of package collection ``` The correct command should be: ``` build/sbt -Pspark-2.13 compile ``` ### Why are the changes needed? The build command is wrong. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I checked that `sbt -Pspark-2.13` is correct with the following command: ``` build/sbt -Dscala.version=2.13.3 -Phive -Phive-thriftserver -Pyarn -Pkubernetes compile ``` I also build the modified doc and checked the generated html: ![spark-scala-2 13-build-doc-after](https://user-images.githubusercontent.com/4736016/94869259-f2745500-047f-11eb-89e5-20816f3ed24d.png) Closes #29921 from sarutak/fix-scala-2.13-build-doc. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-10-01 18:01:23 -05:00
ulysses	e62d24717e	[SPARK-32585][SQL] Support scala enumeration in ScalaReflection ### What changes were proposed in this pull request? Add code in `ScalaReflection` to support scala enumeration and make enumeration type as string type in Spark. ### Why are the changes needed? We support java enum but failed with scala enum, it's better to keep the same behavior. Here is a example. ``` package test object TestEnum extends Enumeration { type TestEnum = Value val E1, E2, E3 = Value } import TestEnum._ case class TestClass(i: Int, e: TestEnum) { } import test._ Seq(TestClass(1, TestEnum.E1)).toDS ``` Before this PR ``` Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for test.TestEnum.TestEnum - field (class: "scala.Enumeration.Value", name: "e") - root class: "test.TestClass" at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerFor$1(ScalaReflection.scala:567) at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69) at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:882) at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:881) ``` After this PR `org.apache.spark.sql.Dataset[test.TestClass] = [i: int, e: string]` ### Does this PR introduce _any_ user-facing change? Yes, user can make case class which include scala enumeration field as dataset. ### How was this patch tested? Add test. Closes #29403 from ulysses-you/SPARK-32585. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	2020-10-01 15:58:01 -04:00
Dongjoon Hyun	9c618b3308	[SPARK-33047][BUILD] Upgrade hive-storage-api to 2.7.2 ### What changes were proposed in this pull request? This PR aims to upgrade Apache Hive `hive-storage-api` library from 2.7.1 to 2.7.2. ### Why are the changes needed? [storage-api 2.7.2](https://github.com/apache/hive/commits/rel/storage-release-2.7.2/storage-api) has the following extension and can be used when users uses a provided orc dependency. [HIVE-22959](`dade9919d9 (diff-ccfc9dd7584117f531322cda3a29f3c3)`) : Extend storage-api to expose FilterContext [HIVE-23215](`361925d2f3 (diff-ccfc9dd7584117f531322cda3a29f3c3)`) : Make FilterContext and MutableFilterContext interfaces ### Does this PR introduce _any_ user-facing change? Yes. This is a dependency change. ### How was this patch tested? Pass the existing tests. Closes #29923 from dongjoon-hyun/SPARK-33047. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-01 12:41:40 -07:00
yangjie01	0963fcd848	[SPARK-33024][SQL] Fix CodeGen fallback issue of UDFSuite in Scala 2.13 ### What changes were proposed in this pull request? After `SPARK-32851` set `CODEGEN_FACTORY_MODE` to `CODEGEN_ONLY` of `sparkConf` in `SharedSparkSessionBase` to construction `SparkSession` in test, the test suite `SPARK-32459: UDF should not fail on WrappedArray` in s.sql.UDFSuite exposed a codegen fallback issue in Scala 2.13 as follow: ``` - SPARK-32459: UDF should not fail on WrappedArray * FAILED * Caused by: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 47, Column 99: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 47, Column 99: No applicable constructor/method found for zero actual parameters; candidates are: "public scala.collection.mutable.Builder scala.collection.mutable.ArraySeq$.newBuilder(java.lang.Object)", "public scala.collection.mutable.Builder scala.collection.mutable.ArraySeq$.newBuilder(scala.reflect.ClassTag)", "public abstract scala.collection.mutable.Builder scala.collection.EvidenceIterableFactory.newBuilder(java.lang.Object)" ``` The root cause is `WrappedArray` represent `mutable.ArraySeq` in Scala 2.13 and has a different constructor of `newBuilder` method. The main change of is pr is add Scala 2.13 only code part to deal with `case match WrappedArray` in Scala 2.13. ### Why are the changes needed? We need to support a Scala 2.13 build ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Scala 2.12: Pass the Jenkins or GitHub Action - Scala 2.13: All tests passed. Do the following: ``` dev/change-scala-version.sh 2.13 mvn clean install -DskipTests -pl sql/core -Pscala-2.13 -am mvn test -pl sql/core -Pscala-2.13 ``` Before ``` Tests: succeeded 8540, failed 1, canceled 1, ignored 52, pending 0 * 1 TEST FAILED * ``` After ``` Tests: succeeded 8541, failed 0, canceled 1, ignored 52, pending 0 All tests passed. ``` Closes #29903 from LuciferYang/fix-udfsuite. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-10-01 08:37:07 -05:00
iRakson	d3dbe1a907	[SQL][DOC][MINOR] Corrects input table names in the examples of CREATE FUNCTION doc ### What changes were proposed in this pull request? Fix Typo ### Why are the changes needed? To maintain consistency. Correct table name should be used for SELECT command. ### Does this PR introduce _any_ user-facing change? Yes. Now CREATE FUNCTION doc will show the correct name of table. ### How was this patch tested? Manually. Doc changes. Closes #29920 from iRakson/fixTypo. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-01 20:50:16 +09:00
Max Gekk	5651284c3b	[SPARK-32992][SQL] Map Oracle's ROWID type to StringType in read via JDBC ### What changes were proposed in this pull request? Convert the `ROWID` type in the Oracle JDBC dialect to Catalyst's `StringType`. The doc for Oracle 19c says explicitly that the type must be string: https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Data-Types.html#GUID-AEF1FE4C-2DE5-4BE7-BB53-83AD8F1E34EF ### Why are the changes needed? To avoid the exception showed in https://stackoverflow.com/questions/52244492/spark-jdbc-dataframereader-fails-to-read-oracle-table-with-datatype-as-rowid ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? N/A Closes #29884 from MaxGekk/jdbc-oracle-rowid-string. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-01 14:50:32 +09:00
Peter Toth	28ed3a512a	[SPARK-32723][WEBUI] Upgrade to jQuery 3.5.1 ### What changes were proposed in this pull request? Upgrade to the latest available version of jQuery (3.5.1). ### Why are the changes needed? There are some CVE-s reported (CVE-2020-11022, CVE-2020-11023) affecting older versions of jQuery. Although Spark UI is read-only and those CVEs doesn't seem to affect Spark, using the latest version of this library can help to handle vulnerability reports of security scans. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual tests and checked the jQuery 3.5 upgrade guide. Closes #29902 from peter-toth/SPARK-32723-upgrade-to-jquery-3.5.1. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-09-30 21:30:17 -07:00
angerszhu	0b5a379c1f	[SPARK-33023][CORE] Judge path of Windows need add condition `Utils.isWindows` ### What changes were proposed in this pull request? according to https://github.com/apache/spark/pull/29881#discussion_r496648397 we need add condition `Utils.isWindows` ### Why are the changes needed? add strict condition of judging path is window path ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No Closes #29909 from AngersZhuuuu/SPARK-33023. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-09-30 19:24:50 -07:00
jlafleche	d75222dd1b	[SPARK-33012][BUILD][K8S] Upgrade fabric8 to 4.10.3 ### What changes were proposed in this pull request? This PR aims to upgrade `kubernetes-client` library to track fabric8's declared compatibility for k8s 1.18.0: https://github.com/fabric8io/kubernetes-client#compatibility-matrix ### Why are the changes needed? According to fabric8, 4.9.2 is incompatible with k8s 1.18.0. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Not tested yet. Closes #29888 from laflechejonathan/jlf/fabric8Ugprade. Authored-by: jlafleche <jlafleche@palantir.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-09-30 19:00:18 -07:00
GuoPhilipse	3bdbb5546d	[SPARK-31753][SQL][DOCS][FOLLOW-UP] Add missing keywords in the SQL docs ### What changes were proposed in this pull request? update sql-ref docs, the following key words will be added in this PR. CLUSTERED BY SORTED BY INTO num_buckets BUCKETS ### Why are the changes needed? let more users know the sql key words usage ### Does this PR introduce _any_ user-facing change? No ![image](https://user-images.githubusercontent.com/46367746/94428281-0a6b8080-01c3-11eb-9ff3-899f8da602ca.png) ![image](https://user-images.githubusercontent.com/46367746/94428285-0d667100-01c3-11eb-8a54-90e7641d917b.png) ![image](https://user-images.githubusercontent.com/46367746/94428288-0f303480-01c3-11eb-9e1d-023538aa6e2d.png) ### How was this patch tested? generate html test Closes #29883 from GuoPhilipse/add-sql-missing-keywords. Lead-authored-by: GuoPhilipse <46367746+GuoPhilipse@users.noreply.github.com> Co-authored-by: GuoPhilipse <guofei_ok@126.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-01 08:15:53 +09:00
Dongjoon Hyun	ece8d8e22c	[SPARK-33006][K8S][DOCS] Add dynamic PVC usage example into K8s doc ### What changes were proposed in this pull request? This updates K8s document to describe new dynamic PVC features. ### Why are the changes needed? This will help the user use the new features easily. ### Does this PR introduce _any_ user-facing change? Yes, but it's a doc updates. ### How was this patch tested? Manual. <img width="847" alt="Screen Shot 2020-09-28 at 3 54 53 PM" src="https://user-images.githubusercontent.com/9700541/94494923-3ed04400-01a5-11eb-81f9-127db42d4256.png"> <img width="779" alt="Screen Shot 2020-09-28 at 3 55 07 PM" src="https://user-images.githubusercontent.com/9700541/94494930-4394f800-01a5-11eb-9387-50ebc14af477.png"> Closes #29897 from dongjoon-hyun/SPARK-33006. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-09-30 09:27:57 -07:00
Takeshi Yamamuro	3a299aa648	[SPARK-32741][SQL] Check if the same ExprId refers to the unique attribute in logical plans ### What changes were proposed in this pull request? Some plan transformations (e.g., `RemoveNoopOperators`) implicitly assume the same `ExprId` refers to the unique attribute. But, `RuleExecutor` does not check this integrity between logical plan transformations. So, this PR intends to add this check in `isPlanIntegral` of `Analyzer`/`Optimizer`. This PR comes from the talk with cloud-fan viirya in https://github.com/apache/spark/pull/29485#discussion_r475346278 ### Why are the changes needed? For better logical plan integrity checking. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #29585 from maropu/PlanIntegrityTest. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-09-30 21:37:29 +09:00
Dongjoon Hyun	cc06266ade	[SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default ### What changes were proposed in this pull request? Apache Spark 3.1's default Hadoop profile is `hadoop-3.2`. Instead of having a warning documentation, this PR aims to use a consistent and safer version of Apache Hadoop file output committer algorithm which is `v1`. This will prevent a silent correctness regression during migration from Apache Spark 2.4/3.0 to Apache Spark 3.1.0. Of course, if there is a user-provided configuration, `spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2`, that will be used still. ### Why are the changes needed? Apache Spark provides multiple distributions with Hadoop 2.7 and Hadoop 3.2. `spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version` depends on the Hadoop version. Apache Hadoop 3.0 switches the default algorithm from `v1` to `v2` and now there exists a discussion to remove `v2`. We had better provide a consistent default behavior of `v1` across various Spark distributions. - [MAPREDUCE-7282](https://issues.apache.org/jira/browse/MAPREDUCE-7282) MR v2 commit algorithm should be deprecated and not the default ### Does this PR introduce _any_ user-facing change? Yes. This changes the default behavior. Users can override this conf. ### How was this patch tested? Manual. BEFORE (spark-3.0.1-bin-hadoop3.2) ```scala scala> sc.version res0: String = 3.0.1 scala> sc.hadoopConfiguration.get("mapreduce.fileoutputcommitter.algorithm.version") res1: String = 2 ``` AFTER ```scala scala> sc.hadoopConfiguration.get("mapreduce.fileoutputcommitter.algorithm.version") res0: String = 1 ``` Closes #29895 from dongjoon-hyun/SPARK-DEFAUT-COMMITTER. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-09-29 12:02:45 -07:00
Yuming Wang	711d8dd28a	[SPARK-33018][SQL] Fix estimate statistics issue if child has 0 bytes ### What changes were proposed in this pull request? This pr fix estimate statistics issue if child has 0 bytes. ### Why are the changes needed? The `sizeInBytes` can be `0` when AQE and CBO are enabled(`spark.sql.adaptive.enabled`=true, `spark.sql.cbo.enabled`=true and `spark.sql.cbo.planStats.enabled`=true). This will generate incorrect BroadcastJoin, resulting in Driver OOM. For example: ![SPARK-33018](https://user-images.githubusercontent.com/5399861/94457606-647e3d00-01e7-11eb-85ee-812ae6efe7bb.jpg) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #29894 from wangyum/SPARK-33018. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-29 16:46:04 +00:00
Akshat Bordia	7766fd13c9	[MINOR][DOCS] Fixing log message for better clarity Fixing log message for better clarity. Closes #29870 from akshatb1/master. Lead-authored-by: Akshat Bordia <akshat.bordia31@gmail.com> Co-authored-by: Akshat Bordia <akshat.bordia@citrix.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-09-29 08:38:43 -05:00
Tom van Bussel	f167002522	[SPARK-32901][CORE] Do not allocate memory while spilling UnsafeExternalSorter ### What changes were proposed in this pull request? This PR changes `UnsafeExternalSorter` to no longer allocate any memory while spilling. In particular it removes the allocation of a new pointer array in `UnsafeInMemorySorter`. Instead the new pointer array is allocated whenever the next record is inserted into the sorter. ### Why are the changes needed? Without this change the `UnsafeExternalSorter` could throw an OOM while spilling. The following sequence of events would have triggered an OOM: 1. `UnsafeExternalSorter` runs out of space in its pointer array and attempts to allocate a new large array to replace the old one. 2. `TaskMemoryManager` tries to allocate the memory backing the new large array using `MemoryManager`, but `MemoryManager` is only willing to return most but not all of the memory requested. 3. `TaskMemoryManager` asks `UnsafeExternalSorter` to spill, which causes `UnsafeExternalSorter` to spill the current run to disk, to free its record pages and to reset its `UnsafeInMemorySorter`. 4. `UnsafeInMemorySorter` frees the old pointer array, and tries to allocate a new small pointer array. 5. `TaskMemoryManager` tries to allocate the memory backing the small array using `MemoryManager`, but `MemoryManager` is unwilling to give it any memory, as the `TaskMemoryManager` is still holding on to the memory it got for the new large array. 6. `TaskMemoryManager` again asks `UnsafeExternalSorter` to spill, but this time there is nothing to spill. 7. `UnsafeInMemorySorter` receives less memory than it requested, and causes a `SparkOutOfMemoryError` to be thrown, which causes the current task to fail. With the changes in the PR the following will happen instead: 1. `UnsafeExternalSorter` runs out of space in its pointer array and attempts to allocate a new large array to replace the old one. 2. `TaskMemoryManager` tries to allocate the memory backing the new large array using `MemoryManager`, but `MemoryManager` is only willing to return most but not all of the memory requested. 3. `TaskMemoryManager` asks `UnsafeExternalSorter` to spill, which causes `UnsafeExternalSorter` to spill the current run to disk, to free its record pages and to reset its `UnsafeInMemorySorter`. 4. `UnsafeInMemorySorter` frees the old pointer array. 5. `TaskMemoryManager` returns control to `UnsafeExternalSorter.growPointerArrayIfNecessary` (either by returning the the new large array or by throwing a `SparkOutOfMemoryError`). 6. `UnsafeExternalSorter` either frees the new large array or it ignores the `SparkOutOfMemoryError` depending on what happened in the previous step. 7. `UnsafeExternalSorter` successfully allocates a new small pointer array and operation continues as normal. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tests were added in `UnsafeExternalSorterSuite` and `UnsafeInMemorySorterSuite`. Closes #29785 from tomvanbussel/SPARK-32901. Authored-by: Tom van Bussel <tom.vanbussel@databricks.com> Signed-off-by: herman <herman@databricks.com>	2020-09-29 13:05:33 +02:00
tanel.kiis@gmail.com	90e86f6fac	[SPARK-32970][SPARK-32019][SQL][TEST] Reduce the runtime of an UT for ### What changes were proposed in this pull request? The UT for SPARK-32019 (#28853) tries to write about 16GB of data do the disk. We must change the value of `spark.sql.files.maxPartitionBytes` to a smaller value do check the correct behavior with less data. By default it is `128MB`. The other parameters in this UT are also changed to smaller values to keep the behavior the same. ### Why are the changes needed? The runtime of this one UT can be over 7 minutes on Jenkins. After the change it is few seconds. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT Closes #29842 from tanelk/SPARK-32970. Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-29 16:51:44 +09:00
Liang-Chi Hsieh	202115e7cd	[SPARK-32948][SQL] Optimize to_json and from_json expression chain ### What changes were proposed in this pull request? This patch proposes to optimize from_json + to_json expression chain. ### Why are the changes needed? To optimize json expression chain that could be manually generated or generated automatically during query optimization. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #29828 from viirya/SPARK-32948. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-09-28 22:22:47 -07:00
Max Gekk	1b60ff5afe	[MINOR][DOCS] Document when `current_date` and `current_timestamp` are evaluated ### What changes were proposed in this pull request? Explicitly document that `current_date` and `current_timestamp` are executed at the start of query evaluation. And all calls of `current_date`/`current_timestamp` within the same query return the same value ### Why are the changes needed? Users could expect that `current_date` and `current_timestamp` return the current date/timestamp at the moment of query execution but in fact the functions are folded by the optimizer at the start of query evaluation: `0df8dd6073/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala (L71-L91)` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running `./dev/scalastyle`. Closes #29892 from MaxGekk/doc-current_date. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-29 05:20:12 +00:00
HyukjinKwon	6868b40517	[SPARK-33020][PYTHON] Add nth_value as a PySpark function ### What changes were proposed in this pull request? `nth_value` was added at SPARK-27951. This PR adds the corresponding PySpark API. ### Why are the changes needed? To support the consistent APIs ### Does this PR introduce _any_ user-facing change? Yes, it introduces a new PySpark function API. ### How was this patch tested? Unittest was added. Closes #29899 from HyukjinKwon/SPARK-33020. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-09-28 22:14:28 -07:00
Max Gekk	68cd5677ae	[SPARK-33015][SQL] Compute the current date only once ### What changes were proposed in this pull request? Compute the current date at the specified time zone using timestamp taken at the start of query evaluation. ### Why are the changes needed? According to the doc for [current_date()](http://spark.apache.org/docs/latest/api/sql/#current_date), the current date should be computed at the start of query evaluation but it can be computed multiple times. As a consequence of that, the function can return different values if the query is executed at the border of two dates. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? By existing test suites `ComputeCurrentTimeSuite` and `DateExpressionsSuite`. Closes #29889 from MaxGekk/fix-current_date. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-29 05:13:01 +00:00
HyukjinKwon	376ede1301	[SPARK-33021][PYTHON][TESTS] Move functions related test cases into test_functions.py ### What changes were proposed in this pull request? Move functions related test cases from `test_context.py` to `test_functions.py`. ### Why are the changes needed? To group the similar test cases. ### Does this PR introduce _any_ user-facing change? Nope, test-only. ### How was this patch tested? Jenkins and GitHub Actions should test. Closes #29898 from HyukjinKwon/SPARK-33021. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-09-28 21:54:00 -07:00
gengjiaan	a53fc9b7ae	[SPARK-27951][SQL][FOLLOWUP] Improve the window function nth_value ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/29604 supports the ANSI SQL NTH_VALUE. We should override the `prettyName` and `sql`. ### Why are the changes needed? Make the name of nth_value correct. To show the ignoreNulls parameter correctly. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #29886 from beliefer/improve-nth_value. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-29 09:54:43 +09:00
Shruti Gumma	173da5bf11	[SPARK-32996][WEB-UI] Handle empty ExecutorMetrics in ExecutorMetricsJsonSerializer ### What changes were proposed in this pull request? When `peakMemoryMetrics` in `ExecutorSummary` is `Option.empty`, then the `ExecutorMetricsJsonSerializer#serialize` method does not execute the `jsonGenerator.writeObject` method. This causes the json to be generated with `peakMemoryMetrics` key added to the serialized string, but no corresponding value. This causes an error to be thrown when it is the next key `attributes` turn to be added to the json: `com.fasterxml.jackson.core.JsonGenerationException: Can not write a field name, expecting a value ` ### Why are the changes needed? At the start of the Spark job, if `peakMemoryMetrics` is `Option.empty`, then it causes a `com.fasterxml.jackson.core.JsonGenerationException` to be thrown when we navigate to the Executors tab in Spark UI. Complete stacktrace: > com.fasterxml.jackson.core.JsonGenerationException: Can not write a field name, expecting a value > at com.fasterxml.jackson.core.JsonGenerator._reportError(JsonGenerator.java:2080) > at com.fasterxml.jackson.core.json.WriterBasedJsonGenerator.writeFieldName(WriterBasedJsonGenerator.java:161) > at com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:725) > at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:721) > at com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:166) > at com.fasterxml.jackson.databind.ser.std.CollectionSerializer.serializeContents(CollectionSerializer.java:145) > at com.fasterxml.jackson.module.scala.ser.IterableSerializer.serializeContents(IterableSerializerModule.scala:26) > at com.fasterxml.jackson.module.scala.ser.IterableSerializer.serializeContents$(IterableSerializerModule.scala:25) > at com.fasterxml.jackson.module.scala.ser.UnresolvedIterableSerializer.serializeContents(IterableSerializerModule.scala:54) > at com.fasterxml.jackson.module.scala.ser.UnresolvedIterableSerializer.serializeContents(IterableSerializerModule.scala:54) > at com.fasterxml.jackson.databind.ser.std.AsArraySerializerBase.serialize(AsArraySerializerBase.java:250) > at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:480) > at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:319) > at com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:4094) > at com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:3404) > at org.apache.spark.ui.exec.ExecutorsPage.allExecutorsDataScript$1(ExecutorsTab.scala:64) > at org.apache.spark.ui.exec.ExecutorsPage.render(ExecutorsTab.scala:76) > at org.apache.spark.ui.WebUI.$anonfun$attachPage$1(WebUI.scala:89) > at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:80) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > at org.sparkproject.jetty.servlet.ServletHolder.handle(ServletHolder.java:873) > at org.sparkproject.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623) > at org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95) > at org.sparkproject.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610) > at org.sparkproject.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540) > at org.sparkproject.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255) > at org.sparkproject.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345) > at org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203) > at org.sparkproject.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480) > at org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201) > at org.sparkproject.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247) > at org.sparkproject.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144) > at org.sparkproject.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:753) > at org.sparkproject.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220) > at org.sparkproject.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) > at org.sparkproject.jetty.server.Server.handle(Server.java:505) > at org.sparkproject.jetty.server.HttpChannel.handle(HttpChannel.java:370) > at org.sparkproject.jetty.server.HttpConnection.onFillable(HttpConnection.java:267) > at org.sparkproject.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305) > at org.sparkproject.jetty.io.FillInterest.fillable(FillInterest.java:103) > at org.sparkproject.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117) > at org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333) > at org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310) > at org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168) > at org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126) > at org.sparkproject.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366) > at org.sparkproject.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698) > at org.sparkproject.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804) > at java.base/java.lang.Thread.run(Thread.java:834) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #29872 from shrutig/SPARK-32996. Authored-by: Shruti Gumma <shruti_gumma@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2020-09-28 10:07:36 -07:00
Jungtaek Lim (HeartSaVioR)	d15f504a5e	[SPARK-33011][ML] Promote the stability annotation to Evolving for MLEvent traits/classes ### What changes were proposed in this pull request? This PR proposes to promote the stability annotation to `Evolving` for MLEvent traits/classes. ### Why are the changes needed? The feature is released in Spark 3.0.0 having SPARK-26818 as the last change in Feb. 2020, and haven't changed in Spark 3.0.1. (There's no change more than a half of year.) While we'd better to wait for some minor releases to consider the API as stable, it would worth to promote to Evolving so that we clearly state that we support the API. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Just changed the annotation, no tests required. Closes #29887 from HeartSaVioR/SPARK-33011. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-28 14:57:59 +09:00
Fabian Höring	a7f84a0b45	[SPARK-32187][PYTHON][DOCS] Doc on Python packaging ### What changes were proposed in this pull request? This PR proposes to document PySpark specific packaging guidelines. ### Why are the changes needed? To have a single place for PySpark users, and better documentation. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? ``` cd python/docs make clean html ``` Closes #29806 from fhoering/add_doc_python_packaging. Lead-authored-by: Fabian Höring <f.horing@criteo.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-28 12:30:28 +09:00
tanel.kiis@gmail.com	f41ba2a2f3	[SPARK-32927][SQL] Bitwise OR, AND and XOR should have similar canonicalization rules to boolean OR and AND ### What changes were proposed in this pull request? Add canonicalization rules for commutative bitwise operations. ### Why are the changes needed? Canonical form is used in many other optimization rules. Reduces the number of cases, where plans with identical results are considered to be distinct. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #29794 from tanelk/SPARK-32927. Lead-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Co-authored-by: Tanel Kiis <tanel.kiis@reach-u.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-28 12:22:15 +09:00
yangjie01	bb6d5e7a90	[SPARK-32972][ML] Pass all UTs of `mllib` module in Scala 2.13 ### What changes were proposed in this pull request? The purpose of this pr is to resolve SPARK-32972, total of 51 Scala failed test cases and 3 Java failed test cases were fixed, the main change of this pr as follow: - Specified `Seq` to `scala.collection.Seq` in case match `Seq` scene and `x.asInstanceOf[Seq[T]]` scene - Use `Row.getSeq[T]` instead of `Row.getAs[Seq]` - Manual call `toMap` method to convert `MapView` to `Map` in Scala 2.13 - Change the tol in the last test to 0.75 to pass `RandomForestRegressorSuite#training with sample weights` in Scala 2.13 ### Why are the changes needed? We need to support a Scala 2.13 build. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Scala 2.12: Pass the Jenkins or GitHub Action - Scala 2.13: Pass GitHub 2.13 Build Action Do the follow: ``` dev/change-scala-version.sh 2.13 mvn clean install -DskipTests -pl mllib -Pscala-2.13 -am mvn test -pl mllib -Pscala-2.13 -fn ``` Before ``` [ERROR] Errors: [ERROR] JavaVectorIndexerSuite.vectorIndexerAPI:51 » ClassCast scala.collection.conver... [ERROR] JavaWord2VecSuite.testJavaWord2Vec:51 » Spark Job aborted due to stage failure... [ERROR] JavaPrefixSpanSuite.runPrefixSpanSaveLoad:79 » Spark Job aborted due to stage ... Tests: succeeded 1567, failed 51, canceled 0, ignored 7, pending 0 * 51 TESTS FAILED * ``` After ``` [INFO] Tests run: 122, Failures: 0, Errors: 0, Skipped: 0 Tests: succeeded 1617, failed 0, canceled 0, ignored 7, pending 0 All tests passed. ``` Closes #29857 from LuciferYang/fix-mllib-2. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-09-27 10:26:51 -05:00
zhengruifeng	bc77e5b840	[SPARK-32973][ML][DOC] FeatureHasher does not check categoricalCols in inputCols ### What changes were proposed in this pull request? 1, update the comment: `Note, the relevant columns must also be set in inputCols` -> `Note, the relevant columns should also be set in inputCols`; 2, add a check, and if there are `categoricalCols` not set in `inputCols`, log.warn it; ### Why are the changes needed? 1, there is no check to make sure `categoricalCols` are all set in `inputCols`, to keep existing behavior, update this comments; ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? repl Closes #29868 from zhengruifeng/feature_hash_cat_doc. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-09-27 10:26:05 -05:00
zero323	c65b64552f	[SPARK-32714][FOLLOW-UP][PYTHON] Address pyspark.install typing errors ### What changes were proposed in this pull request? This PR adds two `type: ignores`, one in `pyspark.install` and one in related tests. ### Why are the changes needed? To satisfy MyPy type checks. It seems like we originally missed some changes that happened around merge of `31a16fbb40` ``` python/pyspark/install.py:30: error: Need type annotation for 'UNSUPPORTED_COMBINATIONS' (hint: "UNSUPPORTED_COMBINATIONS: List[<type>] = ...") [var-annotated] python/pyspark/tests/test_install_spark.py:105: error: Cannot find implementation or library stub for module named 'xmlrunner' [import] python/pyspark/tests/test_install_spark.py:105: note: See https://mypy.readthedocs.io/en/latest/running_mypy.html#missing-imports ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Existing tests. - MyPy tests ``` mypy --show-error-code --no-incremental --config python/mypy.ini python/pyspark ``` Closes #29878 from zero323/SPARK-32714-FOLLOW-UP. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-27 16:21:23 +09:00
zhengruifeng	0c38765b29	[SPARK-32974][ML] FeatureHasher transform optimization ### What changes were proposed in this pull request? pre-compute the output indices of numerical columns, instead of computing them on each row. ### Why are the changes needed? for a numerical column, its output index is a hash of its `col_name`, we can pre-compute it at first, instead of computing it on each row. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes #29850 from zhengruifeng/hash_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-09-27 09:35:05 +08:00
Kris Mok	9a155d42a3	[SPARK-32999][SQL] Use Utils.getSimpleName to avoid hitting Malformed class name in TreeNode ### What changes were proposed in this pull request? Use `Utils.getSimpleName` to avoid hitting `Malformed class name` error in `TreeNode`. ### Why are the changes needed? On older JDK versions (e.g. JDK8u), nested Scala classes may trigger `java.lang.Class.getSimpleName` to throw an `java.lang.InternalError: Malformed class name` error. Similar to https://github.com/apache/spark/pull/29050, we should use Spark's `Utils.getSimpleName` utility function in place of `Class.getSimpleName` to avoid hitting the issue. ### Does this PR introduce _any_ user-facing change? Fixes a bug that throws an error when invoking `TreeNode.nodeName`, otherwise no changes. ### How was this patch tested? Added new unit test case in `TreeNodeSuite`. Note that the test case assumes the test code can trigger the expected error, otherwise it'll skip the test safely, for compatibility with newer JDKs. Manually tested on JDK8u and JDK11u and observed expected behavior: - JDK8u: the test case triggers the "Malformed class name" issue and the fix works; - JDK11u: the test case does not trigger the "Malformed class name" issue, and the test case is safely skipped. Closes #29875 from rednaxelafx/spark-32999-getsimplename. Authored-by: Kris Mok <kris.mok@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-09-26 16:03:59 -07:00
zhengruifeng	934a91fcb4	[SPARK-21481][ML][FOLLOWUP][TRIVIAL] HashingTF use util.collection.OpenHashMap instead of mutable.HashMap ### What changes were proposed in this pull request? `HashingTF` use `util.collection.OpenHashMap` instead of `mutable.HashMap` ### Why are the changes needed? according to `util.collection.OpenHashMap` 's doc: > This map is about 5X faster than java.util.HashMap, while using much less space overhead. according to performance tests like ([Simple microbenchmarks comparing Scala vs Java mutable map performance ](https://gist.github.com/pchiusano/1423303)), `mutable.HashMap` maybe more inefficient than `java.util.HashMap` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes #29852 from zhengruifeng/hashingtf_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-09-26 08:16:39 -05:00
Dongjoon Hyun	6c805470a7	[SPARK-32997][K8S] Support dynamic PVC creation and deletion in K8s driver ### What changes were proposed in this pull request? This PR aims to support dynamic PVC creation and deletion in K8s driver. Configuration This PR reuses the existing PVC volume configs. ``` spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName=OnDemand spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass=gp2 spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit=200Gi spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path=/data spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly=false ``` PVC ``` $ kubectl get pvc \| grep driver tpcds-d6087874c6705564-driver-pvc-0 Bound pvc-fae914a2-ca5c-4e1e-8aba-54a35357d072 200Gi RWO gp2 12m ``` Disk ``` $ k exec -it tpcds-d6087874c6705564-driver -- df -h \| grep data /dev/nvme5n1 197G 61M 197G 1% /data ``` ``` $ k exec -it tpcds-d6087874c6705564-driver -- ls -al /data total 28 drwxr-xr-x 5 root root 4096 Sep 25 18:06 . drwxr-xr-x 1 root root 63 Sep 25 18:06 .. drwxr-xr-x 66 root root 4096 Sep 25 18:09 blockmgr-2c9a8cc5-a05c-45fe-a58e-b8f42da88a57 drwx------ 2 root root 16384 Sep 25 18:06 lost+found drwx------ 4 root root 4096 Sep 25 18:07 spark-0448efe7-da2c-4f3a-bd3c-769aadb11dd6 ``` NOTE This should be used carefully because Apache Spark doesn't delete driver pod automatically. Since the driver PVC shares the lifecycle of driver pod, it will exist after the job completion until the pod deletion. However, if the users are already using pre-populated PVCs, this isn't a regression at all in terms of the cost. ``` $ k get pod -l spark-role=driver NAME READY STATUS RESTARTS AGE tpcds-d6087874c6705564-driver 0/1 Completed 0 35m ``` ### Why are the changes needed? Like executors, driver also needs larger PVC. ### Does this PR introduce _any_ user-facing change? Yes. This is a new feature. ### How was this patch tested? Pass the newly added test case. Closes #29873 from dongjoon-hyun/SPARK-32997. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-09-25 16:36:15 -07:00
gatorsmile	e887c639a7	[SPARK-32931][SQL] Unevaluable Expressions are not Foldable ### What changes were proposed in this pull request? Unevaluable expressions are not foldable because we don't have an eval for it. This PR is to clean up the code and enforce it. ### Why are the changes needed? Ensure that we will not hit the weird cases that trigger ConstantFolding. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? The existing tests. Closes #29798 from gatorsmile/refactorUneval. Lead-authored-by: gatorsmile <gatorsmile@gmail.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-25 07:27:29 +00:00
Yuanjian Li	9e6882feca	[SPARK-32885][SS] Add DataStreamReader.table API ### What changes were proposed in this pull request? This pr aims to add a new `table` API in DataStreamReader, which is similar to the table API in DataFrameReader. ### Why are the changes needed? Users can directly use this API to get a Streaming DataFrame on a table. Below is a simple example: Application 1 for initializing and starting the streaming job: ``` val path = "/home/yuanjian.li/runtime/to_be_deleted" val tblName = "my_table" // Write some data to `my_table` spark.range(3).write.format("parquet").option("path", path).saveAsTable(tblName) // Read the table as a streaming source, write result to destination directory val table = spark.readStream.table(tblName) table.writeStream.format("parquet").option("checkpointLocation", "/home/yuanjian.li/runtime/to_be_deleted_ck").start("/home/yuanjian.li/runtime/to_be_deleted_2") ``` Application 2 for appending new data: ``` // Append new data into the path spark.range(5).write.format("parquet").option("path", "/home/yuanjian.li/runtime/to_be_deleted").mode("append").save() ``` Check result: ``` // The desitination directory should contains all written data spark.read.parquet("/home/yuanjian.li/runtime/to_be_deleted_2").show() ``` ### Does this PR introduce _any_ user-facing change? Yes, a new API added. ### How was this patch tested? New UT added and integrated testing. Closes #29756 from xuanyuanking/SPARK-32885. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-25 06:50:24 +00:00
ulysses	f2fc966674	[SPARK-32877][SQL][TEST] Add test for Hive UDF complex decimal type ### What changes were proposed in this pull request? Add test to cover Hive UDF whose input contains complex decimal type. Add comment to explain why we can't make `HiveSimpleUDF` extend `ImplicitTypeCasts`. ### Why are the changes needed? For better test coverage with Hive which we compatible or not. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test. Closes #29863 from ulysses-you/SPARK-32877-test. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-09-24 22:16:05 -07:00
Terry Kim	e9c98c910a	[SPARK-32990][SQL] Migrate REFRESH TABLE to use UnresolvedTableOrView to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `REFRESH TABLE` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? The current behavior is not consistent between v1 and v2 commands when resolving a temp view. In v2, the `t` in the following example is resolved to a table: ```scala sql("CREATE TABLE testcat.ns.t (id bigint) USING foo") sql("CREATE TEMPORARY VIEW t AS SELECT 2") sql("USE testcat.ns") sql("REFRESH TABLE t") // 't' is resolved to testcat.ns.t ``` whereas in v1, the `t` is resolved to a temp view: ```scala sql("CREATE DATABASE test") sql("CREATE TABLE spark_catalog.test.t (id bigint) USING csv") sql("CREATE TEMPORARY VIEW t AS SELECT 2") sql("USE spark_catalog.test") sql("REFRESH TABLE t") // 't' is resolved to a temp view ``` ### Does this PR introduce _any_ user-facing change? After this PR, `REFRESH TABLE t` is resolved to a temp view `t` instead of `testcat.ns.t`. ### How was this patch tested? Added a new test Closes #29866 from imback82/refresh_table_consistent. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-25 04:29:09 +00:00
Dongjoon Hyun	d7aa3b56e8	[SPARK-32889][SQL][TESTS][FOLLOWUP] Skip special column names test in Hive 1.2 ### What changes were proposed in this pull request? This PR is a followup of SPARK-32889 in order to ignore the special column names test in `hive-1.2` profile. ### Why are the changes needed? Hive 1.2 is too old to support special column names because it doesn't use Apache ORC. This will recover our `hive-1.2` Jenkins job. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7-hive-1.2/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-hive-1.2/ ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the test with Hive 1.2 profile. Closes #29867 from dongjoon-hyun/SPARK-32889-2. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-09-24 16:22:08 -07:00
Chao Sun	8ccfbc114e	[SPARK-32381][CORE][SQL] Move and refactor parallel listing & non-location sensitive listing to core <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html 2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html 3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'. 4. Be sure to keep the PR description updated to reflect all changes. 5. Please write your PR title to summarize what this PR proposes. 6. If possible, provide a concise example to reproduce the issue for a faster review. 7. If you want to add a new configuration, please read the guideline first for naming configurations in 'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'. --> ### What changes were proposed in this pull request? <!-- Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below. 1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers. 2. If you fix some SQL features, you can provide some references of other DBMSes. 3. If there is design documentation, please add the link. 4. If there is a discussion in the mailing list, please add the link. --> This moves and refactors the parallel listing utilities from `InMemoryFileIndex` to Spark core so it can be reused by modules beside SQL. Along the process this also did some cleanups/refactorings: - Created a `HadoopFSUtils` class under core - Moved `InMemoryFileIndex.bulkListLeafFiles` into `HadoopFSUtils.parallelListLeafFiles`. It now depends on a `SparkContext` instead of `SparkSession` in SQL. Also added a few parameters which used to be read from `SparkSession.conf`: `ignoreMissingFiles`, `ignoreLocality`, `parallelismThreshold`, `parallelismMax ` and `filterFun` (for additional filtering support but we may be able to merge this with `filter` parameter in future). - Moved `InMemoryFileIndex.listLeafFiles` into `HadoopFSUtils.listLeafFiles` with similar changes above. ### Why are the changes needed? <!-- Please clarify why the changes are needed. For instance, 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, you can clarify why it is a bug. --> Currently the locality-aware parallel listing mechanism only applies to `InMemoryFileIndex`. By moving this to core, we can potentially reuse the same mechanism for other code paths as well. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as the documentation fix. If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master. If no, write 'No'. --> No. ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Since this is mostly a refactoring, it relies on existing unit tests such as those for `InMemoryFileIndex`. Closes #29471 from sunchao/SPARK-32381. Lead-authored-by: Chao Sun <sunchao@apache.org> Co-authored-by: Holden Karau <hkarau@apple.com> Co-authored-by: Chao Sun <sunchao@uber.com> Signed-off-by: Holden Karau <hkarau@apple.com>	2020-09-24 10:58:52 -07:00
yangjie01	4ae0f70395	[SPARK-32954][YARN][TEST] Add jakarta.servlet-api test dependency to yarn module to avoid UTs badcase ### What changes were proposed in this pull request? When I tried to verify that the `resource-managers/yarn` module passed all UTs in Scala 2.13 , I found that there is a issue related to classpath order maybe blocked the UTs because there are more than one `servlet-api` dependency in spark now: - One is `javax.servlet:javax.servlet-api:3.10:compile` config in core/pom.xml, - The other is `jakarta.servlet:jakarta.servlet-api:4.0.3:test` cascaded by `org.glassfish.jersey.test-framework.providers` we can use `mvn dependency:tree` to check it . So when we execute `resource-managers/yarn` module test use ``` mvn clean test -pl resource-managers/yarn -Pyarn ``` or ``` mvn clean test -pl resource-managers/yarn -Pyarn -Pscala-2.13 ``` and if the position of `javax.servlet-api` in the in classpath is before `jakarta.servlet-api`, there are some cases failed in `YarnClusterSuite`, `YarnShuffleIntegrationSuite` and `YarnShuffleAuthSuite`. The failed reason as follow: ``` 20/09/18 19:14:07.486 launcher-proc-1 INFO YarnClusterDriver: Exception in thread "main" java.lang.ExceptionInInitializerError ... 20/09/18 19:14:07.486 launcher-proc-1 INFO YarnClusterDriver: Caused by: java.lang.SecurityException: class "javax.servlet.http.HttpSessionIdListener"'s signer information does not match signer information of other classes in the same package ... ``` ### Why are the changes needed? Avoid UTs error caused by classpath order . ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Scala 2.12: Pass the Jenkins or GitHub Action - Scala 2.13: Pass 2.13 Build GitHub Action and do the following: ``` dev/change-scala-version.sh 2.13 mvn clean install -DskipTests -pl resource-managers/yarn -Pyarn -Pscala-2.13 -am mvn clean test -pl resource-managers/yarn -Pyarn -Pscala-2.13 ``` ``` Tests: succeeded 136, failed 0, canceled 1, ignored 0, pending 0 All tests passed. ``` Closes #29824 from LuciferYang/yarn-tests-deps. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-09-24 08:32:32 -07:00
yangjie01	fe6d38d243	[SPARK-32987][MESOS] Pass all `mesos` module UTs in Scala 2.13 ### What changes were proposed in this pull request? The main change of this pr is add a manual sort to `defaultConf ++ driverConf` before constructing `--conf` options to ensure options has same order in Scala 2.12 and Scala 2.13. ### Why are the changes needed? We need to support a Scala 2.13 build. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Scala 2.12: Pass the Jenkins or GitHub Action - Scala 2.13: Pass GitHub 2.13 Build Action Do the following: ``` dev/change-scala-version.sh 2.13 mvn clean install -DskipTests -pl resource-managers/mesos -Pscala-2.13 -Pmesos -am mvn test -pl resource-managers/mesos -Pscala-2.13 -Pmesos ``` Before ``` Tests: succeeded 106, failed 1, canceled 0, ignored 0, pending 0 * 1 TESTS FAILED * ``` After ``` Tests: succeeded 107, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #29865 from LuciferYang/SPARK-32987-2. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-09-24 08:25:24 -07:00
HyukjinKwon	688d016c7a	[SPARK-32982][BUILD] Remove hive-1.2 profiles in PIP installation option ### What changes were proposed in this pull request? This PR removes Hive 1.2 option (and therefore `HIVE_VERSION` environment variable as well). ### Why are the changes needed? Hive 1.2 is a fork version. We shouldn't promote users to use. ### Does this PR introduce _any_ user-facing change? Nope, `HIVE_VERSION` and Hive 1.2 are removed but this is new experimental feature in master only. ### How was this patch tested? Manually tested: ```bash SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz -v SPARK_VERSION=3.0.1 HADOOP_VERSION=2.7 pip install pyspark-3.1.0.dev0.tar.gz -v SPARK_VERSION=3.0.1 HADOOP_VERSION=invalid pip install pyspark-3.1.0.dev0.tar.gz -v ``` Closes #29858 from HyukjinKwon/SPARK-32981. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-24 14:49:58 +09:00
zero323	31a16fbb40	[SPARK-32714][PYTHON] Initial pyspark-stubs port ### What changes were proposed in this pull request? This PR proposes migration of [`pyspark-stubs`](https://github.com/zero323/pyspark-stubs) into Spark codebase. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? Yes. This PR adds type annotations directly to Spark source. This can impact interaction with development tools for users, which haven't used `pyspark-stubs`. ### How was this patch tested? - [x] MyPy tests of the PySpark source ``` mypy --no-incremental --config python/mypy.ini python/pyspark ``` - [x] MyPy tests of Spark examples ``` MYPYPATH=python/ mypy --no-incremental --config python/mypy.ini examples/src/main/python/ml examples/src/main/python/sql examples/src/main/python/sql/streaming ``` - [x] Existing Flake8 linter - [x] Existing unit tests Tested against: - `mypy==0.790+dev.e959952d9001e9713d329a2f9b196705b028f894` - `mypy==0.782` Closes #29591 from zero323/SPARK-32681. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-24 14:15:36 +09:00

1 2 3 4 5 ...

28173 commits