ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Dongjoon Hyun	dfb7790a9d	[SPARK-33108][BUILD] Remove sbt-dependency-graph SBT plugin ### What changes were proposed in this pull request? This PR aims to remove `sbt-dependency-graph` SBT plugin. ### Why are the changes needed? `sbt-dependency-graph` officially doesn't support SBT 1.3.x and it's broken due to `NoSuchMethodError`. This cannot be fixed in `sbt-dependency-graph` side at SBT 1.3.x - https://github.com/sbt/sbt-dependency-graph > Note: Under sbt >= 1.3.x some features might currently not work as expected or not at all (like dependencyLicenses). ``` $ build/sbt dependencyTree Launching sbt from build/sbt-launch-1.3.13.jar [info] welcome to sbt 1.3.13 (AdoptOpenJDK Java 1.8.0_252) ... [error] java.lang.NoSuchMethodError: sbt.internal.LibraryManagement$.cachedUpdate(Lsbt/librarymanagement/DependencyResolution;Lsbt/librarymanagement/ModuleDescriptor;Lsbt/util/CacheStoreFactory;Ljava/lang/String;Lsbt/librarymanagement/UpdateConfiguration;Lscala/Function1;ZZZLsbt/librarymanagement/UnresolvedWarningConfiguration;Lsbt/librarymanagement/EvictionWarningOptions;ZLsbt/internal/librarymanagement/CompatibilityWarningOptions;Lsbt/util/Logger;)Lsbt/librarymanagement/UpdateReport; ``` ALTERNATIVES - One alternative is `coursier`, but it requires `coursier-based sbt launcher` which is more intrusive. - https://get-coursier.io/docs/sbt-coursier.html#sbt-13x > you'll have to use the coursier-based sbt launcher, via its custom sbt-extras launcher for example. - Another alternative is moving to `SBT 1.4.0` which uses `sbt-dependency-graph` as a built-in, but it's still new and will requires many change. So, this PR aims to remove the broken plugin simply. ### Does this PR introduce _any_ user-facing change? No. This is a dev-only change. ### How was this patch tested? Manual. ``` $ build/sbt dependencyTree ... [error] Not a valid command: dependencyTree [error] Not a valid project ID: dependencyTree [error] Not a valid key: dependencyTree (similar: dependencyOverrides, sbtDependency, dependencyResolution) [error] dependencyTree [error] ^ ``` Closes #29997 from dongjoon-hyun/remove_depedencyTree. Lead-authored-by: Dongjoon Hyun <dongjoon@apache.org> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-09 22:35:12 -07:00
Gabor Somogyi	1e63dcc8f0	[SPARK-33102][SQL] Use stringToSeq on SQL list typed parameters ### What changes were proposed in this pull request? While I've implemented JDBC provider disable functionality it has been popped up [here](https://github.com/apache/spark/pull/29964#discussion_r501786746) that `Utils.stringToSeq` must be used when String list type SQL parameter handled. In this PR I've fixed the problematic parameters. ### Why are the changes needed? `Utils.stringToSeq` must be used when String list type SQL parameter handled. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #29989 from gaborgsomogyi/SPARK-33102. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-10 13:53:09 +09:00
zero323	018811f974	[SPARK-33105][INFRA] Change default R arch from i386 to x64 and parametrize BINPREF ### What changes were proposed in this pull request? - Change default R `arch` from `i386` to `x64`, to match Rtools version. - Parameterize `BINPREF` with `WIN` (https://stackoverflow.com/a/44035904) Reported on dev: http://apache-spark-developers-list.1001551.n3.nabble.com/Broken-rlang-installation-on-AppVeyor-td30294.html ### Why are the changes needed? It seems like update from rlang 0.4.7 to 0.4.8 exposed an issue, where build fails because of incompatible ddl ``` c:/Rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: skipping incompatible C:/R/bin/i386/R.dll when searching for -lR [00:01:52] c:/Rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: skipping incompatible C:/R/bin/i386/R.dll when searching for -lR [00:01:52] c:/Rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: cannot find -lR [00:01:52] collect2.exe: error: ld returned 1 exit status ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #29991 from zero323/APPVEYOR-DEAFAULT-ARCH. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-10 13:48:26 +09:00
HyukjinKwon	2e07ed3041	[SPARK-33082][SPARK-20202][BUILD][SQL][FOLLOW-UP] Remove Hive 1.2 workarounds and Hive 1.2 profile in Jenkins script ### What changes were proposed in this pull request? This PR removes the leftover of Hive 1.2 workarounds and Hive 1.2 profile in Jenkins script. - `test-hive1.2` title is not used anymore in Jenkins - Remove some comments related to Hive 1.2 - Remove unused codes in `OrcFilters.scala` Hive - Test `spark.sql.hive.convertMetastoreOrc` disabled case for the tests added at SPARK-19809 and SPARK-22267 ### Why are the changes needed? To remove unused codes & improve test coverage ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually ran the unit tests. Also It will be tested in CI in this PR. Closes #29973 from HyukjinKwon/SPARK-33082-SPARK-20202. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-09 03:04:26 -07:00
Jungtaek Lim (HeartSaVioR)	edb140eb5c	[SPARK-32896][SS] Add DataStreamWriter.table API ### What changes were proposed in this pull request? This PR proposes to add `DataStreamWriter.table` to specify the output "table" to write from the streaming query. ### Why are the changes needed? For now, there's no way to write to the table (especially catalog table) even the table is capable to handle streaming write, so even with Spark 3, writing to the catalog table via SS should go through the `DataStreamWriter.format(provider)` and wish the provider can handle it as same as we do with catalog table. With the new API, we can directly point to the catalog table which supports streaming write. Some of usages are covered with tests - simply saying, end users can do the following: ```scala // assuming `testcat` is a custom catalog, and `ns` is a namespace in the catalog spark.sql("CREATE TABLE testcat.ns.table1 (id bigint, data string) USING foo") val query = inputDF .writeStream .table("testcat.ns.table1") .option(...) .start() ``` ### Does this PR introduce _any_ user-facing change? Yes, as this adds a new public API in DataStreamWriter. This doesn't bring backward incompatible change. ### How was this patch tested? New unit tests. Closes #29767 from HeartSaVioR/SPARK-32896. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-09 03:01:54 -07:00
Dongjoon Hyun	e1909c96fb	[SPARK-33099][K8S] Respect executor idle timeout conf in ExecutorPodsAllocator ### What changes were proposed in this pull request? This PR aims to protect the executor pod request or pending pod during executor idle timeout. ### Why are the changes needed? In case of dynamic allocation, Apache Spark K8s `ExecutorPodsAllocator` cancels the pod requests or pending pods too eagerly. Like the following example, `ExecutorPodsAllocator` received the new total executor adjust request rapidly in two minutes. Sometimes, it's called 3 times in a single second. It repeats `request` and `delete` on that request or pending pod frequently. This PR is reusing `spark.dynamicAllocation.executorIdleTimeout (default: 60s)` to keep the pod request or pending pod. ``` 20/10/08 05:58:08 INFO ExecutorPodsAllocator: Set totalExpectedExecutors to 3 20/10/08 05:58:08 INFO ExecutorPodsAllocator: Going to request 3 executors from Kubernetes. 20/10/08 05:58:09 INFO ExecutorPodsAllocator: Set totalExpectedExecutors to 3 20/10/08 05:58:43 INFO ExecutorPodsAllocator: Set totalExpectedExecutors to 1 20/10/08 05:58:47 INFO ExecutorPodsAllocator: Set totalExpectedExecutors to 0 20/10/08 05:59:26 INFO ExecutorPodsAllocator: Set totalExpectedExecutors to 3 20/10/08 05:59:30 INFO ExecutorPodsAllocator: Set totalExpectedExecutors to 2 20/10/08 05:59:31 INFO ExecutorPodsAllocator: Set totalExpectedExecutors to 3 20/10/08 05:59:44 INFO ExecutorPodsAllocator: Set totalExpectedExecutors to 2 20/10/08 05:59:44 INFO ExecutorPodsAllocator: Set totalExpectedExecutors to 0 20/10/08 05:59:45 INFO ExecutorPodsAllocator: Set totalExpectedExecutors to 3 20/10/08 05:59:50 INFO ExecutorPodsAllocator: Set totalExpectedExecutors to 2 20/10/08 05:59:50 INFO ExecutorPodsAllocator: Set totalExpectedExecutors to 1 20/10/08 05:59:50 INFO ExecutorPodsAllocator: Set totalExpectedExecutors to 0 20/10/08 05:59:54 INFO ExecutorPodsAllocator: Set totalExpectedExecutors to 3 20/10/08 05:59:54 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes. ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the newly added test case. Closes #29981 from dongjoon-hyun/SPARK-K8S-INITIAL. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-09 02:50:38 -07:00
Max Gekk	1234c66fa6	[SPARK-33101][ML] Make LibSVM format propagate Hadoop config from DS options to underlying HDFS file system ### What changes were proposed in this pull request? Propagate LibSVM options to Hadoop configs in the LibSVM datasource. ### Why are the changes needed? There is a bug that when running: ```scala spark.read.format("libsvm").options(conf).load(path) ``` The underlying file system will not receive the `conf` options. ### Does this PR introduce _any_ user-facing change? Yes. After the changes, for example, users should read files from Azure Data Lake successfully: ```scala def hadoopConf1() = Map[String, String]( s"fs.adl.oauth2.access.token.provider.type" -> "ClientCredential", s"fs.adl.oauth2.client.id" -> dbutils.secrets.get(scope = "...", key = "..."), s"fs.adl.oauth2.credential" -> dbutils.secrets.get(scope = "...", key = "..."), s"fs.adl.oauth2.refresh.url" -> s"https://login.microsoftonline.com/.../oauth2/token") val df = spark.read.format("libsvm").options(hadoopConf1).load("adl://....azuredatalakestore.net/foldersp1/...") ``` and not get the following exception because the settings above are not propagated to the filesystem: ```java java.lang.IllegalArgumentException: No value for fs.adl.oauth2.access.token.provider found in conf file. at ....adl.AdlFileSystem.getNonEmptyVal(AdlFileSystem.java:820) at ....adl.AdlFileSystem.getCustomAccessTokenProvider(AdlFileSystem.java:220) at ....adl.AdlFileSystem.getAccessTokenProvider(AdlFileSystem.java:257) at ....adl.AdlFileSystem.initialize(AdlFileSystem.java:164) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669) ``` ### How was this patch tested? Added UT to `LibSVMRelationSuite`. Closes #29984 from MaxGekk/ml-option-propagation. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-09 02:37:47 -07:00
zero323	3beab8d8a8	[SPARK-32793][FOLLOW-UP] Minor corrections for PySpark annotations and SparkR ### What changes were proposed in this pull request? - Annotated return types of `assert_true` and `raise_error` as discussed [here](https://github.com/apache/spark/pull/29947#pullrequestreview-504495801). - Add `assert_true` and `raise_error` to SparkR NAMESPACE. - Validating message vector size in SparkR as discussed [here](https://github.com/apache/spark/pull/29947#pullrequestreview-504539004). ### Why are the changes needed? As discussed in review for #29947. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Existing tests. - Validation of annotations using MyPy Closes #29978 from zero323/SPARK-32793-FOLLOW-UP. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-09 09:50:45 +09:00
ulysses	a9077299d7	[SPARK-32743][SQL] Add distinct info at UnresolvedFunction toString ### What changes were proposed in this pull request? Add distinct info at `UnresolvedFunction.toString`. ### Why are the changes needed? Make `UnresolvedFunction` info complete. ``` create table test (c1 int, c2 int); explain extended select sum(distinct c1) from test; -- before this pr == Parsed Logical Plan == 'Project [unresolvedalias('sum('c1), None)] +- 'UnresolvedRelation [test] -- after this pr == Parsed Logical Plan == 'Project [unresolvedalias('sum(distinct 'c1), None)] +- 'UnresolvedRelation [test] ``` ### Does this PR introduce _any_ user-facing change? Yes, get distinct info during sql parse. ### How was this patch tested? manual test. Closes #29586 from ulysses-you/SPARK-32743. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-09 09:25:22 +09:00
Max Gekk	c5f6af9f17	[SPARK-33094][SQL] Make ORC format propagate Hadoop config from DS options to underlying HDFS file system ### What changes were proposed in this pull request? Propagate ORC options to Hadoop configs in Hive `OrcFileFormat` and in the regular ORC datasource. ### Why are the changes needed? There is a bug that when running: ```scala spark.read.format("orc").options(conf).load(path) ``` The underlying file system will not receive the conf options. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added UT to `OrcSourceSuite`. Closes #29976 from MaxGekk/orc-option-propagation. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-08 11:59:30 -07:00
Dongjoon Hyun	4987db8c88	[SPARK-33096][K8S] Use LinkedHashMap instead of Map for newlyCreatedExecutors ### What changes were proposed in this pull request? This PR aims to use `LinkedHashMap` instead of `Map` for `newlyCreatedExecutors`. ### Why are the changes needed? This makes log messages (INFO/DEBUG) more readable. This is helpful when `spark.kubernetes.allocation.batch.size` is large and especially when K8s dynamic allocation is used. BEFORE ``` 20/10/08 10:24:21 DEBUG ExecutorPodsAllocator: Executor with id 8 was not found in the Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08 10:24:21 DEBUG ExecutorPodsAllocator: Executor with id 2 was not found in the Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08 10:24:21 DEBUG ExecutorPodsAllocator: Executor with id 5 was not found in the Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08 10:24:21 DEBUG ExecutorPodsAllocator: Executor with id 4 was not found in the Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08 10:24:21 DEBUG ExecutorPodsAllocator: Executor with id 7 was not found in the Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08 10:24:21 DEBUG ExecutorPodsAllocator: Executor with id 10 was not found in the Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08 10:24:21 DEBUG ExecutorPodsAllocator: Executor with id 9 was not found in the Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08 10:24:21 DEBUG ExecutorPodsAllocator: Executor with id 3 was not found in the Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08 10:24:21 DEBUG ExecutorPodsAllocator: Executor with id 6 was not found in the Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08 10:24:21 INFO ExecutorPodsAllocator: Deleting 9 excess pod requests (5,10,6,9,2,7,3,8,4). ``` AFTER ``` 20/10/08 10:25:17 DEBUG ExecutorPodsAllocator: Executor with id 2 was not found in the Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08 10:25:17 DEBUG ExecutorPodsAllocator: Executor with id 3 was not found in the Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08 10:25:17 DEBUG ExecutorPodsAllocator: Executor with id 4 was not found in the Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08 10:25:17 DEBUG ExecutorPodsAllocator: Executor with id 5 was not found in the Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08 10:25:17 DEBUG ExecutorPodsAllocator: Executor with id 6 was not found in the Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08 10:25:17 DEBUG ExecutorPodsAllocator: Executor with id 7 was not found in the Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08 10:25:17 DEBUG ExecutorPodsAllocator: Executor with id 8 was not found in the Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08 10:25:17 DEBUG ExecutorPodsAllocator: Executor with id 9 was not found in the Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08 10:25:17 DEBUG ExecutorPodsAllocator: Executor with id 10 was not found in the Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08 10:25:17 INFO ExecutorPodsAllocator: Deleting 9 excess pod requests (2,3,4,5,6,7,8,9,10). ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CI or `build/sbt -Pkubernetes "kubernetes/test"` Closes #29979 from dongjoon-hyun/SPARK-K8S-LOG. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-08 11:50:53 -07:00
manubatham20	4a47b3e110	[DOC][MINOR] pySpark usage - removed repeated keyword causing confusion ### What changes were proposed in this pull request? While explaining pySpark usage, use of repeated synonymous words were causing confusion. Removed "instead of a JAR" word, to keep it more readable. ### Why are the changes needed? To keep the docs more readable and easy to understand. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No code changes, minor documentation change only. No tests added. Closes #29956 from manubatham20/patch-1. Authored-by: manubatham20 <manubatham2006@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-10-08 07:52:00 -05:00
HyukjinKwon	5effa8ea26	[SPARK-33091][SQL] Avoid using map instead of foreach to avoid potential side effect at callers of OrcUtils.readCatalystSchema ### What changes were proposed in this pull request? This is a kind of a followup of SPARK-32646. New JIRA was filed to control the fixed versions properly. When you use `map`, it might be lazily evaluated and not executed. To avoid this, we should better use `foreach`. See also SPARK-16694. Current codes look not causing any bug for now but it should be best to fix to avoid potential issues. ### Why are the changes needed? To avoid potential issues from `map` being lazy and not executed. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Ran related tests. CI in this PR should verify. Closes #29974 from HyukjinKwon/SPARK-32646. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-08 16:29:15 +09:00
Max Gekk	7d6e3fb998	[SPARK-33074][SQL] Classify dialect exceptions in JDBC v2 Table Catalog ### What changes were proposed in this pull request? 1. Add new method to the `JdbcDialect` class - `classifyException()`. It converts dialect specific exception to Spark's `AnalysisException` or its sub-classes. 2. Replace H2 exception `org.h2.jdbc.JdbcSQLException` in `JDBCTableCatalogSuite` by `AnalysisException`. 3. Add `H2Dialect` ### Why are the changes needed? Currently JDBC v2 Table Catalog implementation throws dialect specific exception and ignores exceptions defined in the `TableCatalog` interface. This PR adds new method for converting dialect specific exception, and assumes that follow up PRs will implement `classifyException()`. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? By running existing test suites `JDBCTableCatalogSuite` and `JDBCV2Suite`. Closes #29952 from MaxGekk/jdbcv2-classify-exception. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-08 05:28:33 +00:00
Terry Kim	1c781a4354	[SPARK-32282][SQL] Improve EnsureRquirement.reorderJoinKeys to handle more scenarios such as PartitioningCollection ### What changes were proposed in this pull request? This PR proposes to improve `EnsureRquirement.reorderJoinKeys` to handle the following scenarios: 1. If the keys cannot be reordered to match the left-side `HashPartitioning`, consider the right-side `HashPartitioning`. 2. Handle `PartitioningCollection`, which may contain `HashPartitioning` ### Why are the changes needed? 1. For the scenario 1), the current behavior matches either the left-side `HashPartitioning` or the right-side `HashPartitioning`. This means that if both sides are `HashPartitioning`, it will try to match only the left side. The following will not consider the right-side `HashPartitioning`: ``` val df1 = (0 until 10).map(i => (i % 5, i % 13)).toDF("i1", "j1") val df2 = (0 until 10).map(i => (i % 7, i % 11)).toDF("i2", "j2") df1.write.format("parquet").bucketBy(4, "i1", "j1").saveAsTable("t1")df2.write.format("parquet").bucketBy(4, "i2", "j2").saveAsTable("t2") val t1 = spark.table("t1") val t2 = spark.table("t2") val join = t1.join(t2, t1("i1") === t2("j2") && t1("i1") === t2("i2")) join.explain == Physical Plan == (5) SortMergeJoin [i1#26, i1#26], [j2#31, i2#30], Inner :- (2) Sort [i1#26 ASC NULLS FIRST, i1#26 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(i1#26, i1#26, 4), true, [id=#69] : +- (1) Project [i1#26, j1#27] : +- (1) Filter isnotnull(i1#26) : +- (1) ColumnarToRow : +- FileScan parquet default.t1[i1#26,j1#27] Batched: true, DataFilters: [isnotnull(i1#26)], Format: Parquet, Location: InMemoryFileIndex[..., PartitionFilters: [], PushedFilters: [IsNotNull(i1)], ReadSchema: struct<i1:int,j1:int>, SelectedBucketsCount: 4 out of 4 +- (4) Sort [j2#31 ASC NULLS FIRST, i2#30 ASC NULLS FIRST], false, 0. +- Exchange hashpartitioning(j2#31, i2#30, 4), true, [id=#79]. <===== This can be removed +- (3) Project [i2#30, j2#31] +- (3) Filter (((j2#31 = i2#30) AND isnotnull(j2#31)) AND isnotnull(i2#30)) +- (3) ColumnarToRow +- FileScan parquet default.t2[i2#30,j2#31] Batched: true, DataFilters: [(j2#31 = i2#30), isnotnull(j2#31), isnotnull(i2#30)], Format: Parquet, Location: InMemoryFileIndex[..., PartitionFilters: [], PushedFilters: [IsNotNull(j2), IsNotNull(i2)], ReadSchema: struct<i2:int,j2:int>, SelectedBucketsCount: 4 out of 4 ``` 2. For the scenario 2), the current behavior does not handle `PartitioningCollection`: ``` val df1 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i1", "j1") val df2 = (0 until 100).map(i => (i % 7, i % 11)).toDF("i2", "j2") val df3 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i3", "j3") val join = df1.join(df2, df1("i1") === df2("i2") && df1("j1") === df2("j2")) // PartitioningCollection val join2 = join.join(df3, join("j1") === df3("j3") && join("i1") === df3("i3")) join2.explain == Physical Plan == (9) SortMergeJoin [j1#8, i1#7], [j3#30, i3#29], Inner :- (6) Sort [j1#8 ASC NULLS FIRST, i1#7 ASC NULLS FIRST], false, 0. <===== This can be removed : +- Exchange hashpartitioning(j1#8, i1#7, 5), true, [id=#58] <===== This can be removed : +- (5) SortMergeJoin [i1#7, j1#8], [i2#18, j2#19], Inner : :- (2) Sort [i1#7 ASC NULLS FIRST, j1#8 ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(i1#7, j1#8, 5), true, [id=#45] : : +- (1) Project [_1#2 AS i1#7, _2#3 AS j1#8] : : +- (1) LocalTableScan [_1#2, _2#3] : +- (4) Sort [i2#18 ASC NULLS FIRST, j2#19 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(i2#18, j2#19, 5), true, [id=#51] : +- (3) Project [_1#13 AS i2#18, _2#14 AS j2#19] : +- (3) LocalTableScan [_1#13, _2#14] +- (8) Sort [j3#30 ASC NULLS FIRST, i3#29 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(j3#30, i3#29, 5), true, [id=#64] +- (7) Project [_1#24 AS i3#29, _2#25 AS j3#30] +- (7) LocalTableScan [_1#24, _2#25] ``` ### Does this PR introduce _any_ user-facing change? Yes, now from the above examples, the shuffle/sort nodes pointed by `This can be removed` are now removed: 1. Senario 1): ``` == Physical Plan == (4) SortMergeJoin [i1#26, i1#26], [i2#30, j2#31], Inner :- (2) Sort [i1#26 ASC NULLS FIRST, i1#26 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(i1#26, i1#26, 4), true, [id=#67] : +- (1) Project [i1#26, j1#27] : +- (1) Filter isnotnull(i1#26) : +- (1) ColumnarToRow : +- FileScan parquet default.t1[i1#26,j1#27] Batched: true, DataFilters: [isnotnull(i1#26)], Format: Parquet, Location: InMemoryFileIndex[..., PartitionFilters: [], PushedFilters: [IsNotNull(i1)], ReadSchema: struct<i1:int,j1:int>, SelectedBucketsCount: 4 out of 4 +- (3) Sort [i2#30 ASC NULLS FIRST, j2#31 ASC NULLS FIRST], false, 0 +- (3) Project [i2#30, j2#31] +- (3) Filter (((j2#31 = i2#30) AND isnotnull(j2#31)) AND isnotnull(i2#30)) +- (3) ColumnarToRow +- FileScan parquet default.t2[i2#30,j2#31] Batched: true, DataFilters: [(j2#31 = i2#30), isnotnull(j2#31), isnotnull(i2#30)], Format: Parquet, Location: InMemoryFileIndex[..., PartitionFilters: [], PushedFilters: [IsNotNull(j2), IsNotNull(i2)], ReadSchema: struct<i2:int,j2:int>, SelectedBucketsCount: 4 out of 4 ``` 2. Scenario 2): ``` == Physical Plan == (8) SortMergeJoin [i1#7, j1#8], [i3#29, j3#30], Inner :- (5) SortMergeJoin [i1#7, j1#8], [i2#18, j2#19], Inner : :- (2) Sort [i1#7 ASC NULLS FIRST, j1#8 ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(i1#7, j1#8, 5), true, [id=#43] : : +- (1) Project [_1#2 AS i1#7, _2#3 AS j1#8] : : +- (1) LocalTableScan [_1#2, _2#3] : +- (4) Sort [i2#18 ASC NULLS FIRST, j2#19 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(i2#18, j2#19, 5), true, [id=#49] : +- (3) Project [_1#13 AS i2#18, _2#14 AS j2#19] : +- (3) LocalTableScan [_1#13, _2#14] +- (7) Sort [i3#29 ASC NULLS FIRST, j3#30 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(i3#29, j3#30, 5), true, [id=#58] +- (6) Project [_1#24 AS i3#29, _2#25 AS j3#30] +- *(6) LocalTableScan [_1#24, _2#25] ``` ### How was this patch tested? Added tests. Closes #29074 from imback82/reorder_keys. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-08 04:58:41 +00:00
Yuning Zhang	bbc887bf73	[SPARK-33089][SQL] make avro format propagate Hadoop config from DS options to underlying HDFS file system ### What changes were proposed in this pull request? In `AvroUtils`'s `inferSchema()`, propagate Hadoop config from DS options to underlying HDFS file system. ### Why are the changes needed? There is a bug that when running: ```scala spark.read.format("avro").options(conf).load(path) ``` The underlying file system will not receive the `conf` options. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? unit test added Closes #29971 from yuningzh-db/avro_options. Authored-by: Yuning Zhang <yuning.zhang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-08 12:18:06 +09:00
Karen Feng	39510b0e9b	[SPARK-32793][SQL] Add raise_error function, adds error message parameter to assert_true ## What changes were proposed in this pull request? Adds a SQL function `raise_error` which underlies the refactored `assert_true` function. `assert_true` now also (optionally) accepts a custom error message field. `raise_error` is exposed in SQL, Python, Scala, and R. `assert_true` was previously only exposed in SQL; it is now also exposed in Python, Scala, and R. ### Why are the changes needed? Improves usability of `assert_true` by clarifying error messaging, and adds the useful helper function `raise_error`. ### Does this PR introduce _any_ user-facing change? Yes: - Adds `raise_error` function to the SQL, Python, Scala, and R APIs. - Adds `assert_true` function to the SQL, Python and R APIs. ### How was this patch tested? Adds unit tests in SQL, Python, Scala, and R for `assert_true` and `raise_error`. Closes #29947 from karenfeng/spark-32793. Lead-authored-by: Karen Feng <karen.feng@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-08 12:05:39 +09:00
zero323	473b3ba6aa	[SPARK-32511][FOLLOW-UP][SQL][R][PYTHON] Add dropFields to SparkR and PySpark ### What changes were proposed in this pull request? This PR adds `dropFields` method to: - PySpark `Column` - SparkR `Column` ### Why are the changes needed? Feature parity. ### Does this PR introduce _any_ user-facing change? No, new API. ### How was this patch tested? - New unit tests. - Manual verification of examples / doctests. - Manual run of MyPy tests Closes #29967 from zero323/SPARK-32511-FOLLOW-UP-PYSPARK-SPARKR. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-08 10:37:42 +09:00
zero323	37e1b0c4a5	[SPARK-33086][PYTHON] Add static annotations for pyspark.resource ### What changes were proposed in this pull request? This PR replaces dynamically generated annotations for following modules: - `pyspark.resource.information` - `pyspark.resource.profile` - `pyspark.resource.requests` ### Why are the changes needed? These modules where not manually annotated in `pyspark-stubs`, but are part of the public API and we should provide more precise annotations. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? MyPy tests: ``` mypy --no-incremental --config python/mypy.ini python/pyspark ``` Closes #29969 from zero323/SPARK-32714-FOLLOW-UP-RESOURCE. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-08 10:32:30 +09:00
Denis Pyshev	6daa2aeb01	[SPARK-21708][BUILD] Migrate build to sbt 1.x ### What changes were proposed in this pull request? Migrate sbt-launcher URL to download one for sbt 1.x. Update plugins versions where required by sbt update. Change sbt version to be used to latest released at the moment, 1.3.13 Adjust build settings according to plugins and sbt changes. ### Why are the changes needed? Migration to sbt 1.x: 1. enhances dev experience in development 2. updates build plugins to bring there new features/to fix bugs in them 3. enhances build performance on sbt side 4. eases movement to Scala 3 / dotty ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? All existing tests passed, both on Jenkins and via Github Actions, also manually for Scala 2.13 profile. Closes #29286 from gemelen/feature/sbt-1.x. Authored-by: Denis Pyshev <git@gemelen.net> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-07 15:28:00 -07:00
Max Gekk	23afc930ae	[SPARK-26499][SQL][FOLLOWUP] Print the loading provider exception starting from the INFO level ### What changes were proposed in this pull request? 1. Don't print the exception in the error message while loading a built-in provider. 2. Print the exception starting from the INFO level. Up to the INFO level, the output is: ``` 17:48:32.342 ERROR org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider: Failed to load built in provider. ``` and starting from the INFO level: ``` 17:48:32.342 ERROR org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider: Failed to load built in provider. 17:48:32.342 INFO org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider: Loading of the provider failed with the exception: java.util.ServiceConfigurationError: org.apache.spark.sql.jdbc.JdbcConnectionProvider: Provider org.apache.spark.sql.execution.datasources.jdbc.connection.IntentionallyFaultyConnectionProvider could not be instantiated at java.util.ServiceLoader.fail(ServiceLoader.java:232) at java.util.ServiceLoader.access$100(ServiceLoader.java:185) at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384) at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404) at java.util.ServiceLoader$1.next(ServiceLoader.java:480) at org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider$.loadProviders(ConnectionProvider.scala:41) ``` ### Why are the changes needed? To avoid "noise" in logs while running tests. Currently, logs are blown up: ``` org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider: Loading of the provider failed with the exception: java.util.ServiceConfigurationError: org.apache.spark.sql.jdbc.JdbcConnectionProvider: Provider org.apache.spark.sql.execution.datasources.jdbc.connection.IntentionallyFaultyConnectionProvider could not be instantiated at java.util.ServiceLoader.fail(ServiceLoader.java:232) at java.util.ServiceLoader.access$100(ServiceLoader.java:185) at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384) at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404) at java.util.ServiceLoader$1.next(ServiceLoader.java:480) at org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider$.loadProviders(ConnectionProvider.scala:41) ... at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.IllegalArgumentException: Intentional Exception at org.apache.spark.sql.execution.datasources.jdbc.connection.IntentionallyFaultyConnectionProvider.<init>(IntentionallyFaultyConnectionProvider.scala:26) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at java.lang.Class.newInstance(Class.java:442) at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running: ``` $ build/sbt "sql/test:testOnly org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalogSuite" ``` Closes #29968 from MaxGekk/gaborgsomogyi-SPARK-32001-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-07 13:50:15 -07:00
Dongjoon Hyun	a127387a53	[SPARK-33082][SQL] Remove hive-1.2 workaround code ### What changes were proposed in this pull request? This PR removes old Hive-1.2 profile related workaround code. ### Why are the changes needed? To simply the code. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CI. Closes #29961 from dongjoon-hyun/SPARK-HIVE12. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-07 12:27:23 -07:00
Stijn De Haes	3099fd9f9d	[SPARK-32067][K8S] Use unique ConfigMap name for executor pod template ### What changes were proposed in this pull request? The pod template configmap always had the same name. This PR makes it unique. ### Why are the changes needed? If you scheduled 2 spark jobs they will both use the same configmap name this will result in conflicts. This PR fixes that BEFORE ``` $ kubectl get cm --all-namespaces -w \| grep podspec podspec-configmap 1 65s ``` AFTER ``` $ kubectl get cm --all-namespaces -w \| grep podspec aaece65ef82e4a30b7b7800aad600d4f spark-test-app-aac9f37502b2ca55-driver-podspec-conf-map 1 0s ``` This can be seen when running the integration tests ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests and the integration tests test if this works Closes #29934 from stijndehaes/bugfix/SPARK-32067-unique-name-for-template-configmap. Authored-by: Stijn De Haes <stijndehaes@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-07 09:52:00 -07:00
Takeshi Yamamuro	94d648dff5	[SPARK-33036][SQL] Refactor RewriteCorrelatedScalarSubquery code to replace exprIds in a bottom-up manner ### What changes were proposed in this pull request? This PR intends to refactor code in `RewriteCorrelatedScalarSubquery` for replacing `ExprId`s in a bottom-up manner instead of doing in a top-down one. This PR comes from the talk with cloud-fan in https://github.com/apache/spark/pull/29585#discussion_r490371252. ### Why are the changes needed? To improve code. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #29913 from maropu/RefactorRewriteCorrelatedScalarSubquery. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-07 20:16:40 +09:00
zero323	72da6f86cf	[SPARK-33002][PYTHON] Remove non-API annotations ### What changes were proposed in this pull request? This PR: - removes annotations for modules which are not part of the public API. - removes `__init__.pyi` files, if no annotations, beyond exports, are present. ### Why are the changes needed? Primarily to reduce maintenance overhead and as requested in the comments to https://github.com/apache/spark/pull/29591 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests and additional MyPy checks: ``` mypy --no-incremental --config python/mypy.ini python/pyspark MYPYPATH=python/ mypy --no-incremental --config python/mypy.ini examples/src/main/python/ml examples/src/main/python/sql examples/src/main/python/sql/streaming ``` Closes #29879 from zero323/SPARK-33002. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-07 19:53:59 +09:00
itholic	4e1ded67f8	[SPARK-32189][DOCS][PYTHON][FOLLOW-UP] Fixed broken link and typo in PySpark docs ### What changes were proposed in this pull request? This PR is a follow-up of #29781 to fix broken link and typo. <img width="638" alt="Screen Shot 2020-10-07 at 3 56 28 PM" src="https://user-images.githubusercontent.com/44108233/95297583-aa0ccb00-08b5-11eb-85db-89022c76d7e1.png"> <img width="734" alt="Screen Shot 2020-10-07 at 3 55 36 PM" src="https://user-images.githubusercontent.com/44108233/95297508-8ba6cf80-08b5-11eb-9caa-0b52a2482ada.png"> ### Why are the changes needed? Current link is not working properly because of wrong path. ### Does this PR introduce _any_ user-facing change? Yes, the link is working properly now. ### How was this patch tested? Manually built the doc. Closes #29963 from itholic/SPARK-32189-FOLLOWUP. Authored-by: itholic <haejoon309@naver.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-07 16:39:25 +09:00
Terry Kim	7e99fcd64e	[SPARK-33004][SQL] Migrate DESCRIBE column to use UnresolvedTableOrView to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `DESCRIBE tbl colname` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? The current behavior is not consistent between v1 and v2 commands when resolving a temp view. In v2, the `t` in the following example is resolved to a table: ```scala sql("CREATE TABLE testcat.ns.t (id bigint) USING foo") sql("CREATE TEMPORARY VIEW t AS SELECT 2 as i") sql("USE testcat.ns") sql("DESCRIBE t i") // 't' is resolved to testcat.ns.t Describing columns is not supported for v2 tables.; org.apache.spark.sql.AnalysisException: Describing columns is not supported for v2 tables.; ``` whereas in v1, the `t` is resolved to a temp view: ```scala sql("CREATE DATABASE test") sql("CREATE TABLE spark_catalog.test.t (id bigint) USING csv") sql("CREATE TEMPORARY VIEW t AS SELECT 2 as i") sql("USE spark_catalog.test") sql("DESCRIBE t i").show // 't' is resolved to a temp view +---------+----------+ \|info_name\|info_value\| +---------+----------+ \| col_name\| i\| \|data_type\| int\| \| comment\| NULL\| +---------+----------+ ``` ### Does this PR introduce _any_ user-facing change? After this PR, `DESCRIBE t i` is resolved to a temp view `t` instead of `testcat.ns.t`. ### How was this patch tested? Added a new test Closes #29880 from imback82/describe_column_consistent. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-07 06:33:20 +00:00
Max Gekk	aea78d2c8c	[SPARK-33034][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (Oracle dialect) ### What changes were proposed in this pull request? 1. Override the default SQL strings in the Oracle Dialect for: - ALTER TABLE ADD COLUMN - ALTER TABLE UPDATE COLUMN TYPE - ALTER TABLE UPDATE COLUMN NULLABILITY 2. Add new docker integration test suite `jdbc/v2/OracleIntegrationSuite.scala` ### Why are the changes needed? In SPARK-24907, we implemented JDBC v2 Table Catalog but it doesn't support some `ALTER TABLE` at the moment. This PR supports Oracle specific `ALTER TABLE`. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? By running new integration test suite: ``` $ ./build/sbt -Pdocker-integration-tests "test-only *.OracleIntegrationSuite" ``` Closes #29912 from MaxGekk/jdbcv2-oracle-alter-table. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-07 04:48:57 +00:00
HyukjinKwon	5ce321dc80	[SPARK-33017][PYTHON][DOCS][FOLLOW-UP] Add getCheckpointDir into API documentation ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/29918. We should add it into the documentation as well. ### Why are the changes needed? To show users new APIs. ### Does this PR introduce _any_ user-facing change? Yes, `SparkContext.getCheckpointDir` will be documented. ### How was this patch tested? Manually built the PySpark documentation: ```bash cd python/docs make clean html cd build/html open index.html ``` Closes #29960 from HyukjinKwon/SPARK-33017. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-07 13:00:59 +09:00
Max Gekk	584f90c82e	[SPARK-33067][SQL][TESTS][FOLLOWUP] Check error messages in JDBCTableCatalogSuite ### What changes were proposed in this pull request? Get error message from the expected exception, and check that they are reasonable. ### Why are the changes needed? To improve tests by expecting particular error messages. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `JDBCTableCatalogSuite`. Closes #29957 from MaxGekk/jdbcv2-negative-tests-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-07 09:29:30 +09:00
Liang-Chi Hsieh	57ed5a829b	[SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain ### What changes were proposed in this pull request? This proposes to simplify named_struct + get struct field + from_json expression chain from `struct(from_json.col1, from_json.col2, from_json.col3...)` to `struct(from_json)`. ### Why are the changes needed? Simplify complex expression tree that could be produced by query optimization or user. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #29942 from viirya/SPARK-33007. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-06 16:59:23 -07:00
yi.wu	0b326d5327	[SPARK-32857][CORE] Fix flaky o.a.s.s.BarrierTaskContextSuite.throw exception if the number of barrier() calls are not the same on every task ### What changes were proposed in this pull request? Fix the flaky test. ### Why are the changes needed? The test is flaky: `Expected exception org.apache.spark.SparkException to be thrown, but no exception was thrown`. Check the full error stack [here](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128548/testReport/org.apache.spark.scheduler/BarrierTaskContextSuite/throw_exception_if_the_number_of_barrier___calls_are_not_the_same_on_every_task/). By analyzing the log below, I found that task 0 hadn't reached the second `context.barrier()` when another three tasks already raised the sync timeout exceptions by the first `context.barrier()`. The timeout exceptions were caught by the `try...catch...`. Then, each task started another round barrier sync from the second `context.barrier()` and completed the sync successfully. ```scala 20/09/10 20:54:48.821 dispatcher-event-loop-10 INFO BarrierCoordinator: Current barrier epoch for Stage 0 (Attempt 0) is 0. 20/09/10 20:54:48.822 dispatcher-event-loop-10 INFO BarrierCoordinator: Barrier sync epoch 0 from Stage 0 (Attempt 0) received update from Task 2, current progress: 1/4. 20/09/10 20:54:48.826 dispatcher-BlockManagerMaster INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:38420 (size: 2.2 KiB, free: 546.3 MiB) 20/09/10 20:54:48.908 dispatcher-event-loop-12 INFO BarrierCoordinator: Current barrier epoch for Stage 0 (Attempt 0) is 0. 20/09/10 20:54:48.909 dispatcher-event-loop-12 INFO BarrierCoordinator: Barrier sync epoch 0 from Stage 0 (Attempt 0) received update from Task 1, current progress: 2/4. 20/09/10 20:54:48.959 dispatcher-event-loop-11 INFO BarrierCoordinator: Current barrier epoch for Stage 0 (Attempt 0) is 0. 20/09/10 20:54:48.960 dispatcher-event-loop-11 INFO BarrierCoordinator: Barrier sync epoch 0 from Stage 0 (Attempt 0) received update from Task 3, current progress: 3/4. 20/09/10 20:54:49.616 dispatcher-CoarseGrainedScheduler INFO TaskSchedulerImpl: Skip current round of resource offers for barrier stage 0 because the barrier taskSet requires 4 slots, while the total number of available slots is 0. 20/09/10 20:54:49.899 dispatcher-event-loop-15 INFO BarrierCoordinator: Current barrier epoch for Stage 0 (Attempt 0) is 0. 20/09/10 20:54:49.900 dispatcher-event-loop-15 INFO BarrierCoordinator: Barrier sync epoch 0 from Stage 0 (Attempt 0) received update from Task 1, current progress: 1/4. 20/09/10 20:54:49.965 dispatcher-event-loop-13 INFO BarrierCoordinator: Current barrier epoch for Stage 0 (Attempt 0) is 0. 20/09/10 20:54:49.966 dispatcher-event-loop-13 INFO BarrierCoordinator: Barrier sync epoch 0 from Stage 0 (Attempt 0) received update from Task 3, current progress: 2/4. 20/09/10 20:54:50.112 dispatcher-event-loop-16 INFO BarrierCoordinator: Current barrier epoch for Stage 0 (Attempt 0) is 0. 20/09/10 20:54:50.113 dispatcher-event-loop-16 INFO BarrierCoordinator: Barrier sync epoch 0 from Stage 0 (Attempt 0) received update from Task 0, current progress: 3/4. 20/09/10 20:54:50.609 dispatcher-CoarseGrainedScheduler INFO TaskSchedulerImpl: Skip current round of resource offers for barrier stage 0 because the barrier taskSet requires 4 slots, while the total number of available slots is 0. 20/09/10 20:54:50.826 dispatcher-event-loop-17 INFO BarrierCoordinator: Current barrier epoch for Stage 0 (Attempt 0) is 0. 20/09/10 20:54:50.827 dispatcher-event-loop-17 INFO BarrierCoordinator: Barrier sync epoch 0 from Stage 0 (Attempt 0) received update from Task 2, current progress: 4/4. 20/09/10 20:54:50.827 dispatcher-event-loop-17 INFO BarrierCoordinator: Barrier sync epoch 0 from Stage 0 (Attempt 0) received all updates from tasks, finished successfully. ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated the test and tested a hundred times without failure(Previously, there could be several failures). Closes #29732 from Ngone51/fix-flaky-throw-exception. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-06 14:18:37 -07:00
Kousuke Saruta	3b2a38d735	[SPARK-32511][SQL][FOLLOWUP] Fix the broken build for Scala 2.13 with Maven ### What changes were proposed in this pull request? This PR fixes the broken build for Scala 2.13 with Maven. https://github.com/apache/spark/pull/29913/checks?check_run_id=1187826966 #29795 was merged though it doesn't successfully finish the build for Scala 2.13 ### Why are the changes needed? To fix the build. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `build/mvn -Pscala-2.13 -Phive -Phive-thriftserver -DskipTests package` Closes #29954 from sarutak/hotfix-seq. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-06 09:40:16 -07:00
Kent Yao	17d309dfac	[SPARK-32963][SQL] empty string should be consistent for schema name in SparkGetSchemasOperation ### What changes were proposed in this pull request? This PR makes the empty string for schema name pattern match the global temp view as same as it works for other databases. This PR also add new tests to covering different kinds of wildcards to verify the SparkGetSchemasOperation ### Why are the changes needed? When the schema name is empty string, it is considered as "." and can match all databases in the catalog. But when it can not match the global temp view as it is not converted to "." ### Does this PR introduce _any_ user-facing change? yes , JDBC operation like `statement.getConnection.getMetaData..getSchemas(null, "")` now also provides the global temp view in the result set. ### How was this patch tested? new tests Closes #29834 from yaooqinn/SPARK-32963. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-06 16:01:10 +00:00
Wenchen Fan	ec6fccb922	[SPARK-32243][SQL][FOLLOWUP] Fix compilation in HiveSessionCatalog Fix a mistake when merging https://github.com/apache/spark/pull/29054 Closes #29955 from cloud-fan/hot-fix. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-06 14:33:34 +00:00
Michael Munday	b5e4b8c73e	[SPARK-27428][CORE][TEST] Increase receive buffer size used in StatsdSinkSuite ### What changes were proposed in this pull request? Increase size of socket receive buffer in these tests. ### Why are the changes needed? The socket receive buffer size set in this test was too small for the StatsdSinkSuite tests to run reliably on some systems. For a test in this suite to run reliably the buffer needs to be large enough to hold all the data in the packets being sent in a test along with any additional kernel or protocol overhead. The amount of kernel overhead per packet can vary from system to system but is typically far higher than the protocol overhead. If the receive buffer is too small and fills up then packets are silently dropped. This leads to the test failing with a timeout. If the socket defaults to a larger receive buffer (normally true) then we should keep that size. As well as increasing the minimum buffer size I've also decoupled the datagram packet buffer size from the receive buffer size. The receive buffer should in general be far larger to account for the fact that multiple packets might be buffered, as well as the aforementioned overhead. Any truncated data in individual packets will be picked up by the tests. ### Does this PR introduce _any_ user-facing change? No, this only affects the tests. ### How was this patch tested? Existing tests on IBM Z and x86. Closes #29819 from mundaym/fix-statsd. Authored-by: Michael Munday <mike.munday@ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-10-06 08:31:06 -05:00
Bryan Cutler	0812d6c17c	[SPARK-33073][PYTHON] Improve error handling on Pandas to Arrow conversion failures ### What changes were proposed in this pull request? This improves error handling when a failure in conversion from Pandas to Arrow occurs. And fixes tests to be compatible with upcoming Arrow 2.0.0 release. ### Why are the changes needed? Current tests will fail with Arrow 2.0.0 because of a change in error message when the schema is invalid. For these cases, the current error message also includes information on disabling safe conversion config, which is mainly meant for floating point truncation and overflow. The tests have been updated to use a message that is show for past Arrow versions, and upcoming. If the user enters an invalid schema, the error produced by pyarrow is not consistent and either `TypeError` or `ArrowInvalid`, with the latter being caught, and raised as a `RuntimeError` with the extra info. The error handling is improved by: - narrowing the exception type to `TypeError`s, which `ArrowInvalid` is a subclass and what is raised on safe conversion failures. - The exception is only raised with additional information on disabling "spark.sql.execution.pandas.convertToArrowArraySafely" if it is enabled in the first place. - The original exception is chained to better show it to the user. ### Does this PR introduce _any_ user-facing change? Yes, the error re-raised changes from a RuntimeError to a ValueError, which better categorizes this type of error and in-line with the original Arrow error. ### How was this patch tested? Existing tests, using pyarrow 1.0.1 and 2.0.0-snapshot Closes #29951 from BryanCutler/arrow-better-handle-pandas-errors-SPARK-33073. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-06 18:11:24 +09:00
angerszhu	ddc7012b3d	[SPARK-32243][SQL] HiveSessionCatalog call super.makeFunctionExpression should throw earlier when got Spark UDAF Invalid arguments number error ### What changes were proposed in this pull request? When we create a UDAF function use class extended `UserDefinedAggregeteFunction`, when we call the function, in support hive mode, in HiveSessionCatalog, it will call super.makeFunctionExpression, but it will catch error such as the function need 2 parameter and we only give 1, throw exception only show ``` No handler for UDF/UDAF/UDTF xxxxxxxx ``` This is confused for develop , we should show error thrown by super method too, For this pr's UT : Before change, throw Exception like ``` No handler for UDF/UDAF/UDTF 'org.apache.spark.sql.hive.execution.LongProductSum'; line 1 pos 7 ``` After this pr, throw exception ``` Spark UDAF Error: Invalid number of arguments for function longProductSum. Expected: 2; Found: 1; Hive UDF/UDAF/UDTF Error: No handler for UDF/UDAF/UDTF 'org.apache.spark.sql.hive.execution.LongProductSum'; line 1 pos 7 ``` ### Why are the changes needed? Show more detail error message when define UDAF ### Does this PR introduce _any_ user-facing change? People will see more detail error message when use spark sql's UDAF in hive support Mode ### How was this patch tested? Added UT Closes #29054 from AngersZhuuuu/SPARK-32243. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-06 09:09:19 +00:00
fqaiser94@gmail.com	2793347972	[SPARK-32511][SQL] Add dropFields method to Column class ### What changes were proposed in this pull request? 1. Refactored `WithFields` Expression to make it more extensible (now `UpdateFields`). 2. Added a new `dropFields` method to the `Column` class. This method should allow users to drop a `StructField` in a `StructType` column (with similar semantics to the `drop` method on `Dataset`). ### Why are the changes needed? Often Spark users have to work with deeply nested data e.g. to fix a data quality issue with an existing `StructField`. To do this with the existing Spark APIs, users have to rebuild the entire struct column. For example, let's say you have the following deeply nested data structure which has a data quality issue (`5` is missing): ``` import org.apache.spark.sql._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._ val data = spark.createDataFrame(sc.parallelize( Seq(Row(Row(Row(1, 2, 3), Row(Row(4, null, 6), Row(7, 8, 9), Row(10, 11, 12)), Row(13, 14, 15))))), StructType(Seq( StructField("a", StructType(Seq( StructField("a", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))), StructField("b", StructType(Seq( StructField("a", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))), StructField("b", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))), StructField("c", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))) ))), StructField("c", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))) )))))).cache data.show(false) +---------------------------------+ \|a \| +---------------------------------+ \|[[1, 2, 3], [[4,, 6], [7, 8, 9]]]\| +---------------------------------+ ``` Currently, to drop the missing value users would have to do something like this: ``` val result = data.withColumn("a", struct( $"a.a", struct( struct( $"a.b.a.a", $"a.b.a.c" ).as("a"), $"a.b.b", $"a.b.c" ).as("b"), $"a.c" )) result.show(false) +---------------------------------------------------------------+ \|a \| +---------------------------------------------------------------+ \|[[1, 2, 3], [[4, 6], [7, 8, 9], [10, 11, 12]], [13, 14, 15]]\| +---------------------------------------------------------------+ ``` As you can see above, with the existing methods users must call the `struct` function and list all fields, including fields they don't want to change. This is not ideal as: >this leads to complex, fragile code that cannot survive schema evolution. [SPARK-16483](https://issues.apache.org/jira/browse/SPARK-16483) In contrast, with the method added in this PR, a user could simply do something like this to get the same result: ``` val result = data.withColumn("a", 'a.dropFields("b.a.b")) result.show(false) +---------------------------------------------------------------+ \|a \| +---------------------------------------------------------------+ \|[[1, 2, 3], [[4, 6], [7, 8, 9], [10, 11, 12]], [13, 14, 15]]\| +---------------------------------------------------------------+ ``` This is the second of maybe 3 methods that could be added to the `Column` class to make it easier to manipulate nested data. Other methods under discussion in [SPARK-22231](https://issues.apache.org/jira/browse/SPARK-22231) include `withFieldRenamed`. However, this should be added in a separate PR. ### Does this PR introduce _any_ user-facing change? The documentation for `Column.withField` method has changed to include an additional note about how to write optimized queries when adding multiple nested Column directly. ### How was this patch tested? New unit tests were added. Jenkins must pass them. ### Related JIRAs: More discussion on this topic can be found here: - https://issues.apache.org/jira/browse/SPARK-22231 - https://issues.apache.org/jira/browse/SPARK-16483 Closes #29795 from fqaiser94/SPARK-32511-dropFields-second-try. Authored-by: fqaiser94@gmail.com <fqaiser94@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-06 08:53:30 +00:00
Takeshi Yamamuro	4adc2822a3	[SPARK-33035][SQL] Updates the obsoleted entries of attribute mapping in QueryPlan#transformUpWithNewOutput ### What changes were proposed in this pull request? This PR intends to fix corner-case bugs in the `QueryPlan#transformUpWithNewOutput` that is used to propagate updated `ExprId`s in a bottom-up way. Let's say we have a rule to simply assign new `ExprId`s in a projection list like this; ``` case class TestRule extends Rule[LogicalPlan] { override def apply(plan: LogicalPlan): LogicalPlan = plan.transformUpWithNewOutput { case p Project(projList, _) => val newPlan = p.copy(projectList = projList.map { _.transform { // Assigns a new `ExprId` for references case a: AttributeReference => Alias(a, a.name)() }}.asInstanceOf[Seq[NamedExpression]]) val attrMapping = p.output.zip(newPlan.output) newPlan -> attrMapping } } ``` Then, this rule is applied into a plan below; ``` (3) Project [a#5, b#6] +- (2) Project [a#5, b#6] +- (1) Project [a#5, b#6] +- LocalRelation <empty>, [a#5, b#6] ``` In the first transformation, the rule assigns new `ExprId`s in `(1) Project` (e.g., a#5 AS a#7, b#6 AS b#8). In the second transformation, the rule corrects the input references of `(2) Project` first by using attribute mapping given from `(1) Project` (a#5->a#7 and b#6->b#8) and then assigns new `ExprId`s (e.g., a#7 AS a#9, b#8 AS b#10). But, in the third transformation, the rule fails because it tries to correct the references of `(3) Project` by using incorrect attribute mapping (a#7->a#9 and b#8->b#10) even though the correct one is a#5->a#9 and b#6->b#10. To fix this issue, this PR modified the code to update the attribute mapping entries that are obsoleted by generated entries in a given rule. ### Why are the changes needed? bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests in `QueryPlanSuite`. Closes #29911 from maropu/QueryPlanBug. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-06 08:32:55 +00:00
Max Gekk	9870cf9c08	[SPARK-33067][SQL][TESTS] Add negative checks to JDBC v2 Table Catalog tests ### What changes were proposed in this pull request? Add checks for the cases when JDBC v2 Table Catalog commands fail. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `JDBCTableCatalogSuite`. Closes #29945 from MaxGekk/jdbcv2-negative-tests. Lead-authored-by: Max Gekk <max.gekk@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-06 13:01:57 +09:00
HyukjinKwon	a0aa8f33a9	[SPARK-33069][INFRA] Skip test result report if no JUnit XML files are found ### What changes were proposed in this pull request? This PR proposes to skip test reporting ("Report test results") if there are no JUnit XML files are found. Currently, we're running and skipping the tests dynamically. For example, - if there are only changes in SparkR at the underlying commit, it only runs the SparkR tests, and skip the other tests and generate JUnit XML files for SparkR test cases. - if there are only changes in `docs` at the underlying commit, the build skips all tests except linters and do not generate any JUnit XML files. When test reporting ("Report test results") job is triggered after the main build ("Build and test ") is finished, and there are no JUnit XML files found, it reports the case as a failure. See https://github.com/apache/spark/runs/1196184007 as an example. This PR works around it by simply skipping the testing report when there are no JUnit XML files are found. Please see https://github.com/apache/spark/pull/29906#issuecomment-702525542 for more details. ### Why are the changes needed? To avoid false alarm for test results. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested in my fork. Positive case: https://github.com/HyukjinKwon/spark/runs/1208624679?check_suite_focus=true https://github.com/HyukjinKwon/spark/actions/runs/288996327 Negative case: https://github.com/HyukjinKwon/spark/runs/1208229838?check_suite_focus=true https://github.com/HyukjinKwon/spark/actions/runs/289000058 Closes #29946 from HyukjinKwon/test-junit-files. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-06 09:09:58 +09:00
Dongjoon Hyun	008a2ad1f8	[SPARK-20202][BUILD][SQL] Remove references to org.spark-project.hive (Hive 1.2.1) ### What changes were proposed in this pull request? As of today, - SPARK-30034 Apache Spark 3.0.0 switched its default Hive execution engine from Hive 1.2 to Hive 2.3. This removes the direct dependency to the forked Hive 1.2.1 in maven repository. - SPARK-32981 Apache Spark 3.1.0(`master` branch) removed Hive 1.2 related artifacts from Apache Spark binary distributions. This PR(SPARK-20202) aims to remove the following usage of unofficial Apache Hive fork completely from Apache Spark master for Apache Spark 3.1.0. ``` <hive.group>org.spark-project.hive</hive.group> <hive.version>1.2.1.spark2</hive.version> ``` For the forked Hive 1.2.1.spark2 users, Apache Spark 2.4(LTS) and 3.0 (~ 2021.12) will provide it. ### Why are the changes needed? - First, Apache Spark community should not use the unofficial forked release of another Apache project. - Second, Apache Hive 1.2.1 was released at 2015-06-26 and the forked Hive `1.2.1.spark2` exposed many unfixable bugs in Apache because the forked `1.2.1.spark2` is not maintained at all. Apache Hive 2.3.0 was released at 2017-07-19 and it has been used with less number of bugs compared with `1.2.1.spark2`. Many bugs still exist in `hive-1.2` profile and new Apache Spark unit tests are added with `HiveUtils.isHive23` condition so far. ### Does this PR introduce _any_ user-facing change? No. This is a dev-only change. PRBuilder will not accept `[test-hive1.2]` on master and `branch-3.1`. ### How was this patch tested? 1. SBT/Hadoop 3.2/Hive 2.3 (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129366) 2. SBT/Hadoop 2.7/Hive 2.3 (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129382) 3. SBT/Hadoop 3.2/Hive 1.2 (This has not been supported already due to Hive 1.2 doesn't work with Hadoop 3.2.) 4. SBT/Hadoop 2.7/Hive 1.2 (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129383, This is rejected) Closes #29936 from dongjoon-hyun/SPARK-REMOVE-HIVE1. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-05 15:29:56 -07:00
allisonwang-db	14aeab3b27	[SPARK-33038][SQL] Combine AQE initial and current plan string when two plans are the same ### What changes were proposed in this pull request? This PR combines the current plan and the initial plan in the AQE query plan string when the two plans are the same. It also removes the `== Current Plan ==` and `== Initial Plan ==` headers: Before ```scala AdaptiveSparkPlan isFinalPlan=false +- == Current Plan == SortMergeJoin [key#13], [a#23], Inner :- Sort [key#13 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(key#13, 5), true, [id=#94] ... +- == Initial Plan == SortMergeJoin [key#13], [a#23], Inner :- Sort [key#13 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(key#13, 5), true, [id=#94] ... ``` After ```scala AdaptiveSparkPlan isFinalPlan=false +- SortMergeJoin [key#13], [a#23], Inner :- Sort [key#13 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(key#13, 5), true, [id=#94] ... ``` For SQL `EXPLAIN` output: Before ```scala AdaptiveSparkPlan (8) +- == Current Plan == Sort (7) +- Exchange (6) ... +- == Initial Plan == Sort (7) +- Exchange (6) ... ``` After ```scala AdaptiveSparkPlan (8) +- Sort (7) +- Exchange (6) ... ``` ### Why are the changes needed? To simplify the AQE plan string by removing the redundant plan information. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Modified the existing unit test. Closes #29915 from allisonwang-db/aqe-explain. Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2020-10-05 09:30:27 -07:00
gschiavon	a09747bf32	[SPARK-33063][K8S] Improve error message for insufficient K8s volume confs ### What changes were proposed in this pull request? Provide error handling when creating kubernetes volumes. Right now they keys are expected to be there and if not it fails with a `key not found` error, but not knowing why do you need that `key`. Also I renamed some tests that didn't indicate the kind of kubernetes volume ### Why are the changes needed? Easier for the users to understand why `spark-submit` command is failing if not providing they right kubernetes volumes properties. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? It was tested with the current tests plus added one more. [Jira ticket](https://issues.apache.org/jira/browse/SPARK-33063) Closes #29941 from Gschiavon/SPARK-33063-provide-error-handling-k8s-volumes. Authored-by: gschiavon <germanschiavon@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-05 09:02:06 -07:00
Yuming Wang	023eb482b2	[SPARK-32914][SQL] Avoid constructing dataType multiple times ### What changes were proposed in this pull request? Some expression's data type not a static value. It needs to be constructed a new object when calling `dataType` function. E.g.: `CaseWhen`. We should avoid constructing dataType multiple times because it may be used many times. E.g.: [`HyperLogLogPlusPlus.update`](`10edeafc69/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlus.scala (L122)`). ### Why are the changes needed? Improve query performance. for example: ```scala spark.range(100000000L).selectExpr("approx_count_distinct(case when id % 400 > 20 then id else 0 end)").show ``` Profiling result: ``` -- Execution profile --- Total samples : 18365 Frame buffer usage : 2.6688% --- 58443254327 ns (31.82%), 5844 samples [ 0] GenericTaskQueueSet<OverflowTaskQueue<StarTask, (MemoryType)1, 131072u>, (MemoryType)1>::steal_best_of_2(unsigned int, int, StarTask&) [ 1] StealTask::do_it(GCTaskManager, unsigned int) [ 2] GCTaskThread::run() [ 3] java_start(Thread) [ 4] start_thread --- 6140668667 ns (3.34%), 614 samples [ 0] GenericTaskQueueSet<OverflowTaskQueue<StarTask, (MemoryType)1, 131072u>, (MemoryType)1>::peek() [ 1] ParallelTaskTerminator::offer_termination(TerminatorTerminator) [ 2] StealTask::do_it(GCTaskManager, unsigned int) [ 3] GCTaskThread::run() [ 4] java_start(Thread) [ 5] start_thread --- 5679994036 ns (3.09%), 568 samples [ 0] scala.collection.generic.Growable.$plus$plus$eq [ 1] scala.collection.generic.Growable.$plus$plus$eq$ [ 2] scala.collection.mutable.ListBuffer.$plus$plus$eq [ 3] scala.collection.mutable.ListBuffer.$plus$plus$eq [ 4] scala.collection.generic.GenericTraversableTemplate.$anonfun$flatten$1 [ 5] scala.collection.generic.GenericTraversableTemplate$$Lambda$107.411506101.apply [ 6] scala.collection.immutable.List.foreach [ 7] scala.collection.generic.GenericTraversableTemplate.flatten [ 8] scala.collection.generic.GenericTraversableTemplate.flatten$ [ 9] scala.collection.AbstractTraversable.flatten [10] org.apache.spark.internal.config.ConfigEntry.readString [11] org.apache.spark.internal.config.ConfigEntryWithDefault.readFrom [12] org.apache.spark.sql.internal.SQLConf.getConf [13] org.apache.spark.sql.internal.SQLConf.caseSensitiveAnalysis [14] org.apache.spark.sql.types.DataType.sameType [15] org.apache.spark.sql.catalyst.analysis.TypeCoercion$.$anonfun$haveSameType$1 [16] org.apache.spark.sql.catalyst.analysis.TypeCoercion$.$anonfun$haveSameType$1$adapted [17] org.apache.spark.sql.catalyst.analysis.TypeCoercion$$$Lambda$1527.1975399904.apply [18] scala.collection.IndexedSeqOptimized.prefixLengthImpl [19] scala.collection.IndexedSeqOptimized.forall [20] scala.collection.IndexedSeqOptimized.forall$ [21] scala.collection.mutable.ArrayBuffer.forall [22] org.apache.spark.sql.catalyst.analysis.TypeCoercion$.haveSameType [23] org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck [24] org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck$ [25] org.apache.spark.sql.catalyst.expressions.CaseWhen.dataTypeCheck [26] org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType [27] org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType$ [28] org.apache.spark.sql.catalyst.expressions.CaseWhen.dataType [29] org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlus.update [30] org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1.$anonfun$applyOrElse$2 [31] org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1.$anonfun$applyOrElse$2$adapted [32] org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$Lambda$1534.1383512673.apply [33] org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateProcessRow$7 [34] org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateProcessRow$7$adapted [35] org.apache.spark.sql.execution.aggregate.AggregationIterator$$Lambda$1555.725788712.apply ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test and benchmark test: Benchmark code \| Before this PR(Milliseconds) \| After this PR(Milliseconds) --- \| --- \| --- spark.range(100000000L).selectExpr("approx_count_distinct(case when id % 400 > 20 then id else 0 end)").collect() \| 56462 \| 3794 Closes #29790 from wangyum/SPARK-32914. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-05 22:00:42 +09:00
Yuning Zhang	0fb2574d4e	[SPARK-33042][SQL][TEST] Add a test case to ensure changes to spark.sql.optimizer.maxIterations take effect at runtime ### What changes were proposed in this pull request? Add a test case to ensure changes to `spark.sql.optimizer.maxIterations` take effect at runtime. ### Why are the changes needed? Currently, there is only one related test case: https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/internal/SQLConfSuite.scala#L156 However, this test case only checks the value of the conf can be changed at runtime. It does not check the updated value is actually used by the Optimizer. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? unit test Closes #29919 from yuningzh-db/add_optimizer_test. Authored-by: Yuning Zhang <yuning.zhang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-05 20:25:57 +09:00
zero323	24f890e8e8	[SPARK-33040][FOLLOW-UP][R] Reorder argument choices and add examples ### What changes were proposed in this pull request? - Reorder choices of `dtype` to match Scala defaults. - Add example to ml_functions. ### Why are the changes needed? As requested: - https://github.com/apache/spark/pull/29917#pullrequestreview-501715344 - https://github.com/apache/spark/pull/29917#pullrequestreview-501716521 ### Does this PR introduce _any_ user-facing change? No (changes to newly added component). ### How was this patch tested? Existing tests. Closes #29944 from zero323/SPARK-33040-FOLLOW-UP. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-05 16:31:17 +09:00
zero323	e83d03ca48	[SPARK-33040][R][ML] Add SparkR wrapper for vector_to_array ### What changes were proposed in this pull request? Add SparkR wrapper for `o.a.s.ml.functions.vector_to_array` ### Why are the changes needed? - Currently ML vectors, including predictions, are almost inaccessible to R users. That's is a serious loss of functionality. - Feature parity. ### Does this PR introduce _any_ user-facing change? Yes, new R function is added. ### How was this patch tested? - New unit tests. - Manual verification. Closes #29917 from zero323/SPARK-33040. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-05 13:18:12 +09:00
reidy-p	4ab9aa0305	[SPARK-33017][PYTHON] Add getCheckpointDir method to PySpark Context ### What changes were proposed in this pull request? Adding a method to get the checkpoint directory from the PySpark context to match the Scala API ### Why are the changes needed? To make the Scala and Python APIs consistent and remove the need to use the JavaObject ### Does this PR introduce _any_ user-facing change? Yes, there is a new method which makes it easier to get the checkpoint directory directly rather than using the JavaObject #### Previous behaviour: ```python >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/') >>> sc._jsc.sc().getCheckpointDir().get() 'file:/tmp/spark/checkpoint/63f7b67c-e5dc-4d11-a70c-33554a71717a' ``` This method returns a confusing Scala error if it has not been set ```python >>> sc._jsc.sc().getCheckpointDir().get() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/paul/Desktop/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/home/paul/Desktop/spark/python/pyspark/sql/utils.py", line 111, in deco return f(a, *kw) File "/home/paul/Desktop/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o25.get. : java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) ``` #### New method: ```python >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/') >>> spark.sparkContext.getCheckpointDir() 'file:/tmp/spark/checkpoint/b38aca2e-8ace-44fc-a4c4-f4e36c2da2a7' ``` ``getCheckpointDir()`` returns ``None`` if it has not been set ```python >>> print(spark.sparkContext.getCheckpointDir()) None ``` ### How was this patch tested? Added to existing unit tests. But I'm not sure how to add a test for the case where ``getCheckpointDir()`` should return ``None`` since the existing checkpoint tests set the checkpoint directory in the ``setUp`` method before any tests are run as far as I can tell. Closes #29918 from reidy-p/SPARK-33017. Authored-by: reidy-p <paul_reidy@outlook.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-05 11:48:28 +09:00

... 2 3 4 5 6 ...

28383 commits