ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Yuming Wang	4a17e7a5ae	[SPARK-35906][SQL] Remove order by if the maximum number of rows less than or equal to 1 ### What changes were proposed in this pull request? This PR removes order by if the maximum number of rows less than or equal to 1. For example: ```scala spark.sql("select count(*) from range(1, 10, 2, 2) order by 1 limit 10").explain("cost") ``` Before this pr: ``` == Optimized Logical Plan == Sort [count(1)#2L ASC NULLS FIRST], true, Statistics(sizeInBytes=16.0 B) +- Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, rowCount=1) +- Project, Statistics(sizeInBytes=20.0 B) +- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 B, rowCount=5) ``` After this pr: ``` == Optimized Logical Plan == Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, rowCount=1) +- Project, Statistics(sizeInBytes=20.0 B) +- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 B, rowCount=5) ``` ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #33100 from wangyum/SPARK-35906. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-29 11:04:54 -07:00
Takuya UESHIN	2702fb9af0	[SPARK-35859][PYTHON] Cleanup type hints in pandas-on-Spark ### What changes were proposed in this pull request? Cleaning up the type hints in pandas-on-Spark. - Use a single file `_typing.py` for type variables or aliases - Rename `IndexOpsLike` to `SeriesOrIndex`. - Rename `T_Frame` and `T_IndexOps` to `FrameLike` and `IndexOpsLike` respectively - Introduce `DataFrameOrSeries` for `Union[DataFrame, Series]` ### Why are the changes needed? This is a cleanup for the mypy check stuff series. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33117 from ueshin/issues/SPARK-35859/cleanup. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-29 10:52:24 -07:00
Dongjoon Hyun	7e7028282c	[SPARK-35928][BUILD] Upgrade ASM to 9.1 ### What changes were proposed in this pull request? This PR aims to upgrade ASM to 9.1 ### Why are the changes needed? The latest `xbean-asm9-shaded` is built with ASM 9.1. - https://mvnrepository.com/artifact/org.apache.xbean/xbean-asm9-shaded/4.20 - `5e0e3c0c64/pom.xml (L67)` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #33130 from dongjoon-hyun/SPARK-35928. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-29 10:27:51 -07:00
ulysses-you	def738365e	[SPARK-35923][SQL] Coalesce empty partition with mixed CoalescedPartitionSpec and PartialReducerPartitionSpec ### What changes were proposed in this pull request? Skip empty partitions in `ShufflePartitionsUtil.coalescePartitionsWithSkew`. ### Why are the changes needed? Since [SPARK-35447](https://issues.apache.org/jira/browse/SPARK-35447), we apply `OptimizeSkewedJoin` before `CoalesceShufflePartitions`. However, There are something different with the order of these two rules. Let's say if we have a skewed partitions: [0, 128MB, 0, 128MB, 0]: * coalesce partitions first then optimize skewed partitions: [64MB, 64MB, 64MB, 64MB] * optimize skewed partition first then coalesce partitions: [0, 64MB, 64MB, 0, 64MB, 64MB, 0] So we can do coalesce in `ShufflePartitionsUtil.coalescePartitionsWithSkew` with mixed `CoalescedPartitionSpec` and `PartialReducerPartitionSpec` if `CoalescedPartitionSpec` is empty. ### Does this PR introduce _any_ user-facing change? No, not release yet. ### How was this patch tested? Add test. Closes #33123 from ulysses-you/SPARK-35923. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-29 14:58:51 +00:00
Gengliang Wang	78e6263cce	[SPARK-35927][SQL] Remove type collection AllTimestampTypes ### What changes were proposed in this pull request? Replace the type collection `AllTimestampTypes` with the new data type `AnyTimestampType` ### Why are the changes needed? As discussed in https://github.com/apache/spark/pull/33115#discussion_r659866760, it is more convenient to have a new data type "AnyTimestampType" instead of using type collection `AllTimestampTypes`: 1. simplify the pattern match 2. In the default type coercion rules, when implicit casting a type to a TypeCollection type, Spark chooses the first convertible data type as the result. If we are going to make the default timestamp type configurable, having AnyTimestampType is better ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT Closes #33129 from gengliangwang/allTimestampTypes. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-29 16:57:08 +08:00
Yikun Jiang	5db51efa1a	[SPARK-35721][PYTHON] Path level discover for python unittests ### What changes were proposed in this pull request? Add path level discover for python unittests. ### Why are the changes needed? Now we need to specify the python test cases by manually when we add a new testcase. Sometime, we forgot to add the testcase to module list, the testcase would not be executed. Such as: - pyspark-core pyspark.tests.test_pin_thread Thus we need some auto-discover way to find all testcase rather than specified every case by manually. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add below code in end of `dev/sparktestsupport/modules.py` ```python for m in sorted(all_modules): for g in sorted(m.python_test_goals): print(m.name, g) ``` Compare the result before and after: https://www.diffchecker.com/iO3FvhKL Closes #32867 from Yikun/SPARK_DISCOVER_TEST. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-29 17:56:13 +09:00
Gengliang Wang	7635114d53	[SPARK-35916][SQL] Support subtraction among Date/Timestamp/TimestampWithoutTZ ### What changes were proposed in this pull request? Support the following operations: - TimestampWithoutTZ - Date - Date - TimestampWithoutTZ - TimestampWithoutTZ - Timestamp - Timestamp - TimestampWithoutTZ - TimestampWithoutTZ - TimestampWithoutTZ For subtraction between `TimestampWithoutTZ` and `Timestamp`, the `Timestamp` column is cast as TimestampWithoutTZType. ### Why are the changes needed? Support basic subtraction among Date/Timestamp/TimestampWithoutTZ. ### Does this PR introduce _any_ user-facing change? No, the timestamp without time zone type is not release yet. ### How was this patch tested? Unit tests Closes #33115 from gengliangwang/subtractTimestampWithoutTz. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-29 14:45:09 +08:00
Dongjoon Hyun	0a7a6f750c	[SPARK-35483][FOLLOWUP][TESTS] Update run-tests.py doctest ### What changes were proposed in this pull request? This PR updates the doctests in `run-tests.py`. ### Why are the changes needed? This should be consists with `modules.py` behavior. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the GitHub Action. I checked manually. ``` $ python dev/run-tests.py Cannot install SparkR as R was not found in PATH [info] Using build tool sbt with Hadoop profile hadoop3.2 and Hive profile hive2.3 under environment local [info] Found the following changed modules: root [info] Setup the following environment variables for tests: ======================================================================== Running Apache RAT checks ======================================================================== RAT checks passed. ``` Closes #33127 from dongjoon-hyun/SPARK-35483-2. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-28 23:14:47 -07:00
Dongjoon Hyun	16e50356ee	[SPARK-34302][FOLLOWUP][SQL][TESTS] Update jdbc.v2.*IntegrationSuite ### What changes were proposed in this pull request? This PR aims to update JDBC v2 integration suite by adding `catalogName`. ### Why are the changes needed? To recover the integration test suite. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the GitHub Action. Closes #33124 from dongjoon-hyun/SPARK-34302. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-28 23:01:54 -07:00
Dongjoon Hyun	57896e662e	[SPARK-35483][FOLLOWUP][TESTS] Enable docker_integration_tests for catalyst/sql module changes too ### What changes were proposed in this pull request? This PR aims to enable `docker_integration_tests` when `catalyst` and `sql` module changes additionally. ### Why are the changes needed? Currently, `catalyst` and `sql` module changes do not trigger the JDBC integration test. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #33125 from dongjoon-hyun/SPARK-35483. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-28 22:59:56 -07:00
Anton Okolnychyi	8a21d2dcfe	[SPARK-35899][SQL][FOLLOWUP] Utility to convert connector expressions to Catalyst ### What changes were proposed in this pull request? This PR addresses post-review comments on PR #33096: - removes `private[sql]` modifier - removes the option to pass a resolver to simplify the API ### Why are the changes needed? These changes are needed to simply the utility API. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33120 from aokolnychyi/spark-35899-follow-up. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-28 22:22:07 -07:00
Dongjoon Hyun	c45a6f5d09	[SPARK-35922][BUILD] Upgrade maven-shade-plugin to 3.2.4 ### What changes were proposed in this pull request? This PR aims to upgrade `maven-shade-plugin` to 3.2.4. ### Why are the changes needed? This is required to build with Java 17-ea. Since `maven-shade-plugin` 3.2.3, `asm` 8.0 is used now. We should remove our custom dependency of `7.3.1`. - https://mvnrepository.com/artifact/org.apache.maven.plugins/maven-shade-plugin/3.2.4 - https://mvnrepository.com/artifact/org.apache.maven.plugins/maven-shade-plugin/3.2.3 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #33122 from dongjoon-hyun/SPARK-35922. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-28 22:08:44 -07:00
Dongjoon Hyun	b999e6bd90	[SPARK-35920][BUILD] Upgrade to Chill 0.10.0 ### What changes were proposed in this pull request? This PR aims to upgrade Chill to 0.10.0. ### Why are the changes needed? This is a maintenance release having cross-compilation to 2.12.14 and 2.13.6 . - https://github.com/twitter/chill/releases/tag/v0.10.0 ### Does this PR introduce _any_ user-facing change? No, this is a dependency change. ### How was this patch tested? Pass the CIs. Closes #33119 from dongjoon-hyun/SPARK-35920. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-28 22:06:41 -07:00
Kousuke Saruta	880bbd6aaa	[SPARK-35876][SQL] ArraysZip should retain field names to avoid being re-written by analyzer/optimizer ### What changes were proposed in this pull request? This PR fixes an issue that field names of structs generated by `arrays_zip` function could be unexpectedly re-written by analyzer/optimizer. Here is an example. ``` val df = sc.parallelize(Seq((Array(1, 2), Array(3, 4)))).toDF("a1", "b1").selectExpr("arrays_zip(a1, b1) as zipped") df.printSchema root \|-- zipped: array (nullable = true) \| \|-- element: struct (containsNull = false) \| \| \|-- a1: integer (nullable = true) // OK. a1 is expected name \| \| \|-- b1: integer (nullable = true) // OK. b1 is expected name df.explain == Physical Plan == *(1) Project [arrays_zip(_1#3, _2#4) AS zipped#12] // Not OK. field names are re-written as _1 and _2 respectively df.write.parquet("/tmp/test.parquet") val df2 = spark.read.parquet("/tmp/test.parquet") df2.printSchema root \|-- zipped: array (nullable = true) \| \|-- element: struct (containsNull = true) \| \| \|-- _1: integer (nullable = true) // Not OK. a1 is expected but got _1 \| \| \|-- _2: integer (nullable = true) // Not OK. b1 is expected but got _2 ``` This issue happens when aliases are eliminated by `AliasHelper.replaceAliasButKeepName` or `AliasHelper.trimNonTopLevelAliases` called via analyzer/optimizer `b89cd8d75a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (L883)` `b89cd8d75a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (L3759)` I investigated functions which can be affected this issue but I found only `arrays_zip` so far. To fix this issue, this PR changes the definition of `ArraysZip` to retain field names to avoid being re-written by analyzer/optimizer. ### Why are the changes needed? This is apparently a bug. ### Does this PR introduce _any_ user-facing change? No. After this change, the field names are no longer re-written but it should be expected behavior for users. ### How was this patch tested? New tests. Closes #33106 from sarutak/arrays-zip-retain-names. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-29 12:28:41 +09:00
Terry Kim	620fde4767	[SPARK-34302][SQL] Migrate ALTER TABLE ... CHANGE COLUMN command to use UnresolvedTable to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate the following `ALTER TABLE ... CHANGE COLUMN` command to use `UnresolvedTable` as a `child` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? This is a part of effort to make the relation lookup behavior consistent: [SPARK-29900](https://issues.apache.org/jira/browse/SPARK-29900). ### Does this PR introduce _any_ user-facing change? After this PR, the above `ALTER TABLE ... CHANGE COLUMN` commands will have a consistent resolution behavior. ### How was this patch tested? Updated existing tests. Closes #33113 from imback82/alter_change_column. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-29 02:53:05 +00:00
ulysses-you	622fc686e2	[SPARK-35888][SQL] Add dataSize field in CoalescedPartitionSpec ### What changes were proposed in this pull request? * add `dataSize` field in `CoalescedPartitionSpec` * add data size test suite in `ShufflePartitionsUtilSuite` ### Why are the changes needed? Currently, all test suite about `CoalescedPartitionSpec` do not check the data size due to it doesn't contains data size field. We can add data size in `CoalescedPartitionSpec` and then add test case for better coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass CI Closes #33079 from ulysses-you/SPARK-35888. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-29 02:51:24 +00:00
PengLei	8fbbd2e6d7	[SPARK-33898][SQL] Support SHOW CREATE TABLE In V2 ### What changes were proposed in this pull request? 1. Implement V2 execution node `ShowCreateTableExec` similar to V1 `ShowCreateTableCommand` 2. No support `SHOW CREATE TABLE XXX AS SERDE` ### Why are the changes needed? [SPARK-33898](https://issues.apache.org/jira/browse/SPARK-33898) ### Does this PR introduce _any_ user-facing change? Yes. Support the user to execute `SHOW CREATE TABLE` command in V2 table ### How was this patch tested? Add two UT tests 1. ./dev/scalastyle 2. run test DataSourceV2SQLSuite Closes #32931 from Peng-Lei/SPARK-33898. Lead-authored-by: PengLei <18066542445@189.cn> Co-authored-by: PengLei <peng.8lei@gmail.com> Signed-off-by: Kent Yao <yao@apache.org>	2021-06-29 10:14:46 +08:00
Xinrong Meng	5f0113e3a6	[SPARK-35344][PYTHON] Support creating a Column of numpy literals in pandas API on Spark ### What changes were proposed in this pull request? The PR is proposed to support creating a Column of numpy literal value in pandas-on-Spark. It consists of three changes mainly: - Enable the `lit` function defined in `pyspark.pandas.spark.functions` to support numpy literals input. ```py >>> from pyspark.pandas.spark import functions as SF >>> SF.lit(np.int64(1)) Column<'CAST(1 AS BIGINT)'> >>> SF.lit(np.int32(1)) Column<'CAST(1 AS INT)'> >>> SF.lit(np.int8(1)) Column<'CAST(1 AS TINYINT)'> >>> SF.lit(np.byte(1)) Column<'CAST(1 AS TINYINT)'> >>> SF.lit(np.float32(1)) Column<'CAST(1.0 AS FLOAT)'> ``` - Substitute `F.lit` by `SF.lit`, that is, use `lit` function defined in `pyspark.pandas.spark.functions` rather than `lit` function defined in `pyspark.sql.functions` to allow creating columns out of numpy literals. - Enable numpy literals input in `isin` method Non-goal: - Some pandas-on-Spark APIs use PySpark column-related APIs internally, and these column-related APIs don't support numpy literals, thus numpy literals are disallowed as input (e.g. `to_replace` parameter in `replace` API). This PR doesn't aim to adjust all of them. This PR adjusts `isin` only, because the PR is inspired by that (as https://github.com/databricks/koalas/issues/2161). - To complete mappings between all kinds of numpy literals and Spark data types should be a followup task. ### Why are the changes needed? Spark (`lit` function defined in `pyspark.sql.functions`) doesn't support creating a Column out of numpy literal value. So `lit` function defined in `pyspark.pandas.spark.functions` is adjusted in order to support that in pandas-on-Spark. ### Does this PR introduce _any_ user-facing change? Yes. Before: ```py >>> a = ps.DataFrame({'source': [1,2,3,4,5]}) >>> a.source.isin([np.int64(1), np.int64(2)]) Traceback (most recent call last): ... AttributeError: 'numpy.int64' object has no attribute '_get_object_id' ``` After: ```py >>> a = ps.DataFrame({'source': [1,2,3,4,5]}) >>> a.source.isin([np.int64(1), np.int64(2)]) 0 True 1 True 2 False 3 False 4 False Name: source, dtype: bool ``` ### How was this patch tested? Unit tests. Closes #32955 from xinrong-databricks/datatypeops_literal. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-28 19:03:42 -07:00
Kent Yao	9c157a490b	[SPARK-35910][CORE][SHUFFLE] Update remoteBlockBytes based on merged block info to reduce task time ### What changes were proposed in this pull request? Currently, we calculate the `remoteBlockBytes` based on the original block info list. It's not efficient. Usually, it costs more ~25% time to be spent here. If the original reducer size is big but the actual reducer size is small due to automatically partition coalescing of AQE, the reducer will take more time to calculate `remoteBlockBytes`. We can reduce this cost via remote requests which contain merged block info lists. ### Why are the changes needed? improve task performance ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new unit tests and verified manually. Closes #33109 from yaooqinn/SPARK-35910. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-28 13:55:59 -07:00
Tom van Bussel	c6606502a2	[SPARK-35898][SQL] Fix arrays and maps in RowToColumnConverter ### What changes were proposed in this pull request? This PR fixes support for arrays and maps in `RowToColumnConverter`. In particular this PR fixes two bugs: 1. `appendArray` in `WritableColumnVector` does not reserve any elements in its child arrays, which causes the assertion in `OffHeapColumnVector.putArray` to fail. 2. The nullability of the child columns is propagated incorrectly when creating the child converters of `ArrayConverter` and `MapConverter` in `RowToColumnConverter`. This PR fixes these issues. ### Why are the changes needed? Both bugs cause an exception to be thrown. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? I added additional test cases to `ColumnVectorSuite` to catch the first bug, and I added `RowToColumnConverterSuite` to catch the both bugs (but specifically the second). Closes #33108 from tomvanbussel/SPARK-35898. Authored-by: Tom van Bussel <tom.vanbussel@databricks.com> Signed-off-by: herman <herman@databricks.com>	2021-06-28 16:50:53 +02:00
PengLei	356aef48b8	[SPARK-35728][SPARK-35778][SQL][TESTS] Check multiply/divide of day-time and year-month interval of any fields by a numeric ### What changes were proposed in this pull request? [SPARK-35728](https://issues.apache.org/jira/browse/SPARK-35728): Add test case to check multiply/divide of day-time intervals of any fields by numeric [SPARK-35778](https://issues.apache.org/jira/browse/SPARK-35778): Add test case to check multiply/divide of year-month intervals of any fields by numeric ### Why are the changes needed? Improve test coverage ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add ut tests Lead-authored-by: Lei Peng <peng.8leigmail.com> Co-authored-by: AngersZhuuuu <angers.zhugmail.com> Closes #33080 from Peng-Lei/SPARK-35728-35778. Lead-authored-by: PengLei <peng.8lei@gmail.com> Co-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: PengLei <18066542445@189.cn> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-28 13:35:54 +03:00
Yuming Wang	108635af17	Revert "[SPARK-35904][SQL] Collapse above RebalancePartitions" This reverts commit `def29e50`	2021-06-28 16:23:23 +08:00
Erik Krogen	3255511d52	[SPARK-35258][SHUFFLE][YARN] Add new metrics to ExternalShuffleService for better monitoring ### What changes were proposed in this pull request? This adds two new additional metrics to `ExternalBlockHandler`: - `blockTransferRate` -- for indicating the rate of transferring blocks, vs. the data within them - `blockTransferAvgSize_1min` -- a 1-minute trailing average of block sizes transferred by the ESS Additionally, this enhances `YarnShuffleServiceMetrics` to expose the histogram/`Snapshot` information from `Timer` metrics within `ExternalBlockHandler`. ### Why are the changes needed? Currently `ExternalBlockHandler` exposes some useful metrics, but is lacking around metrics for the rate of block transfers. We have `blockTransferRateBytes` to tell us the rate of _bytes_, but no metric to tell us the rate of _blocks_, which is especially relevant when running the ESS on HDDs that are sensitive to random reads. Many small block transfers can have a negative impact on performance, but won't show up as a spike in `blockTransferRateBytes` since the sizes are small. Thus the new metrics to show information around average block size and block transfer rate are very useful to monitor the health/performance of the ESS, especially when running on HDDs. For the `YarnShuffleServiceMetrics`, currently the three `Timer` metrics exposed by `ExternalBlockHandler` are being underutilized in a YARN-based environment -- they are basically treated as a `Meter`, only exposing rate-based information, when the metrics themselves are collected detailed histograms of timing information. We should expose this information for better observability. ### Does this PR introduce _any_ user-facing change? Yes, there are two entirely new metrics for the ESS, as documented in `monitoring.md`. Additionally in a YARN environment, `Timer` metrics exposed by the ESS will include more rich timing information. ### How was this patch tested? New unit tests are added to verify that new metrics are showing up as expected. We have been running this patch internally for approx. 1 year and have found it to be useful for monitoring the health of ESS and diagnosing performance issues. Closes #32388 from xkrogen/xkrogen-SPARK-35258-ess-new-metrics. Authored-by: Erik Krogen <xkrogen@apache.org> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>	2021-06-28 02:36:17 -05:00
dgd-contributor	1c81ad2029	[SPARK-35064][SQL] Group error in spark-catalyst ### What changes were proposed in this pull request? This PR group exception messages in sql/catalyst/src/main/scala/org/apache/spark/sql (except catalyst) ### Why are the changes needed? It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce any user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. Closes #32916 from dgd-contributor/SPARK-35064_catalyst_group_error. Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-28 07:21:24 +00:00
RoryQi	378ac78bdf	[SPARK-35318][SQL][FOLLOWUP] Hide the internal view properties for `show tblproperties` ### What changes were proposed in this pull request? PR #32441 hid the internal view properties for describe table command, But the `show tblproperties view` case is not covered. ### Why are the changes needed? Avoid internal properties confusing the users. ### Does this PR introduce _any_ user-facing change? Yes Before this change, the user will see below output for `show tblproperties test_view` ``` .... p1 v1 p2 v2 view.catalogAndNamespace.numParts 2 view.catalogAndNamespace.part.0 spark_catalog view.catalogAndNamespace.part.1 default view.query.out.col.0 c1 view.query.out.numCols 1 view.referredTempFunctionsNames [] view.referredTempViewNames [] ... ``` After this change, the internal properties will be hidden. ``` .... p1 v1 p2 v2 ... ``` ### How was this patch tested? existing UT Closes #33016 from jerqi/hide_show_tblproperties. Authored-by: RoryQi <1242949407@qq.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-28 07:05:29 +00:00
Venki Korukanti	0da463e593	[SPARK-35880][SS] Track the duplicates dropped count in dedupe operator ### What changes were proposed in this pull request? Add a metric to track the number of duplicates dropped in input in streaming deduplication operator. Also introduce a `StatefulOperatorCustomMetric` to allow stateful operators to output their own unique metrics in `StateOperatorProgress.customMetrics` in `StreamingQueryProgress`. ### Why are the changes needed? 1. Having the duplicates dropped count help monitor and debug any incorrect results issue or find reasons for state size increases in dedupe operator. 2. New API `StatefulOperatorCustomMetric` allows stateful operators to expose their own unique metrics in `StateOperatorProgress.customMetrics` in `StreamingQueryProgress` ### Does this PR introduce _any_ user-facing change? Yes. For deduplication stateful operator a new metric `numDuplicatesDropped` is shown in `StateOperatorProgress` within `StreamingQueryProgress`. Example `StreamingQueryProgress` output in JSON form. ``` { "id" : "510be3cd-a955-4faf-8456-d97c78d39af5", "runId" : "c170c4cd-04cb-4a28-b054-74020e3998e1", ... , "stateOperators" : [ { "numRowsTotal" : 1, "numRowsUpdated" : 1, "numRowsDroppedByWatermark" : 0, "customMetrics" : { "loadedMapCacheHitCount" : 0, "loadedMapCacheMissCount" : 0, "numDuplicatesDropped" : 0, "stateOnCurrentVersionSizeBytes" : 392 } }], ... } ``` ### How was this patch tested? Existing UTs for regression and added a UT. Closes #33065 from vkorukanti/SPARK-35880. Authored-by: Venki Korukanti <venki.korukanti@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-06-28 13:21:00 +09:00
Takuya UESHIN	8c401beb80	[SPARK-35901][PYTHON] Refine type hints in pyspark.pandas.window ### What changes were proposed in this pull request? Refines type hints in `pyspark.pandas.window`. Also, some refactoring is included to clean up the type hierarchy of `Rolling` and `Expanding`. ### Why are the changes needed? We can use more strict type hints for functions in pyspark.pandas.window using the generic way. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33097 from ueshin/issues/SPARK-35901/window. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-28 12:23:32 +09:00
itholic	03e6de2abe	[SPARK-35605][PYTHON] Move to_pandas_on_spark to the Spark DataFrame ### What changes were proposed in this pull request? This PR proposes move `to_pandas_on_spark` function from `pyspark.pandas.frame` to `pyspark.sql.dataframe`, and added the related tests to the PySpark DataFrame tests. ### Why are the changes needed? Because now the Koalas is ported into PySpark, so we don't need to Spark auto-patch anymore. And also `to_pandas_on_spark` is belongs to the pandas-on-Spark DataFrame doesn't look make sense. ### Does this PR introduce _any_ user-facing change? No, it's kinda internal refactoring stuff. ### How was this patch tested? Added the related tests and manually check they're passed. Closes #33054 from itholic/SPARK-35605. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-28 11:47:09 +09:00
Dhruvil Dave	a7369b3080	[SPARK-35909][DOCS] Fix broken Python Links in docs/sql-getting-started.md ### What changes were proposed in this pull request? The hyperlinks in Python code blocks in [Spark SQL Guide - Getting Started](https://spark.apache.org/docs/latest/sql-getting-started.html) currently point to invalid addresses and return 404. This pull request fixes that issue by pointing them to correct links in Python API docs. ### Why are the changes needed? Error in documentation classifies as a bug and hence needs to be fixed. ### Does this PR introduce _any_ user-facing change? Yes. This PR fixes documentation error in https://spark.apache.org/docs/latest/sql-getting-started.html ### How was this patch tested? This patch was locally built after cloning the repo from scratch and then doing a clean build after fixing the required problems. Closes #33107 from dhruvildave/sql-doc. Authored-by: Dhruvil Dave <dhruvil.dave@outlook.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-27 11:34:28 -07:00
Liang-Chi Hsieh	b89cd8d75a	[SPARK-35886][SQL] PromotePrecision should not overwrite genCode ### What changes were proposed in this pull request? This patch fixes `PromotePrecision` where it overwrites `genCode` where subexpression elimination should happen. ### Why are the changes needed? `PromotePrecision` overwrites `genCode` where subexpression elimination should happen. So if it is most top expression of a subexpression, it is never replaced. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added test. Closes #33103 from viirya/fix-precision. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-06-26 23:19:58 -07:00
zengruios	b11b175148	[SPARK-35893][TESTS] Add unit test case for MySQLDialect.getCatalystType ### What changes were proposed in this pull request? Add unit test case for MySQLDialect.getCatalystType ### Why are the changes needed? add unit test case ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit Test Closes #33087 from zengruios/SPARK-35893. Authored-by: zengruios <578395184@qq.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-26 21:43:52 -07:00
Yuming Wang	def29e5075	[SPARK-35904][SQL] Collapse above RebalancePartitions ### What changes were proposed in this pull request? 1. Make `RebalancePartitions` extend `RepartitionOperation`. 2. Make `CollapseRepartition` support `RebalancePartitions`. ### Why are the changes needed? `CollapseRepartition` can optimize `RebalancePartitions` if possible. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #33099 from wangyum/SPARK-35904. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-26 21:19:58 -07:00
Angerszhuuuu	74637a6ca7	[SPARK-35905][SQL][TESTS] Fix UT to clean up table/view in SQLQuerySuite ### What changes were proposed in this pull request? Fix UT mistake in SQLQuerySuite ### Why are the changes needed? Fix UT mistake in SQLQuerySuite ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existed UT Closes #33092 from AngersZhuuuu/SPARK-33338-FOLLOWUP. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-06-26 09:55:34 -07:00
Dongjoon Hyun	f68fbae7ab	[SPARK-35903][TESTS] Parameterize 'master' in TPCDSQueryBenchmark ### What changes were proposed in this pull request? Like SPARK-8397, this PR aims to parameterize TPCDSQueryBenchmark's Spark master by reusing `spark.sql.test.master`. ### Why are the changes needed? This is helpful for testers. ### Does this PR introduce _any_ user-facing change? No. This is a test environment. ### How was this patch tested? Manually, I checked the performance difference with TPCDS 10g data. Closes #33098 from dongjoon-hyun/SPARK-35903. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-06-26 09:33:55 -07:00
Gengliang Wang	645fb59652	[SPARK-35895][SQL] Support subtracting Intervals from TimestampWithoutTZ ### What changes were proposed in this pull request? Support the following operation: - TimestampWithoutTZ - Year-Month interval The following operation is actually supported in https://github.com/apache/spark/pull/33076/. This PR is to add end-to-end tests for them: - TimestampWithoutTZ - Calendar interval - TimestampWithoutTZ - Daytime interval ### Why are the changes needed? Support subtracting all 3 interval types from a timestamp without time zone ### Does this PR introduce _any_ user-facing change? No, the timestamp without time zone type is not release yet. ### How was this patch tested? Unit tests Closes #33086 from gengliangwang/subtract. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-26 13:19:00 +03:00
Kent Yao	14d4decf73	[SPARK-35879][CORE][SHUFFLE] Fix performance regression caused by collectFetchRequests ### What changes were proposed in this pull request? This PR fixes perf regression at the executor side when creating fetch requests with large initial partitions ![image](https://user-images.githubusercontent.com/8326978/123270865-dd21e800-d532-11eb-8447-ad80e47b034f.png) In NetEase, we had an online job that took `45min` to "fetch" about 100MB of shuffle data, which actually turned out that it was just collecting fetch requests slowly. Normally, such a task should finish in seconds. See the `DEBUG` log ``` 21/06/22 11:52:26 DEBUG BlockManagerStorageEndpoint: Sent response: 0 to kyuubi.163.org: 21/06/22 11:53:05 DEBUG ShuffleBlockFetcherIterator: Creating fetch request of 3941440 at BlockManagerId(12, .., 43559, None) with 19 blocks 21/06/22 11:53:44 DEBUG ShuffleBlockFetcherIterator: Creating fetch request of 3716400 at BlockManagerId(20, .., 38287, None) with 18 blocks 21/06/22 11:54:41 DEBUG ShuffleBlockFetcherIterator: Creating fetch request of 4559280 at BlockManagerId(6, .., 39689, None) with 22 blocks 21/06/22 11:55:08 DEBUG ShuffleBlockFetcherIterator: Creating fetch request of 3120160 at BlockManagerId(33, .., 39449, None) with 15 blocks ``` I also create a test case locally with my local laptop docker env to give some reproducible cases. ``` bin/spark-sql --conf spark.kubernetes.file.upload.path=./ --master k8s://https://kubernetes.docker.internal:6443 --conf spark.kubernetes.container.image=yaooqinn/spark:v20210624-5 -c spark.kubernetes.context=docker-for-desktop_1 --num-executors 5 --driver-memory 5g --conf spark.kubernetes.executor.podNamePrefix=sparksql ``` ```sql SET spark.sql.adaptive.enabled=true; SET spark.sql.shuffle.partitions=3000; SELECT /+ REPARTITION / 1 as pid, id from range(1, 1000000, 1, 500); SELECT /+ REPARTITION(pid, id) / 1 as pid, id from range(1, 1000000, 1, 500); ``` ### Why are the changes needed? fix perf regression which was introduced by SPARK-29292 (`3ad4863673`) in v3.1.0. `3ad4863673` is for support compilation with scala 2.13 but the performance losses is huge. We need to consider backporting this PR to branch 3.1. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Mannully, #### before ```log 21/06/23 13:54:22 DEBUG ShuffleBlockFetcherIterator: maxBytesInFlight: 50331648, targetRemoteRequestSize: 10066329, maxBlocksInFlightPerAddress: 2147483647 21/06/23 13:54:38 DEBUG ShuffleBlockFetcherIterator: Creating fetch request of 2314708 at BlockManagerId(2, 10.1.3.114, 36423, None) with 86 blocks 21/06/23 13:54:59 DEBUG ShuffleBlockFetcherIterator: Creating fetch request of 2636612 at BlockManagerId(3, 10.1.3.115, 34293, None) with 87 blocks 21/06/23 13:55:18 DEBUG ShuffleBlockFetcherIterator: Creating fetch request of 2508706 at BlockManagerId(4, 10.1.3.116, 41869, None) with 90 blocks 21/06/23 13:55:34 DEBUG ShuffleBlockFetcherIterator: Creating fetch request of 2350854 at BlockManagerId(5, 10.1.3.117, 45787, None) with 85 blocks 21/06/23 13:55:34 INFO ShuffleBlockFetcherIterator: Getting 438 (11.8 MiB) non-empty blocks including 90 (2.5 MiB) local and 0 (0.0 B) host-local and 348 (9.4 MiB) remote blocks 21/06/23 13:55:34 DEBUG ShuffleBlockFetcherIterator: Sending request for 87 blocks (2.5 MiB) from 10.1.3.115:34293 21/06/23 13:55:34 INFO TransportClientFactory: Successfully created connection to /10.1.3.115:34293 after 1 ms (0 ms spent in bootstraps) 21/06/23 13:55:34 DEBUG ShuffleBlockFetcherIterator: Sending request for 90 blocks (2.4 MiB) from 10.1.3.116:41869 21/06/23 13:55:34 INFO TransportClientFactory: Successfully created connection to /10.1.3.116:41869 after 2 ms (0 ms spent in bootstraps) 21/06/23 13:55:34 DEBUG ShuffleBlockFetcherIterator: Sending request for 85 blocks (2.2 MiB) from 10.1.3.117:45787 ``` ```log 21/06/23 14:00:45 INFO MapOutputTracker: Broadcast outputstatuses size = 411, actual size = 828997 21/06/23 14:00:45 INFO MapOutputTrackerWorker: Got the map output locations 21/06/23 14:00:45 DEBUG ShuffleBlockFetcherIterator: maxBytesInFlight: 50331648, targetRemoteRequestSize: 10066329, maxBlocksInFlightPerAddress: 2147483647 21/06/23 14:00:55 DEBUG ShuffleBlockFetcherIterator: Creating fetch request of 1894389 at BlockManagerId(2, 10.1.3.114, 36423, None) with 99 blocks 21/06/23 14:01:04 DEBUG ShuffleBlockFetcherIterator: Creating fetch request of 1919993 at BlockManagerId(3, 10.1.3.115, 34293, None) with 100 blocks 21/06/23 14:01:14 DEBUG ShuffleBlockFetcherIterator: Creating fetch request of 1977186 at BlockManagerId(5, 10.1.3.117, 45787, None) with 103 blocks 21/06/23 14:01:23 DEBUG ShuffleBlockFetcherIterator: Creating fetch request of 1938336 at BlockManagerId(4, 10.1.3.116, 41869, None) with 101 blocks 21/06/23 14:01:23 INFO ShuffleBlockFetcherIterator: Getting 500 (9.1 MiB) non-empty blocks including 97 (1820.3 KiB) local and 0 (0.0 B) host-local and 403 (7.4 MiB) remote blocks 21/06/23 14:01:23 DEBUG ShuffleBlockFetcherIterator: Sending request for 101 blocks (1892.9 KiB) from 10.1.3.116:41869 21/06/23 14:01:23 DEBUG ShuffleBlockFetcherIterator: Sending request for 103 blocks (1930.8 KiB) from 10.1.3.117:45787 21/06/23 14:01:23 DEBUG ShuffleBlockFetcherIterator: Sending request for 99 blocks (1850.0 KiB) from 10.1.3.114:36423 21/06/23 14:01:23 DEBUG ShuffleBlockFetcherIterator: Sending request for 100 blocks (1875.0 KiB) from 10.1.3.115:34293 21/06/23 14:01:23 INFO ShuffleBlockFetcherIterator: Started 4 remote fetches in 37889 ms ``` #### After ```log 21/06/24 13:01:16 DEBUG ShuffleBlockFetcherIterator: maxBytesInFlight: 50331648, targetRemoteRequestSize: 10066329, maxBlocksInFlightPerAddress: 2147483647 21/06/24 13:01:16 INFO ShuffleBlockFetcherIterator: ==> Call blockInfos.map(_._2).sum: 40 ms 21/06/24 13:01:16 INFO ShuffleBlockFetcherIterator: ==> Call mergeFetchBlockInfo for shuffle_0_9_2990_2997/9: 0 ms 21/06/24 13:01:16 INFO ShuffleBlockFetcherIterator: ==> Call mergeFetchBlockInfo for shuffle_0_15_2395_2997/15: 0 ms ``` Closes #33063 from yaooqinn/SPARK-35879. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>	2021-06-26 12:48:24 +08:00
Takuya UESHIN	a9ebfc5374	[SPARK-35466][PYTHON] Fix disallow_untyped_defs mypy checks for pyspark.pandas.data_type_ops.* ### What changes were proposed in this pull request? Adds more type annotations in the files `python/pyspark/pandas/data_type_ops/*.py` and fixes the mypy check failures. ### Why are the changes needed? We should enable more disallow_untyped_defs mypy checks. ### Does this PR introduce _any_ user-facing change? Yes. This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users. ### How was this patch tested? The mypy check with a new configuration and existing tests should pass. Closes #33094 from ueshin/issues/SPARK-35466/disallow_untyped_defs_data_ops. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-25 18:16:25 -07:00
Anton Okolnychyi	63cd1314d2	[SPARK-35899][SQL] Utility to convert connector expressions to Catalyst ### What changes were proposed in this pull request? This PR adds a utility to convert public connector expressions to Catalyst expressions. Notable differences: - Switched to `QueryCompilationErrors` from an explicit `AnalysisException`. - Decoupled the resolving logic for v2 references into separate methods to use in other places. ### Why are the changes needed? These changes are needed as more and more places require this logic and it is better to implement it in a single place. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33096 from aokolnychyi/spark-35899. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-25 18:04:07 -07:00
Jungtaek Lim	67eddf2ffc	[SPARK-35894][BUILD] Introduce new style enforce to not import scala.collection.Seq/IndexedSeq ### What changes were proposed in this pull request? This PR proposes to add a new scalastyle rule to enforce not importing `scala.collection.Seq` and `scala.collection.IndexedSeq` which conflicts with `scala.Seq` and `scala.IndexedSeq`. The problem occurs as Scala 2.13 changed the alias of `scala.Seq` and `scala.IndexedSeq`. Before Scala 2.13, they were `scala.collection.Seq` and `scala.collection.IndexedSeq`. After Scala 2.13, they become `scala.collection.immutable.Seq` and `scala.collection.immutable.IndexedSeq`. Please refer below doc for more details. https://docs.scala-lang.org/overviews/core/collections-migration-213.html ### Why are the changes needed? We have seen Seq/IndexedSeq issues on cross-compilation of Scala 2.12 / 2.13. While I'm not sure this can prevent all cases, this will prevent the simple case of breaking cross compilation. ### Does this PR introduce _any_ user-facing change? No change on end user. Contributors will be restricted but shouldn't block them doing the right thing. ### How was this patch tested? Ran scalastyle against current master (before #33084) ``` > dev/scalastyle Scalastyle checks failed at following occurrences: [error] /Users/Jungtaek.Lim/WorkArea/ScalaProjects/spark-apache/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala:28:0: [error] Don't import scala.collection.Seq and scala.collection.IndexedSeq as it may bring some problems with cross-build between Scala 2.12 and 2.13. [error] [error] Please refer below page to see the details of changes around Seq. [error] https://docs.scala-lang.org/overviews/core/collections-migration-213.html [error] [error] If you really need to use scala.collection.Seq or scala.collection.IndexedSeq, please use the fully-qualified name instead. [error] [error] /Users/Jungtaek.Lim/WorkArea/ScalaProjects/spark-apache/core/src/main/scala/org/apache/spark/util/Utils.scala:37:0: [error] Don't import scala.collection.Seq and scala.collection.IndexedSeq as it may bring some problems with cross-build between Scala 2.12 and 2.13. [error] [error] Please refer below page to see the details of changes around Seq. [error] https://docs.scala-lang.org/overviews/core/collections-migration-213.html [error] [error] If you really need to use scala.collection.Seq or scala.collection.IndexedSeq, please use the fully-qualified name instead. [error] [error] Total time: 15 s, completed Jun 25, 2021 9:01:32 PM ``` Closes #33085 from HeartSaVioR/SPARK-35894. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-06-26 09:41:16 +09:00
Adam Binford	939ea3d5da	[SPARK-35863][BUILD] Update Ivy to 2.5.0 ### What changes were proposed in this pull request? Update Ivy from 2.4.0 to 2.5.0. - https://ant.apache.org/ivy/history/2.5.0/release-notes.html ### Why are the changes needed? This brings various improvements and bug fixes. Most notably, the adding of `ivy.maven.lookup.sources` and `ivy.maven.lookup.javadoc` configs can significantly speed up module resolution time if these are turned off, especially behind a proxy. These could arguably be turned off by default, because when submitting jobs you probably don't care about the sources or javadoc jars. I didn't include that here but happy to look into if it's desired. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT and build passes Closes #33088 from Kimahriman/feature/ivy-update. Authored-by: Adam Binford <adamq43@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-25 07:37:36 -07:00
Yuanjian Li	0c31137172	[SPARK-35628][SS][FOLLOW-UP] Fix the consistent break on Scala 2.13 build ### What changes were proposed in this pull request? Fix the consistent break on Scala 2.13 build caused by the PR https://github.com/apache/spark/pull/32767 ### Why are the changes needed? Fix the consistent break. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33084 from xuanyuanking/SPARK-35628-follow. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-25 07:08:03 -07:00
Erik Krogen	866df69c62	[SPARK-35672][CORE][YARN] Pass user classpath entries to executors using config instead of command line ### What changes were proposed in this pull request? Refactor the logic for constructing the user classpath from `yarn.ApplicationMaster` into `yarn.Client` so that it can be leveraged on the executor side as well, instead of having the driver construct it and pass it to the executor via command-line arguments. A new method, `getUserClassPath`, is added to `CoarseGrainedExecutorBackend` which defaults to `Nil` (consistent with the existing behavior where non-YARN resource managers do not configure the user classpath). `YarnCoarseGrainedExecutorBackend` overrides this to construct the user classpath from the existing `APP_JAR` and `SECONDARY_JARS` configs. ### Why are the changes needed? User-provided JARs are made available to executors using a custom classloader, so they do not appear on the standard Java classpath. Instead, they are passed as a list to the executor which then creates a classloader out of the URLs. Currently in the case of YARN, this list of JARs is crafted by the Driver (in `ExecutorRunnable`), which then passes the information to the executors (`CoarseGrainedExecutorBackend`) by specifying each JAR on the executor command line as `--user-class-path /path/to/myjar.jar`. This can cause extremely long argument lists when there are many JARs, which can cause the OS argument length to be exceeded, typically manifesting as the error message: > /bin/bash: Argument list too long A [Google search](https://www.google.com/search?q=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22&oq=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22) indicates that this is not a theoretical problem and afflicts real users, including ours. Passing this list using the configurations instead resolves this issue. ### Does this PR introduce _any_ user-facing change? No, except for fixing the bug, allowing for larger JAR lists to be passed successfully. Configuration of JARs is identical to before. ### How was this patch tested? New unit tests were added in `YarnClusterSuite`. Also, we have been running a similar fix internally for 4 months with great success. Closes #32810 from xkrogen/xkrogen-SPARK-35672-classpath-scalable. Authored-by: Erik Krogen <xkrogen@apache.org> Signed-off-by: Thomas Graves <tgraves@apache.org>	2021-06-25 08:53:57 -05:00
Steve Loughran	36aaaa14c3	[SPARK-35878][CORE] Add fs.s3a.endpoint if unset and fs.s3a.endpoint.region is null ### What changes were proposed in this pull request? This patches the hadoop configuration so that fs.s3a.endpoint is set to s3.amazonaws.com if neither it nor fs.s3a.endpoint.region is set. This stops S3A Filesystem creation failing with the error "Unable to find a region via the region provider chain." in some non-EC2 deployments. See: HADOOP-17771. when spark options are propagated to the hadoop configuration in SparkHadoopUtils. the fs.s3a.endpoint value is set to "s3.amazonaws.com" if unset and no explicit region is set in fs.s3a.endpoint.region. ### Why are the changes needed? A regression in Hadoop 3.3.1 has surfaced which causes S3A filesystem instantiation to fail outside EC2 deployments if the host lacks a CLI configuration in ~/.aws/config declaring the region, or the `AWS_REGION` environment variable HADOOP-17771 fixes this in Hadoop-3.3.2+, but this spark patch will correct the behavior when running Spark with the 3.3.1 artifacts. It is harmless for older versions and compatible with hadoop releases containing the HADOOP-17771 fix. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New tests to verify propagation logic from spark conf to hadoop conf. Closes #33064 from steveloughran/SPARK-35878-regions. Authored-by: Steve Loughran <stevel@cloudera.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-25 05:24:55 -07:00
Gengliang Wang	9814cf8853	[SPARK-35889][SQL] Support adding TimestampWithoutTZ with Interval types ### What changes were proposed in this pull request? Supprot the following operations: - TimestampWithoutTZ + Calendar interval - TimestampWithoutTZ + Year-Month interval - TimestampWithoutTZ + Daytime interval ### Why are the changes needed? Support basic '+' operator for timestamp without time zone type. ### Does this PR introduce _any_ user-facing change? No, the timestamp without time zone type is not release yet. ### How was this patch tested? Unit tests Closes #33076 from gengliangwang/addForNewTS. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-25 19:58:42 +08:00
Yuanjian Li	f2029e7442	[SPARK-35628][SS] RocksDBFileManager - load checkpoint from DFS ### What changes were proposed in this pull request? The implementation for the load operation of RocksDBFileManager. ### Why are the changes needed? Provide the functionality of loading all necessary files for specific checkpoint versions from DFS to the given local directory. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New UT added. Closes #32767 from xuanyuanking/SPARK-35628. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-06-25 18:38:26 +09:00
Wenchen Fan	c0cfbb1743	[SPARK-35884][SQL] EXPLAIN FORMATTED for AQE ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/29137 , which has some issues when running EXPLAIN FORMATTED ``` AdaptiveSparkPlan (13) +- == Final Plan == * HashAggregate (12) +- CustomShuffleReader (11) +- ShuffleQueryStage (10) +- Exchange (9) +- * HashAggregate (8) +- * Project (7) +- * BroadcastHashJoin Inner BuildRight (6) :- * LocalTableScan (1) +- BroadcastQueryStage (5) +- BroadcastExchange (4) +- * Project (3) +- * LocalTableScan (2) +- == Initial Plan == HashAggregate (unknown) +- Exchange (unknown) +- HashAggregate (unknown) +- Project (unknown) +- BroadcastHashJoin Inner BuildRight (unknown) :- Project (unknown) : +- LocalTableScan (unknown) +- BroadcastExchange (unknown) +- Project (3) +- LocalTableScan (2) ``` Some nodes do not have an ID and show `unknown`. This PR fixes the issue. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? EXPLAIN FORMATTED with AQE displays correctly. ### How was this patch tested? new tests Closes #33067 from cloud-fan/explain. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-06-25 00:18:26 -07:00
Terry Kim	f1ad34558c	[SPARK-35883][SQL] Migrate ALTER TABLE RENAME COLUMN command to use UnresolvedTable to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate the following `ALTER TABLE ... RENAME COLUMN` command to use `UnresolvedTable` as a `child` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? This is a part of effort to make the relation lookup behavior consistent: [SPARK-29900](https://issues.apache.org/jira/browse/SPARK-29900). ### Does this PR introduce _any_ user-facing change? After this PR, the above `ALTER TABLE ... RENAME COLUMN` commands will have a consistent resolution behavior. ### How was this patch tested? Updated existing tests. Closes #33066 from imback82/alter_rename. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-25 05:53:56 +00:00
Takuya UESHIN	6497ac3585	[SPARK-35471][PYTHON] Fix disallow_untyped_defs mypy checks for pyspark.pandas.frame ### What changes were proposed in this pull request? Adds more type annotations in the file `python/pyspark/pandas/frame.py` and fixes the mypy check failures. ### Why are the changes needed? We should enable more disallow_untyped_defs mypy checks. ### Does this PR introduce _any_ user-facing change? Yes. This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users. ### How was this patch tested? The mypy check with a new configuration and existing tests should pass. Closes #33073 from ueshin/issues/SPARK-35471/disallow_untyped_defs_frame. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-25 14:41:58 +09:00
William Hyun	c6555f1845	[SPARK-35887][BUILD] Find and set JAVA_HOME from javac location ### What changes were proposed in this pull request? This PR aims to find and set JAVA_HOME from the javac location. ### Why are the changes needed? Since SPARK-35850, maven compile fails with Java8 when there is no JAVA_HOME and `java` path is JRE java. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually run mvn in vanilla ubuntu. Closes #33075 from williamhyun/util. Authored-by: William Hyun <william@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-24 21:09:18 -07:00
Kousuke Saruta	c562c1674e	[SPARK-34320][SQL][FOLLOWUP] Modify V2JDBCTest to follow the change of the error message ### What changes were proposed in this pull request? This is a followup PR for SPARK-34320 (#32854). That PR changed the error message of `ALTER TABLE` but `V2JDBCTest` didn't comply with the change. ### Why are the changes needed? To fix`v2.JDBCSuite` failure. ``` [info] - SPARK-33034: ALTER TABLE ... add new columns (173 milliseconds) [info] - SPARK-33034: ALTER TABLE ... drop column FAILED * (126 milliseconds) [info] "Cannot delete missing field bad_column in postgresql.alt_table schema: root [info] \|-- C2: string (nullable = true) [info] ; line 1 pos 0; [info] 'AlterTableDropColumns [unresolvedfieldname(bad_column)] [info] +- ResolvedTable org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalog7f4b7516, alt_table, JDBCTable(alt_table,StructType(StructField(C2,StringType,true)),org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions5842301d), [C2#1879] [info] " did not contain "Cannot delete missing field bad_column in alt_table schema" (V2JDBCTest.scala:106) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) [info] at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) [info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) [info] at org.apache.spark.sql.jdbc.v2.V2JDBCTest.$anonfun$$init$$6(V2JDBCTest.scala:106) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1461) [info] at org.apache.spark.sql.test.SQLTestUtilsBase.withTable(SQLTestUtils.scala:305) [info] at org.apache.spark.sql.test.SQLTestUtilsBase.withTable$(SQLTestUtils.scala:303) [info] at org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.withTable(DockerJDBCIntegrationSuite.scala:95) [info] at org.apache.spark.sql.jdbc.v2.V2JDBCTest.$anonfun$$init$$5(V2JDBCTest.scala:95) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190) [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190) [info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182) [info] at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62) [info] at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) [info] at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) [info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:233) [info] at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) [info] at scala.collection.immutable.List.foreach(List.scala:431) [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:233) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:232) [info] at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563) [info] at org.scalatest.Suite.run(Suite.scala:1112) [info] at org.scalatest.Suite.run$(Suite.scala:1094) [info] at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1563) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:237) [info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535) [info] at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:237) [info] at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:236) [info] at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:62) [info] at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) [info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) [info] at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) [info] at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:62) [info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:318) [info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:513) [info] at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413) [info] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [info] at java.lang.Thread.run(Thread.java:748) [info] - SPARK-33034: ALTER TABLE ... update column type (122 milliseconds) [info] - SPARK-33034: ALTER TABLE ... rename column (93 milliseconds) [info] - SPARK-33034: ALTER TABLE ... update column nullability (92 milliseconds) [info] - CREATE TABLE with table comment (38 milliseconds) [info] - CREATE TABLE with table property (52 milliseconds) [info] MySQLIntegrationSuite: [info] - Basic test (61 milliseconds) [info] - Numeric types (67 milliseconds) [info] - Date types (59 milliseconds) [info] - String types (50 milliseconds) [info] - Basic write test (216 milliseconds) [info] - query JDBC option (64 milliseconds) [info] Run completed in 19 minutes, 43 seconds. [info] Total number of tests run: 89 [info] Suites: completed 14, aborted 0 [info] Tests: succeeded 84, failed 5, canceled 0, ignored 0, pending 0 [info] * 5 TESTS FAILED * [error] Failed tests: [error] org.apache.spark.sql.jdbc.v2.OracleIntegrationSuite [error] org.apache.spark.sql.jdbc.v2.MsSqlServerIntegrationSuite [error] org.apache.spark.sql.jdbc.v2.DB2IntegrationSuite [error] org.apache.spark.sql.jdbc.v2.MySQLIntegrationSuite [error] org.apache.spark.sql.jdbc.v2.PostgresIntegrationSuite [error] (docker-integration-tests / Test / test) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 1223 s (20:23), completed Jun 25, 2021 1:31:04 AM ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? docker-integration-tests on GA. Closes #33074 from sarutak/followup-SPARK-34320. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-06-25 12:58:38 +09:00

1 2 3 4 5 ...

30551 commits