### What changes were proposed in this pull request?
this PR renames the blacklisting feature. I ended up using "excludeOnFailure" or "excluded" in most cases but there is a mix. I renamed the BlacklistTracker to HealthTracker, but for the TaskSetBlacklist HealthTracker didn't make sense to me since its not the health of the taskset itself but rather tracking the things its excluded on so I renamed it to be TaskSetExcludeList. Everything else I tried to use the context and in most cases excluded made sense. It made more sense to me then blocked since you are basically excluding those executors and nodes from scheduling tasks on them. Then can be unexcluded later after timeouts and such. The configs I changed the name to use excludeOnFailure which I thought explained it.
I unfortunately couldn't get rid of some of them because its part of the event listener and history files. To keep backwards compatibility I kept the events and some of the parsing so that the history server would still properly read older history files. It is not forward compatible though - meaning a new application write the "Excluded" events so the older history server won't properly read display them as being blacklisted.
A few of the files below are showing up as deleted and recreated even though I did a git mv on them. I'm not sure why.
### Why are the changes needed?
get rid of problematic language
### Does this PR introduce _any_ user-facing change?
Config name changes but the old configs still work but are deprecated.
### How was this patch tested?
updated tests and also manually tested the UI changes and manually tested the history server reading older versions of history files and vice versa.
Closes#29906 from tgravescs/SPARK-32037.
Lead-authored-by: Thomas Graves <tgraves@nvidia.com>
Co-authored-by: Thomas Graves <tgraves@apache.org>
Signed-off-by: Thomas Graves <tgraves@apache.org>
### What changes were proposed in this pull request?
Add a test that pending executor does not stall pod allocation.
### Why are the changes needed?
Better test coverage
### Does this PR introduce _any_ user-facing change?
Test only change.
### How was this patch tested?
New test passes.
Closes#30205 from holdenk/verify-pod-allocation-does-not-stall.
Authored-by: Holden Karau <hkarau@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Small typo fix in the description of `spark.storage.decommission.shuffleBlocks.enabled` property.
Closes#30208 from dsabanin/patch-1.
Authored-by: Dmitry Sabanin <sdmitry@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR proposes to switch the class loader strategy from `ScalaLibrary` to `Flat` (see https://www.scala-sbt.org/1.x/docs/In-Process-Classloaders.html):
https://github.com/apache/spark/runs/1314691686
```
Error: java.util.MissingResourceException: Can't find bundle for base name org.scalactic.ScalacticBundle, locale en
Error: at java.util.ResourceBundle.throwMissingResourceException(ResourceBundle.java:1581)
Error: at java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1396)
Error: at java.util.ResourceBundle.getBundle(ResourceBundle.java:782)
Error: at org.scalactic.Resources$.resourceBundle$lzycompute(Resources.scala:8)
Error: at org.scalactic.Resources$.resourceBundle(Resources.scala:8)
Error: at org.scalactic.Resources$.pleaseDefineScalacticFillFilePathnameEnvVar(Resources.scala:256)
Error: at org.scalactic.source.PositionMacro$PositionMacroImpl.apply(PositionMacro.scala:65)
Error: at org.scalactic.source.PositionMacro$.genPosition(PositionMacro.scala:85)
Error: at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
Error: at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
Error: at java.lang.reflect.Method.invoke(Method.java:498)
```
See also https://github.com/sbt/sbt/issues/5736
### Why are the changes needed?
To make the build unflaky.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
GitHub Actions build in this test.
Closes#30198 from HyukjinKwon/SPARK-33297.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add `query.resolved` before analyze `InsertIntoDir`.
### Why are the changes needed?
For better error msg.
```
INSERT OVERWRITE DIRECTORY '/tmp/file' USING PARQUET
SELECT * FROM (
SELECT c3 FROM (
SELECT c1, c2 from values(1,2) t(c1, c2)
)
)
```
Before this PR, we get such error msg
```
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to toAttribute on unresolved object, tree: *
at org.apache.spark.sql.catalyst.analysis.Star.toAttribute(unresolved.scala:244)
at org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicLogicalOperators.scala:52)
at org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicLogicalOperators.scala:52)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:392)
```
### Does this PR introduce _any_ user-facing change?
Yes, error msg changed.
### How was this patch tested?
New test.
Closes#30197 from ulysses-you/SPARK-33294.
Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size.
Since we can't decide whether it's a but and some use need it behavior same as Hive.
### Why are the changes needed?
Provides a compatible choice between historical behavior and Hive
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existed UT
Closes#30156 from AngersZhuuuu/SPARK-33284.
Lead-authored-by: angerszhu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
# What changes were proposed in this pull request?
In the PR, I propose to improve the error message from `from_json`/`from_csv` by combining errors from all schema parsers:
- DataType.fromJson (except CSV)
- CatalystSqlParser.parseDataType
- CatalystSqlParser.parseTableSchema
Before the changes, `from_json` does not show error messages from the first parser in the chain that could mislead users.
### Why are the changes needed?
Currently, `from_json` outputs the error message from the fallback schema parser which can confuse end-users. For example:
```scala
val invalidJsonSchema = """{"fields": [{"a":123}], "type": "struct"}"""
df.select(from_json($"json", invalidJsonSchema, Map.empty[String, String])).show()
```
The JSON schema has an issue in `{"a":123}` but the error message doesn't point it out:
```
mismatched input '{' expecting {'ADD', 'AFTER', ...}(line 1, pos 0)
== SQL ==
{"fields": [{"a":123}], "type": "struct"}
^^^
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input '{' expecting {'ADD', 'AFTER', ... }(line 1, pos 0)
== SQL ==
{"fields": [{"a":123}], "type": "struct"}
^^^
```
### Does this PR introduce _any_ user-facing change?
Yes, after the changes for the example above:
```
Cannot parse the schema in JSON format: Failed to convert the JSON string '{"a":123}' to a field.
Failed fallback parsing: Cannot parse the data type:
mismatched input '{' expecting {'ADD', 'AFTER', ...}(line 1, pos 0)
== SQL ==
{"fields": [{"a":123}], "type": "struct"}
^^^
Failed fallback parsing:
mismatched input '{' expecting {'ADD', 'AFTER', ...}(line 1, pos 0)
== SQL ==
{"fields": [{"a":123}], "type": "struct"}
^^^
```
### How was this patch tested?
- By existing tests suites like `JsonFunctionsSuite` and `JsonExpressionsSuite`.
- Add new test to `JsonFunctionsSuite`.
- Re-gen results for `json-functions.sql`.
Closes#30183 from MaxGekk/fromDDL-error-msg.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims to wrap `ArrayBasedMapData` literal representation with `map(...)`.
### Why are the changes needed?
Literal ArrayBasedMapData has inconsistent string representation from `LogicalPlan` to `Optimized Logical Plan/Physical Plan`. Also, the representation at `Optimized Logical Plan` and `Physical Plan` is ambiguous like `[1 AS a#0, keys: [key1], values: [value1] AS b#1]`.
**BEFORE**
```scala
scala> spark.version
res0: String = 2.4.7
scala> sql("SELECT 1 a, map('key1', 'value1') b").explain(true)
== Parsed Logical Plan ==
'Project [1 AS a#0, 'map(key1, value1) AS b#1]
+- OneRowRelation
== Analyzed Logical Plan ==
a: int, b: map<string,string>
Project [1 AS a#0, map(key1, value1) AS b#1]
+- OneRowRelation
== Optimized Logical Plan ==
Project [1 AS a#0, keys: [key1], values: [value1] AS b#1]
+- OneRowRelation
== Physical Plan ==
*(1) Project [1 AS a#0, keys: [key1], values: [value1] AS b#1]
+- Scan OneRowRelation[]
```
**AFTER**
```scala
scala> spark.version
res0: String = 3.1.0-SNAPSHOT
scala> sql("SELECT 1 a, map('key1', 'value1') b").explain(true)
== Parsed Logical Plan ==
'Project [1 AS a#4, 'map(key1, value1) AS b#5]
+- OneRowRelation
== Analyzed Logical Plan ==
a: int, b: map<string,string>
Project [1 AS a#4, map(key1, value1) AS b#5]
+- OneRowRelation
== Optimized Logical Plan ==
Project [1 AS a#4, map(keys: [key1], values: [value1]) AS b#5]
+- OneRowRelation
== Physical Plan ==
*(1) Project [1 AS a#4, map(keys: [key1], values: [value1]) AS b#5]
+- *(1) Scan OneRowRelation[]
```
### Does this PR introduce _any_ user-facing change?
Yes. This changes the query plan's string representation in `explain` command and UI. However, this is a bug fix.
### How was this patch tested?
Pass the CI with the newly added test case.
Closes#30190 from dongjoon-hyun/SPARK-33292.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
In ANSI mode, when a division by zero occurs performing a divide-like operation (Divide, IntegralDivide, Remainder or Pmod), we are returning an incorrect value. Instead, we should throw an exception, as stated in the SQL standard.
### Why are the changes needed?
Result corrupt.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
added UT + existing UTs (improved)
Closes#29882 from luluorta/SPARK-33008.
Authored-by: luluorta <luluorta@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
"external block store" API was removed after SPARK-12667, `externalBlockStoreSize` in `RDDInfo` looks like always 0 and useless. So this pr just to remove this useless variable.
### Why are the changes needed?
remove useless variable.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass the Jenkins or GitHub Action
Closes#30179 from LuciferYang/SPARK-12667-FOLLOWUP.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This patch proposes to make StateStore compression codec configurable.
### Why are the changes needed?
Currently the compression codec of StateStore is not configurable and hard-coded to be lz4. It is better if we can follow Spark other modules to configure the compression codec of StateStore. For example, we can choose zstd codec and zstd is configurable with different compression level.
### Does this PR introduce _any_ user-facing change?
Yes, after this change users can config different codec for StateStore.
### How was this patch tested?
Unit test.
Closes#30162 from viirya/SPARK-33263.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Return schema in SQL format instead of Catalog string from the SchemaOfCsv expression.
### Why are the changes needed?
To unify output of the `schema_of_json()` and `schema_of_csv()`.
### Does this PR introduce _any_ user-facing change?
Yes, they can but `schema_of_csv()` is usually used in combination with `from_csv()`, so, the format of schema shouldn't be much matter.
Before:
```
> SELECT schema_of_csv('1,abc');
struct<_c0:int,_c1:string>
```
After:
```
> SELECT schema_of_csv('1,abc');
STRUCT<`_c0`: INT, `_c1`: STRING>
```
### How was this patch tested?
By existing test suites `CsvFunctionsSuite` and `CsvExpressionsSuite`.
Closes#30180 from MaxGekk/schema_of_csv-sql-schema.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Return schema in SQL format instead of Catalog string from the `SchemaOfJson` expression.
### Why are the changes needed?
In some cases, `from_json()` cannot parse schemas returned by `schema_of_json`, for instance, when JSON fields have spaces (gaps). Such fields will be quoted after the changes, and can be parsed by `from_json()`.
Here is the example:
```scala
val in = Seq("""{"a b": 1}""").toDS()
in.select(from_json('value, schema_of_json("""{"a b": 100}""")) as "parsed")
```
raises the exception:
```
== SQL ==
struct<a b:bigint>
------^^^
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableSchema(ParseDriver.scala:76)
at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:131)
at org.apache.spark.sql.catalyst.expressions.ExprUtils$.evalTypeExpr(ExprUtils.scala:33)
at org.apache.spark.sql.catalyst.expressions.JsonToStructs.<init>(jsonExpressions.scala:537)
at org.apache.spark.sql.functions$.from_json(functions.scala:4141)
```
### Does this PR introduce _any_ user-facing change?
Yes. For example, `schema_of_json` for the input `{"col":0}`.
Before: `struct<col:bigint>`
After: `STRUCT<`col`: BIGINT>`
### How was this patch tested?
By existing test suites `JsonFunctionsSuite` and `JsonExpressionsSuite`.
Closes#30172 from MaxGekk/schema_of_json-sql-schema.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
For complex query plans, `QueryPlan.transformUpWithNewOutput` will keep accumulating the attributes mapping to be propagated, which may hurt performance. This PR prunes the attributes mapping before propagating.
### Why are the changes needed?
A simple perf improvement.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing tests
Closes#30173 from cloud-fan/bug.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This PR proposes to fix the NPE issue on `In` filter when one of values contain null. In real case, you can trigger this issue when you try to push down the filter with `in (..., null)` against V2 source table. `DataSourceStrategy` caches the mapping (filter instance -> expression) in HashMap, which leverages hash code on the key, hence it could trigger the NPE issue.
### Why are the changes needed?
This is an obvious bug as `In` filter doesn't care about null value when calculating hash code.
### Does this PR introduce _any_ user-facing change?
Yes, previously the query with having `null` in "in" condition against data source V2 source table supporting push down filter failed with NPE, whereas after the PR the query will not fail.
### How was this patch tested?
UT added. The new UT fails without the PR and passes with the PR.
Closes#30170 from HeartSaVioR/SPARK-33267.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR intends to fix bus for casting data from/to PythonUserDefinedType. A sequence of queries to reproduce this issue is as follows;
```
>>> from pyspark.sql import Row
>>> from pyspark.sql.functions import col
>>> from pyspark.sql.types import *
>>> from pyspark.testing.sqlutils import *
>>>
>>> row = Row(point=ExamplePoint(1.0, 2.0))
>>> df = spark.createDataFrame([row])
>>> df.select(col("point").cast(PythonOnlyUDT()))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/maropu/Repositories/spark/spark-master/python/pyspark/sql/dataframe.py", line 1402, in select
jdf = self._jdf.select(self._jcols(*cols))
File "/Users/maropu/Repositories/spark/spark-master/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/Users/maropu/Repositories/spark/spark-master/python/pyspark/sql/utils.py", line 111, in deco
return f(*a, **kw)
File "/Users/maropu/Repositories/spark/spark-master/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o44.select.
: java.lang.NullPointerException
at org.apache.spark.sql.types.UserDefinedType.acceptsType(UserDefinedType.scala:84)
at org.apache.spark.sql.catalyst.expressions.Cast$.canCast(Cast.scala:96)
at org.apache.spark.sql.catalyst.expressions.CastBase.checkInputDataTypes(Cast.scala:267)
at org.apache.spark.sql.catalyst.expressions.CastBase.resolved$lzycompute(Cast.scala:290)
at org.apache.spark.sql.catalyst.expressions.CastBase.resolved(Cast.scala:290)
```
A root cause of this issue is that, since `PythonUserDefinedType#userClassis` always null, `isAssignableFrom` in `UserDefinedType#acceptsType` throws a null exception. To fix it, this PR defines `acceptsType` in `PythonUserDefinedType` and filters out the null case in `UserDefinedType#acceptsType`.
### Why are the changes needed?
Bug fixes.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added tests.
Closes#30169 from maropu/FixPythonUDTCast.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Change-Id: I82db1f9e8f667573aa3a03e05152cbed0ea7686b
### What changes were proposed in this pull request?
Update the document of SparkSession#sql, mention that this API eagerly runs DDL/DML commands, but not for SELECT queries.
### Why are the changes needed?
To clarify the behavior of SparkSession#sql.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No needed.
Closes#30168 from waitinfuture/master.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
After SBT upgrade into 1.4.0 and above. there is always a ".bsp" directory after sbt starts:
https://github.com/sbt/sbt/releases/tag/v1.4.0
This PR is to put the directory in to `.gitignore`.
### Why are the changes needed?
The ".bsp" directory is an untracked file for git during development. This is annoying.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manual local test
Closes#30171 from gengliangwang/ignoreBSP.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Spark SQL supports some window function like `NTH_VALUE`.
If we specify window frame like `UNBOUNDED PRECEDING AND CURRENT ROW` or `UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING`, we can elimate some calculations.
For example: if we execute the SQL show below:
```
SELECT NTH_VALUE(col,
2) OVER(ORDER BY rank UNBOUNDED PRECEDING
AND CURRENT ROW)
FROM tab;
```
The output for row number greater than 1, return the fixed value. otherwise, return null. So we just calculate the value once and notice whether the row number less than 2.
`UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING` is simpler.
### Why are the changes needed?
Improve the performance for `NTH_VALUE`, `FIRST_VALUE` and `LAST_VALUE`.
### Does this PR introduce _any_ user-facing change?
'No'.
### How was this patch tested?
Jenkins test.
Closes#29800 from beliefer/optimize-nth_value.
Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR aims to fix a correctness bug in the optimizer rule `EliminateSorts`. It also adds a new physical rule to remove redundant sorts that cannot be eliminated in the Optimizer rule after the bugfix.
### Why are the changes needed?
A global sort should not be eliminated even if its child is ordered since we don't know if its child ordering is global or local. For example, in the following scenario, the first sort shouldn't be removed because it has a stronger guarantee than the second sort even if the sort orders are the same for both sorts.
```
Sort(orders, global = True, ...)
Sort(orders, global = False, ...)
```
Since there is no straightforward way to identify whether a node's output ordering is local or global, we should not remove a global sort even if its child is already ordered.
### Does this PR introduce _any_ user-facing change?
Yes
### How was this patch tested?
Unit tests
Closes#30093 from allisonwang-db/fix-sort.
Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to migrate `DROP TABLE` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).
### Why are the changes needed?
The current behavior is not consistent between v1 and v2 commands when resolving a temp view.
In v2, the `t` in the following example is resolved to a table:
```scala
sql("CREATE TABLE testcat.ns.t (id bigint) USING foo")
sql("CREATE TEMPORARY VIEW t AS SELECT 2")
sql("USE testcat.ns")
sql("DROP TABLE t") // 't' is resolved to testcat.ns.t
```
whereas in v1, the `t` is resolved to a temp view:
```scala
sql("CREATE DATABASE test")
sql("CREATE TABLE spark_catalog.test.t (id bigint) USING csv")
sql("CREATE TEMPORARY VIEW t AS SELECT 2")
sql("USE spark_catalog.test")
sql("DROP TABLE t") // 't' is resolved to a temp view
```
### Does this PR introduce _any_ user-facing change?
After this PR, for v2, `DROP TABLE t` is resolved to a temp view `t` instead of `testcat.ns.t`, consistent with v1 behavior.
### How was this patch tested?
Added a new test
Closes#30079 from imback82/drop_table_consistent.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This patch proposes to change the behavior on failing fast when Spark fails to instantiate configured v2 session catalog.
### Why are the changes needed?
The Spark behavior is against the intention of the end users - if end users configure session catalog which Spark would fail to initialize, Spark would swallow the error with only logging the error message and silently use the default catalog implementation.
This follows the voices on [discussion thread](https://lists.apache.org/thread.html/rdfa22a5ebdc4ac66e2c5c8ff0cd9d750e8a1690cd6fb456d119c2400%40%3Cdev.spark.apache.org%3E) in dev mailing list.
### Does this PR introduce _any_ user-facing change?
Yes. After the PR Spark will fail immediately if Spark fails to instantiate configured session catalog.
### How was this patch tested?
New UT added.
Closes#30147 from HeartSaVioR/SPARK-33240.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR intends to add a dedicated page for SQL-on-file in SQL documents.
This comes from the comment: https://github.com/apache/spark/pull/30095/files#r508965149
### Why are the changes needed?
For better documentations.
### Does this PR introduce _any_ user-facing change?
<img width="544" alt="Screen Shot 2020-10-28 at 9 56 59" src="https://user-images.githubusercontent.com/692303/97378051-c1fbcb80-1904-11eb-86c0-a88c5269d41c.png">
### How was this patch tested?
N/A
Closes#30165 from maropu/DocForFile.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This PR adds the following `Column` methods to R API:
- asc_nulls_first
- asc_nulls_last
- desc_nulls_first
- desc_nulls_last
### Why are the changes needed?
Feature parity.
### Does this PR introduce _any_ user-facing change?
No, new methods.
### How was this patch tested?
New unit tests.
Closes#30159 from zero323/SPARK-33258.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
The documentation of the Spark SQL null semantics states that "NULL AND False" yields NULL. This is incorrect. "NULL AND False" yields False.
```
Seq[(java.lang.Boolean, java.lang.Boolean)](
(null, false)
)
.toDF("left_operand", "right_operand")
.withColumn("AND", 'left_operand && 'right_operand)
.show(truncate = false)
+------------+-------------+-----+
|left_operand|right_operand|AND |
+------------+-------------+-----+
|null |false |false|
+------------+-------------+-----+
```
I propose the documentation be updated to reflect that "NULL AND False" yields False.
This contribution is my original work and I license it to the project under the project’s open source license.
### Why are the changes needed?
This change improves the accuracy of the documentation.
### Does this PR introduce _any_ user-facing change?
Yes. This PR introduces a fix to the documentation.
### How was this patch tested?
Since this is only a documentation change, no tests were added.
Closes#30161 from stwhit/SPARK-33246.
Authored-by: Stuart White <stuart@spotright.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
The following query produces incorrect results. The query has two essential features: (1) it contains a string aggregate, resulting in a `SortExec` node, and (2) it contains a duplicate grouping key, causing `RemoveRepetitionFromGroupExpressions` to produce a sort order stored as a `Stream`.
```sql
SELECT bigint_col_1, bigint_col_9, MAX(CAST(bigint_col_1 AS string))
FROM table_4
GROUP BY bigint_col_1, bigint_col_9, bigint_col_9
```
When the sort order is stored as a `Stream`, the line `ordering.map(_.child.genCode(ctx))` in `GenerateOrdering#createOrderKeys()` produces unpredictable side effects to `ctx`. This is because `genCode(ctx)` modifies `ctx`. When ordering is a `Stream`, the modifications will not happen immediately as intended, but will instead occur lazily when the returned `Stream` is used later.
Similar bugs have occurred at least three times in the past: https://issues.apache.org/jira/browse/SPARK-24500, https://issues.apache.org/jira/browse/SPARK-25767, https://issues.apache.org/jira/browse/SPARK-26680.
The fix is to check if `ordering` is a `Stream` and force the modifications to happen immediately if so.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added a unit test for `SortExec` where `sortOrder` is a `Stream`. The test previously failed and now passes.
Closes#30160 from ankurdave/SPARK-33260.
Authored-by: Ankur Dave <ankurdave@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Make pod allocation executor timeouts configurable. Keep all known pods in mind when allocating executors to avoid over-allocating if the pending time is much higher than the allocation interval.
This PR increases the default wait time to 600s from the current 60s.
Since nodes can now remain "pending" for long periods of time, we allow additional batches to be scheduled during pending allocation but keep the total number of pods in account.
### Why are the changes needed?
The current executor timeouts do not match that of all real world clusters especially under load. While this can be worked around by increasing the allocation batch delay, that will decrease the speed at which the total number of executors will be able to be requested.
The increase in default timeout is needed to handle real-world testing environments I've encountered on moderately busy clusters and K8s clusters with their own underlying dynamic scale-up of hardware (e.g. GKE, EKS, etc.)
### Does this PR introduce _any_ user-facing change?
Yes new configuration property
### How was this patch tested?
Updated existing test to use the timeout from the new configuration property. Verified test failed without the update.
Closes#30155 from holdenk/SPARK-33231-make-pod-creation-timeout-configurable.
Authored-by: Holden Karau <hkarau@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Override the default SQL strings in Postgres Dialect for:
- ALTER TABLE UPDATE COLUMN TYPE
- ALTER TABLE UPDATE COLUMN NULLABILITY
Add new docker integration test suite `jdbc/v2/PostgreSQLIntegrationSuite.scala`
### Why are the changes needed?
supports Postgres specific ALTER TABLE syntax.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Add new test `PostgreSQLIntegrationSuite`
Closes#30089 from huaxingao/postgres_docker.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Extract methods related to handling Aliases to a trait.
### Why are the changes needed?
Avoid code duplication
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing UTs cover this
Closes#30134 from tanelk/SPARK-33225_aliasHelper.
Lead-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com>
Co-authored-by: Tanel Kiis <tanel.kiis@reach-u.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Since Issue [SPARK-33139](https://issues.apache.org/jira/browse/SPARK-33139) has been done, and SQLConf.get and SparkSession.active are more reliable. We are trying to refine the existing code usage of passing SQLConf and SparkSession into sub-class of Rule[QueryPlan].
In this PR.
* remove SQLConf from ctor-parameter of all sub-class of Rule[QueryPlan].
* using SQLConf.get to replace the original SQLConf instance.
* remove SparkSession from ctor-parameter of all sub-class of Rule[QueryPlan].
* using SparkSession.active to replace the original SparkSession instance.
### Why are the changes needed?
Code refine.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing test
Closes#30097 from leanken/leanken-SPARK-33140.
Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This patch separates the view permission checks from the getAppUi in FsHistoryServerProvider, thus enabling SHS to do view permissions check of a given attempt for a given user without rebuilding the UI. This is achieved by adding a method "checkUIViewPermissions(appId: String, attemptId: Option[String], user: String): Boolean" to many layers of history server components. Currently, this feature is useful for event log download.
### Why are the changes needed?
Right now, when we want to download the event logs from the spark history server, SHS will need to parse entire the event log to rebuild UI, and this is just for view permission checks. UI rebuilding is a time-consuming and memory-intensive task, especially for large logs. However, this process is unnecessary for event log download. With this patch, UI rebuild can be skipped when downloading event logs from the history server. Thus the time of downloading a GB scale event log can be reduced from several minutes to several seconds, and the memory consumption of UI rebuilding can be avoided.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added test cases to confirm the view permission checks work properly and download event logs won't trigger UI loading. Also did some manual tests to verify the download speed can be drastically improved and the authentication works properly.
Closes#30126 from baohe-zhang/bypass_ui_rebuild_for_log_download.
Authored-by: Baohe Zhang <baohe.zhang@verizonmedia.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
This PR proposes to initiate the migration to NumPy documentation style (from reST style) in PySpark docstrings.
This PR also adds one migration example of `SparkContext`.
- **Before:**
...
![Screen Shot 2020-10-26 at 7 02 05 PM](https://user-images.githubusercontent.com/6477701/97161090-a8ea0200-17c0-11eb-8204-0e70d18fc571.png)
...
![Screen Shot 2020-10-26 at 7 02 09 PM](https://user-images.githubusercontent.com/6477701/97161100-aab3c580-17c0-11eb-92ad-f5ad4441ce16.png)
...
- **After:**
...
![Screen Shot 2020-10-26 at 7 24 08 PM](https://user-images.githubusercontent.com/6477701/97161219-d636b000-17c0-11eb-80ab-d17a570ecb4b.png)
...
See also https://numpydoc.readthedocs.io/en/latest/format.html
### Why are the changes needed?
There are many reasons for switching to NumPy documentation style.
1. Arguably reST style doesn't fit well when the docstring grows large because it provides (arguably) less structures and syntax.
2. NumPy documentation style provides a better human readable docstring format. For example, notebook users often just do `help(...)` by `pydoc`.
3. NumPy documentation style is pretty commonly used in data science libraries, for example, pandas, numpy, Dask, Koalas,
matplotlib, ... Using NumPy documentation style can give users a consistent documentation style.
### Does this PR introduce _any_ user-facing change?
The dependency itself doesn't change anything user-facing.
The documentation change in `SparkContext` does, as shown above.
### How was this patch tested?
Manually tested via running `cd python` and `make clean html`.
Closes#30149 from HyukjinKwon/SPARK-33243.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
- [x] Expand dictionary definitions into standalone functions.
- [x] Fix annotations for ordering functions.
### Why are the changes needed?
To simplify further maintenance of docstrings.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#30143 from zero323/SPARK-32084.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to document the APIs in `Column` as well in API reference of PySpark documentation.
### Why are the changes needed?
To document common APIs in PySpark.
### Does this PR introduce _any_ user-facing change?
Yes, `Column.*` will be shown in API reference page.
### How was this patch tested?
Manually tested via `cd python` and `make clean html`.
Closes#30150 from HyukjinKwon/SPARK-32188.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
In current Spark script transformation with hive serde mode, in case of schema less, result is different with hive.
This pr to keep result same with hive script transform serde.
#### Hive Scrip Transform with serde in schemaless
```
hive> create table t (c0 int, c1 int, c2 int);
hive> INSERT INTO t VALUES (1, 1, 1);
hive> INSERT INTO t VALUES (2, 2, 2);
hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t;
hive> DESCRIBE v;
key string
value string
hive> SELECT * FROM v;
1 1 1
2 2 2
hive> SELECT key FROM v;
1
2
hive> SELECT value FROM v;
1 1
2 2
```
#### Spark script transform with hive serde in schema less.
```
hive> create table t (c0 int, c1 int, c2 int);
hive> INSERT INTO t VALUES (1, 1, 1);
hive> INSERT INTO t VALUES (2, 2, 2);
hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t;
hive> SELECT * FROM v;
1 1
2 2
```
**No serde mode in hive (ROW FORMATTED DELIMITED)**
![image](https://user-images.githubusercontent.com/46485123/90088770-55841e00-dd52-11ea-92dd-7fe52d93f0b3.png)
### Why are the changes needed?
Keep same behavior with hive script transform
### Does this PR introduce _any_ user-facing change?
Before this pr with hive serde script transform
```
select transform(*)
USING 'cat'
from (
select 1, 2, 3, 4
) tmp
key value
1 2
```
After
```
select transform(*)
USING 'cat'
from (
select 1, 2, 3, 4
) tmp
key value
1 2 3 4
```
### How was this patch tested?
UT
Closes#29421 from AngersZhuuuu/SPARK-32388.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims to use `hadoop-3.2` profile in K8s IT Jenkins jobs.
- [x] Switch the default value of `HADOOP_PROFILE` from `hadoop-2.7` to `hadoop-3.2`.
- [x] Remove `-Phadoop2.7` from Jenkins K8s IT job.
- https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/configure
**BEFORE**
```
./dev/make-distribution.sh --name ${DATE}-${REVISION} --r --pip --tgz -DzincPort=${ZINC_PORT} \
-Phadoop-2.7 -Pkubernetes -Pkinesis-asl -Phive -Phive-thriftserver
```
**AFTER**
```
./dev/make-distribution.sh --name ${DATE}-${REVISION} --r --pip --tgz -DzincPort=${ZINC_PORT} \
-Pkubernetes -Pkinesis-asl -Phive -Phive-thriftserver
```
### Why are the changes needed?
Since Apache Spark 3.1.0, Hadoop 3 is the default.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Check the Jenkins K8s IT log and result.
- https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34899/
```
+ /home/jenkins/workspace/SparkPullRequestBuilder-K8s/build/mvn clean package -DskipTests -DzincPort=4021 -Pkubernetes -Pkinesis-asl -Phive -Phive-thriftserver
Using `mvn` from path: /home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.6.3/bin/mvn
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Build Order:
[INFO]
```
Closes#30153 from dongjoon-hyun/SPARK-33237.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This reinstates the old option `spark.sql.sources.write.jobUUID` to set a unique jobId in the jobconf so that hadoop MR committers have a unique ID which is (a) consistent across tasks and workers and (b) not brittle compared to generated-timestamp job IDs. The latter matches that of what JobID requires, but as they are generated per-thread, may not always be unique within a cluster.
### Why are the changes needed?
If a committer (e.g s3a staging committer) uses job-attempt-ID as a unique ID then any two jobs started within the same second have the same ID, so can clash.
### Does this PR introduce _any_ user-facing change?
Good Q. It is "developer-facing" in the context of anyone writing a committer. But it reinstates a property which was in Spark 1.x and "went away"
### How was this patch tested?
Testing: no test here. You'd have to create a new committer which extracted the value in both job and task(s) and verified consistency. That is possible (with a task output whose records contained the UUID), but it would be pretty convoluted and a high maintenance cost.
Because it's trying to address a race condition, it's hard to regenerate the problem downstream and so verify a fix in a test run...I'll just look at the logs to see what temporary dir is being used in the cluster FS and verify it's a UUID
Closes#30141 from steveloughran/SPARK-33230-jobId.
Authored-by: Steve Loughran <stevel@cloudera.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
The page returned by /jobs in Spark UI will store the detail information of each job in javascript like this:
```javascript
{
'className': 'executor added',
'group': 'executors',
'start': new Date(1602834008978),
'content': '<div class="executor-event-content"' +
'data-toggle="tooltip" data-placement="top"' +
'data-title="Executor 3<br>' +
'Added at 2020/10/16 15:40:08"' +
'data-html="true">Executor 3 added</div>'
}
```
if an application has a failed job, the failure reason corresponding to the job will be stored in the ` content` field in the javascript . if the failure reason contains the character: **'**, the javascript code will throw an exception to cause the `event timeline url` had no response , The following is an example of error json:
```javascript
{
'className': 'executor removed',
'group': 'executors',
'start': new Date(1602925908654),
'content': '<div class="executor-event-content"' +
'data-toggle="tooltip" data-placement="top"' +
'data-title="Executor 2<br>' +
'Removed at 2020/10/17 17:11:48' +
'<br>Reason: Container from a bad node: ... 20/10/17 16:00:42 WARN ShutdownHookManager: ShutdownHook **'$anon$2'** timeout..."' +
'data-html="true">Executor 2 removed</div>'
}
```
So we need to considier this special case , if the returned job info contains the character:**'**, just remove it
### Why are the changes needed?
Ensure that the UI page can function normally
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
This pr only fixes an exception in a special case, manual test result as blows:
![fixed](https://user-images.githubusercontent.com/52202080/96711638-74490580-13d0-11eb-93e0-b44d9ed5da5c.gif)
Closes#30119 from akiyamaneko/timeline_view_cannot_open.
Authored-by: neko <echohlne@gmail.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
### What changes were proposed in this pull request?
This PR is to enable auto bucketed table scan by default, with exception to only disable for cached query (similar to AQE). The reason why disabling auto scan for cached query is that, the cached query output partitioning can be leveraged later to avoid shuffle and sort when doing join and aggregate.
### Why are the changes needed?
Enable auto bucketed table scan by default is useful as it can optimize query automatically under the hood, without users interaction.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added unit test for cached query in `DisableUnnecessaryBucketedScanSuite.scala`. Also change a bunch of unit tests which should disable auto bucketed scan to make them work.
Closes#30138 from c21/enable-auto-bucket.
Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This PR aims to use a pre-built image for Github Action SparkR job.
### Why are the changes needed?
This will reduce the execution time and the flakiness.
**BEFORE (21 minutes 39 seconds)**
![Screen Shot 2020-10-16 at 1 24 43 PM](https://user-images.githubusercontent.com/9700541/96305593-fbeada80-0fb2-11eb-9b8e-86d8abaad9ef.png)
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the GitHub Action `sparkr` job in this PR.
Closes#30066 from dongjoon-hyun/SPARKR.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Make changes to `spark.sql.analyzer.maxIterations` take effect at runtime.
### Why are the changes needed?
`spark.sql.analyzer.maxIterations` is not a static conf. However, before this patch, changing `spark.sql.analyzer.maxIterations` at runtime does not take effect.
### Does this PR introduce _any_ user-facing change?
Yes. Before this patch, changing `spark.sql.analyzer.maxIterations` at runtime does not take effect.
### How was this patch tested?
modified unit test
Closes#30108 from yuningzh-db/dynamic-analyzer-max-iterations.
Authored-by: Yuning Zhang <yuning.zhang@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This is to support left semi join in stream-stream join. The implementation of left semi join is (mostly in `StreamingSymmetricHashJoinExec` and `SymmetricHashJoinStateManager`):
* For left side input row, check if there's a match on right side state store.
* if there's a match, output the left side row, but do not put the row in left side state store (no need to put in state store).
* if there's no match, output nothing, but put the row in left side state store (with "matched" field to set to false in state store).
* For right side input row, check if there's a match on left side state store.
* For all matched left rows in state store, output the rows with "matched" field as false. Set all left rows with "matched" field to be true. Only output the left side rows matched for the first time to guarantee left semi join semantics.
* State store eviction: evict rows from left/right side state store below watermark, same as inner join.
Note a followup optimization can be to evict matched left side rows from state store earlier, even when the rows are still above watermark. However this needs more change in `SymmetricHashJoinStateManager`, so will leave this as a followup.
### Why are the changes needed?
Current stream-stream join supports inner, left outer and right outer join (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L166 ). We do see internally a lot of users are using left semi stream-stream join (not spark structured streaming), e.g. I want to get the ad impression (join left side) which has click (joint right side), but I don't care how many clicks per ad (left semi semantics).
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added unit tests in `UnsupportedOperationChecker.scala` and `StreamingJoinSuite.scala`.
Closes#30076 from c21/stream-join.
Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
In current Spark script transformation with hive serde mode, in case of schema less, result is different with hive.
This pr to keep result same with hive script transform serde.
#### Hive Scrip Transform with serde in schemaless
```
hive> create table t (c0 int, c1 int, c2 int);
hive> INSERT INTO t VALUES (1, 1, 1);
hive> INSERT INTO t VALUES (2, 2, 2);
hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t;
hive> DESCRIBE v;
key string
value string
hive> SELECT * FROM v;
1 1 1
2 2 2
hive> SELECT key FROM v;
1
2
hive> SELECT value FROM v;
1 1
2 2
```
#### Spark script transform with hive serde in schema less.
```
hive> create table t (c0 int, c1 int, c2 int);
hive> INSERT INTO t VALUES (1, 1, 1);
hive> INSERT INTO t VALUES (2, 2, 2);
hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t;
hive> SELECT * FROM v;
1 1
2 2
```
**No serde mode in hive (ROW FORMATTED DELIMITED)**
![image](https://user-images.githubusercontent.com/46485123/90088770-55841e00-dd52-11ea-92dd-7fe52d93f0b3.png)
### Why are the changes needed?
Keep same behavior with hive script transform
### Does this PR introduce _any_ user-facing change?
Before this pr with hive serde script transform
```
select transform(*)
USING 'cat'
from (
select 1, 2, 3, 4
) tmp
key value
1 2
```
After
```
select transform(*)
USING 'cat'
from (
select 1, 2, 3, 4
) tmp
key value
1 2 3 4
```
### How was this patch tested?
UT
Closes#29421 from AngersZhuuuu/SPARK-32388.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
I am generating the SHA-512 using the standard shasum which also has a better output compared to GPG.
### Why are the changes needed?
Which makes the hash much easier to verify for users that don't have GPG.
Because an user having GPG can check the keys but an user without GPG will have a hard time validating the SHA-512 based on the 'pretty printed' format.
Apache Spark is the only project where I've seen this format. Most other Apache projects have a one-line hash file.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
This patch assumes the build system has shasum (it should, but I can't test this).
Closes#30123 from emilianbold/master.
Authored-by: Emi <emilian.bold@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This patch adds new UTs to prevent SPARK-29438 for streaming aggregation as well as flatMapGroupsWithState, as we agree about the review comment quote here:
https://github.com/apache/spark/pull/26162#issuecomment-576929692
> LGTM for this PR. But on a additional note, this is a very subtle and easy-to-make bug with TaskContext.getPartitionId. I wonder if this bug is present in any other stateful operation. I wonder if this bug is present in any other stateful operation. Can you please verify how partitionId is used in the other stateful operations?
For now they're not broken, but even better if we have UTs to prevent the case for the future.
### Why are the changes needed?
New UTs will prevent streaming aggregation and flatMapGroupsWithState to be broken in future where it is placed on the right side of UNION and the number of partition is changing on the left side of UNION. Please refer SPARK-29438 for more details.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added UTs.
Closes#27333 from HeartSaVioR/SPARK-29438-add-regression-test.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
Handle executor failure with multiple containers
Added a spark property spark.kubernetes.executor.checkAllContainers,
with default being false. When it's true, the executor snapshot will
take all containers in the executor into consideration when deciding
whether the executor is in "Running" state, if the pod restart policy is
"Never". Also, added the new spark property to the doc.
### What changes were proposed in this pull request?
Checking of all containers in the executor pod when reporting executor status, if the `spark.kubernetes.executor.checkAllContainers` property is set to true.
### Why are the changes needed?
Currently, a pod remains "running" as long as there is at least one running container. This prevents Spark from noticing when a container has failed in an executor pod with multiple containers. With this change, user can configure the behavior to be different. Namely, if any container in the executor pod has failed, either the executor process or one of its sidecars, the pod is considered to be failed, and it will be rescheduled.
### Does this PR introduce _any_ user-facing change?
Yes, new spark property added.
User is now able to choose whether to turn on this feature using the `spark.kubernetes.executor.checkAllContainers` property.
### How was this patch tested?
Unit test was added and all passed.
I tried to run integration test by following the instruction [here](https://spark.apache.org/developer-tools.html) (section "Testing K8S") and also [here](https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/README.md), but I wasn't able to run it smoothly as it fails to talk with minikube cluster. Maybe it's because my minikube version is too new (I'm using v1.13.1)...? Since I've been trying it for two days and still can't make it work, I decided to submit this PR and hopefully the Jenkins test will pass.
Closes#29924 from huskysun/exec-sidecar-failure.
Authored-by: Shiqi Sun <s.sun@salesforce.com>
Signed-off-by: Holden Karau <hkarau@apple.com>
### What changes were proposed in this pull request?
Add type hints guidelines to developer docs.
### Why are the changes needed?
Since it is a new and still somewhat evolving feature, we should provided clear guidelines for potential contributors.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Closes#30094 from zero323/SPARK-33003.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>