### What changes were proposed in this pull request?
Inline type hints for python/pyspark/sql/window.py
### Why are the changes needed?
Currently, stub files are used for type hints. However, statements within functions cannot be type-checked.
The PR is proposed to inline type hints for python/pyspark/sql/window.py to type check statements within functions.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#34173 from xinrong-databricks/inline_window.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
We found an issue where user configured both AQE and push based shuffle, but the job started to hang after running some stages. We took the thread dump from the Executors, which showed the task is still waiting to fetch shuffle blocks.
Proposed changes in the PR to fix the issue.
### What changes were proposed in this pull request?
Disabled Batch fetch when push based shuffle is enabled.
### Why are the changes needed?
Without this patch, enabling AQE and Push based shuffle will have a chance to hang the tasks.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Tested the PR within our PR, with Spark shell and the queries are:
sql("""SELECT CASE WHEN rand() < 0.8 THEN 100 ELSE CAST(rand() * 30000000 AS INT) END AS s_item_id, CAST(rand() * 100 AS INT) AS s_quantity, DATE_ADD(current_date(), - CAST(rand() * 360 AS INT)) AS s_date FROM RANGE(1000000000)""").createOrReplaceTempView("sales")
// Dynamically coalesce partitions
sql("""SELECT s_date, sum(s_quantity) AS q FROM sales GROUP BY s_date ORDER BY q DESC""").collect
Unit tests to be added.
Closes#34156 from zhouyejoe/SPARK-36892.
Authored-by: Ye Zhou <yezhou@linkedin.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
This PR proposes to refactor typing logic for multi-index support that was mostly introduced in https://github.com/apache/spark/pull/34176
At a high level, the below functions were introduced
```bash
_extract_types # renamed from `extract_types`
```
```
_is_named_params
_address_named_type_hoders
_to_tuple_of_params
_convert_tuples_to_zip
_address_unnamed_type_holders
```
In this PR, they become as below with simplification:
```bash
_to_type_holders # renamed from `_extract_types`
```
```bash
_new_type_holders # merged from `_is_named_params`, etc.
```
### Why are the changes needed?
To make the codes easier to read.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
Existing tests should cover them.
Closes#34181 from HyukjinKwon/SPARK-36711.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
- Make the val lazy wherever `isPushBasedShuffleEnabled` is invoked when it is a class instance variable, so it can happen after user-defined jars/classes in `spark.kryo.classesToRegister` are downloaded and available on executor-side, as part of the fix for the exception mentioned below.
- Add a flag `checkSerializer` to control whether we need to check a serializer is `supportsRelocationOfSerializedObjects` or not within `isPushBasedShuffleEnabled` as part of the fix for the exception mentioned below. Specifically, we don't check this in `registerWithExternalShuffleServer()` in `BlockManager` and `createLocalDirsForMergedShuffleBlocks()` in `DiskBlockManager.scala` as the same issue would raise otherwise.
- Move `instantiateClassFromConf` and `instantiateClass` from `SparkEnv` into `Utils`, in order to let `isPushBasedShuffleEnabled` to leverage them for instantiating serializer instances.
### Why are the changes needed?
When user tries to set classes for Kryo Serialization by `spark.kryo.classesToRegister`, below exception(or similar) would be encountered in `isPushBasedShuffleEnabled` as indicated below.
Reproduced the issue in our internal branch by launching spark-shell as:
```
spark-shell --spark-version 3.1.1 --packages ml.dmlc:xgboost4j_2.12:1.3.1 --conf spark.kryo.classesToRegister=ml.dmlc.xgboost4j.scala.Booster
```
```
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1911)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:393)
at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend$.main(YarnCoarseGrainedExecutorBackend.scala:83)
at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend.main(YarnCoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Failed to register classes with Kryo
at org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$5(KryoSerializer.scala:183)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:230)
at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:171)
at org.apache.spark.serializer.KryoSerializer$$anon$1.create(KryoSerializer.scala:102)
at com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.borrow(KryoPoolQueueImpl.java:48)
at org.apache.spark.serializer.KryoSerializer$PoolWrapper.borrow(KryoSerializer.scala:109)
at org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:346)
at org.apache.spark.serializer.KryoSerializerInstance.getAutoReset(KryoSerializer.scala:446)
at org.apache.spark.serializer.KryoSerializer.supportsRelocationOfSerializedObjects$lzycompute(KryoSerializer.scala:253)
at org.apache.spark.serializer.KryoSerializer.supportsRelocationOfSerializedObjects(KryoSerializer.scala:249)
at org.apache.spark.util.Utils$.isPushBasedShuffleEnabled(Utils.scala:2584)
at org.apache.spark.MapOutputTrackerWorker.<init>(MapOutputTracker.scala:1109)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:322)
at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:205)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:442)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
... 4 more
Caused by: java.lang.ClassNotFoundException: ml.dmlc.xgboost4j.scala.Booster
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:217)
at org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$6(KryoSerializer.scala:174)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$5(KryoSerializer.scala:173)
... 24 more
```
Registering user class for kryo serialization is happening after serializer creation in SparkEnv. Serializer creation can happen in `isPushBasedShuffleEnabled`, which can be called in some places prior to SparkEnv is created. Also, as per analysis by JoshRosen, this is probably due to Kryo instantiation was failing because added packages hadn't been downloaded to the executor yet (because this code is running during executor startup, not task startup). The proposed change helps fix this issue.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Passed existing tests.
Tested this patch in our internal branch where user reported the issue. Issue is now not reproducible with this patch.
Closes#34158 from rmcyang/SPARK-33781-bugFix.
Lead-authored-by: Minchu Yang <minyang@minyang-mn3.linkedin.biz>
Co-authored-by: Minchu Yang <31781684+rmcyang@users.noreply.github.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
### What changes were proposed in this pull request?
Change allowed input types of `Abs()` from:
```
NumericType + CalendarIntervalType + YearMonthIntervalType + DayTimeIntervalType
```
to
```
NumericType + YearMonthIntervalType + DayTimeIntervalType
```
### Why are the changes needed?
The changes make the error message more clear.
Before changes:
```sql
spark-sql> set spark.sql.legacy.interval.enabled=true;
spark.sql.legacy.interval.enabled true
spark-sql> select abs(interval -10 days -20 minutes);
21/10/05 09:11:30 ERROR SparkSQLDriver: Failed in [select abs(interval -10 days -20 minutes)]
java.lang.ClassCastException: org.apache.spark.sql.types.CalendarIntervalType$ cannot be cast to org.apache.spark.sql.types.NumericType
at org.apache.spark.sql.catalyst.util.TypeUtils$.getNumeric(TypeUtils.scala:77)
at org.apache.spark.sql.catalyst.expressions.Abs.numeric$lzycompute(arithmetic.scala:172)
at org.apache.spark.sql.catalyst.expressions.Abs.numeric(arithmetic.scala:169)
```
After:
```sql
spark.sql.legacy.interval.enabled true
spark-sql> select abs(interval -10 days -20 minutes);
Error in query: cannot resolve 'abs(INTERVAL '-10 days -20 minutes')' due to data type mismatch: argument 1 requires (numeric or interval day to second or interval year to month) type, however, 'INTERVAL '-10 days -20 minutes'' is of interval type.; line 1 pos 7;
'Project [unresolvedalias(abs(-10 days -20 minutes, false), None)]
+- OneRowRelation
```
### Does this PR introduce _any_ user-facing change?
No, because the original changes of https://github.com/apache/spark/pull/34169 haven't released yet.
### How was this patch tested?
Manually checked in the command line, see examples above.
Closes#34183 from MaxGekk/fix-abs-input-types.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
1、Add the statement of `set catalog xxx` to change the current catalog
2、Retain the `USE` statement to change the current catalog
3、Forcible loading the new catalog when change the new catalog.
### Why are the changes needed?
Ansi SQL use `SET CATALOG XXX` statement to change the catalog.
[DISCUSS](https://github.com/apache/spark/pull/34030#issuecomment-925936538)
<img width="521" alt="set-catalog" src="https://user-images.githubusercontent.com/41178002/134658562-4e4dd879-b6e5-484c-9461-6345c3faaf2e.png">
### Does this PR introduce _any_ user-facing change?
Yes, User can use `SET CATALOG XXX` to change the current catalog
### How was this patch tested?
Add ut testcase
Closes#34096 from Peng-Lei/set-catalog-statement.
Authored-by: PengLei <peng.8lei@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This bug was introduced by https://github.com/apache/spark/pull/33177
When checking overflow of the sum value in the average function, we should use the `sumDataType` instead of the input decimal type.
### Why are the changes needed?
fix a regression
### Does this PR introduce _any_ user-facing change?
Yes, the result was wrong before this PR.
### How was this patch tested?
a new test
Closes#34180 from cloud-fan/bug.
Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
Inline type hints from `python/pyspark/sql/functions.pyi` to `python/pyspark/sql/functions.py`.
### Why are the changes needed?
Currently, there is type hint stub files `python/pyspark/sql/functions.pyi` to show the expected types for functions, but we can also take advantage of static type checking within the functions by inlining the type hints.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing test.
Closes#34130 from xinrong-databricks/inline_functions.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Support multi-index in new syntax to specify index data type
### Why are the changes needed?
Support multi-index in new syntax to specify index data type
https://issues.apache.org/jira/browse/SPARK-36707
### Does this PR introduce _any_ user-facing change?
After this PR user can use
``` python
>>> ps.DataFrame[[int, int],[int, int]]
typing.Tuple[pyspark.pandas.typedef.typehints.IndexNameType, pyspark.pandas.typedef.typehints.IndexNameType, pyspark.pandas.typedef.typehints.NameType, pyspark.pandas.typedef.typehints.NameType]
>>> arrays = [[1, 1, 2], ['red', 'blue', 'red']]
>>> idx = pd.MultiIndex.from_arrays(arrays, names=('number', 'color'))
>>> pdf = pd.DataFrame([[1,2,3],[2,3,4],[4,5,6]], index=idx, columns=["a", "b", "c"])
>>> ps.DataFrame[pdf.index.dtypes, pdf.dtypes]
typing.Tuple[pyspark.pandas.typedef.typehints.IndexNameType, pyspark.pandas.typedef.typehints.IndexNameType, pyspark.pandas.typedef.typehints.NameType, pyspark.pandas.typedef.typehints.NameType, pyspark.pandas.typedef.typehints.NameType]
>>> ps.DataFrame[[("index", int), ("index-2", int)], [("id", int), ("A", int)]]
typing.Tuple[pyspark.pandas.typedef.typehints.IndexNameType, pyspark.pandas.typedef.typehints.IndexNameType, pyspark.pandas.typedef.typehints.NameType, pyspark.pandas.typedef.typehints.NameType]
>>> ps.DataFrame[zip(pdf.index.names, pdf.index.dtypes), zip(pdf.columns, pdf.dtypes)]
typing.Tuple[pyspark.pandas.typedef.typehints.IndexNameType, pyspark.pandas.typedef.typehints.IndexNameType, pyspark.pandas.typedef.typehints.NameType, pyspark.pandas.typedef.typehints.NameType, pyspark.pandas.typedef.typehints.NameType]
```
### How was this patch tested?
exist tests
Closes#34176 from dchvn/SPARK-36711.
Authored-by: dchvn nguyen <dgd_contributor@viettel.com.vn>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR fixes an issue that ambiguous self join can't be detected if the left and right DataFrame are swapped.
This is an example.
```
val df1 = Seq((1, 2, "A1"),(2, 1, "A2")).toDF("key1", "key2", "value")
val df2 = df1.filter($"value" === "A2")
df1.join(df2, df1("key1") === df2("key2")) // Ambiguous self join is detected and AnalysisException is thrown.
df2.join(df1, df1("key1") === df2("key2)) // Ambiguous self join is not detected.
```
The root cause seems that an inner function `collectConflictPlans` in `DeduplicateRelations.` doesn't copy the `dataset_id` tag when it copies a `LogicalPlan`.
### Why are the changes needed?
Bug fix.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New tests.
Closes#34172 from sarutak/fix-deduplication-issue.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to handle ANSI interval types by the `Abs` expression, and the `abs()` function as a consequence of that:
- for positive and zero intervals, `ABS()` returns the same input value,
- for minimal supported values (`Int.MinValue` months for year-month interval and `Long.MinValue` microseconds for day-time interval), `ABS()` throws the arithmetic overflow exception.
- for other supported negative intervals, `ABS()` negate its input and returns a positive interval.
For example:
```sql
spark-sql> SELECT ABS(INTERVAL -'10-8' YEAR TO MONTH);
10-8
spark-sql> SELECT ABS(INTERVAL '-10 01:02:03.123456' DAY TO SECOND);
10 01:02:03.123456000
```
### Why are the changes needed?
To improve user experience with Spark SQL.
### Does this PR introduce _any_ user-facing change?
No, this PR just extends `ABS()` by supporting new types.
### How was this patch tested?
By running new tests:
```
$ build/sbt "test:testOnly *ArithmeticExpressionSuite"
$ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z interval.sql"
$ build/sbt "sql/test:testOnly org.apache.spark.sql.expressions.ExpressionInfoSuite"
```
Closes#34169 from MaxGekk/abs-ansi-intervals.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Inline type hints for python/pyspark/sql/catalog.py.
### Why are the changes needed?
Currently, a type hint stub file hints for python/pyspark/sql/catalog.pyi is used. We may leverage static type check by inlining type hints.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing test.
Closes#34133 from xinrong-databricks/inline_catalog.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Inline type hints for conf.py and observation.py in python/pyspark/sql.
### Why are the changes needed?
Currently, there is type hint stub files (*.pyi) to show the expected types for functions, but we can also take advantage of static type checking within the functions by inlining the type hints.
### Does this PR introduce _any_ user-facing change?
No.
It has a DOC typo fix:
`Metrics are aggregation expressions, which are applied to the DataFrame while **is** is being`
is changed to
`Metrics are aggregation expressions, which are applied to the DataFrame while **it** is being`
### How was this patch tested?
Existing test.
Closes#34159 from xinrong-databricks/inline_conf_observation.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Use `SQLConf.resolver` for `caseSensitiveResolution`/`caseInsensitveResolution` instead of having a new method
### Why are the changes needed?
remove redundant code
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing code
Closes#34171 from huaxingao/minor.
Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
Fix `DataFrameGroupBy.apply` without shortcut.
Pandas' `DataFrameGroupBy.apply` sometimes behaves weirdly when the udf returns `Series` and whether there is only one group or more. E.g.,:
```py
>>> pdf = pd.DataFrame(
... {"a": [1, 2, 3, 4, 5, 6], "b": [1, 1, 2, 3, 5, 8], "c": [1, 4, 9, 16, 25, 36]},
... columns=["a", "b", "c"],
... )
>>> pdf.groupby('b').apply(lambda x: x['a'])
b
1 0 1
1 2
2 2 3
3 3 4
5 4 5
8 5 6
Name: a, dtype: int64
>>> pdf[pdf['b'] == 1].groupby('b').apply(lambda x: x['a'])
a 0 1
b
1 1 2
```
If there is only one group, it returns a "wide" `DataFrame` instead of `Series`.
In our non-shortcut path, there is always only one group because it will be run in `groupby-applyInPandas`, so we will get `DataFrame`, then we should convert it to `Series` ourselves.
### Why are the changes needed?
`DataFrameGroupBy.apply` without shortcut could raise an exception when it returns `Series`.
```py
>>> ps.options.compute.shortcut_limit = 3
>>> psdf = ps.DataFrame(
... {"a": [1, 2, 3, 4, 5, 6], "b": [1, 1, 2, 3, 5, 8], "c": [1, 4, 9, 16, 25, 36]},
... columns=["a", "b", "c"],
... )
>>> psdf.groupby("b").apply(lambda x: x["a"])
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
...
ValueError: Length mismatch: Expected axis has 2 elements, new values have 3 elements
```
### Does this PR introduce _any_ user-facing change?
The error above will be gone:
```py
>>> psdf.groupby("b").apply(lambda x: x["a"])
b
1 0 1
1 2
2 2 3
3 3 4
5 4 5
8 5 6
Name: a, dtype: int64
```
### How was this patch tested?
Added tests.
Closes#34160 from ueshin/issues/SPARK-36907/groupby-apply.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add a new test suite `ParquetVectorizedSuite` to provide more coverage on vectorized Parquet decoding logic, with different combinations on column index, dictionary, batch size, page size, etc.
To facilitate the test, this also refactored `SpecificParquetRecordReaderBase` and makes the Parquet row group reader pluggable.
### Why are the changes needed?
Currently `ParquetIOSuite` and `ParquetColumnIndexSuite` only test on the high-level API which is insufficient, especially after the introduction of column index support, for which we want to cover various combinations involving row ranges, first row index, batch size, page size, etc.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added new test suite.
Closes#34149 from sunchao/SPARK-36891-parquet-test.
Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Typo fix
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#34129 from fishmandev/patch-1.
Authored-by: Dmitriy Fishman <fishman.code@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Currently there are no speculation metrics available for Spark either at application/job/stage level. This PR is to add some basic speculation metrics for a stage when speculation execution is enabled.
This is similar to the existing stage level metrics tracking numTotal (total number of speculated tasks), numCompleted (total number of successful speculated tasks), numFailed (total number of failed speculated tasks), numKilled (total number of killed speculated tasks) etc.
With this new set of metrics, it helps further understanding speculative execution feature in the context of the application and also helps in further tuning the speculative execution config knobs.
Screenshot of Spark UI with speculation summary:
![Screen Shot 2021-09-22 at 12 12 20 PM](https://user-images.githubusercontent.com/8871522/135321311-db7699ad-f1ae-4729-afea-d1e2c4e86103.png)
Screenshot of Spark UI with API output:
![Screen Shot 2021-09-22 at 12 10 37 PM](https://user-images.githubusercontent.com/8871522/135321486-4dbb7a67-5580-47f8-bccf-81c758c2e988.png)
### Why are the changes needed?
Additional metrics for speculative execution.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit tests added and also deployed in our internal platform for quite some time now.
Lead-authored by: Venkata krishnan Sowrirajan <vsowrirajanlinkedin.com>
Co-authored by: Ron Hu <rhulinkedin.com>
Co-authored by: Thejdeep Gudivada <tgudivadalinkedin.com>
Closes#33253 from venkata91/speculation-metrics.
Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
### What changes were proposed in this pull request?
This PR proposes implementing `MultiIndex.equal_levels`.
```python
>>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
>>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
>>> psmidx1.equal_levels(psmidx2)
True
>>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z"), ("a", "y")])
>>> psmidx2 = ps.MultiIndex.from_tuples([("a", "y"), ("b", "x"), ("c", "z"), ("c", "x")])
>>> psmidx1.equal_levels(psmidx2)
True
```
This was originally proposed in https://github.com/databricks/koalas/pull/1789, and all reviews in origin PR has been resolved.
### Why are the changes needed?
We should support the pandas API as much as possible for pandas-on-Spark module.
### Does this PR introduce _any_ user-facing change?
Yes, the `MultiIndex.equal_levels` API is available.
### How was this patch tested?
Unittests
Closes#34113 from itholic/SPARK-36435.
Lead-authored-by: itholic <haejoon.lee@databricks.com>
Co-authored-by: Haejoon Lee <44108233+itholic@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Adds the `INTERNAL_ERROR` error class and the `isInternalError` API to `SparkThrowable`.
Removes existing error classes that are internal-only and replaces them with `INTERNAL_ERROR`.
### Why are the changes needed?
Makes it easy for end-users to diagnose whether an error is an internal error. If an end-user encounters an internal error, it should be reported immediately. This also limits the number of error classes, making it easy to audit. We do not need high-quality error messages for internal errors, as they should not be exposed to the end-user.
### Does this PR introduce _any_ user-facing change?
Yes; this changes the error class in master.
### How was this patch tested?
Unit tests
Closes#34123 from karenfeng/internal-error-class.
Authored-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims to support reading and writing ANSI intervals from/to JSON datasources.
Aith this change, a interval data is written as a literal form like `{"col":"INTERVAL '1-2' YEAR TO MONTH"}`.
For the reading part, we need to specify the schema explicitly like:
```
val readDF = spark.read.schema("col INTERVAL YEAR TO MONTH").json(...)
```
### Why are the changes needed?
For better usability. There should be no reason to prohibit from reading/writing ANSI intervals from/to JSON datasources.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New test. It covers both V1 and V2 sources.
Closes#34155 from sarutak/ansi-interval-json-source.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
This PR adds PySpark API document of `session_window`.
The docstring of the function doesn't comply with numpydoc format so this PR also fix it.
Further, the API document of `window` doesn't have `Parameters` section so it's also added in this PR.
### Why are the changes needed?
To provide PySpark users with the API document of the newly added function.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
`make html` in `python/docs` and get the following docs.
[window]
![time-window-python-doc-after](https://user-images.githubusercontent.com/4736016/134963797-ce25b268-20ca-48e3-ac8d-cbcbd85ebb3e.png)
[session_window]
![session-window-python-doc-after](https://user-images.githubusercontent.com/4736016/134963853-dd9d8417-139b-41ee-9924-14544b1a91af.png)
Closes#34118 from sarutak/python-session-window-doc.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
Support ILIKE (case insensitive LIKE) API on R.
### Why are the changes needed?
ILIKE statement on SQL interface is supported by SPARK-36674.
This PR will support R API for it.
### Does this PR introduce _any_ user-facing change?
Yes. Users can call ilike from R.
### How was this patch tested?
Unit tests.
Closes#34152 from yoda-mon/r-ilike.
Authored-by: Leona Yoda <yodal@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
### What changes were proposed in this pull request?
Currently `dropTempView` and `dropGlobalTempView` don't have return value, which conflicts with their docstring:
`Returns true if this view is dropped successfully, false otherwise.`. And that's not consistent with the same API in other languages.
The PR proposes a fix for that.
### Why are the changes needed?
Be consistent with API in other languages.
### Does this PR introduce _any_ user-facing change?
Yes.
#### From
```py
# dropTempView
>>> spark.createDataFrame([(1, 1)]).createTempView("my_table")
>>> spark.table("my_table").collect()
[Row(_1=1, _2=1)]
>>> spark.catalog.dropTempView("my_table")
>>> spark.catalog.dropTempView("my_table")
# dropGlobalTempView
>>> spark.createDataFrame([(1, 1)]).createGlobalTempView("my_table")
>>> spark.table("global_temp.my_table").collect()
[Row(_1=1, _2=1)]
>>> spark.catalog.dropGlobalTempView("my_table")
>>> spark.catalog.dropGlobalTempView("my_table")
```
#### To
```py
# dropTempView
>>> spark.createDataFrame([(1, 1)]).createTempView("my_table")
>>> spark.table("my_table").collect()
[Row(_1=1, _2=1)]
>>> spark.catalog.dropTempView("my_table")
True
>>> spark.catalog.dropTempView("my_table")
False
# dropGlobalTempView
>>> spark.createDataFrame([(1, 1)]).createGlobalTempView("my_table")
>>> spark.table("global_temp.my_table").collect()
[Row(_1=1, _2=1)]
>>> spark.catalog.dropGlobalTempView("my_table")
True
>>> spark.catalog.dropGlobalTempView("my_table")
False
```
### How was this patch tested?
Existing tests.
Closes#34147 from xinrong-databricks/fix_return.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Upgrade mesos into 1.4.3
### Why are the changes needed?
Fix CVE-2018-11793
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manually
Closes#34144 from warrenzhu25/mesos.
Authored-by: Warren Zhu <warren.zhu25@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
<!--
Thanks for sending a pull request! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
4. Be sure to keep the PR description updated to reflect all changes.
5. Please write your PR title to summarize what this PR proposes.
6. If possible, provide a concise example to reproduce the issue for a faster review.
7. If you want to add a new configuration, please read the guideline first for naming configurations in
'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
8. If you want to add or modify an error type or message, please read the guideline first in
'core/src/main/resources/error/README.md'.
-->
### What changes were proposed in this pull request?
<!--
Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
2. If you fix some SQL features, you can provide some references of other DBMSes.
3. If there is design documentation, please add the link.
4. If there is a discussion in the mailing list, please add the link.
-->
Adding tests to `HashExpressionSuite` to test the `sha2` function to test for bit lengths of 0 and 512. Also add comments to clarify existing ambiguous comment.
### Why are the changes needed?
<!--
Please clarify why the changes are needed. For instance,
1. If you propose a new API, clarify the use case for a new API.
2. If you fix a bug, you can clarify why it is a bug.
-->
Currently, the `sha2` function with bit lengths of `0` and `512` were not tested. This PR adds those tests
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such as the documentation fix.
If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
If no, write 'No'.
-->
No
### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add.
-->
Ran the sha2 test in `HashExpressionsSuite` to ensure added tests pass.
Closes#34145 from richardc-db/add_sha_tests.
Authored-by: Richard Chen <r.chen@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims to upgrade GitHub Action CI image to recover CRAN installation failure.
### Why are the changes needed?
Sometimes, GitHub Action linter job failed
- https://github.com/apache/spark/runs/3739748809
New image have R 4.1.1 and will recover the failure.
```
$ docker run -it --rm dongjoon/apache-spark-github-action-image:20210928 R --version
R version 4.1.1 (2021-08-10) -- "Kick Things"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
https://www.gnu.org/licenses/.
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass `GitHub Action`.
Closes#34138 from dongjoon-hyun/SPARK-36883.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR aims to support reading and writing ANSI intervals from/to CSV datasources.
Aith this change, a interval data is written as a literal form like `INTERVAL '1-2' YEAR TO MONTH`.
For the reading part, we need to specify the schema explicitly like:
```
val readDF = spark.read.schema("col INTERVAL YEAR TO MONTH").csv(...)
```
### Why are the changes needed?
For better usability. There should be no reason to prohibit from reading/writing ANSI intervals from/to CSV datasources.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New test. It covers both V1 and V2 sources.
Closes#34142 from sarutak/ansi-interval-csv-source.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
In current code for yarn client mode, even when use use `yarn application -kill` to kill the application, driver side still exit with code 0. This behavior make job scheduler can't know the job is not success. and user don't know too.
In this case we should exit program with a non 0 code.
### Why are the changes needed?
Make scheduler/user more clear about application's status
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Closes#33873 from AngersZhuuuu/SPDI-36624.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
### What changes were proposed in this pull request?
When the exception is InvocationTargetException, get cause and stack trace.
### Why are the changes needed?
Now when UDF reflection fails, InvocationTargetException is thrown, but it is not a specific exception.
```
Error in query: No handler for Hive UDF 'XXX': java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
manual test
Closes#33796 from cxzl25/SPARK-36550.
Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
In the PR, I propose to take into account the SQL config `spark.sql.parquet.filterPushdown` while building the array of pushed down filters in v2 `ParquetScanBuilder`.
### Why are the changes needed?
Before the changes, `explain()` outputs some filters even filter pushdown to parquet is disabled:
```
== Physical Plan ==
*(1) Filter (isnotnull(c0#7) AND (c0#7 = 1))
+- *(1) ColumnarToRow
+- BatchScan[c0#7] ParquetScan DataFilters: [isnotnull(c0#7), (c0#7 = 1)], ... PushedFilters: [IsNotNull(c0), EqualTo(c0,1)] RuntimeFilters: []
```
### Does this PR introduce _any_ user-facing change?
If users parse `explain()`'s output, this PR can influence to them. But in general, it shouldn't. Also, need to highlight that the PR affects DSv2 only.
### How was this patch tested?
By running new test in `ParquetV2FilterSuite`:
```
$ build/sbt "test:testOnly *ParquetV2FilterSuite"
```
Closes#34140 from MaxGekk/fix-parquet-v2-pushed-filters.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
- Split plan into several groups, and every child of union is a new group
- Coalesce paritition for every group
### Why are the changes needed?
#### First Issue
The rule `CoalesceShufflePartitions` can only coalesce paritition if
* leaf node is ShuffleQueryStage
* all shuffle have same partition number
With `Union`, it might break the assumption. Let's say we have such plan
```
Union
HashAggregate
ShuffleQueryStage
FileScan
```
`CoalesceShufflePartitions` can not optimize it and the result partition would be `shuffle partition + FileScan partition` which can be quite lagre.
It's better to support partial optimize with `Union`.
#### Second Issue
the coalesce partition formule used the **sum value** as the total input size and it's not friendly for union, see
```
// ShufflePartitionsUtil.coalescePartitions
val totalPostShuffleInputSize = mapOutputStatistics.flatMap(_.map(_.bytesByPartitionId.sum)).sum
```
So for such case:
```
Union
HashAggregate
ShuffleQueryStage
HashAggregate
ShuffleQueryStage
```
The `CoalesceShufflePartitions` rule will return an unexpected partition number.
### Does this PR introduce _any_ user-facing change?
Probably yes, the result partition might changed.
### How was this patch tested?
Add test.
Closes#32084 from ulysses-you/SPARK-34980.
Lead-authored-by: ulysses-you <ulyssesyou18@gmail.com>
Co-authored-by: ulysses <ulyssesyou18@gmail.com>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Support ILIKE (case insensitive LIKE) API on Python.
### Why are the changes needed?
ILIKE statement on SQL interface is supported by SPARK-36674.
This PR will support Python API for it.
### Does this PR introduce _any_ user-facing change?
Yes. Users can call `ilike` from Python.
### How was this patch tested?
Unit tests.
Closes#34135 from yoda-mon/python-ilike.
Authored-by: Leona Yoda <yodal@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
### What changes were proposed in this pull request?
Indexes are database objects created on one or more columns of a table. Indexes are used to improve query performance. A detailed explanation of database index is here https://en.wikipedia.org/wiki/Database_index
This PR adds `supportsIndex` interface that provides APIs to work with indexes.
### Why are the changes needed?
Many data sources support index to improvement query performance. In order to take advantage of the index support in data source, this `supportsIndex` interface is added to let user to create/drop an index, list indexes, etc.
### Does this PR introduce _any_ user-facing change?
yes, the following new APIs are added:
- createIndex
- dropIndex
- indexExists
- listIndexes
New SQL syntax:
```
CREATE [index_type] INDEX [index_name] ON [TABLE] table_name (column_index_property_list)[OPTIONS indexPropertyList]
column_index_property_list: column_name [OPTIONS(indexPropertyList)] [ , . . . ]
indexPropertyList: index_property_name = index_property_value [ , . . . ]
DROP INDEX index_name
```
### How was this patch tested?
only interface is added for now. Tests will be added when doing the implementation
Closes#33754 from huaxingao/index_interface.
Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This patch simply removes a few `unsupportedOperationCheck` in `TextSocketStreamSuite`.
### Why are the changes needed?
`unsupportedOperationCheck` is used to disable the check of unsupported operations. If we are not to test unsupported operations, it was unnecessarily set in `TextSocketStreamSuite` and could cause unexpected error by missing check.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests.
Closes#34132 from viirya/minor-test.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
Inlines type hint files under `pyspark/sql/pandas` folder, except for `pyspark/sql/pandas/functions.pyi` and files under `pyspark/sql/pandas/_typing`.
- Since the file contains a lot of overloads, we should revisit and manage it separately.
- We can't inline files under `pyspark/sql/pandas/_typing` because it includes new syntax for type hints.
### Why are the changes needed?
Currently there are type hint stub files (`*.pyi`) to show the expected types for functions, but we can also take advantage of static type checking within the functions by inlining the type hints.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#34101 from ueshin/issues/SPARK-36846/inline_typehints.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/34104
`SHOW CURRENT NAMESPACE` is a very simple command that does not involve v2 catalog API, does not need analysis, does not have children. We can simply use `RunnableCommand` to save defining the physical plan.
### Why are the changes needed?
code simplification
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
existing tests
Closes#34128 from cloud-fan/follow.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
In the PR, I propose to pushdown filters with ANSI intervals as filter values to the parquet datasource. After the changes, filter values are pushed down with the following values via Filter API:
- `java.time.Period` for year-month filters
- `java.time.Duration` for day-time filters.
Since at the Parquet filter level, we don't have info about Catalyst's types (`YearMonthIntervalType` and `DayTimeIntervalType`) but only the info about primitive parquet types `INT32` and `INT64`. As a consequence of that, Spark has to convert filters values "dynamically" to `Int`/`Long` while building Parquet filters.
### Why are the changes needed?
The PR https://github.com/apache/spark/pull/34057 supported ANSI intervals in the Parquet datasource as INT32 (year-month interval) and INT64 (day-time interval) but filters with such values are not pushed down. So, comparing to primitive types, ANSI intervals can suffer from worse performance in read. This PR aims to solve the issue, and achieve feature parity with other types.
### Does this PR introduce _any_ user-facing change?
No, the changes might influence on performance of the parquet datasource only.
### How was this patch tested?
By running new tests in `ParquetFilterSuite`:
```
$ build/sbt clean "test:testOnly *ParquetV1FilterSuite"
$ build/sbt clean "test:testOnly *ParquetV2FilterSuite"
```
Closes#34115 from MaxGekk/interval-parquet-filter-pushdown.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
`sha2(input, bit_length)` returns incorrect results when `bit_length == 224` for all inputs.
This error can be reproduced by running `spark.sql("SELECT sha2('abc', 224)").show()`, for instance, in spark-shell.
Spark currently returns
```
#\t}"4�"�B�w��U�*��你���l��
```
while the expected result is
```
23097d223405d8228642a477bda255b32aadbce4bda0b3f7e36c9da7
```
This appears to happen because the `MessageDigest.digest()` function appears to return bytes intended to be interpreted as a `BigInt` rather than a string. Thus, the output of `MessageDigest.digest()` must first be interpreted as a `BigInt` and then transformed into a hex string rather than directly being interpreted as a hex string.
### Why are the changes needed?
`sha2(input, bit_length)` with a `bit_length` input of `224` would previously return the incorrect result.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added new test to `HashExpressionsSuite.scala` which previously failed and now pass
Closes#34086 from richardc-db/sha224.
Authored-by: Richard Chen <r.chen@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
Add provided Guava dependency to `network-yarn` module.
### Why are the changes needed?
In Spark 3.1 and earlier the `network-yarn` module implicitly relies on Guava from `hadoop-client` dependency. This was changed by SPARK-33212 where we moved to shaded Hadoop client which no longer expose the transitive Guava dependency. It stayed fine for a while since we were not using `createDependencyReducedPom` so it picks up the transitive dependency from `spark-network-common` instead. However, things start to break after SPARK-36835 where we restored `createDependencyReducedPom` and now it is no longer able to locate Guava classes:
```
build/mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 -Pspark-ganglia-lgpl -Pyarn
...
[INFO] Compiling 1 Java source to /Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ...
[WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8
[ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32: package com.google.common.annotations does not exist
[ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33: package com.google.common.base does not exist
[ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34: package com.google.common.collect does not exist
[ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:118: cannot find symbol
symbol: class VisibleForTesting
location: class org.apache.spark.network.yarn.YarnShuffleService
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Tested with the above `mvn` command and it's now passing.
Closes#34125 from sunchao/SPARK-36873.
Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
Update `dev/test-dependencies.sh` so that it now covers all released Spark artifacts such as `hadoop-cloud`.
### Why are the changes needed?
Currently `dev/test-dependencies.sh` doesn't cover all released Spark modules. Therefore, if there is any dependency change in those modules (e.g., `hadoop-cloud`), we won't be able to detect it early.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
N/A
Closes#34119 from sunchao/SPARK-36863-dependency.
Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This patch proposes to remove broadcast variable in `InSubqueryExec` which is used in DPP.
### Why are the changes needed?
Currently we include a broadcast variable in `InSubqueryExec`. We use it to hold filtering side query result of DPP. It looks weird because we don't use the result in executors but only need the result in the driver during query planning. We already hold the original result, so basically we hold two copied of query result at this moment.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing test.
Closes#34051 from viirya/dpp-no-broadcast.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Currently, JAVA_HOME may be set to path "/usr" improperly, now JAVA_HOME is fetched from command "/usr/libexec/java_home" for macOS.
### Why are the changes needed?
Command "./build/mvn xxx" will be stuck on MacOS 11.4, because JAVA_HOME is set to path "/usr" improperly.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
`build/mvn -DskipTests package` passed on `macOS 11.5.2`.
Closes#34111 from copperybean/work.
Authored-by: copperybean <copperybean.zhang@gmail.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
For LEFT SEMI and LEFT ANTI hash equi-join without extra join condition, we only need to keep one row per unique join key(s) inside hash table (`HashedRelation`) when building the hash table. This can help reduce the size of hash table of join.
This PR adds the optimization in `UnsafeHashedRelation` for broadcast hash join and shuffled hash join. The optimization for `LongHashedRelation` would be added later in the future, because it needs more change of underlying hash table data structure `LongToUnsafeRowMap` to check if key exists in hash table or not.
### Why are the changes needed?
Help reduce the hash table size of join for LEFT SEMI and LEFT ANTI.
This can increase the chance of broadcast join of these queries, and reduce OOM possibility of shuffled hash join.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added unit test in `JoinSuite.scala`.
Just to help easier review, as a lot of files are changed due to unit test plan change. Below files have the real code change:
* BroadcastExchangeExec.scala
* BroadcastHashJoinExec.scala
* HashJoin.scala
* HashedRelation.scala
* ShuffledHashJoinExec.scala
* JoinSuite.scala
* DebuggingSuite.scala
All other files change are generated with followed commands:
```
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *SQLQueryTestSuite"
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *PlanStabilitySuite"
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *PlanStabilityWithStatsSuite"
```
Closes#34034 from c21/join-deduplicate.
Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This reverts commit aaa0d2a66b.
### Why are the changes needed?
This approach has 2 disadvantages:
1. It needs to disable `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly`.
2. The filtering side will be evaluated 2 times. For example: https://github.com/apache/spark/pull/29726#issuecomment-780266596
Instead, we can use bloom filter join pruning in the future.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#34116 from wangyum/revert-SPARK-32855.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
### What changes were proposed in this pull request?
A follow up of https://github.com/apache/spark/pull/34054. Three things changed:
1. Add a test for extendable class `ColumnarBatch`
2. Make `ColumnarBatchRow` public.
3. Change private fields to protected fields.
### Why are the changes needed?
A follow up of https://github.com/apache/spark/pull/34054. Class ColumnarBatch need to be extendable to support better vectorized reading in multiple data sources. For example, Iceberg needs to filter out deleted rows in a batch before Spark consumes it, to support row-level delete( apache/iceberg#3141) in vectorized read.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
A new test is added
Closes#34087 from flyrain/SPARK-36821.
Authored-by: Yufei Gu <yufei_gu@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
### What changes were proposed in this pull request?
Migrate `ShowCurrentNamespaceStatement` to v2 command framework
### Why are the changes needed?
Migrate to the standard V2 framework
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests
Closes#34104 from huaxingao/namesp.
Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>