### What changes were proposed in this pull request?
This PR fixes an issue that `YarnClusterSuite` fails due to `NoClassDefFoundError unless `hadoop-3.2` profile is activated explicitly regardless of building with SBT or Maven.
```
build/sbt -Pyarn "yarn/testOnly org.apache.spark.deploy.yarn.YarnClusterSuite"
...
[info] YarnClusterSuite:
[info] org.apache.spark.deploy.yarn.YarnClusterSuite *** ABORTED *** (598 milliseconds)
[info] java.lang.NoClassDefFoundError: org/bouncycastle/operator/OperatorCreationException
[info] at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:888)
[info] at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
[info] at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:1410)
[info] at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:344)
[info] at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
[info] at org.apache.hadoop.yarn.server.MiniYARNCluster.initResourceManager(MiniYARNCluster.java:359)
```
The solution is modifying `yarn/pom.xml` to activate `hadoop-3.2` profiles by default.
### Why are the changes needed?
hadoop-3.2 profile should be enabled by default so `YarnClusterSuite` should also successfully finishes without `-Phadoop-3.2`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Run `YarnClusterSuite` with both SBT and Maven without `-Phadoop-3.2` and it successfully finished.
```
build/sbt -Pyarn "yarn/testOnly org.apache.spark.deploy.yarn.YarnClusterSuite"
...
[info] Run completed in 5 minutes, 38 seconds.
[info] Total number of tests run: 27
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
build/mvn -Pyarn -pl resource-managers/yarn test -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnClusterSuite
...
Run completed in 5 minutes, 49 seconds.
Total number of tests run: 27
Suites: completed 2, aborted 0
Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```
Closes#33276 from sarutak/fix-bouncy-issue.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 9aa65a6e0e)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Change `currentPhysicalPlan.logicalLink.get` to `inputPlan.logicalLink.get` for initial logical link.
### Why are the changes needed?
At `initialPlan` we may remove some Spark Plan with `queryStagePreparationRules`, if removed Spark Plan is top level node, then we will lose the linked logical node.
Since we support AQE side broadcast join config. It's more common that a join is SMJ at normal planner and changed to BHJ after AQE reOptimize. However, `RemoveRedundantSorts` is applied before reOptimize at `initialPlan`, then a local sort might be removed incorrectly if a join is SMJ at first but changed to BHJ during reOptimize.
### Does this PR introduce _any_ user-facing change?
yes, bug fix
### How was this patch tested?
add test
Closes#33244 from ulysses-you/SPARK-36032.
Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 484b50cadf)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
Merge test_decimal_ops into test_num_ops
- merge test_isnull() into test_num_ops.test_isnull()
- remove test_datatype_ops(), which already covered in 11fcbc73cb/python/pyspark/pandas/tests/data_type_ops/test_base.py (L58-L59)
### Why are the changes needed?
Tests for data-type-based operations of decimal Series are in two places:
- python/pyspark/pandas/tests/data_type_ops/test_decimal_ops.py
- python/pyspark/pandas/tests/data_type_ops/test_num_ops.py
We'd better merge test_decimal_ops into test_num_ops.
See also [SPARK-36002](https://issues.apache.org/jira/browse/SPARK-36002) .
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
unittests passed
Closes#33206 from Yikun/SPARK-36002.
Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit fdc50f4452)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
For tests with operations on different Series, sort index of results before comparing them with pandas.
### Why are the changes needed?
We have many tests with operations on different Series in `spark/python/pyspark/pandas/tests/data_type_ops/` that assume the result's index to be sorted and then compare to the pandas' behavior.
The assumption on the result's index ordering is wrong since Spark DataFrame join is used internally and the order is not preserved if the data being in different partitions.
So we should assume the result to be disordered and sort the index of such results before comparing them with pandas.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit tests.
Closes#33274 from xinrong-databricks/datatypeops_testdiffframe.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit af81ad0d7e)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Try to capture the error message from the `faulthandler` when the Python worker crashes.
### Why are the changes needed?
Currently, we just see an error message saying `"exited unexpectedly (crashed)"` when the UDFs causes the Python worker to crash by like segmentation fault.
We should take advantage of [`faulthandler`](https://docs.python.org/3/library/faulthandler.html) and try to capture the error message from the `faulthandler`.
### Does this PR introduce _any_ user-facing change?
Yes, when a Spark config `spark.python.worker.faulthandler.enabled` is `true`, the stack trace will be seen in the error message when the Python worker crashes.
```py
>>> def f():
... import ctypes
... ctypes.string_at(0)
...
>>> sc.parallelize([1]).map(lambda x: f()).count()
```
```
org.apache.spark.SparkException: Python worker exited unexpectedly (crashed): Fatal Python error: Segmentation fault
Current thread 0x000000010965b5c0 (most recent call first):
File "/.../ctypes/__init__.py", line 525 in string_at
File "<stdin>", line 3 in f
File "<stdin>", line 1 in <lambda>
...
```
### How was this patch tested?
Added some tests, and manually.
Closes#33273 from ueshin/issues/SPARK-36062/faulthandler.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 115b8a180f)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to remove the automatic build guides of documentation in `docs/README.md`.
### Why are the changes needed?
This doesn't work very well:
1. It doesn't detect the changes in RST files. But PySpark internally generates RST files so we can't just simply include it in the detection. Otherwise, it goes to an infinite loop
2. During PySpark documentation generation, it launches some jobs to generate plot images now. This is broken with `entr` command, and the job fails. Seems like it's related to how `entr` creates the process internally.
3. Minor issue but the documentation build directory was changed (`_build` -> `build` in `python/docs`)
I don't think it's worthwhile testing and fixing the docs to show an working example because dev people are already able to do it manually.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
Manually tested.
Closes#33266 from HyukjinKwon/SPARK-36051.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit a1ce64904f)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Remove IntervalUnit
### Why are the changes needed?
Clean code
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Not need
Closes#33265 from AngersZhuuuu/SPARK-36049.
Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit fef7e1703c)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Support group by TimestampNTZ type column
### Why are the changes needed?
It's a basic SQL operation.
### Does this PR introduce _any_ user-facing change?
No, the new timestmap type is not released yet.
### How was this patch tested?
Unit test
Closes#33268 from gengliangwang/agg.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 382b66e267)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
The PR is proposed to standardize TypeError messages for unsupported basic operations by:
- Capitalize the first letter
- Leverage TypeError messages defined in `pyspark/pandas/data_type_ops/base.py`
- Take advantage of the utility `is_valid_operand_for_numeric_arithmetic` to save duplicated TypeError messages
Related unit tests should be adjusted as well.
### Why are the changes needed?
Inconsistent TypeError messages are shown for unsupported data-type-based basic operations.
Take addition's TypeError messages for example:
- addition can not be applied to given types.
- string addition can only be applied to string series or literals.
Standardizing TypeError messages would improve user experience and reduce maintenance costs.
### Does this PR introduce _any_ user-facing change?
No user-facing behavior change. Only TypeError messages are modified.
### How was this patch tested?
Unit tests.
Closes#33237 from xinrong-databricks/datatypeops_err.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit 819c482498)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Currently the TimestampNTZ literals shows only long value instead of timestamp string in its SQL string and toString result.
Before changes (with default timestamp type as TIMESTAMP_NTZ)
```
– !query
select timestamp '2019-01-01\t'
– !query schema
struct<1546300800000000:timestamp_ntz>
```
After changes:
```
– !query
select timestamp '2019-01-01\t'
– !query schema
struct<TIMESTAMP_NTZ '2019-01-01 00:00:00':timestamp_ntz>
```
### Why are the changes needed?
Make the schema of TimestampNTZ literals readable.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit test
Closes#33269 from gengliangwang/ntzLiteralString.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit ee945e99cc)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
When exec the command `SHOW CREATE TABLE`, we should not lost the info null flag if the table column that
is specified `NOT NULL`
### Why are the changes needed?
[SPARK-36012](https://issues.apache.org/jira/browse/SPARK-36012)
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Add UT test for V1 and existed UT for V2
Closes#33219 from Peng-Lei/SPARK-36012.
Authored-by: PengLei <peng.8lei@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit e071721a51)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Currently, AQE uses a very tricky way to trigger and wait for the subqueries:
1. submitting stage calls `QueryStageExec.materialize`
2. `QueryStageExec.materialize` calls `executeQuery`
3. `executeQuery` does some preparation works, which goes to `QueryStageExec.doPrepare`
4. `QueryStageExec.doPrepare` calls `prepare` of shuffle/broadcast, which triggers all the subqueries in this stage
5. `executeQuery` then calls `waitForSubqueries`, which does nothing because `QueryStageExec` itself has no subqueries
6. then we submit the shuffle/broadcast job, without waiting for subqueries
7. for `ShuffleExchangeExec.mapOutputStatisticsFuture`, it calls `child.execute`, which calls `executeQuery` and wait for subqueries in the query tree of `child`
8. The only missing case is: `ShuffleExchangeExec` itself may contain subqueries(repartition expression) and AQE doesn't wait for it.
A simple fix would be overwriting `waitForSubqueries` in `QueryStageExec`, and forward the request to shuffle/broadcast, but this PR proposes a different and probably cleaner way: we follow `execute`/`doExecute` in `SparkPlan`, and add similar APIs in the AQE version of "execute", which gets a future from shuffle/broadcast.
### Why are the changes needed?
bug fix
### Does this PR introduce _any_ user-facing change?
a query fails without the fix and can run now
### How was this patch tested?
new test
Closes#33058 from cloud-fan/aqe.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 2df67a1a1b)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Refactors the base Throwable trait `SparkError.scala` (introduced in SPARK-34920) an interface `SparkThrowable.java`.
### Why are the changes needed?
- Renaming `SparkError` to `SparkThrowable` better reflect sthat this is the base interface for both `Exception` and `Error`
- Migrating to Java maximizes its extensibility
### Does this PR introduce _any_ user-facing change?
Yes; the base trait has been renamed and the accessor methods have changed (eg. `sqlState` -> `getSqlState()`).
### How was this patch tested?
Unit tests.
Closes#33164 from karenfeng/SPARK-35958.
Authored-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 71c086eb87)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add the implementation for the RocksDBStateStoreProvider. It's the subclass of StateStoreProvider that leverages all the functionalities implemented in the RocksDB instance.
### Why are the changes needed?
The interface for the end-user to use the RocksDB state store.
### Does this PR introduce _any_ user-facing change?
Yes. New RocksDBStateStore can be used in their applications.
### How was this patch tested?
New UT added.
Closes#33187 from xuanyuanking/SPARK-35988.
Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(cherry picked from commit 0621e78b5f)
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
Run end-to-end tests with default timestamp type as TIMESTAMP_NTZ to increase test coverage.
### Why are the changes needed?
Inrease test coverage.
Also, there will be more and more expressions have different behaviors when the default timestamp type is TIMESTAMP_NTZ, for example, `to_timestamp`, `from_json`, `from_csv`, and so on. Having this new test suite helps future developments.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI tests.
Closes#33259 from gengliangwang/ntzTest.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 57342dfc1d)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
With more thought, all DT/YM function use field byte to keep consistence is better
### Why are the changes needed?
Keep code consistence
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Not need
Closes#33252 from AngersZhuuuu/SPARK-36021-FOLLOWUP.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 89aa16b4a8)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
This PR fixes an issue about `extract`.
`Extract` should process only existing fields of interval types. For example:
```
spark-sql> SELECT EXTRACT(MONTH FROM INTERVAL '2021-11' YEAR TO MONTH);
11
spark-sql> SELECT EXTRACT(MONTH FROM INTERVAL '2021' YEAR);
0
```
The last command should fail as the month field doesn't present in INTERVAL YEAR.
### Why are the changes needed?
Bug fix.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New tests.
Closes#33247 from sarutak/fix-extract-interval.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 39002cb995)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
This is the followup from https://github.com/apache/spark/pull/33236#issuecomment-875242730, where we are fixing the config value of shuffled hash join, for all other test queries. Found all configs by searching in https://github.com/apache/spark/search?q=spark.sql.join.preferSortMergeJoin .
### Why are the changes needed?
Fix test to have better test coverage.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#33249 from c21/join-test.
Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 23943e5e40)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Make unary and comparison operators data-type-based. Refactored operators include:
- Unary operators: `__neg__`, `__abs__`, `__invert__`,
- Comparison operators: `>`, `>=`, `<`, `<=`, `==`, `!=`
Non-goal: Tasks below are inspired during the development of this PR.
[[SPARK-35997] Implement comparison operators for CategoricalDtype in pandas API on Spark](https://issues.apache.org/jira/browse/SPARK-35997)
[[SPARK-36000] Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled](https://issues.apache.org/jira/browse/SPARK-36000)
[[SPARK-36001] Assume result's index to be disordered in tests with operations on different Series](https://issues.apache.org/jira/browse/SPARK-36001)
[[SPARK-36002] Consolidate tests for data-type-based operations of decimal Series](https://issues.apache.org/jira/browse/SPARK-36002)
[[SPARK-36003] Implement unary operator `invert` of numeric ps.Series/Index](https://issues.apache.org/jira/browse/SPARK-36003)
### Why are the changes needed?
We have been refactoring basic operators to be data-type-based for readability, flexibility, and extensibility.
Unary and comparison operators are still not data-type-based yet. We should fill the gaps.
### Does this PR introduce _any_ user-facing change?
Yes.
- Better error messages. For example,
Before:
```py
>>> import pyspark.pandas as ps
>>> psser = ps.Series([b"2", b"3", b"4"])
>>> -psser
Traceback (most recent call last):
...
pyspark.sql.utils.AnalysisException: cannot resolve '(- `0`)' due to data type mismatch: ...
```
After:
```py
>>> import pyspark.pandas as ps
>>> psser = ps.Series([b"2", b"3", b"4"])
>>> -psser
Traceback (most recent call last):
...
TypeError: Unary - can not be applied to binaries.
>>>
```
- Support unary `-` of `bool` Series. For example,
Before:
```py
>>> psser = ps.Series([True, False, True])
>>> -psser
Traceback (most recent call last):
...
pyspark.sql.utils.AnalysisException: cannot resolve '(- `0`)' due to data type mismatch: ...
```
After:
```py
>>> psser = ps.Series([True, False, True])
>>> -psser
0 False
1 True
2 False
dtype: bool
```
### How was this patch tested?
Unit tests.
Closes#33162 from xinrong-databricks/datatypeops_refactor.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit 6e4e04f2a1)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
For case
```
spark-sql> select interval '123456:12' minute to second;
Error in query:
requirement failed: Interval string must match day-time format of '^(?<sign>[+|-])?(?<minute>\d{1,2}):(?<second>(\d{1,2})(\.(\d{1,9}))?)$': 123456:12, set spark.sql.legacy.fromDayTimeString.enabled to true to restore the behavior before Spark 3.0.(line 1, pos 16)
== SQL ==
select interval '123456:12' minute to second
----------------^^^
```
we should support hour/minute/second when for more than 2 digits when parse interval literal string
### Why are the changes needed?
Keep consistence
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added UT
Closes#33231 from AngersZhuuuu/SPARK-36021.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit ea3333a200)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
The method `WindowSpecDefinition.isValidFrameType` doesn't consider `TimestampNTZType`. We should support it as for `TimestampType`.
### Why are the changes needed?
Support `TimestampNTZType` in the Window spec definition.
### Does this PR introduce _any_ user-facing change?
'Yes'. This PR allows users use `TimestampNTZType` as the sort spec in window spec definition.
### How was this patch tested?
New tests.
Closes#33246 from beliefer/SPARK-36015.
Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 62ff2add94)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
The current `ApproxCountDistinctForInterval`s supports `TimestampType`, but not supports timestamp without time zone yet.
This PR will add the function.
### Why are the changes needed?
`ApproxCountDistinctForInterval` need supports `TimestampNTZType`.
### Does this PR introduce _any_ user-facing change?
'Yes'. `ApproxCountDistinctForInterval` accepts `TimestampNTZType`.
### How was this patch tested?
New tests.
Closes#33243 from beliefer/SPARK-36016.
Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit be382a6285)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
The current `ApproximatePercentile` supports `TimestampType`, but not supports timestamp without time zone yet.
This PR will add the function.
### Why are the changes needed?
`ApproximatePercentile` need supports `TimestampNTZType`.
### Does this PR introduce _any_ user-facing change?
'Yes'. `ApproximatePercentile` accepts `TimestampNTZType`.
### How was this patch tested?
New tests.
Closes#33241 from beliefer/SPARK-36017.
Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit cc4463e818)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
When cast `10:10` to interval minute to second, it can be catch by hour to minute regex, here to fix this.
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added UT
Closes#33242 from AngersZhuuuu/SPARK-35735-FOLLOWUP.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 3953754f36)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
We found the `in-join.sql` does not test shuffled hash join properly in https://issues.apache.org/jira/browse/SPARK-32577, but didn't find a good way to fix it. Given we now have a test config to enforce shuffled hash join in https://github.com/apache/spark/pull/33182, we can fix the test here now as well.
### Why are the changes needed?
Fix test to have better test coverage.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Reran the test to compare the output, and verified the query plan manually to make sure shuffled hash join being used.
Closes#33236 from c21/join-test.
Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit f3c11595ce)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Make it recursive remove sort if the maximum number of rows less than or equal to 1. For example:
```sql
select a from (select a from values(0, 1) t(a, b) order by a) order by a
```
### Why are the changes needed?
Fix Once strategy's idempotence is broken for batch Eliminate Sorts.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit test.
Closes#33240 from wangyum/SPARK-35906-2.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit ddc5cb9051)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to bump up the mypy version to 0.910 which is the latest.
### Why are the changes needed?
To catch the type hint mistakes better in PySpark.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
GitHub Actions should test it out.
Closes#33223 from HyukjinKwon/SPARK-35684.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 16c195ccfb)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add a config `spark.sql.join.forceApplyShuffledHashJoin` to force applying shuffled hash join
during the join selection.
### Why are the changes needed?
In the `SQLQueryTestSuite`, we want to cover 3 kinds of join (BHJ, SHJ, SMJ) in join.sql. But even
if the `spark.sql.join.preferSortMergeJoin` is set to `false`, shuffled hash join is still not guaranteed.
Thus, we need another config to force the selection.
### Does this PR introduce _any_ user-facing change?
No, only for testing
### How was this patch tested?
newly added tests
Verified all queries in join.sql will use `ShuffledHashJoin` when the config set to `true`
Closes#33182 from linhongliu-db/SPARK-35984-hash-join-config.
Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 7566db6033)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
The `RemoveRedundantProjects` feature can conflict with the AQE broadcast threshold ([PR](https://github.com/apache/spark/pull/32391)) sometimes. After removing the project, the physical plan to logical plan link can be changed and we may have a `Project` above `LogicalQueryStage`. This breaks AQE broadcast threshold, because the stats of `Project` does not have the `isRuntime = true` flag, and thus still use the normal broadcast threshold.
This PR updates `RemoveRedundantProjects` to not remove `ProjectExec` that has a different logical plan link than its child.
### Why are the changes needed?
Make AQE broadcast threshold work in more cases.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
new tests
Closes#33222 from cloud-fan/aqe2.
Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 6b3ab8262f)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Some of the test cases in `DateExpressionsSuite` are quite slow:
- `Hour`: 24s
- `Minute`: 26s
- `Day / DayOfMonth`: 8s
- `Year`: 4s
Each test case has a large loop. We should improve them.
### Why are the changes needed?
Save test running time
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Verified the run times on local:
- `Hour`: 2s
- `Minute`: 3.2
- `Day / DayOfMonth`:0.5s
- `Year`: 2s
Total reduced time: 54.3s
Closes#33229 from gengliangwang/improveTest.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit d5d1222686)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
Refactor code about parse string to DT/YM intervals.
### Why are the changes needed?
Extracting the common code about parse string to DT/YM should improve code maintenance.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existed UT.
Closes#33217 from AngersZhuuuu/SPARK-35735-35768.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 26d1bb16bc)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
This PR fixes an issue that `from_csv/to_csv` doesn't handle day-time intervals properly.
`from_csv` throws exception if day-time interval types are given.
```
spark-sql> select from_csv("interval '1 2:3:4' day to second", "a interval day to second");
21/07/03 04:39:13 ERROR SparkSQLDriver: Failed in [select from_csv("interval '1 2:3:4' day to second", "a interval day to second")]
java.lang.Exception: Unsupported type: interval day to second
at org.apache.spark.sql.errors.QueryExecutionErrors$.unsupportedTypeError(QueryExecutionErrors.scala:775)
at org.apache.spark.sql.catalyst.csv.UnivocityParser.makeConverter(UnivocityParser.scala:224)
at org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$valueConverters$1(UnivocityParser.scala:134)
```
Also, `to_csv` doesn't handle day-time interval types properly though any exception is thrown.
The result of `to_csv` for day-time interval types is not ANSI interval compliant form.
```
spark-sql> select to_csv(named_struct("a", interval '1 2:3:4' day to second));
93784000000
```
The result above should be `INTERVAL '1 02:03:04' DAY TO SECOND`.
### Why are the changes needed?
Bug fix.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New tests.
Closes#33226 from sarutak/csv-dtinterval.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit def8bc5c96)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR fixes two issues. One is that `to_json` doesn't support `map` types where value types are `day-time` interval types like:
```
spark-sql> select to_json(map('a', interval '1 2:3:4' day to second));
21/07/06 14:53:58 ERROR SparkSQLDriver: Failed in [select to_json(map('a', interval '1 2:3:4' day to second))]
java.lang.RuntimeException: Failed to convert value 93784000000 (class of class java.lang.Long) with the type of DayTimeIntervalType(0,3) to JSON.
```
The other issue is that even if the issue of `to_json` is resolved, `from_json` doesn't support to convert `day-time` interval string to JSON. So the result of following query will be `null`.
```
spark-sql> select from_json(to_json(map('a', interval '1 2:3:4' day to second)), 'a interval day to second');
{"a":null}
```
### Why are the changes needed?
There should be no reason why day-time intervals cannot used as map value types.
`CalendarIntervalTypes` can do it.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New tests.
Closes#33225 from sarutak/json-dtinterval.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit c8ff613c3c)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Ideally, in SQL query, nested columns should result to GetStructField with non-None name. But there are places that can create GetStructField with None name, such as UnresolvedStar.expand, Dataset encoder stuff, etc.
the current `nestedFieldToAlias` cannot catch it up and will cause job failed.
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added UT,
Closes#33183 from AngersZhuuuu/SPARK-35972.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 87282f04bf)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
PR #15764 disabled creating permanent views based on temporary views or UDFs. But AlterViewCommand didn't block temporary objects.
### Why are the changes needed?
More robust view canonicalization.
### Does this PR introduce _any_ user-facing change?
Yes, now if you alter a permanent view based on temporary views or UDFs, the operation will fail.
### How was this patch tested?
Add new unit tests.
Closes#33204 from jerqi/alter_view.
Authored-by: RoryQi <1242949407@qq.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit e0c6b2e965)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Support new keyword `TIMESTAMP_LTZ`, which can be used for:
- timestamp with local time zone data type in DDL
- timestamp with local time zone data type in Cast clause.
- timestamp with local time zone data type literal
### Why are the changes needed?
Users can use `TIMESTAMP_LTZ` in DDL/Cast/Literals for the timestamp with local time zone type directly. The new keyword is independent of the SQL configuration `spark.sql.timestampType`.
### Does this PR introduce _any_ user-facing change?
No, the new timestamp type is not released yet.
### How was this patch tested?
Unit test
Closes#33224 from gengliangwang/TIMESTAMP_LTZ.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit b0b9643cd7)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
Add more metrics for the RocksDB instance. We transform the native states from RocksDB.
### Why are the changes needed?
Improve the usability with more metrics for RocksDB instance.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#32934 from xuanyuanking/SPARK-35788.
Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(cherry picked from commit 9544277b0a)
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/33113, to do some code cleanup:
1. `UnresolvedFieldPosition` doesn't need to include the field name. We can get it through "context" (`AlterTableAlterColumn.column.name`).
2. Run `ResolveAlterTableCommands` in the main resolution batch, so that the column/field resolution is also unified between v1 and v2 commands (same error message).
3. Fail immediately in `ResolveAlterTableCommands` if we can't resolve the field, instead of waiting until `CheckAnalysis`. We don't expect other rules to resolve fields in ALTER TABLE commands, so failing immediately is simpler and we can remove duplicated code in `CheckAnalysis`.
### Why are the changes needed?
code simplification.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
existing tests
Closes#33213 from cloud-fan/follow.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 8b46e26fc6)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Support new keyword TIMESTAMP_NTZ, which can be used for:
- timestamp without time zone data type in DDL
- timestamp without time zone data type in Cast clause.
- timestamp without time zone data type literal
### Why are the changes needed?
Users can use `TIMESTAMP_NTZ` in DDL/Cast/Literals for the timestamp without time zone type directly.
### Does this PR introduce _any_ user-facing change?
No, the new timestamp type is not released yet.
### How was this patch tested?
Unit test
Closes#33221 from gengliangwang/timstamp_ntz.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 5f44acff3d)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
For the timestamp literal, it should have the following behavior.
1. When `spark.sql.timestampType` is TIMESTAMP_NTZ: if there is no time zone part, return timestamp without time zone literal; otherwise, return timestamp with local time zone literal
2. When `spark.sql.timestampType` is TIMESTAMP_LTZ: return timestamp with local time zone literal
### Why are the changes needed?
When the default timestamp type is TIMESTAMP_NTZ, the result of type literal should return TIMESTAMP_NTZ when there is no time zone part in the string.
From setion 5.3 "literal" of ANSI SQL standard 2011:
```
27) The declared type of a <timestamp literal> that does not specify <time zone interval> is TIMESTAMP(P) WITHOUT TIME ZONE, where P is the number of digits in <seconds fraction>, if specified, and 0 (zero) otherwise. The declared type of a <timestamp literal> that specifies <time zone interval> is TIMESTAMP(P) WITH TIME ZONE, where P is the number of digits in <seconds fraction>, if specified, and 0 (zero) otherwise.
```
Since we don't have "timestamp with time zone", we use timestamp with local time zone instead.
### Does this PR introduce _any_ user-facing change?
No, the new timestmap type and the default timestamp configuration is not released yet.
### How was this patch tested?
Unit test
Closes#33215 from gengliangwang/tsLiteral.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 2fffec7de8)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
This PR fix the incorrect comment for `TimestampNTZType`.
### Why are the changes needed?
Fix the incorrect comment
### Does this PR introduce _any_ user-facing change?
'No'.
### How was this patch tested?
No need.
Closes#33218 from beliefer/SPARK-35664-followup.
Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit c605ba2d46)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
This PR fixes an issue that `from_csv/to_csv` doesn't handle year-month intervals properly.
`from_csv` throws exception if year-month interval types are given.
```
spark-sql> select from_csv("interval '1-2' year to month", "a interval year to month");
21/07/03 04:32:24 ERROR SparkSQLDriver: Failed in [select from_csv("interval '1-2' year to month", "a interval year to month")]
java.lang.Exception: Unsupported type: interval year to month
at org.apache.spark.sql.errors.QueryExecutionErrors$.unsupportedTypeError(QueryExecutionErrors.scala:775)
at org.apache.spark.sql.catalyst.csv.UnivocityParser.makeConverter(UnivocityParser.scala:224)
at org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$valueConverters$1(UnivocityParser.scala:134)
```
Also, `to_csv` doesn't handle year-month interval types properly though any exception is thrown.
The result of `to_csv` for year-month interval types is not ANSI interval compliant form.
```
spark-sql> select to_csv(named_struct("a", interval '1-2' year to month));
14
```
The result above should be `INTERVAL '1-2' YEAR TO MONTH`.
### Why are the changes needed?
Bug fix.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New tests.
Closes#33210 from sarutak/csv-yminterval.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit f4237aff7e)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
This PR is a followup of https://github.com/apache/spark/pull/31019 that forgot to update SBT's to match.
### Why are the changes needed?
To use the same version in both Maven and SBT.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI should test them.
Closes#33207 from HyukjinKwon/SPARK-33996.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 2fe6c94544)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Skip remove shuffle if it's shuffle origin is not `REPARTITION_BY_COL` in AQE.
### Why are the changes needed?
`REPARTITION_BY_COL` doesn't guarantee the output partitioning number so we can remove it safely in AQE.
For `REPARTITION_BY_NUM`, we should retain the shuffle which partition number is specified by user.
For `REBALANCE_PARTITIONS_BY_COL`, it is a special shuffle used to rebalance partitions so we should not remove it.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
add test
Closes#33188 from ulysses-you/SPARK-35989.
Lead-authored-by: ulysses-you <ulyssesyou18@gmail.com>
Co-authored-by: ulysses <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 7fe4c4a9ad)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Current AQE has cost evaluator to decide whether to use new plan after replanning. The current used evaluator is `SimpleCostEvaluator` to make decision based on number of shuffle in the query plan. This is not perfect cost evaluator, and different production environments might want to use different custom evaluators. E.g., sometimes we might want to still do skew join even though it might introduce extra shuffle (trade off resource for better latency), sometimes we might want to take sort into consideration for cost as well. Take our own setting as an example, we are using a custom remote shuffle service (Cosco), and the cost model is more complicated. So We want to make the cost evaluator to be pluggable, and developers can implement their own `CostEvaluator` subclass and plug in dynamically based on configuration.
The approach is to introduce a new config to allow define sub-class name of `CostEvaluator` - `spark.sql.adaptive.customCostEvaluatorClass`. And add `CostEvaluator.instantiate` to instantiate the cost evaluator class in `AdaptiveSparkPlanExec.costEvaluator`.
### Why are the changes needed?
Make AQE cost evaluation more flexible.
### Does this PR introduce _any_ user-facing change?
No but an internal config is introduced - `spark.sql.adaptive.customCostEvaluatorClass` to allow custom implementation of `CostEvaluator`.
### How was this patch tested?
Added unit test in `AdaptiveQueryExecSuite.scala`.
Closes#32944 from c21/aqe-cost.
Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 044dddf288)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR fixes two issues. One is that `to_json` doesn't support `map` types where value types are `year-month` interval types like:
```
spark-sql> select to_json(map('a', interval '1-2' year to month));
21/07/02 11:38:15 ERROR SparkSQLDriver: Failed in [select to_json(map('a', interval '1-2' year to month))]
java.lang.RuntimeException: Failed to convert value 14 (class of class java.lang.Integer) with the type of YearMonthIntervalType(0,1) to JSON.
```
The other issue is that even if the issue of `to_json` is resolved, `from_json` doesn't support to convert `year-month` interval string to JSON. So the result of following query will be `null`.
```
spark-sql> select from_json(to_json(map('a', interval '1-2' year to month)), 'a interval year to month');
{"a":null}
```
### Why are the changes needed?
There should be no reason why year-month intervals cannot used as map value types.
`CalendarIntervalTypes` can do it.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New tests.
Closes#33181 from sarutak/map-json-yminterval.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 6474226852)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Make the ANSI flag part of expressions `Sum` and `Average`'s parameter list, instead of fetching it from the sessional SQLConf.
### Why are the changes needed?
For Views, it is important to show consistent results even the ANSI configuration is different in the running session. This is why many expressions like 'Add'/'Divide' making the ANSI flag part of its case class parameter list.
We should make it consistent for the expressions `Sum` and `Average`
### Does this PR introduce _any_ user-facing change?
Yes, the `Sum` and `Average` inside a View always behaves the same, independent of the ANSI model SQL configuration in the current session.
### How was this patch tested?
Existing UT
Closes#33186 from gengliangwang/sumAndAvg.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 51103cdcdd)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
When I'm running the benchmark in GA, I met the below error.
https://github.com/pingsutw/spark/runs/2867617238?check_suite_focus=true
```
java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.j
ava:1692)java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
21/06/20 07:40:02 ERROR SparkContext: Error initializing SparkContext.java.lang.AssertionError: assertion failed:
spark.test.home is not set! at scala.Predef$.assert(Predef.scala:223) at org.apache.spark.deploy.worker.Worker.<init>
(Worker.scala:148) at org.apache.spark.deploy.worker.Worker$.startRpcEnvAndEndpoint(Worker.scala:954) at
org.apache.spark.deploy.LocalSparkCluster.$anonfun$start$2(LocalSparkCluster.scala:68) at
org.apache.spark.deploy.LocalSparkCluster.$anonfun$start$2$adapted(LocalSparkCluster.scala:65) at
scala.collection.immutable.Range.foreach(Range.scala:158) at
org.apache.spark.deploy.LocalSparkCluster.start(LocalSparkCluster.scala:65) at
org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2954) at
org.apache.spark.SparkContext.<init>(SparkContext.scala:559) at org.apache.spark.SparkContext.<init>
(SparkContext.scala:137) at
org.apache.spark.serializer.KryoSerializerBenchmark$.createSparkContext(KryoSerializerBenchmark.scala:86) at
org.apache.spark.serializer.KryoSerializerBenchmark$.sc$lzycompute$1(KryoSerializerBenchmark.scala:58) at
org.apache.spark.serializer.KryoSerializerBenchmark$.sc$1(KryoSerializerBenchmark.scala:58) at
org.apache.spark.serializer.KryoSerializerBenchmark$.$anonfun$run$3(KryoSerializerBenchmark.scala:63)
```
Set `spark.test.home` in the benchmark workflow.
No
Rerun the benchmark in my fork.
https://github.com/pingsutw/spark/actions/runs/996067851Closes#33203 from pingsutw/SPARK-36007.
Lead-authored-by: Kevin Su <pingsutw@apache.org>
Co-authored-by: Kevin Su <pingsutw@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 11fcbc73cb)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Fix the type hint for `pyspark.rdd .RDD.histogram`'s `buckets` argument
The current type hint is incomplete.
![image](https://user-images.githubusercontent.com/17701527/124248180-df7fd580-db22-11eb-8391-ba0bb51d689b.png)
From `pyspark.rdd .RDD.histogram`'s source:
```python
if isinstance(buckets, int):
...
elif isinstance(buckets, (list, tuple)):
...
else:
raise TypeError("buckets should be a list or tuple or number(int or long)")
```
Fixed the warning displayed above.
Fixed warning above with this change.
Closes#33185 from tpvasconcelos/master.
Authored-by: Tomas Pereira de Vasconcelos <tomasvasconcelos1@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 495d234c6e)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR uses 2 ideas to make `EquivalentExpressions` more efficient:
1. do not keep all the equivalent expressions, we only need a count
2. track the "height" of common subexpressions, to quickly do child-parent sort, and filter out non-child expressions in `addCommonExprs`
This PR also fixes several small bugs (exposed by the refactoring), please see PR comments.
### Why are the changes needed?
code cleanup and small perf improvement
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
existing tests
Closes#33142 from cloud-fan/codegen.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit e6ce220690)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>