Commit graph

31187 commits

Author SHA1 Message Date
Xinrong Meng 0b6af464dc [SPARK-36470][PYTHON] Implement CategoricalIndex.map and DatetimeIndex.map
### What changes were proposed in this pull request?
Implement `CategoricalIndex.map` and `DatetimeIndex.map`

`MultiIndex.map` cannot be implemented in the same way as the `map` of other indexes. It should be taken care of separately if necessary.

### Why are the changes needed?
Mapping values using input correspondence is a common operation that is supported in pandas. We shall support that as well.

### Does this PR introduce _any_ user-facing change?
Yes. `CategoricalIndex.map` and `DatetimeIndex.map` can be used now.

- CategoricalIndex.map

```py
>>> idx = ps.CategoricalIndex(['a', 'b', 'c'])
>>> idx
CategoricalIndex(['a', 'b', 'c'], categories=['a', 'b', 'c'], ordered=False, dtype='category')

>>> idx.map(lambda x: x.upper())
CategoricalIndex(['A', 'B', 'C'],  categories=['A', 'B', 'C'], ordered=False, dtype='category')

>>> pser = pd.Series([1, 2, 3], index=pd.CategoricalIndex(['a', 'b', 'c'], ordered=True))
>>> idx.map(pser)
CategoricalIndex([1, 2, 3], categories=[1, 2, 3], ordered=True, dtype='category')

>>> idx.map({'a': 'first', 'b': 'second', 'c': 'third'})
CategoricalIndex(['first', 'second', 'third'], categories=['first', 'second', 'third'], ordered=False, dtype='category')
```

- DatetimeIndex.map

```py
>>> pidx = pd.date_range(start="2020-08-08", end="2020-08-10")
>>> psidx = ps.from_pandas(pidx)

>>> mapper_dict = {
...   datetime.datetime(2020, 8, 8): datetime.datetime(2021, 8, 8),
...   datetime.datetime(2020, 8, 9): datetime.datetime(2021, 8, 9),
... }
>>> psidx.map(mapper_dict)
DatetimeIndex(['2021-08-08', '2021-08-09', 'NaT'], dtype='datetime64[ns]', freq=None)

>>> mapper_pser = pd.Series([1, 2, 3], index=pidx)
>>> psidx.map(mapper_pser)
Int64Index([1, 2, 3], dtype='int64')
>>> psidx
DatetimeIndex(['2020-08-08', '2020-08-09', '2020-08-10'], dtype='datetime64[ns]', freq=None)

>>> psidx.map(lambda x: x.strftime("%B %d, %Y, %r"))
Index(['August 08, 2020, 12:00:00 AM', 'August 09, 2020, 12:00:00 AM',
       'August 10, 2020, 12:00:00 AM'],
      dtype='object')
```

### How was this patch tested?
Unit tests.

Closes #33756 from xinrong-databricks/other_indexes_map.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-23 10:08:40 +09:00
Gengliang Wang 3da0e9500f [SPARK-36557][DOCS] Update the MAVEN_OPTS in Spark build docs
### What changes were proposed in this pull request?

As Jacek Laskowski pointed out in the dev list, there is StackOverflowError if compiling Spark with the current MAVEN_OPTS in Spark documentation.
We should update it with `-Xss64m` to avoid it.

### Why are the changes needed?

Correct the documentation

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manual test. The MAVEN_OPTS is consistent with our github action build.

Closes #33804 from gengliangwang/updateBuildDoc.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-23 09:46:00 +09:00
Venkata krishnan Sowrirajan 7b2842e986 [SPARK-36374][FOLLOW-UP] Change config key spark.shuffle.server.mergedShuffleFileManagerImpl to spark.shuffle.push.server.mergedShuffleFileManagerImpl
### What changes were proposed in this pull request?

Minor changes to change the config key name from `spark.shuffle.server.mergedShuffleFileManagerImpl` to `spark.shuffle.push.server.mergedShuffleFileManagerImpl`. This is missed out in https://github.com/apache/spark/pull/33615.

### Why are the changes needed?

To keep the config names consistent

### Does this PR introduce _any_ user-facing change?

Yes, this is a change in the config key name. But the new config name changes are yet to be released. Technically there is no user facing change because of this change.

### How was this patch tested?

Existing tests.

Closes #33799 from venkata91/SPARK-36374-follow-up.

Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
2021-08-22 01:28:31 -05:00
Liang-Chi Hsieh 5876e04de2 [MINOR][SS][DOCS] Update doc for streaming deduplication
### What changes were proposed in this pull request?

This patch fixes an error about streaming dedupliaction is Structured Streaming, and also updates an item about unsupported operation.

### Why are the changes needed?

Update the user document.

### Does this PR introduce _any_ user-facing change?

No. It's a doc only change.

### How was this patch tested?

Doc only change.

Closes #33801 from viirya/minor-ss-deduplication.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-08-21 18:20:17 -07:00
Angerszhuuuu 5740d5641d [SPARK-36549][SQL] Add taskStatus supports multiple value to monitoring doc
### What changes were proposed in this pull request?
In Stage related restful API, we support `taskStatus` parameter as a list
```
 QueryParam("taskStatus") taskStatus: JList[TaskStatus]
```
In restful we should write like
```
taskStatus=SUCCESS&taskStatus=FAILED
```

It's usefule but not show in the doc, and many user don't know how to write the list parameters.
So add this feature to monitoring doc too.

### Why are the changes needed?
Make doc clear

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
With restful request
```
http://localhost:4040/api/v1/applications/local-1629432414554/stages/0?details=true&taskStatus=FAILED
```
Resultful request result tasks
```
tasks" : {
    "0" : {
      "taskId" : 0,
      "index" : 0,
      "attempt" : 0,
      "launchTime" : "2021-08-20T04:06:55.515GMT",
      "duration" : 273,
      "executorId" : "driver",
      "host" : "host",
      "status" : "FAILED",
      "taskLocality" : "PROCESS_LOCAL",
      "speculative" : false,
      "accumulatorUpdates" : [ ],
      "errorMessage" : "java.lang.RuntimeException\n\tat org.apache.spark.ui.UISuite.$anonfun$new$8(UISuite.scala:95)\n\tat scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)\n\tat scala.collection.Iterator.foreach(Iterator.scala:943)\n\tat scala.collection.Iterator.foreach$(Iterator.scala:943)\n\tat org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)\n\tat org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1003)\n\tat org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1003)\n\tat org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254)\n\tat org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)\n\tat org.apache.spark.scheduler.Task.run(Task.scala:136)\n\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507)\n\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468)\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n",
      "taskMetrics" : {
        "executorDeserializeTime" : 0,
        "executorDeserializeCpuTime" : 0,
        "executorRunTime" : 206,
        "executorCpuTime" : 0,
        "resultSize" : 0,
        "jvmGcTime" : 0,
        "resultSerializationTime" : 0,
        "memoryBytesSpilled" : 0,
        "diskBytesSpilled" : 0,
        "peakExecutionMemory" : 0,
        "inputMetrics" : {
          "bytesRead" : 0,
          "recordsRead" : 0
        },
        "outputMetrics" : {
          "bytesWritten" : 0,
          "recordsWritten" : 0
        },
        "shuffleReadMetrics" : {
          "remoteBlocksFetched" : 0,
          "localBlocksFetched" : 0,
          "fetchWaitTime" : 0,
          "remoteBytesRead" : 0,
          "remoteBytesReadToDisk" : 0,
          "localBytesRead" : 0,
          "recordsRead" : 0
        },
        "shuffleWriteMetrics" : {
          "bytesWritten" : 0,
          "writeTime" : 0,
          "recordsWritten" : 0
        }
      },
      "executorLogs" : { },
      "schedulerDelay" : 67,
      "gettingResultTime" : 0
    }
  },
```

With restful request
```
http://localhost:4040/api/v1/applications/local-1629432414554/stages/0?details=true&taskStatus=FAILED&taskStatus=SUCCESS
```
Restful result tasks
```
"tasks" : {
    "1" : {
      "taskId" : 1,
      "index" : 1,
      "attempt" : 0,
      "launchTime" : "2021-08-20T04:06:55.786GMT",
      "duration" : 16,
      "executorId" : "driver",
      "host" : "host",
      "status" : "SUCCESS",
      "taskLocality" : "PROCESS_LOCAL",
      "speculative" : false,
      "accumulatorUpdates" : [ ],
      "taskMetrics" : {
        "executorDeserializeTime" : 2,
        "executorDeserializeCpuTime" : 2638000,
        "executorRunTime" : 2,
        "executorCpuTime" : 1993000,
        "resultSize" : 837,
        "jvmGcTime" : 0,
        "resultSerializationTime" : 0,
        "memoryBytesSpilled" : 0,
        "diskBytesSpilled" : 0,
        "peakExecutionMemory" : 0,
        "inputMetrics" : {
          "bytesRead" : 0,
          "recordsRead" : 0
        },
        "outputMetrics" : {
          "bytesWritten" : 0,
          "recordsWritten" : 0
        },
        "shuffleReadMetrics" : {
          "remoteBlocksFetched" : 0,
          "localBlocksFetched" : 0,
          "fetchWaitTime" : 0,
          "remoteBytesRead" : 0,
          "remoteBytesReadToDisk" : 0,
          "localBytesRead" : 0,
          "recordsRead" : 0
        },
        "shuffleWriteMetrics" : {
          "bytesWritten" : 0,
          "writeTime" : 0,
          "recordsWritten" : 0
        }
      },
      "executorLogs" : { },
      "schedulerDelay" : 12,
      "gettingResultTime" : 0
    },
    "0" : {
      "taskId" : 0,
      "index" : 0,
      "attempt" : 0,
      "launchTime" : "2021-08-20T04:06:55.515GMT",
      "duration" : 273,
      "executorId" : "driver",
      "host" : "host",
      "status" : "FAILED",
      "taskLocality" : "PROCESS_LOCAL",
      "speculative" : false,
      "accumulatorUpdates" : [ ],
      "errorMessage" : "java.lang.RuntimeException\n\tat org.apache.spark.ui.UISuite.$anonfun$new$8(UISuite.scala:95)\n\tat scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)\n\tat scala.collection.Iterator.foreach(Iterator.scala:943)\n\tat scala.collection.Iterator.foreach$(Iterator.scala:943)\n\tat org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)\n\tat org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1003)\n\tat org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1003)\n\tat org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254)\n\tat org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)\n\tat org.apache.spark.scheduler.Task.run(Task.scala:136)\n\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507)\n\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468)\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n",
      "taskMetrics" : {
        "executorDeserializeTime" : 0,
        "executorDeserializeCpuTime" : 0,
        "executorRunTime" : 206,
        "executorCpuTime" : 0,
        "resultSize" : 0,
        "jvmGcTime" : 0,
        "resultSerializationTime" : 0,
        "memoryBytesSpilled" : 0,
        "diskBytesSpilled" : 0,
        "peakExecutionMemory" : 0,
        "inputMetrics" : {
          "bytesRead" : 0,
          "recordsRead" : 0
        },
        "outputMetrics" : {
          "bytesWritten" : 0,
          "recordsWritten" : 0
        },
        "shuffleReadMetrics" : {
          "remoteBlocksFetched" : 0,
          "localBlocksFetched" : 0,
          "fetchWaitTime" : 0,
          "remoteBytesRead" : 0,
          "remoteBytesReadToDisk" : 0,
          "localBytesRead" : 0,
          "recordsRead" : 0
        },
        "shuffleWriteMetrics" : {
          "bytesWritten" : 0,
          "writeTime" : 0,
          "recordsWritten" : 0
        }
      },
      "executorLogs" : { },
      "schedulerDelay" : 67,
      "gettingResultTime" : 0
    }
  },
```

Closes #33793 from AngersZhuuuu/SPARK-36549.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-22 09:45:21 +09:00
Kent Yao f918c123a0 [SPARK-36552][SQL] Fix different behavior for writing char/varchar to hive and datasource table
### What changes were proposed in this pull request?

For the hive table, the actual write path and the schema handling are inconsistent when `spark.sql.legacy.charVarcharAsString` is true.

This causes problems like SPARK-36552 described.

In this PR we respect `spark.sql.legacy.charVarcharAsString` when generates hive table schema from spark data types.

### Why are the changes needed?

bugfix

### Does this PR introduce _any_ user-facing change?

yes, when `spark.sql.legacy.charVarcharAsString` is true, hive table with char/varchar will respect string behavior.

### How was this patch tested?

newly added test

Closes #33798 from yaooqinn/SPARK-36552.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-22 09:38:39 +09:00
yangjie01 1ccb06ca8c Revert "[SPARK-34309][BUILD][CORE][SQL][K8S] Use Caffeine instead of Guava Cache"
### What changes were proposed in this pull request?
This pr revert the change of SPARK-34309, includes:

- https://github.com/apache/spark/pull/31517
- https://github.com/apache/spark/pull/33772

### Why are the changes needed?

1. No really performance improvement in Spark
2. Added an additional dependency

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #33784 from LuciferYang/revert-caffeine.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-22 09:36:15 +09:00
ulysses-you 90cbf9ca3e [SPARK-35083][CORE][FOLLLOWUP] Improve docs and migration guide
### What changes were proposed in this pull request?

* improve docs in `docs/job-scheduling.md`
* add migration guide docs in `docs/core-migration-guide.md`

### Why are the changes needed?

Help user to migrate.

### Does this PR introduce _any_ user-facing change?

yes

### How was this patch tested?

Pass CI

Closes #33794 from ulysses-you/SPARK-35083-f.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Kent Yao <yao@apache.org>
2021-08-20 21:32:48 +08:00
sweisdb c441c7e365 Updates AuthEngine to pass the correct SecretKeySpec format
AuthEngineSuite was passing on some platforms (MacOS), but failing on others (Linux) with an InvalidKeyException stemming from this line. We should explicitly pass AES as the key format.

### What changes were proposed in this pull request?

Changes the AuthEngine SecretKeySpec from "RAW" to "AES".

### Why are the changes needed?

Unit tests were failing on some platforms with InvalidKeyExceptions when this key was used to instantiate a Cipher.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests on a MacOS and Linux platform.

Closes #33790 from sweisdb/patch-1.

Authored-by: sweisdb <60895808+sweisdb@users.noreply.github.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-08-20 08:31:39 -05:00
Yesheng Ma 5c0762b5d2 [SPARK-36448][SQL] Exceptions in NoSuchItemException.scala have to be case classes
### What changes were proposed in this pull request?
Change all exceptions in NoSuchItemException.scala to case classes.

### Why are the changes needed?
Exceptions in NoSuchItemException.scala are not case classes. This is causing issues because in Analyzer's executeAndCheck method always calls the `copy` method on the exception. However, since these exceptions are not case classes, the `copy` method was always delegated to `AnalysisException::copy`, which is not the specialized version.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing UTs.

Closes #33673 from yeshengm/SPARK-36448.

Authored-by: Yesheng Ma <kimi.ysma@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-20 20:16:30 +08:00
Gengliang Wang 42eebb84f5 [SPARK-36551][BUILD] Add sphinx-plotly-directive in Spark release Dockerfile
### What changes were proposed in this pull request?

After https://github.com/apache/spark/pull/32726, Python doc build requires `sphinx-plotly-directive`.
This PR is to install it from `spark-rm/Dockerfile` to make sure `do-release-docker.sh` can run successfully.
Also, this PR mentions it in the README of docs.

### Why are the changes needed?

Fix release script and update README of docs

### Does this PR introduce _any_ user-facing change?

No
### How was this patch tested?

Manual test locally.

Closes #33797 from gengliangwang/fixReleaseDocker.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-20 20:02:24 +08:00
Gengliang Wang f0775d215e [SPARK-36547][BUILD] Downgrade scala-maven-plugin to 4.3.0
### What changes were proposed in this pull request?

When preparing Spark 3.2.0 RC1, I hit the same issue of https://github.com/apache/spark/pull/31031.
```
[INFO] Compiling 21 Scala sources and 3 Java sources to /opt/spark-rm/output/spark-3.1.0-bin-hadoop2.7/resource-managers/yarn/target/scala-2.12/test-classes ...
[ERROR] ## Exception when compiling 24 sources to /opt/spark-rm/output/spark-3.1.0-bin-hadoop2.7/resource-managers/yarn/target/scala-2.12/test-classes
java.lang.SecurityException: class "javax.servlet.SessionCookieConfig"'s signer information does not match signer information of other classes in the same package
java.lang.ClassLoader.checkCerts(ClassLoader.java:891)
java.lang.ClassLoader.preDefineClass(ClassLoader.java:661)
```
This PR is to apply the same fix again by downgrading scala-maven-plugin to 4.3.0

### Why are the changes needed?

To unblock the release process.

### Does this PR introduce _any_ user-facing change?

No
### How was this patch tested?

Build test

Closes #33791 from gengliangwang/downgrade.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-20 10:45:16 +08:00
Yuanjian Li a0b24019ed [SPARK-35312][SS][FOLLOW-UP] More documents and checking logic for the new options
### What changes were proposed in this pull request?
Add more documents and checking logic for the new options `minOffsetPerTrigger` and `maxTriggerDelay`.

### Why are the changes needed?
Have a clear description of the behavior introduced in SPARK-35312

### Does this PR introduce _any_ user-facing change?
Yes.
If the user set minOffsetsPerTrigger > maxOffsetsPerTrigger, the new code will throw an AnalysisException. The original behavior is to ignore the maxOffsetsPerTrigger silenctly.

### How was this patch tested?
Existing tests.

Closes #33792 from xuanyuanking/SPARK-35312-follow.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-08-20 10:41:42 +09:00
gengjiaan 462aa7cd3c [SPARK-36428][TESTS][FOLLOWUP] Revert mistake change to DateExpressionsSuite
### What changes were proposed in this pull request?
https://github.com/apache/spark/pull/33775 commits the debug code mistakely.
This PR revert the test path.

### Why are the changes needed?
Revoke debug code.

### Does this PR introduce _any_ user-facing change?
 'No'.
Just adjust test.

### How was this patch tested?
Revert non-ansi test path.

Closes #33787 from beliefer/SPARK-36428-followup2.

Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-19 21:33:21 +08:00
Gengliang Wang b36b1c7e8a Revert "[SPARK-35083][FOLLOW-UP][CORE] Add migration guide for the re…
…mote scheduler pool files support"

This reverts commit e3902d1975. The feature is improvement instead of behavior change.

Closes #33789 from gengliangwang/revertDoc.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-19 21:30:00 +08:00
Yuming Wang 2310b99e14 [SPARK-36444][SQL] Remove OptimizeSubqueries from batch of PartitionPruning
### What changes were proposed in this pull request?

Remove `OptimizeSubqueries` from batch of `PartitionPruning` to make DPP support more cases. For example:
```sql
SELECT date_id, product_id FROM fact_sk f
JOIN (select store_id + 3 as new_store_id from dim_store where country = 'US') s
ON f.store_id = s.new_store_id
```

Before this PR:
```
== Physical Plan ==
*(2) Project [date_id#3998, product_id#3999]
+- *(2) BroadcastHashJoin [store_id#4001], [new_store_id#3997], Inner, BuildRight, false
   :- *(2) ColumnarToRow
   :  +- FileScan parquet default.fact_sk[date_id#3998,product_id#3999,store_id#4001] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [isnotnull(store_id#4001), dynamicpruningexpression(true)], PushedFilters: [], ReadSchema: struct<date_id:int,product_id:int>
   +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#274]
      +- *(1) Project [(store_id#4002 + 3) AS new_store_id#3997]
         +- *(1) Filter ((isnotnull(country#4004) AND (country#4004 = US)) AND isnotnull((store_id#4002 + 3)))
            +- *(1) ColumnarToRow
               +- FileScan parquet default.dim_store[store_id#4002,country#4004] Batched: true, DataFilters: [isnotnull(country#4004), (country#4004 = US), isnotnull((store_id#4002 + 3))], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(country), EqualTo(country,US)], ReadSchema: struct<store_id:int,country:string>
```

After this PR:
```
== Physical Plan ==
*(2) Project [date_id#3998, product_id#3999]
+- *(2) BroadcastHashJoin [store_id#4001], [new_store_id#3997], Inner, BuildRight, false
   :- *(2) ColumnarToRow
   :  +- FileScan parquet default.fact_sk[date_id#3998,product_id#3999,store_id#4001] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [isnotnull(store_id#4001), dynamicpruningexpression(store_id#4001 IN dynamicpruning#4007)], PushedFilters: [], ReadSchema: struct<date_id:int,product_id:int>
   :        +- SubqueryBroadcast dynamicpruning#4007, 0, [new_store_id#3997], [id=#263]
   :           +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#262]
   :              +- *(1) Project [(store_id#4002 + 3) AS new_store_id#3997]
   :                 +- *(1) Filter ((isnotnull(country#4004) AND (country#4004 = US)) AND isnotnull((store_id#4002 + 3)))
   :                    +- *(1) ColumnarToRow
   :                       +- FileScan parquet default.dim_store[store_id#4002,country#4004] Batched: true, DataFilters: [isnotnull(country#4004), (country#4004 = US), isnotnull((store_id#4002 + 3))], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(country), EqualTo(country,US)], ReadSchema: struct<store_id:int,country:string>
   +- ReusedExchange [new_store_id#3997], BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#262]
```
This is because `OptimizeSubqueries` will infer more filters, so we cannot reuse broadcasts. The following is the plan if disable `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly`:
```
== Physical Plan ==
*(2) Project [date_id#3998, product_id#3999]
+- *(2) BroadcastHashJoin [store_id#4001], [new_store_id#3997], Inner, BuildRight, false
   :- *(2) ColumnarToRow
   :  +- FileScan parquet default.fact_sk[date_id#3998,product_id#3999,store_id#4001] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [isnotnull(store_id#4001), dynamicpruningexpression(store_id#4001 IN subquery#4009)], PushedFilters: [], ReadSchema: struct<date_id:int,product_id:int>
   :        +- Subquery subquery#4009, [id=#284]
   :           +- *(2) HashAggregate(keys=[new_store_id#3997#4008], functions=[])
   :              +- Exchange hashpartitioning(new_store_id#3997#4008, 5), ENSURE_REQUIREMENTS, [id=#280]
   :                 +- *(1) HashAggregate(keys=[new_store_id#3997 AS new_store_id#3997#4008], functions=[])
   :                    +- *(1) Project [(store_id#4002 + 3) AS new_store_id#3997]
   :                       +- *(1) Filter (((isnotnull(store_id#4002) AND isnotnull(country#4004)) AND (country#4004 = US)) AND isnotnull((store_id#4002 + 3)))
   :                          +- *(1) ColumnarToRow
   :                             +- FileScan parquet default.dim_store[store_id#4002,country#4004] Batched: true, DataFilters: [isnotnull(store_id#4002), isnotnull(country#4004), (country#4004 = US), isnotnull((store_id#4002..., Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(store_id), IsNotNull(country), EqualTo(country,US)], ReadSchema: struct<store_id:int,country:string>
   +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#305]
      +- *(1) Project [(store_id#4002 + 3) AS new_store_id#3997]
         +- *(1) Filter ((isnotnull(country#4004) AND (country#4004 = US)) AND isnotnull((store_id#4002 + 3)))
            +- *(1) ColumnarToRow
               +- FileScan parquet default.dim_store[store_id#4002,country#4004] Batched: true, DataFilters: [isnotnull(country#4004), (country#4004 = US), isnotnull((store_id#4002 + 3))], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(country), EqualTo(country,US)], ReadSchema: struct<store_id:int,country:string>
```

### Why are the changes needed?

Improve DPP to support more cases.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test and benchmark test:
SQL | Before this PR(Seconds) | After this PR(Seconds)
-- | -- | --
TPC-DS q58 | 40 | 20
TPC-DS q83 | 18 | 14

Closes #33664 from wangyum/SPARK-36444.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
2021-08-19 16:44:50 +08:00
yi.wu e3902d1975 [SPARK-35083][FOLLOW-UP][CORE] Add migration guide for the remote scheduler pool files support
### What changes were proposed in this pull request?

Add remote scheduler pool files support to the migration guide.

### Why are the changes needed?

To highlight this useful support.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass exiting tests.

Closes #33785 from Ngone51/SPARK-35083-follow-up.

Lead-authored-by: yi.wu <yi.wu@databricks.com>
Co-authored-by: wuyi <yi.wu@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-19 16:28:59 +08:00
Shixiong Zhu ea4919801a [SPARK-36519][SS] Store RocksDB format version in the checkpoint for streaming queries
### What changes were proposed in this pull request?

RocksDB provides backward compatibility but it doesn't always provide forward compatibility. It's better to store the RocksDB format version in the checkpoint so that it would give us more information to provide the rollback guarantee when we upgrade the RocksDB version that may introduce incompatible change in a new Spark version.

A typical case is when a user upgrades their query to a new Spark version, and this new Spark version has a new RocksDB version which may use a new format. But the user hits some bug and decide to rollback. But in the old Spark version, the old RocksDB version cannot read the new format.

In order to handle this case, we will write the RocksDB format version to the checkpoint. When restarting from a checkpoint, we will force RocksDB to use the format version stored in the checkpoint. This will ensure the user can rollback their Spark version if needed.

We also provide a config `spark.sql.streaming.stateStore.rocksdb.formatVersion` for users who don't need to rollback their Spark versions to overwrite the format version specified in the checkpoint.

### Why are the changes needed?

Provide the Spark version rollback guarantee for streaming queries when a new RocksDB introduces an incompatible format change.

### Does this PR introduce _any_ user-facing change?

No. RocksDB state store is a new feature in Spark 3.2, which has not yet released.

### How was this patch tested?

The new unit tests.

Closes #33749 from zsxwing/SPARK-36519.

Authored-by: Shixiong Zhu <zsxwing@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-08-19 00:23:40 -07:00
Max Gekk 1235bd29f0 [SPARK-36536][SQL] Use CAST for datetime in CSV/JSON by default
### What changes were proposed in this pull request?
In the PR, I propose to split the `dateFormat` and `timestampFormat` options in CSV/JSON datasources to:
- In write (`dateFormatInWrite`/`timestampFormatInWrite`). CSV/JSON datasource will use it in formatting of dates/timestamps. If an user doesn't initialise it, it will be set to a default value.
- In read (`dateFormatInRead`/`timestampFormatInRead`). The datasources will use it while parsing of input dates/timestamps strings. If an user doesn't set it, we will keep it as uninitialized (None), and use CAST to parse the input dates/timestamps strings.

### Why are the changes needed?
This should improve user experience with Spark SQL, and make the default parsing behavior more flexible.

### Does this PR introduce _any_ user-facing change?
Potentially, it can.

### How was this patch tested?
By existing test suites, and by new tests that are added to `JsonSuite` and to `CSVSuite`:
```
$ build/sbt "sql/test:testOnly *CSVLegacyTimeParserSuite"
$ build/sbt "sql/test:testOnly *JsonFunctionsSuite"
$ build/sbt "sql/test:testOnly *CSVv1Suite"
$ build/sbt "sql/test:testOnly *JsonV2Suite"
```

Closes #33769 from MaxGekk/split-datetime-ds-options.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-08-19 09:30:50 +03:00
Kousuke Saruta 013f2b7439 [SPARK-36512][UI][TESTS] Fix UISeleniumSuite in sql/hive-thriftserver
### What changes were proposed in this pull request?

This PR fixes an issue that `UISeleniumSuite` in `sql/hive-thriftserver` doesn't work even though the ignored test is enabled due to the following two reasons.

(1) The suite waits for thriftserver starting up by reading a startup message from stdin but the expected message is never read.
(2) The URL and CSS selector for test are wrong.

To resolve (1), this PR adopt the way that `HiveThriftServer2Suite` does.
3f8ec0dae4/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala (L1222-L1248)

This PR also enables the ignored test. Let's disable it again in the following PR if it's still flaky.

### Why are the changes needed?

Bug fix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CIs.

Closes #33741 from sarutak/fix-thrift-uiseleniumsuite.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-19 13:27:58 +08:00
Angerszhuuuu 559fe96a40 [SPARK-35991][SQL] Add PlanStability suite for TPCH
### What changes were proposed in this pull request?
Add PlanStability suite for TPCH

### Why are the changes needed?
Add PlanStability suite for TPCH

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #33736 from AngersZhuuuu/SPARK-35991.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-19 11:40:34 +08:00
Kousuke Saruta c458edb77e [SPARK-36371][SQL] Support raw string literal
### What changes were proposed in this pull request?

This PR proposes to support raw string literal which escape no character using `\`.
The raw string literal is the form of `r"..."` or `r'...'` like the syntax BigQuery and Python supports.

Actually, there is no standard way to represent raw string literals.

In PostgreSQL, any special character isn't escaped by \ unless a string literal starts with E prefix.
https://www.postgresql.org/docs/13/sql-syntax-lexical.html#SQL-SYNTAX-CONSTANTS

In MySQL, special characters in a string literal are not escaped if NO_BACKSLASH_ESCAPES is enabled.
https://dev.mysql.com/doc/refman/8.0/en/string-literals.html

In MsSQLServer, any special character isn't escaped by \ but STRING_ESCAPE function can escape such characters.
https://docs.microsoft.com/en-us/sql/t-sql/functions/string-escape-transact-sql?view=sql-server-ver15

But BigQuery supports `r"..."` and `r'...'` forms so this PR proposes this syntax for the purpose.
https://cloud.google.com/bigquery/docs/reference/standard-sql/lexical#literals

### Why are the changes needed?

In the current master, sometimes it's too confusable to represent JSON and regex in a string literal if they contain backslash.
For example, in JSON, `\` needs to be escaped like as follows.
```
{"a": "\\"}
```
But, if the JSON above is represented in a string literal, further two `\` are needed because string literal also requires `\` to be escaped.
```
SELECT from_json('{"a": "\\\\"}', 'a string')
{"a":"\"}
```
With the raw string literal, we can represent such JSON like as follows.
```
SELECT from_json(r'{"a": "\\"}', 'a string')
{"a":"\"}
```

### Does this PR introduce _any_ user-facing change?

No. This PR just extends the existing syntax.

### How was this patch tested?

Added new test.
I also confirmed that the modified document is successfully built with `SKIP_API=1 bundle exec jekyll build`.
![raw_string_literal](https://user-images.githubusercontent.com/4736016/129223184-3bb4e206-f40f-42d2-a128-0cd8fc83b9c9.png)

Closes #33599 from sarutak/raw-string-literal.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-19 11:37:32 +08:00
Wenchen Fan 07d173a8b0 [SPARK-33687][SQL][DOC][FOLLOWUP] Merge the doc pages of ANALYZE TABLE and ANALYZE TABLES
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/30648

ANALYZE TABLE and TABLES are essentially the same command, it's weird to put them in 2 different doc pages. This PR proposes to merge them into one doc page.

### Why are the changes needed?

simplify the doc

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

Closes #33781 from cloud-fan/doc.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-19 11:04:05 +08:00
tooptoop4 2fc9c0bfb5 [SPARK-36147][SQL] Warn if less files visible after stats write in BasicWriteStatsTracker
### What changes were proposed in this pull request?

This log should at least be WARN not INFO (in org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala )
"Expected $numSubmittedFiles files, but only saw $numFiles."

### Why are the changes needed?

INFO logs don't indicate possible issue but WARN logs should

### Does this PR introduce any user-facing change?

Yes, Log level changed.

### How was this patch tested?

manual, trivial change

Closes #33332 from tooptoop4/warn.

Authored-by: tooptoop4 <33283496+tooptoop4@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-19 10:31:10 +09:00
itholic f2e593bcf1 [SPARK-36368][PYTHON] Fix CategoricalOps.astype to follow pandas 1.3
### What changes were proposed in this pull request?

This PR proposes to fix the behavior of `astype` for `CategoricalDtype` to follow pandas 1.3.

**Before:**
```python
>>> pcat
0    a
1    b
2    c
dtype: category
Categories (3, object): ['a', 'b', 'c']

>>> pcat.astype(CategoricalDtype(["b", "c", "a"]))
0    a
1    b
2    c
dtype: category
Categories (3, object): ['b', 'c', 'a']
```

**After:**
```python
>>> pcat
0    a
1    b
2    c
dtype: category
Categories (3, object): ['a', 'b', 'c']

>>> pcat.astype(CategoricalDtype(["b", "c", "a"]))
0    a
1    b
2    c
dtype: category
Categories (3, object): ['a', 'b', 'c']  # CategoricalDtype is not updated if dtype is the same
```

`CategoricalDtype` is treated as a same `dtype` if the unique values are the same.

```python
>>> pcat1 = pser.astype(CategoricalDtype(["b", "c", "a"]))
>>> pcat2 = pser.astype(CategoricalDtype(["a", "b", "c"]))
>>> pcat1.dtype == pcat2.dtype
True
```

### Why are the changes needed?

We should follow the latest pandas as much as possible.

### Does this PR introduce _any_ user-facing change?

Yes, the behavior is changed as example in the PR description.

### How was this patch tested?

Unittest

Closes #33757 from itholic/SPARK-36368.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-08-18 11:38:59 -07:00
itholic c91ae544fd [SPARK-36388][SPARK-36386][PYTHON][FOLLOWUP] Fix DataFrame groupby-rolling and groupby-expanding to follow pandas 1.3
### What changes were proposed in this pull request?

This PR is followup for https://github.com/apache/spark/pull/33646 to add missing tests.

### Why are the changes needed?

Some tests are missing

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unittest

Closes #33776 from itholic/SPARK-36388-followup.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-08-18 11:17:01 -07:00
yangjie01 1859d9bc85 [SPARK-36407][CORE][SQL] Convert int to long to avoid potential integer multiplications overflow risk
### What changes were proposed in this pull request?
The main change of this pr is converting an int literal to a long literal to avoid potential integer multiplications overflow risk.

For example:

**Before**

```java
void f(int i) {
    long val = 65536 * i;
  }
```

**After**

```java
void f(int i) {
    long val = 65536L * i;
  }
```

### Why are the changes needed?
Avoid potential integer multiplications overflow risk

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #33629 from LuciferYang/cast-to-long.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-08-18 11:30:37 -05:00
gengjiaan 707eefa3c7 [SPARK-36428][SQL][FOLLOWUP] Simplify the implementation of make_timestamp
### What changes were proposed in this pull request?
The implement of https://github.com/apache/spark/pull/33665 make `make_timestamp` could accepts integer type as the seconds parameter.
This PR let `make_timestamp` accepts `decimal(16,6)` type as the seconds parameter and cast integer to `decimal(16,6)` is safe, so we can simplify the code.

### Why are the changes needed?
Simplify `make_timestamp`.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
New tests.

Closes #33775 from beliefer/SPARK-36428-followup.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-18 22:57:06 +08:00
yi.wu 996551fece [SPARK-36532][CORE] Fix deadlock in CoarseGrainedExecutorBackend.onDisconnected to avoid executor shutdown hang
### What changes were proposed in this pull request?

Instead of exiting the executor within the RpcEnv's thread, exit the executor in a separate thread.

### Why are the changes needed?

The current exit way in `onDisconnected` can cause the deadlock, which has the exact same root cause with https://github.com/apache/spark/pull/12012:

* `onDisconnected` -> `System.exit` are called in sequence in the thread of `MessageLoop.threadpool`
* `System.exit` triggers shutdown hooks and `executor.stop` is one of the hooks.
* `executor.stop` stops the `Dispatcher`, which waits for the `MessageLoop.threadpool`  to shutdown further.
* Thus, the thread which runs `System.exit` waits for hooks to be done, but the `MessageLoop.threadpool` in the hook waits that thread to finish. Finally, this mutual dependence results in the deadlock.

### Does this PR introduce _any_ user-facing change?

Yes, the executor shutdown won't hang.

### How was this patch tested?

Pass existing tests.

Closes #33759 from Ngone51/fix-executor-shutdown-hang.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-18 22:46:48 +08:00
sychen a1ecf83f2e [SPARK-36451][BUILD] Ivy skips looking for source and doc pom
### What changes were proposed in this pull request?
Because SPARK-35863 Upgrade Ivy to 2.5.0, it supports skip searching the source and doc pom, but the remote repo will still be queried at present.

### Why are the changes needed?
Can improve the speed of some UT, such as `IsolatedClientLoader#downloadVersion`, no need to find source and doc pom

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
manual UT

Closes #33678 from cxzl25/SPARK-36451.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-08-18 08:29:51 -05:00
Kousuke Saruta 281b00ab5b [SPARK-34309][BUILD][FOLLOWUP] Upgrade Caffeine to 2.9.2
### What changes were proposed in this pull request?

This PR upgrades Caffeine to `2.9.2`.
Caffeine was introduced in SPARK-34309 (#31517). At the time that PR was opened, the latest version of caffeine was `2.9.1` but now `2.9.2` is available.

### Why are the changes needed?

`2.9.2` have the following improvements (https://github.com/ben-manes/caffeine/releases/tag/v2.9.2).

* Fixed reading an intermittent null weak/soft value during a concurrent write
* Fixed extraneous eviction when concurrently removing a collected entry after a writer resurrects it with a new mapping
* Fixed excessive retries of discarding an expired entry when the fixed duration period is extended, thereby resurrecting it

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CIs.

Closes #33772 from sarutak/upgrade-caffeine-2.9.2.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-08-18 13:40:52 +09:00
Kousuke Saruta b914ff7d54 [SPARK-36400][SPARK-36398][SQL][WEBUI] Make ThriftServer recognize spark.sql.redaction.string.regex
### What changes were proposed in this pull request?

This PR fixes an issue that ThriftServer doesn't recognize `spark.sql.redaction.string.regex`.
The problem is that sensitive information included in queries can be exposed.
![thrift-password1](https://user-images.githubusercontent.com/4736016/129440772-46379cc5-987b-41ac-adce-aaf2139f6955.png)
![thrift-password2](https://user-images.githubusercontent.com/4736016/129440775-fd328c0f-d128-4a20-82b0-46c331b9fd64.png)

### Why are the changes needed?

Bug fix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Ran ThriftServer, connect to it and execute `CREATE TABLE mytbl2(a int) OPTIONS(url="jdbc:mysql//example.com:3306", driver="com.mysql.jdbc.Driver", dbtable="test_tbl", user="test_usr", password="abcde");` with `spark.sql.redaction.string.regex=((?i)(?<=password=))(".*")|('.*')`
Then, confirmed UI.

![thrift-hide-password1](https://user-images.githubusercontent.com/4736016/129440863-cabea247-d51f-41a4-80ac-6c64141e1fb7.png)
![thrift-hide-password2](https://user-images.githubusercontent.com/4736016/129440874-96cd0f0c-720b-4010-968a-cffbc85d2be5.png)

Closes #33743 from sarutak/thrift-redact.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-08-18 13:31:22 +09:00
Takuya UESHIN 7fb8ea319e [SPARK-36370][PYTHON][FOLLOWUP] Use LooseVersion instead of pkg_resources.parse_version
### What changes were proposed in this pull request?

This is a follow-up of #33687.

Use `LooseVersion` instead of `pkg_resources.parse_version`.

### Why are the changes needed?

In the previous PR, `pkg_resources.parse_version` was used, but we should use `LooseVersion` instead to be consistent in the code base.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33768 from ueshin/issues/SPARK-36370/LooseVersion.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-18 10:36:09 +09:00
Wenchen Fan 4b015e8d7d [SPARK-36535][SQL] Refine the sql reference doc
### What changes were proposed in this pull request?

Refine the SQL reference doc:
- remove useless subitems in the sidebar
- remove useless sub-menu-pages (e.g. `sql-ref-syntax-aux.md`)
- avoid using `#####` in `sql-ref-literals.md`

### Why are the changes needed?

The subitems in the sidebar are quite useless, as the menu page serves the same functionalities:
<img width="1040" alt="WX20210817-2358402x" src="https://user-images.githubusercontent.com/3182036/129765924-d7e69bc1-e351-4581-a6de-f2468022f372.png">
It's also extra work to keep the manu page and sidebar subitems in sync (The ANSI compliance page is already out of sync).

The sub-menu-pages are only referenced by the sidebar, and duplicates the content of the menu page. As a result, the `sql-ref-syntax-aux.md` is already outdated compared to the menu page. It's easier to just look at the menu page.

The `#####` is not rendered properly:
<img width="776" alt="WX20210818-0001192x" src="https://user-images.githubusercontent.com/3182036/129766760-6f385443-e597-44aa-888d-14d128d45f84.png">
It's better to avoid using it.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

N/A

Closes #33767 from cloud-fan/doc.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-08-17 12:46:38 -07:00
Cedric-Magnan 964dfe254f [SPARK-36370][PYTHON] _builtin_table directly imported from pandas instead of being redefined
### What changes were proposed in this pull request?
Suggesting to refactor the way the _builtin_table is defined in the `python/pyspark/pandas/groupby.py` module.
Pandas has recently refactored the way we import the _builtin_table and is now part of the pandas.core.common module instead of being an attribute of the pandas.core.base.SelectionMixin class.

### Why are the changes needed?
This change is not fully needed but the current implementation redefines this table within pyspark, so any changes of this table from the pandas library would need to be updated in the pyspark repository as well.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Ran the following command successfully :
```sh
python/run-tests --testnames 'pyspark.pandas.tests.test_groupby'
```
Tests passed in 327 seconds

Closes #33687 from Cedric-Magnan/_builtin_table_from_pandas.

Authored-by: Cedric-Magnan <cedric.magnan@artefact.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-08-17 10:46:49 -07:00
itholic c0441bb7e8 [SPARK-36387][PYTHON] Fix Series.astype from datetime to nullable string
### What changes were proposed in this pull request?

This PR proposes to fix `Series.astype` when converting datetime type to StringDtype, to match the behavior of pandas 1.3.

In pandas < 1.3,
```python
>>> pd.Series(["2020-10-27 00:00:01", None], name="datetime").astype("string")
0    2020-10-27 00:00:01
1                    NaT
Name: datetime, dtype: string
```

This is changed to

```python
>>> pd.Series(["2020-10-27 00:00:01", None], name="datetime").astype("string")
0    2020-10-27 00:00:01
1                   <NA>
Name: datetime, dtype: string
```

in pandas >= 1.3, so we follow the behavior of latest pandas.

### Why are the changes needed?

Because pandas-on-Spark always follow the behavior of latest pandas.

### Does this PR introduce _any_ user-facing change?

Yes, the behavior is changed to latest pandas when converting datetime to nullable string (StringDtype)

### How was this patch tested?

Unittest passed

Closes #33735 from itholic/SPARK-36387.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-08-17 10:29:16 -07:00
Gengliang Wang 8bfb4f1e72 Revert "[SPARK-35028][SQL] ANSI mode: disallow group by aliases"
### What changes were proposed in this pull request?

Revert [[SPARK-35028][SQL] ANSI mode: disallow group by aliases ](https://github.com/apache/spark/pull/32129)

### Why are the changes needed?

It turns out that many users are using the group by alias feature.  Spark has its precedence rule when alias names conflict with column names in Group by clause: always use the table column. This should be reasonable and acceptable.
Also, external DBMS such as PostgreSQL and MySQL allow grouping by alias, too.

As we are going to announce ANSI mode GA in Spark 3.2, I suggest allowing the group by alias in ANSI mode.

### Does this PR introduce _any_ user-facing change?

No, the feature is not released yet.

### How was this patch tested?

Unit tests

Closes #33758 from gengliangwang/revertGroupByAlias.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-17 20:23:49 +08:00
Max Gekk 82a31508af [SPARK-36524][SQL] Common class for ANSI interval types
### What changes were proposed in this pull request?
Add new type `AnsiIntervalType` to `AbstractDataType.scala`, and extend it by `YearMonthIntervalType` and by `DayTimeIntervalType`

### Why are the changes needed?
To improve code maintenance. The change will allow to replace checking of both `YearMonthIntervalType` and `DayTimeIntervalType` by a check of `AnsiIntervalType`, for instance:
```scala
    case _: YearMonthIntervalType | _: DayTimeIntervalType => false
```
by
```scala
    case _: AnsiIntervalType => false
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By existing test suites.

Closes #33753 from MaxGekk/ansi-interval-type-trait.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-08-17 12:27:56 +03:00
Dongjoon Hyun ea13c5a743 [SPARK-36052][K8S][FOLLOWUP] Update config version to 3.2.0
### What changes were proposed in this pull request?

This PR is a follow-up to update the version of config, `spark.kubernetes.allocation.maxPendingPods`, from 3.3.0 to 3.2.0.

### Why are the changes needed?

SPARK-36052 landed at branch-3.2 to fix a bug.

### Does this PR introduce _any_ user-facing change?

Yes, but this is a new configuration to fix a bug.

### How was this patch tested?

Pass the CIs.

Closes #33755 from dongjoon-hyun/SPARK-36052.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-17 10:28:02 +09:00
Gengliang Wang 26d6b952dc [SPARK-36521][SQL] Disallow comparison between Interval and String
### What changes were proposed in this pull request?

Disallow comparison between Interval and String in the default type coercion rules.

### Why are the changes needed?

If a binary comparison contains interval type and string type, we can't decide which
interval type the string should be promoted as. There are many possible interval
types, such as year interval, month interval, day interval, hour interval, etc.

### Does this PR introduce _any_ user-facing change?

No, the new interval type is not released yet.

### How was this patch tested?

Existing UT

Closes #33750 from gengliangwang/disallowCom.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-08-16 22:41:14 +03:00
Yuanjian Li 3d57e00a7f [SPARK-36041][SS][DOCS] Introduce the RocksDBStateStoreProvider in the programming guide
### What changes were proposed in this pull request?
Add the document for the new RocksDBStateStoreProvider.

### Why are the changes needed?
User guide for the new feature.

### Does this PR introduce _any_ user-facing change?
No, doc only.

### How was this patch tested?
Doc only.

Closes #33683 from xuanyuanking/SPARK-36041.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-08-16 12:32:08 -07:00
zhuqi-lucas 05cd5f97c3 [SPARK-35548][CORE][SHUFFLE] Handling new attempt has started error message in BlockPushErrorHandler in client
### What changes were proposed in this pull request?
Add a new type of error message in BlockPushErrorHandler which indicates the PushblockStream message is received after a new application attempt has started. This error message should be correctly handled in client without retrying the block push.

### Why are the changes needed?
When we get a block push failure because of the too old attempt, we will not retry pushing the block nor log the exception on the client side.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add the corresponding unit test.

Closes #33617 from zhuqi-lucas/master.

Authored-by: zhuqi-lucas <821684824@qq.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
2021-08-16 13:58:48 -05:00
Xinrong Meng 4dcd746025 [SPARK-36469][PYTHON] Implement Index.map
### What changes were proposed in this pull request?
Implement `Index.map`.

The PR is based on https://github.com/databricks/koalas/pull/2136. Thanks awdavidson for the prototype.

`map` of CategoricalIndex and DatetimeIndex will be implemented in separate PRs.

### Why are the changes needed?
Mapping values using input correspondence (a dict, Series, or function) is supported in pandas as [Index.map](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.map.html).
We shall also support hat.

### Does this PR introduce _any_ user-facing change?
Yes. `Index.map` is available now.

```py
>>> psidx = ps.Index([1, 2, 3])

>>> psidx.map({1: "one", 2: "two", 3: "three"})
Index(['one', 'two', 'three'], dtype='object')

>>> psidx.map(lambda id: "{id} + 1".format(id=id))
Index(['1 + 1', '2 + 1', '3 + 1'], dtype='object')

>>> pser = pd.Series(["one", "two", "three"], index=[1, 2, 3])
>>> psidx.map(pser)
Index(['one', 'two', 'three'], dtype='object')
```

### How was this patch tested?
Unit tests.

Closes #33694 from xinrong-databricks/index_map.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-08-16 11:06:10 -07:00
Kazuyuki Tanimura 8ee464cd7a [SPARK-32210][CORE] Fix NegativeArraySizeException in MapOutputTracker with large spark.default.parallelism
### What changes were proposed in this pull request?
The current `MapOutputTracker` class may throw `NegativeArraySizeException` with a large number of partitions. Within the serializeOutputStatuses() method, it is trying to compress an array of mapStatuses and outputting the binary data into (Apache)ByteArrayOutputStream . Inside the (Apache)ByteArrayOutputStream.toByteArray(), negative index exception happens because the index is int and overflows (2GB limit) when the output binary size is too large.

This PR proposes two high-level ideas:
  1. Use `org.apache.spark.util.io.ChunkedByteBufferOutputStream`, which has a way to output the underlying buffer as `Array[Array[Byte]]`.
  2. Change the signatures from `Array[Byte]` to `Array[Array[Byte]]` in order to handle over 2GB compressed data.

### Why are the changes needed?
This issue seems to be missed out in the earlier effort of addressing 2GB limitations [SPARK-6235](https://issues.apache.org/jira/browse/SPARK-6235)

Without this fix, `spark.default.parallelism` needs to be kept at the low number. The drawback of setting smaller spark.default.parallelism is that it requires more executor memory (more data per partition). Setting `spark.io.compression.zstd.level` to higher number (default 1) hardly helps.

That essentially means we have the data size limit that for shuffling and does not scale.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passed existing tests
```
build/sbt "core/testOnly org.apache.spark.MapOutputTrackerSuite"
```
Also added a new unit test
```
build/sbt "core/testOnly org.apache.spark.MapOutputTrackerSuite  -- -z SPARK-32210"
```
Ran the benchmark using GitHub Actions and didn't not observe any performance penalties. The results are attached in this PR
```
core/benchmarks/MapStatusesSerDeserBenchmark-jdk11-results.txt
core/benchmarks/MapStatusesSerDeserBenchmark-results.txt
```

Closes #33721 from kazuyukitanimura/SPARK-32210.

Authored-by: Kazuyuki Tanimura <ktanimura@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-16 09:11:39 -07:00
Max Gekk f620996142 [SPARK-36418][SQL] Use CAST in parsing of dates/timestamps with default pattern
### What changes were proposed in this pull request?
In the PR, I propose to use the `CAST` logic when the pattern is not specified in `DateFormatter` or `TimestampFormatter`. In particular, invoke the `DateTimeUtils.stringToTimestampAnsi()` or `stringToDateAnsi()` in the case.

### Why are the changes needed?
1. This can improve user experience with Spark SQL by making the default date/timestamp parsers more flexible and tolerant to their inputs.
2. We make the default case consistent to the behavior of the `CAST` expression which makes implementation more consistent.

### Does this PR introduce _any_ user-facing change?
The changes shouldn't introduce behavior change in regular cases but it can influence on corner cases. New implementation is able to parse more dates/timestamps by default. For instance, old (current) date parses can recognize dates only in the format **yyyy-MM-dd** but new one can handle:
   * `[+-]yyyy*`
   * `[+-]yyyy*-[m]m`
   * `[+-]yyyy*-[m]m-[d]d`
   * `[+-]yyyy*-[m]m-[d]d `
   * `[+-]yyyy*-[m]m-[d]d *`
   * `[+-]yyyy*-[m]m-[d]dT*`

Similarly for timestamps. The old (current) timestamp formatter is able to parse timestamps only in the format **yyyy-MM-dd HH:mm:ss** by default, but new implementation can handle:
   * `[+-]yyyy*`
   * `[+-]yyyy*-[m]m`
   * `[+-]yyyy*-[m]m-[d]d`
   * `[+-]yyyy*-[m]m-[d]d `
   * `[+-]yyyy*-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]`
   * `[+-]yyyy*-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]`
   * `[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]`
   * `T[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]`

### How was this patch tested?
By running the affected test suites:
```
$ build/sbt "test:testOnly *ImageFileFormatSuite"
$ build/sbt "test:testOnly *ParquetV2PartitionDiscoverySuite"
```

Closes #33709 from MaxGekk/datetime-cast-default-pattern.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-16 23:29:33 +08:00
Venkata krishnan Sowrirajan 2270ecf32f [SPARK-36374][SHUFFLE][DOC] Push-based shuffle high level user documentation
### What changes were proposed in this pull request?

Document the push-based shuffle feature with a high level overview of the feature and corresponding configuration options for both shuffle server side as well as client side. This is how the changes to the doc looks on the browser ([img](https://user-images.githubusercontent.com/8871522/129231582-ad86ee2f-246f-4b42-9528-4ccd693e86d2.png))

### Why are the changes needed?

Helps users understand the feature

### Does this PR introduce _any_ user-facing change?

Docs

### How was this patch tested?

N/A

Closes #33615 from venkata91/SPARK-36374.

Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
2021-08-16 10:24:40 -05:00
Wenchen Fan f4b31c6068 [SPARK-36498][SQL] Reorder inner fields of the input query in byName V2 write
### What changes were proposed in this pull request?

Today, when we write data to a v2 table with byName mode, we only reorder the top-level columns, not inner struct fields. This doesn't make sense as Spark should treat inner struct fields as the first-class citizen (e.g. nested column pruning, filter pushdown with nested columns).

This PR improves `TableOutputResolver` to reorder inner fields as well.

### Why are the changes needed?

better user-experience

### Does this PR introduce _any_ user-facing change?

yes, more queries are allowed to write to v2 tables.

### How was this patch tested?

new test

Closes #33728 from cloud-fan/reorder.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-16 15:08:08 +08:00
Kousuke Saruta 9b9db5a8a0 [SPARK-36491][SQL] Make from_json/to_json to handle timestamp_ntz type properly
### What changes were proposed in this pull request?

This PR fixes an issue that `from_json` and `to_json` cannot handle `timestamp_ntz` type properly.
In the current master, `from_json`/`to_json` can handle `timestamp` type like as follows.
```
SELECT from_json('{"a":"2021-11-23 11:22:33"}', "a TIMESTAMP");
{"a":2021-11-23 11:22:33}
```
```
SELECT to_json(map("a", TIMESTAMP"2021-11-23 11:22:33"));
{"a":"2021-11-23T11:22:33.000+09:00"}
```
But they cannot handle `timestamp_ntz` type properly.
```
SELECT from_json('{"a":"2021-11-23 11:22:33"}', "a TIMESTAMP_NTZ");
21/08/12 16:16:00 ERROR SparkSQLDriver: Failed in [SELECT from_json('{"a":"2021-11-23 11:22:33"}', "a TIMESTAMP_NTZ")]
java.lang.Exception: Unsupported type: timestamp_ntz
        at org.apache.spark.sql.errors.QueryExecutionErrors$.unsupportedTypeError(QueryExecutionErrors.scala:777)
        at org.apache.spark.sql.catalyst.json.JacksonParser.makeConverter(JacksonParser.scala:339)
        at org.apache.spark.sql.catalyst.json.JacksonParser.$anonfun$makeConverter$17(JacksonParser.scala:313)
```
```
SELECT to_json(map("a", TIMESTAMP_NTZ"2021-11-23 11:22:33"));
21/08/12 16:14:07 ERROR SparkSQLDriver: Failed in [SELECT to_json(map("a", TIMESTAMP_NTZ"2021-11-23 11:22:33"))]
java.lang.RuntimeException: Failed to convert value 1637666553000000 (class of class java.lang.Long) with the type of TimestampNTZType to JSON.
        at org.apache.spark.sql.errors.QueryExecutionErrors$.failToConvertValueToJsonError(QueryExecutionErrors.scala:294)
        at org.apache.spark.sql.catalyst.json.JacksonGenerator.$anonfun$makeWriter$25(JacksonGenerator.scala:201)
        at org.apache.spark.sql.catalyst.json.JacksonGenerator.$anonfun$makeWriter$25$adapted(JacksonGenerator.scala:199)
        at org.apache.spark.sql.catalyst.json.JacksonGenerator.writeMapData(JacksonGenerator.scala:253)
        at org.apache.spark.sql.catalyst.json.JacksonGenerator.$anonfun$write$3(JacksonGenerator.scala:293)
        at org.apache.spark.sql.catalyst.json.JacksonGenerator.writeObject(JacksonGenerator.scala:206)
        at org.apache.spark.sql.catalyst.json.JacksonGenerator.write(JacksonGenerator.scala:292)
```
### Why are the changes needed?

Bug fix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New test.

Closes #33742 from sarutak/json-ntz.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-08-16 10:03:22 +03:00
Liang-Chi Hsieh 8b8d91cf64 [SPARK-36465][SS] Dynamic gap duration in session window
### What changes were proposed in this pull request?

This patch supports dynamic gap duration in session window.

### Why are the changes needed?

The gap duration used in session window for now is a static value. To support more complex usage, it is better to support dynamic gap duration which determines the gap duration by looking at the current data. For example, in our usecase, we may have different gap by looking at the certain column in the input rows.

### Does this PR introduce _any_ user-facing change?

Yes, users can specify dynamic gap duration.

### How was this patch tested?

Modified existing tests and new test.

Closes #33691 from viirya/dynamic-session-window-gap.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-08-16 11:06:00 +09:00
Hyukjin Kwon e369499d14 [MINOR][DOCS] More correct results for GitHub Actions build link at README.md
### What changes were proposed in this pull request?

This PR proposes to use more correct link for GitHub Actions build to only show the full builds at the commits in the master branch.

Current link includes scheduled builds (https://github.com/apache/spark/actions/workflows/build_and_test.yml?query=branch%3Amaster+event%3Aschedule).

### Why are the changes needed?

To show the full builds at the commits in the master branch only.

### Does this PR introduce _any_ user-facing change?

No, dev only.

Before:

![Screen Shot 2021-08-15 at 11 11 19 AM](https://user-images.githubusercontent.com/6477701/129464841-7207865e-3f94-4e18-8db6-e8fa36a926bf.png)

After:

![Screen Shot 2021-08-15 at 11 12 01 AM](https://user-images.githubusercontent.com/6477701/129464842-887d121e-88d3-4342-b11a-3048df88856d.png)

### How was this patch tested?

Manually checked by clicking these links.

Closes #33745 from HyukjinKwon/minor-link-readme.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-14 22:05:16 -07:00