Commit graph

28626 commits

Author SHA1 Message Date
Yuanjian Li 9f983a68f1 [SPARK-30294][SS][FOLLOW-UP] Directly override RDD methods
### Why are the changes needed?
Follow the comment: https://github.com/apache/spark/pull/26935#discussion_r514697997

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing test and Mima test.

Closes #30344 from xuanyuanking/SPARK-30294-follow.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-12 12:22:25 +09:00
Ruifeng Zheng 6244407ce6 Revert "[WIP] Test (#30327)"
This reverts commit 61ee5d8a4e.

### What changes were proposed in this pull request?
I need to merge https://github.com/apache/spark/pull/30327 to https://github.com/apache/spark/pull/30009,
but I merged it to master by mistake.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #30345 from zhengruifeng/revert-30327-adaptively_blockify_linear_svc_II.

Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-12 11:32:12 +09:00
WeichenXu 61ee5d8a4e
[WIP] Test (#30327)
* resend

* address comments

* directly gen new Iter

* directly gen new Iter

* update blockify strategy

* address comments

* try to fix 2.13

* try to fix scala 2.13

* use 1.0 as the default value for gemv

* update

Co-authored-by: zhengruifeng <ruifengz@foxmail.com>
2020-11-12 10:20:33 +08:00
Josh Soref 9d58a2f0f0 [MINOR][GRAPHX] Correct typos in the sub-modules: graphx, external, and examples
### What changes were proposed in this pull request?

This PR intends to fix typos in the sub-modules: graphx, external, and examples.
Split per holdenk https://github.com/apache/spark/pull/30323#issuecomment-725159710

NOTE: The misspellings have been reported at 706a726f87 (commitcomment-44064356)

### Why are the changes needed?

Misspelled words make it harder to read / understand content.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

No testing was performed

Closes #30326 from jsoref/spelling-graphx.

Authored-by: Josh Soref <jsoref@users.noreply.github.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-11-12 08:29:22 +09:00
Steve Loughran 318a173fce
[SPARK-33402][CORE] Jobs launched in same second have duplicate MapReduce JobIDs
### What changes were proposed in this pull request?

1. Applies the SQL changes in SPARK-33230 to SparkHadoopWriter, so that `rdd.saveAsNewAPIHadoopDataset` passes in a unique job UUID in `spark.sql.sources.writeJobUUID`
1. `SparkHadoopWriterUtils.createJobTrackerID` generates a JobID by appending a random long number to the supplied timestamp to ensure the probability of a collision is near-zero.
1. With tests of uniqueness, round trips and negative jobID rejection.

### Why are the changes needed?

Without this, if more than one job is started in the same second *and the committer expects application attempt IDs to be unique* is at risk of clashing with other jobs.

With the fix,

* those committers which use the ID set in `spark.sql.sources.writeJobUUID` as a priority ID will pick that up instead and so be unique.
* committers which use the Hadoop JobID for unique paths and filenames will get the randomly generated jobID.  Assuming all clocks in a cluster in sync, the probability of two jobs launched in the same second has dropped from 1 to 1/(2^63)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests

There's a new test suite SparkHadoopWriterUtilsSuite which creates jobID, verifies they are unique even for the same timestamp and that they can be marshalled to string and parsed back in the hadoop code, which contains some (brittle) assumptions about the format of job IDs.

Functional Integration Tests

1. Hadoop-trunk built with [HADOOP-17318], publishing to local maven repository
1. Spark built with hadoop.version=3.4.0-SNAPSHOT to pick up these JARs.
1. Spark + Object store integration tests at [https://github.com/hortonworks-spark/cloud-integration](https://github.com/hortonworks-spark/cloud-integration) were built against that local spark version
1. And executed against AWS london.

The tests were run with `fs.s3a.committer.require.uuid=true`, so the s3a committers fail fast if they don't get a job ID down. This showed that `rdd.saveAsNewAPIHadoopDataset` wasn't setting the UUID option. It again uses the current Date value for an app attempt -which is not guaranteed to be unique.

With the change applied to spark, the relevant tests work, therefore the committers are getting unique job IDs.

Closes #30319 from steveloughran/BUG/SPARK-33402-jobuuid.

Authored-by: Steve Loughran <stevel@cloudera.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-11-11 14:27:48 -08:00
Max Gekk 7e867298fe
[SPARK-33404][SQL][FOLLOWUP] Update benchmark results for date_trunc
### What changes were proposed in this pull request?
Updated results of `DateTimeBenchmark` in the environment:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge (spot instance) |
| AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) |
| Java | OpenJDK8/11 installed by`sudo add-apt-repository ppa:openjdk-r/ppa` & `sudo apt install openjdk-11-jdk`|

### Why are the changes needed?
The fix https://github.com/apache/spark/pull/30303 slowed down `date_trunc`. This PR updates benchmark results to have actual info about performance of `date_trunc`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By regenerating benchmark results:
```
$ SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeBenchmark"
```

Closes #30338 from MaxGekk/fix-trunc_date-benchmark.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-11-11 08:50:43 -08:00
zero323 4b76a74f1c [SPARK-33415][PYTHON][SQL] Don't encode JVM response in Column.__repr__
### What changes were proposed in this pull request?

Removes encoding of the JVM response in `pyspark.sql.column.Column.__repr__`.

### Why are the changes needed?

API consistency and improved readability of the expressions.

### Does this PR introduce _any_ user-facing change?

Before this change

    col("abc")
    col("wąż")

result in

    Column<b'abc'>
    Column<b'w\xc4\x85\xc5\xbc'>

After this change we'll get

    Column<'abc'>
    Column<'wąż'>

### How was this patch tested?

Existing tests and manual inspection.

Closes #30322 from zero323/SPARK-33415.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-12 00:13:17 +09:00
stczwd 1eb236b936 [SPARK-32512][SQL] add alter table add/drop partition command for datasourcev2
### What changes were proposed in this pull request?
This patch is trying to add `AlterTableAddPartitionExec` and `AlterTableDropPartitionExec` with the new table partition API, defined in #28617.

### Does this PR introduce _any_ user-facing change?
Yes. User can use `alter table add partition` or `alter table drop partition` to create/drop partition in V2Table.

### How was this patch tested?
Run suites and fix old tests.

Closes #29339 from stczwd/SPARK-32512-new.

Lead-authored-by: stczwd <qcsd2011@163.com>
Co-authored-by: Jacky Lee <qcsd2011@163.com>
Co-authored-by: Jackey Lee <qcsd2011@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-11 09:30:42 +00:00
Wenchen Fan 8760032f4f [SPARK-33412][SQL] OverwriteByExpression should resolve its delete condition based on the table relation not the input query
### What changes were proposed in this pull request?

Make a special case in `ResolveReferences`, which resolves `OverwriteByExpression`'s condition expression based on the table relation instead of the input query.

### Why are the changes needed?

The condition expression is passed to the table implementation at the end, so we should resolve it using table schema. Previously it works because we have a hack in `ResolveReferences` to delay the resolution if `outputResolved == false`. However, this hack doesn't work for tables accepting any schema like https://github.com/delta-io/delta/pull/521 . We may wrongly resolve the delete condition using input query's outout columns which don't match the table column names.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing tests and updated test in v2 write.

Closes #30318 from cloud-fan/v2-write.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-11 16:13:21 +09:00
Takeshi Yamamuro 4b367976a8 [SPARK-33417][SQL][TEST] Correct the behaviour of query filters in TPCDSQueryBenchmark
### What changes were proposed in this pull request?

This PR intends to fix the behaviour of query filters in `TPCDSQueryBenchmark`. We can use an option `--query-filter` for selecting TPCDS queries to run, e.g., `--query-filter q6,q8,q13`. But, the current master has a weird behaviour about the option. For example, if we pass `--query-filter q6` so as to run the TPCDS q6 only, `TPCDSQueryBenchmark` runs `q6` and `q6-v2.7` because the `filterQueries` method does not respect the name suffix. So, there is no way now to run the TPCDS q6 only.

### Why are the changes needed?

Bugfix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually checked.

Closes #30324 from maropu/FilterBugInTPCDSQueryBenchmark.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-11-11 15:24:05 +09:00
Terry Kim 6d5d030957 [SPARK-33414][SQL] Migrate SHOW CREATE TABLE command to use UnresolvedTableOrView to resolve the identifier
### What changes were proposed in this pull request?

This PR proposes to migrate `SHOW CREATE TABLE` to use `UnresolvedTableOrView` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).

Note that `SHOW CREATE TABLE` works only with a v1 table and a permanent view, and not supported for v2 tables.

### Why are the changes needed?

The changes allow consistent resolution behavior when resolving the table identifier. For example, the following is the current behavior:
```scala
sql("CREATE TEMPORARY VIEW t AS SELECT 1")
sql("CREATE DATABASE db")
sql("CREATE TABLE t (key INT, value STRING) USING hive")
sql("USE db")
sql("SHOW CREATE TABLE t AS SERDE") // Succeeds
```
With this change, `SHOW CREATE TABLE ... AS SERDE` above fails with the following:
```
org.apache.spark.sql.AnalysisException: t is a temp view not table or permanent view.; line 1 pos 0
  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.$anonfun$applyOrElse$43(Analyzer.scala:883)
  at scala.Option.map(Option.scala:230)
```
, which is expected since temporary view is resolved first and `SHOW CREATE TABLE ... AS SERDE` doesn't support a temporary view.

Note that there is no behavior change for `SHOW CREATE TABLE` without `AS SERDE` since it was already resolving to a temporary view first. See below for more detail.

### Does this PR introduce _any_ user-facing change?

After this PR, `SHOW CREATE TABLE t AS SERDE` is resolved to a temp view `t` instead of table `db.t` in the above scenario.

Note that there is no behavior change for `SHOW CREATE TABLE` without `AS SERDE`, but the exception message changes from `SHOW CREATE TABLE is not supported on a temporary view` to `t is a temp view not table or permanent view`.

### How was this patch tested?

Updated existing tests.

Closes #30321 from imback82/show_create_table.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-11 05:54:27 +00:00
Max Gekk 1e2eeda20e [SPARK-33382][SQL][TESTS] Unify datasource v1 and v2 SHOW TABLES tests
### What changes were proposed in this pull request?
In the PR, I propose to gather common `SHOW TABLES` tests into one trait `org.apache.spark.sql.execution.command.ShowTablesSuite`, and put datasource specific tests to the `v1.ShowTablesSuite` and `v2.ShowTablesSuite`. Also tests for parsing `SHOW TABLES` are extracted to `ShowTablesParserSuite`.

### Why are the changes needed?
- The unification will allow to run common `SHOW TABLES` tests for both DSv1 and DSv2
- We can detect missing features and differences between DSv1 and DSv2 implementations.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running new test suites:
- `org.apache.spark.sql.execution.command.v1.ShowTablesSuite`
- `org.apache.spark.sql.execution.command.v2.ShowTablesSuite`
- `ShowTablesParserSuite`

Closes #30287 from MaxGekk/unify-dsv1_v2-tests.

Lead-authored-by: Max Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-11 05:26:46 +00:00
ulysses 5197c5d2e7 [SPARK-33390][SQL] Make Literal support char array
### What changes were proposed in this pull request?

Make Literal support char array.

### Why are the changes needed?

We always use `Literal()` to create foldable value, and `char[]` is a usual data type. We can make it easy that support create String Literal with `char[]`.

### Does this PR introduce _any_ user-facing change?

Yes, user can call `Literal()` with `char[]`.

### How was this patch tested?

Add test.

Closes #30295 from ulysses-you/SPARK-33390.

Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-11 11:39:11 +09:00
Utkarsh 46346943bb [SPARK-33404][SQL] Fix incorrect results in date_trunc expression
### What changes were proposed in this pull request?
The following query produces incorrect results:
```
SELECT date_trunc('minute', '1769-10-17 17:10:02')
```
Spark currently incorrectly returns
```
1769-10-17 17:10:02
```
against the expected return value of
```
1769-10-17 17:10:00
```
**Steps to repro**
Run the following commands in spark-shell:
```
spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
spark.sql("SELECT date_trunc('minute', '1769-10-17 17:10:02')").show()
```
This happens as `truncTimestamp` in package `org.apache.spark.sql.catalyst.util.DateTimeUtils` incorrectly assumes that time zone offsets can never have the granularity of a second and thus does not account for time zone adjustment when truncating the given timestamp to `minute`.
This assumption is currently used when truncating the timestamps to `microsecond, millisecond, second, or minute`.

This PR fixes this issue and always uses time zone knowledge when truncating timestamps regardless of the truncation unit.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added new tests to `DateTimeUtilsSuite` which previously failed and pass now.

Closes #30303 from utkarsh39/trunc-timestamp-fix.

Authored-by: Utkarsh <utkarsh.agarwal@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-11-11 09:28:59 +09:00
Liang-Chi Hsieh 6fa80ed1dd [SPARK-33337][SQL] Support subexpression elimination in branches of conditional expressions
### What changes were proposed in this pull request?

Currently we skip subexpression elimination in branches of conditional expressions including `If`, `CaseWhen`, and `Coalesce`. Actually we can do subexpression elimination for such branches if the subexpression is common across all branches. This patch proposes to support subexpression elimination in branches of conditional expressions.

### Why are the changes needed?

We may miss subexpression elimination chances in branches of conditional expressions. This kind of subexpression is frequently seen. It may be written manually by users or come from query optimizer. For example, project collapsing could embed expressions between two `Project`s and produces conditional expression like:

```
CASE WHEN jsonToStruct(json).a = '1' THEN 1.0 WHEN jsonToStruct(json).a = '2' THEN 2.0 ... ELSE 1.2 END
```

If `jsonToStruct(json)` is time-expensive expression, we don't eliminate the duplication and waste time on running it repeatedly now.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test.

Closes #30245 from viirya/SPARK-33337.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2020-11-10 16:17:00 -08:00
zero323 122c8999cb [SPARK-33251][FOLLOWUP][PYTHON][DOCS][MINOR] Adjusts returns PrefixSpan.findFrequentSequentialPatterns
### What changes were proposed in this pull request?

Changes

    pyspark.sql.dataframe.DataFrame

to

    :py:class:`pyspark.sql.DataFrame`

### Why are the changes needed?

Consistency (see https://github.com/apache/spark/pull/30285#pullrequestreview-526764104).

### Does this PR introduce _any_ user-facing change?

User will see shorter reference with a link.

### How was this patch tested?

`dev/lint-python` and manual check of the rendered docs.

Closes #30313 from zero323/SPARK-33251-FOLLOW-UP.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>
2020-11-10 09:17:00 -08:00
Chao Sun 3165ca742a [SPARK-33376][SQL] Remove the option of "sharesHadoopClasses" in Hive IsolatedClientLoader
### What changes were proposed in this pull request?

This removes the `sharesHadoopClasses` flag from `IsolatedClientLoader` in Hive module.

### Why are the changes needed?

Currently, when initializing `IsolatedClientLoader`, users can set the `sharesHadoopClasses` flag to decide whether the `HiveClient` created should share Hadoop classes with Spark itself or not. In the latter case, the client will only load Hadoop classes from the Hive dependencies.

There are two reasons to remove this:
1. this feature is currently used in two cases: 1) unit tests, 2) when the Hadoop version defined in Maven can not be found when `spark.sql.hive.metastore.jars` is equal to "maven", which could be very rare.
2. when `sharesHadoopClasses` is false, Spark doesn't really only use Hadoop classes from Hive jars: we also download `hadoop-client` jar and put all the sub-module jars (e.g., `hadoop-common`, `hadoop-hdfs`) together with the Hive jars, and the Hadoop version used by `hadoop-client` is the same version used by Spark itself. As result, we're mixing two versions of Hadoop jars in the classpath, which could potentially cause issues, especially considering that the default Hadoop version is already 3.2.0 while most Hive versions supported by the `IsolatedClientLoader` is still using Hadoop 2.x or even lower.

### Does this PR introduce _any_ user-facing change?

This affects Spark users in one scenario: when `spark.sql.hive.metastore.jars` is set to `maven` AND the Hadoop version specified in pom file cannot be downloaded, currently the behavior is to switch to _not_ share Hadoop classes, but with the PR it will share Hadoop classes with Spark.

### How was this patch tested?

Existing UTs.

Closes #30284 from sunchao/SPARK-33376.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-10 15:41:04 +00:00
angerszhu 34f5e7ce77 [SPARK-33302][SQL] Push down filters through Expand
### What changes were proposed in this pull request?
Push down filter through expand.  For case below:
```
create table t1(pid int, uid int, sid int, dt date, suid int) using parquet;
create table t2(pid int, vs int, uid int, csid int) using parquet;

SELECT
       years,
       appversion,
       SUM(uusers) AS users
FROM   (SELECT
               Date_trunc('year', dt)          AS years,
               CASE
                 WHEN h.pid = 3 THEN 'iOS'
                 WHEN h.pid = 4 THEN 'Android'
                 ELSE 'Other'
               END                             AS viewport,
               h.vs                            AS appversion,
               Count(DISTINCT u.uid)           AS uusers
               ,Count(DISTINCT u.suid)         AS srcusers
        FROM   t1 u
               join t2 h
                 ON h.uid = u.uid
        GROUP  BY 1,
                  2,
                  3) AS a
WHERE  viewport = 'iOS'
GROUP  BY 1,
          2
```

Plan. before this pr:
```
== Physical Plan ==
*(5) HashAggregate(keys=[years#30, appversion#32], functions=[sum(uusers#33L)])
+- Exchange hashpartitioning(years#30, appversion#32, 200), true, [id=#251]
   +- *(4) HashAggregate(keys=[years#30, appversion#32], functions=[partial_sum(uusers#33L)])
      +- *(4) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12], functions=[count(if ((gid#44 = 1)) u.`uid`#47 else null)])
         +- Exchange hashpartitioning(date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, 200), true, [id=#246]
            +- *(3) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12], functions=[partial_count(if ((gid#44 = 1)) u.`uid`#47 else null)])
               +- *(3) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44], functions=[])
                  +- Exchange hashpartitioning(date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44, 200), true, [id=#241]
                     +- *(2) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44], functions=[])
                        +- *(2) Filter (CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46 = iOS)
                           +- *(2) Expand [ArrayBuffer(date_trunc(year, cast(dt#9 as timestamp), Some(Etc/GMT+7)), CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END, vs#12, uid#7, null, 1), ArrayBuffer(date_trunc(year, cast(dt#9 as timestamp), Some(Etc/GMT+7)), CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END, vs#12, null, suid#10, 2)], [date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44]
                              +- *(2) Project [uid#7, dt#9, suid#10, pid#11, vs#12]
                                 +- *(2) BroadcastHashJoin [uid#7], [uid#13], Inner, BuildRight
                                    :- *(2) Project [uid#7, dt#9, suid#10]
                                    :  +- *(2) Filter isnotnull(uid#7)
                                    :     +- *(2) ColumnarToRow
                                    :        +- FileScan parquet default.t1[uid#7,dt#9,suid#10] Batched: true, DataFilters: [isnotnull(uid#7)], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-3.0.0-bin-hadoop3.2/spark-warehouse/t1], PartitionFilters: [], PushedFilters: [IsNotNull(uid)], ReadSchema: struct<uid:int,dt:date,suid:int>
                                    +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[2, int, true] as bigint))), [id=#233]
                                       +- *(1) Project [pid#11, vs#12, uid#13]
                                          +- *(1) Filter isnotnull(uid#13)
                                             +- *(1) ColumnarToRow
                                                +- FileScan parquet default.t2[pid#11,vs#12,uid#13] Batched: true, DataFilters: [isnotnull(uid#13)], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-3.0.0-bin-hadoop3.2/spark-warehouse/t2], PartitionFilters: [], PushedFilters: [IsNotNull(uid)], ReadSchema: struct<pid:int,vs:int,uid:int>
```

Plan. after. this pr. :
```
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[years#0, appversion#2], functions=[sum(uusers#3L)], output=[years#0, appversion#2, users#5L])
   +- Exchange hashpartitioning(years#0, appversion#2, 5), true, [id=#71]
      +- HashAggregate(keys=[years#0, appversion#2], functions=[partial_sum(uusers#3L)], output=[years#0, appversion#2, sum#22L])
         +- HashAggregate(keys=[date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12], functions=[count(distinct uid#7)], output=[years#0, appversion#2, uusers#3L])
            +- Exchange hashpartitioning(date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, 5), true, [id=#67]
               +- HashAggregate(keys=[date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12], functions=[partial_count(distinct uid#7)], output=[date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, count#27L])
                  +- HashAggregate(keys=[date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, uid#7], functions=[], output=[date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, uid#7])
                     +- Exchange hashpartitioning(date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, uid#7, 5), true, [id=#63]
                        +- HashAggregate(keys=[date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles)) AS date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END AS CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, uid#7], functions=[], output=[date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, uid#7])
                           +- Project [uid#7, dt#9, pid#11, vs#12]
                              +- BroadcastHashJoin [uid#7], [uid#13], Inner, BuildRight, false
                                 :- Filter isnotnull(uid#7)
                                 :  +- FileScan parquet default.t1[uid#7,dt#9] Batched: true, DataFilters: [isnotnull(uid#7)], Format: Parquet, Location: InMemoryFileIndex[file:/private/var/folders/4l/7_c5c97s1_gb0d9_d6shygx00000gn/T/warehouse-c069d87..., PartitionFilters: [], PushedFilters: [IsNotNull(uid)], ReadSchema: struct<uid:int,dt:date>
                                 +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[2, int, false] as bigint)),false), [id=#58]
                                    +- Filter ((CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END = iOS) AND isnotnull(uid#13))
                                       +- FileScan parquet default.t2[pid#11,vs#12,uid#13] Batched: true, DataFilters: [(CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END = iOS), isnotnull..., Format: Parquet, Location: InMemoryFileIndex[file:/private/var/folders/4l/7_c5c97s1_gb0d9_d6shygx00000gn/T/warehouse-c069d87..., PartitionFilters: [], PushedFilters: [IsNotNull(uid)], ReadSchema: struct<pid:int,vs:int,uid:int>

```

### Why are the changes needed?
Improve  performance, filter more data.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #30278 from AngersZhuuuu/SPARK-33302.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-10 14:40:24 +00:00
Chao Sun 4934da56bc [SPARK-33305][SQL] DSv2: DROP TABLE command should also invalidate cache
### What changes were proposed in this pull request?

This changes `DropTableExec` to also invalidate caches referencing the table to be dropped, in a cascading manner.

### Why are the changes needed?

In DSv1, `DROP TABLE` command also invalidate caches as described in [SPARK-19765](https://issues.apache.org/jira/browse/SPARK-19765). However in DSv2 the same command only drops the table but doesn't handle the caches. This could lead to correctness issue.

### Does this PR introduce _any_ user-facing change?

Yes. Now DSv2 `DROP TABLE` command also invalidates cache.

### How was this patch tested?

Added a new UT

Closes #30211 from sunchao/SPARK-33305.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-10 14:37:42 +00:00
lrz 27bb40b629 [SPARK-33339][PYTHON] Pyspark application will hang due to non Exception error
### What changes were proposed in this pull request?

When a system.exit exception occurs during the process, the python worker exits abnormally, and then the executor task is still waiting for the worker for reading from socket, causing it to hang.
The system.exit exception may be caused by the user's error code, but spark should at least throw an error to remind the user, not get stuck
we can run a simple test to reproduce this case:

```
from pyspark.sql import SparkSession
def err(line):
  raise SystemExit
spark = SparkSession.builder.appName("test").getOrCreate()
spark.sparkContext.parallelize(range(1,2), 2).map(err).collect()
spark.stop()
```

### Why are the changes needed?

to make sure pyspark application won't hang if there's non-Exception error in python worker

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

added a new test and also manually tested the case above

Closes #30248 from li36909/pyspark.

Lead-authored-by: lrz <lrz@lrzdeMacBook-Pro.local>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-10 19:39:18 +09:00
xuewei.linxuewei e3a768dd79 [SPARK-33391][SQL] element_at with CreateArray not respect one based index
### What changes were proposed in this pull request?

element_at with CreateArray not respect one based index.

repo step:

```
var df = spark.sql("select element_at(array(3, 2, 1), 0)")
df.printSchema()

df = spark.sql("select element_at(array(3, 2, 1), 1)")
df.printSchema()

df = spark.sql("select element_at(array(3, 2, 1), 2)")
df.printSchema()

df = spark.sql("select element_at(array(3, 2, 1), 3)")
df.printSchema()

root
– element_at(array(3, 2, 1), 0): integer (nullable = false)

root
– element_at(array(3, 2, 1), 1): integer (nullable = false)

root
– element_at(array(3, 2, 1), 2): integer (nullable = false)

root
– element_at(array(3, 2, 1), 3): integer (nullable = true)

correct answer should be
0 true which is outOfBounds return default true.
1 false
2 false
3 false

```

For expression eval, it respect the oneBasedIndex, but within checking the nullable, it calculates with zeroBasedIndex using `computeNullabilityFromArray`.

### Why are the changes needed?

Correctness issue.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added UT and existing UT.

Closes #30296 from leanken/leanken-SPARK-33391.

Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-10 07:23:47 +00:00
Yuanjian Li ad02ceda29 [SPARK-33244][SQL] Unify the code paths for spark.table and spark.read.table
### What changes were proposed in this pull request?

- Call `spark.read.table` in `spark.table`.
- Add comments for `spark.table` to emphasize it also support streaming temp view reading.

### Why are the changes needed?
The code paths of `spark.table` and `spark.read.table` should be the same. This behavior is broke in SPARK-32592 since we need to respect options in `spark.read.table` API.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing UT.

Closes #30148 from xuanyuanking/SPARK-33244.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-10 05:46:45 +00:00
Terry Kim 90f6f39e42 [SPARK-33366][SQL] Migrate LOAD DATA command to use UnresolvedTable to resolve the identifier
### What changes were proposed in this pull request?

This PR proposes to migrate `LOAD DATA` to use `UnresolvedTable` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).

Note that `LOAD DATA` is not supported for v2 tables.

### Why are the changes needed?

The changes allow consistent resolution behavior when resolving the table identifier. For example, the following is the current behavior:
```scala
sql("CREATE TEMPORARY VIEW t AS SELECT 1")
sql("CREATE DATABASE db")
sql("CREATE TABLE t (key INT, value STRING) USING hive")
sql("USE db")
sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE t") // Succeeds
```
With this change, `LOAD DATA` above fails with the following:
```
org.apache.spark.sql.AnalysisException: t is a temp view not table.; line 1 pos 0
    at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.$anonfun$applyOrElse$39(Analyzer.scala:865)
    at scala.Option.foreach(Option.scala:407)
```
, which is expected since temporary view is resolved first and `LOAD DATA` doesn't support a temporary view.

### Does this PR introduce _any_ user-facing change?

After this PR, `LOAD DATA ... t` is resolved to a temp view `t` instead of table `db.t` in the above scenario.

### How was this patch tested?

Updated existing tests.

Closes #30270 from imback82/load_data_cmd.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-10 05:28:06 +00:00
Gengliang Wang a1f84d8714 [SPARK-33369][SQL] DSV2: Skip schema inference in write if table provider supports external metadata
### What changes were proposed in this pull request?

When TableProvider.supportsExternalMetadata() is true, Spark will use the input Dataframe's schema in `DataframeWriter.save()`/`DataStreamWriter.start()` and skip schema/partitioning inference.

### Why are the changes needed?

For all the v2 data sources which are not FileDataSourceV2, Spark always infers the table schema/partitioning on `DataframeWriter.save()`/`DataStreamWriter.start()`.
The inference of table schema/partitioning can be expensive. However, there is no such trait or flag for indicating a V2 source can use the input DataFrame's schema on `DataframeWriter.save()`/`DataStreamWriter.start()`. We can resolve the problem by adding a new expected behavior for the method `TableProvider.supportsExternalMetadata()`.

### Does this PR introduce _any_ user-facing change?

Yes, a new behavior for the data source v2 API `TableProvider.supportsExternalMetadata()` when it returns true.

### How was this patch tested?

Unit test

Closes #30273 from gengliangwang/supportsExternalMetadata.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-10 04:43:32 +00:00
Chao Sun c2caf2522b [SPARK-33213][BUILD] Upgrade Apache Arrow to 2.0.0
### What changes were proposed in this pull request?

This upgrade Apache Arrow version from 1.0.1 to 2.0.0

### Why are the changes needed?

Apache Arrow 2.0.0 was released with some improvements from Java side, so it's better to upgrade Spark to the new version.
Note that the format version in Arrow 2.0.0 is still 1.0.0 so API should still be compatible between 1.0.1 and 2.0.0.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing UTs.

Closes #30306 from sunchao/SPARK-33213.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-11-09 19:07:16 -08:00
Gabor Somogyi 4ac8133866 [SPARK-33223][SS][UI] Structured Streaming Web UI state information
### What changes were proposed in this pull request?
Structured Streaming UI is not containing state information. In this PR I've added it.

### Why are the changes needed?
Missing state information.

### Does this PR introduce _any_ user-facing change?
Additional UI elements appear.

### How was this patch tested?
Existing unit tests + manual test.
<img width="1044" alt="Screenshot 2020-10-30 at 15 14 21" src="https://user-images.githubusercontent.com/18561820/97715405-a1797000-1ac2-11eb-886a-e3e6efa3af3e.png">

Closes #30151 from gaborgsomogyi/SPARK-33223.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
2020-11-10 11:22:35 +09:00
neko 4360c6f12a [SPARK-33363] Add prompt information related to the current task when pyspark/sparkR starts
### What changes were proposed in this pull request?
add prompt information about current applicationId, current URL and master info when pyspark / sparkR starts.

### Why are the changes needed?
The information printed when pyspark/sparkR starts does not prompt the basic information of current application, and it is not convenient when used pyspark/sparkR in dos.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
manual test result shows below:
![pyspark new print](https://user-images.githubusercontent.com/52202080/98274268-2a663f00-1fce-11eb-88ce-964ce90b439e.png)
![sparkR](https://user-images.githubusercontent.com/52202080/98541235-1a01dd00-22ca-11eb-9304-09bcde87b05e.png)

Closes #30266 from akiyamaneko/pyspark-hint-info.

Authored-by: neko <echohlne@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-10 11:12:19 +09:00
Dongjoon Hyun 35ac314181 [SPARK-33405][BUILD] Upgrade commons-compress to 1.20
### What changes were proposed in this pull request?

This PR aims to upgrade `commons-compress` from 1.8 to 1.20.

### Why are the changes needed?

- https://commons.apache.org/proper/commons-compress/security-reports.html

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #30304 from dongjoon-hyun/SPARK-33405.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-10 11:08:55 +09:00
Kent Yao 036c11b0d4 [SPARK-33397][YARN][DOC] Fix generating md to html for available-patterns-for-shs-custom-executor-log-url
### What changes were proposed in this pull request?

1. replace `{{}}`  with `&#123;&#123;&#125;&#125;`
2. using `<code></code>` in td-tag

### Why are the changes needed?

to fix this.
![image](https://user-images.githubusercontent.com/8326978/98544155-8c74bc00-22ce-11eb-8889-8dacb726b762.png)

### Does this PR introduce _any_ user-facing change?

yes, you will see the correct online doc with this change

![image](https://user-images.githubusercontent.com/8326978/98545256-2e48d880-22d0-11eb-9dd9-b8cae3df8659.png)

### How was this patch tested?

shown as the above pic via jekyll serve.

Closes #30298 from yaooqinn/SPARK-33397.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-11-10 10:15:55 +09:00
zero323 090962cd42 [SPARK-33251][PYTHON][DOCS] Migration to NumPy documentation style in ML (pyspark.ml.*)
### What changes were proposed in this pull request?

This PR proposes migration of `pyspark.ml` to NumPy documentation style.

### Why are the changes needed?

To improve documentation style.

### Does this PR introduce _any_ user-facing change?

Yes, this changes both rendered HTML docs and console representation (SPARK-33243).

### How was this patch tested?

`dev/lint-python` and manual inspection.

Closes #30285 from zero323/SPARK-33251.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-10 09:33:48 +09:00
huangtianhua 83a80796aa [SPARK-32691][BUILD] Update commons-crypto to v1.1.0
### What changes were proposed in this pull request?
Update the package commons-crypto to v1.1.0 to support aarch64 platform
- https://issues.apache.org/jira/browse/CRYPTO-139

### Why are the changes needed?

The package commons-crypto-1.0.0 available in the Maven repository
doesn't support aarch64 platform. It costs long time in
CryptoRandomFactory.getCryptoRandom(properties).nextBytes(iv) when NettyBlockRpcSever
receive block data from client,  if the time more than the default value 120s, IOException raised and client
will retry replicate the block data to other executors. But in fact the replication is complete,
it makes the replication number incorrect.
This makes DistributedSuite tests pass.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Pass the CIs.

Closes #30275 from huangtianhua/SPARK-32691.

Authored-by: huangtianhua <huangtianhua223@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-11-09 14:33:27 -08:00
Chandni Singh 8113c88542 [SPARK-32916][SHUFFLE] Implementation of shuffle service that leverages push-based shuffle in YARN deployment mode
### What changes were proposed in this pull request?
This is one of the patches for SPIP [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602) which is needed for push-based shuffle.
Summary of changes:
- Adds an implementation of `MergedShuffleFileManager` which was introduced with [Spark 32915](https://issues.apache.org/jira/browse/SPARK-32915).
- Integrated the push-based shuffle service with `YarnShuffleService`.

### Why are the changes needed?
Refer to the SPIP in  [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602).

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added unit tests.
The reference PR with the consolidated changes covering the complete implementation is also provided in [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602).
We have already verified the functionality and the improved performance as documented in the SPIP doc.

Lead-authored-by: Min Shen mshenlinkedin.com
Co-authored-by: Chandni Singh chsinghlinkedin.com
Co-authored-by: Ye Zhou yezhoulinkedin.com

Closes #30062 from otterc/SPARK-32916.

Lead-authored-by: Chandni Singh <singh.chandni@gmail.com>
Co-authored-by: Chandni Singh <chsingh@linkedin.com>
Co-authored-by: Ye Zhou <yezhou@linkedin.com>
Co-authored-by: Min Shen <mshen@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
2020-11-09 11:00:52 -06:00
Peter Toth 84dc374611 [SPARK-33303][SQL] Deduplicate deterministic PythonUDF calls
### What changes were proposed in this pull request?
This PR modifies the `ExtractPythonUDFs` rule to deduplicate deterministic PythonUDF calls.

Before this PR the dataframe: `df.withColumn("c", batchedPythonUDF(col("a"))).withColumn("d", col("c"))` has the plan:
```
*(1) Project [value#1 AS a#4, pythonUDF1#15 AS c#7, pythonUDF1#15 AS d#10]
+- BatchEvalPython [dummyUDF(value#1), dummyUDF(value#1)], [pythonUDF0#14, pythonUDF1#15]
   +- LocalTableScan [value#1]
```
After this PR the deterministic PythonUDF calls are deduplicated:
```
*(1) Project [value#1 AS a#4, pythonUDF0#14 AS c#7, pythonUDF0#14 AS d#10]
+- BatchEvalPython [dummyUDF(value#1)], [pythonUDF0#14]
   +- LocalTableScan [value#1]
```

### Why are the changes needed?
To fix a performance issue.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
New and existing UTs.

Closes #30203 from peter-toth/SPARK-33303-deduplicate-deterministic-udf-calls.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-09 19:27:36 +09:00
Linhong Liu 4e1c89400d [SPARK-33140][SQL][FOLLOW-UP] Use sparkSession in AQE context when applying rules
### What changes were proposed in this pull request?
After #30097, all rules are using `SparkSession.active` to get `SQLConf`
and `SparkSession`. But in AQE, when applying the rules for the initial plan,
we should use the spark session in AQE context.

### Why are the changes needed?
Fix potential problem caused by using the wrong spark session

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing ut

Closes #30294 from linhongliu-db/SPARK-33140-followup.

Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-09 09:44:58 +00:00
Yuming Wang 7a5647a93a [SPARK-33385][SQL] Support bucket pruning for IsNaN
### What changes were proposed in this pull request?

This pr add support bucket pruning on `IsNaN` predicate.

### Why are the changes needed?

Improve query performance.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #30291 from wangyum/SPARK-33385.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-09 09:20:31 +00:00
Yuming Wang 69799c514f [SPARK-33372][SQL] Fix InSet bucket pruning
### What changes were proposed in this pull request?

This pr fix `InSet` bucket pruning because of it's values should not be `Literal`:
cbd3fdea62/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala (L253-L255)

### Why are the changes needed?

Fix bug.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test and manual test:

```scala
spark.sql("select id as a, id as b from range(10000)").write.bucketBy(100, "a").saveAsTable("t")
spark.sql("select * from t where a in (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11)").show
```

Before this PR | After this PR
-- | --
![image](https://user-images.githubusercontent.com/5399861/98380788-fb120980-2083-11eb-8fae-4e21ad873e9b.png) | ![image](https://user-images.githubusercontent.com/5399861/98381095-5ba14680-2084-11eb-82ca-2d780c85305c.png)

Closes #30279 from wangyum/SPARK-33372.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-09 08:32:51 +00:00
Wenchen Fan 98730b7ee2 [SPARK-33087][SQL] DataFrameWriterV2 should delegate table resolution to the analyzer
### What changes were proposed in this pull request?

This PR makes `DataFrameWriterV2` to create query plans with `UnresolvedRelation` and leave the table resolution work to the analyzer.

### Why are the changes needed?

Table resolution work should be done by the analyzer. After this PR, the behavior is more consistent between different APIs (DataFrameWriter, DataFrameWriterV2 and SQL). See the next section for behavior changes.

### Does this PR introduce _any_ user-facing change?

Yes.
1. writes to a temp view of v2 relation: previously it fails with table not found exception, now it works if the v2 relation is writable. This is consistent with `DataFrameWriter` and SQL INSERT.
2. writes to other temp views: previously it fails with table not found exception, now it fails with a more explicit error message, saying that writing to a temp view of non-v2-relation is not allowed.
3. writes to a view: previously it fails with table not writable error, now it fails with a more explicit error message, saying that writing to a view is not allowed.
4. writes to a v1 table: previously it fails with table not writable error, now it fails with a more explicit error message, saying that writing to a v1 table is not allowed. (We can allow it later, by falling back to v1 command)

### How was this patch tested?

new tests

Closes #29970 from cloud-fan/refactor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-09 08:08:00 +00:00
Huaxin Gao bfb257f078 [SPARK-32405][SQL] Apply table options while creating tables in JDBC Table Catalog
### What changes were proposed in this pull request?
Currently in JDBCTableCatalog, we ignore the table options when creating table.
```
    // TODO (SPARK-32405): Apply table options while creating tables in JDBC Table Catalog
    if (!properties.isEmpty) {
      logWarning("Cannot create JDBC table with properties, these properties will be " +
        "ignored: " + properties.asScala.map { case (k, v) => s"$k=$v" }.mkString("[", ", ", "]"))
    }
```

### Why are the changes needed?
need to apply the table options when we create table

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
add new test

Closes #30154 from huaxingao/table_options.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-09 07:02:14 +00:00
Dongjoon Hyun aa0849b46a [SPARK-33387][CORE] Support ordered shuffle block migration
### What changes were proposed in this pull request?

This PR aims to support sorted shuffle block migration.

### Why are the changes needed?

Since the current shuffle block migration works in a random order, the failure during worker decommission affects all shuffles. We had better finish the shuffles one by one to minimize the number of affected shuffle.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs with the newly added test case.

Closes #30293 from dongjoon-hyun/SPARK-33387.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-11-08 22:43:27 -08:00
Liang-Chi Hsieh c269b53f07 [SPARK-33384][SS] Delete temporary file when cancelling writing to final path even underlying stream throwing error
### What changes were proposed in this pull request?

In `RenameBasedFSDataOutputStream.cancel`, we do two things: closing underlying stream and delete temporary file, in a single try/catch block. Closing `OutputStream` could possibly throw `IOException` so we possibly missing deleting temporary file.

This patch proposes to delete temporary even underlying stream throwing error.

### Why are the changes needed?

To avoid leaving temporary files during canceling writing in `RenameBasedFSDataOutputStream`.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test.

Closes #30290 from viirya/SPARK-33384.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-11-08 18:44:26 -08:00
yangjie01 02fd52cfbc [SPARK-33352][CORE][SQL][SS][MLLIB][AVRO][K8S] Fix procedure-like declaration compilation warnings in Scala 2.13
### What changes were proposed in this pull request?
There are two similar compilation warnings about procedure-like declaration in Scala 2.13:

```
[WARNING] [Warn] /spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala:70: procedure syntax is deprecated for constructors: add `=`, as in method definition
```
and

```
[WARNING] [Warn] /spark/core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala:211: procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `run`'s return type
```

this pr is the first part to resolve SPARK-33352:

- For constructors method definition add `=` to convert to function syntax

- For without `return type` methods definition add `: Unit =` to convert to function syntax

### Why are the changes needed?
Eliminate compilation warnings in Scala 2.13 and this change should be compatible with Scala 2.12

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #30255 from LuciferYang/SPARK-29392-FOLLOWUP.1.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-11-08 12:51:48 -06:00
Hannah Amundson 1090b1b00a [SPARK-32860][DOCS][SQL] Updating documentation about map support in Encoders
### What changes were proposed in this pull request?

Javadocs updated for the encoder to include maps as a collection type

### Why are the changes needed?

The javadocs were not updated with fix SPARK-16706

### Does this PR introduce _any_ user-facing change?

Yes, the javadocs are updated

### How was this patch tested?

sbt was run to ensure it meets scalastyle

Closes #30274 from hannahkamundson/SPARK-32860.

Lead-authored-by: Hannah Amundson <amundson.hannah@heb.com>
Co-authored-by: Hannah <48397717+hannahkamundson@users.noreply.github.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-08 20:29:24 +09:00
HyukjinKwon e11a24c1ba [SPARK-33371][PYTHON] Update setup.py and tests for Python 3.9
### What changes were proposed in this pull request?

This PR proposes to fix PySpark to officially support Python 3.9. The main codes already work. We should just note that we support Python 3.9.

Also, this PR fixes some minor fixes into the test codes.
- `Thread.isAlive` is removed in Python 3.9, and `Thread.is_alive` exists in Python 3.6+, see https://docs.python.org/3/whatsnew/3.9.html#removed
- Fixed `TaskContextTestsWithWorkerReuse.test_barrier_with_python_worker_reuse` and `TaskContextTests.test_barrier` to be less flaky. This becomes more flaky in Python 3.9 for some reasons.

NOTE that PyArrow does not support Python 3.9 yet.

### Why are the changes needed?

To officially support Python 3.9.

### Does this PR introduce _any_ user-facing change?

Yes, it officially supports Python 3.9.

### How was this patch tested?

Manually ran the tests:

```
$  ./run-tests --python-executable=python
Running PySpark tests. Output is in /.../spark/python/unit-tests.log
Will test against the following Python executables: ['python']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-resource', 'pyspark-sql', 'pyspark-streaming']
python python_implementation is CPython
python version is: Python 3.9.0
Starting test(python): pyspark.ml.tests.test_base
Starting test(python): pyspark.ml.tests.test_evaluation
Starting test(python): pyspark.ml.tests.test_algorithms
Starting test(python): pyspark.ml.tests.test_feature
Finished test(python): pyspark.ml.tests.test_base (12s)
Starting test(python): pyspark.ml.tests.test_image
Finished test(python): pyspark.ml.tests.test_evaluation (15s)
Starting test(python): pyspark.ml.tests.test_linalg
Finished test(python): pyspark.ml.tests.test_feature (25s)
Starting test(python): pyspark.ml.tests.test_param
Finished test(python): pyspark.ml.tests.test_image (17s)
Starting test(python): pyspark.ml.tests.test_persistence
Finished test(python): pyspark.ml.tests.test_param (17s)
Starting test(python): pyspark.ml.tests.test_pipeline
Finished test(python): pyspark.ml.tests.test_linalg (30s)
Starting test(python): pyspark.ml.tests.test_stat
Finished test(python): pyspark.ml.tests.test_pipeline (6s)
Starting test(python): pyspark.ml.tests.test_training_summary
Finished test(python): pyspark.ml.tests.test_stat (12s)
Starting test(python): pyspark.ml.tests.test_tuning
Finished test(python): pyspark.ml.tests.test_algorithms (68s)
Starting test(python): pyspark.ml.tests.test_wrapper
Finished test(python): pyspark.ml.tests.test_persistence (51s)
Starting test(python): pyspark.mllib.tests.test_algorithms
Finished test(python): pyspark.ml.tests.test_training_summary (33s)
Starting test(python): pyspark.mllib.tests.test_feature
Finished test(python): pyspark.ml.tests.test_wrapper (19s)
Starting test(python): pyspark.mllib.tests.test_linalg
Finished test(python): pyspark.mllib.tests.test_feature (26s)
Starting test(python): pyspark.mllib.tests.test_stat
Finished test(python): pyspark.mllib.tests.test_stat (22s)
Starting test(python): pyspark.mllib.tests.test_streaming_algorithms
Finished test(python): pyspark.mllib.tests.test_algorithms (53s)
Starting test(python): pyspark.mllib.tests.test_util
Finished test(python): pyspark.mllib.tests.test_linalg (54s)
Starting test(python): pyspark.sql.tests.test_arrow
Finished test(python): pyspark.sql.tests.test_arrow (0s) ... 61 tests were skipped
Starting test(python): pyspark.sql.tests.test_catalog
Finished test(python): pyspark.mllib.tests.test_util (11s)
Starting test(python): pyspark.sql.tests.test_column
Finished test(python): pyspark.sql.tests.test_catalog (16s)
Starting test(python): pyspark.sql.tests.test_conf
Finished test(python): pyspark.sql.tests.test_column (17s)
Starting test(python): pyspark.sql.tests.test_context
Finished test(python): pyspark.sql.tests.test_context (6s) ... 3 tests were skipped
Starting test(python): pyspark.sql.tests.test_dataframe
Finished test(python): pyspark.sql.tests.test_conf (11s)
Starting test(python): pyspark.sql.tests.test_datasources
Finished test(python): pyspark.sql.tests.test_datasources (19s)
Starting test(python): pyspark.sql.tests.test_functions
Finished test(python): pyspark.sql.tests.test_dataframe (35s) ... 3 tests were skipped
Starting test(python): pyspark.sql.tests.test_group
Finished test(python): pyspark.sql.tests.test_functions (32s)
Starting test(python): pyspark.sql.tests.test_pandas_cogrouped_map
Finished test(python): pyspark.sql.tests.test_pandas_cogrouped_map (1s) ... 15 tests were skipped
Starting test(python): pyspark.sql.tests.test_pandas_grouped_map
Finished test(python): pyspark.sql.tests.test_group (19s)
Starting test(python): pyspark.sql.tests.test_pandas_map
Finished test(python): pyspark.sql.tests.test_pandas_grouped_map (0s) ... 21 tests were skipped
Starting test(python): pyspark.sql.tests.test_pandas_udf
Finished test(python): pyspark.sql.tests.test_pandas_map (0s) ... 6 tests were skipped
Starting test(python): pyspark.sql.tests.test_pandas_udf_grouped_agg
Finished test(python): pyspark.sql.tests.test_pandas_udf (0s) ... 6 tests were skipped
Starting test(python): pyspark.sql.tests.test_pandas_udf_scalar
Finished test(python): pyspark.sql.tests.test_pandas_udf_grouped_agg (0s) ... 13 tests were skipped
Starting test(python): pyspark.sql.tests.test_pandas_udf_typehints
Finished test(python): pyspark.sql.tests.test_pandas_udf_scalar (0s) ... 50 tests were skipped
Starting test(python): pyspark.sql.tests.test_pandas_udf_window
Finished test(python): pyspark.sql.tests.test_pandas_udf_typehints (0s) ... 10 tests were skipped
Starting test(python): pyspark.sql.tests.test_readwriter
Finished test(python): pyspark.sql.tests.test_pandas_udf_window (0s) ... 14 tests were skipped
Starting test(python): pyspark.sql.tests.test_serde
Finished test(python): pyspark.sql.tests.test_serde (19s)
Starting test(python): pyspark.sql.tests.test_session
Finished test(python): pyspark.mllib.tests.test_streaming_algorithms (120s)
Starting test(python): pyspark.sql.tests.test_streaming
Finished test(python): pyspark.sql.tests.test_readwriter (25s)
Starting test(python): pyspark.sql.tests.test_types
Finished test(python): pyspark.ml.tests.test_tuning (208s)
Starting test(python): pyspark.sql.tests.test_udf
Finished test(python): pyspark.sql.tests.test_session (31s)
Starting test(python): pyspark.sql.tests.test_utils
Finished test(python): pyspark.sql.tests.test_streaming (35s)
Starting test(python): pyspark.streaming.tests.test_context
Finished test(python): pyspark.sql.tests.test_types (34s)
Starting test(python): pyspark.streaming.tests.test_dstream
Finished test(python): pyspark.sql.tests.test_utils (14s)
Starting test(python): pyspark.streaming.tests.test_kinesis
Finished test(python): pyspark.streaming.tests.test_kinesis (0s) ... 2 tests were skipped
Starting test(python): pyspark.streaming.tests.test_listener
Finished test(python): pyspark.streaming.tests.test_listener (11s)
Starting test(python): pyspark.tests.test_appsubmit
Finished test(python): pyspark.sql.tests.test_udf (39s)
Starting test(python): pyspark.tests.test_broadcast
Finished test(python): pyspark.streaming.tests.test_context (23s)
Starting test(python): pyspark.tests.test_conf
Finished test(python): pyspark.tests.test_conf (15s)
Starting test(python): pyspark.tests.test_context
Finished test(python): pyspark.tests.test_broadcast (33s)
Starting test(python): pyspark.tests.test_daemon
Finished test(python): pyspark.tests.test_daemon (5s)
Starting test(python): pyspark.tests.test_install_spark
Finished test(python): pyspark.tests.test_context (44s)
Starting test(python): pyspark.tests.test_join
Finished test(python): pyspark.tests.test_appsubmit (68s)
Starting test(python): pyspark.tests.test_profiler
Finished test(python): pyspark.tests.test_join (7s)
Starting test(python): pyspark.tests.test_rdd
Finished test(python): pyspark.tests.test_profiler (9s)
Starting test(python): pyspark.tests.test_rddbarrier
Finished test(python): pyspark.tests.test_rddbarrier (7s)
Starting test(python): pyspark.tests.test_readwrite
Finished test(python): pyspark.streaming.tests.test_dstream (107s)
Starting test(python): pyspark.tests.test_serializers
Finished test(python): pyspark.tests.test_serializers (8s)
Starting test(python): pyspark.tests.test_shuffle
Finished test(python): pyspark.tests.test_readwrite (14s)
Starting test(python): pyspark.tests.test_taskcontext
Finished test(python): pyspark.tests.test_install_spark (65s)
Starting test(python): pyspark.tests.test_util
Finished test(python): pyspark.tests.test_shuffle (8s)
Starting test(python): pyspark.tests.test_worker
Finished test(python): pyspark.tests.test_util (5s)
Starting test(python): pyspark.accumulators
Finished test(python): pyspark.accumulators (5s)
Starting test(python): pyspark.broadcast
Finished test(python): pyspark.broadcast (6s)
Starting test(python): pyspark.conf
Finished test(python): pyspark.tests.test_worker (14s)
Starting test(python): pyspark.context
Finished test(python): pyspark.conf (4s)
Starting test(python): pyspark.ml.classification
Finished test(python): pyspark.tests.test_rdd (60s)
Starting test(python): pyspark.ml.clustering
Finished test(python): pyspark.context (21s)
Starting test(python): pyspark.ml.evaluation
Finished test(python): pyspark.tests.test_taskcontext (69s)
Starting test(python): pyspark.ml.feature
Finished test(python): pyspark.ml.evaluation (26s)
Starting test(python): pyspark.ml.fpm
Finished test(python): pyspark.ml.clustering (45s)
Starting test(python): pyspark.ml.functions
Finished test(python): pyspark.ml.fpm (24s)
Starting test(python): pyspark.ml.image
Finished test(python): pyspark.ml.functions (17s)
Starting test(python): pyspark.ml.linalg.__init__
Finished test(python): pyspark.ml.linalg.__init__ (0s)
Starting test(python): pyspark.ml.recommendation
Finished test(python): pyspark.ml.classification (74s)
Starting test(python): pyspark.ml.regression
Finished test(python): pyspark.ml.image (8s)
Starting test(python): pyspark.ml.stat
Finished test(python): pyspark.ml.stat (29s)
Starting test(python): pyspark.ml.tuning
Finished test(python): pyspark.ml.regression (53s)
Starting test(python): pyspark.mllib.classification
Finished test(python): pyspark.ml.tuning (35s)
Starting test(python): pyspark.mllib.clustering
Finished test(python): pyspark.ml.feature (103s)
Starting test(python): pyspark.mllib.evaluation
Finished test(python): pyspark.mllib.classification (33s)
Starting test(python): pyspark.mllib.feature
Finished test(python): pyspark.mllib.evaluation (21s)
Starting test(python): pyspark.mllib.fpm
Finished test(python): pyspark.ml.recommendation (103s)
Starting test(python): pyspark.mllib.linalg.__init__
Finished test(python): pyspark.mllib.linalg.__init__ (1s)
Starting test(python): pyspark.mllib.linalg.distributed
Finished test(python): pyspark.mllib.feature (26s)
Starting test(python): pyspark.mllib.random
Finished test(python): pyspark.mllib.fpm (23s)
Starting test(python): pyspark.mllib.recommendation
Finished test(python): pyspark.mllib.clustering (50s)
Starting test(python): pyspark.mllib.regression
Finished test(python): pyspark.mllib.random (13s)
Starting test(python): pyspark.mllib.stat.KernelDensity
Finished test(python): pyspark.mllib.stat.KernelDensity (1s)
Starting test(python): pyspark.mllib.stat._statistics
Finished test(python): pyspark.mllib.linalg.distributed (42s)
Starting test(python): pyspark.mllib.tree
Finished test(python): pyspark.mllib.stat._statistics (19s)
Starting test(python): pyspark.mllib.util
Finished test(python): pyspark.mllib.regression (33s)
Starting test(python): pyspark.profiler
Finished test(python): pyspark.mllib.recommendation (36s)
Starting test(python): pyspark.rdd
Finished test(python): pyspark.profiler (9s)
Starting test(python): pyspark.resource.tests.test_resources
Finished test(python): pyspark.mllib.tree (19s)
Starting test(python): pyspark.serializers
Finished test(python): pyspark.mllib.util (21s)
Starting test(python): pyspark.shuffle
Finished test(python): pyspark.resource.tests.test_resources (9s)
Starting test(python): pyspark.sql.avro.functions
Finished test(python): pyspark.shuffle (1s)
Starting test(python): pyspark.sql.catalog
Finished test(python): pyspark.rdd (22s)
Starting test(python): pyspark.sql.column
Finished test(python): pyspark.serializers (12s)
Starting test(python): pyspark.sql.conf
Finished test(python): pyspark.sql.conf (6s)
Starting test(python): pyspark.sql.context
Finished test(python): pyspark.sql.catalog (14s)
Starting test(python): pyspark.sql.dataframe
Finished test(python): pyspark.sql.avro.functions (15s)
Starting test(python): pyspark.sql.functions
Finished test(python): pyspark.sql.column (24s)
Starting test(python): pyspark.sql.group
Finished test(python): pyspark.sql.context (20s)
Starting test(python): pyspark.sql.pandas.conversion
Finished test(python): pyspark.sql.pandas.conversion (13s)
Starting test(python): pyspark.sql.pandas.group_ops
Finished test(python): pyspark.sql.group (36s)
Starting test(python): pyspark.sql.pandas.map_ops
Finished test(python): pyspark.sql.pandas.group_ops (21s)
Starting test(python): pyspark.sql.pandas.serializers
Finished test(python): pyspark.sql.pandas.serializers (0s)
Starting test(python): pyspark.sql.pandas.typehints
Finished test(python): pyspark.sql.pandas.typehints (0s)
Starting test(python): pyspark.sql.pandas.types
Finished test(python): pyspark.sql.pandas.types (0s)
Starting test(python): pyspark.sql.pandas.utils
Finished test(python): pyspark.sql.pandas.utils (0s)
Starting test(python): pyspark.sql.readwriter
Finished test(python): pyspark.sql.dataframe (56s)
Starting test(python): pyspark.sql.session
Finished test(python): pyspark.sql.functions (57s)
Starting test(python): pyspark.sql.streaming
Finished test(python): pyspark.sql.pandas.map_ops (12s)
Starting test(python): pyspark.sql.types
Finished test(python): pyspark.sql.types (10s)
Starting test(python): pyspark.sql.udf
Finished test(python): pyspark.sql.streaming (16s)
Starting test(python): pyspark.sql.window
Finished test(python): pyspark.sql.session (19s)
Starting test(python): pyspark.streaming.util
Finished test(python): pyspark.streaming.util (0s)
Starting test(python): pyspark.util
Finished test(python): pyspark.util (0s)
Finished test(python): pyspark.sql.readwriter (24s)
Finished test(python): pyspark.sql.udf (13s)
Finished test(python): pyspark.sql.window (14s)
Tests passed in 780 seconds

```

Closes #30277 from HyukjinKwon/SPARK-33371.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-11-06 15:05:37 -08:00
yangjie01 fb9c873e7d [SPARK-33347][CORE] Cleanup useless variables of MutableApplicationInfo
### What changes were proposed in this pull request?
There are 4 fields in `MutableApplicationInfo ` seems useless:

- `coresGranted`
- `maxCores`
- `coresPerExecutor`
- `memoryPerExecutorMB`

They are always `None` and not reassigned.

So the main change of this pr is  cleanup these useless fields in `MutableApplicationInfo`.

### Why are the changes needed?
Cleanup useless variables.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #30251 from LuciferYang/SPARK-33347.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
2020-11-07 06:43:27 +09:00
Stuart White 09fa7ecae1 [SPARK-33291][SQL] Improve DataFrame.show for nulls in arrays and structs
### What changes were proposed in this pull request?
The changes in [SPARK-32501 Inconsistent NULL conversions to strings](https://issues.apache.org/jira/browse/SPARK-32501) introduced some behavior that I'd like to clean up a bit.

Here's sample code to illustrate the behavior I'd like to clean up:

```scala
val rows = Seq[String](null)
  .toDF("value")
  .withColumn("struct1", struct('value as "value1"))
  .withColumn("struct2", struct('value as "value1", 'value as "value2"))
  .withColumn("array1", array('value))
  .withColumn("array2", array('value, 'value))

// Show the DataFrame using the "first" codepath.
rows.show(truncate=false)
+-----+-------+-------------+------+--------+
|value|struct1|struct2      |array1|array2  |
+-----+-------+-------------+------+--------+
|null |{ null}|{ null, null}|[]    |[, null]|
+-----+-------+-------------+------+--------+

// Write the DataFrame to disk, then read it back and show it to trigger the "codegen" code path:
rows.write.parquet("rows")
spark.read.parquet("rows").show(truncate=false)

+-----+-------+-------------+-------+-------------+
|value|struct1|struct2      |array1 |array2       |
+-----+-------+-------------+-------+-------------+
|null |{ null}|{ null, null}|[ null]|[ null, null]|
+-----+-------+-------------+-------+-------------+
```

Notice:

1. If the first element of a struct is null, it is printed with a leading space (e.g. "\{ null\}").  I think it's preferable to print it without the leading space (e.g. "\{null\}").  This is consistent with how non-null values are printed inside a struct.
2. If the first element of an array is null, it is not printed at all in the first code path, and the "codegen" code path prints it with a leading space.  I think both code paths should be consistent and print it without a leading space (e.g. "[null]").

The desired result of this PR is to product the following output via both code paths:

```
+-----+-------+------------+------+------------+
|value|struct1|struct2     |array1|array2      |
+-----+-------+------------+------+------------+
|null |{null} |{null, null}|[null]|[null, null]|
+-----+-------+------------+------+------------+
```

This contribution is my original work and I license the work to the project under the project’s open source license.

### Why are the changes needed?

To correct errors and inconsistencies in how DataFrame.show() displays nulls inside arrays and structs.

### Does this PR introduce _any_ user-facing change?

Yes.  This PR changes what is printed out by DataFrame.show().

### How was this patch tested?

I added new test cases in CastSuite.scala to cover the cases addressed by this PR.

Closes #30189 from stwhit/show_nulls.

Authored-by: Stuart White <stuart.white1@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2020-11-06 13:12:35 -08:00
Warren Zhu 93ad26be01 [SPARK-23432][UI] Add executor peak jvm memory metrics in executors page
### What changes were proposed in this pull request?
Add executor peak jvm memory metrics in executors page

![image](https://user-images.githubusercontent.com/1633312/97767765-9121bf00-1adb-11eb-93c7-7912d9fe7826.png)

### Why are the changes needed?
Users can know executor peak jvm metrics on in executors page

### Does this PR introduce _any_ user-facing change?
Users can know executor peak jvm metrics on in executors page

### How was this patch tested?
Manually tested

Closes #30186 from warrenzhu25/23432.

Authored-by: Warren Zhu <warren.zhu25@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
2020-11-06 16:53:10 +09:00
Terry Kim 68c032c246 [SPARK-33364][SQL] Introduce the "purge" option in TableCatalog.dropTable for v2 catalog
### What changes were proposed in this pull request?

This PR proposes to introduce the `purge` option in `TableCatalog.dropTable` so that v2 catalogs can use the option if needed.

Related discussion: https://github.com/apache/spark/pull/30079#discussion_r510594110

### Why are the changes needed?

Spark DDL supports passing the purge option to `DROP TABLE` command. However, the option is not used (ignored) for v2 catalogs.

### Does this PR introduce _any_ user-facing change?

This PR introduces a new API in `TableCatalog`.

### How was this patch tested?

Added a test.

Closes #30267 from imback82/purge_table.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-11-05 22:00:45 -08:00
Prashant Sharma 733a468726 [SPARK-33130][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MsSqlServer dialect)
### What changes were proposed in this pull request?

Override the default SQL strings for:
ALTER TABLE RENAME COLUMN
ALTER TABLE UPDATE COLUMN NULLABILITY
in the following MsSQLServer JDBC dialect according to official documentation.
Write MsSqlServer integration tests for JDBC.

### Why are the changes needed?

To add the support for alter table when interacting with MSSql Server.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

added tests

Closes #30038 from ScrapCodes/mssql-dialect.

Authored-by: Prashant Sharma <prashsh1@in.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-06 05:46:38 +00:00
neko f6c0007970 [SPARK-33342][WEBUI] fix the wrong url and display name of blocking thread in threadDump page
### What changes were proposed in this pull request?
fix the wrong url and display name of blocking thread in threadDump page.
The blockingThreadId variable passed to the page should be of string type instead of Option type.

### Why are the changes needed?
blocking threadId in the ui page is not displayed well, and the corresponding  url cannot be redirected normally

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
The pr  only involves minor changes to the page and does not affect other functions,
The manual test results are as follows. The thread name displayed on the page is correct, and you can click on the URL to jump to the corresponding url

![shows_ok](https://user-images.githubusercontent.com/52202080/98108177-89488d00-1ed6-11eb-9488-8446c3f38bad.gif)

Closes #30249 from akiyamaneko/thread-dump-improve.

Authored-by: neko <echohlne@gmail.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-11-06 13:45:02 +08:00
Wenchen Fan d16311051d [SPARK-32934][SQL][FOLLOW-UP] Refine class naming and code comments
### What changes were proposed in this pull request?

1. Rename `OffsetWindowSpec` to `OffsetWindowFunction`, as it's the base class for all offset based window functions.
2. Refine and add more comments.
3. Remove `isRelative` as it's useless.

### Why are the changes needed?

code refinement

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #30261 from cloud-fan/window.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-06 05:20:25 +00:00