Commit graph

27132 commits

Author SHA1 Message Date
Antonin Delpeuch 497024956a
[SPARK-31518][CORE] Expose filterByRange in JavaPairRDD
### What changes were proposed in this pull request?

This exposes the `filterByRange` method from `OrderedRDDFunctions` in the Java API (as a method of JavaPairRDD).

This is the only method of `OrderedRDDFunctions` which is not exposed in the Java API so far.

### Why are the changes needed?

This improves the consistency between the Scala and Java APIs. Calling the Scala method manually from a Java context is cumbersome as it requires passing many ClassTags.

### Does this PR introduce any user-facing change?

Yes, a new method in the Java API.

### How was this patch tested?

With unit tests. The implementation of the Scala method is already tested independently and it was not touched in this PR.

Suggesting srowen as a reviewer.

Closes #28293 from wetneb/SPARK-31518.

Authored-by: Antonin Delpeuch <antonin@delpeuch.eu>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-22 20:04:17 -07:00
HyukjinKwon da3c6c4e35 [SPARK-31510][R][BUILD] Set setwd in R documentation build
Seems like in certain environment, it requires to set `setwd` as below:

```
> library(devtools); devtools::document(pkg="./pkg", roclets=c("rd"))
Loading required package: usethis
Error: Could not find package root, is your working directory inside a package?
```

see also https://stackoverflow.com/questions/52670051/how-to-troubleshoot-error-could-not-find-package-root and https://groups.google.com/forum/#!topic/rdevtools/79jjjdc_wjg

We can make up another story too. For example, if you set a specific directory in your `~/.Rprofile`, then R documentation build will fail as below:

```
echo 'setwd("~")' > ~/.Rprofile
sh R/create-rd.sh
```

```
Using R_SCRIPT_PATH = /usr/local/bin
Loading required package: usethis
Error: Can't find './pkg'.
Execution halted
```

This PR proposes to set the `setwd` explicitly so it does not get affected on the global environment.

To make R dev env more independent.

No, dev only.

Manually tested:

```bash
echo 'setwd("~")' > ~/.Rprofile
sh R/create-rd.sh
```

Before:

```
Using R_SCRIPT_PATH = /usr/local/bin
Loading required package: usethis
Error: Can't find './pkg'.
Execution halted
```

After:

```
Using R_SCRIPT_PATH = /usr/local/bin
Loading required package: usethis
Updating SparkR documentation
Loading SparkR
Creating a new generic function for ‘as.data.frame’ in package ‘SparkR’
Creating a new generic function for ‘colnames’ in package ‘SparkR’
Creating a new generic function for ‘colnames<-’ in package ‘SparkR’
Creating a new generic function for ‘cov’ in package ‘SparkR’
Creating a new generic function for ‘drop’ in package ‘SparkR’
Creating a new generic function for ‘na.omit’ in package ‘SparkR’
Creating a new generic function for ‘filter’ in package ‘SparkR’
Creating a new generic function for ‘intersect’ in package ‘SparkR’
...
```

Closes #28285
2020-04-23 10:23:01 +09:00
Thomas Graves 95aec091e4 [SPARK-29641][PYTHON][CORE] Stage Level Sched: Add python api's and tests
### What changes were proposed in this pull request?

As part of the Stage level scheduling features, add the Python api's to set resource profiles.
This also adds the functionality to properly apply the pyspark memory configuration when specified in the ResourceProfile. The pyspark memory configuration is being passed in the task local properties. This was an easy way to get it to the PythonRunner that needs it. I modeled this off how the barrier task scheduling is passing the addresses. As part of this I added in the JavaRDD api's because those are needed by python.

### Why are the changes needed?

python api for this feature

### Does this PR introduce any user-facing change?

Yes adds the java and python apis for user to specify a ResourceProfile to use stage level scheduling.

### How was this patch tested?

unit tests and manually tested on yarn. Tests also run to verify it errors properly on standalone and local mode where its not yet supported.

Closes #28085 from tgravescs/SPARK-29641-pr-base.

Lead-authored-by: Thomas Graves <tgraves@nvidia.com>
Co-authored-by: Thomas Graves <tgraves@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-23 10:20:39 +09:00
Gabor Somogyi c619990c1d [SPARK-31272][SQL] Support DB2 Kerberos login in JDBC connector
### What changes were proposed in this pull request?
When loading DataFrames from JDBC datasource with Kerberos authentication, remote executors (yarn-client/cluster etc. modes) fail to establish a connection due to lack of Kerberos ticket or ability to generate it.

This is a real issue when trying to ingest data from kerberized data sources (SQL Server, Oracle) in enterprise environment where exposing simple authentication access is not an option due to IT policy issues.

In this PR I've added DB2 support (other supported databases will come in later PRs).

What this PR contains:
* Added `DB2ConnectionProvider`
* Added `DB2ConnectionProviderSuite`
* Added `DB2KrbIntegrationSuite` docker integration test
* Changed DB2 JDBC driver to use the latest (test scope only)
* Changed test table data type to a type which is supported by all the databases
* Removed double connection creation on test side
* Increased connection timeout in docker tests because DB2 docker takes quite a time to start

### Why are the changes needed?
Missing JDBC kerberos support.

### Does this PR introduce any user-facing change?
Yes, now user is able to connect to DB2 using kerberos.

### How was this patch tested?
* Additional + existing unit tests
* Additional + existing integration tests
* Test on cluster manually

Closes #28215 from gaborgsomogyi/SPARK-31272.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@apache.org>
2020-04-22 17:10:30 -07:00
Cong Du 54b97b2e14 [MINOR][DOCS] Fix a typo in ContainerPlacementStrategy's class comment
### What changes were proposed in this pull request?
This PR fixes a typo in deploy/yarn/LocalityPreferredContainerPlacementStrategy.scala file.

### Why are the changes needed?
To deliver correct explanation about how the placement policy works.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
UT as specified, although shouldn't influence any functionality since it's in the comment.

Closes #28267 from asclepiusaka/master.

Authored-by: Cong Du <asclepius1993@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-04-22 09:44:43 -05:00
Nicholas Marcott 8b77b31835 [SPARK-18886][CORE][FOLLOWUP] allow follow up locality resets even if no task was launched
### What changes were proposed in this pull request?
Remove the requirement to launch a task in order to reset locality wait timer.

### Why are the changes needed?
Recently https://github.com/apache/spark/pull/27207 was merged, but contained a bug which leads to undesirable behavior.

The crux of the issue is that single resource offers couldn't reset the timer, if there had been a previous reject followed by an allResourceOffer with no available resources.
This lead to a problem where once locality level reached ANY, single resource offers are all accepted, leading allResourceOffers to be left with no resources to utilize (hence no task being launched on an all resource offer -> no timer reset). The task manager would be stuck in ANY locality level.

Noting down here the downsides of using below reset conditions, in case we want to follow up.
As this is quite complex, I could easily be missing something, so please comment/respond if you have more bad behavior scenarios or find something wrong here:
The format is:

> **Reset condition**
>  - the unwanted side effect
>      - the cause/use case

Below references to locality increase/decrease mean:
```
PROCESS_LOCAL, NODE_LOCAL ... .. ANY
    ------ locality decrease --->
   <----- locality increase -----
```

**Task launch:**
- locality decrease:
   - Blacklisting, FAIR/FIFO scheduling, or task resource requirements can minimize tasks launched
 - locality increase:
   - single task launch decreases locality despite many tasks remaining

**No delay schedule reject since last allFreeResource offer**
- locality decrease:
   - locality wait less than allFreeResource offer frequency, which occurs at least 1 per second
- locality increase:
   - single resource (or none) not rejected despite many tasks remaining (other lower priority tasks utilizing resources)

**Current impl - No delay schedule reject since last (allFreeResource offer + task launch)**
- locality decrease:
  - all from above
- locality increase:
   - single resource accepted and task launched despite many tasks remaining

The current impl is an improvement on the legacy (task launch) in that unintended locality decrease case is similar and the unintended locality increase case only occurs when the cluster is fully utilized.

For the locality increase cases, perhaps a config which specifies a certain % of tasks in a taskset to finish before resetting locality levels would be helpful.

**If** that was considered a good approach then perhaps removing the task launch as a requirement would eliminate most of downsides listed above.
Lemme know if you have more ideas for eliminating locality increase downside of **No delay schedule reject since last allFreeResource offer**

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
TaskSchedulerImplSuite

Also manually tested similar to how I tested in https://github.com/apache/spark/pull/27207 using [this simple app](https://github.com/bmarcott/spark-test-apps/blob/master/src/main/scala/TestLocalityWait.scala).

With the new changes, given locality wait of 10s the behavior is generally:
10 seconds of locality being respected, followed by a single full utilization of resources using ANY locality level, followed by 10 seconds of locality being respected, and so on

If the legacy flag is enabled (spark.locality.wait.legacyResetOnTaskLaunch=true), the behavior is only scheduling PROCESS_LOCAL tasks (only utilizing a single executor)

cloud-fan
tgravescs

Closes #28188 from bmarcott/nmarcott-locality-fix.

Authored-by: Nicholas Marcott <481161+bmarcott@users.noreply.github.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2020-04-22 08:25:24 -05:00
yi.wu 8fbfdb38c0 [SPARK-31495][SQL] Support formatted explain for AQE
### What changes were proposed in this pull request?

To support formatted explain for AQE.

### Why are the changes needed?

AQE does not support formatted explain yet. It's good to support it for better user experience, debugging, etc.

Before:
```
== Physical Plan ==
AdaptiveSparkPlan (1)
+- * HashAggregate (unknown)
   +- CustomShuffleReader (unknown)
      +- ShuffleQueryStage (unknown)
         +- Exchange (unknown)
            +- * HashAggregate (unknown)
               +- * Project (unknown)
                  +- * BroadcastHashJoin Inner BuildRight (unknown)
                     :- * LocalTableScan (unknown)
                     +- BroadcastQueryStage (unknown)
                        +- BroadcastExchange (unknown)
                           +- LocalTableScan (unknown)

(1) AdaptiveSparkPlan
Output [4]: [k#7, count(v1)#32L, sum(v1)#33L, avg(v2)#34]
Arguments: HashAggregate(keys=[k#7], functions=[count(1), sum(cast(v1#8 as bigint)), avg(cast(v2#19 as bigint))]), AdaptiveExecutionContext(org.apache.spark.sql.SparkSession104ab57b), [PlanAdaptiveSubqueries(Map())], false
```

After:
```
== Physical Plan ==
 AdaptiveSparkPlan (14)
 +- * HashAggregate (13)
    +- CustomShuffleReader (12)
       +- ShuffleQueryStage (11)
          +- Exchange (10)
             +- * HashAggregate (9)
                +- * Project (8)
                   +- * BroadcastHashJoin Inner BuildRight (7)
                      :- * Project (2)
                      :  +- * LocalTableScan (1)
                      +- BroadcastQueryStage (6)
                         +- BroadcastExchange (5)
                            +- * Project (4)
                               +- * LocalTableScan (3)

 (1) LocalTableScan [codegen id : 2]
 Output [2]: [_1#x, _2#x]
 Arguments: [_1#x, _2#x]

 (2) Project [codegen id : 2]
 Output [2]: [_1#x AS k#x, _2#x AS v1#x]
 Input [2]: [_1#x, _2#x]

 (3) LocalTableScan [codegen id : 1]
 Output [2]: [_1#x, _2#x]
 Arguments: [_1#x, _2#x]

 (4) Project [codegen id : 1]
 Output [2]: [_1#x AS k#x, _2#x AS v2#x]
 Input [2]: [_1#x, _2#x]

 (5) BroadcastExchange
 Input [2]: [k#x, v2#x]
 Arguments: HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))), [id=#x]

 (6) BroadcastQueryStage
 Output [2]: [k#x, v2#x]
 Arguments: 0

 (7) BroadcastHashJoin [codegen id : 2]
 Left keys [1]: [k#x]
 Right keys [1]: [k#x]
 Join condition: None

 (8) Project [codegen id : 2]
 Output [3]: [k#x, v1#x, v2#x]
 Input [4]: [k#x, v1#x, k#x, v2#x]

 (9) HashAggregate [codegen id : 2]
 Input [3]: [k#x, v1#x, v2#x]
 Keys [1]: [k#x]
 Functions [3]: [partial_count(1), partial_sum(cast(v1#x as bigint)), partial_avg(cast(v2#x as bigint))]
 Aggregate Attributes [4]: [count#xL, sum#xL, sum#x, count#xL]
 Results [5]: [k#x, count#xL, sum#xL, sum#x, count#xL]

 (10) Exchange
 Input [5]: [k#x, count#xL, sum#xL, sum#x, count#xL]
 Arguments: hashpartitioning(k#x, 5), true, [id=#x]

 (11) ShuffleQueryStage
 Output [5]: [sum#xL, k#x, sum#x, count#xL, count#xL]
 Arguments: 1

 (12) CustomShuffleReader
 Input [5]: [k#x, count#xL, sum#xL, sum#x, count#xL]
 Arguments: coalesced

 (13) HashAggregate [codegen id : 3]
 Input [5]: [k#x, count#xL, sum#xL, sum#x, count#xL]
 Keys [1]: [k#x]
 Functions [3]: [count(1), sum(cast(v1#x as bigint)), avg(cast(v2#x as bigint))]
 Aggregate Attributes [3]: [count(1)#xL, sum(cast(v1#x as bigint))#xL, avg(cast(v2#x as bigint))#x]
 Results [4]: [k#x, count(1)#xL AS count(v1)#xL, sum(cast(v1#x as bigint))#xL AS sum(v1)#xL, avg(cast(v2#x as bigint))#x AS avg(v2)#x]

 (14) AdaptiveSparkPlan
 Output [4]: [k#x, count(v1)#xL, sum(v1)#xL, avg(v2)#x]
 Arguments: isFinalPlan=true
```

### Does this PR introduce any user-facing change?

No, this should be new feature along with AQE in Spark 3.0.

### How was this patch tested?

Added a query file: `explain-aqe.sql` and a unit test.

Closes #28271 from Ngone51/support_formatted_explain_for_aqe.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-22 12:44:06 +00:00
Liang-Chi Hsieh 1d30884963 [SPARK-31484][CORE][FLOLLOWUP] Use taskAttemptId in checkpoint filename
### What changes were proposed in this pull request?

As suggested by https://github.com/apache/spark/pull/28255#discussion_r412619438, this patch proposes to use taskAttemptId in checkpoint filename, instead of stageAttemptNumber + attemptNumber.

### Why are the changes needed?

To simplify checkpoint simplified and unique.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Existing tests.

Closes #28289 from viirya/SPARK-31484-followup.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-22 21:23:48 +09:00
Kent Yao 37d2e037ed [SPARK-31507][SQL] Remove uncommon fields support and update some fields with meaningful names for extract function
### What changes were proposed in this pull request?

Extracting millennium, century, decade, millisecond, microsecond and epoch from datetime is neither ANSI standard nor quite common in modern SQL platforms. Most of the systems listing below does not support these except PostgreSQL and redshift.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions050.htm

https://prestodb.io/docs/current/functions/datetime.html

https://docs.cloudera.com/documentation/enterprise/5-8-x/topics/impala_datetime_functions.html

https://docs.snowflake.com/en/sql-reference/functions-date-time.html#label-supported-date-time-parts

https://www.postgresql.org/docs/9.1/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT

This PR removes these extract fields support from extract function for date and timestamp values

`isoyear` is PostgreSQL specific but `yearofweek` is more commonly used across platforms
`isodow` is PostgreSQL specific but `iso` as a suffix is more commonly used across platforms so, `dow_iso` and `dayofweek_iso` is used to replace it.

For historical reasons, we have [`dayofweek`, `dow`] implemented for representing a non-ISO day-of-week and a newly added `isodow` from PostgreSQL for ISO day-of-week. Many other systems only have one week-numbering system support and use either full names or abbreviations. Things in spark become a little bit complicated.
1. because of the existence of `isodow`, so we need to add iso-prefix to `dayofweek` to make a pair for it too. [`dayofweek`, `isodayofweek`, `dow` and `isodow`]
2. because there are rare `iso`-prefixed systems and more systems choose `iso`-suffixed way, so we may result in [`dayofweek`, `dayofweekiso`, `dow`, `dowiso`]
3. `dayofweekiso` looks nice and has use cases in the platforms listed above, e.g. snowflake, but `dowiso` looks weird and no use cases found.
4. with a discussion the community,we have agreed with an underscore before `iso` may look much better because `isodow` is new and there is no standard for `iso` kind of things, so this may be good for us to make it simple and clear for end-users if they are well documented too.

Thus, we finally result in [`dayofweek`, `dow`] for Non-ISO day-of-week system and [`dayofweek_iso`, `dow_iso`] for ISO system

### Why are the changes needed?

Remove some nonstandard and uncommon features as we can add them back if necessary

### Does this PR introduce any user-facing change?

NO, we should target this to 3.0.0 and these are added during 3.0.0

### How was this patch tested?

Remove unused tests

Closes #28284 from yaooqinn/SPARK-31507.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-22 10:24:49 +00:00
Kent Yao 2c2062ea7c [SPARK-31498][SQL][DOCS] Dump public static sql configurations through doc generation
### What changes were proposed in this pull request?

Currently, only the non-static public SQL configurations are dump to public doc, we'd better also add those static public ones as the command `set -v`

This PR force call StaticSQLConf to buildStaticConf.

### Why are the changes needed?

Fix missing SQL configurations in doc

### Does this PR introduce any user-facing change?

NO

### How was this patch tested?

add unit test and verify locally to see if public static SQL conf is in `docs/sql-config.html`

Closes #28274 from yaooqinn/SPARK-31498.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-22 10:16:39 +00:00
herman cf6038499d
[SPARK-31511][SQL] Make BytesToBytesMap iterators thread-safe
### What changes were proposed in this pull request?
This PR increases the thread safety of the `BytesToBytesMap`:
- It makes the `iterator()` and `destructiveIterator()` methods used their own `Location` object. This used to be shared, and this was causing issues when the map was being iterated over in two threads by two different iterators.
- Removes the `safeIterator()` function. This is not needed anymore.
- Improves the documentation of a couple of methods w.r.t. thread-safety.

### Why are the changes needed?
It is unexpected an iterator shares the object it is returning with all other iterators. This is a violation of the iterator contract, and it causes issues with iterators over a map that are consumed in different threads.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing tests.

Closes #28286 from hvanhovell/SPARK-31511.

Authored-by: herman <herman@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-21 18:17:19 -07:00
Huang-Yi a5c16cbf05
[SPARK-31233][CORE] Enhance RpcTimeoutException Log Message
### What changes were proposed in this pull request?

askAbortable method throw TimeoutException while it does no complete in time. Currently, the error message contains null as remoteAddr when receiver is in client mode.
This change is to print out correct rpcAddress instead of null in the error message.

### Why are the changes needed?

It provides the address of an endpoint which does not reply in time. It helps users to find slow executors when timeout happens.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Add a unit test.

Closes #28002 from Huang-yi-3456/SPARK-31233-enhance-rpctimeoutexception-log.

Authored-by: Huang-Yi <huang.yi.3456@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-21 14:08:37 -07:00
Wenchen Fan b209b5f406
[SPARK-31503][SQL] fix the SQL string of the TRIM functions
### What changes were proposed in this pull request?

override the `sql` method of `StringTrim`, `StringTrimLeft` and `StringTrimRight`, to use the standard SQL syntax.

### Why are the changes needed?

The current implementation is wrong. It gives you a SQL string that returns different result.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

new tests

Closes #28281 from cloud-fan/sql.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-21 11:22:18 -07:00
Wenchen Fan a5ebbacf53 [SPARK-31361][SQL] Rebase datetime in parquet/avro according to file metadata
### What changes were proposed in this pull request?

This PR adds a new parquet/avro file metadata: `org.apache.spark.legacyDatetime`. It indicates that the file was written with the "rebaseInWrite" config enabled, and spark need to do rebase when reading it.

This makes Spark be able to do rebase more smartly:
1. If we don't know which Spark version writes the file, do rebase if the "rebaseInRead" config is true.
2. If the file was written by Spark 2.4 and earlier, then do rebase.
3. If the file was written by Spark 3.0 and later, do rebase if the `org.apache.spark.legacyDatetime` exists in file metadata.

### Why are the changes needed?

It's very easy to have mixed-calendar parquet/avro files: e.g. A user upgrades to Spark 3.0 and writes some parquet files to an existing directory. Then he realizes that the directory contains legacy datetime values before 1582. However, it's too late and he has to find out all the legacy files manually and read them separately.

To support mixed-calendar parquet/avro files, we need to decide to rebase or not based on the file metadata.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Updated test

Closes #28137 from cloud-fan/datetime.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-22 00:26:23 +09:00
yi.wu 55b026a783 [SPARK-31504][SQL] Formatted Explain should have determined order of Output fields
### What changes were proposed in this pull request?

In `verboseStringWithOperatorId`, use `output` (it's `Seq[Attribute]`) instead of `producedAttributes` (it's `AttributeSet`) to generates `"Output"` for the leaf node in order to make `"Output"` determined.

### Why are the changes needed?

Currently, Formatted Explain use `producedAttributes`, the `AttributeSet`,  to generate `"Output"`. As a result, the fields order within `"Output"` can be different from time to time. It's That means, for the same plan, it could have different explain outputs.

### Does this PR introduce any user-facing change?

Yes, user see the determined fields order within formatted explain now.

### How was this patch tested?

Added a regression test.

Closes #28282 from Ngone51/fix_output.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-21 12:33:58 +00:00
Kent Yao 1985437110 [SPARK-31474][SQL] Consistency between dayofweek/dow in extract exprsession and dayofweek function
### What changes were proposed in this pull request?
```sql
spark-sql> SELECT extract(dayofweek from '2009-07-26');
1
spark-sql> SELECT extract(dow from '2009-07-26');
0
spark-sql> SELECT extract(isodow from '2009-07-26');
7
spark-sql> SELECT dayofweek('2009-07-26');
1
spark-sql> SELECT weekday('2009-07-26');
6
```
Currently, there are 4 types of day-of-week range:
1. the function `dayofweek`(2.3.0) and extracting `dayofweek`(2.4.0) result as of Sunday(1) to Saturday(7)
2. extracting `dow`(3.0.0) results as of Sunday(0) to Saturday(6)
3. extracting` isodow` (3.0.0) results as of Monday(1) to Sunday(7)
4. the function `weekday`(2.4.0) results as of Monday(0) to Sunday(6)

Actually, extracting `dayofweek` and `dow` are both derived from PostgreSQL but have different meanings.
https://issues.apache.org/jira/browse/SPARK-23903
https://issues.apache.org/jira/browse/SPARK-28623

In this PR, we make extracting `dow` as same as extracting `dayofweek` and the `dayofweek` function for historical reason and not breaking anything.

Also, add more documentation to the extracting function to make extract field more clear to understand.

### Why are the changes needed?

Consistency insurance

### Does this PR introduce any user-facing change?

yes, doc updated and extract `dow` is as same as `dayofweek`

### How was this patch tested?

1. modified ut
2. local SQL doc verification
#### before
![image](https://user-images.githubusercontent.com/8326978/79601949-3535b100-811c-11ea-957b-a33d68641181.png)

#### after
![image](https://user-images.githubusercontent.com/8326978/79601847-12a39800-811c-11ea-8ff6-aa329255d099.png)

Closes #28248 from yaooqinn/SPARK-31474.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-21 11:55:33 +00:00
yi.wu 7103f19fea [SPARK-31472][CORE] Make sure Barrier Task always return messages or exception with abortableRpcFuture check
### What changes were proposed in this pull request?

Rewrite the periodically check logic of  `abortableRpcFuture` to make sure that barrier task would always return either desired messages or expected exception.

This PR also simplify a bit around `AbortableRpcFuture`.

### Why are the changes needed?

Currently, the periodically check logic of  `abortableRpcFuture` is done by following:

```scala
...
var messages: Array[String] = null

while (!abortableRpcFuture.toFuture.isCompleted) {
   messages = ThreadUtils.awaitResult(abortableRpcFuture.toFuture, 1.second)
   ...
}
return messages
```
It's possible that `abortableRpcFuture` complete before next invocation on `messages = ...`. In this case, the task may return null messages or execute successfully while it should throw exception(e.g. `SparkException` from `BarrierCoordinator`).

And here's a flaky test which caused by this bug:

```
[info] BarrierTaskContextSuite:
[info] - share messages with allGather() call *** FAILED *** (18 seconds, 705 milliseconds)
[info]   org.apache.spark.SparkException: Job aborted due to stage failure: Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(0, 2) finished unsuccessfully.
[info] java.lang.NullPointerException
[info] 	at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:204)
[info] 	at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:204)
[info] 	at scala.collection.IndexedSeqOptimized.toList(IndexedSeqOptimized.scala:285)
[info] 	at scala.collection.IndexedSeqOptimized.toList$(IndexedSeqOptimized.scala:284)
[info] 	at scala.collection.mutable.ArrayOps$ofRef.toList(ArrayOps.scala:198)
[info] 	at org.apache.spark.scheduler.BarrierTaskContextSuite.$anonfun$new$4(BarrierTaskContextSuite.scala:68)
...
```

The test exception can be reproduced by changing the line `messages = ...` to the following:

```scala
messages = ThreadUtils.awaitResult(abortableRpcFuture.toFuture, 10.micros)
Thread.sleep(5000)
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manually test and update some unit tests.

Closes #28245 from Ngone51/fix_barrier.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-21 10:12:56 +00:00
zhengruifeng e7bc6f38b9 [SPARK-31494][ML] flatten the result dataframe of ANOVATest
### What changes were proposed in this pull request?
add a new method `def test(dataset: DataFrame, featuresCol: String, labelCol: String, flatten: Boolean): DataFrame`

### Why are the changes needed?
Similar to new `test` method in `ChiSquareTest`, it will:
1, support df operation on the returned df;
2, make driver no longer a bottleneck with large numFeatures

### Does this PR introduce any user-facing change?
Yes, new method added

### How was this patch tested?
existing testsuites

Closes #28270 from zhengruifeng/flatten_anova.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-04-21 12:43:14 +08:00
Onur Satici ad965103a5
[SPARK-30949][K8S][CORE] Decouple requests and parallelism on drivers in K8s
### What changes were proposed in this pull request?
`spark.driver.cores` configuration is used to set the amount of parallelism in kubernetes cluster mode drivers. Previously the amount of parallelism in the drivers were the number of cores in the host when running on JDK 8u120 or older, or the maximum of driver containers resource requests and limits when running on [JDK 8u121 or newer](https://bugs.openjdk.java.net/browse/JDK-8173345). This will enable users to specify `spark.driver.cores` to set parallelism, and specify `spark.kubernetes.driver.requests.cores` to limit the resource requests of the driver container, effectively decoupling the two

### Why are the changes needed?
Drivers submitted in kubernetes cluster mode set the parallelism of various components like `RpcEnv`, `MemoryManager`, `BlockManager` from inferring the number of available cores by calling `Runtime.getRuntime().availableProcessors()`. By using this, spark applications running on JDK 8u120 or older incorrectly get the total number of cores in the host, [ignoring the cgroup limits set by kubernetes](https://bugs.openjdk.java.net/browse/JDK-6515172). JDK 8u121 and newer runtimes do not have this problem.

Orthogonal to this, it is currently not possible to decouple resource limits on the driver container with the amount of parallelism of the various network and memory components listed above.

### Does this PR introduce any user-facing change?
Yes. Previously the amount of parallelism in kubernetes cluster mode submitted drivers were the number of cores in the host when running on JDK 8u120 or older, or the maximum of driver containers resource requests and limits when running on JDK 8u121 or newer. Now the value of `spark.driver.cores` is used.

### How was this patch tested?
happy to add tests if my proposal looks reasonable

Closes #27695 from onursatici/os/decouple-requests-and-parallelism.

Authored-by: Onur Satici <onursatici@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-20 21:32:43 -07:00
Maryann Xue ae29cf24fc [SPARK-31501][SQL] AQE update UI should not cause deadlock
### What changes were proposed in this pull request?

This PR makes sure that AQE does not call update UI if the current execution ID does not match the current query. This PR also includes a minor refactoring that moves `getOrCloneSessionWithAqeOff` from `QueryExecution` to `AdaptiveSparkPlanHelper` since that function is not used by `QueryExecution` any more.

### Why are the changes needed?

Without this fix, there could be a potential deadlock.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added UT.

Closes #28275 from maryannxue/aqe-ui-deadlock.

Authored-by: Maryann Xue <maryann.xue@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-21 03:56:42 +00:00
zhengruifeng 32259c9733 [SPARK-31492][ML] flatten the result dataframe of FValueTest
### What changes were proposed in this pull request?
add a new method  `def test(dataset: DataFrame, featuresCol: String, labelCol: String, flatten: Boolean): DataFrame`

### Why are the changes needed?

Similar to new test method in ChiSquareTest, it will:
1, support df operation on the returned df;
2, make driver no longer a bottleneck with large `numFeatures`

### Does this PR introduce any user-facing change?
Yes, add a new method

### How was this patch tested?
existing testsuites

Closes #28268 from zhengruifeng/flatten_fvalue.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-04-21 11:09:05 +08:00
Takeshi Yamamuro e42dbe7cd4 [SPARK-31429][SQL][DOC] Automatically generates a SQL document for built-in functions
### What changes were proposed in this pull request?

This PR intends to add a Python script to generates a SQL document for built-in functions and the document in SQL references.

### Why are the changes needed?

To make SQL references complete.

### Does this PR introduce any user-facing change?

Yes;

![a](https://user-images.githubusercontent.com/692303/79406712-c39e1b80-7fd2-11ea-8b85-9f9cbb6efed3.png)
![b](https://user-images.githubusercontent.com/692303/79320526-eb46a280-7f44-11ea-8639-90b1fb2b8848.png)
![c](https://user-images.githubusercontent.com/692303/79320707-3365c500-7f45-11ea-9984-69ffe800fb87.png)

### How was this patch tested?

Manually checked and added tests.

Closes #28224 from maropu/SPARK-31429.

Lead-authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-21 10:55:13 +09:00
rishi 4f8b03d336
[SPARK-31389][SQL][TESTS] Add codegen-on test coverage for some tests in SQLMetricsSuite
### What changes were proposed in this pull request?
Adding missing unit tests in SQLMetricSuite to cover the code generated path.
**Additional tests were added in the following unit tests.**
Filter metrics, SortMergeJoin metrics, SortMergeJoin(outer) metrics, BroadcastHashJoin metrics,  ShuffledHashJoin metrics, BroadcastHashJoin(outer) metrics, BroadcastNestedLoopJoin metrics, BroadcastLeftSemiJoinHash metrics, CartesianProduct metrics,  SortMergeJoin(left-anti) metrics

### Why are the changes needed?
The existing tests in SQLMetricSuite only cover the interpreted path.
It is necessary for the tests to cover code generated path as well since CodeGenerated path is often used in production.

The PR doesn't change test("Aggregate metrics") and test("ObjectHashAggregate metrics"). The test("Aggregate metrics") tests metrics when a HashAggregate is used. Enabling codegen forces the test to use ObjectHashAggregate rather than the regular HashAggregate. ObjectHashAggregate has a test of its own. Therefore, I feel these two tests need not enabling codegen is not necessary.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
I added debug statements in the code to make sure both Code generated and Interpreted paths are being exercised.
I further used Intellij debugger to ensure that the newly added unit tests are in fact exercising both code generated and interpreted paths.

Closes #28173 from sririshindra/SPARK-31389.

Authored-by: rishi <spothireddi@cloudera.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-20 14:41:45 -07:00
Wenchen Fan 69f9ee18b6
[SPARK-31452][SQL] Do not create partition spec for 0-size partitions in AQE
### What changes were proposed in this pull request?

This PR skips creating the partition specs in `ShufflePartitionsUtil` for 0-size partitions, which avoids launching unnecessary tasks that do nothing.

### Why are the changes needed?

launching tasks that do nothing is a waste.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

updated tests

Closes #28226 from cloud-fan/aqe.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-20 13:50:07 -07:00
Yuming Wang b11e42663b
[SPARK-31381][SPARK-29245][SQL] Upgrade built-in Hive 2.3.6 to 2.3.7
### What changes were proposed in this pull request?

**Hive 2.3.7** fixed these issues:
- HIVE-21508: ClassCastException when initializing HiveMetaStoreClient on JDK10 or newer
- HIVE-21980:Parsing time can be high in case of deeply nested subqueries
- HIVE-22249: Support Parquet through HCatalog

### Why are the changes needed?
Fix CCE during creating HiveMetaStoreClient in JDK11 environment: [SPARK-29245](https://issues.apache.org/jira/browse/SPARK-29245).

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?

- [x] Test Jenkins with Hadoop 2.7 (https://github.com/apache/spark/pull/28148#issuecomment-616757840)
- [x] Test Jenkins with Hadoop 3.2 on JDK11 (https://github.com/apache/spark/pull/28148#issuecomment-616294353)
- [x] Manual test with remote hive metastore.

Hive side:

```
export JAVA_HOME=/usr/lib/jdk1.8.0_221
export PATH=$JAVA_HOME/bin:$PATH
cd /usr/lib/hive-2.3.6 # Start Hive metastore with Hive 2.3.6
bin/schematool -dbType derby -initSchema --verbose
bin/hive --service metastore
```

Spark side:

```
export JAVA_HOME=/usr/lib/jdk-11.0.3
export PATH=$JAVA_HOME/bin:$PATH
build/sbt clean package -Phive -Phadoop-3.2 -Phive-thriftserver
export SPARK_PREPEND_CLASSES=true
bin/spark-sql --conf spark.hadoop.hive.metastore.uris=thrift://localhost:9083
```

Closes #28148 from wangyum/SPARK-31381.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-20 13:38:24 -07:00
gatorsmile 6c792a79c1 [SPARK-31234][SQL][FOLLOW-UP] ResetCommand should not affect static SQL Configuration
### What changes were proposed in this pull request?
This PR is the follow-up PR of https://github.com/apache/spark/pull/28003

- add a migration guide
- add an end-to-end test case.

### Why are the changes needed?
The original PR made the major behavior change in the user-facing RESET command.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Added a new end-to-end test

Closes #28265 from gatorsmile/spark-31234followup.

Authored-by: gatorsmile <gatorsmile@gmail.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
2020-04-20 13:08:55 -07:00
Maryann Xue 44d370dd45 [SPARK-31475][SQL] Broadcast stage in AQE did not timeout
### What changes were proposed in this pull request?

This PR adds a timeout for the Future of a BroadcastQueryStageExec to make sure it can have the same timeout behavior as a non-AQE broadcast exchange.

### Why are the changes needed?

This is to make the broadcast timeout behavior in AQE consistent with that in non-AQE.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added UT.

Closes #28250 from maryannxue/aqe-broadcast-timeout.

Authored-by: Maryann Xue <maryann.xue@gmail.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
2020-04-20 11:55:48 -07:00
Max Gekk f1fde0cc22 [SPARK-31490][SQL][TESTS] Benchmark conversions to/from Java 8 datetime types
### What changes were proposed in this pull request?
- Add benchmark cases for **parallelizing** `java.time.LocalDate` and `java.time.Instant` column values.
- Add benchmark cases for **collecting** `java.time.LocalDate` and `java.time.Instant` column values.

### Why are the changes needed?
- To detect perf regression in the future
- To compare parallelization/collection of Java 8 date-time types with Java 7 date-time types `java.sql.Date` & `java.sql.Timestamp`.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By running the modified benchmarks in the environment:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 |

Closes #28263 from MaxGekk/java8-datetime-collect-benchmark.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-20 07:26:38 +00:00
Max Gekk 88d39e5a89 [SPARK-31385][SQL] Restrict micros rebasing via switch arrays up to 2037 year
### What changes were proposed in this pull request?
1. Generate rebasing arrays for micros up to 2037 in `RebaseDateTimeSuite.generateRebaseJson()`.
2. Exclude 4 time zones from the black list in `generateRebaseJson()`.
3. Re-generate JSON files with rebasing info - `gregorian-julian-rebase-micros.json` and `julian-gregorian-rebase-micros.json`.

### Why are the changes needed?
1. `sun.util.calendar.ZoneInfo` resolves DST after 2037 year incorrectly. See aa318070b2/jdk/src/share/classes/sun/util/calendar/ZoneInfo.java (L55-L62) . By restricting the rebase arrays to 2037 year, we follow the behaviour of `ZoneInfo` which uses DST of 2037 for all years beyond 2037.
2. To enable optimization of micros rebasing via switch arrays for the time zones:
    - Asia/Tehran
    - Iran
    - Africa/Casablanca
    - Africa/El_Aaiun

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By existing test suites `RebaseDateTimeUtils`, `DateTimeUtilsSuite` and `DateFunctionsSuite`.

Closes #28253 from MaxGekk/fix-4-time-zones-rebasing.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-20 06:35:16 +00:00
Terry Kim d7499aed9c [SPARK-31256][SQL] DataFrameNaFunctions.drop should work for nested columns
### What changes were proposed in this pull request?

#26700 removed the ability to drop a row whose nested column value is null.

For example, for the following `df`:
```
val schema = new StructType()
  .add("c1", new StructType()
    .add("c1-1", StringType)
    .add("c1-2", StringType))
val data = Seq(Row(Row(null, "a2")), Row(Row("b1", "b2")), Row(null))
val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
df.show
+--------+
|      c1|
+--------+
|  [, a2]|
|[b1, b2]|
|    null|
+--------+
```
In Spark 2.4.4,
```
df.na.drop("any", Seq("c1.c1-1")).show
+--------+
|      c1|
+--------+
|[b1, b2]|
+--------+
```
In Spark 2.4.5 or Spark 3.0.0-preview2, if nested columns are specified, they are ignored.
```
df.na.drop("any", Seq("c1.c1-1")).show
+--------+
|      c1|
+--------+
|  [, a2]|
|[b1, b2]|
|    null|
+--------+
```
### Why are the changes needed?

This seems like a regression.

### Does this PR introduce any user-facing change?

Now, the nested column can be specified:
```
df.na.drop("any", Seq("c1.c1-1")).show
+--------+
|      c1|
+--------+
|[b1, b2]|
+--------+
```

Also, if `*` is specified as a column, it will throw an `AnalysisException` that `*` cannot be resolved, which was the behavior in 2.4.4. Currently, in master, it has no effect.

### How was this patch tested?

Updated existing tests.

Closes #28266 from imback82/SPARK-31256.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-20 02:59:09 +00:00
HyukjinKwon bc212df610 [SPARK-29672][BUILD][PYTHON][FOLLOW-UP] Recover PySpark via pip installation with deprecated Python 2, 3.4 and 3.5
### What changes were proposed in this pull request?

The RC fails to install against Python 2.7 via `pip`. We deprecated but didn't remove Python 2, 3.4 and 3.5 support yet. This PR partially reverts the changes from SPARK-29672 to recover Python 2, 3.4 and 3.5 pip installation.

```bash
python2.7 -m pip install https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc1-bin/pyspark-3.0.0.tar.gz
```
```
...
Collecting https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc1-bin/pyspark-3.0.0.tar.gz
  Using cached https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc1-bin/pyspark-3.0.0.tar.gz (203.0 MB)
    ERROR: Command errored out with exit status 1:
     command: /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/_1/bzcp960d0hlb988k90654z2w0000gp/T/pip-req-build-sfCnmZ/setup.py'"'"'; __file__='"'"'/private/var/folders/_1/bzcp960d0hlb988k90654z2w0000gp/T/pip-req-build-sfCnmZ/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/_1/bzcp960d0hlb988k90654z2w0000gp/T/pip-req-build-sfCnmZ/pip-egg-info
         cwd: /private/var/folders/_1/bzcp960d0hlb988k90654z2w0000gp/T/pip-req-build-sfCnmZ/
    Complete output (6 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/_1/bzcp960d0hlb988k90654z2w0000gp/T/pip-req-build-sfCnmZ/setup.py", line 27
        file=sys.stderr)
            ^
    SyntaxError: invalid syntax
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
```

### Why are the changes needed?

To keep the deprecated support instead of removing.

### Does this PR introduce any user-facing change?

No, it's the change in unreleased branches only yet.

### How was this patch tested?

```bash
./build/mvn -DskipTests -Phive -Phive-thriftserver clean package
cd python
python2.7 setup.py sdist
python2.7 -m pip install dist/pyspark-3.1.0.dev0.tar.gz
```

Closes #28243 from HyukjinKwon/SPARK-29672-followup.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-20 10:31:44 +09:00
Dongjoon Hyun c6e39dffd6
[SPARK-31464][BUILD][SS] Upgrade Kafka to 2.5.0
### What changes were proposed in this pull request?

This PR aims to upgrade Kafka library to 2.5.0 for Apache Spark 3.1.0.

### Why are the changes needed?

Apache Kafka 2.5.0 client has improvements and bug fixes like [KAFKA-9241](https://issues.apache.org/jira/browse/KAFKA-9241)
- https://downloads.apache.org/kafka/2.5.0/RELEASE_NOTES.html

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with the existing tests.

- [x] SBT https://github.com/apache/spark/pull/28235#issuecomment-615936382
- [x] Maven https://github.com/apache/spark/pull/28235#issuecomment-616138840 (All Scala/Java/Python/R UT tests passed. It's timeout during R installation testing which is already covered by SBT.)

Closes #28235 from dongjoon-hyun/SPARK-KAFKA-2.5.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-19 10:51:09 -07:00
Liang-Chi Hsieh e3ac56c8f4
[SPARK-31484][CORE] Add stage attempt number to temp checkpoint filename to avoid file already existing exception
### What changes were proposed in this pull request?

To avoid file already existing exception when creating checkpoint file, this PR proposes to add stage attempt number to temporary checkpoint file.

### Why are the changes needed?

On our production clusters, we have seen checkpointing failure. The failed stage can possibly leave partial written checkpoint file, the task of retried stage to write checkpoint file could fail due to`FileAlreadyExistsException` when creating the same file, like
```
org.apache.hadoop.fs.FileAlreadyExistsException: /path_to_checkpoint/rdd-114/.part-03154-attempt-0 for client xxx.xxx.xxx.xxx already exists
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.startFile(FSDirWriteFileOp.java:359)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2353)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2273)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:728)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:413)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:851)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:794)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2490)

	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121)
	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88)
	at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:270)
	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1263)
	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1205)
	at org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:473)
	at org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:470)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:470)
	at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:411)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:872)
	at org.apache.spark.rdd.ReliableCheckpointRDD$.writePartitionToCheckpointFile(ReliableCheckpointRDD.scala:204)
```

### Does this PR introduce any user-facing change?

Yes. Users won't see checkpoint file already existing exception after this PR.

### How was this patch tested?

Add unit test.

Closes #28255 from viirya/delete-temp-checkpoint.

Authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-19 09:11:17 -07:00
Takeshi Yamamuro 74aed8cc8b
[SPARK-31476][SQL] Add an ExpressionInfo entry for EXTRACT
### What changes were proposed in this pull request?

This PR intends to add an ExpressionInfo entry for EXTRACT for better documentations.
This PR comes from the comment in https://github.com/apache/spark/pull/21479#discussion_r409900080

### Why are the changes needed?

To make SQL documentations complete.

### Does this PR introduce any user-facing change?

Yes, this PR updates the `Spark SQL, Built-in Functions` page.

### How was this patch tested?

Run the example tests.

Closes #28251 from maropu/AddExtractExpr.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-18 13:37:12 -07:00
ulysses 6c2bf8248a
[SPARK-31442][SQL] Print shuffle id at coalesce partitions target size
### What changes were proposed in this pull request?

Minor change. Print shuffle id.

### Why are the changes needed?

Make log more clear.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Not need.

Closes #28211 from ulysses-you/print-shuffle-id.

Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-18 09:27:44 -07:00
gatorsmile 6bf5f01a4a [SPARK-31477][SQL] Dump codegen and compile time in BenchmarkQueryTest
### What changes were proposed in this pull request?
This PR is to dump the codegen and compilation time for benchmark query tests.

### Why are the changes needed?
Measure the codegen and compilation time costs in TPC-DS queries

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Manual test in my local laptop:
```
23:13:12.845 WARN org.apache.spark.sql.TPCDSQuerySuite:
=== Metrics of Whole-stage Codegen ===
Total code generation time: 21.275102261 seconds
Total compilation time: 12.223771828 seconds
```

Closes #28252 from gatorsmile/testMastercode.

Authored-by: gatorsmile <gatorsmile@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-04-18 20:59:45 +09:00
Kent Yao 77cb7cde0d
[SPARK-31469][SQL][TESTS][FOLLOWUP] Remove unsupported fields from ExtractBenchmark
### What changes were proposed in this pull request?

In 697083c051, we remove  "MILLENNIUM", "CENTURY", "DECADE",  "QUARTER", "MILLISECONDS", "MICROSECONDS", "EPOCH" field for date_part and extract expression, this PR fix the related Benchmark.
### Why are the changes needed?

test fix.

### Does this PR introduce any user-facing change?

no
### How was this patch tested?

passing Jenkins

Closes #28249 from yaooqinn/SPARK-31469-F.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-18 00:32:42 -07:00
Maryann Xue 6198f38405
[SPARK-31473][SQL] AQE should set active session during execution
### What changes were proposed in this pull request?

AQE creates new SparkPlan nodes during execution. This PR makes sure that the active session is set correctly during this process and AQE execution is not disrupted by external session change.

### Why are the changes needed?

To prevent potential errors. If not changed, the physical plans generated by AQE would have the wrong SparkSession or even null SparkSession, which could lead to NPE.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added UT.

Closes #28247 from maryannxue/aqe-activesession.

Authored-by: Maryann Xue <maryann.xue@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-18 00:08:36 -07:00
Huaxin Gao 142f43629c [SPARK-31390][SQL][DOCS] Document Window Function in SQL Syntax Section
### What changes were proposed in this pull request?
Document Window Function in SQL syntax

### Why are the changes needed?
Make SQL Reference complete

### Does this PR introduce any user-facing change?
Yes

<img width="1050" alt="Screen Shot 2020-04-16 at 9 13 34 PM" src="https://user-images.githubusercontent.com/13592258/79531509-7bf5af00-8027-11ea-8291-a91b2e97a1b5.png">

<img width="1050" alt="Screen Shot 2020-04-16 at 9 14 12 PM" src="https://user-images.githubusercontent.com/13592258/79531514-7e580900-8027-11ea-8761-4c5a888c476f.png">

<img width="1050" alt="Screen Shot 2020-04-16 at 9 14 45 PM" src="https://user-images.githubusercontent.com/13592258/79531518-82842680-8027-11ea-876f-6375aa5b5ead.png">

<img width="1050" alt="Screen Shot 2020-04-16 at 9 15 10 PM" src="https://user-images.githubusercontent.com/13592258/79531521-844dea00-8027-11ea-8948-712f054d42ee.png">

<img width="1050" alt="Screen Shot 2020-04-16 at 9 15 25 PM" src="https://user-images.githubusercontent.com/13592258/79531528-8748da80-8027-11ea-9dae-a465286982ac.png">

### How was this patch tested?
Manually build and check

Closes #28220 from huaxingao/sql-win-fun.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-04-18 09:31:52 +09:00
Wenchen Fan db7b8651a1 [SPARK-31253][SQL][FOLLOW-UP] simplify the code of calculating size metrics of AQE shuffle
### What changes were proposed in this pull request?

A followup of https://github.com/apache/spark/pull/28175:
1. use mutable collection to store the driver metrics
2. don't send size metrics if there is no map stats, as UI will display size as 0 if there is no data
3. calculate partition data size separately, to make the code easier to read.

### Why are the changes needed?

code simplification

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

existing tests

Closes #28240 from cloud-fan/refactor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>
2020-04-17 13:20:34 -07:00
Dongjoon Hyun fde996be87
[SPARK-31394][DOC][FOLLOWUP] Add nfs volume type description
### What changes were proposed in this pull request?

This adds newly supported `nfs` volume type description into the document for Apache Spark 3.1.0.

### Why are the changes needed?

To complete the document.

### Does this PR introduce any user-facing change?

Yes. (Doc)

![nfs_screen_shot](https://user-images.githubusercontent.com/9700541/79530887-8f077f80-8025-11ea-8cc1-e0b551802d5d.png)

### How was this patch tested?

Manually generate doc and check it.
```
SKIP_API=1 jekyll build
```

Closes #28236 from dongjoon-hyun/SPARK-NFS-DOC.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-17 12:07:34 -07:00
Sean Owen 61b7d446b3 Apply appropriate RPC handler to receive, receiveStream when auth enabled 2020-04-17 13:25:12 -05:00
Takeshi Yamamuro a7fb330ed3 [SPARK-31468][SQL] Null types should be implicitly casted to Decimal types
### What changes were proposed in this pull request?

This PR intends to fix a bug that occurs when comparing null types to decimal types in master/branch-3.0;
```
scala> Seq(BigDecimal(10)).toDF("v1").selectExpr("v1 = NULL").explain(true)
org.apache.spark.sql.AnalysisException: cannot resolve '(`v1` = NULL)' due to data type mismatch: differing types in '(`v1` = NULL)' (decimal(38,18) and null).; line 1 pos 0;
'Project [(v1#5 = null) AS (v1 = NULL)#7]
+- Project [value#2 AS v1#5]
   +- LocalRelation [value#2]
...
```
The query above passed in v2.4.5.

### Why are the changes needed?

bugfix

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added tests.

Closes #28241 from maropu/SPARK-31468.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-17 14:11:17 +00:00
Kent Yao 697083c051 [SPARK-31469][SQL] Make extract interval field ANSI compliance
### What changes were proposed in this pull request?

Currently, we can extract `millennium/century/decade/year/quarter/month/week/day/hour/minute/second(with fractions)//millisecond/microseconds` and `epoch` from interval values

While getting the `millennium/century/decade/year`, it means how many the interval `months` part can be converted to that unit-value. The content of `millennium/century/decade` will overlap `year` and each other.

While getting `month/day` and so on, it means the integral remainder of the previous unit. Here all the units including `year` are individual.

So while extracting `year`, `month`, `day`, `hour`, `minute`, `second`, which are ANSI primary datetime units, the semantic is `extracting`, but others might refer to `transforming`.

While getting epoch we have treat month as 30 days which varies the natural Calendar rules we use.

To avoid ambiguity, I suggest we should only support those extract field defined ANSI with their abbreviations.

### Why are the changes needed?

Extracting `millennium`, `century` etc does not obey the meaning of extracting, and they are not so useful and worth maintaining.

The `extract` is ANSI standard expression and `date_part` is its pg-specific alias function. The current support extract-fields are fully bought from PostgreSQL.

With a look at other systems like Presto/Hive, they don't support those ambiguous fields too.

e.g. Hive 2.2.x also take it from PostgreSQL but without introducing those ambiguous fields https://issues.apache.org/jira/secure/attachment/12828349/HIVE-14579

e.g. presto

```sql
presto> select extract(quater from interval '10-0' year to month);
Query 20200417_094723_00020_m8xq4 failed: line 1:8: Invalid EXTRACT field: quater
select extract(quater from interval '10-0' year to month)

presto> select extract(decade from interval '10-0' year to month);
Query 20200417_094737_00021_m8xq4 failed: line 1:8: Invalid EXTRACT field: decade
select extract(decade from interval '10-0' year to month)

```

### Does this PR introduce any user-facing change?

Yes, as we already have previews versions, this PR will remove support for extracting `millennium/century/decade/quarter/week/millisecond/microseconds` and `epoch` from intervals with `date_part` function

### How was this patch tested?

rm some used tests

Closes #28242 from yaooqinn/SPARK-31469.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-17 13:59:02 +00:00
zhengruifeng f1489e6b12 [SPARK-31436][ML] MinHash keyDistance optimization
### What changes were proposed in this pull request?
re-impl `keyDistance`:
if both vectors are dense, new impl is 9.09x faster;
if both vectors are sparse, new impl is 5.66x faster;
if one is dense and the other is sparse, new impl is 7.8x faster;

### Why are the changes needed?
current implementation based on set operations is inefficient

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing testsuites

Closes #28206 from zhengruifeng/minhash_opt.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-04-17 08:27:21 -05:00
beliefer 1513673f83 [SPARK-30913][SPARK-30841][CORE][SQL][FOLLOWUP] Supplement version information to the configuration of Tests.scala and SQL
### What changes were proposed in this pull request?
I checked all the config of Spark again. find some new commit not add version information.

**Test.scala**
Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.testing.skipValidateCores | 3.1.0 | SPARK-29154 | 474b1bb5c2bce2f83c4dd8e19b9b7c5b3aebd6c4#diff-8b4ea8f3b0cc1e7ce7e943de1abbb165 |  

**SQL**
Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.sql.legacy.integerGroupingId | 3.1.0 | SPARK-30279 | 71c73d58f6e88d2558ed2e696897767d93bac60f#diff-9a6b543db706f1a90f790783d6930a13 |  

The two config only exists in branch master.

### Why are the changes needed?
Supplement version information.

### Does this PR introduce any user-facing change?
'No'.

### How was this patch tested?
Jenkins test.

Closes #28233 from beliefer/sql-conf-version-legacy-integerGroupingId.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-17 17:10:48 +09:00
jiake d136b7248e [SPARK-31253][SQL][FOLLOW-UP] Improve the partition data size metrics in CustomShuffleReaderExec
### What changes were proposed in this pull request?
Currently the partition data size metrics contain three entries (min/max/avg)  in Spark UI, which is not user friendly. This PR lets the metrics with min/max/avg in one entry by calling SQLMetrics.postDriverMetricUpdates multiple times.
Before this PR, the spark UI is shown in the following:
![image](https://user-images.githubusercontent.com/11972570/78980137-da1a2200-7b4f-11ea-81ee-76858e887bde.png)

After this PR. the spark UI is shown in the following:
![image](https://user-images.githubusercontent.com/11972570/78980192-fae27780-7b4f-11ea-9faa-07f58699acfd.png)

### Why are the changes needed?
Improving UI

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing ut

Closes #28175 from JkSelf/improveAqeMetrics.

Authored-by: jiake <ke.a.jia@intel.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-17 06:23:54 +00:00
yi.wu 40f9dbb628 [SPARK-31425][SQL][CORE] UnsafeKVExternalSorter/VariableLengthRowBasedKeyValueBatch should also respect UnsafeAlignedOffset
### What changes were proposed in this pull request?

Make `UnsafeKVExternalSorter` / `VariableLengthRowBasedKeyValueBatch ` also respect `UnsafeAlignedOffset` when reading the record and update some out of date comemnts.

### Why are the changes needed?

Since `BytesToBytesMap` respects `UnsafeAlignedOffset` when writing the record, `UnsafeKVExternalSorter` should also respect `UnsafeAlignedOffset` when reading the record from `BytesToBytesMap` otherwise it will causes data correctness issue.

Unlike `UnsafeKVExternalSorter` may reading records from `BytesToBytesMap`, `VariableLengthRowBasedKeyValueBatch` writes and reads records by itself. Thus, similar to #22053 and [comment](https://github.com/apache/spark/pull/22053#issuecomment-411975239) there, fix for `VariableLengthRowBasedKeyValueBatch` more likely an improvement for the support of SPARC platform.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manually tested `HashAggregationQueryWithControlledFallbackSuite` with `UAO_SIZE=8`  to simulate SPARC platform. And tests only pass with this fix.

Closes #28195 from Ngone51/fix_uao.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-17 04:48:27 +00:00
yi.wu b2e9e1717b [SPARK-31344][CORE] Polish implementation of barrier() and allGather()
### What changes were proposed in this pull request?

1. Combine  `BarrierRequestToSync` and `AllGatherRequestToSync` into `RequestToSync`, which is distinguished by `RequestMethod` type.

2. Remove unnecessary Json serialization/deserialization

3. Clean up some codes to make runBarrier() and `BarrierCoordinator` more general

4. Remove unused imports.

### Why are the changes needed?

To make codes simpler for better maintain in the future.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

This is pure code refactor, so should be covered by existed tests.

Closes #28117 from Ngone51/refactor_barrier.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>
2020-04-16 21:23:32 -07:00
herman fab4ca5156
[SPARK-31450][SQL] Make ExpressionEncoder thread-safe
### What changes were proposed in this pull request?
This PR moves the `ExpressionEncoder.toRow` and `ExpressionEncoder.fromRow` functions into their own function objects(`ExpressionEncoder.Serializer` & `ExpressionEncoder.Deserializer`). This effectively makes the `ExpressionEncoder` stateless, thread-safe and (more) reusable. The function objects are not thread safe, however they are documented as such and should be used in a more limited scope (making it easier to reason about thread safety).

### Why are the changes needed?
ExpressionEncoders are not thread-safe. We had various (nasty) bugs because of this.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing tests.

Closes #28223 from hvanhovell/SPARK-31450.

Authored-by: herman <herman@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-16 18:47:46 -07:00