Commit graph

26425 commits

Author SHA1 Message Date
Yuanjian Li 3db3e39f11 [SPARK-28228][SQL] Change the default behavior for name conflict in nested WITH clause
### What changes were proposed in this pull request?
This is a follow-up for #25029, in this PR we throw an AnalysisException when name conflict is detected in nested WITH clause. In this way, the config `spark.sql.legacy.ctePrecedence.enabled` should be set explicitly for the expected behavior.

### Why are the changes needed?
The original change might risky to end-users, it changes behavior silently.

### Does this PR introduce any user-facing change?
Yes, change the config `spark.sql.legacy.ctePrecedence.enabled` as optional.

### How was this patch tested?
New UT.

Closes #27454 from xuanyuanking/SPARK-28228-follow.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-08 14:10:28 -08:00
zhengruifeng 12e1bbaddb Revert "[SPARK-30642][SPARK-30659][SPARK-30660][SPARK-30662]"
### What changes were proposed in this pull request?
Revert
#27360
#27396
#27374
#27389

### Why are the changes needed?
BLAS need more performace tests, specially on sparse datasets.
Perfermance test of LogisticRegression (https://github.com/apache/spark/pull/27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression.
LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression.

### Does this PR introduce any user-facing change?
remove newly added param blockSize

### How was this patch tested?
reverted testsuites

Closes #27487 from zhengruifeng/revert_blockify_ii.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-02-08 08:46:16 +08:00
Terry Kim a7451f44d2 [SPARK-30614][SQL] The native ALTER COLUMN syntax should change one property at a time
### What changes were proposed in this pull request?

The current ALTER COLUMN syntax allows to change multiple properties at a time:
```
ALTER TABLE table=multipartIdentifier
  (ALTER | CHANGE) COLUMN? column=multipartIdentifier
  (TYPE dataType)?
  (COMMENT comment=STRING)?
  colPosition?
```
The SQL standard (section 11.12) only allows changing one property at a time. This is also true on other recent SQL systems like [snowflake](https://docs.snowflake.net/manuals/sql-reference/sql/alter-table-column.html) and [redshift](https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE.html). (credit to cloud-fan)

This PR proposes to change ALTER COLUMN to follow SQL standard, thus allows altering only one column property at a time.

Note that ALTER COLUMN syntax being changed here is newly added in Spark 3.0, so it doesn't affect Spark 2.4 behavior.

### Why are the changes needed?

To follow SQL standard (and other recent SQL systems) behavior.

### Does this PR introduce any user-facing change?

Yes, now the user can update the column properties only one at a time.

For example,
```
ALTER TABLE table1 ALTER COLUMN a.b.c TYPE bigint COMMENT 'new comment'
```
should be broken into
```
ALTER TABLE table1 ALTER COLUMN a.b.c TYPE bigint
ALTER TABLE table1 ALTER COLUMN a.b.c COMMENT 'new comment'
```

### How was this patch tested?

Updated existing tests.

Closes #27444 from imback82/alter_column_one_at_a_time.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-08 02:47:44 +08:00
Maxim Gekk a3e77773cf [SPARK-30752][SQL] Fix to_utc_timestamp on daylight saving day
### What changes were proposed in this pull request?
- Rewrite the `convertTz` method of `DateTimeUtils` using Java 8 time API
- Change types of `convertTz` parameters from `TimeZone` to `ZoneId`. This allows to avoid unnecessary conversions `TimeZone` -> `ZoneId` and performance regressions as a consequence.

### Why are the changes needed?
- Fixes incorrect behavior of `to_utc_timestamp` on daylight saving day. For example:
```scala
scala> df.select(to_utc_timestamp(lit("2019-11-03T12:00:00"), "Asia/Hong_Kong").as("local UTC")).show
+-------------------+
|          local UTC|
+-------------------+
|2019-11-03 03:00:00|
+-------------------+
```
but the result must be 2019-11-03 04:00:00:
<img width="1013" alt="Screen Shot 2020-02-06 at 20 09 36" src="https://user-images.githubusercontent.com/1580697/73960846-a129bb00-491c-11ea-92f5-45831cb28a62.png">

- Simplifies the code, and make it more maintainable
- Switches `convertTz` on Proleptic Gregorian calendar used by Java 8 time classes by default. That makes the function consistent to other date-time functions.

### Does this PR introduce any user-facing change?
Yes, after the changes `to_utc_timestamp` returns the correct result `2019-11-03 04:00:00`.

### How was this patch tested?
- By existing test suite `DateTimeUtilsSuite`, `DateFunctionsSuite` and `DateExpressionsSuite`.
- Added `convert time zones on a daylight saving day` to DateFunctionsSuite

Closes #27474 from MaxGekk/port-convertTz-on-Java8-api.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-08 02:32:07 +08:00
sharif ahmad dd2f4431f5 [MINOR][DOCS] Fix typos at python/pyspark/sql/types.py
### What changes were proposed in this pull request?

This PR fixes some typos in `python/pyspark/sql/types.py` file.

### Why are the changes needed?

To deliver correct wording in documentation and codes.

### Does this PR introduce any user-facing change?

Yes, it fixes some typos in user-facing API documentation.

### How was this patch tested?

Locally tested the linter.

Closes #27475 from sharifahmad2061/master.

Lead-authored-by: sharif ahmad <sharifahmad2061@gmail.com>
Co-authored-by: Sharif ahmad <sharifahmad2061@users.noreply.github.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-07 18:42:16 +09:00
Yuanjian Li 4804445327 [MINOR][DOC] Fix document UI left menu broken
### What changes were proposed in this pull request?
Fix the left menu broken introduced in #25459.

### Why are the changes needed?
The `left-menu-wrapper` CSS reused for both ml-guide and sql-programming-guide, the before changes will break the UI.

Before:
![image](https://user-images.githubusercontent.com/4833765/73952563-1061d800-493a-11ea-8a75-d802a1534a44.png)
![image](https://user-images.githubusercontent.com/4833765/73952584-18217c80-493a-11ea-85a3-ce5f9875545f.png)
![image](https://user-images.githubusercontent.com/4833765/73952605-21124e00-493a-11ea-8d79-24f4dfec73d9.png)

After:
![image](https://user-images.githubusercontent.com/4833765/73952630-2a031f80-493a-11ea-80ff-4630801cfaf4.png)
![image](https://user-images.githubusercontent.com/4833765/73952652-30919700-493a-11ea-9db1-8bb4a3f913b4.png)
![image](https://user-images.githubusercontent.com/4833765/73952671-35eee180-493a-11ea-801b-d50c4397adf2.png)

### Does this PR introduce any user-facing change?
Document UI change only.

### How was this patch tested?
Local test, screenshot attached below.

Closes #27479 from xuanyuanking/doc-ui.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-06 14:58:53 -08:00
Wenchen Fan 5a4c70b4e2 [SPARK-27986][SQL][FOLLOWUP] window aggregate function with filter predicate is not supported
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/26656.

We don't support window aggregate function with filter predicate yet and we should fail explicitly.

Observable metrics has the same issue. This PR fixes it as well.

### Why are the changes needed?

If we simply ignore filter predicate when we don't support it, the result is wrong.

### Does this PR introduce any user-facing change?

yea, fix the query result.

### How was this patch tested?

new tests

Closes #27476 from cloud-fan/filter.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-06 13:33:39 -08:00
Wenchen Fan 8ce58627eb [SPARK-30719][SQL] do not log warning if AQE is intentionally skipped and add a config to force apply
### What changes were proposed in this pull request?

Update `InsertAdaptiveSparkPlan` to not log warning if AQE is skipped intentionally.

This PR also add a config to not skip AQE.

### Why are the changes needed?

It's not a warning at all if we intentionally skip AQE.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

run `AdaptiveQueryExecSuite` locally and verify that there is no warning logs.

Closes #27452 from cloud-fan/aqe.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2020-02-06 09:16:14 -08:00
Yuanjian Li d8613571bc [SPARK-26700][CORE][FOLLOWUP] Add config spark.network.maxRemoteBlockSizeFetchToMem
### What changes were proposed in this pull request?
Add new config `spark.network.maxRemoteBlockSizeFetchToMem` fallback to the old config `spark.maxRemoteBlockSizeFetchToMem`.

### Why are the changes needed?
For naming consistency.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing tests.

Closes #27463 from xuanyuanking/SPARK-26700-follow.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-06 20:53:44 +08:00
yi.wu 368ee62a5d [SPARK-27297][DOC][FOLLOW-UP] Improve documentation for various Scala functions
### What changes were proposed in this pull request?

Add examples and parameter description for these Scala functions:

* transform
* exists
* forall
* aggregate
* zip_with
* transform_keys
* transform_values
* map_filter
* map_zip_with

### Why are the changes needed?

Better documentation for UX.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass Jenkins.

Closes #27449 from Ngone51/doc-funcs.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-06 20:34:29 +08:00
yi.wu 3f5b23340e [SPARK-30744][SQL] Optimize AnalyzePartitionCommand by calculating location sizes in parallel
### What changes were proposed in this pull request?

Use `CommandUtils.calculateTotalLocationSize` for `AnalyzePartitionCommand` in order to calculate location sizes in parallel.

### Why are the changes needed?

For better performance.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass Jenkins.

Closes #27471 from Ngone51/dev_calculate_in_parallel.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-06 20:20:44 +08:00
beliefer c8ef1dee90 [SPARK-29108][SQL][TESTS][FOLLOWUP] Comment out no use test case and add 'insert into' statement of window.sql (Part 2)
### What changes were proposed in this pull request?
When I running the `window_part2.sql` tests find it lack insert sql. Therefore, the output is empty.
I checked the postgresql and reference https://github.com/postgres/postgres/blob/master/src/test/regress/sql/window.sql
Although `window_part1.sql` and `window_part3.sql` exists the insert sql, I think should also add it into `window_part2.sql`.
Because only one case reference the table `empsalary` and it throws `AnalysisException`.
```
-- !query
select last(salary) over(order by salary range between 1000 preceding and 1000 following),
lag(salary) over(order by salary range between 1000 preceding and 1000 following),
salary from empsalary
-- !query schema
struct<>
-- !query output
org.apache.spark.sql.AnalysisException
Window Frame specifiedwindowframe(RangeFrame, -1000, 1000) must match the required frame specifiedwindowframe(RowFrame, -1, -1);
```

So we should do four work:
1. comment out the only one case and create a new ticket.
2. Add `INSERT INTO empsalary`.

Note: window_part4.sql not use the table `empsalary`.

### Why are the changes needed?
Supplementary test data.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
New test case

Closes #27439 from beliefer/add-insert-to-window.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-06 15:24:26 +09:00
Terry Kim c27a616450 [SPARK-30612][SQL] Resolve qualified column name with v2 tables
### What changes were proposed in this pull request?

This PR fixes the issue where queries with qualified columns like `SELECT t.a FROM t` would fail to resolve for v2 tables.

This PR would allow qualified column names in query as following:
```SQL
SELECT testcat.ns1.ns2.tbl.foo FROM testcat.ns1.ns2.tbl
SELECT ns1.ns2.tbl.foo FROM testcat.ns1.ns2.tbl
SELECT ns2.tbl.foo FROM testcat.ns1.ns2.tbl
SELECT tbl.foo FROM testcat.ns1.ns2.tbl
```

### Why are the changes needed?

This is a bug because you cannot qualify column names in queries.

### Does this PR introduce any user-facing change?

Yes, now users can qualify column names for v2 tables.

### How was this patch tested?

Added new tests.

Closes #27391 from imback82/qualified_col.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-06 13:54:17 +08:00
yi.wu aebabf0bed [SPARK-30729][CORE] Eagerly filter out zombie TaskSetManager before offering resources
### What changes were proposed in this pull request?

Eagerly filter out zombie `TaskSetManager` before offering resources to reduce any overhead as possible.

And this PR also avoid doing `recomputeLocality` and `addPendingTask` when `TaskSetManager` is zombie.

### Why are the changes needed?

Zombie `TaskSetManager` could still exist in Pool's `schedulableQueue` when it has running tasks. Offering resources on a zombie `TaskSetManager` could bring unnecessary overhead and is meaningless.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass Jenkins.

Closes #27455 from Ngone51/exclude-zombie-tsm.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-06 12:48:27 +08:00
HyukjinKwon b95ccb1d8b [SPARK-30737][SPARK-27262][R][BUILD] Reenable CRAN check with UTF-8 encoding to DESCRIPTION
### What changes were proposed in this pull request?

This PR proposes to reenable CRAN check disabled at https://github.com/apache/spark/pull/27460. Given the tests https://github.com/apache/spark/pull/27468, seems we should also port https://github.com/apache/spark/pull/23823 together.

### Why are the changes needed?
To check CRAN back.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
It was tested at https://github.com/apache/spark/pull/27468 and Jenkins should test it out.

Closes #27472 from HyukjinKwon/SPARK-30737.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-06 13:01:08 +09:00
HyukjinKwon 20c60a43cc [MINOR][INFRA] Factor Python executable out as a variable in 'lint-python' script
### What changes were proposed in this pull request?

This PR proposes to make hardcoded `python3` to a variable `PYTHON_EXECUTABLE` in ' lint-python' script.

### Why are the changes needed?

To make changes easier. See 561e9b9688 as an example.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Manually by running `dev/lint-python`.

Closes #27470 from HyukjinKwon/minor-PYTHON_EXECUTABLE.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-05 17:01:33 -08:00
Dongjoon Hyun 9d907bc84d [SPARK-30743][K8S][TESTS] Use JRE instead of JDK in K8S test docker image
### What changes were proposed in this pull request?

This PR aims to replace JDK to JRE in K8S integration test docker images.

### Why are the changes needed?

This will save some resources and make it sure that we only need JRE at runtime and testing.
- https://lists.apache.org/thread.html/3145150b711d7806a86bcd3ab43e18bcd0e4892ab5f11600689ba087%40%3Cdev.spark.apache.org%3E

### Does this PR introduce any user-facing change?

No. This is a dev-only test environment.

### How was this patch tested?

Pass the Jenkins K8s Integration Test.
- https://github.com/apache/spark/pull/27469#issuecomment-582681125

Closes #27469 from dongjoon-hyun/SPARK-30743.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-05 16:55:45 -08:00
yudovin f9f06eee98 [SPARK-30122][K8S] Support spark.kubernetes.authenticate.executor.serviceAccountName
### What changes were proposed in this pull request?

Currently, it doesn't seem to be possible to have Spark Driver set the serviceAccountName for executor pods it launches.

### Why are the changes needed?

it will allow settings serviceAccountName for executors pods.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

It was covered by unit test.

Closes #27034 from ayudovin/srevice-account-name-for-executor-pods.

Authored-by: yudovin <artsiom.yudovin@profitero.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-05 14:16:59 -08:00
Wenchen Fan 3b26f807a0 [SPARK-30721][SQL][TESTS] Fix DataFrameAggregateSuite when enabling AQE
### What changes were proposed in this pull request?

update `DataFrameAggregateSuite` to make it pass with AQE

### Why are the changes needed?

We don't need to turn off AQE in `DataFrameAggregateSuite`

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

run `DataFrameAggregateSuite` locally with AQE on.

Closes #27451 from cloud-fan/aqe-test.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-05 12:36:51 -08:00
Yuanjian Li 4938905a1c [SPARK-29864][SQL][FOLLOWUP] Reference the config for the old behavior in error message
### What changes were proposed in this pull request?
Follow up work for SPARK-29864, reference the config  `spark.sql.legacy.fromDayTimeString.enabled` in error message.

### Why are the changes needed?
For better usability.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing tests.

Closes #27464 from xuanyuanking/SPARK-29864-follow.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-05 11:19:42 -08:00
Dongjoon Hyun 9d90c8b898 [SPARK-30738][K8S] Use specific image version in "Launcher client dependencies" test
### What changes were proposed in this pull request?

This PR use a specific version of docker image instead of `latest`. As of today, when I run K8s integration test locally, this test case fails always.

Also, in this PR, I shows two consecutive failures with a dummy change.
- https://github.com/apache/spark/pull/27465#issuecomment-582326614
- https://github.com/apache/spark/pull/27465#issuecomment-582329114
```
- Launcher client dependencies *** FAILED ***
```

After that, I added the patch and K8s Integration test passed.
- https://github.com/apache/spark/pull/27465#issuecomment-582361696

### Why are the changes needed?

[SPARK-28465](https://github.com/apache/spark/pull/25222) switched from `v4.0.0-stable-4.0-master-centos-7-x86_64` to `latest` to catch up the API change. However, the API change seems to occur again. We had better use a specific version to prevent accidental failures.

```scala
- .withImage("ceph/daemon:v4.0.0-stable-4.0-master-centos-7-x86_64")
+ .withImage("ceph/daemon:latest")
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass `Launcher client dependencies` test in Jenkins K8s Integration Suite.
Or, run K8s Integration test locally.

Closes #27465 from dongjoon-hyun/SPARK-K8S-IT.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-05 11:01:53 -08:00
WeichenXu ec70e0708f [MINOR][DOC] Add migration note for removing org.apache.spark.ml.image.ImageSchema.readImages
### What changes were proposed in this pull request?

Add migration note for removing `org.apache.spark.ml.image.ImageSchema.readImages`

### Why are the changes needed?

### Does this PR introduce any user-facing change?

### How was this patch tested?

Closes #27467 from WeichenXu123/SC-26286.

Authored-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-05 07:54:16 -08:00
turbofei 6d507b4a31 [SPARK-26218][SQL][FOLLOW UP] Fix the corner case when casting float to Integer
### What changes were proposed in this pull request?
When spark.sql.ansi.enabled is true, for the statement:
```
select cast(cast(2147483648 as Float) as Integer) //result is 2147483647
```
Its result is 2147483647 and does not throw `ArithmeticException`.

The root cause is that, the below code does not work for some corner cases.
94fc0e3235/sql/catalyst/src/main/scala/org/apache/spark/sql/types/numerics.scala (L129-L141)

For example:

![image](https://user-images.githubusercontent.com/6757692/72074911-badfde80-332d-11ea-963e-2db0e43c33e8.png)

In this PR, I fix it by comparing Math.floor(x) with Int.MaxValue directly.

### Why are the changes needed?
Result corrupt.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?

Added Unit test.

Closes #27151 from turboFei/SPARK-26218-follow-up-int-overflow.

Authored-by: turbofei <fwang12@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-05 21:24:02 +08:00
Maxim Gekk 459e757ed4 [SPARK-30668][SQL] Support SimpleDateFormat patterns in parsing timestamps/dates strings
### What changes were proposed in this pull request?
In the PR, I propose to partially revert the commit 51a6ba0181, and provide a legacy parser based on `FastDateFormat` which is compatible to `SimpleDateFormat`.

To enable the legacy parser, set `spark.sql.legacy.timeParser.enabled` to `true`.

### Why are the changes needed?
To allow users to restore old behavior in parsing timestamps/dates using `SimpleDateFormat` patterns. The main reason for restoring is `DateTimeFormatter`'s patterns are not fully compatible to `SimpleDateFormat` patterns, see https://issues.apache.org/jira/browse/SPARK-30668

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
- Added new test to `DateFunctionsSuite`
- Restored additional test cases in `JsonInferSchemaSuite`.

Closes #27441 from MaxGekk/support-simpledateformat.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-05 18:48:45 +08:00
Onur Satici 86fdb818bf [SPARK-30715][K8S] Bump fabric8 to 4.7.1
### What changes were proposed in this pull request?
Bump fabric8 kubernetes-client to 4.7.1

### Why are the changes needed?
New fabric8 version brings support for Kubernetes 1.17 clusters.
Full release notes:
- https://github.com/fabric8io/kubernetes-client/releases/tag/v4.7.0
- https://github.com/fabric8io/kubernetes-client/releases/tag/v4.7.1

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing unit and integration tests cover creation of K8S objects. Adjusted them to work with the new fabric8 version

Closes #27443 from onursatici/os/bump-fabric8.

Authored-by: Onur Satici <onursatici@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-05 01:17:30 -08:00
yi.wu 5983ad9cc4 [SPARK-30506][SQL][DOC] Document for generic file source options/configs
### What changes were proposed in this pull request?

Add a new document page named *Generic File Source Options* for *Data Sources* menu and added following sub items:

* spark.sql.files.ignoreCorruptFiles
* spark.sql.files.ignoreMissingFiles
* pathGlobFilter
* recursiveFileLookup

And here're snapshots of the generated document:
<img width="1080" alt="doc-1" src="https://user-images.githubusercontent.com/16397174/73816825-87a54800-4824-11ea-97da-e5c40c59a7d4.png">
<img width="1081" alt="doc-2" src="https://user-images.githubusercontent.com/16397174/73816827-8a07a200-4824-11ea-99ec-9c8b0286625e.png">
<img width="1080" alt="doc-3" src="https://user-images.githubusercontent.com/16397174/73816831-8c69fc00-4824-11ea-84f0-6c9e94c2f0e2.png">
<img width="1081" alt="doc-4" src="https://user-images.githubusercontent.com/16397174/73816834-8f64ec80-4824-11ea-9355-76ad45476634.png">

### Why are the changes needed?

Better guidance for end-user.

### Does this PR introduce any user-facing change?

No, added in Spark 3.0.

### How was this patch tested?

Pass Jenkins.

Closes #27302 from Ngone51/doc-generic-file-source-option.

Lead-authored-by: yi.wu <yi.wu@databricks.com>
Co-authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-05 17:16:38 +08:00
yi.wu 30e418a6fe [SPARK-30594][CORE] Do not post SparkListenerBlockUpdated when updateBlockInfo returns false
### What changes were proposed in this pull request?

If `updateBlockInfo` returns false, which means the `BlockManager` will re-register and report all blocks later. So, we may report two times for the same block, which causes `AppStatusListener` to count used memory for two times, too. As a result, the used memory can exceed the total memory.
So, this PR changes it to not post `SparkListenerBlockUpdated` when `updateBlockInfo` returns false. And, always clean up used memory whenever `AppStatusListener` receives `SparkListenerBlockManagerAdded`.

### Why are the changes needed?

This PR tries to fix negative memory usage in UI (https://user-images.githubusercontent.com/3488126/72131225-95e37e00-33b6-11ea-8708-6e5ed328d1ca.png, see #27144 ). Though, I'm not very sure this is the root cause for #27144 since known information is limited here.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added new tests by xuanyuanking

Closes #27306 from Ngone51/fix-possible-negative-memory.

Lead-authored-by: yi.wu <yi.wu@databricks.com>
Co-authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Co-authored-by: wuyi <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-05 16:15:44 +08:00
HyukjinKwon e2d984aa1c [SPARK-30733][R][HOTFIX] Fix SparkR tests per testthat and R version upgrade, and disable CRAN
### What changes were proposed in this pull request?

There are currently the R test failures after upgrading `testthat` to 2.0.0, and R version 3.5.2 as of SPARK-23435. This PR targets to fix the tests and make the tests pass. See the explanations and causes below:

```
test_context.R:49: failure: Check masked functions
length(maskedCompletely) not equal to length(namesOfMaskedCompletely).
1/1 mismatches
[1] 6 - 4 == 2

test_context.R:53: failure: Check masked functions
sort(maskedCompletely, na.last = TRUE) not equal to sort(namesOfMaskedCompletely, na.last = TRUE).
5/6 mismatches
x[2]: "endsWith"
y[2]: "filter"

x[3]: "filter"
y[3]: "not"

x[4]: "not"
y[4]: "sample"

x[5]: "sample"
y[5]: NA

x[6]: "startsWith"
y[6]: NA
```

From my cursory look, R base and R's version are mismatched. I fixed accordingly and Jenkins will test it out.

```
test_includePackage.R:31: error: include inside function
package or namespace load failed for ���plyr���:
 package ���plyr��� was installed by an R version with different internals; it needs to be reinstalled for use with this R version
Seems it's a package installation issue. Looks like plyr has to be re-installed.
```

From my cursory look, previously installed `plyr` remains and it's not compatible with the new R version. I fixed accordingly and Jenkins will test it out.

```
test_sparkSQL.R:499: warning: SPARK-17811: can create DataFrame containing NA as date and time
Your system is mis-configured: ���/etc/localtime��� is not a symlink
```

Seems a env problem. I suppressed the warnings for now.

```
test_sparkSQL.R:499: warning: SPARK-17811: can create DataFrame containing NA as date and time
It is strongly recommended to set envionment variable TZ to ���America/Los_Angeles��� (or equivalent)
```

Seems a env problem. I suppressed the warnings for now.

```
test_sparkSQL.R:1814: error: string operators
unable to find an inherited method for function ���startsWith��� for signature ���"character"���
1: expect_true(startsWith("Hello World", "Hello")) at /home/jenkins/workspace/SparkPullRequestBuilder2/R/pkg/tests/fulltests/test_sparkSQL.R:1814
2: quasi_label(enquo(object), label)
3: eval_bare(get_expr(quo), get_env(quo))
4: startsWith("Hello World", "Hello")
5: (function (classes, fdef, mtable)
   {
       methods <- .findInheritedMethods(classes, fdef, mtable)
       if (length(methods) == 1L)
           return(methods[[1L]])
       else if (length(methods) == 0L) {
           cnames <- paste0("\"", vapply(classes, as.character, ""), "\"", collapse = ", ")
           stop(gettextf("unable to find an inherited method for function %s for signature %s",
               sQuote(fdefgeneric), sQuote(cnames)), domain = NA)
       }
       else stop("Internal error in finding inherited methods; didn't return a unique method",
           domain = NA)
   })(list("character"), new("nonstandardGenericFunction", .Data = function (x, prefix)
   {
       standardGeneric("startsWith")
   }, generic = structure("startsWith", package = "SparkR"), package = "SparkR", group = list(),
       valueClass = character(0), signature = c("x", "prefix"), default = NULL, skeleton = (function (x,
           prefix)
       stop("invalid call in method dispatch to 'startsWith' (no default method)", domain = NA))(x,
           prefix)), <environment>)
6: stop(gettextf("unable to find an inherited method for function %s for signature %s",
       sQuote(fdefgeneric), sQuote(cnames)), domain = NA)
```

From my cursory look, R base and R's version are mismatched. I fixed accordingly and Jenkins will test it out.

Also, this PR causes a CRAN check failure as below:

```
* creating vignettes ... ERROR
Error: processing vignette 'sparkr-vignettes.Rmd' failed with diagnostics:
package ���htmltools��� was installed by an R version with different internals; it needs to be reinstalled for use with this R version
```

This PR disables it for now.

### Why are the changes needed?

To unblock other PRs.

### Does this PR introduce any user-facing change?

No. Test only and dev only.

### How was this patch tested?

No. I am going to use Jenkins to test.

Closes #27460 from HyukjinKwon/r-test-failure.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-05 16:45:54 +09:00
HyukjinKwon 692e3ddb4e [SPARK-27870][PYTHON][FOLLOW-UP] Rename spark.sql.pandas.udf.buffer.size to spark.sql.execution.pandas.udf.buffer.size
### What changes were proposed in this pull request?

This PR renames `spark.sql.pandas.udf.buffer.size` to `spark.sql.execution.pandas.udf.buffer.size` to be more consistent with other pandas configuration prefixes, given:
-  `spark.sql.execution.pandas.arrowSafeTypeConversion`
- `spark.sql.execution.pandas.respectSessionTimeZone`
- `spark.sql.legacy.execution.pandas.groupedMap.assignColumnsByName`
- other configurations like `spark.sql.execution.arrow.*`.

### Why are the changes needed?

To make configuration names consistent.

### Does this PR introduce any user-facing change?

No because this configuration was not released yet.

### How was this patch tested?

Existing tests should cover.

Closes #27450 from HyukjinKwon/SPARK-27870-followup.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-05 11:38:33 +09:00
Dongjoon Hyun 898716980d Revert "[SPARK-28310][SQL] Support (FIRST_VALUE|LAST_VALUE)(expr[ (IGNORE|RESPECT) NULLS]?) syntax"
### What changes were proposed in this pull request?

This reverts commit b89c3de1a4.

### Why are the changes needed?

`FIRST_VALUE` is used only for window expression. Please see the discussion on https://github.com/apache/spark/pull/25082 .

### Does this PR introduce any user-facing change?

Yes.

### How was this patch tested?

Pass the Jenkins.

Closes #27458 from dongjoon-hyun/SPARK-28310.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-04 17:26:46 -08:00
Liang-Chi Hsieh 7631275f97 [SPARK-25040][SQL][FOLLOWUP] Add legacy config for allowing empty strings for certain types in json parser
### What changes were proposed in this pull request?

This is a follow-up for #22787. In #22787 we disallowed empty strings for json parser except for string and binary types. This follow-up adds a legacy config for restoring previous behavior of allowing empty string.

### Why are the changes needed?

Adding a legacy config to make migration easy for Spark users.

### Does this PR introduce any user-facing change?

Yes. If set this legacy config to true, the users can restore previous behavior prior to Spark 3.0.0.

### How was this patch tested?

Unit test.

Closes #27456 from viirya/SPARK-25040-followup.

Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-04 17:22:23 -08:00
Maxim Gekk f2dd082544 [SPARK-30725][SQL] Make legacy SQL configs as internal configs
### What changes were proposed in this pull request?
All legacy SQL configs are marked as internal configs. In particular, the following configs are updated as internals:
- spark.sql.legacy.sizeOfNull
- spark.sql.legacy.replaceDatabricksSparkAvro.enabled
- spark.sql.legacy.typeCoercion.datetimeToString.enabled
- spark.sql.legacy.looseUpcast
- spark.sql.legacy.arrayExistsFollowsThreeValuedLogic

### Why are the changes needed?
In general case, users shouldn't change legacy configs, so, they can be marked as internals.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Should be tested by jenkins build and run tests.

Closes #27448 from MaxGekk/legacy-internal-sql-conf.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-04 21:17:05 +08:00
Maxim Gekk 0202b675af [SPARK-26618][SQL][FOLLOWUP] Describe the behavior change of typed TIMESTAMP/DATE literals
### What changes were proposed in this pull request?
In the PR, I propose to update the SQL migration guide, and clarify behavior change of typed `TIMESTAMP` and `DATE` literals for input strings without time zone information - local timestamp and date strings.

### Why are the changes needed?
To inform users that the typed literals may change their behavior in Spark 3.0 because of different sources of the default time zone - JVM system time zone in Spark 2.4 and earlier, and `spark.sql.session.timeZone` in Spark 3.0.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
N/A

Closes #27435 from MaxGekk/timestamp-lit-migration-guide.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-04 16:33:34 +09:00
maryannxue 6097b343ba [SPARK-30717][SQL] AQE subquery map should cache SubqueryExec instead of ExecSubqueryExpression
### What changes were proposed in this pull request?
This PR is to fix a potential bug in AQE where an `ExecSubqueryExpression` could be mistakenly replaced with another `ExecSubqueryExpression` with the same `ListQuery` but a different `child` expression.
This is because a ListQuery's id can only identify the ListQuery itself, not the parent expression `InSubquery`, but right now the `subqueryMap` in `InsertAdaptiveSparkPlan` uses the `ListQuery`'s id as key and the corresponding `InSubqueryExec` for the `ListQuery`'s parent expression as value. So the fix uses the corresponding `SubqueryExec` for the `ListQuery` itself as the map's value.

### Why are the changes needed?
This logical bug could potentially cause a wrong query plan, which could throw an exception related to unresolved columns.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Passed existing UTs.

Closes #27446 from maryannxue/spark-30717.

Authored-by: maryannxue <maryannxue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-04 12:31:44 +08:00
Dongjoon Hyun 41bdb7ad39 [SPARK-30718][BUILD] Exclude jdk.tools dependency from hadoop-yarn-api
### What changes were proposed in this pull request?

This PR removes the `jdk.tools:jdk.tools` transitive dependency from `hadoop-yarn-api`.
- This is only used in `hadoop-annotation` project in some `*Doclet.java`.

### Why are the changes needed?

Although this is not used in Apache Spark, this can cause a resolve failure in JDK11 environment.

<img width="530" alt="jdk tools" src="https://user-images.githubusercontent.com/9700541/73697745-2f3f4080-4694-11ea-95a7-228638e31cf7.png">

### Does this PR introduce any user-facing change?

No. This is a dev-only change.
From developers, this will remove the `Cannot resolve` error in IDE environment.

### How was this patch tested?

- Pass the Jenkins in JDK8
- Manually, import the project with JDK11.

Closes #27445 from dongjoon-hyun/SPARK-30718.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-03 19:57:16 -08:00
fuwhu 47659a0675 [SPARK-30525][SQL] HiveTableScanExec do not need to prune partitions again after pushing down to SessionCatalog for partition pruning
### What changes were proposed in this pull request?
HiveTableScanExec does not prune partitions again after SessionCatalog.listPartitionsByFilter called.

### Why are the changes needed?
In HiveTableScanExec, it will push down to hive metastore for partition pruning if spark.sql.hive.metastorePartitionPruning is true, and then it will prune the returned partitions again using partition filters, because some predicates, eg. "b like 'xyz'", are not supported in hive metastore. But now this problem is already fixed in HiveExternalCatalog.listPartitionsByFilter, the HiveExternalCatalog.listPartitionsByFilter can return exactly what we want now. So it is not necessary any more to double prune in HiveTableScanExec.

### Does this PR introduce any user-facing change?
no

### How was this patch tested?
Existing unit tests.

Closes #27232 from fuwhu/SPARK-30525.

Authored-by: fuwhu <bestwwg@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-04 01:24:53 +08:00
Yuanjian Li a4912cee61
[SPARK-29543][SS][FOLLOWUP] Move spark.sql.streaming.ui.* configs to StaticSQLConf
### What changes were proposed in this pull request?
Put the configs below needed by Structured Streaming UI into StaticSQLConf:

- spark.sql.streaming.ui.enabled
- spark.sql.streaming.ui.retainedProgressUpdates
- spark.sql.streaming.ui.retainedQueries

### Why are the changes needed?
Make all SS UI configs consistent with other similar configs in usage and naming.

### Does this PR introduce any user-facing change?
Yes, add new static config `spark.sql.streaming.ui.retainedProgressUpdates`.

### How was this patch tested?
Existing UT.

Closes #27425 from xuanyuanking/SPARK-29543-follow.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
2020-02-02 23:37:13 -08:00
Burak Yavuz 2eccfd8a73 [SPARK-30697][SQL] Handle database and namespace exceptions in catalog.isView
### What changes were proposed in this pull request?

Adds NoSuchDatabaseException and NoSuchNamespaceException to the `isView` method for SessionCatalog.

### Why are the changes needed?

This method prevents specialized resolutions from kicking in within Analysis when using V2 Catalogs if the identifier is a specialized identifier.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Added test to DataSourceV2SessionCatalogSuite

Closes #27423 from brkyvz/isViewF.

Authored-by: Burak Yavuz <brkyvz@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-03 14:08:59 +08:00
zero323 fb321b666d [MINOR][SPARKR][DOCS] Remove duplicate @name tags from read.df and read.stream
### What changes were proposed in this pull request?

Remove duplicate `name` tags from `read.df` and `read.stream`.

### Why are the changes needed?

These tags are already present in

1adf3520e3/R/pkg/R/SQLContext.R (L546)

and

1adf3520e3/R/pkg/R/SQLContext.R (L678)

for `read.df` and `read.stream` respectively.

As only one `name` tag per block is allowed, this causes build warnings with recent `roxygen2` versions:

```
Warning: [/path/to/spark/R/pkg/R/SQLContext.R:559] name May only use one name per block
Warning: [/path/to/spark/R/pkg/R/SQLContext.R:690] name May only use one name per block
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #27437 from zero323/roxygen-warnings-names.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-03 11:09:25 +09:00
Dongjoon Hyun 1adf3520e3 [SPARK-30704][INFRA] Use jekyll-redirect-from 0.15.0 instead of the latest
### What changes were proposed in this pull request?

This PR aims to pin the version of `jekyll-redirect-from` to 0.15.0. This is a release blocker for both Apache Spark 3.0.0 and 2.4.5.

### Why are the changes needed?

`jekyll-redirect-from` released 0.16.0 a few days ago and that requires Ruby 2.4.0.
- https://github.com/jekyll/jekyll-redirect-from/releases/tag/v0.16.0
```
$ cd dev/create-release/spark-rm/
$ docker build -t spark:test .
...
ERROR:  Error installing jekyll-redirect-from:
	jekyll-redirect-from requires Ruby version >= 2.4.0.
...
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manually do the above command to build `spark-rm` Docker image.
```
...
Successfully installed jekyll-redirect-from-0.15.0
Parsing documentation for jekyll-redirect-from-0.15.0
Installing ri documentation for jekyll-redirect-from-0.15.0
Done installing documentation for jekyll-redirect-from after 0 seconds
1 gem installed
Successfully installed rouge-3.15.0
Parsing documentation for rouge-3.15.0
Installing ri documentation for rouge-3.15.0
Done installing documentation for rouge after 4 seconds
1 gem installed
Removing intermediate container e0ec7c77b69f
 ---> 32dec37291c6
```

Closes #27434 from dongjoon-hyun/SPARK-30704.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-02 00:44:25 -08:00
Yuming Wang cd5f03a3ba [SPARK-27686][DOC][SQL] Update migration guide for make Hive 2.3 dependency by default
### What changes were proposed in this pull request?

We have upgraded the built-in Hive from 1.2 to 2.3. This may need to set `spark.sql.hive.metastore.version` and `spark.sql.hive.metastore.jars` according to the version of your Hive metastore. Example:
```
--conf spark.sql.hive.metastore.version=1.2.1 --conf spark.sql.hive.metastore.jars=/root/hive-1.2.1-lib/*
```
Otherwise:
```
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table spark_27686. Invalid method name: 'get_table_req';
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:110)
  at org.apache.spark.sql.hive.HiveExternalCatalog.tableExists(HiveExternalCatalog.scala:841)
  at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.tableExists(ExternalCatalogWithListener.scala:146)
  at org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableExists(SessionCatalog.scala:431)
  at org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:52)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
  at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:226)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3487)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$4(SQLExecution.scala:100)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3485)
  at org.apache.spark.sql.Dataset.<init>(Dataset.scala:226)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:607)
  ... 47 elided
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table spark_27686. Invalid method name: 'get_table_req'
  at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1282)
  at org.apache.spark.sql.hive.client.HiveClientImpl.getRawTableOption(HiveClientImpl.scala:422)
  at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$tableExists$1(HiveClientImpl.scala:436)
  at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
  at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:322)
  at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256)
  at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255)
  at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:305)
  at org.apache.spark.sql.hive.client.HiveClientImpl.tableExists(HiveClientImpl.scala:436)
  at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$tableExists$1(HiveExternalCatalog.scala:841)
  at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:100)
  ... 63 more
Caused by: org.apache.thrift.TApplicationException: Invalid method name: 'get_table_req'
  at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
  at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table_req(ThriftHiveMetastore.java:1567)
  at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table_req(ThriftHiveMetastore.java:1554)
  at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:1350)
  at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.getTable(SessionHiveMetaStoreClient.java:127)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173)
  at com.sun.proxy.$Proxy38.getTable(Unknown Source)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2336)
  at com.sun.proxy.$Proxy38.getTable(Unknown Source)
  at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1274)
  ... 74 more
```

### Why are the changes needed?

Improve documentation.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?
```SKIP_API=1 jekyll build```:
![image](https://user-images.githubusercontent.com/5399861/73531432-67a50b80-4455-11ea-9401-5cad12fd3d14.png)

Closes #27161 from wangyum/SPARK-27686.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-01 20:50:47 -08:00
zhengruifeng da32d1e6b5 [SPARK-30700][ML] NaiveBayesModel predict optimization
### What changes were proposed in this pull request?
var `negThetaSum` is always used together with `pi`, so we can add them at first

### Why are the changes needed?
only need to add one var `piMinusThetaSum`, instead of `pi` and `negThetaSum`

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing testsuites

Closes #27427 from zhengruifeng/nb_predict.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-02-01 15:19:16 +08:00
Dongjoon Hyun 534f5d409a [SPARK-29138][PYTHON][TEST] Increase timeout of StreamingLogisticRegressionWithSGDTests.test_parameter_accuracy
### What changes were proposed in this pull request?

This PR aims to increase the timeout of `StreamingLogisticRegressionWithSGDTests.test_parameter_accuracy` from 30s (default) to 60s.

In this PR, before increasing the timeout,
1. I verified that this is not a JDK11 environmental issue by repeating 3 times first.
2. I reproduced the accuracy failure by reducing the timeout in Jenkins (https://github.com/apache/spark/pull/27424#issuecomment-580981262)

Then, the final commit passed the Jenkins.

### Why are the changes needed?

This seems to happen when Jenkins environment has congestion and the jobs are slowdown. The streaming job seems to be unable to repeat the designed iteration `numIteration=25` in 30 seconds. Since the error is decreasing at each iteration, the failure occurs.

By reducing the timeout, we can reproduce the similar issue locally like Jenkins.
```python
- eventually(condition, catch_assertions=True)
+ eventually(condition, timeout=10.0, catch_assertions=True)
```

```
$ python/run-tests --testname 'pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_parameter_accuracy' --python-executables=python
...
======================================================================
FAIL: test_parameter_accuracy (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/dongjoon/PRS/SPARK-TEST/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 229, in test_parameter_accuracy
    eventually(condition, timeout=10.0, catch_assertions=True)
  File "/Users/dongjoon/PRS/SPARK-TEST/python/pyspark/testing/utils.py", line 86, in eventually
    raise lastValue
Reproduce the error
  File "/Users/dongjoon/PRS/SPARK-TEST/python/pyspark/testing/utils.py", line 77, in eventually
    lastValue = condition()
  File "/Users/dongjoon/PRS/SPARK-TEST/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 226, in condition
    self.assertAlmostEqual(rel, 0.1, 1)
AssertionError: 0.25749106949322637 != 0.1 within 1 places (0.15749106949322636 difference)

----------------------------------------------------------------------
Ran 1 test in 14.814s

FAILED (failures=1)
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins (and manual check by reducing the timeout).

Since this is a flakiness issue depending on the Jenkins job situation, it's difficult to reproduce there.

Closes #27424 from dongjoon-hyun/SPARK-TEST.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-01 15:38:16 +09:00
jiaan geng 35380958b8 [SPARK-30698][BUILD] Bumps checkstyle from 8.25 to 8.29
### What changes were proposed in this pull request?
I found checkstyle have a new release https://checkstyle.org/releasenotes.html#Release_8.29
Bumps checkstyle from 8.25 to 8.29.

### Why are the changes needed?
I have bump  checkstyle from 8.25 to 8.29 on my fork branch and test to build.
It's OK

8.29 added some new features:

- New Check: AvoidNoArgumentSuperConstructorCall.

- New Check NoEnumTrailingComma.

- ENUM_DEF token support in RightCurlyCheck.

- FallThrough module does not support the spelling "fall-through" by default.

8.29 fix some bugs:

- Java 8 Grammar: annotations on varargs parameters.

- Sonar violation: Disable XML external entity (XXE) processing.

- Disable instantiation of modules with private ctor.

- Sonar violation: "ThreadLocal" variables should be cleaned up when no longer used.

- Indentation incorrect level for chained method with bracket on new line.

- InvalidJavadocPosition: false positive when comment is between javadoc and package.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
No UT

Closes #27426 from beliefer/bump-checkstyle.

Authored-by: jiaan geng <jiaan.geng@jiaandeMacBook-Air.local>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-31 21:14:11 -08:00
Thomas Graves 878094f972 [SPARK-30689][CORE][YARN] Add resource discovery plugin api to support YARN versions with resource scheduling
### What changes were proposed in this pull request?

This change is to allow custom resource scheduler (GPUs,FPGAs,etc) resource discovery to be more flexible. Users are asking for it to work with hadoop 2.x versions that do not support resource scheduling in YARN and/or also they may not run in an isolated environment.
This change creates a plugin api that users can write their own resource discovery class that allows a lot more flexibility. The user can chain plugins for different resource types. The user specified plugins execute in the order specified and will fall back to use the discovery script plugin if they don't return information for a particular resource.

I had to open up a few of the classes to be public and change them to not be case classes and make them developer api in order for the the plugin to get enough information it needs.

I also relaxed the yarn side so that if yarn isn't configured for resource scheduling we just warn and go on. This helps users that have yarn 3.1 but haven't configured the resource scheduling side on their cluster yet, or aren't running in isolated environment.

The user would configured this like:
--conf spark.resources.discovery.plugin="org.apache.spark.resource.ResourceDiscoveryFPGAPlugin, org.apache.spark.resource.ResourceDiscoveryGPUPlugin"

Note the executor side had to be wrapped with a classloader to make sure we include the user classpath for jars they specified on submission.

Note this is more flexible because the discovery script has limitations such as spawning it in a separate process. This means if you are trying to allocate resources in that process they might be released when the script returns. Other things are the class makes it more flexible to be able to integrate with existing systems and solutions for assigning resources.

### Why are the changes needed?

to more easily use spark resource scheduling with older versions of hadoop or in non-isolated enivronments.

### Does this PR introduce any user-facing change?

Yes a plugin api

### How was this patch tested?

Unit tests added and manual testing done on yarn and standalone modes.

Closes #27410 from tgravescs/hadoop27spark3.

Lead-authored-by: Thomas Graves <tgraves@nvidia.com>
Co-authored-by: Thomas Graves <tgraves@apache.org>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2020-01-31 22:20:28 -06:00
Liang-Chi Hsieh 8eecc20b11 [SPARK-27946][SQL] Hive DDL to Spark DDL conversion USING "show create table"
## What changes were proposed in this pull request?

This patch adds a DDL command `SHOW CREATE TABLE AS SERDE`. It is used to generate Hive DDL for a Hive table.

For original `SHOW CREATE TABLE`, it now shows Spark DDL always. If given a Hive table, it tries to generate Spark DDL.

For Hive serde to data source conversion, this uses the existing mapping inside `HiveSerDe`. If can't find a mapping there, throws an analysis exception on unsupported serde configuration.

It is arguably that some Hive fileformat + row serde might be mapped to Spark data source, e.g., CSV. It is not included in this PR. To be conservative, it may not be supported.

For Hive serde properties, for now this doesn't save it to Spark DDL because it may not useful to keep Hive serde properties in Spark table.

## How was this patch tested?

Added test.

Closes #24938 from viirya/SPARK-27946.

Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2020-01-31 19:55:25 -08:00
zhengruifeng d0c3e9f1f7 [SPARK-30660][ML][PYSPARK] LinearRegression blockify input vectors
### What changes were proposed in this pull request?
1, use blocks instead of vectors for performance improvement
2, use Level-2 BLAS
3, move standardization of input vectors outside of gradient computation

### Why are the changes needed?
1, less RAM to persist training data; (save ~40%)
2, faster than existing impl; (30% ~ 102%)

### Does this PR introduce any user-facing change?
add a new expert param `blockSize`

### How was this patch tested?
updated testsuites

Closes #27396 from zhengruifeng/blockify_lireg.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-01-31 21:04:26 -06:00
Dongjoon Hyun 2fd15a26fb [SPARK-30695][BUILD] Upgrade Apache ORC to 1.5.9
### What changes were proposed in this pull request?

This PR aims to upgrade to Apache ORC 1.5.9.
- For `hive-2.3` profile, we need to upgrade `hive-storage-api` from `2.6.0` to `2.7.1`.
- For `hive-1.2` profile, ORC library with classifier `nohive` already shaded it. So, there is no change.

### Why are the changes needed?

This will bring the latest bug fixes. The following is the full release note.
- https://issues.apache.org/jira/projects/ORC/versions/12346546

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with the existing tests.

Here is the summary.
1. `Hive 1.2 + Hadoop 2.7` passed. ([here](https://github.com/apache/spark/pull/27421#issuecomment-580924552))
2. `Hive 2.3 + Hadoop 2.7` passed. ([here](https://github.com/apache/spark/pull/27421#issuecomment-580973391))

Closes #27421 from dongjoon-hyun/SPARK-ORC-1.5.9.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-31 17:41:27 -08:00
yi.wu 82b4f753a0 [SPARK-30508][SQL] Add SparkSession.executeCommand API for external datasource
### What changes were proposed in this pull request?

This PR adds `SparkSession.executeCommand` API for external datasource to execute a random command like

```
val df = spark.executeCommand("xxxCommand", "xxxSource", "xxxOptions")
```
Note that the command doesn't execute in Spark, but inside an external execution engine depending on data source. And it will be eagerly executed after `executeCommand` called and the returned `DataFrame` will contain the output of the command(if any).

### Why are the changes needed?

This can be useful when user wants to execute some commands out of Spark. For example, executing custom DDL/DML command for JDBC, creating index for ElasticSearch, creating cores for Solr and so on(as HyukjinKwon suggested).

Previously, user needs to use an option to achieve the goal, e.g. `spark.read.format("xxxSource").option("command", "xxxCommand").load()`, which is kind of cumbersome. With this change, it can be more convenient for user to achieve the same goal.

### Does this PR introduce any user-facing change?

Yes, new API from `SparkSession` and a new interface `ExternalCommandRunnableProvider`.

### How was this patch tested?

Added a new test suite.

Closes #27199 from Ngone51/dev-executeCommand.

Lead-authored-by: yi.wu <yi.wu@databricks.com>
Co-authored-by: Xiao Li <gatorsmile@gmail.com>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2020-01-31 15:05:26 -08:00
Maxim Gekk 2d4b5eaee4 [SPARK-30676][CORE][TESTS] Eliminate warnings from deprecated constructors of java.lang.Integer and java.lang.Double
### What changes were proposed in this pull request?
- Replace `new Integer(0)` by a serializable instance in RDD.scala
- Use `.valueOf()` instead of constructors of `java.lang.Integer` and `java.lang.Double` because constructors has been deprecated, see https://docs.oracle.com/javase/9/docs/api/java/lang/Integer.html

### Why are the changes needed?
This fixes the following warnings:
1. RDD.scala:240: constructor Integer in class Integer is deprecated: see corresponding Javadoc for more information.
2. MutableProjectionSuite.scala:63: constructor Integer in class Integer is deprecated: see corresponding Javadoc for more information.
3. UDFSuite.scala:446: constructor Integer in class Integer is deprecated: see corresponding Javadoc for more information.
4. UDFSuite.scala:451: constructor Double in class Double is deprecated: see corresponding Javadoc for more information.
5. HiveUserDefinedTypeSuite.scala:71: constructor Double in class Double is deprecated: see corresponding Javadoc for more information.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
- By RDDSuite, MutableProjectionSuite, UDFSuite and HiveUserDefinedTypeSuite

Closes #27399 from MaxGekk/eliminate-warning-part4.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-01-31 15:03:16 -06:00