Commit graph

23697 commits

Author SHA1 Message Date
Bryan Cutler 16990f9299 [SPARK-26566][PYTHON][SQL] Upgrade Apache Arrow to version 0.12.0
## What changes were proposed in this pull request?

Upgrade Apache Arrow to version 0.12.0. This includes the Java artifacts and fixes to enable usage with pyarrow 0.12.0

Version 0.12.0 includes the following selected fixes/improvements relevant to Spark users:

* Safe cast fails from numpy float64 array with nans to integer, ARROW-4258
* Java, Reduce heap usage for variable width vectors, ARROW-4147
* Binary identity cast not implemented, ARROW-4101
* pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098
* conversion to date object no longer needed, ARROW-3910
* Error reading IPC file with no record batches, ARROW-3894
* Signed to unsigned integer cast yields incorrect results when type sizes are the same, ARROW-3790
* from_pandas gives incorrect results when converting floating point to bool, ARROW-3428
* Import pyarrow fails if scikit-learn is installed from conda (boost-cpp / libboost issue), ARROW-3048
* Java update to official Flatbuffers version 1.9.0, ARROW-3175

complete list [here](https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.12.0)

PySpark requires the following fixes to work with PyArrow 0.12.0

* Encrypted pyspark worker fails due to ChunkedStream missing closed property
* pyarrow now converts dates as objects by default, which causes error because type is assumed datetime64
* ArrowTests fails due to difference in raised error message
* pyarrow.open_stream deprecated
* tests fail because groupby adds index column with duplicate name

## How was this patch tested?

Ran unit tests with pyarrow versions 0.8.0, 0.10.0, 0.11.1, 0.12.0

Closes #23657 from BryanCutler/arrow-upgrade-012.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-01-29 14:18:45 +08:00
Takeshi Yamamuro 92706e6576
[SPARK-26747][SQL] Makes GetMapValue nullability more precise
## What changes were proposed in this pull request?
In master, `GetMapValue` nullable is always true;
cf133e6110/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala (L371)

But, If input key is foldable, we could make its nullability more precise.
This fix is the same with SPARK-26637(#23566).

## How was this patch tested?
Added tests in `ComplexTypeSuite`.

Closes #23669 from maropu/SPARK-26747.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2019-01-28 13:39:50 -08:00
Marcelo Vanzin 2a67dbfbd3 [SPARK-26595][CORE] Allow credential renewal based on kerberos ticket cache.
This change addes a new mode for credential renewal that does not require
a keytab; it uses the local ticket cache instead, so it works while the
user keeps the cache valid.

This can be useful for, e.g., people running long spark-shell sessions where
their kerberos login is kept up-to-date.

The main change to enable this behavior is in HadoopDelegationTokenManager,
with a small change in the HDFS token provider. The other changes are to avoid
creating duplicate tokens when submitting the application to YARN; they allow
the tokens from the scheduler to be sent to the YARN AM, reducing the round trips
to HDFS.

For that, the scheduler initialization code was changed a little bit so that
the tokens are available when the YARN client is initialized. That basically
takes care of a long-standing TODO that was in the code to clean up configuration
propagation to the driver's RPC endpoint (in CoarseGrainedSchedulerBackend).

Tested with an app designed to stress this functionality, with both keytab and
cache-based logins. Some basic kerberos tests on k8s also.

Closes #23525 from vanzin/SPARK-26595.

Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-01-28 13:32:34 -08:00
Sean Owen 8baf3ba35b [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary
## What changes were proposed in this pull request?

The warning introduced in https://github.com/apache/spark/pull/23580 has a bug: https://github.com/apache/spark/pull/23580#issuecomment-458000380 This just fixes the logic.

## How was this patch tested?

N/A

Closes #23668 from srowen/SPARK-26660.2.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-01-28 13:47:32 -06:00
s71955 dfed439e33 [SPARK-26432][CORE] Obtain HBase delegation token operation compatible with HBase 2.x.x version API
## What changes were proposed in this pull request?

While obtaining token from hbase service , spark uses deprecated API of hbase ,
```public static Token<AuthenticationTokenIdentifier> obtainToken(Configuration conf)```
This deprecated API is already been removed from hbase 2.x version as part of the hbase 2.x major release. https://issues.apache.org/jira/browse/HBASE-14713_
there is one more stable API in
```public static Token<AuthenticationTokenIdentifier> obtainToken(Connection conn)``` in TokenUtil class
spark shall use this stable api for getting the delegation token.

To invoke this api first connection object has to be retrieved from ConnectionFactory and the same connection can be passed to obtainToken(Connection conn) for getting token.
eg: Call   ```public static Connection createConnection(Configuration conf)```
, then call   ```public static Token<AuthenticationTokenIdentifier> obtainToken( Connection conn)```.

## How was this patch tested?
Manual testing is been done.
Manual test result:
Before fix:

![hbase-dep-obtaintok 1](https://user-images.githubusercontent.com/12999161/50699264-64cac200-106d-11e9-81b4-e50ae8097f27.png)

After fix:
1. Create 2 tables in hbase shell
 >Launch hbase shell
 >Enter commands to create tables and load data
    create 'table1','cf'
    put 'table1','row1','cf:cid','20'

    create 'table2','cf'
    put 'table2','row1','cf:cid','30'

 >Show values command
   get 'table1','row1','cf:cid'  will diplay value as 20
   get 'table2','row1','cf:cid'  will diplay value as 30

2.Run SparkHbasetoHbase class in testSpark.jar using spark-submit

spark-submit --master yarn-cluster --class com.mrs.example.spark.SparkHbasetoHbase --conf "spark.yarn.security.credentials.hbase.enabled"="true" --conf "spark.security.credentials.hbase.enabled"="true" --keytab /opt/client/user.keytab --principal sen testSpark.jar

The SparkHbasetoHbase test class will update the value of table2 with sum of values of table1 & table2.

table2 = table1+table2
As we can see in the snapshot the spark job has been successfully able to interact with hbase service and able to update the row count.
![obtaintok_success 1](https://user-images.githubusercontent.com/12999161/50699393-bd9a5a80-106d-11e9-96c6-6c250d561efa.png)

Closes #23429 from sujith71955/master_hbase_service.

Authored-by: s71955 <sujithchacko.2010@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-01-28 10:08:23 -08:00
Xianjin YE 1280bfd756 [SPARK-26713][CORE] Interrupt pipe IO threads in PipedRDD when task is finished
## What changes were proposed in this pull request?
Manually release stdin writer and stderr reader thread when task is finished. This commit also marks
ShuffleBlockFetchIterator as fully consumed if isZombie is set.

## How was this patch tested?
Added new test

Closes #23638 from advancedxy/SPARK-26713.

Authored-by: Xianjin YE <advancedxy@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-01-28 10:54:18 -06:00
Maxim Gekk 58e42cf506 [SPARK-26719][SQL] Get rid of java.util.Calendar in DateTimeUtils
## What changes were proposed in this pull request?

- Replacing `java.util.Calendar` in  `DateTimeUtils. truncTimestamp` and in `DateTimeUtils.getOffsetFromLocalMillis ` by equivalent code using Java 8 API for timestamp manipulations. The reason is `java.util.Calendar` is based on the hybrid calendar (Julian+Gregorian) but *java.time* classes use Proleptic Gregorian calendar which assumes by SQL standard.
-  Replacing `Calendar.getInstance()` in `DateTimeUtilsSuite` by similar code in `DateTimeTestUtils` using *java.time* classes

## How was this patch tested?

The changes were tested by existing suites: `DateExpressionsSuite`, `DateFunctionsSuite` and `DateTimeUtilsSuite`.

Closes #23641 from MaxGekk/cleanup-date-time-utils.

Authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-01-28 10:52:17 -06:00
Wenchen Fan ed71a825c5 [SPARK-26700][CORE] enable fetch-big-block-to-disk by default
## What changes were proposed in this pull request?

This is a followup of #16989

The fetch-big-block-to-disk feature is disabled by default, because it's not compatible with external shuffle service prior to Spark 2.2. The client sends stream request to fetch block chunks, and old shuffle service can't support it.

After 2 years, Spark 2.2 has EOL, and now it's safe to turn on this feature by default

## How was this patch tested?

existing tests

Closes #23625 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-01-28 23:41:55 +08:00
Maxim Gekk bd027f6e0e [SPARK-26656][SQL] Benchmarks for date and timestamp functions
## What changes were proposed in this pull request?

Added the following benchmarks:
- Extract components from timestamp like year, month, day and etc.
- Current date and time
- Date arithmetic like date_add, date_sub
- Format dates and timestamps
- Convert timestamps from/to UTC

Closes #23661 from MaxGekk/datetime-benchmark.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
2019-01-28 14:21:21 +01:00
Sean Owen d53e11ffce [SPARK-26725][TEST] Fix the input values of UnifiedMemoryManager constructor in test suites
## What changes were proposed in this pull request?

Adjust mem settings in UnifiedMemoryManager used in test suites to ha…ve execution memory > 0
Ref: https://github.com/apache/spark/pull/23457#issuecomment-457409976

## How was this patch tested?

Existing tests

Closes #23645 from srowen/SPARK-26725.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-01-28 12:42:14 +08:00
Hyukjin Kwon 3a17c6a06b [SPARK-26743][PYTHON] Adds a test to check the actual resource limit set via 'spark.executor.pyspark.memory'
## What changes were proposed in this pull request?

https://github.com/apache/spark/pull/21977 added a feature to limit Python worker resource limit.
This PR is kind of a followup of it. It proposes to add a test that checks the actual resource limit set by 'spark.executor.pyspark.memory'.

## How was this patch tested?

Unit tests were added.

Closes #23663 from HyukjinKwon/test_rlimit.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-01-28 10:02:27 +08:00
maryannxue ce7e7df99d [SPARK-26708][SQL] Incorrect result caused by inconsistency between a SQL cache's cached RDD and its physical plan
## What changes were proposed in this pull request?

When performing non-cascading cache invalidation, `recache` is called on the other cache entries which are dependent on the cache being invalidated. It leads to the the physical plans of those cache entries being re-compiled. For those cache entries, if the cache RDD has already been persisted, chances are there will be inconsistency between the data and the new plan. It can cause a correctness issue if the new plan's `outputPartitioning`  or `outputOrdering` is different from the that of the actual data, and meanwhile the cache is used by another query that asks for specific `outputPartitioning` or `outputOrdering` which happens to match the new plan but not the actual data.

The fix is to keep the cache entry as it is if the data has been loaded, otherwise re-build the cache entry, with a new plan and an empty cache buffer.

## How was this patch tested?

Added UT.

Closes #23644 from maryannxue/spark-26708.

Lead-authored-by: maryannxue <maryannxue@apache.org>
Co-authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
2019-01-27 11:39:27 -08:00
Gengliang Wang 36a2e6371b
[SPARK-26716][SQL] FileFormat: the supported types of read/write should be consistent
## What changes were proposed in this pull request?

1. Remove parameter `isReadPath`. The supported types of read/write should be the same.

2. Disallow reading `NullType` for ORC data source. In #21667 and #21389, it was supposed that ORC supports reading `NullType`, but can't write it. This doesn't make sense. I read docs and did some tests. ORC doesn't support `NullType`.

## How was this patch tested?

Unit tset

Closes #23639 from gengliangwang/supportDataType.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2019-01-27 10:11:42 -08:00
Dongjoon Hyun 1ca6b8bc3d
[SPARK-26379][SS][FOLLOWUP] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp
## What changes were proposed in this pull request?

Spark replaces `CurrentTimestamp` with `CurrentBatchTimestamp`.
However, `CurrentBatchTimestamp` is `TimeZoneAwareExpression` while `CurrentTimestamp` isn't.
Without TimeZoneId, `CurrentBatchTimestamp` becomes unresolved and raises `UnresolvedException`.

Since `CurrentDate` is `TimeZoneAwareExpression`, there is no problem with `CurrentDate`.

This PR reverts the [previous patch](https://github.com/apache/spark/pull/23609) on `MicroBatchExecution` and fixes the root cause.

## How was this patch tested?

Pass the Jenkins with the updated test cases.

Closes #23660 from dongjoon-hyun/SPARK-26379.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2019-01-27 10:04:51 -08:00
Kris Mok 860336d31e [SPARK-26735][SQL] Verify plan integrity for special expressions
## What changes were proposed in this pull request?

Add verification of plan integrity with regards to special expressions being hosted only in supported operators. Specifically:

- `AggregateExpression`: should only be hosted in `Aggregate`, or indirectly in `Window`
- `WindowExpression`: should only be hosted in `Window`
- `Generator`: should only be hosted in `Generate`

This will help us catch errors in future optimizer rules that incorrectly hoist special expression out of their supported operator.

TODO: This PR actually caught a bug in the analyzer in the test case `SPARK-23957 Remove redundant sort from subquery plan(scalar subquery)` in `SubquerySuite`, where a `max()` aggregate function is hosted in a `Sort` operator in the analyzed plan, which is invalid. That test case is disabled in this PR.
SPARK-26741 has been opened to track the fix in the analyzer.

## How was this patch tested?

Added new test case in `OptimizerStructuralIntegrityCheckerSuite`

Closes #23658 from rednaxelafx/plan-integrity.

Authored-by: Kris Mok <kris.mok@databricks.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
2019-01-26 22:26:10 -08:00
hyukjinkwon e8982ca7ad [SPARK-25981][R] Enables Arrow optimization from R DataFrame to Spark DataFrame
## What changes were proposed in this pull request?

This PR targets to support Arrow optimization for conversion from R DataFrame to Spark DataFrame.
Like PySpark side, it falls back to non-optimization code path when it's unable to use Arrow optimization.

This can be tested as below:

```bash
$ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true
```

```r
collect(createDataFrame(mtcars))
```

### Requirements
  - R 3.5.x
  - Arrow package 0.12+
    ```bash
    Rscript -e 'remotes::install_github("apache/arrowapache-arrow-0.12.0", subdir = "r")'
    ```

**Note:** currently, Arrow R package is not in CRAN. Please take a look at ARROW-3204.
**Note:** currently, Arrow R package seems not supporting Windows. Please take a look at ARROW-3204.

### Benchmarks

**Shall**

```bash
sync && sudo purge
./bin/sparkR --conf spark.sql.execution.arrow.enabled=false
```

```bash
sync && sudo purge
./bin/sparkR --conf spark.sql.execution.arrow.enabled=true
```

**R code**

```r
createDataFrame(mtcars) # Initializes
rdf <- read.csv("500000.csv")

test <- function() {
  options(digits.secs = 6) # milliseconds
  start.time <- Sys.time()
  createDataFrame(rdf)
  end.time <- Sys.time()
  time.taken <- end.time - start.time
  print(time.taken)
}

test()
```

**Data (350 MB):**

```r
object.size(read.csv("500000.csv"))
350379504 bytes
```

"500000 Records"  http://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/

**Results**

```
Time difference of 29.9468 secs
```

```
Time difference of 3.222129 secs
```

The performance improvement was around **950%**.
Actually, this PR improves around **1200%**+ because this PR includes a small optimization about regular R DataFrame -> Spark DatFrame. See https://github.com/apache/spark/pull/22954#discussion_r231847272

### Limitations:

For now, Arrow optimization with R does not support when the data is `raw`, and when user explicitly gives float type in the schema. They produce corrupt values.
In this case, we decide to fall back to non-optimization code path.

## How was this patch tested?

Small test was added.

I manually forced to set this optimization `true` for _all_ R tests and they were _all_ passed (with few of fallback warnings).

**TODOs:**
- [x] Draft codes
- [x] make the tests passed
- [x] make the CRAN check pass
- [x] Performance measurement
- [x] Supportability investigation (for instance types)
- [x] Wait for Arrow 0.12.0 release
- [x] Fix and match it to Arrow 0.12.0

Closes #22954 from HyukjinKwon/r-arrow-createdataframe.

Lead-authored-by: hyukjinkwon <gurwls223@apache.org>
Co-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-01-27 10:45:49 +08:00
heguozi e71acd9a23 [SPARK-26630][SQL] Support reading Hive-serde tables whose INPUTFORMAT is org.apache.hadoop.mapreduce
## What changes were proposed in this pull request?

When we read a hive table and create RDDs in `TableReader`, it'll throw exception `java.lang.ClassCastException: org.apache.hadoop.mapreduce.lib.input.TextInputFormat cannot be cast to org.apache.hadoop.mapred.InputFormat` if the input format class of the table is from mapreduce package.

Now we use NewHadoopRDD to deal with the new input format and keep HadoopRDD to the old one.

This PR is from #23506. We can reproduce this issue by executing the new test with the code in old version. When create a table with `org.apache.hadoop.mapreduce.....` input format, we will find the exception thrown in `org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190)`

## How was this patch tested?

Added a new test.

Closes #23559 from Deegue/fix-hadoopRDD.

Lead-authored-by: heguozi <zyzzxycj@gmail.com>
Co-authored-by: Yizhong Zhang <zyzzxycj@163.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
2019-01-26 10:17:03 -08:00
SongYadong aa3d16d68b [SPARK-26698][CORE] Use ConfigEntry for hardcoded configs for memory and storage categories
## What changes were proposed in this pull request?

This PR makes hardcoded configs about spark memory and storage to use `ConfigEntry` and put them in the config package.

## How was this patch tested?

Existing unit tests.

Closes #23623 from SongYadong/configEntry_for_mem_storage.

Authored-by: SongYadong <song.yadong1@zte.com.cn>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-01-25 22:28:12 -06:00
Bruce Robbins f17a3d9c3a
[SPARK-26711][SQL] Lazily convert string values to BigDecimal during JSON schema inference
## What changes were proposed in this pull request?

This PR fixes a bug where JSON schema inference attempts to convert every String value to a BigDecimal regardless of the setting of "prefersDecimal". With that bug, behavior is still correct, but performance is impacted.

This PR makes this conversion lazy, so it is only performed if prefersDecimal is set to true.

Using Spark with a single executor thread to infer the schema of a single-column, 100M row JSON file, the performance impact is as follows:

option | baseline | pr
-----|----|-----
inferTimestamp=_default_<br>prefersDecimal=_default_ | 12.5 minutes | 6.1 minutes |
inferTimestamp=false<br>prefersDecimal=_default_ | 6.5 minutes | 49 seconds |
inferTimestamp=false<br>prefersDecimal=true | 6.5 minutes | 6.5 minutes |

## How was this patch tested?

I ran JsonInferSchemaSuite and JsonSuite. Also, I ran manual tests to see performance impact (see above).

Closes #23653 from bersprockets/SPARK-26711_improved.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2019-01-25 16:14:38 -08:00
Jungtaek Lim (HeartSaVioR) a4e48359ac
[SPARK-26379][SS] Fix issue on adding current_timestamp/current_date to streaming query
## What changes were proposed in this pull request?

This patch proposes to fix issue on adding `current_timestamp` / `current_date` with streaming query.

The root reason is that Spark transforms `CurrentTimestamp`/`CurrentDate` to `CurrentBatchTimestamp` in MicroBatchExecution which makes transformed attributes not-yet-resolved. They will be resolved by IncrementalExecution.
(In ContinuousExecution, Spark doesn't allow using `current_timestamp` and `current_date` so it has been OK.)

It's OK for DataSource V1 sink because it simply leverages transformed logical plan and don't evaluate until they're resolved, but for DataSource V2 sink, Spark tries to extract the schema of transformed logical plan in prior to IncrementalExecution, and unresolved attributes will raise errors.

This patch fixes the issue via having separate pre-resolved logical plan to pass the schema to StreamingWriteSupport safely.

## How was this patch tested?

Added UT.

Closes #23609 from HeartSaVioR/SPARK-26379.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2019-01-25 14:58:03 -08:00
Jungtaek Lim (HeartSaVioR) 5f3658a8d8 [SPARK-26170][SS] Add missing metrics in FlatMapGroupsWithState
## What changes were proposed in this pull request?

This patch addresses measuring possible metrics in StateStoreWriter to FlatMapGroupsWithStateExec. Please note that some metrics like time to remove elements are not addressed because they are coupled with state function.

## How was this patch tested?

Manually tested with https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredSessionization.scala.

Snapshots below:

![screen shot 2018-11-26 at 4 13 40 pm](https://user-images.githubusercontent.com/1317309/48999346-b5f7b400-f199-11e8-89c7-8795f13470d6.png)
![screen shot 2018-11-26 at 4 13 54 pm](https://user-images.githubusercontent.com/1317309/48999347-b5f7b400-f199-11e8-91ef-ef0b2f816b2e.png)

Closes #23142 from HeartSaVioR/SPARK-26170.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Jose Torres <torres.joseph.f+github@gmail.com>
2019-01-25 13:37:42 -08:00
Devaraj K f06bc0cd1d [SPARK-22404][YARN] Provide an option to use unmanaged AM in yarn-client mode
## What changes were proposed in this pull request?

Providing a new configuration "spark.yarn.un-managed-am" (defaults to false) to enable the Unmanaged AM Application in Yarn Client mode which launches the Application Master service as part of the Client. It utilizes the existing code for communicating between the Application Master <-> Task Scheduler for the container requests/allocations/launch, and eliminates these,
1. Allocating and launching the Application Master container
2. Remote Node/Process communication between Application Master <-> Task Scheduler

## How was this patch tested?

I verified manually running the applications in yarn-client mode with "spark.yarn.un-managed-am" enabled, and also ensured that there is no impact to the existing execution flows.

I would like to hear others feedback/thoughts on this.

Closes #19616 from devaraj-kavali/SPARK-22404.

Authored-by: Devaraj K <devaraj@apache.org>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-01-25 11:52:45 -08:00
Gabor Somogyi 773efede20 [SPARK-26254][CORE] Extract Hive + Kafka dependencies from Core.
## What changes were proposed in this pull request?

There are ugly provided dependencies inside core for the following:
* Hive
* Kafka

In this PR I've extracted them out. This PR contains the following:
* Token providers are now loaded with service loader
* Hive token provider moved to hive project
* Kafka token provider extracted into a new project

## How was this patch tested?

Existing + newly added unit tests.
Additionally tested on cluster.

Closes #23499 from gaborgsomogyi/SPARK-26254.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-01-25 10:36:00 -08:00
ankurgupta b484490824 [SPARK-26694][CORE] Progress bar should be enabled by default for spark-shell
## What changes were proposed in this pull request?
SPARK-21568 made a change to ensure that progress bar is enabled for spark-shell
by default but not for other apps. Before that change, this was distinguished
using log-level which is not a good way to determine the same as users can change
the default log-level. That commit changed the way to determine whether current
app is running in spark-shell or not but it left the log-level part as it is,
which causes this regression. SPARK-25118 changed the default log level to INFO
for spark-shell because of which the progress bar is not enabled anymore.

This commit will remove the log-level check for enabling progress bar for
spark-shell as it is not necessary and seems to be a leftover from SPARK-21568

## How was this patch tested?
1. Ensured that progress bar is enabled with spark-shell by default
2. Ensured that progress bar is not enabled with spark-submit

Closes #23618 from ankuriitg/ankurgupta/SPARK-26694.

Authored-by: ankurgupta <ankur.gupta@cloudera.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-01-25 10:21:26 -08:00
Maxim Gekk e3411a82c3 [SPARK-26720][SQL] Remove DateTimeUtils methods based on system default time zone
## What changes were proposed in this pull request?

In the PR, I propose to remove the following methods from `DateTimeUtils`:
- `timestampAddInterval` and `stringToTimestamp` - used only in test suites
- `truncTimestamp`, `getSeconds`, `getMinutes`, `getHours` - those methods assume system default time zone. They are not used in Spark.

## How was this patch tested?

This was tested by `DateTimeUtilsSuite` and `UnsafeArraySuite`.

Closes #23643 from MaxGekk/unused-date-time-utils.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-01-25 17:06:22 +08:00
Gabor Somogyi 9452e0508a
[SPARK-26649][SS] Add DSv2 noop sink
## What changes were proposed in this pull request?

Noop data source for batch was added in [#23471](https://github.com/apache/spark/pull/23471).
In this PR I've added the streaming part.

## How was this patch tested?

Additional unit tests.

Closes #23631 from gaborgsomogyi/SPARK-26649.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2019-01-24 19:25:38 -08:00
Gengliang Wang f5b9370da2 [SPARK-26709][SQL] OptimizeMetadataOnlyQuery does not handle empty records correctly
## What changes were proposed in this pull request?

When reading from empty tables, the optimization `OptimizeMetadataOnlyQuery` may return wrong results:
```
sql("CREATE TABLE t (col1 INT, p1 INT) USING PARQUET PARTITIONED BY (p1)")
sql("INSERT INTO TABLE t PARTITION (p1 = 5) SELECT ID FROM range(1, 1)")
sql("SELECT MAX(p1) FROM t")
```
The result is supposed to be `null`. However, with the optimization the result is `5`.

The rule is originally ported from https://issues.apache.org/jira/browse/HIVE-1003 in #13494. In Hive, the rule is disabled by default in a later release(https://issues.apache.org/jira/browse/HIVE-15397), due to the same problem.

It is hard to completely avoid the correctness issue. Because data sources like Parquet can be metadata-only. Spark can't tell whether it is empty or not without actually reading it. This PR disable the optimization by default.

## How was this patch tested?

Unit test

Closes #23635 from gengliangwang/optimizeMetadata.

Lead-authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Co-authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
2019-01-24 18:24:49 -08:00
Hyukjin Kwon d2ff10cbe1 [SPARK-23674][ML] Adds Spark ML Events to Instrumentation
## What changes were proposed in this pull request?

This PR proposes to add ML events to Instrumentation, and use it in Pipeline so that other developers can track and add some actions for them.

## Introduction

ML events (like SQL events) can be quite useful when people want to track and make some actions for corresponding ML operations. For instance, I have been working on integrating
Apache Spark with [Apache Atlas](https://atlas.apache.org/QuickStart.html). With some custom changes with this PR, I can visualise ML pipeline as below:

![spark_ml_streaming_lineage](https://user-images.githubusercontent.com/6477701/49682779-394bca80-faf5-11e8-85b8-5fae28b784b3.png)

Another good thing that might have to be considered is, that we can interact this with other SQL/Streaming events. For instance, where the input `Dataset` is originated. For instance, with current Apache Spark, I can visualise SQL operations as below:

![screen shot 2018-12-10 at 9 41 36 am](https://user-images.githubusercontent.com/6477701/49706269-d9bdfe00-fc5f-11e8-943a-3309d1856ba5.png)

I think we can combine those existing lineages together to easily understand where the data comes and goes. Currently, ML side is a hole so the lineages can't be connected for the current Apache Spark ..

To add up, I think it's not to mention how useful it is to track the SQL/Streaming operations. Likewise, I would like to propose ML events as well (as lowest stability `Unstable` APIs for now - no guarantee about stability).

## Implementation Details

### Sends event (but not expose ML specific listener)

**`mllib/src/main/scala/org/apache/spark/ml/events.scala`**

```scala
Unstable
case class ...StartEvent(caller, input)
Unstable
case class ...EndEvent(caller, output)

trait MLEvents {
  // Wrappers to send events:
  // def with...Event(body) = {
  //   body()
  //   SparkContext.getOrCreate().listenerBus.post(event)
  // }
}
```

This trait is used by `Instrumentation`.

```scala
class Instrumentation ... with MLEvents {
```

and used as below:

```scala
instrumented { instr =>
  instr.with...Event(...) {
    ...
  }
}
```

This way mimics both:

**1. Catalog events (see `org/apache/spark/sql/catalyst/catalog/events.scala`)**

- This allows a Catalog specific listener to be added `ExternalCatalogEventListener`

- It's implemented in a way of wrapping whole `ExternalCatalog` named `ExternalCatalogWithListener`
which delegates the operations to `ExternalCatalog`

This is not quite possible in this case because most of instances (like `Pipeline`) will be directly created in most of cases. We might be able to do that via extending `ListenerBus` for all possible instances but IMHO it's too invasive. Also, exposing another ML specific listener sounds a bit too much at this stage. Therefore, I simply borrowed file name and structures here

**2. SQL execution events (see `org/apache/spark/sql/execution/SQLExecution.scala`)**

- Add an object that wraps a body to send events

Current apporach is rather close to this. It has a `with...` wrapper to send events. I borrowed this approach to be consistent.

## Usage

It needs a custom implementation for a query listener. For instance,

with the custom listener below:

```scala
class CustomMLListener extends SparkListener
  def onOtherEvents(e) = e match {
    case e: MLEvent => // do something
    case _ => // pass
  }
}
```

There are two (existing) ways to use this.

```scala
spark.sparkContext.addSparkListener(new CustomMLListener)
```

```bash
spark-submit ...\
  --conf spark.extraListeners=CustomMLListener\
  ...
```

It's also similar with other existing implementation in SQL side.

## Target users

1. I think someone in general would likely utilise this feature like other event listeners. At least, I can see some interests going on outside.

    - SQL Listener
      - https://stackoverflow.com/questions/46409339/spark-listener-to-an-sql-query
      - http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-Custom-Query-Execution-listener-via-conf-properties-td30979.html

    - Streaming Query Listener
      - https://jhui.github.io/2017/01/15/Apache-Spark-Streaming/
      -  http://apache-spark-developers-list.1001551.n3.nabble.com/Structured-Streaming-with-Watermark-td25413.html#a25416

2. Someone would likely run this via Atlas. The plugin mirror intentionally is exposed at [spark-atlas-connector](https://github.com/hortonworks-spark/spark-atlas-connector) so that anyone could do something about lineage and governance in Atlas. I'm trying to show integrated lineages in Apache Spark but this is a missing hole.

## How was this patch tested?

Manually tested and unit tests were added.

Closes #23263 from HyukjinKwon/SPARK-23674-1.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-01-25 10:11:49 +08:00
Hyukjin Kwon d97481ba17 [MINOR] Add Jenkins and AppVeyor badges at README.md
## What changes were proposed in this pull request?

This PR adds Jenkins build and AppVeyor build badges

Take a look at cc24b9e0b3/README.md

## How was this patch tested?

Manually tested by `grip`.

Closes #23629 from HyukjinKwon/minor-badges.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-01-25 10:04:21 +08:00
Ilya Matiach b2d36f65db [SPARK-19591][ML][MLLIB] Add sample weights to decision trees
This is updated PR https://github.com/apache/spark/pull/16722 to latest master

## What changes were proposed in this pull request?

This patch adds support for sample weights to DecisionTreeRegressor and DecisionTreeClassifier.

Note: This patch does not add support for sample weights to RandomForest. As discussed in the JIRA, we would like to add sample weights into the bagging process. This patch is large enough as is, and there are some additional considerations to be made for random forests. Since the machinery introduced here needs to be present regardless, I have opted to leave random forests for a follow up pr.
## How was this patch tested?

The algorithms are tested to ensure that:
    1. Arbitrary scaling of constant weights has no effect
    2. Outliers with small weights do not affect the learned model
    3. Oversampling and weighting are equivalent

Unit tests are also added to test other smaller components.
## Summary of changes

   - Impurity aggregators now store weighted sufficient statistics. They also store a raw count, however, since this is needed to use minInstancesPerNode.

   - Impurity aggregators now also hold the raw count.

   - This patch maintains the meaning of minInstancesPerNode, in that the parameter still corresponds to raw, unweighted counts. It also adds a new parameter minWeightFractionPerNode which requires that nodes must contain at least minWeightFractionPerNode * weightedNumExamples total weight.

   - This patch modifies findSplitsForContinuousFeatures to use weighted sums. Unit tests are added.

   - TreePoint is modified to hold a sample weight

   - BaggedPoint is modified from:
``` Scala
private[spark] class BaggedPoint[Datum](val datum: Datum, val subsampleWeights: Array[Double]) extends Serializable
```
to
``` Scala
private[spark] class BaggedPoint[Datum](
    val datum: Datum,
    val subsampleCounts: Array[Int],
    val sampleWeight: Double) extends Serializable
```
We do not simply multiply the counts by the weight and store that because we need the raw counts and the weight in order to use both minInstancesPerNode and minWeightPerNode

**Note**: many of the changed files are due simply to using Instance instead of LabeledPoint

Closes #21632 from imatiach-msft/ilmat/sample-weights.

Authored-by: Ilya Matiach <ilmat@microsoft.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-01-24 18:20:28 -07:00
Imran Rashid 3699763fda [SPARK-26697][CORE] Log local & remote block sizes.
## What changes were proposed in this pull request?

To help debugging failed or slow tasks, its really useful to know the
size of the blocks getting fetched.  Though that is available at the
debug level, debug logs aren't on in general -- but there is already an
info level log line that this augments a little.

## How was this patch tested?

Ran very basic local-cluster mode app, looked at logs.  Example line:

```
INFO ShuffleBlockFetcherIterator: Getting 2 (194.0 B) non-empty blocks including 1 (97.0 B) local blocks and 1 (97.0 B) remote blocks
```

Full suite via jenkins.

Closes #23621 from squito/SPARK-26697.

Authored-by: Imran Rashid <irashid@cloudera.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-01-24 16:13:58 -08:00
Liupengcheng 8d667c511c [SPARK-26530][CORE] Validate heartheat arguments in HeartbeatReceiver
## What changes were proposed in this pull request?

Currently, heartbeat related arguments is not validated in spark, so if these args are inproperly specified, the Application may run for a while and not failed until the max executor failures reached(especially with spark.dynamicAllocation.enabled=true), thus may incurs resources waste.

This PR is to precheck these arguments in HeartbeatReceiver to fix this problem.

## How was this patch tested?

NA-just validation changes

Closes #23445 from liupc/validate-heartbeat-arguments-in-SparkSubmitArguments.

Authored-by: Liupengcheng <liupengcheng@xiaomi.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-01-24 15:12:57 -08:00
Rob Vesse 69dab94b13 [SPARK-26687][K8S] Fix handling of custom Dockerfile paths
## What changes were proposed in this pull request?

With the changes from vanzin's PR #23019 (SPARK-26025) we use a pared down temporary Docker build context which significantly improves build times.  However the way this is implemented leads to non-intuitive behaviour when supplying custom Docker file paths.  This is because of the following code snippets:

```
(cd $(img_ctx_dir base) && docker build $NOCACHEARG "${BUILD_ARGS[]}" \
    -t $(image_ref spark) \
    -f "$BASEDOCKERFILE" .)
```

Since the script changes to the temporary build context directory and then runs `docker build` there any path given for the Docker file is taken as relative to the temporary build context directory rather than to the directory where the user invoked the script.  This is rather unintuitive and produces somewhat unhelpful errors e.g.

```
> ./bin/docker-image-tool.sh -r rvesse -t badpath -p resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/python/Dockerfile build
Sending build context to Docker daemon  218.4MB
Step 1/15 : FROM openjdk:8-alpine
 ---> 5801f7d008e5
Step 2/15 : ARG spark_uid=185
 ---> Using cache
 ---> 5fd63df1ca39
...
Successfully tagged rvesse/spark:badpath
unable to prepare context: unable to evaluate symlinks in Dockerfile path: lstat /Users/rvesse/Documents/Work/Code/spark/target/tmp/docker/pyspark/resource-managers: no such file or directory
Failed to build PySpark Docker image, please refer to Docker build output for details.
```

Here we can see that the relative path that was valid where the user typed the command was not valid inside the build context directory.

To resolve this we need to ensure that we are resolving relative paths to Docker files appropriately which we do by adding a `resolve_file` function to the script and invoking that on the supplied Docker file paths

## How was this patch tested?

Validated that relative paths now work as expected:

```
> ./bin/docker-image-tool.sh -r rvesse -t badpath -p resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/python/Dockerfile build
Sending build context to Docker daemon  218.4MB
Step 1/15 : FROM openjdk:8-alpine
 ---> 5801f7d008e5
Step 2/15 : ARG spark_uid=185
 ---> Using cache
 ---> 5fd63df1ca39
Step 3/15 : RUN set -ex &&     apk upgrade --no-cache &&     apk add --no-cache bash tini libc6-compat linux-pam krb5 krb5-libs &&     mkdir -p /opt/spark &&     mkdir -p /opt/spark/examples &&     mkdir -p /opt/spark/work-dir &&     touch /opt/spark/RELEASE &&     rm /bin/sh &&     ln -sv /bin/bash /bin/sh &&     echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su &&     chgrp root /etc/passwd && chmod ug+rw /etc/passwd
 ---> Using cache
 ---> eb0a568e032f
Step 4/15 : COPY jars /opt/spark/jars
...
Successfully tagged rvesse/spark:badpath
Sending build context to Docker daemon  6.599MB
Step 1/13 : ARG base_img
Step 2/13 : ARG spark_uid=185
Step 3/13 : FROM $base_img
 ---> 8f4fff16f903
Step 4/13 : WORKDIR /
 ---> Running in 25466e66f27f
Removing intermediate container 25466e66f27f
 ---> 1470b6efae61
Step 5/13 : USER 0
 ---> Running in b094b739df37
Removing intermediate container b094b739df37
 ---> 6a27eb4acad3
Step 6/13 : RUN mkdir ${SPARK_HOME}/python
 ---> Running in bc8002c5b17c
Removing intermediate container bc8002c5b17c
 ---> 19bb12f4286a
Step 7/13 : RUN apk add --no-cache python &&     apk add --no-cache python3 &&     python -m ensurepip &&     python3 -m ensurepip &&     rm -r /usr/lib/python*/ensurepip &&     pip install --upgrade pip setuptools &&     rm -r /root/.cache
 ---> Running in 12dcba5e527f
...
Successfully tagged rvesse/spark-py:badpath
```

Closes #23613 from rvesse/SPARK-26687.

Authored-by: Rob Vesse <rvesse@dotnetrdf.org>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-01-24 10:11:55 -08:00
Tom van Bussel 9813b1d074 [SPARK-26690] Track query execution and time cost for checkpoints
## What changes were proposed in this pull request?

Checkpoints of Dataframes currently do not show up in SQL UI. This PR fixes that by setting an execution id for the execution of the checkpoint by wrapping the checkpoint code with a `withAction`.

## How was this patch tested?

A unit test was added to DatasetSuite.

Closes #23636 from tomvanbussel/SPARK-26690.

Authored-by: Tom van Bussel <tom.vanbussel@databricks.com>
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
2019-01-24 16:44:39 +01:00
Bruce Robbins d4a30fa9af [SPARK-26680][SQL] Eagerly create inputVars while conditions are appropriate
## What changes were proposed in this pull request?

When a user passes a Stream to groupBy, ```CodegenSupport.consume``` ends up lazily generating ```inputVars``` from a Stream, since the field ```output``` will be a Stream. At the time ```output.zipWithIndex.map``` is called, conditions are correct. However, by the time the map operation actually executes, conditions are no longer appropriate. The closure used by the map operation ends up using a reference to the partially created ```inputVars```. As a result, a StackOverflowError occurs.

This PR ensures that ```inputVars``` is eagerly created while conditions are appropriate. It seems this was also an issue with the code path for creating ```inputVars``` from ```outputVars``` (SPARK-25767). I simply extended the solution for that code path to encompass both code paths.

## How was this patch tested?

SQL unit tests
new test
python tests

Closes #23617 from bersprockets/SPARK-26680_opt1.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
2019-01-24 11:18:08 +01:00
Ryan Blue d5a97c1c2c [SPARK-26682][SQL] Use taskAttemptID instead of attemptNumber for Hadoop.
## What changes were proposed in this pull request?

Updates the attempt ID used by FileFormatWriter. Tasks in stage attempts use the same task attempt number and could conflict. Using Spark's task attempt ID guarantees that Hadoop TaskAttemptID instances are unique.

## How was this patch tested?

Existing tests. Also validated that we no longer detect this failure case in our logs after deployment.

Closes #23608 from rdblue/SPARK-26682-fix-hadoop-task-attempt-id.

Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-01-24 12:45:25 +08:00
Dave DeCaprio d0e9219e03 [SPARK-26617][SQL] Cache manager locks
## What changes were proposed in this pull request?

Fixed several places in CacheManager where a write lock was being held while running the query optimizer.  This could cause a very lock block if the query optimization takes a long time.  This builds on changes from [SPARK-26548] that fixed this issue for one specific case in the CacheManager.

gatorsmile This is very similar to the PR you approved last week.

## How was this patch tested?

Has been tested on a live system where the blocking was causing major issues and it is working well.
CacheManager has no explicit unit test but is used in many places internally as part of the SharedState.

Closes #23539 from DaveDeCaprio/cache-manager-locks.

Lead-authored-by: Dave DeCaprio <daved@alum.mit.edu>
Co-authored-by: David DeCaprio <daved@alum.mit.edu>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-01-24 10:48:48 +08:00
ayudovin 11be22bb5e [SPARK-25713][SQL] implementing copy for ColumnArray
## What changes were proposed in this pull request?

Implement copy() for ColumnarArray

## How was this patch tested?
 Updating test case to existing tests in ColumnVectorSuite

Closes #23569 from ayudovin/copy-for-columnArray.

Authored-by: ayudovin <a.yudovin6695@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-01-24 10:35:44 +08:00
Anton Okolnychyi 0df29bfbdc
[SPARK-26706][SQL] Fix illegalNumericPrecedence for ByteType
## What changes were proposed in this pull request?

This PR contains a minor change in `Cast$mayTruncate` that fixes its logic for bytes.

Right now, `mayTruncate(ByteType, LongType)` returns `false` while `mayTruncate(ShortType, LongType)` returns `true`. Consequently, `spark.range(1, 3).as[Byte]` and `spark.range(1, 3).as[Short]` behave differently.

Potentially, this bug can silently corrupt someone's data.
```scala
// executes silently even though Long is converted into Byte
spark.range(Long.MaxValue - 10, Long.MaxValue).as[Byte]
  .map(b => b - 1)
  .show()
+-----+
|value|
+-----+
|  -12|
|  -11|
|  -10|
|   -9|
|   -8|
|   -7|
|   -6|
|   -5|
|   -4|
|   -3|
+-----+
// throws an AnalysisException: Cannot up cast `id` from bigint to smallint as it may truncate
spark.range(Long.MaxValue - 10, Long.MaxValue).as[Short]
  .map(s => s - 1)
  .show()
```
## How was this patch tested?

This PR comes with a set of unit tests.

Closes #23632 from aokolnychyi/cast-fix.

Authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2019-01-24 00:12:26 +00:00
Liupengcheng 0446363ef4 [SPARK-26660] Add warning logs when broadcasting large task binary
## What changes were proposed in this pull request?

Currently, some ML library may generate large ml model, which may be referenced in the task closure, so driver will broadcasting large task binary, and executor may not able to deserialize it and result in OOM failures(for instance, executor's memory is not enough). This problem not only affects apps using ml library, some user specified closure or function which refers large data may also have this problem.

In order to facilitate the debuging of memory problem caused by large taskBinary broadcast, we can add same warning logs for it.

This PR will add some warning logs on the driver side when broadcasting a large task binary, and it also included some minor log changes in the reading of broadcast.

## How was this patch tested?
NA-Just log changes.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Closes #23580 from liupc/Add-warning-logs-for-large-taskBinary-size.

Authored-by: Liupengcheng <liupengcheng@xiaomi.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-01-23 08:51:39 -06:00
Ryan Blue d008e23ab5 [SPARK-26681][SQL] Support Ammonite inner-class scopes.
## What changes were proposed in this pull request?

This adds a new pattern to recognize Ammonite REPL classes and return the correct scope.

## How was this patch tested?

Manually tested with Spark in an Ammonite session.

Closes #23607 from rdblue/SPARK-26681-support-ammonite-scopes.

Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-01-23 08:50:03 -06:00
Maxim Gekk 46d5bb9a0f [SPARK-26653][SQL] Use Proleptic Gregorian calendar in parsing JDBC lower/upper bounds
## What changes were proposed in this pull request?

In the PR, I propose using of the `stringToDate` and `stringToTimestamp` methods in parsing JDBC lower/upper bounds of the partition column if it has `DateType` or `TimestampType`. Since those methods have been ported on Proleptic Gregorian calendar by #23512, the PR switches parsing of JDBC bounds of the partition column on the calendar as well.

## How was this patch tested?

This was tested by `JDBCSuite`.

Closes #23597 from MaxGekk/jdbc-parse-timestamp-bounds.

Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-01-23 20:23:17 +08:00
Takeshi Yamamuro 1ed1b4d8e1 [SPARK-26637][SQL] Makes GetArrayItem nullability more precise
## What changes were proposed in this pull request?
In the master, GetArrayItem nullable is always true;
cf133e6110/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala (L236)

But, If input array size is constant and ordinal is foldable, we could make GetArrayItem nullability more precise. This pr added code to make `GetArrayItem` nullability more precise.

## How was this patch tested?
Added tests in `ComplexTypeSuite`.

Closes #23566 from maropu/GetArrayItemNullability.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-01-23 15:33:02 +08:00
Ngone51 3da71f2da1 [SPARK-22465][CORE][FOLLOWUP] Use existing partitioner when defaultNumPartitions is equal to maxPartitioner.numPartitions
## What changes were proposed in this pull request?

Followup of #20091. We could also use existing partitioner when defaultNumPartitions is equal to the maxPartitioner's numPartitions.

## How was this patch tested?

Existed.

Closes #23581 from Ngone51/dev-use-existing-partitioner-when-defaultNumPartitions-equalTo-MaxPartitioner#-numPartitions.

Authored-by: Ngone51 <ngone_5451@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-01-23 10:23:40 +08:00
Sean Owen 6dcad38ba3 [SPARK-26228][MLLIB] OOM issue encountered when computing Gramian matrix
## What changes were proposed in this pull request?

Avoid memory problems in closure cleaning when handling large Gramians (>= 16K rows/cols) by using null as zeroValue

## How was this patch tested?

Existing tests.
Note that it's hard to test the case that triggers this issue as it would require a large amount of memory and run a while. I confirmed locally that a 16K x 16K Gramian failed with tons of driver memory before, and didn't fail upfront after this change.

Closes #23600 from srowen/SPARK-26228.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-01-22 19:22:06 -06:00
Shahid 0d35f9ea3a [SPARK-24484][MLLIB] Power Iteration Clustering is giving incorrect clustering results when there are mutiple leading eigen values.
## What changes were proposed in this pull request?
![image](https://user-images.githubusercontent.com/23054875/41823325-e83e1d34-781b-11e8-8c34-fc6e7a042f3f.png)

![image](https://user-images.githubusercontent.com/23054875/41823367-733c9ba4-781c-11e8-8da2-b26460c2af63.png)
![image](https://user-images.githubusercontent.com/23054875/41823409-179dd910-781d-11e8-8d8c-9865156fad15.png)

**Method to determine if the top eigen values has same magnitude but opposite signs**
The vector is written as a linear combination of the eigen vectors at iteration k.
![image](https://user-images.githubusercontent.com/23054875/41822941-f8b13d4c-7814-11e8-8091-54c02721c1c5.png)
![image](https://user-images.githubusercontent.com/23054875/41822982-b80a6fc4-7815-11e8-9129-ed96a14f037f.png)
![image](https://user-images.githubusercontent.com/23054875/41823022-5b69e906-7816-11e8-847a-8fa5f0b6200e.png)

![image](https://user-images.githubusercontent.com/23054875/41823087-54311398-7817-11e8-90bf-e1be2bbff323.png)
![image](https://user-images.githubusercontent.com/23054875/41823121-e0b78324-7817-11e8-9596-379bd2e518af.png)
![image](https://user-images.githubusercontent.com/23054875/41823151-965319d2-7818-11e8-8b91-10f6276ace62.png)
![image](https://user-images.githubusercontent.com/23054875/41823182-75cdbad6-7819-11e8-912f-23c66a8359de.png)
![image](https://user-images.githubusercontent.com/23054875/41823221-1ca77a36-781a-11e8-9a40-48bd165797cc.png)
![image](https://user-images.githubusercontent.com/23054875/41823272-f6962b2a-781a-11e8-9978-1b2dc0dc8b2c.png)
![image](https://user-images.githubusercontent.com/23054875/41823303-75b296f0-781b-11e8-8501-6133b04769c8.png)

**So, we need to check if the reileigh coefficient at the convergence is lesser than the norm of the estimated eigen vector before normalizing**

(Please fill in changes proposed in this fix)
Added a UT

Please review http://spark.apache.org/contributing.html before opening a pull request.

Closes #21627 from shahidki31/picConvergence.

Authored-by: Shahid <shahidki31@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-01-22 18:29:18 -06:00
Darcy Shen 9d2a11554b [MINOR][DOC] Documentation on JVM options for SBT
## What changes were proposed in this pull request?

Documentation and .gitignore

## How was this patch tested?

Manual test that SBT honors the settings in .jvmopts if present

Closes #23615 from sadhen/impr/gitignore.

Authored-by: Darcy Shen <sadhen@zoho.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-01-22 18:27:24 -06:00
Kris Mok 02d8ae3d59
[SPARK-26661][SQL] Show actual class name of the writing command in CTAS explain
## What changes were proposed in this pull request?

The explain output of the Hive CTAS command, regardless of whether it's actually writing via Hive's SerDe or converted into using Spark's data source, would always show that it's using `InsertIntoHiveTable` because it's hardcoded.

e.g.
```
Execute OptimizedCreateHiveTableAsSelectCommand [Database:default, TableName: foo, InsertIntoHiveTable]
```
This CTAS is converted into using Spark's data source, but it still says `InsertIntoHiveTable` in the explain output.

It's better to show the actual class name of the writing command used. For the example above, it'd be:
```
Execute OptimizedCreateHiveTableAsSelectCommand [Database:default, TableName: foo, InsertIntoHadoopFsRelationCommand]
```

## How was this patch tested?

Added test case in `HiveExplainSuite`

Closes #23582 from rednaxelafx/fix-explain-1.

Authored-by: Kris Mok <kris.mok@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2019-01-22 13:55:41 -08:00
Rob Vesse dc2da72100 [SPARK-26685][K8S] Correct placement of ARG declaration
Latest Docker releases are stricter in their enforcement of build argument scope.  The location of the `ARG spark_uid` declaration in the Python and R Dockerfiles means the variable is out of scope by the time it is used in a `USER` declaration resulting in a container running as root rather than the default/configured UID.

Also with some of the refactoring of the script that has happened since my PR that introduced the configurable UID it turns out the `-u <uid>` argument is not being properly passed to the Python and R image builds when those are opted into

## What changes were proposed in this pull request?

This commit moves the `ARG` declaration to just before the argument is used such that it is in scope.  It also ensures that Python and R image builds receive the build arguments that include the `spark_uid` argument where relevant

## How was this patch tested?

Prior to the patch images are produced where the Python and R images ignore the default/configured UID:

```
> docker run -it --entrypoint /bin/bash rvesse/spark-py:uid456
bash-4.4# whoami
root
bash-4.4# id -u
0
bash-4.4# exit
> docker run -it --entrypoint /bin/bash rvesse/spark:uid456
bash-4.4$ id -u
456
bash-4.4$ exit
```

Note that the Python image is still running as `root` having ignored the configured UID of 456 while the base image has the correct UID because the relevant `ARG` declaration is correctly in scope.

After the patch the correct UID is observed:

```
> docker run -it --entrypoint /bin/bash rvesse/spark-r:uid456
bash-4.4$ id -u
456
bash-4.4$ exit
exit
> docker run -it --entrypoint /bin/bash rvesse/spark-py:uid456
bash-4.4$ id -u
456
bash-4.4$ exit
exit
> docker run -it --entrypoint /bin/bash rvesse/spark:uid456
bash-4.4$ id -u
456
bash-4.4$ exit
```

Closes #23611 from rvesse/SPARK-26685.

Authored-by: Rob Vesse <rvesse@dotnetrdf.org>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-01-22 10:31:17 -08:00
Rob Vesse c542c247bb [SPARK-25887][K8S] Configurable K8S context support
This enhancement allows for specifying the desired context to use for the initial K8S client auto-configuration.  This allows users to more easily access alternative K8S contexts without having to first
explicitly change their current context via kubectl.

Explicitly set my K8S context to a context pointing to a non-existent cluster, then launched Spark jobs with explicitly specified contexts via the new `spark.kubernetes.context` configuration property.

Example Output:

```
> kubectl config current-context
minikube
> minikube status
minikube: Stopped
cluster:
kubectl:
> ./spark-submit --master k8s://https://localhost:6443 --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=2 --conf spark.kubernetes.context=docker-for-desktop --conf spark.kubernetes.container.image=rvesse/spark:debian local:///opt/spark/examples/jars/spark-examples_2.11-3.0.0-SNAPSHOT.jar 4
18/10/31 11:57:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/10/31 11:57:51 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using context docker-for-desktop from users K8S config file
18/10/31 11:57:52 INFO LoggingPodStatusWatcherImpl: State changed, new state:
	 pod name: spark-pi-1540987071845-driver
	 namespace: default
	 labels: spark-app-selector -> spark-2c4abc226ed3415986eb602bd13f3582, spark-role -> driver
	 pod uid: 32462cac-dd04-11e8-b6c6-025000000001
	 creation time: 2018-10-31T11:57:52Z
	 service account name: default
	 volumes: spark-local-dir-1, spark-conf-volume, default-token-glpfv
	 node name: N/A
	 start time: N/A
	 phase: Pending
	 container status: N/A
18/10/31 11:57:52 INFO LoggingPodStatusWatcherImpl: State changed, new state:
	 pod name: spark-pi-1540987071845-driver
	 namespace: default
	 labels: spark-app-selector -> spark-2c4abc226ed3415986eb602bd13f3582, spark-role -> driver
	 pod uid: 32462cac-dd04-11e8-b6c6-025000000001
	 creation time: 2018-10-31T11:57:52Z
	 service account name: default
	 volumes: spark-local-dir-1, spark-conf-volume, default-token-glpfv
	 node name: docker-for-desktop
	 start time: N/A
	 phase: Pending
	 container status: N/A
...
18/10/31 11:58:03 INFO LoggingPodStatusWatcherImpl: State changed, new state:
	 pod name: spark-pi-1540987071845-driver
	 namespace: default
	 labels: spark-app-selector -> spark-2c4abc226ed3415986eb602bd13f3582, spark-role -> driver
	 pod uid: 32462cac-dd04-11e8-b6c6-025000000001
	 creation time: 2018-10-31T11:57:52Z
	 service account name: default
	 volumes: spark-local-dir-1, spark-conf-volume, default-token-glpfv
	 node name: docker-for-desktop
	 start time: 2018-10-31T11:57:52Z
	 phase: Succeeded
	 container status:
		 container name: spark-kubernetes-driver
		 container image: rvesse/spark:debian
		 container state: terminated
		 container started at: 2018-10-31T11:57:54Z
		 container finished at: 2018-10-31T11:58:02Z
		 exit code: 0
		 termination reason: Completed
```

Without the `spark.kubernetes.context` setting this will fail because the current context - `minikube` - is pointing to a non-running cluster e.g.

```
> ./spark-submit --master k8s://https://localhost:6443 --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=2 --conf spark.kubernetes.container.image=rvesse/spark:debian local:///opt/spark/examples/jars/spark-examples_2.11-3.0.0-SNAPSHOT.jar 4
18/10/31 12:02:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/10/31 12:02:30 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file
18/10/31 12:02:31 WARN WatchConnectionManager: Exec Failure
javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
	at sun.security.ssl.Alerts.getSSLException(Alerts.java:192)
	at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1949)
	at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:302)
	at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296)
	at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1509)
	at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216)
	at sun.security.ssl.Handshaker.processLoop(Handshaker.java:979)
	at sun.security.ssl.Handshaker.process_record(Handshaker.java:914)
	at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1062)
	at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
	at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
	at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
	at okhttp3.internal.connection.RealConnection.connectTls(RealConnection.java:281)
	at okhttp3.internal.connection.RealConnection.establishProtocol(RealConnection.java:251)
	at okhttp3.internal.connection.RealConnection.connect(RealConnection.java:151)
	at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:195)
	at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:121)
	at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:100)
	at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
	at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
	at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
	at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
	at io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:119)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
	at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:66)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
	at io.fabric8.kubernetes.client.utils.HttpClientUtils$2.intercept(HttpClientUtils.java:109)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
	at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:135)
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
	at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:387)
	at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:292)
	at sun.security.validator.Validator.validate(Validator.java:260)
	at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:324)
	at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:229)
	at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:124)
	at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1491)
	... 39 more
Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
	at sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:141)
	at sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:126)
	at java.security.cert.CertPathBuilder.build(CertPathBuilder.java:280)
	at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:382)
	... 45 more
Exception in thread "kubernetes-dispatcher-0" Exception in thread "main" java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask611a9c09 rejected from java.util.concurrent.ScheduledThreadPoolExecutor404819e4[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
	at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
	at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:326)
	at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:533)
	at java.util.concurrent.ScheduledThreadPoolExecutor.submit(ScheduledThreadPoolExecutor.java:632)
	at java.util.concurrent.Executors$DelegatedExecutorService.submit(Executors.java:678)
	at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.scheduleReconnect(WatchConnectionManager.java:300)
	at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$800(WatchConnectionManager.java:48)
	at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$2.onFailure(WatchConnectionManager.java:213)
	at okhttp3.internal.ws.RealWebSocket.failWebSocket(RealWebSocket.java:543)
	at okhttp3.internal.ws.RealWebSocket$2.onFailure(RealWebSocket.java:208)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:148)
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
io.fabric8.kubernetes.client.KubernetesClientException: Failed to start websocket
	at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$2.onFailure(WatchConnectionManager.java:204)
	at okhttp3.internal.ws.RealWebSocket.failWebSocket(RealWebSocket.java:543)
	at okhttp3.internal.ws.RealWebSocket$2.onFailure(RealWebSocket.java:208)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:148)
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
	at sun.security.ssl.Alerts.getSSLException(Alerts.java:192)
	at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1949)
	at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:302)
	at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296)
	at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1509)
	at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216)
	at sun.security.ssl.Handshaker.processLoop(Handshaker.java:979)
	at sun.security.ssl.Handshaker.process_record(Handshaker.java:914)
	at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1062)
	at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
	at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
	at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
	at okhttp3.internal.connection.RealConnection.connectTls(RealConnection.java:281)
	at okhttp3.internal.connection.RealConnection.establishProtocol(RealConnection.java:251)
	at okhttp3.internal.connection.RealConnection.connect(RealConnection.java:151)
	at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:195)
	at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:121)
	at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:100)
	at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
	at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
	at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
	at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
	at io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:119)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
	at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:66)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
	at io.fabric8.kubernetes.client.utils.HttpClientUtils$2.intercept(HttpClientUtils.java:109)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
	at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:135)
	... 4 more
Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
	at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:387)
	at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:292)
	at sun.security.validator.Validator.validate(Validator.java:260)
	at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:324)
	at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:229)
	at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:124)
	at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1491)
	... 39 more
Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
	at sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:141)
	at sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:126)
	at java.security.cert.CertPathBuilder.build(CertPathBuilder.java:280)
	at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:382)
	... 45 more
18/10/31 12:02:31 INFO ShutdownHookManager: Shutdown hook called
18/10/31 12:02:31 INFO ShutdownHookManager: Deleting directory /private/var/folders/6b/y1010qp107j9w2dhhy8csvz0000xq3/T/spark-5e649891-8a0f-4f17-bf3a-33b34082eba8
```

Suggested reviews: mccheah liyinan926 - this is the follow up fix to the bug discovered while working on SPARK-25809 (PR #22805)

Closes #22904 from rvesse/SPARK-25887.

Authored-by: Rob Vesse <rvesse@dotnetrdf.org>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-01-22 10:25:21 -08:00