Commit graph

24027 commits

Author SHA1 Message Date
Sean Owen b76f262fc8 [SPARK-27032][TEST] De-flake org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.HDFSMetadataLog: metadata directory collision
## What changes were proposed in this pull request?

Reduce work in HDFSMetadataLogSuite test to possibly de-flake it.

## How was this patch tested?

Existing tests

Closes #23937 from srowen/SPARK-27032.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-03-04 13:36:41 +09:00
Jungtaek Lim (HeartSaVioR) 34f606678a [SPARK-27001][SQL] Refactor "serializerFor" method between ScalaReflection and JavaTypeInference
## What changes were proposed in this pull request?

This patch proposes refactoring `serializerFor` method between `ScalaReflection` and `JavaTypeInference`, being consistent with what we refactored for `deserializerFor` in #23854.

This patch also extracts the logic on recording walk type path since the logic is duplicated across `serializerFor` and `deserializerFor` with `ScalaReflection` and `JavaTypeInference`.

## How was this patch tested?

Existing tests.

Closes #23908 from HeartSaVioR/SPARK-27001.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-03-04 10:45:48 +08:00
Dilip Biswal 04ad559ab6 [SPARK-27016][SQL][BUILD] Treat all antlr warnings as errors while generating parser from the sql grammar file.
## What changes were proposed in this pull request?
Use the maven plugin option `treatWarningsAsErrors` to make sure the warnings are treated as errors while generating the parser file. In the absence of it, we may inadvertently introducing problems while making grammar changes.  Please refer to [PR-23897](https://github.com/apache/spark/pull/23897) to know more about the context.
## How was this patch tested?
We can use two ways to build Spark 1) sbt 2) Maven
This PR, we made a change to configure the maven antlr plugin to include a parameter that makes antlr4 report error on warning. However, when spark is built using sbt, we use the sbt antlr plugin which does not allow us to pass this additional compilation flag.  More info on sbt-antlr plugin can be found at [link](https://github.com/ihji/sbt-antlr4/blob/master/src/main/scala/com/simplytyped/Antlr4Plugin.scala)
In summary, this fix only applicable when we use maven to build.

Closes #23925 from dilipbiswal/antlr_fix.

Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-03 10:02:25 -06:00
Sean Owen d8754df2bf [SPARK-27029][BUILD] Update Thrift to 0.12.0
## What changes were proposed in this pull request?

Update Thrift to 0.12.0 to pick up bug and security fixes.
Changes: https://github.com/apache/thrift/blob/master/CHANGES.md
The important one is for https://issues.apache.org/jira/browse/THRIFT-4506

## How was this patch tested?

Existing tests. A quick local test suggests this works.

Closes #23935 from srowen/SPARK-27029.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-03-02 17:28:37 -08:00
Marcelo Vanzin d00eca75b3 [SPARK-26048][BUILD] Enable flume profile when creating 2.x releases.
Closes #23931 from vanzin/SPARK-26048.

Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-03-02 08:14:06 -08:00
Huaxin Gao be5d95adc6 [SPARK-27007][PYTHON] add rawPrediction to OneVsRest in PySpark
## What changes were proposed in this pull request?

Add RawPrediction to OneVsRest in PySpark to make it consistent with scala implementation

## How was this patch tested?

Add doctest

Closes #23910 from huaxingao/spark-27007.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-02 09:09:28 -06:00
Sean Owen a97a19dd93 [SPARK-26807][DOCS] Clarify that Pyspark is on PyPi now
## What changes were proposed in this pull request?

Docs still say that Spark will be available on PyPi "in the future"; just needs to be updated.

## How was this patch tested?

Doc build

Closes #23933 from srowen/SPARK-26807.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-03-02 14:23:53 +09:00
Dilip Biswal 4a486d6716 [SPARK-26982][SQL] Enhance describe framework to describe the output of a query.
## What changes were proposed in this pull request?
Currently we can use `df.printSchema` to discover the schema information for a query. We should have a way to describe the output schema of a query using SQL interface.

Example:

DESCRIBE SELECT * FROM desc_table
DESCRIBE QUERY SELECT * FROM desc_table
```SQL

spark-sql> create table desc_table (c1 int comment 'c1-comment', c2 decimal comment 'c2-comment', c3 string);

spark-sql> desc select * from desc_table;
c1	int	        c1-comment
c2	decimal(10,0)	c2-comment
c3	string	        NULL

```
## How was this patch tested?
Added a new test under SQLQueryTestSuite and SparkSqlParserSuite

Closes #23883 from dilipbiswal/dkb_describe_query.

Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-03-02 11:21:23 +08:00
manuzhang 81dd21fda9 [SPARK-26977][CORE] Fix warn against subclassing scala.App
## What changes were proposed in this pull request?

Fix warn against subclassing scala.App

## How was this patch tested?

Manual test

Closes #23903 from manuzhang/fix_submit_warning.

Authored-by: manuzhang <owenzhang1990@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-01 17:37:58 -06:00
Dilip Biswal 5fd62ca65a [SPARK-26215][SQL][FOLLOW-UP][MINOR] Fix the warning from ANTR4
## What changes were proposed in this pull request?
I see the following new warning from ANTR4 after SPARK-26215 after it added `SCHEMA` keyword in the reserved/unreserved list. This is a minor PR to cleanup the warning.

```
WARNING] warning(125): org/apache/spark/sql/catalyst/parser/SqlBase.g4:784:90: implicit definition of token SCHEMA in parser
[WARNING] .../apache/spark/org/apache/spark/sql/catalyst/parser/SqlBase.g4 [784:90]: implicit definition of token SCHEMA in parser
```
## How was this patch tested?
Manually built catalyst after the fix to verify

Closes #23897 from dilipbiswal/minor_parser_token.

Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-03-01 12:34:15 -08:00
Marcelo Vanzin 9f16af6366 [K8S][MINOR] Log minikube version when running integration tests.
Closes #23893 from vanzin/minikube-version.

Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-03-01 11:24:08 -08:00
SongYadong 86b25c4350 [SPARK-26967][CORE] Put MetricsSystem instance names together for clearer management
## What changes were proposed in this pull request?

`MetricsSystem` instance creations have a scattered distribution in the project code. So do their names. It may cause some inconvenience for browsing and management.
This PR tries to put them together. In this way, we can have a uniform location for adding or removing them, and have a overall view of `MetircsSystem `instances in current project.
It's also helpful for maintaining user documents by avoiding missing something.

## How was this patch tested?

Existing unit tests.

Closes #23869 from SongYadong/metrics_system_inst_manage.

Lead-authored-by: SongYadong <song.yadong1@zte.com.cn>
Co-authored-by: walter2001 <ydsong2007@163.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-01 11:49:43 -06:00
liuxian 02bbe977ab [MINOR] Remove unnecessary gets when getting a value from map.
## What changes were proposed in this pull request?

Redundant `get`  when getting a value from `Map` given a key.

## How was this patch tested?

N/A

Closes #23901 from 10110346/removegetfrommap.

Authored-by: liuxian <liu.xian3@zte.com.cn>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-01 11:48:07 -06:00
Sean Owen 131b464d0c [SPARK-26986][ML][FOLLOWUP] Add JAXB reference impl to build for Java 9+
## What changes were proposed in this pull request?

Remove a few new JAXB dependencies that shouldn't be necessary now.
See https://github.com/apache/spark/pull/23890#issuecomment-468299922

## How was this patch tested?

Existing tests

Closes #23923 from srowen/SPARK-26986.2.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-01 11:23:40 -06:00
Yifei Huang bc7592ba11 [SPARK-27009][TEST] Add Standard Deviation to benchmark results
## What changes were proposed in this pull request?

Add standard deviation to the stats taken during benchmark testing.

## How was this patch tested?

Manually ran a few benchmark tests locally and visually inspected the output

Closes #23914 from yifeih/spark-27009-stdev.

Authored-by: Yifei Huang <yifeih@palantir.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-02-28 20:55:55 -08:00
Marcelo Vanzin 14f714fb30 [SPARK-26420][K8S] Generate more unique IDs when creating k8s resource names.
Using the current time as an ID is more prone to clashes than people generally
realize, so try to make things a bit more unique without necessarily using a
UUID, which would eat too much space in the names otherwise.

The implemented approach uses some bits from the current time, plus some random
bits, which should be more resistant to clashes.

Closes #23805 from vanzin/SPARK-26420.

Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-02-28 20:39:13 -08:00
Maxim Gekk 8e5f9995ca [SPARK-27008][SQL] Support java.time.LocalDate as an external type of DateType
## What changes were proposed in this pull request?

In the PR, I propose to add new Catalyst type converter for `DateType`. It should be able to convert `java.time.LocalDate` to/from `DateType`.

Main motivations for the changes:
- Smoothly support Java 8 time API
- Avoid inconsistency of calendars used inside of Spark 3.0 (Proleptic Gregorian calendar) and `java.sql.Date` (hybrid calendar - Julian + Gregorian).
- Make conversion independent from current system timezone.

By default, Spark converts values of `DateType` to `java.sql.Date` instances but the SQL config `spark.sql.datetime.java8API.enabled` can change the behavior. If it is set to `true`, Spark uses `java.time.LocalDate` as external type for `DateType`.

## How was this patch tested?

Added new testes to `CatalystTypeConvertersSuite` to check conversion of `DateType` to/from `java.time.LocalDate`, `JavaUDFSuite`/ `UDFSuite` to test usage of `LocalDate` type in Scala/Java UDFs.

Closes #23913 from MaxGekk/date-localdate.

Authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-03-01 11:04:28 +08:00
Imran Rashid c8e7eb1fa7 [SPARK-26774][CORE] Update some docs on TaskSchedulerImpl.
A couple of places in TaskSchedulerImpl could use a minor doc update on
threading concerns.  There is one bug fix here, but only in
sc.killTaskAttempt() which is probably not used much.

Closes #23874 from squito/SPARK-26774.

Authored-by: Imran Rashid <irashid@cloudera.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-02-28 11:30:20 -08:00
zhengruifeng acd086f207 [SPARK-19591][ML][PYSPARK][FOLLOWUP] Add sample weights to decision trees
## What changes were proposed in this pull request?
Add sample weights to decision trees

## How was this patch tested?
updated testsuites

Closes #23818 from zhengruifeng/py_tree_support_sample_weight.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-02-27 21:11:30 -06:00
Hyukjin Kwon 6e31ccf2a1 [SPARK-26895][CORE][FOLLOW-UP] Uninitializing log after prepareSubmitEnvironment in SparkSubmit
## What changes were proposed in this pull request?

Currently, if I run `spark-shell` in my local, it started to show the logs as below:

```
$ ./bin/spark-shell
...
19/02/28 04:42:43 INFO SecurityManager: Changing view acls to: hkwon
19/02/28 04:42:43 INFO SecurityManager: Changing modify acls to: hkwon
19/02/28 04:42:43 INFO SecurityManager: Changing view acls groups to:
19/02/28 04:42:43 INFO SecurityManager: Changing modify acls groups to:
19/02/28 04:42:43 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hkwon); groups with view permissions: Set(); users  with modify permissions: Set(hkwon); groups with modify permissions: Set()
19/02/28 04:42:43 INFO SignalUtils: Registered signal handler for INT
19/02/28 04:42:48 INFO SparkContext: Running Spark version 3.0.0-SNAPSHOT
19/02/28 04:42:48 INFO SparkContext: Submitted application: Spark shell
19/02/28 04:42:48 INFO SecurityManager: Changing view acls to: hkwon
```

Seems to be the cause is https://github.com/apache/spark/pull/23806 and `prepareSubmitEnvironment` looks actually reinitializing the logging again.

This PR proposes to uninitializing log later after `prepareSubmitEnvironment`.

## How was this patch tested?

Manually tested.

Closes #23911 from HyukjinKwon/SPARK-26895.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-02-27 17:01:30 -08:00
Gabor Somogyi 76e0b6bafb [SPARK-27002][SS] Get kafka delegation tokens right before consumer/producer created
## What changes were proposed in this pull request?

Spark not always picking up the latest Kafka delegation tokens even if a new one properly obtained.
In the PR I'm setting delegation tokens right before `KafkaConsumer` and `KafkaProducer` creation to be on the safe side.

## How was this patch tested?

Long running Kafka to Kafka tests on 4 node cluster with randomly thrown artificial exceptions.

Test scenario:
* 4 node cluster
* Yarn
* Kafka broker version 2.1.0
* security.protocol = SASL_SSL
* sasl.mechanism = SCRAM-SHA-512

Kafka broker settings:
* delegation.token.expiry.time.ms=600000 (10 min)
* delegation.token.max.lifetime.ms=1200000 (20 min)
* delegation.token.expiry.check.interval.ms=300000 (5 min)

After each 7.5 minutes new delegation token obtained from Kafka broker (10 min * 0.75).
But when token expired after 10 minutes (Spark obtains new one and doesn't renew the old), the brokers expiring thread comes after each 5 minutes (invalidates expired tokens) and artificial exception has been thrown inside the Spark application (such case Spark closes connection), then the latest delegation token not always picked up.

Closes #23906 from gaborgsomogyi/SPARK-27002.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-02-27 10:07:02 -08:00
Gabor Somogyi c4bbfd177b [SPARK-24063][SS] Add maximum epoch queue threshold for ContinuousExecution
## What changes were proposed in this pull request?

Continuous processing is waiting on epochs which are not yet complete (for example one partition is not making progress) and stores pending items in queues. These queues are unbounded and can consume up all the memory easily. In this PR I've added `spark.sql.streaming.continuous.epochBacklogQueueSize` configuration possibility to make them bounded. If the related threshold reached then the query will stop with `IllegalStateException`.

## How was this patch tested?

Existing + additional unit tests.

Closes #23156 from gaborgsomogyi/SPARK-24063.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-02-27 09:52:43 -08:00
Marcelo Vanzin a6ddc9d083 [SPARK-24736][K8S] Let spark-submit handle dependency resolution.
Before this change, there was some code in the k8s backend to deal
with how to resolve dependencies and make them available to the
Spark application. It turns out that none of that code is necessary,
since spark-submit already handles all that for applications started
in client mode - like the k8s driver that is run inside a Spark-created
pod.

For that reason, specifically for pyspark, there's no need for the
k8s backend to deal with PYTHONPATH; or, in general, to change the URIs
provided by the user at all. spark-submit takes care of that.

For testing, I created a pyspark script that depends on another module
that is shipped with --py-files. Then I used:

- --py-files http://.../dep.py http://.../test.py
- --py-files http://.../dep.zip http://.../test.py
- --py-files local:/.../dep.py local:/.../test.py
- --py-files local:/.../dep.zip local:/.../test.py

Without this change, all of the above commands fail. With the change, the
driver is able to see the dependencies in all the above cases; but executors
don't see the dependencies in the last two. That's a bug in shared Spark code
that deals with local: dependencies in pyspark (SPARK-26934).

I also tested a Scala app using the main jar from an http server.

Closes #23793 from vanzin/SPARK-24736.

Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-02-27 09:49:31 -08:00
Hyukjin Kwon a67e8426e3 [SPARK-27000][PYTHON] Upgrades cloudpickle to v0.8.0
## What changes were proposed in this pull request?

After upgrading cloudpickle to 0.6.1 at https://github.com/apache/spark/pull/20691, one regression was found. Cloudpickle had a critical https://github.com/cloudpipe/cloudpickle/pull/240 for that.

Basically, it currently looks existing globals would override globals shipped in a function's, meaning:

**Before:**

```python
>>> def hey():
...     return "Hi"
...
>>> spark.range(1).rdd.map(lambda _: hey()).collect()
['Hi']
>>> def hey():
...     return "Yeah"
...
>>> spark.range(1).rdd.map(lambda _: hey()).collect()
['Hi']
```

**After:**

```python
>>> def hey():
...     return "Hi"
...
>>> spark.range(1).rdd.map(lambda _: hey()).collect()
['Hi']
>>>
>>> def hey():
...     return "Yeah"
...
>>> spark.range(1).rdd.map(lambda _: hey()).collect()
['Yeah']
```

Therefore, this PR upgrades cloudpickle to 0.8.0.

Note that cloudpickle's release cycle is quite short.

Between 0.6.1 and 0.7.0, it contains minor bug fixes. I don't see notable changes to double check and/or avoid.

There is virtually only this fix between 0.7.0 and 0.8.1 - other fixes are about testing.

## How was this patch tested?

Manually tested, tests were added. Verified unit tests were added in cloudpickle.

Closes #23904 from HyukjinKwon/SPARK-27000.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-02-28 02:33:10 +09:00
Luca Canali 5fd28e8f5c [SPARK-26890][DOC] Add list of available Dropwizard metrics in Spark and add additional configuration details to the monitoring documentation
## What changes were proposed in this pull request?

This PR proposes to extend the documentation of the Spark metrics system in the monitoring guide. In particular by:
- adding a list of the available metrics grouped per component instance
- adding information on configuration parameters that can be used to configure the metrics system in alternative to the metrics.properties file
- adding information on the configuration parameters needed to enable certain metrics
- it also propose to add an example of Graphite sink configuration in metrics.properties.template

Closes #23798 from LucaCanali/metricsDocUpdate.

Authored-by: Luca Canali <luca.canali@cern.ch>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-02-27 10:07:15 -06:00
Oliver Urs Lenz 28e1695e17 [SPARK-26803][PYTHON] Add sbin subdirectory to pyspark
## What changes were proposed in this pull request?

Modifies `setup.py` so that `sbin` subdirectory is included in pyspark

## How was this patch tested?

Manually tested with python 2.7 and python 3.7

```sh
$ ./build/mvn -D skipTests -P hive -P hive-thriftserver -P yarn -P mesos clean package
$ cd python
$ python setup.py sdist
$ pip install  dist/pyspark-2.1.0.dev0.tar.gz
```

Checked manually that `sbin` is now present in install directory.

srowen holdenk

Closes #23715 from oulenz/pyspark_sbin.

Authored-by: Oliver Urs Lenz <oliver.urs.lenz@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-02-27 08:39:55 -06:00
liuxian 7912dbb88f [MINOR] Simplify boolean expression
## What changes were proposed in this pull request?

Comparing whether Boolean expression is equal to true is redundant
For example:
The datatype of `a` is boolean.
Before:
if (a == true)
After:
if (a)

## How was this patch tested?
N/A

Closes #23884 from 10110346/simplifyboolean.

Authored-by: liuxian <liu.xian3@zte.com.cn>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-02-27 08:38:00 -06:00
Maxim Gekk b0450d07bd [SPARK-26902][SQL] Support java.time.Instant as an external type of TimestampType
## What changes were proposed in this pull request?

In the PR, I propose to add new Catalyst type converter for `TimestampType`. It should be able to convert `java.time.Instant` to/from `TimestampType`.

Main motivations for the changes:
- Smoothly support Java 8 time API
- Avoid inconsistency of calendars used inside of Spark 3.0 (Proleptic Gregorian calendar) and `java.sql.Timestamp` (hybrid calendar - Julian + Gregorian).
- Make conversion independent from current system timezone.

By default, Spark converts values of `TimestampType` to `java.sql.Timestamp` instances but the SQL config `spark.sql.catalyst.timestampType` can change the behavior. It accepts two values `Timestamp` (default) and `Instant`. If the former one is set, Spark returns `java.time.Instant` instances for timestamp values.

## How was this patch tested?

Added new testes to `CatalystTypeConvertersSuite` to check conversion of `TimestampType` to/from `java.time.Instant`.

Closes #23811 from MaxGekk/timestamp-instant.

Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-02-27 21:05:19 +08:00
Gengliang Wang 95e55720d4 [SPARK-26990][SQL] FileIndex: use user specified field names if possible
## What changes were proposed in this pull request?

WIth the following file structure:
```
/tmp/data
└── a=5
```

In the previous release:
```
scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema
root
 |-- ID: long (nullable = true)
 |-- A: integer (nullable = true)
```

While in current code:
```
scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema
root
 |-- ID: long (nullable = true)
 |-- a: integer (nullable = true)
```

We can see that the partition column name `a` is different from `A` as user specifed. This PR is to fix the case and make it more user-friendly.

## How was this patch tested?

Unit test

Closes #23894 from gengliangwang/fileIndexSchema.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-02-27 14:38:35 +08:00
Jungtaek Lim (HeartSaVioR) dea18ee85b [SPARK-22000][SQL] Address missing Upcast in JavaTypeInference.deserializerFor
## What changes were proposed in this pull request?

Spark expects the type of column and the type of matching field is same when deserializing to Object, but Spark hasn't actually restrict it (at least for Java bean encoder) and some users just do it and experience undefined behavior (in SPARK-22000, Spark throws compilation failure on generated code because it calls `.toString()` against primitive type.

It doesn't produce error in Scala side because `ScalaReflection.deserializerFor` properly inject Upcast if necessary. This patch proposes applying same thing to `JavaTypeInference.deserializerFor` as well.

Credit to srowen, maropu, and cloud-fan since they provided various approaches to solve this.

## How was this patch tested?

Added UT which query is slightly modified based on sample code in attachment on JIRA issue.

Closes #23854 from HeartSaVioR/SPARK-22000.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-02-27 13:47:20 +08:00
Hyukjin Kwon 88bc481b9e [SPARK-26830][SQL][R] Vectorized R dapply() implementation
## What changes were proposed in this pull request?

This PR targets to add vectorized `dapply()` in R, Arrow optimization.

This can be tested as below:

```bash
$ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true
```

```r
df <- createDataFrame(mtcars)
collect(dapply(df, function(rdf) { data.frame(rdf$gear + 1) }, structType("gear double")))
```

### Requirements
  - R 3.5.x
  - Arrow package 0.12+
    ```bash
    Rscript -e 'remotes::install_github("apache/arrowapache-arrow-0.12.0", subdir = "r")'
    ```

**Note:** currently, Arrow R package is not in CRAN. Please take a look at ARROW-3204.
**Note:** currently, Arrow R package seems not supporting Windows. Please take a look at ARROW-3204.

### Benchmarks

**Shall**

```bash
sync && sudo purge
./bin/sparkR --conf spark.sql.execution.arrow.enabled=false --driver-memory 4g
```

```bash
sync && sudo purge
./bin/sparkR --conf spark.sql.execution.arrow.enabled=true --driver-memory 4g
```

**R code**

```r
rdf <- read.csv("500000.csv")
df <- cache(createDataFrame(rdf))
count(df)

test <- function() {
  options(digits.secs = 6) # milliseconds
  start.time <- Sys.time()
  count(cache(dapply(df, function(rdf) { rdf }, schema(df))))
  end.time <- Sys.time()
  time.taken <- end.time - start.time
  print(time.taken)
}

test()
```

**Data (350 MB):**

```r
object.size(read.csv("500000.csv"))
350379504 bytes
```

"500000 Records"  http://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/

**Results**

```
Time difference of 13.42037 mins
```

```
Time difference of 30.64156 secs
```

The performance improvement was around **2627%**.

### Limitations

- For now, Arrow optimization with R does not support when the data is `raw`, and when user explicitly gives float type in the schema. They produce corrupt values.

- Due to ARROW-4512, it cannot send and receive batch by batch. It has to send all batches in Arrow stream format at once. It needs improvement later.

## How was this patch tested?

Unit tests were added, and manually tested.

Closes #23787 from HyukjinKwon/SPARK-26830-1.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-02-27 14:29:58 +09:00
Liang-Chi Hsieh 0f2c0b53e8 [SPARK-26837][SQL] Pruning nested fields from object serializers
## What changes were proposed in this pull request?

In SPARK-26619, we make change to prune unnecessary individual serializers when serializing objects. This is extension to SPARK-26619. We can further prune nested fields from object serializers if they are not used.

For example, in following query, we only use one field in a struct column:

```scala
val data = Seq((("a", 1), 1), (("b", 2), 2), (("c", 3), 3))
val df = data.toDS().map(t => (t._1, t._2 + 1)).select("_1._1")
```

So, instead of having a serializer to create a two fields struct, we can prune unnecessary field from it. This is what this PR proposes to do.

In order to make this change conservative and safer, a SQL config is added to control it. It is disabled by default.

TODO: Support to prune nested fields inside MapType's key and value.

## How was this patch tested?

Added tests.

Closes #23740 from viirya/nested-pruning-serializer-2.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-02-27 12:45:24 +08:00
Sean Owen 9c283662c6 [SPARK-26986][ML] Add JAXB reference impl to build for Java 9+
## What changes were proposed in this pull request?

Add reference JAXB impl for Java 9+ from Glassfish. Right now it's only apparently necessary in MLlib but can be expanded later.

## How was this patch tested?

Existing tests particularly PMML-related ones, which use JAXB.
This works on Java 11.

Closes #23890 from srowen/SPARK-26986.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-02-26 18:26:49 -06:00
Hellsen83 387efe29b7 [SPARK-26449][PYTHON] Add transform method to DataFrame API
## What changes were proposed in this pull request?

Added .transform() method to Python DataFrame API to be in sync with Scala API.

## How was this patch tested?

Addition has been tested manually.

Closes #23877 from Hellsen83/pyspark-dataframe-transform.

Authored-by: Hellsen83 <erik.christiansen83@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-02-26 18:22:36 -06:00
Jungtaek Lim (HeartSaVioR) c17150a5f5 [SPARK-22860][CORE][YARN] Redact command line arguments for running Driver and Executor before logging (standalone and YARN)
## What changes were proposed in this pull request?

This patch applies redaction to command line arguments before logging them. This applies to two resource managers: standalone cluster and YARN.

This patch only concerns about arguments starting with `-D` since Spark is likely passing the Spark configuration to command line arguments as `-Dspark.blabla=blabla`. More change is necessary if we also want to handle the case of `--conf spark.blabla=blabla`.

## How was this patch tested?

Added UT for redact logic. This patch only touches how to log so not easy to add UT regarding it.

Closes #23820 from HeartSaVioR/MINOR-redact-command-line-args-for-running-driver-executor.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-02-26 14:49:46 -08:00
Marcelo Vanzin afbff6446f Revert "[SPARK-26742][K8S] Update Kubernetes-Client version to 4.1.2"
This reverts commit a3192d966a.
2019-02-26 13:42:07 -08:00
Maxim Gekk a2a41b7bf2 [SPARK-26978][CORE][SQL] Avoid magic time constants
## What changes were proposed in this pull request?

In the PR, I propose to refactor existing code related to date/time conversions, and replace constants like `1000` and `1000000` by `DateTimeUtils` constants and transformation functions from `java.util.concurrent.TimeUnit._`.

## How was this patch tested?

The changes are tested by existing test suites.

Closes #23878 from MaxGekk/magic-time-constants.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-02-26 09:08:12 -06:00
seancxmao 3d2e55abd0 [MINOR][DOCS] Remove Akka leftover
## What changes were proposed in this pull request?
Since Spark 2.0, Akka is not used anymore and Akka related stuff were removed. However there are still some leftover. This PR aims to remove these leftover.

* `/pom.xml` has a comment about Akka, which is not needed anymore.

## How was this patch tested?
Existing tests.

Closes #23885 from seancxmao/remove-akka-leftover.

Authored-by: seancxmao <seancxmao@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-02-26 08:31:02 -06:00
Xianyang Liu bc03c8b3fa [SPARK-26952][SQL] Row count statics should respect the data reported by data source
## What changes were proposed in this pull request?

In data source v2, if the data source scan implemented `SupportsReportStatistics`. `DataSourceV2Relation` should respect the row count reported by the data source.

## How was this patch tested?

New UT test.

Closes #23853 from ConeyLiu/report-row-count.

Authored-by: Xianyang Liu <xianyang.liu@intel.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-02-26 14:10:54 +08:00
liupengcheng 52a180f25f [SPARK-26674][CORE] Consolidate CompositeByteBuf when reading large frame
## What changes were proposed in this pull request?

Currently, TransportFrameDecoder will not consolidate the buffers read from network which may cause memory waste. Actually, bytebuf's writtenIndex is far less than it's capacity  in most cases, so we can optimize it by doing consolidation.

This PR will do this optimization.

Related codes:
9a30e23211/common/network-common/src/main/java/org/apache/spark/network/util/TransportFrameDecoder.java (L143)

## How was this patch tested?

UT

Please review http://spark.apache.org/contributing.html before opening a pull request.

Closes #23602 from liupc/Reduce-memory-consumption-in-TransportFrameDecoder.

Lead-authored-by: liupengcheng <liupengcheng@xiaomi.com>
Co-authored-by: Liupengcheng <liupengcheng@xiaomi.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-02-25 16:40:46 -08:00
Gengliang Wang 4baa2d4449 [SPARK-26673][FOLLOWUP][SQL] File Source V2: check existence of output path before delete it
## What changes were proposed in this pull request?
This is a followup PR to resolve comment: https://github.com/apache/spark/pull/23601#pullrequestreview-207101115

When Spark writes DataFrame with "overwrite" mode, it deletes the output path before actual writes. To safely handle the case that the output path doesn't exist,  it is suggested to follow the V1 code by checking the existence.

## How was this patch tested?

Apply https://github.com/apache/spark/pull/23836 and run unit tests

Closes #23889 from gengliangwang/checkFileBeforeOverwrite.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
2019-02-25 16:20:06 -08:00
Ilya Matiach b66be0e490 [SPARK-24103][ML][MLLIB] ML Evaluators should use weight column - added weight column for binary classification evaluator
## What changes were proposed in this pull request?

The evaluators BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator and the corresponding metrics classes BinaryClassificationMetrics, RegressionMetrics and MulticlassMetrics should use sample weight data.

I've closed the PR: https://github.com/apache/spark/pull/16557
as recommended in favor of creating three pull requests, one for each of the evaluators (binary/regression/multiclass) to make it easier to review/update.

## How was this patch tested?
I added tests to the metrics and evaluators classes.

Closes #17084 from imatiach-msft/ilmat/binary-evalute.

Authored-by: Ilya Matiach <ilmat@microsoft.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-02-25 17:16:51 -06:00
Marcelo Vanzin 4808393449 [SPARK-26788][YARN] Remove SchedulerExtensionService.
Since the yarn module is actually private to Spark, this interface was never
actually "public". Since it has no use inside of Spark, let's avoid adding
a yarn-specific extension that isn't public, and point any potential users
are more general solutions (like using a SparkListener).

Closes #23839 from vanzin/SPARK-26788.

Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-02-25 13:57:37 -06:00
“attilapiros” 0ac516bebd [SPARK-25035][CORE] Avoiding memory mapping at disk-stored blocks replication
Before this PR the method `BlockManager#putBlockDataAsStream()` (which is used during block replication where the block data is received as a stream) was reading the whole block content into the memory even at DISK_ONLY storage level.

With this change the received block data (which was temporary stored in a file) is just simply moved into the right location backing the target block. This way a possible OOM error is avoided.

In this implementation to save code duplications the method `doPutBytes` is refactored into a template method called `BlockStoreUpdater` which has a separate implementation to handle byte buffer based and temporary file based block store updates.

With existing unit tests of `DistributedSuite` (the ones dealing with replications):
- caching on disk, replicated (encryption = off) (with replication as stream)
- caching on disk, replicated (encryption = on) (with replication as stream)
- caching in memory, serialized, replicated (encryption = on) (with replication as stream)
- caching in memory, serialized, replicated (encryption = off) (with replication as stream)
- etc.

And with new unit tests testing `putBlockDataAsStream` method directly:
- test putBlockDataAsStream with caching (encryption = off)
- test putBlockDataAsStream with caching (encryption = on)
- test putBlockDataAsStream with caching on disk (encryption = off)
- test putBlockDataAsStream with caching on disk (encryption = on)

Closes #23688 from attilapiros/SPARK-25035.

Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-02-25 11:43:39 -08:00
Jungtaek Lim (HeartSaVioR) c5de804093 [MINOR][BUILD] Update all checkstyle dtd to use "https://checkstyle.org"
## What changes were proposed in this pull request?

Below build failed with Java checkstyle test, but instead of violation it shows FileNotFound on dtd file.
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102751/

Looks like the link of dtd file is dead `http://www.puppycrawl.com/dtds/configuration_1_3.dtd`.

This patch updates the dtd link to "https://checkstyle.org/dtds/" given checkstyle repository also updated the URL path.
https://github.com/checkstyle/checkstyle/issues/5601

## How was this patch tested?

Checked the new links.

Closes #23887 from HeartSaVioR/java-checkstyle-dtd-change-url.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-02-25 11:25:53 -08:00
Jiaxin Shan a3192d966a [SPARK-26742][K8S] Update Kubernetes-Client version to 4.1.2
## What changes were proposed in this pull request?
Changed the `kubernetes-client` version to 4.1.2.  Latest version fix error with exec credentials (used by aws eks) and this will be used to talk with kubernetes API server. Users can submit spark job to EKS api endpoint now with this patch.

## How was this patch tested?
unit tests and manual tests.

Closes #23814 from Jeffwan/update_k8s_sdk.

Authored-by: Jiaxin Shan <seedjeffwan@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-02-25 04:56:04 -06:00
Sean Owen d2529788ed [SPARK-26966][ML] Update to JPMML 1.4.8
## What changes were proposed in this pull request?

JPMML apparently only supports Java 9 in 1.4.2+. We are seeing text failures from JPMML relating to JAXB when running on Java 11. It's shaded and not a big change, so should be safe.

## How was this patch tested?

Existing tests.

Closes #23868 from srowen/SPARK-26966.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-02-25 04:37:45 -06:00
Maxim Gekk 2d2fb34b93 [SPARK-26953][CORE][TEST] Test TimSort for ArrayIndexOutOfBoundsException
## What changes were proposed in this pull request?

In the PR, I propose to test the input showed at the end of the article: https://arxiv.org/pdf/1805.08612.pdf . The difference of the test and paper's test is type of array. This test allocates arrays of bytes instead of array of ints.

## How was this patch tested?

New test is added to `SorterSuite`.

Closes #23856 from MaxGekk/timsort-bug-fix.

Authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-02-24 17:37:32 -06:00
Douglas R Colkitt faa61980c4 [SPARK-26935][SQL] Skip DataFrameReader's CSV first line scan when not used
Prior to this patch, all DataFrameReader.csv() calls would collect the first
line from the CSV input iterator. This is done to allow schema inference from the
header row.

However when schema is already specified this is a wasteful operation. It results
in an unncessary compute step on the first partition. This can be expensive if
the CSV itself is expensive to generate (e.g. it's the product of a long-running
external pipe()).

This patch short-circuits the first-line collection in DataFrameReader.csv() when
schema is specified. Thereby improving CSV read performance in certain cases.

## What changes were proposed in this pull request?

Short-circuiting DataFrameReader.csv() first-line read when schema is user-specified.

## How was this patch tested?

Compiled and tested against several CSV datasets.

Closes #23830 from Mister-Meeseeks/master.

Authored-by: Douglas R Colkitt <douglas.colkitt@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-02-23 14:00:57 -06:00
Maxim Gekk 75c48ac36d [SPARK-26908][SQL] Fix DateTimeUtils.toMillis and millisToDays
## What changes were proposed in this pull request?

The `DateTimeUtils.toMillis` can produce inaccurate result for some negative values (timestamps before epoch). The error can be around 1ms. In the PR, I propose to use `Math.floorDiv` in casting microseconds to milliseconds, and milliseconds to days since epoch.

## How was this patch tested?

Added new test to `DateTimeUtilsSuite`, and tested by `CastSuite` as well.

Closes #23815 from MaxGekk/micros-to-millis.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-02-23 11:35:11 -06:00