Commit graph

502 commits

Author SHA1 Message Date
Kazuaki Ishizaki 8ae004b460 [SPARK-22688][SQL] Upgrade Janino version to 3.0.8
## What changes were proposed in this pull request?

This PR upgrade Janino version to 3.0.8. [Janino 3.0.8](https://janino-compiler.github.io/janino/changelog.html) includes an important fix to reduce the number of constant pool entries by using 'sipush' java bytecode.

* SIPUSH bytecode is not used for short integer constant [#33](https://github.com/janino-compiler/janino/issues/33).

Please see detail in [this discussion thread](https://github.com/apache/spark/pull/19518#issuecomment-346674976).

## How was this patch tested?

Existing tests

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #19890 from kiszk/SPARK-22688.
2017-12-06 16:15:25 -08:00
smurakozi 9948b860ac [SPARK-22516][SQL] Bump up Univocity version to 2.5.9
## What changes were proposed in this pull request?

There was a bug in Univocity Parser that causes the issue in SPARK-22516. This was fixed by upgrading from 2.5.4 to 2.5.9 version of the library :

**Executing**
```
spark.read.option("header","true").option("inferSchema", "true").option("multiLine", "true").option("comment", "g").csv("test_file_without_eof_char.csv").show()
```
**Before**
```
ERROR Executor: Exception in task 0.0 in stage 6.0 (TID 6)
com.univocity.parsers.common.TextParsingException: java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End of input reached
...
Internal state when error was thrown: line=3, column=0, record=2, charIndex=31
	at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
	at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475)
	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
```
**After**
```
+-------+-------+
|column1|column2|
+-------+-------+
|    abc|    def|
+-------+-------+
```

## How was this patch tested?
The already existing `CSVSuite.commented lines in CSV data` test was extended to parse the file also in multiline mode. The test input file was modified to also include a comment in the last line.

Author: smurakozi <smurakozi@gmail.com>

Closes #19906 from smurakozi/SPARK-22516.
2017-12-06 13:22:08 -08:00
Sean Owen d2cf95aa63 [SPARK-22634][BUILD] Update Bouncy Castle to 1.58
## What changes were proposed in this pull request?

Update Bouncy Castle to 1.58, and jets3t to 0.9.4 to (sort of) match.

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #19859 from srowen/SPARK-22634.
2017-12-02 07:37:02 -06:00
Min Shen 7da1f5708c [SPARK-22373] Bump Janino dependency version to fix thread safety issue…
… with Janino when compiling generated code.

## What changes were proposed in this pull request?

Bump up Janino dependency version to fix thread safety issue during compiling generated code

## How was this patch tested?

Check https://issues.apache.org/jira/browse/SPARK-22373 for details.
Converted part of the code in CodeGenerator into a standalone application, so the issue can be consistently reproduced locally.
Verified that changing Janino dependency version resolved this issue.

Author: Min Shen <mshen@linkedin.com>

Closes #19839 from Victsm/SPARK-22373.
2017-11-30 19:24:44 -06:00
Stavros Kontopoulos 193555f79c [SPARK-18935][MESOS] Fix dynamic reservations on mesos
## What changes were proposed in this pull request?

- Solves the issue described in the ticket by preserving reservation and allocation info in all cases (port handling included).
- upgrades to 1.4
- Adds extra debug level logging to make debugging easier in the future, for example we add reservation info when applicable.
```
 17/09/29 14:53:07 DEBUG MesosCoarseGrainedSchedulerBackend: Accepting offer: f20de49b-dee3-45dd-a3c1-73418b7de891-O32 with attributes: Map() allocation info: role: "spark-prive"
  reservation info: name: "ports"
 type: RANGES
 ranges {
   range {
     begin: 31000
     end: 32000
   }
 }
 role: "spark-prive"
 reservation {
   principal: "test"
 }
 allocation_info {
   role: "spark-prive"
 }
```
- Some style cleanup.

## How was this patch tested?

Manually by running the example in the ticket with and without a principal. Specifically I tested it on a dc/os 1.10 cluster with 7 nodes and played with reservations. From the master node in order to reserve resources I executed:

```for i in 0 1 2 3 4 5 6
do
curl -i \
      -d slaveId=90ec65ea-1f7b-479f-a824-35d2527d6d26-S$i \
      -d resources='[
        {
          "name": "cpus",
          "type": "SCALAR",
          "scalar": { "value": 2 },
          "role": "spark-role",
          "reservation": {
            "principal": ""
          }
        },
        {
          "name": "mem",
          "type": "SCALAR",
          "scalar": { "value": 8026 },
          "role": "spark-role",
          "reservation": {
            "principal": ""
          }
        }
      ]' \
      -X POST http://master.mesos:5050/master/reserve
done
```
Nodes had 4 cpus (m3.xlarge instances)  and I reserved either 2 or 4 cpus (all for a role).
I verified it launches tasks on nodes with reserved resources under `spark-role` role  only if
a) there are remaining resources for (*) default role and the spark driver has no role assigned to it.
b) the spark driver has a role assigned to it and it is the same role used in reservations.
I also tested this locally on my machine.

Author: Stavros Kontopoulos <st.kontopoulos@gmail.com>

Closes #19390 from skonto/fix_dynamic_reservation.
2017-11-29 14:15:35 -08:00
Yinan Li e9b2070ab2 [SPARK-18278][SCHEDULER] Spark on Kubernetes - Basic Scheduler Backend
## What changes were proposed in this pull request?

This is a stripped down version of the `KubernetesClusterSchedulerBackend` for Spark with the following components:
- Static Allocation of Executors
- Executor Pod Factory
- Executor Recovery Semantics

It's step 1 from the step-wise plan documented [here](https://github.com/apache-spark-on-k8s/spark/issues/441#issuecomment-330802935).
This addition is covered by the [SPIP vote](http://apache-spark-developers-list.1001551.n3.nabble.com/SPIP-Spark-on-Kubernetes-td22147.html) which passed on Aug 31 .

## How was this patch tested?

- The patch contains unit tests which are passing.
- Manual testing: `./build/mvn -Pkubernetes clean package` succeeded.
- It is a **subset** of the entire changelist hosted in http://github.com/apache-spark-on-k8s/spark which is in active use in several organizations.
- There is integration testing enabled in the fork currently [hosted by PepperData](spark-k8s-jenkins.pepperdata.org:8080) which is being moved over to RiseLAB CI.
- Detailed documentation on trying out the patch in its entirety is in: https://apache-spark-on-k8s.github.io/userdocs/running-on-kubernetes.html

cc rxin felixcheung mateiz (shepherd)
k8s-big-data SIG members & contributors: mccheah ash211 ssuchter varunkatta kimoonkim erikerlandson liyinan926 tnachen ifilonenko

Author: Yinan Li <liyinan926@gmail.com>
Author: foxish <ramanathana@google.com>
Author: mcheah <mcheah@palantir.com>

Closes #19468 from foxish/spark-kubernetes-3.
2017-11-28 23:02:09 -08:00
Ilya Matiach 1edb3175d8 [SPARK-21866][ML][PYSPARK] Adding spark image reader
## What changes were proposed in this pull request?
Adding spark image reader, an implementation of schema for representing images in spark DataFrames

The code is taken from the spark package located here:
(https://github.com/Microsoft/spark-images)

Please see the JIRA for more information (https://issues.apache.org/jira/browse/SPARK-21866)

Please see mailing list for SPIP vote and approval information:
(http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-SPARK-21866-Image-support-in-Apache-Spark-td22510.html)

# Background and motivation
As Apache Spark is being used more and more in the industry, some new use cases are emerging for different data formats beyond the traditional SQL types or the numerical types (vectors and matrices). Deep Learning applications commonly deal with image processing. A number of projects add some Deep Learning capabilities to Spark (see list below), but they struggle to communicate with each other or with MLlib pipelines because there is no standard way to represent an image in Spark DataFrames. We propose to federate efforts for representing images in Spark by defining a representation that caters to the most common needs of users and library developers.
This SPIP proposes a specification to represent images in Spark DataFrames and Datasets (based on existing industrial standards), and an interface for loading sources of images. It is not meant to be a full-fledged image processing library, but rather the core description that other libraries and users can rely on. Several packages already offer various processing facilities for transforming images or doing more complex operations, and each has various design tradeoffs that make them better as standalone solutions.
This project is a joint collaboration between Microsoft and Databricks, which have been testing this design in two open source packages: MMLSpark and Deep Learning Pipelines.
The proposed image format is an in-memory, decompressed representation that targets low-level applications. It is significantly more liberal in memory usage than compressed image representations such as JPEG, PNG, etc., but it allows easy communication with popular image processing libraries and has no decoding overhead.

## How was this patch tested?

Unit tests in scala ImageSchemaSuite, unit tests in python

Author: Ilya Matiach <ilmat@microsoft.com>
Author: hyukjinkwon <gurwls223@gmail.com>

Closes #19439 from imatiach-msft/ilmat/spark-images.
2017-11-22 15:45:45 -08:00
Sean Owen b009722591 [SPARK-22511][BUILD] Update maven central repo address
## What changes were proposed in this pull request?

Use repo.maven.apache.org repo address; use latest ASF parent POM version 18

## How was this patch tested?

Existing tests; no functional change

Author: Sean Owen <sowen@cloudera.com>

Closes #19742 from srowen/SPARK-22511.
2017-11-14 17:58:07 -06:00
hyukjinkwon c8b7f97b8a [SPARK-22377][BUILD] Use /usr/sbin/lsof if lsof does not exists in release-build.sh
## What changes were proposed in this pull request?

This PR proposes to use `/usr/sbin/lsof` if `lsof` is missing in the path to fix nightly snapshot jenkins jobs. Please refer https://github.com/apache/spark/pull/19359#issuecomment-340139557:

> Looks like some of the snapshot builds are having lsof issues:
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-2.1-maven-snapshots/182/console
>
>https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-2.2-maven-snapshots/134/console
>
>spark-build/dev/create-release/release-build.sh: line 344: lsof: command not found
>usage: kill [ -s signal | -p ] [ -a ] pid ...
>kill -l [ signal ]

Up to my knowledge,  the full path of `lsof` is required for non-root user in few OSs.

## How was this patch tested?

Manually tested as below:

```bash
#!/usr/bin/env bash

LSOF=lsof
if ! hash $LSOF 2>/dev/null; then
  echo "a"
  LSOF=/usr/sbin/lsof
fi

$LSOF -P | grep "a"
```

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #19695 from HyukjinKwon/SPARK-22377.
2017-11-14 08:28:13 +09:00
hyukjinkwon 160a540610 [SPARK-22376][TESTS] Makes dev/run-tests.py script compatible with Python 3
## What changes were proposed in this pull request?

This PR proposes to fix `dev/run-tests.py` script to support Python 3.

Here are some backgrounds. Up to my knowledge,

In Python 2,
- `unicode` is NOT `str` in Python 2 (`type("foo") != type(u"foo")`).
- `str` has an alias, `bytes` in Python 2 (`type("foo") == type(b"foo")`).

In Python 3,
- `unicode` was (roughly) replaced by `str` in Python 3 (`type("foo") == type(u"foo")`).
- `str` is NOT `bytes` in Python 3 (`type("foo") != type(b"foo")`).

So, this PR fixes:

  1. Use `b''` instead of `''` so that both `str` in Python 2 and `bytes` in Python 3 can be hanlded. `sbt_proc.stdout.readline()` returns `str` (which has an alias, `bytes`) in Python 2 and `bytes` in Python 3

  2. Similarily, use `b''` instead of `''` so that both `str` in Python 2 and `bytes` in Python 3 can be hanlded. `re.compile` with `str` pattern does not seem supporting to match `bytes` in Python 3:

Actually, this change is recommended up to my knowledge - https://docs.python.org/3/howto/pyporting.html#text-versus-binary-data:

> Mark all binary literals with a b prefix, textual literals with a u prefix

## How was this patch tested?

I manually tested this via Python 3 with few additional changes to reduce the elapsed time.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #19665 from HyukjinKwon/SPARK-22376.
2017-11-07 19:45:34 +09:00
Sital Kedia 444bce1c98 [SPARK-19112][CORE] Support for ZStandard codec
## What changes were proposed in this pull request?

Using zstd compression for Spark jobs spilling 100s of TBs of data, we could reduce the amount of data written to disk by as much as 50%. This translates to significant latency gain because of reduced disk io operations. There is a degradation CPU time by 2 - 5% because of zstd compression overhead, but for jobs which are bottlenecked by disk IO, this hit can be taken.

## Benchmark
Please note that this benchmark is using real world compute heavy production workload spilling TBs of data to disk

|         | zstd performance as compred to LZ4   |
| ------------- | -----:|
| spill/shuffle bytes    | -48% |
| cpu time    |    + 3% |
| cpu reservation time       |    -40%|
| latency     |     -40% |

## How was this patch tested?

Tested by running few jobs spilling large amount of data on the cluster and amount of intermediate data written to disk reduced by as much as 50%.

Author: Sital Kedia <skedia@fb.com>

Closes #18805 from sitalkedia/skedia/upstream_zstd.
2017-11-01 14:54:08 +01:00
Xin Lu 544a1ba678 [SPARK-22375][TEST] Test script can fail if eggs are installed by set…
…up.py during test process

## What changes were proposed in this pull request?

Ignore the python/.eggs folder when running lint-python

## How was this patch tested?
1) put a bad python file in python/.eggs and ran the original script.  results were:

xins-MBP:spark xinlu$ dev/lint-python
PEP8 checks failed.
./python/.eggs/worker.py:33:4: E121 continuation line under-indented for hanging indent
./python/.eggs/worker.py:34:5: E131 continuation line unaligned for hanging indent

2) test same situation with change:

xins-MBP:spark xinlu$ dev/lint-python
PEP8 checks passed.
The sphinx-build command was not found. Skipping pydoc checks for now

Author: Xin Lu <xlu@salesforce.com>

Closes #19597 from xynny/SPARK-22375.
2017-10-29 15:29:23 +09:00
hyukjinkwon ff8de99a1c [SPARK-22302][INFRA] Remove manual backports for subprocess and print explicit message for < Python 2.7
## What changes were proposed in this pull request?

Seems there was a mistake - missing import for `subprocess.call`, while refactoring this script a long ago, which should be used for backports of some missing functions in `subprocess`, specifically in < Python 2.7.

Reproduction is:

```
cd dev && python2.6
```

```
>>> from sparktestsupport import shellutils
>>> shellutils.subprocess_check_call("ls")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "sparktestsupport/shellutils.py", line 46, in subprocess_check_call
    retcode = call(*popenargs, **kwargs)
NameError: global name 'call' is not defined
```

For Jenkins logs, please see https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3950/console

Since we dropped the Python 2.6.x support, looks better we remove those workarounds and print out explicit error messages in order to reduce the efforts to find out the root causes for such cases, for example, `https://github.com/apache/spark/pull/19513#issuecomment-337406734`.

## How was this patch tested?

Manually tested:

```
./dev/run-tests
```

```
Python versions prior to 2.7 are not supported.
```

```
./dev/run-tests-jenkins
```

```
Python versions prior to 2.7 are not supported.
```

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #19524 from HyukjinKwon/SPARK-22302.
2017-10-22 02:22:35 +09:00
Dongjoon Hyun 6f1d0dea1c [SPARK-22300][BUILD] Update ORC to 1.4.1
## What changes were proposed in this pull request?

Apache ORC 1.4.1 is released yesterday.
- https://orc.apache.org/news/2017/10/16/ORC-1.4.1/

Like ORC-233 (Allow `orc.include.columns` to be empty), there are several important fixes.
This PR updates Apache ORC dependency to use the latest one, 1.4.1.

## How was this patch tested?

Pass the Jenkins.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #19521 from dongjoon-hyun/SPARK-22300.
2017-10-19 13:30:55 +08:00
Sean Owen 0c03297bf0 [SPARK-22142][BUILD][STREAMING] Move Flume support behind a profile, take 2
## What changes were proposed in this pull request?

Move flume behind a profile, take 2. See https://github.com/apache/spark/pull/19365 for most of the back-story.

This change should fix the problem by removing the examples module dependency and moving Flume examples to the module itself. It also adds deprecation messages, per a discussion on dev about deprecating for 2.3.0.

## How was this patch tested?

Existing tests, which still enable flume integration.

Author: Sean Owen <sowen@cloudera.com>

Closes #19412 from srowen/SPARK-22142.2.
2017-10-06 15:08:28 +01:00
hyukjinkwon 02c91e03f9 [SPARK-22063][R] Fixes lint check failures in R by latest commit sha1 ID of lint-r
## What changes were proposed in this pull request?

Currently, we set lintr to jimhester/lintra769c0b (see [this](7d1175011c) and [SPARK-14074](https://issues.apache.org/jira/browse/SPARK-14074)).

I first tested and checked lintr-1.0.1 but it looks many important fixes are missing (for example, checking 100 length). So, I instead tried the latest commit, 5431140ffe, in my local and fixed the check failures.

It looks it has fixed many bugs and now finds many instances that I have observed and thought should be caught time to time, here I filed [the results](https://gist.github.com/HyukjinKwon/4f59ddcc7b6487a02da81800baca533c).

The downside looks it now takes about 7ish mins, (it was 2ish mins before) in my local.

## How was this patch tested?

Manually, `./dev/lint-r` after manually updating the lintr package.

Author: hyukjinkwon <gurwls223@gmail.com>
Author: zuotingbing <zuo.tingbing9@zte.com.cn>

Closes #19290 from HyukjinKwon/upgrade-r-lint.
2017-10-01 18:42:45 +09:00
gatorsmile 472864014c Revert "[SPARK-22142][BUILD][STREAMING] Move Flume support behind a profile"
This reverts commit a2516f41ae.
2017-09-29 11:45:58 -07:00
Holden Karau ecbe416ab5 [SPARK-22129][SPARK-22138] Release script improvements
## What changes were proposed in this pull request?

Use the GPG_KEY param, fix lsof to non-hardcoded path, remove version swap since it wasn't really needed. Use EXPORT on JAVA_HOME for downstream scripts as well.

## How was this patch tested?

Rolled 2.1.2 RC2

Author: Holden Karau <holden@us.ibm.com>

Closes #19359 from holdenk/SPARK-22129-fix-signing.
2017-09-29 08:04:14 -07:00
Sean Owen a2516f41ae [SPARK-22142][BUILD][STREAMING] Move Flume support behind a profile
## What changes were proposed in this pull request?

Add 'flume' profile to enable Flume-related integration modules

## How was this patch tested?

Existing tests; no functional change

Author: Sean Owen <sowen@cloudera.com>

Closes #19365 from srowen/SPARK-22142.
2017-09-29 08:26:53 +01:00
Sean Owen 01bd00d135 [SPARK-22128][CORE] Update paranamer to 2.8 to avoid BytecodeReadingParanamer ArrayIndexOutOfBoundsException with Scala 2.12 + Java 8 lambda
## What changes were proposed in this pull request?

Un-manage jackson-module-paranamer version to let it use the version desired by jackson-module-scala; manage paranamer up from 2.8 for jackson-module-scala 2.7.9, to override avro 1.7.7's desired paranamer 2.3

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #19352 from srowen/SPARK-22128.
2017-09-28 08:22:48 +01:00
Sean Owen 9b98aef6a3 [HOTFIX][BUILD] Fix finalizer checkstyle error and re-disable checkstyle
## What changes were proposed in this pull request?

Fix finalizer checkstyle violation by just turning it off; re-disable checkstyle as it won't be run by SBT PR builder. See https://github.com/apache/spark/pull/18887#issuecomment-332580700

## How was this patch tested?

`./dev/lint-java` runs successfully

Author: Sean Owen <sowen@cloudera.com>

Closes #19371 from srowen/HotfixFinalizerCheckstlye.
2017-09-27 13:40:21 -07:00
Holden Karau 8f130ad401 [SPARK-22072][SPARK-22071][BUILD] Improve release build scripts
## What changes were proposed in this pull request?

Check JDK version (with javac) and use SPARK_VERSION for publish-release

## How was this patch tested?

Manually tried local build with wrong JDK / JAVA_HOME & built a local release (LFTP disabled)

Author: Holden Karau <holden@us.ibm.com>

Closes #19312 from holdenk/improve-release-scripts-r2.
2017-09-22 00:14:57 -07:00
Sean Owen 3d4dd14cd5 [SPARK-22066][BUILD] Update checkstyle to 8.2, enable it, fix violations
## What changes were proposed in this pull request?

Update plugins, including scala-maven-plugin, to latest versions. Update checkstyle to 8.2. Remove bogus checkstyle config and enable it. Fix existing and new Java checkstyle errors.

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #19282 from srowen/SPARK-22066.
2017-09-20 10:01:46 +01:00
alexmnyc 94f7e046a2 [SPARK-22030][CORE] GraphiteSink fails to re-connect to Graphite instances behind an ELB or any other auto-scaled LB
## What changes were proposed in this pull request?

Upgrade codahale metrics library so that Graphite constructor can re-resolve hosts behind a CNAME with re-tried DNS lookups. When Graphite is deployed behind an ELB, ELB may change IP addresses based on auto-scaling needs. Using current approach yields Graphite usage impossible, fixing for that use case

- Upgrade to codahale 3.1.5
- Use new Graphite(host, port) constructor instead of new Graphite(new InetSocketAddress(host, port)) constructor

## How was this patch tested?

The same logic is used for another project that is using the same configuration and code path, and graphite re-connect's behind ELB's are no longer an issue

This are proposed changes for codahale lib - https://github.com/dropwizard/metrics/compare/v3.1.2...v3.1.5#diff-6916c85d2dd08d89fe771c952e3b8512R120. Specifically, b4d246d34e/metrics-graphite/src/main/java/com/codahale/metrics/graphite/Graphite.java (L120)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: alexmnyc <project@alexandermarkham.com>

Closes #19210 from alexmnyc/patch-1.
2017-09-19 10:05:59 +08:00
Sean Owen 4fbf748bf8 [SPARK-21893][BUILD][STREAMING][WIP] Put Kafka 0.8 behind a profile
## What changes were proposed in this pull request?

Put Kafka 0.8 support behind a kafka-0-8 profile.

## How was this patch tested?

Existing tests, but, until PR builder and Jenkins configs are updated the effect here is to not build or test Kafka 0.8 support at all.

Author: Sean Owen <sowen@cloudera.com>

Closes #19134 from srowen/SPARK-21893.
2017-09-13 10:10:40 +01:00
jerryshao 445f1790ad [SPARK-9104][CORE] Expose Netty memory metrics in Spark
## What changes were proposed in this pull request?

This PR exposes Netty memory usage for Spark's `TransportClientFactory` and `TransportServer`, including the details of each direct arena and heap arena metrics, as well as aggregated metrics. The purpose of adding the Netty metrics is to better know the memory usage of Netty in Spark shuffle, rpc and others network communications, and guide us to better configure the memory size of executors.

This PR doesn't expose these metrics to any sink, to leverage this feature, still requires to connect to either MetricsSystem or collect them back to Driver to display.

## How was this patch tested?

Add Unit test to verify it, also manually verified in real cluster.

Author: jerryshao <sshao@hortonworks.com>

Closes #18935 from jerryshao/SPARK-9104.
2017-09-05 21:28:54 -07:00
hyukjinkwon 02a4386aec [SPARK-20978][SQL] Bump up Univocity version to 2.5.4
## What changes were proposed in this pull request?

There was a bug in Univocity Parser that causes the issue in SPARK-20978. This was fixed as below:

```scala
val df = spark.read.schema("a string, b string, unparsed string").option("columnNameOfCorruptRecord", "unparsed").csv(Seq("a").toDS())
df.show()
```

**Before**

```
java.lang.NullPointerException
	at scala.collection.immutable.StringLike$class.stripLineEnd(StringLike.scala:89)
	at scala.collection.immutable.StringOps.stripLineEnd(StringOps.scala:29)
	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$getCurrentInput(UnivocityParser.scala:56)
	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:207)
	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:207)
...
```

**After**

```
+---+----+--------+
|  a|   b|unparsed|
+---+----+--------+
|  a|null|       a|
+---+----+--------+
```

It was fixed in 2.5.0 and 2.5.4 was released. I guess it'd be safe to upgrade this.

## How was this patch tested?

Unit test added in `CSVSuite.scala`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #19113 from HyukjinKwon/bump-up-univocity.
2017-09-05 23:21:43 +08:00
Sean Owen 12ab7f7e89 [SPARK-14280][BUILD][WIP] Update change-version.sh and pom.xml to add Scala 2.12 profiles and enable 2.12 compilation
…build; fix some things that will be warnings or errors in 2.12; restore Scala 2.12 profile infrastructure

## What changes were proposed in this pull request?

This change adds back the infrastructure for a Scala 2.12 build, but does not enable it in the release or Python test scripts.

In order to make that meaningful, it also resolves compile errors that the code hits in 2.12 only, in a way that still works with 2.11.

It also updates dependencies to the earliest minor release of dependencies whose current version does not yet support Scala 2.12. This is in a sense covered by other JIRAs under the main umbrella, but implemented here. The versions below still work with 2.11, and are the _latest_ maintenance release in the _earliest_ viable minor release.

- Scalatest 2.x -> 3.0.3
- Chill 0.8.0 -> 0.8.4
- Clapper 1.0.x -> 1.1.2
- json4s 3.2.x -> 3.4.2
- Jackson 2.6.x -> 2.7.9 (required by json4s)

This change does _not_ fully enable a Scala 2.12 build:

- It will also require dropping support for Kafka before 0.10. Easy enough, just didn't do it yet here
- It will require recreating `SparkILoop` and `Main` for REPL 2.12, which is SPARK-14650. Possible to do here too.

What it does do is make changes that resolve much of the remaining gap without affecting the current 2.11 build.

## How was this patch tested?

Existing tests and build. Manually tested with `./dev/change-scala-version.sh 2.12` to verify it compiles, modulo the exceptions above.

Author: Sean Owen <sowen@cloudera.com>

Closes #18645 from srowen/SPARK-14280.
2017-09-01 19:21:21 +01:00
ArtRand fc45c2c88a [SPARK-20812][MESOS] Add secrets support to the dispatcher
Mesos has secrets primitives for environment and file-based secrets, this PR adds that functionality to the Spark dispatcher and the appropriate configuration flags.
Unit tested and manually tested against a DC/OS cluster with Mesos 1.4.

Author: ArtRand <arand@soe.ucsc.edu>

Closes #18837 from ArtRand/spark-20812-dispatcher-secrets-and-labels.
2017-08-31 10:58:41 -07:00
Herman van Hovell 05af2de0fd [SPARK-21830][SQL] Bump ANTLR version and fix a few issues.
## What changes were proposed in this pull request?
This PR bumps the ANTLR version to 4.7, and fixes a number of small parser related issues uncovered by the bump.

The main reason for upgrading is that in some cases the current version of ANTLR (4.5) can exhibit exponential slowdowns if it needs to parse boolean predicates. For example the following query will take forever to parse:
```sql
SELECT *
FROM RANGE(1000)
WHERE
TRUE
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
```

This is caused by a know bug in ANTLR (https://github.com/antlr/antlr4/issues/994), which was fixed in version 4.6.

## How was this patch tested?
Existing tests.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #19042 from hvanhovell/SPARK-21830.
2017-08-24 16:33:55 -07:00
Dongjoon Hyun 8c54f1eb71 [SPARK-21422][BUILD] Depend on Apache ORC 1.4.0
## What changes were proposed in this pull request?

Like Parquet, this PR aims to depend on the latest Apache ORC 1.4 for Apache Spark 2.3. There are key benefits for Apache ORC 1.4.

- Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC community more.
- Maintainability: Reduce the Hive dependency and can remove old legacy code later.

Later, we can get the following two key benefits by adding new ORCFileFormat in SPARK-20728 (#17980), too.
- Usability: User can use ORC data sources without hive module, i.e, -Phive.
- Speed: Use both Spark ColumnarBatch and ORC RowBatch together. This will be faster than the current implementation in Spark.

## How was this patch tested?

Pass the jenkins.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #18640 from dongjoon-hyun/SPARK-21422.
2017-08-15 23:00:13 -07:00
pj.fanning c0e333dbed [SPARK-21709][BUILD] sbt 0.13.16 and some plugin updates
## What changes were proposed in this pull request?

Update sbt version to 0.13.16. I think this is a useful stepping stone to getting to sbt 1.0.0.

## How was this patch tested?

Existing Build.

Author: pj.fanning <pj.fanning@workday.com>

Closes #18921 from pjfanning/SPARK-21709.
2017-08-12 20:01:20 +01:00
Sean Owen b0bdfce9ca [MINOR][BUILD] Download RAT and R version info over HTTPS; use RAT 0.12
## What changes were proposed in this pull request?

This is trivial, but bugged me. We should download software over HTTPS.
And we can use RAT 0.12 while at it to pick up bug fixes.

## How was this patch tested?

N/A

Author: Sean Owen <sowen@cloudera.com>

Closes #18927 from srowen/Rat012.
2017-08-12 14:31:05 +09:00
Takeshi Yamamuro b78cf13bf0 [SPARK-21276][CORE] Update lz4-java to the latest (v1.4.0)
## What changes were proposed in this pull request?
This pr updated `lz4-java` to the latest (v1.4.0) and removed custom `LZ4BlockInputStream`. We currently use custom `LZ4BlockInputStream` to read concatenated byte stream in shuffle. But, this functionality has been implemented in the latest lz4-java (https://github.com/lz4/lz4-java/pull/105). So, we might update the latest to remove the custom `LZ4BlockInputStream`.

Major diffs between the latest release and v1.3.0 in the master are as follows (62f7547abb...6d4693f562);
- fixed NPE in XXHashFactory similarly
- Don't place resources in default package to support shading
- Fixes ByteBuffer methods failing to apply arrayOffset() for array-backed
- Try to load lz4-java from java.library.path, then fallback to bundled
- Add ppc64le binary
- Add s390x JNI binding
- Add basic LZ4 Frame v1.5.0 support
- enable aarch64 support for lz4-java
- Allow unsafeInstance() for ppc64le archiecture
- Add unsafeInstance support for AArch64
- Support 64-bit JNI build on Solaris
- Avoid over-allocating a buffer
- Allow EndMark to be incompressible for LZ4FrameInputStream.
- Concat byte stream

## How was this patch tested?
Existing tests.

Author: Takeshi Yamamuro <yamamuro@apache.org>

Closes #18883 from maropu/SPARK-21276.
2017-08-09 17:31:52 +02:00
WeichenXu b35660dd0e [SPARK-21523][ML] update breeze to 0.13.2 for an emergency bugfix in strong wolfe line search
## What changes were proposed in this pull request?

Update breeze to 0.13.1 for an emergency bugfix in strong wolfe line search
https://github.com/scalanlp/breeze/pull/651

## How was this patch tested?

N/A

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #18797 from WeichenXu123/update-breeze.
2017-08-09 14:44:10 +08:00
Sean Owen fb54a564d7 [SPARK-20433][BUILD] Bump jackson from 2.6.5 to 2.6.7.1
## What changes were proposed in this pull request?

Taking over https://github.com/apache/spark/pull/18789 ; Closes #18789

Update Jackson to 2.6.7 uniformly, and some components to 2.6.7.1, to get some fixes and prep for Scala 2.12

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #18881 from srowen/SPARK-20433.
2017-08-08 18:15:29 -07:00
hyukjinkwon 08ef7d7187 [MINOR][R][BUILD] More reliable detection of R version for Windows in AppVeyor
## What changes were proposed in this pull request?

This PR proposes to use https://rversions.r-pkg.org/r-release-win instead of https://rversions.r-pkg.org/r-release to check R's version for Windows correctly.

We met a syncing problem with Windows release (see #15709) before. To cut this short, it was ...

- 3.3.2 release was released but not for Windows for few hours.
- `https://rversions.r-pkg.org/r-release` returns the latest as 3.3.2 and the download link for 3.3.1 becomes `windows/base/old` by our script
- 3.3.2 release for WIndows yet
- 3.3.1 is still not in `windows/base/old` but `windows/base` as the latest
- Failed to download with `windows/base/old` link and builds were broken

I believe this problem is not only what we met. Please see 01ce943929 and also this `r-release-win` API came out between 3.3.1 and 3.3.2 (assuming to deal with this issue), please see `https://github.com/metacran/rversions.app/issues/2`.

Using this API will prevent the problem although it looks quite rare assuming from the commit logs in https://github.com/metacran/rversions.app/commits/master. After 3.3.2, both  `r-release-win` and `r-release` are being updated together.

## How was this patch tested?

AppVeyor tests.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #18859 from HyukjinKwon/use-reliable-link.
2017-08-08 23:18:59 +09:00
Felix Cheung d4e7f20f54 [SPARKR][BUILD] AppVeyor change to latest R version
## What changes were proposed in this pull request?

R version update

## How was this patch tested?

AppVeyor

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #18856 from felixcheung/rappveyorver.
2017-08-06 19:51:35 +09:00
hyukjinkwon f1a798b576 [MINOR] Minor comment fixes in merge_spark_pr.py script
## What changes were proposed in this pull request?

This PR proposes to fix few rather typos in `merge_spark_pr.py`.

- `#   usage: ./apache-pr-merge.py    (see config env vars below)`
  -> `#   usage: ./merge_spark_pr.py    (see config env vars below)`

- `... have local a Spark ...` -> `... have a local Spark ...`

- `... to Apache.` -> `... to Apache Spark.`

I skimmed this file and these look all I could find.

## How was this patch tested?

pep8 check (`./dev/lint-python`).

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #18776 from HyukjinKwon/minor-merge-script.
2017-07-31 10:07:33 +09:00
Sean Owen d3f4a21196 [SPARK-15526][ML][FOLLOWUP] Make JPMML provided scope to avoid including unshaded JARs, and repromote to compile in MLlib
Following the comment at https://issues.apache.org/jira/browse/SPARK-15526?focusedCommentId=16086106&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16086106 -- this change actually needed a little more work to be complete.

This also marks JPMML as `provided` to make sure its JARs aren't included in the `jars` output, but then scopes to `compile` in `mllib`. This is how Guava is handled.

Checked result in `assembly/target/scala-2.11/jars` to verify there are no JPMML jars. Maven and SBT builds still work.

Author: Sean Owen <sowen@cloudera.com>

Closes #18637 from srowen/SPARK-15526.2.
2017-07-18 09:53:51 -07:00
Sean Owen 425c4ada4c [SPARK-19810][BUILD][CORE] Remove support for Scala 2.10
## What changes were proposed in this pull request?

- Remove Scala 2.10 build profiles and support
- Replace some 2.10 support in scripts with commented placeholders for 2.12 later
- Remove deprecated API calls from 2.10 support
- Remove usages of deprecated context bounds where possible
- Remove Scala 2.10 workarounds like ScalaReflectionLock
- Other minor Scala warning fixes

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #17150 from srowen/SPARK-19810.
2017-07-13 17:06:24 +08:00
Bryan Cutler d03aebbe65 [SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas
## What changes were proposed in this pull request?
Integrate Apache Arrow with Spark to increase performance of `DataFrame.toPandas`.  This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process.  The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame.  Data types except complex, date, timestamp, and decimal  are currently supported, otherwise an `UnsupportedOperation` exception is thrown.

Additions to Spark include a Scala package private method `Dataset.toArrowPayload` that will convert data partitions in the executor JVM to `ArrowPayload`s as byte arrays so they can be easily served.  A package private class/object `ArrowConverters` that provide data type mappings and conversion routines.  In Python, a private method `DataFrame._collectAsArrow` is added to collect Arrow payloads and a SQLConf "spark.sql.execution.arrow.enable" can be used in `toPandas()` to enable using Arrow (uses the old conversion by default).

## How was this patch tested?
Added a new test suite `ArrowConvertersSuite` that will run tests on conversion of Datasets to Arrow payloads for supported types.  The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data.  This will ensure that the schema and data has been converted correctly.

Added PySpark tests to verify the `toPandas` method is producing equal DataFrames with and without pyarrow.  A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas.

Author: Bryan Cutler <cutlerb@gmail.com>
Author: Li Jin <ice.xelloss@gmail.com>
Author: Li Jin <li.jin@twosigma.com>
Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes #18459 from BryanCutler/toPandas_with_arrow-SPARK-13534.
2017-07-10 15:21:03 -07:00
Dongjoon Hyun c8d0aba198 [SPARK-21278][PYSPARK] Upgrade to Py4J 0.10.6
## What changes were proposed in this pull request?

This PR aims to bump Py4J in order to fix the following float/double bug.
Py4J 0.10.5 fixes this (https://github.com/bartdag/py4j/issues/272) and the latest Py4J is 0.10.6.

**BEFORE**
```
>>> df = spark.range(1)
>>> df.select(df['id'] + 17.133574204226083).show()
+--------------------+
|(id + 17.1335742042)|
+--------------------+
|       17.1335742042|
+--------------------+
```

**AFTER**
```
>>> df = spark.range(1)
>>> df.select(df['id'] + 17.133574204226083).show()
+-------------------------+
|(id + 17.133574204226083)|
+-------------------------+
|       17.133574204226083|
+-------------------------+
```

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #18546 from dongjoon-hyun/SPARK-21278.
2017-07-05 16:33:23 -07:00
Wenchen Fan 838effb98a Revert "[SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas"
This reverts commit e44697606f.
2017-06-28 14:28:40 +08:00
hyukjinkwon 7c7bc8fc0f [SPARK-21189][INFRA] Handle unknown error codes in Jenkins rather then leaving incomplete comment in PRs
## What changes were proposed in this pull request?

Recently, Jenkins tests were unstable due to unknown reasons as below:

```
 /home/jenkins/workspace/SparkPullRequestBuilder/dev/lint-r ; process was terminated by signal 9
    test_result_code, test_result_note = run_tests(tests_timeout)
  File "./dev/run-tests-jenkins.py", line 140, in run_tests
    test_result_note = ' * This patch **fails %s**.' % failure_note_by_errcode[test_result_code]
KeyError: -9
```

```
Traceback (most recent call last):
  File "./dev/run-tests-jenkins.py", line 226, in <module>
    main()
  File "./dev/run-tests-jenkins.py", line 213, in main
    test_result_code, test_result_note = run_tests(tests_timeout)
  File "./dev/run-tests-jenkins.py", line 140, in run_tests
    test_result_note = ' * This patch **fails %s**.' % failure_note_by_errcode[test_result_code]
KeyError: -10
```

This exception looks causing failing to update the comments in the PR. For example:

![2017-06-23 4 19 41](https://user-images.githubusercontent.com/6477701/27470626-d035ecd8-582f-11e7-883e-0ae6941659b7.png)

![2017-06-23 4 19 50](https://user-images.githubusercontent.com/6477701/27470629-d11ba782-582f-11e7-97e0-64d28cbc19aa.png)

these comment just remain.

This always requires, for both reviewers and the author, a overhead to click and check the logs, which I believe are not really useful.

This PR proposes to leave the code in the PR comment messages and let update the comments.

## How was this patch tested?

Jenkins tests below, I manually gave the error code to test this.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #18399 from HyukjinKwon/jenkins-print-errors.
2017-06-24 10:14:31 +01:00
Bryan Cutler e44697606f [SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas
## What changes were proposed in this pull request?
Integrate Apache Arrow with Spark to increase performance of `DataFrame.toPandas`.  This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process.  The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame.  All non-complex data types are currently supported, otherwise an `UnsupportedOperation` exception is thrown.

Additions to Spark include a Scala package private method `Dataset.toArrowPayloadBytes` that will convert data partitions in the executor JVM to `ArrowPayload`s as byte arrays so they can be easily served.  A package private class/object `ArrowConverters` that provide data type mappings and conversion routines.  In Python, a public method `DataFrame.collectAsArrow` is added to collect Arrow payloads and an optional flag in `toPandas(useArrow=False)` to enable using Arrow (uses the old conversion by default).

## How was this patch tested?
Added a new test suite `ArrowConvertersSuite` that will run tests on conversion of Datasets to Arrow payloads for supported types.  The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data.  This will ensure that the schema and data has been converted correctly.

Added PySpark tests to verify the `toPandas` method is producing equal DataFrames with and without pyarrow.  A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas.

Author: Bryan Cutler <cutlerb@gmail.com>
Author: Li Jin <ice.xelloss@gmail.com>
Author: Li Jin <li.jin@twosigma.com>
Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes #15821 from BryanCutler/wip-toPandas_with_arrow-SPARK-13534.
2017-06-23 09:01:13 +08:00
Xianyang Liu 0a4b7e4f81 [MINOR] Fix some typo of the document
## What changes were proposed in this pull request?

Fix some typo of the document.

## How was this patch tested?

Existing tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Xianyang Liu <xianyang.liu@intel.com>

Closes #18350 from ConeyLiu/fixtypo.
2017-06-19 20:35:58 +01:00
Michael Gummelt a18d637112 [SPARK-20434][YARN][CORE] Move Hadoop delegation token code from yarn to core
## What changes were proposed in this pull request?

Move Hadoop delegation token code from `spark-yarn` to `spark-core`, so that other schedulers (such as Mesos), may use it.  In order to avoid exposing Hadoop interfaces in spark-core, the new Hadoop delegation token classes are kept private.  In order to provider backward compatiblity, and to allow YARN users to continue to load their own delegation token providers via Java service loading, the old YARN interfaces, as well as the client code that uses them, have been retained.

Summary:
- Move registered `yarn.security.ServiceCredentialProvider` classes from `spark-yarn` to `spark-core`.  Moved them into a new, private hierarchy under `HadoopDelegationTokenProvider`.  Client code in `HadoopDelegationTokenManager` now loads credentials from a whitelist of three providers (`HadoopFSDelegationTokenProvider`, `HiveDelegationTokenProvider`, `HBaseDelegationTokenProvider`), instead of service loading, which means that users are not able to implement their own delegation token providers, as they are in the `spark-yarn` module.

- The `yarn.security.ServiceCredentialProvider` interface has been kept for backwards compatibility, and to continue to allow YARN users to implement their own delegation token provider implementations.  Client code in YARN now fetches tokens via the new `YARNHadoopDelegationTokenManager` class, which fetches tokens from the core providers through `HadoopDelegationTokenManager`, as well as service loads them from `yarn.security.ServiceCredentialProvider`.

Old Hierarchy:

```
yarn.security.ServiceCredentialProvider (service loaded)
  HadoopFSCredentialProvider
  HiveCredentialProvider
  HBaseCredentialProvider
yarn.security.ConfigurableCredentialManager
```

New Hierarchy:

```
HadoopDelegationTokenManager
HadoopDelegationTokenProvider (not service loaded)
  HadoopFSDelegationTokenProvider
  HiveDelegationTokenProvider
  HBaseDelegationTokenProvider

yarn.security.ServiceCredentialProvider (service loaded)
yarn.security.YARNHadoopDelegationTokenManager
```
## How was this patch tested?

unit tests

Author: Michael Gummelt <mgummelt@mesosphere.io>
Author: Dr. Stefan Schimanski <sttts@mesosphere.io>

Closes #17723 from mgummelt/SPARK-20434-refactor-kerberos.
2017-06-15 11:46:00 -07:00
Yuming Wang 823f1eef58 [SPARK-13933][BUILD] Update hadoop-2.7 profile's curator version to 2.7.1
## What changes were proposed in this pull request?

Update hadoop-2.7 profile's curator version to 2.7.1, more see [SPARK-13933](https://issues.apache.org/jira/browse/SPARK-13933).

## How was this patch tested?

manual tests

Author: Yuming Wang <wgyumg@gmail.com>

Closes #18247 from wangyum/SPARK-13933.
2017-06-11 10:05:47 +01:00
Wenchen Fan 864d94fe87 [SPARK-20974][BUILD] we should run REPL tests if SQL module has code changes
## What changes were proposed in this pull request?

REPL module depends on SQL module, so we should run REPL tests if SQL module has code changes.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #18191 from cloud-fan/test.
2017-06-02 21:59:52 -07:00