Commit graph

267 commits

Author SHA1 Message Date
Bryan Cutler ed075e1ff6 [SPARK-23874][SQL][PYTHON] Upgrade Apache Arrow to 0.10.0
## What changes were proposed in this pull request?

Upgrade Apache Arrow to 0.10.0

Version 0.10.0 has a number of bug fixes and improvements with the following pertaining directly to usage in Spark:
 * Allow for adding BinaryType support ARROW-2141
 * Bug fix related to array serialization ARROW-1973
 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
 * Python bytearrays are supported in as input to pyarrow ARROW-2141
 * Java has common interface for reset to cleanup complex vectors in Spark ArrowWriter ARROW-1962
 * Cleanup pyarrow type equality checks ARROW-2423
 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, ARROW-2645
 * Improved low level handling of messages for RecordBatch ARROW-2704

## How was this patch tested?

existing tests

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #21939 from BryanCutler/arrow-upgrade-010.
2018-08-14 17:13:38 -07:00
Fokko Driesprong 5d6abad36d [SPARK-25033] Bump Apache commons.{httpclient, httpcore}
## What changes were proposed in this pull request?

Bump the versions of Apache commons.{httpclient, httpcore} to make it congruent with Stocator.

Changelog httpclient: https://archive.apache.org/dist/httpcomponents/httpclient/RELEASE_NOTES-4.5.x.txt
Changelog httpcore: https://archive.apache.org/dist/httpcomponents/httpcore/RELEASE_NOTES.txt

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Closes #22007 from Fokko/SPARK-25033.

Authored-by: Fokko Driesprong <fokkodriesprong@godatadriven.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
2018-08-13 09:14:17 +08:00
Sean Owen eb9a696dd6 [MINOR][BUILD] Update Jetty to 9.3.24.v20180605
## What changes were proposed in this pull request?

Update Jetty to 9.3.24.v20180605 to pick up security fix

## How was this patch tested?

Existing tests.

Closes #22055 from srowen/Jetty9324.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2018-08-09 13:04:03 -05:00
Sean Owen 5f9633dc97 [SPARK-25015][BUILD] Update Hadoop 2.7 to 2.7.7
## What changes were proposed in this pull request?

Update Hadoop 2.7 to 2.7.7 to pull in bug and security fixes.

## How was this patch tested?

Existing tests.

Author: Sean Owen <srowen@gmail.com>

Closes #21987 from srowen/SPARK-25015.
2018-08-04 14:59:13 -05:00
Maxim Gekk b3f2911eeb [SPARK-24945][SQL] Switching to uniVocity 2.7.3
## What changes were proposed in this pull request?

In the PR, I propose to upgrade uniVocity parser from **2.6.3** to **2.7.3**. The recent version includes a fix for the SPARK-24645 issue and has better performance.

Before changes:
```
Parsing quoted values:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
One quoted string                           33336 / 34122          0.0      666727.0       1.0X

Wide rows with 1000 columns:             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Select 1000 columns                         90287 / 91713          0.0       90286.9       1.0X
Select 100 columns                          31826 / 36589          0.0       31826.4       2.8X
Select one column                           25738 / 25872          0.0       25737.9       3.5X
count()                                       6931 / 7269          0.1        6931.5      13.0X
```
after:
```
Parsing quoted values:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
One quoted string                           33411 / 33510          0.0      668211.4       1.0X

Wide rows with 1000 columns:             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Select 1000 columns                         88028 / 89311          0.0       88028.1       1.0X
Select 100 columns                          29010 / 32755          0.0       29010.1       3.0X
Select one column                           22936 / 22953          0.0       22936.5       3.8X
count()                                       6657 / 6740          0.2        6656.6      13.5X
```
Closes #21892

## How was this patch tested?

It was tested by `CSVSuite` and `CSVBenchmarks`

Author: Maxim Gekk <maxim.gekk@databricks.com>

Closes #21969 from MaxGekk/univocity-2_7_3.
2018-08-03 08:33:28 +08:00
Gengliang Wang b90bfe3c42 [SPARK-24771][BUILD] Upgrade Apache AVRO to 1.8.2
## What changes were proposed in this pull request?

Upgrade Apache Avro from 1.7.7 to 1.8.2. The major new features:

1. More logical types. From the spec of 1.8.2 https://avro.apache.org/docs/1.8.2/spec.html#Logical+Types we can see comparing to [1.7.7](https://avro.apache.org/docs/1.7.7/spec.html#Logical+Types), the new version support:
    - Date
    - Time (millisecond precision)
    - Time (microsecond precision)
    - Timestamp (millisecond precision)
    - Timestamp (microsecond precision)
    - Duration

2. Single-object encoding: https://avro.apache.org/docs/1.8.2/spec.html#single_object_encoding

This PR aims to update Apache Spark to support these new features.

## How was this patch tested?

Unit test

Author: Gengliang Wang <gengliang.wang@databricks.com>

Closes #21761 from gengliangwang/upgrade_avro_1.8.
2018-07-30 07:30:47 -07:00
Dongjoon Hyun 3b59d326c7 [SPARK-24576][BUILD] Upgrade Apache ORC to 1.5.2
## What changes were proposed in this pull request?

This issue aims to upgrade Apache ORC library from 1.4.4 to 1.5.2 in order to bring the following benefits into Apache Spark.

- [ORC-91](https://issues.apache.org/jira/browse/ORC-91) Support for variable length blocks in HDFS (The current space wasted in ORC to padding is known to be 5%.)
- [ORC-344](https://issues.apache.org/jira/browse/ORC-344) Support for using Decimal64ColumnVector

In addition to that, Apache Hive 3.1 and 3.2 will use ORC 1.5.1 ([HIVE-19669](https://issues.apache.org/jira/browse/HIVE-19465)) and 1.5.2 ([HIVE-19792](https://issues.apache.org/jira/browse/HIVE-19792)) respectively. This will improve the compatibility between Apache Spark and Apache Hive by sharing the common library.

## How was this patch tested?

Pass the Jenkins with all existing tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #21582 from dongjoon-hyun/SPARK-24576.
2018-07-17 23:52:17 -07:00
DB Tsai 5585c5765f
[SPARK-24420][BUILD] Upgrade ASM to 6.1 to support JDK9+
## What changes were proposed in this pull request?

Upgrade ASM to 6.1 to support JDK9+

## How was this patch tested?

Existing tests.

Author: DB Tsai <d_tsai@apple.com>

Closes #21459 from dbtsai/asm.
2018-07-03 10:13:48 -07:00
DB Tsai c7967c6049 [SPARK-24418][BUILD] Upgrade Scala to 2.11.12 and 2.12.6
## What changes were proposed in this pull request?

Scala is upgraded to `2.11.12` and `2.12.6`.

We used `loadFIles()` in `ILoop` as a hook to initialize the Spark before REPL sees any files in Scala `2.11.8`. However, it was a hack, and it was not intended to be a public API, so it was removed in Scala `2.11.12`.

From the discussion in Scala community, https://github.com/scala/bug/issues/10913 , we can use `initializeSynchronous` to initialize Spark instead. This PR implements the Spark initialization there.

However, in Scala `2.11.12`'s `ILoop.scala`, in function `def startup()`, the first thing it calls is `printWelcome()`. As a result, Scala will call `printWelcome()` and `splash` before calling `initializeSynchronous`.

Thus, the Spark shell will allow users to type commends first, and then show the Spark UI URL. It's working, but it will change the Spark Shell interface as the following.

```scala
➜  apache-spark git:(scala-2.11.12) ✗ ./bin/spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.0-SNAPSHOT
      /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_161)
Type in expressions to have them evaluated.
Type :help for more information.

scala> Spark context Web UI available at http://192.168.1.169:4040
Spark context available as 'sc' (master = local[*], app id = local-1528180279528).
Spark session available as 'spark'.

scala>
```

It seems there is no easy way to inject the Spark initialization code in the proper place as Scala doesn't provide a hook. Maybe som-snytt can comment on this.

The following command is used to update the dep files.
```scala
./dev/test-dependencies.sh --replace-manifest
```
## How was this patch tested?

Existing tests

Author: DB Tsai <d_tsai@apple.com>

Closes #21495 from dbtsai/scala-2.11.12.
2018-06-26 09:48:52 +08:00
Dongjoon Hyun 486ecc680e [SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4
## What changes were proposed in this pull request?

ORC 1.4.4 includes [nine fixes](https://issues.apache.org/jira/issues/?filter=12342568&jql=project%20%3D%20ORC%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%201.4.4). One of the issues is about `Timestamp` bug (ORC-306) which occurs when `native` ORC vectorized reader reads ORC column vector's sub-vector `times` and `nanos`. ORC-306 fixes this according to the [original definition](https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/TimestampColumnVector.java#L45-L46) and this PR includes the updated interpretation on ORC column vectors. Note that `hive` ORC reader and ORC MR reader is not affected.

```scala
scala> spark.version
res0: String = 2.3.0
scala> spark.sql("set spark.sql.orc.impl=native")
scala> Seq(java.sql.Timestamp.valueOf("1900-05-05 12:34:56.000789")).toDF().write.orc("/tmp/orc")
scala> spark.read.orc("/tmp/orc").show(false)
+--------------------------+
|value                     |
+--------------------------+
|1900-05-05 12:34:55.000789|
+--------------------------+
```

This PR aims to update Apache Spark to use it.

**FULL LIST**

ID | TITLE
-- | --
ORC-281 | Fix compiler warnings from clang 5.0
ORC-301 | `extractFileTail` should open a file in `try` statement
ORC-304 | Fix TestRecordReaderImpl to not fail with new storage-api
ORC-306 | Fix incorrect workaround for bug in java.sql.Timestamp
ORC-324 | Add support for ARM and PPC arch
ORC-330 | Remove unnecessary Hive artifacts from root pom
ORC-332 | Add syntax version to orc_proto.proto
ORC-336 | Remove avro and parquet dependency management entries
ORC-360 | Implement error checking on subtype fields in Java

## How was this patch tested?

Pass the Jenkins.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #21372 from dongjoon-hyun/SPARK_ORC144.
2018-05-24 11:34:13 +08:00
Maxim Gekk 7a2d4895c7 [SPARK-17916][SQL] Fix empty string being parsed as null when nullValue is set.
## What changes were proposed in this pull request?

I propose to bump version of uniVocity parser up to 2.6.3 where quoted empty strings are replaced by the empty value (passed to `setEmptyValue`) instead of `null` values as in the current version 2.5.9:
https://github.com/uniVocity/univocity-parsers/blob/v2.6.3/src/main/java/com/univocity/parsers/csv/CsvParser.java#L125

Empty value for writer is set to `""`. So, empty string in dataframe/dataset is stored as empty quoted string `""`. Empty value for reader is set to empty string (zero size). In this way, saved empty quoted string will be read as just empty string. Please, look at the tests for more details.

Here are main changes made in [2.6.0](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.6.0), [2.6.1](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.6.1), [2.6.2](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.6.2), [2.6.3](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.6.3):

- CSV parser now parses quoted values ~30% faster
- CSV format detection process has option provide a list of possible delimiters, in order of priority ( i.e. settings.detectFormatAutomatically( '-', '.');) - https://github.com/uniVocity/univocity-parsers/issues/214
- Implemented trim quoted values support - https://github.com/uniVocity/univocity-parsers/issues/230
- NullPointer when stopping parser when nothing is parsed - https://github.com/uniVocity/univocity-parsers/issues/219
- Concurrency issue when calling stopParsing() - https://github.com/uniVocity/univocity-parsers/issues/231

Closes #20068

## How was this patch tested?

Added tests from the PR https://github.com/apache/spark/pull/20068

Author: Maxim Gekk <maxim.gekk@databricks.com>

Closes #21273 from MaxGekk/univocity-2.6.
2018-05-14 10:01:06 +08:00
Marcelo Vanzin cc613b552e [PYSPARK] Update py4j to version 0.10.7. 2018-05-09 10:47:35 -07:00
Ryan Blue cac9b1dea1 [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.
## What changes were proposed in this pull request?

This updates Parquet to 1.10.0 and updates the vectorized path for buffer management changes. Parquet 1.10.0 uses ByteBufferInputStream instead of byte arrays in encoders. This allows Parquet to break allocations into smaller chunks that are better for garbage collection.

## How was this patch tested?

Existing Parquet tests. Running in production at Netflix for about 3 months.

Author: Ryan Blue <blue@apache.org>

Closes #21070 from rdblue/SPARK-23972-update-parquet-to-1.10.0.
2018-05-09 12:27:32 +08:00
Steve Loughran ce7ba2e98e [SPARK-23807][BUILD] Add Hadoop 3.1 profile with relevant POM fix ups
## What changes were proposed in this pull request?

1. Adds a `hadoop-3.1` profile build depending on the hadoop-3.1 artifacts.
1. In the hadoop-cloud module, adds an explicit hadoop-3.1 profile which switches from explicitly pulling in cloud connectors (hadoop-openstack, hadoop-aws, hadoop-azure) to depending on the hadoop-cloudstorage POM artifact, which pulls these in, has pre-excluded things like hadoop-common, and stays up to date with new connectors (hadoop-azuredatalake, hadoop-allyun). Goal: it becomes the Hadoop projects homework of keeping this clean, and the spark project doesn't need to handle new hadoop releases adding more dependencies.
1. the hadoop-cloud/hadoop-3.1 profile also declares support for jetty-ajax and jetty-util to ensure that these jars get into the distribution jar directory when needed by unshaded libraries.
1. Increases the curator and zookeeper versions to match those in hadoop-3, fixing spark core to build in sbt with the hadoop-3 dependencies.

## How was this patch tested?

* Everything this has been built and tested against both ASF Hadoop branch-3.1 and hadoop trunk.
* spark-shell was used to create connectors to all the stores and verify that file IO could take place.

The spark hive-1.2.1 JAR has problems here, as it's version check logic fails for Hadoop versions > 2.

This can be avoided with either of

* The hadoop JARs built to declare their version as Hadoop 2.11  `mvn install -DskipTests -DskipShade -Ddeclared.hadoop.version=2.11` . This is safe for local test runs, not for deployment (HDFS is very strict about cross-version deployment).
* A modified version of spark hive whose version check switch statement is happy with hadoop 3.

I've done both, with maven and SBT.

Three issues surfaced

1. A spark-core test failure —fixed in SPARK-23787.
1. SBT only: Zookeeper not being found in spark-core. Somehow curator 2.12.0 triggers some slightly different dependency resolution logic from previous versions, and Ivy was missing zookeeper.jar entirely. This patch adds the explicit declaration for all spark profiles, setting the ZK version = 3.4.9 for hadoop-3.1
1. Marking jetty-utils as provided in spark was stopping hadoop-azure from being able to instantiate the azure wasb:// client; it was using jetty-util-ajax, which could then not find a class in jetty-util.

Author: Steve Loughran <stevel@hortonworks.com>

Closes #20923 from steveloughran/cloud/SPARK-23807-hadoop-31.
2018-04-24 09:57:09 -07:00
Kazuaki Ishizaki 649ed9c573 [SPARK-23509][BUILD] Upgrade commons-net from 2.2 to 3.1
## What changes were proposed in this pull request?

This PR avoids version conflicts of `commons-net` by upgrading commons-net from 2.2 to 3.1. We are seeing the following message during the build using sbt.

```
[warn] Found version conflict(s) in library dependencies; some are suspected to be binary incompatible:
...
[warn] 	* commons-net:commons-net:3.1 is selected over 2.2
[warn] 	    +- org.apache.hadoop:hadoop-common:2.6.5              (depends on 3.1)
[warn] 	    +- org.apache.spark:spark-core_2.11:2.4.0-SNAPSHOT    (depends on 2.2)
[warn]
```

[Here](https://commons.apache.org/proper/commons-net/changes-report.html) is a release history.

[Here](https://commons.apache.org/proper/commons-net/migration.html) is a migration guide from 2.x to 3.0.

## How was this patch tested?

Existing tests

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #20672 from kiszk/SPARK-23509.
2018-02-27 08:18:41 -06:00
Dongjoon Hyun 3ee3b2ae1f [SPARK-23340][SQL] Upgrade Apache ORC to 1.4.3
## What changes were proposed in this pull request?

This PR updates Apache ORC dependencies to 1.4.3 released on February 9th. Apache ORC 1.4.2 release removes unnecessary dependencies and 1.4.3 has 5 more patches (https://s.apache.org/Fll8).

Especially, the following ORC-285 is fixed at 1.4.3.

```scala
scala> val df = Seq(Array.empty[Float]).toDF()

scala> df.write.format("orc").save("/tmp/floatarray")

scala> spark.read.orc("/tmp/floatarray")
res1: org.apache.spark.sql.DataFrame = [value: array<float>]

scala> spark.read.orc("/tmp/floatarray").show()
18/02/12 22:09:10 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.io.IOException: Error reading file: file:/tmp/floatarray/part-00000-9c0b461b-4df1-4c23-aac1-3e4f349ac7d6-c000.snappy.orc
	at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1191)
	at org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
...
Caused by: java.io.EOFException: Read past EOF for compressed stream Stream for column 2 kind DATA position: 0 length: 0 range: 0 offset: 0 limit: 0
```

## How was this patch tested?

Pass the Jenkins test.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #20511 from dongjoon-hyun/SPARK-23340.
2018-02-17 00:25:36 -08:00
Yuming Wang 4df84c3f81 [SPARK-23336][BUILD] Upgrade snappy-java to 1.1.7.1
## What changes were proposed in this pull request?

This PR upgrade snappy-java from 1.1.2.6 to 1.1.7.1.
1.1.7.1 release notes:
- Improved performance for big-endian architecture
- The other performance improvement in [snappy-1.1.5](https://github.com/google/snappy/releases/tag/1.1.5)

1.1.4 release notes:
- Fix a 1% performance regression when snappy is used in PIE executables.
- Improve compression performance by 5%.
- Improve decompression performance by 20%.

More details:
https://github.com/xerial/snappy-java/blob/master/Milestone.md

## How was this patch tested?

manual tests

Author: Yuming Wang <wgyumg@gmail.com>

Closes #20510 from wangyum/SPARK-23336.
2018-02-08 12:52:08 -06:00
foxish c3548d11c3 [SPARK-23063][K8S] K8s changes for publishing scripts (and a couple of other misses)
## What changes were proposed in this pull request?

Including the `-Pkubernetes` flag in a few places it was missed.

## How was this patch tested?

checkstyle, mima through manual tests.

Author: foxish <ramanathana@google.com>

Closes #20256 from foxish/SPARK-23063.
2018-01-13 21:34:28 -08:00
shimamoto 628a1ca5a4 [SPARK-23043][BUILD] Upgrade json4s to 3.5.3
## What changes were proposed in this pull request?

Spark still use a few years old version 3.2.11. This change is to upgrade json4s to 3.5.3.

Note that this change does not include the Jackson update because the Jackson version referenced in json4s 3.5.3 is 2.8.4, which has a security vulnerability ([see](https://issues.apache.org/jira/browse/SPARK-20433)).

## How was this patch tested?

Existing unit tests and build.

Author: shimamoto <chibochibo@gmail.com>

Closes #20233 from shimamoto/upgrade-json4s.
2018-01-13 09:40:00 -06:00
Fokko Driesprong fd7d141d8b [SPARK-22919] Bump httpclient versions
Hi all,

I would like to bump the PATCH versions of both the Apache httpclient Apache httpcore. I use the SparkTC Stocator library for connecting to an object store, and I would align the versions to reduce java version mismatches. Furthermore it is good to bump these versions since they fix stability and performance issues:
https://archive.apache.org/dist/httpcomponents/httpclient/RELEASE_NOTES-4.5.x.txt
https://www.apache.org/dist/httpcomponents/httpcore/RELEASE_NOTES-4.4.x.txt

Cheers, Fokko

## What changes were proposed in this pull request?

Update the versions of the httpclient and httpcore. Only update the PATCH versions, so no breaking changes.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Fokko Driesprong <fokkodriesprong@godatadriven.com>

Closes #20103 from Fokko/SPARK-22919-bump-httpclient-versions.
2017-12-30 10:37:41 -06:00
Bryan Cutler 59d52631eb [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0
## What changes were proposed in this pull request?

Upgrade Spark to Arrow 0.8.0 for Java and Python.  Also includes an upgrade of Netty to 4.1.17 to resolve dependency requirements.

The highlights that pertain to Spark for the update from Arrow versoin 0.4.1 to 0.8.0 include:

* Java refactoring for more simple API
* Java reduced heap usage and streamlined hot code paths
* Type support for DecimalType, ArrayType
* Improved type casting support in Python
* Simplified type checking in Python

## How was this patch tested?

Existing tests

Author: Bryan Cutler <cutlerb@gmail.com>
Author: Shixiong Zhu <zsxwing@gmail.com>

Closes #19884 from BryanCutler/arrow-upgrade-080-SPARK-22324.
2017-12-21 20:43:56 +09:00
Kazuaki Ishizaki 8ae004b460 [SPARK-22688][SQL] Upgrade Janino version to 3.0.8
## What changes were proposed in this pull request?

This PR upgrade Janino version to 3.0.8. [Janino 3.0.8](https://janino-compiler.github.io/janino/changelog.html) includes an important fix to reduce the number of constant pool entries by using 'sipush' java bytecode.

* SIPUSH bytecode is not used for short integer constant [#33](https://github.com/janino-compiler/janino/issues/33).

Please see detail in [this discussion thread](https://github.com/apache/spark/pull/19518#issuecomment-346674976).

## How was this patch tested?

Existing tests

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #19890 from kiszk/SPARK-22688.
2017-12-06 16:15:25 -08:00
smurakozi 9948b860ac [SPARK-22516][SQL] Bump up Univocity version to 2.5.9
## What changes were proposed in this pull request?

There was a bug in Univocity Parser that causes the issue in SPARK-22516. This was fixed by upgrading from 2.5.4 to 2.5.9 version of the library :

**Executing**
```
spark.read.option("header","true").option("inferSchema", "true").option("multiLine", "true").option("comment", "g").csv("test_file_without_eof_char.csv").show()
```
**Before**
```
ERROR Executor: Exception in task 0.0 in stage 6.0 (TID 6)
com.univocity.parsers.common.TextParsingException: java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End of input reached
...
Internal state when error was thrown: line=3, column=0, record=2, charIndex=31
	at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
	at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475)
	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
```
**After**
```
+-------+-------+
|column1|column2|
+-------+-------+
|    abc|    def|
+-------+-------+
```

## How was this patch tested?
The already existing `CSVSuite.commented lines in CSV data` test was extended to parse the file also in multiline mode. The test input file was modified to also include a comment in the last line.

Author: smurakozi <smurakozi@gmail.com>

Closes #19906 from smurakozi/SPARK-22516.
2017-12-06 13:22:08 -08:00
Sean Owen d2cf95aa63 [SPARK-22634][BUILD] Update Bouncy Castle to 1.58
## What changes were proposed in this pull request?

Update Bouncy Castle to 1.58, and jets3t to 0.9.4 to (sort of) match.

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #19859 from srowen/SPARK-22634.
2017-12-02 07:37:02 -06:00
Min Shen 7da1f5708c [SPARK-22373] Bump Janino dependency version to fix thread safety issue…
… with Janino when compiling generated code.

## What changes were proposed in this pull request?

Bump up Janino dependency version to fix thread safety issue during compiling generated code

## How was this patch tested?

Check https://issues.apache.org/jira/browse/SPARK-22373 for details.
Converted part of the code in CodeGenerator into a standalone application, so the issue can be consistently reproduced locally.
Verified that changing Janino dependency version resolved this issue.

Author: Min Shen <mshen@linkedin.com>

Closes #19839 from Victsm/SPARK-22373.
2017-11-30 19:24:44 -06:00
Stavros Kontopoulos 193555f79c [SPARK-18935][MESOS] Fix dynamic reservations on mesos
## What changes were proposed in this pull request?

- Solves the issue described in the ticket by preserving reservation and allocation info in all cases (port handling included).
- upgrades to 1.4
- Adds extra debug level logging to make debugging easier in the future, for example we add reservation info when applicable.
```
 17/09/29 14:53:07 DEBUG MesosCoarseGrainedSchedulerBackend: Accepting offer: f20de49b-dee3-45dd-a3c1-73418b7de891-O32 with attributes: Map() allocation info: role: "spark-prive"
  reservation info: name: "ports"
 type: RANGES
 ranges {
   range {
     begin: 31000
     end: 32000
   }
 }
 role: "spark-prive"
 reservation {
   principal: "test"
 }
 allocation_info {
   role: "spark-prive"
 }
```
- Some style cleanup.

## How was this patch tested?

Manually by running the example in the ticket with and without a principal. Specifically I tested it on a dc/os 1.10 cluster with 7 nodes and played with reservations. From the master node in order to reserve resources I executed:

```for i in 0 1 2 3 4 5 6
do
curl -i \
      -d slaveId=90ec65ea-1f7b-479f-a824-35d2527d6d26-S$i \
      -d resources='[
        {
          "name": "cpus",
          "type": "SCALAR",
          "scalar": { "value": 2 },
          "role": "spark-role",
          "reservation": {
            "principal": ""
          }
        },
        {
          "name": "mem",
          "type": "SCALAR",
          "scalar": { "value": 8026 },
          "role": "spark-role",
          "reservation": {
            "principal": ""
          }
        }
      ]' \
      -X POST http://master.mesos:5050/master/reserve
done
```
Nodes had 4 cpus (m3.xlarge instances)  and I reserved either 2 or 4 cpus (all for a role).
I verified it launches tasks on nodes with reserved resources under `spark-role` role  only if
a) there are remaining resources for (*) default role and the spark driver has no role assigned to it.
b) the spark driver has a role assigned to it and it is the same role used in reservations.
I also tested this locally on my machine.

Author: Stavros Kontopoulos <st.kontopoulos@gmail.com>

Closes #19390 from skonto/fix_dynamic_reservation.
2017-11-29 14:15:35 -08:00
Sital Kedia 444bce1c98 [SPARK-19112][CORE] Support for ZStandard codec
## What changes were proposed in this pull request?

Using zstd compression for Spark jobs spilling 100s of TBs of data, we could reduce the amount of data written to disk by as much as 50%. This translates to significant latency gain because of reduced disk io operations. There is a degradation CPU time by 2 - 5% because of zstd compression overhead, but for jobs which are bottlenecked by disk IO, this hit can be taken.

## Benchmark
Please note that this benchmark is using real world compute heavy production workload spilling TBs of data to disk

|         | zstd performance as compred to LZ4   |
| ------------- | -----:|
| spill/shuffle bytes    | -48% |
| cpu time    |    + 3% |
| cpu reservation time       |    -40%|
| latency     |     -40% |

## How was this patch tested?

Tested by running few jobs spilling large amount of data on the cluster and amount of intermediate data written to disk reduced by as much as 50%.

Author: Sital Kedia <skedia@fb.com>

Closes #18805 from sitalkedia/skedia/upstream_zstd.
2017-11-01 14:54:08 +01:00
Dongjoon Hyun 6f1d0dea1c [SPARK-22300][BUILD] Update ORC to 1.4.1
## What changes were proposed in this pull request?

Apache ORC 1.4.1 is released yesterday.
- https://orc.apache.org/news/2017/10/16/ORC-1.4.1/

Like ORC-233 (Allow `orc.include.columns` to be empty), there are several important fixes.
This PR updates Apache ORC dependency to use the latest one, 1.4.1.

## How was this patch tested?

Pass the Jenkins.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #19521 from dongjoon-hyun/SPARK-22300.
2017-10-19 13:30:55 +08:00
Sean Owen 01bd00d135 [SPARK-22128][CORE] Update paranamer to 2.8 to avoid BytecodeReadingParanamer ArrayIndexOutOfBoundsException with Scala 2.12 + Java 8 lambda
## What changes were proposed in this pull request?

Un-manage jackson-module-paranamer version to let it use the version desired by jackson-module-scala; manage paranamer up from 2.8 for jackson-module-scala 2.7.9, to override avro 1.7.7's desired paranamer 2.3

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #19352 from srowen/SPARK-22128.
2017-09-28 08:22:48 +01:00
alexmnyc 94f7e046a2 [SPARK-22030][CORE] GraphiteSink fails to re-connect to Graphite instances behind an ELB or any other auto-scaled LB
## What changes were proposed in this pull request?

Upgrade codahale metrics library so that Graphite constructor can re-resolve hosts behind a CNAME with re-tried DNS lookups. When Graphite is deployed behind an ELB, ELB may change IP addresses based on auto-scaling needs. Using current approach yields Graphite usage impossible, fixing for that use case

- Upgrade to codahale 3.1.5
- Use new Graphite(host, port) constructor instead of new Graphite(new InetSocketAddress(host, port)) constructor

## How was this patch tested?

The same logic is used for another project that is using the same configuration and code path, and graphite re-connect's behind ELB's are no longer an issue

This are proposed changes for codahale lib - https://github.com/dropwizard/metrics/compare/v3.1.2...v3.1.5#diff-6916c85d2dd08d89fe771c952e3b8512R120. Specifically, b4d246d34e/metrics-graphite/src/main/java/com/codahale/metrics/graphite/Graphite.java (L120)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: alexmnyc <project@alexandermarkham.com>

Closes #19210 from alexmnyc/patch-1.
2017-09-19 10:05:59 +08:00
jerryshao 445f1790ad [SPARK-9104][CORE] Expose Netty memory metrics in Spark
## What changes were proposed in this pull request?

This PR exposes Netty memory usage for Spark's `TransportClientFactory` and `TransportServer`, including the details of each direct arena and heap arena metrics, as well as aggregated metrics. The purpose of adding the Netty metrics is to better know the memory usage of Netty in Spark shuffle, rpc and others network communications, and guide us to better configure the memory size of executors.

This PR doesn't expose these metrics to any sink, to leverage this feature, still requires to connect to either MetricsSystem or collect them back to Driver to display.

## How was this patch tested?

Add Unit test to verify it, also manually verified in real cluster.

Author: jerryshao <sshao@hortonworks.com>

Closes #18935 from jerryshao/SPARK-9104.
2017-09-05 21:28:54 -07:00
hyukjinkwon 02a4386aec [SPARK-20978][SQL] Bump up Univocity version to 2.5.4
## What changes were proposed in this pull request?

There was a bug in Univocity Parser that causes the issue in SPARK-20978. This was fixed as below:

```scala
val df = spark.read.schema("a string, b string, unparsed string").option("columnNameOfCorruptRecord", "unparsed").csv(Seq("a").toDS())
df.show()
```

**Before**

```
java.lang.NullPointerException
	at scala.collection.immutable.StringLike$class.stripLineEnd(StringLike.scala:89)
	at scala.collection.immutable.StringOps.stripLineEnd(StringOps.scala:29)
	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$getCurrentInput(UnivocityParser.scala:56)
	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:207)
	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:207)
...
```

**After**

```
+---+----+--------+
|  a|   b|unparsed|
+---+----+--------+
|  a|null|       a|
+---+----+--------+
```

It was fixed in 2.5.0 and 2.5.4 was released. I guess it'd be safe to upgrade this.

## How was this patch tested?

Unit test added in `CSVSuite.scala`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #19113 from HyukjinKwon/bump-up-univocity.
2017-09-05 23:21:43 +08:00
Sean Owen 12ab7f7e89 [SPARK-14280][BUILD][WIP] Update change-version.sh and pom.xml to add Scala 2.12 profiles and enable 2.12 compilation
…build; fix some things that will be warnings or errors in 2.12; restore Scala 2.12 profile infrastructure

## What changes were proposed in this pull request?

This change adds back the infrastructure for a Scala 2.12 build, but does not enable it in the release or Python test scripts.

In order to make that meaningful, it also resolves compile errors that the code hits in 2.12 only, in a way that still works with 2.11.

It also updates dependencies to the earliest minor release of dependencies whose current version does not yet support Scala 2.12. This is in a sense covered by other JIRAs under the main umbrella, but implemented here. The versions below still work with 2.11, and are the _latest_ maintenance release in the _earliest_ viable minor release.

- Scalatest 2.x -> 3.0.3
- Chill 0.8.0 -> 0.8.4
- Clapper 1.0.x -> 1.1.2
- json4s 3.2.x -> 3.4.2
- Jackson 2.6.x -> 2.7.9 (required by json4s)

This change does _not_ fully enable a Scala 2.12 build:

- It will also require dropping support for Kafka before 0.10. Easy enough, just didn't do it yet here
- It will require recreating `SparkILoop` and `Main` for REPL 2.12, which is SPARK-14650. Possible to do here too.

What it does do is make changes that resolve much of the remaining gap without affecting the current 2.11 build.

## How was this patch tested?

Existing tests and build. Manually tested with `./dev/change-scala-version.sh 2.12` to verify it compiles, modulo the exceptions above.

Author: Sean Owen <sowen@cloudera.com>

Closes #18645 from srowen/SPARK-14280.
2017-09-01 19:21:21 +01:00
ArtRand fc45c2c88a [SPARK-20812][MESOS] Add secrets support to the dispatcher
Mesos has secrets primitives for environment and file-based secrets, this PR adds that functionality to the Spark dispatcher and the appropriate configuration flags.
Unit tested and manually tested against a DC/OS cluster with Mesos 1.4.

Author: ArtRand <arand@soe.ucsc.edu>

Closes #18837 from ArtRand/spark-20812-dispatcher-secrets-and-labels.
2017-08-31 10:58:41 -07:00
Herman van Hovell 05af2de0fd [SPARK-21830][SQL] Bump ANTLR version and fix a few issues.
## What changes were proposed in this pull request?
This PR bumps the ANTLR version to 4.7, and fixes a number of small parser related issues uncovered by the bump.

The main reason for upgrading is that in some cases the current version of ANTLR (4.5) can exhibit exponential slowdowns if it needs to parse boolean predicates. For example the following query will take forever to parse:
```sql
SELECT *
FROM RANGE(1000)
WHERE
TRUE
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
```

This is caused by a know bug in ANTLR (https://github.com/antlr/antlr4/issues/994), which was fixed in version 4.6.

## How was this patch tested?
Existing tests.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #19042 from hvanhovell/SPARK-21830.
2017-08-24 16:33:55 -07:00
Dongjoon Hyun 8c54f1eb71 [SPARK-21422][BUILD] Depend on Apache ORC 1.4.0
## What changes were proposed in this pull request?

Like Parquet, this PR aims to depend on the latest Apache ORC 1.4 for Apache Spark 2.3. There are key benefits for Apache ORC 1.4.

- Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC community more.
- Maintainability: Reduce the Hive dependency and can remove old legacy code later.

Later, we can get the following two key benefits by adding new ORCFileFormat in SPARK-20728 (#17980), too.
- Usability: User can use ORC data sources without hive module, i.e, -Phive.
- Speed: Use both Spark ColumnarBatch and ORC RowBatch together. This will be faster than the current implementation in Spark.

## How was this patch tested?

Pass the jenkins.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #18640 from dongjoon-hyun/SPARK-21422.
2017-08-15 23:00:13 -07:00
Takeshi Yamamuro b78cf13bf0 [SPARK-21276][CORE] Update lz4-java to the latest (v1.4.0)
## What changes were proposed in this pull request?
This pr updated `lz4-java` to the latest (v1.4.0) and removed custom `LZ4BlockInputStream`. We currently use custom `LZ4BlockInputStream` to read concatenated byte stream in shuffle. But, this functionality has been implemented in the latest lz4-java (https://github.com/lz4/lz4-java/pull/105). So, we might update the latest to remove the custom `LZ4BlockInputStream`.

Major diffs between the latest release and v1.3.0 in the master are as follows (62f7547abb...6d4693f562);
- fixed NPE in XXHashFactory similarly
- Don't place resources in default package to support shading
- Fixes ByteBuffer methods failing to apply arrayOffset() for array-backed
- Try to load lz4-java from java.library.path, then fallback to bundled
- Add ppc64le binary
- Add s390x JNI binding
- Add basic LZ4 Frame v1.5.0 support
- enable aarch64 support for lz4-java
- Allow unsafeInstance() for ppc64le archiecture
- Add unsafeInstance support for AArch64
- Support 64-bit JNI build on Solaris
- Avoid over-allocating a buffer
- Allow EndMark to be incompressible for LZ4FrameInputStream.
- Concat byte stream

## How was this patch tested?
Existing tests.

Author: Takeshi Yamamuro <yamamuro@apache.org>

Closes #18883 from maropu/SPARK-21276.
2017-08-09 17:31:52 +02:00
WeichenXu b35660dd0e [SPARK-21523][ML] update breeze to 0.13.2 for an emergency bugfix in strong wolfe line search
## What changes were proposed in this pull request?

Update breeze to 0.13.1 for an emergency bugfix in strong wolfe line search
https://github.com/scalanlp/breeze/pull/651

## How was this patch tested?

N/A

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #18797 from WeichenXu123/update-breeze.
2017-08-09 14:44:10 +08:00
Sean Owen fb54a564d7 [SPARK-20433][BUILD] Bump jackson from 2.6.5 to 2.6.7.1
## What changes were proposed in this pull request?

Taking over https://github.com/apache/spark/pull/18789 ; Closes #18789

Update Jackson to 2.6.7 uniformly, and some components to 2.6.7.1, to get some fixes and prep for Scala 2.12

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #18881 from srowen/SPARK-20433.
2017-08-08 18:15:29 -07:00
Sean Owen d3f4a21196 [SPARK-15526][ML][FOLLOWUP] Make JPMML provided scope to avoid including unshaded JARs, and repromote to compile in MLlib
Following the comment at https://issues.apache.org/jira/browse/SPARK-15526?focusedCommentId=16086106&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16086106 -- this change actually needed a little more work to be complete.

This also marks JPMML as `provided` to make sure its JARs aren't included in the `jars` output, but then scopes to `compile` in `mllib`. This is how Guava is handled.

Checked result in `assembly/target/scala-2.11/jars` to verify there are no JPMML jars. Maven and SBT builds still work.

Author: Sean Owen <sowen@cloudera.com>

Closes #18637 from srowen/SPARK-15526.2.
2017-07-18 09:53:51 -07:00
Bryan Cutler d03aebbe65 [SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas
## What changes were proposed in this pull request?
Integrate Apache Arrow with Spark to increase performance of `DataFrame.toPandas`.  This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process.  The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame.  Data types except complex, date, timestamp, and decimal  are currently supported, otherwise an `UnsupportedOperation` exception is thrown.

Additions to Spark include a Scala package private method `Dataset.toArrowPayload` that will convert data partitions in the executor JVM to `ArrowPayload`s as byte arrays so they can be easily served.  A package private class/object `ArrowConverters` that provide data type mappings and conversion routines.  In Python, a private method `DataFrame._collectAsArrow` is added to collect Arrow payloads and a SQLConf "spark.sql.execution.arrow.enable" can be used in `toPandas()` to enable using Arrow (uses the old conversion by default).

## How was this patch tested?
Added a new test suite `ArrowConvertersSuite` that will run tests on conversion of Datasets to Arrow payloads for supported types.  The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data.  This will ensure that the schema and data has been converted correctly.

Added PySpark tests to verify the `toPandas` method is producing equal DataFrames with and without pyarrow.  A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas.

Author: Bryan Cutler <cutlerb@gmail.com>
Author: Li Jin <ice.xelloss@gmail.com>
Author: Li Jin <li.jin@twosigma.com>
Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes #18459 from BryanCutler/toPandas_with_arrow-SPARK-13534.
2017-07-10 15:21:03 -07:00
Dongjoon Hyun c8d0aba198 [SPARK-21278][PYSPARK] Upgrade to Py4J 0.10.6
## What changes were proposed in this pull request?

This PR aims to bump Py4J in order to fix the following float/double bug.
Py4J 0.10.5 fixes this (https://github.com/bartdag/py4j/issues/272) and the latest Py4J is 0.10.6.

**BEFORE**
```
>>> df = spark.range(1)
>>> df.select(df['id'] + 17.133574204226083).show()
+--------------------+
|(id + 17.1335742042)|
+--------------------+
|       17.1335742042|
+--------------------+
```

**AFTER**
```
>>> df = spark.range(1)
>>> df.select(df['id'] + 17.133574204226083).show()
+-------------------------+
|(id + 17.133574204226083)|
+-------------------------+
|       17.133574204226083|
+-------------------------+
```

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #18546 from dongjoon-hyun/SPARK-21278.
2017-07-05 16:33:23 -07:00
Wenchen Fan 838effb98a Revert "[SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas"
This reverts commit e44697606f.
2017-06-28 14:28:40 +08:00
Bryan Cutler e44697606f [SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas
## What changes were proposed in this pull request?
Integrate Apache Arrow with Spark to increase performance of `DataFrame.toPandas`.  This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process.  The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame.  All non-complex data types are currently supported, otherwise an `UnsupportedOperation` exception is thrown.

Additions to Spark include a Scala package private method `Dataset.toArrowPayloadBytes` that will convert data partitions in the executor JVM to `ArrowPayload`s as byte arrays so they can be easily served.  A package private class/object `ArrowConverters` that provide data type mappings and conversion routines.  In Python, a public method `DataFrame.collectAsArrow` is added to collect Arrow payloads and an optional flag in `toPandas(useArrow=False)` to enable using Arrow (uses the old conversion by default).

## How was this patch tested?
Added a new test suite `ArrowConvertersSuite` that will run tests on conversion of Datasets to Arrow payloads for supported types.  The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data.  This will ensure that the schema and data has been converted correctly.

Added PySpark tests to verify the `toPandas` method is producing equal DataFrames with and without pyarrow.  A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas.

Author: Bryan Cutler <cutlerb@gmail.com>
Author: Li Jin <ice.xelloss@gmail.com>
Author: Li Jin <li.jin@twosigma.com>
Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes #15821 from BryanCutler/wip-toPandas_with_arrow-SPARK-13534.
2017-06-23 09:01:13 +08:00
Yuming Wang 823f1eef58 [SPARK-13933][BUILD] Update hadoop-2.7 profile's curator version to 2.7.1
## What changes were proposed in this pull request?

Update hadoop-2.7 profile's curator version to 2.7.1, more see [SPARK-13933](https://issues.apache.org/jira/browse/SPARK-13933).

## How was this patch tested?

manual tests

Author: Yuming Wang <wgyumg@gmail.com>

Closes #18247 from wangyum/SPARK-13933.
2017-06-11 10:05:47 +01:00
Yanbo Liang 67eef47acf
[SPARK-20449][ML] Upgrade breeze version to 0.13.1
## What changes were proposed in this pull request?
Upgrade breeze version to 0.13.1, which fixed some critical bugs of L-BFGS-B.

## How was this patch tested?
Existing unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #17746 from yanboliang/spark-20449.
2017-04-25 17:10:41 +00:00
Sean Owen e8d3fca450
[SPARK-19464][CORE][YARN][TEST-HADOOP2.6] Remove support for Hadoop 2.5 and earlier
## What changes were proposed in this pull request?

- Remove support for Hadoop 2.5 and earlier
- Remove reflection and code constructs only needed to support multiple versions at once
- Update docs to reflect newer versions
- Remove older versions' builds and profiles.

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #16810 from srowen/SPARK-19464.
2017-02-08 12:20:07 +00:00
Dongjoon Hyun 26a4cba3ff [SPARK-19409][BUILD] Bump parquet version to 1.8.2
## What changes were proposed in this pull request?

According to the discussion on #16281 which tried to upgrade toward Apache Parquet 1.9.0, Apache Spark community prefer to upgrade to 1.8.2 instead of 1.9.0. Now, Apache Parquet 1.8.2 is released officially last week on 26 Jan. We can use 1.8.2 now.

https://lists.apache.org/thread.html/af0c813f1419899289a336d96ec02b3bbeecaea23aa6ef69f435c142%3Cdev.parquet.apache.org%3E

This PR only aims to bump Parquet version to 1.8.2. It didn't touch any other codes.

## How was this patch tested?

Pass the existing tests and also manually by doing `./dev/test-dependencies.sh`.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #16751 from dongjoon-hyun/SPARK-19409.
2017-01-31 11:43:52 +01:00
Adam Roberts 17ce0b5b3f
[SPARK-18782][BUILD] Bump Hadoop 2.6 version to use Hadoop 2.6.5
**What changes were proposed in this pull request?**

Use Hadoop 2.6.5 for the Hadoop 2.6 profile, I see a bunch of fixes including security ones in the release notes that we should pick up

**How was this patch tested?**

Running the unit tests now with IBM's SDK for Java and let's see what happens with OpenJDK in the community builder - expecting no trouble as it is only a minor release.

Author: Adam Roberts <aroberts@uk.ibm.com>

Closes #16616 from a-roberts/Hadoop265Bumper.
2017-01-18 09:46:34 +00:00
Shixiong Zhu a8567e34dc
[SPARK-18971][CORE] Upgrade Netty to 4.0.43.Final
## What changes were proposed in this pull request?

Upgrade Netty to `4.0.43.Final` to add the fix for https://github.com/netty/netty/issues/6153

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16568 from zsxwing/SPARK-18971.
2017-01-15 11:15:35 +00:00
Sean Owen 856bae6af6 [SPARK-18997][CORE] Recommended upgrade libthrift to 0.9.3
## What changes were proposed in this pull request?

Updates to libthrift 0.9.3 to address a CVE.

## How was this patch tested?

Existing tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #16530 from srowen/SPARK-18997.
2017-01-10 12:40:21 -08:00
Yin Huai 1a64388973 [SPARK-18951] Upgrade com.thoughtworks.paranamer/paranamer to 2.6
## What changes were proposed in this pull request?
I recently hit a bug of com.thoughtworks.paranamer/paranamer, which causes jackson fail to handle byte array defined in a case class. Then I find https://github.com/FasterXML/jackson-module-scala/issues/48, which suggests that it is caused by a bug in paranamer. Let's upgrade paranamer. Since we are using jackson 2.6.5 and jackson-module-paranamer 2.6.5 use com.thoughtworks.paranamer/paranamer 2.6, I suggests that we upgrade paranamer to 2.6.

Author: Yin Huai <yhuai@databricks.com>

Closes #16359 from yhuai/SPARK-18951.
2016-12-21 09:26:13 -08:00
Sean Owen 553aac56bd
[SPARK-18586][BUILD] netty-3.8.0.Final.jar has vulnerability CVE-2014-3488 and CVE-2014-0193
## What changes were proposed in this pull request?

Force update to latest Netty 3.9.x, for dependencies like Flume, to resolve two CVEs. 3.9.2 is the first version that resolves both, and, this is the latest in the 3.9.x line.

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #16102 from srowen/SPARK-18586.
2016-12-03 09:53:47 +00:00
Yin Huai eba727757e [SPARK-18602] Set the version of org.codehaus.janino:commons-compiler to 3.0.0 to match the version of org.codehaus.janino:janino
## What changes were proposed in this pull request?
org.codehaus.janino:janino depends on org.codehaus.janino:commons-compiler and we have been upgraded to org.codehaus.janino:janino 3.0.0.

However, seems we are still pulling in org.codehaus.janino:commons-compiler 2.7.6 because of calcite. It looks like an accident because we exclude janino from calcite (see here https://github.com/apache/spark/blob/branch-2.1/pom.xml#L1759). So, this PR upgrades org.codehaus.janino:commons-compiler to 3.0.0.

## How was this patch tested?
jenkins

Author: Yin Huai <yhuai@databricks.com>

Closes #16025 from yhuai/janino-commons-compile.
2016-11-28 10:09:30 -08:00
Guoqiang Li bc41d997ea
[SPARK-18375][SPARK-18383][BUILD][CORE] Upgrade netty to 4.0.42.Final
## What changes were proposed in this pull request?

One of the important changes for 4.0.42.Final is "Support any FileRegion implementation when using epoll transport netty/netty#5825".
In 4.0.42.Final, `MessageWithHeader` can work properly when `spark.[shuffle|rpc].io.mode` is set to epoll

## How was this patch tested?

Existing tests

Author: Guoqiang Li <witgo@qq.com>

Closes #15830 from witgo/SPARK-18375_netty-4.0.42.
2016-11-12 09:49:14 +00:00
Sean Owen 16eaad9dae [SPARK-18262][BUILD][SQL] JSON.org license is now CatX
## What changes were proposed in this pull request?

Try excluding org.json:json from hive-exec dep as it's Cat X now. It may be the case that it's not used by the part of Hive Spark uses anyway.

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #15798 from srowen/SPARK-18262.
2016-11-10 10:20:03 -08:00
Jagadeesan 595893d33a
[SPARK-17960][PYSPARK][UPGRADE TO PY4J 0.10.4]
## What changes were proposed in this pull request?

1) Upgrade the Py4J version on the Java side
2) Update the py4j src zip file we bundle with Spark

## How was this patch tested?

Existing doctests & unit tests pass

Author: Jagadeesan <as2@us.ibm.com>

Closes #15514 from jagadeesanas2/SPARK-17960.
2016-10-21 09:48:24 +01:00
Takuya UESHIN 9540357ada
[SPARK-17985][CORE] Bump commons-lang3 version to 3.5.
## What changes were proposed in this pull request?

`SerializationUtils.clone()` of commons-lang3 (<3.5) has a bug that breaks thread safety, which gets stack sometimes caused by race condition of initializing hash map.
See https://issues.apache.org/jira/browse/LANG-1251.

## How was this patch tested?

Existing tests.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #15548 from ueshin/issues/SPARK-17985.
2016-10-19 10:06:43 +01:00
Reynold Xin cd662bc7a2 Revert "[SPARK-17985][CORE] Bump commons-lang3 version to 3.5."
This reverts commit bfe7885aee.

The commit caused build failures on Hadoop 2.2 profile:

```
[error] /scratch/rxin/spark/core/src/main/scala/org/apache/spark/util/Utils.scala:1489: value read is not a member of object org.apache.commons.io.IOUtils
[error]       var numBytes = IOUtils.read(gzInputStream, buf)
[error]                              ^
[error] /scratch/rxin/spark/core/src/main/scala/org/apache/spark/util/Utils.scala:1492: value read is not a member of object org.apache.commons.io.IOUtils
[error]         numBytes = IOUtils.read(gzInputStream, buf)
[error]                            ^
```
2016-10-18 13:56:35 -07:00
Takuya UESHIN bfe7885aee [SPARK-17985][CORE] Bump commons-lang3 version to 3.5.
## What changes were proposed in this pull request?

`SerializationUtils.clone()` of commons-lang3 (<3.5) has a bug that breaks thread safety, which gets stack sometimes caused by race condition of initializing hash map.
See https://issues.apache.org/jira/browse/LANG-1251.

## How was this patch tested?

Existing tests.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #15525 from ueshin/issues/SPARK-17985.
2016-10-18 13:36:00 -07:00
Bryan Cutler 658c7147f5
[SPARK-17808][PYSPARK] Upgraded version of Pyrolite to 4.13
## What changes were proposed in this pull request?
Upgraded to a newer version of Pyrolite which supports serialization of a BinaryType StructField for PySpark.SQL

## How was this patch tested?
Added a unit test which fails with a raised ValueError when using the previous version of Pyrolite 4.9 and Python3

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #15386 from BryanCutler/pyrolite-upgrade-SPARK-17808.
2016-10-11 08:29:52 +02:00
hyukjinkwon 25a020be99
[SPARK-17583][SQL] Remove uesless rowSeparator variable and set auto-expanding buffer as default for maxCharsPerColumn option in CSV
## What changes were proposed in this pull request?

This PR includes the changes below:

1. Upgrade Univocity library from 2.1.1 to 2.2.1

  This includes some performance improvement and also enabling auto-extending buffer in `maxCharsPerColumn` option in CSV. Please refer the [release notes](https://github.com/uniVocity/univocity-parsers/releases).

2. Remove useless `rowSeparator` variable existing in `CSVOptions`

  We have this unused variable in [CSVOptions.scala#L127](29952ed096/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala (L127)) but it seems possibly causing confusion that it actually does not care of `\r\n`. For example, we have an issue open about this, [SPARK-17227](https://issues.apache.org/jira/browse/SPARK-17227), describing this variable.

  This variable is virtually not being used because we rely on `LineRecordReader` in Hadoop which deals with only both `\n` and `\r\n`.

3. Set the default value of `maxCharsPerColumn` to auto-expending.

  We are setting 1000000 for the length of each column. It'd be more sensible we allow auto-expending rather than fixed length by default.

  To make sure, using `-1` is being described in the release note, [2.2.0](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.2.0).

## How was this patch tested?

N/A

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15138 from HyukjinKwon/SPARK-17583.
2016-09-21 10:35:29 +01:00
Reynold Xin dca771bec6 [SPARK-17558] Bump Hadoop 2.7 version from 2.7.2 to 2.7.3
## What changes were proposed in this pull request?
This patch bumps the Hadoop version in hadoop-2.7 profile from 2.7.2 to 2.7.3, which was recently released and contained a number of bug fixes.

## How was this patch tested?
The change should be covered by existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #15115 from rxin/SPARK-17558.
2016-09-16 11:24:26 -07:00
Adam Roberts 0ad8eeb4d3 [SPARK-17379][BUILD] Upgrade netty-all to 4.0.41 final for bug fixes
## What changes were proposed in this pull request?
Upgrade netty-all to latest in the 4.0.x line which is 4.0.41, mentions several bug fixes and performance improvements we may find useful, see netty.io/news/2016/08/29/4-0-41-Final-4-1-5-Final.html. Initially tried to use 4.1.5 but noticed it's not backwards compatible.

## How was this patch tested?
Existing unit tests against branch-1.6 and branch-2.0 using IBM Java 8 on Intel, Power and Z architectures

Author: Adam Roberts <aroberts@uk.ibm.com>

Closes #14961 from a-roberts/netty.
2016-09-15 10:40:10 -07:00
Adam Roberts 6c08dbf683 [SPARK-17378][BUILD] Upgrade snappy-java to 1.1.2.6
## What changes were proposed in this pull request?

Upgrades the Snappy version to 1.1.2.6 from 1.1.2.4, release notes: https://github.com/xerial/snappy-java/blob/master/Milestone.md mention "Fix a bug in SnappyInputStream when reading compressed data that happened to have the same first byte with the stream magic header (#142)"

## How was this patch tested?
Existing unit tests using the latest IBM Java 8 on Intel, Power and Z architectures (little and big-endian)

Author: Adam Roberts <aroberts@uk.ibm.com>

Closes #14958 from a-roberts/master.
2016-09-06 22:13:25 +01:00
Ferdinand Xu 4b4e329e49 [SPARK-5682][CORE] Add encrypted shuffle in spark
This patch is using Apache Commons Crypto library to enable shuffle encryption support.

Author: Ferdinand Xu <cheng.a.xu@intel.com>
Author: kellyzly <kellyzly@126.com>

Closes #8880 from winningsix/SPARK-10771.
2016-08-30 09:15:31 -07:00
Sean Owen 0b3a4be92c [SPARK-16781][PYSPARK] java launched by PySpark as gateway may not be the same java used in the spark environment
## What changes were proposed in this pull request?

Update to py4j 0.10.3 to enable JAVA_HOME support

## How was this patch tested?

Pyspark tests

Author: Sean Owen <sowen@cloudera.com>

Closes #14748 from srowen/SPARK-16781.
2016-08-24 20:04:09 +01:00
Stefan Schulze 4775eb414f [SPARK-16770][BUILD] Fix JLine dependency management and version (Sca…
## What changes were proposed in this pull request?
As of Scala 2.11.x there is no longer a org.scala-lang:jline version aligned to the scala version itself. Scala console now uses the plain jline:jline module. Spark's  dependency management did not reflect this change properly, causing Maven to pull in Jline via transitive dependency. Unfortunately Jline 2.12 contained a minor but very annoying bug rendering the shell almost useless for developers with german keyboard layout. This request contains the following chages:
- Exclude transitive dependency 'jline:jline' from hive-exec module
- Remove global properties 'jline.version' and 'jline.groupId'
- Add both properties and dependency to 'scala-2.11' profile
- Add explicit dependency on 'jline:jline' to  module 'spark-repl'

## How was this patch tested?
- Running mvn dependency:tree and checking for correct Jline version 2.12.1
- Running full builds with assembly and checking for jline-2.12.1.jar in 'lib' folder of generated tarball

Author: Stefan Schulze <stefan.schulze@pentasys.de>

Closes #14429 from stsc-pentasys/SPARK-16770.
2016-08-03 17:07:10 -07:00
Michael Gummelt 266b92faff [SPARK-16637] Unified containerizer
## What changes were proposed in this pull request?

New config var: spark.mesos.docker.containerizer={"mesos","docker" (default)}

This adds support for running docker containers via the Mesos unified containerizer: http://mesos.apache.org/documentation/latest/container-image/

The benefit is losing the dependency on `dockerd`, and all the costs which it incurs.

I've also updated the supported Mesos version to 0.28.2 for support of the required protobufs.

This is blocked on: https://github.com/apache/spark/pull/14167

## How was this patch tested?

- manually testing jobs submitted with both "mesos" and "docker" settings for the new config var.
- spark/mesos integration test suite

Author: Michael Gummelt <mgummelt@mesosphere.io>

Closes #14275 from mgummelt/unified-containerizer.
2016-07-29 05:50:47 -07:00
Adam Roberts 04a2c072d9 [SPARK-16751] Upgrade derby to 10.12.1.1
## What changes were proposed in this pull request?

Version of derby upgraded based on important security info at VersionEye. Test scope added so we don't include it in our final package anyway. NB: I think this should be backported to all previous releases as it is a security problem https://www.versioneye.com/java/org.apache.derby:derby/10.11.1.1

The CVE number is 2015-1832. I also suggest we add a SECURITY tag for JIRAs

## How was this patch tested?
Existing tests with the change making sure that we see no new failures. I checked derby 10.12.x and not derby 10.11.x is downloaded to our ~/.m2 folder.

I then used dev/make-distribution.sh and checked the dist/jars folder for Spark 2.0: no derby jar is present.

I don't know if this would also remove it from the assembly jar in our 1.x branches.

Author: Adam Roberts <aroberts@uk.ibm.com>

Closes #14379 from a-roberts/patch-4.
2016-07-29 04:43:01 -07:00
Philipp Hoffmann 0869b3a5f0 [SPARK-15271][MESOS] Allow force pulling executor docker images
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Mesos agents by default will not pull docker images which are cached
locally already. In order to run Spark executors from mutable tags like
`:latest` this commit introduces a Spark setting
(`spark.mesos.executor.docker.forcePullImage`). Setting this flag to
true will tell the Mesos agent to force pull the docker image (default is `false` which is consistent with the previous
implementation and Mesos' default
behaviour).

Author: Philipp Hoffmann <mail@philipphoffmann.de>

Closes #14348 from philipphoffmann/force-pull-image.
2016-07-26 16:09:10 +01:00
Josh Rosen fc17121d59 Revert "[SPARK-15271][MESOS] Allow force pulling executor docker images"
This reverts commit 978cd5f125.
2016-07-25 12:43:44 -07:00
Philipp Hoffmann 978cd5f125 [SPARK-15271][MESOS] Allow force pulling executor docker images
## What changes were proposed in this pull request?

Mesos agents by default will not pull docker images which are cached
locally already. In order to run Spark executors from mutable tags like
`:latest` this commit introduces a Spark setting
`spark.mesos.executor.docker.forcePullImage`. Setting this flag to
true will tell the Mesos agent to force pull the docker image (default is `false` which is consistent with the previous
implementation and Mesos' default
behaviour).

## How was this patch tested?

I ran a sample application including this change on a Mesos cluster and verified the correct behaviour for both, with and without, force pulling the executor image. As expected the image is being force pulled if the flag is set.

Author: Philipp Hoffmann <mail@philipphoffmann.de>

Closes #13051 from philipphoffmann/force-pull-image.
2016-07-25 20:14:47 +01:00
Yanbo Liang 670891496a [SPARK-16494][ML] Upgrade breeze version to 0.12
## What changes were proposed in this pull request?
breeze 0.12 has been released for more than half a year, and it brings lots of new features, performance improvement and bug fixes.
One of the biggest features is ```LBFGS-B``` which is an implementation of ```LBFGS``` with box constraints and much faster for some special case.
We would like to implement Huber loss function for ```LinearRegression``` ([SPARK-3181](https://issues.apache.org/jira/browse/SPARK-3181)) and it requires ```LBFGS-B``` as the optimization solver. So we should bump up the dependent breeze version to 0.12.
For more features, improvements and bug fixes of breeze 0.12, you can refer the following link:
https://groups.google.com/forum/#!topic/scala-breeze/nEeRi_DcY5c

## How was this patch tested?
No new tests, should pass the existing ones.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14150 from yanboliang/spark-16494.
2016-07-19 12:31:04 +01:00
Kazuaki Ishizaki f12a38b2db [SPARK-15467][BUILD] update janino version to 3.0.0
## What changes were proposed in this pull request?

This PR updates version of Janino compiler from 2.7.8 to 3.0.0. This version fixes [an Janino issue](https://github.com/janino-compiler/janino/issues/1) that fixes [an issue](https://issues.apache.org/jira/browse/SPARK-15467), which throws Java exception, in Spark.

## How was this patch tested?

Manually tested using a program in [the JIRA entry](https://issues.apache.org/jira/browse/SPARK-15467)

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #14127 from kiszk/SPARK-15467.
2016-07-10 17:58:27 -07:00
Adam Roberts 147c020823 [SPARK-15818][BUILD] Upgrade to Hadoop 2.7.2
## What changes were proposed in this pull request?

Updating the Hadoop version from 2.7.0 to 2.7.2 if we use the Hadoop-2.7 build profile

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Existing tests

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

I'd like us to use Hadoop 2.7.2 owing to the Hadoop release notes stating Hadoop 2.7.0 is not ready for production use

https://hadoop.apache.org/docs/r2.7.0/ states

"Apache Hadoop 2.7.0 is a minor release in the 2.x.y release line, building upon the previous stable release 2.6.0.
This release is not yet ready for production use. Production users should use 2.7.1 release and beyond."

Hadoop 2.7.1 release notes:
"Apache Hadoop 2.7.1 is a minor release in the 2.x.y release line, building upon the previous release 2.7.0. This is the next stable release after Apache Hadoop 2.6.x."

And then Hadoop 2.7.2 release notes:
"Apache Hadoop 2.7.2 is a minor release in the 2.x.y release line, building upon the previous stable release 2.7.1."

I've tested this is OK with Intel hardware and IBM Java 8 so let's test it with OpenJDK, ideally this will be pushed to branch-2.0 and master.

Author: Adam Roberts <aroberts@uk.ibm.com>

Closes #13556 from a-roberts/patch-2.
2016-06-09 10:34:01 +01:00
Shixiong Zhu 9a74de18a1 Revert "[SPARK-11753][SQL][TEST-HADOOP2.2] Make allowNonNumericNumbers option work
## What changes were proposed in this pull request?

This reverts commit c24b6b679c. Sent a PR to run Jenkins tests due to the revert conflicts of `dev/deps/spark-deps-hadoop*`.

## How was this patch tested?

Jenkins unit tests, integration tests, manual tests)

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #13417 from zsxwing/revert-SPARK-11753.
2016-05-31 14:50:07 -07:00
Ryan Blue 776d183c82 [SPARK-9876][SQL] Update Parquet to 1.8.1.
## What changes were proposed in this pull request?

This includes minimal changes to get Spark using the current release of Parquet, 1.8.1.

## How was this patch tested?

This uses the existing Parquet tests.

Author: Ryan Blue <blue@apache.org>

Closes #13280 from rdblue/SPARK-9876-update-parquet.
2016-05-27 16:59:38 -07:00
Villu Ruusmann 6d506c9ae9 [SPARK-15523][ML][MLLIB] Update JPMML to 1.2.15
## What changes were proposed in this pull request?

See https://issues.apache.org/jira/browse/SPARK-15523

This PR replaces PR #13293. It's isolated to a new branch, and contains some more squashed changes.

## How was this patch tested?

1. Executed `mvn clean package` in `mllib` directory
2. Executed `dev/test-dependencies.sh --replace-manifest` in the root directory.

Author: Villu Ruusmann <villu.ruusmann@gmail.com>

Closes #13297 from vruusmann/update-jpmml.
2016-05-26 08:11:34 -05:00
Herman van Hovell 527499b624 [SPARK-15525][SQL][BUILD] Upgrade ANTLR4 SBT plugin
## What changes were proposed in this pull request?
The ANTLR4 SBT plugin has been moved from its own repo to one on bintray. The version was also changed from `0.7.10` to `0.7.11`. The latter actually broke our build (ihji has fixed this by also adding `0.7.10` and others to the bin-tray repo).

This PR upgrades the SBT-ANTLR4 plugin and ANTLR4 to their most recent versions (`0.7.11`/`4.5.3`). I have also removed a few obsolete build configurations.

## How was this patch tested?
Manually running SBT/Maven builds.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #13299 from hvanhovell/SPARK-15525.
2016-05-25 15:35:38 -07:00
Jurriaan Pruis c875d81a3d [SPARK-15493][SQL] default QuoteEscapingEnabled flag to true when writing CSV
## What changes were proposed in this pull request?

Default QuoteEscapingEnabled flag to true when writing CSV and add an escapeQuotes option to be able to change this.

See f3eb2af263/src/main/java/com/univocity/parsers/csv/CsvWriterSettings.java (L231-L247)

This change is needed to be able to write RFC 4180 compatible CSV files (https://tools.ietf.org/html/rfc4180#section-2)

https://issues.apache.org/jira/browse/SPARK-15493

## How was this patch tested?

Added a test that verifies the output is quoted correctly.

Author: Jurriaan Pruis <email@jurriaanpruis.nl>

Closes #13267 from jurriaan/quote-escaping.
2016-05-25 12:40:16 -07:00
Liang-Chi Hsieh c24b6b679c [SPARK-11753][SQL][TEST-HADOOP2.2] Make allowNonNumericNumbers option work
## What changes were proposed in this pull request?

Jackson suppprts `allowNonNumericNumbers` option to parse non-standard non-numeric numbers such as "NaN", "Infinity", "INF".  Currently used Jackson version (2.5.3) doesn't support it all. This patch upgrades the library and make the two ignored tests in `JsonParsingOptionsSuite` passed.

## How was this patch tested?

`JsonParsingOptionsSuite`.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #9759 from viirya/fix-json-nonnumric.
2016-05-24 09:43:39 -07:00
Sean Owen fabc8e5b12 [SPARK-12972][CORE][TEST-MAVEN][TEST-HADOOP2.2] Update org.apache.httpcomponents.httpclient, commons-io
## What changes were proposed in this pull request?

This is sort of a hot-fix for https://github.com/apache/spark/pull/13117, but, the problem is limited to Hadoop 2.2. The change is to manage `commons-io` to 2.4 for all Hadoop builds, which is only a net change for Hadoop 2.2, which was using 2.1.

## How was this patch tested?

Jenkins tests -- normal PR builder, then the `[test-hadoop2.2] [test-maven]` if successful.

Author: Sean Owen <sowen@cloudera.com>

Closes #13132 from srowen/SPARK-12972.3.
2016-05-16 16:27:04 +01:00
Sean Owen f5576a052d [SPARK-12972][CORE] Update org.apache.httpcomponents.httpclient
## What changes were proposed in this pull request?

(Retry of https://github.com/apache/spark/pull/13049)

- update to httpclient 4.5 / httpcore 4.4
- remove some defunct exclusions
- manage httpmime version to match
- update selenium / httpunit to support 4.5 (possible now that Jetty 9 is used)

## How was this patch tested?

Jenkins tests. Also, locally running the same test command of one Jenkins profile that failed: `mvn -Phadoop-2.6 -Pyarn -Phive -Phive-thriftserver -Pkinesis-asl ...`

Author: Sean Owen <sowen@cloudera.com>

Closes #13117 from srowen/SPARK-12972.2.
2016-05-15 15:56:46 +01:00
Sean Owen 10a8389674 Revert "[SPARK-12972][CORE] Update org.apache.httpcomponents.httpclient"
This reverts commit c74a6c3f23.
2016-05-13 13:50:26 +01:00
Sean Owen c74a6c3f23 [SPARK-12972][CORE] Update org.apache.httpcomponents.httpclient
## What changes were proposed in this pull request?

- update httpcore/httpclient to latest
- centralize version management
- remove excludes that are no longer relevant according to SBT/Maven dep graphs
- also manage httpmime to match httpclient

## How was this patch tested?

Jenkins tests, plus review of dependency graphs from SBT/Maven, and review of test-dependencies.sh  output

Author: Sean Owen <sowen@cloudera.com>

Closes #13049 from srowen/SPARK-12972.
2016-05-13 09:00:50 +01:00
Holden Karau 382dbc12bb [SPARK-15061][PYSPARK] Upgrade to Py4J 0.10.1
## What changes were proposed in this pull request?

This upgrades to Py4J 0.10.1 which reduces syscal overhead in Java gateway ( see https://github.com/bartdag/py4j/issues/201 ). Related https://issues.apache.org/jira/browse/SPARK-6728 .

## How was this patch tested?

Existing doctests & unit tests pass

Author: Holden Karau <holden@us.ibm.com>

Closes #13064 from holdenk/SPARK-15061-upgrade-to-py4j-0.10.1.
2016-05-13 08:59:18 +01:00
bomeng 81bf870848 [SPARK-14897][SQL] upgrade to jetty 9.2.16
## What changes were proposed in this pull request?

Since Jetty 8 is EOL (end of life) and has critical security issue [http://www.securityweek.com/critical-vulnerability-found-jetty-web-server], I think upgrading to 9 is necessary. I am using latest 9.2 since 9.3 requires Java 8+.

`javax.servlet` and `derby` were also upgraded since Jetty 9.2 needs corresponding version.

## How was this patch tested?

Manual test and current test cases should cover it.

Author: bomeng <bmeng@us.ibm.com>

Closes #12916 from bomeng/SPARK-14897.
2016-05-12 20:07:44 +01:00
hyukjinkwon ac12b35d31 [SPARK-15148][SQL] Upgrade Univocity library from 2.0.2 to 2.1.0
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-15148

Mainly it improves the performance roughtly about 30%-40% according to the [release note](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.1.0). For the details of the purpose is described in the JIRA.

This PR upgrades Univocity library from 2.0.2 to 2.1.0.

## How was this patch tested?

Existing tests should cover this.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #12923 from HyukjinKwon/SPARK-15148.
2016-05-05 11:26:40 -07:00
mcheah b7fdc23ccc [SPARK-12154] Upgrade to Jersey 2
## What changes were proposed in this pull request?

Replace com.sun.jersey with org.glassfish.jersey. Changes to the Spark Web UI code were required to compile. The changes were relatively standard Jersey migration things.

## How was this patch tested?

I did a manual test for the standalone web APIs. Although I didn't test the functionality of the security filter itself, the code that changed non-trivially is how we actually register the filter. I attached a debugger to the Spark master and verified that the SecurityFilter code is indeed invoked upon hitting /api/v1/applications.

Author: mcheah <mcheah@palantir.com>

Closes #12715 from mccheah/feature/upgrade-jersey.
2016-05-05 10:51:03 +01:00
Lining Sun 592fc45563 [SPARK-15123] upgrade org.json4s to 3.2.11 version
## What changes were proposed in this pull request?

We had the issue when using snowplow in our Spark applications. Snowplow requires json4s version 3.2.11 while Spark still use a few years old version 3.2.10. The change is to upgrade json4s jar to 3.2.11.

## How was this patch tested?

We built Spark jar and successfully ran our applications in local and cluster modes.

Author: Lining Sun <lining@gmail.com>

Closes #12901 from liningalex/master.
2016-05-05 10:47:39 +01:00
Davies Liu 7feeb82cb7 [SPARK-14987][SQL] inline hive-service (cli) into sql/hive-thriftserver
## What changes were proposed in this pull request?

This PR copy the thrift-server from hive-service-1.2 (including  TCLIService.thrift and generated Java source code) into sql/hive-thriftserver, so we can do further cleanup and improvements.

## How was this patch tested?

Existing tests.

Author: Davies Liu <davies@databricks.com>

Closes #12764 from davies/thrift_server.
2016-04-29 09:32:42 -07:00
hyukjinkwon ec2a276022 [SPARK-14787][SQL] Upgrade Joda-Time library from 2.9 to 2.9.3
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-14787

The possible problems are described in the JIRA above. Please refer this if you are wondering the purpose of this PR.

This PR upgrades Joda-Time library from 2.9 to 2.9.3.

## How was this patch tested?

`sbt scalastyle` and Jenkins tests in this PR.

closes #11847

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #12552 from HyukjinKwon/SPARK-14787.
2016-04-21 11:32:27 +01:00
Josh Rosen 906eef4c7a [SPARK-11416][BUILD] Update to Chill 0.8.0 & Kryo 3.0.3
This patch upgrades Chill to 0.8.0 and Kryo to 3.0.3. While we'll likely need to bump these dependencies again before Spark 2.0 (due to SPARK-14221 / https://github.com/twitter/chill/issues/252), I wanted to get the bulk of the Kryo 2 -> Kryo 3 migration done now in order to figure out whether there are any unexpected surprises.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #12076 from JoshRosen/kryo3.
2016-04-08 16:35:30 -07:00
hyukjinkwon 725b860e2b [SPARK-14103][SQL] Parse unescaped quotes in CSV data source.
## What changes were proposed in this pull request?

This PR resolves the problem during parsing unescaped quotes in input data. For example, currently the data below:

```
"a"b,ccc,ddd
e,f,g
```

produces a data below:

- **Before**

```bash
["a"b,ccc,ddd[\n]e,f,g]  <- as a value.
```

- **After**

```bash
["a"b], [ccc], [ddd]
[e], [f], [g]
```

This PR bumps up the Univocity parser's version. This was fixed in `2.0.2`, https://github.com/uniVocity/univocity-parsers/issues/60.

## How was this patch tested?

Unit tests in `CSVSuite` and `sbt/sbt scalastyle`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #12226 from HyukjinKwon/SPARK-14103-quote.
2016-04-08 00:28:59 -07:00
Marcelo Vanzin 24d7d2e453 [SPARK-13579][BUILD] Stop building the main Spark assembly.
This change modifies the "assembly/" module to just copy needed
dependencies to its build directory, and modifies the packaging
script to pick those up (and remove duplicate jars packages in the
examples module).

I also made some minor adjustments to dependencies to remove some
test jars from the final packaging, and remove jars that conflict with each
other when packaged separately (e.g. servlet api).

Also note that this change restores guava in applications' classpaths, even
though it's still shaded inside Spark. This is now needed for the Hadoop
libraries that are packaged with Spark, which now are not processed by
the shade plugin.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #11796 from vanzin/SPARK-13579.
2016-04-04 16:52:22 -07:00
Jacek Laskowski c16a396886 [SPARK-13825][CORE] Upgrade to Scala 2.11.8
## What changes were proposed in this pull request?

Upgrade to 2.11.8 (from the current 2.11.7)

## How was this patch tested?

A manual build

Author: Jacek Laskowski <jacek@japila.pl>

Closes #11681 from jaceklaskowski/SPARK-13825-scala-2_11_8.
2016-04-01 15:21:29 -07:00
Sital Kedia 8de201baed [SPARK-14277][CORE] Upgrade Snappy Java to 1.1.2.4
## What changes were proposed in this pull request?

Upgrade snappy to 1.1.2.4 to improve snappy read/write performance.

## How was this patch tested?

Tested by running a job on the cluster and saw 7.5% cpu savings after this change.

Author: Sital Kedia <skedia@fb.com>

Closes #12096 from sitalkedia/snappyRelease.
2016-03-31 16:06:44 -07:00
Herman van Hovell a9b93e0739 [SPARK-14211][SQL] Remove ANTLR3 based parser
### What changes were proposed in this pull request?

This PR removes the ANTLR3 based parser, and moves the new ANTLR4 based parser into the `org.apache.spark.sql.catalyst.parser package`.

### How was this patch tested?

Existing unit tests.

cc rxin andrewor14 yhuai

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #12071 from hvanhovell/SPARK-14211.
2016-03-31 09:25:09 -07:00
Herman van Hovell 600c0b69ca [SPARK-13713][SQL] Migrate parser from ANTLR3 to ANTLR4
### What changes were proposed in this pull request?
The current ANTLR3 parser is quite complex to maintain and suffers from code blow-ups. This PR introduces a new parser that is based on ANTLR4.

This parser is based on the [Presto's SQL parser](https://github.com/facebook/presto/blob/master/presto-parser/src/main/antlr4/com/facebook/presto/sql/parser/SqlBase.g4). The current implementation can parse and create Catalyst and SQL plans. Large parts of the HiveQl DDL and some of the DML functionality is currently missing, the plan is to add this in follow-up PRs.

This PR is a work in progress, and work needs to be done in the following area's:

- [x] Error handling should be improved.
- [x] Documentation should be improved.
- [x] Multi-Insert needs to be tested.
- [ ] Naming and package locations.

### How was this patch tested?

Catalyst and SQL unit tests.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #11557 from hvanhovell/ngParser.
2016-03-28 12:31:12 -07:00
Josh Rosen 07cb323e7a [SPARK-13848][SPARK-5185] Update to Py4J 0.9.2 in order to fix classloading issue
This patch upgrades Py4J from 0.9.1 to 0.9.2 in order to include a patch which modifies Py4J to use the current thread's ContextClassLoader when performing reflection / class loading. This is necessary in order to fix [SPARK-5185](https://issues.apache.org/jira/browse/SPARK-5185), a longstanding issue affecting the use of `--jars` and `--packages` in PySpark.

In order to demonstrate that the fix works, I removed the workarounds which were added as part of [SPARK-6027](https://issues.apache.org/jira/browse/SPARK-6027) / #4779 and other patches.

Py4J diff: https://github.com/bartdag/py4j/compare/0.9.1...0.9.2

/cc zsxwing tdas davies brkyvz

Author: Josh Rosen <joshrosen@databricks.com>

Closes #11687 from JoshRosen/py4j-0.9.2.
2016-03-14 12:22:02 -07:00
Sean Owen 927e22eff8 [SPARK-13663][CORE] Upgrade Snappy Java to 1.1.2.1
## What changes were proposed in this pull request?

Update snappy to 1.1.2.1 to pull in a single fix -- the OOM fix we already worked around.
Supersedes https://github.com/apache/spark/pull/11524

## How was this patch tested?

Jenkins tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #11631 from srowen/SPARK-13663.
2016-03-10 15:17:37 +00:00
Steve Loughran 9a48c656ee [SPARK-13599][BUILD] remove transitive groovy dependencies from Hive
## What changes were proposed in this pull request?

Modifies the dependency declarations of the all the hive artifacts, to explicitly exclude the groovy-all JAR.

This stops the groovy classes *and everything else in that uber-JAR* from getting into spark-assembly JAR.

## How was this patch tested?

1. Pre-patch build was made: `mvn clean install -Pyarn,hive,hive-thriftserver`
1. spark-assembly expanded, observed to have the org.codehaus.groovy packages and JARs
1. A maven dependency tree was created `mvn dependency:tree -Pyarn,hive,hive-thriftserver  -Dverbose > target/dependencies.txt`
1. This text file examined to confirm that groovy was being imported as a dependency of `org.spark-project.hive`
1. Patch applied
1. Repeated step1: clean build of project with ` -Pyarn,hive,hive-thriftserver` set
1. Examined created spark-assembly, verified no org.codehaus packages
1. Verified that the maven dependency tree no longer references groovy

Note also that the size of the assembly JAR was 181628646 bytes before this patch, 166318515 after —15MB smaller. That's a good metric of things being excluded

Author: Steve Loughran <stevel@hortonworks.com>

Closes #11449 from steveloughran/fixes/SPARK-13599-groovy-dependency.
2016-03-03 09:35:49 -08:00
mark800 ec0cc75e15 [SPARK-7483][MLLIB] Upgrade Chill to 0.7.2 to support Kryo with FPGrowth
It registers more Scala classes, including ListBuffer to support Kryo with FPGrowth.

See https://github.com/twitter/chill/releases for Chill's change log.

Author: mark800 <yky800@126.com>

Closes #11041 from mark800/master.
2016-02-27 13:50:37 +00:00
Sean Owen b84404865b [SPARK-13324][CORE][BUILD] Update plugin, test, example dependencies for 2.x
Phase 1: update plugin versions, test dependencies, some example and third-party versions

Author: Sean Owen <sowen@cloudera.com>

Closes #11206 from srowen/SPARK-13324.
2016-02-17 19:03:29 -08:00
Josh Rosen 289373b28c [SPARK-6363][BUILD] Make Scala 2.11 the default Scala version
This patch changes Spark's build to make Scala 2.11 the default Scala version. To be clear, this does not mean that Spark will stop supporting Scala 2.10: users will still be able to compile Spark for Scala 2.10 by following the instructions on the "Building Spark" page; however, it does mean that Scala 2.11 will be the default Scala version used by our CI builds (including pull request builds).

The Scala 2.11 compiler is faster than 2.10, so I think we'll be able to look forward to a slight speedup in our CI builds (it looks like it's about 2X faster for the Maven compile-only builds, for instance).

After this patch is merged, I'll update Jenkins to add new compile-only jobs to ensure that Scala 2.10 compilation doesn't break.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10608 from JoshRosen/SPARK-6363.
2016-01-30 00:20:28 -08:00
Shixiong Zhu bc1babd63d [SPARK-7997][CORE] Remove Akka from Spark Core and Streaming
- Remove Akka dependency from core. Note: the streaming-akka project still uses Akka.
- Remove HttpFileServer
- Remove Akka configs from SparkConf and SSLOptions
- Rename `spark.akka.frameSize` to `spark.rpc.message.maxSize`. I think it's still worth to keep this config because using `DirectTaskResult` or `IndirectTaskResult`  depends on it.
- Update comments and docs

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10854 from zsxwing/remove-akka.
2016-01-22 21:20:04 -08:00
Josh Rosen 8dbbf3e75e [SPARK-12842][TEST-HADOOP2.7] Add Hadoop 2.7 build profile
This patch adds a Hadoop 2.7 build profile in order to let us automate tests against that version.

/cc rxin srowen

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10775 from JoshRosen/add-hadoop-2.7-profile.
2016-01-15 17:07:24 -08:00
Reynold Xin ad1503f92e [SPARK-12667] Remove block manager's internal "external block store" API
This pull request removes the external block store API. This is rarely used, and the file system interface is actually a better, more standard way to interact with external storage systems.

There are some other things to remove also, as pointed out by JoshRosen. We will do those as follow-up pull requests.

Author: Reynold Xin <rxin@databricks.com>

Closes #10752 from rxin/remove-offheap.
2016-01-15 12:03:28 -08:00
Hossein 5f83c6991c [SPARK-12833][SQL] Initial import of spark-csv
CSV is the most common data format in the "small data" world. It is often the first format people want to try when they see Spark on a single node. Having to rely on a 3rd party component for this leads to poor user experience for new users. This PR merges the popular spark-csv data source package (https://github.com/databricks/spark-csv) with SparkSQL.

This is a first PR to bring the functionality to spark 2.0 master. We will complete items outlines in the design document (see JIRA attachment) in follow up pull requests.

Author: Hossein <hossein@databricks.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #10766 from rxin/csv.
2016-01-15 11:46:46 -08:00
Shixiong Zhu 4f60651cbe [SPARK-12652][PYSPARK] Upgrade Py4J to 0.9.1
- [x] Upgrade Py4J to 0.9.1
- [x] SPARK-12657: Revert SPARK-12617
- [x] SPARK-12658: Revert SPARK-12511
  - Still keep the change that only reading checkpoint once. This is a manual change and worth to take a look carefully. bfd4b5c040
- [x] Verify no leak any more after reverting our workarounds

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10692 from zsxwing/py4j-0.9.1.
2016-01-12 14:27:05 -08:00
BrianLondon 8fe928b4fe [SPARK-12269][STREAMING][KINESIS] Update aws-java-sdk version
The current Spark Streaming kinesis connector references a quite old version 1.9.40 of the AWS Java SDK (1.10.40 is current). Numerous AWS features including Kinesis Firehose are unavailable in 1.9. Those two versions of the AWS SDK in turn require conflicting versions of Jackson (2.4.4 and 2.5.3 respectively) such that one cannot include the current AWS SDK in a project that also uses the Spark Streaming Kinesis ASL.

Author: BrianLondon <brian@seatgeek.com>

Closes #10256 from BrianLondon/master.
2016-01-11 09:32:06 +00:00
Josh Rosen f13c7f8f7d [SPARK-12734][HOTFIX][TEST-MAVEN] Fix bug in Netty exclusions
This is a hotfix for a build bug introduced by the Netty exclusion changes in #10672. We can't exclude `io.netty:netty` because Akka depends on it. There's not a direct conflict between `io.netty:netty` and `io.netty:netty-all`, because the former puts classes in the `org.jboss.netty` namespace while the latter uses the `io.netty` namespace. However, there still is a conflict between `org.jboss.netty:netty` and `io.netty:netty`, so we need to continue to exclude the JBoss version of that artifact.

While the diff here looks somewhat large, note that this is only a revert of a some of the changes from #10672. You can see the net changes in pom.xml at 3119206b71...5211ab8 (diff-600376dffeb79835ede4a0b285078036)

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10693 from JoshRosen/netty-hotfix.
2016-01-11 00:31:29 -08:00
Josh Rosen 3ab0138b0f [SPARK-12734][BUILD] Fix Netty exclusion and use Maven Enforcer to prevent future bugs
Netty classes are published under multiple artifacts with different names, so our build needs to exclude the `io.netty:netty` and `org.jboss.netty:netty` versions of the Netty artifact. However, our existing exclusions were incomplete, leading to situations where duplicate Netty classes would wind up on the classpath and cause compile errors (or worse).

This patch fixes the exclusion issue by adding more exclusions and uses Maven Enforcer's [banned dependencies](https://maven.apache.org/enforcer/enforcer-rules/bannedDependencies.html) rule to prevent these classes from accidentally being reintroduced. I also updated `dev/test-dependencies.sh` to run `mvn validate` so that the enforcer rules can run as part of pull request builds.

/cc rxin srowen pwendell. I'd like to backport at least the exclusion portion of this fix to `branch-1.5` in order to fix the documentation publishing job, which fails nondeterministically due to incompatible versions of Netty classes taking precedence on the compile-time classpath.

Author: Josh Rosen <rosenville@gmail.com>
Author: Josh Rosen <joshrosen@databricks.com>

Closes #10672 from JoshRosen/enforce-netty-exclusions.
2016-01-10 19:59:01 -08:00
Herman van Hovell ea489f14f1 [SPARK-12573][SPARK-12574][SQL] Move SQL Parser from Hive to Catalyst
This PR moves a major part of the new SQL parser to Catalyst. This is a prelude to start using this parser for all of our SQL parsing. The following key changes have been made:

The ANTLR Parser & Supporting classes have been moved to the Catalyst project. They are now part of the ```org.apache.spark.sql.catalyst.parser``` package. These classes contained quite a bit of code that was originally from the Hive project, I have added aknowledgements whenever this applied. All Hive dependencies have been factored out. I have also taken this chance to clean-up the ```ASTNode``` class, and to improve the error handling.

The HiveQl object that provides the functionality to convert an AST into a LogicalPlan has been refactored into three different classes, one for every SQL sub-project:
- ```CatalystQl```: This implements Query and Expression parsing functionality.
- ```SparkQl```: This is a subclass of CatalystQL and provides SQL/Core only functionality such as Explain and Describe.
- ```HiveQl```: This is a subclass of ```SparkQl``` and this adds Hive-only functionality to the parser such as Analyze, Drop, Views, CTAS & Transforms. This class still depends on Hive.

cc rxin

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #10583 from hvanhovell/SPARK-12575.
2016-01-06 11:16:53 -08:00
Josh Rosen 0d165ec205 [SPARK-12612][PROJECT-INFRA] Add missing Hadoop profiles to dev/run-tests-*.py scripts and dev/deps
There are a couple of places in the `dev/run-tests-*.py` scripts which deal with Hadoop profiles, but the set of profiles that they handle does not include all Hadoop profiles defined in our POM. Similarly, the `hadoop-2.2` and `hadoop-2.6` profiles were missing from `dev/deps`.

This patch updates these scripts to include all four Hadoop profiles defined in our POM.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10565 from JoshRosen/add-missing-hadoop-profiles-in-test-scripts.
2016-01-03 22:05:02 -08:00
Josh Rosen 27a42c7108 [SPARK-10359] Enumerate dependencies in a file and diff against it for new pull requests
This patch adds a new build check which enumerates Spark's resolved runtime classpath and saves it to a file, then diffs against that file to detect whether pull requests have introduced dependency changes. The aim of this check is to make it simpler to reason about whether pull request which modify the build have introduced new dependencies or changed transitive dependencies in a way that affects the final classpath.

This supplants the checks added in SPARK-4123 / #5093, which are currently disabled due to bugs.

This patch is based on pwendell's work in #8531.

Closes #8531.

Author: Josh Rosen <joshrosen@databricks.com>
Author: Patrick Wendell <patrick@databricks.com>

Closes #10461 from JoshRosen/SPARK-10359.
2015-12-30 12:47:42 -08:00