Commit graph

555 commits

Author SHA1 Message Date
Sean Owen eb9a696dd6 [MINOR][BUILD] Update Jetty to 9.3.24.v20180605
## What changes were proposed in this pull request?

Update Jetty to 9.3.24.v20180605 to pick up security fix

## How was this patch tested?

Existing tests.

Closes #22055 from srowen/Jetty9324.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2018-08-09 13:04:03 -05:00
DB Tsai 51bee7aca1 [SPARK-25018][INFRA] Use Co-authored-by and Signed-off-by git trailer in merge_spark_pr.py
## What changes were proposed in this pull request?

In [Linux community](https://git.wiki.kernel.org/index.php/CommitMessageConventions), `Co-authored-by` and `Signed-off-by` git trailer have been used for awhile.

Until recently, Github adopted `Co-authored-by` to include the work of co-authors in the profile contributions graph and the repository's statistics. It's a convention for recognizing multiple authors, and can encourage people to collaborate in OSS communities.

Git provides a command line tools to read the metadata to know who commits the code to upstream, but it's not as easy as having `Signed-off-by` as part of the message so developers can find who is the relevant committers who can help with certain part of the codebase easier.

For a single author PR, I purpose to use `Authored-by` and `Signed-off-by`, so the message will look like

```
Authored-by: Author's name <authorexample.com>
Signed-off-by: Committer's name <committerexample.com>
```

For a multi-author PR, I purpose to use `Lead-authored-by:` and `Co-authored-by:` for the lead author and co-authors. The message will look like

```
Lead-authored-by: Lead Author's name <leadauthorexample.com>
Co-authored-by: CoAuthor's name <coauthorexample.com>
Signed-off-by: Committer's name <committerexample.com>
```

It's also useful to include `Reviewed-by:` to give credits to the people who participate on the code reviewing. We can add this in the next iteration.

Closes #21991 from dbtsai/script.

Lead-authored-by: DB Tsai <d_tsai@apple.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Brian Lindblom <blindblom@apple.com>
Co-authored-by: hyukjinkwon <gurwls223@apache.org>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
2018-08-07 10:31:11 +08:00
Sean Owen 5f9633dc97 [SPARK-25015][BUILD] Update Hadoop 2.7 to 2.7.7
## What changes were proposed in this pull request?

Update Hadoop 2.7 to 2.7.7 to pull in bug and security fixes.

## How was this patch tested?

Existing tests.

Author: Sean Owen <srowen@gmail.com>

Closes #21987 from srowen/SPARK-25015.
2018-08-04 14:59:13 -05:00
Maxim Gekk b3f2911eeb [SPARK-24945][SQL] Switching to uniVocity 2.7.3
## What changes were proposed in this pull request?

In the PR, I propose to upgrade uniVocity parser from **2.6.3** to **2.7.3**. The recent version includes a fix for the SPARK-24645 issue and has better performance.

Before changes:
```
Parsing quoted values:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
One quoted string                           33336 / 34122          0.0      666727.0       1.0X

Wide rows with 1000 columns:             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Select 1000 columns                         90287 / 91713          0.0       90286.9       1.0X
Select 100 columns                          31826 / 36589          0.0       31826.4       2.8X
Select one column                           25738 / 25872          0.0       25737.9       3.5X
count()                                       6931 / 7269          0.1        6931.5      13.0X
```
after:
```
Parsing quoted values:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
One quoted string                           33411 / 33510          0.0      668211.4       1.0X

Wide rows with 1000 columns:             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Select 1000 columns                         88028 / 89311          0.0       88028.1       1.0X
Select 100 columns                          29010 / 32755          0.0       29010.1       3.0X
Select one column                           22936 / 22953          0.0       22936.5       3.8X
count()                                       6657 / 6740          0.2        6656.6      13.5X
```
Closes #21892

## How was this patch tested?

It was tested by `CSVSuite` and `CSVBenchmarks`

Author: Maxim Gekk <maxim.gekk@databricks.com>

Closes #21969 from MaxGekk/univocity-2_7_3.
2018-08-03 08:33:28 +08:00
hyukjinkwon f1550aaf15 [SPARK-24956][BUILD][FOLLOWUP] Upgrade Maven version to 3.5.4 for AppVeyor as well
## What changes were proposed in this pull request?

Maven version was upgraded and AppVeyor should also use upgraded maven version.

Currently, it looks broken by this:

https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/2458-master

```
[WARNING] Rule 0: org.apache.maven.plugins.enforcer.RequireMavenVersion failed with message:
Detected Maven Version: 3.3.9 is not in the allowed range 3.5.4.
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
```

## How was this patch tested?

AppVeyor tests

Author: hyukjinkwon <gurwls223@apache.org>

Closes #21920 from HyukjinKwon/SPARK-24956.
2018-07-31 09:14:29 +08:00
Gengliang Wang b90bfe3c42 [SPARK-24771][BUILD] Upgrade Apache AVRO to 1.8.2
## What changes were proposed in this pull request?

Upgrade Apache Avro from 1.7.7 to 1.8.2. The major new features:

1. More logical types. From the spec of 1.8.2 https://avro.apache.org/docs/1.8.2/spec.html#Logical+Types we can see comparing to [1.7.7](https://avro.apache.org/docs/1.7.7/spec.html#Logical+Types), the new version support:
    - Date
    - Time (millisecond precision)
    - Time (microsecond precision)
    - Timestamp (millisecond precision)
    - Timestamp (microsecond precision)
    - Duration

2. Single-object encoding: https://avro.apache.org/docs/1.8.2/spec.html#single_object_encoding

This PR aims to update Apache Spark to support these new features.

## How was this patch tested?

Unit test

Author: Gengliang Wang <gengliang.wang@databricks.com>

Closes #21761 from gengliangwang/upgrade_avro_1.8.
2018-07-30 07:30:47 -07:00
hyukjinkwon f9c9d80e46 [SPARK-24929][INFRA] Make merge script don't swallow KeyboardInterrupt
## What changes were proposed in this pull request?

If you want to get out of the loop to assign JIRA's user by command+c (KeyboardInterrupt), I am unable to get out. I faced this problem when the user doesn't have a contributor role and I just wanted to cancel and manually take an action to the JIRA.

**Before:**

```
JIRA is unassigned, choose assignee
[0] todd.chen (Reporter)
Enter number of user, or userid,  to assign to (blank to leave unassigned):Traceback (most recent call last):
  File "./dev/merge_spark_pr.py", line 322, in choose_jira_assignee
    "Enter number of user, or userid,  to assign to (blank to leave unassigned):")
KeyboardInterrupt
Error assigning JIRA, try again (or leave blank and fix manually)
JIRA is unassigned, choose assignee
[0] todd.chen (Reporter)
Enter number of user, or userid,  to assign to (blank to leave unassigned):Traceback (most recent call last):
  File "./dev/merge_spark_pr.py", line 322, in choose_jira_assignee
    "Enter number of user, or userid,  to assign to (blank to leave unassigned):")
KeyboardInterrupt
```

**After:**

```
JIRA is unassigned, choose assignee
[0] Dongjoon Hyun (Reporter)
Enter number of user, or userid to assign to (blank to leave unassigned):Traceback (most recent call last):
  File "./dev/merge_spark_pr.py", line 322, in choose_jira_assignee
    "Enter number of user, or userid to assign to (blank to leave unassigned):")
KeyboardInterrupt
Restoring head pointer to master
git checkout master
Already on 'master'
git branch
```

## How was this patch tested?

I tested this manually (I use my own merging script with few fixes).

Author: hyukjinkwon <gurwls223@apache.org>

Closes #21880 from HyukjinKwon/key-error.
2018-07-27 13:29:54 +08:00
Dongjoon Hyun 3b59d326c7 [SPARK-24576][BUILD] Upgrade Apache ORC to 1.5.2
## What changes were proposed in this pull request?

This issue aims to upgrade Apache ORC library from 1.4.4 to 1.5.2 in order to bring the following benefits into Apache Spark.

- [ORC-91](https://issues.apache.org/jira/browse/ORC-91) Support for variable length blocks in HDFS (The current space wasted in ORC to padding is known to be 5%.)
- [ORC-344](https://issues.apache.org/jira/browse/ORC-344) Support for using Decimal64ColumnVector

In addition to that, Apache Hive 3.1 and 3.2 will use ORC 1.5.1 ([HIVE-19669](https://issues.apache.org/jira/browse/HIVE-19465)) and 1.5.2 ([HIVE-19792](https://issues.apache.org/jira/browse/HIVE-19792)) respectively. This will improve the compatibility between Apache Spark and Apache Hive by sharing the common library.

## How was this patch tested?

Pass the Jenkins with all existing tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #21582 from dongjoon-hyun/SPARK-24576.
2018-07-17 23:52:17 -07:00
Gengliang Wang 395860a986 [SPARK-24768][SQL] Have a built-in AVRO data source implementation
## What changes were proposed in this pull request?

Apache Avro (https://avro.apache.org) is a popular data serialization format. It is widely used in the Spark and Hadoop ecosystem, especially for Kafka-based data pipelines.  Using the external package https://github.com/databricks/spark-avro, Spark SQL can read and write the avro data. Making spark-Avro built-in can provide a better experience for first-time users of Spark SQL and structured streaming. We expect the built-in Avro data source can further improve the adoption of structured streaming.
The proposal is to inline code from spark-avro package (https://github.com/databricks/spark-avro). The target release is Spark 2.4.

[Built-in AVRO Data Source In Spark 2.4.pdf](https://github.com/apache/spark/files/2181511/Built-in.AVRO.Data.Source.In.Spark.2.4.pdf)

## How was this patch tested?

Unit test

Author: Gengliang Wang <gengliang.wang@databricks.com>

Closes #21742 from gengliangwang/export_avro.
2018-07-12 13:55:25 -07:00
hyukjinkwon 4984f1af7e [MINOR] Add Sphinx into dev/requirements.txt
## What changes were proposed in this pull request?

Not a big deal but this PR adds `sphinx` into `dev/requirements.txt` since we found it needed - https://github.com/apache/spark-website/pull/122#discussion_r200896018

## How was this patch tested?

manually:

```
pip install -r requirements.txt
```

Author: hyukjinkwon <gurwls223@apache.org>

Closes #21735 from HyukjinKwon/minor-dev.
2018-07-10 13:54:04 +08:00
cclauss b42fda8ab3 [SPARK-23698] Remove raw_input() from Python 2
Signed-off-by: cclauss <cclaussbluewin.ch>

## What changes were proposed in this pull request?

Humans will be able to enter text in Python 3 prompts which they can not do today.
The Python builtin __raw_input()__ was removed in Python 3 in favor of __input()__.  This PR does the same thing in Python 2.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
flake8 testing

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: cclauss <cclauss@bluewin.ch>

Closes #21702 from cclauss/python-fix-raw_input.
2018-07-04 09:40:58 +08:00
DB Tsai 5585c5765f
[SPARK-24420][BUILD] Upgrade ASM to 6.1 to support JDK9+
## What changes were proposed in this pull request?

Upgrade ASM to 6.1 to support JDK9+

## How was this patch tested?

Existing tests.

Author: DB Tsai <d_tsai@apple.com>

Closes #21459 from dbtsai/asm.
2018-07-03 10:13:48 -07:00
Sean Owen f825847c82 [SPARK-24654][BUILD] Update, fix LICENSE and NOTICE, and specialize for source vs binary
Whew, lots of work to track down again all the license requirements, but this ought to be a pretty good pass. Below, find a writeup on how I approached it for future reference.

- LICENSE and NOTICE and licenses/ now reflect the *source* release
- LICENSE-binary and NOTICE-binary and licenses-binary now reflect the binary release
- Recreated all the license info from scratch
- Added notes about how this was constructed for next time
- License-oriented info was moved from NOTICE to LICENSE, esp. for Cat B deps
- Some seemingly superfluous or stale license info was removed, especially for test-scope deps
- Updated release script to put binary-oriented versions in binary releases

----

# Principles

ASF projects distribute source and binary code under the Apache License 2.0. However these project distributions frequently include copies of source or binary code from third parties, under possibly other license terms. This triggers conditions of those licenses, which essentially amount to including license information in a LICENSE and/or NOTICE file, and including copies of license texts (here, in a directory called `license/`).

See http://www.apache.org/dev/licensing-howto.html and https://www.apache.org/legal/resolved.html#required-third-party-notices

# In Spark

Spark produces source releases, and also binary releases of that code. Spark source code may contain source from third parties, possibly modified. This is true in Scala, Java, Python and R, and in the UI's JavaScript and CSS files. These must be handled appropriately per above in a LICENSE and NOTICE file created for the source release.

Separately, the binary releases may contain binary code from third parties. This is very much true for Scala and Java, as Spark produces an 'assembly' binary release which includes all transitive binary dependencies of this part of Spark. With perhaps the exception of py4j, this doesn't occur in the same way for Python or R because of the way these ecosystems work. (Note that the JS and CSS for the UI will be in both 'source' and 'binary' releases.) These must also be handled in a separate LICENSE and NOTICE file for the binary release.

# Binary Release License

## Transitive Maven Dependencies

We'll first tackle the binary release, and that almost entirely means assessing the transitive dependencies of the Scala/Java backbone of Spark.

Run `project-info-reports:dependencies` with essentially all profiles: a set that would bring in all different possible transitive dependencies. However, don't activate any of the '-lgpl' profiles as these would bring in LGPL-licensed dependencies that are explicitly excluded from Spark binary releases.

```
mvn -Phadoop-2.7 -Pyarn -Phive -Pmesos -Pkubernetes -Pflume -Pkinesis-asl -Pdocker-integration-tests -Phive-thriftserver -Pkafka-0-8 -Ddependency.locations.enabled=false project-info-reports:dependencies
```

Open `assembly/target/site/dependencies.html`. Find "Project Transitive Dependencies", and find "compile" and "runtime" (if exists). This is a list of all the dependencies that Spark is going to ship in its binary "assembly" distro and therefore whose licenses need to be appropriately considered in LICENSE and NOTICE. Copy this table into a spreadsheet for easy management.

Next job is to fill in some blanks, as a few projects will not have clearly declared their licenses in a POM. Sort by license.

This is a good time to verify all the dependencies are at least Cat A/B licenses, and not Cat X! http://www.apache.org/legal/resolved.html

### Apache License 2

The Apache License 2 variants are typically easiest to deal with as they will not require you to modify LICENSE, nor add to license/. It's still good form to list the ALv2 dependencies in LICENSE for completeness, but optional.

They may require you to propagate bits from NOTICE. It's tedious to track down all the NOTICE files and evaluate what if anything needs to be copied to NOTICE.

Fortunately, this can be made easier as the assembly module can be temporarily modified to produce a NOTICE file that concatenates all NOTICE files bundled with transitive dependencies.

First change the packaging of `assembly/spark-assembly_2.11/pom.xml` to `<packaging>jar</packaging>`. Next add this stanza somewhere in the body of the same POM file:

```
<plugin>
  <groupId>org.apache.maven.plugins</groupId>
  <artifactId>maven-shade-plugin</artifactId>
  <configuration>
    <shadedArtifactAttached>false</shadedArtifactAttached>
    <artifactSet>
      <includes>
        <include>*:*</include>
      </includes>
    </artifactSet>
  </configuration>
  <executions>
    <execution>
      <phase>package</phase>
      <goals>
        <goal>shade</goal>
      </goals>
      <configuration>
        <transformers>
          <transformer implementation="org.apache.maven.plugins.shade.resource.ApacheNoticeResourceTransformer"/>
        </transformers>
      </configuration>
    </execution>
  </executions>
</plugin>
```

Finally execute `mvn ... package` with all of the same `-P` profile flags as above. In the JAR file at `assembly/target/spark-assembly_2.11....jar` you'll find a file `META-INF/NOTICE` that concatenates all NOTICE files bundled with transitive dependencies. This should be the starting point for the binary release's NOTICE file.

Some elements in the file are from Spark itself, like:

```
Spark Project Assembly
Copyright 2018 The Apache Software Foundation

Spark Project Core
Copyright 2018 The Apache Software Foundation
```

These can be removed.

Remove elements of the combined NOTICE file that aren't relevant to Spark. It's actually rare that we are sure that some element is completely irrelevant to Spark, because each transitive dependency includes all its transitive dependencies. So there may be nothing that can be done here.

Of course, some projects may not publish NOTICE in their Maven artifacts. Ideally, search for the NOTICE file of projects that don't seem to have produced any text in NOTICE, but, there is some argument that projects that don't produce a NOTICE in their Maven artifacts don't entail an obligation on projects that depend solely on their Maven artifacts.

### Other Licenses

Next are "Cat A" permissively licensed (BSD 2-Clause, BSD 3-Clause, MIT) components. List the components grouped by their license type in LICENSE. Then add the text of the license to licenses/. For example if you list "foo bar" as a BSD-licensed dependency, add its license text as licenses/LICENSE-foo-bar.txt.

Public domain and similar works are treated like permissively licensed dependencies.

And the same goes for all Cat B licenses too, like CDDL. However these additional require at least a URL pointer to the project's page. Use the artifact hyperlink in your spreadsheet if possible; if non-existent or doesn't resolve, do your best to determine a URL for the project's source.

### Shaded third-party dependencies

Some third party dependencies actually copy in other dependencies rather than depend on them as Maven artifacts. This means they don't show up in the process above. These can be quite hard to track down, but are rare. A key example is reflectasm, embedded in kryo.

### Examples module

The above _almost_ considers everything bundled in a Spark binary release. The main assembly won't include examples. The same must be done for dependencies marked as 'compile' for the examples module. See `examples/target/site/dependencies.html`. At the time of this writing however this just adds one dependency: `scopt`.

### provided scope

Above we considered just compile and runtime scope dependencies, which makes sense as they are the ones that are packaged. However, for complicated reasons (shading), a few components that Spark does bundle are not marked as compile dependencies in the assembly. Therefore it's also necessary to consider 'provided' dependencies from `assembly/target/site/dependencies.html` actually! Right now that's just Jetty and JPMML artifacts.

## Python, R

Don't forget that Py4J is also distributed in the binary release, actually. There should be no other R, Python code in the binary release. That's it.

## Sense checking

Compare the contents of `jars/`, `examples/jars/` and `python/lib` from a recent binary release to see if anything appears there that doesn't seem to have been covered above. These additional components will have to be handled manually, but should be few or none of this type.

# Source Release License

While there are relatively fewer third-party source artifacts included as source code, there is no automated way to detect it, really. It requires some degree of manual auditing. Most third party source comes from included JS and CSS files.

At the time of this writing, some places to look or consider: `build/sbt-launch-lib.bash`, `python/lib`, third party source in `python/pyspark` like `heapq3.py`, `docs/js/vendor`, and `core/src/main/resources/org/apache/spark/ui/static`.

The principles are the same as above.

Remember some JS files copy in other JS files! Look out for Modernizr.

# One More Thing: JS and CSS in Binary Release

Now that you've got a handle on source licenses, recall that all the JS and CSS source code will *also* be part of the binary release. Copy that info from source to binary license files accordingly.

Author: Sean Owen <srowen@gmail.com>

Closes #21640 from srowen/SPARK-24654.
2018-06-30 19:27:16 -05:00
DB Tsai c7967c6049 [SPARK-24418][BUILD] Upgrade Scala to 2.11.12 and 2.12.6
## What changes were proposed in this pull request?

Scala is upgraded to `2.11.12` and `2.12.6`.

We used `loadFIles()` in `ILoop` as a hook to initialize the Spark before REPL sees any files in Scala `2.11.8`. However, it was a hack, and it was not intended to be a public API, so it was removed in Scala `2.11.12`.

From the discussion in Scala community, https://github.com/scala/bug/issues/10913 , we can use `initializeSynchronous` to initialize Spark instead. This PR implements the Spark initialization there.

However, in Scala `2.11.12`'s `ILoop.scala`, in function `def startup()`, the first thing it calls is `printWelcome()`. As a result, Scala will call `printWelcome()` and `splash` before calling `initializeSynchronous`.

Thus, the Spark shell will allow users to type commends first, and then show the Spark UI URL. It's working, but it will change the Spark Shell interface as the following.

```scala
➜  apache-spark git:(scala-2.11.12) ✗ ./bin/spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.0-SNAPSHOT
      /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_161)
Type in expressions to have them evaluated.
Type :help for more information.

scala> Spark context Web UI available at http://192.168.1.169:4040
Spark context available as 'sc' (master = local[*], app id = local-1528180279528).
Spark session available as 'spark'.

scala>
```

It seems there is no easy way to inject the Spark initialization code in the proper place as Scala doesn't provide a hook. Maybe som-snytt can comment on this.

The following command is used to update the dep files.
```scala
./dev/test-dependencies.sh --replace-manifest
```
## How was this patch tested?

Existing tests

Author: DB Tsai <d_tsai@apple.com>

Closes #21495 from dbtsai/scala-2.11.12.
2018-06-26 09:48:52 +08:00
Marcelo Vanzin 4e7d8678a3 [SPARK-24372][BUILD] Add scripts to help with preparing releases.
The "do-release.sh" script asks questions about the RC being prepared,
trying to find out as much as possible automatically, and then executes
the existing scripts with proper arguments to prepare the release. This
script was used to prepare the 2.3.1 release candidates, so was tested
in that context.

The docker version runs that same script inside a docker image especially
crafted for building Spark releases. That image is based on the work
by Felix C. linked in the bug. At this point is has been only midly
tested.

I also added a template for the vote e-mail, with placeholders for
things that need to be replaced, although there is no automation around
that for the moment. It shouldn't be hard to hook up certain things like
version and tags to this, or to figure out certain things like the
repo URL from the output of the release scripts.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #21515 from vanzin/SPARK-24372.
2018-06-22 12:38:34 -05:00
hyukjinkwon b0a9352559 [SPARK-24573][INFRA] Runs SBT checkstyle after the build to work around a side-effect
## What changes were proposed in this pull request?

Seems checkstyle affects the build in the PR builder in Jenkins. I can't reproduce in my local and seems it can only be reproduced in the PR builder.

I was checking the places it goes through and this is just a speculation that checkstyle's compilation in SBT has a side effect to the assembly build.

This PR proposes to run the SBT checkstyle after the build.

## How was this patch tested?

Jenkins tests.

Author: hyukjinkwon <gurwls223@apache.org>

Closes #21579 from HyukjinKwon/investigate-javastyle.
2018-06-18 15:32:34 +08:00
Sean Suchter f433ef7867 [SPARK-23010][K8S] Initial checkin of k8s integration tests.
These tests were developed in the https://github.com/apache-spark-on-k8s/spark-integration repo
by several contributors. This is a copy of the current state into the main apache spark repo.
The only changes from the current spark-integration repo state are:
* Move the files from the repo root into resource-managers/kubernetes/integration-tests
* Add a reference to these tests in the root README.md
* Fix a path reference in dev/dev-run-integration-tests.sh
* Add a TODO in include/util.sh

## What changes were proposed in this pull request?

Incorporation of Kubernetes integration tests.

## How was this patch tested?

This code has its own unit tests, but the main purpose is to provide the integration tests.
I tested this on my laptop by running dev/dev-run-integration-tests.sh --spark-tgz ~/spark-2.4.0-SNAPSHOT-bin--.tgz

The spark-integration tests have already been running for months in AMPLab, here is an example:
https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-scheduled-spark-integration-master/

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Sean Suchter <sean-github@suchter.com>
Author: Sean Suchter <ssuchter@pepperdata.com>

Closes #20697 from ssuchter/ssuchter-k8s-integration-tests.
2018-06-08 15:15:24 -07:00
hyukjinkwon 4a14dc0aff [SPARK-22269][BUILD] Run Java linter via SBT for Jenkins
## What changes were proposed in this pull request?

This PR proposes to check Java lint via SBT for Jenkins. It uses the SBT wrapper for checkstyle.

I manually tested. If we build the codes once, running this script takes 2 mins at maximum in my local:

Test codes:

```
Checkstyle failed at following occurrences:
[error] Checkstyle error found in /.../spark/core/src/test/java/test/org/apache/spark/JavaAPISuite.java:82: Line is longer than 100 characters (found 103).
[error] 1 issue(s) found in Checkstyle report: /.../spark/core/target/checkstyle-test-report.xml
[error] Checkstyle error found in /.../spark/sql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java:84: Line is longer than 100 characters (found 115).
[error] 1 issue(s) found in Checkstyle report: /.../spark/sql/hive/target/checkstyle-test-report.xml
...
```

Main codes:

```
Checkstyle failed at following occurrences:
[error] Checkstyle error found in /.../spark/sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/InputPartition.java:39: Line is longer than 100 characters (found 104).
[error] Checkstyle error found in /.../spark/sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/InputPartitionReader.java:26: Line is longer than 100 characters (found 110).
[error] Checkstyle error found in /.../spark/sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/InputPartitionReader.java:30: Line is longer than 100 characters (found 104).
...
```

## How was this patch tested?

Manually tested. Jenkins build should test this.

Author: hyukjinkwon <gurwls223@apache.org>

Closes #21399 from HyukjinKwon/SPARK-22269.
2018-05-24 14:19:32 +08:00
Dongjoon Hyun 486ecc680e [SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4
## What changes were proposed in this pull request?

ORC 1.4.4 includes [nine fixes](https://issues.apache.org/jira/issues/?filter=12342568&jql=project%20%3D%20ORC%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%201.4.4). One of the issues is about `Timestamp` bug (ORC-306) which occurs when `native` ORC vectorized reader reads ORC column vector's sub-vector `times` and `nanos`. ORC-306 fixes this according to the [original definition](https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/TimestampColumnVector.java#L45-L46) and this PR includes the updated interpretation on ORC column vectors. Note that `hive` ORC reader and ORC MR reader is not affected.

```scala
scala> spark.version
res0: String = 2.3.0
scala> spark.sql("set spark.sql.orc.impl=native")
scala> Seq(java.sql.Timestamp.valueOf("1900-05-05 12:34:56.000789")).toDF().write.orc("/tmp/orc")
scala> spark.read.orc("/tmp/orc").show(false)
+--------------------------+
|value                     |
+--------------------------+
|1900-05-05 12:34:55.000789|
+--------------------------+
```

This PR aims to update Apache Spark to use it.

**FULL LIST**

ID | TITLE
-- | --
ORC-281 | Fix compiler warnings from clang 5.0
ORC-301 | `extractFileTail` should open a file in `try` statement
ORC-304 | Fix TestRecordReaderImpl to not fail with new storage-api
ORC-306 | Fix incorrect workaround for bug in java.sql.Timestamp
ORC-324 | Add support for ARM and PPC arch
ORC-330 | Remove unnecessary Hive artifacts from root pom
ORC-332 | Add syntax version to orc_proto.proto
ORC-336 | Remove avro and parquet dependency management entries
ORC-360 | Implement error checking on subtype fields in Java

## How was this patch tested?

Pass the Jenkins.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #21372 from dongjoon-hyun/SPARK_ORC144.
2018-05-24 11:34:13 +08:00
hyukjinkwon f32b7faf7c [MINOR][PROJECT-INFRA] Check if 'original_head' variable is defined in clean_up at merge script
## What changes were proposed in this pull request?

This PR proposes to check if global variable exists or not in clean_up. This can happen when it fails at:

7013eea11c/dev/merge_spark_pr.py (L423)

I found this (It was my environment problem) but the error message took me a while to debug.

## How was this patch tested?

Manually tested:

**Before**

```
git rev-parse --abbrev-ref HEAD
fatal: Not a git repository (or any of the parent directories): .git
Traceback (most recent call last):
  File "./dev/merge_spark_pr_jira.py", line 517, in <module>
    clean_up()
  File "./dev/merge_spark_pr_jira.py", line 104, in clean_up
    print("Restoring head pointer to %s" % original_head)
NameError: global name 'original_head' is not defined
```

**After**

```
git rev-parse --abbrev-ref HEAD
fatal: Not a git repository (or any of the parent directories): .git
Traceback (most recent call last):
  File "./dev/merge_spark_pr.py", line 516, in <module>
    main()
  File "./dev/merge_spark_pr.py", line 424, in main
    original_head = get_current_ref()
  File "./dev/merge_spark_pr.py", line 412, in get_current_ref
    ref = run_cmd("git rev-parse --abbrev-ref HEAD").strip()
  File "./dev/merge_spark_pr.py", line 94, in run_cmd
    return subprocess.check_output(cmd.split(" "))
  File "/usr/local/Cellar/python2/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 219, in check_output
    raise CalledProcessError(retcode, cmd, output=output)
subprocess.CalledProcessError: Command '['git', 'rev-parse', '--abbrev-ref', 'HEAD']' returned non-zero exit status 128
```

Author: hyukjinkwon <gurwls223@apache.org>

Closes #21349 from HyukjinKwon/minor-merge-script.
2018-05-21 09:47:52 +08:00
Marcelo Vanzin 8e60a16b73 [SPARK-23601][BUILD][FOLLOW-UP] Keep md5 checksums for nexus artifacts.
The repository.apache.org server still requires md5 checksums or
it won't publish the staging repo.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #21338 from vanzin/SPARK-23601.
2018-05-16 13:34:54 -07:00
Maxim Gekk 7a2d4895c7 [SPARK-17916][SQL] Fix empty string being parsed as null when nullValue is set.
## What changes were proposed in this pull request?

I propose to bump version of uniVocity parser up to 2.6.3 where quoted empty strings are replaced by the empty value (passed to `setEmptyValue`) instead of `null` values as in the current version 2.5.9:
https://github.com/uniVocity/univocity-parsers/blob/v2.6.3/src/main/java/com/univocity/parsers/csv/CsvParser.java#L125

Empty value for writer is set to `""`. So, empty string in dataframe/dataset is stored as empty quoted string `""`. Empty value for reader is set to empty string (zero size). In this way, saved empty quoted string will be read as just empty string. Please, look at the tests for more details.

Here are main changes made in [2.6.0](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.6.0), [2.6.1](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.6.1), [2.6.2](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.6.2), [2.6.3](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.6.3):

- CSV parser now parses quoted values ~30% faster
- CSV format detection process has option provide a list of possible delimiters, in order of priority ( i.e. settings.detectFormatAutomatically( '-', '.');) - https://github.com/uniVocity/univocity-parsers/issues/214
- Implemented trim quoted values support - https://github.com/uniVocity/univocity-parsers/issues/230
- NullPointer when stopping parser when nothing is parsed - https://github.com/uniVocity/univocity-parsers/issues/219
- Concurrency issue when calling stopParsing() - https://github.com/uniVocity/univocity-parsers/issues/231

Closes #20068

## How was this patch tested?

Added tests from the PR https://github.com/apache/spark/pull/20068

Author: Maxim Gekk <maxim.gekk@databricks.com>

Closes #21273 from MaxGekk/univocity-2.6.
2018-05-14 10:01:06 +08:00
Marcelo Vanzin cc613b552e [PYSPARK] Update py4j to version 0.10.7. 2018-05-09 10:47:35 -07:00
Ryan Blue cac9b1dea1 [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.
## What changes were proposed in this pull request?

This updates Parquet to 1.10.0 and updates the vectorized path for buffer management changes. Parquet 1.10.0 uses ByteBufferInputStream instead of byte arrays in encoders. This allows Parquet to break allocations into smaller chunks that are better for garbage collection.

## How was this patch tested?

Existing Parquet tests. Running in production at Netflix for about 3 months.

Author: Ryan Blue <blue@apache.org>

Closes #21070 from rdblue/SPARK-23972-update-parquet-to-1.10.0.
2018-05-09 12:27:32 +08:00
Steve Loughran ce7ba2e98e [SPARK-23807][BUILD] Add Hadoop 3.1 profile with relevant POM fix ups
## What changes were proposed in this pull request?

1. Adds a `hadoop-3.1` profile build depending on the hadoop-3.1 artifacts.
1. In the hadoop-cloud module, adds an explicit hadoop-3.1 profile which switches from explicitly pulling in cloud connectors (hadoop-openstack, hadoop-aws, hadoop-azure) to depending on the hadoop-cloudstorage POM artifact, which pulls these in, has pre-excluded things like hadoop-common, and stays up to date with new connectors (hadoop-azuredatalake, hadoop-allyun). Goal: it becomes the Hadoop projects homework of keeping this clean, and the spark project doesn't need to handle new hadoop releases adding more dependencies.
1. the hadoop-cloud/hadoop-3.1 profile also declares support for jetty-ajax and jetty-util to ensure that these jars get into the distribution jar directory when needed by unshaded libraries.
1. Increases the curator and zookeeper versions to match those in hadoop-3, fixing spark core to build in sbt with the hadoop-3 dependencies.

## How was this patch tested?

* Everything this has been built and tested against both ASF Hadoop branch-3.1 and hadoop trunk.
* spark-shell was used to create connectors to all the stores and verify that file IO could take place.

The spark hive-1.2.1 JAR has problems here, as it's version check logic fails for Hadoop versions > 2.

This can be avoided with either of

* The hadoop JARs built to declare their version as Hadoop 2.11  `mvn install -DskipTests -DskipShade -Ddeclared.hadoop.version=2.11` . This is safe for local test runs, not for deployment (HDFS is very strict about cross-version deployment).
* A modified version of spark hive whose version check switch statement is happy with hadoop 3.

I've done both, with maven and SBT.

Three issues surfaced

1. A spark-core test failure —fixed in SPARK-23787.
1. SBT only: Zookeeper not being found in spark-core. Somehow curator 2.12.0 triggers some slightly different dependency resolution logic from previous versions, and Ivy was missing zookeeper.jar entirely. This patch adds the explicit declaration for all spark profiles, setting the ZK version = 3.4.9 for hadoop-3.1
1. Marking jetty-utils as provided in spark was stopping hadoop-azure from being able to instantiate the azure wasb:// client; it was using jetty-util-ajax, which could then not find a class in jetty-util.

Author: Steve Loughran <stevel@hortonworks.com>

Closes #20923 from steveloughran/cloud/SPARK-23807-hadoop-31.
2018-04-24 09:57:09 -07:00
Benjamin Peterson 7013eea11c [SPARK-23522][PYTHON] always use sys.exit over builtin exit
The exit() builtin is only for interactive use. applications should use sys.exit().

## What changes were proposed in this pull request?

All usage of the builtin `exit()` function is replaced by `sys.exit()`.

## How was this patch tested?

I ran `python/run-tests`.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Benjamin Peterson <benjamin@python.org>

Closes #20682 from benjaminp/sys-exit.
2018-03-08 20:38:34 +09:00
Sean Owen 8bceb899dc [SPARK-23601][BUILD] Remove .md5 files from release
## What changes were proposed in this pull request?

Remove .md5 files from release artifacts

## How was this patch tested?

N/A

Author: Sean Owen <sowen@cloudera.com>

Closes #20737 from srowen/SPARK-23601.
2018-03-06 08:52:28 -06:00
Kazuaki Ishizaki 649ed9c573 [SPARK-23509][BUILD] Upgrade commons-net from 2.2 to 3.1
## What changes were proposed in this pull request?

This PR avoids version conflicts of `commons-net` by upgrading commons-net from 2.2 to 3.1. We are seeing the following message during the build using sbt.

```
[warn] Found version conflict(s) in library dependencies; some are suspected to be binary incompatible:
...
[warn] 	* commons-net:commons-net:3.1 is selected over 2.2
[warn] 	    +- org.apache.hadoop:hadoop-common:2.6.5              (depends on 3.1)
[warn] 	    +- org.apache.spark:spark-core_2.11:2.4.0-SNAPSHOT    (depends on 2.2)
[warn]
```

[Here](https://commons.apache.org/proper/commons-net/changes-report.html) is a release history.

[Here](https://commons.apache.org/proper/commons-net/migration.html) is a migration guide from 2.x to 3.0.

## How was this patch tested?

Existing tests

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #20672 from kiszk/SPARK-23509.
2018-02-27 08:18:41 -06:00
Kent Yao 189f56f3dc [SPARK-23383][BUILD][MINOR] Make a distribution should exit with usage while detecting wrong options
## What changes were proposed in this pull request?
```shell
./dev/make-distribution.sh --name ne-1.0.0-SNAPSHOT xyz --tgz  -Phadoop-2.7
+++ dirname ./dev/make-distribution.sh
++ cd ./dev/..
++ pwd
+ SPARK_HOME=/Users/Kent/Documents/spark
+ DISTDIR=/Users/Kent/Documents/spark/dist
+ MAKE_TGZ=false
+ MAKE_PIP=false
+ MAKE_R=false
+ NAME=none
+ MVN=/Users/Kent/Documents/spark/build/mvn
+ ((  5  ))
+ case $1 in
+ NAME=ne-1.0.0-SNAPSHOT
+ shift
+ shift
+ ((  3  ))
+ case $1 in
+ break
+ '[' -z /Users/Kent/.jenv/candidates/java/current ']'
+ '[' -z /Users/Kent/.jenv/candidates/java/current ']'
++ command -v git
+ '[' /usr/local/bin/git ']'
++ git rev-parse --short HEAD
+ GITREV=98ea6a7
+ '[' '!' -z 98ea6a7 ']'
+ GITREVSTRING=' (git revision 98ea6a7)'
+ unset GITREV
++ command -v /Users/Kent/Documents/spark/build/mvn
+ '[' '!' /Users/Kent/Documents/spark/build/mvn ']'
++ /Users/Kent/Documents/spark/build/mvn help:evaluate -Dexpression=project.version xyz --tgz -Phadoop-2.7
++ grep -v INFO
++ tail -n 1
+ VERSION=' -X,--debug                             Produce execution debug output'
```
It is better to declare the mistakes and exit with usage than `break`

## How was this patch tested?

manually

cc srowen

Author: Kent Yao <yaooqinn@hotmail.com>

Closes #20571 from yaooqinn/SPARK-23383.
2018-02-20 07:51:30 -06:00
Dongjoon Hyun 3ee3b2ae1f [SPARK-23340][SQL] Upgrade Apache ORC to 1.4.3
## What changes were proposed in this pull request?

This PR updates Apache ORC dependencies to 1.4.3 released on February 9th. Apache ORC 1.4.2 release removes unnecessary dependencies and 1.4.3 has 5 more patches (https://s.apache.org/Fll8).

Especially, the following ORC-285 is fixed at 1.4.3.

```scala
scala> val df = Seq(Array.empty[Float]).toDF()

scala> df.write.format("orc").save("/tmp/floatarray")

scala> spark.read.orc("/tmp/floatarray")
res1: org.apache.spark.sql.DataFrame = [value: array<float>]

scala> spark.read.orc("/tmp/floatarray").show()
18/02/12 22:09:10 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.io.IOException: Error reading file: file:/tmp/floatarray/part-00000-9c0b461b-4df1-4c23-aac1-3e4f349ac7d6-c000.snappy.orc
	at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1191)
	at org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
...
Caused by: java.io.EOFException: Read past EOF for compressed stream Stream for column 2 kind DATA position: 0 length: 0 range: 0 offset: 0 limit: 0
```

## How was this patch tested?

Pass the Jenkins test.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #20511 from dongjoon-hyun/SPARK-23340.
2018-02-17 00:25:36 -08:00
Tathagata Das 0a73aa31f4 [SPARK-23362][SS] Migrate Kafka Microbatch source to v2
## What changes were proposed in this pull request?
Migrating KafkaSource (with data source v1) to KafkaMicroBatchReader (with data source v2).

Performance comparison:
In a unit test with in-process Kafka broker, I tested the read throughput of V1 and V2 using 20M records in a single partition. They were comparable.

## How was this patch tested?
Existing tests, few modified to be better tests than the existing ones.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #20554 from tdas/SPARK-23362.
2018-02-16 14:30:19 -08:00
Yuming Wang 4df84c3f81 [SPARK-23336][BUILD] Upgrade snappy-java to 1.1.7.1
## What changes were proposed in this pull request?

This PR upgrade snappy-java from 1.1.2.6 to 1.1.7.1.
1.1.7.1 release notes:
- Improved performance for big-endian architecture
- The other performance improvement in [snappy-1.1.5](https://github.com/google/snappy/releases/tag/1.1.5)

1.1.4 release notes:
- Fix a 1% performance regression when snappy is used in PIE executables.
- Improve compression performance by 5%.
- Improve decompression performance by 20%.

More details:
https://github.com/xerial/snappy-java/blob/master/Milestone.md

## How was this patch tested?

manual tests

Author: Yuming Wang <wgyumg@gmail.com>

Closes #20510 from wangyum/SPARK-23336.
2018-02-08 12:52:08 -06:00
Kent Yao eefec93d19 [SPARK-23295][BUILD][MINOR] Exclude Waring message when generating versions in make-distribution.sh
## What changes were proposed in this pull request?

When we specified a wrong profile to make a spark distribution, such as `-Phadoop1000`, we will get an odd package named like `spark-[WARNING] The requested profile "hadoop1000" could not be activated because it does not exist.-bin-hadoop-2.7.tgz`, which actually should be `"spark-$VERSION-bin-$NAME.tgz"`

## How was this patch tested?
### before
```
build/mvn help:evaluate -Dexpression=scala.binary.version -Phadoop1000 2>/dev/null | grep -v "INFO" | tail -n 1
[WARNING] The requested profile "hadoop1000" could not be activated because it does not exist.
```
```
build/mvn help:evaluate -Dexpression=project.version -Phadoop1000 2>/dev/null | grep -v "INFO" | tail -n 1
[WARNING] The requested profile "hadoop1000" could not be activated because it does not exist.
```
### after
```
 build/mvn help:evaluate -Dexpression=project.version -Phadoop1000 2>/dev/null | grep  -v "INFO" | grep -v "WARNING" | tail -n 1
2.4.0-SNAPSHOT
```
```
build/mvn help:evaluate -Dexpression=scala.binary.version -Dscala.binary.version=2.11.1 2>/dev/null | grep  -v "INFO" | grep -v "WARNING" | tail -n 1
2.11.1
```

cloud-fan srowen

Author: Kent Yao <yaooqinn@hotmail.com>

Closes #20469 from yaooqinn/dist-minor.
2018-02-02 10:17:51 -06:00
Shashwat Anand 9623a98248 [MINOR] Fix typos in dev/* scripts.
## What changes were proposed in this pull request?

Consistency in style, grammar and removal of extraneous characters.

## How was this patch tested?

Manually as this is a doc change.

Author: Shashwat Anand <me@shashwat.me>

Closes #20436 from ashashwat/SPARK-23174.
2018-01-31 07:37:25 +09:00
Takuya UESHIN a23187f530 [SPARK-23174][BUILD][PYTHON][FOLLOWUP] Add pycodestyle*.py to .gitignore file.
## What changes were proposed in this pull request?

This is a follow-up pr of #20338 which changed the downloaded file name of the python code style checker but it's not contained in .gitignore file so the file remains as an untracked file for git after running the checker.
This pr adds the file name to .gitignore file.

## How was this patch tested?

Tested manually.

Author: Takuya UESHIN <ueshin@databricks.com>

Closes #20432 from ueshin/issues/SPARK-23174/fup1.
2018-01-31 00:51:00 +09:00
“attilapiros” 0ec95bb7df [SPARK-22577][CORE] executor page blacklist status should update with TaskSet level blacklisting
## What changes were proposed in this pull request?

In this PR stage blacklisting is propagated to UI by introducing a new Spark listener event (SparkListenerExecutorBlacklistedForStage) which indicates the executor is blacklisted for a stage. Either because of the number of failures are exceeded a limit given for an executor (spark.blacklist.stage.maxFailedTasksPerExecutor) or because of the whole node is blacklisted for a stage (spark.blacklist.stage.maxFailedExecutorsPerNode). In case of the node is blacklisting all executors will listed as blacklisted for the stage.

Blacklisting state for a selected stage can be seen "Aggregated Metrics by Executor" table's blacklisting column, where after this change three possible labels could be found:
- "for application": when the executor is blacklisted for the application (see the configuration spark.blacklist.application.maxFailedTasksPerExecutor for details)
- "for stage": when the executor is **only** blacklisted for the stage
- "false" : when the executor is not blacklisted at all

## How was this patch tested?

It is tested both manually and with unit tests.

#### Unit tests

- HistoryServerSuite
- TaskSetBlacklistSuite
- AppStatusListenerSuite

#### Manual test for executor blacklisting

Running Spark as a local cluster:
```
$ bin/spark-shell --master "local-cluster[2,1,1024]" --conf "spark.blacklist.enabled=true" --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf "spark.blacklist.application.maxFailedTasksPerExecutor=10" --conf "spark.eventLog.enabled=true"
```

Executing:
``` scala
import org.apache.spark.SparkEnv

sc.parallelize(1 to 10, 10).map { x =>
  if (SparkEnv.get.executorId == "0") throw new RuntimeException("Bad executor")
  else (x % 3, x)
}.reduceByKey((a, b) => a + b).collect()
```

To see result check the "Aggregated Metrics by Executor" section at the bottom of picture:

![UI screenshot for stage level blacklisting executor](https://issues.apache.org/jira/secure/attachment/12905283/stage_blacklisting.png)

#### Manual test for node blacklisting

Running Spark as on a cluster:

``` bash
./bin/spark-shell --master yarn --deploy-mode client --executor-memory=2G --num-executors=8 --conf "spark.blacklist.enabled=true" --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf "spark.blacklist.stage.maxFailedExecutorsPerNode=1"  --conf "spark.blacklist.application.maxFailedTasksPerExecutor=10" --conf "spark.eventLog.enabled=true"
```

And the job was:

``` scala
import org.apache.spark.SparkEnv

sc.parallelize(1 to 10000, 10).map { x =>
  if (SparkEnv.get.executorId.toInt >= 4) throw new RuntimeException("Bad executor")
    else (x % 3, x)
}.reduceByKey((a, b) => a + b).collect()
```

The result is:

![UI screenshot for stage level node blacklisting](https://issues.apache.org/jira/secure/attachment/12906833/node_blacklisting_for_stage.png)

Here you can see apiros3.gce.test.com was node blacklisted for the stage because of failures on executor 4 and 5. As expected executor 3 is also blacklisted even it has no failures itself but sharing the node with 4 and 5.

Author: “attilapiros” <piros.attila.zsolt@gmail.com>
Author: Attila Zsolt Piros <2017933+attilapiros@users.noreply.github.com>

Closes #20203 from attilapiros/SPARK-22577.
2018-01-24 11:34:59 -06:00
Rekha Joshi 7af1a325da [SPARK-23174][BUILD][PYTHON] python code style checker update
## What changes were proposed in this pull request?
Referencing latest python code style checking from PyPi/pycodestyle
Removed pending TODO
For now, in tox.ini excluded the additional style error discovered on existing python due to latest style checker (will fallback on review comment to finalize exclusion or fix py)
Any further code styling requirement needs to be part of pycodestyle, not in SPARK.

## How was this patch tested?
./dev/run-tests

Author: Rekha Joshi <rekhajoshm@gmail.com>
Author: rjoshi2 <rekhajoshm@gmail.com>

Closes #20338 from rekhajoshm/SPARK-11222.
2018-01-24 21:13:47 +09:00
hyukjinkwon 12faae295e [SPARK-23169][INFRA][R] Run lintr on the changes of lint-r script and .lintr configuration
## What changes were proposed in this pull request?

When running the `run-tests` script, seems we don't run lintr on the changes of `lint-r` script and `.lintr` configuration.

## How was this patch tested?

Jenkins builds

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #20339 from HyukjinKwon/check-r-changed.
2018-01-22 09:45:27 +09:00
hyukjinkwon 39d244d921 [SPARK-23122][PYTHON][SQL] Deprecate register* for UDFs in SQLContext and Catalog in PySpark
## What changes were proposed in this pull request?

This PR proposes to deprecate `register*` for UDFs in `SQLContext` and `Catalog` in Spark 2.3.0.

These are inconsistent with Scala / Java APIs and also these basically do the same things with `spark.udf.register*`.

Also, this PR moves the logcis from `[sqlContext|spark.catalog].register*` to `spark.udf.register*` and reuse the docstring.

This PR also handles minor doc corrections. It also includes https://github.com/apache/spark/pull/20158

## How was this patch tested?

Manually tested, manually checked the API documentation and tests added to check if deprecated APIs call the aliases correctly.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #20288 from HyukjinKwon/deprecate-udf.
2018-01-18 14:51:05 +09:00
Imran Rashid 5ae333391b [SPARK-23044] Error handling for jira assignment
## What changes were proposed in this pull request?

* If there is any error while trying to assign the jira, prompt again
* Filter out the "Apache Spark" choice
* allow arbitrary user ids to be entered

## How was this patch tested?

Couldn't really test the error case, just some testing of similar-ish code in python shell.  Haven't run a merge yet.

Author: Imran Rashid <irashid@cloudera.com>

Closes #20236 from squito/SPARK-23044.
2018-01-16 16:25:10 -08:00
foxish c3548d11c3 [SPARK-23063][K8S] K8s changes for publishing scripts (and a couple of other misses)
## What changes were proposed in this pull request?

Including the `-Pkubernetes` flag in a few places it was missed.

## How was this patch tested?

checkstyle, mima through manual tests.

Author: foxish <ramanathana@google.com>

Closes #20256 from foxish/SPARK-23063.
2018-01-13 21:34:28 -08:00
shimamoto 628a1ca5a4 [SPARK-23043][BUILD] Upgrade json4s to 3.5.3
## What changes were proposed in this pull request?

Spark still use a few years old version 3.2.11. This change is to upgrade json4s to 3.5.3.

Note that this change does not include the Jackson update because the Jackson version referenced in json4s 3.5.3 is 2.8.4, which has a security vulnerability ([see](https://issues.apache.org/jira/browse/SPARK-20433)).

## How was this patch tested?

Existing unit tests and build.

Author: shimamoto <chibochibo@gmail.com>

Closes #20233 from shimamoto/upgrade-json4s.
2018-01-13 09:40:00 -06:00
gatorsmile 651f76153f [SPARK-23028] Bump master branch version to 2.4.0-SNAPSHOT
## What changes were proposed in this pull request?
This patch bumps the master branch version to `2.4.0-SNAPSHOT`.

## How was this patch tested?
N/A

Author: gatorsmile <gatorsmile@gmail.com>

Closes #20222 from gatorsmile/bump24.
2018-01-13 00:37:59 +08:00
Marcelo Vanzin 95f9659abe [SPARK-22948][K8S] Move SparkPodInitContainer to correct package.
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #20156 from vanzin/SPARK-22948.
2018-01-04 15:00:09 -08:00
hyukjinkwon e734a4b9c2 [SPARK-21893][SPARK-22142][TESTS][FOLLOWUP] Enables PySpark tests for Flume and Kafka in Jenkins
## What changes were proposed in this pull request?

This PR proposes to enable PySpark tests for Flume and Kafka in Jenkins by explicitly setting the environment variables in `modules.py`.

Seems we are not taking the dependencies into account when calculating environment variables:

3a07eff5af/dev/run-tests.py (L554-L561)

## How was this patch tested?

Manual tests with Jenkins in https://github.com/apache/spark/pull/20126.

**Before** - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85559/consoleFull

```
[info] Setup the following environment variables for tests:
...
```

**After** - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85560/consoleFull

```
[info] Setup the following environment variables for tests:
ENABLE_KAFKA_0_8_TESTS=1
ENABLE_FLUME_TESTS=1
...
```

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #20128 from HyukjinKwon/SPARK-21893.
2018-01-02 07:20:05 +09:00
Sean Owen c284c4e1f6 [MINOR] Fix a bunch of typos 2018-01-02 07:10:19 +09:00
Fokko Driesprong fd7d141d8b [SPARK-22919] Bump httpclient versions
Hi all,

I would like to bump the PATCH versions of both the Apache httpclient Apache httpcore. I use the SparkTC Stocator library for connecting to an object store, and I would align the versions to reduce java version mismatches. Furthermore it is good to bump these versions since they fix stability and performance issues:
https://archive.apache.org/dist/httpcomponents/httpclient/RELEASE_NOTES-4.5.x.txt
https://www.apache.org/dist/httpcomponents/httpcore/RELEASE_NOTES-4.4.x.txt

Cheers, Fokko

## What changes were proposed in this pull request?

Update the versions of the httpclient and httpcore. Only update the PATCH versions, so no breaking changes.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Fokko Driesprong <fokkodriesprong@godatadriven.com>

Closes #20103 from Fokko/SPARK-22919-bump-httpclient-versions.
2017-12-30 10:37:41 -06:00
Imran Rashid ccda75b0d1 [SPARK-22921][PROJECT-INFRA] Bug fix in jira assigning
Small bug fix from last pr, ran a successful merge with this code.

Author: Imran Rashid <irashid@cloudera.com>

Closes #20117 from squito/SPARK-22921.
2017-12-29 17:07:01 -06:00
Imran Rashid dbd492b7e2 [SPARK-22921][PROJECT-INFRA] Choices for Assigning Jira on Merge
In general jiras are assigned to the original reporter or one of
the commentors.  This updates the merge script to give you a simple
choice to do that, so you don't have to do it manually.

Author: Imran Rashid <irashid@cloudera.com>

Closes #20107 from squito/SPARK-22921.
2017-12-29 07:30:49 -06:00
Bryan Cutler 59d52631eb [SPARK-22324][SQL][PYTHON] Upgrade Arrow to 0.8.0
## What changes were proposed in this pull request?

Upgrade Spark to Arrow 0.8.0 for Java and Python.  Also includes an upgrade of Netty to 4.1.17 to resolve dependency requirements.

The highlights that pertain to Spark for the update from Arrow versoin 0.4.1 to 0.8.0 include:

* Java refactoring for more simple API
* Java reduced heap usage and streamlined hot code paths
* Type support for DecimalType, ArrayType
* Improved type casting support in Python
* Simplified type checking in Python

## How was this patch tested?

Existing tests

Author: Bryan Cutler <cutlerb@gmail.com>
Author: Shixiong Zhu <zsxwing@gmail.com>

Closes #19884 from BryanCutler/arrow-upgrade-080-SPARK-22324.
2017-12-21 20:43:56 +09:00