Commit graph

22523 commits

Author SHA1 Message Date
Sean Owen 753f115162 [SPARK-21261][DOCS][SQL] SQL Regex document fix
## What changes were proposed in this pull request?

Fix regexes in spark-sql command examples.
This takes over https://github.com/apache/spark/pull/18477

## How was this patch tested?

Existing tests. I verified the existing example doesn't work in spark-sql, but new ones does.

Author: Sean Owen <srowen@gmail.com>

Closes #21808 from srowen/SPARK-21261.
2018-07-18 18:39:23 -05:00
Devaraj K d404e54e64 [SPARK-24129][K8S] Add option to pass --build-arg's to docker-image-tool.sh
## What changes were proposed in this pull request?

Adding `-b arg` option to take `--build-arg` parameters to pass into the docker command

## How was this patch tested?

I verified by passing proxy details which fails without this change and succeeds with the changes.

Author: Devaraj K <devaraj@apache.org>

Closes #21202 from devaraj-kavali/SPARK-24129.
2018-07-18 16:18:29 -05:00
maryannxue cd203e0dfc [SPARK-24163][SPARK-24164][SQL] Support column list as the pivot column in Pivot
## What changes were proposed in this pull request?

1. Extend the Parser to enable parsing a column list as the pivot column.
2. Extend the Parser and the Pivot node to enable parsing complex expressions with aliases as the pivot value.
3. Add type check and constant check in Analyzer for Pivot node.

## How was this patch tested?

Add tests in pivot.sql

Author: maryannxue <maryannxue@apache.org>

Closes #21720 from maryannxue/spark-24164.
2018-07-18 13:33:26 -07:00
pgandhi 1272b2034d [SPARK-22151] PYTHONPATH not picked up from the spark.yarn.appMaste…
…rEnv properly

Running in yarn cluster mode and trying to set pythonpath via spark.yarn.appMasterEnv.PYTHONPATH doesn't work.

the yarn Client code looks at the env variables:
val pythonPathStr = (sys.env.get("PYTHONPATH") ++ pythonPath)
But when you set spark.yarn.appMasterEnv it puts it into the local env.

So the python path set in spark.yarn.appMasterEnv isn't properly set.

You can work around if you are running in cluster mode by setting it on the client like:

PYTHONPATH=./addon/python/ spark-submit

## What changes were proposed in this pull request?
In Client.scala, PYTHONPATH was being overridden, so changed code to append values to PYTHONPATH instead of overriding them.

## How was this patch tested?
Added log statements to ApplicationMaster.scala to check for environment variable PYTHONPATH, ran a spark job in cluster mode before the change and verified the issue. Performed the same test after the change and verified the fix.

Author: pgandhi <pgandhi@oath.com>

Closes #21468 from pgandhi999/SPARK-22151.
2018-07-18 14:07:03 -05:00
sychen c8bee932cb [SPARK-24677][CORE] Avoid NoSuchElementException from MedianHeap
## What changes were proposed in this pull request?
When speculation is enabled,
TaskSetManager#markPartitionCompleted should write successful task duration to MedianHeap,
not just increase tasksSuccessful.

Otherwise when TaskSetManager#checkSpeculatableTasks,tasksSuccessful non-zero, but MedianHeap is empty.
Then throw an exception successfulTaskDurations.median java.util.NoSuchElementException: MedianHeap is empty.
Finally led to stopping SparkContext.
## How was this patch tested?
TaskSetManagerSuite.scala
unit test:[SPARK-24677] MedianHeap should not be empty when speculation is enabled

Author: sychen <sychen@ctrip.com>

Closes #21656 from cxzl25/fix_MedianHeap_empty.
2018-07-18 13:24:41 -05:00
mcheah fc0c8c9717 [SPARK-24825][K8S][TEST] Kubernetes integration tests build the whole reactor
## What changes were proposed in this pull request?

Make the integration test script build all modules.

In order to not run all the non-Kubernetes integration tests in the build, support specifying tags and tag all integration tests specifically with "k8s". Supply the k8s tag in the dev/dev-run-integration-tests.sh script.

## How was this patch tested?

The build system will test this.

Author: mcheah <mcheah@palantir.com>

Closes #21800 from mccheah/k8s-integration-tests-maven-fix.
2018-07-18 10:01:39 -07:00
Huangweizhe ebe9e28488 [SPARK-24628][DOC] Typos of the example code in docs/mllib-data-types.md
## What changes were proposed in this pull request?

The example wants to create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0)), but the list is given as [1, 2, 3, 4, 5, 6]. Now it is changed as [1, 3, 5, 2, 4, 6].

And the example wants to create an RDD of coordinate entries like:
entries = sc.parallelize([(0, 0, 1.2), (1, 0, 2.1), (2, 1, 3.7)]).
However, it is done with the MatrixEntry class like:
entries = sc.parallelize([MatrixEntry(0, 0, 1.2), MatrixEntry(1, 0, 2.1), MatrixEntry(6, 1, 3.7)]),
where the third MatrixEntry has a different row index.
Now it is changed as MatrixEntry(2, 1, 3.7).

## How was this patch tested?

This is trivial enough that it should not affect tests.

Author: Weizhe Huang <huangweizhebbdservice.com>

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Huangweizhe <huangweizhe@bbdservice.com>

Closes #21612 from huangweizhe123/my_change.
2018-07-18 09:45:56 -05:00
韩田田00222924 002300dd41 [SPARK-24804] There are duplicate words in the test title in the DatasetSuite
## What changes were proposed in this pull request?
In DatasetSuite.scala, in the 1299 line,
test("SPARK-19896: cannot have circular references in in case class") ,
there are  duplicate words "in in".  We can get rid of one.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: 韩田田00222924 <han.tiantian@zte.com.cn>

Closes #21767 from httfighter/inin.
2018-07-18 09:40:36 -05:00
Nihar Sheth 2694dd2bf0 [MINOR][CORE] Add test cases for RDD.cartesian
## What changes were proposed in this pull request?

While looking through the codebase, it appeared that the scala code for RDD.cartesian does not have any tests for correctness. This adds a couple basic tests to verify cartesian yields correct values. While the implementation for RDD.cartesian is pretty simple, it always helps to have a few tests!

## How was this patch tested?

The new test cases pass, and the scala style tests from running dev/run-tests all pass.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Nihar Sheth <niharrsheth@gmail.com>

Closes #21765 from NiharS/cartesianTests.
2018-07-18 09:14:36 -05:00
Takuya UESHIN 34cb3b54e9 [SPARK-24386][SPARK-24768][BUILD][FOLLOWUP] Fix lint-java and Scala 2.12 build.
## What changes were proposed in this pull request?

This pr fixes lint-java and Scala 2.12 build.

lint-java:

```
[ERROR] src/test/resources/log4j.properties:[0] (misc) NewlineAtEndOfFile: File does not end with a newline.
```

Scala 2.12 build:

```
[error] /.../sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousCoalesceRDD.scala:121: overloaded method value addTaskCompletionListener with alternatives:
[error]   (f: org.apache.spark.TaskContext => Unit)org.apache.spark.TaskContext <and>
[error]   (listener: org.apache.spark.util.TaskCompletionListener)org.apache.spark.TaskContext
[error]  cannot be applied to (org.apache.spark.TaskContext => java.util.List[Runnable])
[error]       context.addTaskCompletionListener { ctx =>
[error]               ^
```

## How was this patch tested?

Manually executed lint-java and Scala 2.12 build in my local environment.

Author: Takuya UESHIN <ueshin@databricks.com>

Closes #21801 from ueshin/issues/SPARK-24386_24768/fix_build.
2018-07-18 19:17:18 +08:00
Dongjoon Hyun 3b59d326c7 [SPARK-24576][BUILD] Upgrade Apache ORC to 1.5.2
## What changes were proposed in this pull request?

This issue aims to upgrade Apache ORC library from 1.4.4 to 1.5.2 in order to bring the following benefits into Apache Spark.

- [ORC-91](https://issues.apache.org/jira/browse/ORC-91) Support for variable length blocks in HDFS (The current space wasted in ORC to padding is known to be 5%.)
- [ORC-344](https://issues.apache.org/jira/browse/ORC-344) Support for using Decimal64ColumnVector

In addition to that, Apache Hive 3.1 and 3.2 will use ORC 1.5.1 ([HIVE-19669](https://issues.apache.org/jira/browse/HIVE-19465)) and 1.5.2 ([HIVE-19792](https://issues.apache.org/jira/browse/HIVE-19792)) respectively. This will improve the compatibility between Apache Spark and Apache Hive by sharing the common library.

## How was this patch tested?

Pass the Jenkins with all existing tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #21582 from dongjoon-hyun/SPARK-24576.
2018-07-17 23:52:17 -07:00
Yuming Wang fc2e18963e [SPARK-24529][BUILD][TEST-MAVEN][FOLLOW-UP] Set spotbugs-maven-plugin's fork to true
## What changes were proposed in this pull request?

Set `spotbugs-maven-plugin`'s fork to `true`, otherwise will throw exception when make distribution:
```
./dev/make-distribution.sh --name SPARK-24529  --tgz  -Phadoop-2.7 -Phive -Phive-thriftserver -Pyarn -Phadoop-provided
```
exception:
```java
...
[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM ........................... SUCCESS [  8.753 s]
[INFO] Spark Project Tags ................................. SUCCESS [  9.334 s]
[INFO] Spark Project Sketch ............................... SUCCESS [ 12.029 s]
[INFO] Spark Project Local DB ............................. SUCCESS [ 13.641 s]
[INFO] Spark Project Networking ........................... FAILURE [10:10 min]
[INFO] Spark Project Shuffle Streaming Service ............ SKIPPED
[INFO] Spark Project Unsafe ............................... SUCCESS [ 16.415 s]
[INFO] Spark Project Launcher ............................. SKIPPED
[INFO] Spark Project Core ................................. SKIPPED
[INFO] Spark Project ML Local Library ..................... SKIPPED
[INFO] Spark Project GraphX ............................... SKIPPED
[INFO] Spark Project Streaming ............................ SKIPPED
[INFO] Spark Project Catalyst ............................. SKIPPED
[INFO] Spark Project SQL .................................. SKIPPED
[INFO] Spark Project ML Library ........................... SKIPPED
[INFO] Spark Project Tools ................................ SUCCESS [  8.750 s]
[INFO] Spark Project Hive ................................. SKIPPED
[INFO] Spark Project REPL ................................. SKIPPED
[INFO] Spark Project YARN Shuffle Service ................. SKIPPED
[INFO] Spark Project YARN ................................. SKIPPED
[INFO] Spark Project Hive Thrift Server ................... SKIPPED
[INFO] Spark Project Assembly ............................. SKIPPED
[INFO] Spark Integration for Kafka 0.10 ................... SKIPPED
[INFO] Kafka 0.10 Source for Structured Streaming ......... SKIPPED
[INFO] Spark Project Examples ............................. SKIPPED
[INFO] Spark Integration for Kafka 0.10 Assembly .......... SKIPPED
[INFO] Spark Avro ......................................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 10:29 min (Wall Clock)
[INFO] Finished at: 2018-07-16T21:39:46+08:00
[INFO] Final Memory: 61M/885M
[INFO] ------------------------------------------------------------------------
Timeout: sub-process interrupted
[ERROR] Failed to execute goal com.github.spotbugs:spotbugs-maven-plugin:3.1.3:spotbugs (spotbugs) on project spark-network-common_2.11: Execution spotbugs of goal com.github.spotbugs:spotbugs-maven-plugin:3.1.3:spotbugs failed: Timeout: killed the sub-process -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :spark-network-common_2.11
org.apache.tools.ant.ExitException: Permission ("java.lang.RuntimePermission" "exitVM") was not granted.
        at org.apache.tools.ant.types.Permissions$MySM.checkExit(Permissions.java:194)
        at java.lang.Runtime.exit(Runtime.java:107)
        at java.lang.System.exit(System.java:971)
        at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:358)
Exception in thread "main" org.apache.tools.ant.ExitException: Permission ("java.lang.RuntimePermission" "exitVM") was not granted.
        at org.apache.tools.ant.types.Permissions$MySM.checkExit(Permissions.java:194)
        at java.lang.Runtime.exit(Runtime.java:107)
        at java.lang.System.exit(System.java:971)
        at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:364)
Timeout: sub-process interrupted
```

## How was this patch tested?

manual tests

Author: Yuming Wang <yumwang@ebay.com>

Closes #21785 from wangyum/SPARK-24529.
2018-07-18 10:00:13 +08:00
DB Tsai 681845fd62
[SPARK-24402][SQL] Optimize In expression when only one element in the collection or collection is empty
## What changes were proposed in this pull request?

Two new rules in the logical plan optimizers are added.

1. When there is only one element in the **`Collection`**, the
physical plan will be optimized to **`EqualTo`**, so predicate
pushdown can be used.

```scala
    profileDF.filter( $"profileID".isInCollection(Set(6))).explain(true)
    """
      |== Physical Plan ==
      |*(1) Project [profileID#0]
      |+- *(1) Filter (isnotnull(profileID#0) && (profileID#0 = 6))
      |   +- *(1) FileScan parquet [profileID#0] Batched: true, Format: Parquet,
      |     PartitionFilters: [],
      |     PushedFilters: [IsNotNull(profileID), EqualTo(profileID,6)],
      |     ReadSchema: struct<profileID:int>
    """.stripMargin
```

2. When the **`Collection`** is empty, and the input is nullable, the
logical plan will be simplified to

```scala
    profileDF.filter( $"profileID".isInCollection(Set())).explain(true)
    """
      |== Optimized Logical Plan ==
      |Filter if (isnull(profileID#0)) null else false
      |+- Relation[profileID#0] parquet
    """.stripMargin
```

TODO:

1. For multiple conditions with numbers less than certain thresholds,
we should still allow predicate pushdown.
2. Optimize the **`In`** using **`tableswitch`** or **`lookupswitch`**
when the numbers of the categories are low, and they are **`Int`**,
**`Long`**.
3. The default immutable hash trees set is slow for query, and we
should do benchmark for using different set implementation for faster
query.
4. **`filter(if (condition) null else false)`** can be optimized to false.

## How was this patch tested?

Couple new tests are added.

Author: DB Tsai <d_tsai@apple.com>

Closes #21797 from dbtsai/optimize-in.
2018-07-17 17:33:52 -07:00
Takeshi Yamamuro 2a4dd6f06c [SPARK-24681][SQL] Verify nested column names in Hive metastore
## What changes were proposed in this pull request?
This pr added code to check if nested column names do not include ',', ':', and ';' because Hive metastore can't handle these characters in nested column names;
ref: https://github.com/apache/hive/blob/release-1.2.1/serde/src/java/org/apache/hadoop/hive/serde2/typeinfo/TypeInfoUtils.java#L239

## How was this patch tested?
Added tests in `HiveDDLSuite`.

Author: Takeshi Yamamuro <yamamuro@apache.org>

Closes #21711 from maropu/SPARK-24681.
2018-07-17 14:15:30 -07:00
Bago Amirbekian 912634b004 [SPARK-24747][ML] Make Instrumentation class more flexible
## What changes were proposed in this pull request?

This PR updates the Instrumentation class to make it more flexible and a little bit easier to use. When these APIs are merged, I'll followup with a PR to update the training code to use these new APIs so we can remove the old APIs. These changes are all to private APIs so this PR doesn't make any user facing changes.

## How was this patch tested?

Existing tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Bago Amirbekian <bago@databricks.com>

Closes #21719 from MrBago/new-instrumentation-apis.
2018-07-17 13:11:52 -07:00
HanShuliang 7688ce88b2 [SPARK-21590][SS] Window start time should support negative values
## What changes were proposed in this pull request?

Remove the non-negative checks of window start time to make window support negative start time, and add a check to guarantee the absolute value of start time is less than slide duration.

## How was this patch tested?

New unit tests.

Author: HanShuliang <kevinzwx1992@gmail.com>

Closes #18903 from KevinZwx/dev.
2018-07-17 11:25:23 -05:00
Sean Owen 5215344dea [SPARK-24813][BUILD][FOLLOW-UP][HOTFIX] HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive
## What changes were proposed in this pull request?

Test HiveExternalCatalogVersionsSuite vs only current Spark releases

## How was this patch tested?

`HiveExternalCatalogVersionsSuite`

Author: Sean Owen <srowen@gmail.com>

Closes #21793 from srowen/SPARK-24813.3.
2018-07-17 11:23:34 -05:00
Marek Novotny 4cf1bec4dc [SPARK-24305][SQL][FOLLOWUP] Avoid serialization of private fields in collection expressions.
## What changes were proposed in this pull request?

The PR tries to avoid serialization of private fields of already added collection functions and follows up on comments in [SPARK-23922](https://github.com/apache/spark/pull/21028) and [SPARK-23935](https://github.com/apache/spark/pull/21236)

## How was this patch tested?

Run tests from:
- CollectionExpressionSuite.scala
- DataFrameFunctionsSuite.scala

Author: Marek Novotny <mn.mikke@gmail.com>

Closes #21352 from mn-mikke/SPARK-24305.
2018-07-17 23:07:18 +08:00
hyukjinkwon 0ca16f6e14 Revert "[SPARK-24402][SQL] Optimize In expression when only one element in the collection or collection is empty"
This reverts commit 0f0d1865f5.
2018-07-17 11:30:53 +08:00
Miklos C f876d3fa80 [SPARK-20220][DOCS] Documentation Add thrift scheduling pool config to scheduling docs
## What changes were proposed in this pull request?

The thrift scheduling pool configuration was removed from a previous release. Adding this back to the job scheduling configuration docs.

This PR takes over #17536 and handle some comments here.

## How was this patch tested?

Manually.

Closes #17536

Author: hyukjinkwon <gurwls223@apache.org>

Closes #21778 from HyukjinKwon/SPARK-20220.
2018-07-17 09:22:16 +08:00
Feng Liu d57a267b79 [SPARK-23259][SQL] Clean up legacy code around hive external catalog and HiveClientImpl
## What changes were proposed in this pull request?

Three legacy statements are removed by this patch:

- in HiveExternalCatalog: The withClient wrapper is not necessary for the private method getRawTable.

- in HiveClientImpl: There are some redundant code in both the tableExists and getTableOption method.

This PR takes over https://github.com/apache/spark/pull/20425

## How was this patch tested?

Existing tests

Closes #20425

Author: hyukjinkwon <gurwls223@apache.org>

Closes #21780 from HyukjinKwon/SPARK-23259.
2018-07-17 09:13:35 +08:00
DB Tsai 0f0d1865f5 [SPARK-24402][SQL] Optimize In expression when only one element in the collection or collection is empty
## What changes were proposed in this pull request?

Two new rules in the logical plan optimizers are added.

1. When there is only one element in the **`Collection`**, the
physical plan will be optimized to **`EqualTo`**, so predicate
pushdown can be used.

```scala
    profileDF.filter( $"profileID".isInCollection(Set(6))).explain(true)
    """
      |== Physical Plan ==
      |*(1) Project [profileID#0]
      |+- *(1) Filter (isnotnull(profileID#0) && (profileID#0 = 6))
      |   +- *(1) FileScan parquet [profileID#0] Batched: true, Format: Parquet,
      |     PartitionFilters: [],
      |     PushedFilters: [IsNotNull(profileID), EqualTo(profileID,6)],
      |     ReadSchema: struct<profileID:int>
    """.stripMargin
```

2. When the **`Collection`** is empty, and the input is nullable, the
logical plan will be simplified to

```scala
    profileDF.filter( $"profileID".isInCollection(Set())).explain(true)
    """
      |== Optimized Logical Plan ==
      |Filter if (isnull(profileID#0)) null else false
      |+- Relation[profileID#0] parquet
    """.stripMargin
```

TODO:

1. For multiple conditions with numbers less than certain thresholds,
we should still allow predicate pushdown.
2. Optimize the **`In`** using **`tableswitch`** or **`lookupswitch`**
when the numbers of the categories are low, and they are **`Int`**,
**`Long`**.
3. The default immutable hash trees set is slow for query, and we
should do benchmark for using different set implementation for faster
query.
4. **`filter(if (condition) null else false)`** can be optimized to false.

## How was this patch tested?

Couple new tests are added.

Author: DB Tsai <d_tsai@apple.com>

Closes #21442 from dbtsai/optimize-in.
2018-07-16 15:33:39 -07:00
Maxim Gekk ba437fc5c7 [SPARK-24805][SQL] Do not ignore avro files without extensions by default
## What changes were proposed in this pull request?

In the PR, I propose to change default behaviour of AVRO datasource which currently ignores files without `.avro` extension in read by default. This PR sets the default value for `avro.mapred.ignore.inputs.without.extension` to `false` in the case if the parameter is not set by an user.

## How was this patch tested?

Added a test file without extension in AVRO format, and new test for reading the file with and wihout specified schema.

Author: Maxim Gekk <maxim.gekk@databricks.com>
Author: Maxim Gekk <max.gekk@gmail.com>

Closes #21769 from MaxGekk/avro-without-extension.
2018-07-16 14:35:44 -07:00
Marek Novotny b0c95a1d69 [SPARK-23901][SQL] Removing masking functions
The PR reverts #21246.

Author: Marek Novotny <mn.mikke@gmail.com>

Closes #21786 from mn-mikke/SPARK-23901.
2018-07-16 14:28:35 -07:00
Takuya UESHIN b045315e5d [SPARK-24734][SQL] Fix type coercions and nullabilities of nested data types of some functions.
## What changes were proposed in this pull request?

We have some functions which need to aware the nullabilities of all children, such as `CreateArray`, `CreateMap`, `Concat`, and so on. Currently we add casts to fix the nullabilities, but the casts might be removed during the optimization phase.
After the discussion, we decided to not add extra casts for just fixing the nullabilities of the nested types, but handle them by functions themselves.

## How was this patch tested?

Modified and added some tests.

Author: Takuya UESHIN <ueshin@databricks.com>

Closes #21704 from ueshin/issues/SPARK-24734/concat_containsnull.
2018-07-16 23:16:25 +08:00
Shahid cf97045349 [SPARK-18230][MLLIB] Throw a better exception, if the user or product doesn't exist
When invoking MatrixFactorizationModel.recommendProducts(Int, Int) with a non-existing user, a java.util.NoSuchElementException is thrown:

> java.util.NoSuchElementException: next on empty iterator
	at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
	at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
	at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
	at scala.collection.IterableLike$class.head(IterableLike.scala:107)
	at scala.collection.mutable.WrappedArray.scala$collection$IndexedSeqOptimized$$super$head(WrappedArray.scala:35)
	at scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126)
	at scala.collection.mutable.WrappedArray.head(WrappedArray.scala:35)
	at org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendProducts(MatrixFactorizationModel.scala:169)

## What changes were proposed in this pull request?
Throw a better exception, like "user-id/product-id doesn't found in the model", for a non-existent user/product

## How was this patch tested?
Added UT

Author: Shahid <shahidki31@gmail.com>

Closes #21740 from shahidki31/checkInvalidUserProduct.
2018-07-16 09:50:43 -05:00
Yuming Wang 9549a28149 [SPARK-24549][SQL] Support Decimal type push down to the parquet data sources
## What changes were proposed in this pull request?

Support Decimal type push down to the parquet data sources.
The Decimal comparator used is: [`BINARY_AS_SIGNED_INTEGER_COMPARATOR`](c6764c4a08/parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveComparator.java (L224-L292)).

## How was this patch tested?

unit tests and manual tests.

**manual tests**:
```scala
spark.range(10000000).selectExpr("id", "cast(id as decimal(9)) as d1", "cast(id as decimal(9, 2)) as d2", "cast(id as decimal(18)) as d3", "cast(id as decimal(18, 4)) as d4", "cast(id as decimal(38)) as d5", "cast(id as decimal(38, 18)) as d6").coalesce(1).write.option("parquet.block.size", 1048576).parquet("/tmp/spark/parquet/decimal")
val df = spark.read.parquet("/tmp/spark/parquet/decimal/")
spark.sql("set spark.sql.parquet.filterPushdown.decimal=true")
// Only read about 1 MB data
df.filter("d2 = 10000").show
// Only read about 1 MB data
df.filter("d4 = 10000").show
spark.sql("set spark.sql.parquet.filterPushdown.decimal=false")
// Read 174.3 MB data
df.filter("d2 = 10000").show
// Read 174.3 MB data
df.filter("d4 = 10000").show
```

Author: Yuming Wang <yumwang@ebay.com>

Closes #21556 from wangyum/SPARK-24549.
2018-07-16 15:44:51 +08:00
sandeep-katta 2603ae30be [SPARK-24558][CORE] wrong Idle Timeout value is used in case of the cacheBlock.
It is corrected as per the configuration.

## What changes were proposed in this pull request?
IdleTimeout info used to print in the logs is taken based on the cacheBlock. If it is cacheBlock then cachedExecutorIdleTimeoutS is considered else executorIdleTimeoutS

## How was this patch tested?
Manual Test
spark-sql> cache table sample;
2018-05-15 14:44:02 INFO  DAGScheduler:54 - Submitting 3 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[8] at processCmd at CliDriver.java:376) (first 15 tasks are for partitions Vector(0, 1, 2))
2018-05-15 14:44:02 INFO  YarnScheduler:54 - Adding task set 0.0 with 3 tasks
2018-05-15 14:44:03 INFO  ExecutorAllocationManager:54 - Requesting 1 new executor because tasks are backlogged (new desired total will be 1)
...
...
2018-05-15 14:46:10 INFO  YarnClientSchedulerBackend:54 - Actual list of executor(s) to be killed is 1
2018-05-15 14:46:10 INFO  **ExecutorAllocationManager:54 - Removing executor 1 because it has been idle for 120 seconds (new desired total will be 0)**
2018-05-15 14:46:11 INFO  YarnSchedulerBackend$YarnDriverEndpoint:54 - Disabling executor 1.
2018-05-15 14:46:11 INFO  DAGScheduler:54 - Executor lost: 1 (epoch 1)

Author: sandeep-katta <sandeep.katta2007@gmail.com>

Closes #21565 from sandeep-katta/loginfoBug.
2018-07-16 14:52:49 +08:00
Maxim Gekk 9f929458fb [SPARK-24810][SQL] Fix paths to test files in AvroSuite
## What changes were proposed in this pull request?

In the PR, I propose to move `testFile()` to the common trait `SQLTestUtilsBase` and wrap test files in `AvroSuite` by the method `testFile()` which returns full paths to test files in the resource folder.

Author: Maxim Gekk <maxim.gekk@databricks.com>

Closes #21773 from MaxGekk/test-file.
2018-07-15 23:01:36 -07:00
Takeshi Yamamuro d463533ded [SPARK-24676][SQL] Project required data from CSV parsed data when column pruning disabled
## What changes were proposed in this pull request?
This pr modified code to project required data from CSV parsed data when column pruning disabled.
In the current master, an exception below happens if `spark.sql.csv.parser.columnPruning.enabled` is false. This is because required formats and CSV parsed formats are different from each other;
```
./bin/spark-shell --conf spark.sql.csv.parser.columnPruning.enabled=false
scala> val dir = "/tmp/spark-csv/csv"
scala> spark.range(10).selectExpr("id % 2 AS p", "id").write.mode("overwrite").partitionBy("p").csv(dir)
scala> spark.read.csv(dir).selectExpr("sum(p)").collect()
18/06/25 13:48:46 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7)
java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Integer
        at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
        at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getInt(rows.scala:41)
        ...
```

## How was this patch tested?
Added tests in `CSVSuite`.

Author: Takeshi Yamamuro <yamamuro@apache.org>

Closes #21657 from maropu/SPARK-24676.
2018-07-15 20:22:09 -07:00
郑瑞峰 bcf7121ed2 [TRIVIAL][ML] GMM unpersist RDD after training
## What changes were proposed in this pull request?
unpersist `instances` after training

## How was this patch tested?
existing tests

Author: 郑瑞峰 <zhengruifeng@ZBMAC-C02VX5XWH.local>

Closes #21562 from zhengruifeng/gmm_unpersist.
2018-07-15 20:14:17 -07:00
Sean Owen bbc2ffc8ab [SPARK-24813][TESTS][HIVE][HOTFIX] HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive
## What changes were proposed in this pull request?

Try only unique ASF mirrors to download Spark release; fall back to Apache archive if no mirrors available or release is not mirrored

## How was this patch tested?

Existing HiveExternalCatalogVersionsSuite

Author: Sean Owen <srowen@gmail.com>

Closes #21776 from srowen/SPARK-24813.
2018-07-16 09:29:51 +08:00
Zoltan C. Toth 5d62a985dc Doc fix: The Imputer is an Estimator
Fixing the doc as the imputer is not a `Transformer` but an `Estimator`.

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala#L96-L97

## What changes were proposed in this pull request?

Simple documentation fix

## How was this patch tested?

manual testing

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Zoltan C. Toth <zoltanctoth@gmail.com>

Closes #21755 from zoltanctoth/doc-imputer-is-estimator.
2018-07-15 17:08:26 -05:00
Gengliang Wang 9603087638 [SPARK-24800][SQL] Refactor Avro Serializer and Deserializer
## What changes were proposed in this pull request?
Currently the Avro Deserializer converts input Avro format data to `Row`, and then convert the `Row` to `InternalRow`.
While the Avro Serializer converts `InternalRow` to `Row`, and then output Avro format data.
This PR allows direct conversion between `InternalRow` and Avro format data.

## How was this patch tested?

Unit test

Author: Gengliang Wang <gengliang.wang@databricks.com>

Closes #21762 from gengliangwang/avro_io.
2018-07-15 22:06:33 +08:00
Maxim Gekk 69993217fc [SPARK-24807][CORE] Adding files/jars twice: output a warning and add a note
## What changes were proposed in this pull request?

In the PR, I propose to output an warning if the `addFile()` or `addJar()` methods are callled more than once for the same path. Currently, overwriting of already added files is not supported. New comments and warning are reflected the existing behaviour.

Author: Maxim Gekk <maxim.gekk@databricks.com>

Closes #21771 from MaxGekk/warning-on-adding-file.
2018-07-14 22:07:49 -07:00
Gengliang Wang 3e7dc82960 [SPARK-24776][SQL] Avro unit test: deduplicate code and replace deprecated methods
## What changes were proposed in this pull request?

Improve Avro unit test:
1. use QueryTest/SharedSQLContext/SQLTestUtils, instead of the duplicated test utils.
2. replace deprecated methods

This is a follow up PR for #21760, the PR passes pull request tests but failed in: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-2.6/7842/

This PR is to fix it.
## How was this patch tested?
Unit test.
Compile with different commands:

```
./build/mvn --force -DzincPort=3643 -DskipTests -Phadoop-2.6 -Phive-thriftserver -Pkinesis-asl -Pspark-ganglia-lgpl -Pmesos -Pyarn  compile test-compile
./build/mvn --force -DzincPort=3643 -DskipTests -Phadoop-2.7 -Phive-thriftserver -Pkinesis-asl -Pspark-ganglia-lgpl -Pmesos -Pyarn  compile test-compile
./build/mvn --force -DzincPort=3643 -DskipTests -Phadoop-3.1 -Phive-thriftserver -Pkinesis-asl -Pspark-ganglia-lgpl -Pmesos -Pyarn  compile test-compile

```

Author: Gengliang Wang <gengliang.wang@databricks.com>

Closes #21768 from gengliangwang/improve_avro_test.
2018-07-14 21:36:56 -07:00
Yuming Wang 43e4e851b6 [SPARK-24718][SQL] Timestamp support pushdown to parquet data source
## What changes were proposed in this pull request?

`Timestamp` support pushdown to parquet data source.
Only `TIMESTAMP_MICROS` and `TIMESTAMP_MILLIS` support push down.

## How was this patch tested?

unit tests and benchmark tests

Author: Yuming Wang <yumwang@ebay.com>

Closes #21741 from wangyum/SPARK-24718.
2018-07-15 11:13:49 +08:00
Sean Owen 8aceb961c3 [SPARK-24754][ML] Minhash integer overflow
## What changes were proposed in this pull request?

Use longs in calculating min hash to avoid bias due to int overflow.

## How was this patch tested?

Existing tests.

Author: Sean Owen <srowen@gmail.com>

Closes #21750 from srowen/SPARK-24754.
2018-07-14 15:59:17 -05:00
Yuming Wang e1de34113e [SPARK-17091][SQL] Add rule to convert IN predicate to equivalent Parquet filter
## What changes were proposed in this pull request?

The original pr is: https://github.com/apache/spark/pull/18424

Add a new optimizer rule to convert an IN predicate to an equivalent Parquet filter and add `spark.sql.parquet.pushdown.inFilterThreshold` to control limit thresholds. Different data types have different limit thresholds, this is a copy of data for reference:

Type | limit threshold
-- | --
string | 370
int | 210
long | 285
double | 270
float | 220
decimal | Won't provide better performance before [SPARK-24549](https://issues.apache.org/jira/browse/SPARK-24549)

## How was this patch tested?
unit tests and manual tests

Author: Yuming Wang <yumwang@ebay.com>

Closes #21603 from wangyum/SPARK-17091.
2018-07-14 17:50:54 +08:00
Ilan Filonenko f1a99ad582 [SPARK-23984][K8S][TEST] Added Integration Tests for PySpark on Kubernetes
## What changes were proposed in this pull request?

I added integration tests for PySpark ( + checking JVM options + RemoteFileTest) which wasn't properly merged in the initial integration test PR.

## How was this patch tested?

I tested this with integration tests using:

`dev/dev-run-integration-tests.sh --spark-tgz spark-2.4.0-SNAPSHOT-bin-2.7.3.tgz`

Author: Ilan Filonenko <if56@cornell.edu>

Closes #21583 from ifilonenko/master.
2018-07-13 17:19:28 -07:00
Yuming Wang a75571b46f [SPARK-23831][SQL] Add org.apache.derby to IsolatedClientLoader
## What changes were proposed in this pull request?

Add `org.apache.derby` to `IsolatedClientLoader`, otherwise it may throw an exception:
```scala
...
[info] Cause: java.sql.SQLException: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$12439ab23, see the next exception for details.
[info] at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
[info] at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
[info] at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source)
[info] at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(Unknown Source)
[info] at org.apache.derby.impl.jdbc.EmbedConnection.<init>(Unknown Source)
[info] at org.apache.derby.jdbc.InternalDriver$1.run(Unknown Source)
...
```

## How was this patch tested?

unit tests and manual tests

Author: Yuming Wang <yumwang@ebay.com>

Closes #20944 from wangyum/SPARK-23831.
2018-07-13 14:07:52 -07:00
Marco Gaido 3b6005b8a2 [SPARK-23528][ML] Add numIter to ClusteringSummary
## What changes were proposed in this pull request?

Added the number of iterations in `ClusteringSummary`. This is an helpful information in evaluating how to eventually modify the parameters in order to get a better model.

## How was this patch tested?

modified existing UTs

Author: Marco Gaido <marcogaido91@gmail.com>

Closes #20701 from mgaido91/SPARK-23528.
2018-07-13 11:23:42 -07:00
Xiao Li 3bcb1b4814 Revert "[SPARK-24776][SQL] Avro unit test: use SQLTestUtils and replace deprecated methods"
This reverts commit c1b62e420a.
2018-07-13 10:06:26 -07:00
Gengliang Wang c1b62e420a [SPARK-24776][SQL] Avro unit test: use SQLTestUtils and replace deprecated methods
## What changes were proposed in this pull request?
Improve Avro unit test:
1. use QueryTest/SharedSQLContext/SQLTestUtils, instead of the duplicated test utils.
2. replace deprecated methods

## How was this patch tested?

Unit test

Author: Gengliang Wang <gengliang.wang@databricks.com>

Closes #21760 from gengliangwang/improve_avro_test.
2018-07-13 08:55:46 -07:00
Liang-Chi Hsieh dfd7ac9887 [SPARK-24781][SQL] Using a reference from Dataset in Filter/Sort might not work
## What changes were proposed in this pull request?

When we use a reference from Dataset in filter or sort, which was not used in the prior select, an AnalysisException occurs, e.g.,

```scala
val df = Seq(("test1", 0), ("test2", 1)).toDF("name", "id")
df.select(df("name")).filter(df("id") === 0).show()
```

```scala
org.apache.spark.sql.AnalysisException: Resolved attribute(s) id#6 missing from name#5 in operator !Filter (id#6 = 0).;;
!Filter (id#6 = 0)
   +- AnalysisBarrier
      +- Project [name#5]
         +- Project [_1#2 AS name#5, _2#3 AS id#6]
            +- LocalRelation [_1#2, _2#3]
```
This change updates the rule `ResolveMissingReferences` so `Filter` and `Sort` with non-empty `missingInputs` will also be transformed.

## How was this patch tested?

Added tests.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #21745 from viirya/SPARK-24781.
2018-07-13 08:25:00 -07:00
Yuanbo Liu 0f24c6f8ab [SPARK-24713] AppMatser of spark streaming kafka OOM if there are hund…
We have hundreds of kafka topics need to be consumed in one application. The application master will throw OOM exception after hanging for nearly half of an hour.

OOM happens in the env with a lot of topics, and it's not convenient to set up such kind of env in the unit test. So I didn't change/add test case.

Author: Yuanbo Liu <yuanbo@Yuanbos-MacBook-Air.local>
Author: yuanbo <yuanbo@apache.org>

Closes #21690 from yuanboliu/master.
2018-07-13 07:37:24 -06:00
Kevin Yu 0ce11d0e3a [SPARK-23486] cache the function name from the external catalog for lookupFunctions
## What changes were proposed in this pull request?

This PR will cache the function name from external catalog, it is used by lookupFunctions in the analyzer, and it is cached for each query plan. The original problem is reported in the [ spark-19737](https://issues.apache.org/jira/browse/SPARK-19737)

## How was this patch tested?

create new test file LookupFunctionsSuite and add test case in SessionCatalogSuite

Author: Kevin Yu <qyu@us.ibm.com>

Closes #20795 from kevinyu98/spark-23486.
2018-07-12 22:20:06 -07:00
Huaxin Gao e0f4f206b7 [SPARK-24537][R] Add array_remove / array_zip / map_from_arrays / array_distinct
## What changes were proposed in this pull request?
Add array_remove / array_zip / map_from_arrays / array_distinct functions in SparkR.

## How was this patch tested?
Add tests in test_sparkSQL.R

Author: Huaxin Gao <huaxing@us.ibm.com>

Closes #21645 from huaxingao/spark-24537.
2018-07-13 10:40:58 +08:00
maryannxue 75725057b3 [SPARK-24790][SQL] Allow complex aggregate expressions in Pivot
## What changes were proposed in this pull request?

Relax the check to allow complex aggregate expressions, like `ceil(sum(col1))` or `sum(col1) + 1`, which roughly means any aggregate expression that could appear in an Aggregate plan except pandas UDF (due to the fact that it is not supported in pivot yet).

## How was this patch tested?

Added 2 tests in pivot.sql

Author: maryannxue <maryannxue@apache.org>

Closes #21753 from maryannxue/pivot-relax-syntax.
2018-07-12 16:54:03 -07:00
Marco Gaido 11384893b6 [SPARK-24208][SQL][FOLLOWUP] Move test cases to proper locations
## What changes were proposed in this pull request?

The PR is a followup to move the test cases introduced by the original PR in their proper location.

## How was this patch tested?

moved UTs

Author: Marco Gaido <marcogaido91@gmail.com>

Closes #21751 from mgaido91/SPARK-24208_followup.
2018-07-12 15:13:26 -07:00