Commit graph

17621 commits

Author SHA1 Message Date
WeichenXu 8087ecf8da [SPARK CORE][MINOR] fix "default partitioner cannot partition array keys" error message in PairRDDfunctions
## What changes were proposed in this pull request?

In order to avoid confusing user,
error message in `PairRDDfunctions`
`Default partitioner cannot partition array keys.`
is updated,
the one in `partitionBy` is replaced with
`Specified partitioner cannot partition array keys.`
other is replaced with
`Specified or default partitioner cannot partition array keys.`

## How was this patch tested?

N/A

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #15045 from WeichenXu123/fix_partitionBy_error_message.
2016-09-12 12:23:16 +01:00
Gaetan Semet b3c2291228 [SPARK-16992][PYSPARK] use map comprehension in doc
Code is equivalent, but map comprehency is most of the time faster than a map.

Author: Gaetan Semet <gaetan@xeberon.net>

Closes #14863 from Stibbons/map_comprehension.
2016-09-12 12:21:33 +01:00
codlife 4efcdb7fea [SPARK-17447] Performance improvement in Partitioner.defaultPartitioner without sortBy
## What changes were proposed in this pull request?

if there are many rdds in some situations,the sort will loss he performance servely,actually we needn't sort the rdds , we can just scan the rdds one time to gain the same goal.

## How was this patch tested?

manual tests

Author: codlife <1004910847@qq.com>

Closes #15039 from codlife/master.
2016-09-12 12:10:46 +01:00
cenyuhai cc87280fcd [SPARK-17171][WEB UI] DAG will list all partitions in the graph
## What changes were proposed in this pull request?
DAG will list all partitions in the graph, it is too slow and hard to see all graph.
Always we don't want to see all partitions,we just want to see the relations of DAG graph.
So I just show 2 root nodes for Rdds.

Before this PR, the DAG graph looks like [dag1.png](https://issues.apache.org/jira/secure/attachment/12824702/dag1.png), [dag3.png](https://issues.apache.org/jira/secure/attachment/12825456/dag3.png), after this PR, the DAG graph looks like [dag2.png](https://issues.apache.org/jira/secure/attachment/12824703/dag2.png),[dag4.png](https://issues.apache.org/jira/secure/attachment/12825457/dag4.png)

Author: cenyuhai <cenyuhai@didichuxing.com>
Author: 岑玉海 <261810726@qq.com>

Closes #14737 from cenyuhai/SPARK-17171.
2016-09-12 11:52:56 +01:00
Josh Rosen 72eec70bdb [SPARK-17486] Remove unused TaskMetricsUIData.updatedBlockStatuses field
The `TaskMetricsUIData.updatedBlockStatuses` field is assigned to but never read, increasing the memory consumption of the web UI. We should remove this field.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #15038 from JoshRosen/remove-updated-block-statuses-from-TaskMetricsUIData.
2016-09-11 21:51:22 -07:00
Sameer Agarwal 767d480769 [SPARK-17415][SQL] Better error message for driver-side broadcast join OOMs
## What changes were proposed in this pull request?

This is a trivial patch that catches all `OutOfMemoryError` while building the broadcast hash relation and rethrows it by wrapping it in a nice error message.

## How was this patch tested?

Existing Tests

Author: Sameer Agarwal <sameerag@cs.berkeley.edu>

Closes #14979 from sameeragarwal/broadcast-join-error.
2016-09-11 17:35:27 +02:00
Yanbo Liang 883c763184 [SPARK-17389][FOLLOW-UP][ML] Change KMeans k-means|| default init steps from 5 to 2.
## What changes were proposed in this pull request?
#14956 reduced default k-means|| init steps to 2 from 5 only for spark.mllib package, we should also do same change for spark.ml and PySpark.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15050 from yanboliang/spark-17389.
2016-09-11 13:47:13 +01:00
Bryan Cutler c76baff0cc [SPARK-17336][PYSPARK] Fix appending multiple times to PYTHONPATH from spark-config.sh
## What changes were proposed in this pull request?
During startup of Spark standalone, the script file spark-config.sh appends to the PYTHONPATH and can be sourced many times, causing duplicates in the path.  This change adds a env flag that is set when the PYTHONPATH is appended so it will happen only one time.

## How was this patch tested?
Manually started standalone master/worker and verified PYTHONPATH has no duplicate entries.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #15028 from BryanCutler/fix-duplicate-pythonpath-SPARK-17336.
2016-09-11 10:19:39 +01:00
tone-zhang bf22217377 [SPARK-17330][SPARK UT] Clean up spark-warehouse in UT
## What changes were proposed in this pull request?

Check the database warehouse used in Spark UT, and remove the existing database file before run the UT (SPARK-8368).

## How was this patch tested?

Run Spark UT with the command for several times:
./build/sbt -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver "test-only *HiveSparkSubmitSuit*"
Without the patch, the test case can be passed only at the first time, and always failed from the second time.
With the patch the test case always can be passed correctly.

Author: tone-zhang <tone.zhang@linaro.org>

Closes #14894 from tone-zhang/issue1.
2016-09-11 10:17:53 +01:00
Timothy Hunter 180796ecb3 [SPARK-17439][SQL] Fixing compression issues with approximate quantiles and adding more tests
## What changes were proposed in this pull request?

This PR build on #14976 and fixes a correctness bug that would cause the wrong quantile to be returned for small target errors.

## How was this patch tested?

This PR adds 8 unit tests that were failing without the fix.

Author: Timothy Hunter <timhunter@databricks.com>
Author: Sean Owen <sowen@cloudera.com>

Closes #15002 from thunterdb/ml-1783.
2016-09-11 08:03:45 +01:00
Sean Owen 29ba9578f4 [SPARK-17389][ML][MLLIB] KMeans speedup with better choice of k-means|| init steps = 2
## What changes were proposed in this pull request?

Reduce default k-means|| init steps to 2 from 5. See JIRA for discussion.
See also https://github.com/apache/spark/pull/14948

## How was this patch tested?

Existing tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #14956 from srowen/SPARK-17389.2.
2016-09-11 08:00:55 +01:00
Xin Ren 71b7d42f5f [SPARK-16445][MLLIB][SPARKR] Fix @return description for sparkR mlp summary() method
## What changes were proposed in this pull request?

Fix summary() method's `return` description for spark.mlp

## How was this patch tested?

Ran tests locally on my laptop.

Author: Xin Ren <iamshrek@126.com>

Closes #15015 from keypointt/SPARK-16445-2.
2016-09-10 09:52:53 -07:00
Ryan Blue 6ea5055fa7 [SPARK-17396][CORE] Share the task support between UnionRDD instances.
## What changes were proposed in this pull request?

Share the ForkJoinTaskSupport between UnionRDD instances to avoid creating a huge number of threads if lots of RDDs are created at the same time.

## How was this patch tested?

This uses existing UnionRDD tests.

Author: Ryan Blue <blue@apache.org>

Closes #14985 from rdblue/SPARK-17396-use-shared-pool.
2016-09-10 10:18:53 +01:00
Yanbo Liang bcdd259c37 [SPARK-15509][FOLLOW-UP][ML][SPARKR] R MLlib algorithms should support input columns "features" and "label"
## What changes were proposed in this pull request?
#13584 resolved the issue of features and label columns conflict with ```RFormula``` default ones when loading libsvm data, but it still left some issues should be resolved:
1, It’s not necessary to check and rename label column.
Since we have considerations on the design of ```RFormula```, it can handle the case of label column already exists(with restriction of the existing label column should be numeric/boolean type). So it’s not necessary to change the column name to avoid conflict. If the label column is not numeric/boolean type, ```RFormula``` will throw exception.

2, We should rename features column name to new one if there is conflict, but appending a random value is enough since it was used internally only. We done similar work when implementing ```SQLTransformer```.

3, We should set correct new features column for the estimators. Take ```GLM``` as example:
```GLM``` estimator should set features column with the changed one(rFormula.getFeaturesCol) rather than the default “features”. Although it’s same when training model, but it involves problems when predicting. The following is the prediction result of GLM before this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/18308227/84c3c452-74a8-11e6-9caa-9d6d846cc957.png)
We should drop the internal used feature column name, otherwise, it will appear on the prediction DataFrame which will confused users. And this behavior is same as other scenarios which does not exist column name conflict.
After this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/18308240/92082a04-74a8-11e6-9226-801f52b856d9.png)

## How was this patch tested?
Existing unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14993 from yanboliang/spark-15509.
2016-09-10 00:27:10 -07:00
Yves Raimond 1fec3ce4e1 [SPARK-11496][GRAPHX] Parallel implementation of personalized pagerank
(Updated version of [PR-9457](https://github.com/apache/spark/pull/9457), rebased on latest Spark master, and using mllib-local).

This implements a parallel version of personalized pagerank, which runs all propagations for a list of source vertices in parallel.

I ran a few benchmarks on the full [DBpedia](http://dbpedia.org/) graph. When running personalized pagerank for only one source node, the existing implementation is twice as fast as the parallel one (because of the SparseVector overhead). However for 10 source nodes, the parallel implementation is four times as fast. When increasing the number of source nodes, this difference becomes even greater.

![image](https://cloud.githubusercontent.com/assets/2491/10927702/dd82e4fa-8256-11e5-89a8-4799b407f502.png)

Author: Yves Raimond <yraimond@netflix.com>

Closes #14998 from moustaki/parallel-ppr.
2016-09-10 00:15:59 -07:00
Tejas Patil 335491704c [SPARK-15453][SQL] FileSourceScanExec to extract outputOrdering information
## What changes were proposed in this pull request?

Jira : https://issues.apache.org/jira/browse/SPARK-15453

Extracting sort ordering information in `FileSourceScanExec` so that planner can make use of it. My motivation to make this change was to get Sort Merge join in par with Hive's Sort-Merge-Bucket join when the source tables are bucketed + sorted.

Query:

```
val df = (0 until 16).map(i => (i % 8, i * 2, i.toString)).toDF("i", "j", "k").coalesce(1)
df.write.bucketBy(8, "j", "k").sortBy("j", "k").saveAsTable("table8")
df.write.bucketBy(8, "j", "k").sortBy("j", "k").saveAsTable("table9")
context.sql("SELECT * FROM table8 a JOIN table9 b ON a.j=b.j AND a.k=b.k").explain(true)
```

Before:

```
== Physical Plan ==
*SortMergeJoin [j#120, k#121], [j#123, k#124], Inner
:- *Sort [j#120 ASC, k#121 ASC], false, 0
:  +- *Project [i#119, j#120, k#121]
:     +- *Filter (isnotnull(k#121) && isnotnull(j#120))
:        +- *FileScan orc default.table8[i#119,j#120,k#121] Batched: false, Format: ORC, InputPaths: file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table8, PartitionFilters: [], PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: struct<i:int,j:int,k:string>
+- *Sort [j#123 ASC, k#124 ASC], false, 0
+- *Project [i#122, j#123, k#124]
+- *Filter (isnotnull(k#124) && isnotnull(j#123))
 +- *FileScan orc default.table9[i#122,j#123,k#124] Batched: false, Format: ORC, InputPaths: file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table9, PartitionFilters: [], PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: struct<i:int,j:int,k:string>
```

After:  (note that the `Sort` step is no longer there)

```
== Physical Plan ==
*SortMergeJoin [j#49, k#50], [j#52, k#53], Inner
:- *Project [i#48, j#49, k#50]
:  +- *Filter (isnotnull(k#50) && isnotnull(j#49))
:     +- *FileScan orc default.table8[i#48,j#49,k#50] Batched: false, Format: ORC, InputPaths: file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table8, PartitionFilters: [], PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: struct<i:int,j:int,k:string>
+- *Project [i#51, j#52, k#53]
   +- *Filter (isnotnull(j#52) && isnotnull(k#53))
      +- *FileScan orc default.table9[i#51,j#52,k#53] Batched: false, Format: ORC, InputPaths: file:/Users/tejasp/Desktop/dev/tp-spark/spark-warehouse/table9, PartitionFilters: [], PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: struct<i:int,j:int,k:string>
```

## How was this patch tested?

Added a test case in `JoinSuite`. Ran all other tests in `JoinSuite`

Author: Tejas Patil <tejasp@fb.com>

Closes #14864 from tejasapatil/SPARK-15453_smb_optimization.
2016-09-10 09:27:22 +08:00
hyukjinkwon f7d2143705 [SPARK-17354] [SQL] Partitioning by dates/timestamps should work with Parquet vectorized reader
## What changes were proposed in this pull request?

This PR fixes `ColumnVectorUtils.populate` so that Parquet vectorized reader can read partitioned table with dates/timestamps. This works fine with Parquet normal reader.

This is being only called within [VectorizedParquetRecordReader.java#L185](https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L185).

When partition column types are explicitly given to `DateType` or `TimestampType` (rather than inferring the type of partition column), this fails with the exception below:

```
16/09/01 10:30:07 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 6)
java.lang.ClassCastException: java.lang.Integer cannot be cast to java.sql.Date
	at org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:89)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:185)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:204)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:362)
...
```

## How was this patch tested?

Unit tests in `SQLQuerySuite`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14919 from HyukjinKwon/SPARK-17354.
2016-09-09 14:23:05 -07:00
Thomas Graves a3981c28c9 [SPARK-17433] YarnShuffleService doesn't handle moving credentials levelDb
The secrets leveldb isn't being moved if you run spark shuffle services without yarn nm recovery on and then turn it on.  This fixes that.  I unfortunately missed this when I ported the patch from our internal branch 2 to master branch due to the changes for the recovery path.  Note this only applies to master since it is the only place the yarn nm recovery dir is used.

Unit tests ran and tested on 8 node cluster.  Fresh startup with NM recovery, fresh startup no nm recovery, switching between no nm recovery and recovery.  Also tested running applications to make sure wasn't affected by rolling upgrade.

Author: Thomas Graves <tgraves@prevailsail.corp.gq1.yahoo.com>
Author: Tom Graves <tgraves@apache.org>

Closes #14999 from tgravescs/SPARK-17433.
2016-09-09 13:43:32 -05:00
Satendra Kumar 7098a12945 Streaming doc correction.
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)
Streaming doc correction.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Satendra Kumar <satendra@knoldus.com>

Closes #14996 from satendrakumar06/patch-1.
2016-09-09 19:15:06 +01:00
Yanbo Liang 2ed601217f [SPARK-17464][SPARKR][ML] SparkR spark.als argument reg should be 0.1 by default.
## What changes were proposed in this pull request?
SparkR ```spark.als``` arguments ```reg``` should be 0.1 by default, which need to be consistent with ML.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15021 from yanboliang/spark-17464.
2016-09-09 05:43:34 -07:00
Joseph K. Bradley 65b814bf50 [SPARK-17456][CORE] Utility for parsing Spark versions
## What changes were proposed in this pull request?

This patch adds methods for extracting major and minor versions as Int types in Scala from a Spark version string.

Motivation: There are many hacks within Spark's codebase to identify and compare Spark versions. We should add a simple utility to standardize these code paths, especially since there have been mistakes made in the past. This will let us add unit tests as well.  Currently, I want this functionality to check Spark versions to provide backwards compatibility for ML model persistence.

## How was this patch tested?

Unit tests

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #15017 from jkbradley/version-parsing.
2016-09-09 05:35:10 -07:00
Gurvinder Singh 92ce8d4849 [SPARK-15487][WEB UI] Spark Master UI to reverse proxy Application and Workers UI
## What changes were proposed in this pull request?

This pull request adds the functionality to enable accessing worker and application UI through master UI itself. Thus helps in accessing SparkUI when running spark cluster in closed networks e.g. Kubernetes. Cluster admin needs to expose only spark master UI and rest of the UIs can be in the private network, master UI will reverse proxy the connection request to corresponding resource. It adds the path for workers/application UIs as

WorkerUI: <http/https>://master-publicIP:<port>/target/workerID/
ApplicationUI: <http/https>://master-publicIP:<port>/target/appID/

This makes it easy for users to easily protect the Spark master cluster access by putting some reverse proxy e.g. https://github.com/bitly/oauth2_proxy

## How was this patch tested?

The functionality has been tested manually and there is a unit test too for testing access to worker UI with reverse proxy address.

pwendell bomeng BryanCutler can you please review it, thanks.

Author: Gurvinder Singh <gurvinder.singh@uninett.no>

Closes #13950 from gurvindersingh/rproxy.
2016-09-08 17:20:20 -07:00
Eric Liang 722afbb2b3 [SPARK-17405] RowBasedKeyValueBatch should use default page size to prevent OOMs
## What changes were proposed in this pull request?

Before this change, we would always allocate 64MB per aggregation task for the first-level hash map storage, even when running in low-memory situations such as local mode. This changes it to use the memory manager default page size, which is automatically reduced from 64MB in these situations.

cc ooq JoshRosen

## How was this patch tested?

Tested manually with `bin/spark-shell --master=local[32]` and verifying that `(1 to math.pow(10, 3).toInt).toDF("n").withColumn("m", 'n % 2).groupBy('m).agg(sum('n)).show` does not crash.

Author: Eric Liang <ekl@databricks.com>

Closes #15016 from ericl/sc-4483.
2016-09-08 16:47:18 -07:00
hyukjinkwon 78d5d4dd5c [SPARK-17200][PROJECT INFRA][BUILD][SPARKR] Automate building and testing on Windows (currently SparkR only)
## What changes were proposed in this pull request?

This PR adds the build automation on Windows with [AppVeyor](https://www.appveyor.com/) CI tool.

Currently, this only runs the tests for SparkR as we have been having some issues with testing Windows-specific PRs (e.g. https://github.com/apache/spark/pull/14743 and https://github.com/apache/spark/pull/13165) and hard time to verify this.

One concern is, this build is dependent on [steveloughran/winutils](https://github.com/steveloughran/winutils) for pre-built Hadoop bin package (who is a Hadoop PMC member).

## How was this patch tested?

Manually, https://ci.appveyor.com/project/HyukjinKwon/spark/build/88-SPARK-17200-build-profile
This takes roughly 40 mins.

Some tests are already being failed and this was found in https://github.com/apache/spark/pull/14743#issuecomment-241405287.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14859 from HyukjinKwon/SPARK-17200-build.
2016-09-08 08:26:59 -07:00
Felix Cheung f0d21b7f90 [SPARK-17442][SPARKR] Additional arguments in write.df are not passed to data source
## What changes were proposed in this pull request?

additional options were not passed down in write.df.

## How was this patch tested?

unit tests
falaki shivaram

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #15010 from felixcheung/testreadoptions.
2016-09-08 08:22:58 -07:00
Wenchen Fan 3ced39df32 [SPARK-17432][SQL] PreprocessDDL should respect case sensitivity when checking duplicated columns
## What changes were proposed in this pull request?

In `PreprocessDDL` we will check if table columns are duplicated. However, this checking ignores case sensitivity config(it's always case-sensitive) and lead to different result between `HiveExternalCatalog` and `InMemoryCatalog`. `HiveExternalCatalog` will throw exception because hive metastore is always case-nonsensitive, and `InMemoryCatalog` is fine.

This PR fixes it.

## How was this patch tested?

a new test in DDLSuite

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14994 from cloud-fan/check-dup.
2016-09-08 19:41:49 +08:00
gatorsmile b230fb92a5 [SPARK-17052][SQL] Remove Duplicate Test Cases auto_join from HiveCompatibilitySuite.scala
### What changes were proposed in this pull request?
The original [JIRA Hive-1642](https://issues.apache.org/jira/browse/HIVE-1642) delivered the test cases `auto_joinXYZ` for verifying the results when the joins are automatically converted to map-join. Basically, most of them are just copied from the corresponding `joinXYZ`.

After comparison between `auto_joinXYZ` and `joinXYZ`, below is a list of duplicate cases:
```
    "auto_join0",
    "auto_join1",
    "auto_join10",
    "auto_join11",
    "auto_join12",
    "auto_join13",
    "auto_join14",
    "auto_join14_hadoop20",
    "auto_join15",
    "auto_join17",
    "auto_join18",
    "auto_join2",
    "auto_join20",
    "auto_join21",
    "auto_join23",
    "auto_join24",
    "auto_join3",
    "auto_join4",
    "auto_join5",
    "auto_join6",
    "auto_join7",
    "auto_join8",
    "auto_join9"
```

We can remove all of them without affecting the test coverage.

### How was this patch tested?
N/A

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14635 from gatorsmile/removeAuto.
2016-09-07 14:03:14 -07:00
Eric Liang 649fa4bf1d [SPARK-17370] Shuffle service files not invalidated when a slave is lost
## What changes were proposed in this pull request?

DAGScheduler invalidates shuffle files when an executor loss event occurs, but not when the external shuffle service is enabled. This is because when shuffle service is on, the shuffle file lifetime can exceed the executor lifetime.

However, it also doesn't invalidate shuffle files when the shuffle service itself is lost (due to whole slave loss). This can cause long hangs when slaves are lost since the file loss is not detected until a subsequent stage attempts to read the shuffle files.

The proposed fix is to also invalidate shuffle files when an executor is lost due to a `SlaveLost` event.

## How was this patch tested?

Unit tests, also verified on an actual cluster that slave loss invalidates shuffle files immediately as expected.

cc mateiz

Author: Eric Liang <ekl@databricks.com>

Closes #14931 from ericl/sc-4439.
2016-09-07 12:33:50 -07:00
Srinivasa Reddy Vundela 76ad89e924 [MINOR][SQL] Fixing the typo in unit test
## What changes were proposed in this pull request?

Fixing the typo in the unit test of CodeGenerationSuite.scala

## How was this patch tested?
Ran the unit test after fixing the typo and it passes

Author: Srinivasa Reddy Vundela <vsr@cloudera.com>

Closes #14989 from vundela/typo_fix.
2016-09-07 12:41:03 +01:00
Daoyuan Wang 6f4aeccf8c [SPARK-17427][SQL] function SIZE should return -1 when parameter is null
## What changes were proposed in this pull request?

`select size(null)` returns -1 in Hive. In order to be compatible, we should return `-1`.

## How was this patch tested?

unit test in `CollectionFunctionsSuite` and `DataFrameFunctionsSuite`.

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #14991 from adrian-wang/size.
2016-09-07 13:01:27 +02:00
hyukjinkwon 6b41195bca [SPARK-17339][SPARKR][CORE] Fix some R tests and use Path.toUri in SparkContext for Windows paths in SparkR
## What changes were proposed in this pull request?

This PR fixes the Windows path issues in several APIs. Please refer https://issues.apache.org/jira/browse/SPARK-17339 for more details.

## How was this patch tested?

Tests via AppVeyor CI - https://ci.appveyor.com/project/HyukjinKwon/spark/build/82-SPARK-17339-fix-r

Also, manually,

![2016-09-06 3 14 38](https://cloud.githubusercontent.com/assets/6477701/18263406/b93a98be-7444-11e6-9521-b28ee65a4771.png)

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14960 from HyukjinKwon/SPARK-17339.
2016-09-07 19:24:03 +09:00
Liwei Lin 3ce3a282c8 [SPARK-17359][SQL][MLLIB] Use ArrayBuffer.+=(A) instead of ArrayBuffer.append(A) in performance critical paths
## What changes were proposed in this pull request?

We should generally use `ArrayBuffer.+=(A)` rather than `ArrayBuffer.append(A)`, because `append(A)` would involve extra boxing / unboxing.

## How was this patch tested?

N/A

Author: Liwei Lin <lwlin7@gmail.com>

Closes #14914 from lw-lin/append_to_plus_eq_v2.
2016-09-07 10:04:00 +01:00
Clark Fitzgerald 9fccde4ff8 [SPARK-16785] R dapply doesn't return array or raw columns
## What changes were proposed in this pull request?

Fixed bug in `dapplyCollect` by changing the `compute` function of `worker.R` to explicitly handle raw (binary) vectors.

cc shivaram

## How was this patch tested?

Unit tests

Author: Clark Fitzgerald <clarkfitzg@gmail.com>

Closes #14783 from clarkfitzg/SPARK-16785.
2016-09-06 23:40:37 -07:00
Tathagata Das eb1ab88a86 [SPARK-17372][SQL][STREAMING] Avoid serialization issues by using Arrays to save file names in FileStreamSource
## What changes were proposed in this pull request?

When we create a filestream on a directory that has partitioned subdirs (i.e. dir/x=y/), then ListingFileCatalog.allFiles returns the files in the dir as Seq[String] which internally is a Stream[String]. This is because of this [line](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala#L93), where a LinkedHashSet.values.toSeq returns Stream. Then when the [FileStreamSource](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L79) filters this Stream[String] to remove the seen files, it creates a new Stream[String], which has a filter function that has a $outer reference to the FileStreamSource (in Scala 2.10). Trying to serialize this Stream[String] causes NotSerializableException. This will happened even if there is just one file in the dir.

Its important to note that this behavior is different in Scala 2.11. There is no $outer reference to FileStreamSource, so it does not throw NotSerializableException. However, with a large sequence of files (tested with 10000 files), it throws StackOverflowError. This is because how Stream class is implemented. Its basically like a linked list, and attempting to serialize a long Stream requires *recursively* going through linked list, thus resulting in StackOverflowError.

In short, across both Scala 2.10 and 2.11, serialization fails when both the following conditions are true.
- file stream defined on a partitioned directory
- directory has 10k+ files

The right solution is to convert the seq to an array before writing to the log. This PR implements this fix in two ways.
- Changing all uses for HDFSMetadataLog to ensure Array is used instead of Seq
- Added a `require` in HDFSMetadataLog such that it is never used with type Seq

## How was this patch tested?

Added unit test that test that ensures the file stream source can handle with 10000 files. This tests fails in both Scala 2.10 and 2.11 with different failures as indicated above.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #14987 from tdas/SPARK-17372.
2016-09-06 19:34:11 -07:00
Wenchen Fan d6eede9a36 [SPARK-17238][SQL] simplify the logic for converting data source table into hive compatible format
## What changes were proposed in this pull request?

Previously we have 2 conditions to decide whether a data source table is hive-compatible:

1. the data source is file-based and has a corresponding Hive serde
2. have a `path` entry in data source options/storage properties

However, if condition 1 is true, condition 2 must be true too, as we will put the default table path into data source options/storage properties for managed data source tables.

There is also a potential issue: we will set the `locationUri` even for managed table.

This PR removes the condition 2 and only set the `locationUri` for external data source tables.

Note: this is also a first step to unify the `path` of data source tables and `locationUri` of hive serde tables. For hive serde tables, `locationUri` is only set for external table. For data source tables, `path` is always set. We can make them consistent after this PR.

## How was this patch tested?

existing tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14809 from cloud-fan/minor2.
2016-09-07 09:36:53 +08:00
gatorsmile a40657bfd3 [SPARK-17408][TEST] Flaky test: org.apache.spark.sql.hive.StatisticsSuite
### What changes were proposed in this pull request?
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64956/testReport/junit/org.apache.spark.sql.hive/StatisticsSuite/test_statistics_of_LogicalRelation_converted_from_MetastoreRelation/
```
org.apache.spark.sql.hive.StatisticsSuite.test statistics of LogicalRelation converted from MetastoreRelation

Failing for the past 1 build (Since Failed#64956 )
Took 1.4 sec.
Error Message

org.scalatest.exceptions.TestFailedException: 6871 did not equal 4236
Stacktrace

sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 6871 did not equal 4236
	at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
```

This fix does not check the exact value of `sizeInBytes`. Instead, we compare whether it is larger than zero and compare the values between different values.

In addition, we also combine `checkMetastoreRelationStats` and `checkLogicalRelationStats` into the same checking function.

### How was this patch tested?
N/A

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14978 from gatorsmile/spark17408.
2016-09-07 08:13:12 +08:00
Eric Liang c07cbb3534 [SPARK-17371] Resubmitted shuffle outputs can get deleted by zombie map tasks
## What changes were proposed in this pull request?

It seems that old shuffle map tasks hanging around after a stage resubmit will delete intended shuffle output files on stop(), causing downstream stages to fail even after successful resubmit completion. This can happen easily if the prior map task is waiting for a network timeout when its stage is resubmitted.

This can cause unnecessary stage resubmits, sometimes multiple times as fetch fails cause a cascade of shuffle file invalidations, and confusing FetchFailure messages that report shuffle index files missing from the local disk.

Given that IndexShuffleBlockResolver commits data atomically, it seems unnecessary to ever delete committed task output: even in the rare case that a task is failed after it finishes committing shuffle output, it should be safe to retain that output.

## How was this patch tested?

Prior to the fix proposed in https://github.com/apache/spark/pull/14931, I was able to reproduce this behavior by killing slaves in the middle of a large shuffle. After this patch, stages were no longer resubmitted multiple times due to shuffle index loss.

cc JoshRosen vanzin

Author: Eric Liang <ekl@databricks.com>

Closes #14932 from ericl/dont-remove-committed-files.
2016-09-06 16:55:22 -07:00
Shixiong Zhu 175b434411 [SPARK-17316][CORE] Fix the 'ask' type parameter in 'removeExecutor'
## What changes were proposed in this pull request?

Fix the 'ask' type parameter in 'removeExecutor' to eliminate a lot of error logs `Cannot cast java.lang.Boolean to scala.runtime.Nothing$`

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #14983 from zsxwing/SPARK-17316-3.
2016-09-06 16:49:06 -07:00
Marcelo Vanzin 0bd00ff245 [SPARK-15891][YARN] Clean up some logging in the YARN AM.
To make the log file more readable, rework some of the logging done
by the AM:

- log executor command / env just once, since they're all almost the same;
  the information that changes, such as executor ID, is already available
  in other log messages.
- avoid printing logs when nothing happens, especially when updating the
  container requests in the allocator.
- print fewer log messages when requesting many unlocalized executors,
  instead of repeating the same message multiple times.
- removed some logs that seemed unnecessary.

In the process, I slightly fixed up the wording in a few log messages, and
did some minor clean up of method arguments that were redundant.

Tested by running existing unit tests, and analyzing the logs of an
application that exercises dynamic allocation by forcing executors
to be allocated and be killed in waves.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #14943 from vanzin/SPARK-15891.
2016-09-06 15:54:54 -07:00
Herman van Hovell 4f769b903b [SPARK-17296][SQL] Simplify parser join processing.
## What changes were proposed in this pull request?
Join processing in the parser relies on the fact that the grammar produces a right nested trees, for instance the parse tree for `select * from a join b join c` is expected to produce a tree similar to `JOIN(a, JOIN(b, c))`. However there are cases in which this (invariant) is violated, like:
```sql
SELECT COUNT(1)
FROM test T1
     CROSS JOIN test T2
     JOIN test T3
      ON T3.col = T1.col
     JOIN test T4
      ON T4.col = T1.col
```
In this case the parser returns a tree in which Joins are located on both the left and the right sides of the parent join node.

This PR introduces a different grammar rule which does not make this assumption. The new rule takes a relation and searches for zero or more joined relations. As a bonus processing is much easier.

## How was this patch tested?
Existing tests and I have added a regression test to the plan parser suite.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #14867 from hvanhovell/SPARK-17296.
2016-09-07 00:44:07 +02:00
Josh Rosen 29cfab3f15 [SPARK-17110] Fix StreamCorruptionException in BlockManager.getRemoteValues()
## What changes were proposed in this pull request?

This patch fixes a `java.io.StreamCorruptedException` error affecting remote reads of cached values when certain data types are used. The problem stems from #11801 / SPARK-13990, a patch to have Spark automatically pick the "best" serializer when caching RDDs. If PySpark cached a PythonRDD, then this would be cached as an `RDD[Array[Byte]]` and the automatic serializer selection would pick KryoSerializer for replication and block transfer. However, the `getRemoteValues()` / `getRemoteBytes()` code path did not pass proper class tags in order to enable the same serializer to be used during deserialization, causing Java to be inappropriately used instead of Kryo, leading to the StreamCorruptedException.

We already fixed a similar bug in #14311, which dealt with similar issues in block replication. Prior to that patch, it seems that we had no tests to ensure that block replication actually succeeded. Similarly, prior to this bug fix patch it looks like we had no tests to perform remote reads of cached data, which is why this bug was able to remain latent for so long.

This patch addresses the bug by modifying `BlockManager`'s `get()` and  `getRemoteValues()` methods to accept ClassTags, allowing the proper class tag to be threaded in the `getOrElseUpdate` code path (which is used by `rdd.iterator`)

## How was this patch tested?

Extended the caching tests in `DistributedSuite` to exercise the `getRemoteValues` path, plus manual testing to verify that the PySpark bug reproduction in SPARK-17110 is fixed.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #14952 from JoshRosen/SPARK-17110.
2016-09-06 15:07:28 -07:00
Zheng RuiFeng 8bbb08a300 [MINOR] Remove unnecessary check in MLSerDe
## What changes were proposed in this pull request?
1, remove unnecessary `require()`, because it will make following check useless.
2, update the error msg.

## How was this patch tested?
no test

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #14972 from zhengruifeng/del_unnecessary_check.
2016-09-06 14:20:56 -07:00
Sandeep Singh 7775d9f224 [SPARK-17299] TRIM/LTRIM/RTRIM should not strips characters other than spaces
## What changes were proposed in this pull request?
TRIM/LTRIM/RTRIM should not strips characters other than spaces, we were trimming all chars small than ASCII 0x20(space)

## How was this patch tested?
fixed existing tests.

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #14924 from techaddict/SPARK-17299.
2016-09-06 22:18:28 +01:00
Adam Roberts 6c08dbf683 [SPARK-17378][BUILD] Upgrade snappy-java to 1.1.2.6
## What changes were proposed in this pull request?

Upgrades the Snappy version to 1.1.2.6 from 1.1.2.4, release notes: https://github.com/xerial/snappy-java/blob/master/Milestone.md mention "Fix a bug in SnappyInputStream when reading compressed data that happened to have the same first byte with the stream magic header (#142)"

## How was this patch tested?
Existing unit tests using the latest IBM Java 8 on Intel, Power and Z architectures (little and big-endian)

Author: Adam Roberts <aroberts@uk.ibm.com>

Closes #14958 from a-roberts/master.
2016-09-06 22:13:25 +01:00
Davies Liu f7e26d7887 [SPARK-16922] [SPARK-17211] [SQL] make the address of values portable in LongToUnsafeRowMap
## What changes were proposed in this pull request?

In LongToUnsafeRowMap, we use offset of a value as pointer, stored in a array also in the page for chained values. The offset is not portable, because Platform.LONG_ARRAY_OFFSET will be different with different JVM Heap size, then the deserialized LongToUnsafeRowMap will be corrupt.

This PR will change to use portable address (without Platform.LONG_ARRAY_OFFSET).

## How was this patch tested?

Added a test case with random generated keys, to improve the coverage. But this test is not a regression test, that could require a Spark cluster that have at least 32G heap in driver or executor.

Author: Davies Liu <davies@databricks.com>

Closes #14927 from davies/longmap.
2016-09-06 10:46:31 -07:00
Sean Zhong bc2767df26 [SPARK-17374][SQL] Better error messages when parsing JSON using DataFrameReader
## What changes were proposed in this pull request?

This PR adds better error messages for malformed record when reading a JSON file using DataFrameReader.

For example, for query:
```
import org.apache.spark.sql.types._
val corruptRecords = spark.sparkContext.parallelize("""{"a":{, b:3}""" :: Nil)
val schema = StructType(StructField("a", StringType, true) :: Nil)
val jsonDF = spark.read.schema(schema).json(corruptRecords)
```

**Before change:**
We silently replace corrupted line with null
```
scala> jsonDF.show
+----+
|   a|
+----+
|null|
+----+
```

**After change:**
Add an explicit warning message:
```
scala> jsonDF.show
16/09/02 14:43:16 WARN JacksonParser: Found at least one malformed records (sample: {"a":{, b:3}). The JSON reader will replace
all malformed records with placeholder null in current PERMISSIVE parser mode.
To find out which corrupted records have been replaced with null, please use the
default inferred schema instead of providing a custom schema.

Code example to print all malformed records (scala):
===================================================
// The corrupted record exists in column _corrupt_record.
val parsedJson = spark.read.json("/path/to/json/file/test.json")

+----+
|   a|
+----+
|null|
+----+
```

###

## How was this patch tested?

Unit test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #14929 from clockfly/logwarning_if_schema_not_contain_corrupted_record.
2016-09-06 22:20:55 +08:00
Yanbo Liang 39d538dddf [MINOR][ML] Correct weights doc of MultilayerPerceptronClassificationModel.
## What changes were proposed in this pull request?
```weights``` of ```MultilayerPerceptronClassificationModel``` should be the output weights of layers rather than initial weights, this PR correct it.

## How was this patch tested?
Doc change.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14967 from yanboliang/mlp-weights.
2016-09-06 03:30:37 -07:00
Sean Zhong 6f13aa7dfe [SPARK-17356][SQL] Fix out of memory issue when generating JSON for TreeNode
## What changes were proposed in this pull request?

class `org.apache.spark.sql.types.Metadata` is widely used in mllib to store some ml attributes. `Metadata` is commonly stored in `Alias` expression.

```
case class Alias(child: Expression, name: String)(
    val exprId: ExprId = NamedExpression.newExprId,
    val qualifier: Option[String] = None,
    val explicitMetadata: Option[Metadata] = None,
    override val isGenerated: java.lang.Boolean = false)
```

The `Metadata` can take a big memory footprint since the number of attributes is big ( in scale of million). When `toJSON` is called on `Alias` expression, the `Metadata` will also be converted to a big JSON string.
If a plan contains many such kind of `Alias` expressions, it may trigger out of memory error when `toJSON` is called, since converting all `Metadata` references to JSON will take huge memory.

With this PR, we will skip scanning Metadata when doing JSON conversion. For a reproducer of the OOM, and analysis, please look at jira https://issues.apache.org/jira/browse/SPARK-17356.

## How was this patch tested?

Existing tests.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #14915 from clockfly/json_oom.
2016-09-06 16:05:50 +08:00
Wenchen Fan c0ae6bc6ea [SPARK-17361][SQL] file-based external table without path should not be created
## What changes were proposed in this pull request?

Using the public `Catalog` API, users can create a file-based data source table, without giving the path options. For this case, currently we can create the table successfully, but fail when we read it. Ideally we should fail during creation.

This is because when we create data source table, we resolve the data source relation without validating path: `resolveRelation(checkPathExist = false)`.

Looking back to why we add this trick(`checkPathExist`), it's because when we call `resolveRelation` for managed table, we add the path to data source options but the path is not created yet. So why we add this not-yet-created path to data source options? This PR fix the problem by adding path to options after we call `resolveRelation`. Then we can remove the `checkPathExist` parameter in `DataSource.resolveRelation` and do some related cleanups.

## How was this patch tested?

existing tests and new test in `CatalogSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14921 from cloud-fan/check-path.
2016-09-06 14:17:47 +08:00
Yadong Qi 64e826f91e [SPARK-17358][SQL] Cached table(parquet/orc) should be shard between beelines
## What changes were proposed in this pull request?
Cached table(parquet/orc) couldn't be shard between beelines, because the `sameResult` method used by `CacheManager` always return false(`sparkSession` are different) when compare two `HadoopFsRelation` in different beelines. So we make `sparkSession` a curry parameter.

## How was this patch tested?
Beeline1
```
1: jdbc:hive2://localhost:10000> CACHE TABLE src_pqt;
+---------+--+
| Result  |
+---------+--+
+---------+--+
No rows selected (5.143 seconds)
1: jdbc:hive2://localhost:10000> EXPLAIN SELECT * FROM src_pqt;
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
|                                                                                                                                                                                                            plan                                                                                                                                                                                                            |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| == Physical Plan ==
InMemoryTableScan [key#49, value#50]
   +- InMemoryRelation [key#49, value#50], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), `src_pqt`
         +- *FileScan parquet default.src_pqt[key#0,value#1] Batched: true, Format: ParquetFormat, InputPaths: hdfs://199.0.0.1:9000/qiyadong/src_pqt, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<key:int,value:string>  |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
```

Beeline2
```
0: jdbc:hive2://localhost:10000> EXPLAIN SELECT * FROM src_pqt;
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
|                                                                                                                                                                                                            plan                                                                                                                                                                                                            |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| == Physical Plan ==
InMemoryTableScan [key#68, value#69]
   +- InMemoryRelation [key#68, value#69], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), `src_pqt`
         +- *FileScan parquet default.src_pqt[key#0,value#1] Batched: true, Format: ParquetFormat, InputPaths: hdfs://199.0.0.1:9000/qiyadong/src_pqt, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<key:int,value:string>  |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
```

Author: Yadong Qi <qiyadong2010@gmail.com>

Closes #14913 from watermen/SPARK-17358.
2016-09-06 10:57:21 +08:00