Commit graph

15828 commits

Author SHA1 Message Date
Zheng RuiFeng 9bfb35da1e [SPARK-14515][DOC] Add python example for ChiSqSelector
## What changes were proposed in this pull request?
Add the missing python example for ChiSqSelector

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12283 from zhengruifeng/chi2_pe.
2016-04-18 17:14:22 -07:00
Wenchen Fan 602734084c [SPARK-14628][CORE][FOLLLOW-UP] Always tracking read/write metrics
## What changes were proposed in this pull request?

This PR is a follow up for https://github.com/apache/spark/pull/12417, now we always track input/output/shuffle metrics in spark JSON protocol and status API.

Most of the line changes are because of re-generating the gold answer for `HistoryServerSuite`, and we add a lot of 0 values for read/write metrics.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12462 from cloud-fan/follow.
2016-04-18 15:17:29 -07:00
Shixiong Zhu 6ff0435858 [SPARK-14713][TESTS] Fix the flaky test NettyBlockTransferServiceSuite
## What changes were proposed in this pull request?

When there are multiple tests running, "NettyBlockTransferServiceSuite.can bind to a specific port twice and the second increments" may fail.

E.g., assume there are 2 tests running. Here are the execution order to reproduce the test failure.

| Execution Order | Test 1 | Test 2 |
| ------------- | ------------- | ------------- |
| 1 | service0 binds to 17634 |  |
| 2 |  | service0 binds to 17635 (17634 is occupied) |
| 3 | service1 binds to 17636 |  |
| 4 | pass test |  |
| 5 | service0.close (release 17634) |  |
| 6 |  | service1 binds to 17634 |
| 7 |  | `service1.port should be (service0.port + 1)` fails (17634 != 17635 + 1) |

Here is an example in Jenkins: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/786/testReport/junit/org.apache.spark.network.netty/NettyBlockTransferServiceSuite/can_bind_to_a_specific_port_twice_and_the_second_increments/

This PR makes two changes:

- Use a random port between 17634 and 27634 to reduce the possibility of port conflicts.
- Make `service1` use `service0.port` to bind to avoid the above race condition.

## How was this patch tested?

Jenkins unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #12477 from zsxwing/SPARK-14713.
2016-04-18 14:41:45 -07:00
Luciano Resende 68450c8c6e [SPARK-14504][SQL] Enable Oracle docker tests
## What changes were proposed in this pull request?

Enable Oracle docker tests

## How was this patch tested?

Existing tests

Author: Luciano Resende <lresende@apache.org>

Closes #12270 from lresende/oracle.
2016-04-18 14:35:10 -07:00
Andrew Or f1a11976db [SPARK-14674][SQL] Move HiveContext.hiveconf to HiveSessionState
## What changes were proposed in this pull request?

This is just cleanup. This allows us to remove HiveContext later without inflating the diff too much. This PR fixes the conflicts of https://github.com/apache/spark/pull/12431. It also removes the `def hiveConf` from `HiveSqlParser`. So, we will pass the HiveConf associated with a session explicitly instead of relying on Hive's `SessionState` to pass `HiveConf`.

## How was this patch tested?
Existing tests.

Closes #12431

Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #12449 from yhuai/hiveconf.
2016-04-18 14:28:47 -07:00
Sameer Agarwal 8bd8121329 [SPARK-14710][SQL] Rename gen/genCode to genCode/doGenCode to better reflect the semantics
## What changes were proposed in this pull request?

Per rxin's suggestions, this patch renames `s/gen/genCode` and `s/genCode/doGenCode` to better reflect the semantics of these 2 function calls.

## How was this patch tested?

N/A (refactoring only)

Author: Sameer Agarwal <sameer@databricks.com>

Closes #12475 from sameeragarwal/gencode.
2016-04-18 14:03:40 -07:00
hyukjinkwon 6fc1e72d9b [MINOR] Revert removing explicit typing (changed in some examples and StatFunctions)
## What changes were proposed in this pull request?

This PR reverts some changes in https://github.com/apache/spark/pull/12413. (please see the discussion in that PR).

from
```scala
    words.foreachRDD { (rdd, time) =>
    ...
```

to
```scala
    words.foreachRDD { (rdd: RDD[String], time: Time) =>
    ...
```

Also, this was discussed in dev-mailing list, [here](http://apache-spark-developers-list.1001551.n3.nabble.com/Question-about-Scala-style-explicit-typing-within-transformation-functions-and-anonymous-val-td17173.html)

## How was this patch tested?

This was tested with `sbt scalastyle`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #12452 from HyukjinKwon/revert-explicit-typing.
2016-04-18 13:45:03 -07:00
Xusen Yin 8c62edb70f [SPARK-14299][EXAMPLES] Remove duplications for scala.examples.ml
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-14299

Delete duplications in scala/examples/ml.

TrainValidationSplitExample.scala --> ModelSelectionViaTrainValidationSplitExample
CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample

## How was this patch tested?

Existing tests passed.

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Xusen Yin <yinxusen@gmail.com>

Closes #12366 from yinxusen/SPARK-14299-2.
2016-04-18 13:34:36 -07:00
Xusen Yin f31a62d1b2 [SPARK-14440][PYSPARK] Remove pipeline specific reader and writer
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-14440

Remove

* PipelineMLWriter
* PipelineMLReader
* PipelineModelMLWriter
* PipelineModelMLReader

and modify comments.

## How was this patch tested?

test with unit test.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #12216 from yinxusen/SPARK-14440.
2016-04-18 13:31:48 -07:00
Andrew Or 28ee15702d [SPARK-14647][SQL] Group SQLContext/HiveContext state into SharedState
## What changes were proposed in this pull request?

This patch adds a SharedState that groups state shared across multiple SQLContexts. This is analogous to the SessionState added in SPARK-13526 that groups session-specific state. This cleanup makes the constructors of the contexts simpler and ultimately allows us to remove HiveContext in the near future.

## How was this patch tested?
Existing tests.

Author: Yin Huai <yhuai@databricks.com>

Closes #12463 from yhuai/sharedState.
2016-04-18 13:15:23 -07:00
Reynold Xin e4ae974294 [HOTFIX] Fix Scala 2.10 compilation break. 2016-04-18 12:57:23 -07:00
Jason Lee 3d66a2ce9b [SPARK-14564][ML][MLLIB][PYSPARK] Python Word2Vec missing setWindowSize method
## What changes were proposed in this pull request?
Added windowSize getter/setter to ML/MLlib

## How was this patch tested?
Added test cases in tests.py under both ML and MLlib

Author: Jason Lee <cjlee@us.ibm.com>

Closes #12428 from jasoncl/SPARK-14564.
2016-04-18 12:47:14 -07:00
Dongjoon Hyun d280d1da1a [SPARK-14580][SPARK-14655][SQL] Hive IfCoercion should preserve predicate.
## What changes were proposed in this pull request?

Currently, `HiveTypeCoercion.IfCoercion` removes all predicates whose return-type are null. However, some UDFs need evaluations because they are designed to throw exceptions. This PR fixes that to preserve the predicates. Also, `assert_true` is implemented as Spark SQL function.

**Before**
```
scala> sql("select if(assert_true(false),2,3)").head
res2: org.apache.spark.sql.Row = [3]
```

**After**
```
scala> sql("select if(assert_true(false),2,3)").head
... ASSERT_TRUE ...
```

**Hive**
```
hive> select if(assert_true(false),2,3);
OK
Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: ASSERT_TRUE(): assertion failed.
```

## How was this patch tested?

Pass the Jenkins tests (including a new testcase in `HivePlanTest`)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12340 from dongjoon-hyun/SPARK-14580.
2016-04-18 12:26:56 -07:00
Xusen Yin b64482f49f [SPARK-14306][ML][PYSPARK] PySpark ml.classification OneVsRest support export/import
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-14306

Add PySpark OneVsRest save/load supports.

## How was this patch tested?

Test with Python unit test.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #12439 from yinxusen/SPARK-14306-0415.
2016-04-18 11:52:29 -07:00
Tathagata Das 775cf17eaa [SPARK-14473][SQL] Define analysis rules to catch operations not supported in streaming
## What changes were proposed in this pull request?

There are many operations that are currently not supported in the streaming execution. For example:
 - joining two streams
 - unioning a stream and a batch source
 - sorting
 - window functions (not time windows)
 - distinct aggregates

Furthermore, executing a query with a stream source as a batch query should also fail.

This patch add an additional step after analysis in the QueryExecution which will check that all the operations in the analyzed logical plan is supported or not.

## How was this patch tested?
unit tests.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #12246 from tdas/SPARK-14473.
2016-04-18 11:09:33 -07:00
Dongjoon Hyun 432d1399cb [SPARK-14614] [SQL] Add bround function
## What changes were proposed in this pull request?

This PR aims to add `bound` function (aka Banker's round) by extending current `round` implementation. [Hive supports `bround` since 1.3.0.](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF)

**Hive (1.3 ~ 2.0)**
```
hive> select round(2.5), bround(2.5);
OK
3.0	2.0
```

**After this PR**
```scala
scala> sql("select round(2.5), bround(2.5)").head
res0: org.apache.spark.sql.Row = [3,2]
```

## How was this patch tested?

Pass the Jenkins tests (with extended tests).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12376 from dongjoon-hyun/SPARK-14614.
2016-04-18 10:44:51 -07:00
jerryshao d6fb485de8 [SPARK-14423][YARN] Avoid same name files added to distributed cache again
## What changes were proposed in this pull request?

In the current implementation of assembly-free spark deployment, jars under `assembly/target/scala-xxx/jars` will be uploaded to distributed cache by default, there's a chance these jars' name will be conflicted with name of jars specified in `--jars`, this will introduce exception when starting application:

```
client token: N/A
	 diagnostics: Application application_1459907402325_0004 failed 2 times due to AM Container for appattempt_1459907402325_0004_000002 exited with  exitCode: -1000
For more detailed output, check application tracking page:http://hw12100.local:8088/proxy/application_1459907402325_0004/Then, click on links to logs of each attempt.
Diagnostics: Resource hdfs://localhost:8020/user/sshao/.sparkStaging/application_1459907402325_0004/avro-mapred-1.7.7-hadoop2.jar changed on src filesystem (expected 1459909780508, was 1459909782590
java.io.IOException: Resource hdfs://localhost:8020/user/sshao/.sparkStaging/application_1459907402325_0004/avro-mapred-1.7.7-hadoop2.jar changed on src filesystem (expected 1459909780508, was 1459909782590
	at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
	at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
	at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
	at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
	at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)
	at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
```

So here by checking the name of file to avoid same name files uploaded again.

## How was this patch tested?

Unit test and manual integrated test is done locally.

Author: jerryshao <sshao@hortonworks.com>

Closes #12203 from jerryshao/SPARK-14423.
2016-04-18 10:13:38 -07:00
Reynold Xin 1a3966472c [SPARK-14696][SQL] Add implicit encoders for boxed primitive types
## What changes were proposed in this pull request?
We currently only have implicit encoders for scala primitive types. We should also add implicit encoders for boxed primitives. Otherwise, the following code would not have an encoder:

```scala
sqlContext.range(1000).map { i => i }
```

## How was this patch tested?
Added a unit test case for this.

Author: Reynold Xin <rxin@databricks.com>

Closes #12466 from rxin/SPARK-14696.
2016-04-18 17:03:15 +08:00
Wenchen Fan 2f1d0320c9 [SPARK-13363][SQL] support Aggregator in RelationalGroupedDataset
## What changes were proposed in this pull request?

set the input encoder for `TypedColumn` in `RelationalGroupedDataset.agg`.

## How was this patch tested?

new tests in `DatasetAggregatorSuite`

close https://github.com/apache/spark/pull/11269

This PR brings https://github.com/apache/spark/pull/12359 up to date and fix the compile.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12451 from cloud-fan/agg.
2016-04-18 14:27:26 +08:00
Andrew Or 7de06a646d Revert "[SPARK-14647][SQL] Group SQLContext/HiveContext state into SharedState"
This reverts commit 5cefecc95a.
2016-04-17 17:35:41 -07:00
Subhobrata Dey 699a4dfd89 [SPARK-14632] randomSplit method fails on dataframes with maps in schema
## What changes were proposed in this pull request?

The patch fixes the issue with the randomSplit method which is not able to split dataframes which has maps in schema. The bug was introduced in spark 1.6.1.

## How was this patch tested?

Tested with unit tests.

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Subhobrata Dey <sbcd90@gmail.com>

Closes #12438 from sbcd90/randomSplitIssue.
2016-04-17 15:18:32 -07:00
Reynold Xin 8a87f7d5c8 Mark ExternalClusterManager as private[spark]. 2016-04-16 23:49:26 -07:00
Hemant Bhanawat af1f4da762 [SPARK-13904][SCHEDULER] Add support for pluggable cluster manager
## What changes were proposed in this pull request?

This commit adds support for pluggable cluster manager. And also allows a cluster manager to clean up tasks without taking the parent process down.

To plug a new external cluster manager, ExternalClusterManager trait should be implemented. It returns task scheduler and backend scheduler that will be used by SparkContext to schedule tasks. An external cluster manager is registered using the java.util.ServiceLoader mechanism (This mechanism is also being used to register data sources like parquet, json, jdbc etc.). This allows auto-loading implementations of ExternalClusterManager interface.

Currently, when a driver fails, executors exit using system.exit. This does not bode well for cluster managers that would like to reuse the parent process of an executor. Hence,

  1. Moving system.exit to a function that can be overriden in subclasses of CoarseGrainedExecutorBackend.
  2. Added functionality of killing all the running tasks in an executor.

## How was this patch tested?
ExternalClusterManagerSuite.scala was added to test this patch.

Author: Hemant Bhanawat <hemant@snappydata.io>

Closes #11723 from hbhanawat/pluggableScheduler.
2016-04-16 23:43:32 -07:00
Andrew Or 3394b12c37 [SPARK-14672][SQL] Move HiveContext analyze logic to AnalyzeTable
## What changes were proposed in this pull request?

Move the implementation of `hiveContext.analyze` to the command of `AnalyzeTable`.

## How was this patch tested?
Existing tests.

Closes #12429

Author: Yin Huai <yhuai@databricks.com>
Author: Andrew Or <andrew@databricks.com>

Closes #12448 from yhuai/analyzeTable.
2016-04-16 15:35:51 -07:00
Andrew Or 5cefecc95a [SPARK-14647][SQL] Group SQLContext/HiveContext state into SharedState
## What changes were proposed in this pull request?

This patch adds a SharedState that groups state shared across multiple SQLContexts. This is analogous to the SessionState added in SPARK-13526 that groups session-specific state. This cleanup makes the constructors of the contexts simpler and ultimately allows us to remove HiveContext in the near future.

## How was this patch tested?
Existing tests.

Closes #12405

Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #12447 from yhuai/sharedState.
2016-04-16 14:00:53 -07:00
杨博 (Yang Bo) 3f49afee93 [SPARK-14683][DOCUMENTATION] Configure external links in ScalaDoc
Right now Spark's Scaladoc does not link to Scala standard library and other dependencies. This would bother Spark starters because they may be not experienced Scala programmers.

This patch fixes these links in ScalaDoc.

Author: 杨博 (Yang Bo) <pop.atry@gmail.com>

Closes #12444 from Atry/patch-1.
2016-04-16 11:44:12 -07:00
Reynold Xin 7319fcc1cd [SPARK-14677][SQL] follow up: make max iter num config internal
## What changes were proposed in this pull request?
This is a follow-up to make the max iteration number an internal config.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #12441 from rxin/maxIterConfInternal.
2016-04-16 11:39:47 -07:00
Joseph K. Bradley 36da5e3234 [SPARK-14605][ML][PYTHON] Changed Python to use unicode UIDs for spark.ml Identifiable
## What changes were proposed in this pull request?

Python spark.ml Identifiable classes use UIDs of type str, but they should use unicode (in Python 2.x) to match Java. This could be a problem if someone created a class in Java with odd unicode characters, saved it, and loaded it in Python.

This PR: Use unicode everywhere in Python.

## How was this patch tested?

Updated persistence unit test to check uid type

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12368 from jkbradley/python-uid-unicode.
2016-04-16 11:23:28 -07:00
hyukjinkwon 9f678e9754 [MINOR] Remove inappropriate type notation and extra anonymous closure within functional transformations
## What changes were proposed in this pull request?

This PR removes

- Inappropriate type notations
    For example, from
    ```scala
    words.foreachRDD { (rdd: RDD[String], time: Time) =>
    ...
    ```
    to
    ```scala
    words.foreachRDD { (rdd, time) =>
    ...
    ```

- Extra anonymous closure within functional transformations.
    For example,
    ```scala
    .map(item => {
      ...
    })
    ```

    which can be just simply as below:

    ```scala
    .map { item =>
      ...
    }
    ```

and corrects some obvious style nits.

## How was this patch tested?

This was tested after adding rules in `scalastyle-config.xml`, which ended up with not finding all perfectly.

The rules applied were below:

- For the first correction,

```xml
<check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
    <parameters><parameter name="regex">(?m)\.[a-zA-Z_][a-zA-Z0-9]*\(\s*[^,]+s*=>\s*\{[^\}]+\}\s*\)</parameter></parameters>
</check>
```

```xml
<check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
    <parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]*\s*[\{|\(]([^\n>,]+=>)?\s*\{([^()]|(?R))*\}^[,]</parameter></parameters>
</check>
```

- For the second correction
```xml
<check customId="TypeNotation" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
    <parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]*\s*[\{|\(]\s*\([^):]*:R))*\}^[,]</parameter></parameters>
</check>
```

**Those rules were not added**

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #12413 from HyukjinKwon/SPARK-style.
2016-04-16 14:56:23 +01:00
Reynold Xin 527c780bb0 Revert "[SPARK-13363][SQL] support Aggregator in RelationalGroupedDataset"
This reverts commit 12854464c4.
2016-04-16 01:05:26 -07:00
Wenchen Fan 12854464c4 [SPARK-13363][SQL] support Aggregator in RelationalGroupedDataset
## What changes were proposed in this pull request?

set the input encoder for `TypedColumn` in `RelationalGroupedDataset.agg`.

## How was this patch tested?

new tests in `DatasetAggregatorSuite`

close https://github.com/apache/spark/pull/11269

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12359 from cloud-fan/agg.
2016-04-16 00:31:51 -07:00
Reynold Xin f4be0946af [SPARK-14677][SQL] Make the max number of iterations configurable for Catalyst
## What changes were proposed in this pull request?
We currently hard code the max number of optimizer/analyzer iterations to 100. This patch makes it configurable. While I'm at it, I also added the SessionCatalog to the optimizer, so we can use information there in optimization.

## How was this patch tested?
Updated unit tests to reflect the change.

Author: Reynold Xin <rxin@databricks.com>

Closes #12434 from rxin/SPARK-14677.
2016-04-15 20:28:09 -07:00
Yin Huai b2dfa84959 [SPARK-14668][SQL] Move CurrentDatabase to Catalyst
## What changes were proposed in this pull request?

This PR moves `CurrentDatabase` from sql/hive package to sql/catalyst. It also adds the function description, which looks like the following.

```
scala> sqlContext.sql("describe function extended current_database").collect.foreach(println)
[Function: current_database]
[Class: org.apache.spark.sql.execution.command.CurrentDatabase]
[Usage: current_database() - Returns the current database.]
[Extended Usage:
> SELECT current_database()]
```

## How was this patch tested?
Existing tests

Author: Yin Huai <yhuai@databricks.com>

Closes #12424 from yhuai/SPARK-14668.
2016-04-15 17:48:41 -07:00
Sameer Agarwal 4df65184b6 [SPARK-14620][SQL] Use/benchmark a better hash in VectorizedHashMap
## What changes were proposed in this pull request?

This PR uses a better hashing algorithm while probing the AggregateHashMap:

```java
long h = 0
h = (h ^ (0x9e3779b9)) + key_1 + (h << 6) + (h >>> 2);
h = (h ^ (0x9e3779b9)) + key_2 + (h << 6) + (h >>> 2);
h = (h ^ (0x9e3779b9)) + key_3 + (h << 6) + (h >>> 2);
...
h = (h ^ (0x9e3779b9)) + key_n + (h << 6) + (h >>> 2);
return h
```

Depends on: https://github.com/apache/spark/pull/12345
## How was this patch tested?

    Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02 on Mac OS X 10.11.4
    Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
    Aggregate w keys:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -------------------------------------------------------------------------------------------
    codegen = F                              2417 / 2457          8.7         115.2       1.0X
    codegen = T hashmap = F                  1554 / 1581         13.5          74.1       1.6X
    codegen = T hashmap = T                   877 /  929         23.9          41.8       2.8X

Author: Sameer Agarwal <sameer@databricks.com>

Closes #12379 from sameeragarwal/hash.
2016-04-15 15:55:31 -07:00
Reynold Xin 8028a28885 [SPARK-14628][CORE] Simplify task metrics by always tracking read/write metrics
## What changes were proposed in this pull request?

Part of the reason why TaskMetrics and its callers are complicated are due to the optional metrics we collect, including input, output, shuffle read, and shuffle write. I think we can always track them and just assign 0 as the initial values. It is usually very obvious whether a task is supposed to read any data or not. By always tracking them, we can remove a lot of map, foreach, flatMap, getOrElse(0L) calls throughout Spark.

This patch also changes a few behaviors.

1. Removed the distinction of data read/write methods (e.g. Hadoop, Memory, Network, etc).
2. Accumulate all data reads and writes, rather than only the first method. (Fixes SPARK-5225)

## How was this patch tested?

existing tests.

This is bases on https://github.com/apache/spark/pull/12388, with more test fixes.

Author: Reynold Xin <rxin@databricks.com>
Author: Wenchen Fan <wenchen@databricks.com>

Closes #12417 from cloud-fan/metrics-refactor.
2016-04-15 15:39:39 -07:00
Xusen Yin 90b46e014a [SPARK-7861][ML] PySpark OneVsRest
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-7861

Add PySpark OneVsRest. I implement it with Python since it's a meta-pipeline.

## How was this patch tested?

Test with doctest.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #12124 from yinxusen/SPARK-14306-7861.
2016-04-15 12:58:38 -07:00
sethah 129f2f455d [SPARK-14104][PYSPARK][ML] All Python param setters should use the _set method
## What changes were proposed in this pull request?

Param setters in python previously accessed the _paramMap directly to update values. The `_set` method now implements type checking, so it should be used to update all parameters. This PR eliminates all direct accesses to `_paramMap` besides the one in the `_set` method to ensure type checking happens.

Additional changes:
* [SPARK-13068](https://github.com/apache/spark/pull/11663) missed adding type converters in evaluation.py so those are done here
* An incorrect `toBoolean` type converter was used for StringIndexer `handleInvalid` param in previous PR. This is fixed here.

## How was this patch tested?

Existing unit tests verify that parameters are still set properly. No new functionality is actually added in this PR.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #11939 from sethah/SPARK-14104.
2016-04-15 12:14:41 -07:00
Joseph K. Bradley d6ae7d4637 [SPARK-14665][ML][PYTHON] Fixed bug with StopWordsRemover default stopwords
## What changes were proposed in this pull request?

The default stopwords were a Java object.  They are no longer.

## How was this patch tested?

Unit test which failed before the fix

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12422 from jkbradley/pyspark-stopwords.
2016-04-15 11:50:21 -07:00
Yanbo Liang 83af297ac4 [SPARK-13925][ML][SPARKR] Expose R-like summary statistics in SparkR::glm for more family and link functions
## What changes were proposed in this pull request?
Expose R-like summary statistics in SparkR::glm for more family and link functions.
Note: Not all values in R [summary.glm](http://stat.ethz.ch/R-manual/R-patched/library/stats/html/summary.glm.html) are exposed, we only provide the most commonly used statistics in this PR. More statistics can be added in the followup work.

## How was this patch tested?
Unit tests.

SparkR Output:
```
Deviance Residuals:
(Note: These are approximate quantiles with relative error <= 0.01)
     Min        1Q    Median        3Q       Max
-0.95096  -0.16585  -0.00232   0.17410   0.72918

Coefficients:
                    Estimate  Std. Error  t value  Pr(>|t|)
(Intercept)         1.6765    0.23536     7.1231   4.4561e-11
Sepal_Length        0.34988   0.046301    7.5566   4.1873e-12
Species_versicolor  -0.98339  0.072075    -13.644  0
Species_virginica   -1.0075   0.093306    -10.798  0

(Dispersion parameter for gaussian family taken to be 0.08351462)

    Null deviance: 28.307  on 149  degrees of freedom
Residual deviance: 12.193  on 146  degrees of freedom
AIC: 59.22

Number of Fisher Scoring iterations: 1
```
R output:
```
Deviance Residuals:
     Min        1Q    Median        3Q       Max
-0.95096  -0.16522   0.00171   0.18416   0.72918

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)
(Intercept)        1.67650    0.23536   7.123 4.46e-11 ***
Sepal.Length       0.34988    0.04630   7.557 4.19e-12 ***
Speciesversicolor -0.98339    0.07207 -13.644  < 2e-16 ***
Speciesvirginica  -1.00751    0.09331 -10.798  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.08351462)

    Null deviance: 28.307  on 149  degrees of freedom
Residual deviance: 12.193  on 146  degrees of freedom
AIC: 59.217

Number of Fisher Scoring iterations: 2
```

cc mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12393 from yanboliang/spark-13925.
2016-04-15 08:23:51 -07:00
Peter Ableda 06b9d623e8 [SPARK-14633] Use more readable format to show memory bytes in Error Message
## What changes were proposed in this pull request?

Round memory bytes and convert it to Long to it’s original type. This change fixes the formatting issue in the Exception message.

## How was this patch tested?

Manual tests were done in CDH cluster.

Author: Peter Ableda <peter.ableda@cloudera.com>

Closes #12392 from peterableda/SPARK-14633.
2016-04-15 13:18:48 +01:00
Pravin Gadakh e24923267f [SPARK-14370][MLLIB] removed duplicate generation of ids in OnlineLDAOptimizer
## What changes were proposed in this pull request?

Removed duplicated generation of `ids` in OnlineLDAOptimizer.

## How was this patch tested?

tested with existing unit tests.

Author: Pravin Gadakh <prgadakh@in.ibm.com>

Closes #12176 from pravingadakh/SPARK-14370.
2016-04-15 13:08:30 +01:00
DB Tsai 96534aa47c [SPARK-14549][ML] Copy the Vector and Matrix classes from mllib to ml in mllib-local
## What changes were proposed in this pull request?

This task will copy the Vector and Matrix classes from mllib to ml package in mllib-local jar. The UDTs and `since` annotation in ml vector and matrix will be removed from now. UDTs will be achieved by #SPARK-14487, and `since` will be replaced by /*  since 1.2.0 */

The BLAS implementation will be copied, and some of the test utilities will be copies as well.

Summary of changes:

1. In mllib-local/src/main/scala/org/apache/spark/**ml**/linalg/BLAS.scala
  - Copied from mllib/src/main/scala/org/apache/spark/**mllib**/linalg/BLAS.scala
  - logDebug("gemm: alpha is equal to 0 and beta is equal to 1. Returning C.") is removed in ml version.
2. In  mllib-local/src/main/scala/org/apache/spark/**ml**/linalg/Matrices.scala
  - Copied from mllib/src/main/scala/org/apache/spark/**mllib**/linalg/Matrices.scala
  - `Since` was removed, and we'll use standard `/* Since /*` Java doc. Will be in another PR.
  - `UDT` related code was removed, and will use `SPARK-13944` https://github.com/apache/spark/pull/12259  to replace the annotation.
3. In mllib-local/src/main/scala/org/apache/spark/**ml**/linalg/Vectors.scala
  - Copied from mllib/src/main/scala/org/apache/spark/**mllib**/linalg/Vectors.scala
  - `Since` was removed.
  - `UDT` related code was removed.
  - In `def parseNumeric`, it was throwing `throw new SparkException(s"Cannot parse $other.")`, and now it's throwing `throw new IllegalArgumentException(s"Cannot parse $other.")`
4. In mllib/src/main/scala/org/apache/spark/**mllib**/linalg/Vectors.scala
  - For consistency with ML version of vector, `def parseNumeric` is now throwing `throw new IllegalArgumentException(s"Cannot parse $other.")`
5. mllib/src/main/scala/org/apache/spark/**mllib**/util/NumericParser.scala is moved to mllib-local/src/main/scala/org/apache/spark/**ml**/util/NumericParser.scala
  - All the `throw new SparkException` were replaced by `throw new IllegalArgumentException`

## How was this patch tested?

unit tests

Author: DB Tsai <dbt@netflix.com>

Closes #12317 from dbtsai/dbtsai-ml-vector.
2016-04-15 01:17:03 -07:00
Reynold Xin a9324a06ef Closes #12407
Closes #12408
Closes #12401
2016-04-14 22:04:26 -07:00
Yanbo Liang b9613239d3 [SPARK-14374][ML][PYSPARK] PySpark ml GBTClassifier, Regressor support export/import
## What changes were proposed in this pull request?
PySpark ml GBTClassifier, Regressor support export/import.

## How was this patch tested?
Doc test.

cc jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12383 from yanboliang/spark-14374.
2016-04-14 21:36:03 -07:00
Wenchen Fan 297ba3f1b4 [SPARK-14275][SQL] Reimplement TypedAggregateExpression to DeclarativeAggregate
## What changes were proposed in this pull request?

`ExpressionEncoder` is just a container for serialization and deserialization expressions, we can use these expressions to build `TypedAggregateExpression` directly, so that it can fit in `DeclarativeAggregate`, which is more efficient.

One trick is, for each buffer serializer expression, it will reference to the result object of serialization and function call. To avoid re-calculating this result object, we can serialize the buffer object to a single struct field, so that we can use a special `Expression` to only evaluate result object once.

## How was this patch tested?

existing tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12067 from cloud-fan/typed_udaf.
2016-04-15 12:10:00 +08:00
Sameer Agarwal b5c60bcdca [SPARK-14447][SQL] Speed up TungstenAggregate w/ keys using VectorizedHashMap
## What changes were proposed in this pull request?

This patch speeds up group-by aggregates by around 3-5x by leveraging an in-memory `AggregateHashMap` (please see https://github.com/apache/spark/pull/12161), an append-only aggregate hash map that can act as a 'cache' for extremely fast key-value lookups while evaluating aggregates (and fall back to the `BytesToBytesMap` if a given key isn't found).

Architecturally, it is backed by a power-of-2-sized array for index lookups and a columnar batch that stores the key-value pairs. The index lookups in the array rely on linear probing (with a small number of maximum tries) and use an inexpensive hash function which makes it really efficient for a majority of lookups. However, using linear probing and an inexpensive hash function also makes it less robust as compared to the `BytesToBytesMap` (especially for a large number of keys or even for certain distribution of keys) and requires us to fall back on the latter for correctness.

## How was this patch tested?

    Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02 on Mac OS X 10.11.4
    Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
    Aggregate w keys:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -------------------------------------------------------------------------------------------
    codegen = F                              2124 / 2204          9.9         101.3       1.0X
    codegen = T hashmap = F                  1198 / 1364         17.5          57.1       1.8X
    codegen = T hashmap = T                   369 /  600         56.8          17.6       5.8X

Author: Sameer Agarwal <sameer@databricks.com>

Closes #12345 from sameeragarwal/tungsten-aggregate-integration.
2016-04-14 20:57:03 -07:00
Mark Grover ff9ae61a3b [SPARK-14601][DOC] Minor doc/usage changes related to removal of Spark assembly
## What changes were proposed in this pull request?

Removing references to assembly jar in documentation.
Adding an additional (previously undocumented) usage of spark-submit to run examples.

## How was this patch tested?

Ran spark-submit usage to ensure formatting was fine. Ran examples using SparkSubmit.

Author: Mark Grover <mark@apache.org>

Closes #12365 from markgrover/spark-14601.
2016-04-14 18:51:43 -07:00
Fokko Driesprong c80586d9e8 [SPARK-12869] Implemented an improved version of the toIndexedRowMatrix
Hi guys,

I've implemented an improved version of the `toIndexedRowMatrix` function on the `BlockMatrix`. I needed this for a project, but would like to share it with the rest of the community. In the case of dense matrices, it can increase performance up to 19 times:
https://github.com/Fokko/BlockMatrixToIndexedRowMatrix

If there are any questions or suggestions, please let me know. Keep up the good work! Cheers.

Author: Fokko Driesprong <f.driesprong@catawiki.nl>
Author: Fokko Driesprong <fokko@driesprongen.nl>

Closes #10839 from Fokko/master.
2016-04-14 17:32:20 -07:00
Yong Tang 01dd1f5c07 [SPARK-14565][ML] RandomForest should use parseInt and parseDouble for feature subset size instead of regexes
## What changes were proposed in this pull request?

This fix tries to change RandomForest's supported strategies from using regexes to using parseInt and
parseDouble, for the purpose of robustness and maintainability.

## How was this patch tested?

Existing tests passed.

Author: Yong Tang <yong.tang.github@outlook.com>

Closes #12360 from yongtang/SPARK-14565.
2016-04-14 17:23:16 -07:00
Dongjoon Hyun d7e124edfe [SPARK-14545][SQL] Improve LikeSimplification by adding a%b rule
## What changes were proposed in this pull request?

Current `LikeSimplification` handles the following four rules.
- 'a%' => expr.StartsWith("a")
- '%b' => expr.EndsWith("b")
- '%a%' => expr.Contains("a")
- 'a' => EqualTo("a")

This PR adds the following rule.
- 'a%b' => expr.Length() >= 2 && expr.StartsWith("a") && expr.EndsWith("b")

Here, 2 is statically calculated from "a".size + "b".size.

**Before**
```
scala> sql("select a from (select explode(array('abc','adc')) a) T where a like 'a%c'").explain()
== Physical Plan ==
WholeStageCodegen
:  +- Filter a#5 LIKE a%c
:     +- INPUT
+- Generate explode([abc,adc]), false, false, [a#5]
   +- Scan OneRowRelation[]
```

**After**
```
scala> sql("select a from (select explode(array('abc','adc')) a) T where a like 'a%c'").explain()
== Physical Plan ==
WholeStageCodegen
:  +- Filter ((length(a#5) >= 2) && (StartsWith(a#5, a) && EndsWith(a#5, c)))
:     +- INPUT
+- Generate explode([abc,adc]), false, false, [a#5]
   +- Scan OneRowRelation[]
```

## How was this patch tested?

Pass the Jenkins tests (including new testcase).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12312 from dongjoon-hyun/SPARK-14545.
2016-04-14 13:34:29 -07:00