Commit graph

15869 commits

Author SHA1 Message Date
Yanbo Liang 08f84d7a9a [MINOR][ML][PYSPARK] Fix omissive param setters which should use _set method
## What changes were proposed in this pull request?
#11939 make Python param setters use the `_set` method. This PR fix omissive ones.

## How was this patch tested?
Existing tests.

cc jkbradley sethah

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12531 from yanboliang/setters-omissive.
2016-04-20 20:06:27 +02:00
jerryshao 90cbc82fd4 [SPARK-14725][CORE] Remove HttpServer class
## What changes were proposed in this pull request?

This proposal removes the class `HttpServer`, with the changing of internal file/jar/class transmission to RPC layer, currently there's no code using this `HttpServer`, so here propose to remove it.

## How was this patch tested?

Unit test is verified locally.

Author: jerryshao <sshao@hortonworks.com>

Closes #12526 from jerryshao/SPARK-14725.
2016-04-20 10:48:11 -07:00
Sean Owen b4e76a9a3b [SPARK-14742][DOCS] Redirect spark-ec2 doc to new location
## What changes were proposed in this pull request?

Restore `ec2-scripts.md` as a redirect to amplab/spark-ec2 docs

## How was this patch tested?

`jekyll build` and checked with the browser

Author: Sean Owen <sowen@cloudera.com>

Closes #12534 from srowen/SPARK-14742.
2016-04-20 10:46:02 -07:00
Burak Yavuz 80bf48f437 [SPARK-14555] First cut of Python API for Structured Streaming
## What changes were proposed in this pull request?

This patch provides a first cut of python APIs for structured streaming. This PR provides the new classes:
 - ContinuousQuery
 - Trigger
 - ProcessingTime
in pyspark under `pyspark.sql.streaming`.

In addition, it contains the new methods added under:
 -  `DataFrameWriter`
     a) `startStream`
     b) `trigger`
     c) `queryName`

 -  `DataFrameReader`
     a) `stream`

 - `DataFrame`
    a) `isStreaming`

This PR doesn't contain all methods exposed for `ContinuousQuery`, for example:
 - `exception`
 - `sourceStatuses`
 - `sinkStatus`

They may be added in a follow up.

This PR also contains some very minor doc fixes in the Scala side.

## How was this patch tested?

Python doc tests

TODO:
 - [ ] verify Python docs look good

Author: Burak Yavuz <brkyvz@gmail.com>
Author: Burak Yavuz <burak@databricks.com>

Closes #12320 from brkyvz/stream-python.
2016-04-20 10:32:01 -07:00
Alex Bozarth 834277884f [SPARK-8171][WEB UI] Javascript based infinite scrolling for the log page
Updated the log page by replacing the current pagination with a javascript-based infinite scroll solution

Author: Alex Bozarth <ajbozart@us.ibm.com>

Closes #10910 from ajbozarth/spark8171.
2016-04-20 21:24:11 +09:00
Yuhao Yang ed9d803854 [SPARK-14635][ML] Documentation and Examples for TF-IDF only refer to HashingTF
## What changes were proposed in this pull request?

Currently, the docs for TF-IDF only refer to using HashingTF with IDF. However, CountVectorizer can also be used. We should probably amend the user guide and examples to show this.

## How was this patch tested?

unit tests and doc generation

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #12454 from hhbyyh/tfdoc.
2016-04-20 11:45:08 +01:00
Liwei Lin 17db4bfeaa [SPARK-14687][CORE][SQL][MLLIB] Call path.getFileSystem(conf) instead of call FileSystem.get(conf)
## What changes were proposed in this pull request?

- replaced `FileSystem.get(conf)` calls with `path.getFileSystem(conf)`

## How was this patch tested?

N/A

Author: Liwei Lin <lwlin7@gmail.com>

Closes #12450 from lw-lin/fix-fs-get.
2016-04-20 11:28:51 +01:00
Ryan Blue a3451119d9 [SPARK-14679][UI] Fix UI DAG visualization OOM.
## What changes were proposed in this pull request?

The DAG visualization can cause an OOM when generating the DOT file.
This happens because clusters are not correctly deduped by a contains
check because they use the default equals implementation. This adds a
working equals implementation.

## How was this patch tested?

This adds a test suite that checks the new equals implementation.

Author: Ryan Blue <blue@apache.org>

Closes #12437 from rdblue/SPARK-14679-fix-ui-oom.
2016-04-20 11:26:42 +01:00
Wenchen Fan 7abe9a6578 [SPARK-9013][SQL] generate MutableProjection directly instead of return a function
`MutableProjection` is not thread-safe and we won't use it in multiple threads. I think the reason that we return `() => MutableProjection` is not about thread safety, but to save the costs of generating code when we need same but individual mutable projections.

However, I only found one place that use this [feature](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/Window.scala#L122-L123), and comparing to the troubles it brings, I think we should generate `MutableProjection` directly instead of return a function.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #7373 from cloud-fan/project.
2016-04-20 00:44:02 -07:00
Dongjoon Hyun 14869ae64e [SPARK-14639] [PYTHON] [R] Add bround function in Python/R.
## What changes were proposed in this pull request?

This issue aims to expose Scala `bround` function in Python/R API.
`bround` function is implemented in SPARK-14614 by extending current `round` function.
We used the following semantics from Hive.
```java
public static double bround(double input, int scale) {
    if (Double.isNaN(input) || Double.isInfinite(input)) {
      return input;
    }
    return BigDecimal.valueOf(input).setScale(scale, RoundingMode.HALF_EVEN).doubleValue();
}
```

After this PR, `pyspark` and `sparkR` also support `bround` function.

**PySpark**
```python
>>> from pyspark.sql.functions import bround
>>> sqlContext.createDataFrame([(2.5,)], ['a']).select(bround('a', 0).alias('r')).collect()
[Row(r=2.0)]
```

**SparkR**
```r
> df = createDataFrame(sqlContext, data.frame(x = c(2.5, 3.5)))
> head(collect(select(df, bround(df$x, 0))))
  bround(x, 0)
1            2
2            4
```

## How was this patch tested?

Pass the Jenkins tests (including new testcases).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12509 from dongjoon-hyun/SPARK-14639.
2016-04-19 22:28:11 -07:00
Dongjoon Hyun 6f1ec1f267 [MINOR] [SQL] Re-enable explode() and json_tuple() testcases in ExpressionToSQLSuite
## What changes were proposed in this pull request?

Since [SPARK-12719: SQL Generation supports for generators](https://issues.apache.org/jira/browse/SPARK-12719) was resolved, this PR enables the related testcases: `explode()` and `json_tuple()`.

## How was this patch tested?

Pass the Jenkins tests (with re-enabled test cases).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12329 from dongjoon-hyun/minor_enable_testcases.
2016-04-19 21:55:29 -07:00
Wenchen Fan 856bc465d5 [SPARK-14600] [SQL] Push predicates through Expand
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-14600

This PR makes `Expand.output` have different attributes from the grouping attributes produced by the underlying `Project`, as they have different meaning, so that we can safely push down filter through `Expand`

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12496 from cloud-fan/expand.
2016-04-19 21:53:19 -07:00
Wenchen Fan 85d759ca3a [SPARK-14704][CORE] create accumulators in TaskMetrics
## What changes were proposed in this pull request?

Before this PR, we create accumulators at driver side(and register them) and send them to executor side, then we create `TaskMetrics` with these accumulators at executor side.
After this PR, we will create `TaskMetrics` at driver side and send it to executor side, so that we can create accumulators inside `TaskMetrics` directly, which is cleaner.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12472 from cloud-fan/acc.
2016-04-19 21:20:24 -07:00
Luciano Resende 78b38109ed [SPARK-13419] [SQL] Update SubquerySuite to use checkAnswer for validation
## What changes were proposed in this pull request?

Change SubquerySuite to validate test results utilizing checkAnswer helper method

## How was this patch tested?

Existing tests

Author: Luciano Resende <lresende@apache.org>

Closes #12269 from lresende/SPARK-13419.
2016-04-19 21:02:10 -07:00
Sun Rui 8eedf0b553 [SPARK-13905][SPARKR] Change signature of as.data.frame() to be consistent with the R base package.
## What changes were proposed in this pull request?

Change the signature of as.data.frame() to be consistent with that in the R base package to meet R user's convention.

## How was this patch tested?
dev/lint-r
SparkR unit tests

Author: Sun Rui <rui.sun@intel.com>

Closes #11811 from sun-rui/SPARK-13905.
2016-04-19 19:57:03 -07:00
Lianhui Wang 4514aebd1e [SPARK-14705][YARN] support Multiple FileSystem for YARN STAGING DIR
## What changes were proposed in this pull request?
In SPARK-13063, It makes the SPARK YARN STAGING DIR as configurable. But it only support default FileSystem. If there are many clusters, It can be different FileSystem for different cluster in our spark.

## How was this patch tested?
I have tested it successfully with following commands:
MASTER=yarn-client ./bin/spark-shell --conf spark.yarn.stagingDir=hdfs:namenode2/temp
$SPARK_HOME/bin/spark-submit --conf spark.yarn.stagingDir=hdfs:namenode2/temp

cc tgravescs vanzin andrewor14

Author: Lianhui Wang <lianhuiwang09@gmail.com>

Closes #12473 from lianhuiwang/SPARK-14705.
2016-04-19 19:48:03 -07:00
Joan 3ae25f244b [SPARK-13929] Use Scala reflection for UDTs
## What changes were proposed in this pull request?

Enable ScalaReflection and User Defined Types for plain Scala classes.

This involves the move of `schemaFor` from `ScalaReflection` trait (which is Runtime and Compile time (macros) reflection) to the `ScalaReflection` object (runtime reflection only) as I believe this code wouldn't work at compile time anyway as it manipulates `Class`'s that are not compiled yet.

## How was this patch tested?

Unit test

Author: Joan <joan@goyeau.com>

Closes #12149 from joan38/SPARK-13929-Scala-reflection.
2016-04-19 17:36:31 -07:00
Cheng Lian 10f273d8db [SPARK-14407][SQL] Hides HadoopFsRelation related data source API into execution/datasources package #12178
## What changes were proposed in this pull request?

This PR moves `HadoopFsRelation` related data source API into `execution/datasources` package.

Note that to avoid conflicts, this PR is based on #12153. Effective changes for this PR only consist of the last three commits. Will rebase after merging #12153.

## How was this patch tested?

Existing tests.

Author: Yin Huai <yhuai@databricks.com>
Author: Cheng Lian <lian@databricks.com>

Closes #12361 from liancheng/spark-14407-hide-hadoop-fs-relation.
2016-04-19 17:32:23 -07:00
felixcheung 3664142350 [SPARK-14717] [PYTHON] Scala, Python APIs for Dataset.unpersist differ in default blocking value
## What changes were proposed in this pull request?

Change unpersist blocking parameter default value to match Scala

## How was this patch tested?

unit tests, manual tests

jkbradley davies

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #12507 from felixcheung/pyunpersist.
2016-04-19 17:29:28 -07:00
Josh Rosen a685e65a4c Revert "[SPARK-14719] WriteAheadLogBasedBlockHandler should ignore BlockManager put errors"
This reverts commit ed2de0299a.
2016-04-19 16:15:06 -07:00
felixcheung ecd877e833 [SPARK-12224][SPARKR] R support for JDBC source
Add R API for `read.jdbc`, `write.jdbc`.

Tested this quite a bit manually with different combinations of parameters. It's not clear if we could have automated tests in R for this - Scala `JDBCSuite` depends on Java H2 in-memory database.

Refactored some code into util so they could be tested.

Core's R SerDe code needs to be updated to allow access to java.util.Properties as `jobj` handle which is required by DataFrameReader/Writer's `jdbc` method. It would be possible, though more code to add a `sql/r/SQLUtils` helper function.

Tested:
```
# with postgresql
../bin/sparkR --driver-class-path /usr/share/java/postgresql-9.4.1207.jre7.jar

# read.jdbc
df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", user = "user", password = "12345")
df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", user = "user", password = 12345)

# partitionColumn and numPartitions test
df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", partitionColumn = "did", lowerBound = 0, upperBound = 200, numPartitions = 4, user = "user", password = 12345)
a <- SparkR:::toRDD(df)
SparkR:::getNumPartitions(a)
[1] 4
SparkR:::collectPartition(a, 2L)

# defaultParallelism test
df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", partitionColumn = "did", lowerBound = 0, upperBound = 200, user = "user", password = 12345)
SparkR:::getNumPartitions(a)
[1] 2

# predicates test
df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", predicates = list("did<=105"), user = "user", password = 12345)
count(df) == 1

# write.jdbc, default save mode "error"
irisDf <- as.DataFrame(sqlContext, iris)
write.jdbc(irisDf, "jdbc:postgresql://localhost/db", "films2", user = "user", password = "12345")
"error, already exists"

write.jdbc(irisDf, "jdbc:postgresql://localhost/db", "iris", user = "user", password = "12345")
```

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #10480 from felixcheung/rreadjdbc.
2016-04-19 15:59:47 -07:00
Eric Liang 008a8bbef0 [SPARK-14733] Allow custom timing control in microbenchmarks
## What changes were proposed in this pull request?

The current benchmark framework runs a code block for several iterations and reports statistics. However there is no way to exclude per-iteration setup time from the overall results. This PR adds a timer control object passed into the closure that can be used for this purpose.

## How was this patch tested?

Existing benchmark code. Also see https://github.com/apache/spark/pull/12490

Author: Eric Liang <ekl@databricks.com>

Closes #12502 from ericl/spark-14733.
2016-04-19 15:55:21 -07:00
Herman van Hovell da8859226e [SPARK-4226] [SQL] Support IN/EXISTS Subqueries
### What changes were proposed in this pull request?
This PR adds support for in/exists predicate subqueries to Spark. Predicate sub-queries are used as a filtering condition in a query (this is the only supported use case). A predicate sub-query comes in two forms:

- `[NOT] EXISTS(subquery)`
- `[NOT] IN (subquery)`

This PR is (loosely) based on the work of davies (https://github.com/apache/spark/pull/10706) and chenghao-intel (https://github.com/apache/spark/pull/9055). They should be credited for the work they did.

### How was this patch tested?
Modified parsing unit tests.
Added tests to `org.apache.spark.sql.SQLQuerySuite`

cc rxin, davies & chenghao-intel

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #12306 from hvanhovell/SPARK-4226.
2016-04-19 15:16:02 -07:00
Nezih Yigitbasi 3c91afec20 [SPARK-14042][CORE] Add custom coalescer support
## What changes were proposed in this pull request?

This PR adds support for specifying an optional custom coalescer to the `coalesce()` method. Currently I have only added this feature to the `RDD` interface, and once we sort out the details we can proceed with adding this feature to the other APIs (`Dataset` etc.)

## How was this patch tested?

Added a unit test for this functionality.

/cc rxin (per our discussion on the mailing list)

Author: Nezih Yigitbasi <nyigitbasi@netflix.com>

Closes #11865 from nezihyigitbasi/custom_coalesce_policy.
2016-04-19 14:35:26 -07:00
Kazuaki Ishizaki 0b8369d854 [SPARK-14656][CORE] Fix Benchmark.getPorcessorName() always return "Unknown processor" on Linux
## What changes were proposed in this pull request?
This PR returns correct processor name in ```/proc/cpuinfo``` on Linux from  ```Benchmark.getPorcessorName()```. Now, this return ```Unknown processor```.
Since ```Utils.executeAndGetOutput(Seq("which", "grep"))``` return ```/bin/grep\n```, it is failed to execute ```/bin/grep\n```. This PR strips ```\n``` at the end of the line of a result of ```Utils.executeAndGetOutput()```

Before applying this PR
````
Java HotSpot(TM) 64-Bit Server VM 1.8.0_66-b17 on Linux 2.6.32-504.el6.x86_64
Unknown processor
back-to-back filter:                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
Dataset                                   472 /  503         21.2          47.2       1.0X
DataFrame                                  51 /   58        198.0           5.1       9.3X
RDD                                       189 /  211         52.8          18.9       2.5X
````

After applying this PR
```
Java HotSpot(TM) 64-Bit Server VM 1.8.0_66-b17 on Linux 2.6.32-504.el6.x86_64
Intel(R) Xeon(R) CPU E5-2667 v2  3.30GHz
back-to-back filter:                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
Dataset                                   490 /  502         20.4          49.0       1.0X
DataFrame                                  55 /   61        183.4           5.5       9.0X
RDD                                       210 /  237         47.7          21.0       2.3X
```

## How was this patch tested?
Run Benchmark programs on Linux by hand

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #12411 from kiszk/SPARK-14656.
2016-04-19 23:30:34 +02:00
Wenchen Fan 5cb2e33609 [SPARK-14675][SQL] ClassFormatError when use Seq as Aggregator buffer type
## What changes were proposed in this pull request?

After https://github.com/apache/spark/pull/12067, we now use expressions to do the aggregation in `TypedAggregateExpression`. To implement buffer merge, we produce a new buffer deserializer expression by replacing `AttributeReference` with right-side buffer attribute, like other `DeclarativeAggregate`s do, and finally combine the left and right buffer deserializer with `Invoke`.

However, after https://github.com/apache/spark/pull/12338, we will add loop variable to class members when codegen `MapObjects`. If the `Aggregator` buffer type is `Seq`, which is implemented by `MapObjects` expression, we will add the same loop variable to class members twice(by left and right buffer deserializer), which cause the `ClassFormatError`.

This PR fixes this issue by calling `distinct` before declare the class menbers.

## How was this patch tested?

new regression test in `DatasetAggregatorSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12468 from cloud-fan/bug.
2016-04-19 10:51:58 -07:00
Josh Rosen 947b9020b0 [SPARK-14676] Wrap and re-throw Await.result exceptions in order to capture full stacktrace
When `Await.result` throws an exception which originated from a different thread, the resulting stacktrace doesn't include the path leading to the `Await.result` call itself, making it difficult to identify the impact of these exceptions. For example, I've seen cases where broadcast cleaning errors propagate to the main thread and crash it but the resulting stacktrace doesn't include any of the main thread's code, making it difficult to pinpoint which exception crashed that thread.

This patch addresses this issue by explicitly catching, wrapping, and re-throwing exceptions that are thrown by `Await.result`.

I tested this manually using 16b31c8251, a patch which reproduces an issue where an RPC exception which occurs while unpersisting RDDs manages to crash the main thread without any useful stacktrace, and verified that informative, full stacktraces were generated after applying the fix in this PR.

/cc rxin nongli yhuai anabranch

Author: Josh Rosen <joshrosen@databricks.com>

Closes #12433 from JoshRosen/wrap-and-rethrow-await-exceptions.
2016-04-19 10:38:10 -07:00
gatorsmile d9620e769e [SPARK-12457] Fixed the Wrong Description and Missing Example in Collection Functions
#### What changes were proposed in this pull request?
https://github.com/apache/spark/pull/12185 contains the original PR I submitted in https://github.com/apache/spark/pull/10418

However, it misses one of the extended example, a wrong description and a few typos for collection functions. This PR is fix all these issues.

#### How was this patch tested?
The existing test cases already cover it.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #12492 from gatorsmile/expressionUpdate.
2016-04-19 10:33:40 -07:00
tedyu e89633605e [SPARK-13904] Add exit code parameter to exitExecutor()
## What changes were proposed in this pull request?

This PR adds exit code parameter to exitExecutor() so that caller can specify different exit code.

## How was this patch tested?

Existing test

rxin hbhanawat

Author: tedyu <yuzhihong@gmail.com>

Closes #12457 from tedyu/master.
2016-04-19 10:12:36 -07:00
Wenchen Fan 9ee95b6ecc [SPARK-14491] [SQL] refactor object operator framework to make it easy to eliminate serializations
## What changes were proposed in this pull request?

This PR tries to separate the serialization and deserialization logic from object operators, so that it's easier to eliminate unnecessary serializations in optimizer.

Typed aggregate related operators are special, they will deserialize the input row to multiple objects and it's difficult to simply use a deserializer operator to abstract it, so we still mix the deserialization logic there.

## How was this patch tested?

existing tests and new test in `EliminateSerializationSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12260 from cloud-fan/encoder.
2016-04-19 10:00:44 -07:00
Cheng Lian 5e360c93be [SPARK-13681][SPARK-14458][SPARK-14566][SQL] Add back once removed CommitFailureTestRelationSuite and SimpleTextHadoopFsRelationSuite
## What changes were proposed in this pull request?

These test suites were removed while refactoring `HadoopFsRelation` related API. This PR brings them back.

This PR also fixes two regressions:

- SPARK-14458, which causes runtime error when saving partitioned tables using `FileFormat` data sources that are not able to infer their own schemata. This bug wasn't detected by any built-in data sources because all of them happen to have schema inference feature.

- SPARK-14566, which happens to be covered by SPARK-14458 and causes wrong query result or runtime error when
  - appending a Dataset `ds` to a persisted partitioned data source relation `t`, and
  - partition columns in `ds` don't all appear after data columns

## How was this patch tested?

`CommitFailureTestRelationSuite` uses a testing relation that always fails when committing write tasks to test write job cleanup.

`SimpleTextHadoopFsRelationSuite` uses a testing relation to test general `HadoopFsRelation` and `FileFormat` interfaces.

The two regressions are both covered by existing test cases.

Author: Cheng Lian <lian@databricks.com>

Closes #12179 from liancheng/spark-13681-commit-failure-test.
2016-04-19 09:37:00 -07:00
Dongjoon Hyun 3d46d796a3 [SPARK-14577][SQL] Add spark.sql.codegen.maxCaseBranches config option
## What changes were proposed in this pull request?

We currently disable codegen for `CaseWhen` if the number of branches is greater than 20 (in CaseWhen.MAX_NUM_CASES_FOR_CODEGEN). It would be better if this value is a non-public config defined in SQLConf.

## How was this patch tested?

Pass the Jenkins tests (including a new testcase `Support spark.sql.codegen.maxCaseBranches option`)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12353 from dongjoon-hyun/SPARK-14577.
2016-04-19 21:38:15 +08:00
bomeng 74fe235ab5 [SPARK-14398][SQL] Audit non-reserved keyword list in ANTLR4 parser
## What changes were proposed in this pull request?

I have compared non-reserved list in Antlr3 and Antlr4 one by one as well as all the existing keywords defined in Antlr4, added the missing keywords to the non-reserved keywords list.  If we need to support more syntax, we can add more keywords by then.

Any recommendation for the above is welcome.

## How was this patch tested?

I manually checked the keywords one by one. Please let me know if there is a better way to test.

Another thought: I suggest to put all the keywords definition and non-reserved list in order, that will be much easier to check in the future.

Author: bomeng <bmeng@us.ibm.com>

Closes #12191 from bomeng/SPARK-14398.
2016-04-19 09:09:58 +02:00
Wenchen Fan d4b94ead92 [SPARK-14595][SQL] add input metrics for FileScanRDD
## What changes were proposed in this pull request?

This is roughly based on the input metrics logic in `SqlNewHadoopRDD`

## How was this patch tested?

Not sure how to write a test, I manually verified it in Spark UI.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12352 from cloud-fan/metrics.
2016-04-18 23:48:22 -07:00
Sameer Agarwal 6f88006895 [SPARK-14722][SQL] Rename upstreams() -> inputRDDs() in WholeStageCodegen
## What changes were proposed in this pull request?

Per rxin's suggestions, this patch renames `upstreams()` to `inputRDDs()` in `WholeStageCodegen` for better implied semantics

## How was this patch tested?

N/A

Author: Sameer Agarwal <sameer@databricks.com>

Closes #12486 from sameeragarwal/codegen-cleanup.
2016-04-18 20:28:58 -07:00
Sameer Agarwal 4eae1dbd7c [SPARK-14718][SQL] Avoid mutating ExprCode in doGenCode
## What changes were proposed in this pull request?

The `doGenCode` method currently takes in an `ExprCode`, mutates it and returns the java code to evaluate the given expression. It should instead just return a new `ExprCode` to avoid passing around mutable objects during code generation.

## How was this patch tested?

Existing Tests

Author: Sameer Agarwal <sameer@databricks.com>

Closes #12483 from sameeragarwal/new-exprcode-2.
2016-04-18 20:28:22 -07:00
Josh Rosen ed2de0299a [SPARK-14719] WriteAheadLogBasedBlockHandler should ignore BlockManager put errors
WriteAheadLogBasedBlockHandler will currently throw exceptions if its BlockManager `put()` calls fail, even though those calls are only performed as a performance optimization. Instead, it should log and ignore exceptions during that `put()`.

This is a longstanding issue that was masked by an incorrect test case. I think that we haven't noticed this in production because

1. most people probably use a `MEMORY_AND_DISK` storage level, and
2. typically, individual blocks may be small enough relative to the total storage memory such that they're able to evict blocks from previous batches, so `put()` failures here may be rare in practice.

This patch fixes the faulty test and fixes the bug.

/cc tdas

Author: Josh Rosen <joshrosen@databricks.com>

Closes #12484 from JoshRosen/received-block-hadndler-fix.
2016-04-18 19:36:40 -07:00
Reynold Xin 5e92583d38 [SPARK-14667] Remove HashShuffleManager
## What changes were proposed in this pull request?
The sort shuffle manager has been the default since Spark 1.2. It is time to remove the old hash shuffle manager.

## How was this patch tested?
Removed some tests related to the old manager.

Author: Reynold Xin <rxin@databricks.com>

Closes #12423 from rxin/SPARK-14667.
2016-04-18 19:30:00 -07:00
CodingCat 4b3d1294ae [SPARK-13227] Risky apply() in OpenHashMap
https://issues.apache.org/jira/browse/SPARK-13227

It might confuse the future developers when they use OpenHashMap.apply() with a numeric value type.

null.asInstance[Int], null.asInstance[Long], null.asInstace[Float] and null.asInstance[Double] will return 0/0.0/0L, which might confuse the developer if the value set contains 0/0.0/0L with an existing key

The current patch only adds the comments describing the issue, with the respect to apply the minimum changes to the code base

The more direct, yet more aggressive, approach is use Option as the return type

andrewor14  JoshRosen  any thoughts about how to avoid the potential issue?

Author: CodingCat <zhunansjtu@gmail.com>

Closes #11107 from CodingCat/SPARK-13227.
2016-04-18 18:51:23 -07:00
Mark Grover 2b151b6b93 [SPARK-14711][BUILD] Examples jar not a part of distribution.
## What changes were proposed in this pull request?

Move the spark-examples.jar from being in examples/target to examples/target/scala-2.11/jars

## How was this patch tested?

Built distribution to make sure examples jar was being included in the tarball.
Ran run-example to make sure examples were run.

Author: Mark Grover <mark@apache.org>

Closes #12476 from markgrover/spark-14711.
2016-04-18 17:44:42 -07:00
Joseph K. Bradley d29e429eeb [SPARK-14714][ML][PYTHON] Fixed issues with non-kwarg typeConverter arg for Param constructor
## What changes were proposed in this pull request?

PySpark Param constructors need to pass the TypeConverter argument by name, partly to make sure it is not mistaken for the expectedType arg and partly because we will remove the expectedType arg in 2.1. In several places, this is not being done correctly.

This PR changes all usages in pyspark/ml/ to keyword args.

## How was this patch tested?

Existing unit tests.  I will not test type conversion for every Param unless we really think it necessary.

Also, if you start the PySpark shell and import classes (e.g., pyspark.ml.feature.StandardScaler), then you no longer get this warning:
```
/Users/josephkb/spark/python/pyspark/ml/param/__init__.py:58: UserWarning: expectedType is deprecated and will be removed in 2.1. Use typeConverter instead, as a keyword argument.
  "Use typeConverter instead, as a keyword argument.")
```
That warning came from the typeConverter argument being passes as the expectedType arg by mistake.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12480 from jkbradley/typeconverter-fix.
2016-04-18 17:15:12 -07:00
Zheng RuiFeng 9bfb35da1e [SPARK-14515][DOC] Add python example for ChiSqSelector
## What changes were proposed in this pull request?
Add the missing python example for ChiSqSelector

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12283 from zhengruifeng/chi2_pe.
2016-04-18 17:14:22 -07:00
Wenchen Fan 602734084c [SPARK-14628][CORE][FOLLLOW-UP] Always tracking read/write metrics
## What changes were proposed in this pull request?

This PR is a follow up for https://github.com/apache/spark/pull/12417, now we always track input/output/shuffle metrics in spark JSON protocol and status API.

Most of the line changes are because of re-generating the gold answer for `HistoryServerSuite`, and we add a lot of 0 values for read/write metrics.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12462 from cloud-fan/follow.
2016-04-18 15:17:29 -07:00
Shixiong Zhu 6ff0435858 [SPARK-14713][TESTS] Fix the flaky test NettyBlockTransferServiceSuite
## What changes were proposed in this pull request?

When there are multiple tests running, "NettyBlockTransferServiceSuite.can bind to a specific port twice and the second increments" may fail.

E.g., assume there are 2 tests running. Here are the execution order to reproduce the test failure.

| Execution Order | Test 1 | Test 2 |
| ------------- | ------------- | ------------- |
| 1 | service0 binds to 17634 |  |
| 2 |  | service0 binds to 17635 (17634 is occupied) |
| 3 | service1 binds to 17636 |  |
| 4 | pass test |  |
| 5 | service0.close (release 17634) |  |
| 6 |  | service1 binds to 17634 |
| 7 |  | `service1.port should be (service0.port + 1)` fails (17634 != 17635 + 1) |

Here is an example in Jenkins: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/786/testReport/junit/org.apache.spark.network.netty/NettyBlockTransferServiceSuite/can_bind_to_a_specific_port_twice_and_the_second_increments/

This PR makes two changes:

- Use a random port between 17634 and 27634 to reduce the possibility of port conflicts.
- Make `service1` use `service0.port` to bind to avoid the above race condition.

## How was this patch tested?

Jenkins unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #12477 from zsxwing/SPARK-14713.
2016-04-18 14:41:45 -07:00
Luciano Resende 68450c8c6e [SPARK-14504][SQL] Enable Oracle docker tests
## What changes were proposed in this pull request?

Enable Oracle docker tests

## How was this patch tested?

Existing tests

Author: Luciano Resende <lresende@apache.org>

Closes #12270 from lresende/oracle.
2016-04-18 14:35:10 -07:00
Andrew Or f1a11976db [SPARK-14674][SQL] Move HiveContext.hiveconf to HiveSessionState
## What changes were proposed in this pull request?

This is just cleanup. This allows us to remove HiveContext later without inflating the diff too much. This PR fixes the conflicts of https://github.com/apache/spark/pull/12431. It also removes the `def hiveConf` from `HiveSqlParser`. So, we will pass the HiveConf associated with a session explicitly instead of relying on Hive's `SessionState` to pass `HiveConf`.

## How was this patch tested?
Existing tests.

Closes #12431

Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #12449 from yhuai/hiveconf.
2016-04-18 14:28:47 -07:00
Sameer Agarwal 8bd8121329 [SPARK-14710][SQL] Rename gen/genCode to genCode/doGenCode to better reflect the semantics
## What changes were proposed in this pull request?

Per rxin's suggestions, this patch renames `s/gen/genCode` and `s/genCode/doGenCode` to better reflect the semantics of these 2 function calls.

## How was this patch tested?

N/A (refactoring only)

Author: Sameer Agarwal <sameer@databricks.com>

Closes #12475 from sameeragarwal/gencode.
2016-04-18 14:03:40 -07:00
hyukjinkwon 6fc1e72d9b [MINOR] Revert removing explicit typing (changed in some examples and StatFunctions)
## What changes were proposed in this pull request?

This PR reverts some changes in https://github.com/apache/spark/pull/12413. (please see the discussion in that PR).

from
```scala
    words.foreachRDD { (rdd, time) =>
    ...
```

to
```scala
    words.foreachRDD { (rdd: RDD[String], time: Time) =>
    ...
```

Also, this was discussed in dev-mailing list, [here](http://apache-spark-developers-list.1001551.n3.nabble.com/Question-about-Scala-style-explicit-typing-within-transformation-functions-and-anonymous-val-td17173.html)

## How was this patch tested?

This was tested with `sbt scalastyle`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #12452 from HyukjinKwon/revert-explicit-typing.
2016-04-18 13:45:03 -07:00
Xusen Yin 8c62edb70f [SPARK-14299][EXAMPLES] Remove duplications for scala.examples.ml
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-14299

Delete duplications in scala/examples/ml.

TrainValidationSplitExample.scala --> ModelSelectionViaTrainValidationSplitExample
CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample

## How was this patch tested?

Existing tests passed.

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Xusen Yin <yinxusen@gmail.com>

Closes #12366 from yinxusen/SPARK-14299-2.
2016-04-18 13:34:36 -07:00
Xusen Yin f31a62d1b2 [SPARK-14440][PYSPARK] Remove pipeline specific reader and writer
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-14440

Remove

* PipelineMLWriter
* PipelineMLReader
* PipelineModelMLWriter
* PipelineModelMLReader

and modify comments.

## How was this patch tested?

test with unit test.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #12216 from yinxusen/SPARK-14440.
2016-04-18 13:31:48 -07:00