Commit graph

16197 commits

Author SHA1 Message Date
Reynold Xin 5503e453ba [SPARK-15088] [SQL] Remove SparkSqlSerializer
## What changes were proposed in this pull request?
This patch removes SparkSqlSerializer. I believe this is now dead code.

## How was this patch tested?
Removed a test case related to it.

Author: Reynold Xin <rxin@databricks.com>

Closes #12864 from rxin/SPARK-15088.
2016-05-03 09:43:47 -07:00
Sun Rui 8b6491fc0b [SPARK-15091][SPARKR] Fix warnings and a failure in SparkR test cases with testthat version 1.0.1
## What changes were proposed in this pull request?
Fix warnings and a failure in SparkR test cases with testthat version 1.0.1

## How was this patch tested?
SparkR unit test cases.

Author: Sun Rui <sunrui2016@gmail.com>

Closes #12867 from sun-rui/SPARK-15091.
2016-05-03 09:29:49 -07:00
Yanbo Liang d26f7cb012 [SPARK-14971][ML][PYSPARK] PySpark ML Params setter code clean up
## What changes were proposed in this pull request?
PySpark ML Params setter code clean up.
For examples,
```setInputCol``` can be simplified from
```
self._set(inputCol=value)
return self
```
to:
```
return self._set(inputCol=value)
```
This is a pretty big sweeps, and we cleaned wherever possible.
## How was this patch tested?
Exist unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12749 from yanboliang/spark-14971.
2016-05-03 16:46:13 +02:00
Dongjoon Hyun 46965cd014 [SPARK-15057][GRAPHX] Remove stale TODO comment for making enum in GraphGenerators
## What changes were proposed in this pull request?

This PR removes a stale TODO comment in `GraphGenerators.scala`

## How was this patch tested?

Just comment removed.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12839 from dongjoon-hyun/SPARK-15057.
2016-05-03 14:02:04 +01:00
Sean Owen 57ac7c1824 [SPARK-14897][CORE] Upgrade Jetty to latest version of 8
## What changes were proposed in this pull request?

Update Jetty 8.1 to the latest 2016/02 release, from a 2013/10 release, for security and bug fixes. This does not resolve the JIRA necessarily, as it's still worth considering an update to 9.3.

## How was this patch tested?

Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes #12842 from srowen/SPARK-14897.
2016-05-03 13:13:35 +01:00
Reynold Xin d557a5e01e [SPARK-15081] Move AccumulatorV2 and subclasses into util package
## What changes were proposed in this pull request?
This patch moves AccumulatorV2 and subclasses into util package.

## How was this patch tested?
Updated relevant tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #12863 from rxin/SPARK-15081.
2016-05-03 19:45:12 +08:00
Dongjoon Hyun a744457076 [SPARK-15053][BUILD] Fix Java Lint errors on Hive-Thriftserver module
## What changes were proposed in this pull request?

This issue fixes or hides 181 Java linter errors introduced by SPARK-14987 which copied hive service code from Hive. We had better clean up these errors before releasing Spark 2.0.

- Fix UnusedImports (15 lines), RedundantModifier (14 lines), SeparatorWrap (9 lines), MethodParamPad (6 lines), FileTabCharacter (5 lines), ArrayTypeStyle (3 lines), ModifierOrder (3 lines), RedundantImport (1 line), CommentsIndentation (1 line), UpperEll (1 line), FallThrough (1 line), OneStatementPerLine (1 line), NewlineAtEndOfFile (1 line) errors.
- Ignore `LineLength` errors under `hive/service/*` (118 lines).
- Ignore `MethodName` error in `PasswdAuthenticationProvider.java` (1 line).
- Ignore `NoFinalizer` error in `ThreadWithGarbageCleanup.java` (1 line).

## How was this patch tested?

After passing Jenkins building, run `dev/lint-java` manually.
```bash
$ dev/lint-java
Checkstyle checks passed.
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12831 from dongjoon-hyun/SPARK-15053.
2016-05-03 12:39:37 +01:00
Sandeep Singh dfd9723dd3 [MINOR][DOCS] Fix type Information in Quick Start and Programming Guide
Author: Sandeep Singh <sandeep@techaddict.me>

Closes #12841 from techaddict/improve_docs_1.
2016-05-03 12:38:21 +01:00
Holden Karau f10ae4b1e1 [SPARK-6717][ML] Clear shuffle files after checkpointing in ALS
## What changes were proposed in this pull request?

When ALS is run with a checkpoint interval, during the checkpoint materialize the current state and cleanup the previous shuffles (non-blocking).

## How was this patch tested?

Existing ALS unit tests, new ALS checkpoint cleanup unit tests added & shuffle files checked after ALS w/checkpointing run.

Author: Holden Karau <holden@us.ibm.com>
Author: Holden Karau <holden@pigscanfly.ca>

Closes #11919 from holdenk/SPARK-6717-clear-shuffle-files-after-checkpointing-in-ALS.
2016-05-03 00:18:10 -07:00
Andrew Ray d8f528ceb6 [SPARK-13749][SQL][FOLLOW-UP] Faster pivot implementation for many distinct values with two phase aggregation
## What changes were proposed in this pull request?

This is a follow up PR for #11583. It makes 3 lazy vals into just vals and adds unit test coverage.

## How was this patch tested?

Existing unit tests and additional unit tests.

Author: Andrew Ray <ray.andrew@gmail.com>

Closes #12861 from aray/fast-pivot-follow-up.
2016-05-02 22:47:32 -07:00
Reynold Xin bb9ab56b96 [SPARK-15079] Support average/count/sum in Long/DoubleAccumulator
## What changes were proposed in this pull request?
This patch removes AverageAccumulator and adds the ability to compute average to LongAccumulator and DoubleAccumulator. The patch also improves documentation for the two accumulators.

## How was this patch tested?
Added unit tests for this.

Author: Reynold Xin <rxin@databricks.com>

Closes #12858 from rxin/SPARK-15079.
2016-05-02 21:12:48 -07:00
Marcin Tustin 8028f3a0b4 [SPARK-14685][CORE] Document heritability of localProperties
## What changes were proposed in this pull request?

This updates the java-/scala- doc for setLocalProperty to document heritability of localProperties. This also adds tests for that behaviour.

## How was this patch tested?

Tests pass. New tests were added.

Author: Marcin Tustin <marcin.tustin@gmail.com>

Closes #12455 from marcintustin/SPARK-14685.
2016-05-02 19:37:57 -07:00
Shixiong Zhu 4e3685ae5e [SPARK-15077][SQL] Use a fair lock to avoid thread starvation in StreamExecution
## What changes were proposed in this pull request?

Right now `StreamExecution.awaitBatchLock` uses an unfair lock. `StreamExecution.awaitOffset` may run too long and fail some test because `StreamExecution.constructNextBatch` keeps getting the lock.

See: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/865/testReport/junit/org.apache.spark.sql.streaming/FileStreamSourceStressTestSuite/file_source_stress_test/

This PR uses a fair ReentrantLock to resolve the thread starvation issue.

## How was this patch tested?

Modified `FileStreamSourceStressTestSuite.test("file source stress test")` to run the test codes 100 times locally. It always fails because of timeout without this patch.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #12852 from zsxwing/SPARK-15077.
2016-05-02 18:27:49 -07:00
bomeng 0fd95be3cd [SPARK-15062][SQL] fix list type infer serializer issue
## What changes were proposed in this pull request?

Make serializer correctly inferred if the input type is `List[_]`, since `List[_]` is type of `Seq[_]`, before it was matched to different case (`case t if definedByConstructorParams(t)`).

## How was this patch tested?

New test case was added.

Author: bomeng <bmeng@us.ibm.com>

Closes #12849 from bomeng/SPARK-15062.
2016-05-02 18:20:29 -07:00
Herman van Hovell 1c19c2769e [SPARK-15047][SQL] Cleanup SQL Parser
## What changes were proposed in this pull request?
This PR addresses a few minor issues in SQL parser:

- Removes some unused rules and keywords in the grammar.
- Removes code path for fallback SQL parsing (was needed for Hive native parsing).
- Use `UnresolvedGenerator` instead of hard-coding `Explode` & `JsonTuple`.
- Adds a more generic way of creating error messages for unsupported Hive features.
- Use `visitFunctionName` as much as possible.
- Interpret a `CatalogColumn`'s `DataType` directly instead of parsing it again.

## How was this patch tested?
Existing tests.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #12826 from hvanhovell/SPARK-15047.
2016-05-02 18:12:31 -07:00
hyukjinkwon d37c7f7f04 [SPARK-15050][SQL] Put CSV and JSON options as Python csv and json function parameters
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-15050

This PR adds function parameters for Python API for reading and writing `csv()`.

## How was this patch tested?

This was tested by `./dev/run_tests`.

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>

Closes #12834 from HyukjinKwon/SPARK-15050.
2016-05-02 17:50:40 -07:00
Liwei Lin 35d9c8aa69 [SPARK-14747][SQL] Add assertStreaming/assertNoneStreaming checks in DataFrameWriter
## Problem

If an end user happens to write code mixed with continuous-query-oriented methods and non-continuous-query-oriented methods:

```scala
ctx.read
   .format("text")
   .stream("...")  // continuous query
   .write
   .text("...")    // non-continuous query; should be startStream() here
```

He/she would get this somehow confusing exception:

>
Exception in thread "main" java.lang.AssertionError: assertion failed: No plan for FileSource[./continuous_query_test_input]
	at scala.Predef$.assert(Predef.scala:170)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
	at ...

## What changes were proposed in this pull request?

This PR adds checks for continuous-query-oriented methods and non-continuous-query-oriented methods in `DataFrameWriter`:

<table>
<tr>
	<td align="center"></td>
	<td align="center"><strong>can be called on continuous query?</strong></td>
	<td align="center"><strong>can be called on non-continuous query?</strong></td>
</tr>
<tr>
	<td align="center">mode</td>
	<td align="center"></td>
	<td align="center">yes</td>
</tr>
<tr>
	<td align="center">trigger</td>
	<td align="center">yes</td>
	<td align="center"></td>
</tr>
<tr>
	<td align="center">format</td>
	<td align="center">yes</td>
	<td align="center">yes</td>
</tr>
<tr>
	<td align="center">option/options</td>
	<td align="center">yes</td>
	<td align="center">yes</td>
</tr>
<tr>
	<td align="center">partitionBy</td>
	<td align="center">yes</td>
	<td align="center">yes</td>
</tr>
<tr>
	<td align="center">bucketBy</td>
	<td align="center"></td>
	<td align="center">yes</td>
</tr>
<tr>
	<td align="center">sortBy</td>
	<td align="center"></td>
	<td align="center">yes</td>
</tr>
<tr>
	<td align="center">save</td>
	<td align="center"></td>
	<td align="center">yes</td>
</tr>
<tr>
	<td align="center">queryName</td>
	<td align="center">yes</td>
	<td align="center"></td>
</tr>
<tr>
	<td align="center">startStream</td>
	<td align="center">yes</td>
	<td align="center"></td>
</tr>
<tr>
	<td align="center">insertInto</td>
	<td align="center"></td>
	<td align="center">yes</td>
</tr>
<tr>
	<td align="center">saveAsTable</td>
	<td align="center"></td>
	<td align="center">yes</td>
</tr>
<tr>
	<td align="center">jdbc</td>
	<td align="center"></td>
	<td align="center">yes</td>
</tr>
<tr>
	<td align="center">json</td>
	<td align="center"></td>
	<td align="center">yes</td>
</tr>
<tr>
	<td align="center">parquet</td>
	<td align="center"></td>
	<td align="center">yes</td>
</tr>
<tr>
	<td align="center">orc</td>
	<td align="center"></td>
	<td align="center">yes</td>
</tr>
<tr>
	<td align="center">text</td>
	<td align="center"></td>
	<td align="center">yes</td>
</tr>
<tr>
	<td align="center">csv</td>
	<td align="center"></td>
	<td align="center">yes</td>
</tr>
</table>

After this PR's change, the friendly exception would be:
>
Exception in thread "main" org.apache.spark.sql.AnalysisException: text() can only be called on non-continuous queries;
	at org.apache.spark.sql.DataFrameWriter.assertNotStreaming(DataFrameWriter.scala:678)
	at org.apache.spark.sql.DataFrameWriter.text(DataFrameWriter.scala:629)
	at ss.SSDemo$.main(SSDemo.scala:47)

## How was this patch tested?

dedicated unit tests were added

Author: Liwei Lin <lwlin7@gmail.com>

Closes #12521 from lw-lin/dataframe-writer-check.
2016-05-02 16:48:20 -07:00
Herman van Hovell f362363d14 [SPARK-14785] [SQL] Support correlated scalar subqueries
## What changes were proposed in this pull request?
In this PR we add support for correlated scalar subqueries. An example of such a query is:
```SQL
select * from tbl1 a where a.value > (select max(value) from tbl2 b where b.key = a.key)
```
The implementation adds the `RewriteCorrelatedScalarSubquery` rule to the Optimizer. This rule plans these subqueries using `LEFT OUTER` joins. It currently supports rewrites for `Project`, `Aggregate` & `Filter` logical plans.

I could not find a well defined semantics for the use of scalar subqueries in an `Aggregate`. The current implementation currently evaluates the scalar subquery *before* aggregation. This means that you either have to make scalar subquery part of the grouping expression, or that you have to aggregate it further on. I am open to suggestions on this.

The implementation currently forces the uniqueness of a scalar subquery by enforcing that it is aggregated and that the resulting column is wrapped in an `AggregateExpression`.

## How was this patch tested?
Added tests to `SubquerySuite`.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #12822 from hvanhovell/SPARK-14785.
2016-05-02 16:32:31 -07:00
poolis 917d05f43b [SPARK-12928][SQL] Oracle FLOAT datatype is not properly handled when reading via JDBC
The contribution is my original work and that I license the work to the project under the project's open source license.

Author: poolis <gmichalopoulos@gmail.com>
Author: Greg Michalopoulos <gmichalopoulos@gmail.com>

Closes #10899 from poolis/spark-12928.
2016-05-02 16:15:07 -07:00
Reynold Xin ca1b219858 [SPARK-15052][SQL] Use builder pattern to create SparkSession
## What changes were proposed in this pull request?
This patch creates a builder pattern for creating SparkSession. The new code is unused and mostly deadcode. I'm putting it up here for feedback.

There are a few TODOs that can be done as follow-up pull requests:
- [ ] Update tests to use this
- [ ] Update examples to use this
- [ ] Clean up SQLContext code w.r.t. this one (i.e. SparkSession shouldn't call into SQLContext.getOrCreate; it should be the other way around)
- [ ] Remove SparkSession.withHiveSupport
- [ ] Disable the old constructor (by making it private) so the only way to start a SparkSession is through this builder pattern

## How was this patch tested?
Part of the future pull request is to clean this up and switch existing tests to use this.

Author: Reynold Xin <rxin@databricks.com>

Closes #12830 from rxin/sparksession-builder.
2016-05-02 15:27:16 -07:00
Reynold Xin d5c79f564f [SPARK-15054] Deprecate old accumulator API
## What changes were proposed in this pull request?
This patch deprecates the old accumulator API.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #12832 from rxin/SPARK-15054.
2016-05-02 14:57:00 -07:00
Pete Robbins 8a1ce4899f [SPARK-13745] [SQL] Support columnar in memory representation on Big Endian platforms
## What changes were proposed in this pull request?

parquet datasource and ColumnarBatch tests fail on big-endian platforms This patch adds support for the little-endian byte arrays being correctly interpreted on a big-endian platform

## How was this patch tested?

Spark test builds ran on big endian z/Linux and regression build on little endian amd64

Author: Pete Robbins <robbinspg@gmail.com>

Closes #12397 from robbinspg/master.
2016-05-02 13:16:46 -07:00
Davies Liu 95e372141a [SPARK-14781] [SQL] support nested predicate subquery
## What changes were proposed in this pull request?

In order to support nested predicate subquery, this PR introduce an internal join type ExistenceJoin, which will emit all the rows from left, plus an additional column, which presents there are any rows matched from right or not (it's not null-aware right now). This additional column could be used to replace the subquery in Filter.

In theory, all the predicate subquery could use this join type, but it's slower than LeftSemi and LeftAnti, so it's only used for nested subquery (subquery inside OR).

For example, the following SQL:
```sql
SELECT a FROM t  WHERE EXISTS (select 0) OR EXISTS (select 1)
```

This PR also fix a bug in predicate subquery push down through join (they should not).

Nested null-aware subquery is still not supported. For example,   `a > 3 OR b NOT IN (select bb from t)`

After this, we could run TPCDS query Q10, Q35, Q45

## How was this patch tested?

Added unit tests.

Author: Davies Liu <davies@databricks.com>

Closes #12820 from davies/or_exists.
2016-05-02 12:58:59 -07:00
Dongjoon Hyun 6e6320122e [SPARK-14830][SQL] Add RemoveRepetitionFromGroupExpressions optimizer.
## What changes were proposed in this pull request?

This PR aims to optimize GroupExpressions by removing repeating expressions. `RemoveRepetitionFromGroupExpressions` is added.

**Before**
```scala
scala> sql("select a+1 from values 1,2 T(a) group by a+1, 1+a, A+1, 1+A").explain()
== Physical Plan ==
WholeStageCodegen
:  +- TungstenAggregate(key=[(a#0 + 1)#6,(1 + a#0)#7,(A#0 + 1)#8,(1 + A#0)#9], functions=[], output=[(a + 1)#5])
:     +- INPUT
+- Exchange hashpartitioning((a#0 + 1)#6, (1 + a#0)#7, (A#0 + 1)#8, (1 + A#0)#9, 200), None
   +- WholeStageCodegen
      :  +- TungstenAggregate(key=[(a#0 + 1) AS (a#0 + 1)#6,(1 + a#0) AS (1 + a#0)#7,(A#0 + 1) AS (A#0 + 1)#8,(1 + A#0) AS (1 + A#0)#9], functions=[], output=[(a#0 + 1)#6,(1 + a#0)#7,(A#0 + 1)#8,(1 + A#0)#9])
      :     +- INPUT
      +- LocalTableScan [a#0], [[1],[2]]
```

**After**
```scala
scala> sql("select a+1 from values 1,2 T(a) group by a+1, 1+a, A+1, 1+A").explain()
== Physical Plan ==
WholeStageCodegen
:  +- TungstenAggregate(key=[(a#0 + 1)#6], functions=[], output=[(a + 1)#5])
:     +- INPUT
+- Exchange hashpartitioning((a#0 + 1)#6, 200), None
   +- WholeStageCodegen
      :  +- TungstenAggregate(key=[(a#0 + 1) AS (a#0 + 1)#6], functions=[], output=[(a#0 + 1)#6])
      :     +- INPUT
      +- LocalTableScan [a#0], [[1],[2]]
```

## How was this patch tested?

Pass the Jenkins tests (with a new testcase)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12590 from dongjoon-hyun/SPARK-14830.
2016-05-02 12:40:21 -07:00
Shixiong Zhu a35a67a83d [SPARK-14579][SQL] Fix the race condition in StreamExecution.processAllAvailable again
## What changes were proposed in this pull request?

#12339 didn't fix the race condition. MemorySinkSuite is still flaky: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.2/814/testReport/junit/org.apache.spark.sql.streaming/MemorySinkSuite/registering_as_a_table/

Here is an execution order to reproduce it.

| Time        |Thread 1           | MicroBatchThread  |
|:-------------:|:-------------:|:-----:|
| 1 | |  `MemorySink.getOffset` |
| 2 | |  availableOffsets ++= newData (availableOffsets is not changed here)  |
| 3 | addData(newData)      |   |
| 4 | Set `noNewData` to `false` in  processAllAvailable |  |
| 5 | | `dataAvailable` returns `false`   |
| 6 | | noNewData = true |
| 7 | `noNewData` is true so just return | |
| 8 |  assert results and fail | |
| 9 |   | `dataAvailable` returns true so process the new batch |

This PR expands the scope of `awaitBatchLock.synchronized` to eliminate the above race.

## How was this patch tested?

test("stress test"). It always failed before this patch. And it will pass after applying this patch. Ignore this test in the PR as it takes several minutes to finish.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #12582 from zsxwing/SPARK-14579-2.
2016-05-02 11:28:21 -07:00
Andrew Ray 9927441868 [SPARK-13749][SQL] Faster pivot implementation for many distinct values with two phase aggregation
## What changes were proposed in this pull request?

The existing implementation of pivot translates into a single aggregation with one aggregate per distinct pivot value. When the number of distinct pivot values is large (say 1000+) this can get extremely slow since each input value gets evaluated on every aggregate even though it only affects the value of one of them.

I'm proposing an alternate strategy for when there are 10+ (somewhat arbitrary threshold) distinct pivot values. We do two phases of aggregation. In the first we group by the grouping columns plus the pivot column and perform the specified aggregations (one or sometimes more). In the second aggregation we group by the grouping columns and use the new (non public) PivotFirst aggregate that rearranges the outputs of the first aggregation into an array indexed by the pivot value. Finally we do a project to extract the array entries into the appropriate output column.

## How was this patch tested?

Additional unit tests in DataFramePivotSuite and manual larger scale testing.

Author: Andrew Ray <ray.andrew@gmail.com>

Closes #11583 from aray/fast-pivot.
2016-05-02 11:12:55 -07:00
Jeff Zhang 0a3026990b [SPARK-14845][SPARK_SUBMIT][YARN] spark.files in properties file is n…
## What changes were proposed in this pull request?

initialize SparkSubmitArgument#files first from spark-submit arguments then from properties file, so that sys property spark.yarn.dist.files will be set correctly.
```
OptionAssigner(args.files, YARN, ALL_DEPLOY_MODES, sysProp = "spark.yarn.dist.files"),
```
## How was this patch tested?

manul test. file defined in properties file is also distributed to driver in yarn-cluster mode.

Author: Jeff Zhang <zjffdu@apache.org>

Closes #12656 from zjffdu/SPARK-14845.
2016-05-02 11:03:37 -07:00
Wenchen Fan 0513c3ac93 [SPARK-14637][SQL] object expressions cleanup
## What changes were proposed in this pull request?

Simplify and clean up some object expressions:

1. simplify the logic to handle `propagateNull`
2. add `propagateNull` parameter to `Invoke`
3. simplify the unbox logic in `Invoke`
4. other minor cleanup

TODO: simplify `MapObjects`

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12399 from cloud-fan/object.
2016-05-02 10:21:14 -07:00
Ben McCann 214d1be4fd Fix reference to external metrics documentation
Author: Ben McCann <benjamin.j.mccann@gmail.com>

Closes #12833 from benmccann/patch-1.
2016-05-01 22:43:28 -07:00
Reynold Xin 44da8d8eab [SPARK-15049] Rename NewAccumulator to AccumulatorV2
## What changes were proposed in this pull request?
NewAccumulator isn't the best name if we ever come up with v3 of the API.

## How was this patch tested?
Updated tests to reflect the change.

Author: Reynold Xin <rxin@databricks.com>

Closes #12827 from rxin/SPARK-15049.
2016-05-01 20:21:02 -07:00
hyukjinkwon a832cef112 [SPARK-13425][SQL] Documentation for CSV datasource options
## What changes were proposed in this pull request?

This PR adds the explanation and documentation for CSV options for reading and writing.

## How was this patch tested?

Style tests with `./dev/run_tests` for documentation style.

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>

Closes #12817 from HyukjinKwon/SPARK-13425.
2016-05-01 19:05:20 -07:00
Xusen Yin a6428292f7 [SPARK-14931][ML][PYTHON] Mismatched default values between pipelines in Spark and PySpark - update
## What changes were proposed in this pull request?

This PR is an update for [https://github.com/apache/spark/pull/12738] which:
* Adds a generic unit test for JavaParams wrappers in pyspark.ml for checking default Param values vs. the defaults in the Scala side
* Various fixes for bugs found
  * This includes changing classes taking weightCol to treat unset and empty String Param values the same way.

Defaults changed:
* Scala
 * LogisticRegression: weightCol defaults to not set (instead of empty string)
 * StringIndexer: labels default to not set (instead of empty array)
 * GeneralizedLinearRegression:
   * maxIter always defaults to 25 (simpler than defaulting to 25 for a particular solver)
   * weightCol defaults to not set (instead of empty string)
 * LinearRegression: weightCol defaults to not set (instead of empty string)
* Python
 * MultilayerPerceptron: layers default to not set (instead of [1,1])
 * ChiSqSelector: numTopFeatures defaults to 50 (instead of not set)

## How was this patch tested?

Generic unit test.  Manually tested that unit test by changing defaults and verifying that broke the test.

Author: Joseph K. Bradley <joseph@databricks.com>
Author: yinxusen <yinxusen@gmail.com>

Closes #12816 from jkbradley/yinxusen-SPARK-14931.
2016-05-01 12:29:01 -07:00
Allen cdf9e9753d [SPARK-14505][CORE] Fix bug : creating two SparkContext objects in the same jvm, the first one will can not run any task!
After creating two SparkContext objects in the same jvm(the second one can not be created successfully!),
use the first one to run job will throw exception like below:

![image](https://cloud.githubusercontent.com/assets/7162889/14402832/0c8da2a6-fe73-11e5-8aba-68ee3ddaf605.png)

Author: Allen <yufan_1990@163.com>

Closes #12273 from the-sea/context-create-bug.
2016-05-01 15:39:14 +01:00
Wenchen Fan 90787de864 [SPARK-15033][SQL] fix a flaky test in CachedTableSuite
## What changes were proposed in this pull request?

This is caused by https://github.com/apache/spark/pull/12776, which removes the `synchronized` from all methods in `AccumulatorContext`.

However, a test in `CachedTableSuite` synchronize on `AccumulatorContext` and expecting no one else can change it, which is not true anymore.

This PR update that test to not require to lock on `AccumulatorContext`.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12811 from cloud-fan/flaky.
2016-04-30 20:28:22 -07:00
Hossein 507bea5ca6 [SPARK-14143] Options for parsing NaNs, Infinity and nulls for numeric types
1. Adds the following options for parsing NaNs: nanValue

2. Adds the following options for parsing infinity: positiveInf, negativeInf.

`TypeCast.castTo` is unit tested and an end-to-end test is added to `CSVSuite`

Author: Hossein <hossein@databricks.com>

Closes #11947 from falaki/SPARK-14143.
2016-04-30 18:12:03 -07:00
Yin Huai 0182d9599d [SPARK-15034][SPARK-15035][SPARK-15036][SQL] Use spark.sql.warehouse.dir as the warehouse location
This PR contains three changes:
1. We will use spark.sql.warehouse.dir set warehouse location. We will not use hive.metastore.warehouse.dir.
2. SessionCatalog needs to set the location to default db. Otherwise, when creating a table in SparkSession without hive support, the default db's path will be an empty string.
3. When we create a database, we need to make the path qualified.

Existing tests and new tests

Author: Yin Huai <yhuai@databricks.com>

Closes #12812 from yhuai/warehouse.
2016-04-30 18:04:42 -07:00
Yanbo Liang 19a6d192d5 [SPARK-15030][ML][SPARKR] Support formula in spark.kmeans in SparkR
## What changes were proposed in this pull request?
* ```RFormula``` supports empty response variable like ```~ x + y```.
* Support formula in ```spark.kmeans``` in SparkR.
* Fix some outdated docs for SparkR.

## How was this patch tested?
Unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12813 from yanboliang/spark-15030.
2016-04-30 08:37:56 -07:00
Herman van Hovell e5fb78baf9 [SPARK-14952][CORE][ML] Remove methods that were deprecated in 1.6.0
#### What changes were proposed in this pull request?

This PR removes three methods the were deprecated in 1.6.0:
- `PortableDataStream.close()`
- `LinearRegression.weights`
- `LogisticRegression.weights`

The rationale for doing this is that the impact is small and that Spark 2.0 is a major release.

#### How was this patch tested?
Compilation succeded.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #12732 from hvanhovell/SPARK-14952.
2016-04-30 16:06:20 +01:00
Xiangrui Meng 0847fe4eb3 [SPARK-14653][ML] Remove json4s from mllib-local
## What changes were proposed in this pull request?

This PR moves Vector.toJson/fromJson to ml.linalg.VectorEncoder under mllib/ to keep mllib-local's dependency minimal. The json encoding is used by Params. So we still need this feature in SPARK-14615, where we will switch to ml.linalg in spark.ml APIs.

## How was this patch tested?

Copied existing unit tests over.

cc; dbtsai

Author: Xiangrui Meng <meng@databricks.com>

Closes #12802 from mengxr/SPARK-14653.
2016-04-30 06:30:39 -07:00
Junyang 1192fe4cd2 [SPARK-13289][MLLIB] Fix infinite distances between word vectors in Word2VecModel
## What changes were proposed in this pull request?

This PR fixes the bug that generates infinite distances between word vectors. For example,

Before this PR, we have
```
val synonyms = model.findSynonyms("who", 40)
```
will give the following results:
```
to Infinity
and Infinity
that Infinity
with Infinity
```
With this PR, the distance between words is a value between 0 and 1, as follows:
```
scala> model.findSynonyms("who", 10)
res0: Array[(String, Double)] = Array((Harvard-educated,0.5253688097000122), (ex-SAS,0.5213794708251953), (McMutrie,0.5187736749649048), (fellow,0.5166833400726318), (businessman,0.5145374536514282), (American-born,0.5127736330032349), (British-born,0.5062344074249268), (gray-bearded,0.5047978162765503), (American-educated,0.5035858750343323), (mentored,0.49849334359169006))

scala> model.findSynonyms("king", 10)
res1: Array[(String, Double)] = Array((queen,0.6787897944450378), (prince,0.6786158084869385), (monarch,0.659771203994751), (emperor,0.6490438580513), (goddess,0.643266499042511), (dynasty,0.635733425617218), (sultan,0.6166239380836487), (pharaoh,0.6150713562965393), (birthplace,0.6143025159835815), (empress,0.6109727025032043))

scala> model.findSynonyms("queen", 10)
res2: Array[(String, Double)] = Array((princess,0.7670737504959106), (godmother,0.6982434988021851), (raven-haired,0.6877717971801758), (swan,0.684934139251709), (hunky,0.6816608309745789), (Titania,0.6808111071586609), (heroine,0.6794036030769348), (king,0.6787897944450378), (diva,0.67848801612854), (lip-synching,0.6731793284416199))
```

### There are two places changed in this PR:
- Normalize the word vector to avoid overflow when calculating inner product between word vectors. This also simplifies the distance calculation, since the word vectors only need to be normalized once.
- Scale the learning rate by number of iteration, to be consistent with Google Word2Vec implementation

## How was this patch tested?

Use word2vec to train text corpus, and run model.findSynonyms() to get the distances between word vectors.

Author: Junyang <fly.shenjy@gmail.com>
Author: flyskyfly <fly.shenjy@gmail.com>

Closes #11812 from flyjy/TVec.
2016-04-30 10:16:35 +01:00
pshearer 0368ff30dd [SPARK-13973][PYSPARK] Make pyspark fail noisily if IPYTHON or IPYTHON_OPTS are set
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-13973

Following discussion with srowen the IPYTHON and IPYTHON_OPTS variables are removed. If they are set in the user's environment, pyspark will not execute and prints an error message. Failing noisily will force users to remove these options and learn the new configuration scheme, which is much more sustainable and less confusing.

## How was this patch tested?

Manual testing; set IPYTHON=1 and verified that the error message prints.

Author: pshearer <pshearer@massmutual.com>
Author: shearerp <shearerp@umich.edu>

Closes #12528 from shearerp/master.
2016-04-30 10:15:20 +01:00
Reynold Xin 8dc3987d09 [SPARK-15028][SQL] Remove HiveSessionState.setDefaultOverrideConfs
## What changes were proposed in this pull request?
This patch removes some code that are no longer relevant -- mainly HiveSessionState.setDefaultOverrideConfs.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #12806 from rxin/SPARK-15028.
2016-04-30 01:32:00 -07:00
Xiangrui Meng b3ea579314 [SPARK-14831][.2][ML][R] rename ml.save/ml.load to write.ml/read.ml
## What changes were proposed in this pull request?

Continue the work of #12789 to rename ml.asve/ml.load to write.ml/read.ml, which are more consistent with read.df/write.df and other methods in SparkR.

I didn't rename `data` to `df` because we still use `predict` for prediction, which uses `newData` to match the signature in R.

## How was this patch tested?

Existing unit tests.

cc: yanboliang thunterdb

Author: Xiangrui Meng <meng@databricks.com>

Closes #12807 from mengxr/SPARK-14831.
2016-04-30 00:45:44 -07:00
Xiangrui Meng 7fbe1bb24d [SPARK-14412][.2][ML] rename *RDDStorageLevel to *StorageLevel in ml.ALS
## What changes were proposed in this pull request?

As discussed in #12660, this PR renames
* intermediateRDDStorageLevel -> intermediateStorageLevel
* finalRDDStorageLevel -> finalStorageLevel

The argument name in `ALS.train` will be addressed in SPARK-15027.

## How was this patch tested?

Existing unit tests.

Author: Xiangrui Meng <meng@databricks.com>

Closes #12803 from mengxr/SPARK-14412.
2016-04-30 00:41:28 -07:00
Sean Owen 5886b6217b [SPARK-14533][MLLIB] RowMatrix.computeCovariance inaccurate when values are very large (partial fix)
## What changes were proposed in this pull request?

Fix for part of SPARK-14533: trivial simplification and more accurate computation of column means. See also https://github.com/apache/spark/pull/12299 which contained a complete fix that was very slow. This PR does _not_ resolve SPARK-14533 entirely.

## How was this patch tested?

Existing tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #12779 from srowen/SPARK-14533.2.
2016-04-30 00:15:41 -07:00
Dongjoon Hyun f86f71763c [MINOR][EXAMPLE] Use SparkSession instead of SQLContext in RDDRelation.scala
## What changes were proposed in this pull request?

Now, `SQLContext` is used for backward-compatibility, we had better use `SparkSession` in Spark 2.0 examples.

## How was this patch tested?

It's just example change. After building, run `bin/run-example org.apache.spark.examples.sql.RDDRelation`.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12808 from dongjoon-hyun/rddrelation.
2016-04-30 00:15:04 -07:00
Xiangrui Meng 3d09ceeef9 [SPARK-14850][.2][ML] use UnsafeArrayData.fromPrimitiveArray in ml.VectorUDT/MatrixUDT
## What changes were proposed in this pull request?

This PR uses `UnsafeArrayData.fromPrimitiveArray` to implement `ml.VectorUDT/MatrixUDT` to avoid boxing/unboxing.

## How was this patch tested?

Exiting unit tests.

cc: cloud-fan

Author: Xiangrui Meng <meng@databricks.com>

Closes #12805 from mengxr/SPARK-14850.
2016-04-29 23:51:01 -07:00
Marcelo Vanzin 73c20bf325 [SPARK-14391][LAUNCHER] Fix launcher communication test, take 2.
There's actually a race here: the state of the handler was changed before
the connection was set, so the test code could be notified of the state
change, wake up, and still see the connection as null, triggering the assert.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #12785 from vanzin/SPARK-14391.
2016-04-29 23:13:50 -07:00
Timothy Hunter bc36fe6e89 [SPARK-14831][SPARKR] Make the SparkR MLlib API more consistent with Spark
## What changes were proposed in this pull request?

This PR splits the MLlib algorithms into two flavors:
 - the R flavor, which tries to mimic the existing R API for these algorithms (and works as an S4 specialization for Spark dataframes)
 - the Spark flavor, which follows the same API and naming conventions as the rest of the MLlib algorithms in the other languages

In practice, the former calls the latter.

## How was this patch tested?

The tests for the various algorithms were adapted to be run against both interfaces.

Author: Timothy Hunter <timhunter@databricks.com>

Closes #12789 from thunterdb/14831.
2016-04-29 23:13:03 -07:00
Wenchen Fan 43b149fb88 [SPARK-14850][ML] convert primitive array from/to unsafe array directly in VectorUDT/MatrixUDT
## What changes were proposed in this pull request?

This PR adds `fromPrimitiveArray` and `toPrimitiveArray` in `UnsafeArrayData`, so that we can do the conversion much faster in VectorUDT/MatrixUDT.

## How was this patch tested?

existing tests and new test suite `UnsafeArraySuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12640 from cloud-fan/ml.
2016-04-29 23:04:51 -07:00