Commit graph

17074 commits

Author SHA1 Message Date
Joseph K. Bradley 5ffd5d3838 [SPARK-14817][ML][MLLIB][DOC] Made DataFrame-based API primary in MLlib guide
## What changes were proposed in this pull request?

Made DataFrame-based API primary
* Spark doc menu bar and other places now link to ml-guide.html, not mllib-guide.html
* mllib-guide.html keeps RDD-specific list of features, with a link at the top redirecting people to ml-guide.html
* ml-guide.html includes a "maintenance mode" announcement about the RDD-based API
  * **Reviewers: please check this carefully**
* (minor) Titles for DF API no longer include "- spark.ml" suffix.  Titles for RDD API have "- RDD-based API" suffix
* Moved migration guide to ml-guide from mllib-guide
  * Also moved past guides from mllib-migration-guides to ml-migration-guides, with a redirect link on mllib-migration-guides
  * **Reviewers**: I did not change any of the content of the migration guides.

Reorganized DataFrame-based guide:
* ml-guide.html mimics the old mllib-guide.html page in terms of content: overview, migration guide, etc.
* Moved Pipeline description into ml-pipeline.html and moved tuning into ml-tuning.html
  * **Reviewers**: I did not change the content of these guides, except some intro text.
* Sidebar remains the same, but with pipeline and tuning sections added

Other:
* ml-classification-regression.html: Moved text about linear methods to new section in page

## How was this patch tested?

Generated docs locally

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #14213 from jkbradley/ml-guide-2.0.
2016-07-15 13:38:23 -07:00
z001qdp 71ad945bbb [SPARK-16426][MLLIB] Fix bug that caused NaNs in IsotonicRegression
## What changes were proposed in this pull request?

Fixed a bug that caused `NaN`s in `IsotonicRegression`. The problem occurs when training rows with the same feature value but different labels end up on different partitions. This patch changes a `sortBy` call to a `partitionBy(RangePartitioner)` followed by a `mapPartitions(sortBy)` in order to ensure that all rows with the same feature value end up on the same partition.

## How was this patch tested?

Added a unit test.

Author: z001qdp <Nicholas.Eggert@target.com>

Closes #14140 from neggert/SPARK-16426-isotonic-nan.
2016-07-15 12:30:22 +01:00
WeichenXu 1832423827 [SPARK-16546][SQL][PYSPARK] update python dataframe.drop
## What changes were proposed in this pull request?

Make `dataframe.drop` API in python support multi-columns parameters,
so that it is the same with scala API.

## How was this patch tested?

The doc test.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14203 from WeichenXu123/drop_python_api.
2016-07-14 22:55:49 -07:00
Reynold Xin 2e4075e2ec [SPARK-16557][SQL] Remove stale doc in sql/README.md
## What changes were proposed in this pull request?
Most of the documentation in https://github.com/apache/spark/blob/master/sql/README.md is stale. It would be useful to keep the list of projects to explain what's going on, and everything else should be removed.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #14211 from rxin/SPARK-16557.
2016-07-14 19:24:42 -07:00
Josh Rosen 972673aca5 [SPARK-16555] Work around Jekyll error-handling bug which led to silent failures
If a custom Jekyll template tag throws Ruby's equivalent of a "file not found" exception, then Jekyll will stop the doc building process but will exit with a successful status, causing our doc publishing jobs to silently fail.

This is caused by https://github.com/jekyll/jekyll/issues/5104, a case of bad error-handling logic in Jekyll. This patch works around this by updating our `include_example.rb` plugin to catch the exception and exit rather than allowing it to bubble up and be ignored by Jekyll.

I tested this manually with

```
rm ./examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala
cd docs
SKIP_API=1 jekyll build
echo $?
```

Author: Josh Rosen <joshrosen@databricks.com>

Closes #14209 from JoshRosen/fix-doc-building.
2016-07-14 15:55:36 -07:00
Shivaram Venkataraman 01c4c1fa53 [SPARK-16553][DOCS] Fix SQL example file name in docs
## What changes were proposed in this pull request?

Fixes a typo in the sql programming guide

## How was this patch tested?

Building docs locally

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #14208 from shivaram/spark-sql-doc-fix.
2016-07-14 14:19:30 -07:00
jerryshao 91575cac32 [SPARK-16540][YARN][CORE] Avoid adding jars twice for Spark running on yarn
## What changes were proposed in this pull request?

Currently when running spark on yarn, jars specified with --jars, --packages will be added twice, one is Spark's own file server, another is yarn's distributed cache, this can be seen from log:
for example:

```
./bin/spark-shell --master yarn-client --jars examples/target/scala-2.11/jars/scopt_2.11-3.3.0.jar
```

If specified the jar to be added is scopt jar, it will added twice:

```
...
16/07/14 15:06:48 INFO Server: Started 5603ms
16/07/14 15:06:48 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/07/14 15:06:48 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.0.102:4040
16/07/14 15:06:48 INFO SparkContext: Added JAR file:/Users/sshao/projects/apache-spark/examples/target/scala-2.11/jars/scopt_2.11-3.3.0.jar at spark://192.168.0.102:63996/jars/scopt_2.11-3.3.0.jar with timestamp 1468480008637
16/07/14 15:06:49 INFO RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
16/07/14 15:06:49 INFO Client: Requesting a new application from cluster with 1 NodeManagers
16/07/14 15:06:49 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
16/07/14 15:06:49 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
16/07/14 15:06:49 INFO Client: Setting up container launch context for our AM
16/07/14 15:06:49 INFO Client: Setting up the launch environment for our AM container
16/07/14 15:06:49 INFO Client: Preparing resources for our AM container
16/07/14 15:06:49 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
16/07/14 15:06:50 INFO Client: Uploading resource file:/private/var/folders/tb/8pw1511s2q78mj7plnq8p9g40000gn/T/spark-a446300b-84bf-43ff-bfb1-3adfb0571a42/__spark_libs__6486179704064718817.zip -> hdfs://localhost:8020/user/sshao/.sparkStaging/application_1468468348998_0009/__spark_libs__6486179704064718817.zip
16/07/14 15:06:51 INFO Client: Uploading resource file:/Users/sshao/projects/apache-spark/examples/target/scala-2.11/jars/scopt_2.11-3.3.0.jar -> hdfs://localhost:8020/user/sshao/.sparkStaging/application_1468468348998_0009/scopt_2.11-3.3.0.jar
16/07/14 15:06:51 INFO Client: Uploading resource file:/private/var/folders/tb/8pw1511s2q78mj7plnq8p9g40000gn/T/spark-a446300b-84bf-43ff-bfb1-3adfb0571a42/__spark_conf__326416236462420861.zip -> hdfs://localhost:8020/user/sshao/.sparkStaging/application_1468468348998_0009/__spark_conf__.zip
...
```

So here try to avoid adding jars to Spark's fileserver unnecessarily.

## How was this patch tested?

Manually verified both in yarn client and cluster mode, also in standalone mode.

Author: jerryshao <sshao@hortonworks.com>

Closes #14196 from jerryshao/SPARK-16540.
2016-07-14 10:40:59 -07:00
Jacek Lewandowski 31ca741aef [SPARK-16528][SQL] Fix NPE problem in HiveClientImpl
## What changes were proposed in this pull request?

There are some calls to methods or fields (getParameters, properties) which are then passed to Java/Scala collection converters. Unfortunately those fields can be null in some cases and then the conversions throws NPE. We fix it by wrapping calls to those fields and methods with option and then do the conversion.

## How was this patch tested?

Manually tested with a custom Hive metastore.

Author: Jacek Lewandowski <lewandowski.jacek@gmail.com>

Closes #14200 from jacek-lewandowski/SPARK-16528.
2016-07-14 10:18:31 -07:00
Dongjoon Hyun c576f9fb90 [SPARK-16529][SQL][TEST] withTempDatabase should set default database before dropping
## What changes were proposed in this pull request?

`SQLTestUtils.withTempDatabase` is a frequently used test harness to setup a temporary table and clean up finally. This issue improves like the following for usability.

```scala
-    try f(dbName) finally spark.sql(s"DROP DATABASE $dbName CASCADE")
+    try f(dbName) finally {
+      if (spark.catalog.currentDatabase == dbName) {
+        spark.sql(s"USE ${DEFAULT_DATABASE}")
+      }
+      spark.sql(s"DROP DATABASE $dbName CASCADE")
+    }
```

In case of forgetting to reset the databaes, `withTempDatabase` will not raise Exception.

## How was this patch tested?

This improves test harness.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14184 from dongjoon-hyun/SPARK-16529.
2016-07-15 00:51:11 +08:00
Felix Cheung 12005c88fb [SPARK-16538][SPARKR] fix R call with namespace operator on SparkSession functions
## What changes were proposed in this pull request?

Fix function routing to work with and without namespace operator `SparkR::createDataFrame`

## How was this patch tested?

manual, unit tests

shivaram

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14195 from felixcheung/rroutedefault.
2016-07-14 09:45:30 -07:00
Sun Rui 093ebbc628 [SPARK-16509][SPARKR] Rename window.partitionBy and window.orderBy to windowPartitionBy and windowOrderBy.
## What changes were proposed in this pull request?
Rename window.partitionBy and window.orderBy to windowPartitionBy and windowOrderBy to pass CRAN package check.

## How was this patch tested?
SparkR unit tests.

Author: Sun Rui <sunrui2016@gmail.com>

Closes #14192 from sun-rui/SPARK-16509.
2016-07-14 09:38:42 -07:00
Dongjoon Hyun 56183b84fb [SPARK-16543][SQL] Rename the columns of SHOW PARTITION/COLUMNS commands
## What changes were proposed in this pull request?

This PR changes the name of columns returned by `SHOW PARTITION` and `SHOW COLUMNS` commands. Currently, both commands uses `result` as a column name.

**Comparison: Column Name**

Command|Spark(Before)|Spark(After)|Hive
----------|--------------|------------|-----
SHOW PARTITIONS|result|partition|partition
SHOW COLUMNS|result|col_name|field

Note that Spark/Hive uses `col_name` in `DESC TABLES`. So, this PR chooses `col_name` for consistency among Spark commands.

**Before**
```scala
scala> sql("show partitions p").show()
+------+
|result|
+------+
|   b=2|
+------+

scala> sql("show columns in p").show()
+------+
|result|
+------+
|     a|
|     b|
+------+
```

**After**
```scala
scala> sql("show partitions p").show
+---------+
|partition|
+---------+
|      b=2|
+---------+

scala> sql("show columns in p").show
+--------+
|col_name|
+--------+
|       a|
|       b|
+--------+
```

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14199 from dongjoon-hyun/SPARK-16543.
2016-07-14 17:18:34 +02:00
gatorsmile 1b5c9e52a7 [SPARK-16530][SQL][TRIVIAL] Wrong Parser Keyword in ALTER TABLE CHANGE COLUMN
#### What changes were proposed in this pull request?
Based on the [Hive SQL syntax](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ChangeColumnName/Type/Position/Comment), the command to change column name/type/position/comment is `ALTER TABLE CHANGE COLUMN`. However, in our .g4 file, it is `ALTER TABLE CHANGE COLUMNS`. Because it is the last optional keyword, it does not take any effect. Thus, I put the issue as a Trivial level.

cc hvanhovell

#### How was this patch tested?
Existing test cases

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14186 from gatorsmile/changeColumns.
2016-07-14 17:15:51 +02:00
Marcelo Vanzin b7b5e17876 [SPARK-16505][YARN] Optionally propagate error during shuffle service startup.
This prevents the NM from starting when something is wrong, which would
lead to later errors which are confusing and harder to debug.

Added a unit test to verify startup fails if something is wrong.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #14162 from vanzin/SPARK-16505.
2016-07-14 09:42:32 -05:00
jerryshao c4bc2ed844 [SPARK-14963][MINOR][YARN] Fix typo in YarnShuffleService recovery file name
## What changes were proposed in this pull request?

Due to the changes of [SPARK-14963](https://issues.apache.org/jira/browse/SPARK-14963), external shuffle recovery file name is changed mistakenly, so here change it back to the previous file name.

This only affects the master branch, branch-2.0 is correct [here](https://github.com/apache/spark/blob/branch-2.0/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java#L195).

## How was this patch tested?

N/A

Author: jerryshao <sshao@hortonworks.com>

Closes #14197 from jerryshao/fix-typo-file-name.
2016-07-14 08:31:04 -05:00
Bryan Cutler e3f8a03367 [SPARK-16403][EXAMPLES] Cleanup to remove unused imports, consistent style, minor fixes
## What changes were proposed in this pull request?

Cleanup of examples, mostly from PySpark-ML to fix minor issues:  unused imports, style consistency, pipeline_example is a duplicate, use future print funciton, and a spelling error.

* The "Pipeline Example" is duplicated by "Simple Text Classification Pipeline" in Scala, Python, and Java.

* "Estimator Transformer Param Example" is duplicated by "Simple Params Example" in Scala, Python and Java

* Synced random_forest_classifier_example.py with Scala by adding IndexToString label converted

* Synced train_validation_split.py (in Scala ModelSelectionViaTrainValidationExample) by adjusting data split, adding grid for intercept.

* RegexTokenizer was doing nothing in tokenizer_example.py and JavaTokenizerExample.java, synced with Scala version

## How was this patch tested?
local tests and run modified examples

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #14081 from BryanCutler/examples-cleanup-SPARK-16403.
2016-07-14 09:12:46 +01:00
WeichenXu 252d4f27f2 [SPARK-16500][ML][MLLIB][OPTIMIZER] add LBFGS convergence warning for all used place in MLLib
## What changes were proposed in this pull request?

Add warning_for the following case when LBFGS training not actually convergence:

1) LogisticRegression
2) AFTSurvivalRegression
3) LBFGS algorithm wrapper in mllib package

## How was this patch tested?

N/A

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14157 from WeichenXu123/add_lbfgs_convergence_warning_for_all_used_place.
2016-07-14 09:11:04 +01:00
Wenchen Fan db7317ac3c [SPARK-16448] RemoveAliasOnlyProject should not remove alias with metadata
## What changes were proposed in this pull request?

`Alias` with metadata is not a no-op and we should not strip it in `RemoveAliasOnlyProject` rule.
This PR also did some improvement for this rule:

1. extend the semantic of `alias-only`. Now we allow the project list to be partially aliased.
2. add unit test for this rule.

## How was this patch tested?

new `RemoveAliasOnlyProjectSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14106 from cloud-fan/bug.
2016-07-14 15:48:22 +08:00
Liwei Lin 39c836e976 [SPARK-16503] SparkSession should provide Spark version
## What changes were proposed in this pull request?

This patch enables SparkSession to provide spark version.

## How was this patch tested?

Manual test:

```
scala> sc.version
res0: String = 2.1.0-SNAPSHOT

scala> spark.version
res1: String = 2.1.0-SNAPSHOT
```

```
>>> sc.version
u'2.1.0-SNAPSHOT'
>>> spark.version
u'2.1.0-SNAPSHOT'
```

Author: Liwei Lin <lwlin7@gmail.com>

Closes #14165 from lw-lin/add-version.
2016-07-13 22:30:46 -07:00
Dongjoon Hyun 9c530576a4 [SPARK-16536][SQL][PYSPARK][MINOR] Expose sql in PySpark Shell
## What changes were proposed in this pull request?

This PR exposes `sql` in PySpark Shell like Scala/R Shells for consistency.

**Background**
 * Scala
 ```scala
scala> sql("select 1 a")
res0: org.apache.spark.sql.DataFrame = [a: int]
```

 * R
 ```r
> sql("select 1")
SparkDataFrame[1:int]
```

**Before**
 * Python

 ```python
>>> sql("select 1 a")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'sql' is not defined
```

**After**
 * Python

 ```python
>>> sql("select 1 a")
DataFrame[a: int]
```

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14190 from dongjoon-hyun/SPARK-16536.
2016-07-13 22:24:26 -07:00
Joseph K. Bradley a5f51e2162 [SPARK-16485][ML][DOC] Fix privacy of GLM members, rename sqlDataTypes for ML, doc fixes
## What changes were proposed in this pull request?

Fixing issues found during 2.0 API checks:
* GeneralizedLinearRegressionModel: linkObj, familyObj, familyAndLink should not be exposed
* sqlDataTypes: name does not follow conventions. Do we need to expose it?
* Evaluator: inconsistent doc between evaluate and isLargerBetter
* MinMaxScaler: math rendering --> hard to make it great, but I'll change it a little
* GeneralizedLinearRegressionSummary: aic doc is incorrect --> will change to use more common name

## How was this patch tested?

Existing unit tests.  Docs generated locally.  (MinMaxScaler is improved a tiny bit.)

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #14187 from jkbradley/final-api-check-2.0.
2016-07-13 15:40:44 -07:00
gatorsmile c5ec879828 [SPARK-16482][SQL] Describe Table Command for Tables Requiring Runtime Inferred Schema
#### What changes were proposed in this pull request?
If we create a table pointing to a parquet/json datasets without specifying the schema, describe table command does not show the schema at all. It only shows `# Schema of this table is inferred at runtime`. In 1.6, describe table does show the schema of such a table.

~~For data source tables, to infer the schema, we need to load the data source tables at runtime. Thus, this PR calls the function `lookupRelation`.~~

For data source tables, we infer the schema before table creation. Thus, this PR set the inferred schema as the table schema when table creation.

#### How was this patch tested?
Added test cases

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14148 from gatorsmile/describeSchema.
2016-07-13 15:23:37 -07:00
Felix Cheung fb2e8eeb0b [SPARKR][DOCS][MINOR] R programming guide to include csv data source example
## What changes were proposed in this pull request?

Minor documentation update for code example, code style, and missed reference to "sparkR.init"

## How was this patch tested?

manual

shivaram

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14178 from felixcheung/rcsvprogrammingguide.
2016-07-13 15:09:23 -07:00
Felix Cheung b4baf086ca [SPARKR][MINOR] R examples and test updates
## What changes were proposed in this pull request?

Minor example updates

## How was this patch tested?

manual

shivaram

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14171 from felixcheung/rexample.
2016-07-13 13:33:34 -07:00
James Thomas 51a6706b13 [SPARK-16114][SQL] updated structured streaming guide
## What changes were proposed in this pull request?

Updated structured streaming programming guide with new windowed example.

## How was this patch tested?

Docs

Author: James Thomas <jamesjoethomas@gmail.com>

Closes #14183 from jjthomas/ss_docs_update.
2016-07-13 13:26:23 -07:00
Burak Yavuz 0744d84c91 [SPARK-16531][SQL][TEST] Remove timezone setting from DataFrameTimeWindowingSuite
## What changes were proposed in this pull request?

It's unnecessary. `QueryTest` already sets it.

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #14170 from brkyvz/test-tz.
2016-07-13 12:54:57 -07:00
Joseph K. Bradley 01f09b1612 [SPARK-14812][ML][MLLIB][PYTHON] Experimental, DeveloperApi annotation audit for ML
## What changes were proposed in this pull request?

General decisions to follow, except where noted:
* spark.mllib, pyspark.mllib: Remove all Experimental annotations.  Leave DeveloperApi annotations alone.
* spark.ml, pyspark.ml
** Annotate Estimator-Model pairs of classes and companion objects the same way.
** For all algorithms marked Experimental with Since tag <= 1.6, remove Experimental annotation.
** For all algorithms marked Experimental with Since tag = 2.0, leave Experimental annotation.
* DeveloperApi annotations are left alone, except where noted.
* No changes to which types are sealed.

Exceptions where I am leaving items Experimental in spark.ml, pyspark.ml, mainly because the items are new:
* Model Summary classes
* MLWriter, MLReader, MLWritable, MLReadable
* Evaluator and subclasses: There is discussion of changes around evaluating multiple metrics at once for efficiency.
* RFormula: Its behavior may need to change slightly to match R in edge cases.
* AFTSurvivalRegression
* MultilayerPerceptronClassifier

DeveloperApi changes:
* ml.tree.Node, ml.tree.Split, and subclasses should no longer be DeveloperApi

## How was this patch tested?

N/A

Note to reviewers:
* spark.ml.clustering.LDA underwent significant changes (additional methods), so let me know if you want me to leave it Experimental.
* Be careful to check for cases where a class should no longer be Experimental but has an Experimental method, val, or other feature.  I did not find such cases, but please verify.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #14147 from jkbradley/experimental-audit.
2016-07-13 12:33:39 -07:00
jerryshao d8220c1e5e [SPARK-16435][YARN][MINOR] Add warning log if initialExecutors is less than minExecutors
## What changes were proposed in this pull request?

Currently if `spark.dynamicAllocation.initialExecutors` is less than `spark.dynamicAllocation.minExecutors`, Spark will automatically pick the minExecutors without any warning. While in 1.6 Spark will throw exception if configured like this. So here propose to add warning log if these parameters are configured invalidly.

## How was this patch tested?

Unit test added to verify the scenario.

Author: jerryshao <sshao@hortonworks.com>

Closes #14149 from jerryshao/SPARK-16435.
2016-07-13 13:24:47 -05:00
蒋星博 f376c37268 [SPARK-16343][SQL] Improve the PushDownPredicate rule to pushdown predicates correctly in non-deterministic condition.
## What changes were proposed in this pull request?

Currently our Optimizer may reorder the predicates to run them more efficient, but in non-deterministic condition, change the order between deterministic parts and non-deterministic parts may change the number of input rows. For example:
```SELECT a FROM t WHERE rand() < 0.1 AND a = 1```
And
```SELECT a FROM t WHERE a = 1 AND rand() < 0.1```
may call rand() for different times and therefore the output rows differ.

This PR improved this condition by checking whether the predicate is placed before any non-deterministic predicates.

## How was this patch tested?

Expanded related testcases in FilterPushdownSuite.

Author: 蒋星博 <jiangxingbo@meituan.com>

Closes #14012 from jiangxb1987/ppd.
2016-07-14 00:21:27 +08:00
oraviv ea06e4ef34 [SPARK-16469] enhanced simulate multiply
## What changes were proposed in this pull request?

We have a use case of multiplying very big sparse matrices. we have about 1000x1000 distributed block matrices multiplication and the simulate multiply goes like O(n^4) (n being 1000). it takes about 1.5 hours. We modified it slightly with classical hashmap and now run in about 30 seconds O(n^2).

## How was this patch tested?

We have added a performance test and verified the reduced time.

Author: oraviv <oraviv@paypal.com>

Closes #14068 from uzadude/master.
2016-07-13 14:47:08 +01:00
Sean Owen 51ade51a9f [SPARK-16440][MLLIB] Undeleted broadcast variables in Word2Vec causing OoM for long runs
## What changes were proposed in this pull request?

Unpersist broadcasted vars in Word2Vec.fit for more timely / reliable resource cleanup

## How was this patch tested?

Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes #14153 from srowen/SPARK-16440.
2016-07-13 11:39:32 +01:00
sharkd 3d6f679cfe [MINOR][YARN] Fix code error in yarn-cluster unit test
## What changes were proposed in this pull request?

Fix code error in yarn-cluster unit test.

## How was this patch tested?

Use exist tests

Author: sharkd <sharkd.tu@gmail.com>

Closes #14166 from sharkdtu/master.
2016-07-13 11:36:02 +01:00
sandy bf107f1e65 [SPARK-16438] Add Asynchronous Actions documentation
## What changes were proposed in this pull request?

Add Asynchronous Actions documentation inside action of programming guide

## How was this patch tested?

check the documentation indentation and formatting with md preview.

Author: sandy <phalodi@gmail.com>

Closes #14104 from phalodi/SPARK-16438.
2016-07-13 11:33:46 +01:00
Maciej Brynski 83879ebc58 [SPARK-16439] Fix number formatting in SQL UI
## What changes were proposed in this pull request?

Spark SQL UI display numbers greater than 1000 with u00A0 as grouping separator.
Problem exists when server locale has no-breaking space as separator. (for example pl_PL)
This patch turns off grouping and remove this separator.

The problem starts with this PR.
https://github.com/apache/spark/pull/12425/files#diff-803f475b01acfae1c5c96807c2ea9ddcR125

## How was this patch tested?

Manual UI tests. Screenshot attached.

![image](https://cloud.githubusercontent.com/assets/4006010/16749556/5cb5a372-47cb-11e6-9a95-67fd3f9d1c71.png)

Author: Maciej Brynski <maciej.brynski@adpilot.pl>

Closes #14142 from maver1ck/master.
2016-07-13 10:50:26 +01:00
Xin Ren f73891e0b9 [MINOR] Fix Java style errors and remove unused imports
## What changes were proposed in this pull request?

Fix Java style errors and remove unused imports, which are randomly found

## How was this patch tested?

Tested on my local machine.

Author: Xin Ren <iamshrek@126.com>

Closes #14161 from keypointt/SPARK-16437.
2016-07-13 10:47:07 +01:00
Alex Bozarth f156136dae [SPARK-16375][WEB UI] Fixed misassigned var: numCompletedTasks was assigned to numSkippedTasks
## What changes were proposed in this pull request?

I fixed a misassigned var,  numCompletedTasks was assigned to numSkippedTasks in the convertJobData method

## How was this patch tested?

dev/run-tests

Author: Alex Bozarth <ajbozart@us.ibm.com>

Closes #14141 from ajbozarth/spark16375.
2016-07-13 10:45:06 +01:00
Sean Owen c190d89bd3 [SPARK-15889][STREAMING] Follow-up fix to erroneous condition in StreamTest
## What changes were proposed in this pull request?

A second form of AssertQuery now actually invokes the condition; avoids a build warning too

## How was this patch tested?

Jenkins; running StreamTest

Author: Sean Owen <sowen@cloudera.com>

Closes #14133 from srowen/SPARK-15889.2.
2016-07-13 10:44:07 +01:00
aokolnychyi 772c213ec7 [SPARK-16303][DOCS][EXAMPLES] Updated SQL programming guide and examples
- Hard-coded Spark SQL sample snippets were moved into source files under examples sub-project.
- Removed the inconsistency between Scala and Java Spark SQL examples
- Scala and Java Spark SQL examples were updated

The work is still in progress. All involved examples were tested manually. An additional round of testing will be done after the code review.

![image](https://cloud.githubusercontent.com/assets/6235869/16710314/51851606-462a-11e6-9fbe-0818daef65e4.png)

Author: aokolnychyi <okolnychyyanton@gmail.com>

Closes #14119 from aokolnychyi/spark_16303.
2016-07-13 16:12:11 +08:00
Eric Liang 1c58fa905b [SPARK-16514][SQL] Fix various regex codegen bugs
## What changes were proposed in this pull request?

RegexExtract and RegexReplace currently crash on non-nullable input due use of a hard-coded local variable name (e.g. compiles fail with `java.lang.Exception: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 85, Column 26: Redefinition of local variable "m" `).

This changes those variables to use fresh names, and also in a few other places.

## How was this patch tested?

Unit tests. rxin

Author: Eric Liang <ekl@databricks.com>

Closes #14168 from ericl/sc-3906.
2016-07-12 23:09:02 -07:00
petermaxlee 56bd399a86 [SPARK-16284][SQL] Implement reflect SQL function
## What changes were proposed in this pull request?
This patch implements reflect SQL function, which can be used to invoke a Java method in SQL. Slightly different from Hive, this implementation requires the class name and the method name to be literals. This implementation also supports only a smaller number of data types, and requires the function to be static, as suggested by rxin in #13969.

java_method is an alias for reflect, so this should also resolve SPARK-16277.

## How was this patch tested?
Added expression unit tests and an end-to-end test.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14138 from petermaxlee/reflect-static.
2016-07-13 08:05:20 +08:00
Marcelo Vanzin 7f968867ff [SPARK-16119][SQL] Support PURGE option to drop table / partition.
This option is used by Hive to directly delete the files instead of
moving them to the trash. This is needed in certain configurations
where moving the files does not work. For non-Hive tables and partitions,
Spark already behaves as if the PURGE option was set, so there's no
need to do anything.

Hive support for PURGE was added in 0.14 (for tables) and 1.2 (for
partitions), so the code reflects that: trying to use the option with
older versions of Hive will cause an exception to be thrown.

The change is a little noisier than I would like, because of the code
to propagate the new flag through all the interfaces and implementations;
the main changes are in the parser and in HiveShim, aside from the tests
(DDLCommandSuite, VersionsSuite).

Tested by running sql and catalyst unit tests, plus VersionsSuite which
has been updated to test the version-specific behavior. I also ran an
internal test suite that uses PURGE and would not pass previously.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #13831 from vanzin/SPARK-16119.
2016-07-12 12:47:46 -07:00
Yangyang Liu 68df47aca5 [SPARK-16405] Add metrics and source for external shuffle service
## What changes were proposed in this pull request?

Since externalShuffleService is essential for spark, better monitoring for shuffle service is necessary. In order to do so, we added various metrics in shuffle service and imported into ExternalShuffleServiceSource for metric system.
Metrics added in shuffle service:
* registeredExecutorsSize
* openBlockRequestLatencyMillis
* registerExecutorRequestLatencyMillis
* blockTransferRateBytes

JIRA Issue: https://issues.apache.org/jira/browse/SPARK-16405

## How was this patch tested?

Some test cases are added to verify metrics as expected in metric system. Those unit test cases are shown in `ExternalShuffleBlockHandlerSuite `

Author: Yangyang Liu <yangyangliu@fb.com>

Closes #14080 from lovexi/yangyang-metrics.
2016-07-12 10:13:58 -07:00
sharkd d513c99c19 [SPARK-16414][YARN] Fix bugs for "Can not get user config when calling SparkHadoopUtil.get.conf on yarn cluser mode"
## What changes were proposed in this pull request?

The `SparkHadoopUtil` singleton was instantiated before `ApplicationMaster` in `ApplicationMaster.main` when deploying spark on yarn cluster mode, the `conf` in the `SparkHadoopUtil` singleton didn't include user's configuration.

So, we should load the properties file with the Spark configuration and set entries as system properties before `SparkHadoopUtil` first instantiate.

## How was this patch tested?

Add a test case

Author: sharkd <sharkd.tu@gmail.com>
Author: sharkdtu <sharkdtu@tencent.com>

Closes #14088 from sharkdtu/master.
2016-07-12 10:10:35 -07:00
Reynold Xin c377e49e38 [SPARK-16489][SQL] Guard against variable reuse mistakes in expression code generation
## What changes were proposed in this pull request?
In code generation, it is incorrect for expressions to reuse variable names across different instances of itself. As an example, SPARK-16488 reports a bug in which pmod expression reuses variable name "r".

This patch updates ExpressionEvalHelper test harness to always project two instances of the same expression, which will help us catch variable reuse problems in expression unit tests. This patch also fixes the bug in crc32 expression.

## How was this patch tested?
This is a test harness change, but I also created a new test suite for testing the test harness.

Author: Reynold Xin <rxin@databricks.com>

Closes #14146 from rxin/SPARK-16489.
2016-07-12 10:07:23 -07:00
Lianhui Wang 5ad68ba5ce [SPARK-15752][SQL] Optimize metadata only query that has an aggregate whose children are deterministic project or filter operators.
## What changes were proposed in this pull request?
when query only use metadata (example: partition key), it can return results based on metadata without scanning files. Hive did it in HIVE-1003.

## How was this patch tested?
add unit tests

Author: Lianhui Wang <lianhuiwang09@gmail.com>
Author: Wenchen Fan <wenchen@databricks.com>
Author: Lianhui Wang <lianhuiwang@users.noreply.github.com>

Closes #13494 from lianhuiwang/metadata-only.
2016-07-12 18:52:15 +02:00
WeichenXu 6cb75db9ab [SPARK-16470][ML][OPTIMIZER] Check linear regression training whether actually reach convergence and add warning if not
## What changes were proposed in this pull request?

In `ml.regression.LinearRegression`, it use breeze `LBFGS` and `OWLQN` optimizer to do data training, but do not check whether breeze's optimizer returned result actually reached convergence.

The `LBFGS` and `OWLQN` optimizer in breeze finish iteration may result the following situations:

1) reach max iteration number
2) function reach value convergence
3) objective function stop improving
4) gradient reach convergence
5) search failed(due to some internal numerical error)

I add warning printing code so that
if the iteration result is (1) or (3) or (5) in above, it will print a warning with respective reason string.

## How was this patch tested?

Manual.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14122 from WeichenXu123/add_lr_not_convergence_warn.
2016-07-12 13:04:34 +01:00
Takuya UESHIN 5b28e02584 [SPARK-16189][SQL] Add ExternalRDD logical plan for input with RDD to have a chance to eliminate serialize/deserialize.
## What changes were proposed in this pull request?

Currently the input `RDD` of `Dataset` is always serialized to `RDD[InternalRow]` prior to being as `Dataset`, but there is a case that we use `map` or `mapPartitions` just after converted to `Dataset`.
In this case, serialize and then deserialize happens but it would not be needed.

This pr adds `ExistingRDD` logical plan for input with `RDD` to have a chance to eliminate serialize/deserialize.

## How was this patch tested?

Existing tests.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #13890 from ueshin/issues/SPARK-16189.
2016-07-12 17:16:59 +08:00
WeichenXu fc11c509e2 [MINOR][ML] update comment where is inconsistent with code in ml.regression.LinearRegression
## What changes were proposed in this pull request?

In `train` method of `ml.regression.LinearRegression` when handling situation `std(label) == 0`
the code replace `std(label)` with `mean(label)` but the relative comment is inconsistent, I update it.

## How was this patch tested?

N/A

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14121 from WeichenXu123/update_lr_comment.
2016-07-12 09:23:59 +01:00
petermaxlee c9a6762150 [SPARK-16199][SQL] Add a method to list the referenced columns in data source Filter
## What changes were proposed in this pull request?
It would be useful to support listing the columns that are referenced by a filter. This can help simplify data source planning, because with this we would be able to implement unhandledFilters method in HadoopFsRelation.

This is based on rxin's patch (#13901) and adds unit tests.

## How was this patch tested?
Added a new suite FiltersSuite.

Author: petermaxlee <petermaxlee@gmail.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #14120 from petermaxlee/SPARK-16199.
2016-07-11 22:23:32 -07:00
Russell Spitzer b1e5281c5c [SPARK-12639][SQL] Mark Filters Fully Handled By Sources with *
## What changes were proposed in this pull request?

In order to make it clear which filters are fully handled by the
underlying datasource we will mark them with an *. This will give a
clear visual queue to users that the filter is being treated differently
by catalyst than filters which are just presented to the underlying
DataSource.

Examples from the FilteredScanSuite, in this example `c IN (...)` is handled by the source, `b < ...` is not
### Before
```
//SELECT a FROM oneToTenFiltered WHERE a + b > 9 AND b < 16 AND c IN ('bbbbbBBBBB', 'cccccCCCCC', 'dddddDDDDD', 'foo')
== Physical Plan ==
Project [a#0]
+- Filter (((a#0 + b#1) > 9) && (b#1 < 16))
   +- Scan SimpleFilteredScan(1,10)[a#0,b#1] PushedFilters: [LessThan(b,16), In(c, [bbbbbBBBBB,cccccCCCCC,dddddDDDDD,foo]]
```

### After
```
== Physical Plan ==
Project [a#0]
+- Filter (((a#0 + b#1) > 9) && (b#1 < 16))
   +- Scan SimpleFilteredScan(1,10)[a#0,b#1] PushedFilters: [LessThan(b,16), *In(c, [bbbbbBBBBB,cccccCCCCC,dddddDDDDD,foo]]
```

## How was the this patch tested?

Manually tested with the Spark Cassandra Connector, a source which fully handles underlying filters. Now fully handled filters appear with an * next to their names. I can add an automated test as well if requested

Post 1.6.1
Tested by modifying the FilteredScanSuite to run explains.

Author: Russell Spitzer <Russell.Spitzer@gmail.com>

Closes #11317 from RussellSpitzer/SPARK-12639-Star.
2016-07-11 21:40:09 -07:00