Commit graph

15610 commits

Author SHA1 Message Date
zhonghaihua bd7b91cefb [SPARK-12864][YARN] initialize executorIdCounter after ApplicationMaster killed for max n…
Currently, when max number of executor failures reached the `maxNumExecutorFailures`, `ApplicationMaster` will be killed and re-register another one.This time, `YarnAllocator` will be created a new instance.
But, the value of property `executorIdCounter` in `YarnAllocator` will reset to `0`. Then the Id of new executor will starting from `1`. This will confuse with the executor has already created before, which will cause FetchFailedException.
This situation is just in yarn client mode, so this is an issue in yarn client mode. For more details, [link to jira issues SPARK-12864](https://issues.apache.org/jira/browse/SPARK-12864)
This PR introduce a mechanism to initialize `executorIdCounter` after `ApplicationMaster` killed.

Author: zhonghaihua <793507405@qq.com>

Closes #10794 from zhonghaihua/initExecutorIdCounterAfterAMKilled.
2016-04-01 16:23:14 -05:00
Liang-Chi Hsieh 3e991dbc31 [SPARK-13674] [SQL] Add wholestage codegen support to Sample
JIRA: https://issues.apache.org/jira/browse/SPARK-13674

## What changes were proposed in this pull request?

Sample operator doesn't support wholestage codegen now. This pr is to add support to it.

## How was this patch tested?

A test is added into `BenchmarkWholeStageCodegen`. Besides, all tests should be passed.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #11517 from viirya/add-wholestage-sample.
2016-04-01 14:02:32 -07:00
Burak Yavuz 1b829ce139 [SPARK-14160] Time Windowing functions for Datasets
## What changes were proposed in this pull request?

This PR adds the function `window` as a column expression.

`window` can be used to bucket rows into time windows given a time column. With this expression, performing time series analysis on batch data, as well as streaming data should become much more simpler.

### Usage

Assume the following schema:

`sensor_id, measurement, timestamp`

To average 5 minute data every 1 minute (window length of 5 minutes, slide duration of 1 minute), we will use:
```scala
df.groupBy(window("timestamp", “5 minutes”, “1 minute”), "sensor_id")
  .agg(mean("measurement").as("avg_meas"))
```

This will generate windows such as:
```
09:00:00-09:05:00
09:01:00-09:06:00
09:02:00-09:07:00 ...
```

Intervals will start at every `slideDuration` starting at the unix epoch (1970-01-01 00:00:00 UTC).
To start intervals at a different point of time, e.g. 30 seconds after a minute, the `startTime` parameter can be used.

```scala
df.groupBy(window("timestamp", “5 minutes”, “1 minute”, "30 second"), "sensor_id")
  .agg(mean("measurement").as("avg_meas"))
```

This will generate windows such as:
```
09:00:30-09:05:30
09:01:30-09:06:30
09:02:30-09:07:30 ...
```

Support for Python will be made in a follow up PR after this.

## How was this patch tested?

This patch has some basic unit tests for the `TimeWindow` expression testing that the parameters pass validation, and it also has some unit/integration tests testing the correctness of the windowing and usability in complex operations (multi-column grouping, multi-column projections, joins).

Author: Burak Yavuz <brkyvz@gmail.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #12008 from brkyvz/df-time-window.
2016-04-01 13:19:24 -07:00
Tejas Patil 1e88615984 [SPARK-14070][SQL] Use ORC data source for SQL queries on ORC tables
## What changes were proposed in this pull request?

This patch enables use of OrcRelation for SQL queries which read data from Hive tables. Changes in this patch:

- Added a new rule `OrcConversions` which would alter the plan to use `OrcRelation`. In this diff, the conversion is done only for reads.
- Added a new config `spark.sql.hive.convertMetastoreOrc` to control the conversion

BEFORE

```
scala>  hqlContext.sql("SELECT * FROM orc_table").explain(true)
== Parsed Logical Plan ==
'Project [unresolvedalias(*, None)]
+- 'UnresolvedRelation `orc_table`, None

== Analyzed Logical Plan ==
key: string, value: string
Project [key#171,value#172]
+- MetastoreRelation default, orc_table, None

== Optimized Logical Plan ==
MetastoreRelation default, orc_table, None

== Physical Plan ==
HiveTableScan [key#171,value#172], MetastoreRelation default, orc_table, None
```

AFTER

```
scala> hqlContext.sql("SELECT * FROM orc_table").explain(true)
== Parsed Logical Plan ==
'Project [unresolvedalias(*, None)]
+- 'UnresolvedRelation `orc_table`, None

== Analyzed Logical Plan ==
key: string, value: string
Project [key#76,value#77]
+- SubqueryAlias orc_table
   +- Relation[key#76,value#77] ORC part: struct<>, data: struct<key:string,value:string>

== Optimized Logical Plan ==
Relation[key#76,value#77] ORC part: struct<>, data: struct<key:string,value:string>

== Physical Plan ==
WholeStageCodegen
:  +- Scan ORC part: struct<>, data: struct<key:string,value:string>[key#76,value#77] InputPaths: file:/user/hive/warehouse/orc_table
```

## How was this patch tested?

- Added a new unit test. Ran existing unit tests
- Ran with production like data

## Performance gains

Ran on a production table in Facebook (note that the data was in DWRF file format which is similar to ORC)

Best case : when there was no matching rows for the predicate in the query (everything is filtered out)

```
                      CPU time          Wall time     Total wall time across all tasks
================================================================
Without the change   541_515 sec    25.0 mins    165.8 hours
With change              407 sec       1.5 mins     15 mins
```

Average case: A subset of rows in the data match the query predicate

```
                        CPU time        Wall time     Total wall time across all tasks
================================================================
Without the change   624_630 sec     31.0 mins    199.0 h
With change           14_769 sec      5.3 mins      7.7 h
```

Author: Tejas Patil <tejasp@fb.com>

Closes #11891 from tejasapatil/orc_ppd.
2016-04-01 13:13:16 -07:00
Liang-Chi Hsieh a884daad80 [SPARK-14191][SQL] Remove invalid Expand operator constraints
`Expand` operator now uses its child plan's constraints as its valid constraints (i.e., the base of constraints). This is not correct because `Expand` will set its group by attributes to null values. So the nullability of these attributes should be true.

E.g., for an `Expand` operator like:

    val input = LocalRelation('a.int, 'b.int, 'c.int).where('c.attr > 10 && 'a.attr < 5 && 'b.attr > 2)
    Expand(
      Seq(
        Seq('c, Literal.create(null, StringType), 1),
        Seq('c, 'a, 2)),
      Seq('c, 'a, 'gid.int),
      Project(Seq('a, 'c), input))

The `Project` operator has the constraints `IsNotNull('a)`, `IsNotNull('b)` and `IsNotNull('c)`. But the `Expand` should not have `IsNotNull('a)` in its constraints.

This PR is the first step for this issue and remove invalid constraints of `Expand` operator.

A test is added to `ConstraintPropagationSuite`.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #11995 from viirya/fix-expand-constraints.
2016-04-01 13:08:09 -07:00
Liang-Chi Hsieh df68beb85d [SPARK-13995][SQL] Extract correct IsNotNull constraints for Expression
## What changes were proposed in this pull request?

JIRA: https://issues.apache.org/jira/browse/SPARK-13995

We infer relative `IsNotNull` constraints from logical plan's expressions in `constructIsNotNullConstraints` now. However, we don't consider the case of (nested) `Cast`.

For example:

    val tr = LocalRelation('a.int, 'b.long)
    val plan = tr.where('a.attr === 'b.attr).analyze

Then, the plan's constraints will have `IsNotNull(Cast(resolveColumn(tr, "a"), LongType))`, instead of `IsNotNull(resolveColumn(tr, "a"))`. This PR fixes it.

Besides, as `IsNotNull` constraints are most useful for `Attribute`, we should do recursing through any `Expression` that is null intolerant and construct `IsNotNull` constraints for all `Attribute`s under these Expressions.

For example, consider the following constraints:

    val df = Seq((1,2,3)).toDF("a", "b", "c")
    df.where("a + b = c").queryExecution.analyzed.constraints

The inferred isnotnull constraints should be isnotnull(a), isnotnull(b), isnotnull(c), instead of isnotnull(a + c) and isnotnull(c).

## How was this patch tested?

Test is added into `ConstraintPropagationSuite`.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #11809 from viirya/constraint-cast.
2016-04-01 13:00:55 -07:00
Yanbo Liang 381358fbe9 [SPARK-14305][ML][PYSPARK] PySpark ml.clustering BisectingKMeans support export/import
## What changes were proposed in this pull request?
PySpark ml.clustering BisectingKMeans support export/import
## How was this patch tested?
doc test.

cc jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12112 from yanboliang/spark-14305.
2016-04-01 12:53:39 -07:00
jerryshao 8ba2b7f28f [SPARK-12343][YARN] Simplify Yarn client and client argument
## What changes were proposed in this pull request?

Currently in Spark on YARN, configurations can be passed through SparkConf, env and command arguments, some parts are duplicated, like client argument and SparkConf. So here propose to simplify the command arguments.

## How was this patch tested?

This patch is tested manually with unit test.

CC vanzin tgravescs , please help to suggest this proposal. The original purpose of this JIRA is to remove `ClientArguments`, through refactoring some arguments like `--class`, `--arg` are not so easy to replace, so here I remove the most part of command line arguments, only keep the minimal set.

Author: jerryshao <sshao@hortonworks.com>

Closes #11603 from jerryshao/SPARK-12343.
2016-04-01 10:52:13 -07:00
Dongjoon Hyun 58e6bc827f [MINOR] [SQL] Update usage of debug by removing typeCheck and adding debugCodegen
## What changes were proposed in this pull request?

This PR updates the usage comments of `debug` according to the following commits.
- [SPARK-9754](https://issues.apache.org/jira/browse/SPARK-9754) removed `typeCheck`.
- [SPARK-14227](https://issues.apache.org/jira/browse/SPARK-14227) added `debugCodegen`.

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12094 from dongjoon-hyun/minor_fix_debug_usage.
2016-04-01 10:36:01 -07:00
sureshthalamati a471c7f9ea [SPARK-14133][SQL] Throws exception for unsupported create/drop/alter index , and lock/unlock operations.
## What changes were proposed in this pull request?

This  PR  throws Unsupported Operation exception for create index, drop index, alter index , lock table , lock database, unlock table, and unlock database operations that are not supported in Spark SQL. Currently these operations are executed executed by Hive.

Error:
spark-sql> drop index my_index on my_table;
Error in query:
Unsupported operation: drop index(line 1, pos 0)

## How was this patch tested?
Added test cases to HiveQuerySuite

yhuai hvanhovell andrewor14

Author: sureshthalamati <suresh.thalamati@gmail.com>

Closes #12069 from sureshthalamati/unsupported_ddl_spark-14133.
2016-04-01 18:33:31 +02:00
Dilip Biswal 0b04f8fdf1 [SPARK-14184][SQL] Support native execution of SHOW DATABASE command and fix SHOW TABLE to use table identifier pattern
## What changes were proposed in this pull request?

This PR addresses the following

1. Supports native execution of SHOW DATABASES command
2. Fixes SHOW TABLES to apply the identifier_with_wildcards pattern if supplied.

SHOW TABLE syntax
```
SHOW TABLES [IN database_name] ['identifier_with_wildcards'];
```
SHOW DATABASES syntax
```
SHOW (DATABASES|SCHEMAS) [LIKE 'identifier_with_wildcards'];
```

## How was this patch tested?
Tests added in SQLQuerySuite (both hive and sql contexts) and DDLCommandSuite

Note: Since the table name pattern was not working , tests are added in both SQLQuerySuite to
verify the application of the table pattern.

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes #11991 from dilipbiswal/dkb_show_database.
2016-04-01 18:27:11 +02:00
Cheng Lian 3715ecdf41 [SPARK-14295][MLLIB][HOTFIX] Fixes Scala 2.10 compilation failure
## What changes were proposed in this pull request?

Fixes a compilation failure introduced in PR #12088 under Scala 2.10.

## How was this patch tested?

Compilation.

Author: Cheng Lian <lian@databricks.com>

Closes #12107 from liancheng/spark-14295-hotfix.
2016-04-01 17:02:48 +08:00
Yanbo Liang 22249afb4a [SPARK-14303][ML][SPARKR] Define and use KMeansWrapper for SparkR::kmeans
## What changes were proposed in this pull request?
Define and use ```KMeansWrapper``` for ```SparkR::kmeans```. It's only the code refactor for the original ```KMeans``` wrapper.

## How was this patch tested?
Existing tests.

cc mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12039 from yanboliang/spark-14059.
2016-03-31 23:49:58 -07:00
Alexander Ulanov 26867ebc67 [SPARK-11262][ML] Unit test for gradient, loss layers, memory management for multilayer perceptron
1.Implement LossFunction trait and implement squared error and cross entropy
loss with it
2.Implement unit test for gradient and loss
3.Implement InPlace trait and in-place layer evaluation
4.Refactor interface for ActivationFunction
5.Update of Layer and LayerModel interfaces
6.Fix random weights assignment
7.Implement memory allocation by MLP model instead of individual layers

These features decreased the memory usage and increased flexibility of
internal API.

Author: Alexander Ulanov <nashb@yandex.ru>
Author: avulanov <avulanov@gmail.com>

Closes #9229 from avulanov/mlp-refactoring.
2016-03-31 23:48:36 -07:00
Cheng Lian 1b070637fa [SPARK-14295][SPARK-14274][SQL] Implements buildReader() for LibSVM
## What changes were proposed in this pull request?

This PR implements `FileFormat.buildReader()` for the LibSVM data source. Besides that, a new interface method `prepareRead()` is added to `FileFormat`:

```scala
  def prepareRead(
      sqlContext: SQLContext,
      options: Map[String, String],
      files: Seq[FileStatus]): Map[String, String] = options
```

After migrating from `buildInternalScan()` to `buildReader()`, we lost the opportunity to collect necessary global information, since `buildReader()` works in a per-partition manner. For example, LibSVM needs to infer the total number of features if the `numFeatures` data source option is not set. Any necessary collected global information should be returned using the data source options map. By default, this method just returns the original options untouched.

An alternative approach is to absorb `inferSchema()` into `prepareRead()`, since schema inference is also some kind of global information gathering. However, this approach wasn't chosen because schema inference is optional, while `prepareRead()` must be called whenever a `HadoopFsRelation` based data source relation is instantiated.

One unaddressed problem is that, when `numFeatures` is absent, now the input data will be scanned twice. The `buildInternalScan()` code path doesn't need to do this because it caches the raw parsed RDD in memory before computing the total number of features. However, with `FileScanRDD`, the raw parsed RDD is created in a different way (e.g. partitioning) from the final RDD.

## How was this patch tested?

Tested using existing test suites.

Author: Cheng Lian <lian@databricks.com>

Closes #12088 from liancheng/spark-14295-libsvm-build-reader.
2016-03-31 23:46:08 -07:00
Zhang, Liye 96941b12f8 [SPARK-14242][CORE][NETWORK] avoid copy in compositeBuffer for frame decoder
## What changes were proposed in this pull request?
In this patch, we set the initial `maxNumComponents` to `Integer.MAX_VALUE` instead of the default size ( which is 16) when allocating `compositeBuffer` in `TransportFrameDecoder` because `compositeBuffer` will introduce too many memory copies underlying if `compositeBuffer` is with default `maxNumComponents` when the frame size is large (which result in many transport messages). For details, please refer to [SPARK-14242](https://issues.apache.org/jira/browse/SPARK-14242).

## How was this patch tested?
spark unit tests and manual tests.
For manual tests, we can reproduce the performance issue with following code:
`sc.parallelize(Array(1,2,3),3).mapPartitions(a=>Array(new Array[Double](1024 * 1024 * 50)).iterator).reduce((a,b)=> a).length`
It's easy to see the performance gain, both from the running time and CPU usage.

Author: Zhang, Liye <liye.zhang@intel.com>

Closes #12038 from liyezhang556520/spark-14242.
2016-03-31 20:17:52 -07:00
Davies Liu f0afafdc5d [SPARK-14267] [SQL] [PYSPARK] execute multiple Python UDFs within single batch
## What changes were proposed in this pull request?

This PR support multiple Python UDFs within single batch, also improve the performance.

```python
>>> from pyspark.sql.types import IntegerType
>>> sqlContext.registerFunction("double", lambda x: x * 2, IntegerType())
>>> sqlContext.registerFunction("add", lambda x, y: x + y, IntegerType())
>>> sqlContext.sql("SELECT double(add(1, 2)), add(double(2), 1)").explain(True)
== Parsed Logical Plan ==
'Project [unresolvedalias('double('add(1, 2)), None),unresolvedalias('add('double(2), 1), None)]
+- OneRowRelation$

== Analyzed Logical Plan ==
double(add(1, 2)): int, add(double(2), 1): int
Project [double(add(1, 2))#14,add(double(2), 1)#15]
+- Project [double(add(1, 2))#14,add(double(2), 1)#15]
   +- Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15]
      +- EvaluatePython [add(pythonUDF1#17, 1)], [pythonUDF0#18]
         +- EvaluatePython [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17]
            +- OneRowRelation$

== Optimized Logical Plan ==
Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15]
+- EvaluatePython [add(pythonUDF1#17, 1)], [pythonUDF0#18]
   +- EvaluatePython [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17]
      +- OneRowRelation$

== Physical Plan ==
WholeStageCodegen
:  +- Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15]
:     +- INPUT
+- !BatchPythonEvaluation [add(pythonUDF1#17, 1)], [pythonUDF0#16,pythonUDF1#17,pythonUDF0#18]
   +- !BatchPythonEvaluation [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17]
      +- Scan OneRowRelation[]
```

## How was this patch tested?

Added new tests.

Using the following script to benchmark 1, 2 and 3 udfs,
```
df = sqlContext.range(1, 1 << 23, 1, 4)
double = F.udf(lambda x: x * 2, LongType())
print df.select(double(df.id)).count()
print df.select(double(df.id), double(df.id + 1)).count()
print df.select(double(df.id), double(df.id + 1), double(df.id + 2)).count()
```
Here is the results:

N | Before | After  | speed up
---- |------------ | -------------|------
1 | 22 s | 7 s |  3.1X
2 | 38 s | 13 s | 2.9X
3 | 58 s | 16 s | 3.6X

This benchmark ran locally with 4 CPUs. For 3 UDFs, it launched 12 Python before before this patch, 4 process after this patch. After this patch, it will use less memory for multiple UDFs than before (less buffering).

Author: Davies Liu <davies@databricks.com>

Closes #12057 from davies/multi_udfs.
2016-03-31 16:40:20 -07:00
Sital Kedia 8de201baed [SPARK-14277][CORE] Upgrade Snappy Java to 1.1.2.4
## What changes were proposed in this pull request?

Upgrade snappy to 1.1.2.4 to improve snappy read/write performance.

## How was this patch tested?

Tested by running a job on the cluster and saw 7.5% cpu savings after this change.

Author: Sital Kedia <skedia@fb.com>

Closes #12096 from sitalkedia/snappyRelease.
2016-03-31 16:06:44 -07:00
Josh Rosen a7af6cd2ea [SPARK-14281][TESTS] Fix java8-tests and simplify their build
This patch fixes a compilation / build break in Spark's `java8-tests` and refactors their POM to simplify the build. See individual commit messages for more details.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #12073 from JoshRosen/fix-java8-tests.
2016-03-31 13:52:59 -07:00
sethah b11887c086 [SPARK-14264][PYSPARK][ML] Add feature importance for GBTs in pyspark
## What changes were proposed in this pull request?

Feature importances are exposed in the python API for GBTs.

Other changes:
* Update the random forest feature importance documentation to not repeat decision tree docstring and instead place a reference to it.

## How was this patch tested?

Python doc tests were updated to validate GBT feature importance.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #12056 from sethah/Pyspark_GBT_feature_importance.
2016-03-31 13:00:10 -07:00
Shixiong Zhu e785402826 [SPARK-14304][SQL][TESTS] Fix tests that don't create temp files in the java.io.tmpdir folder
## What changes were proposed in this pull request?

If I press `CTRL-C` when running these tests, the temp files will be left in `sql/core` folder and I need to delete them manually. It's annoying. This PR just moves the temp files to the `java.io.tmpdir` folder and add a name prefix for them.

## How was this patch tested?

Existing Jenkins tests

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #12093 from zsxwing/temp-file.
2016-03-31 12:17:25 -07:00
Michel Lemay 3cfbeb70b1 [SPARK-13710][SHELL][WINDOWS] Fix jline dependency on Windows
## What changes were proposed in this pull request?

Exclude jline from curator-recipes since it conflicts with scala 2.11 when running spark-shell.  Should not affect scala 2.10 since it is builtin.

## How was this patch tested?

Ran spark-shell manually.

Author: Michel Lemay <mlemay@gmail.com>

Closes #12043 from michellemay/spark-13710-fix-jline-on-windows.
2016-03-31 12:15:32 -07:00
Jo Voordeckers 10508f36ad [SPARK-11327][MESOS] Dispatcher does not respect all args from the Submit request
Supersedes https://github.com/apache/spark/pull/9752

Author: Jo Voordeckers <jo.voordeckers@gmail.com>
Author: Iulian Dragos <jaguarul@gmail.com>

Closes #10370 from jayv/mesos_cluster_params.
2016-03-31 12:08:10 -07:00
Wenchen Fan 0abee534f0 [SPARK-14069][SQL] Improve SparkStatusTracker to also track executor information
## What changes were proposed in this pull request?

Track executor information like host and port, cache size, running tasks.

TODO: tests

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11888 from cloud-fan/status-tracker.
2016-03-31 12:07:19 -07:00
Michael Gummelt 4d93b653f7 [Docs] Update monitoring.md to accurately describe the history server
It looks like the docs were recently updated to reflect the History Server's support for incomplete applications, but they still had wording that suggested only completed applications were viewable.  This fixes that.

My editor also introduced several whitespace removal changes, that I hope are OK, as text files shouldn't have trailing whitespace.  To verify they're purely whitespace changes, add `&w=1` to your browser address.  If this isn't acceptable, let me know and I'll update the PR.

I also didn't think this required a JIRA.  Let me know if I should create one.

Not tested

Author: Michael Gummelt <mgummelt@mesosphere.io>

Closes #12045 from mgummelt/update-history-docs.
2016-03-31 12:06:21 -07:00
jeanlyn 8a333d2da8 [SPARK-14243][CORE] update task metrics when removing blocks
## What changes were proposed in this pull request?

This PR try to use `incUpdatedBlockStatuses ` to update the `updatedBlockStatuses ` when removing blocks, making sure `BlockManager` correctly updates `updatedBlockStatuses`

## How was this patch tested?

test("updated block statuses") in BlockManagerSuite.scala

Author: jeanlyn <jeanlyn92@gmail.com>

Closes #12091 from jeanlyn/updateBlock.
2016-03-31 12:04:42 -07:00
gatorsmile 446c45bd87 [SPARK-14182][SQL] Parse DDL Command: Alter View
This PR is to provide native parsing support for DDL commands: `Alter View`. Since its AST trees are highly similar to `Alter Table`. Thus, both implementation are integrated into the same one.

Based on the Hive DDL document:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL and https://cwiki.apache.org/confluence/display/Hive/PartitionedViews

**Syntax:**
```SQL
ALTER VIEW view_name RENAME TO new_view_name
```
 - to change the name of a view to a different name

**Syntax:**
```SQL
ALTER VIEW view_name SET TBLPROPERTIES ('comment' = new_comment);
```
 - to add metadata to a view

**Syntax:**
```SQL
ALTER VIEW view_name UNSET TBLPROPERTIES [IF EXISTS] ('comment', 'key')
```
 - to remove metadata from a view

**Syntax:**
```SQL
ALTER VIEW view_name ADD [IF NOT EXISTS] PARTITION spec1[, PARTITION spec2, ...]
```
 - to add the partitioning metadata for a view.
 - the syntax of partition spec in `ALTER VIEW` is identical to `ALTER TABLE`, **EXCEPT** that it is **ILLEGAL** to specify a `LOCATION` clause.

**Syntax:**
```SQL
ALTER VIEW view_name DROP [IF EXISTS] PARTITION spec1[, PARTITION spec2, ...]
```
 - to drop the related partition metadata for a view.

Added the related test cases to `DDLCommandSuite`

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #11987 from gatorsmile/parseAlterView.
2016-03-31 12:04:03 -07:00
Nishkam Ravi ac1b8b302a [SPARK-13796] Redirect error message to logWarning
## What changes were proposed in this pull request?

Redirect error message to logWarning

## How was this patch tested?

Unit tests, manual tests

JoshRosen

Author: Nishkam Ravi <nishkamravi@gmail.com>

Closes #12052 from nishkamravi2/master_warning.
2016-03-31 12:03:05 -07:00
Sameer Agarwal 3586929320 [SPARK-14278][SQL] Initialize columnar batch with proper memory mode
## What changes were proposed in this pull request?

Fixes a minor bug in the record reader constructor that was possibly introduced during refactoring.

## How was this patch tested?

N/A

Author: Sameer Agarwal <sameer@databricks.com>

Closes #12070 from sameeragarwal/vectorized-rr.
2016-03-31 11:56:28 -07:00
Sameer Agarwal 8d6207206c [SPARK-14263][SQL] Benchmark Vectorized HashMap for GroupBy Aggregates
## What changes were proposed in this pull request?

This PR proposes a new data-structure based on a vectorized hashmap that can be potentially _codegened_ in `TungstenAggregate` to speed up aggregates with group by. Micro-benchmarks show a 10x improvement over the current `BytesToBytes` aggregation map.

## How was this patch tested?

    Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
    BytesToBytesMap:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -------------------------------------------------------------------------------------------
    hash                                      108 /  119         96.9          10.3       1.0X
    fast hash                                  63 /   70        166.2           6.0       1.7X
    arrayEqual                                 70 /   73        150.8           6.6       1.6X
    Java HashMap (Long)                       141 /  200         74.3          13.5       0.8X
    Java HashMap (two ints)                   145 /  185         72.3          13.8       0.7X
    Java HashMap (UnsafeRow)                  499 /  524         21.0          47.6       0.2X
    BytesToBytesMap (off Heap)                483 /  548         21.7          46.0       0.2X
    BytesToBytesMap (on Heap)                 485 /  562         21.6          46.2       0.2X
    Vectorized Hashmap                         54 /   60        193.7           5.2       2.0X

Author: Sameer Agarwal <sameer@databricks.com>

Closes #12055 from sameeragarwal/vectorized-hashmap.
2016-03-31 11:53:13 -07:00
Xusen Yin 8b207f3b6a [SPARK-11892][ML] Model export/import for spark.ml: OneVsRest
# What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-11892

Add save/load for spark ml.OneVsRest and its model. Also add OneVsRest and OneVsRestModel in MetaAlgorithmReadWrite.

# How was this patch tested?

Test with Scala unit test.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9934 from yinxusen/SPARK-11892.
2016-03-31 11:17:32 -07:00
Yuhao Yang a0a1991580 [SPARK-13782][ML] Model export/import for spark.ml: BisectingKMeans
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-13782
Model export/import for BisectingKMeans in spark.ml and mllib

## How was this patch tested?

unit tests

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #11933 from hhbyyh/bisectingsave.
2016-03-31 11:12:40 -07:00
jerryshao 3b3cc76004 [SPARK-14062][YARN] Fix log4j and upload metrics.properties automatically with distributed cache
## What changes were proposed in this pull request?

1. Currently log4j which uses distributed cache only adds to AM's classpath, not executor's, this is introduced in #9118, which breaks the original meaning of that PR, so here add log4j file to the classpath of both AM and executors.
2. Automatically upload metrics.properties to distributed cache, so that it could be used by remote driver and executors implicitly.

## How was this patch tested?

Unit test and integration test is done.

Author: jerryshao <sshao@hortonworks.com>

Closes #11885 from jerryshao/SPARK-14062.
2016-03-31 10:27:33 -07:00
Dongjoon Hyun 208fff3ac8 [SPARK-14164][MLLIB] Improve input layer validation of MultilayerPerceptronClassifier
## What changes were proposed in this pull request?

This issue improves an input layer validation and adds related testcases to MultilayerPerceptronClassifier.

```scala
-    // TODO: how to check ALSO that all elements are greater than 0?
-    ParamValidators.arrayLengthGt(1)
+    (t: Array[Int]) => t.forall(ParamValidators.gt(0)) && t.length > 1
```

## How was this patch tested?

Pass the Jenkins tests including the new testcases.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11964 from dongjoon-hyun/SPARK-14164.
2016-03-31 09:39:15 -07:00
Herman van Hovell a9b93e0739 [SPARK-14211][SQL] Remove ANTLR3 based parser
### What changes were proposed in this pull request?

This PR removes the ANTLR3 based parser, and moves the new ANTLR4 based parser into the `org.apache.spark.sql.catalyst.parser package`.

### How was this patch tested?

Existing unit tests.

cc rxin andrewor14 yhuai

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #12071 from hvanhovell/SPARK-14211.
2016-03-31 09:25:09 -07:00
Cheng Lian 26445c2e47 [SPARK-14206][SQL] buildReader() implementation for CSV
## What changes were proposed in this pull request?

Major changes:

1. Implement `FileFormat.buildReader()` for the CSV data source.
1. Add an extra argument to `FileFormat.buildReader()`, `physicalSchema`, which is basically the result of `FileFormat.inferSchema` or user specified schema.

   This argument is necessary because the CSV data source needs to know all the columns of the underlying files to read the file.

## How was this patch tested?

Existing tests should do the work.

Author: Cheng Lian <lian@databricks.com>

Closes #12002 from liancheng/spark-14206-csv-build-reader.
2016-03-30 18:21:06 -07:00
Travis Crawford da54abfd87 [SPARK-14081][SQL] - Preserve DataFrame column types when filling nulls.
## What changes were proposed in this pull request?
This change resolves an issue where `DataFrameNaFunctions.fill` changes a `FloatType` column to a `DoubleType`. We also clarify the contract that replacement values will be cast to the column data type, which may change the replacement value when casting to a lower precision type.

## How was this patch tested?
This patch has associated unit tests.

Author: Travis Crawford <travis@medium.com>

Closes #11967 from traviscrawford/SPARK-14081-dataframena.
2016-03-30 16:59:52 -07:00
Dongjoon Hyun 258a243419 [SPARK-14282][SQL] CodeFormatter should handle oneline comment with /* */ properly
## What changes were proposed in this pull request?

This PR improves `CodeFormatter` to fix the following malformed indentations.
```java
/* 019 */   public java.lang.Object apply(java.lang.Object _i) {
/* 020 */     InternalRow i = (InternalRow) _i;
/* 021 */     /* createexternalrow(if (isnull(input[0, double])) null else input[0, double], if (isnull(input[1, int])) null else input[1, int], ... */
/* 022 */       boolean isNull = false;
/* 023 */       final Object[] values = new Object[2];
/* 024 */       /* if (isnull(input[0, double])) null else input[0, double] */
/* 025 */     /* isnull(input[0, double]) */
...
/* 053 */     if (!false && false) {
/* 054 */       /* null */
/* 055 */     final int value9 = -1;
/* 056 */     isNull6 = true;
/* 057 */     value6 = value9;
/* 058 */   } else {
...
/* 077 */   return mutableRow;
/* 078 */ }
/* 079 */ }
/* 080 */
```

After this PR, the code will be formatted like the following.
```java
/* 019 */   public java.lang.Object apply(java.lang.Object _i) {
/* 020 */     InternalRow i = (InternalRow) _i;
/* 021 */     /* createexternalrow(if (isnull(input[0, double])) null else input[0, double], if (isnull(input[1, int])) null else input[1, int], ... */
/* 022 */     boolean isNull = false;
/* 023 */     final Object[] values = new Object[2];
/* 024 */     /* if (isnull(input[0, double])) null else input[0, double] */
/* 025 */     /* isnull(input[0, double]) */
...
/* 053 */     if (!false && false) {
/* 054 */       /* null */
/* 055 */       final int value9 = -1;
/* 056 */       isNull6 = true;
/* 057 */       value6 = value9;
/* 058 */     } else {
...
/* 077 */     return mutableRow;
/* 078 */   }
/* 079 */ }
/* 080 */
```

Also, this issue fixes the following too. (Similar with [SPARK-14185](https://issues.apache.org/jira/browse/SPARK-14185))
```java
16/03/30 12:39:24 DEBUG WholeStageCodegen: /* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
```
```java
16/03/30 12:46:32 DEBUG WholeStageCodegen:
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
```

## How was this patch tested?

Pass the Jenkins tests (including new CodeFormatterSuite testcases.)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12072 from dongjoon-hyun/SPARK-14282.
2016-03-30 16:15:37 -07:00
Takeshi YAMAMURO dadf0138b3 [SPARK-14259][SQL] Add a FileSourceStrategy option for limiting #files in a partition
## What changes were proposed in this pull request?
This pr is to add a config to control the maximum number of files as even small files have a non-trivial fixed cost. The current packing can put a lot of small files together which cases straggler tasks.

## How was this patch tested?
I added tests to check if many files get split into partitions in FileSourceStrategySuite.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #12068 from maropu/SPARK-14259.
2016-03-30 16:02:48 -07:00
Yuhao Yang ca458618d8 [SPARK-11507][MLLIB] add compact in Matrices fromBreeze
jira: https://issues.apache.org/jira/browse/SPARK-11507
"In certain situations when adding two block matrices, I get an error regarding colPtr and the operation fails. External issue URL includes full error and code for reproducing the problem."

root cause: colPtr.last does NOT always equal to values.length in breeze SCSMatrix, which fails the require in SparseMatrix.

easy step to repro:
```
val m1: BM[Double] = new CSCMatrix[Double] (Array (1.0, 1, 1), 3, 3, Array (0, 1, 2, 3), Array (0, 1, 2) )
val m2: BM[Double] = new CSCMatrix[Double] (Array (1.0, 2, 2, 4), 3, 3, Array (0, 0, 2, 4), Array (1, 2, 1, 2) )
val sum = m1 + m2
Matrices.fromBreeze(sum)
```

Solution: By checking the code in [CSCMatrix](28000a7b90/math/src/main/scala/breeze/linalg/CSCMatrix.scala), CSCMatrix in breeze can have extra zeros in the end of data array. Invoking compact will make sure it aligns with the require of SparseMatrix. This should add limited overhead as the actual compact operation is only performed when necessary.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #9520 from hhbyyh/matricesFromBreeze.
2016-03-30 15:58:19 -07:00
Yanbo Liang f301df37cb [SPARK-14152][ML][PYSPARK] MultilayerPerceptronClassifier supports save/load for Python API
## What changes were proposed in this pull request?
```MultilayerPerceptronClassifier``` supports save/load for Python API.

## How was this patch tested?
doctest.

cc mengxr jkbradley yinxusen

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11952 from yanboliang/spark-14152.
2016-03-30 15:47:01 -07:00
Yanbo Liang 5dc948e812 [MINOR][ML] Fix the wrong param name of LDA topicDistributionCol
## What changes were proposed in this pull request?
Fix the wrong param name of LDA ```topicDistributionCol```.
## How was this patch tested?
No tests.

cc jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12065 from yanboliang/lda-topicDistributionCol.
2016-03-30 14:57:38 -07:00
Xusen Yin 529d6ce8f9 [SPARK-14181] TrainValidationSplit should have HasSeed
https://issues.apache.org/jira/browse/SPARK-14181

TrainValidationSplit should have HasSeed for the random split of RDD. I also changed the random split from the RDD function to the DataFrame function.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #11985 from yinxusen/SPARK-14181.
2016-03-30 14:32:29 -07:00
Marcelo Vanzin bdabfd43f6 [SPARK-13955][YARN] Also look for Spark jars in the build directory.
Move the logic to find Spark jars to CommandBuilderUtils and make it
available for YARN code, so that it's possible to easily launch Spark
on YARN from a build directory.

Tested by running SparkPi from the build directory on YARN.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #11970 from vanzin/SPARK-13955.
2016-03-30 13:59:10 -07:00
Wenchen Fan d46c71b39d [SPARK-14268][SQL] rename toRowExpressions and fromRowExpression to serializer and deserializer in ExpressionEncoder
## What changes were proposed in this pull request?

In `ExpressionEncoder`, we use `constructorFor` to build `fromRowExpression` as the `deserializer` in `ObjectOperator`. It's kind of confusing, we should make the name consistent.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12058 from cloud-fan/rename.
2016-03-30 11:03:15 -07:00
Wenchen Fan 816f359cf0 [SPARK-14114][SQL] implement buildReader for text data source
## What changes were proposed in this pull request?

This PR implements buildReader for text data source and enable it in the new data source code path.

## How was this patch tested?

Existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11934 from cloud-fan/text.
2016-03-30 17:32:53 +08:00
Shixiong Zhu 7320f9bd19 [SPARK-14254][CORE] Add logs to help investigate the network performance
## What changes were proposed in this pull request?

It would be very helpful for network performance investigation if we log the time spent on connecting and resolving host.

## How was this patch tested?

Jenkins unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #12046 from zsxwing/connection-time.
2016-03-29 21:14:48 -07:00
gatorsmile b66b97cd04 [SPARK-14124][SQL] Implement Database-related DDL Commands
#### What changes were proposed in this pull request?
This PR is to implement the following four Database-related DDL commands:
 - `CREATE DATABASE|SCHEMA [IF NOT EXISTS] database_name`
 - `DROP DATABASE [IF EXISTS] database_name [RESTRICT|CASCADE]`
 - `DESCRIBE DATABASE [EXTENDED] db_name`
 - `ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...)`

Another PR will be submitted to handle the unsupported commands. In the Database-related DDL commands, we will issue an error exception for `ALTER (DATABASE|SCHEMA) database_name SET OWNER [USER|ROLE] user_or_role`.

cc yhuai andrewor14 rxin Could you review the changes? Is it in the right direction? Thanks!

#### How was this patch tested?
Added a few test cases in `command/DDLSuite.scala` for testing DDL command execution in `SQLContext`. Since `HiveContext` also shares the same implementation, the existing test cases in `\hive` also verifies the correctness of these commands.

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #12009 from gatorsmile/dbDDL.
2016-03-29 17:39:52 -07:00
tedyu e1f6845391 [SPARK-12181] Check Cached unaligned-access capability before using Unsafe
## What changes were proposed in this pull request?

For MemoryMode.OFF_HEAP, Unsafe.getInt etc. are used with no restriction.

However, the Oracle implementation uses these methods only if the class variable unaligned (commented as "Cached unaligned-access capability") is true, which seems to be calculated whether the architecture is i386, x86, amd64, or x86_64.

I think we should perform similar check for the use of Unsafe.

Reference: https://github.com/netty/netty/blob/4.1/common/src/main/java/io/netty/util/internal/PlatformDependent0.java#L112

## How was this patch tested?

Unit test suite

Author: tedyu <yuzhihong@gmail.com>

Closes #11943 from tedyu/master.
2016-03-29 17:16:53 -07:00
Sameer Agarwal 366cac6fb0 [SPARK-14225][SQL] Cap the length of toCommentSafeString at 128 chars
## What changes were proposed in this pull request?

Builds on https://github.com/apache/spark/pull/12022 and (a) appends "..." to truncated comment strings and (b) fixes indentation in lines after the commented strings if they happen to have a `(`, `{`, `)` or `}`

## How was this patch tested?

Manually examined the generated code.

Author: Sameer Agarwal <sameer@databricks.com>

Closes #12044 from sameeragarwal/comment.
2016-03-29 16:46:45 -07:00