Commit graph

15466 commits

Author SHA1 Message Date
Dongjoon Hyun cfcca732b4 [MINOR][SQL] Fix substr/substring testcases.
## What changes were proposed in this pull request?

This PR fixes the following two testcases in order to test the correct usages.
```
checkSqlGeneration("SELECT substr('This is a test', 'is')")
checkSqlGeneration("SELECT substring('This is a test', 'is')")
```

Actually, the testcases works but tests on exceptional cases.
```
scala> sql("SELECT substr('This is a test', 'is')")
res0: org.apache.spark.sql.DataFrame = [substring(This is a test, CAST(is AS INT), 2147483647): string]

scala> sql("SELECT substr('This is a test', 'is')").collect()
res1: Array[org.apache.spark.sql.Row] = Array([null])
```

## How was this patch tested?

Pass the modified unit tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11963 from dongjoon-hyun/fix_substr_testcase.
2016-03-27 20:06:02 +01:00
gatorsmile a01b6a92b5 [SPARK-14177][SQL] Native Parsing for DDL Command "Describe Database" and "Alter Database"
#### What changes were proposed in this pull request?

This PR is to provide native parsing support for two DDL commands:  ```Describe Database``` and ```Alter Database Set Properties```

Based on the Hive DDL document:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

##### 1. ALTER DATABASE
**Syntax:**
```SQL
ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...)
```
 - `ALTER DATABASE` is to add new (key, value) pairs into `DBPROPERTIES`

##### 2. DESCRIBE DATABASE
**Syntax:**
```SQL
DESCRIBE DATABASE [EXTENDED] db_name
```
 - `DESCRIBE DATABASE` shows the name of the database, its comment (if one has been set), and its root location on the filesystem. When `extended` is true, it also shows the database's properties

#### How was this patch tested?
Added the related test cases to `DDLCommandSuite`

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

This patch had conflicts when merged, resolved by
Committer: Yin Huai <yhuai@databricks.com>

Closes #11977 from gatorsmile/parseAlterDatabase.
2016-03-26 20:12:30 -07:00
Liang-Chi Hsieh bc925b73a6 [SPARK-14157][SQL] Parse Drop Function DDL command
## What changes were proposed in this pull request?
JIRA: https://issues.apache.org/jira/browse/SPARK-14157

We only parse create function command. In order to support native drop function command, we need to parse it too.

From Hive [manual](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/ReloadFunction), the drop function command has syntax as:

DROP [TEMPORARY] FUNCTION [IF EXISTS] function_name;

## How was this patch tested?

Added test into `DDLCommandSuite`.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #11959 from viirya/parse-drop-func.
2016-03-26 20:09:01 -07:00
Cheng Lian b547de8a60 [SPARK-14116][SQL] Implements buildReader() for ORC data source
## What changes were proposed in this pull request?

This PR implements `FileFormat.buildReader()` for our ORC data source. It also fixed several minor styling issues related to `HadoopFsRelation` planning code path.

Note that `OrcNewInputFormat` doesn't rely on `OrcNewSplit` for creating `OrcRecordReader`s, plain `FileSplit` is just fine. That's why we can simply create the record reader with the help of `OrcNewInputFormat` and `FileSplit`.

## How was this patch tested?

Existing test cases should do the work

Author: Cheng Lian <lian@databricks.com>

Closes #11936 from liancheng/spark-14116-build-reader-for-orc.
2016-03-26 16:10:35 -07:00
gatorsmile 8989d3a396 [SPARK-14161][SQL] Native Parsing for DDL Command Drop Database
### What changes were proposed in this pull request?
Based on the Hive DDL document https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

The syntax of DDL command for Drop Database is
```SQL
DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];
```
 - If `IF EXISTS` is not specified, the default behavior is to issue a warning message if `database_name` does't exist
 - `RESTRICT` is the default behavior.

This PR is to provide a native parsing support for `DROP DATABASE`.

#### How was this patch tested?

Added a test case `DDLCommandSuite`

Author: gatorsmile <gatorsmile@gmail.com>

Closes #11962 from gatorsmile/parseDropDatabase.
2016-03-26 14:11:13 -07:00
Josh Rosen 20c0bcd972 [SPARK-14135] Add off-heap storage memory bookkeeping support to MemoryManager
This patch extends Spark's `UnifiedMemoryManager` to add bookkeeping support for off-heap storage memory, an requirement for enabling off-heap caching (which will be done by #11805). The `MemoryManager`'s `storageMemoryPool` has been split into separate on- and off-heap pools and the storage and unroll memory allocation methods have been updated to accept a `memoryMode` parameter to specify whether allocations should be performed on- or off-heap.

In order to reduce the testing surface, the `StaticMemoryManager` does not support off-heap caching (we plan to eventually remove the `StaticMemoryManager`, so this isn't a significant limitation).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #11942 from JoshRosen/off-heap-storage-memory-bookkeeping.
2016-03-26 11:03:25 -07:00
Davies Liu bd94ea4c80 [SPARK-14175][SQL] whole stage codegen interface refactor
## What changes were proposed in this pull request?

1. merge consumeChild into consume()
2. always generate code for input variables and UnsafeRow, a plan can use eight of them.

## How was this patch tested?

Existing tests.

Author: Davies Liu <davies@databricks.com>

Closes #11975 from davies/gen_refactor.
2016-03-26 11:03:05 -07:00
Rekha Joshi a91784fb6e [SPARK-13973][PYSPARK] ipython notebook` is going away
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-13973

## How was this patch tested?
Pyspark

Author: Rekha Joshi <rekhajoshm@gmail.com>
Author: Joshi <rekhajoshm@gmail.com>

Closes #11829 from rekhajoshm/SPARK-13973.
2016-03-26 12:53:37 +00:00
Liwei Lin 62a85eb09f [SPARK-14089][CORE][MLLIB] Remove methods that has been deprecated since 1.1, 1.2, 1.3, 1.4, and 1.5
## What changes were proposed in this pull request?

Removed methods that has been deprecated since 1.1, 1.2, 1.3, 1.4, and 1.5.

## How was this patch tested?

- manully checked that no codes in Spark call these methods any more
- existing test suits

Author: Liwei Lin <lwlin7@gmail.com>
Author: proflin <proflin.me@gmail.com>

Closes #11910 from lw-lin/remove-deprecates.
2016-03-26 12:41:34 +00:00
Dongjoon Hyun 1808465855 [MINOR] Fix newly added java-lint errors
## What changes were proposed in this pull request?

This PR fixes some newly added java-lint errors(unused-imports, line-lengsth).

## How was this patch tested?

Pass the Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11968 from dongjoon-hyun/SPARK-14167.
2016-03-26 11:55:49 +00:00
Shixiong Zhu d23ad7c1c9 [SPARK-13874][DOC] Remove docs of streaming-akka, streaming-zeromq, streaming-mqtt and streaming-twitter
## What changes were proposed in this pull request?

This PR removes all docs about the old streaming-akka, streaming-zeromq, streaming-mqtt and streaming-twitter projects since I have already copied them to https://github.com/spark-packages

Also remove mqtt_wordcount.py that I forgot to remove previously.

## How was this patch tested?

Jenkins PR Build.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11824 from zsxwing/remove-doc.
2016-03-26 01:47:27 -07:00
Tathagata Das 13945dd83b [SPARK-14109][SQL] Fix HDFSMetadataLog to fallback from FileContext to FileSystem API
## What changes were proposed in this pull request?

HDFSMetadataLog uses newer FileContext API to achieve atomic renaming. However, FileContext implementations may not exist for many scheme for which there may be FileSystem implementations. In those cases, rather than failing completely, we should fallback to the FileSystem based implementation, and log warning that there may be file consistency issues in case the log directory is concurrently modified.

In addition I have also added more tests to increase the code coverage.

## How was this patch tested?

Unit test.
Tested on cluster with custom file system.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #11925 from tdas/SPARK-14109.
2016-03-25 20:07:54 -07:00
Shixiong Zhu 24587ce433 [SPARK-14073][STREAMING][TEST-MAVEN] Move flume back to Spark
## What changes were proposed in this pull request?

This PR moves flume back to Spark as per the discussion in the dev mail-list.

## How was this patch tested?

Existing Jenkins tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11895 from zsxwing/move-flume-back.
2016-03-25 17:37:16 -07:00
Joseph K. Bradley 54d13bed87 [SPARK-14159][ML] Fixed bug in StringIndexer + related issue in RFormula
## What changes were proposed in this pull request?

StringIndexerModel.transform sets the output column metadata to use name inputCol.  It should not.  Fixing this causes a problem with the metadata produced by RFormula.

Fix in RFormula: I added the StringIndexer columns to prefixesToRewrite, and I modified VectorAttributeRewriter to find and replace all "prefixes" since attributes collect multiple prefixes from StringIndexer + Interaction.

Note that "prefixes" is no longer accurate since internal strings may be replaced.

## How was this patch tested?

Unit test which failed before this fix.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #11965 from jkbradley/StringIndexer-fix.
2016-03-25 16:00:09 -07:00
Rajesh Balamohan ff7cc45f52 [SPARK-14091][CORE] Improve performance of SparkContext.getCallSite()
Currently SparkContext.getCallSite() makes a call to Utils.getCallSite().

```
 private[spark] def getCallSite(): CallSite = {
    val callSite = Utils.getCallSite()
    CallSite(
      Option(getLocalProperty(CallSite.SHORT_FORM)).getOrElse(callSite.shortForm),
      Option(getLocalProperty(CallSite.LONG_FORM)).getOrElse(callSite.longForm)
    )
  }
```
However, in some places utils.withDummyCallSite(sc) is invoked to avoid expensive threaddumps within getCallSite(). But Utils.getCallSite() is evaluated earlier causing threaddumps to be computed.

This can have severe impact on smaller queries (that finish in 10-20 seconds) having large number of RDDs.

Creating this patch for lazy evaluation of  getCallSite.

No new test cases are added. Following standalone test was tried out manually. Also, built entire spark binary and tried with few SQL queries in TPC-DS  and TPC-H in multi node cluster
```
def run(): Unit = {
    val conf = new SparkConf()
    val sc = new SparkContext("local[1]", "test-context", conf)
    val start: Long = System.currentTimeMillis();
    val confBroadcast = sc.broadcast(new SerializableConfiguration(new Configuration()))
    Utils.withDummyCallSite(sc) {
      //Large tables end up creating 5500 RDDs
      for(i <- 1 to 5000) {
       //ignore nulls in RDD as its mainly for testing callSite
        val testRDD = new HadoopRDD(sc, confBroadcast, None, null,
          classOf[NullWritable], classOf[Writable], 10)
      }
    }
    val end: Long = System.currentTimeMillis();
    println("Time taken : " + (end - start))
  }

def main(args: Array[String]): Unit = {
    run
  }
```

Author: Rajesh Balamohan <rbalamohan@apache.org>

Closes #11911 from rajeshbalamohan/SPARK-14091.
2016-03-25 15:09:52 -07:00
Shixiong Zhu b554b3c46b [SPARK-14131][SQL] Add a workaround for HADOOP-10622 to fix DataFrameReaderWriterSuite
## What changes were proposed in this pull request?

There is a potential dead-lock in Hadoop Shell.runCommand before 2.5.0 ([HADOOP-10622](https://issues.apache.org/jira/browse/HADOOP-10622)). If we interrupt some thread running Shell.runCommand, we may hit this issue.

This PR adds some protecion to prevent from interrupting the microBatchThread when we may run into Shell.runCommand. There are two places will call Shell.runCommand now:

- offsetLog.add
- FileStreamSource.getOffset

They will create a file using HDFS API and call Shell.runCommand to set the file permission.

## How was this patch tested?

Existing unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11940 from zsxwing/workaround-for-HADOOP-10622.
2016-03-25 13:28:26 -07:00
Sameer Agarwal afd0debe07 [SPARK-14137] [SPARK-14150] [SQL] Infer IsNotNull constraints from non-nullable attributes
## What changes were proposed in this pull request?

This PR adds support for automatically inferring `IsNotNull` constraints from any non-nullable attributes that are part of an operator's output. This also fixes the issue that causes the optimizer to hit the maximum number of iterations for certain queries in https://github.com/apache/spark/pull/11828.

## How was this patch tested?

Unit test in `ConstraintPropagationSuite`

Author: Sameer Agarwal <sameer@databricks.com>

Closes #11953 from sameeragarwal/infer-isnotnull.
2016-03-25 12:57:26 -07:00
Liang-Chi Hsieh ca003354da [SPARK-12443][SQL] encoderFor should support Decimal
## What changes were proposed in this pull request?

JIRA: https://issues.apache.org/jira/browse/SPARK-12443

`constructorFor` will call `dataTypeFor` to determine if a type is `ObjectType` or not. If there is not case for `Decimal`, it will be recognized as `ObjectType` and causes the bug.

## How was this patch tested?

Test is added into `ExpressionEncoderSuite`.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #10399 from viirya/fix-encoder-decimal.
2016-03-25 12:07:56 -07:00
Tathagata Das 11fa8741ca [SQL][HOTFIX] Fix flakiness in StateStoreRDDSuite
## What changes were proposed in this pull request?
StateStoreCoordinator.reportActiveInstance is async, so subsequence state checks must be in eventually.
## How was this patch tested?
Jenkins tests

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #11924 from tdas/state-store-flaky-fix.
2016-03-25 12:04:47 -07:00
Sameer Agarwal b5f8c36e3c [SPARK-14144][SQL] Explicitly identify/catch UnsupportedOperationException during parquet reader initialization
## What changes were proposed in this pull request?

This PR is a minor cleanup task as part of https://issues.apache.org/jira/browse/SPARK-14008 to explicitly identify/catch the `UnsupportedOperationException` while initializing the vectorized parquet reader. Other exceptions will simply be thrown back to `SqlNewHadoopPartition`.

## How was this patch tested?

N/A (cleanup only; no new functionality added)

Author: Sameer Agarwal <sameer@databricks.com>

Closes #11950 from sameeragarwal/parquet-cleanup.
2016-03-25 11:48:05 -07:00
Wenchen Fan 43b15e01c4 [SPARK-14061][SQL] implement CreateMap
## What changes were proposed in this pull request?

As we have `CreateArray` and `CreateStruct`, we should also have `CreateMap`.  This PR adds the `CreateMap` expression, and the DataFrame API, and python API.

## How was this patch tested?

various new tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11879 from cloud-fan/create_map.
2016-03-25 09:50:06 -07:00
Davies Liu 6603d9f7e2 [SPARK-13919] [SQL] fix column pruning through filter
## What changes were proposed in this pull request?

This PR fix the conflict between ColumnPruning and PushPredicatesThroughProject, because ColumnPruning will try to insert a Project before Filter, but PushPredicatesThroughProject will move the Filter before Project.This is fixed by remove the Project before Filter, if the Project only do column pruning.

The RuleExecutor will fail the test if reached max iterations.

Closes #11745

## How was this patch tested?

Existing tests.

This is a test case still failing, disabled for now, will be fixed by https://issues.apache.org/jira/browse/SPARK-14137

Author: Davies Liu <davies@databricks.com>

Closes #11828 from davies/fail_rule.
2016-03-25 09:05:23 -07:00
Holden Karau 55a605763d [SPARK-13887][PYTHON][TRIVIAL][BUILD] Make lint-python script fail fast
## What changes were proposed in this pull request?

Change lint python script to stop on first error rather than building them up so its clearer why we failed (requested by rxin). Also while in the file, remove the commented out code.

## How was this patch tested?

Manually ran lint-python script with & without pep8 errors locally and verified expected results.

Author: Holden Karau <holden@us.ibm.com>

Closes #11898 from holdenk/SPARK-13887-pylint-fast-fail.
2016-03-25 12:53:34 +00:00
Wenchen Fan e9b6e7d857 [SPARK-13456][SQL][FOLLOW-UP] lazily generate the outer pointer for case class defined in REPL
## What changes were proposed in this pull request?

In https://github.com/apache/spark/pull/11410, we missed a corner case: define the inner class and use it in `Dataset` at the same time by using paste mode. For this case, the inner class and the `Dataset` are inside same line object, when we build the `Dataset`, we try to get outer pointer from line object, and it will fail because the line object is not initialized yet.

https://issues.apache.org/jira/browse/SPARK-13456?focusedCommentId=15209174&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15209174 is an example for this corner case.

This PR make the process of getting outer pointer from line object lazy, so that we can successfully build the `Dataset` and finish initializing the line object.

## How was this patch tested?

new test in repl suite.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11931 from cloud-fan/repl.
2016-03-25 20:19:04 +08:00
Reynold Xin 70a6f0bb57 [SPARK-14149] Log exceptions in tryOrIOException
## What changes were proposed in this pull request?
We ran into a problem today debugging some class loading problem during deserialization, and JVM was masking the underlying exception which made it very difficult to debug. We can however log the exceptions using try/catch ourselves in serialization/deserialization. The good thing is that all these methods are already using Utils.tryOrIOException, so we can just put the try catch and logging in a single place.

## How was this patch tested?
A logging change with a manual test.

Author: Reynold Xin <rxin@databricks.com>

Closes #11951 from rxin/SPARK-14149.
2016-03-25 01:17:23 -07:00
Andrew Or 20ddf5fddf [SPARK-14014][SQL] Integrate session catalog (attempt #2)
## What changes were proposed in this pull request?

This reopens #11836, which was merged but promptly reverted because it introduced flaky Hive tests.

## How was this patch tested?

See `CatalogTestCases`, `SessionCatalogSuite` and `HiveContextSuite`.

Author: Andrew Or <andrew@databricks.com>

Closes #11938 from andrewor14/session-catalog-again.
2016-03-24 22:59:35 -07:00
Reynold Xin 1c70b7650f [SPARK-14145][SQL] Remove the untyped version of Dataset.groupByKey
## What changes were proposed in this pull request?
Dataset has two variants of groupByKey, one for untyped and the other for typed. It actually doesn't make as much sense to have an untyped API here, since apps that want to use untyped APIs should just use the groupBy "DataFrame" API.

## How was this patch tested?
This patch removes a method, and removes the associated tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #11949 from rxin/SPARK-14145.
2016-03-24 22:56:34 -07:00
Reynold Xin 3619fec1ec [SPARK-14142][SQL] Replace internal use of unionAll with union
## What changes were proposed in this pull request?
unionAll has been deprecated in SPARK-14088.

## How was this patch tested?
Should be covered by all existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #11946 from rxin/SPARK-14142.
2016-03-24 22:34:55 -07:00
Yanbo Liang 13cbb2de70 [SPARK-13010][ML][SPARKR] Implement a simple wrapper of AFTSurvivalRegression in SparkR
## What changes were proposed in this pull request?
This PR continues the work in #11447, we implemented the wrapper of ```AFTSurvivalRegression``` named ```survreg``` in SparkR.

## How was this patch tested?
Test against output from R package survival's survreg.

cc mengxr felixcheung

Close #11447

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11932 from yanboliang/spark-13010-new.
2016-03-24 22:29:34 -07:00
gatorsmile 05f652d6c2 [SPARK-13957][SQL] Support Group By Ordinal in SQL
#### What changes were proposed in this pull request?
This PR is to support group by position in SQL. For example, when users input the following query
```SQL
select c1 as a, c2, c3, sum(*) from tbl group by 1, 3, c4
```
The ordinals are recognized as the positions in the select list. Thus, `Analyzer` converts it to
```SQL
select c1, c2, c3, sum(*) from tbl group by c1, c3, c4
```

This is controlled by the config option `spark.sql.groupByOrdinal`.
- When true, the ordinal numbers in group by clauses are treated as the position in the select list.
- When false, the ordinal numbers are ignored.
- Only convert integer literals (not foldable expressions). If found foldable expressions, ignore them.
- When the positions specified in the group by clauses correspond to the aggregate functions in select list, output an exception message.
- star is not allowed to use in the select list when users specify ordinals in group by

Note: This PR is taken from https://github.com/apache/spark/pull/10731. When merging this PR, please give the credit to zhichao-li

Also cc all the people who are involved in the previous discussion:  rxin cloud-fan marmbrus yhuai hvanhovell adrian-wang chenghao-intel tejasapatil

#### How was this patch tested?

Added a few test cases for both positive and negative test cases.

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #11846 from gatorsmile/groupByOrdinal.
2016-03-25 12:55:58 +08:00
GayathriMurali 0874ff3aad [SPARK-13949][ML][PYTHON] PySpark ml DecisionTreeClassifier, Regressor support export/import
## What changes were proposed in this pull request?

Added MLReadable and MLWritable to Decision Tree Classifier and Regressor. Added doctests.

## How was this patch tested?

Python Unit tests. Tests added to check persistence in DecisionTreeClassifier and DecisionTreeRegressor.

Author: GayathriMurali <gayathri.m.softie@gmail.com>

Closes #11892 from GayathriMurali/SPARK-13949.
2016-03-24 19:20:49 -07:00
sethah 585097716c [SPARK-14107][PYSPARK][ML] Add seed as named argument to GBTs in pyspark
## What changes were proposed in this pull request?

GBTs in pyspark previously had seed parameters, but they could not be passed as keyword arguments through the class constructor. This patch adds seed as a keyword argument and also sets default value.

## How was this patch tested?

Doc tests were updated to pass a random seed through the GBTClassifier and GBTRegressor constructors.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #11944 from sethah/SPARK-14107.
2016-03-24 19:14:24 -07:00
Josh Rosen fdd460f5f4 [SPARK-13980] Incrementally serialize blocks while unrolling them in MemoryStore
When a block is persisted in the MemoryStore at a serialized storage level, the current MemoryStore.putIterator() code will unroll the entire iterator as Java objects in memory, then will turn around and serialize an iterator obtained from the unrolled array. This is inefficient and doubles our peak memory requirements.

Instead, I think that we should incrementally serialize blocks while unrolling them.

A downside to incremental serialization is the fact that we will need to deserialize the partially-unrolled data in case there is not enough space to unroll the block and the block cannot be dropped to disk. However, I'm hoping that the memory efficiency improvements will outweigh any performance losses as a result of extra serialization in that hopefully-rare case.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #11791 from JoshRosen/serialize-incrementally.
2016-03-24 17:33:21 -07:00
Xusen Yin 2cf46d5a96 [SPARK-11871] Add save/load for MLPC
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-11871

Add save/load for MLPC

## How was this patch tested?

Test with Scala unit test

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9854 from yinxusen/SPARK-11871.
2016-03-24 15:29:17 -07:00
Xin Ren d283223a5a [SPARK-13017][DOCS] Replace example code in mllib-feature-extraction.md using include_example
Replace example code in mllib-feature-extraction.md using include_example
https://issues.apache.org/jira/browse/SPARK-13017

The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6.

Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example.
`{% include_example scala/org/apache/spark/examples/mllib/TFIDFExample.scala %}`
Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/TFIDFExample.scala` and pick code blocks marked "example" and replace code block in
`{% highlight %}`
 in the markdown.

See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337

Author: Xin Ren <iamshrek@126.com>

Closes #11142 from keypointt/SPARK-13017.
2016-03-24 14:25:10 -07:00
Sean Owen 342079dc45 Revert "[SPARK-2208] Fix for local metrics tests can fail on fast machines". The test appears to still be flaky after this change, or more flaky.
This reverts commit 5519760e0f.
2016-03-24 17:27:20 +00:00
Joan 5519760e0f [SPARK-2208] Fix for local metrics tests can fail on fast machines
## What changes were proposed in this pull request?

A fix for local metrics tests that can fail on fast machines.
This is probably what is suggested here #3380 by aarondav?

## How was this patch tested?

CI Tests

Cheers

Author: Joan <joan@goyeau.com>

Closes #11747 from joan38/SPARK-2208-Local-metrics-tests.
2016-03-24 09:47:44 +00:00
Xin Ren dd9ca7b960 [SPARK-13019][DOCS] fix for scala-2.10 build: Replace example code in mllib-statistics.md using include_example
## What changes were proposed in this pull request?

This PR for ticket SPARK-13019 is based on previous PR(https://github.com/apache/spark/pull/11108).
Since PR(https://github.com/apache/spark/pull/11108) is breaking scala-2.10 build, more work is needed to fix build errors.

What I did new in this PR is adding keyword argument for 'fractions':
`    val approxSample = data.sampleByKey(withReplacement = false, fractions = fractions)`
`    val exactSample = data.sampleByKeyExact(withReplacement = false, fractions = fractions)`

I reopened ticket on JIRA but sorry I don't know how to reopen a GitHub pull request, so I just submitting a new pull request.
## How was this patch tested?

Manual build testing on local machine, build based on scala-2.10.

Author: Xin Ren <iamshrek@126.com>

Closes #11901 from keypointt/SPARK-13019.
2016-03-24 09:34:54 +00:00
Ruifeng Zheng 048a7594e2 [SPARK-14030][MLLIB] Add parameter check to MLLIB
## What changes were proposed in this pull request?

add parameter verification to MLLIB, like
numCorrections > 0
tolerance >= 0
iters > 0
regParam >= 0

## How was this patch tested?

manual tests

Author: Ruifeng Zheng <ruifengz@foxmail.com>
Author: Zheng RuiFeng <mllabs@datanode1.(none)>
Author: mllabs <mllabs@datanode1.(none)>
Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #11852 from zhengruifeng/lbfgs_check.
2016-03-24 09:25:00 +00:00
Juarez Bochi 1803bf6333 Fix typo in ALS.scala
## What changes were proposed in this pull request?

Just a typo

## How was this patch tested?

N/A

Author: Juarez Bochi <jbochi@gmail.com>

Closes #11896 from jbochi/patch-1.
2016-03-24 09:24:00 +00:00
Tejas Patil 01849da080 [SPARK-14110][CORE] PipedRDD to print the command ran on non zero exit
## What changes were proposed in this pull request?

In case of failure in subprocess launched in PipedRDD, the failure exception reads “Subprocess exited with status XXX”. Debugging this is not easy for users especially if there are multiple pipe() operations in the Spark application.

Changes done:
- Changed the exception message when non-zero exit code is seen
- If the reader and writer threads see exception, simply logging the command ran. The current model is to propagate the exception "as is" so that upstream Spark logic will take the right action based on what the exception was (eg. for fetch failure, it needs to retry; but for some fatal exception, it will decide to fail the stage / job). So wrapping the exception with a generic exception will not work. Altering the exception message will keep that guarantee but that is ugly (plus not all exceptions might have a constructor for a string message)

## How was this patch tested?

- Added a new test case
- Ran all existing tests for PipedRDD

Author: Tejas Patil <tejasp@fb.com>

Closes #11927 from tejasapatil/SPARK-14110-piperdd-failure.
2016-03-24 00:31:13 -07:00
Andrew Or c44d140cae Revert "[SPARK-14014][SQL] Replace existing catalog with SessionCatalog"
This reverts commit 5dfc01976b.
2016-03-23 22:21:15 -07:00
Joseph K. Bradley cf823bead1 [SPARK-12183][ML][MLLIB] Remove mllib tree implementation, and wrap spark.ml one
Primary change:
* Removed spark.mllib.tree.DecisionTree implementation of tree and forest learning.
* spark.mllib now calls the spark.ml implementation.
* Moved unit tests (of tree learning internals) from spark.mllib to spark.ml as needed.

ml.tree.DecisionTreeModel
* Added toOld and made ```private[spark]```, implemented for Classifier and Regressor in subclasses.  These methods now use OldInformationGainStats.invalidInformationGainStats for LeafNodes in order to mimic the spark.mllib implementation.

ml.tree.Node
* Added ```private[tree] def deepCopy```, used by unit tests

Copied developer comments from spark.mllib implementation to spark.ml one.

Moving unit tests
* Tree learning internals were tested by spark.mllib.tree.DecisionTreeSuite, or spark.mllib.tree.RandomForestSuite.
* Those tests were all moved to spark.ml.tree.impl.RandomForestSuite.  The order in the file + the test names are the same, so you should be able to compare them by opening them in 2 windows side-by-side.
* I made minimal changes to each test to allow it to run.  Each test makes the same checks as before, except for a few removed assertions which were checking irrelevant values.
* No new unit tests were added.
* mllib.tree.DecisionTreeSuite: I removed some checks of splits and bins which were not relevant to the unit tests they were in.  Those same split calculations were already being tested in other unit tests, for each dataset type.

**Changes of behavior** (to be noted in SPARK-13448 once this PR is merged)
* spark.ml.tree.impl.RandomForest: Rather than throwing an error when maxMemoryInMB is set to too small a value (to split any node), we now allow 1 node to be split, even if its memory requirements exceed maxMemoryInMB.  This involved removing the maxMemoryPerNode check in RandomForest.run, as well as modifying selectNodesToSplit().  Once this PR is merged, I will note the change of behavior on SPARK-13448.
* spark.mllib.tree.DecisionTree: When a tree only has one node (root = leaf node), the "stats" field will now be empty, rather than being set to InformationGainStats.invalidInformationGainStats.  This does not remove information from the tree, and it will save a bit of storage.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #11855 from jkbradley/remove-mllib-tree-impl.
2016-03-23 21:16:00 -07:00
gatorsmile f42eaf42bd [SPARK-14085][SQL] Star Expansion for Hash
#### What changes were proposed in this pull request?

This PR is to support star expansion in hash. For example,
```SQL
val structDf = testData2.select("a", "b").as("record")
structDf.select(hash($"*")
```

In addition, it refactors the codes for the rule `ResolveStar` and fixes a regression for star expansion in group by when using SQL API. For example,
```SQL
SELECT * FROM testData2 group by a, b
```

cc cloud-fan Now, the code for star resolution is much cleaner. The coverage is better. Could you check if this refactoring is good? Thanks!

#### How was this patch tested?
Added a few test cases to cover it.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #11904 from gatorsmile/starResolution.
2016-03-24 11:13:36 +08:00
Liwei Lin de4e48b62b [SPARK-14025][STREAMING][WEBUI] Fix streaming job descriptions on the event timeline
## What changes were proposed in this pull request?

Removed the extra `<a href=...>...</a>` for each streaming job's description on the event timeline.

### [Before]
![before](https://cloud.githubusercontent.com/assets/15843379/13898653/0a6c1838-ee13-11e5-9761-14bb7b114c13.png)

### [After]
![after](https://cloud.githubusercontent.com/assets/15843379/13898650/012b8808-ee13-11e5-92a6-64aff0799c83.png)

## How was this patch tested?

test suits, manual checks (see screenshots above)

Author: Liwei Lin <proflin.me@gmail.com>
Author: proflin <proflin.me@gmail.com>

Closes #11845 from lw-lin/description-event-line.
2016-03-23 15:15:55 -07:00
sethah 69bc2c17f1 [SPARK-13952][ML] Add random seed to GBT
## What changes were proposed in this pull request?

`GBTClassifier` and `GBTRegressor` should use random seed for reproducible results. Because of the nature of current unit tests, which compare GBTs in ML and GBTs in MLlib for equality, I also added a random seed to MLlib GBT algorithm. I made alternate constructors in `mllib.tree.GradientBoostedTrees` to accept a random seed, but left them as private so as to not change the API unnecessarily.

## How was this patch tested?

Existing unit tests verify that functionality did not change. Other ML algorithms do not seem to have unit tests that directly test the functionality of random seeding, but reproducibility with seeding for GBTs is effectively verified in existing tests. I can add more tests if needed.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #11903 from sethah/SPARK-13952.
2016-03-23 15:08:47 -07:00
Andrew Or 5dfc01976b [SPARK-14014][SQL] Replace existing catalog with SessionCatalog
## What changes were proposed in this pull request?

`SessionCatalog`, introduced in #11750, is a catalog that keeps track of temporary functions and tables, and delegates metastore operations to `ExternalCatalog`. This functionality overlaps a lot with the existing `analysis.Catalog`.

As of this commit, `SessionCatalog` and `ExternalCatalog` will no longer be dead code. There are still things that need to be done after this patch, namely:
- SPARK-14013: Properly implement temporary functions in `SessionCatalog`
- SPARK-13879: Decide which DDL/DML commands to support natively in Spark
- SPARK-?????: Implement the ones we do want to support through `SessionCatalog`.
- SPARK-?????: Merge SQL/HiveContext

## How was this patch tested?

This is largely a refactoring task so there are no new tests introduced. The particularly relevant tests are `SessionCatalogSuite` and `ExternalCatalogSuite`.

Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #11836 from andrewor14/use-session-catalog.
2016-03-23 13:34:22 -07:00
Michael Armbrust 6bc4be64f8 [SPARK-14078] Streaming Parquet Based FileSink
This PR adds a new `Sink` implementation that writes out Parquet files.  In order to correctly handle partial failures while maintaining exactly once semantics, the files for each batch are written out to a unique directory and then atomically appended to a metadata log.  When a parquet based `DataSource` is initialized for reading, we first check for this log directory and use it instead of file listing when present.

Unit tests are added, as well as a stress test that checks the answer after non-deterministic injected failures.

Author: Michael Armbrust <michael@databricks.com>

Closes #11897 from marmbrus/fileSink.
2016-03-23 13:03:25 -07:00
Herman van Hovell 919bf32198 [SPARK-13325][SQL] Create a 64-bit hashcode expression
This PR introduces a 64-bit hashcode expression. Such an expression is especially usefull for HyperLogLog++ and other probabilistic datastructures.

I have implemented xxHash64 which is a 64-bit hashing algorithm created by Yann Colet and Mathias Westerdahl. This is a high speed (C implementation runs at memory bandwidth) and high quality hashcode. It exploits both Instruction Level Parralellism (for speed) and the multiplication and rotation techniques (for quality) like MurMurHash does.

The initial results are promising. I have added a CG'ed test to the `HashBenchmark`, and this results in the following results (running from SBT):

    Running benchmark: Hash For simple
      Running case: interpreted version
      Running case: codegen version
      Running case: codegen version 64-bit

    Intel(R) Core(TM) i7-4750HQ CPU  2.00GHz
    Hash For simple:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -------------------------------------------------------------------------------------------
    interpreted version                      1011 / 1016        132.8           7.5       1.0X
    codegen version                          1864 / 1869         72.0          13.9       0.5X
    codegen version 64-bit                   1614 / 1644         83.2          12.0       0.6X

    Running benchmark: Hash For normal
      Running case: interpreted version
      Running case: codegen version
      Running case: codegen version 64-bit

    Intel(R) Core(TM) i7-4750HQ CPU  2.00GHz
    Hash For normal:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -------------------------------------------------------------------------------------------
    interpreted version                      2467 / 2475          0.9        1176.1       1.0X
    codegen version                          2008 / 2115          1.0         957.5       1.2X
    codegen version 64-bit                    728 /  758          2.9         347.0       3.4X

    Running benchmark: Hash For array
      Running case: interpreted version
      Running case: codegen version
      Running case: codegen version 64-bit

    Intel(R) Core(TM) i7-4750HQ CPU  2.00GHz
    Hash For array:                     Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -------------------------------------------------------------------------------------------
    interpreted version                      1544 / 1707          0.1       11779.6       1.0X
    codegen version                          2728 / 2745          0.0       20815.5       0.6X
    codegen version 64-bit                   2508 / 2549          0.1       19132.8       0.6X

    Running benchmark: Hash For map
      Running case: interpreted version
      Running case: codegen version
      Running case: codegen version 64-bit

    Intel(R) Core(TM) i7-4750HQ CPU  2.00GHz
    Hash For map:                       Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -------------------------------------------------------------------------------------------
    interpreted version                      1819 / 1826          0.0      444014.3       1.0X
    codegen version                           183 /  194          0.0       44642.9       9.9X
    codegen version 64-bit                    173 /  174          0.0       42120.9      10.5X

This shows that algorithm is consistently faster than MurMurHash32 in all cases and up to 3x (!) in the normal case.

I have also added this to HyperLogLog++ and it cuts the processing time of the following code in half:

    val df = sqlContext.range(1<<25).agg(approxCountDistinct("id"))
    df.explain()
    val t = System.nanoTime()
    df.show()
    val ns = System.nanoTime() - t

    // Before
    ns: Long = 5821524302

    // After
    ns: Long = 2836418963

cc cloud-fan (you have been working on hashcodes) / rxin

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #11209 from hvanhovell/xxHash.
2016-03-23 20:51:01 +01:00
Tathagata Das 8c826880f5 [SPARK-13809][SQL] State store for streaming aggregations
## What changes were proposed in this pull request?

In this PR, I am implementing a new abstraction for management of streaming state data - State Store. It is a key-value store for persisting running aggregates for aggregate operations in streaming dataframes. The motivation and design is discussed here.

https://docs.google.com/document/d/1-ncawFx8JS5Zyfq1HAEGBx56RDet9wfVp_hDM8ZL254/edit#

## How was this patch tested?
- [x] Unit tests
- [x] Cluster tests

**Coverage from unit tests**

<img width="952" alt="screen shot 2016-03-21 at 3 09 40 pm" src="https://cloud.githubusercontent.com/assets/663212/13935872/fdc8ba86-ef76-11e5-93e8-9fa310472c7b.png">

## TODO
- [x] Fix updates() iterator to avoid duplicate updates for same key
- [x] Use Coordinator in ContinuousQueryManager
- [x] Plugging in hadoop conf and other confs
- [x] Unit tests
  - [x] StateStore object lifecycle and methods
  - [x] StateStoreCoordinator communication and logic
  - [x] StateStoreRDD fault-tolerance
  - [x] StateStoreRDD preferred location using StateStoreCoordinator
- [ ] Cluster tests
  - [ ] Whether preferred locations are set correctly
  - [ ] Whether recovery works correctly with distributed storage
  - [x] Basic performance tests
- [x] Docs

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #11645 from tdas/state-store.
2016-03-23 12:48:05 -07:00