Commit graph

15934 commits

Author SHA1 Message Date
Sun Rui 4ae9fe091c [SPARK-12919][SPARKR] Implement dapply() on DataFrame in SparkR.
## What changes were proposed in this pull request?

dapply() applies an R function on each partition of a DataFrame and returns a new DataFrame.

The function signature is:

	dapply(df, function(localDF) {}, schema = NULL)

R function input: local data.frame from the partition on local node
R function output: local data.frame

Schema specifies the Row format of the resulting DataFrame. It must match the R function's output.
If schema is not specified, each partition of the result DataFrame will be serialized in R into a single byte array. Such resulting DataFrame can be processed by successive calls to dapply().

## How was this patch tested?
SparkR unit tests.

Author: Sun Rui <rui.sun@intel.com>
Author: Sun Rui <sunrui2016@gmail.com>

Closes #12493 from sun-rui/SPARK-12919.
2016-04-29 16:41:07 -07:00
BenFradet d78fbcc3cc [SPARK-14570][ML] Log instrumentation in Random forests
## What changes were proposed in this pull request?

Added Instrumentation logging to DecisionTree{Classifier,Regressor} and RandomForest{Classifier,Regressor}

## How was this patch tested?

No tests involved since it's logging related.

Author: BenFradet <benjamin.fradet@gmail.com>

Closes #12536 from BenFradet/SPARK-14570.
2016-04-29 15:42:47 -07:00
Yin Huai af32f4aed6 [SPARK-15013][SQL] Remove hiveConf from HiveSessionState
## What changes were proposed in this pull request?
The hiveConf in HiveSessionState is not actually used anymore. Let's remove it.

## How was this patch tested?
Existing tests

Author: Yin Huai <yhuai@databricks.com>

Closes #12786 from yhuai/removeHiveConf.
2016-04-29 14:54:40 -07:00
Cheng Lian a04b1de5fa [SPARK-14981][SQL] Throws exception if DESC is specified for sorting columns
## What changes were proposed in this pull request?

Currently Spark SQL doesn't support sorting columns in descending order. However, the parser accepts the syntax and silently drops sorting directions. This PR fixes this by throwing an exception if `DESC` is specified as sorting direction of a sorting column.

## How was this patch tested?

A test case is added to test the invalid sorting order by checking exception message.

Author: Cheng Lian <lian@databricks.com>

Closes #12759 from liancheng/spark-14981.
2016-04-29 14:52:32 -07:00
Reynold Xin 8ebae466a3 [SPARK-15004][SQL] Remove zookeeper service discovery code in thrift-server
## What changes were proposed in this pull request?
We recently inlined Hive's thrift server code in SPARK-15004. This patch removes the code related to zookeeper service discovery, Tez, and Hive on Spark, since they are irrelevant.

## How was this patch tested?
N/A - removing dead code

Author: Reynold Xin <rxin@databricks.com>

Closes #12780 from rxin/SPARK-15004.
2016-04-29 13:32:08 -07:00
Yin Huai ac115f6628 [SPARK-15011][SQL][TEST] Ignore org.apache.spark.sql.hive.StatisticsSuite.analyze MetastoreRelation
This test always fail with sbt's hadoop 2.3 and 2.4 tests. Let'e disable it for now and investigate the problem.

Author: Yin Huai <yhuai@databricks.com>

Closes #12783 from yhuai/SPARK-15011-ignore.
2016-04-29 12:14:28 -07:00
Jeff Zhang 775772de36 [SPARK-11940][PYSPARK][ML] Python API for ml.clustering.LDA PR2
## What changes were proposed in this pull request?

pyspark.ml API for LDA
* LDA, LDAModel, LocalLDAModel, DistributedLDAModel
* includes persistence

This replaces [https://github.com/apache/spark/pull/10242]

## How was this patch tested?

* doc test for LDA, including Param setters
* unit test for persistence

Author: Joseph K. Bradley <joseph@databricks.com>
Author: Jeff Zhang <zjffdu@apache.org>

Closes #12723 from jkbradley/zjffdu-SPARK-11940.
2016-04-29 10:42:52 -07:00
Joseph K. Bradley f08dcdb8d3 [SPARK-14984][ML] Deprecated model field in LinearRegressionSummary
## What changes were proposed in this pull request?

Deprecated model field in LinearRegressionSummary

Removed unnecessary Since annotations

## How was this patch tested?

Existing tests

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12763 from jkbradley/lr-summary-api.
2016-04-29 10:40:00 -07:00
Yanbo Liang 87ac84d437 [SPARK-14314][SPARK-14315][ML][SPARKR] Model persistence in SparkR (glm & kmeans)
SparkR ```glm``` and ```kmeans``` model persistence.

Unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>
Author: Gayathri Murali <gayathri.m.softie@gmail.com>

Closes #12778 from yanboliang/spark-14311.
Closes #12680
Closes #12683
2016-04-29 09:43:04 -07:00
Andrew Or a7d0fedc94 [SPARK-14988][PYTHON] SparkSession catalog and conf API
## What changes were proposed in this pull request?

The `catalog` and `conf` APIs were exposed in `SparkSession` in #12713 and #12669. This patch adds those to the python API.

## How was this patch tested?

Python tests.

Author: Andrew Or <andrew@databricks.com>

Closes #12765 from andrewor14/python-spark-session-more.
2016-04-29 09:34:10 -07:00
Davies Liu 7feeb82cb7 [SPARK-14987][SQL] inline hive-service (cli) into sql/hive-thriftserver
## What changes were proposed in this pull request?

This PR copy the thrift-server from hive-service-1.2 (including  TCLIService.thrift and generated Java source code) into sql/hive-thriftserver, so we can do further cleanup and improvements.

## How was this patch tested?

Existing tests.

Author: Davies Liu <davies@databricks.com>

Closes #12764 from davies/thrift_server.
2016-04-29 09:32:42 -07:00
wm624@hotmail.com b6fa7e5934 [SPARK-14571][ML] Log instrumentation in ALS
## What changes were proposed in this pull request?

Add log instrumentation for parameters:
rank, numUserBlocks, numItemBlocks, implicitPrefs, alpha,
userCol, itemCol, ratingCol, predictionCol, maxIter,
regParam, nonnegative, checkpointInterval, seed

Add log instrumentation for numUserFeatures and numItemFeatures

## How was this patch tested?

Manual test: Set breakpoint in intellij and run def testALS(). Single step debugging and check the log method is called.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #12560 from wangmiao1981/log.
2016-04-29 16:18:25 +02:00
dding3 6d5aeaae26 [SPARK-14969][MLLIB] Remove duplicate implementation of compute in LogisticGradient
## What changes were proposed in this pull request?

This PR removes duplicate implementation of compute in LogisticGradient class

## How was this patch tested?

unit tests

Author: dding3 <dingding@dingding-ubuntu.sh.intel.com>

Closes #12747 from dding3/master.
2016-04-29 10:19:51 +01:00
Jakob Odersky 7226e19067 [SPARK-14511][BUILD] Upgrade genjavadoc to latest upstream
## What changes were proposed in this pull request?
In the past, genjavadoc had issues with package private members which led the spark project to use a forked version. This issue has been fixed upstream (typesafehub/genjavadoc#70) and a release is available for scala versions 2.10, 2.11 **and 2.12**, hence a forked version for spark is no longer necessary.
This pull request updates the build configuration to use the newest upstream genjavadoc.

## How was this patch tested?
The build was run `sbt unidoc`. During the process javadoc emits some errors on the generated java stubs, however these errors were also present before the upgrade. Furthermore, the produced html is fine.

Author: Jakob Odersky <jakob@odersky.com>

Closes #12707 from jodersky/SPARK-14511-genjavadoc.
2016-04-29 10:10:20 +01:00
Reynold Xin 054f991c43 [SPARK-14994][SQL] Remove execution hive from HiveSessionState
## What changes were proposed in this pull request?
This patch removes executionHive from HiveSessionState and HiveSharedState.

## How was this patch tested?
Updated test cases.

Author: Reynold Xin <rxin@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #12770 from rxin/SPARK-14994.
2016-04-29 01:14:02 -07:00
Sameer Agarwal 2057cbcb0b [SPARK-14996][SQL] Add TPCDS Benchmark Queries for SparkSQL
## What changes were proposed in this pull request?

This PR adds support for easily running and benchmarking a set of common TPCDS queries locally in SparkSQL.

## How was this patch tested?

N/A

Author: Sameer Agarwal <sameer@databricks.com>

Closes #12771 from sameeragarwal/tpcds-2.
2016-04-29 00:52:42 -07:00
gatorsmile 222dcf7937 [SPARK-12660][SPARK-14967][SQL] Implement Except Distinct by Left Anti Join
#### What changes were proposed in this pull request?
Replaces a logical `Except` operator with a `Left-anti Join` operator. This way, we can take advantage of all the benefits of join implementations (e.g. managed memory, code generation, broadcast joins).
```SQL
  SELECT a1, a2 FROM Tab1 EXCEPT SELECT b1, b2 FROM Tab2
  ==>  SELECT DISTINCT a1, a2 FROM Tab1 LEFT ANTI JOIN Tab2 ON a1<=>b1 AND a2<=>b2
```
 Note:
 1. This rule is only applicable to EXCEPT DISTINCT. Do not use it for EXCEPT ALL.
 2. This rule has to be done after de-duplicating the attributes; otherwise, the enerated
    join conditions will be incorrect.

This PR also corrects the existing behavior in Spark. Before this PR, the behavior is like
```SQL
  test("except") {
    val df_left = Seq(1, 2, 2, 3, 3, 4).toDF("id")
    val df_right = Seq(1, 3).toDF("id")

    checkAnswer(
      df_left.except(df_right),
      Row(2) :: Row(2) :: Row(4) :: Nil
    )
  }
```
After this PR, the result is corrected. We strictly follow the SQL compliance of `Except Distinct`.

#### How was this patch tested?
Modified and added a few test cases to verify the optimization rule and the results of operators.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #12736 from gatorsmile/exceptByAntiJoin.
2016-04-29 15:30:36 +08:00
Reynold Xin e249e6f8b5 [HOTFIX] Disable flaky test StatisticsSuite.analyze MetastoreRelations 2016-04-29 00:23:59 -07:00
Sean Owen d1cf320105 [SPARK-14886][MLLIB] RankingMetrics.ndcgAt throw java.lang.ArrayIndexOutOfBoundsException
## What changes were proposed in this pull request?

Handle case where number of predictions is less than label set, k in nDCG computation

## How was this patch tested?

New unit test; existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #12756 from srowen/SPARK-14886.
2016-04-29 09:21:27 +02:00
Reynold Xin 24d07e45d4 [HOTFIX] Disable flaky LauncherServerSuite.testCommunication 2016-04-29 00:20:52 -07:00
Zheng RuiFeng 4f83e442b1 [MINOR][DOC] Minor typo fixes
## What changes were proposed in this pull request?
Minor typo fixes

## How was this patch tested?
local build

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12755 from zhengruifeng/fix_doc_dataset.
2016-04-28 22:56:26 -07:00
Zheng RuiFeng cabd54d931 [SPARK-14829][MLLIB] Deprecate GLM APIs using SGD
## What changes were proposed in this pull request?
According to the [SPARK-14829](https://issues.apache.org/jira/browse/SPARK-14829), deprecate API of LogisticRegression and LinearRegression using SGD

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12596 from zhengruifeng/deprecate_sgd.
2016-04-28 22:44:14 -07:00
Timothy Hunter 769a909d13 [SPARK-7264][ML] Parallel lapply for sparkR
## What changes were proposed in this pull request?

This PR adds a new function in SparkR called `sparkLapply(list, function)`. This function implements a distributed version of `lapply` using Spark as a backend.

TODO:
 - [x] check documentation
 - [ ] check tests

Trivial example in SparkR:

```R
sparkLapply(1:5, function(x) { 2 * x })
```

Output:

```
[[1]]
[1] 2

[[2]]
[1] 4

[[3]]
[1] 6

[[4]]
[1] 8

[[5]]
[1] 10
```

Here is a slightly more complex example to perform distributed training of multiple models. Under the hood, Spark broadcasts the dataset.

```R
library("MASS")
data(menarche)
families <- c("gaussian", "poisson")
train <- function(family){glm(Menarche ~ Age  , family=family, data=menarche)}
results <- sparkLapply(families, train)
```

## How was this patch tested?

This PR was tested in SparkR. I am unfamiliar with R and SparkR, so any feedback on style, testing, etc. will be much appreciated.

cc falaki davies

Author: Timothy Hunter <timhunter@databricks.com>

Closes #12426 from thunterdb/7264.
2016-04-28 22:42:48 -07:00
Reynold Xin 4607f6e7f7 [SPARK-14991][SQL] Remove HiveNativeCommand
## What changes were proposed in this pull request?
This patch removes HiveNativeCommand, so we can continue to remove the dependency on Hive. This pull request also removes the ability to generate golden result file using Hive.

## How was this patch tested?
Updated tests to reflect this.

Author: Reynold Xin <rxin@databricks.com>

Closes #12769 from rxin/SPARK-14991.
2016-04-28 21:58:48 -07:00
Wenchen Fan 6f9a18fe31 [HOTFIX][CORE] fix a concurrence issue in NewAccumulator
## What changes were proposed in this pull request?

`AccumulatorContext` is not thread-safe, that's why all of its methods are synchronized. However, there is one exception: the `AccumulatorContext.originals`. `NewAccumulator` use it to check if it's registered, which is wrong as it's not synchronized.

This PR mark `AccumulatorContext.originals` as `private` and now all access to `AccumulatorContext` is synchronized.

## How was this patch tested?

I verified it locally. To be safe, we can let jenkins test it many times to make sure this problem is gone.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12773 from cloud-fan/debug.
2016-04-28 21:57:58 -07:00
Yin Huai 9c7c42bc6a Revert "[SPARK-14613][ML] Add @Since into the matrix and vector classes in spark-mllib-local"
This reverts commit dae538a4d7.
2016-04-28 19:57:41 -07:00
jerryshao 2398e3d69c [SPARK-14836][YARN] Zip all the jars before uploading to distributed cache
## What changes were proposed in this pull request?

<copy form JIRA>

Currently if neither `spark.yarn.jars` nor `spark.yarn.archive` is set (by default), Spark on yarn code will upload all the jars in the folder separately into distributed cache, this is quite time consuming, and very verbose, instead of upload jars separately into distributed cache, here changes to zip all the jars first, and then put into distributed cache.

This will significantly improve the speed of starting time.

## How was this patch tested?

Unit test and local integrated test is done.

Verified with SparkPi both in spark cluster and client mode.

Author: jerryshao <sshao@hortonworks.com>

Closes #12597 from jerryshao/SPARK-14836.
2016-04-28 16:39:49 -07:00
Joseph K. Bradley 4f4721a21c [SPARK-14862][ML] Updated Classifiers to not require labelCol metadata
## What changes were proposed in this pull request?

Updated Classifier, DecisionTreeClassifier, RandomForestClassifier, GBTClassifier to not require input column metadata.
* They first check for metadata.
* If numClasses is not specified in metadata, they identify the largest label value (up to a limit).

This functionality is implemented in a new Classifier.getNumClasses method.

Also
* Updated Classifier.extractLabeledPoints to (a) check label values and (b) include a second version which takes a numClasses value for validity checking.

## How was this patch tested?

* Unit tests in ClassifierSuite for helper methods
* Unit tests for DecisionTreeClassifier, RandomForestClassifier, GBTClassifier with toy datasets lacking label metadata

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12663 from jkbradley/trees-no-metadata.
2016-04-28 16:20:00 -07:00
Pravin Gadakh dae538a4d7 [SPARK-14613][ML] Add @Since into the matrix and vector classes in spark-mllib-local
## What changes were proposed in this pull request?

This PR adds `since` tag into the matrix and vector classes in spark-mllib-local.

## How was this patch tested?

Scala-style checks passed.

Author: Pravin Gadakh <prgadakh@in.ibm.com>

Closes #12416 from pravingadakh/SPARK-14613.
2016-04-28 15:59:18 -07:00
Burak Yavuz 78c8aaf849 [SPARK-14555] Second cut of Python API for Structured Streaming
## What changes were proposed in this pull request?

This PR adds Python APIs for:
 - `ContinuousQueryManager`
 - `ContinuousQueryException`

The `ContinuousQueryException` is a very basic wrapper, it doesn't provide the functionality that the Scala side provides, but it follows the same pattern for `AnalysisException`.

For `ContinuousQueryManager`, all APIs are provided except for registering listeners.

This PR also attempts to fix test flakiness by stopping all active streams just before tests.

## How was this patch tested?

Python Doc tests and unit tests

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #12673 from brkyvz/pyspark-cqm.
2016-04-28 15:22:28 -07:00
Kai Jiang d584a2b8ac [SPARK-12810][PYSPARK] PySpark CrossValidatorModel should support avgMetrics
## What changes were proposed in this pull request?
support avgMetrics in CrossValidatorModel with Python
## How was this patch tested?
Doctest and `test_save_load` in `pyspark/ml/test.py`
[JIRA](https://issues.apache.org/jira/browse/SPARK-12810)

Author: Kai Jiang <jiangkai@gmail.com>

Closes #12464 from vectorijk/spark-12810.
2016-04-28 14:19:11 -07:00
Tathagata Das 0ee5419b6c [SPARK-14970][SQL] Prevent DataSource from enumerates all files in a directory if there is user specified schema
## What changes were proposed in this pull request?
The FileCatalog object gets created even if the user specifies schema, which means files in the directory is enumerated even thought its not necessary. For large directories this is very slow. User would want to specify schema in such scenarios of large dirs, and this defeats the purpose quite a bit.

## How was this patch tested?
Hard to test this with unit test.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #12748 from tdas/SPARK-14970.
2016-04-28 12:59:08 -07:00
Yuhao Yang d5ab42ceb9 [SPARK-14916][MLLIB] A more friendly tostring for FreqItemset in mllib.fpm
## What changes were proposed in this pull request?

jira: https://issues.apache.org/jira/browse/SPARK-14916
FreqItemset as the result of FPGrowth should have a more friendly toString(), to help users and developers.
sample:
{a, b}: 5
{x, y, z}: 4

## How was this patch tested?

existing unit tests.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #12698 from hhbyyh/freqtos.
2016-04-28 19:52:09 +01:00
Joseph K. Bradley 5ee72454df [SPARK-14852][ML] refactored GLM summary into training, non-training summaries
## What changes were proposed in this pull request?

This splits GeneralizedLinearRegressionSummary into 2 summary types:
* GeneralizedLinearRegressionSummary, which does not store info from fitting (diagInvAtWA)
* GeneralizedLinearRegressionTrainingSummary, which is a subclass of GeneralizedLinearRegressionSummary and stores info from fitting

This also add a method evaluate() which can produce a GeneralizedLinearRegressionSummary on a new dataset.

The summary no longer provides the model itself as a public val.

Also:
* Fixes bug where GeneralizedLinearRegressionTrainingSummary was created with model, not summaryModel.
* Adds hasSummary method.
* Renames findSummaryModelAndPredictionCol -> getSummaryModel and simplifies that method.
* In summary, extract values from model immediately in case user later changes those (e.g., predictionCol).
* Pardon the style fixes; that is IntelliJ being obnoxious.

## How was this patch tested?

Existing unit tests + updated test for evaluate and hasSummary

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12624 from jkbradley/model-summary-api.
2016-04-28 11:22:13 -07:00
Gregory Hart 12c360c057 [SPARK-14965][SQL] Indicate an exception is thrown for a missing struct field
## What changes were proposed in this pull request?

Fix to ScalaDoc for StructType.

## How was this patch tested?

Built locally.

Author: Gregory Hart <greg.hart@thinkbiganalytics.com>

Closes #12758 from freastro/hotfix/SPARK-14965.
2016-04-28 11:21:43 -07:00
Andrew Or 89addd40ab [SPARK-14945][PYTHON] SparkSession Python API
## What changes were proposed in this pull request?

```
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.0.0-SNAPSHOT
      /_/

Using Python version 2.7.5 (default, Mar  9 2014 22:15:05)
SparkSession available as 'spark'.
>>> spark
<pyspark.sql.session.SparkSession object at 0x101f3bfd0>
>>> spark.sql("SHOW TABLES").show()
...
+---------+-----------+
|tableName|isTemporary|
+---------+-----------+
|      src|      false|
+---------+-----------+

>>> spark.range(1, 10, 2).show()
+---+
| id|
+---+
|  1|
|  3|
|  5|
|  7|
|  9|
+---+
```
**Note**: This API is NOT complete in its current state. In particular, for now I left out the `conf` and `catalog` APIs, which were added later in Scala. These will be added later before 2.0.

## How was this patch tested?

Python tests.

Author: Andrew Or <andrew@databricks.com>

Closes #12746 from andrewor14/python-spark-session.
2016-04-28 10:55:48 -07:00
Xin Ren 5743352a28 [SPARK-14935][CORE] DistributedSuite "local-cluster format" shouldn't actually launch clusters
https://issues.apache.org/jira/browse/SPARK-14935

In DistributedSuite, the "local-cluster format" test actually launches a bunch of clusters, but this doesn't seem necessary for what should just be a unit test of a regex. We should clean up the code so that this is testable without actually launching a cluster, which should buy us about 20 seconds per build.

Passed unit test on my local machine

Author: Xin Ren <iamshrek@126.com>

Closes #12744 from keypointt/SPARK-14935.
2016-04-28 10:50:06 -07:00
Sean Owen bed0b00202 [SPARK-14882][DOCS] Clarify that Spark can be cross-built for other Scala versions
## What changes were proposed in this pull request?

Add simple clarification that Spark can be cross-built for other Scala versions.

## How was this patch tested?

Automated doc build

Author: Sean Owen <sowen@cloudera.com>

Closes #12757 from srowen/SPARK-14882.
2016-04-28 10:41:15 -07:00
jerryshao 8b44bd52fa [SPARK-6735][YARN] Add window based executor failure tracking mechanism for long running service
This work is based on twinkle-sachdeva 's proposal. In parallel to such mechanism for AM failures, here add similar mechanism for executor failure tracking, this is useful for long running Spark service to mitigate the executor failure problems.

Please help to review, tgravescs sryza and vanzin

Author: jerryshao <sshao@hortonworks.com>

Closes #10241 from jerryshao/SPARK-6735.
2016-04-28 12:38:19 -05:00
Sun Rui 9e785079b6 [SPARK-12235][SPARKR] Enhance mutate() to support replace existing columns.
Make the behavior of mutate more consistent with that in dplyr, besides support for replacing existing columns.
1. Throw error message when there are duplicated column names in the DataFrame being mutated.
2. when there are duplicated column names in specified columns by arguments, the last column of the same name takes effect.

Author: Sun Rui <rui.sun@intel.com>

Closes #10220 from sun-rui/SPARK-12235.
2016-04-28 09:33:58 -07:00
Ergin Seyfe 23256be0d0 [SPARK-14576][WEB UI] Spark console should display Web UI url
## What changes were proposed in this pull request?
This is a proposal to print the Spark Driver UI link when spark-shell is launched.

## How was this patch tested?
Launched spark-shell in local mode and cluster mode. Spark-shell console output included following line:
"Spark context Web UI available at <Spark web url>"

Author: Ergin Seyfe <eseyfe@fb.com>

Closes #12341 from seyfe/spark_console_display_webui_link.
2016-04-28 16:16:28 +01:00
Liang-Chi Hsieh 7c6937a885 [SPARK-14487][SQL] User Defined Type registration without SQLUserDefinedType annotation
## What changes were proposed in this pull request?

Currently we use `SQLUserDefinedType` annotation to register UDTs for user classes. However, by doing this, we add Spark dependency to user classes.

For some user classes, it is unnecessary to add such dependency that will increase deployment difficulty.

We should provide alternative approach to register UDTs for user classes without `SQLUserDefinedType` annotation.

## How was this patch tested?

`UserDefinedTypeSuite`

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #12259 from viirya/improve-sql-usertype.
2016-04-28 01:14:49 -07:00
Wenchen Fan bf5496dbda [SPARK-14654][CORE] New accumulator API
## What changes were proposed in this pull request?

This PR introduces a new accumulator API  which is much simpler than before:

1. the type hierarchy is simplified, now we only have an `Accumulator` class
2. Combine `initialValue` and `zeroValue` concepts into just one concept: `zeroValue`
3. there in only one `register` method, the accumulator registration and cleanup registration are combined.
4. the `id`,`name` and `countFailedValues` are combined into an `AccumulatorMetadata`, and is provided during registration.

`SQLMetric` is a good example to show the simplicity of this new API.

What we break:

1. no `setValue` anymore. In the new API, the intermedia type can be different from the result type, it's very hard to implement a general `setValue`
2. accumulator can't be serialized before registered.

Problems need to be addressed in follow-ups:

1. with this new API, `AccumulatorInfo` doesn't make a lot of sense, the partial output is not partial updates, we need to expose the intermediate value.
2. `ExceptionFailure` should not carry the accumulator updates. Why do users care about accumulator updates for failed cases? It looks like we only use this feature to update the internal metrics, how about we sending a heartbeat to update internal metrics after the failure event?
3. the public event `SparkListenerTaskEnd` carries a `TaskMetrics`. Ideally this `TaskMetrics` don't need to carry external accumulators, as the only method of `TaskMetrics` that can access external accumulators is `private[spark]`. However, `SQLListener` use it to retrieve sql metrics.

## How was this patch tested?

existing tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12612 from cloud-fan/acc.
2016-04-28 00:26:39 -07:00
Jakob Odersky be317d4a90 [SPARK-10001][CORE] Don't short-circuit actions in signal handlers
## What changes were proposed in this pull request?
The current signal handlers have a subtle bug that stops evaluating registered actions as soon as one of them returns true, this is because `forall` is short-circuited.
This PR adds a strict mapping stage before evaluating returned result.
There are no known occurrences of the bug and this is a preemptive fix.

## How was this patch tested?
As with the original introduction of signal handlers, this was tested manually (unit testing with signals is not straightforward).

Author: Jakob Odersky <jakob@odersky.com>

Closes #12745 from jodersky/SPARK-10001-hotfix.
2016-04-27 22:46:43 -07:00
Davies Liu ae4e3def5e [SPARK-14961] Build HashedRelation larger than 1G
## What changes were proposed in this pull request?

Currently, LongToUnsafeRowMap use byte array as the underlying page, which can't be larger 1G.

This PR improves LongToUnsafeRowMap  to scale up to 8G bytes by using array of Long instead of array of byte.

## How was this patch tested?

Manually ran a test to confirm that both UnsafeHashedRelation and LongHashedRelation could build a map that larger than 2G.

Author: Davies Liu <davies@databricks.com>

Closes #12740 from davies/larger_broadcast.
2016-04-27 21:23:40 -07:00
hyukjinkwon f5da592fc6 [SPARK-12143][SQL] Binary type support for Hive thrift server
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-12143

This PR adds the support for conversion between `SparkRow` in Spark and `RowSet` in Hive for `BinaryType` as `Array[Byte]` (JDBC)
## How was this patch tested?

Unittests in `HiveThriftBinaryServerSuite` (regression test)

Closes #10139

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #12733 from HyukjinKwon/SPARK-12143.
2016-04-27 17:41:05 -07:00
Arun Allamsetty b0ce0d1312 [MINOR][MAINTENANCE] Sort the entries in .gitignore.
## What changes were proposed in this pull request?

The contents of `.gitignore` have been sorted to make it more readable. The actual contents of the file have not been changed.

## How was this patch tested?

Does not require any tests.

Author: Arun Allamsetty <arun@instructure.com>

Closes #12742 from aa8y/gitignore.
2016-04-27 17:35:25 -07:00
Josh Rosen 8c49cebce5 [SPARK-14966] SizeEstimator should ignore classes in the scala.reflect package
In local profiling, I noticed SizeEstimator spending tons of time estimating the size of objects which contain TypeTag or ClassTag fields. The problem with these tags is that they reference global Scala reflection objects, which, in turn, reference many singletons, such as TestHive. This throws off the accuracy of the size estimation and wastes tons of time traversing a huge object graph.

As a result, I think that SizeEstimator should ignore any classes in the `scala.reflect` package.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #12741 from JoshRosen/ignore-scala-reflect-in-size-estimator.
2016-04-27 17:34:55 -07:00
Joseph K. Bradley f5ebb18c45 [SPARK-14671][ML] Pipeline setStages should handle subclasses of PipelineStage
## What changes were proposed in this pull request?

Pipeline.setStages failed for some code examples which worked in 1.5 but fail in 1.6.  This tends to occur when using a mix of transformers from ml.feature. It is because Java Arrays are non-covariant and the addition of MLWritable to some transformers means the stages0/1 arrays above are not of type Array[PipelineStage].  This PR modifies the following to accept subclasses of PipelineStage:
* Pipeline.setStages()
* Params.w()

## How was this patch tested?

Unit test which fails to compile before this fix.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12430 from jkbradley/pipeline-setstages.
2016-04-27 16:11:12 -07:00
Josh Rosen 6466d6c8a4 Revert "[SPARK-14683][DOCUMENTATION] Configure external links in ScalaDoc"
This reverts commit 3f49afee93.
2016-04-27 15:49:21 -07:00