Commit graph

15041 commits

Author SHA1 Message Date
Xusen Yin 8d29001dec [SPARK-13011] K-means wrapper in SparkR
https://issues.apache.org/jira/browse/SPARK-13011

Author: Xusen Yin <yinxusen@gmail.com>

Closes #11124 from yinxusen/SPARK-13011.
2016-02-23 15:42:58 -08:00
Timothy Hunter 15e3015563 [SPARK-6761][SQL][ML] Fixes to API and documentation of approximate quantiles
## What changes were proposed in this pull request?

This continues  thunterdb 's work on `approxQuantile` API. It changes the signature of `approxQuantile` from `(col: String, quantile: Double, epsilon: Double): Double`  to `(col: String, probabilities: Array[Double], relativeError: Double): Array[Double]` and update API doc. It also improves the error message in tests and simplifies the merge algorithm for summaries.

## How was the this patch tested?

Use the same unit tests as before.

Closes #11325

Author: Timothy Hunter <timhunter@databricks.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #11332 from mengxr/SPARK-6761.
2016-02-23 15:31:17 -08:00
Davies Liu 9cdd867da9 [SPARK-13373] [SQL] generate sort merge join
## What changes were proposed in this pull request?

Generates code for SortMergeJoin.

## How was the this patch tested?

Unit tests and manually tested with TPCDS Q72, which showed 70% performance improvements (from 42s to 25s), but micro benchmark only show minor improvements, it may depends the distribution of data and number of columns.

Author: Davies Liu <davies@databricks.com>

Closes #11248 from davies/gen_smj.
2016-02-23 15:00:10 -08:00
Davies Liu c481bdf512 [SPARK-13329] [SQL] considering output for statistics of logical plan
The current implementation of statistics of UnaryNode does not considering output (for example, Project may product much less columns than it's child), we should considering it to have a better guess.

We usually only join with few columns from a parquet table, the size of projected plan could be much smaller than the original parquet files. Having a better guess of size help we choose between broadcast join or sort merge join.

After this PR, I saw a few queries choose broadcast join other than sort merge join without turning spark.sql.autoBroadcastJoinThreshold for every query, ended up with about 6-8X improvements on end-to-end time.

We use `defaultSize` of DataType to estimate the size of a column, currently For DecimalType/StringType/BinaryType and UDT, we are over-estimate too much (4096 Bytes), so this PR change them to some more reasonable values. Here are the new defaultSize for them:

DecimalType:  8 or 16 bytes, based on the precision
StringType:  20 bytes
BinaryType: 100 bytes
UDF: default size of SQL type

These numbers are not perfect (hard to have a perfect number for them), but should be better than 4096.

Author: Davies Liu <davies@databricks.com>

Closes #11210 from davies/statics.
2016-02-23 12:55:44 -08:00
Michael Armbrust c5bfe5d2a2 [SPARK-13440][SQL] ObjectType should accept any ObjectType, If should not care about nullability
The type checking functions of `If` and `UnwrapOption` are fixed to eliminate spurious failures.  `UnwrapOption` was checking for an input of `ObjectType` but `ObjectType`'s accept function was hard coded to return `false`.  `If`'s type check was returning a false negative in the case that the two options differed only by nullability.

Tests added:
 -  an end-to-end regression test is added to `DatasetSuite` for the reported failure.
 - all the unit tests in `ExpressionEncoderSuite` are augmented to also confirm successful analysis.  These tests are actually what pointed out the additional issues with `If` resolution.

Author: Michael Armbrust <michael@databricks.com>

Closes #11316 from marmbrus/datasetOptions.
2016-02-23 11:20:27 -08:00
Lianhui Wang 9f4263392e [SPARK-7729][UI] Executor which has been killed should also be displayed on Executor Tab
andrewor14 squito Dead Executors should also be displayed on Executor Tab.
as following:
![image](https://cloud.githubusercontent.com/assets/545478/11492707/ae55d7f6-982b-11e5-919a-b62cd84684b2.png)

Author: Lianhui Wang <lianhuiwang09@gmail.com>

This patch had conflicts when merged, resolved by
Committer: Andrew Or <andrew@databricks.com>

Closes #10058 from lianhuiwang/SPARK-7729.
2016-02-23 11:08:39 -08:00
Grzegorz Chilkiewicz 5d69eaf097 [SPARK-13338][ML] Allow setting 'degree' parameter to 1 for PolynomialExpansion
Author: Grzegorz Chilkiewicz <grzegorz.chilkiewicz@codilime.com>

Closes #11216 from grzegorz-chilkiewicz/master.
2016-02-23 10:30:02 -08:00
zhuol 4d1e5f92e1 [SPARK-13364] Sort appId as num rather than str in history page.
## What changes were proposed in this pull request?

History page now sorts the appID as a string, which can lead to unexpected order for the case "application_11111_9" and "application_11111_20".
Add a new sort type called appId-numeric can fix it.

## How was the this patch tested?
This patch was manually tested with UI. See the screenshot below:
![sortappidbetter](https://cloud.githubusercontent.com/assets/11683054/13185564/7f941a16-d707-11e5-8fb7-0316368d3030.png)

Author: zhuol <zhuol@yahoo-inc.com>

Closes #11259 from zhuoliu/13364.
2016-02-23 11:16:42 -06:00
Liang-Chi Hsieh 87d7f8904a [SPARK-13358] [SQL] Retrieve grep path when do benchmark
JIRA: https://issues.apache.org/jira/browse/SPARK-13358

When trying to run a benchmark, I found that on my Ubuntu linux grep is not in /usr/bin/ but /bin/. So wondering if it is better to use which to retrieve grep path.

cc davies

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #11231 from viirya/benchmark-grep-path.
2016-02-23 07:56:08 -08:00
jerryshao e99d017098 [SPARK-13220][CORE] deprecate yarn-client and yarn-cluster mode
Author: jerryshao <sshao@hortonworks.com>

Closes #11229 from jerryshao/SPARK-13220.
2016-02-23 12:30:57 +00:00
gatorsmile 87250580f2 [SPARK-13263][SQL] SQL Generation Support for Tablesample
In the parser, tableSample clause is part of tableSource.
```
tableSource
init { gParent.pushMsg("table source", state); }
after { gParent.popMsg(state); }
    : tabname=tableName
    ((tableProperties) => props=tableProperties)?
    ((tableSample) => ts=tableSample)?
    ((KW_AS) => (KW_AS alias=Identifier)
    |
    (Identifier) => (alias=Identifier))?
    -> ^(TOK_TABREF $tabname $props? $ts? $alias?)
    ;
```

Two typical query samples using TABLESAMPLE are:
```
    "SELECT s.id FROM t0 TABLESAMPLE(10 PERCENT) s"
    "SELECT * FROM t0 TABLESAMPLE(0.1 PERCENT)"
```

FYI, the logical plan of a TABLESAMPLE query:
```
sql("SELECT * FROM t0 TABLESAMPLE(0.1 PERCENT)").explain(true)

== Analyzed Logical Plan ==
id: bigint
Project [id#16L]
+- Sample 0.0, 0.001, false, 381
   +- Subquery t0
      +- Relation[id#16L] ParquetRelation
```

Thanks! cc liancheng

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

This patch had conflicts when merged, resolved by
Committer: Cheng Lian <lian@databricks.com>

Closes #11148 from gatorsmile/tablesplitsample.
2016-02-23 16:13:09 +08:00
movelikeriver 5cd3e6f60b [SPARK-13257][IMPROVEMENT] Refine naive Bayes example by checking model after loading it
Refine naive Bayes example by checking model after loading it

Author: movelikeriver <mars.lenjoy@gmail.com>

Closes #11125 from movelikeriver/naive_bayes.
2016-02-22 23:58:54 -08:00
Xiangrui Meng 764ca18037 [SPARK-13355][MLLIB] replace GraphImpl.fromExistingRDDs by Graph.apply
`GraphImpl.fromExistingRDDs` expects preprocessed vertex RDD as input. We call it in LDA without validating this requirement. So it might introduce errors. Replacing it by `Graph.apply` would be safer and more proper because it is a public API. The tests still pass. So maybe it is safe to use `fromExistingRDDs` here (though it doesn't seem so based on the implementation) or the test cases are special. jkbradley ankurdave

Author: Xiangrui Meng <meng@databricks.com>

Closes #11226 from mengxr/SPARK-13355.
2016-02-22 23:54:21 -08:00
Yanbo Liang 72427c3e11 [SPARK-13429][MLLIB] Unify Logistic Regression convergence tolerance of ML & MLlib
## What changes were proposed in this pull request?
In order to provide better and consistent result, let's change the default value of MLlib ```LogisticRegressionWithLBFGS convergenceTol``` from ```1E-4``` to ```1E-6``` which will be equal to ML ```LogisticRegression```.
cc dbtsai
## How was the this patch tested?
unit tests

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11299 from yanboliang/spark-13429.
2016-02-22 23:37:09 -08:00
Timothy Hunter 4fd1993692 [SPARK-6761][SQL] Approximate quantile for DataFrame
JIRA: https://issues.apache.org/jira/browse/SPARK-6761

Compute approximate quantile based on the paper Greenwald, Michael and Khanna, Sanjeev, "Space-efficient Online Computation of Quantile Summaries," SIGMOD '01.

Author: Timothy Hunter <timhunter@databricks.com>
Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6042 from viirya/approximate_quantile.
2016-02-22 23:31:00 -08:00
gatorsmile 01e10c9fef [SPARK-13236] SQL Generation for Set Operations
This PR is to implement SQL generation for the following three set operations:
- Union Distinct
- Intersect
- Except

liancheng Thanks!

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #11195 from gatorsmile/setOpSQLGen.
2016-02-23 15:16:59 +08:00
gatorsmile 9dd5399d78 [SPARK-12723][SQL] Comprehensive Verification and Fixing of SQL Generation Support for Expressions
#### What changes were proposed in this pull request?

Ensure that all built-in expressions can be mapped to its SQL representation if there is one (e.g. ScalaUDF doesn't have a SQL representation). The function lists are from the expression list in `FunctionRegistry`.

window functions, grouping sets functions (`cube`, `rollup`, `grouping`, `grouping_id`), generator functions (`explode` and `json_tuple`) are covered by separate JIRA and PRs. Thus, this PR does not cover them. Except these functions, all the built-in expressions are covered. For details, see the list in `ExpressionToSQLSuite`.

Fixed a few issues. For example, the `prettyName` of `approx_count_distinct` is not right. The `sql` of `hash` function is not right, since the `hash` function does not accept `seed`.

Additionally, also correct the order of expressions in `FunctionRegistry` so that people are easier to find which functions are missing.

cc liancheng

#### How was the this patch tested?
Added two test cases in LogicalPlanToSQLSuite for covering `not like` and `not in`.

Added a new test suite `ExpressionToSQLSuite` to cover the functions:

1. misc non-aggregate functions + complex type creators + null expressions
2. math functions
3. aggregate functions
4. string functions
5. date time functions + calendar interval
6. collection functions
7. misc functions

Author: gatorsmile <gatorsmile@gmail.com>

Closes #11314 from gatorsmile/expressionToSQL.
2016-02-22 22:17:56 -08:00
Daoyuan Wang 5d80fac58f [SPARK-11624][SPARK-11972][SQL] fix commands that need hive to exec
In SparkSQLCLI, we have created a `CliSessionState`, but then we call `SparkSQLEnv.init()`, which will start another `SessionState`. This would lead to exception because `processCmd` need to get the `CliSessionState` instance by calling `SessionState.get()`, but the return value would be a instance of `SessionState`. See the exception below.

spark-sql> !echo "test";
Exception in thread "main" java.lang.ClassCastException: org.apache.hadoop.hive.ql.session.SessionState cannot be cast to org.apache.hadoop.hive.cli.CliSessionState
	at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:112)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:301)
	at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:242)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:691)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #9589 from adrian-wang/clicommand.
2016-02-22 18:13:32 -08:00
Shixiong Zhu a11b399519 [SPARK-13298][CORE][UI] Escape "label" to avoid DAG being broken by some special character
## What changes were proposed in this pull request?

When there are some special characters (e.g., `"`, `\`) in `label`, DAG will be broken. This patch just escapes `label` to avoid DAG being broken by some special characters

## How was the this patch tested?

Jenkins tests

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11309 from zsxwing/SPARK-13298.
2016-02-22 17:42:30 -08:00
Narine Kokhlikyan 33ef3aa7ea [SPARK-13295][ ML, MLLIB ] AFTSurvivalRegression.AFTAggregator improvements - avoid creating new instances of arrays/vectors for each record
As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: AFTPoint) method a new array is being created for intercept value and it is being concatenated
with another array which contains the betas, the resulted Array is being converted into a Dense vector which in its turn is being converted into breeze vector.
This is expensive and not necessarily beautiful.

I've tried to solve above mentioned problem by simple algebraic decompositions - keeping and treating intercept independently.

Please let me know what do you think and if you have any questions.

Thanks,
Narine

Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>

Closes #11179 from NarineK/survivaloptim.
2016-02-22 17:26:32 -08:00
Devaraj K 02b1fefffb [SPARK-13012][DOCUMENTATION] Replace example code in ml-guide.md using include_example
Replaced example code in ml-guide.md using include_example

Author: Devaraj K <devaraj@apache.org>

Closes #11053 from devaraj-kavali/SPARK-13012.
2016-02-22 17:21:37 -08:00
Devaraj K 9f410871ca [SPARK-13016][DOCUMENTATION] Replace example code in mllib-dimensionality-reduction.md using include_example
Replaced example example code in mllib-dimensionality-reduction.md using
include_example

Author: Devaraj K <devaraj@apache.org>

Closes #11132 from devaraj-kavali/SPARK-13016.
2016-02-22 17:16:56 -08:00
Xiu Guo 2063781840 [SPARK-13422][SQL] Use HashedRelation instead of HashSet in Left Semi Joins
Use the HashedRelation which is a more optimized datastructure and reduce code complexity

Author: Xiu Guo <xguo27@gmail.com>

Closes #11291 from xguo27/SPARK-13422.
2016-02-22 16:34:02 -08:00
Michael Armbrust 173aa949c3 [SPARK-12546][SQL] Change default number of open parquet files
A common problem that users encounter with Spark 1.6.0 is that writing to a partitioned parquet table OOMs.  The root cause is that parquet allocates a significant amount of memory that is not accounted for by our own mechanisms.  As a workaround, we can ensure that only a single file is open per task unless the user explicitly asks for more.

Author: Michael Armbrust <michael@databricks.com>

Closes #11308 from marmbrus/parquetWriteOOM.
2016-02-22 15:27:29 -08:00
Reynold Xin 4a91806a45 [SPARK-13413] Remove SparkContext.metricsSystem
## What changes were proposed in this pull request?

This patch removes SparkContext.metricsSystem. SparkContext.metricsSystem returns MetricsSystem, which is a private class. I think it was added by accident.

In addition, I also removed an unused private[spark] method schedulerBackend setter.

## How was the this patch tested?

N/A.

Author: Reynold Xin <rxin@databricks.com>

This patch had conflicts when merged, resolved by
Committer: Josh Rosen <joshrosen@databricks.com>

Closes #11282 from rxin/SPARK-13413.
2016-02-22 14:01:35 -08:00
Timothy Chen 00461bb911 [SPARK-10749][MESOS] Support multiple roles with mesos cluster mode.
Currently the Mesos cluster dispatcher is not using offers from multiple roles correctly, as it simply aggregates all the offers resource values into one, but doesn't apply them correctly before calling the driver as Mesos needs the resources from the offers to be specified which role it originally belongs to. Multiple roles is already supported with fine/coarse grain scheduler, so porting that logic here to the cluster scheduler.

https://issues.apache.org/jira/browse/SPARK-10749

Author: Timothy Chen <tnachen@gmail.com>

Closes #8872 from tnachen/cluster_multi_roles.
2016-02-22 11:11:33 -08:00
Yanbo Liang 40e6d40fe7 [SPARK-13334][ML] ML KMeansModel / BisectingKMeansModel / QuantileDiscretizer should set parent
ML ```KMeansModel / BisectingKMeansModel / QuantileDiscretizer``` should set parent.

cc mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11214 from yanboliang/spark-13334.
2016-02-22 12:59:50 +02:00
Bryan Cutler e298ac91e3 [SPARK-12632][PYSPARK][DOC] PySpark fpm and als parameter desc to consistent format
Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent.  This is for the fpm and recommendation modules.

Closes #10602
Closes #10897

Author: Bryan Cutler <cutlerb@gmail.com>
Author: somideshmukh <somilde@us.ibm.com>

Closes #11186 from BryanCutler/param-desc-consistent-fpmrecc-SPARK-12632.
2016-02-22 12:48:37 +02:00
Dongjoon Hyun 024482bf51 [MINOR][DOCS] Fix all typos in markdown files of doc and similar patterns in other comments
## What changes were proposed in this pull request?

This PR tries to fix all typos in all markdown files under `docs` module,
and fixes similar typos in other comments, too.

## How was the this patch tested?

manual tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11300 from dongjoon-hyun/minor_fix_typos.
2016-02-22 09:52:07 +00:00
Holden Karau 1b144455b6 [SPARK-13399][STREAMING] Fix checkpointsuite type erasure warnings
## What changes were proposed in this pull request?

Change the checkpointsuite getting the outputstreams to explicitly be unchecked on the generic type so as to avoid the warnings. This only impacts test code.

Alternatively we could encode the type tag in the TestOutputStreamWithPartitions and filter the type tag as well - but this is unnecessary since multiple testoutputstreams are not registered and the previous code was not actually checking this type.

## How was the this patch tested?

unit tests (streaming/testOnly org.apache.spark.streaming.CheckpointSuite)

Author: Holden Karau <holden@us.ibm.com>

Closes #11286 from holdenk/SPARK-13399-checkpointsuite-type-erasure.
2016-02-22 09:50:51 +00:00
Yong Gang Cao ef1047fca7 [SPARK-12153][SPARK-7617][MLLIB] add support of arbitrary length sentence and other tuning for Word2Vec
add support of arbitrary length sentence by using the nature representation of sentences in the input.

add new similarity functions and add normalization option for distances in synonym finding
add new accessor for internal structure(the vocabulary and wordindex) for convenience

need instructions about how to set value for the Since annotation for newly added public functions. 1.5.3?

jira link: https://issues.apache.org/jira/browse/SPARK-12153

Author: Yong Gang Cao <ygcao@amazon.com>
Author: Yong-Gang Cao <ygcao@users.noreply.github.com>

Closes #10152 from ygcao/improvementForSentenceBoundary.
2016-02-22 09:47:36 +00:00
Huaxin Gao 8f35d3eac9 [SPARK-13186][STREAMING] migrate away from SynchronizedMap
trait SynchronizedMap in package mutable is deprecated: Synchronization via traits is deprecated as it is inherently unreliable. Change to java.util.concurrent.ConcurrentHashMap instead.

Author: Huaxin Gao <huaxing@us.ibm.com>

Closes #11250 from huaxingao/spark__13186.
2016-02-22 09:44:32 +00:00
jerryshao 39ff154570 [SPARK-13426][CORE] Remove the support of SIMR
## What changes were proposed in this pull request?

This PR removes the support of SIMR, since SIMR is not actively used and maintained for a long time, also is not supported from `SparkSubmit`, so here propose to remove it.

## How was the this patch tested?

This patch is tested locally by running unit tests.

Author: jerryshao <sshao@hortonworks.com>

Closes #11296 from jerryshao/SPARK-13426.
2016-02-22 00:57:10 -08:00
Yanbo Liang 8a4ed78869 [SPARK-13379][MLLIB] Fix MLlib LogisticRegressionWithLBFGS set regularization incorrectly
## What changes were proposed in this pull request?
Fix MLlib LogisticRegressionWithLBFGS regularization map as:
```SquaredL2Updater``` -> ```elasticNetParam = 0.0```
```L1Updater``` -> ```elasticNetParam = 1.0```
cc dbtsai
## How was the this patch tested?
unit tests

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11258 from yanboliang/spark-13379.
2016-02-21 20:20:41 -08:00
Reynold Xin 9bf6a926a1 [HOTFIX] Fix compilation break 2016-02-21 19:37:35 -08:00
hyukjinkwon 819b0ea029 [SPARK-13381][SQL] Support for loading CSV with a single function call
https://issues.apache.org/jira/browse/SPARK-13381

This PR adds the support to load CSV data directly by a single call with given paths.

Also, I corrected this to refer all paths rather than the first path in schema inference, which JSON datasource dose.

Several unitests were added for each functionality.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #11262 from HyukjinKwon/SPARK-13381.
2016-02-21 19:11:03 -08:00
Liang-Chi Hsieh 55d6fdf22d [SPARK-13321][SQL] Support nested UNION in parser
JIRA: https://issues.apache.org/jira/browse/SPARK-13321

The following SQL can not be parsed with current parser:

    SELECT  `u_1`.`id` FROM (((SELECT  `t0`.`id` FROM `default`.`t0`) UNION ALL (SELECT  `t0`.`id` FROM `default`.`t0`)) UNION ALL (SELECT  `t0`.`id` FROM `default`.`t0`)) AS u_1

We should fix it.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #11204 from viirya/nested-union.
2016-02-21 19:10:17 -08:00
Robin East 3d79f1065c [SPARK-3650][GRAPHX] Triangle Count handles reverse edges incorrectly
jegonzal ankurdave please could you review

## What changes were proposed in this pull request?

Reworking of jegonzal PR #2495 to address the issue identified in SPARK-3650. Code amended to use the convertToCanonicalEdges method.

## How was the this patch tested?

Patch was tested using the unit tests created in PR #2495

Author: Robin East <robin.east@xense.co.uk>
Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>

Closes #11290 from insidedctm/spark-3650.
2016-02-21 17:07:17 -08:00
Franklyn D'souza 0f90f4e6ac [SPARK-13410][SQL] Support unionAll for DataFrames with UDT columns.
## What changes were proposed in this pull request?

This PR adds equality operators to UDT classes so that they can be correctly tested for dataType equality during union operations.

This was previously causing `"AnalysisException: u"unresolved operator 'Union;""` when trying to unionAll two dataframes with UDT columns as below.

```
from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
from pyspark.sql import types

schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)])

a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema)
b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema)

c = a.unionAll(b)
```

## How was the this patch tested?

Tested using two unit tests in sql/test.py and the DataFrameSuite.

Additional information here : https://issues.apache.org/jira/browse/SPARK-13410

Author: Franklyn D'souza <franklynd@gmail.com>

Closes #11279 from damnMeddlingKid/udt-union-all.
2016-02-21 16:58:17 -08:00
Shixiong Zhu 0cbadf28c9 [SPARK-13271][SQL] Better error message if 'path' is not specified
Improved the error message as per discussion in https://github.com/apache/spark/pull/11034#discussion_r52111238. Also made `path` and `metadataPath` in FileStreamSource case insensitive.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11154 from zsxwing/path.
2016-02-21 15:34:39 -08:00
Shixiong Zhu 76bd98d914 [SPARK-13405][STREAMING][TESTS] Make sure no messages leak to the next test
## What changes were proposed in this pull request?

Fixed the test failure `org.apache.spark.sql.util.ContinuousQueryListenerSuite.event ordering`: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/202/testReport/junit/org.apache.spark.sql.util/ContinuousQueryListenerSuite/event_ordering/

```
      org.scalatest.exceptions.TestFailedException:
Assert failed: : null equaled null onQueryTerminated called before onQueryStarted
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
	org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
	org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
	org.apache.spark.sql.util.ContinuousQueryListenerSuite$QueryStatusCollector$$anonfun$onQueryTerminated$1.apply$mcV$sp(ContinuousQueryListenerSuite.scala:204)
	org.scalatest.concurrent.AsyncAssertions$Waiter.apply(AsyncAssertions.scala:349)
	org.apache.spark.sql.util.ContinuousQueryListenerSuite$QueryStatusCollector.onQueryTerminated(ContinuousQueryListenerSuite.scala:203)
	org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.doPostEvent(ContinuousQueryListenerBus.scala:67)
	org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.doPostEvent(ContinuousQueryListenerBus.scala:32)
	org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
	org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.postToAll(ContinuousQueryListenerBus.scala:32)
```

In the previous codes, when the test `adding and removing listener` finishes, there may be still some QueryTerminated events in the listener bus queue. Then when `event ordering` starts to run, it may see these events and throw the above exception.

This PR just added `waitUntilEmpty` in `after` to make sure all events be consumed after each test.

## How was the this patch tested?

Jenkins tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11275 from zsxwing/SPARK-13405.
2016-02-21 15:32:49 -08:00
Dongjoon Hyun 03e62aa3f6 [MINOR][DOCS] Fix typos in configuration.md and hardware-provisioning.md
## What changes were proposed in this pull request?

This PR fixes some typos in the following documentation files.
 * `NOTICE`, `configuration.md`, and `hardware-provisioning.md`.

## How was the this patch tested?

manual tests

Author: Dongjoon Hyun <dongjoonapache.org>

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11289 from dongjoon-hyun/minor_fix_typos_notice_and_confdoc.
2016-02-21 15:27:07 -08:00
Andrew Or 6c3832b26e [SPARK-13080][SQL] Implement new Catalog API using Hive
## What changes were proposed in this pull request?

This is a step towards merging `SQLContext` and `HiveContext`. A new internal Catalog API was introduced in #10982 and extended in #11069. This patch introduces an implementation of this API using `HiveClient`, an existing interface to Hive. It also extends `HiveClient` with additional calls to Hive that are needed to complete the catalog implementation.

*Where should I start reviewing?* The new catalog introduced is `HiveCatalog`. This class is relatively simple because it just calls `HiveClientImpl`, where most of the new logic is. I would not start with `HiveClient`, `HiveQl`, or `HiveMetastoreCatalog`, which are modified mainly because of a refactor.

*Why is this patch so big?* I had to refactor HiveClient to remove an intermediate representation of databases, tables, partitions etc. After this refactor `CatalogTable` convert directly to and from `HiveTable` (etc.). Otherwise we would have to first convert `CatalogTable` to the intermediate representation and then convert that to HiveTable, which is messy.

The new class hierarchy is as follows:
```
org.apache.spark.sql.catalyst.catalog.Catalog
  - org.apache.spark.sql.catalyst.catalog.InMemoryCatalog
  - org.apache.spark.sql.hive.HiveCatalog
```

Note that, as of this patch, none of these classes are currently used anywhere yet. This will come in the future before the Spark 2.0 release.

## How was the this patch tested?
All existing unit tests, and HiveCatalogSuite that extends CatalogTestCases.

Author: Andrew Or <andrew@databricks.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #11293 from rxin/hive-catalog.
2016-02-21 15:00:24 -08:00
hyukjinkwon 7eb83fefd1 [SPARK-13137][SQL] NullPoingException in schema inference for CSV when the first line is empty
https://issues.apache.org/jira/browse/SPARK-13137

This PR adds a filter in schema inference so that it does not emit NullPointException.

Also, I removed `MAX_COMMENT_LINES_IN_HEADER `but instead used a monad chaining with `filter()` and `first()`.

Lastly, I simply added a newline rather than adding a new file for this so that this is covered with the original tests.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #11023 from HyukjinKwon/SPARK-13137.
2016-02-21 13:21:59 -08:00
Herman van Hovell b6a873d6d4 [SPARK-13136][SQL] Create a dedicated Broadcast exchange operator
Quite a few Spark SQL join operators broadcast one side of the join to all nodes. The are a few problems with this:

- This conflates broadcasting (a data exchange) with joining. Data exchanges should be managed by a different operator.
- All these nodes implement their own (duplicate) broadcasting logic.
- Re-use of indices is quite hard.

This PR defines both a ```BroadcastDistribution``` and ```BroadcastPartitioning```, these contain a `BroadcastMode`. The `BroadcastMode` defines the way in which we transform the Array of `InternalRow`'s into an index. We currently support the following `BroadcastMode`'s:

- IdentityBroadcastMode: This broadcasts the rows in their original form.
- HashSetBroadcastMode: This applies a projection to the input rows, deduplicates these rows and broadcasts the resulting `Set`.
- HashedRelationBroadcastMode: This transforms the input rows into a `HashedRelation`, and broadcasts this index.

To match this distribution we implement a ```BroadcastExchange``` operator which will perform the broadcast for us, and have ```EnsureRequirements``` plan this operator. The old Exchange operator has been renamed into ShuffleExchange in order to clearly separate between Shuffled and Broadcasted exchanges. Finally the classes in Exchange.scala have been moved to a dedicated package.

cc rxin davies

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #11083 from hvanhovell/SPARK-13136.
2016-02-21 12:32:31 -08:00
Reynold Xin af441ddbd1 [SPARK-13306][SQL] Addendum to uncorrelated scalar subquery
## What changes were proposed in this pull request?
This pull request fixes some minor issues (documentation, test flakiness, test organization) with #11190, which was merged earlier tonight.

## How was the this patch tested?
unit tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #11285 from rxin/subquery.
2016-02-21 12:27:02 -08:00
Reynold Xin 0947f0989b [SPARK-13420][SQL] Rename Subquery logical plan to SubqueryAlias
## What changes were proposed in this pull request?
This patch renames logical.Subquery to logical.SubqueryAlias, which is a more appropriate name for this operator (versus subqueries as expressions).

## How was the this patch tested?
Unit tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #11288 from rxin/SPARK-13420.
2016-02-21 11:31:46 -08:00
Luciano Resende 1a340da8d7 [SPARK-13248][STREAMING] Remove deprecated Streaming APIs.
Remove deprecated Streaming APIs and adjust sample applications.

Author: Luciano Resende <lresende@apache.org>

Closes #11139 from lresende/streaming-deprecated-apis.
2016-02-21 16:27:56 +00:00
Cheng Lian d9efe63ecd [SPARK-12799] Simplify various string output for expressions
This PR introduces several major changes:

1. Replacing `Expression.prettyString` with `Expression.sql`

   The `prettyString` method is mostly an internal, developer faced facility for debugging purposes, and shouldn't be exposed to users.

1. Using SQL-like representation as column names for selected fields that are not named expression (back-ticks and double quotes should be removed)

   Before, we were using `prettyString` as column names when possible, and sometimes the result column names can be weird.  Here are several examples:

   Expression         | `prettyString` | `sql`      | Note
   ------------------ | -------------- | ---------- | ---------------
   `a && b`           | `a && b`       | `a AND b`  |
   `a.getField("f")`  | `a[f]`         | `a.f`      | `a` is a struct

1. Adding trait `NonSQLExpression` extending from `Expression` for expressions that don't have a SQL representation (e.g. Scala UDF/UDAF and Java/Scala object expressions used for encoders)

   `NonSQLExpression.sql` may return an arbitrary user facing string representation of the expression.

Author: Cheng Lian <lian@databricks.com>

Closes #10757 from liancheng/spark-12799.simplify-expression-string-methods.
2016-02-21 22:53:15 +08:00
Zheng RuiFeng d806ed3436 [SPARK-13416][GraphX] Add positive check for option 'numIter' in StronglyConnectedComponents
JIRA: https://issues.apache.org/jira/browse/SPARK-13416

## What changes were proposed in this pull request?

The output of StronglyConnectedComponents with numIter no greater than 1 may make no sense. So I just add require check in it.

## How was the this patch tested?

 unit tests passed

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #11284 from zhengruifeng/scccheck.
2016-02-21 00:53:15 -08:00