Commit graph

2600 commits

Author SHA1 Message Date
gatorsmile 8f0a3d5bcb [SPARK-15330][SQL] Implement Reset Command
#### What changes were proposed in this pull request?
Like `Set` Command in Hive, `Reset` is also supported by Hive. See the link: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli

Below is the related Hive JIRA: https://issues.apache.org/jira/browse/HIVE-3202

This PR is to implement such a command for resetting the SQL-related configuration to the default values. One of the use case shown in HIVE-3202 is listed below:

> For the purpose of optimization we set various configs per query. It's worthy but all those configs should be reset every time for next query.

#### How was this patch tested?
Added a test case.

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #13121 from gatorsmile/resetCommand.
2016-05-21 20:07:34 -07:00
Reynold Xin 201a51f366 [SPARK-15452][SQL] Mark aggregator API as experimental
## What changes were proposed in this pull request?
The Aggregator API was introduced in 2.0 for Dataset. All typed Dataset APIs should still be marked as experimental in 2.0.

## How was this patch tested?
N/A - annotation only change.

Author: Reynold Xin <rxin@databricks.com>

Closes #13226 from rxin/SPARK-15452.
2016-05-21 12:46:25 -07:00
Dilip Biswal 5e1ee28984 [SPARK-15114][SQL] Column name generated by typed aggregate is super verbose
## What changes were proposed in this pull request?

Generate a shorter default alias for `AggregateExpression `, In this PR, aggregate function name along with a index is used for generating the alias name.

```SQL
val ds = Seq(1, 3, 2, 5).toDS()
ds.select(typed.sum((i: Int) => i), typed.avg((i: Int) => i)).show()
```

Output before change.
```SQL
+-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+
|typedsumdouble(unresolveddeserializer(upcast(input[0, int], IntegerType, - root class: "scala.Int"), value#1), upcast(value))|typedaverage(unresolveddeserializer(upcast(input[0, int], IntegerType, - root class: "scala.Int"), value#1), newInstance(class scala.Tuple2))|
+-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                         11.0|                                                                                                                                         2.75|
+-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+
```
Output after change:
```SQL
+-----------------+---------------+
|typedsumdouble_c1|typedaverage_c2|
+-----------------+---------------+
|             11.0|           2.75|
+-----------------+---------------+
```

Note: There is one test in ParquetSuites.scala which shows that that the system picked alias
name is not usable and is rejected.  [test](https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala#L672-#L687)
## How was this patch tested?

A new test was added in DataSetAggregatorSuite.

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes #13045 from dilipbiswal/spark-15114.
2016-05-21 08:36:08 -07:00
Zheng RuiFeng 127bf1bb07 [SPARK-15031][EXAMPLE] Use SparkSession in examples
## What changes were proposed in this pull request?
Use `SparkSession` according to [SPARK-15031](https://issues.apache.org/jira/browse/SPARK-15031)

`MLLLIB` is not recommended to use now, so examples in `MLLIB` are ignored in this PR.
`StreamingContext` can not be directly obtained from `SparkSession`, so example in `Streaming` are ignored too.

cc andrewor14

## How was this patch tested?
manual tests with spark-submit

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13164 from zhengruifeng/use_sparksession_ii.
2016-05-20 16:40:33 -07:00
Sameer Agarwal a78d6ce376 [SPARK-15078] [SQL] Add all TPCDS 1.4 benchmark queries for SparkSQL
## What changes were proposed in this pull request?

Now that SparkSQL supports all TPC-DS queries, this patch adds all 99 benchmark queries inside SparkSQL.

## How was this patch tested?

Benchmark only

Author: Sameer Agarwal <sameer@databricks.com>

Closes #13188 from sameeragarwal/tpcds-all.
2016-05-20 15:19:28 -07:00
Reynold Xin dcac8e6f49 [SPARK-15454][SQL] Filter out files starting with _
## What changes were proposed in this pull request?
Many other systems (e.g. Impala) uses _xxx as staging, and Spark should not be reading those files.

## How was this patch tested?
Added a unit test case.

Author: Reynold Xin <rxin@databricks.com>

Closes #13227 from rxin/SPARK-15454.
2016-05-20 14:49:54 -07:00
Davies Liu 0e70fd61b4 [SPARK-15438][SQL] improve explain of whole stage codegen
## What changes were proposed in this pull request?

Currently, the explain of a query with whole-stage codegen looks like this
```
>>> df = sqlCtx.range(1000);df2 = sqlCtx.range(1000);df.join(pyspark.sql.functions.broadcast(df2), 'id').explain()
== Physical Plan ==
WholeStageCodegen
:  +- Project [id#1L]
:     +- BroadcastHashJoin [id#1L], [id#4L], Inner, BuildRight, None
:        :- Range 0, 1, 4, 1000, [id#1L]
:        +- INPUT
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint]))
   +- WholeStageCodegen
      :  +- Range 0, 1, 4, 1000, [id#4L]
```

The problem is that the plan looks much different than logical plan, make us hard to understand the plan (especially when the logical plan is not showed together).

This PR will change it to:

```
>>> df = sqlCtx.range(1000);df2 = sqlCtx.range(1000);df.join(pyspark.sql.functions.broadcast(df2), 'id').explain()
== Physical Plan ==
*Project [id#0L]
+- *BroadcastHashJoin [id#0L], [id#3L], Inner, BuildRight, None
   :- *Range 0, 1, 4, 1000, [id#0L]
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
      +- *Range 0, 1, 4, 1000, [id#3L]
```

The `*`before the plan means that it's part of whole-stage codegen, it's easy to understand.

## How was this patch tested?

Manually ran some queries and check the explain.

Author: Davies Liu <davies@databricks.com>

Closes #13204 from davies/explain_codegen.
2016-05-20 13:21:53 -07:00
Michael Armbrust 2ba3ff0449 [SPARK-10216][SQL] Revert "[] Avoid creating empty files during overwrit…
This reverts commit 8d05a7a from #12855, which seems to have caused regressions when working with empty DataFrames.

Author: Michael Armbrust <michael@databricks.com>

Closes #13181 from marmbrus/revert12855.
2016-05-20 13:00:29 -07:00
Kousuke Saruta 22947cd021 [SPARK-15165] [SPARK-15205] [SQL] Introduce place holder for comments in generated code
## What changes were proposed in this pull request?

This PR introduce place holder for comment in generated code and the purpose  is same for #12939 but much safer.

Generated code to be compiled doesn't include actual comments but includes place holder instead.

Place holders in generated code will be replaced with actual comments only at the time of  logging.

Also, this PR can resolve SPARK-15205.

## How was this patch tested?

Existing tests.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #12979 from sarutak/SPARK-15205.
2016-05-20 10:56:35 -07:00
Davies Liu 5a25cd4ff3 [HOTFIX] disable stress test 2016-05-20 10:44:26 -07:00
Reynold Xin e8adc552df [SPARK-15435][SQL] Append Command to all commands
## What changes were proposed in this pull request?
We started this convention to append Command suffix to all SQL commands. However, not all commands follow that convention. This patch adds Command suffix to all RunnableCommands.

## How was this patch tested?
Updated test cases to reflect the renames.

Author: Reynold Xin <rxin@databricks.com>

Closes #13215 from rxin/SPARK-15435.
2016-05-20 09:36:14 -07:00
Andrew Or 2573750192 [SPARK-15421][SQL] Validate DDL property values
## What changes were proposed in this pull request?

When we parse DDLs involving table or database properties, we need to validate the values.
E.g. if we alter a database's property without providing a value:
```
ALTER DATABASE my_db SET DBPROPERTIES('some_key')
```
Then we'll ignore it with Hive, but override the property with the in-memory catalog. Inconsistencies like these arise because we don't validate the property values.

In such cases, we should throw exceptions instead.

## How was this patch tested?

`DDLCommandSuite`

Author: Andrew Or <andrew@databricks.com>

Closes #13205 from andrewor14/ddl-prop-values.
2016-05-19 23:43:01 -07:00
gatorsmile 39fd469078 [SPARK-15367][SQL] Add refreshTable back
#### What changes were proposed in this pull request?
`refreshTable` was a method in `HiveContext`. It was deleted accidentally while we were migrating the APIs. This PR is to add it back to `HiveContext`.

In addition, in `SparkSession`, we put it under the catalog namespace (`SparkSession.catalog.refreshTable`).

#### How was this patch tested?
Changed the existing test cases to use the function `refreshTable`. Also added a test case for refreshTable in `hivecontext-compatibility`

Author: gatorsmile <gatorsmile@gmail.com>

Closes #13156 from gatorsmile/refreshTable.
2016-05-20 14:38:25 +08:00
Lianhui Wang 09a00510c4 [SPARK-15335][SQL] Implement TRUNCATE TABLE Command
## What changes were proposed in this pull request?

Like TRUNCATE TABLE Command in Hive, TRUNCATE TABLE is also supported by Hive. See the link: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
Below is the related Hive JIRA: https://issues.apache.org/jira/browse/HIVE-446
This PR is to implement such a command for truncate table excluded column truncation(HIVE-4005).

## How was this patch tested?
Added a test case.

Author: Lianhui Wang <lianhuiwang09@gmail.com>

Closes #13170 from lianhuiwang/truncate.
2016-05-19 23:03:59 -07:00
Takuya UESHIN d5e1c5acde [SPARK-15313][SQL] EmbedSerializerInFilter rule should keep exprIds of output of surrounded SerializeFromObject.
## What changes were proposed in this pull request?

The following code:

```
val ds = Seq(("a", 1), ("b", 2), ("c", 3)).toDS()
ds.filter(_._1 == "b").select(expr("_1").as[String]).foreach(println(_))
```

throws an Exception:

```
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: _1#420
 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
 at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
 at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)

...
 Cause: java.lang.RuntimeException: Couldn't find _1#420 in [_1#416,_2#417]
 at scala.sys.package$.error(package.scala:27)
 at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:94)
 at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:88)
 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
 at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
 at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
...
```

This is because `EmbedSerializerInFilter` rule drops the `exprId`s of output of surrounded `SerializeFromObject`.

The analyzed and optimized plans of the above example are as follows:

```
== Analyzed Logical Plan ==
_1: string
Project [_1#420]
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, scala.Tuple2]._1, true) AS _1#420,input[0, scala.Tuple2]._2 AS _2#421]
   +- Filter <function1>.apply
      +- DeserializeToObject newInstance(class scala.Tuple2), obj#419: scala.Tuple2
         +- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]]

== Optimized Logical Plan ==
!Project [_1#420]
+- Filter <function1>.apply
   +- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]]
```

This PR fixes `EmbedSerializerInFilter` rule to keep `exprId`s of output of surrounded `SerializeFromObject`.

The plans after this patch are as follows:

```
== Analyzed Logical Plan ==
_1: string
Project [_1#420]
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, scala.Tuple2]._1, true) AS _1#420,input[0, scala.Tuple2]._2 AS _2#421]
   +- Filter <function1>.apply
      +- DeserializeToObject newInstance(class scala.Tuple2), obj#419: scala.Tuple2
         +- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]]

== Optimized Logical Plan ==
Project [_1#416]
+- Filter <function1>.apply
   +- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]]
```

## How was this patch tested?

Existing tests and I added a test to check if `filter and then select` works.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #13096 from ueshin/issues/SPARK-15313.
2016-05-19 22:55:44 -07:00
Reynold Xin f2ee0ed4b7 [SPARK-15075][SPARK-15345][SQL] Clean up SparkSession builder and propagate config options to existing sessions if specified
## What changes were proposed in this pull request?
Currently SparkSession.Builder use SQLContext.getOrCreate. It should probably the the other way around, i.e. all the core logic goes in SparkSession, and SQLContext just calls that. This patch does that.

This patch also makes sure config options specified in the builder are propagated to the existing (and of course the new) SparkSession.

## How was this patch tested?
Updated tests to reflect the change, and also introduced a new SparkSessionBuilderSuite that should cover all the branches.

Author: Reynold Xin <rxin@databricks.com>

Closes #13200 from rxin/SPARK-15075.
2016-05-19 21:53:26 -07:00
Kevin Yu 17591d90e6 [SPARK-11827][SQL] Adding java.math.BigInteger support in Java type inference for POJOs and Java collections
Hello : Can you help check this PR? I am adding support for the java.math.BigInteger for java bean code path. I saw internally spark is converting the BigInteger to BigDecimal in ColumnType.scala and CatalystRowConverter.scala. I use the similar way and convert the BigInteger to the BigDecimal. .

Author: Kevin Yu <qyu@us.ibm.com>

Closes #10125 from kevinyu98/working_on_spark-11827.
2016-05-20 12:41:14 +08:00
Shixiong Zhu 16ba71aba4 [SPARK-15416][SQL] Display a better message for not finding classes removed in Spark 2.0
## What changes were proposed in this pull request?

If finding `NoClassDefFoundError` or `ClassNotFoundException`, check if the class name is removed in Spark 2.0. If so, the user must be using an incompatible library and we can provide a better message.

## How was this patch tested?

1. Run `bin/pyspark --packages com.databricks:spark-avro_2.10:2.0.1`
2. type `sqlContext.read.format("com.databricks.spark.avro").load("src/test/resources/episodes.avro")`.

It will show `java.lang.ClassNotFoundException: org.apache.spark.sql.sources.HadoopFsRelationProvider is removed in Spark 2.0. Please check if your library is compatible with Spark 2.0`

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #13201 from zsxwing/better-message.
2016-05-19 18:31:05 -07:00
jerryshao dcf407de67 [SPARK-15375][SQL][STREAMING] Add ConsoleSink to structure streaming
## What changes were proposed in this pull request?

Add ConsoleSink to structure streaming, user could use it to display dataframes on the console (useful for debugging and demostrating), similar to the functionality of `DStream#print`, to use it:

```
    val query = result.write
      .format("console")
      .trigger(ProcessingTime("2 seconds"))
      .startStream()
```

## How was this patch tested?

local verified.

Not sure it is suitable to add into structure streaming, please review and help to comment, thanks a lot.

Author: jerryshao <sshao@hortonworks.com>

Closes #13162 from jerryshao/SPARK-15375.
2016-05-19 17:42:59 -07:00
Davies Liu 5ccecc078a [SPARK-15392][SQL] fix default value of size estimation of logical plan
## What changes were proposed in this pull request?

We use autoBroadcastJoinThreshold + 1L as the default value of size estimation, that is not good in 2.0, because we will calculate the size based on size of schema, then the estimation could be less than autoBroadcastJoinThreshold if you have an SELECT on top of an DataFrame created from RDD.

This PR change the default value to Long.MaxValue.

## How was this patch tested?

Added regression tests.

Author: Davies Liu <davies@databricks.com>

Closes #13183 from davies/fix_default_size.
2016-05-19 12:12:42 -07:00
Shixiong Zhu 4e3cb7a5d9 [SPARK-15317][CORE] Don't store accumulators for every task in listeners
## What changes were proposed in this pull request?

In general, the Web UI doesn't need to store the Accumulator/AccumulableInfo for every task. It only needs the Accumulator values.

In this PR, it creates new UIData classes to store the necessary fields and make `JobProgressListener` store only these new classes, so that `JobProgressListener` won't store Accumulator/AccumulableInfo and the size of `JobProgressListener` becomes pretty small. I also eliminates `AccumulableInfo` from `SQLListener` so that we don't keep any references for those unused `AccumulableInfo`s.

## How was this patch tested?

I ran two tests reported in JIRA locally:

The first one is:
```
val data = spark.range(0, 10000, 1, 10000)
data.cache().count()
```
The retained size of JobProgressListener decreases from 60.7M to 6.9M.

The second one is:
```
import org.apache.spark.ml.CC
import org.apache.spark.sql.SQLContext
val sqlContext = SQLContext.getOrCreate(sc)
CC.runTest(sqlContext)
```

This test won't cause OOM after applying this patch.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #13153 from zsxwing/memory.
2016-05-19 12:05:17 -07:00
Cheng Lian 6ac1c3a040 [SPARK-14346][SQL] Lists unsupported Hive features in SHOW CREATE TABLE output
## What changes were proposed in this pull request?

This PR is a follow-up of #13079. It replaces `hasUnsupportedFeatures: Boolean` in `CatalogTable` with `unsupportedFeatures: Seq[String]`, which contains unsupported Hive features of the underlying Hive table. In this way, we can accurately report all unsupported Hive features in the exception message.

## How was this patch tested?

Updated existing test case to check exception message.

Author: Cheng Lian <lian@databricks.com>

Closes #13173 from liancheng/spark-14346-follow-up.
2016-05-19 12:02:41 -07:00
hyukjinkwon f5065abf49 [SPARK-15322][SQL][FOLLOW-UP] Update deprecated accumulator usage into accumulatorV2
## What changes were proposed in this pull request?

This PR corrects another case that uses deprecated `accumulableCollection` to use `listAccumulator`, which seems the previous PR missed.

Since `ArrayBuffer[InternalRow].asJava` is `java.util.List[InternalRow]`, it seems ok to replace the usage.

## How was this patch tested?

Related existing tests `InMemoryColumnarQuerySuite` and `CachedTableSuite`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #13187 from HyukjinKwon/SPARK-15322.
2016-05-19 11:54:50 -07:00
Davies Liu 9308bf1192 [SPARK-15390] fix broadcast with 100 millions rows
## What changes were proposed in this pull request?

When broadcast a table with more than 100 millions rows (should not ideally), the size of needed memory will overflow.

This PR fix the overflow by converting it to Long when calculating the size of memory.

Also add more checking in broadcast to show reasonable messages.

## How was this patch tested?

Add test.

Author: Davies Liu <davies@databricks.com>

Closes #13182 from davies/fix_broadcast.
2016-05-19 11:45:18 -07:00
Dongjoon Hyun 5907ebfc11 [SPARK-14939][SQL] Add FoldablePropagation optimizer
## What changes were proposed in this pull request?

This PR aims to add new **FoldablePropagation** optimizer that propagates foldable expressions by replacing all attributes with the aliases of original foldable expression. Other optimizations will take advantage of the propagated foldable expressions: e.g. `EliminateSorts` optimizer now can handle the following Case 2 and 3. (Case 1 is the previous implementation.)

1. Literals and foldable expression, e.g. "ORDER BY 1.0, 'abc', Now()"
2. Foldable ordinals, e.g. "SELECT 1.0, 'abc', Now() ORDER BY 1, 2, 3"
3. Foldable aliases, e.g. "SELECT 1.0 x, 'abc' y, Now() z ORDER BY x, y, z"

This PR has been generalized based on cloud-fan 's key ideas many times; he should be credited for the work he did.

**Before**
```
scala> sql("SELECT 1.0, Now() x ORDER BY 1, x").explain
== Physical Plan ==
WholeStageCodegen
:  +- Sort [1.0#5 ASC,x#0 ASC], true, 0
:     +- INPUT
+- Exchange rangepartitioning(1.0#5 ASC, x#0 ASC, 200), None
   +- WholeStageCodegen
      :  +- Project [1.0 AS 1.0#5,1461873043577000 AS x#0]
      :     +- INPUT
      +- Scan OneRowRelation[]
```

**After**
```
scala> sql("SELECT 1.0, Now() x ORDER BY 1, x").explain
== Physical Plan ==
WholeStageCodegen
:  +- Project [1.0 AS 1.0#5,1461873079484000 AS x#0]
:     +- INPUT
+- Scan OneRowRelation[]
```

## How was this patch tested?

Pass the Jenkins tests including a new test case.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12719 from dongjoon-hyun/SPARK-14939.
2016-05-19 15:57:44 +08:00
Wenchen Fan 661c21049b [SPARK-15381] [SQL] physical object operator should define reference correctly
## What changes were proposed in this pull request?

Whole Stage Codegen depends on `SparkPlan.reference` to do some optimization. For physical object operators, they should be consistent with their logical version and set the `reference` correctly.

## How was this patch tested?

new test in DatasetSuite

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13167 from cloud-fan/bug.
2016-05-18 21:43:07 -07:00
Reynold Xin 4987f39ac7 [SPARK-14463][SQL] Document the semantics for read.text
## What changes were proposed in this pull request?
This patch is a follow-up to https://github.com/apache/spark/pull/13104 and adds documentation to clarify the semantics of read.text with respect to partitioning.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #13184 from rxin/SPARK-14463.
2016-05-18 19:16:28 -07:00
gatorsmile 9c2a376e41 [SPARK-15297][SQL] Fix Set -V Command
#### What changes were proposed in this pull request?
The command `SET -v` always outputs the default values even if we set the parameter. This behavior is incorrect. Instead, if users override it, we should output the user-specified value.

In addition, the output schema of `SET -v` is wrong. We should use the column `value` instead of `default` for the parameter value.

This PR is to fix the above two issues.

#### How was this patch tested?
Added a test case.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #13081 from gatorsmile/setVcommand.
2016-05-19 10:05:53 +08:00
Wenchen Fan ebfe3a1f2c [SPARK-15192][SQL] null check for SparkSession.createDataFrame
## What changes were proposed in this pull request?

This PR adds null check in `SparkSession.createDataFrame`, so that we can make sure the passed in rows matches the given schema.

## How was this patch tested?

new tests in `DatasetSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13008 from cloud-fan/row-encoder.
2016-05-18 18:06:38 -07:00
Jurriaan Pruis 32be51fba4 [SPARK-15323][SPARK-14463][SQL] Fix reading of partitioned format=text datasets
https://issues.apache.org/jira/browse/SPARK-15323

I was using partitioned text datasets in Spark 1.6.1 but it broke in Spark 2.0.0.

It would be logical if you could also write those,
but not entirely sure how to solve this with the new DataSet implementation.

Also it doesn't work using `sqlContext.read.text`, since that method returns a `DataSet[String]`.
See https://issues.apache.org/jira/browse/SPARK-14463 for that issue.

Author: Jurriaan Pruis <email@jurriaanpruis.nl>

Closes #13104 from jurriaan/fix-partitioned-text-reads.
2016-05-18 16:15:09 -07:00
Davies Liu 84b23453dd Revert "[SPARK-15392][SQL] fix default value of size estimation of logical plan"
This reverts commit fc29b896da.
2016-05-18 16:02:52 -07:00
Davies Liu fc29b896da [SPARK-15392][SQL] fix default value of size estimation of logical plan
## What changes were proposed in this pull request?

We use  autoBroadcastJoinThreshold + 1L as the default value of size estimation, that is not good in 2.0, because we will calculate the size based on size of schema, then the estimation could be less than autoBroadcastJoinThreshold if you have an SELECT on top of an DataFrame created from RDD.

This PR change the default value to Long.MaxValue.

## How was this patch tested?

Added regression tests.

Author: Davies Liu <davies@databricks.com>

Closes #13179 from davies/fix_default_size.
2016-05-18 15:45:59 -07:00
Davies Liu 8fb1d1c7f3 [SPARK-15357] Cooperative spilling should check consumer memory mode
## What changes were proposed in this pull request?

Since we support forced spilling for Spillable, which only works in OnHeap mode, different from other SQL operators (could be OnHeap or OffHeap), we should considering the mode of consumer before calling trigger forced spilling.

## How was this patch tested?

Add new test.

Author: Davies Liu <davies@databricks.com>

Closes #13151 from davies/fix_mode.
2016-05-18 09:44:21 -07:00
WeichenXu 2f9047b5eb [SPARK-15322][MLLIB][CORE][SQL] update deprecate accumulator usage into accumulatorV2 in spark project
## What changes were proposed in this pull request?

I use Intellj-IDEA to search usage of deprecate SparkContext.accumulator in the whole spark project, and update the code.(except those test code for accumulator method itself)

## How was this patch tested?

Exisiting unit tests

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #13112 from WeichenXu123/update_accuV2_in_mllib.
2016-05-18 11:48:46 +01:00
Davies Liu 33814f887a [SPARK-15307][SQL] speed up listing files for data source
## What changes were proposed in this pull request?

Currently, listing files is very slow if there is thousands files, especially on local file system, because:
1) FileStatus.getPermission() is very slow on local file system, which is launch a subprocess and parse the stdout.
2) Create an JobConf is very expensive (ClassUtil.findContainingJar() is slow).

This PR improve these by:
1) Use another constructor of LocatedFileStatus to avoid calling FileStatus.getPermission, the permissions are not used for data sources.
2) Only create an JobConf once within one task.

## How was this patch tested?

Manually tests on a partitioned table with 1828 partitions, decrease the time to load the table from 22 seconds to 1.6 seconds (Most of time are spent in merging schema now).

Author: Davies Liu <davies@databricks.com>

Closes #13094 from davies/listing.
2016-05-18 18:46:57 +08:00
Sean Zhong 25b315e6ca [SPARK-15171][SQL] Remove the references to deprecated method dataset.registerTempTable
## What changes were proposed in this pull request?

Update the unit test code, examples, and documents to remove calls to deprecated method `dataset.registerTempTable`.

## How was this patch tested?

This PR only changes the unit test code, examples, and comments. It should be safe.
This is a follow up of PR https://github.com/apache/spark/pull/12945 which was merged.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13098 from clockfly/spark-15171-remove-deprecation.
2016-05-18 09:01:59 +08:00
Cheng Lian b674e67c22 [SPARK-14346][SQL] Native SHOW CREATE TABLE for Hive tables/views
## What changes were proposed in this pull request?

This is a follow-up of #12781. It adds native `SHOW CREATE TABLE` support for Hive tables and views. A new field `hasUnsupportedFeatures` is added to `CatalogTable` to indicate whether all table metadata retrieved from the concrete underlying external catalog (i.e. Hive metastore in this case) can be mapped to fields in `CatalogTable`. This flag is useful when the target Hive table contains structures that can't be handled by Spark SQL, e.g., skewed columns and storage handler, etc..

## How was this patch tested?

New test cases are added in `ShowCreateTableSuite` to do round-trip tests.

Author: Cheng Lian <lian@databricks.com>

Closes #13079 from liancheng/spark-14346-show-create-table-for-hive-tables.
2016-05-17 15:56:44 -07:00
Shixiong Zhu 8e8bc9f957 [SPARK-11735][CORE][SQL] Add a check in the constructor of SQLContext/SparkSession to make sure its SparkContext is not stopped
## What changes were proposed in this pull request?

Add a check in the constructor of SQLContext/SparkSession to make sure its SparkContext is not stopped.

## How was this patch tested?

Jenkins unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #13154 from zsxwing/check-spark-context-stop.
2016-05-17 14:57:21 -07:00
Dongjoon Hyun 9f176dd391 [MINOR][DOCS] Replace remaining 'sqlContext' in ScalaDoc/JavaDoc.
## What changes were proposed in this pull request?

According to the recent change, this PR replaces all the remaining `sqlContext` usage with `spark` in ScalaDoc/JavaDoc (.scala/.java files) except `SQLContext.scala`, `SparkPlan.scala', and `DatasetHolder.scala`.

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13125 from dongjoon-hyun/minor_doc_sparksession.
2016-05-17 20:50:22 +02:00
hyukjinkwon 8d05a7a98b [SPARK-10216][SQL] Avoid creating empty files during overwriting with group by query
## What changes were proposed in this pull request?

Currently, `INSERT INTO` with `GROUP BY` query tries to make at least 200 files (default value of `spark.sql.shuffle.partition`), which results in lots of empty files.

This PR makes it avoid creating empty files during overwriting into Hive table and in internal data sources  with group by query.

This checks whether the given partition has data in it or not and creates/writes file only when it actually has data.

## How was this patch tested?

Unittests in `InsertIntoHiveTableSuite` and `HadoopFsRelationTest`.

Closes #8411

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Keuntae Park <sirpkt@apache.org>

Closes #12855 from HyukjinKwon/pr/8411.
2016-05-17 11:18:51 -07:00
Wenchen Fan 20a89478e1 [SPARK-14346][SQL][FOLLOW-UP] add tests for CREAT TABLE USING with partition and bucket
## What changes were proposed in this pull request?

https://github.com/apache/spark/pull/12781 introduced PARTITIONED BY, CLUSTERED BY, and SORTED BY keywords to CREATE TABLE USING. This PR adds tests to make sure those keywords are handled correctly.

This PR also fixes a mistake that we should create non-hive-compatible table if partition or bucket info exists.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13144 from cloud-fan/add-test.
2016-05-17 10:12:51 -07:00
Kousuke Saruta c0c3ec3547 [SPARK-15165] [SQL] Codegen can break because toCommentSafeString is not actually safe
## What changes were proposed in this pull request?

toCommentSafeString method replaces "\u" with "\\\\u" to avoid codegen breaking.
But if the even number of "\" is put before "u", like "\\\\u", in the string literal in the query, codegen can break.

Following code causes compilation error.

```
val df = Seq(...).toDF
df.select("'\\\\\\\\u002A/'").show
```

The reason of the compilation error is because "\\\\\\\\\\\\\\\\u002A/" is translated into "*/" (the end of comment).

Due to this unsafety, arbitrary code can be injected like as follows.

```
val df = Seq(...).toDF
// Inject "System.exit(1)"
df.select("'\\\\\\\\u002A/{System.exit(1);}/*'").show
```

## How was this patch tested?

Added new test cases.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Author: sarutak <sarutak@oss.nttdata.co.jp>

Closes #12939 from sarutak/SPARK-15165.
2016-05-17 10:07:01 -07:00
Liwei Lin 95f4fbae52 [SPARK-14942][SQL][STREAMING] Reduce delay between batch construction and execution
## Problem

Currently in `StreamExecution`, [we first run the batch, then construct the next](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L165):
```scala
if (dataAvailable) runBatch()
constructNextBatch()
```

This is good when we run batches ASAP, where data would get processed in the **very next batch**:

![1](https://cloud.githubusercontent.com/assets/15843379/14779964/2786e698-0b0d-11e6-9d2c-bb41513488b2.png)

However, when we run batches at trigger like `ProcessTime("1 minute")`, data - such as _y_ below - may not get processed in the very next batch i.e. _batch 1_, but in _batch 2_:

![2](https://cloud.githubusercontent.com/assets/15843379/14779818/6f3bb064-0b0c-11e6-9f16-c1ce4897186b.png)

## What changes were proposed in this pull request?

This patch reverses the order of `constructNextBatch()` and `runBatch()`. After this patch, data would get processed in the **very next batch**, i.e. _batch 1_:

![3](https://cloud.githubusercontent.com/assets/15843379/14779816/6f36ee62-0b0c-11e6-9e53-bc8397fade18.png)

In addition, this patch alters when we do `currentBatchId += 1`: let's do that when the processing of the current batch's data is completed, so we won't bother passing `currentBatchId + 1` or  `currentBatchId - 1` to states or sinks.

## How was this patch tested?

New added test case. Also this should be covered by existing test suits, e.g. stress tests and others.

Author: Liwei Lin <lwlin7@gmail.com>

Closes #12725 from lw-lin/construct-before-run-3.
2016-05-16 12:59:55 -07:00
Sean Zhong 4a5ee1954a [SPARK-15253][SQL] Support old table schema config key "spark.sql.sources.schema" for DESCRIBE TABLE
## What changes were proposed in this pull request?

"DESCRIBE table" is broken when table schema is stored at key "spark.sql.sources.schema".

Originally, we used spark.sql.sources.schema to store the schema of a data source table.
After SPARK-6024, we removed this flag. Although we are not using spark.sql.sources.schema any more, we need to still support it.

## How was this patch tested?

Unit test.

When using spark2.0 to load a table generated by spark 1.2.
Before change:
`DESCRIBE table` => Schema of this table is inferred at runtime,,

After change:
`DESCRIBE table` => correct output.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13073 from clockfly/spark-15253.
2016-05-16 10:41:20 +08:00
Zheng RuiFeng c7efc56c7b [MINOR] Fix Typos
## What changes were proposed in this pull request?
1,Rename matrix args in BreezeUtil to upper to match the doc
2,Fix several typos in ML and SQL

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13078 from zhengruifeng/fix_ann.
2016-05-15 15:59:49 +01:00
Tejas Patil 4210e2a6b7 [TRIVIAL] Add () to SparkSession's builder function
## What changes were proposed in this pull request?

Was trying out `SparkSession` for the first time and the given class doc (when copied as is) did not work over Spark shell:

```
scala> SparkSession.builder().master("local").appName("Word Count").getOrCreate()
<console>:27: error: org.apache.spark.sql.SparkSession.Builder does not take parameters
       SparkSession.builder().master("local").appName("Word Count").getOrCreate()
```

Adding () to the builder method in SparkSession.

## How was this patch tested?

```
scala> SparkSession.builder().master("local").appName("Word Count").getOrCreate()
res0: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession65c17e38

scala> SparkSession.builder.master("local").appName("Word Count").getOrCreate()
res1: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession65c17e38
```

Author: Tejas Patil <tejasp@fb.com>

Closes #13086 from tejasapatil/doc_correction.
2016-05-13 18:10:22 -07:00
hyukjinkwon 3ded5bc4db [SPARK-15267][SQL] Refactor options for JDBC and ORC data sources and change default compression for ORC
## What changes were proposed in this pull request?

Currently, Parquet, JSON and CSV data sources have a class for thier options, (`ParquetOptions`, `JSONOptions` and `CSVOptions`).

It is convenient to manage options for sources to gather options into a class. Currently, `JDBC`, `Text`, `libsvm` and `ORC` datasources do not have this class. This might be nicer if these options are in a unified format so that options can be added and

This PR refactors the options in Spark internal data sources adding new classes, `OrcOptions`, `TextOptions`, `JDBCOptions` and `LibSVMOptions`.

Also, this PR change the default compression codec for ORC from `NONE` to `SNAPPY`.

## How was this patch tested?

Existing tests should cover this for refactoring and unittests in `OrcHadoopFsRelationSuite` for changing the default compression codec for ORC.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #13048 from HyukjinKwon/SPARK-15267.
2016-05-13 09:04:37 -07:00
Reynold Xin e1dc853737 [SPARK-15310][SQL] Rename HiveTypeCoercion -> TypeCoercion
## What changes were proposed in this pull request?
We originally designed the type coercion rules to match Hive, but over time we have diverged. It does not make sense to call it HiveTypeCoercion anymore. This patch renames it TypeCoercion.

## How was this patch tested?
Updated unit tests to reflect the rename.

Author: Reynold Xin <rxin@databricks.com>

Closes #13091 from rxin/SPARK-15310.
2016-05-13 00:15:39 -07:00
hyukjinkwon 51841d77d9 [SPARK-13866] [SQL] Handle decimal type in CSV inference at CSV data source.
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-13866

This PR adds the support to infer `DecimalType`.
Here are the rules between `IntegerType`, `LongType` and `DecimalType`.

#### Infering Types

1. `IntegerType` and then `LongType`are tried first.

  ```scala
  Int.MaxValue => IntegerType
  Long.MaxValue => LongType
  ```

2. If it fails, try `DecimalType`.

  ```scala
  (Long.MaxValue + 1) => DecimalType(20, 0)
  ```
  This does not try to infer this as `DecimalType` when scale is less than 0.

3. if it fails, try `DoubleType`
  ```scala
  0.1 => DoubleType // This is failed to be inferred as `DecimalType` because it has the scale, 1.
  ```

#### Compatible Types (Merging Types)

For merging types, this is the same with JSON data source. If `DecimalType` is not capable, then it becomes `DoubleType`

## How was this patch tested?

Unit tests were used and `./dev/run_tests` for code style test.

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>

Closes #11724 from HyukjinKwon/SPARK-13866.
2016-05-12 22:31:14 -07:00
Reynold Xin eda2800d44 [SPARK-14541][SQL] Support IFNULL, NULLIF, NVL and NVL2
## What changes were proposed in this pull request?
This patch adds support for a few SQL functions to improve compatibility with other databases: IFNULL, NULLIF, NVL and NVL2. In order to do this, this patch introduced a RuntimeReplaceable expression trait that allows replacing an unevaluable expression in the optimizer before evaluation.

Note that the semantics are not completely identical to other databases in esoteric cases.

## How was this patch tested?
Added a new test suite SQLCompatibilityFunctionSuite.

Closes #12373.

Author: Reynold Xin <rxin@databricks.com>

Closes #13084 from rxin/SPARK-14541.
2016-05-12 22:18:39 -07:00
Reynold Xin ba169c3230 [SPARK-15306][SQL] Move object expressions into expressions.objects package
## What changes were proposed in this pull request?
This patch moves all the object related expressions into expressions.objects package, for better code organization.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #13085 from rxin/SPARK-15306.
2016-05-12 21:35:14 -07:00
Herman van Hovell bb1362eb3b [SPARK-10605][SQL] Create native collect_list/collect_set aggregates
## What changes were proposed in this pull request?
We currently use the Hive implementations for the collect_list/collect_set aggregate functions. This has a few major drawbacks: the use of HiveUDAF (which has quite a bit of overhead) and the lack of support for struct datatypes. This PR adds native implementation of these functions to Spark.

The size of the collected list/set may vary, this means we cannot use the fast, Tungsten, aggregation path to perform the aggregation, and that we fallback to the slower sort based path. Another big issue with these operators is that when the size of the collected list/set grows too large, we can start experiencing large GC pauzes and OOMEs.

This `collect*` aggregates implemented in this PR rely on the sort based aggregate path for correctness. They maintain their own internal buffer which holds the rows for one group at a time. The sortbased aggregation path is triggered by disabling `partialAggregation` for these aggregates (which is kinda funny); this technique is also employed in `org.apache.spark.sql.hiveHiveUDAFFunction`.

I have done some performance testing:
```scala
import org.apache.spark.sql.{Dataset, Row}

sql("create function collect_list2 as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCollectList'")

val df = range(0, 10000000).select($"id", (rand(213123L) * 100000).cast("int").as("grp"))
df.select(countDistinct($"grp")).show

def benchmark(name: String, plan: Dataset[Row], maxItr: Int = 5): Unit = {
   // Do not measure planning.
   plan1.queryExecution.executedPlan

   // Execute the plan a number of times and average the result.
   val start = System.nanoTime
   var i = 0
   while (i < maxItr) {
     plan.rdd.foreach(row => Unit)
     i += 1
   }
   val time = (System.nanoTime - start) / (maxItr * 1000000L)
   println(s"[$name] $maxItr iterations completed in an average time of $time ms.")
}

val plan1 = df.groupBy($"grp").agg(collect_list($"id"))
val plan2 = df.groupBy($"grp").agg(callUDF("collect_list2", $"id"))

benchmark("Spark collect_list", plan1)
...
> [Spark collect_list] 5 iterations completed in an average time of 3371 ms.

benchmark("Hive collect_list", plan2)
...
> [Hive collect_list] 5 iterations completed in an average time of 9109 ms.
```
Performance is improved by a factor 2-3.

## How was this patch tested?
Added tests to `DataFrameAggregateSuite`.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #12874 from hvanhovell/implode.
2016-05-12 13:56:00 -07:00
gatorsmile be617f3d06 [SPARK-14684][SPARK-15277][SQL] Partition Spec Validation in SessionCatalog and Checking Partition Spec Existence Before Dropping
#### What changes were proposed in this pull request?
~~Currently, multiple partitions are allowed to drop by using a single DDL command: Alter Table Drop Partition. However, the internal implementation could break atomicity. That means, we could just drop a subset of qualified partitions, if hitting an exception when dropping one of qualified partitions~~

~~This PR contains the following behavior changes:~~
~~- disallow dropping multiple partitions by a single command ~~
~~- allow users to input predicates in partition specification and issue a nicer error message if the predicate's comparison operator is not `=`.~~
~~- verify the partition spec in SessionCatalog. This can ensure each partition spec in `Drop Partition` does not correspond to multiple partitions.~~

This PR has two major parts:
- Verify the partition spec in SessionCatalog for fixing the following issue:
  ```scala
  sql(s"ALTER TABLE $externalTab DROP PARTITION (ds='2008-04-09', unknownCol='12')")
  ```
  Above example uses an invalid partition spec. Without this PR, we will drop all the partitions. The reason is Hive megastores getPartitions API returns all the partitions if we provide an invalid spec.

- Re-implemented the `dropPartitions` in `HiveClientImpl`. Now, we always check if all the user-specified partition specs exist before attempting to drop the partitions. Previously, we start drop the partition before completing checking the existence of all the partition specs. If any failure happened after we start to drop the partitions, we will log an error message to indicate which partitions have been dropped and which partitions have not been dropped.

#### How was this patch tested?
Modified the existing test cases and added new test cases.

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #12801 from gatorsmile/banDropMultiPart.
2016-05-12 11:14:40 -07:00
Liang-Chi Hsieh 470de743ec [SPARK-15094][SPARK-14803][SQL] Remove extra Project added in EliminateSerialization
## What changes were proposed in this pull request?

We will eliminate the pair of `DeserializeToObject` and `SerializeFromObject` in `Optimizer` and add extra `Project`. However, when DeserializeToObject's outputObjectType is ObjectType and its cls can't be processed by unsafe project, it will be failed.

To fix it, we can simply remove the extra `Project` and replace the output attribute of `DeserializeToObject` in another rule.

## How was this patch tested?
`DatasetSuite`.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #12926 from viirya/fix-eliminate-serialization-projection.
2016-05-12 10:11:12 -07:00
Sean Zhong 33c6eb5218 [SPARK-15171][SQL] Deprecate registerTempTable and add dataset.createTempView
## What changes were proposed in this pull request?

Deprecates registerTempTable and add dataset.createTempView, dataset.createOrReplaceTempView.

## How was this patch tested?

Unit tests.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #12945 from clockfly/spark-15171.
2016-05-12 15:51:53 +08:00
Wenchen Fan 46991448aa [SPARK-15160][SQL] support data source table in InMemoryCatalog
## What changes were proposed in this pull request?

This PR adds a new rule to convert `SimpleCatalogRelation` to data source table if its table property contains data source information.

## How was this patch tested?

new test in SQLQuerySuite

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12935 from cloud-fan/ds-table.
2016-05-11 23:55:42 -07:00
Cheng Lian f036dd7ce7 [SPARK-14346] SHOW CREATE TABLE for data source tables
## What changes were proposed in this pull request?

This PR adds native `SHOW CREATE TABLE` DDL command for data source tables. Support for Hive tables will be added in follow-up PR(s).

To show table creation DDL for data source tables created by CTAS statements, this PR also added partitioning and bucketing support for normal `CREATE TABLE ... USING ...` syntax.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

A new test suite `ShowCreateTableSuite` is added in sql/hive package to test the new feature.

Author: Cheng Lian <lian@databricks.com>

Closes #12781 from liancheng/spark-14346-show-create-table.
2016-05-11 20:44:04 -07:00
Sandeep Singh ff92eb2e80 [SPARK-15080][CORE] Break copyAndReset into copy and reset
## What changes were proposed in this pull request?
Break copyAndReset into two methods copy and reset instead of just one.

## How was this patch tested?
Existing Tests

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #12936 from techaddict/SPARK-15080.
2016-05-12 11:12:09 +08:00
Bill Chambers 603f4453a1 [SPARK-15264][SPARK-15274][SQL] CSV Reader Error on Blank Column Names
## What changes were proposed in this pull request?

When a CSV begins with:
- `,,`
OR
- `"","",`

meaning that the first column names are either empty or blank strings and `header` is specified to be `true`, then the column name is replaced with `C` + the index number of that given column. For example, if you were to read in the CSV:
```
"","second column"
"hello", "there"
```
Then column names would become `"C0", "second column"`.

This behavior aligns with what currently happens when `header` is specified to be `false` in recent versions of Spark.

### Current Behavior in Spark <=1.6
In Spark <=1.6, a CSV with a blank column name becomes a blank string, `""`, meaning that this column cannot be accessed. However the CSV reads in without issue.

### Current Behavior in Spark 2.0
Spark throws a NullPointerError and will not read in the file.

#### Reproduction in 2.0
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/346304/2828750690305044/484361/latest.html

## How was this patch tested?
A new test was added to `CSVSuite` to account for this issue. We then have asserts that test for being able to select both the empty column names as well as the regular column names.

Author: Bill Chambers <bill@databricks.com>
Author: Bill Chambers <wchambers@ischool.berkeley.edu>

Closes #13041 from anabranch/master.
2016-05-11 17:42:13 -07:00
Andrew Or f14c4ba001 [SPARK-15276][SQL] CREATE TABLE with LOCATION should imply EXTERNAL
## What changes were proposed in this pull request?

Before:
```sql
-- uses that location but issues a warning
CREATE TABLE my_tab LOCATION /some/path
-- deletes any existing data in the specified location
DROP TABLE my_tab
```

After:
```sql
-- uses that location but creates an EXTERNAL table instead
CREATE TABLE my_tab LOCATION /some/path
-- does not delete the data at /some/path
DROP TABLE my_tab
```

This patch essentially makes the `EXTERNAL` field optional. This is related to #13032.

## How was this patch tested?

New test in `DDLCommandSuite`.

Author: Andrew Or <andrew@databricks.com>

Closes #13060 from andrewor14/location-implies-external.
2016-05-11 17:29:58 -07:00
Andrew Or 8881765ac7 [SPARK-15257][SQL] Require CREATE EXTERNAL TABLE to specify LOCATION
## What changes were proposed in this pull request?

Before:
```sql
-- uses warehouse dir anyway
CREATE EXTERNAL TABLE my_tab
-- doesn't actually delete the data
DROP TABLE my_tab
```
After:
```sql
-- no location is provided, throws exception
CREATE EXTERNAL TABLE my_tab
-- creates an external table using that location
CREATE EXTERNAL TABLE my_tab LOCATION '/path/to/something'
-- doesn't delete the data, which is expected
DROP TABLE my_tab
```

## How was this patch tested?

New test in `DDLCommandSuite`

Author: Andrew Or <andrew@databricks.com>

Closes #13032 from andrewor14/create-external-table-location.
2016-05-11 15:30:53 -07:00
Tathagata Das 81c68eceba [SPARK-15248][SQL] Make MetastoreFileCatalog consider directories from partition specs of a partitioned metastore table
Table partitions can be added with locations different from default warehouse location of a hive table.
`CREATE TABLE parquetTable (a int) PARTITIONED BY (b int) STORED AS parquet `
`ALTER TABLE parquetTable ADD PARTITION (b=1) LOCATION '/partition'`
Querying such a table throws error as the MetastoreFileCatalog does not list the added partition directory, it only lists the default base location.

```
[info] - SPARK-15248: explicitly added partitions should be readable *** FAILED *** (1 second, 8 milliseconds)
[info]   java.util.NoSuchElementException: key not found: file:/Users/tdas/Projects/Spark/spark2/target/tmp/spark-b39ad224-c5d1-4966-8981-fb45a2066d61/partition
[info]   at scala.collection.MapLike$class.default(MapLike.scala:228)
[info]   at scala.collection.AbstractMap.default(Map.scala:59)
[info]   at scala.collection.MapLike$class.apply(MapLike.scala:141)
[info]   at scala.collection.AbstractMap.apply(Map.scala:59)
[info]   at org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog$$anonfun$listFiles$1.apply(PartitioningAwareFileCatalog.scala:59)
[info]   at org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog$$anonfun$listFiles$1.apply(PartitioningAwareFileCatalog.scala:55)
[info]   at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
[info]   at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
[info]   at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
[info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
[info]   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
[info]   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
[info]   at org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.listFiles(PartitioningAwareFileCatalog.scala:55)
[info]   at org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:93)
[info]   at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:59)
[info]   at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:59)
[info]   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
[info]   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
[info]   at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:60)
[info]   at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:55)
[info]   at org.apache.spark.sql.execution.SparkStrategies$SpecialLimits$.apply(SparkStrategies.scala:55)
[info]   at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:59)
[info]   at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:59)
[info]   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
[info]   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
[info]   at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:60)
[info]   at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:77)
[info]   at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:75)
[info]   at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:82)
[info]   at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:82)
[info]   at org.apache.spark.sql.QueryTest.assertEmptyMissingInput(QueryTest.scala:330)
[info]   at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:146)
[info]   at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:159)
[info]   at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12$$anonfun$apply$mcV$sp$7$$anonfun$apply$mcV$sp$25.apply(parquetSuites.scala:554)
[info]   at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12$$anonfun$apply$mcV$sp$7$$anonfun$apply$mcV$sp$25.apply(parquetSuites.scala:535)
[info]   at org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:125)
[info]   at org.apache.spark.sql.hive.ParquetPartitioningTest.withTempDir(parquetSuites.scala:726)
[info]   at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12$$anonfun$apply$mcV$sp$7.apply$mcV$sp(parquetSuites.scala:535)
[info]   at org.apache.spark.sql.test.SQLTestUtils$class.withTable(SQLTestUtils.scala:166)
[info]   at org.apache.spark.sql.hive.ParquetPartitioningTest.withTable(parquetSuites.scala:726)
[info]   at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12.apply$mcV$sp(parquetSuites.scala:534)
[info]   at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12.apply(parquetSuites.scala:534)
[info]   at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12.apply(parquetSuites.scala:534)
```

The solution in this PR to get the paths to list from the partition spec and not rely on the default table path alone.

unit tests.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13022 from tdas/SPARK-15248.
2016-05-11 12:36:25 -07:00
Eric Liang 6d0368ab8d [SPARK-15259] Sort time metric should not include spill and record insertion time
## What changes were proposed in this pull request?

After SPARK-14669 it seems the sort time metric includes both spill and record insertion time. This makes it not very useful since the metric becomes close to the total execution time of the node.

We should track just the time spent for in-memory sort, as before.

## How was this patch tested?

Verified metric in the UI, also unit test on UnsafeExternalRowSorter.

cc davies

Author: Eric Liang <ekl@databricks.com>
Author: Eric Liang <ekhliang@gmail.com>

Closes #13035 from ericl/fix-metrics.
2016-05-11 11:25:46 -07:00
Shixiong Zhu 875ef76428 [SPARK-15231][SQL] Document the semantic of saveAsTable and insertInto and don't drop columns silently
## What changes were proposed in this pull request?

This PR adds documents about the different behaviors between `insertInto` and `saveAsTable`, and throws an exception when the user try to add too man columns using `saveAsTable with append`.

## How was this patch tested?

Unit tests added in this PR.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #13013 from zsxwing/SPARK-15231.
2016-05-10 23:53:55 -07:00
Davies Liu 1fbe2785df [SPARK-15255][SQL] limit the length of name for cached DataFrame
## What changes were proposed in this pull request?

We use the tree string of an SparkPlan as the name of cached DataFrame, that could be very long, cause the browser to be not responsive. This PR will limit the length of the name to 1000 characters.

## How was this patch tested?

Here is how the UI looks right now:

![ui](https://cloud.githubusercontent.com/assets/40902/15163355/d5640f9c-16bc-11e6-8655-809af8a4fed1.png)

Author: Davies Liu <davies@databricks.com>

Closes #13033 from davies/cache_name.
2016-05-10 22:29:41 -07:00
hyukjinkwon 3ff012051f [SPARK-15250][SQL] Remove deprecated json API in DataFrameReader
## What changes were proposed in this pull request?

This PR removes the old `json(path: String)` API which is covered by the new `json(paths: String*)`.

## How was this patch tested?

Jenkins tests (existing tests should cover this)

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>

Closes #13040 from HyukjinKwon/SPARK-15250.
2016-05-10 22:21:17 -07:00
Reynold Xin 5a5b83c97b [SPARK-15261][SQL] Remove experimental tag from DataFrameReader/Writer
## What changes were proposed in this pull request?
This patch removes experimental tag from DataFrameReader and DataFrameWriter, and explicitly tags a few methods added for structured streaming as experimental.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #13038 from rxin/SPARK-15261.
2016-05-10 21:54:32 -07:00
Sean Zhong 61e0bdcff2 [SPARK-14476][SQL] Improve the physical plan visualization by adding meta info like table name and file path for data source.
## What changes were proposed in this pull request?
Improve the physical plan visualization by adding meta info like table name and file path for data source.

Meta info InputPaths and TableName are newly added. Example:
```
scala> spark.range(10).write.saveAsTable("tt")
scala> spark.sql("select * from tt").explain()
== Physical Plan ==
WholeStageCodegen
:  +- BatchedScan HadoopFiles[id#13L] Format: ParquetFormat, InputPaths: file:/home/xzhong10/spark-linux/assembly/spark-warehouse/tt, PushedFilters: [], ReadSchema: struct<id:bigint>, TableName: default.tt
```

## How was this patch tested?

manual tests.

Changes for UI:
Before:
![ui_before_change](https://cloud.githubusercontent.com/assets/2595532/15064559/3d423e3c-1388-11e6-8099-7803ef496c4d.jpg)

After:
![fix_long_string](https://cloud.githubusercontent.com/assets/2595532/15133566/8ad09e26-1696-11e6-939c-99b908249b9d.jpg)

![for_load](https://cloud.githubusercontent.com/assets/2595532/15157224/3ba95c98-171d-11e6-885a-de0ee8dec27c.jpg)

Author: Sean Zhong <clockfly@gmail.com>

Closes #12947 from clockfly/spark-14476.
2016-05-10 21:50:53 -07:00
Tathagata Das d9ca9fd3e5 [SPARK-14837][SQL][STREAMING] Added support in file stream source for reading new files added to subdirs
## What changes were proposed in this pull request?
Currently, file stream source can only find new files if they appear in the directory given to the source, but not if they appear in subdirs. This PR add support for providing glob patterns when creating file stream source so that it can find new files in nested directories based on the glob pattern.

## How was this patch tested?

Unit test that tests when new files are discovered with globs and partitioned directories.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #12616 from tdas/SPARK-14837.
2016-05-10 16:43:32 -07:00
Sandeep Singh da02d006bb [SPARK-15249][SQL] Use FunctionResource instead of (String, String) in CreateFunction and CatalogFunction for resource
Use FunctionResource instead of (String, String) in CreateFunction and CatalogFunction for resource
see: TODO's here
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L36
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala#L42

Existing tests

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #13024 from techaddict/SPARK-15249.
2016-05-10 14:22:03 -07:00
Herman van Hovell d28c67544b [SPARK-14986][SQL] Return correct result for empty LATERAL VIEW OUTER
## What changes were proposed in this pull request?
A Generate with the `outer` flag enabled should always return one or more rows for every input row. The optimizer currently violates this by rewriting `outer` Generates that do not contain columns of the child plan into an unjoined generate, for example:
```sql
select e from a lateral view outer explode(a.b) as e
```
The result of this is that `outer` Generate does not produce output at all when the Generators' input expression is empty. This PR fixes this.

## How was this patch tested?
Added test case to `SQLQuerySuite`.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #12906 from hvanhovell/SPARK-14986.
2016-05-10 12:47:31 -07:00
Subhobrata Dey 89f73f6741 [SPARK-14642][SQL] import org.apache.spark.sql.expressions._ breaks udf under functions
## What changes were proposed in this pull request?

PR fixes the import issue which breaks udf functions.

The following code snippet throws an error

```
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> import org.apache.spark.sql.expressions._
import org.apache.spark.sql.expressions._

scala> udf((v: String) => v.stripSuffix("-abc"))
<console>:30: error: No TypeTag available for String
       udf((v: String) => v.stripSuffix("-abc"))
```

This PR resolves the issue.

## How was this patch tested?

patch tested with unit tests.

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Subhobrata Dey <sbcd90@gmail.com>

Closes #12458 from sbcd90/udfFuncBreak.
2016-05-10 12:32:56 -07:00
Andrew Or 69641066ae [SPARK-15037][HOTFIX] Don't create 2 SparkSessions in constructor
## What changes were proposed in this pull request?

After #12907 `TestSparkSession` creates a spark session in one of the constructors just to get the `SparkContext` from it. This ends up creating 2 `SparkSession`s from one call, which is definitely not what we want.

## How was this patch tested?

Jenkins.

Author: Andrew Or <andrew@databricks.com>

Closes #13031 from andrewor14/sql-test.
2016-05-10 12:07:47 -07:00
Andrew Or cddb9da074 [HOTFIX] SQL test compilation error from merge conflict 2016-05-10 11:46:02 -07:00
gatorsmile 5c6b085578 [SPARK-14603][SQL] Verification of Metadata Operations by Session Catalog
Since we cannot really trust if the underlying external catalog can throw exceptions when there is an invalid metadata operation, let's do it in SessionCatalog.

- [X] The first step is to unify the error messages issued in Hive-specific Session Catalog and general Session Catalog.
- [X] The second step is to verify the inputs of metadata operations for partitioning-related operations. This is moved to a separate PR: https://github.com/apache/spark/pull/12801
- [X] The third step is to add database existence verification in `SessionCatalog`
- [X] The fourth step is to add table existence verification in `SessionCatalog`
- [X] The fifth step is to add function existence verification in `SessionCatalog`

Add test cases and verify the error messages we issued

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #12385 from gatorsmile/verifySessionAPIs.
2016-05-10 11:25:55 -07:00
Sandeep Singh ed0b4070fb [SPARK-15037][SQL][MLLIB] Use SparkSession instead of SQLContext in Scala/Java TestSuites
## What changes were proposed in this pull request?
Use SparkSession instead of SQLContext in Scala/Java TestSuites
as this PR already very big working Python TestSuites in a diff PR.

## How was this patch tested?
Existing tests

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #12907 from techaddict/SPARK-15037.
2016-05-10 11:17:47 -07:00
Wenchen Fan bcfee153b1 [SPARK-12837][CORE] reduce network IO for accumulators
Sending un-updated accumulators back to driver makes no sense, as merging a zero value accumulator is a no-op. We should only send back updated accumulators, to save network IO.

new test in `TaskContextSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12899 from cloud-fan/acc.
2016-05-10 11:16:56 -07:00
Herman van Hovell 2646265368 [SPARK-14773] [SPARK-15179] [SQL] Fix SQL building and enable Hive tests
## What changes were proposed in this pull request?
This PR fixes SQL building for predicate subqueries and correlated scalar subqueries. It also enables most Hive subquery tests.

## How was this patch tested?
Enabled new tests in HiveComparisionSuite.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #12988 from hvanhovell/SPARK-14773.
2016-05-10 09:56:07 -07:00
Pete Robbins 2dfb9cd1f7 [SPARK-15154] [SQL] Change key types to Long in tests
## What changes were proposed in this pull request?

As reported in the Jira the 2 tests changed here are using a key of type Integer where the Spark sql code assumes the type is Long. This PR changes the tests to use the correct key types.

## How was this patch tested?

Test builds run on both Big Endian and Little Endian platforms

Author: Pete Robbins <robbinspg@gmail.com>

Closes #13009 from robbinspg/HashedRelationSuiteFix.
2016-05-10 09:53:56 -07:00
Cheng Lian 8a12580d25 [SPARK-14127][SQL] "DESC <table>": Extracts schema information from table properties for data source tables
## What changes were proposed in this pull request?

This is a follow-up of #12934 and #12844. This PR adds a set of utility methods in `DDLUtils` to help extract schema information (user-defined schema, partition columns, and bucketing information) from data source table properties. These utility methods are then used in `DescribeTableCommand` to refine output for data source tables. Before this PR, the aforementioned schema information are only shown as table properties, which are hard to read.

Sample output:

```
+----------------------------+---------------------------------------------------------+-------+
|col_name                    |data_type                                                |comment|
+----------------------------+---------------------------------------------------------+-------+
|a                           |bigint                                                   |       |
|b                           |bigint                                                   |       |
|c                           |bigint                                                   |       |
|d                           |bigint                                                   |       |
|# Partition Information     |                                                         |       |
|# col_name                  |                                                         |       |
|d                           |                                                         |       |
|                            |                                                         |       |
|# Detailed Table Information|                                                         |       |
|Database:                   |default                                                  |       |
|Owner:                      |lian                                                     |       |
|Create Time:                |Tue May 10 03:20:34 PDT 2016                             |       |
|Last Access Time:           |Wed Dec 31 16:00:00 PST 1969                             |       |
|Location:                   |file:/Users/lian/local/src/spark/workspace-a/target/...  |       |
|Table Type:                 |MANAGED                                                  |       |
|Table Parameters:           |                                                         |       |
|  rawDataSize               |-1                                                       |       |
|  numFiles                  |1                                                        |       |
|  transient_lastDdlTime     |1462875634                                               |       |
|  totalSize                 |684                                                      |       |
|  spark.sql.sources.provider|parquet                                                  |       |
|  EXTERNAL                  |FALSE                                                    |       |
|  COLUMN_STATS_ACCURATE     |false                                                    |       |
|  numRows                   |-1                                                       |       |
|                            |                                                         |       |
|# Storage Information       |                                                         |       |
|SerDe Library:              |org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe       |       |
|InputFormat:                |org.apache.hadoop.mapred.SequenceFileInputFormat         |       |
|OutputFormat:               |org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat|       |
|Compressed:                 |No                                                       |       |
|Num Buckets:                |2                                                        |       |
|Bucket Columns:             |[b]                                                      |       |
|Sort Columns:               |[c]                                                      |       |
|Storage Desc Parameters:    |                                                         |       |
|  path                      |file:/Users/lian/local/src/spark/workspace-a/target/...  |       |
|  serialization.format      |1                                                        |       |
+----------------------------+---------------------------------------------------------+-------+
```

## How was this patch tested?

Test cases are added in `HiveDDLSuite` to check command output.

Author: Cheng Lian <lian@databricks.com>

Closes #13025 from liancheng/spark-14127-extract-schema-info.
2016-05-10 09:00:53 -07:00
gatorsmile 5706472670 [SPARK-15215][SQL] Fix Explain Parsing and Output
#### What changes were proposed in this pull request?
This PR is to address a few existing issues in `EXPLAIN`:
- The `EXPLAIN` options `LOGICAL | FORMATTED | EXTENDED | CODEGEN` should not be 0 or more match. It should 0 or one match. Parser does not allow users to use more than one option in a single command.
- The option `LOGICAL` is not supported. Issue an exception when users specify this option in the command.
- The output of `EXPLAIN ` contains a weird empty line when the output of analyzed plan is empty. We should remove it. For example:
  ```
  == Parsed Logical Plan ==
  CreateTable CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.  HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None), false

  == Analyzed Logical Plan ==

  CreateTable CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.  HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None), false

  == Optimized Logical Plan ==
  CreateTable CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.  HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None), false
  ...
  ```

#### How was this patch tested?
Added and modified a few test cases

Author: gatorsmile <gatorsmile@gmail.com>

Closes #12991 from gatorsmile/explainCreateTable.
2016-05-10 11:53:37 +02:00
gatorsmile f45379173b [SPARK-15187][SQL] Disallow Dropping Default Database
#### What changes were proposed in this pull request?
In Hive Metastore, dropping default database is not allowed. However, in `InMemoryCatalog`, this is allowed.

This PR is to disallow users to drop default database.

#### How was this patch tested?
Previously, we already have a test case in HiveDDLSuite. Now, we also add the same one in DDLSuite

Author: gatorsmile <gatorsmile@gmail.com>

Closes #12962 from gatorsmile/dropDefaultDB.
2016-05-10 11:57:01 +08:00
Reynold Xin 4b4344a813 [SPARK-15229][SQL] Make case sensitivity setting internal
## What changes were proposed in this pull request?
Our case sensitivity support is different from what ANSI SQL standards support. Postgres' behavior is that if an identifier is quoted, then it is treated as case sensitive; otherwise it is folded to lowercase. We will likely need to revisit this in the future and change our behavior. For now, the safest change to do for Spark 2.0 is to make the case sensitive option internal and discourage users from turning it on, effectively making Spark always case insensitive.

## How was this patch tested?
N/A - a small config documentation change.

Author: Reynold Xin <rxin@databricks.com>

Closes #13011 from rxin/SPARK-15229.
2016-05-09 20:03:01 -07:00
Andrew Or 8f932fb88d [SPARK-15234][SQL] Fix spark.catalog.listDatabases.show()
## What changes were proposed in this pull request?

Before:
```
scala> spark.catalog.listDatabases.show()
+--------------------+-----------+-----------+
|                name|description|locationUri|
+--------------------+-----------+-----------+
|Database[name='de...|
|Database[name='my...|
|Database[name='so...|
+--------------------+-----------+-----------+
```

After:
```
+-------+--------------------+--------------------+
|   name|         description|         locationUri|
+-------+--------------------+--------------------+
|default|Default Hive data...|file:/user/hive/w...|
|  my_db|  This is a database|file:/Users/andre...|
|some_db|                    |file:/private/var...|
+-------+--------------------+--------------------+
```

## How was this patch tested?

New test in `CatalogSuite`

Author: Andrew Or <andrew@databricks.com>

Closes #13015 from andrewor14/catalog-show.
2016-05-09 20:02:23 -07:00
xin Wu 980bba0dcf [SPARK-15025][SQL] fix duplicate of PATH key in datasource table options
## What changes were proposed in this pull request?
The issue is that when the user provides the path option with uppercase "PATH" key, `options` contains `PATH` key and will get into the non-external case in the following code in `createDataSourceTables.scala`, where a new key "path" is created with a default path.
```
val optionsWithPath =
      if (!options.contains("path")) {
        isExternal = false
        options + ("path" -> sessionState.catalog.defaultTablePath(tableIdent))
      } else {
        options
      }
```
So before creating hive table, serdeInfo.parameters will contain both "PATH" and "path" keys and different directories. and Hive table's dataLocation contains the value of "path".

The fix in this PR is to convert `options` in the code above to `CaseInsensitiveMap` before checking for containing "path" key.

## How was this patch tested?
A testcase is added

Author: xin Wu <xinwu@us.ibm.com>

Closes #12804 from xwu0226/SPARK-15025.
2016-05-09 17:18:48 -07:00
Josh Rosen c3350cadb8 [SPARK-14972] Improve performance of JSON schema inference's compatibleType method
This patch improves the performance of `InferSchema.compatibleType` and `inferField`. The net result of this patch is a 6x speedup in local benchmarks running against cached data with a massive nested schema.

The key idea is to remove unnecessary sorting in `compatibleType`'s `StructType` merging code. This code takes two structs, merges the fields with matching names, and copies over the unique fields, producing a new schema which is the union of the two structs' schemas. Previously, this code performed a very inefficient `groupBy()` to match up fields with the same name, but this is unnecessary because `inferField` already sorts structs' fields by name: since both lists of fields are sorted, we can simply merge them in a single pass.

This patch also speeds up the existing field sorting in `inferField`: the old sorting code allocated unnecessary intermediate collections, while the new code uses mutable collects and performs in-place sorting.

I rewrote inefficient `equals()` implementations in `StructType` and `Metadata`, significantly reducing object allocations in those methods.

Finally, I replaced a `treeAggregate` call with `fold`: I doubt that `treeAggregate` will benefit us very much because the schemas would have to be enormous to realize large savings in network traffic. Since most schemas are probably fairly small in serialized form, they should typically fit within a direct task result and therefore can be incrementally merged at the driver as individual tasks finish. This change eliminates an entire (short) scheduler stage.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #12750 from JoshRosen/schema-inference-speedups.
2016-05-09 13:11:18 -07:00
Wenchen Fan 2adb11f6db [SPARK-15173][SQL] DataFrameWriter.insertInto should work with datasource table stored in hive
When we parse `CREATE TABLE USING`, we should build a `CreateTableUsing` plan with the `managedIfNoPath` set to true. Then we will add default table path to options when write it to hive.

new test in `SQLQuerySuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12949 from cloud-fan/bug.
2016-05-09 12:58:27 -07:00
Andrew Or 7bf9b12019 [SPARK-15166][SQL] Move some hive-specific code from SparkSession
## What changes were proposed in this pull request?

This also simplifies the code being moved.

## How was this patch tested?

Existing tests.

Author: Andrew Or <andrew@databricks.com>

Closes #12941 from andrewor14/move-code.
2016-05-09 11:24:58 -07:00
jerryshao ee6a8d7eaa [MINOR][SQL] Enhance the exception message if checkpointLocation is not set
Enhance the exception message when `checkpointLocation` is not set, previously the message is:

```
java.util.NoSuchElementException: None.get
  at scala.None$.get(Option.scala:347)
  at scala.None$.get(Option.scala:345)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$8.apply(DataFrameWriter.scala:338)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$8.apply(DataFrameWriter.scala:338)
  at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
  at scala.collection.AbstractMap.getOrElse(Map.scala:59)
  at org.apache.spark.sql.DataFrameWriter.startStream(DataFrameWriter.scala:337)
  at org.apache.spark.sql.DataFrameWriter.startStream(DataFrameWriter.scala:277)
  ... 48 elided
```

This is not so meaningful, so changing to make it more specific.

Local verified.

Author: jerryshao <sshao@hortonworks.com>

Closes #12998 from jerryshao/improve-exception-message.
2016-05-09 11:14:40 -07:00
Cheng Lian 671b382a80 [SPARK-14127][SQL] Makes 'DESC [EXTENDED|FORMATTED] <table>' support data source tables
## What changes were proposed in this pull request?

This is a follow-up of PR #12844. It makes the newly updated `DescribeTableCommand` to support data sources tables.

## How was this patch tested?

A test case is added to check `DESC [EXTENDED | FORMATTED] <table>` output.

Author: Cheng Lian <lian@databricks.com>

Closes #12934 from liancheng/spark-14127-desc-table-follow-up.
2016-05-09 10:53:32 -07:00
gatorsmile b1e01fd519 [SPARK-15199][SQL] Disallow Dropping Build-in Functions
#### What changes were proposed in this pull request?
As Hive and the major RDBMS behave, the built-in functions are not allowed to drop. In the current implementation, users can drop the built-in functions. However, after dropping the built-in functions, users are unable to add them back.

#### How was this patch tested?
Added a test case.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #12975 from gatorsmile/dropBuildInFunction.
2016-05-09 10:49:54 -07:00
Wenchen Fan beb16ec556 [SPARK-15093][SQL] create/delete/rename directory for InMemoryCatalog operations if needed
## What changes were proposed in this pull request?

following operations have file system operation now:

1. CREATE DATABASE: create a dir
2. DROP DATABASE: delete the dir
3. CREATE TABLE: create a dir
4. DROP TABLE: delete the dir
5. RENAME TABLE: rename the dir
6. CREATE PARTITIONS: create a dir
7. RENAME PARTITIONS: rename the dir
8. DROP PARTITIONS: drop the dir

## How was this patch tested?

new tests in `ExternalCatalogSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12871 from cloud-fan/catalog.
2016-05-09 10:47:45 -07:00
gatorsmile a59ab594ca [SPARK-15184][SQL] Fix Silent Removal of An Existent Temp Table by Rename Table
#### What changes were proposed in this pull request?
Currently, if we rename a temp table `Tab1` to another existent temp table `Tab2`. `Tab2` will be silently removed. This PR is to detect it and issue an exception message.

In addition, this PR also detects another issue in the rename table command. When the destination table identifier does have database name, we should not ignore them. That might mean users could rename a regular table.

#### How was this patch tested?
Added two related test cases

Author: gatorsmile <gatorsmile@gmail.com>

Closes #12959 from gatorsmile/rewriteTable.
2016-05-09 13:05:18 +08:00
Herman van Hovell df89f1d43d [SPARK-15122] [SQL] Fix TPC-DS 41 - Normalize predicates before pulling them out
## What changes were proposed in this pull request?
The official TPC-DS 41 query currently fails because it contains a scalar subquery with a disjunctive correlated predicate (the correlated predicates were nested in ORs). This makes the `Analyzer` pull out the entire predicate which is wrong and causes the following (correct) analysis exception: `The correlated scalar subquery can only contain equality predicates`

This PR fixes this by first simplifing (or normalizing) the correlated predicates before pulling them out of the subquery.

## How was this patch tested?
Manual testing on TPC-DS 41, and added a test to SubquerySuite.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #12954 from hvanhovell/SPARK-15122.
2016-05-06 21:06:03 -07:00
Kevin Yu 607a27a0d1 [SPARK-15051][SQL] Create a TypedColumn alias
## What changes were proposed in this pull request?

Currently when we create an alias against a TypedColumn from user-defined Aggregator(for example: agg(aggSum.toColumn as "a")), spark is using the alias' function from Column( as), the alias function will return a column contains a TypedAggregateExpression, which is unresolved because the inputDeserializer is not defined. Later the aggregator function (agg) will inject the inputDeserializer back to the TypedAggregateExpression, but only if the aggregate columns are TypedColumn, in the above case, the TypedAggregateExpression will remain unresolved because it is under column and caused the
problem reported by this jira [15051](https://issues.apache.org/jira/browse/SPARK-15051?jql=project%20%3D%20SPARK).

This PR propose to create an alias function for TypedColumn,  it will return a TypedColumn. It is using the similar code path  as Column's alia function.

For the spark build in aggregate function, like max, it is working with alias, for example

val df1 = Seq(1 -> "a", 2 -> "b", 3 -> "b").toDF("i", "j")
checkAnswer(df1.agg(max("j") as "b"), Row(3) :: Nil)

Thanks for comments.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Add test cases in DatasetAggregatorSuite.scala
run the sql related queries against this patch.

Author: Kevin Yu <qyu@us.ibm.com>

Closes #12893 from kevinyu98/spark-15051.
2016-05-07 11:13:48 +08:00
Tathagata Das f7b7ef4166 [SPARK-14997][SQL] Fixed FileCatalog to return correct set of files when there is no partitioning scheme in the given paths
## What changes were proposed in this pull request?
Lets says there are json files in the following directories structure
```
xyz/file0.json
xyz/subdir1/file1.json
xyz/subdir2/file2.json
xyz/subdir1/subsubdir1/file3.json
```
`sqlContext.read.json("xyz")` should read only file0.json according to behavior in Spark 1.6.1. However in current master, all the 4 files are read.

The fix is to make FileCatalog return only the children files of the given path if there is not partitioning detected (instead of all the recursive list of files).

Closes #12774

## How was this patch tested?

unit tests

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #12856 from tdas/SPARK-14997.
2016-05-06 15:04:16 -07:00
gatorsmile 5c8fad7b9b [SPARK-15108][SQL] Describe Permanent UDTF
#### What changes were proposed in this pull request?
When Describe a UDTF, the command returns a wrong result. The command is unable to find the function, which has been created and cataloged in the catalog but not in the functionRegistry.

This PR is to correct it. If the function is not in the functionRegistry, we will check the catalog for collecting the information of the UDTF function.

#### How was this patch tested?
Added test cases to verify the results

Author: gatorsmile <gatorsmile@gmail.com>

Closes #12885 from gatorsmile/showFunction.
2016-05-06 11:43:07 -07:00
hyukjinkwon fa928ff9a3 [SPARK-14962][SQL] Do not push down isnotnull/isnull on unsuportted types in ORC
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-14962

ORC filters were being pushed down for all types for both `IsNull` and `IsNotNull`.

This is apparently OK because both `IsNull` and `IsNotNull` do not take a type as an argument (Hive 1.2.x) during building filters (`SearchArgument`) in Spark-side but they do not filter correctly because stored statistics always produces `null` for not supported types (eg `ArrayType`) in ORC-side. So, it is always `true` for `IsNull` which ends up with always `false` for `IsNotNull`. (Please see [RecordReaderImpl.java#L296-L318](https://github.com/apache/hive/blob/branch-1.2/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java#L296-L318)  and [RecordReaderImpl.java#L359-L365](https://github.com/apache/hive/blob/branch-1.2/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java#L359-L365) in Hive 1.2)

This looks prevented in Hive 1.3.x >= by forcing to give a type ([`PredicateLeaf.Type`](e085b7e9bd/storage-api/src/java/org/apache/hadoop/hive/ql/io/sarg/PredicateLeaf.java (L50-L56))) when building a filter ([`SearchArgument`](26b5c7b56a/storage-api/src/java/org/apache/hadoop/hive/ql/io/sarg/SearchArgument.java (L260))) but Hive 1.2.x seems not doing this.

This PR prevents ORC filter creation for `IsNull` and `IsNotNull` on unsupported types. `OrcFilters` resembles `ParquetFilters`.

## How was this patch tested?

Unittests in `OrcQuerySuite` and `OrcFilterSuite` and `sbt scalastyle`.

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>

Closes #12777 from HyukjinKwon/SPARK-14962.
2016-05-07 01:46:45 +08:00
Jacek Laskowski bbb7773437 [SPARK-15152][DOC][MINOR] Scaladoc and Code style Improvements
## What changes were proposed in this pull request?

Minor doc and code style fixes

## How was this patch tested?

local build

Author: Jacek Laskowski <jacek@japila.pl>

Closes #12928 from jaceklaskowski/SPARK-15152.
2016-05-05 16:34:27 -07:00
Dongjoon Hyun 2c170dd3d7 [SPARK-15134][EXAMPLE] Indent SparkSession builder patterns and update binary_classification_metrics_example.py
## What changes were proposed in this pull request?

This issue addresses the comments in SPARK-15031 and also fix java-linter errors.
- Use multiline format in SparkSession builder patterns.
- Update `binary_classification_metrics_example.py` to use `SparkSession`.
- Fix Java Linter errors (in SPARK-13745, SPARK-15031, and so far)

## How was this patch tested?

After passing the Jenkins tests and run `dev/lint-java` manually.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12911 from dongjoon-hyun/SPARK-15134.
2016-05-05 14:37:50 -07:00
Shixiong Zhu bb9991dec5 [SPARK-15135][SQL] Make sure SparkSession thread safe
## What changes were proposed in this pull request?

Went through SparkSession and its members and fixed non-thread-safe classes used by SparkSession

## How was this patch tested?

Existing unit tests

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #12915 from zsxwing/spark-session-thread-safe.
2016-05-05 14:36:47 -07:00
Sandeep Singh ed6f3f8a5f [SPARK-15072][SQL][REPL][EXAMPLES] Remove SparkSession.withHiveSupport
## What changes were proposed in this pull request?
Removing the `withHiveSupport` method of `SparkSession`, instead use `enableHiveSupport`

## How was this patch tested?
ran tests locally

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #12851 from techaddict/SPARK-15072.
2016-05-05 14:35:15 -07:00
gatorsmile 8cba57a75c [SPARK-14124][SQL][FOLLOWUP] Implement Database-related DDL Commands
#### What changes were proposed in this pull request?

First, a few test cases failed in mac OS X  because the property value of `java.io.tmpdir` does not include a trailing slash on some platform. Hive always removes the last trailing slash. For example, what I got in the web:
```
Win NT  --> C:\TEMP\
Win XP  --> C:\TEMP
Solaris --> /var/tmp/
Linux   --> /var/tmp
```
Second, a couple of test cases are added to verify if the commands work properly.

#### How was this patch tested?
Added a test case for it and correct the previous test cases.

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #12081 from gatorsmile/mkdir.
2016-05-05 14:34:24 -07:00
NarineK 22226fcc92 [SPARK-15110] [SPARKR] Implement repartitionByColumn for SparkR DataFrames
## What changes were proposed in this pull request?

Implement repartitionByColumn on DataFrame.
This will allow us to run R functions on each partition identified by column groups with dapply() method.

## How was this patch tested?

Unit tests

Author: NarineK <narine.kokhlikyan@us.ibm.com>

Closes #12887 from NarineK/repartitionByColumns.
2016-05-05 12:00:55 -07:00
hyukjinkwon ac12b35d31 [SPARK-15148][SQL] Upgrade Univocity library from 2.0.2 to 2.1.0
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-15148

Mainly it improves the performance roughtly about 30%-40% according to the [release note](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.1.0). For the details of the purpose is described in the JIRA.

This PR upgrades Univocity library from 2.0.2 to 2.1.0.

## How was this patch tested?

Existing tests should cover this.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #12923 from HyukjinKwon/SPARK-15148.
2016-05-05 11:26:40 -07:00
Wenchen Fan 55cc1c991a [SPARK-14139][SQL] RowEncoder should preserve schema nullability
## What changes were proposed in this pull request?

The problem is: In `RowEncoder`, we use `Invoke` to get the field of an external row, which lose the nullability information. This PR creates a `GetExternalRowField` expression, so that we can preserve the nullability info.

TODO: simplify the null handling logic in `RowEncoder`, to remove so many if branches, in follow-up PR.

## How was this patch tested?

new tests in `RowEncoderSuite`

Note that, This PR takes over https://github.com/apache/spark/pull/11980, with a little simplification, so all credits should go to koertkuipers

Author: Wenchen Fan <wenchen@databricks.com>
Author: Koert Kuipers <koert@tresata.com>

Closes #12364 from cloud-fan/nullable.
2016-05-06 01:08:04 +08:00
Kousuke Saruta 1a9b341581 [SPARK-15132][MINOR][SQL] Debug log for generated code should be printed with proper indentation
## What changes were proposed in this pull request?

Similar to #11990, GenerateOrdering and GenerateColumnAccessor should print debug log for generated code with proper indentation.

## How was this patch tested?

Manually checked.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #12908 from sarutak/SPARK-15132.
2016-05-04 22:18:55 -07:00
Tathagata Das bde27b89a2 [SPARK-15131][SQL] Shutdown StateStore management thread when SparkContext has been shutdown
## What changes were proposed in this pull request?

Make sure that whenever the StateStoreCoordinator cannot be contacted, assume that the SparkContext and RpcEnv on the driver has been shutdown, and therefore stop the StateStore management thread, and unload all loaded stores.

## How was this patch tested?

Updated unit tests.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #12905 from tdas/SPARK-15131.
2016-05-04 21:19:53 -07:00
gatorsmile ef55e46c92 [SPARK-14993][SQL] Fix Partition Discovery Inconsistency when Input is a Path to Parquet File
#### What changes were proposed in this pull request?
When we load a dataset, if we set the path to ```/path/a=1```, we will not take `a` as the partitioning column. However, if we set the path to ```/path/a=1/file.parquet```, we take `a` as the partitioning column and it shows up in the schema.

This PR is to fix the behavior inconsistency issue.

The base path contains a set of paths that are considered as the base dirs of the input datasets. The partitioning discovery logic will make sure it will stop when it reaches any base path.

By default, the paths of the dataset provided by users will be base paths. Below are three typical cases,
**Case 1**```sqlContext.read.parquet("/path/something=true/")```: the base path will be
`/path/something=true/`, and the returned DataFrame will not contain a column of `something`.
**Case 2**```sqlContext.read.parquet("/path/something=true/a.parquet")```: the base path will be
still `/path/something=true/`, and the returned DataFrame will also not contain a column of
`something`.
**Case 3**```sqlContext.read.parquet("/path/")```: the base path will be `/path/`, and the returned
DataFrame will have the column of `something`.

Users also can override the basePath by setting `basePath` in the options to pass the new base
path to the data source. For example,
```sqlContext.read.option("basePath", "/path/").parquet("/path/something=true/")```,
and the returned DataFrame will have the column of `something`.

The related PRs:
- https://github.com/apache/spark/pull/9651
- https://github.com/apache/spark/pull/10211

#### How was this patch tested?
Added a couple of test cases

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #12828 from gatorsmile/readPartitionedTable.
2016-05-04 18:47:27 -07:00
Sean Zhong 8fb1463d6a [SPARK-6339][SQL] Supports CREATE TEMPORARY VIEW tableIdentifier AS query
## What changes were proposed in this pull request?

This PR support new SQL syntax CREATE TEMPORARY VIEW.
Like:
```
CREATE TEMPORARY VIEW viewName AS SELECT * from xx
CREATE OR REPLACE TEMPORARY VIEW viewName AS SELECT * from xx
CREATE TEMPORARY VIEW viewName (c1 COMMENT 'blabla', c2 COMMENT 'blabla') AS SELECT * FROM xx
```

## How was this patch tested?

Unit tests.

Author: Sean Zhong <clockfly@gmail.com>

Closes #12872 from clockfly/spark-6399.
2016-05-04 18:27:25 -07:00
sethah b281377647 [MINOR][SQL] Fix typo in DataFrameReader csv documentation
## What changes were proposed in this pull request?
Typo fix

## How was this patch tested?
No tests

My apologies for the tiny PR, but I stumbled across this today and wanted to get it corrected for 2.0.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #12912 from sethah/csv_typo.
2016-05-04 16:46:13 -07:00
Reynold Xin 6ae9fc00ed [SPARK-15126][SQL] RuntimeConfig.set should return Unit
## What changes were proposed in this pull request?
Currently we return RuntimeConfig itself to facilitate chaining. However, it makes the output in interactive environments (e.g. notebooks, scala repl) weird because it'd show the response of calling set as a RuntimeConfig itself.

## How was this patch tested?
Updated unit tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #12902 from rxin/SPARK-15126.
2016-05-04 14:26:05 -07:00
Tathagata Das 0fd3a47484 [SPARK-15103][SQL] Refactored FileCatalog class to allow StreamFileCatalog to infer partitioning
## What changes were proposed in this pull request?

File Stream Sink writes the list of written files in a metadata log. StreamFileCatalog reads the list of the files for processing. However StreamFileCatalog does not infer partitioning like HDFSFileCatalog.

This PR enables that by refactoring HDFSFileCatalog to create an abstract class PartitioningAwareFileCatalog, that has all the functionality to infer partitions from a list of leaf files.
- HDFSFileCatalog has been renamed to ListingFileCatalog and it extends PartitioningAwareFileCatalog by providing a list of leaf files from recursive directory scanning.
- StreamFileCatalog has been renamed to MetadataLogFileCatalog and it extends PartitioningAwareFileCatalog by providing a list of leaf files from the metadata log.
- The above two classes has been moved into their own files as they are not interfaces that should be in fileSourceInterfaces.scala.

## How was this patch tested?
- FileStreamSinkSuite was update to see if partitioning gets inferred, and on reading whether the partitions get pruned correctly based on the query.
- Other unit tests are unchanged and pass as expected.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #12879 from tdas/SPARK-15103.
2016-05-04 11:02:48 -07:00
Reynold Xin 6274a520fa [SPARK-15115][SQL] Reorganize whole stage codegen benchmark suites
## What changes were proposed in this pull request?
We currently have a single suite that is very large, making it difficult to maintain and play with specific primitives. This patch reorganizes the file by creating multiple benchmark suites in a single package.

Most of the changes are straightforward move of code. On top of the code moving, I did:
1. Use SparkSession instead of SQLContext.
2. Turned most benchmark scenarios into a their own test cases, rather than having multiple scenarios in a single test case, which takes forever to run.

## How was this patch tested?
This is a test only change.

Author: Reynold Xin <rxin@databricks.com>

Closes #12891 from rxin/SPARK-15115.
2016-05-04 11:00:01 -07:00
Liang-Chi Hsieh b85d21fb9d [SPARK-14951] [SQL] Support subexpression elimination in TungstenAggregate
## What changes were proposed in this pull request?

We can support subexpression elimination in TungstenAggregate by using current `EquivalentExpressions` which is already used in subexpression elimination for expression codegen.

However, in wholestage codegen, we can't wrap the common expression's codes in functions as before, we simply generate the code snippets for common expressions. These code snippets are inserted before the common expressions are actually used in generated java codes.

For multiple `TypedAggregateExpression` used in aggregation operator, since their input type should be the same. So their `inputDeserializer` will be the same too. This patch can also reduce redundant input deserialization.

## How was this patch tested?
Existing tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #12729 from viirya/subexpr-elimination-tungstenaggregate.
2016-05-04 10:54:51 -07:00
Reynold Xin d864c55cf8 [SPARK-15109][SQL] Accept Dataset[_] in joins
## What changes were proposed in this pull request?
This patch changes the join API in Dataset so they can accept any Dataset, rather than just DataFrames.

## How was this patch tested?
N/A.

Author: Reynold Xin <rxin@databricks.com>

Closes #12886 from rxin/SPARK-15109.
2016-05-04 10:38:27 -07:00
Liwei Lin e597ec6f1c [SPARK-15022][SPARK-15023][SQL][STREAMING] Add support for testing against the ProcessingTime(intervalMS > 0) trigger and ManualClock
## What changes were proposed in this pull request?

Currently in `StreamTest`, we have a `StartStream` which will start a streaming query against trigger `ProcessTime(intervalMS = 0)` and `SystemClock`.

We also need to test cases against `ProcessTime(intervalMS > 0)`, which often requires `ManualClock`.

This patch:
- fixes an issue of `ProcessingTimeExecutor`, where for a batch it should run `batchRunner` only once but might run multiple times under certain conditions;
- adds support for testing against the `ProcessingTime(intervalMS > 0)` trigger and `AdvanceManualClock`, by specifying them as fields for `StartStream`, and by adding an `AdvanceClock` action;
- adds a test, which takes advantage of the new `StartStream` and `AdvanceManualClock`, to test against [PR#[SPARK-14942] Reduce delay between batch construction and execution ](https://github.com/apache/spark/pull/12725).

## How was this patch tested?

N/A

Author: Liwei Lin <lwlin7@gmail.com>

Closes #12797 from lw-lin/add-trigger-test-support.
2016-05-04 10:25:14 -07:00
Cheng Lian f152fae306 [SPARK-14127][SQL] Native "DESC [EXTENDED | FORMATTED] <table>" DDL command
## What changes were proposed in this pull request?

This PR implements native `DESC [EXTENDED | FORMATTED] <table>` DDL command. Sample output:

```
scala> spark.sql("desc extended src").show(100, truncate = false)
+----------------------------+---------------------------------+-------+
|col_name                    |data_type                        |comment|
+----------------------------+---------------------------------+-------+
|key                         |int                              |       |
|value                       |string                           |       |
|                            |                                 |       |
|# Detailed Table Information|CatalogTable(`default`.`src`, ...|       |
+----------------------------+---------------------------------+-------+

scala> spark.sql("desc formatted src").show(100, truncate = false)
+----------------------------+----------------------------------------------------------+-------+
|col_name                    |data_type                                                 |comment|
+----------------------------+----------------------------------------------------------+-------+
|key                         |int                                                       |       |
|value                       |string                                                    |       |
|                            |                                                          |       |
|# Detailed Table Information|                                                          |       |
|Database:                   |default                                                   |       |
|Owner:                      |lian                                                      |       |
|Create Time:                |Mon Jan 04 17:06:00 CST 2016                              |       |
|Last Access Time:           |Thu Jan 01 08:00:00 CST 1970                              |       |
|Location:                   |hdfs://localhost:9000/user/hive/warehouse_hive121/src     |       |
|Table Type:                 |MANAGED                                                   |       |
|Table Parameters:           |                                                          |       |
|  transient_lastDdlTime     |1451898360                                                |       |
|                            |                                                          |       |
|# Storage Information       |                                                          |       |
|SerDe Library:              |org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe        |       |
|InputFormat:                |org.apache.hadoop.mapred.TextInputFormat                  |       |
|OutputFormat:               |org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat|       |
|Num Buckets:                |-1                                                        |       |
|Bucket Columns:             |[]                                                        |       |
|Sort Columns:               |[]                                                        |       |
|Storage Desc Parameters:    |                                                          |       |
|  serialization.format      |1                                                         |       |
+----------------------------+----------------------------------------------------------+-------+
```

## How was this patch tested?

A test case is added to `HiveDDLSuite` to check command output.

Author: Cheng Lian <lian@databricks.com>

Closes #12844 from liancheng/spark-14127-desc-table.
2016-05-04 16:44:09 +08:00
Wenchen Fan 6c12e801e8 [SPARK-15029] improve error message for Generate
## What changes were proposed in this pull request?

This PR improve the error message for `Generate` in 3 cases:

1. generator is nested in expressions, e.g. `SELECT explode(list) + 1 FROM tbl`
2. generator appears more than one time in SELECT, e.g. `SELECT explode(list), explode(list) FROM tbl`
3. generator appears in other operator which is not project, e.g. `SELECT * FROM tbl SORT BY explode(list)`

## How was this patch tested?

new tests in `AnalysisErrorSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12810 from cloud-fan/bug.
2016-05-04 00:10:20 -07:00
Cheng Lian bc3760d405 [SPARK-14237][SQL] De-duplicate partition value appending logic in various buildReader() implementations
## What changes were proposed in this pull request?

Currently, various `FileFormat` data sources share approximately the same code for partition value appending. This PR tries to eliminate this duplication.

A new method `buildReaderWithPartitionValues()` is added to `FileFormat` with a default implementation that appends partition values to `InternalRow`s produced by the reader function returned by `buildReader()`.

Special data sources like Parquet, which implements partition value appending inside `buildReader()` because of the vectorized reader, and the Text data source, which doesn't support partitioning, override `buildReaderWithPartitionValues()` and simply delegate to `buildReader()`.

This PR brings two benefits:

1. Apparently, it de-duplicates partition value appending logic

2. Now the reader function returned by `buildReader()` is only required to produce `InternalRow`s rather than `UnsafeRow`s if the data source doesn't override `buildReaderWithPartitionValues()`.

   Because the safe-to-unsafe conversion is also performed while appending partition values. This makes 3rd-party data sources (e.g. spark-avro) easier to implement since they no longer need to access private APIs involving `UnsafeRow`.

## How was this patch tested?

Existing tests should do the work.

Author: Cheng Lian <lian@databricks.com>

Closes #12866 from liancheng/spark-14237-simplify-partition-values-appending.
2016-05-04 14:16:57 +08:00
Reynold Xin 695f0e9195 [SPARK-15107][SQL] Allow varying # iterations by test case in Benchmark
## What changes were proposed in this pull request?
This patch changes our micro-benchmark util to allow setting different iteration numbers for different test cases. For some of our benchmarks, turning off whole-stage codegen can make the runtime 20X slower, making it very difficult to run a large number of times without substantially shortening the input cardinality.

With this change, I set the default num iterations to 2 for whole stage codegen off, and 5 for whole stage codegen on. I also updated some results.

## How was this patch tested?
N/A - this is a test util.

Author: Reynold Xin <rxin@databricks.com>

Closes #12884 from rxin/SPARK-15107.
2016-05-03 22:56:40 -07:00
Andrew Or 6ba17cd147 [SPARK-14414][SQL] Make DDL exceptions more consistent
## What changes were proposed in this pull request?

Just a bunch of small tweaks on DDL exception messages.

## How was this patch tested?

`DDLCommandSuite` et al.

Author: Andrew Or <andrew@databricks.com>

Closes #12853 from andrewor14/make-exceptions-consistent.
2016-05-03 18:07:53 -07:00
Koert Kuipers 9e4928b7e0 [SPARK-15097][SQL] make Dataset.sqlContext a stable identifier for imports
## What changes were proposed in this pull request?
Make Dataset.sqlContext a lazy val so that its a stable identifier and can be used for imports.
Now this works again:
import someDataset.sqlContext.implicits._

## How was this patch tested?
Add unit test to DatasetSuite that uses the import show above.

Author: Koert Kuipers <koert@tresata.com>

Closes #12877 from koertkuipers/feat-sqlcontext-stable-import.
2016-05-03 18:06:35 -07:00
Sandeep Singh a8d56f5388 [SPARK-14422][SQL] Improve handling of optional configs in SQLConf
## What changes were proposed in this pull request?
Create a new API for handling Optional Configs in SQLConf.
Right now `getConf` for `OptionalConfigEntry[T]` returns value of type `T`, if doesn't exist throws an exception. Add new method `getOptionalConf`(suggestions on naming) which will now returns value of type `Option[T]`(so if doesn't exist it returns `None`).

## How was this patch tested?
Add test and ran tests locally.

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #12846 from techaddict/SPARK-14422.
2016-05-03 18:02:57 -07:00
Andrew Or 588cac414a [SPARK-15073][SQL] Hide SparkSession constructor from the public
## What changes were proposed in this pull request?

Users should use the builder pattern instead.

## How was this patch tested?

Jenks.

Author: Andrew Or <andrew@databricks.com>

Closes #12873 from andrewor14/spark-session-constructor.
2016-05-03 13:47:58 -07:00
yzhou2001 a4aed71719 [SPARK-14521] [SQL] StackOverflowError in Kryo when executing TPC-DS
## What changes were proposed in this pull request?

Observed stackOverflowError in Kryo when executing TPC-DS Query27. Spark thrift server disables kryo reference tracking (if not specified in conf). When "spark.kryo.referenceTracking" is set to true explicitly in spark-defaults.conf, query executes successfully. The root cause is that the TaskMemoryManager inside MemoryConsumer and LongToUnsafeRowMap were not transient and thus were serialized and broadcast around from within LongHashedRelation, which could potentially cause circular reference inside Kryo. But the TaskMemoryManager is per task and should not be passed around at the first place. This fix makes it transient.

## How was this patch tested?
core/test, hive/test, sql/test, catalyst/test, dev/lint-scala, org.apache.spark.sql.hive.execution.HiveCompatibilitySuite, dev/scalastyle,
manual test of TBC-DS Query 27 with 1GB data but without the "limit 100" which would cause a NPE due to SPARK-14752.

Author: yzhou2001 <yzhou_1999@yahoo.com>

Closes #12598 from yzhou2001/master.
2016-05-03 13:41:04 -07:00
Sandeep Singh ca813330c7 [SPARK-15087][CORE][SQL] Remove AccumulatorV2.localValue and keep only value
## What changes were proposed in this pull request?
Remove AccumulatorV2.localValue and keep only value

## How was this patch tested?
existing tests

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #12865 from techaddict/SPARK-15087.
2016-05-03 11:38:43 -07:00
Shixiong Zhu b545d75219 [SPARK-14860][TESTS] Create a new Waiter in reset to bypass an issue of ScalaTest's Waiter.wait
## What changes were proposed in this pull request?

This PR updates `QueryStatusCollector.reset` to create Waiter instead of calling `await(1 milliseconds)` to bypass an ScalaTest's issue that Waiter.await may block forever.

## How was this patch tested?

I created a local stress test to call codes in `test("event ordering")` 100 times. It cannot pass without this patch.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #12623 from zsxwing/flaky-test.
2016-05-03 11:16:55 -07:00
Tathagata Das 4ad492c403 [SPARK-14716][SQL] Added support for partitioning in FileStreamSink
# What changes were proposed in this pull request?

Support partitioning in the file stream sink. This is implemented using a new, but simpler code path for writing parquet files - both unpartitioned and partitioned. This new code path does not use Output Committers, as we will eventually write the file names to the metadata log for "committing" them.

This patch duplicates < 100 LOC from the WriterContainer. But its far simpler that WriterContainer as it does not involve output committing. In addition, it introduces the new APIs in FileFormat and OutputWriterFactory in an attempt to simplify the APIs (not have Job in the `FileFormat` API, not have bucket and other stuff in the `OutputWriterFactory.newInstance()` ).

# Tests
- New unit tests to test the FileStreamSinkWriter for partitioned and unpartitioned files
- New unit test to partially test the FileStreamSink for partitioned files (does not test recovery of partition column data, as that requires change in the StreamFileCatalog, future PR).
- Updated FileStressSuite to test number of records read from partitioned output files.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #12409 from tdas/streaming-partitioned-parquet.
2016-05-03 10:58:26 -07:00
Liwei Lin 5bd9a2f697 [SPARK-14884][SQL][STREAMING][WEBUI] Fix call site for continuous queries
## What changes were proposed in this pull request?

Since we've been processing continuous queries in separate threads, the call sites are then `run at <unknown>:0`. It's not wrong but provides very little information; in addition, we can not distinguish two queries only from their call sites.

This patch fixes this.

### Before
[Jobs Tab]
![s1a](https://cloud.githubusercontent.com/assets/15843379/14766101/a47246b2-0a30-11e6-8d81-06a9a600113b.png)
[SQL Tab]
![s1b](https://cloud.githubusercontent.com/assets/15843379/14766102/a4750226-0a30-11e6-9ada-773d977d902b.png)
### After
[Jobs Tab]
![s2a](https://cloud.githubusercontent.com/assets/15843379/14766104/a89705b6-0a30-11e6-9830-0d40ec68527b.png)
[SQL Tab]
![s2b](https://cloud.githubusercontent.com/assets/15843379/14766103/a8966728-0a30-11e6-8e4d-c2e326400478.png)

## How was this patch tested?

Manually checks - see screenshots above.

Author: Liwei Lin <lwlin7@gmail.com>

Closes #12650 from lw-lin/fix-call-site.
2016-05-03 10:10:25 -07:00
Reynold Xin 5503e453ba [SPARK-15088] [SQL] Remove SparkSqlSerializer
## What changes were proposed in this pull request?
This patch removes SparkSqlSerializer. I believe this is now dead code.

## How was this patch tested?
Removed a test case related to it.

Author: Reynold Xin <rxin@databricks.com>

Closes #12864 from rxin/SPARK-15088.
2016-05-03 09:43:47 -07:00
Reynold Xin d557a5e01e [SPARK-15081] Move AccumulatorV2 and subclasses into util package
## What changes were proposed in this pull request?
This patch moves AccumulatorV2 and subclasses into util package.

## How was this patch tested?
Updated relevant tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #12863 from rxin/SPARK-15081.
2016-05-03 19:45:12 +08:00
Andrew Ray d8f528ceb6 [SPARK-13749][SQL][FOLLOW-UP] Faster pivot implementation for many distinct values with two phase aggregation
## What changes were proposed in this pull request?

This is a follow up PR for #11583. It makes 3 lazy vals into just vals and adds unit test coverage.

## How was this patch tested?

Existing unit tests and additional unit tests.

Author: Andrew Ray <ray.andrew@gmail.com>

Closes #12861 from aray/fast-pivot-follow-up.
2016-05-02 22:47:32 -07:00
Shixiong Zhu 4e3685ae5e [SPARK-15077][SQL] Use a fair lock to avoid thread starvation in StreamExecution
## What changes were proposed in this pull request?

Right now `StreamExecution.awaitBatchLock` uses an unfair lock. `StreamExecution.awaitOffset` may run too long and fail some test because `StreamExecution.constructNextBatch` keeps getting the lock.

See: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/865/testReport/junit/org.apache.spark.sql.streaming/FileStreamSourceStressTestSuite/file_source_stress_test/

This PR uses a fair ReentrantLock to resolve the thread starvation issue.

## How was this patch tested?

Modified `FileStreamSourceStressTestSuite.test("file source stress test")` to run the test codes 100 times locally. It always fails because of timeout without this patch.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #12852 from zsxwing/SPARK-15077.
2016-05-02 18:27:49 -07:00
Herman van Hovell 1c19c2769e [SPARK-15047][SQL] Cleanup SQL Parser
## What changes were proposed in this pull request?
This PR addresses a few minor issues in SQL parser:

- Removes some unused rules and keywords in the grammar.
- Removes code path for fallback SQL parsing (was needed for Hive native parsing).
- Use `UnresolvedGenerator` instead of hard-coding `Explode` & `JsonTuple`.
- Adds a more generic way of creating error messages for unsupported Hive features.
- Use `visitFunctionName` as much as possible.
- Interpret a `CatalogColumn`'s `DataType` directly instead of parsing it again.

## How was this patch tested?
Existing tests.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #12826 from hvanhovell/SPARK-15047.
2016-05-02 18:12:31 -07:00
Liwei Lin 35d9c8aa69 [SPARK-14747][SQL] Add assertStreaming/assertNoneStreaming checks in DataFrameWriter
## Problem

If an end user happens to write code mixed with continuous-query-oriented methods and non-continuous-query-oriented methods:

```scala
ctx.read
   .format("text")
   .stream("...")  // continuous query
   .write
   .text("...")    // non-continuous query; should be startStream() here
```

He/she would get this somehow confusing exception:

>
Exception in thread "main" java.lang.AssertionError: assertion failed: No plan for FileSource[./continuous_query_test_input]
	at scala.Predef$.assert(Predef.scala:170)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
	at ...

## What changes were proposed in this pull request?

This PR adds checks for continuous-query-oriented methods and non-continuous-query-oriented methods in `DataFrameWriter`:

<table>
<tr>
	<td align="center"></td>
	<td align="center"><strong>can be called on continuous query?</strong></td>
	<td align="center"><strong>can be called on non-continuous query?</strong></td>
</tr>
<tr>
	<td align="center">mode</td>
	<td align="center"></td>
	<td align="center">yes</td>
</tr>
<tr>
	<td align="center">trigger</td>
	<td align="center">yes</td>
	<td align="center"></td>
</tr>
<tr>
	<td align="center">format</td>
	<td align="center">yes</td>
	<td align="center">yes</td>
</tr>
<tr>
	<td align="center">option/options</td>
	<td align="center">yes</td>
	<td align="center">yes</td>
</tr>
<tr>
	<td align="center">partitionBy</td>
	<td align="center">yes</td>
	<td align="center">yes</td>
</tr>
<tr>
	<td align="center">bucketBy</td>
	<td align="center"></td>
	<td align="center">yes</td>
</tr>
<tr>
	<td align="center">sortBy</td>
	<td align="center"></td>
	<td align="center">yes</td>
</tr>
<tr>
	<td align="center">save</td>
	<td align="center"></td>
	<td align="center">yes</td>
</tr>
<tr>
	<td align="center">queryName</td>
	<td align="center">yes</td>
	<td align="center"></td>
</tr>
<tr>
	<td align="center">startStream</td>
	<td align="center">yes</td>
	<td align="center"></td>
</tr>
<tr>
	<td align="center">insertInto</td>
	<td align="center"></td>
	<td align="center">yes</td>
</tr>
<tr>
	<td align="center">saveAsTable</td>
	<td align="center"></td>
	<td align="center">yes</td>
</tr>
<tr>
	<td align="center">jdbc</td>
	<td align="center"></td>
	<td align="center">yes</td>
</tr>
<tr>
	<td align="center">json</td>
	<td align="center"></td>
	<td align="center">yes</td>
</tr>
<tr>
	<td align="center">parquet</td>
	<td align="center"></td>
	<td align="center">yes</td>
</tr>
<tr>
	<td align="center">orc</td>
	<td align="center"></td>
	<td align="center">yes</td>
</tr>
<tr>
	<td align="center">text</td>
	<td align="center"></td>
	<td align="center">yes</td>
</tr>
<tr>
	<td align="center">csv</td>
	<td align="center"></td>
	<td align="center">yes</td>
</tr>
</table>

After this PR's change, the friendly exception would be:
>
Exception in thread "main" org.apache.spark.sql.AnalysisException: text() can only be called on non-continuous queries;
	at org.apache.spark.sql.DataFrameWriter.assertNotStreaming(DataFrameWriter.scala:678)
	at org.apache.spark.sql.DataFrameWriter.text(DataFrameWriter.scala:629)
	at ss.SSDemo$.main(SSDemo.scala:47)

## How was this patch tested?

dedicated unit tests were added

Author: Liwei Lin <lwlin7@gmail.com>

Closes #12521 from lw-lin/dataframe-writer-check.
2016-05-02 16:48:20 -07:00
Herman van Hovell f362363d14 [SPARK-14785] [SQL] Support correlated scalar subqueries
## What changes were proposed in this pull request?
In this PR we add support for correlated scalar subqueries. An example of such a query is:
```SQL
select * from tbl1 a where a.value > (select max(value) from tbl2 b where b.key = a.key)
```
The implementation adds the `RewriteCorrelatedScalarSubquery` rule to the Optimizer. This rule plans these subqueries using `LEFT OUTER` joins. It currently supports rewrites for `Project`, `Aggregate` & `Filter` logical plans.

I could not find a well defined semantics for the use of scalar subqueries in an `Aggregate`. The current implementation currently evaluates the scalar subquery *before* aggregation. This means that you either have to make scalar subquery part of the grouping expression, or that you have to aggregate it further on. I am open to suggestions on this.

The implementation currently forces the uniqueness of a scalar subquery by enforcing that it is aggregated and that the resulting column is wrapped in an `AggregateExpression`.

## How was this patch tested?
Added tests to `SubquerySuite`.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #12822 from hvanhovell/SPARK-14785.
2016-05-02 16:32:31 -07:00
poolis 917d05f43b [SPARK-12928][SQL] Oracle FLOAT datatype is not properly handled when reading via JDBC
The contribution is my original work and that I license the work to the project under the project's open source license.

Author: poolis <gmichalopoulos@gmail.com>
Author: Greg Michalopoulos <gmichalopoulos@gmail.com>

Closes #10899 from poolis/spark-12928.
2016-05-02 16:15:07 -07:00
Reynold Xin ca1b219858 [SPARK-15052][SQL] Use builder pattern to create SparkSession
## What changes were proposed in this pull request?
This patch creates a builder pattern for creating SparkSession. The new code is unused and mostly deadcode. I'm putting it up here for feedback.

There are a few TODOs that can be done as follow-up pull requests:
- [ ] Update tests to use this
- [ ] Update examples to use this
- [ ] Clean up SQLContext code w.r.t. this one (i.e. SparkSession shouldn't call into SQLContext.getOrCreate; it should be the other way around)
- [ ] Remove SparkSession.withHiveSupport
- [ ] Disable the old constructor (by making it private) so the only way to start a SparkSession is through this builder pattern

## How was this patch tested?
Part of the future pull request is to clean this up and switch existing tests to use this.

Author: Reynold Xin <rxin@databricks.com>

Closes #12830 from rxin/sparksession-builder.
2016-05-02 15:27:16 -07:00
Pete Robbins 8a1ce4899f [SPARK-13745] [SQL] Support columnar in memory representation on Big Endian platforms
## What changes were proposed in this pull request?

parquet datasource and ColumnarBatch tests fail on big-endian platforms This patch adds support for the little-endian byte arrays being correctly interpreted on a big-endian platform

## How was this patch tested?

Spark test builds ran on big endian z/Linux and regression build on little endian amd64

Author: Pete Robbins <robbinspg@gmail.com>

Closes #12397 from robbinspg/master.
2016-05-02 13:16:46 -07:00
Davies Liu 95e372141a [SPARK-14781] [SQL] support nested predicate subquery
## What changes were proposed in this pull request?

In order to support nested predicate subquery, this PR introduce an internal join type ExistenceJoin, which will emit all the rows from left, plus an additional column, which presents there are any rows matched from right or not (it's not null-aware right now). This additional column could be used to replace the subquery in Filter.

In theory, all the predicate subquery could use this join type, but it's slower than LeftSemi and LeftAnti, so it's only used for nested subquery (subquery inside OR).

For example, the following SQL:
```sql
SELECT a FROM t  WHERE EXISTS (select 0) OR EXISTS (select 1)
```

This PR also fix a bug in predicate subquery push down through join (they should not).

Nested null-aware subquery is still not supported. For example,   `a > 3 OR b NOT IN (select bb from t)`

After this, we could run TPCDS query Q10, Q35, Q45

## How was this patch tested?

Added unit tests.

Author: Davies Liu <davies@databricks.com>

Closes #12820 from davies/or_exists.
2016-05-02 12:58:59 -07:00
Shixiong Zhu a35a67a83d [SPARK-14579][SQL] Fix the race condition in StreamExecution.processAllAvailable again
## What changes were proposed in this pull request?

#12339 didn't fix the race condition. MemorySinkSuite is still flaky: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.2/814/testReport/junit/org.apache.spark.sql.streaming/MemorySinkSuite/registering_as_a_table/

Here is an execution order to reproduce it.

| Time        |Thread 1           | MicroBatchThread  |
|:-------------:|:-------------:|:-----:|
| 1 | |  `MemorySink.getOffset` |
| 2 | |  availableOffsets ++= newData (availableOffsets is not changed here)  |
| 3 | addData(newData)      |   |
| 4 | Set `noNewData` to `false` in  processAllAvailable |  |
| 5 | | `dataAvailable` returns `false`   |
| 6 | | noNewData = true |
| 7 | `noNewData` is true so just return | |
| 8 |  assert results and fail | |
| 9 |   | `dataAvailable` returns true so process the new batch |

This PR expands the scope of `awaitBatchLock.synchronized` to eliminate the above race.

## How was this patch tested?

test("stress test"). It always failed before this patch. And it will pass after applying this patch. Ignore this test in the PR as it takes several minutes to finish.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #12582 from zsxwing/SPARK-14579-2.
2016-05-02 11:28:21 -07:00
Andrew Ray 9927441868 [SPARK-13749][SQL] Faster pivot implementation for many distinct values with two phase aggregation
## What changes were proposed in this pull request?

The existing implementation of pivot translates into a single aggregation with one aggregate per distinct pivot value. When the number of distinct pivot values is large (say 1000+) this can get extremely slow since each input value gets evaluated on every aggregate even though it only affects the value of one of them.

I'm proposing an alternate strategy for when there are 10+ (somewhat arbitrary threshold) distinct pivot values. We do two phases of aggregation. In the first we group by the grouping columns plus the pivot column and perform the specified aggregations (one or sometimes more). In the second aggregation we group by the grouping columns and use the new (non public) PivotFirst aggregate that rearranges the outputs of the first aggregation into an array indexed by the pivot value. Finally we do a project to extract the array entries into the appropriate output column.

## How was this patch tested?

Additional unit tests in DataFramePivotSuite and manual larger scale testing.

Author: Andrew Ray <ray.andrew@gmail.com>

Closes #11583 from aray/fast-pivot.
2016-05-02 11:12:55 -07:00
Reynold Xin 44da8d8eab [SPARK-15049] Rename NewAccumulator to AccumulatorV2
## What changes were proposed in this pull request?
NewAccumulator isn't the best name if we ever come up with v3 of the API.

## How was this patch tested?
Updated tests to reflect the change.

Author: Reynold Xin <rxin@databricks.com>

Closes #12827 from rxin/SPARK-15049.
2016-05-01 20:21:02 -07:00
hyukjinkwon a832cef112 [SPARK-13425][SQL] Documentation for CSV datasource options
## What changes were proposed in this pull request?

This PR adds the explanation and documentation for CSV options for reading and writing.

## How was this patch tested?

Style tests with `./dev/run_tests` for documentation style.

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>

Closes #12817 from HyukjinKwon/SPARK-13425.
2016-05-01 19:05:20 -07:00
Wenchen Fan 90787de864 [SPARK-15033][SQL] fix a flaky test in CachedTableSuite
## What changes were proposed in this pull request?

This is caused by https://github.com/apache/spark/pull/12776, which removes the `synchronized` from all methods in `AccumulatorContext`.

However, a test in `CachedTableSuite` synchronize on `AccumulatorContext` and expecting no one else can change it, which is not true anymore.

This PR update that test to not require to lock on `AccumulatorContext`.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12811 from cloud-fan/flaky.
2016-04-30 20:28:22 -07:00
Hossein 507bea5ca6 [SPARK-14143] Options for parsing NaNs, Infinity and nulls for numeric types
1. Adds the following options for parsing NaNs: nanValue

2. Adds the following options for parsing infinity: positiveInf, negativeInf.

`TypeCast.castTo` is unit tested and an end-to-end test is added to `CSVSuite`

Author: Hossein <hossein@databricks.com>

Closes #11947 from falaki/SPARK-14143.
2016-04-30 18:12:03 -07:00
Yin Huai 0182d9599d [SPARK-15034][SPARK-15035][SPARK-15036][SQL] Use spark.sql.warehouse.dir as the warehouse location
This PR contains three changes:
1. We will use spark.sql.warehouse.dir set warehouse location. We will not use hive.metastore.warehouse.dir.
2. SessionCatalog needs to set the location to default db. Otherwise, when creating a table in SparkSession without hive support, the default db's path will be an empty string.
3. When we create a database, we need to make the path qualified.

Existing tests and new tests

Author: Yin Huai <yhuai@databricks.com>

Closes #12812 from yhuai/warehouse.
2016-04-30 18:04:42 -07:00
Reynold Xin 8dc3987d09 [SPARK-15028][SQL] Remove HiveSessionState.setDefaultOverrideConfs
## What changes were proposed in this pull request?
This patch removes some code that are no longer relevant -- mainly HiveSessionState.setDefaultOverrideConfs.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #12806 from rxin/SPARK-15028.
2016-04-30 01:32:00 -07:00
hyukjinkwon 4bac703eb9 [SPARK-13667][SQL] Support for specifying custom date format for date and timestamp types at CSV datasource.
## What changes were proposed in this pull request?

This PR adds the support to specify custom date format for `DateType` and `TimestampType`.

For `TimestampType`, this uses the given format to infer schema and also to convert the values
For `DateType`, this uses the given format to convert the values.
If the `dateFormat` is not given, then it works with `DateTimeUtils.stringToTime()` for backwords compatibility.
When it's given, then it uses `SimpleDateFormat` for parsing data.

In addition, `IntegerType`, `DoubleType` and `LongType` have a higher priority than `TimestampType` in type inference. This means even if the given format is `yyyy` or `yyyy.MM`, it will be inferred as `IntegerType` or `DoubleType`. Since it is type inference, I think it is okay to give such precedences.

In addition, I renamed `csv.CSVInferSchema` to `csv.InferSchema` as JSON datasource has `json.InferSchema`. Although they have the same names, I did this because I thought the parent package name can still differentiate each.  Accordingly, the suite name was also changed from `CSVInferSchemaSuite` to `InferSchemaSuite`.

## How was this patch tested?

unit tests are used and `./dev/run_tests` for coding style tests.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #11550 from HyukjinKwon/SPARK-13667.
2016-04-29 22:52:21 -07:00
Yin Huai ac41fc648d [SPARK-14591][SQL] Remove DataTypeParser and add more keywords to the nonReserved list.
## What changes were proposed in this pull request?
CatalystSqlParser can parse data types. So, we do not need to have an individual DataTypeParser.

## How was this patch tested?
Existing tests

Author: Yin Huai <yhuai@databricks.com>

Closes #12796 from yhuai/removeDataTypeParser.
2016-04-29 22:49:12 -07:00
Andrew Or 66773eb8a5 [SPARK-15012][SQL] Simplify configuration API further
## What changes were proposed in this pull request?

1. Remove all the `spark.setConf` etc. Just expose `spark.conf`
2. Make `spark.conf` take in things set in the core `SparkConf` as well, otherwise users may get confused

This was done for both the Python and Scala APIs.

## How was this patch tested?
`SQLConfSuite`, python tests.

This one fixes the failed tests in #12787

Closes #12787

Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #12798 from yhuai/conf-api.
2016-04-29 20:46:07 -07:00
Herman van Hovell 83061be697 [SPARK-14858] [SQL] Enable subquery pushdown
The previous subquery PRs did not include support for pushing subqueries used in filters (`WHERE`/`HAVING`) down. This PR adds this support. For example :
```scala
range(0, 10).registerTempTable("a")
range(5, 15).registerTempTable("b")
range(7, 25).registerTempTable("c")
range(3, 12).registerTempTable("d")
val plan = sql("select * from a join b on a.id = b.id left join c on c.id = b.id where a.id in (select id from d)")
plan.explain(true)
```
Leads to the following Analyzed & Optimized plans:
```
== Parsed Logical Plan ==
...

== Analyzed Logical Plan ==
id: bigint, id: bigint, id: bigint
Project [id#0L,id#4L,id#8L]
+- Filter predicate-subquery#16 [(id#0L = id#12L)]
   :  +- SubqueryAlias predicate-subquery#16 [(id#0L = id#12L)]
   :     +- Project [id#12L]
   :        +- SubqueryAlias d
   :           +- Range 3, 12, 1, 8, [id#12L]
   +- Join LeftOuter, Some((id#8L = id#4L))
      :- Join Inner, Some((id#0L = id#4L))
      :  :- SubqueryAlias a
      :  :  +- Range 0, 10, 1, 8, [id#0L]
      :  +- SubqueryAlias b
      :     +- Range 5, 15, 1, 8, [id#4L]
      +- SubqueryAlias c
         +- Range 7, 25, 1, 8, [id#8L]

== Optimized Logical Plan ==
Join LeftOuter, Some((id#8L = id#4L))
:- Join Inner, Some((id#0L = id#4L))
:  :- Join LeftSemi, Some((id#0L = id#12L))
:  :  :- Range 0, 10, 1, 8, [id#0L]
:  :  +- Range 3, 12, 1, 8, [id#12L]
:  +- Range 5, 15, 1, 8, [id#4L]
+- Range 7, 25, 1, 8, [id#8L]

== Physical Plan ==
...
```
I have also taken the opportunity to move quite a bit of code around:
- Rewriting subqueris and pulling out correlated predicated from subqueries has been moved into the analyzer. The analyzer transforms `Exists` and `InSubQuery` into `PredicateSubquery` expressions. A PredicateSubquery exposes the 'join' expressions and the proper references. This makes things like type coercion, optimization and planning easier to do.
- I have added support for `Aggregate` plans in subqueries. Any correlated expressions will be added to the grouping expressions. I have removed support for `Union` plans, since pulling in an outer reference from beneath a Union has no value (a filtered value could easily be part of another Union child).
- Resolution of subqueries is now done using `OuterReference`s. These are used to wrap any outer reference; this makes the identification of these references easier, and also makes dealing with duplicate attributes in the outer and inner plans easier. The resolution of subqueries initially used a resolution loop which would alternate between calling the analyzer and trying to resolve the outer references. We now use a dedicated analyzer which uses a special rule for outer reference resolution.

These changes are a stepping stone for enabling correlated scalar subqueries, enabling all Hive tests & allowing us to use predicate subqueries anywhere.

Current tests and added test cases in FilterPushdownSuite.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #12720 from hvanhovell/SPARK-14858.
2016-04-29 16:50:12 -07:00
Andrew Or d33e3d572e [SPARK-14988][PYTHON] SparkSession API follow-ups
## What changes were proposed in this pull request?

Addresses comments in #12765.

## How was this patch tested?

Python tests.

Author: Andrew Or <andrew@databricks.com>

Closes #12784 from andrewor14/python-followup.
2016-04-29 16:41:13 -07:00
Sun Rui 4ae9fe091c [SPARK-12919][SPARKR] Implement dapply() on DataFrame in SparkR.
## What changes were proposed in this pull request?

dapply() applies an R function on each partition of a DataFrame and returns a new DataFrame.

The function signature is:

	dapply(df, function(localDF) {}, schema = NULL)

R function input: local data.frame from the partition on local node
R function output: local data.frame

Schema specifies the Row format of the resulting DataFrame. It must match the R function's output.
If schema is not specified, each partition of the result DataFrame will be serialized in R into a single byte array. Such resulting DataFrame can be processed by successive calls to dapply().

## How was this patch tested?
SparkR unit tests.

Author: Sun Rui <rui.sun@intel.com>
Author: Sun Rui <sunrui2016@gmail.com>

Closes #12493 from sun-rui/SPARK-12919.
2016-04-29 16:41:07 -07:00
Cheng Lian a04b1de5fa [SPARK-14981][SQL] Throws exception if DESC is specified for sorting columns
## What changes were proposed in this pull request?

Currently Spark SQL doesn't support sorting columns in descending order. However, the parser accepts the syntax and silently drops sorting directions. This PR fixes this by throwing an exception if `DESC` is specified as sorting direction of a sorting column.

## How was this patch tested?

A test case is added to test the invalid sorting order by checking exception message.

Author: Cheng Lian <lian@databricks.com>

Closes #12759 from liancheng/spark-14981.
2016-04-29 14:52:32 -07:00
Andrew Or a7d0fedc94 [SPARK-14988][PYTHON] SparkSession catalog and conf API
## What changes were proposed in this pull request?

The `catalog` and `conf` APIs were exposed in `SparkSession` in #12713 and #12669. This patch adds those to the python API.

## How was this patch tested?

Python tests.

Author: Andrew Or <andrew@databricks.com>

Closes #12765 from andrewor14/python-spark-session-more.
2016-04-29 09:34:10 -07:00
Reynold Xin 054f991c43 [SPARK-14994][SQL] Remove execution hive from HiveSessionState
## What changes were proposed in this pull request?
This patch removes executionHive from HiveSessionState and HiveSharedState.

## How was this patch tested?
Updated test cases.

Author: Reynold Xin <rxin@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #12770 from rxin/SPARK-14994.
2016-04-29 01:14:02 -07:00
Sameer Agarwal 2057cbcb0b [SPARK-14996][SQL] Add TPCDS Benchmark Queries for SparkSQL
## What changes were proposed in this pull request?

This PR adds support for easily running and benchmarking a set of common TPCDS queries locally in SparkSQL.

## How was this patch tested?

N/A

Author: Sameer Agarwal <sameer@databricks.com>

Closes #12771 from sameeragarwal/tpcds-2.
2016-04-29 00:52:42 -07:00
gatorsmile 222dcf7937 [SPARK-12660][SPARK-14967][SQL] Implement Except Distinct by Left Anti Join
#### What changes were proposed in this pull request?
Replaces a logical `Except` operator with a `Left-anti Join` operator. This way, we can take advantage of all the benefits of join implementations (e.g. managed memory, code generation, broadcast joins).
```SQL
  SELECT a1, a2 FROM Tab1 EXCEPT SELECT b1, b2 FROM Tab2
  ==>  SELECT DISTINCT a1, a2 FROM Tab1 LEFT ANTI JOIN Tab2 ON a1<=>b1 AND a2<=>b2
```
 Note:
 1. This rule is only applicable to EXCEPT DISTINCT. Do not use it for EXCEPT ALL.
 2. This rule has to be done after de-duplicating the attributes; otherwise, the enerated
    join conditions will be incorrect.

This PR also corrects the existing behavior in Spark. Before this PR, the behavior is like
```SQL
  test("except") {
    val df_left = Seq(1, 2, 2, 3, 3, 4).toDF("id")
    val df_right = Seq(1, 3).toDF("id")

    checkAnswer(
      df_left.except(df_right),
      Row(2) :: Row(2) :: Row(4) :: Nil
    )
  }
```
After this PR, the result is corrected. We strictly follow the SQL compliance of `Except Distinct`.

#### How was this patch tested?
Modified and added a few test cases to verify the optimization rule and the results of operators.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #12736 from gatorsmile/exceptByAntiJoin.
2016-04-29 15:30:36 +08:00
Zheng RuiFeng 4f83e442b1 [MINOR][DOC] Minor typo fixes
## What changes were proposed in this pull request?
Minor typo fixes

## How was this patch tested?
local build

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12755 from zhengruifeng/fix_doc_dataset.
2016-04-28 22:56:26 -07:00
Reynold Xin 4607f6e7f7 [SPARK-14991][SQL] Remove HiveNativeCommand
## What changes were proposed in this pull request?
This patch removes HiveNativeCommand, so we can continue to remove the dependency on Hive. This pull request also removes the ability to generate golden result file using Hive.

## How was this patch tested?
Updated tests to reflect this.

Author: Reynold Xin <rxin@databricks.com>

Closes #12769 from rxin/SPARK-14991.
2016-04-28 21:58:48 -07:00
Wenchen Fan 6f9a18fe31 [HOTFIX][CORE] fix a concurrence issue in NewAccumulator
## What changes were proposed in this pull request?

`AccumulatorContext` is not thread-safe, that's why all of its methods are synchronized. However, there is one exception: the `AccumulatorContext.originals`. `NewAccumulator` use it to check if it's registered, which is wrong as it's not synchronized.

This PR mark `AccumulatorContext.originals` as `private` and now all access to `AccumulatorContext` is synchronized.

## How was this patch tested?

I verified it locally. To be safe, we can let jenkins test it many times to make sure this problem is gone.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12773 from cloud-fan/debug.
2016-04-28 21:57:58 -07:00
Tathagata Das 0ee5419b6c [SPARK-14970][SQL] Prevent DataSource from enumerates all files in a directory if there is user specified schema
## What changes were proposed in this pull request?
The FileCatalog object gets created even if the user specifies schema, which means files in the directory is enumerated even thought its not necessary. For large directories this is very slow. User would want to specify schema in such scenarios of large dirs, and this defeats the purpose quite a bit.

## How was this patch tested?
Hard to test this with unit test.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #12748 from tdas/SPARK-14970.
2016-04-28 12:59:08 -07:00
Liang-Chi Hsieh 7c6937a885 [SPARK-14487][SQL] User Defined Type registration without SQLUserDefinedType annotation
## What changes were proposed in this pull request?

Currently we use `SQLUserDefinedType` annotation to register UDTs for user classes. However, by doing this, we add Spark dependency to user classes.

For some user classes, it is unnecessary to add such dependency that will increase deployment difficulty.

We should provide alternative approach to register UDTs for user classes without `SQLUserDefinedType` annotation.

## How was this patch tested?

`UserDefinedTypeSuite`

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #12259 from viirya/improve-sql-usertype.
2016-04-28 01:14:49 -07:00
Wenchen Fan bf5496dbda [SPARK-14654][CORE] New accumulator API
## What changes were proposed in this pull request?

This PR introduces a new accumulator API  which is much simpler than before:

1. the type hierarchy is simplified, now we only have an `Accumulator` class
2. Combine `initialValue` and `zeroValue` concepts into just one concept: `zeroValue`
3. there in only one `register` method, the accumulator registration and cleanup registration are combined.
4. the `id`,`name` and `countFailedValues` are combined into an `AccumulatorMetadata`, and is provided during registration.

`SQLMetric` is a good example to show the simplicity of this new API.

What we break:

1. no `setValue` anymore. In the new API, the intermedia type can be different from the result type, it's very hard to implement a general `setValue`
2. accumulator can't be serialized before registered.

Problems need to be addressed in follow-ups:

1. with this new API, `AccumulatorInfo` doesn't make a lot of sense, the partial output is not partial updates, we need to expose the intermediate value.
2. `ExceptionFailure` should not carry the accumulator updates. Why do users care about accumulator updates for failed cases? It looks like we only use this feature to update the internal metrics, how about we sending a heartbeat to update internal metrics after the failure event?
3. the public event `SparkListenerTaskEnd` carries a `TaskMetrics`. Ideally this `TaskMetrics` don't need to carry external accumulators, as the only method of `TaskMetrics` that can access external accumulators is `private[spark]`. However, `SQLListener` use it to retrieve sql metrics.

## How was this patch tested?

existing tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12612 from cloud-fan/acc.
2016-04-28 00:26:39 -07:00
Davies Liu ae4e3def5e [SPARK-14961] Build HashedRelation larger than 1G
## What changes were proposed in this pull request?

Currently, LongToUnsafeRowMap use byte array as the underlying page, which can't be larger 1G.

This PR improves LongToUnsafeRowMap  to scale up to 8G bytes by using array of Long instead of array of byte.

## How was this patch tested?

Manually ran a test to confirm that both UnsafeHashedRelation and LongHashedRelation could build a map that larger than 2G.

Author: Davies Liu <davies@databricks.com>

Closes #12740 from davies/larger_broadcast.
2016-04-27 21:23:40 -07:00
Andrew Or 37575115b9 [SPARK-14940][SQL] Move ExternalCatalog to own file
## What changes were proposed in this pull request?

`interfaces.scala` was getting big. This just moves the biggest class in there to a new file for cleanliness.

## How was this patch tested?

Just moving things around.

Author: Andrew Or <andrew@databricks.com>

Closes #12721 from andrewor14/move-external-catalog.
2016-04-27 14:17:36 -07:00
Cheng Lian 24bea00047 [SPARK-14954] [SQL] Add PARTITION BY and BUCKET BY clause for data source CTAS syntax
Currently, we can only create persisted partitioned and/or bucketed data source tables using the Dataset API but not using SQL DDL. This PR implements the following syntax to add partitioning and bucketing support to the SQL DDL:

```
CREATE TABLE <table-name>
USING <provider> [OPTIONS (<key1> <value1>, <key2> <value2>, ...)]
[PARTITIONED BY (col1, col2, ...)]
[CLUSTERED BY (col1, col2, ...) [SORTED BY (col1, col2, ...)] INTO <n> BUCKETS]
AS SELECT ...
```

Test cases are added in `MetastoreDataSourcesSuite` to check the newly added syntax.

Author: Cheng Lian <lian@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #12734 from liancheng/spark-14954.
2016-04-27 13:55:13 -07:00
Dongjoon Hyun af92299fdb [SPARK-14664][SQL] Implement DecimalAggregates optimization for Window queries
## What changes were proposed in this pull request?

This PR aims to implement decimal aggregation optimization for window queries by improving existing `DecimalAggregates`. Historically, `DecimalAggregates` optimizer is designed to transform general `sum/avg(decimal)`, but it breaks recently added windows queries like the followings. The following queries work well without the current `DecimalAggregates` optimizer.

**Sum**
```scala
scala> sql("select sum(a) over () from (select explode(array(1.0,2.0)) a) t").head
java.lang.RuntimeException: Unsupported window function: MakeDecimal((sum(UnscaledValue(a#31)),mode=Complete,isDistinct=false),12,1)
scala> sql("select sum(a) over () from (select explode(array(1.0,2.0)) a) t").explain()
== Physical Plan ==
WholeStageCodegen
:  +- Project [sum(a) OVER (  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#23]
:     +- INPUT
+- Window [MakeDecimal((sum(UnscaledValue(a#21)),mode=Complete,isDistinct=false),12,1) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS sum(a) OVER (  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#23]
   +- Exchange SinglePartition, None
      +- Generate explode([1.0,2.0]), false, false, [a#21]
         +- Scan OneRowRelation[]
```

**Average**
```scala
scala> sql("select avg(a) over () from (select explode(array(1.0,2.0)) a) t").head
java.lang.RuntimeException: Unsupported window function: cast(((avg(UnscaledValue(a#40)),mode=Complete,isDistinct=false) / 10.0) as decimal(6,5))
scala> sql("select avg(a) over () from (select explode(array(1.0,2.0)) a) t").explain()
== Physical Plan ==
WholeStageCodegen
:  +- Project [avg(a) OVER (  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#44]
:     +- INPUT
+- Window [cast(((avg(UnscaledValue(a#42)),mode=Complete,isDistinct=false) / 10.0) as decimal(6,5)) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS avg(a) OVER (  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#44]
   +- Exchange SinglePartition, None
      +- Generate explode([1.0,2.0]), false, false, [a#42]
         +- Scan OneRowRelation[]
```

After this PR, those queries work fine and new optimized physical plans look like the followings.

**Sum**
```scala
scala> sql("select sum(a) over () from (select explode(array(1.0,2.0)) a) t").explain()
== Physical Plan ==
WholeStageCodegen
:  +- Project [sum(a) OVER (  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#35]
:     +- INPUT
+- Window [MakeDecimal((sum(UnscaledValue(a#33)),mode=Complete,isDistinct=false) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),12,1) AS sum(a) OVER (  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#35]
   +- Exchange SinglePartition, None
      +- Generate explode([1.0,2.0]), false, false, [a#33]
         +- Scan OneRowRelation[]
```

**Average**
```scala
scala> sql("select avg(a) over () from (select explode(array(1.0,2.0)) a) t").explain()
== Physical Plan ==
WholeStageCodegen
:  +- Project [avg(a) OVER (  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#47]
:     +- INPUT
+- Window [cast(((avg(UnscaledValue(a#45)),mode=Complete,isDistinct=false) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) / 10.0) as decimal(6,5)) AS avg(a) OVER (  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#47]
   +- Exchange SinglePartition, None
      +- Generate explode([1.0,2.0]), false, false, [a#45]
         +- Scan OneRowRelation[]
```

In this PR, *SUM over window* pattern matching is based on the code of hvanhovell ; he should be credited for the work he did.

## How was this patch tested?

Pass the Jenkins tests (with newly added testcases)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12421 from dongjoon-hyun/SPARK-14664.
2016-04-27 21:36:19 +02:00
Liwei Lin a234cc6146 [SPARK-14874][SQL][STREAMING] Remove the obsolete Batch representation
## What changes were proposed in this pull request?

The `Batch` class, which had been used to indicate progress in a stream, was abandoned by [[SPARK-13985][SQL] Deterministic batches with ids](caea152145) and then became useless.

This patch:
- removes the `Batch` class
- ~~does some related renaming~~ (update: this has been reverted)
- fixes some related comments

## How was this patch tested?

N/A

Author: Liwei Lin <lwlin7@gmail.com>

Closes #12638 from lw-lin/remove-batch.
2016-04-27 10:25:33 -07:00
Herman van Hovell 7dd01d9c01 [SPARK-14950][SQL] Fix BroadcastHashJoin's unique key Anti-Joins
### What changes were proposed in this pull request?
Anti-Joins using BroadcastHashJoin's unique key code path are broken; it currently returns Semi Join results . This PR fixes this bug.

### How was this patch tested?
Added tests cases to `ExistenceJoinSuite`.

cc davies gatorsmile

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #12730 from hvanhovell/SPARK-14950.
2016-04-27 19:15:17 +02:00
Yin Huai 54a3eb8312 [SPARK-14130][SQL] Throw exceptions for ALTER TABLE ADD/REPLACE/CHANGE COLUMN, ALTER TABLE SET FILEFORMAT, DFS, and transaction related commands
## What changes were proposed in this pull request?
This PR will make Spark SQL not allow ALTER TABLE ADD/REPLACE/CHANGE COLUMN, ALTER TABLE SET FILEFORMAT, DFS, and transaction related commands.

## How was this patch tested?
Existing tests. For those tests that I put in the blacklist, I am adding the useful parts back to SQLQuerySuite.

Author: Yin Huai <yhuai@databricks.com>

Closes #12714 from yhuai/banNativeCommand.
2016-04-27 00:30:54 -07:00
Reynold Xin 8fda5a73dc [SPARK-14913][SQL] Simplify configuration API
## What changes were proposed in this pull request?
We currently expose both Hadoop configuration and Spark SQL configuration in RuntimeConfig. I think we can remove the Hadoop configuration part, and simply generate Hadoop Configuration on the fly by passing all the SQL configurations into it. This way, there is a single interface (in Java/Scala/Python/SQL) for end-users.

As part of this patch, I also removed some config options deprecated in Spark 1.x.

## How was this patch tested?
Updated relevant tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #12689 from rxin/SPARK-14913.
2016-04-26 22:02:28 -07:00
Andrew Or d8a83a564f [SPARK-13477][SQL] Expose new user-facing Catalog interface
## What changes were proposed in this pull request?

#12625 exposed a new user-facing conf interface in `SparkSession`. This patch adds a catalog interface.

## How was this patch tested?

See `CatalogSuite`.

Author: Andrew Or <andrew@databricks.com>

Closes #12713 from andrewor14/user-facing-catalog.
2016-04-26 21:29:25 -07:00
Dilip Biswal d93976d866 [SPARK-14445][SQL] Support native execution of SHOW COLUMNS and SHOW PARTITIONS
## What changes were proposed in this pull request?
This PR adds Native execution of SHOW COLUMNS and SHOW PARTITION commands.

Command Syntax:
``` SQL
SHOW COLUMNS (FROM | IN) table_identifier [(FROM | IN) database]
```
``` SQL
SHOW PARTITIONS [db_name.]table_name [PARTITION(partition_spec)]
```

## How was this patch tested?

Added test cases in HiveCommandSuite to verify execution and DDLCommandSuite
to verify plans.

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes #12222 from dilipbiswal/dkb_show_columns.
2016-04-27 09:28:24 +08:00
Sameer Agarwal 9797cc20c0 [SPARK-14929] [SQL] Disable vectorized map for wide schemas & high-precision decimals
## What changes were proposed in this pull request?

While the vectorized hash map in `TungstenAggregate` is currently supported for all primitive data types during partial aggregation, this patch only enables the hash map for a subset of cases that've been verified to show performance improvements on our benchmarks subject to an internal conf that sets an upper limit on the maximum length of the aggregate key/value schema. This list of supported use-cases should be expanded over time.

## How was this patch tested?

This is no new change in functionality so existing tests should suffice. Performance tests were done on TPCDS benchmarks.

Author: Sameer Agarwal <sameer@databricks.com>

Closes #12710 from sameeragarwal/vectorized-enable.
2016-04-26 14:51:14 -07:00
Davies Liu 7131b03bcf [SPARK-14853] [SQL] Support LeftSemi/LeftAnti in SortMergeJoinExec
## What changes were proposed in this pull request?

This PR update SortMergeJoinExec to support LeftSemi/LeftAnti, so it could support all the join types, same as other three join implementations: BroadcastHashJoinExec, ShuffledHashJoinExec,and BroadcastNestedLoopJoinExec.

This PR also simplify the join selection in SparkStrategy.

## How was this patch tested?

Added new tests.

Author: Davies Liu <davies@databricks.com>

Closes #12668 from davies/smj_semi.
2016-04-26 12:43:47 -07:00
Andrew Or 2a3d39f48b [MINOR] Follow-up to #12625
## What changes were proposed in this pull request?

That patch mistakenly widened the visibility from `private[x]` to `protected[x]`. This patch reverts those changes.

Author: Andrew Or <andrew@databricks.com>

Closes #12686 from andrewor14/visibility.
2016-04-26 11:08:08 -07:00
Reynold Xin 5cb03220a0 [SPARK-14912][SQL] Propagate data source options to Hadoop configuration
## What changes were proposed in this pull request?
We currently have no way for users to propagate options to the underlying library that rely in Hadoop configurations to work. For example, there are various options in parquet-mr that users might want to set, but the data source API does not expose a per-job way to set it. This patch propagates the user-specified options also into Hadoop Configuration.

## How was this patch tested?
Used a mock data source implementation to test both the read path and the write path.

Author: Reynold Xin <rxin@databricks.com>

Closes #12688 from rxin/SPARK-14912.
2016-04-26 10:58:56 -07:00
gatorsmile 162cf02efa [SPARK-14910][SQL] Native DDL Command Support for Describe Function in Non-identifier Format
#### What changes were proposed in this pull request?
The existing `Describe Function` only support the function name in `identifier`. This is different from what Hive behaves. That is why many test cases `udf_abc` in `HiveCompatibilitySuite` are not using our native DDL support. For example,
- udf_not.q
- udf_bitwise_not.q

This PR is to resolve the issues. Now, we can support the command of `Describe Function` whose function names are in the following format:
- `qualifiedName` (e.g., `db.func1`)
- `STRING` (e.g., `'func1'`)
- `comparisonOperator` (e.g,. `<`)
- `arithmeticOperator` (e.g., `+`)
- `predicateOperator` (e.g., `or`)

Note, before this PR, we only have a native command support when the function name is in the format of `qualifiedName`.
#### How was this patch tested?
Added test cases in `DDLSuite.scala`. Also manually verified all the related test cases in `HiveCompatibilitySuite` passed.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #12679 from gatorsmile/descFunction.
2016-04-26 19:29:34 +02:00
Jacek Laskowski b208229ba1 [MINOR][DOCS] Minor typo fixes
## What changes were proposed in this pull request?

Minor typo fixes (too minor to deserve separate a JIRA)

## How was this patch tested?

local build

Author: Jacek Laskowski <jacek@japila.pl>

Closes #12469 from jaceklaskowski/minor-typo-fixes.
2016-04-26 11:51:12 +01:00
Azeem Jiva de6e633420 [SPARK-14756][CORE] Use parseLong instead of valueOf
## What changes were proposed in this pull request?

Use Long.parseLong which returns a primative.
Use a series of appends() reduces the creation of an extra StringBuilder type

## How was this patch tested?

Unit tests

Author: Azeem Jiva <azeemj@gmail.com>

Closes #12520 from javawithjiva/minor.
2016-04-26 11:49:04 +01:00
Andrew Or 18c2c92580 [SPARK-14861][SQL] Replace internal usages of SQLContext with SparkSession
## What changes were proposed in this pull request?

In Spark 2.0, `SparkSession` is the new thing. Internally we should stop using `SQLContext` everywhere since that's supposed to be not the main user-facing API anymore.

In this patch I took care to not break any public APIs. The one place that's suspect is `o.a.s.ml.source.libsvm.DefaultSource`, but according to mengxr it's not supposed to be public so it's OK to change the underlying `FileFormat` trait.

**Reviewers**: This is a big patch that may be difficult to review but the changes are actually really straightforward. If you prefer I can break it up into a few smaller patches, but it will delay the progress of this issue a little.

## How was this patch tested?

No change in functionality intended.

Author: Andrew Or <andrew@databricks.com>

Closes #12625 from andrewor14/spark-session-refactor.
2016-04-25 20:54:31 -07:00
Sameer Agarwal c71c6853fc [SPARK-14870][SQL][FOLLOW-UP] Move decimalDataWithNulls in DataFrameAggregateSuite
## What changes were proposed in this pull request?

Minor followup to https://github.com/apache/spark/pull/12651

## How was this patch tested?

Test-only change

Author: Sameer Agarwal <sameer@databricks.com>

Closes #12674 from sameeragarwal/tpcds-fix-2.
2016-04-25 18:22:06 -07:00
Andrew Or cfa64882fc [SPARK-14902][SQL] Expose RuntimeConfig in SparkSession
## What changes were proposed in this pull request?

`RuntimeConfig` is the new user-facing API in 2.0 added in #11378. Until now, however, it's been dead code. This patch uses `RuntimeConfig` in `SessionState` and exposes that through the `SparkSession`.

## How was this patch tested?

New test in `SQLContextSuite`.

Author: Andrew Or <andrew@databricks.com>

Closes #12669 from andrewor14/use-runtime-conf.
2016-04-25 17:52:25 -07:00
Reynold Xin f36c9c8379 [SPARK-14888][SQL] UnresolvedFunction should use FunctionIdentifier
## What changes were proposed in this pull request?
This patch changes UnresolvedFunction and UnresolvedGenerator to use a FunctionIdentifier rather than just a String for function name. Also changed SessionCatalog to accept FunctionIdentifier in lookupFunction.

## How was this patch tested?
Updated related unit tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #12659 from rxin/SPARK-14888.
2016-04-25 16:20:57 -07:00
Andrew Or 34336b6250 [SPARK-14828][SQL] Start SparkSession in REPL instead of SQLContext
## What changes were proposed in this pull request?

```
Spark context available as 'sc' (master = local[*], app id = local-1461283768192).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0-SNAPSHOT
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_51)
Type in expressions to have them evaluated.
Type :help for more information.

scala> sql("SHOW TABLES").collect()
16/04/21 17:09:39 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/04/21 17:09:39 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
res0: Array[org.apache.spark.sql.Row] = Array([src,false])

scala> sql("SHOW TABLES").collect()
res1: Array[org.apache.spark.sql.Row] = Array([src,false])

scala> spark.createDataFrame(Seq((1, 1), (2, 2), (3, 3)))
res2: org.apache.spark.sql.DataFrame = [_1: int, _2: int]
```

Hive things are loaded lazily.

## How was this patch tested?

Manual.

Author: Andrew Or <andrew@databricks.com>

Closes #12589 from andrewor14/spark-session-repl.
2016-04-25 15:30:18 -07:00
gatorsmile 0c47e274ab [SPARK-13739][SQL] Push Predicate Through Window
#### What changes were proposed in this pull request?

For performance, predicates can be pushed through Window if and only if the following conditions are satisfied:
 1. All the expressions are part of window partitioning key. The expressions can be compound.
 2. Deterministic

#### How was this patch tested?

TODO:
- [X]  DSL needs to be modified for window
- [X] more tests will be added.

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #11635 from gatorsmile/pushPredicateThroughWindow.
2016-04-25 22:32:34 +02:00
Andrew Or 3c5e65c339 [SPARK-14721][SQL] Remove HiveContext (part 2)
## What changes were proposed in this pull request?

This removes the class `HiveContext` itself along with all code usages associated with it. The bulk of the work was already done in #12485. This is mainly just code cleanup and actually removing the class.

Note: A couple of things will break after this patch. These will be fixed separately.
- the python HiveContext
- all the documentation / comments referencing HiveContext
- there will be no more HiveContext in the REPL (fixed by #12589)

## How was this patch tested?

No change in functionality.

Author: Andrew Or <andrew@databricks.com>

Closes #12585 from andrewor14/delete-hive-context.
2016-04-25 13:23:05 -07:00
Cheng Lian e66afd5c66 [SPARK-14875][SQL] Makes OutputWriterFactory.newInstance public
## What changes were proposed in this pull request?

This method was accidentally made `private[sql]` in Spark 2.0. This PR makes it public again, since 3rd party data sources like spark-avro depend on it.

## How was this patch tested?

N/A

Author: Cheng Lian <lian@databricks.com>

Closes #12652 from liancheng/spark-14875.
2016-04-25 20:42:49 +08:00
Sameer Agarwal cbdcd4edab [SPARK-14870] [SQL] Fix NPE in TPCDS q14a
## What changes were proposed in this pull request?

This PR fixes a bug in `TungstenAggregate` that manifests while aggregating by keys over nullable `BigDecimal` columns. This causes a null pointer exception while executing TPCDS q14a.

## How was this patch tested?

1. Added regression test in `DataFrameAggregateSuite`.
2. Verified that TPCDS q14a works

Author: Sameer Agarwal <sameer@databricks.com>

Closes #12651 from sameeragarwal/tpcds-fix.
2016-04-24 22:52:50 -07:00
Yin Huai 35319d3264 [SPARK-14885][SQL] When creating a CatalogColumn, we should use the catalogString of a DataType object.
## What changes were proposed in this pull request?

Right now, the data type field of a CatalogColumn is using the string representation. When we create this string from a DataType object, there are places where we use simpleString instead of catalogString. Although catalogString is the same as simpleString right now, it is still good to use catalogString. So, we will not silently introduce issues when we change the semantic of simpleString or the implementation of catalogString.

## How was this patch tested?
Existing tests.

Author: Yin Huai <yhuai@databricks.com>

Closes #12654 from yhuai/useCatalogString.
2016-04-24 20:48:01 -07:00
Dongjoon Hyun d34d650378 [SPARK-14868][BUILD] Enable NewLineAtEofChecker in checkstyle and fix lint-java errors
## What changes were proposed in this pull request?

Spark uses `NewLineAtEofChecker` rule in Scala by ScalaStyle. And, most Java code also comply with the rule. This PR aims to enforce the same rule `NewlineAtEndOfFile` by CheckStyle explicitly. Also, this fixes lint-java errors since SPARK-14465. The followings are the items.

- Adds a new line at the end of the files (19 files)
- Fixes 25 lint-java errors (12 RedundantModifier, 6 **ArrayTypeStyle**, 2 LineLength, 2 UnusedImports, 2 RegexpSingleline, 1 ModifierOrder)

## How was this patch tested?

After the Jenkins test succeeds, `dev/lint-java` should pass. (Currently, Jenkins dose not run lint-java.)
```bash
$ dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks passed.
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12632 from dongjoon-hyun/SPARK-14868.
2016-04-24 20:40:03 -07:00
Reynold Xin d0ca5797a8 [SPARK-14876][SQL] SparkSession should be case insensitive by default
## What changes were proposed in this pull request?
This patch changes SparkSession to be case insensitive by default, in order to match other database systems.

## How was this patch tested?
N/A - I'm sure some tests will fail and I will need to fix those.

Author: Reynold Xin <rxin@databricks.com>

Closes #12643 from rxin/SPARK-14876.
2016-04-24 19:38:21 -07:00
jliwork f0f1a8afde [SPARK-14548][SQL] Support not greater than and not less than operator in Spark SQL
!< means not less than which is equivalent to >=
!> means not greater than which is equivalent to <=

I'd to create a PR to support these two operators.

I've added new test cases in: DataFrameSuite, ExpressionParserSuite, JDBCSuite, PlanParserSuite, SQLQuerySuite

dilipbiswal viirya gatorsmile

Author: jliwork <jiali@us.ibm.com>

Closes #12316 from jliwork/SPARK-14548.
2016-04-24 11:22:06 -07:00
gatorsmile 337289d712 [SPARK-14691][SQL] Simplify and Unify Error Generation for Unsupported Alter Table DDL
#### What changes were proposed in this pull request?
So far, we are capturing each unsupported Alter Table in separate visit functions. They should be unified and issue the same ParseException instead.

This PR is to refactor the existing implementation and make error message consistent for Alter Table DDL.

#### How was this patch tested?
Updated the existing test cases and also added new test cases to ensure all the unsupported statements are covered.

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #12459 from gatorsmile/cleanAlterTable.
2016-04-24 18:53:27 +02:00
Yin Huai 1672149c26 [SPARK-14879][SQL] Move CreateMetastoreDataSource and CreateMetastoreDataSourceAsSelect to sql/core
## What changes were proposed in this pull request?

CreateMetastoreDataSource and CreateMetastoreDataSourceAsSelect are not Hive-specific. So, this PR moves them from sql/hive to sql/core. Also, I am adding `Command` suffix to these two classes.

## How was this patch tested?
Existing tests.

Author: Yin Huai <yhuai@databricks.com>

Closes #12645 from yhuai/moveCreateDataSource.
2016-04-23 22:29:31 -07:00
Tathagata Das 2853859655 [SPARK-14833][SQL][STREAMING][TEST] Refactor StreamTests to test for source fault-tolerance correctly.
## What changes were proposed in this pull request?

Current StreamTest allows testing of a streaming Dataset generated explicitly wraps a source. This is different from the actual production code path where the source object is dynamically created through a DataSource object every time a query is started. So all the fault-tolerance testing in FileSourceSuite and FileSourceStressSuite is not really testing the actual code path as they are just reusing the FileStreamSource object.

This PR fixes StreamTest and the FileSource***Suite to test this correctly. Instead of maintaining a mapping of source --> expected offset in StreamTest (which requires reuse of source object), it now maintains a mapping of source index --> offset, so that it is independent of the source object.

Summary of changes
- StreamTest refactored to keep track of offset by source index instead of source
- AddData, AddTextData and AddParquetData updated to find the FileStreamSource object from an active query, so that it can work with sources generated when query is started.
- Refactored unit tests in FileSource***Suite to test using DataFrame/Dataset generated with public, rather than reusing the same FileStreamSource. This correctly tests fault tolerance.

The refactoring changed a lot of indents in FileSourceSuite, so its recommended to hide whitespace changes with this - https://github.com/apache/spark/pull/12592/files?w=1

## How was this patch tested?

Refactored unit tests.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #12592 from tdas/SPARK-14833.
2016-04-23 21:53:05 -07:00
Liang-Chi Hsieh ba5e0b87a0 [SPARK-14838] [SQL] Set default size for ObjecType to avoid failure when estimating sizeInBytes in ObjectProducer
## What changes were proposed in this pull request?

We have logical plans that produce domain objects which are `ObjectType`. As we can't estimate the size of `ObjectType`, we throw an `UnsupportedOperationException` if trying to do that. We should set a default size for `ObjectType` to avoid this failure.

## How was this patch tested?

`DatasetSuite`.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #12599 from viirya/skip-broadcast-objectproducer.
2016-04-23 21:15:31 -07:00
tedyu b45819ac43 [SPARK-14856] Correct message in assertion for 'returning batch for wide table'
## What changes were proposed in this pull request?

There was a typo in the message for second assertion in "returning batch for wide table" test

## How was this patch tested?

Existing tests.

Author: tedyu <yuzhihong@gmail.com>

Closes #12639 from tedyu/master.
2016-04-23 16:42:37 -07:00
Reynold Xin e3c1366bbc [SPARK-14865][SQL] Better error handling for view creation.
## What changes were proposed in this pull request?
This patch improves error handling in view creation. CreateViewCommand itself will analyze the view SQL query first, and if it cannot successfully analyze it, throw an AnalysisException.

In addition, I also added the following two conservative guards for easier identification of Spark bugs:

1. If there is a bug and the generated view SQL cannot be analyzed, throw an exception at runtime. Note that this is not an AnalysisException because it is not caused by the user and more likely indicate a bug in Spark.
2. SQLBuilder when it gets an unresolved plan, it will also show the plan in the error message.

I also took the chance to simplify the internal implementation of CreateViewCommand, and *removed* a fallback path that would've masked an exception from before.

## How was this patch tested?
1. Added a unit test for the user facing error handling.
2. Manually introduced some bugs in Spark to test the internal defensive error handling.
3. Also added a test case to test nested views (not super relevant).

Author: Reynold Xin <rxin@databricks.com>

Closes #12633 from rxin/SPARK-14865.
2016-04-23 13:19:57 -07:00
Reynold Xin 890abd1279 [SPARK-14869][SQL] Don't mask exceptions in ResolveRelations
## What changes were proposed in this pull request?
In order to support running SQL directly on files, we added some code in ResolveRelations to catch the exception thrown by catalog.lookupRelation and ignore it. This unfortunately masks all the exceptions. This patch changes the logic to simply test the table's existence.

## How was this patch tested?
I manually hacked some bugs into Spark and made sure the exceptions were being propagated up.

Author: Reynold Xin <rxin@databricks.com>

Closes #12634 from rxin/SPARK-14869.
2016-04-23 12:49:36 -07:00
Reynold Xin 5c8a0ec99b [SPARK-14872][SQL] Restructure command package
## What changes were proposed in this pull request?
This patch restructures sql.execution.command package to break the commands into multiple files, in some logical organization: databases, tables, views, functions.

I also renamed basicOperators.scala to basicLogicalOperators.scala and basicPhysicalOperators.scala.

## How was this patch tested?
N/A - all I did was moving code around.

Author: Reynold Xin <rxin@databricks.com>

Closes #12636 from rxin/SPARK-14872.
2016-04-23 12:44:00 -07:00
Davies Liu ee6b209a9d [HOTFIX] disable generated aggregate map 2016-04-23 11:41:42 -07:00
Zheng RuiFeng 86ca8fefc8 [MINOR][ML][MLLIB] Remove unused imports
## What changes were proposed in this pull request?
del unused imports in ML/MLLIB

## How was this patch tested?
unit tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12497 from zhengruifeng/del_unused_imports.
2016-04-22 23:20:10 -07:00
Davies Liu 39a77e1567 [SPARK-14856] [SQL] returning batch correctly
## What changes were proposed in this pull request?

Currently, the Parquet reader decide whether to return batch based on required schema or full schema, it's not consistent, this PR fix that.

## How was this patch tested?

Added regression tests.

Author: Davies Liu <davies@databricks.com>

Closes #12619 from davies/fix_return_batch.
2016-04-22 22:32:32 -07:00
Reynold Xin c06110187b [SPARK-14842][SQL] Implement view creation in sql/core
## What changes were proposed in this pull request?
This patch re-implements view creation command in sql/core, based on the pre-existing view creation command in the Hive module. This consolidates the view creation logical command and physical command into a single one, called CreateViewCommand.

## How was this patch tested?
All the code should've been tested by existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #12615 from rxin/SPARK-14842-2.
2016-04-22 20:30:51 -07:00
Reynold Xin d7d0cad0ad [SPARK-14855][SQL] Add "Exec" suffix to physical operators
## What changes were proposed in this pull request?
This patch adds "Exec" suffix to all physical operators. Before this patch, Spark's physical operators and logical operators are named the same (e.g. Project could be logical.Project or execution.Project), which caused small issues in code review and bigger issues in code refactoring.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #12617 from rxin/exec-node.
2016-04-22 17:43:56 -07:00
Tathagata Das c431a76d06 [SPARK-14832][SQL][STREAMING] Refactor DataSource to ensure schema is inferred only once when creating a file stream
## What changes were proposed in this pull request?

When creating a file stream using sqlContext.write.stream(), existing files are scanned twice for finding the schema
- Once, when creating a DataSource + StreamingRelation in the DataFrameReader.stream()
- Again, when creating streaming Source from the DataSource, in DataSource.createSource()

Instead, the schema should be generated only once, at the time of creating the dataframe, and when the streaming source is created, it should just reuse that schema

The solution proposed in this PR is to add a lazy field in DataSource that caches the schema. Then streaming Source created by the DataSource can just reuse the schema.

## How was this patch tested?
Refactored unit tests.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #12591 from tdas/SPARK-14832.
2016-04-22 17:17:37 -07:00
Davies Liu c25b97fcce [SPARK-14582][SQL] increase parallelism for small tables
## What changes were proposed in this pull request?

This PR try to increase the parallelism for small table (a few of big files) to reduce the query time, by decrease the maxSplitBytes, the goal is to have at least one task per CPU in the cluster, if the total size of all files is bigger than openCostInBytes * 2 * nCPU.

For example, a small/medium table could be used as dimension table in huge query, this will be useful to reduce the time waiting for broadcast.

## How was this patch tested?

Existing tests.

Author: Davies Liu <davies@databricks.com>

Closes #12344 from davies/more_partition.
2016-04-22 17:09:16 -07:00
Dongjoon Hyun 3647120a5a [SPARK-14796][SQL] Add spark.sql.optimizer.inSetConversionThreshold config option.
## What changes were proposed in this pull request?

Currently, `OptimizeIn` optimizer replaces `In` expression into `InSet` expression if the size of set is greater than a constant, 10.
This issue aims to make a configuration `spark.sql.optimizer.inSetConversionThreshold` for that.

After this PR, `OptimizerIn` is configurable.
```scala
scala> sql("select a in (1,2,3) from (select explode(array(1,2)) a) T").explain()
== Physical Plan ==
WholeStageCodegen
:  +- Project [a#7 IN (1,2,3) AS (a IN (1, 2, 3))#8]
:     +- INPUT
+- Generate explode([1,2]), false, false, [a#7]
   +- Scan OneRowRelation[]

scala> sqlContext.setConf("spark.sql.optimizer.inSetConversionThreshold", "2")

scala> sql("select a in (1,2,3) from (select explode(array(1,2)) a) T").explain()
== Physical Plan ==
WholeStageCodegen
:  +- Project [a#16 INSET (1,2,3) AS (a IN (1, 2, 3))#17]
:     +- INPUT
+- Generate explode([1,2]), false, false, [a#16]
   +- Scan OneRowRelation[]
```

## How was this patch tested?

Pass the Jenkins tests (with a new testcase)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12562 from dongjoon-hyun/SPARK-14796.
2016-04-22 14:14:47 -07:00
Davies Liu 0dcf9dbebb [SPARK-14669] [SQL] Fix some SQL metrics in codegen and added more
## What changes were proposed in this pull request?

1. Fix the "spill size" of TungstenAggregate and Sort
2. Rename "data size" to "peak memory" to match the actual meaning (also consistent with task metrics)
3. Added "data size" for ShuffleExchange and BroadcastExchange
4. Added some timing for Sort, Aggregate and BroadcastExchange (this requires another patch to work)

## How was this patch tested?

Existing tests.
![metrics](https://cloud.githubusercontent.com/assets/40902/14573908/21ad2f00-030d-11e6-9e2c-c544f30039ea.png)

Author: Davies Liu <davies@databricks.com>

Closes #12425 from davies/fix_metrics.
2016-04-22 12:59:32 -07:00
Davies Liu 0419d63169 [SPARK-14791] [SQL] fix risk condition between broadcast and subquery
## What changes were proposed in this pull request?

SparkPlan.prepare() could be called in different threads (BroadcastExchange will call it in a thread pool), it only make sure that doPrepare() will only be called once, the second call to prepare() may return earlier before all the children had finished prepare(). Then some operator may call doProduce() before prepareSubqueries(), `null` will be used as the result of subquery, which is wrong. This cause TPCDS Q23B returns wrong answer sometimes.

This PR added synchronization for prepare(), make sure all the children had finished prepare() before return. Also call prepare() in produce() (similar to execute()).

Added checking for ScalarSubquery to make sure that the subquery has finished before using the result.

## How was this patch tested?

Manually tested with Q23B, no wrong answer anymore.

Author: Davies Liu <davies@databricks.com>

Closes #12600 from davies/fix_risk.
2016-04-22 12:29:53 -07:00
Davies Liu c417cec067 [SPARK-14763][SQL] fix subquery resolution
## What changes were proposed in this pull request?

Currently, a column could be resolved wrongly if there are columns from both outer table and subquery have the same name, we should only resolve the attributes that can't be resolved within subquery. They may have same exprId than other attributes in subquery, so we should create alias for them.

Also, the column in IN subquery could have same exprId, we should create alias for them.

## How was this patch tested?

Added regression tests. Manually tests TPCDS Q70 and Q95, work well after this patch.

Author: Davies Liu <davies@databricks.com>

Closes #12539 from davies/fix_subquery.
2016-04-22 20:55:41 +02:00
Reynold Xin aeb52bea56 [SPARK-14841][SQL] Move SQLBuilder into sql/core
## What changes were proposed in this pull request?
This patch moves SQLBuilder into sql/core so we can in the future move view generation also into sql/core.

## How was this patch tested?
Also moved unit tests.

Author: Reynold Xin <rxin@databricks.com>
Author: Wenchen Fan <wenchen@databricks.com>

Closes #12602 from rxin/SPARK-14841.
2016-04-22 11:10:31 -07:00
Liang-Chi Hsieh 056883e070 [SPARK-13266] [SQL] None read/writer options were not transalated to "null"
## What changes were proposed in this pull request?

In Python, the `option` and `options` method of `DataFrameReader` and `DataFrameWriter` were sending the string "None" instead of `null` when passed `None`, therefore making it impossible to send an actual `null`. This fixes that problem.

This is based on #11305 from mathieulongtin.

## How was this patch tested?

Added test to readwriter.py.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: mathieu longtin <mathieu.longtin@nuance.com>

Closes #12494 from viirya/py-df-none-option.
2016-04-22 09:19:36 -07:00
Pete Robbins 5bed13a872 [SPARK-14848][SQL] Compare as Set in DatasetSuite - Java encoder
## What changes were proposed in this pull request?
Change test to compare sets rather than sequence

## How was this patch tested?
Full test runs on little endian and big endian platforms

Author: Pete Robbins <robbinspg@gmail.com>

Closes #12610 from robbinspg/DatasetSuiteFix.
2016-04-22 23:07:12 +08:00
Joan bf95b8da27 [SPARK-6429] Implement hashCode and equals together
## What changes were proposed in this pull request?

Implement some `hashCode` and `equals` together in order to enable the scalastyle.
This is a first batch, I will continue to implement them but I wanted to know your thoughts.

Author: Joan <joan@goyeau.com>

Closes #12157 from joan38/SPARK-6429-HashCode-Equals.
2016-04-22 12:24:12 +01:00
Liang-Chi Hsieh e09ab5da8b [SPARK-14609][SQL] Native support for LOAD DATA DDL command
## What changes were proposed in this pull request?

Add the native support for LOAD DATA DDL command that loads data into Hive table/partition.

## How was this patch tested?

`HiveDDLCommandSuite` and `HiveQuerySuite`. Besides, few Hive tests (`WindowQuerySuite`, `HiveTableScanSuite` and `HiveSerDeSuite`) also use `LOAD DATA` command.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #12412 from viirya/ddl-load-data.
2016-04-22 18:26:28 +08:00
Reynold Xin 284b15d2fb [SPARK-14826][SQL] Remove HiveQueryExecution
## What changes were proposed in this pull request?
This patch removes HiveQueryExecution. As part of this, I consolidated all the describe commands into DescribeTableCommand.

## How was this patch tested?
Should be covered by existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #12588 from rxin/SPARK-14826.
2016-04-22 01:31:13 -07:00
Cheng Lian 145433f1aa [SPARK-14369] [SQL] Locality support for FileScanRDD
(This PR is a rebased version of PR #12153.)

## What changes were proposed in this pull request?

This PR adds preliminary locality support for `FileFormat` data sources by overriding `FileScanRDD.preferredLocations()`. The strategy can be divided into two parts:

1.  Block location lookup

    Unlike `HadoopRDD` or `NewHadoopRDD`, `FileScanRDD` doesn't have access to the underlying `InputFormat` or `InputSplit`, and thus can't rely on `InputSplit.getLocations()` to gather locality information. Instead, this PR queries block locations using `FileSystem.getBlockLocations()` after listing all `FileStatus`es in `HDFSFileCatalog` and convert all `FileStatus`es into `LocatedFileStatus`es.

    Note that although S3/S3A/S3N file systems don't provide valid locality information, their `getLocatedStatus()` implementations don't actually issue remote calls either. So there's no need to special case these file systems.

2.  Selecting preferred locations

    For each `FilePartition`, we pick up top 3 locations that containing the most data to be retrieved. This isn't necessarily the best algorithm out there. Further improvements may be brought up in follow-up PRs.

## How was this patch tested?

Tested by overriding default `FileSystem` implementation for `file:///` with a mocked one, which returns mocked block locations.

Author: Cheng Lian <lian@databricks.com>

Closes #12527 from liancheng/spark-14369-locality-rebased.
2016-04-21 21:48:09 -07:00
Sameer Agarwal b29bc3f515 [SPARK-14680] [SQL] Support all datatypes to use VectorizedHashmap in TungstenAggregate
## What changes were proposed in this pull request?

This PR adds support for all primitive datatypes, decimal types and stringtypes in the VectorizedHashmap during aggregation.

## How was this patch tested?

Existing tests for group-by aggregates should already test for all these datatypes. Additionally, manually inspected the generated code for all supported datatypes (details below).

Author: Sameer Agarwal <sameer@databricks.com>

Closes #12440 from sameeragarwal/all-datatypes.
2016-04-21 21:31:01 -07:00
Reynold Xin f181aee07c [SPARK-14821][SQL] Implement AnalyzeTable in sql/core and remove HiveSqlAstBuilder
## What changes were proposed in this pull request?
This patch moves analyze table parsing into SparkSqlAstBuilder and removes HiveSqlAstBuilder.

In order to avoid extensive refactoring, I created a common trait for CatalogRelation and MetastoreRelation, and match on that. In the future we should probably just consolidate the two into a single thing so we don't need this common trait.

## How was this patch tested?
Updated unit tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #12584 from rxin/SPARK-14821.
2016-04-21 17:41:29 -07:00
Eric Liang e2b5647ab9 [SPARK-14724] Use radix sort for shuffles and sort operator when possible
## What changes were proposed in this pull request?

Spark currently uses TimSort for all in-memory sorts, including sorts done for shuffle. One low-hanging fruit is to use radix sort when possible (e.g. sorting by integer keys). This PR adds a radix sort implementation to the unsafe sort package and switches shuffles and sorts to use it when possible.

The current implementation does not have special support for null values, so we cannot radix-sort `LongType`. I will address this in a follow-up PR.

## How was this patch tested?

Unit tests, enabling radix sort on existing tests. Microbenchmark results:

```
Running benchmark: radix sort 25000000
Java HotSpot(TM) 64-Bit Server VM 1.8.0_66-b17 on Linux 3.13.0-44-generic
Intel(R) Core(TM) i7-4600U CPU  2.10GHz

radix sort 25000000:                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
reference TimSort key prefix array     15546 / 15859          1.6         621.9       1.0X
reference Arrays.sort                    2416 / 2446         10.3          96.6       6.4X
radix sort one byte                       133 /  137        188.4           5.3     117.2X
radix sort two bytes                      255 /  258         98.2          10.2      61.1X
radix sort eight bytes                    991 /  997         25.2          39.6      15.7X
radix sort key prefix array              1540 / 1563         16.2          61.6      10.1X
```

I also ran a mix of the supported TPCDS queries and compared TimSort vs RadixSort metrics. The overall benchmark ran ~10% faster with radix sort on. In the breakdown below, the radix-enabled sort phases averaged about 20x faster than TimSort, however sorting is only a small fraction of the overall runtime. About half of the TPCDS queries were able to take advantage of radix sort.

```
TPCDS on master: 2499s real time, 8185s executor
    - 1171s in TimSort, avg 267 MB/s
(note the /s accounting is weird here since dataSize counts the record sizes too)

TPCDS with radix enabled: 2294s real time, 7391s executor
    - 596s in TimSort, avg 254 MB/s
    - 26s in radix sort, avg 4.2 GB/s
```

cc davies rxin

Author: Eric Liang <ekl@databricks.com>

Closes #12490 from ericl/sort-benchmark.
2016-04-21 16:48:51 -07:00
Sameer Agarwal f82aa82480 [SPARK-14774][SQL] Write unscaled values in ColumnVector.putDecimal
## What changes were proposed in this pull request?

We recently made `ColumnarBatch.row` mutable and added a new `ColumnVector.putDecimal` method to support putting `Decimal` values in the `ColumnarBatch`. This unfortunately introduced a bug wherein we were not updating the vector with the proper unscaled values.

## How was this patch tested?

This codepath is hit only when the vectorized aggregate hashmap is enabled. https://github.com/apache/spark/pull/12440 makes sure that a number of regression tests/benchmarks test this bugfix.

Author: Sameer Agarwal <sameer@databricks.com>

Closes #12541 from sameeragarwal/fix-bigdecimal.
2016-04-21 16:00:59 -07:00
Reynold Xin 1a95397bb6 [SPARK-14798][SQL] Move native command and script transformation parsing into SparkSqlAstBuilder
## What changes were proposed in this pull request?
This patch moves native command and script transformation into SparkSqlAstBuilder. This builds on #12561. See the last commit for diff.

## How was this patch tested?
Updated test cases to reflect this.

Author: Reynold Xin <rxin@databricks.com>

Closes #12564 from rxin/SPARK-14798.
2016-04-21 15:59:37 -07:00
Andrew Or ef6be7bedd [MINOR] Comment whitespace changes in #12553 2016-04-21 14:52:42 -07:00
Andrew Or a2e8d4fddd [SPARK-13643][SQL] Implement SparkSession
## What changes were proposed in this pull request?

After removing most of `HiveContext` in 8fc267ab33 we can now move existing functionality in `SQLContext` to `SparkSession`. As of this PR `SQLContext` becomes a simple wrapper that has a `SparkSession` and delegates all functionality to it.

## How was this patch tested?

Jenkins.

Author: Andrew Or <andrew@databricks.com>

Closes #12553 from andrewor14/implement-spark-session.
2016-04-21 14:18:18 -07:00
Liang-Chi Hsieh 4ac6e75cd6 [HOTFIX] Remove wrong DDL tests
## What changes were proposed in this pull request?

As we moved most parsing rules to `SparkSqlParser`, some tests expected to throw exception are not correct anymore.

## How was this patch tested?
`DDLCommandSuite`

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #12572 from viirya/hotfix-ddl.
2016-04-21 13:18:39 +02:00
Wenchen Fan cb51680d22 [SPARK-14753][CORE] remove internal flag in Accumulable
## What changes were proposed in this pull request?

the `Accumulable.internal` flag is only used to avoid registering internal accumulators for 2 certain cases:

1. `TaskMetrics.createTempShuffleReadMetrics`: the accumulators in the temp shuffle read metrics should not be registered.
2. `TaskMetrics.fromAccumulatorUpdates`: the created task metrics is only used to post event, accumulators inside it should not be registered.

For 1, we can create a `TempShuffleReadMetrics` that don't create accumulators, just keep the data and merge it at last.
For 2, we can un-register these accumulators immediately.

TODO: remove `internal` flag in `AccumulableInfo` with followup PR

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12525 from cloud-fan/acc.
2016-04-21 01:06:22 -07:00
Reynold Xin 77d847ddb2 [SPARK-14792][SQL] Move as many parsing rules as possible into SQL parser
## What changes were proposed in this pull request?
This patch moves as many parsing rules as possible into SQL parser. There are only three more left after this patch: (1) run native command, (2) analyze, and (3) script IO. These 3 will be dealt with in a follow-up PR.

## How was this patch tested?
No test change. This simply moves code around.

Author: Reynold Xin <rxin@databricks.com>

Closes #12556 from rxin/SPARK-14792.
2016-04-21 00:24:24 -07:00
Reynold Xin 8045814114 [SPARK-14782][SPARK-14778][SQL] Remove HiveConf dependency from HiveSqlAstBuilder
## What changes were proposed in this pull request?
The patch removes HiveConf dependency from HiveSqlAstBuilder. This is required in order to merge HiveSqlParser and SparkSqlAstBuilder, which would require getting rid of the Hive specific dependencies in HiveSqlParser.

This patch also accomplishes [SPARK-14778] Remove HiveSessionState.substitutor.

## How was this patch tested?
This should be covered by existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #12550 from rxin/SPARK-14782.
2016-04-20 21:20:51 -07:00
Reynold Xin 334c293ec0 [SPARK-14769][SQL] Create built-in functionality for variable substitution
## What changes were proposed in this pull request?
In order to fully merge the Hive parser and the SQL parser, we'd need to support variable substitution in Spark. The implementation of the substitute algorithm is mostly copied from Hive, but I simplified the overall structure quite a bit and added more comprehensive test coverage.

Note that this pull request does not yet use this functionality anywhere.

## How was this patch tested?
Added VariableSubstitutionSuite for unit tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #12538 from rxin/SPARK-14769.
2016-04-20 16:32:38 -07:00
Subhobrata Dey fd82681945 [SPARK-14749][SQL, TESTS] PlannerSuite failed when it run individually
## What changes were proposed in this pull request?

3 testcases namely,

```
"count is partially aggregated"
"count distinct is partially aggregated"
"mixed aggregates are partially aggregated"
```

were failing when running PlannerSuite individually.
The PR provides a fix for this.

## How was this patch tested?

unit tests

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Subhobrata Dey <sbcd90@gmail.com>

Closes #12532 from sbcd90/plannersuitetestsfix.
2016-04-20 14:26:07 -07:00
Shixiong Zhu 7bc948557b [SPARK-14678][SQL] Add a file sink log to support versioning and compaction
## What changes were proposed in this pull request?

This PR adds a special log for FileStreamSink for two purposes:

- Versioning. A future Spark version should be able to read the metadata of an old FileStreamSink.
- Compaction. As reading from many small files is usually pretty slow, we should compact small metadata files into big files.

FileStreamSinkLog has a new log format instead of Java serialization format. It will write one log file for each batch. The first line of the log file is the version number, and there are multiple JSON lines following. Each JSON line is a JSON format of FileLog.

FileStreamSinkLog will compact log files every "spark.sql.sink.file.log.compactLen" batches into a big file. When doing a compact, it will read all history logs and merge them with the new batch. During the compaction, it will also delete the files that are deleted (marked by FileLog.action). When the reader uses allLogs to list all files, this method only returns the visible files (drops the deleted files).

## How was this patch tested?

FileStreamSinkLogSuite

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #12435 from zsxwing/sink-log.
2016-04-20 13:33:04 -07:00
Andrew Or 8fc267ab33 [SPARK-14720][SPARK-13643] Move Hive-specific methods into HiveSessionState and Create a SparkSession class
## What changes were proposed in this pull request?
This PR has two main changes.
1. Move Hive-specific methods from HiveContext to HiveSessionState, which help the work of removing HiveContext.
2. Create a SparkSession Class, which will later be the entry point of Spark SQL users.

## How was this patch tested?
Existing tests

This PR is trying to fix test failures of https://github.com/apache/spark/pull/12485.

Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #12522 from yhuai/spark-session.
2016-04-20 12:58:48 -07:00
Tathagata Das cb8ea9e1f3 [SPARK-14741][SQL] Fixed error in reading json file stream inside a partitioned directory
## What changes were proposed in this pull request?

Consider the following directory structure
dir/col=X/some-files
If we create a text format streaming dataframe on `dir/col=X/`  then it should not consider as partitioning in columns. Even though the streaming dataframe does not do so, the generated batch dataframes pick up col as a partitioning columns, causing mismatch streaming source schema and generated df schema. This leads to runtime failure:
```
18:55:11.262 ERROR org.apache.spark.sql.execution.streaming.StreamExecution: Query query-0 terminated with error
java.lang.AssertionError: assertion failed: Invalid batch: c#2 != c#7,type#8
```
The reason is that the partition inferring code has no idea of a base path, above which it should not search of partitions. This PR makes sure that the batch DF is generated with the basePath set as the original path on which the file stream source is defined.

## How was this patch tested?

New unit test

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #12517 from tdas/SPARK-14741.
2016-04-20 12:22:51 -07:00
Burak Yavuz 80bf48f437 [SPARK-14555] First cut of Python API for Structured Streaming
## What changes were proposed in this pull request?

This patch provides a first cut of python APIs for structured streaming. This PR provides the new classes:
 - ContinuousQuery
 - Trigger
 - ProcessingTime
in pyspark under `pyspark.sql.streaming`.

In addition, it contains the new methods added under:
 -  `DataFrameWriter`
     a) `startStream`
     b) `trigger`
     c) `queryName`

 -  `DataFrameReader`
     a) `stream`

 - `DataFrame`
    a) `isStreaming`

This PR doesn't contain all methods exposed for `ContinuousQuery`, for example:
 - `exception`
 - `sourceStatuses`
 - `sinkStatus`

They may be added in a follow up.

This PR also contains some very minor doc fixes in the Scala side.

## How was this patch tested?

Python doc tests

TODO:
 - [ ] verify Python docs look good

Author: Burak Yavuz <brkyvz@gmail.com>
Author: Burak Yavuz <burak@databricks.com>

Closes #12320 from brkyvz/stream-python.
2016-04-20 10:32:01 -07:00
Liwei Lin 17db4bfeaa [SPARK-14687][CORE][SQL][MLLIB] Call path.getFileSystem(conf) instead of call FileSystem.get(conf)
## What changes were proposed in this pull request?

- replaced `FileSystem.get(conf)` calls with `path.getFileSystem(conf)`

## How was this patch tested?

N/A

Author: Liwei Lin <lwlin7@gmail.com>

Closes #12450 from lw-lin/fix-fs-get.
2016-04-20 11:28:51 +01:00
Wenchen Fan 7abe9a6578 [SPARK-9013][SQL] generate MutableProjection directly instead of return a function
`MutableProjection` is not thread-safe and we won't use it in multiple threads. I think the reason that we return `() => MutableProjection` is not about thread safety, but to save the costs of generating code when we need same but individual mutable projections.

However, I only found one place that use this [feature](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/Window.scala#L122-L123), and comparing to the troubles it brings, I think we should generate `MutableProjection` directly instead of return a function.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #7373 from cloud-fan/project.
2016-04-20 00:44:02 -07:00
Wenchen Fan 85d759ca3a [SPARK-14704][CORE] create accumulators in TaskMetrics
## What changes were proposed in this pull request?

Before this PR, we create accumulators at driver side(and register them) and send them to executor side, then we create `TaskMetrics` with these accumulators at executor side.
After this PR, we will create `TaskMetrics` at driver side and send it to executor side, so that we can create accumulators inside `TaskMetrics` directly, which is cleaner.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12472 from cloud-fan/acc.
2016-04-19 21:20:24 -07:00
Luciano Resende 78b38109ed [SPARK-13419] [SQL] Update SubquerySuite to use checkAnswer for validation
## What changes were proposed in this pull request?

Change SubquerySuite to validate test results utilizing checkAnswer helper method

## How was this patch tested?

Existing tests

Author: Luciano Resende <lresende@apache.org>

Closes #12269 from lresende/SPARK-13419.
2016-04-19 21:02:10 -07:00
Joan 3ae25f244b [SPARK-13929] Use Scala reflection for UDTs
## What changes were proposed in this pull request?

Enable ScalaReflection and User Defined Types for plain Scala classes.

This involves the move of `schemaFor` from `ScalaReflection` trait (which is Runtime and Compile time (macros) reflection) to the `ScalaReflection` object (runtime reflection only) as I believe this code wouldn't work at compile time anyway as it manipulates `Class`'s that are not compiled yet.

## How was this patch tested?

Unit test

Author: Joan <joan@goyeau.com>

Closes #12149 from joan38/SPARK-13929-Scala-reflection.
2016-04-19 17:36:31 -07:00
Cheng Lian 10f273d8db [SPARK-14407][SQL] Hides HadoopFsRelation related data source API into execution/datasources package #12178
## What changes were proposed in this pull request?

This PR moves `HadoopFsRelation` related data source API into `execution/datasources` package.

Note that to avoid conflicts, this PR is based on #12153. Effective changes for this PR only consist of the last three commits. Will rebase after merging #12153.

## How was this patch tested?

Existing tests.

Author: Yin Huai <yhuai@databricks.com>
Author: Cheng Lian <lian@databricks.com>

Closes #12361 from liancheng/spark-14407-hide-hadoop-fs-relation.
2016-04-19 17:32:23 -07:00
Herman van Hovell da8859226e [SPARK-4226] [SQL] Support IN/EXISTS Subqueries
### What changes were proposed in this pull request?
This PR adds support for in/exists predicate subqueries to Spark. Predicate sub-queries are used as a filtering condition in a query (this is the only supported use case). A predicate sub-query comes in two forms:

- `[NOT] EXISTS(subquery)`
- `[NOT] IN (subquery)`

This PR is (loosely) based on the work of davies (https://github.com/apache/spark/pull/10706) and chenghao-intel (https://github.com/apache/spark/pull/9055). They should be credited for the work they did.

### How was this patch tested?
Modified parsing unit tests.
Added tests to `org.apache.spark.sql.SQLQuerySuite`

cc rxin, davies & chenghao-intel

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #12306 from hvanhovell/SPARK-4226.
2016-04-19 15:16:02 -07:00
Wenchen Fan 5cb2e33609 [SPARK-14675][SQL] ClassFormatError when use Seq as Aggregator buffer type
## What changes were proposed in this pull request?

After https://github.com/apache/spark/pull/12067, we now use expressions to do the aggregation in `TypedAggregateExpression`. To implement buffer merge, we produce a new buffer deserializer expression by replacing `AttributeReference` with right-side buffer attribute, like other `DeclarativeAggregate`s do, and finally combine the left and right buffer deserializer with `Invoke`.

However, after https://github.com/apache/spark/pull/12338, we will add loop variable to class members when codegen `MapObjects`. If the `Aggregator` buffer type is `Seq`, which is implemented by `MapObjects` expression, we will add the same loop variable to class members twice(by left and right buffer deserializer), which cause the `ClassFormatError`.

This PR fixes this issue by calling `distinct` before declare the class menbers.

## How was this patch tested?

new regression test in `DatasetAggregatorSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12468 from cloud-fan/bug.
2016-04-19 10:51:58 -07:00
Josh Rosen 947b9020b0 [SPARK-14676] Wrap and re-throw Await.result exceptions in order to capture full stacktrace
When `Await.result` throws an exception which originated from a different thread, the resulting stacktrace doesn't include the path leading to the `Await.result` call itself, making it difficult to identify the impact of these exceptions. For example, I've seen cases where broadcast cleaning errors propagate to the main thread and crash it but the resulting stacktrace doesn't include any of the main thread's code, making it difficult to pinpoint which exception crashed that thread.

This patch addresses this issue by explicitly catching, wrapping, and re-throwing exceptions that are thrown by `Await.result`.

I tested this manually using 16b31c8251, a patch which reproduces an issue where an RPC exception which occurs while unpersisting RDDs manages to crash the main thread without any useful stacktrace, and verified that informative, full stacktraces were generated after applying the fix in this PR.

/cc rxin nongli yhuai anabranch

Author: Josh Rosen <joshrosen@databricks.com>

Closes #12433 from JoshRosen/wrap-and-rethrow-await-exceptions.
2016-04-19 10:38:10 -07:00
Wenchen Fan 9ee95b6ecc [SPARK-14491] [SQL] refactor object operator framework to make it easy to eliminate serializations
## What changes were proposed in this pull request?

This PR tries to separate the serialization and deserialization logic from object operators, so that it's easier to eliminate unnecessary serializations in optimizer.

Typed aggregate related operators are special, they will deserialize the input row to multiple objects and it's difficult to simply use a deserializer operator to abstract it, so we still mix the deserialization logic there.

## How was this patch tested?

existing tests and new test in `EliminateSerializationSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12260 from cloud-fan/encoder.
2016-04-19 10:00:44 -07:00
Cheng Lian 5e360c93be [SPARK-13681][SPARK-14458][SPARK-14566][SQL] Add back once removed CommitFailureTestRelationSuite and SimpleTextHadoopFsRelationSuite
## What changes were proposed in this pull request?

These test suites were removed while refactoring `HadoopFsRelation` related API. This PR brings them back.

This PR also fixes two regressions:

- SPARK-14458, which causes runtime error when saving partitioned tables using `FileFormat` data sources that are not able to infer their own schemata. This bug wasn't detected by any built-in data sources because all of them happen to have schema inference feature.

- SPARK-14566, which happens to be covered by SPARK-14458 and causes wrong query result or runtime error when
  - appending a Dataset `ds` to a persisted partitioned data source relation `t`, and
  - partition columns in `ds` don't all appear after data columns

## How was this patch tested?

`CommitFailureTestRelationSuite` uses a testing relation that always fails when committing write tasks to test write job cleanup.

`SimpleTextHadoopFsRelationSuite` uses a testing relation to test general `HadoopFsRelation` and `FileFormat` interfaces.

The two regressions are both covered by existing test cases.

Author: Cheng Lian <lian@databricks.com>

Closes #12179 from liancheng/spark-13681-commit-failure-test.
2016-04-19 09:37:00 -07:00