Commit graph

1773 commits

Author SHA1 Message Date
Dongjoon Hyun 88134e7368 [SPARK-16288][SQL] Implement inline table generating function
## What changes were proposed in this pull request?

This PR implements `inline` table generating function.

## How was this patch tested?

Pass the Jenkins tests with new testcase.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13976 from dongjoon-hyun/SPARK-16288.
2016-07-04 01:57:45 +08:00
Dongjoon Hyun 54b27c1797 [SPARK-16278][SPARK-16279][SQL] Implement map_keys/map_values SQL functions
## What changes were proposed in this pull request?

This PR adds `map_keys` and `map_values` SQL functions in order to remove Hive fallback.

## How was this patch tested?

Pass the Jenkins tests including new testcases.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13967 from dongjoon-hyun/SPARK-16278.
2016-07-03 16:59:40 +08:00
gatorsmile ea990f9693 [SPARK-16329][SQL] Star Expansion over Table Containing No Column
#### What changes were proposed in this pull request?
Star expansion over a table containing zero column does not work since 1.6. However, it works in Spark 1.5.1. This PR is to fix the issue in the master branch.

For example,
```scala
val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => Row.empty)
val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty))
dfNoCols.registerTempTable("temp_table_no_cols")
sqlContext.sql("select * from temp_table_no_cols").show
```

Without the fix, users will get the following the exception:
```
java.lang.IllegalArgumentException: requirement failed
        at scala.Predef$.require(Predef.scala:221)
        at org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
```

#### How was this patch tested?
Tests are added

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14007 from gatorsmile/starExpansionTableWithZeroColumn.
2016-07-03 16:48:04 +08:00
Dongjoon Hyun 3000b4b29f [MINOR][BUILD] Fix Java linter errors
## What changes were proposed in this pull request?

This PR fixes the minor Java linter errors like the following.
```
-    public int read(char cbuf[], int off, int len) throws IOException {
+    public int read(char[] cbuf, int off, int len) throws IOException {
```

## How was this patch tested?

Manual.
```
$ build/mvn -T 4 -q -DskipTests -Pyarn -Phadoop-2.3 -Pkinesis-asl -Phive -Phive-thriftserver install
$ dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks passed.
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14017 from dongjoon-hyun/minor_build_java_linter_error.
2016-07-02 16:31:06 +01:00
Dongjoon Hyun c55397652a [SPARK-16208][SQL] Add PropagateEmptyRelation optimizer
## What changes were proposed in this pull request?

This PR adds a new logical optimizer, `PropagateEmptyRelation`, to collapse a logical plans consisting of only empty LocalRelations.

**Optimizer Targets**

1. Binary(or Higher)-node Logical Plans
   - Union with all empty children.
   - Join with one or two empty children (including Intersect/Except).
2. Unary-node Logical Plans
   - Project/Filter/Sample/Join/Limit/Repartition with all empty children.
   - Aggregate with all empty children and without AggregateFunction expressions, COUNT.
   - Generate with Explode because other UserDefinedGenerators like Hive UDTF returns results.

**Sample Query**
```sql
WITH t1 AS (SELECT a FROM VALUES 1 t(a)),
     t2 AS (SELECT b FROM VALUES 1 t(b) WHERE 1=2)
SELECT a,b
FROM t1, t2
WHERE a=b
GROUP BY a,b
HAVING a>1
ORDER BY a,b
```

**Before**
```scala
scala> sql("with t1 as (select a from values 1 t(a)), t2 as (select b from values 1 t(b) where 1=2) select a,b from t1, t2 where a=b group by a,b having a>1 order by a,b").explain
== Physical Plan ==
*Sort [a#0 ASC, b#1 ASC], true, 0
+- Exchange rangepartitioning(a#0 ASC, b#1 ASC, 200)
   +- *HashAggregate(keys=[a#0, b#1], functions=[])
      +- Exchange hashpartitioning(a#0, b#1, 200)
         +- *HashAggregate(keys=[a#0, b#1], functions=[])
            +- *BroadcastHashJoin [a#0], [b#1], Inner, BuildRight
               :- *Filter (isnotnull(a#0) && (a#0 > 1))
               :  +- LocalTableScan [a#0]
               +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
                  +- *Filter (isnotnull(b#1) && (b#1 > 1))
                     +- LocalTableScan <empty>, [b#1]
```

**After**
```scala
scala> sql("with t1 as (select a from values 1 t(a)), t2 as (select b from values 1 t(b) where 1=2) select a,b from t1, t2 where a=b group by a,b having a>1 order by a,b").explain
== Physical Plan ==
LocalTableScan <empty>, [a#0, b#1]
```

## How was this patch tested?

Pass the Jenkins tests (including a new testsuite).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13906 from dongjoon-hyun/SPARK-16208.
2016-07-01 22:13:56 +08:00
Hiroshi Inoue 14cf61e909 [SPARK-16331][SQL] Reduce code generation time
## What changes were proposed in this pull request?
During the code generation, a `LocalRelation` often has a huge `Vector` object as `data`. In the simple example below, a `LocalRelation` has a Vector with 1000000 elements of `UnsafeRow`.

```
val numRows = 1000000
val ds = (1 to numRows).toDS().persist()
benchmark.addCase("filter+reduce") { iter =>
  ds.filter(a => (a & 1) == 0).reduce(_ + _)
}
```

At `TreeNode.transformChildren`, all elements of the vector is unnecessarily iterated to check whether any children exist in the vector since `Vector` is Traversable. This part significantly increases code generation time.

This patch avoids this overhead by checking the number of children before iterating all elements; `LocalRelation` does not have children since it extends `LeafNode`.

The performance of the above example
```
without this patch
Java HotSpot(TM) 64-Bit Server VM 1.8.0_91-b14 on Mac OS X 10.11.5
Intel(R) Core(TM) i5-5257U CPU  2.70GHz
compilationTime:                         Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
filter+reduce                                 4426 / 4533          0.2        4426.0       1.0X

with this patch
compilationTime:                         Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
filter+reduce                                 3117 / 3391          0.3        3116.6       1.0X
```

## How was this patch tested?

using existing unit tests

Author: Hiroshi Inoue <inouehrs@jp.ibm.com>

Closes #14000 from inouehrs/compilation-time-reduction.
2016-06-30 21:47:44 -07:00
petermaxlee 85f2303eca [SPARK-16276][SQL] Implement elt SQL function
## What changes were proposed in this pull request?
This patch implements the elt function, as it is implemented in Hive.

## How was this patch tested?
Added expression unit test in StringExpressionsSuite and end-to-end test in StringFunctionsSuite.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #13966 from petermaxlee/SPARK-16276.
2016-07-01 07:57:48 +08:00
Dongjoon Hyun 46395db80e [SPARK-16289][SQL] Implement posexplode table generating function
## What changes were proposed in this pull request?

This PR implements `posexplode` table generating function. Currently, master branch raises the following exception for `map` argument. It's different from Hive.

**Before**
```scala
scala> sql("select posexplode(map('a', 1, 'b', 2))").show
org.apache.spark.sql.AnalysisException: No handler for Hive UDF ... posexplode() takes an array as a parameter; line 1 pos 7
```

**After**
```scala
scala> sql("select posexplode(map('a', 1, 'b', 2))").show
+---+---+-----+
|pos|key|value|
+---+---+-----+
|  0|  a|    1|
|  1|  b|    2|
+---+---+-----+
```

For `array` argument, `after` is the same with `before`.
```
scala> sql("select posexplode(array(1, 2, 3))").show
+---+---+
|pos|col|
+---+---+
|  0|  1|
|  1|  2|
|  2|  3|
+---+---+
```

## How was this patch tested?

Pass the Jenkins tests with newly added testcases.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13971 from dongjoon-hyun/SPARK-16289.
2016-06-30 12:03:54 -07:00
Sital Kedia 07f46afc73 [SPARK-13850] Force the sorter to Spill when number of elements in th…
## What changes were proposed in this pull request?

Force the sorter to Spill when number of elements in the pointer array reach a certain size. This is to workaround the issue of timSort failing on large buffer size.

## How was this patch tested?

Tested by running a job which was failing without this change due to TimSort bug.

Author: Sital Kedia <skedia@fb.com>

Closes #13107 from sitalkedia/fix_TimSort.
2016-06-30 10:53:18 -07:00
Sean Zhong 5320adc863 [SPARK-16071][SQL] Checks size limit when doubling the array size in BufferHolder
## What changes were proposed in this pull request?

This PR Checks the size limit when doubling the array size in BufferHolder to avoid integer overflow.

## How was this patch tested?

Manual test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13829 from clockfly/SPARK-16071_2.
2016-06-30 21:56:34 +08:00
petermaxlee d3af6731fa [SPARK-16274][SQL] Implement xpath_boolean
## What changes were proposed in this pull request?
This patch implements xpath_boolean expression for Spark SQL, a xpath function that returns true or false. The implementation is modelled after Hive's xpath_boolean, except that how the expression handles null inputs. Hive throws a NullPointerException at runtime if either of the input is null. This implementation returns null if either of the input is null.

## How was this patch tested?
Created two new test suites. One for unit tests covering the expression, and the other for end-to-end test in SQL.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #13964 from petermaxlee/SPARK-16274.
2016-06-30 09:27:48 +08:00
Wenchen Fan d063898beb [SPARK-16134][SQL] optimizer rules for typed filter
## What changes were proposed in this pull request?

This PR adds 3 optimizer rules for typed filter:

1. push typed filter down through `SerializeFromObject` and eliminate the deserialization in filter condition.
2. pull typed filter up through `SerializeFromObject` and eliminate the deserialization in filter condition.
3. combine adjacent typed filters and share the deserialized object among all the condition expressions.

This PR also adds `TypedFilter` logical plan, to separate it from normal filter, so that the concept is more clear and it's easier to write optimizer rules.

## How was this patch tested?

`TypedFilterOptimizationSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13846 from cloud-fan/filter.
2016-06-30 08:15:08 +08:00
Eric Liang 23c58653f9 [SPARK-16238] Metrics for generated method and class bytecode size
## What changes were proposed in this pull request?

This extends SPARK-15860 to include metrics for the actual bytecode size of janino-generated methods. They can be accessed in the same way as any other codahale metric, e.g.

```
scala> org.apache.spark.metrics.source.CodegenMetrics.METRIC_GENERATED_CLASS_BYTECODE_SIZE.getSnapshot().getValues()
res7: Array[Long] = Array(532, 532, 532, 542, 1479, 2670, 3585, 3585)

scala> org.apache.spark.metrics.source.CodegenMetrics.METRIC_GENERATED_METHOD_BYTECODE_SIZE.getSnapshot().getValues()
res8: Array[Long] = Array(5, 5, 5, 5, 10, 10, 10, 10, 15, 15, 15, 38, 63, 79, 88, 94, 94, 94, 132, 132, 165, 165, 220, 220)
```

## How was this patch tested?

Small unit test, also verified manually that the performance impact is minimal (<10%). hvanhovell

Author: Eric Liang <ekl@databricks.com>

Closes #13934 from ericl/spark-16238.
2016-06-29 15:07:32 -07:00
Yin Huai 8b5a8b25b9 [SPARK-16301] [SQL] The analyzer rule for resolving using joins should respect the case sensitivity setting.
## What changes were proposed in this pull request?
The analyzer rule for resolving using joins should respect the case sensitivity setting.

## How was this patch tested?
New tests in ResolveNaturalJoinSuite

Author: Yin Huai <yhuai@databricks.com>

Closes #13977 from yhuai/SPARK-16301.
2016-06-29 14:42:58 -07:00
gatorsmile 7ee9e39cb4 [SPARK-16157][SQL] Add New Methods for comments in StructField and StructType
#### What changes were proposed in this pull request?
Based on the previous discussion with cloud-fan hvanhovell in another related PR https://github.com/apache/spark/pull/13764#discussion_r67994276, it looks reasonable to add convenience methods for users to add `comment` when defining `StructField`.

Currently, the column-related `comment` attribute is stored in `Metadata` of `StructField`. For example, users can add the `comment` attribute using the following way:
```Scala
StructType(
  StructField(
    "cl1",
    IntegerType,
    nullable = false,
    new MetadataBuilder().putString("comment", "test").build()) :: Nil)
```
This PR is to add more user friendly methods for the `comment` attribute when defining a `StructField`. After the changes, users are provided three different ways to do it:
```Scala
val struct = (new StructType)
  .add("a", "int", true, "test1")

val struct = (new StructType)
  .add("c", StringType, true, "test3")

val struct = (new StructType)
  .add(StructField("d", StringType).withComment("test4"))
```

#### How was this patch tested?
Added test cases:
- `DataTypeSuite` is for testing three types of API changes,
- `DataFrameReaderWriterSuite` is for parquet, json and csv formats - using in-memory catalog
- `OrcQuerySuite.scala` is for orc format using Hive-metastore

Author: gatorsmile <gatorsmile@gmail.com>

Closes #13860 from gatorsmile/newMethodForComment.
2016-06-29 19:36:21 +08:00
Cheng Lian d1e8108854 [SPARK-16291][SQL] CheckAnalysis should capture nested aggregate functions that reference no input attributes
## What changes were proposed in this pull request?

`MAX(COUNT(*))` is invalid since aggregate expression can't be nested within another aggregate expression. This case should be captured at analysis phase, but somehow sneaks off to runtime.

The reason is that when checking aggregate expressions in `CheckAnalysis`, a checking branch treats all expressions that reference no input attributes as valid ones. However, `MAX(COUNT(*))` is translated into `MAX(COUNT(1))` at analysis phase and also references no input attribute.

This PR fixes this issue by removing the aforementioned branch.

## How was this patch tested?

New test case added in `AnalysisErrorSuite`.

Author: Cheng Lian <lian@databricks.com>

Closes #13968 from liancheng/spark-16291-nested-agg-functions.
2016-06-29 19:08:36 +08:00
petermaxlee 153c2f9ac1 [SPARK-16271][SQL] Implement Hive's UDFXPathUtil
## What changes were proposed in this pull request?
This patch ports Hive's UDFXPathUtil over to Spark, which can be used to implement xpath functionality in Spark in the near future.

## How was this patch tested?
Added two new test suites UDFXPathUtilSuite and ReusableStringReaderSuite. They have been ported over from Hive (but rewritten in Scala in order to leverage ScalaTest).

Author: petermaxlee <petermaxlee@gmail.com>

Closes #13961 from petermaxlee/xpath.
2016-06-28 21:07:52 -07:00
Reynold Xin 363bcedeea [SPARK-16248][SQL] Whitelist the list of Hive fallback functions
## What changes were proposed in this pull request?
This patch removes the blind fallback into Hive for functions. Instead, it creates a whitelist and adds only a small number of functions to the whitelist, i.e. the ones we intend to support in the long run in Spark.

## How was this patch tested?
Updated tests to reflect the change.

Author: Reynold Xin <rxin@databricks.com>

Closes #13939 from rxin/hive-whitelist.
2016-06-28 19:36:53 -07:00
Burak Yavuz 5545b79109 [MINOR][DOCS][STRUCTURED STREAMING] Minor doc fixes around DataFrameWriter and DataStreamWriter
## What changes were proposed in this pull request?

Fixes a couple old references to `DataFrameWriter.startStream` to `DataStreamWriter.start

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #13952 from brkyvz/minor-doc-fix.
2016-06-28 17:02:16 -07:00
Wenchen Fan 8a977b0654 [SPARK-16100][SQL] fix bug when use Map as the buffer type of Aggregator
## What changes were proposed in this pull request?

The root cause is in `MapObjects`. Its parameter `loopVar` is not declared as child, but sometimes can be same with `lambdaFunction`(e.g. the function that takes `loopVar` and produces `lambdaFunction` may be `identity`), which is a child. This brings trouble when call `withNewChildren`, it may mistakenly treat `loopVar` as a child and cause `IndexOutOfBoundsException: 0` later.

This PR fixes this bug by simply pulling out the paremters from `LambdaVariable` and pass them to `MapObjects` directly.

## How was this patch tested?

new test in `DatasetAggregatorSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13835 from cloud-fan/map-objects.
2016-06-29 06:39:28 +08:00
Wenchen Fan 1f2776df6e [SPARK-16181][SQL] outer join with isNull filter may return wrong result
## What changes were proposed in this pull request?

The root cause is: the output attributes of outer join are derived from its children, while they are actually different attributes(outer join can return null).

We have already added some special logic to handle it, e.g. `PushPredicateThroughJoin` won't push down predicates through outer join side, `FixNullability`.

This PR adds one more special logic in `FoldablePropagation`.

## How was this patch tested?

new test in `DataFrameSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13884 from cloud-fan/bug.
2016-06-28 10:26:01 -07:00
Herman van Hovell 02a029df43 [SPARK-16220][SQL] Add scope to show functions
## What changes were proposed in this pull request?
Spark currently shows all functions when issue a `SHOW FUNCTIONS` command. This PR refines the `SHOW FUNCTIONS` command by allowing users to select all functions, user defined function or system functions. The following syntax can be used:

**ALL** (default)
```SHOW FUNCTIONS```
```SHOW ALL FUNCTIONS```

**SYSTEM**
```SHOW SYSTEM FUNCTIONS```

**USER**
```SHOW USER FUNCTIONS```
## How was this patch tested?
Updated tests and added tests to the DDLSuite

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #13929 from hvanhovell/SPARK-16220.
2016-06-27 16:57:34 -07:00
Bill Chambers c48c8ebc0a [SPARK-16220][SQL] Revert Change to Bring Back SHOW FUNCTIONS Functionality
## What changes were proposed in this pull request?

- Fix tests regarding show functions functionality
- Revert `catalog.ListFunctions` and `SHOW FUNCTIONS` to return to `Spark 1.X` functionality.

Cherry picked changes from this PR: https://github.com/apache/spark/pull/13413/files

## How was this patch tested?

Unit tests.

Author: Bill Chambers <bill@databricks.com>
Author: Bill Chambers <wchambers@ischool.berkeley.edu>

Closes #13916 from anabranch/master.
2016-06-27 11:50:34 -07:00
Takeshi YAMAMURO 3e4e868c85 [SPARK-16135][SQL] Remove hashCode and euqals in ArrayBasedMapData
## What changes were proposed in this pull request?
This pr is to remove `hashCode` and `equals` in `ArrayBasedMapData` because the type cannot be used as join keys, grouping keys, or in equality tests.

## How was this patch tested?
Add a new test suite `MapDataSuite` for comparison tests.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #13847 from maropu/UnsafeMapTest.
2016-06-27 21:45:22 +08:00
Sital Kedia bf665a9586 [SPARK-15958] Make initial buffer size for the Sorter configurable
## What changes were proposed in this pull request?

Currently the initial buffer size in the sorter is hard coded inside the code and is too small for large workload. As a result, the sorter spends significant time expanding the buffer size and copying the data. It would be useful to have it configurable.

## How was this patch tested?

Tested by running a job on the cluster.

Author: Sital Kedia <skedia@fb.com>

Closes #13699 from sitalkedia/config_sort_buffer_upstream.
2016-06-25 09:13:39 +01:00
Takeshi YAMAMURO d2e44d7db8 [SPARK-16192][SQL] Add type checks in CollectSet
## What changes were proposed in this pull request?
`CollectSet` cannot have map-typed data because MapTypeData does not implement `equals`.
So, this pr is to add type checks in `CheckAnalysis`.

## How was this patch tested?
Added tests to check failures when we found map-typed data in `CollectSet`.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #13892 from maropu/SPARK-16192.
2016-06-24 21:07:03 -07:00
Sean Owen 158af162ea [SPARK-16129][CORE][SQL] Eliminate direct use of commons-lang classes in favor of commons-lang3
## What changes were proposed in this pull request?

Replace use of `commons-lang` in favor of `commons-lang3` and forbid the former via scalastyle; remove `NotImplementedException` from `comons-lang` in favor of JDK `UnsupportedOperationException`

## How was this patch tested?

Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes #13843 from srowen/SPARK-16129.
2016-06-24 10:35:54 +01:00
Wenchen Fan 6a3c6276f5 [SQL][MINOR] ParserUtils.operationNotAllowed should throw exception directly
## What changes were proposed in this pull request?

It's weird that `ParserUtils.operationNotAllowed` returns an exception and the caller throw it.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13874 from cloud-fan/style.
2016-06-23 20:20:55 -07:00
Dongjoon Hyun 91b1ef28d1 [SPARK-16164][SQL] Update CombineFilters to try to construct predicates with child predicate first
## What changes were proposed in this pull request?

This PR changes `CombineFilters` to compose the final predicate condition by using (`child predicate` AND `parent predicate`) instead of (`parent predicate` AND `child predicate`). This is a best effort approach. Some other optimization rules may destroy this order by reorganizing conjunctive predicates.

**Reported Error Scenario**
Chris McCubbin reported a bug when he used StringIndexer in an ML pipeline with additional filters. It seems that during filter pushdown, we changed the ordering in the logical plan.
```scala
import org.apache.spark.ml.feature._
val df1 = (0 until 3).map(_.toString).toDF
val indexer = new StringIndexer()
  .setInputCol("value")
  .setOutputCol("idx")
  .setHandleInvalid("skip")
  .fit(df1)
val df2 = (0 until 5).map(_.toString).toDF
val predictions = indexer.transform(df2)
predictions.show() // this is okay
predictions.where('idx > 2).show() // this will throw an exception
```

Please see the notebook at https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1233855/2159162931615821/588180/latest.html for error messages.

## How was this patch tested?

Pass the Jenkins tests (including a new testcase).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13872 from dongjoon-hyun/SPARK-16164.
2016-06-23 15:27:43 -07:00
Davies Liu 10396d9505 [SPARK-16163] [SQL] Cache the statistics for logical plans
## What changes were proposed in this pull request?

This calculation of statistics is not trivial anymore, it could be very slow on large query (for example, TPC-DS Q64 took several minutes to plan).

During the planning of a query, the statistics of any logical plan should not change (even InMemoryRelation), so we should use `lazy val` to cache the statistics.

For InMemoryRelation, the statistics could be updated after materialization, it's only useful when used in another query (before planning), because once we finished the planning, the statistics will not be used anymore.

## How was this patch tested?

Testsed with TPC-DS Q64, it could be planned in a second after the patch.

Author: Davies Liu <davies@databricks.com>

Closes #13871 from davies/fix_statistics.
2016-06-23 11:48:48 -07:00
Davies Liu 20d411bc5d [SPARK-16078][SQL] from_utc_timestamp/to_utc_timestamp should not depends on local timezone
## What changes were proposed in this pull request?

Currently, we use local timezone to parse or format a timestamp (TimestampType), then use Long as the microseconds since epoch UTC.

In from_utc_timestamp() and to_utc_timestamp(), we did not consider the local timezone, they could return different results with different local timezone.

This PR will do the conversion based on human time (in local timezone), it should return same result in whatever timezone. But because the mapping from absolute timestamp to human time is not exactly one-to-one mapping, it will still return wrong result in some timezone (also in the begging or ending of DST).

This PR is kind of the best effort fix. In long term, we should make the TimestampType be timezone aware to fix this totally.

## How was this patch tested?

Tested these function in all timezone.

Author: Davies Liu <davies@databricks.com>

Closes #13784 from davies/convert_tz.
2016-06-22 13:40:24 -07:00
Wenchen Fan 01277d4b25 [SPARK-16097][SQL] Encoders.tuple should handle null object correctly
## What changes were proposed in this pull request?

Although the top level input object can not be null, but when we use `Encoders.tuple` to combine 2 encoders, their input objects are not top level anymore and can be null. We should handle this case.

## How was this patch tested?

new test in DatasetSuite

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13807 from cloud-fan/bug.
2016-06-22 18:32:14 +08:00
Yin Huai 905f774b71 [SPARK-16030][SQL] Allow specifying static partitions when inserting to data source tables
## What changes were proposed in this pull request?
This PR adds the static partition support to INSERT statement when the target table is a data source table.

## How was this patch tested?
New tests in InsertIntoHiveTableSuite and DataSourceAnalysisSuite.

**Note: This PR is based on https://github.com/apache/spark/pull/13766. The last commit is the actual change.**

Author: Yin Huai <yhuai@databricks.com>

Closes #13769 from yhuai/SPARK-16030-1.
2016-06-20 20:17:47 +08:00
Yin Huai 6d0f921aed [SPARK-16036][SPARK-16037][SPARK-16034][SQL] Follow up code clean up and improvement
## What changes were proposed in this pull request?
This PR is the follow-up PR for https://github.com/apache/spark/pull/13754/files and https://github.com/apache/spark/pull/13749. I will comment inline to explain my changes.

## How was this patch tested?
Existing tests.

Author: Yin Huai <yhuai@databricks.com>

Closes #13766 from yhuai/caseSensitivity.
2016-06-19 21:45:53 -07:00
Davies Liu 001a589603 [SPARK-15613] [SQL] Fix incorrect days to millis conversion due to Daylight Saving Time
## What changes were proposed in this pull request?

Internally, we use Int to represent a date (the days since 1970-01-01), when we convert that into unix timestamp (milli-seconds since epoch in UTC), we get the offset of a timezone using local millis (the milli-seconds since 1970-01-01 in a timezone), but TimeZone.getOffset() expect unix timestamp, the result could be off by one hour (in Daylight Saving Time (DST) or not).

This PR change to use best effort approximate of posix timestamp to lookup the offset. In the event of changing of DST, Some time is not defined (for example, 2016-03-13 02:00:00 PST), or could lead to multiple valid result in UTC (for example, 2016-11-06 01:00:00), this best effort approximate should be enough in practice.

## How was this patch tested?

Added regression tests.

Author: Davies Liu <davies@databricks.com>

Closes #13652 from davies/fix_timezone.
2016-06-19 00:34:52 -07:00
Wenchen Fan 3d010c8375 [SPARK-16036][SPARK-16037][SQL] fix various table insertion problems
## What changes were proposed in this pull request?

The current table insertion has some weird behaviours:

1. inserting into a partitioned table with mismatch columns has confusing error message for hive table, and wrong result for datasource table
2. inserting into a partitioned table without partition list has wrong result for hive table.

This PR fixes these 2 problems.

## How was this patch tested?

new test in hive `SQLQuerySuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13754 from cloud-fan/insert2.
2016-06-18 10:32:27 -07:00
Reynold Xin 1a65e62a7f [SPARK-16014][SQL] Rename optimizer rules to be more consistent
## What changes were proposed in this pull request?
This small patch renames a few optimizer rules to make the naming more consistent, e.g. class name start with a verb. The main important "fix" is probably SamplePushDown -> PushProjectThroughSample. SamplePushDown is actually the wrong name, since the rule is not about pushing Sample down.

## How was this patch tested?
Updated test cases.

Author: Reynold Xin <rxin@databricks.com>

Closes #13732 from rxin/SPARK-16014.
2016-06-17 15:51:20 -07:00
gatorsmile e5d703bca8 [SPARK-15706][SQL] Fix Wrong Answer when using IF NOT EXISTS in INSERT OVERWRITE for DYNAMIC PARTITION
#### What changes were proposed in this pull request?
`IF NOT EXISTS` in `INSERT OVERWRITE` should not support dynamic partitions. If we specify `IF NOT EXISTS`, the inserted statement is not shown in the table.

This PR is to issue an exception in this case, just like what Hive does. Also issue an exception if users specify `IF NOT EXISTS` if users do not specify any `PARTITION` specification.

#### How was this patch tested?
Added test cases into `PlanParserSuite` and `InsertIntoHiveTableSuite`

Author: gatorsmile <gatorsmile@gmail.com>

Closes #13447 from gatorsmile/insertIfNotExist.
2016-06-16 22:54:02 -07:00
Pete Robbins 5ada606144 [SPARK-15822] [SQL] Prevent byte array backed classes from referencing freed memory
## What changes were proposed in this pull request?
`UTF8String` and all `Unsafe*` classes are backed by either on-heap or off-heap byte arrays. The code generated version `SortMergeJoin` buffers the left hand side join keys during iteration. This was actually problematic in off-heap mode when one of the keys is a `UTF8String` (or any other 'Unsafe*` object) and the left hand side iterator was exhausted (and released its memory); the buffered keys would reference freed memory. This causes Seg-faults and all kinds of other undefined behavior when we would use one these buffered keys.

This PR fixes this problem by creating copies of the buffered variables. I have added a general method to the `CodeGenerator` for this. I have checked all places in which this could happen, and only `SortMergeJoin` had this problem.

This PR is largely based on the work of robbinspg and he should be credited for this.

closes https://github.com/apache/spark/pull/13707

## How was this patch tested?
Manually tested on problematic workloads.

Author: Pete Robbins <robbinspg@gmail.com>
Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #13723 from hvanhovell/SPARK-15822-2.
2016-06-16 22:27:32 -07:00
Dongjoon Hyun 2d27eb1e75 [MINOR][DOCS][SQL] Fix some comments about types(TypeCoercion,Partition) and exceptions.
## What changes were proposed in this pull request?

This PR contains a few changes on code comments.
- `HiveTypeCoercion` is renamed into `TypeCoercion`.
- `NoSuchDatabaseException` is only used for the absence of database.
- For partition type inference, only `DoubleType` is considered.

## How was this patch tested?

N/A

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13674 from dongjoon-hyun/minor_doc_types.
2016-06-16 14:27:09 -07:00
bomeng bbad4cb48d [SPARK-15978][SQL] improve 'show tables' command related codes
## What changes were proposed in this pull request?

I've found some minor issues in "show tables" command:

1. In the `SessionCatalog.scala`, `listTables(db: String)` method will call `listTables(formatDatabaseName(db), "*")` to list all the tables for certain db, but in the method `listTables(db: String, pattern: String)`, this db name is formatted once more. So I think we should remove
`formatDatabaseName()` in the caller.

2. I suggest to add sort to listTables(db: String) in InMemoryCatalog.scala, just like listDatabases().

## How was this patch tested?

The existing test cases should cover it.

Author: bomeng <bmeng@us.ibm.com>

Closes #13695 from bomeng/SPARK-15978.
2016-06-16 14:18:02 -07:00
gatorsmile 6451cf9270 [SPARK-15862][SQL] Better Error Message When Having Database Name in CACHE TABLE AS SELECT
#### What changes were proposed in this pull request?
~~If the temp table already exists, we should not silently replace it when doing `CACHE TABLE AS SELECT`. This is inconsistent with the behavior of `CREAT VIEW` or `CREATE TABLE`. This PR is to fix this silent drop.~~

~~Maybe, we also can introduce new syntax for replacing the existing one. For example, in Hive, to replace a view, the syntax should be like `ALTER VIEW AS SELECT` or `CREATE OR REPLACE VIEW AS SELECT`~~

The table name in `CACHE TABLE AS SELECT` should NOT contain database prefix like "database.table". Thus, this PR captures this in Parser and outputs a better error message, instead of reporting the view already exists.

In addition, refactoring the `Parser` to generate table identifiers instead of returning the table name string.

#### How was this patch tested?
- Added a test case for caching and uncaching qualified table names
- Fixed a few test cases that do not drop temp table at the end
- Added the related test case for the issue resolved in this PR

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #13572 from gatorsmile/cacheTableAsSelect.
2016-06-16 10:01:59 -07:00
Narine Kokhlikyan 7c6c692637 [SPARK-12922][SPARKR][WIP] Implement gapply() on DataFrame in SparkR
## What changes were proposed in this pull request?

gapply() applies an R function on groups grouped by one or more columns of a DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() in the Dataset API.

Please, let me know what do you think and if you have any ideas to improve it.

Thank you!

## How was this patch tested?
Unit tests.
1. Primitive test with different column types
2. Add a boolean column
3. Compute average by a group

Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>
Author: NarineK <narine.kokhlikyan@us.ibm.com>

Closes #12836 from NarineK/gapply2.
2016-06-15 21:42:05 -07:00
Sean Zhong 9bd80ad6bd [SPARK-15776][SQL] Divide Expression inside Aggregation function is casted to wrong type
## What changes were proposed in this pull request?

This PR fixes the problem that Divide Expression inside Aggregation function is casted to wrong type, which cause `select 1/2` and `select sum(1/2)`returning different result.

**Before the change:**

```
scala> sql("select 1/2 as a").show()
+---+
|  a|
+---+
|0.5|
+---+

scala> sql("select sum(1/2) as a").show()
+---+
|  a|
+---+
|0  |
+---+

scala> sql("select sum(1 / 2) as a").schema
res4: org.apache.spark.sql.types.StructType = StructType(StructField(a,LongType,true))
```

**After the change:**

```
scala> sql("select 1/2 as a").show()
+---+
|  a|
+---+
|0.5|
+---+

scala> sql("select sum(1/2) as a").show()
+---+
|  a|
+---+
|0.5|
+---+

scala> sql("select sum(1/2) as a").schema
res4: org.apache.spark.sql.types.StructType = StructType(StructField(a,DoubleType,true))
```

## How was this patch tested?

Unit test.

This PR is based on https://github.com/apache/spark/pull/13524 by Sephiroth-Lin

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13651 from clockfly/SPARK-15776.
2016-06-15 14:34:15 -07:00
bomeng 42a28caf10 [SPARK-15952][SQL] fix "show databases" ordering issue
## What changes were proposed in this pull request?

Two issues I've found for "show databases" command:

1. The returned database name list was not sorted, it only works when "like" was used together; (HIVE will always return a sorted list)

2. When it is used as sql("show databases").show, it will output a table with column named as "result", but for sql("show tables").show, it will output the column name as "tableName", so I think we should be consistent and use "databaseName" at least.

## How was this patch tested?

Updated existing test case to test its ordering as well.

Author: bomeng <bmeng@us.ibm.com>

Closes #13671 from bomeng/SPARK-15952.
2016-06-14 18:35:29 -07:00
Takuya UESHIN c5b7355819 [SPARK-15915][SQL] Logical plans should use canonicalized plan when override sameResult.
## What changes were proposed in this pull request?

`DataFrame` with plan overriding `sameResult` but not using canonicalized plan to compare can't cacheTable.

The example is like:

```
    val localRelation = Seq(1, 2, 3).toDF()
    localRelation.createOrReplaceTempView("localRelation")

    spark.catalog.cacheTable("localRelation")
    assert(
      localRelation.queryExecution.withCachedData.collect {
        case i: InMemoryRelation => i
      }.size == 1)
```

and this will fail as:

```
ArrayBuffer() had size 0 instead of expected size 1
```

The reason is that when do `spark.catalog.cacheTable("localRelation")`, `CacheManager` tries to cache for the plan wrapped by `SubqueryAlias` but when planning for the DataFrame `localRelation`, `CacheManager` tries to find cached table for the not-wrapped plan because the plan for DataFrame `localRelation` is not wrapped.
Some plans like `LocalRelation`, `LogicalRDD`, etc. override `sameResult` method, but not use canonicalized plan to compare so the `CacheManager` can't detect the plans are the same.

This pr modifies them to use canonicalized plan when override `sameResult` method.

## How was this patch tested?

Added a test to check if DataFrame with plan overriding sameResult but not using canonicalized plan to compare can cacheTable.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #13638 from ueshin/issues/SPARK-15915.
2016-06-14 10:52:13 -07:00
Wenchen Fan 688b6ef9dc [SPARK-15932][SQL][DOC] document the contract of encoder serializer expressions
## What changes were proposed in this pull request?

In our encoder framework, we imply that serializer expressions should use `BoundReference` to refer to the input object, and a lot of codes depend on this contract(e.g. ExpressionEncoder.tuple).  This PR adds some document and assert in `ExpressionEncoder` to make it clearer.

## How was this patch tested?

existing tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13648 from cloud-fan/comment.
2016-06-13 22:02:23 -07:00
Sandeep Singh 1842cdd4ee [SPARK-15663][SQL] SparkSession.catalog.listFunctions shouldn't include the list of built-in functions
## What changes were proposed in this pull request?
SparkSession.catalog.listFunctions currently returns all functions, including the list of built-in functions. This makes the method not as useful because anytime it is run the result set contains over 100 built-in functions.

## How was this patch tested?
CatalogSuite

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #13413 from techaddict/SPARK-15663.
2016-06-13 21:58:52 -07:00
Sean Zhong 7b9071eeaa [SPARK-15910][SQL] Check schema consistency when using Kryo encoder to convert DataFrame to Dataset
## What changes were proposed in this pull request?

This PR enforces schema check when converting DataFrame to Dataset using Kryo encoder. For example.

**Before the change:**

Schema is NOT checked when converting DataFrame to Dataset using kryo encoder.
```
scala> case class B(b: Int)
scala> implicit val encoder = Encoders.kryo[B]
scala> val df = Seq((1)).toDF("b")
scala> val ds = df.as[B] // Schema compatibility is NOT checked
```

**After the change:**
Report AnalysisException since the schema is NOT compatible.
```
scala> val ds = Seq((1)).toDF("b").as[B]
org.apache.spark.sql.AnalysisException: cannot resolve 'CAST(`b` AS BINARY)' due to data type mismatch: cannot cast IntegerType to BinaryType;
...
```

## How was this patch tested?

Unit test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13632 from clockfly/spark-15910.
2016-06-13 17:43:55 -07:00
Herman van Hövell tot Westerflier 1f8f2b5c2a [SPARK-15370][SQL] Fix count bug
# What changes were proposed in this pull request?
This pull request fixes the COUNT bug in the `RewriteCorrelatedScalarSubquery` rule.

After this change, the rule tests the expression at the root of the correlated subquery to determine whether the expression returns `NULL` on empty input. If the expression does not return `NULL`, the rule generates additional logic in the `Project` operator above the rewritten subquery. This additional logic intercepts `NULL` values coming from the outer join and replaces them with the value that the subquery's expression would return on empty input.

This PR takes over https://github.com/apache/spark/pull/13155. It only fixes an issue with `Literal` construction and style issues.  All credits should go frreiss.

# How was this patch tested?
Added regression tests to cover all branches of the updated rule (see changes to `SubquerySuite`).
Ran all existing automated regression tests after merging with latest trunk.

Author: frreiss <frreiss@us.ibm.com>
Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #13629 from hvanhovell/SPARK-15370-cleanup.
2016-06-12 21:30:32 -07:00
Wenchen Fan f5d38c3925 Revert "[SPARK-15753][SQL] Move Analyzer stuff to Analyzer from DataFrameWriter"
This reverts commit 0ec279ffdf.
2016-06-12 16:52:15 -07:00
Herman van Hovell 20b8f2c32a [SPARK-15370][SQL] Revert PR "Update RewriteCorrelatedSuquery rule"
This reverts commit 9770f6ee60.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #13626 from hvanhovell/SPARK-15370-revert.
2016-06-12 15:06:37 -07:00
Ioana Delaney 0ff8a68b9f [SPARK-15832][SQL] Embedded IN/EXISTS predicate subquery throws TreeNodeException
## What changes were proposed in this pull request?
Queries with embedded existential sub-query predicates throws exception when building the physical plan.

Example failing query:
```SQL
scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t1")
scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t2")
scala> sql("select c1 from t1 where (case when c2 in (select c2 from t2) then 2 else 3 end) IN (select c2 from t1)").show()

Binding attribute, tree: c2#239
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: c2#239
  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
  at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)

  ...
  at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
  at org.apache.spark.sql.execution.joins.HashJoin$$anonfun$4.apply(HashJoin.scala:66)
  at org.apache.spark.sql.execution.joins.HashJoin$$anonfun$4.apply(HashJoin.scala:66)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:285)
  at org.apache.spark.sql.execution.joins.HashJoin$class.org$apache$spark$sql$execution$joins$HashJoin$$x$8(HashJoin.scala:66)
  at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.org$apache$spark$sql$execution$joins$HashJoin$$x$8$lzycompute(BroadcastHashJoinExec.scala:38)
  at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.org$apache$spark$sql$execution$joins$HashJoin$$x$8(BroadcastHashJoinExec.scala:38)
  at org.apache.spark.sql.execution.joins.HashJoin$class.buildKeys(HashJoin.scala:63)
  at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.buildKeys$lzycompute(BroadcastHashJoinExec.scala:38)
  at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.buildKeys(BroadcastHashJoinExec.scala:38)
  at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.requiredChildDistribution(BroadcastHashJoinExec.scala:52)
```

**Problem description:**
When the left hand side expression of an existential sub-query predicate contains another embedded sub-query predicate, the RewritePredicateSubquery optimizer rule does not resolve the embedded sub-query expressions into existential joins.For example, the above query has the following optimized plan, which fails during physical plan build.

```SQL
== Optimized Logical Plan ==
Project [_1#224 AS c1#227]
+- Join LeftSemi, (CASE WHEN predicate-subquery#255 [(_2#225 = c2#239)] THEN 2 ELSE 3 END = c2#228#262)
   :  +- SubqueryAlias predicate-subquery#255 [(_2#225 = c2#239)]
   :     +- LocalRelation [c2#239]
   :- LocalRelation [_1#224, _2#225]
   +- LocalRelation [c2#228#262]

== Physical Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: c2#239
```

**Solution:**
In RewritePredicateSubquery, before rewriting the outermost predicate sub-query, resolve any embedded existential sub-queries. The Optimized plan for the above query after the changes looks like below.

```SQL
== Optimized Logical Plan ==
Project [_1#224 AS c1#227]
+- Join LeftSemi, (CASE WHEN exists#285 THEN 2 ELSE 3 END = c2#228#284)
   :- Join ExistenceJoin(exists#285), (_2#225 = c2#239)
   :  :- LocalRelation [_1#224, _2#225]
   :  +- LocalRelation [c2#239]
   +- LocalRelation [c2#228#284]

== Physical Plan ==
*Project [_1#224 AS c1#227]
+- *BroadcastHashJoin [CASE WHEN exists#285 THEN 2 ELSE 3 END], [c2#228#284], LeftSemi, BuildRight
   :- *BroadcastHashJoin [_2#225], [c2#239], ExistenceJoin(exists#285), BuildRight
   :  :- LocalTableScan [_1#224, _2#225]
   :  +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
   :     +- LocalTableScan [c2#239]
   +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
      +- LocalTableScan [c2#228#284]
      +- LocalTableScan [c222#36], [[111],[222]]
```

## How was this patch tested?
Added new test cases in SubquerySuite.scala

Author: Ioana Delaney <ioanamdelaney@gmail.com>

Closes #13570 from ioana-delaney/fixEmbedSubPredV1.
2016-06-12 14:26:29 -07:00
frreiss 9770f6ee60 [SPARK-15370][SQL] Update RewriteCorrelatedScalarSubquery rule to fix COUNT bug
## What changes were proposed in this pull request?
This pull request fixes the COUNT bug in the `RewriteCorrelatedScalarSubquery` rule.

After this change, the rule tests the expression at the root of the correlated subquery to determine whether the expression returns NULL on empty input. If the expression does not return NULL, the rule generates additional logic in the Project operator above the rewritten subquery.  This additional logic intercepts NULL values coming from the outer join and replaces them with the value that the subquery's expression would return on empty input.

## How was this patch tested?
Added regression tests to cover all branches of the updated rule (see changes to `SubquerySuite.scala`).
Ran all existing automated regression tests after merging with latest trunk.

Author: frreiss <frreiss@us.ibm.com>

Closes #13155 from frreiss/master.
2016-06-12 14:21:10 -07:00
Eric Liang e1f986c7a3 [SPARK-15860] Metrics for codegen size and perf
## What changes were proposed in this pull request?

Adds codahale metrics for the codegen source text size and how long it takes to compile. The size is particularly interesting, since the JVM does have hard limits on how large methods can get.

To simplify, I added the metrics under a statically-initialized source that is always registered with SparkEnv.

## How was this patch tested?

Unit tests

Author: Eric Liang <ekl@databricks.com>

Closes #13586 from ericl/spark-15860.
2016-06-11 23:16:21 -07:00
Eric Liang c06c58bbbb [SPARK-14851][CORE] Support radix sort with nullable longs
## What changes were proposed in this pull request?

This adds support for radix sort of nullable long fields. When a sort field is null and radix sort is enabled, we keep nulls in a separate region of the sort buffer so that radix sort does not need to deal with them. This also has performance benefits when sorting smaller integer types, since the current representation of nulls in two's complement (Long.MIN_VALUE) otherwise forces a full-width radix sort.

This strategy for nulls does mean the sort is no longer stable. cc davies

## How was this patch tested?

Existing randomized sort tests for correctness. I also tested some TPCDS queries and there does not seem to be any significant regression for non-null sorts.

Some test queries (best of 5 runs each).
Before change:
scala> val start = System.nanoTime; spark.range(5000000).selectExpr("if(id > 5, cast(hash(id) as long), NULL) as h").coalesce(1).orderBy("h").collect(); (System.nanoTime - start) / 1e6
start: Long = 3190437233227987
res3: Double = 4716.471091

After change:
scala> val start = System.nanoTime; spark.range(5000000).selectExpr("if(id > 5, cast(hash(id) as long), NULL) as h").coalesce(1).orderBy("h").collect(); (System.nanoTime - start) / 1e6
start: Long = 3190367870952791
res4: Double = 2981.143045

Author: Eric Liang <ekl@databricks.com>

Closes #13161 from ericl/sc-2998.
2016-06-11 15:42:58 -07:00
Sameer Agarwal 468da03e23 [SPARK-15678] Add support to REFRESH data source paths
## What changes were proposed in this pull request?

Spark currently incorrectly continues to use cached data even if the underlying data is overwritten.

Current behavior:
```scala
val dir = "/tmp/test"
sqlContext.range(1000).write.mode("overwrite").parquet(dir)
val df = sqlContext.read.parquet(dir).cache()
df.count() // outputs 1000
sqlContext.range(10).write.mode("overwrite").parquet(dir)
sqlContext.read.parquet(dir).count() // outputs 1000 <---- We are still using the cached dataset
```

This patch fixes this bug by adding support for `REFRESH path` that invalidates and refreshes all the cached data (and the associated metadata) for any dataframe that contains the given data source path.

Expected behavior:
```scala
val dir = "/tmp/test"
sqlContext.range(1000).write.mode("overwrite").parquet(dir)
val df = sqlContext.read.parquet(dir).cache()
df.count() // outputs 1000
sqlContext.range(10).write.mode("overwrite").parquet(dir)
spark.catalog.refreshResource(dir)
sqlContext.read.parquet(dir).count() // outputs 10 <---- We are not using the cached dataset
```

## How was this patch tested?

Unit tests for overwrites and appends in `ParquetQuerySuite` and `CachedTableSuite`.

Author: Sameer Agarwal <sameer@databricks.com>

Closes #13566 from sameeragarwal/refresh-path-2.
2016-06-10 20:43:18 -07:00
Cheng Lian 8e7b56f3d4 Revert "[SPARK-15639][SQL] Try to push down filter at RowGroups level for parquet reader"
This reverts commit bba5d7999f.
2016-06-10 20:41:48 -07:00
Liang-Chi Hsieh bba5d7999f [SPARK-15639][SQL] Try to push down filter at RowGroups level for parquet reader
## What changes were proposed in this pull request?

The base class `SpecificParquetRecordReaderBase` used for vectorized parquet reader will try to get pushed-down filters from the given configuration. This pushed-down filters are used for RowGroups-level filtering. However, we don't set up the filters to push down into the configuration. In other words, the filters are not actually pushed down to do RowGroups-level filtering. This patch is to fix this and tries to set up the filters for pushing down to configuration for the reader.

## How was this patch tested?
Existing tests should be passed.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #13371 from viirya/vectorized-reader-push-down-filter.
2016-06-10 18:23:59 -07:00
Narine Kokhlikyan 54f758b5fc [SPARK-15884][SPARKR][SQL] Overriding stringArgs in MapPartitionsInR
## What changes were proposed in this pull request?

As discussed in https://github.com/apache/spark/pull/12836
we need to override stringArgs method in MapPartitionsInR in order to avoid too large strings generated by "stringArgs" method based on the input arguments.

In this case exclude some of the input arguments: serialized R objects.

## How was this patch tested?
Existing test cases

Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>

Closes #13610 from NarineK/dapply_MapPartitionsInR_stringArgs.
2016-06-10 17:17:47 -07:00
Sela 127a6678d7 [SPARK-15489][SQL] Dataset kryo encoder won't load custom user settings
## What changes were proposed in this pull request?

Serializer instantiation will consider existing SparkConf

## How was this patch tested?
manual test with `ImmutableList` (Guava) and `kryo-serializers`'s `Immutable*Serializer` implementations.

Added Test Suite.

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Sela <ansela@paypal.com>

Closes #13424 from amitsela/SPARK-15489.
2016-06-10 14:36:51 -07:00
Herman van Hovell e05a2feebe [SPARK-15825] [SQL] Fix SMJ invalid results
## What changes were proposed in this pull request?
Code generated `SortMergeJoin` failed with wrong results when using structs as keys. This could (eventually) be traced back to the use of a wrong row reference when comparing structs.

## How was this patch tested?
TBD

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #13589 from hvanhovell/SPARK-15822.
2016-06-10 14:29:05 -07:00
wangyang 026eb90644 [SPARK-15875] Try to use Seq.isEmpty and Seq.nonEmpty instead of Seq.length == 0 and Seq.length > 0
## What changes were proposed in this pull request?

In scala, immutable.List.length is an expensive operation so we should
avoid using Seq.length == 0 or Seq.lenth > 0, and use Seq.isEmpty and Seq.nonEmpty instead.

## How was this patch tested?
existing tests

Author: wangyang <wangyang@haizhi.com>

Closes #13601 from yangw1234/isEmpty.
2016-06-10 13:10:03 -07:00
Sandeep Singh 865ec32dd9 [MINOR][X][X] Replace all occurrences of None: Option with Option.empty
## What changes were proposed in this pull request?
Replace all occurrences of `None: Option[X]` with `Option.empty[X]`

## How was this patch tested?
Exisiting Tests

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #13591 from techaddict/minor-7.
2016-06-10 13:06:51 -07:00
Takuya UESHIN 667d4ea7b3 [SPARK-6320][SQL] Move planLater method into GenericStrategy.
## What changes were proposed in this pull request?

This PR moves `QueryPlanner.planLater()` method into `GenericStrategy` for extra strategies to be able to use `planLater` in its strategy.

## How was this patch tested?

Existing tests.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #13147 from ueshin/issues/SPARK-6320.
2016-06-10 13:06:18 -07:00
Liang-Chi Hsieh 0ec279ffdf [SPARK-15753][SQL] Move Analyzer stuff to Analyzer from DataFrameWriter
## What changes were proposed in this pull request?

This patch moves some codes in `DataFrameWriter.insertInto` that belongs to `Analyzer`.

## How was this patch tested?
Existing tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #13496 from viirya/move-analyzer-stuff.
2016-06-10 11:05:04 -07:00
Tathagata Das abdb5d42c5 [SPARK-15812][SQ][STREAMING] Added support for sorting after streaming aggregation with complete mode
## What changes were proposed in this pull request?

When the output mode is complete, then the output of a streaming aggregation essentially will contain the complete aggregates every time. So this is not different from a batch dataset within an incremental execution. Other non-streaming operations should be supported on this dataset. In this PR, I am just adding support for sorting, as it is a common useful functionality. Support for other operations will come later.

## How was this patch tested?
Additional unit tests.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13549 from tdas/SPARK-15812.
2016-06-10 10:48:28 -07:00
Eric Liang b914e1930f [SPARK-15794] Should truncate toString() of very wide plans
## What changes were proposed in this pull request?

With very wide tables, e.g. thousands of fields, the plan output is unreadable and often causes OOMs due to inefficient string processing. This truncates all struct and operator field lists to a user configurable threshold to limit performance impact.

It would also be nice to optimize string generation to avoid these sort of O(n^2) slowdowns entirely (i.e. use StringBuilder everywhere including expressions), but this is probably too large of a change for 2.0 at this point, and truncation has other benefits for usability.

## How was this patch tested?

Added a microbenchmark that covers this case particularly well. I also ran the microbenchmark while varying the truncation threshold.

```
numFields = 5
wide shallowly nested struct field r/w:  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
2000 wide x 50 rows (write in-mem)            2336 / 2558          0.0       23364.4       0.1X

numFields = 25
wide shallowly nested struct field r/w:  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
2000 wide x 50 rows (write in-mem)            4237 / 4465          0.0       42367.9       0.1X

numFields = 100
wide shallowly nested struct field r/w:  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
2000 wide x 50 rows (write in-mem)          10458 / 11223          0.0      104582.0       0.0X

numFields = Infinity
wide shallowly nested struct field r/w:  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
[info]   java.lang.OutOfMemoryError: Java heap space
```

Author: Eric Liang <ekl@databricks.com>
Author: Eric Liang <ekhliang@gmail.com>

Closes #13537 from ericl/truncated-string.
2016-06-09 18:05:16 -07:00
Herman van Hovell b0768538e5 [SPARK-14321][SQL] Reduce date format cost and string-to-date cost in date functions
## What changes were proposed in this pull request?
The current implementations of `UnixTime` and `FromUnixTime` do not cache their parser/formatter as much as they could. This PR resolved this issue.

This PR is a take over from https://github.com/apache/spark/pull/13522 and further optimizes the re-use of the parser/formatter. It also fixes the improves handling (catching the actual exception instead of `Throwable`). All credits for this work should go to rajeshbalamohan.

This PR closes https://github.com/apache/spark/pull/13522

## How was this patch tested?
Current tests.

Author: Herman van Hovell <hvanhovell@databricks.com>
Author: Rajesh Balamohan <rbalamohan@apache.org>

Closes #13581 from hvanhovell/SPARK-14321.
2016-06-09 16:37:18 -07:00
Kevin Yu 99386fe398 [SPARK-15804][SQL] Include metadata in the toStructType
## What changes were proposed in this pull request?
The help function 'toStructType' in the AttributeSeq class doesn't include the metadata when it builds the StructField, so it causes this reported problem https://issues.apache.org/jira/browse/SPARK-15804?jql=project%20%3D%20SPARK when spark writes the the dataframe with the metadata to the parquet datasource.

The code path is when spark writes the dataframe to the parquet datasource through the InsertIntoHadoopFsRelationCommand, spark will build the WriteRelation container, and it will call the help function 'toStructType' to create StructType which contains StructField, it should include the metadata there, otherwise, we will lost the user provide metadata.

## How was this patch tested?

added test case in ParquetQuerySuite.scala

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Kevin Yu <qyu@us.ibm.com>

Closes #13555 from kevinyu98/spark-15804.
2016-06-09 09:50:09 -07:00
Herman van Hovell 91fbc880b6 [SPARK-15789][SQL] Allow reserved keywords in most places
## What changes were proposed in this pull request?
The parser currently does not allow the use of some SQL keywords as table or field names. This PR adds supports for all keywords as identifier. The exception to this are table aliases, in this case most keywords are allowed except for join keywords (```anti, full, inner, left, semi, right, natural, on, join, cross```) and set-operator keywords (```union, intersect, except```).

## How was this patch tested?
I have added/move/renamed test in the catalyst `*ParserSuite`s.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #13534 from hvanhovell/SPARK-15789.
2016-06-07 17:01:11 -07:00
Sean Zhong 890baaca50 [SPARK-15674][SQL] Deprecates "CREATE TEMPORARY TABLE USING...", uses "CREAT TEMPORARY VIEW USING..." instead
## What changes were proposed in this pull request?

The current implementation of "CREATE TEMPORARY TABLE USING datasource..." is NOT creating any intermediate temporary data directory like temporary HDFS folder, instead, it only stores a SQL string in memory. Probably we should use "TEMPORARY VIEW" instead.

This PR assumes a temporary table has to link with some temporary intermediate data. It follows the definition of temporary table like this (from [hortonworks doc](https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/temp-tables.html)):
> A temporary table is a convenient way for an application to automatically manage intermediate data generated during a complex query

**Example**:

```
scala> spark.sql("CREATE temporary view  my_tab7 (c1: String, c2: String)  USING org.apache.spark.sql.execution.datasources.csv.CSVFileFormat  OPTIONS (PATH '/Users/seanzhong/csv/cars.csv')")
scala> spark.sql("select c1, c2 from my_tab7").show()
+----+-----+
|  c1|   c2|
+----+-----+
|year| make|
|2012|Tesla|
...
```

It NOW prints a **deprecation warning** if "CREATE TEMPORARY TABLE USING..." is used.

```
scala> spark.sql("CREATE temporary table  my_tab7 (c1: String, c2: String)  USING org.apache.spark.sql.execution.datasources.csv.CSVFileFormat  OPTIONS (PATH '/Users/seanzhong/csv/cars.csv')")
16/05/31 10:39:27 WARN SparkStrategies$DDLStrategy: CREATE TEMPORARY TABLE tableName USING... is deprecated, please use CREATE TEMPORARY VIEW viewName USING... instead
```

## How was this patch tested?

Unit test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13414 from clockfly/create_temp_view_using.
2016-06-07 15:21:55 -07:00
Sean Zhong 5f731d6859 [SPARK-15792][SQL] Allows operator to change the verbosity in explain output
## What changes were proposed in this pull request?

This PR allows customization of verbosity in explain output. After change, `dataframe.explain()` and `dataframe.explain(true)` has different verbosity output for physical plan.

Currently, this PR only enables verbosity string for operator `HashAggregateExec` and `SortAggregateExec`. We will gradually enable verbosity string for more operators in future.

**Less verbose mode:** dataframe.explain(extended = false)

`output=[count(a)#85L]` is **NOT** displayed for HashAggregate.

```
scala> Seq((1,2,3)).toDF("a", "b", "c").createTempView("df2")
scala> spark.sql("select count(a) from df2").explain()
== Physical Plan ==
*HashAggregate(key=[], functions=[count(1)])
+- Exchange SinglePartition
   +- *HashAggregate(key=[], functions=[partial_count(1)])
      +- LocalTableScan
```

**Verbose mode:** dataframe.explain(extended = true)

`output=[count(a)#85L]` is displayed for HashAggregate.

```
scala> spark.sql("select count(a) from df2").explain(true)  // "output=[count(a)#85L]" is added
...
== Physical Plan ==
*HashAggregate(key=[], functions=[count(1)], output=[count(a)#85L])
+- Exchange SinglePartition
   +- *HashAggregate(key=[], functions=[partial_count(1)], output=[count#87L])
      +- LocalTableScan
```

## How was this patch tested?

Manual test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13535 from clockfly/verbose_breakdown_2.
2016-06-06 22:59:25 -07:00
Sean Zhong 0e0904a2fc [SPARK-15632][SQL] Typed Filter should NOT change the Dataset schema
## What changes were proposed in this pull request?

This PR makes sure the typed Filter doesn't change the Dataset schema.

**Before the change:**

```
scala> val df = spark.range(0,9)
scala> df.schema
res12: org.apache.spark.sql.types.StructType = StructType(StructField(id,LongType,false))
scala> val afterFilter = df.filter(_=>true)
scala> afterFilter.schema   // !!! schema is CHANGED!!! Column name is changed from id to value, nullable is changed from false to true.
res13: org.apache.spark.sql.types.StructType = StructType(StructField(value,LongType,true))

```

SerializeFromObject and DeserializeToObject are inserted to wrap the Filter, and these two can possibly change the schema of Dataset.

**After the change:**

```
scala> afterFilter.schema   // schema is NOT changed.
res47: org.apache.spark.sql.types.StructType = StructType(StructField(id,LongType,false))
```

## How was this patch tested?

Unit test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13529 from clockfly/spark-15632.
2016-06-06 22:40:21 -07:00
Josh Rosen 0b8d694999 [SPARK-15764][SQL] Replace N^2 loop in BindReferences
BindReferences contains a n^2 loop which causes performance issues when operating over large schemas: to determine the ordinal of an attribute reference, we perform a linear scan over the `input` array. Because input can sometimes be a `List`, the call to `input(ordinal).nullable` can also be O(n).

Instead of performing a linear scan, we can convert the input into an array and build a hash map to map from expression ids to ordinals. The greater up-front cost of the map construction is offset by the fact that an expression can contain multiple attribute references, so the cost of the map construction is amortized across a number of lookups.

Perf. benchmarks to follow. /cc ericl

Author: Josh Rosen <joshrosen@databricks.com>

Closes #13505 from JoshRosen/bind-references-improvement.
2016-06-06 11:44:51 -07:00
Zheng RuiFeng fd8af39713 [MINOR] Fix Typos 'an -> a'
## What changes were proposed in this pull request?

`an -> a`

Use cmds like `find . -name '*.R' | xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one.

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13515 from zhengruifeng/an_a.
2016-06-06 09:35:47 +01:00
Wenchen Fan 30c4774f33 [SPARK-15657][SQL] RowEncoder should validate the data type of input object
## What changes were proposed in this pull request?

This PR improves the error handling of `RowEncoder`. When we create a `RowEncoder` with a given schema, we should validate the data type of input object. e.g. we should throw an exception when a field is boolean but is declared as a string column.

This PR also removes the support to use `Product` as a valid external type of struct type.  This support is added at https://github.com/apache/spark/pull/9712, but is incomplete, e.g. nested product, product in array are both not working.  However, we never officially support this feature and I think it's ok to ban it.

## How was this patch tested?

new tests in `RowEncoderSuite`.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13401 from cloud-fan/bug.
2016-06-05 15:59:52 -07:00
Weiqing Yang 0f307db5e1 [SPARK-15707][SQL] Make Code Neat - Use map instead of if check.
## What changes were proposed in this pull request?
In forType function of object RandomDataGenerator, the code following:
if (maybeSqlTypeGenerator.isDefined){
  ....
  Some(generator)
} else{
 None
}
will be changed. Instead, maybeSqlTypeGenerator.map will be used.

## How was this patch tested?
All of the current unit tests passed.

Author: Weiqing Yang <yangweiqing001@gmail.com>

Closes #13448 from Sherry302/master.
2016-06-04 22:44:03 +01:00
Josh Rosen 091f81e1f7 [SPARK-15762][SQL] Cache Metadata & StructType hashCodes; use singleton Metadata.empty
We should cache `Metadata.hashCode` and use a singleton for `Metadata.empty` because calculating metadata hashCodes appears to be a bottleneck for certain workloads.

We should also cache `StructType.hashCode`.

In an optimizer stress-test benchmark run by ericl, these `hashCode` calls accounted for roughly 40% of the total CPU time and this bottleneck was completely eliminated by the caching added by this patch.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #13504 from JoshRosen/metadata-fix.
2016-06-04 14:14:50 -07:00
Wenchen Fan 11c83f83d5 [SPARK-15140][SQL] make the semantics of null input object for encoder clear
## What changes were proposed in this pull request?

For input object of non-flat type, we can't encode it to row if it's null, as Spark SQL doesn't allow row to be null, only its columns can be null.

This PR explicitly add this constraint and throw exception if users break it.

## How was this patch tested?

several new tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13469 from cloud-fan/null-object.
2016-06-03 14:28:19 -07:00
Wenchen Fan 61b80d552a [SPARK-15547][SQL] nested case class in encoder can have different number of fields from the real schema
## What changes were proposed in this pull request?

There are 2 kinds of `GetStructField`:

1. resolved from `UnresolvedExtractValue`, and it will have a `name` property.
2. created when we build deserializer expression for nested tuple, no `name` property.

When we want to validate the ordinals of nested tuple, we should only catch `GetStructField` without the name property.

## How was this patch tested?

new test in `EncoderResolutionSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13474 from cloud-fan/ordinal-check.
2016-06-03 14:26:24 -07:00
gatorsmile eb10b481ca [SPARK-15286][SQL] Make the output readable for EXPLAIN CREATE TABLE and DESC EXTENDED
#### What changes were proposed in this pull request?
Before this PR, the output of EXPLAIN of following SQL is like

```SQL
CREATE EXTERNAL TABLE extTable_with_partitions (key INT, value STRING)
PARTITIONED BY (ds STRING, hr STRING)
LOCATION '/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-b39a6185-8981-403b-a4aa-36fb2f4ca8a9'
```
``ExecutedCommand CreateTableCommand CatalogTable(`extTable_with_partitions`,CatalogTableType(EXTERNAL),CatalogStorageFormat(Some(/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-dd234718-e85d-4c5a-8353-8f1834ac0323),Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(key,int,true,None), CatalogColumn(value,string,true,None), CatalogColumn(ds,string,true,None), CatalogColumn(hr,string,true,None)),List(ds, hr),List(),List(),-1,,1463026413544,-1,Map(),None,None,None), false``

After this PR, the output is like

```
ExecutedCommand
:  +- CreateTableCommand CatalogTable(
	Table:`extTable_with_partitions`
	Created:Thu Jun 02 21:30:54 PDT 2016
	Last Access:Wed Dec 31 15:59:59 PST 1969
	Type:EXTERNAL
	Schema:[`key` int, `value` string, `ds` string, `hr` string]
	Partition Columns:[`ds`, `hr`]
	Storage(Location:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-a06083b8-8e88-4d07-9ff0-d6bd8d943ad3, InputFormat:org.apache.hadoop.mapred.TextInputFormat, OutputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat)), false
```

This is also applicable to `DESC EXTENDED`. However, this does not have special handling for Data Source Tables. If needed, we need to move the logics of `DDLUtil`. Let me know if we should do it in this PR. Thanks! rxin liancheng

#### How was this patch tested?
Manual testing

Author: gatorsmile <gatorsmile@gmail.com>

Closes #13070 from gatorsmile/betterExplainCatalogTable.
2016-06-03 13:56:22 -07:00
Josh Rosen e526913989 [SPARK-15742][SQL] Reduce temp collections allocations in TreeNode transform methods
In Catalyst's TreeNode transform methods we end up calling `productIterator.map(...).toArray` in a number of places, which is slightly inefficient because it needs to allocate an `ArrayBuilder` and grow a temporary array. Since we already know the size of the final output (`productArity`), we can simply allocate an array up-front and use a while loop to consume the iterator and populate the array.

For most workloads, this performance difference is negligible but it does make a measurable difference in optimizer performance for queries that operate over very wide schemas (such as the benchmark queries in #13456).

### Perf results (from #13456 benchmarks)

**Before**

```
Java HotSpot(TM) 64-Bit Server VM 1.8.0_66-b17 on Mac OS X 10.10.5
Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz

parsing large select:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
1 select expressions                            19 /   22          0.0    19119858.0       1.0X
10 select expressions                           23 /   25          0.0    23208774.0       0.8X
100 select expressions                          55 /   73          0.0    54768402.0       0.3X
1000 select expressions                        229 /  259          0.0   228606373.0       0.1X
2500 select expressions                        530 /  554          0.0   529938178.0       0.0X
```

**After**

```
parsing large select:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
1 select expressions                            15 /   21          0.0    14978203.0       1.0X
10 select expressions                           22 /   27          0.0    22492262.0       0.7X
100 select expressions                          48 /   64          0.0    48449834.0       0.3X
1000 select expressions                        189 /  208          0.0   189346428.0       0.1X
2500 select expressions                        429 /  449          0.0   428943897.0       0.0X
```

###

Author: Josh Rosen <joshrosen@databricks.com>

Closes #13484 from JoshRosen/treenode-productiterator-map.
2016-06-03 13:53:02 -07:00
Ioana Delaney 9e2eb13ca5 [SPARK-15677][SQL] Query with scalar sub-query in the SELECT list throws UnsupportedOperationException
## What changes were proposed in this pull request?
Queries with scalar sub-query in the SELECT list run against a local, in-memory relation throw
UnsupportedOperationException exception.

Problem repro:
```SQL
scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t1")
scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t2")
scala> sql("select (select min(c1) from t2) from t1").show()

java.lang.UnsupportedOperationException: Cannot evaluate expression: scalar-subquery#62 []
  at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:215)
  at org.apache.spark.sql.catalyst.expressions.ScalarSubquery.eval(subquery.scala:62)
  at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:142)
  at org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:45)
  at org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:29)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:285)
  at org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$37.applyOrElse(Optimizer.scala:1473)
```
The problem is specific to local, in memory relations. It is caused by rule ConvertToLocalRelation, which attempts to push down
a scalar-subquery expression to the local tables.

The solution prevents the rule to apply if Project references scalar subqueries.

## How was this patch tested?
Added regression tests to SubquerySuite.scala

Author: Ioana Delaney <ioanamdelaney@gmail.com>

Closes #13418 from ioana-delaney/scalarSubV2.
2016-06-03 12:04:27 -07:00
Wenchen Fan 190ff274fd [SPARK-15494][SQL] encoder code cleanup
## What changes were proposed in this pull request?

Our encoder framework has been evolved a lot, this PR tries to clean up the code to make it more readable and emphasise the concept that encoder should be used as a container of serde expressions.

1. move validation logic to analyzer instead of encoder
2. only have a `resolveAndBind` method in encoder instead of `resolve` and `bind`, as we don't have the encoder life cycle concept anymore.
3. `Dataset` don't need to keep a resolved encoder, as there is no such concept anymore. bound encoder is still needed to do serialization outside of query framework.
4. Using `BoundReference` to represent an unresolved field in deserializer expression is kind of weird, this PR adds a `GetColumnByOrdinal` for this purpose. (serializer expression still use `BoundReference`, we can replace it with `GetColumnByOrdinal` in follow-ups)

## How was this patch tested?

existing test

Author: Wenchen Fan <wenchen@databricks.com>
Author: Cheng Lian <lian@databricks.com>

Closes #13269 from cloud-fan/clean-encoder.
2016-06-03 00:43:02 -07:00
Sean Zhong 6dde27404c [SPARK-15733][SQL] Makes the explain output less verbose by hiding some verbose output like None, null, empty List, and etc.
## What changes were proposed in this pull request?

This PR makes the explain output less verbose by hiding some verbose output like `None`, `null`, empty List `[]`, empty set `{}`, and etc.

**Before change**:

```
== Physical Plan ==
ExecutedCommand
:  +- ShowTablesCommand None, None
```

**After change**:

```
== Physical Plan ==
ExecutedCommand
:  +- ShowTablesCommand
```

## How was this patch tested?

Manual test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13470 from clockfly/verbose_breakdown_4.
2016-06-02 22:45:37 -07:00
Wenchen Fan 6323e4bd76 [SPARK-15732][SQL] better error message when use java reserved keyword as field name
## What changes were proposed in this pull request?

When users create a case class and use java reserved keyword as field name, spark sql will generate illegal java code and throw exception at runtime.

This PR checks the field names when building the encoder, and if illegal field names are used, throw exception immediately with a good error message.

## How was this patch tested?

new test in DatasetSuite

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13485 from cloud-fan/java.
2016-06-02 18:13:04 -07:00
Andrew Or d1c1fbc345 [SPARK-15715][SQL] Fix alter partition with storage information in Hive
## What changes were proposed in this pull request?

This command didn't work for Hive tables. Now it does:
```
ALTER TABLE boxes PARTITION (width=3)
    SET SERDE 'com.sparkbricks.serde.ColumnarSerDe'
    WITH SERDEPROPERTIES ('compress'='true')
```

## How was this patch tested?

`HiveExternalCatalogSuite`

Author: Andrew Or <andrew@databricks.com>

Closes #13453 from andrewor14/alter-partition-storage.
2016-06-02 17:44:48 -07:00
Sean Zhong 985d532812 [SPARK-15734][SQL] Avoids printing internal row in explain output
## What changes were proposed in this pull request?

This PR avoids printing internal rows in explain output for some operators.

**Before change:**

```
scala> (1 to 10).toSeq.map(_ => (1,2,3)).toDF().createTempView("df3")
scala> spark.sql("select * from df3 where 1=2").explain(true)
...
== Analyzed Logical Plan ==
_1: int, _2: int, _3: int
Project [_1#37,_2#38,_3#39]
+- Filter (1 = 2)
   +- SubqueryAlias df3
      +- LocalRelation [_1#37,_2#38,_3#39], [[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3]]
...
== Physical Plan ==
LocalTableScan [_1#37,_2#38,_3#39]
```

**After change:**

```
scala> spark.sql("select * from df3 where 1=2").explain(true)
...
== Analyzed Logical Plan ==
_1: int, _2: int, _3: int
Project [_1#58,_2#59,_3#60]
+- Filter (1 = 2)
   +- SubqueryAlias df3
      +- LocalRelation [_1#58,_2#59,_3#60]
...
== Physical Plan ==
LocalTableScan <empty>, [_1#58,_2#59,_3#60]
```

## How was this patch tested?
Manual test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13471 from clockfly/verbose_breakdown_5.
2016-06-02 16:21:33 -07:00
Sameer Agarwal 09b3c56c91 [SPARK-14752][SQL] Explicitly implement KryoSerialization for LazilyGenerateOrdering
## What changes were proposed in this pull request?

This patch fixes a number of `com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException` exceptions reported in [SPARK-15604], [SPARK-14752] etc. (while executing sparkSQL queries with the kryo serializer) by explicitly implementing `KryoSerialization` for `LazilyGenerateOrdering`.

## How was this patch tested?

1. Modified `OrderingSuite` so that all tests in the suite also test kryo serialization (for both interpreted and generated ordering).
2. Manually verified TPC-DS q1.

Author: Sameer Agarwal <sameer@databricks.com>

Closes #13466 from sameeragarwal/kryo.
2016-06-02 10:58:00 -07:00
Dongjoon Hyun 63b7f127ca [SPARK-15076][SQL] Add ReorderAssociativeOperator optimizer
## What changes were proposed in this pull request?

This issue add a new optimizer `ReorderAssociativeOperator` by taking advantage of integral associative property. Currently, Spark works like the following.

1) Can optimize `1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + a` into `45 + a`.
2) Cannot optimize `a + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9`.

This PR can handle Case 2 for **Add/Multiply** expression whose data types are `ByteType`, `ShortType`, `IntegerType`, and `LongType`. The followings are the plan comparison between `before` and `after` this issue.

**Before**
```scala
scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) a)").explain
== Physical Plan ==
WholeStageCodegen
:  +- Project [(((((((((a#7 + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9) AS (((((((((a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9)#8]
:     +- INPUT
+- Generate explode([1]), false, false, [a#7]
   +- Scan OneRowRelation[]
scala> sql("select a*1*2*3*4*5*6*7*8*9 from (select explode(array(1)) a)").explain
== Physical Plan ==
*Project [(((((((((a#18 * 1) * 2) * 3) * 4) * 5) * 6) * 7) * 8) * 9) AS (((((((((a * 1) * 2) * 3) * 4) * 5) * 6) * 7) * 8) * 9)#19]
+- Generate explode([1]), false, false, [a#18]
   +- Scan OneRowRelation[]
```

**After**
```scala
scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) a)").explain
== Physical Plan ==
WholeStageCodegen
:  +- Project [(a#7 + 45) AS (((((((((a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9)#8]
:     +- INPUT
+- Generate explode([1]), false, false, [a#7]
   +- Scan OneRowRelation[]
scala> sql("select a*1*2*3*4*5*6*7*8*9 from (select explode(array(1)) a)").explain
== Physical Plan ==
*Project [(a#18 * 362880) AS (((((((((a * 1) * 2) * 3) * 4) * 5) * 6) * 7) * 8) * 9)#19]
+- Generate explode([1]), false, false, [a#18]
   +- Scan OneRowRelation[]
```

This PR is greatly generalized by cloud-fan 's key ideas; he should be credited for the work he did.

## How was this patch tested?

Pass the Jenkins tests including new testsuite.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12850 from dongjoon-hyun/SPARK-15076.
2016-06-02 09:48:58 -07:00
Takeshi YAMAMURO 5eea332307 [SPARK-13484][SQL] Prevent illegal NULL propagation when filtering outer-join results
## What changes were proposed in this pull request?
This PR add a rule at the end of analyzer to correct nullable fields of attributes in a logical plan by using nullable fields of the corresponding attributes in its children logical plans (these plans generate the input rows).

This is another approach for addressing SPARK-13484 (the first approach is https://github.com/apache/spark/pull/11371).

Close #113711

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #13290 from yhuai/SPARK-13484.
2016-06-01 22:23:00 -07:00
jerryshao 8288e16a5a [SPARK-15620][SQL] Fix transformed dataset attributes revolve failure
## What changes were proposed in this pull request?

Join on transformed dataset has attributes conflicts, which make query execution failure, for example:

```
val dataset = Seq(1, 2, 3).toDs
val mappedDs = dataset.map(_ + 1)

mappedDs.as("t1").joinWith(mappedDs.as("t2"), $"t1.value" === $"t2.value").show()
```

will throw exception:

```
org.apache.spark.sql.AnalysisException: cannot resolve '`t1.value`' given input columns: [value];
  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:62)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:59)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:287)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:287)
```

## How was this patch tested?

Unit test.

Author: jerryshao <sshao@hortonworks.com>

Closes #13399 from jerryshao/SPARK-15620.
2016-06-01 21:58:05 -07:00
Sean Zhong c8fb776d4a [SPARK-15692][SQL] Improves the explain output of several physical plans by displaying embedded logical plan in tree style
## What changes were proposed in this pull request?

Improves the explain output of several physical plans by displaying embedded logical plan in tree style

Some physical plan contains a embedded logical plan, for example, `cache tableName query` maps to:

```
case class CacheTableCommand(
    tableName: String,
    plan: Option[LogicalPlan],
    isLazy: Boolean)
  extends RunnableCommand
```

It is easier to read the explain output if we can display the `plan` in tree style.

**Before change:**

Everything is messed in one line.

```
scala> Seq((1,2)).toDF().createOrReplaceTempView("testView")
scala> spark.sql("cache table testView2 select * from testView").explain()
== Physical Plan ==
ExecutedCommand CacheTableCommand testView2, Some('Project [*]
+- 'UnresolvedRelation `testView`, None
), false
```

**After change:**

```
scala> spark.sql("cache table testView2 select * from testView").explain()
== Physical Plan ==
ExecutedCommand
:  +- CacheTableCommand testView2, false
:     :  +- 'Project [*]
:     :     +- 'UnresolvedRelation `testView`, None
```

## How was this patch tested?

Manual test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13433 from clockfly/verbose_breakdown_3_2.
2016-06-01 17:03:39 -07:00
Wenchen Fan 8640cdb836 [SPARK-15441][SQL] support null object in Dataset outer-join
## What changes were proposed in this pull request?

Currently we can't encode top level null object into internal row, as Spark SQL doesn't allow row to be null, only its columns can be null.

This is not a problem before, as we assume the input object is never null. However, for outer join, we do need the semantics of null object.

This PR fixes this problem by making both join sides produce a single column, i.e. nest the logical plan output(by `CreateStruct`), so that we have an extra level to represent top level null obejct.

## How was this patch tested?

new test in `DatasetSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13425 from cloud-fan/outer-join2.
2016-06-01 16:16:54 -07:00
Cheng Lian 7bb64aae27 [SPARK-15269][SQL] Removes unexpected empty table directories created while creating external Spark SQL data sourcet tables.
This PR is an alternative to #13120 authored by xwu0226.

## What changes were proposed in this pull request?

When creating an external Spark SQL data source table and persisting its metadata to Hive metastore, we don't use the standard Hive `Table.dataLocation` field because Hive only allows directory paths as data locations while Spark SQL also allows file paths. However, if we don't set `Table.dataLocation`, Hive always creates an unexpected empty table directory under database location, but doesn't remove it while dropping the table (because the table is external).

This PR works around this issue by explicitly setting `Table.dataLocation` and then manullay removing the created directory after creating the external table.

Please refer to [this JIRA comment][1] for more details about why we chose this approach as a workaround.

[1]: https://issues.apache.org/jira/browse/SPARK-15269?focusedCommentId=15297408&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15297408

## How was this patch tested?

1. A new test case is added in `HiveQuerySuite` for this case
2. Updated `ShowCreateTableSuite` to use the same table name in all test cases. (This is how I hit this issue at the first place.)

Author: Cheng Lian <lian@databricks.com>

Closes #13270 from liancheng/spark-15269-unpleasant-fix.
2016-06-01 16:02:27 -07:00
Reynold Xin a71d1364ae [SPARK-15686][SQL] Move user-facing streaming classes into sql.streaming
## What changes were proposed in this pull request?
This patch moves all user-facing structured streaming classes into sql.streaming. As part of this, I also added some since version annotation to methods and classes that don't have them.

## How was this patch tested?
Updated tests to reflect the moves.

Author: Reynold Xin <rxin@databricks.com>

Closes #13429 from rxin/SPARK-15686.
2016-06-01 10:14:40 -07:00
Sean Zhong d5012c2740 [SPARK-15495][SQL] Improve the explain output for Aggregation operator
## What changes were proposed in this pull request?

This PR improves the explain output of Aggregator operator.

SQL:

```
Seq((1,2,3)).toDF("a", "b", "c").createTempView("df1")
spark.sql("cache table df1")
spark.sql("select count(a), count(c), b from df1 group by b").explain()
```

**Before change:**

```
*TungstenAggregate(key=[b#8], functions=[count(1),count(1)], output=[count(a)#79L,count(c)#80L,b#8])
+- Exchange hashpartitioning(b#8, 200), None
   +- *TungstenAggregate(key=[b#8], functions=[partial_count(1),partial_count(1)], output=[b#8,count#98L,count#99L])
      +- InMemoryTableScan [b#8], InMemoryRelation [a#7,b#8,c#9], true, 10000, StorageLevel(disk=true, memory=true, offheap=false, deserialized=true, replication=1), LocalTableScan [a#7,b#8,c#9], [[1,2,3]], Some(df1)
``````

**After change:**

```
*Aggregate(key=[b#8], functions=[count(1),count(1)], output=[count(a)#79L,count(c)#80L,b#8])
+- Exchange hashpartitioning(b#8, 200), None
   +- *Aggregate(key=[b#8], functions=[partial_count(1),partial_count(1)], output=[b#8,count#98L,count#99L])
      +- InMemoryTableScan [b#8], InMemoryRelation [a#7,b#8,c#9], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), LocalTableScan [a#7,b#8,c#9], [[1,2,3]], Some(df1)
```

## How was this patch tested?

Manual test and existing UT.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13363 from clockfly/verbose3.
2016-06-01 09:58:01 -07:00
jerryshao e4ce1bc4f3 [SPARK-15659][SQL] Ensure FileSystem is gotten from path
## What changes were proposed in this pull request?

Currently `spark.sql.warehouse.dir` is pointed to local dir by default, which will throw exception when HADOOP_CONF_DIR is configured and default FS is hdfs.

```
java.lang.IllegalArgumentException: Wrong FS: file:/Users/sshao/projects/apache-spark/spark-warehouse, expected: hdfs://localhost:8020
```

So we should always get the `FileSystem` from `Path` to avoid wrong FS problem.

## How was this patch tested?

Local test.

Author: jerryshao <sshao@hortonworks.com>

Closes #13405 from jerryshao/SPARK-15659.
2016-06-01 08:28:19 -05:00
Eric Liang 93e97147eb [MINOR] Slightly better error message when attempting to query hive tables w/in-mem catalog
andrewor14

Author: Eric Liang <ekl@databricks.com>

Closes #13427 from ericl/better-error-msg.
2016-05-31 17:39:03 -07:00