Commit graph

1831 commits

Author SHA1 Message Date
Wenchen Fan aaf632b213 revert PR#10896 and PR#14865
## What changes were proposed in this pull request?

according to the discussion in the original PR #10896 and the new approach PR #14876 , we decided to revert these 2 PRs and go with the new approach.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14909 from cloud-fan/revert.
2016-09-01 13:19:15 +08:00
Takeshi YAMAMURO 94922d79e9 [SPARK-17289][SQL] Fix a bug to satisfy sort requirements in partial aggregations
## What changes were proposed in this pull request?
Partial aggregations are generated in `EnsureRequirements`, but the planner fails to
check if partial aggregation satisfies sort requirements.
For the following query:
```
val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", "b").createOrReplaceTempView("t2")
spark.sql("select max(b) from t2 group by a").explain(true)
```
Now, the SortAggregator won't insert Sort operator before partial aggregation, this will break sort-based partial aggregation.
```
== Physical Plan ==
SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
+- *Sort [a#5 ASC], false, 0
   +- Exchange hashpartitioning(a#5, 200)
      +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, max#19])
         +- LocalTableScan [a#5, b#6]
```
Actually, a correct plan is:
```
== Physical Plan ==
SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
+- *Sort [a#5 ASC], false, 0
   +- Exchange hashpartitioning(a#5, 200)
      +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, max#19])
         +- *Sort [a#5 ASC], false, 0
            +- LocalTableScan [a#5, b#6]
```

## How was this patch tested?
Added tests in `PlannerSuite`.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #14865 from maropu/SPARK-17289.
2016-08-30 16:43:47 +08:00
Davies Liu 48caec2516 [SPARK-17063] [SQL] Improve performance of MSCK REPAIR TABLE with Hive metastore
## What changes were proposed in this pull request?

This PR split the the single `createPartitions()` call into smaller batches, which could prevent Hive metastore from OOM (caused by millions of partitions).

It will also try to gather all the fast stats (number of files and total size of all files) in parallel to avoid the bottle neck of listing the files in metastore sequential, which is controlled by spark.sql.gatherFastStats (enabled by default).

## How was this patch tested?

Tested locally with 10000 partitions and 100 files with embedded metastore, without gathering fast stats in parallel, adding partitions took 153 seconds, after enable that, gathering the fast stats took about 34 seconds, adding these partitions took 25 seconds (most of the time spent in object store), 59 seconds in total, 2.5X faster (with larger cluster, gathering will much faster).

Author: Davies Liu <davies@databricks.com>

Closes #14607 from davies/repair_batch.
2016-08-29 11:23:53 -07:00
Tejas Patil 095862a3cf [SPARK-17271][SQL] Planner adds un-necessary Sort even if child ordering is semantically same as required ordering
## What changes were proposed in this pull request?

Jira : https://issues.apache.org/jira/browse/SPARK-17271

Planner is adding un-needed SORT operation due to bug in the way comparison for `SortOrder` is done at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L253
`SortOrder` needs to be compared semantically because `Expression` within two `SortOrder` can be "semantically equal" but not literally equal objects.

eg. In case of `sql("SELECT * FROM table1 a JOIN table2 b ON a.col1=b.col1")`

Expression in required SortOrder:
```
      AttributeReference(
        name = "col1",
        dataType = LongType,
        nullable = false
      ) (exprId = exprId,
        qualifier = Some("a")
      )
```

Expression in child SortOrder:
```
      AttributeReference(
        name = "col1",
        dataType = LongType,
        nullable = false
      ) (exprId = exprId)
```

Notice that the output column has a qualifier but the child attribute does not but the inherent expression is the same and hence in this case we can say that the child satisfies the required sort order.

This PR includes following changes:
- Added a `semanticEquals` method to `SortOrder` so that it can compare underlying child expressions semantically (and not using default Object.equals)
- Fixed `EnsureRequirements` to use semantic comparison of SortOrder

## How was this patch tested?

- Added a test case to `PlannerSuite`. Ran rest tests in `PlannerSuite`

Author: Tejas Patil <tejasp@fb.com>

Closes #14841 from tejasapatil/SPARK-17271_sort_order_equals_bug.
2016-08-28 19:14:58 +02:00
Takeshi YAMAMURO cd0ed31ea9 [SPARK-15382][SQL] Fix a bug in sampling with replacement
## What changes were proposed in this pull request?
This pr to fix a bug below in sampling with replacement
```
val df = Seq((1, 0), (2, 0), (3, 0)).toDF("a", "b")
df.sample(true, 2.0).withColumn("c", monotonically_increasing_id).select($"c").show
+---+
|  c|
+---+
|  0|
|  1|
|  1|
|  1|
|  2|
+---+
```

## How was this patch tested?
Added a test in `DataFrameSuite`.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #14800 from maropu/FixSampleBug.
2016-08-27 08:42:41 +01:00
petermaxlee f64a1ddd09 [SPARK-17235][SQL] Support purging of old logs in MetadataLog
## What changes were proposed in this pull request?
This patch adds a purge interface to MetadataLog, and an implementation in HDFSMetadataLog. The purge function is currently unused, but I will use it to purge old execution and file source logs in follow-up patches. These changes are required in a production structured streaming job that runs for a long period of time.

## How was this patch tested?
Added a unit test case in HDFSMetadataLogSuite.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14802 from petermaxlee/SPARK-17235.
2016-08-26 16:05:34 -07:00
Herman van Hovell a11d10f182 [SPARK-17246][SQL] Add BigDecimal literal
## What changes were proposed in this pull request?
This PR adds parser support for `BigDecimal` literals. If you append the suffix `BD` to a valid number then this will be interpreted as a `BigDecimal`, for example `12.0E10BD` will interpreted into a BigDecimal with scale -9 and precision 3. This is useful in situations where you need exact values.

## How was this patch tested?
Added tests to `ExpressionParserSuite`, `ExpressionSQLBuilderSuite` and `SQLQueryTestSuite`.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #14819 from hvanhovell/SPARK-17246.
2016-08-26 13:29:22 -07:00
petermaxlee 9812f7d538 [SPARK-17165][SQL] FileStreamSource should not track the list of seen files indefinitely
## What changes were proposed in this pull request?
Before this change, FileStreamSource uses an in-memory hash set to track the list of files processed by the engine. The list can grow indefinitely, leading to OOM or overflow of the hash set.

This patch introduces a new user-defined option called "maxFileAge", default to 24 hours. If a file is older than this age, FileStreamSource will purge it from the in-memory map that was used to track the list of files that have been processed.

## How was this patch tested?
Added unit tests for the underlying utility, and also added an end-to-end test to validate the purge in FileStreamSourceSuite. Also verified the new test cases would fail when the timeout was set to a very large number.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14728 from petermaxlee/SPARK-17165.
2016-08-26 11:30:23 -07:00
gatorsmile fd4ba3f626 [SPARK-17192][SQL] Issue Exception when Users Specify the Partitioning Columns without a Given Schema
### What changes were proposed in this pull request?
Address the comments by yhuai in the original PR: https://github.com/apache/spark/pull/14207

First, issue an exception instead of logging a warning when users specify the partitioning columns without a given schema.

Second, refactor the codes a little.

### How was this patch tested?
Fixed the test cases.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14572 from gatorsmile/followup16552.
2016-08-26 11:13:38 -07:00
hyukjinkwon 6063d5963f [SPARK-16216][SQL][FOLLOWUP] Enable timestamp type tests for JSON and verify all unsupported types in CSV
## What changes were proposed in this pull request?

This PR enables the tests for `TimestampType` for JSON and unifies the logics for verifying schema when writing in CSV.

In more details, this PR,

- Enables the tests for `TimestampType` for JSON and

  This was disabled due to an issue in `DatatypeConverter.parseDateTime` which parses dates incorrectly, for example as below:

  ```scala
   val d = javax.xml.bind.DatatypeConverter.parseDateTime("0900-01-01T00:00:00.000").getTime
  println(d.toString)
  ```
  ```
  Fri Dec 28 00:00:00 KST 899
  ```

  However, since we use `FastDateFormat`, it seems we are safe now.

  ```scala
  val d = FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSS").parse("0900-01-01T00:00:00.000")
  println(d)
  ```
  ```
  Tue Jan 01 00:00:00 PST 900
  ```

- Verifies all unsupported types in CSV

  There is a separate logics to verify the schemas in `CSVFileFormat`. This is actually not quite correct enough because we don't support `NullType` and `CalanderIntervalType` as well `StructType`, `ArrayType`, `MapType`. So, this PR adds both types.

## How was this patch tested?

Tests in `JsonHadoopFsRelation` and `CSVSuite`

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14829 from HyukjinKwon/SPARK-16216-followup.
2016-08-26 17:29:37 +02:00
Sean Zhong d96d151563 [SPARK-17187][SQL] Supports using arbitrary Java object as internal aggregation buffer object
## What changes were proposed in this pull request?

This PR introduces an abstract class `TypedImperativeAggregate` so that an aggregation function of TypedImperativeAggregate can use  **arbitrary** user-defined Java object as intermediate aggregation buffer object.

**This has advantages like:**
1. It now can support larger category of aggregation functions. For example, it will be much easier to implement aggregation function `percentile_approx`, which has a complex aggregation buffer definition.
2. It can be used to avoid doing serialization/de-serialization for every call of `update` or `merge` when converting domain specific aggregation object to internal Spark-Sql storage format.
3. It is easier to integrate with other existing monoid libraries like algebird, and supports more aggregation functions with high performance.

Please see `org.apache.spark.sql.TypedImperativeAggregateSuite.TypedMaxAggregate` to find an example of how to defined a `TypedImperativeAggregate` aggregation function.
Please see Java doc of `TypedImperativeAggregate` and Jira ticket SPARK-17187 for more information.

## How was this patch tested?

Unit tests.

Author: Sean Zhong <seanzhong@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #14753 from clockfly/object_aggregation_buffer_try_2.
2016-08-25 16:36:16 -07:00
gatorsmile d2ae6399ee [SPARK-16991][SPARK-17099][SPARK-17120][SQL] Fix Outer Join Elimination when Filter's isNotNull Constraints Unable to Filter Out All Null-supplying Rows
### What changes were proposed in this pull request?
This PR is to fix an incorrect outer join elimination when filter's `isNotNull` constraints is unable to filter out all null-supplying rows. For example, `isnotnull(coalesce(b#227, c#238))`.

Users can hit this error when they try to use `using/natural outer join`, which is converted to a normal outer join with a `coalesce` expression on the `using columns`. For example,
```Scala
    val a = Seq((1, 2), (2, 3)).toDF("a", "b")
    val b = Seq((2, 5), (3, 4)).toDF("a", "c")
    val c = Seq((3, 1)).toDF("a", "d")
    val ab = a.join(b, Seq("a"), "fullouter")
    ab.join(c, "a").explain(true)
```
The dataframe `ab` is doing `using full-outer join`, which is converted to a normal outer join with a `coalesce` expression. Constraints inference generates a `Filter` with constraints `isnotnull(coalesce(b#227, c#238))`. Then, it triggers a wrong outer join elimination and generates a wrong result.
```
Project [a#251, b#227, c#237, d#247]
+- Join Inner, (a#251 = a#246)
   :- Project [coalesce(a#226, a#236) AS a#251, b#227, c#237]
   :  +- Join FullOuter, (a#226 = a#236)
   :     :- Project [_1#223 AS a#226, _2#224 AS b#227]
   :     :  +- LocalRelation [_1#223, _2#224]
   :     +- Project [_1#233 AS a#236, _2#234 AS c#237]
   :        +- LocalRelation [_1#233, _2#234]
   +- Project [_1#243 AS a#246, _2#244 AS d#247]
      +- LocalRelation [_1#243, _2#244]

== Optimized Logical Plan ==
Project [a#251, b#227, c#237, d#247]
+- Join Inner, (a#251 = a#246)
   :- Project [coalesce(a#226, a#236) AS a#251, b#227, c#237]
   :  +- Filter isnotnull(coalesce(a#226, a#236))
   :     +- Join FullOuter, (a#226 = a#236)
   :        :- LocalRelation [a#226, b#227]
   :        +- LocalRelation [a#236, c#237]
   +- LocalRelation [a#246, d#247]
```

**A note to the `Committer`**, please also give the credit to dongjoon-hyun who submitted another PR for fixing this issue. https://github.com/apache/spark/pull/14580

### How was this patch tested?
Added test cases

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14661 from gatorsmile/fixOuterJoinElimination.
2016-08-25 14:18:58 +02:00
Takeshi YAMAMURO 2b0cc4e0df [SPARK-12978][SQL] Skip unnecessary final group-by when input data already clustered with group-by keys
This ticket targets the optimization to skip an unnecessary group-by operation below;

Without opt.:
```
== Physical Plan ==
TungstenAggregate(key=[col0#159], functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)], output=[col0#159,sum(col1)#177,avg(col2)#178])
+- TungstenAggregate(key=[col0#159], functions=[(sum(col1#160),mode=Partial,isDistinct=false),(avg(col2#161),mode=Partial,isDistinct=false)], output=[col0#159,sum#200,sum#201,count#202L])
   +- TungstenExchange hashpartitioning(col0#159,200), None
      +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation [col0#159,col1#160,col2#161], true, 10000, StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None
```

With opt.:
```
== Physical Plan ==
TungstenAggregate(key=[col0#159], functions=[(sum(col1#160),mode=Complete,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)], output=[col0#159,sum(col1)#177,avg(col2)#178])
+- TungstenExchange hashpartitioning(col0#159,200), None
  +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation [col0#159,col1#160,col2#161], true, 10000, StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None
```

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #10896 from maropu/SkipGroupbySpike.
2016-08-25 12:39:58 +02:00
hyukjinkwon 29952ed096 [SPARK-16216][SQL] Read/write timestamps and dates in ISO 8601 and dateFormat/timestampFormat option for CSV and JSON
## What changes were proposed in this pull request?

### Default - ISO 8601

Currently, CSV datasource is writing `Timestamp` and `Date` as numeric form and JSON datasource is writing both as below:

- CSV
  ```
  // TimestampType
  1414459800000000
  // DateType
  16673
  ```

- Json

  ```
  // TimestampType
  1970-01-01 11:46:40.0
  // DateType
  1970-01-01
  ```

So, for CSV we can't read back what we write and for JSON it becomes ambiguous because the timezone is being missed.

So, this PR make both **write** `Timestamp` and `Date` in ISO 8601 formatted string (please refer the [ISO 8601 specification](https://www.w3.org/TR/NOTE-datetime)).

- For `Timestamp` it becomes as below: (`yyyy-MM-dd'T'HH:mm:ss.SSSZZ`)

  ```
  1970-01-01T02:00:01.000-01:00
  ```

- For `Date` it becomes as below (`yyyy-MM-dd`)

  ```
  1970-01-01
  ```

### Custom date format option - `dateFormat`

This PR also adds the support to write and read dates and timestamps in a formatted string as below:

- **DateType**

  - With `dateFormat` option (e.g. `yyyy/MM/dd`)

    ```
    +----------+
    |      date|
    +----------+
    |2015/08/26|
    |2014/10/27|
    |2016/01/28|
    +----------+
    ```

### Custom date format option - `timestampFormat`

- **TimestampType**

  - With `dateFormat` option (e.g. `dd/MM/yyyy HH:mm`)

    ```
    +----------------+
    |            date|
    +----------------+
    |2015/08/26 18:00|
    |2014/10/27 18:30|
    |2016/01/28 20:00|
    +----------------+
    ```

## How was this patch tested?

Unit tests were added in `CSVSuite` and `JsonSuite`. For JSON, existing tests cover the default cases.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14279 from HyukjinKwon/SPARK-16216-json-csv.
2016-08-24 22:16:20 +02:00
Sean Zhong cc33460a51 [SPARK-17188][SQL] Moves class QuantileSummaries to project catalyst for implementing percentile_approx
## What changes were proposed in this pull request?

This is a sub-task of [SPARK-16283](https://issues.apache.org/jira/browse/SPARK-16283) (Implement percentile_approx SQL function), which moves class QuantileSummaries to project catalyst so that it can be reused when implementing aggregation function `percentile_approx`.

## How was this patch tested?

This PR only does class relocation, class implementation is not changed.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #14754 from clockfly/move_QuantileSummaries_to_catalyst.
2016-08-23 14:57:00 +08:00
Sean Zhong 929cb8beed [MINOR][SQL] Fix some typos in comments and test hints
## What changes were proposed in this pull request?

Fix some typos in comments and test hints

## How was this patch tested?

N/A.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #14755 from clockfly/fix_minor_typo.
2016-08-22 13:31:38 -07:00
Davies Liu 8d35a6f68d [SPARK-17115][SQL] decrease the threshold when split expressions
## What changes were proposed in this pull request?

In 2.0, we change the threshold of splitting expressions from 16K to 64K, which cause very bad performance on wide table, because the generated method can't be JIT compiled by default (above the limit of 8K bytecode).

This PR will decrease it to 1K, based on the benchmark results for a wide table with 400 columns of LongType.

It also fix a bug around splitting expression in whole-stage codegen (it should not split them).

## How was this patch tested?

Added benchmark suite.

Author: Davies Liu <davies@databricks.com>

Closes #14692 from davies/split_exprs.
2016-08-22 16:16:03 +08:00
Wenchen Fan b2074b664a [SPARK-16498][SQL] move hive hack for data source table into HiveExternalCatalog
## What changes were proposed in this pull request?

Spark SQL doesn't have its own meta store yet, and use hive's currently. However, hive's meta store has some limitations(e.g. columns can't be too many, not case-preserving, bad decimal type support, etc.), so we have some hacks to successfully store data source table metadata into hive meta store, i.e. put all the information in table properties.

This PR moves these hacks to `HiveExternalCatalog`, tries to isolate hive specific logic in one place.

changes overview:

1.  **before this PR**: we need to put metadata(schema, partition columns, etc.) of data source tables to table properties before saving it to external catalog, even the external catalog doesn't use hive metastore(e.g. `InMemoryCatalog`)
**after this PR**: the table properties tricks are only in `HiveExternalCatalog`, the caller side doesn't need to take care of it anymore.

2. **before this PR**: because the table properties tricks are done outside of external catalog, so we also need to revert these tricks when we read the table metadata from external catalog and use it. e.g. in `DescribeTableCommand` we will read schema and partition columns from table properties.
**after this PR**: The table metadata read from external catalog is exactly the same with what we saved to it.

bonus: now we can create data source table using `SessionCatalog`, if schema is specified.
breaks: `schemaStringLengthThreshold` is not configurable anymore. `hive.default.rcfile.serde` is not configurable anymore.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14155 from cloud-fan/catalog-table.
2016-08-21 22:23:14 -07:00
Dongjoon Hyun 91c2397684 [SPARK-17098][SQL] Fix NullPropagation optimizer to handle COUNT(NULL) OVER correctly
## What changes were proposed in this pull request?

Currently, `NullPropagation` optimizer replaces `COUNT` on null literals in a bottom-up fashion. During that, `WindowExpression` is not covered properly. This PR adds the missing propagation logic.

**Before**
```scala
scala> sql("SELECT COUNT(1 + NULL) OVER ()").show
java.lang.UnsupportedOperationException: Cannot evaluate expression: cast(0 as bigint) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
```

**After**
```scala
scala> sql("SELECT COUNT(1 + NULL) OVER ()").show
+----------------------------------------------------------------------------------------------+
|count((1 + CAST(NULL AS INT))) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)|
+----------------------------------------------------------------------------------------------+
|                                                                                             0|
+----------------------------------------------------------------------------------------------+
```

## How was this patch tested?

Pass the Jenkins test with a new test case.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14689 from dongjoon-hyun/SPARK-17098.
2016-08-21 22:07:47 +02:00
petermaxlee 9560c8d295 [SPARK-17124][SQL] RelationalGroupedDataset.agg should preserve order and allow multiple aggregates per column
## What changes were proposed in this pull request?
This patch fixes a longstanding issue with one of the RelationalGroupedDataset.agg function. Even though the signature accepts vararg of pairs, the underlying implementation turns the seq into a map, and thus not order preserving nor allowing multiple aggregates per column.

This change also allows users to use this function to run multiple different aggregations for a single column, e.g.
```
agg("age" -> "max", "age" -> "count")
```

## How was this patch tested?
Added a test case in DataFrameAggregateSuite.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14697 from petermaxlee/SPARK-17124.
2016-08-21 00:25:55 +08:00
Srinath Shankar ba1737c21a [SPARK-17158][SQL] Change error message for out of range numeric literals
## What changes were proposed in this pull request?

Modifies error message for numeric literals to
Numeric literal <literal> does not fit in range [min, max] for type <T>

## How was this patch tested?

Fixed up the error messages for literals.sql in  SqlQueryTestSuite and re-ran via sbt. Also fixed up error messages in ExpressionParserSuite

Author: Srinath Shankar <srinath@databricks.com>

Closes #14721 from srinathshankar/sc4296.
2016-08-19 19:54:26 -07:00
petermaxlee a117afa7c2 [SPARK-17149][SQL] array.sql for testing array related functions
## What changes were proposed in this pull request?
This patch creates array.sql in SQLQueryTestSuite for testing array related functions, including:

- indexing
- array creation
- size
- array_contains
- sort_array

## How was this patch tested?
The patch itself is about adding tests.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14708 from petermaxlee/SPARK-17149.
2016-08-19 18:14:45 -07:00
Reynold Xin 67e59d464f [SPARK-16994][SQL] Whitelist operators for predicate pushdown
## What changes were proposed in this pull request?
This patch changes predicate pushdown optimization rule (PushDownPredicate) from using a blacklist to a whitelist. That is to say, operators must be explicitly allowed. This approach is more future-proof: previously it was possible for us to introduce a new operator and then render the optimization rule incorrect.

This also fixes the bug that previously we allowed pushing filter beneath limit, which was incorrect. That is to say, before this patch, the optimizer would rewrite
```
select * from (select * from range(10) limit 5) where id > 3

to

select * from range(10) where id > 3 limit 5
```

## How was this patch tested?
- a unit test case in FilterPushdownSuite
- an end-to-end test in limit.sql

Author: Reynold Xin <rxin@databricks.com>

Closes #14713 from rxin/SPARK-16994.
2016-08-19 21:11:35 +08:00
petermaxlee f5472dda51 [SPARK-16947][SQL] Support type coercion and foldable expression for inline tables
## What changes were proposed in this pull request?
This patch improves inline table support with the following:

1. Support type coercion.
2. Support using foldable expressions. Previously only literals were supported.
3. Improve error message handling.
4. Improve test coverage.

## How was this patch tested?
Added a new unit test suite ResolveInlineTablesSuite and a new file-based end-to-end test inline-table.sql.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14676 from petermaxlee/SPARK-16947.
2016-08-19 09:19:47 +08:00
petermaxlee 68f5087d21 [SPARK-17117][SQL] 1 / NULL should not fail analysis
## What changes were proposed in this pull request?
This patch fixes the problem described in SPARK-17117, i.e. "SELECT 1 / NULL" throws an analysis exception:

```
org.apache.spark.sql.AnalysisException: cannot resolve '(1 / NULL)' due to data type mismatch: differing types in '(1 / NULL)' (int and null).
```

The problem is that division type coercion did not take null type into account.

## How was this patch tested?
A unit test for the type coercion, and a few end-to-end test cases using SQLQueryTestSuite.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14695 from petermaxlee/SPARK-17117.
2016-08-18 13:44:13 +02:00
Eric Liang 412dba63b5 [SPARK-17069] Expose spark.range() as table-valued function in SQL
## What changes were proposed in this pull request?

This adds analyzer rules for resolving table-valued functions, and adds one builtin implementation for range(). The arguments for range() are the same as those of `spark.range()`.

## How was this patch tested?

Unit tests.

cc hvanhovell

Author: Eric Liang <ekl@databricks.com>

Closes #14656 from ericl/sc-4309.
2016-08-18 13:33:55 +02:00
Reynold Xin 1748f82410 [SPARK-16391][SQL] Support partial aggregation for reduceGroups
## What changes were proposed in this pull request?
This patch introduces a new private ReduceAggregator interface that is a subclass of Aggregator. ReduceAggregator only requires a single associative and commutative reduce function. ReduceAggregator is also used to implement KeyValueGroupedDataset.reduceGroups in order to support partial aggregation.

Note that the pull request was initially done by viirya.

## How was this patch tested?
Covered by original tests for reduceGroups, as well as a new test suite for ReduceAggregator.

Author: Reynold Xin <rxin@databricks.com>
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #14576 from rxin/reduceAggregator.
2016-08-18 16:37:25 +08:00
Liang-Chi Hsieh 10204b9d29 [SPARK-16995][SQL] TreeNodeException when flat mapping RelationalGroupedDataset created from DataFrame containing a column created with lit/expr
## What changes were proposed in this pull request?

A TreeNodeException is thrown when executing the following minimal example in Spark 2.0.

    import spark.implicits._
    case class test (x: Int, q: Int)

    val d = Seq(1).toDF("x")
    d.withColumn("q", lit(0)).as[test].groupByKey(_.x).flatMapGroups{case (x, iter) => List[Int]()}.show
    d.withColumn("q", expr("0")).as[test].groupByKey(_.x).flatMapGroups{case (x, iter) => List[Int]()}.show

The problem is at `FoldablePropagation`. The rule will do `transformExpressions` on `LogicalPlan`. The query above contains a `MapGroups` which has a parameter `dataAttributes:Seq[Attribute]`. One attributes in `dataAttributes` will be transformed to an `Alias(literal(0), _)` in `FoldablePropagation`. `Alias` is not an `Attribute` and causes the error.

We can't easily detect such type inconsistency during transforming expressions. A direct approach to this problem is to skip doing `FoldablePropagation` on object operators as they should not contain such expressions.

## How was this patch tested?

Jenkins tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #14648 from viirya/flat-mapping.
2016-08-18 13:24:12 +08:00
Tathagata Das d60af8f6aa [SPARK-17096][SQL][STREAMING] Improve exception string reported through the StreamingQueryListener
## What changes were proposed in this pull request?

Currently, the stackTrace (as `Array[StackTraceElements]`) reported through StreamingQueryListener.onQueryTerminated is useless as it has the stack trace of where StreamingQueryException is defined, not the stack trace of underlying exception.  For example, if a streaming query fails because of a / by zero exception in a task, the `QueryTerminated.stackTrace` will have
```
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:211)
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:124)
```
This is basically useless, as it is location where the StreamingQueryException was defined. What we want is

Here is the right way to reason about what should be posted as through StreamingQueryListener.onQueryTerminated
- The actual exception could either be a SparkException, or an arbitrary exception.
  - SparkException reports the relevant executor stack trace of a failed task as a string in the the exception message. The `Array[StackTraceElements]` returned by `SparkException.stackTrace()` is mostly irrelevant.
  - For any arbitrary exception, the `Array[StackTraceElements]` returned by `exception.stackTrace()` may be relevant.
- When there is an error in a streaming query, it's hard to reason whether the `Array[StackTraceElements]` is useful or not. In fact, it is not clear whether it is even useful to report the stack trace as this array of Java objects. It may be sufficient to report the strack trace as a string, along with the message. This is how Spark reported executor stra
- Hence, this PR simplifies the API by removing the array `stackTrace` from `QueryTerminated`. Instead the `exception` returns a string containing the message and the stack trace of the actual underlying exception that failed the streaming query (i.e. not that of the StreamingQueryException). If anyone is interested in the actual stack trace as an array, can always access them through `streamingQuery.exception` which returns the exception object.

With this change, if a streaming query fails because of a / by zero exception in a task, the `QueryTerminated.exception` will be
```
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 (TID 1, localhost): java.lang.ArithmeticException: / by zero
	at org.apache.spark.sql.streaming.StreamingQueryListenerSuite$$anonfun$5$$anonfun$apply$mcV$sp$4$$anonfun$apply$mcV$sp$5.apply$mcII$sp(StreamingQueryListenerSuite.scala:153)
	at org.apache.spark.sql.streaming.StreamingQueryListenerSuite$$anonfun$5$$anonfun$apply$mcV$sp$4$$anonfun$apply$mcV$sp$5.apply(StreamingQueryListenerSuite.scala:153)
	at org.apache.spark.sql.streaming.StreamingQueryListenerSuite$$anonfun$5$$anonfun$apply$mcV$sp$4$$anonfun$apply$mcV$sp$5.apply(StreamingQueryListenerSuite.scala:153)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:232)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:226)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
	at org.apache.spark.scheduler.Task.run(Task.scala:86)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:744)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1429)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1417)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1416)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1416)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
...
```
It contains the relevant executor stack trace. In a case non-SparkException, if the streaming source MemoryStream throws an exception, exception message will have the relevant stack trace.
```
java.lang.RuntimeException: this is the exception message
	at org.apache.spark.sql.execution.streaming.MemoryStream.getBatch(memory.scala:103)
	at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:316)
	at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:313)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
	at org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
	at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
	at org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:25)
	at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch(StreamExecution.scala:313)
	at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1.apply$mcZ$sp(StreamExecution.scala:197)
	at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:43)
	at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:187)
	at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:124)
```

Note that this change in the public `QueryTerminated` class is okay as the APIs are still experimental.

## How was this patch tested?
Unit tests that test whether the right information is present in the exception message reported through QueryTerminated object.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #14675 from tdas/SPARK-17096.
2016-08-17 13:31:34 -07:00
Wenchen Fan 928ca1c6d1 [SPARK-17102][SQL] bypass UserDefinedGenerator for json format check
## What changes were proposed in this pull request?

We use reflection to convert `TreeNode` to json string, and currently don't support arbitrary object. `UserDefinedGenerator` takes a function object, so we should skip json format test for it, or the tests can be flacky, e.g. `DataFrameSuite.simple explode`, this test always fail with scala 2.10(branch 1.6 builds with scala 2.10 by default), but pass with scala 2.11(master branch builds with scala 2.11 by default).

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14679 from cloud-fan/json.
2016-08-17 09:31:22 -07:00
Herman van Hovell 0b0c8b95e3 [SPARK-17106] [SQL] Simplify the SubqueryExpression interface
## What changes were proposed in this pull request?
The current subquery expression interface contains a little bit of technical debt in the form of a few different access paths to get and set the query contained by the expression. This is confusing to anyone who goes over this code.

This PR unifies these access paths.

## How was this patch tested?
(Existing tests)

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #14685 from hvanhovell/SPARK-17106.
2016-08-17 07:03:24 -07:00
Kazuaki Ishizaki 56d86742d2 [SPARK-15285][SQL] Generated SpecificSafeProjection.apply method grows beyond 64 KB
## What changes were proposed in this pull request?

This PR splits the generated code for ```SafeProjection.apply``` by using ```ctx.splitExpressions()```. This is because the large code body for ```NewInstance``` may grow beyond 64KB bytecode size for ```apply()``` method.

Here is [the original PR](https://github.com/apache/spark/pull/13243) for SPARK-15285. However, it breaks a build with Scala 2.10 since Scala 2.10 does not a case class with large number of members. Thus, it was reverted by [this commit](fa244e5a90).

## How was this patch tested?

Added new tests by using `DefinedByConstructorParams` instead of case class for scala-2.10

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #14670 from kiszk/SPARK-15285-2.
2016-08-17 21:34:57 +08:00
Sean Zhong 7b65030e7a [SPARK-17034][SQL] adds expression UnresolvedOrdinal to represent the ordinals in GROUP BY or ORDER BY
## What changes were proposed in this pull request?

This PR adds expression `UnresolvedOrdinal` to represent the ordinal in GROUP BY or ORDER BY, and fixes the rules when resolving ordinals.

Ordinals in GROUP BY or ORDER BY like `1` in `order by 1` or `group by 1` should be considered as unresolved before analysis. But in current code, it uses `Literal` expression to store the ordinal. This is inappropriate as `Literal` itself is a resolved expression, it gives the user a wrong message that the ordinals has already been resolved.

### Before this change

Ordinal is stored as `Literal` expression

```
scala> sc.setLogLevel("TRACE")
scala> sql("select a from t group by 1 order by 1")
...
'Sort [1 ASC], true
 +- 'Aggregate [1], ['a]
     +- 'UnresolvedRelation `t
```

For query:

```
scala> Seq(1).toDF("a").createOrReplaceTempView("t")
scala> sql("select count(a), a from t group by 2 having a > 0").show
```

During analysis, the intermediate plan before applying rule `ResolveAggregateFunctions` is:

```
'Filter ('a > 0)
   +- Aggregate [2], [count(1) AS count(1)#83L, a#81]
        +- LocalRelation [value#7 AS a#9]
```

Before this PR, rule `ResolveAggregateFunctions` believes all expressions of `Aggregate` have already been resolved, and tries to resolve the expressions in `Filter` directly. But this is wrong, as ordinal `2` in Aggregate is not really resolved!

### After this change

Ordinals are stored as `UnresolvedOrdinal`.

```
scala> sc.setLogLevel("TRACE")
scala> sql("select a from t group by 1 order by 1")
...
'Sort [unresolvedordinal(1) ASC], true
 +- 'Aggregate [unresolvedordinal(1)], ['a]
      +- 'UnresolvedRelation `t`
```

## How was this patch tested?

Unit tests.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #14616 from clockfly/spark-16955.
2016-08-16 15:51:30 +08:00
Wenchen Fan 7de30d6e9e [SPARK-16916][SQL] serde/storage properties should not have limitations
## What changes were proposed in this pull request?

`CatalogStorageFormat.properties` can be used in 2 ways:

1. for hive tables, it stores the serde properties.
2. for data source tables, it stores the data source options, e.g. `path`, `skipHiveMetadata`, etc.

however, both of them have nothing to do with data source properties, e.g. `spark.sql.sources.provider`, so they should not have limitations about data source properties.

## How was this patch tested?

existing tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14506 from cloud-fan/table-prop.
2016-08-15 21:43:41 -07:00
Marcelo Vanzin 5da6c4b24f [SPARK-16671][CORE][SQL] Consolidate code to do variable substitution.
Both core and sql have slightly different code that does variable substitution
of config values. This change refactors that code and encapsulates the logic
of reading config values and expading variables in a new helper class, which
can be configured so that both core and sql can use it without losing existing
functionality, and allows for easier testing and makes it easier to add more
features in the future.

Tested with existing and new unit tests, and by running spark-shell with
some configs referencing variables and making sure it behaved as expected.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #14468 from vanzin/SPARK-16671.
2016-08-15 11:09:54 -07:00
Sean Owen cdaa562c9a [SPARK-16966][SQL][CORE] App Name is a randomUUID even when "spark.app.name" exists
## What changes were proposed in this pull request?

Don't override app name specified in `SparkConf` with a random app name. Only set it if the conf has no app name even after options have been applied.

See also https://github.com/apache/spark/pull/14602
This is similar to Sherry302 's original proposal in https://github.com/apache/spark/pull/14556

## How was this patch tested?

Jenkins test, with new case reproducing the bug

Author: Sean Owen <sowen@cloudera.com>

Closes #14630 from srowen/SPARK-16966.2.
2016-08-13 15:40:43 -07:00
GraceH 8c8acdec93 [SPARK-16968] Add additional options in jdbc when creating a new table
## What changes were proposed in this pull request?

In the PR, we just allow the user to add additional options when create a new table in JDBC writer.
The options can be table_options or partition_options.
E.g., "CREATE TABLE t (name string) ENGINE=InnoDB DEFAULT CHARSET=utf8"

Here is the usage example:
```
df.write.option("createTableOptions", "ENGINE=InnoDB DEFAULT CHARSET=utf8").jdbc(...)
```
## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
will apply test result soon.

Author: GraceH <93113783@qq.com>

Closes #14559 from GraceH/jdbc_options.
2016-08-13 11:39:58 +01:00
Dongjoon Hyun 2a105134e9 [SPARK-16771][SQL] WITH clause should not fall into infinite loop.
## What changes were proposed in this pull request?

This PR changes the CTE resolving rule to use only **forward-declared** tables in order to prevent infinite loops. More specifically, new logic is like the following.

* Resolve CTEs in `WITH` clauses first before replacing the main SQL body.
* When resolving CTEs, only forward-declared CTEs or base tables are referenced.
  - Self-referencing is not allowed any more.
  - Cross-referencing is not allowed any more.

**Reported Error Scenarios**
```scala
scala> sql("WITH t AS (SELECT 1 FROM t) SELECT * FROM t")
java.lang.StackOverflowError
...
scala> sql("WITH t1 AS (SELECT * FROM t2), t2 AS (SELECT 2 FROM t1) SELECT * FROM t1, t2")
java.lang.StackOverflowError
...
```
Note that `t`, `t1`, and `t2` are not declared in database. Spark falls into infinite loops before resolving table names.

## How was this patch tested?

Pass the Jenkins tests with new two testcases.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14397 from dongjoon-hyun/SPARK-16771-TREENODE.
2016-08-12 19:07:34 +02:00
petermaxlee 00e103a6ed [SPARK-17013][SQL] Parse negative numeric literals
## What changes were proposed in this pull request?
This patch updates the SQL parser to parse negative numeric literals as numeric literals, instead of unary minus of positive literals.

This allows the parser to parse the minimal value for each data type, e.g. "-32768S".

## How was this patch tested?
Updated test cases.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14608 from petermaxlee/SPARK-17013.
2016-08-11 23:56:55 -07:00
Dongjoon Hyun abff92bfdc [SPARK-16975][SQL] Column-partition path starting '_' should be handled correctly
## What changes were proposed in this pull request?

Currently, Spark ignores path names starting with underscore `_` and `.`. This causes read-failures for the column-partitioned file data sources whose partition column names starts from '_', e.g. `_col`.

**Before**
```scala
scala> spark.range(10).withColumn("_locality_code", $"id").write.partitionBy("_locality_code").save("/tmp/parquet")
scala> spark.read.parquet("/tmp/parquet")
org.apache.spark.sql.AnalysisException: Unable to infer schema for ParquetFormat at /tmp/parquet20. It must be specified manually;
```

**After**
```scala
scala> spark.range(10).withColumn("_locality_code", $"id").write.partitionBy("_locality_code").save("/tmp/parquet")
scala> spark.read.parquet("/tmp/parquet")
res2: org.apache.spark.sql.DataFrame = [id: bigint, _locality_code: int]
```

## How was this patch tested?

Pass the Jenkins with a new test case.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14585 from dongjoon-hyun/SPARK-16975-PARQUET.
2016-08-12 14:40:12 +08:00
hyukjinkwon ac84fb64dd [SPARK-16434][SQL] Avoid per-record type dispatch in JSON when reading
## What changes were proposed in this pull request?

Currently, `JacksonParser.parse` is doing type-based dispatch for each row to convert the tokens to appropriate values for Spark.
It might not have to be done like this because the schema is already kept.

So, appropriate converters can be created first according to the schema once, and then apply them to each row.

This PR corrects `JacksonParser` so that it creates all converters for the schema once and then applies them to each row rather than type dispatching for every row.

Benchmark was proceeded with the codes below:

#### Parser tests

**Before**

```scala
test("Benchmark for JSON converter") {
  val N = 500 << 8
  val row =
    """{"struct":{"field1": true, "field2": 92233720368547758070},
    "structWithArrayFields":{"field1":[4, 5, 6], "field2":["str1", "str2"]},
    "arrayOfString":["str1", "str2"],
    "arrayOfInteger":[1, 2147483647, -2147483648],
    "arrayOfLong":[21474836470, 9223372036854775807, -9223372036854775808],
    "arrayOfBigInteger":[922337203685477580700, -922337203685477580800],
    "arrayOfDouble":[1.2, 1.7976931348623157E308, 4.9E-324, 2.2250738585072014E-308],
    "arrayOfBoolean":[true, false, true],
    "arrayOfNull":[null, null, null, null],
    "arrayOfStruct":[{"field1": true, "field2": "str1"}, {"field1": false}, {"field3": null}],
    "arrayOfArray1":[[1, 2, 3], ["str1", "str2"]],
    "arrayOfArray2":[[1, 2, 3], [1.1, 2.1, 3.1]]
   }"""
  val data = List.fill(N)(row)
  val dummyOption = new JSONOptions(Map.empty[String, String])
  val schema =
    InferSchema.infer(spark.sparkContext.parallelize(Seq(row)), "", dummyOption)
  val factory = new JsonFactory()

  val benchmark = new Benchmark("JSON converter", N)
  benchmark.addCase("convert JSON file", 10) { _ =>
    data.foreach { input =>
      val parser = factory.createParser(input)
      parser.nextToken()
      JacksonParser.convertRootField(factory, parser, schema)
    }
  }
  benchmark.run()
}
```

```
JSON converter:                          Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
convert JSON file                             1697 / 1807          0.1       13256.9       1.0X
```

**After**

```scala
test("Benchmark for JSON converter") {
  val N = 500 << 8
  val row =
    """{"struct":{"field1": true, "field2": 92233720368547758070},
    "structWithArrayFields":{"field1":[4, 5, 6], "field2":["str1", "str2"]},
    "arrayOfString":["str1", "str2"],
    "arrayOfInteger":[1, 2147483647, -2147483648],
    "arrayOfLong":[21474836470, 9223372036854775807, -9223372036854775808],
    "arrayOfBigInteger":[922337203685477580700, -922337203685477580800],
    "arrayOfDouble":[1.2, 1.7976931348623157E308, 4.9E-324, 2.2250738585072014E-308],
    "arrayOfBoolean":[true, false, true],
    "arrayOfNull":[null, null, null, null],
    "arrayOfStruct":[{"field1": true, "field2": "str1"}, {"field1": false}, {"field3": null}],
    "arrayOfArray1":[[1, 2, 3], ["str1", "str2"]],
    "arrayOfArray2":[[1, 2, 3], [1.1, 2.1, 3.1]]
   }"""
  val data = List.fill(N)(row)
  val dummyOption = new JSONOptions(Map.empty[String, String], new SQLConf())
  val schema =
    InferSchema.infer(spark.sparkContext.parallelize(Seq(row)), dummyOption)

  val benchmark = new Benchmark("JSON converter", N)
  benchmark.addCase("convert JSON file", 10) { _ =>
    val parser = new JacksonParser(schema, dummyOption)
    data.foreach { input =>
      parser.parse(input)
    }
  }
  benchmark.run()
}
```

```
JSON converter:                          Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
convert JSON file                             1401 / 1461          0.1       10947.4       1.0X
```

It seems parsing time is improved by roughly ~20%

#### End-to-End test

```scala
test("Benchmark for JSON reader") {
  val N = 500 << 8
  val row =
    """{"struct":{"field1": true, "field2": 92233720368547758070},
    "structWithArrayFields":{"field1":[4, 5, 6], "field2":["str1", "str2"]},
    "arrayOfString":["str1", "str2"],
    "arrayOfInteger":[1, 2147483647, -2147483648],
    "arrayOfLong":[21474836470, 9223372036854775807, -9223372036854775808],
    "arrayOfBigInteger":[922337203685477580700, -922337203685477580800],
    "arrayOfDouble":[1.2, 1.7976931348623157E308, 4.9E-324, 2.2250738585072014E-308],
    "arrayOfBoolean":[true, false, true],
    "arrayOfNull":[null, null, null, null],
    "arrayOfStruct":[{"field1": true, "field2": "str1"}, {"field1": false}, {"field3": null}],
    "arrayOfArray1":[[1, 2, 3], ["str1", "str2"]],
    "arrayOfArray2":[[1, 2, 3], [1.1, 2.1, 3.1]]
   }"""
  val df = spark.sqlContext.read.json(spark.sparkContext.parallelize(List.fill(N)(row)))
  withTempPath { path =>
    df.write.format("json").save(path.getCanonicalPath)

    val benchmark = new Benchmark("JSON reader", N)
    benchmark.addCase("reading JSON file", 10) { _ =>
      spark.read.format("json").load(path.getCanonicalPath).collect()
    }
    benchmark.run()
  }
}
```

**Before**

```
JSON reader:                             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
reading JSON file                             6485 / 6924          0.0       50665.0       1.0X
```

**After**

```
JSON reader:                             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
reading JSON file                             6350 / 6529          0.0       49609.3       1.0X
```

## How was this patch tested?

Existing test cases should cover this.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14102 from HyukjinKwon/SPARK-16434.
2016-08-12 11:09:42 +08:00
petermaxlee cf9367826c [SPARK-17018][SQL] literals.sql for testing literal parsing
## What changes were proposed in this pull request?
This patch adds literals.sql for testing literal parsing end-to-end in SQL.

## How was this patch tested?
The patch itself is only about adding test cases.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14598 from petermaxlee/SPARK-17018-2.
2016-08-11 13:55:10 -07:00
petermaxlee a7b02db457 [SPARK-17015][SQL] group-by/order-by ordinal and arithmetic tests
## What changes were proposed in this pull request?
This patch adds three test files:
1. arithmetic.sql.out
2. order-by-ordinal.sql
3. group-by-ordinal.sql

This includes https://github.com/apache/spark/pull/14594.

## How was this patch tested?
This is a test case change.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14595 from petermaxlee/SPARK-17015.
2016-08-11 01:43:08 -07:00
petermaxlee 0db373aaf8 [SPARK-17011][SQL] Support testing exceptions in SQLQueryTestSuite
## What changes were proposed in this pull request?
This patch adds exception testing to SQLQueryTestSuite. When there is an exception in query execution, the query result contains the the exception class along with the exception message.

As part of this, I moved some additional test cases for limit from SQLQuerySuite over to SQLQueryTestSuite.

## How was this patch tested?
This is a test harness change.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14592 from petermaxlee/SPARK-17011.
2016-08-10 23:22:14 -07:00
petermaxlee 665e175328 [SPARK-17007][SQL] Move test data files into a test-data folder
## What changes were proposed in this pull request?
This patch moves all the test data files in sql/core/src/test/resources to sql/core/src/test/resources/test-data, so we don't clutter the top level sql/core/src/test/resources. Also deleted sql/core/src/test/resources/old-repeated.parquet since it is no longer used.

The change will make it easier to spot sql-tests directory.

## How was this patch tested?
This is a test-only change.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14589 from petermaxlee/SPARK-17007.
2016-08-10 21:26:46 -07:00
petermaxlee 425c7c2dbd [SPARK-17008][SPARK-17009][SQL] Normalization and isolation in SQLQueryTestSuite.
## What changes were proposed in this pull request?
This patch enhances SQLQueryTestSuite in two ways:

1. SPARK-17009: Use a new SparkSession for each test case to provide stronger isolation (e.g. config changes in one test case does not impact another). That said, we do not currently isolate catalog changes.
2. SPARK-17008: Normalize query output using sorting, inspired by HiveComparisonTest.

I also ported a few new test cases over from SQLQuerySuite.

## How was this patch tested?
This is a test harness update.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14590 from petermaxlee/SPARK-17008.
2016-08-10 21:05:32 -07:00
Liang-Chi Hsieh 19af298bb6 [SPARK-15639] [SPARK-16321] [SQL] Push down filter at RowGroups level for parquet reader
## What changes were proposed in this pull request?

The base class `SpecificParquetRecordReaderBase` used for vectorized parquet reader will try to get pushed-down filters from the given configuration. This pushed-down filters are used for RowGroups-level filtering. However, we don't set up the filters to push down into the configuration. In other words, the filters are not actually pushed down to do RowGroups-level filtering. This patch is to fix this and tries to set up the filters for pushing down to configuration for the reader.

The benchmark that excludes the time of writing Parquet file:

    test("Benchmark for Parquet") {
      val N = 500 << 12
        withParquetTable((0 until N).map(i => (101, i)), "t") {
          val benchmark = new Benchmark("Parquet reader", N)
          benchmark.addCase("reading Parquet file", 10) { iter =>
            sql("SELECT _1 FROM t where t._1 < 100").collect()
          }
          benchmark.run()
      }
    }

`withParquetTable` in default will run tests for vectorized reader non-vectorized readers. I only let it run vectorized reader.

When we set the block size of parquet as 1024 to have multiple row groups. The benchmark is:

Before this patch:

The retrieved row groups: 8063

    Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic
    Intel(R) Core(TM) i7-5557U CPU  3.10GHz
    Parquet reader:                          Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    reading Parquet file                           825 / 1233          2.5         402.6       1.0X

After this patch:

The retrieved row groups: 0

    Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic
    Intel(R) Core(TM) i7-5557U CPU  3.10GHz
    Parquet reader:                          Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    reading Parquet file                           306 /  503          6.7         149.6       1.0X

Next, I run the benchmark for non-pushdown case using the same benchmark code but with disabled pushdown configuration. This time the parquet block size is default value.

Before this patch:

    Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic
    Intel(R) Core(TM) i7-5557U CPU  3.10GHz
    Parquet reader:                          Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    reading Parquet file                           136 /  238         15.0          66.5       1.0X

After this patch:

    Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic
    Intel(R) Core(TM) i7-5557U CPU  3.10GHz
    Parquet reader:                          Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    reading Parquet file                           124 /  193         16.5          60.7       1.0X

For non-pushdown case, from the results, I think this patch doesn't affect normal code path.

I've manually output the `totalRowCount` in `SpecificParquetRecordReaderBase` to see if this patch actually filter the row-groups. When running the above benchmark:

After this patch:
    `totalRowCount = 0`

Before this patch:
    `totalRowCount = 1024000`

## How was this patch tested?
Existing tests should be passed.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #13701 from viirya/vectorized-reader-push-down-filter2.
2016-08-10 10:03:55 -07:00
avulanov 11a6844beb [SPARK-15899][SQL] Fix the construction of the file path with hadoop Path
## What changes were proposed in this pull request?

Fix the construction of the file path. Previous way of construction caused the creation of incorrect path on Windows.

## How was this patch tested?

Run SQL unit tests on Windows

Author: avulanov <nashb@yandex.ru>

Closes #13868 from avulanov/SPARK-15899-file.
2016-08-10 10:25:00 +01:00
petermaxlee b9f8a11709 [SPARK-16866][SQL] Infrastructure for file-based SQL end-to-end tests
## What changes were proposed in this pull request?
This patch introduces SQLQueryTestSuite, a basic framework for end-to-end SQL test cases defined in spark/sql/core/src/test/resources/sql-tests. This is a more standard way to test SQL queries end-to-end in different open source database systems, because it is more manageable to work with files.

This is inspired by HiveCompatibilitySuite, but simplified for general Spark SQL tests. Once this is merged, I can work towards porting SQLQuerySuite over, and eventually also move the existing HiveCompatibilitySuite to use this framework.

Unlike HiveCompatibilitySuite, SQLQueryTestSuite compares both the output schema and the output data (in string form).

When there is a mismatch, the error message looks like the following:

```
[info] - blacklist.sql !!! IGNORED !!!
[info] - number-format.sql *** FAILED *** (2 seconds, 405 milliseconds)
[info]   Expected "...147483648	-214748364[8]", but got "...147483648	-214748364[9]" Result should match for query #1 (SQLQueryTestSuite.scala:171)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
[info]   at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
[info]   at org.scalatest.Assertions$class.assertResult(Assertions.scala:1171)
```

## How was this patch tested?
This is a test infrastructure change.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14472 from petermaxlee/SPARK-16866.
2016-08-10 17:17:21 +08:00
Sean Owen 0578ff9681 [SPARK-16324][SQL] regexp_extract should doc that it returns empty string when match fails
## What changes were proposed in this pull request?

Doc that regexp_extract returns empty string when regex or group does not match

## How was this patch tested?

Jenkins test, with a few new test cases

Author: Sean Owen <sowen@cloudera.com>

Closes #14525 from srowen/SPARK-16324.
2016-08-10 10:14:43 +01:00