Commit graph

2192 commits

Author SHA1 Message Date
Subhobrata Dey fd82681945 [SPARK-14749][SQL, TESTS] PlannerSuite failed when it run individually
## What changes were proposed in this pull request?

3 testcases namely,

```
"count is partially aggregated"
"count distinct is partially aggregated"
"mixed aggregates are partially aggregated"
```

were failing when running PlannerSuite individually.
The PR provides a fix for this.

## How was this patch tested?

unit tests

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Subhobrata Dey <sbcd90@gmail.com>

Closes #12532 from sbcd90/plannersuitetestsfix.
2016-04-20 14:26:07 -07:00
Shixiong Zhu 7bc948557b [SPARK-14678][SQL] Add a file sink log to support versioning and compaction
## What changes were proposed in this pull request?

This PR adds a special log for FileStreamSink for two purposes:

- Versioning. A future Spark version should be able to read the metadata of an old FileStreamSink.
- Compaction. As reading from many small files is usually pretty slow, we should compact small metadata files into big files.

FileStreamSinkLog has a new log format instead of Java serialization format. It will write one log file for each batch. The first line of the log file is the version number, and there are multiple JSON lines following. Each JSON line is a JSON format of FileLog.

FileStreamSinkLog will compact log files every "spark.sql.sink.file.log.compactLen" batches into a big file. When doing a compact, it will read all history logs and merge them with the new batch. During the compaction, it will also delete the files that are deleted (marked by FileLog.action). When the reader uses allLogs to list all files, this method only returns the visible files (drops the deleted files).

## How was this patch tested?

FileStreamSinkLogSuite

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #12435 from zsxwing/sink-log.
2016-04-20 13:33:04 -07:00
Andrew Or 8fc267ab33 [SPARK-14720][SPARK-13643] Move Hive-specific methods into HiveSessionState and Create a SparkSession class
## What changes were proposed in this pull request?
This PR has two main changes.
1. Move Hive-specific methods from HiveContext to HiveSessionState, which help the work of removing HiveContext.
2. Create a SparkSession Class, which will later be the entry point of Spark SQL users.

## How was this patch tested?
Existing tests

This PR is trying to fix test failures of https://github.com/apache/spark/pull/12485.

Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #12522 from yhuai/spark-session.
2016-04-20 12:58:48 -07:00
Tathagata Das cb8ea9e1f3 [SPARK-14741][SQL] Fixed error in reading json file stream inside a partitioned directory
## What changes were proposed in this pull request?

Consider the following directory structure
dir/col=X/some-files
If we create a text format streaming dataframe on `dir/col=X/`  then it should not consider as partitioning in columns. Even though the streaming dataframe does not do so, the generated batch dataframes pick up col as a partitioning columns, causing mismatch streaming source schema and generated df schema. This leads to runtime failure:
```
18:55:11.262 ERROR org.apache.spark.sql.execution.streaming.StreamExecution: Query query-0 terminated with error
java.lang.AssertionError: assertion failed: Invalid batch: c#2 != c#7,type#8
```
The reason is that the partition inferring code has no idea of a base path, above which it should not search of partitions. This PR makes sure that the batch DF is generated with the basePath set as the original path on which the file stream source is defined.

## How was this patch tested?

New unit test

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #12517 from tdas/SPARK-14741.
2016-04-20 12:22:51 -07:00
Burak Yavuz 80bf48f437 [SPARK-14555] First cut of Python API for Structured Streaming
## What changes were proposed in this pull request?

This patch provides a first cut of python APIs for structured streaming. This PR provides the new classes:
 - ContinuousQuery
 - Trigger
 - ProcessingTime
in pyspark under `pyspark.sql.streaming`.

In addition, it contains the new methods added under:
 -  `DataFrameWriter`
     a) `startStream`
     b) `trigger`
     c) `queryName`

 -  `DataFrameReader`
     a) `stream`

 - `DataFrame`
    a) `isStreaming`

This PR doesn't contain all methods exposed for `ContinuousQuery`, for example:
 - `exception`
 - `sourceStatuses`
 - `sinkStatus`

They may be added in a follow up.

This PR also contains some very minor doc fixes in the Scala side.

## How was this patch tested?

Python doc tests

TODO:
 - [ ] verify Python docs look good

Author: Burak Yavuz <brkyvz@gmail.com>
Author: Burak Yavuz <burak@databricks.com>

Closes #12320 from brkyvz/stream-python.
2016-04-20 10:32:01 -07:00
Liwei Lin 17db4bfeaa [SPARK-14687][CORE][SQL][MLLIB] Call path.getFileSystem(conf) instead of call FileSystem.get(conf)
## What changes were proposed in this pull request?

- replaced `FileSystem.get(conf)` calls with `path.getFileSystem(conf)`

## How was this patch tested?

N/A

Author: Liwei Lin <lwlin7@gmail.com>

Closes #12450 from lw-lin/fix-fs-get.
2016-04-20 11:28:51 +01:00
Wenchen Fan 7abe9a6578 [SPARK-9013][SQL] generate MutableProjection directly instead of return a function
`MutableProjection` is not thread-safe and we won't use it in multiple threads. I think the reason that we return `() => MutableProjection` is not about thread safety, but to save the costs of generating code when we need same but individual mutable projections.

However, I only found one place that use this [feature](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/Window.scala#L122-L123), and comparing to the troubles it brings, I think we should generate `MutableProjection` directly instead of return a function.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #7373 from cloud-fan/project.
2016-04-20 00:44:02 -07:00
Wenchen Fan 85d759ca3a [SPARK-14704][CORE] create accumulators in TaskMetrics
## What changes were proposed in this pull request?

Before this PR, we create accumulators at driver side(and register them) and send them to executor side, then we create `TaskMetrics` with these accumulators at executor side.
After this PR, we will create `TaskMetrics` at driver side and send it to executor side, so that we can create accumulators inside `TaskMetrics` directly, which is cleaner.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12472 from cloud-fan/acc.
2016-04-19 21:20:24 -07:00
Luciano Resende 78b38109ed [SPARK-13419] [SQL] Update SubquerySuite to use checkAnswer for validation
## What changes were proposed in this pull request?

Change SubquerySuite to validate test results utilizing checkAnswer helper method

## How was this patch tested?

Existing tests

Author: Luciano Resende <lresende@apache.org>

Closes #12269 from lresende/SPARK-13419.
2016-04-19 21:02:10 -07:00
Joan 3ae25f244b [SPARK-13929] Use Scala reflection for UDTs
## What changes were proposed in this pull request?

Enable ScalaReflection and User Defined Types for plain Scala classes.

This involves the move of `schemaFor` from `ScalaReflection` trait (which is Runtime and Compile time (macros) reflection) to the `ScalaReflection` object (runtime reflection only) as I believe this code wouldn't work at compile time anyway as it manipulates `Class`'s that are not compiled yet.

## How was this patch tested?

Unit test

Author: Joan <joan@goyeau.com>

Closes #12149 from joan38/SPARK-13929-Scala-reflection.
2016-04-19 17:36:31 -07:00
Cheng Lian 10f273d8db [SPARK-14407][SQL] Hides HadoopFsRelation related data source API into execution/datasources package #12178
## What changes were proposed in this pull request?

This PR moves `HadoopFsRelation` related data source API into `execution/datasources` package.

Note that to avoid conflicts, this PR is based on #12153. Effective changes for this PR only consist of the last three commits. Will rebase after merging #12153.

## How was this patch tested?

Existing tests.

Author: Yin Huai <yhuai@databricks.com>
Author: Cheng Lian <lian@databricks.com>

Closes #12361 from liancheng/spark-14407-hide-hadoop-fs-relation.
2016-04-19 17:32:23 -07:00
Herman van Hovell da8859226e [SPARK-4226] [SQL] Support IN/EXISTS Subqueries
### What changes were proposed in this pull request?
This PR adds support for in/exists predicate subqueries to Spark. Predicate sub-queries are used as a filtering condition in a query (this is the only supported use case). A predicate sub-query comes in two forms:

- `[NOT] EXISTS(subquery)`
- `[NOT] IN (subquery)`

This PR is (loosely) based on the work of davies (https://github.com/apache/spark/pull/10706) and chenghao-intel (https://github.com/apache/spark/pull/9055). They should be credited for the work they did.

### How was this patch tested?
Modified parsing unit tests.
Added tests to `org.apache.spark.sql.SQLQuerySuite`

cc rxin, davies & chenghao-intel

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #12306 from hvanhovell/SPARK-4226.
2016-04-19 15:16:02 -07:00
Wenchen Fan 5cb2e33609 [SPARK-14675][SQL] ClassFormatError when use Seq as Aggregator buffer type
## What changes were proposed in this pull request?

After https://github.com/apache/spark/pull/12067, we now use expressions to do the aggregation in `TypedAggregateExpression`. To implement buffer merge, we produce a new buffer deserializer expression by replacing `AttributeReference` with right-side buffer attribute, like other `DeclarativeAggregate`s do, and finally combine the left and right buffer deserializer with `Invoke`.

However, after https://github.com/apache/spark/pull/12338, we will add loop variable to class members when codegen `MapObjects`. If the `Aggregator` buffer type is `Seq`, which is implemented by `MapObjects` expression, we will add the same loop variable to class members twice(by left and right buffer deserializer), which cause the `ClassFormatError`.

This PR fixes this issue by calling `distinct` before declare the class menbers.

## How was this patch tested?

new regression test in `DatasetAggregatorSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12468 from cloud-fan/bug.
2016-04-19 10:51:58 -07:00
Josh Rosen 947b9020b0 [SPARK-14676] Wrap and re-throw Await.result exceptions in order to capture full stacktrace
When `Await.result` throws an exception which originated from a different thread, the resulting stacktrace doesn't include the path leading to the `Await.result` call itself, making it difficult to identify the impact of these exceptions. For example, I've seen cases where broadcast cleaning errors propagate to the main thread and crash it but the resulting stacktrace doesn't include any of the main thread's code, making it difficult to pinpoint which exception crashed that thread.

This patch addresses this issue by explicitly catching, wrapping, and re-throwing exceptions that are thrown by `Await.result`.

I tested this manually using 16b31c8251, a patch which reproduces an issue where an RPC exception which occurs while unpersisting RDDs manages to crash the main thread without any useful stacktrace, and verified that informative, full stacktraces were generated after applying the fix in this PR.

/cc rxin nongli yhuai anabranch

Author: Josh Rosen <joshrosen@databricks.com>

Closes #12433 from JoshRosen/wrap-and-rethrow-await-exceptions.
2016-04-19 10:38:10 -07:00
Wenchen Fan 9ee95b6ecc [SPARK-14491] [SQL] refactor object operator framework to make it easy to eliminate serializations
## What changes were proposed in this pull request?

This PR tries to separate the serialization and deserialization logic from object operators, so that it's easier to eliminate unnecessary serializations in optimizer.

Typed aggregate related operators are special, they will deserialize the input row to multiple objects and it's difficult to simply use a deserializer operator to abstract it, so we still mix the deserialization logic there.

## How was this patch tested?

existing tests and new test in `EliminateSerializationSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12260 from cloud-fan/encoder.
2016-04-19 10:00:44 -07:00
Cheng Lian 5e360c93be [SPARK-13681][SPARK-14458][SPARK-14566][SQL] Add back once removed CommitFailureTestRelationSuite and SimpleTextHadoopFsRelationSuite
## What changes were proposed in this pull request?

These test suites were removed while refactoring `HadoopFsRelation` related API. This PR brings them back.

This PR also fixes two regressions:

- SPARK-14458, which causes runtime error when saving partitioned tables using `FileFormat` data sources that are not able to infer their own schemata. This bug wasn't detected by any built-in data sources because all of them happen to have schema inference feature.

- SPARK-14566, which happens to be covered by SPARK-14458 and causes wrong query result or runtime error when
  - appending a Dataset `ds` to a persisted partitioned data source relation `t`, and
  - partition columns in `ds` don't all appear after data columns

## How was this patch tested?

`CommitFailureTestRelationSuite` uses a testing relation that always fails when committing write tasks to test write job cleanup.

`SimpleTextHadoopFsRelationSuite` uses a testing relation to test general `HadoopFsRelation` and `FileFormat` interfaces.

The two regressions are both covered by existing test cases.

Author: Cheng Lian <lian@databricks.com>

Closes #12179 from liancheng/spark-13681-commit-failure-test.
2016-04-19 09:37:00 -07:00
Dongjoon Hyun 3d46d796a3 [SPARK-14577][SQL] Add spark.sql.codegen.maxCaseBranches config option
## What changes were proposed in this pull request?

We currently disable codegen for `CaseWhen` if the number of branches is greater than 20 (in CaseWhen.MAX_NUM_CASES_FOR_CODEGEN). It would be better if this value is a non-public config defined in SQLConf.

## How was this patch tested?

Pass the Jenkins tests (including a new testcase `Support spark.sql.codegen.maxCaseBranches option`)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12353 from dongjoon-hyun/SPARK-14577.
2016-04-19 21:38:15 +08:00
Wenchen Fan d4b94ead92 [SPARK-14595][SQL] add input metrics for FileScanRDD
## What changes were proposed in this pull request?

This is roughly based on the input metrics logic in `SqlNewHadoopRDD`

## How was this patch tested?

Not sure how to write a test, I manually verified it in Spark UI.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12352 from cloud-fan/metrics.
2016-04-18 23:48:22 -07:00
Sameer Agarwal 6f88006895 [SPARK-14722][SQL] Rename upstreams() -> inputRDDs() in WholeStageCodegen
## What changes were proposed in this pull request?

Per rxin's suggestions, this patch renames `upstreams()` to `inputRDDs()` in `WholeStageCodegen` for better implied semantics

## How was this patch tested?

N/A

Author: Sameer Agarwal <sameer@databricks.com>

Closes #12486 from sameeragarwal/codegen-cleanup.
2016-04-18 20:28:58 -07:00
Sameer Agarwal 4eae1dbd7c [SPARK-14718][SQL] Avoid mutating ExprCode in doGenCode
## What changes were proposed in this pull request?

The `doGenCode` method currently takes in an `ExprCode`, mutates it and returns the java code to evaluate the given expression. It should instead just return a new `ExprCode` to avoid passing around mutable objects during code generation.

## How was this patch tested?

Existing Tests

Author: Sameer Agarwal <sameer@databricks.com>

Closes #12483 from sameeragarwal/new-exprcode-2.
2016-04-18 20:28:22 -07:00
Reynold Xin 5e92583d38 [SPARK-14667] Remove HashShuffleManager
## What changes were proposed in this pull request?
The sort shuffle manager has been the default since Spark 1.2. It is time to remove the old hash shuffle manager.

## How was this patch tested?
Removed some tests related to the old manager.

Author: Reynold Xin <rxin@databricks.com>

Closes #12423 from rxin/SPARK-14667.
2016-04-18 19:30:00 -07:00
Sameer Agarwal 8bd8121329 [SPARK-14710][SQL] Rename gen/genCode to genCode/doGenCode to better reflect the semantics
## What changes were proposed in this pull request?

Per rxin's suggestions, this patch renames `s/gen/genCode` and `s/genCode/doGenCode` to better reflect the semantics of these 2 function calls.

## How was this patch tested?

N/A (refactoring only)

Author: Sameer Agarwal <sameer@databricks.com>

Closes #12475 from sameeragarwal/gencode.
2016-04-18 14:03:40 -07:00
hyukjinkwon 6fc1e72d9b [MINOR] Revert removing explicit typing (changed in some examples and StatFunctions)
## What changes were proposed in this pull request?

This PR reverts some changes in https://github.com/apache/spark/pull/12413. (please see the discussion in that PR).

from
```scala
    words.foreachRDD { (rdd, time) =>
    ...
```

to
```scala
    words.foreachRDD { (rdd: RDD[String], time: Time) =>
    ...
```

Also, this was discussed in dev-mailing list, [here](http://apache-spark-developers-list.1001551.n3.nabble.com/Question-about-Scala-style-explicit-typing-within-transformation-functions-and-anonymous-val-td17173.html)

## How was this patch tested?

This was tested with `sbt scalastyle`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #12452 from HyukjinKwon/revert-explicit-typing.
2016-04-18 13:45:03 -07:00
Andrew Or 28ee15702d [SPARK-14647][SQL] Group SQLContext/HiveContext state into SharedState
## What changes were proposed in this pull request?

This patch adds a SharedState that groups state shared across multiple SQLContexts. This is analogous to the SessionState added in SPARK-13526 that groups session-specific state. This cleanup makes the constructors of the contexts simpler and ultimately allows us to remove HiveContext in the near future.

## How was this patch tested?
Existing tests.

Author: Yin Huai <yhuai@databricks.com>

Closes #12463 from yhuai/sharedState.
2016-04-18 13:15:23 -07:00
Tathagata Das 775cf17eaa [SPARK-14473][SQL] Define analysis rules to catch operations not supported in streaming
## What changes were proposed in this pull request?

There are many operations that are currently not supported in the streaming execution. For example:
 - joining two streams
 - unioning a stream and a batch source
 - sorting
 - window functions (not time windows)
 - distinct aggregates

Furthermore, executing a query with a stream source as a batch query should also fail.

This patch add an additional step after analysis in the QueryExecution which will check that all the operations in the analyzed logical plan is supported or not.

## How was this patch tested?
unit tests.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #12246 from tdas/SPARK-14473.
2016-04-18 11:09:33 -07:00
Dongjoon Hyun 432d1399cb [SPARK-14614] [SQL] Add bround function
## What changes were proposed in this pull request?

This PR aims to add `bound` function (aka Banker's round) by extending current `round` implementation. [Hive supports `bround` since 1.3.0.](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF)

**Hive (1.3 ~ 2.0)**
```
hive> select round(2.5), bround(2.5);
OK
3.0	2.0
```

**After this PR**
```scala
scala> sql("select round(2.5), bround(2.5)").head
res0: org.apache.spark.sql.Row = [3,2]
```

## How was this patch tested?

Pass the Jenkins tests (with extended tests).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12376 from dongjoon-hyun/SPARK-14614.
2016-04-18 10:44:51 -07:00
Reynold Xin 1a3966472c [SPARK-14696][SQL] Add implicit encoders for boxed primitive types
## What changes were proposed in this pull request?
We currently only have implicit encoders for scala primitive types. We should also add implicit encoders for boxed primitives. Otherwise, the following code would not have an encoder:

```scala
sqlContext.range(1000).map { i => i }
```

## How was this patch tested?
Added a unit test case for this.

Author: Reynold Xin <rxin@databricks.com>

Closes #12466 from rxin/SPARK-14696.
2016-04-18 17:03:15 +08:00
Wenchen Fan 2f1d0320c9 [SPARK-13363][SQL] support Aggregator in RelationalGroupedDataset
## What changes were proposed in this pull request?

set the input encoder for `TypedColumn` in `RelationalGroupedDataset.agg`.

## How was this patch tested?

new tests in `DatasetAggregatorSuite`

close https://github.com/apache/spark/pull/11269

This PR brings https://github.com/apache/spark/pull/12359 up to date and fix the compile.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12451 from cloud-fan/agg.
2016-04-18 14:27:26 +08:00
Andrew Or 7de06a646d Revert "[SPARK-14647][SQL] Group SQLContext/HiveContext state into SharedState"
This reverts commit 5cefecc95a.
2016-04-17 17:35:41 -07:00
Subhobrata Dey 699a4dfd89 [SPARK-14632] randomSplit method fails on dataframes with maps in schema
## What changes were proposed in this pull request?

The patch fixes the issue with the randomSplit method which is not able to split dataframes which has maps in schema. The bug was introduced in spark 1.6.1.

## How was this patch tested?

Tested with unit tests.

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Subhobrata Dey <sbcd90@gmail.com>

Closes #12438 from sbcd90/randomSplitIssue.
2016-04-17 15:18:32 -07:00
Andrew Or 5cefecc95a [SPARK-14647][SQL] Group SQLContext/HiveContext state into SharedState
## What changes were proposed in this pull request?

This patch adds a SharedState that groups state shared across multiple SQLContexts. This is analogous to the SessionState added in SPARK-13526 that groups session-specific state. This cleanup makes the constructors of the contexts simpler and ultimately allows us to remove HiveContext in the near future.

## How was this patch tested?
Existing tests.

Closes #12405

Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #12447 from yhuai/sharedState.
2016-04-16 14:00:53 -07:00
Reynold Xin 7319fcc1cd [SPARK-14677][SQL] follow up: make max iter num config internal
## What changes were proposed in this pull request?
This is a follow-up to make the max iteration number an internal config.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #12441 from rxin/maxIterConfInternal.
2016-04-16 11:39:47 -07:00
hyukjinkwon 9f678e9754 [MINOR] Remove inappropriate type notation and extra anonymous closure within functional transformations
## What changes were proposed in this pull request?

This PR removes

- Inappropriate type notations
    For example, from
    ```scala
    words.foreachRDD { (rdd: RDD[String], time: Time) =>
    ...
    ```
    to
    ```scala
    words.foreachRDD { (rdd, time) =>
    ...
    ```

- Extra anonymous closure within functional transformations.
    For example,
    ```scala
    .map(item => {
      ...
    })
    ```

    which can be just simply as below:

    ```scala
    .map { item =>
      ...
    }
    ```

and corrects some obvious style nits.

## How was this patch tested?

This was tested after adding rules in `scalastyle-config.xml`, which ended up with not finding all perfectly.

The rules applied were below:

- For the first correction,

```xml
<check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
    <parameters><parameter name="regex">(?m)\.[a-zA-Z_][a-zA-Z0-9]*\(\s*[^,]+s*=>\s*\{[^\}]+\}\s*\)</parameter></parameters>
</check>
```

```xml
<check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
    <parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]*\s*[\{|\(]([^\n>,]+=>)?\s*\{([^()]|(?R))*\}^[,]</parameter></parameters>
</check>
```

- For the second correction
```xml
<check customId="TypeNotation" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
    <parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]*\s*[\{|\(]\s*\([^):]*:R))*\}^[,]</parameter></parameters>
</check>
```

**Those rules were not added**

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #12413 from HyukjinKwon/SPARK-style.
2016-04-16 14:56:23 +01:00
Reynold Xin 527c780bb0 Revert "[SPARK-13363][SQL] support Aggregator in RelationalGroupedDataset"
This reverts commit 12854464c4.
2016-04-16 01:05:26 -07:00
Wenchen Fan 12854464c4 [SPARK-13363][SQL] support Aggregator in RelationalGroupedDataset
## What changes were proposed in this pull request?

set the input encoder for `TypedColumn` in `RelationalGroupedDataset.agg`.

## How was this patch tested?

new tests in `DatasetAggregatorSuite`

close https://github.com/apache/spark/pull/11269

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12359 from cloud-fan/agg.
2016-04-16 00:31:51 -07:00
Reynold Xin f4be0946af [SPARK-14677][SQL] Make the max number of iterations configurable for Catalyst
## What changes were proposed in this pull request?
We currently hard code the max number of optimizer/analyzer iterations to 100. This patch makes it configurable. While I'm at it, I also added the SessionCatalog to the optimizer, so we can use information there in optimization.

## How was this patch tested?
Updated unit tests to reflect the change.

Author: Reynold Xin <rxin@databricks.com>

Closes #12434 from rxin/SPARK-14677.
2016-04-15 20:28:09 -07:00
Yin Huai b2dfa84959 [SPARK-14668][SQL] Move CurrentDatabase to Catalyst
## What changes were proposed in this pull request?

This PR moves `CurrentDatabase` from sql/hive package to sql/catalyst. It also adds the function description, which looks like the following.

```
scala> sqlContext.sql("describe function extended current_database").collect.foreach(println)
[Function: current_database]
[Class: org.apache.spark.sql.execution.command.CurrentDatabase]
[Usage: current_database() - Returns the current database.]
[Extended Usage:
> SELECT current_database()]
```

## How was this patch tested?
Existing tests

Author: Yin Huai <yhuai@databricks.com>

Closes #12424 from yhuai/SPARK-14668.
2016-04-15 17:48:41 -07:00
Sameer Agarwal 4df65184b6 [SPARK-14620][SQL] Use/benchmark a better hash in VectorizedHashMap
## What changes were proposed in this pull request?

This PR uses a better hashing algorithm while probing the AggregateHashMap:

```java
long h = 0
h = (h ^ (0x9e3779b9)) + key_1 + (h << 6) + (h >>> 2);
h = (h ^ (0x9e3779b9)) + key_2 + (h << 6) + (h >>> 2);
h = (h ^ (0x9e3779b9)) + key_3 + (h << 6) + (h >>> 2);
...
h = (h ^ (0x9e3779b9)) + key_n + (h << 6) + (h >>> 2);
return h
```

Depends on: https://github.com/apache/spark/pull/12345
## How was this patch tested?

    Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02 on Mac OS X 10.11.4
    Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
    Aggregate w keys:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -------------------------------------------------------------------------------------------
    codegen = F                              2417 / 2457          8.7         115.2       1.0X
    codegen = T hashmap = F                  1554 / 1581         13.5          74.1       1.6X
    codegen = T hashmap = T                   877 /  929         23.9          41.8       2.8X

Author: Sameer Agarwal <sameer@databricks.com>

Closes #12379 from sameeragarwal/hash.
2016-04-15 15:55:31 -07:00
Wenchen Fan 297ba3f1b4 [SPARK-14275][SQL] Reimplement TypedAggregateExpression to DeclarativeAggregate
## What changes were proposed in this pull request?

`ExpressionEncoder` is just a container for serialization and deserialization expressions, we can use these expressions to build `TypedAggregateExpression` directly, so that it can fit in `DeclarativeAggregate`, which is more efficient.

One trick is, for each buffer serializer expression, it will reference to the result object of serialization and function call. To avoid re-calculating this result object, we can serialize the buffer object to a single struct field, so that we can use a special `Expression` to only evaluate result object once.

## How was this patch tested?

existing tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12067 from cloud-fan/typed_udaf.
2016-04-15 12:10:00 +08:00
Sameer Agarwal b5c60bcdca [SPARK-14447][SQL] Speed up TungstenAggregate w/ keys using VectorizedHashMap
## What changes were proposed in this pull request?

This patch speeds up group-by aggregates by around 3-5x by leveraging an in-memory `AggregateHashMap` (please see https://github.com/apache/spark/pull/12161), an append-only aggregate hash map that can act as a 'cache' for extremely fast key-value lookups while evaluating aggregates (and fall back to the `BytesToBytesMap` if a given key isn't found).

Architecturally, it is backed by a power-of-2-sized array for index lookups and a columnar batch that stores the key-value pairs. The index lookups in the array rely on linear probing (with a small number of maximum tries) and use an inexpensive hash function which makes it really efficient for a majority of lookups. However, using linear probing and an inexpensive hash function also makes it less robust as compared to the `BytesToBytesMap` (especially for a large number of keys or even for certain distribution of keys) and requires us to fall back on the latter for correctness.

## How was this patch tested?

    Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02 on Mac OS X 10.11.4
    Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
    Aggregate w keys:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -------------------------------------------------------------------------------------------
    codegen = F                              2124 / 2204          9.9         101.3       1.0X
    codegen = T hashmap = F                  1198 / 1364         17.5          57.1       1.8X
    codegen = T hashmap = T                   369 /  600         56.8          17.6       5.8X

Author: Sameer Agarwal <sameer@databricks.com>

Closes #12345 from sameeragarwal/tungsten-aggregate-integration.
2016-04-14 20:57:03 -07:00
Mark Grover ff9ae61a3b [SPARK-14601][DOC] Minor doc/usage changes related to removal of Spark assembly
## What changes were proposed in this pull request?

Removing references to assembly jar in documentation.
Adding an additional (previously undocumented) usage of spark-submit to run examples.

## How was this patch tested?

Ran spark-submit usage to ensure formatting was fine. Ran examples using SparkSubmit.

Author: Mark Grover <mark@apache.org>

Closes #12365 from markgrover/spark-14601.
2016-04-14 18:51:43 -07:00
Liang-Chi Hsieh 28efdd3fd7 [SPARK-14592][SQL] Native support for CREATE TABLE LIKE DDL command
## What changes were proposed in this pull request?
JIRA: https://issues.apache.org/jira/browse/SPARK-14592

This patch adds native support for DDL command `CREATE TABLE LIKE`.

The SQL syntax is like:

    CREATE TABLE table_name LIKE existing_table
    CREATE TABLE IF NOT EXISTS table_name LIKE existing_table

## How was this patch tested?
`HiveDDLCommandSuite`. `HiveQuerySuite` already tests `CREATE TABLE LIKE`.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

This patch had conflicts when merged, resolved by
Committer: Andrew Or <andrew@databricks.com>

Closes #12362 from viirya/create-table-like.
2016-04-14 11:08:08 -07:00
Reynold Xin dac40b68dc [SPARK-14619] Track internal accumulators (metrics) by stage attempt
## What changes were proposed in this pull request?
When there are multiple attempts for a stage, we currently only reset internal accumulator values if all the tasks are resubmitted. It would make more sense to reset the accumulator values for each stage attempt. This will allow us to eventually get rid of the internal flag in the Accumulator class. This is part of my bigger effort to simplify accumulators and task metrics.

## How was this patch tested?
Covered by existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #12378 from rxin/SPARK-14619.
2016-04-14 10:54:57 -07:00
Liwei Lin 3e27940a19 [SPARK-14630][BUILD][CORE][SQL][STREAMING] Code style: public abstract methods should have explicit return types
## What changes were proposed in this pull request?

Currently many public abstract methods (in abstract classes as well as traits) don't declare return types explicitly, such as in [o.a.s.streaming.dstream.InputDStream](https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/InputDStream.scala#L110):
```scala
def start() // should be: def start(): Unit
def stop()  // should be: def stop(): Unit
```

These methods exist in core, sql, streaming; this PR fixes them.

## How was this patch tested?

N/A

## Which piece of scala style rule led to the changes?

the rule was added separately in https://github.com/apache/spark/pull/12396

Author: Liwei Lin <lwlin7@gmail.com>

Closes #12389 from lw-lin/public-abstract-methods.
2016-04-14 10:14:38 -07:00
gatorsmile 0d22092cd9 [SPARK-14125][SQL] Native DDL Support: Alter View
#### What changes were proposed in this pull request?
This PR is to provide a native DDL support for the following three Alter View commands:

Based on the Hive DDL document:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
##### 1. ALTER VIEW RENAME
**Syntax:**
```SQL
ALTER VIEW view_name RENAME TO new_view_name
```
- to change the name of a view to a different name
- not allowed to rename a view's name by ALTER TABLE

##### 2. ALTER VIEW SET TBLPROPERTIES
**Syntax:**
```SQL
ALTER VIEW view_name SET TBLPROPERTIES ('comment' = new_comment);
```
- to add metadata to a view
- not allowed to set views' properties by ALTER TABLE
- ignore it if trying to set a view's existing property key when the value is the same
- overwrite the value if trying to set a view's existing key to a different value

##### 3. ALTER VIEW UNSET TBLPROPERTIES
**Syntax:**
```SQL
ALTER VIEW view_name UNSET TBLPROPERTIES [IF EXISTS] ('comment', 'key')
```
- to remove metadata from a view
- not allowed to unset views' properties by ALTER TABLE
- issue an exception if trying to unset a view's non-existent key

#### How was this patch tested?
Added test cases to verify if it works properly.

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #12324 from gatorsmile/alterView.
2016-04-14 08:34:11 -07:00
hyukjinkwon 6fc3dc8839 [MINOR][SQL] Remove extra anonymous closure within functional transformations
## What changes were proposed in this pull request?

This PR removes extra anonymous closure within functional transformations.

For example,

```scala
.map(item => {
  ...
})
```

which can be just simply as below:

```scala
.map { item =>
  ...
}
```

## How was this patch tested?

Related unit tests and `sbt scalastyle`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #12382 from HyukjinKwon/minor-extra-closers.
2016-04-14 09:43:41 +01:00
hyukjinkwon b4819404a6 [SPARK-14596][SQL] Remove not used SqlNewHadoopRDD and some more unused imports
## What changes were proposed in this pull request?

Old `HadoopFsRelation` API includes `buildInternalScan()` which uses `SqlNewHadoopRDD` in `ParquetRelation`.
Because now the old API is removed, `SqlNewHadoopRDD` is not used anymore.

So, this PR removes `SqlNewHadoopRDD` and several unused imports.

This was discussed in https://github.com/apache/spark/pull/12326.

## How was this patch tested?

Several related existing unit tests and `sbt scalastyle`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #12354 from HyukjinKwon/SPARK-14596.
2016-04-14 15:43:44 +08:00
Davies Liu 62b7f306fb [SPARK-14607] [SPARK-14484] [SQL] fix case-insensitive predicates in FileSourceStrategy
## What changes were proposed in this pull request?

When prune the partitions or push down predicates, case-sensitivity is not respected. In order to make it work with case-insensitive, this PR update the AttributeReference inside predicate to use the name from schema.

## How was this patch tested?

Add regression tests for case-insensitive.

Author: Davies Liu <davies@databricks.com>

Closes #12371 from davies/case_insensi.
2016-04-13 17:17:19 -07:00
Andrew Or 7d2ed8cc03 [SPARK-14388][SQL] Implement CREATE TABLE
## What changes were proposed in this pull request?

This patch implements the `CREATE TABLE` command using the `SessionCatalog`. Previously we handled only `CTAS` and `CREATE TABLE ... USING`. This requires us to refactor `CatalogTable` to accept various fields (e.g. bucket and skew columns) and pass them to Hive.

WIP: Note that I haven't verified whether this actually works yet! But I believe it does.

## How was this patch tested?

Tests will come in a future commit.

Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #12271 from andrewor14/create-table-ddl.
2016-04-13 11:08:34 -07:00
Wenchen Fan a5f8c9b15b [SPARK-14554][SQL][FOLLOW-UP] use checkDataset to check the result
## What changes were proposed in this pull request?

address this comment: https://github.com/apache/spark/pull/12322#discussion_r59417359

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12346 from cloud-fan/tmp.
2016-04-13 11:41:09 +08:00