Commit graph

1646 commits

Author SHA1 Message Date
gatorsmile 2c5b18fb0f [SPARK-12789][SQL] Support Order By Ordinal in SQL
#### What changes were proposed in this pull request?
This PR is to support order by position in SQL, e.g.
```SQL
select c1, c2, c3 from tbl order by 1 desc, 3
```
should be equivalent to
```SQL
select c1, c2, c3 from tbl order by c1 desc, c3 asc
```

This is controlled by config option `spark.sql.orderByOrdinal`.
- When true, the ordinal numbers are treated as the position in the select list.
- When false, the ordinal number in order/sort By clause are ignored.

- Only convert integer literals (not foldable expressions). If found foldable expressions, ignore them
- This also works with select *.

**Question**: Do we still need sort by columns that contain zero reference? In this case, it will have no impact on the sorting results. IMO, we should not allow users do it. rxin cloud-fan marmbrus yhuai hvanhovell
-- Update: In these cases, they are ignored in this case.

**Note**: This PR is taken from https://github.com/apache/spark/pull/10731. When merging this PR, please give the credit to zhichao-li

Also cc all the people who are involved in the previous discussion: adrian-wang chenghao-intel tejasapatil

#### How was this patch tested?
Added a few test cases for both positive and negative test cases.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #11815 from gatorsmile/orderByPosition.
2016-03-21 18:08:41 +08:00
Dongjoon Hyun 761c2d1b6e [MINOR][DOCS] Add proper periods and spaces for CLI help messages and config doc.
## What changes were proposed in this pull request?

This PR adds some proper periods and spaces to Spark CLI help messages and SQL/YARN conf docs for consistency.

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11848 from dongjoon-hyun/add_proper_period_and_space.
2016-03-21 08:00:09 +00:00
Dongjoon Hyun 20fd254101 [SPARK-14011][CORE][SQL] Enable LineLength Java checkstyle rule
## What changes were proposed in this pull request?

[Spark Coding Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) has 100-character limit on lines, but it's disabled for Java since 11/09/15. This PR enables **LineLength** checkstyle again. To help that, this also introduces **RedundantImport** and **RedundantModifier**, too. The following is the diff on `checkstyle.xml`.

```xml
-        <!-- TODO: 11/09/15 disabled - the lengths are currently > 100 in many places -->
-        <!--
         <module name="LineLength">
             <property name="max" value="100"/>
             <property name="ignorePattern" value="^package.*|^import.*|a href|href|http://|https://|ftp://"/>
         </module>
-        -->
         <module name="NoLineWrap"/>
         <module name="EmptyBlock">
             <property name="option" value="TEXT"/>
 -167,5 +164,7
         </module>
         <module name="CommentsIndentation"/>
         <module name="UnusedImports"/>
+        <module name="RedundantImport"/>
+        <module name="RedundantModifier"/>
```

## How was this patch tested?

Currently, `lint-java` is disabled in Jenkins. It needs a manual test.
After passing the Jenkins tests, `dev/lint-java` should passes locally.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11831 from dongjoon-hyun/SPARK-14011.
2016-03-21 07:58:57 +00:00
hyukjinkwon e474088144 [SPARK-13764][SQL] Parse modes in JSON data source
## What changes were proposed in this pull request?

Currently, there is no way to control the behaviour when fails to parse corrupt records in JSON data source .

This PR adds the support for parse modes just like CSV data source. There are three modes below:

- `PERMISSIVE` :  When it fails to parse, this sets `null` to to field. This is a default mode when it has been this mode.
- `DROPMALFORMED`: When it fails to parse, this drops the whole record.
- `FAILFAST`: When it fails to parse, it just throws an exception.

This PR also make JSON data source share the `ParseModes` in CSV data source.

## How was this patch tested?

Unit tests were used and `./dev/run_tests` for code style tests.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #11756 from HyukjinKwon/SPARK-13764.
2016-03-21 15:42:35 +08:00
Reynold Xin dcaa016610 [SPARK-13897][SQL] RelationalGroupedDataset and KeyValueGroupedDataset
## What changes were proposed in this pull request?
Previously, Dataset.groupBy returns a GroupedData, and Dataset.groupByKey returns a GroupedDataset. The naming is very similar, and unfortunately does not convey the real differences between the two.

Assume we are grouping by some keys (K). groupByKey is a key-value style group by, in which the schema of the returned dataset is a tuple of just two fields: key and value. groupBy, on the other hand, is a relational style group by, in which the schema of the returned dataset is flattened and contain |K| + |V| fields.

This pull request also removes the experimental tag from RelationalGroupedDataset. It has been with DataFrame since 1.3, and we have enough confidence now to stabilize it.

## How was this patch tested?
This is a rename to improve API understandability. Should be covered by all existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #11841 from rxin/SPARK-13897.
2016-03-19 11:23:14 -07:00
Sameer Agarwal b39594472b [SPARK-14012][SQL] Extract VectorizedColumnReader from VectorizedParquetRecordReader
## What changes were proposed in this pull request?

This is a minor followup on https://github.com/apache/spark/pull/11799 that extracts out the `VectorizedColumnReader` from `VectorizedParquetRecordReader` into its own file.

## How was this patch tested?

N/A (refactoring only)

Author: Sameer Agarwal <sameer@databricks.com>

Closes #11834 from sameeragarwal/rename.
2016-03-18 22:33:43 -07:00
Sameer Agarwal 54794113a6 [SPARK-13989] [SQL] Remove non-vectorized/unsafe-row parquet record reader
## What changes were proposed in this pull request?

This PR cleans up the new parquet record reader with the following changes:

1. Removes the non-vectorized parquet reader code from `UnsafeRowParquetRecordReader`.
2. Removes the non-vectorized column reader code from `ColumnReader`.
3. Renames `UnsafeRowParquetRecordReader` to `VectorizedParquetRecordReader` and `ColumnReader` to `VectorizedColumnReader`
4. Deprecate `PARQUET_UNSAFE_ROW_RECORD_READER_ENABLED`

## How was this patch tested?

Refactoring only; Existing tests should reveal any problems.

Author: Sameer Agarwal <sameer@databricks.com>

Closes #11799 from sameeragarwal/vectorized-parquet.
2016-03-18 14:04:42 -07:00
Davies Liu 9c23c818ca [SPARK-13977] [SQL] Brings back Shuffled hash join
## What changes were proposed in this pull request?

ShuffledHashJoin (also outer join) is removed in 1.6, in favor of SortMergeJoin, which is more robust and also fast.

ShuffledHashJoin is still useful in this case: 1) one table is much smaller than the other one, then cost to build a hash table on smaller table is smaller than sorting the larger table 2) any partition of the small table could fit in memory.

This PR brings back ShuffledHashJoin, basically revert #9645, and fix the conflict. Also merging outer join and left-semi join into the same class. This PR does not implement full outer join, because it's not implemented efficiently (requiring build hash table on both side).

A simple benchmark (one table is 5x smaller than other one) show that ShuffledHashJoin could be 2X faster than SortMergeJoin.

## How was this patch tested?

Added new unit tests for ShuffledHashJoin.

Author: Davies Liu <davies@databricks.com>

Closes #11788 from davies/shuffle_join.
2016-03-18 10:32:53 -07:00
Reynold Xin bb1fda01fe [SPARK-13826][SQL] Addendum: update documentation for Datasets
## What changes were proposed in this pull request?
This patch updates documentations for Datasets. I also updated some internal documentation for exchange/broadcast.

## How was this patch tested?
Just documentation/api stability update.

Author: Reynold Xin <rxin@databricks.com>

Closes #11814 from rxin/dataset-docs.
2016-03-18 00:57:23 -07:00
Liang-Chi Hsieh 750ed64cd9 [SPARK-13930] [SQL] Apply fast serialization on collect limit operator
## What changes were proposed in this pull request?

JIRA: https://issues.apache.org/jira/browse/SPARK-13930

Recently the fast serialization has been introduced to collecting DataFrame/Dataset (#11664). The same technology can be used on collect limit operator too.

## How was this patch tested?

Add a benchmark for collect limit to `BenchmarkWholeStageCodegen`.

Without this patch:

    model name      : Westmere E56xx/L56xx/X56xx (Nehalem-C)
    collect limit:                      Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -------------------------------------------------------------------------------------------
    collect limit 1 million                  3413 / 3768          0.3        3255.0       1.0X
    collect limit 2 millions                9728 / 10440          0.1        9277.3       0.4X

With this patch:

    model name      : Westmere E56xx/L56xx/X56xx (Nehalem-C)
    collect limit:                      Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -------------------------------------------------------------------------------------------
    collect limit 1 million                   833 / 1284          1.3         794.4       1.0X
    collect limit 2 millions                 3348 / 4005          0.3        3193.3       0.2X

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #11759 from viirya/execute-take.
2016-03-17 23:24:44 -07:00
Cheng Lian 10ef4f3e77 [SPARK-13826][SQL] Revises Dataset ScalaDoc
## What changes were proposed in this pull request?

This PR revises Dataset API ScalaDoc.  All public methods are divided into the following groups

* `groupname basic`: Basic Dataset functions
* `groupname action`: Actions
* `groupname untypedrel`: Untyped Language Integrated Relational Queries
* `groupname typedrel`: Typed Language Integrated Relational Queries
* `groupname func`: Functional Transformations
* `groupname rdd`: RDD Operations
* `groupname output`: Output Operations

`since` tag and sample code are also updated.  We may want to add more sample code for typed APIs.

## How was this patch tested?

Documentation change.  Checked by building unidoc locally.

Author: Cheng Lian <lian@databricks.com>

Closes #11769 from liancheng/spark-13826-ds-api-doc.
2016-03-17 21:31:11 -07:00
Dilip Biswal 637a78f1d3 [SPARK-13427][SQL] Support USING clause in JOIN.
## What changes were proposed in this pull request?

Support queries that JOIN tables with USING clause.
SELECT * from table1 JOIN table2 USING <column_list>

USING clause can be used as a means to simplify the join condition
when :

1) Equijoin semantics is desired and
2) The column names in the equijoin have the same name.

We already have the support for Natural Join in Spark. This PR makes
use of the already existing infrastructure for natural join to
form the join condition and also the projection list.

## How was the this patch tested?

Have added unit tests in SQLQuerySuite, CatalystQlSuite, ResolveNaturalJoinSuite

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes #11297 from dilipbiswal/spark-13427.
2016-03-17 10:01:41 -07:00
Wenchen Fan 8ef3399aff [SPARK-13928] Move org.apache.spark.Logging into org.apache.spark.internal.Logging
## What changes were proposed in this pull request?

Logging was made private in Spark 2.0. If we move it, then users would be able to create a Logging trait themselves to avoid changing their own code.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11764 from cloud-fan/logger.
2016-03-17 19:23:38 +08:00
Josh Rosen de1a84e56e [SPARK-13926] Automatically use Kryo serializer when shuffling RDDs with simple types
Because ClassTags are available when constructing ShuffledRDD we can use them to automatically use Kryo for shuffle serialization when the RDD's types are known to be compatible with Kryo.

This patch introduces `SerializerManager`, a component which picks the "best" serializer for a shuffle given the elements' ClassTags. It will automatically pick a Kryo serializer for ShuffledRDDs whose key, value, and/or combiner types are primitives, arrays of primitives, or strings. In the future we can use this class as a narrow extension point to integrate specialized serializers for other types, such as ByteBuffers.

In a planned followup patch, I will extend the BlockManager APIs so that we're able to use similar automatic serializer selection when caching RDDs (this is a little trickier because the ClassTags need to be threaded through many more places).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #11755 from JoshRosen/automatically-pick-best-serializer.
2016-03-16 22:52:55 -07:00
Daoyuan Wang d1c193a2f1 [SPARK-12855][MINOR][SQL][DOC][TEST] remove spark.sql.dialect from doc and test
## What changes were proposed in this pull request?

Since developer API of plug-able parser has been removed in #10801 , docs should be updated accordingly.

## How was this patch tested?

This patch will not affect the real code path.

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #11758 from adrian-wang/spark12855.
2016-03-16 22:52:10 -07:00
Davies Liu c100d31ddc [SPARK-13873] [SQL] Avoid copy of UnsafeRow when there is no join in whole stage codegen
## What changes were proposed in this pull request?

We need to copy the UnsafeRow since a Join could produce multiple rows from single input rows. We could avoid that if there is no join (or the join will not produce multiple rows) inside WholeStageCodegen.

Updated the benchmark for `collect`, we could see 20-30% speedup.

## How was this patch tested?

existing unit tests.

Author: Davies Liu <davies@databricks.com>

Closes #11740 from davies/avoid_copy2.
2016-03-16 21:46:04 -07:00
hyukjinkwon 917f4000b4 [SPARK-13719][SQL] Parse JSON rows having an array type and a struct type in the same fieild
## What changes were proposed in this pull request?

This https://github.com/apache/spark/pull/2400 added the support to parse JSON rows wrapped with an array. However, this throws an exception when the given data contains array data and struct data in the same field as below:

```json
{"a": {"b": 1}}
{"a": []}
```

and the schema is given as below:

```scala
val schema =
  StructType(
    StructField("a", StructType(
      StructField("b", StringType) :: Nil
    )) :: Nil)
```

- **Before**

```scala
sqlContext.read.schema(schema).json(path).show()
```

```scala
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: Lost task 7.3 in stage 0.0 (TID 10, 192.168.1.170): java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData cannot be cast to org.apache.spark.sql.catalyst.InternalRow
	at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:50)
	at org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:247)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown Source)
...
```

- **After**

```scala
sqlContext.read.schema(schema).json(path).show()
```

```bash
+----+
|   a|
+----+
| [1]|
|null|
+----+
```

For other data types, in this case it converts the given values are `null` but only this case emits an exception.

This PR makes the support for wrapped rows applied only at the top level.

## How was this patch tested?

Unit tests were used and `./dev/run_tests` for code style tests.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #11752 from HyukjinKwon/SPARK-3308-follow-up.
2016-03-16 18:20:30 -07:00
Jakob Odersky d4d84936fb [SPARK-11011][SQL] Narrow type of UDT serialization
## What changes were proposed in this pull request?

Narrow down the parameter type of `UserDefinedType#serialize()`. Currently, the parameter type is `Any`, however it would logically make more sense to narrow it down to the type of the actual user defined type.

## How was this patch tested?

Existing tests were successfully run on local machine.

Author: Jakob Odersky <jakob@odersky.com>

Closes #11379 from jodersky/SPARK-11011-udt-types.
2016-03-16 16:59:36 -07:00
Sameer Agarwal b90c0206fa [SPARK-13922][SQL] Filter rows with null attributes in vectorized parquet reader
# What changes were proposed in this pull request?

It's common for many SQL operators to not care about reading `null` values for correctness. Currently, this is achieved by performing `isNotNull` checks (for all relevant columns) on a per-row basis. Pushing these null filters in the vectorized parquet reader should bring considerable benefits (especially for cases when the underlying data doesn't contain any nulls or contains all nulls).

## How was this patch tested?

        Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
        String with Nulls Scan (0%):        Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
        -------------------------------------------------------------------------------------------
        SQL Parquet Vectorized                   1229 / 1648          8.5         117.2       1.0X
        PR Vectorized                             833 /  846         12.6          79.4       1.5X
        PR Vectorized (Null Filtering)            732 /  782         14.3          69.8       1.7X

        Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
        String with Nulls Scan (50%):       Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
        -------------------------------------------------------------------------------------------
        SQL Parquet Vectorized                    995 / 1053         10.5          94.9       1.0X
        PR Vectorized                             732 /  772         14.3          69.8       1.4X
        PR Vectorized (Null Filtering)            725 /  790         14.5          69.1       1.4X

        Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
        String with Nulls Scan (95%):       Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
        -------------------------------------------------------------------------------------------
        SQL Parquet Vectorized                    326 /  333         32.2          31.1       1.0X
        PR Vectorized                             190 /  200         55.1          18.2       1.7X
        PR Vectorized (Null Filtering)            168 /  172         62.2          16.1       1.9X

Author: Sameer Agarwal <sameer@databricks.com>

Closes #11749 from sameeragarwal/perf-testing.
2016-03-16 16:25:40 -07:00
Cheng Hao d9670f8473 [SPARK-13894][SQL] SqlContext.range return type from DataFrame to DataSet
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-13894
Change the return type of the `SQLContext.range` API from `DataFrame` to `Dataset`.

## How was this patch tested?
No additional unit test required.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #11730 from chenghao-intel/range.
2016-03-16 11:20:15 -07:00
Wenchen Fan d9e8f26d03 [SPARK-13924][SQL] officially support multi-insert
## What changes were proposed in this pull request?

There is a feature of hive SQL called multi-insert. For example:
```
FROM src
INSERT OVERWRITE TABLE dest1
SELECT key + 1
INSERT OVERWRITE TABLE dest2
SELECT key WHERE key > 2
INSERT OVERWRITE TABLE dest3
SELECT col EXPLODE(arr) exp AS col
...
```

We partially support it currently, with some limitations: 1) WHERE can't reference columns produced by LATERAL VIEW. 2) It's not executed eagerly, i.e. `sql("...multi-insert clause...")` won't take place right away like other commands, e.g. CREATE TABLE.

This PR removes these limitations and make us fully support multi-insert.

## How was this patch tested?

new tests in `SQLQuerySuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11754 from cloud-fan/lateral-view.
2016-03-16 10:52:36 -07:00
Sean Owen 3b461d9ecd [SPARK-13823][SPARK-13397][SPARK-13395][CORE] More warnings, StandardCharset follow up
## What changes were proposed in this pull request?

Follow up to https://github.com/apache/spark/pull/11657

- Also update `String.getBytes("UTF-8")` to use `StandardCharsets.UTF_8`
- And fix one last new Coverity warning that turned up (use of unguarded `wait()` replaced by simpler/more robust `java.util.concurrent` classes in tests)
- And while we're here cleaning up Coverity warnings, just fix about 15 more build warnings

## How was this patch tested?

Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes #11725 from srowen/SPARK-13823.2.
2016-03-16 09:36:34 +00:00
hyukjinkwon 92024797a4 [SPARK-13899][SQL] Produce InternalRow instead of external Row at CSV data source
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-13899

This PR makes CSV data source produce `InternalRow` instead of `Row`.

Basically, this resembles JSON data source. It uses the same codes for casting.

## How was this patch tested?

Unit tests were used within IDE and code style was checked by `./dev/run_tests`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #11717 from HyukjinKwon/SPARK-13899.
2016-03-15 23:31:46 -07:00
Davies Liu 421f6c20e8 [SPARK-13917] [SQL] generate broadcast semi join
## What changes were proposed in this pull request?

This PR brings codegen support for broadcast left-semi join.

## How was this patch tested?

Existing tests. Added benchmark, the result show 7X speedup.

Author: Davies Liu <davies@databricks.com>

Closes #11742 from davies/gen_semi.
2016-03-15 22:17:04 -07:00
Davies Liu bbd887f53c [SPARK-13918][SQL] Merge SortMergeJoin and SortMergerOuterJoin
## What changes were proposed in this pull request?

This PR just move some code from SortMergeOuterJoin into SortMergeJoin.

This is for support codegen for outer join.

## How was this patch tested?

existing tests.

Author: Davies Liu <davies@databricks.com>

Closes #11743 from davies/gen_smjouter.
2016-03-15 19:58:49 -07:00
Reynold Xin 643649dcbf [SPARK-13895][SQL] DataFrameReader.text should return Dataset[String]
## What changes were proposed in this pull request?
This patch changes DataFrameReader.text()'s return type from DataFrame to Dataset[String].

Closes #11731.

## How was this patch tested?
Updated existing integration tests to reflect the change.

Author: Reynold Xin <rxin@databricks.com>

Closes #11739 from rxin/SPARK-13895.
2016-03-15 14:57:54 -07:00
Stavros Kontopoulos 50e3644d00 [SPARK-13896][SQL][STRING] Dataset.toJSON should return Dataset
## What changes were proposed in this pull request?
Change the return type of toJson in Dataset class
## How was this patch tested?
No additional unit test required.

Author: Stavros Kontopoulos <stavros.kontopoulos@typesafe.com>

Closes #11732 from skonto/fix_toJson.
2016-03-15 12:18:30 -07:00
Reynold Xin 5e6f2f4563 [SPARK-13893][SQL] Remove SQLContext.catalog/analyzer (internal method)
## What changes were proposed in this pull request?
Our internal code can go through SessionState.catalog and SessionState.analyzer. This brings two small benefits:
1. Reduces internal dependency on SQLContext.
2. Removes 2 public methods in Java (Java does not obey package private visibility).

More importantly, according to the design in SPARK-13485, we'd need to claim this catalog function for the user-facing public functions, rather than having an internal field.

## How was this patch tested?
Existing unit/integration test code.

Author: Reynold Xin <rxin@databricks.com>

Closes #11716 from rxin/SPARK-13893.
2016-03-15 10:12:32 -07:00
Reynold Xin 276c2d51a3 [SPARK-13890][SQL] Remove some internal classes' dependency on SQLContext
## What changes were proposed in this pull request?
In general it is better for internal classes to not depend on the external class (in this case SQLContext) to reduce coupling between user-facing APIs and the internal implementations. This patch removes SQLContext dependency from some internal classes such as SparkPlanner, SparkOptimizer.

As part of this patch, I also removed the following internal methods from SQLContext:
```
protected[sql] def functionRegistry: FunctionRegistry
protected[sql] def optimizer: Optimizer
protected[sql] def sqlParser: ParserInterface
protected[sql] def planner: SparkPlanner
protected[sql] def continuousQueryManager
protected[sql] def prepareForExecution: RuleExecutor[SparkPlan]
```

## How was this patch tested?
Existing unit/integration tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #11712 from rxin/sqlContext-planner.
2016-03-14 23:58:57 -07:00
Reynold Xin e64958001c [SPARK-13884][SQL] Remove DescribeCommand's dependency on LogicalPlan
## What changes were proposed in this pull request?
This patch removes DescribeCommand's dependency on LogicalPlan. After this patch, DescribeCommand simply accepts a TableIdentifier. It minimizes the dependency, and blocks my next patch (removes SQLContext dependency from SparkPlanner).

## How was this patch tested?
Should be covered by existing unit tests and Hive compatibility tests that run describe table.

Author: Reynold Xin <rxin@databricks.com>

Closes #11710 from rxin/SPARK-13884.
2016-03-14 23:09:10 -07:00
Davies Liu f72743d971 [SPARK-13353][SQL] fast serialization for collecting DataFrame/Dataset
## What changes were proposed in this pull request?

When we call DataFrame/Dataset.collect(), Java serializer (or Kryo Serializer) will be used to serialize the UnsafeRows in executor, then deserialize them into UnsafeRows in driver. Java serializer (and Kyro serializer) are slow on millions rows, because they try to find out the same rows, but usually there is no same rows.

This PR will serialize the UnsafeRows as byte array by packing them together, then Java serializer (or Kyro serializer) serialize the bytes very fast (there are fewer blocks and byte array are not compared by content).

The UnsafeRow format is highly compressible, the serialized bytes are also compressed (configurable by spark.io.compression.codec).

## How was this patch tested?

Existing unit tests.

Add a benchmark for collect, before this patch:
```
Intel(R) Core(TM) i7-4558U CPU  2.80GHz
collect:                        Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
collect 1 million                      3991 / 4311          0.3        3805.7       1.0X
collect 2 millions                  10083 / 10637          0.1        9616.0       0.4X
collect 4 millions                  29551 / 30072          0.0       28182.3       0.1X
```

```
Intel(R) Core(TM) i7-4558U CPU  2.80GHz
collect:                        Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
collect 1 million                        775 / 1170          1.4         738.9       1.0X
collect 2 millions                     1153 / 1758          0.9        1099.3       0.7X
collect 4 millions                     4451 / 5124          0.2        4244.9       0.2X
```

We can see about 5-7X speedup.

Author: Davies Liu <davies@databricks.com>

Closes #11664 from davies/serialize_row.
2016-03-14 22:32:22 -07:00
Davies Liu 9256840cb6 [SPARK-13661][SQL] avoid the copy in HashedRelation
## What changes were proposed in this pull request?

Avoid the copy in HashedRelation, since most of the HashedRelation are built with Array[Row], added the copy() for LeftSemiJoinHash. This could help to reduce the memory consumption for Broadcast join.

## How was this patch tested?

Existing tests.

Author: Davies Liu <davies@databricks.com>

Closes #11666 from davies/remove_copy.
2016-03-14 22:25:57 -07:00
Reynold Xin e76679a814 [SPARK-13880][SPARK-13881][SQL] Rename DataFrame.scala Dataset.scala, and remove LegacyFunctions
## What changes were proposed in this pull request?
1. Rename DataFrame.scala Dataset.scala, since the class is now named Dataset.
2. Remove LegacyFunctions. It was introduced in Spark 1.6 for backward compatibility, and can be removed in Spark 2.0.

## How was this patch tested?
Should be covered by existing unit/integration tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #11704 from rxin/SPARK-13880.
2016-03-15 10:39:07 +08:00
Shixiong Zhu b5e3bd87f5 [SPARK-13791][SQL] Add MetadataLog and HDFSMetadataLog
## What changes were proposed in this pull request?

- Add a MetadataLog interface for  metadata reliably storage.
- Add HDFSMetadataLog as a MetadataLog implementation based on HDFS.
- Update FileStreamSource to use HDFSMetadataLog instead of managing metadata by itself.

## How was this patch tested?

unit tests

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11625 from zsxwing/metadata-log.
2016-03-14 19:28:13 -07:00
Reynold Xin 4bf4609795 [SPARK-13882][SQL] Remove org.apache.spark.sql.execution.local
## What changes were proposed in this pull request?
We introduced some local operators in org.apache.spark.sql.execution.local package but never fully wired the engine to actually use these. We still plan to implement a full local mode, but it's probably going to be fairly different from what the current iterator-based local mode would look like. Based on what we know right now, we might want a push-based columnar version of these operators.

Let's just remove them for now, and we can always re-introduced them in the future by looking at branch-1.6.

## How was this patch tested?
This is simply dead code removal.

Author: Reynold Xin <rxin@databricks.com>

Closes #11705 from rxin/SPARK-13882.
2016-03-14 19:22:11 -07:00
Michael Armbrust 17eec0a71b [SPARK-13664][SQL] Add a strategy for planning partitioned and bucketed scans of files
This PR adds a new strategy, `FileSourceStrategy`, that can be used for planning scans of collections of files that might be partitioned or bucketed.

Compared with the existing planning logic in `DataSourceStrategy` this version has the following desirable properties:
 - It removes the need to have `RDD`, `broadcastedHadoopConf` and other distributed concerns  in the public API of `org.apache.spark.sql.sources.FileFormat`
 - Partition column appending is delegated to the format to avoid an extra copy / devectorization when appending partition columns
 - It minimizes the amount of data that is shipped to each executor (i.e. it does not send the whole list of files to every worker in the form of a hadoop conf)
 - it natively supports bucketing files into partitions, and thus does not require coalescing / creating a `UnionRDD` with the correct partitioning.
 - Small files are automatically coalesced into fewer tasks using an approximate bin-packing algorithm.

Currently only a testing source is planned / tested using this strategy.  In follow-up PRs we will port the existing formats to this API.

A stub for `FileScanRDD` is also added, but most methods remain unimplemented.

Other minor cleanups:
 - partition pruning is pushed into `FileCatalog` so both the new and old code paths can use this logic.  This will also allow future implementations to use indexes or other tricks (i.e. a MySQL metastore)
 - The partitions from the `FileCatalog` now propagate information about file sizes all the way up to the planner so we can intelligently spread files out.
 - `Array` -> `Seq` in some internal APIs to avoid unnecessary `toArray` calls
 - Rename `Partition` to `PartitionDirectory` to differentiate partitions used earlier in pruning from those where we have already enumerated the files and their sizes.

Author: Michael Armbrust <michael@databricks.com>

Closes #11646 from marmbrus/fileStrategy.
2016-03-14 19:21:12 -07:00
Andrew Or 9a1680c2c8 [SPARK-13139][SQL] Follow-ups to #11573
Addressing outstanding comments in #11573.

Jenkins, new test case in `DDLCommandSuite`

Author: Andrew Or <andrew@databricks.com>

Closes #11667 from andrewor14/ddl-parser-followups.
2016-03-14 09:59:22 -07:00
Yin Huai 250832c733 [SPARK-13207][SQL] Make partitioning discovery ignore _SUCCESS files.
If a _SUCCESS appears in the inner partitioning dir, partition discovery will treat that _SUCCESS file as a data file. Then, partition discovery will fail because it finds that the dir structure is not valid. We should ignore those `_SUCCESS` files.

In future, it is better to ignore all files/dirs starting with `_` or `.`. This PR does not make this change. I am thinking about making this change simple, so we can consider of getting it in branch 1.6.

To ignore all files/dirs starting with `_` or `, the main change is to let ParquetRelation have another way to get metadata files. Right now, it relies on FileStatusCache's cachedLeafStatuses, which returns file statuses of both metadata files (e.g. metadata files used by parquet) and data files, which requires more changes.

https://issues.apache.org/jira/browse/SPARK-13207

Author: Yin Huai <yhuai@databricks.com>

Closes #11088 from yhuai/SPARK-13207.
2016-03-14 09:03:13 -07:00
Dongjoon Hyun acdf219703 [MINOR][DOCS] Fix more typos in comments/strings.
## What changes were proposed in this pull request?

This PR fixes 135 typos over 107 files:
* 121 typos in comments
* 11 typos in testcase name
* 3 typos in log messages

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11689 from dongjoon-hyun/fix_more_typos.
2016-03-14 09:07:39 +00:00
Sean Owen 1840852841 [SPARK-13823][CORE][STREAMING][SQL] Always specify Charset in String <-> byte[] conversions (and remaining Coverity items)
## What changes were proposed in this pull request?

- Fixes calls to `new String(byte[])` or `String.getBytes()` that rely on platform default encoding, to use UTF-8
- Same for `InputStreamReader` and `OutputStreamWriter` constructors
- Standardizes on UTF-8 everywhere
- Standardizes specifying the encoding with `StandardCharsets.UTF-8`, not the Guava constant or "UTF-8" (which means handling `UnuspportedEncodingException`)
- (also addresses the other remaining Coverity scan issues, which are pretty trivial; these are separated into commit 1deecd8d9c )

## How was this patch tested?

Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes #11657 from srowen/SPARK-13823.
2016-03-13 21:03:49 -07:00
Jacky Li f3daa099bf [SQL] fix typo in DataSourceRegister
## What changes were proposed in this pull request?
fix typo in DataSourceRegister

## How was this patch tested?

found when going through latest code

Author: Jacky Li <jacky.likun@huawei.com>

Closes #11686 from jackylk/patch-12.
2016-03-13 18:44:02 -07:00
Cheng Lian c079420d7c [SPARK-13841][SQL] Removes Dataset.collectRows()/takeRows()
## What changes were proposed in this pull request?

This PR removes two methods, `collectRows()` and `takeRows()`, from `Dataset[T]`. These methods were added in PR #11443, and were later considered not useful.

## How was this patch tested?

Existing tests should do the work.

Author: Cheng Lian <lian@databricks.com>

Closes #11678 from liancheng/remove-collect-rows-and-take-rows.
2016-03-13 12:02:52 +08:00
Cheng Lian 4eace4d384 [SPARK-13828][SQL] Bring back stack trace of AnalysisException thrown from QueryExecution.assertAnalyzed
PR #11443 added an extra `plan: Option[LogicalPlan]` argument to `AnalysisException` and attached partially analyzed plan to thrown `AnalysisException` in `QueryExecution.assertAnalyzed()`.  However, the original stack trace wasn't properly inherited.  This PR fixes this issue by inheriting the stack trace.

A test case is added to verify that the first entry of `AnalysisException` stack trace isn't from `QueryExecution`.

Author: Cheng Lian <lian@databricks.com>

Closes #11677 from liancheng/analysis-exception-stacktrace.
2016-03-12 11:25:15 -08:00
Davies Liu ba8c86d06f [SPARK-13671] [SPARK-13311] [SQL] Use different physical plans for RDD and data sources
## What changes were proposed in this pull request?

This PR split the PhysicalRDD into two classes, PhysicalRDD and PhysicalScan. PhysicalRDD is used for DataFrames that is created from existing RDD. PhysicalScan is used for DataFrame that is created from data sources. This enable use to apply different optimization on both of them.

Also fix the problem for sameResult() on two DataSourceScan.

Also fix the equality check to toString for `In`. It's better to use Seq there, but we can't break this public API (sad).

## How was this patch tested?

Existing tests. Manually tested with TPCDS query Q59 and Q64, all those duplicated exchanges can be re-used now, also saw there are 40+% performance improvement (saving half of the scan).

Author: Davies Liu <davies@databricks.com>

Closes #11514 from davies/existing_rdd.
2016-03-12 00:48:36 -08:00
Andrew Or 66d9d0edfe [SPARK-13139][SQL] Parse Hive DDL commands ourselves
## What changes were proposed in this pull request?

This patch is ported over from viirya's changes in #11048. Currently for most DDLs we just pass the query text directly to Hive. Instead, we should parse these commands ourselves and in the future (not part of this patch) use the `HiveCatalog` to process these DDLs. This is a pretext to merging `SQLContext` and `HiveContext`.

Note: As of this patch we still pass the query text to Hive. The difference is that we now parse the commands ourselves so in the future we can just use our own catalog.

## How was this patch tested?

Jenkins, new `DDLCommandSuite`, which comprises of about 40% of the changes here.

Author: Andrew Or <andrew@databricks.com>

Closes #11573 from andrewor14/parser-plus-plus.
2016-03-11 15:13:48 -08:00
Cheng Lian 6d37e1eb90 [SPARK-13817][BUILD][SQL] Re-enable MiMA and removes object DataFrame
## What changes were proposed in this pull request?

PR #11443 temporarily disabled MiMA check, this PR re-enables it.

One extra change is that `object DataFrame` is also removed. The only purpose of introducing `object DataFrame` was to use it as an internal factory for creating `Dataset[Row]`. By replacing this internal factory with `Dataset.newDataFrame`, both `DataFrame` and `DataFrame$` are entirely removed from the API, so that we can simply put a `MissingClassProblem` filter in `MimaExcludes.scala` for most DataFrame API  changes.

## How was this patch tested?

Tested by MiMA check triggered by Jenkins.

Author: Cheng Lian <lian@databricks.com>

Closes #11656 from liancheng/re-enable-mima.
2016-03-11 22:17:50 +08:00
gatorsmile 560489f4e1 [SPARK-13732][SPARK-13797][SQL] Remove projectList from Window and Eliminate useless Window
#### What changes were proposed in this pull request?

`projectList` is useless. Its value is always the same as the child.output. Remove it from the class `Window`. Removal can simplify the codes in Analyzer and Optimizer.

This PR is based on the discussion started by cloud-fan in a separate PR:
https://github.com/apache/spark/pull/5604#discussion_r55140466

This PR also eliminates useless `Window`.

cloud-fan yhuai

#### How was this patch tested?

Existing test cases cover it.

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #11565 from gatorsmile/removeProjListWindow.
2016-03-11 11:59:18 +08:00
Cheng Lian 1d542785b9 [SPARK-13244][SQL] Migrates DataFrame to Dataset
## What changes were proposed in this pull request?

This PR unifies DataFrame and Dataset by migrating existing DataFrame operations to Dataset and make `DataFrame` a type alias of `Dataset[Row]`.

Most Scala code changes are source compatible, but Java API is broken as Java knows nothing about Scala type alias (mostly replacing `DataFrame` with `Dataset<Row>`).

There are several noticeable API changes related to those returning arrays:

1.  `collect`/`take`

    -   Old APIs in class `DataFrame`:

        ```scala
        def collect(): Array[Row]
        def take(n: Int): Array[Row]
        ```

    -   New APIs in class `Dataset[T]`:

        ```scala
        def collect(): Array[T]
        def take(n: Int): Array[T]

        def collectRows(): Array[Row]
        def takeRows(n: Int): Array[Row]
        ```

    Two specialized methods `collectRows` and `takeRows` are added because Java doesn't support returning generic arrays. Thus, for example, `DataFrame.collect(): Array[T]` actually returns `Object` instead of `Array<T>` from Java side.

    Normally, Java users may fall back to `collectAsList` and `takeAsList`.  The two new specialized versions are added to avoid performance regression in ML related code (but maybe I'm wrong and they are not necessary here).

1.  `randomSplit`

    -   Old APIs in class `DataFrame`:

        ```scala
        def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame]
        def randomSplit(weights: Array[Double]): Array[DataFrame]
        ```

    -   New APIs in class `Dataset[T]`:

        ```scala
        def randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]]
        def randomSplit(weights: Array[Double]): Array[Dataset[T]]
        ```

    Similar problem as above, but hasn't been addressed for Java API yet.  We can probably add `randomSplitAsList` to fix this one.

1.  `groupBy`

    Some original `DataFrame.groupBy` methods have conflicting signature with original `Dataset.groupBy` methods.  To distinguish these two, typed `Dataset.groupBy` methods are renamed to `groupByKey`.

Other noticeable changes:

1.  Dataset always do eager analysis now

    We used to support disabling DataFrame eager analysis to help reporting partially analyzed malformed logical plan on analysis failure.  However, Dataset encoders requires eager analysi during Dataset construction.  To preserve the error reporting feature, `AnalysisException` now takes an extra `Option[LogicalPlan]` argument to hold the partially analyzed plan, so that we can check the plan tree when reporting test failures.  This plan is passed by `QueryExecution.assertAnalyzed`.

## How was this patch tested?

Existing tests do the work.

## TODO

- [ ] Fix all tests
- [ ] Re-enable MiMA check
- [ ] Update ScalaDoc (`since`, `group`, and example code)

Author: Cheng Lian <lian@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Author: Wenchen Fan <wenchen@databricks.com>
Author: Cheng Lian <liancheng@users.noreply.github.com>

Closes #11443 from liancheng/ds-to-df.
2016-03-10 17:00:17 -08:00
Davies Liu 020ff8cd34 [SPARK-13751] [SQL] generate better code for Filter
## What changes were proposed in this pull request?

This PR improve the codegen of Filter by:

1. filter out the rows early if it have null value in it that will cause the condition result in null or false. After this, we could simplify the condition, because the input are not nullable anymore.

2. Split the condition as conjunctive predicates, then check them one by one.

Here is a piece of generated code for Filter in TPCDS Q55:
```java
/* 109 */       /*** CONSUME: Filter ((((isnotnull(d_moy#149) && isnotnull(d_year#147)) && (d_moy#149 = 11)) && (d_year#147 = 1999)) && isnotnull(d_date_sk#141)) */
/* 110 */       /* input[0, int] */
/* 111 */       boolean project_isNull2 = rdd_row.isNullAt(0);
/* 112 */       int project_value2 = project_isNull2 ? -1 : (rdd_row.getInt(0));
/* 113 */       /* input[1, int] */
/* 114 */       boolean project_isNull3 = rdd_row.isNullAt(1);
/* 115 */       int project_value3 = project_isNull3 ? -1 : (rdd_row.getInt(1));
/* 116 */       /* input[2, int] */
/* 117 */       boolean project_isNull4 = rdd_row.isNullAt(2);
/* 118 */       int project_value4 = project_isNull4 ? -1 : (rdd_row.getInt(2));
/* 119 */
/* 120 */       if (project_isNull3) continue;
/* 121 */       if (project_isNull4) continue;
/* 122 */       if (project_isNull2) continue;
/* 123 */
/* 124 */       /* (input[1, int] = 11) */
/* 125 */       boolean filter_value6 = false;
/* 126 */       filter_value6 = project_value3 == 11;
/* 127 */       if (!filter_value6) continue;
/* 128 */
/* 129 */       /* (input[2, int] = 1999) */
/* 130 */       boolean filter_value9 = false;
/* 131 */       filter_value9 = project_value4 == 1999;
/* 132 */       if (!filter_value9) continue;
/* 133 */
/* 134 */       filter_metricValue1.add(1);
/* 135 */
/* 136 */       /*** CONSUME: Project [d_date_sk#141] */
/* 137 */
/* 138 */       project_rowWriter1.write(0, project_value2);
/* 139 */       append(project_result1.copy());
```

## How was this patch tested?

Existing tests.

Author: Davies Liu <davies@databricks.com>

Closes #11585 from davies/gen_filter.
2016-03-10 16:40:16 -08:00
Dongjoon Hyun 91fed8e9c5 [SPARK-3854][BUILD] Scala style: require spaces before {.
## What changes were proposed in this pull request?

Since the opening curly brace, '{', has many usages as discussed in [SPARK-3854](https://issues.apache.org/jira/browse/SPARK-3854), this PR adds a ScalaStyle rule to prevent '){' pattern  for the following majority pattern and fixes the code accordingly. If we enforce this in ScalaStyle from now, it will improve the Scala code quality and reduce review time.
```
// Correct:
if (true) {
  println("Wow!")
}

// Incorrect:
if (true){
   println("Wow!")
}
```
IntelliJ also shows new warnings based on this.

## How was this patch tested?

Pass the Jenkins ScalaStyle test.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11637 from dongjoon-hyun/SPARK-3854.
2016-03-10 15:57:22 -08:00