Commit graph

2028 commits

Author SHA1 Message Date
Reynold Xin b87825310a [SPARK-9692] Remove SqlNewHadoopRDD's generated Tuple2 and InterruptibleIterator.
A small performance optimization – we don't need to generate a Tuple2 and then immediately discard the key. We also don't need an extra wrapper from InterruptibleIterator.

Author: Reynold Xin <rxin@databricks.com>

Closes #8000 from rxin/SPARK-9692 and squashes the following commits:

1d4d0b3 [Reynold Xin] [SPARK-9692] Remove SqlNewHadoopRDD's generated Tuple2 and InterruptibleIterator.
2015-08-06 18:25:38 -07:00
Davies Liu 49b1504fe3 Revert "[SPARK-9228] [SQL] use tungsten.enabled in public for both of codegen/unsafe"
This reverts commit 4e70e8256c.
2015-08-06 17:36:12 -07:00
Michael Armbrust 0867b23c74 [SPARK-9650][SQL] Fix quoting behavior on interpolated column names
Make sure that `$"column"` is consistent with other methods with respect to backticks.  Adds a bunch of tests for various ways of constructing columns.

Author: Michael Armbrust <michael@databricks.com>

Closes #7969 from marmbrus/namesWithDots and squashes the following commits:

53ef3d7 [Michael Armbrust] [SPARK-9650][SQL] Fix quoting behavior on interpolated column names
2bf7a92 [Michael Armbrust] WIP
2015-08-06 17:31:16 -07:00
Davies Liu 4e70e8256c [SPARK-9228] [SQL] use tungsten.enabled in public for both of codegen/unsafe
spark.sql.tungsten.enabled will be the default value for both codegen and unsafe, they are kept internally for debug/testing.

cc marmbrus rxin

Author: Davies Liu <davies@databricks.com>

Closes #7998 from davies/tungsten and squashes the following commits:

c1c16da [Davies Liu] update doc
1a47be1 [Davies Liu] use tungsten.enabled for both of codegen/unsafe
2015-08-06 17:30:31 -07:00
Yin Huai 3504bf3aa9 [SPARK-9630] [SQL] Clean up new aggregate operators (SPARK-9240 follow up)
This is the followup of https://github.com/apache/spark/pull/7813. It renames `HybridUnsafeAggregationIterator` to `TungstenAggregationIterator` and makes it only work with `UnsafeRow`. Also, I add a `TungstenAggregate` that uses `TungstenAggregationIterator` and make `SortBasedAggregate` (renamed from `SortBasedAggregate`) only works with `SafeRow`.

Author: Yin Huai <yhuai@databricks.com>

Closes #7954 from yhuai/agg-followUp and squashes the following commits:

4d2f4fc [Yin Huai] Add comments and free map.
0d7ddb9 [Yin Huai] Add TungstenAggregationQueryWithControlledFallbackSuite to test fall back process.
91d69c2 [Yin Huai] Rename UnsafeHybridAggregationIterator to  TungstenAggregateIteraotr and make it only work with UnsafeRow.
2015-08-06 15:04:44 -07:00
Liang-Chi Hsieh 21fdfd7d6f [SPARK-9548][SQL] Add a destructive iterator for BytesToBytesMap
This pull request adds a destructive iterator to BytesToBytesMap. When used, the iterator frees pages as it traverses them. This is part of the effort to avoid starving when we have more than one operators that can exhaust memory.

This is based on #7924, but fixes a bug there (Don't use destructive iterator in UnsafeKVExternalSorter).

Closes #7924.

Author: Liang-Chi Hsieh <viirya@appier.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #8003 from rxin/map-destructive-iterator and squashes the following commits:

6b618c3 [Reynold Xin] Don't use destructive iterator in UnsafeKVExternalSorter.
a7bd8ec [Reynold Xin] Merge remote-tracking branch 'viirya/destructive_iter' into map-destructive-iterator
7652083 [Liang-Chi Hsieh] For comments: add destructiveIterator(), modify unit test, remove code block.
4a3e9de [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into destructive_iter
581e9e3 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into destructive_iter
f0ff783 [Liang-Chi Hsieh] No need to free last page.
9e9d2a3 [Liang-Chi Hsieh] Add a destructive iterator for BytesToBytesMap.
2015-08-06 14:33:29 -07:00
Christian Kadner abfedb9cd7 [SPARK-9211] [SQL] [TEST] normalize line separators before generating MD5 hash
The golden answer file names for the existing Hive comparison tests were generated using a MD5 hash of the query text which uses Unix-style line separator characters `\n` (LF).
This PR ensures that all occurrences of the Windows-style line separator `\r\n` (CR) are replaced with `\n` (LF) before generating the MD5 hash to produce an identical MD5 hash for golden answer file names generated on Windows.

Author: Christian Kadner <ckadner@us.ibm.com>

Closes #7563 from ckadner/SPARK-9211_working and squashes the following commits:

d541db0 [Christian Kadner] [SPARK-9211][SQL] normalize line separators before MD5 hash
2015-08-06 14:15:42 -07:00
Wenchen Fan 1f62f104c7 [SPARK-9632][SQL] update InternalRow.toSeq to make it accept data type info
This re-applies #7955, which was reverted due to a race condition to fix build breaking.

Author: Wenchen Fan <cloud0fan@outlook.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #8002 from rxin/InternalRow-toSeq and squashes the following commits:

332416a [Reynold Xin] Merge pull request #7955 from cloud-fan/toSeq
21665e2 [Wenchen Fan] fix hive again...
4addf29 [Wenchen Fan] fix hive
bc16c59 [Wenchen Fan] minor fix
33d802c [Wenchen Fan] pass data type info to InternalRow.toSeq
3dd033e [Wenchen Fan] move the default special getters implementation from InternalRow to BaseGenericInternalRow
2015-08-06 13:11:59 -07:00
Yin Huai cdd53b762b [SPARK-9632] [SQL] [HOT-FIX] Fix build.
seems https://github.com/apache/spark/pull/7955 breaks the build.

Author: Yin Huai <yhuai@databricks.com>

Closes #8001 from yhuai/SPARK-9632-fixBuild and squashes the following commits:

6c257dd [Yin Huai] Fix build.
2015-08-06 11:15:54 -07:00
Davies Liu 2eca46a17a Revert "[SPARK-9632][SQL] update InternalRow.toSeq to make it accept data type info"
This reverts commit 6e009cb9c4.
2015-08-06 11:15:37 -07:00
Wenchen Fan 6e009cb9c4 [SPARK-9632][SQL] update InternalRow.toSeq to make it accept data type info
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7955 from cloud-fan/toSeq and squashes the following commits:

21665e2 [Wenchen Fan] fix hive again...
4addf29 [Wenchen Fan] fix hive
bc16c59 [Wenchen Fan] minor fix
33d802c [Wenchen Fan] pass data type info to InternalRow.toSeq
3dd033e [Wenchen Fan] move the default special getters implementation from InternalRow to BaseGenericInternalRow
2015-08-06 10:40:54 -07:00
Reynold Xin 5e1b0ef079 [SPARK-9659][SQL] Rename inSet to isin to match Pandas function.
Inspiration drawn from this blog post: https://lab.getbase.com/pandarize-spark-dataframes/

Author: Reynold Xin <rxin@databricks.com>

Closes #7977 from rxin/isin and squashes the following commits:

9b1d3d6 [Reynold Xin] Added return.
2197d37 [Reynold Xin] Fixed test case.
7c1b6cf [Reynold Xin] Import warnings.
4f4a35d [Reynold Xin] [SPARK-9659][SQL] Rename inSet to isin to match Pandas function.
2015-08-06 10:39:16 -07:00
Burak Yavuz 98e69467d4 [SPARK-9615] [SPARK-9616] [SQL] [MLLIB] Bugs related to FrequentItems when merging and with Tungsten
In short:
1- FrequentItems should not use the InternalRow representation, because the keys in the map get messed up. For example, every key in the Map correspond to the very last element observed in the partition, when the elements are strings.

2- Merging two partitions had a bug:

**Existing behavior with size 3**
Partition A -> Map(1 -> 3, 2 -> 3, 3 -> 4)
Partition B -> Map(4 -> 25)
Result -> Map()

**Correct Behavior:**
Partition A -> Map(1 -> 3, 2 -> 3, 3 -> 4)
Partition B -> Map(4 -> 25)
Result -> Map(3 -> 1, 4 -> 22)

cc mengxr rxin JoshRosen

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #7945 from brkyvz/freq-fix and squashes the following commits:

07fa001 [Burak Yavuz] address 2
1dc61a8 [Burak Yavuz] address 1
506753e [Burak Yavuz] fixed and added reg test
47bfd50 [Burak Yavuz] pushing
2015-08-06 10:29:40 -07:00
Cheng Lian 9f94c85ff3 [SPARK-9593] [SQL] [HOTFIX] Makes the Hadoop shims loading fix more robust
This is a follow-up of #7929.

We found that Jenkins SBT master build still fails because of the Hadoop shims loading issue. But the failure doesn't appear to be deterministic. My suspect is that Hadoop `VersionInfo` class may fail to inspect Hadoop version, and the shims loading branch is skipped.

This PR tries to make the fix more robust:

1. When Hadoop version is available, we load `Hadoop20SShims` for versions <= 2.0.x as srowen suggested in PR #7929.
2. Otherwise, we use `Path.getPathWithoutSchemeAndAuthority` as a probe method, which doesn't exist in Hadoop 1.x or 2.0.x. If this method is not found, `Hadoop20SShims` is also loaded.

Author: Cheng Lian <lian@databricks.com>

Closes #7994 from liancheng/spark-9593/fix-hadoop-shims and squashes the following commits:

e1d3d70 [Cheng Lian] Fixes typo in comments
8d971da [Cheng Lian] Makes the Hadoop shims loading fix more robust
2015-08-06 09:53:53 -07:00
Davies Liu 93085c992e [SPARK-9482] [SQL] Fix thread-safey issue of using UnsafeProjection in join
This PR also change to use `def` instead of `lazy val` for UnsafeProjection, because it's not thread safe.

TODO: cleanup the debug code once the flaky test passed 100 times.

Author: Davies Liu <davies@databricks.com>

Closes #7940 from davies/semijoin and squashes the following commits:

93baac7 [Davies Liu] fix outerjoin
5c40ded [Davies Liu] address comments
aa3de46 [Davies Liu] Merge branch 'master' of github.com:apache/spark into semijoin
7590a25 [Davies Liu] Merge branch 'master' of github.com:apache/spark into semijoin
2d4085b [Davies Liu] use def for resultProjection
0833407 [Davies Liu] Merge branch 'semijoin' of github.com:davies/spark into semijoin
e0d8c71 [Davies Liu] use lazy val
6a59e8f [Davies Liu] Update HashedRelation.scala
0fdacaf [Davies Liu] fix broadcast and thread-safety of UnsafeProjection
2fc3ef6 [Davies Liu] reproduce failure in semijoin
2015-08-06 09:12:41 -07:00
Davies Liu 5b965d64ee [SPARK-9644] [SQL] Support update DecimalType with precision > 18 in UnsafeRow
In order to support update a varlength (actually fixed length) object, the space should be preserved even  it's null. And, we can't call setNullAt(i) for it anymore, we because setNullAt(i) will remove the offset of the preserved space, should call setDecimal(i, null, precision) instead.

After this, we can do hash based aggregation on DecimalType with precision > 18. In a tests, this could decrease the end-to-end run time of aggregation query from 37 seconds (sort based) to 24 seconds (hash based).

cc rxin

Author: Davies Liu <davies@databricks.com>

Closes #7978 from davies/update_decimal and squashes the following commits:

bed8100 [Davies Liu] isSettable -> isMutable
923c9eb [Davies Liu] address comments and fix bug
385891d [Davies Liu] Merge branch 'master' of github.com:apache/spark into update_decimal
36a1872 [Davies Liu] fix tests
cd6c524 [Davies Liu] support set decimal with precision > 18
2015-08-06 09:10:57 -07:00
zhichao.li aead18ffca [SPARK-8266] [SQL] add function translate
![translate](http://www.w3resource.com/PostgreSQL/postgresql-translate-function.png)

Author: zhichao.li <zhichao.li@intel.com>

Closes #7709 from zhichao-li/translate and squashes the following commits:

9418088 [zhichao.li] refine checking condition
f2ab77a [zhichao.li] clone string
9d88f2d [zhichao.li] fix indent
6aa2962 [zhichao.li] style
e575ead [zhichao.li] add python api
9d4bab0 [zhichao.li] add special case for fodable and refactor unittest
eda7ad6 [zhichao.li] update to use TernaryExpression
cdfd4be [zhichao.li] add function translate
2015-08-06 09:02:30 -07:00
Yin Huai d5a9af3230 [SPARK-9664] [SQL] Remove UDAFRegistration and add apply to UserDefinedAggregateFunction.
https://issues.apache.org/jira/browse/SPARK-9664

Author: Yin Huai <yhuai@databricks.com>

Closes #7982 from yhuai/udafRegister and squashes the following commits:

0cc2287 [Yin Huai] Remove UDAFRegistration and add apply to UserDefinedAggregateFunction.
2015-08-05 21:50:35 -07:00
Reynold Xin 9270bd06fd [SPARK-9674][SQL] Remove GeneratedAggregate.
The new aggregate replaces the old GeneratedAggregate.

Author: Reynold Xin <rxin@databricks.com>

Closes #7983 from rxin/remove-generated-agg and squashes the following commits:

8334aae [Reynold Xin] [SPARK-9674][SQL] Remove GeneratedAggregate.
2015-08-05 21:50:14 -07:00
Cheng Hao 119b590538 [SPARK-6923] [SPARK-7550] [SQL] Persists data source relations in Hive compatible format when possible
This PR is a fork of PR #5733 authored by chenghao-intel.  For committers who's going to merge this PR, please set the author to "Cheng Hao <hao.chengintel.com>".

----

When a data source relation meets the following requirements, we persist it in Hive compatible format, so that other systems like Hive can access it:

1. It's a `HadoopFsRelation`
2. It has only one input path
3. It's non-partitioned
4. It's data source provider can be naturally mapped to a Hive builtin SerDe (e.g. ORC and Parquet)

Author: Cheng Lian <lian@databricks.com>
Author: Cheng Hao <hao.cheng@intel.com>

Closes #7967 from liancheng/spark-6923/refactoring-pr-5733 and squashes the following commits:

5175ee6 [Cheng Lian] Fixes an oudated comment
3870166 [Cheng Lian] Fixes build error and comments
864acee [Cheng Lian] Refactors PR #5733
3490cdc [Cheng Hao] update the scaladoc
6f57669 [Cheng Hao] write schema info to hivemetastore for data source
2015-08-06 11:13:44 +08:00
Yin Huai 4581badbc8 [SPARK-9611] [SQL] Fixes a few corner cases when we spill a UnsafeFixedWidthAggregationMap
This PR has the following three small fixes.

1. UnsafeKVExternalSorter does not use 0 as the initialSize to create an UnsafeInMemorySorter if its BytesToBytesMap is empty.
2. We will not not spill a InMemorySorter if it is empty.
3. We will not add a SpillReader to a SpillMerger if this SpillReader is empty.

JIRA: https://issues.apache.org/jira/browse/SPARK-9611

Author: Yin Huai <yhuai@databricks.com>

Closes #7948 from yhuai/unsafeEmptyMap and squashes the following commits:

9727abe [Yin Huai] Address Josh's comments.
34b6f76 [Yin Huai] 1. UnsafeKVExternalSorter does not use 0 as the initialSize to create an UnsafeInMemorySorter if its BytesToBytesMap is empty. 2. Do not spill a InMemorySorter if it is empty. 3. Do not add spill to SpillMerger if this spill is empty.
2015-08-05 19:19:09 -07:00
Josh Rosen 9c878923db [SPARK-9054] [SQL] Rename RowOrdering to InterpretedOrdering; use newOrdering in SMJ
This patches renames `RowOrdering` to `InterpretedOrdering` and updates SortMergeJoin to use the `SparkPlan` methods for constructing its ordering so that it may benefit from codegen.

This is an updated version of #7408.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7973 from JoshRosen/SPARK-9054 and squashes the following commits:

e610655 [Josh Rosen] Add comment RE: Ascending ordering
34b8e0c [Josh Rosen] Import ordering
be19a0f [Josh Rosen] [SPARK-9054] [SQL] Rename RowOrdering to InterpretedOrdering; use newOrdering in more places.
2015-08-05 16:33:42 -07:00
Liang-Chi Hsieh e1e05873fc [SPARK-9403] [SQL] Add codegen support in In and InSet
This continues tarekauel's work in #7778.

Author: Liang-Chi Hsieh <viirya@appier.com>
Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7893 from viirya/codegen_in and squashes the following commits:

81ff97b [Liang-Chi Hsieh] For comments.
47761c6 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into codegen_in
cf4bf41 [Liang-Chi Hsieh] For comments.
f532b3c [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into codegen_in
446bbcd [Liang-Chi Hsieh] Fix bug.
b3d0ab4 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into codegen_in
4610eff [Liang-Chi Hsieh] Relax the types of references and update optimizer test.
224f18e [Liang-Chi Hsieh] Beef up the test cases for In and InSet to include all primitive data types.
86dc8aa [Liang-Chi Hsieh] Only convert In to InSet when the number of items in set is more than the threshold.
b7ded7e [Tarek Auel] [SPARK-9403][SQL] codeGen in / inSet
2015-08-05 11:38:56 -07:00
Yin Huai 1f8c364b9c [SPARK-9141] [SQL] [MINOR] Fix comments of PR #7920
This is a follow-up of https://github.com/apache/spark/pull/7920 to fix comments.

Author: Yin Huai <yhuai@databricks.com>

Closes #7964 from yhuai/SPARK-9141-follow-up and squashes the following commits:

4d0ee80 [Yin Huai] Fix comments.
2015-08-05 11:03:02 -07:00
Michael Armbrust 23d982204b [SPARK-9141] [SQL] Remove project collapsing from DataFrame API
Currently we collapse successive projections that are added by `withColumn`.  However, this optimization violates the constraint that adding nodes to a plan will never change its analyzed form and thus breaks caching.  Instead of doing early optimization, in this PR I just fix some low-hanging slowness in the analyzer.  In particular, I add a mechanism for skipping already analyzed subplans, `resolveOperators` and `resolveExpression`.  Since trees are generally immutable after construction, it's safe to annotate a plan as already analyzed as any transformation will create a new tree with this bit no longer set.

Together these result in a faster analyzer than before, even with added timing instrumentation.

```
Original Code
[info] 3430ms
[info] 2205ms
[info] 1973ms
[info] 1982ms
[info] 1916ms

Without Project Collapsing in DataFrame
[info] 44610ms
[info] 45977ms
[info] 46423ms
[info] 46306ms
[info] 54723ms

With analyzer optimizations
[info] 6394ms
[info] 4630ms
[info] 4388ms
[info] 4093ms
[info] 4113ms

With resolveOperators
[info] 2495ms
[info] 1380ms
[info] 1685ms
[info] 1414ms
[info] 1240ms
```

Author: Michael Armbrust <michael@databricks.com>

Closes #7920 from marmbrus/withColumnCache and squashes the following commits:

2145031 [Michael Armbrust] fix hive udfs tests
5a5a525 [Michael Armbrust] remove wrong comment
7a507d5 [Michael Armbrust] style
b59d710 [Michael Armbrust] revert small change
1fa5949 [Michael Armbrust] move logic into LogicalPlan, add tests
0e2cb43 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into withColumnCache
c926e24 [Michael Armbrust] naming
e593a2d [Michael Armbrust] style
f5a929e [Michael Armbrust] [SPARK-9141][SQL] Remove project collapsing from DataFrame API
38b1c83 [Michael Armbrust] WIP
2015-08-05 09:01:45 -07:00
Cheng Hao 519cf6d3f7 [SPARK-9381] [SQL] Migrate JSON data source to the new partitioning data source
Support partitioning for the JSON data source.

Still 2 open issues for the `HadoopFsRelation`
- `refresh()` will invoke the `discoveryPartition()`, which will auto infer the data type for the partition columns, and maybe conflict with the given partition columns. (TODO enable `HadoopFsRelationSuite.Partition column type casting"
- When insert data into a cached HadoopFsRelation based table, we need to invalidate the cache after the insertion (TODO enable `InsertSuite.Caching`)

Author: Cheng Hao <hao.cheng@intel.com>

Closes #7696 from chenghao-intel/json and squashes the following commits:

d90b104 [Cheng Hao] revert the change for JacksonGenerator.apply
307111d [Cheng Hao] fix bug in the unit test
8738c8a [Cheng Hao] fix bug in unit testing
35f2cde [Cheng Hao] support partition for json format
2015-08-05 22:35:55 +08:00
Nathan Howell eb8bfa3eaa [SPARK-9618] [SQL] Use the specified schema when reading Parquet files
The user specified schema is currently ignored when loading Parquet files.

One workaround is to use the `format` and `load` methods instead of `parquet`, e.g.:

```
val schema = ???

// schema is ignored
sqlContext.read.schema(schema).parquet("hdfs:///test")

// schema is retained
sqlContext.read.schema(schema).format("parquet").load("hdfs:///test")
```

The fix is simple, but I wonder if the `parquet` method should instead be written in a similar fashion to `orc`:

```
def parquet(path: String): DataFrame = format("parquet").load(path)
```

Author: Nathan Howell <nhowell@godaddy.com>

Closes #7947 from NathanHowell/SPARK-9618 and squashes the following commits:

d1ea62c [Nathan Howell] [SPARK-9618] [SQL] Use the specified schema when reading Parquet files
2015-08-05 22:16:56 +08:00
Cheng Lian 70112ff22b [SPARK-9593] [SQL] Fixes Hadoop shims loading
This PR is used to workaround CDH Hadoop versions like 2.0.0-mr1-cdh4.1.1.

Internally, Hive `ShimLoader` tries to load different versions of Hadoop shims by checking version information gathered from Hadoop jar files.  If the major version number is 1, `Hadoop20SShims` will be loaded.  Otherwise, if the major version number is 2, `Hadoop23Shims` will be chosen.  However, CDH Hadoop versions like 2.0.0-mr1-cdh4.1.1 have 2 as major version number, but contain Hadoop 1 code.  This confuses Hive `ShimLoader` and loads wrong version of shims.

In this PR we check for existence of the `Path.getPathWithoutSchemeAndAuthority` method, which doesn't exist in Hadoop 1 (it's also the method that reveals this shims loading issue), and load `Hadoop20SShims` when it doesn't exist.

Author: Cheng Lian <lian@databricks.com>

Closes #7929 from liancheng/spark-9593/fix-hadoop-shims-loading and squashes the following commits:

c99b497 [Cheng Lian] Narrows down the fix to handle "2.0.0-*cdh4*" Hadoop versions only
b17e955 [Cheng Lian] Updates comments
490d8f2 [Cheng Lian] Fixes Scala style issue
9c6c12d [Cheng Lian] Fixes Hadoop shims loading
2015-08-05 20:03:54 +08:00
Yijie Shen 84ca3183b6 [SPARK-9628][SQL]Rename int to SQLDate, long to SQLTimestamp for better readability
JIRA: https://issues.apache.org/jira/browse/SPARK-9628

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7953 from yjshen/datetime_alias and squashes the following commits:

3cac3cc [Yijie Shen] rename int to SQLDate, long to SQLTimestamp for better readability
2015-08-05 02:04:28 -07:00
zsxwing 1b0317f64c [SPARK-8861][SPARK-8862][SQL] Add basic instrumentation to each SparkPlan operator and add a new SQL tab
This PR includes the following changes:

### SPARK-8862: Add basic instrumentation to each SparkPlan operator

A SparkPlan can override `def accumulators: Map[String, Accumulator[_]]` to expose its metrics that can be displayed in UI. The UI will use them to track the updates and show them in the web page in real-time.

### SparkSQLExecution and SQLSparkListener

`SparkSQLExecution.withNewExecutionId` will set `spark.sql.execution.id` to the local properties so that we can use it to track all jobs that belong to the same query.

SQLSparkListener is a listener to track all accumulator updates of all tasks for a query. It receives them from heartbeats can the UI can query them in real-time.

When running a query, `SQLSparkListener.onExecutionStart` will be called. When a query is finished,  `SQLSparkListener.onExecutionEnd` will be called. And the Spark jobs with the same execution id will be tracked and stored with this query.

`SQLSparkListener` has to store all accumulator updates for tasks separately. When a task fails and starts to retry, we need to drop the old accumulator updates. Because we can not revert our changes to an accumulator, we have to maintain these accumulator updates by ourselves so as to drop accumulator updates for a failed task.

### SPARK-8862: A new SQL tab
Includes two pages:
#### A page for all DataFrame/SQL queries
It will show the running, completed and failed queries in 3 tables. It also displays the jobs and their links for a query in each row.
#### A detail page for a DataFrame/SQL query
In this page, it also shows the SparkPlan metrics in real-time. Run a long-running query, such as
```
val testData = sc.parallelize((1 to 1000000).map(i => (i, i.toString))).toDF()
testData.select($"_1").filter($"_1" < 1000).foreach(_ => Thread.sleep(60))
```
and you will see the metrics keep updating in real-time.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/7774)
<!-- Reviewable:end -->

Author: zsxwing <zsxwing@gmail.com>

Closes #7774 from zsxwing/sql-ui and squashes the following commits:

5a2bc99 [zsxwing] Remove UISeleniumSuite and its dependency
57d4cd2 [zsxwing] Use VisibleForTesting annotation
cc1c736 [zsxwing] Add SparkPlan.trackNumOfRowsEnabled to make subclasses easy to track the number of rows; fix the issue that the "save" action cannot collect metrics
3771ab0 [zsxwing] Register SQL metrics accmulators
3a101c0 [zsxwing] Change prepareCalled's type to AtomicBoolean for thread-safety
b8d5605 [zsxwing] Make prepare idempotent; call children's prepare in SparkPlan.prepare; change doPrepare to def
4ed11a1 [zsxwing] var -> val
332639c [zsxwing] Ignore UISeleniumSuite and SQLListenerSuite."no memory leak" because of SPARK-9580
bb52359 [zsxwing] Address other commens in SQLListener
c4d0f5d [zsxwing] Move newPredicate out of the iterator loop
957473c [zsxwing] Move STATIC_RESOURCE_DIR to object SQLTab
7ab4816 [zsxwing] Make SparkPlan accumulator API private[sql]
dae195e [zsxwing] Fix the code style and comments
3a66207 [zsxwing] Ignore irrelevant accumulators
b8484a1 [zsxwing] Merge branch 'master' into sql-ui
9406592 [zsxwing] Implement the SparkPlan viz
4ebce68 [zsxwing] Add SparkPlan.prepare to support BroadcastHashJoin to run background work in parallel
ca1811f [zsxwing] Merge branch 'master' into sql-ui
fef6fc6 [zsxwing] Fix a corner case
25f335c [zsxwing] Fix the code style
6eae828 [zsxwing] SQLSparkListener -> SQLListener; SparkSQLExecutionUIData -> SQLExecutionUIData; SparkSQLExecution -> SQLExecution
822af75 [zsxwing] Add SQLSparkListenerSuite and fix the issue about onExecutionEnd and onJobEnd
6be626f [zsxwing] Add UISeleniumSuite to test UI
d02a24d [zsxwing] Make ExecutionPage private
23abf73 [zsxwing] [SPARK-8862][SPARK-8862][SQL] Add basic instrumentation to each SparkPlan operator and add a new SQL tab
2015-08-05 01:51:22 -07:00
Takeshi YAMAMURO 6d8a6e4161 [SPARK-9360] [SQL] Support BinaryType in PrefixComparators for UnsafeExternalSort
The current implementation of UnsafeExternalSort uses NoOpPrefixComparator for binary-typed data.
So, we need to add BinaryPrefixComparator in PrefixComparators.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #7676 from maropu/BinaryTypePrefixComparator and squashes the following commits:

fe6f31b [Takeshi YAMAMURO] Apply comments
d943c04 [Takeshi YAMAMURO] Add a codegen'd entry for BinaryType in SortPrefix
ecf3ac5 [Takeshi YAMAMURO] Support BinaryType in PrefixComparator
2015-08-05 00:56:35 -07:00
Emiliano Leporati 1d1a76c8c5 [SPARK-9581][SQL] Add unit test for JSON UDT
This brings #7416 up-to-date by drubbo.

Author: Emiliano Leporati <emiliano.leporati@gmail.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #7917 from rxin/udt-json-test and squashes the following commits:

93e3954 [Reynold Xin] Fix test.
7035308 [Reynold Xin] Merge pull request #7416 from drubbo/master
b5bcd94 [Emiliano Leporati] removed unneded case in MyDenseVector::equals
508a399 [Emiliano Leporati] Merge remote branch 'upstream/master'
7569e42 [Emiliano Leporati] using checkAnswer
62daccd [Emiliano Leporati] added coverage for UDTs in JSON RDDs
2015-08-05 00:42:08 -07:00
Davies Liu 781c8d71a0 [SPARK-9119] [SPARK-8359] [SQL] match Decimal.precision/scale with DecimalType
Let Decimal carry the correct precision and scale with DecimalType.

cc rxin yhuai

Author: Davies Liu <davies@databricks.com>

Closes #7925 from davies/decimal_scale and squashes the following commits:

e19701a [Davies Liu] some tweaks
57d78d2 [Davies Liu] fix tests
5d5bc69 [Davies Liu] match precision and scale with DecimalType
2015-08-04 23:12:49 -07:00
Pedro Rodriguez d34548587a [SPARK-8231] [SQL] Add array_contains
This PR is based on #7580 , thanks to EntilZha

PR for work on https://issues.apache.org/jira/browse/SPARK-8231

Currently, I have an initial implementation for contains. Based on discussion on JIRA, it should behave same as Hive: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFArrayContains.java#L102-L128

Main points are:
1. If the array is empty, null, or the value is null, return false
2. If there is a type mismatch, throw error
3. If comparison is not supported, throw error

Closes #7580

Author: Pedro Rodriguez <prodriguez@trulia.com>
Author: Pedro Rodriguez <ski.rodriguez@gmail.com>
Author: Davies Liu <davies@databricks.com>

Closes #7949 from davies/array_contains and squashes the following commits:

d3c08bc [Davies Liu] use foreach() to avoid copy
bc3d1fe [Davies Liu] fix array_contains
719e37d [Davies Liu] Merge branch 'master' of github.com:apache/spark into array_contains
e352cf9 [Pedro Rodriguez] fixed diff from master
4d5b0ff [Pedro Rodriguez] added docs and another type check
ffc0591 [Pedro Rodriguez] fixed unit test
7a22deb [Pedro Rodriguez] Changed test to use strings instead of long/ints which are different between python 2 an 3
b5ffae8 [Pedro Rodriguez] fixed pyspark test
4e7dce3 [Pedro Rodriguez] added more docs
3082399 [Pedro Rodriguez] fixed unit test
46f9789 [Pedro Rodriguez] reverted change
d3ca013 [Pedro Rodriguez] Fixed type checking to match hive behavior, then added tests to insure this
8528027 [Pedro Rodriguez] added more tests
686e029 [Pedro Rodriguez] fix scala style
d262e9d [Pedro Rodriguez] reworked type checking code and added more tests
2517a58 [Pedro Rodriguez] removed unused import
28b4f71 [Pedro Rodriguez] fixed bug with type conversions and re-added tests
12f8795 [Pedro Rodriguez] fix scala style checks
e8a20a9 [Pedro Rodriguez] added python df (broken atm)
65b562c [Pedro Rodriguez] made array_contains nullable false
33b45aa [Pedro Rodriguez] reordered test
9623c64 [Pedro Rodriguez] fixed test
4b4425b [Pedro Rodriguez] changed Arrays in tests to Seqs
72cb4b1 [Pedro Rodriguez] added checkInputTypes and docs
69c46fb [Pedro Rodriguez] added tests and codegen
9e0bfc4 [Pedro Rodriguez] initial attempt at implementation
2015-08-04 22:34:02 -07:00
Davies Liu 2b67fdb60b [SPARK-9513] [SQL] [PySpark] Add python API for DataFrame functions
This adds Python API for those DataFrame functions that is introduced in 1.5.

There is issue with serialize byte_array in Python 3, so some of functions (for BinaryType) does not have tests.

cc rxin

Author: Davies Liu <davies@databricks.com>

Closes #7922 from davies/python_functions and squashes the following commits:

8ad942f [Davies Liu] fix test
5fb6ec3 [Davies Liu] fix bugs
3495ed3 [Davies Liu] fix issues
ea5f7bb [Davies Liu] Add python API for DataFrame functions
2015-08-04 19:25:24 -07:00
zhichao.li 6f8f0e265a [SPARK-7119] [SQL] Give script a default serde with the user specific types
This is to address this issue that there would be not compatible type exception when running this:
`from (from src select transform(key, value) using 'cat' as (thing1 int, thing2 string)) t select thing1 + 2;`

15/04/24 00:58:55 ERROR CliDriver: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.ClassCastException: org.apache.spark.sql.types.UTF8String cannot be cast to java.lang.Integer
	at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
	at scala.math.Numeric$IntIsIntegral$.plus(Numeric.scala:57)
	at org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:127)
	at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:118)
	at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68)
	at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
	at scala.collection.AbstractIterator.to(Iterator.scala:1157)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
	at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:819)
	at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:819)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1618)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1618)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
	at org.apache.spark.scheduler.Task.run(Task.scala:64)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:209)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
	at java.lang.Thread.run(Thread.java:722)

chenghao-intel marmbrus

Author: zhichao.li <zhichao.li@intel.com>

Closes #6638 from zhichao-li/transDataType2 and squashes the following commits:

a36cc7c [zhichao.li] style
b9252a8 [zhichao.li] delete cacheRow
f6968a4 [zhichao.li] give script a default serde
2015-08-04 18:26:05 -07:00
Yijie Shen a7fe48f687 [SPARK-9432][SQL] Audit expression unit tests to make sure we pass the proper numeric ranges
JIRA: https://issues.apache.org/jira/browse/SPARK-9432

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7933 from yjshen/numeric_ranges and squashes the following commits:

e719f78 [Yijie Shen] proper integral range check
2015-08-04 18:19:26 -07:00
Wenchen Fan 7c8fc1f7cb [SPARK-9598][SQL] do not expose generic getter in internal row
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7932 from cloud-fan/generic-getter and squashes the following commits:

c60de4c [Wenchen Fan] do not expose generic getter in internal row
2015-08-04 17:05:19 -07:00
Josh Rosen ab8ee1a3b9 [SPARK-9452] [SQL] Support records larger than page size in UnsafeExternalSorter
This patch extends UnsafeExternalSorter to support records larger than the page size. The basic strategy is the same as in #7762: store large records in their own overflow pages.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7891 from JoshRosen/large-records-in-sql-sorter and squashes the following commits:

967580b [Josh Rosen] Merge remote-tracking branch 'origin/master' into large-records-in-sql-sorter
948c344 [Josh Rosen] Add large records tests for KV sorter.
3c17288 [Josh Rosen] Combine memory and disk cleanup into general cleanupResources() method
380f217 [Josh Rosen] Merge remote-tracking branch 'origin/master' into large-records-in-sql-sorter
27eafa0 [Josh Rosen] Fix page size in PackedRecordPointerSuite
a49baef [Josh Rosen] Address initial round of review comments
3edb931 [Josh Rosen] Remove accidentally-committed debug statements.
2b164e2 [Josh Rosen] Support large records in UnsafeExternalSorter.
2015-08-04 14:42:11 -07:00
Wenchen Fan f4b1ac08a1 [SPARK-9553][SQL] remove the no-longer-necessary createCode and createStructCode, and replace the usage
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7890 from cloud-fan/minor and squashes the following commits:

c3b1be3 [Wenchen Fan] fix style
b0cbe2e [Wenchen Fan] remove the createCode and createStructCode, and replace the usage of them by createStructCode
2015-08-04 14:40:46 -07:00
Michael Armbrust a0cc01759b [SPARK-9606] [SQL] Ignore flaky thrift server tests
Author: Michael Armbrust <michael@databricks.com>

Closes #7939 from marmbrus/turnOffThriftTests and squashes the following commits:

80d618e [Michael Armbrust] [SPARK-9606][SQL] Ignore flaky thrift server tests
2015-08-04 12:19:52 -07:00
Michael Armbrust 34a0eb2e89 [SPARK-9512][SQL] Revert SPARK-9251, Allow evaluation while sorting
The analysis rule has a bug and we ended up making the sorter still capable of doing evaluation, so lets revert this for now.

Author: Michael Armbrust <michael@databricks.com>

Closes #7906 from marmbrus/revertSortProjection and squashes the following commits:

2da6972 [Michael Armbrust] unrevert unrelated changes
4f2b00c [Michael Armbrust] Revert "[SPARK-9251][SQL] do not order by expressions which still need evaluation"
2015-08-04 10:07:53 -07:00
Yijie Shen b5034c9c59 [SPARK-9541] [SQL] DataTimeUtils cleanup
JIRA: https://issues.apache.org/jira/browse/SPARK-9541

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7870 from yjshen/datetime_cleanup and squashes the following commits:

9203e33 [Yijie Shen] revert getMonth & getDayOfMonth
5cad119 [Yijie Shen] rebase code
7d62a74 [Yijie Shen] remove tmp tuple inside split date
e98aaac [Yijie Shen] DataTimeUtils cleanup
2015-08-04 09:09:52 -07:00
Davies Liu 73dedb589d [SPARK-8246] [SQL] Implement get_json_object
This is based on #7485 , thanks to NathanHowell

Tests were copied from Hive, but do not seem to be super comprehensive. I've generally replicated Hive's unusual behavior rather than following a JSONPath reference, except for one case (as noted in the comments). I don't know if there is a way of fully replicating Hive's behavior without a slower TreeNode implementation, so I've erred on the side of performance instead.

Author: Davies Liu <davies@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Author: Nathan Howell <nhowell@godaddy.com>

Closes #7901 from davies/get_json_object and squashes the following commits:

3ace9b9 [Davies Liu] Merge branch 'get_json_object' of github.com:davies/spark into get_json_object
98766fc [Davies Liu] Merge branch 'master' of github.com:apache/spark into get_json_object
a7dc6d0 [Davies Liu] Update JsonExpressionsSuite.scala
c818519 [Yin Huai] new results.
18ce26b [Davies Liu] fix tests
6ac29fb [Yin Huai] Golden files.
25eebef [Davies Liu] use HiveQuerySuite
e0ac6ec [Yin Huai] Golden answer files.
940c060 [Davies Liu] tweat code style
44084c5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into get_json_object
9192d09 [Nathan Howell] Match Hive’s behavior for unwrapping arrays of one element
8dab647 [Nathan Howell] [SPARK-8246] [SQL] Implement get_json_object
2015-08-04 09:07:09 -07:00
Tarek Auel b1f88a38d5 [SPARK-8244] [SQL] string function: find in set
This PR is based on #7186 (just fix the conflict), thanks to tarekauel .

find_in_set(string str, string strList): int

Returns the first occurance of str in strList where strList is a comma-delimited string. Returns null if either argument is null. Returns 0 if the first argument contains any commas. For example, find_in_set('ab', 'abc,b,ab,c,def') returns 3.

Only add this to SQL, not DataFrame.

Closes #7186

Author: Tarek Auel <tarek.auel@googlemail.com>
Author: Davies Liu <davies@databricks.com>

Closes #7900 from davies/find_in_set and squashes the following commits:

4334209 [Davies Liu] Merge branch 'master' of github.com:apache/spark into find_in_set
8f00572 [Davies Liu] Merge branch 'master' of github.com:apache/spark into find_in_set
243ede4 [Tarek Auel] [SPARK-8244][SQL] hive compatibility
1aaf64e [Tarek Auel] [SPARK-8244][SQL] unit test fix
e4093a4 [Tarek Auel] [SPARK-8244][SQL] final modifier for COMMA_UTF8
0d05df5 [Tarek Auel] Merge branch 'master' into SPARK-8244
208d710 [Tarek Auel] [SPARK-8244] address comments & bug fix
71b2e69 [Tarek Auel] [SPARK-8244] find_in_set
66c7fda [Tarek Auel] Merge branch 'master' into SPARK-8244
61b8ca2 [Tarek Auel] [SPARK-8224] removed loop and split; use unsafe String comparison
4f75a65 [Tarek Auel] Merge branch 'master' into SPARK-8244
e3b20c8 [Tarek Auel] [SPARK-8244] added type check
1c2bbb7 [Tarek Auel] [SPARK-8244] findInSet
2015-08-04 08:59:42 -07:00
Sean Owen 76d74090d6 [SPARK-9534] [BUILD] Enable javac lint for scalac parity; fix a lot of build warnings, 1.5.0 edition
Enable most javac lint warnings; fix a lot of build warnings. In a few cases, touch up surrounding code in the process.

I'll explain several of the changes inline in comments.

Author: Sean Owen <sowen@cloudera.com>

Closes #7862 from srowen/SPARK-9534 and squashes the following commits:

ea51618 [Sean Owen] Enable most javac lint warnings; fix a lot of build warnings. In a few cases, touch up surrounding code in the process.
2015-08-04 12:02:26 +01:00
Reynold Xin 5eb89f67e3 [SPARK-9577][SQL] Surface concrete iterator types in various sort classes.
We often return abstract iterator types in various sort-related classes (e.g. UnsafeKVExternalSorter). It is actually better to return a more concrete type, so the callsite uses that type and JIT can inline the iterator calls.

Author: Reynold Xin <rxin@databricks.com>

Closes #7911 from rxin/surface-concrete-type and squashes the following commits:

0422add [Reynold Xin] [SPARK-9577][SQL] Surface concrete iterator types in various sort classes.
2015-08-03 18:47:02 -07:00
Steve Loughran a2409d1c8e [SPARK-8064] [SQL] Build against Hive 1.2.1
Cherry picked the parts of the initial SPARK-8064 WiP branch needed to get sql/hive to compile against hive 1.2.1. That's the ASF release packaged under org.apache.hive, not any fork.

Tests not run yet: that's what the machines are for

Author: Steve Loughran <stevel@hortonworks.com>
Author: Cheng Lian <lian@databricks.com>
Author: Michael Armbrust <michael@databricks.com>
Author: Patrick Wendell <patrick@databricks.com>

Closes #7191 from steveloughran/stevel/feature/SPARK-8064-hive-1.2-002 and squashes the following commits:

7556d85 [Cheng Lian] Updates .q files and corresponding golden files
ef4af62 [Steve Loughran] Merge commit '6a92bb09f46a04d6cd8c41bdba3ecb727ebb9030' into stevel/feature/SPARK-8064-hive-1.2-002
6a92bb0 [Cheng Lian] Overrides HiveConf time vars
dcbb391 [Cheng Lian] Adds com.twitter:parquet-hadoop-bundle:1.6.0 for Hive Parquet SerDe
0bbe475 [Steve Loughran] SPARK-8064 scalastyle rejects the standard Hadoop ASF license header...
fdf759b [Steve Loughran] SPARK-8064 classpath dependency suite to be in sync with shading in final (?) hive-exec spark
7a6c727 [Steve Loughran] SPARK-8064 switch to second staging repo of the spark-hive artifacts. This one has the protobuf-shaded hive-exec jar
376c003 [Steve Loughran] SPARK-8064 purge duplicate protobuf declaration
2c74697 [Steve Loughran] SPARK-8064 switch to the protobuf shaded hive-exec jar with tests to chase it down
cc44020 [Steve Loughran] SPARK-8064 remove hadoop.version from runtest.py, as profile will fix that automatically.
6901fa9 [Steve Loughran] SPARK-8064 explicit protobuf import
da310dc [Michael Armbrust] Fixes for Hive tests.
a775a75 [Steve Loughran] SPARK-8064 cherry-pick-incomplete
7404f34 [Patrick Wendell] Add spark-hive staging repo
832c164 [Steve Loughran] SPARK-8064 try to supress compiler warnings on Complex.java pasted-thrift-code
312c0d4 [Steve Loughran] SPARK-8064  maven/ivy dependency purge; calcite declaration needed
fa5ae7b [Steve Loughran] HIVE-8064 fix up hive-thriftserver dependencies and cut back on evicted references in the hive- packages; this keeps mvn and ivy resolution compatible, as the reconciliation policy is "by hand"
c188048 [Steve Loughran] SPARK-8064 manage the Hive depencencies to that -things that aren't needed are excluded -sql/hive built with ivy is in sync with the maven reconciliation policy, rather than latest-first
4c8be8d [Cheng Lian] WIP: Partial fix for Thrift server and CLI tests
314eb3c [Steve Loughran] SPARK-8064 deprecation warning  noise in one of the tests
17b0341 [Steve Loughran] SPARK-8064 IDE-hinted cleanups of Complex.java to reduce compiler warnings. It's all autogenerated code, so still ugly.
d029b92 [Steve Loughran] SPARK-8064 rely on unescaping to have already taken place, so go straight to map of serde options
23eca7e [Steve Loughran] HIVE-8064 handle raw and escaped property tokens
54d9b06 [Steve Loughran] SPARK-8064 fix compilation regression surfacing from rebase
0b12d5f [Steve Loughran] HIVE-8064 use subset of hive complex type whose types deserialize
fce73b6 [Steve Loughran] SPARK-8064 poms rely implicitly on the version of kryo chill provides
fd3aa5d [Steve Loughran] SPARK-8064 version of hive to d/l from ivy is 1.2.1
dc73ece [Steve Loughran] SPARK-8064 revert to master's determinstic pushdown strategy
d3c1e4a [Steve Loughran] SPARK-8064 purge UnionType
051cc21 [Steve Loughran] SPARK-8064 switch to an unshaded version of hive-exec-core, which must have been built with Kryo 2.21. This currently looks for a (locally built) version 1.2.1.spark
6684c60 [Steve Loughran] SPARK-8064 ignore RTE raised in blocking process.exitValue() call
e6121e5 [Steve Loughran] SPARK-8064 address review comments
aa43dc6 [Steve Loughran] SPARK-8064  more robust teardown on JavaMetastoreDatasourcesSuite
f2bff01 [Steve Loughran] SPARK-8064 better takeup of asynchronously caught error text
8b1ef38 [Steve Loughran] SPARK-8064: on failures executing spark-submit in HiveSparkSubmitSuite, print command line and all logged output.
5a9ce6b [Steve Loughran] SPARK-8064 add explicit reason for kv split failure, rather than array OOB. *does not address the issue*
642b63a [Steve Loughran] SPARK-8064 reinstate something cut briefly during rebasing
97194dc [Steve Loughran] SPARK-8064 add extra logging to the YarnClusterSuite classpath test. There should be no reason why this is failing on jenkins, but as it is (and presumably its CP-related), improve the logging including any exception raised.
335357f [Steve Loughran] SPARK-8064 fail fast on thrive process spawning tests on exit codes and/or error string patterns seen in log.
3ed872f [Steve Loughran] SPARK-8064 rename field double to  dbl
bca55e5 [Steve Loughran] SPARK-8064 missed one of the `date` escapes
41d6479 [Steve Loughran] SPARK-8064 wrap tests with withTable() calls to avoid table-exists exceptions
2bc29a4 [Steve Loughran] SPARK-8064 ParquetSuites to escape `date` field name
1ab9bc4 [Steve Loughran] SPARK-8064 TestHive to use sered2.thrift.test.Complex
bf3a249 [Steve Loughran] SPARK-8064: more resubmit than fix; tighten startup timeout to 60s. Still no obvious reason why jersey server code in spark-assembly isn't being picked up -it hasn't been shaded
c829b8f [Steve Loughran] SPARK-8064: reinstate yarn-rm-server dependencies to hive-exec to ensure that jersey server is on classpath on hadoop versions < 2.6
0b0f738 [Steve Loughran] SPARK-8064: thrift server startup to fail fast on any exception in the main thread
13abaf1 [Steve Loughran] SPARK-8064 Hive compatibilty tests sin sync with explain/show output from Hive 1.2.1
d14d5ea [Steve Loughran] SPARK-8064: DATE is now a predicate; you can't use it as a field in select ops
26eef1c [Steve Loughran] SPARK-8064: HIVE-9039 renamed TOK_UNION => TOK_UNIONALL while adding TOK_UNIONDISTINCT
3d64523 [Steve Loughran] SPARK-8064 improve diagns on uknown token; fix scalastyle failure
d0360f6 [Steve Loughran] SPARK-8064: delicate merge in of the branch vanzin/hive-1.1
1126e5a [Steve Loughran] SPARK-8064: name of unrecognized file format wasn't appearing in error text
8cb09c4 [Steve Loughran] SPARK-8064: test resilience/assertion improvements. Independent of the rest of the work; can be backported to earlier versions
dec12cb [Steve Loughran] SPARK-8064: when a CLI suite test fails include the full output text in the raised exception; this ensures that the stdout/stderr is included in jenkins reports, so it becomes possible to diagnose the cause.
463a670 [Steve Loughran] SPARK-8064 run-tests.py adds a hadoop-2.6 profile, and changes info messages to say "w/Hive 1.2.1" in console output
2531099 [Steve Loughran] SPARK-8064 successful attempt to get rid of pentaho as a transitive dependency of hive-exec
1d59100 [Steve Loughran] SPARK-8064 (unsuccessful) attempt to get rid of pentaho as a transitive dependency of hive-exec
75733fc [Steve Loughran] SPARK-8064 change thrift binary startup message to "Starting ThriftBinaryCLIService on port"
3ebc279 [Steve Loughran] SPARK-8064 move strings used to check for http/bin thrift services up into constants
c80979d [Steve Loughran] SPARK-8064: SparkSQLCLIDriver drops remote mode support. CLISuite Tests pass instead of timing out: undetected regression?
27e8370 [Steve Loughran] SPARK-8064 fix some style & IDE warnings
00e50d6 [Steve Loughran] SPARK-8064 stop excluding hive shims from dependency (commented out , for now)
cb4f142 [Steve Loughran] SPARK-8054 cut pentaho dependency from calcite
f7aa9cb [Steve Loughran] SPARK-8064 everything compiles with some commenting and moving of classes into a hive package
6c310b4 [Steve Loughran] SPARK-8064 subclass  Hive ServerOptionsProcessor to make it public again
f61a675 [Steve Loughran] SPARK-8064 thrift server switched to Hive 1.2.1, though it doesn't compile everywhere
4890b9d [Steve Loughran] SPARK-8064, build against Hive 1.2.1
2015-08-03 15:24:42 -07:00
Reynold Xin b2e4b85d2d Revert "[SPARK-9372] [SQL] Filter nulls in join keys"
This reverts commit 687c8c3715.
2015-08-03 14:51:15 -07:00
Andrew Or 702aa9d7fb [SPARK-8735] [SQL] Expose memory usage for shuffles, joins and aggregations
This patch exposes the memory used by internal data structures on the SparkUI. This tracks memory used by all spilling operations and SQL operators backed by Tungsten, e.g. `BroadcastHashJoin`, `ExternalSort`, `GeneratedAggregate` etc. The metric exposed is "peak execution memory", which broadly refers to the peak in-memory sizes of each of these data structure.

A separate patch will extend this by linking the new information to the SQL operators themselves.

<img width="950" alt="screen shot 2015-07-29 at 7 43 17 pm" src="https://cloud.githubusercontent.com/assets/2133137/8974776/b90fc980-362a-11e5-9e2b-842da75b1641.png">
<img width="802" alt="screen shot 2015-07-29 at 7 43 05 pm" src="https://cloud.githubusercontent.com/assets/2133137/8974777/baa76492-362a-11e5-9b77-e364a6a6b64e.png">

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/7770)
<!-- Reviewable:end -->

Author: Andrew Or <andrew@databricks.com>

Closes #7770 from andrewor14/expose-memory-metrics and squashes the following commits:

9abecb9 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
f5b0d68 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
d7df332 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
8eefbc5 [Andrew Or] Fix non-failing tests
9de2a12 [Andrew Or] Fix tests due to another logical merge conflict
876bfa4 [Andrew Or] Fix failing test after logical merge conflict
361a359 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
40b4802 [Andrew Or] Fix style?
d0fef87 [Andrew Or] Fix tests?
b3b92f6 [Andrew Or] Address comments
0625d73 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
c00a197 [Andrew Or] Fix potential NPEs
10da1cd [Andrew Or] Fix compile
17f4c2d [Andrew Or] Fix compile?
a87b4d0 [Andrew Or] Fix compile?
d70874d [Andrew Or] Fix test compile + address comments
2840b7d [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
6aa2f7a [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
b889a68 [Andrew Or] Minor changes: comments, spacing, style
663a303 [Andrew Or] UnsafeShuffleWriter: update peak memory before close
d090a94 [Andrew Or] Fix style
2480d84 [Andrew Or] Expand test coverage
5f1235b [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
1ecf678 [Andrew Or] Minor changes: comments, style, unused imports
0b6926c [Andrew Or] Oops
111a05e [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
a7a39a5 [Andrew Or] Strengthen presence check for accumulator
a919eb7 [Andrew Or] Add tests for unsafe shuffle writer
23c845d [Andrew Or] Add tests for SQL operators
a757550 [Andrew Or] Address comments
b5c51c1 [Andrew Or] Re-enable test in JavaAPISuite
5107691 [Andrew Or] Add tests for internal accumulators
59231e4 [Andrew Or] Fix tests
9528d09 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
5b5e6f3 [Andrew Or] Add peak execution memory to summary table + tooltip
92b4b6b [Andrew Or] Display peak execution memory on the UI
eee5437 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
d9b9015 [Andrew Or] Track execution memory in unsafe shuffles
770ee54 [Andrew Or] Track execution memory in broadcast joins
9c605a4 [Andrew Or] Track execution memory in GeneratedAggregate
9e824f2 [Andrew Or] Add back execution memory tracking for *ExternalSort
4ef4cb1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
e6c3e2f [Andrew Or] Move internal accumulators creation to Stage
a417592 [Andrew Or] Expose memory metrics in UnsafeExternalSorter
3c4f042 [Andrew Or] Track memory usage in ExternalAppendOnlyMap / ExternalSorter
bd7ab3f [Andrew Or] Add internal accumulators to TaskContext
2015-08-03 14:22:07 -07:00
Cheng Lian 703e44bff1 [SPARK-9554] [SQL] Enables in-memory partition pruning by default
Author: Cheng Lian <lian@databricks.com>

Closes #7895 from liancheng/spark-9554/enable-in-memory-partition-pruning and squashes the following commits:

67c403e [Cheng Lian] Enables in-memory partition pruning by default
2015-08-03 12:06:58 -07:00
Reynold Xin 7a9d09f0bb [SQL][minor] Simplify UnsafeRow.calculateBitSetWidthInBytes.
Author: Reynold Xin <rxin@databricks.com>

Closes #7897 from rxin/calculateBitSetWidthInBytes and squashes the following commits:

2e73b3a [Reynold Xin] [SQL][minor] Simplify UnsafeRow.calculateBitSetWidthInBytes.
2015-08-03 11:22:02 -07:00
Joseph Batchik dfe7bd168d [SPARK-9511] [SQL] Fixed Table Name Parsing
The issue was that the tokenizer was parsing "1one" into the numeric 1 using the code on line 110. I added another case to accept strings that start with a number and then have a letter somewhere else in it as well.

Author: Joseph Batchik <joseph.batchik@cloudera.com>

Closes #7844 from JDrit/parse_error and squashes the following commits:

b8ca12f [Joseph Batchik] fixed parsing issue by adding another case
2015-08-03 11:17:38 -07:00
Reynold Xin 8be198c869 Two minor comments from code review on 191bf2689. 2015-08-03 04:26:18 -07:00
Davies Liu 191bf2689d [SPARK-9518] [SQL] cleanup generated UnsafeRowJoiner and fix bug
Currently, when copy the bitsets, we didn't consider that the row1 may not sit in the beginning of byte array.

cc rxin

Author: Davies Liu <davies@databricks.com>

Closes #7892 from davies/clean_join and squashes the following commits:

14cce9e [Davies Liu] cleanup generated UnsafeRowJoiner and fix bug
2015-08-03 04:23:26 -07:00
Wenchen Fan 137f47865d [SPARK-9551][SQL] add a cheap version of copy for UnsafeRow to reuse a copy buffer
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7885 from cloud-fan/cheap-copy and squashes the following commits:

0900ca1 [Wenchen Fan] replace == with ===
73f4ada [Wenchen Fan] add tests
07b865a [Wenchen Fan] add a cheap version of copy
2015-08-03 04:21:15 -07:00
Yin Huai 1ebd41b141 [SPARK-9240] [SQL] Hybrid aggregate operator using unsafe row
This PR adds a base aggregation iterator `AggregationIterator`, which is used to create `SortBasedAggregationIterator` (for sort-based aggregation) and `UnsafeHybridAggregationIterator` (first it tries hash-based aggregation and falls back to the sort-based aggregation (using external sorter) if we cannot allocate memory for the map). With these two iterators, we will not need existing iterators and I am removing those. Also, we can use a single physical `Aggregate` operator and it internally determines what iterators to used.

https://issues.apache.org/jira/browse/SPARK-9240

Author: Yin Huai <yhuai@databricks.com>

Closes #7813 from yhuai/AggregateOperator and squashes the following commits:

e317e2b [Yin Huai] Remove unnecessary change.
74d93c5 [Yin Huai] Merge remote-tracking branch 'upstream/master' into AggregateOperator
ba6afbc [Yin Huai] Add a little bit more comments.
c9cf3b6 [Yin Huai] update
0f1b06f [Yin Huai] Remove unnecessary code.
21fd15f [Yin Huai] Remove unnecessary change.
964f88b [Yin Huai] Implement fallback strategy.
b1ea5cf [Yin Huai] wip
7fcbd87 [Yin Huai] Add a flag to control what iterator to use.
533d5b2 [Yin Huai] Prepare for fallback!
33b7022 [Yin Huai] wip
bd9282b [Yin Huai] UDAFs now supports UnsafeRow.
f52ee53 [Yin Huai] wip
3171f44 [Yin Huai] wip
d2c45a0 [Yin Huai] wip
f60cc83 [Yin Huai] Also check input schema.
af32210 [Yin Huai] Check iter.hasNext before we create an iterator because the constructor of the iterato will read at least one row from a non-empty input iter.
299008c [Yin Huai] First round cleanup.
3915bac [Yin Huai] Create a base iterator class for aggregation iterators and add the initial version of the hybrid iterator.
2015-08-03 00:23:08 -07:00
Yijie Shen 98d6d9c7a9 [SPARK-9549][SQL] fix bugs in expressions
JIRA: https://issues.apache.org/jira/browse/SPARK-9549

This PR fix the following bugs:
1.  `UnaryMinus`'s codegen version would fail to compile when the input is `Long.MinValue`
2.  `BinaryComparison` would fail to compile in codegen mode when comparing Boolean types.
3.  `AddMonth` would fail if passed a huge negative month, which would lead accessing negative index of `monthDays` array.
4.  `Nanvl` with different type operands.

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7882 from yjshen/minor_bug_fix and squashes the following commits:

41bbd2c [Yijie Shen] fix bug in Nanvl type coercion
3dee204 [Yijie Shen] address comments
4fa5de0 [Yijie Shen] fix bugs in expressions
2015-08-03 00:15:24 -07:00
Wenchen Fan 608353c8e8 [SPARK-9404][SPARK-9542][SQL] unsafe array data and map data
This PR adds a UnsafeArrayData, current we encode it in this way:

first 4 bytes is the # elements
then each 4 byte is the start offset of the element, unless it is negative, in which case the element is null.
followed by the elements themselves

an example:  [10, 11, 12, 13, null, 14] will be encoded as:
5, 28, 32, 36, 40, -44, 44, 10, 11, 12, 13, 14

Note that, when we read a UnsafeArrayData from bytes, we can read the first 4 bytes as numElements and take the rest(first 4 bytes skipped) as value region.

unsafe map data just use 2 unsafe array data, first 4 bytes is # of elements, second 4 bytes is numBytes of key array, the follows key array data and value array data.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7752 from cloud-fan/unsafe-array and squashes the following commits:

3269bd7 [Wenchen Fan] fix a bug
6445289 [Wenchen Fan] add unit tests
49adf26 [Wenchen Fan] add unsafe map
20d1039 [Wenchen Fan] add comments and unsafe converter
821b8db [Wenchen Fan] add unsafe array
2015-08-02 23:41:16 -07:00
Yin Huai 687c8c3715 [SPARK-9372] [SQL] Filter nulls in join keys
This PR adds an optimization rule, `FilterNullsInJoinKey`, to add `Filter` before join operators to filter out rows having null values for join keys.

This optimization is guarded by a new SQL conf, `spark.sql.advancedOptimization`.

The code in this PR was authored by yhuai; I'm opening this PR to factor out this change from #7685, a larger pull request which contains two other optimizations.

Author: Yin Huai <yhuai@databricks.com>
Author: Josh Rosen <joshrosen@databricks.com>

Closes #7768 from JoshRosen/filter-nulls-in-join-key and squashes the following commits:

c02fc3f [Yin Huai] Address Josh's comments.
0a8e096 [Yin Huai] Update comments.
ea7d5a6 [Yin Huai] Make sure we do not keep adding filters.
be88760 [Yin Huai] Make it clear that FilterNullsInJoinKeySuite.scala is used to test FilterNullsInJoinKey.
8bb39ad [Yin Huai] Fix non-deterministic tests.
303236b [Josh Rosen] Revert changes that are unrelated to null join key filtering
40eeece [Josh Rosen] Merge remote-tracking branch 'origin/master' into filter-nulls-in-join-key
c57a954 [Yin Huai] Bug fix.
d3d2e64 [Yin Huai] First round of cleanup.
f9516b0 [Yin Huai] Style
c6667e7 [Yin Huai] Add PartitioningCollection.
e616d3b [Yin Huai] wip
7c2d2d8 [Yin Huai] Bug fix and refactoring.
69bb072 [Yin Huai] Introduce NullSafeHashPartitioning and NullUnsafePartitioning.
d5b84c3 [Yin Huai] Do not add unnessary filters.
2201129 [Yin Huai] Filter out rows that will not be joined in equal joins early.
2015-08-02 23:32:09 -07:00
Yin Huai 114ff926fc [SPARK-2205] [SQL] Avoid unnecessary exchange operators in multi-way joins
This PR adds `PartitioningCollection`, which is used to represent the `outputPartitioning` for SparkPlans with multiple children (e.g. `ShuffledHashJoin`). So, a `SparkPlan` can have multiple descriptions of its partitioning schemes. Taking `ShuffledHashJoin` as an example, it has two descriptions of its partitioning schemes, i.e. `left.outputPartitioning` and `right.outputPartitioning`. So when we have a query like `select * from t1 join t2 on (t1.x = t2.x) join t3 on (t2.x = t3.x)` will only have three Exchange operators (when shuffled joins are needed) instead of four.

The code in this PR was authored by yhuai; I'm opening this PR to factor out this change from #7685, a larger pull request which contains two other optimizations.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/7773)
<!-- Reviewable:end -->

Author: Yin Huai <yhuai@databricks.com>
Author: Josh Rosen <joshrosen@databricks.com>

Closes #7773 from JoshRosen/multi-way-join-planning-improvements and squashes the following commits:

5c45924 [Josh Rosen] Merge remote-tracking branch 'origin/master' into multi-way-join-planning-improvements
cd8269b [Josh Rosen] Refactor test to use SQLTestUtils
2963857 [Yin Huai] Revert unnecessary SqlConf change.
73913f7 [Yin Huai] Add comments and test. Also, revert the change in ShuffledHashOuterJoin for now.
4a99204 [Josh Rosen] Delete unrelated expression change
884ab95 [Josh Rosen] Carve out only SPARK-2205 changes.
247e5fa [Josh Rosen] Merge remote-tracking branch 'origin/master' into multi-way-join-planning-improvements
c57a954 [Yin Huai] Bug fix.
d3d2e64 [Yin Huai] First round of cleanup.
f9516b0 [Yin Huai] Style
c6667e7 [Yin Huai] Add PartitioningCollection.
e616d3b [Yin Huai] wip
7c2d2d8 [Yin Huai] Bug fix and refactoring.
69bb072 [Yin Huai] Introduce NullSafeHashPartitioning and NullUnsafePartitioning.
d5b84c3 [Yin Huai] Do not add unnessary filters.
2201129 [Yin Huai] Filter out rows that will not be joined in equal joins early.
2015-08-02 20:44:23 -07:00
Reynold Xin 30e89111d6 [SPARK-9546][SQL] Centralize orderable data type checking.
This pull request creates two isOrderable functions in RowOrdering that can be used to check whether a data type or a sequence of expressions can be used in sorting.

Author: Reynold Xin <rxin@databricks.com>

Closes #7880 from rxin/SPARK-9546 and squashes the following commits:

f9e322d [Reynold Xin] Fixed tests.
0439b43 [Reynold Xin] [SPARK-9546][SQL] Centralize orderable data type checking.
2015-08-02 20:12:03 -07:00
Reynold Xin 9d03ad910b [SPARK-9543][SQL] Add randomized testing for UnsafeKVExternalSorter.
The detailed approach is documented in UnsafeKVExternalSorterSuite.testKVSorter(), working as follows:

1. Create input by generating data randomly based on the given key/value schema (which is also randomly drawn from a list of candidate types)
2. Run UnsafeKVExternalSorter on the generated data
3. Collect the output from the sorter, and make sure the keys are sorted in ascending order
4. Sort the input by both key and value, and sort the sorter output also by both key and value. Compare the sorted input and sorted output together to make sure all the key/values match.
5. Check memory allocation to make sure there is no memory leak.

There is also a spill flag. When set to true, the sorter will spill probabilistically roughly every 100 records.

Author: Reynold Xin <rxin@databricks.com>

Closes #7873 from rxin/kvsorter-randomized-test and squashes the following commits:

a08c251 [Reynold Xin] Resource cleanup.
0488b5c [Reynold Xin] [SPARK-9543][SQL] Add randomized testing for UnsafeKVExternalSorter.
2015-08-02 17:54:30 -07:00
Liang-Chi Hsieh 0722f43316 [SPARK-7937][SQL] Support comparison on StructType
This brings #6519 up-to-date with master branch.

Closes #6519.

Author: Liang-Chi Hsieh <viirya@appier.com>
Author: Liang-Chi Hsieh <viirya@gmail.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #7877 from rxin/sort-struct and squashes the following commits:

4968231 [Reynold Xin] Minor fixes.
2537813 [Reynold Xin] Merge branch 'compare_named_struct' of github.com:viirya/spark-1 into sort-struct
d2ba8ad [Liang-Chi Hsieh] Remove unused import.
3a3f40e [Liang-Chi Hsieh] Don't need to add compare to InternalRow because we can use RowOrdering.
dae6aad [Liang-Chi Hsieh] Fix nested struct.
d5349c7 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into compare_named_struct
43d4354 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into compare_named_struct
1f66196 [Liang-Chi Hsieh] Reuse RowOrdering and GenerateOrdering.
f8b2e9c [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into compare_named_struct
1187a65 [Liang-Chi Hsieh] Fix scala style.
9d67f68 [Liang-Chi Hsieh] Fix wrongly merging.
8f4d775 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into compare_named_struct
94b27d5 [Liang-Chi Hsieh] Remove test for error on complex type comparison.
2071693 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into compare_named_struct
3c142e4 [Liang-Chi Hsieh] Fix scala style.
cf58dc3 [Liang-Chi Hsieh] Use checkAnswer.
f651b8d [Liang-Chi Hsieh] Remove Either and move orderings to BinaryComparison to reuse it.
b6e1009 [Liang-Chi Hsieh] Fix scala style.
3922b54 [Liang-Chi Hsieh] Support ordering on named_struct.
2015-08-02 17:53:44 -07:00
Reynold Xin 2e981b7bfa [SPARK-9531] [SQL] UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter
This pull request adds a destructAndCreateExternalSorter method to UnsafeFixedWidthAggregationMap. The new method does the following:

1. Creates a new external sorter UnsafeKVExternalSorter
2. Adds all the data into an in-memory sorter, sorts them
3. Spills the sorted in-memory data to disk

This method can be used to fallback to sort-based aggregation when under memory pressure.

The pull request also includes accounting fixes from JoshRosen.

TODOs (that can be done in follow-up PRs)
- [x] Address Josh's feedbacks from #7849
- [x] More documentation and test cases
- [x] Make sure we are doing memory accounting correctly with test cases (e.g. did we release the memory in BytesToBytesMap twice?)
- [ ] Look harder at possible memory leaks and exception handling
- [ ] Randomized tester for the KV sorter as well as the aggregation map

Author: Reynold Xin <rxin@databricks.com>
Author: Josh Rosen <joshrosen@databricks.com>

Closes #7860 from rxin/kvsorter and squashes the following commits:

986a58c [Reynold Xin] Bug fix.
599317c [Reynold Xin] Style fix and slightly more compact code.
fe7bd4e [Reynold Xin] Bug fixes.
fd71bef [Reynold Xin] Merge remote-tracking branch 'josh/large-records-in-sql-sorter' into kvsorter-with-josh-fix
3efae38 [Reynold Xin] More fixes and documentation.
45f1b09 [Josh Rosen] Ensure that spill files are cleaned up
f6a9bd3 [Reynold Xin] Josh feedback.
9be8139 [Reynold Xin] Remove testSpillFrequency.
7cbe759 [Reynold Xin] [SPARK-9531][SQL] UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter.
ae4a8af [Josh Rosen] Detect leaked unsafe memory in UnsafeExternalSorterSuite.
52f9b06 [Josh Rosen] Detect ShuffleMemoryManager leaks in UnsafeExternalSorter.
2015-08-02 12:32:14 -07:00
Reynold Xin 8eafa2aeb6 [SPARK-9208][SQL] Sort DataFrame functions alphabetically.
Author: Reynold Xin <rxin@databricks.com>

Closes #7861 from rxin/api-audit and squashes the following commits:

7200256 [Reynold Xin] [SPARK-9208][SQL] Sort DataFrame functions alphabetically.
2015-08-02 11:36:11 -07:00
Davies Liu 16b928c543 [SPARK-9529] [SQL] improve TungstenSort on DecimalType
Generate prefix for DecimalType, fix the random generator of decimal

cc JoshRosen

Author: Davies Liu <davies@databricks.com>

Closes #7857 from davies/sort_decimal and squashes the following commits:

2433959 [Davies Liu] Merge branch 'master' of github.com:apache/spark into sort_decimal
de24253 [Davies Liu] fix style
0a54c1a [Davies Liu] sort decimal
2015-08-01 23:36:06 -07:00
Davies Liu 57084e0c7c [SPARK-9459] [SQL] use generated FromUnsafeProjection to do deep copy for UTF8String and struct
When accessing a column in UnsafeRow, it's good to avoid the copy, then we should do deep copy when turn the UnsafeRow into generic Row, this PR brings generated FromUnsafeProjection to do that.

This PR also fix the expressions that cache the UTF8String, which should also copy it.

Author: Davies Liu <davies@databricks.com>

Closes #7840 from davies/avoid_copy and squashes the following commits:

230c8a1 [Davies Liu] address comment
fd797c9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into avoid_copy
e095dd0 [Davies Liu] rollback rename
8ef5b0b [Davies Liu] copy String in Columnar
81360b8 [Davies Liu] fix class name
9aecb88 [Davies Liu] use FromUnsafeProjection to do deep copy for UTF8String and struct
2015-08-01 21:50:42 -07:00
Davies Liu c1b0cbd762 [SPARK-8185] [SPARK-8188] [SPARK-8191] [SQL] function datediff, to_utc_timestamp, from_utc_timestamp
This PR is based on #7643 , thanks to adrian-wang

Author: Davies Liu <davies@databricks.com>
Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #7847 from davies/datediff and squashes the following commits:

74333d7 [Davies Liu] fix bug
22d8a8c [Davies Liu] optimize
85cdd21 [Davies Liu] remove unnecessary tests
241d90c [Davies Liu] Merge branch 'master' of github.com:apache/spark into datediff
e9dc0f5 [Davies Liu] fix datediff/to_utc_timestamp/from_utc_timestamp
c360447 [Daoyuan Wang] function datediff, to_utc_timestamp, from_utc_timestamp (commits merged)
2015-08-01 21:46:46 -07:00
HuJiayin 00cd92f32f [SPARK-8269] [SQL] string function: initcap
This PR is based on #7208 , thanks to HuJiayin

Closes #7208

Author: HuJiayin <jiayin.hu@intel.com>
Author: Davies Liu <davies@databricks.com>

Closes #7850 from davies/initcap and squashes the following commits:

54472e9 [Davies Liu] fix python test
17ffe51 [Davies Liu] Merge branch 'master' of github.com:apache/spark into initcap
ca46390 [Davies Liu] Merge branch 'master' of github.com:apache/spark into initcap
3a906e4 [Davies Liu] implement title case in UTF8String
8b2506a [HuJiayin] Update functions.py
2cd43e5 [HuJiayin] fix python style check
b616c0e [HuJiayin] add python api
1f5a0ef [HuJiayin] add codegen
7e0c604 [HuJiayin] Merge branch 'master' of https://github.com/apache/spark into initcap
6a0b958 [HuJiayin] add column
c79482d [HuJiayin] support soundex
7ce416b [HuJiayin] support initcap rebase code
2015-08-01 21:44:57 -07:00
Davies Liu 5d9e33d9a2 [SPARK-9495] prefix of DateType/TimestampType
cc rxin

Author: Davies Liu <davies@databricks.com>

Closes #7856 from davies/sort_improve and squashes the following commits:

5fc81bd [Davies Liu] support DateType/TimestampType
2015-08-01 18:22:46 -07:00
Reynold Xin 3d1535d488 [SPARK-9520] [SQL] Support in-place sort in UnsafeFixedWidthAggregationMap
This pull request adds a sortedIterator method to UnsafeFixedWidthAggregationMap that sorts its data in-place by the grouping key.

This is needed so we can fallback to external sorting for aggregation.

Author: Reynold Xin <rxin@databricks.com>

Closes #7849 from rxin/bytes2bytes-sorting and squashes the following commits:

75018c6 [Reynold Xin] Updated documentation.
81a8694 [Reynold Xin] [SPARK-9520][SQL] Support in-place sort in UnsafeFixedWidthAggregationMap.
2015-08-01 13:20:26 -07:00
zhichao.li c5166f7a69 [SPARK-8263] [SQL] substr/substring should also support binary type
This is based on #7641, thanks to zhichao-li

Closes #7641

Author: zhichao.li <zhichao.li@intel.com>
Author: Davies Liu <davies@databricks.com>

Closes #7848 from davies/substr and squashes the following commits:

461b709 [Davies Liu] remove bytearry from tests
b45377a [Davies Liu] Merge branch 'master' of github.com:apache/spark into substr
01d795e [zhichao.li] scala style
99aa130 [zhichao.li] add substring to dataframe
4f68bfe [zhichao.li] add binary type support for substring
2015-08-01 08:48:46 -07:00
Cheng Hao cf6c9ca32a [SPARK-8232] [SQL] Add sort_array support
This PR is based on #7581 , just fix the conflict.

Author: Cheng Hao <hao.cheng@intel.com>
Author: Davies Liu <davies@databricks.com>

Closes #7851 from davies/sort_array and squashes the following commits:

a80ef66 [Davies Liu] fix conflict
7cfda65 [Davies Liu] Merge branch 'master' of github.com:apache/spark into sort_array
664c960 [Cheng Hao] update the sort_array by using the ArrayData
276d2d5 [Cheng Hao] add empty line
0edab9c [Cheng Hao] Add asending/descending support for sort_array
80fc0f8 [Cheng Hao] Add type checking
a42b678 [Cheng Hao] Add sort_array support
2015-08-01 08:32:29 -07:00
Davies Liu 60ea7ab4bb Revert "[SPARK-8232] [SQL] Add sort_array support"
This reverts commit 67ad4e21fc.
2015-08-01 00:41:15 -07:00
Wenchen Fan 1d59a4162b [SPARK-9480][SQL] add MapData and cleanup internal row stuff
This PR adds a `MapData` as internal representation of map type in Spark SQL, and provides a default implementation with just 2 `ArrayData`.

After that, we have specialized getters for all internal type, so I removed generic getter in `ArrayData` and added specialized `toArray` for it.
Also did some refactor and cleanup for `InternalRow` and its subclasses.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7799 from cloud-fan/map-data and squashes the following commits:

77d482f [Wenchen Fan] fix python
e8f6682 [Wenchen Fan] skip MapData equality check in HiveInspectorSuite
40cc9db [Wenchen Fan] add toString
6e06ec9 [Wenchen Fan] some more cleanup
a90aca1 [Wenchen Fan] add MapData
2015-08-01 00:17:15 -07:00
Reynold Xin d90f2cf7a2 [SPARK-9517][SQL] BytesToBytesMap should encode data the same way as UnsafeExternalSorter
BytesToBytesMap current encodes key/value data in the following format:
```
8B key length, key data, 8B value length, value data
```

UnsafeExternalSorter, on the other hand, encodes data this way:
```
4B record length, data
```

As a result, we cannot pass records encoded by BytesToBytesMap directly into UnsafeExternalSorter for sorting. However, if we rearrange data slightly, we can then pass the key/value records directly into UnsafeExternalSorter:
```
4B key+value length, 4B key length, key data, value data
```

Author: Reynold Xin <rxin@databricks.com>

Closes #7845 from rxin/kvsort-rebase and squashes the following commits:

5716b59 [Reynold Xin] Fixed test.
2e62ccb [Reynold Xin] Updated BytesToBytesMap's data encoding to put the key first.
a51b641 [Reynold Xin] Added a KV sorter interface.
2015-07-31 23:55:16 -07:00
Cheng Hao 67ad4e21fc [SPARK-8232] [SQL] Add sort_array support
Add expression `sort_array` support.

Author: Cheng Hao <hao.cheng@intel.com>

This patch had conflicts when merged, resolved by
Committer: Davies Liu <davies.liu@gmail.com>

Closes #7581 from chenghao-intel/sort_array and squashes the following commits:

664c960 [Cheng Hao] update the sort_array by using the ArrayData
276d2d5 [Cheng Hao] add empty line
0edab9c [Cheng Hao] Add asending/descending support for sort_array
80fc0f8 [Cheng Hao] Add type checking
a42b678 [Cheng Hao] Add sort_array support
2015-07-31 23:11:22 -07:00
Liang-Chi Hsieh 3320b0ba26 [SPARK-9415][SQL] Throw AnalysisException when using MapType on Join and Aggregate
JIRA: https://issues.apache.org/jira/browse/SPARK-9415

Following up #7787. We shouldn't use MapType as grouping keys and join keys too.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #7819 from viirya/map_join_groupby and squashes the following commits:

005ee0c [Liang-Chi Hsieh] For comments.
7463398 [Liang-Chi Hsieh] MapType can't be used as join keys, grouping keys.
2015-07-31 22:26:30 -07:00
zhichao.li 6996bd2e81 [SPARK-8264][SQL]add substring_index function
This PR is based on #7533 , thanks to zhichao-li

Closes #7533

Author: zhichao.li <zhichao.li@intel.com>
Author: Davies Liu <davies@databricks.com>

Closes #7843 from davies/str_index and squashes the following commits:

391347b [Davies Liu] add python api
3ce7802 [Davies Liu] fix substringIndex
f2d29a1 [Davies Liu] Merge branch 'master' of github.com:apache/spark into str_index
515519b [zhichao.li] add foldable and remove null checking
9546991 [zhichao.li] scala style
67c253a [zhichao.li] hide some apis and clean code
b19b013 [zhichao.li] add codegen and clean code
ac863e9 [zhichao.li] reduce the calling of numChars
12e108f [zhichao.li] refine unittest
d92951b [zhichao.li] add lastIndexOf
52d7b03 [zhichao.li] add substring_index function
2015-07-31 21:18:01 -07:00
Reynold Xin 03377d2522 [SPARK-9358][SQL] Code generation for UnsafeRow joiner.
This patch creates a code generated unsafe row concatenator that can be used to concatenate/join two UnsafeRows into a single UnsafeRow.

Since it is inherently hard to test these low level stuff, the test suites employ randomized testing heavily in order to guarantee correctness.

Author: Reynold Xin <rxin@databricks.com>

Closes #7821 from rxin/rowconcat and squashes the following commits:

8717f35 [Reynold Xin] Rebase and code review.
72c5d8e [Reynold Xin] Fixed a bug.
a84ed2e [Reynold Xin] Fixed offset.
40c3fb2 [Reynold Xin] Reset random data generator.
f0913aa [Reynold Xin] Test fixes.
6687b6f [Reynold Xin] Updated documentation.
00354b9 [Reynold Xin] Support concat data as well.
e9a4347 [Reynold Xin] Updated.
6269f96 [Reynold Xin] Fixed a bug .
0f89716 [Reynold Xin] [SPARK-9358][SQL][WIP] Code generation for UnsafeRow concat.
2015-07-31 21:09:00 -07:00
Josh Rosen 8cb415a4b9 [SPARK-9451] [SQL] Support entries larger than default page size in BytesToBytesMap & integrate with ShuffleMemoryManager
This patch adds support for entries larger than the default page size in BytesToBytesMap.  These large rows are handled by allocating special overflow pages to hold individual entries.

In addition, this patch integrates BytesToBytesMap with the ShuffleMemoryManager:

- Move BytesToBytesMap from `unsafe` to `core` so that it can import `ShuffleMemoryManager`.
- Before allocating new data pages, ask the ShuffleMemoryManager to reserve the memory:
  - `putNewKey()` now returns a boolean to indicate whether the insert succeeded or failed due to a lack of memory.  The caller can use this value to respond to the memory pressure (e.g. by spilling).
- `UnsafeFixedWidthAggregationMap. getAggregationBuffer()` now returns `null` to signal failure due to a lack of memory.
- Updated all uses of these classes to handle these error conditions.
- Added new tests for allocating large records and for allocations which fail due to memory pressure.
- Extended the `afterAll()` test teardown methods to detect ShuffleMemoryManager leaks.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7762 from JoshRosen/large-rows and squashes the following commits:

ae7bc56 [Josh Rosen] Fix compilation
82fc657 [Josh Rosen] Merge remote-tracking branch 'origin/master' into large-rows
34ab943 [Josh Rosen] Remove semi
31a525a [Josh Rosen] Integrate BytesToBytesMap with ShuffleMemoryManager.
626b33c [Josh Rosen] Move code to sql/core and spark/core packages so that ShuffleMemoryManager can be integrated
ec4484c [Josh Rosen] Move BytesToBytesMap from unsafe package to core.
642ed69 [Josh Rosen] Rename size to numElements
bea1152 [Josh Rosen] Add basic test.
2cd3570 [Josh Rosen] Remove accidental duplicated code
07ff9ef [Josh Rosen] Basic support for large rows in BytesToBytesMap.
2015-07-31 19:19:27 -07:00
HuJiayin 4d5a6e7b60 [SPARK-8271][SQL]string function: soundex
This PR brings SQL function soundex(), see https://issues.apache.org/jira/browse/HIVE-9738

It's based on #7115 , thanks to HuJiayin

Author: HuJiayin <jiayin.hu@intel.com>
Author: Davies Liu <davies@databricks.com>

Closes #7812 from davies/soundex and squashes the following commits:

fa75941 [Davies Liu] Merge branch 'master' of github.com:apache/spark into soundex
a4bd6d8 [Davies Liu] fix soundex
2538908 [HuJiayin] add codegen soundex
d15d329 [HuJiayin] add back ut
ded1a14 [HuJiayin] Merge branch 'master' of https://github.com/apache/spark
e2dec2c [HuJiayin] support soundex rebase code
2015-07-31 16:05:26 -07:00
Yin Huai 3fc0cb9200 [SPARK-9233] [SQL] Enable code-gen in window function unit tests
Since code-gen is enabled by default, it is better to run window function tests with code-gen.

https://issues.apache.org/jira/browse/SPARK-9233

Author: Yin Huai <yhuai@databricks.com>

Closes #7832 from yhuai/SPARK-9233 and squashes the following commits:

4e4e4cc [Yin Huai] style
ca80e07 [Yin Huai] Test window function with codegen.
2015-07-31 14:13:06 -07:00
Yin Huai 815c8245f4 [SPARK-9466] [SQL] Increate two timeouts in CliSuite.
Hopefully this can resolve the flakiness of this suite.

JIRA: https://issues.apache.org/jira/browse/SPARK-9466

Author: Yin Huai <yhuai@databricks.com>

Closes #7777 from yhuai/SPARK-9466 and squashes the following commits:

e0e3a86 [Yin Huai] Increate the timeout.
2015-07-31 13:45:12 -07:00
Herman van Hovell 39ab199a3f [SPARK-8640] [SQL] Enable Processing of Multiple Window Frames in a Single Window Operator
This PR enables the processing of multiple window frames in a single window operator. This should improve the performance of processing multiple window expressions wich share partition by/order by clauses, because it will be more efficient with respect to memory use and group processing.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #7515 from hvanhovell/SPARK-8640 and squashes the following commits:

f0e1c21 [Herman van Hovell] Changed Window Logical/Physical plans to use partition by/order by specs directly instead of using WindowSpec.
e1711c2 [Herman van Hovell] Enabled the processing of multiple window frames in a single Window operator.
2015-07-31 12:08:25 -07:00
Davies Liu 0024da9157 [SQL] address comments for to_date/trunc
This PR address the comments in #7805

cc rxin

Author: Davies Liu <davies@databricks.com>

Closes #7817 from davies/trunc and squashes the following commits:

f729d5f [Davies Liu] rollback
cb7f7832 [Davies Liu] genCode() is protected
31e52ef [Davies Liu] fix style
ed1edc7 [Davies Liu] address comments for #7805
2015-07-31 11:07:34 -07:00
Davies Liu 6bba7509a9 [SPARK-9500] add TernaryExpression to simplify ternary expressions
There lots of duplicated code in ternary expressions, create a TernaryExpression for them to reduce duplicated code.

cc chenghao-intel

Author: Davies Liu <davies@databricks.com>

Closes #7816 from davies/ternary and squashes the following commits:

ed2bf76 [Davies Liu] add TernaryExpression
2015-07-31 08:28:05 -07:00
WangTaoTheTonic a3a85d73da [SPARK-9496][SQL]do not print the password in config
https://issues.apache.org/jira/browse/SPARK-9496

We better do not print the password in log.

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes #7815 from WangTaoTheTonic/master and squashes the following commits:

c7a5145 [WangTaoTheTonic] do not print the password in config
2015-07-30 23:50:06 -07:00
Liang-Chi Hsieh 0244170b66 [SPARK-9152][SQL] Implement code generation for Like and RLike
JIRA: https://issues.apache.org/jira/browse/SPARK-9152

This PR implements code generation for `Like` and `RLike`.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #7561 from viirya/like_rlike_codegen and squashes the following commits:

fe5641b [Liang-Chi Hsieh] Add test for NonFoldableLiteral.
ccd1b43 [Liang-Chi Hsieh] For comments.
0086723 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into like_rlike_codegen
50df9a8 [Liang-Chi Hsieh] Use nullSafeCodeGen.
8092a68 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into like_rlike_codegen
696d451 [Liang-Chi Hsieh] Check expression foldable.
48e5536 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into like_rlike_codegen
aea58e0 [Liang-Chi Hsieh] For comments.
46d946f [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into like_rlike_codegen
a0fb76e [Liang-Chi Hsieh] For comments.
6cffe3c [Liang-Chi Hsieh] For comments.
69f0fb6 [Liang-Chi Hsieh] Add code generation for Like and RLike.
2015-07-30 23:05:58 -07:00
Daoyuan Wang 83670fc9e6 [SPARK-8176] [SPARK-8197] [SQL] function to_date/ trunc
This PR is based on #6988 , thanks to adrian-wang .

This brings two SQL functions: to_date() and trunc().

Closes #6988

Author: Daoyuan Wang <daoyuan.wang@intel.com>
Author: Davies Liu <davies@databricks.com>

Closes #7805 from davies/to_date and squashes the following commits:

2c7beba [Davies Liu] Merge branch 'master' of github.com:apache/spark into to_date
310dd55 [Daoyuan Wang] remove dup test in rebase
980b092 [Daoyuan Wang] resolve rebase conflict
a476c5a [Daoyuan Wang] address comments from davies
d44ea5f [Daoyuan Wang] function to_date, trunc
2015-07-30 19:22:38 -07:00
Josh Rosen 3c66ff727d [SPARK-9489] Remove unnecessary compatibility and requirements checks from Exchange
While reviewing yhuai's patch for SPARK-2205 (#7773), I noticed that Exchange's `compatible` check may be incorrectly returning `false` in many cases.  As far as I know, this is not actually a problem because the `compatible`, `meetsRequirements`, and `needsAnySort` checks are serving only as short-circuit performance optimizations that are not necessary for correctness.

In order to reduce code complexity, I think that we should remove these checks and unconditionally rewrite the operator's children.  This should be safe because we rewrite the tree in a single bottom-up pass.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7807 from JoshRosen/SPARK-9489 and squashes the following commits:

9d76ce9 [Josh Rosen] [SPARK-9489] Remove compatibleWith, meetsRequirements, and needsAnySort checks from Exchange
2015-07-30 17:38:48 -07:00
Liang-Chi Hsieh 351eda0e2f [SPARK-6319][SQL] Throw AnalysisException when using BinaryType on Join and Aggregate
JIRA: https://issues.apache.org/jira/browse/SPARK-6319

Spark SQL uses plain byte arrays to represent binary values. However, the arrays are compared by reference rather than by values. Thus, we should not use BinaryType on Join and Aggregate in current implementation.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #7787 from viirya/agg_no_binary_type and squashes the following commits:

4f76cac [Liang-Chi Hsieh] Throw AnalysisException when using BinaryType on Join and Aggregate.
2015-07-30 17:22:51 -07:00
Davies Liu 0b1a464b6e [SPARK-9425] [SQL] support DecimalType in UnsafeRow
This PR brings the support of DecimalType in UnsafeRow, for precision <= 18, it's settable, otherwise it's not settable.

Author: Davies Liu <davies@databricks.com>

Closes #7758 from davies/unsafe_decimal and squashes the following commits:

478b1ba [Davies Liu] address comments
536314c [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_decimal
7c2e77a [Davies Liu] fix JoinedRow
76d6fa4 [Davies Liu] fix tests
99d3151 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_decimal
d49c6ae [Davies Liu] support DecimalType in UnsafeRow
2015-07-30 17:18:32 -07:00
Reynold Xin e7a0976e99 [SPARK-9458][SPARK-9469][SQL] Code generate prefix computation in sorting & moves unsafe conversion out of TungstenSort.
Author: Reynold Xin <rxin@databricks.com>

Closes #7803 from rxin/SPARK-9458 and squashes the following commits:

5b032dc [Reynold Xin] Fix string.
b670dbb [Reynold Xin] [SPARK-9458][SPARK-9469][SQL] Code generate prefix computation in sorting & moves unsafe conversion out of TungstenSort.
2015-07-30 17:17:27 -07:00
Xiangrui Meng df32669514 [SPARK-7157][SQL] add sampleBy to DataFrame
This was previously committed but then reverted due to test failures (see #6769).

Author: Xiangrui Meng <meng@databricks.com>

Closes #7755 from rxin/SPARK-7157 and squashes the following commits:

fbf9044 [Xiangrui Meng] fix python test
542bd37 [Xiangrui Meng] update test
604fe6d [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7157
f051afd [Xiangrui Meng] use udf instead of building expression
f4e9425 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7157
8fb990b [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7157
103beb3 [Xiangrui Meng] add Java-friendly sampleBy
991f26f [Xiangrui Meng] fix seed
4a14834 [Xiangrui Meng] move sampleBy to stat
832f7cc [Xiangrui Meng] add sampleBy to DataFrame
2015-07-30 17:16:03 -07:00
Daoyuan Wang 1abf7dc16c [SPARK-8186] [SPARK-8187] [SPARK-8194] [SPARK-8198] [SPARK-9133] [SPARK-9290] [SQL] functions: date_add, date_sub, add_months, months_between, time-interval calculation
This PR is based on #7589 , thanks to adrian-wang

Added SQL function date_add, date_sub, add_months, month_between, also add a rule for
add/subtract of date/timestamp and interval.

Closes #7589

cc rxin

Author: Daoyuan Wang <daoyuan.wang@intel.com>
Author: Davies Liu <davies@databricks.com>

Closes #7754 from davies/date_add and squashes the following commits:

e8c633a [Davies Liu] Merge branch 'master' of github.com:apache/spark into date_add
9e8e085 [Davies Liu] Merge branch 'master' of github.com:apache/spark into date_add
6224ce4 [Davies Liu] fix conclict
bd18cd4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into date_add
e47ff2c [Davies Liu] add python api, fix date functions
01943d0 [Davies Liu] Merge branch 'master' into date_add
522e91a [Daoyuan Wang] fix
e8a639a [Daoyuan Wang] fix
42df486 [Daoyuan Wang] fix style
87c4b77 [Daoyuan Wang] function add_months, months_between and some fixes
1a68e03 [Daoyuan Wang] poc of time interval calculation
c506661 [Daoyuan Wang] function date_add , date_sub
2015-07-30 13:21:46 -07:00
Daoyuan Wang 6d94bf6ac1 [SPARK-8174] [SPARK-8175] [SQL] function unix_timestamp, from_unixtime
unix_timestamp(): long
Gets current Unix timestamp in seconds.

unix_timestamp(string|date): long
Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale, return null if fail: unix_timestamp('2009-03-20 11:30:01') = 1237573801

unix_timestamp(string date, string pattern): long
Convert time string with given pattern (see [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) to Unix time stamp (in seconds), return null if fail: unix_timestamp('2009-03-20', 'yyyy-MM-dd') = 1237532400.

from_unixtime(bigint unixtime[, string format]): string
Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of "1970-01-01 00:00:00".

Jira:
https://issues.apache.org/jira/browse/SPARK-8174
https://issues.apache.org/jira/browse/SPARK-8175

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #7644 from adrian-wang/udfunixtime and squashes the following commits:

2fe20c4 [Daoyuan Wang] util.Date
ea2ec16 [Daoyuan Wang] use util.Date for better performance
a2cf929 [Daoyuan Wang] doc return null instead of 0
f6f070a [Daoyuan Wang] address comments from davies
6a4cbb3 [Daoyuan Wang] temp
56ded53 [Daoyuan Wang] rebase and address comments
14a8b37 [Daoyuan Wang] function unix_timestamp, from_unixtime
2015-07-30 11:14:09 -07:00
Josh Rosen 520ec0ff9d [SPARK-8850] [SQL] Enable Unsafe mode by default
This pull request enables Unsafe mode by default in Spark SQL. In order to do this, we had to fix a number of small issues:

**List of fixed blockers**:

- [x] Make some default buffer sizes configurable so that HiveCompatibilitySuite can run properly (#7741).
- [x] Memory leak on grouped aggregation of empty input (fixed by #7560 to fix this)
- [x] Update planner to also check whether codegen is enabled before planning unsafe operators.
- [x] Investigate failing HiveThriftBinaryServerSuite test.  This turns out to be caused by a ClassCastException that occurs when Exchange tries to apply an interpreted RowOrdering to an UnsafeRow when range partitioning an RDD.  This could be fixed by #7408, but a shorter-term fix is to just skip the Unsafe exchange path when RangePartitioner is used.
- [x] Memory leak exceptions masking exceptions that actually caused tasks to fail (will be fixed by #7603).
- [x]  ~~https://issues.apache.org/jira/browse/SPARK-9162, to implement code generation for ScalaUDF.  This is necessary for `UDFSuite` to pass.  For now, I've just ignored this test in order to try to find other problems while we wait for a fix.~~ This is no longer necessary as of #7682.
- [x] Memory leaks from Limit after UnsafeExternalSort cause the memory leak detector to fail tests. This is a huge problem in the HiveCompatibilitySuite (fixed by f4ac642a4e5b2a7931c5e04e086bb10e263b1db6).
- [x] Tests in `AggregationQuerySuite` are failing due to NaN-handling issues in UnsafeRow, which were fixed in #7736.
- [x] `org.apache.spark.sql.ColumnExpressionSuite.rand` needs to be updated so that the planner check also matches `TungstenProject`.
- [x] After having lowered the buffer sizes to 4MB so that most of HiveCompatibilitySuite runs:
  - [x] Wrong answer in `join_1to1` (fixed by #7680)
  - [x] Wrong answer in `join_nulls` (fixed by #7680)
  - [x] Managed memory OOM / leak in `lateral_view`
  - [x] Seems to hang indefinitely in `partcols1`.  This might be a deadlock in script transformation or a bug in error-handling code? The hang was fixed by #7710.
  - [x] Error while freeing memory in `partcols1`: will be fixed by #7734.
- [x] After fixing the `partcols1` hang, it appears that a number of later tests have issues as well.
- [x] Fix thread-safety bug in codegen fallback expression evaluation (#7759).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7564 from JoshRosen/unsafe-by-default and squashes the following commits:

83c0c56 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-by-default
f4cc859 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-by-default
963f567 [Josh Rosen] Reduce buffer size for R tests
d6986de [Josh Rosen] Lower page size in PySpark tests
013b9da [Josh Rosen] Also match TungstenProject in checkNumProjects
5d0b2d3 [Josh Rosen] Add task completion callback to avoid leak in limit after sort
ea250da [Josh Rosen] Disable unsafe Exchange path when RangePartitioning is used
715517b [Josh Rosen] Enable Unsafe by default
2015-07-30 10:45:32 -07:00
Liang-Chi Hsieh 5363ed7156 [SPARK-9361] [SQL] Refactor new aggregation code to reduce the times of checking compatibility
JIRA: https://issues.apache.org/jira/browse/SPARK-9361

Currently, we call `aggregate.Utils.tryConvert` in many places to check it the logical.Aggregate can be run with new aggregation. But looks like `aggregate.Utils.tryConvert` will cost considerable time to run. We should only call `tryConvert` once and keep it value in `logical.Aggregate` and reuse it.

In `org.apache.spark.sql.execution.aggregate.Utils`, the codes involving with `tryConvert` should be moved to catalyst because it actually doesn't deal with execution details.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #7677 from viirya/refactor_aggregate and squashes the following commits:

babea30 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into refactor_aggregate
9a589d7 [Liang-Chi Hsieh] Fix scala style.
0a91329 [Liang-Chi Hsieh] Refactor new aggregation code to reduce the times to call tryConvert.
2015-07-30 10:32:12 -07:00
Wenchen Fan c0cc0eaec6 [SPARK-9390][SQL] create a wrapper for array type
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7724 from cloud-fan/array-data and squashes the following commits:

d0408a1 [Wenchen Fan] fix python
661e608 [Wenchen Fan] rebase
f39256c [Wenchen Fan] fix hive...
6dbfa6f [Wenchen Fan] fix hive again...
8cb8842 [Wenchen Fan] remove element type parameter from getArray
43e9816 [Wenchen Fan] fix mllib
e719afc [Wenchen Fan] fix hive
4346290 [Wenchen Fan] address comment
d4a38da [Wenchen Fan] remove sizeInBytes and add license
7e283e2 [Wenchen Fan] create a wrapper for array type
2015-07-30 10:04:30 -07:00
Liang-Chi Hsieh 6175d6cfe7 [SPARK-8838] [SQL] Add config to enable/disable merging part-files when merging parquet schema
JIRA: https://issues.apache.org/jira/browse/SPARK-8838

Currently all part-files are merged when merging parquet schema. However, in case there are many part-files and we can make sure that all the part-files have the same schema as their summary file. If so, we provide a configuration to disable merging part-files when merging parquet schema.

In short, we need to merge parquet schema because different summary files may contain different schema. But the part-files are confirmed to have the same schema with summary files.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #7238 from viirya/option_partfile_merge and squashes the following commits:

71d5b5f [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into option_partfile_merge
8816f44 [Liang-Chi Hsieh] For comments.
dbc8e6b [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into option_partfile_merge
afc2fa1 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into option_partfile_merge
d4ed7e6 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into option_partfile_merge
df43027 [Liang-Chi Hsieh] Get dataStatuses' partitions based on all paths.
4eb2f00 [Liang-Chi Hsieh] Use given parameter.
ea8f6e5 [Liang-Chi Hsieh] Correct the code comments.
a57be0e [Liang-Chi Hsieh] Merge part-files if there are no summary files.
47df981 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into option_partfile_merge
4caf293 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into option_partfile_merge
0e734e0 [Liang-Chi Hsieh] Use correct API.
3b6be5b [Liang-Chi Hsieh] Fix key not found.
4bdd7e0 [Liang-Chi Hsieh] Don't read footer files if we can skip them.
8bbebcb [Liang-Chi Hsieh] Figure out how to test the config.
bbd4ce7 [Liang-Chi Hsieh] Add config to enable/disable merging part-files when merging parquet schema.
2015-07-30 17:45:30 +08:00
Reynold Xin 5ba2d44068 Fix flaky HashedRelationSuite
SparkEnv might not have been set in local unit tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #7784 from rxin/HashedRelationSuite and squashes the following commits:

435d64b [Reynold Xin] Fix flaky HashedRelationSuite
2015-07-30 01:21:39 -07:00
Reynold Xin 4a8bb9d00d Revert "[SPARK-9458] Avoid object allocation in prefix generation."
This reverts commit 9514d874f0.
2015-07-30 01:04:24 -07:00
Joseph Batchik 1221849f91 [SPARK-8005][SQL] Input file name
Users can now get the file name of the partition being read in. A thread local variable is in `SQLNewHadoopRDD` and is set when the partition is computed. `SQLNewHadoopRDD` is moved to core so that the catalyst package can reach it.

This supports:

`df.select(inputFileName())`

and

`sqlContext.sql("select input_file_name() from table")`

Author: Joseph Batchik <josephbatchik@gmail.com>

Closes #7743 from JDrit/input_file_name and squashes the following commits:

abb8609 [Joseph Batchik] fixed failing test and changed the default value to be an empty string
d2f323d [Joseph Batchik] updates per review
102061f [Joseph Batchik] updates per review
75313f5 [Joseph Batchik] small fixes
c7f7b5a [Joseph Batchik] addeding input file name to Spark SQL
2015-07-29 23:35:55 -07:00
Yijie Shen e127ec34d5 [SPARK-9428] [SQL] Add test cases for null inputs for expression unit tests
JIRA: https://issues.apache.org/jira/browse/SPARK-9428

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7748 from yjshen/string_cleanup and squashes the following commits:

e0c2b3d [Yijie Shen] update codegen in RegExpExtract and RegExpReplace
26614d2 [Yijie Shen] MathFunctionSuite
a402859 [Yijie Shen] complex_create, conditional and cast
6e4e608 [Yijie Shen] arithmetic and cast
52593c1 [Yijie Shen] null input test cases for StringExpressionSuite
2015-07-29 23:24:20 -07:00
Reynold Xin 712465b68e HOTFIX: disable HashedRelationSuite. 2015-07-29 22:51:06 -07:00
Davies Liu e044705b44 [SPARK-9116] [SQL] [PYSPARK] support Python only UDT in __main__
Also we could create a Python UDT without having a Scala one, it's important for Python users.

cc mengxr JoshRosen

Author: Davies Liu <davies@databricks.com>

Closes #7453 from davies/class_in_main and squashes the following commits:

4dfd5e1 [Davies Liu] add tests for Python and Scala UDT
793d9b2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main
dc65f19 [Davies Liu] address comment
a9a3c40 [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main
a86e1fc [Davies Liu] fix serialization
ad528ba [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main
63f52ef [Davies Liu] fix pylint check
655b8a9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main
316a394 [Davies Liu] support Python UDT with UTF
0bcb3ef [Davies Liu] fix bug in mllib
de986d6 [Davies Liu] fix test
83d65ac [Davies Liu] fix bug in StructType
55bb86e [Davies Liu] support Python UDT in __main__ (without Scala one)
2015-07-29 22:30:49 -07:00
Reynold Xin 27850af525 [SPARK-9462][SQL] Initialize nondeterministic expressions in code gen fallback mode.
Author: Reynold Xin <rxin@databricks.com>

Closes #7767 from rxin/SPARK-9462 and squashes the following commits:

ef3e2d9 [Reynold Xin] Removed println
713ac3a [Reynold Xin] More unit tests.
bb5c334 [Reynold Xin] [SPARK-9462][SQL] Initialize nondeterministic expressions in code gen fallback mode.
2015-07-29 21:24:47 -07:00
Reynold Xin 9514d874f0 [SPARK-9458] Avoid object allocation in prefix generation.
In our existing sort prefix generation code, we use expression's eval method to generate the prefix, which results in object allocation for every prefix. We can use the specialized getters available on InternalRow directly to avoid the object allocation.

I also removed the FLOAT prefix, opting for converting float directly to double.

Author: Reynold Xin <rxin@databricks.com>

Closes #7763 from rxin/sort-prefix and squashes the following commits:

5dc2f06 [Reynold Xin] [SPARK-9458] Avoid object allocation in prefix generation.
2015-07-29 20:46:03 -07:00
Reynold Xin 86505962e6 [SPARK-9448][SQL] GenerateUnsafeProjection should not share expressions across instances.
We accidentally moved the list of expressions from the generated code instance to the class wrapper, and as a result, different threads are sharing the same set of expressions, which cause problems for expressions with mutable state.

This pull request fixed that problem, and also added unit tests for all codegen classes, except GeneratedOrdering (which will never need any expressions since sort now only accepts bound references.

Author: Reynold Xin <rxin@databricks.com>

Closes #7759 from rxin/SPARK-9448 and squashes the following commits:

c09b50f [Reynold Xin] [SPARK-9448][SQL] GenerateUnsafeProjection should not share expressions across instances.
2015-07-29 16:49:02 -07:00
Josh Rosen 1b0099fc62 [SPARK-9411] [SQL] Make Tungsten page sizes configurable
We need to make page sizes configurable so we can reduce them in unit tests and increase them in real production workloads.  These sizes are now controlled by a new configuration, `spark.buffer.pageSize`.  The new default is 64 megabytes.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7741 from JoshRosen/SPARK-9411 and squashes the following commits:

a43c4db [Josh Rosen] Fix pow
2c0eefc [Josh Rosen] Fix MAXIMUM_PAGE_SIZE_BYTES comment + value
bccfb51 [Josh Rosen] Lower page size to 4MB in TestHive
ba54d4b [Josh Rosen] Make UnsafeExternalSorter's page size configurable
0045aa2 [Josh Rosen] Make UnsafeShuffle's page size configurable
bc734f0 [Josh Rosen] Rename configuration
e614858 [Josh Rosen] Makes BytesToBytesMap page size configurable
2015-07-29 16:00:30 -07:00
Reynold Xin 5340dfaf94 [SPARK-9430][SQL] Rename IntervalType to CalendarIntervalType.
We want to introduce a new IntervalType in 1.6 that is based on only the number of microseoncds,
so interval can be compared.

Renaming the existing IntervalType to CalendarIntervalType so we can do that in the future.

Author: Reynold Xin <rxin@databricks.com>

Closes #7745 from rxin/calendarintervaltype and squashes the following commits:

99f64e8 [Reynold Xin] One more line ...
13466c8 [Reynold Xin] Fixed tests.
e20f24e [Reynold Xin] [SPARK-9430][SQL] Rename IntervalType to CalendarIntervalType.
2015-07-29 13:49:22 -07:00
Reynold Xin 97906944e1 [SPARK-9127][SQL] Rand/Randn codegen fails with long seed.
Author: Reynold Xin <rxin@databricks.com>

Closes #7747 from rxin/SPARK-9127 and squashes the following commits:

e851418 [Reynold Xin] [SPARK-9127][SQL] Rand/Randn codegen fails with long seed.
2015-07-29 09:36:22 -07:00
Wenchen Fan 708794e8aa [SPARK-9251][SQL] do not order by expressions which still need evaluation
as an offline discussion with rxin , it's weird to be computing stuff while doing sorting, we should only order by bound reference during execution.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7593 from cloud-fan/sort and squashes the following commits:

7b1bef7 [Wenchen Fan] add test
daf206d [Wenchen Fan] add more comments
289bee0 [Wenchen Fan] do not order by expressions which still need evaluation
2015-07-29 00:08:45 -07:00
Davies Liu 15667a0afa [SPARK-9281] [SQL] use decimal or double when parsing SQL
Right now, we use double to parse all the float number in SQL. When it's used in expression together with DecimalType, it will turn the decimal into double as well. Also it will loss some precision when using double.

This PR change to parse float number to decimal or double, based on it's  using scientific notation or not, see https://msdn.microsoft.com/en-us/library/ms179899.aspx

This is a break change, should we doc it somewhere?

Author: Davies Liu <davies@databricks.com>

Closes #7642 from davies/parse_decimal and squashes the following commits:

1f576d9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into parse_decimal
5e142b6 [Davies Liu] fix scala style
eca99de [Davies Liu] fix tests
2afe702 [Davies Liu] Merge branch 'master' of github.com:apache/spark into parse_decimal
f4a320b [Davies Liu] Update SqlParser.scala
1c48e34 [Davies Liu] use decimal or double when parsing SQL
2015-07-28 22:51:08 -07:00
Yijie Shen 6309b93467 [SPARK-9398] [SQL] Datetime cleanup
JIRA: https://issues.apache.org/jira/browse/SPARK-9398

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7725 from yjshen/date_null_check and squashes the following commits:

b4eade1 [Yijie Shen] inline daysToMonthEnd
d09acc1 [Yijie Shen] implement getLastDayOfMonth to avoid repeated evaluation
d857ec3 [Yijie Shen] add null check in DateExpressionSuite
2015-07-28 22:38:28 -07:00
Wenchen Fan 429b2f0df4 [SPARK-8608][SPARK-8609][SPARK-9083][SQL] reset mutable states of nondeterministic expression before evaluation and fix PullOutNondeterministic
We will do local projection for LocalRelation, and thus reuse the same Expression object among multiply evaluations. We should reset the mutable states of Expression before evaluate it.

Fix `PullOutNondeterministic` rule to make it work for `Sort`.

Also got a chance to cleanup the dataframe test suite.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7674 from cloud-fan/show and squashes the following commits:

888934f [Wenchen Fan] fix sort
c0e93e8 [Wenchen Fan] local DataFrame with random columns should return same value when call `show`
2015-07-28 21:37:50 -07:00
Yin Huai 3744b7fd42 [SPARK-9422] [SQL] Remove the placeholder attributes used in the aggregation buffers
https://issues.apache.org/jira/browse/SPARK-9422

Author: Yin Huai <yhuai@databricks.com>

Closes #7737 from yhuai/removePlaceHolder and squashes the following commits:

ec29b44 [Yin Huai]  Remove placeholder attributes.
2015-07-28 19:01:25 -07:00
Josh Rosen e78ec1a8fa [SPARK-9421] Fix null-handling bugs in UnsafeRow.getDouble, getFloat(), and get(ordinal, dataType)
UnsafeRow.getDouble and getFloat() return NaN when called on columns that are null, which is inconsistent with the behavior of other row classes (which is to return 0.0).

In addition, the generic get(ordinal, dataType) method should always return null for a null literal, but currently it handles nulls by calling the type-specific accessors.

This patch addresses both of these issues and adds a regression test.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7736 from JoshRosen/unsafe-row-null-fixes and squashes the following commits:

c8eb2ee [Josh Rosen] Fix test in UnsafeRowConverterSuite
6214682 [Josh Rosen] Fixes to null handling in UnsafeRow
2015-07-28 17:51:58 -07:00
Reynold Xin 6662ee2124 [SPARK-9418][SQL] Use sort-merge join as the default shuffle join.
Sort-merge join is more robust in Spark since sorting can be made using the Tungsten sort operator.

Author: Reynold Xin <rxin@databricks.com>

Closes #7733 from rxin/smj and squashes the following commits:

61e4d34 [Reynold Xin] Fixed test case.
5ffd731 [Reynold Xin] Fixed JoinSuite.
a137dc0 [Reynold Xin] [SPARK-9418][SQL] Use sort-merge join as the default shuffle join.
2015-07-28 17:42:35 -07:00
Reynold Xin b7f54119f8 [SPARK-9420][SQL] Move expressions in sql/core package to catalyst.
Since catalyst package already depends on Spark core, we can move those expressions
into catalyst, and simplify function registry.

This is a followup of #7478.

Author: Reynold Xin <rxin@databricks.com>

Closes #7735 from rxin/SPARK-8003 and squashes the following commits:

2ffbdc3 [Reynold Xin] [SPARK-8003][SQL] Move expressions in sql/core package to catalyst.
2015-07-28 17:03:59 -07:00
Josh Rosen 59b92add7c [SPARK-9393] [SQL] Fix several error-handling bugs in ScriptTransform operator
SparkSQL's ScriptTransform operator has several serious bugs which make debugging fairly difficult:

- If exceptions are thrown in the writing thread then the child process will not be killed, leading to a deadlock because the reader thread will block while waiting for input that will never arrive.
- TaskContext is not propagated to the writer thread, which may cause errors in upstream pipelined operators.
- Exceptions which occur in the writer thread are not propagated to the main reader thread, which may cause upstream errors to be silently ignored instead of killing the job.  This can lead to silently incorrect query results.
- The writer thread is not a daemon thread, but it should be.

In addition, the code in this file is extremely messy:

- Lots of fields are nullable but the nullability isn't clearly explained.
- Many confusing variable names: for instance, there are variables named `ite` and `iterator` that are defined in the same scope.
- Some code was misindented.
- The `*serdeClass` variables are actually expected to be single-quoted strings, which is really confusing: I feel that this parsing / extraction should be performed in the analyzer, not in the operator itself.
- There were no unit tests for the operator itself, only end-to-end tests.

This pull request addresses these issues, borrowing some error-handling techniques from PySpark's PythonRDD.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7710 from JoshRosen/script-transform and squashes the following commits:

16c44e2 [Josh Rosen] Update some comments
983f200 [Josh Rosen] Use unescapeSQLString instead of stripQuotes
6a06a8c [Josh Rosen] Clean up handling of quotes in serde class name
494cde0 [Josh Rosen] Propagate TaskContext to writer thread
323bb2b [Josh Rosen] Fix error-swallowing bug
b31258d [Josh Rosen] Rename iterator variables to disambiguate.
88278de [Josh Rosen] Split ScriptTransformation writer thread into own class.
8b162b6 [Josh Rosen] Add failing test which demonstrates exception masking issue
4ee36a2 [Josh Rosen] Kill script transform subprocess when error occurs in input writer.
bd4c948 [Josh Rosen] Skip launching of external command for empty partitions.
b43e4ec [Josh Rosen] Clean up nullability in ScriptTransformation
fa18d26 [Josh Rosen] Add basic unit test for script transform with 'cat' command.
2015-07-28 16:04:48 -07:00
Davies Liu 21825529ea [SPARK-9247] [SQL] Use BytesToBytesMap for broadcast join
This PR introduce BytesToBytesMap to UnsafeHashedRelation, use it in executor for better performance.

It serialize all the key and values from java HashMap, put them into a BytesToBytesMap while deserializing. All the values for a same key are stored continuous to have better memory locality.

This PR also address the comments for #7480 , do some clean up.

Author: Davies Liu <davies@databricks.com>

Closes #7592 from davies/unsafe_map2 and squashes the following commits:

42c578a [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_map2
fd09528 [Davies Liu] remove thread local cache and update docs
1c5ad8d [Davies Liu] fix test
5eb1b5a [Davies Liu] address comments in #7480
46f1f22 [Davies Liu] fix style
fc221e0 [Davies Liu] use BytesToBytesMap for broadcast join
2015-07-28 15:56:19 -07:00
Joseph Batchik b88b868eb3 [SPARK-8003][SQL] Added virtual column support to Spark
Added virtual column support by adding a new resolution role to the query analyzer. Additional virtual columns can be added by adding case expressions to [the new rule](https://github.com/JDrit/spark/blob/virt_columns/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L1026) and my modifying the [logical plan](https://github.com/JDrit/spark/blob/virt_columns/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L216) to resolve them.

This also solves [SPARK-8003](https://issues.apache.org/jira/browse/SPARK-8003)

This allows you to perform queries such as:
```sql
select spark__partition__id, count(*) as c from table group by spark__partition__id;
```

Author: Joseph Batchik <josephbatchik@gmail.com>
Author: JD <jd@csh.rit.edu>

Closes #7478 from JDrit/virt_columns and squashes the following commits:

7932bf0 [Joseph Batchik] adding spark__partition__id to hive as well
f8a9c6c [Joseph Batchik] merging in master
e49da48 [JD] fixes for @rxin's suggestions
60e120b [JD] fixing test in merge
4bf8554 [JD] merging in master
c68bc0f [Joseph Batchik] Adding function register ability to SQLContext and adding a function for spark__partition__id()
2015-07-28 14:39:25 -07:00
Yin Huai 6cdcc21fe6 [SPARK-9196] [SQL] Ignore test DatetimeExpressionsSuite: function current_timestamp.
This test is flaky. https://issues.apache.org/jira/browse/SPARK-9196 will track the fix of it. For now, let's disable this test.

Author: Yin Huai <yhuai@databricks.com>

Closes #7727 from yhuai/SPARK-9196-ignore and squashes the following commits:

f92bded [Yin Huai] Ignore current_timestamp.
2015-07-28 13:16:48 -07:00
Aaron Davidson 35ef853b3f [SPARK-9397] DataFrame should provide an API to find source data files if applicable
Certain applications would benefit from being able to inspect DataFrames that are straightforwardly produced by data sources that stem from files, and find out their source data. For example, one might want to display to a user the size of the data underlying a table, or to copy or mutate it.

This PR exposes an `inputFiles` method on DataFrame which attempts to discover the source data in a best-effort manner, by inspecting HadoopFsRelations and JSONRelations.

Author: Aaron Davidson <aaron@databricks.com>

Closes #7717 from aarondav/paths and squashes the following commits:

ff67430 [Aaron Davidson] inputFiles
0acd3ad [Aaron Davidson] [SPARK-9397] DataFrame should provide an API to find source data files if applicable
2015-07-28 10:12:09 -07:00
Reynold Xin 9bbe0171cb [SPARK-8196][SQL] Fix null handling & documentation for next_day.
The original patch didn't handle nulls correctly for next_day.

Author: Reynold Xin <rxin@databricks.com>

Closes #7718 from rxin/next_day and squashes the following commits:

616a425 [Reynold Xin] Merged DatetimeExpressionsSuite into DateFunctionsSuite.
faa78cf [Reynold Xin] Merged DatetimeFunctionsSuite into DateExpressionsSuite.
6c4fb6a [Reynold Xin] [SPARK-8196][SQL] Fix null handling & documentation for next_day.
2015-07-28 09:43:39 -07:00
Reynold Xin c740bed172 [SPARK-9373][SQL] follow up for StructType support in Tungsten projection.
Author: Reynold Xin <rxin@databricks.com>

Closes #7720 from rxin/struct-followup and squashes the following commits:

d9757f5 [Reynold Xin] [SPARK-9373][SQL] follow up for StructType support in Tungsten projection.
2015-07-28 09:43:12 -07:00
Reynold Xin 5a2330e546 [SPARK-9402][SQL] Remove CodegenFallback from Abs / FormatNumber.
Both expressions already implement code generation.

Author: Reynold Xin <rxin@databricks.com>

Closes #7723 from rxin/abs-formatnum and squashes the following commits:

31ed765 [Reynold Xin] [SPARK-9402][SQL] Remove CodegenFallback from Abs / FormatNumber.
2015-07-28 09:42:35 -07:00
Reynold Xin 15724fac56 [SPARK-9394][SQL] Handle parentheses in CodeFormatter.
Our CodeFormatter currently does not handle parentheses, and as a result in code dump, we see code formatted this way:

```
foo(
a,
b,
c)
```

With this patch, it is formatted this way:
```
foo(
  a,
  b,
  c)
```

Author: Reynold Xin <rxin@databricks.com>

Closes #7712 from rxin/codeformat-parentheses and squashes the following commits:

c2b1c5f [Reynold Xin] Took square bracket out
3cfb174 [Reynold Xin] Code review feedback.
91f5bb1 [Reynold Xin] [SPARK-9394][SQL] Handle parentheses in CodeFormatter.
2015-07-28 00:52:26 -07:00
Cheng Hao 9c5612f4e1 [MINOR] [SQL] Support mutable expression unit test with codegen projection
This is actually contains 3 minor issues:
1) Enable the unit test(codegen) for mutable expressions (FormatNumber, Regexp_Replace/Regexp_Extract)
2) Use the `PlatformDependent.copyMemory` instead of the `System.arrayCopy`

Author: Cheng Hao <hao.cheng@intel.com>

Closes #7566 from chenghao-intel/codegen_ut and squashes the following commits:

24f43ea [Cheng Hao] enable codegen for mutable expression & UTF8String performance
2015-07-27 23:02:23 -07:00
Reynold Xin 60f08c7c87 [SPARK-9373][SQL] Support StructType in Tungsten projection
This pull request updates GenerateUnsafeProjection to support StructType. If an input struct type is backed already by an UnsafeRow, GenerateUnsafeProjection copies the bytes directly into its buffer space without any conversion. However, if the input is not an UnsafeRow, GenerateUnsafeProjection runs the code generated recursively to convert the input into an UnsafeRow and then copies it into the buffer space.

Also create a TungstenProject operator that projects data directly into UnsafeRow. Note that I'm not sure if this is the way we want to structure Unsafe+codegen operators, but we can defer that decision to follow-up pull requests.

Author: Reynold Xin <rxin@databricks.com>

Closes #7689 from rxin/tungsten-struct-type and squashes the following commits:

9162f42 [Reynold Xin] Support IntervalType in UnsafeRow's getter.
be9f377 [Reynold Xin] Fixed tests.
10c4b7c [Reynold Xin] Format generated code.
77e8d0e [Reynold Xin] Fixed NondeterministicSuite.
ac4951d [Reynold Xin] Yay.
ac203bf [Reynold Xin] More comments.
9f36216 [Reynold Xin] Updated comment.
6b781fe [Reynold Xin] Reset the change in DataFrameSuite.
525b95b [Reynold Xin] Merged with master, more documentation & test cases.
321859a [Reynold Xin] [SPARK-9373][SQL] Support StructType in Tungsten projection [WIP]
2015-07-27 22:51:15 -07:00
Yijie Shen 63a492b931 [SPARK-8828] [SQL] Revert SPARK-5680
JIRA: https://issues.apache.org/jira/browse/SPARK-8828

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7667 from yjshen/revert_combinesum_2 and squashes the following commits:

c37ccb1 [Yijie Shen] add test case
8377214 [Yijie Shen] revert spark.sql.useAggregate2 to its default value
e2305ac [Yijie Shen] fix bug - avg on decimal column
7cb0e95 [Yijie Shen] [wip] resolving bugs
1fadb5a [Yijie Shen] remove occurance
17c6248 [Yijie Shen] revert SPARK-5680
2015-07-27 22:47:33 -07:00
Reynold Xin 3bc7055e26 Fixed a test failure. 2015-07-27 22:04:54 -07:00
Reynold Xin 84da8792e2 [SPARK-9395][SQL] Create a SpecializedGetters interface to track all the specialized getters.
As we are adding more and more specialized getters to more classes (coming soon ArrayData), this interface can help us prevent missing a method in some interfaces.

Author: Reynold Xin <rxin@databricks.com>

Closes #7713 from rxin/SpecializedGetters and squashes the following commits:

3b39be1 [Reynold Xin] Added override modifier.
567ba9c [Reynold Xin] [SPARK-9395][SQL] Create a SpecializedGetters interface to track all the specialized getters.
2015-07-27 21:41:15 -07:00
Daoyuan Wang 2e7f99a004 [SPARK-8195] [SPARK-8196] [SQL] udf next_day last_day
next_day, returns next certain dayofweek.
last_day, returns the last day of the month which given date belongs to.

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #6986 from adrian-wang/udfnlday and squashes the following commits:

ef7e3da [Daoyuan Wang] fix
02b3426 [Daoyuan Wang] address 2 comments
dc69630 [Daoyuan Wang] address comments from rxin
8846086 [Daoyuan Wang] address comments from rxin
d09bcce [Daoyuan Wang] multi fix
1a9de3d [Daoyuan Wang] function next_day and last_day
2015-07-27 21:08:56 -07:00
Michael Armbrust ce89ff477a [SPARK-9386] [SQL] Feature flag for metastore partition pruning
Since we have been seeing a lot of failures related to this new feature, lets put it behind a flag and turn it off by default.

Author: Michael Armbrust <michael@databricks.com>

Closes #7703 from marmbrus/optionalMetastorePruning and squashes the following commits:

6ad128c [Michael Armbrust] style
8447835 [Michael Armbrust] [SPARK-9386][SQL] Feature flag for metastore partition pruning
fd37b87 [Michael Armbrust] add config flag
2015-07-27 17:32:34 -07:00
Wenchen Fan 3ab7525dce [SPARK-9355][SQL] Remove InternalRow.get generic getter call in columnar cache code
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7673 from cloud-fan/row-generic-getter-columnar and squashes the following commits:

88b1170 [Wenchen Fan] fix style
eeae712 [Wenchen Fan] Remove Internal.get generic getter call in columnar cache code
2015-07-27 13:40:50 -07:00
Cheng Lian 8e7d2bee23 [SPARK-9378] [SQL] Fixes test case "CTAS with serde"
This is a proper version of PR #7693 authored by viirya

The reason why "CTAS with serde" fails is that the `MetastoreRelation` gets converted to a Parquet data source relation by default.

Author: Cheng Lian <lian@databricks.com>

Closes #7700 from liancheng/spark-9378-fix-ctas-test and squashes the following commits:

4413af0 [Cheng Lian] Fixes test case "CTAS with serde"
2015-07-27 13:28:03 -07:00
Yin Huai 55946e76fd [SPARK-9349] [SQL] UDAF cleanup
https://issues.apache.org/jira/browse/SPARK-9349

With this PR, we only expose `UserDefinedAggregateFunction` (an abstract class) and `MutableAggregationBuffer` (an interface). Other internal wrappers and helper classes are moved to `org.apache.spark.sql.execution.aggregate` and marked as `private[sql]`.

Author: Yin Huai <yhuai@databricks.com>

Closes #7687 from yhuai/UDAF-cleanup and squashes the following commits:

db36542 [Yin Huai] Add comments to UDAF examples.
ae17f66 [Yin Huai] Address comments.
9c9fa5f [Yin Huai] UDAF cleanup.
2015-07-27 13:26:57 -07:00
Wenchen Fan 75438422c2 [SPARK-9369][SQL] Support IntervalType in UnsafeRow
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7688 from cloud-fan/interval and squashes the following commits:

5b36b17 [Wenchen Fan] fix codegen
a99ed50 [Wenchen Fan] address comment
9e6d319 [Wenchen Fan] Support IntervalType in UnsafeRow
2015-07-27 11:28:22 -07:00
Wenchen Fan dd9ae7945a [SPARK-9351] [SQL] remove literals from grouping expressions in Aggregate
literals in grouping expressions have no effect at all, only make our grouping key bigger, so we should remove them in Optimizer.

I also make old and new aggregation code consistent about literals in grouping here. In old aggregation, actually literals in grouping are already removed but new aggregation is not. So I explicitly make it a rule in Optimizer.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7583 from cloud-fan/minor and squashes the following commits:

471adff [Wenchen Fan] add test
0839925 [Wenchen Fan] use transformDown when rewrite final result expressions
2015-07-27 11:23:29 -07:00
Wenchen Fan e2f38167f8 [SPARK-9376] [SQL] use a seed in RandomDataGeneratorSuite
Make this test deterministic, i.e. make sure this test can be passed no matter how many times we run it.

The origin implementation uses a random seed and gives a chance that we may break the null check assertion `assert(Iterator.fill(100)(generator()).contains(null))`.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7691 from cloud-fan/seed and squashes the following commits:

eae7281 [Wenchen Fan] use a seed in RandomDataGeneratorSuite
2015-07-27 11:02:16 -07:00
Josh Rosen ecad9d4346 [SPARK-9364] Fix array out of bounds and use-after-free bugs in UnsafeExternalSorter
This patch fixes two bugs in UnsafeExternalSorter and UnsafeExternalRowSorter:

- UnsafeExternalSorter does not properly update freeSpaceInCurrentPage, which can cause it to write past the end of memory pages and trigger segfaults.
- UnsafeExternalRowSorter has a use-after-free bug when returning the last row from an iterator.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7680 from JoshRosen/SPARK-9364 and squashes the following commits:

590f311 [Josh Rosen] null out row
f4cf91d [Josh Rosen] Fix use-after-free bug in UnsafeExternalRowSorter.
8abcf82 [Josh Rosen] Properly decrement freeSpaceInCurrentPage in UnsafeExternalSorter
2015-07-27 09:34:49 -07:00
Rene Treffer aa19c696e2 [SPARK-4176] [SQL] Supports decimal types with precision > 18 in Parquet
This PR is based on #6796 authored by rtreffer.

To support large decimal precisions (> 18), we do the following things in this PR:

1. Making `CatalystSchemaConverter` support large decimal precision

   Decimal types with large precision are always converted to fixed-length byte array.

2. Making `CatalystRowConverter` support reading decimal values with large precision

   When the precision is > 18, constructs `Decimal` values with an unscaled `BigInteger` rather than an unscaled `Long`.

3. Making `RowWriteSupport` support writing decimal values with large precision

   In this PR we always write decimals as fixed-length byte array, because Parquet write path hasn't been refactored to conform Parquet format spec (see SPARK-6774 & SPARK-8848).

Two follow-up tasks should be done in future PRs:

- [ ] Writing decimals as `INT32`, `INT64` when possible while fixing SPARK-8848
- [ ] Adding compatibility tests as part of SPARK-5463

Author: Cheng Lian <lian@databricks.com>

Closes #7455 from liancheng/spark-4176 and squashes the following commits:

a543d10 [Cheng Lian] Fixes errors introduced while rebasing
9e31cdf [Cheng Lian] Supports decimals with precision > 18 for Parquet
2015-07-27 23:29:40 +08:00
Cheng Lian 72981bc8f0 [SPARK-7943] [SPARK-8105] [SPARK-8435] [SPARK-8714] [SPARK-8561] Fixes multi-database support
This PR fixes a set of issues related to multi-database. A new data structure `TableIdentifier` is introduced to identify a table among multiple databases. We should stop using a single `String` (table name without database name), or `Seq[String]` (optional database name plus table name) to identify tables internally.

Author: Cheng Lian <lian@databricks.com>

Closes #7623 from liancheng/spark-8131-multi-db and squashes the following commits:

f3bcd4b [Cheng Lian] Addresses PR comments
e0eb76a [Cheng Lian] Fixes styling issues
41e2207 [Cheng Lian] Fixes multi-database support
d4d1ec2 [Cheng Lian] Adds multi-database test cases
2015-07-27 17:15:35 +08:00
Wenchen Fan 4ffd3a1db5 [SPARK-9371][SQL] fix the support for special chars in column names for hive context
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7684 from cloud-fan/hive and squashes the following commits:

da21ffe [Wenchen Fan] fix the support for special chars in column names for hive context
2015-07-26 23:58:03 -07:00
Reynold Xin aa80c64fcf [SPARK-9368][SQL] Support get(ordinal, dataType) generic getter in UnsafeRow.
Author: Reynold Xin <rxin@databricks.com>

Closes #7682 from rxin/unsaferow-generic-getter and squashes the following commits:

3063788 [Reynold Xin] Reset the change for real this time.
0f57c55 [Reynold Xin] Reset the changes in ExpressionEvalHelper.
fb6ca30 [Reynold Xin] Support BinaryType.
24a3e46 [Reynold Xin] Added support for DateType/TimestampType.
9989064 [Reynold Xin] JoinedRow.
11f80a3 [Reynold Xin] [SPARK-9368][SQL] Support get(ordinal, dataType) generic getter in UnsafeRow.
2015-07-26 23:01:04 -07:00
Liang-Chi Hsieh 945d8bcbf6 [SPARK-9306] [SQL] Don't use SortMergeJoin when joining on unsortable columns
JIRA: https://issues.apache.org/jira/browse/SPARK-9306

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #7645 from viirya/smj_unsortable and squashes the following commits:

a240707 [Liang-Chi Hsieh] Use forall instead of exists for readability.
55221fa [Liang-Chi Hsieh] Shouldn't use SortMergeJoin when joining on unsortable columns.
2015-07-26 22:13:37 -07:00
Cheng Hao 1efe97dc9e [SPARK-8867][SQL] Support list / describe function usage
As Hive does, we need to list all of the registered UDF and its usage for user.

We add the annotation to describe a UDF, so we can get the literal description info while registering the UDF.
e.g.
```scala
ExpressionDescription(
    usage = "_FUNC_(expr) - Returns the absolute value of the numeric value",
    extended = """> SELECT _FUNC_('-1')
                  1""")
 case class Abs(child: Expression) extends UnaryArithmetic {
...
```

Author: Cheng Hao <hao.cheng@intel.com>

Closes #7259 from chenghao-intel/desc_function and squashes the following commits:

cf29bba [Cheng Hao] fixing the code style issue
5193855 [Cheng Hao] Add more powerful parser for show functions
c645a6b [Cheng Hao] fix bug in unit test
78d40f1 [Cheng Hao] update the padding issue for usage
48ee4b3 [Cheng Hao] update as feedback
70eb4e9 [Cheng Hao] add show/describe function support
2015-07-26 18:34:19 -07:00
Cheng Lian c025c3d0a1 [SPARK-9095] [SQL] Removes the old Parquet support
This PR removes the old Parquet support:

- Removes the old `ParquetRelation` together with related SQL configuration, plan nodes, strategies, utility classes, and test suites.

- Renames `ParquetRelation2` to `ParquetRelation`

- Renames `RowReadSupport` and `RowRecordMaterializer` to `CatalystReadSupport` and `CatalystRecordMaterializer` respectively, and moved them to separate files.

  This follows naming convention used in other Parquet data models implemented in parquet-mr. It should be easier for developers who are familiar with Parquet to follow.

There's still some other code that can be cleaned up. Especially `RowWriteSupport`. But I'd like to leave this part to SPARK-8848.

Author: Cheng Lian <lian@databricks.com>

Closes #7441 from liancheng/spark-9095 and squashes the following commits:

c7b6e38 [Cheng Lian] Removes WriteToFile
2d688d6 [Cheng Lian] Renames ParquetRelation2 to ParquetRelation
ca9e1b7 [Cheng Lian] Removes old Parquet support
2015-07-26 16:49:19 -07:00
Yijie Shen fb5d43fb25 [SPARK-9356][SQL]Remove the internal use of DecimalType.Unlimited
JIRA: https://issues.apache.org/jira/browse/SPARK-9356

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7671 from yjshen/deprecated_unlimit and squashes the following commits:

c707f56 [Yijie Shen] remove pattern matching in changePrecision
4a1823c [Yijie Shen] remove internal occurrence of Decimal.Unlimited
2015-07-26 10:29:22 -07:00
Reynold Xin 6c400b4f39 [SPARK-9354][SQL] Remove InternalRow.get generic getter call in Hive integration code.
Replaced them with get(ordinal, datatype) so we can use UnsafeRow here.

I passed the data types throughout.

Author: Reynold Xin <rxin@databricks.com>

Closes #7669 from rxin/row-generic-getter-hive and squashes the following commits:

3467d8e [Reynold Xin] [SPARK-9354][SQL] Remove Internal.get generic getter call in Hive integration code.
2015-07-26 10:27:39 -07:00
Reynold Xin 4a01bfc2a2 [SPARK-9350][SQL] Introduce an InternalRow generic getter that requires a DataType
Currently UnsafeRow cannot support a generic getter. However, if the data type is known, we can support a generic getter.

Author: Reynold Xin <rxin@databricks.com>

Closes #7666 from rxin/generic-getter-with-datatype and squashes the following commits:

ee2874c [Reynold Xin] Add a default implementation for getStruct.
1e109a0 [Reynold Xin] [SPARK-9350][SQL] Introduce an InternalRow generic getter that requires a DataType.
033ee88 [Reynold Xin] Removed getAs in non test code.
2015-07-25 23:52:37 -07:00
Reynold Xin b1f4b4abfd [SPARK-9348][SQL] Remove apply method on InternalRow.
Author: Reynold Xin <rxin@databricks.com>

Closes #7665 from rxin/remove-row-apply and squashes the following commits:

0b43001 [Reynold Xin] support getString in UnsafeRow.
176d633 [Reynold Xin] apply -> get.
2941324 [Reynold Xin] [SPARK-9348][SQL] Remove apply method on InternalRow.
2015-07-25 18:41:51 -07:00
Wenchen Fan 2c94d0f24a [SPARK-9192][SQL] add initialization phase for nondeterministic expression
Currently nondeterministic expression is broken without a explicit initialization phase.

Let me take `MonotonicallyIncreasingID` as an example. This expression need a mutable state to remember how many times it has been evaluated, so we use `transient var count: Long` there. By being transient, the `count` will be reset to 0 and **only** to 0 when serialize and deserialize it, as deserialize transient variable will result to default value. There is *no way* to use another initial value for `count`, until we add the explicit initialization phase.

Another use case is local execution for `LocalRelation`, there is no serialize and deserialize phase and thus we can't reset mutable states for it.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7535 from cloud-fan/init and squashes the following commits:

6c6f332 [Wenchen Fan] add test
ef68ff4 [Wenchen Fan] fix comments
9eac85e [Wenchen Fan] move init code to interpreted class
bb7d838 [Wenchen Fan] pulls out nondeterministic expressions into a project
b4a4fc7 [Wenchen Fan] revert a refactor
86fee36 [Wenchen Fan] add initialization phase for nondeterministic expression
2015-07-25 12:10:02 -07:00
Cheng Lian e2ec018e37 [SPARK-9285] [SQL] Fixes Row/InternalRow conversion for HadoopFsRelation
This is a follow-up of #7626. It fixes `Row`/`InternalRow` conversion for data sources extending `HadoopFsRelation` with `needConversion` being `true`.

Author: Cheng Lian <lian@databricks.com>

Closes #7649 from liancheng/spark-9285-conversion-fix and squashes the following commits:

036a50c [Cheng Lian] Addresses PR comment
f6d7c6a [Cheng Lian] Fixes Row/InternalRow conversion for HadoopFsRelation
2015-07-25 11:42:49 -07:00
Reynold Xin 215713e199 [SPARK-9334][SQL] Remove UnsafeRowConverter in favor of UnsafeProjection.
The two are redundant.

Once this patch is merged, I plan to remove the inbound conversions from unsafe aggregates.

Author: Reynold Xin <rxin@databricks.com>

Closes #7658 from rxin/unsafeconverters and squashes the following commits:

ed19e6c [Reynold Xin] Updated support types.
2a56d7e [Reynold Xin] [SPARK-9334][SQL] Remove UnsafeRowConverter in favor of UnsafeProjection.
2015-07-25 01:37:41 -07:00
Reynold Xin f0ebab3f6d [SPARK-9336][SQL] Remove extra JoinedRows
They were added to improve performance (so JIT can inline the JoinedRow calls). However, we can also just improve it by projecting output out to UnsafeRow in Tungsten variant of the operators.

Author: Reynold Xin <rxin@databricks.com>

Closes #7659 from rxin/remove-joinedrows and squashes the following commits:

7510447 [Reynold Xin] [SPARK-9336][SQL] Remove extra JoinedRows
2015-07-25 01:28:46 -07:00
JD 723db13e06 [Spark-8668][SQL] Adding expr to functions
Author: JD <jd@csh.rit.edu>
Author: Joseph Batchik <josephbatchik@gmail.com>

Closes #7606 from JDrit/expr and squashes the following commits:

ad7f607 [Joseph Batchik] fixing python linter error
9d6daea [Joseph Batchik] removed order by per @rxin's comment
707d5c6 [Joseph Batchik] Added expr to fuctions.py
79df83c [JD] added example to the docs
b89eec8 [JD] moved function up as per @rxin's comment
4960909 [JD] updated per @JoshRosen's comment
2cb329c [JD] updated per @rxin's comment
9a9ad0c [JD] removing unused import
6dc26d0 [JD] removed split
7f2222c [JD] Adding expr function as per SPARK-8668
2015-07-25 00:34:59 -07:00
Reynold Xin c84acd4aa4 [SPARK-9331][SQL] Add a code formatter to auto-format generated code.
The generated expression code can be hard to read since they are not indented well. This patch adds a code formatter that formats code automatically when we output them to the screen.

Author: Reynold Xin <rxin@databricks.com>

Closes #7656 from rxin/codeformatter and squashes the following commits:

5ba0e90 [Reynold Xin] [SPARK-9331][SQL] Add a code formatter to auto-format generated code.
2015-07-24 19:35:24 -07:00
Reynold Xin f99cb5615c [SPARK-9330][SQL] Create specialized getStruct getter in InternalRow.
Also took the chance to rearrange some of the methods in UnsafeRow to group static/private/public things together.

Author: Reynold Xin <rxin@databricks.com>

Closes #7654 from rxin/getStruct and squashes the following commits:

b491a09 [Reynold Xin] Fixed typo.
48d77e5 [Reynold Xin] [SPARK-9330][SQL] Create specialized getStruct getter in InternalRow.
2015-07-24 19:29:01 -07:00
Liang-Chi Hsieh 64135cbb33 [SPARK-9067] [SQL] Close reader in NewHadoopRDD early if there is no more data
JIRA: https://issues.apache.org/jira/browse/SPARK-9067

According to the description of the JIRA ticket, calling `reader.close()` only after the task is finished will cause memory and file open limit problem since these resources are occupied even we don't need that anymore.

This PR simply closes the reader early when we know there is no more data to read.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #7424 from viirya/close_reader and squashes the following commits:

3ff64e5 [Liang-Chi Hsieh] For comments.
3d20267 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into close_reader
e152182 [Liang-Chi Hsieh] For comments.
5116cbe [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into close_reader
3ceb755 [Liang-Chi Hsieh] For comments.
e34d98e [Liang-Chi Hsieh] For comments.
50ed729 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into close_reader
216912f [Liang-Chi Hsieh] Fix it.
f429016 [Liang-Chi Hsieh] Release reader if we don't need it.
a305621 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into close_reader
67569da [Liang-Chi Hsieh] Close reader early if there is no more data.
2015-07-24 12:36:44 -07:00
Josh Rosen 6aceaf3d62 [SPARK-9295] Analysis should detect sorting on unsupported column types
This patch extends CheckAnalysis to throw errors for queries that try to sort on unsupported column types, such as ArrayType.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7633 from JoshRosen/SPARK-9295 and squashes the following commits:

23b2fbf [Josh Rosen] Embed function in foreach
bfe1451 [Josh Rosen] Update to allow sorting by null literals
2f1b802 [Josh Rosen] Add analysis rule to detect sorting on unsupported column types (SPARK-9295)
2015-07-24 11:34:23 -07:00
Josh Rosen c2b50d693e [SPARK-9292] Analysis should check that join conditions' data types are BooleanType
This patch adds an analysis check to ensure that join conditions' data types are BooleanType. This check is necessary in order to report proper errors for non-boolean DataFrame join conditions.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7630 from JoshRosen/SPARK-9292 and squashes the following commits:

aec6c7b [Josh Rosen] Check condition type in resolved()
75a3ea6 [Josh Rosen] Fix SPARK-9292.
2015-07-24 09:49:50 -07:00
Reynold Xin c8d71a4183 [SPARK-9305] Rename org.apache.spark.Row to Item.
It's a thing used in test cases, but named Row. Pretty annoying because everytime I search for Row, it shows up before the Spark SQL Row, which is what a developer wants most of the time.

Author: Reynold Xin <rxin@databricks.com>

Closes #7638 from rxin/remove-row and squashes the following commits:

aeda52d [Reynold Xin] [SPARK-9305] Rename org.apache.spark.Row to Item.
2015-07-24 09:38:13 -07:00
Reynold Xin 431ca39be5 [SPARK-9285][SQL] Remove InternalRow's inheritance from Row.
I also changed InternalRow's size/length function to numFields, to make it more obvious that it is not about bytes, but the number of fields.

Author: Reynold Xin <rxin@databricks.com>

Closes #7626 from rxin/internalRow and squashes the following commits:

e124daf [Reynold Xin] Fixed test case.
805ceb7 [Reynold Xin] Commented out the failed test suite.
f8a9ca5 [Reynold Xin] Fixed more bugs. Still at least one more remaining.
76d9081 [Reynold Xin] Fixed data sources.
7807f70 [Reynold Xin] Fixed DataFrameSuite.
cb60cd2 [Reynold Xin] Code review & small bug fixes.
0a2948b [Reynold Xin] Fixed style.
3280d03 [Reynold Xin] [SPARK-9285][SQL] Remove InternalRow's inheritance from Row.
2015-07-24 09:37:36 -07:00
Davies Liu dfb18be036 [SPARK-9069] [SQL] follow up
Address comments for #7605

cc rxin

Author: Davies Liu <davies@databricks.com>

Closes #7634 from davies/decimal_unlimited2 and squashes the following commits:

b2d8b0d [Davies Liu] add doc and test for DecimalType.isWiderThan
65b251c [Davies Liu] fix test
6a91f32 [Davies Liu] fix style
ca9c973 [Davies Liu] address comments
2015-07-24 08:24:13 -07:00
Liang-Chi Hsieh 6a7e537f3a [SPARK-8756] [SQL] Keep cached information and avoid re-calculating footers in ParquetRelation2
JIRA: https://issues.apache.org/jira/browse/SPARK-8756

Currently, in ParquetRelation2, footers are re-read every time refresh() is called. But we can check if it is possibly changed before we do the reading because reading all footers will be expensive when there are too many partitions. This pr fixes this by keeping some cached information to check it.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #7154 from viirya/cached_footer_parquet_relation and squashes the following commits:

92e9347 [Liang-Chi Hsieh] Fix indentation.
ae0ec64 [Liang-Chi Hsieh] Fix wrong assignment.
c8fdfb7 [Liang-Chi Hsieh] Fix it.
a52b6d1 [Liang-Chi Hsieh] For comments.
c2a2420 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into cached_footer_parquet_relation
fa5458f [Liang-Chi Hsieh] Use Map to cache FileStatus and do merging previously loaded schema and newly loaded one.
6ae0911 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into cached_footer_parquet_relation
21bbdec [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into cached_footer_parquet_relation
12a0ed9 [Liang-Chi Hsieh] Add check of FileStatus's modification time.
186429d [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into cached_footer_parquet_relation
0ef8caf [Liang-Chi Hsieh] Keep cached information and avoid re-calculating footers.
2015-07-24 17:39:57 +08:00
Reynold Xin cb8c241f05 [SPARK-9200][SQL] Don't implicitly cast non-atomic types to string type.
Author: Reynold Xin <rxin@databricks.com>

Closes #7636 from rxin/complex-string-implicit-cast and squashes the following commits:

3e67327 [Reynold Xin] [SPARK-9200][SQL] Don't implicitly cast non-atomic types to string type.
2015-07-24 01:18:43 -07:00
Wenchen Fan 408e64b284 [SPARK-9294][SQL] cleanup comments, code style, naming typo for the new aggregation
fix some comments and code style for https://github.com/apache/spark/pull/7458

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7619 from cloud-fan/agg-clean and squashes the following commits:

3925457 [Wenchen Fan] one more...
cc78357 [Wenchen Fan] one more cleanup
26f6a93 [Wenchen Fan] some minor cleanup for the new aggregation
2015-07-23 23:40:01 -07:00
Davies Liu 8a94eb23d5 [SPARK-9069] [SPARK-9264] [SQL] remove unlimited precision support for DecimalType
Romove Decimal.Unlimited (change to support precision up to 38, to match with Hive and other databases).

In order to keep backward source compatibility, Decimal.Unlimited is still there, but change to Decimal(38, 18).

If no precision and scale is provide, it's Decimal(10, 0) as before.

Author: Davies Liu <davies@databricks.com>

Closes #7605 from davies/decimal_unlimited and squashes the following commits:

aa3f115 [Davies Liu] fix tests and style
fb0d20d [Davies Liu] address comments
bfaae35 [Davies Liu] fix style
df93657 [Davies Liu] address comments and clean up
06727fd [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_unlimited
4c28969 [Davies Liu] fix tests
8d783cc [Davies Liu] fix tests
788631c [Davies Liu] fix double with decimal in Union/except
1779bde [Davies Liu] fix scala style
c9c7c78 [Davies Liu] remove Decimal.Unlimited
2015-07-23 18:31:13 -07:00
Cheng Lian bebe3f7b45 [SPARK-9207] [SQL] Enables Parquet filter push-down by default
PARQUET-136 and PARQUET-173 have been fixed in parquet-mr 1.7.0. It's time to enable filter push-down by default now.

Author: Cheng Lian <lian@databricks.com>

Closes #7612 from liancheng/spark-9207 and squashes the following commits:

77e6b5e [Cheng Lian] Enables Parquet filter push-down by default
2015-07-23 17:49:33 -07:00
Josh Rosen b2f3aca1e8 [SPARK-9286] [SQL] Methods in Unevaluable should be final and AlgebraicAggregate should extend Unevaluable.
This patch marks the Unevaluable.eval() and UnevaluablegenCode() methods as final and fixes two cases where they were overridden.  It also updates AggregateFunction2 to extend Unevaluable.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7627 from JoshRosen/unevaluable-fix and squashes the following commits:

8d9ed22 [Josh Rosen] AlgebraicAggregate should extend Unevaluable
65329c2 [Josh Rosen] Do not have AggregateFunction1 inherit from AggregateExpression1
fa68a22 [Josh Rosen] Make eval() and genCode() final
2015-07-23 16:08:07 -07:00
David Arroyo Cazorla 662d60db3f [SPARK-5447][SQL] Replace reference 'schema rdd' with DataFrame @rxin.
Author: David Arroyo Cazorla <darroyo@stratio.com>

Closes #7618 from darroyocazorla/master and squashes the following commits:

5f91379 [David Arroyo Cazorla] [SPARK-5447][SQL] Replace reference 'schema rdd' with DataFrame
2015-07-23 10:34:32 -07:00
Xiangrui Meng ecfb312767 [SPARK-9243] [Documentation] null -> zero in crosstab doc
We forgot to update doc. brkyvz

Author: Xiangrui Meng <meng@databricks.com>

Closes #7608 from mengxr/SPARK-9243 and squashes the following commits:

0ea3236 [Xiangrui Meng] null -> zero in crosstab doc
2015-07-23 10:32:11 -07:00
Cheng Hao 19aeab57c1 [Build][Minor] Fix building error & performance
1. When build the latest code with sbt, it throws exception like:
[error] /home/hcheng/git/catalyst/core/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala:78: match may not be exhaustive.
[error] It would fail on the following input: UNKNOWN
[error]       val classNameByStatus = status match {
[error]

2. Potential performance issue when implicitly convert an Array[Any] to Seq[Any]

Author: Cheng Hao <hao.cheng@intel.com>

Closes #7611 from chenghao-intel/toseq and squashes the following commits:

cab75c5 [Cheng Hao] remove the toArray
24df682 [Cheng Hao] fix building error & performance
2015-07-23 10:28:20 -07:00
Wenchen Fan 52ef76de21 [SPARK-9082] [SQL] [FOLLOW-UP] use partition in PushPredicateThroughProject
a follow up of https://github.com/apache/spark/pull/7446

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7607 from cloud-fan/tmp and squashes the following commits:

7106989 [Wenchen Fan] use `partition` in `PushPredicateThroughProject`
2015-07-23 09:38:02 -07:00
Reynold Xin fb36397b3c Revert "[SPARK-8579] [SQL] support arbitrary object in UnsafeRow"
Reverts ObjectPool. As it stands, it has a few problems:

1. ObjectPool doesn't work with spilling and memory accounting.
2. I don't think in the long run the idea of an object pool is what we want to support, since it essentially goes back to unmanaged memory, and creates pressure on GC, and is hard to account for the total in memory size.
3. The ObjectPool patch removed the specialized getters for strings and binary, and as a result, actually introduced branches when reading non primitive data types.

If we do want to support arbitrary user defined types in the future, I think we can just add an object array in UnsafeRow, rather than relying on indirect memory addressing through a pool. We also need to pick execution strategies that are optimized for those, rather than keeping a lot of unserialized JVM objects in memory during aggregation.

This is probably the hardest thing I had to revert in Spark, due to recent patches that also change the same part of the code. Would be great to get a careful look.

Author: Reynold Xin <rxin@databricks.com>

Closes #7591 from rxin/revert-object-pool and squashes the following commits:

01db0bc [Reynold Xin] Scala style.
eda89fc [Reynold Xin] Fixed describe.
2967118 [Reynold Xin] Fixed accessor for JoinedRow.
e3294eb [Reynold Xin] Merge branch 'master' into revert-object-pool
657855f [Reynold Xin] Temp commit.
c20f2c8 [Reynold Xin] Style fix.
fe37079 [Reynold Xin] Revert "[SPARK-8579] [SQL] support arbitrary object in UnsafeRow"
2015-07-23 01:51:34 -07:00
Yijie Shen 6d0d8b4069 [SPARK-8935] [SQL] Implement code generation for all casts
JIRA: https://issues.apache.org/jira/browse/SPARK-8935

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7365 from yjshen/cast_codegen and squashes the following commits:

ef6e8b5 [Yijie Shen] getColumn and setColumn in struct cast, autounboxing in array and map
eaece18 [Yijie Shen] remove null case in cast code gen
fd7eba4 [Yijie Shen] resolve comments
80378a5 [Yijie Shen] the missing self cast
611d66e [Yijie Shen] Bug fix: NullType & primitive object unboxing
6d5c0fe [Yijie Shen] rebase and add Interval codegen
9424b65 [Yijie Shen] tiny style fix
4a1c801 [Yijie Shen] remove CodeHolder class, use function instead.
3f5df88 [Yijie Shen] CodeHolder for complex dataTypes
c286f13 [Yijie Shen] moved all the cast code into class body
4edfd76 [Yijie Shen] [WIP] finished primitive part
2015-07-22 23:44:08 -07:00
Josh Rosen b217230f2a [SPARK-9144] Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled
Spark has an option called spark.localExecution.enabled; according to the docs:

> Enables Spark to run certain jobs, such as first() or take() on the driver, without sending tasks to the cluster. This can make certain jobs execute very quickly, but may require shipping a whole partition of data to the driver.

This feature ends up adding quite a bit of complexity to DAGScheduler, especially in the runLocallyWithinThread method, but as far as I know nobody uses this feature (I searched the mailing list and haven't seen any recent mentions of the configuration nor stacktraces including the runLocally method). As a step towards scheduler complexity reduction, I propose that we remove this feature and all code related to it for Spark 1.5.

This pull request simply brings #7484 up to date.

Author: Josh Rosen <joshrosen@databricks.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #7585 from rxin/remove-local-exec and squashes the following commits:

84bd10e [Reynold Xin] Python fix.
1d9739a [Reynold Xin] Merge pull request #7484 from JoshRosen/remove-localexecution
eec39fa [Josh Rosen] Remove allowLocal(); deprecate user-facing uses of it.
b0835dc [Josh Rosen] Remove local execution code in DAGScheduler
8975d96 [Josh Rosen] Remove local execution tests.
ffa8c9b [Josh Rosen] Remove documentation for configuration
2015-07-22 21:04:04 -07:00
Reynold Xin d71a13f475 [SPARK-9262][build] Treat Scala compiler warnings as errors
I've seen a few cases in the past few weeks that the compiler is throwing warnings that are caused by legitimate bugs. This patch upgrades warnings to errors, except deprecation warnings.

Note that ideally we should be able to mark deprecation warnings as errors as well. However, due to the lack of ability to suppress individual warning messages in the Scala compiler, we cannot do that (since we do need to access deprecated APIs in Hadoop).

Most of the work are done by ericl.

Author: Reynold Xin <rxin@databricks.com>
Author: Eric Liang <ekl@databricks.com>

Closes #7598 from rxin/warnings and squashes the following commits:

beb311b [Reynold Xin] Fixed tests.
542c031 [Reynold Xin] Fixed one more warning.
87c354a [Reynold Xin] Fixed all non-deprecation warnings.
78660ac [Eric Liang] first effort to fix warnings
2015-07-22 21:02:19 -07:00
Matei Zaharia fe26584a1f [SPARK-9244] Increase some memory defaults
There are a few memory limits that people hit often and that we could
make higher, especially now that memory sizes have grown.

- spark.akka.frameSize: This defaults at 10 but is often hit for map
  output statuses in large shuffles. This memory is not fully allocated
  up-front, so we can just make this larger and still not affect jobs
  that never sent a status that large. We increase it to 128.

- spark.executor.memory: Defaults at 512m, which is really small. We
  increase it to 1g.

Author: Matei Zaharia <matei@databricks.com>

Closes #7586 from mateiz/configs and squashes the following commits:

ce0038a [Matei Zaharia] [SPARK-9244] Increase some memory defaults
2015-07-22 15:28:09 -07:00
Yin Huai cf21d05f8b [SPARK-4366] [SQL] [Follow-up] Fix SqlParser compiling warning.
Author: Yin Huai <yhuai@databricks.com>

Closes #7588 from yhuai/SPARK-4366-update1 and squashes the following commits:

25f5f36 [Yin Huai] Fix SqlParser Warning.
2015-07-22 13:28:09 -07:00
Davies Liu e0b7ba59a1 [SPARK-9024] Unsafe HashJoin/HashOuterJoin/HashSemiJoin
This PR introduce unsafe version (using UnsafeRow) of HashJoin, HashOuterJoin and HashSemiJoin, including the broadcast one and shuffle one (except FullOuterJoin, which is better to be implemented using SortMergeJoin).

It use HashMap to store UnsafeRow right now, will change to use BytesToBytesMap for better performance (in another PR).

Author: Davies Liu <davies@databricks.com>

Closes #7480 from davies/unsafe_join and squashes the following commits:

6294b1e [Davies Liu] fix projection
10583f1 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_join
dede020 [Davies Liu] fix test
84c9807 [Davies Liu] address comments
a05b4f6 [Davies Liu] support UnsafeRow in LeftSemiJoinBNL and BroadcastNestedLoopJoin
611d2ed [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_join
9481ae8 [Davies Liu] return UnsafeRow after join()
ca2b40f [Davies Liu] revert unrelated change
68f5cd9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_join
0f4380d [Davies Liu] ada a comment
69e38f5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_join
1a40f02 [Davies Liu] refactor
ab1690f [Davies Liu] address comments
60371f2 [Davies Liu] use UnsafeRow in SemiJoin
a6c0b7d [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_join
184b852 [Davies Liu] fix style
6acbb11 [Davies Liu] fix tests
95d0762 [Davies Liu] remove println
bea4a50 [Davies Liu] Unsafe HashJoin
2015-07-22 13:02:43 -07:00
Yijie Shen 86f80e2b47 [SPARK-9165] [SQL] codegen for CreateArray, CreateStruct and CreateNamedStruct
JIRA: https://issues.apache.org/jira/browse/SPARK-9165

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7537 from yjshen/array_struct_codegen and squashes the following commits:

3a6dce6 [Yijie Shen] use infix notion in createArray test
5e90f0a [Yijie Shen] resolve comments: classOf
39cefb8 [Yijie Shen] codegen for createArray createStruct & createNamedStruct
2015-07-22 12:19:59 -07:00
Wenchen Fan 76520955fd [SPARK-9082] [SQL] Filter using non-deterministic expressions should not be pushed down
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7446 from cloud-fan/filter and squashes the following commits:

330021e [Wenchen Fan] add exists to tree node
2cab68c [Wenchen Fan] more enhance
949be07 [Wenchen Fan] push down part of predicate if possible
3912f84 [Wenchen Fan] address comments
8ce15ca [Wenchen Fan] fix bug
557158e [Wenchen Fan] Filter using non-deterministic expressions should not be pushed down
2015-07-22 11:45:51 -07:00
Yin Huai c03299a18b [SPARK-4233] [SPARK-4367] [SPARK-3947] [SPARK-3056] [SQL] Aggregation Improvement
This is the first PR for the aggregation improvement, which is tracked by https://issues.apache.org/jira/browse/SPARK-4366 (umbrella JIRA). This PR contains work for its subtasks, SPARK-3056, SPARK-3947, SPARK-4233, and SPARK-4367.

This PR introduces a new code path for evaluating aggregate functions. This code path is guarded by `spark.sql.useAggregate2` and by default the value of this flag is true.

This new code path contains:
* A new aggregate function interface (`AggregateFunction2`) and 7 built-int aggregate functions based on this new interface (`AVG`, `COUNT`, `FIRST`, `LAST`, `MAX`, `MIN`, `SUM`)
* A UDAF interface (`UserDefinedAggregateFunction`) based on the new code path and two example UDAFs (`MyDoubleAvg` and `MyDoubleSum`).
* A sort-based aggregate operator (`Aggregate2Sort`) for the new aggregate function interface .
* A sort-based aggregate operator (`FinalAndCompleteAggregate2Sort`) for distinct aggregations (for distinct aggregations the query plan will use `Aggregate2Sort` and `FinalAndCompleteAggregate2Sort` together).

With this change, `spark.sql.useAggregate2` is `true`, the flow of compiling an aggregation query is:
1. Our analyzer looks up functions and returns aggregate functions built based on the old aggregate function interface.
2. When our planner is compiling the physical plan, it tries try to convert all aggregate functions to the ones built based on the new interface. The planner will fallback to the old code path if any of the following two conditions is true:
* code-gen is disabled.
* there is any function that cannot be converted (right now, Hive UDAFs).
* the schema of grouping expressions contain any complex data type.
* There are multiple distinct columns.

Right now, the new code path handles a single distinct column in the query (you can have multiple aggregate functions using that distinct column). For a query having a aggregate function with DISTINCT and regular aggregate functions, the generated plan will do partial aggregations for those regular aggregate function.

Thanks chenghao-intel for his initial work on it.

Author: Yin Huai <yhuai@databricks.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #7458 from yhuai/UDAF and squashes the following commits:

7865f5e [Yin Huai] Put the catalyst expression in the comment of the generated code for it.
b04d6c8 [Yin Huai] Remove unnecessary change.
f1d5901 [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF
35b0520 [Yin Huai] Use semanticEquals to replace grouping expressions in the output of the aggregate operator.
3b43b24 [Yin Huai] bug fix.
00eb298 [Yin Huai] Make it compile.
a3ca551 [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF
e0afca3 [Yin Huai] Gracefully fallback to old aggregation code path.
8a8ac4a [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF
88c7d4d [Yin Huai] Enable spark.sql.useAggregate2 by default for testing purpose.
dc96fd1 [Yin Huai] Many updates:
85c9c4b [Yin Huai] newline.
43de3de [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF
c3614d7 [Yin Huai] Handle single distinct column.
68b8ee9 [Yin Huai] Support single distinct column set. WIP
3013579 [Yin Huai] Format.
d678aee [Yin Huai] Remove AggregateExpressionSuite.scala since our built-in aggregate functions will be based on AlgebraicAggregate and we need to have another way to test it.
e243ca6 [Yin Huai] Add aggregation iterators.
a101960 [Yin Huai] Change MyJavaUDAF to MyDoubleSum.
594cdf5 [Yin Huai] Change existing AggregateExpression to AggregateExpression1 and add an AggregateExpression as the common interface for both AggregateExpression1 and AggregateExpression2.
380880f [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF
0a827b3 [Yin Huai] Add comments and doc. Move some classes to the right places.
a19fea6 [Yin Huai] Add UDAF interface.
262d4c4 [Yin Huai] Make it compile.
b2e358e [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF
6edb5ac [Yin Huai] Format update.
70b169c [Yin Huai] Remove groupOrdering.
4721936 [Yin Huai] Add CheckAggregateFunction to extendedCheckRules.
d821a34 [Yin Huai] Cleanup.
32aea9c [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF
5b46d41 [Yin Huai] Bug fix.
aff9534 [Yin Huai] Make Aggregate2Sort work with both algebraic AggregateFunctions and non-algebraic AggregateFunctions.
2857b55 [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF
4435f20 [Yin Huai] Add ConvertAggregateFunction to HiveContext's analyzer.
1b490ed [Michael Armbrust] make hive test
8cfa6a9 [Michael Armbrust] add test
1b0bb3f [Yin Huai] Do not bind references in AlgebraicAggregate and use code gen for all places.
072209f [Yin Huai] Bug fix: Handle expressions in grouping columns that are not attribute references.
f7d9e54 [Michael Armbrust] Merge remote-tracking branch 'apache/master' into UDAF
39ee975 [Yin Huai] Code cleanup: Remove unnecesary AttributeReferences.
b7720ba [Yin Huai] Add an analysis rule to convert aggregate function to the new version.
5c00f3f [Michael Armbrust] First draft of codegen
6bbc6ba [Michael Armbrust] now with correct answers\!
f7996d0 [Michael Armbrust] Add AlgebraicAggregate
dded1c5 [Yin Huai] wip
2015-07-21 23:26:11 -07:00
Andrew Or f4785f5b82 [SPARK-9232] [SQL] Duplicate code in JSONRelation
Author: Andrew Or <andrew@databricks.com>

Closes #7576 from andrewor14/clean-up-json-relation and squashes the following commits:

ea80803 [Andrew Or] Clean up duplicate code
2015-07-21 23:00:13 -07:00
Reynold Xin a4c83cb1e4 [SPARK-9154][SQL] Rename formatString to format_string.
Also make format_string the canonical form, rather than printf.

Author: Reynold Xin <rxin@databricks.com>

Closes #7579 from rxin/format_strings and squashes the following commits:

53ee54f [Reynold Xin] Fixed unit tests.
52357e1 [Reynold Xin] Add format_string alias.
b40a42a [Reynold Xin] [SPARK-9154][SQL] Rename formatString to format_string.
2015-07-21 19:14:07 -07:00
Tarek Auel d4c7a7a364 [SPARK-9154] [SQL] codegen StringFormat
Jira: https://issues.apache.org/jira/browse/SPARK-9154

fixes bug of #7546

marmbrus I can't reopen the other PR, because I didn't closed it. Can you trigger Jenkins?

Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7571 from tarekauel/SPARK-9154 and squashes the following commits:

dcae272 [Tarek Auel] [SPARK-9154][SQL] build fix
1487602 [Tarek Auel] Merge remote-tracking branch 'upstream/master' into SPARK-9154
f512c5f [Tarek Auel] [SPARK-9154][SQL] build fix
a943d3e [Tarek Auel] [SPARK-9154] implicit input cast, added tests for null, support for null primitives
10b4de8 [Tarek Auel] [SPARK-9154][SQL] codegen removed fallback trait
cd8322b [Tarek Auel] [SPARK-9154][SQL] codegen string format
086caba [Tarek Auel] [SPARK-9154][SQL] codegen string format
2015-07-21 15:47:40 -07:00
Dennis Huo c07838b5a9 [SPARK-9206] [SQL] Fix HiveContext classloading for GCS connector.
IsolatedClientLoader.isSharedClass includes all of com.google.\*, presumably
for Guava, protobuf, and/or other shared Google libraries, but needs to
count com.google.cloud.\* as "hive classes" when determining which ClassLoader
to use. Otherwise, things like HiveContext.parquetFile will throw a
ClassCastException when fs.defaultFS is set to a Google Cloud Storage (gs://)
path. On StackOverflow: http://stackoverflow.com/questions/31478955

EDIT: Adding yhuai who worked on the relevant classloading isolation pieces.

Author: Dennis Huo <dhuo@google.com>

Closes #7549 from dennishuo/dhuo-fix-hivecontext-gcs and squashes the following commits:

1f8db07 [Dennis Huo] Fix HiveContext classloading for GCS connector.
2015-07-21 13:12:11 -07:00
Reynold Xin 60c0ce134d [SPARK-8906][SQL] Move all internal data source classes into execution.datasources.
This way, the sources package contains only public facing interfaces.

Author: Reynold Xin <rxin@databricks.com>

Closes #7565 from rxin/move-ds and squashes the following commits:

7661aff [Reynold Xin] Mima
9d5196a [Reynold Xin] Rearranged imports.
3dd7174 [Reynold Xin] [SPARK-8906][SQL] Move all internal data source classes into execution.datasources.
2015-07-21 11:56:38 -07:00
navis.ryu 9ba7c64dec [SPARK-8357] Fix unsafe memory leak on empty inputs in GeneratedAggregate
This patch fixes a managed memory leak in GeneratedAggregate.  The leak occurs when the unsafe aggregation path is used to perform grouped aggregation on an empty input; in this case, GeneratedAggregate allocates an UnsafeFixedWidthAggregationMap that is never cleaned up because `next()` is never called on the aggregate result iterator.

This patch fixes this by short-circuiting on empty inputs.

This patch is an updated version of #6810.

Closes #6810.

Author: navis.ryu <navis@apache.org>
Author: Josh Rosen <joshrosen@databricks.com>

Closes #7560 from JoshRosen/SPARK-8357 and squashes the following commits:

3486ce4 [Josh Rosen] Some minor cleanup
c649310 [Josh Rosen] Revert SparkPlan change:
3c7db0f [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-8357
adc8239 [Josh Rosen] Back out Projection changes.
c5419b3 [navis.ryu] addressed comments
143e1ef [navis.ryu] fixed format & added test for CCE case
735972f [navis.ryu] used new conf apis
1a02a55 [navis.ryu] Rolled-back test-conf cleanup & fixed possible CCE & added more tests
51178e8 [navis.ryu] addressed comments
4d326b9 [navis.ryu] fixed test fails
15c5afc [navis.ryu] added a test as suggested by JoshRosen
d396589 [navis.ryu] added comments
1b07556 [navis.ryu] [SPARK-8357] [SQL] Memory leakage on unsafe aggregation path with empty input
2015-07-21 11:52:52 -07:00
Michael Armbrust 87d890cc10 Revert "[SPARK-9154] [SQL] codegen StringFormat"
This reverts commit 7f072c3d5e.

Revert #7546

Author: Michael Armbrust <michael@databricks.com>

Closes #7570 from marmbrus/revert9154 and squashes the following commits:

ed2c32a [Michael Armbrust] Revert "[SPARK-9154] [SQL] codegen StringFormat"
2015-07-21 11:18:39 -07:00
Tarek Auel 7f072c3d5e [SPARK-9154] [SQL] codegen StringFormat
Jira: https://issues.apache.org/jira/browse/SPARK-9154

Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7546 from tarekauel/SPARK-9154 and squashes the following commits:

a943d3e [Tarek Auel] [SPARK-9154] implicit input cast, added tests for null, support for null primitives
10b4de8 [Tarek Auel] [SPARK-9154][SQL] codegen removed fallback trait
cd8322b [Tarek Auel] [SPARK-9154][SQL] codegen string format
086caba [Tarek Auel] [SPARK-9154][SQL] codegen string format
2015-07-21 09:58:16 -07:00
Yijie Shen be5c5d3741 [SPARK-9081] [SPARK-9168] [SQL] nanvl & dropna/fillna supporting nan as well
JIRA:
https://issues.apache.org/jira/browse/SPARK-9081
https://issues.apache.org/jira/browse/SPARK-9168

This PR target at two modifications:
1.  Change `isNaN` to return `false` on `null` input
2.  Make `dropna` and `fillna` to fill/drop NaN values as well
3.  Implement `nanvl`

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7523 from yjshen/fillna_dropna and squashes the following commits:

f0a51db [Yijie Shen] make coalesce untouched and implement nanvl
1d3e35f [Yijie Shen] make Coalesce aware of NaN in order to support fillna
2760cbc [Yijie Shen] change isNaN(null) to false as well as implement dropna
2015-07-21 08:25:50 -07:00
Yijie Shen ae230596b8 [SPARK-9173][SQL]UnionPushDown should also support Intersect and Except
JIRA: https://issues.apache.org/jira/browse/SPARK-9173

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7540 from yjshen/union_pushdown and squashes the following commits:

278510a [Yijie Shen] rename UnionPushDown to SetOperationPushDown
91741c1 [Yijie Shen] Add UnionPushDown support for intersect and except
2015-07-21 00:56:57 -07:00
Pedro Rodriguez 560c658a74 [SPARK-8230][SQL] Add array/map size method
Pull Request for: https://issues.apache.org/jira/browse/SPARK-8230

Primary issue resolved is to implement array/map size for Spark SQL. Code is ready for review by a committer. Chen Hao is on the JIRA ticket, but I don't know his username on github, rxin is also on JIRA ticket.

Things to review:
1. Where to put added functions namespace wise, they seem to be part of a few operations on collections which includes `sort_array` and `array_contains`. Hence the name given `collectionOperations.scala` and `_collection_functions` in python.
2. In Python code, should it be in a `1.5.0` function array or in a collections array?
3. Are there any missing methods on the `Size` case class? Looks like many of these functions have generated Java code, is that also needed in this case?
4. Something else?

Author: Pedro Rodriguez <ski.rodriguez@gmail.com>
Author: Pedro Rodriguez <prodriguez@trulia.com>

Closes #7462 from EntilZha/SPARK-8230 and squashes the following commits:

9a442ae [Pedro Rodriguez] fixed functions and sorted __all__
9aea3bb [Pedro Rodriguez] removed imports from python docs
15d4bf1 [Pedro Rodriguez] Added null test case and changed to nullSafeCodeGen
d88247c [Pedro Rodriguez] removed python code
bd5f0e4 [Pedro Rodriguez] removed duplicate function from rebase/merge
59931b4 [Pedro Rodriguez] fixed compile bug instroduced when merging
c187175 [Pedro Rodriguez] updated code to add size to __all__ directly and removed redundent pretty print
130839f [Pedro Rodriguez] fixed failing test
aa9bade [Pedro Rodriguez] fix style
e093473 [Pedro Rodriguez] updated python code with docs, switched classes/traits implemented, added (failing) expression tests
0449377 [Pedro Rodriguez] refactored code to use better abstract classes/traits and implementations
9a1a2ff [Pedro Rodriguez] added unit tests for map size
2bfbcb6 [Pedro Rodriguez] added unit test for size
20df2b4 [Pedro Rodriguez] Finished working version of size function and added it to python
b503e75 [Pedro Rodriguez] First attempt at implementing size for maps and arrays
99a6a5c [Pedro Rodriguez] fixed failing test
cac75ac [Pedro Rodriguez] fix style
933d843 [Pedro Rodriguez] updated python code with docs, switched classes/traits implemented, added (failing) expression tests
42bb7d4 [Pedro Rodriguez] refactored code to use better abstract classes/traits and implementations
f9c3b8a [Pedro Rodriguez] added unit tests for map size
2515d9f [Pedro Rodriguez] added documentation
0e60541 [Pedro Rodriguez] added unit test for size
acf9853 [Pedro Rodriguez] Finished working version of size function and added it to python
84a5d38 [Pedro Rodriguez] First attempt at implementing size for maps and arrays
2015-07-21 00:53:20 -07:00