Commit graph

576 commits

Author SHA1 Message Date
Reynold Xin d931b01dca [SQL] Two DataFrame fixes.
- Removed DataFrame.apply for projection & filtering since they are extremely confusing.
- Added implicits for RDD[Int], RDD[Long], and RDD[String]

Author: Reynold Xin <rxin@databricks.com>

Closes #4543 from rxin/df-cleanup and squashes the following commits:

81ec915 [Reynold Xin] [SQL] More DataFrame fixes.
2015-02-11 18:32:48 -08:00
Michael Armbrust a60d2b70ad [SPARK-5454] More robust handling of self joins
Also I fix a bunch of bad output in test cases.

Author: Michael Armbrust <michael@databricks.com>

Closes #4520 from marmbrus/selfJoin and squashes the following commits:

4f4a85c [Michael Armbrust] comments
49c8e26 [Michael Armbrust] fix tests
6fc38de [Michael Armbrust] fix style
55d64b3 [Michael Armbrust] fix dataframe selfjoins
2015-02-11 12:31:56 -08:00
Davies Liu b694eb9c2f [SPARK-5677] [SPARK-5734] [SQL] [PySpark] Python DataFrame API remaining tasks
1. DataFrame.renameColumn

2. DataFrame.show() and _repr_

3. Use simpleString() rather than jsonValue in DataFrame.dtypes

4. createDataFrame from local Python data, including pandas.DataFrame

Author: Davies Liu <davies@databricks.com>

Closes #4528 from davies/df3 and squashes the following commits:

014acea [Davies Liu] fix typo
6ba526e [Davies Liu] fix tests
46f5f95 [Davies Liu] address comments
6cbc154 [Davies Liu] dataframe.show() and improve dtypes
6f94f25 [Davies Liu] create DataFrame from local Python data
2015-02-11 12:13:16 -08:00
Patrick Wendell 7e2f8821e0 HOTFIX: Java 6 compilation error in Spark SQL 2015-02-10 22:43:32 -08:00
Reynold Xin b8f88d3272 [SPARK-5702][SQL] Allow short names for built-in data sources.
Also took the chance to fixed up some style ...

Author: Reynold Xin <rxin@databricks.com>

Closes #4489 from rxin/SPARK-5702 and squashes the following commits:

74f42e3 [Reynold Xin] [SPARK-5702][SQL] Allow short names for built-in data sources.
2015-02-10 20:40:21 -08:00
Reynold Xin 7e24249af1 [SQL][DataFrame] Fix column computability bug.
Do not recursively strip out projects. Only strip the first level project.

```scala
df("colA") + df("colB").as("colC")
```

Previously, the above would construct an invalid plan.

Author: Reynold Xin <rxin@databricks.com>

Closes #4519 from rxin/computability and squashes the following commits:

87ff763 [Reynold Xin] Code review feedback.
015c4fc [Reynold Xin] [SQL][DataFrame] Fix column computability.
2015-02-10 19:50:44 -08:00
Cheng Hao 45df77b841 [SPARK-5709] [SQL] Add EXPLAIN support in DataFrame API for debugging purpose
Author: Cheng Hao <hao.cheng@intel.com>

Closes #4496 from chenghao-intel/df_explain and squashes the following commits:

552aa58 [Cheng Hao] Add explain support for DF
2015-02-10 19:40:51 -08:00
Davies Liu ea60284095 [SPARK-5704] [SQL] [PySpark] createDataFrame from RDD with columns
Deprecate inferSchema() and applySchema(), use createDataFrame() instead, which could take an optional `schema` to create an DataFrame from an RDD. The `schema` could be StructType or list of names of columns.

Author: Davies Liu <davies@databricks.com>

Closes #4498 from davies/create and squashes the following commits:

08469c1 [Davies Liu] remove Scala/Java API for now
c80a7a9 [Davies Liu] fix hive test
d1bd8f2 [Davies Liu] cleanup applySchema
9526e97 [Davies Liu] createDataFrame from RDD with columns
2015-02-10 19:40:12 -08:00
Cheng Hao a60aea86b4 [SPARK-5683] [SQL] Avoid multiple json generator created
Author: Cheng Hao <hao.cheng@intel.com>

Closes #4468 from chenghao-intel/json and squashes the following commits:

aeb7801 [Cheng Hao] avoid multiple json generator created
2015-02-10 18:19:56 -08:00
Michael Armbrust 6195e2473b [SQL] Add an exception for analysis errors.
Also start from the bottom so we show the first error instead of the top error.

Author: Michael Armbrust <michael@databricks.com>

Closes #4439 from marmbrus/analysisException and squashes the following commits:

45862a0 [Michael Armbrust] fix hive test
a773bba [Michael Armbrust] Merge remote-tracking branch 'origin/master' into analysisException
f88079f [Michael Armbrust] update more cases
fede90a [Michael Armbrust] newline
fbf4bc3 [Michael Armbrust] move to sql
6235db4 [Michael Armbrust] [SQL] Add an exception for analysis errors.
2015-02-10 17:32:42 -08:00
Yin Huai aaf50d05c7 [SPARK-5658][SQL] Finalize DDL and write support APIs
https://issues.apache.org/jira/browse/SPARK-5658

Author: Yin Huai <yhuai@databricks.com>

This patch had conflicts when merged, resolved by
Committer: Michael Armbrust <michael@databricks.com>

Closes #4446 from yhuai/writeSupportFollowup and squashes the following commits:

f3a96f7 [Yin Huai] davies's comments.
225ff71 [Yin Huai] Use Scala TestHiveContext to initialize the Python HiveContext in Python tests.
2306f93 [Yin Huai] Style.
2091fcd [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup
537e28f [Yin Huai] Correctly clean up temp data.
ae4649e [Yin Huai] Fix Python test.
609129c [Yin Huai] Doc format.
92b6659 [Yin Huai] Python doc and other minor updates.
cbc717f [Yin Huai] Rename dataSourceName to source.
d1c12d3 [Yin Huai] No need to delete the duplicate rule since it has been removed in master.
22cfa70 [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup
d91ecb8 [Yin Huai] Fix test.
4c76d78 [Yin Huai] Simplify APIs.
3abc215 [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup
0832ce4 [Yin Huai] Fix test.
98e7cdb [Yin Huai] Python style.
2bf44ef [Yin Huai] Python APIs.
c204967 [Yin Huai] Format
a10223d [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup
9ff97d8 [Yin Huai] Add SaveMode to saveAsTable.
9b6e570 [Yin Huai] Update doc.
c2be775 [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup
99950a2 [Yin Huai] Use Java enum for SaveMode.
4679665 [Yin Huai] Remove duplicate rule.
77d89dc [Yin Huai] Update doc.
e04d908 [Yin Huai] Move import and add (Scala-specific) to scala APIs.
cf5703d [Yin Huai] Add checkAnswer to Java tests.
7db95ff [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup
6dfd386 [Yin Huai] Add java test.
f2f33ef [Yin Huai] Fix test.
e702386 [Yin Huai] Apache header.
b1e9b1b [Yin Huai] Format.
ed4e1b4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup
af9e9b3 [Yin Huai] DDL and write support API followup.
2a6213a [Yin Huai] Update API names.
e6a0b77 [Yin Huai] Update test.
43bae01 [Yin Huai] Remove createTable from HiveContext.
5ffc372 [Yin Huai] Add more load APIs to SQLContext.
5390743 [Yin Huai] Add more save APIs to DataFrame.
2015-02-10 17:29:52 -08:00
Yin Huai e28b6bdbb5 [SQL] Make Options in the data source API CREATE TABLE statements optional.
Users will not need to put `Options()` in a CREATE TABLE statement when there is not option provided.

Author: Yin Huai <yhuai@databricks.com>

Closes #4515 from yhuai/makeOptionsOptional and squashes the following commits:

1a898d3 [Yin Huai] Make options optional.
2015-02-10 17:06:12 -08:00
Cheng Lian 2d50a010ff [SPARK-5725] [SQL] Fixes ParquetRelation2.equals
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4513)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #4513 from liancheng/spark-5725 and squashes the following commits:

bf6a087 [Cheng Lian] Fixes ParquetRelation2.equals
2015-02-10 17:02:44 -08:00
Sheng, Li 91e3512544 [SQL][Minor] correct some comments
Author: Sheng, Li <OopsOutOfMemory@users.noreply.github.com>
Author: OopsOutOfMemory <victorshengli@126.com>

Closes #4508 from OopsOutOfMemory/cmt and squashes the following commits:

d8a68c6 [Sheng, Li] Update ddl.scala
f24aeaf [OopsOutOfMemory] correct style
2015-02-11 00:59:46 +00:00
Michael Armbrust de80b1ba4d [SQL] Add toString to DataFrame/Column
Author: Michael Armbrust <michael@databricks.com>

Closes #4436 from marmbrus/dfToString and squashes the following commits:

8a3c35f [Michael Armbrust] Merge remote-tracking branch 'origin/master' into dfToString
b72a81b [Michael Armbrust] add toString
2015-02-10 13:14:01 -08:00
Sean Owen 2d1e916730 SPARK-5239 [CORE] JdbcRDD throws "java.lang.AbstractMethodError: oracle.jdbc.driver.xxxxxx.isClosed()Z"
This is a completion of https://github.com/apache/spark/pull/4033 which was withdrawn for some reason.

Author: Sean Owen <sowen@cloudera.com>

Closes #4470 from srowen/SPARK-5239.2 and squashes the following commits:

2398bde [Sean Owen] Avoid use of JDBC4-only isClosed()
2015-02-10 09:19:01 +00:00
Cheng Hao bd0b5ea708 [SQL] Remove the duplicated code
Author: Cheng Hao <hao.cheng@intel.com>

Closes #4494 from chenghao-intel/tiny_code_change and squashes the following commits:

450dfe7 [Cheng Hao] remove the duplicated code
2015-02-09 21:33:34 -08:00
Davies Liu 08488c175f [SPARK-5469] restructure pyspark.sql into multiple files
All the DataTypes moved into pyspark.sql.types

The changes can be tracked by `--find-copies-harder -M25`
```
davieslocalhost:~/work/spark/python$ git diff --find-copies-harder -M25 --numstat master..
2       5       python/docs/pyspark.ml.rst
0       3       python/docs/pyspark.mllib.rst
10      2       python/docs/pyspark.sql.rst
1       1       python/pyspark/mllib/linalg.py
21      14      python/pyspark/{mllib => sql}/__init__.py
14      2108    python/pyspark/{sql.py => sql/context.py}
10      1772    python/pyspark/{sql.py => sql/dataframe.py}
7       6       python/pyspark/{sql_tests.py => sql/tests.py}
8       1465    python/pyspark/{sql.py => sql/types.py}
4       2       python/run-tests
1       1       sql/core/src/main/scala/org/apache/spark/sql/test/ExamplePointUDT.scala
```

Also `git blame -C -C python/pyspark/sql/context.py` to track the history.

Author: Davies Liu <davies@databricks.com>

Closes #4479 from davies/sql and squashes the following commits:

1b5f0a5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into sql
2b2b983 [Davies Liu] restructure pyspark.sql
2015-02-09 20:49:22 -08:00
Wenchen Fan 0ee53ebce9 [SPARK-2096][SQL] support dot notation on array of struct
~~The rule is simple: If you want `a.b` work, then `a` must be some level of nested array of struct(level 0 means just a StructType). And the result of `a.b` is same level of nested array of b-type.
An optimization is: the resolve chain looks like `Attribute -> GetItem -> GetField -> GetField ...`, so we could transmit the nested array information between `GetItem` and `GetField` to avoid repeated computation of `innerDataType` and `containsNullList` of that nested array.~~
marmbrus Could you take a look?

to evaluate `a.b`, if `a` is array of struct, then `a.b` means get field `b` on each element of `a`, and return a result of array.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #2405 from cloud-fan/nested-array-dot and squashes the following commits:

08a228a [Wenchen Fan] support dot notation on array of struct
2015-02-09 16:39:34 -08:00
Michael Armbrust 68b25cf695 [SQL] Add some missing DataFrame functions.
- as with a `Symbol`
- distinct
- sqlContext.emptyDataFrame
- move add/remove col out of RDDApi section

Author: Michael Armbrust <michael@databricks.com>

Closes #4437 from marmbrus/dfMissingFuncs and squashes the following commits:

2004023 [Michael Armbrust] Add missing functions
2015-02-09 16:02:56 -08:00
Hung Lin 4575c5643a [SPARK-5472][SQL] Fix Scala code style
Fix Scala code style.

Author: Hung Lin <hung@zoomdata.com>

Closes #4464 from hunglin/SPARK-5472 and squashes the following commits:

ef7a3b3 [Hung Lin] SPARK-5472: fix scala style
2015-02-08 22:36:42 -08:00
Reynold Xin a052ed4250 [SPARK-5643][SQL] Add a show method to print the content of a DataFrame in tabular format.
An example:
```
year  month AVG('Adj Close) MAX('Adj Close)
1980  12    0.503218        0.595103
1981  01    0.523289        0.570307
1982  02    0.436504        0.475256
1983  03    0.410516        0.442194
1984  04    0.450090        0.483521
```

Author: Reynold Xin <rxin@databricks.com>

Closes #4416 from rxin/SPARK-5643 and squashes the following commits:

d0e0d6e [Reynold Xin] [SQL] Minor update to data source and statistics documentation.
269da83 [Reynold Xin] Updated isLocal comment.
2cf3c27 [Reynold Xin] Moved logic into optimizer.
1a04d8b [Reynold Xin] [SPARK-5643][SQL] Add a show method to print the content of a DataFrame in columnar format.
2015-02-08 18:56:51 -08:00
Cheng Lian c4021401e3 [SQL] [Minor] HiveParquetSuite was disabled by mistake, re-enable them
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4440)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #4440 from liancheng/parquet-oops and squashes the following commits:

f21ede4 [Cheng Lian] HiveParquetSuite was disabled by mistake, re-enable them.
2015-02-06 15:23:42 -08:00
Michael Armbrust 76c4bf59f6 [SQL] Use TestSQLContext in Java tests
Sometimes tests were failing due to the creation of multiple `SparkContext`s in a single JVM.

Author: Michael Armbrust <michael@databricks.com>

Closes #4441 from marmbrus/javaTests and squashes the following commits:

657b1e0 [Michael Armbrust] [SQL] Use TestSQLContext in Java tests
2015-02-06 15:11:02 -08:00
Wenchen Fan 4793c8402a [SPARK-5278][SQL] Introduce UnresolvedGetField and complete the check of ambiguous reference to fields
When the `GetField` chain(`a.b.c.d.....`) is interrupted by `GetItem` like `a.b[0].c.d....`, then the check of ambiguous reference to fields is broken.
The reason is that: for something like `a.b[0].c.d`, we first parse it to `GetField(GetField(GetItem(Unresolved("a.b"), 0), "c"), "d")`. Then in `LogicalPlan#resolve`, we resolve `"a.b"` and build a `GetField` chain from bottom(the relation). But for the 2 outer `GetFiled`, we have to resolve them in `Analyzer` or do it in `GetField` lazily, check data type of child, search needed field, etc. which is similar to what we have done in `LogicalPlan#resolve`.
So in this PR, the fix is just copy the same logic in `LogicalPlan#resolve` to `Analyzer`, which is simple and quick, but I do suggest introduce `UnresolvedGetFiled` like I explained in https://github.com/apache/spark/pull/2405.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #4068 from cloud-fan/simple and squashes the following commits:

a6857b5 [Wenchen Fan] fix import order
8411c40 [Wenchen Fan] use UnresolvedGetField
2015-02-06 13:08:09 -08:00
Yin Huai 3eccf29ce0 [SPARK-5595][SPARK-5603][SQL] Add a rule to do PreInsert type casting and field renaming and invalidating in memory cache after INSERT
This PR adds a rule to Analyzer that will add preinsert data type casting and field renaming to the select clause in an `INSERT INTO/OVERWRITE` statement. Also, with the change of this PR, we always invalidate our in memory data cache after inserting into a BaseRelation.

cc marmbrus liancheng

Author: Yin Huai <yhuai@databricks.com>

Closes #4373 from yhuai/insertFollowUp and squashes the following commits:

08237a7 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertFollowUp
316542e [Yin Huai] Doc update.
c9ccfeb [Yin Huai] Revert a unnecessary change.
84aecc4 [Yin Huai] Address comments.
1951fe1 [Yin Huai] Merge remote-tracking branch 'upstream/master'
c18da34 [Yin Huai] Invalidate cache after insert.
727f21a [Yin Huai] Preinsert casting and renaming.
2015-02-06 12:38:07 -08:00
OopsOutOfMemory 0b7eb3f3b7 [SPARK-5324][SQL] Results of describe can't be queried
Make below code works.
```
sql("DESCRIBE test").registerTempTable("describeTest")
sql("SELECT * FROM describeTest").collect()
```

Author: OopsOutOfMemory <victorshengli@126.com>
Author: Sheng, Li <OopsOutOfMemory@users.noreply.github.com>

Closes #4249 from OopsOutOfMemory/desc_query and squashes the following commits:

6fee13d [OopsOutOfMemory] up-to-date
e71430a [Sheng, Li] Update HiveOperatorQueryableSuite.scala
3ba1058 [OopsOutOfMemory] change to default argument
aac7226 [OopsOutOfMemory] Merge branch 'master' into desc_query
68eb6dd [OopsOutOfMemory] Merge branch 'desc_query' of github.com:OopsOutOfMemory/spark into desc_query
354ad71 [OopsOutOfMemory] query describe command
d541a35 [OopsOutOfMemory] refine test suite
e1da481 [OopsOutOfMemory] refine test suite
a780539 [OopsOutOfMemory] Merge branch 'desc_query' of github.com:OopsOutOfMemory/spark into desc_query
0015f82 [OopsOutOfMemory] code style
dd0aaef [OopsOutOfMemory] code style
c7d606d [OopsOutOfMemory] rename test suite
75f2342 [OopsOutOfMemory] refine code and test suite
f942c9b [OopsOutOfMemory] initial
11559ae [OopsOutOfMemory] code style
c5fdecf [OopsOutOfMemory] code style
aeaea5f [OopsOutOfMemory] rename test suite
ac2c3bb [OopsOutOfMemory] refine code and test suite
544573e [OopsOutOfMemory] initial
2015-02-06 12:33:20 -08:00
Reynold Xin 7dc4965f34 [SPARK-5639][SQL] Support DataFrame.renameColumn.
Author: Reynold Xin <rxin@databricks.com>

Closes #4410 from rxin/df-renameCol and squashes the following commits:

a6a796e [Reynold Xin] [SPARK-5639][SQL] Support DataFrame.renameColumn.
2015-02-05 23:02:40 -08:00
Reynold Xin e8a5d50a96 [SPARK-5638][SQL] Add a config flag to disable eager analysis of DataFrames
Author: Reynold Xin <rxin@databricks.com>

Closes #4408 from rxin/df-config-eager and squashes the following commits:

c0204cf [Reynold Xin] [SPARK-5638][SQL] Add a config flag to disable eager analysis of DataFrames.
2015-02-05 18:07:10 -08:00
Cheng Lian a9ed51178c [SPARK-5182] [SPARK-5528] [SPARK-5509] [SPARK-3575] [SQL] Parquet data source improvements
This PR adds three major improvements to Parquet data source:

1.  Partition discovery

    While reading Parquet files resides in Hive style partition directories, `ParquetRelation2` automatically discovers partitioning information and infers partition column types.

    This is also a partial work for [SPARK-5182] [1], which aims to provide first class partitioning support for the data source API.  Related code in this PR can be easily extracted to the data source API level in future versions.

1.  Schema merging

    When enabled, Parquet data source collects schema information from all Parquet part-files and tries to merge them.  Exceptions are thrown when incompatible schemas are detected.  This feature is controlled by data source option `parquet.mergeSchema`, and is enabled by default.

1.  Metastore Parquet table conversion moved to analysis phase

    This greatly simplifies the conversion logic.  `ParquetConversion` strategy can be removed once the old Parquet implementation is removed in the future.

This version of Parquet data source aims to entirely replace the old Parquet implementation.  However, the old version hasn't been removed yet.  Users can fall back to the old version by turning off SQL configuration `spark.sql.parquet.useDataSourceApi`.

Other JIRA tickets fixed as side effects in this PR:

- [SPARK-5509] [3]: `EqualTo` now uses a proper `Ordering` to compare binary types.

- [SPARK-3575] [4]: Metastore schema is now preserved and passed to `ParquetRelation2` via data source option `parquet.metastoreSchema`.

TODO:

- [ ] More test cases for partition discovery
- [x] Fix write path after data source write support (#4294) is merged

      It turned out to be non-trivial to fall back to old Parquet implementation on the write path when Parquet data source is enabled.  Since we're planning to include data source write support in 1.3.0, I simply ignored two test cases involving Parquet insertion for now.

- [ ] Fix outdated comments and documentations

PS: This PR looks big, but more than a half of the changed lines in this PR are trivial changes to test cases. To test Parquet with and without the new data source, almost all Parquet test cases are moved into wrapper driver functions. This introduces hundreds of lines of changes.

[1]: https://issues.apache.org/jira/browse/SPARK-5182
[2]: https://issues.apache.org/jira/browse/SPARK-5528
[3]: https://issues.apache.org/jira/browse/SPARK-5509
[4]: https://issues.apache.org/jira/browse/SPARK-3575

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4308)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #4308 from liancheng/parquet-partition-discovery and squashes the following commits:

b6946e6 [Cheng Lian] Fixes MiMA issues, addresses comments
8232e17 [Cheng Lian] Write support for Parquet data source
a49bd28 [Cheng Lian] Fixes spelling typo in trait name "CreateableRelationProvider"
808380f [Cheng Lian] Fixes issues introduced while rebasing
50dd8d1 [Cheng Lian] Addresses @rxin's comment, fixes UDT schema merging
adf2aae [Cheng Lian] Fixes compilation error introduced while rebasing
4e0175f [Cheng Lian] Fixes Python Parquet API, we need Py4J array to call varargs method
0d8ec1d [Cheng Lian] Adds more test cases
b35c8c6 [Cheng Lian] Fixes some typos and outdated comments
dd704fd [Cheng Lian] Fixes Python Parquet API
596c312 [Cheng Lian] Uses switch to control whether use Parquet data source or not
7d0f7a2 [Cheng Lian] Fixes Metastore Parquet table conversion
a1896c7 [Cheng Lian] Fixes all existing Parquet test suites except for ParquetMetastoreSuite
5654c9d [Cheng Lian] Draft version of Parquet partition discovery and schema merging
2015-02-05 15:29:56 -08:00
OopsOutOfMemory 4d8d070c4f [SPARK-5135][SQL] Add support for describe table to DDL in SQLContext
Hi, rxin marmbrus
I considered your suggestion (in #4127) and now re-write it. This is now up-to-date.
Could u please review it ?

Author: OopsOutOfMemory <victorshengli@126.com>

Closes #4227 from OopsOutOfMemory/describe and squashes the following commits:

053826f [OopsOutOfMemory] describe
2015-02-05 13:07:48 -08:00
wangfei a83936e109 [SPARK-5617][SQL] fix test failure of SQLQuerySuite
SQLQuerySuite test failure:
[info] - simple select (22 milliseconds)
[info] - sorting (722 milliseconds)
[info] - external sorting (728 milliseconds)
[info] - limit (95 milliseconds)
[info] - date row *** FAILED *** (35 milliseconds)
[info]   Results do not match for query:
[info]   'Limit 1
[info]    'Project [CAST(2015-01-28, DateType) AS c0#3630]
[info]     'UnresolvedRelation [testData], None
[info]
[info]   == Analyzed Plan ==
[info]   Limit 1
[info]    Project [CAST(2015-01-28, DateType) AS c0#3630]
[info]     LogicalRDD [key#0,value#1], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:35
[info]
[info]   == Physical Plan ==
[info]   Limit 1
[info]    Project [16463 AS c0#3630]
[info]     PhysicalRDD [key#0,value#1], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:35
[info]
[info]   == Results ==
[info]   !== Correct Answer - 1 ==   == Spark Answer - 1 ==
[info]   ![2015-01-28]               [2015-01-27] (QueryTest.scala:77)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
[info]   at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
[info]   at org.scalatest.Assertions$class.fail(Assertions.scala:1328)
[info]   at org.scalatest.FunSuite.fail(FunSuite.scala:1555)
[info]   at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:77)
[info]   at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:95)
[info]   at org.apache.spark.sql.SQLQuerySuite$$anonfun$23.apply$mcV$sp(SQLQuerySuite.scala:300)
[info]   at org.apache.spark.sql.SQLQuerySuite$$anonfun$23.apply(SQLQuerySuite.scala:300)
[info]   at org.apache.spark.sql.SQLQuerySuite$$anonfun$23.apply(SQLQuerySuite.scala:300)
[info]   at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
[info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
[info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
[info]   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
[info]   at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
[info]   at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
[info]   at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
[info]   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
[info]   at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
[info]   at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
[info]   at org.scalatest.SuperEngine$$anonfun$traverseSubNode

Author: wangfei <wangfei1@huawei.com>

Closes #4395 from scwf/SQLQuerySuite and squashes the following commits:

1431a2d [wangfei] fix conflicts
c35fe5e [wangfei] minor fix
01dab3a [wangfei] fix test failure of SQLQuerySuite
2015-02-05 12:44:12 -08:00
Reynold Xin 7d789e117d [SPARK-5612][SQL] Move DataFrame implicit functions into SQLContext.implicits.
Author: Reynold Xin <rxin@databricks.com>

Closes #4386 from rxin/df-implicits and squashes the following commits:

9d96606 [Reynold Xin] style fix
edd296b [Reynold Xin] ReplSuite
1c946ab [Reynold Xin] [SPARK-5612][SQL] Move DataFrame implicit functions into SQLContext.implicits.
2015-02-04 23:44:34 -08:00
Reynold Xin 84acd08e08 [SPARK-5602][SQL] Better support for creating DataFrame from local data collection
1. Added methods to create DataFrames from Seq[Product]
2. Added executeTake to avoid running a Spark job on LocalRelations.

Author: Reynold Xin <rxin@databricks.com>

Closes #4372 from rxin/localDataFrame and squashes the following commits:

f696858 [Reynold Xin] style checker.
839ef7f [Reynold Xin] [SPARK-5602][SQL] Better support for creating DataFrame from local data collection.
2015-02-04 19:53:57 -08:00
Reynold Xin 206f9bc362 [SPARK-5538][SQL] Fix flaky CachedTableSuite
Author: Reynold Xin <rxin@databricks.com>

Closes #4379 from rxin/CachedTableSuite and squashes the following commits:

f2b44ce [Reynold Xin] [SQL] Fix flaky CachedTableSuite.
2015-02-04 19:52:41 -08:00
Reynold Xin 6b4c7f0806 [SQL][DataFrame] Minor cleanup.
1. Removed LocalHiveContext in Python.
2. Reduced DSL UDF support from 22 arguments to 10 arguments so JavaDoc/ScalaDoc look nicer.

Author: Reynold Xin <rxin@databricks.com>

Closes #4374 from rxin/df-style and squashes the following commits:

e493342 [Reynold Xin] [SQL][DataFrame] Minor cleanup.
2015-02-04 19:51:48 -08:00
Sadhan Sood dba98bf698 [SPARK-4520] [SQL] This pr fixes the ArrayIndexOutOfBoundsException as r...
...aised in SPARK-4520.

The exception is thrown only for a thrift generated parquet file. The array element schema name is assumed as "array" as per ParquetAvro but for thrift generated parquet files, it is array_name + "_tuple". This leads to missing child of array group type and hence when the parquet rows are being materialized leads to the exception.

Author: Sadhan Sood <sadhan@tellapart.com>

Closes #4148 from sadhan/SPARK-4520 and squashes the following commits:

c5ccde8 [Sadhan Sood] [SPARK-4520] [SQL] This pr fixes the ArrayIndexOutOfBoundsException as raised in SPARK-4520.
2015-02-04 19:18:06 -08:00
Reynold Xin 1fbd124b1b [SPARK-5605][SQL][DF] Allow using String to specify colum name in DSL aggregate functions
Author: Reynold Xin <rxin@databricks.com>

Closes #4376 from rxin/SPARK-5605 and squashes the following commits:

c55f5fa [Reynold Xin] Added a Python test.
f4b8dbb [Reynold Xin] [SPARK-5605][SQL][DF] Allow using String to specify colum name in DSL aggregate functions.
2015-02-04 18:35:51 -08:00
Davies Liu dc101b0e4e [SPARK-5577] Python udf for DataFrame
Author: Davies Liu <davies@databricks.com>

Closes #4351 from davies/python_udf and squashes the following commits:

d250692 [Davies Liu] fix conflict
34234d4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python_udf
440f769 [Davies Liu] address comments
f0a3121 [Davies Liu] track life cycle of broadcast
f99b2e1 [Davies Liu] address comments
462b334 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python_udf
7bccc3b [Davies Liu] python udf
58dee20 [Davies Liu] clean up
2015-02-04 15:55:09 -08:00
kul 424cb699ee [SPARK-5426][SQL] Add SparkSQL Java API helper methods.
Right now the PR adds few helper methods for java apis. But the issue was opened mainly to get rid of transformations in java api like `.rdd` and `.toJavaRDD` while working with `SQLContext` or `HiveContext`.

Author: kul <kuldeep.bora@gmail.com>

Closes #4243 from kul/master and squashes the following commits:

2390fba [kul] [SPARK-5426][SQL] Add SparkSQL Java API helper methods.
2015-02-04 15:08:37 -08:00
Davies Liu ac0b2b788f [SPARK-5588] [SQL] support select/filter by SQL expression
```
df.selectExpr('a + 1', 'abs(age)')
df.filter('age > 3')
df[ df.age > 3 ]
df[ ['age', 'name'] ]
```

Author: Davies Liu <davies@databricks.com>

Closes #4359 from davies/select_expr and squashes the following commits:

d99856b [Davies Liu] support select/filter by SQL expression
2015-02-04 11:34:46 -08:00
Reynold Xin 40c4cb2fe7 [SPARK-5579][SQL][DataFrame] Support for project/filter using SQL expressions
```scala
df.selectExpr("abs(colA)", "colB")
df.filter("age > 21")
```

Author: Reynold Xin <rxin@databricks.com>

Closes #4348 from rxin/SPARK-5579 and squashes the following commits:

2baeef2 [Reynold Xin] Fix Python.
b416372 [Reynold Xin] [SPARK-5579][SQL][DataFrame] Support for project/filter using SQL expressions.
2015-02-03 22:15:35 -08:00
Reynold Xin 1077f2e1de [SPARK-5578][SQL][DataFrame] Provide a convenient way for Scala users to use UDFs
A more convenient way to define user-defined functions.

Author: Reynold Xin <rxin@databricks.com>

Closes #4345 from rxin/defineUDF and squashes the following commits:

639c0f8 [Reynold Xin] udf tests.
0a0b339 [Reynold Xin] defineUDF -> udf.
b452b8d [Reynold Xin] Fix UDF registration.
d2e42c3 [Reynold Xin] SQLContext.udf.register() returns a UserDefinedFunction also.
4333605 [Reynold Xin] [SQL][DataFrame] defineUDF.
2015-02-03 20:07:46 -08:00
Davies Liu 068c0e2ee0 [SPARK-5554] [SQL] [PySpark] add more tests for DataFrame Python API
Add more tests and docs for DataFrame Python API, improve test coverage, fix bugs.

Author: Davies Liu <davies@databricks.com>

Closes #4331 from davies/fix_df and squashes the following commits:

dd9919f [Davies Liu] fix tests
467332c [Davies Liu] support string in cast()
83c92fe [Davies Liu] address comments
c052f6f [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_df
8dd19a9 [Davies Liu] fix tests in python 2.6
35ccb9f [Davies Liu] fix build
78ebcfa [Davies Liu] add sql_test.py in run_tests
9ab78b4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_df
6040ba7 [Davies Liu] fix docs
3ab2661 [Davies Liu] add more tests for DataFrame
2015-02-03 16:01:56 -08:00
Daoyuan Wang db821ed2ed [SPARK-4508] [SQL] build native date type to conform behavior to Hive
The previous #3732 is reverted due to some test failure.
Have fixed that.

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #4325 from adrian-wang/datenative and squashes the following commits:

096e20d [Daoyuan Wang] fix for mixed timezone
0ed0fdc [Daoyuan Wang] fix test data
a2fdd4e [Daoyuan Wang] getDate
c37832b [Daoyuan Wang] row to catalyst
f0005b1 [Daoyuan Wang] add date in sql parser and java type conversion
024c9a6 [Daoyuan Wang] clean some import order
d6715fc [Daoyuan Wang] refactoring Date as Primitive Int internally
374abd5 [Daoyuan Wang] spark native date type support
2015-02-03 12:21:45 -08:00
Cheng Hao ca7a6cdff0 [SPARK-5550] [SQL] Support the case insensitive for UDF
SQL in HiveContext, should be case insensitive, however, the following query will fail.

```scala
udf.register("random0", ()  => { Math.random()})
assert(sql("SELECT RANDOM0() FROM src LIMIT 1").head().getDouble(0) >= 0.0)
```

Author: Cheng Hao <hao.cheng@intel.com>

Closes #4326 from chenghao-intel/udf_case_sensitive and squashes the following commits:

485cf66 [Cheng Hao] Support the case insensitive for UDF
2015-02-03 12:12:26 -08:00
Daoyuan Wang 0c20ce69fb [SPARK-4987] [SQL] parquet timestamp type support
Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #3820 from adrian-wang/parquettimestamp and squashes the following commits:

b1e2a0d [Daoyuan Wang] fix for nanos
4dadef1 [Daoyuan Wang] fix wrong read
93f438d [Daoyuan Wang] parquet timestamp support
2015-02-03 12:06:06 -08:00
Reynold Xin 4204a1271d [SQL] DataFrame API update
1. Added Java-friendly version of the expression operators (i.e. gt, geq)
2. Added JavaDoc for most operators
3. Simplified expression operators by having only one version of the function (that accepts Any). Previously we had two methods for each expression operator, one accepting Any and another accepting Column.
4. agg function now accepts varargs of (String, String).

Author: Reynold Xin <rxin@databricks.com>

Closes #4332 from rxin/df-update and squashes the following commits:

ab0aa69 [Reynold Xin] Added Java friendly expression methods. Added JavaDoc. For each expression operator, have only one version of the function (that accepts Any). Previously we had two methods for each expression operator, one accepting Any and another accepting Column.
576d07a [Reynold Xin] random commit.
2015-02-03 10:34:56 -08:00
Reynold Xin 523a93523d [SPARK-5551][SQL] Create type alias for SchemaRDD for source backward compatibility
Author: Reynold Xin <rxin@databricks.com>

Closes #4327 from rxin/schemarddTypeAlias and squashes the following commits:

e5a8ff3 [Reynold Xin] [SPARK-5551][SQL] Create type alias for SchemaRDD for source backward compatibility
2015-02-03 00:29:23 -08:00
Reynold Xin 37df330135 [SQL][DataFrame] Remove DataFrameApi, ExpressionApi, and GroupedDataFrameApi
They were there mostly for code review and easier check of the API. I don't think they need to be there anymore.

Author: Reynold Xin <rxin@databricks.com>

Closes #4328 from rxin/remove-df-api and squashes the following commits:

723d600 [Reynold Xin] [SQL][DataFrame] Remove DataFrameApi and ColumnApi.
2015-02-03 00:29:04 -08:00
Yin Huai 13531dd97c [SPARK-5501][SPARK-5420][SQL] Write support for the data source API
This PR aims to support `INSERT INTO/OVERWRITE TABLE tableName` and `CREATE TABLE tableName AS SELECT` for the data source API (partitioned tables are not supported).

In this PR, I am also adding the support of `IF NOT EXISTS` for our ddl parser. The current semantic of `IF NOT EXISTS` is explained as follows.
* For a `CREATE TEMPORARY TABLE` statement, it does not `IF NOT EXISTS` for now.
* For a `CREATE TABLE` statement (we are creating a metastore table), if there is an existing table having the same name ...
  * when `IF NOT EXISTS` clause is used, we will do nothing.
  * when `IF NOT EXISTS` clause is not used, the user will see an exception saying the table already exists.

TODOs:
- [x] CTAS support
- [x] Programmatic APIs
- [ ] Python API (another PR)
- [x] More unit tests
- [ ] Documents (another PR)

marmbrus liancheng rxin

Author: Yin Huai <yhuai@databricks.com>

Closes #4294 from yhuai/writeSupport and squashes the following commits:

3db1539 [Yin Huai] save does not take overwrite.
1c98881 [Yin Huai] Fix test.
142372a [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupport
34e1bfb [Yin Huai] Address comments.
1682ca6 [Yin Huai] Better support for CTAS statements.
e789d64 [Yin Huai] For the Scala API, let users to use tuples to provide options.
0128065 [Yin Huai] Short hand versions of save and load.
66ebd74 [Yin Huai] Formatting.
9203ec2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupport
e5d29f2 [Yin Huai] Programmatic APIs.
1a719a5 [Yin Huai] CREATE TEMPORARY TABLE with IF NOT EXISTS is not allowed for now.
909924f [Yin Huai] Add saveAsTable for the data source API to DataFrame.
95a7c71 [Yin Huai] Fix bug when handling IF NOT EXISTS clause in a CREATE TEMPORARY TABLE statement.
d37b19c [Yin Huai] Cheng's comments.
fd6758c [Yin Huai] Use BeforeAndAfterAll.
7880891 [Yin Huai] Support CREATE TABLE AS SELECT STATEMENT and the IF NOT EXISTS clause.
cb85b05 [Yin Huai] Initial write support.
2f91354 [Yin Huai] Make INSERT OVERWRITE/INTO statements consistent between HiveQL and SqlParser.
2015-02-02 23:30:44 -08:00
Tor Myklebust 8f471a66db [SPARK-5472][SQL] A JDBC data source for Spark SQL.
This pull request contains a Spark SQL data source that can pull data from, and can put data into, a JDBC database.

I have tested both read and write support with H2, MySQL, and Postgres.  It would surprise me if both read and write support worked flawlessly out-of-the-box for any other database; different databases have different names for different JDBC data types and different meanings for SQL types with the same name.  However, this code is designed (see `DriverQuirks.scala`) to make it *relatively* painless to add support for another database by augmenting the type mapping contained in this PR.

Author: Tor Myklebust <tmyklebu@gmail.com>

Closes #4261 from tmyklebu/master and squashes the following commits:

cf167ce [Tor Myklebust] Work around other Java tests ruining TestSQLContext.
67893bf [Tor Myklebust] Move the jdbcRDD methods into SQLContext itself.
585f95b [Tor Myklebust] Dependencies go into the project's pom.xml.
829d5ba [Tor Myklebust] Merge branch 'master' of https://github.com/apache/spark
41647ef [Tor Myklebust] Hide a couple things that don't need to be public.
7318aea [Tor Myklebust] Fix scalastyle warnings.
a09eeac [Tor Myklebust] JDBC data source for Spark SQL.
176bb98 [Tor Myklebust] Add test deps for JDBC support.
2015-02-02 19:50:14 -08:00
Reynold Xin 554403fd91 [SQL] Improve DataFrame API error reporting
1. Throw UnsupportedOperationException if a Column is not computable.
2. Perform eager analysis on DataFrame so we can catch errors when they happen (not when an action is run).

Author: Reynold Xin <rxin@databricks.com>
Author: Davies Liu <davies@databricks.com>

Closes #4296 from rxin/col-computability and squashes the following commits:

6527b86 [Reynold Xin] Merge pull request #8 from davies/col-computability
fd92bc7 [Reynold Xin] Merge branch 'master' into col-computability
f79034c [Davies Liu] fix python tests
5afe1ff [Reynold Xin] Fix scala test.
17f6bae [Reynold Xin] Various fixes.
b932e86 [Reynold Xin] Added eager analysis for error reporting.
e6f00b8 [Reynold Xin] [SQL][API] ComputableColumn vs IncomputableColumn
2015-02-02 19:01:47 -08:00
Patrick Wendell eccb9fbb2d Revert "[SPARK-4508] [SQL] build native date type to conform behavior to Hive"
This reverts commit 1646f89d96.
2015-02-02 17:52:17 -08:00
Reynold Xin 8aa3cfff66 [SPARK-5514] DataFrame.collect should call executeCollect
Author: Reynold Xin <rxin@databricks.com>

Closes #4313 from rxin/SPARK-5514 and squashes the following commits:

e34e91b [Reynold Xin] [SPARK-5514] DataFrame.collect should call executeCollect
2015-02-02 16:55:36 -08:00
Daoyuan Wang 1646f89d96 [SPARK-4508] [SQL] build native date type to conform behavior to Hive
Store daysSinceEpoch as an Int value(4 bytes) to represent DateType, instead of using java.sql.Date(8 bytes as Long) in catalyst row. This ensures the same comparison behavior of Hive and Catalyst.
Subsumes #3381
I thinks there are already some tests in JavaSQLSuite, and for python it will not affect python's datetime class.

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #3732 from adrian-wang/datenative and squashes the following commits:

0ed0fdc [Daoyuan Wang] fix test data
a2fdd4e [Daoyuan Wang] getDate
c37832b [Daoyuan Wang] row to catalyst
f0005b1 [Daoyuan Wang] add date in sql parser and java type conversion
024c9a6 [Daoyuan Wang] clean some import order
d6715fc [Daoyuan Wang] refactoring Date as Primitive Int internally
374abd5 [Daoyuan Wang] spark native date type support
2015-02-02 15:49:22 -08:00
Cheng Lian ec1003219b [SPARK-5465] [SQL] Fixes filter push-down for Parquet data source
Not all Catalyst filter expressions can be converted to Parquet filter predicates. We should try to convert each individual predicate and then collect those convertible ones.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4255)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #4255 from liancheng/spark-5465 and squashes the following commits:

14ccd37 [Cheng Lian] Fixes filter push-down for Parquet data source
2015-02-01 18:52:39 -08:00
Daoyuan Wang 8cf4a1f02e [SPARK-5262] [SPARK-5244] [SQL] add coalesce in SQLParser and widen types for parameters of coalesce
I'll add test case in #4040

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #4057 from adrian-wang/coal and squashes the following commits:

4d0111a [Daoyuan Wang] address Yin's comments
c393e18 [Daoyuan Wang] fix rebase conflicts
e47c03a [Daoyuan Wang] add coalesce in parser
c74828d [Daoyuan Wang] cast types for coalesce
2015-02-01 18:51:38 -08:00
OopsOutOfMemory 1b56f1d6bb [SPARK-5196][SQL] Support comment in Create Table Field DDL
Support `comment` in create a table field.
__CREATE TEMPORARY TABLE people(name string `comment` "the name of a person")__

Author: OopsOutOfMemory <victorshengli@126.com>

Closes #3999 from OopsOutOfMemory/meta_comment and squashes the following commits:

39150d4 [OopsOutOfMemory] add comment and refine test suite
2015-02-01 18:41:58 -08:00
Liang-Chi Hsieh ef89b82d83 [Minor][SQL] Little refactor DataFrame related codes
Simplify some codes related to DataFrame.

*  Calling `toAttributes` instead of a `map`.
*  Original `createDataFrame` creates the `StructType` and its attributes in a redundant way. Refactored it to create `StructType` and call `toAttributes` on it directly.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #4298 from viirya/refactor_df and squashes the following commits:

1d61c64 [Liang-Chi Hsieh] Revert it.
f36efb5 [Liang-Chi Hsieh] Relax the constraint of toDataFrame.
2c9f370 [Liang-Chi Hsieh] Just refactor DataFrame codes.
2015-02-01 17:52:18 -08:00
kai f54c9f607b [SQL] remove redundant field "childOutput" from execution.Aggregate, use child.output instead
Author: kai <kaizeng@eecs.berkeley.edu>

Closes #4291 from kai-zeng/aggregate-fix and squashes the following commits:

78658ef [kai] remove redundant field "childOutput"
2015-01-30 23:19:10 -08:00
Takuya UESHIN 6f21dce5f4 [SPARK-5457][SQL] Add missing DSL for ApproxCountDistinct.
Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #4250 from ueshin/issues/SPARK-5457 and squashes the following commits:

3c05e59 [Takuya UESHIN] Remove parameter to use default value of ApproxCountDistinct.
faea19d [Takuya UESHIN] Use overload instead of default value for Java support.
d1cca38 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-5457
663d43d [Takuya UESHIN] Add missing DSL for ApproxCountDistinct.
2015-01-30 01:21:35 -08:00
Reynold Xin 80def9deb3 [SQL] Support df("*") to select all columns in a data frame.
This PR makes Star a trait, and provides two implementations: UnresolvedStar (used for *, tblName.*) and ResolvedStar (used for df("*")).

Author: Reynold Xin <rxin@databricks.com>

Closes #4283 from rxin/df-star and squashes the following commits:

c9cba3e [Reynold Xin] Removed mapFunction in UnresolvedStar.
1a3a1d7 [Reynold Xin] [SQL] Support df("*") to select all columns in a data frame.
2015-01-29 19:09:08 -08:00
Josh Rosen 22271f9693 [SPARK-5462] [SQL] Use analyzed query plan in DataFrame.apply()
This patch changes DataFrame's `apply()` method to use an analyzed query plan when resolving column names.  This fixes a bug where `apply` would throw "invalid call to qualifiers on unresolved object" errors when called on DataFrames constructed via `SQLContext.sql()`.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #4282 from JoshRosen/SPARK-5462 and squashes the following commits:

b9e6da2 [Josh Rosen] [SPARK-5462] Use analyzed query plan in DataFrame.apply().
2015-01-29 18:23:05 -08:00
Reynold Xin ce9c43ba8c [SQL] DataFrame API improvements
1. Added Dsl.column in case Dsl.col is shadowed.
2. Allow using String to specify the target data type in cast.
3. Support sorting on multiple columns using column names.
4. Added Java API test file.

Author: Reynold Xin <rxin@databricks.com>

Closes #4280 from rxin/dsl1 and squashes the following commits:

33ecb7a [Reynold Xin] Add the Java test.
d06540a [Reynold Xin] [SQL] DataFrame API improvements.
2015-01-29 17:24:00 -08:00
wangfei c1b3eebf97 [SPARK-5373][SQL] Literal in agg grouping expressions leads to incorrect result
`select key, count( * ) from src group by key, 1`  will get the wrong answer.

e.g. for this table
```
  val testData2 =
    TestSQLContext.sparkContext.parallelize(
      TestData2(1, 1) ::
      TestData2(1, 2) ::
      TestData2(2, 1) ::
      TestData2(2, 2) ::
      TestData2(3, 1) ::
      TestData2(3, 2) :: Nil, 2).toSchemaRDD
  testData2.registerTempTable("testData2")
```
result of `SELECT a, count(1) FROM testData2 GROUP BY a, 1`  is

```
                     [1,1]
                     [2,2]
                     [3,1]
```

Author: wangfei <wangfei1@huawei.com>

Closes #4169 from scwf/agg-bug and squashes the following commits:

05751db [wangfei] fix bugs when literal in agg grouping expressioons
2015-01-29 15:47:18 -08:00
Yash Datta de221ea032 [SPARK-4786][SQL]: Parquet filter pushdown for castable types
Enable parquet filter pushdown of castable types like short, byte that can be cast to integer

Author: Yash Datta <Yash.Datta@guavus.com>

Closes #4156 from saucam/filter_short and squashes the following commits:

a403979 [Yash Datta] SPARK-4786: Fix styling issues
d029866 [Yash Datta] SPARK-4786: Add test case
cb2e0d9 [Yash Datta] SPARK-4786: Parquet filter pushdown for castable types
2015-01-29 15:42:23 -08:00
Michael Davies 940f375611 [SPARK-5309][SQL] Add support for dictionaries in PrimitiveConverter for Strin...
...gs.

Parquet Converters allow developers to take advantage of dictionary encoding of column data to reduce Column Binary decoding.

The Spark PrimitiveConverter was not using that API and consequently for String columns that used dictionary compression repeated Binary to String conversions for the same String.

In measurements this could account for over 25% of entire query time.
For example a 500M row table split across 16 blocks was aggregated and summed in a litte under 30s before this change and a little under 20s after the change.

Author: Michael Davies <Michael.BellDavies@gmail.com>

Closes #4187 from MickDavies/SPARK-5309-2 and squashes the following commits:

327287e [Michael Davies] SPARK-5309: Add support for dictionaries in PrimitiveConverter for Strings.
33c002c [Michael Davies] SPARK-5309: Add support for dictionaries in PrimitiveConverter for Strings.
2015-01-29 15:40:59 -08:00
Reynold Xin 715632232d [SPARK-5445][SQL] Consolidate Java and Scala DSL static methods.
Turns out Scala does generate static methods for ones defined in a companion object. Finally no need to separate api.java.dsl and api.scala.dsl.

Author: Reynold Xin <rxin@databricks.com>

Closes #4276 from rxin/dsl and squashes the following commits:

30aa611 [Reynold Xin] Add all files.
1a9d215 [Reynold Xin] [SPARK-5445][SQL] Consolidate Java and Scala DSL static methods.
2015-01-29 15:13:09 -08:00
Reynold Xin 5ad78f6205 [SQL] Various DataFrame DSL update.
1. Added foreach, foreachPartition, flatMap to DataFrame.
2. Added col() in dsl.
3. Support renaming columns in toDataFrame.
4. Support type inference on arrays (in addition to Seq).
5. Updated mllib to use the new DSL.

Author: Reynold Xin <rxin@databricks.com>

Closes #4260 from rxin/sql-dsl-update and squashes the following commits:

73466c1 [Reynold Xin] Fixed LogisticRegression. Also added better error message for resolve.
fab3ccc [Reynold Xin] Bug fix.
d31fcd2 [Reynold Xin] Style fix.
62608c4 [Reynold Xin] [SQL] Various DataFrame DSL update.
2015-01-29 00:01:10 -08:00
Reynold Xin 5b9760de8d [SPARK-5445][SQL] Made DataFrame dsl usable in Java
Also removed the literal implicit transformation since it is pretty scary for API design. Instead, created a new lit method for creating literals. This doesn't break anything from a compatibility perspective because Literal was added two days ago.

Author: Reynold Xin <rxin@databricks.com>

Closes #4241 from rxin/df-docupdate and squashes the following commits:

c0f4810 [Reynold Xin] Fix Python merge conflict.
094c7d7 [Reynold Xin] Minor style fix. Reset Python tests.
3c89f4a [Reynold Xin] Package.
dfe6962 [Reynold Xin] Updated Python aggregate.
5dd4265 [Reynold Xin] Made dsl Java callable.
14b3c27 [Reynold Xin] Fix literal expression for symbols.
68b31cb [Reynold Xin] Literal.
4cfeb78 [Reynold Xin] [SPARK-5097][SQL] Address DataFrame code review feedback.
2015-01-28 19:10:32 -08:00
Reynold Xin c8e934ef3c [SPARK-5447][SQL] Replaced reference to SchemaRDD with DataFrame.
and

[SPARK-5448][SQL] Make CacheManager a concrete class and field in SQLContext

Author: Reynold Xin <rxin@databricks.com>

Closes #4242 from rxin/sqlCleanup and squashes the following commits:

e351cb2 [Reynold Xin] Fixed toDataFrame.
6545c42 [Reynold Xin] More changes.
728c017 [Reynold Xin] [SPARK-5447][SQL] Replaced reference to SchemaRDD with DataFrame.
2015-01-28 12:10:01 -08:00
Reynold Xin d74373225e [SPARK-5097][SQL] Test cases for DataFrame expressions.
Author: Reynold Xin <rxin@databricks.com>

Closes #4235 from rxin/df-tests1 and squashes the following commits:

f341db6 [Reynold Xin] [SPARK-5097][SQL] Test cases for DataFrame expressions.
2015-01-27 18:10:49 -08:00
Reynold Xin 119f45d61d [SPARK-5097][SQL] DataFrame
This pull request redesigns the existing Spark SQL dsl, which already provides data frame like functionalities.

TODOs:
With the exception of Python support, other tasks can be done in separate, follow-up PRs.
- [ ] Audit of the API
- [ ] Documentation
- [ ] More test cases to cover the new API
- [x] Python support
- [ ] Type alias SchemaRDD

Author: Reynold Xin <rxin@databricks.com>
Author: Davies Liu <davies@databricks.com>

Closes #4173 from rxin/df1 and squashes the following commits:

0a1a73b [Reynold Xin] Merge branch 'df1' of github.com:rxin/spark into df1
23b4427 [Reynold Xin] Mima.
828f70d [Reynold Xin] Merge pull request #7 from davies/df
257b9e6 [Davies Liu] add repartition
6bf2b73 [Davies Liu] fix collect with UDT and tests
e971078 [Reynold Xin] Missing quotes.
b9306b4 [Reynold Xin] Remove removeColumn/updateColumn for now.
a728bf2 [Reynold Xin] Example rename.
e8aa3d3 [Reynold Xin] groupby -> groupBy.
9662c9e [Davies Liu] improve DataFrame Python API
4ae51ea [Davies Liu] python API for dataframe
1e5e454 [Reynold Xin] Fixed a bug with symbol conversion.
2ca74db [Reynold Xin] Couple minor fixes.
ea98ea1 [Reynold Xin] Documentation & literal expressions.
2b22684 [Reynold Xin] Got rid of IntelliJ problems.
02bbfbc [Reynold Xin] Tightening imports.
ffbce66 [Reynold Xin] Fixed compilation error.
59b6d8b [Reynold Xin] Style violation.
b85edfb [Reynold Xin] ALS.
8c37f0a [Reynold Xin] Made MLlib and examples compile
6d53134 [Reynold Xin] Hive module.
d35efd5 [Reynold Xin] Fixed compilation error.
ce4a5d2 [Reynold Xin] Fixed test cases in SQL except ParquetIOSuite.
66d5ef1 [Reynold Xin] SQLContext minor patch.
c9bcdc0 [Reynold Xin] Checkpoint: SQL module compiles!
2015-01-27 16:08:24 -08:00
Cheng Lian ba19689fe7 [SQL] [Minor] Remove deprecated parquet tests
This PR removes the deprecated `ParquetQuerySuite`, renamed `ParquetQuerySuite2` to `ParquetQuerySuite`, and refactored changes introduced in #4115 to `ParquetFilterSuite` . It is a follow-up of #3644.

Notice that test cases in the old `ParquetQuerySuite` have already been well covered by other test suites introduced in #3644.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4116)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #4116 from liancheng/remove-deprecated-parquet-tests and squashes the following commits:

f73b8f9 [Cheng Lian] Removes deprecated Parquet test suite
2015-01-21 14:38:10 -08:00
Josh Rosen b328ac6c8c Revert "[SPARK-5244] [SQL] add coalesce() in sql parser"
This reverts commit 812d3679f5.
2015-01-21 14:27:43 -08:00
Cheng Hao 8361078efa [SPARK-5009] [SQL] Long keyword support in SQL Parsers
* The `SqlLexical.allCaseVersions` will cause `StackOverflowException` if the key word is too long, the patch will fix that by normalizing all of the keywords in `SqlLexical`.
* And make a unified SparkSQLParser for sharing the common code.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #3926 from chenghao-intel/long_keyword and squashes the following commits:

686660f [Cheng Hao] Support Long Keyword and Refactor the SQLParsers
2015-01-21 13:05:56 -08:00
Daoyuan Wang 812d3679f5 [SPARK-5244] [SQL] add coalesce() in sql parser
Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #4040 from adrian-wang/coalesce and squashes the following commits:

0ac8e8f [Daoyuan Wang] add coalesce() in sql parser
2015-01-21 12:59:41 -08:00
Reynold Xin d181c2a1fc [SPARK-5323][SQL] Remove Row's Seq inheritance.
Author: Reynold Xin <rxin@databricks.com>

Closes #4115 from rxin/row-seq and squashes the following commits:

e33abd8 [Reynold Xin] Fixed compilation error.
cceb650 [Reynold Xin] Python test fixes, and removal of WrapDynamic.
0334a52 [Reynold Xin] mkString.
9cdeb7d [Reynold Xin] Hive tests.
15681c2 [Reynold Xin] Fix more test cases.
ea9023a [Reynold Xin] Fixed a catalyst test.
c5e2cb5 [Reynold Xin] Minor patch up.
b9cab7c [Reynold Xin] [SPARK-5323][SQL] Remove Row's Seq inheritance.
2015-01-20 15:16:14 -08:00
Yin Huai bc20a52b34 [SPARK-5287][SQL] Add defaultSizeOf to every data type.
JIRA: https://issues.apache.org/jira/browse/SPARK-5287

This PR only add `defaultSizeOf` to data types and make those internal type classes `protected[sql]`. I will use another PR to cleanup the type hierarchy of data types.

Author: Yin Huai <yhuai@databricks.com>

Closes #4081 from yhuai/SPARK-5287 and squashes the following commits:

90cec75 [Yin Huai] Update unit test.
e1c600c [Yin Huai] Make internal classes protected[sql].
7eaba68 [Yin Huai] Add `defaultSize` method to data types.
fd425e0 [Yin Huai] Add all native types to NativeType.defaultSizeOf.
2015-01-20 13:26:36 -08:00
Jacky Li 7dbf1fdb81 [SQL] fix typo in class description
Author: Jacky Li <jacky.likun@gmail.com>

Closes #4100 from jackylk/patch-9 and squashes the following commits:

b13b9d6 [Jacky Li] Update SQLConf.scala
4d3f83d [Jacky Li] Update SQLConf.scala
fcc8c85 [Jacky Li] [SQL] fix typo in class description
2015-01-18 23:59:08 -08:00
Reynold Xin 1727e0841c [SPARK-5279][SQL] Use java.math.BigDecimal as the exposed Decimal type.
Author: Reynold Xin <rxin@databricks.com>

Closes #4092 from rxin/bigdecimal and squashes the following commits:

27b08c9 [Reynold Xin] Fixed test.
10cb496 [Reynold Xin] [SPARK-5279][SQL] Use java.math.BigDecimal as the exposed Decimal type.
2015-01-18 11:01:42 -08:00
Reynold Xin 61b427d4b1 [SPARK-5193][SQL] Remove Spark SQL Java-specific API.
After the following patches, the main (Scala) API is now usable for Java users directly.

https://github.com/apache/spark/pull/4056
https://github.com/apache/spark/pull/4054
https://github.com/apache/spark/pull/4049
https://github.com/apache/spark/pull/4030
https://github.com/apache/spark/pull/3965
https://github.com/apache/spark/pull/3958

Author: Reynold Xin <rxin@databricks.com>

Closes #4065 from rxin/sql-java-api and squashes the following commits:

b1fd860 [Reynold Xin] Fix Mima
6d86578 [Reynold Xin] Ok one more attempt in fixing Python...
e8f1455 [Reynold Xin] Fix Python again...
3e53f91 [Reynold Xin] Fixed Python.
83735da [Reynold Xin] Fix BigDecimal test.
e9f1de3 [Reynold Xin] Use scala BigDecimal.
500d2c4 [Reynold Xin] Fix Decimal.
ba3bfa2 [Reynold Xin] Updated javadoc for RowFactory.
c4ae1c5 [Reynold Xin] [SPARK-5193][SQL] Remove Spark SQL Java-specific API.
2015-01-16 21:09:06 -08:00
Reynold Xin 1881431dd5 [SPARK-5274][SQL] Reconcile Java and Scala UDFRegistration.
As part of SPARK-5193:

1. Removed UDFRegistration as a mixin in SQLContext and made it a field ("udf").
2. For Java UDFs, renamed dataType to returnType.
3. For Scala UDFs, added type tags.
4. Added all Java UDF registration methods to Scala's UDFRegistration.
5. Documentation

Author: Reynold Xin <rxin@databricks.com>

Closes #4056 from rxin/udf-registration and squashes the following commits:

ae9c556 [Reynold Xin] Updated example.
675a3c9 [Reynold Xin] Style fix
47c24ff [Reynold Xin] Python fix.
5f00c45 [Reynold Xin] Restore data type position in java udf and added typetags.
032f006 [Reynold Xin] [SPARK-5193][SQL] Reconcile Java and Scala UDFRegistration.
2015-01-15 16:15:12 -08:00
Reynold Xin cfa397c126 [SPARK-5193][SQL] Tighten up SQLContext API
1. Removed 2 implicits (logicalPlanToSparkQuery and baseRelationToSchemaRDD)
2. Moved extraStrategies into ExperimentalMethods.
3. Made private methods protected[sql] so they don't show up in javadocs.
4. Removed createParquetFile.
5. Added Java version of applySchema to SQLContext.

Author: Reynold Xin <rxin@databricks.com>

Closes #4049 from rxin/sqlContext-refactor and squashes the following commits:

a326a1a [Reynold Xin] Remove createParquetFile and add applySchema for Java to SQLContext.
ecd6685 [Reynold Xin] Added baseRelationToSchemaRDD back.
4a38c9b [Reynold Xin] [SPARK-5193][SQL] Tighten up SQLContext API
2015-01-14 18:36:15 -08:00
Alex Baretta 2fd7f72b6b [SPARK-5235] Make SQLConf Serializable
Declare SQLConf to be serializable to fix "Task not serializable" exceptions in SparkSQL

Author: Alex Baretta <alexbaretta@gmail.com>

Closes #4031 from alexbaretta/SPARK-5235-SQLConf and squashes the following commits:

c2103f5 [Alex Baretta] [SPARK-5235] Make SQLConf Serializable
2015-01-14 11:51:55 -08:00
Josh Rosen 259936be71 [SPARK-4014] Add TaskContext.attemptNumber and deprecate TaskContext.attemptId
`TaskContext.attemptId` is misleadingly-named, since it currently returns a taskId, which uniquely identifies a particular task attempt within a particular SparkContext, instead of an attempt number, which conveys how many times a task has been attempted.

This patch deprecates `TaskContext.attemptId` and add `TaskContext.taskId` and `TaskContext.attemptNumber` fields.  Prior to this change, it was impossible to determine whether a task was being re-attempted (or was a speculative copy), which made it difficult to write unit tests for tasks that fail on early attempts or speculative tasks that complete faster than original tasks.

Earlier versions of the TaskContext docs suggest that `attemptId` behaves like `attemptNumber`, so there's an argument to be made in favor of changing this method's implementation.  Since we've decided against making that change in maintenance branches, I think it's simpler to add better-named methods and retain the old behavior for `attemptId`; if `attemptId` behaved differently in different branches, then this would cause confusing build-breaks when backporting regression tests that rely on the new `attemptId` behavior.

Most of this patch is fairly straightforward, but there is a bit of trickiness related to Mesos tasks: since there's no field in MesosTaskInfo to encode the attemptId, I packed it into the `data` field alongside the task binary.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #3849 from JoshRosen/SPARK-4014 and squashes the following commits:

89d03e0 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4014
5cfff05 [Josh Rosen] Introduce wrapper for serializing Mesos task launch data.
38574d4 [Josh Rosen] attemptId -> taskAttemptId in PairRDDFunctions
a180b88 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4014
1d43aa6 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4014
eee6a45 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4014
0b10526 [Josh Rosen] Use putInt instead of putLong (silly mistake)
8c387ce [Josh Rosen] Use local with maxRetries instead of local-cluster.
cbe4d76 [Josh Rosen] Preserve attemptId behavior and deprecate it:
b2dffa3 [Josh Rosen] Address some of Reynold's minor comments
9d8d4d1 [Josh Rosen] Doc typo
1e7a933 [Josh Rosen] [SPARK-4014] Change TaskContext.attemptId to return attempt number instead of task ID.
fd515a5 [Josh Rosen] Add failing test for SPARK-4014
2015-01-14 11:45:40 -08:00
Daoyuan Wang a3f7421b42 [SPARK-5248] [SQL] move sql.types.decimal.Decimal to sql.types.Decimal
rxin follow up of #3732

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #4041 from adrian-wang/decimal and squashes the following commits:

aa3d738 [Daoyuan Wang] fix auto refactor
7777a58 [Daoyuan Wang] move sql.types.decimal.Decimal to sql.types.Decimal
2015-01-14 09:36:59 -08:00
Reynold Xin d5eeb35167 [SPARK-5167][SQL] Move Row into sql package and make it usable for Java.
Mostly just moving stuff around. This should still be source compatible since we type aliased Row previously in org.apache.spark.sql.Row.

Added the following APIs to Row:
```scala
def getMap[K, V](i: Int): scala.collection.Map[K, V]
def getJavaMap[K, V](i: Int): java.util.Map[K, V]
def getSeq[T](i: Int): Seq[T]
def getList[T](i: Int): java.util.List[T]
def getStruct(i: Int): StructType
```

Author: Reynold Xin <rxin@databricks.com>

Closes #4030 from rxin/sql-row and squashes the following commits:

6c85c29 [Reynold Xin] Fixed style violation by adding a new line to Row.scala.
82b064a [Reynold Xin] [SPARK-5167][SQL] Move Row into sql package and make it usable for Java.
2015-01-14 00:38:55 -08:00
Reynold Xin f9969098c8 [SPARK-5123][SQL] Reconcile Java/Scala API for data types.
Having two versions of the data type APIs (one for Java, one for Scala) requires downstream libraries to also have two versions of the APIs if the library wants to support both Java and Scala. I took a look at the Scala version of the data type APIs - it can actually work out pretty well for Java out of the box.

As part of the PR, I created a sql.types package and moved all type definitions there. I then removed the Java specific data type API along with a lot of the conversion code.

This subsumes https://github.com/apache/spark/pull/3925

Author: Reynold Xin <rxin@databricks.com>

Closes #3958 from rxin/SPARK-5123-datatype-2 and squashes the following commits:

66505cc [Reynold Xin] [SPARK-5123] Expose only one version of the data type APIs (i.e. remove the Java-specific API).
2015-01-13 17:16:41 -08:00
Reynold Xin 14e3f114ef [SPARK-5168] Make SQLConf a field rather than mixin in SQLContext
This change should be binary and source backward compatible since we didn't change any user facing APIs.

Author: Reynold Xin <rxin@databricks.com>

Closes #3965 from rxin/SPARK-5168-sqlconf and squashes the following commits:

42eec09 [Reynold Xin] Fix default conf value.
0ef86cc [Reynold Xin] Fix constructor ordering.
4d7f910 [Reynold Xin] Properly override config.
ccc8e6a [Reynold Xin] [SPARK-5168] Make SQLConf a field rather than mixin in SQLContext
2015-01-13 13:30:35 -08:00
Yin Huai 6463e0b9e8 [SPARK-4912][SQL] Persistent tables for the Spark SQL data sources api
With changes in this PR, users can persist metadata of tables created based on the data source API in metastore through DDLs.

Author: Yin Huai <yhuai@databricks.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #3960 from yhuai/persistantTablesWithSchema2 and squashes the following commits:

069c235 [Yin Huai] Make exception messages user friendly.
c07cbc6 [Yin Huai] Get the location of test file in a correct way.
4456e98 [Yin Huai] Test data.
5315dfc [Yin Huai] rxin's comments.
7fc4b56 [Yin Huai] Add DDLStrategy and HiveDDLStrategy to plan DDLs based on the data source API.
aeaf4b3 [Yin Huai] Add comments.
06f9b0c [Yin Huai] Revert unnecessary changes.
feb88aa [Yin Huai] Merge remote-tracking branch 'apache/master' into persistantTablesWithSchema2
172db80 [Yin Huai] Fix unit test.
49bf1ac [Yin Huai] Unit tests.
8f8f1a1 [Yin Huai] [SPARK-4574][SQL] Adding support for defining schema in foreign DDL commands. #3431
f47fda1 [Yin Huai] Unit tests.
2b59723 [Michael Armbrust] Set external when creating tables
c00bb1b [Michael Armbrust] Don't use reflection to read options
1ea6e7b [Michael Armbrust] Don't fail when trying to uncache a table that doesn't exist
6edc710 [Michael Armbrust] Add tests.
d7da491 [Michael Armbrust] First draft of persistent tables.
2015-01-13 13:01:27 -08:00
Michael Armbrust 5d9fa55082 [SPARK-5049][SQL] Fix ordering of partition columns in ParquetTableScan
Followup to #3870.  Props to rahulaggarwalguavus for identifying the issue.

Author: Michael Armbrust <michael@databricks.com>

Closes #3990 from marmbrus/SPARK-5049 and squashes the following commits:

dd03e4e [Michael Armbrust] Fill in the partition values of parquet scans instead of using JoinedRow
2015-01-12 15:19:09 -08:00
scwf d22a31f5e8 [SPARK-5029][SQL] Enable from follow multiple brackets
Enable from follow multiple brackets:
```
select key from ((select * from testData limit 1) union all (select * from testData limit 1)) x limit 1
```

Author: scwf <wangfei1@huawei.com>

Closes #3853 from scwf/from and squashes the following commits:

14f110a [scwf] enable from follow multiple brackets
2015-01-10 17:07:34 -08:00
Yanbo Liang 77106df691 SPARK-4963 [SQL] Add copy to SQL's Sample operator
https://issues.apache.org/jira/browse/SPARK-4963
SchemaRDD.sample() return wrong results due to GapSamplingIterator operating on mutable row.
HiveTableScan make RDD with SpecificMutableRow and SchemaRDD.sample() will return GapSamplingIterator for iterating.

override def next(): T = {
    val r = data.next()
    advance
    r
  }

GapSamplingIterator.next() return the current underlying element and assigned it to r.
However if the underlying iterator is mutable row just like what HiveTableScan returned, underlying iterator and r will point to the same object.
After advance operation, we drop some underlying elments and it also changed r which is not expected. Then we return the wrong value different from initial r.

To fix this issue, the most direct way is to make HiveTableScan return mutable row with copy just like the initial commit that I have made. This solution will make HiveTableScan can not get the full advantage of reusable MutableRow, but it can make sample operation return correct result.
Further more, we need to investigate  GapSamplingIterator.next() and make it can implement copy operation inside it. To achieve this, we should define every elements that RDD can store implement the function like cloneable and it will make huge change.

Author: Yanbo Liang <yanbohappy@gmail.com>

Closes #3827 from yanbohappy/spark-4963 and squashes the following commits:

0912ca0 [Yanbo Liang] code format keep
65c4e7c [Yanbo Liang] import file and clear annotation
55c7c56 [Yanbo Liang] better output of test case
cea7e2e [Yanbo Liang] SchemaRDD add copy operation before Sample operator
e840829 [Yanbo Liang] HiveTableScan return mutable row with copy
2015-01-10 14:19:32 -08:00
scwf b3e86dc624 [SPARK-4861][SQL] Refactory command in spark sql
Follow up for #3712.
This PR finally remove ```CommandStrategy``` and make all commands follow ```RunnableCommand``` so they can go with ```case r: RunnableCommand => ExecutedCommand(r) :: Nil```.

One exception is the ```DescribeCommand``` of hive, which is a special case and need to distinguish hive table and temporary table, so still keep ```HiveCommandStrategy``` here.

Author: scwf <wangfei1@huawei.com>

Closes #3948 from scwf/followup-SPARK-4861 and squashes the following commits:

6b48e64 [scwf] minor style fix
2c62e9d [scwf] fix for hive module
5a7a819 [scwf] Refactory command in spark sql
2015-01-10 14:08:04 -08:00
scwf 693a323a70 [SPARK-4574][SQL] Adding support for defining schema in foreign DDL commands.
Adding support for defining schema in foreign DDL commands. Now foreign DDL support commands like:
```
CREATE TEMPORARY TABLE avroTable
USING org.apache.spark.sql.avro
OPTIONS (path "../hive/src/test/resources/data/files/episodes.avro")
```
With this PR user can define schema instead of infer from file, so  support ddl command as follows:
```
CREATE TEMPORARY TABLE avroTable(a int, b string)
USING org.apache.spark.sql.avro
OPTIONS (path "../hive/src/test/resources/data/files/episodes.avro")
```

Author: scwf <wangfei1@huawei.com>
Author: Yin Huai <yhuai@databricks.com>
Author: Fei Wang <wangfei1@huawei.com>
Author: wangfei <wangfei1@huawei.com>

Closes #3431 from scwf/ddl and squashes the following commits:

7e79ce5 [Fei Wang] Merge pull request #22 from yhuai/pr3431yin
38f634e [Yin Huai] Remove Option from createRelation.
65e9c73 [Yin Huai] Revert all changes since applying a given schema has not been testd.
a852b10 [scwf] remove cleanIdentifier
f336a16 [Fei Wang] Merge pull request #21 from yhuai/pr3431yin
baf79b5 [Yin Huai] Test special characters quoted by backticks.
50a03b0 [Yin Huai] Use JsonRDD.nullTypeToStringType to convert NullType to StringType.
1eeb769 [Fei Wang] Merge pull request #20 from yhuai/pr3431yin
f5c22b0 [Yin Huai] Refactor code and update test cases.
f1cffe4 [Yin Huai] Revert "minor refactory"
b621c8f [scwf] minor refactory
d02547f [scwf] fix HiveCompatibilitySuite test failure
8dfbf7a [scwf] more tests for complex data type
ddab984 [Fei Wang] Merge pull request #19 from yhuai/pr3431yin
91ad91b [Yin Huai] Parse data types in DDLParser.
cf982d2 [scwf] fixed test failure
445b57b [scwf] address comments
02a662c [scwf] style issue
44eb70c [scwf] fix decimal parser issue
83b6fc3 [scwf] minor fix
9bf12f8 [wangfei] adding test case
7787ec7 [wangfei] added SchemaRelationProvider
0ba70df [wangfei] draft version
2015-01-10 13:53:21 -08:00
Alex Liu 4b39fd1e63 [SPARK-4943][SQL] Allow table name having dot for db/catalog
The pull only fixes the parsing error and changes API to use tableIdentifier. Joining different catalog datasource related change is not done in this pull.

Author: Alex Liu <alex_liu68@yahoo.com>

Closes #3941 from alexliu68/SPARK-SQL-4943-3 and squashes the following commits:

343ae27 [Alex Liu] [SPARK-4943][SQL] refactoring according to review
29e5e55 [Alex Liu] [SPARK-4943][SQL] fix failed Hive CTAS tests
6ae77ce [Alex Liu] [SPARK-4943][SQL] fix TestHive matching error
3652997 [Alex Liu] [SPARK-4943][SQL] Allow table name having dot to support db/catalog ...
2015-01-10 13:23:09 -08:00
luogankun 545dfcb92f [SPARK-5141][SQL]CaseInsensitiveMap throws java.io.NotSerializableException
CaseInsensitiveMap throws java.io.NotSerializableException.

Author: luogankun <luogankun@gmail.com>

Closes #3944 from luogankun/SPARK-5141 and squashes the following commits:

b6d63d5 [luogankun] [SPARK-5141]CaseInsensitiveMap throws java.io.NotSerializableException
2015-01-09 20:38:41 -08:00
Marcelo Vanzin 48cecf673c [SPARK-4048] Enhance and extend hadoop-provided profile.
This change does a few things to make the hadoop-provided profile more useful:

- Create new profiles for other libraries / services that might be provided by the infrastructure
- Simplify and fix the poms so that the profiles are only activated while building assemblies.
- Fix tests so that they're able to run when the profiles are activated
- Add a new env variable to be used by distributions that use these profiles to provide the runtime
  classpath for Spark jobs and daemons.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #2982 from vanzin/SPARK-4048 and squashes the following commits:

82eb688 [Marcelo Vanzin] Add a comment.
eb228c0 [Marcelo Vanzin] Fix borked merge.
4e38f4e [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
9ef79a3 [Marcelo Vanzin] Alternative way to propagate test classpath to child processes.
371ebee [Marcelo Vanzin] Review feedback.
52f366d [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
83099fc [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
7377e7b [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
322f882 [Marcelo Vanzin] Fix merge fail.
f24e9e7 [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
8b00b6a [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
9640503 [Marcelo Vanzin] Cleanup child process log message.
115fde5 [Marcelo Vanzin] Simplify a comment (and make it consistent with another pom).
e3ab2da [Marcelo Vanzin] Fix hive-thriftserver profile.
7820d58 [Marcelo Vanzin] Fix CliSuite with provided profiles.
1be73d4 [Marcelo Vanzin] Restore flume-provided profile.
d1399ed [Marcelo Vanzin] Restore jetty dependency.
82a54b9 [Marcelo Vanzin] Remove unused profile.
5c54a25 [Marcelo Vanzin] Fix HiveThriftServer2Suite with *-provided profiles.
1fc4d0b [Marcelo Vanzin] Update dependencies for hive-thriftserver.
f7b3bbe [Marcelo Vanzin] Add snappy to hadoop-provided list.
9e4e001 [Marcelo Vanzin] Remove duplicate hive profile.
d928d62 [Marcelo Vanzin] Redirect child stderr to parent's log.
4d67469 [Marcelo Vanzin] Propagate SPARK_DIST_CLASSPATH on Yarn.
417d90e [Marcelo Vanzin] Introduce "SPARK_DIST_CLASSPATH".
2f95f0d [Marcelo Vanzin] Propagate classpath to child processes during testing.
1adf91c [Marcelo Vanzin] Re-enable maven-install-plugin for a few projects.
284dda6 [Marcelo Vanzin] Rework the "hadoop-provided" profile, add new ones.
2015-01-08 17:15:13 -08:00
Sean Owen 4cba6eb420 SPARK-4159 [CORE] Maven build doesn't run JUnit test suites
This PR:

- Reenables `surefire`, and copies config from `scalatest` (which is itself an old fork of `surefire`, so similar)
- Tells `surefire` to test only Java tests
- Enables `surefire` and `scalatest` for all children, and in turn eliminates some duplication.

For me this causes the Scala and Java tests to be run once each, it seems, as desired. It doesn't affect the SBT build but works for Maven. I still need to verify that all of the Scala tests and Java tests are being run.

Author: Sean Owen <sowen@cloudera.com>

Closes #3651 from srowen/SPARK-4159 and squashes the following commits:

2e8a0af [Sean Owen] Remove specialized SPARK_HOME setting for REPL, YARN tests as it appears to be obsolete
12e4558 [Sean Owen] Append to unit-test.log instead of overwriting, so that both surefire and scalatest output is preserved. Also standardize/correct comments a bit.
e6f8601 [Sean Owen] Reenable Java tests by reenabling surefire with config cloned from scalatest; centralize test config in the parent
2015-01-06 12:02:08 -08:00
Reynold Xin 04d55d8e8e [SPARK-5040][SQL] Support expressing unresolved attributes using $"attribute name" notation in SQL DSL.
Author: Reynold Xin <rxin@databricks.com>

Closes #3862 from rxin/stringcontext-attr and squashes the following commits:

9b10f57 [Reynold Xin] Rename StrongToAttributeConversionHelper
72121af [Reynold Xin] [SPARK-5040][SQL] Support expressing unresolved attributes using $"attribute name" notation in SQL DSL.
2015-01-05 15:34:22 -08:00
Reynold Xin c88a3d7fca [SPARK-5038][SQL] Add explicit return type for implicit functions in Spark SQL
As we learned in https://github.com/apache/spark/pull/3580, not explicitly typing implicit functions can lead to compiler bugs and potentially unexpected runtime behavior.

Author: Reynold Xin <rxin@databricks.com>

Closes #3859 from rxin/sql-implicits and squashes the following commits:

30c2c24 [Reynold Xin] [SPARK-5038] Add explicit return type for implicit functions in Spark SQL.
2014-12-31 14:25:03 -08:00
wangxiaojing 07fa1910d9 [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash
JIRA issue: [SPARK-4570](https://issues.apache.org/jira/browse/SPARK-4570)
We are planning to create a `BroadcastLeftSemiJoinHash` to implement the broadcast join for `left semijoin`
In left semijoin :
If the size of data from right side is smaller than the user-settable threshold `AUTO_BROADCASTJOIN_THRESHOLD`,
the planner would mark it as the `broadcast` relation and mark the other relation as the stream side. The broadcast table will be broadcasted to all of the executors involved in the join, as a `org.apache.spark.broadcast.Broadcast` object. It will use `joins.BroadcastLeftSemiJoinHash`.,else it will use `joins.LeftSemiJoinHash`.

The benchmark suggests these  made the optimized version 4x faster  when `left semijoin`
<pre><code>
Original:
left semi join : 9288 ms
Optimized:
left semi join : 1963 ms
</code></pre>
The micro benchmark load `data1/kv3.txt` into a normal Hive table.
Benchmark code:
<pre><code>
 def benchmark(f: => Unit) = {
    val begin = System.currentTimeMillis()
    f
    val end = System.currentTimeMillis()
    end - begin
  }
  val sc = new SparkContext(
    new SparkConf()
      .setMaster("local")
      .setAppName(getClass.getSimpleName.stripSuffix("$")))
  val hiveContext = new HiveContext(sc)
  import hiveContext._
  sql("drop table if exists left_table")
  sql("drop table if exists right_table")
  sql( """create table left_table (key int, value string)
       """.stripMargin)
  sql( s"""load data local inpath "/data1/kv3.txt" into table left_table""")
  sql( """create table right_table (key int, value string)
       """.stripMargin)
  sql(
    """
      |from left_table
      |insert overwrite table right_table
      |select left_table.key, left_table.value
    """.stripMargin)

  val leftSimeJoin = sql(
    """select a.key from left_table a
      |left semi join right_table b on a.key = b.key""".stripMargin)
  val leftSemiJoinDuration = benchmark(leftSimeJoin.count())
  println(s"left semi join : $leftSemiJoinDuration ms ")
</code></pre>

Author: wangxiaojing <u9jing@gmail.com>

Closes #3442 from wangxiaojing/SPARK-4570 and squashes the following commits:

a4a43c9 [wangxiaojing] rebase
f103983 [wangxiaojing] change style
fbe4887 [wangxiaojing] change style
ff2e618 [wangxiaojing] add testsuite
1a8da2a [wangxiaojing] add BroadcastLeftSemiJoinHash
2014-12-30 13:54:12 -08:00
Michael Davies 7425bec320 [SPARK-4386] Improve performance when writing Parquet files
Convert type of RowWriteSupport.attributes to Array.

Analysis of performance for writing very wide tables shows that time is spent predominantly in apply method on  attributes var. Type of attributes previously was LinearSeqOptimized and apply is O(N) which made write O(N squared).

Measurements on 575 column table showed this change made a 6x improvement in write times.

Author: Michael Davies <Michael.BellDavies@gmail.com>

Closes #3843 from MickDavies/SPARK-4386 and squashes the following commits:

892519d [Michael Davies] [SPARK-4386] Improve performance when writing Parquet files
2014-12-30 13:40:51 -08:00
Cheng Lian 61a99f6a11 [SPARK-4937][SQL] Normalizes conjunctions and disjunctions to eliminate common predicates
This PR is a simplified version of several filter optimization rules introduced in #3778 authored by scwf. Newly introduced optimizations include:

1. `a && a` => `a`
2. `a || a` => `a`
3. `(a || b || c || ...) && (a || b || d || ...)` => `a && b && (c || d || ...)`

The 3rd rule is particularly useful for optimizing the following query, which is planned into a cartesian product

```sql
SELECT *
  FROM t1, t2
 WHERE (t1.key = t2.key AND t1.value > 10)
    OR (t1.key = t2.key AND t2.value < 20)
```

to the following one, which is planned into an equi-join:

```sql
SELECT *
  FROM t1, t2
 WHERE t1.key = t2.key
   AND (t1.value > 10 OR t2.value < 20)
```

The example above is quite artificial, but common predicates are likely to appear in real life complex queries (like the one mentioned in #3778).

A difference between this PR and #3778 is that these optimizations are not limited to `Filter`, but are generalized to all logical plan nodes. Thanks to scwf for bringing up these optimizations, and chenghao-intel for the generalization suggestion.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3784)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3784 from liancheng/normalize-filters and squashes the following commits:

caca560 [Cheng Lian] Moves filter normalization into BooleanSimplification rule
4ab3a58 [Cheng Lian] Fixes test failure, adds more tests
5d54349 [Cheng Lian] Fixes typo in comment
2abbf8e [Cheng Lian] Forgot our sacred Apache licence header...
cf95639 [Cheng Lian] Adds an optimization rule for filter normalization
2014-12-30 13:38:27 -08:00
Cheng Lian 19a8802e70 [SPARK-4493][SQL] Tests for IsNull / IsNotNull in the ParquetFilterSuite
This is a follow-up of #3367 and #3644.

At the time #3644 was written, #3367 hadn't been merged yet, thus `IsNull` and `IsNotNull` filters are not covered in the first version of `ParquetFilterSuite`. This PR adds corresponding test cases.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3748)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3748 from liancheng/test-null-filters and squashes the following commits:

1ab943f [Cheng Lian] IsNull and IsNotNull Parquet filter test case for boolean type
bcd616b [Cheng Lian] Adds Parquet filter pushedown tests for IsNull and IsNotNull
2014-12-30 12:16:45 -08:00
Cheng Hao 53f0a00b60 [Spark-4512] [SQL] Unresolved Attribute Exception in Sort By
It will cause exception while do query like:
SELECT key+key FROM src sort by value;

Author: Cheng Hao <hao.cheng@intel.com>

Closes #3386 from chenghao-intel/sort and squashes the following commits:

38c78cc [Cheng Hao] revert the SortPartition in SparkStrategies
7e9dd15 [Cheng Hao] update the typo
fcd1d64 [Cheng Hao] rebase the latest master and update the SortBy unit test
2014-12-30 12:11:44 -08:00
wangfei daac221302 [SPARK-5002][SQL] Using ascending by default when not specify order in order by
spark sql does not support ```SELECT a, b FROM testData2 ORDER BY a desc, b```.

Author: wangfei <wangfei1@huawei.com>

Closes #3838 from scwf/orderby and squashes the following commits:

114b64a [wangfei] remove nouse methods
48145d3 [wangfei] fix order, using asc by default
2014-12-30 12:07:24 -08:00
Sean Owen 29fabb1b52 SPARK-4297 [BUILD] Build warning fixes omnibus
There are a number of warnings generated in a normal, successful build right now. They're mostly Java unchecked cast warnings, which can be suppressed. But there's a grab bag of other Scala language warnings and so on that can all be easily fixed. The forthcoming PR fixes about 90% of the build warnings I see now.

Author: Sean Owen <sowen@cloudera.com>

Closes #3157 from srowen/SPARK-4297 and squashes the following commits:

8c9e469 [Sean Owen] Suppress unchecked cast warnings, and several other build warning fixes
2014-12-24 13:32:51 -08:00
jbencook fd41eb9574 [SPARK-4860][pyspark][sql] speeding up sample() and takeSample()
This PR modifies the python `SchemaRDD` to use `sample()` and `takeSample()` from Scala instead of the slower python implementations from `rdd.py`. This is worthwhile because the `Row`'s are already serialized as Java objects.

In order to use the faster `takeSample()`, a `takeSampleToPython()` method was implemented in `SchemaRDD.scala` following the pattern of `collectToPython()`.

Author: jbencook <jbenjamincook@gmail.com>
Author: J. Benjamin Cook <jbenjamincook@gmail.com>

Closes #3764 from jbencook/master and squashes the following commits:

6fbc769 [J. Benjamin Cook] [SPARK-4860][pyspark][sql] fixing sloppy indentation for takeSampleToPython() arguments
5170da2 [J. Benjamin Cook] [SPARK-4860][pyspark][sql] fixing typo: from RDD to SchemaRDD
de22f70 [jbencook] [SPARK-4860][pyspark][sql] using sample() method from JavaSchemaRDD
b916442 [jbencook] [SPARK-4860][pyspark][sql] adding sample() to JavaSchemaRDD
020cbdf [jbencook] [SPARK-4860][pyspark][sql] using Scala implementations of `sample()` and `takeSample()`
2014-12-23 17:46:24 -08:00
wangfei c3d91da5ea [SPARK-4861][SQL] Refactory command in spark sql
Remove ```Command``` and use ```RunnableCommand``` instead.

Author: wangfei <wangfei1@huawei.com>
Author: scwf <wangfei1@huawei.com>

Closes #3712 from scwf/cmd and squashes the following commits:

51a82f2 [wangfei] fix test failure
0e03be8 [wangfei] address comments
4033bed [scwf] remove CreateTableAsSelect in hivestrategy
5d20010 [wangfei] address comments
125f542 [scwf] factory command in spark sql
2014-12-18 20:24:56 -08:00
Thu Kyaw b68bc6d264 [SPARK-3928][SQL] Support wildcard matches on Parquet files.
...arquetFile accept hadoop glob pattern in path.

Author: Thu Kyaw <trk007@gmail.com>

Closes #3407 from tkyaw/master and squashes the following commits:

19115ad [Thu Kyaw] Merge https://github.com/apache/spark
ceded32 [Thu Kyaw] [SPARK-3928][SQL] Support wildcard matches on Parquet files.
d322c28 [Thu Kyaw] [SPARK-3928][SQL] Support wildcard matches on Parquet files.
ce677c6 [Thu Kyaw] [SPARK-3928][SQL] Support wildcard matches on Parquet files.
2014-12-18 20:08:32 -08:00
Cheng Hao f728e0fe7e [SPARK-2663] [SQL] Support the Grouping Set
Add support for `GROUPING SETS`, `ROLLUP`, `CUBE` and the the virtual column `GROUPING__ID`.

More details on how to use the `GROUPING SETS" can be found at: https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup
https://issues.apache.org/jira/secure/attachment/12676811/grouping_set.pdf

The generic idea of the implementations are :
1 Replace the `ROLLUP`, `CUBE` with `GROUPING SETS`
2 Explode each of the input row, and then feed them to `Aggregate`
  * Each grouping set are represented as the bit mask for the `GroupBy Expression List`, for each bit, `1` means the expression is selected, otherwise `0` (left is the lower bit, and right is the higher bit in the `GroupBy Expression List`)
  * Several of projections are constructed according to the grouping sets, and within each projection(Seq[Expression), we replace those expressions with `Literal(null)` if it's not selected in the grouping set (based on the bit mask)
  * Output Schema of `Explode` is `child.output :+ grouping__id`
  * GroupBy Expressions of `Aggregate` is `GroupBy Expression List :+ grouping__id`
  * Keep the `Aggregation expressions` the same for the `Aggregate`

The expressions substitutions happen in Logic Plan analyzing, so we will benefit from the Logical Plan optimization (e.g. expression constant folding, and map side aggregation etc.), Only an `Explosive` operator added for Physical Plan, which will explode the rows according the pre-set projections.

A known issue will be done in the follow up PR:
* Optimization `ColumnPruning` is not supported yet for `Explosive` node.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #1567 from chenghao-intel/grouping_sets and squashes the following commits:

fe65fcc [Cheng Hao] Remove the extra space
3547056 [Cheng Hao] Add more doc and Simplify the Expand
a7c869d [Cheng Hao] update code as feedbacks
d23c672 [Cheng Hao] Add GroupingExpression to replace the Seq[Expression]
414b165 [Cheng Hao] revert the unnecessary changes
ec276c6 [Cheng Hao] Support Rollup/Cube/GroupingSets
2014-12-18 18:58:29 -08:00
Cheng Hao 8d0d2a65eb [SPARK-4856] [SQL] NullType instead of StringType when sampling against empty string or nul...
```
TestSQLContext.sparkContext.parallelize(
  """{"ip":"27.31.100.29","headers":{"Host":"1.abc.com","Charset":"UTF-8"}}""" ::
  """{"ip":"27.31.100.29","headers":{}}""" ::
  """{"ip":"27.31.100.29","headers":""}""" :: Nil)
```
As empty string (the "headers") will be considered as String in the beginning (in line 2 and 3), it ignores the real nested data type (struct type "headers" in line 1), and also take the line 1 (the "headers") as String Type, which is not our expected.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #3708 from chenghao-intel/json and squashes the following commits:

e7a72e9 [Cheng Hao] add more concise unit test
853de51 [Cheng Hao] NullType instead of StringType when sampling against empty string or null value
2014-12-17 15:01:59 -08:00
Michael Armbrust 19c0faad6d [HOTFIX][SQL] Fix parquet filter suite
Author: Michael Armbrust <michael@databricks.com>

Closes #3727 from marmbrus/parquetNotEq and squashes the following commits:

2157bfc [Michael Armbrust] Fix parquet filter suite
2014-12-17 14:27:02 -08:00
Cheng Lian 6277135376 [SPARK-4493][SQL] Don't pushdown Eq, NotEq, Lt, LtEq, Gt and GtEq predicates with nulls for Parquet
Predicates like `a = NULL` and `a < NULL` can't be pushed down since Parquet `Lt`, `LtEq`, `Gt`, `GtEq` doesn't accept null value. Note that `Eq` and `NotEq` can only be used with `null` to represent predicates like `a IS NULL` and `a IS NOT NULL`.

However, normally this issue doesn't cause NPE because any value compared to `NULL` results `NULL`, and Spark SQL automatically optimizes out `NULL` predicate in the `SimplifyFilters` rule. Only testing code that intentionally disables the optimizer may trigger this issue. (That's why this issue is not marked as blocker and I do **NOT** think we need to backport this to branch-1.1

This PR restricts `Lt`, `LtEq`, `Gt` and `GtEq` to non-null values only, and only uses `Eq` with null value to pushdown `IsNull` and `IsNotNull`. Also, added support for Parquet `NotEq` filter for completeness and (tiny) performance gain, it's also used to pushdown `IsNotNull`.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3367)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3367 from liancheng/filters-with-null and squashes the following commits:

cc41281 [Cheng Lian] Fixes several styling issues
de7de28 [Cheng Lian] Adds stricter rules for Parquet filters with null
2014-12-17 12:48:04 -08:00
Cheng Hao 5fdcbdc0c9 [SPARK-4625] [SQL] Add sort by for DSL & SimpleSqlParser
Add `sort by` support for both DSL & SqlParser.

This PR is relevant with #3386, either one merged, will cause the other rebased.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #3481 from chenghao-intel/sortby and squashes the following commits:

041004f [Cheng Hao] Add sort by for DSL & SimpleSqlParser
2014-12-17 12:01:57 -08:00
scwf 60698801eb [SPARK-4618][SQL] Make foreign DDL commands options case-insensitive
Using lowercase for ```options``` key to make it case-insensitive, then we should use lower case to get value from parameters.
So flowing cmd work
```
      create temporary table normal_parquet
      USING org.apache.spark.sql.parquet
      OPTIONS (
        PATH '/xxx/data'
      )
```

Author: scwf <wangfei1@huawei.com>
Author: wangfei <wangfei1@huawei.com>

Closes #3470 from scwf/ddl-ulcase and squashes the following commits:

ae78509 [scwf] address comments
8f4f585 [wangfei] address comments
3c132ef [scwf] minor fix
a0fc20b [scwf] Merge branch 'master' of https://github.com/apache/spark into ddl-ulcase
4f86401 [scwf] adding CaseInsensitiveMap
e244e8d [wangfei] using lower case in json
e0cb017 [wangfei] make options in-casesensitive
2014-12-16 21:26:36 -08:00
Davies Liu ec5c4279ed [SPARK-4866] support StructType as key in MapType
This PR brings support of using StructType(and other hashable types) as key in MapType.

Author: Davies Liu <davies@databricks.com>

Closes #3714 from davies/fix_struct_in_map and squashes the following commits:

68585d7 [Davies Liu] fix primitive types in MapType
9601534 [Davies Liu] support StructType as key in MapType
2014-12-16 21:23:28 -08:00
Cheng Hao 770d8153a5 [SPARK-4375] [SQL] Add 0 argument support for udf
Author: Cheng Hao <hao.cheng@intel.com>

Closes #3595 from chenghao-intel/udf0 and squashes the following commits:

a858973 [Cheng Hao] Add 0 arguments support for udf
2014-12-16 21:21:11 -08:00
Cheng Lian 3b395e1051 [SPARK-4798][SQL] A new set of Parquet testing API and test suites
This PR provides a set Parquet testing API (see trait `ParquetTest`) that enables developers to write more concise test cases. A new set of Parquet test suites built upon this API  are added and aim to replace the old `ParquetQuerySuite`. To avoid potential merge conflicts, old testing code are not removed yet. The following classes can be safely removed after most Parquet related PRs are handled:

- `ParquetQuerySuite`
- `ParquetTestData`

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3644)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3644 from liancheng/parquet-tests and squashes the following commits:

800e745 [Cheng Lian] Enforces ordering of test output
3bb8731 [Cheng Lian] Refactors HiveParquetSuite
aa2cb2e [Cheng Lian] Decouples ParquetTest and TestSQLContext
7b43a68 [Cheng Lian] Updates ParquetTest Scaladoc
7f07af0 [Cheng Lian] Adds a new set of Parquet test suites
2014-12-16 21:16:03 -08:00
Jacky Li fa66ef6c97 [SPARK-4269][SQL] make wait time configurable in BroadcastHashJoin
In BroadcastHashJoin, currently it is using a hard coded value (5 minutes) to wait for the execution and broadcast of the small table.
In my opinion, it should be a configurable value since broadcast may exceed 5 minutes in some case, like in a busy/congested network environment.

Author: Jacky Li <jacky.likun@huawei.com>

Closes #3133 from jackylk/timeout-config and squashes the following commits:

733ac08 [Jacky Li] add spark.sql.broadcastTimeout in SQLConf.scala
557acd4 [Jacky Li] switch to sqlContext.getConf
81a5e20 [Jacky Li] make wait time configurable in BroadcastHashJoin
2014-12-16 15:34:59 -08:00
tianyi 30f6b85c81 [SPARK-4483][SQL]Optimization about reduce memory costs during the HashOuterJoin
In `HashOuterJoin.scala`, spark read data from both side of join operation before zip them together. It is a waste for memory. We are trying to read data from only one side, put them into a hashmap, and then generate the `JoinedRow` with data from other side one by one.
Currently, we could only do this optimization for `left outer join` and `right outer join`. For `full outer join`, we will do something in another issue.

for
table test_csv contains 1 million records
table dim_csv contains 10 thousand records

SQL:
`select * from test_csv a left outer join dim_csv b on a.key = b.key`

the result is:
master:
```
CSV: 12671 ms
CSV: 9021 ms
CSV: 9200 ms
Current Mem Usage:787788984
```
after patch:
```
CSV: 10382 ms
CSV: 7543 ms
CSV: 7469 ms
Current Mem Usage:208145728
```

Author: tianyi <tianyi@asiainfo-linkage.com>
Author: tianyi <tianyi.asiainfo@gmail.com>

Closes #3375 from tianyi/SPARK-4483 and squashes the following commits:

72a8aec [tianyi] avoid having mutable state stored inside of the task
99c5c97 [tianyi] performance optimization
d2f94d7 [tianyi] fix bug: missing output when the join-key is null.
2be45d1 [tianyi] fix spell bug
1f2c6f1 [tianyi] remove commented codes
a676de6 [tianyi] optimize some codes
9e7d5b5 [tianyi] remove commented old codes
838707d [tianyi] Optimization about reduce memory costs during the HashOuterJoin
2014-12-16 15:22:29 -08:00
wangxiaojing ea1315e3e2 [SPARK-4527][SQl]Add BroadcastNestedLoopJoin operator selection testsuite
In `JoinSuite` add BroadcastNestedLoopJoin operator selection testsuite

Author: wangxiaojing <u9jing@gmail.com>

Closes #3395 from wangxiaojing/SPARK-4527 and squashes the following commits:

ea0e495 [wangxiaojing] change style
53c3952 [wangxiaojing] Add BroadcastNestedLoopJoin operator selection testsuite
2014-12-16 14:45:56 -08:00
zsxwing 6530243a52 [SPARK-4812][SQL] Fix the initialization issue of 'codegenEnabled'
The problem is `codegenEnabled` is `val`, but it uses a `val` `sqlContext`, which can be override by subclasses. Here is a simple example to show this issue.

```Scala
scala> :paste
// Entering paste mode (ctrl-D to finish)

abstract class Foo {

  protected val sqlContext = "Foo"

  val codegenEnabled: Boolean = {
    println(sqlContext) // it will call subclass's `sqlContext` which has not yet been initialized.
    if (sqlContext != null) {
      true
    } else {
      false
    }
  }
}

class Bar extends Foo {
  override val sqlContext = "Bar"
}

println(new Bar().codegenEnabled)

// Exiting paste mode, now interpreting.

null
false
defined class Foo
defined class Bar
```

We should make `sqlContext` `final` to prevent subclasses from overriding it incorrectly.

Author: zsxwing <zsxwing@gmail.com>

Closes #3660 from zsxwing/SPARK-4812 and squashes the following commits:

1cbb623 [zsxwing] Make `sqlContext` final to prevent subclasses from overriding it incorrectly
2014-12-16 14:13:40 -08:00
jerryshao dc8280dcca [SPARK-4847][SQL]Fix "extraStrategies cannot take effect in SQLContext" issue
Author: jerryshao <saisai.shao@intel.com>

Closes #3698 from jerryshao/SPARK-4847 and squashes the following commits:

4741130 [jerryshao] Make later added extraStrategies effect when calling strategies
2014-12-16 14:08:28 -08:00
Sasaki Toru 8091dd62ea [SPARK-4742][SQL] The name of Parquet File generated by AppendingParquetOutputFormat should be zero padded
When I use Parquet File as a output file using ParquetOutputFormat#getDefaultWorkFile, the file name is not zero padded while RDD#saveAsText does zero padding.

Author: Sasaki Toru <sasakitoa@nttdata.co.jp>

Closes #3602 from sasakitoa/parquet-zeroPadding and squashes the following commits:

6b0e58f [Sasaki Toru] Merge branch 'master' of git://github.com/apache/spark into parquet-zeroPadding
20dc79d [Sasaki Toru] Fixed the name of Parquet File generated by AppendingParquetOutputFormat
2014-12-11 22:54:21 -08:00
Cheng Hao bf40cf89e3 [SPARK-4713] [SQL] SchemaRDD.unpersist() should not raise exception if it is not persisted
Unpersist a uncached RDD, will not raise exception, for example:
```
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
distData.unpersist(true)
```

But the `SchemaRDD` will raise exception if the `SchemaRDD` is not cached. Since `SchemaRDD` is the subclasses of the `RDD`, we should follow the same behavior.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #3572 from chenghao-intel/try_uncache and squashes the following commits:

50a7a89 [Cheng Hao] SchemaRDD.unpersist() should not raise exception if it is not persisted
2014-12-11 22:41:36 -08:00
Jacky Li 944384363d [SQL] remove unnecessary import in spark-sql
Author: Jacky Li <jacky.likun@huawei.com>

Closes #3630 from jackylk/remove and squashes the following commits:

150e7e0 [Jacky Li] remove unnecessary import
2014-12-08 17:27:46 -08:00
Michael Armbrust f5801e813f [SPARK-4753][SQL] Use catalyst for partition pruning in newParquet.
Author: Michael Armbrust <michael@databricks.com>

Closes #3613 from marmbrus/parquetPartitionPruning and squashes the following commits:

4f138f8 [Michael Armbrust] Use catalyst for partition pruning in newParquet.
2014-12-04 22:25:21 -08:00
Jacky Li ed88db4cb2 [SQL] remove unnecessary import
Author: Jacky Li <jacky.likun@huawei.com>

Closes #3585 from jackylk/remove and squashes the following commits:

045423d [Jacky Li] remove unnecessary import
2014-12-04 00:43:55 -08:00
Michael Armbrust 513ef82e85 [SPARK-4552][SQL] Avoid exception when reading empty parquet data through Hive
This is a very small fix that catches one specific exception and returns an empty table.  #3441 will address this in a more principled way.

Author: Michael Armbrust <michael@databricks.com>

Closes #3586 from marmbrus/fixEmptyParquet and squashes the following commits:

2781d9f [Michael Armbrust] Handle empty lists for newParquet
04dd376 [Michael Armbrust] Avoid exception when reading empty parquet data through Hive
2014-12-03 14:13:35 -08:00
YanTangZhai 1066427600 [SPARK-4676][SQL] JavaSchemaRDD.schema may throw NullType MatchError if sql has null
val jsc = new org.apache.spark.api.java.JavaSparkContext(sc)
val jhc = new org.apache.spark.sql.hive.api.java.JavaHiveContext(jsc)
val nrdd = jhc.hql("select null from spark_test.for_test")
println(nrdd.schema)
Then the error is thrown as follows:
scala.MatchError: NullType (of class org.apache.spark.sql.catalyst.types.NullType$)
at org.apache.spark.sql.types.util.DataTypeConversions$.asJavaDataType(DataTypeConversions.scala:43)

Author: YanTangZhai <hakeemzhai@tencent.com>
Author: yantangzhai <tyz0303@163.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #3538 from YanTangZhai/MatchNullType and squashes the following commits:

e052dff [yantangzhai] [SPARK-4676] [SQL] JavaSchemaRDD.schema may throw NullType MatchError if sql has null
4b4bb34 [yantangzhai] [SPARK-4676] [SQL] JavaSchemaRDD.schema may throw NullType MatchError if sql has null
896c7b7 [yantangzhai] fix NullType MatchError in JavaSchemaRDD when sql has null
6e643f8 [YanTangZhai] Merge pull request #11 from apache/master
e249846 [YanTangZhai] Merge pull request #10 from apache/master
d26d982 [YanTangZhai] Merge pull request #9 from apache/master
76d4027 [YanTangZhai] Merge pull request #8 from apache/master
03b62b0 [YanTangZhai] Merge pull request #7 from apache/master
8a00106 [YanTangZhai] Merge pull request #6 from apache/master
cbcba66 [YanTangZhai] Merge pull request #3 from apache/master
cdef539 [YanTangZhai] Merge pull request #1 from apache/master
2014-12-02 14:15:12 -08:00
baishuo 69b6fed206 [SPARK-4663][sql]add finally to avoid resource leak
Author: baishuo <vc_java@hotmail.com>

Closes #3526 from baishuo/master-trycatch and squashes the following commits:

d446e14 [baishuo] correct the code style
b36bf96 [baishuo] correct the code style
ae0e447 [baishuo] add finally to avoid resource leak
2014-12-02 12:12:03 -08:00
Kousuke Saruta e75e04f980 [SPARK-4536][SQL] Add sqrt and abs to Spark SQL DSL
Spark SQL has embeded sqrt and abs but DSL doesn't support those functions.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #3401 from sarutak/dsl-missing-operator and squashes the following commits:

07700cf [Kousuke Saruta] Modified Literal(null, NullType) to Literal(null) in DslQuerySuite
8f366f8 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into dsl-missing-operator
1b88e2e [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into dsl-missing-operator
0396f89 [Kousuke Saruta] Added sqrt and abs to Spark SQL DSL
2014-12-02 12:07:52 -08:00
Reynold Xin b1f8fe316a Indent license header properly for interfaces.scala.
A very small nit update.

Author: Reynold Xin <rxin@databricks.com>

Closes #3552 from rxin/license-header and squashes the following commits:

df8d1a4 [Reynold Xin] Indent license header properly for interfaces.scala.
2014-12-02 11:59:15 -08:00
wangfei 7b79957879 [SQL] Minor fix for doc and comment
Author: wangfei <wangfei1@huawei.com>

Closes #3533 from scwf/sql-doc1 and squashes the following commits:

962910b [wangfei] doc and comment fix
2014-12-01 14:02:02 -08:00
ravipesala bc353819cc [SPARK-4658][SQL] Code documentation issue in DDL of datasource API
Author: ravipesala <ravindra.pesala@huawei.com>

Closes #3516 from ravipesala/ddl_doc and squashes the following commits:

d101fdf [ravipesala] Style issues fixed
d2238cd [ravipesala] Corrected documentation
2014-12-01 13:31:27 -08:00
ravipesala 6a9ff19dc0 [SPARK-4650][SQL] Supporting multi column support in countDistinct function like count(distinct c1,c2..) in Spark SQL
Supporting multi column support in countDistinct function like count(distinct c1,c2..) in Spark SQL

Author: ravipesala <ravindra.pesala@huawei.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #3511 from ravipesala/countdistinct and squashes the following commits:

cc4dbb1 [ravipesala] style
070e12a [ravipesala] Supporting multi column support in count(distinct c1,c2..) in Spark SQL
2014-12-01 13:28:04 -08:00
Jacky Li bafee67eba [SQL] add @group tab in limit() and count()
group tab is missing for scaladoc

Author: Jacky Li <jacky.likun@gmail.com>

Closes #3458 from jackylk/patch-7 and squashes the following commits:

0121a70 [Jacky Li] add @group tab in limit() and count()
2014-12-01 13:12:30 -08:00
Davies Liu 6cf507685e [SPARK-4548] []SPARK-4517] improve performance of python broadcast
Re-implement the Python broadcast using file:

1) serialize the python object using cPickle, write into disks.
2) Create a wrapper in JVM (for the dumped file), it read data from during serialization
3) Using TorrentBroadcast or HttpBroadcast to transfer the data (compressed) into executors
4) During deserialization, writing the data into disk.
5) Passing the path into Python worker, read data from disk and unpickle it into python object, until the first access.

It fixes the performance regression introduced in #2659, has similar performance as 1.1, but support object larger than 2G, also improve the memory efficiency (only one compressed copy in driver and executor).

Testing with a 500M broadcast and 4 tasks (excluding the benefit from reused worker in 1.2):

         name |   1.1   | 1.2 with this patch |  improvement
---------|--------|---------|--------
      python-broadcast-w-bytes  |	25.20  |	9.33   |	170.13% |
        python-broadcast-w-set	  |     4.13	   |    4.50  |	-8.35%  |

Testing with 100 tasks (16 CPUs):

         name |   1.1   | 1.2 with this patch |  improvement
---------|--------|---------|--------
     python-broadcast-w-bytes	| 38.16	| 8.40	 | 353.98%
        python-broadcast-w-set	| 23.29	| 9.59 |	142.80%

Author: Davies Liu <davies@databricks.com>

Closes #3417 from davies/pybroadcast and squashes the following commits:

50a58e0 [Davies Liu] address comments
b98de1d [Davies Liu] disable gc while unpickle
e5ee6b9 [Davies Liu] support large string
09303b8 [Davies Liu] read all data into memory
dde02dd [Davies Liu] improve performance of python broadcast
2014-11-24 17:17:03 -08:00
Kousuke Saruta dd1c9cb36c [SPARK-4487][SQL] Fix attribute reference resolution error when using ORDER BY.
When we use ORDER BY clause, at first, attributes referenced by projection are resolved (1).
And then, attributes referenced at ORDER BY clause are resolved (2).
 But when resolving attributes referenced at ORDER BY clause, the resolution result generated in (1) is discarded so for example, following query fails.

    SELECT c1 + c2 FROM mytable ORDER BY c1;

The query above fails because when resolving the attribute reference 'c1', the resolution result of 'c2' is discarded.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #3363 from sarutak/SPARK-4487 and squashes the following commits:

fd314f3 [Kousuke Saruta] Fixed attribute resolution logic in Analyzer
6e60c20 [Kousuke Saruta] Fixed conflicts
cb5b7e9 [Kousuke Saruta] Added test case for SPARK-4487
282d529 [Kousuke Saruta] Fixed attributes reference resolution error
b6123e6 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into concat-feature
317b7fb [Kousuke Saruta] WIP
2014-11-24 12:54:37 -08:00
Cheng Lian a6d7b61f92 [SPARK-4479][SQL] Avoids unnecessary defensive copies when sort based shuffle is on
This PR is a workaround for SPARK-4479. Two changes are introduced: when merge sort is bypassed in `ExternalSorter`,

1. also bypass RDD elements buffering as buffering is the reason that `MutableRow` backed row objects must be copied, and
2. avoids defensive copies in `Exchange` operator

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3422)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3422 from liancheng/avoids-defensive-copies and squashes the following commits:

591f2e9 [Cheng Lian] Passes all shuffle suites
0c3c91e [Cheng Lian] Fixes shuffle write metrics when merge sort is bypassed
ed5df3c [Cheng Lian] Fixes styling changes
f75089b [Cheng Lian] Avoids unnecessary defensive copies when sort based shuffle is on
2014-11-24 12:43:45 -08:00
Michael Armbrust 02ec058efe [SPARK-4413][SQL] Parquet support through datasource API
Goals:
 - Support for accessing parquet using SQL but not requiring Hive (thus allowing support of parquet tables with decimal columns)
 - Support for folder based partitioning with automatic discovery of available partitions
 - Caching of file metadata

See scaladoc of `ParquetRelation2` for more details.

Author: Michael Armbrust <michael@databricks.com>

Closes #3269 from marmbrus/newParquet and squashes the following commits:

1dd75f1 [Michael Armbrust] Pass all paths for FileInputFormat at once.
645768b [Michael Armbrust] Review comments.
abd8e2f [Michael Armbrust] Alternative implementation of parquet based on the datasources API.
938019e [Michael Armbrust] Add an experimental interface to data sources that exposes catalyst expressions.
e9d2641 [Michael Armbrust] logging / formatting improvements.
2014-11-20 18:31:02 -08:00
Jacky Li ad5f1f3ca2 [SQL] fix function description mistake
Sample code in the description of SchemaRDD.where is not correct

Author: Jacky Li <jacky.likun@gmail.com>

Closes #3344 from jackylk/patch-6 and squashes the following commits:

62cd126 [Jacky Li] [SQL] fix function description mistake
2014-11-20 15:48:36 -08:00
Takuya UESHIN 2c2e7a44db [SPARK-4318][SQL] Fix empty sum distinct.
Executing sum distinct for empty table throws `java.lang.UnsupportedOperationException: empty.reduceLeft`.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #3184 from ueshin/issues/SPARK-4318 and squashes the following commits:

8168c42 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4318
66fdb0a [Takuya UESHIN] Re-refine aggregate functions.
6186eb4 [Takuya UESHIN] Fix Sum of GeneratedAggregate.
d2975f6 [Takuya UESHIN] Refine Sum and Average of GeneratedAggregate.
1bba675 [Takuya UESHIN] Refine Sum, SumDistinct and Average functions.
917e533 [Takuya UESHIN] Use aggregate instead of groupBy().
1a5f874 [Takuya UESHIN] Add tests to be executed as non-partial aggregation.
a5a57d2 [Takuya UESHIN] Fix empty Average.
22799dc [Takuya UESHIN] Fix empty Sum and SumDistinct.
65b7dd2 [Takuya UESHIN] Fix empty sum distinct.
2014-11-20 15:41:24 -08:00
ravipesala 98e9419784 [SPARK-4513][SQL] Support relational operator '<=>' in Spark SQL
The relational operator '<=>' is not working in Spark SQL. Same works in Spark HiveQL

Author: ravipesala <ravindra.pesala@huawei.com>

Closes #3387 from ravipesala/<=> and squashes the following commits:

7198e90 [ravipesala] Supporting relational operator '<=>' in Spark SQL
2014-11-20 15:34:03 -08:00
Dan McClary b8e6886fb8 [SPARK-4228][SQL] SchemaRDD to JSON
Here's a simple fix for SchemaRDD to JSON.

Author: Dan McClary <dan.mcclary@gmail.com>

Closes #3213 from dwmclary/SPARK-4228 and squashes the following commits:

d714e1d [Dan McClary] fixed PEP 8 error
cac2879 [Dan McClary] move pyspark comment and doctest to correct location
f9471d3 [Dan McClary] added pyspark doc and doctest
6598cee [Dan McClary] adding complex type queries
1a5fd30 [Dan McClary] removing SPARK-4228 from SQLQuerySuite
4a651f0 [Dan McClary] cleaned PEP and Scala style failures.  Moved tests to JsonSuite
47ceff6 [Dan McClary] cleaned up scala style issues
2ee1e70 [Dan McClary] moved rowToJSON to JsonRDD
4387dd5 [Dan McClary] Added UserDefinedType, cleaned up case formatting
8f7bfb6 [Dan McClary] Map type added to SchemaRDD.toJSON
1b11980 [Dan McClary] Map and UserDefinedTypes partially done
11d2016 [Dan McClary] formatting and unicode deserialization default fixed
6af72d1 [Dan McClary] deleted extaneous comment
4d11c0c [Dan McClary] JsonFactory rewrite of toJSON for SchemaRDD
149dafd [Dan McClary] wrapped scala toJSON in sql.py
5e5eb1b [Dan McClary] switched to Jackson for JSON processing
6c94a54 [Dan McClary] added toJSON to pyspark SchemaRDD
aaeba58 [Dan McClary] added toJSON to pyspark SchemaRDD
1d171aa [Dan McClary] upated missing brace on if statement
319e3ba [Dan McClary] updated to upstream master with merged SPARK-4228
424f130 [Dan McClary] tests pass, ready for pull and PR
626a5b1 [Dan McClary] added toJSON to SchemaRDD
f7d166a [Dan McClary] added toJSON method
5d34e37 [Dan McClary] merge resolved
d6d19e9 [Dan McClary] pr example
2014-11-20 13:44:19 -08:00
Cheng Lian abf29187f0 [SPARK-3938][SQL] Names in-memory columnar RDD with corresponding table name
This PR enables the Web UI storage tab to show the in-memory table name instead of the mysterious query plan string as the name of the in-memory columnar RDD.

Note that after #2501, a single columnar RDD can be shared by multiple in-memory tables, as long as their query results are the same. In this case, only the first cached table name is shown. For example:

```sql
CACHE TABLE first AS SELECT * FROM src;
CACHE TABLE second AS SELECT * FROM src;
```

The Web UI only shows "In-memory table first".

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3383)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3383 from liancheng/columnar-rdd-name and squashes the following commits:

071907f [Cheng Lian] Fixes tests
12ddfa6 [Cheng Lian] Names in-memory columnar RDD with corresponding table name
2014-11-20 13:12:24 -08:00