Commit graph

231 commits

Author SHA1 Message Date
Reynold Xin c8e934ef3c [SPARK-5447][SQL] Replaced reference to SchemaRDD with DataFrame.
and

[SPARK-5448][SQL] Make CacheManager a concrete class and field in SQLContext

Author: Reynold Xin <rxin@databricks.com>

Closes #4242 from rxin/sqlCleanup and squashes the following commits:

e351cb2 [Reynold Xin] Fixed toDataFrame.
6545c42 [Reynold Xin] More changes.
728c017 [Reynold Xin] [SPARK-5447][SQL] Replaced reference to SchemaRDD with DataFrame.
2015-01-28 12:10:01 -08:00
Reynold Xin d74373225e [SPARK-5097][SQL] Test cases for DataFrame expressions.
Author: Reynold Xin <rxin@databricks.com>

Closes #4235 from rxin/df-tests1 and squashes the following commits:

f341db6 [Reynold Xin] [SPARK-5097][SQL] Test cases for DataFrame expressions.
2015-01-27 18:10:49 -08:00
Reynold Xin 119f45d61d [SPARK-5097][SQL] DataFrame
This pull request redesigns the existing Spark SQL dsl, which already provides data frame like functionalities.

TODOs:
With the exception of Python support, other tasks can be done in separate, follow-up PRs.
- [ ] Audit of the API
- [ ] Documentation
- [ ] More test cases to cover the new API
- [x] Python support
- [ ] Type alias SchemaRDD

Author: Reynold Xin <rxin@databricks.com>
Author: Davies Liu <davies@databricks.com>

Closes #4173 from rxin/df1 and squashes the following commits:

0a1a73b [Reynold Xin] Merge branch 'df1' of github.com:rxin/spark into df1
23b4427 [Reynold Xin] Mima.
828f70d [Reynold Xin] Merge pull request #7 from davies/df
257b9e6 [Davies Liu] add repartition
6bf2b73 [Davies Liu] fix collect with UDT and tests
e971078 [Reynold Xin] Missing quotes.
b9306b4 [Reynold Xin] Remove removeColumn/updateColumn for now.
a728bf2 [Reynold Xin] Example rename.
e8aa3d3 [Reynold Xin] groupby -> groupBy.
9662c9e [Davies Liu] improve DataFrame Python API
4ae51ea [Davies Liu] python API for dataframe
1e5e454 [Reynold Xin] Fixed a bug with symbol conversion.
2ca74db [Reynold Xin] Couple minor fixes.
ea98ea1 [Reynold Xin] Documentation & literal expressions.
2b22684 [Reynold Xin] Got rid of IntelliJ problems.
02bbfbc [Reynold Xin] Tightening imports.
ffbce66 [Reynold Xin] Fixed compilation error.
59b6d8b [Reynold Xin] Style violation.
b85edfb [Reynold Xin] ALS.
8c37f0a [Reynold Xin] Made MLlib and examples compile
6d53134 [Reynold Xin] Hive module.
d35efd5 [Reynold Xin] Fixed compilation error.
ce4a5d2 [Reynold Xin] Fixed test cases in SQL except ParquetIOSuite.
66d5ef1 [Reynold Xin] SQLContext minor patch.
c9bcdc0 [Reynold Xin] Checkpoint: SQL module compiles!
2015-01-27 16:08:24 -08:00
Cheng Lian ba19689fe7 [SQL] [Minor] Remove deprecated parquet tests
This PR removes the deprecated `ParquetQuerySuite`, renamed `ParquetQuerySuite2` to `ParquetQuerySuite`, and refactored changes introduced in #4115 to `ParquetFilterSuite` . It is a follow-up of #3644.

Notice that test cases in the old `ParquetQuerySuite` have already been well covered by other test suites introduced in #3644.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4116)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #4116 from liancheng/remove-deprecated-parquet-tests and squashes the following commits:

f73b8f9 [Cheng Lian] Removes deprecated Parquet test suite
2015-01-21 14:38:10 -08:00
Josh Rosen b328ac6c8c Revert "[SPARK-5244] [SQL] add coalesce() in sql parser"
This reverts commit 812d3679f5.
2015-01-21 14:27:43 -08:00
Daoyuan Wang 812d3679f5 [SPARK-5244] [SQL] add coalesce() in sql parser
Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #4040 from adrian-wang/coalesce and squashes the following commits:

0ac8e8f [Daoyuan Wang] add coalesce() in sql parser
2015-01-21 12:59:41 -08:00
Reynold Xin d181c2a1fc [SPARK-5323][SQL] Remove Row's Seq inheritance.
Author: Reynold Xin <rxin@databricks.com>

Closes #4115 from rxin/row-seq and squashes the following commits:

e33abd8 [Reynold Xin] Fixed compilation error.
cceb650 [Reynold Xin] Python test fixes, and removal of WrapDynamic.
0334a52 [Reynold Xin] mkString.
9cdeb7d [Reynold Xin] Hive tests.
15681c2 [Reynold Xin] Fix more test cases.
ea9023a [Reynold Xin] Fixed a catalyst test.
c5e2cb5 [Reynold Xin] Minor patch up.
b9cab7c [Reynold Xin] [SPARK-5323][SQL] Remove Row's Seq inheritance.
2015-01-20 15:16:14 -08:00
Yin Huai bc20a52b34 [SPARK-5287][SQL] Add defaultSizeOf to every data type.
JIRA: https://issues.apache.org/jira/browse/SPARK-5287

This PR only add `defaultSizeOf` to data types and make those internal type classes `protected[sql]`. I will use another PR to cleanup the type hierarchy of data types.

Author: Yin Huai <yhuai@databricks.com>

Closes #4081 from yhuai/SPARK-5287 and squashes the following commits:

90cec75 [Yin Huai] Update unit test.
e1c600c [Yin Huai] Make internal classes protected[sql].
7eaba68 [Yin Huai] Add `defaultSize` method to data types.
fd425e0 [Yin Huai] Add all native types to NativeType.defaultSizeOf.
2015-01-20 13:26:36 -08:00
Reynold Xin 1727e0841c [SPARK-5279][SQL] Use java.math.BigDecimal as the exposed Decimal type.
Author: Reynold Xin <rxin@databricks.com>

Closes #4092 from rxin/bigdecimal and squashes the following commits:

27b08c9 [Reynold Xin] Fixed test.
10cb496 [Reynold Xin] [SPARK-5279][SQL] Use java.math.BigDecimal as the exposed Decimal type.
2015-01-18 11:01:42 -08:00
Reynold Xin 61b427d4b1 [SPARK-5193][SQL] Remove Spark SQL Java-specific API.
After the following patches, the main (Scala) API is now usable for Java users directly.

https://github.com/apache/spark/pull/4056
https://github.com/apache/spark/pull/4054
https://github.com/apache/spark/pull/4049
https://github.com/apache/spark/pull/4030
https://github.com/apache/spark/pull/3965
https://github.com/apache/spark/pull/3958

Author: Reynold Xin <rxin@databricks.com>

Closes #4065 from rxin/sql-java-api and squashes the following commits:

b1fd860 [Reynold Xin] Fix Mima
6d86578 [Reynold Xin] Ok one more attempt in fixing Python...
e8f1455 [Reynold Xin] Fix Python again...
3e53f91 [Reynold Xin] Fixed Python.
83735da [Reynold Xin] Fix BigDecimal test.
e9f1de3 [Reynold Xin] Use scala BigDecimal.
500d2c4 [Reynold Xin] Fix Decimal.
ba3bfa2 [Reynold Xin] Updated javadoc for RowFactory.
c4ae1c5 [Reynold Xin] [SPARK-5193][SQL] Remove Spark SQL Java-specific API.
2015-01-16 21:09:06 -08:00
Reynold Xin 1881431dd5 [SPARK-5274][SQL] Reconcile Java and Scala UDFRegistration.
As part of SPARK-5193:

1. Removed UDFRegistration as a mixin in SQLContext and made it a field ("udf").
2. For Java UDFs, renamed dataType to returnType.
3. For Scala UDFs, added type tags.
4. Added all Java UDF registration methods to Scala's UDFRegistration.
5. Documentation

Author: Reynold Xin <rxin@databricks.com>

Closes #4056 from rxin/udf-registration and squashes the following commits:

ae9c556 [Reynold Xin] Updated example.
675a3c9 [Reynold Xin] Style fix
47c24ff [Reynold Xin] Python fix.
5f00c45 [Reynold Xin] Restore data type position in java udf and added typetags.
032f006 [Reynold Xin] [SPARK-5193][SQL] Reconcile Java and Scala UDFRegistration.
2015-01-15 16:15:12 -08:00
Reynold Xin cfa397c126 [SPARK-5193][SQL] Tighten up SQLContext API
1. Removed 2 implicits (logicalPlanToSparkQuery and baseRelationToSchemaRDD)
2. Moved extraStrategies into ExperimentalMethods.
3. Made private methods protected[sql] so they don't show up in javadocs.
4. Removed createParquetFile.
5. Added Java version of applySchema to SQLContext.

Author: Reynold Xin <rxin@databricks.com>

Closes #4049 from rxin/sqlContext-refactor and squashes the following commits:

a326a1a [Reynold Xin] Remove createParquetFile and add applySchema for Java to SQLContext.
ecd6685 [Reynold Xin] Added baseRelationToSchemaRDD back.
4a38c9b [Reynold Xin] [SPARK-5193][SQL] Tighten up SQLContext API
2015-01-14 18:36:15 -08:00
Daoyuan Wang a3f7421b42 [SPARK-5248] [SQL] move sql.types.decimal.Decimal to sql.types.Decimal
rxin follow up of #3732

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #4041 from adrian-wang/decimal and squashes the following commits:

aa3d738 [Daoyuan Wang] fix auto refactor
7777a58 [Daoyuan Wang] move sql.types.decimal.Decimal to sql.types.Decimal
2015-01-14 09:36:59 -08:00
Reynold Xin f9969098c8 [SPARK-5123][SQL] Reconcile Java/Scala API for data types.
Having two versions of the data type APIs (one for Java, one for Scala) requires downstream libraries to also have two versions of the APIs if the library wants to support both Java and Scala. I took a look at the Scala version of the data type APIs - it can actually work out pretty well for Java out of the box.

As part of the PR, I created a sql.types package and moved all type definitions there. I then removed the Java specific data type API along with a lot of the conversion code.

This subsumes https://github.com/apache/spark/pull/3925

Author: Reynold Xin <rxin@databricks.com>

Closes #3958 from rxin/SPARK-5123-datatype-2 and squashes the following commits:

66505cc [Reynold Xin] [SPARK-5123] Expose only one version of the data type APIs (i.e. remove the Java-specific API).
2015-01-13 17:16:41 -08:00
Reynold Xin 14e3f114ef [SPARK-5168] Make SQLConf a field rather than mixin in SQLContext
This change should be binary and source backward compatible since we didn't change any user facing APIs.

Author: Reynold Xin <rxin@databricks.com>

Closes #3965 from rxin/SPARK-5168-sqlconf and squashes the following commits:

42eec09 [Reynold Xin] Fix default conf value.
0ef86cc [Reynold Xin] Fix constructor ordering.
4d7f910 [Reynold Xin] Properly override config.
ccc8e6a [Reynold Xin] [SPARK-5168] Make SQLConf a field rather than mixin in SQLContext
2015-01-13 13:30:35 -08:00
Yin Huai 6463e0b9e8 [SPARK-4912][SQL] Persistent tables for the Spark SQL data sources api
With changes in this PR, users can persist metadata of tables created based on the data source API in metastore through DDLs.

Author: Yin Huai <yhuai@databricks.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #3960 from yhuai/persistantTablesWithSchema2 and squashes the following commits:

069c235 [Yin Huai] Make exception messages user friendly.
c07cbc6 [Yin Huai] Get the location of test file in a correct way.
4456e98 [Yin Huai] Test data.
5315dfc [Yin Huai] rxin's comments.
7fc4b56 [Yin Huai] Add DDLStrategy and HiveDDLStrategy to plan DDLs based on the data source API.
aeaf4b3 [Yin Huai] Add comments.
06f9b0c [Yin Huai] Revert unnecessary changes.
feb88aa [Yin Huai] Merge remote-tracking branch 'apache/master' into persistantTablesWithSchema2
172db80 [Yin Huai] Fix unit test.
49bf1ac [Yin Huai] Unit tests.
8f8f1a1 [Yin Huai] [SPARK-4574][SQL] Adding support for defining schema in foreign DDL commands. #3431
f47fda1 [Yin Huai] Unit tests.
2b59723 [Michael Armbrust] Set external when creating tables
c00bb1b [Michael Armbrust] Don't use reflection to read options
1ea6e7b [Michael Armbrust] Don't fail when trying to uncache a table that doesn't exist
6edc710 [Michael Armbrust] Add tests.
d7da491 [Michael Armbrust] First draft of persistent tables.
2015-01-13 13:01:27 -08:00
scwf d22a31f5e8 [SPARK-5029][SQL] Enable from follow multiple brackets
Enable from follow multiple brackets:
```
select key from ((select * from testData limit 1) union all (select * from testData limit 1)) x limit 1
```

Author: scwf <wangfei1@huawei.com>

Closes #3853 from scwf/from and squashes the following commits:

14f110a [scwf] enable from follow multiple brackets
2015-01-10 17:07:34 -08:00
scwf 693a323a70 [SPARK-4574][SQL] Adding support for defining schema in foreign DDL commands.
Adding support for defining schema in foreign DDL commands. Now foreign DDL support commands like:
```
CREATE TEMPORARY TABLE avroTable
USING org.apache.spark.sql.avro
OPTIONS (path "../hive/src/test/resources/data/files/episodes.avro")
```
With this PR user can define schema instead of infer from file, so  support ddl command as follows:
```
CREATE TEMPORARY TABLE avroTable(a int, b string)
USING org.apache.spark.sql.avro
OPTIONS (path "../hive/src/test/resources/data/files/episodes.avro")
```

Author: scwf <wangfei1@huawei.com>
Author: Yin Huai <yhuai@databricks.com>
Author: Fei Wang <wangfei1@huawei.com>
Author: wangfei <wangfei1@huawei.com>

Closes #3431 from scwf/ddl and squashes the following commits:

7e79ce5 [Fei Wang] Merge pull request #22 from yhuai/pr3431yin
38f634e [Yin Huai] Remove Option from createRelation.
65e9c73 [Yin Huai] Revert all changes since applying a given schema has not been testd.
a852b10 [scwf] remove cleanIdentifier
f336a16 [Fei Wang] Merge pull request #21 from yhuai/pr3431yin
baf79b5 [Yin Huai] Test special characters quoted by backticks.
50a03b0 [Yin Huai] Use JsonRDD.nullTypeToStringType to convert NullType to StringType.
1eeb769 [Fei Wang] Merge pull request #20 from yhuai/pr3431yin
f5c22b0 [Yin Huai] Refactor code and update test cases.
f1cffe4 [Yin Huai] Revert "minor refactory"
b621c8f [scwf] minor refactory
d02547f [scwf] fix HiveCompatibilitySuite test failure
8dfbf7a [scwf] more tests for complex data type
ddab984 [Fei Wang] Merge pull request #19 from yhuai/pr3431yin
91ad91b [Yin Huai] Parse data types in DDLParser.
cf982d2 [scwf] fixed test failure
445b57b [scwf] address comments
02a662c [scwf] style issue
44eb70c [scwf] fix decimal parser issue
83b6fc3 [scwf] minor fix
9bf12f8 [wangfei] adding test case
7787ec7 [wangfei] added SchemaRelationProvider
0ba70df [wangfei] draft version
2015-01-10 13:53:21 -08:00
Alex Liu 4b39fd1e63 [SPARK-4943][SQL] Allow table name having dot for db/catalog
The pull only fixes the parsing error and changes API to use tableIdentifier. Joining different catalog datasource related change is not done in this pull.

Author: Alex Liu <alex_liu68@yahoo.com>

Closes #3941 from alexliu68/SPARK-SQL-4943-3 and squashes the following commits:

343ae27 [Alex Liu] [SPARK-4943][SQL] refactoring according to review
29e5e55 [Alex Liu] [SPARK-4943][SQL] fix failed Hive CTAS tests
6ae77ce [Alex Liu] [SPARK-4943][SQL] fix TestHive matching error
3652997 [Alex Liu] [SPARK-4943][SQL] Allow table name having dot to support db/catalog ...
2015-01-10 13:23:09 -08:00
Reynold Xin 04d55d8e8e [SPARK-5040][SQL] Support expressing unresolved attributes using $"attribute name" notation in SQL DSL.
Author: Reynold Xin <rxin@databricks.com>

Closes #3862 from rxin/stringcontext-attr and squashes the following commits:

9b10f57 [Reynold Xin] Rename StrongToAttributeConversionHelper
72121af [Reynold Xin] [SPARK-5040][SQL] Support expressing unresolved attributes using $"attribute name" notation in SQL DSL.
2015-01-05 15:34:22 -08:00
wangxiaojing 07fa1910d9 [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash
JIRA issue: [SPARK-4570](https://issues.apache.org/jira/browse/SPARK-4570)
We are planning to create a `BroadcastLeftSemiJoinHash` to implement the broadcast join for `left semijoin`
In left semijoin :
If the size of data from right side is smaller than the user-settable threshold `AUTO_BROADCASTJOIN_THRESHOLD`,
the planner would mark it as the `broadcast` relation and mark the other relation as the stream side. The broadcast table will be broadcasted to all of the executors involved in the join, as a `org.apache.spark.broadcast.Broadcast` object. It will use `joins.BroadcastLeftSemiJoinHash`.,else it will use `joins.LeftSemiJoinHash`.

The benchmark suggests these  made the optimized version 4x faster  when `left semijoin`
<pre><code>
Original:
left semi join : 9288 ms
Optimized:
left semi join : 1963 ms
</code></pre>
The micro benchmark load `data1/kv3.txt` into a normal Hive table.
Benchmark code:
<pre><code>
 def benchmark(f: => Unit) = {
    val begin = System.currentTimeMillis()
    f
    val end = System.currentTimeMillis()
    end - begin
  }
  val sc = new SparkContext(
    new SparkConf()
      .setMaster("local")
      .setAppName(getClass.getSimpleName.stripSuffix("$")))
  val hiveContext = new HiveContext(sc)
  import hiveContext._
  sql("drop table if exists left_table")
  sql("drop table if exists right_table")
  sql( """create table left_table (key int, value string)
       """.stripMargin)
  sql( s"""load data local inpath "/data1/kv3.txt" into table left_table""")
  sql( """create table right_table (key int, value string)
       """.stripMargin)
  sql(
    """
      |from left_table
      |insert overwrite table right_table
      |select left_table.key, left_table.value
    """.stripMargin)

  val leftSimeJoin = sql(
    """select a.key from left_table a
      |left semi join right_table b on a.key = b.key""".stripMargin)
  val leftSemiJoinDuration = benchmark(leftSimeJoin.count())
  println(s"left semi join : $leftSemiJoinDuration ms ")
</code></pre>

Author: wangxiaojing <u9jing@gmail.com>

Closes #3442 from wangxiaojing/SPARK-4570 and squashes the following commits:

a4a43c9 [wangxiaojing] rebase
f103983 [wangxiaojing] change style
fbe4887 [wangxiaojing] change style
ff2e618 [wangxiaojing] add testsuite
1a8da2a [wangxiaojing] add BroadcastLeftSemiJoinHash
2014-12-30 13:54:12 -08:00
Cheng Lian 61a99f6a11 [SPARK-4937][SQL] Normalizes conjunctions and disjunctions to eliminate common predicates
This PR is a simplified version of several filter optimization rules introduced in #3778 authored by scwf. Newly introduced optimizations include:

1. `a && a` => `a`
2. `a || a` => `a`
3. `(a || b || c || ...) && (a || b || d || ...)` => `a && b && (c || d || ...)`

The 3rd rule is particularly useful for optimizing the following query, which is planned into a cartesian product

```sql
SELECT *
  FROM t1, t2
 WHERE (t1.key = t2.key AND t1.value > 10)
    OR (t1.key = t2.key AND t2.value < 20)
```

to the following one, which is planned into an equi-join:

```sql
SELECT *
  FROM t1, t2
 WHERE t1.key = t2.key
   AND (t1.value > 10 OR t2.value < 20)
```

The example above is quite artificial, but common predicates are likely to appear in real life complex queries (like the one mentioned in #3778).

A difference between this PR and #3778 is that these optimizations are not limited to `Filter`, but are generalized to all logical plan nodes. Thanks to scwf for bringing up these optimizations, and chenghao-intel for the generalization suggestion.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3784)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3784 from liancheng/normalize-filters and squashes the following commits:

caca560 [Cheng Lian] Moves filter normalization into BooleanSimplification rule
4ab3a58 [Cheng Lian] Fixes test failure, adds more tests
5d54349 [Cheng Lian] Fixes typo in comment
2abbf8e [Cheng Lian] Forgot our sacred Apache licence header...
cf95639 [Cheng Lian] Adds an optimization rule for filter normalization
2014-12-30 13:38:27 -08:00
Cheng Lian 19a8802e70 [SPARK-4493][SQL] Tests for IsNull / IsNotNull in the ParquetFilterSuite
This is a follow-up of #3367 and #3644.

At the time #3644 was written, #3367 hadn't been merged yet, thus `IsNull` and `IsNotNull` filters are not covered in the first version of `ParquetFilterSuite`. This PR adds corresponding test cases.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3748)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3748 from liancheng/test-null-filters and squashes the following commits:

1ab943f [Cheng Lian] IsNull and IsNotNull Parquet filter test case for boolean type
bcd616b [Cheng Lian] Adds Parquet filter pushedown tests for IsNull and IsNotNull
2014-12-30 12:16:45 -08:00
Cheng Hao 53f0a00b60 [Spark-4512] [SQL] Unresolved Attribute Exception in Sort By
It will cause exception while do query like:
SELECT key+key FROM src sort by value;

Author: Cheng Hao <hao.cheng@intel.com>

Closes #3386 from chenghao-intel/sort and squashes the following commits:

38c78cc [Cheng Hao] revert the SortPartition in SparkStrategies
7e9dd15 [Cheng Hao] update the typo
fcd1d64 [Cheng Hao] rebase the latest master and update the SortBy unit test
2014-12-30 12:11:44 -08:00
wangfei daac221302 [SPARK-5002][SQL] Using ascending by default when not specify order in order by
spark sql does not support ```SELECT a, b FROM testData2 ORDER BY a desc, b```.

Author: wangfei <wangfei1@huawei.com>

Closes #3838 from scwf/orderby and squashes the following commits:

114b64a [wangfei] remove nouse methods
48145d3 [wangfei] fix order, using asc by default
2014-12-30 12:07:24 -08:00
Sean Owen 29fabb1b52 SPARK-4297 [BUILD] Build warning fixes omnibus
There are a number of warnings generated in a normal, successful build right now. They're mostly Java unchecked cast warnings, which can be suppressed. But there's a grab bag of other Scala language warnings and so on that can all be easily fixed. The forthcoming PR fixes about 90% of the build warnings I see now.

Author: Sean Owen <sowen@cloudera.com>

Closes #3157 from srowen/SPARK-4297 and squashes the following commits:

8c9e469 [Sean Owen] Suppress unchecked cast warnings, and several other build warning fixes
2014-12-24 13:32:51 -08:00
Thu Kyaw b68bc6d264 [SPARK-3928][SQL] Support wildcard matches on Parquet files.
...arquetFile accept hadoop glob pattern in path.

Author: Thu Kyaw <trk007@gmail.com>

Closes #3407 from tkyaw/master and squashes the following commits:

19115ad [Thu Kyaw] Merge https://github.com/apache/spark
ceded32 [Thu Kyaw] [SPARK-3928][SQL] Support wildcard matches on Parquet files.
d322c28 [Thu Kyaw] [SPARK-3928][SQL] Support wildcard matches on Parquet files.
ce677c6 [Thu Kyaw] [SPARK-3928][SQL] Support wildcard matches on Parquet files.
2014-12-18 20:08:32 -08:00
Cheng Hao 8d0d2a65eb [SPARK-4856] [SQL] NullType instead of StringType when sampling against empty string or nul...
```
TestSQLContext.sparkContext.parallelize(
  """{"ip":"27.31.100.29","headers":{"Host":"1.abc.com","Charset":"UTF-8"}}""" ::
  """{"ip":"27.31.100.29","headers":{}}""" ::
  """{"ip":"27.31.100.29","headers":""}""" :: Nil)
```
As empty string (the "headers") will be considered as String in the beginning (in line 2 and 3), it ignores the real nested data type (struct type "headers" in line 1), and also take the line 1 (the "headers") as String Type, which is not our expected.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #3708 from chenghao-intel/json and squashes the following commits:

e7a72e9 [Cheng Hao] add more concise unit test
853de51 [Cheng Hao] NullType instead of StringType when sampling against empty string or null value
2014-12-17 15:01:59 -08:00
Michael Armbrust 19c0faad6d [HOTFIX][SQL] Fix parquet filter suite
Author: Michael Armbrust <michael@databricks.com>

Closes #3727 from marmbrus/parquetNotEq and squashes the following commits:

2157bfc [Michael Armbrust] Fix parquet filter suite
2014-12-17 14:27:02 -08:00
Cheng Lian 6277135376 [SPARK-4493][SQL] Don't pushdown Eq, NotEq, Lt, LtEq, Gt and GtEq predicates with nulls for Parquet
Predicates like `a = NULL` and `a < NULL` can't be pushed down since Parquet `Lt`, `LtEq`, `Gt`, `GtEq` doesn't accept null value. Note that `Eq` and `NotEq` can only be used with `null` to represent predicates like `a IS NULL` and `a IS NOT NULL`.

However, normally this issue doesn't cause NPE because any value compared to `NULL` results `NULL`, and Spark SQL automatically optimizes out `NULL` predicate in the `SimplifyFilters` rule. Only testing code that intentionally disables the optimizer may trigger this issue. (That's why this issue is not marked as blocker and I do **NOT** think we need to backport this to branch-1.1

This PR restricts `Lt`, `LtEq`, `Gt` and `GtEq` to non-null values only, and only uses `Eq` with null value to pushdown `IsNull` and `IsNotNull`. Also, added support for Parquet `NotEq` filter for completeness and (tiny) performance gain, it's also used to pushdown `IsNotNull`.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3367)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3367 from liancheng/filters-with-null and squashes the following commits:

cc41281 [Cheng Lian] Fixes several styling issues
de7de28 [Cheng Lian] Adds stricter rules for Parquet filters with null
2014-12-17 12:48:04 -08:00
Cheng Hao 5fdcbdc0c9 [SPARK-4625] [SQL] Add sort by for DSL & SimpleSqlParser
Add `sort by` support for both DSL & SqlParser.

This PR is relevant with #3386, either one merged, will cause the other rebased.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #3481 from chenghao-intel/sortby and squashes the following commits:

041004f [Cheng Hao] Add sort by for DSL & SimpleSqlParser
2014-12-17 12:01:57 -08:00
scwf 60698801eb [SPARK-4618][SQL] Make foreign DDL commands options case-insensitive
Using lowercase for ```options``` key to make it case-insensitive, then we should use lower case to get value from parameters.
So flowing cmd work
```
      create temporary table normal_parquet
      USING org.apache.spark.sql.parquet
      OPTIONS (
        PATH '/xxx/data'
      )
```

Author: scwf <wangfei1@huawei.com>
Author: wangfei <wangfei1@huawei.com>

Closes #3470 from scwf/ddl-ulcase and squashes the following commits:

ae78509 [scwf] address comments
8f4f585 [wangfei] address comments
3c132ef [scwf] minor fix
a0fc20b [scwf] Merge branch 'master' of https://github.com/apache/spark into ddl-ulcase
4f86401 [scwf] adding CaseInsensitiveMap
e244e8d [wangfei] using lower case in json
e0cb017 [wangfei] make options in-casesensitive
2014-12-16 21:26:36 -08:00
Cheng Hao 770d8153a5 [SPARK-4375] [SQL] Add 0 argument support for udf
Author: Cheng Hao <hao.cheng@intel.com>

Closes #3595 from chenghao-intel/udf0 and squashes the following commits:

a858973 [Cheng Hao] Add 0 arguments support for udf
2014-12-16 21:21:11 -08:00
Cheng Lian 3b395e1051 [SPARK-4798][SQL] A new set of Parquet testing API and test suites
This PR provides a set Parquet testing API (see trait `ParquetTest`) that enables developers to write more concise test cases. A new set of Parquet test suites built upon this API  are added and aim to replace the old `ParquetQuerySuite`. To avoid potential merge conflicts, old testing code are not removed yet. The following classes can be safely removed after most Parquet related PRs are handled:

- `ParquetQuerySuite`
- `ParquetTestData`

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3644)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3644 from liancheng/parquet-tests and squashes the following commits:

800e745 [Cheng Lian] Enforces ordering of test output
3bb8731 [Cheng Lian] Refactors HiveParquetSuite
aa2cb2e [Cheng Lian] Decouples ParquetTest and TestSQLContext
7b43a68 [Cheng Lian] Updates ParquetTest Scaladoc
7f07af0 [Cheng Lian] Adds a new set of Parquet test suites
2014-12-16 21:16:03 -08:00
wangxiaojing ea1315e3e2 [SPARK-4527][SQl]Add BroadcastNestedLoopJoin operator selection testsuite
In `JoinSuite` add BroadcastNestedLoopJoin operator selection testsuite

Author: wangxiaojing <u9jing@gmail.com>

Closes #3395 from wangxiaojing/SPARK-4527 and squashes the following commits:

ea0e495 [wangxiaojing] change style
53c3952 [wangxiaojing] Add BroadcastNestedLoopJoin operator selection testsuite
2014-12-16 14:45:56 -08:00
Cheng Hao bf40cf89e3 [SPARK-4713] [SQL] SchemaRDD.unpersist() should not raise exception if it is not persisted
Unpersist a uncached RDD, will not raise exception, for example:
```
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
distData.unpersist(true)
```

But the `SchemaRDD` will raise exception if the `SchemaRDD` is not cached. Since `SchemaRDD` is the subclasses of the `RDD`, we should follow the same behavior.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #3572 from chenghao-intel/try_uncache and squashes the following commits:

50a7a89 [Cheng Hao] SchemaRDD.unpersist() should not raise exception if it is not persisted
2014-12-11 22:41:36 -08:00
Jacky Li ed88db4cb2 [SQL] remove unnecessary import
Author: Jacky Li <jacky.likun@huawei.com>

Closes #3585 from jackylk/remove and squashes the following commits:

045423d [Jacky Li] remove unnecessary import
2014-12-04 00:43:55 -08:00
YanTangZhai 1066427600 [SPARK-4676][SQL] JavaSchemaRDD.schema may throw NullType MatchError if sql has null
val jsc = new org.apache.spark.api.java.JavaSparkContext(sc)
val jhc = new org.apache.spark.sql.hive.api.java.JavaHiveContext(jsc)
val nrdd = jhc.hql("select null from spark_test.for_test")
println(nrdd.schema)
Then the error is thrown as follows:
scala.MatchError: NullType (of class org.apache.spark.sql.catalyst.types.NullType$)
at org.apache.spark.sql.types.util.DataTypeConversions$.asJavaDataType(DataTypeConversions.scala:43)

Author: YanTangZhai <hakeemzhai@tencent.com>
Author: yantangzhai <tyz0303@163.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #3538 from YanTangZhai/MatchNullType and squashes the following commits:

e052dff [yantangzhai] [SPARK-4676] [SQL] JavaSchemaRDD.schema may throw NullType MatchError if sql has null
4b4bb34 [yantangzhai] [SPARK-4676] [SQL] JavaSchemaRDD.schema may throw NullType MatchError if sql has null
896c7b7 [yantangzhai] fix NullType MatchError in JavaSchemaRDD when sql has null
6e643f8 [YanTangZhai] Merge pull request #11 from apache/master
e249846 [YanTangZhai] Merge pull request #10 from apache/master
d26d982 [YanTangZhai] Merge pull request #9 from apache/master
76d4027 [YanTangZhai] Merge pull request #8 from apache/master
03b62b0 [YanTangZhai] Merge pull request #7 from apache/master
8a00106 [YanTangZhai] Merge pull request #6 from apache/master
cbcba66 [YanTangZhai] Merge pull request #3 from apache/master
cdef539 [YanTangZhai] Merge pull request #1 from apache/master
2014-12-02 14:15:12 -08:00
Kousuke Saruta e75e04f980 [SPARK-4536][SQL] Add sqrt and abs to Spark SQL DSL
Spark SQL has embeded sqrt and abs but DSL doesn't support those functions.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #3401 from sarutak/dsl-missing-operator and squashes the following commits:

07700cf [Kousuke Saruta] Modified Literal(null, NullType) to Literal(null) in DslQuerySuite
8f366f8 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into dsl-missing-operator
1b88e2e [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into dsl-missing-operator
0396f89 [Kousuke Saruta] Added sqrt and abs to Spark SQL DSL
2014-12-02 12:07:52 -08:00
ravipesala 6a9ff19dc0 [SPARK-4650][SQL] Supporting multi column support in countDistinct function like count(distinct c1,c2..) in Spark SQL
Supporting multi column support in countDistinct function like count(distinct c1,c2..) in Spark SQL

Author: ravipesala <ravindra.pesala@huawei.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #3511 from ravipesala/countdistinct and squashes the following commits:

cc4dbb1 [ravipesala] style
070e12a [ravipesala] Supporting multi column support in count(distinct c1,c2..) in Spark SQL
2014-12-01 13:28:04 -08:00
Kousuke Saruta dd1c9cb36c [SPARK-4487][SQL] Fix attribute reference resolution error when using ORDER BY.
When we use ORDER BY clause, at first, attributes referenced by projection are resolved (1).
And then, attributes referenced at ORDER BY clause are resolved (2).
 But when resolving attributes referenced at ORDER BY clause, the resolution result generated in (1) is discarded so for example, following query fails.

    SELECT c1 + c2 FROM mytable ORDER BY c1;

The query above fails because when resolving the attribute reference 'c1', the resolution result of 'c2' is discarded.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #3363 from sarutak/SPARK-4487 and squashes the following commits:

fd314f3 [Kousuke Saruta] Fixed attribute resolution logic in Analyzer
6e60c20 [Kousuke Saruta] Fixed conflicts
cb5b7e9 [Kousuke Saruta] Added test case for SPARK-4487
282d529 [Kousuke Saruta] Fixed attributes reference resolution error
b6123e6 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into concat-feature
317b7fb [Kousuke Saruta] WIP
2014-11-24 12:54:37 -08:00
Takuya UESHIN 2c2e7a44db [SPARK-4318][SQL] Fix empty sum distinct.
Executing sum distinct for empty table throws `java.lang.UnsupportedOperationException: empty.reduceLeft`.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #3184 from ueshin/issues/SPARK-4318 and squashes the following commits:

8168c42 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4318
66fdb0a [Takuya UESHIN] Re-refine aggregate functions.
6186eb4 [Takuya UESHIN] Fix Sum of GeneratedAggregate.
d2975f6 [Takuya UESHIN] Refine Sum and Average of GeneratedAggregate.
1bba675 [Takuya UESHIN] Refine Sum, SumDistinct and Average functions.
917e533 [Takuya UESHIN] Use aggregate instead of groupBy().
1a5f874 [Takuya UESHIN] Add tests to be executed as non-partial aggregation.
a5a57d2 [Takuya UESHIN] Fix empty Average.
22799dc [Takuya UESHIN] Fix empty Sum and SumDistinct.
65b7dd2 [Takuya UESHIN] Fix empty sum distinct.
2014-11-20 15:41:24 -08:00
ravipesala 98e9419784 [SPARK-4513][SQL] Support relational operator '<=>' in Spark SQL
The relational operator '<=>' is not working in Spark SQL. Same works in Spark HiveQL

Author: ravipesala <ravindra.pesala@huawei.com>

Closes #3387 from ravipesala/<=> and squashes the following commits:

7198e90 [ravipesala] Supporting relational operator '<=>' in Spark SQL
2014-11-20 15:34:03 -08:00
Dan McClary b8e6886fb8 [SPARK-4228][SQL] SchemaRDD to JSON
Here's a simple fix for SchemaRDD to JSON.

Author: Dan McClary <dan.mcclary@gmail.com>

Closes #3213 from dwmclary/SPARK-4228 and squashes the following commits:

d714e1d [Dan McClary] fixed PEP 8 error
cac2879 [Dan McClary] move pyspark comment and doctest to correct location
f9471d3 [Dan McClary] added pyspark doc and doctest
6598cee [Dan McClary] adding complex type queries
1a5fd30 [Dan McClary] removing SPARK-4228 from SQLQuerySuite
4a651f0 [Dan McClary] cleaned PEP and Scala style failures.  Moved tests to JsonSuite
47ceff6 [Dan McClary] cleaned up scala style issues
2ee1e70 [Dan McClary] moved rowToJSON to JsonRDD
4387dd5 [Dan McClary] Added UserDefinedType, cleaned up case formatting
8f7bfb6 [Dan McClary] Map type added to SchemaRDD.toJSON
1b11980 [Dan McClary] Map and UserDefinedTypes partially done
11d2016 [Dan McClary] formatting and unicode deserialization default fixed
6af72d1 [Dan McClary] deleted extaneous comment
4d11c0c [Dan McClary] JsonFactory rewrite of toJSON for SchemaRDD
149dafd [Dan McClary] wrapped scala toJSON in sql.py
5e5eb1b [Dan McClary] switched to Jackson for JSON processing
6c94a54 [Dan McClary] added toJSON to pyspark SchemaRDD
aaeba58 [Dan McClary] added toJSON to pyspark SchemaRDD
1d171aa [Dan McClary] upated missing brace on if statement
319e3ba [Dan McClary] updated to upstream master with merged SPARK-4228
424f130 [Dan McClary] tests pass, ready for pull and PR
626a5b1 [Dan McClary] added toJSON to SchemaRDD
f7d166a [Dan McClary] added toJSON method
5d34e37 [Dan McClary] merge resolved
d6d19e9 [Dan McClary] pr example
2014-11-20 13:44:19 -08:00
Cheng Lian abf29187f0 [SPARK-3938][SQL] Names in-memory columnar RDD with corresponding table name
This PR enables the Web UI storage tab to show the in-memory table name instead of the mysterious query plan string as the name of the in-memory columnar RDD.

Note that after #2501, a single columnar RDD can be shared by multiple in-memory tables, as long as their query results are the same. In this case, only the first cached table name is shown. For example:

```sql
CACHE TABLE first AS SELECT * FROM src;
CACHE TABLE second AS SELECT * FROM src;
```

The Web UI only shows "In-memory table first".

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3383)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3383 from liancheng/columnar-rdd-name and squashes the following commits:

071907f [Cheng Lian] Fixes tests
12ddfa6 [Cheng Lian] Names in-memory columnar RDD with corresponding table name
2014-11-20 13:12:24 -08:00
Cheng Lian 423baea953 [SPARK-4468][SQL] Fixes Parquet filter creation for inequality predicates with literals on the left hand side
For expressions like `10 < someVar`, we should create an `Operators.Gt` filter, but right now an `Operators.Lt` is created. This issue affects all inequality predicates with literals on the left hand side.

(This bug existed before #3317 and affects branch-1.1. #3338 was opened to backport this to branch-1.1.)

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3334)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3334 from liancheng/fix-parquet-comp-filter and squashes the following commits:

0130897 [Cheng Lian] Fixes Parquet comparison filter generation
2014-11-18 17:41:54 -08:00
Cheng Lian 36b0956a3e [SPARK-4453][SPARK-4213][SQL] Simplifies Parquet filter generation code
While reviewing PR #3083 and #3161, I noticed that Parquet record filter generation code can be simplified significantly according to the clue stated in [SPARK-4453](https://issues.apache.org/jira/browse/SPARK-4213). This PR addresses both SPARK-4453 and SPARK-4213 with this simplification.

While generating `ParquetTableScan` operator, we need to remove all Catalyst predicates that have already been pushed down to Parquet. Originally, we first generate the record filter, and then call `findExpression` to traverse the generated filter to find out all pushed down predicates [[1](64c6b9bad5/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala (L213-L228))]. In this way, we have to introduce the `CatalystFilter` class hierarchy to bind the Catalyst predicates together with their generated Parquet filter, and complicate the code base a lot.

The basic idea of this PR is that, we don't need `findExpression` after filter generation, because we already know a predicate can be pushed down if we can successfully generate its corresponding Parquet filter. SPARK-4213 is fixed by returning `None` for any unsupported predicate type.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3317)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3317 from liancheng/simplify-parquet-filters and squashes the following commits:

d6a9499 [Cheng Lian] Fixes import styling issue
43760e8 [Cheng Lian] Simplifies Parquet filter generation logic
2014-11-17 16:55:12 -08:00
Cheng Lian 5ce7dae859 [SQL] Makes conjunction pushdown more aggressive for in-memory table
This is inspired by the [Parquet record filter generation code](64c6b9bad5/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetFilters.scala (L387-L400)).

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3318)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3318 from liancheng/aggresive-conj-pushdown and squashes the following commits:

78b69d2 [Cheng Lian] Makes conjunction pushdown more aggressive
2014-11-17 15:33:13 -08:00
Michael Armbrust 64c6b9bad5 [SPARK-4410][SQL] Add support for external sort
Adds a new operator that uses Spark's `ExternalSort` class.  It is off by default now, but we might consider making it the default if benchmarks show that it does not regress performance.

Author: Michael Armbrust <michael@databricks.com>

Closes #3268 from marmbrus/externalSort and squashes the following commits:

48b9726 [Michael Armbrust] comments
b98799d [Michael Armbrust] Add test
afd7562 [Michael Armbrust] Add support for external sort.
2014-11-16 21:55:57 -08:00
Cheng Lian 0c7b66bd44 [SPARK-4322][SQL] Enables struct fields as sub expressions of grouping fields
While resolving struct fields, the resulted `GetField` expression is wrapped with an `Alias` to make it a named expression. Assume `a` is a struct instance with a field `b`, then `"a.b"` will be resolved as `Alias(GetField(a, "b"), "b")`. Thus, for this following SQL query:

```sql
SELECT a.b + 1 FROM t GROUP BY a.b + 1
```

the grouping expression is

```scala
Add(GetField(a, "b"), Literal(1, IntegerType))
```

while the aggregation expression is

```scala
Add(Alias(GetField(a, "b"), "b"), Literal(1, IntegerType))
```

This mismatch makes the above SQL query fail during the both analysis and execution phases. This PR fixes this issue by removing the alias when substituting aggregation expressions.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3248)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3248 from liancheng/spark-4322 and squashes the following commits:

23a46ea [Cheng Lian] Code simplification
dd20a79 [Cheng Lian] Should only trim aliases around `GetField`s
7f46532 [Cheng Lian] Enables struct fields as sub expressions of grouping fields
2014-11-14 15:09:36 -08:00