Commit graph

1379 commits

Author SHA1 Message Date
Davies Liu bcb47ad771 [SPARK-6917] [SQL] DecimalType is not read back when non-native type exists
cc yhuai

Author: Davies Liu <davies@databricks.com>

Closes #6558 from davies/decimalType and squashes the following commits:

c877ca8 [Davies Liu] Update ParquetConverter.scala
48cc57c [Davies Liu] Update ParquetConverter.scala
b43845c [Davies Liu] add test
3b4a94f [Davies Liu] DecimalType is not read back when non-native type exists
2015-06-01 23:12:29 -07:00
Reynold Xin b53a011647 Fixed typo in the previous commit. 2015-06-01 21:41:53 -07:00
Yin Huai e797dba58e [SPARK-7965] [SPARK-7972] [SQL] Handle expressions containing multiple window expressions and make parser match window frames in case insensitive way
JIRAs:
https://issues.apache.org/jira/browse/SPARK-7965
https://issues.apache.org/jira/browse/SPARK-7972

Author: Yin Huai <yhuai@databricks.com>

Closes #6524 from yhuai/7965-7972 and squashes the following commits:

c12c79c [Yin Huai] Add doc for returned value.
de64328 [Yin Huai] Address rxin's comments.
fc9b1ad [Yin Huai] wip
2996da4 [Yin Huai] scala style
20b65b7 [Yin Huai] Handle expressions containing multiple window expressions.
9568b21 [Yin Huai] case insensitive matches
41f633d [Yin Huai] Failed test case.
2015-06-01 21:40:17 -07:00
Reynold Xin 75dda33f3e Revert "[SPARK-8020] Spark SQL in spark-defaults.conf make metadataHive get constructed too early"
This reverts commit 91f6be87bc.
2015-06-01 21:35:55 -07:00
Yin Huai 91f6be87bc [SPARK-8020] Spark SQL in spark-defaults.conf make metadataHive get constructed too early
https://issues.apache.org/jira/browse/SPARK-8020

Author: Yin Huai <yhuai@databricks.com>

Closes #6563 from yhuai/SPARK-8020 and squashes the following commits:

4e5addc [Yin Huai] style
bf766c6 [Yin Huai] Failed test.
0398f5b [Yin Huai] First populate the SQLConf and then construct executionHive and metadataHive.
2015-06-01 21:33:57 -07:00
Reynold Xin 4c868b9943 [minor doc] Add exploratory data analysis warning for DataFrame.stat.freqItem API
Author: Reynold Xin <rxin@databricks.com>

Closes #6569 from rxin/freqItemsWarning and squashes the following commits:

7eec145 [Reynold Xin] [minor doc] Add exploratory data analysis warning for DataFrame.stat.freqItem API.
2015-06-01 21:29:39 -07:00
Reynold Xin 89f642a0e8 [SPARK-8026][SQL] Add Column.alias to Scala/Java DataFrame API
Author: Reynold Xin <rxin@databricks.com>

Closes #6565 from rxin/alias and squashes the following commits:

286d880 [Reynold Xin] [SPARK-8026][SQL] Add Column.alias to Scala/Java DataFrame API
2015-06-01 21:13:15 -07:00
Reynold Xin 6396cc0303 [SPARK-7982][SQL] DataFrame.stat.crosstab should use 0 instead of null for pairs that don't appear
Author: Reynold Xin <rxin@databricks.com>

Closes #6566 from rxin/crosstab and squashes the following commits:

e0ace1c [Reynold Xin] [SPARK-7982][SQL] DataFrame.stat.crosstab should use 0 instead of null for pairs that don't appear
2015-06-01 21:11:19 -07:00
Wenchen Fan a0e46a0d2a [SPARK-7952][SPARK-7984][SQL] equality check between boolean type and numeric type is broken.
The origin code has several problems:
* `true <=> 1` will return false as we didn't set a rule to handle it.
* `true = a` where `a` is not `Literal` and its value is 1, will return false as we only handle literal values.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #6505 from cloud-fan/tmp1 and squashes the following commits:

77f0f39 [Wenchen Fan] minor fix
b6401ba [Wenchen Fan] add type coercion for CaseKeyWhen and address comments
ebc8c61 [Wenchen Fan] use SQLTestUtils and If
625973c [Wenchen Fan] improve
9ba2130 [Wenchen Fan] address comments
fc0d741 [Wenchen Fan] fix style
2846a04 [Wenchen Fan] fix 7952
2015-05-31 21:01:46 -07:00
Reynold Xin 866652c903 [SPARK-3850] Turn style checker on for trailing whitespaces.
Author: Reynold Xin <rxin@databricks.com>

Closes #6541 from rxin/trailing-whitespace-on and squashes the following commits:

f72ebe4 [Reynold Xin] [SPARK-3850] Turn style checker on for trailing whitespaces.
2015-05-31 14:23:42 -07:00
Reynold Xin 63a50be13d [SPARK-3850] Trim trailing spaces for SQL.
Author: Reynold Xin <rxin@databricks.com>

Closes #6535 from rxin/whitespace-sql and squashes the following commits:

de50316 [Reynold Xin] [SPARK-3850] Trim trailing spaces for SQL.
2015-05-31 00:48:49 -07:00
Reynold Xin 7896e99b2a [SPARK-7975] Add style checker to disallow overriding equals covariantly.
Author: Reynold Xin <rxin@databricks.com>

This patch had conflicts when merged, resolved by
Committer: Reynold Xin <rxin@databricks.com>

Closes #6527 from rxin/covariant-equals and squashes the following commits:

e7d7784 [Reynold Xin] [SPARK-7975] Enforce CovariantEqualsChecker
2015-05-31 00:05:55 -07:00
Cheng Lian 8764dccebd [SQL] [MINOR] Adds @deprecated Scaladoc entry for SchemaRDD
Author: Cheng Lian <lian@databricks.com>

Closes #6529 from liancheng/schemardd-deprecation-fix and squashes the following commits:

49765c2 [Cheng Lian] Adds @deprecated Scaladoc entry for SchemaRDD
2015-05-30 23:49:42 -07:00
Cheng Lian f7fe9e4744 [SQL] [MINOR] Fixes a minor comment mistake in IsolatedClientLoader
Author: Cheng Lian <lian@databricks.com>

Closes #6521 from liancheng/classloader-comment-fix and squashes the following commits:

fc09606 [Cheng Lian] Addresses @srowen's comment
59945c5 [Cheng Lian] Fixes a minor comment mistake in IsolatedClientLoader
2015-05-31 12:56:41 +08:00
Reynold Xin c63e1a742b [SPARK-7971] Add JavaDoc style deprecation for deprecated DataFrame methods
Scala deprecated annotation actually doesn't show up in JavaDoc.

Author: Reynold Xin <rxin@databricks.com>

Closes #6523 from rxin/df-deprecated-javadoc and squashes the following commits:

26da2b2 [Reynold Xin] [SPARK-7971] Add JavaDoc style deprecation for deprecated DataFrame methods.
2015-05-30 19:51:53 -07:00
Reynold Xin 14b314dc2c [SQL] Tighten up visibility for JavaDoc.
I went through all the JavaDocs and tightened up visibility.

Author: Reynold Xin <rxin@databricks.com>

Closes #6526 from rxin/sql-1.4-visibility-for-docs and squashes the following commits:

bc37d1e [Reynold Xin] Tighten up visibility for JavaDoc.
2015-05-30 19:50:52 -07:00
Wenchen Fan 0978aec9cd [SPARK-7964][SQL] remove unnecessary type coercion rule
We have defined these logics in `Cast` already, I think we should remove this rule.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #6516 from cloud-fan/tmp2 and squashes the following commits:

d5035a4 [Wenchen Fan] remove useless rule
2015-05-30 00:26:46 -07:00
Andrew Or 8c9979337f [HOTFIX] [SQL] Maven test compilation issue
Tests compile in SBT but not Maven.
2015-05-29 15:26:49 -07:00
Andrew Or 9eb222c139 [SPARK-7558] Demarcate tests in unit-tests.log
Right now `unit-tests.log` are not of much value because we can't tell where the test boundaries are easily. This patch adds log statements before and after each test to outline the test boundaries, e.g.:

```
===== TEST OUTPUT FOR o.a.s.serializer.KryoSerializerSuite: 'kryo with parallelize for primitive arrays' =====

15/05/27 12:36:39.596 pool-1-thread-1-ScalaTest-running-KryoSerializerSuite INFO SparkContext: Starting job: count at KryoSerializerSuite.scala:230
15/05/27 12:36:39.596 dag-scheduler-event-loop INFO DAGScheduler: Got job 3 (count at KryoSerializerSuite.scala:230) with 4 output partitions (allowLocal=false)
15/05/27 12:36:39.596 dag-scheduler-event-loop INFO DAGScheduler: Final stage: ResultStage 3(count at KryoSerializerSuite.scala:230)
15/05/27 12:36:39.596 dag-scheduler-event-loop INFO DAGScheduler: Parents of final stage: List()
15/05/27 12:36:39.597 dag-scheduler-event-loop INFO DAGScheduler: Missing parents: List()
15/05/27 12:36:39.597 dag-scheduler-event-loop INFO DAGScheduler: Submitting ResultStage 3 (ParallelCollectionRDD[5] at parallelize at KryoSerializerSuite.scala:230), which has no missing parents

...

15/05/27 12:36:39.624 pool-1-thread-1-ScalaTest-running-KryoSerializerSuite INFO DAGScheduler: Job 3 finished: count at KryoSerializerSuite.scala:230, took 0.028563 s
15/05/27 12:36:39.625 pool-1-thread-1-ScalaTest-running-KryoSerializerSuite INFO KryoSerializerSuite:

***** FINISHED o.a.s.serializer.KryoSerializerSuite: 'kryo with parallelize for primitive arrays' *****

...
```

Author: Andrew Or <andrew@databricks.com>

Closes #6441 from andrewor14/demarcate-tests and squashes the following commits:

879b060 [Andrew Or] Fix compile after rebase
d622af7 [Andrew Or] Merge branch 'master' of github.com:apache/spark into demarcate-tests
017c8ba [Andrew Or] Merge branch 'master' of github.com:apache/spark into demarcate-tests
7790b6c [Andrew Or] Fix tests after logical merge conflict
c7460c0 [Andrew Or] Merge branch 'master' of github.com:apache/spark into demarcate-tests
c43ffc4 [Andrew Or] Fix tests?
8882581 [Andrew Or] Fix tests
ee22cda [Andrew Or] Fix log message
fa9450e [Andrew Or] Merge branch 'master' of github.com:apache/spark into demarcate-tests
12d1e1b [Andrew Or] Various whitespace changes (minor)
69cbb24 [Andrew Or] Make all test suites extend SparkFunSuite instead of FunSuite
bbce12e [Andrew Or] Fix manual things that cannot be covered through automation
da0b12f [Andrew Or] Add core tests as dependencies in all modules
f7d29ce [Andrew Or] Introduce base abstract class for all test suites
2015-05-29 14:03:12 -07:00
Reynold Xin 94f62a4979 [SPARK-7940] Enforce whitespace checking for DO, TRY, CATCH, FINALLY, MATCH, LARROW, RARROW in style checker.
…

Author: Reynold Xin <rxin@databricks.com>

Closes #6491 from rxin/more-whitespace and squashes the following commits:

f6e63dc [Reynold Xin] [SPARK-7940] Enforce whitespace checking for DO, TRY, CATCH, FINALLY, MATCH, LARROW, RARROW in style checker.
2015-05-29 13:38:37 -07:00
Cheng Lian 4782e13040 [SQL] [TEST] [MINOR] Uses a temporary log4j.properties in HiveThriftServer2Test to ensure expected logging behavior
The `HiveThriftServer2Test` relies on proper logging behavior to assert whether the Thrift server daemon process is started successfully. However, some other jar files listed in the classpath may potentially contain an unexpected Log4J configuration file which overrides the logging behavior.

This PR writes a temporary `log4j.properties` and prepend it to driver classpath before starting the testing Thrift server process to ensure proper logging behavior.

cc andrewor14 yhuai

Author: Cheng Lian <lian@databricks.com>

Closes #6493 from liancheng/override-log4j and squashes the following commits:

c489e0e [Cheng Lian] Fixes minor Scala styling issue
b46ef0d [Cheng Lian] Uses a temporary log4j.properties in HiveThriftServer2Test to ensure expected logging behavior
2015-05-29 11:11:40 -07:00
Cheng Lian e7b6177557 [SPARK-7950] [SQL] Sets spark.sql.hive.version in HiveThriftServer2.startWithContext()
When starting `HiveThriftServer2` via `startWithContext`, property `spark.sql.hive.version` isn't set. This causes Simba ODBC driver 1.0.8.1006 behaves differently and fails simple queries.

Hive2 JDBC driver works fine in this case. Also, when starting the server with `start-thriftserver.sh`, both Hive2 JDBC driver and Simba ODBC driver works fine.

Please refer to [SPARK-7950] [1] for details.

[1]: https://issues.apache.org/jira/browse/SPARK-7950

Author: Cheng Lian <lian@databricks.com>

Closes #6500 from liancheng/odbc-bugfix and squashes the following commits:

051e3a3 [Cheng Lian] Fixes import order
3a97376 [Cheng Lian] Sets spark.sql.hive.version in HiveThriftServer2.startWithContext()
2015-05-29 10:43:34 -07:00
Reynold Xin 97a60cf75d [SPARK-7929] Turn whitespace checker on for more token types.
This is the last batch of changes to complete SPARK-7929.

Previous related PRs:
https://github.com/apache/spark/pull/6480
https://github.com/apache/spark/pull/6478
https://github.com/apache/spark/pull/6477
https://github.com/apache/spark/pull/6476
https://github.com/apache/spark/pull/6475
https://github.com/apache/spark/pull/6474
https://github.com/apache/spark/pull/6473

Author: Reynold Xin <rxin@databricks.com>

Closes #6487 from rxin/whitespace-lint and squashes the following commits:

b33d43d [Reynold Xin] [SPARK-7929] Turn whitespace checker on for more token types.
2015-05-28 23:00:02 -07:00
Reynold Xin 8da560d7de [SPARK-7927] whitespace fixes for Catalyst module.
So we can enable a whitespace enforcement rule in the style checker to save code review time.

Author: Reynold Xin <rxin@databricks.com>

Closes #6476 from rxin/whitespace-catalyst and squashes the following commits:

650409d [Reynold Xin] Fixed tests.
51a9e5d [Reynold Xin] [SPARK-7927] whitespace fixes for Catalyst module.
2015-05-28 20:11:57 -07:00
Reynold Xin ff44c711ab [SPARK-7927] whitespace fixes for SQL core.
So we can enable a whitespace enforcement rule in the style checker to save code review time.

Author: Reynold Xin <rxin@databricks.com>

Closes #6477 from rxin/whitespace-sql-core and squashes the following commits:

ce6e369 [Reynold Xin] Fixed tests.
6095fed [Reynold Xin] [SPARK-7927] whitespace fixes for SQL core.
2015-05-28 20:10:21 -07:00
Reynold Xin ee6a0e12fb [SPARK-7927] whitespace fixes for Hive and ThriftServer.
So we can enable a whitespace enforcement rule in the style checker to save code review time.

Author: Reynold Xin <rxin@databricks.com>

Closes #6478 from rxin/whitespace-hive and squashes the following commits:

e01b0e0 [Reynold Xin] Fixed tests.
a3bba22 [Reynold Xin] [SPARK-7927] whitespace fixes for Hive and ThriftServer.
2015-05-28 18:08:56 -07:00
Yin Huai 572b62cafe [SPARK-7853] [SQL] Fix HiveContext in Spark Shell
https://issues.apache.org/jira/browse/SPARK-7853

This fixes the problem introduced by my change in https://github.com/apache/spark/pull/6435, which causes that Hive Context fails to create in spark shell because of the class loader issue.

Author: Yin Huai <yhuai@databricks.com>

Closes #6459 from yhuai/SPARK-7853 and squashes the following commits:

37ad33e [Yin Huai] Do not use hiveQlTable at all.
47cdb6d [Yin Huai] Move hiveconf.set to the end of setConf.
005649b [Yin Huai] Update comment.
35d86f3 [Yin Huai] Access TTable directly to make sure Hive will not internally use any metastore utility functions.
3737766 [Yin Huai] Recursively find all jars.
2015-05-28 17:12:30 -07:00
Yin Huai 3c1f1baaf0 [SPARK-7907] [SQL] [UI] Rename tab ThriftServer to SQL.
This PR has three changes:
1. Renaming the table of `ThriftServer` to `SQL`;
2. Renaming the title of the tab from `ThriftServer` to `JDBC/ODBC Server`; and
3. Renaming the title of the session page from `ThriftServer` to `JDBC/ODBC Session`.

https://issues.apache.org/jira/browse/SPARK-7907

Author: Yin Huai <yhuai@databricks.com>

Closes #6448 from yhuai/JDBCServer and squashes the following commits:

eadcc3d [Yin Huai] Update test.
9168005 [Yin Huai] Use SQL as the tab name.
221831e [Yin Huai] Rename ThriftServer to JDBCServer.
2015-05-27 20:04:29 -07:00
Liang-Chi Hsieh a1e092eae5 [SPARK-7897][SQL] Use DecimalType to represent unsigned bigint in JDBCRDD
JIRA: https://issues.apache.org/jira/browse/SPARK-7897

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6438 from viirya/jdbc_unsigned_bigint and squashes the following commits:

ccb3c3f [Liang-Chi Hsieh] Use DecimalType to represent unsigned bigint.
2015-05-27 18:51:36 -07:00
Cheng Hao db3fd054f2 [SPARK-7853] [SQL] Fixes a class loader issue in Spark SQL
This PR is based on PR #6396 authored by chenghao-intel. Essentially, Spark SQL should use context classloader to load SerDe classes.

yhuai helped updating the test case, and I fixed a bug in the original `CliSuite`: while testing the CLI tool with `runCliWithin`, we don't append `\n` to the last query, thus the last query is never executed.

Original PR description is pasted below.

----

```
bin/spark-sql --jars ./sql/hive/src/test/resources/hive-hcatalog-core-0.13.1.jar
CREATE TABLE t1(a string, b string) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
```

Throws exception like

```
15/05/26 00:16:33 ERROR SparkSQLDriver: Failed in [CREATE TABLE t1(a string, b string) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe']
org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde: org.apache.hive.hcatalog.data.JsonSerDe
        at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:333)
        at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:310)
        at org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:139)
        at org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:310)
        at org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:300)
        at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:457)
        at org.apache.spark.sql.hive.execution.HiveNativeCommand.run(HiveNativeCommand.scala:33)
        at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
        at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
        at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:68)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:87)
        at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:922)
        at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:922)
        at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:147)
        at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:131)
        at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
        at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:727)
        at org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:57)
```

Author: Cheng Hao <hao.cheng@intel.com>
Author: Cheng Lian <lian@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #6435 from liancheng/classLoader and squashes the following commits:

d4c4845 [Cheng Lian] Fixes CliSuite
75e80e2 [Yin Huai] Update the fix.
fd26533 [Cheng Hao] scalastyle
dd78775 [Cheng Hao] workaround for classloader of IsolatedClientLoader
2015-05-27 14:21:00 -07:00
Cheng Lian b97ddff000 [SPARK-7684] [SQL] Refactoring MetastoreDataSourcesSuite to workaround SPARK-7684
As stated in SPARK-7684, currently `TestHive.reset` has some execution order specific bug, which makes running specific test suites locally pretty frustrating. This PR refactors `MetastoreDataSourcesSuite` (which relies on `TestHive.reset` heavily) using various `withXxx` utility methods in `SQLTestUtils` to ask each test case to cleanup their own mess so that we can avoid calling `TestHive.reset`.

Author: Cheng Lian <lian@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #6353 from liancheng/workaround-spark-7684 and squashes the following commits:

26939aa [Yin Huai] Move the initialization of jsonFilePath to beforeAll.
a423d48 [Cheng Lian] Fixes Scala style issue
dfe45d0 [Cheng Lian] Refactors MetastoreDataSourcesSuite to workaround SPARK-7684
92a116d [Cheng Lian] Fixes minor styling issues
2015-05-27 13:09:33 -07:00
Daoyuan Wang 8161562eab [SPARK-7790] [SQL] date and decimal conversion for dynamic partition key
Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #6318 from adrian-wang/dynpart and squashes the following commits:

ad73b61 [Daoyuan Wang] not use sqlTestUtils for try catch because dont have sqlcontext here
6c33b51 [Daoyuan Wang] fix according to liancheng
f0f8074 [Daoyuan Wang] some specific types as dynamic partition
2015-05-27 12:42:13 -07:00
Reynold Xin 6fec1a9409 Removed Guava dependency from JavaTypeInference's type signature.
This should also close #6243.

Author: Reynold Xin <rxin@databricks.com>

Closes #6431 from rxin/JavaTypeInference-guava and squashes the following commits:

e58df3c [Reynold Xin] Removed Gauva dependency from JavaTypeInference's type signature.
2015-05-27 11:54:35 -07:00
Cheng Lian 15459db4f6 [SPARK-7847] [SQL] Fixes dynamic partition directory escaping
Please refer to [SPARK-7847] [1] for details.

[1]: https://issues.apache.org/jira/browse/SPARK-7847

Author: Cheng Lian <lian@databricks.com>

Closes #6389 from liancheng/spark-7847 and squashes the following commits:

935c652 [Cheng Lian] Adds test case for writing various data types as dynamic partition value
f4fc398 [Cheng Lian] Converts partition columns to Scala type when writing dynamic partitions
d0aeca0 [Cheng Lian] Fixes dynamic partition directory escaping
2015-05-27 10:09:12 -07:00
Reynold Xin 3e7d7d6b3d [SQL] Rename MathematicalExpression UnaryMathExpression, and specify BinaryMathExpression's output data type as DoubleType.
Two minor changes.

cc brkyvz

Author: Reynold Xin <rxin@databricks.com>

Closes #6428 from rxin/math-func-cleanup and squashes the following commits:

5910df5 [Reynold Xin] [SQL] Rename MathematicalExpression UnaryMathExpression, and specify BinaryMathExpression's output data type as DoubleType.
2015-05-27 01:13:57 -07:00
Reynold Xin 9f48bf6b37 [SPARK-7887][SQL] Remove EvaluatedType from SQL Expression.
This type is not really used. Might as well remove it.

Author: Reynold Xin <rxin@databricks.com>

Closes #6427 from rxin/evalutedType and squashes the following commits:

51a319a [Reynold Xin] [SPARK-7887][SQL] Remove EvaluatedType from SQL Expression.
2015-05-27 01:12:59 -07:00
Liang-Chi Hsieh 4f98d7a7f1 [SPARK-7697][SQL] Use LongType for unsigned int in JDBCRDD
JIRA: https://issues.apache.org/jira/browse/SPARK-7697

The reported problem case is mysql. But for h2 db, there is no unsigned int. So it is not able to add corresponding test.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6229 from viirya/unsignedint_as_long and squashes the following commits:

dc4b5d8 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into unsignedint_as_long
608695b [Liang-Chi Hsieh] Use LongType for unsigned int in JDBCRDD.
2015-05-27 00:27:39 -07:00
Cheolsoo Park 6dd645870d [SPARK-7850][BUILD] Hive 0.12.0 profile in POM should be removed
I grep'ed hive-0.12.0 in the source code and removed all the profiles and doc references.

Author: Cheolsoo Park <cheolsoop@netflix.com>

Closes #6393 from piaozhexiu/SPARK-7850 and squashes the following commits:

fb429ce [Cheolsoo Park] Remove hive-0.13.1 profile
82bf09a [Cheolsoo Park] Remove hive 0.12.0 shim code
f3722da [Cheolsoo Park] Remove hive-0.12.0 profile and references from POM and build docs
2015-05-27 00:18:42 -07:00
Cheng Lian b463e6d618 [SPARK-7868] [SQL] Ignores _temporary directories in HadoopFsRelation
So that potential partial/corrupted data files left by failed tasks/jobs won't affect normal data scan.

Author: Cheng Lian <lian@databricks.com>

Closes #6411 from liancheng/spark-7868 and squashes the following commits:

273ea36 [Cheng Lian] Ignores _temporary directories
2015-05-26 20:48:56 -07:00
Josh Rosen 0c33c7b4a6 [SPARK-7858] [SQL] Use output schema, not relation schema, for data source input conversion
In `DataSourceStrategy.createPhysicalRDD`, we use the relation schema as the target schema for converting incoming rows into Catalyst rows.  However, we should be using the output schema instead, since our scan might return a subset of the relation's columns.

This patch incorporates #6414 by liancheng, which fixes an issue in `SimpleTestRelation` that prevented this bug from being caught by our old tests:

> In `SimpleTextRelation`, we specified `needsConversion` to `true`, indicating that values produced by this testing relation should be of Scala types, and need to be converted to Catalyst types when necessary. However, we also used `Cast` to convert strings to expected data types. And `Cast` always produces values of Catalyst types, thus no conversion is done at all. This PR makes `SimpleTextRelation` produce Scala values so that data conversion code paths can be properly tested.

Closes #5986.

Author: Josh Rosen <joshrosen@databricks.com>
Author: Cheng Lian <lian@databricks.com>
Author: Cheng Lian <liancheng@users.noreply.github.com>

Closes #6400 from JoshRosen/SPARK-7858 and squashes the following commits:

e71c866 [Josh Rosen] Re-fix bug so that the tests pass again
56b13e5 [Josh Rosen] Add regression test to hadoopFsRelationSuites
2169a0f [Josh Rosen] Remove use of SpecificMutableRow and BufferedIterator
6cd7366 [Josh Rosen] Fix SPARK-7858 by using output types for conversion.
5a00e66 [Josh Rosen] Add assertions in order to reproduce SPARK-7858
8ba195c [Cheng Lian] Merge 9968fba9979287aaa1f141ba18bfb9d4c116a3b3 into 61664732b2
9968fba [Cheng Lian] Tests the data type conversion code paths
2015-05-26 20:24:35 -07:00
rowan 03668348e2 [SPARK-7637] [SQL] O(N) merge implementation for StructType merge
Contribution is my original work and I license the work to the project under the projects open source license.

Author: rowan <rowan.chattaway@googlemail.com>

Closes #6259 from rowan000/SPARK-7637 and squashes the following commits:

c479df4 [rowan] SPARK-7637: rename mapFields to fieldsMap as per comments on github.
8d2e419 [rowan] SPARK-7637: fix up whitespace changes
0e9d662 [rowan] SPARK-7637: O(N) merge implementatio for StructType merge
2015-05-26 18:17:16 -07:00
Reynold Xin c9adcad81a [SQL][minor] Removed unused Catalyst logical plan DSL.
The Catalyst DSL is no longer used as a public facing API. This pull request removes the UDF and writeToFile feature from it since they are not used in unit tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #6350 from rxin/unused-logical-dsl and squashes the following commits:

90b3de6 [Reynold Xin] [SQL][minor] Removed unused Catalyst logical plan DSL.
2015-05-25 23:09:22 -07:00
Cheng Lian 8af1bf10b7 [SPARK-7842] [SQL] Makes task committing/aborting in InsertIntoHadoopFsRelation more robust
When committing/aborting a write task issued in `InsertIntoHadoopFsRelation`, if an exception is thrown from `OutputWriter.close()`, the committing/aborting process will be interrupted, and leaves messy stuff behind (e.g., the `_temporary` directory created by `FileOutputCommitter`).

This PR makes these two process more robust by catching potential exceptions and falling back to normal task committment/abort.

Author: Cheng Lian <lian@databricks.com>

Closes #6378 from liancheng/spark-7838 and squashes the following commits:

f18253a [Cheng Lian] Makes task committing/aborting in InsertIntoHadoopFsRelation more robust
2015-05-26 00:28:47 +08:00
Cheng Lian bfeedc69a2 [SPARK-7684] [SQL] Invoking HiveContext.newTemporaryConfiguration() shouldn't create new metastore directory
The "Database does not exist" error reported in SPARK-7684 was caused by `HiveContext.newTemporaryConfiguration()`, which always creates a new temporary metastore directory and returns a metastore configuration pointing that directory. This makes `TestHive.reset()` always replaces old temporary metastore with an empty new one.

Author: Cheng Lian <lian@databricks.com>

Closes #6359 from liancheng/spark-7684 and squashes the following commits:

95d2eb8 [Cheng Lian] Addresses @marmbrust's comment
042769d [Cheng Lian] Don't create new temp directory in HiveContext.newTemporaryConfiguration()
2015-05-26 00:16:06 +08:00
Yin Huai ed21476bc0 [SPARK-7805] [SQL] Move SQLTestUtils.scala and ParquetTest.scala to src/test
https://issues.apache.org/jira/browse/SPARK-7805

Because `sql/hive`'s tests depend on the test jar of `sql/core`, we do not need to store `SQLTestUtils` and `ParquetTest` in `src/main`. We should only add stuff that will be needed by `sql/console` or Python tests (for Python, we need it in `src/main`, right? davies).

Author: Yin Huai <yhuai@databricks.com>

Closes #6334 from yhuai/SPARK-7805 and squashes the following commits:

af6d0c9 [Yin Huai] mima
b86746a [Yin Huai] Move SQLTestUtils.scala and ParquetTest.scala to src/test.
2015-05-24 09:51:37 -07:00
Yin Huai 2b7e63585d [SPARK-7654] [SQL] Move insertInto into reader/writer interface.
This one continues the work of https://github.com/apache/spark/pull/6216.

Author: Yin Huai <yhuai@databricks.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #6366 from yhuai/insert and squashes the following commits:

3d717fb [Yin Huai] Use insertInto to handle the casue when table exists and Append is used for saveAsTable.
56d2540 [Yin Huai] Add PreWriteCheck to HiveContext's analyzer.
c636e35 [Yin Huai] Remove unnecessary empty lines.
cf83837 [Yin Huai] Move insertInto to write. Also, remove the partition columns from InsertIntoHadoopFsRelation.
0841a54 [Reynold Xin] Removed experimental tag for deprecated methods.
33ed8ef [Reynold Xin] [SPARK-7654][SQL] Move insertInto into reader/writer interface.
2015-05-23 09:48:20 -07:00
Davies Liu efe3bfdf49 [SPARK-7322, SPARK-7836, SPARK-7822][SQL] DataFrame window function related updates
1. ntile should take an integer as parameter.
2. Added Python API (based on #6364)
3. Update documentation of various DataFrame Python functions.

Author: Davies Liu <davies@databricks.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #6374 from rxin/window-final and squashes the following commits:

69004c7 [Reynold Xin] Style fix.
288cea9 [Reynold Xin] Update documentaiton.
7cb8985 [Reynold Xin] Merge pull request #6364 from davies/window
66092b4 [Davies Liu] update docs
ed73cb4 [Reynold Xin] [SPARK-7322][SQL] Improve DataFrame window function documentation.
ef55132 [Davies Liu] Merge branch 'master' of github.com:apache/spark into window4
8936ade [Davies Liu] fix maxint in python 3
2649358 [Davies Liu] update docs
778e2c0 [Davies Liu] SPARK-7836 and SPARK-7822: Python API of window functions
2015-05-23 08:30:05 -07:00
Michael Armbrust 3c1305107a [SPARK-7834] [SQL] Better window error messages
Author: Michael Armbrust <michael@databricks.com>

Closes #6363 from marmbrus/windowErrors and squashes the following commits:

516b02d [Michael Armbrust] [SPARK-7834] [SQL] Better window error messages
2015-05-22 17:23:12 -07:00
Liang-Chi Hsieh 126d7235de [SPARK-7270] [SQL] Consider dynamic partition when inserting into hive table
JIRA: https://issues.apache.org/jira/browse/SPARK-7270

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #5864 from viirya/dyn_partition_insert and squashes the following commits:

b5627df [Liang-Chi Hsieh] For comments.
3b21e4b [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into dyn_partition_insert
8a4352d [Liang-Chi Hsieh] Consider dynamic partition when inserting into hive table.
2015-05-22 15:39:58 -07:00
Santiago M. Mola e4aef91fe7 [SPARK-7724] [SQL] Support Intersect/Except in Catalyst DSL.
Author: Santiago M. Mola <santi@mola.io>

Closes #6327 from smola/feature/catalyst-dsl-set-ops and squashes the following commits:

11db778 [Santiago M. Mola] [SPARK-7724] [SQL] Support Intersect/Except in Catalyst DSL.
2015-05-22 15:10:27 -07:00
WangTaoTheTonic 31d5d463e7 [SPARK-7758] [SQL] Override more configs to avoid failure when connect to a postgre sql
https://issues.apache.org/jira/browse/SPARK-7758

When initializing `executionHive`, we only masks
`javax.jdo.option.ConnectionURL` to override metastore location.  However,
other properties that relates to the actual Hive metastore data source are not
masked.  For example, when using Spark SQL with a PostgreSQL backed Hive
metastore, `executionHive` actually tries to use settings read from
`hive-site.xml`, which talks about PostgreSQL, to connect to the temporary
Derby metastore, thus causes error.

To fix this, we need to mask all metastore data source properties.
Specifically, according to the code of [Hive `ObjectStore.getDataSourceProps()`
method] [1], all properties whose name mentions "jdo" and "datanucleus" must be
included.

[1]: https://github.com/apache/hive/blob/release-0.13.1/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L288

Have tested using postgre sql as metastore, it worked fine.

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes #6314 from WangTaoTheTonic/SPARK-7758 and squashes the following commits:

ca7ae7c [WangTaoTheTonic] add comments
86caf2c [WangTaoTheTonic] delete unused import
e4f0feb [WangTaoTheTonic] block more data source related property
92a81fa [WangTaoTheTonic] fix style check
e3e683d [WangTaoTheTonic] override more configs to avoid failuer connecting to postgre sql
2015-05-22 14:43:16 -07:00
Michael Armbrust 3b68cb0430 [SPARK-6743] [SQL] Fix empty projections of cached data
Author: Michael Armbrust <michael@databricks.com>

Closes #6165 from marmbrus/wrongColumn and squashes the following commits:

4fad158 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into wrongColumn
aad7eab [Michael Armbrust] rxins comments
f1e8df1 [Michael Armbrust] [SPARK-6743][SQL] Fix empty projections of cached data
2015-05-22 09:43:46 -07:00
Cheng Lian 4e5220c317 [MINOR] [SQL] Ignores Thrift server UISeleniumSuite
This Selenium test case has been flaky for a while and led to frequent Jenkins build failure. Let's disable it temporarily until we figure out a proper solution.

Author: Cheng Lian <lian@databricks.com>

Closes #6345 from liancheng/ignore-selenium-test and squashes the following commits:

09996fe [Cheng Lian] Ignores Thrift server UISeleniumSuite
2015-05-22 16:25:52 +08:00
Cheng Hao f6f2eeb179 [SPARK-7322][SQL] Window functions in DataFrame
This closes #6104.

Author: Cheng Hao <hao.cheng@intel.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #6343 from rxin/window-df and squashes the following commits:

026d587 [Reynold Xin] Address code review feedback.
dc448fe [Reynold Xin] Fixed Hive tests.
9794d9d [Reynold Xin] Moved Java test package.
9331605 [Reynold Xin] Refactored API.
3313e2a [Reynold Xin] Merge pull request #6104 from chenghao-intel/df_window
d625a64 [Cheng Hao] Update the dataframe window API as suggsted
c141fb1 [Cheng Hao] hide all of properties of the WindowFunctionDefinition
3b1865f [Cheng Hao] scaladoc typos
f3fd2d0 [Cheng Hao] polish the unit test
6847825 [Cheng Hao] Add additional analystcs functions
57e3bc0 [Cheng Hao] typos
24a08ec [Cheng Hao] scaladoc
28222ed [Cheng Hao] fix bug of range/row Frame
1d91865 [Cheng Hao] style issue
53f89f2 [Cheng Hao] remove the over from the functions.scala
964c013 [Cheng Hao] add more unit tests and window functions
64e18a7 [Cheng Hao] Add Window Function support for DataFrame
2015-05-22 01:00:16 -07:00
Yin Huai 347b50106b [SPARK-7737] [SQL] Use leaf dirs having data files to discover partitions.
https://issues.apache.org/jira/browse/SPARK-7737

cc liancheng

Author: Yin Huai <yhuai@databricks.com>

Closes #6329 from yhuai/spark-7737 and squashes the following commits:

7e0dfc7 [Yin Huai] Use leaf dirs having data files to discover partitions.
2015-05-22 07:10:26 +08:00
Andrew Or 5287eec5a6 [SPARK-7718] [SQL] Speed up partitioning by avoiding closure cleaning
According to yhuai we spent 6-7 seconds cleaning closures in a partitioning job that takes 12 seconds. Since we provide these closures in Spark we know for sure they are serializable, so we can bypass the cleaning.

Author: Andrew Or <andrew@databricks.com>

Closes #6256 from andrewor14/sql-partition-speed-up and squashes the following commits:

a82b451 [Andrew Or] Fix style
10f7e3e [Andrew Or] Avoid getting call sites and cleaning closures
17e2943 [Andrew Or] Merge branch 'master' of github.com:apache/spark into sql-partition-speed-up
523f042 [Andrew Or] Skip unnecessary Utils.getCallSites too
f7fe143 [Andrew Or] Avoid unnecessary closure cleaning
2015-05-21 14:33:11 -07:00
Tathagata Das 3d0cccc858 [SPARK-7478] [SQL] Added SQLContext.getOrCreate
Having a SQLContext singleton would make it easier for applications to use a lazily instantiated single shared instance of SQLContext when needed. It would avoid problems like

1. In REPL/notebook environment, rerunning the line {{val sqlContext = new SQLContext}} multiple times created different contexts while overriding the reference to previous context, leading to issues like registered temp tables going missing.

2. In Streaming, creating SQLContext directly leads to serialization/deserialization issues when attempting to recover from DStream checkpoints. See [SPARK-6770]. Also to get around this problem I had to suggest creating a singleton instance - https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala

This can be solved by {{SQLContext.getOrCreate}} which get or creates a new singleton instance of SQLContext using either a given SparkContext or a given SparkConf.

rxin marmbrus

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #6006 from tdas/SPARK-7478 and squashes the following commits:

25f4da9 [Tathagata Das] Addressed comments.
79fe069 [Tathagata Das] Added comments.
c66ca76 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7478
48adb14 [Tathagata Das] Removed HiveContext.getOrCreate
bf8cf50 [Tathagata Das] Fix more bug
dec5594 [Tathagata Das] Fixed bug
b4e9721 [Tathagata Das] Remove unnecessary import
4ef513b [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7478
d3ea8e4 [Tathagata Das] Added HiveContext
83bc950 [Tathagata Das] Updated tests
f82ae81 [Tathagata Das] Fixed test
bc72868 [Tathagata Das] Added SQLContext.getOrCreate
2015-05-21 14:08:20 -07:00
Yin Huai 30f3f556f7 [SPARK-7763] [SPARK-7616] [SQL] Persists partition columns into metastore
Author: Yin Huai <yhuai@databricks.com>
Author: Cheng Lian <lian@databricks.com>

Closes #6285 from liancheng/spark-7763 and squashes the following commits:

bb2829d [Yin Huai] Fix hashCode.
d677f7d [Cheng Lian] Fixes Scala style issue
44b283f [Cheng Lian] Adds test case for SPARK-7616
6733276 [Yin Huai] Fix a bug that potentially causes https://issues.apache.org/jira/browse/SPARK-7616.
6cabf3c [Yin Huai] Update unit test.
7e02910 [Yin Huai] Use metastore partition columns and do not hijack maybePartitionSpec.
e9a03ec [Cheng Lian] Persists partition columns into metastore
2015-05-21 13:51:40 -07:00
scwf f6c486aa4b [SQL] [TEST] udf_java_method failed due to jdk version
java.lang.Math.exp(1.0) has different result between jdk versions. so do not use createQueryTest, write a separate test for it.
```
jdk version   	result
1.7.0_11		2.7182818284590455
1.7.0_05        2.7182818284590455
1.7.0_71		2.718281828459045
```

Author: scwf <wangfei1@huawei.com>

Closes #6274 from scwf/java_method and squashes the following commits:

3dd2516 [scwf] address comments
5fa1459 [scwf] style
df46445 [scwf] fix test error
fcb6d22 [scwf] fix udf_java_method
2015-05-21 12:31:58 -07:00
Cheng Lian 8730fbb47b [SPARK-7749] [SQL] Fixes partition discovery for non-partitioned tables
When no partition columns can be found, we should have an empty `PartitionSpec`, rather than a `PartitionSpec` with empty partition columns.

This PR together with #6285 should fix SPARK-7749.

Author: Cheng Lian <lian@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #6287 from liancheng/spark-7749 and squashes the following commits:

a799ff3 [Cheng Lian] Adds test cases for SPARK-7749
c4949be [Cheng Lian] Minor refactoring, and tolerant _TEMPORARY directory name
5aa87ea [Yin Huai] Make parsePartitions more robust.
fc56656 [Cheng Lian] Returns empty PartitionSpec if no partition columns can be inferred
19ae41e [Cheng Lian] Don't list base directory as leaf directory
2015-05-21 10:56:17 -07:00
Davies Liu a25c1ab8f0 [SPARK-7565] [SQL] fix MapType in JsonRDD
The key of Map in JsonRDD should be converted into UTF8String (also failed records), Thanks to yhuai viirya

Closes #6084

Author: Davies Liu <davies@databricks.com>

Closes #6299 from davies/string_in_json and squashes the following commits:

0dbf559 [Davies Liu] improve test, fix corrupt record
6836a80 [Davies Liu] move unit tests into Scala
b97af11 [Davies Liu] fix MapType in JsonRDD
2015-05-21 09:58:47 -07:00
Cheng Hao feb3a9d3f8 [SPARK-7320] [SQL] [Minor] Move the testData into beforeAll()
Follow up of #6340, to avoid the test report missing once it fails.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #6312 from chenghao-intel/rollup_minor and squashes the following commits:

b03a25f [Cheng Hao] simplify the testData instantiation
09b7e8b [Cheng Hao] move the testData into beforeAll()
2015-05-21 09:28:00 -07:00
Liang-Chi Hsieh d0eb9ffe97 [SPARK-7746][SQL] Add FetchSize parameter for JDBC driver
JIRA: https://issues.apache.org/jira/browse/SPARK-7746

Looks like an easy to add parameter but can show significant performance improvement if the JDBC driver accepts it.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6283 from viirya/jdbc_fetchsize and squashes the following commits:

de47f94 [Liang-Chi Hsieh] Don't keep fetchSize as single parameter.
b7bff2f [Liang-Chi Hsieh] Add FetchSize parameter for JDBC driver.
2015-05-20 22:23:49 -07:00
Cheng Hao 42c592adb3 [SPARK-7320] [SQL] Add Cube / Rollup for dataframe
This is a follow up for #6257, which broke the maven test.

Add cube & rollup for DataFrame
For example:
```scala
testData.rollup($"a" + $"b", $"b").agg(sum($"a" - $"b"))
testData.cube($"a" + $"b", $"b").agg(sum($"a" - $"b"))
```

Author: Cheng Hao <hao.cheng@intel.com>

Closes #6304 from chenghao-intel/rollup and squashes the following commits:

04bb1de [Cheng Hao] move the table register/unregister into beforeAll/afterAll
a6069f1 [Cheng Hao] cancel the implicit keyword
ced4b8f [Cheng Hao] remove the unnecessary code changes
9959dfa [Cheng Hao] update the code as comments
e1d88aa [Cheng Hao] update the code as suggested
03bc3d9 [Cheng Hao] Remove the CubedData & RollupedData
5fd62d0 [Cheng Hao] hiden the CubedData & RollupedData
5ffb196 [Cheng Hao] Add Cube / Rollup for dataframe
2015-05-20 19:58:22 -07:00
Patrick Wendell 6338c40da6 Revert "[SPARK-7320] [SQL] Add Cube / Rollup for dataframe"
This reverts commit 10698e1131.
2015-05-20 13:39:04 -07:00
Yin Huai b631bf73b9 [SPARK-7713] [SQL] Use shared broadcast hadoop conf for partitioned table scan.
https://issues.apache.org/jira/browse/SPARK-7713

I tested the performance with the following code:
```scala
import sqlContext._
import sqlContext.implicits._

(1 to 5000).foreach { i =>
  val df = (1 to 1000).map(j => (j, s"str$j")).toDF("a", "b").save(s"/tmp/partitioned/i=$i")
}

sqlContext.sql("""
CREATE TEMPORARY TABLE partitionedParquet
USING org.apache.spark.sql.parquet
OPTIONS (
  path '/tmp/partitioned'
)""")

table("partitionedParquet").explain(true)
```

In our master `explain` takes 40s in my laptop. With this PR, `explain` takes 14s.

Author: Yin Huai <yhuai@databricks.com>

Closes #6252 from yhuai/broadcastHadoopConf and squashes the following commits:

6fa73df [Yin Huai] Address comments of Josh and Andrew.
807fbf9 [Yin Huai] Make the new buildScan and SqlNewHadoopRDD private sql.
e393555 [Yin Huai] Cheng's comments.
2eb53bb [Yin Huai] Use a shared broadcast Hadoop Configuration for partitioned HadoopFsRelations.
2015-05-20 11:23:40 -07:00
Cheng Hao 09265ad7c8 [SPARK-7320] [SQL] Add Cube / Rollup for dataframe
Add `cube` & `rollup` for DataFrame
For example:
```scala
testData.rollup($"a" + $"b", $"b").agg(sum($"a" - $"b"))
testData.cube($"a" + $"b", $"b").agg(sum($"a" - $"b"))
```

Author: Cheng Hao <hao.cheng@intel.com>

Closes #6257 from chenghao-intel/rollup and squashes the following commits:

7302319 [Cheng Hao] cancel the implicit keyword
a66e38f [Cheng Hao] remove the unnecessary code changes
a2869d4 [Cheng Hao] update the code as comments
c441777 [Cheng Hao] update the code as suggested
84c9564 [Cheng Hao] Remove the CubedData & RollupedData
279584c [Cheng Hao] hiden the CubedData & RollupedData
ef357e1 [Cheng Hao] Add Cube / Rollup for dataframe
2015-05-20 19:09:47 +08:00
scwf 60336e3bc0 [SPARK-7656] [SQL] use CatalystConf in FunctionRegistry
follow up for #5806

Author: scwf <wangfei1@huawei.com>

Closes #6164 from scwf/FunctionRegistry and squashes the following commits:

15e6697 [scwf] use catalogconf in FunctionRegistry
2015-05-19 17:36:00 -07:00
Cheng Hao bcb1ff8146 [SPARK-7662] [SQL] Resolve correct names for generator in projection
```
select explode(map(value, key)) from src;
```
Throws exception
```
org.apache.spark.sql.AnalysisException: The number of aliases supplied in the AS clause does not match the number of columns output by the UDTF expected 2 aliases but got _c0 ;
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:43)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGenerate$$makeGeneratorOutput(Analyzer.scala:605)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:562)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:548)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:548)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:538)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
```

Author: Cheng Hao <hao.cheng@intel.com>

Closes #6178 from chenghao-intel/explode and squashes the following commits:

916fbe9 [Cheng Hao] add more strict rules for TGF alias
5c3f2c5 [Cheng Hao] fix bug in unit test
e1d93ab [Cheng Hao] Add more unit test
19db09e [Cheng Hao] resolve names for generator in projection
2015-05-19 15:20:46 -07:00
Patrick Wendell 9ebb44f8ab [HOTFIX]: Java 6 Build Breaks
These were blocking RC1 so I fixed them manually.
2015-05-19 06:01:16 +00:00
Josh Rosen c9fa870a6d [SPARK-7687] [SQL] DataFrame.describe() should cast all aggregates to String
In `DataFrame.describe()`, the `count` aggregate produces an integer, the `avg` and `stdev` aggregates produce doubles, and `min` and `max` aggregates can produce varying types depending on what type of column they're applied to.  As a result, we should cast all aggregate results to String so that `describe()`'s output types match its declared output schema.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #6218 from JoshRosen/SPARK-7687 and squashes the following commits:

146b615 [Josh Rosen] Fix R test.
2974bd5 [Josh Rosen] Cast to string type instead
f206580 [Josh Rosen] Cast to double to fix SPARK-7687
307ecbf [Josh Rosen] Add failing regression test for SPARK-7687
2015-05-18 21:53:44 -07:00
Daoyuan Wang c2437de189 [SPARK-7150] SparkContext.range() and SQLContext.range()
This PR is based on #6081, thanks adrian-wang.

Closes #6081

Author: Daoyuan Wang <daoyuan.wang@intel.com>
Author: Davies Liu <davies@databricks.com>

Closes #6230 from davies/range and squashes the following commits:

d3ce5fe [Davies Liu] add tests
789eda5 [Davies Liu] add range() in Python
4590208 [Davies Liu] Merge commit 'refs/pull/6081/head' of github.com:apache/spark into range
cbf5200 [Daoyuan Wang] let's add python support in a separate PR
f45e3b2 [Daoyuan Wang] remove redundant toLong
617da76 [Daoyuan Wang] fix safe marge for corner cases
867c417 [Daoyuan Wang] fix
13dbe84 [Daoyuan Wang] update
bd998ba [Daoyuan Wang] update comments
d3a0c1b [Daoyuan Wang] add range api()
2015-05-18 21:43:12 -07:00
Michael Armbrust eb4632f282 [SQL] Fix serializability of ORC table scan
A follow-up to #6244.

Author: Michael Armbrust <michael@databricks.com>

Closes #6247 from marmbrus/fixOrcTests and squashes the following commits:

e39ee1b [Michael Armbrust] [SQL] Fix serializability of ORC table scan
2015-05-18 15:24:31 -07:00
Michael Armbrust fcf90b75cc [HOTFIX] Fix ORC build break
Fix break caused by merging #6225 and #6194.

Author: Michael Armbrust <michael@databricks.com>

Closes #6244 from marmbrus/fixOrcBuildBreak and squashes the following commits:

b10e47b [Michael Armbrust] [HOTFIX] Fix ORC Build break
2015-05-18 14:04:04 -07:00
Davies Liu 32fbd297dd [SPARK-6216] [PYSPARK] check python version of worker with driver
This PR revert #5404, change to pass the version of python in driver into JVM, check it in worker before deserializing closure, then it can works with different major version of Python.

Author: Davies Liu <davies@databricks.com>

Closes #6203 from davies/py_version and squashes the following commits:

b8fb76e [Davies Liu] fix test
6ce5096 [Davies Liu] use string for version
47c6278 [Davies Liu] check python version of worker with driver
2015-05-18 12:55:13 -07:00
Cheng Lian 9dadf019b9 [SPARK-7673] [SQL] WIP: HadoopFsRelation and ParquetRelation2 performance optimizations
This PR introduces several performance optimizations to `HadoopFsRelation` and `ParquetRelation2`:

1.  Moving `FileStatus` listing from `DataSourceStrategy` into a cache within `HadoopFsRelation`.

    This new cache generalizes and replaces the one used in `ParquetRelation2`.

    This also introduces an interface change: to reuse cached `FileStatus` objects, `HadoopFsRelation.buildScan` methods now receive `Array[FileStatus]` instead of `Array[String]`.

1.  When Parquet task side metadata reading is enabled, skip reading row group information when reading Parquet footers.

    This is basically what PR #5334 does. Also, now we uses `ParquetFileReader.readAllFootersInParallel` to read footers in parallel.

Another optimization in question is, instead of asking `HadoopFsRelation.buildScan` to return an `RDD[Row]` for a single selected partition and then union them all, we ask it to return an `RDD[Row]` for all selected partitions. This optimization is based on the fact that Hadoop configuration broadcasting used in `NewHadoopRDD` takes 34% time in the following microbenchmark.  However, this complicates data source user code because user code must merge partition values manually.

To check the cost of broadcasting in `NewHadoopRDD`, I also did microbenchmark after removing the `broadcast` call in `NewHadoopRDD`.  All results are shown below.

### Microbenchmark

#### Preparation code

Generating a partitioned table with 50k partitions, 1k rows per partition:

```scala
import sqlContext._
import sqlContext.implicits._

for (n <- 0 until 500) {
  val data = for {
    p <- (n * 10) until ((n + 1) * 10)
    i <- 0 until 1000
  } yield (i, f"val_$i%04d", f"$p%04d")

  data.
    toDF("a", "b", "p").
    write.
    partitionBy("p").
    mode("append").
    parquet(path)
}
```

#### Benchmarking code

```scala
import sqlContext._
import sqlContext.implicits._

import org.apache.spark.sql.types._
import com.google.common.base.Stopwatch

val path = "hdfs://localhost:9000/user/lian/5k"

def benchmark(n: Int)(f: => Unit) {
  val stopwatch = new Stopwatch()

  def run() = {
    stopwatch.reset()
    stopwatch.start()
    f
    stopwatch.stop()
    stopwatch.elapsedMillis()
  }

  val records = (0 until n).map(_ => run())

  (0 until n).foreach(i => println(s"Round $i: ${records(i)} ms"))
  println(s"Average: ${records.sum / n.toDouble} ms")
}

benchmark(3) { read.parquet(path).explain(extended = true) }
```

#### Results

Before:

```
Round 0: 72528 ms
Round 1: 68938 ms
Round 2: 65372 ms
Average: 68946.0 ms
```

After:

```
Round 0: 59499 ms
Round 1: 53645 ms
Round 2: 53844 ms
Round 3: 49093 ms
Round 4: 50555 ms
Average: 53327.2 ms
```

Also removing Hadoop configuration broadcasting:

(Note that I was testing on a local laptop, thus network cost is pretty low.)

```
Round 0: 15806 ms
Round 1: 14394 ms
Round 2: 14699 ms
Round 3: 15334 ms
Round 4: 14123 ms
Average: 14871.2 ms
```

Author: Cheng Lian <lian@databricks.com>

Closes #6225 from liancheng/spark-7673 and squashes the following commits:

2d58a2b [Cheng Lian] Skips reading row group information when using task side metadata reading
7aa3748 [Cheng Lian] Optimizes FileStatusCache by introducing a map from parent directories to child files
ba41250 [Cheng Lian] Reuses HadoopFsRelation FileStatusCache in ParquetRelation2
3d278f7 [Cheng Lian] Fixes a bug when reading a single Parquet data file
b84612a [Cheng Lian] Fixes Scala style issue
6a08b02 [Cheng Lian] WIP: Moves file status cache into HadoopFSRelation
2015-05-18 12:45:37 -07:00
Yin Huai 530397ba2f [SPARK-7567] [SQL] [follow-up] Use a new flag to set output committer based on mapreduce apis
cc liancheng marmbrus

Author: Yin Huai <yhuai@databricks.com>

Closes #6130 from yhuai/directOutput and squashes the following commits:

312b07d [Yin Huai] A data source can use spark.sql.sources.outputCommitterClass to override the output committer.
2015-05-18 12:17:10 -07:00
Wenchen Fan 103c863c2e [SPARK-7269] [SQL] Incorrect analysis for aggregation(use semanticEquals)
A modified version of https://github.com/apache/spark/pull/6110, use `semanticEquals` to make it more efficient.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #6173 from cloud-fan/7269 and squashes the following commits:

e4a3cc7 [Wenchen Fan] address comments
cc02045 [Wenchen Fan] consider elements length equal
d7ff8f4 [Wenchen Fan] fix 7269
2015-05-18 12:12:55 -07:00
scwf fc2480ed13 [SPARK-7631] [SQL] treenode argString should not print children
spark-sql>
> explain extended
> select * from (
> select key from src union all
> select key from src) t;

now the spark plan will print children in argString
```
== Physical Plan ==
Union[ HiveTableScan key#1, (MetastoreRelation default, src, None), None,
HiveTableScan key#3, (MetastoreRelation default, src, None), None]
HiveTableScan key#1, (MetastoreRelation default, src, None), None
HiveTableScan key#3, (MetastoreRelation default, src, None), None
```

after this patch:
```
== Physical Plan ==
Union
 HiveTableScan [key#1], (MetastoreRelation default, src, None), None
 HiveTableScan [key#3], (MetastoreRelation default, src, None), None
```

I have tested this locally

Author: scwf <wangfei1@huawei.com>

Closes #6144 from scwf/fix-argString and squashes the following commits:

1a642e0 [scwf] fix treenode argString
2015-05-18 12:05:14 -07:00
Zhan Zhang aa31e431fc [SPARK-2883] [SQL] ORC data source for Spark SQL
This PR updates PR #6135 authored by zhzhan from Hortonworks.

----

This PR implements a Spark SQL data source for accessing ORC files.

> **NOTE**
>
> Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive.  That's why the new ORC data source is under `org.apache.spark.sql.hive` package, and must be used with `HiveContext`.  However, it doesn't require existing Hive installation to access ORC files.

1.  Saving/loading ORC files without contacting Hive metastore

1.  Support for complex data types (i.e. array, map, and struct)

1.  Aware of common optimizations provided by Spark SQL:

    - Column pruning
    - Partitioning pruning
    - Filter push-down

1.  Schema evolution support
1.  Hive metastore table conversion

This PR also include initial work done by scwf from Huawei (PR #3753).

Author: Zhan Zhang <zhazhan@gmail.com>
Author: Cheng Lian <lian@databricks.com>

Closes #6194 from liancheng/polishing-orc and squashes the following commits:

55ecd96 [Cheng Lian] Reorganizes ORC test suites
d4afeed [Cheng Lian] Addresses comments
21ada22 [Cheng Lian] Adds @since and @Experimental annotations
128bd3b [Cheng Lian] ORC filter bug fix
d734496 [Cheng Lian] Polishes the ORC data source
2650a42 [Zhan Zhang] resolve review comments
3c9038e [Zhan Zhang] resolve review comments
7b3c7c5 [Zhan Zhang] save mode fix
f95abfd [Zhan Zhang] reuse test suite
7cc2c64 [Zhan Zhang] predicate fix
4e61c16 [Zhan Zhang] minor change
305418c [Zhan Zhang] orc data source support
2015-05-18 12:03:40 -07:00
Wenchen Fan 56ede88485 [SQL] [MINOR] [THIS] use private for internal field in ScalaUdf
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #6235 from cloud-fan/tmp and squashes the following commits:

8f16367 [Wenchen Fan] use private[this]
2015-05-18 12:01:30 -07:00
Cheng Lian 010a1c2780 [SPARK-7570] [SQL] Ignores _temporary during partition discovery
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/6091)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #6091 from liancheng/spark-7570 and squashes the following commits:

8ff07e8 [Cheng Lian] Ignores _temporary during partition discovery
2015-05-18 11:59:44 -07:00
Rene Treffer e1ac2a955b [SPARK-6888] [SQL] Make the jdbc driver handling user-definable
Replace the DriverQuirks with JdbcDialect(s) (and MySQLDialect/PostgresDialect)
and allow developers to change the dialects on the fly (for new JDBCRRDs only).

Some types (like an unsigned 64bit number) can be trivially mapped to java.
The status quo is that the RRD will fail to load.
This patch makes it possible to overwrite the type mapping to read e.g.
64Bit numbers as strings and handle them afterwards in software.

JDBCSuite has an example that maps all types to String, which should always
work (at the cost of extra code afterwards).

As a side effect it should now be possible to develop simple dialects
out-of-tree and even with spark-shell.

Author: Rene Treffer <treffer@measite.de>

Closes #5555 from rtreffer/jdbc-dialects and squashes the following commits:

3cbafd7 [Rene Treffer] [SPARK-6888] ignore classes belonging to changed API in MIMA report
fe7e2e8 [Rene Treffer] [SPARK-6888] Make the jdbc driver handling user-definable
2015-05-18 11:55:36 -07:00
Liang-Chi Hsieh e32c0f69f3 [SPARK-7299][SQL] Set precision and scale for Decimal according to JDBC metadata instead of returned BigDecimal
JIRA: https://issues.apache.org/jira/browse/SPARK-7299

When connecting with oracle db through jdbc, the precision and scale of `BigDecimal` object returned by `ResultSet.getBigDecimal` is not correctly matched to the table schema reported by `ResultSetMetaData.getPrecision` and `ResultSetMetaData.getScale`.

So in case you insert a value like `19999` into a column with `NUMBER(12, 2)` type, you get through a `BigDecimal` object with scale as 0. But the dataframe schema has correct type as `DecimalType(12, 2)`. Thus, after you save the dataframe into parquet file and then retrieve it, you will get wrong result `199.99`.

Because it is reported to be problematic on jdbc connection with oracle db. It might be difficult to add test case for it. But according to the user's test on JIRA, it solves this problem.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #5833 from viirya/jdbc_decimal_precision and squashes the following commits:

69bc2b5 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into jdbc_decimal_precision
928f864 [Liang-Chi Hsieh] Add comments.
5f9da94 [Liang-Chi Hsieh] Set up Decimal's precision and scale according to table schema instead of returned BigDecimal.
2015-05-18 01:10:55 -07:00
zsxwing ff71d34e00 [SPARK-7693][Core] Remove "import scala.concurrent.ExecutionContext.Implicits.global"
Learnt a lesson from SPARK-7655: Spark should avoid to use `scala.concurrent.ExecutionContext.Implicits.global` because the user may submit blocking actions to `scala.concurrent.ExecutionContext.Implicits.global` and exhaust all threads in it. This could crash Spark. So Spark should always use its own thread pools for safety.

This PR removes all usages of `scala.concurrent.ExecutionContext.Implicits.global` and uses proper thread pools to replace them.

Author: zsxwing <zsxwing@gmail.com>

Closes #6223 from zsxwing/SPARK-7693 and squashes the following commits:

a33ff06 [zsxwing] Decrease the max thread number from 1024 to 128
cf4b3fc [zsxwing] Remove "import scala.concurrent.ExecutionContext.Implicits.global"
2015-05-17 20:37:19 -07:00
Wenchen Fan 2f22424e9f [SQL] [MINOR] use catalyst type converter in ScalaUdf
It's a follow-up of https://github.com/apache/spark/pull/5154, we can speed up scala udf evaluation by create type converter in advance.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #6182 from cloud-fan/tmp and squashes the following commits:

241cfe9 [Wenchen Fan] use converter in ScalaUdf
2015-05-17 16:51:57 -07:00
Michael Armbrust 2ca60ace8f [SPARK-7491] [SQL] Allow configuration of classloader isolation for hive
Author: Michael Armbrust <michael@databricks.com>

Closes #6167 from marmbrus/configureIsolation and squashes the following commits:

6147cbe [Michael Armbrust] filter other conf
22cc3bc7 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into configureIsolation
07476ee [Michael Armbrust] filter empty prefixes
dfdf19c [Michael Armbrust] [SPARK-6906][SQL] Allow configuration of classloader isolation for hive
2015-05-17 12:43:15 -07:00
Josh Rosen 564562874f [SPARK-7686] [SQL] DescribeCommand is assigned wrong output attributes in SparkStrategies
In `SparkStrategies`, `RunnableDescribeCommand` is called with the output attributes of the table being described rather than the attributes for the `describe` command's output.  I discovered this issue because it caused type conversion errors in some UnsafeRow conversion code that I'm writing.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #6217 from JoshRosen/SPARK-7686 and squashes the following commits:

953a344 [Josh Rosen] Fix SPARK-7686 with a simple change in SparkStrategies.
a4eec9f [Josh Rosen] Add failing regression test for SPARK-7686
2015-05-17 11:59:28 -07:00
Liang-Chi Hsieh 3399055787 [SPARK-7447] [SQL] Don't re-merge Parquet schema when the relation is deserialized
JIRA: https://issues.apache.org/jira/browse/SPARK-7447

`MetadataCache` in `ParquetRelation2` is annotated as `transient`. When `ParquetRelation2` is deserialized, we ask `MetadataCache` to refresh and perform schema merging again. It is time-consuming especially for very many parquet files.

With the new `FSBasedParquetRelation`, although `MetadataCache` is not `transient` now, `MetadataCache.refresh()` still performs schema merging again when the relation is deserialized.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6012 from viirya/without_remerge_schema and squashes the following commits:

2663957 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into without_remerge_schema
6ac7d93 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into without_remerge_schema
b0fc09b [Liang-Chi Hsieh] Don't generate and merge parquetSchema multiple times.
2015-05-17 15:42:21 +08:00
scwf edf09ea1bd [SQL] [MINOR] Skip unresolved expression for InConversion
Author: scwf <wangfei1@huawei.com>

Closes #6145 from scwf/InConversion and squashes the following commits:

5c8ac6b [scwf] minir fix for InConversion
2015-05-17 15:17:11 +08:00
Cheng Lian ba4f8ca0d9 [MINOR] [SQL] Removes an unreachable case clause
This case clause is already covered by the one above, and generates a compilation warning.

Author: Cheng Lian <lian@databricks.com>

Closes #6214 from liancheng/remove-unreachable-code and squashes the following commits:

c38ca7c [Cheng Lian] Removes an unreachable case clause
2015-05-16 23:20:09 -07:00
Reynold Xin 517eb37a85 [SPARK-7654][SQL] Move JDBC into DataFrame's reader/writer interface.
Also moved all the deprecated functions into one place for SQLContext and DataFrame, and updated tests to use the new API.

Author: Reynold Xin <rxin@databricks.com>

Closes #6210 from rxin/df-writer-reader-jdbc and squashes the following commits:

7465c2c [Reynold Xin] Fixed unit test.
118e609 [Reynold Xin] Updated tests.
3441b57 [Reynold Xin] Updated javadoc.
13cdd1c [Reynold Xin] [SPARK-7654][SQL] Move JDBC into DataFrame's reader/writer interface.
2015-05-16 22:01:53 -07:00
Cheng Lian ce6391296a [HOTFIX] [SQL] Fixes DataFrameWriter.mode(String)
We forgot an assignment there.

/cc rxin

Author: Cheng Lian <lian@databricks.com>

Closes #6212 from liancheng/fix-df-writer and squashes the following commits:

711fbb0 [Cheng Lian] Adds a test case
3b72d78 [Cheng Lian] Fixes DataFrameWriter.mode(String)
2015-05-16 20:55:17 +08:00
zsxwing 47e7ffe36b [SPARK-7655][Core][SQL] Remove 'scala.concurrent.ExecutionContext.Implicits.global' in 'ask' and 'BroadcastHashJoin'
Because both `AkkaRpcEndpointRef.ask` and `BroadcastHashJoin` uses `scala.concurrent.ExecutionContext.Implicits.global`. However, because the tasks in `BroadcastHashJoin` are usually long-running tasks, which will occupy all threads in `global`. Then `ask` cannot get a chance to process the replies.

For `ask`, actually the tasks are very simple, so we can use `MoreExecutors.sameThreadExecutor()`. For `BroadcastHashJoin`, it's better to use `ThreadUtils.newDaemonCachedThreadPool`.

Author: zsxwing <zsxwing@gmail.com>

Closes #6200 from zsxwing/SPARK-7655-2 and squashes the following commits:

cfdc605 [zsxwing] Remove redundant imort and minor doc fix
cf83153 [zsxwing] Add "sameThread" and "newDaemonCachedThreadPool with maxThreadNumber" to ThreadUtils
08ad0ee [zsxwing] Remove 'scala.concurrent.ExecutionContext.Implicits.global' in 'ask' and 'BroadcastHashJoin'
2015-05-16 00:44:29 -07:00
Reynold Xin 578bfeeff5 [SPARK-7654][SQL] DataFrameReader and DataFrameWriter for input/output API
This patch introduces DataFrameWriter and DataFrameReader.

DataFrameReader interface, accessible through SQLContext.read, contains methods that create DataFrames. These methods used to reside in SQLContext. Example usage:
```scala
sqlContext.read.json("...")
sqlContext.read.parquet("...")
```

DataFrameWriter interface, accessible through DataFrame.write, implements a builder pattern to avoid the proliferation of options in writing DataFrame out. It currently implements:
- mode
- format (e.g. "parquet", "json")
- options (generic options passed down into data sources)
- partitionBy (partitioning columns)
Example usage:
```scala
df.write.mode("append").format("json").partitionBy("date").saveAsTable("myJsonTable")
```

TODO:

- [ ] Documentation update
- [ ] Move JDBC into reader / writer?
- [ ] Deprecate the old interfaces
- [ ] Move the generic load interface into reader.
- [ ] Update example code and documentation

Author: Reynold Xin <rxin@databricks.com>

Closes #6175 from rxin/reader-writer and squashes the following commits:

b146c95 [Reynold Xin] Deprecation of old APIs.
bd8abdf [Reynold Xin] Fixed merge conflict.
26abea2 [Reynold Xin] Added general load methods.
244fbec [Reynold Xin] Added equivalent to example.
4f15d92 [Reynold Xin] Added documentation for partitionBy.
7e91611 [Reynold Xin] [SPARK-7654][SQL] DataFrameReader and DataFrameWriter for input/output API.
2015-05-15 22:00:31 -07:00
Cheng Lian fdf5bba35d [SPARK-7591] [SQL] Partitioning support API tweaks
Please see [SPARK-7591] [1] for the details.

/cc rxin marmbrus yhuai

[1]: https://issues.apache.org/jira/browse/SPARK-7591

Author: Cheng Lian <lian@databricks.com>

Closes #6150 from liancheng/spark-7591 and squashes the following commits:

af422e7 [Cheng Lian] Addresses @rxin's comments
37d1738 [Cheng Lian] Fixes HadoopFsRelation partition columns initialization
2fc680a [Cheng Lian] Fixes Scala style issue
189ad23 [Cheng Lian] Removes HadoopFsRelation constructor arguments
522c24e [Cheng Lian] Adds OutputWriterFactory
047d40d [Cheng Lian] Renames FSBased* to HadoopFs*, also renamed FSBasedParquetRelation back to ParquetRelation2
2015-05-15 16:20:49 +08:00
Yin Huai e8f0e016ea [SQL] When creating partitioned table scan, explicitly create UnionRDD.
Otherwise, it will cause stack overflow when there are many partitions.

Author: Yin Huai <yhuai@databricks.com>

Closes #6162 from yhuai/partitionUnionedRDD and squashes the following commits:

fa016d8 [Yin Huai] Explicitly create UnionRDD.
2015-05-15 12:04:26 +08:00
Liang-Chi Hsieh f9705d4613 [SPARK-7098][SQL] Make the WHERE clause with timestamp show consistent result
JIRA: https://issues.apache.org/jira/browse/SPARK-7098

The WHERE clause with timstamp shows inconsistent results. This pr fixes it.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #5682 from viirya/consistent_timestamp and squashes the following commits:

171445a [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into consistent_timestamp
4e98520 [Liang-Chi Hsieh] Make the WHERE clause with timestamp show consistent result.
2015-05-14 20:49:21 -07:00
Michael Armbrust 6d0633e3ec [SPARK-7548] [SQL] Add explode function for DataFrames
Add an `explode` function for dataframes and modify the analyzer so that single table generating functions can be present in a select clause along with other expressions.   There are currently the following restrictions:
 - only top level TGFs are allowed (i.e. no `select(explode('list) + 1)`)
 - only one may be present in a single select to avoid potentially confusing implicit Cartesian products.

TODO:
 - [ ] Python

Author: Michael Armbrust <michael@databricks.com>

Closes #6107 from marmbrus/explodeFunction and squashes the following commits:

7ee2c87 [Michael Armbrust] whitespace
6f80ba3 [Michael Armbrust] Update dataframe.py
c176c89 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into explodeFunction
81b5da3 [Michael Armbrust] style
d3faa05 [Michael Armbrust] fix self join case
f9e1e3e [Michael Armbrust] fix python, add since
4f0d0a9 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into explodeFunction
e710fe4 [Michael Armbrust] add java and python
52ca0dc [Michael Armbrust] [SPARK-7548][SQL] Add explode function for dataframes.
2015-05-14 19:49:44 -07:00
Wenchen Fan f2cd00be35 [SQL][minor] rename apply for QueryPlanner
A follow-up of https://github.com/apache/spark/pull/5624

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #6142 from cloud-fan/tmp and squashes the following commits:

971a92b [Wenchen Fan] use plan instead of execute
24c5ffe [Wenchen Fan] rename apply
2015-05-14 10:25:18 -07:00