The API signatire for join requires the JoinType to be the third parameter. The code examples provided for join show JoinType being provided as the 2nd parater resuling in errors (i.e. "df1.join(df2, "outer", $"df1Key" === $"df2Key") ). The correct sample code is df1.join(df2, $"df1Key" === $"df2Key", "outer")
Author: Paul Power <paul.power@peerside.com>
Closes#4847 from peerside/master and squashes the following commits:
ebc1efa [Paul Power] Merge pull request #1 from peerside/peerside-patch-1
e353340 [Paul Power] Updated comments use correct sample code for Dataframe joins
Always set `containsNull = true` when infer the schema of JSON datasets. If we set `containsNull` based on records we scanned, we may miss arrays with null values when we do sampling. Also, because future data can have arrays with null values, if we convert JSON data to parquet, always setting `containsNull = true` is a more robust way to go.
JIRA: https://issues.apache.org/jira/browse/SPARK-6052
Author: Yin Huai <yhuai@databricks.com>
Closes#4806 from yhuai/jsonArrayContainsNull and squashes the following commits:
05eab9d [Yin Huai] Change containsNull to true.
JIRA: https://issues.apache.org/jira/browse/SPARK-6024
Author: Yin Huai <yhuai@databricks.com>
Closes#4795 from yhuai/wideSchema and squashes the following commits:
4882e6f [Yin Huai] Address comments.
73e71b4 [Yin Huai] Address comments.
143927a [Yin Huai] Simplify code.
cc1d472 [Yin Huai] Make the schema wider.
12bacae [Yin Huai] If the JSON string of a schema is too large, split it before storing it in metastore.
e9b4f70 [Yin Huai] Failed test.
`FilteringParquetRowInputFormat` manually merges Parquet schemas before computing splits. However, it is duplicate because the schemas are already merged in `ParquetRelation2`. We don't need to re-merge them at `InputFormat`.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#4786 from viirya/dup_parquet_schemas_merge and squashes the following commits:
ef78a5a [Liang-Chi Hsieh] Avoiding duplicate Parquet schema merging.
It is useful to let the user decide the number of rows to show in DataFrame.show
Author: Jacky Li <jacky.likun@huawei.com>
Closes#4767 from jackylk/show and squashes the following commits:
a0e0f4b [Jacky Li] fix testcase
7cdbe91 [Jacky Li] modify according to comment
bb54537 [Jacky Li] for Java compatibility
d7acc18 [Jacky Li] modify according to comments
981be52 [Jacky Li] add numRows param in DataFrame.show()
Please see JIRA (https://issues.apache.org/jira/browse/SPARK-6016) for details of the bug.
Author: Yin Huai <yhuai@databricks.com>
Closes#4775 from yhuai/parquetFooterCache and squashes the following commits:
78787b1 [Yin Huai] Remove footerCache in FilteringParquetRowInputFormat.
dff6fba [Yin Huai] Failed unit test.
DataFrame.explain return wrong result when the query is DDL command.
For example, the following two queries should print out the same execution plan, but it not.
sql("create table tb as select * from src where key > 490").explain(true)
sql("explain extended create table tb as select * from src where key > 490")
This is because DataFrame.explain leverage logicalPlan which had been forced executed, we should use the unexecuted plan queryExecution.logical.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#4707 from yanboliang/spark-5926 and squashes the following commits:
fa6db63 [Yanbo Liang] logicalPlan is not lazy
0e40a1b [Yanbo Liang] make DataFrame.explain leverage queryExecution.logical
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#4760 from viirya/dup_literal and squashes the following commits:
06e7516 [Liang-Chi Hsieh] Remove duplicate Literal matching block.
`ReadContext.init` calls `InitContext.getMergedKeyValueMetadata`, which doesn't know how to merge conflicting user defined key-value metadata and throws exception. In our case, when dealing with different but compatible schemas, we have different Spark SQL schema JSON strings in different Parquet part-files, thus causes this problem. Reading similar Parquet files generated by Hive doesn't suffer from this issue.
In this PR, we manually merge the schemas before passing it to `ReadContext` to avoid the exception.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4768)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#4768 from liancheng/spark-6010 and squashes the following commits:
9002f0a [Cheng Lian] Fixes SPARK-6010
Author: Michael Armbrust <michael@databricks.com>
Closes#4757 from marmbrus/udtConversions and squashes the following commits:
3714aad [Michael Armbrust] [SPARK-5996][SQL] Fix specialized outbound conversions
Also added desc/asc function for constructing sorting expressions more conveniently. And added a small fix to lift alias out of cast expression.
Author: Reynold Xin <rxin@databricks.com>
Closes#4752 from rxin/SPARK-5985 and squashes the following commits:
aeda5ae [Reynold Xin] Added Experimental flag to ColumnName.
047ad03 [Reynold Xin] Lift alias out of cast.
c9cf17c [Reynold Xin] [SPARK-5985][SQL] DataFrame sortBy -> orderBy in Python.
Author: Michael Armbrust <michael@databricks.com>
Closes#4738 from marmbrus/udtRepart and squashes the following commits:
c06d7b5 [Michael Armbrust] fix compilation
91c8829 [Michael Armbrust] [SQL][SPARK-5532] Repartition should not use external rdd representation
1. Column is no longer a DataFrame to simplify class hierarchy.
2. Don't use varargs on abstract methods (see Scala compiler bug SI-9013).
Author: Reynold Xin <rxin@databricks.com>
Closes#4686 from rxin/SPARK-5904 and squashes the following commits:
fd9b199 [Reynold Xin] Fixed Python tests.
df25cef [Reynold Xin] Non final.
5221530 [Reynold Xin] [SPARK-5904][SQL] DataFrame API fixes.
The `int` is 64-bit on 64-bit machine (very common now), we should infer it as LongType for it in Spark SQL.
Also, LongType in SQL will come back as `int`.
Author: Davies Liu <davies@databricks.com>
Closes#4666 from davies/long and squashes the following commits:
6bc6cc4 [Davies Liu] infer int as LongType
Although we've migrated to the DataFrame API, lots of code still uses `rdd` or `srdd` as local variable names. This PR tries to address these naming inconsistencies and some other minor DataFrame related style issues.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4670)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#4670 from liancheng/df-cleanup and squashes the following commits:
3e14448 [Cheng Lian] Cleans up DataFrame variable names and toDF() calls
The problem is that after we create an empty hive metastore parquet table (e.g. `CREATE TABLE test (a int) STORED AS PARQUET`), Hive will create an empty dir for us, which cause our data source `ParquetRelation2` fail to get the schema of the table. See JIRA for the case to reproduce the bug and the exception.
This PR is based on #4562 from chenghao-intel.
JIRA: https://issues.apache.org/jira/browse/SPARK-5852
Author: Yin Huai <yhuai@databricks.com>
Author: Cheng Hao <hao.cheng@intel.com>
Closes#4655 from yhuai/CTASParquet and squashes the following commits:
b8b3450 [Yin Huai] Update tests.
2ac94f7 [Yin Huai] Update tests.
3db3d20 [Yin Huai] Minor update.
d7e2308 [Yin Huai] Revert changes in HiveMetastoreCatalog.scala.
36978d1 [Cheng Hao] Update the code as feedback
a04930b [Cheng Hao] fix bug of scan an empty parquet based table
442ffe0 [Cheng Hao] passdown the schema for Parquet File in HiveContext
Author: Michael Armbrust <michael@databricks.com>
Closes#4657 from marmbrus/pythonUdfs and squashes the following commits:
a7823a8 [Michael Armbrust] [SPARK-5868][SQL] Fix python UDFs in HiveContext and checks in SQLContext
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#4649 from viirya/use_checkpath and squashes the following commits:
0f9a1a1 [Liang-Chi Hsieh] Use same function to check path parameter.
Added a bunch of tags.
Also changed parquetFile to take varargs rather than a string followed by varargs.
Author: Reynold Xin <rxin@databricks.com>
Closes#4636 from rxin/df-doc and squashes the following commits:
651f80c [Reynold Xin] Fixed parquetFile in PySpark.
8dc3024 [Reynold Xin] [SQL] Various DataFrame doc changes.
This PR adds a `ShowTablesCommand` to support `SHOW TABLES [IN databaseName]` SQL command. The result of `SHOW TABLE` has two columns, `tableName` and `isTemporary`. For temporary tables, the value of `isTemporary` column will be `false`.
JIRA: https://issues.apache.org/jira/browse/SPARK-4865
Author: Yin Huai <yhuai@databricks.com>
Closes#4618 from yhuai/showTablesCommand and squashes the following commits:
0c09791 [Yin Huai] Use ShowTablesCommand.
85ee76d [Yin Huai] Since SHOW TABLES is not a Hive native command any more and we will not see "OK" (originally generated by Hive's driver), use SHOW DATABASES in the test.
94bacac [Yin Huai] Add SHOW TABLES to the list of noExplainCommands.
d71ed09 [Yin Huai] Fix test.
a4a6ec3 [Yin Huai] Add SHOW TABLE command.
Existing implementation of arithmetic operators and BinaryComparison operators have redundant type checking codes, e.g.:
Expression.n2 is used by Add/Subtract/Multiply.
(1) n2 always checks left.dataType == right.dataType. However, this checking should be done once when we resolve expression types;
(2) n2 requires dataType is a NumericType. This can be done once.
This PR optimizes arithmetic and predicate operators by removing such redundant type-checking codes.
Some preliminary benchmarking on 10G TPC-H data over 5 r3.2xlarge EC2 machines shows that this PR can reduce the query time by 5.5% to 11%.
The benchmark queries follow the template below, where OP is plus/minus/times/divide/remainder/bitwise and/bitwise or/bitwise xor.
SELECT l_returnflag, l_linestatus, SUM(l_quantity OP cnt1), SUM(l_quantity OP cnt2), ...., SUM(l_quantity OP cnt700)
FROM (
SELECT l_returnflag, l_linestatus, l_quantity, 1 AS cnt1, 2 AS cnt2, ..., 700 AS cnt700
FROM lineitem
WHERE l_shipdate <= '1998-09-01'
)
GROUP BY l_returnflag, l_linestatus;
Author: kai <kaizeng@eecs.berkeley.edu>
Closes#4472 from kai-zeng/arithmetic-optimize and squashes the following commits:
fef0cf1 [kai] Merge branch 'master' of github.com:apache/spark into arithmetic-optimize
4b3a1bb [kai] chmod a-x
5a41e49 [kai] chmod a-x Expression.scala
cb37c94 [kai] rebase onto spark master
7f6e968 [kai] chmod 100755 -> 100644
6cddb46 [kai] format
7490dbc [kai] fix unresolved-expression exception for EqualTo
9c40bc0 [kai] fix bitwisenot
3cbd363 [kai] clean up test code
ca47801 [kai] override evalInternal for bitwise ops
8fa84a1 [kai] add bitwise or and xor
6892fc4 [kai] revert override evalInternal
f8eba24 [kai] override evalInternal
31ccdd4 [kai] rewrite all bitwise op and remove evalInternal
86297e2 [kai] generalized
cb92ae1 [kai] bitwise-and: override eval
97a7d6c [kai] bitwise-and: override evalInternal using and func
0906c39 [kai] add bitwise test
62abbbc [kai] clean up predicate and arithmetic
b34d58d [kai] add caching and benmark option
12c5b32 [kai] override eval
1cd7571 [kai] fix sqrt and maxof
03fd0c3 [kai] fix predicate
16fd84c [kai] optimize + - * / % -(unary) abs < > <= >=
fd95823 [kai] remove unnecessary type checking
24d062f [kai] test suite
JIRA: https://issues.apache.org/jira/browse/SPARK-5746
liancheng marmbrus
Author: Yin Huai <yhuai@databricks.com>
Closes#4617 from yhuai/insertOverwrite and squashes the following commits:
8e3019d [Yin Huai] Fix compilation error.
499e8e7 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertOverwrite
e76e85a [Yin Huai] Address comments.
ac31b3c [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertOverwrite
f30bdad [Yin Huai] Use toDF.
99da57e [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertOverwrite
6b7545c [Yin Huai] Add a pre write check to the data source API.
a88c516 [Yin Huai] DDLParser will take a parsering function to take care CTAS statements.
Lifts `HiveMetastoreCatalog.refreshTable` to `Catalog`. Adds `RefreshTable` command to refresh (possibly cached) metadata in external data sources tables.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4624)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#4624 from liancheng/refresh-table and squashes the following commits:
8d1aa4c [Cheng Lian] Adds REFRESH TABLE command
This PR adds the following filter types for data sources API:
- `IsNull`
- `IsNotNull`
- `Not`
- `And`
- `Or`
The code which converts Catalyst predicate expressions to data sources filters is very similar to filter conversion logics in `ParquetFilters` which converts Catalyst predicates to Parquet filter predicates. In this way we can support nested AND/OR/NOT predicates without changing current `BaseScan` type hierarchy.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4623)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
This patch had conflicts when merged, resolved by
Committer: Michael Armbrust <michael@databricks.com>
Closes#4623 from liancheng/more-fiters and squashes the following commits:
1b296f4 [Cheng Lian] Add more filter types for data sources API
This PR migrates the Parquet data source to the new data source write support API. Now users can also overwriting and appending to existing tables. Notice that inserting into partitioned tables is not supported yet.
When Parquet data source is enabled, insertion to Hive Metastore Parquet tables is also fullfilled by the Parquet data source. This is done by the newly introduced `HiveMetastoreCatalog.ParquetConversions` rule, which is a "proper" implementation of the original hacky `HiveStrategies.ParquetConversion`. The latter is still preserved, and can be removed together with the old Parquet support in the future.
TODO:
- [x] Update outdated comments in `newParquet.scala`.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4563)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#4563 from liancheng/parquet-refining and squashes the following commits:
fa98d27 [Cheng Lian] Fixes test cases which should disable off Parquet data source
2476e82 [Cheng Lian] Fixes compilation error introduced during rebasing
a83d290 [Cheng Lian] Passes Hive Metastore partitioning information to ParquetRelation2
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4613)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#4613 from liancheng/df-implicit-rename and squashes the following commits:
db8bdd3 [Cheng Lian] Renames stringRddToDataFrame to stringRddToDataFrameHolder for consistency
If one tries an example by using copy&paste, throw an exception.
Author: Takeshi Yamamuro <linguin.m.s@gmail.com>
Closes#4615 from maropu/AddMissingImportInSqlContext and squashes the following commits:
ab21b66 [Takeshi Yamamuro] Add missing import in the example of SqlContext
Author: Yin Huai <yhuai@databricks.com>
Closes#4582 from yhuai/jsonErrorMessage and squashes the following commits:
152dbd4 [Yin Huai] Update error message.
1466256 [Yin Huai] Throw a better error message when a JSON object in the input dataset span multiple records (lines for files or strings for an RDD of strings).
Author: Yin Huai <yhuai@databricks.com>
Closes#4542 from yhuai/moveSaveMode and squashes the following commits:
65a4425 [Yin Huai] Move SaveMode to sql package.
Author: Yin Huai <yhuai@databricks.com>
Closes#4544 from yhuai/jsonUseLongTypeByDefault and squashes the following commits:
6e2ffc2 [Yin Huai] Use LongType as the default type for integers in JSON schema inference.
Eases use in the spark-shell.
Author: Michael Armbrust <michael@databricks.com>
Closes#4545 from marmbrus/serialization and squashes the following commits:
04748e6 [Michael Armbrust] @scala.annotation.varargs
b36e219 [Michael Armbrust] moreFixes
- Removed DataFrame.apply for projection & filtering since they are extremely confusing.
- Added implicits for RDD[Int], RDD[Long], and RDD[String]
Author: Reynold Xin <rxin@databricks.com>
Closes#4543 from rxin/df-cleanup and squashes the following commits:
81ec915 [Reynold Xin] [SQL] More DataFrame fixes.
Also I fix a bunch of bad output in test cases.
Author: Michael Armbrust <michael@databricks.com>
Closes#4520 from marmbrus/selfJoin and squashes the following commits:
4f4a85c [Michael Armbrust] comments
49c8e26 [Michael Armbrust] fix tests
6fc38de [Michael Armbrust] fix style
55d64b3 [Michael Armbrust] fix dataframe selfjoins
1. DataFrame.renameColumn
2. DataFrame.show() and _repr_
3. Use simpleString() rather than jsonValue in DataFrame.dtypes
4. createDataFrame from local Python data, including pandas.DataFrame
Author: Davies Liu <davies@databricks.com>
Closes#4528 from davies/df3 and squashes the following commits:
014acea [Davies Liu] fix typo
6ba526e [Davies Liu] fix tests
46f5f95 [Davies Liu] address comments
6cbc154 [Davies Liu] dataframe.show() and improve dtypes
6f94f25 [Davies Liu] create DataFrame from local Python data
Also took the chance to fixed up some style ...
Author: Reynold Xin <rxin@databricks.com>
Closes#4489 from rxin/SPARK-5702 and squashes the following commits:
74f42e3 [Reynold Xin] [SPARK-5702][SQL] Allow short names for built-in data sources.
Do not recursively strip out projects. Only strip the first level project.
```scala
df("colA") + df("colB").as("colC")
```
Previously, the above would construct an invalid plan.
Author: Reynold Xin <rxin@databricks.com>
Closes#4519 from rxin/computability and squashes the following commits:
87ff763 [Reynold Xin] Code review feedback.
015c4fc [Reynold Xin] [SQL][DataFrame] Fix column computability.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#4496 from chenghao-intel/df_explain and squashes the following commits:
552aa58 [Cheng Hao] Add explain support for DF
Deprecate inferSchema() and applySchema(), use createDataFrame() instead, which could take an optional `schema` to create an DataFrame from an RDD. The `schema` could be StructType or list of names of columns.
Author: Davies Liu <davies@databricks.com>
Closes#4498 from davies/create and squashes the following commits:
08469c1 [Davies Liu] remove Scala/Java API for now
c80a7a9 [Davies Liu] fix hive test
d1bd8f2 [Davies Liu] cleanup applySchema
9526e97 [Davies Liu] createDataFrame from RDD with columns
Author: Cheng Hao <hao.cheng@intel.com>
Closes#4468 from chenghao-intel/json and squashes the following commits:
aeb7801 [Cheng Hao] avoid multiple json generator created
Users will not need to put `Options()` in a CREATE TABLE statement when there is not option provided.
Author: Yin Huai <yhuai@databricks.com>
Closes#4515 from yhuai/makeOptionsOptional and squashes the following commits:
1a898d3 [Yin Huai] Make options optional.
Author: Sheng, Li <OopsOutOfMemory@users.noreply.github.com>
Author: OopsOutOfMemory <victorshengli@126.com>
Closes#4508 from OopsOutOfMemory/cmt and squashes the following commits:
d8a68c6 [Sheng, Li] Update ddl.scala
f24aeaf [OopsOutOfMemory] correct style
Author: Michael Armbrust <michael@databricks.com>
Closes#4436 from marmbrus/dfToString and squashes the following commits:
8a3c35f [Michael Armbrust] Merge remote-tracking branch 'origin/master' into dfToString
b72a81b [Michael Armbrust] add toString
This is a completion of https://github.com/apache/spark/pull/4033 which was withdrawn for some reason.
Author: Sean Owen <sowen@cloudera.com>
Closes#4470 from srowen/SPARK-5239.2 and squashes the following commits:
2398bde [Sean Owen] Avoid use of JDBC4-only isClosed()
Author: Cheng Hao <hao.cheng@intel.com>
Closes#4494 from chenghao-intel/tiny_code_change and squashes the following commits:
450dfe7 [Cheng Hao] remove the duplicated code
- as with a `Symbol`
- distinct
- sqlContext.emptyDataFrame
- move add/remove col out of RDDApi section
Author: Michael Armbrust <michael@databricks.com>
Closes#4437 from marmbrus/dfMissingFuncs and squashes the following commits:
2004023 [Michael Armbrust] Add missing functions
Fix Scala code style.
Author: Hung Lin <hung@zoomdata.com>
Closes#4464 from hunglin/SPARK-5472 and squashes the following commits:
ef7a3b3 [Hung Lin] SPARK-5472: fix scala style
An example:
```
year month AVG('Adj Close) MAX('Adj Close)
1980 12 0.503218 0.595103
1981 01 0.523289 0.570307
1982 02 0.436504 0.475256
1983 03 0.410516 0.442194
1984 04 0.450090 0.483521
```
Author: Reynold Xin <rxin@databricks.com>
Closes#4416 from rxin/SPARK-5643 and squashes the following commits:
d0e0d6e [Reynold Xin] [SQL] Minor update to data source and statistics documentation.
269da83 [Reynold Xin] Updated isLocal comment.
2cf3c27 [Reynold Xin] Moved logic into optimizer.
1a04d8b [Reynold Xin] [SPARK-5643][SQL] Add a show method to print the content of a DataFrame in columnar format.
When the `GetField` chain(`a.b.c.d.....`) is interrupted by `GetItem` like `a.b[0].c.d....`, then the check of ambiguous reference to fields is broken.
The reason is that: for something like `a.b[0].c.d`, we first parse it to `GetField(GetField(GetItem(Unresolved("a.b"), 0), "c"), "d")`. Then in `LogicalPlan#resolve`, we resolve `"a.b"` and build a `GetField` chain from bottom(the relation). But for the 2 outer `GetFiled`, we have to resolve them in `Analyzer` or do it in `GetField` lazily, check data type of child, search needed field, etc. which is similar to what we have done in `LogicalPlan#resolve`.
So in this PR, the fix is just copy the same logic in `LogicalPlan#resolve` to `Analyzer`, which is simple and quick, but I do suggest introduce `UnresolvedGetFiled` like I explained in https://github.com/apache/spark/pull/2405.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#4068 from cloud-fan/simple and squashes the following commits:
a6857b5 [Wenchen Fan] fix import order
8411c40 [Wenchen Fan] use UnresolvedGetField
This PR adds a rule to Analyzer that will add preinsert data type casting and field renaming to the select clause in an `INSERT INTO/OVERWRITE` statement. Also, with the change of this PR, we always invalidate our in memory data cache after inserting into a BaseRelation.
cc marmbrus liancheng
Author: Yin Huai <yhuai@databricks.com>
Closes#4373 from yhuai/insertFollowUp and squashes the following commits:
08237a7 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertFollowUp
316542e [Yin Huai] Doc update.
c9ccfeb [Yin Huai] Revert a unnecessary change.
84aecc4 [Yin Huai] Address comments.
1951fe1 [Yin Huai] Merge remote-tracking branch 'upstream/master'
c18da34 [Yin Huai] Invalidate cache after insert.
727f21a [Yin Huai] Preinsert casting and renaming.
Make below code works.
```
sql("DESCRIBE test").registerTempTable("describeTest")
sql("SELECT * FROM describeTest").collect()
```
Author: OopsOutOfMemory <victorshengli@126.com>
Author: Sheng, Li <OopsOutOfMemory@users.noreply.github.com>
Closes#4249 from OopsOutOfMemory/desc_query and squashes the following commits:
6fee13d [OopsOutOfMemory] up-to-date
e71430a [Sheng, Li] Update HiveOperatorQueryableSuite.scala
3ba1058 [OopsOutOfMemory] change to default argument
aac7226 [OopsOutOfMemory] Merge branch 'master' into desc_query
68eb6dd [OopsOutOfMemory] Merge branch 'desc_query' of github.com:OopsOutOfMemory/spark into desc_query
354ad71 [OopsOutOfMemory] query describe command
d541a35 [OopsOutOfMemory] refine test suite
e1da481 [OopsOutOfMemory] refine test suite
a780539 [OopsOutOfMemory] Merge branch 'desc_query' of github.com:OopsOutOfMemory/spark into desc_query
0015f82 [OopsOutOfMemory] code style
dd0aaef [OopsOutOfMemory] code style
c7d606d [OopsOutOfMemory] rename test suite
75f2342 [OopsOutOfMemory] refine code and test suite
f942c9b [OopsOutOfMemory] initial
11559ae [OopsOutOfMemory] code style
c5fdecf [OopsOutOfMemory] code style
aeaea5f [OopsOutOfMemory] rename test suite
ac2c3bb [OopsOutOfMemory] refine code and test suite
544573e [OopsOutOfMemory] initial
Author: Reynold Xin <rxin@databricks.com>
Closes#4410 from rxin/df-renameCol and squashes the following commits:
a6a796e [Reynold Xin] [SPARK-5639][SQL] Support DataFrame.renameColumn.
Author: Reynold Xin <rxin@databricks.com>
Closes#4408 from rxin/df-config-eager and squashes the following commits:
c0204cf [Reynold Xin] [SPARK-5638][SQL] Add a config flag to disable eager analysis of DataFrames.
This PR adds three major improvements to Parquet data source:
1. Partition discovery
While reading Parquet files resides in Hive style partition directories, `ParquetRelation2` automatically discovers partitioning information and infers partition column types.
This is also a partial work for [SPARK-5182] [1], which aims to provide first class partitioning support for the data source API. Related code in this PR can be easily extracted to the data source API level in future versions.
1. Schema merging
When enabled, Parquet data source collects schema information from all Parquet part-files and tries to merge them. Exceptions are thrown when incompatible schemas are detected. This feature is controlled by data source option `parquet.mergeSchema`, and is enabled by default.
1. Metastore Parquet table conversion moved to analysis phase
This greatly simplifies the conversion logic. `ParquetConversion` strategy can be removed once the old Parquet implementation is removed in the future.
This version of Parquet data source aims to entirely replace the old Parquet implementation. However, the old version hasn't been removed yet. Users can fall back to the old version by turning off SQL configuration `spark.sql.parquet.useDataSourceApi`.
Other JIRA tickets fixed as side effects in this PR:
- [SPARK-5509] [3]: `EqualTo` now uses a proper `Ordering` to compare binary types.
- [SPARK-3575] [4]: Metastore schema is now preserved and passed to `ParquetRelation2` via data source option `parquet.metastoreSchema`.
TODO:
- [ ] More test cases for partition discovery
- [x] Fix write path after data source write support (#4294) is merged
It turned out to be non-trivial to fall back to old Parquet implementation on the write path when Parquet data source is enabled. Since we're planning to include data source write support in 1.3.0, I simply ignored two test cases involving Parquet insertion for now.
- [ ] Fix outdated comments and documentations
PS: This PR looks big, but more than a half of the changed lines in this PR are trivial changes to test cases. To test Parquet with and without the new data source, almost all Parquet test cases are moved into wrapper driver functions. This introduces hundreds of lines of changes.
[1]: https://issues.apache.org/jira/browse/SPARK-5182
[2]: https://issues.apache.org/jira/browse/SPARK-5528
[3]: https://issues.apache.org/jira/browse/SPARK-5509
[4]: https://issues.apache.org/jira/browse/SPARK-3575
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4308)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#4308 from liancheng/parquet-partition-discovery and squashes the following commits:
b6946e6 [Cheng Lian] Fixes MiMA issues, addresses comments
8232e17 [Cheng Lian] Write support for Parquet data source
a49bd28 [Cheng Lian] Fixes spelling typo in trait name "CreateableRelationProvider"
808380f [Cheng Lian] Fixes issues introduced while rebasing
50dd8d1 [Cheng Lian] Addresses @rxin's comment, fixes UDT schema merging
adf2aae [Cheng Lian] Fixes compilation error introduced while rebasing
4e0175f [Cheng Lian] Fixes Python Parquet API, we need Py4J array to call varargs method
0d8ec1d [Cheng Lian] Adds more test cases
b35c8c6 [Cheng Lian] Fixes some typos and outdated comments
dd704fd [Cheng Lian] Fixes Python Parquet API
596c312 [Cheng Lian] Uses switch to control whether use Parquet data source or not
7d0f7a2 [Cheng Lian] Fixes Metastore Parquet table conversion
a1896c7 [Cheng Lian] Fixes all existing Parquet test suites except for ParquetMetastoreSuite
5654c9d [Cheng Lian] Draft version of Parquet partition discovery and schema merging
Hi, rxin marmbrus
I considered your suggestion (in #4127) and now re-write it. This is now up-to-date.
Could u please review it ?
Author: OopsOutOfMemory <victorshengli@126.com>
Closes#4227 from OopsOutOfMemory/describe and squashes the following commits:
053826f [OopsOutOfMemory] describe
1. Added methods to create DataFrames from Seq[Product]
2. Added executeTake to avoid running a Spark job on LocalRelations.
Author: Reynold Xin <rxin@databricks.com>
Closes#4372 from rxin/localDataFrame and squashes the following commits:
f696858 [Reynold Xin] style checker.
839ef7f [Reynold Xin] [SPARK-5602][SQL] Better support for creating DataFrame from local data collection.
1. Removed LocalHiveContext in Python.
2. Reduced DSL UDF support from 22 arguments to 10 arguments so JavaDoc/ScalaDoc look nicer.
Author: Reynold Xin <rxin@databricks.com>
Closes#4374 from rxin/df-style and squashes the following commits:
e493342 [Reynold Xin] [SQL][DataFrame] Minor cleanup.
...aised in SPARK-4520.
The exception is thrown only for a thrift generated parquet file. The array element schema name is assumed as "array" as per ParquetAvro but for thrift generated parquet files, it is array_name + "_tuple". This leads to missing child of array group type and hence when the parquet rows are being materialized leads to the exception.
Author: Sadhan Sood <sadhan@tellapart.com>
Closes#4148 from sadhan/SPARK-4520 and squashes the following commits:
c5ccde8 [Sadhan Sood] [SPARK-4520] [SQL] This pr fixes the ArrayIndexOutOfBoundsException as raised in SPARK-4520.
Author: Reynold Xin <rxin@databricks.com>
Closes#4376 from rxin/SPARK-5605 and squashes the following commits:
c55f5fa [Reynold Xin] Added a Python test.
f4b8dbb [Reynold Xin] [SPARK-5605][SQL][DF] Allow using String to specify colum name in DSL aggregate functions.
Right now the PR adds few helper methods for java apis. But the issue was opened mainly to get rid of transformations in java api like `.rdd` and `.toJavaRDD` while working with `SQLContext` or `HiveContext`.
Author: kul <kuldeep.bora@gmail.com>
Closes#4243 from kul/master and squashes the following commits:
2390fba [kul] [SPARK-5426][SQL] Add SparkSQL Java API helper methods.
```scala
df.selectExpr("abs(colA)", "colB")
df.filter("age > 21")
```
Author: Reynold Xin <rxin@databricks.com>
Closes#4348 from rxin/SPARK-5579 and squashes the following commits:
2baeef2 [Reynold Xin] Fix Python.
b416372 [Reynold Xin] [SPARK-5579][SQL][DataFrame] Support for project/filter using SQL expressions.
The previous #3732 is reverted due to some test failure.
Have fixed that.
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#4325 from adrian-wang/datenative and squashes the following commits:
096e20d [Daoyuan Wang] fix for mixed timezone
0ed0fdc [Daoyuan Wang] fix test data
a2fdd4e [Daoyuan Wang] getDate
c37832b [Daoyuan Wang] row to catalyst
f0005b1 [Daoyuan Wang] add date in sql parser and java type conversion
024c9a6 [Daoyuan Wang] clean some import order
d6715fc [Daoyuan Wang] refactoring Date as Primitive Int internally
374abd5 [Daoyuan Wang] spark native date type support
SQL in HiveContext, should be case insensitive, however, the following query will fail.
```scala
udf.register("random0", () => { Math.random()})
assert(sql("SELECT RANDOM0() FROM src LIMIT 1").head().getDouble(0) >= 0.0)
```
Author: Cheng Hao <hao.cheng@intel.com>
Closes#4326 from chenghao-intel/udf_case_sensitive and squashes the following commits:
485cf66 [Cheng Hao] Support the case insensitive for UDF
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#3820 from adrian-wang/parquettimestamp and squashes the following commits:
b1e2a0d [Daoyuan Wang] fix for nanos
4dadef1 [Daoyuan Wang] fix wrong read
93f438d [Daoyuan Wang] parquet timestamp support
1. Added Java-friendly version of the expression operators (i.e. gt, geq)
2. Added JavaDoc for most operators
3. Simplified expression operators by having only one version of the function (that accepts Any). Previously we had two methods for each expression operator, one accepting Any and another accepting Column.
4. agg function now accepts varargs of (String, String).
Author: Reynold Xin <rxin@databricks.com>
Closes#4332 from rxin/df-update and squashes the following commits:
ab0aa69 [Reynold Xin] Added Java friendly expression methods. Added JavaDoc. For each expression operator, have only one version of the function (that accepts Any). Previously we had two methods for each expression operator, one accepting Any and another accepting Column.
576d07a [Reynold Xin] random commit.
Author: Reynold Xin <rxin@databricks.com>
Closes#4327 from rxin/schemarddTypeAlias and squashes the following commits:
e5a8ff3 [Reynold Xin] [SPARK-5551][SQL] Create type alias for SchemaRDD for source backward compatibility
They were there mostly for code review and easier check of the API. I don't think they need to be there anymore.
Author: Reynold Xin <rxin@databricks.com>
Closes#4328 from rxin/remove-df-api and squashes the following commits:
723d600 [Reynold Xin] [SQL][DataFrame] Remove DataFrameApi and ColumnApi.
This PR aims to support `INSERT INTO/OVERWRITE TABLE tableName` and `CREATE TABLE tableName AS SELECT` for the data source API (partitioned tables are not supported).
In this PR, I am also adding the support of `IF NOT EXISTS` for our ddl parser. The current semantic of `IF NOT EXISTS` is explained as follows.
* For a `CREATE TEMPORARY TABLE` statement, it does not `IF NOT EXISTS` for now.
* For a `CREATE TABLE` statement (we are creating a metastore table), if there is an existing table having the same name ...
* when `IF NOT EXISTS` clause is used, we will do nothing.
* when `IF NOT EXISTS` clause is not used, the user will see an exception saying the table already exists.
TODOs:
- [x] CTAS support
- [x] Programmatic APIs
- [ ] Python API (another PR)
- [x] More unit tests
- [ ] Documents (another PR)
marmbrus liancheng rxin
Author: Yin Huai <yhuai@databricks.com>
Closes#4294 from yhuai/writeSupport and squashes the following commits:
3db1539 [Yin Huai] save does not take overwrite.
1c98881 [Yin Huai] Fix test.
142372a [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupport
34e1bfb [Yin Huai] Address comments.
1682ca6 [Yin Huai] Better support for CTAS statements.
e789d64 [Yin Huai] For the Scala API, let users to use tuples to provide options.
0128065 [Yin Huai] Short hand versions of save and load.
66ebd74 [Yin Huai] Formatting.
9203ec2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupport
e5d29f2 [Yin Huai] Programmatic APIs.
1a719a5 [Yin Huai] CREATE TEMPORARY TABLE with IF NOT EXISTS is not allowed for now.
909924f [Yin Huai] Add saveAsTable for the data source API to DataFrame.
95a7c71 [Yin Huai] Fix bug when handling IF NOT EXISTS clause in a CREATE TEMPORARY TABLE statement.
d37b19c [Yin Huai] Cheng's comments.
fd6758c [Yin Huai] Use BeforeAndAfterAll.
7880891 [Yin Huai] Support CREATE TABLE AS SELECT STATEMENT and the IF NOT EXISTS clause.
cb85b05 [Yin Huai] Initial write support.
2f91354 [Yin Huai] Make INSERT OVERWRITE/INTO statements consistent between HiveQL and SqlParser.
This pull request contains a Spark SQL data source that can pull data from, and can put data into, a JDBC database.
I have tested both read and write support with H2, MySQL, and Postgres. It would surprise me if both read and write support worked flawlessly out-of-the-box for any other database; different databases have different names for different JDBC data types and different meanings for SQL types with the same name. However, this code is designed (see `DriverQuirks.scala`) to make it *relatively* painless to add support for another database by augmenting the type mapping contained in this PR.
Author: Tor Myklebust <tmyklebu@gmail.com>
Closes#4261 from tmyklebu/master and squashes the following commits:
cf167ce [Tor Myklebust] Work around other Java tests ruining TestSQLContext.
67893bf [Tor Myklebust] Move the jdbcRDD methods into SQLContext itself.
585f95b [Tor Myklebust] Dependencies go into the project's pom.xml.
829d5ba [Tor Myklebust] Merge branch 'master' of https://github.com/apache/spark
41647ef [Tor Myklebust] Hide a couple things that don't need to be public.
7318aea [Tor Myklebust] Fix scalastyle warnings.
a09eeac [Tor Myklebust] JDBC data source for Spark SQL.
176bb98 [Tor Myklebust] Add test deps for JDBC support.
1. Throw UnsupportedOperationException if a Column is not computable.
2. Perform eager analysis on DataFrame so we can catch errors when they happen (not when an action is run).
Author: Reynold Xin <rxin@databricks.com>
Author: Davies Liu <davies@databricks.com>
Closes#4296 from rxin/col-computability and squashes the following commits:
6527b86 [Reynold Xin] Merge pull request #8 from davies/col-computability
fd92bc7 [Reynold Xin] Merge branch 'master' into col-computability
f79034c [Davies Liu] fix python tests
5afe1ff [Reynold Xin] Fix scala test.
17f6bae [Reynold Xin] Various fixes.
b932e86 [Reynold Xin] Added eager analysis for error reporting.
e6f00b8 [Reynold Xin] [SQL][API] ComputableColumn vs IncomputableColumn
Author: Reynold Xin <rxin@databricks.com>
Closes#4313 from rxin/SPARK-5514 and squashes the following commits:
e34e91b [Reynold Xin] [SPARK-5514] DataFrame.collect should call executeCollect
Store daysSinceEpoch as an Int value(4 bytes) to represent DateType, instead of using java.sql.Date(8 bytes as Long) in catalyst row. This ensures the same comparison behavior of Hive and Catalyst.
Subsumes #3381
I thinks there are already some tests in JavaSQLSuite, and for python it will not affect python's datetime class.
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#3732 from adrian-wang/datenative and squashes the following commits:
0ed0fdc [Daoyuan Wang] fix test data
a2fdd4e [Daoyuan Wang] getDate
c37832b [Daoyuan Wang] row to catalyst
f0005b1 [Daoyuan Wang] add date in sql parser and java type conversion
024c9a6 [Daoyuan Wang] clean some import order
d6715fc [Daoyuan Wang] refactoring Date as Primitive Int internally
374abd5 [Daoyuan Wang] spark native date type support
Not all Catalyst filter expressions can be converted to Parquet filter predicates. We should try to convert each individual predicate and then collect those convertible ones.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4255)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#4255 from liancheng/spark-5465 and squashes the following commits:
14ccd37 [Cheng Lian] Fixes filter push-down for Parquet data source
Support `comment` in create a table field.
__CREATE TEMPORARY TABLE people(name string `comment` "the name of a person")__
Author: OopsOutOfMemory <victorshengli@126.com>
Closes#3999 from OopsOutOfMemory/meta_comment and squashes the following commits:
39150d4 [OopsOutOfMemory] add comment and refine test suite
Simplify some codes related to DataFrame.
* Calling `toAttributes` instead of a `map`.
* Original `createDataFrame` creates the `StructType` and its attributes in a redundant way. Refactored it to create `StructType` and call `toAttributes` on it directly.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#4298 from viirya/refactor_df and squashes the following commits:
1d61c64 [Liang-Chi Hsieh] Revert it.
f36efb5 [Liang-Chi Hsieh] Relax the constraint of toDataFrame.
2c9f370 [Liang-Chi Hsieh] Just refactor DataFrame codes.
Author: kai <kaizeng@eecs.berkeley.edu>
Closes#4291 from kai-zeng/aggregate-fix and squashes the following commits:
78658ef [kai] remove redundant field "childOutput"
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#4250 from ueshin/issues/SPARK-5457 and squashes the following commits:
3c05e59 [Takuya UESHIN] Remove parameter to use default value of ApproxCountDistinct.
faea19d [Takuya UESHIN] Use overload instead of default value for Java support.
d1cca38 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-5457
663d43d [Takuya UESHIN] Add missing DSL for ApproxCountDistinct.
This PR makes Star a trait, and provides two implementations: UnresolvedStar (used for *, tblName.*) and ResolvedStar (used for df("*")).
Author: Reynold Xin <rxin@databricks.com>
Closes#4283 from rxin/df-star and squashes the following commits:
c9cba3e [Reynold Xin] Removed mapFunction in UnresolvedStar.
1a3a1d7 [Reynold Xin] [SQL] Support df("*") to select all columns in a data frame.
This patch changes DataFrame's `apply()` method to use an analyzed query plan when resolving column names. This fixes a bug where `apply` would throw "invalid call to qualifiers on unresolved object" errors when called on DataFrames constructed via `SQLContext.sql()`.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#4282 from JoshRosen/SPARK-5462 and squashes the following commits:
b9e6da2 [Josh Rosen] [SPARK-5462] Use analyzed query plan in DataFrame.apply().