Avoid `UnsupportedOperationException` from JsonRDD.inferSchema on empty RDD.
Not sure if this is supposed to be an error (but a better one), but it seems like this case can come up if the input is down-sampled so much that nothing is sampled.
Now stuff like this:
```
sqlContext.jsonRDD(sc.parallelize(List[String]()))
```
just results in
```
org.apache.spark.sql.DataFrame = []
```
Author: Sean Owen <sowen@cloudera.com>
Closes#4971 from srowen/SPARK-6245 and squashes the following commits:
3699964 [Sean Owen] Set() -> Set.empty
3c619e1 [Sean Owen] Avoid UnsupportedOperationException from JsonRDD.inferSchema on empty RDD
Resolve javac, scalac warnings of various types -- deprecations, Scala lang, unchecked cast, etc.
Author: Sean Owen <sowen@cloudera.com>
Closes#4950 from srowen/SPARK-6225 and squashes the following commits:
3080972 [Sean Owen] Ordered imports: Java, Scala, 3rd party, Spark
c67985b [Sean Owen] Resolve javac, scalac warnings of various types -- deprecations, Scala lang, unchecked cast, etc.
Author: Michael Armbrust <michael@databricks.com>
Closes#4920 from marmbrus/openStrategies and squashes the following commits:
cbc35c0 [Michael Armbrust] [SQL] Make Strategies a public developer API
jira: https://issues.apache.org/jira/browse/SPARK-6163
Author: Yin Huai <yhuai@databricks.com>
Closes#4896 from yhuai/SPARK-6163 and squashes the following commits:
45e023e [Yin Huai] Address @chenghao-intel's comment.
2e8734e [Yin Huai] Use JSON data source for jsonFile.
92a4a33 [Yin Huai] Test.
Based on #4904 with style errors fixed.
`LogicalPlan#resolve` will not only produce `Attribute`, but also "`GetField` chain".
So in `ResolveSortReferences`, after resolve the ordering expressions, we should not just collect the `Attribute` results, but also `Attribute` at the bottom of "`GetField` chain".
Author: Wenchen Fan <cloud0fan@outlook.com>
Author: Michael Armbrust <michael@databricks.com>
Closes#4918 from marmbrus/pr/4904 and squashes the following commits:
997f84e [Michael Armbrust] fix style
3eedbfc [Wenchen Fan] fix 6145
Option 1 of 2: Convert spark-parent module name to spark-parent_2.10 / spark-parent_2.11
Author: Sean Owen <sowen@cloudera.com>
Closes#4912 from srowen/SPARK-6182.1 and squashes the following commits:
eff60de [Sean Owen] Convert spark-parent module name to spark-parent_2.10 / spark-parent_2.11
Integration test suites in the JDBC data source (`MySQLIntegration` and `PostgresIntegration`) depend on docker-client 2.7.5, which transitively depends on Guava 17.0. Unfortunately, Guava 17.0 is causing test runtime binary compatibility issues when Spark is compiled against Hive 0.12.0, or Hadoop 2.4.
Considering `MySQLIntegration` and `PostgresIntegration` are ignored right now, I'd suggest moving them from the Spark project to the [Spark integration tests] [1] project. This PR removes both the JDBC data source integration tests and the docker-client test dependency.
[1]: |https://github.com/databricks/spark-integration-tests
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4872)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#4872 from liancheng/remove-docker-client and squashes the following commits:
1f4169e [Cheng Lian] Removes DockerHacks
159b24a [Cheng Lian] Removed JDBC integration tests which depends on docker-client
- Various Fixes to docs
- Make data source traits actually interfaces
Based on #4862 but with fixed conflicts.
Author: Reynold Xin <rxin@databricks.com>
Author: Michael Armbrust <michael@databricks.com>
Closes#4868 from marmbrus/pr/4862 and squashes the following commits:
fe091ea [Michael Armbrust] Merge remote-tracking branch 'origin/master' into pr/4862
0208497 [Reynold Xin] Test fixes.
34e0a28 [Reynold Xin] [SPARK-5310][SQL] Various fixes to Spark SQL docs.
This PR contains the following changes:
1. Add a new method, `DataType.equalsIgnoreCompatibleNullability`, which is the middle ground between DataType's equality check and `DataType.equalsIgnoreNullability`. For two data types `from` and `to`, it does `equalsIgnoreNullability` as well as if the nullability of `from` is compatible with that of `to`. For example, the nullability of `ArrayType(IntegerType, containsNull = false)` is compatible with that of `ArrayType(IntegerType, containsNull = true)` (for an array without null values, we can always say it may contain null values). However, the nullability of `ArrayType(IntegerType, containsNull = true)` is incompatible with that of `ArrayType(IntegerType, containsNull = false)` (for an array that may have null values, we cannot say it does not have null values).
2. For the `resolved` field of `InsertIntoTable`, use `equalsIgnoreCompatibleNullability` to replace the equality check of the data types.
3. For our data source write path, when appending data, we always use the schema of existing table to write the data. This is important for parquet, since nullability direct impacts the way to encode/decode values. If we do not do this, we may see corrupted values when reading values from a set of parquet files generated with different nullability settings.
4. When generating a new parquet table, we always set nullable/containsNull/valueContainsNull to true. So, we will not face situations that we cannot append data because containsNull/valueContainsNull in an Array/Map column of the existing table has already been set to `false`. This change makes the whole data pipeline more robust.
5. Update the equality check of JSON relation. Since JSON does not really cares nullability, `equalsIgnoreNullability` seems a better choice to compare schemata from to JSON tables.
JIRA: https://issues.apache.org/jira/browse/SPARK-5950
Thanks viirya for the initial work in #4729.
cc marmbrus liancheng
Author: Yin Huai <yhuai@databricks.com>
Closes#4826 from yhuai/insertNullabilityCheck and squashes the following commits:
3b61a04 [Yin Huai] Revert change on equals.
80e487e [Yin Huai] asNullable in UDT.
587d88b [Yin Huai] Make methods private.
0cb7ea2 [Yin Huai] marmbrus's comments.
3cec464 [Yin Huai] Cheng's comments.
486ed08 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertNullabilityCheck
d3747d1 [Yin Huai] Remove unnecessary change.
8360817 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertNullabilityCheck
8a3f237 [Yin Huai] Use equalsIgnoreNullability instead of equality check.
0eb5578 [Yin Huai] Fix tests.
f6ed813 [Yin Huai] Update old parquet path.
e4f397c [Yin Huai] Unit tests.
b2c06f8 [Yin Huai] Ignore nullability in JSON relation's equality check.
8bd008b [Yin Huai] nullable, containsNull, and valueContainsNull will be always true for parquet data.
bf50d73 [Yin Huai] When appending data, we use the schema of the existing table instead of the schema of the new data.
0a703e7 [Yin Huai] Test failed again since we cannot read correct content.
9a26611 [Yin Huai] Make InsertIntoTable happy.
8f19fe5 [Yin Huai] equalsIgnoreCompatibleNullability
4ec17fd [Yin Huai] Failed test.
Constructs like Hive `TRANSFORM` may generate malformed rows (via badly authored external scripts for example). I'm a bit hesitant to have this feature, since it introduces per-tuple cost when caching tables. However, considering caching tables is usually a one-time cost, this is probably worth having.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4842)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#4842 from liancheng/spark-6082 and squashes the following commits:
b05dbff [Cheng Lian] Provides better error message for malformed rows when caching tables
The API signatire for join requires the JoinType to be the third parameter. The code examples provided for join show JoinType being provided as the 2nd parater resuling in errors (i.e. "df1.join(df2, "outer", $"df1Key" === $"df2Key") ). The correct sample code is df1.join(df2, $"df1Key" === $"df2Key", "outer")
Author: Paul Power <paul.power@peerside.com>
Closes#4847 from peerside/master and squashes the following commits:
ebc1efa [Paul Power] Merge pull request #1 from peerside/peerside-patch-1
e353340 [Paul Power] Updated comments use correct sample code for Dataframe joins
Always set `containsNull = true` when infer the schema of JSON datasets. If we set `containsNull` based on records we scanned, we may miss arrays with null values when we do sampling. Also, because future data can have arrays with null values, if we convert JSON data to parquet, always setting `containsNull = true` is a more robust way to go.
JIRA: https://issues.apache.org/jira/browse/SPARK-6052
Author: Yin Huai <yhuai@databricks.com>
Closes#4806 from yhuai/jsonArrayContainsNull and squashes the following commits:
05eab9d [Yin Huai] Change containsNull to true.
This is needed for the SQL bindings to work on Yarn.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#4822 from vanzin/SPARK-6074 and squashes the following commits:
fb52001 [Marcelo Vanzin] [SPARK-6074] [sql] Package pyspark sql bindings.
JIRA: https://issues.apache.org/jira/browse/SPARK-6024
Author: Yin Huai <yhuai@databricks.com>
Closes#4795 from yhuai/wideSchema and squashes the following commits:
4882e6f [Yin Huai] Address comments.
73e71b4 [Yin Huai] Address comments.
143927a [Yin Huai] Simplify code.
cc1d472 [Yin Huai] Make the schema wider.
12bacae [Yin Huai] If the JSON string of a schema is too large, split it before storing it in metastore.
e9b4f70 [Yin Huai] Failed test.
`FilteringParquetRowInputFormat` manually merges Parquet schemas before computing splits. However, it is duplicate because the schemas are already merged in `ParquetRelation2`. We don't need to re-merge them at `InputFormat`.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#4786 from viirya/dup_parquet_schemas_merge and squashes the following commits:
ef78a5a [Liang-Chi Hsieh] Avoiding duplicate Parquet schema merging.
It is useful to let the user decide the number of rows to show in DataFrame.show
Author: Jacky Li <jacky.likun@huawei.com>
Closes#4767 from jackylk/show and squashes the following commits:
a0e0f4b [Jacky Li] fix testcase
7cdbe91 [Jacky Li] modify according to comment
bb54537 [Jacky Li] for Java compatibility
d7acc18 [Jacky Li] modify according to comments
981be52 [Jacky Li] add numRows param in DataFrame.show()
Please see JIRA (https://issues.apache.org/jira/browse/SPARK-6016) for details of the bug.
Author: Yin Huai <yhuai@databricks.com>
Closes#4775 from yhuai/parquetFooterCache and squashes the following commits:
78787b1 [Yin Huai] Remove footerCache in FilteringParquetRowInputFormat.
dff6fba [Yin Huai] Failed unit test.
DataFrame.explain return wrong result when the query is DDL command.
For example, the following two queries should print out the same execution plan, but it not.
sql("create table tb as select * from src where key > 490").explain(true)
sql("explain extended create table tb as select * from src where key > 490")
This is because DataFrame.explain leverage logicalPlan which had been forced executed, we should use the unexecuted plan queryExecution.logical.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#4707 from yanboliang/spark-5926 and squashes the following commits:
fa6db63 [Yanbo Liang] logicalPlan is not lazy
0e40a1b [Yanbo Liang] make DataFrame.explain leverage queryExecution.logical
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#4760 from viirya/dup_literal and squashes the following commits:
06e7516 [Liang-Chi Hsieh] Remove duplicate Literal matching block.
`ReadContext.init` calls `InitContext.getMergedKeyValueMetadata`, which doesn't know how to merge conflicting user defined key-value metadata and throws exception. In our case, when dealing with different but compatible schemas, we have different Spark SQL schema JSON strings in different Parquet part-files, thus causes this problem. Reading similar Parquet files generated by Hive doesn't suffer from this issue.
In this PR, we manually merge the schemas before passing it to `ReadContext` to avoid the exception.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4768)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#4768 from liancheng/spark-6010 and squashes the following commits:
9002f0a [Cheng Lian] Fixes SPARK-6010
Author: Michael Armbrust <michael@databricks.com>
Closes#4757 from marmbrus/udtConversions and squashes the following commits:
3714aad [Michael Armbrust] [SPARK-5996][SQL] Fix specialized outbound conversions
Also added desc/asc function for constructing sorting expressions more conveniently. And added a small fix to lift alias out of cast expression.
Author: Reynold Xin <rxin@databricks.com>
Closes#4752 from rxin/SPARK-5985 and squashes the following commits:
aeda5ae [Reynold Xin] Added Experimental flag to ColumnName.
047ad03 [Reynold Xin] Lift alias out of cast.
c9cf17c [Reynold Xin] [SPARK-5985][SQL] DataFrame sortBy -> orderBy in Python.
Added a new test suite to make sure Java DF programs can use varargs properly.
Also moved all suites into test.org.apache.spark package to make sure the suites also test for method visibility.
Author: Reynold Xin <rxin@databricks.com>
Closes#4751 from rxin/df-tests and squashes the following commits:
1e8b8e4 [Reynold Xin] Fixed imports and renamed JavaAPISuite.
a6ca53b [Reynold Xin] [SPARK-5904][SQL] DataFrame Java API test suites.
Author: Michael Armbrust <michael@databricks.com>
Closes#4738 from marmbrus/udtRepart and squashes the following commits:
c06d7b5 [Michael Armbrust] fix compilation
91c8829 [Michael Armbrust] [SQL][SPARK-5532] Repartition should not use external rdd representation
Author: Michael Armbrust <michael@databricks.com>
Closes#4736 from marmbrus/asExprs and squashes the following commits:
5ba97e4 [Michael Armbrust] [SPARK-5910][SQL] Support for as in selectExpr
1. Column is no longer a DataFrame to simplify class hierarchy.
2. Don't use varargs on abstract methods (see Scala compiler bug SI-9013).
Author: Reynold Xin <rxin@databricks.com>
Closes#4686 from rxin/SPARK-5904 and squashes the following commits:
fd9b199 [Reynold Xin] Fixed Python tests.
df25cef [Reynold Xin] Non final.
5221530 [Reynold Xin] [SPARK-5904][SQL] DataFrame API fixes.
The `int` is 64-bit on 64-bit machine (very common now), we should infer it as LongType for it in Spark SQL.
Also, LongType in SQL will come back as `int`.
Author: Davies Liu <davies@databricks.com>
Closes#4666 from davies/long and squashes the following commits:
6bc6cc4 [Davies Liu] infer int as LongType
Also added test cases for checking the serializability of HiveContext and SQLContext.
Author: Reynold Xin <rxin@databricks.com>
Closes#4628 from rxin/SPARK-5840 and squashes the following commits:
ecb3bcd [Reynold Xin] test cases and reviews.
55eb822 [Reynold Xin] [SPARK-5840][SQL] HiveContext cannot be serialized due to tuple extraction.
This pull request replaces calls to deprecated methods from `java.util.Date` with near-equivalents in `java.util.Calendar`.
Author: Tor Myklebust <tmyklebu@gmail.com>
Closes#4668 from tmyklebu/master and squashes the following commits:
66215b1 [Tor Myklebust] Use GregorianCalendar instead of Timestamp get methods.
Although we've migrated to the DataFrame API, lots of code still uses `rdd` or `srdd` as local variable names. This PR tries to address these naming inconsistencies and some other minor DataFrame related style issues.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4670)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#4670 from liancheng/df-cleanup and squashes the following commits:
3e14448 [Cheng Lian] Cleans up DataFrame variable names and toDF() calls
The problem is that after we create an empty hive metastore parquet table (e.g. `CREATE TABLE test (a int) STORED AS PARQUET`), Hive will create an empty dir for us, which cause our data source `ParquetRelation2` fail to get the schema of the table. See JIRA for the case to reproduce the bug and the exception.
This PR is based on #4562 from chenghao-intel.
JIRA: https://issues.apache.org/jira/browse/SPARK-5852
Author: Yin Huai <yhuai@databricks.com>
Author: Cheng Hao <hao.cheng@intel.com>
Closes#4655 from yhuai/CTASParquet and squashes the following commits:
b8b3450 [Yin Huai] Update tests.
2ac94f7 [Yin Huai] Update tests.
3db3d20 [Yin Huai] Minor update.
d7e2308 [Yin Huai] Revert changes in HiveMetastoreCatalog.scala.
36978d1 [Cheng Hao] Update the code as feedback
a04930b [Cheng Hao] fix bug of scan an empty parquet based table
442ffe0 [Cheng Hao] passdown the schema for Parquet File in HiveContext
Author: Michael Armbrust <michael@databricks.com>
Closes#4657 from marmbrus/pythonUdfs and squashes the following commits:
a7823a8 [Michael Armbrust] [SPARK-5868][SQL] Fix python UDFs in HiveContext and checks in SQLContext
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#4649 from viirya/use_checkpath and squashes the following commits:
0f9a1a1 [Liang-Chi Hsieh] Use same function to check path parameter.
Author: Reynold Xin <rxin@databricks.com>
Closes#4640 from rxin/SPARK-5853 and squashes the following commits:
9c6f569 [Reynold Xin] [SPARK-5853][SQL] Schema support in Row.
Added a bunch of tags.
Also changed parquetFile to take varargs rather than a string followed by varargs.
Author: Reynold Xin <rxin@databricks.com>
Closes#4636 from rxin/df-doc and squashes the following commits:
651f80c [Reynold Xin] Fixed parquetFile in PySpark.
8dc3024 [Reynold Xin] [SQL] Various DataFrame doc changes.
This PR adds a `ShowTablesCommand` to support `SHOW TABLES [IN databaseName]` SQL command. The result of `SHOW TABLE` has two columns, `tableName` and `isTemporary`. For temporary tables, the value of `isTemporary` column will be `false`.
JIRA: https://issues.apache.org/jira/browse/SPARK-4865
Author: Yin Huai <yhuai@databricks.com>
Closes#4618 from yhuai/showTablesCommand and squashes the following commits:
0c09791 [Yin Huai] Use ShowTablesCommand.
85ee76d [Yin Huai] Since SHOW TABLES is not a Hive native command any more and we will not see "OK" (originally generated by Hive's driver), use SHOW DATABASES in the test.
94bacac [Yin Huai] Add SHOW TABLES to the list of noExplainCommands.
d71ed09 [Yin Huai] Fix test.
a4a6ec3 [Yin Huai] Add SHOW TABLE command.
Existing implementation of arithmetic operators and BinaryComparison operators have redundant type checking codes, e.g.:
Expression.n2 is used by Add/Subtract/Multiply.
(1) n2 always checks left.dataType == right.dataType. However, this checking should be done once when we resolve expression types;
(2) n2 requires dataType is a NumericType. This can be done once.
This PR optimizes arithmetic and predicate operators by removing such redundant type-checking codes.
Some preliminary benchmarking on 10G TPC-H data over 5 r3.2xlarge EC2 machines shows that this PR can reduce the query time by 5.5% to 11%.
The benchmark queries follow the template below, where OP is plus/minus/times/divide/remainder/bitwise and/bitwise or/bitwise xor.
SELECT l_returnflag, l_linestatus, SUM(l_quantity OP cnt1), SUM(l_quantity OP cnt2), ...., SUM(l_quantity OP cnt700)
FROM (
SELECT l_returnflag, l_linestatus, l_quantity, 1 AS cnt1, 2 AS cnt2, ..., 700 AS cnt700
FROM lineitem
WHERE l_shipdate <= '1998-09-01'
)
GROUP BY l_returnflag, l_linestatus;
Author: kai <kaizeng@eecs.berkeley.edu>
Closes#4472 from kai-zeng/arithmetic-optimize and squashes the following commits:
fef0cf1 [kai] Merge branch 'master' of github.com:apache/spark into arithmetic-optimize
4b3a1bb [kai] chmod a-x
5a41e49 [kai] chmod a-x Expression.scala
cb37c94 [kai] rebase onto spark master
7f6e968 [kai] chmod 100755 -> 100644
6cddb46 [kai] format
7490dbc [kai] fix unresolved-expression exception for EqualTo
9c40bc0 [kai] fix bitwisenot
3cbd363 [kai] clean up test code
ca47801 [kai] override evalInternal for bitwise ops
8fa84a1 [kai] add bitwise or and xor
6892fc4 [kai] revert override evalInternal
f8eba24 [kai] override evalInternal
31ccdd4 [kai] rewrite all bitwise op and remove evalInternal
86297e2 [kai] generalized
cb92ae1 [kai] bitwise-and: override eval
97a7d6c [kai] bitwise-and: override evalInternal using and func
0906c39 [kai] add bitwise test
62abbbc [kai] clean up predicate and arithmetic
b34d58d [kai] add caching and benmark option
12c5b32 [kai] override eval
1cd7571 [kai] fix sqrt and maxof
03fd0c3 [kai] fix predicate
16fd84c [kai] optimize + - * / % -(unary) abs < > <= >=
fd95823 [kai] remove unnecessary type checking
24d062f [kai] test suite
JIRA: https://issues.apache.org/jira/browse/SPARK-5839
Author: Yin Huai <yhuai@databricks.com>
Closes#4626 from yhuai/SPARK-5839 and squashes the following commits:
f779d85 [Yin Huai] Use subqeury to wrap replaced ParquetRelation.
2695f13 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-5839
f1ba6ca [Yin Huai] Address comment.
2c7fa08 [Yin Huai] Use Subqueries to wrap a data source table.
JIRA: https://issues.apache.org/jira/browse/SPARK-5746
liancheng marmbrus
Author: Yin Huai <yhuai@databricks.com>
Closes#4617 from yhuai/insertOverwrite and squashes the following commits:
8e3019d [Yin Huai] Fix compilation error.
499e8e7 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertOverwrite
e76e85a [Yin Huai] Address comments.
ac31b3c [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertOverwrite
f30bdad [Yin Huai] Use toDF.
99da57e [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertOverwrite
6b7545c [Yin Huai] Add a pre write check to the data source API.
a88c516 [Yin Huai] DDLParser will take a parsering function to take care CTAS statements.
Lifts `HiveMetastoreCatalog.refreshTable` to `Catalog`. Adds `RefreshTable` command to refresh (possibly cached) metadata in external data sources tables.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4624)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#4624 from liancheng/refresh-table and squashes the following commits:
8d1aa4c [Cheng Lian] Adds REFRESH TABLE command
This PR adds the following filter types for data sources API:
- `IsNull`
- `IsNotNull`
- `Not`
- `And`
- `Or`
The code which converts Catalyst predicate expressions to data sources filters is very similar to filter conversion logics in `ParquetFilters` which converts Catalyst predicates to Parquet filter predicates. In this way we can support nested AND/OR/NOT predicates without changing current `BaseScan` type hierarchy.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4623)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
This patch had conflicts when merged, resolved by
Committer: Michael Armbrust <michael@databricks.com>
Closes#4623 from liancheng/more-fiters and squashes the following commits:
1b296f4 [Cheng Lian] Add more filter types for data sources API
This PR migrates the Parquet data source to the new data source write support API. Now users can also overwriting and appending to existing tables. Notice that inserting into partitioned tables is not supported yet.
When Parquet data source is enabled, insertion to Hive Metastore Parquet tables is also fullfilled by the Parquet data source. This is done by the newly introduced `HiveMetastoreCatalog.ParquetConversions` rule, which is a "proper" implementation of the original hacky `HiveStrategies.ParquetConversion`. The latter is still preserved, and can be removed together with the old Parquet support in the future.
TODO:
- [x] Update outdated comments in `newParquet.scala`.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4563)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#4563 from liancheng/parquet-refining and squashes the following commits:
fa98d27 [Cheng Lian] Fixes test cases which should disable off Parquet data source
2476e82 [Cheng Lian] Fixes compilation error introduced during rebasing
a83d290 [Cheng Lian] Passes Hive Metastore partitioning information to ParquetRelation2
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4613)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#4613 from liancheng/df-implicit-rename and squashes the following commits:
db8bdd3 [Cheng Lian] Renames stringRddToDataFrame to stringRddToDataFrameHolder for consistency
If one tries an example by using copy&paste, throw an exception.
Author: Takeshi Yamamuro <linguin.m.s@gmail.com>
Closes#4615 from maropu/AddMissingImportInSqlContext and squashes the following commits:
ab21b66 [Takeshi Yamamuro] Add missing import in the example of SqlContext
Author: Yin Huai <yhuai@databricks.com>
Closes#4582 from yhuai/jsonErrorMessage and squashes the following commits:
152dbd4 [Yin Huai] Update error message.
1466256 [Yin Huai] Throw a better error message when a JSON object in the input dataset span multiple records (lines for files or strings for an RDD of strings).
Author: Yin Huai <yhuai@databricks.com>
Closes#4542 from yhuai/moveSaveMode and squashes the following commits:
65a4425 [Yin Huai] Move SaveMode to sql package.
Author: Yin Huai <yhuai@databricks.com>
Closes#4544 from yhuai/jsonUseLongTypeByDefault and squashes the following commits:
6e2ffc2 [Yin Huai] Use LongType as the default type for integers in JSON schema inference.
Author: Michael Armbrust <michael@databricks.com>
Author: wangfei <wangfei1@huawei.com>
Closes#4558 from marmbrus/errorMessages and squashes the following commits:
5e5ab50 [Michael Armbrust] Merge pull request #15 from scwf/errorMessages
fa38881 [wangfei] fix for grouping__id
f279a71 [wangfei] make right references for ScriptTransformation
d29fbde [Michael Armbrust] extra case
1a797b4 [Michael Armbrust] comments
d4e9015 [Michael Armbrust] add comment
af9e668 [Michael Armbrust] no braces
34eb3a4 [Michael Armbrust] more work
6197cd5 [Michael Armbrust] [SQL] Better error messages for analysis failures
Eases use in the spark-shell.
Author: Michael Armbrust <michael@databricks.com>
Closes#4545 from marmbrus/serialization and squashes the following commits:
04748e6 [Michael Armbrust] @scala.annotation.varargs
b36e219 [Michael Armbrust] moreFixes
- Removed DataFrame.apply for projection & filtering since they are extremely confusing.
- Added implicits for RDD[Int], RDD[Long], and RDD[String]
Author: Reynold Xin <rxin@databricks.com>
Closes#4543 from rxin/df-cleanup and squashes the following commits:
81ec915 [Reynold Xin] [SQL] More DataFrame fixes.
Also I fix a bunch of bad output in test cases.
Author: Michael Armbrust <michael@databricks.com>
Closes#4520 from marmbrus/selfJoin and squashes the following commits:
4f4a85c [Michael Armbrust] comments
49c8e26 [Michael Armbrust] fix tests
6fc38de [Michael Armbrust] fix style
55d64b3 [Michael Armbrust] fix dataframe selfjoins
1. DataFrame.renameColumn
2. DataFrame.show() and _repr_
3. Use simpleString() rather than jsonValue in DataFrame.dtypes
4. createDataFrame from local Python data, including pandas.DataFrame
Author: Davies Liu <davies@databricks.com>
Closes#4528 from davies/df3 and squashes the following commits:
014acea [Davies Liu] fix typo
6ba526e [Davies Liu] fix tests
46f5f95 [Davies Liu] address comments
6cbc154 [Davies Liu] dataframe.show() and improve dtypes
6f94f25 [Davies Liu] create DataFrame from local Python data
Also took the chance to fixed up some style ...
Author: Reynold Xin <rxin@databricks.com>
Closes#4489 from rxin/SPARK-5702 and squashes the following commits:
74f42e3 [Reynold Xin] [SPARK-5702][SQL] Allow short names for built-in data sources.
Do not recursively strip out projects. Only strip the first level project.
```scala
df("colA") + df("colB").as("colC")
```
Previously, the above would construct an invalid plan.
Author: Reynold Xin <rxin@databricks.com>
Closes#4519 from rxin/computability and squashes the following commits:
87ff763 [Reynold Xin] Code review feedback.
015c4fc [Reynold Xin] [SQL][DataFrame] Fix column computability.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#4496 from chenghao-intel/df_explain and squashes the following commits:
552aa58 [Cheng Hao] Add explain support for DF
Deprecate inferSchema() and applySchema(), use createDataFrame() instead, which could take an optional `schema` to create an DataFrame from an RDD. The `schema` could be StructType or list of names of columns.
Author: Davies Liu <davies@databricks.com>
Closes#4498 from davies/create and squashes the following commits:
08469c1 [Davies Liu] remove Scala/Java API for now
c80a7a9 [Davies Liu] fix hive test
d1bd8f2 [Davies Liu] cleanup applySchema
9526e97 [Davies Liu] createDataFrame from RDD with columns
Author: Cheng Hao <hao.cheng@intel.com>
Closes#4468 from chenghao-intel/json and squashes the following commits:
aeb7801 [Cheng Hao] avoid multiple json generator created
Also start from the bottom so we show the first error instead of the top error.
Author: Michael Armbrust <michael@databricks.com>
Closes#4439 from marmbrus/analysisException and squashes the following commits:
45862a0 [Michael Armbrust] fix hive test
a773bba [Michael Armbrust] Merge remote-tracking branch 'origin/master' into analysisException
f88079f [Michael Armbrust] update more cases
fede90a [Michael Armbrust] newline
fbf4bc3 [Michael Armbrust] move to sql
6235db4 [Michael Armbrust] [SQL] Add an exception for analysis errors.
Users will not need to put `Options()` in a CREATE TABLE statement when there is not option provided.
Author: Yin Huai <yhuai@databricks.com>
Closes#4515 from yhuai/makeOptionsOptional and squashes the following commits:
1a898d3 [Yin Huai] Make options optional.
Author: Sheng, Li <OopsOutOfMemory@users.noreply.github.com>
Author: OopsOutOfMemory <victorshengli@126.com>
Closes#4508 from OopsOutOfMemory/cmt and squashes the following commits:
d8a68c6 [Sheng, Li] Update ddl.scala
f24aeaf [OopsOutOfMemory] correct style
Author: Michael Armbrust <michael@databricks.com>
Closes#4436 from marmbrus/dfToString and squashes the following commits:
8a3c35f [Michael Armbrust] Merge remote-tracking branch 'origin/master' into dfToString
b72a81b [Michael Armbrust] add toString
This is a completion of https://github.com/apache/spark/pull/4033 which was withdrawn for some reason.
Author: Sean Owen <sowen@cloudera.com>
Closes#4470 from srowen/SPARK-5239.2 and squashes the following commits:
2398bde [Sean Owen] Avoid use of JDBC4-only isClosed()
Author: Cheng Hao <hao.cheng@intel.com>
Closes#4494 from chenghao-intel/tiny_code_change and squashes the following commits:
450dfe7 [Cheng Hao] remove the duplicated code
~~The rule is simple: If you want `a.b` work, then `a` must be some level of nested array of struct(level 0 means just a StructType). And the result of `a.b` is same level of nested array of b-type.
An optimization is: the resolve chain looks like `Attribute -> GetItem -> GetField -> GetField ...`, so we could transmit the nested array information between `GetItem` and `GetField` to avoid repeated computation of `innerDataType` and `containsNullList` of that nested array.~~
marmbrus Could you take a look?
to evaluate `a.b`, if `a` is array of struct, then `a.b` means get field `b` on each element of `a`, and return a result of array.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#2405 from cloud-fan/nested-array-dot and squashes the following commits:
08a228a [Wenchen Fan] support dot notation on array of struct
- as with a `Symbol`
- distinct
- sqlContext.emptyDataFrame
- move add/remove col out of RDDApi section
Author: Michael Armbrust <michael@databricks.com>
Closes#4437 from marmbrus/dfMissingFuncs and squashes the following commits:
2004023 [Michael Armbrust] Add missing functions
Fix Scala code style.
Author: Hung Lin <hung@zoomdata.com>
Closes#4464 from hunglin/SPARK-5472 and squashes the following commits:
ef7a3b3 [Hung Lin] SPARK-5472: fix scala style
An example:
```
year month AVG('Adj Close) MAX('Adj Close)
1980 12 0.503218 0.595103
1981 01 0.523289 0.570307
1982 02 0.436504 0.475256
1983 03 0.410516 0.442194
1984 04 0.450090 0.483521
```
Author: Reynold Xin <rxin@databricks.com>
Closes#4416 from rxin/SPARK-5643 and squashes the following commits:
d0e0d6e [Reynold Xin] [SQL] Minor update to data source and statistics documentation.
269da83 [Reynold Xin] Updated isLocal comment.
2cf3c27 [Reynold Xin] Moved logic into optimizer.
1a04d8b [Reynold Xin] [SPARK-5643][SQL] Add a show method to print the content of a DataFrame in columnar format.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4440)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#4440 from liancheng/parquet-oops and squashes the following commits:
f21ede4 [Cheng Lian] HiveParquetSuite was disabled by mistake, re-enable them.
Sometimes tests were failing due to the creation of multiple `SparkContext`s in a single JVM.
Author: Michael Armbrust <michael@databricks.com>
Closes#4441 from marmbrus/javaTests and squashes the following commits:
657b1e0 [Michael Armbrust] [SQL] Use TestSQLContext in Java tests
When the `GetField` chain(`a.b.c.d.....`) is interrupted by `GetItem` like `a.b[0].c.d....`, then the check of ambiguous reference to fields is broken.
The reason is that: for something like `a.b[0].c.d`, we first parse it to `GetField(GetField(GetItem(Unresolved("a.b"), 0), "c"), "d")`. Then in `LogicalPlan#resolve`, we resolve `"a.b"` and build a `GetField` chain from bottom(the relation). But for the 2 outer `GetFiled`, we have to resolve them in `Analyzer` or do it in `GetField` lazily, check data type of child, search needed field, etc. which is similar to what we have done in `LogicalPlan#resolve`.
So in this PR, the fix is just copy the same logic in `LogicalPlan#resolve` to `Analyzer`, which is simple and quick, but I do suggest introduce `UnresolvedGetFiled` like I explained in https://github.com/apache/spark/pull/2405.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#4068 from cloud-fan/simple and squashes the following commits:
a6857b5 [Wenchen Fan] fix import order
8411c40 [Wenchen Fan] use UnresolvedGetField
This PR adds a rule to Analyzer that will add preinsert data type casting and field renaming to the select clause in an `INSERT INTO/OVERWRITE` statement. Also, with the change of this PR, we always invalidate our in memory data cache after inserting into a BaseRelation.
cc marmbrus liancheng
Author: Yin Huai <yhuai@databricks.com>
Closes#4373 from yhuai/insertFollowUp and squashes the following commits:
08237a7 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertFollowUp
316542e [Yin Huai] Doc update.
c9ccfeb [Yin Huai] Revert a unnecessary change.
84aecc4 [Yin Huai] Address comments.
1951fe1 [Yin Huai] Merge remote-tracking branch 'upstream/master'
c18da34 [Yin Huai] Invalidate cache after insert.
727f21a [Yin Huai] Preinsert casting and renaming.
Make below code works.
```
sql("DESCRIBE test").registerTempTable("describeTest")
sql("SELECT * FROM describeTest").collect()
```
Author: OopsOutOfMemory <victorshengli@126.com>
Author: Sheng, Li <OopsOutOfMemory@users.noreply.github.com>
Closes#4249 from OopsOutOfMemory/desc_query and squashes the following commits:
6fee13d [OopsOutOfMemory] up-to-date
e71430a [Sheng, Li] Update HiveOperatorQueryableSuite.scala
3ba1058 [OopsOutOfMemory] change to default argument
aac7226 [OopsOutOfMemory] Merge branch 'master' into desc_query
68eb6dd [OopsOutOfMemory] Merge branch 'desc_query' of github.com:OopsOutOfMemory/spark into desc_query
354ad71 [OopsOutOfMemory] query describe command
d541a35 [OopsOutOfMemory] refine test suite
e1da481 [OopsOutOfMemory] refine test suite
a780539 [OopsOutOfMemory] Merge branch 'desc_query' of github.com:OopsOutOfMemory/spark into desc_query
0015f82 [OopsOutOfMemory] code style
dd0aaef [OopsOutOfMemory] code style
c7d606d [OopsOutOfMemory] rename test suite
75f2342 [OopsOutOfMemory] refine code and test suite
f942c9b [OopsOutOfMemory] initial
11559ae [OopsOutOfMemory] code style
c5fdecf [OopsOutOfMemory] code style
aeaea5f [OopsOutOfMemory] rename test suite
ac2c3bb [OopsOutOfMemory] refine code and test suite
544573e [OopsOutOfMemory] initial
Author: Reynold Xin <rxin@databricks.com>
Closes#4410 from rxin/df-renameCol and squashes the following commits:
a6a796e [Reynold Xin] [SPARK-5639][SQL] Support DataFrame.renameColumn.
Author: Reynold Xin <rxin@databricks.com>
Closes#4408 from rxin/df-config-eager and squashes the following commits:
c0204cf [Reynold Xin] [SPARK-5638][SQL] Add a config flag to disable eager analysis of DataFrames.
This PR adds three major improvements to Parquet data source:
1. Partition discovery
While reading Parquet files resides in Hive style partition directories, `ParquetRelation2` automatically discovers partitioning information and infers partition column types.
This is also a partial work for [SPARK-5182] [1], which aims to provide first class partitioning support for the data source API. Related code in this PR can be easily extracted to the data source API level in future versions.
1. Schema merging
When enabled, Parquet data source collects schema information from all Parquet part-files and tries to merge them. Exceptions are thrown when incompatible schemas are detected. This feature is controlled by data source option `parquet.mergeSchema`, and is enabled by default.
1. Metastore Parquet table conversion moved to analysis phase
This greatly simplifies the conversion logic. `ParquetConversion` strategy can be removed once the old Parquet implementation is removed in the future.
This version of Parquet data source aims to entirely replace the old Parquet implementation. However, the old version hasn't been removed yet. Users can fall back to the old version by turning off SQL configuration `spark.sql.parquet.useDataSourceApi`.
Other JIRA tickets fixed as side effects in this PR:
- [SPARK-5509] [3]: `EqualTo` now uses a proper `Ordering` to compare binary types.
- [SPARK-3575] [4]: Metastore schema is now preserved and passed to `ParquetRelation2` via data source option `parquet.metastoreSchema`.
TODO:
- [ ] More test cases for partition discovery
- [x] Fix write path after data source write support (#4294) is merged
It turned out to be non-trivial to fall back to old Parquet implementation on the write path when Parquet data source is enabled. Since we're planning to include data source write support in 1.3.0, I simply ignored two test cases involving Parquet insertion for now.
- [ ] Fix outdated comments and documentations
PS: This PR looks big, but more than a half of the changed lines in this PR are trivial changes to test cases. To test Parquet with and without the new data source, almost all Parquet test cases are moved into wrapper driver functions. This introduces hundreds of lines of changes.
[1]: https://issues.apache.org/jira/browse/SPARK-5182
[2]: https://issues.apache.org/jira/browse/SPARK-5528
[3]: https://issues.apache.org/jira/browse/SPARK-5509
[4]: https://issues.apache.org/jira/browse/SPARK-3575
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4308)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#4308 from liancheng/parquet-partition-discovery and squashes the following commits:
b6946e6 [Cheng Lian] Fixes MiMA issues, addresses comments
8232e17 [Cheng Lian] Write support for Parquet data source
a49bd28 [Cheng Lian] Fixes spelling typo in trait name "CreateableRelationProvider"
808380f [Cheng Lian] Fixes issues introduced while rebasing
50dd8d1 [Cheng Lian] Addresses @rxin's comment, fixes UDT schema merging
adf2aae [Cheng Lian] Fixes compilation error introduced while rebasing
4e0175f [Cheng Lian] Fixes Python Parquet API, we need Py4J array to call varargs method
0d8ec1d [Cheng Lian] Adds more test cases
b35c8c6 [Cheng Lian] Fixes some typos and outdated comments
dd704fd [Cheng Lian] Fixes Python Parquet API
596c312 [Cheng Lian] Uses switch to control whether use Parquet data source or not
7d0f7a2 [Cheng Lian] Fixes Metastore Parquet table conversion
a1896c7 [Cheng Lian] Fixes all existing Parquet test suites except for ParquetMetastoreSuite
5654c9d [Cheng Lian] Draft version of Parquet partition discovery and schema merging
Hi, rxin marmbrus
I considered your suggestion (in #4127) and now re-write it. This is now up-to-date.
Could u please review it ?
Author: OopsOutOfMemory <victorshengli@126.com>
Closes#4227 from OopsOutOfMemory/describe and squashes the following commits:
053826f [OopsOutOfMemory] describe
SQLQuerySuite test failure:
[info] - simple select (22 milliseconds)
[info] - sorting (722 milliseconds)
[info] - external sorting (728 milliseconds)
[info] - limit (95 milliseconds)
[info] - date row *** FAILED *** (35 milliseconds)
[info] Results do not match for query:
[info] 'Limit 1
[info] 'Project [CAST(2015-01-28, DateType) AS c0#3630]
[info] 'UnresolvedRelation [testData], None
[info]
[info] == Analyzed Plan ==
[info] Limit 1
[info] Project [CAST(2015-01-28, DateType) AS c0#3630]
[info] LogicalRDD [key#0,value#1], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:35
[info]
[info] == Physical Plan ==
[info] Limit 1
[info] Project [16463 AS c0#3630]
[info] PhysicalRDD [key#0,value#1], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:35
[info]
[info] == Results ==
[info] !== Correct Answer - 1 == == Spark Answer - 1 ==
[info] ![2015-01-28] [2015-01-27] (QueryTest.scala:77)
[info] org.scalatest.exceptions.TestFailedException:
[info] at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
[info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
[info] at org.scalatest.Assertions$class.fail(Assertions.scala:1328)
[info] at org.scalatest.FunSuite.fail(FunSuite.scala:1555)
[info] at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:77)
[info] at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:95)
[info] at org.apache.spark.sql.SQLQuerySuite$$anonfun$23.apply$mcV$sp(SQLQuerySuite.scala:300)
[info] at org.apache.spark.sql.SQLQuerySuite$$anonfun$23.apply(SQLQuerySuite.scala:300)
[info] at org.apache.spark.sql.SQLQuerySuite$$anonfun$23.apply(SQLQuerySuite.scala:300)
[info] at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
[info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
[info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info] at org.scalatest.Transformer.apply(Transformer.scala:22)
[info] at org.scalatest.Transformer.apply(Transformer.scala:20)
[info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
[info] at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
[info] at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
[info] at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
[info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
[info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
[info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
[info] at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
[info] at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
[info] at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
[info] at org.scalatest.SuperEngine$$anonfun$traverseSubNode
Author: wangfei <wangfei1@huawei.com>
Closes#4395 from scwf/SQLQuerySuite and squashes the following commits:
1431a2d [wangfei] fix conflicts
c35fe5e [wangfei] minor fix
01dab3a [wangfei] fix test failure of SQLQuerySuite
1. Added methods to create DataFrames from Seq[Product]
2. Added executeTake to avoid running a Spark job on LocalRelations.
Author: Reynold Xin <rxin@databricks.com>
Closes#4372 from rxin/localDataFrame and squashes the following commits:
f696858 [Reynold Xin] style checker.
839ef7f [Reynold Xin] [SPARK-5602][SQL] Better support for creating DataFrame from local data collection.
Author: Reynold Xin <rxin@databricks.com>
Closes#4379 from rxin/CachedTableSuite and squashes the following commits:
f2b44ce [Reynold Xin] [SQL] Fix flaky CachedTableSuite.
1. Removed LocalHiveContext in Python.
2. Reduced DSL UDF support from 22 arguments to 10 arguments so JavaDoc/ScalaDoc look nicer.
Author: Reynold Xin <rxin@databricks.com>
Closes#4374 from rxin/df-style and squashes the following commits:
e493342 [Reynold Xin] [SQL][DataFrame] Minor cleanup.
...aised in SPARK-4520.
The exception is thrown only for a thrift generated parquet file. The array element schema name is assumed as "array" as per ParquetAvro but for thrift generated parquet files, it is array_name + "_tuple". This leads to missing child of array group type and hence when the parquet rows are being materialized leads to the exception.
Author: Sadhan Sood <sadhan@tellapart.com>
Closes#4148 from sadhan/SPARK-4520 and squashes the following commits:
c5ccde8 [Sadhan Sood] [SPARK-4520] [SQL] This pr fixes the ArrayIndexOutOfBoundsException as raised in SPARK-4520.
Author: Reynold Xin <rxin@databricks.com>
Closes#4376 from rxin/SPARK-5605 and squashes the following commits:
c55f5fa [Reynold Xin] Added a Python test.
f4b8dbb [Reynold Xin] [SPARK-5605][SQL][DF] Allow using String to specify colum name in DSL aggregate functions.