Commit graph

620 commits

Author SHA1 Message Date
Herman van Hovell 39ab199a3f [SPARK-8640] [SQL] Enable Processing of Multiple Window Frames in a Single Window Operator
This PR enables the processing of multiple window frames in a single window operator. This should improve the performance of processing multiple window expressions wich share partition by/order by clauses, because it will be more efficient with respect to memory use and group processing.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #7515 from hvanhovell/SPARK-8640 and squashes the following commits:

f0e1c21 [Herman van Hovell] Changed Window Logical/Physical plans to use partition by/order by specs directly instead of using WindowSpec.
e1711c2 [Herman van Hovell] Enabled the processing of multiple window frames in a single Window operator.
2015-07-31 12:08:25 -07:00
WangTaoTheTonic a3a85d73da [SPARK-9496][SQL]do not print the password in config
https://issues.apache.org/jira/browse/SPARK-9496

We better do not print the password in log.

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes #7815 from WangTaoTheTonic/master and squashes the following commits:

c7a5145 [WangTaoTheTonic] do not print the password in config
2015-07-30 23:50:06 -07:00
Wenchen Fan c0cc0eaec6 [SPARK-9390][SQL] create a wrapper for array type
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7724 from cloud-fan/array-data and squashes the following commits:

d0408a1 [Wenchen Fan] fix python
661e608 [Wenchen Fan] rebase
f39256c [Wenchen Fan] fix hive...
6dbfa6f [Wenchen Fan] fix hive again...
8cb8842 [Wenchen Fan] remove element type parameter from getArray
43e9816 [Wenchen Fan] fix mllib
e719afc [Wenchen Fan] fix hive
4346290 [Wenchen Fan] address comment
d4a38da [Wenchen Fan] remove sizeInBytes and add license
7e283e2 [Wenchen Fan] create a wrapper for array type
2015-07-30 10:04:30 -07:00
Joseph Batchik 1221849f91 [SPARK-8005][SQL] Input file name
Users can now get the file name of the partition being read in. A thread local variable is in `SQLNewHadoopRDD` and is set when the partition is computed. `SQLNewHadoopRDD` is moved to core so that the catalyst package can reach it.

This supports:

`df.select(inputFileName())`

and

`sqlContext.sql("select input_file_name() from table")`

Author: Joseph Batchik <josephbatchik@gmail.com>

Closes #7743 from JDrit/input_file_name and squashes the following commits:

abb8609 [Joseph Batchik] fixed failing test and changed the default value to be an empty string
d2f323d [Joseph Batchik] updates per review
102061f [Joseph Batchik] updates per review
75313f5 [Joseph Batchik] small fixes
c7f7b5a [Joseph Batchik] addeding input file name to Spark SQL
2015-07-29 23:35:55 -07:00
Josh Rosen 1b0099fc62 [SPARK-9411] [SQL] Make Tungsten page sizes configurable
We need to make page sizes configurable so we can reduce them in unit tests and increase them in real production workloads.  These sizes are now controlled by a new configuration, `spark.buffer.pageSize`.  The new default is 64 megabytes.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7741 from JoshRosen/SPARK-9411 and squashes the following commits:

a43c4db [Josh Rosen] Fix pow
2c0eefc [Josh Rosen] Fix MAXIMUM_PAGE_SIZE_BYTES comment + value
bccfb51 [Josh Rosen] Lower page size to 4MB in TestHive
ba54d4b [Josh Rosen] Make UnsafeExternalSorter's page size configurable
0045aa2 [Josh Rosen] Make UnsafeShuffle's page size configurable
bc734f0 [Josh Rosen] Rename configuration
e614858 [Josh Rosen] Makes BytesToBytesMap page size configurable
2015-07-29 16:00:30 -07:00
Reynold Xin 6662ee2124 [SPARK-9418][SQL] Use sort-merge join as the default shuffle join.
Sort-merge join is more robust in Spark since sorting can be made using the Tungsten sort operator.

Author: Reynold Xin <rxin@databricks.com>

Closes #7733 from rxin/smj and squashes the following commits:

61e4d34 [Reynold Xin] Fixed test case.
5ffd731 [Reynold Xin] Fixed JoinSuite.
a137dc0 [Reynold Xin] [SPARK-9418][SQL] Use sort-merge join as the default shuffle join.
2015-07-28 17:42:35 -07:00
Reynold Xin b7f54119f8 [SPARK-9420][SQL] Move expressions in sql/core package to catalyst.
Since catalyst package already depends on Spark core, we can move those expressions
into catalyst, and simplify function registry.

This is a followup of #7478.

Author: Reynold Xin <rxin@databricks.com>

Closes #7735 from rxin/SPARK-8003 and squashes the following commits:

2ffbdc3 [Reynold Xin] [SPARK-8003][SQL] Move expressions in sql/core package to catalyst.
2015-07-28 17:03:59 -07:00
Josh Rosen 59b92add7c [SPARK-9393] [SQL] Fix several error-handling bugs in ScriptTransform operator
SparkSQL's ScriptTransform operator has several serious bugs which make debugging fairly difficult:

- If exceptions are thrown in the writing thread then the child process will not be killed, leading to a deadlock because the reader thread will block while waiting for input that will never arrive.
- TaskContext is not propagated to the writer thread, which may cause errors in upstream pipelined operators.
- Exceptions which occur in the writer thread are not propagated to the main reader thread, which may cause upstream errors to be silently ignored instead of killing the job.  This can lead to silently incorrect query results.
- The writer thread is not a daemon thread, but it should be.

In addition, the code in this file is extremely messy:

- Lots of fields are nullable but the nullability isn't clearly explained.
- Many confusing variable names: for instance, there are variables named `ite` and `iterator` that are defined in the same scope.
- Some code was misindented.
- The `*serdeClass` variables are actually expected to be single-quoted strings, which is really confusing: I feel that this parsing / extraction should be performed in the analyzer, not in the operator itself.
- There were no unit tests for the operator itself, only end-to-end tests.

This pull request addresses these issues, borrowing some error-handling techniques from PySpark's PythonRDD.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7710 from JoshRosen/script-transform and squashes the following commits:

16c44e2 [Josh Rosen] Update some comments
983f200 [Josh Rosen] Use unescapeSQLString instead of stripQuotes
6a06a8c [Josh Rosen] Clean up handling of quotes in serde class name
494cde0 [Josh Rosen] Propagate TaskContext to writer thread
323bb2b [Josh Rosen] Fix error-swallowing bug
b31258d [Josh Rosen] Rename iterator variables to disambiguate.
88278de [Josh Rosen] Split ScriptTransformation writer thread into own class.
8b162b6 [Josh Rosen] Add failing test which demonstrates exception masking issue
4ee36a2 [Josh Rosen] Kill script transform subprocess when error occurs in input writer.
bd4c948 [Josh Rosen] Skip launching of external command for empty partitions.
b43e4ec [Josh Rosen] Clean up nullability in ScriptTransformation
fa18d26 [Josh Rosen] Add basic unit test for script transform with 'cat' command.
2015-07-28 16:04:48 -07:00
Joseph Batchik b88b868eb3 [SPARK-8003][SQL] Added virtual column support to Spark
Added virtual column support by adding a new resolution role to the query analyzer. Additional virtual columns can be added by adding case expressions to [the new rule](https://github.com/JDrit/spark/blob/virt_columns/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L1026) and my modifying the [logical plan](https://github.com/JDrit/spark/blob/virt_columns/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L216) to resolve them.

This also solves [SPARK-8003](https://issues.apache.org/jira/browse/SPARK-8003)

This allows you to perform queries such as:
```sql
select spark__partition__id, count(*) as c from table group by spark__partition__id;
```

Author: Joseph Batchik <josephbatchik@gmail.com>
Author: JD <jd@csh.rit.edu>

Closes #7478 from JDrit/virt_columns and squashes the following commits:

7932bf0 [Joseph Batchik] adding spark__partition__id to hive as well
f8a9c6c [Joseph Batchik] merging in master
e49da48 [JD] fixes for @rxin's suggestions
60e120b [JD] fixing test in merge
4bf8554 [JD] merging in master
c68bc0f [Joseph Batchik] Adding function register ability to SQLContext and adding a function for spark__partition__id()
2015-07-28 14:39:25 -07:00
Aaron Davidson 35ef853b3f [SPARK-9397] DataFrame should provide an API to find source data files if applicable
Certain applications would benefit from being able to inspect DataFrames that are straightforwardly produced by data sources that stem from files, and find out their source data. For example, one might want to display to a user the size of the data underlying a table, or to copy or mutate it.

This PR exposes an `inputFiles` method on DataFrame which attempts to discover the source data in a best-effort manner, by inspecting HadoopFsRelations and JSONRelations.

Author: Aaron Davidson <aaron@databricks.com>

Closes #7717 from aarondav/paths and squashes the following commits:

ff67430 [Aaron Davidson] inputFiles
0acd3ad [Aaron Davidson] [SPARK-9397] DataFrame should provide an API to find source data files if applicable
2015-07-28 10:12:09 -07:00
Yijie Shen 63a492b931 [SPARK-8828] [SQL] Revert SPARK-5680
JIRA: https://issues.apache.org/jira/browse/SPARK-8828

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7667 from yjshen/revert_combinesum_2 and squashes the following commits:

c37ccb1 [Yijie Shen] add test case
8377214 [Yijie Shen] revert spark.sql.useAggregate2 to its default value
e2305ac [Yijie Shen] fix bug - avg on decimal column
7cb0e95 [Yijie Shen] [wip] resolving bugs
1fadb5a [Yijie Shen] remove occurance
17c6248 [Yijie Shen] revert SPARK-5680
2015-07-27 22:47:33 -07:00
Michael Armbrust ce89ff477a [SPARK-9386] [SQL] Feature flag for metastore partition pruning
Since we have been seeing a lot of failures related to this new feature, lets put it behind a flag and turn it off by default.

Author: Michael Armbrust <michael@databricks.com>

Closes #7703 from marmbrus/optionalMetastorePruning and squashes the following commits:

6ad128c [Michael Armbrust] style
8447835 [Michael Armbrust] [SPARK-9386][SQL] Feature flag for metastore partition pruning
fd37b87 [Michael Armbrust] add config flag
2015-07-27 17:32:34 -07:00
Cheng Lian 8e7d2bee23 [SPARK-9378] [SQL] Fixes test case "CTAS with serde"
This is a proper version of PR #7693 authored by viirya

The reason why "CTAS with serde" fails is that the `MetastoreRelation` gets converted to a Parquet data source relation by default.

Author: Cheng Lian <lian@databricks.com>

Closes #7700 from liancheng/spark-9378-fix-ctas-test and squashes the following commits:

4413af0 [Cheng Lian] Fixes test case "CTAS with serde"
2015-07-27 13:28:03 -07:00
Yin Huai 55946e76fd [SPARK-9349] [SQL] UDAF cleanup
https://issues.apache.org/jira/browse/SPARK-9349

With this PR, we only expose `UserDefinedAggregateFunction` (an abstract class) and `MutableAggregationBuffer` (an interface). Other internal wrappers and helper classes are moved to `org.apache.spark.sql.execution.aggregate` and marked as `private[sql]`.

Author: Yin Huai <yhuai@databricks.com>

Closes #7687 from yhuai/UDAF-cleanup and squashes the following commits:

db36542 [Yin Huai] Add comments to UDAF examples.
ae17f66 [Yin Huai] Address comments.
9c9fa5f [Yin Huai] UDAF cleanup.
2015-07-27 13:26:57 -07:00
Cheng Lian 72981bc8f0 [SPARK-7943] [SPARK-8105] [SPARK-8435] [SPARK-8714] [SPARK-8561] Fixes multi-database support
This PR fixes a set of issues related to multi-database. A new data structure `TableIdentifier` is introduced to identify a table among multiple databases. We should stop using a single `String` (table name without database name), or `Seq[String]` (optional database name plus table name) to identify tables internally.

Author: Cheng Lian <lian@databricks.com>

Closes #7623 from liancheng/spark-8131-multi-db and squashes the following commits:

f3bcd4b [Cheng Lian] Addresses PR comments
e0eb76a [Cheng Lian] Fixes styling issues
41e2207 [Cheng Lian] Fixes multi-database support
d4d1ec2 [Cheng Lian] Adds multi-database test cases
2015-07-27 17:15:35 +08:00
Wenchen Fan 4ffd3a1db5 [SPARK-9371][SQL] fix the support for special chars in column names for hive context
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7684 from cloud-fan/hive and squashes the following commits:

da21ffe [Wenchen Fan] fix the support for special chars in column names for hive context
2015-07-26 23:58:03 -07:00
Cheng Hao 1efe97dc9e [SPARK-8867][SQL] Support list / describe function usage
As Hive does, we need to list all of the registered UDF and its usage for user.

We add the annotation to describe a UDF, so we can get the literal description info while registering the UDF.
e.g.
```scala
ExpressionDescription(
    usage = "_FUNC_(expr) - Returns the absolute value of the numeric value",
    extended = """> SELECT _FUNC_('-1')
                  1""")
 case class Abs(child: Expression) extends UnaryArithmetic {
...
```

Author: Cheng Hao <hao.cheng@intel.com>

Closes #7259 from chenghao-intel/desc_function and squashes the following commits:

cf29bba [Cheng Hao] fixing the code style issue
5193855 [Cheng Hao] Add more powerful parser for show functions
c645a6b [Cheng Hao] fix bug in unit test
78d40f1 [Cheng Hao] update the padding issue for usage
48ee4b3 [Cheng Hao] update as feedback
70eb4e9 [Cheng Hao] add show/describe function support
2015-07-26 18:34:19 -07:00
Cheng Lian c025c3d0a1 [SPARK-9095] [SQL] Removes the old Parquet support
This PR removes the old Parquet support:

- Removes the old `ParquetRelation` together with related SQL configuration, plan nodes, strategies, utility classes, and test suites.

- Renames `ParquetRelation2` to `ParquetRelation`

- Renames `RowReadSupport` and `RowRecordMaterializer` to `CatalystReadSupport` and `CatalystRecordMaterializer` respectively, and moved them to separate files.

  This follows naming convention used in other Parquet data models implemented in parquet-mr. It should be easier for developers who are familiar with Parquet to follow.

There's still some other code that can be cleaned up. Especially `RowWriteSupport`. But I'd like to leave this part to SPARK-8848.

Author: Cheng Lian <lian@databricks.com>

Closes #7441 from liancheng/spark-9095 and squashes the following commits:

c7b6e38 [Cheng Lian] Removes WriteToFile
2d688d6 [Cheng Lian] Renames ParquetRelation2 to ParquetRelation
ca9e1b7 [Cheng Lian] Removes old Parquet support
2015-07-26 16:49:19 -07:00
Reynold Xin 6c400b4f39 [SPARK-9354][SQL] Remove InternalRow.get generic getter call in Hive integration code.
Replaced them with get(ordinal, datatype) so we can use UnsafeRow here.

I passed the data types throughout.

Author: Reynold Xin <rxin@databricks.com>

Closes #7669 from rxin/row-generic-getter-hive and squashes the following commits:

3467d8e [Reynold Xin] [SPARK-9354][SQL] Remove Internal.get generic getter call in Hive integration code.
2015-07-26 10:27:39 -07:00
Reynold Xin 4a01bfc2a2 [SPARK-9350][SQL] Introduce an InternalRow generic getter that requires a DataType
Currently UnsafeRow cannot support a generic getter. However, if the data type is known, we can support a generic getter.

Author: Reynold Xin <rxin@databricks.com>

Closes #7666 from rxin/generic-getter-with-datatype and squashes the following commits:

ee2874c [Reynold Xin] Add a default implementation for getStruct.
1e109a0 [Reynold Xin] [SPARK-9350][SQL] Introduce an InternalRow generic getter that requires a DataType.
033ee88 [Reynold Xin] Removed getAs in non test code.
2015-07-25 23:52:37 -07:00
Reynold Xin b1f4b4abfd [SPARK-9348][SQL] Remove apply method on InternalRow.
Author: Reynold Xin <rxin@databricks.com>

Closes #7665 from rxin/remove-row-apply and squashes the following commits:

0b43001 [Reynold Xin] support getString in UnsafeRow.
176d633 [Reynold Xin] apply -> get.
2941324 [Reynold Xin] [SPARK-9348][SQL] Remove apply method on InternalRow.
2015-07-25 18:41:51 -07:00
Cheng Lian e2ec018e37 [SPARK-9285] [SQL] Fixes Row/InternalRow conversion for HadoopFsRelation
This is a follow-up of #7626. It fixes `Row`/`InternalRow` conversion for data sources extending `HadoopFsRelation` with `needConversion` being `true`.

Author: Cheng Lian <lian@databricks.com>

Closes #7649 from liancheng/spark-9285-conversion-fix and squashes the following commits:

036a50c [Cheng Lian] Addresses PR comment
f6d7c6a [Cheng Lian] Fixes Row/InternalRow conversion for HadoopFsRelation
2015-07-25 11:42:49 -07:00
Reynold Xin 431ca39be5 [SPARK-9285][SQL] Remove InternalRow's inheritance from Row.
I also changed InternalRow's size/length function to numFields, to make it more obvious that it is not about bytes, but the number of fields.

Author: Reynold Xin <rxin@databricks.com>

Closes #7626 from rxin/internalRow and squashes the following commits:

e124daf [Reynold Xin] Fixed test case.
805ceb7 [Reynold Xin] Commented out the failed test suite.
f8a9ca5 [Reynold Xin] Fixed more bugs. Still at least one more remaining.
76d9081 [Reynold Xin] Fixed data sources.
7807f70 [Reynold Xin] Fixed DataFrameSuite.
cb60cd2 [Reynold Xin] Code review & small bug fixes.
0a2948b [Reynold Xin] Fixed style.
3280d03 [Reynold Xin] [SPARK-9285][SQL] Remove InternalRow's inheritance from Row.
2015-07-24 09:37:36 -07:00
Davies Liu 8a94eb23d5 [SPARK-9069] [SPARK-9264] [SQL] remove unlimited precision support for DecimalType
Romove Decimal.Unlimited (change to support precision up to 38, to match with Hive and other databases).

In order to keep backward source compatibility, Decimal.Unlimited is still there, but change to Decimal(38, 18).

If no precision and scale is provide, it's Decimal(10, 0) as before.

Author: Davies Liu <davies@databricks.com>

Closes #7605 from davies/decimal_unlimited and squashes the following commits:

aa3f115 [Davies Liu] fix tests and style
fb0d20d [Davies Liu] address comments
bfaae35 [Davies Liu] fix style
df93657 [Davies Liu] address comments and clean up
06727fd [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_unlimited
4c28969 [Davies Liu] fix tests
8d783cc [Davies Liu] fix tests
788631c [Davies Liu] fix double with decimal in Union/except
1779bde [Davies Liu] fix scala style
c9c7c78 [Davies Liu] remove Decimal.Unlimited
2015-07-23 18:31:13 -07:00
Reynold Xin d71a13f475 [SPARK-9262][build] Treat Scala compiler warnings as errors
I've seen a few cases in the past few weeks that the compiler is throwing warnings that are caused by legitimate bugs. This patch upgrades warnings to errors, except deprecation warnings.

Note that ideally we should be able to mark deprecation warnings as errors as well. However, due to the lack of ability to suppress individual warning messages in the Scala compiler, we cannot do that (since we do need to access deprecated APIs in Hadoop).

Most of the work are done by ericl.

Author: Reynold Xin <rxin@databricks.com>
Author: Eric Liang <ekl@databricks.com>

Closes #7598 from rxin/warnings and squashes the following commits:

beb311b [Reynold Xin] Fixed tests.
542c031 [Reynold Xin] Fixed one more warning.
87c354a [Reynold Xin] Fixed all non-deprecation warnings.
78660ac [Eric Liang] first effort to fix warnings
2015-07-22 21:02:19 -07:00
Matei Zaharia fe26584a1f [SPARK-9244] Increase some memory defaults
There are a few memory limits that people hit often and that we could
make higher, especially now that memory sizes have grown.

- spark.akka.frameSize: This defaults at 10 but is often hit for map
  output statuses in large shuffles. This memory is not fully allocated
  up-front, so we can just make this larger and still not affect jobs
  that never sent a status that large. We increase it to 128.

- spark.executor.memory: Defaults at 512m, which is really small. We
  increase it to 1g.

Author: Matei Zaharia <matei@databricks.com>

Closes #7586 from mateiz/configs and squashes the following commits:

ce0038a [Matei Zaharia] [SPARK-9244] Increase some memory defaults
2015-07-22 15:28:09 -07:00
Yin Huai c03299a18b [SPARK-4233] [SPARK-4367] [SPARK-3947] [SPARK-3056] [SQL] Aggregation Improvement
This is the first PR for the aggregation improvement, which is tracked by https://issues.apache.org/jira/browse/SPARK-4366 (umbrella JIRA). This PR contains work for its subtasks, SPARK-3056, SPARK-3947, SPARK-4233, and SPARK-4367.

This PR introduces a new code path for evaluating aggregate functions. This code path is guarded by `spark.sql.useAggregate2` and by default the value of this flag is true.

This new code path contains:
* A new aggregate function interface (`AggregateFunction2`) and 7 built-int aggregate functions based on this new interface (`AVG`, `COUNT`, `FIRST`, `LAST`, `MAX`, `MIN`, `SUM`)
* A UDAF interface (`UserDefinedAggregateFunction`) based on the new code path and two example UDAFs (`MyDoubleAvg` and `MyDoubleSum`).
* A sort-based aggregate operator (`Aggregate2Sort`) for the new aggregate function interface .
* A sort-based aggregate operator (`FinalAndCompleteAggregate2Sort`) for distinct aggregations (for distinct aggregations the query plan will use `Aggregate2Sort` and `FinalAndCompleteAggregate2Sort` together).

With this change, `spark.sql.useAggregate2` is `true`, the flow of compiling an aggregation query is:
1. Our analyzer looks up functions and returns aggregate functions built based on the old aggregate function interface.
2. When our planner is compiling the physical plan, it tries try to convert all aggregate functions to the ones built based on the new interface. The planner will fallback to the old code path if any of the following two conditions is true:
* code-gen is disabled.
* there is any function that cannot be converted (right now, Hive UDAFs).
* the schema of grouping expressions contain any complex data type.
* There are multiple distinct columns.

Right now, the new code path handles a single distinct column in the query (you can have multiple aggregate functions using that distinct column). For a query having a aggregate function with DISTINCT and regular aggregate functions, the generated plan will do partial aggregations for those regular aggregate function.

Thanks chenghao-intel for his initial work on it.

Author: Yin Huai <yhuai@databricks.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #7458 from yhuai/UDAF and squashes the following commits:

7865f5e [Yin Huai] Put the catalyst expression in the comment of the generated code for it.
b04d6c8 [Yin Huai] Remove unnecessary change.
f1d5901 [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF
35b0520 [Yin Huai] Use semanticEquals to replace grouping expressions in the output of the aggregate operator.
3b43b24 [Yin Huai] bug fix.
00eb298 [Yin Huai] Make it compile.
a3ca551 [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF
e0afca3 [Yin Huai] Gracefully fallback to old aggregation code path.
8a8ac4a [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF
88c7d4d [Yin Huai] Enable spark.sql.useAggregate2 by default for testing purpose.
dc96fd1 [Yin Huai] Many updates:
85c9c4b [Yin Huai] newline.
43de3de [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF
c3614d7 [Yin Huai] Handle single distinct column.
68b8ee9 [Yin Huai] Support single distinct column set. WIP
3013579 [Yin Huai] Format.
d678aee [Yin Huai] Remove AggregateExpressionSuite.scala since our built-in aggregate functions will be based on AlgebraicAggregate and we need to have another way to test it.
e243ca6 [Yin Huai] Add aggregation iterators.
a101960 [Yin Huai] Change MyJavaUDAF to MyDoubleSum.
594cdf5 [Yin Huai] Change existing AggregateExpression to AggregateExpression1 and add an AggregateExpression as the common interface for both AggregateExpression1 and AggregateExpression2.
380880f [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF
0a827b3 [Yin Huai] Add comments and doc. Move some classes to the right places.
a19fea6 [Yin Huai] Add UDAF interface.
262d4c4 [Yin Huai] Make it compile.
b2e358e [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF
6edb5ac [Yin Huai] Format update.
70b169c [Yin Huai] Remove groupOrdering.
4721936 [Yin Huai] Add CheckAggregateFunction to extendedCheckRules.
d821a34 [Yin Huai] Cleanup.
32aea9c [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF
5b46d41 [Yin Huai] Bug fix.
aff9534 [Yin Huai] Make Aggregate2Sort work with both algebraic AggregateFunctions and non-algebraic AggregateFunctions.
2857b55 [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF
4435f20 [Yin Huai] Add ConvertAggregateFunction to HiveContext's analyzer.
1b490ed [Michael Armbrust] make hive test
8cfa6a9 [Michael Armbrust] add test
1b0bb3f [Yin Huai] Do not bind references in AlgebraicAggregate and use code gen for all places.
072209f [Yin Huai] Bug fix: Handle expressions in grouping columns that are not attribute references.
f7d9e54 [Michael Armbrust] Merge remote-tracking branch 'apache/master' into UDAF
39ee975 [Yin Huai] Code cleanup: Remove unnecesary AttributeReferences.
b7720ba [Yin Huai] Add an analysis rule to convert aggregate function to the new version.
5c00f3f [Michael Armbrust] First draft of codegen
6bbc6ba [Michael Armbrust] now with correct answers\!
f7996d0 [Michael Armbrust] Add AlgebraicAggregate
dded1c5 [Yin Huai] wip
2015-07-21 23:26:11 -07:00
Dennis Huo c07838b5a9 [SPARK-9206] [SQL] Fix HiveContext classloading for GCS connector.
IsolatedClientLoader.isSharedClass includes all of com.google.\*, presumably
for Guava, protobuf, and/or other shared Google libraries, but needs to
count com.google.cloud.\* as "hive classes" when determining which ClassLoader
to use. Otherwise, things like HiveContext.parquetFile will throw a
ClassCastException when fs.defaultFS is set to a Google Cloud Storage (gs://)
path. On StackOverflow: http://stackoverflow.com/questions/31478955

EDIT: Adding yhuai who worked on the relevant classloading isolation pieces.

Author: Dennis Huo <dhuo@google.com>

Closes #7549 from dennishuo/dhuo-fix-hivecontext-gcs and squashes the following commits:

1f8db07 [Dennis Huo] Fix HiveContext classloading for GCS connector.
2015-07-21 13:12:11 -07:00
Reynold Xin 60c0ce134d [SPARK-8906][SQL] Move all internal data source classes into execution.datasources.
This way, the sources package contains only public facing interfaces.

Author: Reynold Xin <rxin@databricks.com>

Closes #7565 from rxin/move-ds and squashes the following commits:

7661aff [Reynold Xin] Mima
9d5196a [Reynold Xin] Rearranged imports.
3dd7174 [Reynold Xin] [SPARK-8906][SQL] Move all internal data source classes into execution.datasources.
2015-07-21 11:56:38 -07:00
Cheng Lian d38c5029a2 [SPARK-9100] [SQL] Adds DataFrame reader/writer shortcut methods for ORC
This PR adds DataFrame reader/writer shortcut methods for ORC in both Scala and Python.

Author: Cheng Lian <lian@databricks.com>

Closes #7444 from liancheng/spark-9100 and squashes the following commits:

284d043 [Cheng Lian] Fixes PySpark test cases and addresses PR comments
e0b09fb [Cheng Lian] Adds DataFrame reader/writer shortcut methods for ORC
2015-07-21 15:08:44 +08:00
Cheng Lian dde0e12f32 [SPARK-6910] [SQL] Support for pushing predicates down to metastore for partition pruning
This PR forks PR #7421 authored by piaozhexiu and adds [a workaround] [1] for fixing the occasional test failures occurred in PR #7421. Please refer to these [two] [2] [comments] [3] for details.

[1]: 536ac41a7e
[2]: https://github.com/apache/spark/pull/7421#issuecomment-122527391
[3]: https://github.com/apache/spark/pull/7421#issuecomment-122528059

Author: Cheolsoo Park <cheolsoop@netflix.com>
Author: Cheng Lian <lian@databricks.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #7492 from liancheng/pr-7421-workaround and squashes the following commits:

5599cc4 [Cheolsoo Park] Predicate pushdown to hive metastore
536ac41 [Cheng Lian] Sets hive.metastore.integral.jdo.pushdown to true to workaround test failures caused by in #7421
2015-07-20 15:12:14 -07:00
Herman van Hovell 7a81245345 [SPARK-8638] [SQL] Window Function Performance Improvements - Cleanup
This PR contains a few clean-ups that are a part of SPARK-8638: a few style issues got fixed, and a few tests were moved.

Git commit message is wrong BTW :(...

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #7513 from hvanhovell/SPARK-8638-cleanup and squashes the following commits:

4e69d08 [Herman van Hovell] Fixed Perfomance Regression for Shrinking Window Frames (+Rebase)
2015-07-19 16:29:50 -07:00
Herman van Hovell a9a0d0cebf [SPARK-8638] [SQL] Window Function Performance Improvements
## Description
Performance improvements for Spark Window functions. This PR will also serve as the basis for moving away from Hive UDAFs to Spark UDAFs. See JIRA tickets SPARK-8638 and SPARK-7712 for more information.

## Improvements
* Much better performance (10x) in running cases (e.g. BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) and UNBOUDED FOLLOWING cases. The current implementation in spark uses a sliding window approach in these cases. This means that an aggregate is maintained for every row, so space usage is N (N being the number of rows). This also means that all these aggregates all need to be updated separately, this takes N*(N-1)/2 updates. The running case differs from the Sliding case because we are only adding data to an aggregate function (no reset is required), we only need to maintain one aggregate (like in the UNBOUNDED PRECEDING AND UNBOUNDED case), update the aggregate for each row, and get the aggregate value after each update. This is what the new implementation does. This approach only uses 1 buffer, and only requires N updates; I am currently working on data with window sizes of 500-1000 doing running sums and this saves a lot of time. The CURRENT ROW AND UNBOUNDED FOLLOWING case also uses this approach and the fact that aggregate operations are communitative, there is one twist though it will process the input buffer in reverse.
* Fewer comparisons in the sliding case. The current implementation determines frame boundaries for every input row. The new implementation makes more use of the fact that the window is sorted, maintains the boundaries, and only moves them when the current row order changes. This is a minor improvement.
* A single Window node is able to process all types of Frames for the same Partitioning/Ordering. This saves a little time/memory spent buffering and managing partitions. This will be enabled in a follow-up PR.
* A lot of the staging code is moved from the execution phase to the initialization phase. Minor performance improvement, and improves readability of the execution code.

## Benchmarking
I have done a small benchmark using [on time performance](http://www.transtats.bts.gov) data of the month april. I have used the origin as a partioning key, as a result there is quite some variation in window sizes. The code for the benchmark can be found in the JIRA ticket. These are the results per Frame type:

Frame | Master | SPARK-8638
----- | ------ | ----------
Entire Frame | 2 s | 1 s
Sliding | 18 s | 1 s
Growing | 14 s | 0.9 s
Shrinking | 13 s | 1 s

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #7057 from hvanhovell/SPARK-8638 and squashes the following commits:

3bfdc49 [Herman van Hovell] Fixed Perfomance Regression for Shrinking Window Frames (+Rebase)
2eb3b33 [Herman van Hovell] Corrected reverse range frame processing.
2cd2d5b [Herman van Hovell] Corrected reverse range frame processing.
b0654d7 [Herman van Hovell] Tests for exotic frame specifications.
e75b76e [Herman van Hovell] More docs, added support for reverse sliding range frames, and some reorganization of code.
1fdb558 [Herman van Hovell] Changed Data In HiveDataFrameWindowSuite.
ac2f682 [Herman van Hovell] Added a few more comments.
1938312 [Herman van Hovell] Added Documentation to the createBoundOrdering methods.
bb020e6 [Herman van Hovell] Major overhaul of Window operator.
2015-07-18 23:44:38 -07:00
Reynold Xin 9914b1b2c5 [SPARK-9150][SQL] Create CodegenFallback and Unevaluable trait
It is very hard to track which expressions have code gen implemented or not. This patch removes the default fallback gencode implementation from Expression, and moves that into a new trait called CodegenFallback. Each concrete expression needs to either implement code generation, or mix in CodegenFallback. This makes it very easy to track which expressions have code generation implemented already.

Additionally, this patch creates an Unevaluable trait that can be used to track expressions that don't support evaluation (e.g. Star).

Author: Reynold Xin <rxin@databricks.com>

Closes #7487 from rxin/codegenfallback and squashes the following commits:

14ebf38 [Reynold Xin] Fixed Conv
6c1c882 [Reynold Xin] Fixed Alias.
b42611b [Reynold Xin] [SPARK-9150][SQL] Create a trait to track code generation for expressions.
cb5c066 [Reynold Xin] Removed extra import.
39cbe40 [Reynold Xin] [SPARK-8240][SQL] string function: concat
2015-07-18 18:18:19 -07:00
Reynold Xin b2aa490bb6 [SPARK-9142] [SQL] Removing unnecessary self types in Catalyst.
Just a small change to add Product type to the base expression/plan abstract classes, based on suggestions on #7434 and offline discussions.

Author: Reynold Xin <rxin@databricks.com>

Closes #7479 from rxin/remove-self-types and squashes the following commits:

e407ffd [Reynold Xin] [SPARK-9142][SQL] Removing unnecessary self types in Catalyst.
2015-07-17 15:02:13 -07:00
Cheng Hao e27212317c [SPARK-8972] [SQL] Incorrect result for rollup
We don't support the complex expression keys in the rollup/cube, and we even will not report it if we have the complex group by keys, that will cause very confusing/incorrect result.

e.g. `SELECT key%100 FROM src GROUP BY key %100 with ROLLUP`

This PR adds an additional project during the analyzing for the complex GROUP BY keys, and that projection will be the child of `Expand`, so to `Expand`, the GROUP BY KEY are always the simple key(attribute names).

Author: Cheng Hao <hao.cheng@intel.com>

Closes #7343 from chenghao-intel/expand and squashes the following commits:

1ebbb59 [Cheng Hao] update the comment
827873f [Cheng Hao] update as feedback
34def69 [Cheng Hao] Add more unit test and comments
c695760 [Cheng Hao] fix bug of incorrect result for rollup
2015-07-15 23:35:27 -07:00
Steve Loughran ec9b621647 SPARK-9070 JavaDataFrameSuite teardown NPEs if setup failed
fix teardown to skip table delete if hive context is null

Author: Steve Loughran <stevel@hortonworks.com>

Closes #7425 from steveloughran/stevel/patches/SPARK-9070-JavaDataFrameSuite-NPE and squashes the following commits:

1982d38 [Steve Loughran] SPARK-9070 JavaDataFrameSuite teardown NPEs if setup failed
2015-07-15 12:15:35 -07:00
Michael Armbrust c6b1a9e74e Revert SPARK-6910 and SPARK-9027
Revert #7216 and #7386.  These patch seems to be causing quite a few test failures:

```
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.GeneratedMethodAccessor322.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:351)
	at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getPartitionsByFilter$1.apply(ClientWrapper.scala:320)
	at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getPartitionsByFilter$1.apply(ClientWrapper.scala:318)
	at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:180)
	at org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:135)
	at org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:172)
	at org.apache.spark.sql.hive.client.ClientWrapper.getPartitionsByFilter(ClientWrapper.scala:318)
	at org.apache.spark.sql.hive.client.HiveTable.getPartitions(ClientInterface.scala:78)
	at org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(HiveMetastoreCatalog.scala:670)
	at org.apache.spark.sql.hive.execution.HiveTableScan.doExecute(HiveTableScan.scala:137)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:90)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:90)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:89)
	at org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1.apply(Exchange.scala:164)
	at org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1.apply(Exchange.scala:151)
	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48)
	... 85 more
Caused by: MetaException(message:Filtering is supported only on partition keys of type string)
	at org.apache.hadoop.hive.metastore.parser.ExpressionTree$FilterBuilder.setError(ExpressionTree.java:185)
	at org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.getJdoFilterPushdownParam(ExpressionTree.java:452)
	at org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.generateJDOFilterOverPartitions(ExpressionTree.java:357)
	at org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.generateJDOFilter(ExpressionTree.java:279)
	at org.apache.hadoop.hive.metastore.parser.ExpressionTree$TreeNode.generateJDOFilter(ExpressionTree.java:243)
	at org.apache.hadoop.hive.metastore.parser.ExpressionTree.generateJDOFilterFragment(ExpressionTree.java:590)
	at org.apache.hadoop.hive.metastore.ObjectStore.makeQueryFilterString(ObjectStore.java:2417)
	at org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsViaOrmFilter(ObjectStore.java:2029)
	at org.apache.hadoop.hive.metastore.ObjectStore.access$500(ObjectStore.java:146)
	at org.apache.hadoop.hive.metastore.ObjectStore$4.getJdoResult(ObjectStore.java:2332)
```
https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-Maven-with-YARN/2945/HADOOP_PROFILE=hadoop-2.4,label=centos/testReport/junit/org.apache.spark.sql.hive.execution/SortMergeCompatibilitySuite/auto_sortmerge_join_16/

Author: Michael Armbrust <michael@databricks.com>

Closes #7409 from marmbrus/revertMetastorePushdown and squashes the following commits:

92fabd3 [Michael Armbrust] Revert SPARK-6910 and SPARK-9027
5d3bdf2 [Michael Armbrust] Revert "[SPARK-9027] [SQL] Generalize metastore predicate pushdown"
2015-07-14 22:57:39 -07:00
Josh Rosen 11e5c37286 [SPARK-8962] Add Scalastyle rule to ban direct use of Class.forName; fix existing uses
This pull request adds a Scalastyle regex rule which fails the style check if `Class.forName` is used directly.  `Class.forName` always loads classes from the default / system classloader, but in a majority of cases, we should be using Spark's own `Utils.classForName` instead, which tries to load classes from the current thread's context classloader and falls back to the classloader which loaded Spark when the context classloader is not defined.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/7350)
<!-- Reviewable:end -->

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7350 from JoshRosen/ban-Class.forName and squashes the following commits:

e3e96f7 [Josh Rosen] Merge remote-tracking branch 'origin/master' into ban-Class.forName
c0b7885 [Josh Rosen] Hopefully fix the last two cases
d707ba7 [Josh Rosen] Fix uses of Class.forName that I missed in my first cleanup pass
046470d [Josh Rosen] Merge remote-tracking branch 'origin/master' into ban-Class.forName
62882ee [Josh Rosen] Fix uses of Class.forName or add exclusion.
d9abade [Josh Rosen] Add stylechecker rule to ban uses of Class.forName
2015-07-14 16:08:17 -07:00
Michael Armbrust 37f2d9635f [SPARK-9027] [SQL] Generalize metastore predicate pushdown
Add support for pushing down metastore filters that are in different orders and add some unit tests.

Author: Michael Armbrust <michael@databricks.com>

Closes #7386 from marmbrus/metastoreFilters and squashes the following commits:

05a4524 [Michael Armbrust] [SPARK-9027][SQL] Generalize metastore predicate pushdown
2015-07-14 11:22:09 -07:00
Cheolsoo Park 408b384de9 [SPARK-6910] [SQL] Support for pushing predicates down to metastore for partition pruning
This PR supersedes my old one #6921. Since my patch has changed quite a bit, I am opening a new PR to make it easier to review.

The changes include-
* Implement `toMetastoreFilter()` function in `HiveShim` that takes `Seq[Expression]` and converts them into a filter string for Hive metastore.
 * This functions matches all the `AttributeReference` + `BinaryComparisonOp` + `Integral/StringType` patterns in `Seq[Expression]` and fold them into a string.
* Change `hiveQlPartitions` field in `MetastoreRelation` to `getHiveQlPartitions()` function that takes a filter string parameter.
* Call `getHiveQlPartitions()` in `HiveTableScan` with a filter string.

But there are some cases in which predicate pushdown is disabled-

Case | Predicate pushdown
------- | -----------------------------
Hive integral and string types | Yes
Hive varchar type | No
Hive 0.13 and newer | Yes
Hive 0.12 and older | No
convertMetastoreParquet=false | Yes
convertMetastoreParquet=true | No

In case of `convertMetastoreParquet=true`, predicates are not pushed down because this conversion happens in an `Analyzer` rule (`HiveMetastoreCatalog.ParquetConversions`). At this point, `HiveTableScan` hasn't run, so predicates are not available. But reading the source code, I think it is intentional to convert the entire Hive table w/ all the partitions into `ParquetRelation` because then `ParquetRelation` can be cached and reused for any query against that table. Please correct me if I am wrong.

cc marmbrus

Author: Cheolsoo Park <cheolsoop@netflix.com>

Closes #7216 from piaozhexiu/SPARK-6910-2 and squashes the following commits:

aa1490f [Cheolsoo Park] Fix ordering of imports
c212c4d [Cheolsoo Park] Incorporate review comments
5e93f9d [Cheolsoo Park] Predicate pushdown into Hive metastore
2015-07-13 19:45:10 -07:00
Jonathan Alter e14b545d2d [SPARK-7977] [BUILD] Disallowing println
Author: Jonathan Alter <jonalter@users.noreply.github.com>

Closes #7093 from jonalter/SPARK-7977 and squashes the following commits:

ccd44cc [Jonathan Alter] Changed println to log in ThreadingSuite
7fcac3e [Jonathan Alter] Reverting to println in ThreadingSuite
10724b6 [Jonathan Alter] Changing some printlns to logs in tests
eeec1e7 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
0b1dcb4 [Jonathan Alter] More println cleanup
aedaf80 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
925fd98 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
0c16fa3 [Jonathan Alter] Replacing some printlns with logs
45c7e05 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
5c8e283 [Jonathan Alter] Allowing println in audit-release examples
5b50da1 [Jonathan Alter] Allowing printlns in example files
ca4b477 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
83ab635 [Jonathan Alter] Fixing new printlns
54b131f [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
1cd8a81 [Jonathan Alter] Removing some unnecessary comments and printlns
b837c3a [Jonathan Alter] Disallowing println
2015-07-10 11:34:01 +01:00
Michael Armbrust 768907eb7b [SPARK-8926][SQL] Good errors for ExpectsInputType expressions
For example: `cannot resolve 'testfunction(null)' due to data type mismatch: argument 1 is expected to be of type int, however, null is of type datetype.`

Author: Michael Armbrust <michael@databricks.com>

Closes #7303 from marmbrus/expectsTypeErrors and squashes the following commits:

c654a0e [Michael Armbrust] fix udts and make errors pretty
137160d [Michael Armbrust] style
5428fda [Michael Armbrust] style
10fac82 [Michael Armbrust] [SPARK-8926][SQL] Good errors for ExpectsInputType expressions
2015-07-08 22:05:58 -07:00
Andrew Or 47ef423f86 [SPARK-8910] Fix MiMa flaky due to port contention issue
Due to the way MiMa works, we currently start a `SQLContext` pretty early on. This causes us to start a `SparkUI` that attempts to bind to port 4040. Because many tests run in parallel on the Jenkins machines, this  causes port contention sometimes and fails the MiMa tests.

Note that we already disabled the SparkUI for scalatests. However, the MiMa test is run before we even have a chance to load the default scalatest settings, so we need to explicitly disable the UI ourselves.

Author: Andrew Or <andrew@databricks.com>

Closes #7300 from andrewor14/mima-flaky and squashes the following commits:

b55a547 [Andrew Or] Do not enable SparkUI during tests
2015-07-08 20:29:08 -07:00
Yijie Shen a290814877 [SPARK-8866][SQL] use 1us precision for timestamp type
JIRA: https://issues.apache.org/jira/browse/SPARK-8866

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7283 from yijieshen/micro_timestamp and squashes the following commits:

dc735df [Yijie Shen] update CastSuite to avoid round error
714eaea [Yijie Shen] add timestamp_udf into blacklist due to precision lose
c3ca2f4 [Yijie Shen] fix unhandled case in CurrentTimestamp
8d4aa6b [Yijie Shen] use 1us precision for timestamp type
2015-07-08 20:20:17 -07:00
Cheng Lian 4ffc27caaf [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] Refactors Parquet read path for interoperability and backwards-compatibility
This PR is a follow-up of #6617 and is part of [SPARK-6774] [2], which aims to ensure interoperability and backwards-compatibility for Spark SQL Parquet support.  And this one fixes the read path.  Now Spark SQL is expected to be able to read legacy Parquet data files generated by most (if not all) common libraries/tools like parquet-thrift, parquet-avro, and parquet-hive. However, we still need to refactor the write path to write standard Parquet LISTs and MAPs ([SPARK-8848] [4]).

### Major changes

1. `CatalystConverter` class hierarchy refactoring

   - Replaces `CatalystConverter` trait with a much simpler `ParentContainerUpdater`.

     Now instead of extending the original `CatalystConverter` trait, every converter class accepts an updater which is responsible for propagating the converted value to some parent container. For example, appending array elements to a parent array buffer, appending a key-value pairs to a parent mutable map, or setting a converted value to some specific field of a parent row. Root converter doesn't have a parent and thus uses a `NoopUpdater`.

     This simplifies the design since converters don't need to care about details of their parent converters anymore.

   - Unifies `CatalystRootConverter`, `CatalystGroupConverter` and `CatalystPrimitiveRowConverter` into `CatalystRowConverter`

     Specifically, now all row objects are represented by `SpecificMutableRow` during conversion.

   - Refactors `CatalystArrayConverter`, and removes `CatalystArrayContainsNullConverter` and `CatalystNativeArrayConverter`

     `CatalystNativeArrayConverter` was probably designed with the intention of avoiding boxing costs. However, the way it uses Scala generics actually doesn't achieve this goal.

     The new `CatalystArrayConverter` handles both nullable and non-nullable array elements in a consistent way.

   - Implements backwards-compatibility rules in `CatalystArrayConverter`

     When Parquet records are being converted, schema of Parquet files should have already been verified. So we only need to care about the structure rather than field names in the Parquet schema. Since all map objects represented in legacy systems have the same structure as the standard one (see [backwards-compatibility rules for MAP] [1]), we only need to deal with LIST (namely array) in `CatalystArrayConverter`.

2. Requested columns handling

   When specifying requested columns in `RowReadSupport`, we used to use a Parquet `MessageType` converted from a Catalyst `StructType` which contains all requested columns.  This is not preferable when taking compatibility and interoperability into consideration.  Because the actual Parquet file may have different physical structure from the converted schema.

   In this PR, the schema for requested columns is constructed using the following method:

   - For a column that exists in the target Parquet file, we extract the column type by name from the full file schema, and construct a single-field `MessageType` for that column.
   - For a column that doesn't exist in the target Parquet file, we create a single-field `StructType` and convert it to a `MessageType` using `CatalystSchemaConverter`.
   - Unions all single-field `MessageType`s into a full schema containing all requested fields

   With this change, we also fix [SPARK-6123] [3] by validating the global schema against each individual Parquet part-files.

### Testing

This PR also adds compatibility tests for parquet-avro, parquet-thrift, and parquet-hive. Please refer to `README.md` under `sql/core/src/test` for more information about these tests. To avoid build time code generation and adding extra complexity to the build system, Java code generated from testing Thrift schema and Avro IDL is also checked in.

[1]: https://github.com/apache/incubator-parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1
[2]: https://issues.apache.org/jira/browse/SPARK-6774
[3]: https://issues.apache.org/jira/browse/SPARK-6123
[4]: https://issues.apache.org/jira/browse/SPARK-8848

Author: Cheng Lian <lian@databricks.com>

Closes #7231 from liancheng/spark-6776 and squashes the following commits:

360fe18 [Cheng Lian] Adds ParquetHiveCompatibilitySuite
c6fbc06 [Cheng Lian] Removes WIP file committed by mistake
b8c1295 [Cheng Lian] Excludes the whole parquet package from MiMa
598c3e8 [Cheng Lian] Adds extra Maven repo for hadoop-lzo, which is a transitive dependency of parquet-thrift
926af87 [Cheng Lian] Simplifies Parquet compatibility test suites
7946ee1 [Cheng Lian] Fixes Scala styling issues
3d7ab36 [Cheng Lian] Fixes .rat-excludes
a8f13bb [Cheng Lian] Using Parquet writer API to do compatibility tests
f2208cd [Cheng Lian] Adds README.md for Thrift/Avro code generation
1d390aa [Cheng Lian] Adds parquet-thrift compatibility test
440f7b3 [Cheng Lian] Adds generated files to .rat-excludes
13b9121 [Cheng Lian] Adds ParquetAvroCompatibilitySuite
06cfe9d [Cheng Lian] Adds comments about TimestampType handling
a099d3e [Cheng Lian] More comments
0cc1b37 [Cheng Lian] Fixes MiMa checks
884d3e6 [Cheng Lian] Fixes styling issue and reverts unnecessary changes
802cbd7 [Cheng Lian] Fixes bugs related to schema merging and empty requested columns
38fe1e7 [Cheng Lian] Adds explicit return type
7fb21f1 [Cheng Lian] Reverts an unnecessary debugging change
1781dff [Cheng Lian] Adds test case for SPARK-8811
6437d4b [Cheng Lian] Assembles requested schema from Parquet file schema
bcac49f [Cheng Lian] Removes the 16-byte restriction of decimals
a74fb2c [Cheng Lian] More comments
0525346 [Cheng Lian] Removes old Parquet record converters
03c3bd9 [Cheng Lian] Refactors Parquet read path to implement backwards-compatibility rules
2015-07-08 15:51:01 -07:00
Keuntae Park f031543782 [SPARK-8783] [SQL] CTAS with WITH clause does not work
Currently, CTESubstitution only handles the case that WITH is on the top of the plan.
I think it SHOULD handle the case that WITH is child of CTAS.
This patch simply changes 'match' to 'transform' for recursive search of WITH in the plan.

Author: Keuntae Park <sirpkt@apache.org>

Closes #7180 from sirpkt/SPARK-8783 and squashes the following commits:

e4428f0 [Keuntae Park] Merge remote-tracking branch 'upstream/master' into CTASwithWITH
1671c77 [Keuntae Park] WITH clause can be inside CTAS
2015-07-08 14:29:52 -07:00
Takeshi YAMAMURO 3e831a2696 [SPARK-6912] [SQL] Throw an AnalysisException when unsupported Java Map<K,V> types used in Hive UDF
To make UDF developers understood, throw an exception when unsupported Map<K,V> types used in Hive UDF. This fix is the same with #7248.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #7257 from maropu/ThrowExceptionWhenMapUsed and squashes the following commits:

916099a [Takeshi YAMAMURO] Fix style errors
7886dcc [Takeshi YAMAMURO] Throw an exception when Map<> used in Hive UDF
2015-07-08 10:33:27 -07:00
Cheng Hao 351a36d0c5 [SPARK-8883][SQL]Remove the OverrideFunctionRegistry
Remove the `OverrideFunctionRegistry` from the Spark SQL, as the subclasses of `FunctionRegistry` have their own way to the delegate to the right underlying `FunctionRegistry`.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #7260 from chenghao-intel/override and squashes the following commits:

164d093 [Cheng Hao] enable the function registry
2ca8459 [Cheng Hao] remove the OverrideFunctionRegistry
2015-07-08 00:10:24 -07:00
Reynold Xin 770ff1025e [SPARK-8876][SQL] Remove InternalRow type alias in expressions package.
The type alias was there because initially when I moved Row around, I didn't want to do massive changes to the expression code. But now it should be pretty easy to just remove it. One less concept to worry about.

Author: Reynold Xin <rxin@databricks.com>

Closes #7270 from rxin/internalrow and squashes the following commits:

72fc842 [Reynold Xin] [SPARK-8876][SQL] Remove InternalRow type alias in expressions package.
2015-07-07 17:40:14 -07:00