* The `SqlLexical.allCaseVersions` will cause `StackOverflowException` if the key word is too long, the patch will fix that by normalizing all of the keywords in `SqlLexical`.
* And make a unified SparkSQLParser for sharing the common code.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#3926 from chenghao-intel/long_keyword and squashes the following commits:
686660f [Cheng Hao] Support Long Keyword and Refactor the SQLParsers
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#4040 from adrian-wang/coalesce and squashes the following commits:
0ac8e8f [Daoyuan Wang] add coalesce() in sql parser
JIRA: https://issues.apache.org/jira/browse/SPARK-5287
This PR only add `defaultSizeOf` to data types and make those internal type classes `protected[sql]`. I will use another PR to cleanup the type hierarchy of data types.
Author: Yin Huai <yhuai@databricks.com>
Closes#4081 from yhuai/SPARK-5287 and squashes the following commits:
90cec75 [Yin Huai] Update unit test.
e1c600c [Yin Huai] Make internal classes protected[sql].
7eaba68 [Yin Huai] Add `defaultSize` method to data types.
fd425e0 [Yin Huai] Add all native types to NativeType.defaultSizeOf.
This is a follow-up of #4090. The original deeply nested `reduceOption` code is hard to grasp.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4091)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#4091 from liancheng/refactor-boolean-simplification and squashes the following commits:
cd8860b [Cheng Lian] Improves `compareConditions` to handle more subtle cases
1bf3258 [Cheng Lian] Avoids converting predicate sets to lists
e833ca4 [Cheng Lian] Refactors deeply nested FP style code
Author: Reynold Xin <rxin@databricks.com>
Closes#4117 from rxin/catalyst-test-log4j and squashes the following commits:
8ad610b [Reynold Xin] [SQL][minor] Add a log4j file for catalyst test.
JIRA: https://issues.apache.org/jira/browse/SPARK-5286
Author: Yin Huai <yhuai@databricks.com>
Closes#4076 from yhuai/SPARK-5286 and squashes the following commits:
6b69ed1 [Yin Huai] Catch all exception when we try to uncache a query.
JIRA: https://issues.apache.org/jira/browse/SPARK-5284
Author: Yin Huai <yhuai@databricks.com>
Closes#4077 from yhuai/SPARK-5284 and squashes the following commits:
fceacd6 [Yin Huai] Check if a value is null when the field has a complex type.
Author: Jacky Li <jacky.likun@gmail.com>
Closes#4100 from jackylk/patch-9 and squashes the following commits:
b13b9d6 [Jacky Li] Update SQLConf.scala
4d3f83d [Jacky Li] Update SQLConf.scala
fcc8c85 [Jacky Li] [SQL] fix typo in class description
Author: Reynold Xin <rxin@databricks.com>
Closes#4097 from rxin/javarename and squashes the following commits:
c5ce96a [Reynold Xin] [SQL][minor] Put DataTypes.java in java dir.
Author: Reynold Xin <rxin@databricks.com>
Closes#4092 from rxin/bigdecimal and squashes the following commits:
27b08c9 [Reynold Xin] Fixed test.
10cb496 [Reynold Xin] [SPARK-5279][SQL] Use java.math.BigDecimal as the exposed Decimal type.
Author: Reynold Xin <rxin@databricks.com>
Closes#4090 from rxin/booleanSimplification and squashes the following commits:
68c8986 [Reynold Xin] [SQL][Minor] Added comments and examples to explain BooleanSimplification.
Follow up of #3778
/cc rxin
Author: scwf <wangfei1@huawei.com>
Closes#4086 from scwf/commentforspark-4937 and squashes the following commits:
aaf89f6 [scwf] code style issue
2d3406e [scwf] added comment for spark-4937
Author: Reynold Xin <rxin@databricks.com>
Closes#4085 from rxin/row-doc and squashes the following commits:
f77cb27 [Reynold Xin] [SQL][minor] Improved Row documentation.
Adding optimization to simplify the And/Or condition in spark sql.
There are two kinds of Optimization
1 Numeric condition optimization, such as:
a < 3 && a > 5 ---- False
a < 1 || a > 0 ---- True
a > 3 && a > 5 => a > 5
(a < 2 || b > 5) && a < 2 => a < 2
2 optimizing the some query from a cartesian product into equi-join, such as this sql (one of hive-testbench):
```
select
sum(l_extendedprice* (1 - l_discount)) as revenue
from
lineitem,
part
where
(
p_partkey = l_partkey
and p_brand = 'Brand#32'
and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')
and l_quantity >= 7 and l_quantity <= 7 + 10
and p_size between 1 and 5
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
)
or
(
p_partkey = l_partkey
and p_brand = 'Brand#35'
and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')
and l_quantity >= 15 and l_quantity <= 15 + 10
and p_size between 1 and 10
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
)
or
(
p_partkey = l_partkey
and p_brand = 'Brand#24'
and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')
and l_quantity >= 26 and l_quantity <= 26 + 10
and p_size between 1 and 15
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
)
```
It has a repeated expression in Or, so we can optimize it by ``` (a && b) || (a && c) = a && (b || c)```
Before optimization, this sql hang in my locally test, and the physical plan is:
![image](https://cloud.githubusercontent.com/assets/7018048/5539175/31cf38e8-8af9-11e4-95e3-336f9b3da4a4.png)
After optimization, this sql run successfully in 20+ seconds, and its physical plan is:
![image](https://cloud.githubusercontent.com/assets/7018048/5539176/39a558e0-8af9-11e4-912b-93de94b20075.png)
This PR focus on the second optimization and some simple ones of the first. For complex Numeric condition optimization, I will make a follow up PR.
Author: scwf <wangfei1@huawei.com>
Author: wangfei <wangfei1@huawei.com>
Closes#3778 from scwf/filter1 and squashes the following commits:
58bcbc2 [scwf] minor format fix
9570211 [scwf] conflicts fix
527e6ce [scwf] minor comment improvements
5c6f134 [scwf] remove numeric optimizations and move to BooleanSimplification
546a82b [wangfei] style fix
825fa69 [wangfei] adding more tests
a001e8c [wangfei] revert pom changes
32a595b [scwf] improvement and test fix
e99a26c [wangfei] refactory And/Or optimization to make it more readable and clean
As part of SPARK-5193:
1. Removed UDFRegistration as a mixin in SQLContext and made it a field ("udf").
2. For Java UDFs, renamed dataType to returnType.
3. For Scala UDFs, added type tags.
4. Added all Java UDF registration methods to Scala's UDFRegistration.
5. Documentation
Author: Reynold Xin <rxin@databricks.com>
Closes#4056 from rxin/udf-registration and squashes the following commits:
ae9c556 [Reynold Xin] Updated example.
675a3c9 [Reynold Xin] Style fix
47c24ff [Reynold Xin] Python fix.
5f00c45 [Reynold Xin] Restore data type position in java udf and added typetags.
032f006 [Reynold Xin] [SPARK-5193][SQL] Reconcile Java and Scala UDFRegistration.
1. Removed the deprecated LocalHiveContext
2. Made private[sql] fields protected[sql] so they don't show up in javadoc.
3. Added javadoc to refreshTable.
4. Added Experimental tag to analyze command.
Author: Reynold Xin <rxin@databricks.com>
Closes#4054 from rxin/hivecontext-api and squashes the following commits:
25cc00a [Reynold Xin] Add implicit conversion back.
cbca886 [Reynold Xin] [SPARK-5193][SQL] Tighten up HiveContext API
1. Removed 2 implicits (logicalPlanToSparkQuery and baseRelationToSchemaRDD)
2. Moved extraStrategies into ExperimentalMethods.
3. Made private methods protected[sql] so they don't show up in javadocs.
4. Removed createParquetFile.
5. Added Java version of applySchema to SQLContext.
Author: Reynold Xin <rxin@databricks.com>
Closes#4049 from rxin/sqlContext-refactor and squashes the following commits:
a326a1a [Reynold Xin] Remove createParquetFile and add applySchema for Java to SQLContext.
ecd6685 [Reynold Xin] Added baseRelationToSchemaRDD back.
4a38c9b [Reynold Xin] [SPARK-5193][SQL] Tighten up SQLContext API
Declare SQLConf to be serializable to fix "Task not serializable" exceptions in SparkSQL
Author: Alex Baretta <alexbaretta@gmail.com>
Closes#4031 from alexbaretta/SPARK-5235-SQLConf and squashes the following commits:
c2103f5 [Alex Baretta] [SPARK-5235] Make SQLConf Serializable
`TaskContext.attemptId` is misleadingly-named, since it currently returns a taskId, which uniquely identifies a particular task attempt within a particular SparkContext, instead of an attempt number, which conveys how many times a task has been attempted.
This patch deprecates `TaskContext.attemptId` and add `TaskContext.taskId` and `TaskContext.attemptNumber` fields. Prior to this change, it was impossible to determine whether a task was being re-attempted (or was a speculative copy), which made it difficult to write unit tests for tasks that fail on early attempts or speculative tasks that complete faster than original tasks.
Earlier versions of the TaskContext docs suggest that `attemptId` behaves like `attemptNumber`, so there's an argument to be made in favor of changing this method's implementation. Since we've decided against making that change in maintenance branches, I think it's simpler to add better-named methods and retain the old behavior for `attemptId`; if `attemptId` behaved differently in different branches, then this would cause confusing build-breaks when backporting regression tests that rely on the new `attemptId` behavior.
Most of this patch is fairly straightforward, but there is a bit of trickiness related to Mesos tasks: since there's no field in MesosTaskInfo to encode the attemptId, I packed it into the `data` field alongside the task binary.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#3849 from JoshRosen/SPARK-4014 and squashes the following commits:
89d03e0 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4014
5cfff05 [Josh Rosen] Introduce wrapper for serializing Mesos task launch data.
38574d4 [Josh Rosen] attemptId -> taskAttemptId in PairRDDFunctions
a180b88 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4014
1d43aa6 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4014
eee6a45 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4014
0b10526 [Josh Rosen] Use putInt instead of putLong (silly mistake)
8c387ce [Josh Rosen] Use local with maxRetries instead of local-cluster.
cbe4d76 [Josh Rosen] Preserve attemptId behavior and deprecate it:
b2dffa3 [Josh Rosen] Address some of Reynold's minor comments
9d8d4d1 [Josh Rosen] Doc typo
1e7a933 [Josh Rosen] [SPARK-4014] Change TaskContext.attemptId to return attempt number instead of task ID.
fd515a5 [Josh Rosen] Add failing test for SPARK-4014
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#4000 from adrian-wang/comment and squashes the following commits:
9c24fc4 [Daoyuan Wang] some comments
jira: https://issues.apache.org/jira/browse/SPARK-5211
Author: Yin Huai <yhuai@databricks.com>
Closes#4026 from yhuai/SPARK-5211 and squashes the following commits:
15ee32b [Yin Huai] Remove extra line.
c6c1651 [Yin Huai] Get back HiveMetastoreTypes.toDataType.
rxin follow up of #3732
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#4041 from adrian-wang/decimal and squashes the following commits:
aa3d738 [Daoyuan Wang] fix auto refactor
7777a58 [Daoyuan Wang] move sql.types.decimal.Decimal to sql.types.Decimal
Mostly just moving stuff around. This should still be source compatible since we type aliased Row previously in org.apache.spark.sql.Row.
Added the following APIs to Row:
```scala
def getMap[K, V](i: Int): scala.collection.Map[K, V]
def getJavaMap[K, V](i: Int): java.util.Map[K, V]
def getSeq[T](i: Int): Seq[T]
def getList[T](i: Int): java.util.List[T]
def getStruct(i: Int): StructType
```
Author: Reynold Xin <rxin@databricks.com>
Closes#4030 from rxin/sql-row and squashes the following commits:
6c85c29 [Reynold Xin] Fixed style violation by adding a new line to Row.scala.
82b064a [Reynold Xin] [SPARK-5167][SQL] Move Row into sql package and make it usable for Java.
Having two versions of the data type APIs (one for Java, one for Scala) requires downstream libraries to also have two versions of the APIs if the library wants to support both Java and Scala. I took a look at the Scala version of the data type APIs - it can actually work out pretty well for Java out of the box.
As part of the PR, I created a sql.types package and moved all type definitions there. I then removed the Java specific data type API along with a lot of the conversion code.
This subsumes https://github.com/apache/spark/pull/3925
Author: Reynold Xin <rxin@databricks.com>
Closes#3958 from rxin/SPARK-5123-datatype-2 and squashes the following commits:
66505cc [Reynold Xin] [SPARK-5123] Expose only one version of the data type APIs (i.e. remove the Java-specific API).
This change should be binary and source backward compatible since we didn't change any user facing APIs.
Author: Reynold Xin <rxin@databricks.com>
Closes#3965 from rxin/SPARK-5168-sqlconf and squashes the following commits:
42eec09 [Reynold Xin] Fix default conf value.
0ef86cc [Reynold Xin] Fix constructor ordering.
4d7f910 [Reynold Xin] Properly override config.
ccc8e6a [Reynold Xin] [SPARK-5168] Make SQLConf a field rather than mixin in SQLContext
With changes in this PR, users can persist metadata of tables created based on the data source API in metastore through DDLs.
Author: Yin Huai <yhuai@databricks.com>
Author: Michael Armbrust <michael@databricks.com>
Closes#3960 from yhuai/persistantTablesWithSchema2 and squashes the following commits:
069c235 [Yin Huai] Make exception messages user friendly.
c07cbc6 [Yin Huai] Get the location of test file in a correct way.
4456e98 [Yin Huai] Test data.
5315dfc [Yin Huai] rxin's comments.
7fc4b56 [Yin Huai] Add DDLStrategy and HiveDDLStrategy to plan DDLs based on the data source API.
aeaf4b3 [Yin Huai] Add comments.
06f9b0c [Yin Huai] Revert unnecessary changes.
feb88aa [Yin Huai] Merge remote-tracking branch 'apache/master' into persistantTablesWithSchema2
172db80 [Yin Huai] Fix unit test.
49bf1ac [Yin Huai] Unit tests.
8f8f1a1 [Yin Huai] [SPARK-4574][SQL] Adding support for defining schema in foreign DDL commands. #3431
f47fda1 [Yin Huai] Unit tests.
2b59723 [Michael Armbrust] Set external when creating tables
c00bb1b [Michael Armbrust] Don't use reflection to read options
1ea6e7b [Michael Armbrust] Don't fail when trying to uncache a table that doesn't exist
6edc710 [Michael Armbrust] Add tests.
d7da491 [Michael Armbrust] First draft of persistent tables.
Followup to #3870. Props to rahulaggarwalguavus for identifying the issue.
Author: Michael Armbrust <michael@databricks.com>
Closes#3990 from marmbrus/SPARK-5049 and squashes the following commits:
dd03e4e [Michael Armbrust] Fill in the partition values of parquet scans instead of using JoinedRow
Disables the Spark web UI in HiveThriftServer2Suite in order to prevent Jenkins test failures due to port contention.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#3998 from JoshRosen/SPARK-5200 and squashes the following commits:
a384416 [Josh Rosen] [SPARK-5200] Disable web UI in Hive Thriftserver tests.
Enable from follow multiple brackets:
```
select key from ((select * from testData limit 1) union all (select * from testData limit 1)) x limit 1
```
Author: scwf <wangfei1@huawei.com>
Closes#3853 from scwf/from and squashes the following commits:
14f110a [scwf] enable from follow multiple brackets
Author: wangfei <wangfei1@huawei.com>
Closes#3718 from scwf/sparksqlui and squashes the following commits:
e0d6b5d [wangfei] format fix
383b505 [wangfei] fix conflicts
4d2038a [wangfei] using setJobDescription
df79837 [wangfei] fix compile error
92ce834 [wangfei] show sql statement in spark ui when run sql use spark-sql
Author: Michael Armbrust <michael@databricks.com>
Closes#3987 from marmbrus/hiveUdfCaching and squashes the following commits:
8bca2fa [Michael Armbrust] [SPARK-5187][SQL] Fix caching of tables with HiveUDFs in the WHERE clause
https://issues.apache.org/jira/browse/SPARK-4963
SchemaRDD.sample() return wrong results due to GapSamplingIterator operating on mutable row.
HiveTableScan make RDD with SpecificMutableRow and SchemaRDD.sample() will return GapSamplingIterator for iterating.
override def next(): T = {
val r = data.next()
advance
r
}
GapSamplingIterator.next() return the current underlying element and assigned it to r.
However if the underlying iterator is mutable row just like what HiveTableScan returned, underlying iterator and r will point to the same object.
After advance operation, we drop some underlying elments and it also changed r which is not expected. Then we return the wrong value different from initial r.
To fix this issue, the most direct way is to make HiveTableScan return mutable row with copy just like the initial commit that I have made. This solution will make HiveTableScan can not get the full advantage of reusable MutableRow, but it can make sample operation return correct result.
Further more, we need to investigate GapSamplingIterator.next() and make it can implement copy operation inside it. To achieve this, we should define every elements that RDD can store implement the function like cloneable and it will make huge change.
Author: Yanbo Liang <yanbohappy@gmail.com>
Closes#3827 from yanbohappy/spark-4963 and squashes the following commits:
0912ca0 [Yanbo Liang] code format keep
65c4e7c [Yanbo Liang] import file and clear annotation
55c7c56 [Yanbo Liang] better output of test case
cea7e2e [Yanbo Liang] SchemaRDD add copy operation before Sample operator
e840829 [Yanbo Liang] HiveTableScan return mutable row with copy
Follow up for #3712.
This PR finally remove ```CommandStrategy``` and make all commands follow ```RunnableCommand``` so they can go with ```case r: RunnableCommand => ExecutedCommand(r) :: Nil```.
One exception is the ```DescribeCommand``` of hive, which is a special case and need to distinguish hive table and temporary table, so still keep ```HiveCommandStrategy``` here.
Author: scwf <wangfei1@huawei.com>
Closes#3948 from scwf/followup-SPARK-4861 and squashes the following commits:
6b48e64 [scwf] minor style fix
2c62e9d [scwf] fix for hive module
5a7a819 [scwf] Refactory command in spark sql
Adding support for defining schema in foreign DDL commands. Now foreign DDL support commands like:
```
CREATE TEMPORARY TABLE avroTable
USING org.apache.spark.sql.avro
OPTIONS (path "../hive/src/test/resources/data/files/episodes.avro")
```
With this PR user can define schema instead of infer from file, so support ddl command as follows:
```
CREATE TEMPORARY TABLE avroTable(a int, b string)
USING org.apache.spark.sql.avro
OPTIONS (path "../hive/src/test/resources/data/files/episodes.avro")
```
Author: scwf <wangfei1@huawei.com>
Author: Yin Huai <yhuai@databricks.com>
Author: Fei Wang <wangfei1@huawei.com>
Author: wangfei <wangfei1@huawei.com>
Closes#3431 from scwf/ddl and squashes the following commits:
7e79ce5 [Fei Wang] Merge pull request #22 from yhuai/pr3431yin
38f634e [Yin Huai] Remove Option from createRelation.
65e9c73 [Yin Huai] Revert all changes since applying a given schema has not been testd.
a852b10 [scwf] remove cleanIdentifier
f336a16 [Fei Wang] Merge pull request #21 from yhuai/pr3431yin
baf79b5 [Yin Huai] Test special characters quoted by backticks.
50a03b0 [Yin Huai] Use JsonRDD.nullTypeToStringType to convert NullType to StringType.
1eeb769 [Fei Wang] Merge pull request #20 from yhuai/pr3431yin
f5c22b0 [Yin Huai] Refactor code and update test cases.
f1cffe4 [Yin Huai] Revert "minor refactory"
b621c8f [scwf] minor refactory
d02547f [scwf] fix HiveCompatibilitySuite test failure
8dfbf7a [scwf] more tests for complex data type
ddab984 [Fei Wang] Merge pull request #19 from yhuai/pr3431yin
91ad91b [Yin Huai] Parse data types in DDLParser.
cf982d2 [scwf] fixed test failure
445b57b [scwf] address comments
02a662c [scwf] style issue
44eb70c [scwf] fix decimal parser issue
83b6fc3 [scwf] minor fix
9bf12f8 [wangfei] adding test case
7787ec7 [wangfei] added SchemaRelationProvider
0ba70df [wangfei] draft version
The pull only fixes the parsing error and changes API to use tableIdentifier. Joining different catalog datasource related change is not done in this pull.
Author: Alex Liu <alex_liu68@yahoo.com>
Closes#3941 from alexliu68/SPARK-SQL-4943-3 and squashes the following commits:
343ae27 [Alex Liu] [SPARK-4943][SQL] refactoring according to review
29e5e55 [Alex Liu] [SPARK-4943][SQL] fix failed Hive CTAS tests
6ae77ce [Alex Liu] [SPARK-4943][SQL] fix TestHive matching error
3652997 [Alex Liu] [SPARK-4943][SQL] Allow table name having dot to support db/catalog ...
Author: Alex Liu <alex_liu68@yahoo.com>
Closes#3766 from alexliu68/SPARK-SQL-4925 and squashes the following commits:
3137b51 [Alex Liu] [SPARK-4925][SQL] Remove sql/hive-thriftserver module from pom.xml
15f2e38 [Alex Liu] [SPARK-4925][SQL] Publish Spark SQL hive-thriftserver maven artifact
CaseInsensitiveMap throws java.io.NotSerializableException.
Author: luogankun <luogankun@gmail.com>
Closes#3944 from luogankun/SPARK-5141 and squashes the following commits:
b6d63d5 [luogankun] [SPARK-5141]CaseInsensitiveMap throws java.io.NotSerializableException
This change does a few things to make the hadoop-provided profile more useful:
- Create new profiles for other libraries / services that might be provided by the infrastructure
- Simplify and fix the poms so that the profiles are only activated while building assemblies.
- Fix tests so that they're able to run when the profiles are activated
- Add a new env variable to be used by distributions that use these profiles to provide the runtime
classpath for Spark jobs and daemons.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#2982 from vanzin/SPARK-4048 and squashes the following commits:
82eb688 [Marcelo Vanzin] Add a comment.
eb228c0 [Marcelo Vanzin] Fix borked merge.
4e38f4e [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
9ef79a3 [Marcelo Vanzin] Alternative way to propagate test classpath to child processes.
371ebee [Marcelo Vanzin] Review feedback.
52f366d [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
83099fc [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
7377e7b [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
322f882 [Marcelo Vanzin] Fix merge fail.
f24e9e7 [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
8b00b6a [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
9640503 [Marcelo Vanzin] Cleanup child process log message.
115fde5 [Marcelo Vanzin] Simplify a comment (and make it consistent with another pom).
e3ab2da [Marcelo Vanzin] Fix hive-thriftserver profile.
7820d58 [Marcelo Vanzin] Fix CliSuite with provided profiles.
1be73d4 [Marcelo Vanzin] Restore flume-provided profile.
d1399ed [Marcelo Vanzin] Restore jetty dependency.
82a54b9 [Marcelo Vanzin] Remove unused profile.
5c54a25 [Marcelo Vanzin] Fix HiveThriftServer2Suite with *-provided profiles.
1fc4d0b [Marcelo Vanzin] Update dependencies for hive-thriftserver.
f7b3bbe [Marcelo Vanzin] Add snappy to hadoop-provided list.
9e4e001 [Marcelo Vanzin] Remove duplicate hive profile.
d928d62 [Marcelo Vanzin] Redirect child stderr to parent's log.
4d67469 [Marcelo Vanzin] Propagate SPARK_DIST_CLASSPATH on Yarn.
417d90e [Marcelo Vanzin] Introduce "SPARK_DIST_CLASSPATH".
2f95f0d [Marcelo Vanzin] Propagate classpath to child processes during testing.
1adf91c [Marcelo Vanzin] Re-enable maven-install-plugin for a few projects.
284dda6 [Marcelo Vanzin] Rework the "hadoop-provided" profile, add new ones.
This PR:
- Reenables `surefire`, and copies config from `scalatest` (which is itself an old fork of `surefire`, so similar)
- Tells `surefire` to test only Java tests
- Enables `surefire` and `scalatest` for all children, and in turn eliminates some duplication.
For me this causes the Scala and Java tests to be run once each, it seems, as desired. It doesn't affect the SBT build but works for Maven. I still need to verify that all of the Scala tests and Java tests are being run.
Author: Sean Owen <sowen@cloudera.com>
Closes#3651 from srowen/SPARK-4159 and squashes the following commits:
2e8a0af [Sean Owen] Remove specialized SPARK_HOME setting for REPL, YARN tests as it appears to be obsolete
12e4558 [Sean Owen] Append to unit-test.log instead of overwriting, so that both surefire and scalatest output is preserved. Also standardize/correct comments a bit.
e6f8601 [Sean Owen] Reenable Java tests by reenabling surefire with config cloned from scalatest; centralize test config in the parent
Author: Reynold Xin <rxin@databricks.com>
Closes#3862 from rxin/stringcontext-attr and squashes the following commits:
9b10f57 [Reynold Xin] Rename StrongToAttributeConversionHelper
72121af [Reynold Xin] [SPARK-5040][SQL] Support expressing unresolved attributes using $"attribute name" notation in SQL DSL.
As we learned in https://github.com/apache/spark/pull/3580, not explicitly typing implicit functions can lead to compiler bugs and potentially unexpected runtime behavior.
Author: Reynold Xin <rxin@databricks.com>
Closes#3859 from rxin/sql-implicits and squashes the following commits:
30c2c24 [Reynold Xin] [SPARK-5038] Add explicit return type for implicit functions in Spark SQL.
JIRA issue: [SPARK-4570](https://issues.apache.org/jira/browse/SPARK-4570)
We are planning to create a `BroadcastLeftSemiJoinHash` to implement the broadcast join for `left semijoin`
In left semijoin :
If the size of data from right side is smaller than the user-settable threshold `AUTO_BROADCASTJOIN_THRESHOLD`,
the planner would mark it as the `broadcast` relation and mark the other relation as the stream side. The broadcast table will be broadcasted to all of the executors involved in the join, as a `org.apache.spark.broadcast.Broadcast` object. It will use `joins.BroadcastLeftSemiJoinHash`.,else it will use `joins.LeftSemiJoinHash`.
The benchmark suggests these made the optimized version 4x faster when `left semijoin`
<pre><code>
Original:
left semi join : 9288 ms
Optimized:
left semi join : 1963 ms
</code></pre>
The micro benchmark load `data1/kv3.txt` into a normal Hive table.
Benchmark code:
<pre><code>
def benchmark(f: => Unit) = {
val begin = System.currentTimeMillis()
f
val end = System.currentTimeMillis()
end - begin
}
val sc = new SparkContext(
new SparkConf()
.setMaster("local")
.setAppName(getClass.getSimpleName.stripSuffix("$")))
val hiveContext = new HiveContext(sc)
import hiveContext._
sql("drop table if exists left_table")
sql("drop table if exists right_table")
sql( """create table left_table (key int, value string)
""".stripMargin)
sql( s"""load data local inpath "/data1/kv3.txt" into table left_table""")
sql( """create table right_table (key int, value string)
""".stripMargin)
sql(
"""
|from left_table
|insert overwrite table right_table
|select left_table.key, left_table.value
""".stripMargin)
val leftSimeJoin = sql(
"""select a.key from left_table a
|left semi join right_table b on a.key = b.key""".stripMargin)
val leftSemiJoinDuration = benchmark(leftSimeJoin.count())
println(s"left semi join : $leftSemiJoinDuration ms ")
</code></pre>
Author: wangxiaojing <u9jing@gmail.com>
Closes#3442 from wangxiaojing/SPARK-4570 and squashes the following commits:
a4a43c9 [wangxiaojing] rebase
f103983 [wangxiaojing] change style
fbe4887 [wangxiaojing] change style
ff2e618 [wangxiaojing] add testsuite
1a8da2a [wangxiaojing] add BroadcastLeftSemiJoinHash
If we passed in a wrong sql like ```abdcdfsfs```, the spark-sql script aborted.
Author: wangfei <wangfei1@huawei.com>
Author: Fei Wang <wangfei1@huawei.com>
Closes#3761 from scwf/patch-10 and squashes the following commits:
46dc344 [Fei Wang] revert console.printError(rc.getErrorMessage())
0330e07 [wangfei] avoid to print error message repeatedly
1614a11 [wangfei] spark-sql abort when passed in a wrong sql
Convert type of RowWriteSupport.attributes to Array.
Analysis of performance for writing very wide tables shows that time is spent predominantly in apply method on attributes var. Type of attributes previously was LinearSeqOptimized and apply is O(N) which made write O(N squared).
Measurements on 575 column table showed this change made a 6x improvement in write times.
Author: Michael Davies <Michael.BellDavies@gmail.com>
Closes#3843 from MickDavies/SPARK-4386 and squashes the following commits:
892519d [Michael Davies] [SPARK-4386] Improve performance when writing Parquet files
This PR is a simplified version of several filter optimization rules introduced in #3778 authored by scwf. Newly introduced optimizations include:
1. `a && a` => `a`
2. `a || a` => `a`
3. `(a || b || c || ...) && (a || b || d || ...)` => `a && b && (c || d || ...)`
The 3rd rule is particularly useful for optimizing the following query, which is planned into a cartesian product
```sql
SELECT *
FROM t1, t2
WHERE (t1.key = t2.key AND t1.value > 10)
OR (t1.key = t2.key AND t2.value < 20)
```
to the following one, which is planned into an equi-join:
```sql
SELECT *
FROM t1, t2
WHERE t1.key = t2.key
AND (t1.value > 10 OR t2.value < 20)
```
The example above is quite artificial, but common predicates are likely to appear in real life complex queries (like the one mentioned in #3778).
A difference between this PR and #3778 is that these optimizations are not limited to `Filter`, but are generalized to all logical plan nodes. Thanks to scwf for bringing up these optimizations, and chenghao-intel for the generalization suggestion.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3784)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#3784 from liancheng/normalize-filters and squashes the following commits:
caca560 [Cheng Lian] Moves filter normalization into BooleanSimplification rule
4ab3a58 [Cheng Lian] Fixes test failure, adds more tests
5d54349 [Cheng Lian] Fixes typo in comment
2abbf8e [Cheng Lian] Forgot our sacred Apache licence header...
cf95639 [Cheng Lian] Adds an optimization rule for filter normalization
case operator with decimal between different precision, we need change them to unlimited
Author: guowei2 <guowei2@asiainfo.com>
Closes#3767 from guowei2/SPARK-4928 and squashes the following commits:
c6a6e3e [guowei2] fix code style
3214e0a [guowei2] add test case
b4985a2 [guowei2] fix code style
27adf42 [guowei2] Fix: Operation '>,<,>=,<=' with Decimal report error
This is a follow-up of #3367 and #3644.
At the time #3644 was written, #3367 hadn't been merged yet, thus `IsNull` and `IsNotNull` filters are not covered in the first version of `ParquetFilterSuite`. This PR adds corresponding test cases.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3748)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#3748 from liancheng/test-null-filters and squashes the following commits:
1ab943f [Cheng Lian] IsNull and IsNotNull Parquet filter test case for boolean type
bcd616b [Cheng Lian] Adds Parquet filter pushedown tests for IsNull and IsNotNull
It will cause exception while do query like:
SELECT key+key FROM src sort by value;
Author: Cheng Hao <hao.cheng@intel.com>
Closes#3386 from chenghao-intel/sort and squashes the following commits:
38c78cc [Cheng Hao] revert the SortPartition in SparkStrategies
7e9dd15 [Cheng Hao] update the typo
fcd1d64 [Cheng Hao] rebase the latest master and update the SortBy unit test
spark sql does not support ```SELECT a, b FROM testData2 ORDER BY a desc, b```.
Author: wangfei <wangfei1@huawei.com>
Closes#3838 from scwf/orderby and squashes the following commits:
114b64a [wangfei] remove nouse methods
48145d3 [wangfei] fix order, using asc by default
Since #3429 has been merged, the bug of wrapping to Writable for HiveGenericUDF is resolved, we can safely remove the foldable checking in `HiveGenericUdf.eval`, which discussed in #2802.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#3745 from chenghao-intel/generic_udf and squashes the following commits:
622ad03 [Cheng Hao] Remove the unnecessary code change in Generic UDF
Author: Cheng Hao <hao.cheng@intel.com>
Closes#3796 from chenghao-intel/spark_4959 and squashes the following commits:
3ec08f8 [Cheng Hao] Replace the attribute in comparing its exprId other than itself
HiveInspectorSuite test failure:
[info] - wrap / unwrap null, constant null and writables *** FAILED *** (21 milliseconds)
[info] 1 did not equal 0 (HiveInspectorSuite.scala:136)
this is because the origin date(is 3914-10-23) not equals the date returned by ```unwrap```(is 3914-10-22).
Setting TimeZone and Locale fix this.
Another minor change here is rename ```def checkValues(v1: Any, v2: Any): Unit``` to ```def checkValue(v1: Any, v2: Any): Unit ``` to make the code more clear
Author: scwf <wangfei1@huawei.com>
Author: Fei Wang <wangfei1@huawei.com>
Closes#3814 from scwf/fix-inspectorsuite and squashes the following commits:
d8531ef [Fei Wang] Delete test.log
72b19a9 [scwf] fix HiveInspectorSuite test error
This is a follow up of #3396 , just add a test to white list.
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#3826 from adrian-wang/viewtest and squashes the following commits:
f105f68 [Daoyuan Wang] enable view test
This is just a quick fix that locks when calling `runHive`. If we can find a way to avoid the error without a global lock that would be better.
Author: Michael Armbrust <michael@databricks.com>
Closes#3834 from marmbrus/hiveConcurrency and squashes the following commits:
bf25300 [Michael Armbrust] prevent multiple concurrent hive native commands
Creates a top level directory script (as `build/mvn`) to automatically download zinc and the specific version of scala used to easily build spark. This will also download and install maven if the user doesn't already have it and all packages are hosted under the `build/` directory. Tested on both Linux and OSX OS's and both work. All commands pass through to the maven binary so it acts exactly as a traditional maven call would.
Author: Brennon York <brennon.york@capitalone.com>
Closes#3707 from brennonyork/SPARK-4501 and squashes the following commits:
0e5a0e4 [Brennon York] minor incorrect doc verbage (with -> this)
9b79e38 [Brennon York] fixed merge conflicts with dev/run-tests, properly quoted args in sbt/sbt, fixed bug where relative paths would fail if passed in from build/mvn
d2d41b6 [Brennon York] added blurb about leverging zinc with build/mvn
b979c58 [Brennon York] updated the merge conflict
c5634de [Brennon York] updated documentation to overview build/mvn, updated all points where sbt/sbt was referenced with build/sbt
b8437ba [Brennon York] set progress bars for curl and wget when not run on jenkins, no progress bar when run on jenkins, moved sbt script to build/sbt, wrote stub and warning under sbt/sbt which calls build/sbt, modified build/sbt to use the correct directory, fixed bug in build/sbt-launch-lib.bash to correctly pull the sbt version
be11317 [Brennon York] added switch to silence download progress only if AMPLAB_JENKINS is set
28d0a99 [Brennon York] updated to remove the python dependency, uses grep instead
7e785a6 [Brennon York] added silent and quiet flags to curl and wget respectively, added single echo output to denote start of a download if download is needed
14a5da0 [Brennon York] removed unnecessary zinc output on startup
1af4a94 [Brennon York] fixed bug with uppercase vs lowercase variable
3e8b9b3 [Brennon York] updated to properly only restart zinc if it was freshly installed
a680d12 [Brennon York] Added comments to functions and tested various mvn calls
bb8cc9d [Brennon York] removed package files
ef017e6 [Brennon York] removed OS complexities, setup generic install_app call, removed extra file complexities, removed help, removed forced install (defaults now), removed double-dash from cli
07bf018 [Brennon York] Updated to specifically handle pulling down the correct scala version
f914dea [Brennon York] Beginning final portions of localized scala home
69c4e44 [Brennon York] working linux and osx installers for purely local mvn build
4a1609c [Brennon York] finalizing working linux install for maven to local ./build/apache-maven folder
cbfcc68 [Brennon York] Changed the default sbt/sbt to build/sbt and added a build/mvn which will automatically download, install, and execute maven with zinc for easier build capability
There are a number of warnings generated in a normal, successful build right now. They're mostly Java unchecked cast warnings, which can be suppressed. But there's a grab bag of other Scala language warnings and so on that can all be easily fixed. The forthcoming PR fixes about 90% of the build warnings I see now.
Author: Sean Owen <sowen@cloudera.com>
Closes#3157 from srowen/SPARK-4297 and squashes the following commits:
8c9e469 [Sean Owen] Suppress unchecked cast warnings, and several other build warning fixes
This PR modifies the python `SchemaRDD` to use `sample()` and `takeSample()` from Scala instead of the slower python implementations from `rdd.py`. This is worthwhile because the `Row`'s are already serialized as Java objects.
In order to use the faster `takeSample()`, a `takeSampleToPython()` method was implemented in `SchemaRDD.scala` following the pattern of `collectToPython()`.
Author: jbencook <jbenjamincook@gmail.com>
Author: J. Benjamin Cook <jbenjamincook@gmail.com>
Closes#3764 from jbencook/master and squashes the following commits:
6fbc769 [J. Benjamin Cook] [SPARK-4860][pyspark][sql] fixing sloppy indentation for takeSampleToPython() arguments
5170da2 [J. Benjamin Cook] [SPARK-4860][pyspark][sql] fixing typo: from RDD to SchemaRDD
de22f70 [jbencook] [SPARK-4860][pyspark][sql] using sample() method from JavaSchemaRDD
b916442 [jbencook] [SPARK-4860][pyspark][sql] adding sample() to JavaSchemaRDD
020cbdf [jbencook] [SPARK-4860][pyspark][sql] using Scala implementations of `sample()` and `takeSample()`
Minor fix for an obvious scala doc error.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#3751 from viirya/fix_scaladoc and squashes the following commits:
03fddaa [Liang-Chi Hsieh] Fix scala doc.
HiveInspectors.scala failed in compiling with Hadoop 1, as the BytesWritable.copyBytes is not available in Hadoop 1.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#3742 from chenghao-intel/settable_oi_hotfix and squashes the following commits:
bb04d1f [Cheng Hao] hot fix for ByteWritables.copyBytes
Hive UDAF may create an customized object constructed by SettableStructObjectInspector, this is critical when integrate Hive UDAF with the refactor-ed UDAF interface.
Performance issue in `wrap/unwrap` since more match cases added, will do it in another PR.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#3429 from chenghao-intel/settable_oi and squashes the following commits:
9f0aff3 [Cheng Hao] update code style issues as feedbacks
2b0561d [Cheng Hao] Add more scala doc
f5a40e8 [Cheng Hao] add scala doc
2977e9b [Cheng Hao] remove the timezone setting for test suite
3ed284c [Cheng Hao] fix the date type comparison
f1b6749 [Cheng Hao] Update the comment
932940d [Cheng Hao] Add more unit test
72e4332 [Cheng Hao] Add settable StructObjectInspector support
Adding support to the partial aggregation of SumDistinct
Author: ravipesala <ravindra.pesala@huawei.com>
Closes#3348 from ravipesala/SPARK-2554 and squashes the following commits:
fd28e4d [ravipesala] Fixed review comments
e60e67f [ravipesala] Fixed test cases and made it as nullable
32fe234 [ravipesala] Supporting SumDistinct partial aggregation Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala
The sql "select * from spark_test::for_test where abs(20141202) is not null" has predicates=List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)) and
partitionKeyIds=AttributeSet(). PruningPredicates is List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)). Then the exception "java.lang.IllegalArgumentException: requirement failed: Partition pruning predicates only supported for partitioned tables." is thrown.
The sql "select * from spark_test::for_test_partitioned_table where abs(20141202) is not null and type_id=11 and platform = 3" with partitioned key insert_date has predicates=List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202), (type_id#12 = 11), (platform#8 = 3)) and partitionKeyIds=AttributeSet(insert_date#24). PruningPredicates is List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)).
Author: YanTangZhai <hakeemzhai@tencent.com>
Author: yantangzhai <tyz0303@163.com>
Closes#3556 from YanTangZhai/SPARK-4693 and squashes the following commits:
620ebe3 [yantangzhai] [SPARK-4693] [SQL] PruningPredicates may be wrong if predicates contains an empty AttributeSet() references
37cfdf5 [yantangzhai] [SPARK-4693] [SQL] PruningPredicates may be wrong if predicates contains an empty AttributeSet() references
70a3544 [yantangzhai] [SPARK-4693] [SQL] PruningPredicates may be wrong if predicates contains an empty AttributeSet() references
efa9b03 [YanTangZhai] Update HiveQuerySuite.scala
72accf1 [YanTangZhai] Update HiveQuerySuite.scala
e572b9a [YanTangZhai] Update HiveStrategies.scala
6e643f8 [YanTangZhai] Merge pull request #11 from apache/master
e249846 [YanTangZhai] Merge pull request #10 from apache/master
d26d982 [YanTangZhai] Merge pull request #9 from apache/master
76d4027 [YanTangZhai] Merge pull request #8 from apache/master
03b62b0 [YanTangZhai] Merge pull request #7 from apache/master
8a00106 [YanTangZhai] Merge pull request #6 from apache/master
cbcba66 [YanTangZhai] Merge pull request #3 from apache/master
cdef539 [YanTangZhai] Merge pull request #1 from apache/master
**sessionToActivePool** in **SparkSQLOperationManager** grow infinitely, even as sessions expire.
we should remove the pool value when the session closed, even though **sessionToActivePool** would not exist in all of sessions.
Author: guowei2 <guowei2@asiainfo.com>
Closes#3617 from guowei2/SPARK-4756 and squashes the following commits:
e9b97b8 [guowei2] fix compile bug with Shim12
cf0f521 [guowei2] Merge remote-tracking branch 'apache/master' into SPARK-4756
e070998 [guowei2] fix: remove active pool of the session when it expired
...arquetFile accept hadoop glob pattern in path.
Author: Thu Kyaw <trk007@gmail.com>
Closes#3407 from tkyaw/master and squashes the following commits:
19115ad [Thu Kyaw] Merge https://github.com/apache/spark
ceded32 [Thu Kyaw] [SPARK-3928][SQL] Support wildcard matches on Parquet files.
d322c28 [Thu Kyaw] [SPARK-3928][SQL] Support wildcard matches on Parquet files.
ce677c6 [Thu Kyaw] [SPARK-3928][SQL] Support wildcard matches on Parquet files.
Add support for `GROUPING SETS`, `ROLLUP`, `CUBE` and the the virtual column `GROUPING__ID`.
More details on how to use the `GROUPING SETS" can be found at: https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rolluphttps://issues.apache.org/jira/secure/attachment/12676811/grouping_set.pdf
The generic idea of the implementations are :
1 Replace the `ROLLUP`, `CUBE` with `GROUPING SETS`
2 Explode each of the input row, and then feed them to `Aggregate`
* Each grouping set are represented as the bit mask for the `GroupBy Expression List`, for each bit, `1` means the expression is selected, otherwise `0` (left is the lower bit, and right is the higher bit in the `GroupBy Expression List`)
* Several of projections are constructed according to the grouping sets, and within each projection(Seq[Expression), we replace those expressions with `Literal(null)` if it's not selected in the grouping set (based on the bit mask)
* Output Schema of `Explode` is `child.output :+ grouping__id`
* GroupBy Expressions of `Aggregate` is `GroupBy Expression List :+ grouping__id`
* Keep the `Aggregation expressions` the same for the `Aggregate`
The expressions substitutions happen in Logic Plan analyzing, so we will benefit from the Logical Plan optimization (e.g. expression constant folding, and map side aggregation etc.), Only an `Explosive` operator added for Physical Plan, which will explode the rows according the pre-set projections.
A known issue will be done in the follow up PR:
* Optimization `ColumnPruning` is not supported yet for `Explosive` node.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#1567 from chenghao-intel/grouping_sets and squashes the following commits:
fe65fcc [Cheng Hao] Remove the extra space
3547056 [Cheng Hao] Add more doc and Simplify the Expand
a7c869d [Cheng Hao] update code as feedbacks
d23c672 [Cheng Hao] Add GroupingExpression to replace the Seq[Expression]
414b165 [Cheng Hao] revert the unnecessary changes
ec276c6 [Cheng Hao] Support Rollup/Cube/GroupingSets
```
TestSQLContext.sparkContext.parallelize(
"""{"ip":"27.31.100.29","headers":{"Host":"1.abc.com","Charset":"UTF-8"}}""" ::
"""{"ip":"27.31.100.29","headers":{}}""" ::
"""{"ip":"27.31.100.29","headers":""}""" :: Nil)
```
As empty string (the "headers") will be considered as String in the beginning (in line 2 and 3), it ignores the real nested data type (struct type "headers" in line 1), and also take the line 1 (the "headers") as String Type, which is not our expected.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#3708 from chenghao-intel/json and squashes the following commits:
e7a72e9 [Cheng Hao] add more concise unit test
853de51 [Cheng Hao] NullType instead of StringType when sampling against empty string or null value
Author: Michael Armbrust <michael@databricks.com>
Closes#3727 from marmbrus/parquetNotEq and squashes the following commits:
2157bfc [Michael Armbrust] Fix parquet filter suite
In local mode, Hadoop/Hive will ignore the "mapred.map.tasks", hence for small table file, it's always a single input split, however, SparkSQL doesn't honor that in table scanning, and we will get different result when do the Hive Compatibility test. This PR will fix that.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#2589 from chenghao-intel/source_split and squashes the following commits:
dff38e7 [Cheng Hao] Remove the extra blank line
160a2b6 [Cheng Hao] fix the compiling bug
04d67f7 [Cheng Hao] Keep 1 split for small file in table scanning
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#3616 from adrian-wang/sqrt and squashes the following commits:
d877439 [Daoyuan Wang] fix NULLTYPE
3effa2c [Daoyuan Wang] sqrt(negative value) should return null
Predicates like `a = NULL` and `a < NULL` can't be pushed down since Parquet `Lt`, `LtEq`, `Gt`, `GtEq` doesn't accept null value. Note that `Eq` and `NotEq` can only be used with `null` to represent predicates like `a IS NULL` and `a IS NOT NULL`.
However, normally this issue doesn't cause NPE because any value compared to `NULL` results `NULL`, and Spark SQL automatically optimizes out `NULL` predicate in the `SimplifyFilters` rule. Only testing code that intentionally disables the optimizer may trigger this issue. (That's why this issue is not marked as blocker and I do **NOT** think we need to backport this to branch-1.1
This PR restricts `Lt`, `LtEq`, `Gt` and `GtEq` to non-null values only, and only uses `Eq` with null value to pushdown `IsNull` and `IsNotNull`. Also, added support for Parquet `NotEq` filter for completeness and (tiny) performance gain, it's also used to pushdown `IsNotNull`.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3367)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#3367 from liancheng/filters-with-null and squashes the following commits:
cc41281 [Cheng Lian] Fixes several styling issues
de7de28 [Cheng Lian] Adds stricter rules for Parquet filters with null
Based on #2543.
Author: Michael Armbrust <michael@databricks.com>
Closes#3724 from marmbrus/resolveGetField and squashes the following commits:
0a47aae [Michael Armbrust] Fix case insensitive resolution of GetField.
HiveThriftServer2 can not exit automactic when changing the standy resource manager in Yarn HA mode.
The scheduler backend was aware of the AM had been exited so it call sc.stop to exit the driver process but there was a user thread(HiveThriftServer2 ) which was still alive and cause this problem.
To fix it, make a demo thread to detect the sparkContext is null or not.If the sc is stopped, call the ThriftServer.stop to stop the user thread.
Author: carlmartin <carlmartinmax@gmail.com>
Closes#3576 from SaintBacchus/ThriftServer2ExitBug and squashes the following commits:
2890b4a [carlmartin] Use SparkListener instead of the demo thread to stop the hive server.
c15da0e [carlmartin] HiveThriftServer2 can not exit automactic when changing the standy resource manager in Yarn HA mode
Add `sort by` support for both DSL & SqlParser.
This PR is relevant with #3386, either one merged, will cause the other rebased.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#3481 from chenghao-intel/sortby and squashes the following commits:
041004f [Cheng Hao] Add sort by for DSL & SimpleSqlParser
Using lowercase for ```options``` key to make it case-insensitive, then we should use lower case to get value from parameters.
So flowing cmd work
```
create temporary table normal_parquet
USING org.apache.spark.sql.parquet
OPTIONS (
PATH '/xxx/data'
)
```
Author: scwf <wangfei1@huawei.com>
Author: wangfei <wangfei1@huawei.com>
Closes#3470 from scwf/ddl-ulcase and squashes the following commits:
ae78509 [scwf] address comments
8f4f585 [wangfei] address comments
3c132ef [scwf] minor fix
a0fc20b [scwf] Merge branch 'master' of https://github.com/apache/spark into ddl-ulcase
4f86401 [scwf] adding CaseInsensitiveMap
e244e8d [wangfei] using lower case in json
e0cb017 [wangfei] make options in-casesensitive
This PR brings support of using StructType(and other hashable types) as key in MapType.
Author: Davies Liu <davies@databricks.com>
Closes#3714 from davies/fix_struct_in_map and squashes the following commits:
68585d7 [Davies Liu] fix primitive types in MapType
9601534 [Davies Liu] support StructType as key in MapType
Author: Cheng Hao <hao.cheng@intel.com>
Closes#3595 from chenghao-intel/udf0 and squashes the following commits:
a858973 [Cheng Hao] Add 0 arguments support for udf
This is a follow-up of SPARK-4593 (#3443).
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#3581 from ueshin/issues/SPARK-4720 and squashes the following commits:
c3959d4 [Takuya UESHIN] Make Remainder return null if the divider is 0.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#3606 from chenghao-intel/codegen_short_circuit and squashes the following commits:
f466303 [Cheng Hao] short circuit for AND & OR
This PR provides a set Parquet testing API (see trait `ParquetTest`) that enables developers to write more concise test cases. A new set of Parquet test suites built upon this API are added and aim to replace the old `ParquetQuerySuite`. To avoid potential merge conflicts, old testing code are not removed yet. The following classes can be safely removed after most Parquet related PRs are handled:
- `ParquetQuerySuite`
- `ParquetTestData`
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3644)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#3644 from liancheng/parquet-tests and squashes the following commits:
800e745 [Cheng Lian] Enforces ordering of test output
3bb8731 [Cheng Lian] Refactors HiveParquetSuite
aa2cb2e [Cheng Lian] Decouples ParquetTest and TestSQLContext
7b43a68 [Cheng Lian] Updates ParquetTest Scaladoc
7f07af0 [Cheng Lian] Adds a new set of Parquet test suites
In BroadcastHashJoin, currently it is using a hard coded value (5 minutes) to wait for the execution and broadcast of the small table.
In my opinion, it should be a configurable value since broadcast may exceed 5 minutes in some case, like in a busy/congested network environment.
Author: Jacky Li <jacky.likun@huawei.com>
Closes#3133 from jackylk/timeout-config and squashes the following commits:
733ac08 [Jacky Li] add spark.sql.broadcastTimeout in SQLConf.scala
557acd4 [Jacky Li] switch to sqlContext.getConf
81a5e20 [Jacky Li] make wait time configurable in BroadcastHashJoin
Since `AttributeReference` resolution and `*` expansion are currently in separate rules, each pair requires a full iteration instead of being able to resolve in a single pass. Since its pretty easy to construct queries that have many of these in a row, I combine them into a single rule in this PR.
Author: Michael Armbrust <michael@databricks.com>
Closes#3674 from marmbrus/projectStars and squashes the following commits:
d83d6a1 [Michael Armbrust] Fix resolution of deeply nested Project(attr, Project(Star,...)).
In `HashOuterJoin.scala`, spark read data from both side of join operation before zip them together. It is a waste for memory. We are trying to read data from only one side, put them into a hashmap, and then generate the `JoinedRow` with data from other side one by one.
Currently, we could only do this optimization for `left outer join` and `right outer join`. For `full outer join`, we will do something in another issue.
for
table test_csv contains 1 million records
table dim_csv contains 10 thousand records
SQL:
`select * from test_csv a left outer join dim_csv b on a.key = b.key`
the result is:
master:
```
CSV: 12671 ms
CSV: 9021 ms
CSV: 9200 ms
Current Mem Usage:787788984
```
after patch:
```
CSV: 10382 ms
CSV: 7543 ms
CSV: 7469 ms
Current Mem Usage:208145728
```
Author: tianyi <tianyi@asiainfo-linkage.com>
Author: tianyi <tianyi.asiainfo@gmail.com>
Closes#3375 from tianyi/SPARK-4483 and squashes the following commits:
72a8aec [tianyi] avoid having mutable state stored inside of the task
99c5c97 [tianyi] performance optimization
d2f94d7 [tianyi] fix bug: missing output when the join-key is null.
2be45d1 [tianyi] fix spell bug
1f2c6f1 [tianyi] remove commented codes
a676de6 [tianyi] optimize some codes
9e7d5b5 [tianyi] remove commented old codes
838707d [tianyi] Optimization about reduce memory costs during the HashOuterJoin
The problem is `codegenEnabled` is `val`, but it uses a `val` `sqlContext`, which can be override by subclasses. Here is a simple example to show this issue.
```Scala
scala> :paste
// Entering paste mode (ctrl-D to finish)
abstract class Foo {
protected val sqlContext = "Foo"
val codegenEnabled: Boolean = {
println(sqlContext) // it will call subclass's `sqlContext` which has not yet been initialized.
if (sqlContext != null) {
true
} else {
false
}
}
}
class Bar extends Foo {
override val sqlContext = "Bar"
}
println(new Bar().codegenEnabled)
// Exiting paste mode, now interpreting.
null
false
defined class Foo
defined class Bar
```
We should make `sqlContext` `final` to prevent subclasses from overriding it incorrectly.
Author: zsxwing <zsxwing@gmail.com>
Closes#3660 from zsxwing/SPARK-4812 and squashes the following commits:
1cbb623 [zsxwing] Make `sqlContext` final to prevent subclasses from overriding it incorrectly
Author: jerryshao <saisai.shao@intel.com>
Closes#3698 from jerryshao/SPARK-4847 and squashes the following commits:
4741130 [jerryshao] Make later added extraStrategies effect when calling strategies
Add HTTP protocol support and test cases to spark thrift server, so users can deploy thrift server in both TCP and http mode.
Author: Judy Nash <judynash@microsoft.com>
Author: judynash <judynash@microsoft.com>
Closes#3672 from judynash/master and squashes the following commits:
526315d [Judy Nash] correct spacing on startThriftServer method
31a6520 [Judy Nash] fix code style issues and update sql programming guide format issue
47bf87e [Judy Nash] modify withJdbcStatement method definition to meet less than 100 line length
2e9c11c [Judy Nash] add thrift server in http mode documentation on sql programming guide
1cbd305 [Judy Nash] Merge remote-tracking branch 'upstream/master'
2b1d312 [Judy Nash] updated http thrift server support based on feedback
377532c [judynash] add HTTP protocol spark thrift server
This enables assertions for the Maven and SBT build, but overrides the Hive module to not enable assertions.
Author: Sean Owen <sowen@cloudera.com>
Closes#3692 from srowen/SPARK-4814 and squashes the following commits:
caca704 [Sean Owen] Disable assertions just for Hive
f71e783 [Sean Owen] Enable assertions for SBT and Maven build
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#3676 from adrian-wang/countexpr and squashes the following commits:
dc5765b [Daoyuan Wang] add rule to fold count(expr) if expr is not null
When I use Parquet File as a output file using ParquetOutputFormat#getDefaultWorkFile, the file name is not zero padded while RDD#saveAsText does zero padding.
Author: Sasaki Toru <sasakitoa@nttdata.co.jp>
Closes#3602 from sasakitoa/parquet-zeroPadding and squashes the following commits:
6b0e58f [Sasaki Toru] Merge branch 'master' of git://github.com/apache/spark into parquet-zeroPadding
20dc79d [Sasaki Toru] Fixed the name of Parquet File generated by AppendingParquetOutputFormat
Fix bug when query like:
```
test("save join to table") {
val testData = sparkContext.parallelize(1 to 10).map(i => TestData(i, i.toString))
sql("CREATE TABLE test1 (key INT, value STRING)")
testData.insertInto("test1")
sql("CREATE TABLE test2 (key INT, value STRING)")
testData.insertInto("test2")
testData.insertInto("test2")
sql("SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = b.key").saveAsTable("test")
checkAnswer(
table("test"),
sql("SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = b.key").collect().toSeq)
}
```
Author: Cheng Hao <hao.cheng@intel.com>
Closes#3673 from chenghao-intel/spark_4825 and squashes the following commits:
e8cbd56 [Cheng Hao] alternate the pattern matching order for logical plan:CTAS
e004895 [Cheng Hao] fix bug
This is fixed by SPARK-4318 #3184
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#3445 from adrian-wang/emptyaggr and squashes the following commits:
982575e [Daoyuan Wang] enable empty aggr test case
So the optimizations are not valid. Also I think the optimization here is rarely encounter, so removing them will not have influence on performance.
Can we merge #3445 before I add a comparison test case from this?
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#3675 from adrian-wang/sumempty and squashes the following commits:
42df763 [Daoyuan Wang] sum and avg on empty table should always return null
a follow up of #3547
/cc marmbrus
Author: scwf <wangfei1@huawei.com>
Closes#3563 from scwf/rnc and squashes the following commits:
9395661 [scwf] remove unnecessary condition
Inserting data of type including `ArrayType.containsNull == false` or `MapType.valueContainsNull == false` or `StructType.fields.exists(_.nullable == false)` into Hive table will fail because `Cast` inserted by `HiveMetastoreCatalog.PreInsertionCasts` rule of `Analyzer` can't handle these types correctly.
Complex type cast rule proposal:
- Cast for non-complex types should be able to cast the same as before.
- Cast for `ArrayType` can evaluate if
- Element type can cast
- Nullability rule doesn't break
- Cast for `MapType` can evaluate if
- Key type can cast
- Nullability for casted key type is `false`
- Value type can cast
- Nullability rule for value type doesn't break
- Cast for `StructType` can evaluate if
- The field size is the same
- Each field can cast
- Nullability rule for each field doesn't break
- The nested structure should be the same.
Nullability rule:
- If the casted type is `nullable == true`, the target nullability should be `true`
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#3150 from ueshin/issues/SPARK-4293 and squashes the following commits:
e935939 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4293
ba14003 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4293
8999868 [Takuya UESHIN] Fix a test title.
f677c30 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4293
287f410 [Takuya UESHIN] Add tests to insert data of types ArrayType / MapType / StructType with nullability is false into Hive table.
4f71bb8 [Takuya UESHIN] Make Cast be able to handle complex types.
fix a TODO in Analyzer:
// TODO: pass this in as a parameter
val fixedPoint = FixedPoint(100)
Author: Jacky Li <jacky.likun@huawei.com>
Closes#3499 from jackylk/config and squashes the following commits:
4c1252c [Jacky Li] fix scalastyle
820f460 [Jacky Li] pass maxIterations in as a parameter
Whitelist more hive unit test:
"create_like_tbl_props"
"udf5"
"udf_java_method"
"decimal_1"
"udf_pmod"
"udf_to_double"
"udf_to_float"
"udf7" (this will fail in Hive 0.12)
Author: Cheng Hao <hao.cheng@intel.com>
Closes#3522 from chenghao-intel/unittest and squashes the following commits:
f54e4c7 [Cheng Hao] work around to clean up the hive.table.parameters.default in reset
16fee22 [Cheng Hao] Whitelist more unittest
Unpersist a uncached RDD, will not raise exception, for example:
```
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
distData.unpersist(true)
```
But the `SchemaRDD` will raise exception if the `SchemaRDD` is not cached. Since `SchemaRDD` is the subclasses of the `RDD`, we should follow the same behavior.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#3572 from chenghao-intel/try_uncache and squashes the following commits:
50a7a89 [Cheng Hao] SchemaRDD.unpersist() should not raise exception if it is not persisted
Modified ScalaReflection.schemaFor to take primary constructor of Product when there are multiple constructors. Added test to suite which failed before but works now.
Needed for [https://github.com/apache/spark/pull/3637]
CC: marmbrus
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#3646 from jkbradley/sql-reflection and squashes the following commits:
796b2e4 [Joseph K. Bradley] Modified ScalaReflection.schemaFor to take primary constructor of Product when there are multiple constructors. Added test to suite which failed before but works now.
Different from Hive 0.12.0, in Hive 0.13.1 UDF/UDAF/UDTF (aka Hive function) objects should only be initialized once on the driver side and then serialized to executors. However, not all function objects are serializable (e.g. GenericUDF doesn't implement Serializable). Hive 0.13.1 solves this issue with Kryo or XML serializer. Several utility ser/de methods are provided in class o.a.h.h.q.e.Utilities for this purpose. In this PR we chose Kryo for efficiency. The Kryo serializer used here is created in Hive. Spark Kryo serializer wasn't used because there's no available SparkConf instance.
Author: Cheng Hao <hao.cheng@intel.com>
Author: Cheng Lian <lian@databricks.com>
Closes#3640 from chenghao-intel/udf_serde and squashes the following commits:
8e13756 [Cheng Hao] Update the comment
74466a3 [Cheng Hao] refactor as feedbacks
396c0e1 [Cheng Hao] avoid Simple UDF to be serialized
e9c3212 [Cheng Hao] update the comment
19cbd46 [Cheng Hao] support udf instance ser/de after initialization
This is the code refactor and follow ups for #2570
Author: Cheng Hao <hao.cheng@intel.com>
Closes#3336 from chenghao-intel/createtbl and squashes the following commits:
3563142 [Cheng Hao] remove the unused variable
e215187 [Cheng Hao] eliminate the compiling warning
4f97f14 [Cheng Hao] fix bug in unittest
5d58812 [Cheng Hao] revert the API changes
b85b620 [Cheng Hao] fix the regression of temp tabl not found in CTAS
Author: Jacky Li <jacky.likun@huawei.com>
Closes#3630 from jackylk/remove and squashes the following commits:
150e7e0 [Jacky Li] remove unnecessary import
Enables Kryo and disables reference tracking by default in Spark SQL Thrift server. Configurations explicitly defined by users in `spark-defaults.conf` are respected (the Thrift server is started by `spark-submit`, which handles configuration properties properly).
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3621)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#3621 from liancheng/kryo-by-default and squashes the following commits:
70c2775 [Cheng Lian] Enables Kryo by default in Spark SQL Thrift server
Author: Michael Armbrust <michael@databricks.com>
Closes#3613 from marmbrus/parquetPartitionPruning and squashes the following commits:
4f138f8 [Michael Armbrust] Use catalyst for partition pruning in newParquet.
Just found this instance while doing some jstack-based profiling of a Spark SQL job. It is very unlikely that this is causing much of a perf issue anywhere, but it is unnecessarily suboptimal.
Author: Aaron Davidson <aaron@databricks.com>
Closes#3593 from aarondav/seq-opt and squashes the following commits:
962cdfc [Aaron Davidson] [SQL] Minor: Avoid calling Seq#size in a loop
Author: Jacky Li <jacky.likun@huawei.com>
Closes#3585 from jackylk/remove and squashes the following commits:
045423d [Jacky Li] remove unnecessary import
This is a very small fix that catches one specific exception and returns an empty table. #3441 will address this in a more principled way.
Author: Michael Armbrust <michael@databricks.com>
Closes#3586 from marmbrus/fixEmptyParquet and squashes the following commits:
2781d9f [Michael Armbrust] Handle empty lists for newParquet
04dd376 [Michael Armbrust] Avoid exception when reading empty parquet data through Hive
Using ```executeCollect``` to collect the result, because executeCollect is a custom implementation of collect in spark sql which better than rdd's collect
Author: wangfei <wangfei1@huawei.com>
Closes#3547 from scwf/executeCollect and squashes the following commits:
a5ab68e [wangfei] Revert "adding debug info"
a60d680 [wangfei] fix test failure
0db7ce8 [wangfei] adding debug info
184c594 [wangfei] using executeCollect instead collect
We should use `~` instead of `-` for bitwise NOT.
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#3528 from adrian-wang/symbol and squashes the following commits:
affd4ad [Daoyuan Wang] fix code gen test case
56efb79 [Daoyuan Wang] ensure bitwise NOT over byte and short persist data type
f55fbae [Daoyuan Wang] wrong symbol for bitwise not
SELECT max(1/0) FROM src
would return a very large number, which is obviously not right.
For hive-0.12, hive would return `Infinity` for 1/0, while for hive-0.13.1, it is `NULL` for 1/0.
I think it is better to keep our behavior with newer Hive version.
This PR ensures that when the divider is 0, the result of expression should be NULL, same with hive-0.13.1
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#3443 from adrian-wang/div and squashes the following commits:
2e98677 [Daoyuan Wang] fix code gen for divide 0
85c28ba [Daoyuan Wang] temp
36236a5 [Daoyuan Wang] add test cases
6f5716f [Daoyuan Wang] fix comments
cee92bd [Daoyuan Wang] avoid evaluation 2 times
22ecd9a [Daoyuan Wang] fix style
cf28c58 [Daoyuan Wang] divide fix
2dfe50f [Daoyuan Wang] return null when divider is 0 of Double type
val jsc = new org.apache.spark.api.java.JavaSparkContext(sc)
val jhc = new org.apache.spark.sql.hive.api.java.JavaHiveContext(jsc)
val nrdd = jhc.hql("select null from spark_test.for_test")
println(nrdd.schema)
Then the error is thrown as follows:
scala.MatchError: NullType (of class org.apache.spark.sql.catalyst.types.NullType$)
at org.apache.spark.sql.types.util.DataTypeConversions$.asJavaDataType(DataTypeConversions.scala:43)
Author: YanTangZhai <hakeemzhai@tencent.com>
Author: yantangzhai <tyz0303@163.com>
Author: Michael Armbrust <michael@databricks.com>
Closes#3538 from YanTangZhai/MatchNullType and squashes the following commits:
e052dff [yantangzhai] [SPARK-4676] [SQL] JavaSchemaRDD.schema may throw NullType MatchError if sql has null
4b4bb34 [yantangzhai] [SPARK-4676] [SQL] JavaSchemaRDD.schema may throw NullType MatchError if sql has null
896c7b7 [yantangzhai] fix NullType MatchError in JavaSchemaRDD when sql has null
6e643f8 [YanTangZhai] Merge pull request #11 from apache/master
e249846 [YanTangZhai] Merge pull request #10 from apache/master
d26d982 [YanTangZhai] Merge pull request #9 from apache/master
76d4027 [YanTangZhai] Merge pull request #8 from apache/master
03b62b0 [YanTangZhai] Merge pull request #7 from apache/master
8a00106 [YanTangZhai] Merge pull request #6 from apache/master
cbcba66 [YanTangZhai] Merge pull request #3 from apache/master
cdef539 [YanTangZhai] Merge pull request #1 from apache/master
Author: baishuo <vc_java@hotmail.com>
Closes#3526 from baishuo/master-trycatch and squashes the following commits:
d446e14 [baishuo] correct the code style
b36bf96 [baishuo] correct the code style
ae0e447 [baishuo] add finally to avoid resource leak
Spark SQL has embeded sqrt and abs but DSL doesn't support those functions.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#3401 from sarutak/dsl-missing-operator and squashes the following commits:
07700cf [Kousuke Saruta] Modified Literal(null, NullType) to Literal(null) in DslQuerySuite
8f366f8 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into dsl-missing-operator
1b88e2e [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into dsl-missing-operator
0396f89 [Kousuke Saruta] Added sqrt and abs to Spark SQL DSL
A very small nit update.
Author: Reynold Xin <rxin@databricks.com>
Closes#3552 from rxin/license-header and squashes the following commits:
df8d1a4 [Reynold Xin] Indent license header properly for interfaces.scala.
In addition, using `s.isEmpty` to eliminate the string comparison.
Author: zsxwing <zsxwing@gmail.com>
Closes#3132 from zsxwing/SPARK-4268 and squashes the following commits:
358e235 [zsxwing] Improvement of allCaseVersions
Support view definition like
CREATE VIEW view3(valoo)
TBLPROPERTIES ("fear" = "factor")
AS SELECT upper(value) FROM src WHERE key=86;
[valoo as the alias of upper(value)]. This is missing part of SPARK-4239, for a fully view support.
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#3396 from adrian-wang/viewcolumn and squashes the following commits:
4d001d0 [Daoyuan Wang] support view with column alias
Author: ravipesala <ravindra.pesala@huawei.com>
Closes#3516 from ravipesala/ddl_doc and squashes the following commits:
d101fdf [ravipesala] Style issues fixed
d2238cd [ravipesala] Corrected documentation
Supporting multi column support in countDistinct function like count(distinct c1,c2..) in Spark SQL
Author: ravipesala <ravindra.pesala@huawei.com>
Author: Michael Armbrust <michael@databricks.com>
Closes#3511 from ravipesala/countdistinct and squashes the following commits:
cc4dbb1 [ravipesala] style
070e12a [ravipesala] Supporting multi column support in count(distinct c1,c2..) in Spark SQL
Remove hardcoding max and min values for types. Let BigDecimal do checking type compatibility.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#3208 from viirya/more_numericLit and squashes the following commits:
e9834b4 [Liang-Chi Hsieh] Remove byte and short types for number literal.
1bd1825 [Liang-Chi Hsieh] Fix Indentation and make the modification clearer.
cf1a997 [Liang-Chi Hsieh] Modified for comment to add a rule of analysis that adds a cast.
91fe489 [Liang-Chi Hsieh] add Byte and Short.
1bdc69d [Liang-Chi Hsieh] Let BigDecimal do checking type compatibility.
group tab is missing for scaladoc
Author: Jacky Li <jacky.likun@gmail.com>
Closes#3458 from jackylk/patch-7 and squashes the following commits:
0121a70 [Jacky Li] add @group tab in limit() and count()
Author: zsxwing <zsxwing@gmail.com>
Closes#3521 from zsxwing/SPARK-4661 and squashes the following commits:
03cbe3f [zsxwing] Minor code and docs cleanup
This PR disables HiveThriftServer2 asynchronous execution by setting `runInBackground` argument in `ExecuteStatementOperation` to `false`, and reverting `SparkExecuteStatementOperation.run` in Hive 13 shim to Hive 12 version. This change makes Simba ODBC driver v1.0.0.1000 work.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3506)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#3506 from liancheng/disable-async-exec and squashes the following commits:
593804d [Cheng Lian] Disables asynchronous execution in Hive 0.13.1 HiveThriftServer2
```timeTaken``` should not count the time of printing result.
Author: w00228970 <wangfei1@huawei.com>
Closes#3423 from scwf/time-taken-bug and squashes the following commits:
da7e102 [w00228970] compute time taken correctly
Re-implement the Python broadcast using file:
1) serialize the python object using cPickle, write into disks.
2) Create a wrapper in JVM (for the dumped file), it read data from during serialization
3) Using TorrentBroadcast or HttpBroadcast to transfer the data (compressed) into executors
4) During deserialization, writing the data into disk.
5) Passing the path into Python worker, read data from disk and unpickle it into python object, until the first access.
It fixes the performance regression introduced in #2659, has similar performance as 1.1, but support object larger than 2G, also improve the memory efficiency (only one compressed copy in driver and executor).
Testing with a 500M broadcast and 4 tasks (excluding the benefit from reused worker in 1.2):
name | 1.1 | 1.2 with this patch | improvement
---------|--------|---------|--------
python-broadcast-w-bytes | 25.20 | 9.33 | 170.13% |
python-broadcast-w-set | 4.13 | 4.50 | -8.35% |
Testing with 100 tasks (16 CPUs):
name | 1.1 | 1.2 with this patch | improvement
---------|--------|---------|--------
python-broadcast-w-bytes | 38.16 | 8.40 | 353.98%
python-broadcast-w-set | 23.29 | 9.59 | 142.80%
Author: Davies Liu <davies@databricks.com>
Closes#3417 from davies/pybroadcast and squashes the following commits:
50a58e0 [Davies Liu] address comments
b98de1d [Davies Liu] disable gc while unpickle
e5ee6b9 [Davies Liu] support large string
09303b8 [Davies Liu] read all data into memory
dde02dd [Davies Liu] improve performance of python broadcast
When we use ORDER BY clause, at first, attributes referenced by projection are resolved (1).
And then, attributes referenced at ORDER BY clause are resolved (2).
But when resolving attributes referenced at ORDER BY clause, the resolution result generated in (1) is discarded so for example, following query fails.
SELECT c1 + c2 FROM mytable ORDER BY c1;
The query above fails because when resolving the attribute reference 'c1', the resolution result of 'c2' is discarded.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#3363 from sarutak/SPARK-4487 and squashes the following commits:
fd314f3 [Kousuke Saruta] Fixed attribute resolution logic in Analyzer
6e60c20 [Kousuke Saruta] Fixed conflicts
cb5b7e9 [Kousuke Saruta] Added test case for SPARK-4487
282d529 [Kousuke Saruta] Fixed attributes reference resolution error
b6123e6 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into concat-feature
317b7fb [Kousuke Saruta] WIP
This file is for Hive 0.13.1 I think.
Author: Daniel Darabos <darabos.daniel@gmail.com>
Closes#3432 from darabos/patch-2 and squashes the following commits:
4fd22ed [Daniel Darabos] Fix comment. This file is for Hive 0.13.1.
This PR is a workaround for SPARK-4479. Two changes are introduced: when merge sort is bypassed in `ExternalSorter`,
1. also bypass RDD elements buffering as buffering is the reason that `MutableRow` backed row objects must be copied, and
2. avoids defensive copies in `Exchange` operator
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3422)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#3422 from liancheng/avoids-defensive-copies and squashes the following commits:
591f2e9 [Cheng Lian] Passes all shuffle suites
0c3c91e [Cheng Lian] Fixes shuffle write metrics when merge sort is bypassed
ed5df3c [Cheng Lian] Fixes styling changes
f75089b [Cheng Lian] Avoids unnecessary defensive copies when sort based shuffle is on
This is just a quick fix for 1.2. SPARK-4523 describes a more complete solution.
Author: Michael Armbrust <michael@databricks.com>
Closes#3392 from marmbrus/parquetMetadata and squashes the following commits:
bcc6626 [Michael Armbrust] Parse schema with missing metadata.
Goals:
- Support for accessing parquet using SQL but not requiring Hive (thus allowing support of parquet tables with decimal columns)
- Support for folder based partitioning with automatic discovery of available partitions
- Caching of file metadata
See scaladoc of `ParquetRelation2` for more details.
Author: Michael Armbrust <michael@databricks.com>
Closes#3269 from marmbrus/newParquet and squashes the following commits:
1dd75f1 [Michael Armbrust] Pass all paths for FileInputFormat at once.
645768b [Michael Armbrust] Review comments.
abd8e2f [Michael Armbrust] Alternative implementation of parquet based on the datasources API.
938019e [Michael Armbrust] Add an experimental interface to data sources that exposes catalyst expressions.
e9d2641 [Michael Armbrust] logging / formatting improvements.
Query `SELECT named_struct(lower("AA"), "12", lower("Bb"), "13") FROM src LIMIT 1` will throw exception, some of the Hive Generic UDF/UDAF requires the input object inspector is `ConstantObjectInspector`, however, we won't get that before the expression optimization executed. (Constant Folding).
This PR is a work around to fix this. (As ideally, the `output` of LogicalPlan should be identical before and after Optimization).
Author: Cheng Hao <hao.cheng@intel.com>
Closes#3109 from chenghao-intel/optimized and squashes the following commits:
487ff79 [Cheng Hao] rebase to the latest master & update the unittest
Sample code in the description of SchemaRDD.where is not correct
Author: Jacky Li <jacky.likun@gmail.com>
Closes#3344 from jackylk/patch-6 and squashes the following commits:
62cd126 [Jacky Li] [SQL] fix function description mistake
Hive supports the `explain` the CTAS, which was supported by Spark SQL previously, however, seems it was reverted after the code refactoring in HiveQL.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#3357 from chenghao-intel/explain and squashes the following commits:
7aace63 [Cheng Hao] Support the CTAS in EXPLAIN command
Executing sum distinct for empty table throws `java.lang.UnsupportedOperationException: empty.reduceLeft`.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#3184 from ueshin/issues/SPARK-4318 and squashes the following commits:
8168c42 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4318
66fdb0a [Takuya UESHIN] Re-refine aggregate functions.
6186eb4 [Takuya UESHIN] Fix Sum of GeneratedAggregate.
d2975f6 [Takuya UESHIN] Refine Sum and Average of GeneratedAggregate.
1bba675 [Takuya UESHIN] Refine Sum, SumDistinct and Average functions.
917e533 [Takuya UESHIN] Use aggregate instead of groupBy().
1a5f874 [Takuya UESHIN] Add tests to be executed as non-partial aggregation.
a5a57d2 [Takuya UESHIN] Fix empty Average.
22799dc [Takuya UESHIN] Fix empty Sum and SumDistinct.
65b7dd2 [Takuya UESHIN] Fix empty sum distinct.
The relational operator '<=>' is not working in Spark SQL. Same works in Spark HiveQL
Author: ravipesala <ravindra.pesala@huawei.com>
Closes#3387 from ravipesala/<=> and squashes the following commits:
7198e90 [ravipesala] Supporting relational operator '<=>' in Spark SQL
This PR enables the Web UI storage tab to show the in-memory table name instead of the mysterious query plan string as the name of the in-memory columnar RDD.
Note that after #2501, a single columnar RDD can be shared by multiple in-memory tables, as long as their query results are the same. In this case, only the first cached table name is shown. For example:
```sql
CACHE TABLE first AS SELECT * FROM src;
CACHE TABLE second AS SELECT * FROM src;
```
The Web UI only shows "In-memory table first".
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3383)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#3383 from liancheng/columnar-rdd-name and squashes the following commits:
071907f [Cheng Lian] Fixes tests
12ddfa6 [Cheng Lian] Names in-memory columnar RDD with corresponding table name
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#3277 from vanzin/version-1.3 and squashes the following commits:
7c3c396 [Marcelo Vanzin] Added temp repo to sbt build.
5f404ff [Marcelo Vanzin] Add another exclusion.
19457e7 [Marcelo Vanzin] Update old version to 1.2, add temporary 1.2 repo.
3c8d705 [Marcelo Vanzin] Workaround for MIMA checks.
e940810 [Marcelo Vanzin] Bumping version to 1.3.0-SNAPSHOT.
For expressions like `10 < someVar`, we should create an `Operators.Gt` filter, but right now an `Operators.Lt` is created. This issue affects all inequality predicates with literals on the left hand side.
(This bug existed before #3317 and affects branch-1.1. #3338 was opened to backport this to branch-1.1.)
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3334)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#3334 from liancheng/fix-parquet-comp-filter and squashes the following commits:
0130897 [Cheng Lian] Fixes Parquet comparison filter generation
This patch will bring support for broadcasting objects larger than 2G.
pickle, zlib, FrameSerializer and Array[Byte] all can not support objects larger than 2G, so this patch introduce LargeObjectSerializer to serialize broadcast objects, the object will be serialized and compressed into small chunks, it also change the type of Broadcast[Array[Byte]]] into Broadcast[Array[Array[Byte]]]].
Testing for support broadcast objects larger than 2G is slow and memory hungry, so this is tested manually, could be added into SparkPerf.
Author: Davies Liu <davies@databricks.com>
Author: Davies Liu <davies.liu@gmail.com>
Closes#2659 from davies/huge and squashes the following commits:
7b57a14 [Davies Liu] add more tests for broadcast
28acff9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
a2f6a02 [Davies Liu] bug fix
4820613 [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
5875c73 [Davies Liu] address comments
10a349b [Davies Liu] address comments
0c33016 [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
6182c8f [Davies Liu] Merge branch 'master' into huge
d94b68f [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
2514848 [Davies Liu] address comments
fda395b [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
1c2d928 [Davies Liu] fix scala style
091b107 [Davies Liu] broadcast objects larger than 2G
Author: Michael Armbrust <michael@databricks.com>
Closes#3272 from marmbrus/keyInPartitionedTable and squashes the following commits:
447f08c [Michael Armbrust] Support partitioned parquet tables that have the key in both the directory and the file
While reviewing PR #3083 and #3161, I noticed that Parquet record filter generation code can be simplified significantly according to the clue stated in [SPARK-4453](https://issues.apache.org/jira/browse/SPARK-4213). This PR addresses both SPARK-4453 and SPARK-4213 with this simplification.
While generating `ParquetTableScan` operator, we need to remove all Catalyst predicates that have already been pushed down to Parquet. Originally, we first generate the record filter, and then call `findExpression` to traverse the generated filter to find out all pushed down predicates [[1](64c6b9bad5/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala (L213-L228))]. In this way, we have to introduce the `CatalystFilter` class hierarchy to bind the Catalyst predicates together with their generated Parquet filter, and complicate the code base a lot.
The basic idea of this PR is that, we don't need `findExpression` after filter generation, because we already know a predicate can be pushed down if we can successfully generate its corresponding Parquet filter. SPARK-4213 is fixed by returning `None` for any unsupported predicate type.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3317)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#3317 from liancheng/simplify-parquet-filters and squashes the following commits:
d6a9499 [Cheng Lian] Fixes import styling issue
43760e8 [Cheng Lian] Simplifies Parquet filter generation logic
Author: Cheng Hao <hao.cheng@intel.com>
Closes#3308 from chenghao-intel/unwrap_constant_oi and squashes the following commits:
156b500 [Cheng Hao] rebase the master
c5b20ab [Cheng Hao] unwrap for the ConstantObjectInspector
The `totalSize` of external table is always zero, which will influence join strategy(always use broadcast join for external table).
Author: w00228970 <wangfei1@huawei.com>
Closes#3304 from scwf/statistics and squashes the following commits:
568f321 [w00228970] fix statistics for external table
This PR is exactly the same as #3178 except it reverts the `FileStatus.isDir` to `FileStatus.isDirectory` change, since it doesn't compile with Hadoop 1.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3298)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#3298 from liancheng/date-for-thriftserver and squashes the following commits:
866037e [Cheng Lian] Revers isDirectory to isDir (it breaks Hadoop 1 profile)
6f71d0b [Cheng Lian] Makes toHiveString static
26fa955 [Cheng Lian] Fixes complex type support in Hive 0.13.1 shim
a92882a [Cheng Lian] Updates HiveShim for 0.13.1
73f442b [Cheng Lian] Adds Date support for HiveThriftServer2 (Hive 0.12.0)
Author: Cheng Hao <hao.cheng@intel.com>
Closes#3217 from chenghao-intel/mutablerow and squashes the following commits:
e8a10bd [Cheng Hao] revert the change of Row object
4681aea [Cheng Hao] Add toMutableRow method in object Row
a751838 [Cheng Hao] Construct the MutableRow from an existed row
`Cast` from `NaN` or `Infinity` of `Double` or `Float` to `TimestampType` throws `NumberFormatException`.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#3283 from ueshin/issues/SPARK-4425 and squashes the following commits:
14def0c [Takuya UESHIN] Fix Cast to be able to handle NaN or Infinity to TimestampType.
This is follow-up of [SPARK-4390](https://issues.apache.org/jira/browse/SPARK-4390) (#3256).
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#3278 from ueshin/issues/SPARK-4420 and squashes the following commits:
7fea558 [Takuya UESHIN] Add some tests.
cb2301a [Takuya UESHIN] Fix tests.
133bad5 [Takuya UESHIN] Change nullability of Cast from DoubleType/FloatType to DecimalType.
Adds a new operator that uses Spark's `ExternalSort` class. It is off by default now, but we might consider making it the default if benchmarks show that it does not regress performance.
Author: Michael Armbrust <michael@databricks.com>
Closes#3268 from marmbrus/externalSort and squashes the following commits:
48b9726 [Michael Armbrust] comments
b98799d [Michael Armbrust] Add test
afd7562 [Michael Armbrust] Add support for external sort.
Author: Michael Armbrust <michael@databricks.com>
Closes#3292 from marmbrus/revert4309 and squashes the following commits:
808e96e [Michael Armbrust] Revert "[SPARK-4309][SPARK-4407][SQL] Date type support for Thrift server, and fixes for complex types"
SPARK-4407 was detected while working on SPARK-4309. Merged these two into a single PR since 1.2.0 RC is approaching.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3178)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#3178 from liancheng/date-for-thriftserver and squashes the following commits:
6f71d0b [Cheng Lian] Makes toHiveString static
26fa955 [Cheng Lian] Fixes complex type support in Hive 0.13.1 shim
a92882a [Cheng Lian] Updates HiveShim for 0.13.1
73f442b [Cheng Lian] Adds Date support for HiveThriftServer2 (Hive 0.12.0)
The symbol of BitwiseOr is defined as '&' but I think it's wrong. It should be '|'.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#3284 from sarutak/bitwise-or-symbol-fix and squashes the following commits:
aff4be5 [Kousuke Saruta] Fixed symbol of BitwiseOr
Add contains(key) to org.apache.spark.sql.catalyst.util.Metadata to test the existence of a key. Otherwise, Class Metadata's get methods may throw NoSuchElement exception if the key does not exist.
Testcases are added to MetadataSuite as well.
Author: kai <kaizeng@eecs.berkeley.edu>
Closes#3273 from kai-zeng/metadata-fix and squashes the following commits:
74b3d03 [kai] Added contains(key) to Metadata
The Spark ParquetRelation.scala code makes the assumption that the parquet.Log class has already been loaded. If ParquetRelation.enableLogForwarding executes prior to the parquet.Log class being loaded then the code in enableLogForwarding has no affect.
ParquetRelation.scala attempts to override the parquet logger but, at least currently (and if your application simply reads a parquet file before it does anything else with Parquet), the parquet.Log class hasn't been loaded yet. Therefore the code in ParquetRelation.enableLogForwarding has no affect. If you look at the code in parquet.Log there's a static initializer that needs to be called prior to enableLogForwarding or whatever enableLogForwarding does gets undone by this static initializer.
The "fix" would be to force the static initializer to get called in parquet.Log as part of enableForwardLogging.
Author: Jim Carroll <jim@dontcallme.com>
Closes#3271 from jimfcarroll/parquet-logging and squashes the following commits:
37bdff7 [Jim Carroll] Fix Spark's control of Parquet logging.
Since parquet library has been updated , we no longer need to filter the records returned from parquet library for null records , as now the library skips those :
from parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordReader.java
public boolean nextKeyValue() throws IOException, InterruptedException {
boolean recordFound = false;
while (!recordFound) {
// no more records left
if (current >= total)
{ return false; }
try {
checkRead();
currentValue = recordReader.read();
current ++;
if (recordReader.shouldSkipCurrentRecord())
{
// this record is being filtered via the filter2 package
if (DEBUG) LOG.debug("skipping record");
continue;
}
if (currentValue == null)
{
// only happens with FilteredRecordReader at end of block current = totalCountLoadedSoFar;
if (DEBUG) LOG.debug("filtered record reader reached end of block");
continue;
}
recordFound = true;
if (DEBUG) LOG.debug("read value: " + currentValue);
} catch (RuntimeException e)
{ throw new ParquetDecodingException(format("Can not read value at %d in block %d in file %s", current, currentBlock, file), e); }
}
return true;
}
Author: Yash Datta <Yash.Datta@guavus.com>
Closes#3229 from saucam/remove_filter and squashes the following commits:
8909ae9 [Yash Datta] SPARK-4365: Remove unnecessary filter call on records returned from parquet library
If you profile the writing of a Parquet file, the single worst time consuming call inside of org.apache.spark.sql.parquet.MutableRowWriteSupport.write is actually in the scala.collection.AbstractSequence.size call. This is because the size call actually ends up COUNTING the elements in a scala.collection.LinearSeqOptimized.length ("optimized?").
This doesn't need to be done. "size" is called repeatedly where needed rather than called once at the top of the method and stored in a 'val'.
Author: Jim Carroll <jim@dontcallme.com>
Closes#3254 from jimfcarroll/parquet-perf and squashes the following commits:
30cc0b5 [Jim Carroll] Improve performance when writing Parquet files.
While resolving struct fields, the resulted `GetField` expression is wrapped with an `Alias` to make it a named expression. Assume `a` is a struct instance with a field `b`, then `"a.b"` will be resolved as `Alias(GetField(a, "b"), "b")`. Thus, for this following SQL query:
```sql
SELECT a.b + 1 FROM t GROUP BY a.b + 1
```
the grouping expression is
```scala
Add(GetField(a, "b"), Literal(1, IntegerType))
```
while the aggregation expression is
```scala
Add(Alias(GetField(a, "b"), "b"), Literal(1, IntegerType))
```
This mismatch makes the above SQL query fail during the both analysis and execution phases. This PR fixes this issue by removing the alias when substituting aggregation expressions.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3248)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#3248 from liancheng/spark-4322 and squashes the following commits:
23a46ea [Cheng Lian] Code simplification
dd20a79 [Cheng Lian] Should only trim aliases around `GetField`s
7f46532 [Cheng Lian] Enables struct fields as sub expressions of grouping fields
When sort based shuffle and code gen are on we were trying to ship the code generated rows during a shuffle. This doesn't work because the classes don't exist on the other side. Instead we now copy into a generic row before shipping.
Author: Michael Armbrust <michael@databricks.com>
Closes#3263 from marmbrus/aggCodeGen and squashes the following commits:
f6ba8cf [Michael Armbrust] fix and test
Author: Michael Armbrust <michael@databricks.com>
Closes#3257 from marmbrus/minorCleanup and squashes the following commits:
d8b5abc [Michael Armbrust] Use interpolation.
2fdf903 [Michael Armbrust] Better error message when coalesce can't be resolved.
f9fa6cf [Michael Armbrust] Methods in a final class do not also need to be final, use override.
199fd98 [Michael Armbrust] Fix typo
This is more uniform with the rest of SQL configuration and allows it to be turned on and off without restarting the SparkContext. In this PR I also turn off filter pushdown by default due to a number of outstanding issues (in particular SPARK-4258). When those are fixed we should turn it back on by default.
Author: Michael Armbrust <michael@databricks.com>
Closes#3258 from marmbrus/parquetFilters and squashes the following commits:
5655bfe [Michael Armbrust] Remove extra line.
15e9a98 [Michael Armbrust] Enable filters for tests
75afd39 [Michael Armbrust] Fix comments
78fa02d [Michael Armbrust] off by default
e7f9e16 [Michael Armbrust] First draft of correctly configuring parquet filter pushdown
Author: Michael Armbrust <michael@databricks.com>
Closes#3256 from marmbrus/NanDecimal and squashes the following commits:
4c3ba46 [Michael Armbrust] fix style
d360f83 [Michael Armbrust] Handle NaN cast to decimal
When iterator of RuleExecutor breaks, the num of iterator should be (iteration - 1) not (iteration ).Because log looks like "Fixed point reached for batch ${batch.name} after 3 iterations.", but it did 2 iterations really!
Author: DoingDone9 <799203320@qq.com>
Closes#3180 from DoingDone9/issue_01 and squashes the following commits:
571e2ed [DoingDone9] Update RuleExecutor.scala
46514b6 [DoingDone9] When iterator of RuleExecutor breaks, the num of iterator should be iteration - 1 not iteration.
It seems like the winds might have moved away from this approach, but wanted to post the PR anyway because I got it working and to show what it would look like.
Author: Sandy Ryza <sandy@cloudera.com>
Closes#3239 from sryza/sandy-spark-4375 and squashes the following commits:
0ffbe95 [Sandy Ryza] Enable -Dscala-2.11 in sbt
cd42d94 [Sandy Ryza] Update doc
f6644c3 [Sandy Ryza] SPARK-4375 take 2
The `containsNull` of the result `ArrayType` of `CreateArray` should be `true` only if the children is empty or there exists nullable child.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#3110 from ueshin/issues/SPARK-4245 and squashes the following commits:
6f64746 [Takuya UESHIN] Move equalsIgnoreNullability method into DataType.
5a90e02 [Takuya UESHIN] Refine InsertIntoHiveType and add some comments.
cbecba8 [Takuya UESHIN] Fix a test title.
884ec37 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4245
3c5274b [Takuya UESHIN] Add tests to insert data of types ArrayType / MapType / StructType with nullability is false into Hive table.
41a94a9 [Takuya UESHIN] Replace InsertIntoTable with InsertIntoHiveTable if data types ignoring nullability are same.
43e6ef5 [Takuya UESHIN] Fix containsNull for empty array.
778e997 [Takuya UESHIN] Fix containsNull of the result ArrayType of CreateArray expression.
Currently still not support view like
CREATE VIEW view3(valoo)
TBLPROPERTIES ("fear" = "factor")
AS SELECT upper(value) FROM src WHERE key=86;
because the text in metastore for this view is like
select \`_c0\` as \`valoo\` from (select upper(\`src\`.\`value\`) from \`default\`.\`src\` where ...) \`view3\`
while catalyst cannot resolve \`_c0\` for this query.
For view without colname definition in parentheses, it works fine.
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#3131 from adrian-wang/view and squashes the following commits:
8a56fd6 [Daoyuan Wang] michael's comments
e46c056 [Daoyuan Wang] add some golden file
079290a [Daoyuan Wang] remove useless import
88afcad [Daoyuan Wang] support view in HiveQl
This PR adds two features to the data sources API:
- Support for pushing down `IN` filters
- The ability for relations to optionally provide information about their `sizeInBytes`.
Author: Michael Armbrust <michael@databricks.com>
Closes#3260 from marmbrus/sourcesImprovements and squashes the following commits:
9a5e171 [Michael Armbrust] Use method instead of configuration directly
99c0e6b [Michael Armbrust] Add support for sizeInBytes.
416f167 [Michael Armbrust] Support for IN in data sources API.
2a04ab3 [Michael Armbrust] Simplify implementation of InSet.
Let's give this another go using a version of Hive that shades its JLine dependency.
Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Patrick Wendell <pwendell@gmail.com>
Closes#3159 from pwendell/scala-2.11-prashant and squashes the following commits:
e93aa3e [Patrick Wendell] Restoring -Phive-thriftserver profile and cleaning up build script.
f65d17d [Patrick Wendell] Fixing build issue due to merge conflict
a8c41eb [Patrick Wendell] Reverting dev/run-tests back to master state.
7a6eb18 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into scala-2.11-prashant
583aa07 [Prashant Sharma] REVERT ME: removed hive thirftserver
3680e58 [Prashant Sharma] Revert "REVERT ME: Temporarily removing some Cli tests."
935fb47 [Prashant Sharma] Revert "Fixed by disabling a few tests temporarily."
925e90f [Prashant Sharma] Fixed by disabling a few tests temporarily.
2fffed3 [Prashant Sharma] Exclude groovy from sbt build, and also provide a way for such instances in future.
8bd4e40 [Prashant Sharma] Switched to gmaven plus, it fixes random failures observer with its predecessor gmaven.
5272ce5 [Prashant Sharma] SPARK_SCALA_VERSION related bugs.
2121071 [Patrick Wendell] Migrating version detection to PySpark
b1ed44d [Patrick Wendell] REVERT ME: Temporarily removing some Cli tests.
1743a73 [Patrick Wendell] Removing decimal test that doesn't work with Scala 2.11
f5cad4e [Patrick Wendell] Add Scala 2.11 docs
210d7e1 [Patrick Wendell] Revert "Testing new Hive version with shaded jline"
48518ce [Patrick Wendell] Remove association of Hive and Thriftserver profiles.
e9d0a06 [Patrick Wendell] Revert "Enable thritfserver for Scala 2.10 only"
67ec364 [Patrick Wendell] Guard building of thriftserver around Scala 2.10 check
8502c23 [Patrick Wendell] Enable thritfserver for Scala 2.10 only
e22b104 [Patrick Wendell] Small fix in pom file
ec402ab [Patrick Wendell] Various fixes
0be5a9d [Patrick Wendell] Testing new Hive version with shaded jline
4eaec65 [Prashant Sharma] Changed scripts to ignore target.
5167bea [Prashant Sharma] small correction
a4fcac6 [Prashant Sharma] Run against scala 2.11 on jenkins.
80285f4 [Prashant Sharma] MAven equivalent of setting spark.executor.extraClasspath during tests.
034b369 [Prashant Sharma] Setting test jars on executor classpath during tests from sbt.
d4874cb [Prashant Sharma] Fixed Python Runner suite. null check should be first case in scala 2.11.
6f50f13 [Prashant Sharma] Fixed build after rebasing with master. We should use ${scala.binary.version} instead of just 2.10
e56ca9d [Prashant Sharma] Print an error if build for 2.10 and 2.11 is spotted.
937c0b8 [Prashant Sharma] SCALA_VERSION -> SPARK_SCALA_VERSION
cb059b0 [Prashant Sharma] Code review
0476e5e [Prashant Sharma] Scala 2.11 support with repl and all build changes.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#3139 from chenghao-intel/comparison_test and squashes the following commits:
f5d7146 [Cheng Hao] avoid exception in printing the codegen enabled
This implement the feature davies mentioned in https://github.com/apache/spark/pull/2901#discussion-diff-19313312
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#3012 from adrian-wang/iso8601 and squashes the following commits:
50df6e7 [Daoyuan Wang] json data timestamp ISO8601 support
Author: Cheng Hao <hao.cheng@intel.com>
Closes#3114 from chenghao-intel/constant_null_oi and squashes the following commits:
e603bda [Cheng Hao] fix the bug of null value for primitive types
50a13ba [Cheng Hao] fix the timezone issue
f54f369 [Cheng Hao] fix bug of constant null value for ObjectInspector
it generates warnings at compile time marmbrus
Author: Xiangrui Meng <meng@databricks.com>
Closes#3192 from mengxr/dtc-decimal and squashes the following commits:
955e9fb [Xiangrui Meng] remove a decimal case branch that has no effect
In `HiveThriftServer2`, when an exception is thrown during a SQL execution, the SQL operation state should be set to `ERROR`, but now it remains `RUNNING`. This affects the result of the `GetOperationStatus` Thrift API.
Author: Cheng Lian <lian@databricks.com>
Closes#3175 from liancheng/fix-op-state and squashes the following commits:
6d4c1fe [Cheng Lian] Sets SQL operation state to ERROR when exception is thrown
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#3185 from ueshin/issues/SPARK-4319 and squashes the following commits:
a44a38e [Takuya UESHIN] Enable an ignored test "null count".
marmbrus
Author: Xiangrui Meng <meng@databricks.com>
Closes#3164 from mengxr/hive-udt and squashes the following commits:
57c7519 [Xiangrui Meng] support udt->hive types (hive->udt is not supported)
andrewor14 Another try at SPARK-1209, to address https://github.com/apache/spark/pull/2814#issuecomment-61197619
I successfully tested with `mvn -Dhadoop.version=1.0.4 -DskipTests clean package; mvn -Dhadoop.version=1.0.4 test` I assume that is what failed Jenkins last time. I also tried `-Dhadoop.version1.2.1` and `-Phadoop-2.4 -Pyarn -Phive` for more coverage.
So this is why the class was put in `org.apache.hadoop` to begin with, I assume. One option is to leave this as-is for now and move it only when Hadoop 1.0.x support goes away.
This is the other option, which adds a call to force the constructor to be public at run-time. It's probably less surprising than putting Spark code in `org.apache.hadoop`, but, does involve reflection. A `SecurityManager` might forbid this, but it would forbid a lot of stuff Spark does. This would also only affect Hadoop 1.0.x it seems.
Author: Sean Owen <sowen@cloudera.com>
Closes#3048 from srowen/SPARK-1209 and squashes the following commits:
0d48f4b [Sean Owen] For Hadoop 1.0.x, make certain constructors public, which were public in later versions
466e179 [Sean Owen] Disable MIMA warnings resulting from moving the class -- this was also part of the PairRDDFunctions type hierarchy though?
eb61820 [Sean Owen] Move SparkHadoopMapRedUtil / SparkHadoopMapReduceUtil from org.apache.hadoop to org.apache.spark
When doing an insert into hive table with partitions the folders written to the file system are in a random order instead of the order defined in table creation. Seems that the loadPartition method in Hive.java has a Map<String,String> parameter but expects to be called with a map that has a defined ordering such as LinkedHashMap. Working on a test but having intillij problems
Author: Matthew Taylor <matthew.t@tbfe.net>
Closes#3076 from tbfenet/partition_dir_order_problem and squashes the following commits:
f1b9a52 [Matthew Taylor] Comment format fix
bca709f [Matthew Taylor] review changes
0e50f6b [Matthew Taylor] test fix
99f1a31 [Matthew Taylor] partition ordering fix
369e618 [Matthew Taylor] partition ordering fix
`Cast` from `DateType` to `DecimalType` throws `NullPointerException`.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#3134 from ueshin/issues/SPARK-4270 and squashes the following commits:
7394e4b [Takuya UESHIN] Fix Cast from DateType to DecimalType.
Currently, the data "unwrap" only support couple of primitive types, not all, it will not cause exception, but may get some performance in table scanning for the type like binary, date, timestamp, decimal etc.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#3136 from chenghao-intel/table_reader and squashes the following commits:
fffb729 [Cheng Hao] fix bug for retrieving the timestamp object
e9c97a4 [Cheng Hao] Add more unwrapper functions for primitive type in TableReader
Following description is quoted from JIRA:
When I issue a hql query against a HiveContext where my predicate uses a column of string type with one of LT, LTE, GT, or GTE operator, I get the following error:
scala.MatchError: StringType (of class org.apache.spark.sql.catalyst.types.StringType$)
Looking at the code in org.apache.spark.sql.parquet.ParquetFilters, StringType is absent from the corresponding functions for creating these filters.
To reproduce, in a Hive 0.13.1 shell, I created the following table (at a specified DB):
create table sparkbug (
id int,
event string
) stored as parquet;
Insert some sample data:
insert into table sparkbug select 1, '2011-06-18' from <some table> limit 1;
insert into table sparkbug select 2, '2012-01-01' from <some table> limit 1;
Launch a spark shell and create a HiveContext to the metastore where the table above is located.
import org.apache.spark.sql._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.hive.HiveContext
val hc = new HiveContext(sc)
hc.setConf("spark.sql.shuffle.partitions", "10")
hc.setConf("spark.sql.hive.convertMetastoreParquet", "true")
hc.setConf("spark.sql.parquet.compression.codec", "snappy")
import hc._
hc.hql("select * from <db>.sparkbug where event >= '2011-12-01'")
A scala.MatchError will appear in the output.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#3083 from sarutak/SPARK-4213 and squashes the following commits:
4ab6e56 [Kousuke Saruta] WIP
b6890c6 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-4213
9a1fae7 [Kousuke Saruta] Fixed ParquetFilters so that compare Strings
'DOUBLE' should be moved before 'ELSE' according to the ordering convension
Author: Jacky Li <jacky.likun@gmail.com>
Closes#3080 from jackylk/patch-5 and squashes the following commits:
3c11df7 [Jacky Li] [SQL] Modify keyword val location according to ordering
Author: Michael Armbrust <michael@databricks.com>
Closes#3096 from marmbrus/reflectionContext and squashes the following commits:
adc221f [Michael Armbrust] Support ScalaReflection of schema in different universes
This PR resorts to `SparkContext.version` rather than META-INF/MANIFEST.MF in the assembly jar to inspect Spark version. Currently, when built with Maven, the MANIFEST.MF file in the assembly jar is incorrectly replaced by Guava 15.0 MANIFEST.MF, probably because of the assembly/shading tricks.
Another related PR is #3103, which tries to fix the MANIFEST issue.
Author: Cheng Lian <lian@databricks.com>
Closes#3105 from liancheng/spark-4225 and squashes the following commits:
d9585e1 [Cheng Lian] Resorts to SparkContext.version to inspect Spark version
marmbrus
Author: Xiangrui Meng <meng@databricks.com>
Closes#3125 from mengxr/SPARK-4262 and squashes the following commits:
307695e [Xiangrui Meng] add .schemaRDD to JavaSchemaRDD
Author: Michael Armbrust <michael@databricks.com>
Closes#3097 from marmbrus/asString and squashes the following commits:
6430520 [Michael Armbrust] Add String option for DSL AS
This PR simplify serializer, always use batched serializer (AutoBatchedSerializer as default), even batch size is 1.
Author: Davies Liu <davies@databricks.com>
This patch had conflicts when merged, resolved by
Committer: Josh Rosen <joshrosen@databricks.com>
Closes#2920 from davies/fix_autobatch and squashes the following commits:
e544ef9 [Davies Liu] revert unrelated change
6880b14 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
1d557fc [Davies Liu] fix tests
8180907 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
76abdce [Davies Liu] clean up
53fa60b [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
d7ac751 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
2cc2497 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
b4292ce [Davies Liu] fix bug in master
d79744c [Davies Liu] recover hive tests
be37ece [Davies Liu] refactor
eb3938d [Davies Liu] refactor serializer in scala
8d77ef2 [Davies Liu] simplify serializer, use AutoBatchedSerializer by default.
Following #2919, this PR adds Python UDT (for internal use only) with tests under "pyspark.tests". Before `SQLContext.applySchema`, we check whether we need to convert user-type instances into SQL recognizable data. In the current implementation, a Python UDT must be paired with a Scala UDT for serialization on the JVM side. A following PR will add VectorUDT in MLlib for both Scala and Python.
marmbrus jkbradley davies
Author: Xiangrui Meng <meng@databricks.com>
Closes#3068 from mengxr/SPARK-4192-sql and squashes the following commits:
acff637 [Xiangrui Meng] merge master
dba5ea7 [Xiangrui Meng] only use pyClass for Python UDT output sqlType as well
2c9d7e4 [Xiangrui Meng] move import to global setup; update needsConversion
7c4a6a9 [Xiangrui Meng] address comments
75223db [Xiangrui Meng] minor update
f740379 [Xiangrui Meng] remove UDT from default imports
e98d9d0 [Xiangrui Meng] fix py style
4e84fce [Xiangrui Meng] remove local hive tests and add more tests
39f19e0 [Xiangrui Meng] add tests
b7f666d [Xiangrui Meng] add Python UDT
Author: Michael Armbrust <michael@databricks.com>
Closes#3077 from marmbrus/udfsWithUdts and squashes the following commits:
34b5f27 [Michael Armbrust] style
504adef [Michael Armbrust] Convert arguments to Scala UDFs
- Turns on compression for in-memory cached data by default
- Changes the default parquet compression format back to gzip (we have seen more OOMs with production workloads due to the way Snappy allocates memory)
- Ups the batch size to 10,000 rows
- Increases the broadcast threshold to 10mb.
- Uses our parquet implementation instead of the hive one by default.
- Cache parquet metadata by default.
Author: Michael Armbrust <michael@databricks.com>
Closes#3064 from marmbrus/fasterDefaults and squashes the following commits:
97ee9f8 [Michael Armbrust] parquet codec docs
e641694 [Michael Armbrust] Remote also
a12866a [Michael Armbrust] Cache metadata.
2d73acc [Michael Armbrust] Update docs defaults.
d63d2d5 [Michael Armbrust] document parquet option
da373f9 [Michael Armbrust] More aggressive defaults
CREATE TABLE t1 (a String);
CREATE TABLE t1 AS SELECT key FROM src; – throw exception
CREATE TABLE if not exists t1 AS SELECT key FROM src; – expect do nothing, currently it will overwrite the t1, which is incorrect.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#3013 from chenghao-intel/ctas_unittest and squashes the following commits:
194113e [Cheng Hao] fix bug in CTAS when table already existed
This feature is based on an offline discussion with mengxr, hopefully can be useful for the new MLlib pipeline API.
For the following test snippet
```scala
case class KeyValue(key: Int, value: String)
val testData = sc.parallelize(1 to 10).map(i => KeyValue(i, i.toString)).toSchemaRDD
def foo(a: Int, b: String) => a.toString + b
```
the newly introduced DSL enables the following syntax
```scala
import org.apache.spark.sql.catalyst.dsl._
testData.select(Star(None), foo.call('key, 'value) as 'result)
```
which is equivalent to
```scala
testData.registerTempTable("testData")
sqlContext.registerFunction("foo", foo)
sql("SELECT *, foo(key, value) AS result FROM testData")
```
Author: Cheng Lian <lian@databricks.com>
Closes#3067 from liancheng/udf-dsl and squashes the following commits:
f132818 [Cheng Lian] Adds DSL support for Scala UDF
This patch will try to infer schema for RDD which has empty value (None, [], {}) in the first row. It will try first 100 rows and merge the types into schema, also merge fields of StructType together. If there is still NullType in schema, then it will show an warning, tell user to try with sampling.
If sampling is presented, it will infer schema from all the rows after sampling.
Also, add samplingRatio for jsonFile() and jsonRDD()
Author: Davies Liu <davies.liu@gmail.com>
Author: Davies Liu <davies@databricks.com>
Closes#2716 from davies/infer and squashes the following commits:
e678f6d [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
34b5c63 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
567dc60 [Davies Liu] update docs
9767b27 [Davies Liu] Merge branch 'master' into infer
e48d7fb [Davies Liu] fix tests
29e94d5 [Davies Liu] let NullType inherit from PrimitiveType
ee5d524 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
540d1d5 [Davies Liu] merge fields for StructType
f93fd84 [Davies Liu] add more tests
3603e00 [Davies Liu] take more rows to infer schema, or infer the schema by sampling the RDD
Queries which has 'not like' is not working spark sql.
sql("SELECT * FROM records where value not like 'val%'")
same query works in Spark HiveQL
Author: ravipesala <ravindra.pesala@huawei.com>
Closes#3075 from ravipesala/SPARK-4207 and squashes the following commits:
35c11e7 [ravipesala] Supported 'not like' syntax in sql
This PR adds User-Defined Types (UDTs) to SQL. It is a precursor to using SchemaRDD as a Dataset for the new MLlib API. Currently, the UDT API is private since there is incomplete support (e.g., no Java or Python support yet).
Author: Joseph K. Bradley <joseph@databricks.com>
Author: Michael Armbrust <michael@databricks.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes#3063 from marmbrus/udts and squashes the following commits:
7ccfc0d [Michael Armbrust] remove println
46a3aee [Michael Armbrust] Slightly easier to read test output.
6cc434d [Michael Armbrust] Recursively convert rows.
e369b91 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into udts
15c10a6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into sql-udt2
f3c72fe [Joseph K. Bradley] Fixing merge
e13cd8a [Joseph K. Bradley] Removed Vector UDTs
5817b2b [Joseph K. Bradley] style edits
30ce5b2 [Joseph K. Bradley] updates based on code review
d063380 [Joseph K. Bradley] Cleaned up Java UDT Suite, and added warning about element ordering when creating schema from Java Bean
a571bb6 [Joseph K. Bradley] Removed old UDT code (registry and Java UDTs). Cleaned up other code. Extended JavaUserDefinedTypeSuite
6fddc1c [Joseph K. Bradley] Made MyLabeledPoint into a Java Bean
20630bc [Joseph K. Bradley] fixed scalastyle
fa86b20 [Joseph K. Bradley] Removed Java UserDefinedType, and made UDTs private[spark] for now
8de957c [Joseph K. Bradley] Modified UserDefinedType to store Java class of user type so that registerUDT takes only the udt argument.
8b242ea [Joseph K. Bradley] Fixed merge error after last merge. Note: Last merge commit also removed SQL UDT examples from mllib.
7f29656 [Joseph K. Bradley] Moved udt case to top of all matches. Small cleanups
b028675 [Xiangrui Meng] allow any type in UDT
4500d8a [Xiangrui Meng] update example code
87264a5 [Xiangrui Meng] remove debug code
3143ac3 [Xiangrui Meng] remove unnecessary changes
cfbc321 [Xiangrui Meng] support UDT in parquet
db16139 [Joseph K. Bradley] Added more doc for UserDefinedType. Removed unused code in Suite
759af7a [Joseph K. Bradley] Added more doc to UserDefineType
63626a4 [Joseph K. Bradley] Updated ScalaReflectionsSuite per @marmbrus suggestions
51e5282 [Joseph K. Bradley] fixed 1 test
f025035 [Joseph K. Bradley] Cleanups before PR. Added new tests
85872f6 [Michael Armbrust] Allow schema calculation to be lazy, but ensure its available on executors.
dff99d6 [Joseph K. Bradley] Added UDTs for Vectors in MLlib, plus DatasetExample using the UDTs
cd60cb4 [Joseph K. Bradley] Trying to get other SQL tests to run
34a5831 [Joseph K. Bradley] Added MLlib dependency on SQL.
e1f7b9c [Joseph K. Bradley] blah
2f40c02 [Joseph K. Bradley] renamed UDT types
3579035 [Joseph K. Bradley] udt annotation now working
b226b9e [Joseph K. Bradley] Changing UDT to annotation
fea04af [Joseph K. Bradley] more cleanups
964b32e [Joseph K. Bradley] some cleanups
893ee4c [Joseph K. Bradley] udt finallly working
50f9726 [Joseph K. Bradley] udts
04303c9 [Joseph K. Bradley] udts
39f8707 [Joseph K. Bradley] removed old udt suite
273ac96 [Joseph K. Bradley] basic UDT is working, but deserialization has yet to be done
8bebf24 [Joseph K. Bradley] commented out convertRowToScala for debugging
53de70f [Joseph K. Bradley] more udts...
982c035 [Joseph K. Bradley] still working on UDTs
19b2f60 [Joseph K. Bradley] still working on UDTs
0eaeb81 [Joseph K. Bradley] Still working on UDTs
105c5a3 [Joseph K. Bradley] Adding UserDefinedType to SQL, not done yet.