When doing an insert into hive table with partitions the folders written to the file system are in a random order instead of the order defined in table creation. Seems that the loadPartition method in Hive.java has a Map<String,String> parameter but expects to be called with a map that has a defined ordering such as LinkedHashMap. Working on a test but having intillij problems
Author: Matthew Taylor <matthew.t@tbfe.net>
Closes#3076 from tbfenet/partition_dir_order_problem and squashes the following commits:
f1b9a52 [Matthew Taylor] Comment format fix
bca709f [Matthew Taylor] review changes
0e50f6b [Matthew Taylor] test fix
99f1a31 [Matthew Taylor] partition ordering fix
369e618 [Matthew Taylor] partition ordering fix
`Cast` from `DateType` to `DecimalType` throws `NullPointerException`.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#3134 from ueshin/issues/SPARK-4270 and squashes the following commits:
7394e4b [Takuya UESHIN] Fix Cast from DateType to DecimalType.
Currently, the data "unwrap" only support couple of primitive types, not all, it will not cause exception, but may get some performance in table scanning for the type like binary, date, timestamp, decimal etc.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#3136 from chenghao-intel/table_reader and squashes the following commits:
fffb729 [Cheng Hao] fix bug for retrieving the timestamp object
e9c97a4 [Cheng Hao] Add more unwrapper functions for primitive type in TableReader
Following description is quoted from JIRA:
When I issue a hql query against a HiveContext where my predicate uses a column of string type with one of LT, LTE, GT, or GTE operator, I get the following error:
scala.MatchError: StringType (of class org.apache.spark.sql.catalyst.types.StringType$)
Looking at the code in org.apache.spark.sql.parquet.ParquetFilters, StringType is absent from the corresponding functions for creating these filters.
To reproduce, in a Hive 0.13.1 shell, I created the following table (at a specified DB):
create table sparkbug (
id int,
event string
) stored as parquet;
Insert some sample data:
insert into table sparkbug select 1, '2011-06-18' from <some table> limit 1;
insert into table sparkbug select 2, '2012-01-01' from <some table> limit 1;
Launch a spark shell and create a HiveContext to the metastore where the table above is located.
import org.apache.spark.sql._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.hive.HiveContext
val hc = new HiveContext(sc)
hc.setConf("spark.sql.shuffle.partitions", "10")
hc.setConf("spark.sql.hive.convertMetastoreParquet", "true")
hc.setConf("spark.sql.parquet.compression.codec", "snappy")
import hc._
hc.hql("select * from <db>.sparkbug where event >= '2011-12-01'")
A scala.MatchError will appear in the output.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#3083 from sarutak/SPARK-4213 and squashes the following commits:
4ab6e56 [Kousuke Saruta] WIP
b6890c6 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-4213
9a1fae7 [Kousuke Saruta] Fixed ParquetFilters so that compare Strings
'DOUBLE' should be moved before 'ELSE' according to the ordering convension
Author: Jacky Li <jacky.likun@gmail.com>
Closes#3080 from jackylk/patch-5 and squashes the following commits:
3c11df7 [Jacky Li] [SQL] Modify keyword val location according to ordering
Author: Michael Armbrust <michael@databricks.com>
Closes#3096 from marmbrus/reflectionContext and squashes the following commits:
adc221f [Michael Armbrust] Support ScalaReflection of schema in different universes
This PR resorts to `SparkContext.version` rather than META-INF/MANIFEST.MF in the assembly jar to inspect Spark version. Currently, when built with Maven, the MANIFEST.MF file in the assembly jar is incorrectly replaced by Guava 15.0 MANIFEST.MF, probably because of the assembly/shading tricks.
Another related PR is #3103, which tries to fix the MANIFEST issue.
Author: Cheng Lian <lian@databricks.com>
Closes#3105 from liancheng/spark-4225 and squashes the following commits:
d9585e1 [Cheng Lian] Resorts to SparkContext.version to inspect Spark version
marmbrus
Author: Xiangrui Meng <meng@databricks.com>
Closes#3125 from mengxr/SPARK-4262 and squashes the following commits:
307695e [Xiangrui Meng] add .schemaRDD to JavaSchemaRDD
Author: Michael Armbrust <michael@databricks.com>
Closes#3097 from marmbrus/asString and squashes the following commits:
6430520 [Michael Armbrust] Add String option for DSL AS
This PR simplify serializer, always use batched serializer (AutoBatchedSerializer as default), even batch size is 1.
Author: Davies Liu <davies@databricks.com>
This patch had conflicts when merged, resolved by
Committer: Josh Rosen <joshrosen@databricks.com>
Closes#2920 from davies/fix_autobatch and squashes the following commits:
e544ef9 [Davies Liu] revert unrelated change
6880b14 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
1d557fc [Davies Liu] fix tests
8180907 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
76abdce [Davies Liu] clean up
53fa60b [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
d7ac751 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
2cc2497 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
b4292ce [Davies Liu] fix bug in master
d79744c [Davies Liu] recover hive tests
be37ece [Davies Liu] refactor
eb3938d [Davies Liu] refactor serializer in scala
8d77ef2 [Davies Liu] simplify serializer, use AutoBatchedSerializer by default.
Following #2919, this PR adds Python UDT (for internal use only) with tests under "pyspark.tests". Before `SQLContext.applySchema`, we check whether we need to convert user-type instances into SQL recognizable data. In the current implementation, a Python UDT must be paired with a Scala UDT for serialization on the JVM side. A following PR will add VectorUDT in MLlib for both Scala and Python.
marmbrus jkbradley davies
Author: Xiangrui Meng <meng@databricks.com>
Closes#3068 from mengxr/SPARK-4192-sql and squashes the following commits:
acff637 [Xiangrui Meng] merge master
dba5ea7 [Xiangrui Meng] only use pyClass for Python UDT output sqlType as well
2c9d7e4 [Xiangrui Meng] move import to global setup; update needsConversion
7c4a6a9 [Xiangrui Meng] address comments
75223db [Xiangrui Meng] minor update
f740379 [Xiangrui Meng] remove UDT from default imports
e98d9d0 [Xiangrui Meng] fix py style
4e84fce [Xiangrui Meng] remove local hive tests and add more tests
39f19e0 [Xiangrui Meng] add tests
b7f666d [Xiangrui Meng] add Python UDT
Author: Michael Armbrust <michael@databricks.com>
Closes#3077 from marmbrus/udfsWithUdts and squashes the following commits:
34b5f27 [Michael Armbrust] style
504adef [Michael Armbrust] Convert arguments to Scala UDFs
- Turns on compression for in-memory cached data by default
- Changes the default parquet compression format back to gzip (we have seen more OOMs with production workloads due to the way Snappy allocates memory)
- Ups the batch size to 10,000 rows
- Increases the broadcast threshold to 10mb.
- Uses our parquet implementation instead of the hive one by default.
- Cache parquet metadata by default.
Author: Michael Armbrust <michael@databricks.com>
Closes#3064 from marmbrus/fasterDefaults and squashes the following commits:
97ee9f8 [Michael Armbrust] parquet codec docs
e641694 [Michael Armbrust] Remote also
a12866a [Michael Armbrust] Cache metadata.
2d73acc [Michael Armbrust] Update docs defaults.
d63d2d5 [Michael Armbrust] document parquet option
da373f9 [Michael Armbrust] More aggressive defaults
CREATE TABLE t1 (a String);
CREATE TABLE t1 AS SELECT key FROM src; – throw exception
CREATE TABLE if not exists t1 AS SELECT key FROM src; – expect do nothing, currently it will overwrite the t1, which is incorrect.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#3013 from chenghao-intel/ctas_unittest and squashes the following commits:
194113e [Cheng Hao] fix bug in CTAS when table already existed
This feature is based on an offline discussion with mengxr, hopefully can be useful for the new MLlib pipeline API.
For the following test snippet
```scala
case class KeyValue(key: Int, value: String)
val testData = sc.parallelize(1 to 10).map(i => KeyValue(i, i.toString)).toSchemaRDD
def foo(a: Int, b: String) => a.toString + b
```
the newly introduced DSL enables the following syntax
```scala
import org.apache.spark.sql.catalyst.dsl._
testData.select(Star(None), foo.call('key, 'value) as 'result)
```
which is equivalent to
```scala
testData.registerTempTable("testData")
sqlContext.registerFunction("foo", foo)
sql("SELECT *, foo(key, value) AS result FROM testData")
```
Author: Cheng Lian <lian@databricks.com>
Closes#3067 from liancheng/udf-dsl and squashes the following commits:
f132818 [Cheng Lian] Adds DSL support for Scala UDF
This patch will try to infer schema for RDD which has empty value (None, [], {}) in the first row. It will try first 100 rows and merge the types into schema, also merge fields of StructType together. If there is still NullType in schema, then it will show an warning, tell user to try with sampling.
If sampling is presented, it will infer schema from all the rows after sampling.
Also, add samplingRatio for jsonFile() and jsonRDD()
Author: Davies Liu <davies.liu@gmail.com>
Author: Davies Liu <davies@databricks.com>
Closes#2716 from davies/infer and squashes the following commits:
e678f6d [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
34b5c63 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
567dc60 [Davies Liu] update docs
9767b27 [Davies Liu] Merge branch 'master' into infer
e48d7fb [Davies Liu] fix tests
29e94d5 [Davies Liu] let NullType inherit from PrimitiveType
ee5d524 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
540d1d5 [Davies Liu] merge fields for StructType
f93fd84 [Davies Liu] add more tests
3603e00 [Davies Liu] take more rows to infer schema, or infer the schema by sampling the RDD
Queries which has 'not like' is not working spark sql.
sql("SELECT * FROM records where value not like 'val%'")
same query works in Spark HiveQL
Author: ravipesala <ravindra.pesala@huawei.com>
Closes#3075 from ravipesala/SPARK-4207 and squashes the following commits:
35c11e7 [ravipesala] Supported 'not like' syntax in sql
This PR adds User-Defined Types (UDTs) to SQL. It is a precursor to using SchemaRDD as a Dataset for the new MLlib API. Currently, the UDT API is private since there is incomplete support (e.g., no Java or Python support yet).
Author: Joseph K. Bradley <joseph@databricks.com>
Author: Michael Armbrust <michael@databricks.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes#3063 from marmbrus/udts and squashes the following commits:
7ccfc0d [Michael Armbrust] remove println
46a3aee [Michael Armbrust] Slightly easier to read test output.
6cc434d [Michael Armbrust] Recursively convert rows.
e369b91 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into udts
15c10a6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into sql-udt2
f3c72fe [Joseph K. Bradley] Fixing merge
e13cd8a [Joseph K. Bradley] Removed Vector UDTs
5817b2b [Joseph K. Bradley] style edits
30ce5b2 [Joseph K. Bradley] updates based on code review
d063380 [Joseph K. Bradley] Cleaned up Java UDT Suite, and added warning about element ordering when creating schema from Java Bean
a571bb6 [Joseph K. Bradley] Removed old UDT code (registry and Java UDTs). Cleaned up other code. Extended JavaUserDefinedTypeSuite
6fddc1c [Joseph K. Bradley] Made MyLabeledPoint into a Java Bean
20630bc [Joseph K. Bradley] fixed scalastyle
fa86b20 [Joseph K. Bradley] Removed Java UserDefinedType, and made UDTs private[spark] for now
8de957c [Joseph K. Bradley] Modified UserDefinedType to store Java class of user type so that registerUDT takes only the udt argument.
8b242ea [Joseph K. Bradley] Fixed merge error after last merge. Note: Last merge commit also removed SQL UDT examples from mllib.
7f29656 [Joseph K. Bradley] Moved udt case to top of all matches. Small cleanups
b028675 [Xiangrui Meng] allow any type in UDT
4500d8a [Xiangrui Meng] update example code
87264a5 [Xiangrui Meng] remove debug code
3143ac3 [Xiangrui Meng] remove unnecessary changes
cfbc321 [Xiangrui Meng] support UDT in parquet
db16139 [Joseph K. Bradley] Added more doc for UserDefinedType. Removed unused code in Suite
759af7a [Joseph K. Bradley] Added more doc to UserDefineType
63626a4 [Joseph K. Bradley] Updated ScalaReflectionsSuite per @marmbrus suggestions
51e5282 [Joseph K. Bradley] fixed 1 test
f025035 [Joseph K. Bradley] Cleanups before PR. Added new tests
85872f6 [Michael Armbrust] Allow schema calculation to be lazy, but ensure its available on executors.
dff99d6 [Joseph K. Bradley] Added UDTs for Vectors in MLlib, plus DatasetExample using the UDTs
cd60cb4 [Joseph K. Bradley] Trying to get other SQL tests to run
34a5831 [Joseph K. Bradley] Added MLlib dependency on SQL.
e1f7b9c [Joseph K. Bradley] blah
2f40c02 [Joseph K. Bradley] renamed UDT types
3579035 [Joseph K. Bradley] udt annotation now working
b226b9e [Joseph K. Bradley] Changing UDT to annotation
fea04af [Joseph K. Bradley] more cleanups
964b32e [Joseph K. Bradley] some cleanups
893ee4c [Joseph K. Bradley] udt finallly working
50f9726 [Joseph K. Bradley] udts
04303c9 [Joseph K. Bradley] udts
39f8707 [Joseph K. Bradley] removed old udt suite
273ac96 [Joseph K. Bradley] basic UDT is working, but deserialization has yet to be done
8bebf24 [Joseph K. Bradley] commented out convertRowToScala for debugging
53de70f [Joseph K. Bradley] more udts...
982c035 [Joseph K. Bradley] still working on UDTs
19b2f60 [Joseph K. Bradley] still working on UDTs
0eaeb81 [Joseph K. Bradley] Still working on UDTs
105c5a3 [Joseph K. Bradley] Adding UserDefinedType to SQL, not done yet.
This PR adds an API for unregistering temporary tables. If a temporary table has been cached before, it's unpersisted as well.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#3039 from liancheng/unregister-temp-table and squashes the following commits:
54ae99f [Cheng Lian] Fixes Scala styling issue
1948c14 [Cheng Lian] Removes the unpersist argument
aca41d3 [Cheng Lian] Ensures thread safety
7d4fb2b [Cheng Lian] Adds unregisterTempTable API
JIRA: https://issues.apache.org/jira/browse/SPARK-4185.
This PR also has the fix of #3052.
Author: Yin Huai <huai@cse.ohio-state.edu>
Closes#3056 from yhuai/SPARK-4185 and squashes the following commits:
ed3a5a8 [Yin Huai] Correctly handle type conflicts between structs and primitive types in an array.
Move wrapperFor in InsertIntoHiveTable to HiveInspectors to reuse them, this method can be reused when writing date with ObjectInspector(such as orc support)
Author: wangfei <wangfei1@huawei.com>
Author: scwf <wangfei1@huawei.com>
Closes#3057 from scwf/reuse-wraperfor and squashes the following commits:
7ccf932 [scwf] fix conflicts
d44f4da [wangfei] fix imports
9bf1b50 [wangfei] revert no related change
9a5276a [wangfei] move wrapfor to hiveinspector to reuse them
This PR overrides the `GetInfo` Hive Thrift API to provide correct version information. Another property `spark.sql.hive.version` is added to reveal the underlying Hive version. These are generally useful for Spark SQL ODBC driver providers. The Spark version information is extracted from the jar manifest. Also took the chance to remove the `SET -v` hack, which was a workaround for Simba ODBC driver connectivity.
TODO
- [x] Find a general way to figure out Hive (or even any dependency) version.
This [blog post](http://blog.soebes.de/blog/2014/01/02/version-information-into-your-appas-with-maven/) suggests several methods to inspect application version. In the case of Spark, this can be tricky because the chosen method:
1. must applies to both Maven build and SBT build
For Maven builds, we can retrieve the version information from the META-INF/maven directory within the assembly jar. But this doesn't work for SBT builds.
2. must not rely on the original jars of dependencies to extract specific dependency version, because Spark uses assembly jar.
This implies we can't read Hive version from Hive jar files since standard Spark distribution doesn't include them.
3. should play well with `SPARK_PREPEND_CLASSES` to ease local testing during development.
`SPARK_PREPEND_CLASSES` prevents classes to be loaded from the assembly jar, thus we can't locate the jar file and read its manifest.
Given these, maybe the only reliable method is to generate a source file containing version information at build time. pwendell Do you have any suggestions from the perspective of the build process?
**Update** Hive version is now retrieved from the newly introduced `HiveShim` object.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Author: Cheng Lian <lian@databricks.com>
Closes#2843 from liancheng/get-info and squashes the following commits:
a873d0f [Cheng Lian] Updates test case
53f43cd [Cheng Lian] Retrieves underlying Hive verson via HiveShim
1d282b8 [Cheng Lian] Removes the Simba ODBC "SET -v" hack
f857fce [Cheng Lian] Overrides Hive GetInfo Thrift API and adds Hive version property
`CliSuite` has been flaky for a while, this PR tries to improve this situation by fixing a race condition in `CliSuite`. The `captureOutput` function is used to capture both stdout and stderr output of the forked external process in two background threads and search for expected strings, but wasn't been properly synchronized before.
Author: Cheng Lian <lian@databricks.com>
Closes#3060 from liancheng/fix-cli-suite and squashes the following commits:
a70569c [Cheng Lian] Fixes race condition in CliSuite
`NoopColumnStats` was once used for binary, boolean and complex data types. This `ColumnStats` doesn't return properly shaped column statistics and causes caching failure if a table contains columns of the aforementioned types.
This PR adds `BooleanColumnStats`, `BinaryColumnStats` and `GenericColumnStats`, used for boolean, binary and all complex data types respectively. In addition, `NoopColumnStats` returns properly shaped column statistics containing null count and row count, but this class is now used for testing purpose only.
Author: Cheng Lian <lian@databricks.com>
Closes#3059 from liancheng/spark-4182 and squashes the following commits:
b398cfd [Cheng Lian] Fixes failed test case
fb3ee85 [Cheng Lian] Fixes SPARK-4182
This PR introduces a new set of APIs to Spark SQL to allow other developers to add support for reading data from new sources in `org.apache.spark.sql.sources`.
New sources must implement the interface `BaseRelation`, which is responsible for describing the schema of the data. BaseRelations have three `Scan` subclasses, which are responsible for producing an RDD containing row objects. The [various Scan interfaces](https://github.com/marmbrus/spark/blob/foreign/sql/core/src/main/scala/org/apache/spark/sql/sources/package.scala#L50) allow for optimizations such as column pruning and filter push down, when the underlying data source can handle these operations.
By implementing a class that inherits from RelationProvider these data sources can be accessed using using pure SQL. I've used the functionality to update the JSON support so it can now be used in this way as follows:
```sql
CREATE TEMPORARY TABLE jsonTableSQL
USING org.apache.spark.sql.json
OPTIONS (
path '/home/michael/data.json'
)
```
Further example usage can be found in the test cases: https://github.com/marmbrus/spark/tree/foreign/sql/core/src/test/scala/org/apache/spark/sql/sources
There is also a library that uses this new API to read avro data available here:
https://github.com/marmbrus/sql-avro
Author: Michael Armbrust <michael@databricks.com>
Closes#2475 from marmbrus/foreign and squashes the following commits:
1ed6010 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into foreign
ab2c31f [Michael Armbrust] fix test
1d41bb5 [Michael Armbrust] unify argument names
5b47901 [Michael Armbrust] Remove sealed, more filter types
fab154a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into foreign
e3e690e [Michael Armbrust] Add hook for extraStrategies
a70d602 [Michael Armbrust] Fix style, more tests, FilteredSuite => PrunedFilteredSuite
70da6d9 [Michael Armbrust] Modify API to ease binary compatibility and interop with Java
7d948ae [Michael Armbrust] Fix equality of AttributeReference.
5545491 [Michael Armbrust] Address comments
5031ac3 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into foreign
22963ef [Michael Armbrust] package objects compile wierdly...
b069146 [Michael Armbrust] traits => abstract classes
34f836a [Michael Armbrust] Make @DeveloperApi
0d74bcf [Michael Armbrust] Add documention on object life cycle
3e06776 [Michael Armbrust] remove line wraps
de3b68c [Michael Armbrust] Remove empty file
360cb30 [Michael Armbrust] style and java api
2957875 [Michael Armbrust] add override
0fd3a07 [Michael Armbrust] Draft of data sources API
- Adds optional precision and scale to Spark SQL's decimal type, which behave similarly to those in Hive 13 (https://cwiki.apache.org/confluence/download/attachments/27362075/Hive_Decimal_Precision_Scale_Support.pdf)
- Replaces our internal representation of decimals with a Decimal class that can store small values in a mutable Long, saving memory in this situation and letting some operations happen directly on Longs
This is still marked WIP because there are a few TODOs, but I'll remove that tag when done.
Author: Matei Zaharia <matei@databricks.com>
Closes#2983 from mateiz/decimal-1 and squashes the following commits:
35e6b02 [Matei Zaharia] Fix issues after merge
227f24a [Matei Zaharia] Review comments
31f915e [Matei Zaharia] Implement Davies's suggestions in Python
eb84820 [Matei Zaharia] Support reading/writing decimals as fixed-length binary in Parquet
4dc6bae [Matei Zaharia] Fix decimal support in PySpark
d1d9d68 [Matei Zaharia] Fix compile error and test issues after rebase
b28933d [Matei Zaharia] Support decimal precision/scale in Hive metastore
2118c0d [Matei Zaharia] Some test and bug fixes
81db9cb [Matei Zaharia] Added mutable Decimal that will be more efficient for small precisions
7af0c3b [Matei Zaharia] Add optional precision and scale to DecimalType, but use Unlimited for now
ec0a947 [Matei Zaharia] Make the result of AVG on Decimals be Decimal, not Double
`HiveThriftServer2` creates a global singleton `SessionState` instance and overrides `HiveContext` to inject the `SessionState` object. This messes up `SessionState` initialization and causes problems.
This PR replaces the global `SessionState` with `HiveContext.sessionState` to avoid the initialization conflict. Also `HiveContext` reuses existing started `SessionState` if any (this is required by `SparkSQLCLIDriver`, which uses specialized `CliSessionState`).
Author: Cheng Lian <lian@databricks.com>
Closes#2887 from liancheng/spark-4037 and squashes the following commits:
8446675 [Cheng Lian] Removes redundant Driver initialization
a28fef5 [Cheng Lian] Avoid starting HiveContext.sessionState multiple times
49b1c5b [Cheng Lian] Reuses existing started SessionState if any
3cd6fab [Cheng Lian] Fixes SPARK-4037
Add `metadata: Metadata` to `StructField` to store extra information of columns. `Metadata` is a simple wrapper over `Map[String, Any]` with value types restricted to Boolean, Long, Double, String, Metadata, and arrays of those types. SerDe is via JSON.
Metadata is preserved through simple operations like `SELECT`.
marmbrus liancheng
Author: Xiangrui Meng <meng@databricks.com>
Author: Michael Armbrust <michael@databricks.com>
Closes#2701 from mengxr/structfield-metadata and squashes the following commits:
dedda56 [Xiangrui Meng] merge remote
5ef930a [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata
c35203f [Xiangrui Meng] Merge pull request #1 from marmbrus/pr/2701
886b85c [Michael Armbrust] Expose Metadata and MetadataBuilder through the public scala and java packages.
589f314 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata
1e2abcf [Xiangrui Meng] change default value of metadata to None in python
611d3c2 [Xiangrui Meng] move metadata from Expr to NamedExpr
ddfcfad [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata
a438440 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata
4266f4d [Xiangrui Meng] add StructField.toString back for backward compatibility
3f49aab [Xiangrui Meng] remove StructField.toString
24a9f80 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata
473a7c5 [Xiangrui Meng] merge master
c9d7301 [Xiangrui Meng] organize imports
1fcbf13 [Xiangrui Meng] change metadata type in StructField for Scala/Java
60cc131 [Xiangrui Meng] add doc and header
60614c7 [Xiangrui Meng] add metadata
e42c452 [Xiangrui Meng] merge master
93518fb [Xiangrui Meng] support metadata in python
905bb89 [Xiangrui Meng] java conversions
618e349 [Xiangrui Meng] make tests work in scala
61b8e0f [Xiangrui Meng] merge master
7e5a322 [Xiangrui Meng] do not output metadata in StructField.toString
c41a664 [Xiangrui Meng] merge master
d8af0ed [Xiangrui Meng] move tests to SQLQuerySuite
67fdebb [Xiangrui Meng] add test on join
d65072e [Xiangrui Meng] remove Map.empty
367d237 [Xiangrui Meng] add test
c194d5e [Xiangrui Meng] add metadata field to StructField and Attribute
if the query contains "not between" does not work like.
SELECT * FROM src where key not between 10 and 20'
Author: ravipesala <ravindra.pesala@huawei.com>
Closes#3017 from ravipesala/SPARK-4154 and squashes the following commits:
65fc89e [ravipesala] Handled admin comments
32e6d42 [ravipesala] 'not between' is not working
In org.apache.hadoop.hive.serde2.io.TimestampWritable.set , if the next entry is null then current time stamp object is being reset.
However because of this hiveinspectors:unwrap cannot use the same timestamp object without creating a copy.
Author: Venkata Ramana G <ramana.gollamudihuawei.com>
Author: Venkata Ramana Gollamudi <ramana.gollamudi@huawei.com>
Closes#3019 from gvramana/spark_4077 and squashes the following commits:
32d818f [Venkata Ramana Gollamudi] fixed check style
fa01e71 [Venkata Ramana Gollamudi] cloned timestamp object as org.apache.hadoop.hive.serde2.io.TimestampWritable.set will reset current time object
In #2241 hive-thriftserver is not enabled. This patch enable hive-thriftserver to support hive-0.13.1 by using a shim layer refer to #2241.
1 A light shim layer(code in sql/hive-thriftserver/hive-version) for each different hive version to handle api compatibility
2 New pom profiles "hive-default" and "hive-versions"(copy from #2241) to activate different hive version
3 SBT cmd for different version as follows:
hive-0.12.0 --- sbt/sbt -Phive,hadoop-2.3 -Phive-0.12.0 assembly
hive-0.13.1 --- sbt/sbt -Phive,hadoop-2.3 -Phive-0.13.1 assembly
4 Since hive-thriftserver depend on hive subproject, this patch should be merged with #2241 to enable hive-0.13.1 for hive-thriftserver
Author: wangfei <wangfei1@huawei.com>
Author: scwf <wangfei1@huawei.com>
Closes#2685 from scwf/shim-thriftserver1 and squashes the following commits:
f26f3be [wangfei] remove clean to save time
f5cac74 [wangfei] remove local hivecontext test
578234d [wangfei] use new shaded hive
18fb1ff [wangfei] exclude kryo in hive pom
fa21d09 [wangfei] clean package assembly/assembly
8a4daf2 [wangfei] minor fix
0d7f6cf [wangfei] address comments
f7c93ae [wangfei] adding build with hive 0.13 before running tests
bcf943f [wangfei] Merge branch 'master' of https://github.com/apache/spark into shim-thriftserver1
c359822 [wangfei] reuse getCommandProcessor in hiveshim
52674a4 [scwf] sql/hive included since examples depend on it
3529e98 [scwf] move hive module to hive profile
f51ff4e [wangfei] update and fix conflicts
f48d3a5 [scwf] Merge branch 'master' of https://github.com/apache/spark into shim-thriftserver1
41f727b [scwf] revert pom changes
13afde0 [scwf] fix small bug
4b681f4 [scwf] enable thriftserver in profile hive-0.13.1
0bc53aa [scwf] fixed when result filed is null
dfd1c63 [scwf] update run-tests to run hive-0.12.0 default now
c6da3ce [scwf] Merge branch 'master' of https://github.com/apache/spark into shim-thriftserver
7c66b8e [scwf] update pom according spark-2706
ae47489 [scwf] update and fix conflicts
The class DeferredObjectAdapter is the inner class of HiveGenericUdf, which may cause some overhead in closure ser/de-ser. Move it to top level.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#3007 from chenghao-intel/move_deferred and squashes the following commits:
3a139b1 [Cheng Hao] Move inner class DeferredObjectAdapter to top level
Fixed usage of deprecated in sql/catalyst/types/datatypes to have versio...n parameter
Author: Anant <anant.asty@gmail.com>
Closes#2970 from anantasty/SPARK-4108 and squashes the following commits:
e92cb01 [Anant] Fixed usage of deprecated in sql/catalyst/types/datatypes to have version parameter
The parquet-mr project has introduced a new filter api (https://github.com/apache/incubator-parquet-mr/pull/4), along with several fixes . It can also eliminate entire RowGroups depending on certain statistics like min/max
We can leverage that to further improve performance of queries with filters.
Also filter2 api introduces ability to create custom filters. We can create a custom filter for the optimized In clause (InSet) , so that elimination happens in the ParquetRecordReader itself
Author: Yash Datta <Yash.Datta@guavus.com>
Closes#2841 from saucam/master and squashes the following commits:
8282ba0 [Yash Datta] SPARK-3968: fix scala code style and add some more tests for filtering on optional columns
515df1c [Yash Datta] SPARK-3968: Add a test case for filter pushdown on optional column
5f4530e [Yash Datta] SPARK-3968: Fix scala code style
f304667 [Yash Datta] SPARK-3968: Using task metadata strategy for row group filtering
ec53e92 [Yash Datta] SPARK-3968: No push down should result in case we are unable to create a record filter
48163c3 [Yash Datta] SPARK-3968: Code cleanup
cc7b596 [Yash Datta] SPARK-3968: 1. Fix RowGroupFiltering not working 2. Use the serialization/deserialization from Parquet library for filter pushdown
caed851 [Yash Datta] Revert "SPARK-3968: Not pushing the filters in case of OPTIONAL columns" since filtering on optional columns is now supported in filter2 api
49703c9 [Yash Datta] SPARK-3968: Not pushing the filters in case of OPTIONAL columns
9d09741 [Yash Datta] SPARK-3968: Change parquet filter pushdown to use filter2 api of parquet-mr
Right now it works for only 2 tables like below query.
sql("SELECT * FROM records1 as a,records2 as b where a.key=b.key ")
But it does not work for more than 2 tables like below query
sql("SELECT * FROM records1 as a,records2 as b,records3 as c where a.key=b.key and a.key=c.key").
Author: ravipesala <ravindra.pesala@huawei.com>
Closes#2987 from ravipesala/multijoin and squashes the following commits:
429b005 [ravipesala] Support multiple joins
(This is just a look at what completely moving the classes would look like. I know Patrick flagged that as maybe not OK, although, it's private?)
Author: Sean Owen <sowen@cloudera.com>
Closes#2814 from srowen/SPARK-1209 and squashes the following commits:
ead1115 [Sean Owen] Disable MIMA warnings resulting from moving the class -- this was also part of the PairRDDFunctions type hierarchy though?
2d42c1d [Sean Owen] Move SparkHadoopMapRedUtil / SparkHadoopMapReduceUtil from org.apache.hadoop to org.apache.spark
In JavaSqlContext, we need to let java program use big decimal, timestamp, date types.
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#2850 from adrian-wang/javacontext and squashes the following commits:
4c4292c [Daoyuan Wang] change underlying type of JavaSchemaRDD as scala
bb0508f [Daoyuan Wang] add test cases
3c58b0d [Daoyuan Wang] add 3 types for java SQL context
Call Python UDF on ArrayType/MapType/PrimitiveType, the returnType can also be ArrayType/MapType/PrimitiveType.
For StructType, it will act as tuple (without attributes). If returnType is StructType, it also should be tuple.
Author: Davies Liu <davies@databricks.com>
Closes#2973 from davies/udf_array and squashes the following commits:
306956e [Davies Liu] Merge branch 'master' of github.com:apache/spark into udf_array
2c00e43 [Davies Liu] fix merge
11395fa [Davies Liu] Merge branch 'master' of github.com:apache/spark into udf_array
9df50a2 [Davies Liu] address comments
79afb4e [Davies Liu] type conversionfor python udf
In HQL, we convert all of the data type into normal `ObjectInspector`s for UDFs, most of cases it works, however, some of the UDF actually requires its children `ObjectInspector` to be the `ConstantObjectInspector`, which will cause exception.
e.g.
select named_struct("x", "str") from src limit 1;
I updated the method `wrap` by adding the one more parameter `ObjectInspector`(to describe what it expects to wrap to, for example: java.lang.Integer or IntWritable).
As well as the `unwrap` method by providing the input `ObjectInspector`.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#2762 from chenghao-intel/udf_coi and squashes the following commits:
bcacfd7 [Cheng Hao] Shim for both Hive 0.12 & 0.13.1
2416e5d [Cheng Hao] revert to hive 0.12
5793c01 [Cheng Hao] add space before while
4e56e1b [Cheng Hao] style issue
683d3fd [Cheng Hao] Add golden files
fe591e4 [Cheng Hao] update HiveGenericUdf for set the ObjectInspector while constructing the DeferredObject
f6740fe [Cheng Hao] Support Constant ObjectInspector for Map & List
8814c3a [Cheng Hao] Passing ContantObjectInspector(when necessary) for UDF initializing
Currently, `CTAS` (Create Table As Select) doesn't support specifying the `SerDe` in HQL. This PR will pass down the `ASTNode` into the physical operator `execution.CreateTableAsSelect`, which will extract the `CreateTableDesc` object via Hive `SemanticAnalyzer`. In the meantime, I also update the `HiveMetastoreCatalog.createTable` to optionally support the `CreateTableDesc` for table creation.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#2570 from chenghao-intel/ctas_serde and squashes the following commits:
e011ef5 [Cheng Hao] shim for both 0.12 & 0.13.1
cfb3662 [Cheng Hao] revert to hive 0.12
c8a547d [Cheng Hao] Support SerDe properties within CTAS
Add json and python api for date type.
By using Pickle, `java.sql.Date` was serialized as calendar, and recognized in python as `datetime.datetime`.
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#2901 from adrian-wang/spark3988 and squashes the following commits:
c51a24d [Daoyuan Wang] convert datetime to date
5670626 [Daoyuan Wang] minor line combine
f760d8e [Daoyuan Wang] fix indent
444f100 [Daoyuan Wang] fix a typo
1d74448 [Daoyuan Wang] fix scala style
8d7dd22 [Daoyuan Wang] add json and python api for date type
Currently there is no support of Bitwise & , | in Spark HiveQl and Spark SQL as well. So this PR support the same.
I am closing https://github.com/apache/spark/pull/2926 as it has conflicts to merge. And also added support for Bitwise AND(&), OR(|) ,XOR(^), NOT(~) And I handled all review comments in that PR
Author: ravipesala <ravindra.pesala@huawei.com>
Closes#2961 from ravipesala/SPARK-3814-NEW4 and squashes the following commits:
a391c7a [ravipesala] Rebase with master
JIRA issue: [SPARK-3907]https://issues.apache.org/jira/browse/SPARK-3907
Add turncate table support
TRUNCATE TABLE table_name [PARTITION partition_spec];
partition_spec:
: (partition_col = partition_col_value, partition_col = partiton_col_value, ...)
Removes all rows from a table or partition(s). Currently target table should be native/managed table or exception will be thrown. User can specify partial partition_spec for truncating multiple partitions at once and omitting partition_spec will truncate all partitions in the table.
Author: wangxiaojing <u9jing@gmail.com>
Closes#2770 from wangxiaojing/spark-3907 and squashes the following commits:
63dbd81 [wangxiaojing] change hive scalastyle
7a03707 [wangxiaojing] add comment
f6e710e [wangxiaojing] change truncate table
a1f692c [wangxiaojing] Correct spelling mistakes
3b20007 [wangxiaojing] add truncate can not support column err message
e483547 [wangxiaojing] add golden file
77b1f20 [wangxiaojing] add truncate table support
`schemaRDD2` is not tested because `schemaRDD1` is registered again.
Author: Yin Huai <huai@cse.ohio-state.edu>
Closes#2869 from yhuai/JavaApplySchemaSuite and squashes the following commits:
95fe894 [Yin Huai] Correct variable name.
In ```MetastoreRelation``` the attributes name is lowercase because of hive using lowercase for fields name, so we should convert attributes name in table scan lowercase in ```indexWhere(_.name == a.name)```.
```neededColumnIDs``` may be not correct if not convert to lowercase.
Author: wangfei <wangfei1@huawei.com>
Author: scwf <wangfei1@huawei.com>
Closes#2884 from scwf/fixColumnIds and squashes the following commits:
6174046 [scwf] use AttributeMap for this issue
dc74a24 [wangfei] use lowerName and add a test case for this issue
3ff3a80 [wangfei] more safer change
294fcb7 [scwf] attributes names in table scan should convert lowercase in neededColumnsIDs
...ob conf in SparkHadoopWriter class
Author: Alex Liu <alex_liu68@yahoo.com>
Closes#2677 from alexliu68/SPARK-SQL-3816 and squashes the following commits:
79c269b [Alex Liu] [SPARK-3816][SQL] Add table properties from storage handler to job conf