Commit graph

892 commits

Author SHA1 Message Date
Yijie Shen 3ca995b78f [SPARK-6212] [SQL] The EXPLAIN output of CTAS only shows the analyzed plan
JIRA: https://issues.apache.org/jira/browse/SPARK-6212

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7986 from yjshen/ctas_explain and squashes the following commits:

bb6fee5 [Yijie Shen] refine test
f731041 [Yijie Shen] address comment
b2cf8ab [Yijie Shen] bug fix
bd7eb20 [Yijie Shen] ctas explain
2015-08-08 21:05:50 -07:00
Yijie Shen 23695f1d2d [SPARK-9728][SQL]Support CalendarIntervalType in HiveQL
This PR enables converting interval term in HiveQL to CalendarInterval Literal.

JIRA: https://issues.apache.org/jira/browse/SPARK-9728

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #8034 from yjshen/interval_hiveql and squashes the following commits:

7fe9a5e [Yijie Shen] declare throw exception and add unit test
fce7795 [Yijie Shen] convert hiveql interval term into CalendarInterval literal
2015-08-08 11:01:25 -07:00
Reynold Xin 05d04e10a8 [SPARK-9733][SQL] Improve physical plan explain for data sources
All data sources show up as "PhysicalRDD" in physical plan explain. It'd be better if we can show the name of the data source.

Without this patch:
```
== Physical Plan ==
NewAggregate with UnsafeHybridAggregationIterator ArrayBuffer(date#0, cat#1) ArrayBuffer((sum(CAST((CAST(count#2, IntegerType) + 1), LongType))2,mode=Final,isDistinct=false))
 Exchange hashpartitioning(date#0,cat#1)
  NewAggregate with UnsafeHybridAggregationIterator ArrayBuffer(date#0, cat#1) ArrayBuffer((sum(CAST((CAST(count#2, IntegerType) + 1), LongType))2,mode=Partial,isDistinct=false))
   PhysicalRDD [date#0,cat#1,count#2], MapPartitionsRDD[3] at
```

With this patch:
```
== Physical Plan ==
TungstenAggregate(key=[date#0,cat#1], value=[(sum(CAST((CAST(count#2, IntegerType) + 1), LongType)),mode=Final,isDistinct=false)]
 Exchange hashpartitioning(date#0,cat#1)
  TungstenAggregate(key=[date#0,cat#1], value=[(sum(CAST((CAST(count#2, IntegerType) + 1), LongType)),mode=Partial,isDistinct=false)]
   ConvertToUnsafe
    Scan ParquetRelation[file:/scratch/rxin/spark/sales4][date#0,cat#1,count#2]
```

Author: Reynold Xin <rxin@databricks.com>

Closes #8024 from rxin/SPARK-9733 and squashes the following commits:

811b90e [Reynold Xin] Fixed Python test case.
52cab77 [Reynold Xin] Cast.
eea9ccc [Reynold Xin] Fix test case.
fcecb22 [Reynold Xin] [SPARK-9733][SQL] Improve explain message for data source scan node.
2015-08-07 13:41:45 -07:00
Yin Huai 3504bf3aa9 [SPARK-9630] [SQL] Clean up new aggregate operators (SPARK-9240 follow up)
This is the followup of https://github.com/apache/spark/pull/7813. It renames `HybridUnsafeAggregationIterator` to `TungstenAggregationIterator` and makes it only work with `UnsafeRow`. Also, I add a `TungstenAggregate` that uses `TungstenAggregationIterator` and make `SortBasedAggregate` (renamed from `SortBasedAggregate`) only works with `SafeRow`.

Author: Yin Huai <yhuai@databricks.com>

Closes #7954 from yhuai/agg-followUp and squashes the following commits:

4d2f4fc [Yin Huai] Add comments and free map.
0d7ddb9 [Yin Huai] Add TungstenAggregationQueryWithControlledFallbackSuite to test fall back process.
91d69c2 [Yin Huai] Rename UnsafeHybridAggregationIterator to  TungstenAggregateIteraotr and make it only work with UnsafeRow.
2015-08-06 15:04:44 -07:00
Christian Kadner abfedb9cd7 [SPARK-9211] [SQL] [TEST] normalize line separators before generating MD5 hash
The golden answer file names for the existing Hive comparison tests were generated using a MD5 hash of the query text which uses Unix-style line separator characters `\n` (LF).
This PR ensures that all occurrences of the Windows-style line separator `\r\n` (CR) are replaced with `\n` (LF) before generating the MD5 hash to produce an identical MD5 hash for golden answer file names generated on Windows.

Author: Christian Kadner <ckadner@us.ibm.com>

Closes #7563 from ckadner/SPARK-9211_working and squashes the following commits:

d541db0 [Christian Kadner] [SPARK-9211][SQL] normalize line separators before MD5 hash
2015-08-06 14:15:42 -07:00
Wenchen Fan 1f62f104c7 [SPARK-9632][SQL] update InternalRow.toSeq to make it accept data type info
This re-applies #7955, which was reverted due to a race condition to fix build breaking.

Author: Wenchen Fan <cloud0fan@outlook.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #8002 from rxin/InternalRow-toSeq and squashes the following commits:

332416a [Reynold Xin] Merge pull request #7955 from cloud-fan/toSeq
21665e2 [Wenchen Fan] fix hive again...
4addf29 [Wenchen Fan] fix hive
bc16c59 [Wenchen Fan] minor fix
33d802c [Wenchen Fan] pass data type info to InternalRow.toSeq
3dd033e [Wenchen Fan] move the default special getters implementation from InternalRow to BaseGenericInternalRow
2015-08-06 13:11:59 -07:00
Davies Liu 2eca46a17a Revert "[SPARK-9632][SQL] update InternalRow.toSeq to make it accept data type info"
This reverts commit 6e009cb9c4.
2015-08-06 11:15:37 -07:00
Wenchen Fan 6e009cb9c4 [SPARK-9632][SQL] update InternalRow.toSeq to make it accept data type info
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7955 from cloud-fan/toSeq and squashes the following commits:

21665e2 [Wenchen Fan] fix hive again...
4addf29 [Wenchen Fan] fix hive
bc16c59 [Wenchen Fan] minor fix
33d802c [Wenchen Fan] pass data type info to InternalRow.toSeq
3dd033e [Wenchen Fan] move the default special getters implementation from InternalRow to BaseGenericInternalRow
2015-08-06 10:40:54 -07:00
Yin Huai d5a9af3230 [SPARK-9664] [SQL] Remove UDAFRegistration and add apply to UserDefinedAggregateFunction.
https://issues.apache.org/jira/browse/SPARK-9664

Author: Yin Huai <yhuai@databricks.com>

Closes #7982 from yhuai/udafRegister and squashes the following commits:

0cc2287 [Yin Huai] Remove UDAFRegistration and add apply to UserDefinedAggregateFunction.
2015-08-05 21:50:35 -07:00
Cheng Hao 119b590538 [SPARK-6923] [SPARK-7550] [SQL] Persists data source relations in Hive compatible format when possible
This PR is a fork of PR #5733 authored by chenghao-intel.  For committers who's going to merge this PR, please set the author to "Cheng Hao <hao.chengintel.com>".

----

When a data source relation meets the following requirements, we persist it in Hive compatible format, so that other systems like Hive can access it:

1. It's a `HadoopFsRelation`
2. It has only one input path
3. It's non-partitioned
4. It's data source provider can be naturally mapped to a Hive builtin SerDe (e.g. ORC and Parquet)

Author: Cheng Lian <lian@databricks.com>
Author: Cheng Hao <hao.cheng@intel.com>

Closes #7967 from liancheng/spark-6923/refactoring-pr-5733 and squashes the following commits:

5175ee6 [Cheng Lian] Fixes an oudated comment
3870166 [Cheng Lian] Fixes build error and comments
864acee [Cheng Lian] Refactors PR #5733
3490cdc [Cheng Hao] update the scaladoc
6f57669 [Cheng Hao] write schema info to hivemetastore for data source
2015-08-06 11:13:44 +08:00
Michael Armbrust 23d982204b [SPARK-9141] [SQL] Remove project collapsing from DataFrame API
Currently we collapse successive projections that are added by `withColumn`.  However, this optimization violates the constraint that adding nodes to a plan will never change its analyzed form and thus breaks caching.  Instead of doing early optimization, in this PR I just fix some low-hanging slowness in the analyzer.  In particular, I add a mechanism for skipping already analyzed subplans, `resolveOperators` and `resolveExpression`.  Since trees are generally immutable after construction, it's safe to annotate a plan as already analyzed as any transformation will create a new tree with this bit no longer set.

Together these result in a faster analyzer than before, even with added timing instrumentation.

```
Original Code
[info] 3430ms
[info] 2205ms
[info] 1973ms
[info] 1982ms
[info] 1916ms

Without Project Collapsing in DataFrame
[info] 44610ms
[info] 45977ms
[info] 46423ms
[info] 46306ms
[info] 54723ms

With analyzer optimizations
[info] 6394ms
[info] 4630ms
[info] 4388ms
[info] 4093ms
[info] 4113ms

With resolveOperators
[info] 2495ms
[info] 1380ms
[info] 1685ms
[info] 1414ms
[info] 1240ms
```

Author: Michael Armbrust <michael@databricks.com>

Closes #7920 from marmbrus/withColumnCache and squashes the following commits:

2145031 [Michael Armbrust] fix hive udfs tests
5a5a525 [Michael Armbrust] remove wrong comment
7a507d5 [Michael Armbrust] style
b59d710 [Michael Armbrust] revert small change
1fa5949 [Michael Armbrust] move logic into LogicalPlan, add tests
0e2cb43 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into withColumnCache
c926e24 [Michael Armbrust] naming
e593a2d [Michael Armbrust] style
f5a929e [Michael Armbrust] [SPARK-9141][SQL] Remove project collapsing from DataFrame API
38b1c83 [Michael Armbrust] WIP
2015-08-05 09:01:45 -07:00
Cheng Hao 519cf6d3f7 [SPARK-9381] [SQL] Migrate JSON data source to the new partitioning data source
Support partitioning for the JSON data source.

Still 2 open issues for the `HadoopFsRelation`
- `refresh()` will invoke the `discoveryPartition()`, which will auto infer the data type for the partition columns, and maybe conflict with the given partition columns. (TODO enable `HadoopFsRelationSuite.Partition column type casting"
- When insert data into a cached HadoopFsRelation based table, we need to invalidate the cache after the insertion (TODO enable `InsertSuite.Caching`)

Author: Cheng Hao <hao.cheng@intel.com>

Closes #7696 from chenghao-intel/json and squashes the following commits:

d90b104 [Cheng Hao] revert the change for JacksonGenerator.apply
307111d [Cheng Hao] fix bug in the unit test
8738c8a [Cheng Hao] fix bug in unit testing
35f2cde [Cheng Hao] support partition for json format
2015-08-05 22:35:55 +08:00
zhichao.li 6f8f0e265a [SPARK-7119] [SQL] Give script a default serde with the user specific types
This is to address this issue that there would be not compatible type exception when running this:
`from (from src select transform(key, value) using 'cat' as (thing1 int, thing2 string)) t select thing1 + 2;`

15/04/24 00:58:55 ERROR CliDriver: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.ClassCastException: org.apache.spark.sql.types.UTF8String cannot be cast to java.lang.Integer
	at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
	at scala.math.Numeric$IntIsIntegral$.plus(Numeric.scala:57)
	at org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:127)
	at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:118)
	at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68)
	at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
	at scala.collection.AbstractIterator.to(Iterator.scala:1157)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
	at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:819)
	at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:819)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1618)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1618)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
	at org.apache.spark.scheduler.Task.run(Task.scala:64)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:209)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
	at java.lang.Thread.run(Thread.java:722)

chenghao-intel marmbrus

Author: zhichao.li <zhichao.li@intel.com>

Closes #6638 from zhichao-li/transDataType2 and squashes the following commits:

a36cc7c [zhichao.li] style
b9252a8 [zhichao.li] delete cacheRow
f6968a4 [zhichao.li] give script a default serde
2015-08-04 18:26:05 -07:00
Davies Liu 73dedb589d [SPARK-8246] [SQL] Implement get_json_object
This is based on #7485 , thanks to NathanHowell

Tests were copied from Hive, but do not seem to be super comprehensive. I've generally replicated Hive's unusual behavior rather than following a JSONPath reference, except for one case (as noted in the comments). I don't know if there is a way of fully replicating Hive's behavior without a slower TreeNode implementation, so I've erred on the side of performance instead.

Author: Davies Liu <davies@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Author: Nathan Howell <nhowell@godaddy.com>

Closes #7901 from davies/get_json_object and squashes the following commits:

3ace9b9 [Davies Liu] Merge branch 'get_json_object' of github.com:davies/spark into get_json_object
98766fc [Davies Liu] Merge branch 'master' of github.com:apache/spark into get_json_object
a7dc6d0 [Davies Liu] Update JsonExpressionsSuite.scala
c818519 [Yin Huai] new results.
18ce26b [Davies Liu] fix tests
6ac29fb [Yin Huai] Golden files.
25eebef [Davies Liu] use HiveQuerySuite
e0ac6ec [Yin Huai] Golden answer files.
940c060 [Davies Liu] tweat code style
44084c5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into get_json_object
9192d09 [Nathan Howell] Match Hive’s behavior for unwrapping arrays of one element
8dab647 [Nathan Howell] [SPARK-8246] [SQL] Implement get_json_object
2015-08-04 09:07:09 -07:00
Sean Owen 76d74090d6 [SPARK-9534] [BUILD] Enable javac lint for scalac parity; fix a lot of build warnings, 1.5.0 edition
Enable most javac lint warnings; fix a lot of build warnings. In a few cases, touch up surrounding code in the process.

I'll explain several of the changes inline in comments.

Author: Sean Owen <sowen@cloudera.com>

Closes #7862 from srowen/SPARK-9534 and squashes the following commits:

ea51618 [Sean Owen] Enable most javac lint warnings; fix a lot of build warnings. In a few cases, touch up surrounding code in the process.
2015-08-04 12:02:26 +01:00
Steve Loughran a2409d1c8e [SPARK-8064] [SQL] Build against Hive 1.2.1
Cherry picked the parts of the initial SPARK-8064 WiP branch needed to get sql/hive to compile against hive 1.2.1. That's the ASF release packaged under org.apache.hive, not any fork.

Tests not run yet: that's what the machines are for

Author: Steve Loughran <stevel@hortonworks.com>
Author: Cheng Lian <lian@databricks.com>
Author: Michael Armbrust <michael@databricks.com>
Author: Patrick Wendell <patrick@databricks.com>

Closes #7191 from steveloughran/stevel/feature/SPARK-8064-hive-1.2-002 and squashes the following commits:

7556d85 [Cheng Lian] Updates .q files and corresponding golden files
ef4af62 [Steve Loughran] Merge commit '6a92bb09f46a04d6cd8c41bdba3ecb727ebb9030' into stevel/feature/SPARK-8064-hive-1.2-002
6a92bb0 [Cheng Lian] Overrides HiveConf time vars
dcbb391 [Cheng Lian] Adds com.twitter:parquet-hadoop-bundle:1.6.0 for Hive Parquet SerDe
0bbe475 [Steve Loughran] SPARK-8064 scalastyle rejects the standard Hadoop ASF license header...
fdf759b [Steve Loughran] SPARK-8064 classpath dependency suite to be in sync with shading in final (?) hive-exec spark
7a6c727 [Steve Loughran] SPARK-8064 switch to second staging repo of the spark-hive artifacts. This one has the protobuf-shaded hive-exec jar
376c003 [Steve Loughran] SPARK-8064 purge duplicate protobuf declaration
2c74697 [Steve Loughran] SPARK-8064 switch to the protobuf shaded hive-exec jar with tests to chase it down
cc44020 [Steve Loughran] SPARK-8064 remove hadoop.version from runtest.py, as profile will fix that automatically.
6901fa9 [Steve Loughran] SPARK-8064 explicit protobuf import
da310dc [Michael Armbrust] Fixes for Hive tests.
a775a75 [Steve Loughran] SPARK-8064 cherry-pick-incomplete
7404f34 [Patrick Wendell] Add spark-hive staging repo
832c164 [Steve Loughran] SPARK-8064 try to supress compiler warnings on Complex.java pasted-thrift-code
312c0d4 [Steve Loughran] SPARK-8064  maven/ivy dependency purge; calcite declaration needed
fa5ae7b [Steve Loughran] HIVE-8064 fix up hive-thriftserver dependencies and cut back on evicted references in the hive- packages; this keeps mvn and ivy resolution compatible, as the reconciliation policy is "by hand"
c188048 [Steve Loughran] SPARK-8064 manage the Hive depencencies to that -things that aren't needed are excluded -sql/hive built with ivy is in sync with the maven reconciliation policy, rather than latest-first
4c8be8d [Cheng Lian] WIP: Partial fix for Thrift server and CLI tests
314eb3c [Steve Loughran] SPARK-8064 deprecation warning  noise in one of the tests
17b0341 [Steve Loughran] SPARK-8064 IDE-hinted cleanups of Complex.java to reduce compiler warnings. It's all autogenerated code, so still ugly.
d029b92 [Steve Loughran] SPARK-8064 rely on unescaping to have already taken place, so go straight to map of serde options
23eca7e [Steve Loughran] HIVE-8064 handle raw and escaped property tokens
54d9b06 [Steve Loughran] SPARK-8064 fix compilation regression surfacing from rebase
0b12d5f [Steve Loughran] HIVE-8064 use subset of hive complex type whose types deserialize
fce73b6 [Steve Loughran] SPARK-8064 poms rely implicitly on the version of kryo chill provides
fd3aa5d [Steve Loughran] SPARK-8064 version of hive to d/l from ivy is 1.2.1
dc73ece [Steve Loughran] SPARK-8064 revert to master's determinstic pushdown strategy
d3c1e4a [Steve Loughran] SPARK-8064 purge UnionType
051cc21 [Steve Loughran] SPARK-8064 switch to an unshaded version of hive-exec-core, which must have been built with Kryo 2.21. This currently looks for a (locally built) version 1.2.1.spark
6684c60 [Steve Loughran] SPARK-8064 ignore RTE raised in blocking process.exitValue() call
e6121e5 [Steve Loughran] SPARK-8064 address review comments
aa43dc6 [Steve Loughran] SPARK-8064  more robust teardown on JavaMetastoreDatasourcesSuite
f2bff01 [Steve Loughran] SPARK-8064 better takeup of asynchronously caught error text
8b1ef38 [Steve Loughran] SPARK-8064: on failures executing spark-submit in HiveSparkSubmitSuite, print command line and all logged output.
5a9ce6b [Steve Loughran] SPARK-8064 add explicit reason for kv split failure, rather than array OOB. *does not address the issue*
642b63a [Steve Loughran] SPARK-8064 reinstate something cut briefly during rebasing
97194dc [Steve Loughran] SPARK-8064 add extra logging to the YarnClusterSuite classpath test. There should be no reason why this is failing on jenkins, but as it is (and presumably its CP-related), improve the logging including any exception raised.
335357f [Steve Loughran] SPARK-8064 fail fast on thrive process spawning tests on exit codes and/or error string patterns seen in log.
3ed872f [Steve Loughran] SPARK-8064 rename field double to  dbl
bca55e5 [Steve Loughran] SPARK-8064 missed one of the `date` escapes
41d6479 [Steve Loughran] SPARK-8064 wrap tests with withTable() calls to avoid table-exists exceptions
2bc29a4 [Steve Loughran] SPARK-8064 ParquetSuites to escape `date` field name
1ab9bc4 [Steve Loughran] SPARK-8064 TestHive to use sered2.thrift.test.Complex
bf3a249 [Steve Loughran] SPARK-8064: more resubmit than fix; tighten startup timeout to 60s. Still no obvious reason why jersey server code in spark-assembly isn't being picked up -it hasn't been shaded
c829b8f [Steve Loughran] SPARK-8064: reinstate yarn-rm-server dependencies to hive-exec to ensure that jersey server is on classpath on hadoop versions < 2.6
0b0f738 [Steve Loughran] SPARK-8064: thrift server startup to fail fast on any exception in the main thread
13abaf1 [Steve Loughran] SPARK-8064 Hive compatibilty tests sin sync with explain/show output from Hive 1.2.1
d14d5ea [Steve Loughran] SPARK-8064: DATE is now a predicate; you can't use it as a field in select ops
26eef1c [Steve Loughran] SPARK-8064: HIVE-9039 renamed TOK_UNION => TOK_UNIONALL while adding TOK_UNIONDISTINCT
3d64523 [Steve Loughran] SPARK-8064 improve diagns on uknown token; fix scalastyle failure
d0360f6 [Steve Loughran] SPARK-8064: delicate merge in of the branch vanzin/hive-1.1
1126e5a [Steve Loughran] SPARK-8064: name of unrecognized file format wasn't appearing in error text
8cb09c4 [Steve Loughran] SPARK-8064: test resilience/assertion improvements. Independent of the rest of the work; can be backported to earlier versions
dec12cb [Steve Loughran] SPARK-8064: when a CLI suite test fails include the full output text in the raised exception; this ensures that the stdout/stderr is included in jenkins reports, so it becomes possible to diagnose the cause.
463a670 [Steve Loughran] SPARK-8064 run-tests.py adds a hadoop-2.6 profile, and changes info messages to say "w/Hive 1.2.1" in console output
2531099 [Steve Loughran] SPARK-8064 successful attempt to get rid of pentaho as a transitive dependency of hive-exec
1d59100 [Steve Loughran] SPARK-8064 (unsuccessful) attempt to get rid of pentaho as a transitive dependency of hive-exec
75733fc [Steve Loughran] SPARK-8064 change thrift binary startup message to "Starting ThriftBinaryCLIService on port"
3ebc279 [Steve Loughran] SPARK-8064 move strings used to check for http/bin thrift services up into constants
c80979d [Steve Loughran] SPARK-8064: SparkSQLCLIDriver drops remote mode support. CLISuite Tests pass instead of timing out: undetected regression?
27e8370 [Steve Loughran] SPARK-8064 fix some style & IDE warnings
00e50d6 [Steve Loughran] SPARK-8064 stop excluding hive shims from dependency (commented out , for now)
cb4f142 [Steve Loughran] SPARK-8054 cut pentaho dependency from calcite
f7aa9cb [Steve Loughran] SPARK-8064 everything compiles with some commenting and moving of classes into a hive package
6c310b4 [Steve Loughran] SPARK-8064 subclass  Hive ServerOptionsProcessor to make it public again
f61a675 [Steve Loughran] SPARK-8064 thrift server switched to Hive 1.2.1, though it doesn't compile everywhere
4890b9d [Steve Loughran] SPARK-8064, build against Hive 1.2.1
2015-08-03 15:24:42 -07:00
Yin Huai 1ebd41b141 [SPARK-9240] [SQL] Hybrid aggregate operator using unsafe row
This PR adds a base aggregation iterator `AggregationIterator`, which is used to create `SortBasedAggregationIterator` (for sort-based aggregation) and `UnsafeHybridAggregationIterator` (first it tries hash-based aggregation and falls back to the sort-based aggregation (using external sorter) if we cannot allocate memory for the map). With these two iterators, we will not need existing iterators and I am removing those. Also, we can use a single physical `Aggregate` operator and it internally determines what iterators to used.

https://issues.apache.org/jira/browse/SPARK-9240

Author: Yin Huai <yhuai@databricks.com>

Closes #7813 from yhuai/AggregateOperator and squashes the following commits:

e317e2b [Yin Huai] Remove unnecessary change.
74d93c5 [Yin Huai] Merge remote-tracking branch 'upstream/master' into AggregateOperator
ba6afbc [Yin Huai] Add a little bit more comments.
c9cf3b6 [Yin Huai] update
0f1b06f [Yin Huai] Remove unnecessary code.
21fd15f [Yin Huai] Remove unnecessary change.
964f88b [Yin Huai] Implement fallback strategy.
b1ea5cf [Yin Huai] wip
7fcbd87 [Yin Huai] Add a flag to control what iterator to use.
533d5b2 [Yin Huai] Prepare for fallback!
33b7022 [Yin Huai] wip
bd9282b [Yin Huai] UDAFs now supports UnsafeRow.
f52ee53 [Yin Huai] wip
3171f44 [Yin Huai] wip
d2c45a0 [Yin Huai] wip
f60cc83 [Yin Huai] Also check input schema.
af32210 [Yin Huai] Check iter.hasNext before we create an iterator because the constructor of the iterato will read at least one row from a non-empty input iter.
299008c [Yin Huai] First round cleanup.
3915bac [Yin Huai] Create a base iterator class for aggregation iterators and add the initial version of the hybrid iterator.
2015-08-03 00:23:08 -07:00
Liang-Chi Hsieh 0722f43316 [SPARK-7937][SQL] Support comparison on StructType
This brings #6519 up-to-date with master branch.

Closes #6519.

Author: Liang-Chi Hsieh <viirya@appier.com>
Author: Liang-Chi Hsieh <viirya@gmail.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #7877 from rxin/sort-struct and squashes the following commits:

4968231 [Reynold Xin] Minor fixes.
2537813 [Reynold Xin] Merge branch 'compare_named_struct' of github.com:viirya/spark-1 into sort-struct
d2ba8ad [Liang-Chi Hsieh] Remove unused import.
3a3f40e [Liang-Chi Hsieh] Don't need to add compare to InternalRow because we can use RowOrdering.
dae6aad [Liang-Chi Hsieh] Fix nested struct.
d5349c7 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into compare_named_struct
43d4354 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into compare_named_struct
1f66196 [Liang-Chi Hsieh] Reuse RowOrdering and GenerateOrdering.
f8b2e9c [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into compare_named_struct
1187a65 [Liang-Chi Hsieh] Fix scala style.
9d67f68 [Liang-Chi Hsieh] Fix wrongly merging.
8f4d775 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into compare_named_struct
94b27d5 [Liang-Chi Hsieh] Remove test for error on complex type comparison.
2071693 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into compare_named_struct
3c142e4 [Liang-Chi Hsieh] Fix scala style.
cf58dc3 [Liang-Chi Hsieh] Use checkAnswer.
f651b8d [Liang-Chi Hsieh] Remove Either and move orderings to BinaryComparison to reuse it.
b6e1009 [Liang-Chi Hsieh] Fix scala style.
3922b54 [Liang-Chi Hsieh] Support ordering on named_struct.
2015-08-02 17:53:44 -07:00
Wenchen Fan 1d59a4162b [SPARK-9480][SQL] add MapData and cleanup internal row stuff
This PR adds a `MapData` as internal representation of map type in Spark SQL, and provides a default implementation with just 2 `ArrayData`.

After that, we have specialized getters for all internal type, so I removed generic getter in `ArrayData` and added specialized `toArray` for it.
Also did some refactor and cleanup for `InternalRow` and its subclasses.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7799 from cloud-fan/map-data and squashes the following commits:

77d482f [Wenchen Fan] fix python
e8f6682 [Wenchen Fan] skip MapData equality check in HiveInspectorSuite
40cc9db [Wenchen Fan] add toString
6e06ec9 [Wenchen Fan] some more cleanup
a90aca1 [Wenchen Fan] add MapData
2015-08-01 00:17:15 -07:00
Herman van Hovell 39ab199a3f [SPARK-8640] [SQL] Enable Processing of Multiple Window Frames in a Single Window Operator
This PR enables the processing of multiple window frames in a single window operator. This should improve the performance of processing multiple window expressions wich share partition by/order by clauses, because it will be more efficient with respect to memory use and group processing.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #7515 from hvanhovell/SPARK-8640 and squashes the following commits:

f0e1c21 [Herman van Hovell] Changed Window Logical/Physical plans to use partition by/order by specs directly instead of using WindowSpec.
e1711c2 [Herman van Hovell] Enabled the processing of multiple window frames in a single Window operator.
2015-07-31 12:08:25 -07:00
Wenchen Fan c0cc0eaec6 [SPARK-9390][SQL] create a wrapper for array type
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7724 from cloud-fan/array-data and squashes the following commits:

d0408a1 [Wenchen Fan] fix python
661e608 [Wenchen Fan] rebase
f39256c [Wenchen Fan] fix hive...
6dbfa6f [Wenchen Fan] fix hive again...
8cb8842 [Wenchen Fan] remove element type parameter from getArray
43e9816 [Wenchen Fan] fix mllib
e719afc [Wenchen Fan] fix hive
4346290 [Wenchen Fan] address comment
d4a38da [Wenchen Fan] remove sizeInBytes and add license
7e283e2 [Wenchen Fan] create a wrapper for array type
2015-07-30 10:04:30 -07:00
Joseph Batchik 1221849f91 [SPARK-8005][SQL] Input file name
Users can now get the file name of the partition being read in. A thread local variable is in `SQLNewHadoopRDD` and is set when the partition is computed. `SQLNewHadoopRDD` is moved to core so that the catalyst package can reach it.

This supports:

`df.select(inputFileName())`

and

`sqlContext.sql("select input_file_name() from table")`

Author: Joseph Batchik <josephbatchik@gmail.com>

Closes #7743 from JDrit/input_file_name and squashes the following commits:

abb8609 [Joseph Batchik] fixed failing test and changed the default value to be an empty string
d2f323d [Joseph Batchik] updates per review
102061f [Joseph Batchik] updates per review
75313f5 [Joseph Batchik] small fixes
c7f7b5a [Joseph Batchik] addeding input file name to Spark SQL
2015-07-29 23:35:55 -07:00
Reynold Xin 6662ee2124 [SPARK-9418][SQL] Use sort-merge join as the default shuffle join.
Sort-merge join is more robust in Spark since sorting can be made using the Tungsten sort operator.

Author: Reynold Xin <rxin@databricks.com>

Closes #7733 from rxin/smj and squashes the following commits:

61e4d34 [Reynold Xin] Fixed test case.
5ffd731 [Reynold Xin] Fixed JoinSuite.
a137dc0 [Reynold Xin] [SPARK-9418][SQL] Use sort-merge join as the default shuffle join.
2015-07-28 17:42:35 -07:00
Reynold Xin b7f54119f8 [SPARK-9420][SQL] Move expressions in sql/core package to catalyst.
Since catalyst package already depends on Spark core, we can move those expressions
into catalyst, and simplify function registry.

This is a followup of #7478.

Author: Reynold Xin <rxin@databricks.com>

Closes #7735 from rxin/SPARK-8003 and squashes the following commits:

2ffbdc3 [Reynold Xin] [SPARK-8003][SQL] Move expressions in sql/core package to catalyst.
2015-07-28 17:03:59 -07:00
Josh Rosen 59b92add7c [SPARK-9393] [SQL] Fix several error-handling bugs in ScriptTransform operator
SparkSQL's ScriptTransform operator has several serious bugs which make debugging fairly difficult:

- If exceptions are thrown in the writing thread then the child process will not be killed, leading to a deadlock because the reader thread will block while waiting for input that will never arrive.
- TaskContext is not propagated to the writer thread, which may cause errors in upstream pipelined operators.
- Exceptions which occur in the writer thread are not propagated to the main reader thread, which may cause upstream errors to be silently ignored instead of killing the job.  This can lead to silently incorrect query results.
- The writer thread is not a daemon thread, but it should be.

In addition, the code in this file is extremely messy:

- Lots of fields are nullable but the nullability isn't clearly explained.
- Many confusing variable names: for instance, there are variables named `ite` and `iterator` that are defined in the same scope.
- Some code was misindented.
- The `*serdeClass` variables are actually expected to be single-quoted strings, which is really confusing: I feel that this parsing / extraction should be performed in the analyzer, not in the operator itself.
- There were no unit tests for the operator itself, only end-to-end tests.

This pull request addresses these issues, borrowing some error-handling techniques from PySpark's PythonRDD.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7710 from JoshRosen/script-transform and squashes the following commits:

16c44e2 [Josh Rosen] Update some comments
983f200 [Josh Rosen] Use unescapeSQLString instead of stripQuotes
6a06a8c [Josh Rosen] Clean up handling of quotes in serde class name
494cde0 [Josh Rosen] Propagate TaskContext to writer thread
323bb2b [Josh Rosen] Fix error-swallowing bug
b31258d [Josh Rosen] Rename iterator variables to disambiguate.
88278de [Josh Rosen] Split ScriptTransformation writer thread into own class.
8b162b6 [Josh Rosen] Add failing test which demonstrates exception masking issue
4ee36a2 [Josh Rosen] Kill script transform subprocess when error occurs in input writer.
bd4c948 [Josh Rosen] Skip launching of external command for empty partitions.
b43e4ec [Josh Rosen] Clean up nullability in ScriptTransformation
fa18d26 [Josh Rosen] Add basic unit test for script transform with 'cat' command.
2015-07-28 16:04:48 -07:00
Joseph Batchik b88b868eb3 [SPARK-8003][SQL] Added virtual column support to Spark
Added virtual column support by adding a new resolution role to the query analyzer. Additional virtual columns can be added by adding case expressions to [the new rule](https://github.com/JDrit/spark/blob/virt_columns/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L1026) and my modifying the [logical plan](https://github.com/JDrit/spark/blob/virt_columns/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L216) to resolve them.

This also solves [SPARK-8003](https://issues.apache.org/jira/browse/SPARK-8003)

This allows you to perform queries such as:
```sql
select spark__partition__id, count(*) as c from table group by spark__partition__id;
```

Author: Joseph Batchik <josephbatchik@gmail.com>
Author: JD <jd@csh.rit.edu>

Closes #7478 from JDrit/virt_columns and squashes the following commits:

7932bf0 [Joseph Batchik] adding spark__partition__id to hive as well
f8a9c6c [Joseph Batchik] merging in master
e49da48 [JD] fixes for @rxin's suggestions
60e120b [JD] fixing test in merge
4bf8554 [JD] merging in master
c68bc0f [Joseph Batchik] Adding function register ability to SQLContext and adding a function for spark__partition__id()
2015-07-28 14:39:25 -07:00
Yijie Shen 63a492b931 [SPARK-8828] [SQL] Revert SPARK-5680
JIRA: https://issues.apache.org/jira/browse/SPARK-8828

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7667 from yjshen/revert_combinesum_2 and squashes the following commits:

c37ccb1 [Yijie Shen] add test case
8377214 [Yijie Shen] revert spark.sql.useAggregate2 to its default value
e2305ac [Yijie Shen] fix bug - avg on decimal column
7cb0e95 [Yijie Shen] [wip] resolving bugs
1fadb5a [Yijie Shen] remove occurance
17c6248 [Yijie Shen] revert SPARK-5680
2015-07-27 22:47:33 -07:00
Cheng Lian 8e7d2bee23 [SPARK-9378] [SQL] Fixes test case "CTAS with serde"
This is a proper version of PR #7693 authored by viirya

The reason why "CTAS with serde" fails is that the `MetastoreRelation` gets converted to a Parquet data source relation by default.

Author: Cheng Lian <lian@databricks.com>

Closes #7700 from liancheng/spark-9378-fix-ctas-test and squashes the following commits:

4413af0 [Cheng Lian] Fixes test case "CTAS with serde"
2015-07-27 13:28:03 -07:00
Yin Huai 55946e76fd [SPARK-9349] [SQL] UDAF cleanup
https://issues.apache.org/jira/browse/SPARK-9349

With this PR, we only expose `UserDefinedAggregateFunction` (an abstract class) and `MutableAggregationBuffer` (an interface). Other internal wrappers and helper classes are moved to `org.apache.spark.sql.execution.aggregate` and marked as `private[sql]`.

Author: Yin Huai <yhuai@databricks.com>

Closes #7687 from yhuai/UDAF-cleanup and squashes the following commits:

db36542 [Yin Huai] Add comments to UDAF examples.
ae17f66 [Yin Huai] Address comments.
9c9fa5f [Yin Huai] UDAF cleanup.
2015-07-27 13:26:57 -07:00
Cheng Lian 72981bc8f0 [SPARK-7943] [SPARK-8105] [SPARK-8435] [SPARK-8714] [SPARK-8561] Fixes multi-database support
This PR fixes a set of issues related to multi-database. A new data structure `TableIdentifier` is introduced to identify a table among multiple databases. We should stop using a single `String` (table name without database name), or `Seq[String]` (optional database name plus table name) to identify tables internally.

Author: Cheng Lian <lian@databricks.com>

Closes #7623 from liancheng/spark-8131-multi-db and squashes the following commits:

f3bcd4b [Cheng Lian] Addresses PR comments
e0eb76a [Cheng Lian] Fixes styling issues
41e2207 [Cheng Lian] Fixes multi-database support
d4d1ec2 [Cheng Lian] Adds multi-database test cases
2015-07-27 17:15:35 +08:00
Wenchen Fan 4ffd3a1db5 [SPARK-9371][SQL] fix the support for special chars in column names for hive context
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7684 from cloud-fan/hive and squashes the following commits:

da21ffe [Wenchen Fan] fix the support for special chars in column names for hive context
2015-07-26 23:58:03 -07:00
Cheng Hao 1efe97dc9e [SPARK-8867][SQL] Support list / describe function usage
As Hive does, we need to list all of the registered UDF and its usage for user.

We add the annotation to describe a UDF, so we can get the literal description info while registering the UDF.
e.g.
```scala
ExpressionDescription(
    usage = "_FUNC_(expr) - Returns the absolute value of the numeric value",
    extended = """> SELECT _FUNC_('-1')
                  1""")
 case class Abs(child: Expression) extends UnaryArithmetic {
...
```

Author: Cheng Hao <hao.cheng@intel.com>

Closes #7259 from chenghao-intel/desc_function and squashes the following commits:

cf29bba [Cheng Hao] fixing the code style issue
5193855 [Cheng Hao] Add more powerful parser for show functions
c645a6b [Cheng Hao] fix bug in unit test
78d40f1 [Cheng Hao] update the padding issue for usage
48ee4b3 [Cheng Hao] update as feedback
70eb4e9 [Cheng Hao] add show/describe function support
2015-07-26 18:34:19 -07:00
Cheng Lian c025c3d0a1 [SPARK-9095] [SQL] Removes the old Parquet support
This PR removes the old Parquet support:

- Removes the old `ParquetRelation` together with related SQL configuration, plan nodes, strategies, utility classes, and test suites.

- Renames `ParquetRelation2` to `ParquetRelation`

- Renames `RowReadSupport` and `RowRecordMaterializer` to `CatalystReadSupport` and `CatalystRecordMaterializer` respectively, and moved them to separate files.

  This follows naming convention used in other Parquet data models implemented in parquet-mr. It should be easier for developers who are familiar with Parquet to follow.

There's still some other code that can be cleaned up. Especially `RowWriteSupport`. But I'd like to leave this part to SPARK-8848.

Author: Cheng Lian <lian@databricks.com>

Closes #7441 from liancheng/spark-9095 and squashes the following commits:

c7b6e38 [Cheng Lian] Removes WriteToFile
2d688d6 [Cheng Lian] Renames ParquetRelation2 to ParquetRelation
ca9e1b7 [Cheng Lian] Removes old Parquet support
2015-07-26 16:49:19 -07:00
Reynold Xin 6c400b4f39 [SPARK-9354][SQL] Remove InternalRow.get generic getter call in Hive integration code.
Replaced them with get(ordinal, datatype) so we can use UnsafeRow here.

I passed the data types throughout.

Author: Reynold Xin <rxin@databricks.com>

Closes #7669 from rxin/row-generic-getter-hive and squashes the following commits:

3467d8e [Reynold Xin] [SPARK-9354][SQL] Remove Internal.get generic getter call in Hive integration code.
2015-07-26 10:27:39 -07:00
Cheng Lian e2ec018e37 [SPARK-9285] [SQL] Fixes Row/InternalRow conversion for HadoopFsRelation
This is a follow-up of #7626. It fixes `Row`/`InternalRow` conversion for data sources extending `HadoopFsRelation` with `needConversion` being `true`.

Author: Cheng Lian <lian@databricks.com>

Closes #7649 from liancheng/spark-9285-conversion-fix and squashes the following commits:

036a50c [Cheng Lian] Addresses PR comment
f6d7c6a [Cheng Lian] Fixes Row/InternalRow conversion for HadoopFsRelation
2015-07-25 11:42:49 -07:00
Reynold Xin 431ca39be5 [SPARK-9285][SQL] Remove InternalRow's inheritance from Row.
I also changed InternalRow's size/length function to numFields, to make it more obvious that it is not about bytes, but the number of fields.

Author: Reynold Xin <rxin@databricks.com>

Closes #7626 from rxin/internalRow and squashes the following commits:

e124daf [Reynold Xin] Fixed test case.
805ceb7 [Reynold Xin] Commented out the failed test suite.
f8a9ca5 [Reynold Xin] Fixed more bugs. Still at least one more remaining.
76d9081 [Reynold Xin] Fixed data sources.
7807f70 [Reynold Xin] Fixed DataFrameSuite.
cb60cd2 [Reynold Xin] Code review & small bug fixes.
0a2948b [Reynold Xin] Fixed style.
3280d03 [Reynold Xin] [SPARK-9285][SQL] Remove InternalRow's inheritance from Row.
2015-07-24 09:37:36 -07:00
Reynold Xin d71a13f475 [SPARK-9262][build] Treat Scala compiler warnings as errors
I've seen a few cases in the past few weeks that the compiler is throwing warnings that are caused by legitimate bugs. This patch upgrades warnings to errors, except deprecation warnings.

Note that ideally we should be able to mark deprecation warnings as errors as well. However, due to the lack of ability to suppress individual warning messages in the Scala compiler, we cannot do that (since we do need to access deprecated APIs in Hadoop).

Most of the work are done by ericl.

Author: Reynold Xin <rxin@databricks.com>
Author: Eric Liang <ekl@databricks.com>

Closes #7598 from rxin/warnings and squashes the following commits:

beb311b [Reynold Xin] Fixed tests.
542c031 [Reynold Xin] Fixed one more warning.
87c354a [Reynold Xin] Fixed all non-deprecation warnings.
78660ac [Eric Liang] first effort to fix warnings
2015-07-22 21:02:19 -07:00
Matei Zaharia fe26584a1f [SPARK-9244] Increase some memory defaults
There are a few memory limits that people hit often and that we could
make higher, especially now that memory sizes have grown.

- spark.akka.frameSize: This defaults at 10 but is often hit for map
  output statuses in large shuffles. This memory is not fully allocated
  up-front, so we can just make this larger and still not affect jobs
  that never sent a status that large. We increase it to 128.

- spark.executor.memory: Defaults at 512m, which is really small. We
  increase it to 1g.

Author: Matei Zaharia <matei@databricks.com>

Closes #7586 from mateiz/configs and squashes the following commits:

ce0038a [Matei Zaharia] [SPARK-9244] Increase some memory defaults
2015-07-22 15:28:09 -07:00
Yin Huai c03299a18b [SPARK-4233] [SPARK-4367] [SPARK-3947] [SPARK-3056] [SQL] Aggregation Improvement
This is the first PR for the aggregation improvement, which is tracked by https://issues.apache.org/jira/browse/SPARK-4366 (umbrella JIRA). This PR contains work for its subtasks, SPARK-3056, SPARK-3947, SPARK-4233, and SPARK-4367.

This PR introduces a new code path for evaluating aggregate functions. This code path is guarded by `spark.sql.useAggregate2` and by default the value of this flag is true.

This new code path contains:
* A new aggregate function interface (`AggregateFunction2`) and 7 built-int aggregate functions based on this new interface (`AVG`, `COUNT`, `FIRST`, `LAST`, `MAX`, `MIN`, `SUM`)
* A UDAF interface (`UserDefinedAggregateFunction`) based on the new code path and two example UDAFs (`MyDoubleAvg` and `MyDoubleSum`).
* A sort-based aggregate operator (`Aggregate2Sort`) for the new aggregate function interface .
* A sort-based aggregate operator (`FinalAndCompleteAggregate2Sort`) for distinct aggregations (for distinct aggregations the query plan will use `Aggregate2Sort` and `FinalAndCompleteAggregate2Sort` together).

With this change, `spark.sql.useAggregate2` is `true`, the flow of compiling an aggregation query is:
1. Our analyzer looks up functions and returns aggregate functions built based on the old aggregate function interface.
2. When our planner is compiling the physical plan, it tries try to convert all aggregate functions to the ones built based on the new interface. The planner will fallback to the old code path if any of the following two conditions is true:
* code-gen is disabled.
* there is any function that cannot be converted (right now, Hive UDAFs).
* the schema of grouping expressions contain any complex data type.
* There are multiple distinct columns.

Right now, the new code path handles a single distinct column in the query (you can have multiple aggregate functions using that distinct column). For a query having a aggregate function with DISTINCT and regular aggregate functions, the generated plan will do partial aggregations for those regular aggregate function.

Thanks chenghao-intel for his initial work on it.

Author: Yin Huai <yhuai@databricks.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #7458 from yhuai/UDAF and squashes the following commits:

7865f5e [Yin Huai] Put the catalyst expression in the comment of the generated code for it.
b04d6c8 [Yin Huai] Remove unnecessary change.
f1d5901 [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF
35b0520 [Yin Huai] Use semanticEquals to replace grouping expressions in the output of the aggregate operator.
3b43b24 [Yin Huai] bug fix.
00eb298 [Yin Huai] Make it compile.
a3ca551 [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF
e0afca3 [Yin Huai] Gracefully fallback to old aggregation code path.
8a8ac4a [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF
88c7d4d [Yin Huai] Enable spark.sql.useAggregate2 by default for testing purpose.
dc96fd1 [Yin Huai] Many updates:
85c9c4b [Yin Huai] newline.
43de3de [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF
c3614d7 [Yin Huai] Handle single distinct column.
68b8ee9 [Yin Huai] Support single distinct column set. WIP
3013579 [Yin Huai] Format.
d678aee [Yin Huai] Remove AggregateExpressionSuite.scala since our built-in aggregate functions will be based on AlgebraicAggregate and we need to have another way to test it.
e243ca6 [Yin Huai] Add aggregation iterators.
a101960 [Yin Huai] Change MyJavaUDAF to MyDoubleSum.
594cdf5 [Yin Huai] Change existing AggregateExpression to AggregateExpression1 and add an AggregateExpression as the common interface for both AggregateExpression1 and AggregateExpression2.
380880f [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF
0a827b3 [Yin Huai] Add comments and doc. Move some classes to the right places.
a19fea6 [Yin Huai] Add UDAF interface.
262d4c4 [Yin Huai] Make it compile.
b2e358e [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF
6edb5ac [Yin Huai] Format update.
70b169c [Yin Huai] Remove groupOrdering.
4721936 [Yin Huai] Add CheckAggregateFunction to extendedCheckRules.
d821a34 [Yin Huai] Cleanup.
32aea9c [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF
5b46d41 [Yin Huai] Bug fix.
aff9534 [Yin Huai] Make Aggregate2Sort work with both algebraic AggregateFunctions and non-algebraic AggregateFunctions.
2857b55 [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF
4435f20 [Yin Huai] Add ConvertAggregateFunction to HiveContext's analyzer.
1b490ed [Michael Armbrust] make hive test
8cfa6a9 [Michael Armbrust] add test
1b0bb3f [Yin Huai] Do not bind references in AlgebraicAggregate and use code gen for all places.
072209f [Yin Huai] Bug fix: Handle expressions in grouping columns that are not attribute references.
f7d9e54 [Michael Armbrust] Merge remote-tracking branch 'apache/master' into UDAF
39ee975 [Yin Huai] Code cleanup: Remove unnecesary AttributeReferences.
b7720ba [Yin Huai] Add an analysis rule to convert aggregate function to the new version.
5c00f3f [Michael Armbrust] First draft of codegen
6bbc6ba [Michael Armbrust] now with correct answers\!
f7996d0 [Michael Armbrust] Add AlgebraicAggregate
dded1c5 [Yin Huai] wip
2015-07-21 23:26:11 -07:00
Reynold Xin 60c0ce134d [SPARK-8906][SQL] Move all internal data source classes into execution.datasources.
This way, the sources package contains only public facing interfaces.

Author: Reynold Xin <rxin@databricks.com>

Closes #7565 from rxin/move-ds and squashes the following commits:

7661aff [Reynold Xin] Mima
9d5196a [Reynold Xin] Rearranged imports.
3dd7174 [Reynold Xin] [SPARK-8906][SQL] Move all internal data source classes into execution.datasources.
2015-07-21 11:56:38 -07:00
Cheng Lian d38c5029a2 [SPARK-9100] [SQL] Adds DataFrame reader/writer shortcut methods for ORC
This PR adds DataFrame reader/writer shortcut methods for ORC in both Scala and Python.

Author: Cheng Lian <lian@databricks.com>

Closes #7444 from liancheng/spark-9100 and squashes the following commits:

284d043 [Cheng Lian] Fixes PySpark test cases and addresses PR comments
e0b09fb [Cheng Lian] Adds DataFrame reader/writer shortcut methods for ORC
2015-07-21 15:08:44 +08:00
Cheng Lian dde0e12f32 [SPARK-6910] [SQL] Support for pushing predicates down to metastore for partition pruning
This PR forks PR #7421 authored by piaozhexiu and adds [a workaround] [1] for fixing the occasional test failures occurred in PR #7421. Please refer to these [two] [2] [comments] [3] for details.

[1]: 536ac41a7e
[2]: https://github.com/apache/spark/pull/7421#issuecomment-122527391
[3]: https://github.com/apache/spark/pull/7421#issuecomment-122528059

Author: Cheolsoo Park <cheolsoop@netflix.com>
Author: Cheng Lian <lian@databricks.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #7492 from liancheng/pr-7421-workaround and squashes the following commits:

5599cc4 [Cheolsoo Park] Predicate pushdown to hive metastore
536ac41 [Cheng Lian] Sets hive.metastore.integral.jdo.pushdown to true to workaround test failures caused by in #7421
2015-07-20 15:12:14 -07:00
Herman van Hovell 7a81245345 [SPARK-8638] [SQL] Window Function Performance Improvements - Cleanup
This PR contains a few clean-ups that are a part of SPARK-8638: a few style issues got fixed, and a few tests were moved.

Git commit message is wrong BTW :(...

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #7513 from hvanhovell/SPARK-8638-cleanup and squashes the following commits:

4e69d08 [Herman van Hovell] Fixed Perfomance Regression for Shrinking Window Frames (+Rebase)
2015-07-19 16:29:50 -07:00
Herman van Hovell a9a0d0cebf [SPARK-8638] [SQL] Window Function Performance Improvements
## Description
Performance improvements for Spark Window functions. This PR will also serve as the basis for moving away from Hive UDAFs to Spark UDAFs. See JIRA tickets SPARK-8638 and SPARK-7712 for more information.

## Improvements
* Much better performance (10x) in running cases (e.g. BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) and UNBOUDED FOLLOWING cases. The current implementation in spark uses a sliding window approach in these cases. This means that an aggregate is maintained for every row, so space usage is N (N being the number of rows). This also means that all these aggregates all need to be updated separately, this takes N*(N-1)/2 updates. The running case differs from the Sliding case because we are only adding data to an aggregate function (no reset is required), we only need to maintain one aggregate (like in the UNBOUNDED PRECEDING AND UNBOUNDED case), update the aggregate for each row, and get the aggregate value after each update. This is what the new implementation does. This approach only uses 1 buffer, and only requires N updates; I am currently working on data with window sizes of 500-1000 doing running sums and this saves a lot of time. The CURRENT ROW AND UNBOUNDED FOLLOWING case also uses this approach and the fact that aggregate operations are communitative, there is one twist though it will process the input buffer in reverse.
* Fewer comparisons in the sliding case. The current implementation determines frame boundaries for every input row. The new implementation makes more use of the fact that the window is sorted, maintains the boundaries, and only moves them when the current row order changes. This is a minor improvement.
* A single Window node is able to process all types of Frames for the same Partitioning/Ordering. This saves a little time/memory spent buffering and managing partitions. This will be enabled in a follow-up PR.
* A lot of the staging code is moved from the execution phase to the initialization phase. Minor performance improvement, and improves readability of the execution code.

## Benchmarking
I have done a small benchmark using [on time performance](http://www.transtats.bts.gov) data of the month april. I have used the origin as a partioning key, as a result there is quite some variation in window sizes. The code for the benchmark can be found in the JIRA ticket. These are the results per Frame type:

Frame | Master | SPARK-8638
----- | ------ | ----------
Entire Frame | 2 s | 1 s
Sliding | 18 s | 1 s
Growing | 14 s | 0.9 s
Shrinking | 13 s | 1 s

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #7057 from hvanhovell/SPARK-8638 and squashes the following commits:

3bfdc49 [Herman van Hovell] Fixed Perfomance Regression for Shrinking Window Frames (+Rebase)
2eb3b33 [Herman van Hovell] Corrected reverse range frame processing.
2cd2d5b [Herman van Hovell] Corrected reverse range frame processing.
b0654d7 [Herman van Hovell] Tests for exotic frame specifications.
e75b76e [Herman van Hovell] More docs, added support for reverse sliding range frames, and some reorganization of code.
1fdb558 [Herman van Hovell] Changed Data In HiveDataFrameWindowSuite.
ac2f682 [Herman van Hovell] Added a few more comments.
1938312 [Herman van Hovell] Added Documentation to the createBoundOrdering methods.
bb020e6 [Herman van Hovell] Major overhaul of Window operator.
2015-07-18 23:44:38 -07:00
Cheng Hao e27212317c [SPARK-8972] [SQL] Incorrect result for rollup
We don't support the complex expression keys in the rollup/cube, and we even will not report it if we have the complex group by keys, that will cause very confusing/incorrect result.

e.g. `SELECT key%100 FROM src GROUP BY key %100 with ROLLUP`

This PR adds an additional project during the analyzing for the complex GROUP BY keys, and that projection will be the child of `Expand`, so to `Expand`, the GROUP BY KEY are always the simple key(attribute names).

Author: Cheng Hao <hao.cheng@intel.com>

Closes #7343 from chenghao-intel/expand and squashes the following commits:

1ebbb59 [Cheng Hao] update the comment
827873f [Cheng Hao] update as feedback
34def69 [Cheng Hao] Add more unit test and comments
c695760 [Cheng Hao] fix bug of incorrect result for rollup
2015-07-15 23:35:27 -07:00
Steve Loughran ec9b621647 SPARK-9070 JavaDataFrameSuite teardown NPEs if setup failed
fix teardown to skip table delete if hive context is null

Author: Steve Loughran <stevel@hortonworks.com>

Closes #7425 from steveloughran/stevel/patches/SPARK-9070-JavaDataFrameSuite-NPE and squashes the following commits:

1982d38 [Steve Loughran] SPARK-9070 JavaDataFrameSuite teardown NPEs if setup failed
2015-07-15 12:15:35 -07:00
Michael Armbrust c6b1a9e74e Revert SPARK-6910 and SPARK-9027
Revert #7216 and #7386.  These patch seems to be causing quite a few test failures:

```
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.GeneratedMethodAccessor322.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:351)
	at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getPartitionsByFilter$1.apply(ClientWrapper.scala:320)
	at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getPartitionsByFilter$1.apply(ClientWrapper.scala:318)
	at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:180)
	at org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:135)
	at org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:172)
	at org.apache.spark.sql.hive.client.ClientWrapper.getPartitionsByFilter(ClientWrapper.scala:318)
	at org.apache.spark.sql.hive.client.HiveTable.getPartitions(ClientInterface.scala:78)
	at org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(HiveMetastoreCatalog.scala:670)
	at org.apache.spark.sql.hive.execution.HiveTableScan.doExecute(HiveTableScan.scala:137)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:90)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:90)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:89)
	at org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1.apply(Exchange.scala:164)
	at org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1.apply(Exchange.scala:151)
	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48)
	... 85 more
Caused by: MetaException(message:Filtering is supported only on partition keys of type string)
	at org.apache.hadoop.hive.metastore.parser.ExpressionTree$FilterBuilder.setError(ExpressionTree.java:185)
	at org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.getJdoFilterPushdownParam(ExpressionTree.java:452)
	at org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.generateJDOFilterOverPartitions(ExpressionTree.java:357)
	at org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.generateJDOFilter(ExpressionTree.java:279)
	at org.apache.hadoop.hive.metastore.parser.ExpressionTree$TreeNode.generateJDOFilter(ExpressionTree.java:243)
	at org.apache.hadoop.hive.metastore.parser.ExpressionTree.generateJDOFilterFragment(ExpressionTree.java:590)
	at org.apache.hadoop.hive.metastore.ObjectStore.makeQueryFilterString(ObjectStore.java:2417)
	at org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsViaOrmFilter(ObjectStore.java:2029)
	at org.apache.hadoop.hive.metastore.ObjectStore.access$500(ObjectStore.java:146)
	at org.apache.hadoop.hive.metastore.ObjectStore$4.getJdoResult(ObjectStore.java:2332)
```
https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-Maven-with-YARN/2945/HADOOP_PROFILE=hadoop-2.4,label=centos/testReport/junit/org.apache.spark.sql.hive.execution/SortMergeCompatibilitySuite/auto_sortmerge_join_16/

Author: Michael Armbrust <michael@databricks.com>

Closes #7409 from marmbrus/revertMetastorePushdown and squashes the following commits:

92fabd3 [Michael Armbrust] Revert SPARK-6910 and SPARK-9027
5d3bdf2 [Michael Armbrust] Revert "[SPARK-9027] [SQL] Generalize metastore predicate pushdown"
2015-07-14 22:57:39 -07:00
Josh Rosen 11e5c37286 [SPARK-8962] Add Scalastyle rule to ban direct use of Class.forName; fix existing uses
This pull request adds a Scalastyle regex rule which fails the style check if `Class.forName` is used directly.  `Class.forName` always loads classes from the default / system classloader, but in a majority of cases, we should be using Spark's own `Utils.classForName` instead, which tries to load classes from the current thread's context classloader and falls back to the classloader which loaded Spark when the context classloader is not defined.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/7350)
<!-- Reviewable:end -->

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7350 from JoshRosen/ban-Class.forName and squashes the following commits:

e3e96f7 [Josh Rosen] Merge remote-tracking branch 'origin/master' into ban-Class.forName
c0b7885 [Josh Rosen] Hopefully fix the last two cases
d707ba7 [Josh Rosen] Fix uses of Class.forName that I missed in my first cleanup pass
046470d [Josh Rosen] Merge remote-tracking branch 'origin/master' into ban-Class.forName
62882ee [Josh Rosen] Fix uses of Class.forName or add exclusion.
d9abade [Josh Rosen] Add stylechecker rule to ban uses of Class.forName
2015-07-14 16:08:17 -07:00
Michael Armbrust 37f2d9635f [SPARK-9027] [SQL] Generalize metastore predicate pushdown
Add support for pushing down metastore filters that are in different orders and add some unit tests.

Author: Michael Armbrust <michael@databricks.com>

Closes #7386 from marmbrus/metastoreFilters and squashes the following commits:

05a4524 [Michael Armbrust] [SPARK-9027][SQL] Generalize metastore predicate pushdown
2015-07-14 11:22:09 -07:00
Cheolsoo Park 408b384de9 [SPARK-6910] [SQL] Support for pushing predicates down to metastore for partition pruning
This PR supersedes my old one #6921. Since my patch has changed quite a bit, I am opening a new PR to make it easier to review.

The changes include-
* Implement `toMetastoreFilter()` function in `HiveShim` that takes `Seq[Expression]` and converts them into a filter string for Hive metastore.
 * This functions matches all the `AttributeReference` + `BinaryComparisonOp` + `Integral/StringType` patterns in `Seq[Expression]` and fold them into a string.
* Change `hiveQlPartitions` field in `MetastoreRelation` to `getHiveQlPartitions()` function that takes a filter string parameter.
* Call `getHiveQlPartitions()` in `HiveTableScan` with a filter string.

But there are some cases in which predicate pushdown is disabled-

Case | Predicate pushdown
------- | -----------------------------
Hive integral and string types | Yes
Hive varchar type | No
Hive 0.13 and newer | Yes
Hive 0.12 and older | No
convertMetastoreParquet=false | Yes
convertMetastoreParquet=true | No

In case of `convertMetastoreParquet=true`, predicates are not pushed down because this conversion happens in an `Analyzer` rule (`HiveMetastoreCatalog.ParquetConversions`). At this point, `HiveTableScan` hasn't run, so predicates are not available. But reading the source code, I think it is intentional to convert the entire Hive table w/ all the partitions into `ParquetRelation` because then `ParquetRelation` can be cached and reused for any query against that table. Please correct me if I am wrong.

cc marmbrus

Author: Cheolsoo Park <cheolsoop@netflix.com>

Closes #7216 from piaozhexiu/SPARK-6910-2 and squashes the following commits:

aa1490f [Cheolsoo Park] Fix ordering of imports
c212c4d [Cheolsoo Park] Incorporate review comments
5e93f9d [Cheolsoo Park] Predicate pushdown into Hive metastore
2015-07-13 19:45:10 -07:00
Jonathan Alter e14b545d2d [SPARK-7977] [BUILD] Disallowing println
Author: Jonathan Alter <jonalter@users.noreply.github.com>

Closes #7093 from jonalter/SPARK-7977 and squashes the following commits:

ccd44cc [Jonathan Alter] Changed println to log in ThreadingSuite
7fcac3e [Jonathan Alter] Reverting to println in ThreadingSuite
10724b6 [Jonathan Alter] Changing some printlns to logs in tests
eeec1e7 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
0b1dcb4 [Jonathan Alter] More println cleanup
aedaf80 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
925fd98 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
0c16fa3 [Jonathan Alter] Replacing some printlns with logs
45c7e05 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
5c8e283 [Jonathan Alter] Allowing println in audit-release examples
5b50da1 [Jonathan Alter] Allowing printlns in example files
ca4b477 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
83ab635 [Jonathan Alter] Fixing new printlns
54b131f [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
1cd8a81 [Jonathan Alter] Removing some unnecessary comments and printlns
b837c3a [Jonathan Alter] Disallowing println
2015-07-10 11:34:01 +01:00
Cheng Lian 4ffc27caaf [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] Refactors Parquet read path for interoperability and backwards-compatibility
This PR is a follow-up of #6617 and is part of [SPARK-6774] [2], which aims to ensure interoperability and backwards-compatibility for Spark SQL Parquet support.  And this one fixes the read path.  Now Spark SQL is expected to be able to read legacy Parquet data files generated by most (if not all) common libraries/tools like parquet-thrift, parquet-avro, and parquet-hive. However, we still need to refactor the write path to write standard Parquet LISTs and MAPs ([SPARK-8848] [4]).

### Major changes

1. `CatalystConverter` class hierarchy refactoring

   - Replaces `CatalystConverter` trait with a much simpler `ParentContainerUpdater`.

     Now instead of extending the original `CatalystConverter` trait, every converter class accepts an updater which is responsible for propagating the converted value to some parent container. For example, appending array elements to a parent array buffer, appending a key-value pairs to a parent mutable map, or setting a converted value to some specific field of a parent row. Root converter doesn't have a parent and thus uses a `NoopUpdater`.

     This simplifies the design since converters don't need to care about details of their parent converters anymore.

   - Unifies `CatalystRootConverter`, `CatalystGroupConverter` and `CatalystPrimitiveRowConverter` into `CatalystRowConverter`

     Specifically, now all row objects are represented by `SpecificMutableRow` during conversion.

   - Refactors `CatalystArrayConverter`, and removes `CatalystArrayContainsNullConverter` and `CatalystNativeArrayConverter`

     `CatalystNativeArrayConverter` was probably designed with the intention of avoiding boxing costs. However, the way it uses Scala generics actually doesn't achieve this goal.

     The new `CatalystArrayConverter` handles both nullable and non-nullable array elements in a consistent way.

   - Implements backwards-compatibility rules in `CatalystArrayConverter`

     When Parquet records are being converted, schema of Parquet files should have already been verified. So we only need to care about the structure rather than field names in the Parquet schema. Since all map objects represented in legacy systems have the same structure as the standard one (see [backwards-compatibility rules for MAP] [1]), we only need to deal with LIST (namely array) in `CatalystArrayConverter`.

2. Requested columns handling

   When specifying requested columns in `RowReadSupport`, we used to use a Parquet `MessageType` converted from a Catalyst `StructType` which contains all requested columns.  This is not preferable when taking compatibility and interoperability into consideration.  Because the actual Parquet file may have different physical structure from the converted schema.

   In this PR, the schema for requested columns is constructed using the following method:

   - For a column that exists in the target Parquet file, we extract the column type by name from the full file schema, and construct a single-field `MessageType` for that column.
   - For a column that doesn't exist in the target Parquet file, we create a single-field `StructType` and convert it to a `MessageType` using `CatalystSchemaConverter`.
   - Unions all single-field `MessageType`s into a full schema containing all requested fields

   With this change, we also fix [SPARK-6123] [3] by validating the global schema against each individual Parquet part-files.

### Testing

This PR also adds compatibility tests for parquet-avro, parquet-thrift, and parquet-hive. Please refer to `README.md` under `sql/core/src/test` for more information about these tests. To avoid build time code generation and adding extra complexity to the build system, Java code generated from testing Thrift schema and Avro IDL is also checked in.

[1]: https://github.com/apache/incubator-parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1
[2]: https://issues.apache.org/jira/browse/SPARK-6774
[3]: https://issues.apache.org/jira/browse/SPARK-6123
[4]: https://issues.apache.org/jira/browse/SPARK-8848

Author: Cheng Lian <lian@databricks.com>

Closes #7231 from liancheng/spark-6776 and squashes the following commits:

360fe18 [Cheng Lian] Adds ParquetHiveCompatibilitySuite
c6fbc06 [Cheng Lian] Removes WIP file committed by mistake
b8c1295 [Cheng Lian] Excludes the whole parquet package from MiMa
598c3e8 [Cheng Lian] Adds extra Maven repo for hadoop-lzo, which is a transitive dependency of parquet-thrift
926af87 [Cheng Lian] Simplifies Parquet compatibility test suites
7946ee1 [Cheng Lian] Fixes Scala styling issues
3d7ab36 [Cheng Lian] Fixes .rat-excludes
a8f13bb [Cheng Lian] Using Parquet writer API to do compatibility tests
f2208cd [Cheng Lian] Adds README.md for Thrift/Avro code generation
1d390aa [Cheng Lian] Adds parquet-thrift compatibility test
440f7b3 [Cheng Lian] Adds generated files to .rat-excludes
13b9121 [Cheng Lian] Adds ParquetAvroCompatibilitySuite
06cfe9d [Cheng Lian] Adds comments about TimestampType handling
a099d3e [Cheng Lian] More comments
0cc1b37 [Cheng Lian] Fixes MiMa checks
884d3e6 [Cheng Lian] Fixes styling issue and reverts unnecessary changes
802cbd7 [Cheng Lian] Fixes bugs related to schema merging and empty requested columns
38fe1e7 [Cheng Lian] Adds explicit return type
7fb21f1 [Cheng Lian] Reverts an unnecessary debugging change
1781dff [Cheng Lian] Adds test case for SPARK-8811
6437d4b [Cheng Lian] Assembles requested schema from Parquet file schema
bcac49f [Cheng Lian] Removes the 16-byte restriction of decimals
a74fb2c [Cheng Lian] More comments
0525346 [Cheng Lian] Removes old Parquet record converters
03c3bd9 [Cheng Lian] Refactors Parquet read path to implement backwards-compatibility rules
2015-07-08 15:51:01 -07:00
Keuntae Park f031543782 [SPARK-8783] [SQL] CTAS with WITH clause does not work
Currently, CTESubstitution only handles the case that WITH is on the top of the plan.
I think it SHOULD handle the case that WITH is child of CTAS.
This patch simply changes 'match' to 'transform' for recursive search of WITH in the plan.

Author: Keuntae Park <sirpkt@apache.org>

Closes #7180 from sirpkt/SPARK-8783 and squashes the following commits:

e4428f0 [Keuntae Park] Merge remote-tracking branch 'upstream/master' into CTASwithWITH
1671c77 [Keuntae Park] WITH clause can be inside CTAS
2015-07-08 14:29:52 -07:00
Takeshi YAMAMURO 3e831a2696 [SPARK-6912] [SQL] Throw an AnalysisException when unsupported Java Map<K,V> types used in Hive UDF
To make UDF developers understood, throw an exception when unsupported Map<K,V> types used in Hive UDF. This fix is the same with #7248.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #7257 from maropu/ThrowExceptionWhenMapUsed and squashes the following commits:

916099a [Takeshi YAMAMURO] Fix style errors
7886dcc [Takeshi YAMAMURO] Throw an exception when Map<> used in Hive UDF
2015-07-08 10:33:27 -07:00
Reynold Xin 770ff1025e [SPARK-8876][SQL] Remove InternalRow type alias in expressions package.
The type alias was there because initially when I moved Row around, I didn't want to do massive changes to the expression code. But now it should be pretty easy to just remove it. One less concept to worry about.

Author: Reynold Xin <rxin@databricks.com>

Closes #7270 from rxin/internalrow and squashes the following commits:

72fc842 [Reynold Xin] [SPARK-8876][SQL] Remove InternalRow type alias in expressions package.
2015-07-07 17:40:14 -07:00
Takeshi YAMAMURO 1821fc1658 [SPARK-6747] [SQL] Throw an AnalysisException when unsupported Java list types used in Hive UDF
The current implementation can't handle List<> as a return type in Hive UDF and
throws meaningless Match Error.
We assume an UDF below;
public class UDFToListString extends UDF {
public List<String> evaluate(Object o)
{ return Arrays.asList("xxx", "yyy", "zzz"); }
}
An exception of scala.MatchError is thrown as follows when the UDF used;
scala.MatchError: interface java.util.List (of class java.lang.Class)
at org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174)
at org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76)
at org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106)
at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106)
at org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131)
at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95)
at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278)
...
To make udf developers more understood, we need to throw a more suitable exception.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #7248 from maropu/FixBugInHiveInspectors and squashes the following commits:

1c3df2a [Takeshi YAMAMURO] Fix comments
56305de [Takeshi YAMAMURO] Fix conflicts
92ed7a6 [Takeshi YAMAMURO] Throw an exception when java list type used
2844a8e [Takeshi YAMAMURO] Apply comments
7114a47 [Takeshi YAMAMURO] Add TODO comments in UDFToListString of HiveUdfSuite
fdb2ae4 [Takeshi YAMAMURO] Add StringToUtf8 to comvert String into UTF8String
af61f2e [Takeshi YAMAMURO] Remove a new type
7f812fd [Takeshi YAMAMURO] Fix code-style errors
6984bf4 [Takeshi YAMAMURO] Apply review comments
93e3d4e [Takeshi YAMAMURO] Add a blank line at the end of UDFToListString
ee232db [Takeshi YAMAMURO] Support List as a return type in Hive UDF
1e82316 [Takeshi YAMAMURO] Apply comments
21e8763 [Takeshi YAMAMURO] Add TODO comments in UDFToListString of HiveUdfSuite
a488712 [Takeshi YAMAMURO] Add StringToUtf8 to comvert String into UTF8String
1c7b9d1 [Takeshi YAMAMURO] Remove a new type
f965c34 [Takeshi YAMAMURO] Fix code-style errors
9406416 [Takeshi YAMAMURO] Apply review comments
e21ce7e [Takeshi YAMAMURO] Add a blank line at the end of UDFToListString
e553f10 [Takeshi YAMAMURO] Support List as a return type in Hive UDF
2015-07-06 19:44:31 -07:00
Yin Huai 7b467cc934 [SPARK-8588] [SQL] Regression test
This PR adds regression test for https://issues.apache.org/jira/browse/SPARK-8588 (fixed by 457d07eaa0).

Author: Yin Huai <yhuai@databricks.com>

This patch had conflicts when merged, resolved by
Committer: Michael Armbrust <michael@databricks.com>

Closes #7103 from yhuai/SPARK-8588-test and squashes the following commits:

eb5f418 [Yin Huai] Add a query test.
c61a173 [Yin Huai] Regression test for SPARK-8588.
2015-07-06 16:28:47 -07:00
Cheng Lian 20a4d7dbd1 [SPARK-8501] [SQL] Avoids reading schema from empty ORC files
ORC writes empty schema (`struct<>`) to ORC files containing zero rows.  This is OK for Hive since the table schema is managed by the metastore. But it causes trouble when reading raw ORC files via Spark SQL since we have to discover the schema from the files.

Notice that the ORC data source always avoids writing empty ORC files, but it's still problematic when reading Hive tables which contain empty part-files.

Author: Cheng Lian <lian@databricks.com>

Closes #7199 from liancheng/spark-8501 and squashes the following commits:

bb8cd95 [Cheng Lian] Addresses comments
a290221 [Cheng Lian] Avoids reading schema from empty ORC files
2015-07-02 21:30:57 -07:00
Yijie Shen 52302a8039 [SPARK-8407] [SQL] complex type constructors: struct and named_struct
This is a follow up of [SPARK-8283](https://issues.apache.org/jira/browse/SPARK-8283) ([PR-6828](https://github.com/apache/spark/pull/6828)), to support both `struct` and `named_struct` in Spark SQL.

After [#6725](https://github.com/apache/spark/pull/6828), the semantic of [`CreateStruct`](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypes.scala#L56) methods have changed a little and do not limited to cols of `NamedExpressions`, it will name non-NamedExpression fields following the hive convention, col1, col2 ...

This PR would both loosen [`struct`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L723) to take children of `Expression` type and add `named_struct` support.

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #6874 from yijieshen/SPARK-8283 and squashes the following commits:

4cd3375ac [Yijie Shen] change struct documentation
d599d0b [Yijie Shen] rebase code
9a7039e [Yijie Shen] fix reviews and regenerate golden answers
b487354 [Yijie Shen] replace assert using checkAnswer
f07e114 [Yijie Shen] tiny fix
9613be9 [Yijie Shen] review fix
7fef712 [Yijie Shen] Fix checkInputTypes' implementation using foldable and nullable
60812a7 [Yijie Shen] Fix type check
828d694 [Yijie Shen] remove unnecessary resolved assertion inside dataType method
fd3cd8e [Yijie Shen] remove type check from eval
7a71255 [Yijie Shen] tiny fix
ccbbd86 [Yijie Shen] Fix reviews
47da332 [Yijie Shen] remove nameStruct API from DataFrame
917e680 [Yijie Shen] Fix reviews
4bd75ad [Yijie Shen] loosen struct method in functions.scala to take Expression children
0acb7be [Yijie Shen] Add CreateNamedStruct in both DataFrame function API and FunctionRegistery
2015-07-02 10:12:25 -07:00
Christian Kadner 1e1f339976 [SPARK-6785] [SQL] fix DateTimeUtils for dates before 1970
Hi Michael,
this Pull-Request is a follow-up to [PR-6242](https://github.com/apache/spark/pull/6242). I removed the two obsolete test cases from the HiveQuerySuite and deleted the corresponding golden answer files.
Thanks for your review!

Author: Christian Kadner <ckadner@us.ibm.com>

Closes #6983 from ckadner/SPARK-6785 and squashes the following commits:

ab1e79b [Christian Kadner] Merge remote-tracking branch 'origin/SPARK-6785' into SPARK-6785
1fed877 [Christian Kadner] [SPARK-6785][SQL] failed Scala style test, remove spaces on empty line DateTimeUtils.scala:61
9d8021d [Christian Kadner] [SPARK-6785][SQL] merge recent changes in DateTimeUtils & MiscFunctionsSuite
b97c3fb [Christian Kadner] [SPARK-6785][SQL] move test case for DateTimeUtils to DateTimeUtilsSuite
a451184 [Christian Kadner] [SPARK-6785][SQL] fix DateTimeUtils.fromJavaDate(java.util.Date) for Dates before 1970
2015-06-30 12:22:34 -07:00
Yin Huai fbf75738fe [SPARK-7287] [SPARK-8567] [TEST] Add sc.stop to applications in SparkSubmitSuite
Hopefully, this suite will not be flaky anymore.

Author: Yin Huai <yhuai@databricks.com>

Closes #7027 from yhuai/SPARK-8567 and squashes the following commits:

c0167e2 [Yin Huai] Add sc.stop().
2015-06-29 17:20:05 -07:00
BenFradet 931da5c8ab [SPARK-8478] [SQL] Harmonize UDF-related code to use uniformly UDF instead of Udf
Follow-up of #6902 for being coherent between ```Udf``` and ```UDF```

Author: BenFradet <benjamin.fradet@gmail.com>

Closes #6920 from BenFradet/SPARK-8478 and squashes the following commits:

c500f29 [BenFradet] renamed a few variables in functions to use UDF
8ab0f2d [BenFradet] renamed idUdf to idUDF in SQLQuerySuite
98696c2 [BenFradet] renamed originalUdfs in TestHive to originalUDFs
7738f74 [BenFradet] modified HiveUDFSuite to use only UDF
c52608d [BenFradet] renamed HiveUdfSuite to HiveUDFSuite
e51b9ac [BenFradet] renamed ExtractPythonUdfs to ExtractPythonUDFs
8c756f1 [BenFradet] renamed Hive UDF related code
2a1ca76 [BenFradet] renamed pythonUdfs to pythonUDFs
261e6fb [BenFradet] renamed ScalaUdf to ScalaUDF
2015-06-29 15:27:13 -07:00
Cheng Hao c6ba2ea341 [SPARK-7862] [SQL] Disable the error message redirect to stderr
This is a follow up of #6404, the ScriptTransformation prints the error msg into stderr directly, probably be a disaster for application log.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #6882 from chenghao-intel/verbose and squashes the following commits:

bfedd77 [Cheng Hao] revert the write
76ff46b [Cheng Hao] update the CircularBuffer
692b19e [Cheng Hao] check the process exitValue for ScriptTransform
47e0970 [Cheng Hao] Use the RedirectThread instead
1de771d [Cheng Hao] naming the threads in ScriptTransformation
8536e81 [Cheng Hao] disable the error message redirection for stderr
2015-06-29 12:46:33 -07:00
Marcelo Vanzin 3664ee25f0 [SPARK-8066, SPARK-8067] [hive] Add support for Hive 1.0, 1.1 and 1.2.
Allow HiveContext to connect to metastores of those versions; some new shims
had to be added to account for changing internal APIs.

A new test was added to exercise the "reset()" path which now also requires
a shim; and the test code was changed to use a directory under the build's
target to store ivy dependencies. Without that, at least I consistently run
into issues with Ivy messing up (or being confused) by my existing caches.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #7026 from vanzin/SPARK-8067 and squashes the following commits:

3e2e67b [Marcelo Vanzin] [SPARK-8066, SPARK-8067] [hive] Add support for Hive 1.0, 1.1 and 1.2.
2015-06-29 11:53:17 -07:00
Davies Liu 77da5be6f1 [SPARK-8610] [SQL] Separate Row and InternalRow (part 2)
Currently, we use GenericRow both for Row and InternalRow, which is confusing because it could contain Scala type also Catalyst types.

This PR changes to use GenericInternalRow for InternalRow (contains catalyst types), GenericRow for Row (contains Scala types).

Also fixes some incorrect use of InternalRow or Row.

Author: Davies Liu <davies@databricks.com>

Closes #7003 from davies/internalrow and squashes the following commits:

d05866c [Davies Liu] fix test: rollback changes for pyspark
72878dd [Davies Liu] Merge branch 'master' of github.com:apache/spark into internalrow
efd0b25 [Davies Liu] fix copy of MutableRow
87b13cf [Davies Liu] fix test
d2ebd72 [Davies Liu] fix style
eb4b473 [Davies Liu] mark expensive API as final
bd4e99c [Davies Liu] Merge branch 'master' of github.com:apache/spark into internalrow
bdfb78f [Davies Liu] remove BaseMutableRow
6f99a97 [Davies Liu] fix catalyst test
defe931 [Davies Liu] remove BaseRow
288b31f [Davies Liu] Merge branch 'master' of github.com:apache/spark into internalrow
9d24350 [Davies Liu] separate Row and InternalRow (part 2)
2015-06-28 08:03:58 -07:00
Yin Huai f9b397f54d [SPARK-8567] [SQL] Add logs to record the progress of HiveSparkSubmitSuite.
Author: Yin Huai <yhuai@databricks.com>

Closes #7009 from yhuai/SPARK-8567 and squashes the following commits:

62fb1f9 [Yin Huai] Add sc.stop().
b22cf7d [Yin Huai] Add logs.
2015-06-25 06:52:03 -07:00
Cheng Lian c337844ed7 [SPARK-8604] [SQL] HadoopFsRelation subclasses should set their output format class
`HadoopFsRelation` subclasses, especially `ParquetRelation2` should set its own output format class, so that the default output committer can be setup correctly when doing appending (where we ignore user defined output committers).

Author: Cheng Lian <lian@databricks.com>

Closes #6998 from liancheng/spark-8604 and squashes the following commits:

9be51d1 [Cheng Lian] Adds more comments
6db1368 [Cheng Lian] HadoopFsRelation subclasses should set their output format class
2015-06-25 00:06:23 -07:00
Wenchen Fan b71d3254e5 [SPARK-8075] [SQL] apply type check interface to more expressions
a follow up of https://github.com/apache/spark/pull/6405.
Note: It's not a big change, a lot of changing is due to I swap some code in `aggregates.scala` to make aggregate functions right below its corresponding aggregate expressions.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #6723 from cloud-fan/type-check and squashes the following commits:

2124301 [Wenchen Fan] fix tests
5a658bb [Wenchen Fan] add tests
287d3bb [Wenchen Fan] apply type check interface to more expressions
2015-06-24 16:26:00 -07:00
Yin Huai 7daa70292e [SPARK-8567] [SQL] Increase the timeout of HiveSparkSubmitSuite
https://issues.apache.org/jira/browse/SPARK-8567

Author: Yin Huai <yhuai@databricks.com>

Closes #6957 from yhuai/SPARK-8567 and squashes the following commits:

62dff5b [Yin Huai] Increase the timeout.
2015-06-24 15:52:58 -07:00
Cheng Lian 8ab50765cd [SPARK-6777] [SQL] Implements backwards compatibility rules in CatalystSchemaConverter
This PR introduces `CatalystSchemaConverter` for converting Parquet schema to Spark SQL schema and vice versa.  Original conversion code in `ParquetTypesConverter` is removed. Benefits of the new version are:

1. When converting Spark SQL schemas, it generates standard Parquet schemas conforming to [the most updated Parquet format spec] [1]. Converting to old style Parquet schemas is also supported via feature flag `spark.sql.parquet.followParquetFormatSpec` (which is set to `false` for now, and should be set to `true` after both read and write paths are fixed).

   Note that although this version of Parquet format spec hasn't been officially release yet, Parquet MR 1.7.0 already sticks to it. So it should be safe to follow.

1. It implements backwards-compatibility rules described in the most updated Parquet format spec. Thus can recognize more schema patterns generated by other/legacy systems/tools.
1. Code organization follows convention used in [parquet-mr] [2], which is easier to follow. (Structure of `CatalystSchemaConverter` is similar to `AvroSchemaConverter`).

To fully implement backwards-compatibility rules in both read and write path, we also need to update `CatalystRowConverter` (which is responsible for converting Parquet records to `Row`s), `RowReadSupport`, and `RowWriteSupport`. These would be done in follow-up PRs.

TODO

- [x] More schema conversion test cases for legacy schema patterns.

[1]: ea09522659/LogicalTypes.md
[2]: https://github.com/apache/parquet-mr/

Author: Cheng Lian <lian@databricks.com>

Closes #6617 from liancheng/spark-6777 and squashes the following commits:

2a2062d [Cheng Lian] Don't convert decimals without precision information
b60979b [Cheng Lian] Adds a constructor which accepts a Configuration, and fixes default value of assumeBinaryIsString
743730f [Cheng Lian] Decimal scale shouldn't be larger than precision
a104a9e [Cheng Lian] Fixes Scala style issue
1f71d8d [Cheng Lian] Adds feature flag to allow falling back to old style Parquet schema conversion
ba84f4b [Cheng Lian] Fixes MapType schema conversion bug
13cb8d5 [Cheng Lian] Fixes MiMa failure
81de5b0 [Cheng Lian] Fixes UDT, workaround read path, and add tests
28ef95b [Cheng Lian] More AnalysisExceptions
b10c322 [Cheng Lian] Replaces require() with analysisRequire() which throws AnalysisException
cceaf3f [Cheng Lian] Implements backwards compatibility rules in CatalystSchemaConverter
2015-06-24 15:03:43 -07:00
Yin Huai bba6699d0e [SPARK-8578] [SQL] Should ignore user defined output committer when appending data
https://issues.apache.org/jira/browse/SPARK-8578

It is not very safe to use a custom output committer when append data to an existing dir. This changes adds the logic to check if we are appending data, and if so, we use the output committer associated with the file output format.

Author: Yin Huai <yhuai@databricks.com>

Closes #6964 from yhuai/SPARK-8578 and squashes the following commits:

43544c4 [Yin Huai] Do not use a custom output commiter when appendiing data.
2015-06-24 09:50:03 -07:00
Cheng Lian 9d36ec2431 [SPARK-8567] [SQL] Debugging flaky HiveSparkSubmitSuite
Using similar approach used in `HiveThriftServer2Suite` to print stdout/stderr of the spawned process instead of logging them to see what happens on Jenkins. (This test suite only fails on Jenkins and doesn't spill out any log...)

cc yhuai

Author: Cheng Lian <lian@databricks.com>

Closes #6978 from liancheng/debug-hive-spark-submit-suite and squashes the following commits:

b031647 [Cheng Lian] Prints process stdout/stderr instead of logging them
2015-06-24 09:49:20 -07:00
Cheng Hao 13321e6555 [SPARK-7859] [SQL] Collect_set() behavior differences which fails the unit test under jdk8
To reproduce that:
```
JAVA_HOME=/home/hcheng/Java/jdk1.8.0_45 | build/sbt -Phadoop-2.3 -Phive  'test-only org.apache.spark.sql.hive.execution.HiveWindowFunctionQueryWithoutCodeGenSuite'
```

A simple workaround to fix that is update the original query, for getting the output size instead of the exact elements of the array (output by collect_set())

Author: Cheng Hao <hao.cheng@intel.com>

Closes #6402 from chenghao-intel/windowing and squashes the following commits:

99312ad [Cheng Hao] add order by for the select clause
edf8ce3 [Cheng Hao] update the code as suggested
7062da7 [Cheng Hao] fix the collect_set() behaviour differences under different versions of JDK
2015-06-22 20:04:49 -07:00
Cheng Lian 0818fdec37 [SPARK-8406] [SQL] Adding UUID to output file name to avoid accidental overwriting
This PR fixes a Parquet output file name collision bug which may cause data loss.  Changes made:

1.  Identify each write job issued by `InsertIntoHadoopFsRelation` with a UUID

    All concrete data sources which extend `HadoopFsRelation` (Parquet and ORC for now) must use this UUID to generate task output file path to avoid name collision.

2.  Make `TestHive` use a local mode `SparkContext` with 32 threads to increase parallelism

    The major reason for this is that, the original parallelism of 2 is too low to reproduce the data loss issue.  Also, higher concurrency may potentially caught more concurrency bugs during testing phase. (It did help us spotted SPARK-8501.)

3. `OrcSourceSuite` was updated to workaround SPARK-8501, which we detected along the way.

NOTE: This PR is made a little bit more complicated than expected because we hit two other bugs on the way and have to work them around. See [SPARK-8501] [1] and [SPARK-8513] [2].

[1]: https://github.com/liancheng/spark/tree/spark-8501
[2]: https://github.com/liancheng/spark/tree/spark-8513

----

Some background and a summary of offline discussion with yhuai about this issue for better understanding:

In 1.4.0, we added `HadoopFsRelation` to abstract partition support of all data sources that are based on Hadoop `FileSystem` interface.  Specifically, this makes partition discovery, partition pruning, and writing dynamic partitions for data sources much easier.

To support appending, the Parquet data source tries to find out the max part number of part-files in the destination directory (i.e., `<id>` in output file name `part-r-<id>.gz.parquet`) at the beginning of the write job.  In 1.3.0, this step happens on driver side before any files are written.  However, in 1.4.0, this is moved to task side.  Unfortunately, for tasks scheduled later, they may see wrong max part number generated of files newly written by other finished tasks within the same job.  This actually causes a race condition.  In most cases, this only causes nonconsecutive part numbers in output file names.  But when the DataFrame contains thousands of RDD partitions, it's likely that two tasks may choose the same part number, then one of them gets overwritten by the other.

Before `HadoopFsRelation`, Spark SQL already supports appending data to Hive tables.  From a user's perspective, these two look similar.  However, they differ a lot internally.  When data are inserted into Hive tables via Spark SQL, `InsertIntoHiveTable` simulates Hive's behaviors:

1.  Write data to a temporary location

2.  Move data in the temporary location to the final destination location using

    -   `Hive.loadTable()` for non-partitioned table
    -   `Hive.loadPartition()` for static partitions
    -   `Hive.loadDynamicPartitions()` for dynamic partitions

The important part is that, `Hive.copyFiles()` is invoked in step 2 to move the data to the destination directory (I found the name is kinda confusing since no "copying" occurs here, we are just moving and renaming stuff).  If a file in the source directory and another file in the destination directory happen to have the same name, say `part-r-00001.parquet`, the former is moved to the destination directory and renamed with a `_copy_N` postfix (`part-r-00001_copy_1.parquet`).  That's how Hive handles appending and avoids name collision between different write jobs.

Some alternatives fixes considered for this issue:

1.  Use a similar approach as Hive

    This approach is not preferred in Spark 1.4.0 mainly because file metadata operations in S3 tend to be slow, especially for tables with lots of file and/or partitions.  That's why `InsertIntoHadoopFsRelation` just inserts to destination directory directly, and is often used together with `DirectParquetOutputCommitter` to reduce latency when working with S3.  This means, we don't have the chance to do renaming, and must avoid name collision from the very beginning.

2.  Same as 1.3, just move max part number detection back to driver side

    This isn't doable because unlike 1.3, 1.4 also takes dynamic partitioning into account.  When inserting into dynamic partitions, we don't know which partition directories will be touched on driver side before issuing the write job.  Checking all partition directories is simply too expensive for tables with thousands of partitions.

3.  Add extra component to output file names to avoid name collision

    This seems to be the only reasonable solution for now.  To be more specific, we need a JOB level unique identifier to identify all write jobs issued by `InsertIntoHadoopFile`.  Notice that TASK level unique identifiers can NOT be used.  Because in this way a speculative task will write to a different output file from the original task.  If both tasks succeed, duplicate output will be left behind.  Currently, the ORC data source adds `System.currentTimeMillis` to the output file name for uniqueness.  This doesn't work because of exactly the same reason.

    That's why this PR adds a job level random UUID in `BaseWriterContainer` (which is used by `InsertIntoHadoopFsRelation` to issue write jobs).  The drawback is that record order is not preserved any more (output files of a later job may be listed before those of a earlier job).  However, we never promise to preserve record order when writing data, and Hive doesn't promise this either because the `_copy_N` trick breaks the order.

Author: Cheng Lian <lian@databricks.com>

Closes #6864 from liancheng/spark-8406 and squashes the following commits:

db7a46a [Cheng Lian] More comments
f5c1133 [Cheng Lian] Addresses comments
85c478e [Cheng Lian] Workarounds SPARK-8513
088c76c [Cheng Lian] Adds comment about SPARK-8501
99a5e7e [Cheng Lian] Uses job level UUID in SimpleTextRelation and avoids double task abortion
4088226 [Cheng Lian] Works around SPARK-8501
1d7d206 [Cheng Lian] Adds more logs
8966bbb [Cheng Lian] Fixes Scala style issue
18b7003 [Cheng Lian] Uses job level UUID to take speculative tasks into account
3806190 [Cheng Lian] Lets TestHive use all cores by default
748dbd7 [Cheng Lian] Adding UUID to output file name to avoid accidental overwriting
2015-06-22 10:03:57 -07:00
Cheng Lian 83cdfd84f8 [SPARK-8508] [SQL] Ignores a test case to cleanup unnecessary testing output until #6882 is merged
Currently [the test case for SPARK-7862] [1] writes 100,000 lines of integer triples to stderr and makes Jenkins build output unnecessarily large and it's hard to debug other build errors. A proper fix is on the way in #6882. This PR ignores this test case temporarily until #6882 is merged.

[1]: https://github.com/apache/spark/pull/6404/files#diff-1ea02a6fab84e938582f7f87cc4d9ea1R641

Author: Cheng Lian <lian@databricks.com>

Closes #6925 from liancheng/spark-8508 and squashes the following commits:

41e5b47 [Cheng Lian] Ignores the test case until #6882 is merged
2015-06-21 13:20:28 -07:00
Andrew Or bec40e52be [HOTFIX] [SPARK-8489] Correct JIRA number in previous commit
It should be SPARK-8489, not SPARK-8498.
2015-06-19 17:39:26 -07:00
Andrew Or 093c34838d [SPARK-8498] [SQL] Add regression test for SPARK-8470
**Summary of the problem in SPARK-8470.** When using `HiveContext` to create a data frame of a user case class, Spark throws `scala.reflect.internal.MissingRequirementError` when it tries to infer the schema using reflection. This is caused by `HiveContext` silently overwriting the context class loader containing the user classes.

**What this issue is about.** This issue adds regression tests for SPARK-8470, which is already fixed in #6891. We closed SPARK-8470 as a duplicate because it is a different manifestation of the same problem in SPARK-8368. Due to the complexity of the reproduction, this requires us to pre-package a special test jar and include it in the Spark project itself.

I tested this with and without the fix in #6891 and verified that it passes only if the fix is present.

Author: Andrew Or <andrew@databricks.com>

Closes #6909 from andrewor14/SPARK-8498 and squashes the following commits:

5e9d688 [Andrew Or] Add regression test for SPARK-8470
2015-06-19 17:34:09 -07:00
Yin Huai c5876e529b [SPARK-8368] [SPARK-8058] [SQL] HiveContext may override the context class loader of the current thread
https://issues.apache.org/jira/browse/SPARK-8368

Also, I add tests according https://issues.apache.org/jira/browse/SPARK-8058.

Author: Yin Huai <yhuai@databricks.com>

Closes #6891 from yhuai/SPARK-8368 and squashes the following commits:

37bb3db [Yin Huai] Update test timeout and comment.
8762eec [Yin Huai] Style.
695cd2d [Yin Huai] Correctly set the class loader in the conf of the state in client wrapper.
b3378fe [Yin Huai] Failed tests.
2015-06-19 11:11:58 -07:00
zsxwing 78a430ea4d [SPARK-7961][SQL]Refactor SQLConf to display better error message
1. Add `SQLConfEntry` to store the information about a configuration. For those configurations that cannot be found in `sql-programming-guide.md`, I left the doc as `<TODO>`.
2. Verify the value when setting a configuration if this is in SQLConf.
3. Use `SET -v` to display all public configurations.

Author: zsxwing <zsxwing@gmail.com>

Closes #6747 from zsxwing/sqlconf and squashes the following commits:

7d09bad [zsxwing] Use SQLConfEntry in HiveContext
49f6213 [zsxwing] Add getConf, setConf to SQLContext and HiveContext
e014f53 [zsxwing] Merge branch 'master' into sqlconf
93dad8e [zsxwing] Fix the unit tests
cf950c1 [zsxwing] Fix the code style and tests
3c5f03e [zsxwing] Add unsetConf(SQLConfEntry) and fix the code style
a2f4add [zsxwing] getConf will return the default value if a config is not set
037b1db [zsxwing] Add schema to SetCommand
0520c3c [zsxwing] Merge branch 'master' into sqlconf
7afb0ec [zsxwing] Fix the configurations about HiveThriftServer
7e728e3 [zsxwing] Add doc for SQLConfEntry and fix 'toString'
5e95b10 [zsxwing] Add enumConf
c6ba76d [zsxwing] setRawString => setConfString, getRawString => getConfString
4abd807 [zsxwing] Fix the test for 'set -v'
6e47e56 [zsxwing] Fix the compilation error
8973ced [zsxwing] Remove floatConf
1fc3a8b [zsxwing] Remove the 'conf' command and use 'set -v' instead
99c9c16 [zsxwing] Fix tests that use SQLConfEntry as a string
88a03cc [zsxwing] Add new lines between confs and return types
ce7c6c8 [zsxwing] Remove seqConf
f3c1b33 [zsxwing] Refactor SQLConf to display better error message
2015-06-17 23:22:54 -07:00
Yin Huai 302556ff99 [SPARK-8306] [SQL] AddJar command needs to set the new class loader to the HiveConf inside executionHive.state.
https://issues.apache.org/jira/browse/SPARK-8306

I will try to add a test later.

marmbrus aarondav

Author: Yin Huai <yhuai@databricks.com>

Closes #6758 from yhuai/SPARK-8306 and squashes the following commits:

1292346 [Yin Huai] [SPARK-8306] AddJar command needs to set the new class loader to the HiveConf inside executionHive.state.
2015-06-17 14:52:43 -07:00
baishuo 0b8c8fdc12 [SPARK-8156] [SQL] create table to specific database by 'use dbname'
when i test the following code:
hiveContext.sql("""use testdb""")
val df = (1 to 3).map(i => (i, s"val_$i", i * 2)).toDF("a", "b", "c")
df.write
.format("parquet")
.mode(SaveMode.Overwrite)
.saveAsTable("ttt3")
hiveContext.sql("show TABLES in default")

found that the table ttt3 will be created under the database "default"

Author: baishuo <vc_java@hotmail.com>

Closes #6695 from baishuo/SPARK-8516-use-database and squashes the following commits:

9e155f9 [baishuo] remove no use comment
cb9f027 [baishuo] modify testcase
00a7a2d [baishuo] modify testcase
4df48c7 [baishuo] modify testcase
b742e69 [baishuo] modify testcase
3d19ad9 [baishuo] create table to specific database
2015-06-16 16:40:02 -07:00
Marcelo Vanzin 4eb48ed1da [SPARK-8065] [SQL] Add support for Hive 0.14 metastores
This change has two parts.

The first one gets rid of "ReflectionMagic". That worked well for the differences between 0.12 and
0.13, but breaks in 0.14, since some of the APIs that need to be used have primitive types. I could
not figure out a way to make that class work with primitive types. So instead I wrote some shims
 (I can already hear the collective sigh) that find the appropriate methods via reflection. This should
be faster since the method instances are cached, and the code is not much uglier than before,
with the advantage that all the ugliness is local to one file (instead of multiple switch statements on
the version being used scattered in ClientWrapper).

The second part is simple: add code to handle Hive 0.14. A few new methods had to be added
to the new shims.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #6627 from vanzin/SPARK-8065 and squashes the following commits:

3fa4270 [Marcelo Vanzin] Indentation style.
4b8a3d4 [Marcelo Vanzin] Fix dep exclusion.
be3d0cc [Marcelo Vanzin] Merge branch 'master' into SPARK-8065
ca3fb1e [Marcelo Vanzin] Merge branch 'master' into SPARK-8065
b43f13e [Marcelo Vanzin] Since exclusions seem to work, clean up some of the code.
73bd161 [Marcelo Vanzin] Botched merge.
d2ddf01 [Marcelo Vanzin] Comment about excluded dep.
0c929d1 [Marcelo Vanzin] Merge branch 'master' into SPARK-8065
2c3c02e [Marcelo Vanzin] Try to fix tests by adding support for exclusions.
0a03470 [Marcelo Vanzin] Try to fix tests by upgrading calcite dependency.
13b2dfa [Marcelo Vanzin] Fix NPE.
6439d88 [Marcelo Vanzin] Minor style thing.
69b017b [Marcelo Vanzin] Style.
a21cad8 [Marcelo Vanzin] Part II: Add shims / version for Hive 0.14.
ae98c87 [Marcelo Vanzin] PART I: Get rid of reflection magic.
2015-06-14 11:49:22 -07:00
Liang-Chi Hsieh ddec45279e [SPARK-8052] [SQL] Use java.math.BigDecimal for casting String to Decimal instead of using toDouble
JIRA: https://issues.apache.org/jira/browse/SPARK-8052

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6645 from viirya/cast_string_integraltype and squashes the following commits:

e19c6a3 [Liang-Chi Hsieh] For comment.
c3e472a [Liang-Chi Hsieh] Add test.
7ced9b0 [Liang-Chi Hsieh] Use java.math.BigDecimal for casting String to Decimal instead of using toDouble.
2015-06-13 16:39:52 -07:00
Davies Liu d46f8e5d4b [SPARK-7186] [SQL] Decouple internal Row from external Row
Currently, we use o.a.s.sql.Row both internally and externally. The external interface is wider than what the internal needs because it is designed to facilitate end-user programming. This design has proven to be very error prone and cumbersome for internal Row implementations.

As a first step, we create an InternalRow interface in the catalyst module, which is identical to the current Row interface. And we switch all internal operators/expressions to use this InternalRow instead. When we need to expose Row, we convert the InternalRow implementation into Row for users.

For all public API, we use Row (for example, data source APIs), which will be converted into/from InternalRow by CatalystTypeConverters.

For all internal data sources (Json, Parquet, JDBC, Hive), we use InternalRow for better performance, casted into Row in buildScan() (without change the public API). When create a PhysicalRDD, we cast them back to InternalRow.

cc rxin marmbrus JoshRosen

Author: Davies Liu <davies@databricks.com>

Closes #6792 from davies/internal_row and squashes the following commits:

f2abd13 [Davies Liu] fix scalastyle
a7e025c [Davies Liu] move InternalRow into catalyst
30db8ba [Davies Liu] Merge branch 'master' of github.com:apache/spark into internal_row
7cbced8 [Davies Liu] separate Row and InternalRow
2015-06-12 23:06:31 -07:00
zhichao.li 2dd7f93080 [SPARK-7862] [SQL] Fix the deadlock in script transformation for stderr
[Related PR SPARK-7044] (https://github.com/apache/spark/pull/5671)

Author: zhichao.li <zhichao.li@intel.com>

Closes #6404 from zhichao-li/transform and squashes the following commits:

8418c97 [zhichao.li] add comments and remove useless failAfter logic
d9677e1 [zhichao.li] redirect the error desitination to be the same as the current process
2015-06-11 22:28:28 -07:00
Cheng Hao 040f223c5b [SPARK-7915] [SQL] Support specifying the column list for target table in CTAS
```
create table t1 (a int, b string) as select key, value from src;

desc t1;
key	int	NULL
value	string	NULL
```

Thus Hive doesn't support specifying the column list for target table in CTAS, however, we should either throwing exception explicity, or supporting the this feature, we just pick up the later one, which seems useful and straightforward.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #6458 from chenghao-intel/ctas_column and squashes the following commits:

d1fa9b6 [Cheng Hao] bug in unittest
4e701aa [Cheng Hao] update as feedback
f305ec1 [Cheng Hao] support specifying the column list for target table in CTAS
2015-06-11 14:03:08 -07:00
Davies Liu 37719e0cd0 [SPARK-8189] [SQL] use Long for TimestampType in SQL
This PR change to use Long as internal type for TimestampType for efficiency, which means it will the precision below 100ns.

Author: Davies Liu <davies@databricks.com>

Closes #6733 from davies/timestamp and squashes the following commits:

d9565fa [Davies Liu] remove print
65cf2f1 [Davies Liu] fix Timestamp in SparkR
86fecfb [Davies Liu] disable two timestamp tests
8f77ee0 [Davies Liu] fix scala style
246ee74 [Davies Liu] address comments
309d2e1 [Davies Liu] use Long for TimestampType in SQL
2015-06-10 16:55:39 -07:00
Cheng Lian 16fc49617e [SPARK-8079] [SQL] Makes InsertIntoHadoopFsRelation job/task abortion more robust
As described in SPARK-8079, when writing a DataFrame to a `HadoopFsRelation`, if `HadoopFsRelation.prepareForWriteJob` throws exception, an unexpected NPE will be thrown during job abortion. (This issue doesn't bring much damage since the job is failing anyway.)

This PR makes the job/task abortion logic in `InsertIntoHadoopFsRelation` more robust to avoid such confusing exceptions.

Author: Cheng Lian <lian@databricks.com>

Closes #6612 from liancheng/spark-8079 and squashes the following commits:

87cd81e [Cheng Lian] Addresses @rxin's comment
1864c75 [Cheng Lian] Addresses review comments
9e6dbb3 [Cheng Lian] Makes InsertIntoHadoopFsRelation job/task abortion more robust
2015-06-06 17:23:12 +08:00
Reynold Xin a71be0a36d [SPARK-8114][SQL] Remove some wildcard import on TestSQLContext._ round 3.
Author: Reynold Xin <rxin@databricks.com>

Closes #6677 from rxin/test-wildcard and squashes the following commits:

8a17b33 [Reynold Xin] Fixed line length.
6663813 [Reynold Xin] [SPARK-8114][SQL] Remove some wildcard import on TestSQLContext._ round 3.
2015-06-05 23:15:10 -07:00
Reynold Xin 6ebe419f33 [SPARK-8114][SQL] Remove some wildcard import on TestSQLContext._ cont'd.
Fixed the following packages:
sql.columnar
sql.jdbc
sql.json
sql.parquet

Author: Reynold Xin <rxin@databricks.com>

Closes #6667 from rxin/testsqlcontext_wildcard and squashes the following commits:

134a776 [Reynold Xin] Fixed compilation break.
6da7b69 [Reynold Xin] [SPARK-8114][SQL] Remove some wildcard import on TestSQLContext._ cont'd.
2015-06-05 13:57:21 -07:00
Cheolsoo Park 0526fea483 [SPARK-6909][SQL] Remove Hive Shim code
This is a follow-up on #6393. I am removing the following files in this PR.
```
./sql/hive/v0.13.1/src/main/scala/org/apache/spark/sql/hive/Shim13.scala
./sql/hive-thriftserver/v0.13.1/src/main/scala/org/apache/spark/sql/hive/thriftserver/Shim13.scala
```
Basically, I re-factored the shim code as follows-
* Rewrote code directly with Hive 0.13 methods, or
* Converted code into private methods, or
* Extracted code into separate classes

But for leftover code that didn't fit in any of these cases, I created a HiveShim object. For eg, helper functions which wrap Hive 0.13 methods to work around Hive bugs are placed here.

Author: Cheolsoo Park <cheolsoop@netflix.com>

Closes #6604 from piaozhexiu/SPARK-6909 and squashes the following commits:

5dccc20 [Cheolsoo Park] Remove hive shim code
2015-06-04 13:27:35 -07:00
Cheng Lian 686a45f0b9 [SPARK-8014] [SQL] Avoid premature metadata discovery when writing a HadoopFsRelation with a save mode other than Append
The current code references the schema of the DataFrame to be written before checking save mode. This triggers expensive metadata discovery prematurely. For save mode other than `Append`, this metadata discovery is useless since we either ignore the result (for `Ignore` and `ErrorIfExists`) or delete existing files (for `Overwrite`) later.

This PR fixes this issue by deferring metadata discovery after save mode checking.

Author: Cheng Lian <lian@databricks.com>

Closes #6583 from liancheng/spark-8014 and squashes the following commits:

1aafabd [Cheng Lian] Updates comments
088abaa [Cheng Lian] Avoids schema merging and partition discovery when data schema and partition schema are defined
8fbd93f [Cheng Lian] Fixes SPARK-8014
2015-06-02 13:32:13 -07:00
Yin Huai e797dba58e [SPARK-7965] [SPARK-7972] [SQL] Handle expressions containing multiple window expressions and make parser match window frames in case insensitive way
JIRAs:
https://issues.apache.org/jira/browse/SPARK-7965
https://issues.apache.org/jira/browse/SPARK-7972

Author: Yin Huai <yhuai@databricks.com>

Closes #6524 from yhuai/7965-7972 and squashes the following commits:

c12c79c [Yin Huai] Add doc for returned value.
de64328 [Yin Huai] Address rxin's comments.
fc9b1ad [Yin Huai] wip
2996da4 [Yin Huai] scala style
20b65b7 [Yin Huai] Handle expressions containing multiple window expressions.
9568b21 [Yin Huai] case insensitive matches
41f633d [Yin Huai] Failed test case.
2015-06-01 21:40:17 -07:00
Reynold Xin 75dda33f3e Revert "[SPARK-8020] Spark SQL in spark-defaults.conf make metadataHive get constructed too early"
This reverts commit 91f6be87bc.
2015-06-01 21:35:55 -07:00
Yin Huai 91f6be87bc [SPARK-8020] Spark SQL in spark-defaults.conf make metadataHive get constructed too early
https://issues.apache.org/jira/browse/SPARK-8020

Author: Yin Huai <yhuai@databricks.com>

Closes #6563 from yhuai/SPARK-8020 and squashes the following commits:

4e5addc [Yin Huai] style
bf766c6 [Yin Huai] Failed test.
0398f5b [Yin Huai] First populate the SQLConf and then construct executionHive and metadataHive.
2015-06-01 21:33:57 -07:00
Reynold Xin 866652c903 [SPARK-3850] Turn style checker on for trailing whitespaces.
Author: Reynold Xin <rxin@databricks.com>

Closes #6541 from rxin/trailing-whitespace-on and squashes the following commits:

f72ebe4 [Reynold Xin] [SPARK-3850] Turn style checker on for trailing whitespaces.
2015-05-31 14:23:42 -07:00
Reynold Xin 63a50be13d [SPARK-3850] Trim trailing spaces for SQL.
Author: Reynold Xin <rxin@databricks.com>

Closes #6535 from rxin/whitespace-sql and squashes the following commits:

de50316 [Reynold Xin] [SPARK-3850] Trim trailing spaces for SQL.
2015-05-31 00:48:49 -07:00
Andrew Or 9eb222c139 [SPARK-7558] Demarcate tests in unit-tests.log
Right now `unit-tests.log` are not of much value because we can't tell where the test boundaries are easily. This patch adds log statements before and after each test to outline the test boundaries, e.g.:

```
===== TEST OUTPUT FOR o.a.s.serializer.KryoSerializerSuite: 'kryo with parallelize for primitive arrays' =====

15/05/27 12:36:39.596 pool-1-thread-1-ScalaTest-running-KryoSerializerSuite INFO SparkContext: Starting job: count at KryoSerializerSuite.scala:230
15/05/27 12:36:39.596 dag-scheduler-event-loop INFO DAGScheduler: Got job 3 (count at KryoSerializerSuite.scala:230) with 4 output partitions (allowLocal=false)
15/05/27 12:36:39.596 dag-scheduler-event-loop INFO DAGScheduler: Final stage: ResultStage 3(count at KryoSerializerSuite.scala:230)
15/05/27 12:36:39.596 dag-scheduler-event-loop INFO DAGScheduler: Parents of final stage: List()
15/05/27 12:36:39.597 dag-scheduler-event-loop INFO DAGScheduler: Missing parents: List()
15/05/27 12:36:39.597 dag-scheduler-event-loop INFO DAGScheduler: Submitting ResultStage 3 (ParallelCollectionRDD[5] at parallelize at KryoSerializerSuite.scala:230), which has no missing parents

...

15/05/27 12:36:39.624 pool-1-thread-1-ScalaTest-running-KryoSerializerSuite INFO DAGScheduler: Job 3 finished: count at KryoSerializerSuite.scala:230, took 0.028563 s
15/05/27 12:36:39.625 pool-1-thread-1-ScalaTest-running-KryoSerializerSuite INFO KryoSerializerSuite:

***** FINISHED o.a.s.serializer.KryoSerializerSuite: 'kryo with parallelize for primitive arrays' *****

...
```

Author: Andrew Or <andrew@databricks.com>

Closes #6441 from andrewor14/demarcate-tests and squashes the following commits:

879b060 [Andrew Or] Fix compile after rebase
d622af7 [Andrew Or] Merge branch 'master' of github.com:apache/spark into demarcate-tests
017c8ba [Andrew Or] Merge branch 'master' of github.com:apache/spark into demarcate-tests
7790b6c [Andrew Or] Fix tests after logical merge conflict
c7460c0 [Andrew Or] Merge branch 'master' of github.com:apache/spark into demarcate-tests
c43ffc4 [Andrew Or] Fix tests?
8882581 [Andrew Or] Fix tests
ee22cda [Andrew Or] Fix log message
fa9450e [Andrew Or] Merge branch 'master' of github.com:apache/spark into demarcate-tests
12d1e1b [Andrew Or] Various whitespace changes (minor)
69cbb24 [Andrew Or] Make all test suites extend SparkFunSuite instead of FunSuite
bbce12e [Andrew Or] Fix manual things that cannot be covered through automation
da0b12f [Andrew Or] Add core tests as dependencies in all modules
f7d29ce [Andrew Or] Introduce base abstract class for all test suites
2015-05-29 14:03:12 -07:00
Reynold Xin 97a60cf75d [SPARK-7929] Turn whitespace checker on for more token types.
This is the last batch of changes to complete SPARK-7929.

Previous related PRs:
https://github.com/apache/spark/pull/6480
https://github.com/apache/spark/pull/6478
https://github.com/apache/spark/pull/6477
https://github.com/apache/spark/pull/6476
https://github.com/apache/spark/pull/6475
https://github.com/apache/spark/pull/6474
https://github.com/apache/spark/pull/6473

Author: Reynold Xin <rxin@databricks.com>

Closes #6487 from rxin/whitespace-lint and squashes the following commits:

b33d43d [Reynold Xin] [SPARK-7929] Turn whitespace checker on for more token types.
2015-05-28 23:00:02 -07:00
Cheng Lian b97ddff000 [SPARK-7684] [SQL] Refactoring MetastoreDataSourcesSuite to workaround SPARK-7684
As stated in SPARK-7684, currently `TestHive.reset` has some execution order specific bug, which makes running specific test suites locally pretty frustrating. This PR refactors `MetastoreDataSourcesSuite` (which relies on `TestHive.reset` heavily) using various `withXxx` utility methods in `SQLTestUtils` to ask each test case to cleanup their own mess so that we can avoid calling `TestHive.reset`.

Author: Cheng Lian <lian@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #6353 from liancheng/workaround-spark-7684 and squashes the following commits:

26939aa [Yin Huai] Move the initialization of jsonFilePath to beforeAll.
a423d48 [Cheng Lian] Fixes Scala style issue
dfe45d0 [Cheng Lian] Refactors MetastoreDataSourcesSuite to workaround SPARK-7684
92a116d [Cheng Lian] Fixes minor styling issues
2015-05-27 13:09:33 -07:00
Daoyuan Wang 8161562eab [SPARK-7790] [SQL] date and decimal conversion for dynamic partition key
Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #6318 from adrian-wang/dynpart and squashes the following commits:

ad73b61 [Daoyuan Wang] not use sqlTestUtils for try catch because dont have sqlcontext here
6c33b51 [Daoyuan Wang] fix according to liancheng
f0f8074 [Daoyuan Wang] some specific types as dynamic partition
2015-05-27 12:42:13 -07:00
Cheng Lian b463e6d618 [SPARK-7868] [SQL] Ignores _temporary directories in HadoopFsRelation
So that potential partial/corrupted data files left by failed tasks/jobs won't affect normal data scan.

Author: Cheng Lian <lian@databricks.com>

Closes #6411 from liancheng/spark-7868 and squashes the following commits:

273ea36 [Cheng Lian] Ignores _temporary directories
2015-05-26 20:48:56 -07:00
Josh Rosen 0c33c7b4a6 [SPARK-7858] [SQL] Use output schema, not relation schema, for data source input conversion
In `DataSourceStrategy.createPhysicalRDD`, we use the relation schema as the target schema for converting incoming rows into Catalyst rows.  However, we should be using the output schema instead, since our scan might return a subset of the relation's columns.

This patch incorporates #6414 by liancheng, which fixes an issue in `SimpleTestRelation` that prevented this bug from being caught by our old tests:

> In `SimpleTextRelation`, we specified `needsConversion` to `true`, indicating that values produced by this testing relation should be of Scala types, and need to be converted to Catalyst types when necessary. However, we also used `Cast` to convert strings to expected data types. And `Cast` always produces values of Catalyst types, thus no conversion is done at all. This PR makes `SimpleTextRelation` produce Scala values so that data conversion code paths can be properly tested.

Closes #5986.

Author: Josh Rosen <joshrosen@databricks.com>
Author: Cheng Lian <lian@databricks.com>
Author: Cheng Lian <liancheng@users.noreply.github.com>

Closes #6400 from JoshRosen/SPARK-7858 and squashes the following commits:

e71c866 [Josh Rosen] Re-fix bug so that the tests pass again
56b13e5 [Josh Rosen] Add regression test to hadoopFsRelationSuites
2169a0f [Josh Rosen] Remove use of SpecificMutableRow and BufferedIterator
6cd7366 [Josh Rosen] Fix SPARK-7858 by using output types for conversion.
5a00e66 [Josh Rosen] Add assertions in order to reproduce SPARK-7858
8ba195c [Cheng Lian] Merge 9968fba9979287aaa1f141ba18bfb9d4c116a3b3 into 61664732b2
9968fba [Cheng Lian] Tests the data type conversion code paths
2015-05-26 20:24:35 -07:00
Cheng Lian 8af1bf10b7 [SPARK-7842] [SQL] Makes task committing/aborting in InsertIntoHadoopFsRelation more robust
When committing/aborting a write task issued in `InsertIntoHadoopFsRelation`, if an exception is thrown from `OutputWriter.close()`, the committing/aborting process will be interrupted, and leaves messy stuff behind (e.g., the `_temporary` directory created by `FileOutputCommitter`).

This PR makes these two process more robust by catching potential exceptions and falling back to normal task committment/abort.

Author: Cheng Lian <lian@databricks.com>

Closes #6378 from liancheng/spark-7838 and squashes the following commits:

f18253a [Cheng Lian] Makes task committing/aborting in InsertIntoHadoopFsRelation more robust
2015-05-26 00:28:47 +08:00
Yin Huai 2b7e63585d [SPARK-7654] [SQL] Move insertInto into reader/writer interface.
This one continues the work of https://github.com/apache/spark/pull/6216.

Author: Yin Huai <yhuai@databricks.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #6366 from yhuai/insert and squashes the following commits:

3d717fb [Yin Huai] Use insertInto to handle the casue when table exists and Append is used for saveAsTable.
56d2540 [Yin Huai] Add PreWriteCheck to HiveContext's analyzer.
c636e35 [Yin Huai] Remove unnecessary empty lines.
cf83837 [Yin Huai] Move insertInto to write. Also, remove the partition columns from InsertIntoHadoopFsRelation.
0841a54 [Reynold Xin] Removed experimental tag for deprecated methods.
33ed8ef [Reynold Xin] [SPARK-7654][SQL] Move insertInto into reader/writer interface.
2015-05-23 09:48:20 -07:00
Davies Liu efe3bfdf49 [SPARK-7322, SPARK-7836, SPARK-7822][SQL] DataFrame window function related updates
1. ntile should take an integer as parameter.
2. Added Python API (based on #6364)
3. Update documentation of various DataFrame Python functions.

Author: Davies Liu <davies@databricks.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #6374 from rxin/window-final and squashes the following commits:

69004c7 [Reynold Xin] Style fix.
288cea9 [Reynold Xin] Update documentaiton.
7cb8985 [Reynold Xin] Merge pull request #6364 from davies/window
66092b4 [Davies Liu] update docs
ed73cb4 [Reynold Xin] [SPARK-7322][SQL] Improve DataFrame window function documentation.
ef55132 [Davies Liu] Merge branch 'master' of github.com:apache/spark into window4
8936ade [Davies Liu] fix maxint in python 3
2649358 [Davies Liu] update docs
778e2c0 [Davies Liu] SPARK-7836 and SPARK-7822: Python API of window functions
2015-05-23 08:30:05 -07:00
Liang-Chi Hsieh 126d7235de [SPARK-7270] [SQL] Consider dynamic partition when inserting into hive table
JIRA: https://issues.apache.org/jira/browse/SPARK-7270

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #5864 from viirya/dyn_partition_insert and squashes the following commits:

b5627df [Liang-Chi Hsieh] For comments.
3b21e4b [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into dyn_partition_insert
8a4352d [Liang-Chi Hsieh] Consider dynamic partition when inserting into hive table.
2015-05-22 15:39:58 -07:00
Cheng Hao f6f2eeb179 [SPARK-7322][SQL] Window functions in DataFrame
This closes #6104.

Author: Cheng Hao <hao.cheng@intel.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #6343 from rxin/window-df and squashes the following commits:

026d587 [Reynold Xin] Address code review feedback.
dc448fe [Reynold Xin] Fixed Hive tests.
9794d9d [Reynold Xin] Moved Java test package.
9331605 [Reynold Xin] Refactored API.
3313e2a [Reynold Xin] Merge pull request #6104 from chenghao-intel/df_window
d625a64 [Cheng Hao] Update the dataframe window API as suggsted
c141fb1 [Cheng Hao] hide all of properties of the WindowFunctionDefinition
3b1865f [Cheng Hao] scaladoc typos
f3fd2d0 [Cheng Hao] polish the unit test
6847825 [Cheng Hao] Add additional analystcs functions
57e3bc0 [Cheng Hao] typos
24a08ec [Cheng Hao] scaladoc
28222ed [Cheng Hao] fix bug of range/row Frame
1d91865 [Cheng Hao] style issue
53f89f2 [Cheng Hao] remove the over from the functions.scala
964c013 [Cheng Hao] add more unit tests and window functions
64e18a7 [Cheng Hao] Add Window Function support for DataFrame
2015-05-22 01:00:16 -07:00
Yin Huai 30f3f556f7 [SPARK-7763] [SPARK-7616] [SQL] Persists partition columns into metastore
Author: Yin Huai <yhuai@databricks.com>
Author: Cheng Lian <lian@databricks.com>

Closes #6285 from liancheng/spark-7763 and squashes the following commits:

bb2829d [Yin Huai] Fix hashCode.
d677f7d [Cheng Lian] Fixes Scala style issue
44b283f [Cheng Lian] Adds test case for SPARK-7616
6733276 [Yin Huai] Fix a bug that potentially causes https://issues.apache.org/jira/browse/SPARK-7616.
6cabf3c [Yin Huai] Update unit test.
7e02910 [Yin Huai] Use metastore partition columns and do not hijack maybePartitionSpec.
e9a03ec [Cheng Lian] Persists partition columns into metastore
2015-05-21 13:51:40 -07:00
scwf f6c486aa4b [SQL] [TEST] udf_java_method failed due to jdk version
java.lang.Math.exp(1.0) has different result between jdk versions. so do not use createQueryTest, write a separate test for it.
```
jdk version   	result
1.7.0_11		2.7182818284590455
1.7.0_05        2.7182818284590455
1.7.0_71		2.718281828459045
```

Author: scwf <wangfei1@huawei.com>

Closes #6274 from scwf/java_method and squashes the following commits:

3dd2516 [scwf] address comments
5fa1459 [scwf] style
df46445 [scwf] fix test error
fcb6d22 [scwf] fix udf_java_method
2015-05-21 12:31:58 -07:00
Cheng Lian 8730fbb47b [SPARK-7749] [SQL] Fixes partition discovery for non-partitioned tables
When no partition columns can be found, we should have an empty `PartitionSpec`, rather than a `PartitionSpec` with empty partition columns.

This PR together with #6285 should fix SPARK-7749.

Author: Cheng Lian <lian@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #6287 from liancheng/spark-7749 and squashes the following commits:

a799ff3 [Cheng Lian] Adds test cases for SPARK-7749
c4949be [Cheng Lian] Minor refactoring, and tolerant _TEMPORARY directory name
5aa87ea [Yin Huai] Make parsePartitions more robust.
fc56656 [Cheng Lian] Returns empty PartitionSpec if no partition columns can be inferred
19ae41e [Cheng Lian] Don't list base directory as leaf directory
2015-05-21 10:56:17 -07:00
Cheng Hao feb3a9d3f8 [SPARK-7320] [SQL] [Minor] Move the testData into beforeAll()
Follow up of #6340, to avoid the test report missing once it fails.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #6312 from chenghao-intel/rollup_minor and squashes the following commits:

b03a25f [Cheng Hao] simplify the testData instantiation
09b7e8b [Cheng Hao] move the testData into beforeAll()
2015-05-21 09:28:00 -07:00
Cheng Hao 42c592adb3 [SPARK-7320] [SQL] Add Cube / Rollup for dataframe
This is a follow up for #6257, which broke the maven test.

Add cube & rollup for DataFrame
For example:
```scala
testData.rollup($"a" + $"b", $"b").agg(sum($"a" - $"b"))
testData.cube($"a" + $"b", $"b").agg(sum($"a" - $"b"))
```

Author: Cheng Hao <hao.cheng@intel.com>

Closes #6304 from chenghao-intel/rollup and squashes the following commits:

04bb1de [Cheng Hao] move the table register/unregister into beforeAll/afterAll
a6069f1 [Cheng Hao] cancel the implicit keyword
ced4b8f [Cheng Hao] remove the unnecessary code changes
9959dfa [Cheng Hao] update the code as comments
e1d88aa [Cheng Hao] update the code as suggested
03bc3d9 [Cheng Hao] Remove the CubedData & RollupedData
5fd62d0 [Cheng Hao] hiden the CubedData & RollupedData
5ffb196 [Cheng Hao] Add Cube / Rollup for dataframe
2015-05-20 19:58:22 -07:00
Patrick Wendell 6338c40da6 Revert "[SPARK-7320] [SQL] Add Cube / Rollup for dataframe"
This reverts commit 10698e1131.
2015-05-20 13:39:04 -07:00
Cheng Hao 09265ad7c8 [SPARK-7320] [SQL] Add Cube / Rollup for dataframe
Add `cube` & `rollup` for DataFrame
For example:
```scala
testData.rollup($"a" + $"b", $"b").agg(sum($"a" - $"b"))
testData.cube($"a" + $"b", $"b").agg(sum($"a" - $"b"))
```

Author: Cheng Hao <hao.cheng@intel.com>

Closes #6257 from chenghao-intel/rollup and squashes the following commits:

7302319 [Cheng Hao] cancel the implicit keyword
a66e38f [Cheng Hao] remove the unnecessary code changes
a2869d4 [Cheng Hao] update the code as comments
c441777 [Cheng Hao] update the code as suggested
84c9564 [Cheng Hao] Remove the CubedData & RollupedData
279584c [Cheng Hao] hiden the CubedData & RollupedData
ef357e1 [Cheng Hao] Add Cube / Rollup for dataframe
2015-05-20 19:09:47 +08:00
Cheng Hao bcb1ff8146 [SPARK-7662] [SQL] Resolve correct names for generator in projection
```
select explode(map(value, key)) from src;
```
Throws exception
```
org.apache.spark.sql.AnalysisException: The number of aliases supplied in the AS clause does not match the number of columns output by the UDTF expected 2 aliases but got _c0 ;
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:43)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGenerate$$makeGeneratorOutput(Analyzer.scala:605)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:562)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:548)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:548)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:538)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
```

Author: Cheng Hao <hao.cheng@intel.com>

Closes #6178 from chenghao-intel/explode and squashes the following commits:

916fbe9 [Cheng Hao] add more strict rules for TGF alias
5c3f2c5 [Cheng Hao] fix bug in unit test
e1d93ab [Cheng Hao] Add more unit test
19db09e [Cheng Hao] resolve names for generator in projection
2015-05-19 15:20:46 -07:00
Cheng Lian 9dadf019b9 [SPARK-7673] [SQL] WIP: HadoopFsRelation and ParquetRelation2 performance optimizations
This PR introduces several performance optimizations to `HadoopFsRelation` and `ParquetRelation2`:

1.  Moving `FileStatus` listing from `DataSourceStrategy` into a cache within `HadoopFsRelation`.

    This new cache generalizes and replaces the one used in `ParquetRelation2`.

    This also introduces an interface change: to reuse cached `FileStatus` objects, `HadoopFsRelation.buildScan` methods now receive `Array[FileStatus]` instead of `Array[String]`.

1.  When Parquet task side metadata reading is enabled, skip reading row group information when reading Parquet footers.

    This is basically what PR #5334 does. Also, now we uses `ParquetFileReader.readAllFootersInParallel` to read footers in parallel.

Another optimization in question is, instead of asking `HadoopFsRelation.buildScan` to return an `RDD[Row]` for a single selected partition and then union them all, we ask it to return an `RDD[Row]` for all selected partitions. This optimization is based on the fact that Hadoop configuration broadcasting used in `NewHadoopRDD` takes 34% time in the following microbenchmark.  However, this complicates data source user code because user code must merge partition values manually.

To check the cost of broadcasting in `NewHadoopRDD`, I also did microbenchmark after removing the `broadcast` call in `NewHadoopRDD`.  All results are shown below.

### Microbenchmark

#### Preparation code

Generating a partitioned table with 50k partitions, 1k rows per partition:

```scala
import sqlContext._
import sqlContext.implicits._

for (n <- 0 until 500) {
  val data = for {
    p <- (n * 10) until ((n + 1) * 10)
    i <- 0 until 1000
  } yield (i, f"val_$i%04d", f"$p%04d")

  data.
    toDF("a", "b", "p").
    write.
    partitionBy("p").
    mode("append").
    parquet(path)
}
```

#### Benchmarking code

```scala
import sqlContext._
import sqlContext.implicits._

import org.apache.spark.sql.types._
import com.google.common.base.Stopwatch

val path = "hdfs://localhost:9000/user/lian/5k"

def benchmark(n: Int)(f: => Unit) {
  val stopwatch = new Stopwatch()

  def run() = {
    stopwatch.reset()
    stopwatch.start()
    f
    stopwatch.stop()
    stopwatch.elapsedMillis()
  }

  val records = (0 until n).map(_ => run())

  (0 until n).foreach(i => println(s"Round $i: ${records(i)} ms"))
  println(s"Average: ${records.sum / n.toDouble} ms")
}

benchmark(3) { read.parquet(path).explain(extended = true) }
```

#### Results

Before:

```
Round 0: 72528 ms
Round 1: 68938 ms
Round 2: 65372 ms
Average: 68946.0 ms
```

After:

```
Round 0: 59499 ms
Round 1: 53645 ms
Round 2: 53844 ms
Round 3: 49093 ms
Round 4: 50555 ms
Average: 53327.2 ms
```

Also removing Hadoop configuration broadcasting:

(Note that I was testing on a local laptop, thus network cost is pretty low.)

```
Round 0: 15806 ms
Round 1: 14394 ms
Round 2: 14699 ms
Round 3: 15334 ms
Round 4: 14123 ms
Average: 14871.2 ms
```

Author: Cheng Lian <lian@databricks.com>

Closes #6225 from liancheng/spark-7673 and squashes the following commits:

2d58a2b [Cheng Lian] Skips reading row group information when using task side metadata reading
7aa3748 [Cheng Lian] Optimizes FileStatusCache by introducing a map from parent directories to child files
ba41250 [Cheng Lian] Reuses HadoopFsRelation FileStatusCache in ParquetRelation2
3d278f7 [Cheng Lian] Fixes a bug when reading a single Parquet data file
b84612a [Cheng Lian] Fixes Scala style issue
6a08b02 [Cheng Lian] WIP: Moves file status cache into HadoopFSRelation
2015-05-18 12:45:37 -07:00
Wenchen Fan 103c863c2e [SPARK-7269] [SQL] Incorrect analysis for aggregation(use semanticEquals)
A modified version of https://github.com/apache/spark/pull/6110, use `semanticEquals` to make it more efficient.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #6173 from cloud-fan/7269 and squashes the following commits:

e4a3cc7 [Wenchen Fan] address comments
cc02045 [Wenchen Fan] consider elements length equal
d7ff8f4 [Wenchen Fan] fix 7269
2015-05-18 12:12:55 -07:00
Zhan Zhang aa31e431fc [SPARK-2883] [SQL] ORC data source for Spark SQL
This PR updates PR #6135 authored by zhzhan from Hortonworks.

----

This PR implements a Spark SQL data source for accessing ORC files.

> **NOTE**
>
> Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive.  That's why the new ORC data source is under `org.apache.spark.sql.hive` package, and must be used with `HiveContext`.  However, it doesn't require existing Hive installation to access ORC files.

1.  Saving/loading ORC files without contacting Hive metastore

1.  Support for complex data types (i.e. array, map, and struct)

1.  Aware of common optimizations provided by Spark SQL:

    - Column pruning
    - Partitioning pruning
    - Filter push-down

1.  Schema evolution support
1.  Hive metastore table conversion

This PR also include initial work done by scwf from Huawei (PR #3753).

Author: Zhan Zhang <zhazhan@gmail.com>
Author: Cheng Lian <lian@databricks.com>

Closes #6194 from liancheng/polishing-orc and squashes the following commits:

55ecd96 [Cheng Lian] Reorganizes ORC test suites
d4afeed [Cheng Lian] Addresses comments
21ada22 [Cheng Lian] Adds @since and @Experimental annotations
128bd3b [Cheng Lian] ORC filter bug fix
d734496 [Cheng Lian] Polishes the ORC data source
2650a42 [Zhan Zhang] resolve review comments
3c9038e [Zhan Zhang] resolve review comments
7b3c7c5 [Zhan Zhang] save mode fix
f95abfd [Zhan Zhang] reuse test suite
7cc2c64 [Zhan Zhang] predicate fix
4e61c16 [Zhan Zhang] minor change
305418c [Zhan Zhang] orc data source support
2015-05-18 12:03:40 -07:00
Reynold Xin 517eb37a85 [SPARK-7654][SQL] Move JDBC into DataFrame's reader/writer interface.
Also moved all the deprecated functions into one place for SQLContext and DataFrame, and updated tests to use the new API.

Author: Reynold Xin <rxin@databricks.com>

Closes #6210 from rxin/df-writer-reader-jdbc and squashes the following commits:

7465c2c [Reynold Xin] Fixed unit test.
118e609 [Reynold Xin] Updated tests.
3441b57 [Reynold Xin] Updated javadoc.
13cdd1c [Reynold Xin] [SPARK-7654][SQL] Move JDBC into DataFrame's reader/writer interface.
2015-05-16 22:01:53 -07:00
Reynold Xin 578bfeeff5 [SPARK-7654][SQL] DataFrameReader and DataFrameWriter for input/output API
This patch introduces DataFrameWriter and DataFrameReader.

DataFrameReader interface, accessible through SQLContext.read, contains methods that create DataFrames. These methods used to reside in SQLContext. Example usage:
```scala
sqlContext.read.json("...")
sqlContext.read.parquet("...")
```

DataFrameWriter interface, accessible through DataFrame.write, implements a builder pattern to avoid the proliferation of options in writing DataFrame out. It currently implements:
- mode
- format (e.g. "parquet", "json")
- options (generic options passed down into data sources)
- partitionBy (partitioning columns)
Example usage:
```scala
df.write.mode("append").format("json").partitionBy("date").saveAsTable("myJsonTable")
```

TODO:

- [ ] Documentation update
- [ ] Move JDBC into reader / writer?
- [ ] Deprecate the old interfaces
- [ ] Move the generic load interface into reader.
- [ ] Update example code and documentation

Author: Reynold Xin <rxin@databricks.com>

Closes #6175 from rxin/reader-writer and squashes the following commits:

b146c95 [Reynold Xin] Deprecation of old APIs.
bd8abdf [Reynold Xin] Fixed merge conflict.
26abea2 [Reynold Xin] Added general load methods.
244fbec [Reynold Xin] Added equivalent to example.
4f15d92 [Reynold Xin] Added documentation for partitionBy.
7e91611 [Reynold Xin] [SPARK-7654][SQL] DataFrameReader and DataFrameWriter for input/output API.
2015-05-15 22:00:31 -07:00
Cheng Lian fdf5bba35d [SPARK-7591] [SQL] Partitioning support API tweaks
Please see [SPARK-7591] [1] for the details.

/cc rxin marmbrus yhuai

[1]: https://issues.apache.org/jira/browse/SPARK-7591

Author: Cheng Lian <lian@databricks.com>

Closes #6150 from liancheng/spark-7591 and squashes the following commits:

af422e7 [Cheng Lian] Addresses @rxin's comments
37d1738 [Cheng Lian] Fixes HadoopFsRelation partition columns initialization
2fc680a [Cheng Lian] Fixes Scala style issue
189ad23 [Cheng Lian] Removes HadoopFsRelation constructor arguments
522c24e [Cheng Lian] Adds OutputWriterFactory
047d40d [Cheng Lian] Renames FSBased* to HadoopFs*, also renamed FSBasedParquetRelation back to ParquetRelation2
2015-05-15 16:20:49 +08:00
linweizhong 13e652b61a [SPARK-7595] [SQL] Window will cause resolve failed with self join
for example:
table: src(key string, value string)
sql: with v1 as(select key, count(value) over (partition by key) cnt_val from src), v2 as(select v1.key, v1_lag.cnt_val from v1, v1 v1_lag where v1.key = v1_lag.key) select * from v2 limit 5;
then will analyze fail when resolving conflicting references in Join:
'Limit 5
 'Project [*]
  'Subquery v2
   'Project ['v1.key,'v1_lag.cnt_val]
    'Filter ('v1.key = 'v1_lag.key)
     'Join Inner, None
      Subquery v1
       Project [key#95,cnt_val#94L]
        Window [key#95,value#96], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCount(value#96) WindowSpecDefinition [key#95], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS cnt_val#94L], WindowSpecDefinition [key#95], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
         Project [key#95,value#96]
          MetastoreRelation default, src, None
      Subquery v1_lag
       Subquery v1
        Project [key#97,cnt_val#94L]
         Window [key#97,value#98], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCount(value#98) WindowSpecDefinition [key#97], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS cnt_val#94L], WindowSpecDefinition [key#97], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
          Project [key#97,value#98]
           MetastoreRelation default, src, None

Conflicting attributes: cnt_val#94L

Author: linweizhong <linweizhong@huawei.com>

Closes #6114 from Sephiroth-Lin/spark-7595 and squashes the following commits:

f8f2637 [linweizhong] Add unit test
dfe9169 [linweizhong] Handle windowExpression with self join
2015-05-14 00:23:27 -07:00
Reynold Xin e683182c3e [SQL] Move some classes into packages that are more appropriate.
JavaTypeInference into catalyst
types.DateUtils into catalyst
CacheManager into execution
DefaultParserDialect into catalyst

Author: Reynold Xin <rxin@databricks.com>

Closes #6108 from rxin/sql-rename and squashes the following commits:

3fc9613 [Reynold Xin] Fixed import ordering.
83d9ff4 [Reynold Xin] Fixed codegen tests.
e271e86 [Reynold Xin] mima
f4e24a6 [Reynold Xin] [SQL] Move some classes into packages that are more appropriate.
2015-05-13 16:15:31 -07:00
Cheng Lian 7ff16e8abe [SPARK-7567] [SQL] Migrating Parquet data source to FSBasedRelation
This PR migrates Parquet data source to the newly introduced `FSBasedRelation`. `FSBasedParquetRelation` is created to replace `ParquetRelation2`. Major differences are:

1. Partition discovery code has been factored out to `FSBasedRelation`
1. `AppendingParquetOutputFormat` is not used now. Instead, an anonymous subclass of `ParquetOutputFormat` is used to handle appending and writing dynamic partitions
1. When scanning partitioned tables, `FSBasedParquetRelation.buildScan` only builds an `RDD[Row]` for a single selected partition
1. `FSBasedParquetRelation` doesn't rely on Catalyst expressions for filter push down, thus it doesn't extend `CatalystScan` anymore

   After migrating `JSONRelation` (which extends `CatalystScan`), we can remove `CatalystScan`.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/6090)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #6090 from liancheng/parquet-migration and squashes the following commits:

6063f87 [Cheng Lian] Casts to OutputCommitter rather than FileOutputCommtter
bfd1cf0 [Cheng Lian] Fixes compilation error introduced while rebasing
f9ea56e [Cheng Lian] Adds ParquetRelation2 related classes to MiMa check whitelist
261d8c1 [Cheng Lian] Minor bug fix and more tests
db65660 [Cheng Lian] Migrates Parquet data source to FSBasedRelation
2015-05-13 11:04:10 -07:00
Cheng Hao 0da254fb29 [SPARK-6734] [SQL] Add UDTF.close support in Generate
Some third-party UDTF extensions generate additional rows in the "GenericUDTF.close()" method, which is supported / documented by Hive.
https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide+UDTF
However, Spark SQL ignores the "GenericUDTF.close()", and it causes bug while porting job from Hive to Spark SQL.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #5383 from chenghao-intel/udtf_close and squashes the following commits:

98b4e4b [Cheng Hao] Support UDTF.close
2015-05-14 00:14:59 +08:00
Cheng Lian aa6ba3f216 [MINOR] [SQL] Removes debugging println
Author: Cheng Lian <lian@databricks.com>

Closes #6123 from liancheng/remove-println and squashes the following commits:

03356b6 [Cheng Lian] Removes debugging println
2015-05-13 23:40:13 +08:00
Cheng Lian 0595b6de8f [SPARK-3928] [SPARK-5182] [SQL] Partitioning support for the data sources API
This PR adds partitioning support for the external data sources API. It aims to simplify development of file system based data sources, and provide first class partitioning support for both read path and write path.  Existing data sources like JSON and Parquet can be simplified with this work.

## New features provided

1. Hive compatible partition discovery

   This actually generalizes the partition discovery strategy used in Parquet data source in Spark 1.3.0.

1. Generalized partition pruning optimization

   Now partition pruning is handled during physical planning phase.  Specific data sources don't need to worry about this harness anymore.

   (This also implies that we can remove `CatalystScan` after migrating the Parquet data source, since now we don't need to pass Catalyst expressions to data source implementations.)

1. Insertion with dynamic partitions

   When inserting data to a `FSBasedRelation`, data can be partitioned dynamically by specified partition columns.

## New structures provided

### Developer API

1. `FSBasedRelation`

   Base abstract class for file system based data sources.

1. `OutputWriter`

   Base abstract class for output row writers, responsible for writing a single row object.

1. `FSBasedRelationProvider`

   A new relation provider for `FSBasedRelation` subclasses. Note that data sources extending `FSBasedRelation` don't need to extend `RelationProvider` and `SchemaRelationProvider`.

### User API

New overloaded versions of

1. `DataFrame.save()`
1. `DataFrame.saveAsTable()`
1. `SQLContext.load()`

are provided to allow users to save/load DataFrames with user defined dynamic partition columns.

### Spark SQL query planning

1. `InsertIntoFSBasedRelation`

   Used to implement write path for `FSBasedRelation`s.

1. New rules for `FSBasedRelation` in `DataSourceStrategy`

   These are added to hook `FSBasedRelation` into physical query plan in read path, and perform partition pruning.

## TODO

- [ ] Use scratch directories when overwriting a table with data selected from itself.

      Currently, this is not supported, because the table been overwritten is always deleted before writing any data to it.

- [ ] When inserting with dynamic partition columns, use external sorter to group the data first.

      This ensures that we only need to open a single `OutputWriter` at a time.  For data sources like Parquet, `OutputWriter`s can be quite memory consuming.  One issue is that, this approach breaks the row distribution in the original DataFrame.  However, we did't promise to preserve data distribution when writing a DataFrame.

- [x] More tests.  Specifically, test cases for

      - [x] Self-join
      - [x] Loading partitioned relations with a subset of partition columns stored in data files.
      - [x] `SQLContext.load()` with user defined dynamic partition columns.

## Parquet data source migration

Parquet data source migration is covered in PR https://github.com/liancheng/spark/pull/6, which is against this PR branch and for preview only. A formal PR need to be made after this one is merged.

Author: Cheng Lian <lian@databricks.com>

Closes #5526 from liancheng/partitioning-support and squashes the following commits:

5351a1b [Cheng Lian] Fixes compilation error introduced while rebasing
1f9b1a5 [Cheng Lian] Tweaks data schema passed to FSBasedRelations
43ba50e [Cheng Lian] Avoids serializing generated projection code
edf49e7 [Cheng Lian] Removed commented stale code block
348a922 [Cheng Lian] Adds projection in FSBasedRelation.buildScan(requiredColumns, inputPaths)
ad4d4de [Cheng Lian] Enables HDFS style globbing
8d12e69 [Cheng Lian] Fixes compilation error
c71ac6c [Cheng Lian] Addresses comments from @marmbrus
7552168 [Cheng Lian] Fixes typo in MimaExclude.scala
0349e09 [Cheng Lian] Fixes compilation error introduced while rebasing
52b0c9b [Cheng Lian] Adjusts project/MimaExclude.scala
c466de6 [Cheng Lian] Addresses comments
bc3f9b4 [Cheng Lian] Uses projection to separate partition columns and data columns while inserting rows
795920a [Cheng Lian] Fixes compilation error after rebasing
0b8cd70 [Cheng Lian] Adds Scala/Catalyst row conversion when writing non-partitioned tables
fa543f3 [Cheng Lian] Addresses comments
5849dd0 [Cheng Lian] Fixes doc typos.  Fixes partition discovery refresh.
51be443 [Cheng Lian] Replaces FSBasedRelation.outputCommitterClass with FSBasedRelation.prepareForWrite
c4ed4fe [Cheng Lian] Bug fixes and a new test suite
a29e663 [Cheng Lian] Bug fix: should only pass actuall data files to FSBaseRelation.buildScan
5f423d3 [Cheng Lian] Bug fixes. Lets data source to customize OutputCommitter rather than OutputFormat
54c3d7b [Cheng Lian] Enforces that FileOutputFormat must be used
be0c268 [Cheng Lian] Uses TaskAttempContext rather than Configuration in OutputWriter.init
0bc6ad1 [Cheng Lian] Resorts to new Hadoop API, and now FSBasedRelation can customize output format class
f320766 [Cheng Lian] Adds prepareForWrite() hook, refactored writer containers
422ff4a [Cheng Lian] Fixes style issue
ce52353 [Cheng Lian] Adds new SQLContext.load() overload with user defined dynamic partition columns
8d2ff71 [Cheng Lian] Merges partition columns when reading partitioned relations
ca1805b [Cheng Lian] Removes duplicated partition discovery code in new Parquet
f18dec2 [Cheng Lian] More strict schema checking
b746ab5 [Cheng Lian] More tests
9b487bf [Cheng Lian] Fixes compilation errors introduced while rebasing
ea6c8dd [Cheng Lian] Removes remote debugging stuff
327bb1d [Cheng Lian] Implements partitioning support for data sources API
3c5073a [Cheng Lian] Fixes SaveModes used in test cases
fb5a607 [Cheng Lian] Fixes compilation error
9d17607 [Cheng Lian] Adds the contract that OutputWriter should have zero-arg constructor
5de194a [Cheng Lian] Forgot Apache licence header
95d0b4d [Cheng Lian] Renames PartitionedSchemaRelationProvider to FSBasedRelationProvider
770b5ba [Cheng Lian] Adds tests for FSBasedRelation
3ba9bbf [Cheng Lian] Adds DataFrame.saveAsTable() overrides which support partitioning
1b8231f [Cheng Lian] Renames FSBasedPrunedFilteredScan to FSBasedRelation
aa8ba9a [Cheng Lian] Javadoc fix
012ed2d [Cheng Lian] Adds PartitioningOptions
7dd8dd5 [Cheng Lian] Adds new interfaces and stub methods for data sources API partitioning support
2015-05-13 01:32:28 +08:00
Reynold Xin 16696759e9 [SQL] Rename Dialect -> ParserDialect.
Author: Reynold Xin <rxin@databricks.com>

Closes #6071 from rxin/parserdialect and squashes the following commits:

ca2eb31 [Reynold Xin] Rename Dialect -> ParserDialect.
2015-05-11 22:06:56 -07:00
Cheng Hao e35d878be3 [SPARK-7411] [SQL] Support SerDe for HiveQl in CTAS
This is a follow up of #5876 and should be merged after #5876.

Let's wait for unit testing result from Jenkins.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #5963 from chenghao-intel/useIsolatedClient and squashes the following commits:

f87ace6 [Cheng Hao] remove the TODO and add `resolved condition` for HiveTable
a8260e8 [Cheng Hao] Update code as feedback
f4e243f [Cheng Hao] remove the serde setting for SequenceFile
d166afa [Cheng Hao] style issue
d25a4aa [Cheng Hao] Add SerDe support for CTAS
2015-05-11 19:21:16 -07:00
Jacky Li 6dad76e5eb [SPARK-4699] [SQL] Make caseSensitive configurable in spark sql analyzer
based on #3558

Author: Jacky Li <jacky.likun@huawei.com>
Author: wangfei <wangfei1@huawei.com>
Author: scwf <wangfei1@huawei.com>

Closes #5806 from scwf/case and squashes the following commits:

cd51712 [wangfei] fix compile
d4b724f [wangfei] address michael's comment
af512c7 [wangfei] fix conflicts
4ef1be7 [wangfei] fix conflicts
269cf21 [scwf] fix conflicts
b73df6c [scwf] style issue
9e11752 [scwf] improve SimpleCatalystConf
b35529e [scwf] minor style
a3f7659 [scwf] remove unsed imports
2a56515 [scwf] fix conflicts
6db4bf5 [scwf] also fix for HiveContext
7fc4a98 [scwf] fix test case
d5a9933 [wangfei] fix style
eee75ba [wangfei] fix EmptyConf
6ef31cf [wangfei] revert pom changes
5d7c456 [wangfei] set CASE_SENSITIVE false in TestHive
966e719 [wangfei] set CASE_SENSITIVE false in hivecontext
fd30e25 [wangfei] added override
69b3b70 [wangfei] fix AnalysisSuite
5472b08 [wangfei] fix compile issue
56034ca [wangfei] fix conflicts and improve for catalystconf
664d1e9 [Jacky Li] Merge branch 'master' of https://github.com/apache/spark into case
12eca9a [Jacky Li] solve conflict with master
39e369c [Jacky Li] fix confilct after DataFrame PR
dee56e9 [Jacky Li] fix test case failure
05b09a3 [Jacky Li] fix conflict base on the latest master branch
73c16b1 [Jacky Li] fix bug in sql/hive
9bf4cc7 [Jacky Li] fix bug in catalyst
005c56d [Jacky Li] make SQLContext caseSensitivity configurable
6332e0f [Jacky Li] fix bug
fcbf0d9 [Jacky Li] fix scalastyle check
e7bca31 [Jacky Li] make caseSensitive configuration in Analyzer and Catalog
91b1b96 [Jacky Li] make caseSensitive configurable in Analyzer
f57f15c [Jacky Li] add testcase
578d167 [Jacky Li] make caseSensitive configurable
2015-05-08 15:25:54 -07:00
Michael Armbrust cd1d4110cf [SPARK-6908] [SQL] Use isolated Hive client
This PR switches Spark SQL's Hive support to use the isolated hive client interface introduced by #5851, instead of directly interacting with the client.  By using this isolated client we can now allow users to dynamically configure the version of Hive that they are connecting to by setting `spark.sql.hive.metastore.version` without the need recompile.  This also greatly reduces the surface area for our interaction with the hive libraries, hopefully making it easier to support other versions in the future.

Jars for the desired hive version can be configured using `spark.sql.hive.metastore.jars`, which accepts the following options:
 - a colon-separated list of jar files or directories for hive and hadoop.
 - `builtin` - attempt to discover the jars that were used to load Spark SQL and use those. This
            option is only valid when using the execution version of Hive.
 - `maven` - download the correct version of hive on demand from maven.

By default, `builtin` is used for Hive 13.

This PR also removes the test step for building against Hive 12, as this will no longer be required to talk to Hive 12 metastores.  However, the full removal of the Shim is deferred until a later PR.

Remaining TODOs:
 - Remove the Hive Shims and inline code for Hive 13.
 - Several HiveCompatibility tests are not yet passing.
  - `nullformatCTAS` - As detailed below, we now are handling CTAS parsing ourselves instead of hacking into the Hive semantic analyzer.  However, we currently only handle the common cases and not things like CTAS where the null format is specified.
  - `combine1` now leaks state about compression somehow, breaking all subsequent tests.  As such we currently add it to the blacklist
  - `part_inherit_tbl_props` and `part_inherit_tbl_props_with_star` do not work anymore.  We are correctly propagating the information
  - "load_dyn_part14.*" - These tests pass when run on their own, but fail when run with all other tests.  It seems our `RESET` mechanism may not be as robust as it used to be?

Other required changes:
 -  `CreateTableAsSelect` no longer carries parts of the HiveQL AST with it through the query execution pipeline.  Instead, we parse CTAS during the HiveQL conversion and construct a `HiveTable`.  The full parsing here is not yet complete as detailed above in the remaining TODOs.  Since the operator is Hive specific, it is moved to the hive package.
 - `Command` is simplified to be a trait that simply acts as a marker for a LogicalPlan that should be eagerly evaluated.

Author: Michael Armbrust <michael@databricks.com>

Closes #5876 from marmbrus/useIsolatedClient and squashes the following commits:

258d000 [Michael Armbrust] really really correct path handling
e56fd4a [Michael Armbrust] getAbsolutePath
5a259f5 [Michael Armbrust] fix typos
81bb366 [Michael Armbrust] comments from vanzin
5f3945e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into useIsolatedClient
4b5cd41 [Michael Armbrust] yin's comments
f5de7de [Michael Armbrust] cleanup
11e9c72 [Michael Armbrust] better coverage in versions suite
7e8f010 [Michael Armbrust] better error messages and jar handling
e7b3941 [Michael Armbrust] more permisive checking for function registration
da91ba7 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into useIsolatedClient
5fe5894 [Michael Armbrust] fix serialization suite
81711c4 [Michael Armbrust] Initial support for running without maven
1d8ae44 [Michael Armbrust] fix final tests?
1c50813 [Michael Armbrust] more comments
a3bee70 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into useIsolatedClient
a6f5df1 [Michael Armbrust] style
ab07f7e [Michael Armbrust] WIP
4d8bf02 [Michael Armbrust] Remove hive 12 compilation
8843a25 [Michael Armbrust] [SPARK-6908] [SQL] Use isolated Hive client
2015-05-07 19:36:24 -07:00
Wenchen Fan 35f0173b8f [SPARK-2155] [SQL] [WHEN D THEN E] [ELSE F] add CaseKeyWhen for "CASE a WHEN b THEN c * END"
Avoid translating to CaseWhen and evaluate the key expression many times.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #5979 from cloud-fan/condition and squashes the following commits:

3ce54e1 [Wenchen Fan] add CaseKeyWhen
2015-05-07 16:26:49 -07:00
Cheng Hao 074d75d4c8 [SPARK-5213] [SQL] Remove the duplicated SparkSQLParser
This is a follow up of #5827 to remove the additional `SparkSQLParser`

Author: Cheng Hao <hao.cheng@intel.com>

Closes #5965 from chenghao-intel/remove_sparksqlparser and squashes the following commits:

509a233 [Cheng Hao] Remove the HiveQlQueryExecution
a5f9e3b [Cheng Hao] Remove the duplicated SparkSQLParser
2015-05-07 12:09:54 -07:00
Yin Huai 7740996700 [HOT-FIX] Move HiveWindowFunctionQuerySuite.scala to hive compatibility dir.
Author: Yin Huai <yhuai@databricks.com>

Closes #5951 from yhuai/fixBuildMaven and squashes the following commits:

fdde183 [Yin Huai] Move HiveWindowFunctionQuerySuite.scala to hive compatibility dir.
2015-05-06 14:48:25 -07:00
Yin Huai f2c47082c3 [SPARK-1442] [SQL] Window Function Support for Spark SQL
Adding more information about the implementation...

This PR is adding the support of window functions to Spark SQL (specifically OVER and WINDOW clause). For every expression having a OVER clause, we use a WindowExpression as the container of a WindowFunction and the corresponding WindowSpecDefinition (the definition of a window frame, i.e. partition specification, order specification, and frame specification appearing in a OVER clause).
# Implementation #
The high level work flow of the implementation is described as follows.

*	Query parsing: In the query parse process, all WindowExpressions are originally placed in the projectList of a Project operator or the aggregateExpressions of an Aggregate operator. It makes our changes to simple and keep all of parsing rules for window functions at a single place (nodesToWindowSpecification). For the WINDOWclause in a query, we use a WithWindowDefinition as the container as the mapping from the name of a window specification to a WindowSpecDefinition. This changes is similar with our common table expression support.

*	Analysis: The query analysis process has three steps for window functions.

 *	Resolve all WindowSpecReferences by replacing them with WindowSpecReferences according to the mapping table stored in the node of WithWindowDefinition.
 *	Resolve WindowFunctions in the projectList of a Project operator or the aggregateExpressions of an Aggregate operator. For this PR, we use Hive's functions for window functions because we will have a major refactoring of our internal UDAFs and it is better to switch our UDAFs after that refactoring work.
 *	Once we have resolved all WindowFunctions, we will use ResolveWindowFunction to extract WindowExpressions from projectList and aggregateExpressions and then create a Window operator for every distinct WindowSpecDefinition. With this choice, at the execution time, we can rely on the Exchange operator to do all of work on reorganizing the table and we do not need to worry about it in the physical Window operator. An example analyzed plan is shown as follows

```
sql("""
SELECT
  year, country, product, sales,
  avg(sales) over(partition by product) avg_product,
  sum(sales) over(partition by country) sum_country
FROM sales
ORDER BY year, country, product
""").explain(true)

== Analyzed Logical Plan ==
Sort [year#34 ASC,country#35 ASC,product#36 ASC], true
 Project [year#34,country#35,product#36,sales#37,avg_product#27,sum_country#28]
  Window [year#34,country#35,product#36,sales#37,avg_product#27], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum(sales#37) WindowSpecDefinition [country#35], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS sum_country#28], WindowSpecDefinition [country#35], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
   Window [year#34,country#35,product#36,sales#37], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFAverage(sales#37) WindowSpecDefinition [product#36], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS avg_product#27], WindowSpecDefinition [product#36], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    Project [year#34,country#35,product#36,sales#37]
     MetastoreRelation default, sales, None
```

*	Query planning: In the process of query planning, we simple generate the physical Window operator based on the logical Window operator. Then, to prepare the executedPlan, the EnsureRequirements rule will add Exchange and Sort operators if necessary. The EnsureRequirements rule will analyze the data properties and try to not add unnecessary shuffle and sort. The physical plan for the above example query is shown below.

```
== Physical Plan ==
Sort [year#34 ASC,country#35 ASC,product#36 ASC], true
 Exchange (RangePartitioning [year#34 ASC,country#35 ASC,product#36 ASC], 200), []
  Window [year#34,country#35,product#36,sales#37,avg_product#27], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum(sales#37) WindowSpecDefinition [country#35], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS sum_country#28], WindowSpecDefinition [country#35], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
   Exchange (HashPartitioning [country#35], 200), [country#35 ASC]
    Window [year#34,country#35,product#36,sales#37], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFAverage(sales#37) WindowSpecDefinition [product#36], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS avg_product#27], WindowSpecDefinition [product#36], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
     Exchange (HashPartitioning [product#36], 200), [product#36 ASC]
      HiveTableScan [year#34,country#35,product#36,sales#37], (MetastoreRelation default, sales, None), None
```

*	Execution time: At execution time, a physical Window operator buffers all rows in a partition specified in the partition spec of a OVER clause. If necessary, it also maintains a sliding window frame. The current implementation tries to buffer the input parameters of a window function according to the window frame to avoid evaluating a row multiple times.

# Future work #

Here are three improvements that are not hard to add:
*	Taking advantage of the window frame specification to reduce the number of rows buffered in the physical Window operator. For some cases, we only need to buffer the rows appearing in the sliding window. But for other cases, we will not be able to reduce the number of rows buffered (e.g. ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING).

*	When aRAGEN frame is used, for <value> PRECEDING and <value> FOLLOWING, it will be great if the <value> part is an expression (we can start with Literal). So, when the data type of ORDER BY expression is a FractionalType, we can support FractionalType as the type <value> (<value> still needs to be evaluated as a positive value).

*	When aRAGEN frame is used, we need to support DateType and TimestampType as the data type of the expression appearing in the order specification. Then, the <value> part of <value> PRECEDING and <value> FOLLOWING can support interval types (once we support them).

This is a joint work with guowei2 and yhuai
Thanks hbutani hvanhovell for his comments
Thanks scwf for his comments and unit tests

Author: Yin Huai <yhuai@databricks.com>

Closes #5604 from guowei2/windowImplement and squashes the following commits:

76fe1c8 [Yin Huai] Implementation.
aa2b0ae [Yin Huai] Tests.
2015-05-06 10:43:00 -07:00
Reynold Xin 1fd31ba089 [SPARK-6231][SQL/DF] Automatically resolve join condition ambiguity for self-joins.
See the comment in join function for more information.

Author: Reynold Xin <rxin@databricks.com>

Closes #5919 from rxin/self-join-resolve and squashes the following commits:

e2fb0da [Reynold Xin] Updated SQLConf comment.
7233a86 [Reynold Xin] Updated comment.
6be2b4d [Reynold Xin] Removed println
9f6b72f [Reynold Xin] [SPARK-6231][SQL/DF] Automatically resolve ambiguity in join condition for self-joins.
2015-05-05 18:59:46 -07:00
Michael Armbrust daa70bf135 [SPARK-6907] [SQL] Isolated client for HiveMetastore
This PR adds initial support for loading multiple versions of Hive in a single JVM and provides a common interface for extracting metadata from the `HiveMetastoreClient` for a given version.  This is accomplished by creating an isolated `ClassLoader` that operates according to the following rules:

 - __Shared Classes__: Java, Scala, logging, and Spark classes are delegated to `baseClassLoader`
  allowing the results of calls to the `ClientInterface` to be visible externally.
 - __Hive Classes__: new instances are loaded from `execJars`.  These classes are not
  accessible externally due to their custom loading.
 - __Barrier Classes__: Classes such as `ClientWrapper` are defined in Spark but must link to a specific version of Hive.  As a result, the bytecode is acquired from the Spark `ClassLoader` but a new copy is created for each instance of `IsolatedClientLoader`.
  This new instance is able to see a specific version of hive without using reflection where ever hive is consistent across versions. Since
  this is a unique instance, it is not visible externally other than as a generic
  `ClientInterface`, unless `isolationOn` is set to `false`.

In addition to the unit tests, I have also tested this locally against mysql instances of the Hive Metastore.  I've also successfully ported Spark SQL to run with this client, but due to the size of the changes, that will come in a follow-up PR.

By default, Hive jars are currently downloaded from Maven automatically for a given version to ease packaging and testing.  However, there is also support for specifying their location manually for deployments without internet.

Author: Michael Armbrust <michael@databricks.com>

Closes #5851 from marmbrus/isolatedClient and squashes the following commits:

c72f6ac [Michael Armbrust] rxins comments
1e271fa [Michael Armbrust] [SPARK-6907][SQL] Isolated client for HiveMetastore
2015-05-03 13:12:50 -07:00
Cheng Hao 5d6b90d939 [SPARK-5213] [SQL] Pluggable SQL Parser Support
based on #4015, we should not delete `sqlParser` from sqlcontext, that leads to mima failed. Users implement dialect to give a fallback for `sqlParser`  and we should construct `sqlParser` in sqlcontext according to the dialect
`protected[sql] val sqlParser = new SparkSQLParser(getSQLDialect().parse(_))`

Author: Cheng Hao <hao.cheng@intel.com>
Author: scwf <wangfei1@huawei.com>

Closes #5827 from scwf/sqlparser1 and squashes the following commits:

81b9737 [scwf] comment fix
0878bd1 [scwf] remove comments
c19780b [scwf] fix mima tests
c2895cf [scwf] Merge branch 'master' of https://github.com/apache/spark into sqlparser1
493775c [Cheng Hao] update the code as feedback
81a731f [Cheng Hao] remove the unecessary comment
aab0b0b [Cheng Hao] polish the code a little bit
49b9d81 [Cheng Hao] shrink the comment for rebasing
2015-05-02 15:20:07 -07:00
Marcelo Vanzin 82c8c37c09 [MINOR] [HIVE] Fix QueryPartitionSuite.
At least in the version of Hive I tested on, the test was deleting
a temp directory generated by Hive instead of one containing partition
data. So fix the filter to only consider partition directories when
deciding what to delete.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #5854 from vanzin/hive-test-fix and squashes the following commits:

7594ae9 [Marcelo Vanzin] Fix typo.
729fa80 [Marcelo Vanzin] [minor] [hive] Fix QueryPartitionSuite.
2015-05-02 23:10:35 +01:00
Patrick Wendell beeafcfd6e Revert "[SPARK-5213] [SQL] Pluggable SQL Parser Support"
This reverts commit 3ba5aaab82.
2015-04-30 20:33:36 -07:00
Cheng Hao 3ba5aaab82 [SPARK-5213] [SQL] Pluggable SQL Parser Support
This PR aims to make the SQL Parser Pluggable, and user can register it's own parser via Spark SQL CLI.

```
# add the jar into the classpath
$hchengmydesktop:spark>bin/spark-sql --jars sql99.jar

-- switch to "hiveql" dialect
   spark-sql>SET spark.sql.dialect=hiveql;
   spark-sql>SELECT * FROM src LIMIT 1;

-- switch to "sql" dialect
   spark-sql>SET spark.sql.dialect=sql;
   spark-sql>SELECT * FROM src LIMIT 1;

-- switch to a custom dialect
   spark-sql>SET spark.sql.dialect=com.xxx.xxx.SQL99Dialect;
   spark-sql>SELECT * FROM src LIMIT 1;

-- register the non-exist SQL dialect
   spark-sql> SET spark.sql.dialect=NotExistedClass;
   spark-sql> SELECT * FROM src LIMIT 1;
-- Exception will be thrown and switch to default sql dialect ("sql" for SQLContext and "hiveql" for HiveContext)
```

Author: Cheng Hao <hao.cheng@intel.com>

Closes #4015 from chenghao-intel/sqlparser and squashes the following commits:

493775c [Cheng Hao] update the code as feedback
81a731f [Cheng Hao] remove the unecessary comment
aab0b0b [Cheng Hao] polish the code a little bit
49b9d81 [Cheng Hao] shrink the comment for rebasing
2015-04-30 18:49:06 -07:00
Sean Owen ab5adb7a97 [SPARK-7145] [CORE] commons-lang (2.x) classes used instead of commons-lang3 (3.x); commons-io used without dependency
Remove use of commons-lang in favor of commons-lang3 classes; remove commons-io use in favor of Guava

Author: Sean Owen <sowen@cloudera.com>

Closes #5703 from srowen/SPARK-7145 and squashes the following commits:

21fbe03 [Sean Owen] Remove use of commons-lang in favor of commons-lang3 classes; remove commons-io use in favor of Guava
2015-04-27 19:50:55 -04:00
Cheng Hao cc48e6387a [SPARK-7044] [SQL] Fix the deadlock in script transformation
Author: Cheng Hao <hao.cheng@intel.com>

Closes #5625 from chenghao-intel/transform and squashes the following commits:

5ec1dd2 [Cheng Hao] fix the deadlock issue in ScriptTransform
2015-04-23 10:35:22 -07:00
Cheng Hao 7662ec23bb [SPARK-5817] [SQL] Fix bug of udtf with column names
It's a bug while do query like:
```sql
select d from (select explode(array(1,1)) d from src limit 1) t
```
And it will throws exception like:
```
org.apache.spark.sql.AnalysisException: cannot resolve 'd' given input columns _c0; line 1 pos 7
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:48)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:45)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:103)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:117)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:116)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
```

To solve the bug, it requires code refactoring for UDTF
The major changes are about:
* Simplifying the UDTF development, UDTF will manage the output attribute names any more, instead, the `logical.Generate` will handle that properly.
* UDTF will be asked for the output schema (data types) during the logical plan analyzing.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #4602 from chenghao-intel/explode_bug and squashes the following commits:

c2a5132 [Cheng Hao] add back resolved for Alias
556e982 [Cheng Hao] revert the unncessary change
002c361 [Cheng Hao] change the rule of resolved for Generate
04ae500 [Cheng Hao] add qualifier only for generator output
5ee5d2c [Cheng Hao] prepend the new qualifier
d2e8b43 [Cheng Hao] Update the code as feedback
ca5e7f4 [Cheng Hao] shrink the commits
2015-04-21 15:11:15 -07:00
Yin Huai 6265cba00f [SPARK-6969][SQL] Refresh the cached table when REFRESH TABLE is used
https://issues.apache.org/jira/browse/SPARK-6969

Author: Yin Huai <yhuai@databricks.com>

Closes #5583 from yhuai/refreshTableRefreshDataCache and squashes the following commits:

1e5142b [Yin Huai] Add todo.
92b2498 [Yin Huai] Minor updates.
367df92 [Yin Huai] Recache data in the command of REFRESH TABLE.
2015-04-21 14:48:42 -07:00
Daoyuan Wang b45059d0d7 [SPARK-5794] [SQL] fix add jar
Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #4586 from adrian-wang/addjar and squashes the following commits:

efdd602 [Daoyuan Wang] move jar to another place
6c707e8 [Daoyuan Wang] restrict hive version for test
32c4fb8 [Daoyuan Wang] fix style and add a test
9957d87 [Daoyuan Wang] use sessionstate classloader in makeRDDforTable
0810e71 [Daoyuan Wang] remove variable substitution
1898309 [Daoyuan Wang] fix classnotfound
95a40da [Daoyuan Wang] support env argus in add jar, and set add jar ret to 0
2015-04-13 18:26:00 -07:00
Cheng Hao c5602bdc31 [SPARK-5941] [SQL] Unit Test loads the table src twice for leftsemijoin.q
In `leftsemijoin.q`, there is a data loading command for table `sales` already, but in `TestHive`, it also created the table `sales`, which causes duplicated records inserted into the `sales`.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #4506 from chenghao-intel/df_table and squashes the following commits:

0be05f7 [Cheng Hao] Remove the table `sales` creating from TestHive
2015-04-13 16:02:18 -07:00
Daoyuan Wang 85ee0cabe8 [SPARK-6130] [SQL] support if not exists for insert overwrite into partition in hiveQl
Standard syntax:
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;
 
Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2]
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;
FROM from_statement
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2]
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] ...;
 
Hive extension (dynamic partition inserts):
INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #4865 from adrian-wang/insertoverwrite and squashes the following commits:

2fce94f [Daoyuan Wang] add assert
10ea6f3 [Daoyuan Wang] add name for boolean parameter
0bbe9b9 [Daoyuan Wang] fix failure
4391154 [Daoyuan Wang] support if not exists for insert overwrite into partition in hiveQl
2015-04-13 14:29:07 -07:00
Reynold Xin c5b0b296b8 [SPARK-6765] Enable scalastyle on test code.
Turn scalastyle on for all test code. Most of the violations have been resolved in my previous pull requests:

Core: https://github.com/apache/spark/pull/5484
SQL: https://github.com/apache/spark/pull/5412
MLlib: https://github.com/apache/spark/pull/5411
GraphX: https://github.com/apache/spark/pull/5410
Streaming: https://github.com/apache/spark/pull/5409

Author: Reynold Xin <rxin@databricks.com>

Closes #5486 from rxin/test-style-enable and squashes the following commits:

01683de [Reynold Xin] Fixed new code.
a4ab46e [Reynold Xin] Fixed tests.
20adbc8 [Reynold Xin] Missed one violation.
5e36521 [Reynold Xin] [SPARK-6765] Enable scalastyle on test code.
2015-04-13 09:29:04 -07:00
lazymam500 1f39a61118 [Spark-5068][SQL]Fix bug query data when path doesn't exist for HiveContext
This PR follow up PR #3907 & #3891 & #4356.
According to  marmbrus  liancheng 's comments, I try to use fs.globStatus to retrieve all FileStatus objects under path(s), and then do the filtering locally.

[1]. get pathPattern by path, and put it into pathPatternSet. (hdfs://cluster/user/demo/2016/08/12 -> hdfs://cluster/user/demo/*/*/*)
[2]. retrieve all FileStatus objects ,and cache them by undating existPathSet.
[3]. do the filtering locally
[4]. if we have new pathPattern,do 1,2 step again. (external table maybe have more than one partition pathPattern)

chenghao-intel jeanlyn

Author: lazymam500 <lazyman500@gmail.com>
Author: lazyman <lazyman500@gmail.com>

Closes #5059 from lazyman500/SPARK-5068 and squashes the following commits:

5bfcbfd [lazyman] move spark.sql.hive.verifyPartitionPath to SQLConf,fix scala style
e1d6386 [lazymam500] fix scala style
f23133f [lazymam500] bug fix
47e0023 [lazymam500] fix scala style,add config flag,break the chaining
04c443c [lazyman] SPARK-5068: fix bug when partition path doesn't exists #2
41f60ce [lazymam500] Merge pull request #1 from apache/master
2015-04-11 18:33:14 -07:00
haiyang 2f53588738 [SPARK-6199] [SQL] Support CTE in HiveContext and SQLContext
Author: haiyang <huhaiyang@huawei.com>

Closes #4929 from haiyangsea/cte and squashes the following commits:

220b67d [haiyang] add golden files for cte test
d3c7681 [haiyang] Merge branch 'master' into cte-repair
0ba2070 [haiyang] modify code style
9ce6b58 [haiyang] fix conflict
ff74741 [haiyang] add comment for With plan
0d56af4 [haiyang] code indention
776a440 [haiyang] add comments for resolve relation strategy
2fccd7e [haiyang] add comments for resolve relation strategy
241bbe2 [haiyang] fix cte problem of view
e9e1237 [haiyang] fix test case problem
614182f [haiyang] add test cases for CTE feature
32e415b [haiyang] add comment
1cc8c15 [haiyang] support with
03f1097 [haiyang] support with
e960099 [haiyang] support with
9aaa874 [haiyang] support with
0566978 [haiyang] support with
a99ecd2 [haiyang] support with
c3fa4c2 [haiyang] support with
3b6077f [haiyang] support with
5f8abe3 [haiyang] support with
4572b05 [haiyang] support with
f801f54 [haiyang] support with
2015-04-11 18:30:17 -07:00
Cheng Hao 3ceb810aa8 [SPARK-6835] [SQL] Fix bug of Hive UDTF in Lateral View (ClassNotFound)
```SQL
select key, v from src lateral view stack(3, 1+1, 2+2, 3) d as v;
```
Will cause exception
```
java.lang.ClassNotFoundException: stack
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at org.apache.spark.sql.hive.HiveFunctionWrapper.createFunction(Shim13.scala:148)
at org.apache.spark.sql.hive.HiveGenericUdtf.function$lzycompute(hiveUdfs.scala:274)
at org.apache.spark.sql.hive.HiveGenericUdtf.function(hiveUdfs.scala:274)
at org.apache.spark.sql.hive.HiveGenericUdtf.outputInspector$lzycompute(hiveUdfs.scala:280)
at org.apache.spark.sql.hive.HiveGenericUdtf.outputInspector(hiveUdfs.scala:280)
at org.apache.spark.sql.hive.HiveGenericUdtf.outputDataTypes$lzycompute(hiveUdfs.scala:285)
at org.apache.spark.sql.hive.HiveGenericUdtf.outputDataTypes(hiveUdfs.scala:285)
at org.apache.spark.sql.hive.HiveGenericUdtf.makeOutput(hiveUdfs.scala:291)
at org.apache.spark.sql.catalyst.expressions.Generator.output(generators.scala:60)
at org.apache.spark.sql.catalyst.plans.logical.Generate$$anonfun$2.apply(basicOperators.scala:60)
at org.apache.spark.sql.catalyst.plans.logical.Generate$$anonfun$2.apply(basicOperators.scala:60)
at scala.Option.map(Option.scala:145)
at org.apache.spark.sql.catalyst.plans.logical.Generate.generatorOutput(basicOperators.scala:60)
at org.apache.spark.sql.catalyst.plans.logical.Generate.output(basicOperators.scala:70)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveChildren$1.apply(LogicalPlan.scala:117)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveChildren$1.apply(LogicalPlan.scala:117)
```

Author: Cheng Hao <hao.cheng@intel.com>

Closes #5444 from chenghao-intel/hive_udtf and squashes the following commits:

065a98c [Cheng Hao] fix bug of Hive UDTF in Lateral View (ClassNotFound)
2015-04-11 22:11:03 +08:00
Michael Armbrust 23d5f8864f [SPARK-6851][SQL] Create new instance for each converted parquet relation
Otherwise we end up rewriting predicates to be trivially equal (i.e. `a#1 = a#2` -> `a#3 = a#3`), at which point the query is no longer valid.

Author: Michael Armbrust <michael@databricks.com>

Closes #5458 from marmbrus/selfJoinParquet and squashes the following commits:

22df77c [Michael Armbrust] [SPARK-6851][SQL] Create new instance for each converted parquet relation
2015-04-10 16:05:14 -07:00
Reynold Xin 1b2aab8d5b [SPARK-6765] Fix test code style for SQL
So we can turn style checker on for test code.

Author: Reynold Xin <rxin@databricks.com>

Closes #5412 from rxin/test-style-sql and squashes the following commits:

9098a31 [Reynold Xin] One more compilation error ...
8c7250a [Reynold Xin] Fix compilation.
82d0944 [Reynold Xin] Indentation.
0b03fbb [Reynold Xin] code review.
f2f4348 [Reynold Xin] oops.
ef4ec48 [Reynold Xin] Hive module.
7e0db5e [Reynold Xin] sql module
04ec7ac [Reynold Xin] catalyst module
2015-04-08 20:35:29 -07:00
Liang-Chi Hsieh 7bca62f790 [SPARK-6607][SQL] Check invalid characters for Parquet schema and show error messages
'(' and ')' are special characters used in Parquet schema for type annotation. When we run an aggregation query, we will obtain attribute name such as "MAX(a)".

If we directly store the generated DataFrame as Parquet file, it causes failure when reading and parsing the stored schema string.

Several methods can be adopted to solve this. This pr uses a simplest one to just replace attribute names before generating Parquet schema based on these attributes.

Another possible method might be modifying all aggregation expression names from "func(column)" to "func[column]".

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #5263 from viirya/parquet_aggregation_name and squashes the following commits:

2d70542 [Liang-Chi Hsieh] Address comment.
463dff4 [Liang-Chi Hsieh] Instead of replacing special chars, showing error message to user to suggest using Alias.
1de001d [Liang-Chi Hsieh] Replace special characters '(' and ')' of Parquet schema.
2015-04-05 00:20:43 +08:00
guowei2 c23ba81b8c [SPARK-5203][SQL] fix union with different decimal type
When union non-decimal types with decimals, we use the following rules:
      - FIRST `intTypeToFixed`, then fixed union decimals with precision/scale p1/s2 and p2/s2  will be promoted to
      DecimalType(max(p1, p2), max(s1, s2))
      - FLOAT and DOUBLE cause fixed-length decimals to turn into DOUBLE (this is the same as Hive,
      but note that unlimited decimals are considered bigger than doubles in WidenTypes)

Author: guowei2 <guowei2@asiainfo.com>

Closes #4004 from guowei2/SPARK-5203 and squashes the following commits:

ff50f5f [guowei2] fix code style
11df1bf [guowei2] fix decimal union with double, double->Decimal(15,15)
0f345f9 [guowei2] fix structType merge with decimal
101ed4d [guowei2] fix build error after rebase
0b196e4 [guowei2] code style
fe2c2ca [guowei2] handle union decimal precision in 'DecimalPrecision'
421d840 [guowei2] fix union types for decimal precision
ef2c661 [guowei2] fix union with different decimal type
2015-04-04 02:02:30 +08:00
Yin Huai c42c3fc7f7 [SPARK-6575][SQL] Converted Parquet Metastore tables no longer cache metadata
https://issues.apache.org/jira/browse/SPARK-6575

Author: Yin Huai <yhuai@databricks.com>

This patch had conflicts when merged, resolved by
Committer: Cheng Lian <lian@databricks.com>

Closes #5339 from yhuai/parquetRelationCache and squashes the following commits:

b0e1a42 [Yin Huai] Address comments.
83d9846 [Yin Huai] Remove unnecessary change.
c0dc7a4 [Yin Huai] Cache converted parquet relations.
2015-04-03 14:40:36 +08:00
Yin Huai 4b82bd730a [SPARK-6575][SQL] Converted Parquet Metastore tables no longer cache metadata
https://issues.apache.org/jira/browse/SPARK-6575

Author: Yin Huai <yhuai@databricks.com>

Closes #5339 from yhuai/parquetRelationCache and squashes the following commits:

83d9846 [Yin Huai] Remove unnecessary change.
c0dc7a4 [Yin Huai] Cache converted parquet relations.
2015-04-02 20:23:08 -07:00
Cheng Hao dfd2982bc7 [SQL][Minor] Use analyzed logical instead of unresolved in HiveComparisonTest
Some internal unit test failed due to the logical plan node in pattern matching in `HiveComparisonTest`,  e.g.
https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveComparisonTest.scala#L137

Which will may call the `output` function on an unresolved logical plan.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #4946 from chenghao-intel/logical and squashes the following commits:

432ecb3 [Cheng Hao] Use analyzed instead of logical in HiveComparisonTest
2015-04-02 17:20:31 -07:00
Yin Huai 5db89127e7 [SPARK-6618][SPARK-6669][SQL] Lock Hive metastore client correctly.
Author: Yin Huai <yhuai@databricks.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #5333 from yhuai/lookupRelationLock and squashes the following commits:

59c884f [Michael Armbrust] [SQL] Lock metastore client in analyzeTable
7667030 [Yin Huai] Merge pull request #2 from marmbrus/pr/5333
e4a9b0b [Michael Armbrust] Correctly lock on MetastoreCatalog
d6fc32f [Yin Huai] Missing `)`.
1e241af [Yin Huai] Protect InsertIntoHive.
fee7e9c [Yin Huai] A test?
5416b0f [Yin Huai] Just protect client.
2015-04-02 16:46:50 -07:00
Yin Huai 251698fb73 [SPARK-6655][SQL] We need to read the schema of a data source table stored in spark.sql.sources.schema property
https://issues.apache.org/jira/browse/SPARK-6655

Author: Yin Huai <yhuai@databricks.com>

Closes #5313 from yhuai/SPARK-6655 and squashes the following commits:

1e00c03 [Yin Huai] Unnecessary change.
f131bd9 [Yin Huai] Fix.
f1218c1 [Yin Huai] Failed test.
2015-04-02 16:02:31 -07:00
Michael Armbrust 4214e50fc3 [SQL] Throw UnsupportedOperationException instead of NotImplementedError
NotImplementedError in scala 2.10 is a fatal exception, which is not very nice to throw when not actually fatal.

Author: Michael Armbrust <michael@databricks.com>

Closes #5315 from marmbrus/throwUnsupported and squashes the following commits:

c29e03b [Michael Armbrust] [SQL] Throw UnsupportedOperationException instead of NotImplementedError
052e05b [Michael Armbrust] [SQL] Throw UnsupportedOperationException instead of NotImplementedError
2015-04-02 16:01:03 -07:00
Davies Liu 40df5d49bb [SPARK-6663] [SQL] use Literal.create instread of constructor
In order to do inbound checking and type conversion, we should use Literal.create() instead of  constructor.

Author: Davies Liu <davies@databricks.com>

Closes #5320 from davies/literal and squashes the following commits:

1667604 [Davies Liu] fix style and add comment
5f8c0fd [Davies Liu] use Literal.create instread of constructor
2015-04-01 23:11:38 -07:00
Cheng Lian 2bc7fe7f7e Revert "[SPARK-6618][SQL] HiveMetastoreCatalog.lookupRelation should use fine-grained lock"
This reverts commit 314afd0e2f.
2015-04-02 12:56:34 +08:00
Steve Loughran ee11be2582 SPARK-6433 hive tests to import spark-sql test JAR for QueryTest access
1. Test JARs are built & published
1. log4j.resources is explicitly excluded. Without this, downstream test run logging depends on the order the JARs are listed/loaded
1. sql/hive pulls in spark-sql &...spark-catalyst for its test runs
1. The copied in test classes were rm'd, and a test edited to remove its now duplicate assert method
1. Spark streaming is now build with the same plugin/phase as the rest, but its shade plugin declaration is kept in (so different from the rest of the test plugins). Due to (#2), this means the test JAR no longer includes its log4j file.

Outstanding issues:
* should the JARs be shaded? `spark-streaming-test.jar` does, but given these are test jars for developers only, especially in the same spark source tree, it's hard to justify.
* `maven-jar-plugin` v 2.6 was explicitly selected; without this the apache-1.4 parent template JAR version (2.4) chosen.
* Are there any other resources to exclude?

Author: Steve Loughran <stevel@hortonworks.com>

Closes #5119 from steveloughran/stevel/patches/SPARK-6433-test-jars and squashes the following commits:

81ceb01 [Steve Loughran] SPARK-6433 add a clearer comment explaining what the plugin is doing & why
a6dca33 [Steve Loughran] SPARK-6433 : pull configuration section form archive plugin
c2b5f89 [Steve Loughran] SPARK-6433 omit "jar" goal from jar plugin
fdac51b [Steve Loughran] SPARK-6433 -002; indentation & delegate plugin version to parent
650f442 [Steve Loughran] SPARK-6433 patch 001: test JARs are built; sql/hive pulls in spark-sql & spark-catalyst for its test runs
2015-04-01 16:26:54 +01:00
Michael Armbrust beebb7ffc2 [SPARK-5371][SQL] Propagate types after function conversion, before futher resolution
Before it was possible for a query to flip back and forth from a resolved state, allowing resolution to propagate up before coercion had stabilized.  The issue was that `ResolvedReferences` would run after `FunctionArgumentConversion`, but before `PropagateTypes` had run.  This PR ensures we correctly `PropagateTypes` after any coercion has applied.

Author: Michael Armbrust <michael@databricks.com>

Closes #5278 from marmbrus/unionNull and squashes the following commits:

dc3581a [Michael Armbrust] [SPARK-5371][SQL] Propogate types after function conversion / before futher resolution
2015-03-31 11:34:52 -07:00
Cheng Lian a7992ffaf1 [SPARK-6555] [SQL] Overrides equals() and hashCode() for MetastoreRelation
Also removes temporary workarounds made in #5183 and #5251.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5289)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #5289 from liancheng/spark-6555 and squashes the following commits:

d0095ac [Cheng Lian] Removes unused imports
cfafeeb [Cheng Lian] Removes outdated comment
75a2746 [Cheng Lian] Overrides equals() and hashCode() for MetastoreRelation
2015-03-31 11:18:25 -07:00
Yin Huai 314afd0e2f [SPARK-6618][SQL] HiveMetastoreCatalog.lookupRelation should use fine-grained lock
JIRA: https://issues.apache.org/jira/browse/SPARK-6618

Author: Yin Huai <yhuai@databricks.com>

Closes #5281 from yhuai/lookupRelationLock and squashes the following commits:

591b4be [Yin Huai] A test?
b3a9625 [Yin Huai] Just protect client.
2015-03-31 16:28:40 +08:00
Michael Armbrust fe81f6c779 [SPARK-6595][SQL] MetastoreRelation should be a MultiInstanceRelation
Now that we have `DataFrame`s it is possible to have multiple copies in a single query plan.  As such, it needs to inherit from `MultiInstanceRelation` or self joins will break.  I also add better debugging errors when our self join handling fails in case there are future bugs.

Author: Michael Armbrust <michael@databricks.com>

Closes #5251 from marmbrus/multiMetaStore and squashes the following commits:

4272f6d [Michael Armbrust] [SPARK-6595][SQL] MetastoreRelation should be MuliInstanceRelation
2015-03-30 22:24:12 +08:00
DoingDone9 855cba8fe5 [SPARK-6546][Build] Using the wrong code that will make spark compile failed!!
wrong code : val tmpDir = Files.createTempDir()
not Files should Utils

Author: DoingDone9 <799203320@qq.com>

Closes #5198 from DoingDone9/FilesBug and squashes the following commits:

6e0140d [DoingDone9] Update InsertIntoHiveTableSuite.scala
e57d23f [DoingDone9] Update InsertIntoHiveTableSuite.scala
802261c [DoingDone9] Merge pull request #7 from apache/master
d00303b [DoingDone9] Merge pull request #6 from apache/master
98b134f [DoingDone9] Merge pull request #5 from apache/master
161cae3 [DoingDone9] Merge pull request #4 from apache/master
c87e8b6 [DoingDone9] Merge pull request #3 from apache/master
cb1852d [DoingDone9] Merge pull request #2 from apache/master
c3f046f [DoingDone9] Merge pull request #1 from apache/master
2015-03-26 17:04:19 +08:00
KaiXinXiaoLei e87bf3713e The UT test of spark is failed. Because there is a test in SQLQuerySuite about creating table “test”
If the tests in "sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala" are  running before CachedTableSuite.scala, the test("Drop cached table") will failed. Because the table test is created in SQLQuerySuite.scala  ,and this table not droped. So when running "drop cached table", table test already exists.

There is error info:
01:18:35.738 ERROR hive.ql.exec.DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: AlreadyExistsException(message:Table test already exists)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:616)
at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4189)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:281)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1503)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1270)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)test”

And the test about "create table test" in "sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala,is:

  test("SPARK-4825 save join to table") {
    val testData = sparkContext.parallelize(1 to 10).map(i => TestData(i, i.toString)).toDF()
    sql("CREATE TABLE test1 (key INT, value STRING)")
    testData.insertInto("test1")
    sql("CREATE TABLE test2 (key INT, value STRING)")
    testData.insertInto("test2")
    testData.insertInto("test2")
    sql("CREATE TABLE test AS SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key =   b.key")
    checkAnswer(
      table("test"),
      sql("SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = b.key").collect().toSeq)
  }

Author: KaiXinXiaoLei <huleilei1@huawei.com>

Closes #5150 from KaiXinXiaoLei/testFailed and squashes the following commits:

7534b02 [KaiXinXiaoLei] The UT test of spark is failed.
2015-03-25 19:15:30 -07:00
jeanlyn e6d1406abd [SPARK-5498][SQL]fix query exception when partition schema does not match table schema
In hive,the schema of partition may be difference from  the table schema.When we use spark-sql to query the data of partition which schema is difference from the table schema,we will get the exceptions as the description of the [jira](https://issues.apache.org/jira/browse/SPARK-5498) .For example:
* We take a look of the schema for the partition and the table

```sql
DESCRIBE partition_test PARTITION (dt='1');
id                  	int              	None
name                	string              	None
dt                  	string              	None

# Partition Information
# col_name            	data_type           	comment

dt                  	string              	None
```
```
DESCRIBE partition_test;
OK
id                  	bigint              	None
name                	string              	None
dt                  	string              	None

# Partition Information
# col_name            	data_type           	comment

dt                  	string              	None
```
*  run the sql
```sql
SELECT * FROM partition_test where dt='1';
```
we will get the cast exception `java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt`

Author: jeanlyn <jeanlyn92@gmail.com>

Closes #4289 from jeanlyn/schema and squashes the following commits:

9c8da74 [jeanlyn] fix style
b41d6b9 [jeanlyn] fix compile errors
07d84b6 [jeanlyn] Merge branch 'master' into schema
535b0b6 [jeanlyn] reduce conflicts
d6c93c5 [jeanlyn] fix bug
1e8b30c [jeanlyn] fix code style
0549759 [jeanlyn] fix code style
c879aa1 [jeanlyn] clean the code
2a91a87 [jeanlyn] add more test case and clean the code
12d800d [jeanlyn] fix code style
63d170a [jeanlyn] fix compile problem
7470901 [jeanlyn] reduce conflicts
afc7da5 [jeanlyn] make getConvertedOI compatible between 0.12.0 and 0.13.1
b1527d5 [jeanlyn] fix type mismatch
10744ca [jeanlyn] Insert a space after the start of the comment
3b27af3 [jeanlyn] SPARK-5498:fix bug when query the data when partition schema does not match table schema
2015-03-25 17:47:45 -07:00
Cheng Lian 8c3b0052f4 [SPARK-6450] [SQL] Fixes metastore Parquet table conversion
The `ParquetConversions` analysis rule generates a hash map, which maps from the original `MetastoreRelation` instances to the newly created `ParquetRelation2` instances. However, `MetastoreRelation.equals` doesn't compare output attributes. Thus, if a single metastore Parquet table appears multiple times in a query, only a single entry ends up in the hash map, and the conversion is not correctly performed.

Proper fix for this issue should be overriding `equals` and `hashCode` for MetastoreRelation. Unfortunately, this breaks more tests than expected. It's possible that these tests are ill-formed from the very beginning. As 1.3.1 release is approaching, we'd like to make the change more surgical to avoid potential regressions. The proposed fix here is to make both the metastore relations and their output attributes as keys in the hash map used in ParquetConversions.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5183)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #5183 from liancheng/spark-6450 and squashes the following commits:

3536780 [Cheng Lian] Fixes metastore Parquet table conversion
2015-03-25 17:40:19 -07:00
DoingDone9 968408b345 [SPARK-6409][SQL] It is not necessary that avoid old inteface of hive, because this will make some UDAF can not work.
spark avoid old inteface of hive, then some udaf can not work like "org.apache.hadoop.hive.ql.udf.generic.GenericUDAFAverage"

Author: DoingDone9 <799203320@qq.com>

Closes #5131 from DoingDone9/udaf and squashes the following commits:

9de08d0 [DoingDone9] Update HiveUdfSuite.scala
49c62dc [DoingDone9] Update hiveUdfs.scala
98b134f [DoingDone9] Merge pull request #5 from apache/master
161cae3 [DoingDone9] Merge pull request #4 from apache/master
c87e8b6 [DoingDone9] Merge pull request #3 from apache/master
cb1852d [DoingDone9] Merge pull request #2 from apache/master
c3f046f [DoingDone9] Merge pull request #1 from apache/master
2015-03-25 11:11:52 -07:00
Michael Armbrust cbeaf9ebab [SPARK-6376][SQL] Avoid eliminating subqueries until optimization
Previously it was okay to throw away subqueries after analysis, as we would never try to use that tree for resolution again.  However, with eager analysis in `DataFrame`s this can cause errors for queries such as:

```scala
val df = Seq(1,2,3).map(i => (i, i.toString)).toDF("int", "str")
df.as('x).join(df.as('y), $"x.str" === $"y.str").groupBy("x.str").count()
```

As a result, in this PR we defer the elimination of subqueries until the optimization phase.

Author: Michael Armbrust <michael@databricks.com>

Closes #5160 from marmbrus/subqueriesInDfs and squashes the following commits:

a9bb262 [Michael Armbrust] Update Optimizer.scala
27d25bf [Michael Armbrust] fix hive tests
9137e03 [Michael Armbrust] add type
81cd597 [Michael Armbrust] Avoid eliminating subqueries until optimization
2015-03-24 14:08:20 -07:00
Michael Armbrust 046c1e2aa4 [SPARK-6375][SQL] Fix formatting of error messages.
Author: Michael Armbrust <michael@databricks.com>

Closes #5155 from marmbrus/errorMessages and squashes the following commits:

b898188 [Michael Armbrust] Fix formatting of error messages.
2015-03-24 13:22:46 -07:00
Venkata Ramana Gollamudi ee569a0c71 [SPARK-5680][SQL] Sum function on all null values, should return zero
SELECT sum('a'), avg('a'), variance('a'), std('a') FROM src;
Should give output as
0.0	NULL	NULL	NULL
This fixes hive udaf_number_format.q

Author: Venkata Ramana G <ramana.gollamudihuawei.com>

Author: Venkata Ramana Gollamudi <ramana.gollamudi@huawei.com>

Closes #4466 from gvramana/sum_fix and squashes the following commits:

42e14d1 [Venkata Ramana Gollamudi] Added comments
39415c0 [Venkata Ramana Gollamudi] Handled the partitioned Sum expression scenario
df66515 [Venkata Ramana Gollamudi] code style fix
4be2606 [Venkata Ramana Gollamudi] Add udaf_number_format to whitelist and golden answer
330fd64 [Venkata Ramana Gollamudi] fix sum function for all null data
2015-03-21 13:24:24 -07:00
Sean Owen 6f80c3e888 SPARK-6338 [CORE] Use standard temp dir mechanisms in tests to avoid orphaned temp files
Use `Utils.createTempDir()` to replace other temp file mechanisms used in some tests, to further ensure they are cleaned up, and simplify

Author: Sean Owen <sowen@cloudera.com>

Closes #5029 from srowen/SPARK-6338 and squashes the following commits:

27b740a [Sean Owen] Fix hive-thriftserver tests that don't expect an existing dir
4a212fa [Sean Owen] Standardize a bit more temp dir management
9004081 [Sean Owen] Revert some added recursive-delete calls
57609e4 [Sean Owen] Use Utils.createTempDir() to replace other temp file mechanisms used in some tests, to further ensure they are cleaned up, and simplify
2015-03-20 14:16:21 +00:00
Michael Armbrust 3579003115 [SPARK-6247][SQL] Fix resolution of ambiguous joins caused by new aliases
We need to handle ambiguous `exprId`s that are produced by new aliases as well as those caused by leaf nodes (`MultiInstanceRelation`).

Attempting to fix this revealed a bug in `equals` for `Alias` as these objects were comparing equal even when the expression ids did not match. Additionally, `LocalRelation` did not correctly provide statistics, and some tests in `catalyst` and `hive` were not using the helper functions for comparing plans.

Based on #4991 by chenghao-intel

Author: Michael Armbrust <michael@databricks.com>

Closes #5062 from marmbrus/selfJoins and squashes the following commits:

8e9b84b [Michael Armbrust] check qualifier too
8038a36 [Michael Armbrust] handle aggs too
0b9c687 [Michael Armbrust] fix more tests
c3c574b [Michael Armbrust] revert change.
725f1ab [Michael Armbrust] add statistics
a925d08 [Michael Armbrust] check for conflicting attributes in join resolution
b022ef7 [Michael Armbrust] Handle project aliases.
d8caa40 [Michael Armbrust] test case: SPARK-6247
f9c67c2 [Michael Armbrust] Check for duplicate attributes in join resolution.
898af73 [Michael Armbrust] Fix Alias equality.
2015-03-17 19:47:51 -07:00
watermen a6ee2f7940 [SPARK-5651][SQL] Add input64 in blacklist and add test suit for create table within backticks
Now spark version is only support
```create table table_in_database_creation.test1 as select * from src limit 1;``` in HiveContext.

This patch is used to support
```create table `table_in_database_creation.test2` as select * from src limit 1;``` in HiveContext.

Author: watermen <qiyadong2010@gmail.com>
Author: q00251598 <qiyadong@huawei.com>

Closes #4427 from watermen/SPARK-5651 and squashes the following commits:

c5c8ed1 [watermen] add the generated golden files
1f0e42e [q00251598] add input64 in blacklist and add test suit
2015-03-17 19:35:18 -07:00
Liang-Chi Hsieh 5c80643d13 [SPARK-5908][SQL] Resolve UdtfsAlias when only single Alias is used
`ResolveUdtfsAlias` in `hiveUdfs` only considers the `HiveGenericUdtf` with multiple alias. When only single alias is used with `HiveGenericUdtf`, the alias is not working.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #4692 from viirya/udft_alias and squashes the following commits:

8a3bae4 [Liang-Chi Hsieh] No need to test selected column from DataFrame since DataFrame API is updated.
160a379 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into udft_alias
e6531cc [Liang-Chi Hsieh] Selected column from DataFrame should not re-analyze logical plan.
a45cc2a [Liang-Chi Hsieh] Resolve UdtfsAlias when only single Alias is used.
2015-03-17 18:58:52 -07:00
Daoyuan Wang 9667b9f9c3 [SPARK-5712] [SQL] fix comment with semicolon at end
---- comment;

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #4500 from adrian-wang/semicolon and squashes the following commits:

70b8abb [Daoyuan Wang] use mkstring instead of reduce
2d49738 [Daoyuan Wang] remove outdated golden file
317346e [Daoyuan Wang] only skip comment with semicolon at end of line, to avoid golden file outdated
d3ae01e [Daoyuan Wang] fix error
a11602d [Daoyuan Wang] fix comment with semicolon at end
2015-03-17 12:29:15 +08:00
Sean Owen 6e94c4eadf SPARK-6225 [CORE] [SQL] [STREAMING] Resolve most build warnings, 1.3.0 edition
Resolve javac, scalac warnings of various types -- deprecations, Scala lang, unchecked cast, etc.

Author: Sean Owen <sowen@cloudera.com>

Closes #4950 from srowen/SPARK-6225 and squashes the following commits:

3080972 [Sean Owen] Ordered imports: Java, Scala, 3rd party, Spark
c67985b [Sean Owen] Resolve javac, scalac warnings of various types -- deprecations, Scala lang, unchecked cast, etc.
2015-03-11 13:15:19 +00:00
Reynold Xin 54d19689ff [SPARK-5310][SQL] Fixes to Docs and Datasources API
- Various Fixes to docs
 - Make data source traits actually interfaces

Based on #4862 but with fixed conflicts.

Author: Reynold Xin <rxin@databricks.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #4868 from marmbrus/pr/4862 and squashes the following commits:

fe091ea [Michael Armbrust] Merge remote-tracking branch 'origin/master' into pr/4862
0208497 [Reynold Xin] Test fixes.
34e0a28 [Reynold Xin] [SPARK-5310][SQL] Various fixes to Spark SQL docs.
2015-03-02 22:14:08 -08:00
Yin Huai 12599942e6 [SPARK-5950][SQL]Insert array into a metastore table saved as parquet should work when using datasource api
This PR contains the following changes:
1. Add a new method, `DataType.equalsIgnoreCompatibleNullability`, which is the middle ground between DataType's equality check and `DataType.equalsIgnoreNullability`. For two data types `from` and `to`, it does `equalsIgnoreNullability` as well as if the nullability of `from` is compatible with that of `to`. For example, the nullability of `ArrayType(IntegerType, containsNull = false)` is compatible with that of `ArrayType(IntegerType, containsNull = true)` (for an array without null values, we can always say it may contain null values). However,  the nullability of `ArrayType(IntegerType, containsNull = true)` is incompatible with that of `ArrayType(IntegerType, containsNull = false)` (for an array that may have null values, we cannot say it does not have null values).
2. For the `resolved` field of `InsertIntoTable`, use `equalsIgnoreCompatibleNullability` to replace the equality check of the data types.
3. For our data source write path, when appending data, we always use the schema of existing table to write the data. This is important for parquet, since nullability direct impacts the way to encode/decode values. If we do not do this, we may see corrupted values when reading values from a set of parquet files generated with different nullability settings.
4. When generating a new parquet table, we always set nullable/containsNull/valueContainsNull to true. So, we will not face situations that we cannot append data because containsNull/valueContainsNull in an Array/Map column of the existing table has already been set to `false`. This change makes the whole data pipeline more robust.
5. Update the equality check of JSON relation. Since JSON does not really cares nullability,  `equalsIgnoreNullability` seems a better choice to compare schemata from to JSON tables.

JIRA: https://issues.apache.org/jira/browse/SPARK-5950

Thanks viirya for the initial work in #4729.

cc marmbrus liancheng

Author: Yin Huai <yhuai@databricks.com>

Closes #4826 from yhuai/insertNullabilityCheck and squashes the following commits:

3b61a04 [Yin Huai] Revert change on equals.
80e487e [Yin Huai] asNullable in UDT.
587d88b [Yin Huai] Make methods private.
0cb7ea2 [Yin Huai] marmbrus's comments.
3cec464 [Yin Huai] Cheng's comments.
486ed08 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertNullabilityCheck
d3747d1 [Yin Huai] Remove unnecessary change.
8360817 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertNullabilityCheck
8a3f237 [Yin Huai] Use equalsIgnoreNullability instead of equality check.
0eb5578 [Yin Huai] Fix tests.
f6ed813 [Yin Huai] Update old parquet path.
e4f397c [Yin Huai] Unit tests.
b2c06f8 [Yin Huai] Ignore nullability in JSON relation's equality check.
8bd008b [Yin Huai] nullable, containsNull, and valueContainsNull will be always true for parquet data.
bf50d73 [Yin Huai] When appending data, we use the schema of the existing table instead of the schema of the new data.
0a703e7 [Yin Huai] Test failed again since we cannot read correct content.
9a26611 [Yin Huai] Make InsertIntoTable happy.
8f19fe5 [Yin Huai] equalsIgnoreCompatibleNullability
4ec17fd [Yin Huai] Failed test.
2015-03-02 19:31:55 -08:00
Michael Armbrust 8223ce6a81 [SPARK-6114][SQL] Avoid metastore conversions before plan is resolved
Author: Michael Armbrust <michael@databricks.com>

Closes #4855 from marmbrus/explodeBug and squashes the following commits:

a712249 [Michael Armbrust] [SPARK-6114][SQL] Avoid metastore conversions before plan is resolved
2015-03-02 16:10:54 -08:00
q00251598 582e5a24c5 [SPARK-6040][SQL] Fix the percent bug in tablesample
HiveQL expression like `select count(1) from src tablesample(1 percent);` means take 1% sample to select. But it means 100% in the current version of the Spark.

Author: q00251598 <qiyadong@huawei.com>

Closes #4789 from watermen/SPARK-6040 and squashes the following commits:

2453ebe [q00251598] check and adjust the fraction.
2015-03-02 13:16:29 -08:00
q00251598 9ce12aaf28 [SPARK-5741][SQL] Support the path contains comma in HiveContext
When run ```select * from nzhang_part where hr = 'file,';```, it throws exception ```java.lang.IllegalArgumentException: Can not create a Path from an empty string```
. Because the path of hdfs contains comma, and FileInputFormat.setInputPaths will split path by comma.

### SQL
```
set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

create table nzhang_part like srcpart;

insert overwrite table nzhang_part partition (ds='2010-08-15', hr) select key, value, hr from srcpart where ds='2008-04-08';

insert overwrite table nzhang_part partition (ds='2010-08-15', hr=11) select key, value from srcpart where ds='2008-04-08';

insert overwrite table nzhang_part partition (ds='2010-08-15', hr)
select * from (
select key, value, hr from srcpart where ds='2008-04-08'
union all
select '1' as key, '1' as value, 'file,' as hr from src limit 1) s;

select * from nzhang_part where hr = 'file,';
```

### Error Log
```
15/02/10 14:33:16 ERROR SparkSQLDriver: Failed in [select * from nzhang_part where hr = 'file,']
java.lang.IllegalArgumentException: Can not create a Path from an empty string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
at org.apache.hadoop.fs.Path.<init>(Path.java:135)
at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:241)
at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:400)
at org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:251)
at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229)
at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229)
at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
at scala.Option.map(Option.scala:145)
at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196)

Author: q00251598 <qiyadong@huawei.com>

Closes #4532 from watermen/SPARK-5741 and squashes the following commits:

9758ab1 [q00251598] fix bug
1db1a1c [q00251598] use setInputPaths(Job job, Path... inputPaths)
b788a72 [q00251598] change FileInputFormat.setInputPaths to jobConf.set and add test suite
2015-03-02 10:13:11 -08:00
Yin Huai 39a54b40af [SPARK-6073][SQL] Need to refresh metastore cache after append data in CreateMetastoreDataSourceAsSelect
JIRA: https://issues.apache.org/jira/browse/SPARK-6073

liancheng

Author: Yin Huai <yhuai@databricks.com>

Closes #4824 from yhuai/refreshCache and squashes the following commits:

b9542ef [Yin Huai] Refresh metadata cache in the Catalog in CreateMetastoreDataSourceAsSelect.
2015-03-02 22:42:18 +08:00
Cheng Lian e6003f0a57 [SPARK-5775] [SQL] BugFix: GenericRow cannot be cast to SpecificMutableRow when nested data and partitioned table
This PR adapts anselmevignon's #4697 to master and branch-1.3. Please refer to PR description of #4697 for details.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4792)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>
Author: Cheng Lian <liancheng@users.noreply.github.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #4792 from liancheng/spark-5775 and squashes the following commits:

538f506 [Cheng Lian] Addresses comments
cee55cf [Cheng Lian] Merge pull request #4 from yhuai/spark-5775-yin
b0b74fb [Yin Huai] Remove runtime pattern matching.
ca6e038 [Cheng Lian] Fixes SPARK-5775
2015-02-28 21:15:43 +08:00
Yin Huai 5e5ad6558d [SPARK-6024][SQL] When a data source table has too many columns, it's schema cannot be stored in metastore.
JIRA: https://issues.apache.org/jira/browse/SPARK-6024

Author: Yin Huai <yhuai@databricks.com>

Closes #4795 from yhuai/wideSchema and squashes the following commits:

4882e6f [Yin Huai] Address comments.
73e71b4 [Yin Huai] Address comments.
143927a [Yin Huai] Simplify code.
cc1d472 [Yin Huai] Make the schema wider.
12bacae [Yin Huai] If the JSON string of a schema is too large, split it before storing it in metastore.
e9b4f70 [Yin Huai] Failed test.
2015-02-26 20:46:05 -08:00
Yin Huai 192e42a293 [SPARK-6016][SQL] Cannot read the parquet table after overwriting the existing table when spark.sql.parquet.cacheMetadata=true
Please see JIRA (https://issues.apache.org/jira/browse/SPARK-6016) for details of the bug.

Author: Yin Huai <yhuai@databricks.com>

Closes #4775 from yhuai/parquetFooterCache and squashes the following commits:

78787b1 [Yin Huai] Remove footerCache in FilteringParquetRowInputFormat.
dff6fba [Yin Huai] Failed unit test.
2015-02-27 01:01:32 +08:00
Yin Huai f02394d064 [SPARK-6023][SQL] ParquetConversions fails to replace the destination MetastoreRelation of an InsertIntoTable node to ParquetRelation2
JIRA: https://issues.apache.org/jira/browse/SPARK-6023

Author: Yin Huai <yhuai@databricks.com>

Closes #4782 from yhuai/parquetInsertInto and squashes the following commits:

ae7e806 [Yin Huai] Convert MetastoreRelation in InsertIntoTable and InsertIntoHiveTable.
ba543cd [Yin Huai] More tests.
50b6d0f [Yin Huai] Update error messages.
346780c [Yin Huai] Failed test.
2015-02-26 22:39:49 +08:00
Reynold Xin f0e3b71077 [SPARK-5840][SQL] HiveContext cannot be serialized due to tuple extraction
Also added test cases for checking the serializability of HiveContext and SQLContext.

Author: Reynold Xin <rxin@databricks.com>

Closes #4628 from rxin/SPARK-5840 and squashes the following commits:

ecb3bcd [Reynold Xin] test cases and reviews.
55eb822 [Reynold Xin] [SPARK-5840][SQL] HiveContext cannot be serialized due to tuple extraction.
2015-02-18 14:02:32 -08:00
Cheng Lian 61ab08549c [Minor] [SQL] Cleans up DataFrame variable names and toDF() calls
Although we've migrated to the DataFrame API, lots of code still uses `rdd` or `srdd` as local variable names. This PR tries to address these naming inconsistencies and some other minor DataFrame related style issues.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4670)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #4670 from liancheng/df-cleanup and squashes the following commits:

3e14448 [Cheng Lian] Cleans up DataFrame variable names and toDF() calls
2015-02-17 23:36:20 -08:00
Yin Huai e50934f11e [SPARK-5723][SQL]Change the default file format to Parquet for CTAS statements.
JIRA: https://issues.apache.org/jira/browse/SPARK-5723

Author: Yin Huai <yhuai@databricks.com>

This patch had conflicts when merged, resolved by
Committer: Michael Armbrust <michael@databricks.com>

Closes #4639 from yhuai/defaultCTASFileFormat and squashes the following commits:

a568137 [Yin Huai] Merge remote-tracking branch 'upstream/master' into defaultCTASFileFormat
ad2b07d [Yin Huai] Update tests and error messages.
8af5b2a [Yin Huai] Update conf key and unit test.
5a67903 [Yin Huai] Use data source write path for Hive's CTAS statements when no storage format/handler is specified.
2015-02-17 18:14:33 -08:00
Yin Huai d5f12bfe8f [SPARK-5875][SQL]logical.Project should not be resolved if it contains aggregates or generators
https://issues.apache.org/jira/browse/SPARK-5875 has a case to reproduce the bug and explain the root cause.

Author: Yin Huai <yhuai@databricks.com>

Closes #4663 from yhuai/projectResolved and squashes the following commits:

472f7b6 [Yin Huai] If a logical.Project has any AggregateExpression or Generator, it's resolved field should be false.
2015-02-17 17:50:39 -08:00
Yin Huai 117121a4ec [SPARK-5852][SQL]Fail to convert a newly created empty metastore parquet table to a data source parquet table.
The problem is that after we create an empty hive metastore parquet table (e.g. `CREATE TABLE test (a int) STORED AS PARQUET`), Hive will create an empty dir for us, which cause our data source `ParquetRelation2` fail to get the schema of the table. See JIRA for the case to reproduce the bug and the exception.

This PR is based on #4562 from chenghao-intel.

JIRA: https://issues.apache.org/jira/browse/SPARK-5852

Author: Yin Huai <yhuai@databricks.com>
Author: Cheng Hao <hao.cheng@intel.com>

Closes #4655 from yhuai/CTASParquet and squashes the following commits:

b8b3450 [Yin Huai] Update tests.
2ac94f7 [Yin Huai] Update tests.
3db3d20 [Yin Huai] Minor update.
d7e2308 [Yin Huai] Revert changes in HiveMetastoreCatalog.scala.
36978d1 [Cheng Hao] Update the code as feedback
a04930b [Cheng Hao] fix bug of scan an empty parquet based table
442ffe0 [Cheng Hao] passdown the schema for Parquet File in HiveContext
2015-02-17 15:47:59 -08:00
Cheng Hao 9d281fa560 [SQL] [Minor] Update the HiveContext Unittest
In unit test, the table src(key INT, value STRING) is not the same as HIVE src(key STRING, value STRING)
https://github.com/apache/hive/blob/branch-0.13/data/scripts/q_test_init.sql

And in the reflect.q, test failed for expression `reflect("java.lang.Integer", "valueOf", key, 16)`, which expect the argument `key` as STRING not INT.

This PR doesn't aim to change the `src` schema, we can do that after 1.3 released, however, we probably need to re-generate all the golden files.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #4584 from chenghao-intel/reflect and squashes the following commits:

e5bdc3a [Cheng Hao] Move the test case reflect into blacklist
184abfd [Cheng Hao] revert the change to table src1
d9bcf92 [Cheng Hao] Update the HiveContext Unittest
2015-02-17 12:25:35 -08:00
Liang-Chi Hsieh ac506b7c28 [Minor][SQL] Use same function to check path parameter in JSONRelation
Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #4649 from viirya/use_checkpath and squashes the following commits:

0f9a1a1 [Liang-Chi Hsieh] Use same function to check path parameter.
2015-02-17 12:24:13 -08:00
Yin Huai e189cbb052 [SPARK-4865][SQL]Include temporary tables in SHOW TABLES
This PR adds a `ShowTablesCommand` to support `SHOW TABLES [IN databaseName]` SQL command. The result of `SHOW TABLE` has two columns, `tableName` and `isTemporary`. For temporary tables, the value of `isTemporary` column will be `false`.

JIRA: https://issues.apache.org/jira/browse/SPARK-4865

Author: Yin Huai <yhuai@databricks.com>

Closes #4618 from yhuai/showTablesCommand and squashes the following commits:

0c09791 [Yin Huai] Use ShowTablesCommand.
85ee76d [Yin Huai] Since SHOW TABLES is not a Hive native command any more and we will not see "OK" (originally generated by Hive's driver), use SHOW DATABASES in the test.
94bacac [Yin Huai] Add SHOW TABLES to the list of noExplainCommands.
d71ed09 [Yin Huai] Fix test.
a4a6ec3 [Yin Huai] Add SHOW TABLE command.
2015-02-16 15:59:23 -08:00
Yin Huai f3ff1eb298 [SPARK-5839][SQL]HiveMetastoreCatalog does not recognize table names and aliases of data source tables.
JIRA: https://issues.apache.org/jira/browse/SPARK-5839

Author: Yin Huai <yhuai@databricks.com>

Closes #4626 from yhuai/SPARK-5839 and squashes the following commits:

f779d85 [Yin Huai] Use subqeury to wrap replaced ParquetRelation.
2695f13 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-5839
f1ba6ca [Yin Huai] Address comment.
2c7fa08 [Yin Huai] Use Subqueries to wrap a data source table.
2015-02-16 15:54:01 -08:00
Cheng Lian c51ab37fad [SPARK-5833] [SQL] Adds REFRESH TABLE command
Lifts `HiveMetastoreCatalog.refreshTable` to `Catalog`. Adds `RefreshTable` command to refresh (possibly cached) metadata in external data sources tables.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4624)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #4624 from liancheng/refresh-table and squashes the following commits:

8d1aa4c [Cheng Lian] Adds REFRESH TABLE command
2015-02-16 12:52:05 -08:00
Michael Armbrust 104b2c4580 [SQL] Initial support for reporting location of error in sql string
Author: Michael Armbrust <michael@databricks.com>

Closes #4587 from marmbrus/position and squashes the following commits:

0810052 [Michael Armbrust] fix tests
395c019 [Michael Armbrust] Merge remote-tracking branch 'marmbrus/position' into position
e155dce [Michael Armbrust] more errors
f3efa51 [Michael Armbrust] Update AnalysisException.scala
d45ff60 [Michael Armbrust] [SQL] Initial support for reporting location of error in sql string
2015-02-16 12:32:56 -08:00
Daoyuan Wang 275a0c0813 [SPARK-5824] [SQL] add null format in ctas and set default col comment to null
Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #4609 from adrian-wang/ctas and squashes the following commits:

0a75d5a [Daoyuan Wang] reorder import
93d1863 [Daoyuan Wang] add null format in ctas and set default col comment to null
2015-02-16 12:31:36 -08:00
Cheng Lian 3ce58cf9c0 [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet data source with the newly introduced write support for data source API
This PR migrates the Parquet data source to the new data source write support API.  Now users can also overwriting and appending to existing tables. Notice that inserting into partitioned tables is not supported yet.

When Parquet data source is enabled, insertion to Hive Metastore Parquet tables is also fullfilled by the Parquet data source. This is done by the newly introduced `HiveMetastoreCatalog.ParquetConversions` rule, which is a "proper" implementation of the original hacky `HiveStrategies.ParquetConversion`. The latter is still preserved, and can be removed together with the old Parquet support in the future.

TODO:

- [x] Update outdated comments in `newParquet.scala`.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4563)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #4563 from liancheng/parquet-refining and squashes the following commits:

fa98d27 [Cheng Lian] Fixes test cases which should disable off Parquet data source
2476e82 [Cheng Lian] Fixes compilation error introduced during rebasing
a83d290 [Cheng Lian] Passes Hive Metastore partitioning information to ParquetRelation2
2015-02-16 01:38:31 -08:00
Reynold Xin e98dfe627c [SPARK-5752][SQL] Don't implicitly convert RDDs directly to DataFrames
- The old implicit would convert RDDs directly to DataFrames, and that added too many methods.
- toDataFrame -> toDF
- Dsl -> functions
- implicits moved into SQLContext.implicits
- addColumn -> withColumn
- renameColumn -> withColumnRenamed

Python changes:
- toDataFrame -> toDF
- Dsl -> functions package
- addColumn -> withColumn
- renameColumn -> withColumnRenamed
- add toDF functions to RDD on SQLContext init
- add flatMap to DataFrame

Author: Reynold Xin <rxin@databricks.com>
Author: Davies Liu <davies@databricks.com>

Closes #4556 from rxin/SPARK-5752 and squashes the following commits:

5ef9910 [Reynold Xin] More fix
61d3fca [Reynold Xin] Merge branch 'df5' of github.com:davies/spark into SPARK-5752
ff5832c [Reynold Xin] Fix python
749c675 [Reynold Xin] count(*) fixes.
5806df0 [Reynold Xin] Fix build break again.
d941f3d [Reynold Xin] Fixed explode compilation break.
fe1267a [Davies Liu] flatMap
c4afb8e [Reynold Xin] style
d9de47f [Davies Liu] add comment
b783994 [Davies Liu] add comment for toDF
e2154e5 [Davies Liu] schema() -> schema
3a1004f [Davies Liu] Dsl -> functions, toDF()
fb256af [Reynold Xin] - toDataFrame -> toDF - Dsl -> functions - implicits moved into SQLContext.implicits - addColumn -> withColumn - renameColumn -> withColumnRenamed
0dd74eb [Reynold Xin] [SPARK-5752][SQL] Don't implicitly convert RDDs directly to DataFrames
97dd47c [Davies Liu] fix mistake
6168f74 [Davies Liu] fix test
1fc0199 [Davies Liu] fix test
a075cd5 [Davies Liu] clean up, toPandas
663d314 [Davies Liu] add test for agg('*')
9e214d5 [Reynold Xin] count(*) fixes.
1ed7136 [Reynold Xin] Fix build break again.
921b2e3 [Reynold Xin] Fixed explode compilation break.
14698d4 [Davies Liu] flatMap
ba3e12d [Reynold Xin] style
d08c92d [Davies Liu] add comment
5c8b524 [Davies Liu] add comment for toDF
a4e5e66 [Davies Liu] schema() -> schema
d377fc9 [Davies Liu] Dsl -> functions, toDF()
6b3086c [Reynold Xin] - toDataFrame -> toDF - Dsl -> functions - implicits moved into SQLContext.implicits - addColumn -> withColumn - renameColumn -> withColumnRenamed
807e8b1 [Reynold Xin] [SPARK-5752][SQL] Don't implicitly convert RDDs directly to DataFrames
2015-02-13 23:03:22 -08:00
Yin Huai 1d0596a16e [SPARK-3299][SQL]Public API in SQLContext to list tables
https://issues.apache.org/jira/browse/SPARK-3299

Author: Yin Huai <yhuai@databricks.com>

Closes #4547 from yhuai/tables and squashes the following commits:

6c8f92e [Yin Huai] Add tableNames.
acbb281 [Yin Huai] Update Python test.
7793dcb [Yin Huai] Fix scala test.
572870d [Yin Huai] Address comments.
aba2e88 [Yin Huai] Format.
12c86df [Yin Huai] Add tables() to SQLContext to return a DataFrame containing existing tables.
2015-02-12 18:08:01 -08:00
Yin Huai c025a46882 [SQL] Move SaveMode to SQL package.
Author: Yin Huai <yhuai@databricks.com>

Closes #4542 from yhuai/moveSaveMode and squashes the following commits:

65a4425 [Yin Huai] Move SaveMode to sql package.
2015-02-12 15:32:17 -08:00
tianyi 44b2311d94 [SPARK-3688][SQL]LogicalPlan can't resolve column correctlly
This PR fixed the resolving problem described in https://issues.apache.org/jira/browse/SPARK-3688
```
CREATE TABLE t1(x INT);
CREATE TABLE t2(a STRUCT<x: INT>, k INT);
SELECT a.x FROM t1 a JOIN t2 b ON a.x = b.k;
```

Author: tianyi <tianyi.asiainfo@gmail.com>

Closes #4524 from tianyi/SPARK-3688 and squashes the following commits:

237a256 [tianyi] resolve a name with table.column pattern first.
2015-02-11 12:50:17 -08:00
Michael Armbrust a60d2b70ad [SPARK-5454] More robust handling of self joins
Also I fix a bunch of bad output in test cases.

Author: Michael Armbrust <michael@databricks.com>

Closes #4520 from marmbrus/selfJoin and squashes the following commits:

4f4a85c [Michael Armbrust] comments
49c8e26 [Michael Armbrust] fix tests
6fc38de [Michael Armbrust] fix style
55d64b3 [Michael Armbrust] fix dataframe selfjoins
2015-02-11 12:31:56 -08:00
Patrick Wendell 7e2f8821e0 HOTFIX: Java 6 compilation error in Spark SQL 2015-02-10 22:43:32 -08:00
Davies Liu ea60284095 [SPARK-5704] [SQL] [PySpark] createDataFrame from RDD with columns
Deprecate inferSchema() and applySchema(), use createDataFrame() instead, which could take an optional `schema` to create an DataFrame from an RDD. The `schema` could be StructType or list of names of columns.

Author: Davies Liu <davies@databricks.com>

Closes #4498 from davies/create and squashes the following commits:

08469c1 [Davies Liu] remove Scala/Java API for now
c80a7a9 [Davies Liu] fix hive test
d1bd8f2 [Davies Liu] cleanup applySchema
9526e97 [Davies Liu] createDataFrame from RDD with columns
2015-02-10 19:40:12 -08:00
Michael Armbrust 6195e2473b [SQL] Add an exception for analysis errors.
Also start from the bottom so we show the first error instead of the top error.

Author: Michael Armbrust <michael@databricks.com>

Closes #4439 from marmbrus/analysisException and squashes the following commits:

45862a0 [Michael Armbrust] fix hive test
a773bba [Michael Armbrust] Merge remote-tracking branch 'origin/master' into analysisException
f88079f [Michael Armbrust] update more cases
fede90a [Michael Armbrust] newline
fbf4bc3 [Michael Armbrust] move to sql
6235db4 [Michael Armbrust] [SQL] Add an exception for analysis errors.
2015-02-10 17:32:42 -08:00
Yin Huai aaf50d05c7 [SPARK-5658][SQL] Finalize DDL and write support APIs
https://issues.apache.org/jira/browse/SPARK-5658

Author: Yin Huai <yhuai@databricks.com>

This patch had conflicts when merged, resolved by
Committer: Michael Armbrust <michael@databricks.com>

Closes #4446 from yhuai/writeSupportFollowup and squashes the following commits:

f3a96f7 [Yin Huai] davies's comments.
225ff71 [Yin Huai] Use Scala TestHiveContext to initialize the Python HiveContext in Python tests.
2306f93 [Yin Huai] Style.
2091fcd [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup
537e28f [Yin Huai] Correctly clean up temp data.
ae4649e [Yin Huai] Fix Python test.
609129c [Yin Huai] Doc format.
92b6659 [Yin Huai] Python doc and other minor updates.
cbc717f [Yin Huai] Rename dataSourceName to source.
d1c12d3 [Yin Huai] No need to delete the duplicate rule since it has been removed in master.
22cfa70 [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup
d91ecb8 [Yin Huai] Fix test.
4c76d78 [Yin Huai] Simplify APIs.
3abc215 [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup
0832ce4 [Yin Huai] Fix test.
98e7cdb [Yin Huai] Python style.
2bf44ef [Yin Huai] Python APIs.
c204967 [Yin Huai] Format
a10223d [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup
9ff97d8 [Yin Huai] Add SaveMode to saveAsTable.
9b6e570 [Yin Huai] Update doc.
c2be775 [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup
99950a2 [Yin Huai] Use Java enum for SaveMode.
4679665 [Yin Huai] Remove duplicate rule.
77d89dc [Yin Huai] Update doc.
e04d908 [Yin Huai] Move import and add (Scala-specific) to scala APIs.
cf5703d [Yin Huai] Add checkAnswer to Java tests.
7db95ff [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup
6dfd386 [Yin Huai] Add java test.
f2f33ef [Yin Huai] Fix test.
e702386 [Yin Huai] Apache header.
b1e9b1b [Yin Huai] Format.
ed4e1b4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup
af9e9b3 [Yin Huai] DDL and write support API followup.
2a6213a [Yin Huai] Update API names.
e6a0b77 [Yin Huai] Update test.
43bae01 [Yin Huai] Remove createTable from HiveContext.
5ffc372 [Yin Huai] Add more load APIs to SQLContext.
5390743 [Yin Huai] Add more save APIs to DataFrame.
2015-02-10 17:29:52 -08:00
Yin Huai e28b6bdbb5 [SQL] Make Options in the data source API CREATE TABLE statements optional.
Users will not need to put `Options()` in a CREATE TABLE statement when there is not option provided.

Author: Yin Huai <yhuai@databricks.com>

Closes #4515 from yhuai/makeOptionsOptional and squashes the following commits:

1a898d3 [Yin Huai] Make options optional.
2015-02-10 17:06:12 -08:00
wangfei 59272dad77 [SPARK-5592][SQL] java.net.URISyntaxException when insert data to a partitioned table
flowing sql get URISyntaxException:
```
create table sc as select *
from (select '2011-01-11', '2011-01-11+14:18:26' from src tablesample (1 rows)
union all
select '2011-01-11', '2011-01-11+15:18:26' from src tablesample (1 rows)
union all
select '2011-01-11', '2011-01-11+16:18:26' from src tablesample (1 rows) ) s;
create table sc_part (key string) partitioned by (ts string) stored as rcfile;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table sc_part partition(ts) select * from sc;
```
java.net.URISyntaxException: Relative path in absolute URI: ts=2011-01-11+15:18:26
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.<init>(Path.java:172)
at org.apache.hadoop.fs.Path.<init>(Path.java:94)
at org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.org$apache$spark$sql$hive$SparkHiveDynamicPartitionWriterContainer$$newWriter$1(hiveWriterContainers.scala:230)
at org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer$$anonfun$getLocalFileWriter$1.apply(hiveWriterContainers.scala:243)
at org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer$$anonfun$getLocalFileWriter$1.apply(hiveWriterContainers.scala:243)
at scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:189)
at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:91)
at org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.getLocalFileWriter(hiveWriterContainers.scala:243)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:113)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:105)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:105)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:87)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:87)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.net.URISyntaxException: Relative path in absolute URI: ts=2011-01-11+15:18:26
at java.net.URI.checkPath(URI.java:1804)
at java.net.URI.<init>(URI.java:752)
at org.apache.hadoop.fs.Path.initialize(Path.java:203)

Author: wangfei <wangfei1@huawei.com>
Author: Fei Wang <wangfei1@huawei.com>

Closes #4368 from scwf/SPARK-5592 and squashes the following commits:

aa55ef4 [Fei Wang] comments addressed
f8f8bb1 [wangfei] added test case
f24624f [wangfei] Merge branch 'master' of https://github.com/apache/spark into SPARK-5592
9998177 [wangfei] added test case
ea81daf [wangfei] fix URISyntaxException
2015-02-10 11:54:30 -08:00
Daoyuan Wang c7ad80ae42 [SPARK-5716] [SQL] Support TOK_CHARSETLITERAL in HiveQl
Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #4502 from adrian-wang/utf8 and squashes the following commits:

4d7b0ee [Daoyuan Wang] remove useless import
606f981 [Daoyuan Wang] support TOK_CHARSETLITERAL in HiveQl
2015-02-10 11:08:21 -08:00
Yin Huai 5f0b30e59c [SQL] Code cleanup.
I added an unnecessary line of code in 13531dd97c.

My bad. Let's delete it.

Author: Yin Huai <yhuai@databricks.com>

Closes #4482 from yhuai/unnecessaryCode and squashes the following commits:

3645af0 [Yin Huai] Code cleanup.
2015-02-09 16:20:42 -08:00
Cheng Lian c4021401e3 [SQL] [Minor] HiveParquetSuite was disabled by mistake, re-enable them
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4440)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #4440 from liancheng/parquet-oops and squashes the following commits:

f21ede4 [Cheng Lian] HiveParquetSuite was disabled by mistake, re-enable them.
2015-02-06 15:23:42 -08:00
Wenchen Fan 4793c8402a [SPARK-5278][SQL] Introduce UnresolvedGetField and complete the check of ambiguous reference to fields
When the `GetField` chain(`a.b.c.d.....`) is interrupted by `GetItem` like `a.b[0].c.d....`, then the check of ambiguous reference to fields is broken.
The reason is that: for something like `a.b[0].c.d`, we first parse it to `GetField(GetField(GetItem(Unresolved("a.b"), 0), "c"), "d")`. Then in `LogicalPlan#resolve`, we resolve `"a.b"` and build a `GetField` chain from bottom(the relation). But for the 2 outer `GetFiled`, we have to resolve them in `Analyzer` or do it in `GetField` lazily, check data type of child, search needed field, etc. which is similar to what we have done in `LogicalPlan#resolve`.
So in this PR, the fix is just copy the same logic in `LogicalPlan#resolve` to `Analyzer`, which is simple and quick, but I do suggest introduce `UnresolvedGetFiled` like I explained in https://github.com/apache/spark/pull/2405.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #4068 from cloud-fan/simple and squashes the following commits:

a6857b5 [Wenchen Fan] fix import order
8411c40 [Wenchen Fan] use UnresolvedGetField
2015-02-06 13:08:09 -08:00
OopsOutOfMemory 0b7eb3f3b7 [SPARK-5324][SQL] Results of describe can't be queried
Make below code works.
```
sql("DESCRIBE test").registerTempTable("describeTest")
sql("SELECT * FROM describeTest").collect()
```

Author: OopsOutOfMemory <victorshengli@126.com>
Author: Sheng, Li <OopsOutOfMemory@users.noreply.github.com>

Closes #4249 from OopsOutOfMemory/desc_query and squashes the following commits:

6fee13d [OopsOutOfMemory] up-to-date
e71430a [Sheng, Li] Update HiveOperatorQueryableSuite.scala
3ba1058 [OopsOutOfMemory] change to default argument
aac7226 [OopsOutOfMemory] Merge branch 'master' into desc_query
68eb6dd [OopsOutOfMemory] Merge branch 'desc_query' of github.com:OopsOutOfMemory/spark into desc_query
354ad71 [OopsOutOfMemory] query describe command
d541a35 [OopsOutOfMemory] refine test suite
e1da481 [OopsOutOfMemory] refine test suite
a780539 [OopsOutOfMemory] Merge branch 'desc_query' of github.com:OopsOutOfMemory/spark into desc_query
0015f82 [OopsOutOfMemory] code style
dd0aaef [OopsOutOfMemory] code style
c7d606d [OopsOutOfMemory] rename test suite
75f2342 [OopsOutOfMemory] refine code and test suite
f942c9b [OopsOutOfMemory] initial
11559ae [OopsOutOfMemory] code style
c5fdecf [OopsOutOfMemory] code style
aeaea5f [OopsOutOfMemory] rename test suite
ac2c3bb [OopsOutOfMemory] refine code and test suite
544573e [OopsOutOfMemory] initial
2015-02-06 12:33:20 -08:00
Liang-Chi Hsieh d433816157 [SPARK-5650][SQL] Support optional 'FROM' clause
In Hive, 'FROM' clause is optional. This pr supports it.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #4426 from viirya/optional_from and squashes the following commits:

fe81f31 [Liang-Chi Hsieh] Support optional 'FROM' clause.
2015-02-06 12:13:44 -08:00
Cheng Lian 7c0a648fb5 [HOTFIX] [SQL] Disables Metastore Parquet table conversion for "SQLQuerySuite.CTAS with serde"
Ideally we should convert Metastore Parquet tables with our own Parquet implementation on both read path and write path. However, the write path is not well covered, and causes this test failure. This PR is a hotfix to bring back Jenkins PR builder. A proper fix will be delivered in a follow-up PR.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4413)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #4413 from liancheng/hotfix-parquet-ctas and squashes the following commits:

5291289 [Cheng Lian] Hot fix for "SQLQuerySuite.CTAS with serde"
2015-02-05 18:09:18 -08:00
Cheng Lian a9ed51178c [SPARK-5182] [SPARK-5528] [SPARK-5509] [SPARK-3575] [SQL] Parquet data source improvements
This PR adds three major improvements to Parquet data source:

1.  Partition discovery

    While reading Parquet files resides in Hive style partition directories, `ParquetRelation2` automatically discovers partitioning information and infers partition column types.

    This is also a partial work for [SPARK-5182] [1], which aims to provide first class partitioning support for the data source API.  Related code in this PR can be easily extracted to the data source API level in future versions.

1.  Schema merging

    When enabled, Parquet data source collects schema information from all Parquet part-files and tries to merge them.  Exceptions are thrown when incompatible schemas are detected.  This feature is controlled by data source option `parquet.mergeSchema`, and is enabled by default.

1.  Metastore Parquet table conversion moved to analysis phase

    This greatly simplifies the conversion logic.  `ParquetConversion` strategy can be removed once the old Parquet implementation is removed in the future.

This version of Parquet data source aims to entirely replace the old Parquet implementation.  However, the old version hasn't been removed yet.  Users can fall back to the old version by turning off SQL configuration `spark.sql.parquet.useDataSourceApi`.

Other JIRA tickets fixed as side effects in this PR:

- [SPARK-5509] [3]: `EqualTo` now uses a proper `Ordering` to compare binary types.

- [SPARK-3575] [4]: Metastore schema is now preserved and passed to `ParquetRelation2` via data source option `parquet.metastoreSchema`.

TODO:

- [ ] More test cases for partition discovery
- [x] Fix write path after data source write support (#4294) is merged

      It turned out to be non-trivial to fall back to old Parquet implementation on the write path when Parquet data source is enabled.  Since we're planning to include data source write support in 1.3.0, I simply ignored two test cases involving Parquet insertion for now.

- [ ] Fix outdated comments and documentations

PS: This PR looks big, but more than a half of the changed lines in this PR are trivial changes to test cases. To test Parquet with and without the new data source, almost all Parquet test cases are moved into wrapper driver functions. This introduces hundreds of lines of changes.

[1]: https://issues.apache.org/jira/browse/SPARK-5182
[2]: https://issues.apache.org/jira/browse/SPARK-5528
[3]: https://issues.apache.org/jira/browse/SPARK-5509
[4]: https://issues.apache.org/jira/browse/SPARK-3575

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4308)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #4308 from liancheng/parquet-partition-discovery and squashes the following commits:

b6946e6 [Cheng Lian] Fixes MiMA issues, addresses comments
8232e17 [Cheng Lian] Write support for Parquet data source
a49bd28 [Cheng Lian] Fixes spelling typo in trait name "CreateableRelationProvider"
808380f [Cheng Lian] Fixes issues introduced while rebasing
50dd8d1 [Cheng Lian] Addresses @rxin's comment, fixes UDT schema merging
adf2aae [Cheng Lian] Fixes compilation error introduced while rebasing
4e0175f [Cheng Lian] Fixes Python Parquet API, we need Py4J array to call varargs method
0d8ec1d [Cheng Lian] Adds more test cases
b35c8c6 [Cheng Lian] Fixes some typos and outdated comments
dd704fd [Cheng Lian] Fixes Python Parquet API
596c312 [Cheng Lian] Uses switch to control whether use Parquet data source or not
7d0f7a2 [Cheng Lian] Fixes Metastore Parquet table conversion
a1896c7 [Cheng Lian] Fixes all existing Parquet test suites except for ParquetMetastoreSuite
5654c9d [Cheng Lian] Draft version of Parquet partition discovery and schema merging
2015-02-05 15:29:56 -08:00
OopsOutOfMemory 4d8d070c4f [SPARK-5135][SQL] Add support for describe table to DDL in SQLContext
Hi, rxin marmbrus
I considered your suggestion (in #4127) and now re-write it. This is now up-to-date.
Could u please review it ?

Author: OopsOutOfMemory <victorshengli@126.com>

Closes #4227 from OopsOutOfMemory/describe and squashes the following commits:

053826f [OopsOutOfMemory] describe
2015-02-05 13:07:48 -08:00
Reynold Xin 7d789e117d [SPARK-5612][SQL] Move DataFrame implicit functions into SQLContext.implicits.
Author: Reynold Xin <rxin@databricks.com>

Closes #4386 from rxin/df-implicits and squashes the following commits:

9d96606 [Reynold Xin] style fix
edd296b [Reynold Xin] ReplSuite
1c946ab [Reynold Xin] [SPARK-5612][SQL] Move DataFrame implicit functions into SQLContext.implicits.
2015-02-04 23:44:34 -08:00
guowei2 e0490e271d [SPARK-5118][SQL] Fix: create table test stored as parquet as select ..
Author: guowei2 <guowei2@asiainfo.com>

Closes #3921 from guowei2/SPARK-5118 and squashes the following commits:

b1ba3be [guowei2] add table file check in test case
9da56f8 [guowei2] test case only run in Shim13
112a0b6 [guowei2] add test case
187c7d8 [guowei2] Fix: create table test stored as parquet as select ..
2015-02-04 15:26:10 -08:00
wangfei 417d1118cd [SPARK-5367][SQL] Support star expression in udfs
A follow up for #4163: support  `select array(key, *) from src`

Since  array(key, *)  will not go into this case
```
case Alias(f  UnresolvedFunction(_, args), name) if containsStar(args) =>
              val expandedArgs = args.flatMap {
                case s: Star => s.expand(child.output, resolver)
                case o => o :: Nil
              }
```
here added a case to cover the corner case of array.

/cc liancheng

Author: wangfei <wangfei1@huawei.com>
Author: scwf <wangfei1@huawei.com>

Closes #4353 from scwf/udf-star1 and squashes the following commits:

4350d17 [wangfei] minor fix
a7cd191 [wangfei] minor fix
0942fb1 [wangfei] follow up: support select array(key, *) from src
6ae00db [wangfei] also fix problem with array
da1da09 [scwf] minor fix
f87b5f9 [scwf] added test case
587bf7e [wangfei] compile fix
eb93c16 [wangfei] fix star resolve issue in udf
2015-02-04 15:12:07 -08:00
Daoyuan Wang db821ed2ed [SPARK-4508] [SQL] build native date type to conform behavior to Hive
The previous #3732 is reverted due to some test failure.
Have fixed that.

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #4325 from adrian-wang/datenative and squashes the following commits:

096e20d [Daoyuan Wang] fix for mixed timezone
0ed0fdc [Daoyuan Wang] fix test data
a2fdd4e [Daoyuan Wang] getDate
c37832b [Daoyuan Wang] row to catalyst
f0005b1 [Daoyuan Wang] add date in sql parser and java type conversion
024c9a6 [Daoyuan Wang] clean some import order
d6715fc [Daoyuan Wang] refactoring Date as Primitive Int internally
374abd5 [Daoyuan Wang] spark native date type support
2015-02-03 12:21:45 -08:00
wangfei 5adbb39482 [SPARK-5383][SQL] Support alias for udtfs
Add support for alias of udtfs, such as
```
select stack(2, key, value, key, value) as (a, b) from src limit 5;

select a, b from (select stack(2, key, value, key, value) as (a, b) from src) t limit 5

```

Author: wangfei <wangfei1@huawei.com>
Author: scwf <wangfei1@huawei.com>
Author: Fei Wang <wangfei1@huawei.com>

Closes #4186 from scwf/multi-alias-names and squashes the following commits:

c35e922 [wangfei] fix conflicts
adc8311 [wangfei] minor format fix
2783aed [wangfei] convert it to a Generate instead of leaving it inside of a Project clause
a87668a [wangfei] minor improvement
b25d9b3 [wangfei] resolve conflicts
d38f041 [wangfei] style fix
8cfcebf [wangfei] minor improvement
12a239e [wangfei] fix test case
050177f [wangfei] added extendedCheckRules
3d69329 [wangfei] added CheckMultiAlias to analyzer
324150d [wangfei] added multi alias node
74f5a81 [Fei Wang] imports order fix
5bc3f59 [scwf] style fix
3daec28 [scwf] support alias for udfs with multi output columns
2015-02-03 12:16:31 -08:00
Cheng Hao ca7a6cdff0 [SPARK-5550] [SQL] Support the case insensitive for UDF
SQL in HiveContext, should be case insensitive, however, the following query will fail.

```scala
udf.register("random0", ()  => { Math.random()})
assert(sql("SELECT RANDOM0() FROM src LIMIT 1").head().getDouble(0) >= 0.0)
```

Author: Cheng Hao <hao.cheng@intel.com>

Closes #4326 from chenghao-intel/udf_case_sensitive and squashes the following commits:

485cf66 [Cheng Hao] Support the case insensitive for UDF
2015-02-03 12:12:26 -08:00
Yin Huai 13531dd97c [SPARK-5501][SPARK-5420][SQL] Write support for the data source API
This PR aims to support `INSERT INTO/OVERWRITE TABLE tableName` and `CREATE TABLE tableName AS SELECT` for the data source API (partitioned tables are not supported).

In this PR, I am also adding the support of `IF NOT EXISTS` for our ddl parser. The current semantic of `IF NOT EXISTS` is explained as follows.
* For a `CREATE TEMPORARY TABLE` statement, it does not `IF NOT EXISTS` for now.
* For a `CREATE TABLE` statement (we are creating a metastore table), if there is an existing table having the same name ...
  * when `IF NOT EXISTS` clause is used, we will do nothing.
  * when `IF NOT EXISTS` clause is not used, the user will see an exception saying the table already exists.

TODOs:
- [x] CTAS support
- [x] Programmatic APIs
- [ ] Python API (another PR)
- [x] More unit tests
- [ ] Documents (another PR)

marmbrus liancheng rxin

Author: Yin Huai <yhuai@databricks.com>

Closes #4294 from yhuai/writeSupport and squashes the following commits:

3db1539 [Yin Huai] save does not take overwrite.
1c98881 [Yin Huai] Fix test.
142372a [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupport
34e1bfb [Yin Huai] Address comments.
1682ca6 [Yin Huai] Better support for CTAS statements.
e789d64 [Yin Huai] For the Scala API, let users to use tuples to provide options.
0128065 [Yin Huai] Short hand versions of save and load.
66ebd74 [Yin Huai] Formatting.
9203ec2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupport
e5d29f2 [Yin Huai] Programmatic APIs.
1a719a5 [Yin Huai] CREATE TEMPORARY TABLE with IF NOT EXISTS is not allowed for now.
909924f [Yin Huai] Add saveAsTable for the data source API to DataFrame.
95a7c71 [Yin Huai] Fix bug when handling IF NOT EXISTS clause in a CREATE TEMPORARY TABLE statement.
d37b19c [Yin Huai] Cheng's comments.
fd6758c [Yin Huai] Use BeforeAndAfterAll.
7880891 [Yin Huai] Support CREATE TABLE AS SELECT STATEMENT and the IF NOT EXISTS clause.
cb85b05 [Yin Huai] Initial write support.
2f91354 [Yin Huai] Make INSERT OVERWRITE/INTO statements consistent between HiveQL and SqlParser.
2015-02-02 23:30:44 -08:00
Patrick Wendell eccb9fbb2d Revert "[SPARK-4508] [SQL] build native date type to conform behavior to Hive"
This reverts commit 1646f89d96.
2015-02-02 17:52:17 -08:00
seayi dca6faa29a [SPARK-5195][sql]Update HiveMetastoreCatalog.scala(override the MetastoreRelation's sameresult method only compare databasename and table name)
override  the MetastoreRelation's  sameresult method only compare databasename and table name

because in previous :
cache table t1;
select count(*) from t1;
it will read data from memory  but the sql below will not,instead it read from hdfs:
select count(*) from t1 t;

because cache data is keyed by logical plan and compare with sameResult ,so  when table with alias  the same table 's logicalplan is not the same logical plan with out alias  so modify  the sameresult method only compare databasename and table name

Author: seayi <405078363@qq.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #3898 from seayi/branch-1.2 and squashes the following commits:

8f0c7d2 [seayi] Update CachedTableSuite.scala
a277120 [seayi] Update HiveMetastoreCatalog.scala
8d910aa [seayi] Update HiveMetastoreCatalog.scala
2015-02-02 16:18:55 -08:00
Daoyuan Wang 1646f89d96 [SPARK-4508] [SQL] build native date type to conform behavior to Hive
Store daysSinceEpoch as an Int value(4 bytes) to represent DateType, instead of using java.sql.Date(8 bytes as Long) in catalyst row. This ensures the same comparison behavior of Hive and Catalyst.
Subsumes #3381
I thinks there are already some tests in JavaSQLSuite, and for python it will not affect python's datetime class.

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #3732 from adrian-wang/datenative and squashes the following commits:

0ed0fdc [Daoyuan Wang] fix test data
a2fdd4e [Daoyuan Wang] getDate
c37832b [Daoyuan Wang] row to catalyst
f0005b1 [Daoyuan Wang] add date in sql parser and java type conversion
024c9a6 [Daoyuan Wang] clean some import order
d6715fc [Daoyuan Wang] refactoring Date as Primitive Int internally
374abd5 [Daoyuan Wang] spark native date type support
2015-02-02 15:49:22 -08:00
Liang-Chi Hsieh 683e938242 [SPARK-5212][SQL] Add support of schema-less, custom field delimiter and SerDe for HiveQL transform
This pr adds the support of schema-less syntax, custom field delimiter and SerDe for HiveQL's transform.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #4014 from viirya/schema_less_trans and squashes the following commits:

ac2d1fe [Liang-Chi Hsieh] Refactor codes for comments.
a137933 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into schema_less_trans
aa10fbd [Liang-Chi Hsieh] Add Hive golden answer files again.
575f695 [Liang-Chi Hsieh] Add Hive golden answer files for new unit tests.
a422562 [Liang-Chi Hsieh] Use createQueryTest for unit tests and remove unnecessary imports.
ccb71e3 [Liang-Chi Hsieh] Refactor codes for comments.
37bd391 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into schema_less_trans
6000889 [Liang-Chi Hsieh] Wrap input and output schema into ScriptInputOutputSchema.
21727f7 [Liang-Chi Hsieh] Move schema-less output to proper place. Use multilines instead of a long line SQL.
9a6dc04 [Liang-Chi Hsieh] setRecordReaderID is introduced in 0.13.1, use reflection API to call it.
7a14f31 [Liang-Chi Hsieh] Fix bug.
799b5e1 [Liang-Chi Hsieh] Call getSerializedClass instead of using Text.
be2c3fc [Liang-Chi Hsieh] Fix style.
32d3046 [Liang-Chi Hsieh] Add SerDe support.
ab22f7b [Liang-Chi Hsieh] Fix style.
7a48e42 [Liang-Chi Hsieh] Add support of custom field delimiter.
b1729d9 [Liang-Chi Hsieh] Fix style.
ccee49e [Liang-Chi Hsieh] Add unit test.
f561c37 [Liang-Chi Hsieh] Add support of schema-less script transformation.
2015-02-02 13:53:55 -08:00
Daoyuan Wang 8cf4a1f02e [SPARK-5262] [SPARK-5244] [SQL] add coalesce in SQLParser and widen types for parameters of coalesce
I'll add test case in #4040

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #4057 from adrian-wang/coal and squashes the following commits:

4d0111a [Daoyuan Wang] address Yin's comments
c393e18 [Daoyuan Wang] fix rebase conflicts
e47c03a [Daoyuan Wang] add coalesce in parser
c74828d [Daoyuan Wang] cast types for coalesce
2015-02-01 18:51:38 -08:00
Yin Huai c00d517d66 [SPARK-4296][SQL] Trims aliases when resolving and checking aggregate expressions
I believe that SPARK-4296 has been fixed by 3684fd21e1. I am adding tests based #3910 (change the udf to HiveUDF instead).

Author: Yin Huai <yhuai@databricks.com>
Author: Cheng Lian <lian@databricks.com>

Closes #4010 from yhuai/SPARK-4296-yin and squashes the following commits:

6343800 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-4296-yin
6cfadd2 [Yin Huai] Actually, this issue has been fixed by 3684fd21e1.
d42b707 [Yin Huai] Update comment.
8b3a274 [Yin Huai] Since expressions in grouping expressions can have aliases, which can be used by the outer query block,     revert this change.
443538d [Cheng Lian] Trims aliases when resolving and checking aggregate expressions
2015-01-29 15:49:34 -08:00
wangfei fbaf9e0896 [SPARK-5367][SQL] Support star expression in udf
now spark sql does not support star expression in udf, run the following sql by spark-sql will get error
```
select concat(*) from src
```

Author: wangfei <wangfei1@huawei.com>
Author: scwf <wangfei1@huawei.com>

Closes #4163 from scwf/udf-star and squashes the following commits:

9db7b39 [wangfei] addressed comments
da1da09 [scwf] minor fix
f87b5f9 [scwf] added test case
587bf7e [wangfei] compile fix
eb93c16 [wangfei] fix star resolve issue in udf
2015-01-29 15:44:53 -08:00
Reynold Xin 715632232d [SPARK-5445][SQL] Consolidate Java and Scala DSL static methods.
Turns out Scala does generate static methods for ones defined in a companion object. Finally no need to separate api.java.dsl and api.scala.dsl.

Author: Reynold Xin <rxin@databricks.com>

Closes #4276 from rxin/dsl and squashes the following commits:

30aa611 [Reynold Xin] Add all files.
1a9d215 [Reynold Xin] [SPARK-5445][SQL] Consolidate Java and Scala DSL static methods.
2015-01-29 15:13:09 -08:00
Reynold Xin 5b9760de8d [SPARK-5445][SQL] Made DataFrame dsl usable in Java
Also removed the literal implicit transformation since it is pretty scary for API design. Instead, created a new lit method for creating literals. This doesn't break anything from a compatibility perspective because Literal was added two days ago.

Author: Reynold Xin <rxin@databricks.com>

Closes #4241 from rxin/df-docupdate and squashes the following commits:

c0f4810 [Reynold Xin] Fix Python merge conflict.
094c7d7 [Reynold Xin] Minor style fix. Reset Python tests.
3c89f4a [Reynold Xin] Package.
dfe6962 [Reynold Xin] Updated Python aggregate.
5dd4265 [Reynold Xin] Made dsl Java callable.
14b3c27 [Reynold Xin] Fix literal expression for symbols.
68b31cb [Reynold Xin] Literal.
4cfeb78 [Reynold Xin] [SPARK-5097][SQL] Address DataFrame code review feedback.
2015-01-28 19:10:32 -08:00
Reynold Xin c8e934ef3c [SPARK-5447][SQL] Replaced reference to SchemaRDD with DataFrame.
and

[SPARK-5448][SQL] Make CacheManager a concrete class and field in SQLContext

Author: Reynold Xin <rxin@databricks.com>

Closes #4242 from rxin/sqlCleanup and squashes the following commits:

e351cb2 [Reynold Xin] Fixed toDataFrame.
6545c42 [Reynold Xin] More changes.
728c017 [Reynold Xin] [SPARK-5447][SQL] Replaced reference to SchemaRDD with DataFrame.
2015-01-28 12:10:01 -08:00
Reynold Xin 119f45d61d [SPARK-5097][SQL] DataFrame
This pull request redesigns the existing Spark SQL dsl, which already provides data frame like functionalities.

TODOs:
With the exception of Python support, other tasks can be done in separate, follow-up PRs.
- [ ] Audit of the API
- [ ] Documentation
- [ ] More test cases to cover the new API
- [x] Python support
- [ ] Type alias SchemaRDD

Author: Reynold Xin <rxin@databricks.com>
Author: Davies Liu <davies@databricks.com>

Closes #4173 from rxin/df1 and squashes the following commits:

0a1a73b [Reynold Xin] Merge branch 'df1' of github.com:rxin/spark into df1
23b4427 [Reynold Xin] Mima.
828f70d [Reynold Xin] Merge pull request #7 from davies/df
257b9e6 [Davies Liu] add repartition
6bf2b73 [Davies Liu] fix collect with UDT and tests
e971078 [Reynold Xin] Missing quotes.
b9306b4 [Reynold Xin] Remove removeColumn/updateColumn for now.
a728bf2 [Reynold Xin] Example rename.
e8aa3d3 [Reynold Xin] groupby -> groupBy.
9662c9e [Davies Liu] improve DataFrame Python API
4ae51ea [Davies Liu] python API for dataframe
1e5e454 [Reynold Xin] Fixed a bug with symbol conversion.
2ca74db [Reynold Xin] Couple minor fixes.
ea98ea1 [Reynold Xin] Documentation & literal expressions.
2b22684 [Reynold Xin] Got rid of IntelliJ problems.
02bbfbc [Reynold Xin] Tightening imports.
ffbce66 [Reynold Xin] Fixed compilation error.
59b6d8b [Reynold Xin] Style violation.
b85edfb [Reynold Xin] ALS.
8c37f0a [Reynold Xin] Made MLlib and examples compile
6d53134 [Reynold Xin] Hive module.
d35efd5 [Reynold Xin] Fixed compilation error.
ce4a5d2 [Reynold Xin] Fixed test cases in SQL except ParquetIOSuite.
66d5ef1 [Reynold Xin] SQLContext minor patch.
c9bcdc0 [Reynold Xin] Checkpoint: SQL module compiles!
2015-01-27 16:08:24 -08:00
Cheng Hao 27bccc5ea9 [SPARK-5202] [SQL] Add hql variable substitution support
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VariableSubstitution

This is a block issue for the CLI user, it impacts the existed hql scripts from Hive.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #4003 from chenghao-intel/substitution and squashes the following commits:

bb41fd6 [Cheng Hao] revert the removed the implicit conversion
af7c31a [Cheng Hao] add hql variable substitution support
2015-01-21 17:34:18 -08:00
Reynold Xin d181c2a1fc [SPARK-5323][SQL] Remove Row's Seq inheritance.
Author: Reynold Xin <rxin@databricks.com>

Closes #4115 from rxin/row-seq and squashes the following commits:

e33abd8 [Reynold Xin] Fixed compilation error.
cceb650 [Reynold Xin] Python test fixes, and removal of WrapDynamic.
0334a52 [Reynold Xin] mkString.
9cdeb7d [Reynold Xin] Hive tests.
15681c2 [Reynold Xin] Fix more test cases.
ea9023a [Reynold Xin] Fixed a catalyst test.
c5e2cb5 [Reynold Xin] Minor patch up.
b9cab7c [Reynold Xin] [SPARK-5323][SQL] Remove Row's Seq inheritance.
2015-01-20 15:16:14 -08:00
Yin Huai 2604bc35d7 [SPARK-5286][SQL] Fail to drop an invalid table when using the data source API
JIRA: https://issues.apache.org/jira/browse/SPARK-5286

Author: Yin Huai <yhuai@databricks.com>

Closes #4076 from yhuai/SPARK-5286 and squashes the following commits:

6b69ed1 [Yin Huai] Catch all exception when we try to uncache a query.
2015-01-19 10:45:29 -08:00
Yin Huai cd5da42853 [SPARK-5284][SQL] Insert into Hive throws NPE when a inner complex type field has a null value
JIRA: https://issues.apache.org/jira/browse/SPARK-5284

Author: Yin Huai <yhuai@databricks.com>

Closes #4077 from yhuai/SPARK-5284 and squashes the following commits:

fceacd6 [Yin Huai] Check if a value is null when the field has a complex type.
2015-01-19 10:44:12 -08:00
Reynold Xin 61b427d4b1 [SPARK-5193][SQL] Remove Spark SQL Java-specific API.
After the following patches, the main (Scala) API is now usable for Java users directly.

https://github.com/apache/spark/pull/4056
https://github.com/apache/spark/pull/4054
https://github.com/apache/spark/pull/4049
https://github.com/apache/spark/pull/4030
https://github.com/apache/spark/pull/3965
https://github.com/apache/spark/pull/3958

Author: Reynold Xin <rxin@databricks.com>

Closes #4065 from rxin/sql-java-api and squashes the following commits:

b1fd860 [Reynold Xin] Fix Mima
6d86578 [Reynold Xin] Ok one more attempt in fixing Python...
e8f1455 [Reynold Xin] Fix Python again...
3e53f91 [Reynold Xin] Fixed Python.
83735da [Reynold Xin] Fix BigDecimal test.
e9f1de3 [Reynold Xin] Use scala BigDecimal.
500d2c4 [Reynold Xin] Fix Decimal.
ba3bfa2 [Reynold Xin] Updated javadoc for RowFactory.
c4ae1c5 [Reynold Xin] [SPARK-5193][SQL] Remove Spark SQL Java-specific API.
2015-01-16 21:09:06 -08:00
Reynold Xin 1881431dd5 [SPARK-5274][SQL] Reconcile Java and Scala UDFRegistration.
As part of SPARK-5193:

1. Removed UDFRegistration as a mixin in SQLContext and made it a field ("udf").
2. For Java UDFs, renamed dataType to returnType.
3. For Scala UDFs, added type tags.
4. Added all Java UDF registration methods to Scala's UDFRegistration.
5. Documentation

Author: Reynold Xin <rxin@databricks.com>

Closes #4056 from rxin/udf-registration and squashes the following commits:

ae9c556 [Reynold Xin] Updated example.
675a3c9 [Reynold Xin] Style fix
47c24ff [Reynold Xin] Python fix.
5f00c45 [Reynold Xin] Restore data type position in java udf and added typetags.
032f006 [Reynold Xin] [SPARK-5193][SQL] Reconcile Java and Scala UDFRegistration.
2015-01-15 16:15:12 -08:00
Yin Huai 81f72a0df2 [SPARK-5211][SQL]Restore HiveMetastoreTypes.toDataType
jira: https://issues.apache.org/jira/browse/SPARK-5211

Author: Yin Huai <yhuai@databricks.com>

Closes #4026 from yhuai/SPARK-5211 and squashes the following commits:

15ee32b [Yin Huai] Remove extra line.
c6c1651 [Yin Huai] Get back HiveMetastoreTypes.toDataType.
2015-01-14 09:47:30 -08:00
Daoyuan Wang a3f7421b42 [SPARK-5248] [SQL] move sql.types.decimal.Decimal to sql.types.Decimal
rxin follow up of #3732

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #4041 from adrian-wang/decimal and squashes the following commits:

aa3d738 [Daoyuan Wang] fix auto refactor
7777a58 [Daoyuan Wang] move sql.types.decimal.Decimal to sql.types.Decimal
2015-01-14 09:36:59 -08:00
Reynold Xin f9969098c8 [SPARK-5123][SQL] Reconcile Java/Scala API for data types.
Having two versions of the data type APIs (one for Java, one for Scala) requires downstream libraries to also have two versions of the APIs if the library wants to support both Java and Scala. I took a look at the Scala version of the data type APIs - it can actually work out pretty well for Java out of the box.

As part of the PR, I created a sql.types package and moved all type definitions there. I then removed the Java specific data type API along with a lot of the conversion code.

This subsumes https://github.com/apache/spark/pull/3925

Author: Reynold Xin <rxin@databricks.com>

Closes #3958 from rxin/SPARK-5123-datatype-2 and squashes the following commits:

66505cc [Reynold Xin] [SPARK-5123] Expose only one version of the data type APIs (i.e. remove the Java-specific API).
2015-01-13 17:16:41 -08:00
Reynold Xin 14e3f114ef [SPARK-5168] Make SQLConf a field rather than mixin in SQLContext
This change should be binary and source backward compatible since we didn't change any user facing APIs.

Author: Reynold Xin <rxin@databricks.com>

Closes #3965 from rxin/SPARK-5168-sqlconf and squashes the following commits:

42eec09 [Reynold Xin] Fix default conf value.
0ef86cc [Reynold Xin] Fix constructor ordering.
4d7f910 [Reynold Xin] Properly override config.
ccc8e6a [Reynold Xin] [SPARK-5168] Make SQLConf a field rather than mixin in SQLContext
2015-01-13 13:30:35 -08:00
Yin Huai 6463e0b9e8 [SPARK-4912][SQL] Persistent tables for the Spark SQL data sources api
With changes in this PR, users can persist metadata of tables created based on the data source API in metastore through DDLs.

Author: Yin Huai <yhuai@databricks.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #3960 from yhuai/persistantTablesWithSchema2 and squashes the following commits:

069c235 [Yin Huai] Make exception messages user friendly.
c07cbc6 [Yin Huai] Get the location of test file in a correct way.
4456e98 [Yin Huai] Test data.
5315dfc [Yin Huai] rxin's comments.
7fc4b56 [Yin Huai] Add DDLStrategy and HiveDDLStrategy to plan DDLs based on the data source API.
aeaf4b3 [Yin Huai] Add comments.
06f9b0c [Yin Huai] Revert unnecessary changes.
feb88aa [Yin Huai] Merge remote-tracking branch 'apache/master' into persistantTablesWithSchema2
172db80 [Yin Huai] Fix unit test.
49bf1ac [Yin Huai] Unit tests.
8f8f1a1 [Yin Huai] [SPARK-4574][SQL] Adding support for defining schema in foreign DDL commands. #3431
f47fda1 [Yin Huai] Unit tests.
2b59723 [Michael Armbrust] Set external when creating tables
c00bb1b [Michael Armbrust] Don't use reflection to read options
1ea6e7b [Michael Armbrust] Don't fail when trying to uncache a table that doesn't exist
6edc710 [Michael Armbrust] Add tests.
d7da491 [Michael Armbrust] First draft of persistent tables.
2015-01-13 13:01:27 -08:00
Michael Armbrust 5d9fa55082 [SPARK-5049][SQL] Fix ordering of partition columns in ParquetTableScan
Followup to #3870.  Props to rahulaggarwalguavus for identifying the issue.

Author: Michael Armbrust <michael@databricks.com>

Closes #3990 from marmbrus/SPARK-5049 and squashes the following commits:

dd03e4e [Michael Armbrust] Fill in the partition values of parquet scans instead of using JoinedRow
2015-01-12 15:19:09 -08:00
YanTangZhai 0ca51cc31d [SPARK-4692] [SQL] Support ! boolean logic operator like NOT
Support ! boolean logic operator like NOT in sql as follows
select * from for_test where !(col1 > col2)

Author: YanTangZhai <hakeemzhai@tencent.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #3555 from YanTangZhai/SPARK-4692 and squashes the following commits:

1a9f605 [YanTangZhai] Update HiveQuerySuite.scala
7c03c68 [YanTangZhai] Merge pull request #23 from apache/master
992046e [YanTangZhai] Update HiveQuerySuite.scala
ea618f4 [YanTangZhai] Update HiveQuerySuite.scala
192411d [YanTangZhai] Merge pull request #17 from YanTangZhai/master
e4c2c0a [YanTangZhai] Merge pull request #15 from apache/master
1e1ebb4 [YanTangZhai] Update HiveQuerySuite.scala
efc4210 [YanTangZhai] Update HiveQuerySuite.scala
bd2c444 [YanTangZhai] Update HiveQuerySuite.scala
1893956 [YanTangZhai] Merge pull request #14 from marmbrus/pr/3555
59e4de9 [Michael Armbrust] make hive test
718afeb [YanTangZhai] Merge pull request #12 from apache/master
950b21e [YanTangZhai] Update HiveQuerySuite.scala
74175b4 [YanTangZhai] Update HiveQuerySuite.scala
92242c7 [YanTangZhai] Update HiveQl.scala
6e643f8 [YanTangZhai] Merge pull request #11 from apache/master
e249846 [YanTangZhai] Merge pull request #10 from apache/master
d26d982 [YanTangZhai] Merge pull request #9 from apache/master
76d4027 [YanTangZhai] Merge pull request #8 from apache/master
03b62b0 [YanTangZhai] Merge pull request #7 from apache/master
8a00106 [YanTangZhai] Merge pull request #6 from apache/master
cbcba66 [YanTangZhai] Merge pull request #3 from apache/master
cdef539 [YanTangZhai] Merge pull request #1 from apache/master
2015-01-10 15:05:23 -08:00
Michael Armbrust 3684fd21e1 [SPARK-5187][SQL] Fix caching of tables with HiveUDFs in the WHERE clause
Author: Michael Armbrust <michael@databricks.com>

Closes #3987 from marmbrus/hiveUdfCaching and squashes the following commits:

8bca2fa [Michael Armbrust] [SPARK-5187][SQL] Fix caching of tables with HiveUDFs in the WHERE clause
2015-01-10 14:25:45 -08:00
Yanbo Liang 77106df691 SPARK-4963 [SQL] Add copy to SQL's Sample operator
https://issues.apache.org/jira/browse/SPARK-4963
SchemaRDD.sample() return wrong results due to GapSamplingIterator operating on mutable row.
HiveTableScan make RDD with SpecificMutableRow and SchemaRDD.sample() will return GapSamplingIterator for iterating.

override def next(): T = {
    val r = data.next()
    advance
    r
  }

GapSamplingIterator.next() return the current underlying element and assigned it to r.
However if the underlying iterator is mutable row just like what HiveTableScan returned, underlying iterator and r will point to the same object.
After advance operation, we drop some underlying elments and it also changed r which is not expected. Then we return the wrong value different from initial r.

To fix this issue, the most direct way is to make HiveTableScan return mutable row with copy just like the initial commit that I have made. This solution will make HiveTableScan can not get the full advantage of reusable MutableRow, but it can make sample operation return correct result.
Further more, we need to investigate  GapSamplingIterator.next() and make it can implement copy operation inside it. To achieve this, we should define every elements that RDD can store implement the function like cloneable and it will make huge change.

Author: Yanbo Liang <yanbohappy@gmail.com>

Closes #3827 from yanbohappy/spark-4963 and squashes the following commits:

0912ca0 [Yanbo Liang] code format keep
65c4e7c [Yanbo Liang] import file and clear annotation
55c7c56 [Yanbo Liang] better output of test case
cea7e2e [Yanbo Liang] SchemaRDD add copy operation before Sample operator
e840829 [Yanbo Liang] HiveTableScan return mutable row with copy
2015-01-10 14:19:32 -08:00
scwf b3e86dc624 [SPARK-4861][SQL] Refactory command in spark sql
Follow up for #3712.
This PR finally remove ```CommandStrategy``` and make all commands follow ```RunnableCommand``` so they can go with ```case r: RunnableCommand => ExecutedCommand(r) :: Nil```.

One exception is the ```DescribeCommand``` of hive, which is a special case and need to distinguish hive table and temporary table, so still keep ```HiveCommandStrategy``` here.

Author: scwf <wangfei1@huawei.com>

Closes #3948 from scwf/followup-SPARK-4861 and squashes the following commits:

6b48e64 [scwf] minor style fix
2c62e9d [scwf] fix for hive module
5a7a819 [scwf] Refactory command in spark sql
2015-01-10 14:08:04 -08:00
scwf 693a323a70 [SPARK-4574][SQL] Adding support for defining schema in foreign DDL commands.
Adding support for defining schema in foreign DDL commands. Now foreign DDL support commands like:
```
CREATE TEMPORARY TABLE avroTable
USING org.apache.spark.sql.avro
OPTIONS (path "../hive/src/test/resources/data/files/episodes.avro")
```
With this PR user can define schema instead of infer from file, so  support ddl command as follows:
```
CREATE TEMPORARY TABLE avroTable(a int, b string)
USING org.apache.spark.sql.avro
OPTIONS (path "../hive/src/test/resources/data/files/episodes.avro")
```

Author: scwf <wangfei1@huawei.com>
Author: Yin Huai <yhuai@databricks.com>
Author: Fei Wang <wangfei1@huawei.com>
Author: wangfei <wangfei1@huawei.com>

Closes #3431 from scwf/ddl and squashes the following commits:

7e79ce5 [Fei Wang] Merge pull request #22 from yhuai/pr3431yin
38f634e [Yin Huai] Remove Option from createRelation.
65e9c73 [Yin Huai] Revert all changes since applying a given schema has not been testd.
a852b10 [scwf] remove cleanIdentifier
f336a16 [Fei Wang] Merge pull request #21 from yhuai/pr3431yin
baf79b5 [Yin Huai] Test special characters quoted by backticks.
50a03b0 [Yin Huai] Use JsonRDD.nullTypeToStringType to convert NullType to StringType.
1eeb769 [Fei Wang] Merge pull request #20 from yhuai/pr3431yin
f5c22b0 [Yin Huai] Refactor code and update test cases.
f1cffe4 [Yin Huai] Revert "minor refactory"
b621c8f [scwf] minor refactory
d02547f [scwf] fix HiveCompatibilitySuite test failure
8dfbf7a [scwf] more tests for complex data type
ddab984 [Fei Wang] Merge pull request #19 from yhuai/pr3431yin
91ad91b [Yin Huai] Parse data types in DDLParser.
cf982d2 [scwf] fixed test failure
445b57b [scwf] address comments
02a662c [scwf] style issue
44eb70c [scwf] fix decimal parser issue
83b6fc3 [scwf] minor fix
9bf12f8 [wangfei] adding test case
7787ec7 [wangfei] added SchemaRelationProvider
0ba70df [wangfei] draft version
2015-01-10 13:53:21 -08:00
Alex Liu 4b39fd1e63 [SPARK-4943][SQL] Allow table name having dot for db/catalog
The pull only fixes the parsing error and changes API to use tableIdentifier. Joining different catalog datasource related change is not done in this pull.

Author: Alex Liu <alex_liu68@yahoo.com>

Closes #3941 from alexliu68/SPARK-SQL-4943-3 and squashes the following commits:

343ae27 [Alex Liu] [SPARK-4943][SQL] refactoring according to review
29e5e55 [Alex Liu] [SPARK-4943][SQL] fix failed Hive CTAS tests
6ae77ce [Alex Liu] [SPARK-4943][SQL] fix TestHive matching error
3652997 [Alex Liu] [SPARK-4943][SQL] Allow table name having dot to support db/catalog ...
2015-01-10 13:23:09 -08:00
wangxiaojing 07fa1910d9 [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash
JIRA issue: [SPARK-4570](https://issues.apache.org/jira/browse/SPARK-4570)
We are planning to create a `BroadcastLeftSemiJoinHash` to implement the broadcast join for `left semijoin`
In left semijoin :
If the size of data from right side is smaller than the user-settable threshold `AUTO_BROADCASTJOIN_THRESHOLD`,
the planner would mark it as the `broadcast` relation and mark the other relation as the stream side. The broadcast table will be broadcasted to all of the executors involved in the join, as a `org.apache.spark.broadcast.Broadcast` object. It will use `joins.BroadcastLeftSemiJoinHash`.,else it will use `joins.LeftSemiJoinHash`.

The benchmark suggests these  made the optimized version 4x faster  when `left semijoin`
<pre><code>
Original:
left semi join : 9288 ms
Optimized:
left semi join : 1963 ms
</code></pre>
The micro benchmark load `data1/kv3.txt` into a normal Hive table.
Benchmark code:
<pre><code>
 def benchmark(f: => Unit) = {
    val begin = System.currentTimeMillis()
    f
    val end = System.currentTimeMillis()
    end - begin
  }
  val sc = new SparkContext(
    new SparkConf()
      .setMaster("local")
      .setAppName(getClass.getSimpleName.stripSuffix("$")))
  val hiveContext = new HiveContext(sc)
  import hiveContext._
  sql("drop table if exists left_table")
  sql("drop table if exists right_table")
  sql( """create table left_table (key int, value string)
       """.stripMargin)
  sql( s"""load data local inpath "/data1/kv3.txt" into table left_table""")
  sql( """create table right_table (key int, value string)
       """.stripMargin)
  sql(
    """
      |from left_table
      |insert overwrite table right_table
      |select left_table.key, left_table.value
    """.stripMargin)

  val leftSimeJoin = sql(
    """select a.key from left_table a
      |left semi join right_table b on a.key = b.key""".stripMargin)
  val leftSemiJoinDuration = benchmark(leftSimeJoin.count())
  println(s"left semi join : $leftSemiJoinDuration ms ")
</code></pre>

Author: wangxiaojing <u9jing@gmail.com>

Closes #3442 from wangxiaojing/SPARK-4570 and squashes the following commits:

a4a43c9 [wangxiaojing] rebase
f103983 [wangxiaojing] change style
fbe4887 [wangxiaojing] change style
ff2e618 [wangxiaojing] add testsuite
1a8da2a [wangxiaojing] add BroadcastLeftSemiJoinHash
2014-12-30 13:54:12 -08:00
Cheng Hao 53f0a00b60 [Spark-4512] [SQL] Unresolved Attribute Exception in Sort By
It will cause exception while do query like:
SELECT key+key FROM src sort by value;

Author: Cheng Hao <hao.cheng@intel.com>

Closes #3386 from chenghao-intel/sort and squashes the following commits:

38c78cc [Cheng Hao] revert the SortPartition in SparkStrategies
7e9dd15 [Cheng Hao] update the typo
fcd1d64 [Cheng Hao] rebase the latest master and update the SortBy unit test
2014-12-30 12:11:44 -08:00
Cheng Hao 5595eaa74f [SPARK-4959] [SQL] Attributes are case sensitive when using a select query from a projection
Author: Cheng Hao <hao.cheng@intel.com>

Closes #3796 from chenghao-intel/spark_4959 and squashes the following commits:

3ec08f8 [Cheng Hao] Replace the attribute in comparing its exprId other than itself
2014-12-30 11:33:47 -08:00
scwf 65357f11c2 [SPARK-4975][SQL] Fix HiveInspectorSuite test failure
HiveInspectorSuite test failure:
[info] - wrap / unwrap null, constant null and writables *** FAILED *** (21 milliseconds)
[info] 1 did not equal 0 (HiveInspectorSuite.scala:136)
this is because the origin date(is 3914-10-23) not equals the date returned by ```unwrap```(is 3914-10-22).

Setting TimeZone and Locale fix this.
Another minor change here is rename ```def checkValues(v1: Any, v2: Any): Unit```  to  ```def checkValue(v1: Any, v2: Any): Unit ``` to make the code more clear

Author: scwf <wangfei1@huawei.com>
Author: Fei Wang <wangfei1@huawei.com>

Closes #3814 from scwf/fix-inspectorsuite and squashes the following commits:

d8531ef [Fei Wang] Delete test.log
72b19a9 [scwf] fix HiveInspectorSuite test error
2014-12-30 11:30:47 -08:00
Daoyuan Wang 94d60b7021 [SQL] enable view test
This is a follow up of #3396 , just add a test to white list.

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #3826 from adrian-wang/viewtest and squashes the following commits:

f105f68 [Daoyuan Wang] enable view test
2014-12-30 11:29:13 -08:00
Michael Armbrust 480bd1d2ed [SPARK-4908][SQL] Prevent multiple concurrent hive native commands
This is just a quick fix that locks when calling `runHive`.  If we can find a way to avoid the error without a global lock that would be better.

Author: Michael Armbrust <michael@databricks.com>

Closes #3834 from marmbrus/hiveConcurrency and squashes the following commits:

bf25300 [Michael Armbrust] prevent multiple concurrent hive native commands
2014-12-30 11:24:46 -08:00
Sean Owen 29fabb1b52 SPARK-4297 [BUILD] Build warning fixes omnibus
There are a number of warnings generated in a normal, successful build right now. They're mostly Java unchecked cast warnings, which can be suppressed. But there's a grab bag of other Scala language warnings and so on that can all be easily fixed. The forthcoming PR fixes about 90% of the build warnings I see now.

Author: Sean Owen <sowen@cloudera.com>

Closes #3157 from srowen/SPARK-4297 and squashes the following commits:

8c9e469 [Sean Owen] Suppress unchecked cast warnings, and several other build warning fixes
2014-12-24 13:32:51 -08:00
wangfei c3d91da5ea [SPARK-4861][SQL] Refactory command in spark sql
Remove ```Command``` and use ```RunnableCommand``` instead.

Author: wangfei <wangfei1@huawei.com>
Author: scwf <wangfei1@huawei.com>

Closes #3712 from scwf/cmd and squashes the following commits:

51a82f2 [wangfei] fix test failure
0e03be8 [wangfei] address comments
4033bed [scwf] remove CreateTableAsSelect in hivestrategy
5d20010 [wangfei] address comments
125f542 [scwf] factory command in spark sql
2014-12-18 20:24:56 -08:00
Cheng Hao ae9f128608 [SPARK-4573] [SQL] Add SettableStructObjectInspector support in "wrap" function
Hive UDAF may create an customized object constructed by SettableStructObjectInspector, this is critical when integrate Hive UDAF with the refactor-ed UDAF interface.

Performance issue in `wrap/unwrap` since more match cases added, will do it in another PR.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #3429 from chenghao-intel/settable_oi and squashes the following commits:

9f0aff3 [Cheng Hao] update code style issues as feedbacks
2b0561d [Cheng Hao] Add more scala doc
f5a40e8 [Cheng Hao] add scala doc
2977e9b [Cheng Hao] remove the timezone setting for test suite
3ed284c [Cheng Hao] fix the date type comparison
f1b6749 [Cheng Hao] Update the comment
932940d [Cheng Hao] Add more unit test
72e4332 [Cheng Hao] Add settable StructObjectInspector support
2014-12-18 20:21:52 -08:00
ravipesala 7687415c25 [SPARK-2554][SQL] Supporting SumDistinct partial aggregation
Adding support to the partial aggregation of SumDistinct

Author: ravipesala <ravindra.pesala@huawei.com>

Closes #3348 from ravipesala/SPARK-2554 and squashes the following commits:

fd28e4d [ravipesala] Fixed review comments
e60e67f [ravipesala] Fixed test cases and made it as nullable
32fe234 [ravipesala] Supporting SumDistinct partial aggregation Conflicts: 	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala
2014-12-18 20:19:10 -08:00
YanTangZhai e7de7e5f46 [SPARK-4693] [SQL] PruningPredicates may be wrong if predicates contains an empty AttributeSet() references
The sql "select * from spark_test::for_test where abs(20141202) is not null" has predicates=List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)) and
partitionKeyIds=AttributeSet(). PruningPredicates is List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)). Then the exception "java.lang.IllegalArgumentException: requirement failed: Partition pruning predicates only supported for partitioned tables." is thrown.
The sql "select * from spark_test::for_test_partitioned_table where abs(20141202) is not null and type_id=11 and platform = 3" with partitioned key insert_date has predicates=List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202), (type_id#12 = 11), (platform#8 = 3)) and partitionKeyIds=AttributeSet(insert_date#24). PruningPredicates is List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)).

Author: YanTangZhai <hakeemzhai@tencent.com>
Author: yantangzhai <tyz0303@163.com>

Closes #3556 from YanTangZhai/SPARK-4693 and squashes the following commits:

620ebe3 [yantangzhai] [SPARK-4693] [SQL] PruningPredicates may be wrong if predicates contains an empty AttributeSet() references
37cfdf5 [yantangzhai] [SPARK-4693] [SQL] PruningPredicates may be wrong if predicates contains an empty AttributeSet() references
70a3544 [yantangzhai] [SPARK-4693] [SQL] PruningPredicates may be wrong if predicates contains an empty AttributeSet() references
efa9b03 [YanTangZhai] Update HiveQuerySuite.scala
72accf1 [YanTangZhai] Update HiveQuerySuite.scala
e572b9a [YanTangZhai] Update HiveStrategies.scala
6e643f8 [YanTangZhai] Merge pull request #11 from apache/master
e249846 [YanTangZhai] Merge pull request #10 from apache/master
d26d982 [YanTangZhai] Merge pull request #9 from apache/master
76d4027 [YanTangZhai] Merge pull request #8 from apache/master
03b62b0 [YanTangZhai] Merge pull request #7 from apache/master
8a00106 [YanTangZhai] Merge pull request #6 from apache/master
cbcba66 [YanTangZhai] Merge pull request #3 from apache/master
cdef539 [YanTangZhai] Merge pull request #1 from apache/master
2014-12-18 20:13:46 -08:00
Cheng Hao f728e0fe7e [SPARK-2663] [SQL] Support the Grouping Set
Add support for `GROUPING SETS`, `ROLLUP`, `CUBE` and the the virtual column `GROUPING__ID`.

More details on how to use the `GROUPING SETS" can be found at: https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup
https://issues.apache.org/jira/secure/attachment/12676811/grouping_set.pdf

The generic idea of the implementations are :
1 Replace the `ROLLUP`, `CUBE` with `GROUPING SETS`
2 Explode each of the input row, and then feed them to `Aggregate`
  * Each grouping set are represented as the bit mask for the `GroupBy Expression List`, for each bit, `1` means the expression is selected, otherwise `0` (left is the lower bit, and right is the higher bit in the `GroupBy Expression List`)
  * Several of projections are constructed according to the grouping sets, and within each projection(Seq[Expression), we replace those expressions with `Literal(null)` if it's not selected in the grouping set (based on the bit mask)
  * Output Schema of `Explode` is `child.output :+ grouping__id`
  * GroupBy Expressions of `Aggregate` is `GroupBy Expression List :+ grouping__id`
  * Keep the `Aggregation expressions` the same for the `Aggregate`

The expressions substitutions happen in Logic Plan analyzing, so we will benefit from the Logical Plan optimization (e.g. expression constant folding, and map side aggregation etc.), Only an `Explosive` operator added for Physical Plan, which will explode the rows according the pre-set projections.

A known issue will be done in the follow up PR:
* Optimization `ColumnPruning` is not supported yet for `Explosive` node.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #1567 from chenghao-intel/grouping_sets and squashes the following commits:

fe65fcc [Cheng Hao] Remove the extra space
3547056 [Cheng Hao] Add more doc and Simplify the Expand
a7c869d [Cheng Hao] update code as feedbacks
d23c672 [Cheng Hao] Add GroupingExpression to replace the Seq[Expression]
414b165 [Cheng Hao] revert the unnecessary changes
ec276c6 [Cheng Hao] Support Rollup/Cube/GroupingSets
2014-12-18 18:58:29 -08:00
Venkata Ramana Gollamudi f33d550464 [SPARK-3891][SQL] Add array support to percentile, percentile_approx and constant inspectors support
Supported passing array to percentile and percentile_approx UDAFs
To support percentile_approx,  constant inspectors are supported for GenericUDAF
Constant folding support added to CreateArray expression
Avoided constant udf expression re-evaluation

Author: Venkata Ramana G <ramana.gollamudihuawei.com>

Author: Venkata Ramana Gollamudi <ramana.gollamudi@huawei.com>

Closes #2802 from gvramana/percentile_array_support and squashes the following commits:

a0182e5 [Venkata Ramana Gollamudi] fixed review comment
a18f917 [Venkata Ramana Gollamudi] avoid constant udf expression re-evaluation - fixes failure due to return iterator and value type mismatch
c46db0f [Venkata Ramana Gollamudi] Removed TestHive reset
4d39105 [Venkata Ramana Gollamudi] Unified inspector creation, style check fixes
f37fd69 [Venkata Ramana Gollamudi] Fixed review comments
47f6365 [Venkata Ramana Gollamudi] fixed test
cb7c61e [Venkata Ramana Gollamudi] Supported ConstantInspector for UDAF Fixed HiveUdaf wrap object issue.
7f94aff [Venkata Ramana Gollamudi] Added foldable support to CreateArray
2014-12-17 15:41:35 -08:00
Cheng Hao 636d9fc450 [SPARK-3739] [SQL] Update the split num base on block size for table scanning
In local mode, Hadoop/Hive will ignore the "mapred.map.tasks", hence for small table file, it's always a single input split, however, SparkSQL doesn't honor that in table scanning, and we will get different result when do the Hive Compatibility test. This PR will fix that.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #2589 from chenghao-intel/source_split and squashes the following commits:

dff38e7 [Cheng Hao] Remove the extra blank line
160a2b6 [Cheng Hao] fix the compiling bug
04d67f7 [Cheng Hao] Keep 1 split for small file in table scanning
2014-12-17 13:39:36 -08:00
Michael Armbrust 7ad579ee97 [SPARK-3698][SQL] Fix case insensitive resolution of GetField.
Based on #2543.

Author: Michael Armbrust <michael@databricks.com>

Closes #3724 from marmbrus/resolveGetField and squashes the following commits:

0a47aae [Michael Armbrust] Fix case insensitive resolution of GetField.
2014-12-17 12:43:51 -08:00
Cheng Lian 3b395e1051 [SPARK-4798][SQL] A new set of Parquet testing API and test suites
This PR provides a set Parquet testing API (see trait `ParquetTest`) that enables developers to write more concise test cases. A new set of Parquet test suites built upon this API  are added and aim to replace the old `ParquetQuerySuite`. To avoid potential merge conflicts, old testing code are not removed yet. The following classes can be safely removed after most Parquet related PRs are handled:

- `ParquetQuerySuite`
- `ParquetTestData`

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3644)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3644 from liancheng/parquet-tests and squashes the following commits:

800e745 [Cheng Lian] Enforces ordering of test output
3bb8731 [Cheng Lian] Refactors HiveParquetSuite
aa2cb2e [Cheng Lian] Decouples ParquetTest and TestSQLContext
7b43a68 [Cheng Lian] Updates ParquetTest Scaladoc
7f07af0 [Cheng Lian] Adds a new set of Parquet test suites
2014-12-16 21:16:03 -08:00
Cheng Hao 0abbff2862 [SPARK-4825] [SQL] CTAS fails to resolve when created using saveAsTable
Fix bug when query like:
```
  test("save join to table") {
    val testData = sparkContext.parallelize(1 to 10).map(i => TestData(i, i.toString))
    sql("CREATE TABLE test1 (key INT, value STRING)")
    testData.insertInto("test1")
    sql("CREATE TABLE test2 (key INT, value STRING)")
    testData.insertInto("test2")
    testData.insertInto("test2")
    sql("SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = b.key").saveAsTable("test")
    checkAnswer(
      table("test"),
      sql("SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = b.key").collect().toSeq)
  }
```

Author: Cheng Hao <hao.cheng@intel.com>

Closes #3673 from chenghao-intel/spark_4825 and squashes the following commits:

e8cbd56 [Cheng Hao] alternate the pattern matching order for logical plan:CTAS
e004895 [Cheng Hao] fix bug
2014-12-11 22:51:49 -08:00
Daoyuan Wang cbb634ae69 [SQL] enable empty aggr test case
This is fixed by SPARK-4318 #3184

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #3445 from adrian-wang/emptyaggr and squashes the following commits:

982575e [Daoyuan Wang] enable empty aggr test case
2014-12-11 22:50:18 -08:00
Cheng Hao a7f07f511c [SPARK-4662] [SQL] Whitelist more unittest
Whitelist more hive unit test:

"create_like_tbl_props"
"udf5"
"udf_java_method"
"decimal_1"
"udf_pmod"
"udf_to_double"
"udf_to_float"
"udf7" (this will fail in Hive 0.12)

Author: Cheng Hao <hao.cheng@intel.com>

Closes #3522 from chenghao-intel/unittest and squashes the following commits:

f54e4c7 [Cheng Hao] work around to clean up the hive.table.parameters.default in reset
16fee22 [Cheng Hao] Whitelist more unittest
2014-12-11 22:43:02 -08:00
Cheng Hao 383c5555c9 [SPARK-4785][SQL] Initilize Hive UDFs on the driver and serialize them with a wrapper
Different from Hive 0.12.0, in Hive 0.13.1 UDF/UDAF/UDTF (aka Hive function) objects should only be initialized once on the driver side and then serialized to executors. However, not all function objects are serializable (e.g. GenericUDF doesn't implement Serializable). Hive 0.13.1 solves this issue with Kryo or XML serializer. Several utility ser/de methods are provided in class o.a.h.h.q.e.Utilities for this purpose. In this PR we chose Kryo for efficiency. The Kryo serializer used here is created in Hive. Spark Kryo serializer wasn't used because there's no available SparkConf instance.

Author: Cheng Hao <hao.cheng@intel.com>
Author: Cheng Lian <lian@databricks.com>

Closes #3640 from chenghao-intel/udf_serde and squashes the following commits:

8e13756 [Cheng Hao] Update the comment
74466a3 [Cheng Hao] refactor as feedbacks
396c0e1 [Cheng Hao] avoid Simple UDF to be serialized
e9c3212 [Cheng Hao] update the comment
19cbd46 [Cheng Hao] support udf instance ser/de after initialization
2014-12-09 10:28:33 -08:00
Cheng Hao 51b1fe1426 [SPARK-4769] [SQL] CTAS does not work when reading from temporary tables
This is the code refactor and follow ups for #2570

Author: Cheng Hao <hao.cheng@intel.com>

Closes #3336 from chenghao-intel/createtbl and squashes the following commits:

3563142 [Cheng Hao] remove the unused variable
e215187 [Cheng Hao] eliminate the compiling warning
4f97f14 [Cheng Hao] fix bug in unittest
5d58812 [Cheng Hao] revert the API changes
b85b620 [Cheng Hao] fix the regression of temp tabl not found in CTAS
2014-12-08 17:39:12 -08:00
Michael Armbrust 513ef82e85 [SPARK-4552][SQL] Avoid exception when reading empty parquet data through Hive
This is a very small fix that catches one specific exception and returns an empty table.  #3441 will address this in a more principled way.

Author: Michael Armbrust <michael@databricks.com>

Closes #3586 from marmbrus/fixEmptyParquet and squashes the following commits:

2781d9f [Michael Armbrust] Handle empty lists for newParquet
04dd376 [Michael Armbrust] Avoid exception when reading empty parquet data through Hive
2014-12-03 14:13:35 -08:00
Michael Armbrust 02ec058efe [SPARK-4413][SQL] Parquet support through datasource API
Goals:
 - Support for accessing parquet using SQL but not requiring Hive (thus allowing support of parquet tables with decimal columns)
 - Support for folder based partitioning with automatic discovery of available partitions
 - Caching of file metadata

See scaladoc of `ParquetRelation2` for more details.

Author: Michael Armbrust <michael@databricks.com>

Closes #3269 from marmbrus/newParquet and squashes the following commits:

1dd75f1 [Michael Armbrust] Pass all paths for FileInputFormat at once.
645768b [Michael Armbrust] Review comments.
abd8e2f [Michael Armbrust] Alternative implementation of parquet based on the datasources API.
938019e [Michael Armbrust] Add an experimental interface to data sources that exposes catalyst expressions.
e9d2641 [Michael Armbrust] logging / formatting improvements.
2014-11-20 18:31:02 -08:00
Cheng Hao 84d79ee9ec [SPARK-4244] [SQL] Support Hive Generic UDFs with constant object inspector parameters
Query `SELECT named_struct(lower("AA"), "12", lower("Bb"), "13") FROM src LIMIT 1` will throw exception, some of the Hive Generic UDF/UDAF requires the input object inspector is `ConstantObjectInspector`, however, we won't get that before the expression optimization executed. (Constant Folding).

This PR is a work around to fix this. (As ideally, the `output` of LogicalPlan should be identical before and after Optimization).

Author: Cheng Hao <hao.cheng@intel.com>

Closes #3109 from chenghao-intel/optimized and squashes the following commits:

487ff79 [Cheng Hao] rebase to the latest master & update the unittest
2014-11-20 16:50:59 -08:00
Cheng Hao 6aa0fc9f4d [SPARK-2918] [SQL] Support the CTAS in EXPLAIN command
Hive supports the `explain` the CTAS, which was supported by Spark SQL previously, however, seems it was reverted after the code refactoring in HiveQL.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #3357 from chenghao-intel/explain and squashes the following commits:

7aace63 [Cheng Hao] Support the CTAS in EXPLAIN command
2014-11-20 15:46:00 -08:00
Michael Armbrust 90d72ec850 [SQL] Support partitioned parquet tables that have the key in both the directory and the file
Author: Michael Armbrust <michael@databricks.com>

Closes #3272 from marmbrus/keyInPartitionedTable and squashes the following commits:

447f08c [Michael Armbrust] Support partitioned parquet tables that have the key in both the directory and the file
2014-11-18 12:13:23 -08:00
Michael Armbrust a0300ea32a [SPARK-4390][SQL] Handle NaN cast to decimal correctly
Author: Michael Armbrust <michael@databricks.com>

Closes #3256 from marmbrus/NanDecimal and squashes the following commits:

4c3ba46 [Michael Armbrust] fix style
d360f83 [Michael Armbrust] Handle NaN cast to decimal
2014-11-14 14:56:57 -08:00
Takuya UESHIN bbd8f5bee8 [SPARK-4245][SQL] Fix containsNull of the result ArrayType of CreateArray expression.
The `containsNull` of the result `ArrayType` of `CreateArray` should be `true` only if the children is empty or there exists nullable child.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #3110 from ueshin/issues/SPARK-4245 and squashes the following commits:

6f64746 [Takuya UESHIN] Move equalsIgnoreNullability method into DataType.
5a90e02 [Takuya UESHIN] Refine InsertIntoHiveType and add some comments.
cbecba8 [Takuya UESHIN] Fix a test title.
884ec37 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4245
3c5274b [Takuya UESHIN] Add tests to insert data of types ArrayType / MapType / StructType with nullability is false into Hive table.
41a94a9 [Takuya UESHIN] Replace InsertIntoTable with InsertIntoHiveTable if data types ignoring nullability are same.
43e6ef5 [Takuya UESHIN] Fix containsNull for empty array.
778e997 [Takuya UESHIN] Fix containsNull of the result ArrayType of CreateArray expression.
2014-11-14 14:21:16 -08:00
Daoyuan Wang ade72c4362 [SPARK-4239] [SQL] support view in HiveQl
Currently still not support view like

CREATE VIEW view3(valoo)
TBLPROPERTIES ("fear" = "factor")
AS SELECT upper(value) FROM src WHERE key=86;

because the text in metastore for this view is like

select \`_c0\` as \`valoo\` from (select upper(\`src\`.\`value\`) from \`default\`.\`src\` where ...) \`view3\`

while catalyst cannot resolve \`_c0\` for this query.
For view without colname definition in parentheses, it works fine.

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #3131 from adrian-wang/view and squashes the following commits:

8a56fd6 [Daoyuan Wang] michael's comments
e46c056 [Daoyuan Wang] add some golden file
079290a [Daoyuan Wang] remove useless import
88afcad [Daoyuan Wang] support view in HiveQl
2014-11-14 13:51:20 -08:00
Cheng Hao fa777833b5 [SPARK-4250] [SQL] Fix bug of constant null value mapping to ConstantObjectInspector
Author: Cheng Hao <hao.cheng@intel.com>

Closes #3114 from chenghao-intel/constant_null_oi and squashes the following commits:

e603bda [Cheng Hao] fix the bug of null value for primitive types
50a13ba [Cheng Hao] fix the timezone issue
f54f369 [Cheng Hao] fix bug of constant null value for ObjectInspector
2014-11-10 17:22:57 -08:00
Xiangrui Meng 894a7245c3 [SQL] support udt to hive types conversion (hive->udt is not supported)
marmbrus

Author: Xiangrui Meng <meng@databricks.com>

Closes #3164 from mengxr/hive-udt and squashes the following commits:

57c7519 [Xiangrui Meng] support udt->hive types (hive->udt is not supported)
2014-11-10 11:04:12 -08:00
Matthew Taylor ac70c972a5 [SPARK-4203][SQL] Partition directories in random order when inserting into hive table
When doing an insert into hive table with partitions the folders written to the file system are in a random order instead of the order defined in table creation. Seems that the loadPartition method in Hive.java has a Map<String,String> parameter but expects to be called with a map that has a defined ordering such as LinkedHashMap. Working on a test but having intillij problems

Author: Matthew Taylor <matthew.t@tbfe.net>

Closes #3076 from tbfenet/partition_dir_order_problem and squashes the following commits:

f1b9a52 [Matthew Taylor] Comment format fix
bca709f [Matthew Taylor] review changes
0e50f6b [Matthew Taylor] test fix
99f1a31 [Matthew Taylor] partition ordering fix
369e618 [Matthew Taylor] partition ordering fix
2014-11-07 12:53:08 -08:00
Cheng Hao e83f13e8d3 [SPARK-4152] [SQL] Avoid data change in CTAS while table already existed
CREATE TABLE t1 (a String);
CREATE TABLE t1 AS SELECT key FROM src; – throw exception
CREATE TABLE if not exists t1 AS SELECT key FROM src; – expect do nothing, currently it will overwrite the t1, which is incorrect.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #3013 from chenghao-intel/ctas_unittest and squashes the following commits:

194113e [Cheng Hao] fix bug in CTAS when table already existed
2014-11-03 13:59:43 -08:00
Cheng Lian c9f840046f [SPARK-3791][SQL] Provides Spark version and Hive version in HiveThriftServer2
This PR overrides the `GetInfo` Hive Thrift API to provide correct version information. Another property `spark.sql.hive.version` is added to reveal the underlying Hive version. These are generally useful for Spark SQL ODBC driver providers. The Spark version information is extracted from the jar manifest. Also took the chance to remove the `SET -v` hack, which was a workaround for Simba ODBC driver connectivity.

TODO

- [x] Find a general way to figure out Hive (or even any dependency) version.

  This [blog post](http://blog.soebes.de/blog/2014/01/02/version-information-into-your-appas-with-maven/) suggests several methods to inspect application version. In the case of Spark, this can be tricky because the chosen method:

  1. must applies to both Maven build and SBT build

    For Maven builds, we can retrieve the version information from the META-INF/maven directory within the assembly jar. But this doesn't work for SBT builds.

  2. must not rely on the original jars of dependencies to extract specific dependency version, because Spark uses assembly jar.

    This implies we can't read Hive version from Hive jar files since standard Spark distribution doesn't include them.

  3. should play well with `SPARK_PREPEND_CLASSES` to ease local testing during development.

     `SPARK_PREPEND_CLASSES` prevents classes to be loaded from the assembly jar, thus we can't locate the jar file and read its manifest.

  Given these, maybe the only reliable method is to generate a source file containing version information at build time. pwendell Do you have any suggestions from the perspective of the build process?

**Update** Hive version is now retrieved from the newly introduced `HiveShim` object.

Author: Cheng Lian <lian.cs.zju@gmail.com>
Author: Cheng Lian <lian@databricks.com>

Closes #2843 from liancheng/get-info and squashes the following commits:

a873d0f [Cheng Lian] Updates test case
53f43cd [Cheng Lian] Retrieves underlying Hive verson via HiveShim
1d282b8 [Cheng Lian] Removes the Simba ODBC "SET -v" hack
f857fce [Cheng Lian] Overrides Hive GetInfo Thrift API and adds Hive version property
2014-11-02 15:18:29 -08:00
wangfei f0a4b630ab [HOTFIX][SQL] hive test missing some golden files
cc marmbrus

Author: wangfei <wangfei1@huawei.com>

Closes #3055 from scwf/hotfix and squashes the following commits:

d881bd7 [wangfei] miss golden files
2014-11-02 14:59:41 -08:00
Cheng Lian 23468e7e96 [SPARK-2220][SQL] Fixes remaining Hive commands
This PR adds support for the `ADD FILE` Hive command, and removes `ShellCommand` and `SourceCommand`. The reason is described in [this SPARK-2220 comment](https://issues.apache.org/jira/browse/SPARK-2220?focusedCommentId=14191841&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14191841).

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #3038 from liancheng/hive-commands and squashes the following commits:

6db61e0 [Cheng Lian] Fixes remaining Hive commands
2014-10-31 11:34:51 -07:00
ravipesala ea465af12d [SPARK-4154][SQL] Query does not work if it has "not between " in Spark SQL and HQL
if the query contains "not between" does not work like.
SELECT * FROM src where key not between 10 and 20'

Author: ravipesala <ravindra.pesala@huawei.com>

Closes #3017 from ravipesala/SPARK-4154 and squashes the following commits:

65fc89e [ravipesala] Handled admin comments
32e6d42 [ravipesala] 'not between' is not working
2014-10-31 11:33:20 -07:00
Venkata Ramana Gollamudi fa712b309c [SPARK-4077][SQL] Spark SQL return wrong values for valid string timestamp values
In org.apache.hadoop.hive.serde2.io.TimestampWritable.set , if the next entry is null then current time stamp object is being reset.
However because of this hiveinspectors:unwrap cannot use the same timestamp object without creating a copy.

Author: Venkata Ramana G <ramana.gollamudihuawei.com>

Author: Venkata Ramana Gollamudi <ramana.gollamudi@huawei.com>

Closes #3019 from gvramana/spark_4077 and squashes the following commits:

32d818f [Venkata Ramana Gollamudi] fixed check style
fa01e71 [Venkata Ramana Gollamudi] cloned timestamp object as org.apache.hadoop.hive.serde2.io.TimestampWritable.set will reset current time object
2014-10-31 11:30:28 -07:00
Cheng Hao b5e79bf889 [SPARK-3904] [SQL] add constant objectinspector support for udfs
In HQL, we convert all of the data type into normal `ObjectInspector`s for UDFs, most of cases it works, however, some of the UDF actually requires its children `ObjectInspector` to be the `ConstantObjectInspector`, which will cause exception.
e.g.
select named_struct("x", "str") from src limit 1;

I updated the method `wrap` by adding the one more parameter `ObjectInspector`(to describe what it expects to wrap to, for example: java.lang.Integer or IntWritable).

As well as the `unwrap` method by providing the input `ObjectInspector`.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #2762 from chenghao-intel/udf_coi and squashes the following commits:

bcacfd7 [Cheng Hao] Shim for both Hive 0.12 & 0.13.1
2416e5d [Cheng Hao] revert to hive 0.12
5793c01 [Cheng Hao] add space before while
4e56e1b [Cheng Hao] style issue
683d3fd [Cheng Hao] Add golden files
fe591e4 [Cheng Hao] update HiveGenericUdf for set the ObjectInspector while constructing the DeferredObject
f6740fe [Cheng Hao] Support Constant ObjectInspector for Map & List
8814c3a [Cheng Hao] Passing ContantObjectInspector(when necessary) for UDF initializing
2014-10-28 19:11:57 -07:00
Cheng Hao 4b55482abf [SPARK-3343] [SQL] Add serde support for CTAS
Currently, `CTAS` (Create Table As Select) doesn't support specifying the `SerDe` in HQL. This PR will pass down the `ASTNode` into the physical operator `execution.CreateTableAsSelect`, which will extract the `CreateTableDesc` object via Hive `SemanticAnalyzer`. In the meantime, I also update the `HiveMetastoreCatalog.createTable` to optionally support the `CreateTableDesc` for table creation.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #2570 from chenghao-intel/ctas_serde and squashes the following commits:

e011ef5 [Cheng Hao] shim for both 0.12 & 0.13.1
cfb3662 [Cheng Hao] revert to hive 0.12
c8a547d [Cheng Hao] Support SerDe properties within CTAS
2014-10-28 14:36:06 -07:00
ravipesala 5807cb40ae [SPARK-3814][SQL] Support for Bitwise AND(&), OR(|) ,XOR(^), NOT(~) in Spark HQL and SQL
Currently there is no support of Bitwise & , | in Spark HiveQl and Spark SQL as well. So this PR support the same.
I am closing https://github.com/apache/spark/pull/2926 as it has conflicts to merge. And also added support for Bitwise AND(&), OR(|) ,XOR(^), NOT(~) And I handled all review comments in that PR

Author: ravipesala <ravindra.pesala@huawei.com>

Closes #2961 from ravipesala/SPARK-3814-NEW4 and squashes the following commits:

a391c7a [ravipesala] Rebase with master
2014-10-28 13:36:06 -07:00
wangxiaojing 0c34fa5b4b [SPARK-3907][SQL] Add truncate table support
JIRA issue: [SPARK-3907]https://issues.apache.org/jira/browse/SPARK-3907

Add turncate table support
TRUNCATE TABLE table_name [PARTITION partition_spec];
partition_spec:
  : (partition_col = partition_col_value, partition_col = partiton_col_value, ...)
Removes all rows from a table or partition(s). Currently target table should be native/managed table or exception will be thrown. User can specify partial partition_spec for truncating multiple partitions at once and omitting partition_spec will truncate all partitions in the table.

Author: wangxiaojing <u9jing@gmail.com>

Closes #2770 from wangxiaojing/spark-3907 and squashes the following commits:

63dbd81 [wangxiaojing] change hive scalastyle
7a03707 [wangxiaojing] add comment
f6e710e [wangxiaojing] change truncate table
a1f692c [wangxiaojing] Correct spelling mistakes
3b20007 [wangxiaojing] add truncate can not support column err message
e483547 [wangxiaojing] add golden file
77b1f20 [wangxiaojing]  add truncate table support
2014-10-27 22:02:52 -07:00
wangfei 89af6dfc3a [SPARK-4041][SQL] Attributes names in table scan should converted to lowercase when compare with relation attributes
In ```MetastoreRelation``` the attributes name is lowercase because of hive using lowercase for fields name, so we should convert attributes name in table scan lowercase in ```indexWhere(_.name == a.name)```.
```neededColumnIDs``` may be not correct if not convert to lowercase.

Author: wangfei <wangfei1@huawei.com>
Author: scwf <wangfei1@huawei.com>

Closes #2884 from scwf/fixColumnIds and squashes the following commits:

6174046 [scwf] use AttributeMap for this issue
dc74a24 [wangfei] use lowerName and add a test case for this issue
3ff3a80 [wangfei] more safer change
294fcb7 [scwf] attributes names in table scan should convert lowercase in neededColumnsIDs
2014-10-27 20:46:26 -07:00
Cheng Hao 418ad83fe1 [SPARK-3911] [SQL] HiveSimpleUdf can not be optimized in constant folding
```
explain extended select cos(null) from src limit 1;
```
outputs:
```
 Project [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFCos(null) AS c_0#5]
  MetastoreRelation default, src, None

== Optimized Logical Plan ==
Limit 1
 Project [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFCos(null) AS c_0#5]
  MetastoreRelation default, src, None

== Physical Plan ==
Limit 1
 Project [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFCos(null) AS c_0#5]
  HiveTableScan [], (MetastoreRelation default, src, None), None
```
After patching this PR it outputs
```
== Parsed Logical Plan ==
Limit 1
 Project ['cos(null) AS c_0#0]
  UnresolvedRelation None, src, None

== Analyzed Logical Plan ==
Limit 1
 Project [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFCos(null) AS c_0#0]
  MetastoreRelation default, src, None

== Optimized Logical Plan ==
Limit 1
 Project [null AS c_0#0]
  MetastoreRelation default, src, None

== Physical Plan ==
Limit 1
 Project [null AS c_0#0]
  HiveTableScan [], (MetastoreRelation default, src, None), None
```

Author: Cheng Hao <hao.cheng@intel.com>

Closes #2771 from chenghao-intel/hive_udf_constant_folding and squashes the following commits:

1379c73 [Cheng Hao] duplicate the PlanTest with catalyst/plans/PlanTest
1e52dda [Cheng Hao] add unit test for hive simple udf constant folding
01609ff [Cheng Hao] support constant folding for HiveSimpleUdf
2014-10-27 20:42:05 -07:00
Yin Huai 05308426f0 [SPARK-4052][SQL] Use scala.collection.Map for pattern matching instead of using Predef.Map (it is scala.collection.immutable.Map)
Please check https://issues.apache.org/jira/browse/SPARK-4052 for cases triggering this bug.

Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #2899 from yhuai/SPARK-4052 and squashes the following commits:

1188f70 [Yin Huai] Address liancheng's comments.
b6712be [Yin Huai] Use scala.collection.Map instead of Predef.Map (scala.collection.immutable.Map).
2014-10-26 16:30:15 -07:00
Michael Armbrust 3a845d3c04 [SQL] Update Hive test harness for Hive 12 and 13
As part of the upgrade I also copy the newest version of the query tests, and whitelist a bunch of new ones that are now passing.

Author: Michael Armbrust <michael@databricks.com>

Closes #2936 from marmbrus/fix13tests and squashes the following commits:

d9cbdab [Michael Armbrust] Remove user specific tests
65801cd [Michael Armbrust] style and rat
8f6b09a [Michael Armbrust] Update test harness to work with both Hive 12 and 13.
f044843 [Michael Armbrust] Update Hive query tests and golden files to 0.13
2014-10-24 18:36:35 -07:00
Zhan Zhang 7c89a8f0c8 [SPARK-2706][SQL] Enable Spark to support Hive 0.13
Given that a lot of users are trying to use hive 0.13 in spark, and the incompatibility between hive-0.12 and hive-0.13 on the API level I want to propose following approach, which has no or minimum impact on existing hive-0.12 support, but be able to jumpstart the development of hive-0.13 and future version support.

Approach: Introduce “hive-version” property,  and manipulate pom.xml files to support different hive version at compiling time through shim layer, e.g., hive-0.12.0 and hive-0.13.1. More specifically,

1. For each different hive version, there is a very light layer of shim code to handle API differences, sitting in sql/hive/hive-version, e.g., sql/hive/v0.12.0 or sql/hive/v0.13.1

2. Add a new profile hive-default active by default, which picks up all existing configuration and hive-0.12.0 shim (v0.12.0)  if no hive.version is specified.

3. If user specifies different version (currently only 0.13.1 by -Dhive.version = 0.13.1), hive-versions profile will be activated, which pick up hive-version specific shim layer and configuration, mainly the hive jars and hive-version shim, e.g., v0.13.1.

4. With this approach, nothing is changed with current hive-0.12 support.

No change by default: sbt/sbt -Phive
For example: sbt/sbt -Phive -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 assembly

To enable hive-0.13: sbt/sbt -Dhive.version=0.13.1
For example: sbt/sbt -Dhive.version=0.13.1 -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 assembly

Note that in hive-0.13, hive-thriftserver is not enabled, which should be fixed by other Jira, and we don’t need -Phive with -Dhive.version in building (probably we should use -Phive -Dhive.version=xxx instead after thrift server is also supported in hive-0.13.1).

Author: Zhan Zhang <zhazhan@gmail.com>
Author: zhzhan <zhazhan@gmail.com>
Author: Patrick Wendell <pwendell@gmail.com>

Closes #2241 from zhzhan/spark-2706 and squashes the following commits:

3ece905 [Zhan Zhang] minor fix
410b668 [Zhan Zhang] solve review comments
cbb4691 [Zhan Zhang] change run-test for new options
0d4d2ed [Zhan Zhang] rebase
497b0f4 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
8fad1cf [Zhan Zhang] change the pom file and make hive-0.13.1 as the default
ab028d1 [Zhan Zhang] rebase
4a2e36d [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
4cb1b93 [zhzhan] Merge pull request #1 from pwendell/pr-2241
b0478c0 [Patrick Wendell] Changes to simplify the build of SPARK-2706
2b50502 [Zhan Zhang] rebase
a72c0d4 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
cb22863 [Zhan Zhang] correct the typo
20f6cf7 [Zhan Zhang] solve compatability issue
f7912a9 [Zhan Zhang] rebase and solve review feedback
301eb4a [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
10c3565 [Zhan Zhang] address review comments
6bc9204 [Zhan Zhang] rebase and remove temparory repo
d3aa3f2 [Zhan Zhang] Merge branch 'master' into spark-2706
cedcc6f [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
3ced0d7 [Zhan Zhang] rebase
d9b981d [Zhan Zhang] rebase and fix error due to rollback
adf4924 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
3dd50e8 [Zhan Zhang] solve conflicts and remove unnecessary implicts
d10bf00 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
dc7bdb3 [Zhan Zhang] solve conflicts
7e0cc36 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
d7c3e1e [Zhan Zhang] Merge branch 'master' into spark-2706
68deb11 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
d48bd18 [Zhan Zhang] address review comments
3ee3b2b [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
57ea52e [Zhan Zhang] Merge branch 'master' into spark-2706
2b0d513 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
9412d24 [Zhan Zhang] address review comments
f4af934 [Zhan Zhang] rebase
1ccd7cc [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
128b60b [Zhan Zhang] ignore 0.12.0 test cases for the time being
af9feb9 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
5f5619f [Zhan Zhang] restructure the directory and different hive version support
05d3683 [Zhan Zhang] solve conflicts
e4c1982 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
94b4fdc [Zhan Zhang] Spark-2706: hive-0.13.1 support on spark
87ebf3b [Zhan Zhang] Merge branch 'master' into spark-2706
921e914 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
f896b2a [Zhan Zhang] Merge branch 'master' into spark-2706
789ea21 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
cb53a2c [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
f6a8a40 [Zhan Zhang] revert
ba14f28 [Zhan Zhang] test
dbedff3 [Zhan Zhang] Merge remote-tracking branch 'upstream/master'
70964fe [Zhan Zhang] revert
fe0f379 [Zhan Zhang] Merge branch 'master' of https://github.com/zhzhan/spark
70ffd93 [Zhan Zhang] revert
42585ec [Zhan Zhang] test
7d5fce2 [Zhan Zhang] test
2014-10-24 11:03:17 -07:00
chirag e6e37701f1 SPARK-3807: SparkSql does not work for tables created using custom serde
SparkSql crashes on selecting tables using custom serde.

Example:
----------------

CREATE EXTERNAL TABLE table_name PARTITIONED BY ( a int) ROW FORMAT 'SERDE "org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer" with serdeproperties("serialization.format"="org.apache.thrift.protocol.TBinaryProtocol","serialization.class"="ser_class") STORED AS SEQUENCEFILE;

The following exception is seen on running a query like 'select * from table_name limit 1':

ERROR CliDriver: org.apache.hadoop.hive.serde2.SerDeException: java.lang.NullPointerException
at org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer.initialize(ThriftDeserializer.java:68)
at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializer(TableDesc.java:80)
at org.apache.spark.sql.hive.execution.HiveTableScan.addColumnMetadataToConf(HiveTableScan.scala:86)
at org.apache.spark.sql.hive.execution.HiveTableScan.<init>(HiveTableScan.scala:100)
at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
at org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:364)
at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:184)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:280)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402)
at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400)
at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406)
at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406)
at org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:406)
at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:59)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.NullPointerException

Author: chirag <chirag.aggarwal@guavus.com>

Closes #2674 from chiragaggarwal/branch-1.1 and squashes the following commits:

370c31b [chirag] SPARK-3807: Add a test case to validate the fix.
1f26805 [chirag] SPARK-3807: SparkSql does not work for tables created using custom serde (Incorporated Review Comments)
ba4bc0c [chirag] SPARK-3807: SparkSql does not work for tables created using custom serde
5c73b72 [chirag] SPARK-3807: SparkSql does not work for tables created using custom serde

(cherry picked from commit 925e22d313)
Signed-off-by: Michael Armbrust <michael@databricks.com>
2014-10-13 13:47:51 -07:00
Daoyuan Wang 2ac40da3f9 [SPARK-3407][SQL]Add Date type support
Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #2344 from adrian-wang/date and squashes the following commits:

f15074a [Daoyuan Wang] remove outdated lines
2038085 [Daoyuan Wang] update return type
00fe81f [Daoyuan Wang] address lian cheng's comments
0df6ea1 [Daoyuan Wang] rebase and remove simple string
bb1b1ef [Daoyuan Wang] remove failing test
aa96735 [Daoyuan Wang] not cast for same type compare
30bf48b [Daoyuan Wang] resolve rebase conflict
617d1a8 [Daoyuan Wang] add date_udf case to white list
c37e848 [Daoyuan Wang] comment update
5429212 [Daoyuan Wang] change to long
f8f219f [Daoyuan Wang] revise according to Cheng Hao
0e0a4f5 [Daoyuan Wang] minor format
4ddcb92 [Daoyuan Wang] add java api for date
0e3110e [Daoyuan Wang] try to fix timezone issue
17fda35 [Daoyuan Wang] set test list
2dfbb5b [Daoyuan Wang] support date type
2014-10-13 13:33:12 -07:00
ravipesala 6f98902a3d [SPARK-3834][SQL] Backticks not correctly handled in subquery aliases
The queries like SELECT a.key FROM (SELECT key FROM src) \`a\` does not work as backticks in subquery aliases are not handled properly. This PR fixes that.

Author : ravipesala ravindra.pesalahuawei.com

Author: ravipesala <ravindra.pesala@huawei.com>

Closes #2737 from ravipesala/SPARK-3834 and squashes the following commits:

0e0ab98 [ravipesala] Fixing issue in backtick handling for subquery aliases
2014-10-09 18:41:36 -07:00
Vida Ha b77a02f41c [SPARK-3752][SQL]: Add tests for different UDF's
Author: Vida Ha <vida@databricks.com>

Closes #2621 from vidaha/vida/SPARK-3752 and squashes the following commits:

d7fdbbc [Vida Ha] Add tests for different UDF's
2014-10-09 13:13:31 -07:00
Reynold Xin bcb1ae049b [SPARK-3857] Create joins package for various join operators.
Author: Reynold Xin <rxin@apache.org>

Closes #2719 from rxin/sql-join-break and squashes the following commits:

0c0082b [Reynold Xin] Fix line length.
cbc664c [Reynold Xin] Rename join -> joins package.
a070d44 [Reynold Xin] Fix line length in HashJoin
a39be8c [Reynold Xin] [SPARK-3857] Create a join package for various join operators.
2014-10-08 18:17:01 -07:00
Cheng Lian e703357233 [SPARK-3810][SQL] Makes PreInsertionCasts handle partitions properly
Includes partition keys into account when applying `PreInsertionCasts` rule.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2672 from liancheng/fix-pre-insert-casts and squashes the following commits:

def1a1a [Cheng Lian] Makes PreInsertionCasts handle partitions properly
2014-10-08 18:11:18 -07:00
Cheng Lian 34b97a067d [SPARK-3645][SQL] Makes table caching eager by default and adds syntax for lazy caching
Although lazy caching for in-memory table seems consistent with the `RDD.cache()` API, it's relatively confusing for users who mainly work with SQL and not familiar with Spark internals. The `CACHE TABLE t; SELECT COUNT(*) FROM t;` pattern is also commonly seen just to ensure predictable performance.

This PR makes both the `CACHE TABLE t [AS SELECT ...]` statement and the `SQLContext.cacheTable()` API eager by default, and adds a new `CACHE LAZY TABLE t [AS SELECT ...]` syntax to provide lazy in-memory table caching.

Also, took the chance to make some refactoring: `CacheCommand` and `CacheTableAsSelectCommand` are now merged and renamed to `CacheTableCommand` since the former is strictly a special case of the latter. A new `UncacheTableCommand` is added for the `UNCACHE TABLE t` statement.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2513 from liancheng/eager-caching and squashes the following commits:

fe92287 [Cheng Lian] Makes table caching eager by default and adds syntax for lazy caching
2014-10-05 17:51:59 -07:00
scwf 58f5361caa [SPARK-3792][SQL] Enable JavaHiveQLSuite
Do not use TestSQLContext in JavaHiveQLSuite, that may lead to two SparkContexts in one jvm and enable JavaHiveQLSuite

Author: scwf <wangfei1@huawei.com>

Closes #2652 from scwf/fix-JavaHiveQLSuite and squashes the following commits:

be35c91 [scwf] enable JavaHiveQLSuite
2014-10-05 17:49:41 -07:00
Michael Armbrust 6a1d48f4f0 [SPARK-3212][SQL] Use logical plan matching instead of temporary tables for table caching
_Also addresses: SPARK-1671, SPARK-1379 and SPARK-3641_

This PR introduces a new trait, `CacheManger`, which replaces the previous temporary table based caching system.  Instead of creating a temporary table that shadows an existing table with and equivalent cached representation, the cached manager maintains a separate list of logical plans and their cached data.  After optimization, this list is searched for any matching plan fragments.  When a matching plan fragment is found it is replaced with the cached data.

There are several advantages to this approach:
 - Calling .cache() on a SchemaRDD now works as you would expect, and uses the more efficient columnar representation.
 - Its now possible to provide a list of temporary tables, without having to decide if a given table is actually just a  cached persistent table. (To be done in a follow-up PR)
 - In some cases it is possible that cached data will be used, even if a cached table was not explicitly requested.  This is because we now look at the logical structure instead of the table name.
 - We now correctly invalidate when data is inserted into a hive table.

Author: Michael Armbrust <michael@databricks.com>

Closes #2501 from marmbrus/caching and squashes the following commits:

63fbc2c [Michael Armbrust] Merge remote-tracking branch 'origin/master' into caching.
0ea889e [Michael Armbrust] Address comments.
1e23287 [Michael Armbrust] Add support for cache invalidation for hive inserts.
65ed04a [Michael Armbrust] fix tests.
bdf9a3f [Michael Armbrust] Merge remote-tracking branch 'origin/master' into caching
b4b77f2 [Michael Armbrust] Address comments
6923c9d [Michael Armbrust] More comments / tests
80f26ac [Michael Armbrust] First draft of improved semantics for Spark SQL caching.
2014-10-03 12:34:27 -07:00
Cheng Lian bec0d0eaa3 [SPARK-3007][SQL] Adds dynamic partitioning support
PR #2226 was reverted because it broke Jenkins builds for unknown reason. This debugging PR aims to fix the Jenkins build.

This PR also fixes two bugs:

1. Compression configurations in `InsertIntoHiveTable` are disabled by mistake

   The `FileSinkDesc` object passed to the writer container doesn't have compression related configurations. These configurations are not taken care of until `saveAsHiveFile` is called. This PR moves compression code forward, right after instantiation of the `FileSinkDesc` object.

1. `PreInsertionCasts` doesn't take table partitions into account

   In `castChildOutput`, `table.attributes` only contains non-partition columns, thus for partitioned table `childOutputDataTypes` never equals to `tableOutputDataTypes`. This results funny analyzed plan like this:

   ```
   == Analyzed Logical Plan ==
   InsertIntoTable Map(partcol1 -> None, partcol2 -> None), false
    MetastoreRelation default, dynamic_part_table, None
    Project [c_0#1164,c_1#1165,c_2#1166]
     Project [c_0#1164,c_1#1165,c_2#1166]
      Project [c_0#1164,c_1#1165,c_2#1166]
       ... (repeats 99 times) ...
        Project [c_0#1164,c_1#1165,c_2#1166]
         Project [c_0#1164,c_1#1165,c_2#1166]
          Project [1 AS c_0#1164,1 AS c_1#1165,1 AS c_2#1166]
           Filter (key#1170 = 150)
            MetastoreRelation default, src, None
   ```

   Awful though this logical plan looks, it's harmless because all projects will be eliminated by optimizer. Guess that's why this issue hasn't been caught before.

Author: Cheng Lian <lian.cs.zju@gmail.com>
Author: baishuo(白硕) <vc_java@hotmail.com>
Author: baishuo <vc_java@hotmail.com>

Closes #2616 from liancheng/dp-fix and squashes the following commits:

21935b6 [Cheng Lian] Adds back deleted trailing space
f471c4b [Cheng Lian] PreInsertionCasts should take table partitions into account
a132c80 [Cheng Lian] Fixes output compression
9c6eb2d [Cheng Lian] Adds tests to verify dynamic partitioning folder layout
0eed349 [Cheng Lian] Addresses @yhuai's comments
26632c3 [Cheng Lian] Adds more tests
9227181 [Cheng Lian] Minor refactoring
c47470e [Cheng Lian] Refactors InsertIntoHiveTable to a Command
6fb16d7 [Cheng Lian] Fixes typo in test name, regenerated golden answer files
d53daa5 [Cheng Lian] Refactors dynamic partitioning support
b821611 [baishuo] pass check style
997c990 [baishuo] use HiveConf.DEFAULTPARTITIONNAME to replace hive.exec.default.partition.name
761ecf2 [baishuo] modify according micheal's advice
207c6ac [baishuo] modify for some bad indentation
caea6fb [baishuo] modify code to pass scala style checks
b660e74 [baishuo] delete a empty else branch
cd822f0 [baishuo] do a little modify
8e7268c [baishuo] update file after test
3f91665 [baishuo(白硕)] Update Cast.scala
8ad173c [baishuo(白硕)] Update InsertIntoHiveTable.scala
051ba91 [baishuo(白硕)] Update Cast.scala
d452eb3 [baishuo(白硕)] Update HiveQuerySuite.scala
37c603b [baishuo(白硕)] Update InsertIntoHiveTable.scala
98cfb1f [baishuo(白硕)] Update HiveCompatibilitySuite.scala
6af73f4 [baishuo(白硕)] Update InsertIntoHiveTable.scala
adf02f1 [baishuo(白硕)] Update InsertIntoHiveTable.scala
1867e23 [baishuo(白硕)] Update SparkHadoopWriter.scala
6bb5880 [baishuo(白硕)] Update HiveQl.scala
2014-10-03 12:26:02 -07:00
ravipesala 22f8e1ee7c [SPARK-2693][SQL] Supported for UDAF Hive Aggregates like PERCENTILE
Implemented UDAF Hive aggregates by adding wrapper to Spark Hive.

Author: ravipesala <ravindra.pesala@huawei.com>

Closes #2620 from ravipesala/SPARK-2693 and squashes the following commits:

a8df326 [ravipesala] Removed resolver from constructor arguments
caf25c6 [ravipesala] Fixed style issues
5786200 [ravipesala] Supported for UDAF Hive Aggregates like PERCENTILE
2014-10-03 11:25:18 -07:00
ravipesala 1c90347a4b [SPARK-3654][SQL] Implement all extended HiveQL statements/commands with a separate parser combinator
Created separate parser for hql. It preparses the commands like cache,uncache,add jar etc.. and then parses with HiveQl

Author: ravipesala <ravindra.pesala@huawei.com>

Closes #2590 from ravipesala/SPARK-3654 and squashes the following commits:

bbca7dd [ravipesala] Fixed code as per admin comments.
ae9290a [ravipesala] Fixed style issues as per Admin comments
898ed81 [ravipesala] Removed spaces
fb24edf [ravipesala] Updated the code as per admin comments
8947d37 [ravipesala] Removed duplicate code
ba26cd1 [ravipesala] Created seperate parser for hql.It pre parses the commands like cache,uncache,add jar etc.. and then parses with HiveQl
2014-10-02 20:04:33 -07:00
ravipesala 3508ce8a5a [SPARK-3708][SQL] Backticks aren't handled correctly is aliases
The below query gives error
sql("SELECT k FROM (SELECT \`key\` AS \`k\` FROM src) a")
It gives error because the aliases are not cleaned so it could not be resolved in further processing.

Author: ravipesala <ravindra.pesala@huawei.com>

Closes #2594 from ravipesala/SPARK-3708 and squashes the following commits:

d55db54 [ravipesala] Fixed SPARK-3708 (Backticks aren't handled correctly is aliases)
2014-10-01 15:44:41 -07:00
Cheng Lian 8cc70e7e15 [SQL] Kill dangerous trailing space in query string
MD5 of query strings in `createQueryTest` calls are used to generate golden files, leaving trailing spaces there can be really dangerous. Got bitten by this while working on #2616: my "smart" IDE automatically removed a trailing space and makes Jenkins fail.

(Really should add "no trailing space" to our coding style guidelines!)

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2619 from liancheng/kill-trailing-space and squashes the following commits:

034f119 [Cheng Lian] Kill dangerous trailing space in query string
2014-10-01 12:40:37 -07:00
Reynold Xin 3888ee2f38 [SPARK-3748] Log thread name in unit test logs
Thread names are useful for correlating failures.

Author: Reynold Xin <rxin@apache.org>

Closes #2600 from rxin/log4j and squashes the following commits:

83ffe88 [Reynold Xin] [SPARK-3748] Log thread name in unit test logs
2014-10-01 01:03:49 -07:00
Patrick Wendell b64fcbd2dc Revert "[SPARK-3007][SQL]Add Dynamic Partition support to Spark Sql hive"
This reverts commit 0bbe7faeff.
2014-09-30 09:43:46 -07:00
baishuo(白硕) 0bbe7faeff [SPARK-3007][SQL]Add Dynamic Partition support to Spark Sql hive
a new PR base on new master.  changes are the same as https://github.com/apache/spark/pull/1919

Author: baishuo(白硕) <vc_java@hotmail.com>
Author: baishuo <vc_java@hotmail.com>
Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2226 from baishuo/patch-3007 and squashes the following commits:

e69ce88 [Cheng Lian] Adds tests to verify dynamic partitioning folder layout
b20a3dc [Cheng Lian] Addresses @yhuai's comments
096bbbc [baishuo(白硕)] Merge pull request #1 from liancheng/refactor-dp
1093c20 [Cheng Lian] Adds more tests
5004542 [Cheng Lian] Minor refactoring
fae9eff [Cheng Lian] Refactors InsertIntoHiveTable to a Command
528e84c [Cheng Lian] Fixes typo in test name, regenerated golden answer files
c464b26 [Cheng Lian] Refactors dynamic partitioning support
5033928 [baishuo] pass check style
2201c75 [baishuo] use HiveConf.DEFAULTPARTITIONNAME to replace hive.exec.default.partition.name
b47c9bf [baishuo] modify according micheal's advice
c3ab36d [baishuo] modify for some bad indentation
7ce2d9f [baishuo] modify code to pass scala style checks
37c1c43 [baishuo] delete a empty else branch
66e33fc [baishuo] do a little modify
88d0110 [baishuo] update file after test
a3961d9 [baishuo(白硕)] Update Cast.scala
f7467d0 [baishuo(白硕)] Update InsertIntoHiveTable.scala
c1a59dd [baishuo(白硕)] Update Cast.scala
0e18496 [baishuo(白硕)] Update HiveQuerySuite.scala
60f70aa [baishuo(白硕)] Update InsertIntoHiveTable.scala
0a50db9 [baishuo(白硕)] Update HiveCompatibilitySuite.scala
491c7d0 [baishuo(白硕)] Update InsertIntoHiveTable.scala
a2374a8 [baishuo(白硕)] Update InsertIntoHiveTable.scala
701a814 [baishuo(白硕)] Update SparkHadoopWriter.scala
dc24c41 [baishuo(白硕)] Update HiveQl.scala
2014-09-29 15:51:55 -07:00
Michael Armbrust f0c7e19550 [SPARK-3680][SQL] Fix bug caused by eager typing of HiveGenericUDFs
Typing of UDFs should be lazy as it is often not valid to call `dataType` on an expression until after all of its children are `resolved`.

Author: Michael Armbrust <michael@databricks.com>

Closes #2525 from marmbrus/concatBug and squashes the following commits:

5b8efe7 [Michael Armbrust] fix bug with eager typing of udfs
2014-09-27 12:10:16 -07:00
w00228970 0800881051 [SPARK-3676][SQL] Fix hive test suite failure due to diffs in JDK 1.6/1.7
This is a bug in JDK6: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4428022

this is because jdk get different result to operate ```double```,
```System.out.println(1/500d)``` in different jdk get different result
jdk 1.6.0(_31) ---- 0.0020
jdk 1.7.0(_05) ---- 0.002
this leads to HiveQuerySuite failed when generate golden answer in jdk 1.7 and run tests in jdk 1.6, result did not match

Author: w00228970 <wangfei1@huawei.com>

Closes #2517 from scwf/HiveQuerySuite and squashes the following commits:

0cb5e8d [w00228970] delete golden answer of division-0 and timestamp cast #1
1df3964 [w00228970] Jdk version leads to different query output for Double, this make HiveQuerySuite failed
2014-09-27 12:06:16 -07:00
Daoyuan Wang 0ec2d2e8f0 [SPARK-3531][SQL]select null from table would throw a MatchError
Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #2396 from adrian-wang/selectnull and squashes the following commits:

2458229 [Daoyuan Wang] rebase solution
2014-09-26 12:04:37 -07:00
Venkata Ramana Gollamudi 1c62f97e94 [SPARK-3268][SQL] DoubleType, FloatType and DecimalType modulus support
Supported modulus operation using % operator on fractional datatypes FloatType, DoubleType and DecimalType
Example:
SELECT 1388632775.0 % 60 from tablename LIMIT 1

Author : Venkata Ramana Gollamudi ramana.gollamudihuawei.com

Author: Venkata Ramana Gollamudi <ramana.gollamudi@huawei.com>

Closes #2457 from gvramana/double_modulus_support and squashes the following commits:

79172a8 [Venkata Ramana Gollamudi] Add hive cache to testcase
c09bd5b [Venkata Ramana Gollamudi] Added a HiveQuerySuite testcase
193fa81 [Venkata Ramana Gollamudi] corrected testcase
3624471 [Venkata Ramana Gollamudi] modified testcase
e112c09 [Venkata Ramana Gollamudi] corrected the testcase
513d0e0 [Venkata Ramana Gollamudi] modified to add modulus support to fractional types float,double,decimal
296d253 [Venkata Ramana Gollamudi] modified to add modulus support to fractional types float,double,decimal
2014-09-23 12:17:47 -07:00
wangfei ae60f8fb2d [SPARK-3481][SQL] removes the evil MINOR HACK
a follow up of https://github.com/apache/spark/pull/2377 and https://github.com/apache/spark/pull/2352, see detail there.

Author: wangfei <wangfei1@huawei.com>

Closes #2505 from scwf/patch-6 and squashes the following commits:

4874ec8 [wangfei] removes the evil MINOR HACK
2014-09-23 11:59:44 -07:00
Daoyuan Wang 66bc0f2d67 [SPARK-3598][SQL]cast to timestamp should be the same as hive
this patch fixes timestamp smaller than 0 and cast int as timestamp

select cast(1000 as timestamp) from src limit 1;

should return 1970-01-01 00:00:01, but we now take it as 1000 seconds.
also, current implementation has bug when the time is before 1970-01-01 00:00:00.
rxin marmbrus chenghao-intel

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #2458 from adrian-wang/timestamp and squashes the following commits:

4274b1d [Daoyuan Wang] set test not related to timezone
1234f66 [Daoyuan Wang] fix timestamp smaller than 0 and cast int as timestamp
2014-09-23 11:45:44 -07:00
Michael Armbrust 293ce85145 [SPARK-3414][SQL] Replace LowerCaseSchema with Resolver
**This PR introduces a subtle change in semantics for HiveContext when using the results in Python or Scala.  Specifically, while resolution remains case insensitive, it is now case preserving.**

_This PR is a follow up to #2293 (and to a lesser extent #2262 #2334)._

In #2293 the catalog was changed to store analyzed logical plans instead of unresolved ones.  While this change fixed the reported bug (which was caused by yet another instance of us forgetting to put in a `LowerCaseSchema` operator) it had the consequence of breaking assumptions made by `MultiInstanceRelation`.  Specifically, we can't replace swap out leaf operators in a tree without rewriting changed expression ids (which happens when you self join the same RDD that has been registered as a temp table).

In this PR, I instead remove the need to insert `LowerCaseSchema` operators at all, by moving the concern of matching up identifiers completely into analysis.  Doing so allows the test cases from both #2293 and #2262 to pass at the same time (and likely fixes a slew of other "unknown unknown" bugs).

While it is rolled back in this PR, storing the analyzed plan might actually be a good idea.  For instance, it is kind of confusing if you register a temporary table, change the case sensitivity of resolution and now you can't query that table anymore.  This can be addressed in a follow up PR.

Follow-ups:
 - Configurable case sensitivity
 - Consider storing analyzed plans for temp tables

Author: Michael Armbrust <michael@databricks.com>

Closes #2382 from marmbrus/lowercase and squashes the following commits:

c21171e [Michael Armbrust] Ensure the resolver is used for field lookups and ensure that case insensitive resolution is still case preserving.
d4320f1 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into lowercase
2de881e [Michael Armbrust] Address comments.
219805a [Michael Armbrust] style
5b93711 [Michael Armbrust] Replace LowerCaseSchema with Resolver.
2014-09-20 16:41:14 -07:00
Cheng Hao 2c3cc7641d [SPARK-3501] [SQL] Fix the bug of Hive SimpleUDF creates unnecessary type cast
When do the query like:
```
select datediff(cast(value as timestamp), cast('2002-03-21 00:00:00' as timestamp)) from src;
```
SparkSQL will raise exception:
```
[info] scala.MatchError: TimestampType (of class org.apache.spark.sql.catalyst.types.TimestampType$)
[info] at org.apache.spark.sql.catalyst.expressions.Cast.castToTimestamp(Cast.scala:77)
[info] at org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:251)
[info] at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
[info] at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
[info] at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$5$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:217)
[info] at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$5$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:210)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$2.apply(TreeNode.scala:180)
[info] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[info] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
```

Author: Cheng Hao <hao.cheng@intel.com>

Closes #2368 from chenghao-intel/cast_exception and squashes the following commits:

5c9c3a5 [Cheng Hao] make more clear code
49dfc50 [Cheng Hao] Add no-op for Cast and revert the position of SimplifyCasts
b804abd [Cheng Hao] Add unit test to show the failure in identical data type casting
330a5c8 [Cheng Hao] Update Code based on comments
b834ed4 [Cheng Hao] Fix bug of HiveSimpleUDF with unnecessary type cast which cause exception in constant folding
2014-09-19 15:29:22 -07:00
Michael Armbrust 30f288ae34 [SPARK-2890][SQL] Allow reading of data when case insensitive resolution could cause possible ambiguity.
Throwing an error in the constructor makes it possible to run queries, even when there is no actual ambiguity.  Remove this check in favor of throwing an error in analysis when they query is actually is ambiguous.

Also took the opportunity to add test cases that would have caught a subtle bug in my first attempt at fixing this and refactor some other test code.

Author: Michael Armbrust <michael@databricks.com>

Closes #2209 from marmbrus/sameNameStruct and squashes the following commits:

729cca4 [Michael Armbrust] Better tests.
a003aeb [Michael Armbrust] Remove error (it'll be caught in analysis).
2014-09-16 11:42:26 -07:00
Cheng Lian 74049249ab [SPARK-3294][SQL] Eliminates boxing costs from in-memory columnar storage
This is a major refactoring of the in-memory columnar storage implementation, aims to eliminate boxing costs from critical paths (building/accessing column buffers) as much as possible. The basic idea is to refactor all major interfaces into a row-based form and use them together with `SpecificMutableRow`. The difficult part is how to adapt all compression schemes, esp. `RunLengthEncoding` and `DictionaryEncoding`, to this design. Since in-memory compression is disabled by default for now, and this PR should be strictly better than before no matter in-memory compression is enabled or not, maybe I'll finish that part in another PR.

**UPDATE** This PR also took the chance to optimize `HiveTableScan` by

1. leveraging `SpecificMutableRow` to avoid boxing cost, and
1. building specific `Writable` unwrapper functions a head of time to avoid per row pattern matching and branching costs.

TODO

- [x] Benchmark
- [ ] ~~Eliminate boxing costs in `RunLengthEncoding`~~ (left to future PRs)
- [ ] ~~Eliminate boxing costs in `DictionaryEncoding` (seems not easy to do without specializing `DictionaryEncoding` for every supported column type)~~  (left to future PRs)

## Micro benchmark

The benchmark uses a 10 million line CSV table consists of bytes, shorts, integers, longs, floats and doubles, measures the time to build the in-memory version of this table, and the time to scan the whole in-memory table.

Benchmark code can be found [here](https://gist.github.com/liancheng/fe70a148de82e77bd2c8#file-hivetablescanbenchmark-scala). Script used to generate the input table can be found [here](https://gist.github.com/liancheng/fe70a148de82e77bd2c8#file-tablegen-scala).

Speedup:

- Hive table scanning + column buffer building: **18.74%**

  The original benchmark uses 1K as in-memory batch size, when increased to 10K, it can be 28.32% faster.

- In-memory table scanning: **7.95%**

Before:

        | Building | Scanning
------- | -------- | --------
1       | 16472    | 525
2       | 16168    | 530
3       | 16386    | 529
4       | 16184    | 538
5       | 16209    | 521
Average | 16283.8  | 528.6

After:

        | Building | Scanning
------- | -------- | --------
1       | 13124    | 458
2       | 13260    | 529
3       | 12981    | 463
4       | 13214    | 483
5       | 13583    | 500
Average | 13232.4  | 486.6

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2327 from liancheng/prevent-boxing/unboxing and squashes the following commits:

4419fe4 [Cheng Lian] Addressing comments
e5d2cf2 [Cheng Lian] Bug fix: should call setNullAt when field value is null to avoid NPE
8b8552b [Cheng Lian] Only checks for partition batch pruning flag once
489f97b [Cheng Lian] Bug fix: TableReader.fillObject uses wrong ordinals
97bbc4e [Cheng Lian] Optimizes hive.TableReader by by providing specific Writable unwrappers a head of time
3dc1f94 [Cheng Lian] Minor changes to eliminate row object creation
5b39cb9 [Cheng Lian] Lowers log level of compression scheme details
f2a7890 [Cheng Lian] Use SpecificMutableRow in InMemoryColumnarTableScan to avoid boxing
9cf30b0 [Cheng Lian] Added row based ColumnType.append/extract
456c366 [Cheng Lian] Made compression decoder row based
edac3cd [Cheng Lian] Makes ColumnAccessor.extractSingle row based
8216936 [Cheng Lian] Removes boxing cost in IntDelta and LongDelta by providing specialized implementations
b70d519 [Cheng Lian] Made some in-memory columnar storage interfaces row-based
2014-09-13 15:08:30 -07:00
Cheng Lian 184cd51c42 [SPARK-3481][SQL] Removes the evil MINOR HACK
This is a follow up of #2352. Now we can finally remove the evil "MINOR HACK", which covered up the eldest bug in the history of Spark SQL (see details [here](https://github.com/apache/spark/pull/2352#issuecomment-55440621)).

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2377 from liancheng/remove-evil-minor-hack and squashes the following commits:

0869c78 [Cheng Lian] Removes the evil MINOR HACK
2014-09-13 12:35:40 -07:00
Cheng Lian 6d887db789 [SPARK-3515][SQL] Moves test suite setup code to beforeAll rather than in constructor
Please refer to the JIRA ticket for details.

**NOTE** We should check all test suites that do similar initialization-like side effects in their constructors. This PR only fixes `ParquetMetastoreSuite` because it breaks our Jenkins Maven build.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2375 from liancheng/say-no-to-constructor and squashes the following commits:

0ceb75b [Cheng Lian] Moves test suite setup code to beforeAll rather than in constructor
2014-09-12 20:14:09 -07:00
Cheng Hao ca83f1e2c4 [SPARK-2917] [SQL] Avoid table creation in logical plan analyzing for CTAS
Author: Cheng Hao <hao.cheng@intel.com>

Closes #1846 from chenghao-intel/ctas and squashes the following commits:

56a0578 [Cheng Hao] remove the unused imports
9a57abc [Cheng Hao] Avoid table creation in logical plan analyzing
2014-09-11 11:57:01 -07:00
Michael Armbrust 84e2c8bfe4 [SQL] Add test case with workaround for reading partitioned Avro files
In order to read from partitioned Avro files we need to also set the `SERDEPROPERTIES` since `TBLPROPERTIES` are not passed to the initialization.  This PR simply adds a test to make sure we don't break this workaround.

Author: Michael Armbrust <michael@databricks.com>

Closes #2340 from marmbrus/avroPartitioned and squashes the following commits:

6b969d6 [Michael Armbrust] fix style
fea2124 [Michael Armbrust] Add test case with workaround for reading partitioned avro files.
2014-09-10 20:57:38 -07:00
Wenchen Fan e4f4886d71 [SPARK-2096][SQL] Correctly parse dot notations
First let me write down the current `projections` grammar of spark sql:

    expression                : orExpression
    orExpression              : andExpression {"or" andExpression}
    andExpression             : comparisonExpression {"and" comparisonExpression}
    comparisonExpression      : termExpression | termExpression "=" termExpression | termExpression ">" termExpression | ...
    termExpression            : productExpression {"+"|"-" productExpression}
    productExpression         : baseExpression {"*"|"/"|"%" baseExpression}
    baseExpression            : expression "[" expression "]" | ... | ident | ...
    ident                     : identChar {identChar | digit} | delimiters | ...
    identChar                 : letter | "_" | "."
    delimiters                : "," | ";" | "(" | ")" | "[" | "]" | ...
    projection                : expression [["AS"] ident]
    projections               : projection { "," projection}

For something like `a.b.c[1]`, it will be parsed as:
<img src="http://img51.imgspice.com/i/03008/4iltjsnqgmtt_t.jpg" border=0>
But for something like `a[1].b`, the current grammar can't parse it correctly.
A simple solution is written in `ParquetQuerySuite#NestedSqlParser`, changed grammars are:

    delimiters                : "." | "," | ";" | "(" | ")" | "[" | "]" | ...
    identChar                 : letter | "_"
    baseExpression            : expression "[" expression "]" | expression "." ident | ... | ident | ...
This works well, but can't cover some corner case like `select t.a.b from table as t`:
<img src="http://img51.imgspice.com/i/03008/v2iau3hoxoxg_t.jpg" border=0>
`t.a.b` parsed as `GetField(GetField(UnResolved("t"), "a"), "b")` instead of `GetField(UnResolved("t.a"), "b")` using this new grammar.
However, we can't resolve `t` as it's not a filed, but the whole table.(if we could do this, then `select t from table as t` is legal, which is unexpected)
My solution is:

    dotExpressionHeader       : ident "." ident
    baseExpression            : expression "[" expression "]" | expression "." ident | ... | dotExpressionHeader  | ident | ...
I passed all test cases under sql locally and add a more complex case.
"arrayOfStruct.field1 to access all values of field1" is not supported yet. Since this PR has changed a lot of code, I will open another PR for it.
I'm not familiar with the latter optimize phase, please correct me if I missed something.

Author: Wenchen Fan <cloud0fan@163.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #2230 from cloud-fan/dot and squashes the following commits:

e1a8898 [Wenchen Fan] remove support for arbitrary nested arrays
ee8a724 [Wenchen Fan] rollback LogicalPlan, support dot operation on nested array type
a58df40 [Michael Armbrust] add regression test for doubly nested data
16bc4c6 [Wenchen Fan] some enhance
95d733f [Wenchen Fan] split long line
dc31698 [Wenchen Fan] SPARK-2096 Correctly parse dot notations
2014-09-10 12:56:59 -07:00
Daoyuan Wang a0283300c4 [SPARK-3362][SQL] Fix resolution for casewhen with nulls.
Current implementation will ignore else val type.

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #2245 from adrian-wang/casewhenbug and squashes the following commits:

3332f6e [Daoyuan Wang] remove wrong comment
83b536c [Daoyuan Wang] a comment to trigger retest
d7315b3 [Daoyuan Wang] code improve
eed35fc [Daoyuan Wang] bug in casewhen resolve
2014-09-10 10:45:24 -07:00
William Benton 2b7ab814f9 [SPARK-3329][SQL] Don't depend on Hive SET pair ordering in tests.
This fixes some possible spurious test failures in `HiveQuerySuite` by comparing sets of key-value pairs as sets, rather than as lists.

Author: William Benton <willb@redhat.com>
Author: Aaron Davidson <aaron@databricks.com>

Closes #2220 from willb/spark-3329 and squashes the following commits:

3b3e205 [William Benton] Collapse collectResults case match in HiveQuerySuite
6525d8e [William Benton] Handle cases where SET returns Rows of (single) strings
cf11b0e [Aaron Davidson] Fix flakey HiveQuerySuite test
2014-09-08 19:29:23 -07:00
Cheng Lian dc1dbf206e [SPARK-3414][SQL] Stores analyzed logical plan when registering a temp table
Case insensitivity breaks when unresolved relation contains attributes with uppercase letters in their names, because we store unanalyzed logical plan when registering temp tables while the `CaseInsensitivityAttributeReferences` batch runs before the `Resolution` batch. To fix this issue, we need to store analyzed logical plan.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2293 from liancheng/spark-3414 and squashes the following commits:

d9fa1d6 [Cheng Lian] Stores analyzed logical plan when registering a temp table
2014-09-08 19:08:05 -07:00
Cheng Lian ee575f12f2 [SPARK-2219][SQL] Added support for the "add jar" command
Adds logical and physical command classes for the "add jar" command.

Note that this PR conflicts with and should be merged after #2215.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2242 from liancheng/add-jar and squashes the following commits:

e43a2f1 [Cheng Lian] Updates AddJar according to conventions introduced in #2215
b99107f [Cheng Lian] Added test case for ADD JAR command
095b2c7 [Cheng Lian] Also forward ADD JAR command to Hive
9be031b [Cheng Lian] Trims Jar path string
8195056 [Cheng Lian] Added support for the "add jar" command
2014-09-04 18:47:45 -07:00
qiping.lqp 634d04b87c [SPARK-3291][SQL]TestcaseName in createQueryTest should not contain ":"
":" is not allowed to appear in a file name of Windows system. If file name contains ":", this file can't be checked out in a Windows system and developers using Windows must be careful to not commit the deletion of such files, Which is very inconvenient.

Author: qiping.lqp <qiping.lqp@alibaba-inc.com>

Closes #2191 from chouqin/querytest and squashes the following commits:

0e943a1 [qiping.lqp] rename golden file
60a863f [qiping.lqp] TestcaseName in createQueryTest should not contain ":"
2014-08-29 15:37:43 -07:00
viirya 28d41d6279 [SPARK-3252][SQL] Add missing condition for test
According to the text message, both relations should be tested. So add the missing condition.

Author: viirya <viirya@gmail.com>

Closes #2159 from viirya/fix_test and squashes the following commits:

b1c0f52 [viirya] add missing condition.
2014-08-27 14:55:05 -07:00
u0jing 3b5eb7083d [SPARK-3118][SQL]add "SHOW TBLPROPERTIES tblname;" and "SHOW COLUMNS (FROM|IN) table_name [(FROM|IN) db_name]" support
JIRA issue: [SPARK-3118] https://issues.apache.org/jira/browse/SPARK-3118

eg:
> SHOW TBLPROPERTIES test;
SHOW TBLPROPERTIES test;
numPartitions	0
numFiles	1
transient_lastDdlTime	1407923642
numRows	0
totalSize	82
rawDataSize	0

eg:
> SHOW COLUMNS  in test;
SHOW COLUMNS  in test;
OK
Time taken: 0.304 seconds
id
stid
bo

Author: u0jing <u9jing@gmail.com>

Closes #2034 from u0jing/spark-3118 and squashes the following commits:

b231d87 [u0jing] add golden answer files
35f4885 [u0jing] add 'show columns' and 'show tblproperties' support
2014-08-27 12:47:14 -07:00
Michael Armbrust c4787a3690 [SPARK-3194][SQL] Add AttributeSet to fix bugs with invalid comparisons of AttributeReferences
It is common to want to describe sets of attributes that are in various parts of a query plan.  However, the semantics of putting `AttributeReference` objects into a standard Scala `Set` result in subtle bugs when references differ cosmetically.  For example, with case insensitive resolution it is possible to have two references to the same attribute whose names are not equal.

In this PR I introduce a new abstraction, an `AttributeSet`, which performs all comparisons using the globally unique `ExpressionId` instead of case class equality.  (There is already a related class, [`AttributeMap`](https://github.com/marmbrus/spark/blob/inMemStats/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala#L32))  This new type of set is used to fix a bug in the optimizer where needed attributes were getting projected away underneath join operators.

I also took this opportunity to refactor the expression and query plan base classes.  In all but one instance the logic for computing the `references` of an `Expression` were the same.  Thus, I moved this logic into the base class.

For query plans the semantics of  the `references` method were ill defined (is it the references output? or is it those used by expression evaluation? or what?).  As a result, this method wasn't really used very much.  So, I removed it.

TODO:
 - [x] Finish scala doc for `AttributeSet`
 - [x] Scan the code for other instances of `Set[Attribute]` and refactor them.
 - [x] Finish removing `references` from `QueryPlan`

Author: Michael Armbrust <michael@databricks.com>

Closes #2109 from marmbrus/attributeSets and squashes the following commits:

1c0dae5 [Michael Armbrust] work on serialization bug.
9ba868d [Michael Armbrust] Merge remote-tracking branch 'origin/master' into attributeSets
3ae5288 [Michael Armbrust] review comments
40ce7f6 [Michael Armbrust] style
d577cc7 [Michael Armbrust] Scaladoc
cae5d22 [Michael Armbrust] remove more references implementations
d6e16be [Michael Armbrust] Remove more instances of "def references" and normal sets of attributes.
fc26b49 [Michael Armbrust] Add AttributeSet class, remove references from Expression.
2014-08-26 16:29:14 -07:00
Daoyuan Wang 52fbdc2ded [Spark-3222] [SQL] Cross join support in HiveQL
We can simple treat cross join as inner join without join conditions.

Author: Daoyuan Wang <daoyuan.wang@intel.com>
Author: adrian-wang <daoyuanwong@gmail.com>

Closes #2124 from adrian-wang/crossjoin and squashes the following commits:

8c9b7c5 [Daoyuan Wang] add a test
7d47bbb [adrian-wang] add cross join support for hql
2014-08-25 22:56:35 -07:00
Cheng Hao 156eb39661 [SPARK-3058] [SQL] Support EXTENDED for EXPLAIN
Provide `extended` keyword support for `explain` command in SQL. e.g.
```
explain extended select key as a1, value as a2 from src where key=1;
== Parsed Logical Plan ==
Project ['key AS a1#3,'value AS a2#4]
 Filter ('key = 1)
  UnresolvedRelation None, src, None

== Analyzed Logical Plan ==
Project [key#8 AS a1#3,value#9 AS a2#4]
 Filter (CAST(key#8, DoubleType) = CAST(1, DoubleType))
  MetastoreRelation default, src, None

== Optimized Logical Plan ==
Project [key#8 AS a1#3,value#9 AS a2#4]
 Filter (CAST(key#8, DoubleType) = 1.0)
  MetastoreRelation default, src, None

== Physical Plan ==
Project [key#8 AS a1#3,value#9 AS a2#4]
 Filter (CAST(key#8, DoubleType) = 1.0)
  HiveTableScan [key#8,value#9], (MetastoreRelation default, src, None), None

Code Generation: false
== RDD ==
(2) MappedRDD[14] at map at HiveContext.scala:350
  MapPartitionsRDD[13] at mapPartitions at basicOperators.scala:42
  MapPartitionsRDD[12] at mapPartitions at basicOperators.scala:57
  MapPartitionsRDD[11] at mapPartitions at TableReader.scala:112
  MappedRDD[10] at map at TableReader.scala:240
  HadoopRDD[9] at HadoopRDD at TableReader.scala:230
```

It's the sub task of #1847. But can go without any dependency.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #1962 from chenghao-intel/explain_extended and squashes the following commits:

295db74 [Cheng Hao] Fix bug in printing the simple execution plan
48bc989 [Cheng Hao] Support EXTENDED for EXPLAIN
2014-08-25 17:43:56 -07:00
Michael Armbrust 7e191fe29b [SPARK-2554][SQL] CountDistinct partial aggregation and object allocation improvements
Author: Michael Armbrust <michael@databricks.com>
Author: Gregory Owen <greowen@gmail.com>

Closes #1935 from marmbrus/countDistinctPartial and squashes the following commits:

5c7848d [Michael Armbrust] turn off caching in the constructor
8074a80 [Michael Armbrust] fix tests
32d216f [Michael Armbrust] reynolds comments
c122cca [Michael Armbrust] Address comments, add tests
b2e8ef3 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into countDistinctPartial
fae38f4 [Michael Armbrust] Fix style
fdca896 [Michael Armbrust] cleanup
93d0f64 [Michael Armbrust] metastore concurrency fix.
db44a30 [Michael Armbrust] JIT hax.
3868f6c [Michael Armbrust] Merge pull request #9 from GregOwen/countDistinctPartial
c9e67de [Gregory Owen] Made SpecificRow and types serializable by Kryo
2b46c4b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into countDistinctPartial
8ff6402 [Michael Armbrust] Add specific row.
58d15f1 [Michael Armbrust] disable codegen logging
87d101d [Michael Armbrust] Fix isNullAt bug
abee26d [Michael Armbrust] WIP
27984d0 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into countDistinctPartial
57ae3b1 [Michael Armbrust] Fix order dependent test
b3d0f64 [Michael Armbrust] Add golden files.
c1f7114 [Michael Armbrust] Improve tests / fix serialization.
f31b8ad [Michael Armbrust] more fixes
38c7449 [Michael Armbrust] comments and style
9153652 [Michael Armbrust] better toString
d494598 [Michael Armbrust] Fix tests now that the planner is better
41fbd1d [Michael Armbrust] Never try and create an empty hash set.
050bb97 [Michael Armbrust] Skip no-arg constructors for kryo,
bd08239 [Michael Armbrust] WIP
213ada8 [Michael Armbrust] First draft of partially aggregated and code generated count distinct / max
2014-08-23 16:19:10 -07:00
Michael Armbrust 3abd0c1cda [SPARK-2406][SQL] Initial support for using ParquetTableScan to read HiveMetaStore tables.
This PR adds an experimental flag `spark.sql.hive.convertMetastoreParquet` that when true causes the planner to detects tables that use Hive's Parquet SerDe and instead plans them using Spark SQL's native `ParquetTableScan`.

Author: Michael Armbrust <michael@databricks.com>
Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #1819 from marmbrus/parquetMetastore and squashes the following commits:

1620079 [Michael Armbrust] Revert "remove hive parquet bundle"
cc30430 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into parquetMetastore
4f3d54f [Michael Armbrust] fix style
41ebc5f [Michael Armbrust] remove hive parquet bundle
a43e0da [Michael Armbrust] Merge remote-tracking branch 'origin/master' into parquetMetastore
4c4dc19 [Michael Armbrust] Fix bug with tree splicing.
ebb267e [Michael Armbrust] include parquet hive to tests pass (Remove this later).
c0d9b72 [Michael Armbrust] Avoid creating a HadoopRDD per partition.  Add dirty hacks to retrieve partition values from the InputSplit.
8cdc93c [Michael Armbrust] Merge pull request #8 from yhuai/parquetMetastore
a0baec7 [Yin Huai] Partitioning columns can be resolved.
1161338 [Michael Armbrust] Add a test to make sure conversion is actually happening
212d5cd [Michael Armbrust] Initial support for using ParquetTableScan to read HiveMetaStore tables.
2014-08-18 13:17:10 -07:00
tianyi 13f54e2b97 [SPARK-2817] [SQL] add "show create table" support
In spark sql component, the "show create table" syntax had been disabled.
We thought it is a useful funciton to describe a hive table.

Author: tianyi <tianyi@asiainfo-linkage.com>
Author: tianyi <tianyi@asiainfo.com>
Author: tianyi <tianyi.asiainfo@gmail.com>

Closes #1760 from tianyi/spark-2817 and squashes the following commits:

7d28b15 [tianyi] [SPARK-2817] fix too short prefix problem
cbffe8b [tianyi] [SPARK-2817] fix the case problem
565ec14 [tianyi] [SPARK-2817] fix the case problem
60d48a9 [tianyi] [SPARK-2817] use system temporary folder instead of temporary files in the source tree, and also clean some empty line
dbe1031 [tianyi] [SPARK-2817] move some code out of function rewritePaths, as it may be called multiple times
9b2ba11 [tianyi] [SPARK-2817] fix the line length problem
9f97586 [tianyi] [SPARK-2817] remove test.tmp.dir from pom.xml
bfc2999 [tianyi] [SPARK-2817] add "File.separator" support, create a "testTmpDir" outside the rewritePaths
bde800a [tianyi] [SPARK-2817] add "${system:test.tmp.dir}" support add "last_modified_by" to nonDeterministicLineIndicators in HiveComparisonTest
bb82726 [tianyi] [SPARK-2817] remove test which requires a system from the whitelist.
bbf6b42 [tianyi] [SPARK-2817] add a systemProperties named "test.tmp.dir" to pass the test which contains "${system:test.tmp.dir}"
a337bd6 [tianyi] [SPARK-2817] add "show create table" support
a03db77 [tianyi] [SPARK-2817] add "show create table" support
2014-08-13 16:50:02 -07:00
Yin Huai 45d8f4deab [SPARK-2919] [SQL] Basic support for analyze command in HiveQl
The command we will support is
```
ANALYZE TABLE tablename COMPUTE STATISTICS noscan
```
Other cases shown in https://cwiki.apache.org/confluence/display/Hive/StatsDev#StatsDev-ExistingTables will still be treated as Hive native commands.

JIRA: https://issues.apache.org/jira/browse/SPARK-2919

Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #1848 from yhuai/sqlAnalyze and squashes the following commits:

0b79d36 [Yin Huai] Typo and format.
c59d94b [Yin Huai] Support "ANALYZE TABLE tableName COMPUTE STATISTICS noscan".
2014-08-08 11:23:58 -07:00
Reynold Xin b70bae40eb [SQL] Tighten the visibility of various SQLConf methods and renamed setter/getters
Author: Reynold Xin <rxin@apache.org>

Closes #1794 from rxin/sql-conf and squashes the following commits:

3ac11ef [Reynold Xin] getAllConfs return an immutable Map instead of an Array.
4b19d6c [Reynold Xin] Tighten the visibility of various SQLConf methods and renamed setter/getters.
2014-08-05 22:29:19 -07:00
Michael Armbrust 1d70c4f66d [SPARK-2866][SQL] Support attributes in ORDER BY that aren't in SELECT
Minor refactoring to allow resolution either using a nodes input or output.

Author: Michael Armbrust <michael@databricks.com>

Closes #1795 from marmbrus/ordering and squashes the following commits:

237f580 [Michael Armbrust] style
74d833b [Michael Armbrust] newline
705d963 [Michael Armbrust] Add a rule for resolving ORDER BY expressions that reference attributes not present in the SELECT clause.
82cabda [Michael Armbrust] Generalize attribute resolution.
2014-08-05 20:55:02 -07:00
Michael Armbrust 6e821e3d1a [SPARK-2860][SQL] Fix coercion of CASE WHEN.
Author: Michael Armbrust <michael@databricks.com>

Closes #1785 from marmbrus/caseNull and squashes the following commits:

126006d [Michael Armbrust] better error message
2fe357f [Michael Armbrust] Fix coercion of CASE WHEN.
2014-08-05 11:17:50 -07:00
Yin Huai e139e2be60 [SPARK-2783][SQL] Basic support for analyze in HiveContext
JIRA: https://issues.apache.org/jira/browse/SPARK-2783

Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #1741 from yhuai/analyzeTable and squashes the following commits:

7bb5f02 [Yin Huai] Use sql instead of hql.
4d09325 [Yin Huai] Merge remote-tracking branch 'upstream/master' into analyzeTable
e3ebcd4 [Yin Huai] Renaming.
c170f4e [Yin Huai] Do not use getContentSummary.
62393b6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into analyzeTable
db233a6 [Yin Huai] Trying to debug jenkins...
fee84f0 [Yin Huai] Merge remote-tracking branch 'upstream/master' into analyzeTable
f0501f3 [Yin Huai] Fix compilation error.
24ad391 [Yin Huai] Merge remote-tracking branch 'upstream/master' into analyzeTable
8918140 [Yin Huai] Wording.
23df227 [Yin Huai] Add a simple analyze method to get the size of a table and update the "totalSize" property of this table in the Hive metastore.
2014-08-03 14:54:41 -07:00
Michael Armbrust 236dfac676 [SPARK-2784][SQL] Deprecate hql() method in favor of a config option, 'spark.sql.dialect'
Many users have reported being confused by the distinction between the `sql` and `hql` methods.  Specifically, many users think that `sql(...)` cannot be used to read hive tables.  In this PR I introduce a new configuration option `spark.sql.dialect` that picks which dialect with be used for parsing.  For SQLContext this must be set to `sql`.  In `HiveContext` it defaults to `hiveql` but can also be set to `sql`.

The `hql` and `hiveql` methods continue to act the same but are now marked as deprecated.

**This is a possibly breaking change for some users unless they set the dialect manually, though this is unlikely.**

For example: `hiveContex.sql("SELECT 1")` will now throw a parsing exception by default.

Author: Michael Armbrust <michael@databricks.com>

Closes #1746 from marmbrus/sqlLanguageConf and squashes the following commits:

ad375cc [Michael Armbrust] Merge remote-tracking branch 'apache/master' into sqlLanguageConf
20c43f8 [Michael Armbrust] override function instead of just setting the value
7e4ae93 [Michael Armbrust] Deprecate hql() method in favor of a config option, 'spark.sql.dialect'
2014-08-03 12:28:29 -07:00
Michael Armbrust 1a8043739d [SPARK-2739][SQL] Rename registerAsTable to registerTempTable
There have been user complaints that the difference between `registerAsTable` and `saveAsTable` is too subtle.  This PR addresses this by renaming `registerAsTable` to `registerTempTable`, which more clearly reflects what is happening.  `registerAsTable` remains, but will cause a deprecation warning.

Author: Michael Armbrust <michael@databricks.com>

Closes #1743 from marmbrus/registerTempTable and squashes the following commits:

d031348 [Michael Armbrust] Merge remote-tracking branch 'apache/master' into registerTempTable
4dff086 [Michael Armbrust] Fix .java files too
89a2f12 [Michael Armbrust] Merge remote-tracking branch 'apache/master' into registerTempTable
0b7b71e [Michael Armbrust] Rename registerAsTable to registerTempTable
2014-08-02 18:27:04 -07:00
Michael Armbrust 158ad0bba9 [SPARK-2097][SQL] UDF Support
This patch adds the ability to register lambda functions written in Python, Java or Scala as UDFs for use in SQL or HiveQL.

Scala:
```scala
registerFunction("strLenScala", (_: String).length)
sql("SELECT strLenScala('test')")
```
Python:
```python
sqlCtx.registerFunction("strLenPython", lambda x: len(x), IntegerType())
sqlCtx.sql("SELECT strLenPython('test')")
```
Java:
```java
sqlContext.registerFunction("stringLengthJava", new UDF1<String, Integer>() {
  Override
  public Integer call(String str) throws Exception {
    return str.length();
  }
}, DataType.IntegerType);

sqlContext.sql("SELECT stringLengthJava('test')");
```

Author: Michael Armbrust <michael@databricks.com>

Closes #1063 from marmbrus/udfs and squashes the following commits:

9eda0fe [Michael Armbrust] newline
747c05e [Michael Armbrust] Add some scala UDF tests.
d92727d [Michael Armbrust] Merge remote-tracking branch 'apache/master' into udfs
005d684 [Michael Armbrust] Fix naming and formatting.
d14dac8 [Michael Armbrust] Fix last line of autogened java files.
8135c48 [Michael Armbrust] Move UDF unit tests to pyspark.
40b0ffd [Michael Armbrust] Merge remote-tracking branch 'apache/master' into udfs
6a36890 [Michael Armbrust] Switch logging so that SQLContext can be serializable.
7a83101 [Michael Armbrust] Drop toString
795fd15 [Michael Armbrust] Try to avoid capturing SQLContext.
e54fb45 [Michael Armbrust] Docs and tests.
437cbe3 [Michael Armbrust] Update use of dataTypes, fix some python tests, address review comments.
01517d6 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into udfs
8e6c932 [Michael Armbrust] WIP
3f96a52 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into udfs
6237c8d [Michael Armbrust] WIP
2766f0b [Michael Armbrust] Move udfs support to SQL from hive. Add support for Java UDFs.
0f7d50c [Michael Armbrust] Draft of native Spark SQL UDFs for Scala and Python.
2014-08-02 16:33:48 -07:00
GuoQiang Li 4c477117bb SPARK-2804: Remove scalalogging-slf4j dependency
This also Closes #1701.

Author: GuoQiang Li <witgo@qq.com>

Closes #1208 from witgo/SPARK-1470 and squashes the following commits:

422646b [GuoQiang Li] Remove scalalogging-slf4j dependency
2014-08-02 13:59:58 -07:00
Patrick Wendell dab37966b0 Revert "[SPARK-1470][SPARK-1842] Use the scala-logging wrapper instead of the directly sfl4j api"
This reverts commit adc8303294.
2014-08-01 23:55:30 -07:00
GuoQiang Li adc8303294 [SPARK-1470][SPARK-1842] Use the scala-logging wrapper instead of the directly sfl4j api
Author: GuoQiang Li <witgo@qq.com>

Closes #1369 from witgo/SPARK-1470_new and squashes the following commits:

66a1641 [GuoQiang Li] IncompatibleResultTypeProblem
73a89ba [GuoQiang Li] Use the scala-logging wrapper instead of the directly sfl4j api.
2014-08-01 23:55:11 -07:00
Yin Huai 49b361298b [SPARK-2523] [SQL] Hadoop table scan bug fixing (fix failing Jenkins maven test)
This PR tries to resolve the broken Jenkins maven test issue introduced by #1439. Now, we create a single query test to run both the setup work and the test query.

Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #1669 from yhuai/SPARK-2523-fixTest and squashes the following commits:

358af1a [Yin Huai] Make partition_based_table_scan_with_different_serde run atomically.
2014-07-31 13:05:24 -07:00
Michael Armbrust 3072b96026 [SPARK-2743][SQL] Resolve original attributes in ParquetTableScan
Author: Michael Armbrust <michael@databricks.com>

Closes #1647 from marmbrus/parquetCase and squashes the following commits:

a1799b7 [Michael Armbrust] move comment
2a2a68b [Michael Armbrust] Merge remote-tracking branch 'apache/master' into parquetCase
bb35d5b [Michael Armbrust] Fix test case that produced an invalid plan.
e6870bf [Michael Armbrust] Better error message.
539a2e1 [Michael Armbrust] Resolve original attributes in ParquetTableScan
2014-07-31 11:15:25 -07:00
Michael Armbrust 88a519db90 [SPARK-2734][SQL] Remove tables from cache when DROP TABLE is run.
Author: Michael Armbrust <michael@databricks.com>

Closes #1650 from marmbrus/dropCached and squashes the following commits:

e6ab80b [Michael Armbrust] Support if exists.
83426c6 [Michael Armbrust] Remove tables from cache when DROP TABLE is run.
2014-07-30 17:30:51 -07:00
Michael Armbrust 84467468d4 [SPARK-2054][SQL] Code Generation for Expression Evaluation
Adds a new method for evaluating expressions using code that is generated though Scala reflection.  This functionality is configured by the SQLConf option `spark.sql.codegen` and is currently turned off by default.

Evaluation can be done in several specialized ways:
 - *Projection* - Given an input row, produce a new row from a set of expressions that define each column in terms of the input row.  This can either produce a new Row object or perform the projection in-place on an existing Row (MutableProjection).
 - *Ordering* - Compares two rows based on a list of `SortOrder` expressions
 - *Condition* - Returns `true` or `false` given an input row.

For each of the above operations there is both a Generated and Interpreted version.  When generation for a given expression type is undefined, the code generator falls back on calling the `eval` function of the expression class.  Even without custom code, there is still a potential speed up, as loops are unrolled and code can still be inlined by JIT.

This PR also contains a new type of Aggregation operator, `GeneratedAggregate`, that performs aggregation by using generated `Projection` code.  Currently the required expression rewriting only works for simple aggregations like `SUM` and `COUNT`.  This functionality will be extended in a future PR.

This PR also performs several clean ups that simplified the implementation:
 - The notion of `Binding` all expressions in a tree automatically before query execution has been removed.  Instead it is the responsibly of an operator to provide the input schema when creating one of the specialized evaluators defined above.  In cases when the standard eval method is going to be called, binding can still be done manually using `BindReferences`.  There are a few reasons for this change:  First, there were many operators where it just didn't work before.  For example, operators with more than one child, and operators like aggregation that do significant rewriting of the expression. Second, the semantics of equality with `BoundReferences` are broken.  Specifically, we have had a few bugs where partitioning breaks because of the binding.
 - A copy of the current `SQLContext` is automatically propagated to all `SparkPlan` nodes by the query planner.  Before this was done ad-hoc for the nodes that needed this.  However, this required a lot of boilerplate as one had to always remember to make it `transient` and also had to modify the `otherCopyArgs`.

Author: Michael Armbrust <michael@databricks.com>

Closes #993 from marmbrus/newCodeGen and squashes the following commits:

96ef82c [Michael Armbrust] Merge remote-tracking branch 'apache/master' into newCodeGen
f34122d [Michael Armbrust] Merge remote-tracking branch 'apache/master' into newCodeGen
67b1c48 [Michael Armbrust] Use conf variable in SQLConf object
4bdc42c [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
41a40c9 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
de22aac [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
fed3634 [Michael Armbrust] Inspectors are not serializable.
ef8d42b [Michael Armbrust] comments
533fdfd [Michael Armbrust] More logging of expression rewriting for GeneratedAggregate.
3cd773e [Michael Armbrust] Allow codegen for Generate.
64b2ee1 [Michael Armbrust] Implement copy
3587460 [Michael Armbrust] Drop unused string builder function.
9cce346 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
1a61293 [Michael Armbrust] Address review comments.
0672e8a [Michael Armbrust] Address comments.
1ec2d6e [Michael Armbrust] Address comments
033abc6 [Michael Armbrust] off by default
4771fab [Michael Armbrust] Docs, more test coverage.
d30fee2 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
d2ad5c5 [Michael Armbrust] Refactor putting SQLContext into SparkPlan. Fix ordering, other test cases.
be2cd6b [Michael Armbrust] WIP: Remove old method for reference binding, more work on configuration.
bc88ecd [Michael Armbrust] Style
6cc97ca [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
4220f1e [Michael Armbrust] Better config, docs, etc.
ca6cc6b [Michael Armbrust] WIP
9d67d85 [Michael Armbrust] Fix hive planner
fc522d5 [Michael Armbrust] Hook generated aggregation in to the planner.
e742640 [Michael Armbrust] Remove unneeded changes and code.
675e679 [Michael Armbrust] Upgrade paradise.
0093376 [Michael Armbrust] Comment / indenting cleanup.
d81f998 [Michael Armbrust] include schema for binding.
0e889e8 [Michael Armbrust] Use typeOf instead tq
f623ffd [Michael Armbrust] Quiet logging from test suite.
efad14f [Michael Armbrust] Remove some half finished functions.
92e74a4 [Michael Armbrust] add overrides
a2b5408 [Michael Armbrust] WIP: Code generation with scala reflection.
2014-07-29 20:58:05 -07:00
Michael Armbrust 39b8193102 [SPARK-2716][SQL] Don't check resolved for having filters.
For queries like `... HAVING COUNT(*) > 9` the expression is always resolved since it contains no attributes.  This was causing us to avoid doing the Having clause aggregation rewrite.

Author: Michael Armbrust <michael@databricks.com>

Closes #1640 from marmbrus/havingNoRef and squashes the following commits:

92d3901 [Michael Armbrust] Don't check resolved for having filters.
2014-07-29 18:14:20 -07:00
Zongheng Yang c7db274be7 [SPARK-2393][SQL] Cost estimation optimization framework for Catalyst logical plans & sample usage.
The idea is that every Catalyst logical plan gets hold of a Statistics class, the usage of which provides useful estimations on various statistics. See the implementations of `MetastoreRelation`.

This patch also includes several usages of the estimation interface in the planner. For instance, we now use physical table sizes from the estimate interface to convert an equi-join to a broadcast join (when doing so is beneficial, as determined by a size threshold).

Finally, there are a couple minor accompanying changes including:
- Remove the not-in-use `BaseRelation`.
- Make SparkLogicalPlan take a `SQLContext` in the second param list.

Author: Zongheng Yang <zongheng.y@gmail.com>

Closes #1238 from concretevitamin/estimates and squashes the following commits:

329071d [Zongheng Yang] Address review comments; turn config name from string to field in SQLConf.
8663e84 [Zongheng Yang] Use BigInt for stat; for logical leaves, by default throw an exception.
2f2fb89 [Zongheng Yang] Fix statistics for SparkLogicalPlan.
9951305 [Zongheng Yang] Remove childrenStats.
16fc60a [Zongheng Yang] Avoid calling statistics on plans if auto join conversion is disabled.
8bd2816 [Zongheng Yang] Add a note on performance of statistics.
6e594b8 [Zongheng Yang] Get size info from metastore for MetastoreRelation.
01b7a3e [Zongheng Yang] Update scaladoc for a field and move it to @param section.
549061c [Zongheng Yang] Remove numTuples in Statistics for now.
729a8e2 [Zongheng Yang] Update docs to be more explicit.
573e644 [Zongheng Yang] Remove singleton SQLConf and move back `settings` to the trait.
2d99eb5 [Zongheng Yang] {Cleanup, use synchronized in, enrich} StatisticsSuite.
ca5b825 [Zongheng Yang] Inject SQLContext into SparkLogicalPlan, removing SQLConf mixin from it.
43d38a6 [Zongheng Yang] Revert optimization for BroadcastNestedLoopJoin (this fixes tests).
0ef9e5b [Zongheng Yang] Use multiplication instead of sum for default estimates.
4ef0d26 [Zongheng Yang] Make Statistics a case class.
3ba8f3e [Zongheng Yang] Add comment.
e5bcf5b [Zongheng Yang] Fix optimization conditions & update scala docs to explain.
7d9216a [Zongheng Yang] Apply estimation to planning ShuffleHashJoin & BroadcastNestedLoopJoin.
73cde01 [Zongheng Yang] Move SQLConf back. Assign default sizeInBytes to SparkLogicalPlan.
73412be [Zongheng Yang] Move SQLConf to Catalyst & add default val for sizeInBytes.
7a60ab7 [Zongheng Yang] s/Estimates/Statistics, s/cardinality/numTuples.
de3ae13 [Zongheng Yang] Add parquetAfter() properly in test.
dcff9bd [Zongheng Yang] Cleanups.
84301a4 [Zongheng Yang] Refactors.
5bf5586 [Zongheng Yang] Typo.
56a8e6e [Zongheng Yang] Prototype impl of estimations for Catalyst logical plans.
2014-07-29 15:32:50 -07:00
Cheng Lian a7a9d14479 [SPARK-2410][SQL] Merging Hive Thrift/JDBC server (with Maven profile fix)
JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)

Another try for #1399 & #1600. Those two PR breaks Jenkins builds because we made a separate profile `hive-thriftserver` in sub-project `assembly`, but the `hive-thriftserver` module is defined outside the `hive-thriftserver` profile. Thus every time a pull request that doesn't touch SQL code will also execute test suites defined in `hive-thriftserver`, but tests fail because related .class files are not included in the assembly jar.

In the most recent commit, module `hive-thriftserver` is moved into its own profile to fix this problem. All previous commits are squashed for clarity.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1620 from liancheng/jdbc-with-maven-fix and squashes the following commits:

629988e [Cheng Lian] Moved hive-thriftserver module definition into its own profile
ec3c7a7 [Cheng Lian] Cherry picked the Hive Thrift server
2014-07-28 12:07:30 -07:00
Cheng Hao 2b8d89e30e [SPARK-2523] [SQL] Hadoop table scan bug fixing
In HiveTableScan.scala, ObjectInspector was created for all of the partition based records, which probably causes ClassCastException if the object inspector is not identical among table & partitions.

This is the follow up with:
https://github.com/apache/spark/pull/1408
https://github.com/apache/spark/pull/1390

I've run a micro benchmark in my local with 15000000 records totally, and got the result as below:

With This Patch  |  Partition-Based Table  |  Non-Partition-Based Table
------------ | ------------- | -------------
No  |  1927 ms  |  1885 ms
Yes  | 1541 ms  |  1524 ms

It showed this patch will also improve the performance.

PS:  the benchmark code is also attached. (thanks liancheng )
```
package org.apache.spark.sql.hive

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql._

object HiveTableScanPrepare extends App {
  case class Record(key: String, value: String)

  val sparkContext = new SparkContext(
    new SparkConf()
      .setMaster("local")
      .setAppName(getClass.getSimpleName.stripSuffix("$")))

  val hiveContext = new LocalHiveContext(sparkContext)

  val rdd = sparkContext.parallelize((1 to 3000000).map(i => Record(s"$i", s"val_$i")))

  import hiveContext._

  hql("SHOW TABLES")
  hql("DROP TABLE if exists part_scan_test")
  hql("DROP TABLE if exists scan_test")
  hql("DROP TABLE if exists records")
  rdd.registerAsTable("records")

  hql("""CREATE TABLE part_scan_test (key STRING, value STRING) PARTITIONED BY (part1 string, part2 STRING)
                 | ROW FORMAT SERDE
                 | 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe'
                 | STORED AS RCFILE
               """.stripMargin)
  hql("""CREATE TABLE scan_test (key STRING, value STRING)
                 | ROW FORMAT SERDE
                 | 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe'
                 | STORED AS RCFILE
               """.stripMargin)

  for (part1 <- 2000 until 2001) {
    for (part2 <- 1 to 5) {
      hql(s"""from records
                 | insert into table part_scan_test PARTITION (part1='$part1', part2='2010-01-$part2')
                 | select key, value
               """.stripMargin)
      hql(s"""from records
                 | insert into table scan_test select key, value
               """.stripMargin)
    }
  }
}

object HiveTableScanTest extends App {
  val sparkContext = new SparkContext(
    new SparkConf()
      .setMaster("local")
      .setAppName(getClass.getSimpleName.stripSuffix("$")))

  val hiveContext = new LocalHiveContext(sparkContext)

  import hiveContext._

  hql("SHOW TABLES")
  val part_scan_test = hql("select key, value from part_scan_test")
  val scan_test = hql("select key, value from scan_test")

  val r_part_scan_test = (0 to 5).map(i => benchmark(part_scan_test))
  val r_scan_test = (0 to 5).map(i => benchmark(scan_test))
  println("Scanning Partition-Based Table")
  r_part_scan_test.foreach(printResult)
  println("Scanning Non-Partition-Based Table")
  r_scan_test.foreach(printResult)

  def printResult(result: (Long, Long)) {
    println(s"Duration: ${result._1} ms Result: ${result._2}")
  }

  def benchmark(srdd: SchemaRDD) = {
    val begin = System.currentTimeMillis()
    val result = srdd.count()
    val end = System.currentTimeMillis()
    ((end - begin), result)
  }
}
```

Author: Cheng Hao <hao.cheng@intel.com>

Closes #1439 from chenghao-intel/hadoop_table_scan and squashes the following commits:

888968f [Cheng Hao] Fix issues in code style
27540ba [Cheng Hao] Fix the TableScan Bug while partition serde differs
40a24a7 [Cheng Hao] Add Unit Test
2014-07-28 10:59:53 -07:00
Patrick Wendell e5bbce9a60 Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server"
This reverts commit f6ff2a61d0.
2014-07-27 18:46:58 -07:00
Cheng Lian f6ff2a61d0 [SPARK-2410][SQL] Merging Hive Thrift/JDBC server
(This is a replacement of #1399, trying to fix potential `HiveThriftServer2` port collision between parallel builds. Please refer to [these comments](https://github.com/apache/spark/pull/1399#issuecomment-50212572) for details.)

JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)

Merging the Hive Thrift/JDBC server from [branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc).

Thanks chenghao-intel for his initial contribution of the Spark SQL CLI.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1600 from liancheng/jdbc and squashes the following commits:

ac4618b [Cheng Lian] Uses random port for HiveThriftServer2 to avoid collision with parallel builds
090beea [Cheng Lian] Revert changes related to SPARK-2678, decided to move them to another PR
21c6cf4 [Cheng Lian] Updated Spark SQL programming guide docs
fe0af31 [Cheng Lian] Reordered spark-submit options in spark-shell[.cmd]
199e3fb [Cheng Lian] Disabled MIMA for hive-thriftserver
1083e9d [Cheng Lian] Fixed failed test suites
7db82a1 [Cheng Lian] Fixed spark-submit application options handling logic
9cc0f06 [Cheng Lian] Starts beeline with spark-submit
cfcf461 [Cheng Lian] Updated documents and build scripts for the newly added hive-thriftserver profile
061880f [Cheng Lian] Addressed all comments by @pwendell
7755062 [Cheng Lian] Adapts test suites to spark-submit settings
40bafef [Cheng Lian] Fixed more license header issues
e214aab [Cheng Lian] Added missing license headers
b8905ba [Cheng Lian] Fixed minor issues in spark-sql and start-thriftserver.sh
f975d22 [Cheng Lian] Updated docs for Hive compatibility and Shark migration guide draft
3ad4e75 [Cheng Lian] Starts spark-sql shell with spark-submit
a5310d1 [Cheng Lian] Make HiveThriftServer2 play well with spark-submit
61f39f4 [Cheng Lian] Starts Hive Thrift server via spark-submit
2c4c539 [Cheng Lian] Cherry picked the Hive Thrift server
2014-07-27 13:03:38 -07:00
Michael Armbrust 8904791230 [SPARK-2659][SQL] Fix division semantics for hive
Author: Michael Armbrust <michael@databricks.com>

Closes #1557 from marmbrus/fixDivision and squashes the following commits:

b85077f [Michael Armbrust] Fix unit tests.
af98f29 [Michael Armbrust] Change DIV to long type
0c29ae8 [Michael Armbrust] Fix division semantics for hive
2014-07-25 19:17:49 -07:00
Michael Armbrust afd757a241 Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server"
This reverts commit 06dc0d2c6b.

#1399 is making Jenkins fail.  We should investigate and put this back after its passing tests.

Author: Michael Armbrust <michael@databricks.com>

Closes #1594 from marmbrus/revertJDBC and squashes the following commits:

59748da [Michael Armbrust] Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server"
2014-07-25 15:36:57 -07:00
Cheng Lian 06dc0d2c6b [SPARK-2410][SQL] Merging Hive Thrift/JDBC server
JIRA issue:

- Main: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)
- Related: [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678)

Cherry picked the Hive Thrift/JDBC server from [branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc).

(Thanks chenghao-intel for his initial contribution of the Spark SQL CLI.)

TODO

- [x] Use `spark-submit` to launch the server, the CLI and beeline
- [x] Migration guideline draft for Shark users

----

Hit by a bug in `SparkSubmitArguments` while working on this PR: all application options that are recognized by `SparkSubmitArguments` are stolen as `SparkSubmit` options. For example:

```bash
$ spark-submit --class org.apache.hive.beeline.BeeLine spark-internal --help
```

This actually shows usage information of `SparkSubmit` rather than `BeeLine`.

~~Fixed this bug here since the `spark-internal` related stuff also touches `SparkSubmitArguments` and I'd like to avoid conflict.~~

**UPDATE** The bug mentioned above is now tracked by [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678). Decided to revert changes to this bug since it involves more subtle considerations and worth a separate PR.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1399 from liancheng/thriftserver and squashes the following commits:

090beea [Cheng Lian] Revert changes related to SPARK-2678, decided to move them to another PR
21c6cf4 [Cheng Lian] Updated Spark SQL programming guide docs
fe0af31 [Cheng Lian] Reordered spark-submit options in spark-shell[.cmd]
199e3fb [Cheng Lian] Disabled MIMA for hive-thriftserver
1083e9d [Cheng Lian] Fixed failed test suites
7db82a1 [Cheng Lian] Fixed spark-submit application options handling logic
9cc0f06 [Cheng Lian] Starts beeline with spark-submit
cfcf461 [Cheng Lian] Updated documents and build scripts for the newly added hive-thriftserver profile
061880f [Cheng Lian] Addressed all comments by @pwendell
7755062 [Cheng Lian] Adapts test suites to spark-submit settings
40bafef [Cheng Lian] Fixed more license header issues
e214aab [Cheng Lian] Added missing license headers
b8905ba [Cheng Lian] Fixed minor issues in spark-sql and start-thriftserver.sh
f975d22 [Cheng Lian] Updated docs for Hive compatibility and Shark migration guide draft
3ad4e75 [Cheng Lian] Starts spark-sql shell with spark-submit
a5310d1 [Cheng Lian] Make HiveThriftServer2 play well with spark-submit
61f39f4 [Cheng Lian] Starts Hive Thrift server via spark-submit
2c4c539 [Cheng Lian] Cherry picked the Hive Thrift server
2014-07-25 12:20:49 -07:00
Cheng Hao 184aa1c6c0 [SPARK-2665] [SQL] Add EqualNS & Unit Tests
Hive Supports the operator "<=>", which returns same result with EQUAL(=) operator for non-null operands, but returns TRUE if both are NULL, FALSE if one of the them is NULL.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #1570 from chenghao-intel/equalns and squashes the following commits:

8d6c789 [Cheng Hao] Remove the test case orc_predicate_pushdown
5b2ca88 [Cheng Hao] Add cases into whitelist
8e66cdd [Cheng Hao] Rename the EqualNSTo ==> EqualNullSafe
7af4b0b [Cheng Hao] Add EqualNS & Unit Tests
2014-07-25 01:30:22 -07:00
Michael Armbrust 78d18fdbaa [SPARK-2658][SQL] Add rule for true = 1.
Author: Michael Armbrust <michael@databricks.com>

Closes #1556 from marmbrus/fixBooleanEqualsOne and squashes the following commits:

ad8edd4 [Michael Armbrust] Add rule for true = 1 and false = 0.
2014-07-23 22:52:49 -07:00
witgo 60f0ae3d87 [SPARK-2484][SQL] Build should not run hivecompatibility tests by default.
Author: witgo <witgo@qq.com>

Closes #1403 from witgo/hive_compatibility and squashes the following commits:

4e5ecdb [witgo] The default does not run hive compatibility tests
2014-07-23 18:17:05 -07:00
William Benton e060d3ee2d SPARK-2226: [SQL] transform HAVING clauses with aggregate expressions that aren't in the aggregation list
This change adds an analyzer rule to
  1. find expressions in `HAVING` clause filters that depend on unresolved attributes,
  2. push these expressions down to the underlying aggregates, and then
  3. project them away above the filter.

It also enables the `HAVING` queries in the Hive compatibility suite.

Author: William Benton <willb@redhat.com>

Closes #1497 from willb/spark-2226 and squashes the following commits:

92c9a93 [William Benton] Removed unnecessary import
f1d4f34 [William Benton] Cleanups missed in prior commit
0e1624f [William Benton] Incorporated suggestions from @marmbrus; thanks!
541d4ee [William Benton] Cleanups from review
5a12647 [William Benton] Explanatory comments and stylistic cleanups.
c7f2b2c [William Benton] Whitelist HAVING queries.
29a26e3 [William Benton] Added rule to handle unresolved attributes in HAVING clauses (SPARK-2226)
2014-07-23 16:25:32 -07:00
Cheng Hao 79fe7634f6 [SPARK-2615] [SQL] Add Equal Sign "==" Support for HiveQl
Currently, the "==" in HiveQL expression will cause exception thrown, this patch will fix it.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #1522 from chenghao-intel/equal and squashes the following commits:

f62a0ff [Cheng Hao] Add == Support for HiveQl
2014-07-22 18:13:28 -07:00
Aaron Davidson abeacffb7b Fix flakey HiveQuerySuite test
Result may not be returned in the expected order, so relax that constraint.

Author: Aaron Davidson <aaron@databricks.com>

Closes #1514 from aarondav/flakey and squashes the following commits:

e5af823 [Aaron Davidson] Fix flakey HiveQuerySuite test
2014-07-21 14:35:15 -07:00
Cheng Lian cd273a2381 [SPARK-2190][SQL] Specialized ColumnType for Timestamp
JIRA issue: [SPARK-2190](https://issues.apache.org/jira/browse/SPARK-2190)

Added specialized in-memory column type for `Timestamp`. Whitelisted all timestamp related Hive tests except `timestamp_udf`, which is timezone sensitive.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1440 from liancheng/timestamp-column-type and squashes the following commits:

e682175 [Cheng Lian] Enabled more timezone sensitive Hive tests.
53a358f [Cheng Lian] Fixed failed test suites
01b592d [Cheng Lian] Fixed SimpleDateFormat thread safety issue
2a59343 [Cheng Lian] Removed timezone sensitive Hive timestamp tests
45dd05d [Cheng Lian] Added Timestamp specific in-memory columnar representation
2014-07-21 00:46:28 -07:00
Yin Huai df95d82da7 [SPARK-2525][SQL] Remove as many compilation warning messages as possible in Spark SQL
JIRA: https://issues.apache.org/jira/browse/SPARK-2525.

Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #1444 from yhuai/SPARK-2517 and squashes the following commits:

edbac3f [Yin Huai] Removed some compiler type erasure warnings.
2014-07-16 10:53:59 -07:00
Michael Armbrust bcd0c30c7e [SQL] Whitelist more Hive tests.
Author: Michael Armbrust <michael@databricks.com>

Closes #1396 from marmbrus/moreTests and squashes the following commits:

6660b60 [Michael Armbrust] Blacklist a test that requires DFS command.
8b6001c [Michael Armbrust] Add golden files.
ccd8f97 [Michael Armbrust] Whitelist more tests.
2014-07-15 14:04:01 -07:00
Michael Armbrust 0f98ef1a2c [SPARK-2483][SQL] Fix parsing of repeated, nested data access.
Author: Michael Armbrust <michael@databricks.com>

Closes #1411 from marmbrus/nestedRepeated and squashes the following commits:

044fa09 [Michael Armbrust] Fix parsing of repeated, nested data access.
2014-07-15 14:01:48 -07:00
Yin Huai c0b4cf097d [SPARK-2339][SQL] SQL parser in sql-core is case sensitive, but a table alias is converted to lower case when we create Subquery
Reported by http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-throws-exception-td8599.html
After we get the table from the catalog, because the table has an alias, we will temporarily insert a Subquery. Then, we convert the table alias to lower case no matter if the parser is case sensitive or not.
To see the issue ...
```
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD

case class Person(name: String, age: Int)

val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))
people.registerAsTable("people")

sqlContext.sql("select PEOPLE.name from people PEOPLE")
```
The plan is ...
```
== Query Plan ==
Project ['PEOPLE.name]
 ExistingRdd [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:176
```
You can find that `PEOPLE.name` is not resolved.

This PR introduces three changes.
1.  If a table has an alias, the catalog will not lowercase the alias. If a lowercase alias is needed, the analyzer will do the work.
2.  A catalog has a new val caseSensitive that indicates if this catalog is case sensitive or not. For example, a SimpleCatalog is case sensitive, but
3.  Corresponding unit tests.
With this PR, case sensitivity of database names and table names is handled by the catalog. Case sensitivity of other identifiers are handled by the analyzer.

JIRA: https://issues.apache.org/jira/browse/SPARK-2339

Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #1317 from yhuai/SPARK-2339 and squashes the following commits:

12d8006 [Yin Huai] Handling case sensitivity correctly. This patch introduces three changes. 1. If a table has an alias, the catalog will not lowercase the alias. If a lowercase alias is needed, the analyzer will do the work. 2. A catalog has a new val caseSensitive that indicates if this catalog is case sensitive or not. For example, a SimpleCatalog is case sensitive, but 3. Corresponding unit tests. With this patch, case sensitivity of database names and table names is handled by the catalog. Case sensitivity of other identifiers is handled by the analyzer.
2014-07-07 17:01:44 -07:00
Takuya UESHIN 1e2c26c83d [SPARK-2328] [SQL] Add execution of SHOW TABLES before TestHive.reset().
`PruningSuite` is executed first of Hive tests unfortunately, `TestHive.reset()` breaks the test environment.
To prevent this, we must run a query before calling reset the first time.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #1268 from ueshin/issues/SPARK-2328 and squashes the following commits:

043ceac [Takuya UESHIN] Add execution of `SHOW TABLES` before `TestHive.reset()`.
2014-07-02 10:07:01 -07:00
Cheng Lian 7f196b009d [SPARK-2283][SQL] Reset test environment before running PruningSuite
JIRA issue: [SPARK-2283](https://issues.apache.org/jira/browse/SPARK-2283)

If `PruningSuite` is run right after `HiveCompatibilitySuite`, the first test case fails because `srcpart` table is cached in-memory by `HiveCompatibilitySuite`, but column pruning is not implemented for `InMemoryColumnarTableScan` operator yet.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1221 from liancheng/spark-2283 and squashes the following commits:

dc0b663 [Cheng Lian] SPARK-2283: reset test environment before running PruningSuite
2014-06-25 18:41:47 -07:00
Cheng Lian 8fade8973e [SPARK-2263][SQL] Support inserting MAP<K, V> to Hive tables
JIRA issue: [SPARK-2263](https://issues.apache.org/jira/browse/SPARK-2263)

Map objects were not converted to Hive types before inserting into Hive tables.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1205 from liancheng/spark-2263 and squashes the following commits:

c7a4373 [Cheng Lian] Addressed @concretevitamin's comment
784940b [Cheng Lian] SARPK-2263: support inserting MAP<K, V> to Hive tables
2014-06-25 00:14:34 -07:00
Reynold Xin 0ac71d1284 [SPARK-2225] Turn HAVING without GROUP BY into WHERE.
@willb

Author: Reynold Xin <rxin@apache.org>

Closes #1161 from rxin/having-filter and squashes the following commits:

fa8359a [Reynold Xin] [SPARK-2225] Turn HAVING without GROUP BY into WHERE.
2014-06-20 15:38:02 -07:00
William Benton 171ebb3a82 SPARK-2180: support HAVING clauses in Hive queries
This PR extends Spark's HiveQL support to handle HAVING clauses in aggregations.  The HAVING test from the Hive compatibility suite doesn't appear to be runnable from within Spark, so I added a simple comparable test to `HiveQuerySuite`.

Author: William Benton <willb@redhat.com>

Closes #1136 from willb/SPARK-2180 and squashes the following commits:

3bbaf26 [William Benton] Added casts to HAVING expressions
83f1340 [William Benton] scalastyle fixes
18387f1 [William Benton] Add test for HAVING without GROUP BY
b880bef [William Benton] Added semantic error for HAVING without GROUP BY
942428e [William Benton] Added test coverage for SPARK-2180.
56084cc [William Benton] Add support for HAVING clauses in Hive queries.
2014-06-20 13:41:38 -07:00
Reynold Xin 2f6a835e1a [SPARK-2218] rename Equals to EqualTo in Spark SQL expressions.
Due to the existence of scala.Equals, it is very error prone to name the expression Equals, especially because we use a lot of partial functions and pattern matching in the optimizer.

Note that this sits on top of #1144.

Author: Reynold Xin <rxin@apache.org>

Closes #1146 from rxin/equals and squashes the following commits:

f8583fd [Reynold Xin] Merge branch 'master' of github.com:apache/spark into equals
326b388 [Reynold Xin] Merge branch 'master' of github.com:apache/spark into equals
bd19807 [Reynold Xin] Rename EqualsTo to EqualTo.
81148d1 [Reynold Xin] [SPARK-2218] rename Equals to EqualsTo in Spark SQL expressions.
c4e543d [Reynold Xin] [SPARK-2210] boolean cast on boolean value should be removed.
2014-06-20 00:34:59 -07:00
Reynold Xin 6175640973 [SPARK-2210] cast to boolean on boolean value gets turned into NOT((boolean_condition) = 0)
```
explain select cast(cast(key=0 as boolean) as boolean) aaa from src
```
should be
```
[Physical execution plan:]
[Project [(key#10:0 = 0) AS aaa#7]]
[ HiveTableScan [key#10], (MetastoreRelation default, src, None), None]
```

However, it is currently
```
[Physical execution plan:]
[Project [NOT((key#10=0) = 0) AS aaa#7]]
[ HiveTableScan [key#10], (MetastoreRelation default, src, None), None]
```

Author: Reynold Xin <rxin@apache.org>

Closes #1144 from rxin/booleancast and squashes the following commits:

c4e543d [Reynold Xin] [SPARK-2210] boolean cast on boolean value should be removed.
2014-06-19 23:58:23 -07:00
Yin Huai f397e92eb2 [SPARK-2177][SQL] describe table result contains only one column
```
scala> hql("describe src").collect().foreach(println)

[key                 	string              	None                ]
[value               	string              	None                ]
```

The result should contain 3 columns instead of one. This screws up JDBC or even the downstream consumer of the Scala/Java/Python APIs.

I am providing a workaround. We handle a subset of describe commands in Spark SQL, which are defined by ...
```
DESCRIBE [EXTENDED] [db_name.]table_name
```
All other cases are treated as Hive native commands.

Also, if we upgrade Hive to 0.13, we need to check the results of context.sessionState.isHiveServerQuery() to determine how to split the result. This method is introduced by https://issues.apache.org/jira/browse/HIVE-4545. We may want to set Hive to use JsonMetaDataFormatter for the output of a DDL statement (`set hive.ddl.output.format=json` introduced by https://issues.apache.org/jira/browse/HIVE-2822).

The link to JIRA: https://issues.apache.org/jira/browse/SPARK-2177

Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #1118 from yhuai/SPARK-2177 and squashes the following commits:

fd2534c [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2177
b9b9aa5 [Yin Huai] rxin's comments.
e7c4e72 [Yin Huai] Fix unit test.
656b068 [Yin Huai] 100 characters.
6387217 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2177
8003cf3 [Yin Huai] Generate strings with the format like Hive for unit tests.
9787fff [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2177
440c5af [Yin Huai] rxin's comments.
f1a417e [Yin Huai] Update doc.
83adb2f [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2177
366f891 [Yin Huai] Add describe command.
74bd1d4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2177
342fdf7 [Yin Huai] Split to up to 3 parts.
725e88c [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2177
bb8bbef [Yin Huai] Split every string in the result of a describe command.
2014-06-19 23:41:38 -07:00
Michael Armbrust 777c5958c4 [SPARK-2191][SQL] Make sure InsertIntoHiveTable doesn't execute more than once.
Author: Michael Armbrust <michael@databricks.com>

Closes #1129 from marmbrus/doubleCreateAs and squashes the following commits:

9c6d9e4 [Michael Armbrust] Fix typo.
5128fe2 [Michael Armbrust] Make sure InsertIntoHiveTable doesn't execute each time you ask for its result.
2014-06-19 14:14:03 -07:00
Reynold Xin 640c294369 [SPARK-2187] Explain should not run the optimizer twice.
@yhuai @marmbrus @concretevitamin

Author: Reynold Xin <rxin@apache.org>

Closes #1123 from rxin/explain and squashes the following commits:

def83b0 [Reynold Xin] Update unit tests for explain.
a9d3ba8 [Reynold Xin] [SPARK-2187] Explain should not run the optimizer twice.
2014-06-18 22:44:12 -07:00
Zongheng Yang e243c5ffac [SPARK-2053][SQL] Add Catalyst expressions for CASE WHEN.
JIRA ticket: https://issues.apache.org/jira/browse/SPARK-2053

This PR adds support for two types of CASE statements present in Hive. The first type is of the form `CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END`, with the semantics like a chain of if statements. The second type is of the form `CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END`, with the semantics like a switch statement on key `a`. Both forms are implemented in `CaseWhen`.

[This link](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-ConditionalFunctions) contains more detailed descriptions on their semantics.

Notes / Open issues:

* Please check if any implicit contracts / invariants are broken in the implementations (especially for the operators). I am not very familiar with them and I currently find them tricky to spot.
* We should decide whether or not a non-boolean condition is allowed in a branch of `CaseWhen`. Hive throws a `SemanticException` for this situation and I think it'd be good to mimic it -- the question is where in the whole Spark SQL pipeline should we signal an exception for such a query.

Author: Zongheng Yang <zongheng.y@gmail.com>

Closes #1055 from concretevitamin/caseWhen and squashes the following commits:

4226eb9 [Zongheng Yang] Comment.
79d26fc [Zongheng Yang] Merge branch 'master' into caseWhen
caf9383 [Zongheng Yang] Update a FIXME.
9d26ab8 [Zongheng Yang] Add @transient marker.
788a0d9 [Zongheng Yang] Implement CastNulls, which fixes udf_case and udf_when.
7ef284f [Zongheng Yang] Refactors: remove redundant passes, improve toString, mark transient.
f47ae7b [Zongheng Yang] Modify queries in tests to have shorter golden files.
1c1fbfc [Zongheng Yang] Cleanups per review comments.
7d2b7e2 [Zongheng Yang] Translate CaseKeyWhen to CaseWhen at parsing time.
47d406a [Zongheng Yang] Do toArray once and lazily outside of eval().
bb3d109 [Zongheng Yang] Update scaladoc of a method.
aea3195 [Zongheng Yang] Fix bug that branchesArr is not used; remove unused import.
96870a8 [Zongheng Yang] Turn off scalastyle for some comments.
7392f3a [Zongheng Yang] Minor cleanup.
2cf08bb [Zongheng Yang] Merge branch 'master' into caseWhen
9f84b40 [Zongheng Yang] Add golden outputs from Hive.
db51a85 [Zongheng Yang] Add allCondBooleans check; uncomment tests.
3f9ef0a [Zongheng Yang] Cleanups and bug fixes (mainly in eval() and resolved).
be54bc8 [Zongheng Yang] Rewrite eval() to a low-level implementation. Separate two CASE stmts.
f2bcb9d [Zongheng Yang] WIP
5906f75 [Zongheng Yang] WIP
efd019b [Zongheng Yang] eval() and toString() bug fixes.
7d81e95 [Zongheng Yang] Clean up resolved.
a31d782 [Zongheng Yang] Finish up Case.
2014-06-17 13:30:17 +02:00
Xi Liu f5a4049e53 [SPARK-2164][SQL] Allow Hive UDF on columns of type struct
Author: Xi Liu <xil@conviva.com>

Closes #796 from xiliu82/sqlbug and squashes the following commits:

328dfc4 [Xi Liu] [Spark SQL] remove a temporary function after test
354386a [Xi Liu] [Spark SQL] add test suite for UDF on struct
8fc6f51 [Xi Liu] [SparkSQL] allow UDF on struct
2014-06-17 13:16:02 +02:00
Cheng Lian 237b96bc59 Minor fix: made "EXPLAIN" output to play well with JDBC output format
Fixed the broken JDBC output. Test from Shark `beeline`:

```
beeline> !connect jdbc:hive2://localhost:10000/
scan complete in 2ms
Connecting to jdbc:hive2://localhost:10000/
Enter username for jdbc:hive2://localhost:10000/: lian
Enter password for jdbc:hive2://localhost:10000/:
Connected to: Hive (version 0.12.0)
Driver: Hive (version 0.12.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000/>
0: jdbc:hive2://localhost:10000/> explain select * from src;
+-------------------------------------------------------------------------------+
|                                     plan                                      |
+-------------------------------------------------------------------------------+
| ExplainCommand [plan#2:0]                                                     |
|  HiveTableScan [key#0,value#1], (MetastoreRelation default, src, None), None  |
+-------------------------------------------------------------------------------+
2 rows selected (1.386 seconds)
```

Before this change, the output looked something like this:

```
+-------------------------------------------------------------------------------+
|                                     plan                                      |
+-------------------------------------------------------------------------------+
| ExplainCommand [plan#2:0]
 HiveTableScan [key#0,value#1], (MetastoreRelation default, src, None), None  |
+-------------------------------------------------------------------------------+
```

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1097 from liancheng/multiLineExplain and squashes the following commits:

eb37967 [Cheng Lian] Made output of "EXPLAIN" play well with JDBC output format
2014-06-16 16:42:17 -07:00
Cheng Lian 273afcb254 [SQL][SPARK-2094] Follow up of PR #1071 for Java API
Updated `JavaSQLContext` and `JavaHiveContext` similar to what we've done to `SQLContext` and `HiveContext` in PR #1071. Added corresponding test case for Spark SQL Java API.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1085 from liancheng/spark-2094-java and squashes the following commits:

29b8a51 [Cheng Lian] Avoided instantiating JavaSparkContext & JavaHiveContext to workaround test failure
92bb4fb [Cheng Lian] Marked test cases in JavaHiveQLSuite with "ignore"
22aec97 [Cheng Lian] Follow up of PR #1071 for Java API
2014-06-16 21:32:51 +02:00
Yin Huai 8919685091 [Spark-2137][SQL] Timestamp UDFs broken
https://issues.apache.org/jira/browse/SPARK-2137

Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #1081 from yhuai/SPARK-2137 and squashes the following commits:

c04f910 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2137
205f17b [Yin Huai] Make Hive UDF wrapper support Timestamp.
2014-06-13 23:28:57 -07:00
Cheng Lian ac96d9657c [SPARK-2094][SQL] "Exactly once" semantics for DDL and command statements
## Related JIRA issues

- Main issue:

  - [SPARK-2094](https://issues.apache.org/jira/browse/SPARK-2094): Ensure exactly once semantics for DDL/Commands

- Issues resolved as dependencies:

  - [SPARK-2081](https://issues.apache.org/jira/browse/SPARK-2081): Undefine output() from the abstract class Command and implement it in concrete subclasses
  - [SPARK-2128](https://issues.apache.org/jira/browse/SPARK-2128): No plan for DESCRIBE
  - [SPARK-1852](https://issues.apache.org/jira/browse/SPARK-1852): SparkSQL Queries with Sorts run before the user asks them to

- Other related issue:

  - [SPARK-2129](https://issues.apache.org/jira/browse/SPARK-2129): NPE thrown while lookup a view

    Two test cases, `join_view` and `mergejoin_mixed`, within the `HiveCompatibilitySuite` are removed from the whitelist to workaround this issue.

## PR Overview

This PR defines physical plans for DDL statements and commands and wraps their side effects in a lazy field `PhysicalCommand.sideEffectResult`, so that they are executed eagerly and exactly once.  Also, as a positive side effect, now DDL statements and commands can be turned into proper `SchemaRDD`s and let user query the execution results.

This PR defines schemas for the following DDL/commands:

- EXPLAIN command

  - `plan`: String, the plan explanation

- SET command

  - `key`: String, the key(s) of the propert(y/ies) being set or queried
  - `value`: String, the value(s) of the propert(y/ies) being queried

- Other Hive native command

  - `result`: String, execution result returned by Hive

  **NOTE**: We should refine schemas for different native commands by defining physical plans for them in the future.

## Examples

### EXPLAIN command

Take the "EXPLAIN" command as an example, we first execute the command and obtain a `SchemaRDD` at the same time, then query the `plan` field with the schema DSL:

```
scala> loadTestTable("src")
...

scala> val q0 = hql("EXPLAIN SELECT key, COUNT(*) FROM src GROUP BY key")
...
q0: org.apache.spark.sql.SchemaRDD =
SchemaRDD[0] at RDD at SchemaRDD.scala:98
== Query Plan ==
ExplainCommandPhysical [plan#11:0]
 Aggregate false, [key#4], [key#4,SUM(PartialCount#6L) AS c_1#2L]
  Exchange (HashPartitioning [key#4:0], 200)
   Exchange (HashPartitioning [key#4:0], 200)
    Aggregate true, [key#4], [key#4,COUNT(1) AS PartialCount#6L]
     HiveTableScan [key#4], (MetastoreRelation default, src, None), None

scala> q0.select('plan).collect()
...
[ExplainCommandPhysical [plan#24:0]
 Aggregate false, [key#17], [key#17,SUM(PartialCount#19L) AS c_1#2L]
  Exchange (HashPartitioning [key#17:0], 200)
   Exchange (HashPartitioning [key#17:0], 200)
    Aggregate true, [key#17], [key#17,COUNT(1) AS PartialCount#19L]
     HiveTableScan [key#17], (MetastoreRelation default, src, None), None]

scala>
```

### SET command

In this example we query all the properties set in `SQLConf`, register the result as a table, and then query the table with HiveQL:

```
scala> val q1 = hql("SET")
...
q1: org.apache.spark.sql.SchemaRDD =
SchemaRDD[7] at RDD at SchemaRDD.scala:98
== Query Plan ==
<SET command: executed by Hive, and noted by SQLContext>

scala> q1.registerAsTable("properties")

scala> hql("SELECT key, value FROM properties ORDER BY key LIMIT 10").foreach(println)
...
== Query Plan ==
TakeOrdered 10, [key#51:0 ASC]
 Project [key#51:0,value#52:1]
  SetCommandPhysical None, None, [key#55:0,value#56:1]), which has no missing parents
14/06/12 12:19:27 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 5 (SchemaRDD[21] at RDD at SchemaRDD.scala:98
== Query Plan ==
TakeOrdered 10, [key#51:0 ASC]
 Project [key#51:0,value#52:1]
  SetCommandPhysical None, None, [key#55:0,value#56:1])
...
[datanucleus.autoCreateSchema,true]
[datanucleus.autoStartMechanismMode,checked]
[datanucleus.cache.level2,false]
[datanucleus.cache.level2.type,none]
[datanucleus.connectionPoolingType,BONECP]
[datanucleus.fixedDatastore,false]
[datanucleus.identifierFactory,datanucleus1]
[datanucleus.plugin.pluginRegistryBundleCheck,LOG]
[datanucleus.rdbms.useLegacyNativeValueStrategy,true]
[datanucleus.storeManagerType,rdbms]

scala>
```

### "Exactly once" semantics

At last, an example of the "exactly once" semantics:

```
scala> val q2 = hql("CREATE TABLE t1(key INT, value STRING)")
...
q2: org.apache.spark.sql.SchemaRDD =
SchemaRDD[28] at RDD at SchemaRDD.scala:98
== Query Plan ==
<Native command: executed by Hive>

scala> table("t1")
...
res9: org.apache.spark.sql.SchemaRDD =
SchemaRDD[32] at RDD at SchemaRDD.scala:98
== Query Plan ==
HiveTableScan [key#58,value#59], (MetastoreRelation default, t1, None), None

scala> q2.collect()
...
res10: Array[org.apache.spark.sql.Row] = Array([])

scala>
```

As we can see, the "CREATE TABLE" command is executed eagerly right after the `SchemaRDD` is created, and referencing the `SchemaRDD` again won't trigger a duplicated execution.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1071 from liancheng/exactlyOnceCommand and squashes the following commits:

d005b03 [Cheng Lian] Made "SET key=value" returns the newly set key value pair
f6c7715 [Cheng Lian] Added test cases for DDL/command statement RDDs
1d00937 [Cheng Lian] Makes SchemaRDD DSLs work for DDL/command statement RDDs
5c7e680 [Cheng Lian] Bug fix: wrong type used in pattern matching
48aa2e5 [Cheng Lian] Refined SQLContext.emptyResult as an empty RDD[Row]
cc64f32 [Cheng Lian] Renamed physical plan classes for DDL/commands
74789c1 [Cheng Lian] Fixed failing test cases
0ad343a [Cheng Lian] Added physical plan for DDL and commands to ensure the "exactly once" semantics
2014-06-13 12:59:48 -07:00
Michael Armbrust 13f8cfdc04 [SPARK-2135][SQL] Use planner for in-memory scans
Author: Michael Armbrust <michael@databricks.com>

Closes #1072 from marmbrus/cachedStars and squashes the following commits:

8757c8e [Michael Armbrust] Use planner for in-memory scans.
2014-06-12 23:09:41 -07:00
Daoyuan ce6deb1e5b [SQL] Code Cleanup: Left Semi Hash Join
Some improvement for PR #837, add another case to white list and use `filter` to build result iterator.

Author: Daoyuan <daoyuan.wang@intel.com>

Closes #1049 from adrian-wang/clean-LeftSemiJoinHash and squashes the following commits:

b314d5a [Daoyuan] change hashSet name
27579a9 [Daoyuan] add semijoin to white list and use filter to create new iterator in LeftSemiJoinBNL

Signed-off-by: Michael Armbrust <michael@databricks.com>
2014-06-11 12:09:42 -07:00
Cheng Lian 0266a0c8a7 [SPARK-1968][SQL] SQL/HiveQL command for caching/uncaching tables
JIRA issue: [SPARK-1968](https://issues.apache.org/jira/browse/SPARK-1968)

This PR added support for SQL/HiveQL command for caching/uncaching tables:

```
scala> sql("CACHE TABLE src")
...
res0: org.apache.spark.sql.SchemaRDD =
SchemaRDD[0] at RDD at SchemaRDD.scala:98
== Query Plan ==
CacheCommandPhysical src, true

scala> table("src")
...
res1: org.apache.spark.sql.SchemaRDD =
SchemaRDD[3] at RDD at SchemaRDD.scala:98
== Query Plan ==
InMemoryColumnarTableScan [key#0,value#1], (HiveTableScan [key#0,value#1], (MetastoreRelation default, src, None), None), false

scala> isCached("src")
res2: Boolean = true

scala> sql("CACHE TABLE src")
...
res3: org.apache.spark.sql.SchemaRDD =
SchemaRDD[4] at RDD at SchemaRDD.scala:98
== Query Plan ==
CacheCommandPhysical src, false

scala> table("src")
...
res4: org.apache.spark.sql.SchemaRDD =
SchemaRDD[11] at RDD at SchemaRDD.scala:98
== Query Plan ==
HiveTableScan [key#2,value#3], (MetastoreRelation default, src, None), None

scala> isCached("src")
res5: Boolean = false
```

Things also work for `hql`.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1038 from liancheng/sqlCacheTable and squashes the following commits:

ecb7194 [Cheng Lian] Trimmed the SQL string before parsing special commands
6f4ce42 [Cheng Lian] Moved logical command classes to a separate file
3458a24 [Cheng Lian] Added comment for public API
f0ffacc [Cheng Lian] Added isCached() predicate
15ec6d2 [Cheng Lian] Added "(UN)CACHE TABLE" SQL/HiveQL statements
2014-06-11 00:06:50 -07:00
Zongheng Yang 08ed9ad813 [SPARK-1508][SQL] Add SQLConf to SQLContext.
This PR (1) introduces a new class SQLConf that stores key-value properties for a SQLContext (2) clean up the semantics of various forms of SET commands.

The SQLConf class unlocks user-controllable optimization opportunities; for example, user can now override the number of partitions used during an Exchange. A SQLConf can be accessed and modified programmatically through its getters and setters. It can also be modified through SET commands executed by `sql()` or `hql()`. Note that users now have the ability to change a particular property for different queries inside the same Spark job, unlike settings configured in SparkConf.

For SET commands: "SET" will return all properties currently set in a SQLConf, "SET key" will return the key-value pair (if set) or an undefined message, and "SET key=value" will call the setter on SQLConf, and if a HiveContext is used, it will be executed in Hive as well.

Author: Zongheng Yang <zongheng.y@gmail.com>

Closes #956 from concretevitamin/sqlconf and squashes the following commits:

4968c11 [Zongheng Yang] Very minor cleanup.
d74dde5 [Zongheng Yang] Remove the redundant mkQueryExecution() method.
c129b86 [Zongheng Yang] Merge remote-tracking branch 'upstream/master' into sqlconf
26c40eb [Zongheng Yang] Make SQLConf a trait and have SQLContext mix it in.
dd19666 [Zongheng Yang] Update a comment.
baa5d29 [Zongheng Yang] Remove default param for shuffle partitions accessor.
5f7e6d8 [Zongheng Yang] Add default num partitions.
22d9ed7 [Zongheng Yang] Fix output() of Set physical. Add SQLConf param accessor method.
e9856c4 [Zongheng Yang] Use java.util.Collections.synchronizedMap on a Java HashMap.
88dd0c8 [Zongheng Yang] Remove redundant SET Keyword.
271f0b1 [Zongheng Yang] Minor change.
f8983d1 [Zongheng Yang] Minor changes per review comments.
1ce8a5e [Zongheng Yang] Invoke runSqlHive() in SQLConf#get for the HiveContext case.
b766af9 [Zongheng Yang] Remove a test.
d52e1bd [Zongheng Yang] De-hardcode number of shuffle partitions for BasicOperators (read from SQLConf).
555599c [Zongheng Yang] Bullet-proof (relatively) parsing SET per review comment.
c2067e8 [Zongheng Yang] Mark SQLContext transient and put it in a second param list.
2ea8cdc [Zongheng Yang] Wrap long line.
41d7f09 [Zongheng Yang] Fix imports.
13279e6 [Zongheng Yang] Refactor the logic of eagerly processing SET commands.
b14b83e [Zongheng Yang] In a HiveContext, make SQLConf a subset of HiveConf.
6983180 [Zongheng Yang] Move a SET test to SQLQuerySuite and make it complete.
5b67985 [Zongheng Yang] New line at EOF.
c651797 [Zongheng Yang] Add commands.scala.
efd82db [Zongheng Yang] Clean up semantics of several cases of SET.
c1017c2 [Zongheng Yang] WIP in changing SetCommand to take two Options (for different semantics of SETs).
0f00d86 [Zongheng Yang] Add a test for singleton set command in SQL.
41acd75 [Zongheng Yang] Add a test for hql() in HiveQuerySuite.
2276929 [Zongheng Yang] Fix default hive result for set commands in HiveComparisonTest.
3b0c71b [Zongheng Yang] Remove Parser for set commands. A few other fixes.
d0c4578 [Zongheng Yang] Tmux typo.
0ecea46 [Zongheng Yang] Changes for HiveQl and HiveContext.
ce22d80 [Zongheng Yang] Fix parsing issues.
cb722c1 [Zongheng Yang] Finish up SQLConf patch.
4ebf362 [Zongheng Yang] First cut at SQLConf inside SQLContext.
2014-06-10 00:49:09 -07:00
Zongheng Yang a9ec033c8c [SPARK-1704][SQL] Fully support EXPLAIN commands as SchemaRDD.
This PR attempts to resolve [SPARK-1704](https://issues.apache.org/jira/browse/SPARK-1704) by introducing a physical plan for EXPLAIN commands, which just prints out the debug string (containing various SparkSQL's plans) of the corresponding QueryExecution for the actual query.

Author: Zongheng Yang <zongheng.y@gmail.com>

Closes #1003 from concretevitamin/explain-cmd and squashes the following commits:

5b7911f [Zongheng Yang] Add a regression test.
1bfa379 [Zongheng Yang] Modify output().
719ada9 [Zongheng Yang] Override otherCopyArgs for ExplainCommandPhysical.
4318fd7 [Zongheng Yang] Make all output one Row.
439c6ab [Zongheng Yang] Minor cleanups.
408f574 [Zongheng Yang] SPARK-1704: Add CommandStrategy and ExplainCommandPhysical.
2014-06-09 16:47:44 -07:00
Daoyuan 0cf6002801 [SPARK-1495][SQL]add support for left semi join
Just submit another solution for #395

Author: Daoyuan <daoyuan.wang@intel.com>
Author: Michael Armbrust <michael@databricks.com>
Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #837 from adrian-wang/left-semi-join-support and squashes the following commits:

d39cd12 [Daoyuan Wang] Merge pull request #1 from marmbrus/pr/837
6713c09 [Michael Armbrust] Better debugging for failed query tests.
035b73e [Michael Armbrust] Add test for left semi that can't be done with a hash join.
5ec6fa4 [Michael Armbrust] Add left semi to SQL Parser.
4c726e5 [Daoyuan] improvement according to Michael
8d4a121 [Daoyuan] add golden files for leftsemijoin
83a3c8a [Daoyuan] scala style fix
14cff80 [Daoyuan] add support for left semi join
2014-06-09 11:31:36 -07:00
Michael Armbrust 8d210560be [SPARK-2050 - 2][SQL] DIV and BETWEEN should not be case sensitive.
Followup: #989

Author: Michael Armbrust <michael@databricks.com>

Closes #994 from marmbrus/caseSensitiveFunctions2 and squashes the following commits:

9d9c8ed [Michael Armbrust] Fix DIV and BETWEEN.
2014-06-06 11:31:37 -07:00
Cheng Lian 862283e9cc Avoid dynamic dispatching when unwrapping Hive data.
This is a follow up of PR #758.

The `unwrapHiveData` function is now composed statically before actual rows are scanned according to the field object inspector to avoid dynamic dispatching cost.

According to the same micro benchmark used in PR #758, this simple change brings slight performance boost: 2.5% for CSV table and 1% for RCFile table.

```
Optimized version:

CSV: 6870 ms, RCFile: 5687 ms
CSV: 6832 ms, RCFile: 5800 ms
CSV: 6822 ms, RCFile: 5679 ms
CSV: 6704 ms, RCFile: 5758 ms
CSV: 6819 ms, RCFile: 5725 ms

Original version:

CSV: 7042 ms, RCFile: 5667 ms
CSV: 6883 ms, RCFile: 5703 ms
CSV: 7115 ms, RCFile: 5665 ms
CSV: 7020 ms, RCFile: 5981 ms
CSV: 6871 ms, RCFile: 5906 ms
```

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #935 from liancheng/staticUnwrapping and squashes the following commits:

c49c70c [Cheng Lian] Avoid dynamic dispatching when unwrapping Hive data.
2014-06-02 19:20:23 -07:00
Cheng Lian 8f7141fbc0 [SPARK-1368][SQL] Optimized HiveTableScan
JIRA issue: [SPARK-1368](https://issues.apache.org/jira/browse/SPARK-1368)

This PR introduces two major updates:

- Replaced FP style code with `while` loop and reusable `GenericMutableRow` object in critical path of `HiveTableScan`.
- Using `ColumnProjectionUtils` to help optimizing RCFile and ORC column pruning.

My quick micro benchmark suggests these two optimizations made the optimized version 2x and 2.5x faster when scanning CSV table and RCFile table respectively:

```
Original:

[info] CSV: 27676 ms, RCFile: 26415 ms
[info] CSV: 27703 ms, RCFile: 26029 ms
[info] CSV: 27511 ms, RCFile: 25962 ms

Optimized:

[info] CSV: 13820 ms, RCFile: 10402 ms
[info] CSV: 14158 ms, RCFile: 10691 ms
[info] CSV: 13606 ms, RCFile: 10346 ms
```

The micro benchmark loads a 609MB CVS file (structurally similar to the `src` test table) into a normal Hive table with `LazySimpleSerDe` and a RCFile table, then scans these tables respectively.

Preparation code:

```scala
package org.apache.spark.examples.sql.hive

import org.apache.spark.sql.hive.LocalHiveContext
import org.apache.spark.{SparkConf, SparkContext}

object HiveTableScanPrepare extends App {
  val sparkContext = new SparkContext(
    new SparkConf()
      .setMaster("local")
      .setAppName(getClass.getSimpleName.stripSuffix("$")))

  val hiveContext = new LocalHiveContext(sparkContext)

  import hiveContext._

  hql("drop table scan_csv")
  hql("drop table scan_rcfile")

  hql("""create table scan_csv (key int, value string)
        |  row format serde 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
        |  with serdeproperties ('field.delim'=',')
      """.stripMargin)

  hql(s"""load data local inpath "${args(0)}" into table scan_csv""")

  hql("""create table scan_rcfile (key int, value string)
        |  row format serde 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
        |stored as
        |  inputformat 'org.apache.hadoop.hive.ql.io.RCFileInputFormat'
        |  outputformat 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'
      """.stripMargin)

  hql(
    """
      |from scan_csv
      |insert overwrite table scan_rcfile
      |select scan_csv.key, scan_csv.value
    """.stripMargin)
}
```

Benchmark code:

```scala
package org.apache.spark.examples.sql.hive

import org.apache.spark.sql.hive.LocalHiveContext
import org.apache.spark.{SparkConf, SparkContext}

object HiveTableScanBenchmark extends App {
  val sparkContext = new SparkContext(
    new SparkConf()
      .setMaster("local")
      .setAppName(getClass.getSimpleName.stripSuffix("$")))

  val hiveContext = new LocalHiveContext(sparkContext)

  import hiveContext._

  val scanCsv = hql("select key from scan_csv")
  val scanRcfile = hql("select key from scan_rcfile")

  val csvDuration = benchmark(scanCsv.count())
  val rcfileDuration = benchmark(scanRcfile.count())

  println(s"CSV: $csvDuration ms, RCFile: $rcfileDuration ms")

  def benchmark(f: => Unit) = {
    val begin = System.currentTimeMillis()
    f
    val end = System.currentTimeMillis()
    end - begin
  }
}
```

@marmbrus Please help review, thanks!

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #758 from liancheng/fastHiveTableScan and squashes the following commits:

4241a19 [Cheng Lian] Distinguishes sorted and possibly not sorted operations more accurately in HiveComparisonTest
cf640d8 [Cheng Lian] More HiveTableScan optimisations:
bf0e7dc [Cheng Lian] Added SortedOperation pattern to match *some* definitely sorted operations and avoid some sorting cost in HiveComparisonTest.
6d1c642 [Cheng Lian] Using ColumnProjectionUtils to optimise RCFile and ORC column pruning
eb62fd3 [Cheng Lian] [SPARK-1368] Optimized HiveTableScan
2014-05-29 15:24:03 -07:00
LY Lai 0682567450 [SQL] SPARK-1922
Allow underscore in column name of a struct field https://issues.apache.org/jira/browse/SPARK-1922 .

Author: LY Lai <ly.lai@vpon.com>

Closes #873 from lyuanlai/master and squashes the following commits:

2253263 [LY Lai] Allow underscore in struct field column name
2014-05-27 16:08:38 -07:00
Tathagata Das 7f0cfe47f4 [Hotfix] Blacklisted flaky HiveCompatibility test
`lateral_view_outer` query sometimes returns a different set of 10 rows.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #838 from tdas/hive-test-fix2 and squashes the following commits:

9128a0d [Tathagata Das] Blacklisted flaky HiveCompatibility test.
2014-05-20 10:27:12 -07:00
Michael Armbrust 032d6632ad [SQL] Implement between in hql
Author: Michael Armbrust <michael@databricks.com>

Closes #804 from marmbrus/between and squashes the following commits:

ae24672 [Michael Armbrust] add golden answer.
d9997ef [Michael Armbrust] Implement between in hql.
9bd4433 [Michael Armbrust] Better error on parse failures.
2014-05-16 11:47:00 -07:00
Stevo Slavić e66e31be51 SPARK-1803 Replaced colon in filenames with a dash
This patch replaces colon in several filenames with dash to make these filenames Windows compatible.

Author: Stevo Slavić <sslavic@gmail.com>
Author: Stevo Slavic <sslavic@gmail.com>

Closes #739 from sslavic/SPARK-1803 and squashes the following commits:

3ec66eb [Stevo Slavic] Removed extra empty line which was causing test to fail
b967cc3 [Stevo Slavić] Aligned tests and names of test resources
2b12776 [Stevo Slavić] Fixed a typo in file name
1c5dfff [Stevo Slavić] Replaced colon in file name with dash
8f5bf7f [Stevo Slavić] Replaced colon in file name with dash
c5b5083 [Stevo Slavić] Replaced colon in file name with dash
a49801f [Stevo Slavić] Replaced colon in file name with dash
401d99e [Stevo Slavić] Replaced colon in file name with dash
40a9621 [Stevo Slavić] Replaced colon in file name with dash
4774580 [Stevo Slavić] Replaced colon in file name with dash
004f8bb [Stevo Slavić] Replaced colon in file name with dash
d6a3e2c [Stevo Slavić] Replaced colon in file name with dash
b585126 [Stevo Slavić] Replaced colon in file name with dash
028e48a [Stevo Slavić] Replaced colon in file name with dash
ece0507 [Stevo Slavić] Replaced colon in file name with dash
84f5d2f [Stevo Slavić] Replaced colon in file name with dash
2fc7854 [Stevo Slavić] Replaced colon in file name with dash
9e1467d [Stevo Slavić] Replaced colon in file name with dash
2014-05-15 16:44:14 -07:00
Cheng Hao 3eb53bd59e [WIP][Spark-SQL] Optimize the Constant Folding for Expression
Currently, expression does not support the "constant null" well in constant folding.
e.g. Sum(a, 0) actually always produces Literal(0, NumericType) in runtime.

For example:
```
explain select isnull(key+null)  from src;
== Logical Plan ==
Project [HiveGenericUdf#isnull((key#30 + CAST(null, IntegerType))) AS c_0#28]
 MetastoreRelation default, src, None

== Optimized Logical Plan ==
Project [true AS c_0#28]
 MetastoreRelation default, src, None

== Physical Plan ==
Project [true AS c_0#28]
 HiveTableScan [], (MetastoreRelation default, src, None), None
```

I've create a new Optimization rule called NullPropagation for such kind of constant folding.

Author: Cheng Hao <hao.cheng@intel.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #482 from chenghao-intel/optimize_constant_folding and squashes the following commits:

2f14b50 [Cheng Hao] Fix code style issues
68b9fad [Cheng Hao] Remove the Literal pattern matching for NullPropagation
29c8166 [Cheng Hao] Update the code for feedback of code review
50444cc [Cheng Hao] Remove the unnecessary null checking
80f9f18 [Cheng Hao] Update the UnitTest for aggregation constant folding
27ea3d7 [Cheng Hao] Fix Constant Folding Bugs & Add More Unittests
b28e03a [Cheng Hao] Merge pull request #1 from marmbrus/pr/482
9ccefdb [Michael Armbrust] Add tests for optimized expression evaluation.
543ef9d [Cheng Hao] fix code style issues
9cf0396 [Cheng Hao] update code according to the code review comment
536c005 [Cheng Hao] Add Exceptional case for constant folding
3c045c7 [Cheng Hao] Optimize the Constant Folding by adding more rules
2645d4f [Cheng Hao] Constant Folding(null propagation)
2014-05-07 03:37:12 -04:00
Michael Armbrust 92b2902ca0 Whitelist Hive Tests
This is ready when Jenkins is.

Author: Michael Armbrust <michael@databricks.com>

Closes #596 from marmbrus/moreTests and squashes the following commits:

85be703 [Michael Armbrust] Blacklist MR required tests.
35bc311 [Michael Armbrust] Add hive golden answers.
ede98fd [Michael Armbrust] More hive gitignore
da096ea [Michael Armbrust] update whitelist
2014-05-03 23:13:53 -07:00
Michael Armbrust a43d9c14f2 [SQL] SPARK-1661 - Fix regex_serde test
The JIRA in question is actually reporting a bug with Shark, but I wanted to make sure Spark SQL did not have similar problems.  This fixes a bug in our parsing code that was preventing the test from executing, but it looks like the RegexSerDe is working in Spark SQL.

Author: Michael Armbrust <michael@databricks.com>

Closes #595 from marmbrus/fixRegexSerdeTest and squashes the following commits:

a4dc612 [Michael Armbrust] Add files created by hive to gitignore.
efa6402 [Michael Armbrust] Fix Hive serde_regex test.
2014-05-01 21:32:43 -07:00
Mridul Muralidharan 968c0187a1 SPARK-1586 Windows build fixes
Unfortunately, this is not exhaustive - particularly hive tests still fail due to path issues.

Author: Mridul Muralidharan <mridulm80@apache.org>

This patch had conflicts when merged, resolved by
Committer: Matei Zaharia <matei@databricks.com>

Closes #505 from mridulm/windows_fixes and squashes the following commits:

ef12283 [Mridul Muralidharan] Move to org.apache.commons.lang3 for StringEscapeUtils. Earlier version was buggy appparently
cdae406 [Mridul Muralidharan] Remove leaked changes from > 2G fix branch
3267f4b [Mridul Muralidharan] Fix build failures
35b277a [Mridul Muralidharan] Fix Scalastyle failures
bc69d14 [Mridul Muralidharan] Change from hardcoded path separator
10c4d78 [Mridul Muralidharan] Use explicit encoding while using getBytes
1337abd [Mridul Muralidharan] fix classpath while running in windows
2014-04-24 20:48:33 -07:00
Michael Armbrust aa77f8a6a6 SPARK-1562 Fix visibility / annotation of Spark SQL APIs
Author: Michael Armbrust <michael@databricks.com>

Closes #489 from marmbrus/sqlDocFixes and squashes the following commits:

acee4f3 [Michael Armbrust] Fix visibility / annotation of Spark SQL APIs
2014-04-22 20:02:33 -07:00
Michael Armbrust 273c2fd08d [SQL] SPARK-1424 Generalize insertIntoTable functions on SchemaRDDs
This makes it possible to create tables and insert into them using the DSL and SQL for the scala and java apis.

Author: Michael Armbrust <michael@databricks.com>

Closes #354 from marmbrus/insertIntoTable and squashes the following commits:

6c6f227 [Michael Armbrust] Create random temporary files in python parquet unit tests.
f5e6d5c [Michael Armbrust] Merge remote-tracking branch 'origin/master' into insertIntoTable
765c506 [Michael Armbrust] Add to JavaAPI.
77b512c [Michael Armbrust] typos.
5c3ef95 [Michael Armbrust] use names for boolean args.
882afdf [Michael Armbrust] Change createTableAs to saveAsTable.  Clean up api annotations.
d07d94b [Michael Armbrust] Add tests, support for creating parquet files and hive tables.
fa3fe81 [Michael Armbrust] Make insertInto available on JavaSchemaRDD as well.  Add createTableAs function.
2014-04-15 20:40:40 -07:00
Ahir Reddy c99bcb7fea SPARK-1374: PySpark API for SparkSQL
An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries.

```
from pyspark.context import SQLContext
sqlCtx = SQLContext(sc)
rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}])
srdd = sqlCtx.applySchema(rdd)
sqlCtx.registerRDDAsTable(srdd, "table1")
srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1")
srdd2.collect()
```
The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]```

Author: Ahir Reddy <ahirreddy@gmail.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #363 from ahirreddy/pysql and squashes the following commits:

0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns
307d6e0 [Ahir Reddy] Style fix
6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies
3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py
29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD
f2312c7 [Ahir Reddy] Moved everything into sql.py
a19afe4 [Ahir Reddy] Doc fixes
6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL
521ff6d [Ahir Reddy] Trying to get spark to build with hive
ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins
ded03e7 [Ahir Reddy] Added doc test for HiveContext
22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency
e4da06c [Ahir Reddy] Display message if hive is not built into spark
227a0be [Michael Armbrust] Update API links. Fix Hive example.
58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api.  Minor fixes.
4285340 [Michael Armbrust] Fix building of Hive API Docs.
38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs.
337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build
40491c9 [Ahir Reddy] PR Changes + Method Visibility
1836944 [Michael Armbrust] Fix comments.
e00980f [Michael Armbrust] First draft of python sql programming guide.
b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test
f98a422 [Ahir Reddy] HiveContexts
79621cf [Ahir Reddy] cleaning up cruft
b406ba0 [Ahir Reddy] doctest formatting
20936a5 [Ahir Reddy] Added tests and documentation
e4d21b4 [Ahir Reddy] Added pyrolite dependency
79f739d [Ahir Reddy] added more tests
7515ba0 [Ahir Reddy] added more tests :)
d26ec5e [Ahir Reddy] added test
e9f5b8d [Ahir Reddy] adding tests
906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python
251f99d [Ahir Reddy] for now only allow dictionaries as input
09b9980 [Ahir Reddy] made jrdd explicitly lazy
c608947 [Ahir Reddy] SchemaRDD now has all RDD operations
725c91e [Ahir Reddy] awesome row objects
55d1c76 [Ahir Reddy] return row objects
4fe1319 [Ahir Reddy] output dictionaries correctly
be079de [Ahir Reddy] returning dictionaries works
cd5f79f [Ahir Reddy] Switched to using Scala SQLContext
e948bd9 [Ahir Reddy] yippie
4886052 [Ahir Reddy] even better
c0fb1c6 [Ahir Reddy] more working
043ca85 [Ahir Reddy] working
5496f9f [Ahir Reddy] doesn't crash
b8b904b [Ahir Reddy] Added schema rdd class
67ba875 [Ahir Reddy] java to python, and python to java
bcc0f23 [Ahir Reddy] Java to python
ab6025d [Ahir Reddy] compiling
2014-04-15 00:07:55 -07:00
Cheng Lian 7dbca68e92 [BUGFIX] In-memory columnar storage bug fixes
Fixed several bugs of in-memory columnar storage to make `HiveInMemoryCompatibilitySuite` pass.

@rxin @marmbrus It is reasonable to include `HiveInMemoryCompatibilitySuite` in this PR, but I didn't, since it significantly increases test execution time. What do you think?

**UPDATE** `HiveCompatibilitySuite` has been made to cache tables in memory. `HiveInMemoryCompatibilitySuite` was removed.

Author: Cheng Lian <lian.cs.zju@gmail.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #374 from liancheng/inMemBugFix and squashes the following commits:

6ad6d9b [Cheng Lian] Merged HiveCompatibilitySuite and HiveInMemoryCompatibilitySuite
5bdbfe7 [Cheng Lian] Revert 882c538 & 8426ddc, which introduced regression
882c538 [Cheng Lian] Remove attributes field from InMemoryColumnarTableScan
32cc9ce [Cheng Lian] Code style cleanup
99382bf [Cheng Lian] Enable compression by default
4390bcc [Cheng Lian] Report error for any Throwable in HiveComparisonTest
d1df4fd [Michael Armbrust] Remove test tables that might always get created anyway?
ab9e807 [Michael Armbrust] Fix the logged console version of failed test cases to use the new syntax.
1965123 [Michael Armbrust] Don't use coalesce for gathering all data to a single partition, as it does not work correctly with mutable rows.
e36cdd0 [Michael Armbrust] Spelling.
2d0e168 [Michael Armbrust] Run Hive tests in-memory too.
6360723 [Cheng Lian] Made PreInsertionCasts support SparkLogicalPlan and InMemoryColumnarTableScan
c9b0f6f [Cheng Lian] Let InsertIntoTable support InMemoryColumnarTableScan
9c8fc40 [Cheng Lian] Disable compression by default
e619995 [Cheng Lian] Bug fix: incorrect byte order in CompressionScheme.columnHeaderSize
8426ddc [Cheng Lian] Bug fix: InMemoryColumnarTableScan should cache columns specified by the attributes argument
036cd09 [Cheng Lian] Clean up unused imports
44591a5 [Cheng Lian] Bug fix: NullableColumnAccessor.hasNext must take nulls into account
052bf41 [Cheng Lian] Bug fix: should only gather compressibility info for non-null values
95b3301 [Cheng Lian] Fixed bugs in IntegralDelta
2014-04-14 15:22:43 -07:00
Michael Armbrust b5bae849db [SQL] SPARK-1427 Fix toString for SchemaRDD NativeCommands.
Author: Michael Armbrust <michael@databricks.com>

Closes #343 from marmbrus/toStringFix and squashes the following commits:

37198fe [Michael Armbrust] Fix toString for SchemaRDD NativeCommands.
2014-04-07 01:46:50 -07:00
Michael Armbrust 8de038eb36 [SQL] SPARK-1366 Consistent sql function across different types of SQLContexts
Now users who want to use HiveQL should explicitly say `hiveql` or `hql`.

Author: Michael Armbrust <michael@databricks.com>

Closes #319 from marmbrus/standardizeSqlHql and squashes the following commits:

de68d0e [Michael Armbrust] Fix sampling test.
fbe4a54 [Michael Armbrust] Make `sql` always use spark sql parser, users of hive context can now use hql or hiveql to run queries using HiveQL instead.
2014-04-04 21:15:33 -07:00
Michael Armbrust b8f534196f [SQL] SPARK-1333 First draft of java API
WIP: Some work remains...
 * [x] Hive support
 * [x] Tests
 * [x] Update docs

Feedback welcome!

Author: Michael Armbrust <michael@databricks.com>

Closes #248 from marmbrus/javaSchemaRDD and squashes the following commits:

b393913 [Michael Armbrust] @srowen 's java style suggestions.
f531eb1 [Michael Armbrust] Address matei's comments.
33a1b1a [Michael Armbrust] Ignore JavaHiveSuite.
822f626 [Michael Armbrust] improve docs.
ab91750 [Michael Armbrust] Improve Java SQL API: * Change JavaRow => Row * Add support for querying RDDs of JavaBeans * Docs * Tests * Hive support
0b859c8 [Michael Armbrust] First draft of java API.
2014-04-03 15:45:34 -07:00
Cheng Hao 5d1feda217 [SPARK-1360] Add Timestamp Support for SQL
This PR includes:
1) Add new data type Timestamp
2) Add more data type casting base on Hive's Rule
3) Fix bug missing data type in both parsers (HiveQl & SQLParser).

Author: Cheng Hao <hao.cheng@intel.com>

Closes #275 from chenghao-intel/timestamp and squashes the following commits:

df709e5 [Cheng Hao] Move orc_ends_with_nulls to blacklist
24b04b0 [Cheng Hao] Put 3 cases into the black lists(describe_pretty,describe_syntax,lateral_view_outer)
fc512c2 [Cheng Hao] remove the unnecessary data type equality check in data casting
d0d1919 [Cheng Hao] Add more data type for scala reflection
3259808 [Cheng Hao] Add the new Golden files
3823b97 [Cheng Hao] Update the UnitTest cases & add timestamp type for HiveQL
54a0489 [Cheng Hao] fix bug mapping to 0 (which is supposed to be null) when NumberFormatException occurs
9cb505c [Cheng Hao] Fix issues according to PR comments
e529168 [Cheng Hao] Fix bug of converting from String
6fc8100 [Cheng Hao] Update Unit Test & CodeStyle
8a1d4d6 [Cheng Hao] Add DataType for SqlParser
ce4385e [Cheng Hao] Add TimestampType Support
2014-04-03 15:33:17 -07:00
Andre Schumacher fbebaedf26 Spark parquet improvements
A few improvements to the Parquet support for SQL queries:
- Instead of files a ParquetRelation is now backed by a directory, which simplifies importing data from other
  sources
- InsertIntoParquetTable operation now supports switching between overwriting or appending (at least in
  HiveQL)
- tests now use the new API
- Parquet logging can be set to WARNING level (Default)
- Default compression for Parquet files (GZIP, as in parquet-mr)

Author: Andre Schumacher <andre.schumacher@iki.fi>

Closes #195 from AndreSchumacher/spark_parquet_improvements and squashes the following commits:

54df314 [Andre Schumacher] SPARK-1383 [SQL] Improvements to ParquetRelation
2014-04-03 15:31:47 -07:00
Michael Armbrust ea9de658a3 Remove * from test case golden filename.
@rxin mentioned this might cause issues on windows machines.

Author: Michael Armbrust <michael@databricks.com>

Closes #297 from marmbrus/noStars and squashes the following commits:

263122a [Michael Armbrust] Remove * from test case golden filename.
2014-04-01 23:54:38 -07:00
Michael Armbrust f5c418da04 [SQL] SPARK-1372 Support for caching and uncaching tables in a SQLContext.
This doesn't yet support different databases in Hive (though you can probably workaround this by calling `USE <dbname>`).  However, given the time constraints for 1.0 I think its probably worth including this now and extending the functionality in the next release.

Author: Michael Armbrust <michael@databricks.com>

Closes #282 from marmbrus/cacheTables and squashes the following commits:

83785db [Michael Armbrust] Support for caching and uncaching tables in a SQLContext.
2014-04-01 14:45:44 -07:00
Michael Armbrust 32cbdfd288 [SQL] Un-ignore a test that is now passing.
Add golden answer for aforementioned test.

Also, fix golden test generation from sbt/sbt by setting the classpath correctly.

Author: Michael Armbrust <michael@databricks.com>

Closes #244 from marmbrus/partTest and squashes the following commits:

37a33c9 [Michael Armbrust] Un-ignore a test that is now passing, add golden answer for aforementioned test.  Fix golden test generation from sbt/sbt.
2014-03-26 18:19:15 -07:00
Cheng Lian 345825d979 Unified package definition format in Spark SQL
According to discussions in comments of PR #208, this PR unifies package definition format in Spark SQL.

Some broken links in ScalaDoc and typos detected along the way are also fixed.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #225 from liancheng/packageDefinition and squashes the following commits:

75c47b3 [Cheng Lian] Fixed file line length
4f87968 [Cheng Lian] Unified package definition format in Spark SQL
2014-03-26 15:36:18 -07:00
Michael Armbrust 134ace7fea Add more hive compatability tests to whitelist
Author: Michael Armbrust <michael@databricks.com>

Closes #220 from marmbrus/moreTests and squashes the following commits:

223ec35 [Michael Armbrust] Blacklist machine specific test
9c966cc [Michael Armbrust] add more hive compatability tests to whitelist
2014-03-25 09:57:26 -07:00
Michael Armbrust b637f2d91a Unify the logic for column pruning, projection, and filtering of table scans.
This removes duplicated logic, dead code and casting when planning parquet table scans and hive table scans.

Other changes:
 - Fix tests now that we are doing a better job of column pruning (i.e., since pruning predicates are applied before we even start scanning tuples, columns required by these predicates do not need to be included in the output of the scan unless they are also included in the final output of this logical plan fragment).
 - Add rule to simplify trivial filters.  This was required to avoid `WHERE false` from getting pushed into table scans, since `HiveTableScan` (reasonably) refuses to apply partition pruning predicates to non-partitioned tables.

Author: Michael Armbrust <michael@databricks.com>

Closes #213 from marmbrus/strategyCleanup and squashes the following commits:

48ce403 [Michael Armbrust] Move one more bit of parquet stuff into the core SQLContext.
834ce08 [Michael Armbrust] Address comments.
0f2c6f5 [Michael Armbrust] Unify the logic for column pruning, projection, and filtering of table scans for both Hive and Parquet relations.  Fix tests now that we are doing a better job of column pruning.
2014-03-24 22:15:51 -07:00
Michael Armbrust 8043b7bc74 SPARK-1294 Fix resolution of uppercase field names using a HiveContext.
Fixing this bug required the following:
 - Creation of a new logical node that converts a schema to lowercase.
 - Generalization of the subquery eliding rule to also elide this new node
 - Fixing of several places where too tight assumptions were made on the types of `InsertIntoTable` children.
 - I also removed an API that was left in by accident that exposed catalyst data structures, and fix the logic that pushes down filters into hive tables scans to correctly compare attribute references.

Author: Michael Armbrust <michael@databricks.com>

Closes #202 from marmbrus/upperCaseFieldNames and squashes the following commits:

15e5265 [Michael Armbrust] Support for resolving mixed case fields from a reflected schema using HiveQL.
5aa5035 [Michael Armbrust] Remove API that exposes internal catalyst data structures.
9d99cb6 [Michael Armbrust] Attributes should be compared using exprId, not TreeNode.id.
2014-03-24 19:24:22 -07:00
Cheng Lian 8265dc7739 Fixed coding style issues in Spark SQL
This PR addresses various coding style issues in Spark SQL, including but not limited to those mentioned by @mateiz in PR #146.

As this PR affects lots of source files and may cause potential conflicts, it would be better to merge this as soon as possible *after* PR #205 (In-memory columnar representation for Spark SQL) is merged.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #208 from liancheng/fixCodingStyle and squashes the following commits:

fc2b528 [Cheng Lian] Merge branch 'master' into fixCodingStyle
b531273 [Cheng Lian] Fixed coding style issues in sql/hive
0b56f77 [Cheng Lian] Fixed coding style issues in sql/core
fae7b02 [Cheng Lian] Addressed styling issues mentioned by @marmbrus
9265366 [Cheng Lian] Fixed coding style issues in sql/core
3dcbbbd [Cheng Lian] Fixed relative package imports for package catalyst
2014-03-23 15:21:40 -07:00
Michael Armbrust 7e17fe69f9 Add hive test files to repository. Remove download script.
This PR removes our test dependence on files hosted at Berkeley by checking the test queries and answers into the repository.  This should also fix the maven Jenkins build.

I realize this is a *giant* commit.  But size wise its actually pretty small.  We are only looking at ~1.2Mb compressed (~30Mb uncompressed).  Given that we already have a ~80Mb file permanently added to the spark code lineage, I do not think that this will change the developer experience significantly.

Furthermore, I think it is good engineering practice to consider such test support files as "code", since changes to them would indicate a change in functionality.  These files were only excluded from the initial PR as I wanted the diff to be readable.

Author: Michael Armbrust <michael@databricks.com>

Closes #199 from marmbrus/hiveTestFiles and squashes the following commits:

b9b9b17 [Michael Armbrust] Add hive test files to repository.  Remove download script.
2014-03-21 15:05:45 -07:00
Michael Armbrust 9aadcffabd SPARK-1251 Support for optimizing and executing structured queries
This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL.

*This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.*

The code is broken into three primary components:
 - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions.
 - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs.  This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files.
 - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes.  There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs.

A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251).

[An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review.

With this PR comes support for inferring the schema of existing RDDs that contain case classes.  Using this information, developers can now express structured queries that are automatically compiled into RDD operations.

```scala
// Define the schema using a case class.
case class Person(name: String, age: Int)
val people: RDD[Person] =
  sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt))

// The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19'
val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd
```

RDDs can also be registered as Tables, allowing SQL queries to be written over them.
```scala
people.registerAsTable("people")
val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19")
```

The results of queries are themselves RDDs and support standard RDD operations:
```scala
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
```

Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL.
```scala
sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src")

// Queries are expressed in HiveQL
sql("SELECT key, value FROM src").collect().foreach(println)
```

## Relationship to Shark

Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project.

Author: Michael Armbrust <michael@databricks.com>
Author: Yin Huai <huaiyin.thu@gmail.com>
Author: Reynold Xin <rxin@apache.org>
Author: Lian, Cheng <rhythm.mail@gmail.com>
Author: Andre Schumacher <andre.schumacher@iki.fi>
Author: Yin Huai <huai@cse.ohio-state.edu>
Author: Timothy Chen <tnachen@gmail.com>
Author: Cheng Lian <lian.cs.zju@gmail.com>
Author: Timothy Chen <tnachen@apache.org>
Author: Henry Cook <henry.m.cook+github@gmail.com>
Author: Mark Hamstra <markhamstra@gmail.com>

Closes #146 from marmbrus/catalyst and squashes the following commits:

458bd1b [Michael Armbrust] Update people.txt
0d638c3 [Michael Armbrust] Typo fix from @ash211.
bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests
778299a [Michael Armbrust] Fix some old links to spark-project.org
fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations.  This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user.  This change also makes it slightly less verbose to run language integrated queries.
fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD.
48a99bc [Michael Armbrust] Address first round of feedback.
461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour
adcf1a4 [Henry Cook] Update sql-programming-guide.md
9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins.
6978dd8 [Michael Armbrust] update docs, add apache license
1d0eb63 [Michael Armbrust] update changes with spark core
e5e1d6b [Michael Armbrust] Remove travis configuration.
c2efad6 [Michael Armbrust] First draft of SQL documentation.
013f62a [Michael Armbrust] Fix documentation / code style.
c01470f [Michael Armbrust] Clean up example
2f22454 [Michael Armbrust] WIP: Parquet example.
ce8073b [Michael Armbrust] clean up implicits.
f7d992d [Michael Armbrust] Naming / spelling.
9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext.
d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar.  Create a separate, optional Hive assembly that is used when present.
8b35e0a [Michael Armbrust] address feedback, work on DSL
5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes
f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation
1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven
3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes
3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context.
7233a74 [Michael Armbrust] initial support for maven builds
f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven
7386a9f [Michael Armbrust] Initial example programs using spark sql.
aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow
7ca4b4e [Andre Schumacher] Improving checks in Parquet tests
5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport
54637ec [Andre Schumacher] First part of second round of code review feedback
c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows
ba28849 [Michael Armbrust] code review comments.
d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization.
9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates.  Also add a constructor so that it can be serialized out-of-the-box.
959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream.
d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path.
c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support
3c3f962 [Michael Armbrust] Fix a bug due to array reuse.  This will need to be revisited after we merge the mutable row PR.
7d0f13e [Michael Armbrust] Update parquet support with master.
9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support
0040ae6 [Andre Schumacher] Feedback from code review
1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow
70e489d [Cheng Lian] Fixed a spelling typo
6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object.
8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly
99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval
7b9d142 [Michael Armbrust] Update travis to increase permgen size.
da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS.
6fdefe6 [Michael Armbrust] Port sbt improvements from master.
296fe50 [Michael Armbrust] Address review feedback.
d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework.
3bda72d [Andre Schumacher] Adding license banner to new files
3ac9eb0 [Andre Schumacher] Rebasing to new main branch
c863bed [Andre Schumacher] Codestyle checks
61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite
3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala
3a0a552 [Andre Schumacher] Reorganizing Parquet table operations
18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables
75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies
f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection
6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan
0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport
a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport
6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core
eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase
99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types
b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types
c334386 [Michael Armbrust] Initial support for generating schema's based on case classes.
608a29e [Michael Armbrust] Add hive as a repl dependency
7413ac2 [Michael Armbrust] make test downloading quieter.
4d57d0e [Michael Armbrust] Fix test execution on travis.
5f2963c [Michael Armbrust] naming and continuous compilation fixes.
f5e7492 [Michael Armbrust] Add Apache license.  Make naming more consistent.
3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject.
2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes
d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query.
24eaa79 [Michael Armbrust] fix > 100 chars
6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL.
df88f01 [Michael Armbrust] add a simple test for aggregation
18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data.
b922511 [Michael Armbrust] Fix insertion of nested types into hive tables.
5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function.
a430895 [Michael Armbrust] Planning for logical Repartition operators.
532dd37 [Michael Armbrust] Allow the local warehouse path to be specified.
4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter.
8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package.
c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation.
29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables.
9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning
f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe
cf4db59 [Lian, Cheng] Added golden answers for PruningSuite
54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names
2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning
c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning
f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query.
128a9f8 [Yin Huai] Minor changes.
017872c [Yin Huai] Remove stats20 from whitelist.
a1a4776 [Yin Huai] Update comments.
feb022c [Yin Huai] Partitioning key should be case insensitive.
555fb1d [Yin Huai] Correctly set the extension for a text file.
d00260b [Yin Huai] Strips backticks from partition keys.
334aace [Yin Huai] New golden files.
a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query.
428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`.
eea75c5 [Yin Huai] Correctly set codec.
45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew
e089627 [Yin Huai] Code style.
563bb22 [Yin Huai] Set compression info in FileSinkDesc.
35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback
bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables.
5495fab [Yin Huai] Remove cloneRecords which is no longer needed.
1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy.
3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location.
8506c17 [Michael Armbrust] Address review feedback.
3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master
9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling
566fd66 [Timothy Chen] Whitelist tests and add support for Binary type
69adf72 [Yin Huai] Set cloneRecords to false.
a9c3188 [Timothy Chen] Fix udaf struct return
346f828 [Yin Huai] Move SharkHadoopWriter to the correct location.
59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew
ed3a1d1 [Yin Huai] Load data directly into Hive.
7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT.
b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile
1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile
5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii
678341a [Mark Hamstra] Replaced non-ascii text
887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew
1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents
7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package.
bc9a12c [Michael Armbrust] Move hive test files.
5720d2b [Lian, Cheng] Fixed comment typo
f0c3742 [Lian, Cheng] Refactored PhysicalOperation
f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered
cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings
2407a21 [Lian, Cheng] Added optimized logical plan to debugging output
a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens
9329820 [Michael Armbrust] add golden answer files to repository
dce0593 [Michael Armbrust] move golden answer to the source code directory.
964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView
7785ee6 [Michael Armbrust] Tighten visibility based on comments.
341116c [Michael Armbrust] address comments.
0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS
2897deb [Michael Armbrust] fix scaladoc
7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries.
b376d15 [Michael Armbrust] fix newlines at EOF
5cc367c [Michael Armbrust] use berkeley instead of cloudbees
ff5ea3f [Michael Armbrust] new golden
db92adc [Michael Armbrust] more tests passing. clean up logging.
740febb [Michael Armbrust] Tests for tgfs.
0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf.
ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView
dd00b7e [Michael Armbrust] initial implementation of generators.
ea76cf9 [Michael Armbrust] Add NoRelation to planner.
bea4b7f [Michael Armbrust] Add SumDistinct.
016b489 [Michael Armbrust] fix typo.
acb9566 [Michael Armbrust] Correctly type attributes of CTAS.
8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation.
02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query.
5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes
5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg
8017afb [Michael Armbrust] fix copy paste error.
dc6353b [Michael Armbrust] turn off deprecation
cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance.
883006d [Michael Armbrust] improve tests.
32b615b [Michael Armbrust] add override to asPartial.
e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe.
f94345c [Michael Armbrust] fix doc link
d8cb805 [Michael Armbrust] Implement partial aggregation.
ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings.  Remove a blank line.
b4be6a5 [Michael Armbrust] better logging when applying rules.
67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex
cb57459 [Michael Armbrust] blacklist machine specific test.
2f27604 [Michael Armbrust] Address comments / style errors.
389525d [Michael Armbrust] update golden, blacklist mr.
e3c10bd [Michael Armbrust] update whitelist.
44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex
42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs.
ab5bff3 [Michael Armbrust] Support for get item of map types.
1679554 [Michael Armbrust] add toString for if and IS NOT NULL.
ab9a131 [Michael Armbrust] when UDFs fail they should return null.
25288d0 [Michael Armbrust] Implement [] for arrays and maps.
e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions.
010accb [Michael Armbrust] add tinyint to metastore type parser.
7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes.
ac9d7de [Michael Armbrust] Resolve *s in Transform clauses.
692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables.
92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects
9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector.
72a003d [Michael Armbrust] revert regex change
7661b6c [Michael Armbrust] blacklist machines specific tests
aa430e7 [Michael Armbrust] Update .travis.yml
e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs.
5e54aa6 [Michael Armbrust] quotes for struct field names.
bbec500 [Michael Armbrust] update test coverage, new golden
3734a94 [Michael Armbrust] only quote string types.
3f9e519 [Michael Armbrust] use names w/ boolean args
5b3d2c8 [Michael Armbrust] implement distinct.
5b33216 [Michael Armbrust] work on decimal support.
2c6deb3 [Michael Armbrust] improve printing compatibility.
35a70fb [Michael Armbrust] multi-letter field names.
a9388fb [Michael Armbrust] printing for map types.
c3feda7 [Michael Armbrust] use toArray.
c654f19 [Michael Armbrust] Support for list and maps in hive table scan.
cf8d992 [Michael Armbrust] Use built in functions for creating temp directory.
1579eec [Michael Armbrust] Only cast unresolved inserts.
6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression.
da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test.  Not sure how this was passing before...
6709441 [Michael Armbrust] Evaluation for accessing nested fields.
dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation.
d670e41 [Michael Armbrust] Print nested fields like hive does.
efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan.
9c22b4e [Michael Armbrust] Support for parsing nested types.
82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables
ea6f37f [Michael Armbrust] fix style.
7845364 [Michael Armbrust] deactivate concurrent test.
b649c20 [Michael Armbrust] fix test logging / caching.
1590568 [Michael Armbrust] add log4j.properties
19bfd74 [Michael Armbrust] store hive output in circular buffer
dfb67aa [Michael Armbrust] add test case
cb775ac [Michael Armbrust] get rid of SharkContext singleton
2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master
63003e9 [Michael Armbrust] Fix spacing.
41b41f3 [Michael Armbrust] Only cast unresolved inserts.
6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs
5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator
b1151a8 [Timothy Chen] Fix load data regex
8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API.
e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction
45b334b [Yin Huai] fix comments
235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator
fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning.
6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style
271e483 [Michael Armbrust] Update build status icon.
d3a3d48 [Michael Armbrust] add testing to travis
807b2d7 [Michael Armbrust] check style and publish docs with travis
d20b565 [Michael Armbrust] fix if style
bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide.
d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests.  This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive.  These are used by default when HIVE_HOME is not set.
f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin
41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator
7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file).
5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference
7213a2c [Reynold Xin] style fix for Hive.scala.
08e4d05 [Reynold Xin] First round of style cleanup.
605255e [Reynold Xin] Added scalastyle checker.
61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases
2486fb7 [Lian, Cheng] Fixed spelling
8ee41be [Lian, Cheng] Minor refactoring
ebb56fa [Michael Armbrust] add travis config
4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests
d4f539a [Michael Armbrust] blacklist mr and user specific tests.
677eb07 [Michael Armbrust] Update test whitelist.
5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning
c263c84 [Michael Armbrust] Only push predicates into partitioned table scans.
ab77882 [Michael Armbrust] upgrade spark to RC5.
c98ede5 [Lian, Cheng] Response to comments from @marmbrus
83d4520 [Yin Huai] marmbrus's comments
70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes
9ebff47 [Yin Huai] remove unnecessary .toSeq
e811d1a [Yin Huai] markhamstra's comments
4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations.
040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators.
9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions.
e9347fc [Michael Armbrust] Remove broken scaladoc links.
99c6707 [Michael Armbrust] upgrade spark
57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable
cb49af0 [Lian, Cheng] Fixed Scaladoc links
4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion
111ffdc [Lian, Cheng] More comments and minor reformatting
9e0d840 [Lian, Cheng] Added partition pruning optimization
761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan
04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator
9dd3b26 [Michael Armbrust] Fix scaladoc.
6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst
7c92a41 [Lian, Cheng] Added Hive SerDe support
ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator
2957f31 [Yin Huai] addressed comments on PR
907db68 [Michael Armbrust] Space after while.
04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts
4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile
5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator
be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23
fd084a4 [Michael Armbrust] implement casts binary <=> string.
0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type
548e479 [Yin Huai] merge master into exchangeOperator and fix code style
5b11db0 [Reynold Xin] Added Void to Boolean type widening.
9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear.
9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion
a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20
b20a4d4 [Lian, Cheng] Fix issue #20
6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc.
4285962 [Michael Armbrust] Remove temporary test cases
167162f [Michael Armbrust] more merge errors, cleanup.
e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge.
6377d0b [Michael Armbrust] Drop empty files, fix if ().
c0b0e60 [Michael Armbrust] cleanup broken doc links.
330a88b [Michael Armbrust] Fix bugs in AddExchange.
4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering.
043e296 [Michael Armbrust] Make physical union nodes variadic.
ece15e1 [Michael Armbrust] update unit tests
5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width.
9804eb5 [Michael Armbrust] upgrade spark
053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow
5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow
ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes
bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg
563053f [Michael Armbrust] Address @rxin's comments.
6537c66 [Michael Armbrust] Address @rxin's comments.
2a76fc6 [Michael Armbrust] add notes from @rxin.
685bfa1 [Michael Armbrust] fix spelling
69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions.
7859a86 [Michael Armbrust] Remove SparkAggregate.  Its kinda broken and breaks RDD lineage.
fc22e01 [Michael Armbrust] whitelist newly passing union test.
3f547b8 [Michael Armbrust] Add support for widening types in unions.
53b95f8 [Michael Armbrust] coercion should not occur until children are resolved.
b892e32 [Michael Armbrust] Union is not resolved until the types match up.
95ab382 [Michael Armbrust] Use resolved instead of custom function.  This is better because some nodes override the notion of resolved.
81a109d [Michael Armbrust] fix link.
f143f61 [Michael Armbrust] Implement sampling.  Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!"
6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar.
c800798 [Michael Armbrust] Add build status icon.
0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown
05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row.
f2fdd77 [Michael Armbrust] fix required distribtion for aggregate.
658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala
583a337 [Michael Armbrust] break apart distribution and partitioning.
e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator
0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links.
73c70de [Yin Huai] add a first set of unit tests for data properties.
fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements.
2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans.
fcbc03b [Michael Armbrust] Fix if ().
7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators.
b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes
b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization.
e286d20 [Michael Armbrust] address code review comments.
80d0681 [Michael Armbrust] fix scaladoc links.
de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString.
3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode.
404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation.
fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution.
2abb0bc [Michael Armbrust] better debug messages, use exists.
098dfc4 [Michael Armbrust] Implement Long sorting again.
60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable.
a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString.
dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode.
037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation.
ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes.
b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen
83adb9d [Yin Huai] add DataProperty
5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics.
f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator.  This can probably also be use for codegen...
6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet.
ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts.
058ec15 [Michael Armbrust] handle more writeables.
ffa9f25 [Michael Armbrust] blacklist some more MR tests.
aa2239c [Michael Armbrust] filter test lines containing Owner:
f71a325 [Michael Armbrust] Update golden jar.
a3003ae [Michael Armbrust] Update makefile to use better sharding support.
568d150 [Michael Armbrust] Updates to white/blacklist.
8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right.
c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure.  See comments for details.
09c6300 [Michael Armbrust] Add nullability information to StructFields.
5460b2d [Michael Armbrust] load srcpart by default.
3695141 [Michael Armbrust] Lots of parser improvements.
965ac9a [Michael Armbrust] Add expressions that allow access into complex types.
3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences.
8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning.
e57f97a [Michael Armbrust] more decimal/null support.
e1440ed [Michael Armbrust] Initial support for function specific type conversions.
1814ed3 [Michael Armbrust] use childrenResolved function.
f2ec57e [Michael Armbrust] Begin supporting decimal.
6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs
7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters.
d0124f3 [Michael Armbrust] Correctly type null literals.
b65626e [Michael Armbrust] Initial support for parsing BigDecimal.
a90efda [Michael Armbrust] utility function for outputing string stacktraces.
7102f33 [Michael Armbrust] methods with side-effects should use ().
3ccaef7 [Michael Armbrust] add renaming TODO.
bc282c7 [Michael Armbrust] fix bug in getNodeNumbered
c8e89d5 [Michael Armbrust] memoize inputSet calculation.
6aefa46 [Michael Armbrust] Skip folding literals.
a72e540 [Michael Armbrust] Add IN operator.
04f885b [Michael Armbrust] literals are only non-nullable if they are not null.
35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output.
12fd52d [Michael Armbrust] support for sorting longs.
0606520 [Michael Armbrust] drop old comment.
859200a [Michael Armbrust] support for reading more types from the metastore.
1fedd18 [Michael Armbrust] coercion from null to numeric types
71e902d [Michael Armbrust] fix test cases.
cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer
8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment
86355a6 [Michael Armbrust] throw error if there are unexpected join clauses.
c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute.
0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling
a92919d [Michael Armbrust] add alter view as to native commands
f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT
f0faa26 [Michael Armbrust] add sample and distinct operators.
ef7b943 [Michael Armbrust] add metastore support for float
e9f4588 [Michael Armbrust] fix > 100 char.
755b229 [Michael Armbrust] blacklist some ddl tests.
9ae740a [Michael Armbrust] blacklist more tests that require MR.
4cfc11a [Michael Armbrust] more test coverage.
0d9d56a [Michael Armbrust] add more native commands to parser
78d730d [Michael Armbrust] Load src test table on RESET.
8364ec2 [Michael Armbrust] whitelist all possible partition values.
b01468d [Michael Armbrust] support path rewrites when the query begins with a comment.
4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail.
4c5fb0f [Michael Armbrust] makefile target for building new whitelist.
4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO.
516481c [Michael Armbrust] Ignore requests to explain native commands.
68aa2e6 [Michael Armbrust] Stronger type for Token extractor.
ca4ea26 [Michael Armbrust] Support for parsing UDF(*).
1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset.
9627616 [Michael Armbrust] Use current database as default database.
9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode.
6f64cee [Michael Armbrust] don't line wrap string literal
eafaeed [Michael Armbrust] add type documentation
f54c94c [Michael Armbrust] make golden answers file a test dependency
5362365 [Michael Armbrust] push conditions into join
0d2388b [Michael Armbrust] Point at databricks hosted scaladoc.
73b29cd [Michael Armbrust] fix bad casting
9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes
7eff191 [Michael Armbrust] link all the expression names.
83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules
9de6b74 [Michael Armbrust] fix language feature and deprecation warnings.
0b1960a [Michael Armbrust] Fix broken scala doc links / warnings.
b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions
01c00c2 [Michael Armbrust] new golden
5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions
b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching
66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork
1a393da [Yin Huai] folded -> foldable
1e964ea [Yin Huai] update
a43d41c [Michael Armbrust] more tests passing!
8ca38d0 [Michael Armbrust] begin support for varchar / binary types.
ab8bbd1 [Michael Armbrust] parsing % operator
c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests.
3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline.
5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions
367fb9e [Yin Huai] update
0cd5cc6 [Michael Armbrust] add BIGINT cast parsing
61b266f [Michael Armbrust] comment for eliminate subqueries.
d72a5a2 [Michael Armbrust] add long to literal factory object.
b3bd15f [Michael Armbrust] blacklist more mr requiring tests.
e06fd38 [Michael Armbrust] black list map reduce tests.
8e7ce30 [Michael Armbrust] blacklist some env specific tests.
6250cbd [Michael Armbrust] Do not exit on test failure
b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath.
b6e4899 [Yin Huai] formatting
e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12
5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing?
9e190f5 [Michael Armbrust] drop unneeded ()
68b58c1 [Michael Armbrust] drop a few more tests.
b0aa400 [Michael Armbrust] update whitelist.
c99012c [Michael Armbrust] skip tests with hooks
db00ebf [Michael Armbrust] more types for hive udfs
dbc3678 [Michael Armbrust] update ghpages repo
138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure.
6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests.
46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel.  usage: "make -j 8 -i"
8d47ed4 [Yin Huai] comment
2795f05 [Yin Huai] comment
e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer
2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions
0bd1688 [Yin Huai] update
6a7bd75 [Michael Armbrust] fix partition column delimiter configuration.
e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0.
b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean
52864da [Reynold Xin] Added executeCollect method to SharkPlan.
f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan.
b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException.
38124bd [Yin Huai] formatting
2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively
555d839 [Reynold Xin] More cleaning ...
d48d0e1 [Reynold Xin] Code review feedback.
aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions
5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency.
479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if)
da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename
e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope.
72426ed [Reynold Xin] Rename shark2 package to execution.
0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename
e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore
3f9fee1 [Michael Armbrust] rewrite push filter through join optimization.
c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory.
c9777d8 [Reynold Xin] Put all source files in a catalyst directory.
019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files.
80ca4be [Timothy Chen] Address comments
0079392 [Michael Armbrust] support for multiple insert commands in a single query
75b5a01 [Michael Armbrust] remove space.
4283400 [Timothy Chen] Add limited predicate push down
e547e50 [Michael Armbrust] implement First.
e77c9b6 [Michael Armbrust] more work on unique join.
c795e06 [Michael Armbrust] improve star expansion
a26494e [Michael Armbrust] allow aliases to have qualifiers
d078333 [Michael Armbrust] remove extra space
a75c023 [Michael Armbrust] implement Coalesce
3a018b6 [Michael Armbrust] fix up docs.
ab6f67d [Michael Armbrust] import the string "null" as actual null.
5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved.
191ce3e [Michael Armbrust] analyze rewrite test query.
60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved.
2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues.
e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD
c086a35 [Michael Armbrust] docs, spacing
c4060e4 [Michael Armbrust] cleanup
3b85462 [Michael Armbrust] more tests passing
bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data.
c944a95 [Michael Armbrust] First aggregate expression.
1e28311 [Michael Armbrust] make tests execute in alpha order again
a287481 [Michael Armbrust] spelling
8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing.
a6ab6c7 [Michael Armbrust] add !=
4529594 [Michael Armbrust] draft of coalesce
70f253f [Michael Armbrust] more tests passing!
7349e7b [Michael Armbrust] initial support for test thrift table
d3c9305 [Michael Armbrust] fix > 100 char line
93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE"
06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths
6355d0e [Michael Armbrust] match actual return type of count with expected
cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty.
fd4b096 [Michael Armbrust] fix casing of null strings as well.
4632695 [Michael Armbrust] support for megastore bigint
67b88cf [Michael Armbrust] more verbose debugging of evaluation return types
c680e0d [Michael Armbrust] Failed string => number conversion should return null.
2326be1 [Michael Armbrust] make getClauses case insensitive.
dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types.
045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions
fb5ddfd [Michael Armbrust] move ViewExamples to examples/
83833e8 [Michael Armbrust] more tests passing!
47c98d6 [Michael Armbrust] add query tests for like and hash.
1724c16 [Michael Armbrust] clear lines that contain last updated times.
cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse.
9b2642b [Michael Armbrust] make the blacklist support regexes
1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs
910e33e [Michael Armbrust] basic support for building an assembly jar.
d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore.
495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf.
65f4e69 [Michael Armbrust] remove incorrect comments
0831a3c [Michael Armbrust] support for parsing some operator udfs.
6c27aa7 [Michael Armbrust] more cast parsing.
43db061 [Michael Armbrust] significant generalization of hive udf functionality.
3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines.
e5690a6 [Michael Armbrust] add BinaryType
adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called.
d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]).
8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage.
21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function.
0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons.
2b70abf [Michael Armbrust] true and false literals.
ef8b0a5 [Michael Armbrust] more tests.
14d070f [Michael Armbrust] add support for correctly extracting partition keys.
0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions
69a0bd4 [Michael Armbrust] promote strings in predicates with number too.
3946e31 [Michael Armbrust] don't build strings unless assertion fails.
90c453d [Michael Armbrust] more tests passing!
6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting.
8000504 [Michael Armbrust] Improve type coercion.
9087152 [Michael Armbrust] fix toString of Not.
58b111c [Michael Armbrust] fix bad scaladoc tag.
d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there.
ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness.
1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types.
d873b2b [Yin Huai] Remove numbers associated with test cases.
8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions
b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown
d1e7b8e [Michael Armbrust] Update README.md
c8b1553 [Michael Armbrust] Update README.md
9307ef9 [Michael Armbrust] update list of passing tests.
934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers.
a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now.
ae0024a [Yin Huai] update
cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions
21976ae [Yin Huai] update
b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions
dedbf0c [Yin Huai] support Boolean literals
eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals
37817b5 [Yin Huai] add a comment to EvaluateLiterals.
468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is.
b1d1843 [Michael Armbrust] more work on big data benchmark tests.
cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark
7d7fa9f [Michael Armbrust] support for create table as
5f54f03 [Michael Armbrust] parsing for ASC
d42b725 [Michael Armbrust] Sum of strings requires cast
34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.)
81659cb [Michael Armbrust] implement transform operator.
5cd76d6 [Michael Armbrust] break up the file based test case code for reuse
1031b65 [Michael Armbrust] support for case insensitive resolution.
320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots)
b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt
d9d18b4 [Michael Armbrust] debug logging implicit.
669089c [Yin Huai] support Boolean literals
ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals
73a05fd [Yin Huai] add a comment to EvaluateLiterals.
191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is.
80039cc [Yin Huai] Merge pull request #1 from yhuai/master
cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide
5c518e4 [Michael Armbrust] fix bug in test.
b50dd0e [Michael Armbrust] fix return type of overloaded method
05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview.
8c60cc0 [Michael Armbrust] Update README.md
03b9526 [Michael Armbrust] First draft of optimizer tests.
f392755 [Michael Armbrust] Add flatMap to TreeNode
6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings
15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions.
06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression.
4b0a888 [Michael Armbrust] implement string comparison and more casts.
356b321 [Michael Armbrust] Update README.md
3776395 [Michael Armbrust] Update README.md
304d17d [Michael Armbrust] Create README.md
b7d8be0 [Michael Armbrust] more tests passing.
b82481f [Michael Armbrust] add todo comment.
02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist.
cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support.
c43a259 [Michael Armbrust] comments
15ff448 [Michael Armbrust] better error message when a dsl test throws an exception
76ec650 [Michael Armbrust] fix join conditions
e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan.
91573a4 [Michael Armbrust] initial type promotion
e2ef4a5 [Michael Armbrust] logging
e43dc1e [Michael Armbrust] add string => int cast evaluation
f1f7e96 [Michael Armbrust] fix incorrect generation of join keys
2b27230 [Michael Armbrust] add depth based subtree access
0f6279f [Michael Armbrust] broken tests.
389bc0b [Michael Armbrust] support for partitioned columns in output.
12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name.
b67a225 [Michael Armbrust] better errors when types don't match up.
9e74808 [Michael Armbrust] add children resolved.
6d03ce9 [Michael Armbrust] defaults for unresolved relation
2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions
be5ae2c [Michael Armbrust] better resolution logging
cb7b5af [Michael Armbrust] views example
420e05b [Michael Armbrust] more tests passing!
6916c63 [Michael Armbrust] Reading from partitioned hive tables.
a1245f9 [Michael Armbrust] more tests passing
956e760 [Michael Armbrust] extended explain
5f14c35 [Michael Armbrust] more test tables supported
175c43e [Michael Armbrust] better errors for parse exceptions
480ade5 [Michael Armbrust] don't use partial cached results.
8a9d21c [Michael Armbrust] fix evaluation
7aee69c [Michael Armbrust] parsing for joins, boolean logic
7fcf480 [Michael Armbrust] test for and logic
3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines.
6902490 [Michael Armbrust] fix boolean logic evaluation
4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic
8b2a2ee [Michael Armbrust] more tests passing!
ad1f3b4 [Michael Armbrust] toString for null literals
a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work)
60ec19d [Michael Armbrust] initial support for udfs
c45b440 [Michael Armbrust] support for is (not) null and boolean logic
7f4a1dc [Michael Armbrust] add NoRelation logical operator
72e183b [Michael Armbrust] support for null values in tree node args.
ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs
e5c9d1a [Michael Armbrust] use nonEmpty
dcc4fe1 [Michael Armbrust] support for src1 test table.
c78b587 [Michael Armbrust] casting.
75c3f3f [Michael Armbrust] add support for logging with scalalogging.
da2c011 [Michael Armbrust] make it more obvious when results are being truncated.
96b73ba [Michael Armbrust] more docs in TestShark
18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive.
e6d063b [Michael Armbrust] more join tests.
664c1c3 [Michael Armbrust] make parsing of function names case insensitive.
0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome.
1a6db68 [Michael Armbrust] spelling
7638cb4 [Michael Armbrust] simple join execution with dsl tests.  no hive tests yes.
859d4c9 [Michael Armbrust] better argString printing of nested trees.
fc53615 [Michael Armbrust] add same instance comparisons for tree nodes.
a026e6b [Michael Armbrust] move out hive specific operators
fff4d1c [Michael Armbrust] add simple query execution debugging
e2120ab [Michael Armbrust] sorting for strings
da06eb6 [Michael Armbrust] Parsing for sortby and joins
9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId.
8eb2460 [Michael Armbrust] add system property to override whitelist.
88124bb [Michael Armbrust] make strategy evaluation lazy.
74a3a21 [Michael Armbrust] implement outputSet
d25b171 [Michael Armbrust] Add AND and OR expressions
67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll
12acf0a [Michael Armbrust] add .DS_Store for macs
f7da6ce [Michael Armbrust] add agg with grouping expr in select test
36805b3 [Michael Armbrust] pull out and improve aggregation
75613e1 [Michael Armbrust] better evaluations failure messages.
4789a35 [Michael Armbrust] weaken type since its hard to create pure references.
e89dd36 [Michael Armbrust] no newline for online trees
d0590d4 [Michael Armbrust] include stack trace for catalyst failures.
081c0d9 [Michael Armbrust] more generic computation of agg functions.
31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser
ecd45b2 [Michael Armbrust] Add more passing tests.
97d5419 [Michael Armbrust] fix alignment.
565cc13 [Michael Armbrust] make the canary query optional.
a95e65c [Michael Armbrust] support for resolving qualified attribute references.
e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails.
4640a0b [Michael Armbrust] handle test tables when database is specified.
bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis.
fec5158 [Michael Armbrust] add hive / idea files to .gitignore
3f97ffe [Michael Armbrust] Rename Hive => HiveQl
656b836 [Michael Armbrust] Support for parsing select clause aliases.
3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs.
3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception.
8cbef8a [Michael Armbrust] spelling
aa8c37c [Michael Armbrust] Better toString for SortOrder
1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions
a2e0327 [Michael Armbrust] add a bunch of tests.
4a3e1ea [Michael Armbrust] docs and use shark for data loading.
339bb8f [Michael Armbrust] better docs, Not support
1d7b2d9 [Michael Armbrust] Add NaN conversions.
46a2534 [Michael Armbrust] only run canary query on failure.
8996066 [Michael Armbrust] remove protected from makeCopy
53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass
04a372a [Michael Armbrust] add a flag for running all tests.
3b2235b [Michael Armbrust] More general implementation of arithmetic.
edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results
da6c577 [Michael Armbrust] add string <==> file utility functions.
3adf5ca [Michael Armbrust] Initial support for groupBy and count.
7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ.
a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product.
d66ba7e [Michael Armbrust] drop printlns.
88f2efd [Michael Armbrust] Add sum / count distinct expressions.
05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark
07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands.
b8a9910 [Michael Armbrust] quote tests passing.
8e5e267 [Michael Armbrust] handle aliased select expressions.
4286a96 [Michael Armbrust] drop debugging println
ac34aeb [Michael Armbrust] proof of concept for hive ast transformations.
2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments
ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general.
74a6337 [Michael Armbrust] use fastEquals when doing transformations.
1184a23 [Michael Armbrust] add native test for escapes.
b972b18 [Michael Armbrust] create BaseRelation class
fa6bce9 [Michael Armbrust] implement union
6391a87 [Michael Armbrust] count aggregate.
d47c317 [Michael Armbrust] add unary minus, more tests passing.
c7114e4 [Michael Armbrust] first draft of star expansion.
044c43d [Michael Armbrust] better support for numeric literal parsing.
1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view.
61503c5 [Michael Armbrust] add cached toRdd
2036883 [Michael Armbrust] skip explain queries when testing.
ebac4b1 [Michael Armbrust] fix bug in sort reference calculation
ca0dee0 [Michael Armbrust] docs.
1ee0471 [Michael Armbrust] string literal parsing.
357278b [Michael Armbrust] add limit support
9b3e479 [Michael Armbrust] creation of string literals.
02efa30 [Michael Armbrust] alias evaluation
cb68b33 [Michael Armbrust] parsing for random sample in hive ql.
126dd36 [Michael Armbrust] include query plans in failure output
bb59ae9 [Michael Armbrust] doc fixes
7e68286 [Michael Armbrust] fix confusing naming
768bb25 [Michael Armbrust] handle errors in shark query toString
829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark.  Make test shark a singleton to avoid weirdness with the hive megastore.
ad02e41 [Michael Armbrust] comment jdo dependency
7bc89fe [Michael Armbrust] add collect to TreeNode.
438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs.
09679ee [Michael Armbrust] fix bug in TreeNode foreach
2930b27 [Michael Armbrust] more specific name for del query tests.
8842549 [Michael Armbrust] docs.
da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL.
a8969b9 [Michael Armbrust] Factor out hive query comparison test framework.
1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations.
a36dd9a [Michael Armbrust] evaluation for other > data types.
cae729b [Michael Armbrust] remove unnecessary lazy vals.
d8e12af [Michael Armbrust] docs
3a60d67 [Michael Armbrust] implement average, placeholder for count
f05c106 [Michael Armbrust] checkAnswer handles single row results.
2730534 [Michael Armbrust] implement inputSet
a9aa79d [Michael Armbrust] debugging for sort exec
8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors.
554b4b2 [Michael Armbrust] BoundAttribute pretty printing.
754f5fa [Michael Armbrust] dsl for setting nullability
a206d7a [Michael Armbrust] clean up query tests.
84ad6ef [Michael Armbrust] better sort implementation and tests.
de24923 [Michael Armbrust] add double type.
9611a2c [Michael Armbrust] literal creation for doubles.
7358313 [Michael Armbrust] sort order returns child type.
b544715 [Michael Armbrust] implement eval for rand, and > for doubles
7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols)
1c1a35e [Michael Armbrust] add simple Rand expression.
3ca51de [Michael Armbrust] add orderBy to dsl
7ae41ab [Michael Armbrust] more literal implicit conversions
b18b675 [Michael Armbrust] First cut at native query tests for shark.
d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark.
5eac895 [Michael Armbrust] better error when descending is specified.
2b16f86 [Michael Armbrust] add todo
e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization
9dde3c8 [Michael Armbrust] add project and filter operations.
ad9037b [Michael Armbrust] Add support for local relations.
6227143 [Michael Armbrust] evaluation of Equals.
7526290 [Michael Armbrust] BoundReference should also be an Attribute.
bd33e26 [Michael Armbrust] more documentation
5de0ea3 [Michael Armbrust] Move all shark specific into a separate package.  Lots of documentation improvements.
0ae292b [Michael Armbrust] implement calculation of sort expressions.
9fd5011 [Michael Armbrust] First cut at expression evaluation.
6259e3a [Michael Armbrust] cleanup
787e5a2 [Michael Armbrust] use fastEquals
f90da36 [Michael Armbrust] better printing of optimization exceptions
b05dd67 [Michael Armbrust] Application of rules to fixed point.
bb2e0db [Michael Armbrust] pretty print for literals.
1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals.
d3a3687 [Michael Armbrust] add fastEquals
2b4935b [Michael Armbrust] set sbt.version explicitly
46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests.
c79f2fd [Michael Armbrust] insert operator should return an empty rdd.
14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input.
ae7b4c3 [Michael Armbrust] remove implicit dependencies.  now compiles without copying things into lib/ manually.
84082f9 [Michael Armbrust] add sbt binaries and scripts
15371a8 [Michael Armbrust] First draft of simple Hive DDL parser.
063bf44 [Michael Armbrust] Periods should end all comments.
e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack.
ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing!
b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust.
e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected.
26f410a [Michael Armbrust] physical traits should extend PhysicalPlan.
dc72469 [Michael Armbrust] beginning of hive compatibility testing framework.
0763490 [Michael Armbrust] support for hive native command pass-through.
d8a924f [Michael Armbrust] scaladoc
29a7163 [Michael Armbrust] Insert into hive table physical operator.
633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy.
59ac444 [Michael Armbrust] add unary expression
3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName'
665f7d0 [Michael Armbrust] add logical nodes for hive data sinks.
64d2923 [Michael Armbrust] Add classes for representing sorts.
f72b7ce [Michael Armbrust] first trivial end to end query execution.
5c7d244 [Michael Armbrust] first draft of references implementation.
7bff274 [Michael Armbrust] point at new shark.
c7cd57f [Michael Armbrust] docs for util function.
910811c [Michael Armbrust] check each item of the sequence
ef21a0b [Michael Armbrust] line up comments.
4b765d5 [Michael Armbrust] docs, drop println
6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution.
a703c49 [Michael Armbrust] this order works better until fixed point is implemented.
ec1d7c0 [Michael Armbrust] Simple attribute resolution.
069df02 [Michael Armbrust] parsing binary predicates
a1cf754 [Michael Armbrust] add joins and equality.
3f5bc98 [Michael Armbrust] add optiq to sbt.
54f3460 [Michael Armbrust] initial optiq parsing.
d9161ce [Michael Armbrust] add join operator
1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs
24ef6fb [Michael Armbrust] toString for alias.
ae7d776 [Michael Armbrust] add nullability changing function
d49dc02 [Michael Armbrust] scaladoc for named exprs
7c45dd7 [Michael Armbrust] pretty printing of trees.
78e34bf [Michael Armbrust] simple git ignore.
7ba19be [Michael Armbrust] First draft of interface to hive metastore.
7e7acf0 [Michael Armbrust] physical placeholder.
1c11136 [Michael Armbrust] first draft of error handling / plans for debugging.
3766a41 [Michael Armbrust] rearrange utility functions.
7fb3d5e [Michael Armbrust] docs and equality improvements.
45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions.
002d4d4 [Michael Armbrust] default to no alias.
be25003 [Michael Armbrust] add repl initialization to sbt.
0608a00 [Michael Armbrust] tighten public interface
a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms.
daa71ca [Michael Armbrust] foreach, maps, and scaladoc
6a158cb [Michael Armbrust] simple transform working.
db0299f [Michael Armbrust] basic analysis of relations minus transform function.
f74c4ee [Michael Armbrust] parsing a simple query.
08e4f57 [Michael Armbrust] upgrade scala include shark.
d3c6404 [Michael Armbrust] initial commit
2014-03-20 18:03:20 -07:00