Commit graph

46 commits

Author SHA1 Message Date
witgo 030f2c2126 Improved build configuration
1, Fix SPARK-1441: compile spark core error with hadoop 0.23.x
2, Fix SPARK-1491: maven hadoop-provided profile fails to build
3, Fix org.scala-lang: * ,org.apache.avro:* inconsistent versions dependency
4, A modified on the sql/catalyst/pom.xml,sql/hive/pom.xml,sql/core/pom.xml (Four spaces formatted into two spaces)

Author: witgo <witgo@qq.com>

Closes #480 from witgo/format_pom and squashes the following commits:

03f652f [witgo] review commit
b452680 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
bee920d [witgo] revert fix SPARK-1629: Spark Core missing commons-lang dependence
7382a07 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
6902c91 [witgo] fix SPARK-1629: Spark Core missing commons-lang dependence
0da4bc3 [witgo] merge master
d1718ed [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
e345919 [witgo] add avro dependency to yarn-alpha
77fad08 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
62d0862 [witgo] Fix org.scala-lang: * inconsistent versions dependency
1a162d7 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
934f24d [witgo] review commit
cf46edc [witgo] exclude jruby
06e7328 [witgo] Merge branch 'SparkBuild' into format_pom
99464d2 [witgo] fix maven hadoop-provided profile fails to build
0c6c1fc [witgo] Fix compile spark core error with hadoop 0.23.x
6851bec [witgo] Maintain consistent SparkBuild.scala, pom.xml
2014-04-28 22:51:46 -07:00
Michael Armbrust 86ff8b1027 Generalize pattern for planning hash joins.
This will be helpful for [SPARK-1495](https://issues.apache.org/jira/browse/SPARK-1495) and other cases where we want to have custom hash join implementations but don't want to repeat the logic for finding the join keys.

Author: Michael Armbrust <michael@databricks.com>

Closes #418 from marmbrus/hashFilter and squashes the following commits:

d5cc79b [Michael Armbrust] Address @rxin 's comments.
366b6d9 [Michael Armbrust] style fixes
14560eb [Michael Armbrust] Generalize pattern for planning hash joins.
f4809c1 [Michael Armbrust] Move common functions to PredicateHelper.
2014-04-24 21:42:33 -07:00
Mridul Muralidharan 968c0187a1 SPARK-1586 Windows build fixes
Unfortunately, this is not exhaustive - particularly hive tests still fail due to path issues.

Author: Mridul Muralidharan <mridulm80@apache.org>

This patch had conflicts when merged, resolved by
Committer: Matei Zaharia <matei@databricks.com>

Closes #505 from mridulm/windows_fixes and squashes the following commits:

ef12283 [Mridul Muralidharan] Move to org.apache.commons.lang3 for StringEscapeUtils. Earlier version was buggy appparently
cdae406 [Mridul Muralidharan] Remove leaked changes from > 2G fix branch
3267f4b [Mridul Muralidharan] Fix build failures
35b277a [Mridul Muralidharan] Fix Scalastyle failures
bc69d14 [Mridul Muralidharan] Change from hardcoded path separator
10c4d78 [Mridul Muralidharan] Use explicit encoding while using getBytes
1337abd [Mridul Muralidharan] fix classpath while running in windows
2014-04-24 20:48:33 -07:00
Michael Armbrust 4660991e67 [SQL] Add support for parsing indexing into arrays in SQL.
Author: Michael Armbrust <michael@databricks.com>

Closes #518 from marmbrus/parseArrayIndex and squashes the following commits:

afd2d6b [Michael Armbrust] 100 chars
c3d6026 [Michael Armbrust] Add support for parsing indexing into arrays in SQL.
2014-04-24 18:21:00 -07:00
Arun Ramakrishnan 35e3d199f0 SPARK-1438 RDD.sample() make seed param optional
copying form previous pull request https://github.com/apache/spark/pull/462

Its probably better to let the underlying language implementation take care of the default . This was easier to do with python as the default value for seed in random and numpy random is None.

In Scala/Java side it might mean propagating an Option or null(oh no!) down the chain until where the Random is constructed. But, looks like the convention in some other methods was to use System.nanoTime. So, followed that convention.

Conflict with overloaded method in sql.SchemaRDD.sample which also defines default params.
sample(fraction, withReplacement=false, seed=math.random)
Scala does not allow more than one overloaded to have default params. I believe the author intended to override the RDD.sample method and not overload it. So, changed it.

If backward compatible is important, 3 new method can be introduced (without default params) like this
sample(fraction)
sample(fraction, withReplacement)
sample(fraction, withReplacement, seed)

Added some tests for the scala RDD takeSample method.

Author: Arun Ramakrishnan <smartnut007@gmail.com>

This patch had conflicts when merged, resolved by
Committer: Matei Zaharia <matei@databricks.com>

Closes #477 from smartnut007/master and squashes the following commits:

07bb06e [Arun Ramakrishnan] SPARK-1438 fixing more space formatting issues
b9ebfe2 [Arun Ramakrishnan] SPARK-1438 removing redundant import of random in python rddsampler
8d05b1a [Arun Ramakrishnan] SPARK-1438 RDD . Replace System.nanoTime with a Random generated number. python: use a separate instance of Random instead of seeding language api global Random instance.
69619c6 [Arun Ramakrishnan] SPARK-1438 fix spacing issue
0c247db [Arun Ramakrishnan] SPARK-1438 RDD language apis to support optional seed in RDD methods sample/takeSample
2014-04-24 17:27:16 -07:00
Sandeep a03ac222d8 Fix Scala Style
Any comments are welcome

Author: Sandeep <sandeep@techaddict.me>

Closes #531 from techaddict/stylefix-1 and squashes the following commits:

7492730 [Sandeep] Pass 4
98b2428 [Sandeep] fix rxin suggestions
b5e2e6f [Sandeep] Pass 3
05932d7 [Sandeep] fix if else styling 2
08690e5 [Sandeep] fix if else styling
2014-04-24 15:07:23 -07:00
Michael Armbrust aa77f8a6a6 SPARK-1562 Fix visibility / annotation of Spark SQL APIs
Author: Michael Armbrust <michael@databricks.com>

Closes #489 from marmbrus/sqlDocFixes and squashes the following commits:

acee4f3 [Michael Armbrust] Fix visibility / annotation of Spark SQL APIs
2014-04-22 20:02:33 -07:00
Kan Zhang ea8cea82a0 [SPARK-1570] Fix classloading in JavaSQLContext.applySchema
I think I hit a class loading issue when running JavaSparkSQL example using spark-submit in local mode.

Author: Kan Zhang <kzhang@apache.org>

Closes #484 from kanzhang/SPARK-1570 and squashes the following commits:

feaaeba [Kan Zhang] [SPARK-1570] Fix classloading in JavaSQLContext.applySchema
2014-04-22 15:05:12 -07:00
Andrew Or b3e5366f69 [Fix #274] Document + fix annotation usages
... so that we don't follow an unspoken set of forbidden rules for adding **@AlphaComponent**, **@DeveloperApi**, and **@Experimental** annotations in the code.

In addition, this PR
(1) removes unnecessary `:: * ::` tags,
(2) adds missing `:: * ::` tags, and
(3) removes annotations for internal APIs.

Author: Andrew Or <andrewor14@gmail.com>

Closes #470 from andrewor14/annotations-fix and squashes the following commits:

92a7f42 [Andrew Or] Document + fix annotation usages
2014-04-21 22:24:44 -07:00
Cheng Lian 89f47434e2 Reuses Row object in ExistingRdd.productToRowRdd()
Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #432 from liancheng/reuseRow and squashes the following commits:

9e6d083 [Cheng Lian] Simplified code with BufferedIterator
52acec9 [Cheng Lian] Reuses Row object in ExistingRdd.productToRowRdd()
2014-04-18 10:02:27 -07:00
Michael Armbrust 273c2fd08d [SQL] SPARK-1424 Generalize insertIntoTable functions on SchemaRDDs
This makes it possible to create tables and insert into them using the DSL and SQL for the scala and java apis.

Author: Michael Armbrust <michael@databricks.com>

Closes #354 from marmbrus/insertIntoTable and squashes the following commits:

6c6f227 [Michael Armbrust] Create random temporary files in python parquet unit tests.
f5e6d5c [Michael Armbrust] Merge remote-tracking branch 'origin/master' into insertIntoTable
765c506 [Michael Armbrust] Add to JavaAPI.
77b512c [Michael Armbrust] typos.
5c3ef95 [Michael Armbrust] use names for boolean args.
882afdf [Michael Armbrust] Change createTableAs to saveAsTable.  Clean up api annotations.
d07d94b [Michael Armbrust] Add tests, support for creating parquet files and hive tables.
fa3fe81 [Michael Armbrust] Make insertInto available on JavaSchemaRDD as well.  Add createTableAs function.
2014-04-15 20:40:40 -07:00
Ahir Reddy c99bcb7fea SPARK-1374: PySpark API for SparkSQL
An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries.

```
from pyspark.context import SQLContext
sqlCtx = SQLContext(sc)
rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}])
srdd = sqlCtx.applySchema(rdd)
sqlCtx.registerRDDAsTable(srdd, "table1")
srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1")
srdd2.collect()
```
The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]```

Author: Ahir Reddy <ahirreddy@gmail.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #363 from ahirreddy/pysql and squashes the following commits:

0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns
307d6e0 [Ahir Reddy] Style fix
6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies
3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py
29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD
f2312c7 [Ahir Reddy] Moved everything into sql.py
a19afe4 [Ahir Reddy] Doc fixes
6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL
521ff6d [Ahir Reddy] Trying to get spark to build with hive
ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins
ded03e7 [Ahir Reddy] Added doc test for HiveContext
22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency
e4da06c [Ahir Reddy] Display message if hive is not built into spark
227a0be [Michael Armbrust] Update API links. Fix Hive example.
58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api.  Minor fixes.
4285340 [Michael Armbrust] Fix building of Hive API Docs.
38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs.
337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build
40491c9 [Ahir Reddy] PR Changes + Method Visibility
1836944 [Michael Armbrust] Fix comments.
e00980f [Michael Armbrust] First draft of python sql programming guide.
b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test
f98a422 [Ahir Reddy] HiveContexts
79621cf [Ahir Reddy] cleaning up cruft
b406ba0 [Ahir Reddy] doctest formatting
20936a5 [Ahir Reddy] Added tests and documentation
e4d21b4 [Ahir Reddy] Added pyrolite dependency
79f739d [Ahir Reddy] added more tests
7515ba0 [Ahir Reddy] added more tests :)
d26ec5e [Ahir Reddy] added test
e9f5b8d [Ahir Reddy] adding tests
906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python
251f99d [Ahir Reddy] for now only allow dictionaries as input
09b9980 [Ahir Reddy] made jrdd explicitly lazy
c608947 [Ahir Reddy] SchemaRDD now has all RDD operations
725c91e [Ahir Reddy] awesome row objects
55d1c76 [Ahir Reddy] return row objects
4fe1319 [Ahir Reddy] output dictionaries correctly
be079de [Ahir Reddy] returning dictionaries works
cd5f79f [Ahir Reddy] Switched to using Scala SQLContext
e948bd9 [Ahir Reddy] yippie
4886052 [Ahir Reddy] even better
c0fb1c6 [Ahir Reddy] more working
043ca85 [Ahir Reddy] working
5496f9f [Ahir Reddy] doesn't crash
b8b904b [Ahir Reddy] Added schema rdd class
67ba875 [Ahir Reddy] java to python, and python to java
bcc0f23 [Ahir Reddy] Java to python
ab6025d [Ahir Reddy] compiling
2014-04-15 00:07:55 -07:00
Cheng Lian 7dbca68e92 [BUGFIX] In-memory columnar storage bug fixes
Fixed several bugs of in-memory columnar storage to make `HiveInMemoryCompatibilitySuite` pass.

@rxin @marmbrus It is reasonable to include `HiveInMemoryCompatibilitySuite` in this PR, but I didn't, since it significantly increases test execution time. What do you think?

**UPDATE** `HiveCompatibilitySuite` has been made to cache tables in memory. `HiveInMemoryCompatibilitySuite` was removed.

Author: Cheng Lian <lian.cs.zju@gmail.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #374 from liancheng/inMemBugFix and squashes the following commits:

6ad6d9b [Cheng Lian] Merged HiveCompatibilitySuite and HiveInMemoryCompatibilitySuite
5bdbfe7 [Cheng Lian] Revert 882c538 & 8426ddc, which introduced regression
882c538 [Cheng Lian] Remove attributes field from InMemoryColumnarTableScan
32cc9ce [Cheng Lian] Code style cleanup
99382bf [Cheng Lian] Enable compression by default
4390bcc [Cheng Lian] Report error for any Throwable in HiveComparisonTest
d1df4fd [Michael Armbrust] Remove test tables that might always get created anyway?
ab9e807 [Michael Armbrust] Fix the logged console version of failed test cases to use the new syntax.
1965123 [Michael Armbrust] Don't use coalesce for gathering all data to a single partition, as it does not work correctly with mutable rows.
e36cdd0 [Michael Armbrust] Spelling.
2d0e168 [Michael Armbrust] Run Hive tests in-memory too.
6360723 [Cheng Lian] Made PreInsertionCasts support SparkLogicalPlan and InMemoryColumnarTableScan
c9b0f6f [Cheng Lian] Let InsertIntoTable support InMemoryColumnarTableScan
9c8fc40 [Cheng Lian] Disable compression by default
e619995 [Cheng Lian] Bug fix: incorrect byte order in CompressionScheme.columnHeaderSize
8426ddc [Cheng Lian] Bug fix: InMemoryColumnarTableScan should cache columns specified by the attributes argument
036cd09 [Cheng Lian] Clean up unused imports
44591a5 [Cheng Lian] Bug fix: NullableColumnAccessor.hasNext must take nulls into account
052bf41 [Cheng Lian] Bug fix: should only gather compressibility info for non-null values
95b3301 [Cheng Lian] Fixed bugs in IntegralDelta
2014-04-14 15:22:43 -07:00
Patrick Wendell 4bc07eebbf SPARK-1480: Clean up use of classloaders
The Spark codebase is a bit fast-and-loose when accessing classloaders and this has caused a few bugs to surface in master.

This patch defines some utility methods for accessing classloaders. This makes the intention when accessing a classloader much more explicit in the code and fixes a few cases where the wrong one was chosen.

case (a) -> We want the classloader that loaded Spark
case (b) -> We want the context class loader, or if not present, we want (a)

This patch provides a better fix for SPARK-1403 (https://issues.apache.org/jira/browse/SPARK-1403) than the current work around, which it reverts. It also fixes a previously unreported bug that the `./spark-submit` script did not work for running with `local` master. It didn't work because the executor classloader did not properly delegate to the context class loader (if it is defined) and in local mode the context class loader is set by the `./spark-submit` script. A unit test is added for that case.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #398 from pwendell/class-loaders and squashes the following commits:

b4a1a58 [Patrick Wendell] Minor clean up
14f1272 [Patrick Wendell] SPARK-1480: Clean up use of classloaders
2014-04-13 08:58:37 -07:00
Sandeep 930b70f052 Remove Unnecessary Whitespace's
stack these together in a commit else they show up chunk by chunk in different commits.

Author: Sandeep <sandeep@techaddict.me>

Closes #380 from techaddict/white_space and squashes the following commits:

b58f294 [Sandeep] Remove Unnecessary Whitespace's
2014-04-10 15:04:13 -07:00
witgo a74fbbbca8 Fix SPARK-1413: Parquet messes up stdout and stdin when used in Spark REPL
Author: witgo <witgo@qq.com>

Closes #325 from witgo/SPARK-1413 and squashes the following commits:

e57cd8e [witgo] use scala reflection to access and call the SLF4JBridgeHandler  methods
45c8f40 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
5e35d87 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
0d5f819 [witgo] review commit
45e5b70 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
fa69dcf [witgo] Merge branch 'master' into SPARK-1413
3c98dc4 [witgo] Merge branch 'master' into SPARK-1413
38160cb [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
ba09bcd [witgo] remove set the parquet log level
a63d574 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
5231ecd [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
3feb635 [witgo] parquet logger use parent handler
fa00d5d [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
8bb6ffd [witgo] enableLogForwarding note fix
edd9630 [witgo]  move to
f447f50 [witgo] merging master
5ad52bd [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
76670c1 [witgo] review commit
70f3c64 [witgo] Fix SPARK-1413
2014-04-10 10:36:20 -07:00
Patrick Wendell 87bd1f9ef7 SPARK-1093: Annotate developer and experimental API's
This patch marks some existing classes as private[spark] and adds two types of API annotations:
- `EXPERIMENTAL API` = experimental user-facing module
- `DEVELOPER API - UNSTABLE` = developer-facing API that might change

There is some discussion of the different mechanisms for doing this here:
https://issues.apache.org/jira/browse/SPARK-1081

I was pretty aggressive with marking things private. Keep in mind that if we want to open something up in the future we can, but we can never reduce visibility.

A few notes here:
- In the past we've been inconsistent with the visiblity of the X-RDD classes. This patch marks them private whenever there is an existing function in RDD that can directly creat them (e.g. CoalescedRDD and rdd.coalesce()). One trade-off here is users can't subclass them.
- Noted that compression and serialization formats don't have to be wire compatible across versions.
- Compression codecs and serialization formats are semi-private as users typically don't instantiate them directly.
- Metrics sources are made private - user only interacts with them through Spark's reflection

Author: Patrick Wendell <pwendell@gmail.com>
Author: Andrew Or <andrewor14@gmail.com>

Closes #274 from pwendell/private-apis and squashes the following commits:

44179e4 [Patrick Wendell] Merge remote-tracking branch 'apache-github/master' into private-apis
042c803 [Patrick Wendell] spark.annotations -> spark.annotation
bfe7b52 [Patrick Wendell] Adding experimental for approximate counts
8d0c873 [Patrick Wendell] Warning in SparkEnv
99b223a [Patrick Wendell] Cleaning up annotations
e849f64 [Patrick Wendell] Merge pull request #2 from andrewor14/annotations
982a473 [Andrew Or] Generalize jQuery matching for non Spark-core API docs
a01c076 [Patrick Wendell] Merge pull request #1 from andrewor14/annotations
c1bcb41 [Andrew Or] DeveloperAPI -> DeveloperApi
0d48908 [Andrew Or] Comments and new lines (minor)
f3954e0 [Andrew Or] Add identifier tags in comments to work around scaladocs bug
99192ef [Andrew Or] Dynamically add badges based on annotations
824011b [Andrew Or] Add support for injecting arbitrary JavaScript to API docs
037755c [Patrick Wendell] Some changes after working with andrew or
f7d124f [Patrick Wendell] Small fixes
c318b24 [Patrick Wendell] Use CSS styles
e4c76b9 [Patrick Wendell] Logging
f390b13 [Patrick Wendell] Better visibility for workaround constructors
d6b0afd [Patrick Wendell] Small chang to existing constructor
403ba52 [Patrick Wendell] Style fix
870a7ba [Patrick Wendell] Work around for SI-8479
7fb13b2 [Patrick Wendell] Changes to UnionRDD and EmptyRDD
4a9e90c [Patrick Wendell] EXPERIMENTAL API --> EXPERIMENTAL
c581dce [Patrick Wendell] Changes after building against Shark.
8452309 [Patrick Wendell] Style fixes
1ed27d2 [Patrick Wendell] Formatting and coloring of badges
cd7a465 [Patrick Wendell] Code review feedback
2f706f1 [Patrick Wendell] Don't use floats
542a736 [Patrick Wendell] Small fixes
cf23ec6 [Patrick Wendell] Marking GraphX as alpha
d86818e [Patrick Wendell] Another naming change
5a76ed6 [Patrick Wendell] More visiblity clean-up
42c1f09 [Patrick Wendell] Using better labels
9d48cbf [Patrick Wendell] Initial pass
2014-04-09 01:14:46 -07:00
Cheng Lian 0d0493fcf7 [SPARK-1402] Added 3 more compression schemes
JIRA issue: [SPARK-1402](https://issues.apache.org/jira/browse/SPARK-1402)

This PR provides 3 more compression schemes for Spark SQL in-memory columnar storage:

* `BooleanBitSet`
* `IntDelta`
* `LongDelta`

Now there are 6 compression schemes in total, including the no-op `PassThrough` scheme.

Also fixed a bug in PR #286: not all compression schemes are added as available schemes when accessing an in-memory column, and when a column is compressed with an unrecognised scheme, `ColumnAccessor` throws exception.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #330 from liancheng/moreCompressionSchemes and squashes the following commits:

1d037b8 [Cheng Lian] Fixed SPARK-1436: in-memory column byte buffer must be able to be accessed multiple times
d7c0e8f [Cheng Lian] Added test suite for IntegralDelta (IntDelta & LongDelta)
3c1ad7a [Cheng Lian] Added test suite for BooleanBitSet, refactored other test suites
44fe4b2 [Cheng Lian] Refactored CompressionScheme, added 3 more compression schemes.
2014-04-07 22:24:12 -07:00
Reynold Xin 14c9238aa7 [sql] Rename execution/aggregates.scala Aggregate.scala, and added a bunch of private[this] to variables.
Author: Reynold Xin <rxin@apache.org>

Closes #348 from rxin/aggregate and squashes the following commits:

f4bc36f [Reynold Xin] Rename execution/aggregates.scala Aggregate.scala, and added a bunch of private[this] to variables.
2014-04-07 18:38:44 -07:00
Reynold Xin 83f2a2f14e [sql] Rename Expression.apply to eval for better readability.
Also used this opportunity to add a bunch of override's and made some members private.

Author: Reynold Xin <rxin@apache.org>

Closes #340 from rxin/eval and squashes the following commits:

a7c7ca7 [Reynold Xin] Fixed conflicts in merge.
9069de6 [Reynold Xin] Merge branch 'master' into eval
3ccc313 [Reynold Xin] Merge branch 'master' into eval
1a47e10 [Reynold Xin] Renamed apply to eval for generators and added a bunch of override's.
ea061de [Reynold Xin] Rename Expression.apply to eval for better readability.
2014-04-07 10:45:31 -07:00
Michael Armbrust b5bae849db [SQL] SPARK-1427 Fix toString for SchemaRDD NativeCommands.
Author: Michael Armbrust <michael@databricks.com>

Closes #343 from marmbrus/toStringFix and squashes the following commits:

37198fe [Michael Armbrust] Fix toString for SchemaRDD NativeCommands.
2014-04-07 01:46:50 -07:00
Michael Armbrust accd0999f9 [SQL] SPARK-1371 Hash Aggregation Improvements
Given:
```scala
case class Data(a: Int, b: Int)
val rdd =
  sparkContext
    .parallelize(1 to 200)
    .flatMap(_ => (1 to 50000).map(i => Data(i % 100, i)))
rdd.registerAsTable("data")
cacheTable("data")
```
Before:
```
SELECT COUNT(*) FROM data:[10000000]
16795.567ms
SELECT a, SUM(b) FROM data GROUP BY a
7536.436ms
SELECT SUM(b) FROM data
10954.1ms
```

After:
```
SELECT COUNT(*) FROM data:[10000000]
1372.175ms
SELECT a, SUM(b) FROM data GROUP BY a
2070.446ms
SELECT SUM(b) FROM data
958.969ms
```

Author: Michael Armbrust <michael@databricks.com>

Closes #295 from marmbrus/hashAgg and squashes the following commits:

ec63575 [Michael Armbrust] Add comment.
d0495a9 [Michael Armbrust] Use scaladoc instead.
b4a6887 [Michael Armbrust] Address review comments.
a2d90ba [Michael Armbrust] Capture child output statically to avoid issues with generators and serialization.
7c13112 [Michael Armbrust] Rewrite Aggregate operator to stream input and use projections.  Remove unused local RDD functions implicits.
5096f99 [Michael Armbrust] Make HiveUDAF fields transient since object inspectors are not serializable.
6a4b671 [Michael Armbrust] Add option to avoid binding operators expressions automatically.
92cca08 [Michael Armbrust] Always include serialization debug info when running tests.
1279df2 [Michael Armbrust] Increase default number of partitions.
2014-04-07 00:14:00 -07:00
Sean Owen 890d63bd4e Fix for PR #195 for Java 6
Use Java 6's recommended equivalent of Java 7's Logger.getGlobal() to retain Java 6 compatibility. See PR #195

Author: Sean Owen <sowen@cloudera.com>

Closes #334 from srowen/FixPR195ForJava6 and squashes the following commits:

f92fbd3 [Sean Owen] Use Java 6's recommended equivalent of Java 7's Logger.getGlobal() to retain Java 6 compatibility
2014-04-05 19:08:24 -07:00
Michael Armbrust d956cc2516 [SQL] Minor fixes.
Author: Michael Armbrust <michael@databricks.com>

Closes #315 from marmbrus/minorFixes and squashes the following commits:

b23a15d [Michael Armbrust] fix scaladoc
11062ac [Michael Armbrust] Fix registering "SELECT *" queries as tables and caching them.  As some tests for this and self-joins.
3997dc9 [Michael Armbrust] Move Row extractor to catalyst.
208bf5e [Michael Armbrust] More idiomatic naming of DSL functions. * subquery => as * for join condition => on, i.e., `r.join(s, condition = 'a == 'b)` =>`r.join(s, on = 'a == 'b)`
87211ce [Michael Armbrust] Correctly handle self joins of in-memory cached tables.
69e195e [Michael Armbrust] Change != to !== in the DSL since != will always translate to != on Any.
01f2dd5 [Michael Armbrust] Correctly assign aliases to tables in SqlParser.
2014-04-04 17:23:17 -07:00
Michael Armbrust d94826be6d [BUILD FIX] Fix compilation of Spark SQL Java API.
The JavaAPI and the Parquet improvements PRs didn't conflict, but broke the build.

Author: Michael Armbrust <michael@databricks.com>

Closes #316 from marmbrus/hotFixJavaApi and squashes the following commits:

0b84c2d [Michael Armbrust] Fix compilation of Spark SQL Java API.
2014-04-03 16:12:08 -07:00
Michael Armbrust b8f534196f [SQL] SPARK-1333 First draft of java API
WIP: Some work remains...
 * [x] Hive support
 * [x] Tests
 * [x] Update docs

Feedback welcome!

Author: Michael Armbrust <michael@databricks.com>

Closes #248 from marmbrus/javaSchemaRDD and squashes the following commits:

b393913 [Michael Armbrust] @srowen 's java style suggestions.
f531eb1 [Michael Armbrust] Address matei's comments.
33a1b1a [Michael Armbrust] Ignore JavaHiveSuite.
822f626 [Michael Armbrust] improve docs.
ab91750 [Michael Armbrust] Improve Java SQL API: * Change JavaRow => Row * Add support for querying RDDs of JavaBeans * Docs * Tests * Hive support
0b859c8 [Michael Armbrust] First draft of java API.
2014-04-03 15:45:34 -07:00
Cheng Hao 5d1feda217 [SPARK-1360] Add Timestamp Support for SQL
This PR includes:
1) Add new data type Timestamp
2) Add more data type casting base on Hive's Rule
3) Fix bug missing data type in both parsers (HiveQl & SQLParser).

Author: Cheng Hao <hao.cheng@intel.com>

Closes #275 from chenghao-intel/timestamp and squashes the following commits:

df709e5 [Cheng Hao] Move orc_ends_with_nulls to blacklist
24b04b0 [Cheng Hao] Put 3 cases into the black lists(describe_pretty,describe_syntax,lateral_view_outer)
fc512c2 [Cheng Hao] remove the unnecessary data type equality check in data casting
d0d1919 [Cheng Hao] Add more data type for scala reflection
3259808 [Cheng Hao] Add the new Golden files
3823b97 [Cheng Hao] Update the UnitTest cases & add timestamp type for HiveQL
54a0489 [Cheng Hao] fix bug mapping to 0 (which is supposed to be null) when NumberFormatException occurs
9cb505c [Cheng Hao] Fix issues according to PR comments
e529168 [Cheng Hao] Fix bug of converting from String
6fc8100 [Cheng Hao] Update Unit Test & CodeStyle
8a1d4d6 [Cheng Hao] Add DataType for SqlParser
ce4385e [Cheng Hao] Add TimestampType Support
2014-04-03 15:33:17 -07:00
Andre Schumacher fbebaedf26 Spark parquet improvements
A few improvements to the Parquet support for SQL queries:
- Instead of files a ParquetRelation is now backed by a directory, which simplifies importing data from other
  sources
- InsertIntoParquetTable operation now supports switching between overwriting or appending (at least in
  HiveQL)
- tests now use the new API
- Parquet logging can be set to WARNING level (Default)
- Default compression for Parquet files (GZIP, as in parquet-mr)

Author: Andre Schumacher <andre.schumacher@iki.fi>

Closes #195 from AndreSchumacher/spark_parquet_improvements and squashes the following commits:

54df314 [Andre Schumacher] SPARK-1383 [SQL] Improvements to ParquetRelation
2014-04-03 15:31:47 -07:00
Michael Armbrust 47ebea5468 [SQL] SPARK-1364 Improve datatype and test coverage for ScalaReflection schema inference.
Author: Michael Armbrust <michael@databricks.com>

Closes #293 from marmbrus/reflectTypes and squashes the following commits:

f54e8e8 [Michael Armbrust] Improve datatype and test coverage for ScalaReflection schema inference.
2014-04-02 18:14:31 -07:00
Reynold Xin ed730c9502 StopAfter / TopK related changes
1. Renamed StopAfter to Limit to be more consistent with naming in other relational databases.
2. Renamed TopK to TakeOrdered to be more consistent with Spark RDD API.
3. Avoid breaking lineage in Limit.
4. Added a bunch of override's to execution/basicOperators.scala.

@marmbrus @liancheng

Author: Reynold Xin <rxin@apache.org>
Author: Michael Armbrust <michael@databricks.com>

Closes #233 from rxin/limit and squashes the following commits:

13eb12a [Reynold Xin] Merge pull request #1 from marmbrus/limit
92b9727 [Michael Armbrust] More hacks to make Maps serialize with Kryo.
4fc8b4e [Reynold Xin] Merge branch 'master' of github.com:apache/spark into limit
87b7d37 [Reynold Xin] Use the proper serializer in limit.
9b79246 [Reynold Xin] Updated doc for Limit.
47d3327 [Reynold Xin] Copy tuples in Limit before shuffle.
231af3a [Reynold Xin] Limit/TakeOrdered: 1. Renamed StopAfter to Limit to be more consistent with naming in other relational databases. 2. Renamed TopK to TakeOrdered to be more consistent with Spark RDD API. 3. Avoid breaking lineage in Limit. 4. Added a bunch of override's to execution/basicOperators.scala.
2014-04-02 12:48:04 -07:00
Cheng Lian 1faa579711 [SPARK-1371][WIP] Compression support for Spark SQL in-memory columnar storage
JIRA issue: [SPARK-1373](https://issues.apache.org/jira/browse/SPARK-1373)

(Although tagged as WIP, this PR is structurally complete. The only things left unimplemented are 3 more compression algorithms: `BooleanBitSet`, `IntDelta` and `LongDelta`, which are trivial to add later in this or another separate PR.)

This PR contains compression support for Spark SQL in-memory columnar storage. Main interfaces include:

*   `CompressionScheme`

    Each `CompressionScheme` represents a concrete compression algorithm, which basically consists of an `Encoder` for compression and a `Decoder` for decompression. Algorithms implemented include:

    * `RunLengthEncoding`
    * `DictionaryEncoding`

    Algorithms to be implemented include:

    * `BooleanBitSet`
    * `IntDelta`
    * `LongDelta`

*   `CompressibleColumnBuilder`

    A stackable `ColumnBuilder` trait used to build byte buffers for compressible columns.  A best `CompressionScheme` that exhibits lowest compression ratio is chosen for each column according to statistical information gathered while elements are appended into the `ColumnBuilder`. However, if no `CompressionScheme` can achieve a compression ratio better than 80%, no compression will be done for this column to save CPU time.

    Memory layout of the final byte buffer is showed below:

    ```
     .--------------------------- Column type ID (4 bytes)
     |   .----------------------- Null count N (4 bytes)
     |   |   .------------------- Null positions (4 x N bytes, empty if null count is zero)
     |   |   |     .------------- Compression scheme ID (4 bytes)
     |   |   |     |   .--------- Compressed non-null elements
     V   V   V     V   V
    +---+---+-----+---+---------+
    |   |   | ... |   | ... ... |
    +---+---+-----+---+---------+
     \-----------/ \-----------/
        header         body
    ```

*   `CompressibleColumnAccessor`

    A stackable `ColumnAccessor` trait used to iterate (possibly) compressed data column.

*   `ColumnStats`

    Used to collect statistical information while loading data into in-memory columnar table. Optimizations like partition pruning rely on this information.

    Strictly speaking, `ColumnStats` related code is not part of the compression support. It's contained in this PR to ensure and validate the row-based API design (which is used to avoid boxing/unboxing cost whenever possible).

A major refactoring change since PR #205 is:

* Refactored all getter/setter methods for primitive types in various places into `ColumnType` classes to remove duplicated code.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #285 from liancheng/memColumnarCompression and squashes the following commits:

ed71bbd [Cheng Lian] Addressed all PR comments by @marmbrus
d3a4fa9 [Cheng Lian] Removed Ordering[T] in ColumnStats for better performance
5034453 [Cheng Lian] Bug fix, more tests, and more refactoring
c298b76 [Cheng Lian] Test suites refactored
2780d6a [Cheng Lian] [WIP] in-memory columnar compression support
211331c [Cheng Lian] WIP: in-memory columnar compression support
85cc59b [Cheng Lian] Refactored ColumnAccessors & ColumnBuilders to remove duplicate code
2014-04-02 12:47:22 -07:00
Michael Armbrust f5c418da04 [SQL] SPARK-1372 Support for caching and uncaching tables in a SQLContext.
This doesn't yet support different databases in Hive (though you can probably workaround this by calling `USE <dbname>`).  However, given the time constraints for 1.0 I think its probably worth including this now and extending the functionality in the next release.

Author: Michael Armbrust <michael@databricks.com>

Closes #282 from marmbrus/cacheTables and squashes the following commits:

83785db [Michael Armbrust] Support for caching and uncaching tables in a SQLContext.
2014-04-01 14:45:44 -07:00
Michael Armbrust 5731af5be6 [SQL] Rewrite join implementation to allow streaming of one relation.
Before we were materializing everything in memory.  This also uses the projection interface so will be easier to plug in code gen (its ported from that branch).

@rxin @liancheng

Author: Michael Armbrust <michael@databricks.com>

Closes #250 from marmbrus/hashJoin and squashes the following commits:

1ad873e [Michael Armbrust] Change hasNext logic back to the correct version.
8e6f2a2 [Michael Armbrust] Review comments.
1e9fb63 [Michael Armbrust] style
bc0cb84 [Michael Armbrust] Rewrite join implementation to allow streaming of one relation.
2014-03-31 15:23:46 -07:00
Prashant Sharma d666053679 SPARK-1352 - Comment style single space before ending */ check.
Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #261 from ScrapCodes/comment-style-check2 and squashes the following commits:

6cde61e [Prashant Sharma] comment style space before ending */ check.
2014-03-30 10:06:56 -07:00
jerryshao 95d7d2a3fc [SPARK-1354][SQL] Add tableName as a qualifier for SimpleCatelogy
Fix attribute unresolved when query with table name as a qualifier in SQLContext with SimplCatelog, details please see [SPARK-1354](https://issues.apache.org/jira/browse/SPARK-1354?jql=project%20%3D%20SPARK).

Author: jerryshao <saisai.shao@intel.com>

Closes #272 from jerryshao/qualifier-fix and squashes the following commits:

7950170 [jerryshao] Add tableName as a qualifier for SimpleCatelogy
2014-03-30 10:04:28 -07:00
Michael Armbrust 2861b07bb0 [SQL] SPARK-1354 Fix self-joins of parquet relations
@AndreSchumacher, please take a look.

https://spark-project.atlassian.net/browse/SPARK-1354

Author: Michael Armbrust <michael@databricks.com>

Closes #269 from marmbrus/parquetJoin and squashes the following commits:

4081e77 [Michael Armbrust] Create new instances of Parquet relation when multiple copies are in a single plan.
2014-03-29 22:02:53 -07:00
Thomas Graves 3738f24421 SPARK-1345 adding missing dependency on avro for hadoop 0.23 to the new ...
...sql pom files

Author: Thomas Graves <tgraves@apache.org>

Closes #263 from tgravescs/SPARK-1345 and squashes the following commits:

b43a2a0 [Thomas Graves] SPARK-1345 adding missing dependency on avro for hadoop 0.23 to the new sql pom files
2014-03-28 23:09:29 -07:00
Michael Armbrust e15e57413e [SQL] Add a custom serializer for maps since they do not have a no-arg constructor.
Author: Michael Armbrust <michael@databricks.com>

Closes #243 from marmbrus/mapSer and squashes the following commits:

54045f7 [Michael Armbrust] Add a custom serializer for maps since they do not have a no-arg constructor.
2014-03-26 18:19:49 -07:00
Cheng Lian 345825d979 Unified package definition format in Spark SQL
According to discussions in comments of PR #208, this PR unifies package definition format in Spark SQL.

Some broken links in ScalaDoc and typos detected along the way are also fixed.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #225 from liancheng/packageDefinition and squashes the following commits:

75c47b3 [Cheng Lian] Fixed file line length
4f87968 [Cheng Lian] Unified package definition format in Spark SQL
2014-03-26 15:36:18 -07:00
Michael Armbrust b637f2d91a Unify the logic for column pruning, projection, and filtering of table scans.
This removes duplicated logic, dead code and casting when planning parquet table scans and hive table scans.

Other changes:
 - Fix tests now that we are doing a better job of column pruning (i.e., since pruning predicates are applied before we even start scanning tuples, columns required by these predicates do not need to be included in the output of the scan unless they are also included in the final output of this logical plan fragment).
 - Add rule to simplify trivial filters.  This was required to avoid `WHERE false` from getting pushed into table scans, since `HiveTableScan` (reasonably) refuses to apply partition pruning predicates to non-partitioned tables.

Author: Michael Armbrust <michael@databricks.com>

Closes #213 from marmbrus/strategyCleanup and squashes the following commits:

48ce403 [Michael Armbrust] Move one more bit of parquet stuff into the core SQLContext.
834ce08 [Michael Armbrust] Address comments.
0f2c6f5 [Michael Armbrust] Unify the logic for column pruning, projection, and filtering of table scans for both Hive and Parquet relations.  Fix tests now that we are doing a better job of column pruning.
2014-03-24 22:15:51 -07:00
Prashant Sharma 21109fbab0 SPARK-1144 Added license and RAT to check licenses.
Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #125 from ScrapCodes/rat-integration and squashes the following commits:

64f7c7d [Prashant Sharma] added license headers.
fcf28b1 [Prashant Sharma] Review feedback.
c0648db [Prashant Sharma] SPARK-1144 Added license and RAT to check licenses.
2014-03-24 08:44:20 -07:00
Cheng Lian 8265dc7739 Fixed coding style issues in Spark SQL
This PR addresses various coding style issues in Spark SQL, including but not limited to those mentioned by @mateiz in PR #146.

As this PR affects lots of source files and may cause potential conflicts, it would be better to merge this as soon as possible *after* PR #205 (In-memory columnar representation for Spark SQL) is merged.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #208 from liancheng/fixCodingStyle and squashes the following commits:

fc2b528 [Cheng Lian] Merge branch 'master' into fixCodingStyle
b531273 [Cheng Lian] Fixed coding style issues in sql/hive
0b56f77 [Cheng Lian] Fixed coding style issues in sql/core
fae7b02 [Cheng Lian] Addressed styling issues mentioned by @marmbrus
9265366 [Cheng Lian] Fixed coding style issues in sql/core
3dcbbbd [Cheng Lian] Fixed relative package imports for package catalyst
2014-03-23 15:21:40 -07:00
Cheng Lian 57a4379c03 [SPARK-1292] In-memory columnar representation for Spark SQL
This PR is rebased from the Catalyst repository, and contains the first version of in-memory columnar representation for Spark SQL. Compression support is not included yet and will be added later in a separate PR.

Author: Cheng Lian <lian@databricks.com>
Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #205 from liancheng/memColumnarSupport and squashes the following commits:

99dba41 [Cheng Lian] Restricted new objects/classes to `private[sql]'
0892ad8 [Cheng Lian] Addressed ScalaStyle issues
af1ad5e [Cheng Lian] Fixed some minor issues introduced during rebasing
0dbf2fb [Cheng Lian] Make necessary renaming due to rebase
a162d4d [Cheng Lian] Removed the unnecessary InMemoryColumnarRelation class
9bcae4b [Cheng Lian] Added Apache license
220ee1e [Cheng Lian] Added table scan operator for in-memory columnar support.
c701c7a [Cheng Lian] Using SparkSqlSerializer for generic object SerDe causes error, made a workaround
ed8608e [Cheng Lian] Added implicit conversion from DataType to ColumnType
b8a645a [Cheng Lian] Replaced KryoSerializer with an updated SparkSqlSerializer
b6c0a49 [Cheng Lian] Minor test suite refactoring
214be73 [Cheng Lian] Refactored BINARY and GENERIC to reduce duplicate code
da2f4d5 [Cheng Lian] Added Apache license
dbf7a38 [Cheng Lian] Added ColumnAccessor and test suite, refactored ColumnBuilder
c01a177 [Cheng Lian] Added column builder classes and test suite
f18ddc6 [Cheng Lian] Added ColumnTypes and test suite
2d09066 [Cheng Lian] Added KryoSerializer
34f3c19 [Cheng Lian] Added TypeTag field to all NativeTypes
acc5c48 [Cheng Lian] Added Hive test files to .gitignore
2014-03-23 12:08:55 -07:00
Matei Zaharia dab5439a08 Make SQL keywords case-insensitive
This is a bit of a hack that allows all variations of a keyword, but it still seems to produce valid error messages and such.

Author: Matei Zaharia <matei@databricks.com>

Closes #193 from mateiz/case-insensitive-sql and squashes the following commits:

0ee4ace [Matei Zaharia] Removed unnecessary `+ ""`
e3ed773 [Matei Zaharia] Make SQL keywords case-insensitive
2014-03-21 16:53:18 -07:00
Michael Armbrust e09139d9ca Fix maven jenkins: Add explicit init for required tables in SQLQuerySuite
Sorry! I added this test at the last minute and failed to run it in maven as well.

Note that, this will probably not be sufficient to actually fix the maven jenkins build, as that does not use the dev/run-tests scripts.  We will need to configure it to also run dev/download-hive-tests.sh.  The other option would be to check in the tests as I suggested in the original PR. (I can do this if we agree its the right thing to do).

Long term it would probably be a good idea to also have maven run some sort of test env setup script so that we can decouple the test environment from the jenkins configuration.

Author: Michael Armbrust <michael@databricks.com>

Closes #191 from marmbrus/fixMaven and squashes the following commits:

3366e37 [Michael Armbrust] Fix maven jenkins: Add explicit init for required tables in SQLQuerySuite
2014-03-20 22:31:11 -07:00
Michael Armbrust 9aadcffabd SPARK-1251 Support for optimizing and executing structured queries
This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL.

*This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.*

The code is broken into three primary components:
 - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions.
 - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs.  This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files.
 - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes.  There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs.

A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251).

[An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review.

With this PR comes support for inferring the schema of existing RDDs that contain case classes.  Using this information, developers can now express structured queries that are automatically compiled into RDD operations.

```scala
// Define the schema using a case class.
case class Person(name: String, age: Int)
val people: RDD[Person] =
  sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt))

// The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19'
val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd
```

RDDs can also be registered as Tables, allowing SQL queries to be written over them.
```scala
people.registerAsTable("people")
val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19")
```

The results of queries are themselves RDDs and support standard RDD operations:
```scala
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
```

Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL.
```scala
sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src")

// Queries are expressed in HiveQL
sql("SELECT key, value FROM src").collect().foreach(println)
```

## Relationship to Shark

Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project.

Author: Michael Armbrust <michael@databricks.com>
Author: Yin Huai <huaiyin.thu@gmail.com>
Author: Reynold Xin <rxin@apache.org>
Author: Lian, Cheng <rhythm.mail@gmail.com>
Author: Andre Schumacher <andre.schumacher@iki.fi>
Author: Yin Huai <huai@cse.ohio-state.edu>
Author: Timothy Chen <tnachen@gmail.com>
Author: Cheng Lian <lian.cs.zju@gmail.com>
Author: Timothy Chen <tnachen@apache.org>
Author: Henry Cook <henry.m.cook+github@gmail.com>
Author: Mark Hamstra <markhamstra@gmail.com>

Closes #146 from marmbrus/catalyst and squashes the following commits:

458bd1b [Michael Armbrust] Update people.txt
0d638c3 [Michael Armbrust] Typo fix from @ash211.
bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests
778299a [Michael Armbrust] Fix some old links to spark-project.org
fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations.  This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user.  This change also makes it slightly less verbose to run language integrated queries.
fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD.
48a99bc [Michael Armbrust] Address first round of feedback.
461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour
adcf1a4 [Henry Cook] Update sql-programming-guide.md
9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins.
6978dd8 [Michael Armbrust] update docs, add apache license
1d0eb63 [Michael Armbrust] update changes with spark core
e5e1d6b [Michael Armbrust] Remove travis configuration.
c2efad6 [Michael Armbrust] First draft of SQL documentation.
013f62a [Michael Armbrust] Fix documentation / code style.
c01470f [Michael Armbrust] Clean up example
2f22454 [Michael Armbrust] WIP: Parquet example.
ce8073b [Michael Armbrust] clean up implicits.
f7d992d [Michael Armbrust] Naming / spelling.
9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext.
d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar.  Create a separate, optional Hive assembly that is used when present.
8b35e0a [Michael Armbrust] address feedback, work on DSL
5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes
f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation
1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven
3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes
3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context.
7233a74 [Michael Armbrust] initial support for maven builds
f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven
7386a9f [Michael Armbrust] Initial example programs using spark sql.
aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow
7ca4b4e [Andre Schumacher] Improving checks in Parquet tests
5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport
54637ec [Andre Schumacher] First part of second round of code review feedback
c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows
ba28849 [Michael Armbrust] code review comments.
d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization.
9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates.  Also add a constructor so that it can be serialized out-of-the-box.
959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream.
d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path.
c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support
3c3f962 [Michael Armbrust] Fix a bug due to array reuse.  This will need to be revisited after we merge the mutable row PR.
7d0f13e [Michael Armbrust] Update parquet support with master.
9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support
0040ae6 [Andre Schumacher] Feedback from code review
1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow
70e489d [Cheng Lian] Fixed a spelling typo
6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object.
8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly
99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval
7b9d142 [Michael Armbrust] Update travis to increase permgen size.
da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS.
6fdefe6 [Michael Armbrust] Port sbt improvements from master.
296fe50 [Michael Armbrust] Address review feedback.
d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework.
3bda72d [Andre Schumacher] Adding license banner to new files
3ac9eb0 [Andre Schumacher] Rebasing to new main branch
c863bed [Andre Schumacher] Codestyle checks
61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite
3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala
3a0a552 [Andre Schumacher] Reorganizing Parquet table operations
18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables
75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies
f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection
6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan
0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport
a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport
6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core
eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase
99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types
b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types
c334386 [Michael Armbrust] Initial support for generating schema's based on case classes.
608a29e [Michael Armbrust] Add hive as a repl dependency
7413ac2 [Michael Armbrust] make test downloading quieter.
4d57d0e [Michael Armbrust] Fix test execution on travis.
5f2963c [Michael Armbrust] naming and continuous compilation fixes.
f5e7492 [Michael Armbrust] Add Apache license.  Make naming more consistent.
3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject.
2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes
d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query.
24eaa79 [Michael Armbrust] fix > 100 chars
6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL.
df88f01 [Michael Armbrust] add a simple test for aggregation
18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data.
b922511 [Michael Armbrust] Fix insertion of nested types into hive tables.
5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function.
a430895 [Michael Armbrust] Planning for logical Repartition operators.
532dd37 [Michael Armbrust] Allow the local warehouse path to be specified.
4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter.
8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package.
c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation.
29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables.
9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning
f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe
cf4db59 [Lian, Cheng] Added golden answers for PruningSuite
54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names
2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning
c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning
f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query.
128a9f8 [Yin Huai] Minor changes.
017872c [Yin Huai] Remove stats20 from whitelist.
a1a4776 [Yin Huai] Update comments.
feb022c [Yin Huai] Partitioning key should be case insensitive.
555fb1d [Yin Huai] Correctly set the extension for a text file.
d00260b [Yin Huai] Strips backticks from partition keys.
334aace [Yin Huai] New golden files.
a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query.
428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`.
eea75c5 [Yin Huai] Correctly set codec.
45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew
e089627 [Yin Huai] Code style.
563bb22 [Yin Huai] Set compression info in FileSinkDesc.
35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback
bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables.
5495fab [Yin Huai] Remove cloneRecords which is no longer needed.
1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy.
3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location.
8506c17 [Michael Armbrust] Address review feedback.
3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master
9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling
566fd66 [Timothy Chen] Whitelist tests and add support for Binary type
69adf72 [Yin Huai] Set cloneRecords to false.
a9c3188 [Timothy Chen] Fix udaf struct return
346f828 [Yin Huai] Move SharkHadoopWriter to the correct location.
59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew
ed3a1d1 [Yin Huai] Load data directly into Hive.
7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT.
b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile
1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile
5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii
678341a [Mark Hamstra] Replaced non-ascii text
887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew
1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents
7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package.
bc9a12c [Michael Armbrust] Move hive test files.
5720d2b [Lian, Cheng] Fixed comment typo
f0c3742 [Lian, Cheng] Refactored PhysicalOperation
f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered
cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings
2407a21 [Lian, Cheng] Added optimized logical plan to debugging output
a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens
9329820 [Michael Armbrust] add golden answer files to repository
dce0593 [Michael Armbrust] move golden answer to the source code directory.
964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView
7785ee6 [Michael Armbrust] Tighten visibility based on comments.
341116c [Michael Armbrust] address comments.
0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS
2897deb [Michael Armbrust] fix scaladoc
7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries.
b376d15 [Michael Armbrust] fix newlines at EOF
5cc367c [Michael Armbrust] use berkeley instead of cloudbees
ff5ea3f [Michael Armbrust] new golden
db92adc [Michael Armbrust] more tests passing. clean up logging.
740febb [Michael Armbrust] Tests for tgfs.
0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf.
ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView
dd00b7e [Michael Armbrust] initial implementation of generators.
ea76cf9 [Michael Armbrust] Add NoRelation to planner.
bea4b7f [Michael Armbrust] Add SumDistinct.
016b489 [Michael Armbrust] fix typo.
acb9566 [Michael Armbrust] Correctly type attributes of CTAS.
8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation.
02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query.
5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes
5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg
8017afb [Michael Armbrust] fix copy paste error.
dc6353b [Michael Armbrust] turn off deprecation
cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance.
883006d [Michael Armbrust] improve tests.
32b615b [Michael Armbrust] add override to asPartial.
e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe.
f94345c [Michael Armbrust] fix doc link
d8cb805 [Michael Armbrust] Implement partial aggregation.
ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings.  Remove a blank line.
b4be6a5 [Michael Armbrust] better logging when applying rules.
67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex
cb57459 [Michael Armbrust] blacklist machine specific test.
2f27604 [Michael Armbrust] Address comments / style errors.
389525d [Michael Armbrust] update golden, blacklist mr.
e3c10bd [Michael Armbrust] update whitelist.
44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex
42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs.
ab5bff3 [Michael Armbrust] Support for get item of map types.
1679554 [Michael Armbrust] add toString for if and IS NOT NULL.
ab9a131 [Michael Armbrust] when UDFs fail they should return null.
25288d0 [Michael Armbrust] Implement [] for arrays and maps.
e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions.
010accb [Michael Armbrust] add tinyint to metastore type parser.
7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes.
ac9d7de [Michael Armbrust] Resolve *s in Transform clauses.
692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables.
92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects
9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector.
72a003d [Michael Armbrust] revert regex change
7661b6c [Michael Armbrust] blacklist machines specific tests
aa430e7 [Michael Armbrust] Update .travis.yml
e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs.
5e54aa6 [Michael Armbrust] quotes for struct field names.
bbec500 [Michael Armbrust] update test coverage, new golden
3734a94 [Michael Armbrust] only quote string types.
3f9e519 [Michael Armbrust] use names w/ boolean args
5b3d2c8 [Michael Armbrust] implement distinct.
5b33216 [Michael Armbrust] work on decimal support.
2c6deb3 [Michael Armbrust] improve printing compatibility.
35a70fb [Michael Armbrust] multi-letter field names.
a9388fb [Michael Armbrust] printing for map types.
c3feda7 [Michael Armbrust] use toArray.
c654f19 [Michael Armbrust] Support for list and maps in hive table scan.
cf8d992 [Michael Armbrust] Use built in functions for creating temp directory.
1579eec [Michael Armbrust] Only cast unresolved inserts.
6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression.
da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test.  Not sure how this was passing before...
6709441 [Michael Armbrust] Evaluation for accessing nested fields.
dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation.
d670e41 [Michael Armbrust] Print nested fields like hive does.
efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan.
9c22b4e [Michael Armbrust] Support for parsing nested types.
82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables
ea6f37f [Michael Armbrust] fix style.
7845364 [Michael Armbrust] deactivate concurrent test.
b649c20 [Michael Armbrust] fix test logging / caching.
1590568 [Michael Armbrust] add log4j.properties
19bfd74 [Michael Armbrust] store hive output in circular buffer
dfb67aa [Michael Armbrust] add test case
cb775ac [Michael Armbrust] get rid of SharkContext singleton
2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master
63003e9 [Michael Armbrust] Fix spacing.
41b41f3 [Michael Armbrust] Only cast unresolved inserts.
6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs
5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator
b1151a8 [Timothy Chen] Fix load data regex
8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API.
e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction
45b334b [Yin Huai] fix comments
235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator
fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning.
6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style
271e483 [Michael Armbrust] Update build status icon.
d3a3d48 [Michael Armbrust] add testing to travis
807b2d7 [Michael Armbrust] check style and publish docs with travis
d20b565 [Michael Armbrust] fix if style
bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide.
d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests.  This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive.  These are used by default when HIVE_HOME is not set.
f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin
41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator
7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file).
5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference
7213a2c [Reynold Xin] style fix for Hive.scala.
08e4d05 [Reynold Xin] First round of style cleanup.
605255e [Reynold Xin] Added scalastyle checker.
61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases
2486fb7 [Lian, Cheng] Fixed spelling
8ee41be [Lian, Cheng] Minor refactoring
ebb56fa [Michael Armbrust] add travis config
4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests
d4f539a [Michael Armbrust] blacklist mr and user specific tests.
677eb07 [Michael Armbrust] Update test whitelist.
5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning
c263c84 [Michael Armbrust] Only push predicates into partitioned table scans.
ab77882 [Michael Armbrust] upgrade spark to RC5.
c98ede5 [Lian, Cheng] Response to comments from @marmbrus
83d4520 [Yin Huai] marmbrus's comments
70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes
9ebff47 [Yin Huai] remove unnecessary .toSeq
e811d1a [Yin Huai] markhamstra's comments
4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations.
040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators.
9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions.
e9347fc [Michael Armbrust] Remove broken scaladoc links.
99c6707 [Michael Armbrust] upgrade spark
57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable
cb49af0 [Lian, Cheng] Fixed Scaladoc links
4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion
111ffdc [Lian, Cheng] More comments and minor reformatting
9e0d840 [Lian, Cheng] Added partition pruning optimization
761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan
04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator
9dd3b26 [Michael Armbrust] Fix scaladoc.
6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst
7c92a41 [Lian, Cheng] Added Hive SerDe support
ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator
2957f31 [Yin Huai] addressed comments on PR
907db68 [Michael Armbrust] Space after while.
04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts
4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile
5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator
be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23
fd084a4 [Michael Armbrust] implement casts binary <=> string.
0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type
548e479 [Yin Huai] merge master into exchangeOperator and fix code style
5b11db0 [Reynold Xin] Added Void to Boolean type widening.
9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear.
9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion
a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20
b20a4d4 [Lian, Cheng] Fix issue #20
6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc.
4285962 [Michael Armbrust] Remove temporary test cases
167162f [Michael Armbrust] more merge errors, cleanup.
e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge.
6377d0b [Michael Armbrust] Drop empty files, fix if ().
c0b0e60 [Michael Armbrust] cleanup broken doc links.
330a88b [Michael Armbrust] Fix bugs in AddExchange.
4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering.
043e296 [Michael Armbrust] Make physical union nodes variadic.
ece15e1 [Michael Armbrust] update unit tests
5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width.
9804eb5 [Michael Armbrust] upgrade spark
053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow
5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow
ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes
bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg
563053f [Michael Armbrust] Address @rxin's comments.
6537c66 [Michael Armbrust] Address @rxin's comments.
2a76fc6 [Michael Armbrust] add notes from @rxin.
685bfa1 [Michael Armbrust] fix spelling
69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions.
7859a86 [Michael Armbrust] Remove SparkAggregate.  Its kinda broken and breaks RDD lineage.
fc22e01 [Michael Armbrust] whitelist newly passing union test.
3f547b8 [Michael Armbrust] Add support for widening types in unions.
53b95f8 [Michael Armbrust] coercion should not occur until children are resolved.
b892e32 [Michael Armbrust] Union is not resolved until the types match up.
95ab382 [Michael Armbrust] Use resolved instead of custom function.  This is better because some nodes override the notion of resolved.
81a109d [Michael Armbrust] fix link.
f143f61 [Michael Armbrust] Implement sampling.  Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!"
6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar.
c800798 [Michael Armbrust] Add build status icon.
0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown
05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row.
f2fdd77 [Michael Armbrust] fix required distribtion for aggregate.
658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala
583a337 [Michael Armbrust] break apart distribution and partitioning.
e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator
0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links.
73c70de [Yin Huai] add a first set of unit tests for data properties.
fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements.
2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans.
fcbc03b [Michael Armbrust] Fix if ().
7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators.
b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes
b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization.
e286d20 [Michael Armbrust] address code review comments.
80d0681 [Michael Armbrust] fix scaladoc links.
de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString.
3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode.
404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation.
fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution.
2abb0bc [Michael Armbrust] better debug messages, use exists.
098dfc4 [Michael Armbrust] Implement Long sorting again.
60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable.
a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString.
dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode.
037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation.
ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes.
b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen
83adb9d [Yin Huai] add DataProperty
5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics.
f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator.  This can probably also be use for codegen...
6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet.
ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts.
058ec15 [Michael Armbrust] handle more writeables.
ffa9f25 [Michael Armbrust] blacklist some more MR tests.
aa2239c [Michael Armbrust] filter test lines containing Owner:
f71a325 [Michael Armbrust] Update golden jar.
a3003ae [Michael Armbrust] Update makefile to use better sharding support.
568d150 [Michael Armbrust] Updates to white/blacklist.
8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right.
c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure.  See comments for details.
09c6300 [Michael Armbrust] Add nullability information to StructFields.
5460b2d [Michael Armbrust] load srcpart by default.
3695141 [Michael Armbrust] Lots of parser improvements.
965ac9a [Michael Armbrust] Add expressions that allow access into complex types.
3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences.
8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning.
e57f97a [Michael Armbrust] more decimal/null support.
e1440ed [Michael Armbrust] Initial support for function specific type conversions.
1814ed3 [Michael Armbrust] use childrenResolved function.
f2ec57e [Michael Armbrust] Begin supporting decimal.
6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs
7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters.
d0124f3 [Michael Armbrust] Correctly type null literals.
b65626e [Michael Armbrust] Initial support for parsing BigDecimal.
a90efda [Michael Armbrust] utility function for outputing string stacktraces.
7102f33 [Michael Armbrust] methods with side-effects should use ().
3ccaef7 [Michael Armbrust] add renaming TODO.
bc282c7 [Michael Armbrust] fix bug in getNodeNumbered
c8e89d5 [Michael Armbrust] memoize inputSet calculation.
6aefa46 [Michael Armbrust] Skip folding literals.
a72e540 [Michael Armbrust] Add IN operator.
04f885b [Michael Armbrust] literals are only non-nullable if they are not null.
35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output.
12fd52d [Michael Armbrust] support for sorting longs.
0606520 [Michael Armbrust] drop old comment.
859200a [Michael Armbrust] support for reading more types from the metastore.
1fedd18 [Michael Armbrust] coercion from null to numeric types
71e902d [Michael Armbrust] fix test cases.
cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer
8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment
86355a6 [Michael Armbrust] throw error if there are unexpected join clauses.
c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute.
0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling
a92919d [Michael Armbrust] add alter view as to native commands
f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT
f0faa26 [Michael Armbrust] add sample and distinct operators.
ef7b943 [Michael Armbrust] add metastore support for float
e9f4588 [Michael Armbrust] fix > 100 char.
755b229 [Michael Armbrust] blacklist some ddl tests.
9ae740a [Michael Armbrust] blacklist more tests that require MR.
4cfc11a [Michael Armbrust] more test coverage.
0d9d56a [Michael Armbrust] add more native commands to parser
78d730d [Michael Armbrust] Load src test table on RESET.
8364ec2 [Michael Armbrust] whitelist all possible partition values.
b01468d [Michael Armbrust] support path rewrites when the query begins with a comment.
4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail.
4c5fb0f [Michael Armbrust] makefile target for building new whitelist.
4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO.
516481c [Michael Armbrust] Ignore requests to explain native commands.
68aa2e6 [Michael Armbrust] Stronger type for Token extractor.
ca4ea26 [Michael Armbrust] Support for parsing UDF(*).
1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset.
9627616 [Michael Armbrust] Use current database as default database.
9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode.
6f64cee [Michael Armbrust] don't line wrap string literal
eafaeed [Michael Armbrust] add type documentation
f54c94c [Michael Armbrust] make golden answers file a test dependency
5362365 [Michael Armbrust] push conditions into join
0d2388b [Michael Armbrust] Point at databricks hosted scaladoc.
73b29cd [Michael Armbrust] fix bad casting
9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes
7eff191 [Michael Armbrust] link all the expression names.
83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules
9de6b74 [Michael Armbrust] fix language feature and deprecation warnings.
0b1960a [Michael Armbrust] Fix broken scala doc links / warnings.
b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions
01c00c2 [Michael Armbrust] new golden
5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions
b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching
66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork
1a393da [Yin Huai] folded -> foldable
1e964ea [Yin Huai] update
a43d41c [Michael Armbrust] more tests passing!
8ca38d0 [Michael Armbrust] begin support for varchar / binary types.
ab8bbd1 [Michael Armbrust] parsing % operator
c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests.
3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline.
5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions
367fb9e [Yin Huai] update
0cd5cc6 [Michael Armbrust] add BIGINT cast parsing
61b266f [Michael Armbrust] comment for eliminate subqueries.
d72a5a2 [Michael Armbrust] add long to literal factory object.
b3bd15f [Michael Armbrust] blacklist more mr requiring tests.
e06fd38 [Michael Armbrust] black list map reduce tests.
8e7ce30 [Michael Armbrust] blacklist some env specific tests.
6250cbd [Michael Armbrust] Do not exit on test failure
b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath.
b6e4899 [Yin Huai] formatting
e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12
5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing?
9e190f5 [Michael Armbrust] drop unneeded ()
68b58c1 [Michael Armbrust] drop a few more tests.
b0aa400 [Michael Armbrust] update whitelist.
c99012c [Michael Armbrust] skip tests with hooks
db00ebf [Michael Armbrust] more types for hive udfs
dbc3678 [Michael Armbrust] update ghpages repo
138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure.
6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests.
46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel.  usage: "make -j 8 -i"
8d47ed4 [Yin Huai] comment
2795f05 [Yin Huai] comment
e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer
2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions
0bd1688 [Yin Huai] update
6a7bd75 [Michael Armbrust] fix partition column delimiter configuration.
e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0.
b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean
52864da [Reynold Xin] Added executeCollect method to SharkPlan.
f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan.
b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException.
38124bd [Yin Huai] formatting
2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively
555d839 [Reynold Xin] More cleaning ...
d48d0e1 [Reynold Xin] Code review feedback.
aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions
5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency.
479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if)
da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename
e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope.
72426ed [Reynold Xin] Rename shark2 package to execution.
0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename
e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore
3f9fee1 [Michael Armbrust] rewrite push filter through join optimization.
c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory.
c9777d8 [Reynold Xin] Put all source files in a catalyst directory.
019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files.
80ca4be [Timothy Chen] Address comments
0079392 [Michael Armbrust] support for multiple insert commands in a single query
75b5a01 [Michael Armbrust] remove space.
4283400 [Timothy Chen] Add limited predicate push down
e547e50 [Michael Armbrust] implement First.
e77c9b6 [Michael Armbrust] more work on unique join.
c795e06 [Michael Armbrust] improve star expansion
a26494e [Michael Armbrust] allow aliases to have qualifiers
d078333 [Michael Armbrust] remove extra space
a75c023 [Michael Armbrust] implement Coalesce
3a018b6 [Michael Armbrust] fix up docs.
ab6f67d [Michael Armbrust] import the string "null" as actual null.
5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved.
191ce3e [Michael Armbrust] analyze rewrite test query.
60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved.
2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues.
e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD
c086a35 [Michael Armbrust] docs, spacing
c4060e4 [Michael Armbrust] cleanup
3b85462 [Michael Armbrust] more tests passing
bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data.
c944a95 [Michael Armbrust] First aggregate expression.
1e28311 [Michael Armbrust] make tests execute in alpha order again
a287481 [Michael Armbrust] spelling
8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing.
a6ab6c7 [Michael Armbrust] add !=
4529594 [Michael Armbrust] draft of coalesce
70f253f [Michael Armbrust] more tests passing!
7349e7b [Michael Armbrust] initial support for test thrift table
d3c9305 [Michael Armbrust] fix > 100 char line
93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE"
06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths
6355d0e [Michael Armbrust] match actual return type of count with expected
cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty.
fd4b096 [Michael Armbrust] fix casing of null strings as well.
4632695 [Michael Armbrust] support for megastore bigint
67b88cf [Michael Armbrust] more verbose debugging of evaluation return types
c680e0d [Michael Armbrust] Failed string => number conversion should return null.
2326be1 [Michael Armbrust] make getClauses case insensitive.
dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types.
045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions
fb5ddfd [Michael Armbrust] move ViewExamples to examples/
83833e8 [Michael Armbrust] more tests passing!
47c98d6 [Michael Armbrust] add query tests for like and hash.
1724c16 [Michael Armbrust] clear lines that contain last updated times.
cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse.
9b2642b [Michael Armbrust] make the blacklist support regexes
1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs
910e33e [Michael Armbrust] basic support for building an assembly jar.
d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore.
495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf.
65f4e69 [Michael Armbrust] remove incorrect comments
0831a3c [Michael Armbrust] support for parsing some operator udfs.
6c27aa7 [Michael Armbrust] more cast parsing.
43db061 [Michael Armbrust] significant generalization of hive udf functionality.
3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines.
e5690a6 [Michael Armbrust] add BinaryType
adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called.
d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]).
8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage.
21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function.
0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons.
2b70abf [Michael Armbrust] true and false literals.
ef8b0a5 [Michael Armbrust] more tests.
14d070f [Michael Armbrust] add support for correctly extracting partition keys.
0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions
69a0bd4 [Michael Armbrust] promote strings in predicates with number too.
3946e31 [Michael Armbrust] don't build strings unless assertion fails.
90c453d [Michael Armbrust] more tests passing!
6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting.
8000504 [Michael Armbrust] Improve type coercion.
9087152 [Michael Armbrust] fix toString of Not.
58b111c [Michael Armbrust] fix bad scaladoc tag.
d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there.
ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness.
1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types.
d873b2b [Yin Huai] Remove numbers associated with test cases.
8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions
b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown
d1e7b8e [Michael Armbrust] Update README.md
c8b1553 [Michael Armbrust] Update README.md
9307ef9 [Michael Armbrust] update list of passing tests.
934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers.
a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now.
ae0024a [Yin Huai] update
cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions
21976ae [Yin Huai] update
b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions
dedbf0c [Yin Huai] support Boolean literals
eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals
37817b5 [Yin Huai] add a comment to EvaluateLiterals.
468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is.
b1d1843 [Michael Armbrust] more work on big data benchmark tests.
cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark
7d7fa9f [Michael Armbrust] support for create table as
5f54f03 [Michael Armbrust] parsing for ASC
d42b725 [Michael Armbrust] Sum of strings requires cast
34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.)
81659cb [Michael Armbrust] implement transform operator.
5cd76d6 [Michael Armbrust] break up the file based test case code for reuse
1031b65 [Michael Armbrust] support for case insensitive resolution.
320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots)
b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt
d9d18b4 [Michael Armbrust] debug logging implicit.
669089c [Yin Huai] support Boolean literals
ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals
73a05fd [Yin Huai] add a comment to EvaluateLiterals.
191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is.
80039cc [Yin Huai] Merge pull request #1 from yhuai/master
cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide
5c518e4 [Michael Armbrust] fix bug in test.
b50dd0e [Michael Armbrust] fix return type of overloaded method
05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview.
8c60cc0 [Michael Armbrust] Update README.md
03b9526 [Michael Armbrust] First draft of optimizer tests.
f392755 [Michael Armbrust] Add flatMap to TreeNode
6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings
15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions.
06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression.
4b0a888 [Michael Armbrust] implement string comparison and more casts.
356b321 [Michael Armbrust] Update README.md
3776395 [Michael Armbrust] Update README.md
304d17d [Michael Armbrust] Create README.md
b7d8be0 [Michael Armbrust] more tests passing.
b82481f [Michael Armbrust] add todo comment.
02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist.
cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support.
c43a259 [Michael Armbrust] comments
15ff448 [Michael Armbrust] better error message when a dsl test throws an exception
76ec650 [Michael Armbrust] fix join conditions
e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan.
91573a4 [Michael Armbrust] initial type promotion
e2ef4a5 [Michael Armbrust] logging
e43dc1e [Michael Armbrust] add string => int cast evaluation
f1f7e96 [Michael Armbrust] fix incorrect generation of join keys
2b27230 [Michael Armbrust] add depth based subtree access
0f6279f [Michael Armbrust] broken tests.
389bc0b [Michael Armbrust] support for partitioned columns in output.
12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name.
b67a225 [Michael Armbrust] better errors when types don't match up.
9e74808 [Michael Armbrust] add children resolved.
6d03ce9 [Michael Armbrust] defaults for unresolved relation
2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions
be5ae2c [Michael Armbrust] better resolution logging
cb7b5af [Michael Armbrust] views example
420e05b [Michael Armbrust] more tests passing!
6916c63 [Michael Armbrust] Reading from partitioned hive tables.
a1245f9 [Michael Armbrust] more tests passing
956e760 [Michael Armbrust] extended explain
5f14c35 [Michael Armbrust] more test tables supported
175c43e [Michael Armbrust] better errors for parse exceptions
480ade5 [Michael Armbrust] don't use partial cached results.
8a9d21c [Michael Armbrust] fix evaluation
7aee69c [Michael Armbrust] parsing for joins, boolean logic
7fcf480 [Michael Armbrust] test for and logic
3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines.
6902490 [Michael Armbrust] fix boolean logic evaluation
4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic
8b2a2ee [Michael Armbrust] more tests passing!
ad1f3b4 [Michael Armbrust] toString for null literals
a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work)
60ec19d [Michael Armbrust] initial support for udfs
c45b440 [Michael Armbrust] support for is (not) null and boolean logic
7f4a1dc [Michael Armbrust] add NoRelation logical operator
72e183b [Michael Armbrust] support for null values in tree node args.
ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs
e5c9d1a [Michael Armbrust] use nonEmpty
dcc4fe1 [Michael Armbrust] support for src1 test table.
c78b587 [Michael Armbrust] casting.
75c3f3f [Michael Armbrust] add support for logging with scalalogging.
da2c011 [Michael Armbrust] make it more obvious when results are being truncated.
96b73ba [Michael Armbrust] more docs in TestShark
18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive.
e6d063b [Michael Armbrust] more join tests.
664c1c3 [Michael Armbrust] make parsing of function names case insensitive.
0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome.
1a6db68 [Michael Armbrust] spelling
7638cb4 [Michael Armbrust] simple join execution with dsl tests.  no hive tests yes.
859d4c9 [Michael Armbrust] better argString printing of nested trees.
fc53615 [Michael Armbrust] add same instance comparisons for tree nodes.
a026e6b [Michael Armbrust] move out hive specific operators
fff4d1c [Michael Armbrust] add simple query execution debugging
e2120ab [Michael Armbrust] sorting for strings
da06eb6 [Michael Armbrust] Parsing for sortby and joins
9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId.
8eb2460 [Michael Armbrust] add system property to override whitelist.
88124bb [Michael Armbrust] make strategy evaluation lazy.
74a3a21 [Michael Armbrust] implement outputSet
d25b171 [Michael Armbrust] Add AND and OR expressions
67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll
12acf0a [Michael Armbrust] add .DS_Store for macs
f7da6ce [Michael Armbrust] add agg with grouping expr in select test
36805b3 [Michael Armbrust] pull out and improve aggregation
75613e1 [Michael Armbrust] better evaluations failure messages.
4789a35 [Michael Armbrust] weaken type since its hard to create pure references.
e89dd36 [Michael Armbrust] no newline for online trees
d0590d4 [Michael Armbrust] include stack trace for catalyst failures.
081c0d9 [Michael Armbrust] more generic computation of agg functions.
31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser
ecd45b2 [Michael Armbrust] Add more passing tests.
97d5419 [Michael Armbrust] fix alignment.
565cc13 [Michael Armbrust] make the canary query optional.
a95e65c [Michael Armbrust] support for resolving qualified attribute references.
e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails.
4640a0b [Michael Armbrust] handle test tables when database is specified.
bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis.
fec5158 [Michael Armbrust] add hive / idea files to .gitignore
3f97ffe [Michael Armbrust] Rename Hive => HiveQl
656b836 [Michael Armbrust] Support for parsing select clause aliases.
3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs.
3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception.
8cbef8a [Michael Armbrust] spelling
aa8c37c [Michael Armbrust] Better toString for SortOrder
1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions
a2e0327 [Michael Armbrust] add a bunch of tests.
4a3e1ea [Michael Armbrust] docs and use shark for data loading.
339bb8f [Michael Armbrust] better docs, Not support
1d7b2d9 [Michael Armbrust] Add NaN conversions.
46a2534 [Michael Armbrust] only run canary query on failure.
8996066 [Michael Armbrust] remove protected from makeCopy
53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass
04a372a [Michael Armbrust] add a flag for running all tests.
3b2235b [Michael Armbrust] More general implementation of arithmetic.
edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results
da6c577 [Michael Armbrust] add string <==> file utility functions.
3adf5ca [Michael Armbrust] Initial support for groupBy and count.
7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ.
a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product.
d66ba7e [Michael Armbrust] drop printlns.
88f2efd [Michael Armbrust] Add sum / count distinct expressions.
05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark
07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands.
b8a9910 [Michael Armbrust] quote tests passing.
8e5e267 [Michael Armbrust] handle aliased select expressions.
4286a96 [Michael Armbrust] drop debugging println
ac34aeb [Michael Armbrust] proof of concept for hive ast transformations.
2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments
ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general.
74a6337 [Michael Armbrust] use fastEquals when doing transformations.
1184a23 [Michael Armbrust] add native test for escapes.
b972b18 [Michael Armbrust] create BaseRelation class
fa6bce9 [Michael Armbrust] implement union
6391a87 [Michael Armbrust] count aggregate.
d47c317 [Michael Armbrust] add unary minus, more tests passing.
c7114e4 [Michael Armbrust] first draft of star expansion.
044c43d [Michael Armbrust] better support for numeric literal parsing.
1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view.
61503c5 [Michael Armbrust] add cached toRdd
2036883 [Michael Armbrust] skip explain queries when testing.
ebac4b1 [Michael Armbrust] fix bug in sort reference calculation
ca0dee0 [Michael Armbrust] docs.
1ee0471 [Michael Armbrust] string literal parsing.
357278b [Michael Armbrust] add limit support
9b3e479 [Michael Armbrust] creation of string literals.
02efa30 [Michael Armbrust] alias evaluation
cb68b33 [Michael Armbrust] parsing for random sample in hive ql.
126dd36 [Michael Armbrust] include query plans in failure output
bb59ae9 [Michael Armbrust] doc fixes
7e68286 [Michael Armbrust] fix confusing naming
768bb25 [Michael Armbrust] handle errors in shark query toString
829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark.  Make test shark a singleton to avoid weirdness with the hive megastore.
ad02e41 [Michael Armbrust] comment jdo dependency
7bc89fe [Michael Armbrust] add collect to TreeNode.
438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs.
09679ee [Michael Armbrust] fix bug in TreeNode foreach
2930b27 [Michael Armbrust] more specific name for del query tests.
8842549 [Michael Armbrust] docs.
da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL.
a8969b9 [Michael Armbrust] Factor out hive query comparison test framework.
1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations.
a36dd9a [Michael Armbrust] evaluation for other > data types.
cae729b [Michael Armbrust] remove unnecessary lazy vals.
d8e12af [Michael Armbrust] docs
3a60d67 [Michael Armbrust] implement average, placeholder for count
f05c106 [Michael Armbrust] checkAnswer handles single row results.
2730534 [Michael Armbrust] implement inputSet
a9aa79d [Michael Armbrust] debugging for sort exec
8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors.
554b4b2 [Michael Armbrust] BoundAttribute pretty printing.
754f5fa [Michael Armbrust] dsl for setting nullability
a206d7a [Michael Armbrust] clean up query tests.
84ad6ef [Michael Armbrust] better sort implementation and tests.
de24923 [Michael Armbrust] add double type.
9611a2c [Michael Armbrust] literal creation for doubles.
7358313 [Michael Armbrust] sort order returns child type.
b544715 [Michael Armbrust] implement eval for rand, and > for doubles
7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols)
1c1a35e [Michael Armbrust] add simple Rand expression.
3ca51de [Michael Armbrust] add orderBy to dsl
7ae41ab [Michael Armbrust] more literal implicit conversions
b18b675 [Michael Armbrust] First cut at native query tests for shark.
d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark.
5eac895 [Michael Armbrust] better error when descending is specified.
2b16f86 [Michael Armbrust] add todo
e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization
9dde3c8 [Michael Armbrust] add project and filter operations.
ad9037b [Michael Armbrust] Add support for local relations.
6227143 [Michael Armbrust] evaluation of Equals.
7526290 [Michael Armbrust] BoundReference should also be an Attribute.
bd33e26 [Michael Armbrust] more documentation
5de0ea3 [Michael Armbrust] Move all shark specific into a separate package.  Lots of documentation improvements.
0ae292b [Michael Armbrust] implement calculation of sort expressions.
9fd5011 [Michael Armbrust] First cut at expression evaluation.
6259e3a [Michael Armbrust] cleanup
787e5a2 [Michael Armbrust] use fastEquals
f90da36 [Michael Armbrust] better printing of optimization exceptions
b05dd67 [Michael Armbrust] Application of rules to fixed point.
bb2e0db [Michael Armbrust] pretty print for literals.
1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals.
d3a3687 [Michael Armbrust] add fastEquals
2b4935b [Michael Armbrust] set sbt.version explicitly
46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests.
c79f2fd [Michael Armbrust] insert operator should return an empty rdd.
14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input.
ae7b4c3 [Michael Armbrust] remove implicit dependencies.  now compiles without copying things into lib/ manually.
84082f9 [Michael Armbrust] add sbt binaries and scripts
15371a8 [Michael Armbrust] First draft of simple Hive DDL parser.
063bf44 [Michael Armbrust] Periods should end all comments.
e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack.
ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing!
b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust.
e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected.
26f410a [Michael Armbrust] physical traits should extend PhysicalPlan.
dc72469 [Michael Armbrust] beginning of hive compatibility testing framework.
0763490 [Michael Armbrust] support for hive native command pass-through.
d8a924f [Michael Armbrust] scaladoc
29a7163 [Michael Armbrust] Insert into hive table physical operator.
633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy.
59ac444 [Michael Armbrust] add unary expression
3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName'
665f7d0 [Michael Armbrust] add logical nodes for hive data sinks.
64d2923 [Michael Armbrust] Add classes for representing sorts.
f72b7ce [Michael Armbrust] first trivial end to end query execution.
5c7d244 [Michael Armbrust] first draft of references implementation.
7bff274 [Michael Armbrust] point at new shark.
c7cd57f [Michael Armbrust] docs for util function.
910811c [Michael Armbrust] check each item of the sequence
ef21a0b [Michael Armbrust] line up comments.
4b765d5 [Michael Armbrust] docs, drop println
6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution.
a703c49 [Michael Armbrust] this order works better until fixed point is implemented.
ec1d7c0 [Michael Armbrust] Simple attribute resolution.
069df02 [Michael Armbrust] parsing binary predicates
a1cf754 [Michael Armbrust] add joins and equality.
3f5bc98 [Michael Armbrust] add optiq to sbt.
54f3460 [Michael Armbrust] initial optiq parsing.
d9161ce [Michael Armbrust] add join operator
1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs
24ef6fb [Michael Armbrust] toString for alias.
ae7d776 [Michael Armbrust] add nullability changing function
d49dc02 [Michael Armbrust] scaladoc for named exprs
7c45dd7 [Michael Armbrust] pretty printing of trees.
78e34bf [Michael Armbrust] simple git ignore.
7ba19be [Michael Armbrust] First draft of interface to hive metastore.
7e7acf0 [Michael Armbrust] physical placeholder.
1c11136 [Michael Armbrust] first draft of error handling / plans for debugging.
3766a41 [Michael Armbrust] rearrange utility functions.
7fb3d5e [Michael Armbrust] docs and equality improvements.
45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions.
002d4d4 [Michael Armbrust] default to no alias.
be25003 [Michael Armbrust] add repl initialization to sbt.
0608a00 [Michael Armbrust] tighten public interface
a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms.
daa71ca [Michael Armbrust] foreach, maps, and scaladoc
6a158cb [Michael Armbrust] simple transform working.
db0299f [Michael Armbrust] basic analysis of relations minus transform function.
f74c4ee [Michael Armbrust] parsing a simple query.
08e4f57 [Michael Armbrust] upgrade scala include shark.
d3c6404 [Michael Armbrust] initial commit
2014-03-20 18:03:20 -07:00