Commit graph

550 commits

Author SHA1 Message Date
Sean Owen 2b7bd29eb6 SPARK-1789. Multiple versions of Netty dependencies cause FlumeStreamSuite failure
TL;DR is there is a bit of JAR hell trouble with Netty, that can be mostly resolved and will resolve a test failure.

I hit the error described at http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-startup-time-out-td1753.html while running FlumeStreamingSuite, and have for a short while (is it just me?)

velvia notes:
"I have found a workaround.  If you add akka 2.2.4 to your dependencies, then everything works, probably because akka 2.2.4 brings in newer version of Jetty."

There are at least 3 versions of Netty in play in the build:

- the new Flume 1.4.0 dependency brings in io.netty:netty:3.4.0.Final, and that is the immediate problem
- the custom version of akka 2.2.3 depends on io.netty:netty:3.6.6.
- but, Spark Core directly uses io.netty:netty-all:4.0.17.Final

The POMs try to exclude other versions of netty, but are excluding org.jboss.netty:netty, when in fact older versions of io.netty:netty (not netty-all) are also an issue.

The org.jboss.netty:netty excludes are largely unnecessary. I replaced many of them with io.netty:netty exclusions until everything agreed on io.netty:netty-all:4.0.17.Final.

But this didn't work, since Akka 2.2.3 doesn't work with Netty 4.x. Down-grading to 3.6.6.Final across the board made some Spark code not compile.

If the build *keeps* io.netty:netty:3.6.6.Final as well, everything seems to work. Part of the reason seems to be that Netty 3.x used the old `org.jboss.netty` packages. This is less than ideal, but is no worse than the current situation.

So this PR resolves the issue and improves the JAR hell, even if it leaves the existing theoretical Netty 3-vs-4 conflict:

- Remove org.jboss.netty excludes where possible, for clarity; they're not needed except with Hadoop artifacts
- Add io.netty:netty excludes where needed -- except, let akka keep its io.netty:netty
- Change a bit of test code that actually depended on Netty 3.x, to use 4.x equivalent
- Update SBT build accordingly

A better change would be to update Akka far enough such that it agrees on Netty 4.x, but I don't know if that's feasible.

Author: Sean Owen <sowen@cloudera.com>

Closes #723 from srowen/SPARK-1789 and squashes the following commits:

43661b7 [Sean Owen] Update and add Netty excludes to prevent some JAR conflicts that cause test issues
2014-05-10 20:50:40 -07:00
Michael Armbrust 4d60553298 [SQL] Upgrade parquet library.
I think we are hitting this issue in some perf tests: 6aed5288fd

Credit to @aarondav !

Author: Michael Armbrust <michael@databricks.com>

Closes #684 from marmbrus/upgradeParquet and squashes the following commits:

e10a619 [Michael Armbrust] Upgrade parquet library.
2014-05-10 11:48:01 -07:00
witgo 561510867a [SPARK-1644] The org.datanucleus:* should not be packaged into spark-assembly-*.jar
Author: witgo <witgo@qq.com>

Closes #688 from witgo/SPARK-1644 and squashes the following commits:

56ad6ac [witgo] review commit
87c03e4 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1644
6ffa7e4 [witgo] review commit
a597414 [witgo] The org.datanucleus:* should not be packaged into spark-assembly-*.jar
2014-05-10 10:15:04 -07:00
Matei Zaharia 951a5d9398 [SPARK-1549] Add Python support to spark-submit
This PR updates spark-submit to allow submitting Python scripts (currently only with deploy-mode=client, but that's all that was supported before) and updates the PySpark code to properly find various paths, etc. One significant change is that we assume we can always find the Python files either from the Spark assembly JAR (which will happen with the Maven assembly build in make-distribution.sh) or from SPARK_HOME (which will exist in local mode even if you use sbt assembly, and should be enough for testing). This means we no longer need a weird hack to modify the environment for YARN.

This patch also updates the Python worker manager to run python with -u, which means unbuffered output (send it to our logs right away instead of waiting a while after stuff was written); this should simplify debugging.

In addition, it fixes https://issues.apache.org/jira/browse/SPARK-1709, setting the main class from a JAR's Main-Class attribute if not specified by the user, and fixes a few help strings and style issues in spark-submit.

In the future we may want to make the `pyspark` shell use spark-submit as well, but it seems unnecessary for 1.0.

Author: Matei Zaharia <matei@databricks.com>

Closes #664 from mateiz/py-submit and squashes the following commits:

15e9669 [Matei Zaharia] Fix some uses of path.separator property
051278c [Matei Zaharia] Small style fixes
0afe886 [Matei Zaharia] Add license headers
4650412 [Matei Zaharia] Add pyFiles to PYTHONPATH in executors, remove old YARN stuff, add tests
15f8e1e [Matei Zaharia] Set PYTHONPATH in PythonWorkerFactory in case it wasn't set from outside
47c0655 [Matei Zaharia] More work to make spark-submit work with Python:
d4375bd [Matei Zaharia] Clean up description of spark-submit args a bit and add Python ones
2014-05-06 15:12:35 -07:00
Thomas Graves 1e829905c7 SPARK-1474: Spark on yarn assembly doesn't include AmIpFilter
We use org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter in spark on yarn but are not included it in the assembly jar.

I tested this on yarn cluster by removing the yarn jars from the classpath and spark runs fine now.

Author: Thomas Graves <tgraves@apache.org>

Closes #406 from tgravescs/SPARK-1474 and squashes the following commits:

1548bf9 [Thomas Graves] SPARK-1474: Spark on yarn assembly doesn't include org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
2014-05-06 12:00:09 -07:00
Sean Owen 73b0cbcc24 SPARK-1556. jets3t dep doesn't update properly with newer Hadoop versions
See related discussion at https://github.com/apache/spark/pull/468

This PR may still overstep what you have in mind, but let me put it on the table to start. Besides fixing the issue, it has one substantive change, and that is to manage Hadoop-specific things only in Hadoop-related profiles. This does _not_ remove `yarn.version`.

- Moves the YARN and Hadoop profiles together in pom.xml. Sorry that this makes the diff a little hard to grok but the changes are only as follows.
- Removes `hadoop.major.version`
- Introduce `hadoop-2.2` and `hadoop-2.3` profiles to control Hadoop-specific changes:
  - like the protobuf version issue - this was only 'solved' now by enabling YARN for 2.2+, which is really an orthogonal issue
  - like the jets3t version issue now
- Hadoop profiles set an appropriate default `hadoop.version`, that can be overridden
- _(YARN profiles in the parent now only exist to add the sub-module)_
- Fixes the jets3t dependency issue
 - and makes it a runtime dependency
 - and centralizes config of this guy in the parent pom
- Updates build docs
- Updates SBT build too
  - and fixes a regex problem along the way

Author: Sean Owen <sowen@cloudera.com>

Closes #629 from srowen/SPARK-1556 and squashes the following commits:

c3fa967 [Sean Owen] Fix hadoop-2.4 profile typo in doc
a2105fd [Sean Owen] Add hadoop-2.4 profile and don't set hadoop.version in profiles
274f4f9 [Sean Owen] Make jets3t a runtime dependency, and bring its exclusion up into parent config
bbed826 [Sean Owen] Use jets3t 0.9.0 for Hadoop 2.3+ (and correct similar regex issue in SBT build)
f21f356 [Sean Owen] Build changes to set up for jets3t fix
2014-05-05 10:33:49 -07:00
Sean Owen f5041579ff SPARK-1629. Addendum: Depend on commons lang3 (already used by tachyon) as it's used in ReplSuite, and return to use lang3 utility in Utils.scala
For consideration. This was proposed in related discussion: https://github.com/apache/spark/pull/569

Author: Sean Owen <sowen@cloudera.com>

Closes #635 from srowen/SPARK-1629.2 and squashes the following commits:

a442b98 [Sean Owen] Depend on commons lang3 (already used by tachyon) as it's used in ReplSuite, and return to use lang3 utility in Utils.scala
2014-05-04 17:43:35 -07:00
Xiangrui Meng 3f38334f44 [SPARK-1636][MLLIB] Move main methods to examples
* `NaiveBayes` -> `SparseNaiveBayes`
* `KMeans` -> `DenseKMeans`
* `SVMWithSGD` and `LogisticRegerssionWithSGD` -> `BinaryClassification`
* `ALS` -> `MovieLensALS`
* `LinearRegressionWithSGD`, `LassoWithSGD`, and `RidgeRegressionWithSGD` -> `LinearRegression`
* `DecisionTree` -> `DecisionTreeRunner`

`scopt` is used for parsing command-line parameters. `scopt` has MIT license and it only depends on `scala-library`.

Example help message:

~~~
BinaryClassification: an example app for binary classification.
Usage: BinaryClassification [options] <input>

  --numIterations <value>
        number of iterations
  --stepSize <value>
        initial step size, default: 1.0
  --algorithm <value>
        algorithm (SVM,LR), default: LR
  --regType <value>
        regularization type (L1,L2), default: L2
  --regParam <value>
        regularization parameter, default: 0.1
  <input>
        input paths to labeled examples in LIBSVM format
~~~

Author: Xiangrui Meng <meng@databricks.com>

Closes #584 from mengxr/mllib-main and squashes the following commits:

7b58c60 [Xiangrui Meng] minor
6e35d7e [Xiangrui Meng] make imports explicit and fix code style
c6178c9 [Xiangrui Meng] update TS PCA/SVD to use new spark-submit
6acff75 [Xiangrui Meng] use scopt for DecisionTreeRunner
be86069 [Xiangrui Meng] use main instead of extending App
b3edf68 [Xiangrui Meng] move DecisionTree's main method to examples
8bfaa5a [Xiangrui Meng] change NaiveBayesParams to Params
fe23dcb [Xiangrui Meng] remove main from KMeans and add DenseKMeans as an example
67f4448 [Xiangrui Meng] remove main methods from linear regression algorithms and add LinearRegression example
b066bbc [Xiangrui Meng] remove main from ALS and add MovieLensALS example
b040f3b [Xiangrui Meng] change BinaryClassificationParams to Params
577945b [Xiangrui Meng] remove unused imports from NB
3d299bc [Xiangrui Meng] remove main from LR/SVM and add an example app for binary classification
f70878e [Xiangrui Meng] remove main from NaiveBayes and add an example NaiveBayes app
01ec2cd [Xiangrui Meng] Merge branch 'master' into mllib-main
9420692 [Xiangrui Meng] add scopt to examples dependencies
2014-04-29 00:41:03 -07:00
witgo 030f2c2126 Improved build configuration
1, Fix SPARK-1441: compile spark core error with hadoop 0.23.x
2, Fix SPARK-1491: maven hadoop-provided profile fails to build
3, Fix org.scala-lang: * ,org.apache.avro:* inconsistent versions dependency
4, A modified on the sql/catalyst/pom.xml,sql/hive/pom.xml,sql/core/pom.xml (Four spaces formatted into two spaces)

Author: witgo <witgo@qq.com>

Closes #480 from witgo/format_pom and squashes the following commits:

03f652f [witgo] review commit
b452680 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
bee920d [witgo] revert fix SPARK-1629: Spark Core missing commons-lang dependence
7382a07 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
6902c91 [witgo] fix SPARK-1629: Spark Core missing commons-lang dependence
0da4bc3 [witgo] merge master
d1718ed [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
e345919 [witgo] add avro dependency to yarn-alpha
77fad08 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
62d0862 [witgo] Fix org.scala-lang: * inconsistent versions dependency
1a162d7 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
934f24d [witgo] review commit
cf46edc [witgo] exclude jruby
06e7328 [witgo] Merge branch 'SparkBuild' into format_pom
99464d2 [witgo] fix maven hadoop-provided profile fails to build
0c6c1fc [witgo] Fix compile spark core error with hadoop 0.23.x
6851bec [witgo] Maintain consistent SparkBuild.scala, pom.xml
2014-04-28 22:51:46 -07:00
Cheng Hao ea01affc34 Update the import package name for TestHive in sbt shell
sbt/sbt hive/console will fail as TestHive changed its package from "org.apache.spark.sql.hive" to "org.apache.spark.sql.hive.test".

Author: Cheng Hao <hao.cheng@intel.com>

Closes #574 from chenghao-intel/hive_console and squashes the following commits:

de14035 [Cheng Hao] Update the import package name for TestHive in sbt shell
2014-04-27 23:57:29 -07:00
Matei Zaharia a24d918c71 SPARK-1621 Upgrade Chill to 0.3.6
It registers more Scala classes, including things like Ranges that we had to register manually before. See https://github.com/twitter/chill/releases for Chill's change log.

Author: Matei Zaharia <matei@databricks.com>

Closes #543 from mateiz/chill-0.3.6 and squashes the following commits:

a1dc5e0 [Matei Zaharia] Upgrade Chill to 0.3.6 and remove our special registration of Ranges
2014-04-25 11:12:41 -07:00
tmalaska d5c6ae6cc3 SPARK-1584: Upgrade Flume dependency to 1.4.0
Updated the Flume dependency in the maven pom file and the scala build file.

Author: tmalaska <ted.malaska@cloudera.com>

Closes #507 from tmalaska/master and squashes the following commits:

79492c8 [tmalaska] excluded all thrift
159c3f1 [tmalaska] fixed the flume pom file issues
5bf56a7 [tmalaska] Upgrade flume version
2014-04-24 20:31:17 -07:00
Patrick Wendell cd4ed29326 SPARK-1119 and other build improvements
1. Makes assembly and examples jar naming consistent in maven/sbt.
2. Updates make-distribution.sh to use Maven and fixes some bugs.
3. Updates the create-release script to call make-distribution script.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #502 from pwendell/make-distribution and squashes the following commits:

1a97f0d [Patrick Wendell] SPARK-1119 and other build improvements
2014-04-23 10:19:32 -07:00
Michael Armbrust aa77f8a6a6 SPARK-1562 Fix visibility / annotation of Spark SQL APIs
Author: Michael Armbrust <michael@databricks.com>

Closes #489 from marmbrus/sqlDocFixes and squashes the following commits:

acee4f3 [Michael Armbrust] Fix visibility / annotation of Spark SQL APIs
2014-04-22 20:02:33 -07:00
Ahir Reddy 0f87e6ad43 [SPARK-1560]: Updated Pyrolite Dependency to be Java 6 compatible
Changed the Pyrolite dependency to a build which targets Java 6.

Author: Ahir Reddy <ahirreddy@gmail.com>

Closes #479 from ahirreddy/java6-pyrolite and squashes the following commits:

8ea25d3 [Ahir Reddy] Updated maven build to use java 6 compatible pyrolite
dabc703 [Ahir Reddy] Updated Pyrolite dependency to be Java 6 compatible
2014-04-22 09:44:41 -07:00
Matei Zaharia fc78384704 [SPARK-1439, SPARK-1440] Generate unified Scaladoc across projects and Javadocs
I used the sbt-unidoc plugin (https://github.com/sbt/sbt-unidoc) to create a unified Scaladoc of our public packages, and generate Javadocs as well. One limitation is that I haven't found an easy way to exclude packages in the Javadoc; there is a SBT task that identifies Java sources to run javadoc on, but it's been very difficult to modify it from outside to change what is set in the unidoc package. Some SBT-savvy people should help with this. The Javadoc site also lacks package-level descriptions and things like that, so we may want to look into that. We may decide not to post these right now if it's too limited compared to the Scala one.

Example of the built doc site: http://people.csail.mit.edu/matei/spark-unified-docs/

Author: Matei Zaharia <matei@databricks.com>

This patch had conflicts when merged, resolved by
Committer: Patrick Wendell <pwendell@gmail.com>

Closes #457 from mateiz/better-docs and squashes the following commits:

a63d4a3 [Matei Zaharia] Skip Java/Scala API docs for Python package
5ea1f43 [Matei Zaharia] Fix links to Java classes in Java guide, fix some JS for scrolling to anchors on page load
f05abc0 [Matei Zaharia] Don't include java.lang package names
995e992 [Matei Zaharia] Skip internal packages and class names with $ in JavaDoc
a14a93c [Matei Zaharia] typo
76ce64d [Matei Zaharia] Add groups to Javadoc index page, and a first package-info.java
ed6f994 [Matei Zaharia] Generate JavaDoc as well, add titles, update doc site to use unified docs
acb993d [Matei Zaharia] Add Unidoc plugin for the projects we want Unidoced
2014-04-21 21:57:40 -07:00
Xiangrui Meng aa17f022c5 [SPARK-1520] remove fastutil from dependencies
A quick fix for https://issues.apache.org/jira/browse/SPARK-1520

By excluding fastutil, we bring the number of files in the assembly jar back under 65536, so Java 7 won't create the assembly jar in zip64 format, which cannot be read by Java 6.

With this change, the assembly jar now has about 60000 entries (58000 files), tested with both sbt and maven.

Author: Xiangrui Meng <meng@databricks.com>

Closes #437 from mengxr/remove-fastutil and squashes the following commits:

00f9beb [Xiangrui Meng] remove fastutil from dependencies
2014-04-18 10:03:15 -07:00
Cheng Lian 6a10d80162 [SPARK-959] Updated SBT from 0.13.1 to 0.13.2
JIRA issue: [SPARK-959](https://spark-project.atlassian.net/browse/SPARK-959)

SBT 0.13.2 has been officially released. This version updated Ivy 2.0 to Ivy 2.3, which fixes [IVY-899](https://issues.apache.org/jira/browse/IVY-899). This PR also removed previous workaround.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #426 from liancheng/updateSbt and squashes the following commits:

95e3dc8 [Cheng Lian] Updated SBT from 0.13.1 to 0.13.2 to fix SPARK-959
2014-04-16 08:52:14 -07:00
Ahir Reddy c99bcb7fea SPARK-1374: PySpark API for SparkSQL
An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries.

```
from pyspark.context import SQLContext
sqlCtx = SQLContext(sc)
rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}])
srdd = sqlCtx.applySchema(rdd)
sqlCtx.registerRDDAsTable(srdd, "table1")
srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1")
srdd2.collect()
```
The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]```

Author: Ahir Reddy <ahirreddy@gmail.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #363 from ahirreddy/pysql and squashes the following commits:

0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns
307d6e0 [Ahir Reddy] Style fix
6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies
3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py
29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD
f2312c7 [Ahir Reddy] Moved everything into sql.py
a19afe4 [Ahir Reddy] Doc fixes
6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL
521ff6d [Ahir Reddy] Trying to get spark to build with hive
ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins
ded03e7 [Ahir Reddy] Added doc test for HiveContext
22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency
e4da06c [Ahir Reddy] Display message if hive is not built into spark
227a0be [Michael Armbrust] Update API links. Fix Hive example.
58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api.  Minor fixes.
4285340 [Michael Armbrust] Fix building of Hive API Docs.
38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs.
337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build
40491c9 [Ahir Reddy] PR Changes + Method Visibility
1836944 [Michael Armbrust] Fix comments.
e00980f [Michael Armbrust] First draft of python sql programming guide.
b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test
f98a422 [Ahir Reddy] HiveContexts
79621cf [Ahir Reddy] cleaning up cruft
b406ba0 [Ahir Reddy] doctest formatting
20936a5 [Ahir Reddy] Added tests and documentation
e4d21b4 [Ahir Reddy] Added pyrolite dependency
79f739d [Ahir Reddy] added more tests
7515ba0 [Ahir Reddy] added more tests :)
d26ec5e [Ahir Reddy] added test
e9f5b8d [Ahir Reddy] adding tests
906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python
251f99d [Ahir Reddy] for now only allow dictionaries as input
09b9980 [Ahir Reddy] made jrdd explicitly lazy
c608947 [Ahir Reddy] SchemaRDD now has all RDD operations
725c91e [Ahir Reddy] awesome row objects
55d1c76 [Ahir Reddy] return row objects
4fe1319 [Ahir Reddy] output dictionaries correctly
be079de [Ahir Reddy] returning dictionaries works
cd5f79f [Ahir Reddy] Switched to using Scala SQLContext
e948bd9 [Ahir Reddy] yippie
4886052 [Ahir Reddy] even better
c0fb1c6 [Ahir Reddy] more working
043ca85 [Ahir Reddy] working
5496f9f [Ahir Reddy] doesn't crash
b8b904b [Ahir Reddy] Added schema rdd class
67ba875 [Ahir Reddy] java to python, and python to java
bcc0f23 [Ahir Reddy] Java to python
ab6025d [Ahir Reddy] compiling
2014-04-15 00:07:55 -07:00
Sean Owen 0247b5c546 SPARK-1488. Resolve scalac feature warnings during build
For your consideration: scalac currently notes a number of feature warnings during compilation:

```
[warn] there were 65 feature warning(s); re-run with -feature for details
```

Warnings are like:

```
[warn] /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/SparkContext.scala:1261: implicit conversion method rddToPairRDDFunctions should be enabled
[warn] by making the implicit value scala.language.implicitConversions visible.
[warn] This can be achieved by adding the import clause 'import scala.language.implicitConversions'
[warn] or by setting the compiler option -language:implicitConversions.
[warn] See the Scala docs for value scala.language.implicitConversions for a discussion
[warn] why the feature should be explicitly enabled.
[warn]   implicit def rddToPairRDDFunctions[K: ClassTag, V: ClassTag](rdd: RDD[(K, V)]) =
[warn]                ^
```

scalac is suggesting that it's just best practice to explicitly enable certain language features by importing them where used.

This PR simply adds the imports it suggests (and squashes one other Java warning along the way). This leaves just deprecation warnings in the build.

Author: Sean Owen <sowen@cloudera.com>

Closes #404 from srowen/SPARK-1488 and squashes the following commits:

8598980 [Sean Owen] Quiet scalac warnings about language features by explicitly importing language features.
39bc831 [Sean Owen] Enable -feature in scalac to emit language feature warnings
2014-04-14 19:50:00 -07:00
Sean Owen 165e06a74c SPARK-1057 (alternative) Remove fastutil
(This is for discussion at this point -- I'm not suggesting this should be committed.)

This is what removing fastutil looks like. Much of it is straightforward, like using `java.io` buffered stream classes, and Guava for murmurhash3.

Uses of the `FastByteArrayOutputStream` were a little trickier. In only one case though do I think the change to use `java.io` actually entails an extra array copy.

The rest is using `OpenHashMap` and `OpenHashSet`.  These are now written in terms of more scala-like operations.

`OpenHashMap` is where I made three non-trivial changes to make it work, and they need review:

- It is no longer private
- The key must be a `ClassTag`
- Unless a lot of other code changes, the key type can't enforce being a supertype of `Null`

It all works and tests pass, and I think there is reason to believe it's OK from a speed perspective.

But what about those last changes?

Author: Sean Owen <sowen@cloudera.com>

Closes #266 from srowen/SPARK-1057-alternate and squashes the following commits:

2601129 [Sean Owen] Fix Map return type error not previously caught
ec65502 [Sean Owen] Updates from matei's review
00bc81e [Sean Owen] Remove use of fastutil and replace with use of java.io, spark.util and Guava classes
2014-04-11 22:46:47 -07:00
Patrick Wendell 7b52b66312 Revert "SPARK-1433: Upgrade Mesos dependency to 0.17.0"
This reverts commit 12c077d5aa.
2014-04-10 14:43:29 -07:00
Holden Karau fa0524fd02 Spark-939: allow user jars to take precedence over spark jars
I still need to do a small bit of re-factoring [mostly the one Java file I'll switch it back to a Scala file and use it in both the close loaders], but comments on other things I should do would be great.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #217 from holdenk/spark-939-allow-user-jars-to-take-precedence-over-spark-jars and squashes the following commits:

cf0cac9 [Holden Karau] Fix the executorclassloader
1955232 [Holden Karau] Fix long line in TestUtils
8f89965 [Holden Karau] Fix tests for new class name
7546549 [Holden Karau] CR feedback, merge some of the testutils methods down, rename the classloader
644719f [Holden Karau] User the class generator for the repl class loader tests too
f0b7114 [Holden Karau] Fix the core/src/test/scala/org/apache/spark/executor/ExecutorURLClassLoaderSuite.scala tests
204b199 [Holden Karau] Fix the generated classes
9f68f10 [Holden Karau] Start rewriting the ExecutorURLClassLoaderSuite to not use the hard coded classes
858aba2 [Holden Karau] Remove a bunch of test junk
261aaee [Holden Karau] simplify executorurlclassloader a bit
7a7bf5f [Holden Karau] CR feedback
d4ae848 [Holden Karau] rewrite component into scala
aa95083 [Holden Karau] CR feedback
7752594 [Holden Karau] re-add https comment
a0ef85a [Holden Karau] Fix style issues
125ea7f [Holden Karau] Easier to just remove those files, we don't need them
bb8d179 [Holden Karau] Fix issues with the repl class loader
241b03d [Holden Karau] fix my rat excludes
a343350 [Holden Karau] Update rat-excludes and remove a useless file
d90d217 [Holden Karau] Fix fall back with custom class loader and add a test for it
4919bf9 [Holden Karau] Fix parent calling class loader issue
8a67302 [Holden Karau] Test are good
9e2d236 [Holden Karau] It works comrade
691ee00 [Holden Karau] It works ish
dc4fe44 [Holden Karau] Does not depend on being in my home directory
47046ff [Holden Karau] Remove bad import'
22d83cb [Holden Karau] Add a test suite for the executor url class loader suite
7ef4628 [Holden Karau] Clean up
792d961 [Holden Karau] Almost works
16aecd1 [Holden Karau] Doesn't quite work
8d2241e [Holden Karau] Adda FakeClass for testing ClassLoader precedence options
648b559 [Holden Karau] Both class loaders compile. Now for testing
e1d9f71 [Holden Karau] One loader workers.
2014-04-08 22:30:03 -07:00
Sandeep 12c077d5aa SPARK-1433: Upgrade Mesos dependency to 0.17.0
Mesos 0.13.0 was released 6 months ago.
Upgrade Mesos dependency to 0.17.0

Author: Sandeep <sandeep@techaddict.me>

Closes #355 from techaddict/mesos_update and squashes the following commits:

f1abeee [Sandeep] SPARK-1433: Upgrade Mesos dependency to 0.17.0 Mesos 0.13.0 was released 6 months ago. Upgrade Mesos dependency to 0.17.0
2014-04-08 16:19:22 -07:00
Michael Armbrust accd0999f9 [SQL] SPARK-1371 Hash Aggregation Improvements
Given:
```scala
case class Data(a: Int, b: Int)
val rdd =
  sparkContext
    .parallelize(1 to 200)
    .flatMap(_ => (1 to 50000).map(i => Data(i % 100, i)))
rdd.registerAsTable("data")
cacheTable("data")
```
Before:
```
SELECT COUNT(*) FROM data:[10000000]
16795.567ms
SELECT a, SUM(b) FROM data GROUP BY a
7536.436ms
SELECT SUM(b) FROM data
10954.1ms
```

After:
```
SELECT COUNT(*) FROM data:[10000000]
1372.175ms
SELECT a, SUM(b) FROM data GROUP BY a
2070.446ms
SELECT SUM(b) FROM data
958.969ms
```

Author: Michael Armbrust <michael@databricks.com>

Closes #295 from marmbrus/hashAgg and squashes the following commits:

ec63575 [Michael Armbrust] Add comment.
d0495a9 [Michael Armbrust] Use scaladoc instead.
b4a6887 [Michael Armbrust] Address review comments.
a2d90ba [Michael Armbrust] Capture child output statically to avoid issues with generators and serialization.
7c13112 [Michael Armbrust] Rewrite Aggregate operator to stream input and use projections.  Remove unused local RDD functions implicits.
5096f99 [Michael Armbrust] Make HiveUDAF fields transient since object inspectors are not serializable.
6a4b671 [Michael Armbrust] Add option to avoid binding operators expressions automatically.
92cca08 [Michael Armbrust] Always include serialization debug info when running tests.
1279df2 [Michael Armbrust] Increase default number of partitions.
2014-04-07 00:14:00 -07:00
Aaron Davidson 4106558435 SPARK-1314: Use SPARK_HIVE to determine if we include Hive in packaging
Previously, we based our decision regarding including datanucleus jars based on the existence of a spark-hive-assembly jar, which was incidentally built whenever "sbt assembly" is run. This means that a typical and previously supported pathway would start using hive jars.

This patch has the following features/bug fixes:

- Use of SPARK_HIVE (default false) to determine if we should include Hive in the assembly jar.
- Analagous feature in Maven with -Phive (previously, there was no support for adding Hive to any of our jars produced by Maven)
- assemble-deps fixed since we no longer use a different ASSEMBLY_DIR
- avoid adding log message in compute-classpath.sh to the classpath :)

Still TODO before mergeable:
- We need to download the datanucleus jars outside of sbt. Perhaps we can have spark-class download them if SPARK_HIVE is set similar to how sbt downloads itself.
- Spark SQL documentation updates.

Author: Aaron Davidson <aaron@databricks.com>

Closes #237 from aarondav/master and squashes the following commits:

5dc4329 [Aaron Davidson] Typo fixes
dd4f298 [Aaron Davidson] Doc update
dd1a365 [Aaron Davidson] Eliminate need for SPARK_HIVE at runtime by d/ling datanucleus from Maven
a9269b5 [Aaron Davidson] [WIP] Use SPARK_HIVE to determine if we include Hive in packaging
2014-04-06 17:48:41 -07:00
Sean Owen 856c50f59b SPARK-1387. Update build plugins, avoid plugin version warning, centralize versions
Another handful of small build changes to organize and standardize a bit, and avoid warnings:

- Update Maven plugin versions for good measure
- Since plugins need maven 3.0.4 already, require it explicitly (<3.0.4 had some bugs anyway)
- Use variables to define versions across dependencies where they should move in lock step
- ... and make this consistent between Maven/SBT

OK, I also updated the JIRA URL while I was at it here.

Author: Sean Owen <sowen@cloudera.com>

Closes #291 from srowen/SPARK-1387 and squashes the following commits:

461eca1 [Sean Owen] Couldn't resist also updating JIRA location to new one
c2d5cc5 [Sean Owen] Update plugins and Maven version; use variables consistently across Maven/SBT to define dependency versions that should stay in step.
2014-04-06 17:41:01 -07:00
Haoyuan Li b50ddfde03 SPARK-1305: Support persisting RDD's directly to Tachyon
Move the PR#468 of apache-incubator-spark to the apache-spark
"Adding an option to persist Spark RDD blocks into Tachyon."

Author: Haoyuan Li <haoyuan@cs.berkeley.edu>
Author: RongGu <gurongwalker@gmail.com>

Closes #158 from RongGu/master and squashes the following commits:

72b7768 [Haoyuan Li] merge master
9f7fa1b [Haoyuan Li] fix code style
ae7834b [Haoyuan Li] minor cleanup
a8b3ec6 [Haoyuan Li] merge master branch
e0f4891 [Haoyuan Li] better check offheap.
55b5918 [RongGu] address matei's comment on the replication of offHeap storagelevel
7cd4600 [RongGu] remove some logic code for tachyonstore's replication
51149e7 [RongGu] address aaron's comment on returning value of the remove() function in tachyonstore
8adfcfa [RongGu] address arron's comment on inTachyonSize
120e48a [RongGu] changed the root-level dir name in Tachyon
5cc041c [Haoyuan Li] address aaron's comments
9b97935 [Haoyuan Li] address aaron's comments
d9a6438 [Haoyuan Li] fix for pspark
77d2703 [Haoyuan Li] change python api.git status
3dcace4 [Haoyuan Li] address matei's comments
91fa09d [Haoyuan Li] address patrick's comments
589eafe [Haoyuan Li] use TRY_CACHE instead of MUST_CACHE
64348b2 [Haoyuan Li] update conf docs.
ed73e19 [Haoyuan Li] Merge branch 'master' of github.com:RongGu/spark-1
619a9a8 [RongGu] set number of directories in TachyonStore back to 64; added a TODO tag for duplicated code from the DiskStore
be79d77 [RongGu] find a way to clean up some unnecessay metods and classed to make the code simpler
49cc724 [Haoyuan Li] update docs with off_headp option
4572f9f [RongGu] reserving the old apply function API of StorageLevel
04301d3 [RongGu] rename StorageLevel.TACHYON to Storage.OFF_HEAP
c9aeabf [RongGu] rename the StorgeLevel.TACHYON as StorageLevel.OFF_HEAP
76805aa [RongGu] unifies the config properties name prefix; add the configs into docs/configuration.md
e700d9c [RongGu] add the SparkTachyonHdfsLR example and some comments
fd84156 [RongGu] use randomUUID to generate sparkapp directory name on tachyon;minor code style fix
939e467 [Haoyuan Li] 0.4.1-thrift from maven central
86a2eab [Haoyuan Li] tachyon 0.4.1-thrift is in the staging repo. but jenkins failed to download it. temporarily revert it back to 0.4.1
16c5798 [RongGu] make the dependency on tachyon as tachyon-0.4.1-thrift
eacb2e8 [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1
bbeb4de [RongGu] fix the JsonProtocolSuite test failure problem
6adb58f [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1
d827250 [RongGu] fix JsonProtocolSuie test failure
716e93b [Haoyuan Li] revert the version
ca14469 [Haoyuan Li] bump tachyon version to 0.4.1-thrift
2825a13 [RongGu] up-merging to the current master branch of the apache spark
6a22c1a [Haoyuan Li] fix scalastyle
8968b67 [Haoyuan Li] exclude more libraries from tachyon dependency to be the same as referencing tachyon-client.
77be7e8 [RongGu] address mateiz's comment about the temp folder name problem. The implementation followed mateiz's advice.
1dcadf9 [Haoyuan Li] typo
bf278fa [Haoyuan Li] fix python tests
e82909c [Haoyuan Li] minor cleanup
776a56c [Haoyuan Li] address patrick's and ali's comments from the previous PR
8859371 [Haoyuan Li] various minor fixes and clean up
e3ddbba [Haoyuan Li] add doc to use Tachyon cache mode.
fcaeab2 [Haoyuan Li] address Aaron's comment
e554b1e [Haoyuan Li] add python code
47304b3 [Haoyuan Li] make tachyonStore in BlockMananger lazy val; add more comments StorageLevels.
dc8ef24 [Haoyuan Li] add old storelevel constructor
e01a271 [Haoyuan Li] update tachyon 0.4.1
8011a96 [RongGu] fix a brought-in mistake in StorageLevel
70ca182 [RongGu] a bit change in comment
556978b [RongGu] fix the scalastyle errors
791189b [RongGu] "Adding an option to persist Spark RDD blocks into Tachyon." move the PR#468 of apache-incubator-spark to the apache-spark
2014-04-04 20:38:20 -07:00
Patrick Wendell 33e63618d0 Revert "[SPARK-1398] Removed findbugs jsr305 dependency"
This reverts commit 92a86b285f.
2014-04-03 17:00:06 -07:00
Mark Hamstra 92a86b285f [SPARK-1398] Removed findbugs jsr305 dependency
Should be a painless upgrade, and does offer some significant advantages should we want to leverage FindBugs more during the 1.0 lifecycle. http://findbugs.sourceforge.net/findbugs2.html

Author: Mark Hamstra <markhamstra@gmail.com>

Closes #307 from markhamstra/findbugs and squashes the following commits:

99f2d09 [Mark Hamstra] Removed unnecessary findbugs jsr305 dependency
2014-04-03 14:08:47 -07:00
Mark Hamstra 764353d2c5 [SPARK-1342] Scala 2.10.4
Just a Scala version increment

Author: Mark Hamstra <markhamstra@gmail.com>

Closes #259 from markhamstra/scala-2.10.4 and squashes the following commits:

fbec547 [Mark Hamstra] [SPARK-1342] Bumped Scala version to 2.10.4
2014-04-01 18:35:50 -07:00
Andrew Or 94fe7fd4fa [SPARK-1377] Upgrade Jetty to 8.1.14v20131031
Previous version was 7.6.8v20121106. The only difference between Jetty 7 and Jetty 8 is that the former uses Servlet API 2.5, while the latter uses Servlet API 3.0.

Author: Andrew Or <andrewor14@gmail.com>

Closes #280 from andrewor14/jetty-upgrade and squashes the following commits:

dd57104 [Andrew Or] Merge github.com:apache/spark into jetty-upgrade
e75fa85 [Andrew Or] Upgrade Jetty to 8.1.14v20131031
2014-03-31 21:42:36 -07:00
Prashant Sharma 60abc25254 SPARK-1096, a space after comment start style checker.
Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #124 from ScrapCodes/SPARK-1096/scalastyle-comment-check and squashes the following commits:

214135a [Prashant Sharma] Review feedback.
5eba88c [Prashant Sharma] Fixed style checks for ///+ comments.
e54b2f8 [Prashant Sharma] improved message, work around.
83e7144 [Prashant Sharma] removed dependency on scalastyle in plugin, since scalastyle sbt plugin already depends on the right version. Incase we update the plugin we will have to adjust our spark-style project to depend on right scalastyle version.
810a1d6 [Prashant Sharma] SPARK-1096, a space after comment style checker.
ba33193 [Prashant Sharma] scala style as a project
2014-03-28 00:21:49 -07:00
Sean Owen 1fa48d9422 SPARK-1325. The maven build error for Spark Tools
This is just a slight variation on https://github.com/apache/spark/pull/234 and alternative suggestion for SPARK-1325. `scala-actors` is not necessary. `SparkBuild.scala` should be updated to reflect the direct dependency on `scala-reflect` and `scala-compiler`. And the `repl` build, which has the same dependencies, should also be consistent between Maven / SBT.

Author: Sean Owen <sowen@cloudera.com>
Author: witgo <witgo@qq.com>

Closes #240 from srowen/SPARK-1325 and squashes the following commits:

25bd7db [Sean Owen] Add necessary dependencies scala-reflect and scala-compiler to tools. Update repl dependencies, which are similar, to be consistent between Maven / SBT in this regard too.
2014-03-26 18:32:14 -07:00
Sean Owen 71d4ed271b SPARK-1316. Remove use of Commons IO
(This follows from a side point on SPARK-1133, in discussion of the PR: https://github.com/apache/spark/pull/164 )

Commons IO is barely used in the project, and can easily be replaced with equivalent calls to Guava or the existing Spark `Utils.scala` class.

Removing a dependency feels good, and this one in particular can get a little problematic since Hadoop uses it too.

Author: Sean Owen <sowen@cloudera.com>

Closes #226 from srowen/SPARK-1316 and squashes the following commits:

21efef3 [Sean Owen] Remove use of Commons IO
2014-03-25 10:21:25 -07:00
Patrick Wendell dc126f2121 SPARK-1094 Support MiMa for reporting binary compatibility accross versions.
This adds some changes on top of the initial work by @scrapcodes in #20:

The goal here is to do automated checking of Spark commits to determine whether they break binary compatibility.

1. Special case for inner classes of package-private objects.
2. Made tools classes accessible when running `spark-class`.
3. Made some declared types in MLLib more general.
4. Various other improvements to exclude-generation script.
5. In-code documentation.

Author: Patrick Wendell <pwendell@gmail.com>
Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Prashant Sharma <scrapcodes@gmail.com>

Closes #207 from pwendell/mima and squashes the following commits:

22ae267 [Patrick Wendell] New binary changes after upmerge
6c2030d [Patrick Wendell] Merge remote-tracking branch 'apache/master' into mima
3666cf1 [Patrick Wendell] Minor style change
0e0f570 [Patrick Wendell] Small fix and removing directory listings
647c547 [Patrick Wendell] Reveiw feedback.
c39f3b5 [Patrick Wendell] Some enhancements to binary checking.
4c771e0 [Prashant Sharma] Added a tool to generate mima excludes and also adapted build to pick automatically.
b551519 [Prashant Sharma] adding a new exclude after rebasing with master
651844c [Prashant Sharma] Support MiMa for reporting binary compatibility accross versions.
2014-03-24 21:20:23 -07:00
Xiangrui Meng 80c29689ae [SPARK-1212] Adding sparse data support and update KMeans
Continue our discussions from https://github.com/apache/incubator-spark/pull/575

This PR is WIP because it depends on a SNAPSHOT version of breeze.

Per previous discussions and benchmarks, I switched to breeze for linear algebra operations. @dlwh and I made some improvements to breeze to keep its performance comparable to the bare-bone implementation, including norm computation and squared distance. This is why this PR needs to depend on a SNAPSHOT version of breeze.

@fommil , please find the notice of using netlib-core in `NOTICE`. This is following Apache's instructions on appropriate labeling.

I'm going to update this PR to include:

1. Fast distance computation: using `\|a\|_2^2 + \|b\|_2^2 - 2 a^T b` when it doesn't introduce too much numerical error. The squared norms are pre-computed. Otherwise, computing the distance between the center (dense) and a point (possibly sparse) always takes O(n) time.

2. Some numbers about the performance.

3. A released version of breeze. @dlwh, a minor release of breeze will help this PR get merged early. Do you mind sharing breeze's release plan? Thanks!

Author: Xiangrui Meng <meng@databricks.com>

Closes #117 from mengxr/sparse-kmeans and squashes the following commits:

67b368d [Xiangrui Meng] fix SparseVector.toArray
5eda0de [Xiangrui Meng] update NOTICE
67abe31 [Xiangrui Meng] move ArrayRDDs to mllib.rdd
1da1033 [Xiangrui Meng] remove dependency on commons-math3 and compute EPSILON directly
9bb1b31 [Xiangrui Meng] optimize SparseVector.toArray
226d2cd [Xiangrui Meng] update Java friendly methods in Vectors
238ba34 [Xiangrui Meng] add VectorRDDs with a converter from RDD[Array[Double]]
b28ba2f [Xiangrui Meng] add toArray to Vector
e69b10c [Xiangrui Meng] remove examples/JavaKMeans.java, which is replaced by mllib/examples/JavaKMeans.java
72bde33 [Xiangrui Meng] clean up code for distance computation
712cb88 [Xiangrui Meng] make Vectors.sparse Java friendly
27858e4 [Xiangrui Meng] update breeze version to 0.7
07c3cf2 [Xiangrui Meng] change Mahout to breeze in doc use a simple lower bound to avoid unnecessary distance computation
6f5cdde [Xiangrui Meng] fix a bug in filtering finished runs
42512f2 [Xiangrui Meng] Merge branch 'master' into sparse-kmeans
d6e6c07 [Xiangrui Meng] add predict(RDD[Vector]) to KMeansModel
42b4e50 [Xiangrui Meng] line feed at the end
a4ace73 [Xiangrui Meng] Merge branch 'fast-dist' into sparse-kmeans
3ed1a24 [Xiangrui Meng] add doc to BreezeVectorWithSquaredNorm
0107e19 [Xiangrui Meng] update NOTICE
87bc755 [Xiangrui Meng] tuned the KMeans code: changed some for loops to while, use view to avoid copying arrays
0ff8046 [Xiangrui Meng] update KMeans to use fastSquaredDistance
f355411 [Xiangrui Meng] add BreezeVectorWithSquaredNorm case class
ab74f67 [Xiangrui Meng] add fastSquaredDistance for KMeans
4e7d5ca [Xiangrui Meng] minor style update
07ffaf2 [Xiangrui Meng] add dense/sparse vector data models and conversions to/from breeze vectors use breeze to implement KMeans in order to support both dense and sparse data
2014-03-23 17:34:02 -07:00
Sean Owen abf6714e27 SPARK-1254. Supplemental fix for HTTPS on Maven Central
It seems that HTTPS does not necessarily work on Maven Central, as it does not today at least. Back to HTTP. Both builds works from a clean repo.

Author: Sean Owen <sowen@cloudera.com>

Closes #209 from srowen/SPARK-1254Fix and squashes the following commits:

bb7be47 [Sean Owen] Revert to HTTP for Maven Central repo, as it seems HTTPS does not necessarily work
2014-03-23 10:57:01 -07:00
Michael Armbrust 9aadcffabd SPARK-1251 Support for optimizing and executing structured queries
This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL.

*This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.*

The code is broken into three primary components:
 - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions.
 - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs.  This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files.
 - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes.  There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs.

A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251).

[An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review.

With this PR comes support for inferring the schema of existing RDDs that contain case classes.  Using this information, developers can now express structured queries that are automatically compiled into RDD operations.

```scala
// Define the schema using a case class.
case class Person(name: String, age: Int)
val people: RDD[Person] =
  sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt))

// The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19'
val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd
```

RDDs can also be registered as Tables, allowing SQL queries to be written over them.
```scala
people.registerAsTable("people")
val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19")
```

The results of queries are themselves RDDs and support standard RDD operations:
```scala
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
```

Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL.
```scala
sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src")

// Queries are expressed in HiveQL
sql("SELECT key, value FROM src").collect().foreach(println)
```

## Relationship to Shark

Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project.

Author: Michael Armbrust <michael@databricks.com>
Author: Yin Huai <huaiyin.thu@gmail.com>
Author: Reynold Xin <rxin@apache.org>
Author: Lian, Cheng <rhythm.mail@gmail.com>
Author: Andre Schumacher <andre.schumacher@iki.fi>
Author: Yin Huai <huai@cse.ohio-state.edu>
Author: Timothy Chen <tnachen@gmail.com>
Author: Cheng Lian <lian.cs.zju@gmail.com>
Author: Timothy Chen <tnachen@apache.org>
Author: Henry Cook <henry.m.cook+github@gmail.com>
Author: Mark Hamstra <markhamstra@gmail.com>

Closes #146 from marmbrus/catalyst and squashes the following commits:

458bd1b [Michael Armbrust] Update people.txt
0d638c3 [Michael Armbrust] Typo fix from @ash211.
bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests
778299a [Michael Armbrust] Fix some old links to spark-project.org
fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations.  This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user.  This change also makes it slightly less verbose to run language integrated queries.
fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD.
48a99bc [Michael Armbrust] Address first round of feedback.
461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour
adcf1a4 [Henry Cook] Update sql-programming-guide.md
9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins.
6978dd8 [Michael Armbrust] update docs, add apache license
1d0eb63 [Michael Armbrust] update changes with spark core
e5e1d6b [Michael Armbrust] Remove travis configuration.
c2efad6 [Michael Armbrust] First draft of SQL documentation.
013f62a [Michael Armbrust] Fix documentation / code style.
c01470f [Michael Armbrust] Clean up example
2f22454 [Michael Armbrust] WIP: Parquet example.
ce8073b [Michael Armbrust] clean up implicits.
f7d992d [Michael Armbrust] Naming / spelling.
9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext.
d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar.  Create a separate, optional Hive assembly that is used when present.
8b35e0a [Michael Armbrust] address feedback, work on DSL
5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes
f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation
1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven
3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes
3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context.
7233a74 [Michael Armbrust] initial support for maven builds
f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven
7386a9f [Michael Armbrust] Initial example programs using spark sql.
aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow
7ca4b4e [Andre Schumacher] Improving checks in Parquet tests
5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport
54637ec [Andre Schumacher] First part of second round of code review feedback
c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows
ba28849 [Michael Armbrust] code review comments.
d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization.
9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates.  Also add a constructor so that it can be serialized out-of-the-box.
959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream.
d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path.
c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support
3c3f962 [Michael Armbrust] Fix a bug due to array reuse.  This will need to be revisited after we merge the mutable row PR.
7d0f13e [Michael Armbrust] Update parquet support with master.
9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support
0040ae6 [Andre Schumacher] Feedback from code review
1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow
70e489d [Cheng Lian] Fixed a spelling typo
6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object.
8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly
99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval
7b9d142 [Michael Armbrust] Update travis to increase permgen size.
da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS.
6fdefe6 [Michael Armbrust] Port sbt improvements from master.
296fe50 [Michael Armbrust] Address review feedback.
d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework.
3bda72d [Andre Schumacher] Adding license banner to new files
3ac9eb0 [Andre Schumacher] Rebasing to new main branch
c863bed [Andre Schumacher] Codestyle checks
61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite
3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala
3a0a552 [Andre Schumacher] Reorganizing Parquet table operations
18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables
75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies
f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection
6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan
0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport
a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport
6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core
eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase
99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types
b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types
c334386 [Michael Armbrust] Initial support for generating schema's based on case classes.
608a29e [Michael Armbrust] Add hive as a repl dependency
7413ac2 [Michael Armbrust] make test downloading quieter.
4d57d0e [Michael Armbrust] Fix test execution on travis.
5f2963c [Michael Armbrust] naming and continuous compilation fixes.
f5e7492 [Michael Armbrust] Add Apache license.  Make naming more consistent.
3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject.
2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes
d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query.
24eaa79 [Michael Armbrust] fix > 100 chars
6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL.
df88f01 [Michael Armbrust] add a simple test for aggregation
18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data.
b922511 [Michael Armbrust] Fix insertion of nested types into hive tables.
5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function.
a430895 [Michael Armbrust] Planning for logical Repartition operators.
532dd37 [Michael Armbrust] Allow the local warehouse path to be specified.
4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter.
8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package.
c9116a6 [Michael Armbrust] Add combiner to avoid NPE when spark performs external aggregation.
29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables.
9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning
f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe
cf4db59 [Lian, Cheng] Added golden answers for PruningSuite
54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names
2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning
c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning
f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query.
128a9f8 [Yin Huai] Minor changes.
017872c [Yin Huai] Remove stats20 from whitelist.
a1a4776 [Yin Huai] Update comments.
feb022c [Yin Huai] Partitioning key should be case insensitive.
555fb1d [Yin Huai] Correctly set the extension for a text file.
d00260b [Yin Huai] Strips backticks from partition keys.
334aace [Yin Huai] New golden files.
a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query.
428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`.
eea75c5 [Yin Huai] Correctly set codec.
45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew
e089627 [Yin Huai] Code style.
563bb22 [Yin Huai] Set compression info in FileSinkDesc.
35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback
bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables.
5495fab [Yin Huai] Remove cloneRecords which is no longer needed.
1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy.
3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location.
8506c17 [Michael Armbrust] Address review feedback.
3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master
9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling
566fd66 [Timothy Chen] Whitelist tests and add support for Binary type
69adf72 [Yin Huai] Set cloneRecords to false.
a9c3188 [Timothy Chen] Fix udaf struct return
346f828 [Yin Huai] Move SharkHadoopWriter to the correct location.
59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew
ed3a1d1 [Yin Huai] Load data directly into Hive.
7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT.
b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile
1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile
5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii
678341a [Mark Hamstra] Replaced non-ascii text
887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew
1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents
7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package.
bc9a12c [Michael Armbrust] Move hive test files.
5720d2b [Lian, Cheng] Fixed comment typo
f0c3742 [Lian, Cheng] Refactored PhysicalOperation
f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered
cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings
2407a21 [Lian, Cheng] Added optimized logical plan to debugging output
a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens
9329820 [Michael Armbrust] add golden answer files to repository
dce0593 [Michael Armbrust] move golden answer to the source code directory.
964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView
7785ee6 [Michael Armbrust] Tighten visibility based on comments.
341116c [Michael Armbrust] address comments.
0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS
2897deb [Michael Armbrust] fix scaladoc
7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries.
b376d15 [Michael Armbrust] fix newlines at EOF
5cc367c [Michael Armbrust] use berkeley instead of cloudbees
ff5ea3f [Michael Armbrust] new golden
db92adc [Michael Armbrust] more tests passing. clean up logging.
740febb [Michael Armbrust] Tests for tgfs.
0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf.
ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView
dd00b7e [Michael Armbrust] initial implementation of generators.
ea76cf9 [Michael Armbrust] Add NoRelation to planner.
bea4b7f [Michael Armbrust] Add SumDistinct.
016b489 [Michael Armbrust] fix typo.
acb9566 [Michael Armbrust] Correctly type attributes of CTAS.
8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation.
02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query.
5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes
5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg
8017afb [Michael Armbrust] fix copy paste error.
dc6353b [Michael Armbrust] turn off deprecation
cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance.
883006d [Michael Armbrust] improve tests.
32b615b [Michael Armbrust] add override to asPartial.
e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe.
f94345c [Michael Armbrust] fix doc link
d8cb805 [Michael Armbrust] Implement partial aggregation.
ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings.  Remove a blank line.
b4be6a5 [Michael Armbrust] better logging when applying rules.
67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex
cb57459 [Michael Armbrust] blacklist machine specific test.
2f27604 [Michael Armbrust] Address comments / style errors.
389525d [Michael Armbrust] update golden, blacklist mr.
e3c10bd [Michael Armbrust] update whitelist.
44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex
42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs.
ab5bff3 [Michael Armbrust] Support for get item of map types.
1679554 [Michael Armbrust] add toString for if and IS NOT NULL.
ab9a131 [Michael Armbrust] when UDFs fail they should return null.
25288d0 [Michael Armbrust] Implement [] for arrays and maps.
e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions.
010accb [Michael Armbrust] add tinyint to metastore type parser.
7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes.
ac9d7de [Michael Armbrust] Resolve *s in Transform clauses.
692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables.
92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects
9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector.
72a003d [Michael Armbrust] revert regex change
7661b6c [Michael Armbrust] blacklist machines specific tests
aa430e7 [Michael Armbrust] Update .travis.yml
e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs.
5e54aa6 [Michael Armbrust] quotes for struct field names.
bbec500 [Michael Armbrust] update test coverage, new golden
3734a94 [Michael Armbrust] only quote string types.
3f9e519 [Michael Armbrust] use names w/ boolean args
5b3d2c8 [Michael Armbrust] implement distinct.
5b33216 [Michael Armbrust] work on decimal support.
2c6deb3 [Michael Armbrust] improve printing compatibility.
35a70fb [Michael Armbrust] multi-letter field names.
a9388fb [Michael Armbrust] printing for map types.
c3feda7 [Michael Armbrust] use toArray.
c654f19 [Michael Armbrust] Support for list and maps in hive table scan.
cf8d992 [Michael Armbrust] Use built in functions for creating temp directory.
1579eec [Michael Armbrust] Only cast unresolved inserts.
6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression.
da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test.  Not sure how this was passing before...
6709441 [Michael Armbrust] Evaluation for accessing nested fields.
dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation.
d670e41 [Michael Armbrust] Print nested fields like hive does.
efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan.
9c22b4e [Michael Armbrust] Support for parsing nested types.
82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables
ea6f37f [Michael Armbrust] fix style.
7845364 [Michael Armbrust] deactivate concurrent test.
b649c20 [Michael Armbrust] fix test logging / caching.
1590568 [Michael Armbrust] add log4j.properties
19bfd74 [Michael Armbrust] store hive output in circular buffer
dfb67aa [Michael Armbrust] add test case
cb775ac [Michael Armbrust] get rid of SharkContext singleton
2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master
63003e9 [Michael Armbrust] Fix spacing.
41b41f3 [Michael Armbrust] Only cast unresolved inserts.
6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs
5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator
b1151a8 [Timothy Chen] Fix load data regex
8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API.
e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction
45b334b [Yin Huai] fix comments
235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator
fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning.
6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style
271e483 [Michael Armbrust] Update build status icon.
d3a3d48 [Michael Armbrust] add testing to travis
807b2d7 [Michael Armbrust] check style and publish docs with travis
d20b565 [Michael Armbrust] fix if style
bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide.
d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests.  This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive.  These are used by default when HIVE_HOME is not set.
f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin
41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator
7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file).
5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference
7213a2c [Reynold Xin] style fix for Hive.scala.
08e4d05 [Reynold Xin] First round of style cleanup.
605255e [Reynold Xin] Added scalastyle checker.
61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases
2486fb7 [Lian, Cheng] Fixed spelling
8ee41be [Lian, Cheng] Minor refactoring
ebb56fa [Michael Armbrust] add travis config
4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests
d4f539a [Michael Armbrust] blacklist mr and user specific tests.
677eb07 [Michael Armbrust] Update test whitelist.
5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning
c263c84 [Michael Armbrust] Only push predicates into partitioned table scans.
ab77882 [Michael Armbrust] upgrade spark to RC5.
c98ede5 [Lian, Cheng] Response to comments from @marmbrus
83d4520 [Yin Huai] marmbrus's comments
70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes
9ebff47 [Yin Huai] remove unnecessary .toSeq
e811d1a [Yin Huai] markhamstra's comments
4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations.
040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators.
9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions.
e9347fc [Michael Armbrust] Remove broken scaladoc links.
99c6707 [Michael Armbrust] upgrade spark
57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable
cb49af0 [Lian, Cheng] Fixed Scaladoc links
4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion
111ffdc [Lian, Cheng] More comments and minor reformatting
9e0d840 [Lian, Cheng] Added partition pruning optimization
761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan
04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator
9dd3b26 [Michael Armbrust] Fix scaladoc.
6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst
7c92a41 [Lian, Cheng] Added Hive SerDe support
ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator
2957f31 [Yin Huai] addressed comments on PR
907db68 [Michael Armbrust] Space after while.
04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts
4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile
5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator
be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23
fd084a4 [Michael Armbrust] implement casts binary <=> string.
0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type
548e479 [Yin Huai] merge master into exchangeOperator and fix code style
5b11db0 [Reynold Xin] Added Void to Boolean type widening.
9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear.
9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion
a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20
b20a4d4 [Lian, Cheng] Fix issue #20
6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc.
4285962 [Michael Armbrust] Remove temporary test cases
167162f [Michael Armbrust] more merge errors, cleanup.
e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge.
6377d0b [Michael Armbrust] Drop empty files, fix if ().
c0b0e60 [Michael Armbrust] cleanup broken doc links.
330a88b [Michael Armbrust] Fix bugs in AddExchange.
4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering.
043e296 [Michael Armbrust] Make physical union nodes variadic.
ece15e1 [Michael Armbrust] update unit tests
5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width.
9804eb5 [Michael Armbrust] upgrade spark
053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow
5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow
ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes
bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg
563053f [Michael Armbrust] Address @rxin's comments.
6537c66 [Michael Armbrust] Address @rxin's comments.
2a76fc6 [Michael Armbrust] add notes from @rxin.
685bfa1 [Michael Armbrust] fix spelling
69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions.
7859a86 [Michael Armbrust] Remove SparkAggregate.  Its kinda broken and breaks RDD lineage.
fc22e01 [Michael Armbrust] whitelist newly passing union test.
3f547b8 [Michael Armbrust] Add support for widening types in unions.
53b95f8 [Michael Armbrust] coercion should not occur until children are resolved.
b892e32 [Michael Armbrust] Union is not resolved until the types match up.
95ab382 [Michael Armbrust] Use resolved instead of custom function.  This is better because some nodes override the notion of resolved.
81a109d [Michael Armbrust] fix link.
f143f61 [Michael Armbrust] Implement sampling.  Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!"
6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar.
c800798 [Michael Armbrust] Add build status icon.
0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown
05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row.
f2fdd77 [Michael Armbrust] fix required distribtion for aggregate.
658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala
583a337 [Michael Armbrust] break apart distribution and partitioning.
e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator
0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links.
73c70de [Yin Huai] add a first set of unit tests for data properties.
fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements.
2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans.
fcbc03b [Michael Armbrust] Fix if ().
7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators.
b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes
b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization.
e286d20 [Michael Armbrust] address code review comments.
80d0681 [Michael Armbrust] fix scaladoc links.
de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString.
3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode.
404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation.
fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution.
2abb0bc [Michael Armbrust] better debug messages, use exists.
098dfc4 [Michael Armbrust] Implement Long sorting again.
60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable.
a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString.
dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode.
037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation.
ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes.
b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen
83adb9d [Yin Huai] add DataProperty
5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics.
f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator.  This can probably also be use for codegen...
6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet.
ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts.
058ec15 [Michael Armbrust] handle more writeables.
ffa9f25 [Michael Armbrust] blacklist some more MR tests.
aa2239c [Michael Armbrust] filter test lines containing Owner:
f71a325 [Michael Armbrust] Update golden jar.
a3003ae [Michael Armbrust] Update makefile to use better sharding support.
568d150 [Michael Armbrust] Updates to white/blacklist.
8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right.
c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure.  See comments for details.
09c6300 [Michael Armbrust] Add nullability information to StructFields.
5460b2d [Michael Armbrust] load srcpart by default.
3695141 [Michael Armbrust] Lots of parser improvements.
965ac9a [Michael Armbrust] Add expressions that allow access into complex types.
3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences.
8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning.
e57f97a [Michael Armbrust] more decimal/null support.
e1440ed [Michael Armbrust] Initial support for function specific type conversions.
1814ed3 [Michael Armbrust] use childrenResolved function.
f2ec57e [Michael Armbrust] Begin supporting decimal.
6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs
7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters.
d0124f3 [Michael Armbrust] Correctly type null literals.
b65626e [Michael Armbrust] Initial support for parsing BigDecimal.
a90efda [Michael Armbrust] utility function for outputing string stacktraces.
7102f33 [Michael Armbrust] methods with side-effects should use ().
3ccaef7 [Michael Armbrust] add renaming TODO.
bc282c7 [Michael Armbrust] fix bug in getNodeNumbered
c8e89d5 [Michael Armbrust] memoize inputSet calculation.
6aefa46 [Michael Armbrust] Skip folding literals.
a72e540 [Michael Armbrust] Add IN operator.
04f885b [Michael Armbrust] literals are only non-nullable if they are not null.
35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output.
12fd52d [Michael Armbrust] support for sorting longs.
0606520 [Michael Armbrust] drop old comment.
859200a [Michael Armbrust] support for reading more types from the metastore.
1fedd18 [Michael Armbrust] coercion from null to numeric types
71e902d [Michael Armbrust] fix test cases.
cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer
8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment
86355a6 [Michael Armbrust] throw error if there are unexpected join clauses.
c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute.
0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling
a92919d [Michael Armbrust] add alter view as to native commands
f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT
f0faa26 [Michael Armbrust] add sample and distinct operators.
ef7b943 [Michael Armbrust] add metastore support for float
e9f4588 [Michael Armbrust] fix > 100 char.
755b229 [Michael Armbrust] blacklist some ddl tests.
9ae740a [Michael Armbrust] blacklist more tests that require MR.
4cfc11a [Michael Armbrust] more test coverage.
0d9d56a [Michael Armbrust] add more native commands to parser
78d730d [Michael Armbrust] Load src test table on RESET.
8364ec2 [Michael Armbrust] whitelist all possible partition values.
b01468d [Michael Armbrust] support path rewrites when the query begins with a comment.
4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail.
4c5fb0f [Michael Armbrust] makefile target for building new whitelist.
4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO.
516481c [Michael Armbrust] Ignore requests to explain native commands.
68aa2e6 [Michael Armbrust] Stronger type for Token extractor.
ca4ea26 [Michael Armbrust] Support for parsing UDF(*).
1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset.
9627616 [Michael Armbrust] Use current database as default database.
9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode.
6f64cee [Michael Armbrust] don't line wrap string literal
eafaeed [Michael Armbrust] add type documentation
f54c94c [Michael Armbrust] make golden answers file a test dependency
5362365 [Michael Armbrust] push conditions into join
0d2388b [Michael Armbrust] Point at databricks hosted scaladoc.
73b29cd [Michael Armbrust] fix bad casting
9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes
7eff191 [Michael Armbrust] link all the expression names.
83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules
9de6b74 [Michael Armbrust] fix language feature and deprecation warnings.
0b1960a [Michael Armbrust] Fix broken scala doc links / warnings.
b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions
01c00c2 [Michael Armbrust] new golden
5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions
b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching
66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork
1a393da [Yin Huai] folded -> foldable
1e964ea [Yin Huai] update
a43d41c [Michael Armbrust] more tests passing!
8ca38d0 [Michael Armbrust] begin support for varchar / binary types.
ab8bbd1 [Michael Armbrust] parsing % operator
c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests.
3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline.
5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions
367fb9e [Yin Huai] update
0cd5cc6 [Michael Armbrust] add BIGINT cast parsing
61b266f [Michael Armbrust] comment for eliminate subqueries.
d72a5a2 [Michael Armbrust] add long to literal factory object.
b3bd15f [Michael Armbrust] blacklist more mr requiring tests.
e06fd38 [Michael Armbrust] black list map reduce tests.
8e7ce30 [Michael Armbrust] blacklist some env specific tests.
6250cbd [Michael Armbrust] Do not exit on test failure
b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath.
b6e4899 [Yin Huai] formatting
e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12
5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing?
9e190f5 [Michael Armbrust] drop unneeded ()
68b58c1 [Michael Armbrust] drop a few more tests.
b0aa400 [Michael Armbrust] update whitelist.
c99012c [Michael Armbrust] skip tests with hooks
db00ebf [Michael Armbrust] more types for hive udfs
dbc3678 [Michael Armbrust] update ghpages repo
138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure.
6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests.
46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel.  usage: "make -j 8 -i"
8d47ed4 [Yin Huai] comment
2795f05 [Yin Huai] comment
e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer
2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions
0bd1688 [Yin Huai] update
6a7bd75 [Michael Armbrust] fix partition column delimiter configuration.
e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0.
b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean
52864da [Reynold Xin] Added executeCollect method to SharkPlan.
f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan.
b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException.
38124bd [Yin Huai] formatting
2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively
555d839 [Reynold Xin] More cleaning ...
d48d0e1 [Reynold Xin] Code review feedback.
aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions
5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency.
479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if)
da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename
e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope.
72426ed [Reynold Xin] Rename shark2 package to execution.
0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename
e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore
3f9fee1 [Michael Armbrust] rewrite push filter through join optimization.
c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory.
c9777d8 [Reynold Xin] Put all source files in a catalyst directory.
019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files.
80ca4be [Timothy Chen] Address comments
0079392 [Michael Armbrust] support for multiple insert commands in a single query
75b5a01 [Michael Armbrust] remove space.
4283400 [Timothy Chen] Add limited predicate push down
e547e50 [Michael Armbrust] implement First.
e77c9b6 [Michael Armbrust] more work on unique join.
c795e06 [Michael Armbrust] improve star expansion
a26494e [Michael Armbrust] allow aliases to have qualifiers
d078333 [Michael Armbrust] remove extra space
a75c023 [Michael Armbrust] implement Coalesce
3a018b6 [Michael Armbrust] fix up docs.
ab6f67d [Michael Armbrust] import the string "null" as actual null.
5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved.
191ce3e [Michael Armbrust] analyze rewrite test query.
60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved.
2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues.
e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD
c086a35 [Michael Armbrust] docs, spacing
c4060e4 [Michael Armbrust] cleanup
3b85462 [Michael Armbrust] more tests passing
bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data.
c944a95 [Michael Armbrust] First aggregate expression.
1e28311 [Michael Armbrust] make tests execute in alpha order again
a287481 [Michael Armbrust] spelling
8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing.
a6ab6c7 [Michael Armbrust] add !=
4529594 [Michael Armbrust] draft of coalesce
70f253f [Michael Armbrust] more tests passing!
7349e7b [Michael Armbrust] initial support for test thrift table
d3c9305 [Michael Armbrust] fix > 100 char line
93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE"
06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths
6355d0e [Michael Armbrust] match actual return type of count with expected
cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty.
fd4b096 [Michael Armbrust] fix casing of null strings as well.
4632695 [Michael Armbrust] support for megastore bigint
67b88cf [Michael Armbrust] more verbose debugging of evaluation return types
c680e0d [Michael Armbrust] Failed string => number conversion should return null.
2326be1 [Michael Armbrust] make getClauses case insensitive.
dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types.
045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions
fb5ddfd [Michael Armbrust] move ViewExamples to examples/
83833e8 [Michael Armbrust] more tests passing!
47c98d6 [Michael Armbrust] add query tests for like and hash.
1724c16 [Michael Armbrust] clear lines that contain last updated times.
cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse.
9b2642b [Michael Armbrust] make the blacklist support regexes
1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs
910e33e [Michael Armbrust] basic support for building an assembly jar.
d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore.
495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf.
65f4e69 [Michael Armbrust] remove incorrect comments
0831a3c [Michael Armbrust] support for parsing some operator udfs.
6c27aa7 [Michael Armbrust] more cast parsing.
43db061 [Michael Armbrust] significant generalization of hive udf functionality.
3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines.
e5690a6 [Michael Armbrust] add BinaryType
adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called.
d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]).
8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage.
21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function.
0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons.
2b70abf [Michael Armbrust] true and false literals.
ef8b0a5 [Michael Armbrust] more tests.
14d070f [Michael Armbrust] add support for correctly extracting partition keys.
0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions
69a0bd4 [Michael Armbrust] promote strings in predicates with number too.
3946e31 [Michael Armbrust] don't build strings unless assertion fails.
90c453d [Michael Armbrust] more tests passing!
6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting.
8000504 [Michael Armbrust] Improve type coercion.
9087152 [Michael Armbrust] fix toString of Not.
58b111c [Michael Armbrust] fix bad scaladoc tag.
d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there.
ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness.
1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types.
d873b2b [Yin Huai] Remove numbers associated with test cases.
8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions
b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown
d1e7b8e [Michael Armbrust] Update README.md
c8b1553 [Michael Armbrust] Update README.md
9307ef9 [Michael Armbrust] update list of passing tests.
934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers.
a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now.
ae0024a [Yin Huai] update
cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions
21976ae [Yin Huai] update
b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions
dedbf0c [Yin Huai] support Boolean literals
eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals
37817b5 [Yin Huai] add a comment to EvaluateLiterals.
468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is.
b1d1843 [Michael Armbrust] more work on big data benchmark tests.
cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark
7d7fa9f [Michael Armbrust] support for create table as
5f54f03 [Michael Armbrust] parsing for ASC
d42b725 [Michael Armbrust] Sum of strings requires cast
34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.)
81659cb [Michael Armbrust] implement transform operator.
5cd76d6 [Michael Armbrust] break up the file based test case code for reuse
1031b65 [Michael Armbrust] support for case insensitive resolution.
320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots)
b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt
d9d18b4 [Michael Armbrust] debug logging implicit.
669089c [Yin Huai] support Boolean literals
ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals
73a05fd [Yin Huai] add a comment to EvaluateLiterals.
191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is.
80039cc [Yin Huai] Merge pull request #1 from yhuai/master
cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide
5c518e4 [Michael Armbrust] fix bug in test.
b50dd0e [Michael Armbrust] fix return type of overloaded method
05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview.
8c60cc0 [Michael Armbrust] Update README.md
03b9526 [Michael Armbrust] First draft of optimizer tests.
f392755 [Michael Armbrust] Add flatMap to TreeNode
6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings
15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions.
06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression.
4b0a888 [Michael Armbrust] implement string comparison and more casts.
356b321 [Michael Armbrust] Update README.md
3776395 [Michael Armbrust] Update README.md
304d17d [Michael Armbrust] Create README.md
b7d8be0 [Michael Armbrust] more tests passing.
b82481f [Michael Armbrust] add todo comment.
02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist.
cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support.
c43a259 [Michael Armbrust] comments
15ff448 [Michael Armbrust] better error message when a dsl test throws an exception
76ec650 [Michael Armbrust] fix join conditions
e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan.
91573a4 [Michael Armbrust] initial type promotion
e2ef4a5 [Michael Armbrust] logging
e43dc1e [Michael Armbrust] add string => int cast evaluation
f1f7e96 [Michael Armbrust] fix incorrect generation of join keys
2b27230 [Michael Armbrust] add depth based subtree access
0f6279f [Michael Armbrust] broken tests.
389bc0b [Michael Armbrust] support for partitioned columns in output.
12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name.
b67a225 [Michael Armbrust] better errors when types don't match up.
9e74808 [Michael Armbrust] add children resolved.
6d03ce9 [Michael Armbrust] defaults for unresolved relation
2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions
be5ae2c [Michael Armbrust] better resolution logging
cb7b5af [Michael Armbrust] views example
420e05b [Michael Armbrust] more tests passing!
6916c63 [Michael Armbrust] Reading from partitioned hive tables.
a1245f9 [Michael Armbrust] more tests passing
956e760 [Michael Armbrust] extended explain
5f14c35 [Michael Armbrust] more test tables supported
175c43e [Michael Armbrust] better errors for parse exceptions
480ade5 [Michael Armbrust] don't use partial cached results.
8a9d21c [Michael Armbrust] fix evaluation
7aee69c [Michael Armbrust] parsing for joins, boolean logic
7fcf480 [Michael Armbrust] test for and logic
3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines.
6902490 [Michael Armbrust] fix boolean logic evaluation
4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic
8b2a2ee [Michael Armbrust] more tests passing!
ad1f3b4 [Michael Armbrust] toString for null literals
a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work)
60ec19d [Michael Armbrust] initial support for udfs
c45b440 [Michael Armbrust] support for is (not) null and boolean logic
7f4a1dc [Michael Armbrust] add NoRelation logical operator
72e183b [Michael Armbrust] support for null values in tree node args.
ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs
e5c9d1a [Michael Armbrust] use nonEmpty
dcc4fe1 [Michael Armbrust] support for src1 test table.
c78b587 [Michael Armbrust] casting.
75c3f3f [Michael Armbrust] add support for logging with scalalogging.
da2c011 [Michael Armbrust] make it more obvious when results are being truncated.
96b73ba [Michael Armbrust] more docs in TestShark
18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive.
e6d063b [Michael Armbrust] more join tests.
664c1c3 [Michael Armbrust] make parsing of function names case insensitive.
0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome.
1a6db68 [Michael Armbrust] spelling
7638cb4 [Michael Armbrust] simple join execution with dsl tests.  no hive tests yes.
859d4c9 [Michael Armbrust] better argString printing of nested trees.
fc53615 [Michael Armbrust] add same instance comparisons for tree nodes.
a026e6b [Michael Armbrust] move out hive specific operators
fff4d1c [Michael Armbrust] add simple query execution debugging
e2120ab [Michael Armbrust] sorting for strings
da06eb6 [Michael Armbrust] Parsing for sortby and joins
9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId.
8eb2460 [Michael Armbrust] add system property to override whitelist.
88124bb [Michael Armbrust] make strategy evaluation lazy.
74a3a21 [Michael Armbrust] implement outputSet
d25b171 [Michael Armbrust] Add AND and OR expressions
67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll
12acf0a [Michael Armbrust] add .DS_Store for macs
f7da6ce [Michael Armbrust] add agg with grouping expr in select test
36805b3 [Michael Armbrust] pull out and improve aggregation
75613e1 [Michael Armbrust] better evaluations failure messages.
4789a35 [Michael Armbrust] weaken type since its hard to create pure references.
e89dd36 [Michael Armbrust] no newline for online trees
d0590d4 [Michael Armbrust] include stack trace for catalyst failures.
081c0d9 [Michael Armbrust] more generic computation of agg functions.
31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser
ecd45b2 [Michael Armbrust] Add more passing tests.
97d5419 [Michael Armbrust] fix alignment.
565cc13 [Michael Armbrust] make the canary query optional.
a95e65c [Michael Armbrust] support for resolving qualified attribute references.
e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails.
4640a0b [Michael Armbrust] handle test tables when database is specified.
bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis.
fec5158 [Michael Armbrust] add hive / idea files to .gitignore
3f97ffe [Michael Armbrust] Rename Hive => HiveQl
656b836 [Michael Armbrust] Support for parsing select clause aliases.
3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs.
3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception.
8cbef8a [Michael Armbrust] spelling
aa8c37c [Michael Armbrust] Better toString for SortOrder
1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions
a2e0327 [Michael Armbrust] add a bunch of tests.
4a3e1ea [Michael Armbrust] docs and use shark for data loading.
339bb8f [Michael Armbrust] better docs, Not support
1d7b2d9 [Michael Armbrust] Add NaN conversions.
46a2534 [Michael Armbrust] only run canary query on failure.
8996066 [Michael Armbrust] remove protected from makeCopy
53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass
04a372a [Michael Armbrust] add a flag for running all tests.
3b2235b [Michael Armbrust] More general implementation of arithmetic.
edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results
da6c577 [Michael Armbrust] add string <==> file utility functions.
3adf5ca [Michael Armbrust] Initial support for groupBy and count.
7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ.
a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product.
d66ba7e [Michael Armbrust] drop printlns.
88f2efd [Michael Armbrust] Add sum / count distinct expressions.
05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark
07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands.
b8a9910 [Michael Armbrust] quote tests passing.
8e5e267 [Michael Armbrust] handle aliased select expressions.
4286a96 [Michael Armbrust] drop debugging println
ac34aeb [Michael Armbrust] proof of concept for hive ast transformations.
2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments
ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general.
74a6337 [Michael Armbrust] use fastEquals when doing transformations.
1184a23 [Michael Armbrust] add native test for escapes.
b972b18 [Michael Armbrust] create BaseRelation class
fa6bce9 [Michael Armbrust] implement union
6391a87 [Michael Armbrust] count aggregate.
d47c317 [Michael Armbrust] add unary minus, more tests passing.
c7114e4 [Michael Armbrust] first draft of star expansion.
044c43d [Michael Armbrust] better support for numeric literal parsing.
1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view.
61503c5 [Michael Armbrust] add cached toRdd
2036883 [Michael Armbrust] skip explain queries when testing.
ebac4b1 [Michael Armbrust] fix bug in sort reference calculation
ca0dee0 [Michael Armbrust] docs.
1ee0471 [Michael Armbrust] string literal parsing.
357278b [Michael Armbrust] add limit support
9b3e479 [Michael Armbrust] creation of string literals.
02efa30 [Michael Armbrust] alias evaluation
cb68b33 [Michael Armbrust] parsing for random sample in hive ql.
126dd36 [Michael Armbrust] include query plans in failure output
bb59ae9 [Michael Armbrust] doc fixes
7e68286 [Michael Armbrust] fix confusing naming
768bb25 [Michael Armbrust] handle errors in shark query toString
829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark.  Make test shark a singleton to avoid weirdness with the hive megastore.
ad02e41 [Michael Armbrust] comment jdo dependency
7bc89fe [Michael Armbrust] add collect to TreeNode.
438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs.
09679ee [Michael Armbrust] fix bug in TreeNode foreach
2930b27 [Michael Armbrust] more specific name for del query tests.
8842549 [Michael Armbrust] docs.
da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL.
a8969b9 [Michael Armbrust] Factor out hive query comparison test framework.
1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations.
a36dd9a [Michael Armbrust] evaluation for other > data types.
cae729b [Michael Armbrust] remove unnecessary lazy vals.
d8e12af [Michael Armbrust] docs
3a60d67 [Michael Armbrust] implement average, placeholder for count
f05c106 [Michael Armbrust] checkAnswer handles single row results.
2730534 [Michael Armbrust] implement inputSet
a9aa79d [Michael Armbrust] debugging for sort exec
8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors.
554b4b2 [Michael Armbrust] BoundAttribute pretty printing.
754f5fa [Michael Armbrust] dsl for setting nullability
a206d7a [Michael Armbrust] clean up query tests.
84ad6ef [Michael Armbrust] better sort implementation and tests.
de24923 [Michael Armbrust] add double type.
9611a2c [Michael Armbrust] literal creation for doubles.
7358313 [Michael Armbrust] sort order returns child type.
b544715 [Michael Armbrust] implement eval for rand, and > for doubles
7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols)
1c1a35e [Michael Armbrust] add simple Rand expression.
3ca51de [Michael Armbrust] add orderBy to dsl
7ae41ab [Michael Armbrust] more literal implicit conversions
b18b675 [Michael Armbrust] First cut at native query tests for shark.
d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark.
5eac895 [Michael Armbrust] better error when descending is specified.
2b16f86 [Michael Armbrust] add todo
e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization
9dde3c8 [Michael Armbrust] add project and filter operations.
ad9037b [Michael Armbrust] Add support for local relations.
6227143 [Michael Armbrust] evaluation of Equals.
7526290 [Michael Armbrust] BoundReference should also be an Attribute.
bd33e26 [Michael Armbrust] more documentation
5de0ea3 [Michael Armbrust] Move all shark specific into a separate package.  Lots of documentation improvements.
0ae292b [Michael Armbrust] implement calculation of sort expressions.
9fd5011 [Michael Armbrust] First cut at expression evaluation.
6259e3a [Michael Armbrust] cleanup
787e5a2 [Michael Armbrust] use fastEquals
f90da36 [Michael Armbrust] better printing of optimization exceptions
b05dd67 [Michael Armbrust] Application of rules to fixed point.
bb2e0db [Michael Armbrust] pretty print for literals.
1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals.
d3a3687 [Michael Armbrust] add fastEquals
2b4935b [Michael Armbrust] set sbt.version explicitly
46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests.
c79f2fd [Michael Armbrust] insert operator should return an empty rdd.
14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input.
ae7b4c3 [Michael Armbrust] remove implicit dependencies.  now compiles without copying things into lib/ manually.
84082f9 [Michael Armbrust] add sbt binaries and scripts
15371a8 [Michael Armbrust] First draft of simple Hive DDL parser.
063bf44 [Michael Armbrust] Periods should end all comments.
e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack.
ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing!
b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust.
e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected.
26f410a [Michael Armbrust] physical traits should extend PhysicalPlan.
dc72469 [Michael Armbrust] beginning of hive compatibility testing framework.
0763490 [Michael Armbrust] support for hive native command pass-through.
d8a924f [Michael Armbrust] scaladoc
29a7163 [Michael Armbrust] Insert into hive table physical operator.
633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy.
59ac444 [Michael Armbrust] add unary expression
3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName'
665f7d0 [Michael Armbrust] add logical nodes for hive data sinks.
64d2923 [Michael Armbrust] Add classes for representing sorts.
f72b7ce [Michael Armbrust] first trivial end to end query execution.
5c7d244 [Michael Armbrust] first draft of references implementation.
7bff274 [Michael Armbrust] point at new shark.
c7cd57f [Michael Armbrust] docs for util function.
910811c [Michael Armbrust] check each item of the sequence
ef21a0b [Michael Armbrust] line up comments.
4b765d5 [Michael Armbrust] docs, drop println
6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution.
a703c49 [Michael Armbrust] this order works better until fixed point is implemented.
ec1d7c0 [Michael Armbrust] Simple attribute resolution.
069df02 [Michael Armbrust] parsing binary predicates
a1cf754 [Michael Armbrust] add joins and equality.
3f5bc98 [Michael Armbrust] add optiq to sbt.
54f3460 [Michael Armbrust] initial optiq parsing.
d9161ce [Michael Armbrust] add join operator
1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs
24ef6fb [Michael Armbrust] toString for alias.
ae7d776 [Michael Armbrust] add nullability changing function
d49dc02 [Michael Armbrust] scaladoc for named exprs
7c45dd7 [Michael Armbrust] pretty printing of trees.
78e34bf [Michael Armbrust] simple git ignore.
7ba19be [Michael Armbrust] First draft of interface to hive metastore.
7e7acf0 [Michael Armbrust] physical placeholder.
1c11136 [Michael Armbrust] first draft of error handling / plans for debugging.
3766a41 [Michael Armbrust] rearrange utility functions.
7fb3d5e [Michael Armbrust] docs and equality improvements.
45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions.
002d4d4 [Michael Armbrust] default to no alias.
be25003 [Michael Armbrust] add repl initialization to sbt.
0608a00 [Michael Armbrust] tighten public interface
a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms.
daa71ca [Michael Armbrust] foreach, maps, and scaladoc
6a158cb [Michael Armbrust] simple transform working.
db0299f [Michael Armbrust] basic analysis of relations minus transform function.
f74c4ee [Michael Armbrust] parsing a simple query.
08e4f57 [Michael Armbrust] upgrade scala include shark.
d3c6404 [Michael Armbrust] initial commit
2014-03-20 18:03:20 -07:00
Patrick Wendell e7423d4040 Revert "SPARK-1236 - Upgrade Jetty to 9.1.3.v20140225."
This reverts commit ca4bf8c572.

Jetty 9 requires JDK7 which is probably not a dependency we want to bump right now. Before Spark 1.0 we should consider upgrading to Jetty 8. However, in the mean time to ease some pain let's revert this. Sorry for not catching this during the initial review. cc/ @rxin

Author: Patrick Wendell <pwendell@gmail.com>

Closes #167 from pwendell/jetty-revert and squashes the following commits:

811b1c5 [Patrick Wendell] Revert "SPARK-1236 - Upgrade Jetty to 9.1.3.v20140225."
2014-03-18 00:46:03 -07:00
Sean Owen 97e4459e1e SPARK-1254. Consolidate, order, and harmonize repository declarations in Maven/SBT builds
This suggestion addresses a few minor suboptimalities with how repositories are handled.

1) Use HTTPS consistently to access repos, instead of HTTP

2) Consolidate repository declarations in the parent POM file, in the case of the Maven build, so that their ordering can be controlled to put the fully optional Cloudera repo at the end, after required repos. (This was prompted by the untimely failure of the Cloudera repo this week, which made the Spark build fail. #2 would have prevented that.)

3) Update SBT build to match Maven build in this regard

4) Update SBT build to not refer to Sonatype snapshot repos. This wasn't in Maven, and a build generally would not refer to external snapshots, but I'm not 100% sure on this one.

Author: Sean Owen <sowen@cloudera.com>

Closes #145 from srowen/SPARK-1254 and squashes the following commits:

42f9bfc [Sean Owen] Use HTTPS for repos; consolidate repos in parent in order to put optional Cloudera repo last; harmonize SBT build repos with Maven; remove snapshot repos from SBT build which weren't in Maven
2014-03-15 16:44:34 -07:00
Reynold Xin ca4bf8c572 SPARK-1236 - Upgrade Jetty to 9.1.3.v20140225.
Author: Reynold Xin <rxin@apache.org>

Closes #113 from rxin/jetty9 and squashes the following commits:

867a2ce [Reynold Xin] Updated Jetty version to 9.1.3.v20140225 in Maven build file.
d7c97ca [Reynold Xin] Return the correctly bound port.
d14706f [Reynold Xin] Upgrade Jetty to 9.1.3.v20140225.
2014-03-13 12:16:04 -07:00
Patrick Wendell 16788a6542 SPARK-1167: Remove metrics-ganglia from default build due to LGPL issues...
This patch removes Ganglia integration from the default build. It
allows users willing to link against LGPL code to use Ganglia
by adding build flags or linking against a new Spark artifact called
spark-ganglia-lgpl.

This brings Spark in line with the Apache policy on LGPL code
enumerated here:

https://www.apache.org/legal/3party.html#options-optional

Author: Patrick Wendell <pwendell@gmail.com>

Closes #108 from pwendell/ganglia and squashes the following commits:

326712a [Patrick Wendell] Responding to review feedback
5f28ee4 [Patrick Wendell] SPARK-1167: Remove metrics-ganglia from default build due to LGPL issues.
2014-03-11 11:16:59 -07:00
Patrick Wendell b9be160951 SPARK-782 Clean up for ASM dependency.
This makes two changes.

1) Spark uses the shaded version of asm that is (conveniently) published
   with Kryo.
2) Existing exclude rules around asm are updated to reflect the new groupId
   of `org.ow2.asm`. This made all of the old rules not work with newer Hadoop
   versions that pull in new asm versions.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #100 from pwendell/asm and squashes the following commits:

9235f3f [Patrick Wendell] SPARK-782 Clean up for ASM dependency.
2014-03-09 13:17:07 -07:00
Thomas Graves 7edbea41b4 SPARK-1189: Add Security to Spark - Akka, Http, ConnectionManager, UI use servlets
resubmit pull request.  was https://github.com/apache/incubator-spark/pull/332.

Author: Thomas Graves <tgraves@apache.org>

Closes #33 from tgravescs/security-branch-0.9-with-client-rebase and squashes the following commits:

dfe3918 [Thomas Graves] Fix merge conflict since startUserClass now using runAsUser
05eebed [Thomas Graves] Fix dependency lost in upmerge
d1040ec [Thomas Graves] Fix up various imports
05ff5e0 [Thomas Graves] Fix up imports after upmerging to master
ac046b3 [Thomas Graves] Merge remote-tracking branch 'upstream/master' into security-branch-0.9-with-client-rebase
13733e1 [Thomas Graves] Pass securityManager and SparkConf around where we can. Switch to use sparkConf for reading config whereever possible. Added ConnectionManagerSuite unit tests.
4a57acc [Thomas Graves] Change UI createHandler routines to createServlet since they now return servlets
2f77147 [Thomas Graves] Rework from comments
50dd9f2 [Thomas Graves] fix header in SecurityManager
ecbfb65 [Thomas Graves] Fix spacing and formatting
b514bec [Thomas Graves] Fix reference to config
ed3d1c1 [Thomas Graves] Add security.md
6f7ddf3 [Thomas Graves] Convert SaslClient and SaslServer to scala, change spark.authenticate.ui to spark.ui.acls.enable, and fix up various other things from review comments
2d9e23e [Thomas Graves] Merge remote-tracking branch 'upstream/master' into security-branch-0.9-with-client-rebase_rework
5721c5a [Thomas Graves] update AkkaUtilsSuite test for the actorSelection changes, fix typos based on comments, and remove extra lines I missed in rebase from AkkaUtils
f351763 [Thomas Graves] Add Security to Spark - Akka, Http, ConnectionManager, UI to use servlets
2014-03-06 18:27:50 -06:00
Prashant Sharma 181ec50307 [java8API] SPARK-964 Investigate the potential for using JDK 8 lambda expressions for the Java/Scala APIs
Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Patrick Wendell <pwendell@gmail.com>

Closes #17 from ScrapCodes/java8-lambdas and squashes the following commits:

95850e6 [Patrick Wendell] Some doc improvements and build changes to the Java 8 patch.
85a954e [Prashant Sharma] Nit. import orderings.
673f7ac [Prashant Sharma] Added support for -java-home as well
80a13e8 [Prashant Sharma] Used fake class tag syntax
26eb3f6 [Prashant Sharma] Patrick's comments on PR.
35d8d79 [Prashant Sharma] Specified java 8 building in the docs
31d4cd6 [Prashant Sharma] Maven build to support -Pjava8-tests flag.
4ab87d3 [Prashant Sharma] Review feedback on the pr
c33dc2c [Prashant Sharma] SPARK-964, Java 8 API Support.
2014-03-03 22:31:30 -08:00
Sean Owen fd31adbf27 SPARK-1084.2 (resubmitted)
(Ported from https://github.com/apache/incubator-spark/pull/650 )

This adds one more change though, to fix the scala version warning introduced by json4s recently.

Author: Sean Owen <sowen@cloudera.com>

Closes #32 from srowen/SPARK-1084.2 and squashes the following commits:

9240abd [Sean Owen] Avoid scala version conflict in scalap induced by json4s dependency
1561cec [Sean Owen] Remove "exclude *" dependencies that are causing Maven warnings, and that are apparently unneeded anyway
2014-03-02 14:27:53 -08:00
Patrick Wendell 1fd2bfd3dd Remove remaining references to incubation
This removes some loose ends not caught by the other (incubating -> tlp) patches. @markhamstra this updates the version as you mentioned earlier.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #51 from pwendell/tlp and squashes the following commits:

d553b1b [Patrick Wendell] Remove remaining references to incubation
2014-03-02 01:00:16 -08:00
Binh Nguyen b70823c91d Update io.netty from 4.0.13 Final to 4.0.17.Final
This update contains a lot of bug fixes and some new perf improvements.
It is also binary compatible with the current 4.0.13.Final

For more information: http://netty.io/news/2014/02/25/4-0-17-Final.html

Author: Binh Nguyen <ngbinh@gmail.com>

Author: Binh Nguyen <ngbinh@gmail.com>

Closes #41 from ngbinh/master and squashes the following commits:

a9498f4 [Binh Nguyen] update io.netty to 4.0.17.Final
2014-03-02 00:48:50 -08:00
Michael Armbrust 012bd5fbc9 Merge the old sbt-launch-lib.bash with the new sbt-launcher jar downloading logic.
This allows developers to pass options (such as -D) to sbt.  I also modified the SparkBuild to ensure spark specific properties are propagated to forked test JVMs.

Author: Michael Armbrust <michael@databricks.com>

Closes #14 from marmbrus/sbtScripts and squashes the following commits:

c008b18 [Michael Armbrust] Merge the old sbt-launch-lib.bash with the new sbt-launcher jar downloading logic.
2014-03-02 00:35:23 -08:00
Prashant Sharma 6ccd6c55bd SPARK-1121 Only add avro if the build is for Hadoop 0.23.X and SPARK_YARN is set
Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #6 from ScrapCodes/SPARK-1121/avro-dep-fix and squashes the following commits:

9b29e34 [Prashant Sharma] Review feedback on PR
46ed2ad [Prashant Sharma] SPARK-1121-Only add avro if the build is for Hadoop 0.23.X and SPARK_YARN is set
2014-02-26 23:40:49 -08:00
William Benton fbedc8eff2 SPARK-1078: Replace lift-json with json4s-jackson.
The aim of the Json4s project is to provide a common API for
Scala JSON libraries.  It is Apache-licensed, easier for
downstream distributions to package, and mostly API-compatible
with lift-json.  Furthermore, the Jackson-backed implementation
parses faster than lift-json on all but the smallest inputs.

Author: William Benton <willb@redhat.com>

Closes #582 from willb/json4s and squashes the following commits:

7ca62c4 [William Benton] Replace lift-json with json4s-jackson.
2014-02-26 10:09:50 -08:00
Raymond Liu c852201ce9 For SPARK-1082, Use Curator for ZK interaction in standalone cluster
Author: Raymond Liu <raymond.liu@intel.com>

Closes #611 from colorant/curator and squashes the following commits:

7556aa1 [Raymond Liu] Address review comments
af92e1f [Raymond Liu] Fix coding style
964f3c2 [Raymond Liu] Ignore NodeExists exception
6df2966 [Raymond Liu] Rewrite zookeeper client code with curator
2014-02-24 23:20:38 -08:00
Sean Owen c0ef3afa82 SPARK-1071: Tidy logging strategy and use of log4j
Prompted by a recent thread on the mailing list, I tried and failed to see if Spark can be made independent of log4j. There are a few cases where control of the underlying logging is pretty useful, and to do that, you have to bind to a specific logger.

Instead I propose some tidying that leaves Spark's use of log4j, but gets rid of warnings and should still enable downstream users to switch. The idea is to pipe everything (except log4j) through SLF4J, and have Spark use SLF4J directly when logging, and where Spark needs to output info (REPL and tests), bind from SLF4J to log4j.

This leaves the same behavior in Spark. It means that downstream users who want to use something except log4j should:

- Exclude dependencies on log4j, slf4j-log4j12 from Spark
- Include dependency on log4j-over-slf4j
- Include dependency on another logger X, and another slf4j-X
- Recreate any log config that Spark does, that is needed, in the other logger's config

That sounds about right.

Here are the key changes:

- Include the jcl-over-slf4j shim everywhere by depending on it in core.
- Exclude dependencies on commons-logging from third-party libraries.
- Include the jul-to-slf4j shim everywhere by depending on it in core.
- Exclude slf4j-* dependencies from third-party libraries to prevent collision or warnings
- Added missing slf4j-log4j12 binding to GraphX, Bagel module tests

And minor/incidental changes:

- Update to SLF4J 1.7.5, which happily matches Hadoop 2’s version and is a recommended update over 1.7.2
- (Remove a duplicate HBase dependency declaration in SparkBuild.scala)
- (Remove a duplicate mockito dependency declaration that was causing warnings and bugging me)

Author: Sean Owen <sowen@cloudera.com>

Closes #570 from srowen/SPARK-1071 and squashes the following commits:

52eac9f [Sean Owen] Add slf4j-over-log4j12 dependency to core (non-test) and remove it from things that depend on core.
77a7fa9 [Sean Owen] SPARK-1071: Tidy logging strategy and use of log4j
2014-02-23 11:40:55 -08:00
Bijay Bisht a3bb86179e Ported hadoopClient jar for < 1.0.1 fix
#522 got messed after i rewrote the branch hadoop_jar_name. So created a new one.

Author: Bijay Bisht <bijay.bisht@gmail.com>

Closes #584 from bijaybisht/hadoop_jar_name_on_0.9.0 and squashes the following commits:

1b6fb3c [Bijay Bisht] Ported hadoopClient jar for < 1.0.1 fix
(cherry picked from commit 8093de1bb3)

Signed-off-by: Patrick Wendell <pwendell@gmail.com>
2014-02-12 23:42:25 -08:00
Patrick Wendell b69f8b2a01 Merge pull request #557 from ScrapCodes/style. Closes #557.
SPARK-1058, Fix Style Errors and Add Scala Style to Spark Build.

Author: Patrick Wendell <pwendell@gmail.com>
Author: Prashant Sharma <scrapcodes@gmail.com>

== Merge branch commits ==

commit 1a8bd1c059b842cb95cc246aaea74a79fec684f4
Author: Prashant Sharma <scrapcodes@gmail.com>
Date:   Sun Feb 9 17:39:07 2014 +0530

    scala style fixes

commit f91709887a8e0b608c5c2b282db19b8a44d53a43
Author: Patrick Wendell <pwendell@gmail.com>
Date:   Fri Jan 24 11:22:53 2014 -0800

    Adding scalastyle snapshot
2014-02-09 10:09:19 -08:00
Mark Hamstra c2341c92bb Merge pull request #542 from markhamstra/versionBump. Closes #542.
Version number to 1.0.0-SNAPSHOT

Since 0.9.0-incubating is done and out the door, we shouldn't be building 0.9.0-incubating-SNAPSHOT anymore.

@pwendell

Author: Mark Hamstra <markhamstra@gmail.com>

== Merge branch commits ==

commit 1b00a8a7c1a7f251b4bb3774b84b9e64758eaa71
Author: Mark Hamstra <markhamstra@gmail.com>
Date:   Wed Feb 5 09:30:32 2014 -0800

    Version number to 1.0.0-SNAPSHOT
2014-02-08 16:00:43 -08:00
Josh Rosen 531d9d7576 Increase JUnit test verbosity under SBT.
Upgrade junit-interface plugin from 0.9 to 0.10.

I noticed that the JavaAPISuite tests didn't
appear to display any output locally or under
Jenkins, making it difficult to know whether they
were running.  This change increases the verbosity
to more closely match the ScalaTest tests.
2014-01-25 16:32:44 -08:00
Jianping J Wang 19a01c1b1d Add jblas dependency 2014-01-23 19:54:01 +08:00
Sean Owen 4476398f7d Also add graphx commons-math3 dependeny in sbt build 2014-01-22 22:40:41 +00:00
Patrick Wendell bf5699543b Merge pull request #462 from mateiz/conf-file-fix
Remove Typesafe Config usage and conf files to fix nested property names

With Typesafe Config we had the subtle problem of no longer allowing
nested property names, which are used for a few of our properties:
http://apache-spark-developers-list.1001551.n3.nabble.com/Config-properties-broken-in-master-td208.html

This PR is for branch 0.9 but should be added into master too.
(cherry picked from commit 34e911ce9a)

Signed-off-by: Patrick Wendell <pwendell@gmail.com>
2014-01-18 16:20:00 -08:00
Patrick Wendell 4a805aff5e Merge pull request #367 from ankurdave/graphx
GraphX: Unifying Graphs and Tables

GraphX extends Spark's distributed fault-tolerant collections API and interactive console with a new graph API which leverages recent advances in graph systems (e.g., [GraphLab](http://graphlab.org)) to enable users to easily and interactively build, transform, and reason about graph structured data at scale. See http://amplab.github.io/graphx/.

Thanks to @jegonzal, @rxin, @ankurdave, @dcrankshaw, @jianpingjwang, @amatsukawa, @kellrott, and @adamnovak.

Tasks left:
- [x] Graph-level uncache
- [x] Uncache previous iterations in Pregel
- [x] ~~Uncache previous iterations in GraphLab~~ (postponed to post-release)
- [x] - Describe GC issue with GraphLab
- [ ] Write `docs/graphx-programming-guide.md`
- [x] - Mention future Bagel support in docs
- [ ] - Section on caching/uncaching in docs: As with Spark, cache something that is used more than once. In an iterative algorithm, try to cache and force (i.e., materialize) something every iteration, then uncache the cached things that depended on the newly materialized RDD but that won't be referenced again.
- [x] Undo modifications to core collections and instead copy them to org.apache.spark.graphx
- [x] Make Graph serializable to work around capture in Spark shell
- [x] Rename graph -> graphx in package name and subproject
- [x] Remove standalone PageRank
- [x] ~~Fix amplab/graphx#52 by checking `iter.hasNext`~~
2014-01-13 22:58:38 -08:00
Reynold Xin 33022d6656 Adjusted visibility of various components. 2014-01-13 19:58:53 -08:00
Reynold Xin e2d25d2dfe Merge branch 'master' into graphx 2014-01-13 16:21:26 -08:00
Patrick Wendell 4216178d5e Merge pull request #373 from jerryshao/kafka-upgrade
Upgrade Kafka dependecy to 0.8.0 release version
2014-01-11 09:46:48 -08:00
jerryshao dd7fd0c443 Upgrade Kafka dependecy to 0.8.0 release version 2014-01-10 08:56:11 +08:00
Ankur Dave 731f56f309 graph -> graphx 2014-01-09 14:31:33 -08:00
Ankur Dave 91227566bc Merge remote-tracking branch 'spark-upstream/master' into HEAD
Conflicts:
	README.md
	core/src/main/scala/org/apache/spark/util/collection/OpenHashMap.scala
	core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala
	core/src/main/scala/org/apache/spark/util/collection/PrimitiveKeyOpenHashMap.scala
	pom.xml
	project/SparkBuild.scala
	repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala
2014-01-08 21:19:08 -08:00
Patrick Wendell bc81ce040d Merge remote-tracking branch 'apache-github/master' into standalone-driver
Conflicts:
	core/src/test/scala/org/apache/spark/deploy/JsonProtocolSuite.scala
	pom.xml
2014-01-08 00:38:31 -08:00
Patrick Wendell c0f0155eca Merge pull request #313 from tdas/project-refactor
Refactored the streaming project to separate external libraries like Twitter, Kafka, Flume, etc.

At a high level, these are the following changes.

1. All the external code was put in `SPARK_HOME/external/` as separate SBT projects and Maven modules. Their artifact names are `spark-streaming-twitter`, `spark-streaming-kafka`, etc. Both SparkBuild.scala and pom.xml files have been updated. References to external libraries and repositories have been removed from the settings of root and streaming projects/modules.

2. To avail the external functionality (say, creating a Twitter stream), the developer has to `import org.apache.spark.streaming.twitter._` . For Scala API, the developer has to call `TwitterUtils.createStream(streamingContext, ...)`. For the Java API, the developer has to call `TwitterUtils.createStream(javaStreamingContext, ...)`.

3.  Each external project has its own scala and java unit tests. Note the unit tests of each external library use classes of the streaming unit tests (`TestSuiteBase`, `LocalJavaStreamingContext`, etc.). To enable this code sharing among test classes, `dependsOn(streaming % "compile->compile,test->test")` was used in the SparkBuild.scala . In the streaming/pom.xml, an additional `maven-jar-plugin` was necessary to capture this dependency (see comment inside the pom.xml for more information).

4. Jars of the external projects have been added to examples project but not to the assembly project.

5. In some files, imports have been rearrange to conform to the Spark coding guidelines.
2014-01-07 22:21:52 -08:00
Patrick Wendell e21a707a13 Adding unit tests and some refactoring to promote testability. 2014-01-07 15:39:47 -08:00
Patrick Wendell 93bf96205d Merge pull request #340 from ScrapCodes/sbt-fixes
Made java options to be applied during tests so that they become self explanatory.
2014-01-06 11:42:41 -08:00
Tathagata Das 3b4c4c7f4d Merge remote-tracking branch 'apache/master' into project-refactor
Conflicts:
	examples/src/main/java/org/apache/spark/streaming/examples/JavaFlumeEventCount.java
	streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
	streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala
	streaming/src/test/java/org/apache/spark/streaming/JavaAPISuite.java
	streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala
	streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala
2014-01-06 03:05:52 -08:00
Prashant Sharma 2d0825e9f4 Made java options to be applied during tests so that they become self explanatory. 2014-01-06 16:03:31 +05:30
Prashant Sharma 355a033893 SPARK-1005 Ning upgrade 2014-01-06 14:38:27 +05:30
Patrick Wendell 604fad9c39 Merge remote-tracking branch 'apache-github/master' into remove-binaries
Conflicts:
	core/src/test/scala/org/apache/spark/DriverSuite.scala
	docs/python-programming-guide.md
2014-01-03 21:29:33 -08:00
Patrick Wendell 9e6f3bdcda Changes on top of Prashant's patch.
Closes #316
2014-01-03 18:30:17 -08:00
Prashant Sharma 94f2fffa23 fixed review comments 2014-01-03 14:43:37 +05:30
Prashant Sharma b4bb80002b Merge branch 'master' into spark-1002-remove-jars 2014-01-03 12:12:04 +05:30
Raymond Liu ebdfa6bb97 Using name yarn-alpha/yarn instead of yarn-2.0/yarn-2.2 2014-01-03 12:14:38 +08:00
Raymond Liu a47ebf7228 Add yarn/common/src/test dir in building script 2014-01-03 12:14:38 +08:00
Raymond Liu d1a6f7aabc Use unmanaged source dir to include common yarn code 2014-01-03 12:14:37 +08:00
Raymond Liu 3dc379ce5a Reorganize yarn related codes into sub projects to remove duplicate files. 2014-01-03 12:12:37 +08:00
Prashant Sharma 8821c3a526 Deleted py4j jar and added to assembly dependency 2014-01-02 13:09:46 +05:30
Matei Zaharia 0e5b2adb5c Merge remote-tracking branch 'apache/master' into conf2
Conflicts:
	project/SparkBuild.scala
2014-01-01 13:28:54 -05:00
Reynold Xin 8b8e70ebde Merge pull request #73 from falaki/ApproximateDistinctCount
Approximate distinct count

Added countApproxDistinct() to RDD and countApproxDistinctByKey() to PairRDDFunctions to approximately count distinct number of elements and distinct number of values per key, respectively. Both functions use HyperLogLog from stream-lib for counting. Both functions take a parameter that controls the trade-off between accuracy and memory consumption. Also added Scala docs and test suites for both methods.
2013-12-31 17:48:24 -08:00
Matei Zaharia ba9338f104 Merge remote-tracking branch 'apache/master' into conf2
Conflicts:
	core/src/main/scala/org/apache/spark/rdd/CheckpointRDD.scala
	streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala
	streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
2013-12-31 18:23:14 -05:00
Tathagata Das 97630849ff Added pom.xml for external projects and removed unnecessary dependencies and repositoris from other poms and sbt. 2013-12-31 00:28:57 -08:00
Hossein Falaki b75d7c98bc Added stream 2.5.1 jar depenency 2013-12-30 19:29:17 -08:00
Tathagata Das f4e4066191 Refactored kafka, flume, zeromq, mqtt as separate external projects, with their own self-contained scala API, java API, scala unit tests and java unit tests. Updated examples to use the external projects. 2013-12-30 11:13:24 -08:00
Matei Zaharia b4ceed40d6 Merge remote-tracking branch 'origin/master' into conf2
Conflicts:
	core/src/main/scala/org/apache/spark/SparkContext.scala
	core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
	core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
	core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala
	core/src/main/scala/org/apache/spark/scheduler/local/LocalScheduler.scala
	core/src/main/scala/org/apache/spark/util/MetadataCleaner.scala
	core/src/test/scala/org/apache/spark/scheduler/TaskResultGetterSuite.scala
	core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala
	new-yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
	streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala
	streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala
	streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
	streaming/src/test/scala/org/apache/spark/streaming/BasicOperationsSuite.scala
	streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala
	streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala
	streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala
	streaming/src/test/scala/org/apache/spark/streaming/WindowOperationsSuite.scala
2013-12-29 15:08:08 -05:00
Tathagata Das 6e43039614 Refactored streaming project to separate out the twitter functionality. 2013-12-26 18:02:49 -08:00
Binh Nguyen 040dd3ecd5 upgrade Netty from 4.0.0.Beta2 to 4.0.13.Final 2013-12-24 14:58:18 -08:00
Prashant Sharma 2573add94c spark-544, introducing SparkConf and related configuration overhaul. 2013-12-25 00:09:36 +05:30
Reynold Xin fc80b2e693 Show full stack trace and time taken in unit tests. 2013-12-23 21:20:20 -08:00
Aaron Davidson eaf6a269b1 [SPARK-959] Explicitly depend on org.eclipse.jetty.orbit jar
Without this, in some cases, Ivy attempts to download the wrong file
and fails, stopping the whole build. See bug for more details.

(This is probably also the beginning of the slow death of our
recently prettified dependencies. Form follow function.)
2013-12-18 23:37:31 -08:00
Patrick Wendell c6f95e603e Attempt with extra repositories 2013-12-16 21:53:51 -08:00
Prashant Sharma a854cc536d Review comments on the PR for scala 2.10 migration. 2013-12-13 15:19:51 +05:30
Prashant Sharma 589b83a18f Disabled yarn 2.2 and added a message in the sbt build 2013-12-12 16:25:30 +05:30
Prashant Sharma f4c73df5c9 Merge branch 'akka-bug-fix' of github.com:ScrapCodes/incubator-spark into akka-bug-fix 2013-12-11 10:22:44 +05:30
Prashant Sharma 603af51bb5 Merge branch 'master' into akka-bug-fix
Conflicts:
	core/pom.xml
	core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
	pom.xml
	project/SparkBuild.scala
	streaming/pom.xml
	yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala
2013-12-11 10:21:53 +05:30
Prashant Sharma 0b82b5af1e added eclipse repository for spark streaming. 2013-12-11 08:17:02 +05:30
Harvey Feng 1b6e450771 Use published "org.spark-project.akka-*" in sbt build for Hadoop-2.2 dependencies.
This also includes:
-Change `isNewYarn` to `isNewHadoop`, since the protobuf-2.5 dependency is from Hadoop-2.2 itself.
-Regexp bugix

Credits to @alig for this patch.
2013-12-03 00:28:33 -08:00
Harvey Feng afe4fe7f5e Merge remote-tracking branch 'origin/master' into yarn-2.2
Conflicts:
	yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
2013-11-26 15:03:03 -08:00
Harvey Feng a1a1c62a3e Add optional Hadoop 2.2 settings in sbt build.
If the Hadoop used is version 2.2 or derived from it, then Spark
will be compiled against protobuf-2.5 and a protobuf-2.5 version of
Akka 2.0.5.
2013-11-26 14:58:41 -08:00
Prashant Sharma 44fd30d3fb Merge branch 'master' into scala-2.10-wip
Conflicts:
	core/src/main/scala/org/apache/spark/rdd/RDD.scala
	project/SparkBuild.scala
2013-11-25 18:10:54 +05:30
Reynold Xin 6bcac986b2 Merge branch 'master' of github.com:apache/incubator-spark 2013-11-25 15:47:47 +08:00
Matei Zaharia 859d62dc2a Merge pull request #151 from russellcardullo/add-graphite-sink
Add graphite sink for metrics

This adds a metrics sink for graphite.  The sink must
be configured with the host and port of a graphite node
and optionally may be configured with a prefix that will
be prepended to all metrics that are sent to graphite.
2013-11-24 16:19:51 -08:00
Aaron Davidson ce1d2af7e4 Use Kafka 2.10 (again) 2013-11-14 23:00:39 -08:00
Aaron Davidson f629ba95b6 Various merge corrections
I've diff'd this patch against my own -- since they were both created
independently, this means that two sets of eyes have gone over all the
merge conflicts that were created, so I'm feeling significantly more
confident in the resulting PR.

@rxin has looked at the changes to the repl and is resoundingly
confident that they are correct.
2013-11-14 22:13:09 -08:00
Raymond Liu d4cd32330e Some fixes for previous master merge commits 2013-11-15 10:22:31 +08:00
Raymond Liu a60620b76a Merge branch 'master' into scala-2.10 2013-11-14 12:44:19 +08:00
Matei Zaharia 9290e5bcd2 Merge pull request #165 from NathanHowell/kerberos-master
spark-assembly.jar fails to authenticate with YARN ResourceManager

The META-INF/services/ sbt MergeStrategy was discarding support for Kerberos, among others. This pull request changes to a merge strategy similar to sbt-assembly's default. I've also included an update to sbt-assembly 0.9.2, a minor fix to it's zip file handling.
2013-11-13 16:48:44 -08:00
Raymond Liu 0f2e3c6e31 Merge branch 'master' into scala-2.10 2013-11-13 16:55:11 +08:00
Matei Zaharia f49ea28d25 Merge pull request #137 from tgravescs/sparkYarnJarsHdfsRebase
Allow spark on yarn to be run from HDFS.

Allows the spark.jar, app.jar, and log4j.properties to be put into hdfs.  Allows you to specify the files on a different hdfs cluster and it will copy them over. It makes sure permissions are correct and makes sure to put things into public distributed cache so they can be reused amongst users if their permissions are appropriate.  Also add a bit of error handling for missing arguments.
2013-11-12 19:13:39 -08:00
Nathan Howell 23146a6705 spark-assembly.jar fails to authenticate with YARN ResourceManager
sbt-assembly is setup to pick the first META-INF/services/org.apache.hadoop.security.SecurityInfo file instead of merging them. This causes Kerberos authentication to fail, this manifests itself in the "info:null" debug log statement:

    DEBUG SaslRpcClient: Get token info proto:interface org.apache.hadoop.yarn.api.ApplicationClientProtocolPB info:null
    DEBUG SaslRpcClient: Get kerberos info proto:interface org.apache.hadoop.yarn.api.ApplicationClientProtocolPB info:null
    ERROR UserGroupInformation: PriviledgedActionException as:foo@BAR (auth:KERBEROS) cause:org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
    DEBUG UserGroupInformation: PrivilegedAction as:foo@BAR (auth:KERBEROS) from:org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:583)
    WARN Client: Exception encountered while connecting to the server : org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
    ERROR UserGroupInformation: PriviledgedActionException as:foo@BAR (auth:KERBEROS) cause:java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]

This previously would just contain a single class:

$ unzip -c assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-SNAPSHOT-hadoop2.2.0.jar META-INF/services/org.apache.hadoop.security.SecurityInfo
Archive:  assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-SNAPSHOT-hadoop2.2.0.jar
  inflating: META-INF/services/org.apache.hadoop.security.SecurityInfo

    org.apache.hadoop.security.AnnotatedSecurityInfo

And now has the full list of classes:

$ unzip -c assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-SNAPSHOT-hadoop2.2.0.jar META-INF/services/org.apache.hadoop.security.SecurityInfoArchive:  assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-SNAPSHOT-hadoop2.2.0.jar
  inflating: META-INF/services/org.apache.hadoop.security.SecurityInfo

    org.apache.hadoop.security.AnnotatedSecurityInfo
    org.apache.hadoop.mapreduce.v2.app.MRClientSecurityInfo
    org.apache.hadoop.mapreduce.v2.security.client.ClientHSSecurityInfo
    org.apache.hadoop.yarn.security.client.ClientRMSecurityInfo
    org.apache.hadoop.yarn.security.ContainerManagerSecurityInfo
    org.apache.hadoop.yarn.security.SchedulerSecurityInfo
    org.apache.hadoop.yarn.security.admin.AdminSecurityInfo
    org.apache.hadoop.yarn.server.RMNMSecurityInfoClass
2013-11-12 13:27:50 -08:00
tgravescs 17bb9a27b2 Add mockito to the sbt build 2013-11-11 10:01:23 -06:00
Josh Rosen a37ff0f1db Add spark-tools assembly to spark-class classpath.
This allows the JavaAPICompletenessChecker to be
run with Spark 0.8+.
2013-11-09 13:42:45 -08:00
Russell Cardullo ef85a51f85 Add graphite sink for metrics
This adds a metrics sink for graphite.  The sink must
be configured with the host and port of a graphite node
and optionally may be configured with a prefix that will
be prepended to all metrics that are sent to graphite.
2013-11-08 16:36:03 -08:00
Ankur Dave 5064f9b2d2 Merge remote-tracking branch 'spark-upstream/master'
Conflicts:
	project/SparkBuild.scala
2013-10-30 15:59:09 -07:00
Patrick Wendell af4a529f6e Exclude jopt from kafka dependency.
Kafka uses an older version of jopt that causes bad conflicts with the version
used by spark-perf. It's not easy to remove this downstream because of the way
that spark-perf uses Spark (by including a spark assembly as an unmanaged jar).
This fixes the problem at its source by just never including it.
2013-10-25 09:20:30 -07:00
Prashant Sharma c77ca1fed9 Updating to latest akka 2.2.3, which fixes our only failing Driver Suite 2013-10-24 16:11:40 +05:30
Matei Zaharia dadfc63b03 Fix Maven build to use MQTT repository 2013-10-23 15:29:22 -07:00
Matei Zaharia dd659642e7 Merge pull request #64 from prabeesh/master
MQTT Adapter for Spark Streaming

MQTT is a machine-to-machine (M2M)/Internet of Things connectivity protocol.
It was designed as an extremely lightweight publish/subscribe messaging transport. You may read more about it here http://mqtt.org/

Message Queue Telemetry Transport (MQTT) is an open message protocol for M2M communications. It enables the transfer of telemetry-style data in the form of messages from devices like sensors and actuators, to mobile phones, embedded systems on vehicles, or laptops and full scale computers.

The protocol was invented by Andy Stanford-Clark of IBM, and Arlen Nipper of Cirrus Link Solutions

This protocol enables a publish/subscribe messaging model in an extremely lightweight way. It is useful for connections with remote locations where line of code and network bandwidth is a constraint.

MQTT is one of the widely used protocol for 'Internet of Things'. This protocol is getting much attraction as anything and everything is getting connected to internet and they all produce data. Researchers and companies predict some 25 billion devices will be connected to the internet by 2015.

Plugin/Support for MQTT is available in popular MQs like RabbitMQ, ActiveMQ etc.

Support for MQTT in Spark will help people with Internet of Things (IoT) projects to use Spark Streaming for their real time data processing needs (from sensors and other embedded devices etc).
2013-10-23 15:07:59 -07:00
Matei Zaharia 731c94e91d Merge pull request #56 from jerryshao/kafka-0.8-dev
Upgrade Kafka 0.7.2 to Kafka 0.8.0-beta1 for Spark Streaming

Conflicts:
	streaming/pom.xml
2013-10-21 23:31:38 -07:00
Matei Zaharia 8de9706b86 Merge pull request #66 from shivaram/sbt-assembly-deps
Add SBT target to assemble dependencies

This pull request is an attempt to address the long assembly build times during development. Instead of rebuilding the assembly jar for every Spark change, this pull request adds a new SBT target `spark` that packages all the Spark modules and builds an assembly of the dependencies.

So the work flow that should work now would be something like

```
./sbt/sbt spark # Doing this once should suffice
## Make changes
./sbt/sbt compile
./sbt/sbt test or ./spark-shell
```
2013-10-18 20:32:39 -07:00
Joseph E. Gonzalez 1856b37e9d Merge branch 'master' of https://github.com/apache/incubator-spark into indexedrdd_graphx 2013-10-18 12:21:19 -07:00
prabeesh 29245605bf remove unused dependency 2013-10-17 09:57:30 +05:30
Shivaram Venkataraman 0a4b76fcc2 Rename SBT target to assemble-deps. 2013-10-16 17:05:46 -07:00
prabeesh 06de3d516d added mqtt adapter library dependencies 2013-10-16 13:38:37 +05:30
Patrick Wendell 35befe07bb Fixing spark streaming example and a bug in examples build.
- Examples assembly included a log4j.properties which clobbered Spark's
- Example had an error where some classes weren't serializable
- Did some other clean-up in this example
2013-10-15 22:55:43 -07:00
Shivaram Venkataraman 051cd960d9 Merge branch 'master' of https://github.com/apache/incubator-spark into sbt-assembly-deps 2013-10-15 13:26:40 -07:00
Joseph E. Gonzalez ef7c369092 merged with upstream changes 2013-10-14 22:56:42 -07:00
jerryshao c23cd72b4b Upgrade Kafka 0.7.2 to Kafka 0.8.0-beta1 for Spark Streaming 2013-10-12 20:00:42 +08:00
Shivaram Venkataraman c441904bce Add a comment and exclude tools 2013-10-11 18:23:15 -07:00
Matei Zaharia c71499b779 Merge pull request #19 from aarondav/master-zk
Standalone Scheduler fault tolerance using ZooKeeper

This patch implements full distributed fault tolerance for standalone scheduler Masters.
There is only one master Leader at a time, which is actively serving scheduling
requests. If this Leader crashes, another master will eventually be elected, reconstruct
the state from the first Master, and continue serving scheduling requests.

Leader election is performed using the ZooKeeper leader election pattern. We try to minimize
the use of ZooKeeper and the assumptions about ZooKeeper's behavior, so there is a layer of
retries and session monitoring on top of the ZooKeeper client.

Master failover follows directly from the single-node Master recovery via the file
system (patch d5a96fe), save that the Master state is stored in ZooKeeper instead.

Configuration:
By default, no recovery mechanism is enabled (spark.deploy.recoveryMode = NONE).
By setting spark.deploy.recoveryMode to ZOOKEEPER and setting spark.deploy.zookeeper.url
to an appropriate ZooKeeper URL, ZooKeeper recovery mode is enabled.
By setting spark.deploy.recoveryMode to FILESYSTEM and setting spark.deploy.recoveryDirectory
to an appropriate directory accessible by the Master, we will keep the behavior of from d5a96fe.

Additionally, places where a Master could be specificied by a spark:// url can now take
comma-delimited lists to specify backup masters. Note that this is only used for registration
of NEW Workers and application Clients. Once a Worker or Client has registered with the
Master Leader, it is "in the system" and will never need to register again.
2013-10-10 17:16:42 -07:00
Prashant Sharma 26860639c5 Merge branch 'scala-2.10' of github.com:ScrapCodes/spark into scala-2.10
Conflicts:
	core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala
	project/SparkBuild.scala
2013-10-10 09:42:23 +05:30
Shivaram Venkataraman 484166d520 Add new SBT target for dependency assembly 2013-10-09 04:24:34 -07:00
Prashant Sharma 7be75682b9 Merge branch 'master' into wip-merge-master
Conflicts:
	bagel/pom.xml
	core/pom.xml
	core/src/test/scala/org/apache/spark/ui/UISuite.scala
	examples/pom.xml
	mllib/pom.xml
	pom.xml
	project/SparkBuild.scala
	repl/pom.xml
	streaming/pom.xml
	tools/pom.xml

In scala 2.10, a shorter representation is used for naming artifacts
 so changed to shorter scala version for artifacts and made it a property in pom.
2013-10-08 11:29:40 +05:30
Reynold Xin 213b70a2db Merge pull request #31 from sundeepn/branch-0.8
Resolving package conflicts with hadoop 0.23.9

Hadoop 0.23.9 is having a package conflict with easymock's dependencies.

(cherry picked from commit 023e3fdf00)
Signed-off-by: Reynold Xin <rxin@apache.org>
2013-10-07 10:54:22 -07:00
Martin Weindel 9b0c9c893d scala 2.10 requires Java 1.6,
using Scala 2.10.3,
resolved maven-scala-plugin warning
2013-10-05 21:41:09 +02:00
Prashant Sharma c810ee0690 Merge branch 'master' into scala-2.10
Conflicts:
	core/src/test/scala/org/apache/spark/DistributedSuite.scala
	project/SparkBuild.scala
2013-10-05 15:52:57 +05:30
Du Li 9fd6bba60d ask ivy/sbt to check local maven repo under ~/.m2 2013-10-01 15:46:51 -07:00
Prashant Sharma 5829692885 Merge branch 'master' into scala-2.10
Conflicts:
	core/src/main/scala/org/apache/spark/ui/jobs/JobProgressUI.scala
	docs/_config.yml
	project/SparkBuild.scala
	repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala
2013-10-01 11:57:24 +05:30
Aaron Davidson f549ea33d3 Standalone Scheduler fault tolerance using ZooKeeper
This patch implements full distributed fault tolerance for standalone scheduler Masters.
There is only one master Leader at a time, which is actively serving scheduling
requests. If this Leader crashes, another master will eventually be elected, reconstruct
the state from the first Master, and continue serving scheduling requests.

Leader election is performed using the ZooKeeper leader election pattern. We try to minimize
the use of ZooKeeper and the assumptions about ZooKeeper's behavior, so there is a layer of
retries and session monitoring on top of the ZooKeeper client.

Master failover follows directly from the single-node Master recovery via the file
system (patch 194ba4b8), save that the Master state is stored in ZooKeeper instead.

Configuration:
By default, no recovery mechanism is enabled (spark.deploy.recoveryMode = NONE).
By setting spark.deploy.recoveryMode to ZOOKEEPER and setting spark.deploy.zookeeper.url
to an appropriate ZooKeeper URL, ZooKeeper recovery mode is enabled.
By setting spark.deploy.recoveryMode to FILESYSTEM and setting spark.deploy.recoveryDirectory
to an appropriate directory accessible by the Master, we will keep the behavior of from 194ba4b8.

Additionally, places where a Master could be specificied by a spark:// url can now take
comma-delimited lists to specify backup masters. Note that this is only used for registration
of NEW Workers and application Clients. Once a Worker or Client has registered with the
Master Leader, it is "in the system" and will never need to register again.

Forthcoming:
Documentation, tests (! - only ad hoc testing has been performed so far)
I do not intend for this commit to be merged until tests are added, but this patch should
still be mostly reviewable until then.
2013-09-26 15:04:23 -07:00
Reynold Xin 3f283278b0 Removed scala -optimize flag. 2013-09-26 13:58:10 -07:00
Reynold Xin c514cd1587 Merge pull request #930 from holdenk/master
Add mapPartitionsWithIndex
2013-09-26 13:48:20 -07:00
Prashant Sharma 604dc40996 Sync with master and some build fixes 2013-09-26 11:40:02 +05:30
Prashant Sharma 7ff4c2d399 fixed maven build for scala 2.10 2013-09-26 10:48:24 +05:30
Patrick Wendell 6079721fa1 Update build version in master 2013-09-24 11:41:51 -07:00
Prashant Sharma 276c37a51c Akka 2.2 migration 2013-09-22 08:20:12 +05:30
Joseph E. Gonzalez 55696e2584 GraphX now builds with all merged changes. 2013-09-17 22:42:12 -07:00
Joseph E. Gonzalez 8b59fb72c4 Merging latest changes from spark main branch 2013-09-17 20:56:12 -07:00
Patrick Wendell c856860c5b Bumping Mesos version to 0.13.0 2013-09-15 12:46:26 -07:00
Prashant Sharma 383e151fd7 Merge branch 'master' of git://github.com/mesos/spark into scala-2.10
Conflicts:
	core/src/main/scala/org/apache/spark/SparkContext.scala
	project/SparkBuild.scala
2013-09-15 10:55:12 +05:30
Prashant Sharma 20c65bc334 Fixed repl suite 2013-09-15 10:43:06 +05:30
Holden Karau 68068977b8 Fix build on ubuntu 2013-09-14 20:51:11 -07:00
Patrick Wendell 91a59e6b10 Merge pull request #919 from mateiz/jets3t
Add explicit jets3t dependency, which is excluded in hadoop-client
2013-09-11 10:21:48 -07:00
Patrick Wendell 0c1985b153 Fix HDFS access bug with assembly build.
Due to this change in HDFS:
https://issues.apache.org/jira/browse/HADOOP-7549

there is a bug when using the new assembly builds. The symptom is that any HDFS access
results in an exception saying "No filesystem for scheme 'hdfs'". This adds a merge
strategy in the assembly build which fixes the problem.
2013-09-10 22:05:13 -07:00
Matei Zaharia f117dc6d0d Add explicit jets3t dependency, which is excluded in hadoop-client 2013-09-10 06:39:25 +00:00
Patrick Wendell f68848d95d Merge pull request #906 from pwendell/ganglia-sink
Clean-up of Metrics Code/Docs and Add Ganglia Sink
2013-09-08 18:32:16 -07:00
Matei Zaharia 0b957997ad Merge pull request #908 from pwendell/master
Fix target JVM version in scala build
2013-09-08 15:30:16 -07:00
Patrick Wendell 27bd74c8ad Fix target JVM version in scala build 2013-09-08 14:37:45 -07:00
Patrick Wendell 8de8ee5d3c Ganglia sink 2013-09-08 10:08:18 -07:00
Jey Kottalam 30a32c8335 Minor YARN build cleanups 2013-09-06 11:31:16 -07:00
Prashant Sharma 4106ae9fbf Merged with master 2013-09-06 17:53:01 +05:30
Matei Zaharia 59218bdd49 Add Apache parent POM 2013-09-02 18:34:03 -07:00
Matei Zaharia 5701eb92c7 Fix some URLs 2013-09-01 14:13:16 -07:00
Matei Zaharia 46eecd110a Initial work to rename package to org.apache.spark 2013-09-01 14:13:13 -07:00
Matei Zaharia 666d93c294 Update Maven build to create assemblies expected by new scripts
This includes the following changes:
- The "assembly" package now builds in Maven by default, and creates an
  assembly containing both hadoop-client and Spark, unlike the old
  BigTop distribution assembly that skipped hadoop-client
- There is now a bigtop-dist package to build the old BigTop assembly
- The repl-bin package is no longer built by default since the scripts
  don't reply on it; instead it can be enabled with -Prepl-bin
- Py4J is now included in the assembly/lib folder as a local Maven repo,
  so that the Maven package can link to it
- run-example now adds the original Spark classpath as well because the
  Maven examples assembly lists spark-core and such as provided
- The various Maven projects add a spark-yarn dependency correctly
2013-08-29 21:19:06 -07:00
Matei Zaharia 8d81358a05 Provide more memory for tests 2013-08-29 21:19:06 -07:00
Matei Zaharia 53cd50c069 Change build and run instructions to use assemblies
This commit makes Spark invocation saner by using an assembly JAR to
find all of Spark's dependencies instead of adding all the JARs in
lib_managed. It also packages the examples into an assembly and uses
that as SPARK_EXAMPLES_JAR. Finally, it replaces the old "run" script
with two better-named scripts: "run-examples" for examples, and
"spark-class" for Spark internal classes (e.g. REPL, master, etc). This
is also designed to minimize the confusion people have in trying to use
"run" to run their own classes; it's not meant to do that, but now at
least if they look at it, they can modify run-examples to do a decent
job for them.

As part of this, Bagel's examples are also now properly moved to the
examples package instead of bagel.
2013-08-29 21:19:04 -07:00
Reynold Xin 9db1e50344 Revert "Merge pull request #841 from rxin/json"
This reverts commit 1fb1b09928, reversing
changes made to c69c48947d.
2013-08-26 11:05:14 -07:00
Jey Kottalam b7f9e6374a Fix SBT generation of IDE project files 2013-08-23 10:26:37 -07:00
Jey Kottalam 281b6c5f28 Re-add removed dependency on 'commons-daemon'
Fixes SBT build under Hadoop 0.23.9 and 2.0.4
2013-08-22 15:45:45 -07:00
Matei Zaharia ae8ba83ef2 Merge pull request #855 from jey/update-build-docs
Update build docs
2013-08-22 10:14:54 -07:00
Matei Zaharia 8a36fd09dd Merge pull request #854 from markhamstra/pomUpdate
Synced sbt and maven builds to use the same dependencies, etc.
2013-08-22 10:13:35 -07:00
Jey Kottalam f9cc1fbf27 Remove references to unsupported Hadoop versions 2013-08-21 17:14:36 -07:00
Mark Hamstra ff6f1b0500 Synced sbt and maven builds 2013-08-21 13:50:24 -07:00
Reynold Xin af602ba9d3 Downgraded default build hadoop version to 1.0.4. 2013-08-21 11:38:24 -07:00
Matei Zaharia aa2b89d98d Merge remote-tracking branch 'jey/hadoop-agnostic'
Conflicts:
	core/src/main/scala/spark/PairRDDFunctions.scala
2013-08-20 10:14:15 -07:00
Jey Kottalam 6f6944c807 Update SBT build to use simpler fix for Hadoop 0.23.9 2013-08-19 12:33:13 -07:00
Jey Kottalam 67b593607c Rename YARN build flag to SPARK_WITH_YARN 2013-08-16 14:00:05 -07:00
Jey Kottalam b1d99744a8 Fix SBT build under Hadoop 0.23.x 2013-08-16 13:50:12 -07:00
Jey Kottalam 8add2d7a59 Fix repl/assembly when YARN enabled 2013-08-16 13:50:12 -07:00
Jey Kottalam 3f98eff63a Allow make-distribution.sh to specify Hadoop version used 2013-08-16 13:50:09 -07:00
Reynold Xin c961c19b7b Use the JSON formatter from Scala library and removed dependency on lift-json.
It made the JSON creation slightly more complicated, but reduces one external dependency. The scala library also properly escape "/" (which lift-json doesn't).
2013-08-15 18:23:01 -07:00
Jey Kottalam a0f0848463 Update default version of Hadoop to 1.2.1 2013-08-15 16:50:37 -07:00
Jey Kottalam cb4ef19214 yarn support 2013-08-15 16:50:37 -07:00
Jey Kottalam 273b499b9a yarn sbt 2013-08-15 16:50:37 -07:00
Jey Kottalam 69c3bbf688 dynamically detect hadoop version 2013-08-15 16:50:37 -07:00
Matei Zaharia d9588183fa Update to Mesos 0.12.1 2013-08-13 18:51:35 -07:00
jerryshao 320e87e7ab Add MetricsServlet for Spark metrics system 2013-08-12 13:23:23 +08:00
Matei Zaharia dce5e47435 Merge pull request #800 from dlyubimov/HBASE_VERSION
Pull HBASE_VERSION in the head of sbt build
2013-08-09 21:53:45 -07:00
Matei Zaharia cd247ba5bb Merge pull request #786 from shivaram/mllib-java
Java fixes, tests and examples for ALS, KMeans
2013-08-09 20:41:13 -07:00
Dmitriy Lyubimov 27f674f82b fewer words 2013-08-09 13:54:41 -07:00
Dmitriy Lyubimov ae95b57469 Pull HBASE_VERSION in the head of sbt build 2013-08-09 12:45:18 -07:00
Matei Zaharia 5a4003c1ac Update to Chill 0.3.1 2013-08-08 13:30:27 -07:00
Shivaram Venkataraman 471fbadd0c Java examples, tests for KMeans and ALS
- Changes ALS to accept RDD[Rating] instead of (Int, Int, Double) making it
  easier to call from Java
- Renames class methods from `train` to `run` to enable static methods to be
  called from Java.
- Add unit tests which check if both static / class methods can be called.
- Also add examples which port the main() function in ALS, KMeans to the
  examples project.

Couple of minor changes to existing code:
- Add a toJavaRDD method in RDD to convert scala RDD to java RDD easily
- Workaround a bug where using double[] from Java leads to class cast exception in
  KMeans init
2013-08-06 15:43:46 -07:00
Joseph E. Gonzalez 499a0d8383 Merged graphx from @rxin into master 2013-08-06 12:28:29 -07:00