Commit graph

8846 commits

Author SHA1 Message Date
Andrew Or 6e3c5a296c [Test] Better exception message from SparkSubmitSuite
Before:
```
Exception in thread "main" java.lang.Exception: Could not load user defined classes inside of executors
	at org.apache.spark.deploy.JarCreationTest$.main(SparkSubmitSuite.scala:471)
	at org.apache.spark.deploy.JarCreationTest.main(SparkSubmitSuite.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
```
After:
```
Exception in thread "main" java.lang.Exception: Could not load user class from jar:
java.lang.UnsupportedClassVersionError: SparkSubmitClassA : Unsupported major.minor version 51.0
	java.lang.ClassLoader.defineClass1(Native Method)
	java.lang.ClassLoader.defineClass(ClassLoader.java:643)
	...
	at org.apache.spark.deploy.JarCreationTest$.main(SparkSubmitSuite.scala:472)
	at org.apache.spark.deploy.JarCreationTest.main(SparkSubmitSuite.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
```

Author: Andrew Or <andrew@databricks.com>

Closes #3212 from andrewor14/submit-suite-message and squashes the following commits:

7779248 [Andrew Or] Format exception
8fe6719 [Andrew Or] Better exception message from failed test
2014-11-12 13:35:48 -08:00
Soumitra Kumar 36ddeb7bf8 [SPARK-3660][STREAMING] Initial RDD for updateStateByKey transformation
SPARK-3660 : Initial RDD for updateStateByKey transformation

I have added a sample StatefulNetworkWordCountWithInitial inspired by StatefulNetworkWordCount.

Please let me know if any changes are required.

Author: Soumitra Kumar <kumar.soumitra@gmail.com>

Closes #2665 from soumitrak/master and squashes the following commits:

ee8980b [Soumitra Kumar] Fixed copy/paste issue.
304f636 [Soumitra Kumar] Added simpler version of updateStateByKey API with initialRDD and test.
9781135 [Soumitra Kumar] Fixed test, and renamed variable.
3da51a2 [Soumitra Kumar] Adding updateStateByKey with initialRDD API to JavaPairDStream.
2f78f7e [Soumitra Kumar] Merge remote-tracking branch 'upstream/master'
d4fdd18 [Soumitra Kumar] Renamed variable and moved method.
d0ce2cd [Soumitra Kumar] Merge remote-tracking branch 'upstream/master'
31399a4 [Soumitra Kumar] Merge remote-tracking branch 'upstream/master'
4efa58b [Soumitra Kumar] [SPARK-3660][STREAMING] Initial RDD for updateStateByKey transformation
8f40ca0 [Soumitra Kumar] Merge remote-tracking branch 'upstream/master'
dde4271 [Soumitra Kumar] Merge remote-tracking branch 'upstream/master'
fdd7db3 [Soumitra Kumar] Adding support of initial value for state update. SPARK-3660 : Initial RDD for updateStateByKey transformation
2014-11-12 12:25:31 -08:00
Xiangrui Meng 4b736dbab3 [SPARK-3530][MLLIB] pipeline and parameters with examples
This PR adds package "org.apache.spark.ml" with pipeline and parameters, as discussed on the JIRA. This is a joint work of jkbradley etrain shivaram and many others who helped on the design, also with help from  marmbrus and liancheng on the Spark SQL side. The design doc can be found at:

https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing

**org.apache.spark.ml**

This is a new package with new set of ML APIs that address practical machine learning pipelines. (Sorry for taking so long!) It will be an alpha component, so this is definitely not something set in stone. The new set of APIs, inspired by the MLI project from AMPLab and scikit-learn, takes leverage on Spark SQL's schema support and execution plan optimization. It introduces the following components that help build a practical pipeline:

1. Transformer, which transforms a dataset into another
2. Estimator, which fits models to data, where models are transformers
3. Evaluator, which evaluates model output and returns a scalar metric
4. Pipeline, a simple pipeline that consists of transformers and estimators

Parameters could be supplied at fit/transform or embedded with components.

1. Param: a strong-typed parameter key with self-contained doc
2. ParamMap: a param -> value map
3. Params: trait for components with parameters

For any component that implements `Params`, user can easily check the doc by calling `explainParams`:

~~~
> val lr = new LogisticRegression
> lr.explainParams
maxIter: max number of iterations (default: 100)
regParam: regularization constant (default: 0.1)
labelCol: label column name (default: label)
featuresCol: features column name (default: features)
~~~

or user can check individual param:

~~~
> lr.maxIter
maxIter: max number of iterations (default: 100)
~~~

**Please start with the example code in test suites and under `org.apache.spark.examples.ml`, where I put several examples:**

1. run a simple logistic regression job

~~~
    val lr = new LogisticRegression()
      .setMaxIter(10)
      .setRegParam(1.0)
    val model = lr.fit(dataset)
    model.transform(dataset, model.threshold -> 0.8) // overwrite threshold
      .select('label, 'score, 'prediction).collect()
      .foreach(println)
~~~

2. run logistic regression with cross-validation and grid search using areaUnderROC (default) as the metric

~~~
    val lr = new LogisticRegression
    val lrParamMaps = new ParamGridBuilder()
      .addGrid(lr.regParam, Array(0.1, 100.0))
      .addGrid(lr.maxIter, Array(0, 5))
      .build()
    val eval = new BinaryClassificationEvaluator
    val cv = new CrossValidator()
      .setEstimator(lr)
      .setEstimatorParamMaps(lrParamMaps)
      .setEvaluator(eval)
      .setNumFolds(3)
    val bestModel = cv.fit(dataset)
~~~

3. run a pipeline that consists of a standard scaler and a logistic regression component

~~~
    val scaler = new StandardScaler()
      .setInputCol("features")
      .setOutputCol("scaledFeatures")
    val lr = new LogisticRegression()
      .setFeaturesCol(scaler.getOutputCol)
    val pipeline = new Pipeline()
      .setStages(Array(scaler, lr))
    val model = pipeline.fit(dataset)
    val predictions = model.transform(dataset)
      .select('label, 'score, 'prediction)
      .collect()
      .foreach(println)
~~~

4. a simple text classification pipeline, which recognizes "spark":

~~~
    val training = sparkContext.parallelize(Seq(
      LabeledDocument(0L, "a b c d e spark", 1.0),
      LabeledDocument(1L, "b d", 0.0),
      LabeledDocument(2L, "spark f g h", 1.0),
      LabeledDocument(3L, "hadoop mapreduce", 0.0)))
    val tokenizer = new Tokenizer()
      .setInputCol("text")
      .setOutputCol("words")
    val hashingTF = new HashingTF()
      .setInputCol(tokenizer.getOutputCol)
      .setOutputCol("features")
    val lr = new LogisticRegression()
      .setMaxIter(10)
    val pipeline = new Pipeline()
      .setStages(Array(tokenizer, hashingTF, lr))
    val model = pipeline.fit(training)
    val test = sparkContext.parallelize(Seq(
      Document(4L, "spark i j k"),
      Document(5L, "l m"),
      Document(6L, "mapreduce spark"),
      Document(7L, "apache hadoop")))
    model.transform(test)
      .select('id, 'text, 'prediction, 'score)
      .collect()
      .foreach(println)
~~~

Java examples are very similar. I put example code that creates a simple text classification pipeline in Scala and Java, where a simple tokenizer is defined as a transformer outside `org.apache.spark.ml`.

**What are missing now and will be added soon:**

1. ~~Runtime check of schemas. So before we touch the data, we will go through the schema and make sure column names and types match the input parameters.~~
2. ~~Java examples.~~
3. ~~Store training parameters in trained models.~~
4. (later) Serialization and Python API.

Author: Xiangrui Meng <meng@databricks.com>

Closes #3099 from mengxr/SPARK-3530 and squashes the following commits:

2cc93fd [Xiangrui Meng] hide APIs as much as I can
34319ba [Xiangrui Meng] use local instead local[2] for unit tests
2524251 [Xiangrui Meng] rename PipelineStage.transform to transformSchema
c9daab4 [Xiangrui Meng] remove mockito version
1397ab5 [Xiangrui Meng] use sqlContext from LocalSparkContext instead of TestSQLContext
6ffc389 [Xiangrui Meng] try to fix unit test
a59d8b7 [Xiangrui Meng] doc updates
977fd9d [Xiangrui Meng] add scala ml package object
6d97fe6 [Xiangrui Meng] add AlphaComponent annotation
731f0e4 [Xiangrui Meng] update package doc
0435076 [Xiangrui Meng] remove ;this from setters
fa21d9b [Xiangrui Meng] update extends indentation
f1091b3 [Xiangrui Meng] typo
228a9f4 [Xiangrui Meng] do not persist before calling binary classification metrics
f51cd27 [Xiangrui Meng] rename default to defaultValue
b3be094 [Xiangrui Meng] refactor schema transform in lr
8791e8e [Xiangrui Meng] rename copyValues to inheritValues and make it do the right thing
51f1c06 [Xiangrui Meng] remove leftover code in Transformer
494b632 [Xiangrui Meng] compure score once
ad678e9 [Xiangrui Meng] more doc for Transformer
4306ed4 [Xiangrui Meng] org imports in text pipeline
6e7c1c7 [Xiangrui Meng] update pipeline
4f9e34f [Xiangrui Meng] more doc for pipeline
aa5dbd4 [Xiangrui Meng] fix typo
11be383 [Xiangrui Meng] fix unit tests
3df7952 [Xiangrui Meng] clean up
986593e [Xiangrui Meng] re-org java test suites
2b11211 [Xiangrui Meng] remove external data deps
9fd4933 [Xiangrui Meng] add unit test for pipeline
2a0df46 [Xiangrui Meng] update tests
2d52e4d [Xiangrui Meng] add @AlphaComponent to package-info
27582a4 [Xiangrui Meng] doc changes
73a000b [Xiangrui Meng] add schema transformation layer
6736e87 [Xiangrui Meng] more doc / remove HasMetricName trait
80a8b5e [Xiangrui Meng] rename SimpleTransformer to UnaryTransformer
62ca2bb [Xiangrui Meng] check param parent in set/get
1622349 [Xiangrui Meng] add getModel to PipelineModel
a0e0054 [Xiangrui Meng] update StandardScaler to use SimpleTransformer
d0faa04 [Xiangrui Meng] remove implicit mapping from ParamMap
c7f6921 [Xiangrui Meng] move ParamGridBuilder test to ParamGridBuilderSuite
e246f29 [Xiangrui Meng] re-org:
7772430 [Xiangrui Meng] remove modelParams add a simple text classification pipeline
b95c408 [Xiangrui Meng] remove implicits add unit tests to params
bab3e5b [Xiangrui Meng] update params
fe0ee92 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-3530
6e86d98 [Xiangrui Meng] some code clean-up
2d040b3 [Xiangrui Meng] implement setters inside each class, add Params.copyValues [ci skip]
fd751fc [Xiangrui Meng] add java-friendly versions of fit and tranform
3f810cd [Xiangrui Meng] use multi-model training api in cv
5b8f413 [Xiangrui Meng] rename model to modelParams
9d2d35d [Xiangrui Meng] test varargs and chain model params
f46e927 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-3530
1ef26e0 [Xiangrui Meng] specialize methods/types for Java
df293ed [Xiangrui Meng] switch to setter/getter
376db0a [Xiangrui Meng] pipeline and parameters
2014-11-12 10:38:57 -08:00
Xiangrui Meng 84324fbcb9 [SPARK-4355][MLLIB] fix OnlineSummarizer.merge when other.mean is zero
See inline comment about the bug. I also did some code clean-up. dbtsai I moved `update` to a private method of `MultivariateOnlineSummarizer`. I don't think it will cause performance regression, but it would be great if you have some time to test.

Author: Xiangrui Meng <meng@databricks.com>

Closes #3220 from mengxr/SPARK-4355 and squashes the following commits:

5ef601f [Xiangrui Meng] fix OnlineSummarizer.merge when other.mean is zero and some code clean-up
2014-11-12 01:50:11 -08:00
Ankur Dave faeb41de21 [SPARK-3936] Add aggregateMessages, which supersedes mapReduceTriplets
aggregateMessages enables neighborhood computation similarly to mapReduceTriplets, but it introduces two API improvements:

1. Messages are sent using an imperative interface based on EdgeContext rather than by returning an iterator of messages.

2. Rather than attempting bytecode inspection, the required triplet fields must be explicitly specified by the user by passing a TripletFields object. This fixes SPARK-3936.

Additionally, this PR includes the following optimizations for aggregateMessages and EdgePartition:

1. EdgePartition now stores local vertex ids instead of global ids. This avoids hash lookups when looking up vertex attributes and aggregating messages.

2. Internal iterators in aggregateMessages are inlined into a while loop.

In total, these optimizations were tested to provide a 37% speedup on PageRank (uk-2007-05 graph, 10 iterations, 16 r3.2xlarge machines, sped up from 513 s to 322 s).

Subsumes apache/spark#2815. Also fixes SPARK-4173.

Author: Ankur Dave <ankurdave@gmail.com>

Closes #3100 from ankurdave/aggregateMessages and squashes the following commits:

f5b65d0 [Ankur Dave] Address @rxin comments on apache/spark#3054 and apache/spark#3100
1e80aca [Ankur Dave] Add aggregateMessages, which supersedes mapReduceTriplets
194a2df [Ankur Dave] Test triplet iterator in EdgePartition serialization test
e0f8ecc [Ankur Dave] Take activeSet in ExistingEdgePartitionBuilder
c85076d [Ankur Dave] Readability improvements
b567be2 [Ankur Dave] iter.foreach -> while loop
4a566dc [Ankur Dave] Optimizations for mapReduceTriplets and EdgePartition
2014-11-11 23:38:27 -08:00
Manish Amde 2ef016b130 [MLLIB] SPARK-4347: Reducing GradientBoostingSuite run time.
Before:
[info] GradientBoostingSuite:
[info] - Regression with continuous features: SquaredError (22 seconds, 115 milliseconds)
[info] - Regression with continuous features: Absolute Error (19 seconds, 330 milliseconds)
[info] - Binary classification with continuous features: Log Loss (19 seconds, 17 milliseconds)

After:
[info] - Regression with continuous features: SquaredError (7 seconds, 69 milliseconds)
[info] - Regression with continuous features: Absolute Error (4 seconds, 617 milliseconds)
[info] - Binary classification with continuous features: Log Loss (4 seconds, 658 milliseconds)

cc: mengxr, jkbradley

Author: Manish Amde <manish9ue@gmail.com>

Closes #3214 from manishamde/gbt_test_speedup and squashes the following commits:

8994552 [Manish Amde] reducing gbt test run times
2014-11-11 22:47:53 -08:00
Prashant Sharma daaca14c16 Support cross building for Scala 2.11
Let's give this another go using a version of Hive that shades its JLine dependency.

Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Patrick Wendell <pwendell@gmail.com>

Closes #3159 from pwendell/scala-2.11-prashant and squashes the following commits:

e93aa3e [Patrick Wendell] Restoring -Phive-thriftserver profile and cleaning up build script.
f65d17d [Patrick Wendell] Fixing build issue due to merge conflict
a8c41eb [Patrick Wendell] Reverting dev/run-tests back to master state.
7a6eb18 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into scala-2.11-prashant
583aa07 [Prashant Sharma] REVERT ME: removed hive thirftserver
3680e58 [Prashant Sharma] Revert "REVERT ME: Temporarily removing some Cli tests."
935fb47 [Prashant Sharma] Revert "Fixed by disabling a few tests temporarily."
925e90f [Prashant Sharma] Fixed by disabling a few tests temporarily.
2fffed3 [Prashant Sharma] Exclude groovy from sbt build, and also provide a way for such instances in future.
8bd4e40 [Prashant Sharma] Switched to gmaven plus, it fixes random failures observer with its predecessor gmaven.
5272ce5 [Prashant Sharma] SPARK_SCALA_VERSION related bugs.
2121071 [Patrick Wendell] Migrating version detection to PySpark
b1ed44d [Patrick Wendell] REVERT ME: Temporarily removing some Cli tests.
1743a73 [Patrick Wendell] Removing decimal test that doesn't work with Scala 2.11
f5cad4e [Patrick Wendell] Add Scala 2.11 docs
210d7e1 [Patrick Wendell] Revert "Testing new Hive version with shaded jline"
48518ce [Patrick Wendell] Remove association of Hive and Thriftserver profiles.
e9d0a06 [Patrick Wendell] Revert "Enable thritfserver for Scala 2.10 only"
67ec364 [Patrick Wendell] Guard building of thriftserver around Scala 2.10 check
8502c23 [Patrick Wendell] Enable thritfserver for Scala 2.10 only
e22b104 [Patrick Wendell] Small fix in pom file
ec402ab [Patrick Wendell] Various fixes
0be5a9d [Patrick Wendell] Testing new Hive version with shaded jline
4eaec65 [Prashant Sharma] Changed scripts to ignore target.
5167bea [Prashant Sharma] small correction
a4fcac6 [Prashant Sharma] Run against scala 2.11 on jenkins.
80285f4 [Prashant Sharma] MAven equivalent of setting spark.executor.extraClasspath during tests.
034b369 [Prashant Sharma] Setting test jars on executor classpath during tests from sbt.
d4874cb [Prashant Sharma] Fixed Python Runner suite. null check should be first case in scala 2.11.
6f50f13 [Prashant Sharma] Fixed build after rebasing with master. We should use ${scala.binary.version} instead of just 2.10
e56ca9d [Prashant Sharma] Print an error if build for 2.10 and 2.11 is spotted.
937c0b8 [Prashant Sharma] SCALA_VERSION -> SPARK_SCALA_VERSION
cb059b0 [Prashant Sharma] Code review
0476e5e [Prashant Sharma] Scala 2.11 support with repl and all build changes.
2014-11-11 21:36:48 -08:00
Andrew Or 2ddb1415e2 [Release] Log build output for each distribution 2014-11-11 18:02:59 -08:00
Timothy Chen a878660d2d SPARK-2269 Refactor mesos scheduler resourceOffers and add unit test
Author: Timothy Chen <tnachen@gmail.com>

Closes #1487 from tnachen/resource_offer_refactor and squashes the following commits:

4ea5dec [Timothy Chen] Rebase from master and address comments
9ccab09 [Timothy Chen] Address review comments
e6494dc [Timothy Chen] Refactor class loading
8207428 [Timothy Chen] Refactor mesos scheduler resourceOffers and add unit test
2014-11-11 14:29:18 -08:00
Kousuke Saruta 7f3718842c [SPARK-4282][YARN] Stopping flag in YarnClientSchedulerBackend should be volatile
In YarnClientSchedulerBackend, a variable "stopping" is used as a flag and it's accessed by some threads so it should be volatile.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #3143 from sarutak/stopping-flag-volatile and squashes the following commits:

58fdcc9 [Kousuke Saruta] Marked stoppig flag as volatile
2014-11-11 12:33:53 -06:00
Sean Owen f820b563d8 SPARK-4305 [BUILD] yarn-alpha profile won't build due to network/yarn module
SPARK-3797 introduced the `network/yarn` module, but its YARN code depends on YARN APIs not present in older versions covered by the `yarn-alpha` profile. As a result builds like `mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package` fail.

The solution is just to not build `network/yarn` with profile `yarn-alpha`.

Author: Sean Owen <sowen@cloudera.com>

Closes #3167 from srowen/SPARK-4305 and squashes the following commits:

88938cb [Sean Owen] Don't build network/yarn in yarn-alpha profile as it won't compile
2014-11-11 12:30:35 -06:00
Prashant Sharma deefd9d737 SPARK-1830 Deploy failover, Make Persistence engine and LeaderAgent Pluggable
Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #771 from ScrapCodes/deploy-failover-pluggable and squashes the following commits:

29ba440 [Prashant Sharma] fixed a compilation error
fef35ec [Prashant Sharma] Code review
57ee6f0 [Prashant Sharma] SPARK-1830 Deploy failover, Make Persistence engine and LeaderAgent Pluggable.
2014-11-11 09:29:48 -08:00
huangzhaowei 6e03de304e [Streaming][Minor]Replace some 'if-else' in Clock
Replace some 'if-else' statement by math.min and math.max in Clock.scala

Author: huangzhaowei <carlmartinmax@gmail.com>

Closes #3088 from SaintBacchus/StreamingClock and squashes the following commits:

7b7f8e7 [huangzhaowei] [Streaming][Minor]Replace some 'if-else' in Clock
2014-11-11 03:02:12 -08:00
jerryshao c8850a3d6d [SPARK-2492][Streaming] kafkaReceiver minor changes to align with Kafka 0.8
Update the KafkaReceiver's behavior when auto.offset.reset is set.

In Kafka 0.8, `auto.offset.reset` is a hint for out-range offset to seek to the beginning or end of the partition. While in the previous code `auto.offset.reset` is a enforcement to seek to the beginning or end immediately, this is different from Kafka 0.8 defined behavior.

Also deleting extesting ZK metadata in Receiver when multiple consumers are launched will introduce issue as mentioned in [SPARK-2383](https://issues.apache.org/jira/browse/SPARK-2383).

So Here we change to offer user to API to explicitly reset offset before create Kafka stream, while in the meantime keep the same behavior as Kafka 0.8 for parameter `auto.offset.reset`.

@tdas, would you please review this PR? Thanks a lot.

Author: jerryshao <saisai.shao@intel.com>

Closes #1420 from jerryshao/kafka-fix and squashes the following commits:

d6ae94d [jerryshao] Address the comment to remove the resetOffset() function
de3a4c8 [jerryshao] Fix compile error
4a1c3f9 [jerryshao] Doc changes
b2c1430 [jerryshao] Move offset reset to a helper function to let user explicitly delete ZK metadata by calling this API
fac8fd6 [jerryshao] Changes to align with Kafka 0.8
2014-11-11 02:22:23 -08:00
maji2014 f8811a5695 [SPARK-4295][External]Fix exception in SparkSinkSuite
Handle exception in SparkSinkSuite, please refer to [SPARK-4295]

Author: maji2014 <maji3@asiainfo.com>

Closes #3177 from maji2014/spark-4295 and squashes the following commits:

312620a [maji2014] change a new statement for spark-4295
24c3d21 [maji2014] add log4j.properties for SparkSinkSuite and spark-4295
c807bf6 [maji2014] Fix exception in SparkSinkSuite
2014-11-11 02:18:27 -08:00
Reynold Xin ef29a9a9aa [SPARK-4307] Initialize FileDescriptor lazily in FileRegion.
Netty's DefaultFileRegion requires a FileDescriptor in its constructor, which means we need to have a opened file handle. In super large workloads, this could lead to too many open files due to the way these file descriptors are cleaned. This pull request creates a new LazyFileRegion that initializes the FileDescriptor when we are sending data for the first time.

Author: Reynold Xin <rxin@databricks.com>
Author: Reynold Xin <rxin@apache.org>

Closes #3172 from rxin/lazyFD and squashes the following commits:

0bdcdc6 [Reynold Xin] Added reference to Netty's DefaultFileRegion
d4564ae [Reynold Xin] Added SparkConf to the ctor argument of IndexShuffleBlockManager.
6ed369e [Reynold Xin] Code review feedback.
04cddc8 [Reynold Xin] [SPARK-4307] Initialize FileDescriptor lazily in FileRegion.
2014-11-11 00:25:31 -08:00
Davies Liu 65083e93dd [SPARK-4324] [PySpark] [MLlib] support numpy.array for all MLlib API
This PR check all of the existing Python MLlib API to make sure that numpy.array is supported as Vector (also RDD of numpy.array).

It also improve some docstring and doctest.

cc mateiz mengxr

Author: Davies Liu <davies@databricks.com>

Closes #3189 from davies/numpy and squashes the following commits:

d5057c4 [Davies Liu] fix tests
6987611 [Davies Liu] support numpy.array for all MLlib API
2014-11-10 22:26:16 -08:00
Kousuke Saruta 3c07b8f082 [SPARK-4330][Doc] Link to proper URL for YARN overview
In running-on-yarn.md, a link to YARN overview is here.
But the URL is to YARN alpha's.
It should be stable's.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #3196 from sarutak/SPARK-4330 and squashes the following commits:

30baa21 [Kousuke Saruta] Fixed running-on-yarn.md to point proper URL for YARN
2014-11-10 22:18:00 -08:00
Ankur Dave 300887bd76 [SPARK-3649] Remove GraphX custom serializers
As [reported][1] on the mailing list, GraphX throws

```
java.lang.ClassCastException: java.lang.Long cannot be cast to scala.Tuple2
        at org.apache.spark.graphx.impl.RoutingTableMessageSerializer$$anon$1$$anon$2.writeObject(Serializers.scala:39)
        at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:195)
        at org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:329)
```

when sort-based shuffle attempts to spill to disk. This is because GraphX defines custom serializers for shuffling pair RDDs that assume Spark will always serialize the entire pair object rather than breaking it up into its components. However, the spill code path in sort-based shuffle [violates this assumption][2].

GraphX uses the custom serializers to compress vertex ID keys using variable-length integer encoding. However, since the serializer can no longer rely on the key and value being serialized and deserialized together, performing such encoding would either require writing a tag byte (costly) or maintaining state in the serializer and assuming that serialization calls will alternate between key and value (fragile).

Instead, this PR simply removes the custom serializers. This causes a **10% slowdown** (494 s to 543 s) and **16% increase in per-iteration communication** (2176 MB to 2518 MB) for PageRank (averages across 3 trials, 10 iterations per trial, uk-2007-05 graph, 16 r3.2xlarge nodes).

[1]: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassCastException-java-lang-Long-cannot-be-cast-to-scala-Tuple2-td13926.html#a14501
[2]: f9d6220c79/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala (L329)

Author: Ankur Dave <ankurdave@gmail.com>

Closes #2503 from ankurdave/SPARK-3649 and squashes the following commits:

a49c2ad [Ankur Dave] [SPARK-3649] Remove GraphX custom serializers
2014-11-10 19:31:52 -08:00
Cheng Hao c764d0ac1c [SPARK-4274] [SQL] Fix NPE in printing the details of the query plan
Author: Cheng Hao <hao.cheng@intel.com>

Closes #3139 from chenghao-intel/comparison_test and squashes the following commits:

f5d7146 [Cheng Hao] avoid exception in printing the codegen enabled
2014-11-10 17:46:05 -08:00
surq ce6ed2abd1 [SPARK-3954][Streaming] Optimization to FileInputDStream
about convert files to RDDS there are 3 loops with files sequence in spark source.
loops files sequence:
1.files.map(...)
2.files.zip(fileRDDs)
3.files-size.foreach
It's will very time consuming when lots of files.So I do the following correction:
3 loops with files sequence => only one loop

Author: surq <surq@asiainfo.com>

Closes #2811 from surq/SPARK-3954 and squashes the following commits:

321bbe8 [surq]  updated the code style.The style from [for...yield]to [files.map(file=>{})]
88a2c20 [surq] Merge branch 'master' of https://github.com/apache/spark into SPARK-3954
178066f [surq] modify code's style. [Exceeds 100 columns]
626ef97 [surq] remove redundant import(ArrayBuffer)
739341f [surq] promote the speed of convert files to RDDS
2014-11-10 17:37:16 -08:00
Daoyuan Wang a1fc059b69 [SPARK-4149][SQL] ISO 8601 support for json date time strings
This implement the feature davies mentioned in https://github.com/apache/spark/pull/2901#discussion-diff-19313312

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #3012 from adrian-wang/iso8601 and squashes the following commits:

50df6e7 [Daoyuan Wang] json data timestamp ISO8601 support
2014-11-10 17:26:03 -08:00
Cheng Hao fa777833b5 [SPARK-4250] [SQL] Fix bug of constant null value mapping to ConstantObjectInspector
Author: Cheng Hao <hao.cheng@intel.com>

Closes #3114 from chenghao-intel/constant_null_oi and squashes the following commits:

e603bda [Cheng Hao] fix the bug of null value for primitive types
50a13ba [Cheng Hao] fix the timezone issue
f54f369 [Cheng Hao] fix bug of constant null value for ObjectInspector
2014-11-10 17:22:57 -08:00
Xiangrui Meng d793d80c80 [SQL] remove a decimal case branch that has no effect at runtime
it generates warnings at compile time marmbrus

Author: Xiangrui Meng <meng@databricks.com>

Closes #3192 from mengxr/dtc-decimal and squashes the following commits:

955e9fb [Xiangrui Meng] remove a decimal case branch that has no effect
2014-11-10 17:20:52 -08:00
Cheng Lian acb55aeddb [SPARK-4308][SQL] Sets SQL operation state to ERROR when exception is thrown
In `HiveThriftServer2`, when an exception is thrown during a SQL execution, the SQL operation state should be set to `ERROR`, but now it remains `RUNNING`. This affects the result of the `GetOperationStatus` Thrift API.

Author: Cheng Lian <lian@databricks.com>

Closes #3175 from liancheng/fix-op-state and squashes the following commits:

6d4c1fe [Cheng Lian] Sets SQL operation state to ERROR when exception is thrown
2014-11-10 16:56:36 -08:00
Cheng Lian 534b231417 [SPARK-4000][Build] Uploads HiveCompatibilitySuite logs
This is a follow up of #2845. In addition to unit-tests.log files, also upload failure output files generated by `HiveCompatibilitySuite` to Jenkins master. These files can be very helpful to debug Hive compatibility test failures.

/cc pwendell marmbrus

Author: Cheng Lian <lian@databricks.com>

Closes #2993 from liancheng/upload-hive-compat-logs and squashes the following commits:

8e6247f [Cheng Lian] Uploads HiveCompatibilitySuite logs
2014-11-10 16:17:52 -08:00
Takuya UESHIN dbf10588de [SPARK-4319][SQL] Enable an ignored test "null count".
Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #3185 from ueshin/issues/SPARK-4319 and squashes the following commits:

a44a38e [Takuya UESHIN] Enable an ignored test "null count".
2014-11-10 15:55:15 -08:00
Patrick Wendell 6e7a309b81 Revert "[SPARK-2703][Core]Make Tachyon related unit tests execute without deploying a Tachyon system locally."
This reverts commit bd86cb1738.
2014-11-10 14:56:06 -08:00
Varadharajan Mukundan 974d334cf0 [SPARK-4047] - Generate runtime warnings for example implementation of PageRank
Based on SPARK-2434, this PR generates runtime warnings for example implementations (Python, Scala) of PageRank.

Author: Varadharajan Mukundan <srinathsmn@gmail.com>

Closes #2894 from varadharajan/SPARK-4047 and squashes the following commits:

5f9406b [Varadharajan Mukundan] [SPARK-4047] - Point users to LogisticRegressionWithSGD and LogisticRegressionWithLBFGS instead of LogisticRegressionModel
252f595 [Varadharajan Mukundan] a. Generate runtime warnings for
05a018b [Varadharajan Mukundan] Fix PageRank implementation's package reference
5c2bf54 [Varadharajan Mukundan] [SPARK-4047] - Generate runtime warnings for example implementation of PageRank
2014-11-10 14:32:29 -08:00
tedyu b32734e12d SPARK-1297 Upgrade HBase dependency to 0.98
pwendell rxin
Please take a look

Author: tedyu <yuzhihong@gmail.com>

Closes #3115 from tedyu/master and squashes the following commits:

2b079c8 [tedyu] SPARK-1297 Upgrade HBase dependency to 0.98
2014-11-10 13:23:33 -08:00
Sandy Ryza c6f4e70421 SPARK-4230. Doc for spark.default.parallelism is incorrect
Author: Sandy Ryza <sandy@cloudera.com>

Closes #3107 from sryza/sandy-spark-4230 and squashes the following commits:

37a1d19 [Sandy Ryza] Clear up a couple things
34d53de [Sandy Ryza] SPARK-4230. Doc for spark.default.parallelism is incorrect
2014-11-10 12:40:41 -08:00
Jey Kottalam c5db8e2c07 [SPARK-4312] bash doesn't have "die"
sbt-launch-lib.bash includes `die` command but it's not valid command for Linux, MacOS X or Windows.

Closes #2898

Author: Jey Kottalam <jey@kottalam.net>

Closes #3182 from sarutak/SPARK-4312 and squashes the following commits:

24c6677 [Jey Kottalam] bash doesn't have "die"
2014-11-10 12:37:56 -08:00
comcmipi 0340c56a92 Update RecoverableNetworkWordCount.scala
Trying this example, I missed the moment when the checkpoint was iniciated

Author: comcmipi <pitonak@fns.uniba.sk>

Closes #2735 from comcmipi/patch-1 and squashes the following commits:

b6d8001 [comcmipi] Update RecoverableNetworkWordCount.scala
96fe274 [comcmipi] Update RecoverableNetworkWordCount.scala
2014-11-10 12:33:48 -08:00
Sean Owen 3a02d416cd SPARK-2548 [STREAMING] JavaRecoverableWordCount is missing
Here's my attempt to re-port `RecoverableNetworkWordCount` to Java, following the example of its Scala and Java siblings. I fixed a few minor doc/formatting issues along the way I believe.

Author: Sean Owen <sowen@cloudera.com>

Closes #2564 from srowen/SPARK-2548 and squashes the following commits:

0d0bf29 [Sean Owen] Update checkpoint call as in https://github.com/apache/spark/pull/2735
35f23e3 [Sean Owen] Remove old comment about running in standalone mode
179b3c2 [Sean Owen] Re-port RecoverableNetworkWordCount to Java example, and touch up doc / formatting in related examples
2014-11-10 11:47:27 -08:00
Niklas Wilcke ed8bf1eac5 [SPARK-4169] [Core] Accommodate non-English Locales in unit tests
For me the core tests failed because there are two locale dependent parts in the code.
Look at the Jira ticket for details.

Why is it necessary to check the exception message in isBindCollision in
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L1686
?

Author: Niklas Wilcke <1wilcke@informatik.uni-hamburg.de>

Closes #3036 from numbnut/core-test-fix and squashes the following commits:

1fb0d04 [Niklas Wilcke] Fixing locale dependend code and tests
2014-11-10 11:37:38 -08:00
Xiangrui Meng 894a7245c3 [SQL] support udt to hive types conversion (hive->udt is not supported)
marmbrus

Author: Xiangrui Meng <meng@databricks.com>

Closes #3164 from mengxr/hive-udt and squashes the following commits:

57c7519 [Xiangrui Meng] support udt->hive types (hive->udt is not supported)
2014-11-10 11:04:12 -08:00
RongGu bd86cb1738 [SPARK-2703][Core]Make Tachyon related unit tests execute without deploying a Tachyon system locally.
Make Tachyon related unit tests execute without deploying a Tachyon system locally.

Author: RongGu <gurongwalker@gmail.com>

Closes #3030 from RongGu/SPARK-2703 and squashes the following commits:

ad08827 [RongGu] Make Tachyon related unit tests execute without deploying a Tachyon system locally
2014-11-09 23:48:15 -08:00
Patrick Wendell 227488d8cd MAINTENANCE: Automated closing of pull requests.
This commit exists to close the following pull requests on Github:

Closes #2898 (close requested by 'pwendell')
Closes #2212 (close requested by 'pwendell')
Closes #2102 (close requested by 'pwendell')
2014-11-09 23:07:14 -08:00
Sandy Ryza 3c2cff4b94 SPARK-3179. Add task OutputMetrics.
Author: Sandy Ryza <sandy@cloudera.com>

This patch had conflicts when merged, resolved by
Committer: Kay Ousterhout <kayousterhout@gmail.com>

Closes #2968 from sryza/sandy-spark-3179 and squashes the following commits:

dce4784 [Sandy Ryza] More review feedback
8d350d1 [Sandy Ryza] Fix test against Hadoop 2.5+
e7c74d0 [Sandy Ryza] More review feedback
6cff9c4 [Sandy Ryza] Review feedback
fb2dde0 [Sandy Ryza] SPARK-3179
2014-11-09 22:29:03 -08:00
Sean Owen f8e5732307 SPARK-1209 [CORE] (Take 2) SparkHadoop{MapRed,MapReduce}Util should not use package org.apache.hadoop
andrewor14 Another try at SPARK-1209, to address https://github.com/apache/spark/pull/2814#issuecomment-61197619

I successfully tested with `mvn -Dhadoop.version=1.0.4 -DskipTests clean package; mvn -Dhadoop.version=1.0.4 test` I assume that is what failed Jenkins last time. I also tried `-Dhadoop.version1.2.1` and `-Phadoop-2.4 -Pyarn -Phive` for more coverage.

So this is why the class was put in `org.apache.hadoop` to begin with, I assume. One option is to leave this as-is for now and move it only when Hadoop 1.0.x support goes away.

This is the other option, which adds a call to force the constructor to be public at run-time. It's probably less surprising than putting Spark code in `org.apache.hadoop`, but, does involve reflection. A `SecurityManager` might forbid this, but it would forbid a lot of stuff Spark does. This would also only affect Hadoop 1.0.x it seems.

Author: Sean Owen <sowen@cloudera.com>

Closes #3048 from srowen/SPARK-1209 and squashes the following commits:

0d48f4b [Sean Owen] For Hadoop 1.0.x, make certain constructors public, which were public in later versions
466e179 [Sean Owen] Disable MIMA warnings resulting from moving the class -- this was also part of the PairRDDFunctions type hierarchy though?
eb61820 [Sean Owen] Move SparkHadoopMapRedUtil / SparkHadoopMapReduceUtil from org.apache.hadoop to org.apache.spark
2014-11-09 22:11:20 -08:00
Patrick Wendell f73b56f5e5 MAINTENANCE: Automated closing of pull requests.
This commit exists to close the following pull requests on Github:

Closes #464 (close requested by 'JoshRosen')
Closes #283 (close requested by 'pwendell')
Closes #449 (close requested by 'pwendell')
Closes #907 (close requested by 'pwendell')
Closes #2478 (close requested by 'JoshRosen')
Closes #2192 (close requested by 'tdas')
Closes #918 (close requested by 'pwendell')
Closes #1465 (close requested by 'pwendell')
Closes #3135 (close requested by 'JoshRosen')
Closes #1693 (close requested by 'tdas')
Closes #1279 (close requested by 'pwendell')
2014-11-09 18:16:20 -08:00
Sean Owen d1362659ef SPARK-1344 [DOCS] Scala API docs for top methods
Use "k" in javadoc of top and takeOrdered to avoid confusion with type K in pair RDDs. I think this resolves the discussion in SPARK-1344.

Author: Sean Owen <sowen@cloudera.com>

Closes #3168 from srowen/SPARK-1344 and squashes the following commits:

6963fcc [Sean Owen] Use "k" in javadoc of top and takeOrdered to avoid confusion with type K in pair RDDs
2014-11-09 17:42:08 -08:00
Sean Owen 8c99a47a4f SPARK-971 [DOCS] Link to Confluence wiki from project website / documentation
This is a trivial change to add links to the wiki from `README.md` and the main docs page. It is already linked to from spark.apache.org.

Author: Sean Owen <sowen@cloudera.com>

Closes #3169 from srowen/SPARK-971 and squashes the following commits:

dcb84d0 [Sean Owen] Add link to wiki from README, docs home page
2014-11-09 17:40:48 -08:00
Josh Rosen 7b41b17f32 [SPARK-4301] StreamingContext should not allow start() to be called after calling stop()
In Spark 1.0.0+, calling `stop()` on a StreamingContext that has not been started is a no-op which has no side-effects. This allows users to call `stop()` on a fresh StreamingContext followed by `start()`. I believe that this almost always indicates an error and is not behavior that we should support. Since we don't allow `start() stop() start()` then I don't think it makes sense to allow `stop() start()`.

The current behavior can lead to resource leaks when StreamingContext constructs its own SparkContext: if I call `stop(stopSparkContext=True)`, then I expect StreamingContext's underlying SparkContext to be stopped irrespective of whether the StreamingContext has been started. This is useful when writing unit test fixtures.

Prior discussions:
- https://github.com/apache/spark/pull/3053#discussion-diff-19710333R490
- https://github.com/apache/spark/pull/3121#issuecomment-61927353

Author: Josh Rosen <joshrosen@databricks.com>

Closes #3160 from JoshRosen/SPARK-4301 and squashes the following commits:

dbcc929 [Josh Rosen] Address more review comments
bdbe5da [Josh Rosen] Stop SparkContext after stopping scheduler, not before.
03e9c40 [Josh Rosen] Always stop SparkContext, even if stop(false) has already been called.
832a7f4 [Josh Rosen] Address review comment
5142517 [Josh Rosen] Add tests; improve Scaladoc.
813e471 [Josh Rosen] Revert workaround added in https://github.com/apache/spark/pull/3053/files#diff-e144dbee130ed84f9465853ddce65f8eR49
5558e70 [Josh Rosen] StreamingContext.stop() should stop SparkContext even if StreamingContext has not been started yet.
2014-11-08 18:10:23 -08:00
Aaron Davidson 4af5c7e244 [Minor] [Core] Don't NPE on closeQuietly(null)
Author: Aaron Davidson <aaron@databricks.com>

Closes #3166 from aarondav/closeQuietlyer and squashes the following commits:

78096b5 [Aaron Davidson] Don't NPE on closeQuietly(null)
2014-11-08 13:03:51 -08:00
Andrew Or 7afc8564f3 [SPARK-4291][Build] Rename network module projects
The names of the recently introduced network modules are inconsistent with those of the other modules in the project. We should just drop the "Code" suffix since it doesn't sacrifice any meaning, especially before they get into an official release.

```
[INFO] Reactor Build Order:
[INFO]
[INFO] Spark Project Parent POM
[INFO] Spark Project Common Network Code
[INFO] Spark Project Shuffle Streaming Service Code
[INFO] Spark Project Core
[INFO] Spark Project Bagel
[INFO] Spark Project GraphX
[INFO] Spark Project Streaming
[INFO] Spark Project Catalyst
[INFO] Spark Project SQL
[INFO] Spark Project ML Library
[INFO] Spark Project Tools
[INFO] Spark Project Hive
[INFO] Spark Project REPL
[INFO] Spark Project YARN Parent POM
[INFO] Spark Project YARN Stable API
[INFO] Spark Project Assembly
[INFO] Spark Project External Twitter
[INFO] Spark Project External Kafka
[INFO] Spark Project External Flume Sink
[INFO] Spark Project External Flume
[INFO] Spark Project External ZeroMQ
[INFO] Spark Project External MQTT
[INFO] Spark Project Examples
[INFO] Spark Project Yarn Shuffle Service Code
```

Author: Andrew Or <andrew@databricks.com>

Closes #3148 from andrewor14/build-drop-code and squashes the following commits:

eac839b [Andrew Or] Network -> Networking
d01ad47 [Andrew Or] Rename network module project names
2014-11-07 23:16:13 -08:00
Michelangelo D'Agostino 7e9d975676 [MLLIB] [PYTHON] SPARK-4221: Expose nonnegative ALS in the python API
SPARK-1553 added alternating nonnegative least squares to MLLib, however it's not possible to access it via the python API.  This pull request resolves that.

Author: Michelangelo D'Agostino <mdagostino@civisanalytics.com>

Closes #3095 from mdagost/python_nmf and squashes the following commits:

a6743ad [Michelangelo D'Agostino] Use setters instead of static methods in PythonMLLibAPI.  Remove the new static methods I added.  Set seed in tests.  Change ratings to ratingsRDD in both train and trainImplicit for consistency.
7cffd39 [Michelangelo D'Agostino] Swapped nonnegative and seed in a few more places.
3fdc851 [Michelangelo D'Agostino] Moved seed to the end of the python parameter list.
bdcc154 [Michelangelo D'Agostino] Change seed type to java.lang.Long so that it can handle null.
cedf043 [Michelangelo D'Agostino] Added in ability to set the seed from python and made that play nice with the nonnegative changes.  Also made the python ALS tests more exact.
a72fdc9 [Michelangelo D'Agostino] Expose nonnegative ALS in the python API.
2014-11-07 22:53:01 -08:00
Davies Liu 7779109796 [SPARK-4304] [PySpark] Fix sort on empty RDD
This PR fix sortBy()/sortByKey() on empty RDD.

This should be back ported into 1.1/1.2

Author: Davies Liu <davies@databricks.com>

Closes #3162 from davies/fix_sort and squashes the following commits:

84f64b7 [Davies Liu] add tests
52995b5 [Davies Liu] fix sortByKey() on empty RDD
2014-11-07 20:53:03 -08:00
Patrick Wendell 5923dd986b MAINTENANCE: Automated closing of pull requests.
This commit exists to close the following pull requests on Github:

Closes #3016 (close requested by 'andrewor14')
Closes #2798 (close requested by 'andrewor14')
Closes #2864 (close requested by 'andrewor14')
Closes #3154 (close requested by 'JoshRosen')
Closes #3156 (close requested by 'JoshRosen')
Closes #214 (close requested by 'kayousterhout')
Closes #2584 (close requested by 'andrewor14')
2014-11-07 13:08:25 -08:00
xiao321 7c9ec529a3 Update JavaCustomReceiver.java
数组下标越界

Author: xiao321 <1042460381@qq.com>

Closes #3153 from xiao321/patch-1 and squashes the following commits:

0ed17b5 [xiao321] Update JavaCustomReceiver.java
2014-11-07 12:56:49 -08:00