Commit graph

7062 commits

Author SHA1 Message Date
witgo 6bee01dd04 remove outdated runtime Information scala home
Author: witgo <witgo@qq.com>

Closes #728 from witgo/scala_home and squashes the following commits:

cdfd8be [witgo] Merge branch 'master' of https://github.com/apache/spark into scala_home
fac094a [witgo] remove outdated runtime Information scala home
2014-05-11 14:34:27 -07:00
Prashant Sharma 70bcdef48a Enabled incremental build that comes with sbt 0.13.2
More info at. https://github.com/sbt/sbt/issues/1010

Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #525 from ScrapCodes/sbt-inc-opt and squashes the following commits:

ba8fa42 [Prashant Sharma] Enabled incremental build that comes with sbt 0.13.2
2014-05-10 21:08:04 -07:00
Andrew Or 83e0424d87 [SPARK-1774] Respect SparkSubmit --jars on YARN (client)
SparkSubmit ignores `--jars` for YARN client. This is a bug.

This PR also automatically adds the application jar to `spark.jar`. Previously, when running as yarn-client, you must specify the jar additionally through `--files` (because `--jars` didn't work). Now you don't have to explicitly specify it through either.

Tested on a YARN cluster.

Author: Andrew Or <andrewor14@gmail.com>

Closes #710 from andrewor14/yarn-jars and squashes the following commits:

35d1928 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-jars
c27bf6c [Andrew Or] For yarn-cluster and python, do not add primaryResource to spark.jar
c92c5bf [Andrew Or] Minor cleanups
269f9f3 [Andrew Or] Fix format
013d840 [Andrew Or] Fix tests
1407474 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-jars
3bb75e8 [Andrew Or] Allow SparkSubmit --jars to take effect in yarn-client mode
2014-05-10 20:58:02 -07:00
Sean Owen 2b7bd29eb6 SPARK-1789. Multiple versions of Netty dependencies cause FlumeStreamSuite failure
TL;DR is there is a bit of JAR hell trouble with Netty, that can be mostly resolved and will resolve a test failure.

I hit the error described at http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-startup-time-out-td1753.html while running FlumeStreamingSuite, and have for a short while (is it just me?)

velvia notes:
"I have found a workaround.  If you add akka 2.2.4 to your dependencies, then everything works, probably because akka 2.2.4 brings in newer version of Jetty."

There are at least 3 versions of Netty in play in the build:

- the new Flume 1.4.0 dependency brings in io.netty:netty:3.4.0.Final, and that is the immediate problem
- the custom version of akka 2.2.3 depends on io.netty:netty:3.6.6.
- but, Spark Core directly uses io.netty:netty-all:4.0.17.Final

The POMs try to exclude other versions of netty, but are excluding org.jboss.netty:netty, when in fact older versions of io.netty:netty (not netty-all) are also an issue.

The org.jboss.netty:netty excludes are largely unnecessary. I replaced many of them with io.netty:netty exclusions until everything agreed on io.netty:netty-all:4.0.17.Final.

But this didn't work, since Akka 2.2.3 doesn't work with Netty 4.x. Down-grading to 3.6.6.Final across the board made some Spark code not compile.

If the build *keeps* io.netty:netty:3.6.6.Final as well, everything seems to work. Part of the reason seems to be that Netty 3.x used the old `org.jboss.netty` packages. This is less than ideal, but is no worse than the current situation.

So this PR resolves the issue and improves the JAR hell, even if it leaves the existing theoretical Netty 3-vs-4 conflict:

- Remove org.jboss.netty excludes where possible, for clarity; they're not needed except with Hadoop artifacts
- Add io.netty:netty excludes where needed -- except, let akka keep its io.netty:netty
- Change a bit of test code that actually depended on Netty 3.x, to use 4.x equivalent
- Update SBT build accordingly

A better change would be to update Akka far enough such that it agrees on Netty 4.x, but I don't know if that's feasible.

Author: Sean Owen <sowen@cloudera.com>

Closes #723 from srowen/SPARK-1789 and squashes the following commits:

43661b7 [Sean Owen] Update and add Netty excludes to prevent some JAR conflicts that cause test issues
2014-05-10 20:50:40 -07:00
Ankur Dave 905173df57 Unify GraphImpl RDDs + other graph load optimizations
This PR makes the following changes, primarily in e4fbd329aef85fe2c38b0167255d2a712893d683:

1. *Unify RDDs to avoid zipPartitions.* A graph used to be four RDDs: vertices, edges, routing table, and triplet view. This commit merges them down to two: vertices (with routing table), and edges (with replicated vertices).

2. *Avoid duplicate shuffle in graph building.* We used to do two shuffles when building a graph: one to extract routing information from the edges and move it to the vertices, and another to find nonexistent vertices referred to by edges. With this commit, the latter is done as a side effect of the former.

3. *Avoid no-op shuffle when joins are fully eliminated.* This is a side effect of unifying the edges and the triplet view.

4. *Join elimination for mapTriplets.*

5. *Ship only the needed vertex attributes when upgrading the triplet view.* If the triplet view already contains source attributes, and we now need both attributes, only ship destination attributes rather than re-shipping both. This is done in `ReplicatedVertexView#upgrade`.

Author: Ankur Dave <ankurdave@gmail.com>

Closes #497 from ankurdave/unify-rdds and squashes the following commits:

332ab43 [Ankur Dave] Merge remote-tracking branch 'apache-spark/master' into unify-rdds
4933e2e [Ankur Dave] Exclude RoutingTable from binary compatibility check
5ba8789 [Ankur Dave] Add GraphX upgrade guide from Spark 0.9.1
13ac845 [Ankur Dave] Merge remote-tracking branch 'apache-spark/master' into unify-rdds
a04765c [Ankur Dave] Remove unnecessary toOps call
57202e8 [Ankur Dave] Replace case with pair parameter
75af062 [Ankur Dave] Add explicit return types
04d3ae5 [Ankur Dave] Convert implicit parameter to context bound
c88b269 [Ankur Dave] Revert upgradeIterator to if-in-a-loop
0d3584c [Ankur Dave] EdgePartition.size should be val
2a928b2 [Ankur Dave] Set locality wait
10b3596 [Ankur Dave] Clean up public API
ae36110 [Ankur Dave] Fix style errors
e4fbd32 [Ankur Dave] Unify GraphImpl RDDs + other graph load optimizations
d6d60e2 [Ankur Dave] In GraphLoader, coalesce to minEdgePartitions
62c7b78 [Ankur Dave] In Analytics, take PageRank numIter
d64e8d4 [Ankur Dave] Log current Pregel iteration
2014-05-10 14:48:07 -07:00
Kan Zhang 6c2691d0a0 [SPARK-1690] Tolerating empty elements when saving Python RDD to text files
Tolerate empty strings in PythonRDD

Author: Kan Zhang <kzhang@apache.org>

Closes #644 from kanzhang/SPARK-1690 and squashes the following commits:

c62ad33 [Kan Zhang] Adding Python doctest
473ec4b [Kan Zhang] [SPARK-1690] Tolerating empty elements when saving Python RDD to text files
2014-05-10 14:01:08 -07:00
Bouke van der Bijl 3776f2f283 Add Python includes to path before depickling broadcast values
This fixes https://issues.apache.org/jira/browse/SPARK-1731 by adding the Python includes to the PYTHONPATH before depickling the broadcast values

@airhorns

Author: Bouke van der Bijl <boukevanderbijl@gmail.com>

Closes #656 from bouk/python-includes-before-broadcast and squashes the following commits:

7b0dfe4 [Bouke van der Bijl] Add Python includes to path before depickling broadcast values
2014-05-10 13:02:13 -07:00
Andy Konwinski c05d11bb30 fix broken in link in python docs
Author: Andy Konwinski <andykonwinski@gmail.com>

Closes #650 from andyk/python-docs-link-fix and squashes the following commits:

a1f9d51 [Andy Konwinski] fix broken in link in python docs
2014-05-10 12:46:51 -07:00
Matei Zaharia 7eefc9d2b3 SPARK-1708. Add a ClassTag on Serializer and things that depend on it
This pull request contains a rebased patch from @heathermiller (https://github.com/heathermiller/spark/pull/1) to add ClassTags on Serializer and types that depend on it (Broadcast and AccumulableCollection). Putting these in the public API signatures now will allow us to use Scala Pickling for serialization down the line without breaking binary compatibility.

One question remaining is whether we also want them on Accumulator -- Accumulator is passed as part of a bigger Task or TaskResult object via the closure serializer so it doesn't seem super useful to add the ClassTag there. Broadcast and AccumulableCollection in contrast were being serialized directly.

CC @rxin, @pwendell, @heathermiller

Author: Matei Zaharia <matei@databricks.com>

Closes #700 from mateiz/spark-1708 and squashes the following commits:

1a3d8b0 [Matei Zaharia] Use fake ClassTag in Java
3b449ed [Matei Zaharia] test fix
2209a27 [Matei Zaharia] Code style fixes
9d48830 [Matei Zaharia] Add a ClassTag on Serializer and things that depend on it
2014-05-10 12:10:24 -07:00
Takuya UESHIN 8e94d2721a [SPARK-1778] [SQL] Add 'limit' transformation to SchemaRDD.
Add `limit` transformation to `SchemaRDD`.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #711 from ueshin/issues/SPARK-1778 and squashes the following commits:

33169df [Takuya UESHIN] Add 'limit' transformation to SchemaRDD.
2014-05-10 12:03:27 -07:00
Michael Armbrust 4d60553298 [SQL] Upgrade parquet library.
I think we are hitting this issue in some perf tests: 6aed5288fd

Credit to @aarondav !

Author: Michael Armbrust <michael@databricks.com>

Closes #684 from marmbrus/upgradeParquet and squashes the following commits:

e10a619 [Michael Armbrust] Upgrade parquet library.
2014-05-10 11:48:01 -07:00
witgo 561510867a [SPARK-1644] The org.datanucleus:* should not be packaged into spark-assembly-*.jar
Author: witgo <witgo@qq.com>

Closes #688 from witgo/SPARK-1644 and squashes the following commits:

56ad6ac [witgo] review commit
87c03e4 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1644
6ffa7e4 [witgo] review commit
a597414 [witgo] The org.datanucleus:* should not be packaged into spark-assembly-*.jar
2014-05-10 10:15:04 -07:00
CodingCat 2f452cbaf3 SPARK-1686: keep schedule() calling in the main thread
https://issues.apache.org/jira/browse/SPARK-1686

moved from original JIRA (by @markhamstra):

In deploy.master.Master, the completeRecovery method is the last thing to be called when a standalone Master is recovering from failure. It is responsible for resetting some state, relaunching drivers, and eventually resuming its scheduling duties.

There are currently four places in Master.scala where completeRecovery is called. Three of them are from within the actor's receive method, and aren't problems. The last starts from within receive when the ElectedLeader message is received, but the actual completeRecovery() call is made from the Akka scheduler. That means that it will execute on a different scheduler thread, and Master itself will end up running (i.e., schedule() ) from that Akka scheduler thread.

In this PR, I added a new master message TriggerSchedule to trigger the "local" call of schedule() in the scheduler thread

Author: CodingCat <zhunansjtu@gmail.com>

Closes #639 from CodingCat/SPARK-1686 and squashes the following commits:

81bb4ca [CodingCat] rename variable
69e0a2a [CodingCat] style fix
36a2ac0 [CodingCat] address Aaron's comments
ec9b7bb [CodingCat] address the comments
02b37ca [CodingCat] keep schedule() calling in the main thread
2014-05-09 21:50:23 -07:00
Aaron Davidson 59577df14c SPARK-1770: Revert accidental(?) fix
Looks like this change was accidentally committed here: 06b15baab2
but the change does not show up in the PR itself (#704).

Other than not intending to go in with that PR, this also broke the test JavaAPISuite.repartition.

Author: Aaron Davidson <aaron@databricks.com>

Closes #716 from aarondav/shufflerand and squashes the following commits:

b1cf70b [Aaron Davidson] SPARK-1770: Revert accidental(?) fix
2014-05-09 14:51:34 -07:00
witgo bd67551ee7 [SPARK-1760]: fix building spark with maven documentation
Author: witgo <witgo@qq.com>

Closes #712 from witgo/building-with-maven and squashes the following commits:

215523b [witgo] fix building spark with maven documentation
2014-05-09 01:51:26 -07:00
Tathagata Das 32868f31f8 Converted bang to ask to avoid scary warning when a block is removed
Removing a block through the blockmanager gave a scary warning messages in the driver.
```
2014-05-08 20:16:19,172 WARN BlockManagerMasterActor: Got unknown message: true
2014-05-08 20:16:19,172 WARN BlockManagerMasterActor: Got unknown message: true
2014-05-08 20:16:19,172 WARN BlockManagerMasterActor: Got unknown message: true
```

This is because the [BlockManagerSlaveActor](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManagerSlaveActor.scala#L44) would send back an acknowledgement ("true"). But the BlockManagerMasterActor would have sent the RemoveBlock message as a send, not as ask(), so would reject the receiver "true" as a unknown message.
@pwendell

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #708 from tdas/bm-fix and squashes the following commits:

ed4ef15 [Tathagata Das] Converted bang to ask to avoid scary warning when a block is removed.
2014-05-08 22:34:08 -07:00
Patrick Wendell 4c60fd1e8c MINOR: Removing dead code.
Meant to do this when patching up the last merge.
2014-05-08 22:33:06 -07:00
Sandeep 7db47c463f SPARK-1775: Unneeded lock in ShuffleMapTask.deserializeInfo
This was used in the past to have a cache of deserialized ShuffleMapTasks, but that's been removed, so there's no need for a lock. It slows down Spark when task descriptions are large, e.g. due to large lineage graphs or local variables.

Author: Sandeep <sandeep@techaddict.me>

Closes #707 from techaddict/SPARK-1775 and squashes the following commits:

18d8ebf [Sandeep] SPARK-1775: Unneeded lock in ShuffleMapTask.deserializeInfo This was used in the past to have a cache of deserialized ShuffleMapTasks, but that's been removed, so there's no need for a lock. It slows down Spark when task descriptions are large, e.g. due to large lineage graphs or local variables.
2014-05-08 22:30:17 -07:00
Patrick Wendell 06b15baab2 SPARK-1565 (Addendum): Replace run-example with spark-submit.
Gives a nicely formatted message to the user when `run-example` is run to
tell them to use `spark-submit`.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #704 from pwendell/examples and squashes the following commits:

1996ee8 [Patrick Wendell] Feedback form Andrew
3eb7803 [Patrick Wendell] Suggestions from TD
2474668 [Patrick Wendell] SPARK-1565 (Addendum): Replace `run-example` with `spark-submit`.
2014-05-08 22:26:36 -07:00
Marcelo Vanzin 3f779d872d [SPARK-1631] Correctly set the Yarn app name when launching the AM.
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #539 from vanzin/yarn-app-name and squashes the following commits:

7d1ca4f [Marcelo Vanzin] [SPARK-1631] Correctly set the Yarn app name when launching the AM.
2014-05-08 20:46:11 -07:00
Andrew Or 8b78412994 [SPARK-1755] Respect SparkSubmit --name on YARN
Right now, SparkSubmit ignores the `--name` flag for both yarn-client and yarn-cluster. This is a bug.

In client mode, SparkSubmit treats `--name` as a [cluster config](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L170) and does not propagate this to SparkContext.

In cluster mode, SparkSubmit passes this flag to `org.apache.spark.deploy.yarn.Client`, which only uses it for the [YARN ResourceManager](https://github.com/apache/spark/blob/master/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L80), but does not propagate this to SparkContext.

This PR ensures that `spark.app.name` is always set if SparkSubmit receives the `--name` flag, which is what the usage promises. This makes it possible for applications to start a SparkContext with an empty conf `val sc = new SparkContext(new SparkConf)`, and inherit the app name from SparkSubmit.

Tested both modes on a YARN cluster.

Author: Andrew Or <andrewor14@gmail.com>

Closes #699 from andrewor14/yarn-app-name and squashes the following commits:

98f6a79 [Andrew Or] Fix tests
dea932f [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-app-name
c86d9ca [Andrew Or] Respect SparkSubmit --name on YARN
2014-05-08 20:45:29 -07:00
Bouke van der Bijl 2fd2752e57 Include the sbin/spark-config.sh in spark-executor
This is needed because broadcast values are broken on pyspark on Mesos, it tries to import pyspark but can't, as the PYTHONPATH is not set due to changes in ff5be9a4

https://issues.apache.org/jira/browse/SPARK-1725

Author: Bouke van der Bijl <boukevanderbijl@gmail.com>

Closes #651 from bouk/include-spark-config-in-mesos-executor and squashes the following commits:

b2f1295 [Bouke van der Bijl] Inline PYTHONPATH in spark-executor
eedbbcc [Bouke van der Bijl] Include the sbin/spark-config.sh in spark-executor
2014-05-08 20:43:37 -07:00
Funes 191279ce4e Bug fix of sparse vector conversion
Fixed a small bug caused by the inconsistency of index/data array size and vector length.

Author: Funes <tianshaocun@gmail.com>
Author: funes <tianshaocun@gmail.com>

Closes #661 from funes/bugfix and squashes the following commits:

edb2b9d [funes] remove unused import
75dced3 [Funes] update test case
d129a66 [Funes] Add test for sparse breeze by vector builder
64e7198 [Funes] Copy data only when necessary
b85806c [Funes] Bug fix of sparse vector conversion
2014-05-08 17:54:10 -07:00
DB Tsai 910a13b3c5 [SPARK-1157][MLlib] Bug fix: lossHistory should exclude rejection steps, and remove miniBatch
Getting the lossHistory from Breeze's API which already excludes the rejection steps in line search. Also, remove the miniBatch in LBFGS since those quasi-Newton methods approximate the inverse of Hessian. It doesn't make sense if the gradients are computed from a varying objective.

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #582 from dbtsai/dbtsai-lbfgs-bug and squashes the following commits:

9cc6cf9 [DB Tsai] Removed the miniBatch in LBFGS.
1ba6a33 [DB Tsai] Formatting the code.
d72c679 [DB Tsai] Using Breeze's states to get the loss.
2014-05-08 17:53:22 -07:00
DB Tsai d38febee46 MLlib documentation fix
Fixed the documentation for that `loadLibSVMData` is changed to `loadLibSVMFile`.

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #703 from dbtsai/dbtsai-docfix and squashes the following commits:

71dd508 [DB Tsai] loadLibSVMData is changed to loadLibSVMFile
2014-05-08 17:52:32 -07:00
Takuya UESHIN 322b1808d2 [SPARK-1754] [SQL] Add missing arithmetic DSL operations.
Add missing arithmetic DSL operations: `unary_-`, `%`.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #689 from ueshin/issues/SPARK-1754 and squashes the following commits:

a09ef69 [Takuya UESHIN] Add also missing ! (not) operation.
f73ae2c [Takuya UESHIN] Remove redundant tests.
5b3f087 [Takuya UESHIN] Add tests relating DSL operations.
e09c5b8 [Takuya UESHIN] Add missing arithmetic DSL operations.
2014-05-08 15:31:47 -07:00
Evan Sparks 5c5e7d5809 Fixing typo in als.py
XtY should be Xty.

Author: Evan Sparks <evan.sparks@gmail.com>

Closes #696 from etrain/patch-2 and squashes the following commits:

634cb8d [Evan Sparks] Fixing typo in als.py
2014-05-08 13:07:30 -07:00
Andrew Or c3f8b78c21 [SPARK-1745] Move interrupted flag from TaskContext constructor (minor)
It makes little sense to start a TaskContext that is interrupted. Indeed, I searched for all use cases of it and didn't find a single instance in which `interrupted` is true on construction.

This was inspired by reviewing #640, which adds an additional `@volatile var completed` that is similar. These are not the most urgent changes, but I wanted to push them out before I forget.

Author: Andrew Or <andrewor14@gmail.com>

Closes #675 from andrewor14/task-context and squashes the following commits:

9575e02 [Andrew Or] Add space
69455d1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into task-context
c471490 [Andrew Or] Oops, removed one flag too many. Adding it back.
85311f8 [Andrew Or] Move interrupted flag from TaskContext constructor
2014-05-08 12:13:07 -07:00
Prashant Sharma 44dd57fb66 SPARK-1565, update examples to be used with spark-submit script.
Commit for initial feedback, basically I am curious if we should prompt user for providing args esp. when its mandatory. And can we skip if they are not ?

Also few other things that did not work like
`bin/spark-submit examples/target/scala-2.10/spark-examples-1.0.0-SNAPSHOT-hadoop1.0.4.jar --class org.apache.spark.examples.SparkALS --arg 100 500 10 5 2`

Not all the args get passed properly, may be I have messed up something will try to sort it out hopefully.

Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #552 from ScrapCodes/SPARK-1565/update-examples and squashes the following commits:

669dd23 [Prashant Sharma] Review comments
2727e70 [Prashant Sharma] SPARK-1565, update examples to be used with spark-submit script.
2014-05-08 10:23:05 -07:00
Michael Armbrust 19c8fb02bc [SQL] Improve SparkSQL Aggregates
* Add native min/max (was using hive before).
* Handle nulls correctly in Avg and Sum.

Author: Michael Armbrust <michael@databricks.com>

Closes #683 from marmbrus/aggFixes and squashes the following commits:

64fe30b [Michael Armbrust] Improve SparkSQL Aggregates * Add native min/max (was using hive before). * Handle nulls correctly in Avg and Sum.
2014-05-08 01:08:43 -04:00
Evan Sparks 6ed7e2cd01 Use numpy directly for matrix multiply.
Using matrix multiply to compute XtX and XtY yields a 5-20x speedup depending on problem size.

For example - the following takes 19s locally after this change vs. 5m21s before the change. (16x speedup).
bin/pyspark examples/src/main/python/als.py local[8] 1000 1000 50 10 10

Author: Evan Sparks <evan.sparks@gmail.com>

Closes #687 from etrain/patch-1 and squashes the following commits:

e094dbc [Evan Sparks] Touching only diaganols on update.
d1ab9b6 [Evan Sparks] Use numpy directly for matrix multiply.
2014-05-08 00:24:36 -04:00
Sandeep 108c4c16cc SPARK-1668: Add implicit preference as an option to examples/MovieLensALS
Add --implicitPrefs as an command-line option to the example app MovieLensALS under examples/

Author: Sandeep <sandeep@techaddict.me>

Closes #597 from techaddict/SPARK-1668 and squashes the following commits:

8b371dc [Sandeep] Second Pass on reviews by mengxr
eca9d37 [Sandeep] based on mengxr's suggestions
937e54c [Sandeep] Changes
5149d40 [Sandeep] Changes based on review
1dd7657 [Sandeep] use mean()
42444d7 [Sandeep] Based on Suggestions by mengxr
e3082fa [Sandeep] SPARK-1668: Add implicit preference as an option to examples/MovieLensALS Add --implicitPrefs as an command-line option to the example app MovieLensALS under examples/
2014-05-08 00:15:05 -04:00
Manish Amde f269b016ac SPARK-1544 Add support for deep decision trees.
@etrain and I came with a PR for arbitrarily deep decision trees at the cost of multiple passes over the data at deep tree levels.

To summarize:
1) We take a parameter that indicates the amount of memory users want to reserve for computation on each worker (and 2x that at the driver).
2) Using that information, we calculate two things - the maximum depth to which we train as usual (which is, implicitly, the maximum number of nodes we want to train in parallel), and the size of the groups we should use in the case where we exceed this depth.

cc: @atalwalkar, @hirakendu, @mengxr

Author: Manish Amde <manish9ue@gmail.com>
Author: manishamde <manish9ue@gmail.com>
Author: Evan Sparks <sparks@cs.berkeley.edu>

Closes #475 from manishamde/deep_tree and squashes the following commits:

968ca9d [Manish Amde] merged master
7fc9545 [Manish Amde] added docs
ce004a1 [Manish Amde] minor formatting
b27ad2c [Manish Amde] formatting
426bb28 [Manish Amde] programming guide blurb
8053fed [Manish Amde] more formatting
5eca9e4 [Manish Amde] grammar
4731cda [Manish Amde] formatting
5e82202 [Manish Amde] added documentation, fixed off by 1 error in max level calculation
cbd9f14 [Manish Amde] modified scala.math to math
dad9652 [Manish Amde] removed unused imports
e0426ee [Manish Amde] renamed parameter
718506b [Manish Amde] added unit test
1517155 [Manish Amde] updated documentation
9dbdabe [Manish Amde] merge from master
719d009 [Manish Amde] updating user documentation
fecf89a [manishamde] Merge pull request #6 from etrain/deep_tree
0287772 [Evan Sparks] Fixing scalastyle issue.
2f1e093 [Manish Amde] minor: added doc for maxMemory parameter
2f6072c [manishamde] Merge pull request #5 from etrain/deep_tree
abc5a23 [Evan Sparks] Parameterizing max memory.
50b143a [Manish Amde] adding support for very deep trees
2014-05-07 17:08:38 -07:00
baishuo(白硕) 0c19bb161b Update GradientDescentSuite.scala
use more faster way to construct an array

Author: baishuo(白硕) <vc_java@hotmail.com>

Closes #588 from baishuo/master and squashes the following commits:

45b95fb [baishuo(白硕)] Update GradientDescentSuite.scala
c03b61c [baishuo(白硕)] Update GradientDescentSuite.scala
b666d27 [baishuo(白硕)] Update GradientDescentSuite.scala
2014-05-07 16:02:55 -07:00
Xiangrui Meng 3188553f73 [SPARK-1743][MLLIB] add loadLibSVMFile and saveAsLibSVMFile to pyspark
Make loading/saving labeled data easier for pyspark users.

Also changed type check in `SparseVector` to allow numpy integers.

Author: Xiangrui Meng <meng@databricks.com>

Closes #672 from mengxr/pyspark-mllib-util and squashes the following commits:

2943fa7 [Xiangrui Meng] format docs
d61668d [Xiangrui Meng] add loadLibSVMFile and saveAsLibSVMFile to pyspark
2014-05-07 16:01:11 -07:00
Thomas Graves 4bec84b6a2 SPARK-1569 Spark on Yarn, authentication broken by pr299
Pass the configs as java options since the executor needs to know before it registers whether to create the connection using authentication or not.    We could see about passing only the authentication configs but for now I just had it pass them all.

I also updating it to use a list to construct the command to make it the same as ClientBase and avoid any issues with spaces.

Author: Thomas Graves <tgraves@apache.org>

Closes #649 from tgravescs/SPARK-1569 and squashes the following commits:

0178ab8 [Thomas Graves] add akka settings
22a8735 [Thomas Graves] Change to only path spark.auth* configs
8ccc1d4 [Thomas Graves] SPARK-1569 Spark on Yarn, authentication broken
2014-05-07 15:51:53 -07:00
Andrew Or 5200872243 [SPARK-1688] Propagate PySpark worker stderr to driver
When at least one of the following conditions is true, PySpark cannot be loaded:

1. PYTHONPATH is not set
2. PYTHONPATH does not contain the python directory (or jar, in the case of YARN)
3. The jar does not contain pyspark files (YARN)
4. The jar does not contain py4j files (YARN)

However, we currently throw the same random `java.io.EOFException` for all of the above cases, when trying to read from the python daemon's output. This message is super unhelpful.

This PR includes the python stderr and the PYTHONPATH in the exception propagated to the driver. Now, the exception message looks something like:

```
Error from python worker:
  : No module named pyspark
PYTHONPATH was:
  /path/to/spark/python:/path/to/some/jar
java.io.EOFException
  <stack trace>
```

whereas before it was just

```
java.io.EOFException
  <stack trace>
```

Author: Andrew Or <andrewor14@gmail.com>

Closes #603 from andrewor14/pyspark-exception and squashes the following commits:

10d65d3 [Andrew Or] Throwable -> Exception, worker -> daemon
862d1d7 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-exception
a5ed798 [Andrew Or] Use block string and interpolation instead of var (minor)
cc09c45 [Andrew Or] Account for the fact that the python daemon may not have terminated yet
444f019 [Andrew Or] Use the new RedirectThread + include system PYTHONPATH
aab00ae [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-exception
0cc2402 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-exception
783efe2 [Andrew Or] Make python daemon stderr indentation consistent
9524172 [Andrew Or] Avoid potential NPE / error stream contention + Move things around
29f9688 [Andrew Or] Add back original exception type
e92d36b [Andrew Or] Include python worker stderr in the exception propagated to the driver
7c69360 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-exception
cdbc185 [Andrew Or] Fix python attribute not found exception when PYTHONPATH is not set
dcc0353 [Andrew Or] Check both python and system environment variables for PYTHONPATH
6c09c21 [Andrew Or] Validate PYTHONPATH and PySpark modules before starting python workers
2014-05-07 14:35:22 -07:00
Andrew Ash d00981a951 Typo fix: fetchting -> fetching
Author: Andrew Ash <andrew@andrewash.com>

Closes #680 from ash211/patch-3 and squashes the following commits:

9ce3746 [Andrew Ash] Typo fix: fetchting -> fetching
2014-05-07 17:24:49 -04:00
Andrew Ash 7f6f4a1035 Nicer logging for SecurityManager startup
Happy to open a jira ticket if you'd like to track one there.

Author: Andrew Ash <andrew@andrewash.com>

Closes #678 from ash211/SecurityManagerLogging and squashes the following commits:

2aa0b7a [Andrew Ash] Nicer logging for SecurityManager startup
2014-05-07 17:24:12 -04:00
Cheng Hao ca43186867 [SQL] Fix Performance Issue in data type casting
Using lazy val object instead of function in the class Cast, which improved the performance nearly by 2X in my local micro-benchmark.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #679 from chenghao-intel/fix_type_casting and squashes the following commits:

71b0902 [Cheng Hao] using lazy val object instead of function for data type casting
2014-05-07 16:54:58 -04:00
Aaron Davidson 3308722ca0 SPARK-1579: Clean up PythonRDD and avoid swallowing IOExceptions
This patch includes several cleanups to PythonRDD, focused around fixing [SPARK-1579](https://issues.apache.org/jira/browse/SPARK-1579) cleanly. Listed in order of approximate importance:

- The Python daemon waits for Spark to close the socket before exiting,
  in order to avoid causing spurious IOExceptions in Spark's
  `PythonRDD::WriterThread`.
- Removes the Python Monitor Thread, which polled for task cancellations
  in order to kill the Python worker. Instead, we do this in the
  onCompleteCallback, since this is guaranteed to be called during
  cancellation.
- Adds a "completed" variable to TaskContext to avoid the issue noted in
  [SPARK-1019](https://issues.apache.org/jira/browse/SPARK-1019), where onCompleteCallbacks may be execution-order dependent.
  Along with this, I removed the "context.interrupted = true" flag in
  the onCompleteCallback.
- Extracts PythonRDD::WriterThread to its own class.

Since this patch provides an alternative solution to [SPARK-1019](https://issues.apache.org/jira/browse/SPARK-1019), I did test it with

```
sc.textFile("latlon.tsv").take(5)
```

many times without error.

Additionally, in order to test the unswallowed exceptions, I performed

```
sc.textFile("s3n://<big file>").count()
```

and cut my internet during execution. Prior to this patch, we got the "stdin writer exited early" message, which was unhelpful. Now, we get the SocketExceptions propagated through Spark to the user and get proper (though unsuccessful) task retries.

Author: Aaron Davidson <aaron@databricks.com>

Closes #640 from aarondav/pyspark-io and squashes the following commits:

b391ff8 [Aaron Davidson] Detect "clean socket shutdowns" and stop waiting on the socket
c0c49da [Aaron Davidson] SPARK-1579: Clean up PythonRDD and avoid swallowing IOExceptions
2014-05-07 09:48:31 -07:00
Kan Zhang 967635a242 [SPARK-1460] Returning SchemaRDD instead of normal RDD on Set operations...
... that do not change schema

Author: Kan Zhang <kzhang@apache.org>

Closes #448 from kanzhang/SPARK-1460 and squashes the following commits:

111e388 [Kan Zhang] silence MiMa errors in EdgeRDD and VertexRDD
91dc787 [Kan Zhang] Taking into account newly added Ordering param
79ed52a [Kan Zhang] [SPARK-1460] Returning SchemaRDD on Set operations that do not change schema
2014-05-07 09:41:31 -07:00
Cheng Hao 3eb53bd59e [WIP][Spark-SQL] Optimize the Constant Folding for Expression
Currently, expression does not support the "constant null" well in constant folding.
e.g. Sum(a, 0) actually always produces Literal(0, NumericType) in runtime.

For example:
```
explain select isnull(key+null)  from src;
== Logical Plan ==
Project [HiveGenericUdf#isnull((key#30 + CAST(null, IntegerType))) AS c_0#28]
 MetastoreRelation default, src, None

== Optimized Logical Plan ==
Project [true AS c_0#28]
 MetastoreRelation default, src, None

== Physical Plan ==
Project [true AS c_0#28]
 HiveTableScan [], (MetastoreRelation default, src, None), None
```

I've create a new Optimization rule called NullPropagation for such kind of constant folding.

Author: Cheng Hao <hao.cheng@intel.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #482 from chenghao-intel/optimize_constant_folding and squashes the following commits:

2f14b50 [Cheng Hao] Fix code style issues
68b9fad [Cheng Hao] Remove the Literal pattern matching for NullPropagation
29c8166 [Cheng Hao] Update the code for feedback of code review
50444cc [Cheng Hao] Remove the unnecessary null checking
80f9f18 [Cheng Hao] Update the UnitTest for aggregation constant folding
27ea3d7 [Cheng Hao] Fix Constant Folding Bugs & Add More Unittests
b28e03a [Cheng Hao] Merge pull request #1 from marmbrus/pr/482
9ccefdb [Michael Armbrust] Add tests for optimized expression evaluation.
543ef9d [Cheng Hao] fix code style issues
9cf0396 [Cheng Hao] update code according to the code review comment
536c005 [Cheng Hao] Add Exceptional case for constant folding
3c045c7 [Cheng Hao] Optimize the Constant Folding by adding more rules
2645d4f [Cheng Hao] Constant Folding(null propagation)
2014-05-07 03:37:12 -04:00
Patrick Wendell 913a0a9c0a SPARK-1746: Support setting SPARK_JAVA_OPTS on executors for backwards compatibility
Author: Patrick Wendell <pwendell@gmail.com>

Closes #676 from pwendell/worker-opts and squashes the following commits:

54456c4 [Patrick Wendell] SPARK-1746: Support setting SPARK_JAVA_OPTS on executors for backwards compatibility
2014-05-07 00:11:05 -07:00
Sandeep fdae095de2 [HOTFIX] SPARK-1637: There are some Streaming examples added after the PR #571 was last updated.
This resulted in Compilation Errors.
cc @mateiz project not compiling currently.

Author: Sandeep <sandeep@techaddict.me>

Closes #673 from techaddict/SPARK-1637-HOTFIX and squashes the following commits:

b512f4f [Sandeep] [SPARK-1637][HOTFIX] There are some Streaming examples added after the PR #571 was last updated. This resulted in Compilation Errors.
2014-05-06 21:55:05 -07:00
Ethan Jewett 48ba3b8cdc Proposal: clarify Scala programming guide on caching ...
... with regards to saved map output. Wording taken partially from Matei Zaharia's email to the Spark user list. http://apache-spark-user-list.1001560.n3.nabble.com/performance-improvement-on-second-operation-without-caching-td5227.html

Author: Ethan Jewett <esjewett@gmail.com>

Closes #668 from esjewett/Doc-update and squashes the following commits:

11793ce [Ethan Jewett] Update based on feedback
171e670 [Ethan Jewett] Clarify Scala programming guide on caching ...
2014-05-06 20:50:08 -07:00
Sean Owen 25ad8f9301 SPARK-1727. Correct small compile errors, typos, and markdown issues in (primarly) MLlib docs
While play-testing the Scala and Java code examples in the MLlib docs, I noticed a number of small compile errors, and some typos. This led to finding and fixing a few similar items in other docs.

Then in the course of building the site docs to check the result, I found a few small suggestions for the build instructions. I also found a few more formatting and markdown issues uncovered when I accidentally used maruku instead of kramdown.

Author: Sean Owen <sowen@cloudera.com>

Closes #653 from srowen/SPARK-1727 and squashes the following commits:

6e7c38a [Sean Owen] Final doc updates - one more compile error, and use of mean instead of sum and count
8f5e847 [Sean Owen] Fix markdown syntax issues that maruku flags, even though we use kramdown (but only those that do not affect kramdown's output)
99966a9 [Sean Owen] Update issue tracker URL in docs
23c9ac3 [Sean Owen] Add Scala Naive Bayes example, to use existing example data file (whose format needed a tweak)
8c81982 [Sean Owen] Fix small compile errors and typos across MLlib docs
2014-05-06 20:07:22 -07:00
Sandeep a000b5c3b0 SPARK-1637: Clean up examples for 1.0
- [x] Move all of them into subpackages of org.apache.spark.examples (right now some are in org.apache.spark.streaming.examples, for instance, and others are in org.apache.spark.examples.mllib)
- [x] Move Python examples into examples/src/main/python
- [x] Update docs to reflect these changes

Author: Sandeep <sandeep@techaddict.me>

This patch had conflicts when merged, resolved by
Committer: Matei Zaharia <matei@databricks.com>

Closes #571 from techaddict/SPARK-1637 and squashes the following commits:

47ef86c [Sandeep] Changes based on Discussions on PR, removing use of RawTextHelper from examples
8ed2d3f [Sandeep] Docs Updated for changes, Change for java examples
5f96121 [Sandeep] Move Python examples into examples/src/main/python
0a8dd77 [Sandeep] Move all Scala Examples to org.apache.spark.examples (some are in org.apache.spark.streaming.examples, for instance, and others are in org.apache.spark.examples.mllib)
2014-05-06 17:27:52 -07:00
Patrick Wendell 39b8b1489f SPARK-1737: Warn rather than fail when Java 7+ is used to create distributions
Also moves a few lines of code around in make-distribution.sh.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #669 from pwendell/make-distribution and squashes the following commits:

8bfac49 [Patrick Wendell] Small fix
46918ec [Patrick Wendell] SPARK-1737: Warn rather than fail when Java 7+ is used to create distributions.
2014-05-06 15:41:46 -07:00
Matei Zaharia 951a5d9398 [SPARK-1549] Add Python support to spark-submit
This PR updates spark-submit to allow submitting Python scripts (currently only with deploy-mode=client, but that's all that was supported before) and updates the PySpark code to properly find various paths, etc. One significant change is that we assume we can always find the Python files either from the Spark assembly JAR (which will happen with the Maven assembly build in make-distribution.sh) or from SPARK_HOME (which will exist in local mode even if you use sbt assembly, and should be enough for testing). This means we no longer need a weird hack to modify the environment for YARN.

This patch also updates the Python worker manager to run python with -u, which means unbuffered output (send it to our logs right away instead of waiting a while after stuff was written); this should simplify debugging.

In addition, it fixes https://issues.apache.org/jira/browse/SPARK-1709, setting the main class from a JAR's Main-Class attribute if not specified by the user, and fixes a few help strings and style issues in spark-submit.

In the future we may want to make the `pyspark` shell use spark-submit as well, but it seems unnecessary for 1.0.

Author: Matei Zaharia <matei@databricks.com>

Closes #664 from mateiz/py-submit and squashes the following commits:

15e9669 [Matei Zaharia] Fix some uses of path.separator property
051278c [Matei Zaharia] Small style fixes
0afe886 [Matei Zaharia] Add license headers
4650412 [Matei Zaharia] Add pyFiles to PYTHONPATH in executors, remove old YARN stuff, add tests
15f8e1e [Matei Zaharia] Set PYTHONPATH in PythonWorkerFactory in case it wasn't set from outside
47c0655 [Matei Zaharia] More work to make spark-submit work with Python:
d4375bd [Matei Zaharia] Clean up description of spark-submit args a bit and add Python ones
2014-05-06 15:12:35 -07:00