Commit graph

9423 commits

Author SHA1 Message Date
Kousuke Saruta e902dc443d [SPARK-5188][BUILD] make-distribution.sh should support curl, not only wget to get Tachyon
When we use `make-distribution.sh` with `--with-tachyon` option, Tachyon will be downloaded by `wget` command but some systems don't have `wget` by default (MacOS X doesn't have).
Other scripts like build/mvn, build/sbt support not only `wget` but also `curl` so `make-distribution.sh` should support `curl` too.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #3988 from sarutak/SPARK-5188 and squashes the following commits:

0f546e0 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-5188
010e884 [Kousuke Saruta] Merge branch 'SPARK-5188' of github.com:sarutak/spark into SPARK-5188
163687e [Kousuke Saruta] Fixed a merge conflict
e24e01b [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-5188
3daf1f1 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-5188
3caa4cb [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-5188
7cc8255 [Kousuke Saruta] Fix to use \$MVN instead of mvn
a3e908b [Kousuke Saruta] Fixed style
2db9fbf [Kousuke Saruta] Removed redirection from the logic which checks the existence of commands
1e4c7e0 [Kousuke Saruta] Used "command" command instead of "type" command
83b49b5 [Kousuke Saruta] Modified make-distribution.sh so that we use curl, not only wget to get tachyon
2015-01-28 12:43:22 -08:00
Sandy Ryza 406f6d3070 SPARK-5458. Refer to aggregateByKey instead of combineByKey in docs
Author: Sandy Ryza <sandy@cloudera.com>

Closes #4251 from sryza/sandy-spark-5458 and squashes the following commits:

460827a [Sandy Ryza] Python too
d2dc160 [Sandy Ryza] SPARK-5458. Refer to aggregateByKey instead of combineByKey in docs
2015-01-28 12:41:23 -08:00
Reynold Xin c8e934ef3c [SPARK-5447][SQL] Replaced reference to SchemaRDD with DataFrame.
and

[SPARK-5448][SQL] Make CacheManager a concrete class and field in SQLContext

Author: Reynold Xin <rxin@databricks.com>

Closes #4242 from rxin/sqlCleanup and squashes the following commits:

e351cb2 [Reynold Xin] Fixed toDataFrame.
6545c42 [Reynold Xin] More changes.
728c017 [Reynold Xin] [SPARK-5447][SQL] Replaced reference to SchemaRDD with DataFrame.
2015-01-28 12:10:01 -08:00
Winston Chen 453d7999b8 [SPARK-5361]Multiple Java RDD <-> Python RDD conversions not working correctly
This is found through reading RDD from `sc.newAPIHadoopRDD` and writing it back using `rdd.saveAsNewAPIHadoopFile` in pyspark.

It turns out that whenever there are multiple RDD conversions from JavaRDD to PythonRDD then back to JavaRDD, the exception below happens:

```
15/01/16 10:28:31 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 7)
java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to java.util.ArrayList
	at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:157)
	at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
```

The test case code below reproduces it:

```
from pyspark.rdd import RDD

dl = [
    (u'2', {u'director': u'David Lean'}),
    (u'7', {u'director': u'Andrew Dominik'})
]

dl_rdd = sc.parallelize(dl)
tmp = dl_rdd._to_java_object_rdd()
tmp2 = sc._jvm.SerDe.javaToPython(tmp)
t = RDD(tmp2, sc)
t.count()

tmp = t._to_java_object_rdd()
tmp2 = sc._jvm.SerDe.javaToPython(tmp)
t = RDD(tmp2, sc)
t.count() # it blows up here during the 2nd time of conversion
```

Author: Winston Chen <wchen@quid.com>

Closes #4146 from wingchen/master and squashes the following commits:

903df7d [Winston Chen] SPARK-5361, update to toSeq based on the PR
5d90a83 [Winston Chen] SPARK-5361, make python pretty, so to pass PEP 8 checks
126be6b [Winston Chen] SPARK-5361, add in test case
4cf1187 [Winston Chen] SPARK-5361, add in test case
9f1a097 [Winston Chen] add in tuple handling while converting form python RDD back to JavaRDD
2015-01-28 11:08:44 -08:00
Kousuke Saruta 0b35fcd7f0 [SPARK-5291][CORE] Add timestamp and reason why an executor is removed to SparkListenerExecutorAdded and SparkListenerExecutorRemoved
Recently `SparkListenerExecutorAdded` and `SparkListenerExecutorRemoved` are added.
I think it's useful if they have timestamp and the reason why an executor is removed.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #4082 from sarutak/SPARK-5291 and squashes the following commits:

a026ff2 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-5291
979dfe1 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-5291
cf9f9080 [Kousuke Saruta] Fixed test case
1f2a89b [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-5291
243f2a60 [Kousuke Saruta] Modified MesosSchedulerBackendSuite
a527c35 [Kousuke Saruta] Added timestamp to SparkListenerExecutorAdded
2015-01-28 11:02:51 -08:00
Burak Yavuz eeb53bf90e [SPARK-3974][MLlib] Distributed Block Matrix Abstractions
This pull request includes the abstractions for the distributed BlockMatrix representation.
`BlockMatrix` will allow users to store very large matrices in small blocks of local matrices. Specific partitioners, such as `RowBasedPartitioner` and `ColumnBasedPartitioner`, are implemented in order to optimize addition and multiplication operations that will be added in a following PR.

This work is based on the ml-matrix repo developed at the AMPLab at UC Berkeley, CA.
https://github.com/amplab/ml-matrix

Additional thanks to rezazadeh, shivaram, and mengxr for guidance on the design.

Author: Burak Yavuz <brkyvz@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Author: Burak Yavuz <brkyvz@dn51t42l.sunet>
Author: Burak Yavuz <brkyvz@dn51t4rd.sunet>
Author: Burak Yavuz <brkyvz@dn0a221430.sunet>

Closes #3200 from brkyvz/SPARK-3974 and squashes the following commits:

a8eace2 [Burak Yavuz] Merge pull request #2 from mengxr/brkyvz-SPARK-3974
feb32a7 [Xiangrui Meng] update tests
e1d3ee8 [Xiangrui Meng] minor updates
24ec7b8 [Xiangrui Meng] update grid partitioner
5eecd48 [Burak Yavuz] fixed gridPartitioner and added tests
140f20e [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into SPARK-3974
1694c9e [Burak Yavuz] almost finished addressing comments
f9d664b [Burak Yavuz] updated API and modified partitioning scheme
eebbdf7 [Burak Yavuz] preliminary changes addressing code review
1a63b20 [Burak Yavuz] [SPARK-3974] Remove setPartition method. Isn't required
1e8bb2a [Burak Yavuz] [SPARK-3974] Change return type of cache and persist
239ab4b [Burak Yavuz] [SPARK-3974] Addressed @jkbradley's comments
ba414d2 [Burak Yavuz] [SPARK-3974] fixed frobenius norm
ab6cde0 [Burak Yavuz] [SPARK-3974] Modifications cleaning code up, making size calculation more robust
9ae85aa [Burak Yavuz] [SPARK-3974] Made partitioner a variable inside BlockMatrix instead of a constructor variable
d033861 [Burak Yavuz] [SPARK-3974] Removed SubMatrixInfo and added constructor without partitioner
49b9586 [Burak Yavuz] [SPARK-3974] Updated testing utils from master
645afbe [Burak Yavuz] [SPARK-3974] Pull latest master
b05aabb [Burak Yavuz] [SPARK-3974] Updated tests to reflect changes
19c17e8 [Burak Yavuz] [SPARK-3974] Changed blockIdRow and blockIdCol
589fbb6 [Burak Yavuz] [SPARK-3974] Code review feedback addressed
aa8f086 [Burak Yavuz] [SPARK-3974] Additional comments added
f378e16 [Burak Yavuz] [SPARK-3974] Block Matrix Abstractions ready
b693209 [Burak Yavuz] Ready for Pull request
2015-01-28 10:06:37 -08:00
Patrick Wendell 622ff09d03 MAINTENANCE: Automated closing of pull requests.
This commit exists to close the following pull requests on Github:

Closes #1480 (close requested by 'pwendell')
Closes #4205 (close requested by 'kdatta')
Closes #4114 (close requested by 'pwendell')
Closes #3382 (close requested by 'mengxr')
Closes #3933 (close requested by 'mengxr')
Closes #3870 (close requested by 'yhuai')
2015-01-28 02:15:14 -08:00
Ryan Williams 661d3f9f3e [SPARK-5415] bump sbt to version to 0.13.7
Author: Ryan Williams <ryan.blake.williams@gmail.com>

Closes #4211 from ryan-williams/sbt0.13.7 and squashes the following commits:

e28476d [Ryan Williams] bump sbt to version to 0.13.7
2015-01-28 02:13:06 -08:00
Marcelo Vanzin 37a5e272f8 [SPARK-4809] Rework Guava library shading.
The current way of shading Guava is a little problematic. Code that
depends on "spark-core" does not see the transitive dependency, yet
classes in "spark-core" actually depend on Guava. So it's a little
tricky to run unit tests that use spark-core classes, since you need
a compatible version of Guava in your dependencies when running the
tests. This can become a little tricky, and is kind of a bad user
experience.

This change modifies the way Guava is shaded so that it's applied
uniformly across the Spark build. This means Guava is shaded inside
spark-core itself, so that the dependency issues above are solved.
Aside from that, all Spark sub-modules have their Guava references
relocated, so that they refer to the relocated classes now packaged
inside spark-core. Before, this was only done by the time the assembly
was built, so projects that did not end up inside the assembly (such
as streaming backends) could still reference the original location
of Guava classes.

The Guava classes are added to the "first" artifact Spark generates
(network-common), so that all downstream modules have the needed
classes available. Since "network-common" is a dependency of spark-core,
all Spark apps should get the relocated classes automatically.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #3658 from vanzin/SPARK-4809 and squashes the following commits:

3c93e42 [Marcelo Vanzin] Shade Guava in the network-common artifact.
5d69ec9 [Marcelo Vanzin] Merge branch 'master' into SPARK-4809
b3104fc [Marcelo Vanzin] Add comment.
941848f [Marcelo Vanzin] Merge branch 'master' into SPARK-4809
f78c48a [Marcelo Vanzin] Merge branch 'master' into SPARK-4809
8053dd4 [Marcelo Vanzin] Merge branch 'master' into SPARK-4809
107d7da [Marcelo Vanzin] Add fix for SPARK-5052 (PR #3874).
40b8723 [Marcelo Vanzin] Merge branch 'master' into SPARK-4809
4a4ed42 [Marcelo Vanzin] [SPARK-4809] Rework Guava library shading.
2015-01-28 00:29:29 -08:00
Reynold Xin d74373225e [SPARK-5097][SQL] Test cases for DataFrame expressions.
Author: Reynold Xin <rxin@databricks.com>

Closes #4235 from rxin/df-tests1 and squashes the following commits:

f341db6 [Reynold Xin] [SPARK-5097][SQL] Test cases for DataFrame expressions.
2015-01-27 18:10:49 -08:00
Reynold Xin 119f45d61d [SPARK-5097][SQL] DataFrame
This pull request redesigns the existing Spark SQL dsl, which already provides data frame like functionalities.

TODOs:
With the exception of Python support, other tasks can be done in separate, follow-up PRs.
- [ ] Audit of the API
- [ ] Documentation
- [ ] More test cases to cover the new API
- [x] Python support
- [ ] Type alias SchemaRDD

Author: Reynold Xin <rxin@databricks.com>
Author: Davies Liu <davies@databricks.com>

Closes #4173 from rxin/df1 and squashes the following commits:

0a1a73b [Reynold Xin] Merge branch 'df1' of github.com:rxin/spark into df1
23b4427 [Reynold Xin] Mima.
828f70d [Reynold Xin] Merge pull request #7 from davies/df
257b9e6 [Davies Liu] add repartition
6bf2b73 [Davies Liu] fix collect with UDT and tests
e971078 [Reynold Xin] Missing quotes.
b9306b4 [Reynold Xin] Remove removeColumn/updateColumn for now.
a728bf2 [Reynold Xin] Example rename.
e8aa3d3 [Reynold Xin] groupby -> groupBy.
9662c9e [Davies Liu] improve DataFrame Python API
4ae51ea [Davies Liu] python API for dataframe
1e5e454 [Reynold Xin] Fixed a bug with symbol conversion.
2ca74db [Reynold Xin] Couple minor fixes.
ea98ea1 [Reynold Xin] Documentation & literal expressions.
2b22684 [Reynold Xin] Got rid of IntelliJ problems.
02bbfbc [Reynold Xin] Tightening imports.
ffbce66 [Reynold Xin] Fixed compilation error.
59b6d8b [Reynold Xin] Style violation.
b85edfb [Reynold Xin] ALS.
8c37f0a [Reynold Xin] Made MLlib and examples compile
6d53134 [Reynold Xin] Hive module.
d35efd5 [Reynold Xin] Fixed compilation error.
ce4a5d2 [Reynold Xin] Fixed test cases in SQL except ParquetIOSuite.
66d5ef1 [Reynold Xin] SQLContext minor patch.
c9bcdc0 [Reynold Xin] Checkpoint: SQL module compiles!
2015-01-27 16:08:24 -08:00
Sandy Ryza b1b35ca2e4 SPARK-5199. FS read metrics should support CombineFileSplits and track bytes from all FSs
...mbineFileSplits

Author: Sandy Ryza <sandy@cloudera.com>

Closes #4050 from sryza/sandy-spark-5199 and squashes the following commits:

864514b [Sandy Ryza] Add tests and fix bug
0d504f1 [Sandy Ryza] Prettify
915c7e6 [Sandy Ryza] Get metrics from all filesystems
cdbc3e8 [Sandy Ryza] SPARK-5199. Input metrics should show up for InputFormats that return CombineFileSplits
2015-01-27 15:42:55 -08:00
Davies Liu fdaad4eb03 [MLlib] fix python example of ALS in guide
fix python example of ALS in guide, use Rating instead of np.array.

Author: Davies Liu <davies@databricks.com>

Closes #4226 from davies/fix_als_guide and squashes the following commits:

1433d76 [Davies Liu] fix python example of als in guide
2015-01-27 15:33:01 -08:00
Sean Owen ff356e2a21 SPARK-5308 [BUILD] MD5 / SHA1 hash format doesn't match standard Maven output
Here's one way to make the hashes match what Maven's plugins would create. It takes a little extra footwork since OS X doesn't have the same command line tools. An alternative is just to make Maven output these of course - would that be better? I ask in case there is a reason I'm missing, like, we need to hash files that Maven doesn't build.

Author: Sean Owen <sowen@cloudera.com>

Closes #4161 from srowen/SPARK-5308 and squashes the following commits:

70d09d0 [Sean Owen] Use $(...) syntax
e25eff8 [Sean Owen] Generate MD5, SHA1 hashes in a format like Maven's plugin
2015-01-27 10:22:50 -08:00
Burak Yavuz 914267484a [SPARK-5321] Support for transposing local matrices
Support for transposing local matrices added. The `.transpose` function creates a new object re-using the backing array(s) but switches `numRows` and `numCols`. Operations check the flag `.isTransposed` to see whether the indexing in `values` should be modified.

This PR will pave the way for transposing `BlockMatrix`.

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #4109 from brkyvz/SPARK-5321 and squashes the following commits:

87ab83c [Burak Yavuz] fixed scalastyle
caf4438 [Burak Yavuz] addressed code review v3
c524770 [Burak Yavuz] address code review comments 2
77481e8 [Burak Yavuz] fixed MiMa
f1c1742 [Burak Yavuz] small refactoring
ccccdec [Burak Yavuz] fixed failed test
dd45c88 [Burak Yavuz] addressed code review
a01bd5f [Burak Yavuz] [SPARK-5321] Fixed MiMa issues
2a63593 [Burak Yavuz] [SPARK-5321] fixed bug causing failed gemm test
c55f29a [Burak Yavuz] [SPARK-5321] Support for transposing local matrices cleaned up
c408c05 [Burak Yavuz] [SPARK-5321] Support for transposing local matrices added
2015-01-27 01:46:17 -08:00
Liang-Chi Hsieh 7b0ed79795 [SPARK-5419][Mllib] Fix the logic in Vectors.sqdist
The current implementation in Vectors.sqdist is not efficient because of allocating temp arrays. There is also a bug in the code `v1.indices.length / v1.size < 0.5`. This pr fixes the bug and refactors sqdist without allocating new arrays.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #4217 from viirya/fix_sqdist and squashes the following commits:

e8b0b3d [Liang-Chi Hsieh] For review comments.
314c424 [Liang-Chi Hsieh] Fix sqdist bug.
2015-01-27 01:29:14 -08:00
MechCoder d6894b1c53 [SPARK-3726] [MLlib] Allow sampling_rate not equal to 1.0 in RandomForests
I've added support for sampling_rate not equal to 1.0 . I have two major questions.

1. A Scala style test is failing, since the number of parameters now exceed 10.
2. I would like suggestions to understand how to test this.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #4073 from MechCoder/spark-3726 and squashes the following commits:

8012fb2 [MechCoder] Add test in Strategy
e0e0d9c [MechCoder] TST: Add better test
d1df1b2 [MechCoder] Add test to verify subsampling behavior
a7bfc70 [MechCoder] [SPARK-3726] Allow sampling_rate not equal to 1.0
2015-01-26 19:46:17 -08:00
lewuathe f2ba5c6fc3 [SPARK-5119] java.lang.ArrayIndexOutOfBoundsException on trying to train...
... decision tree model

Labels loaded from libsvm files are mapped to 0.0 if they are negative labels because they should be nonnegative value.

Author: lewuathe <lewuathe@me.com>

Closes #3975 from Lewuathe/map-negative-label-to-positive and squashes the following commits:

12d1d59 [lewuathe] [SPARK-5119] Fix code styles
6d9a18a [lewuathe] [SPARK-5119] Organize test codes
62a150c [lewuathe] [SPARK-5119] Modify Impurities throw exceptions with negatie labels
3336c21 [lewuathe] [SPARK-5119] java.lang.ArrayIndexOutOfBoundsException on trying to train decision tree model
2015-01-26 18:03:21 -08:00
Elmer Garduno 661e0fca5d [SPARK-5052] Add common/base classes to fix guava methods signatures.
Fixes problems with incorrect method signatures related to shaded classes. For discussion see the jira issue.

Author: Elmer Garduno <elmerg@google.com>

Closes #3874 from elmer-garduno/fix_guava_signatures and squashes the following commits:

aa5d8e0 [Elmer Garduno] Unshade common/base[Function|Supplier] classes to fix guava methods signatures.
2015-01-26 17:40:48 -08:00
Sean Owen 0497ea51ac SPARK-960 [CORE] [TEST] JobCancellationSuite "two jobs sharing the same stage" is broken
This reenables and fixes this test, after addressing two issues:

- The Semaphore that was intended to be shared locally was being serialized and copied; it's now a static member in the companion object as in other tests
- Later changes to Spark means that cancelling the first task will not cancel the shared stage and therefore the second task should succeed

Author: Sean Owen <sowen@cloudera.com>

Closes #4180 from srowen/SPARK-960 and squashes the following commits:

43da66f [Sean Owen] Fix 'two jobs sharing the same stage' test and reenable it: truly share a Semaphore locally as intended, and update expectation of failure in non-cancelled task
2015-01-26 14:32:27 -08:00
David Y. Ross b38034e878 Fix command spaces issue in make-distribution.sh
Storing command in variables is tricky in bash, use an array
to handle all issues with spaces, quoting, etc.
See: http://mywiki.wooledge.org/BashFAQ/050

Author: David Y. Ross <dyross@gmail.com>

Closes #4126 from dyross/dyr-fix-make-distribution and squashes the following commits:

4ce522b [David Y. Ross] Fix command spaces issue in make-distribution.sh
2015-01-26 14:26:10 -08:00
Sean Owen 54e7b456dd SPARK-4147 [CORE] Reduce log4j dependency
Defer use of log4j class until it's known that log4j 1.2 is being used. This may avoid dealing with log4j dependencies for callers that reroute slf4j to another logging framework. The only change is to push one half of the check in the original `if` condition inside. This is a trivial change, may or may not actually solve a problem, but I think it's all that makes sense to do for SPARK-4147.

Author: Sean Owen <sowen@cloudera.com>

Closes #4190 from srowen/SPARK-4147 and squashes the following commits:

4e99942 [Sean Owen] Defer use of log4j class until it's known that log4j 1.2 is being used. This may avoid dealing with log4j dependencies for callers that reroute slf4j to another logging framework.
2015-01-26 14:23:42 -08:00
Kousuke Saruta c094c73270 [SPARK-5339][BUILD] build/mvn doesn't work because of invalid URL for maven's tgz.
build/mvn will automatically download tarball of maven. But currently, the URL is invalid.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #4124 from sarutak/SPARK-5339 and squashes the following commits:

6e96121 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-5339
0e012d1 [Kousuke Saruta] Updated Maven version to 3.2.5
ca26499 [Kousuke Saruta] Fixed URL of the tarball of Maven
2015-01-26 13:07:49 -08:00
Davies Liu 142093179a [SPARK-5355] use j.u.c.ConcurrentHashMap instead of TrieMap
j.u.c.ConcurrentHashMap is more battle tested.

cc rxin JoshRosen pwendell

Author: Davies Liu <davies@databricks.com>

Closes #4208 from davies/safe-conf and squashes the following commits:

c2182dc [Davies Liu] address comments, fix tests
3a1d821 [Davies Liu] fix test
da14ced [Davies Liu] Merge branch 'master' of github.com:apache/spark into safe-conf
ae4d305 [Davies Liu] change to j.u.c.ConcurrentMap
f8fa1cf [Davies Liu] change to TrieMap
a1d769a [Davies Liu] make SparkConf thread-safe
2015-01-26 12:51:32 -08:00
Yuhao Yang 81251682ed [SPARK-5384][mllib] Vectors.sqdist returns inconsistent results for sparse/dense vectors when the vectors have different lengths
JIRA issue: https://issues.apache.org/jira/browse/SPARK-5384
Currently `Vectors.sqdist` return inconsistent result for sparse/dense vectors when the vectors have different lengths, please refer to JIRA for sample

PR scope:
Unify the sqdist logic for dense/sparse vectors and fix the inconsistency, also remove the possible sparse to dense conversion in the original code.

For reviewers:
Maybe we should first discuss what's the correct behavior.
1. Vectors for sqdist must have the same length, like in breeze?
2. If they can have different lengths, what's the correct result for sqdist? (should the extra part get into calculation?)

I'll update PR with more optimization and additional ut afterwards. Thanks.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #4183 from hhbyyh/fixDouble and squashes the following commits:

1f17328 [Yuhao Yang] limit PR scope to size constraints only
54cbf97 [Yuhao Yang] fix Vectors.sqdist inconsistence
2015-01-25 22:18:09 -08:00
CodingCat 8df9435512 [SPARK-5268] don't stop CoarseGrainedExecutorBackend for irrelevant DisassociatedEvent
https://issues.apache.org/jira/browse/SPARK-5268

In CoarseGrainedExecutorBackend, we subscribe DisassociatedEvent in executor backend actor and exit the program upon receive such event...

let's consider the following case

The user may develop an Akka-based program which starts the actor with Spark's actor system and communicate with an external actor system (e.g. an Akka-based receiver in spark streaming which communicates with an external system) If the external actor system fails or disassociates with the actor within spark's system with purpose, we may receive DisassociatedEvent and the executor is restarted.

This is not the expected behavior.....

----

This is a simple fix to check the event before making the quit decision

Author: CodingCat <zhunansjtu@gmail.com>

Closes #4063 from CodingCat/SPARK-5268 and squashes the following commits:

4d7d48e [CodingCat] simplify the log
18c36f4 [CodingCat] more descriptive log
f299e0b [CodingCat] clean log
1632e79 [CodingCat] check whether DisassociatedEvent is relevant before quit
2015-01-25 19:28:53 -08:00
Sean Owen 0528b85cf9 SPARK-4430 [STREAMING] [TEST] Apache RAT Checks fail spuriously on test files
Another trivial one. The RAT failure was due to temp files from `FailureSuite` not being cleaned up. This just makes the cleanup more reliable by using the standard temp dir mechanism.

Author: Sean Owen <sowen@cloudera.com>

Closes #4189 from srowen/SPARK-4430 and squashes the following commits:

9ea63ff [Sean Owen] Properly acquire a temp directory to ensure it is cleaned up at shutdown, which helps avoid a RAT check failure
2015-01-25 19:16:44 -08:00
Kay Ousterhout fc2168f04e [SPARK-5326] Show fetch wait time as optional metric in the UI
With this change, here's what the UI looks like:

![image](https://cloud.githubusercontent.com/assets/1108612/5809994/1ec8a904-9ff4-11e4-8f24-6a59a1a858f7.png)

If you want to locally test this, you need to spin up multiple executors, because the shuffle read metrics are only shown for data read remotely.

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #4110 from kayousterhout/SPARK-5326 and squashes the following commits:

610051e [Kay Ousterhout] Josh style comments
5feaa28 [Kay Ousterhout] What is the difference here??
aa129cb [Kay Ousterhout] Removed inadvertent change
721c742 [Kay Ousterhout] Improved tooltip
f3a7111 [Kay Ousterhout] Style fix
679b4e9 [Kay Ousterhout] [SPARK-5326] Show fetch wait time as optional metric in the UI
2015-01-25 16:48:26 -08:00
Kousuke Saruta 8f5c827b01 [SPARK-5344][WebUI] HistoryServer cannot recognize that inprogress file was renamed to completed file
`FsHistoryProvider` tries to update application status but if `checkForLogs` is called before `.inprogress` file is renamed to completed file, the file is not recognized as completed.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #4132 from sarutak/SPARK-5344 and squashes the following commits:

9658008 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-5344
d2c72b6 [Kousuke Saruta] Fixed update issue of FsHistoryProvider
2015-01-25 15:34:20 -08:00
Sean Owen 9f6435763d SPARK-4506 [DOCS] Addendum: Update more docs to reflect that standalone works in cluster mode
This is a trivial addendum to SPARK-4506, which was already resolved. noted by Asim Jalis in SPARK-4506.

Author: Sean Owen <sowen@cloudera.com>

Closes #4160 from srowen/SPARK-4506 and squashes the following commits:

5f5f7df [Sean Owen] Update more docs to reflect that standalone works in cluster mode
2015-01-25 15:25:05 -08:00
Jacek Lewandowski 1c30afdf94 SPARK-5382: Use SPARK_CONF_DIR in spark-class if it is defined
Author: Jacek Lewandowski <lewandowski.jacek@gmail.com>

Closes #4179 from jacek-lewandowski/SPARK-5382-1.3 and squashes the following commits:

55d7791 [Jacek Lewandowski] SPARK-5382: Use SPARK_CONF_DIR in spark-class if it is defined
2015-01-25 15:15:09 -08:00
Sean Owen 383425ab70 SPARK-3782 [CORE] Direct use of log4j in AkkaUtils interferes with certain logging configurations
Although the underlying issue can I think be solved by having user code use slf4j 1.7.6+, it might be helpful and consistent to update Spark's slf4j too. I see no reason to believe it would be incompatible with other 1.7.x releases: http://www.slf4j.org/news.html  Lots of different version of slf4j are in use in the wild and anecdotally I have never seen an issue mixing them.

Author: Sean Owen <sowen@cloudera.com>

Closes #4184 from srowen/SPARK-3782 and squashes the following commits:

5608d28 [Sean Owen] Update slf4j to 1.7.10
2015-01-25 15:11:57 -08:00
Sean Owen c586b45dd2 SPARK-3852 [DOCS] Document spark.driver.extra* configs
As per the JIRA. I copied the `spark.executor.extra*` text, but removed info that appears to be specific to the `executor` config and not `driver`.

Author: Sean Owen <sowen@cloudera.com>

Closes #4185 from srowen/SPARK-3852 and squashes the following commits:

f60a8a1 [Sean Owen] Document spark.driver.extra* configs
2015-01-25 15:08:35 -08:00
Ryan Williams aea25482c3 [SPARK-5402] log executor ID at executor-construction time
also rename "slaveHostname" to "executorHostname"

Author: Ryan Williams <ryan.blake.williams@gmail.com>

Closes #4195 from ryan-williams/exec and squashes the following commits:

e60a7bb [Ryan Williams] log executor ID at executor-construction time
2015-01-25 14:20:02 -08:00
Ryan Williams 2d9887bae5 [SPARK-5401] set executor ID before creating MetricsSystem
Author: Ryan Williams <ryan.blake.williams@gmail.com>

Closes #4194 from ryan-williams/metrics and squashes the following commits:

7c5a33f [Ryan Williams] set executor ID before creating MetricsSystem
2015-01-25 14:17:59 -08:00
Idan Zalzberg 412a58e118 Add comment about defaultMinPartitions
Added a comment about using math.min for choosing default partition count

Author: Idan Zalzberg <idanzalz@gmail.com>

Closes #4102 from idanz/patch-2 and squashes the following commits:

50e9d58 [Idan Zalzberg] Update SparkContext.scala
2015-01-25 11:28:05 -08:00
Reynold Xin d22ca1e921 Closes #4157 2015-01-25 00:24:59 -08:00
zsxwing 0d1e67ee9b [SPARK-5214][Test] Add a test to demonstrate EventLoop can be stopped in the event thread
Author: zsxwing <zsxwing@gmail.com>

Closes #4174 from zsxwing/SPARK-5214-unittest and squashes the following commits:

443e564 [zsxwing] Change the check interval to 5ms
7aaa2d7 [zsxwing] Add a test to demonstrate EventLoop can be stopped in the event thread
2015-01-24 11:00:35 -08:00
Jongyoul Lee 09e09c548e [SPARK-5058] Part 2. Typos and broken URL
- Also fixed java link

Author: Jongyoul Lee <jongyoul@gmail.com>

Closes #4172 from jongyoul/SPARK-FIXDOC and squashes the following commits:

6be03e5 [Jongyoul Lee] [SPARK-5058] Part 2. Typos and broken URL - Also fixed java link
2015-01-23 23:34:11 -08:00
Takeshi Yamamuro e224dbb011 [SPARK-5351][GraphX] Do not use Partitioner.defaultPartitioner as a partitioner of EdgeRDDImp...
If the value of 'spark.default.parallelism' does not match the number of partitoins in EdgePartition(EdgeRDDImpl),
the following error occurs in ReplicatedVertexView.scala:72;

object GraphTest extends Logging {
  def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]): VertexRDD[Int] = {
    graph.aggregateMessages(
      ctx => {
        ctx.sendToSrc(1)
        ctx.sendToDst(2)
      },
      _ + _)
  }
}

val g = GraphLoader.edgeListFile(sc, "graph.txt")
val rdd = GraphTest.run(g)

java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions
	at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:204)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:204)
	at org.apache.spark.ShuffleDependency.<init>(Dependency.scala:82)
	at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
	at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:193)
	at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:191)
    ...

Author: Takeshi Yamamuro <linguin.m.s@gmail.com>

Closes #4136 from maropu/EdgePartitionBugFix and squashes the following commits:

0cd8942 [Ankur Dave] Use more concise getOrElse
aad4a2c [Ankur Dave] Add unit test for non-default number of edge partitions
0a2f32b [Takeshi Yamamuro] Do not use Partitioner.defaultPartitioner as a partitioner of EdgeRDDImpl
2015-01-23 19:26:39 -08:00
Josh Rosen cef1f092a6 [SPARK-5063] More helpful error messages for several invalid operations
This patch adds more helpful error messages for invalid programs that define nested RDDs, broadcast RDDs, perform actions inside of transformations (e.g. calling `count()` from inside of `map()`), and call certain methods on stopped SparkContexts.  Currently, these invalid programs lead to confusing NullPointerExceptions at runtime and have been a major source of questions on the mailing list and StackOverflow.

In a few cases, I chose to log warnings instead of throwing exceptions in order to avoid any chance that this patch breaks programs that worked "by accident" in earlier Spark releases (e.g. programs that define nested RDDs but never run any jobs with them).

In SparkContext, the new `assertNotStopped()` method is used to check whether methods are being invoked on a stopped SparkContext.  In some cases, user programs will not crash in spite of calling methods on stopped SparkContexts, so I've only added `assertNotStopped()` calls to methods that always throw exceptions when called on stopped contexts (e.g. by dereferencing a null `dagScheduler` pointer).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #3884 from JoshRosen/SPARK-5063 and squashes the following commits:

a38774b [Josh Rosen] Fix spelling typo
a943e00 [Josh Rosen] Convert two exceptions into warnings in order to avoid breaking user programs in some edge-cases.
2d0d7f7 [Josh Rosen] Fix test to reflect 1.2.1 compatibility
3f0ea0c [Josh Rosen] Revert two unintentional formatting changes
8e5da69 [Josh Rosen] Remove assertNotStopped() calls for methods that were sometimes safe to call on stopped SC's in Spark 1.2
8cff41a [Josh Rosen] IllegalStateException fix
6ef68d0 [Josh Rosen] Fix Python line length issues.
9f6a0b8 [Josh Rosen] Add improved error messages to PySpark.
13afd0f [Josh Rosen] SparkException -> IllegalStateException
8d404f3 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-5063
b39e041 [Josh Rosen] Fix BroadcastSuite test which broadcasted an RDD
99cc09f [Josh Rosen] Guard against calling methods on stopped SparkContexts.
34833e8 [Josh Rosen] Add more descriptive error message.
57cc8a1 [Josh Rosen] Add error message when directly broadcasting RDD.
15b2e6b [Josh Rosen] [SPARK-5063] Useful error messages for nested RDDs and actions inside of transformations
2015-01-23 17:53:15 -08:00
Xiangrui Meng ea74365b7c [SPARK-3541][MLLIB] New ALS implementation with improved storage
This PR adds a new ALS implementation to `spark.ml` using the pipeline API, which should be able to scale to billions of ratings. Compared with the ALS under `spark.mllib`, the new implementation

1. uses the same algorithm,
2. uses float type for ratings,
3. uses primitive arrays to avoid GC,
4. sorts and compresses ratings on each block so that we can solve least squares subproblems one by one using only one normal equation instance.

The following figure shows performance comparison on copies of the Amazon Reviews dataset using a 16-node (m3.2xlarge) EC2 cluster (the same setup as in http://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html):
![als-wip](https://cloud.githubusercontent.com/assets/829644/5659447/4c4ff8e0-96c7-11e4-87a9-73c1c63d07f3.png)

I keep the `spark.mllib`'s ALS untouched for easy comparison. If the new implementation works well, I'm going to match the features of the ALS under `spark.mllib` and then make it a wrapper of the new implementation, in a separate PR.

TODO:
- [X] Add unit tests for implicit preferences.

Author: Xiangrui Meng <meng@databricks.com>

Closes #3720 from mengxr/SPARK-3541 and squashes the following commits:

1b9e852 [Xiangrui Meng] fix compile
5129be9 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-3541
dd0d0e8 [Xiangrui Meng] simplify test code
c627de3 [Xiangrui Meng] add tests for implicit feedback
b84f41c [Xiangrui Meng] address comments
a76da7b [Xiangrui Meng] update ALS tests
2a8deb3 [Xiangrui Meng] add some ALS tests
857e876 [Xiangrui Meng] add tests for rating block and encoded block
d3c1ac4 [Xiangrui Meng] rename some classes for better code readability add more doc and comments
213d163 [Xiangrui Meng] org imports
771baf3 [Xiangrui Meng] chol doc update
ca9ad9d [Xiangrui Meng] add unit tests for chol
b4fd17c [Xiangrui Meng] add unit tests for NormalEquation
d0f99d3 [Xiangrui Meng] add tests for LocalIndexEncoder
80b8e61 [Xiangrui Meng] fix imports
4937fd4 [Xiangrui Meng] update ALS example
56c253c [Xiangrui Meng] rename product to item
bce8692 [Xiangrui Meng] doc for parameters and project the output columns
3f2d81a [Xiangrui Meng] add doc
1efaecf [Xiangrui Meng] add example code
8ae86b5 [Xiangrui Meng] add a working copy of the new ALS implementation
2015-01-22 22:09:13 -08:00
jerryshao e0f7fb7f9f [SPARK-5315][Streaming] Fix reduceByWindow Java API not work bug
`reduceByWindow` for Java API is actually not Java compatible, change to make it Java compatible.

Current solution is to deprecate the old one and add a new API, but since old API actually is not correct, so is keeping the old one meaningful? just to keep the binary compatible? Also even adding new API still need to add to Mima exclusion, I'm not sure to change the API, or deprecate the old API and add a new one, which is the best solution?

Author: jerryshao <saisai.shao@intel.com>

Closes #4104 from jerryshao/SPARK-5315 and squashes the following commits:

5bc8987 [jerryshao] Address the comment
c7aa1b4 [jerryshao] Deprecate the old one to keep binary compatible
8e9dc67 [jerryshao] Fix JavaDStream reduceByWindow signature error
2015-01-22 22:04:21 -08:00
jerryshao 3c3fa632e6 [SPARK-5233][Streaming] Fix error replaying of WAL introduced bug
Because of lacking of `BlockAllocationEvent` in WAL recovery, the dangled event will mix into the new batch, which will lead to the wrong result. Details can be seen in [SPARK-5233](https://issues.apache.org/jira/browse/SPARK-5233).

Author: jerryshao <saisai.shao@intel.com>

Closes #4032 from jerryshao/SPARK-5233 and squashes the following commits:

f0b0c0b [jerryshao] Further address the comments
a237c75 [jerryshao] Address the comments
e356258 [jerryshao] Fix bug in unit test
558bdc3 [jerryshao] Correctly replay the WAL log when recovering from failure
2015-01-22 21:58:53 -08:00
Sandy Ryza 820ce03597 SPARK-5370. [YARN] Remove some unnecessary synchronization in YarnAlloca...
...tor

Author: Sandy Ryza <sandy@cloudera.com>

Closes #4164 from sryza/sandy-spark-5370 and squashes the following commits:

0c8d736 [Sandy Ryza] SPARK-5370. [YARN] Remove some unnecessary synchronization in YarnAllocator
2015-01-22 13:49:35 -06:00
Liang-Chi Hsieh 246111d179 [SPARK-5365][MLlib] Refactor KMeans to reduce redundant data
If a point is selected as new centers for many runs, it would collect many redundant data. This pr refactors it.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #4159 from viirya/small_refactor_kmeans and squashes the following commits:

25487e6 [Liang-Chi Hsieh] Refactor codes to reduce redundant data.
2015-01-22 08:16:35 -08:00
Tathagata Das 3027f06b41 [SPARK-5147][Streaming] Delete the received data WAL log periodically
This is a refactored fix based on jerryshao 's PR #4037
This enabled deletion of old WAL files containing the received block data.
Improvements over #4037
- Respecting the rememberDuration of all receiver streams. In #4037, if there were two receiver streams with multiple remember durations, the deletion would have delete based on the shortest remember duration, thus deleting data prematurely for the receiver stream with longer remember duration.
- Added unit test to test creation of receiver WAL, automatic deletion, and respecting of remember duration.

jerryshao I am going to merge this ASAP to make it 1.2.1 Thanks for the initial draft of this PR. Made my job much easier.

Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: jerryshao <saisai.shao@intel.com>

Closes #4149 from tdas/SPARK-5147 and squashes the following commits:

730798b [Tathagata Das] Added comments.
c4cf067 [Tathagata Das] Minor fixes
2579b27 [Tathagata Das] Refactored the fix to make sure that the cleanup respects the remember duration of all the receiver streams
2736fd1 [jerryshao] Delete the old WAL log periodically
2015-01-21 23:41:44 -08:00
Basin fcb3e1862f [SPARK-5317]Set BoostingStrategy.defaultParams With Enumeration Algo.Classification or Algo.Regression
JIRA Issue: https://issues.apache.org/jira/browse/SPARK-5317
When setting the BoostingStrategy.defaultParams("Classification"), It's more straightforward to set it with the Enumeration Algo.Classification, just like BoostingStragety.defaultParams(Algo.Classification).
I overload the method BoostingStragety.defaultParams().

Author: Basin <jpsachilles@gmail.com>

Closes #4103 from Peishen-Jia/stragetyAlgo and squashes the following commits:

87bab1c [Basin] Docs and Code documentations updated.
3b72875 [Basin] defaultParams(algoStr: String) call defaultParams(algo: Algo).
7c1e6ee [Basin] Doc of Java updated. algo -> algoStr instead.
d5c8a2e [Basin] Merge branch 'stragetyAlgo' of github.com:Peishen-Jia/spark into stragetyAlgo
65f96ce [Basin] mllib-ensembles doc modified.
e04a5aa [Basin] boostingstrategy.defaultParam string algo to enumeration.
68cf544 [Basin] mllib-ensembles doc modified.
a4aea51 [Basin] boostingstrategy.defaultParam string algo to enumeration.
2015-01-21 23:06:34 -08:00
Xiangrui Meng ca7910d6dd [SPARK-3424][MLLIB] cache point distances during k-means|| init
This PR ports the following feature implemented in #2634 by derrickburns:

* During k-means|| initialization, we should cache costs (squared distances) previously computed.

It also contains the following optimization:

* aggregate sumCosts directly
* ran multiple (#runs) k-means++ in parallel

I compared the performance locally on mnist-digit. Before this patch:

![before](https://cloud.githubusercontent.com/assets/829644/5845647/93080862-a172-11e4-9a35-044ec711afc4.png)

with this patch:

![after](https://cloud.githubusercontent.com/assets/829644/5845653/a47c29e8-a172-11e4-8e9f-08db57fe3502.png)

It is clear that each k-means|| iteration takes about the same amount of time with this patch.

Authors:
  Derrick Burns <derrickburns@gmail.com>
  Xiangrui Meng <meng@databricks.com>

Closes #4144 from mengxr/SPARK-3424-kmeans-parallel and squashes the following commits:

0a875ec [Xiangrui Meng] address comments
4341bb8 [Xiangrui Meng] do not re-compute point distances during k-means||
2015-01-21 21:21:07 -08:00
Cheng Hao 27bccc5ea9 [SPARK-5202] [SQL] Add hql variable substitution support
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VariableSubstitution

This is a block issue for the CLI user, it impacts the existed hql scripts from Hive.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #4003 from chenghao-intel/substitution and squashes the following commits:

bb41fd6 [Cheng Hao] revert the removed the implicit conversion
af7c31a [Cheng Hao] add hql variable substitution support
2015-01-21 17:34:18 -08:00