Commit graph

4249 commits

Author SHA1 Message Date
Prashant Sharma bfbd7e5d9f Fixed some scala warnings in core. 2013-10-10 15:22:31 +05:30
Prashant Sharma 34da58ae50 Changed message-frame-size to maximum-frame-size as property.
Removed a test accidentally added during merge.
2013-10-10 15:13:44 +05:30
Prashant Sharma 026ab75661 Merge branch 'master' of github.com:apache/incubator-spark into scala-2.10 2013-10-10 09:42:55 +05:30
Prashant Sharma 26860639c5 Merge branch 'scala-2.10' of github.com:ScrapCodes/spark into scala-2.10
Conflicts:
	core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala
	project/SparkBuild.scala
2013-10-10 09:42:23 +05:30
Reynold Xin 320418f7c8 Merge pull request #49 from mateiz/kryo-fix-2
Fix Chill serialization of Range objects

It used to write out each element one by one, creating very large objects.
2013-10-09 16:55:30 -07:00
Reynold Xin 215238cb39 Merge pull request #50 from kayousterhout/SPARK-908
Fix race condition in SparkListenerSuite (fixes SPARK-908).
2013-10-09 16:49:44 -07:00
Matei Zaharia c84c205289 Fix Chill serialization of Range objects, which used to write out each
element, and register user and Spark classes before Chill's serializers
to let them override Chill's behavior in general.
2013-10-09 16:23:40 -07:00
Kay Ousterhout 36966f65df Style fixes 2013-10-09 15:36:34 -07:00
Kay Ousterhout 3f7e9b265c Fixed comment to use javadoc style 2013-10-09 15:23:04 -07:00
Kay Ousterhout a34a4e8174 Fix race condition in SparkListenerSuite (fixes SPARK-908). 2013-10-09 15:07:53 -07:00
Matei Zaharia 7827efc87b Merge pull request #46 from mateiz/py-sort-update
Fix PySpark docs and an overly long line of code after #38

Just noticed these after merging that commit (https://github.com/apache/incubator-spark/pull/38).
2013-10-09 15:07:25 -07:00
Patrick Wendell 7b3ae04ea7 Merge pull request #45 from pwendell/metrics_units
Use standard abbreviation in metrics description (MBytes -> MB)

This is a small change - older commits are shown here because Github hasn't sync'ed yet with apache.
2013-10-09 12:14:19 -07:00
Matei Zaharia 478b2b7edc Fix PySpark docs and an overly long line of code after fdbae41e 2013-10-09 12:08:04 -07:00
Matei Zaharia b4fa11f6c9 Merge pull request #38 from AndreSchumacher/pyspark_sorting
SPARK-705: implement sortByKey() in PySpark

This PR contains the implementation of a RangePartitioner in Python and uses its partition ID's to get a global sort in PySpark.
2013-10-09 11:59:47 -07:00
Patrick Wendell bd3bcc5f8e Use standard abbreviations in metrics labels 2013-10-09 11:16:24 -07:00
Patrick Wendell 19d445d37c Merge pull request #22 from GraceH/metrics-naming
SPARK-900 Use coarser grained naming for metrics

see SPARK-900 Use coarser grained naming for metrics.
Now the new metric name is formatted as {XXX.YYY.ZZZ.COUNTER_UNIT}, XXX.YYY.ZZZ represents the group name, which can group several metrics under the same Ganglia view.
2013-10-09 11:08:34 -07:00
Matei Zaharia 7d50f9f87b Merge pull request #35 from MartinWeindel/scala-2.10
Fixing inconsistencies and warnings on Scala 2.10 branch

- scala 2.10 requires Java 1.6
- using newest Scala release: 2.10.3
- resolved maven-scala-plugin warning
- fixed various compiler warnings
2013-10-09 10:32:42 -07:00
Matei Zaharia 3218fa795f Merge pull request #4 from MLnick/implicit-als
Adding algorithm for implicit feedback data to ALS

This PR adds the commonly used "implicit feedack" variant to ALS.

The implementation is based in part on Mahout's implementation, which is in turn based on [Collaborative Filtering for Implicit Feedback Datasets](http://research.yahoo.com/pub/2433). It has been adapted for the blocked approach used in MLlib.

I have tested this implementation against the MovieLens 100k, 1m and 10m datasets, and confirmed that it produces the same RMSE score as Mahout, as well as my own port of Mahout's implicit ALS implementation to Spark (not that RMSE is necessarily the best metric to judge by for implicit feedback, but it provides a consistent metric for comparison).

It turned out to be more straightforward than I had thought to add this. The main additions are:
1. Adding `implicitPrefs` boolean flag and `alpha` parameter
2. Added the `computeYtY` method. In each least-squares step, the algorithm requires the computation of `YtY`, where `Y` is the {user, item} factor matrix. Since the factors are already block-distributed in an `RDD`, this is quite straightforward to compute but does add an extra operation over the explicit version (but only twice per iteration)
3. Finally the actual solve step in `updateBlock` boils down to:
    * a multiplication of the `XtX` matrix by `alpha * rating`
    * a multiplication of the `Xty` vector by `1 + alpha * rating`
    * when solving for the factor vector, the implicit variant adds the `YtY` matrix to the LHS
4. Added `trainImplicit` methods in the `ALS` object
5. Added test cases for both Scala and Java - based on achieving a confidence-weighted RMSE score < 0.4 (this is taken from Mahout's test cases)

It would be great to get some feedback on this and have people test things out against some datasets (MovieLens and others and perhaps proprietary datasets) both locally and on a cluster if possible. I have not yet tested on a cluster but will try to do that soon.

I have tried to make things as efficient as possible but if there are potential improvements let me know.

The results of a run against ml-1m are below (note the vanilla RMSE scores will be very different from the explicit variant):

**INPUTS**
```
iterations=10
factors=10
lambda=0.01
alpha=1
implicitPrefs=true
```

**RESULTS**

```
Spark MLlib 0.8.0-SNAPSHOT

RMSE = 3.1544
Time: 24.834 sec
```
```
My own port of Mahout's ALS to Spark (updated to 0.8.0-SNAPSHOT)

RMSE = 3.1543
Time: 58.708 sec
```
```
Mahout 0.8

time ./factorize-movielens-1M.sh /path/to/ratings/ml-1m/ratings.dat

real	3m48.648s
user	6m39.254s
sys	0m14.505s

RMSE = 3.1539
```

Results of a run against ml-10m

```
Spark MLlib

RMSE = 3.1200
Time: 162.348 sec
```
```
Mahout 0.8

real	23m2.220s
user	43m39.185s
sys	0m25.316s

RMSE = 3.1187
```
2013-10-08 23:44:55 -07:00
Reynold Xin e67d5b962a Merge pull request #43 from mateiz/kryo-fix
Don't allocate Kryo buffers unless needed

I noticed that the Kryo serializer could be slower than the Java one by 2-3x on small shuffles because it spend a lot of time initializing Kryo Input and Output objects. This is because our default buffer size for them is very large. Since the serializer is often used on streams, I made the initialization lazy for that, and used a smaller buffer (auto-managed by Kryo) for input.
2013-10-08 22:57:38 -07:00
Grace Huang f7628e4033 remove those futile suffixes like number/count 2013-10-09 08:36:41 +08:00
Grace Huang 22bed59d2d create metrics name manually. 2013-10-08 18:01:11 +08:00
Grace Huang 188abbf8f1 Revert "SPARK-900 Use coarser grained naming for metrics"
This reverts commit 4b68be5f3c.
2013-10-08 17:45:14 +08:00
Grace Huang a2af6b543a Revert "remedy the line-wrap while exceeding 100 chars"
This reverts commit 892fb8ffa8.
2013-10-08 17:44:56 +08:00
Prashant Sharma 7be75682b9 Merge branch 'master' into wip-merge-master
Conflicts:
	bagel/pom.xml
	core/pom.xml
	core/src/test/scala/org/apache/spark/ui/UISuite.scala
	examples/pom.xml
	mllib/pom.xml
	pom.xml
	project/SparkBuild.scala
	repl/pom.xml
	streaming/pom.xml
	tools/pom.xml

In scala 2.10, a shorter representation is used for naming artifacts
 so changed to shorter scala version for artifacts and made it a property in pom.
2013-10-08 11:29:40 +05:30
Reynold Xin ea34c52102 Merge pull request #42 from pwendell/shuffle-read-perf
Fix inconsistent and incorrect log messages in shuffle read path

The user-facing messages generated by the CacheManager are currently wrong and somewhat misleading. This patch makes the messages more accurate. It also uses a consistent representation of the partition being fetched (`rdd_xx_yy`) so that it's easier for users to trace what is going on when reading logs.
2013-10-07 20:45:58 -07:00
Patrick Wendell 8b377718b8 Responses to review 2013-10-07 20:03:35 -07:00
Matei Zaharia a8725bf8f8 Don't allocate Kryo buffers unless needed 2013-10-07 19:16:35 -07:00
Patrick Wendell 391133f66a Fix inconsistent and incorrect log messages in shuffle read path 2013-10-07 17:24:18 -07:00
Patrick Wendell 02f37ee853 Merge pull request #39 from pwendell/master
Adding Shark 0.7.1 to EC2 scripts

This adds a newer version of Shark to the ec2 scripts. I've tested this for both Hadoop1 and Hadoop2 clusters.
2013-10-07 15:48:52 -07:00
Patrick Wendell 3745a1827f Adding Shark 0.7.1 to EC2 scripts 2013-10-07 15:03:42 -07:00
Andre Schumacher fdbae41e88 SPARK-705: implement sortByKey() in PySpark 2013-10-07 12:16:33 -07:00
Reynold Xin 213b70a2db Merge pull request #31 from sundeepn/branch-0.8
Resolving package conflicts with hadoop 0.23.9

Hadoop 0.23.9 is having a package conflict with easymock's dependencies.

(cherry picked from commit 023e3fdf00)
Signed-off-by: Reynold Xin <rxin@apache.org>
2013-10-07 10:54:22 -07:00
Nick Pentreath a5e58b8f98 Merge branch 'master' into implicit-als 2013-10-07 11:46:17 +02:00
Nick Pentreath b0f5f4d441 Bumping up test matrix size to eliminate random failures 2013-10-07 11:44:22 +02:00
Patrick Wendell d585613ee2 Merge pull request #37 from pwendell/merge-0.8
merge in remaining changes from `branch-0.8`

This merges in the following changes from `branch-0.8`:

- The scala version is included in the published maven artifact names
- A unit tests which had non-deterministic failures is ignored (see SPARK-908)
- A minor documentation change shows the short version instead of the full version
- Moving the kafka jar to be "provided"
- Changing the default spark ec2 version.
- Some spacing changes caused by Maven's release plugin

Note that I've squashed this into a single commit rather than pull in the branch-0.8 history. There are a bunch of release/revert commits there that make the history super ugly.
2013-10-05 22:57:05 -07:00
Patrick Wendell aa9fb84994 Merging build changes in from 0.8 2013-10-05 22:07:00 -07:00
Matei Zaharia 4a25b116d4 Merge pull request #20 from harveyfeng/hadoop-config-cache
Allow users to pass broadcasted Configurations and cache InputFormats across Hadoop file reads.

Note: originally from https://github.com/mesos/spark/pull/942

Currently motivated by Shark queries on Hive-partitioned tables, where there's a JobConf broadcast for every Hive-partition (i.e., every subdirectory read). The only thing different about those JobConfs is the input path - the Hadoop Configuration that the JobConfs are constructed from remain the same.
This PR only modifies the old Hadoop API RDDs, but similar additions to the new API might reduce computation latencies a little bit for high-frequency FileInputDStreams (which only uses the new API right now).

As a small bonus, added InputFormats caching, to avoid reflection calls for every RDD#compute().

Few other notes:

Added a general soft-reference hashmap in SparkHadoopUtil because I wanted to avoid adding another class to SparkEnv.
SparkContext default hadoopConfiguration isn't cached. There's no equals() method for Configuration, so there isn't a good way to determine when configuration properties have changed.
2013-10-05 19:28:55 -07:00
Harvey Feng 6a2bbec5e3 Some comments regarding JobConf and InputFormat caching for HadoopRDDs. 2013-10-05 17:53:58 -07:00
Reynold Xin 8fc68d04bd Merge pull request #36 from pwendell/versions
Bumping EC2 default version in master to .

This change was already made on . This PR ports the change up to master.
2013-10-05 17:24:35 -07:00
Harvey Feng 96929f28bb Make HadoopRDD object Spark private. 2013-10-05 17:14:19 -07:00
Patrick Wendell 2484b84678 Bumping EC2 default version in master to 0.8.0. 2013-10-05 16:59:11 -07:00
Harvey Feng b5e93c1227 Fix API changes; lines > 100 chars. 2013-10-05 16:57:08 -07:00
Martin Weindel e09f4a9601 fixed some warnings 2013-10-05 23:08:23 +02:00
Matei Zaharia 100222b048 Merge pull request #27 from davidmccauley/master
SPARK-920/921 - JSON endpoint updates

920 - Removal of duplicate scheme part of Spark URI, it was appearing as spark://spark//host:port in the JSON field.

JSON now delivered as:
url:spark://127.0.0.1:7077

921 - Adding the URL of the Main Application UI will allow custom interfaces (that use the JSON output) to redirect from the standalone UI.
2013-10-05 13:38:59 -07:00
Matei Zaharia 08641932bd Merge pull request #33 from AndreSchumacher/pyspark_partition_key_change
Fixing SPARK-602: PythonPartitioner

Currently PythonPartitioner determines partition ID by hashing a
byte-array representation of PySpark's key. This PR lets
PythonPartitioner use the actual partition ID, which is required e.g.
for sorting via PySpark.
2013-10-05 13:25:18 -07:00
Martin Weindel 9b0c9c893d scala 2.10 requires Java 1.6,
using Scala 2.10.3,
resolved maven-scala-plugin warning
2013-10-05 21:41:09 +02:00
Prashant Sharma 3e41495288 Fixed tests, changed property akka.remote.netty.x to akka.remote.netty.tcp.x 2013-10-05 16:39:25 +05:30
Prashant Sharma c810ee0690 Merge branch 'master' into scala-2.10
Conflicts:
	core/src/test/scala/org/apache/spark/DistributedSuite.scala
	project/SparkBuild.scala
2013-10-05 15:52:57 +05:30
Andre Schumacher c84946fe21 Fixing SPARK-602: PythonPartitioner
Currently PythonPartitioner determines partition ID by hashing a
byte-array representation of PySpark's key. This PR lets
PythonPartitioner use the actual partition ID, which is required e.g.
for sorting via PySpark.
2013-10-04 11:56:47 -07:00
Matei Zaharia 3fe12cc2ca Merge pull request #946 from ScrapCodes/scala-2.10
Fixed non termination of Executor backend, when sc.stop is not called and system.exit instead.
2013-10-04 10:51:28 -07:00