Here's a modification to PageRank which does personalized PageRank. The approach is basically similar to that outlined by Bahmani et al. from 2010 (http://arxiv.org/pdf/1006.2880.pdf).
I'm sure this needs tuning up or other considerations, so let me know how I can improve this.
Author: Dan McClary <dan.mcclary@gmail.com>
Author: dwmclary <dan.mcclary@gmail.com>
Closes#4774 from dwmclary/SPARK-5854-Personalized-PageRank and squashes the following commits:
8b907db [dwmclary] fixed scalastyle errors in PageRankSuite
2c20e5d [dwmclary] merged with upstream master
d6cebac [dwmclary] updated as per style requests
7d00c23 [Dan McClary] fixed line overrun in personalizedVertexPageRank
d711677 [Dan McClary] updated vertexProgram to restore binary compatibility for inner method
bb8d507 [Dan McClary] Merge branch 'master' of https://github.com/apache/spark into SPARK-5854-Personalized-PageRank
fba0edd [Dan McClary] fixed silly mistakes
de51be2 [Dan McClary] cleaned up whitespace between comments and methods
0c30d0c [Dan McClary] updated to maintain binary compatibility
aaf0b4b [Dan McClary] Merge branch 'master' of https://github.com/apache/spark into SPARK-5854-Personalized-PageRank
76773f6 [Dan McClary] Merge branch 'master' of https://github.com/apache/spark into SPARK-5854-Personalized-PageRank
44ada8e [Dan McClary] updated tolerance on chain PPR
1ffed95 [Dan McClary] updated tolerance on chain PPR
b67ac69 [Dan McClary] updated tolerance on chain PPR
a560942 [Dan McClary] rolled PPR into pregel code for PageRank
6dc2c29 [Dan McClary] initial implementation of personalized page rank
Author: Michael Malak <michaelmalak@yahoo.com>
Closes#5464 from michaelmalak/master and squashes the following commits:
9d942ba [Michael Malak] SPARK-6710 GraphX Fixed Wrong initial bias in GraphX SVDPlusPlus
https://issues.apache.org/jira/browse/SPARK-6758
I am not sure if it is ok to block them in test resources too (as we shade jetty in assembly?).
Author: WangTaoTheTonic <wangtao111@huawei.com>
Closes#5406 from WangTaoTheTonic/SPARK-6758 and squashes the following commits:
e09605b [WangTaoTheTonic] block the right jetty package
So we can turn style checker on for test code.
Author: Reynold Xin <rxin@databricks.com>
Closes#5410 from rxin/test-style-graphx and squashes the following commits:
89e253a [Reynold Xin] [SPARK-6765] Fix test code style for graphx.
Example of Graph#aggregateMessages has error.
Since aggregateMessages is a method of Graph, It should be written "rawGraph.aggregateMessages"
Author: Sasaki Toru <sasakitoa@nttdata.co.jp>
Closes#5388 from sasakitoa/aggregateMessagesExample and squashes the following commits:
b1d631b [Sasaki Toru] Example of Graph#aggregateMessages has error
This builds on my earlier pull requests and turns on the explicit type checking in scalastyle.
Author: Reynold Xin <rxin@databricks.com>
Closes#5342 from rxin/SPARK-6428 and squashes the following commits:
7b531ab [Reynold Xin] import ordering
2d9a8a5 [Reynold Xin] jl
e668b1c [Reynold Xin] override
9b9e119 [Reynold Xin] Parenthesis.
82e0cf5 [Reynold Xin] [SPARK-6428] Turn on explicit type checking for public methods.
Adds a `Graph#minus` method which will return only unique `VertexId`'s from the calling `VertexRDD`.
To demonstrate a basic example with pseudocode:
```
Set((0L,0),(1L,1)).minus(Set((1L,1),(2L,2)))
> Set((0L,0))
```
Author: Brennon York <brennon.york@capitalone.com>
Closes#5175 from brennonyork/SPARK-6510 and squashes the following commits:
248d5c8 [Brennon York] added minus(VertexRDD[VD]) method to avoid createUsingIndex and updated the mask operations to simplify with andNot call
3fb7cce [Brennon York] updated graphx doc to reflect the addition of minus method
6575d92 [Brennon York] updated mima exclude
aaa030b [Brennon York] completed graph#minus functionality
7227c0f [Brennon York] beginning work on minus functionality
Correct some typos. Correct a mistake in lib/PageRank.scala. The first PageRank implementation uses standalone Graph interface, but the second uses Pregel interface. It may mislead the code viewers.
Author: Hangchen Yu <yuhc@gitcafe.com>
Closes#5128 from yuhc/master and squashes the following commits:
53e5432 [Hangchen Yu] Merge branch 'master' of https://github.com/yuhc/spark
67b77b5 [Hangchen Yu] [SPARK-6455] [docs] Correct some mistakes and typos
206f2dc [Hangchen Yu] Correct some mistakes and typos.
Use `Utils.createTempDir()` to replace other temp file mechanisms used in some tests, to further ensure they are cleaned up, and simplify
Author: Sean Owen <sowen@cloudera.com>
Closes#5029 from srowen/SPARK-6338 and squashes the following commits:
27b740a [Sean Owen] Fix hive-thriftserver tests that don't expect an existing dir
4a212fa [Sean Owen] Standardize a bit more temp dir management
9004081 [Sean Owen] Revert some added recursive-delete calls
57609e4 [Sean Owen] Use Utils.createTempDir() to replace other temp file mechanisms used in some tests, to further ensure they are cleaned up, and simplify
This extractor is mainly used for Graph#aggregateMessages*.
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
Closes#5047 from maropu/AddUnapplyInEdgeContext and squashes the following commits:
87e04df [Takeshi YAMAMURO] Add unapply in EdgeContext
Changed method invocation of 'diff' to match that of 'innerJoin' and 'leftJoin' from VertexRDD[VD] to RDD[(VertexId, VD)]. This change maintains backwards compatibility and better unifies the VertexRDD methods to match each other.
Author: Brennon York <brennon.york@capitalone.com>
Closes#4733 from brennonyork/SPARK-5922 and squashes the following commits:
e800f08 [Brennon York] fixed merge conflicts
b9274af [Brennon York] fixed merge conflicts
f86375c [Brennon York] fixed minor include line
398ddb4 [Brennon York] fixed merge conflicts
aac1810 [Brennon York] updated to aggregateUsingIndex and added test to ensure that method works properly
2af0b88 [Brennon York] removed deprecation line
753c963 [Brennon York] fixed merge conflicts and set preference to use the diff(other: VertexRDD[VD]) method
2c678c6 [Brennon York] added mima exclude to exclude new public diff method from VertexRDD
93186f3 [Brennon York] added back the original diff method to sustain binary compatibility
f18356e [Brennon York] changed method invocation of 'diff' to match that of 'innerJoin' and 'leftJoin' from VertexRDD[VD] to RDD[(VertexId, VD)]
Added tests that maropu [created](1f64794b2c/graphx/src/test/scala/org/apache/spark/graphx/VertexRDDSuite.scala) for vertices with differing partition counts. Wanted to make sure his work got captured /merged as its not in the master branch and I don't believe there's a PR out already for it.
Author: Brennon York <brennon.york@capitalone.com>
Closes#5023 from brennonyork/SPARK-5790 and squashes the following commits:
83bbd29 [Brennon York] added maropu's tests for vertices with differing partition counts
Turns out, per the [convo on the JIRA](https://issues.apache.org/jira/browse/SPARK-4600), `diff` is acting exactly as should. It became a large misconception as I thought it meant set difference, when in fact it does not. To that extent I merely updated the `diff` documentation to, hopefully, better reflect its true intentions moving forward.
Author: Brennon York <brennon.york@capitalone.com>
Closes#5015 from brennonyork/SPARK-4600 and squashes the following commits:
1e1d1e5 [Brennon York] reverted internal diff docs
92288f7 [Brennon York] reverted both the test suite and the diff function back to its origin functionality
f428623 [Brennon York] updated diff documentation to better represent its function
cc16d65 [Brennon York] Merge remote-tracking branch 'upstream/master' into SPARK-4600
66818b9 [Brennon York] added small secondary diff test
99ad412 [Brennon York] Merge remote-tracking branch 'upstream/master' into SPARK-4600
74b8c95 [Brennon York] corrected method by leveraging bitmask operations to correctly return only the portions of that are different from the calling VertexRDD
9717120 [Brennon York] updated diff impl to cause fewer objects to be created
710a21c [Brennon York] working diff given test case
aa57f83 [Brennon York] updated to set ShortestPaths to run 'forward' rather than 'backward'
The issue is discussed in https://issues.apache.org/jira/browse/SPARK-5669. Replacing all JBLAS usage by netlib-java gives us a simpler dependency tree and less license issues to worry about. I didn't touch the test scope in this PR. The user guide is not modified to avoid merge conflicts with branch-1.3. srowen ankurdave pwendell
Author: Xiangrui Meng <meng@databricks.com>
Closes#4699 from mengxr/SPARK-5814 and squashes the following commits:
48635c6 [Xiangrui Meng] move netlib-java version to parent pom
ca21c74 [Xiangrui Meng] remove jblas from ml-guide
5f7767a [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5814
c5c4183 [Xiangrui Meng] merge master
0f20cad [Xiangrui Meng] add mima excludes
e53e9f4 [Xiangrui Meng] remove jblas from mllib runtime
ceaa14d [Xiangrui Meng] replace jblas by netlib-java in graphx
fa7c2ca [Xiangrui Meng] move jblas to test scope
Option 1 of 2: Convert spark-parent module name to spark-parent_2.10 / spark-parent_2.11
Author: Sean Owen <sowen@cloudera.com>
Closes#4912 from srowen/SPARK-6182.1 and squashes the following commits:
eff60de [Sean Owen] Convert spark-parent module name to spark-parent_2.10 / spark-parent_2.11
Class TaskContext is unused in EdgeRDDImpl, so we need to remove it from import list.
Author: Lianhui Wang <lianhuiwang09@gmail.com>
Closes#4846 from lianhuiwang/SPARK-6103 and squashes the following commits:
31aed64 [Lianhui Wang] remove unused class to import in EdgeRDDImpl
Fixes the issue whereby when VertexRDD's are `diff`ed, `innerJoin`ed, or `leftJoin`ed and have different partition sizes they fail under the `zipPartitions` method. This fix tests whether the partitions are equal or not and, if not, will repartition the other to match the partition size of the calling VertexRDD.
Author: Brennon York <brennon.york@capitalone.com>
Closes#4705 from brennonyork/SPARK-1955 and squashes the following commits:
0882590 [Brennon York] updated to properly handle differently-partitioned vertexRDDs
Now, deprecated runSVDPlusPlus and update run, for 1.4.0 / master only
Author: Sean Owen <sowen@cloudera.com>
Closes#4625 from srowen/SPARK-5815.2 and squashes the following commits:
6fd2ca5 [Sean Owen] Now, deprecated runSVDPlusPlus and update run, for 1.4.0 / master only
Deprecate SVDPlusPlus.run and introduce SVDPlusPlus.runSVDPlusPlus with return type that doesn't include DoubleMatrix
CC mengxr
Author: Sean Owen <sowen@cloudera.com>
Closes#4614 from srowen/SPARK-5815 and squashes the following commits:
288cb05 [Sean Owen] Clarify deprecation plans in scaladoc
497458e [Sean Owen] Deprecate SVDPlusPlus.run and introduce SVDPlusPlus.runSVDPlusPlus with return type that doesn't include DoubleMatrix
This just unpersist()s each RDD in this code that was cache()ed.
Author: Sean Owen <sowen@cloudera.com>
Closes#4234 from srowen/SPARK-3290 and squashes the following commits:
66c1e11 [Sean Owen] unpersist() each RDD that was cache()ed
Corrected the logic with ShortestPaths so that the calculation will run forward rather than backwards. Output before looked like:
```scala
import org.apache.spark.graphx._
val g = Graph(sc.makeRDD(Array((1L,""), (2L,""), (3L,""))), sc.makeRDD(Array(Edge(1L,2L,""), Edge(2L,3L,""))))
lib.ShortestPaths.run(g,Array(3)).vertices.collect
// res0: Array[(org.apache.spark.graphx.VertexId, org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map()), (3,Map(3 -> 0)), (2,Map()))
lib.ShortestPaths.run(g,Array(1)).vertices.collect
// res1: Array[(org.apache.spark.graphx.VertexId, org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map(1 -> 0)), (3,Map(1 -> 2)), (2,Map(1 -> 1)))
```
And new output after the changes looks like:
```scala
import org.apache.spark.graphx._
val g = Graph(sc.makeRDD(Array((1L,""), (2L,""), (3L,""))), sc.makeRDD(Array(Edge(1L,2L,""), Edge(2L,3L,""))))
lib.ShortestPaths.run(g,Array(3)).vertices.collect
// res0: Array[(org.apache.spark.graphx.VertexId, org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map(3 -> 2)), (2,Map(3 -> 1)), (3,Map(3 -> 0)))
lib.ShortestPaths.run(g,Array(1)).vertices.collect
// res1: Array[(org.apache.spark.graphx.VertexId, org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map(1 -> 0)), (2,Map()), (3,Map()))
```
Author: Brennon York <brennon.york@capitalone.com>
Closes#4478 from brennonyork/SPARK-5343 and squashes the following commits:
aa57f83 [Brennon York] updated to set ShortestPaths to run 'forward' rather than 'backward'
When I build a graph with a file format error, there will be an ArrayIndexOutOfBoundsException
Author: Leolh <leosandylh@gmail.com>
Closes#4176 from Leolh/patch-1 and squashes the following commits:
94f6d22 [Leolh] Update GraphLoader.scala
23767f1 [Leolh] [SPARK-3650][GraphX] There will be an ArrayIndexOutOfBoundsException if the format of the source file is wrong