Commit graph

412 commits

Author SHA1 Message Date
DB Tsai 5596086935 [SPARK-1969][MLlib] Online summarizer APIs for mean, variance, min, and max
It basically moved the private ColumnStatisticsAggregator class from RowMatrix to public available DeveloperApi with documentation and unitests.

Changes:
1) Moved the private implementation from org.apache.spark.mllib.linalg.ColumnStatisticsAggregator to org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
2) When creating OnlineSummarizer object, the number of columns is not needed in the constructor. It's determined when users add the first sample.
3) Added the APIs documentation for MultivariateOnlineSummarizer.
4) Added the unittests for MultivariateOnlineSummarizer.

Author: DB Tsai <dbtsai@dbtsai.com>

Closes #955 from dbtsai/dbtsai-summarizer and squashes the following commits:

b13ac90 [DB Tsai] dbtsai-summarizer
2014-07-11 23:04:43 -07:00
tmalaska 40a8fef4e6 [SPARK-1478].3: Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-1915
This is a modified version of this PR https://github.com/apache/spark/pull/1168 done by @tmalaska
Adds MIMA binary check exclusions.

Author: tmalaska <ted.malaska@cloudera.com>
Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #1347 from tdas/FLUME-1915 and squashes the following commits:

96065df [Tathagata Das] Added Mima exclusion for FlumeReceiver.
41d5338 [tmalaska] Address line 57 that was too long
12617e5 [tmalaska] SPARK-1478: Upgrade FlumeInputDStream's Flume...
2014-07-10 13:15:02 -07:00
Prashant Sharma 628932b8d0 [SPARK-1776] Have Spark's SBT build read dependencies from Maven.
Patch introduces the new way of working also retaining the existing ways of doing things.

For example build instruction for yarn in maven is
`mvn -Pyarn -PHadoop2.2 clean package -DskipTests`
in sbt it can become
`MAVEN_PROFILES="yarn, hadoop-2.2" sbt/sbt clean assembly`
Also supports
`sbt/sbt -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 clean assembly`

Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Patrick Wendell <pwendell@gmail.com>

Closes #772 from ScrapCodes/sbt-maven and squashes the following commits:

a8ac951 [Prashant Sharma] Updated sbt version.
62b09bb [Prashant Sharma] Improvements.
fa6221d [Prashant Sharma] Excluding sql from mima
4b8875e [Prashant Sharma] Sbt assembly no longer builds tools by default.
72651ca [Prashant Sharma] Addresses code reivew comments.
acab73d [Prashant Sharma] Revert "Small fix to run-examples script."
ac4312c [Prashant Sharma] Revert "minor fix"
6af91ac [Prashant Sharma] Ported oldDeps back. + fixes issues with prev commit.
65cf06c [Prashant Sharma] Servelet API jars mess up with the other servlet jars on the class path.
446768e [Prashant Sharma] minor fix
89b9777 [Prashant Sharma] Merge conflicts
d0a02f2 [Prashant Sharma] Bumped up pom versions, Since the build now depends on pom it is better updated there. + general cleanups.
dccc8ac [Prashant Sharma] updated mima to check against 1.0
a49c61b [Prashant Sharma] Fix for tools jar
a2f5ae1 [Prashant Sharma] Fixes a bug in dependencies.
cf88758 [Prashant Sharma] cleanup
9439ea3 [Prashant Sharma] Small fix to run-examples script.
96cea1f [Prashant Sharma] SPARK-1776 Have Spark's SBT build read dependencies from Maven.
36efa62 [Patrick Wendell] Set project name in pom files and added eclipse/intellij plugins.
4973dbd [Patrick Wendell] Example build using pom reader.
2014-07-10 11:03:37 -07:00
Marcelo Vanzin 21ddd7d1e9 [SPARK-1768] History server enhancements.
Two improvements to the history server:

- Separate the HTTP handling from history fetching, so that it's easy to add
  new backends later (thinking about SPARK-1537 in the long run)

- Avoid loading all UIs in memory. Do lazy loading instead, keeping a few in
  memory for faster access. This allows the app limit to go away, since holding
  just the listing in memory shouldn't be too expensive unless the user has millions
  of completed apps in the history (at which point I'd expect other issues to arise
  aside from history server memory usage, such as FileSystem.listStatus()
  starting to become ridiculously expensive).

I also fixed a few minor things along the way which aren't really worth mentioning.
I also removed the app's log path from the UI since that information may not even
exist depending on which backend is used (even though there is only one now).

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #718 from vanzin/hist-server and squashes the following commits:

53620c9 [Marcelo Vanzin] Add mima exclude, fix scaladoc wording.
c21f8d8 [Marcelo Vanzin] Feedback: formatting, docs.
dd8cc4b [Marcelo Vanzin] Standardize on using spark.history.* configuration.
4da3a52 [Marcelo Vanzin] Remove UI from ApplicationHistoryInfo.
2a7f68d [Marcelo Vanzin] Address review feedback.
4e72c77 [Marcelo Vanzin] Remove comment about ordering.
249bcea [Marcelo Vanzin] Remove offset / count from provider interface.
ca5d320 [Marcelo Vanzin] Remove code that deals with unfinished apps.
6e2432f [Marcelo Vanzin] Second round of feedback.
b2c570a [Marcelo Vanzin] Make class package-private.
4406f61 [Marcelo Vanzin] Cosmetic change to listing header.
e852149 [Marcelo Vanzin] Initialize new app array to expected size.
e8026f4 [Marcelo Vanzin] Review feedback.
49d2fd3 [Marcelo Vanzin] Fix a comment.
91e96ca [Marcelo Vanzin] Fix scalastyle issues.
6fbe0d8 [Marcelo Vanzin] Better handle failures when loading app info.
eee2f5a [Marcelo Vanzin] Ensure server.stop() is called when shutting down.
bda2fa1 [Marcelo Vanzin] Rudimentary paging support for the history UI.
b284478 [Marcelo Vanzin] Separate history server from history backend.
2014-06-23 13:53:44 -07:00
Patrick Wendell 0a432d6a05 HOTFIX: Fix missing MIMA ignore 2014-06-21 13:02:49 -07:00
Andrew Or 44daec5abd [Minor] Fix style, formatting and naming in BlockManager etc.
This is a precursor to a bigger change. I wanted to separate out the relatively insignificant changes so the ultimate PR is not inflated.

(Warning: this PR is full of unimportant nitpicks)

Author: Andrew Or <andrewor14@gmail.com>

Closes #1058 from andrewor14/bm-minor and squashes the following commits:

8e12eaf [Andrew Or] SparkException -> BlockException
c36fd53 [Andrew Or] Make parts of BlockManager more readable
0a5f378 [Andrew Or] Entry -> MemoryEntry
e9762a5 [Andrew Or] Tone down string interpolation (minor reverts)
c4de9ac [Andrew Or] Merge branch 'master' of github.com:apache/spark into bm-minor
b3470f1 [Andrew Or] More string interpolation (minor)
7f9dcab [Andrew Or] Use string interpolation (minor)
94a425b [Andrew Or] Refactor against duplicate code + minor changes
8a6a7dc [Andrew Or] Exception -> SparkException
97c410f [Andrew Or] Deal with MIMA excludes
2480f1d [Andrew Or] Fixes in StorgeLevel.scala
abb0163 [Andrew Or] Style, formatting and naming fixes
2014-06-12 20:40:58 -07:00
Sandy Ryza ce92a9c18f SPARK-554. Add aggregateByKey.
Author: Sandy Ryza <sandy@cloudera.com>

Closes #705 from sryza/sandy-spark-554 and squashes the following commits:

2302b8f [Sandy Ryza] Add MIMA exclude
f52e0ad [Sandy Ryza] Fix Python tests for real
2f3afa3 [Sandy Ryza] Fix Python test
0b735e9 [Sandy Ryza] Fix line lengths
ae56746 [Sandy Ryza] Fix doc (replace T with V)
c2be415 [Sandy Ryza] Java and Python aggregateByKey
23bf400 [Sandy Ryza] SPARK-554.  Add aggregateByKey.
2014-06-12 08:14:25 -07:00
Tor Myklebust d9203350b0 [SPARK-1672][MLLIB] Separate user and product partitioning in ALS
Some clean up work following #593.

1. Allow to set different number user blocks and number product blocks in `ALS`.
2. Update `MovieLensALS` to reflect the change.

Author: Tor Myklebust <tmyklebu@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #1014 from mengxr/SPARK-1672 and squashes the following commits:

0e910dd [Xiangrui Meng] change private[this] to private[recommendation]
36420c7 [Xiangrui Meng] set exclusion rules for ALS
9128b77 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-1672
294efe9 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-1672
9bab77b [Xiangrui Meng] clean up add numUserBlocks and numProductBlocks to MovieLensALS
84c8e8c [Xiangrui Meng] Merge branch 'master' into SPARK-1672
d17a8bf [Xiangrui Meng] merge master
a4925fd [Tor Myklebust] Style.
bd8a75c [Tor Myklebust] Merge branch 'master' of github.com:apache/spark into alsseppar
021f54b [Tor Myklebust] Separate user and product blocks.
dcf583a [Tor Myklebust] Remove the partitioner member variable; instead, thread that needle everywhere it needs to go.
23d6f91 [Tor Myklebust] Stop making the partitioner configurable.
495784f [Tor Myklebust] Merge branch 'master' of https://github.com/apache/spark
674933a [Tor Myklebust] Fix style.
40edc23 [Tor Myklebust] Fix missing space.
f841345 [Tor Myklebust] Fix daft bug creating 'pairs', also for -> foreach.
5ec9e6c [Tor Myklebust] Clean a couple of things up using 'map'.
36a0f43 [Tor Myklebust] Make the partitioner private.
d872b09 [Tor Myklebust] Add negative id ALS test.
df27697 [Tor Myklebust] Support custom partitioners.  Currently we use the same partitioner for users and products.
c90b6d8 [Tor Myklebust] Scramble user and product ids before bucketing.
c774d7d [Tor Myklebust] Make the partitioner a member variable and use it instead of modding directly.
2014-06-11 18:16:33 -07:00
Kan Zhang c402a4a685 [SPARK-1817] RDD.zip() should verify partition sizes for each partition
RDD.zip() will throw an exception if it finds partition sizes are not the same.

Author: Kan Zhang <kzhang@apache.org>

Closes #944 from kanzhang/SPARK-1817 and squashes the following commits:

c073848 [Kan Zhang] [SPARK-1817] Cosmetic updates
524c670 [Kan Zhang] [SPARK-1817] RDD.zip() should verify partition sizes for each partition
2014-06-03 22:47:18 -07:00
Reynold Xin 1faef149f7 SPARK-1941: Update streamlib to 2.7.0 and use HyperLogLogPlus instead of HyperLogLog.
I also corrected some errors made in the previous HLL count approximate API, including relativeSD wasn't really a measure for error (and we used it to test error bounds in test results).

Author: Reynold Xin <rxin@apache.org>

Closes #897 from rxin/hll and squashes the following commits:

4d83f41 [Reynold Xin] New error bound and non-randomness.
f154ea0 [Reynold Xin] Added a comment on the value bound for testing.
e367527 [Reynold Xin] One more round of code review.
41e649a [Reynold Xin] Update final mima list.
9e320c8 [Reynold Xin] Incorporate code review feedback.
e110d70 [Reynold Xin] Merge branch 'master' into hll
354deb8 [Reynold Xin] Added comment on the Mima exclude rules.
acaa524 [Reynold Xin] Added the right exclude rules in MimaExcludes.
6555bfe [Reynold Xin] Added a default method and re-arranged MimaExcludes.
1db1522 [Reynold Xin] Excluded util.SerializableHyperLogLog from MIMA check.
9221b27 [Reynold Xin] Merge branch 'master' into hll
88cfe77 [Reynold Xin] Updated documentation and restored the old incorrect API to maintain API compatibility.
1294be6 [Reynold Xin] Updated HLL+.
e7786cb [Reynold Xin] Merge branch 'master' into hll
c0ef0c2 [Reynold Xin] SPARK-1941: Update streamlib to 2.7.0 and use HyperLogLogPlus instead of HyperLogLog.
2014-06-03 18:37:40 -07:00
Joseph E. Gonzalez 894ecde04f Synthetic GraphX Benchmark
This PR accomplishes two things:

1. It introduces a Synthetic Benchmark application that generates an arbitrarily large log-normal graph and executes either PageRank or connected components on the graph.  This can be used to profile GraphX system on arbitrary clusters without access to large graph datasets

2. This PR improves the implementation of the log-normal graph generator.

Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
Author: Ankur Dave <ankurdave@gmail.com>

Closes #720 from jegonzal/graphx_synth_benchmark and squashes the following commits:

e40812a [Ankur Dave] Exclude all of GraphX from compatibility checks vs. 1.0.0
bccccad [Ankur Dave] Fix long lines
374678a [Ankur Dave] Bugfix and style changes
1bdf39a [Joseph E. Gonzalez] updating options
d943972 [Joseph E. Gonzalez] moving the benchmark application into the examples folder.
f4f839a [Joseph E. Gonzalez] Creating a synthetic benchmark script.
2014-06-03 14:14:48 -07:00
Patrick Wendell d17d221487 Better explanation for how to use MIMA excludes.
This patch does a few things:
1. We have a file MimaExcludes.scala exclusively for excludes.
2. The test runner tells users about that file if a test fails.
3. I've added back the excludes used from 0.9->1.0. We should keep
   these in the project as an official audit trail of times where
   we decided to make exceptions.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #937 from pwendell/mima and squashes the following commits:

7ee0db2 [Patrick Wendell] Better explanation for how to use MIMA excludes.
2014-06-01 17:27:05 -07:00