Commit graph

5335 commits

Author SHA1 Message Date
Patrick Wendell 5c1b4f6405 Minor fixes 2013-12-26 14:39:39 -08:00
Tathagata Das 5fde4566ea Added Apache boilerplate and class docs to PartitionerAwareUnionRDD. 2013-12-26 14:33:37 -08:00
Tathagata Das 577c8cc834 Removed unncessary options from WindowedDStream. 2013-12-26 14:17:16 -08:00
Tathagata Das 3618d70b2a Added warning if filestream adds files with no data in them (file RDDs have 0 partitions). 2013-12-26 12:45:40 -08:00
Lian, Cheng 654f42174a Reformatted some lines commented by Matei 2013-12-27 04:45:04 +08:00
Patrick Wendell c23d640516 Addressing smaller changes from Aaron's review 2013-12-26 12:38:39 -08:00
Tathagata Das be64719138 Changed file stream to not catch any exceptions related to finding new files (FileNotFound exception is still caught and ignored). 2013-12-26 12:33:12 -08:00
Tathagata Das 3579647cdc Merge branch 'apache-master' into window-improvement 2013-12-26 12:12:10 -08:00
Patrick Wendell da20270b83 Merge pull request #1 from aarondav/driver
Refactor DriverClient to be more Actor-based
2013-12-26 12:11:52 -08:00
Patrick Wendell a97ad55c45 Removing accidental file 2013-12-26 12:11:28 -08:00
Tathagata Das c4a54f51b5 Merge branch 'master' into window-improvement 2013-12-26 12:03:11 -08:00
Patrick Wendell 5938cfc153 Updated approach to driver restarting 2013-12-26 12:02:19 -08:00
Matei Zaharia e240bad03b Merge pull request #296 from witgo/master
Renamed ClusterScheduler to TaskSchedulerImpl for yarn and new-yarn package
2013-12-26 12:30:48 -05:00
Tathagata Das 069cb14bdc Updated groupByKeyAndWindow to be computed incrementally, and added mapSideCombine to combineByKeyAndWindow. 2013-12-26 02:58:29 -08:00
Tathagata Das bacc65cf28 Removed slack time in file stream and added better handling of exceptions due to failures due FileNotFound exceptions. 2013-12-26 10:18:46 +00:00
liguoqiang b662c88a24 fix this import order 2013-12-26 15:49:33 +08:00
Mark Hamstra c529dceaff Avoid a lump of coal (NPE) in JobProgressListener's stocking. 2013-12-25 23:10:02 -08:00
Matei Zaharia c344ed04c7 Merge pull request #283 from tmyklebu/master
Python bindings for mllib

This pull request contains Python bindings for the regression, clustering, classification, and recommendation tools in mllib.

For each 'train' frontend exposed, there is a Scala stub in PythonMLLibAPI.scala and a Python stub in mllib.py.  The Python stub serialises the input RDD and any vector/matrix arguments into a mutually-understood format and calls the Scala stub.  The Scala stub deserialises the RDD and the vector/matrix arguments, calls the appropriate 'train' function, serialises the resulting model, and returns the serialised model.

ALSModel is slightly different since a MatrixFactorizationModel has RDDs inside.  The Scala stub returns a handle to a Scala MatrixFactorizationModel; prediction is done by calling the Scala predict method.

I have tested these bindings on an x86_64 machine running Linux.  There is a risk that these bindings may fail on some choose-your-own-endian platform if Python's endian differs from java.nio.ByteBuffer's idea of the native byte order.
2013-12-26 01:31:06 -05:00
liguoqiang 2bd76f693d Renamed ClusterScheduler to TaskSchedulerImpl for yarn and new-yarn 2013-12-26 11:10:35 +08:00
liguoqiang 14fcef72db Renamed ClusterScheduler to TaskSchedulerImpl for yarn and new-yarn 2013-12-26 11:05:07 +08:00
Tathagata Das 94479673eb Fixed bug in PartitionAwareUnionRDD 2013-12-26 00:07:45 +00:00
Tor Myklebust 9cbcf81453 Remove commented code in __init__.py. 2013-12-25 14:12:42 -05:00
Tor Myklebust 5e71354cb7 Fix copypasta in __init__.py. Don't import anything directly into pyspark.mllib. 2013-12-25 14:10:55 -05:00
Aaron Davidson 61372b11f4 Refactor DriverClient to be more Actor-based 2013-12-25 10:55:25 -08:00
Matei Zaharia 56094bcd8d Merge pull request #290 from ash211/patch-3
Typo: avaiable -> available
2013-12-25 13:14:33 -05:00
Lian, Cheng c0337c5bbf Let reduceByKey to take care of local combine
Also refactored some heavy FP code to improve readability and reduce memory footprint.
2013-12-25 22:45:57 +08:00
walker 0af4b4f3e8 Bug fixes for updating the RDD block's memory and disk usage information 2013-12-25 20:07:01 +08:00
Reynold Xin 4842a07da8 Merge pull request #287 from azuryyu/master
Fixed job name in the java streaming example.
2013-12-25 01:52:15 -08:00
Patrick Wendell bbc362833b Removing un-used variable 2013-12-25 01:38:57 -08:00
Patrick Wendell 18ad419b52 Small fix from rebase 2013-12-25 01:22:38 -08:00
Patrick Wendell 55f833803a Minor bug fix 2013-12-25 01:19:25 -08:00
Patrick Wendell c9c0f745af Minor style clean-up 2013-12-25 01:19:25 -08:00
Patrick Wendell b2b7514ba3 Import clean-up (yay Aaron) 2013-12-25 01:19:25 -08:00
Patrick Wendell d5f23e0083 Adding scheduling and reporting based on cores 2013-12-25 01:19:01 -08:00
Patrick Wendell 760823d393 Adding better option parsing 2013-12-25 01:19:01 -08:00
Patrick Wendell 6a4acc4c2d Initial cut at driver submission. 2013-12-25 01:19:01 -08:00
Patrick Wendell 1070b566d4 Renaming Client => AppClient 2013-12-25 01:17:01 -08:00
Lian, Cheng 3bb714eaa3 Refactored NaiveBayes
* Minimized shuffle output with mapPartitions.
* Reduced RDD actions from 3 to 1.
2013-12-25 17:15:38 +08:00
Frank Dai 3dc655aa19 standard Naive Bayes classifier 2013-12-25 16:50:42 +08:00
Tor Myklebust 02208a175c Initial weights in Scala are ones; do that too. Also fix some errors. 2013-12-25 00:53:48 -05:00
Tor Myklebust 4e821390bc Scala stubs for updated Python bindings. 2013-12-25 00:09:00 -05:00
Tor Myklebust 05163057a1 Split the mllib bindings into a whole bunch of modules and rename some things. 2013-12-25 00:08:05 -05:00
Andrew Ash 3665c722b5 Typo: avaiable -> available 2013-12-24 17:25:04 -08:00
Patrick Wendell 85a344b4f0 Merge pull request #127 from kayousterhout/consolidate_schedulers
Deduplicate Local and Cluster schedulers.

The code in LocalScheduler/LocalTaskSetManager was nearly identical
to the code in ClusterScheduler/ClusterTaskSetManager. The redundancy
made making updating the schedulers unnecessarily painful and error-
prone. This commit combines the two into a single TaskScheduler/
TaskSetManager.

Unfortunately the diff makes this change look much more invasive than it is -- TaskScheduler.scala is only superficially changed (names updated, overrides removed) from the old ClusterScheduler.scala, and the same with
TaskSetManager.scala.

Thanks @rxin for suggesting this change!
2013-12-24 16:35:06 -08:00
Binh Nguyen 786f393a98 Fix imports order 2013-12-24 14:59:30 -08:00
Binh Nguyen 9115a5de62 Remove import * and fix some formatting 2013-12-24 14:59:30 -08:00
Binh Nguyen 040dd3ecd5 upgrade Netty from 4.0.0.Beta2 to 4.0.13.Final 2013-12-24 14:58:18 -08:00
Patrick Wendell c2dd6bcd6e Merge pull request #279 from aarondav/shuffle-cleanup0
Clean up shuffle files once their metadata is gone

Previously, we would only clean the in-memory metadata for consolidated shuffle files.

Additionally, fixes a bug where the Metadata Cleaner was ignoring type-specific TTLs.
2013-12-24 14:36:47 -08:00
Kay Ousterhout 1efe3adf56 Responded to Reynold's style comments 2013-12-24 14:18:39 -08:00
Tathagata Das d4dfab503a Fixed Python API for sc.setCheckpointDir. Also other fixes based on Reynold's comments on PR 289. 2013-12-24 14:01:13 -08:00