Commit graph

4881 commits

Author SHA1 Message Date
Tor Myklebust 4e821390bc Scala stubs for updated Python bindings. 2013-12-25 00:09:00 -05:00
Tor Myklebust 05163057a1 Split the mllib bindings into a whole bunch of modules and rename some things. 2013-12-25 00:08:05 -05:00
Tor Myklebust 86e38c4942 Remove useless line from test stub. 2013-12-24 16:49:31 -05:00
Tor Myklebust 4efec6eb94 Python change for move of PythonMLLibAPI. 2013-12-24 16:49:03 -05:00
Tor Myklebust 58e2a7d6d4 Move PythonMLLibAPI into its own package. 2013-12-24 16:48:40 -05:00
Tor Myklebust 2402180b32 Fix error message ugliness. 2013-12-24 16:18:33 -05:00
Tor Myklebust cbb2811189 Release JVM reference to the ALSModel when done. 2013-12-22 15:03:58 -05:00
Tor Myklebust 20f85eca3d Java stubs for ALSModel. 2013-12-21 14:54:13 -05:00
Tor Myklebust 076fc16221 Python stubs for ALSModel. 2013-12-21 14:54:01 -05:00
Tor Myklebust b454fdc2eb Javadocs; also, declare some things private. 2013-12-20 02:10:21 -05:00
Tor Myklebust 0b494c2167 Un-semicolon mllib.py. 2013-12-20 02:05:55 -05:00
Tor Myklebust 0a5cacb961 Change some docstrings and add some others. 2013-12-20 02:05:15 -05:00
Tor Myklebust b835ddf3df Licence notice. 2013-12-20 01:55:03 -05:00
Tor Myklebust d89cc1e28a Whitespace. 2013-12-20 01:50:42 -05:00
Tor Myklebust 319520b9bb Remove gigantic endian-specific test and exception tests. 2013-12-20 01:48:44 -05:00
Tor Myklebust 2940201ad8 Tests for the Python side of the mllib bindings. 2013-12-20 01:33:32 -05:00
Tor Myklebust 73e17064c6 Python stubs for classification and clustering. 2013-12-20 00:12:48 -05:00
Tor Myklebust f99970e8cd Scala classification and clustering stubs; matrix serialization/deserialization. 2013-12-20 00:12:22 -05:00
Tor Myklebust 2328bdd00f Python side of python bindings for linear, Lasso, and ridge regression 2013-12-19 22:45:16 -05:00
Tor Myklebust ded67ee90c Bindings for linear, Lasso, and ridge regression. 2013-12-19 22:42:12 -05:00
Tor Myklebust 2a41c9aad3 Un-semicolon PythonMLLibAPI. 2013-12-19 21:27:11 -05:00
Tor Myklebust bf20591a00 Incorporate most of Josh's style suggestions. I don't want to deal with the type and length checking errors until we've got at least one working stub that we're all happy with. 2013-12-19 03:40:57 -05:00
Tor Myklebust bf491bb3c0 The rest of the Python side of those bindings. 2013-12-19 01:29:51 -05:00
Tor Myklebust 95915f8b3b First cut at python mllib bindings. Only LinearRegression is supported. 2013-12-19 01:29:09 -05:00
Tor Myklebust d3b1af4b6c Add a serialisation time column to the StagePage. 2013-12-18 14:25:56 -05:00
Tor Myklebust 717c7fddb2 objectSer -> valueSer in a test. 2013-12-17 23:02:21 -05:00
Tor Myklebust b2f0329511 Missed a spot; had an objectSer here too. 2013-12-17 00:18:46 -05:00
Tor Myklebust 25fa976580 Merge branch 'master' of git://github.com/apache/incubator-spark 2013-12-16 23:48:37 -05:00
Tor Myklebust 963d6f065a Incorporate pwendell's code review suggestions. 2013-12-16 23:14:52 -05:00
Patrick Wendell 964a3b6971 Merge pull request #270 from ewencp/really-force-ssh-pseudo-tty-master
Force pseudo-tty allocation in spark-ec2 script.

ssh commands need the -t argument repeated twice if there is no local
tty, e.g. if the process running spark-ec2 uses nohup and the parent
process exits.

Without this change, if you run the script this way (e.g. using nohup from a cron job), it will fail setting up the nodes because some of the ssh commands complain about missing ttys and then fail.

(This version is for the master branch. I've filed a separate request for the 0.8 since changes to the script caused the patches to be different.)
2013-12-16 15:23:51 -08:00
Reynold Xin 883e034aeb Merge pull request #245 from gregakespret/task-maxfailures-fix
Fix for spark.task.maxFailures not enforced correctly.

Docs at http://spark.incubator.apache.org/docs/latest/configuration.html say:

```
spark.task.maxFailures

Number of individual task failures before giving up on the job. Should be greater than or equal to 1. Number of allowed retries = this value - 1.
```

Previous implementation worked incorrectly. When for example `spark.task.maxFailures` was set to 1, the job was aborted only after the second task failure, not after the first one.
2013-12-16 14:16:02 -08:00
Tor Myklebust 882d544856 UI to display serialisation time of a stage. 2013-12-16 13:27:03 -05:00
Tor Myklebust 8a397a959b Track task value serialisation time in TaskMetrics. 2013-12-16 12:07:39 -05:00
Ewen Cheslack-Postava d17c142615 Force pseudo-tty allocation in spark-ec2 script.
ssh commands need the -t argument repeated twice if there is no local
tty, e.g. if the process running spark-ec2 uses nohup and the parent
process exits.
2013-12-16 08:09:37 -08:00
Patrick Wendell a51f3404ad Merge pull request #265 from markhamstra/scala.binary.version
DRY out the POMs with scala.binary.version

...instead of hard-coding 2.10 repeatedly.

As long as it's not a `<project>`-level `<artifactId>`, I think that we are okay parameterizing these.
2013-12-15 22:02:30 -08:00
Josh Rosen d2ced6d58c Merge pull request #256 from MLnick/master
Fix 'IPYTHON=1 ./pyspark' throwing ValueError

This fixes an annoying issue where running ```IPYTHON=1 ./pyspark``` resulted in:

```
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 0.8.0
      /_/

Using Python version 2.7.5 (default, Jun 20 2013 11:06:30)
Spark context avaiable as sc.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/usr/local/lib/python2.7/site-packages/IPython/utils/py3compat.pyc in execfile(fname, *where)
    202             else:
    203                 filename = fname
--> 204             __builtin__.execfile(filename, *where)

/Users/Nick/workspace/scala/spark-0.8.0-incubating-bin-hadoop1/python/pyspark/shell.py in <module>()
     30 add_files = os.environ.get("ADD_FILES").split(',') if os.environ.get("ADD_FILES") != None else None
     31
---> 32 sc = SparkContext(os.environ.get("MASTER", "local"), "PySparkShell", pyFiles=add_files)
     33
     34 print """Welcome to

/Users/Nick/workspace/scala/spark-0.8.0-incubating-bin-hadoop1/python/pyspark/context.pyc in __init__(self, master, jobName, sparkHome, pyFiles, environment, batchSize)
     70         with SparkContext._lock:
     71             if SparkContext._active_spark_context:
---> 72                 raise ValueError("Cannot run multiple SparkContexts at once")
     73             else:
     74                 SparkContext._active_spark_context = self

ValueError: Cannot run multiple SparkContexts at once
```

The issue arises since previously IPython didn't seem to respect ```$PYTHONSTARTUP```, but since at least 1.0.0 it has. Technically this might break for older versions of IPython, but most users should be able to upgrade IPython to at least 1.0.0 (and should be encouraged to do so :).

New behaviour:
```
Nicks-MacBook-Pro:incubator-spark-mlnick Nick$ IPYTHON=1 ./pyspark
Python 2.7.5 (default, Jun 20 2013, 11:06:30)
Type "copyright", "credits" or "license" for more information.

IPython 1.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/Nick/workspace/scala/incubator-spark-mlnick/tools/target/scala-2.9.3/spark-tools-assembly-0.9.0-incubating-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/Nick/workspace/scala/incubator-spark-mlnick/assembly/target/scala-2.9.3/spark-assembly-0.9.0-incubating-SNAPSHOT-hadoop1.0.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
13/12/12 13:08:15 WARN Utils: Your hostname, Nicks-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 10.0.0.4 instead (on interface en0)
13/12/12 13:08:15 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
13/12/12 13:08:15 INFO Slf4jEventHandler: Slf4jEventHandler started
13/12/12 13:08:15 INFO SparkEnv: Registering BlockManagerMaster
13/12/12 13:08:15 INFO DiskBlockManager: Created local directory at /var/folders/_l/06wxljt13wqgm7r08jlc44_r0000gn/T/spark-local-20131212130815-0e76
13/12/12 13:08:15 INFO MemoryStore: MemoryStore started with capacity 326.7 MB.
13/12/12 13:08:15 INFO ConnectionManager: Bound socket to port 53732 with id = ConnectionManagerId(10.0.0.4,53732)
13/12/12 13:08:15 INFO BlockManagerMaster: Trying to register BlockManager
13/12/12 13:08:15 INFO BlockManagerMasterActor$BlockManagerInfo: Registering block manager 10.0.0.4:53732 with 326.7 MB RAM
13/12/12 13:08:15 INFO BlockManagerMaster: Registered BlockManager
13/12/12 13:08:15 INFO HttpBroadcast: Broadcast server started at http://10.0.0.4:53733
13/12/12 13:08:15 INFO SparkEnv: Registering MapOutputTracker
13/12/12 13:08:15 INFO HttpFileServer: HTTP File server directory is /var/folders/_l/06wxljt13wqgm7r08jlc44_r0000gn/T/spark-8f40e897-8211-4628-a7a8-755562d5244c
13/12/12 13:08:16 INFO SparkUI: Started Spark Web UI at http://10.0.0.4:4040
2013-12-12 13:08:16.337 java[56801:4003] Unable to load realm info from SCDynamicStore
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 0.9.0-SNAPSHOT
      /_/

Using Python version 2.7.5 (default, Jun 20 2013 11:06:30)
Spark context avaiable as sc.
```
2013-12-15 14:11:34 -08:00
Reynold Xin c55e698559 Merge pull request #257 from tgravescs/sparkYarnFixName
Fix the --name option for Spark on Yarn

Looks like the --name option accidentally got broken in one of the merges.  The Client hangs if the --name option is used right now.
2013-12-15 12:49:02 -08:00
Reynold Xin ab85f88fd7 Merge pull request #264 from shivaram/spark-class-fix
Use CoarseGrainedExecutorBackend in spark-class
2013-12-15 12:48:32 -08:00
Mark Hamstra 09ed7ddfa0 Use scala.binary.version in POMs 2013-12-15 12:39:58 -08:00
Shivaram Venkataraman fc96ca9f62 Use CoarseGrainedExecutorBackend in spark-class 2013-12-15 11:53:44 -08:00
Nick Pentreath bb5277b10a Making IPython PySpark compatible across versions <1.0.0. Also cleaned up '-i' option and made IPYTHON_OPTS work 2013-12-15 09:39:45 +02:00
Nick Pentreath d36ee3b159 Merge remote-tracking branch 'upstream/master' 2013-12-15 08:34:05 +02:00
Reynold Xin 7db9165961 Merge pull request #251 from pwendell/master
Fix list rendering in YARN markdown docs.

This is some minor clean-up which makes the list render correctly.
2013-12-14 14:16:34 -08:00
Josh Rosen 2fd781d347 Merge pull request #249 from ngbinh/partitionInJavaSortByKey
Expose numPartitions parameter in JavaPairRDD.sortByKey()

This change makes Java and Scala API on sortByKey() the same.
2013-12-14 12:59:37 -08:00
Patrick Wendell 97ac060182 Merge pull request #259 from pwendell/scala-2.10
Migration to Scala 2.10

== Below description was written by Prashant Sharma ==

This PR migrates spark to scala 2.10.

Summary of changes apart from scala 2.10 migration:
(has no implications for user.)
1. Migrated Akka to 2.2.3.

Does not use remote death watch for it has a bug, where it tries to send message to dead node infinitely.

Uses an indestructible actorsystem which tolerates errors only on executors.

(Might be useful for user.)
4. New configuration settings introduced:

System.getProperty("spark.akka.heartbeat.pauses", "600")
System.getProperty("spark.akka.failure-detector.threshold", "300.0")
System.getProperty("spark.akka.heartbeat.interval", "1000")

Defaults for these are fairly large to only disable Failure detector that comes with akka. The reason for doing so is we have our own failure detector like mechanism in place and then this is just an overhead on top of that + it leads to a lot of false positives. But with these properties it is possible to enable them. A good use case for enabling it could be when someone wants spark to be sensitive (in a controllable manner ofc.) to GC pauses/Network lags and quickly evict executors that experienced it. More information is included in configuration.md

Once we have the SPARK-544 merged, I had like to deprecate atleast these akka properties and may be others too.

This PR is duplicate of #221(where all the discussion happened.) for that one pointed to master this one points to scala-2.10 branch.
2013-12-14 00:22:45 -08:00
Patrick Wendell 7ac944fc27 Merge pull request #262 from pwendell/mvn-fix
Fix maven build issues in 2.10 branch

Found some issues when locally testing maven.
2013-12-13 23:22:08 -08:00
Patrick Wendell 6e8a96c7e7 Fix maven build issues in 2.10 branch 2013-12-13 23:14:08 -08:00
Reynold Xin 6defb061f0 Merge pull request #261 from ScrapCodes/scala-2.10
Added a comment about ActorRef and ActorSelection difference.
2013-12-13 21:18:57 -08:00
Prashant Sharma 1ae3c0fc5e Added a comment about ActorRef and ActorSelection difference. 2013-12-14 10:44:24 +05:30
Reynold Xin 76566b1fc9 Merge pull request #260 from ScrapCodes/scala-2.10
Review comments on the PR for scala 2.10 migration.
2013-12-13 10:11:02 -08:00