Commit graph

5761 commits

Author SHA1 Message Date
Joseph E. Gonzalez b1eeefb401 WIP. Updating figures and cleaning up initial skeleton for GraphX Programming guide. 2014-01-10 00:39:08 -08:00
Ankur Dave b5b0de2de5 Start fixing formatting of graphx-programming-guide 2014-01-09 13:24:25 -08:00
Ankur Dave e4483582fc Add docs/graphx-programming-guide.md from 7210257ba3038d5e22d4b60fe9c3113dc45c3dff:README.md 2014-01-09 10:24:43 -08:00
Ankur Dave 7309a29c75 Removed Kryo dependency and graphx-shell 2014-01-09 00:13:23 -08:00
Ankur Dave 22374559a2 Remove GraphX README 2014-01-08 22:48:54 -08:00
Ankur Dave 74fdfac112 Fix AbstractMethodError by inlining zip{Edge,Vertex}Partitions
The zip{Edge,Vertex}Partitions methods created doubly-nested closures
and passed them to zipPartitions. For some reason this caused an
AbstractMethodError when zipPartitions tried to invoke the closure. This
commit works around the problem by inlining these methods wherever they
are called, eliminating the doubly-nested closure.
2014-01-08 21:19:14 -08:00
Ankur Dave ab861d8450 Take SparkConf in constructor of Serializer subclasses 2014-01-08 21:19:14 -08:00
Ankur Dave 0ad75cdfb0 Manifest -> Tag in variable names 2014-01-08 21:19:14 -08:00
Ankur Dave ac536345f8 ClassManifest -> ClassTag 2014-01-08 21:19:14 -08:00
Ankur Dave 78d6b13ac8 Fix mis-merge in 44fd30d3fb 2014-01-08 21:19:14 -08:00
Ankur Dave 91227566bc Merge remote-tracking branch 'spark-upstream/master' into HEAD
Conflicts:
	README.md
	core/src/main/scala/org/apache/spark/util/collection/OpenHashMap.scala
	core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala
	core/src/main/scala/org/apache/spark/util/collection/PrimitiveKeyOpenHashMap.scala
	pom.xml
	project/SparkBuild.scala
	repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala
2014-01-08 21:19:08 -08:00
Reynold Xin 04d83fc37f Merge pull request #360 from witgo/master
fix make-distribution.sh show version: command not found
2014-01-08 11:55:37 -08:00
Reynold Xin 56ebfeaa52 Merge pull request #357 from hsaputra/set_boolean_paramname
Set boolean param name for call to SparkHadoopMapReduceUtil.newTaskAttemptID

Set boolean param name for call to SparkHadoopMapReduceUtil.newTaskAttemptID to make it clear which param being set.
2014-01-08 11:50:06 -08:00
Patrick Wendell bdeaeafbda Merge pull request #358 from pwendell/add-cdh
Add CDH Repository to Maven Build

At some point this was removed from the Maven build... so I'm adding it back. It's needed for the Hadoop2 tests we run on Jenkins and it's also included in the SBT build.
2014-01-08 11:48:39 -08:00
Reynold Xin 5cae05f59e Merge pull request #356 from hsaputra/remove_deprecated_cleanup_method
Remove calls to deprecated mapred's OutputCommitter.cleanupJob

Since Hadoop 1.0.4 the mapred OutputCommitter.commitJob should do cleanup job via call to OutputCommitter.cleanupJob,

Remove SparkHadoopWriter.cleanup since it is used only by PairRDDFunctions.

In fact the implementation of mapred OutputCommitter.commitJob looks like this:

  public void commitJob(JobContext jobContext) throws IOException {
    cleanupJob(jobContext);
  }
2014-01-08 11:47:28 -08:00
liguoqiang cf4aaf92d6 fix make-distribution.sh show version: command not found 2014-01-09 00:34:53 +08:00
Thomas Graves 6eef78d769 Merge pull request #345 from colorant/yarn
support distributing extra files to worker for yarn client mode

So that user doesn't need to package all dependency into one assemble jar as spark app jar
2014-01-08 08:49:20 -06:00
Patrick Wendell 3209a86f39 Add CDH Repository to Maven Build 2014-01-08 01:21:17 -08:00
Henry Saputra aa56585d21 Resolve PR review over 100 chars 2014-01-08 00:38:29 -08:00
Henry Saputra f6b6f88367 Set boolean param name for two files call to SparkHadoopMapReduceUtil.newTaskAttemptID to make
it clear which param being set.
2014-01-07 23:23:17 -08:00
Henry Saputra 4517326ec6 Remove calls to deprecated mapred's OutputCommitter.cleanupJob because since Hadoop 1.0.4
the mapred OutputCommitter.commitJob should do cleanup job.

In fact the implementation of mapred OutputCommitter.commitJob looks like this:

  public void commitJob(JobContext jobContext) throws IOException {
    cleanupJob(jobContext);
  }

(The jobContext input argument is type of org.apache.hadoop.mapred.JobContext)
2014-01-07 22:55:56 -08:00
Patrick Wendell bb6a39a687 Merge pull request #322 from falaki/MLLibDocumentationImprovement
SPARK-1009 Updated MLlib docs to show how to use it in Python

In addition added detailed examples for regression, clustering and recommendation algorithms in a separate Scala section. Fixed a few minor issues with existing documentation.
2014-01-07 22:32:18 -08:00
Patrick Wendell cb1b927399 Merge pull request #355 from ScrapCodes/patch-1
Update README.md

The link does not work otherwise.
2014-01-07 22:26:28 -08:00
Patrick Wendell c0f0155eca Merge pull request #313 from tdas/project-refactor
Refactored the streaming project to separate external libraries like Twitter, Kafka, Flume, etc.

At a high level, these are the following changes.

1. All the external code was put in `SPARK_HOME/external/` as separate SBT projects and Maven modules. Their artifact names are `spark-streaming-twitter`, `spark-streaming-kafka`, etc. Both SparkBuild.scala and pom.xml files have been updated. References to external libraries and repositories have been removed from the settings of root and streaming projects/modules.

2. To avail the external functionality (say, creating a Twitter stream), the developer has to `import org.apache.spark.streaming.twitter._` . For Scala API, the developer has to call `TwitterUtils.createStream(streamingContext, ...)`. For the Java API, the developer has to call `TwitterUtils.createStream(javaStreamingContext, ...)`.

3.  Each external project has its own scala and java unit tests. Note the unit tests of each external library use classes of the streaming unit tests (`TestSuiteBase`, `LocalJavaStreamingContext`, etc.). To enable this code sharing among test classes, `dependsOn(streaming % "compile->compile,test->test")` was used in the SparkBuild.scala . In the streaming/pom.xml, an additional `maven-jar-plugin` was necessary to capture this dependency (see comment inside the pom.xml for more information).

4. Jars of the external projects have been added to examples project but not to the assembly project.

5. In some files, imports have been rearrange to conform to the Spark coding guidelines.
2014-01-07 22:21:52 -08:00
Prashant Sharma d1f2805712 Update README.md
The link does not work otherwise.
2014-01-08 11:36:26 +05:30
Patrick Wendell f5f12dc282 Merge pull request #336 from liancheng/akka-remote-lookup
Get rid of `Either[ActorRef, ActorSelection]'

In this pull request, instead of returning an `Either[ActorRef, ActorSelection]`, `registerOrLookup` identifies the remote actor blockingly to obtain an `ActorRef`, or throws an exception if the remote actor doesn't exist or the lookup times out (configured by `spark.akka.lookupTimeout`).  This function is only called when an `SparkEnv` is constructed (instantiating driver or executor), so the blocking call is considered acceptable.  Executor side `ActorSelection`s/`ActorRef`s to driver side `MapOutputTrackerMasterActor` and `BlockManagerMasterActor` are affected by this pull request.

`ActorSelection` is dangerous and should be used with care.  It's only absolutely safe to send messages via an `ActorSelection` when the remote actor is stateless, so that actor incarnation is irrelevant.  But as pointed by @ScrapCodes in the comments below, executor exits immediately once the connection to the driver lost, `ActorSelection`s are not harmful in this scenario.  So this pull request is mostly a code style patch.
2014-01-07 21:56:35 -08:00
Matei Zaharia 11891e68c3 Merge pull request #327 from lucarosellini/master
Added ‘-i’ command line option to Spark REPL

We had to create a new implementation of both scala.tools.nsc.CompilerCommand and scala.tools.nsc.Settings, because using scala.tools.nsc.GenericRunnerSettings would bring in other options (-howtorun, -save and -execute) which don’t make sense in Spark.
Any new Spark specific command line option could now be added to org.apache.spark.repl.SparkRunnerSettings class.

Since the behavior of loading a script from the command line should be the same as loading it using the “:load” command inside the shell, the script should be loaded when the SparkContext is available, that’s why we had to move the call to ‘loadfiles(settings)’ _after_ the call to postInitialization(). This still doesn’t work if ‘isAsync = true’.
2014-01-08 00:32:18 -05:00
Matei Zaharia 7d0aac917b Merge pull request #354 from hsaputra/addasfheadertosbt
Add ASF header to the new sbt script.

Add ASF header to the new sbt script.
2014-01-08 00:30:45 -05:00
Matei Zaharia d75dc428da Merge pull request #350 from mateiz/standalone-limit
Add way to limit default # of cores used by apps in standalone mode

Also documents the spark.deploy.spreadOut option, and fixes a config option that had a dash in its name.
2014-01-08 00:30:03 -05:00
Hossein Falaki 46cb980a5f Fixed merge conflict 2014-01-07 21:28:26 -08:00
Henry Saputra 226b58ada2 Add ASF header to the new sbt script. 2014-01-07 21:07:27 -08:00
Patrick Wendell 61674bcadf Merge pull request #352 from markhamstra/oldArch
Don't leave os.arch unset after BlockManagerSuite

Recent SparkConf changes meant that BlockManagerSuite was now leaving the os.arch System.property unset.  That's a problem for any subsequent tests that rely upon having a valid os.arch.  This is true for CompressionCodecSuite in the usual maven build test order, even though it isn't usually true for the sbt build.
2014-01-07 18:32:13 -08:00
Patrick Wendell b2e690f839 Merge pull request #328 from falaki/MatrixFactorizationModel-fix
SPARK-1012: DAGScheduler Exception Fix

Added a predict method to MatrixFactorizationModel to enable bulk prediction. This method takes and RDD[(Int, Int)] of users and products and return an RDD with a Rating element per each element in the input RDD.

Also added python bindings to the new bulk prediction methods to address SPARK-1011 issue.

This is ready to be merged now.
2014-01-07 16:57:08 -08:00
Mark Hamstra 86ed1ad252 Fix BlockManagerSuite#after 2014-01-07 16:39:37 -08:00
Matei Zaharia 2c421749ea Address review comments 2014-01-07 19:30:23 -05:00
Patrick Wendell 6ccf8ce705 Merge pull request #351 from pwendell/maven-fix
Add log4j exclusion rule to maven.

To make this work I had to rename the defaults file. Otherwise
maven's pattern matching rules included it when trying to match
other log4j.properties files.

I also fixed a bug in the existing maven build where two
<transformers> tags were present in assembly/pom.xml
such that one overwrote the other.
2014-01-07 15:49:14 -08:00
Hossein Falaki 3a8beb46cb Merge branch 'master' into MatrixFactorizationModel-fix 2014-01-07 15:22:42 -08:00
Matei Zaharia 044c8ad3a4 Fix unit test compilation 2014-01-07 16:12:20 -05:00
Patrick Wendell e688e11206 Add log4j exclusion rule to maven.
To make this work I had to rename the defaults file. Otherwise
maven's pattern matching rules included it when trying to match
other log4j.properties files.

I also fixed a bug in the existing maven build where two
<transformers> tags were present in assembly/pom.xml
such that one overwrote the other.
2014-01-07 12:56:24 -08:00
Matei Zaharia d8bcc8e9a0 Add way to limit default # of cores used by applications on standalone mode
Also documents the spark.deploy.spreadOut option.
2014-01-07 14:35:52 -05:00
Reynold Xin 7d5fa175ca Merge pull request #337 from yinxusen/mllib-16-bugfix
Mllib 16 bugfix

Bug fix: https://spark-project.atlassian.net/browse/MLLIB-16

Hi, I fixed the bug and added a test suite for `GradientDescent`. There are 2 checks in the test case. First, the final loss must be lower than the initial one. Second, the trend of loss sequence should be decreasing, i.e., at least 80% iterations have lower losses than their prior iterations.

Thanks!
2014-01-07 11:31:34 -08:00
Reynold Xin 71fc113574 Merge pull request #349 from CodingCat/support-worker_dir
add the comments about SPARK_WORKER_DIR

this env variable seems to be forgotten

in many cases we need to set this variable, e.g. in EC2, we have to move the large application log files from the EBS to the ephemeral storage
2014-01-07 11:30:35 -08:00
Tathagata Das 8f02f1c3d4 Fixed examples/pom.xml and run-example based on Patrick's suggestions. 2014-01-07 11:02:29 -08:00
CodingCat 3633172e30 add the comments about SPARK_WORKER_DIR
this env variable seems to be forgotten …
2014-01-07 12:53:04 -05:00
Reynold Xin 15d9534501 Merge pull request #318 from srowen/master
Suggested small changes to Java code for slightly more standard style, encapsulation and in some cases performance

Sorry if this is too abrupt or not a welcome set of changes, but thought I'd see if I could contribute a little. I'm a Java developer and just getting seriously into Spark. So I thought I'd suggest a number of small changes to the couple Java parts of the code to make it a little tighter, more standard and even a bit faster.

Feel free to take all, some or none of this. Happy to explain any of it.
2014-01-07 08:10:02 -08:00
Reynold Xin 468af0fa03 Merge pull request #348 from prabeesh/master
spark -> org.apache.spark

Changed package name spark to org.apache.spark which was missing in some of the files
2014-01-07 08:09:01 -08:00
Tathagata Das aa99f226a6 Removed XYZFunctions and added XYZUtils as a common Scala and Java interface for creating XYZ streams. 2014-01-07 01:56:15 -08:00
Sean Owen 4b92a20232 Issue #318 : minor style updates per review from Reynold Xin 2014-01-07 09:38:45 +00:00
Patrick Wendell c3cf0475e8 Merge pull request #339 from ScrapCodes/conf-improvements
Conf improvements

There are two new features.

1. Allow users to set arbitrary akka configurations via spark conf.

2. Allow configuration to be printed in logs for diagnosis.
2014-01-07 00:54:25 -08:00
Luca Rosellini 4689ce29fd Added license header and removed @author tag 2014-01-07 09:44:24 +01:00