ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Patrick Wendell	ff44732171	Minor fix	2014-01-23 19:23:12 -08:00
Patrick Wendell	c3196171f3	Merge pull request #502 from pwendell/clone-1 Remove Hadoop object cloning and warn users making Hadoop RDD's. The code introduced in #359 used Hadoop's WritableUtils.clone() to duplicate objects when reading from Hadoop files. Some users have reported exceptions when cloning data in various file formats, including Avro and another custom format. This patch removes that functionality to ensure stability for the 0.9 release. Instead, it puts a clear warning in the documentation that copying may be necessary for Hadoop data sets.	2014-01-23 19:11:59 -08:00
Patrick Wendell	cad3002fea	Merge pull request #501 from JoshRosen/cartesian-rdd-fixes Fix two bugs in PySpark cartesian(): SPARK-978 and SPARK-1034 This pull request fixes two bugs in PySpark's `cartesian()` method: - [SPARK-978](https://spark-project.atlassian.net/browse/SPARK-978): PySpark's cartesian method throws ClassCastException exception - [SPARK-1034](https://spark-project.atlassian.net/browse/SPARK-1034): Py4JException on PySpark Cartesian Result The JIRAs have more details describing the fixes.	2014-01-23 19:08:34 -08:00
Patrick Wendell	268ecbd231	Minor changes after auditing diff from earlier version	2014-01-23 18:30:11 -08:00
Josh Rosen	f83068497b	Fix for SPARK-1025: PySpark hang on missing files.	2014-01-23 18:24:51 -08:00
Patrick Wendell	c58d4ea3d4	Response to Matei's review	2014-01-23 18:12:40 -08:00
Patrick Wendell	0213b4032a	Fix bug on read-side of external sort when using Snappy. This case wasn't handled correctly and this patch fixes it.	2014-01-23 18:04:55 -08:00
Patrick Wendell	7101017803	Remove Hadoop object cloning and warn users making Hadoop RDD's. The code introduced in #359 used Hadoop's WritableUtils.clone() to duplicate objects when reading from Hadoop files. Some users have reported exceptions when cloning data in verious file formats, including Avro and another custom format. This patch removes that functionality to ensure stability for the 0.9 release. Instead, it puts a clear warning in the documentation that copying may be necessary for Hadoop data sets.	2014-01-23 17:39:23 -08:00
Josh Rosen	61569906cc	Fix SPARK-978: ClassCastException in PySpark cartesian.	2014-01-23 15:09:19 -08:00
Josh Rosen	0035dbbc81	Fix SPARK-1034: Py4JException on PySpark Cartesian Result	2014-01-23 13:05:59 -08:00
Josh Rosen	fad6aacfb0	Merge pull request #406 from eklavya/master Extending Java API coverage Hi, I have added three new methods to JavaRDD. Please review and merge.	2014-01-23 11:14:15 -08:00
Reynold Xin	a2b47dae66	Merge pull request #499 from jianpingjwang/dev1 Replace commons-math with jblas in SVDPlusPlus	2014-01-23 10:48:26 -08:00
eklavya	60e7457266	fixed ClassTag in mapPartitions	2014-01-23 17:40:36 +05:30
Jianping J Wang	19a01c1b1d	Add jblas dependency	2014-01-23 19:54:01 +08:00
Jianping J Wang	a5a513e25e	Add jblas dependency	2014-01-23 19:48:39 +08:00
Jianping J Wang	cc0fd33177	Replace commons-math with jblas	2014-01-23 19:44:30 +08:00
Patrick Wendell	a1cd185122	Merge pull request #496 from pwendell/master Fix bug in worker clean-up in UI Introduced in `d5a96fec` (/cc @aarondav). This should be picked into 0.8 and 0.9 as well. The bug causes old (zombie) workers on a node to not disappear immediately from the UI when a new one registers.	2014-01-22 19:37:29 -08:00
Patrick Wendell	034dce2a7e	Merge pull request #447 from CodingCat/SPARK-1027 fix for SPARK-1027 fix for SPARK-1027 (https://spark-project.atlassian.net/browse/SPARK-1027) FIXES 1. change sparkhome from String to Option(String) in ApplicationDesc 2. remove sparkhome parameter in LaunchExecutor message 3. adjust involved files	2014-01-22 18:58:02 -08:00
Patrick Wendell	6285513147	Fix bug in worker clean-up in UI Introduced in `d5a96fec`. This should be picked into 0.8 and 0.9 as well.	2014-01-22 18:19:52 -08:00
CodingCat	2b3c461451	refactor sparkHome to val clean code	2014-01-22 20:20:46 -05:00
Patrick Wendell	3184facdc5	Merge pull request #495 from srowen/GraphXCommonsMathDependency Fix graphx Commons Math dependency `graphx` depends on Commons Math (2.x) in `SVDPlusPlus.scala`. However the module doesn't declare this dependency. It happens to work because it is included by Hadoop artifacts. But, I can tell you this isn't true as of a month or so ago. Building versus recent Hadoop would fail. (That's how we noticed.) The simple fix is to declare the dependency, as it should be. But it's also worth noting that `commons-math` is the old-ish 2.x line, while `commons-math3` is where newer 3.x releases are. Drop-in replacement, but different artifact and package name. Changing this only usage to `commons-math3` works, tests pass, and isn't surprising that it does, so is probably also worth changing. (A comment in some test code also references `commons-math3`, FWIW.) It does raise another question though: `mllib` looks like it uses the `jblas` `DoubleMatrix` for general purpose vector/matrix stuff. Should `graphx` really use Commons Math for this? Beyond the tiny scope here but worth asking.	2014-01-22 15:45:04 -08:00
Sean Owen	4476398f7d	Also add graphx commons-math3 dependeny in sbt build	2014-01-22 22:40:41 +00:00
Patrick Wendell	a1238bb5fc	Merge pull request #492 from skicavs/master fixed job name and usage information for the JavaSparkPi example	2014-01-22 14:32:59 -08:00
Sean Owen	fd0c5b8c57	Depend on Commons Math explicitly instead of accidentally getting it from Hadoop (which stops working in 2.2.x) and also use the newer commons-math3	2014-01-22 22:25:49 +00:00
Patrick Wendell	576c4a4c50	Merge pull request #478 from sryza/sandy-spark-1033 SPARK-1033. Ask for cores in Yarn container requests Tested on a pseudo-distributed cluster against the Fair Scheduler and observed a worker taking more than a single core.	2014-01-22 14:10:07 -08:00
Matei Zaharia	5bcfd79811	Merge pull request #493 from kayousterhout/double_add Fixed bug where task set managers are added to queue twice @mateiz can you verify that this is a bug and wasn't intentional? (`90a04dab8d (diff-7fa4f84a961750c374f2120ca70e96edR551)`) This bug leads to a small performance hit because task set managers will get offered each rejected resource offer twice, but doesn't lead to any incorrect functionality. Thanks to @hdc1112 for pointing this out.	2014-01-22 14:05:48 -08:00
Matei Zaharia	d009b17d13	Merge pull request #315 from rezazadeh/sparsesvd Sparse SVD # Singular Value Decomposition Given an m x n matrix A, compute matrices U, S, V such that A = U S * V^T* There is no restriction on m, but we require n^2 doubles to fit in memory. Further, n should be less than m. The decomposition is computed by first computing A^TA = V S^2 V^T, computing svd locally on that (since n x n is small), from which we recover S and V. Then we compute U via easy matrix multiplication as U = A V * S^-1* Only singular vectors associated with the largest k singular values If there are k such values, then the dimensions of the return will be: * S is k x k and diagonal, holding the singular values on diagonal. * U is m x k and satisfies U^TU = eye(k). V is n x k and satisfies V^TV = eye(k). All input and output is expected in sparse matrix format, 0-indexed as tuples of the form ((i,j),value) all in RDDs. # Testing Tests included. They test: - Decomposition promise (A = USV^T) - For small matrices, output is compared to that of jblas - Rank 1 matrix test included - Full Rank matrix test included - Middle-rank matrix forced via k included # Example Usage import org.apache.spark.SparkContext import org.apache.spark.mllib.linalg.SVD import org.apache.spark.mllib.linalg.SparseMatrix import org.apache.spark.mllib.linalg.MatrixyEntry // Load and parse the data file val data = sc.textFile("mllib/data/als/test.data").map { line => val parts = line.split(',') MatrixEntry(parts(0).toInt, parts(1).toInt, parts(2).toDouble) } val m = 4 val n = 4 // recover top 1 singular vector val decomposed = SVD.sparseSVD(SparseMatrix(data, m, n), 1) println("singular values = " + decomposed.S.data.toArray.mkString) # Documentation Added to docs/mllib-guide.md	2014-01-22 14:01:30 -08:00
Kay Ousterhout	19da82c50f	Fixed bug where task set managers are added to queue twice This bug leads to a small performance hit because task set managers will get offered each rejected resource offer twice, but doesn't lead to any incorrect functionality.	2014-01-22 09:52:12 -08:00
Kevin Mader	36f9a64ec9	fixed job name and usage information for the JavaSparkPi example	2014-01-22 15:58:23 +01:00
Henry Saputra	90ea9d5a8f	Replace the code to check for Option != None with Option.isDefined call in Scala code. This hopefully will make the code cleaner.	2014-01-21 23:22:10 -08:00
Reynold Xin	749f842827	Merge pull request #489 from ash211/patch-6 Clarify spark.default.parallelism It's the task count across the cluster, not per worker, per machine, per core, or anything else.	2014-01-21 14:53:49 -08:00
Andrew Ash	069bb94206	Clarify spark.default.parallelism It's the task count across the cluster, not per worker, per machine, per core, or anything else.	2014-01-21 14:49:35 -08:00
Reynold Xin	f8544981a6	Merge pull request #469 from ajtulloch/use-local-spark-context-in-tests-for-mllib [MLlib] Use a LocalSparkContext trait in test suites Replaces the 9 instances of ```scala class XXXSuite extends FunSuite with BeforeAndAfterAll { @transient private var sc: SparkContext = _ override def beforeAll() { sc = new SparkContext("local", "test") } override def afterAll() { sc.stop() System.clearProperty("spark.driver.port") } ``` with ```scala class XXXSuite extends FunSuite with LocalSparkContext { ```	2014-01-21 10:49:54 -08:00
Andrew Tulloch	3a067b4a76	Fixed import order	2014-01-21 13:36:53 +00:00
Sandy Ryza	adf42611f1	Incorporate Tom's comments - update doc and code to reflect that core requests may not always be honored	2014-01-21 00:38:02 -08:00
Patrick Wendell	77b986f661	Merge pull request #480 from pwendell/0.9-fixes Handful of 0.9 fixes This patch addresses a few fixes for Spark 0.9.0 based on the last release candidate. @mridulm gets credit for reporting most of the issues here. Many of the fixes here are based on his work in #477 and follow up discussion with him.	2014-01-21 00:09:42 -08:00
Patrick Wendell	a9bcc980b6	Style clean-up	2014-01-21 00:05:28 -08:00
Patrick Wendell	c67d3d8beb	Merge pull request #484 from tdas/run-example-fix Made run-example respect SPARK_JAVA_OPTS and SPARK_MEM. bin/run-example scripts was not passing Java properties set through the SPARK_JAVA_OPTS to the example. This is important for examples like Twitter** as the Twitter authentication information must be set through java properties. Hence added the same JAVA_OPTS code in run-example as it is in bin/spark-class script. Also added SPARK_MEM, in case someone wants to run the example with different amounts of memory. This can be removed if it is not tune with the intended semantics of the run-example scripts. @matei Please check this soon I want this to go in 0.9-rc4	2014-01-20 23:34:35 -08:00
Tathagata Das	65869f843d	Removed SPARK_MEM from run-examples.	2014-01-20 23:15:28 -08:00
Patrick Wendell	a917a87e02	Adding small code comment	2014-01-20 23:11:45 -08:00
Reynold Xin	6b4eed779b	Merge pull request #449 from CrazyJvm/master SPARK-1028 : fix "set MASTER automatically fails" bug. spark-shell intends to set MASTER automatically if we do not provide the option when we start the shell , but there's a problem. The condition is "if [[ "x" != "x$SPARK_MASTER_IP" && "y" != "y$SPARK_MASTER_PORT" ]];" we sure will set SPARK_MASTER_IP explicitly, the SPARK_MASTER_PORT option, however, we probably do not set just using spark default port 7077. So if we do not set SPARK_MASTER_PORT, the condition will never be true. We should just use default port if users do not set port explicitly I think.	2014-01-20 22:35:45 -08:00
Patrick Wendell	0367981d47	Merge pull request #482 from tdas/streaming-example-fix Added StreamingContext.awaitTermination to streaming examples StreamingContext.start() currently starts a non-daemon thread which prevents termination of a Spark Streaming program even if main function has exited. Since the expected behavior of a streaming program is to run until explicitly killed, this was sort of fine when spark streaming applications are launched from the command line. However, when launched in Yarn-standalone mode, this did not work as the driver effectively got terminated when the main function exits. So SparkStreaming examples did not work on Yarn. This addition to the examples ensures that the examples work on Yarn and also ensures that everyone learns that StreamingContext.awaitTermination() being necessary for SparkStreaming programs to wait. The true bug-fix of making sure all threads by Spark Streaming are daemon threads is left for post-0.9.	2014-01-20 22:25:50 -08:00
Reynold Xin	7373ffb5e7	Merge pull request #483 from pwendell/gitignore Restricting /lib to top level directory in .gitignore This patch was proposed by Sean Mackrory.	2014-01-20 21:44:29 -08:00
Tathagata Das	e0b741d07a	Made run-example respect SPARK_JAVA_OPTS and SPARK_MEM.	2014-01-20 20:48:59 -08:00
Patrick Wendell	e437069dce	Restricting /lib to top level directory in .gitignore This patch was proposed by Sean Mackrory.	2014-01-20 20:39:30 -08:00
Tathagata Das	2e95174c45	Added StreamingContext.awaitTermination to streaming examples.	2014-01-20 20:25:04 -08:00
Patrick Wendell	d46df96de3	Avoid matching attempt files in the checkpoint	2014-01-20 20:03:23 -08:00
Patrick Wendell	de526ad527	Remove shuffle files if they are still present on a machine.	2014-01-20 19:11:22 -08:00
Patrick Wendell	f84400e86c	Fixing speculation bug	2014-01-20 19:05:03 -08:00
Patrick Wendell	c324ac10ee	Force use of LZF when spilling data	2014-01-20 19:00:48 -08:00

... 2 3 4 5 6 ...

6456 commits