Author: Michael Armbrust <michael@databricks.com>
Closes#447 from marmbrus/pythonInsert and squashes the following commits:
c7ab692 [Michael Armbrust] Keep docstrings < 72 chars.
ff62870 [Michael Armbrust] Add insertInto and saveAsTable to Python API.
This gets rid of a warning when compiling core (since we were depending on a deprecated interface with a non-deprecated function). I also tested with javac, and this does the right thing when compiling java code.
Author: Michael Armbrust <michael@databricks.com>
Closes#452 from marmbrus/scalaDeprecation and squashes the following commits:
f628b4d [Michael Armbrust] Use scala deprecation instead of java.
Author: Reynold Xin <rxin@apache.org>
Closes#443 from rxin/readme and squashes the following commits:
16853de [Reynold Xin] Updated SBT and Scala instructions.
3ac3ceb [Reynold Xin] README update
...AsNewAPIHadoopDataset
`writer.close` should be put in the `finally` block to avoid potential resource leaks.
JIRA: https://issues.apache.org/jira/browse/SPARK-1482
Author: zsxwing <zsxwing@gmail.com>
Closes#400 from zsxwing/SPARK-1482 and squashes the following commits:
06b197a [zsxwing] SPARK-1482: Fix potential resource leaks in saveAsHadoopDataset and saveAsNewAPIHadoopDataset
This doesn't require creating new Ordering objects per row. Additionally, [view bounds are going to be deprecated](https://issues.scala-lang.org/browse/SI-7629), so we should get rid of them while APIs are still flexible.
Author: Michael Armbrust <michael@databricks.com>
Closes#410 from marmbrus/viewBounds and squashes the following commits:
c574221 [Michael Armbrust] fix example.
812008e [Michael Armbrust] Update Java API.
1b9b85c [Michael Armbrust] Update scala doc.
35798a8 [Michael Armbrust] Remove view bounds on Ordered in favor of a context bound on Ordering.
Author: Reynold Xin <rxin@apache.org>
Closes#444 from rxin/pyspark and squashes the following commits:
fc11356 [Reynold Xin] Made the PySpark shell version checking compatible with Python 2.6.
571830b [Reynold Xin] Fixed broken pyspark shell.
Per discussion, this is my suggestion to make ALS Rating, ClassificationModel, RegressionModel experimental for now, to reserve the right to possibly change after 1.0. See what you think of this much.
Author: Sean Owen <sowen@cloudera.com>
Closes#372 from srowen/SPARK-1357Addendum and squashes the following commits:
17cf1ea [Sean Owen] Remove (another) blank line after ":: Experimental ::"
6800e4c [Sean Owen] Remove blank line after ":: Experimental ::"
b3a88d2 [Sean Owen] Make ALS Rating, ClassificationModel, RegressionModel experimental for now, to reserve the right to possibly change after 1.0
A quick fix for https://issues.apache.org/jira/browse/SPARK-1520
By excluding fastutil, we bring the number of files in the assembly jar back under 65536, so Java 7 won't create the assembly jar in zip64 format, which cannot be read by Java 6.
With this change, the assembly jar now has about 60000 entries (58000 files), tested with both sbt and maven.
Author: Xiangrui Meng <meng@databricks.com>
Closes#437 from mengxr/remove-fastutil and squashes the following commits:
00f9beb [Xiangrui Meng] remove fastutil from dependencies
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#432 from liancheng/reuseRow and squashes the following commits:
9e6d083 [Cheng Lian] Simplified code with BufferedIterator
52acec9 [Cheng Lian] Reuses Row object in ExistingRdd.productToRowRdd()
https://issues.apache.org/jira/browse/SPARK-1483
From the original JIRA: " The parameter name is part of the public API in Scala and Python, since you can pass named parameters to a method, so we should name it to this more descriptive term. Everywhere else we refer to "splits" as partitions." - @mateiz
Author: CodingCat <zhunansjtu@gmail.com>
Closes#430 from CodingCat/SPARK-1483 and squashes the following commits:
4b60541 [CodingCat] deprecate defaultMinSplits
ba2c663 [CodingCat] Rename minSplits to minPartitions in public APIs
This is currently causing many builds to hang.
https://issues.apache.org/jira/browse/SPARK-1530
Author: Patrick Wendell <pwendell@gmail.com>
Closes#440 from pwendell/uitest-fix and squashes the following commits:
9a143dc [Patrick Wendell] Ignore streaming UI test
This will make the tests more stable when not running SQL tests.
Author: Patrick Wendell <pwendell@gmail.com>
Closes#439 from pwendell/hive-tests and squashes the following commits:
88a6032 [Patrick Wendell] FIX: Don't build Hive in assembly unless running Hive tests.
...finishes
Note this is dependent on https://github.com/apache/spark/pull/204 to have a working history server, but there are no code dependencies.
This also fixes SPARK-1288 yarn stable finishApplicationMaster incomplete. Since I was in there I made the diagnostic message be passed properly.
Author: Thomas Graves <tgraves@apache.org>
Closes#362 from tgravescs/SPARK-1408 and squashes the following commits:
ec89705 [Thomas Graves] Fix typo.
446122d [Thomas Graves] Make config yarn specific
f5d5373 [Thomas Graves] SPARK-1408 Modify Spark on Yarn to point to the history server when app finishes
This only works for the three paths defined in the environment
(SPARK_JAR, SPARK_YARN_APP_JAR and SPARK_LOG4J_CONF).
Tested by running SparkPi with local: and file: URIs against Yarn cluster (no "upload" shows up in logs in the local case).
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#303 from vanzin/yarn-local and squashes the following commits:
82219c1 [Marcelo Vanzin] [SPARK-1395] Allow "local:" URIs to work on Yarn.
Python alternative for https://github.com/apache/spark/pull/392; managed from shell.py
Author: AbhishekKr <abhikumar163@gmail.com>
Closes#399 from abhishekkr/pyspark_shell and squashes the following commits:
134bdc9 [AbhishekKr] pyspark require Python2, failing if system default is Py3 from shell.py
This will also fix SPARK-1464: Update MLLib Examples to Use Breeze.
Author: Sandeep <sandeep@techaddict.me>
Closes#416 from techaddict/1462 and squashes the following commits:
a43638e [Sandeep] Some Style Changes
3ce69c3 [Sandeep] Fix Ordering and Naming of Imports in Examples
6c7e543 [Sandeep] SPARK-1462: Examples of ML algorithms are using deprecated APIs
It is very confusing when your code throws an exception, but the only stack trace show is in the DAGScheduler. This is a simple patch to include the stack trace for the actual failure in the error message. Suggestions on formatting welcome.
Before:
```
scala> sc.parallelize(1 :: Nil).map(_ => sys.error("Ahh!")).collect()
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:3 failed 1 times (most recent failure: Exception failure in TID 3 on host localhost: java.lang.RuntimeException: Ahh!)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1055)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1039)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1037)
...
```
After:
```
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:3 failed 1 times, most recent failure: Exception failure in TID 3 on host localhost: java.lang.RuntimeException: Ahh!
scala.sys.package$.error(package.scala:27)
$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13)
$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
scala.collection.AbstractIterator.to(Iterator.scala:1157)
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
org.apache.spark.rdd.RDD$$anonfun$6.apply(RDD.scala:676)
org.apache.spark.rdd.RDD$$anonfun$6.apply(RDD.scala:676)
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1048)
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1048)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:110)
org.apache.spark.scheduler.Task.run(Task.scala:50)
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:211)
org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:46)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:744)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1055)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1039)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1037)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1037)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:614)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:614)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:614)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:143)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
```
Author: Michael Armbrust <michael@databricks.com>
Closes#409 from marmbrus/stacktraces and squashes the following commits:
3e4eb65 [Michael Armbrust] indent. include header for driver stack trace.
018b06b [Michael Armbrust] Include stack trace for exceptions in user code.
change _slideDuration to _windowDuration
Author: baishuo(白硕) <vc_java@hotmail.com>
Closes#425 from baishuo/master and squashes the following commits:
6f09ea1 [baishuo(白硕)] Update ReducedWindowedDStream.scala
"By default, this uses only 8 parallel tasks to do the grouping." is a big misleading. Please refer to https://github.com/apache/spark/pull/389
detail is as following code :
def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
val bySize = (Seq(rdd) ++ others).sortBy(_.partitions.size).reverse
for (r <- bySize if r.partitioner.isDefined) {
return r.partitioner.get
}
if (rdd.context.conf.contains("spark.default.parallelism")) {
new HashPartitioner(rdd.context.defaultParallelism)
} else {
new HashPartitioner(bySize.head.partitions.size)
}
}
Author: Chen Chao <crazyjvm@gmail.com>
Closes#403 from CrazyJvm/patch-4 and squashes the following commits:
42f6c9e [Chen Chao] fix format
829a995 [Chen Chao] fix format
1568336 [Chen Chao] misleading task number of groupByKey
Author: Kan Zhang <kzhang@apache.org>
Closes#401 from kanzhang/fix-1475 and squashes the following commits:
c6058bd [Kan Zhang] Fixing a race condition in event listener unit test
Each vertex partition is co-located with a pid2vid array created in RoutingTable.scala. This array maps edge partition IDs to the list of vertices in the current vertex partition that are mentioned by edges in that partition. Therefore the pid2vid array should have one entry per edge partition.
GraphX currently creates one entry per *vertex* partition, which is a bug that leads to an ArrayIndexOutOfBoundsException when there are more edge partitions than vertex partitions. This commit fixes the bug and adds a test for this case.
Resolves SPARK-1329. Thanks to Daniel Darabos for reporting this bug.
Author: Ankur Dave <ankurdave@gmail.com>
Closes#368 from ankurdave/fix-pid2vid-size and squashes the following commits:
5a5c52a [Ankur Dave] SPARK-1329: Create pid2vid with correct number of partitions
GraphImpl.reverse used to reverse edges in each partition of the edge RDD but preserve the routing table and replicated vertex view, since reversing should not affect partitioning.
However, the old routing table would then have incorrect information for srcAttrOnly and dstAttrOnly. These RDDs should be switched.
A simple fix is for Graph.reverse to rebuild the routing table and replicated vertex view.
Thanks to Bogdan Ghidireac for reporting this issue on the [mailing list](http://apache-spark-user-list.1001560.n3.nabble.com/graph-reverse-amp-Pregel-API-td4338.html).
Author: Ankur Dave <ankurdave@gmail.com>
Closes#431 from ankurdave/fix-reverse-bug and squashes the following commits:
75d63cb [Ankur Dave] Rebuild routing table after Graph.reverse
JIRA issue:[SPARK-1511](https://issues.apache.org/jira/browse/SPARK-1511)
TestUtils.createCompiledClass method use renameTo() to move files which fails when the src and dest files are in different disks or partitions. This pr uses Files.move() instead. The move method will try to use renameTo() and then fall back to copy() and delete(). I think this should handle this issue.
I didn't found a test suite for this file, so I add file existence detection after file moving.
Author: Ye Xianjin <advancedxy@gmail.com>
Closes#427 from advancedxy/SPARK-1511 and squashes the following commits:
a2b97c7 [Ye Xianjin] Based on @srowen's comment, assert file existence.
6f95550 [Ye Xianjin] use Files.move instead of renameTo to handle the src and dest files are in different disks or partitions.
YARN-1824 changes the APIs (addToEnvironment, setEnvFromInputString) in Apps, which causes the spark build to break if built against a version 2.4.0. To fix this, create the spark own function to do that functionality which will not break compiling against 2.3 and other 2.x versions.
Author: xuan <xuan@MacBook-Pro.local>
Author: xuan <xuan@macbook-pro.home>
Closes#396 from xgong/master and squashes the following commits:
42b5984 [xuan] Remove two extra imports
bc0926f [xuan] Remove usage of org.apache.hadoop.util.Shell
be89fa7 [xuan] fix Spark compilation is broken with the latest hadoop-2.4.0 release
... nicer error messages
There are two improvements to Scheduler Mode:
1. Made the built in ones case insensitive (fair/FAIR, fifo/FIFO).
2. If an invalid mode is given we should print a better error message.
Author: Sandeep <sandeep@techaddict.me>
Closes#388 from techaddict/1469 and squashes the following commits:
a31bbd5 [Sandeep] SPARK-1469: Scheduler mode should accept lower-case definitions and have nicer error messages There are two improvements to Scheduler Mode: 1. Made the built in ones case insensitive (fair/FAIR, fifo/FIFO). 2. If an invalid mode is given we should print a better error message.
(I wasn't sure how to automatically set `SPARK_YARN=true` and `SPARK_HIVE=true` when running scalastyle, but these are the errors that turn up.)
Author: Sean Owen <sowen@cloudera.com>
Closes#413 from srowen/SPARK-1497 and squashes the following commits:
f0c9318 [Sean Owen] Fix more scalastyle warnings in yarn
80bf4c3 [Sean Owen] Add YARN alpha / YARN profile to scalastyle check
026319c [Sean Owen] Fix scalastyle warnings in YARN, Hive code
Author: Holden Karau <holden@pigscanfly.ca>
Closes#18 from holdenk/addkfoldcrossvalidation and squashes the following commits:
208db9b [Holden Karau] Fix a bad space
e84f2fc [Holden Karau] Fix the test, we should be looking at the second element instead
6ddbf05 [Holden Karau] swap training and validation order
7157ae9 [Holden Karau] CR feedback
90896c7 [Holden Karau] New line
150889c [Holden Karau] Fix up error messages in the MLUtilsSuite
2cb90b3 [Holden Karau] Fix the names in kFold
c702a96 [Holden Karau] Fix imports in MLUtils
e187e35 [Holden Karau] Move { up to same line as whenExecuting(random) in RandomSamplerSuite.scala
c5b723f [Holden Karau] clean up
7ebe4d5 [Holden Karau] CR feedback, remove unecessary learners (came back during merge mistake) and insert an empty line
bb5fa56 [Holden Karau] extra line sadness
163c5b1 [Holden Karau] code review feedback 1.to -> 1 to and folds -> numFolds
5a33f1d [Holden Karau] Code review follow up.
e8741a7 [Holden Karau] CR feedback
b78804e [Holden Karau] Remove cross validation [TODO in another pull request]
91eae64 [Holden Karau] Consolidate things in mlutils
264502a [Holden Karau] Add a test for the bug that was found with BernoulliSampler not copying the complement param
dd0b737 [Holden Karau] Wrap long lines (oops)
c0b7fa4 [Holden Karau] Switch FoldedRDD to use BernoulliSampler and PartitionwiseSampledRDD
08f8e4d [Holden Karau] Fix BernoulliSampler to respect complement
a751ec6 [Holden Karau] Add k-fold cross validation to MLLib
actually, the value 8 is only valid in mesos fine-grained mode :
<code>
override def defaultParallelism() = sc.conf.getInt("spark.default.parallelism", 8)
</code>
while in coarse-grained model including mesos coares-grained, the value of the property depending on core numbers!
<code>
override def defaultParallelism(): Int = {
conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(), 2))
}
</code>
Author: Chen Chao <crazyjvm@gmail.com>
Closes#389 from CrazyJvm/patch-2 and squashes the following commits:
84a7fe4 [Chen Chao] miss </li> at the end of every single line
04a9796 [Chen Chao] change format
ee0fae0 [Chen Chao] update spark.default.parallelism
When running Hive tests, the working directory is `$SPARK_HOME/sql/hive`, while when running `sbt hive/console`, it becomes `$SPARK_HOME`, and test tables are not loaded if `HIVE_DEV_HOME` is not defined.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#417 from liancheng/loadTestTables and squashes the following commits:
7cea8d6 [Cheng Lian] Loads test tables when running "sbt hive/console" without HIVE_DEV_HOME
This is not an issue with the driver UI, but when you fire
up the history server, there's currently no way to go back to
the app listing page without editing the browser's location
field (since the logo's link points to the root of the
application's own UI - i.e. the "stages" tab).
The change just points the logo link to "/", which is the app
listing for the history server, and the stages tab for the
driver's UI.
Tested with both history server and live driver.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#408 from vanzin/web-ui-root and squashes the following commits:
1b60cb6 [Marcelo Vanzin] Make "spark logo" link refer to "/".
JIRA issue: [SPARK-959](https://spark-project.atlassian.net/browse/SPARK-959)
SBT 0.13.2 has been officially released. This version updated Ivy 2.0 to Ivy 2.3, which fixes [IVY-899](https://issues.apache.org/jira/browse/IVY-899). This PR also removed previous workaround.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#426 from liancheng/updateSbt and squashes the following commits:
95e3dc8 [Cheng Lian] Updated SBT from 0.13.1 to 0.13.2 to fix SPARK-959
This makes it possible to create tables and insert into them using the DSL and SQL for the scala and java apis.
Author: Michael Armbrust <michael@databricks.com>
Closes#354 from marmbrus/insertIntoTable and squashes the following commits:
6c6f227 [Michael Armbrust] Create random temporary files in python parquet unit tests.
f5e6d5c [Michael Armbrust] Merge remote-tracking branch 'origin/master' into insertIntoTable
765c506 [Michael Armbrust] Add to JavaAPI.
77b512c [Michael Armbrust] typos.
5c3ef95 [Michael Armbrust] use names for boolean args.
882afdf [Michael Armbrust] Change createTableAs to saveAsTable. Clean up api annotations.
d07d94b [Michael Armbrust] Add tests, support for creating parquet files and hive tables.
fa3fe81 [Michael Armbrust] Make insertInto available on JavaSchemaRDD as well. Add createTableAs function.
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes#341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
I saw `No "scala-library*.jar" in Scala compiler library` error in IDEA. It seems upgrading `sbt-idea` to 1.6.0 fixed the problem.
Author: Xiangrui Meng <meng@databricks.com>
Closes#419 from mengxr/idea-plugin and squashes the following commits:
fb3c35f [Xiangrui Meng] update sbt-idea to version 1.6.0
This is a simple first step towards avoiding running the Hive tests
whenever possible.
Author: Patrick Wendell <pwendell@gmail.com>
Closes#420 from pwendell/test-isolation and squashes the following commits:
350c8af [Patrick Wendell] SPARK-1455: Better isolation for unit tests.
Added documentation for user to use the decision tree algorithms for classification and regression in Spark 1.0 release.
Apart from a general review, I need specific input on the following:
* I had to move a lot of the existing documentation under the *linear methods* umbrella to accommodate decision trees. I wonder if there is a better way to organize the programming guide given we are so close to the release.
* I have not looked closely at pyspark but I am wondering new mllib algorithms are automatically plugged in or do we need to some extra work to call mllib functions from pyspark. I will add to the pyspark examples based upon the advice I get.
cc: @mengxr, @hirakendu, @etrain, @atalwalkar
Author: Manish Amde <manish9ue@gmail.com>
Closes#402 from manishamde/tree_doc and squashes the following commits:
022485a [Manish Amde] more documentation
865826e [Manish Amde] minor: grammar
dbb0e5e [Manish Amde] minor improvements to text
b9ef6c4 [Manish Amde] basic decision tree code examples
6e297d7 [Manish Amde] added subsections
f427e84 [Manish Amde] renaming sections
9c0c4be [Manish Amde] split candidate
6925275 [Manish Amde] impurity and information gain
94fd2f9 [Manish Amde] more reorg
b93125c [Manish Amde] more subsection reorg
3ecb2ad [Manish Amde] minor text addition
1537dd3 [Manish Amde] added placeholders and some doc
d06511d [Manish Amde] basic skeleton
This PR uses Breeze's L-BFGS implement, and Breeze dependency has already been introduced by Xiangrui's sparse input format work in SPARK-1212. Nice work, @mengxr !
When use with regularized updater, we need compute the regVal and regGradient (the gradient of regularized part in the cost function), and in the currently updater design, we can compute those two values by the following way.
Let's review how updater works when returning newWeights given the input parameters.
w' = w - thisIterStepSize * (gradient + regGradient(w)) Note that regGradient is function of w!
If we set gradient = 0, thisIterStepSize = 1, then
regGradient(w) = w - w'
As a result, for regVal, it can be computed by
val regVal = updater.compute(
weights,
new DoubleMatrix(initialWeights.length, 1), 0, 1, regParam)._2
and for regGradient, it can be obtained by
val regGradient = weights.sub(
updater.compute(weights, new DoubleMatrix(initialWeights.length, 1), 1, 1, regParam)._1)
The PR includes the tests which compare the result with SGD with/without regularization.
We did a comparison between LBFGS and SGD, and often we saw 10x less
steps in LBFGS while the cost of per step is the same (just computing
the gradient).
The following is the paper by Prof. Ng at Stanford comparing different
optimizers including LBFGS and SGD. They use them in the context of
deep learning, but worth as reference.
http://cs.stanford.edu/~jngiam/papers/LeNgiamCoatesLahiriProchnowNg2011.pdf
Author: DB Tsai <dbtsai@alpinenow.com>
Closes#353 from dbtsai/dbtsai-LBFGS and squashes the following commits:
984b18e [DB Tsai] L-BFGS Optimizer based on Breeze's implementation. Also fixed indentation issue in GradientDescent optimizer.
The Graph.apply test in GraphSuite had some assertions in a closure in
a graph transformation. As a consequence, these assertions never
actually executed. Furthermore, these closures had a reference to
(non-serializable) test harness classes because they called assert(),
which could be a problem if we proactively check closure serializability
in the future.
This commit simply changes the Graph.apply test to collect the graph
triplets so it can assert about each triplet from a map method.
Author: William Benton <willb@redhat.com>
Closes#415 from willb/graphsuite-nop-fix and squashes the following commits:
0b63658 [William Benton] Ensure assertions in Graph.apply are asserted.
Currently it requires NumPy 1.7 due to using the copyto method (http://docs.scipy.org/doc/numpy/reference/generated/numpy.copyto.html) for extracting data out of an array.
Replace it with a fallback
Author: Sandeep <sandeep@techaddict.me>
Closes#391 from techaddict/1426 and squashes the following commits:
d365962 [Sandeep] SPARK-1426: Make MLlib work with NumPy versions older than 1.7 Currently it requires NumPy 1.7 due to using the copyto method (http://docs.scipy.org/doc/numpy/reference/generated/numpy.copyto.html) for extracting data out of an array. Replace it with a fallback
An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries.
```
from pyspark.context import SQLContext
sqlCtx = SQLContext(sc)
rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}])
srdd = sqlCtx.applySchema(rdd)
sqlCtx.registerRDDAsTable(srdd, "table1")
srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1")
srdd2.collect()
```
The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]```
Author: Ahir Reddy <ahirreddy@gmail.com>
Author: Michael Armbrust <michael@databricks.com>
Closes#363 from ahirreddy/pysql and squashes the following commits:
0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns
307d6e0 [Ahir Reddy] Style fix
6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies
3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py
29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD
f2312c7 [Ahir Reddy] Moved everything into sql.py
a19afe4 [Ahir Reddy] Doc fixes
6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL
521ff6d [Ahir Reddy] Trying to get spark to build with hive
ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins
ded03e7 [Ahir Reddy] Added doc test for HiveContext
22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency
e4da06c [Ahir Reddy] Display message if hive is not built into spark
227a0be [Michael Armbrust] Update API links. Fix Hive example.
58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api. Minor fixes.
4285340 [Michael Armbrust] Fix building of Hive API Docs.
38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs.
337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build
40491c9 [Ahir Reddy] PR Changes + Method Visibility
1836944 [Michael Armbrust] Fix comments.
e00980f [Michael Armbrust] First draft of python sql programming guide.
b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test
f98a422 [Ahir Reddy] HiveContexts
79621cf [Ahir Reddy] cleaning up cruft
b406ba0 [Ahir Reddy] doctest formatting
20936a5 [Ahir Reddy] Added tests and documentation
e4d21b4 [Ahir Reddy] Added pyrolite dependency
79f739d [Ahir Reddy] added more tests
7515ba0 [Ahir Reddy] added more tests :)
d26ec5e [Ahir Reddy] added test
e9f5b8d [Ahir Reddy] adding tests
906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python
251f99d [Ahir Reddy] for now only allow dictionaries as input
09b9980 [Ahir Reddy] made jrdd explicitly lazy
c608947 [Ahir Reddy] SchemaRDD now has all RDD operations
725c91e [Ahir Reddy] awesome row objects
55d1c76 [Ahir Reddy] return row objects
4fe1319 [Ahir Reddy] output dictionaries correctly
be079de [Ahir Reddy] returning dictionaries works
cd5f79f [Ahir Reddy] Switched to using Scala SQLContext
e948bd9 [Ahir Reddy] yippie
4886052 [Ahir Reddy] even better
c0fb1c6 [Ahir Reddy] more working
043ca85 [Ahir Reddy] working
5496f9f [Ahir Reddy] doesn't crash
b8b904b [Ahir Reddy] Added schema rdd class
67ba875 [Ahir Reddy] java to python, and python to java
bcc0f23 [Ahir Reddy] Java to python
ab6025d [Ahir Reddy] compiling
For your consideration: scalac currently notes a number of feature warnings during compilation:
```
[warn] there were 65 feature warning(s); re-run with -feature for details
```
Warnings are like:
```
[warn] /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/SparkContext.scala:1261: implicit conversion method rddToPairRDDFunctions should be enabled
[warn] by making the implicit value scala.language.implicitConversions visible.
[warn] This can be achieved by adding the import clause 'import scala.language.implicitConversions'
[warn] or by setting the compiler option -language:implicitConversions.
[warn] See the Scala docs for value scala.language.implicitConversions for a discussion
[warn] why the feature should be explicitly enabled.
[warn] implicit def rddToPairRDDFunctions[K: ClassTag, V: ClassTag](rdd: RDD[(K, V)]) =
[warn] ^
```
scalac is suggesting that it's just best practice to explicitly enable certain language features by importing them where used.
This PR simply adds the imports it suggests (and squashes one other Java warning along the way). This leaves just deprecation warnings in the build.
Author: Sean Owen <sowen@cloudera.com>
Closes#404 from srowen/SPARK-1488 and squashes the following commits:
8598980 [Sean Owen] Quiet scalac warnings about language features by explicitly importing language features.
39bc831 [Sean Owen] Enable -feature in scalac to emit language feature warnings
Fixed several bugs of in-memory columnar storage to make `HiveInMemoryCompatibilitySuite` pass.
@rxin @marmbrus It is reasonable to include `HiveInMemoryCompatibilitySuite` in this PR, but I didn't, since it significantly increases test execution time. What do you think?
**UPDATE** `HiveCompatibilitySuite` has been made to cache tables in memory. `HiveInMemoryCompatibilitySuite` was removed.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Author: Michael Armbrust <michael@databricks.com>
Closes#374 from liancheng/inMemBugFix and squashes the following commits:
6ad6d9b [Cheng Lian] Merged HiveCompatibilitySuite and HiveInMemoryCompatibilitySuite
5bdbfe7 [Cheng Lian] Revert 882c538 & 8426ddc, which introduced regression
882c538 [Cheng Lian] Remove attributes field from InMemoryColumnarTableScan
32cc9ce [Cheng Lian] Code style cleanup
99382bf [Cheng Lian] Enable compression by default
4390bcc [Cheng Lian] Report error for any Throwable in HiveComparisonTest
d1df4fd [Michael Armbrust] Remove test tables that might always get created anyway?
ab9e807 [Michael Armbrust] Fix the logged console version of failed test cases to use the new syntax.
1965123 [Michael Armbrust] Don't use coalesce for gathering all data to a single partition, as it does not work correctly with mutable rows.
e36cdd0 [Michael Armbrust] Spelling.
2d0e168 [Michael Armbrust] Run Hive tests in-memory too.
6360723 [Cheng Lian] Made PreInsertionCasts support SparkLogicalPlan and InMemoryColumnarTableScan
c9b0f6f [Cheng Lian] Let InsertIntoTable support InMemoryColumnarTableScan
9c8fc40 [Cheng Lian] Disable compression by default
e619995 [Cheng Lian] Bug fix: incorrect byte order in CompressionScheme.columnHeaderSize
8426ddc [Cheng Lian] Bug fix: InMemoryColumnarTableScan should cache columns specified by the attributes argument
036cd09 [Cheng Lian] Clean up unused imports
44591a5 [Cheng Lian] Bug fix: NullableColumnAccessor.hasNext must take nulls into account
052bf41 [Cheng Lian] Bug fix: should only gather compressibility info for non-null values
95b3301 [Cheng Lian] Fixed bugs in IntegralDelta
JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-1415).
New Hadoop API of `InputFormat` does not provide the `minSplits` parameter, which makes the API incompatible between `HadoopRDD` and `NewHadoopRDD`. The PR is for constructing compatible APIs.
Though `minSplits` is deprecated by New Hadoop API, we think it is better to make APIs compatible here.
**Note** that `minSplits` in `wholeTextFiles` could only be treated as a *suggestion*, the real number of splits may not be greater than `minSplits` due to `isSplitable()=false`.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#376 from yinxusen/hadoop-min-split and squashes the following commits:
76417f6 [Xusen Yin] refine comments
c10af60 [Xusen Yin] refine comments and rewrite new class for wholeTextFile
766d05b [Xusen Yin] refine Java API and comments
4875755 [Xusen Yin] add minSplits for WholeTextFiles
The Spark codebase is a bit fast-and-loose when accessing classloaders and this has caused a few bugs to surface in master.
This patch defines some utility methods for accessing classloaders. This makes the intention when accessing a classloader much more explicit in the code and fixes a few cases where the wrong one was chosen.
case (a) -> We want the classloader that loaded Spark
case (b) -> We want the context class loader, or if not present, we want (a)
This patch provides a better fix for SPARK-1403 (https://issues.apache.org/jira/browse/SPARK-1403) than the current work around, which it reverts. It also fixes a previously unreported bug that the `./spark-submit` script did not work for running with `local` master. It didn't work because the executor classloader did not properly delegate to the context class loader (if it is defined) and in local mode the context class loader is set by the `./spark-submit` script. A unit test is added for that case.
Author: Patrick Wendell <pwendell@gmail.com>
Closes#398 from pwendell/class-loaders and squashes the following commits:
b4a1a58 [Patrick Wendell] Minor clean up
14f1272 [Patrick Wendell] SPARK-1480: Clean up use of classloaders
[SPARK-1403] I investigated why spark 0.9.0 loads fine on mesos while spark 1.0.0 fails. What I found was that in SparkEnv.scala, while creating the SparkEnv object, the current thread's classloader is null. But in 0.9.0, at the same place, it is set to org.apache.spark.repl.ExecutorClassLoader . I saw that 7edbea41b4 moved it to it current place. I moved it back and saw that 1.0.0 started working fine on mesos.
I just created a minimal patch that allows me to run spark on mesos correctly. It seems like SecurityManager's creation needs to be taken into account for a correct fix. Also moving the creation of the serializer out of SparkEnv might be a part of the right solution. PTAL.
Author: Bharath Bhushan <manku.timma@outlook.com>
Closes#322 from manku-timma/spark-1403 and squashes the following commits:
606c2b9 [Bharath Bhushan] Merge remote-tracking branch 'upstream/master' into spark-1403
ec8f870 [Bharath Bhushan] revert the logger change for java 6 compatibility as PR 334 is doing it
728beca [Bharath Bhushan] Merge remote-tracking branch 'upstream/master' into spark-1403
044027d [Bharath Bhushan] fix compile error
6f260a4 [Bharath Bhushan] Merge remote-tracking branch 'upstream/master' into spark-1403
b3a053f [Bharath Bhushan] Merge remote-tracking branch 'upstream/master' into spark-1403
04b9662 [Bharath Bhushan] add missing line
4803c19 [Bharath Bhushan] Merge remote-tracking branch 'upstream/master' into spark-1403
f3c9a14 [Bharath Bhushan] Merge remote-tracking branch 'upstream/master' into spark-1403
42d3d6a [Bharath Bhushan] used code fragment from @ueshin to fix the problem in a better way
89109d7 [Bharath Bhushan] move the class loader creation back to where it was in 0.9.0