Commit graph

1162 commits

Author SHA1 Message Date
Vincenzo Selvaggio 814b3dabdf [SPARK-7272] [MLLIB] User guide for PMML model export
https://issues.apache.org/jira/browse/SPARK-7272

Author: Vincenzo Selvaggio <vselvaggio@hotmail.it>

Closes #6219 from selvinsource/mllib_pmml_model_export_SPARK-7272 and squashes the following commits:

c866fb8 [Vincenzo Selvaggio] Update mllib-pmml-model-export.md
1beda98 [Vincenzo Selvaggio] [SPARK-7272] Initial user guide for pmml export
d670662 [Vincenzo Selvaggio] Update mllib-pmml-model-export.md
2731375 [Vincenzo Selvaggio] Update mllib-pmml-model-export.md
680dc33 [Vincenzo Selvaggio] Update mllib-pmml-model-export.md
2e298b5 [Vincenzo Selvaggio] Update mllib-pmml-model-export.md
a932f51 [Vincenzo Selvaggio] Create mllib-pmml-model-export.md
2015-05-18 08:46:33 -07:00
Sean Owen 1fd33815f4 [SPARK-4556] [BUILD] binary distribution assembly can't run in local mode
Add note on building a runnable distribution with make-distribution.sh

Author: Sean Owen <sowen@cloudera.com>

Closes #6186 from srowen/SPARK-4556 and squashes the following commits:

4002966 [Sean Owen] Add pointer to --help flag
9fa7883 [Sean Owen] Add note on building a runnable distribution with make-distribution.sh
2015-05-16 08:18:41 +01:00
FavioVazquez d41ae4344c [SPARK-7671] Fix wrong URLs in MLlib Data Types Documentation
There is a mistake in the URL of Matrices in the MLlib Data Types documentation (Local matrix scala section), the URL points to https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Matrices which is a mistake, since Matrices is an object that implements factory methods for Matrix that does not have a companion class. The correct link should point to https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Matrices$

There is another mistake, in the Local Vector section in Scala, Java and Python

In the Scala section the URL of Vectors points to the trait Vector (https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Vector) and not to the factory methods implemented in Vectors.

The correct link should be: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$

In the Java section the URL of Vectors points to the Interface Vector (https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/linalg/Vector.html) and not to the Class Vectors

The correct link should be:
https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/linalg/Vectors.html

In the Python section the URL of Vectors points to the class Vector (https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.Vector) and not the Class Vectors

The correct link should be:
https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.Vectors

Author: FavioVazquez <favio.vazquezp@gmail.com>

Closes #6196 from FavioVazquez/fix-typo-matrices-mllib-datatypes and squashes the following commits:

3e9efd5 [FavioVazquez] - Fixed wrong URLs in the MLlib Data Types Documentation
9af7074 [FavioVazquez] Merge remote-tracking branch 'upstream/master'
edab1ef [FavioVazquez] Merge remote-tracking branch 'upstream/master'
b2e2f8c [FavioVazquez] Merge remote-tracking branch 'upstream/master'
2015-05-16 08:07:03 +01:00
Liang-Chi Hsieh c8696337e2 [SPARK-7556] [ML] [DOC] Add user guide for spark.ml Binarizer, including Scala, Java and Python examples
JIRA: https://issues.apache.org/jira/browse/SPARK-7556

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6116 from viirya/binarizer_doc and squashes the following commits:

40cb677 [Liang-Chi Hsieh] Better print out.
5b7ef1d [Liang-Chi Hsieh] Make examples more clear.
1bf9c09 [Liang-Chi Hsieh] For comments.
6cf8cba [Liang-Chi Hsieh] Add user guide for Binarizer.
2015-05-15 15:05:04 -07:00
FavioVazquez 7fb715de6d [SPARK-7249] Updated Hadoop dependencies due to inconsistency in the versions
Updated Hadoop dependencies due to inconsistency in the versions. Now the global properties are the ones used by the hadoop-2.2 profile, and the profile was set to empty but kept for backwards compatibility reasons.

Changes proposed by vanzin resulting from previous pull-request https://github.com/apache/spark/pull/5783 that did not fixed the problem correctly.

Please let me know if this is the correct way of doing this, the comments of vanzin are in the pull-request mentioned.

Author: FavioVazquez <favio.vazquezp@gmail.com>

Closes #5786 from FavioVazquez/update-hadoop-dependencies and squashes the following commits:

11670e5 [FavioVazquez] - Added missing instance of -Phadoop-2.2 in create-release.sh
379f50d [FavioVazquez] - Added instances of -Phadoop-2.2 in create-release.sh, run-tests, scalastyle and building-spark.md - Reconstructed docs to not ask users to rely on default behavior
3f9249d [FavioVazquez] Merge branch 'master' of https://github.com/apache/spark into update-hadoop-dependencies
31bdafa [FavioVazquez] - Added missing instances in -Phadoop-1 in create-release.sh, run-tests and in the building-spark documentation
cbb93e8 [FavioVazquez] - Added comment related to SPARK-3710 about  hadoop-yarn-server-tests in Hadoop 2.2 that fails to pull some needed dependencies
83dc332 [FavioVazquez] - Cleaned up the main POM concerning the yarn profile - Erased hadoop-2.2 profile from yarn/pom.xml and its content was integrated into yarn/pom.xml
93f7624 [FavioVazquez] - Deleted unnecessary comments and <activation> tag on the YARN profile in the main POM
668d126 [FavioVazquez] - Moved <dependencies> <activation> and <properties> sections of the hadoop-2.2 profile in the YARN POM to the YARN profile in the root POM - Erased unnecessary hadoop-2.2 profile from the YARN POM
fda6a51 [FavioVazquez] - Updated hadoop1 releases in create-release.sh  due to changes in the default hadoop version set - Erased unnecessary instance of -Dyarn.version=2.2.0 in create-release.sh - Prettify comment in yarn/pom.xml
0470587 [FavioVazquez] - Erased unnecessary instance of -Phadoop-2.2 -Dhadoop.version=2.2.0 in create-release.sh - Updated how the releases are made in the create-release.sh no that the default hadoop version is the 2.2.0 - Erased unnecessary instance of -Phadoop-2.2 -Dhadoop.version=2.2.0 in scalastyle - Erased unnecessary instance of -Phadoop-2.2 -Dhadoop.version=2.2.0 in run-tests - Better example given in the hadoop-third-party-distributions.md now that the default hadoop version is 2.2.0
a650779 [FavioVazquez] - Default value of avro.mapred.classifier has been set to hadoop2 in pom.xml - Cleaned up hadoop-2.3 and 2.4 profiles due to change in the default set in avro.mapred.classifier in pom.xml
199f40b [FavioVazquez] - Erased unnecessary CDH5-specific note in docs/building-spark.md - Remove example of instance -Phadoop-2.2 -Dhadoop.version=2.2.0 in docs/building-spark.md - Enabled hadoop-2.2 profile when the Hadoop version is 2.2.0, which is now the default .Added comment in the yarn/pom.xml to specify that.
88a8b88 [FavioVazquez] - Simplified Hadoop profiles due to new setting of global properties in the pom.xml file - Added comment to specify that the hadoop-2.2 profile is now the default hadoop profile in the pom.xml file - Erased hadoop-2.2 from related hadoop profiles now that is a no-op in the make-distribution.sh file
70b8344 [FavioVazquez] - Fixed typo in the make-distribution.sh file and added hadoop-1 in the Related profiles
287fa2f [FavioVazquez] - Updated documentation about specifying the hadoop version in building-spark. Now is clear that Spark will build against Hadoop 2.2.0 by default. - Added Cloudera CDH 5.3.3 without MapReduce example in the building-spark doc.
1354292 [FavioVazquez] - Fixed hadoop-1 version to match jenkins build profile in hadoop1.0 tests and documentation
6b4bfaf [FavioVazquez] - Cleanup in hadoop-2.x profiles since they contained mostly redundant stuff.
7e9955d [FavioVazquez] - Updated Hadoop dependencies due to inconsistency in the versions. Now the global properties are the ones used by the hadoop-2.2 profile, and the profile was set to empty but kept for backwards compatibility reasons
660decc [FavioVazquez] - Updated Hadoop dependencies due to inconsistency in the versions. Now the global properties are the ones used by the hadoop-2.2 profile, and the profile was set to empty but kept for backwards compatibility reasons
ec91ce3 [FavioVazquez] - Updated protobuf-java version of com.google.protobuf dependancy to fix blocking error when connecting to HDFS via the Hadoop Cloudera HDFS CDH5 (fix for 2.5.0-cdh5.3.3 version)
2015-05-14 15:22:58 +01:00
Joseph K. Bradley f0c1bc3472 [SPARK-7557] [ML] [DOC] User guide for spark.ml HashingTF, Tokenizer
Added feature transformer subsection to spark.ml guide, with HashingTF and Tokenizer.  Added JavaHashingTFSuite to test Java examples in new guide.

I've run Scala, Python examples in the Spark/PySpark shells.  I ran the Java examples via the test suite (with small modifications for printing).

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #6093 from jkbradley/hashingtf-guide and squashes the following commits:

d5d213f [Joseph K. Bradley] small fix
dd6e91a [Joseph K. Bradley] fixes from code review of user guide
33c3ff9 [Joseph K. Bradley] small fix
bc6058c [Joseph K. Bradley] fix link
361a174 [Joseph K. Bradley] Added subsection for feature transformers to spark.ml guide, with HashingTF and Tokenizer.  Added JavaHashingTFSuite to test Java examples in new guide
2015-05-12 16:39:56 -07:00
Yuhao Yang 1d703660d4 [SPARK-7496] [MLLIB] Update Programming guide with Online LDA
jira: https://issues.apache.org/jira/browse/SPARK-7496

Update LDA subsection of clustering section of MLlib programming guide to include OnlineLDA.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #6046 from hhbyyh/ldaDocument and squashes the following commits:

4b6fbfa [Yuhao Yang] add online paper and some comparison
fd4c983 [Yuhao Yang] update lda document for optimizers
2015-05-12 15:12:29 -07:00
vidmantas zemleris 640f63b959 [SPARK-6994][SQL] Update docs for fetching Row fields by name
add docs for https://issues.apache.org/jira/browse/SPARK-6994

Author: vidmantas zemleris <vidmantas@vinted.com>

Closes #6030 from vidma/docs/row-with-named-fields and squashes the following commits:

241b401 [vidmantas zemleris] [SPARK-6994][SQL] Update docs for fetching Row fields by name
2015-05-11 22:29:24 -07:00
Reynold Xin 3a9b6997df [SPARK-7462][SQL] Update documentation for retaining grouping columns in DataFrames.
Author: Reynold Xin <rxin@databricks.com>

Closes #6062 from rxin/agg-retain-doc and squashes the following commits:

43e511e [Reynold Xin] [SPARK-7462][SQL] Update documentation for retaining grouping columns in DataFrames.
2015-05-11 18:07:12 -07:00
gchen 8e674331d9 [SPARK-7516] [Minor] [DOC] Replace depreciated inferSchema() with createDataFrame()
JIRA: https://issues.apache.org/jira/browse/SPARK-7516

In sql-programming-guide, deprecated python data frame api inferSchema() should be replaced by createDataFrame():

schemaPeople = sqlContext.inferSchema(people) ->
schemaPeople = sqlContext.createDataFrame(people)

Author: gchen <chenguancheng@gmail.com>

Closes #6041 from gchen/python-docs and squashes the following commits:

c27eb7c [gchen] replace inferSchema() with createDataFrame()
2015-05-11 14:37:18 -07:00
Kousuke Saruta 6e9910c21a [SPARK-7515] [DOC] Update documentation for PySpark on YARN with cluster mode
Now PySpark on YARN with cluster mode is supported so let's update doc.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #6040 from sarutak/update-doc-for-pyspark-on-yarn and squashes the following commits:

ad9f88c [Kousuke Saruta] Brushed up sentences
469fd2e [Kousuke Saruta] Merge branch 'master' of https://github.com/apache/spark into update-doc-for-pyspark-on-yarn
fcfdb92 [Kousuke Saruta] Updated doc for PySpark on YARN with cluster mode
2015-05-11 14:19:11 -07:00
Sandy Ryza 82fee9d9aa [SPARK-6470] [YARN] Add support for YARN node labels.
This is difficult to write a test for because it relies on the latest version of YARN, but I verified manually that the patch does pass along the label expression on this version and containers are successfully launched.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #5242 from sryza/sandy-spark-6470 and squashes the following commits:

6af87b9 [Sandy Ryza] Change info to warning
6e22d99 [Sandy Ryza] [YARN] SPARK-6470.  Add support for YARN node labels.
2015-05-11 12:09:39 -07:00
Kirill A. Korinskiy 8c07c75c98 [SPARK-5521] PCA wrapper for easy transform vectors
I implement a simple PCA wrapper for easy transform of vectors by PCA for example LabeledPoint or another complicated structure.

Example of usage:
```
  import org.apache.spark.mllib.regression.LinearRegressionWithSGD
  import org.apache.spark.mllib.regression.LabeledPoint
  import org.apache.spark.mllib.linalg.Vectors
  import org.apache.spark.mllib.feature.PCA

  val data = sc.textFile("data/mllib/ridge-data/lpsa.data").map { line =>
    val parts = line.split(',')
    LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
  }.cache()

  val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
  val training = splits(0).cache()
  val test = splits(1)

  val pca = PCA.create(training.first().features.size/2, data.map(_.features))
  val training_pca = training.map(p => p.copy(features = pca.transform(p.features)))
  val test_pca = test.map(p => p.copy(features = pca.transform(p.features)))

  val numIterations = 100
  val model = LinearRegressionWithSGD.train(training, numIterations)
  val model_pca = LinearRegressionWithSGD.train(training_pca, numIterations)

  val valuesAndPreds = test.map { point =>
    val score = model.predict(point.features)
    (score, point.label)
  }

  val valuesAndPreds_pca = test_pca.map { point =>
    val score = model_pca.predict(point.features)
    (score, point.label)
  }

  val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
  val MSE_pca = valuesAndPreds_pca.map{case(v, p) => math.pow((v - p), 2)}.mean()

  println("Mean Squared Error = " + MSE)
  println("PCA Mean Squared Error = " + MSE_pca)
```

Author: Kirill A. Korinskiy <catap@catap.ru>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4304 from catap/pca and squashes the following commits:

501bcd9 [Joseph K. Bradley] Small updates: removed k from Java-friendly PCA fit().  In PCASuite, converted results to set for comparison. Added an error message for bad k in PCA.
9dcc02b [Kirill A. Korinskiy] [SPARK-5521] fix scala style
1892a06 [Kirill A. Korinskiy] [SPARK-5521] PCA wrapper for easy transform vectors
2015-05-10 13:34:00 -07:00
dobashim 7d0f17208c [STREAMING] [DOCS] Fix wrong url about API docs of StreamingListener
A little fix about wrong url of the API document. (org.apache.spark.streaming.scheduler.StreamingListener)

Author: dobashim <dobashim@oss.nttdata.co.jp>

Closes #6024 from dobashim/master and squashes the following commits:

ac9a955 [dobashim] [STREAMING][DOCS] Fix wrong url about API docs of StreamingListener
2015-05-09 10:14:46 +01:00
Imran Rashid c796be70f3 [SPARK-3454] separate json endpoints for data in the UI
Exposes data available in the UI as json over http.  Key points:

* new endpoints, handled independently of existing XyzPage classes.  Root entrypoint is `JsonRootResource`
* Uses jersey + jackson for routing & converting POJOs into json
* tests against known results in `HistoryServerSuite`
* also fixes some minor issues w/ the UI -- synchronizing on access to `StorageListener` & `StorageStatusListener`, and fixing some inconsistencies w/ the way we handle retained jobs & stages.

Author: Imran Rashid <irashid@cloudera.com>

Closes #5940 from squito/SPARK-3454_better_test_files and squashes the following commits:

1a72ed6 [Imran Rashid] rats
85fdb3e [Imran Rashid] Merge branch 'no_php' into SPARK-3454
1fc65b0 [Imran Rashid] Revert "Revert "[SPARK-3454] separate json endpoints for data in the UI""
1276900 [Imran Rashid] get rid of giant event file, replace w/ smaller one; check both shuffle read & shuffle write
4e12013 [Imran Rashid] just use test case name for expectation file name
863ef64 [Imran Rashid] rename json files to avoid strange file names and not look like php
2015-05-08 16:54:32 +01:00
Octavian Geagla 658a478d3f [SPARK-5726] [MLLIB] Elementwise (Hadamard) Vector Product Transformer
See https://issues.apache.org/jira/browse/SPARK-5726

Author: Octavian Geagla <ogeagla@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4580 from ogeagla/spark-mllib-weighting and squashes the following commits:

fac12ad [Octavian Geagla] [SPARK-5726] [MLLIB] Use new createTransformFunc.
90f7e39 [Joseph K. Bradley] small cleanups
4595165 [Octavian Geagla] [SPARK-5726] [MLLIB] Remove erroneous test case.
ded3ac6 [Octavian Geagla] [SPARK-5726] [MLLIB] Pass style checks.
37d4705 [Octavian Geagla] [SPARK-5726] [MLLIB] Incorporated feedback.
1dffeee [Octavian Geagla] [SPARK-5726] [MLLIB] Pass style checks.
e436896 [Octavian Geagla] [SPARK-5726] [MLLIB] Remove 'TF' from 'ElementwiseProductTF'
cb520e6 [Octavian Geagla] [SPARK-5726] [MLLIB] Rename HadamardProduct to ElementwiseProduct
4922722 [Octavian Geagla] [SPARK-5726] [MLLIB] Hadamard Vector Product Transformer
2015-05-07 14:49:55 -07:00
ksonj fae4e2d609 [SPARK-7035] Encourage __getitem__ over __getattr__ on column access in the Python DataFrame API
Author: ksonj <kson@siberie.de>

Closes #5971 from ksonj/doc and squashes the following commits:

dadfebb [ksonj] __getitem__ is cleaner than __getattr__
2015-05-07 01:02:00 -07:00
Reynold Xin 51b3d41e16 Revert "[SPARK-3454] separate json endpoints for data in the UI"
This reverts commit d49735800d.

The commit broke Spark on Windows.
2015-05-05 19:27:30 -07:00
zsxwing fec7b29f55 [SPARK-7351] [STREAMING] [DOCS] Add spark.streaming.ui.retainedBatches to docs
The default value will be changed to `1000` in #5533. So here I just used `1000`.

Author: zsxwing <zsxwing@gmail.com>

Closes #5899 from zsxwing/SPARK-7351 and squashes the following commits:

e1ec515 [zsxwing] [SPARK-7351][Streaming][Docs] Add spark.streaming.ui.retainedBatches to docs
2015-05-05 13:42:23 -07:00
Imran Rashid d49735800d [SPARK-3454] separate json endpoints for data in the UI
Exposes data available in the UI as json over http.  Key points:

* new endpoints, handled independently of existing XyzPage classes.  Root entrypoint is `JsonRootResource`
* Uses jersey + jackson for routing & converting POJOs into json
* tests against known results in `HistoryServerSuite`
* also fixes some minor issues w/ the UI -- synchronizing on access to `StorageListener` & `StorageStatusListener`, and fixing some inconsistencies w/ the way we handle retained jobs & stages.

Author: Imran Rashid <irashid@cloudera.com>

Closes #4435 from squito/SPARK-3454 and squashes the following commits:

da1e35f [Imran Rashid] typos etc.
5e78b4f [Imran Rashid] fix rendering problems
5ae02ad [Imran Rashid] Merge branch 'master' into SPARK-3454
f016182 [Imran Rashid] change all constructors json-pojo class constructors to be private[spark] to protect us from mima-false-positives if we add fields
3347b72 [Imran Rashid] mark EnumUtil as @Private
ec140a2 [Imran Rashid] create @Private
cc1febf [Imran Rashid] add docs on the metrics-as-json api
cbaf287 [Imran Rashid] Merge branch 'master' into SPARK-3454
56db31e [Imran Rashid] update tests for mulit-attempt
7f3bc4e [Imran Rashid] Revert "add sbt-revolved plugin, to make it easier to start & stop http servers in sbt"
67008b4 [Imran Rashid] rats
9e51400 [Imran Rashid] style
c9bae1c [Imran Rashid] handle multiple attempts per app
b87cd63 [Imran Rashid] add sbt-revolved plugin, to make it easier to start & stop http servers in sbt
188762c [Imran Rashid] multi-attempt
2af11e5 [Imran Rashid] Merge branch 'master' into SPARK-3454
befff0c [Imran Rashid] review feedback
14ac3ed [Imran Rashid] jersey-core needs to be explicit; move version & scope to parent pom.xml
f90680e [Imran Rashid] Merge branch 'master' into SPARK-3454
dc8a7fe [Imran Rashid] style, fix errant comments
acb7ef6 [Imran Rashid] fix indentation
7bf1811 [Imran Rashid] move MetricHelper so mima doesnt think its exposed; comments
9d889d6 [Imran Rashid] undo some unnecessary changes
f48a7b0 [Imran Rashid] docs
52bbae8 [Imran Rashid] StorageListener & StorageStatusListener needs to synchronize internally to be thread-safe
31c79ce [Imran Rashid] asm no longer needed for SPARK_PREPEND_CLASSES
b2f8b91 [Imran Rashid] @DeveloperApi
2e19be2 [Imran Rashid] lazily convert ApplicationInfo to avoid memory overhead
ba3d9d2 [Imran Rashid] upper case enums
39ac29c [Imran Rashid] move EnumUtil
d2bde77 [Imran Rashid] update error handling & scoping
4a234d3 [Imran Rashid] avoid jersey-media-json-jackson b/c of potential version conflicts
a157a2f [Imran Rashid] style
7bd4d15 [Imran Rashid] delete security test, since it doesnt do anything
a325563 [Imran Rashid] style
a9c5cf1 [Imran Rashid] undo changes superceeded by master
0c6f968 [Imran Rashid] update deps
1ed0d07 [Imran Rashid] Merge branch 'master' into SPARK-3454
4c92af6 [Imran Rashid] style
f2e63ad [Imran Rashid] Merge branch 'master' into SPARK-3454
c22b11f [Imran Rashid] fix compile error
9ea682c [Imran Rashid] go back to good ol' java enums
cf86175 [Imran Rashid] style
d493b38 [Imran Rashid] Merge branch 'master' into SPARK-3454
f05ae89 [Imran Rashid] add in ExecutorSummaryInfo for MiMa :(
101a698 [Imran Rashid] style
d2ef58d [Imran Rashid] revert changes that had HistoryServer refresh the application listing more often
b136e39b [Imran Rashid] Revert "add sbt-revolved plugin, to make it easier to start & stop http servers in sbt"
e031719 [Imran Rashid] fixes from review
1f53a66 [Imran Rashid] style
b4a7863 [Imran Rashid] fix compile error
2c8b7ee [Imran Rashid] rats
1578a4a [Imran Rashid] doc
674f8dc [Imran Rashid] more explicit about total numbers of jobs & stages vs. number retained
9922be0 [Imran Rashid] Merge branch 'master' into stage_distributions
f5a5196 [Imran Rashid] undo removal of renderJson from MasterPage, since there is no substitute yet
db61211 [Imran Rashid] get JobProgressListener directly from UI
fdfc181 [Imran Rashid] stage/taskList
63eb4a6 [Imran Rashid] tests for taskSummary
ad27de8 [Imran Rashid] error handling on quantile values
b2efcaf [Imran Rashid] cleanup, combine stage-related paths into one resource
aaba896 [Imran Rashid] wire up task summary
a4b1397 [Imran Rashid] stage metric distributions
e48ba32 [Imran Rashid] rename
eaf3bbb [Imran Rashid] style
25cd894 [Imran Rashid] if only given day, assume GMT
51eaedb [Imran Rashid] more visibility fixes
9f28b7e [Imran Rashid] ack, more cleanup
99764e1 [Imran Rashid] Merge branch 'SPARK-3454_w_jersey' into SPARK-3454
a61a43c [Imran Rashid] oops, remove accidental checkin
a066055 [Imran Rashid] set visibility on a lot of classes
1f361c8 [Imran Rashid] update rat-excludes
0be5120 [Imran Rashid] Merge branch 'master' into SPARK-3454_w_jersey
2382bef [Imran Rashid] switch to using new "enum"
fef6605 [Imran Rashid] some utils for working w/ new "enum" format
dbfc7bf [Imran Rashid] style
b86bcb0 [Imran Rashid] update test to look at one stage attempt
5f9df24 [Imran Rashid] style
7fd156a [Imran Rashid] refactor jsonDiff to avoid code duplication
73f1378 [Imran Rashid] test json; also add test cases for cleaned stages & jobs
97d411f [Imran Rashid] json endpoint for one job
0c96147 [Imran Rashid] better error msgs for bad stageId vs bad attemptId
dddbd29 [Imran Rashid] stages have attempt; jobs are sorted; resource for all attempts for one stage
190c17a [Imran Rashid] StagePage should distinguish no task data, from unknown stage
84cd497 [Imran Rashid] AllJobsPage should still report correct completed & failed job count, even if some have been cleaned, to make it consistent w/ AllStagesPage
36e4062 [Imran Rashid] SparkUI needs to know about startTime, so it can list its own applicationInfo
b4c75ed [Imran Rashid] fix merge conflicts; need to widen visibility in a few cases
e91750a [Imran Rashid] Merge branch 'master' into SPARK-3454_w_jersey
56d2fc7 [Imran Rashid] jersey needs asm for SPARK_PREPEND_CLASSES to work
f7df095 [Imran Rashid] add test for accumulables, and discover that I need update after all
9c0c125 [Imran Rashid] add accumulableInfo
00e9cc5 [Imran Rashid] more style
3377e61 [Imran Rashid] scaladoc
d05f7a9 [Imran Rashid] dont use case classes for status api POJOs, since they have binary compatibility issues
654cecf [Imran Rashid] move all the status api POJOs to one file
b86e2b0 [Imran Rashid] style
18a8c45 [Imran Rashid] Merge branch 'master' into SPARK-3454_w_jersey
5598f19 [Imran Rashid] delete some unnecessary code, more to go
56edce0 [Imran Rashid] style
017c755 [Imran Rashid] add in metrics now available
1b78cb7 [Imran Rashid] fix some import ordering
0dc3ea7 [Imran Rashid] if app isnt found, reload apps from FS before giving up
c7d884f [Imran Rashid] fix merge conflicts
0c12b50 [Imran Rashid] Merge branch 'master' into SPARK-3454_w_jersey
b6a96a8 [Imran Rashid] compare json by AST, not string
cd37845 [Imran Rashid] switch to using java.util.Dates for times
a4ab5aa [Imran Rashid] add in explicit dependency on jersey 1.9 -- maven wasn't happy before this
4fdc39f [Imran Rashid] refactor case insensitive enum parsing
cba1ef6 [Imran Rashid] add security (maybe?) for metrics json
f0264a7 [Imran Rashid] switch to using jersey for metrics json
bceb3a9 [Imran Rashid] set http response code on error, some testing
e0356b6 [Imran Rashid] put new test expectation files in rat excludes (is this OK?)
b252e7a [Imran Rashid] small cleanup of accidental changes
d1a8c92 [Imran Rashid] add sbt-revolved plugin, to make it easier to start & stop http servers in sbt
4b398d0 [Imran Rashid] expose UI data as json in new endpoints
2015-05-05 07:25:40 -05:00
Sandy Ryza 4222da68dc [SPARK-5112] Expose SizeEstimator as a developer api
"The best way to size the amount of memory consumption your dataset will require is to create an RDD, put it into cache, and look at the SparkContext logs on your driver program. The logs will tell you how much memory each partition is consuming, which you can aggregate to get the total size of the RDD."
-the Tuning Spark page

This is a pain. It would be much nicer to expose simply functionality for understanding the memory footprint of a Java object.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #3913 from sryza/sandy-spark-5112 and squashes the following commits:

8d9e082 [Sandy Ryza] Add SizeEstimator in org.apache.spark
2e1a906 [Sandy Ryza] Revert "Move SizeEstimator out of util"
93f4cd0 [Sandy Ryza] Move SizeEstimator out of util
e21c1f4 [Sandy Ryza] Remove unused import
798ab88 [Sandy Ryza] Update documentation and add to SparkContext
34c523c [Sandy Ryza] SPARK-5112. Expose SizeEstimator as a developer api
2015-05-05 12:38:46 +01:00
shekhar.bansal fc8feaa8e9 [SPARK-6653] [YARN] New config to specify port for sparkYarnAM actor system
Author: shekhar.bansal <shekhar.bansal@guavus.com>

Closes #5719 from zuxqoj/master and squashes the following commits:

5574ff7 [shekhar.bansal] [SPARK-6653][yarn] New config to specify port for sparkYarnAM actor system
5117258 [shekhar.bansal] [SPARK-6653][yarn] New config to specify port for sparkYarnAM actor system
9de5330 [shekhar.bansal] [SPARK-6653][yarn] New config to specify port for sparkYarnAM actor system
456a592 [shekhar.bansal] [SPARK-6653][yarn] New configuration property to specify port for sparkYarnAM actor system
803e93e [shekhar.bansal] [SPARK-6653][yarn] New configuration property to specify port for sparkYarnAM actor system
2015-05-05 11:09:51 +01:00
Sean Owen 9e25b09f88 [SPARK-7302] [DOCS] SPARK building documentation still mentions building for yarn 0.23
Remove references to Hadoop 0.23

CC tgravescs Is this what you had in mind? basically all refs to 0.23?
We don't support YARN 0.23, but also don't support Hadoop 0.23 anymore AFAICT. There are no builds or releases for it.

In fact, on a related note, refs to CDH3 (Hadoop 0.20.2) should be removed as this certainly isn't supported either.

Author: Sean Owen <sowen@cloudera.com>

Closes #5863 from srowen/SPARK-7302 and squashes the following commits:

42f5d1e [Sean Owen] Remove CDH3 (Hadoop 0.20.2) refs too
dad02e3 [Sean Owen] Remove references to Hadoop 0.23
2015-05-03 21:22:31 +01:00
BenFradet ea841efc5a [SPARK-7255] [STREAMING] [DOCUMENTATION] Added documentation for spark.streaming.kafka.maxRetries
Added documentation for spark.streaming.kafka.maxRetries

Author: BenFradet <benjamin.fradet@gmail.com>

Closes #5808 from BenFradet/master and squashes the following commits:

cc72e7a [BenFradet] updated doc for spark.streaming.kafka.maxRetries to explain the default value
18f823e [BenFradet] Added "consecutive" to the spark.streaming.kafka.maxRetries doc
597fdeb [BenFradet] Mention that spark.streaming.kafka.maxRetries only applies to the direct kafka api
0efad39 [BenFradet] Added documentation for spark.streaming.kafka.maxRetries
2015-05-02 23:41:14 +01:00
Chris Heller 8f50a07d21 [SPARK-2691] [MESOS] Support for Mesos DockerInfo
This patch adds partial support for running spark on mesos inside of a docker container. Only fine-grained mode is presently supported, and there is no checking done to ensure that the version of libmesos is recent enough to have a DockerInfo structure in the protobuf (other than pinning a mesos version in the pom.xml).

Author: Chris Heller <hellertime@gmail.com>

Closes #3074 from hellertime/SPARK-2691 and squashes the following commits:

d504af6 [Chris Heller] Assist type inference
f64885d [Chris Heller] Fix errant line length
17c41c0 [Chris Heller] Base Dockerfile on mesosphere/mesos image
8aebda4 [Chris Heller] Simplfy Docker image docs
1ae7f4f [Chris Heller] Style points
974bd56 [Chris Heller] Convert map to flatMap
5d8bdf7 [Chris Heller] Factor out the DockerInfo construction.
7b75a3d [Chris Heller] Align to styleguide
80108e7 [Chris Heller] Bend to the will of RAT
ba77056 [Chris Heller] Explicit RAT exclude
abda5e5 [Chris Heller] Wildcard .rat-excludes
2f2873c [Chris Heller] Exclude spark-mesos from RAT
a589a5b [Chris Heller] Add example Dockerfile
b6825ce [Chris Heller] Remove use of EasyMock
eae1b86 [Chris Heller] Move properties under 'spark.mesos.'
c184d00 [Chris Heller] Use map on Option to be consistent with non-coarse code
fb9501a [Chris Heller] Bumped mesos version to current release
fa11879 [Chris Heller] Add listenerBus to EasyMock
882151e [Chris Heller] Changes to scala style
b22d42d [Chris Heller] Exclude template from RAT
db536cf [Chris Heller] Remove unneeded mocks
dea1bd5 [Chris Heller] Force default protocol
7dac042 [Chris Heller] Add test for DockerInfo
5456c0c [Chris Heller] Adjust syntax style
521c194 [Chris Heller] Adjust version info
6e38f70 [Chris Heller] Document Mesos Docker properties
29572ab [Chris Heller] Support all DockerInfo fields
b8c0dea [Chris Heller] Support for mesos DockerInfo in coarse-mode.
482a9fd [Chris Heller] Support for mesos DockerInfo in fine-grained mode.
2015-05-01 18:41:22 -07:00
Hari Shreedharan b1f4ca82d1 [SPARK-5342] [YARN] Allow long running Spark apps to run on secure YARN/HDFS
Take 2. Does the same thing as #4688, but fixes Hadoop-1 build.

Author: Hari Shreedharan <hshreedharan@apache.org>

Closes #5823 from harishreedharan/kerberos-longrunning and squashes the following commits:

3c86bba [Hari Shreedharan] Import fixes. Import postfixOps explicitly.
4d04301 [Hari Shreedharan] Minor formatting fixes.
b5e7a72 [Hari Shreedharan] Remove reflection, use a method in SparkHadoopUtil to update the token renewer.
7bff6e9 [Hari Shreedharan] Make sure all required classes are present in the jar. Fix import order.
e851f70 [Hari Shreedharan] Move the ExecutorDelegationTokenRenewer to yarn module. Use reflection to use it.
36eb8a9 [Hari Shreedharan] Change the renewal interval config param. Fix a bunch of comments.
611923a [Hari Shreedharan] Make sure the namenodes are listed correctly for creating tokens.
09fe224 [Hari Shreedharan] Use token.renew to get token's renewal interval rather than using hdfs-site.xml
6963bbc [Hari Shreedharan] Schedule renewal in AM before starting user class. Else, a restarted AM cannot access HDFS if the user class tries to.
072659e [Hari Shreedharan] Fix build failure caused by thread factory getting moved to ThreadUtils.
f041dd3 [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
42eead4 [Hari Shreedharan] Remove RPC part. Refactor and move methods around, use renewal interval rather than max lifetime to create new tokens.
ebb36f5 [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
bc083e3 [Hari Shreedharan] Overload RegisteredExecutor to send tokens. Minor doc updates.
7b19643 [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
8a4f268 [Hari Shreedharan] Added docs in the security guide. Changed some code to ensure that the renewer objects are created only if required.
e800c8b [Hari Shreedharan] Restore original RegisteredExecutor message, and send new tokens via NewTokens message.
0e9507e [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
7f1bc58 [Hari Shreedharan] Minor fixes, cleanup.
bcd11f9 [Hari Shreedharan] Refactor AM and Executor token update code into separate classes, also send tokens via akka on executor startup.
f74303c [Hari Shreedharan] Move the new logic into specialized classes. Add cleanup for old credentials files.
2f9975c [Hari Shreedharan] Ensure new tokens are written out immediately on AM restart. Also, pikc up the latest suffix from HDFS if the AM is restarted.
61b2b27 [Hari Shreedharan] Account for AM restarts by making sure lastSuffix is read from the files on HDFS.
62c45ce [Hari Shreedharan] Relogin from keytab periodically.
fa233bd [Hari Shreedharan] Adding logging, fixing minor formatting and ordering issues.
42813b4 [Hari Shreedharan] Remove utils.sh, which was re-added due to merge with master.
0de27ee [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
55522e3 [Hari Shreedharan] Fix failure caused by Preconditions ambiguity.
9ef5f1b [Hari Shreedharan] Added explanation of how the credentials refresh works, some other minor fixes.
f4fd711 [Hari Shreedharan] Fix SparkConf usage.
2debcea [Hari Shreedharan] Change the file structure for credentials files. I will push a followup patch which adds a cleanup mechanism for old credentials files. The credentials files are small and few enough for it to cause issues on HDFS.
af6d5f0 [Hari Shreedharan] Cleaning up files where changes weren't required.
f0f54cb [Hari Shreedharan] Be more defensive when updating the credentials file.
f6954da [Hari Shreedharan] Got rid of Akka communication to renew, instead the executors check a known file's modification time to read the credentials.
5c11c3e [Hari Shreedharan] Move tests to YarnSparkHadoopUtil to fix compile issues.
b4cb917 [Hari Shreedharan] Send keytab to AM via DistributedCache rather than directly via HDFS
0985b4e [Hari Shreedharan] Write tokens to HDFS and read them back when required, rather than sending them over the wire.
d79b2b9 [Hari Shreedharan] Make sure correct credentials are passed to FileSystem#addDelegationTokens()
8c6928a [Hari Shreedharan] Fix issue caused by direct creation of Actor object.
fb27f46 [Hari Shreedharan] Make sure principal and keytab are set before CoarseGrainedSchedulerBackend is started. Also schedule re-logins in CoarseGrainedSchedulerBackend#start()
41efde0 [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
d282d7a [Hari Shreedharan] Fix ClientSuite to set YARN mode, so that the correct class is used in tests.
bcfc374 [Hari Shreedharan] Fix Hadoop-1 build by adding no-op methods in SparkHadoopUtil, with impl in YarnSparkHadoopUtil.
f8fe694 [Hari Shreedharan] Handle None if keytab-login is not scheduled.
2b0d745 [Hari Shreedharan] [SPARK-5342][YARN] Allow long running Spark apps to run on secure YARN/HDFS.
ccba5bc [Hari Shreedharan] WIP: More changes wrt kerberos
77914dd [Hari Shreedharan] WIP: Add kerberos principal and keytab to YARN client.
2015-05-01 15:32:09 -05:00
Marcelo Vanzin 7b5dd3e3c0 [SPARK-7281] [YARN] Add option to set AM's lib path in client mode.
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #5813 from vanzin/SPARK-7281 and squashes the following commits:

1cb6f42 [Marcelo Vanzin] [SPARK-7281] [yarn] Add option to set AM's lib path in client mode.
2015-05-01 21:20:46 +01:00
MechCoder c24aeb6a31 [SPARK-6257] [PYSPARK] [MLLIB] MLlib API missing items in Recommendation
Adds

rank, recommendUsers and RecommendProducts to MatrixFactorizationModel in PySpark.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #5807 from MechCoder/spark-6257 and squashes the following commits:

09629c6 [MechCoder] doc
953b326 [MechCoder] [SPARK-6257] MLlib API missing items in Recommendation
2015-04-30 23:51:00 -07:00
Zhan Zhang 36a7a6807e [SPARK-6479] [BLOCK MANAGER] Create off-heap block storage API
This is the classes for creating off-heap block storage API. It also includes the migration for Tachyon. The diff seems to be big, but it mainly just rename tachyon to offheap. New implementation for hdfs will be submit for review in spark-6112.

Author: Zhan Zhang <zhazhan@gmail.com>

Closes #5430 from zhzhan/SPARK-6479 and squashes the following commits:

60acd84 [Zhan Zhang] minor change to kickoff the test
12f54c9 [Zhan Zhang] solve merge conflicts
a54132c [Zhan Zhang] solve review comments
ffb8e00 [Zhan Zhang] rebase to sparkcontext change
6e121e0 [Zhan Zhang] resolve review comments and restructure blockmanasger code
a7aed6c [Zhan Zhang] add Tachyon migration code
186de31 [Zhan Zhang] initial commit for off-heap block storage api
2015-04-30 22:24:31 -07:00
Patrick Wendell e0628f2fae Revert "[SPARK-5342] [YARN] Allow long running Spark apps to run on secure YARN/HDFS"
This reverts commit 6c65da6bb7.
2015-04-30 14:59:20 -07:00
Hari Shreedharan 6c65da6bb7 [SPARK-5342] [YARN] Allow long running Spark apps to run on secure YARN/HDFS
Current Spark apps running on Secure YARN/HDFS would not be able to write data
to HDFS after 7 days, since delegation tokens cannot be renewed beyond that. This
means Spark Streaming apps will not be able to run on Secure YARN.

This commit adds basic functionality to fix this issue. In this patch:
- new parameters are added - principal and keytab, which can be used to login to a KDC
- the client logs in, and then get tokens to start the AM
- the keytab is copied to the staging directory
- the AM waits for 60% of the time till expiry of the tokens and then logs in using the keytab
- each time after 60% of the time, new tokens are created and sent to the executors

Currently, to avoid complicating the architecture, we set the keytab and principal in the
SparkHadoopUtil singleton, and schedule a login. Once the login is completed, a callback is scheduled.

This is being posted for feedback, so I can gather feedback on the general implementation.

There are currently a bunch of things to do:
- [x] logging
- [x] testing - I plan to manually test this soon. If you have ideas of how to add unit tests, comment.
- [x] add code to ensure that if these params are set in non-YARN cluster mode, we complain
- [x] documentation
- [x] Have the executors request for credentials from the AM, so that retries are possible.

Author: Hari Shreedharan <hshreedharan@apache.org>

Closes #4688 from harishreedharan/kerberos-longrunning and squashes the following commits:

36eb8a9 [Hari Shreedharan] Change the renewal interval config param. Fix a bunch of comments.
611923a [Hari Shreedharan] Make sure the namenodes are listed correctly for creating tokens.
09fe224 [Hari Shreedharan] Use token.renew to get token's renewal interval rather than using hdfs-site.xml
6963bbc [Hari Shreedharan] Schedule renewal in AM before starting user class. Else, a restarted AM cannot access HDFS if the user class tries to.
072659e [Hari Shreedharan] Fix build failure caused by thread factory getting moved to ThreadUtils.
f041dd3 [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
42eead4 [Hari Shreedharan] Remove RPC part. Refactor and move methods around, use renewal interval rather than max lifetime to create new tokens.
ebb36f5 [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
bc083e3 [Hari Shreedharan] Overload RegisteredExecutor to send tokens. Minor doc updates.
7b19643 [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
8a4f268 [Hari Shreedharan] Added docs in the security guide. Changed some code to ensure that the renewer objects are created only if required.
e800c8b [Hari Shreedharan] Restore original RegisteredExecutor message, and send new tokens via NewTokens message.
0e9507e [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
7f1bc58 [Hari Shreedharan] Minor fixes, cleanup.
bcd11f9 [Hari Shreedharan] Refactor AM and Executor token update code into separate classes, also send tokens via akka on executor startup.
f74303c [Hari Shreedharan] Move the new logic into specialized classes. Add cleanup for old credentials files.
2f9975c [Hari Shreedharan] Ensure new tokens are written out immediately on AM restart. Also, pikc up the latest suffix from HDFS if the AM is restarted.
61b2b27 [Hari Shreedharan] Account for AM restarts by making sure lastSuffix is read from the files on HDFS.
62c45ce [Hari Shreedharan] Relogin from keytab periodically.
fa233bd [Hari Shreedharan] Adding logging, fixing minor formatting and ordering issues.
42813b4 [Hari Shreedharan] Remove utils.sh, which was re-added due to merge with master.
0de27ee [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
55522e3 [Hari Shreedharan] Fix failure caused by Preconditions ambiguity.
9ef5f1b [Hari Shreedharan] Added explanation of how the credentials refresh works, some other minor fixes.
f4fd711 [Hari Shreedharan] Fix SparkConf usage.
2debcea [Hari Shreedharan] Change the file structure for credentials files. I will push a followup patch which adds a cleanup mechanism for old credentials files. The credentials files are small and few enough for it to cause issues on HDFS.
af6d5f0 [Hari Shreedharan] Cleaning up files where changes weren't required.
f0f54cb [Hari Shreedharan] Be more defensive when updating the credentials file.
f6954da [Hari Shreedharan] Got rid of Akka communication to renew, instead the executors check a known file's modification time to read the credentials.
5c11c3e [Hari Shreedharan] Move tests to YarnSparkHadoopUtil to fix compile issues.
b4cb917 [Hari Shreedharan] Send keytab to AM via DistributedCache rather than directly via HDFS
0985b4e [Hari Shreedharan] Write tokens to HDFS and read them back when required, rather than sending them over the wire.
d79b2b9 [Hari Shreedharan] Make sure correct credentials are passed to FileSystem#addDelegationTokens()
8c6928a [Hari Shreedharan] Fix issue caused by direct creation of Actor object.
fb27f46 [Hari Shreedharan] Make sure principal and keytab are set before CoarseGrainedSchedulerBackend is started. Also schedule re-logins in CoarseGrainedSchedulerBackend#start()
41efde0 [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
d282d7a [Hari Shreedharan] Fix ClientSuite to set YARN mode, so that the correct class is used in tests.
bcfc374 [Hari Shreedharan] Fix Hadoop-1 build by adding no-op methods in SparkHadoopUtil, with impl in YarnSparkHadoopUtil.
f8fe694 [Hari Shreedharan] Handle None if keytab-login is not scheduled.
2b0d745 [Hari Shreedharan] [SPARK-5342][YARN] Allow long running Spark apps to run on secure YARN/HDFS.
ccba5bc [Hari Shreedharan] WIP: More changes wrt kerberos
77914dd [Hari Shreedharan] WIP: Add kerberos principal and keytab to YARN client.
2015-04-30 13:03:23 -05:00
Timothy Chen 53befacced [SPARK-5338] [MESOS] Add cluster mode support for Mesos
This patch adds the support for cluster mode to run on Mesos.
It introduces a new Mesos framework dedicated to launch new apps/drivers, and can be called with the spark-submit script and specifying --master flag to the cluster mode REST interface instead of Mesos master.

Example:
./bin/spark-submit --deploy-mode cluster --class org.apache.spark.examples.SparkPi --master mesos://10.0.0.206:8077 --executor-memory 1G --total-executor-cores 100 examples/target/spark-examples_2.10-1.3.0-SNAPSHOT.jar 30

Part of this patch is also to abstract the StandaloneRestServer so it can have different implementations of the REST endpoints.

Features of the cluster mode in this PR:
- Supports supervise mode where scheduler will keep trying to reschedule exited job.
- Adds a new UI for the cluster mode scheduler to see all the running jobs, finished jobs, and supervise jobs waiting to be retried
- Supports state persistence to ZK, so when the cluster scheduler fails over it can pick up all the queued and running jobs

Author: Timothy Chen <tnachen@gmail.com>
Author: Luc Bourlier <luc.bourlier@typesafe.com>

Closes #5144 from tnachen/mesos_cluster_mode and squashes the following commits:

069e946 [Timothy Chen] Fix rebase.
e24b512 [Timothy Chen] Persist submitted driver.
390c491 [Timothy Chen] Fix zk conf key for mesos zk engine.
e324ac1 [Timothy Chen] Fix merge.
fd5259d [Timothy Chen] Address review comments.
1553230 [Timothy Chen] Address review comments.
c6c6b73 [Timothy Chen] Pass spark properties to mesos cluster tasks.
f7d8046 [Timothy Chen] Change app name to spark cluster.
17f93a2 [Timothy Chen] Fix head of line blocking in scheduling drivers.
6ff8e5c [Timothy Chen] Address comments and add logging.
df355cd [Timothy Chen] Add metrics to mesos cluster scheduler.
20f7284 [Timothy Chen] Address review comments
7252612 [Timothy Chen] Fix tests.
a46ad66 [Timothy Chen] Allow zk cli param override.
920fc4b [Timothy Chen] Fix scala style issues.
862b5b5 [Timothy Chen] Support asking driver status when it's retrying.
7f214c2 [Timothy Chen] Fix RetryState visibility
e0f33f7 [Timothy Chen] Add supervise support and persist retries.
371ce65 [Timothy Chen] Handle cluster mode recovery and state persistence.
3d4dfa1 [Luc Bourlier] Adds support to kill submissions
febfaba [Timothy Chen] Bound the finished drivers in memory
543a98d [Timothy Chen] Schedule multiple jobs
6887e5e [Timothy Chen] Support looking at SPARK_EXECUTOR_URI env variable in schedulers
8ec76bc [Timothy Chen] Fix Mesos dispatcher UI.
d57d77d [Timothy Chen] Add documentation
825afa0 [Luc Bourlier] Supports more spark-submit parameters
b8e7181 [Luc Bourlier] Adds a shutdown latch to keep the deamon running
0fa7780 [Luc Bourlier] Launch task through the mesos scheduler
5b7a12b [Timothy Chen] WIP: Making a cluster mode a mesos framework.
4b2f5ef [Timothy Chen] Specify user jar in command to be replaced with local.
e775001 [Timothy Chen] Support fetching remote uris in driver runner.
7179495 [Timothy Chen] Change Driver page output and add logging
880bc27 [Timothy Chen] Add Mesos Cluster UI to display driver results
9986731 [Timothy Chen] Kill drivers when shutdown
67cbc18 [Timothy Chen] Rename StandaloneRestClient to RestClient and add sbin scripts
e3facdd [Timothy Chen] Add Mesos Cluster dispatcher
2015-04-28 13:33:57 -07:00
Ilya Ganelin 2d222fb39d [SPARK-5932] [CORE] Use consistent naming for size properties
I've added an interface to JavaUtils to do byte conversion and added hooks within Utils.scala to handle conversion within Spark code (like for time strings). I've added matching tests for size conversion, and then updated all deprecated configs and documentation as per SPARK-5933.

Author: Ilya Ganelin <ilya.ganelin@capitalone.com>

Closes #5574 from ilganeli/SPARK-5932 and squashes the following commits:

11f6999 [Ilya Ganelin] Nit fixes
49a8720 [Ilya Ganelin] Whitespace fix
2ab886b [Ilya Ganelin] Scala style
fc85733 [Ilya Ganelin] Got rid of floating point math
852a407 [Ilya Ganelin] [SPARK-5932] Added much improved overflow handling. Can now handle sizes up to Long.MAX_VALUE Petabytes instead of being capped at Long.MAX_VALUE Bytes
9ee779c [Ilya Ganelin] Simplified fraction matches
22413b1 [Ilya Ganelin] Made MAX private
3dfae96 [Ilya Ganelin] Fixed some nits. Added automatic conversion of old paramter for kryoserializer.mb to new values.
e428049 [Ilya Ganelin] resolving merge conflict
8b43748 [Ilya Ganelin] Fixed error in pattern matching for doubles
84a2581 [Ilya Ganelin] Added smoother handling of fractional values for size parameters. This now throws an exception and added a warning for old spark.kryoserializer.buffer
d3d09b6 [Ilya Ganelin] [SPARK-5932] Fixing error in KryoSerializer
fe286b4 [Ilya Ganelin] Resolved merge conflict
c7803cd [Ilya Ganelin] Empty lines
54b78b4 [Ilya Ganelin] Simplified byteUnit class
69e2f20 [Ilya Ganelin] Updates to code
f32bc01 [Ilya Ganelin] [SPARK-5932] Fixed error in API in SparkConf.scala where Kb conversion wasn't being done properly (was Mb). Added test cases for both timeUnit and ByteUnit conversion
f15f209 [Ilya Ganelin] Fixed conversion of kryo buffer size
0f4443e [Ilya Ganelin]     Merge remote-tracking branch 'upstream/master' into SPARK-5932
35a7fa7 [Ilya Ganelin] Minor formatting
928469e [Ilya Ganelin] [SPARK-5932] Converted some longs to ints
5d29f90 [Ilya Ganelin] [SPARK-5932] Finished documentation updates
7a6c847 [Ilya Ganelin] [SPARK-5932] Updated spark.shuffle.file.buffer
afc9a38 [Ilya Ganelin] [SPARK-5932] Updated spark.broadcast.blockSize and spark.storage.memoryMapThreshold
ae7e9f6 [Ilya Ganelin] [SPARK-5932] Updated spark.io.compression.snappy.block.size
2d15681 [Ilya Ganelin] [SPARK-5932] Updated spark.executor.logs.rolling.size.maxBytes
1fbd435 [Ilya Ganelin] [SPARK-5932] Updated spark.broadcast.blockSize
eba4de6 [Ilya Ganelin] [SPARK-5932] Updated spark.shuffle.file.buffer.kb
b809a78 [Ilya Ganelin] [SPARK-5932] Updated spark.kryoserializer.buffer.max
0cdff35 [Ilya Ganelin] [SPARK-5932] Updated to use bibibytes in method names. Updated spark.kryoserializer.buffer.mb and spark.reducer.maxMbInFlight
475370a [Ilya Ganelin] [SPARK-5932] Simplified ByteUnit code, switched to using longs. Updated docs to clarify that we use kibi, mebi etc instead of kilo, mega
851d691 [Ilya Ganelin] [SPARK-5932] Updated memoryStringToMb to use new interfaces
a9f4fcf [Ilya Ganelin] [SPARK-5932] Added unit tests for unit conversion
747393a [Ilya Ganelin] [SPARK-5932] Added unit tests for ByteString conversion
09ea450 [Ilya Ganelin] [SPARK-5932] Added byte string conversion to Jav utils
5390fd9 [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-5932
db9a963 [Ilya Ganelin] Closing second spark context
1dc0444 [Ilya Ganelin] Added ref equality check
8c884fa [Ilya Ganelin] Made getOrCreate synchronized
cb0c6b7 [Ilya Ganelin] Doc updates and code cleanup
270cfe3 [Ilya Ganelin] [SPARK-6703] Documentation fixes
15e8dea [Ilya Ganelin] Updated comments and added MiMa Exclude
0e1567c [Ilya Ganelin] Got rid of unecessary option for AtomicReference
dfec4da [Ilya Ganelin] Changed activeContext to AtomicReference
733ec9f [Ilya Ganelin] Fixed some bugs in test code
8be2f83 [Ilya Ganelin] Replaced match with if
e92caf7 [Ilya Ganelin] [SPARK-6703] Added test to ensure that getOrCreate both allows creation, retrieval, and a second context if desired
a99032f [Ilya Ganelin] Spacing fix
d7a06b8 [Ilya Ganelin] Updated SparkConf class to add getOrCreate method. Started test suite implementation
2015-04-28 12:18:55 -07:00
Iulian Dragos 8aab94d898 [SPARK-4286] Add an external shuffle service that can be run as a daemon.
This allows Mesos deployments to use the shuffle service (and implicitly dynamic allocation). It does so by adding a new "main" class and two corresponding scripts in `sbin`:

- `sbin/start-shuffle-service.sh`
- `sbin/stop-shuffle-service.sh`

Specific options can be passed in `SPARK_SHUFFLE_OPTS`.

This is picking up work from #3861 /cc tnachen

Author: Iulian Dragos <jaguarul@gmail.com>

Closes #4990 from dragos/feature/external-shuffle-service and squashes the following commits:

6c2b148 [Iulian Dragos] Import order and wrong name fixup.
07804ad [Iulian Dragos] Moved ExternalShuffleService to the `deploy` package + other minor tweaks.
4dc1f91 [Iulian Dragos] Reviewer’s comments:
8145429 [Iulian Dragos] Add an external shuffle service that can be run as a daemon.
2015-04-28 12:08:18 -07:00
Deborah Siegel 59b7cfc41b [SPARK-7136][Docs] Spark SQL and DataFrame Guide fix example file and paths
Changes example file for Generic Load/Save Functions to users.parquet rather than people.parquet which doesn't exist unless a later example has already been executed. Also adds filepaths.

Author: Deborah Siegel <deborah.siegel@gmail.com>
Author: DEBORAH SIEGEL <deborahsiegel@d-140-142-0-49.dhcp4.washington.edu>
Author: DEBORAH SIEGEL <deborahsiegel@DEBORAHs-MacBook-Pro.local>
Author: DEBORAH SIEGEL <deborahsiegel@d-69-91-154-197.dhcp4.washington.edu>

Closes #5693 from d3borah/master and squashes the following commits:

4d5e43b [Deborah Siegel] sparkSQL doc change
b15a497 [Deborah Siegel] Revert "sparkSQL doc change"
5a2863c [DEBORAH SIEGEL] Merge remote-tracking branch 'upstream/master'
91972fc [DEBORAH SIEGEL] sparkSQL doc change
f000e59 [DEBORAH SIEGEL] Merge remote-tracking branch 'upstream/master'
db54173 [DEBORAH SIEGEL] fixed aggregateMessages example in graphX doc
2015-04-24 20:25:07 -07:00
Ken Geis 67bccbda1e Update sql-programming-guide.md
fix typo

Author: Ken Geis <geis.ken@gmail.com>

Closes #5674 from kgeis/patch-1 and squashes the following commits:

5ae67de [Ken Geis] Update sql-programming-guide.md
2015-04-23 20:45:33 -07:00
zsxwing 3a3f7100f4 [SPARK-6490][Docs] Add docs for rpc configurations
Added docs for rpc configurations and also fixed two places that should have been fixed in #5595.

Author: zsxwing <zsxwing@gmail.com>

Closes #5607 from zsxwing/SPARK-6490-docs and squashes the following commits:

25a6736 [zsxwing] Increase the default timeout to 120s
6e37c30 [zsxwing] Update docs
5577540 [zsxwing] Use spark.network.timeout as the default timeout if it presents
4f07174 [zsxwing] Fix unit tests
1c2cf26 [zsxwing] Add docs for rpc configurations
2015-04-21 18:37:53 -07:00
BenFradet 517bdf36ae [doc][streaming] Fixed broken link in mllib section
The commit message is pretty self-explanatory.

Author: BenFradet <benjamin.fradet@gmail.com>

Closes #5600 from BenFradet/master and squashes the following commits:

108492d [BenFradet] [doc][streaming] Fixed broken link in mllib section
2015-04-20 13:46:55 -07:00
Eric Chiang 97fda73db4 fixed doc
The contribution is my original work. I license the work to the project under the project's open source license.

Small typo in the programming guide.

Author: Eric Chiang <eric.chiang.m@gmail.com>

Closes #5599 from ericchiang/docs-typo and squashes the following commits:

1177942 [Eric Chiang] fixed doc
2015-04-20 13:11:21 -07:00
dobashim 6fe690d5a8 [doc][mllib] Fix typo of the page title in Isotonic regression documents
* Fix the page title in Isotonic regression documents (Naive Bayes -> Isotonic regression)
* Add a newline character at the end of the file

Author: dobashim <dobashim@oss.nttdata.co.jp>

Closes #5581 from dobashim/master and squashes the following commits:

d54a041 [dobashim] Fix typo of the page title in Isotonic regression documents
2015-04-20 00:03:23 -04:00
Gaurav Nanda 729885ec6b Fixed doc
Just fixed a doc.

Author: Gaurav Nanda <gaurav324@gmail.com>

Closes #5576 from gaurav324/master and squashes the following commits:

8a7323f [Gaurav Nanda] Fixed doc
2015-04-18 17:20:46 -07:00
Olivier Girardot 5f095d5605 SPARK-6992 : Fix documentation example for Spark SQL on StructType
This patch is fixing the Java examples for Spark SQL when defining
programmatically a Schema and mapping Rows.

Author: Olivier Girardot <o.girardot@lateral-thoughts.com>

Closes #5569 from ogirardot/branch-1.3 and squashes the following commits:

c29e58d [Olivier Girardot] SPARK-6992 : Fix documentation example for Spark SQL on StructType

(cherry picked from commit c9b1ba4b16a7afe93d45bf75b128cc0dd287ded0)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-04-18 00:31:15 -07:00
Marcelo Vanzin 1991337336 [SPARK-5933] [core] Move config deprecation warnings to SparkConf.
I didn't find many deprecated configs after a grep-based search,
but the ones I could find were moved to the centralized location
in SparkConf.

While there, I deprecated a couple more HS configs that mentioned
time units.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #5562 from vanzin/SPARK-5933 and squashes the following commits:

dcb617e7 [Marcelo Vanzin] [SPARK-5933] [core] Move config deprecation warnings to SparkConf.
2015-04-17 19:02:07 -07:00
Jongyoul Lee 6fbeb82e13 [SPARK-6350][Mesos] Make mesosExecutorCores configurable in mesos "fine-grained" mode
- Defined executorCores from "spark.mesos.executor.cores"
- Changed the amount of mesosExecutor's cores to executorCores.
- Added new configuration option on running-on-mesos.md

Author: Jongyoul Lee <jongyoul@gmail.com>

Closes #5063 from jongyoul/SPARK-6350 and squashes the following commits:

9238d6e [Jongyoul Lee] [SPARK-6350][Mesos] Make mesosExecutorCores configurable in mesos "fine-grained" mode - Fixed docs - Changed configuration name - Made mesosExecutorCores private
2d41241 [Jongyoul Lee] [SPARK-6350][Mesos] Make mesosExecutorCores configurable in mesos "fine-grained" mode - Fixed docs
89edb4f [Jongyoul Lee] [SPARK-6350][Mesos] Make mesosExecutorCores configurable in mesos "fine-grained" mode - Fixed docs
8ba7694 [Jongyoul Lee] [SPARK-6350][Mesos] Make mesosExecutorCores configurable in mesos "fine-grained" mode - Fixed docs
7549314 [Jongyoul Lee] [SPARK-6453][Mesos] Some Mesos*Suite have a different package with their classes - Fixed docs
4ae7b0c [Jongyoul Lee] [SPARK-6453][Mesos] Some Mesos*Suite have a different package with their classes - Removed TODO
c27efce [Jongyoul Lee] [SPARK-6453][Mesos] Some Mesos*Suite have a different package with their classes - Fixed Mesos*Suite for supporting integer WorkerOffers - Fixed Documentation
1fe4c03 [Jongyoul Lee] [SPARK-6453][Mesos] Some Mesos*Suite have a different package with their classes - Change available resources of cpus to integer value beacuse WorkerOffer support the amount cpus as integer value
5f3767e [Jongyoul Lee] Revert "[SPARK-6350][Mesos] Make mesosExecutorCores configurable in mesos "fine-grained" mode"
4b7c69e [Jongyoul Lee] [SPARK-6350][Mesos] Make mesosExecutorCores configurable in mesos "fine-grained" mode - Changed configruation name and description from "spark.mesos.executor.cores" to "spark.executor.frameworkCores"
0556792 [Jongyoul Lee] [SPARK-6350][Mesos] Make mesosExecutorCores configurable in mesos "fine-grained" mode - Defined executorCores from "spark.mesos.executor.cores" - Changed the amount of mesosExecutor's cores to executorCores. - Added new configuration option on running-on-mesos.md
2015-04-17 18:30:55 -07:00
Olivier Girardot d305e686b3 SPARK-6988 : Fix documentation regarding DataFrames using the Java API
This patch includes :
 * adding how to use map after an sql query using javaRDD
 * fixing the first few java examples that were written in Scala

Thank you for your time,

Olivier.

Author: Olivier Girardot <o.girardot@lateral-thoughts.com>

Closes #5564 from ogirardot/branch-1.3 and squashes the following commits:

9f8d60e [Olivier Girardot] SPARK-6988 : Fix documentation regarding DataFrames using the Java API

(cherry picked from commit 6b528dc139da594ef2e651d84bd91fe0f738a39d)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-04-17 16:27:02 -05:00
Marcelo Vanzin 50ab8a6543 [SPARK-2669] [yarn] Distribute client configuration to AM.
Currently, when Spark launches the Yarn AM, the process will use
the local Hadoop configuration on the node where the AM launches,
if one is present. A more correct approach is to use the same
configuration used to launch the Spark job, since the user may
have made modifications (such as adding app-specific configs).

The approach taken here is to use the distributed cache to make
all files in the Hadoop configuration directory available to the
AM. This is a little overkill since only the AM needs them (the
executors use the broadcast Hadoop configuration from the driver),
but is the easier approach.

Even though only a few files in that directory may end up being
used, all of them are uploaded. This allows supporting use cases
such as when auxiliary configuration files are used for SSL
configuration, or when uploading a Hive configuration directory.
Not all of these may be reflected in a o.a.h.conf.Configuration object,
but may be needed when a driver in cluster mode instantiates, for
example, a HiveConf object instead.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #4142 from vanzin/SPARK-2669 and squashes the following commits:

f5434b9 [Marcelo Vanzin] Merge branch 'master' into SPARK-2669
013f0fb [Marcelo Vanzin] Review feedback.
f693152 [Marcelo Vanzin] Le sigh.
ed45b7d [Marcelo Vanzin] Zip all config files and upload them as an archive.
5927b6b [Marcelo Vanzin] Merge branch 'master' into SPARK-2669
cbb9fb3 [Marcelo Vanzin] Remove stale test.
e3e58d0 [Marcelo Vanzin] Merge branch 'master' into SPARK-2669
e3d0613 [Marcelo Vanzin] Review feedback.
34bdbd8 [Marcelo Vanzin] Fix test.
022a688 [Marcelo Vanzin] Merge branch 'master' into SPARK-2669
a77ddd5 [Marcelo Vanzin] Merge branch 'master' into SPARK-2669
79221c7 [Marcelo Vanzin] [SPARK-2669] [yarn] Distribute client configuration to AM.
2015-04-17 14:21:51 -05:00
Marcelo Vanzin 4527761bcd [SPARK-6046] [core] Reorganize deprecated config support in SparkConf.
This change tries to follow the chosen way for handling deprecated
configs in SparkConf: all values (old and new) are kept in the conf
object, and newer names take precedence over older ones when
retrieving the value.

Warnings are logged when config options are set, which generally happens
on the driver node (where the logs are most visible).

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #5514 from vanzin/SPARK-6046 and squashes the following commits:

9371529 [Marcelo Vanzin] Avoid math.
6cf3f11 [Marcelo Vanzin] Review feedback.
2445d48 [Marcelo Vanzin] Fix (and cleanup) update interval initialization.
b6824be [Marcelo Vanzin] Clean up the other deprecated config use also.
ab20351 [Marcelo Vanzin] Update FsHistoryProvider to only retrieve new config key.
2c93209 [Marcelo Vanzin] [SPARK-6046] [core] Reorganize deprecated config support in SparkConf.
2015-04-17 11:06:01 +01:00
Liang-Chi Hsieh e3e4e9a38b [SPARK-6800][SQL] Update doc for JDBCRelation's columnPartition
JIRA https://issues.apache.org/jira/browse/SPARK-6800

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #5488 from viirya/fix_jdbc_where and squashes the following commits:

51386c8 [Liang-Chi Hsieh] Update code comment.
1dcc929 [Liang-Chi Hsieh] Update document.
3eb74d6 [Liang-Chi Hsieh] Revert and modify doc.
df11783 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into fix_jdbc_where
3e7db15 [Liang-Chi Hsieh] Fix wrong logic to generate WHERE clause for JDBC.
2015-04-15 13:02:25 -07:00
CodingCat 8f8dc45f6d SPARK-1706: Allow multiple executors per worker in Standalone mode
resubmit of https://github.com/apache/spark/pull/636  for a totally different algorithm

https://issues.apache.org/jira/browse/SPARK-1706

In current implementation, the user has to start multiple workers in a server for starting multiple executors in a server, which introduces additional overhead due to the more JVM processes...

In this patch, I changed the scheduling logic in master to enable the user to start multiple executor processes within the same JVM process.

1. user configure spark.executor.maxCoreNumPerExecutor to suggest the maximum core he/she would like to allocate to each executor

2. Master assigns the executors to the workers with the major consideration on the memoryPerExecutor and the worker.freeMemory, and tries to allocate as many as possible cores to the executor ```min(min(memoryPerExecutor, worker.freeCore), maxLeftCoreToAssign)``` where ```maxLeftCoreToAssign = maxExecutorCanAssign * maxCoreNumPerExecutor```

---------------------------------------

Other small changes include

change memoryPerSlave in ApplicationDescription to memoryPerExecutor, as "Slave" is overrided to represent both worker and executor in the documents... (we have some discussion on this before?)

Author: CodingCat <zhunansjtu@gmail.com>

Closes #731 from CodingCat/SPARK-1706-2 and squashes the following commits:

6dee808 [CodingCat] change filter predicate
fbeb7e5 [CodingCat] address the comments
940cb42 [CodingCat] avoid unnecessary allocation
b8ca561 [CodingCat] revert a change
45967b4 [CodingCat] remove unused method
2eeff77 [CodingCat] stylistic fixes
12a1b32 [CodingCat] change the semantic of coresPerExecutor to exact core number
f035423 [CodingCat] stylistic fix
d9c1685 [CodingCat] remove unused var
f595bd6 [CodingCat] recover some unintentional changes
63b3df9 [CodingCat] change the description of the parameter in the submit script
4cf61f1 [CodingCat] improve the code and docs
ff011e2 [CodingCat] start multiple executors on the worker by rewriting startExeuctor logic
2c2bcc5 [CodingCat] fix wrong usage info
497ec2c [CodingCat] address andrew's comments
878402c [CodingCat] change the launching executor code
f64a28d [CodingCat] typo fix
387f4ec [CodingCat] bug fix
35c462c [CodingCat] address Andrew's comments
0b64fea [CodingCat] fix compilation issue
19d3da7 [CodingCat] address the comments
5b81466 [CodingCat] remove outdated comments
ec7d421 [CodingCat] test commit
e5efabb [CodingCat] more java docs and consolidate canUse function
a26096d [CodingCat] stylistic fix
a5d629a [CodingCat] java doc
b34ec0c [CodingCat] make master support multiple executors per worker
2015-04-14 13:32:06 -07:00
Ilya Ganelin c4ab255e94 [SPARK-5931][CORE] Use consistent naming for time properties
I've added new utility methods to do the conversion from times specified as e.g. 120s, 240ms, 360us to convert to a consistent internal representation. I've updated usage of these constants throughout the code to be consistent.

I believe I've captured all usages of time-based properties throughout the code. I've also updated variable names in a number of places to reflect their units for clarity and updated documentation where appropriate.

Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
Author: Ilya Ganelin <ilganeli@gmail.com>

Closes #5236 from ilganeli/SPARK-5931 and squashes the following commits:

4526c81 [Ilya Ganelin] Update configuration.md
de3bff9 [Ilya Ganelin] Fixing style errors
f5fafcd [Ilya Ganelin] Doc updates
951ca2d [Ilya Ganelin] Made the most recent round of changes
bc04e05 [Ilya Ganelin] Minor fixes and doc updates
25d3f52 [Ilya Ganelin] Minor nit fixes
642a06d [Ilya Ganelin] Fixed logic for invalid suffixes and addid matching test
8927e66 [Ilya Ganelin] Fixed handling of -1
69fedcc [Ilya Ganelin] Added test for zero
dc7bd08 [Ilya Ganelin] Fixed error in exception handling
7d19cdd [Ilya Ganelin] Added fix for possible NPE
6f651a8 [Ilya Ganelin] Now using regexes to simplify code in parseTimeString. Introduces getTimeAsSec and getTimeAsMs methods in SparkConf. Updated documentation
cbd2ca6 [Ilya Ganelin] Formatting error
1a1122c [Ilya Ganelin] Formatting fixes and added m for use as minute formatter
4e48679 [Ilya Ganelin] Fixed priority order and mixed up conversions in a couple spots
d4efd26 [Ilya Ganelin] Added time conversion for yarn.scheduler.heartbeat.interval-ms
cbf41db [Ilya Ganelin] Got rid of thrown exceptions
1465390 [Ilya Ganelin] Nit
28187bf [Ilya Ganelin] Convert straight to seconds
ff40bfe [Ilya Ganelin] Updated tests to fix small bugs
19c31af [Ilya Ganelin] Added cleaner computation of time conversions in tests
6387772 [Ilya Ganelin] Updated suffix handling to handle overlap of units more gracefully
5193d5f [Ilya Ganelin] Resolved merge conflicts
76cfa27 [Ilya Ganelin] [SPARK-5931] Minor nit fixes'
bf779b0 [Ilya Ganelin] Special handling of overlapping usffixes for java
dd0a680 [Ilya Ganelin] Updated scala code to call into java
b2fc965 [Ilya Ganelin] replaced get or default since it's not present in this version of java
39164f9 [Ilya Ganelin] [SPARK-5931] Updated Java conversion to be similar to scala conversion. Updated conversions to clean up code a little using TimeUnit.convert. Added Unit tests
3b126e1 [Ilya Ganelin] Fixed conversion to US from seconds
1858197 [Ilya Ganelin] Fixed bug where all time was being converted to us instead of the appropriate units
bac9edf [Ilya Ganelin] More whitespace
8613631 [Ilya Ganelin] Whitespace
1c0c07c [Ilya Ganelin] Updated Java code to add day, minutes, and hours
647b5ac [Ilya Ganelin] Udpated time conversion to use map iterator instead of if fall through
70ac213 [Ilya Ganelin] Fixed remaining usages to be consistent. Updated Java-side time conversion
68f4e93 [Ilya Ganelin] Updated more files to clean up usage of default time strings
3a12dd8 [Ilya Ganelin] Updated host revceiver
5232a36 [Ilya Ganelin] [SPARK-5931] Changed default behavior of time string conversion.
499bdf0 [Ilya Ganelin] Merge branch 'SPARK-5931' of github.com:ilganeli/spark into SPARK-5931
9e2547c [Ilya Ganelin] Reverting doc changes
8f741e1 [Ilya Ganelin] Update JavaUtils.java
34f87c2 [Ilya Ganelin] Update Utils.scala
9a29d8d [Ilya Ganelin] Fixed misuse of time in streaming context test
42477aa [Ilya Ganelin] Updated configuration doc with note on specifying time properties
cde9bff [Ilya Ganelin] Updated spark.streaming.blockInterval
c6a0095 [Ilya Ganelin] Updated spark.core.connection.auth.wait.timeout
5181597 [Ilya Ganelin] Updated spark.dynamicAllocation.schedulerBacklogTimeout
2fcc91c [Ilya Ganelin] Updated spark.dynamicAllocation.executorIdleTimeout
6d1518e [Ilya Ganelin] Upated spark.speculation.interval
3f1cfc8 [Ilya Ganelin] Updated spark.scheduler.revive.interval
3352d34 [Ilya Ganelin] Updated spark.scheduler.maxRegisteredResourcesWaitingTime
272c215 [Ilya Ganelin] Updated spark.locality.wait
7320c87 [Ilya Ganelin] updated spark.akka.heartbeat.interval
064ebd6 [Ilya Ganelin] Updated usage of spark.cleaner.ttl
21ef3dd [Ilya Ganelin] updated spark.shuffle.sasl.timeout
c9f5cad [Ilya Ganelin] Updated spark.shuffle.io.retryWait
4933fda [Ilya Ganelin] Updated usage of spark.storage.blockManagerSlaveTimeout
7db6d2a [Ilya Ganelin] Updated usage of spark.akka.timeout
404f8c3 [Ilya Ganelin] Updated usage of spark.core.connection.ack.wait.timeout
59bf9e1 [Ilya Ganelin] [SPARK-5931] Updated Utils and JavaUtils classes to add helper methods to handle time strings. Updated time strings in a few places to properly parse time
2015-04-13 16:28:07 -07:00
Cheolsoo Park 6cc5b3ed3c [SPARK-6662][YARN] Allow variable substitution in spark.yarn.historyServer.address
In Spark on YARN, explicit hostname and port number need to be set for "spark.yarn.historyServer.address" in SparkConf to make the HISTORY link. If the history server address is known and static, this is usually not a problem.

But in cloud, that is usually not true. Particularly in EMR, the history server always runs on the same node as with RM. So I could simply set it to ${yarn.resourcemanager.hostname}:18080 if variable substitution is allowed.

In fact, Hadoop configuration already implements variable substitution, so if this property is read via YarnConf, this can be easily achievable.

Author: Cheolsoo Park <cheolsoop@netflix.com>

Closes #5321 from piaozhexiu/SPARK-6662 and squashes the following commits:

e37de75 [Cheolsoo Park] Preserve the space between the Hadoop and Spark imports
79757c6 [Cheolsoo Park] Incorporate review comments
10e2917 [Cheolsoo Park] Add helper function that substitutes hadoop vars to SparkHadoopUtil
589b52c [Cheolsoo Park] Revert "Allow variable substitution for spark.yarn. properties"
ff9c35d [Cheolsoo Park] Allow variable substitution for spark.yarn. properties
2015-04-13 13:45:10 -05:00
Santiago M. Mola 6437e7cc3b [SPARK-6863] Fix formatting on SQL programming guide.
https://issues.apache.org/jira/browse/SPARK-6863

Author: Santiago M. Mola <santiago.mola@sap.com>

Closes #5472 from smola/fix/sql-docs and squashes the following commits:

42503d4 [Santiago M. Mola] [SPARK-6863] Fix formatting on SQL programming guide.
2015-04-11 15:42:03 -07:00
Peter Parente b9c51c0493 [SPARK-6343] Doc driver-worker network reqs
Attempt at making the driver-worker networking requirement more explicit and up-front in the documentation (see https://issues.apache.org/jira/browse/SPARK-6343).

Update cluster overview diagram to show connections from workers to driver. Add a bullet below about how driver listens / accepts connections from workers.

Author: Peter Parente <pparent@us.ibm.com>

Closes #5382 from parente/SPARK-6343 and squashes the following commits:

0b2fb9d [Peter Parente] [SPARK-6343] Doc driver-worker network reqs
2015-04-09 06:37:20 -04:00
Shivaram Venkataraman 2fe0a1aaee [SPARK-5654] Integrate SparkR
This pull requests integrates SparkR, an R frontend for Spark. The SparkR package contains both RDD and DataFrame APIs in R and is integrated with Spark's submission scripts to work on different cluster managers.

Some integration points that would be great to get feedback on:

1. Build procedure: SparkR requires R to be installed on the machine to be built. Right now we have a new Maven profile `-PsparkR` that can be used to enable SparkR builds

2. YARN cluster mode: The R package that is built needs to be present on the driver and all the worker nodes during execution. The R package location is currently set using SPARK_HOME, but this might not work on YARN cluster mode.

The SparkR package represents the work of many contributors and attached below is a list of people along with areas they worked on

edwardt (edwart) - Documentation improvements
Felix Cheung (felixcheung) - Documentation improvements
Hossein Falaki (falaki)  - Documentation improvements
Chris Freeman (cafreeman) - DataFrame API, Programming Guide
Todd Gao (7c00) - R worker Internals
Ryan Hafen (hafen) - SparkR Internals
Qian Huang (hqzizania) - RDD API
Hao Lin (hlin09) - RDD API, Closure cleaner
Evert Lammerts (evertlammerts) - DataFrame API
Davies Liu (davies) - DataFrame API, R worker internals, Merging with Spark
Yi Lu (lythesia) - RDD API, Worker internals
Matt Massie (massie) - Jenkins build
Harihar Nahak (hnahak87) - SparkR examples
Oscar Olmedo (oscaroboto) - Spark configuration
Antonio Piccolboni (piccolbo) - SparkR examples, Namespace bug fixes
Dan Putler (dputler) - Dataframe API, SparkR Install Guide
Ashutosh Raina (ashutoshraina) - Build improvements
Josh Rosen (joshrosen) - Travis CI build
Sun Rui (sun-rui)- RDD API, JVM Backend, Shuffle improvements
Shivaram Venkataraman (shivaram) - RDD API, JVM Backend, Worker Internals
Zongheng Yang (concretevitamin) - RDD API, Pipelined RDDs, Examples and EC2 guide

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Author: Shivaram Venkataraman <shivaram.venkataraman@gmail.com>
Author: Zongheng Yang <zongheng.y@gmail.com>
Author: cafreeman <cfreeman@alteryx.com>
Author: Shivaram Venkataraman <shivaram@eecs.berkeley.edu>
Author: Davies Liu <davies@databricks.com>
Author: Davies Liu <davies.liu@gmail.com>
Author: hlin09 <hlin09pu@gmail.com>
Author: Sun Rui <rui.sun@intel.com>
Author: lythesia <iranaikimi@gmail.com>
Author: oscaroboto <oscarjr@gmail.com>
Author: Antonio Piccolboni <antonio@piccolboni.info>
Author: root <edward>
Author: edwardt <edwardt.tril@gmail.com>
Author: hqzizania <qian.huang@intel.com>
Author: dputler <dan.putler@gmail.com>
Author: Todd Gao <todd.gao.2013@gmail.com>
Author: Chris Freeman <cfreeman@alteryx.com>
Author: Felix Cheung <fcheung@AVVOMAC-119.local>
Author: Hossein <hossein@databricks.com>
Author: Evert Lammerts <evert@apache.org>
Author: Felix Cheung <fcheung@avvomac-119.t-mobile.com>
Author: felixcheung <felixcheung_m@hotmail.com>
Author: Ryan Hafen <rhafen@gmail.com>
Author: Ashutosh Raina <ashutoshraina@users.noreply.github.com>
Author: Oscar Olmedo <oscarjr@gmail.com>
Author: Josh Rosen <rosenville@gmail.com>
Author: Yi Lu <iranaikimi@gmail.com>
Author: Harihar Nahak <hnahak87@users.noreply.github.com>

Closes #5096 from shivaram/R and squashes the following commits:

da64742 [Davies Liu] fix Date serialization
59266d1 [Davies Liu] check exclusive of primary-py-file and primary-r-file
55808e4 [Davies Liu] fix tests
5581c75 [Davies Liu] update author of SparkR
f731b48 [Shivaram Venkataraman] Only run SparkR tests if R is installed
64eda24 [Shivaram Venkataraman] Merge branch 'R' of https://github.com/amplab-extras/spark into R
d7c3f22 [Shivaram Venkataraman] Address code review comments Changes include 1. Adding SparkR docs to API docs generated 2. Style fixes in SparkR scala files 3. Clean up of shell scripts and explanation of install-dev.sh
377151f [Shivaram Venkataraman] Merge remote-tracking branch 'apache/master' into R
eb5da53 [Shivaram Venkataraman] Merge pull request #3 from davies/R2
a18ff5c [Davies Liu] Update sparkR.R
5133f3a [Shivaram Venkataraman] Merge pull request #7 from hqzizania/R3
940b631 [hqzizania] [SPARKR-92] Phase 2: implement sum(rdd)
0e788c0 [Shivaram Venkataraman] Merge pull request #5 from hlin09/doc-fix
3487461 [hlin09] Add tests log in .gitignore.
1d1802e [Shivaram Venkataraman] Merge pull request #4 from felixcheung/r-require
11981b7 [felixcheung] Update R to fail early if SparkR package is missing
c300e08 [Davies Liu] remove duplicated file
b045701 [Davies Liu] Merge branch 'remote_r' into R
19c9368 [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into remote_r
f8fa8af [Davies Liu] mute logging when start/stop context
e7104b6 [Davies Liu] remove ::: in SparkR
a1777eb [Davies Liu] move rules into R/.gitignore
e88b649 [Davies Liu] Merge branch 'R' of github.com:amplab-extras/spark into R
6e20e71 [Davies Liu] address comments
b433817 [Davies Liu] Merge branch 'master' of github.com:apache/spark into R
a1cedad [Shivaram Venkataraman] Merge pull request #228 from felixcheung/doc
e089151 [Davies Liu] Merge pull request #225 from sun-rui/SPARKR-154_2
463e28c [Davies Liu] Merge pull request #2 from shivaram/doc-fixes
bc2d6d8 [Shivaram Venkataraman] Remove arg from sparkR.stop and update docs
d425363 [Shivaram Venkataraman] Some doc fixes for column, generics, group
1f1a7e0 [Shivaram Venkataraman] Some fixes to DataFrame, RDD, SQLContext docs
104ad4e [Shivaram Venkataraman] Check the right env in exists
cf5cd99 [Shivaram Venkataraman] Remove unused numCols argument
85a50ec [Shivaram Venkataraman] Merge pull request #226 from RevolutionAnalytics/master
3eacfc0 [Davies Liu] fix flaky test
733380d [Davies Liu] update R examples (remove master from args)
b21a0da [Davies Liu] Merge pull request #1 from shivaram/log4j-tests
a1493d7 [Shivaram Venkataraman] Address comments
e1f83ab [Shivaram Venkataraman] Send Spark INFO logs to a file in SparkR tests
58276f5 [Shivaram Venkataraman] Merge branch 'R' of https://github.com/amplab-extras/spark into R
52cc92d [Shivaram Venkataraman] Add license to create-docs.sh
6ff5ea2 [Shivaram Venkataraman] Add instructions to generate docs
1f478c5 [Shivaram Venkataraman] Merge branch 'R' of https://github.com/amplab-extras/spark into R
02b4833 [Shivaram Venkataraman] Add a script to generate R docs (Rd, html) Also fix some issues with our documentation
d6d3729 [Davies Liu] enable spark and pyspark tests
0e5a83f [Davies Liu] fix code style
afd8a77 [Davies Liu] Merge branch 'R' of github.com:amplab-extras/spark into R
d87a181 [Davies Liu] fix flaky tests
7100fb9 [Shivaram Venkataraman] Fix libPaths in README
bdf3a14 [Davies Liu] Merge branch 'R' of github.com:amplab-extras/spark into R
05e7375 [Davies Liu] sort generics
b44e371 [Shivaram Venkataraman] Include RStudio instructions in README
855537f [Davies Liu] Merge branch 'R' of github.com:amplab-extras/spark into R
9fb6af3 [Davies Liu] mark R classes/objects are private
423ea3c [Shivaram Venkataraman] Ignore unknown jobj in cleanup
974e4ea [Davies Liu] fix flaky test
410ec18 [Davies Liu] fix zipRDD() tests
d8b24fc [Davies Liu] disable spark and python tests temporary
ce3ca62 [Davies Liu] fix license check
7da0049 [Davies Liu] fix build
2892e29 [Davies Liu] support R in YARN cluster
ebd4d07 [Davies Liu] Merge branch 'R' of github.com:amplab-extras/spark into R
38cbf59 [Davies Liu] fix test of zipRDD()
756ece0 [Shivaram Venkataraman] Update README remove outdated TODO
d436f26 [Davies Liu] add missing files
40d193a [Shivaram Venkataraman] Merge pull request #224 from sun-rui/SPARKR-224-new
1a16cd6 [Davies Liu] rm PROJECT_HOME
56670ef [Davies Liu] rm man page
ba4b80b [Davies Liu] Merge branch 'remote_r' into R
f04080c [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into remote_r
028cbfb [Davies Liu] fix exit code of sparkr unit test
42d8b4c [Davies Liu] Merge branch 'R' of github.com:amplab-extras/spark into R
ef26015 [Davies Liu] Merge branch 'R' of github.com:amplab-extras/spark into R
a1870e8 [Shivaram Venkataraman] Merge pull request #214 from sun-rui/SPARKR-156_3
cb6e5e3 [Shivaram Venkataraman] Add scripts to start SparkR on windows
8030847 [Shivaram Venkataraman] Set windows file separators, install dirs
05afef0 [Shivaram Venkataraman] Only stop backend JVM if R launched it
95d2de3 [Davies Liu] fix spark-submit with R scripot
baefd9e [Shivaram Venkataraman] Make bin/sparkR use spark-submit As a part of this move the R initialization functions into first.R and first-submit.R
d6f2bdd [Shivaram Venkataraman] Fix run-tests path
ea90fab [Davies Liu] fix spark-submit with R path and sparkR -h
0e2412c [Davies Liu] fix bin/sparkR
9f6aa1f [Davies Liu] Merge branch 'R' of github.com:amplab-extras/spark into R
479e3fe [Davies Liu] change println() to logging
52ca6e5 [Shivaram Venkataraman] Add missing comma
716b16f [Shivaram Venkataraman] Merge branch 'R' of https://github.com/amplab-extras/spark into R
2d235d4 [Shivaram Venkataraman] Build SparkR with Maven profile
aae881b [Davies Liu] fix rat
ff776aa [Shivaram Venkataraman] Fix style
e4f1937 [Shivaram Venkataraman] Remove DFC example
f7b6936 [Davies Liu] remove Spark prefix for class
043959e [Davies Liu] cleanup
ba53b09 [Davies Liu] support R in spark-submit
f403b4a [Davies Liu] rm .travis.yml
c4a5bdf [Davies Liu] run sparkr tests in Spark
e8fc7ca [Davies Liu] fix .gitignore
35e5755 [Davies Liu] reduce size of example data
50bff63 [Davies Liu] add LICENSE header for R sources
facb6e0 [Davies Liu] add .gitignore for .o, .so, .Rd
18e5eed [Davies Liu] update docs
0a0e632 [Davies Liu] move sparkR into bin/
a76472f [Davies Liu] fix path of assembly jar
df3eeea [Davies Liu] move R/examples into examples/src/main/r
3415cc7 [Davies Liu] move Scala source into core/ and sql/
180fc9c [Davies Liu] move scala
014d253 [Davies Liu] delete man pages
49a8133 [Davies Liu] Merge branch 'remote_r' into R
44994c2 [Davies Liu] Moved files to R/
2fc553f [Shivaram Venkataraman] Merge pull request #222 from davies/column2
b043876 [Davies Liu] fix test
5e610cb [Davies Liu] add more API for Column
6f95d49 [Shivaram Venkataraman] Merge pull request #221 from shivaram/sparkr-stop-start
3214c6d [Shivaram Venkataraman] Merge pull request #217 from hlin09/cleanClosureFix
f5d3355 [Shivaram Venkataraman] Merge pull request #218 from davies/merge
70f620c [Davies Liu] address comments
4b1628d [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into merge
3139325 [Shivaram Venkataraman] Merge pull request #212 from davies/toDF
6122e0e [Davies Liu] handle NULL
bc2ff38 [Davies Liu] handle NULL
7f5e70c [Davies Liu] Update SerDe.scala
46454e4 [Davies Liu] address comments
dd52cbc [Shivaram Venkataraman] Merge pull request #220 from shivaram/sparkr-utils-include
662938a [Shivaram Venkataraman] Include utils before SparkR for `head` to work Before this change calling `head` on a DataFrame would not work from the sparkR script as utils would be loaded after SparkR and placed ahead in the search list. This change requires utils to be loaded before SparkR
1bc2998 [Shivaram Venkataraman] Merge pull request #179 from evertlammerts/sparkr-sql
7695d36 [Evert Lammerts] added tests
8190127 [Evert Lammerts] fixed parquetFile signature
d8c8fcc [Shivaram Venkataraman] Merge pull request #219 from shivaram/sparkr-build-final
963c7ee [Davies Liu] Merge branch 'master' into merge
8bff523 [Shivaram Venkataraman] Remove staging repo now that 1.3 is released
e52258f [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into toDF
05b9126 [Shivaram Venkataraman] Merge pull request #215 from davies/agg
8e1497d [Davies Liu] Update DataFrame.R
72adb14 [Davies Liu] Update SQLContext.R
66cc92a [Davies Liu] address commets
55c38bc [Shivaram Venkataraman] Merge pull request #216 from davies/select2
3e0555d [Shivaram Venkataraman] Merge pull request #193 from davies/daemon
0467474 [Davies Liu] add more selecter for DataFrame
9a6be74 [Davies Liu] include grouping columns in agg()
e87bb98 [Davies Liu] improve comment and logging
a6dc435 [Davies Liu] remove dependency of jsonlite
26a3621 [Davies Liu] support date.frame and Date/Time
4e4908a [Davies Liu] createDataFrame from rdd
5757b95 [Shivaram Venkataraman] Merge pull request #196 from davies/die
90f2692 [Shivaram Venkataraman] Merge pull request #211 from hlin09/generics
8583968 [Davies Liu] readFully()
46cea3d [Davies Liu] retry
01aa5ee [Davies Liu] add config for using daemon, refactor
ff948db [hlin09] Remove missingOrInteger.
ecdfda1 [hlin09] Remove duplication.
411b751 [Davies Liu] make RStudio happy
8f8813f [Davies Liu] switch back to use parallel
6bccbbf [hlin09] Move roxygen doc back to implementation.
ffd6e8e [Shivaram Venkataraman] Merge pull request #210 from hlin09/hlin09
471c794 [hlin09] Move getJRDD and broadcast's value to 00-generic.R.
89b886d [hlin09] Move setGeneric() to 00-generics.R.
97dde1a [hlin09] Add a test for access operators.
09ff163 [Shivaram Venkataraman] Merge pull request #204 from cafreeman/sparkr-sql
15a713f [cafreeman] Fix example for `dropTempTable`
dc1291b [hlin09] Add checks for namespace access operators in cleanClosure.
b4c0b2e [Davies Liu] use fork package
3db5649 [cafreeman] Merge branch 'sparkr-sql' of https://github.com/amplab-extras/SparkR-pkg into sparkr-sql
789be97 [Shivaram Venkataraman] Merge pull request #207 from shivaram/err-remove
e60578a [cafreeman] update tests to guarantee row order
5eec6fc [Shivaram Venkataraman] Merge pull request #206 from sun-rui/SPARKR-156_2
3f7aed6 [Sun Rui] Fix minor typos in the function description.
a8cebf0 [Shivaram Venkataraman] Remove print statement in SparkRBackendHandler This print statement is noisy for SQL methods which have multiple APIs (like loadDF). We already have a better error message when no valid methods are found
5e3a576 [Sun Rui] Fix indentation.
f3d99a6 [Sun Rui] [SPARKR-156] phase 2: implement zipWithIndex() of the RDD class.
a582810 [cafreeman] Merge branch 'dfMethods' into sparkr-sql
7a5d6fd [cafreeman] `withColumn` and `withColumnRenamed`
c5fa3b9 [cafreeman] New `select` method
bcb0bf5 [Shivaram Venkataraman] Merge pull request #180 from davies/group
9dd6a5a [Davies Liu] Update SparkRBackendHandler.scala
e6fb8d8 [Davies Liu] improve logging
428a99a [Davies Liu] remove test, catch exception
fef99de [cafreeman] `intersect`, `subtract`, `unionAll`
befbd32 [cafreeman] `insertInto`
9d01bcd [cafreeman] `dropTempTable`
d8c1c09 [Davies Liu] add test to start and stop context multiple times
18c6004 [Shivaram Venkataraman] Merge pull request #201 from sun-rui/SPARKR-156_1
dfb399a [Davies Liu] address comments
f06ccec [Sun Rui] Use mapply() instead of for statement.
3c7674f [Davies Liu] Merge branch 'die' of github.com:davies/SparkR-pkg into die
ac8a852 [Davies Liu] close monitor connection in sparkR.stop()
4d0fb56 [Shivaram Venkataraman] Merge pull request #203 from shivaram/sparkr-hive-fix
62b0760 [Shivaram Venkataraman] Fix test hive context package name
47a613f [Shivaram Venkataraman] Fix HiveContext package name
fb3b139 [Davies Liu] fix tests
d0d4626 [Shivaram Venkataraman] Merge pull request #199 from davies/load
8b7fb67 [Davies Liu] fix HiveContext
bb46832 [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into load
e9e2a03 [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into group
b875b4f [Davies Liu] fix style
de2abfa [Shivaram Venkataraman] Merge pull request #202 from cafreeman/sparkr-sql
3675fcf [cafreeman] Update `explain` and fixed doc for `toJSON`
5fd9575 [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into load
6fac596 [Davies Liu] support Column expression in agg()
f10a24e [Davies Liu] address comments
ff8b005 [cafreeman] 'saveAsParquetFile`
a5c2887 [cafreeman] fix test
3fab0f8 [cafreeman] `showDF`
779c102 [cafreeman] `isLocal`
68b11cf [cafreeman] `toJSON`
0ac4abc [cafreeman] 'explain`
20242c4 [cafreeman] clean up docs
6a1fe64 [Shivaram Venkataraman] Merge pull request #198 from cafreeman/sparkr-sql
198c130 [Shivaram Venkataraman] Merge pull request #200 from shivaram/sparkr-sql-build
870acd4 [Shivaram Venkataraman] Use rc2 explicitly
8b9a963 [cafreeman] Merge branch 'sparkr-sql' of https://github.com/amplab-extras/SparkR-pkg into sparkr-sql
bc90115 [cafreeman] Fixed docs
3865f39 [Sun Rui] [SPARKR-156] phase 1: implement zipWithUniqueId() of the RDD class.
a37fd80 [Davies Liu] Update sparkR.R
d18f9d3 [Shivaram Venkataraman] Remove SparkR snapshot build We now have 1.3.0 RC2 on Apache Staging
8de958d [Davies Liu] Update SparkRBackend.scala
4e0becc [Shivaram Venkataraman] Merge pull request #194 from davies/api
197a79b [Davies Liu] add HiveContext (commented)
32aa01d [Shivaram Venkataraman] Merge pull request #191 from felixcheung/doc
5073e07 [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into load
7918634 [cafreeman] Fix test
acea146 [cafreeman] remove extra line
74269f3 [cafreeman] Merge branch 'dfMethods' into sparkr-sql
cd7ac8a [Shivaram Venkataraman] Merge pull request #197 from cafreeman/sparkr-sql
494a4dd [cafreeman] update export
e14c328 [cafreeman] `selectExpr`
32b37d1 [cafreeman] Fixed indent in `join` test.
2e7b190 [Felix Cheung] small update on yarn deploy mode.
8ff29d6 [Davies Liu] fix tests
12a6db2 [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into api
294ca4a [cafreeman] `join`, `sort`, and `filter`
4fa6343 [cafreeman] Refactor `join` generic for use with `DataFrame`
3f22c8d [Shivaram Venkataraman] Merge pull request #195 from cafreeman/sparkr-sql
2b6f980 [Davies Liu] shutdown the JVM after R process die
e8639c3 [cafreeman] New 1.3 repo and updates to `column.R`
ed9a89f [Davies Liu] address comments
03bcf20 [Davies Liu] Merge branch 'group' of github.com:davies/SparkR-pkg into group
39c253d [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into group
98cc97a [Davies Liu] fix test and docs
e2d144a [Felix Cheung] Fixed small typos
3beadcf [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into api
06cbc2d [Davies Liu] launch R worker by a daemon
8a676b1 [Shivaram Venkataraman] Merge pull request #188 from davies/column
524c122 [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into column
f798402 [Davies Liu] Update column.R
1d0f2ae [Davies Liu] Update DataFrame.R
03402eb [Felix Cheung] Updates as per feedback on sparkR-submit
76cf2e0 [Shivaram Venkataraman] Merge pull request #192 from cafreeman/sparkr-sql
1955a09 [cafreeman] return object instead of a list of one object
f585929 [cafreeman] Fix brackets
e998356 [cafreeman] define generic for 'first' in RDD API
71d66a1 [Davies Liu] fix first(0
8ec21af [Davies Liu] fix signature
acae527 [Davies Liu] refactor
d7b17a4 [Davies Liu] fix approxCountDistinct
7dfe27d [Davies Liu] fix cyclic namespace dependency
8caf5bb [Davies Liu] use S4 methods
5c0bb24 [Felix Cheung] Doc updates: build and running on YARN
773baf0 [Zongheng Yang] Merge pull request #178 from davies/random
862f07c [Shivaram Venkataraman] Merge pull request #190 from shivaram/SPARKR-79
b457833 [Shivaram Venkataraman] Merge pull request #189 from shivaram/stdErrFix
f7caeb8 [Davies Liu] Update SparkRBackend.scala
8c4deae [Shivaram Venkataraman] Remove unused function
6e51c7f [Shivaram Venkataraman] Fix stderr redirection on executors
7afa4c9 [Shivaram Venkataraman] Merge pull request #186 from hlin09/funcDep3
4d36ab1 [hlin09] Add tests for broadcast variables.
3f57e56 [hlin09] Fix comments.
7b72487 [hlin09] Fix comments.
ae05bf1 [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into column
abb4bb9 [Davies Liu] add Column and expression
eb8ac11 [Shivaram Venkataraman] Set Spark version 1.3.0 in Windows build
5c72e73 [Davies Liu] wait atmost 100 seconds
e425437 [Shivaram Venkataraman] Merge pull request #177 from lythesia/master
a00f502 [lythesia] fix indents
0346e5f [Davies Liu] address comment
6134649 [Shivaram Venkataraman] Merge pull request #187 from cafreeman/sparkr-sql
ad0935e [lythesia] minor fixes
b0e7f73 [cafreeman] Update `sampleDF` test
7b0d070 [lythesia] keep partitions check
889c265 [cafreeman] numToInt utility function
27dd3a0 [lythesia] modify tests for repartition
cad0f0c [cafreeman] Fix docs and indents
2808dcf [cafreeman] Three more DataFrame methods
5ef66fb [Davies Liu] send back the port via temporary file
3b46429 [Davies Liu] Merge branch 'master' of github.com:amplab-extras/SparkR-pkg into random
798f453 [cafreeman] Merge branch 'sparkr-sql' into dev
9aa4acf [Shivaram Venkataraman] Merge pull request #184 from davies/socket
020bce8 [Shivaram Venkataraman] Merge pull request #183 from cafreeman/sparkr-sql
222e06b [cafreeman] Lazy evaluation and formatting changes
e776324 [Davies Liu] fix import
211cc15 [cafreeman] Merge branch 'sparkr-sql' into dev
3351afd [hlin09] Replaces getDependencies with cleanClosure, to serialize UDFs to workers.
e7c56d6 [lythesia] fix random partition key
50c74b1 [Davies Liu] address comments
083c89f [cafreeman] Remove commented lines an unused import
dfa119b [hlin09] Improve the coverage of processClosure.
a41c9b9 [cafreeman] Merge branch 'wrapper' into sparkr-sql
1cd714f [cafreeman] Wrapper function docs.
db0cd9e [cafreeman] Clean up for wrapper functions
818c19f [cafreeman] Update schema-related functions
a57884e [cafreeman] Remove unused import
d72e830 [cafreeman] Add wrapper for `StructField` and `StructType`
2ea2ecf [lythesia] use generic arg
09b9512 [hlin09] add docs
f4f077c [hlin09] Add recursive cleanClosure for function access.
f84ad27 [hlin09] Merge remote-tracking branch 'upstream/master' into funcDep2
5300766 [Shivaram Venkataraman] Merge pull request #185 from hlin09/hlin09
07aa7c0 [hlin09] Unifies the implementation of lapply with lapplyParitionsWithIndex.
f4dbb0b [Davies Liu] use socket in worker
8282c59 [Davies Liu] Update DataFrame.R
ba495a8 [Davies Liu] Update NAMESPACE
36dffb3 [cafreeman] Add 'head` and `first`
534a95f [cafreeman] Schema-related methods
64f488d [cafreeman] Cache and Persist Methods
30d71fd [cafreeman] Standardize method arguments for DataFrame methods
785898b [Shivaram Venkataraman] Merge pull request #182 from cafreeman/sparkr-sql
2619003 [Shivaram Venkataraman] Merge pull request #181 from cafreeman/master
a9bbe0b [cafreeman] Update existing SparkSQL functions
8c241a3 [cafreeman] Merge with master, include changes to method args
68d6de4 [cafreeman] Fix typos
8d2ec6e [Davies Liu] add sum/max/min/avg/mean
774e687 [Davies Liu] add missing API in SQLContext
1e72b4b [Davies Liu] missing API in SQLContext
3294949 [Chris Freeman] Restore `rdd` argument to `getJRDD`
3a58ebc [Davies Liu] rm unrelated file
8bd93b5 [Davies Liu] fix signature
c652b4c [cafreeman] Update method signatures to use generic arg
48c8827 [Davies Liu] update NAMESPACE
84e2d8c [Davies Liu] groupBy and agg()
7c3ddbd [Davies Liu] create jmode in JVM
9465426 [Davies Liu] load and save
982f342 [lythesia] fix numeric issue
7651d84 [lythesia] fix coalesce
4e712e1 [Davies Liu] use random port in backend
041d22b [Shivaram Venkataraman] Merge pull request #172 from cafreeman/sparkr-sql
0d07770 [cafreeman] Added `limit` and updated `take`
301d8e5 [cafreeman] Remove extraneous map functions
0387db2 [cafreeman] Remove colNames
04c4b65 [lythesia] add repartition/coalesce
231deab [cafreeman] Change reserialize to serializeToBytes
acf7e1a [cafreeman] Rework the Scala to R DataFrame Conversion
481ae37 [cafreeman] Updated stale comments and standardized arg names
21d4a97 [hlin09] Adds cleanClosure to capture the function closures.
d24ffb4 [hlin09] Merge remote-tracking branch 'upstream/master' into funcDep2
8be02de [hlin09] Revert "loop 1-12 test pass."
fddb9cc [hlin09] Revert "add docs"
f8ef0ab [hlin09] Revert "More docs"
8e4b3da [hlin09] Revert "More docs"
57e005b [hlin09] Revert "fix tests."
c10148e [Shivaram Venkataraman] Merge pull request #174 from shivaram/sparkr-runner
910e3be [Shivaram Venkataraman] Add a timeout for initialization Also move sparkRBackend.stop into a finally block
bf52b17 [Shivaram Venkataraman] Merge remote-tracking branch 'amplab-sparkr/master' into sparkr-runner
08102b0 [Shivaram Venkataraman] Merge pull request #176 from lythesia/master
9c77b20 [Chris Freeman] Merge pull request #2 from shivaram/sparkr-sql
179ab38 [lythesia] add try counts and increase time interval
71a73b2 [Shivaram Venkataraman] Use a getter for serialization mode This change encapsulates the semantics of serialization mode for RDDs inside a getter function. For PipelinedRDDs if a backing JavaRDD is available we use that else we fall back to a default serialization mode
06bf250 [Shivaram Venkataraman] Merge pull request #173 from shivaram/windows-space-fix
88bf97f [Shivaram Venkataraman] Create SparkContext for R shell launch
f9268d9 [Shivaram Venkataraman] Fix code review comments
e6ad12d [Shivaram Venkataraman] Update comment describing sparkR-submit
17eda4c [Shivaram Venkataraman] Merge pull request #175 from falaki/docfix
ba2b72b [Hossein] Spark 1.1.0 is default
4cd7d3f [lythesia] retry backend connection
749e2d0 [Hossein] Updated README
bc04cf4 [Shivaram Venkataraman] Use SPARKR_BACKEND_PORT in sparkR.R as default Change SparkRRunner to use EXISTING_SPARKR_BACKEND_PORT to differentiate between the two
22a19ac [Shivaram Venkataraman] Use a semaphore to wait for backend to initalize Also pick a random port to avoid collisions
7f1f0f8 [cafreeman] Move comments to fit 100 char line length
8b84e4e [cafreeman] Make if statements more explicit
ce5d5ab [cafreeman] New tests for Union and Object File
b063320 [cafreeman] Changed 'serialized' to 'serializedMode'
0981dff [Zongheng Yang] Merge pull request #168 from sun-rui/SPARKR-153_2
86fc639 [Shivaram Venkataraman] Move sparkR-submit into pkg/inst
fd8f8a9 [Shivaram Venkataraman] Merge branch 'hqzizania-master'
a33dbea [Shivaram Venkataraman] Merge branch 'master' of https://github.com/hqzizania/SparkR-pkg into hqzizania-master
384e6e2 [Shivaram Venkataraman] Merge pull request #171 from hlin09/hlin09
1f5a6ac [hlin09] fixed comments
7f7596a [cafreeman] Additional handling for "row" serialization
8c3b8c5 [cafreeman] Add test for UnionRDD on "row" serialization
b1141f8 [cafreeman] Fixed formatting issues.
5db30bf [cafreeman] Changed serialized from bool to string
2f0c0b8 [cafreeman] Add check for serialized type
d243dfb [cafreeman] Clean up code
5ff63a2 [cafreeman] Change test from boolean to string
77fec1a [cafreeman] Updated .Rd files
9224989 [cafreeman] Various updates for DataFrame to RRDD
26af62b [cafreeman] DataFrame to RRDD
e004481 [cafreeman] Update UnionRDD test
5292be7 [hlin09] Adds support of pipeRDD().
e2a7560 [Shivaram Venkataraman] Merge pull request #170 from cafreeman/sparkr-sql
5d537f4 [cafreeman] Add pairRDD to Description
b6fa88e [cafreeman] Updating to current master
0cda231 [Sun Rui] [SPARKR-153] phase 2: implement aggregateByKey() and foldByKey().
95ee6b4 [Shivaram Venkataraman] Merge remote-tracking branch 'amplab-sparkr/master' into sparkr-runner
67fbc60 [Shivaram Venkataraman] Add support for SparkR shell to use spark-submit This ensures that SparkConf options are read in both in batch and interactive modes
2271030 [Shivaram Venkataraman] Merge pull request #167 from sun-rui/removePartionByInRDD
7fcb46a [Sun Rui] Remove partitionBy() in RDD.
52f94c4 [Shivaram Venkataraman] Merge pull request #160 from lythesia/master
59e2d54 [lythesia] merge with upstream
5836650 [Zongheng Yang] Merge pull request #163 from sun-rui/SPARKR-153_1
141723e [Sun Rui] fix comments.
f73a07e [Shivaram Venkataraman] Merge pull request #165 from shivaram/sparkr-sql-build
10ffc6d [Shivaram Venkataraman] Set Spark version to 1.3 using staging dependency Also fix the maven build
c91ede2 [Shivaram Venkataraman] Merge pull request #164 from hlin09/hlin09
9d335a9 [hlin09] Makes git to ignore Eclipse meta files.
94066bf [Sun Rui] [SPARKR-153] phase 1: implement fold() and aggregate().
9c391c7 [hqzizania] Merge remote-tracking branch 'upstream/master'
5f29551 [hqzizania] 	modified:   pkg/R/RDD.R 	modified:   pkg/R/context.R
d968664 [lythesia] fix comment
7972858 [Shivaram Venkataraman] Merge pull request #159 from sun-rui/SPARKR-150_2
7690878 [lythesia] separate out pair RDD functions
f4573c1 [Sun Rui] Use reduce() instead of sortBy().take() to get the ordered elements.
63e62ed [Sun Rui] [SPARKR-150] phase 2: implement takeOrdered() and top().
050390b [Shivaram Venkataraman] Fix bugs in inferring R file
8398f2e [Shivaram Venkataraman] Add sparkR-submit helper script Also adjust R file path for YARN cluster mode
bd6705b [Zongheng Yang] Merge pull request #154 from sun-rui/SPARKR-150
c7964c9 [Sun Rui] Merge with upstream master.
7feac38 [Sun Rui] Use default arguments for sortBy() and sortKeyBy().
de2bfb3 [Sun Rui] Fix minor comments and add more test cases.
0c6e071 [Zongheng Yang] Merge pull request #157 from lythesia/master
f5038c0 [lythesia] pull out anonymous functions in groupByKey
ba6f044 [lythesia] fixes for reduceByKeyLocally
343b6ab [Oscar Olmedo] Export sparkR.stop Closes #156 from oscaroboto/master
25639cf [Shivaram Venkataraman] Replace tabs with spaces
bb25920 [Shivaram Venkataraman] Merge branch 'dputler-master'
fd836db [hlin09] fix tests.
24a7f13 [hlin09] More docs
a465165 [hlin09] More docs
6ad4fc3 [hlin09] add docs
b082a35 [lythesia] add reduceByKeyLocally
7ca6512 [Shivaram Venkataraman] First cut of SparkRRunner
193f5fe [hlin09] loop 1-12 test pass.
345f1b8 [dputler] [SPARKR-195] Implemented project style guidelines for if-else statements
8043559 [Sun Rui] Add a TODO to use binary search in the range partitioner.
91b2fd6 [Sun Rui] Add more test cases.
e8ebbe4 [Shivaram Venkataraman] Merge pull request #152 from cafreeman/sparkr-sql
0c53d6c [dputler] Data frames now coerced to lists, and messages issued for a data frame or matrix on how they are parallelized
6d57ec0 [cafreeman] Remove json test file since we're using a temp
ac1ef09 [cafreeman] Update registerTempTable test
d9da451 [Sun Rui] [SPARKR-150] phase 1: implement sortBy() and sortByKey().
08ff30b [Shivaram Venkataraman] Merge pull request #153 from hqzizania/master
9767e8e [hqzizania] 	modified:   pkg/man/collect-methods.Rd
5d69f0a [hqzizania] 	modified:   pkg/R/RDD.R
4914091 [hqzizania] 	modified:   pkg/inst/tests/test_rdd.R
742a68b [cafreeman] Update test_sparkRSQL.R
a95823e [hqzizania] 	modified:   pkg/R/RDD.R
2d04526 [cafreeman] Formatting
fae9bdd [cafreeman] Renamed to SQLUtils.scala
39888ea [Chris Freeman] Update test_sparkSQL.R
fce2453 [cafreeman] Updated documentation for SQLContext
13fbf12 [cafreeman] Regenerated .Rd files
51ecf41 [cafreeman] Updated Scala object
30d7337 [cafreeman] Added SparkSQL test
74b3ed6 [cafreeman] Incorporate code feedback
554bda0 [Zongheng Yang] Merge pull request #147 from shivaram/sparkr-ec2-fixes
a5f4f8f [cafreeman] Squashed commit of the following:
f34bb88 [Shivaram Venkataraman] Remove profiling information from this PR
c662f29 [Zongheng Yang] Merge pull request #146 from shivaram/spark-1.2-build
21e9b74 [Zongheng Yang] Merge pull request #145 from lythesia/master
76f6b9e [Shivaram Venkataraman] Merge pull request #149 from hqzizania/master
1c2dbec [lythesia] minor fix for refactoring join code
5b380d3 [hqzizania] 	modified:   pkg/man/combineByKey.Rd 	modified:   pkg/man/groupByKey.Rd 	modified:   pkg/man/partitionBy.Rd 	modified:   pkg/man/reduceByKey.Rd
98794fe [hqzizania] 	modified:   pkg/R/RDD.R
b66534d [Zongheng Yang] Merge pull request #144 from shivaram/fix-rd-files
60da1df [Shivaram Venkataraman] Initialize timing variables
179aa75 [Shivaram Venkataraman] Bunch of fixes for longer running jobs 1. Increase the timeout for socket connection to wait for long jobs 2. Add some profiling information in worker.R 3. Put temp file writes before stdin writes in RRDD.scala
06d99f0 [Shivaram Venkataraman] Fix URI to have right number of slashes
add97f5 [Shivaram Venkataraman] Use URL encode to create valid URIs for jars
4eec962 [lythesia] refactor join functions
73430c6 [Shivaram Venkataraman] Make SparkR work on paths with spaces on Windows
aaf8f47 [Shivaram Venkataraman] Exclude hadoop client from Spark dependency
227ee42 [Zongheng Yang] Merge pull request #141 from shivaram/SPARKR-140
ac5ceb1 [Shivaram Venkataraman] Fix code review comments
32394de [Shivaram Venkataraman] Regenerate Rd files for SparkR This fixes a number of issues in SparkR man pages. The main changes are 1. Don't export or generate docs for PipelineRDD 2. Fix variable names for Filter, count to match base methods 3. Document missing arguments for sparkR.init, print.jobj etc.
e157bf6 [Shivaram Venkataraman] Use prev_serialized to track if JRDD is serialized This changes introduces a new variable in PipelineRDD environment to track if the prev_jrdd is serialized or not.
7428a7e [Zongheng Yang] Merge pull request #143 from shivaram/SPARKR-181
7dd1797 [Shivaram Venkataraman] Address code review comments
8f81c45 [Shivaram Venkataraman] Remove roxygen export for PipelinedRDD
0cb90f1 [Zongheng Yang] Merge pull request #142 from shivaram/SPARKR-169
d1c6e6c [Shivaram Venkataraman] Buffer stderr from R and return it on Exception This change buffers the last 100 lines from R process and passes these lines back to the driver if we have an exception. This will help users debug why their tasks failed on the cluster
d6c1393 [Shivaram Venkataraman] Suppress warnings from normalizePath
a382835 [Shivaram Venkataraman] Fix serialization tracking in pipelined RDDs When creating a pipeline RDD, we need to check if the JavaRDD belonging to the parent is serialized.
da39529 [Zongheng Yang] Merge pull request #140 from sun-rui/SPARKR-183
2814caa [Sun Rui] Merge with upstream master.
cd2a5b3 [Sun Rui] Add reference to Nagle's algorithm and clean code.
52356b6 [Shivaram Venkataraman] Merge pull request #139 from shivaram/fix-backend-exit
97e5a1f [Sun Rui] [SPARKR-183] Fix the issue that parallelize collect tests are slow.
a9f8e8e [Shivaram Venkataraman] Merge pull request #138 from concretevitamin/fix-collect-test
125ae43 [Shivaram Venkataraman] Fix SparkR backend to exit in more cases This change has two fixes 1. When the workspace is saved (from R or RStudio) the backend connection seems to be closed before the finalizer is run. In such cases we reopen the connection and stop the backend 2. With RStudio when R is restarted, there are port-conflicts which appear due to a race condition between the JVM and rsession restart. This change adds a 1 sec sleep to avoid this race.
12c102a [Zongheng Yang] Simplify a unit test.
9c0637a [Zongheng Yang] Merge pull request #137 from shivaram/fix-docs
0df0e18 [Shivaram Venkataraman] Fix documentation for includePackage
7549f88 [Zongheng Yang] Merge pull request #136 from shivaram/man-updates
7edbe46 [Shivaram Venkataraman] Add missing man pages
9cb9567 [Shivaram Venkataraman] Merge pull request #131 from shivaram/rJavaExpt
1fa722e [Shivaram Venkataraman] Rename to SerDe now
2fcb051 [Shivaram Venkataraman] Rename to SerDeJVMR
d112cf0 [Shivaram Venkataraman] Style fixes
9fd01cc [Shivaram Venkataraman] Remove unnecessary braces
0881931 [Shivaram Venkataraman] Some more style fixes
f00b531 [Shivaram Venkataraman] Address code review comments. Big changes include style fixes throughout for named arguments
c09ba05 [Shivaram Venkataraman] Change jobj id to be just an integer Add a new print.jobj that gets the class name and prints it Also add a utility function isInstanceOf
be05b16 [Shivaram Venkataraman] Check if context, connection exist before stopping
d596a23 [Shivaram Venkataraman] Address code review comments
396e7ac [Shivaram Venkataraman] Changes to make new backend work on Windows This change uses file.path to construct the Java binary path in a OS agnostic way and uses system2 to handle quoting binary paths correctly. Tests pass on Mac OSX and a Windows EC2 instance.
e7a4e03 [Shivaram Venkataraman] Remove unused file BACKEND.md
62f380b [Shivaram Venkataraman] Update worker.R to use new deserialization call
8b9c4e6 [Shivaram Venkataraman] Change RDD name, setName to use new backend
6dcd5c5 [Shivaram Venkataraman] Merge branch 'master' of https://github.com/amplab-extras/SparkR-pkg into rJavaExpt
0873397 [Shivaram Venkataraman] Refactor java object tracking into a new singleton. Also add comments describing each class
95db964 [Shivaram Venkataraman] Add comments, cleanup new R code
bcd4258 [Zongheng Yang] Merge pull request #130 from lythesia/master
74dbc5e [Sun Rui] Match method using parameter types.
7ad4a4d [Sun Rui] Use 1 char to represent types on the backend->client direction.
bace887 [Sun Rui] Use an integer count for the backend java object ID because Uniqueness isn't guaranteed by System.identityHashCode().
b38d04f [Sun Rui] Use 1 char to represent types on the client -> backend direction.
f88bc68 [lythesia] Merge branch 'master' of github.com:lythesia/SparkR-pkg
71d41f5 [lythesia] add test case for fullOuterJoin
eb4f423 [lythesia] --amend
cffecc5 [lythesia] add test case for fullOuterJoin
a547dd2 [Shivaram Venkataraman] Move classTag, rddRef into newJObject call This avoids them getting eagerly garbage collected
1255391 [Shivaram Venkataraman] Add a finalizer for jobj objects This enables Java objects to be garbage collected on the backend when they are no longer referenced in R. Also rename newJava to newJObject to be more consistent with callJMethod
70fa409 [Sun Rui] Add YARN Conf Dir to the class path when launching the backend.
a1108ca [lythesia] add fullOuterJoin in RDD.R
2152727 [Shivaram Venkataraman] Remove empty file
cd08bee [Shivaram Venkataraman] Update all functions to use new backend All unit tests pass.
9de49b7 [Shivaram Venkataraman] Add high level calls for methods, constructors Also update BACKEND.md
5a97ea4 [Shivaram Venkataraman] Add jobj S3 class that holds backend refs
e071d3e [Shivaram Venkataraman] Change SparkRBackend to use general method calls This change uses a custom protocl + JNI to invoke any method on a given object type. Also update serializers, deserializers to make code more concise
49f0404 [Shivaram Venkataraman] Merge pull request #129 from lythesia/master
7f8cd82 [lythesia] update man
4715ed2 [Yi Lu] Update RDD.R
5a53801 [lythesia] fix name,setName
4f3870b [lythesia] add name,setName in RDD.R
1c25700 [Shivaram Venkataraman] Merge pull request #128 from sun-rui/SPARKR-165
c8507d8 [Sun Rui] [SPARKR-165] IS_SCALAR is not present in R before 3.1
2cff2bd [Sun Rui] Add function to invoke Java method.
7a31da1 [Shivaram Venkataraman] Merge branch 'dputler-master'. Closes #119
0ceba82 [Shivaram Venkataraman] Merge branch 'master' of https://github.com/dputler/SparkR-pkg into dputler-master
735f70c [Shivaram Venkataraman] Merge pull request #125 from 7c00/rawcon
fccfe6c [Shivaram Venkataraman] Merge pull request #127 from sun-rui/SPARKR-164
387bd57 [Sun Rui] [SPARKR-164] Temporary files used by SparkR accumulat as time goes on.
5f2268f [Shivaram Venkataraman] Add support to stop backend
5f745c0 [Shivaram Venkataraman] Update notes in backend
22015c1 [Shivaram Venkataraman] Add first cut of SparkR Backend
52821da [Todd Gao] switch the order of packages and function deps
d7b0007 [Todd Gao] remove memCompress
cb6873e [Shivaram Venkataraman] Merge pull request #126 from sun-rui/SPARKR-147
c5962eb [Todd Gao] further optimize using rawConnection
f04c6e0 [Sun Rui] [SPARKR-147] Support multiple directories as input to textFile.
b7de604 [Todd Gao] optimize execFunctionDeps loading in worker.R
4d4fc30 [Shivaram Venkataraman] Merge pull request #122 from cafreeman/master
b508877 [cafreeman] Update SparkR_IDE_Setup.sh
21ed9d7 [cafreeman] Update build.sbt
f73ec16 [cafreeman] Delete SparkR_IDE_Setup_Guide.md
d63b026 [cafreeman] Delete SparkR_Quick_Start_Guide.md
6e6cb62 [cafreeman] Update SparkR_IDE_Setup.sh
bc6042b [cafreeman] Update build.sbt
a8197d5 [cafreeman] Merge remote-tracking branch 'upstream/master'
d671564 [Zongheng Yang] Merge pull request #123 from shivaram/jcheck-void
76b8d00 [Zongheng Yang] Merge pull request #124 from shivaram/master
b690d58 [Shivaram Venkataraman] Specify how to change Spark versions in README
0fb003d [Shivaram Venkataraman] Merge branch 'master' of https://github.com/amplab-extras/SparkR-pkg into jcheck-void
1c227b4 [Shivaram Venkataraman] Also add a check in context.R
96812b6 [Shivaram Venkataraman] Check for exceptions after void method calls
f5c216d [cafreeman] Merge remote-tracking branch 'upstream/master'
90c8933 [Zongheng Yang] Merge pull request #121 from shivaram/fix-sort-order
bd0e3b4 [Shivaram Venkataraman] Fix saveAsTextFile test case
2e55f67 [Shivaram Venkataraman] Merge branch 'master' of https://github.com/amplab-extras/SparkR-pkg into fix-sort-order
f10c607 [Shivaram Venkataraman] Merge pull request #118 from sun-rui/saveAsTextFile
6c9bfc0 [Sun Rui] Merge remote-tracking branch 'SparkR_upstream/master' into saveAsTextFile
6faedbe [cafreeman] Update SparkR_IDE_Setup_Guide.md
57008bc [cafreeman] Update SparkR_IDE_Setup.sh
bb1c17d [cafreeman] Update SparkR_IDE_Setup.sh
538bfdb [cafreeman] Update SparkR_Quick_Start_Guide.md
31322c6 [cafreeman] Update SparkR_IDE_Setup.sh
ca3f593 [Sun Rui] Refactor RRDD code.
df58d95 [cafreeman] Update SparkR_Quick_Start_Guide.md
b488c88 [cafreeman] Rename Spark_IDE_Setup.sh to SparkR_IDE_Setup.sh
b2545a4 [cafreeman] Added IDE Setup Guide
0ffb5de [cafreeman] Merge branch 'master' of https://github.com/cafreeman/SparkR-pkg
bd8fbfb [cafreeman] Merge remote-tracking branch 'upstream/master'
98efa5b [cafreeman] Added Quick Start Guide
3cf88f2 [Shivaram Venkataraman] Sort lists before comparing in unit tests Since Spark doesn't guarantee that shuffle results will always be in the same order, we need to sort the results before comparing for deterministic behavior
d621dbc [Shivaram Venkataraman] Merge pull request #120 from sun-rui/objectFile
c4a44d7 [Sun Rui] Add @seealso in comments and extract some common code into a function.
724e3a4 [cafreeman] Update Spark_IDE_Setup.sh
8153e5a [Sun Rui] [SPARKR-146] Support read/save object files in SparkR.
17f9909 [cafreeman] Update Spark_IDE_Setup.sh
a9eb080 [cafreeman] IDE Shell Script
64d800c [dputler] Merge remote branch 'upstream/master'
1fbdb2e [dputler] Added the ability for the user to specify a text file location throught the use of tilde expansion or just the file name if it is in the working directory.
d83c017 [Shivaram Venkataraman] Merge pull request #113 from sun-rui/stringHashCodeInC
a7d9cdb [Sun Rui] Fix build on Windows.
7d81b05 [Shivaram Venkataraman] Merge pull request #114 from hlin09/hlin09
47c4bb7 [hlin09] fix reviews
a457f7f [Shivaram Venkataraman] Merge pull request #116 from dputler/master
0fa48d1 [Shivaram Venkataraman] Merge pull request #117 from sun-rui/keyBy
85cfeb4 [Sun Rui] [SPARKR-144] Implement saveAsTextFile() in the RDD class.
09083d9 [Sun Rui] Add keyBy() to the RDD class.
caad5d7 [dputler] Adding the script to install software on the Cloudera Quick Start VM.
dca3d05 [hlin09] Minor fix.
ece5f7d [hlin09] Merge remote-tracking branch 'upstream/master' into hlin09
a40874b [hlin09] Use extendible accumulators aggregate the cogroup values.
d0347ce [Zongheng Yang] Merge pull request #112 from sun-rui/outer_join
492f76e [Sun Rui] Refine code and add description.
ba01358 [Shivaram Venkataraman] Merge pull request #115 from sun-rui/SPARKR-130
5c8e46e [Sun Rui] Fix per the review comments.
7190a2c [Sun Rui] Update comment to add a reference to storage levels.
1da705e [hlin09] Fix the review comments.
c4b77be [Sun Rui] [SPARKR-130] Add persist(storageLevel) API to RDD.
b424a1a [hlin09] Add function cogroup().
9770312 [Shivaram Venkataraman] Merge pull request #111 from hlin09/hlin09
cead7df [hlin09] fix review comments.
54f712e [Sun Rui] Implement string hash code in C.
425f0c6 [Sun Rui] Add leftOuterJoin() and rightOuterJoin() to the RDD class.
39509c7 [hlin09] add Rd file for foreach and foreachPartition.
63d6ac7 [hlin09] Adds function foreach() and foreachPartition().
9c954df [Zongheng Yang] Merge pull request #105 from sun-rui/join
c71228d [Sun Rui] Pre-allocate list with fixed length. Add test case for join() using string key.
bc3e9f6 [Shivaram Venkataraman] Merge pull request #108 from concretevitamin/take-optimize
c06fc90 [Zongheng Yang] Fix: only optimize for unserialized dataset case.
d399aeb [Zongheng Yang] Apply size-capping on logical representation instead of physical.
e4217dd [Zongheng Yang] Merge pull request #107 from shivaram/master
7952180 [Shivaram Venkataraman] Copy, use getLocalDirs from Spark Utils.scala
08e24c3 [Zongheng Yang] Merge pull request #109 from hlin09/hlin09
97d4e02 [Zongheng Yang] Min() upper-bound size with actual size.
bb779bf [hlin09] Rename the filter function to filterRDD to follow the API consistency. Filter() is also kept.
ce1661f [Zongheng Yang] Fix slow take(): deserialize only up to necessary # of elements.
4dca9b1 [Shivaram Venkataraman] Merge pull request #106 from hlin09/hlin09
1220d92 [hlin09] Adds function numPartitions().
2326a65 [Shivaram Venkataraman] Use SPARK_LOCAL_DIRS to create tmp files
e119757 [hlin09] Minor fix.
9c24c8b [hlin09] Adds function countByKey().
48fce67 [hlin09] Adds countByValue().
6679eef [Sun Rui] Update documentation for join().
70586b4 [Sun Rui] Add join() to the RDD class.
e6fb999 [Zongheng Yang] Merge pull request #103 from shivaram/rlibdir-fix
a21f146 [Shivaram Venkataraman] Merge pull request #102 from hlin09/hlin09
32eb619 [Shivaram Venkataraman] Merge pull request #104 from sun-rui/add_keys_values
d8692e9 [Sun Rui] Add keys() and values() for the RDD class.
18b9be1 [Shivaram Venkataraman] Allow users to set where SparkR is installed This also adds a warning if somebody tries to call sparkR.init multiple times.
a17f135 [hlin09] Adds tests for flatMap and flatMapValues.
4bcf59b [hlin09] Adds function flatMapValues.
4a193ef [Zongheng Yang] Merge pull request #101 from ashutoshraina/master
60d22f2 [Ashutosh Raina] changed sbt version
5400793 [Zongheng Yang] Merge pull request #98 from shivaram/windows-fixes-build
36d61a7 [Shivaram Venkataraman] Merge pull request #97 from hlin09/hlin09
f7d7d89 [hlin09] Remove redundant code in test.
6bbe823 [hlin09] minor style fix.
9b47f3a [Shivaram Venkataraman] Merge pull request #100 from hnahak87/patch-1
7f6e4ea [Harihar Nahak] Update logistic_regression.R
a605047 [Shivaram Venkataraman] Merge pull request #99 from hlin09/makefile
323151d [hlin09] Fix yar flag in Makefile to remove build error in Maven.
8911897 [hlin09] Make reserialize() private function in package.
79aee73 [Shivaram Venkataraman] Add notes on how to build SparkR on windows
49a99e7 [Shivaram Venkataraman] Clean up some commented code
ddc271b [Shivaram Venkataraman] Only append file:/// to non empty jar paths
a53952e [Shivaram Venkataraman] Add windows build scripts
325b179 [hlin09] Merge remote-tracking branch 'upstream/master' into hlin09
daf5040 [hlin09] Add reserialize() before union if two RDDs are not both serialized.
536afb1 [hlin09] Add new function of union().
7044677 [Shivaram Venkataraman] Merge branch 'master' of https://github.com/amplab-extras/SparkR-pkg into windows-fixes
d22a02d [Zongheng Yang] Merge pull request #94 from shivaram/windows-fixes-stdin
51924f7 [Shivaram Venkataraman] Merge pull request #90 from oscaroboto/master
eb97d85 [Shivaram Venkataraman] Merge pull request #96 from sun-rui/add_clarification_readme
5a128f4 [Sun Rui] Add clarification on setting Spark master when launching the SparkR shell.
187526a [oscaroboto] Update sparkR.R
32c567b [Shivaram Venkataraman] Merge pull request #95 from concretevitamin/master
4cd2d5e [Zongheng Yang] Notes about spark-ec2.
1c28e3b [Shivaram Venkataraman] Merge branch 'master' of https://github.com/amplab-extras/SparkR-pkg into windows-fixes
8e8a029 [Zongheng Yang] Merge pull request #92 from shivaram/sparkr-yarn
721043b [Zongheng Yang] Update README.md with YARN instructions.
1681f58 [Shivaram Venkataraman] Use temporary files for input instead of stdin This fixes a bug for Windows where stdin would get truncated
b084314 [oscaroboto] removed ... from example
44c93d4 [oscaroboto] Added example to SparkR.R
be82dcc [Shivaram Venkataraman] Merge pull request #93 from hlin09/hlin09
868554d [oscaroboto] Update sparkR.R
488ac47 [hlin09] Add generated Rd file of previous added functions, distinct() and mapValues().
b2740ad [hlin09] Add test for filter all elements. Add filter() as alias.
08d3631 [hlin09] Minor style fixes.
2c0e34f [hlin09] Adds function Filter(), which extracts the elements that satisfy a predicate.
5951d3b [Shivaram Venkataraman] Remove SBT plugin
4e70ced [oscaroboto] changed ExecutorEnv to sparkExecutorEnvMap, to make it consistent with sparkEnvirMap
903d18a [oscaroboto] changed executorEnv to sparkExecutorEnvMap,  will do the same in R
f97346e [oscaroboto] executorEnv to lower-case e
88a524e [oscaroboto] Added LD_LIBRARY_PATH to the ExecutorEnv. This is need so that the nodes can find libjvm.so, or if the master has a different LD_LIBRARY_PATH then the nodes. Make sure to export LD_LIBRARY_PATH  that includes the path to libjvm.so in the nodes.
1d208ae [oscaroboto] added the YARN_CONF_DIR to the classpath
8a9b75c [oscaroboto] forgot to change hm and ee inside the for loops
579db58 [Shivaram Venkataraman] Merge pull request #91 from sun-rui/add_max_min
4381efa [Sun Rui] use reduce() to implemement max() and min().
a5459c5 [Shivaram Venkataraman] Consolidate yarn flags
86b04eb [Shivaram Venkataraman] Don't use quotes around yarn
bf0797f [Shivaram Venkataraman] Add dependency on spark yarn module
af5fe77 [Shivaram Venkataraman] Fix SBT build, add dependency tree plugin
4917607 [Sun Rui] Add maximum() and minimum() API to RDD.
51bbbe4 [Shivaram Venkataraman] Changes to make SparkR work with YARN
9d5e3ab [oscaroboto] a few stylistic changes. Also change vars to sparkEnvirMap and eevars to ExecutorEnv, to match sparkR.R
578f545 [oscaroboto] a few stylistic changes
39eea2f [oscaroboto] Modification to dynamically create a sparkContext with YARN. Added .setExecutorEnv to the sparkConf in createSparkContext within the RRDD object. This modification was made together with sparkR.R
17ec42e [oscaroboto] A modification to dynamically create a sparkContext with YARN. sparkR.R modified to pass custom Jar file names and EnvironmentEnv to the sparkConf. RRDD.scala was also modified to accept the new inputs to creatSparkContext.
624ac9d [Shivaram Venkataraman] Merge pull request #87 from sun-rui/SPARKR-125
4f213db [Shivaram Venkataraman] Merge pull request #89 from sun-rui/SPARKR-108
eb833c5 [Shivaram Venkataraman] Merge pull request #88 from hlin09/hlin09
07bf971 [Sun Rui] [SPARKR-108] Implement map-side reduction for reduceByKey().
4accba1 [hlin09] Fixes style and adds an optional param 'numPartition' in distinct().
80d303a [hlin09] typo fixed.
e37a9b5 [hlin09] Adds function distinct() and mapValues().
08dac06 [Sun Rui] [SPARKR-125] Get the iterator of the parent RDD before launching a R worker process in compute() of RRDD/PairwiseRRDD
c4ba53c [Shivaram Venkataraman] Merge pull request #85 from edwardt/master
72a9d27 [root] reorder to keep relative ordering the same
f3fcb10 [root] fix up build.sbt also to match pom.xml
5ecbe3e [root] Make spark verison configurable in build script per ISSUE122
a44e63d [Shivaram Venkataraman] Merge pull request #84 from sun-rui/SPARKR-94
fbb5663 [Sun Rui] Add {} to one-line functions and add a test case for lookup where no match is found.
95beb4e [Shivaram Venkataraman] Merge pull request #82 from edwardt/master
36776c5 [edwardt] missed one 0.9.0 revert
b26deec [Sun Rui] [SPARKR-94] Add a  method to get an element of a pair RDD object by key.
1ba256e [edwardt] Keep 0.9.0 and says uses 1.1.0 by default
5380c43 [root] missed one version
21f74da [root] upgrade to spark version 1.1.0 to match lastest merge list
ddfcde9 [root] merge
67d067a [Shivaram Venkataraman] Merge pull request #81 from sun-rui/SparkR-117
993868f [Sun Rui] [SPARKR-117] Update Spark dependency to 1.1.0
d20661a [Zongheng Yang] Merge pull request #80 from sun-rui/master
0b2da9f [Sun Rui] Update Rd file and add a test case for mapPartitions.
5879648 [Sun Rui] Add mapPartitions() method to RDD for API consistency.
c033461 [Shivaram Venkataraman] Merge pull request #79 from sun-rui/fix-kmeans
f62b77e [Sun Rui] Adjust coding style.
b40911d [Sun Rui] Fix syntax error in examples/kmeans.R.
5304451 [Shivaram Venkataraman] Merge pull request #78 from sun-rui/master
70ffbfb [Sun Rui] Fix a bug that modifications to build.sbt won't trigger rebuilding.
a25696c [Shivaram Venkataraman] Merge pull request #76 from edwardt/addjira
b8bbd93 [edwardt] Update README.md
615d930 [edwardt] Update README.md
e522e69 [edwardt] Update README.md
03e6ced [edwardt] Update README.md
3007015 [root] don't check in gedit buffer file'
c35c9a6 [root] Add where to enter bugs ad feeback
469eae3 [edwardt] Update README.md
61b4a43 [edwardt] Update Makefile (style uniformity)
ce3337d [edwardt] Update README.md
7ff68fc [root] Merge branch 'master' of https://github.com/edwardt/SparkR-pkg
16353f5 [root] add links to devtools and install_github
513b9e5 [Shivaram Venkataraman] Merge pull request #72 from edwardt/master
31608a4 [edwardt] Update Makefile (style uniformity)
4ffe146 [root] Makefile: factor out SPARKR_VERSION to reduce potential copy&paste error; cp & rm called with -f in build/clean phase; .gitignore includes checkpoints and unit test log generated by run-tests.sh
715275f [Zongheng Yang] Merge pull request #68 from shivaram/master
90e2083 [Shivaram Venkataraman] Add return type to hasNext
8eb983d [Shivaram Venkataraman] Fix up comment
2206164 [Shivaram Venkataraman] Delete temporary files after they are read This change deletes temporary files used for communication between Rscript and the JVM once they have been completely read.
5881da7 [Zongheng Yang] Merge pull request #67 from shivaram/improve-shuffle
81251e2 [Shivaram Venkataraman] Address code review comments
a5f573f [Shivaram Venkataraman] Use a better list append in shuffles This is helpful in scenarios where we have a large number of values in a bucket
388e64d [Shivaram Venkataraman] Merge pull request #55 from RevolutionAnalytics/master
e1f95b6 [Zongheng Yang] Merge pull request #65 from concretevitamin/parallelize-fix
fc1a71a [Zongheng Yang] Fix that collect(parallelize(sc,1:72,15)) drops elements.
b8204c5 [Zongheng Yang] Minor: update a URL in README.
86f30c3 [Antonio Piccolboni] better fix for amplab-extras/SparkR-pkg#53
b3c318d [Antonio Piccolboni] delayed loading to have all namespaces available.
f323e97 [Antonio Piccolboni] tentative fix for amplab-extras/SparkR-pkg#53
6f82269 [Zongheng Yang] Merge pull request #48 from shivaram/master
8f433e5 [Shivaram Venkataraman] Move up Hadoop in pom.xml and add back protobufs As Hadoop 1.0.4 doesn't use protobufs, we can't exclude protobufs from Spark always. This change tries to order the dependencies so that the shader first picks up Hadoop's protobufs over Mesos.
bfe7e26 [Shivaram Venkataraman] Merge pull request #36 from RevolutionAnalytics/vectorize-examples
059ae41 [Antonio Piccolboni] and more formatting
9dbd531 [Antonio Piccolboni] more formatting per committer request
948738a [Antonio Piccolboni] converted tabs to spaces per project request
49f5f5a [Shivaram Venkataraman] Merge pull request #35 from shivaram/master
3eb5ad3 [Shivaram Venkataraman] on_failure -> after_failure in travis.yml
139bdee [Shivaram Venkataraman] Cache sbt, maven, ivy dependencies
4ebced2 [Shivaram Venkataraman] Merge pull request #34 from shivaram/master
8437061 [Shivaram Venkataraman] Exclude protobuf from Spark dependency in Maven This avoids pulling in multiple versions of protobuf from Mesos and Hadoop.
91aa527 [Antonio Piccolboni] vectorized version, 36s 10 slices 10^6 per slice. The older version takes 30 sec on 1/10th of data.
f137a57 [Antonio Piccolboni] for rstudio users
1f7ffb0 [Antonio Piccolboni] implemented using matrices and vectorized calls wherever possible
46b23df [Antonio Piccolboni] replace require with library
b15d7db [Antonio Piccolboni] faster parsing
8b7aeb3 [Antonio Piccolboni] 22x speed improvement, 3X mem impovement
c5bce07 [Zongheng Yang] Merge pull request #30 from shivaram/string-tests
21fa2d8 [Shivaram Venkataraman] Fix bug where serialized was not changed for RRRD Reason: When an RRDD is created in getJRDD we have converted any possibly unserialized RDD to a serialized RDD.
9d1ea20 [Shivaram Venkataraman] Merge branch 'master' of github.com:amplab/SparkR-pkg into string-tests
7b9348c [Shivaram Venkataraman] Add tests for partition with string keys Add two tests one with a string array and one from a textFile to test both codepaths
aacd726 [Shivaram Venkataraman] Update README with maven proxy instructions
803e62c [Shivaram Venkataraman] Merge pull request #28 from concretevitamin/master
7c093e6 [Zongheng Yang] Use inherits() to test an object's class.
061c591 [Shivaram Venkataraman] Merge pull request #26 from hafen/master
90f9fda [Ryan Hafen] Fix isRdd() to properly check for class
5b10cc7 [Zongheng Yang] Merge pull request #24 from shivaram/master
7014f83 [Shivaram Venkataraman] Remove unused transformers in maven's pom.xml
b00cea5 [Shivaram Venkataraman] Add support for a Maven build
11ec9b2 [Shivaram Venkataraman] Merge pull request #12 from concretevitamin/pipelined
6b18a90 [Zongheng Yang] Merge branch 'master' into pipelined
57127b8 [Zongheng Yang] Merge pull request #23 from shivaram/master
1ac3940 [Zongheng Yang] Review feedback.
a06fb34 [Zongheng Yang] Remove outdated comment.
0a1fc13 [Shivaram Venkataraman] Fixes for using SparkR with Hadoop2. 1. Exclude ASM, Netty from Hadoop similar to Spark. 2. Concat services files to ensure HDFS filesystems work. 3. Update README with an example
9a1db44 [Zongheng Yang] Merge pull request #22 from shivaram/master
e462448 [Shivaram Venkataraman] Use `$` for calling `put` instead of .jrcall
ed4559a [Shivaram Venkataraman] Add support for passing Spark environment vars This change creates a new `createSparkContext` method in RRDD as we can't pass Map<String, String> through rJava. Also use SPARK_MEM in local mode to increase heap size and update the README with some examples.
10228fb [Shivaram Venkataraman] Merge pull request #20 from concretevitamin/digit-ex
1398d9f [Zongheng Yang] Add linear_solver_mnist to examples/.
d484c2a [Zongheng Yang] Add tests for actions on PipelinedRDD.
d9cb95c [Zongheng Yang] Add setCheckpointDir() to context.R; comment fix.
f8bc8a9 [Zongheng Yang] Minor edits per Shivaram's comments.
8cd67f7 [Shivaram Venkataraman] Merge pull request #15 from shivaram/master
d4468a9 [Shivaram Venkataraman] Remove trailing comma
e2714b8 [Shivaram Venkataraman] Remove Apache Staging repo and update README
334eace [Zongheng Yang] Add a multi-transformation test to benchmark on pipelining.
5650ad7 [Zongheng Yang] Put serialized field inside env for both RDD and PipelinedRDD.
0b9e8bb [Zongheng Yang] First cut at PipelinedRDD.
a4c431e [Zongheng Yang] Add `isCheckpointed` field and checkpoint().
dac0795 [Zongheng Yang] Minor inline comment style fix.
bfb8e26 [Zongheng Yang] Add isCached field (inside an env) and unpersist().
295bff6 [Zongheng Yang] Merge pull request #11 from shivaram/master
4cb209c [Shivaram Venkataraman] Search rLibDir in worker before libPaths This ensures we pick up the SparkR intended and not an older version installed on the same machine
ef198ff [Zongheng Yang] Merge pull request #10 from shivaram/unit-tests
e0557a8 [Shivaram Venkataraman] Update travis to install plyr
8b18bc1 [Shivaram Venkataraman] Merge branch 'master' of github.com:amplab/SparkR-pkg into unit-tests
4a9ca31 [Shivaram Venkataraman] Use smaller broadcast and plyr instead of Matrix Matrix package takes around 2s to load and slows down unit tests.
21c6a61 [Zongheng Yang] Merge pull request #8 from shivaram/master
08c2947 [Shivaram Venkataraman] Move dev install directory to front of libPaths
bda42ee [Shivaram Venkataraman] Merge pull request #7 from JoshRosen/travis
cc5f5c0 [Josh Rosen] Add Travis CI integration (using craigcitro/r-travis)
b6c864b [Shivaram Venkataraman] Merge pull request #6 from concretevitamin/env-style-fix
4fcef22 [Zongheng Yang] Use one style ($) for accessing names in environments.
8a948c6 [Shivaram Venkataraman] Merge pull request #4 from shivaram/master
24978eb [Shivaram Venkataraman] Update README to use install_github
8899db4 [Shivaram Venkataraman] Update TODO.md
91792de [Shivaram Venkataraman] Update Spark requirements
f34f4bf [Shivaram Venkataraman] Check tests for failures and output error msg
cd750d3 [Shivaram Venkataraman] Update run-tests to use new path
1877b7c [Shivaram Venkataraman] Unset R_TESTS to make tests work with R CMD check Also silence Akka remoting logs and update Makefile to build on log4j changes
e60e18a [Shivaram Venkataraman] Update README to remove Spark installation notes
4450189 [Shivaram Venkataraman] Add Spark 0.9 dependency from Apache Staging Also clean up assembly jar from inst on make clean
5eb2131 [Shivaram Venkataraman] Update repo path in README
ec8210e [Shivaram Venkataraman] Remove broadcastId hack as it is public in Spark
9f0e080 [Shivaram Venkataraman] Merge branch 'install-github'
5c88fbd [Shivaram Venkataraman] Add helper script to run tests
77450a1 [Shivaram Venkataraman] Remove dependency on Spark Logging
6cb00d1 [Shivaram Venkataraman] Update README and add helper script install-dev.sh
28346ca [Shivaram Venkataraman] Only normalize if SPARK_HOME is not empty
0fd6571 [Shivaram Venkataraman] Normalize SPARK_HOME before passing it
ff96d5c [Shivaram Venkataraman] Pass in SPARK_HOME and jar file path
34c4dce [Shivaram Venkataraman] Move src into pkg and update Makefile This enables the package to be installed using install_github using devtools and automates the build procedure.
b25afed [Shivaram Venkataraman] Change package name to edu.berkeley.cs.amplab
c691464 [Shivaram Venkataraman] Add Apache 2.0 License file
27a4a4b [Shivaram Venkataraman] Add notes on how to compile roxygen2 docs
ca63844 [Shivaram Venkataraman] Add broadcast documentation Also generate documentation for sample, takeSample etc.
e4dd976 [Shivaram Venkataraman] Update TODO.md
e42d435 [Shivaram Venkataraman] Add support for broadcast variables
6b638e7 [Shivaram Venkataraman] Add the assembly jar to SparkContext
bf24e32 [Shivaram Venkataraman] Merge branch 'master' of github.com:amplab/SparkR-pkg
43c05ce [Zongheng Yang] Fix a flaky/incorrect test for sampleRDD().
c6a9dfc [Zongheng Yang] Initial port of the kmeans example.
6885581 [Zongheng Yang] Implement element-level sampleRDD() and takeSample() with tests.
d3a4987 [Zongheng Yang] Add a test for lapplyPartitionsWithIndex on pairwise RDD.
c7899c1 [Zongheng Yang] Add lapplyPartitionsWithIndex, with a test and an alias function.
a9a7436 [Shivaram Venkataraman] Add DFC example from Tselil, Benjamin and Jonah
fbc5a95 [Zongheng Yang] Implement take() and takeSample().
c4a3409 [Shivaram Venkataraman] Use RDD instead of RRDD
dfad3f5 [Zongheng Yang] Add test_utils.R: a unit test for convertJListToRList().
a45227d [Zongheng Yang] Update .gitignore.
238fe6e [Zongheng Yang] Add a unit test for textFile().
a88898b [Zongheng Yang] Rename test_rrd to test_rrdd
10c8baa [Shivaram Venkataraman] Make SparkR work as a standalone package. Changes include: 1. Adding a new `sbt` project that builds RRDD.scala 2. Change the onLoad functions to load the assembly jar for SparkR 3. Set rLibDir in RRDD.scala and worker.R to load things correctly
78adcd8 [Shivaram Venkataraman] Add a gitignore
ca6108f [Shivaram Venkataraman] Merge branch 'SparkR-scalacode' of ../SparkR
999bd61 [Shivaram Venkataraman] Update collectPartition in R and use ClassTag
c58f63e [Shivaram Venkataraman] Update collectPartition in R and use ClassTag
48265fd [Shivaram Venkataraman] Use new version of collectPartitions in take
d4fe086 [Shivaram Venkataraman] Move collectPartitions to JavaRDDLike Also remove numPartitions in JavaRDD and update R code
bfecd7b [Shivaram Venkataraman] Scala 2.10 changes 1. Update sparkR script 2. Use classTag instead of classManifest
092a4b3 [Shivaram Venkataraman] Add combineByKey, update TODO
ac0d81d [Shivaram Venkataraman] Add more documentation
d1dc3fa [Shivaram Venkataraman] Add more documentation
c515e3a [Shivaram Venkataraman] Update TODO
db56a34 [Shivaram Venkataraman] Add a test case for include package
41cea51 [Shivaram Venkataraman] Ensure all parent environments are serialized. Also add a test case with an inline function
a978e84 [Shivaram Venkataraman] Add support to include packages in the worker
12bf8ce [Shivaram Venkataraman] Add support to include packages in the worker
fb7e72c [Shivaram Venkataraman] Cleanup TODO
16ac314 [Shivaram Venkataraman] Add documentation for functions in context, sparkR
85b1d25 [Shivaram Venkataraman] Set license to Apache
88f1101 [Shivaram Venkataraman] Add unit test running instructions
c40768e [Shivaram Venkataraman] Update TODO
0c7efbf [Shivaram Venkataraman] Refactor RRDD.scala and add comments to functions
5880d42 [Shivaram Venkataraman] Refactor RRDD.scala and add comments to functions
2dee36c [Shivaram Venkataraman] Remove empty test file
a82219b [Shivaram Venkataraman] Update TODOs
5db00dc [Shivaram Venkataraman] Add reduceByKey, groupByKey and refactor shuffle Other changes include 1. Adding unit tests for basic RDD functions and shuffle 2. Add a word count example 3. Change the dependency serialization to handle double loading of SparkR    package 4. Allow partitionBy to operate on any RDDs to create pair-wise RDD.
f196479 [Shivaram Venkataraman] Add reduceByKey, groupByKey and refactor shuffle Other changes include 1. Adding unit tests for basic RDD functions and shuffle 2. Add a word count example 3. Change the dependency serialization to handle double loading of SparkR    package 4. Allow partitionBy to operate on any RDDs to create pair-wise RDD.
987e36f [Shivaram Venkataraman] Add perf todo
0b03265 [Shivaram Venkataraman] Update TODO with testing, docs todo
685aaad [Zongheng Yang] First cut at refactoring worker.R. Remove pairwiseWorker.R.
95b9ddc [Zongheng Yang] First cut at refactoring worker.R. Remove pairwiseWorker.R.
4f00895 [Zongheng Yang] Remove the unnecessary `pairwise' flag in RRDD class. Reasons:
75d36d9 [Zongheng Yang] Working versions: partitionBy() and collectPartition() for RRDD.
e3fbd9d [Zongheng Yang] Working versions: partitionBy() and collectPartition() for RRDD.
67a4335 [Zongheng Yang] Add unit test for parallelize() and collect() pairwise data.
100ae65 [Zongheng Yang] Properly parallelize() and collect() pairwise data.
cd0a5e2 [Zongheng Yang] Properly parallelize() and collect() pairwise data.
aea16c3 [Zongheng Yang] WIP: second cut at partitionBy. Running into R/Scala communication issues.
45eb943 [Zongheng Yang] WIP: second cut at partitionBy. Running into R/Scala communication issues.
11c893b [Zongheng Yang] WIP: need to figure out the logic of (whether or not) shipping a hash func
82c201a [Zongheng Yang] WIP: need to figure out the logic of (whether or not) shipping a hash func
b3bfad2 [Zongheng Yang] Update TODO: take() done.
0e45293 [Zongheng Yang] Add ability to parallelize key-val collections in R.
f60406a [Zongheng Yang] Add ability to parallelize key-val collections in R.
7d7fe3b [Zongheng Yang] Re-implement take(): take a partition at a time and append.
a054e55 [Zongheng Yang] Fix take() tests(): mode difference.
9de0935 [Zongheng Yang] Implement take() for RRDD.
1e4427e [Zongheng Yang] Implement take() for RRDD.
ec3cd67 [Shivaram Venkataraman] Use temp file in Spark to pipe output
417aaed [Shivaram Venkataraman] Use temp file in Spark to pipe output
bb0a3c3 [Shivaram Venkataraman] Add conf directory to classpath
9594d8a [Shivaram Venkataraman] Clean up LR example
3b26b58 [Shivaram Venkataraman] Add a list of things to do.
cabce68 [Shivaram Venkataraman] Fix warnings from package check
fde3f9c [Shivaram Venkataraman] Flatten by default and disable recursive unlist
ab2e061 [Shivaram Venkataraman] Create LIB_DIR before installing SparkR package
555220a [Shivaram Venkataraman] Add readme and update Makefile
1319cda [Shivaram Venkataraman] Make standalone programs run with sparkR
ae19fa8 [Shivaram Venkataraman] Add support for cache and use `tempfile`
4e89ca4 [Shivaram Venkataraman] Add support for apply, reduce, count Also serialize closures using `save` and add two examples
25a0bea [Shivaram Venkataraman] Add support for apply, reduce, count Also serialize closures using `save` and add two examples
f50223f [Zongheng Yang] Make parallelize() and collect() use lists. Add a few more tests for them.
fc7693f [Zongheng Yang] Refactor and enhance the previously added unit test a little bit.
6de9b81 [Zongheng Yang] Add a simple unit test for parallelize().
8b95155 [Zongheng Yang] Add testthat skeleton infrastructure
ef305bf [Zongheng Yang] parallelize() followed by collect() now work for vectors/lists of strings and numerics (should work for other primitives as well).
dc16af4 [Zongheng Yang] Comment: toArray() allocates memory for a copy
f50121e [Zongheng Yang] Make parallelize() return JavaRDD[Array[Byte]]. Add RRDD.scala with a helper function in the singleton object.
46eb063 [Zongheng Yang] Make parallelize() return JavaRDD[Array[Byte]]. Add RRDD.scala with a helper function in the singleton object.
6b4938a [Zongheng Yang] parallelize(): a raw can be parallelized by JavaSparkContext and get back JavaRDD
978aa0f [Zongheng Yang] Add parallelize() skeleton: only return serialized slices now
84c1fd2 [Zongheng Yang] Use .jsimplify() to get around generic List's get() type erasure problem
f16b891 [Zongheng Yang] Convert a few reflectionc alls to .jcall
1284c13 [Zongheng Yang] WIP on collect(): JavaListToRList() failed with errors.
4c2e516 [Zongheng Yang] Add simple prototype of S4 class RRDD. Make TextFile() returns an RRDD.
82aa17a [Zongheng Yang] Add textFile()
83ce63f [Zongheng Yang] Create a JavaSparkContext and save it in .sparkEnv using sparkR.init()
01cdf0e [Zongheng Yang] Add Makefile for SparkR
fc9cae2 [Shivaram Venkataraman] Add skeleton R package
2015-04-08 22:45:40 -07:00
Davies Liu 6ada4f6f52 [SPARK-6781] [SQL] use sqlContext in python shell
Use `sqlContext` in PySpark shell, make it consistent with SQL programming guide. `sqlCtx` is also kept for compatibility.

Author: Davies Liu <davies@databricks.com>

Closes #5425 from davies/sqlCtx and squashes the following commits:

af67340 [Davies Liu] sqlCtx -> sqlContext
15a278f [Davies Liu] use sqlContext in python shell
2015-04-08 13:31:45 -07:00
WangTaoTheTonic b65bad65c3 [SPARK-3591][YARN]fire and forget for YARN cluster mode
https://issues.apache.org/jira/browse/SPARK-3591

The output after this patch:
>doggie153:/opt/oss/spark-1.3.0-bin-hadoop2.4/bin # ./spark-submit  --class org.apache.spark.examples.SparkPi --master yarn-cluster ../lib/spark-examples*.jar
15/03/31 21:15:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/03/31 21:15:25 INFO RMProxy: Connecting to ResourceManager at doggie153/10.177.112.153:8032
15/03/31 21:15:25 INFO Client: Requesting a new application from cluster with 4 NodeManagers
15/03/31 21:15:25 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
15/03/31 21:15:25 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
15/03/31 21:15:25 INFO Client: Setting up container launch context for our AM
15/03/31 21:15:25 INFO Client: Preparing resources for our AM container
15/03/31 21:15:26 INFO Client: Uploading resource file:/opt/oss/spark-1.3.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-SNAPSHOT-hadoop2.4.1.jar -> hdfs://doggie153:9000/user/root/.sparkStaging/application_1427257505534_0016/spark-assembly-1.4.0-SNAPSHOT-hadoop2.4.1.jar
15/03/31 21:15:27 INFO Client: Uploading resource file:/opt/oss/spark-1.3.0-bin-hadoop2.4/lib/spark-examples-1.3.0-hadoop2.4.0.jar -> hdfs://doggie153:9000/user/root/.sparkStaging/application_1427257505534_0016/spark-examples-1.3.0-hadoop2.4.0.jar
15/03/31 21:15:28 INFO Client: Setting up the launch environment for our AM container
15/03/31 21:15:28 INFO SecurityManager: Changing view acls to: root
15/03/31 21:15:28 INFO SecurityManager: Changing modify acls to: root
15/03/31 21:15:28 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
15/03/31 21:15:28 INFO Client: Submitting application 16 to ResourceManager
15/03/31 21:15:28 INFO YarnClientImpl: Submitted application application_1427257505534_0016
15/03/31 21:15:28 INFO Client: ... waiting before polling ResourceManager for application state
15/03/31 21:15:33 INFO Client: ... polling ResourceManager for application state
15/03/31 21:15:33 INFO Client: Application report for application_1427257505534_0016 (state: RUNNING)
15/03/31 21:15:33 INFO Client:
         client token: N/A
         diagnostics: N/A
         ApplicationMaster host: doggie157
         ApplicationMaster RPC port: 0
         queue: default
         start time: 1427807728307
         final status: UNDEFINED
         tracking URL: http://doggie153:8088/proxy/application_1427257505534_0016/
         user: root

/cc andrewor14

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes #5297 from WangTaoTheTonic/SPARK-3591 and squashes the following commits:

c76d232 [WangTaoTheTonic] wrap lines
16c90a8 [WangTaoTheTonic] move up lines to avoid duplicate
fea390d [WangTaoTheTonic] log failed/killed report, style and comment
be1cc2e [WangTaoTheTonic] reword
f0bc54f [WangTaoTheTonic] minor: expose appid in excepiton messages
ba9b22b [WangTaoTheTonic] wrong config name
e1a4013 [WangTaoTheTonic] revert to the old version and do some robust
19706c0 [WangTaoTheTonic] add a config to control whether to forget
0cbdce8 [WangTaoTheTonic] fire and forget for YARN cluster mode
2015-04-07 08:36:25 -05:00
jayson 0358b08db8 SPARK-6626 [DOCS]: Corrected Scala:TwitterUtils parameters
Per Sean Owen's request, here is the update call for TwitterUtils using Scala :)

Author: jayson <jayson@ziprecruiter.com>

Closes #5295 from JaysonSunshine/master and squashes the following commits:

df1d056 [jayson] Corrected Scala:TwitterUtils parameters
2015-04-01 11:12:55 +01:00
leahmcguire d01a6d8c33 [SPARK-4894][mllib] Added Bernoulli option to NaiveBayes model in mllib
Added optional model type parameter for  NaiveBayes training. Can be either Multinomial or Bernoulli.

When Bernoulli is given the Bernoulli smoothing is used for fitting and for prediction as per: http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html.

 Default for model is original Multinomial fit and predict.

Added additional testing for Bernoulli and Multinomial models.

Author: leahmcguire <lmcguire@salesforce.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Author: Leah McGuire <lmcguire@salesforce.com>

Closes #4087 from leahmcguire/master and squashes the following commits:

f3c8994 [leahmcguire] changed checks on model type to requires
acb69af [leahmcguire] removed enum type and replaces all modelType parameters with strings
2224b15 [Leah McGuire] Merge pull request #2 from jkbradley/leahmcguire-master
9ad89ca [Joseph K. Bradley] removed old code
6a8f383 [Joseph K. Bradley] Added new model save/load format 2.0 for NaiveBayesModel after modelType parameter was added.  Updated tests.  Also updated ModelType enum-like type.
852a727 [leahmcguire] merged with upstream master
a22d670 [leahmcguire] changed NaiveBayesModel modelType parameter back to NaiveBayes.ModelType, made NaiveBayes.ModelType serializable, fixed getter method in NavieBayes
18f3219 [leahmcguire] removed private from naive bayes constructor for lambda only
bea62af [leahmcguire] put back in constructor for NaiveBayes
01baad7 [leahmcguire] made fixes from code review
fb0a5c7 [leahmcguire] removed typo
e2d925e [leahmcguire] fixed nonserializable error that was causing naivebayes test failures
2d0c1ba [leahmcguire] fixed typo in NaiveBayes
c298e78 [leahmcguire] fixed scala style errors
b85b0c9 [leahmcguire] Merge remote-tracking branch 'upstream/master'
900b586 [leahmcguire] fixed model call so that uses type argument
ea09b28 [leahmcguire] Merge remote-tracking branch 'upstream/master'
e016569 [leahmcguire] updated test suite with model type fix
85f298f [leahmcguire] Merge remote-tracking branch 'upstream/master'
dc65374 [leahmcguire] integrated model type fix
7622b0c [leahmcguire] added comments and fixed style as per rb
b93aaf6 [Leah McGuire] Merge pull request #1 from jkbradley/nb-model-type
3730572 [Joseph K. Bradley] modified NB model type to be more Java-friendly
b61b5e2 [leahmcguire] added back compatable constructor to NaiveBayesModel to fix MIMA test failure
5a4a534 [leahmcguire] fixed scala style error in NaiveBayes
3891bf2 [leahmcguire] synced with apache spark and resolved merge conflict
d9477ed [leahmcguire] removed old inaccurate comment from test suite for mllib naive bayes
76e5b0f [leahmcguire] removed unnecessary sort from test
0313c0c [leahmcguire] fixed style error in NaiveBayes.scala
4a3676d [leahmcguire] Updated changes re-comments. Got rid of verbose populateMatrix method. Public api now has string instead of enumeration. Docs are updated."
ce73c63 [leahmcguire] added Bernoulli option to niave bayes model in mllib, added optional model type parameter for training. When Bernoulli is given the Bernoulli smoothing is used for fitting and for prediction http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html
2015-03-31 11:16:55 -07:00
Ilya Ganelin 4bdfb7bab3 [SPARK-5750][SPARK-3441][SPARK-5836][CORE] Added documentation explaining shuffle
I've updated the Spark Programming Guide to add a section on the shuffle operation providing some background on what it does. I've also addressed some of its performance impacts.

I've included documentation to address the following issues:
https://issues.apache.org/jira/browse/SPARK-5836
https://issues.apache.org/jira/browse/SPARK-3441
https://issues.apache.org/jira/browse/SPARK-5750

https://issues.apache.org/jira/browse/SPARK-4227 is related but can be addressed in a separate PR since it involves updates to the Spark Configuration Guide.

Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
Author: Ilya Ganelin <ilganeli@gmail.com>

Closes #5074 from ilganeli/SPARK-5750 and squashes the following commits:

6178e24 [Ilya Ganelin] Update programming-guide.md
7a0b96f [Ilya Ganelin] Update programming-guide.md
2c5df08 [Ilya Ganelin] Merge branch 'SPARK-5750' of github.com:ilganeli/spark into SPARK-5750
dffbd2d [Ilya Ganelin] [SPARK-5750] Slight wording update
1ff4eb4 [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-5750
85f9c6e [Ilya Ganelin] Update programming-guide.md
349d1fa [Ilya Ganelin] Added cross linkf or configuration page
eeb5a7a [Ilya Ganelin] [SPARK-5750] Added some minor fixes
dd5cc9d [Ilya Ganelin] [SPARK-5750] Fixed some factual inaccuracies with regards to shuffle internals.
a8adb57 [Ilya Ganelin] [SPARK-5750] Incoporated feedback from Sean Owen
9954bbe [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-5750
159dd1c [Ilya Ganelin] [SPARK-5750] Style fixes from rxin.
75ef67b [Ilya Ganelin] [SPARK-5750][SPARK-3441][SPARK-5836] Added documentation explaining the shuffle operation and included errata from a number of other JIRAs
2015-03-30 11:54:01 +01:00
CodingCat de6733036e [SPARK-6596] fix the instruction on building scaladoc
In README.md under docs/ directory, it says that

> You can build just the Spark scaladoc by running build/sbt doc from the SPARK_PROJECT_ROOT directory.

I guess the right approach is build/sbt unidoc

Author: CodingCat <zhunansjtu@gmail.com>

Closes #5253 from CodingCat/SPARK-6596 and squashes the following commits:

af379ed [CodingCat] fix the instruction on building scaladoc
2015-03-30 11:41:43 +01:00
WangTaoTheTonic 99631438c0 [SPARK-6552][Deploy][Doc]expose start-slave.sh to user and update outdated doc
https://issues.apache.org/jira/browse/SPARK-6552

/cc srowen

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes #5205 from WangTaoTheTonic/SPARK-6552 and squashes the following commits:

b02263c [WangTaoTheTonic] use less than rather than less equal
f0fa408 [WangTaoTheTonic] expose start-slave.sh
2015-03-28 12:32:35 +00:00
Brennon York 39fb579683 [SPARK-6510][GraphX]: Add Graph#minus method to act as Set#difference
Adds a `Graph#minus` method which will return only unique `VertexId`'s from the calling `VertexRDD`.

To demonstrate a basic example with pseudocode:

```
Set((0L,0),(1L,1)).minus(Set((1L,1),(2L,2)))
> Set((0L,0))
```

Author: Brennon York <brennon.york@capitalone.com>

Closes #5175 from brennonyork/SPARK-6510 and squashes the following commits:

248d5c8 [Brennon York] added minus(VertexRDD[VD]) method to avoid createUsingIndex and updated the mask operations to simplify with andNot call
3fb7cce [Brennon York] updated graphx doc to reflect the addition of minus method
6575d92 [Brennon York] updated mima exclude
aaa030b [Brennon York] completed graph#minus functionality
7227c0f [Brennon York] beginning work on minus functionality
2015-03-26 19:08:09 -07:00
Michael Armbrust aad0032276 [DOCS][SQL] Fix JDBC example
Author: Michael Armbrust <michael@databricks.com>

Closes #5192 from marmbrus/fixJDBCDocs and squashes the following commits:

b48a33d [Michael Armbrust] [DOCS][SQL] Fix JDBC example
2015-03-26 14:51:46 -07:00
MechCoder 4fc4d0369e [SPARK-5987] [MLlib] Save/load for GaussianMixtureModels
Should be self explanatory.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #4986 from MechCoder/spark-5987 and squashes the following commits:

7d2cd56 [MechCoder] Iterate over dataframe in a better way
e7a14cb [MechCoder] Minor
33c84f9 [MechCoder] Store as Array[Data] instead of Data[Array]
505bd57 [MechCoder] Rebased over master and used MatrixUDT
7422bb4 [MechCoder] Store sigmas as Array[Double] instead of Array[Array[Double]]
b9794e4 [MechCoder] Minor
cb77095 [MechCoder] [SPARK-5987] Save/load for GaussianMixtureModels
2015-03-25 14:45:23 -07:00
Bill Chambers c5cc41468e [DOCUMENTATION]Fixed Missing Type Import in Documentation
Needed to import the types specifically, not the more general pyspark.sql

Author: Bill Chambers <wchambers@ischool.berkeley.edu>
Author: anabranch <wac.chambers@gmail.com>

Closes #5179 from anabranch/master and squashes the following commits:

8fa67bf [anabranch] Corrected SqlContext Import
603b080 [Bill Chambers] [DOCUMENTATION]Fixed Missing Type Import in Documentation
2015-03-24 22:24:35 -07:00
Christophe Préaud 05c2214b41 [SPARK-6469] Improving documentation on YARN local directories usage
Clarify the local directories usage in YARN

Author: Christophe Préaud <christophe.preaud@kelkoo.com>

Closes #5165 from preaudc/yarn-doc-local-dirs and squashes the following commits:

6912b90 [Christophe Préaud] Fix some formatting issues.
4fa8ec2 [Christophe Préaud] Merge remote-tracking branch 'upstream/master' into yarn-doc-local-dirs
eaaf519 [Christophe Préaud] Clarify the local directories usage in YARN
436fb7d [Christophe Préaud] Revert "Clarify the local directories usage in YARN"
876ae5e [Christophe Préaud] Clarify the local directories usage in YARN
608dbfa [Christophe Préaud] Merge remote-tracking branch 'upstream/master'
a49a2ce [Christophe Préaud] Merge remote-tracking branch 'upstream/master'
9ba89ca [Christophe Préaud] Ensure that files are fetched atomically
54419ae [Christophe Préaud] Merge remote-tracking branch 'upstream/master'
c6a5590 [Christophe Préaud] Revert commit 8ea871f8130b2490f1bad7374a819bf56f0ccbbd
7456a33 [Christophe Préaud] Merge remote-tracking branch 'upstream/master'
8ea871f [Christophe Préaud] Ensure that files are fetched atomically
2015-03-24 17:05:49 -07:00
Peter Rudenko 08d4528011 [ML][docs][minor] Define LabeledDocument/Document classes in CV example
To easier copy/paste Cross-Validation example code snippet need to define LabeledDocument/Document in it, since they difined in a previous example.

Author: Peter Rudenko <petro.rudenko@gmail.com>

Closes #5135 from petro-rudenko/patch-3 and squashes the following commits:

5190c75 [Peter Rudenko] Fix primitive types for java examples.
1d35383 [Peter Rudenko] [SQL][docs][minor] Define LabeledDocument/Document classes in CV example
2015-03-24 16:33:38 +00:00
Cong Yue c12312f8b1 Update the command to use IPython notebook
As for "notebook --pylab inline" is not supported any more, update the related documentation for this.

Author: Cong Yue <yuecong1104@gmail.com>

Closes #5111 from yuecong/patch-1 and squashes the following commits:

872df76 [Cong Yue] Update the command to use IPython notebook
2015-03-24 12:58:58 +00:00
vinodkc 2bf40c58e6 [SPARK-6337][Documentation, SQL]Spark 1.3 doc fixes
Author: vinodkc <vinod.kc.in@gmail.com>

Closes #5112 from vinodkc/spark_1.3_doc_fixes and squashes the following commits:

2c6aee6 [vinodkc] Spark 1.3 doc fixes
2015-03-22 20:00:08 +00:00
Kamil Smuga 6ef48632fb SPARK-6454 [DOCS] Fix links to pyspark api
Author: Kamil Smuga <smugakamil@gmail.com>
Author: stderr <smugakamil@gmail.com>

Closes #5120 from kamilsmuga/master and squashes the following commits:

fee3281 [Kamil Smuga] more python api links fixed for docs
13240cb [Kamil Smuga] resolved merge conflicts with upstream/master
6649b3b [Kamil Smuga] fix broken docs links to Python API
92f03d7 [stderr] Fix links to pyspark api
2015-03-22 15:56:25 +00:00
MechCoder 25e271d9fb [SPARK-6025] [MLlib] Add helper method evaluateEachIteration to extract learning curve
Added evaluateEachIteration to allow the user to manually extract the error for each iteration of GradientBoosting. The internal optimisation can be dealt with later.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #4906 from MechCoder/spark-6025 and squashes the following commits:

67146ab [MechCoder] Minor
352001f [MechCoder] Minor
6e8aa10 [MechCoder] Made the following changes Used mapPartition instead of map Refactored computeError and unpersisted broadcast variables
bc99ac6 [MechCoder] Refactor the method and stuff
dbda033 [MechCoder] [SPARK-6025] Add helper method evaluateEachIteration to extract learning curve
2015-03-20 17:14:09 -07:00
Jongyoul Lee 49a01c7ea2 [SPARK-6423][Mesos] MemoryUtils should use memoryOverhead if it's set
- Fixed calculateTotalMemory to use spark.mesos.executor.memoryOverhead
- Added testCase

Author: Jongyoul Lee <jongyoul@gmail.com>

Closes #5099 from jongyoul/SPARK-6423 and squashes the following commits:

6747fce [Jongyoul Lee] [SPARK-6423][Mesos] MemoryUtils should use memoryOverhead if it's set - Changed a description of spark.mesos.executor.memoryOverhead
475a7c8 [Jongyoul Lee] [SPARK-6423][Mesos] MemoryUtils should use memoryOverhead if it's set - Fit the import rules
453c5a2 [Jongyoul Lee] [SPARK-6423][Mesos] MemoryUtils should use memoryOverhead if it's set - Fixed calculateTotalMemory to use spark.mesos.executor.memoryOverhead - Added testCase
2015-03-20 19:14:35 +00:00
Marcelo Vanzin a74564591f [SPARK-6371] [build] Update version to 1.4.0-SNAPSHOT.
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #5056 from vanzin/SPARK-6371 and squashes the following commits:

63220df [Marcelo Vanzin] Merge branch 'master' into SPARK-6371
6506f75 [Marcelo Vanzin] Use more fine-grained exclusion.
178ba71 [Marcelo Vanzin] Oops.
75b2375 [Marcelo Vanzin] Exclude VertexRDD in MiMA.
a45a62c [Marcelo Vanzin] Work around MIMA warning.
1d8a670 [Marcelo Vanzin] Re-group jetty exclusion.
0e8e909 [Marcelo Vanzin] Ignore ml, don't ignore graphx.
cef4603 [Marcelo Vanzin] Indentation.
296cf82 [Marcelo Vanzin] [SPARK-6371] [build] Update version to 1.4.0-SNAPSHOT.
2015-03-20 18:43:57 +00:00
WangTaoTheTonic 385b2ff10d [SPARK-6426][Doc]User could also point the yarn cluster config directory via YARN_CONF_DI...
...R

https://issues.apache.org/jira/browse/SPARK-6426

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes #5103 from WangTaoTheTonic/SPARK-6426 and squashes the following commits:

e6dd78d [WangTaoTheTonic] User could also point the yarn cluster config directory via YARN_CONF_DIR
2015-03-20 18:42:18 +00:00
Pierre Borckmans 797f8a0007 [SPARK-6402][DOC] - Remove some refererences to shark in docs and ec2
EC2 script and job scheduling documentation still refered to Shark.
I removed these references.

I also removed a remaining `SHARK_VERSION` variable from `ec2-variables.sh`.

Author: Pierre Borckmans <pierre.borckmans@realimpactanalytics.com>

Closes #5083 from pierre-borckmans/remove_refererences_to_shark_in_docs and squashes the following commits:

4e90ffc [Pierre Borckmans] Removed deprecated SHARK_VERSION
caea407 [Pierre Borckmans] Remove shark reference from ec2 script doc
196c744 [Pierre Borckmans] Removed references to Shark
2015-03-19 08:02:06 -04:00
Jongyoul Lee 3db1387425 SPARK-6085 Part. 2 Increase default value for memory overhead
- fixed a description of spark.mesos.executor.memoryOverhead from 7% to 10%
- This is a second part of SPARK-6085

Author: Jongyoul Lee <jongyoul@gmail.com>

Closes #5065 from jongyoul/SPARK-6085-1 and squashes the following commits:

c5af84c [Jongyoul Lee] SPARK-6085 Part. 2 Increase default value for memory overhead - Changed "MiB" to "MB"
dbac1c0 [Jongyoul Lee] SPARK-6085 Part. 2 Increase default value for memory overhead - fixed a description of spark.mesos.executor.memoryOverhead from 7% to 10%
2015-03-18 20:54:22 -04:00
Tijo Thomas a012e08635 [SPARK-6383][SQL]Fixed compiler and errors in Dataframe examples
Author: Tijo Thomas <tijoparacka@gmail.com>

Closes #5068 from tijoparacka/fix_sql_dataframe_example and squashes the following commits:

6953ac1 [Tijo Thomas] Handled Java and Python example sections
0751a74 [Tijo Thomas] Fixed compiler and errors in Dataframe examples
2015-03-17 18:50:19 -07:00
lewuathe d9f3e01688 [SPARK-6336] LBFGS should document what convergenceTol means
LBFGS uses convergence tolerance. This value should be written in document as an argument.

Author: lewuathe <lewuathe@me.com>

Closes #5033 from Lewuathe/SPARK-6336 and squashes the following commits:

e738b33 [lewuathe] Modify text to be more natural
ac03c3a [lewuathe] Modify documentations
6ccb304 [lewuathe] [SPARK-6336] LBFGS should document what convergenceTol means
2015-03-17 12:11:57 -07:00
nemccarthy 4cca3917dc [SPARK-6313] Add config option to disable file locks/fetchFile cache to ...
...support NFS mounts.

This is a work around for now with the goal to find a more permanent solution.
https://issues.apache.org/jira/browse/SPARK-6313

Author: nemccarthy <nathan@nemccarthy.me>

Closes #5036 from nemccarthy/master and squashes the following commits:

2eaaf42 [nemccarthy] [SPARK-6313] Update config wording doc for spark.files.useFetchCache
5de7eb4 [nemccarthy] [SPARK-6313] Add config option to disable file locks/fetchFile cache to support NFS mounts
2015-03-17 09:33:11 -07:00
Theodore Vasiloudis e26db9be47 [docs] [SPARK-4820] Spark build encounters "File name too long" on some encrypted filesystems
Added a note instructing users how to build Spark in an encrypted file system.

Author: Theodore Vasiloudis <tvas@sics.se>

Closes #5041 from thvasilo/patch-2 and squashes the following commits:

09d890b [Theodore Vasiloudis] Workaroung for buiding in an encrypted filesystem
2015-03-17 11:26:11 +00:00
Jongyoul Lee aa6536fa3c [SPARK-3619] Part 2. Upgrade to Mesos 0.21 to work around MESOS-1688
- MESOS_NATIVE_LIBRARY become deprecated
- Chagned MESOS_NATIVE_LIBRARY to MESOS_NATIVE_JAVA_LIBRARY

Author: Jongyoul Lee <jongyoul@gmail.com>

Closes #4361 from jongyoul/SPARK-3619-1 and squashes the following commits:

f1ea91f [Jongyoul Lee] Merge branch 'SPARK-3619-1' of https://github.com/jongyoul/spark into SPARK-3619-1
a6a00c2 [Jongyoul Lee] [SPARK-3619] Upgrade to Mesos 0.21 to work around MESOS-1688 - Removed 'Known issues' section
2e15a21 [Jongyoul Lee] [SPARK-3619] Upgrade to Mesos 0.21 to work around MESOS-1688 - MESOS_NATIVE_LIBRARY become deprecated - Chagned MESOS_NATIVE_LIBRARY to MESOS_NATIVE_JAVA_LIBRARY
0dace7b [Jongyoul Lee] [SPARK-3619] Upgrade to Mesos 0.21 to work around MESOS-1688 - MESOS_NATIVE_LIBRARY become deprecated - Chagned MESOS_NATIVE_LIBRARY to MESOS_NATIVE_JAVA_LIBRARY
2015-03-15 15:46:55 +00:00
Brennon York 127268bc39 [SPARK-6329][Docs]: Minor doc changes for Mesos and TOC
Updated the configuration docs from the minor items that Reynold had left over from SPARK-1182; specifically I updated the `running-on-mesos` link to point directly to `running-on-mesos#configuration` and upgraded the `yarn`, `mesos`, etc. bullets to `<h5>` tags in hopes that they'll get pushed into the TOC.

Author: Brennon York <brennon.york@capitalone.com>

Closes #5022 from brennonyork/SPARK-6329 and squashes the following commits:

42a10a9 [Brennon York] minor doc fixes
2015-03-14 17:28:13 +00:00
Xiangrui Meng 7f13434a5c [SPARK-6278][MLLIB] Mention the change of objective in linear regression
As discussed in the RC3 vote thread, we should mention the change of objective in linear regression in the migration guide. srowen

Author: Xiangrui Meng <meng@databricks.com>

Closes #4978 from mengxr/SPARK-6278 and squashes the following commits:

fb3bbe6 [Xiangrui Meng] mention regularization parameter
bfd6cff [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-6278
375fd09 [Xiangrui Meng] address Sean's comments
f87ae71 [Xiangrui Meng] mention step size change
2015-03-13 10:27:28 -07:00
Cheng Lian 69ff8e8cfb [SPARK-5310] [SQL] [DOC] Parquet section for the SQL programming guide
Also fixed a bunch of minor styling issues.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5001)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #5001 from liancheng/parquet-doc and squashes the following commits:

89ad3db [Cheng Lian] Addresses @rxin's comments
7eb6955 [Cheng Lian] Docs for the new Parquet data source
415eefb [Cheng Lian] Some minor formatting improvements
2015-03-13 21:34:50 +08:00
zzcclp 304366c465 [SPARK-6275][Documentation]Miss toDF() function in docs/sql-programming-guide.md
Miss `toDF()` function in docs/sql-programming-guide.md

Author: zzcclp <xm_zzc@sina.com>

Closes #4977 from zzcclp/SPARK-6275 and squashes the following commits:

9a96c7b [zzcclp] Miss toDF()
2015-03-12 15:07:15 +00:00
Xiangrui Meng 0cba802adf [SPARK-5814][MLLIB][GRAPHX] Remove JBLAS from runtime
The issue is discussed in https://issues.apache.org/jira/browse/SPARK-5669. Replacing all JBLAS usage by netlib-java gives us a simpler dependency tree and less license issues to worry about. I didn't touch the test scope in this PR. The user guide is not modified to avoid merge conflicts with branch-1.3. srowen ankurdave pwendell

Author: Xiangrui Meng <meng@databricks.com>

Closes #4699 from mengxr/SPARK-5814 and squashes the following commits:

48635c6 [Xiangrui Meng] move netlib-java version to parent pom
ca21c74 [Xiangrui Meng] remove jblas from ml-guide
5f7767a [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5814
c5c4183 [Xiangrui Meng] merge master
0f20cad [Xiangrui Meng] add mima excludes
e53e9f4 [Xiangrui Meng] remove jblas from mllib runtime
ceaa14d [Xiangrui Meng] replace jblas by netlib-java in graphx
fa7c2ca [Xiangrui Meng] move jblas to test scope
2015-03-12 01:39:04 -07:00
Tathagata Das cd3b68d93a [SPARK-6128][Streaming][Documentation] Updates to Spark Streaming Programming Guide
Updates to the documentation are as follows:

- Added information on Kafka Direct API and Kafka Python API
- Added joins to the main streaming guide
- Improved details on the fault-tolerance semantics

Generated docs located here
http://people.apache.org/~tdas/spark-1.3.0-temp-docs/streaming-programming-guide.html#fault-tolerance-semantics

More things to add:
- Configuration for Kafka receive rate
- May be add concurrentJobs

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #4956 from tdas/streaming-guide-update-1.3 and squashes the following commits:

819408c [Tathagata Das] Minor fixes.
debe484 [Tathagata Das] Added DataFrames and MLlib
380cf8d [Tathagata Das] Fix link
04167a6 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into streaming-guide-update-1.3
0b77486 [Tathagata Das] Updates based on Josh's comments.
86c4c2a [Tathagata Das] Updated streaming guides
82de92a [Tathagata Das] Add Kafka to Python api docs
2015-03-11 18:48:21 -07:00
Sandy Ryza 2d87a415f2 SPARK-3642. Document the nuances of shared variables.
Author: Sandy Ryza <sandy@cloudera.com>

Closes #2490 from sryza/sandy-spark-3642 and squashes the following commits:

aae3340 [Sandy Ryza] SPARK-3642. Document the nuances of broadcast variables
2015-03-11 13:22:05 +00:00
Ilya Ganelin 548643a9e4 [SPARK-4423] Improve foreach() documentation to avoid confusion between local- and cluster-mode behavior
Hi all - I've added a writeup on how closures work within Spark to help clarify the general case for this problem and similar problems. I hope this addresses the issue and would love any feedback.

Author: Ilya Ganelin <ilya.ganelin@capitalone.com>

Closes #4696 from ilganeli/SPARK-4423 and squashes the following commits:

c5dc498 [Ilya Ganelin] Fixed typo
07b78e8 [Ilya Ganelin] Updated to fix capitalization
48c1983 [Ilya Ganelin] Updated to fix capitalization and clarify wording
2fd2a07 [Ilya Ganelin] Incoporated a few more minor fixes. Fixed a bug in python code. Added semicolons for java
4772f99 [Ilya Ganelin] Incorporated latest feedback
448bd79 [Ilya Ganelin] Updated some verbage and added section links
5dbbda5 [Ilya Ganelin] Improved some wording
d374d3a [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-4423
2600668 [Ilya Ganelin] Minor edits
c768ab2 [Ilya Ganelin] Updated documentation to add a section on closures. This helps understand confusing behavior of foreach and map functions when attempting to modify variables outside of the scope of an RDD action or transformation
2015-03-11 13:20:15 +00:00
Sean Owen 35b25640a4 [MINOR] [DOCS] Fix map -> mapToPair in Streaming Java example
Fix map -> mapToPair in Java example. (And zap some unneeded "throws Exception" while here)

Author: Sean Owen <sowen@cloudera.com>

Closes #4967 from srowen/MapToPairFix and squashes the following commits:

ded2bc0 [Sean Owen] Fix map -> mapToPair in Java example. (And zap some unneeded "throws Exception" while here)
2015-03-11 12:16:32 +00:00
Marcelo Vanzin 517975d89d [SPARK-4924] Add a library for launching Spark jobs programmatically.
This change encapsulates all the logic involved in launching a Spark job
into a small Java library that can be easily embedded into other applications.

The overall goal of this change is twofold, as described in the bug:

- Provide a public API for launching Spark processes. This is a common request
  from users and currently there's no good answer for it.

- Remove a lot of the duplicated code and other coupling that exists in the
  different parts of Spark that deal with launching processes.

A lot of the duplication was due to different code needed to build an
application's classpath (and the bootstrapper needed to run the driver in
certain situations), and also different code needed to parse spark-submit
command line options in different contexts. The change centralizes those
as much as possible so that all code paths can rely on the library for
handling those appropriately.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #3916 from vanzin/SPARK-4924 and squashes the following commits:

18c7e4d [Marcelo Vanzin] Fix make-distribution.sh.
2ce741f [Marcelo Vanzin] Add lots of quotes.
3b28a75 [Marcelo Vanzin] Update new pom.
a1b8af1 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
897141f [Marcelo Vanzin] Review feedback.
e2367d2 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
28cd35e [Marcelo Vanzin] Remove stale comment.
b1d86b0 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
00505f9 [Marcelo Vanzin] Add blurb about new API in the programming guide.
5f4ddcc [Marcelo Vanzin] Better usage messages.
92a9cfb [Marcelo Vanzin] Fix Win32 launcher, usage.
6184c07 [Marcelo Vanzin] Rename field.
4c19196 [Marcelo Vanzin] Update comment.
7e66c18 [Marcelo Vanzin] Fix pyspark tests.
0031a8e [Marcelo Vanzin] Review feedback.
c12d84b [Marcelo Vanzin] Review feedback. And fix spark-submit on Windows.
e2d4d71 [Marcelo Vanzin] Simplify some code used to launch pyspark.
43008a7 [Marcelo Vanzin] Don't make builder extend SparkLauncher.
b4d6912 [Marcelo Vanzin] Use spark-submit script in SparkLauncher.
28b1434 [Marcelo Vanzin] Add a comment.
304333a [Marcelo Vanzin] Fix propagation of properties file arg.
bb67b93 [Marcelo Vanzin] Remove unrelated Yarn change (that is also wrong).
8ec0243 [Marcelo Vanzin] Add missing newline.
95ddfa8 [Marcelo Vanzin] Fix handling of --help for spark-class command builder.
72da7ec [Marcelo Vanzin] Rename SparkClassLauncher.
62978e4 [Marcelo Vanzin] Minor cleanup of Windows code path.
9cd5b44 [Marcelo Vanzin] Make all non-public APIs package-private.
e4c80b6 [Marcelo Vanzin] Reorganize the code so that only SparkLauncher is public.
e50dc5e [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
de81da2 [Marcelo Vanzin] Fix CommandUtils.
86a87bf [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
2061967 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
46d46da [Marcelo Vanzin] Clean up a test and make it more future-proof.
b93692a [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
ad03c48 [Marcelo Vanzin] Revert "Fix a thread-safety issue in "local" mode."
0b509d0 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
23aa2a9 [Marcelo Vanzin] Read java-opts from conf dir, not spark home.
7cff919 [Marcelo Vanzin] Javadoc updates.
eae4d8e [Marcelo Vanzin] Fix new unit tests on Windows.
e570fb5 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
44cd5f7 [Marcelo Vanzin] Add package-info.java, clean up javadocs.
f7cacff [Marcelo Vanzin] Remove "launch Spark in new thread" feature.
7ed8859 [Marcelo Vanzin] Some more feedback.
54cd4fd [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
61919df [Marcelo Vanzin] Clean leftover debug statement.
aae5897 [Marcelo Vanzin] Use launcher classes instead of jars in non-release mode.
e584fc3 [Marcelo Vanzin] Rework command building a little bit.
525ef5b [Marcelo Vanzin] Rework Unix spark-class to handle argument with newlines.
8ac4e92 [Marcelo Vanzin] Minor test cleanup.
e946a99 [Marcelo Vanzin] Merge PySparkLauncher into SparkSubmitCliLauncher.
c617539 [Marcelo Vanzin] Review feedback round 1.
fc6a3e2 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
f26556b [Marcelo Vanzin] Fix a thread-safety issue in "local" mode.
2f4e8b4 [Marcelo Vanzin] Changes needed to make this work with SPARK-4048.
799fc20 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
bb5d324 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
53faef1 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
a7936ef [Marcelo Vanzin] Fix pyspark tests.
656374e [Marcelo Vanzin] Mima fixes.
4d511e7 [Marcelo Vanzin] Fix tools search code.
7a01e4a [Marcelo Vanzin] Fix pyspark on Yarn.
1b3f6e9 [Marcelo Vanzin] Call SparkSubmit from spark-class launcher for unknown classes.
25c5ae6 [Marcelo Vanzin] Centralize SparkSubmit command line parsing.
27be98a [Marcelo Vanzin] Modify Spark to use launcher lib.
6f70eea [Marcelo Vanzin] [SPARK-4924] Add a library for launching Spark jobs programatically.
2015-03-11 01:03:01 -07:00
Michael Armbrust 2672374110 [SPARK-5183][SQL] Update SQL Docs with JDBC and Migration Guide
Author: Michael Armbrust <michael@databricks.com>

Closes #4958 from marmbrus/sqlDocs and squashes the following commits:

9351dbc [Michael Armbrust] fix parquet example
6877e13 [Michael Armbrust] add sql examples
d81b7e7 [Michael Armbrust] rxins comments
e393528 [Michael Armbrust] fix order
19c2735 [Michael Armbrust] more on data source load/store
00d5914 [Michael Armbrust] Update SQL Docs with JDBC and Migration Guide
2015-03-10 18:13:09 -07:00
Reynold Xin 3cac1991a1 [SPARK-5310][Doc] Update SQL Programming Guide to include DataFrames.
Author: Reynold Xin <rxin@databricks.com>

Closes #4954 from rxin/df-docs and squashes the following commits:

c592c70 [Reynold Xin] [SPARK-5310][Doc] Update SQL Programming Guide to include DataFrames.
2015-03-09 16:16:16 -07:00
RobertZK 48a723c986 Fix python typo (+ Scala, Java typos)
Author: RobertZK <technoguyrob@gmail.com>
Author: Robert Krzyzanowski <technoguyrob@gmail.com>

Closes #4840 from robertzk/patch-1 and squashes the following commits:

d286215 [RobertZK] lambda fix per @laserson
5937989 [Robert Krzyzanowski] Fix python typo
2015-03-07 00:39:24 +00:00
tedyu 8d3e2414d4 SPARK-6085 Increase default value for memory overhead
Author: tedyu <yuzhihong@gmail.com>

Closes #4836 from tedyu/master and squashes the following commits:

d65b495 [tedyu] SPARK-6085 Increase default value for memory overhead
1fdd4df [tedyu] SPARK-6085 Increase default value for memory overhead
2015-03-04 11:00:52 +00:00
Sean Owen e750a6bfdd SPARK-1911 [DOCS] Warn users if their assembly jars are not built with Java 6
Add warning about building with Java 7+ and running the JAR on early Java 6.

CC andrewor14

Author: Sean Owen <sowen@cloudera.com>

Closes #4874 from srowen/SPARK-1911 and squashes the following commits:

79fa2f6 [Sean Owen] Add warning about building with Java 7+ and running the JAR on early Java 6.
2015-03-03 13:40:11 -08:00
DB Tsai b196056190 [SPARK-5537][MLlib][Docs] Add user guide for multinomial logistic regression
Adding more description on top of #4861.

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #4866 from dbtsai/doc and squashes the following commits:

37e9d07 [DB Tsai] doc
2015-03-02 22:37:12 -08:00
Xiangrui Meng 7e53a79c30 [SPARK-6097][MLLIB] Support tree model save/load in PySpark/MLlib
Similar to `MatrixFactorizaionModel`, we only need wrappers to support save/load for tree models in Python.

jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #4854 from mengxr/SPARK-6097 and squashes the following commits:

4586a4d [Xiangrui Meng] fix more typos
8ebcac2 [Xiangrui Meng] fix python style
91172d8 [Xiangrui Meng] fix typos
201b3b9 [Xiangrui Meng] update user guide
b5158e2 [Xiangrui Meng] support tree model save/load in PySpark/MLlib
2015-03-02 22:27:01 -08:00
Xiangrui Meng 9d6c5aeebd [SPARK-5537] Add user guide for multinomial logistic regression
This is based on #4801 from dbtsai. The linear method guide is re-organized a little bit for this change.

Closes #4801

Author: Xiangrui Meng <meng@databricks.com>
Author: DB Tsai <dbtsai@alpinenow.com>

Closes #4861 from mengxr/SPARK-5537 and squashes the following commits:

47af0ac [Xiangrui Meng] update user guide for multinomial logistic regression
cdc2e15 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into AlpineNow-mlor-doc
096d0ca [DB Tsai] first commit
2015-03-02 18:10:50 -08:00
Andrew Or 258d154c9f [SPARK-6048] SparkConf should not translate deprecated configs on set
There are multiple issues with translating on set outlined in the JIRA.

This PR reverts the translation logic added to `SparkConf`. In the future, after the 1.3.0 release we will figure out a way to reorganize the internal structure more elegantly. For now, let's preserve the existing semantics of `SparkConf` since it's a public interface. Unfortunately this means duplicating some code for now, but this is all internal and we can always clean it up later.

Author: Andrew Or <andrew@databricks.com>

Closes #4799 from andrewor14/conf-set-translate and squashes the following commits:

11c525b [Andrew Or] Move warning to driver
10e77b5 [Andrew Or] Add documentation for deprecation precedence
a369cb1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into conf-set-translate
c26a9e3 [Andrew Or] Revert all translate logic in SparkConf
fef6c9c [Andrew Or] Restore deprecation logic for spark.executor.userClassPathFirst
94b4dfa [Andrew Or] Translate on get, not set
2015-03-02 16:36:42 -08:00
Sean Owen 0b472f60cd SPARK-5390 [DOCS] Encourage users to post on Stack Overflow in Community Docs
Point "Community" to main Spark Community page; mention SO tag apache-spark.

Separately, the Apache site can be updated to mention, under Mailing Lists:
"StackOverflow also has an apache-spark tag for Spark Q&A." or similar.

Author: Sean Owen <sowen@cloudera.com>

Closes #4843 from srowen/SPARK-5390 and squashes the following commits:

3508ac6 [Sean Owen] Point "Community" to main Spark Community page; mention SO tag apache-spark
2015-03-02 21:10:08 +00:00
DEBORAH SIEGEL e7d8ae444f aggregateMessages example in graphX doc
Examples illustrating difference between legacy mapReduceTriplets usage and aggregateMessages usage has type issues on the reduce for both operators.

Being just an example-  changed example to reduce the message String by concatenation. Although non-optimal for performance.

Author: DEBORAH SIEGEL <deborahsiegel@DEBORAHs-MacBook-Pro.local>

Closes #4853 from d3borah/master and squashes the following commits:

db54173 [DEBORAH SIEGEL] fixed aggregateMessages example in graphX doc
2015-03-02 10:15:32 -08:00
MechCoder 3f00bb3ef1 [SPARK-6083] [MLLib] [DOC] Make Python API example consistent in NaiveBayes
Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #4834 from MechCoder/spark-6083 and squashes the following commits:

1cdd7b5 [MechCoder] Add parse function
65bbbe9 [MechCoder] [SPARK-6083] Make Python API example consistent in NaiveBayes
2015-03-01 16:28:15 -08:00
Xiangrui Meng aedbbaa3dd [SPARK-6053][MLLIB] support save/load in PySpark's ALS
A simple wrapper to save/load `MatrixFactorizationModel` in Python. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #4811 from mengxr/SPARK-5991 and squashes the following commits:

f135dac [Xiangrui Meng] update save doc
57e5200 [Xiangrui Meng] address comments
06140a4 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5991
282ec8d [Xiangrui Meng] support save/load in PySpark's ALS
2015-03-01 16:26:57 -08:00
Joseph K. Bradley d17cb2ba33 [SPARK-4587] [mllib] [docs] Fixed save,load calls in ML guide examples
Should pass spark context to save/load

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4816 from jkbradley/ml-io-doc-fix and squashes the following commits:

83d369d [Joseph K. Bradley] added comment to save,load parts of ML guide examples
2841170 [Joseph K. Bradley] Fixed save,load calls in ML guide examples
2015-02-27 13:00:36 -08:00
许鹏 0375a413b8 fix spark-6033, clarify the spark.worker.cleanup behavior in standalone mode
jira case spark-6033 https://issues.apache.org/jira/browse/SPARK-6033

In standalone deploy mode, the cleanup will only remove the stopped application's directories.

The original description about the cleanup behavior is incorrect.

Author: 许鹏 <peng.xu@fraudmetrix.cn>

Closes #4803 from hseagle/spark-6033 and squashes the following commits:

927a6a0 [许鹏] fix the incorrect description about the spark.worker.cleanup in standalone mode
2015-02-26 23:06:34 -08:00
moussa taifi c871e2dae0 Add a note for context termination for History server on Yarn
The history server on Yarn only shows completed jobs. This adds a note concerning the needed explicit context termination at the end of a spark job which is a best practice anyway.
Related to SPARK-2972 and SPARK-3458

Author: moussa taifi <moutai10@gmail.com>

Closes #4721 from moutai/add-history-server-note-for-closing-the-spark-context and squashes the following commits:

9f5b6c3 [moussa taifi] Fix upper case typo for YARN
3ad3db4 [moussa taifi] Add context termination for History server on Yarn
2015-02-26 14:20:30 -08:00
xukun 00228947 8942b522d8 [SPARK-3562]Periodic cleanup event logs
Author: xukun 00228947 <xukun.xu@huawei.com>

Closes #4214 from viper-kun/cleaneventlog and squashes the following commits:

7a5b9c5 [xukun 00228947] fix issue
31674ee [xukun 00228947] fix issue
6e3d06b [xukun 00228947] fix issue
373f3b9 [xukun 00228947] fix issue
71782b5 [xukun 00228947] fix issue
5b45035 [xukun 00228947] fix issue
70c28d6 [xukun 00228947] fix issues
adcfe86 [xukun 00228947] Periodic cleanup event logs
2015-02-26 13:24:00 -08:00
Li Zhihui 10094a523e Modify default value description for spark.scheduler.minRegisteredResourcesRatio on docs.
The configuration is not supported in mesos mode now.
See https://github.com/apache/spark/pull/1462

Author: Li Zhihui <zhihui.li@intel.com>

Closes #4781 from li-zhihui/fixdocconf and squashes the following commits:

63e7a44 [Li Zhihui] Modify default value description for spark.scheduler.minRegisteredResourcesRatio on docs.
2015-02-26 13:07:49 -08:00
Joseph K. Bradley d20559b157 [SPARK-5974] [SPARK-5980] [mllib] [python] [docs] Update ML guide with save/load, Python GBT
* Add GradientBoostedTrees Python examples to ML guide
  * I ran these in the pyspark shell, and they worked.
* Add save/load to examples in ML guide
* Added note to python docs about predict,transform not working within RDD actions,transformations in some cases (See SPARK-5981)

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4750 from jkbradley/SPARK-5974 and squashes the following commits:

c410e38 [Joseph K. Bradley] Added note to LabeledPoint about attributes
bcae18b [Joseph K. Bradley] Added import of models for save/load examples in ml guide.  Fixed line length for tree.py, feature.py (but not other ML Pyspark files yet).
6d81c3e [Joseph K. Bradley] completed python GBT examples
9903309 [Joseph K. Bradley] Added note to python docs about predict,transform not working within RDD actions,transformations in some cases
c7dfad8 [Joseph K. Bradley] Added model save/load to ML guide.  Added GBT examples to ML guide
2015-02-25 16:13:17 -08:00
Brennon York 46a044a36a [SPARK-1182][Docs] Sort the configuration parameters in configuration.md
Sorts all configuration options present on the `configuration.md` page to ease readability.

Author: Brennon York <brennon.york@capitalone.com>

Closes #3863 from brennonyork/SPARK-1182 and squashes the following commits:

5696f21 [Brennon York] fixed merge conflict with port comments
81a7b10 [Brennon York] capitalized A in Allocation
e240486 [Brennon York] moved all spark.mesos properties into the running-on-mesos doc
7de5f75 [Brennon York] moved serialization from application to compression and serialization section
a16fec0 [Brennon York] moved shuffle settings from network to shuffle
f8fa286 [Brennon York] sorted encryption category
1023f15 [Brennon York] moved initialExecutors
e9d62aa [Brennon York] fixed akka.heartbeat.interval
25e6f6f [Brennon York] moved spark.executer.user*
4625ade [Brennon York] added spark.executor.extra* items
4ee5648 [Brennon York] fixed merge conflicts
1b49234 [Brennon York] sorting mishap
2b5758b [Brennon York] sorting mishap
6fbdf42 [Brennon York] sorting mishap
55dc6f8 [Brennon York] sorted security
ec34294 [Brennon York] sorted dynamic allocation
2a7c4a3 [Brennon York] sorted scheduling
aa9acdc [Brennon York] sorted networking
a4380b8 [Brennon York] sorted execution behavior
27f3919 [Brennon York] sorted compression and serialization
80a5bbb [Brennon York] sorted spark ui
3f32e5b [Brennon York] sorted shuffle behavior
6c51b38 [Brennon York] sorted runtime environment
efe9d6f [Brennon York] sorted application properties
2015-02-25 16:12:56 -08:00
Sean Owen 7d8e6a2e44 SPARK-5930 [DOCS] Documented default of spark.shuffle.io.retryWait is confusing
Clarify default max wait in spark.shuffle.io.retryWait docs

CC andrewor14

Author: Sean Owen <sowen@cloudera.com>

Closes #4769 from srowen/SPARK-5930 and squashes the following commits:

ae2792b [Sean Owen] Clarify default max wait in spark.shuffle.io.retryWait docs
2015-02-25 12:20:44 -08:00
Benedikt Linse 5b8480e035 [GraphX] fixing 3 typos in the graphx programming guide
Corrected 3 Typos in the GraphX programming guide. I hope this is the correct way to contribute.

Author: Benedikt Linse <benedikt.linse@gmail.com>

Closes #4766 from 1123/master and squashes the following commits:

8a63812 [Benedikt Linse] fixing 3 typos in the graphx programming guide
2015-02-25 14:46:17 +00:00
Davies Liu d641fbb39c [SPARK-5994] [SQL] Python DataFrame documentation fixes
select empty should NOT be the same as select. make sure selectExpr is behaving the same.
join param documentation
link to source doesn't work in jekyll generated file
cross reference of columns (i.e. enabling linking)
show(): move df example before df.show()
move tests in SQLContext out of docstring otherwise doc is too long
Column.desc and .asc doesn't have any documentation
in documentation, sort functions.*)

Author: Davies Liu <davies@databricks.com>

Closes #4756 from davies/df_docs and squashes the following commits:

f30502c [Davies Liu] fix doc
32f0d46 [Davies Liu] fix DataFrame docs
2015-02-24 20:51:55 -08:00
MechCoder 2a0fe34891 [SPARK-5436] [MLlib] Validate GradientBoostedTrees using runWithValidation
One can early stop if the decrease in error rate is lesser than a certain tol or if the error increases if the training data is overfit.

This introduces a new method runWithValidation which takes in a pair of RDD's , one for the training data and the other for the validation.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #4677 from MechCoder/spark-5436 and squashes the following commits:

1bb21d4 [MechCoder] Combine regression and classification tests into a single one
e4d799b [MechCoder] Addresses indentation and doc comments
b48a70f [MechCoder] COSMIT
b928a19 [MechCoder] Move validation while training section under usage tips
fad9b6e [MechCoder] Made the following changes 1. Add section to documentation 2. Return corresponding to bestValidationError 3. Allow negative tolerance.
55e5c3b [MechCoder] One liner for prevValidateError
3e74372 [MechCoder] TST: Add test for classification
77549a9 [MechCoder] [SPARK-5436] Validate GradientBoostedTrees using runWithValidation
2015-02-24 15:13:22 -08:00
Judy c5ba975ee8 [Spark-5708] Add Slf4jSink to Spark Metrics
Add Slf4jSink to Spark Metrics using Coda Hale's SlfjReporter.
This sends metrics to log4j, allowing spark users to reuse log4j pipeline for metrics collection.

Reviewed existing unit tests and didn't see any sink-related tests. Please advise on if tests should be added.

Author: Judy <judynash@microsoft.com>
Author: judynash <judynash@microsoft.com>

Closes #4644 from judynash/master and squashes the following commits:

57ef214 [judynash] doc clarification and indent fixes
a751a66 [Judy] Spark-5708: Add Slf4jSink to Spark Metrics
2015-02-24 20:50:16 +00:00
Xiangrui Meng 105791e35c [MLLIB] Change x_i to y_i in Variance's user guide
Variance is calculated on labels/responses.

Author: Xiangrui Meng <meng@databricks.com>

Closes #4740 from mengxr/patch-1 and squashes the following commits:

673317b [Xiangrui Meng] [MLLIB] Change x_i to y_i in Variance's user guide
2015-02-24 11:38:59 -08:00
Xiangrui Meng cf2e41653d [SPARK-5958][MLLIB][DOC] update block matrix user guide
* Removed SVD code from examples.
* Corrected Java API doc link.
* Updated variable names: `AtransposeA` -> `ata`.
* Minor changes.

brkyvz

Author: Xiangrui Meng <meng@databricks.com>

Closes #4737 from mengxr/update-block-matrix-user-guide and squashes the following commits:

70f53ac [Xiangrui Meng] update block matrix user guide
2015-02-23 22:08:44 -08:00
Joseph K. Bradley 59536cc87e [SPARK-5912] [docs] [mllib] Small fixes to ChiSqSelector docs
Fixes:
* typo in Scala example
* Removed comment "usually applied on sparse data" since that is debatable
* small edits to text for clarity

CC: avulanov  I noticed a typo post-hoc and ended up making a few small edits.  Do the changes look OK?

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4732 from jkbradley/chisqselector-docs and squashes the following commits:

9656a3b [Joseph K. Bradley] added Java example for ChiSqSelector to guide
3f3f9f4 [Joseph K. Bradley] small fixes to ChiSqSelector docs
2015-02-23 16:15:57 -08:00
Alexander Ulanov 28ccf5ee76 [MLLIB] SPARK-5912 Programming guide for feature selection
Added description of ChiSqSelector and few words about feature selection in general. I could add a code example, however it would not look reasonable in the absence of feature discretizer or a dataset in the `data` folder that has redundant features.

Author: Alexander Ulanov <nashb@yandex.ru>

Closes #4709 from avulanov/SPARK-5912 and squashes the following commits:

19a8a4e [Alexander Ulanov] Addressing reviewers comments @jkbradley
58d9e4d [Alexander Ulanov] Addressing reviewers comments @jkbradley
eb6b9fe [Alexander Ulanov] Typo
2921a1d [Alexander Ulanov] ChiSqSelector example of use
c845350 [Alexander Ulanov] ChiSqSelector docs
2015-02-23 12:09:40 -08:00
CodingCat 242d49584c [SPARK-5724] fix the misconfiguration in AkkaUtils
https://issues.apache.org/jira/browse/SPARK-5724

In AkkaUtil, we set several failure detector related the parameters as following

```
al akkaConf = ConfigFactory.parseMap(conf.getAkkaConf.toMap[String, String])
      .withFallback(akkaSslConfig).withFallback(ConfigFactory.parseString(
      s"""
      |akka.daemonic = on
      |akka.loggers = [""akka.event.slf4j.Slf4jLogger""]
      |akka.stdout-loglevel = "ERROR"
      |akka.jvm-exit-on-fatal-error = off
      |akka.remote.require-cookie = "$requireCookie"
      |akka.remote.secure-cookie = "$secureCookie"
      |akka.remote.transport-failure-detector.heartbeat-interval = $akkaHeartBeatInterval s
      |akka.remote.transport-failure-detector.acceptable-heartbeat-pause = $akkaHeartBeatPauses s
      |akka.remote.transport-failure-detector.threshold = $akkaFailureDetector
      |akka.actor.provider = "akka.remote.RemoteActorRefProvider"
      |akka.remote.netty.tcp.transport-class = "akka.remote.transport.netty.NettyTransport"
      |akka.remote.netty.tcp.hostname = "$host"
      |akka.remote.netty.tcp.port = $port
      |akka.remote.netty.tcp.tcp-nodelay = on
      |akka.remote.netty.tcp.connection-timeout = $akkaTimeout s
      |akka.remote.netty.tcp.maximum-frame-size = ${akkaFrameSize}B
      |akka.remote.netty.tcp.execution-pool-size = $akkaThreads
      |akka.actor.default-dispatcher.throughput = $akkaBatchSize
      |akka.log-config-on-start = $logAkkaConfig
      |akka.remote.log-remote-lifecycle-events = $lifecycleEvents
      |akka.log-dead-letters = $lifecycleEvents
      |akka.log-dead-letters-during-shutdown = $lifecycleEvents
      """.stripMargin))

```

Actually, we do not have any parameter naming "akka.remote.transport-failure-detector.threshold"
see: http://doc.akka.io/docs/akka/2.3.4/general/configuration.html
what we have is "akka.remote.watch-failure-detector.threshold"

Author: CodingCat <zhunansjtu@gmail.com>

Closes #4512 from CodingCat/SPARK-5724 and squashes the following commits:

bafe56e [CodingCat] fix the grammar in configuration doc
338296e [CodingCat] remove failure-detector related info
8bfcfd4 [CodingCat] fix the misconfiguration in AkkaUtils
2015-02-23 11:29:25 +00:00
Alexander a7f9039025 [DOCS] Fix typo in API for custom InputFormats based on the “new” MapReduce API
This looks like a simple typo ```SparkContext.newHadoopRDD``` instead of ```SparkContext.newAPIHadoopRDD``` as in actual http://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.SparkContext

Author: Alexander <abezzubov@nflabs.com>

Closes #4718 from bzz/hadoop-InputFormats-doc-fix and squashes the following commits:

680a4c4 [Alexander] Fix typo in docs on custom Hadoop InputFormats
2015-02-22 08:53:05 +00:00
Joseph K. Bradley 4a17eedb16 [SPARK-5867] [SPARK-5892] [doc] [ml] [mllib] Doc cleanups for 1.3 release
For SPARK-5867:
* The spark.ml programming guide needs to be updated to use the new SQL DataFrame API instead of the old SchemaRDD API.
* It should also include Python examples now.

For SPARK-5892:
* Fix Python docs
* Various other cleanups

BTW, I accidentally merged this with master.  If you want to compile it on your own, use this branch which is based on spark/branch-1.3 and cherry-picks the commits from this PR: [https://github.com/jkbradley/spark/tree/doc-review-1.3-check]

CC: mengxr  (ML),  davies  (Python docs)

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4675 from jkbradley/doc-review-1.3 and squashes the following commits:

f191bb0 [Joseph K. Bradley] small cleanups
e786efa [Joseph K. Bradley] small doc corrections
6b1ab4a [Joseph K. Bradley] fixed python lint test
946affa [Joseph K. Bradley] Added sample data for ml.MovieLensALS example.  Changed spark.ml Java examples to use DataFrames API instead of sql()
da81558 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into doc-review-1.3
629dbf5 [Joseph K. Bradley] Updated based on code review: * made new page for old migration guides * small fixes * moved inherit_doc in python
b9df7c4 [Joseph K. Bradley] Small cleanups: toDF to toDF(), adding s for string interpolation
34b067f [Joseph K. Bradley] small doc correction
da16aef [Joseph K. Bradley] Fixed python mllib docs
8cce91c [Joseph K. Bradley] GMM: removed old imports, added some doc
695f3f6 [Joseph K. Bradley] partly done trying to fix inherit_doc for class hierarchies in python docs
a72c018 [Joseph K. Bradley] made ChiSqTestResult appear in python docs
b05a80d [Joseph K. Bradley] organize imports. doc cleanups
e572827 [Joseph K. Bradley] updated programming guide for ml and mllib
2015-02-20 02:31:32 -08:00
Xiangrui Meng 0cfd2cebde [SPARK-5900][MLLIB] make PIC and FPGrowth Java-friendly
In the previous version, PIC stores clustering assignments as an `RDD[(Long, Int)]`. This is mapped to `RDD<Tuple2<Object, Object>>` in Java and hence Java users have to cast types manually. We should either create a new method called `javaAssignments` that returns `JavaRDD[(java.lang.Long, java.lang.Int)]` or wrap the result pair in a class. I chose the latter approach in this PR. Now assignments are stored as an `RDD[Assignment]`, where `Assignment` is a class with `id` and `cluster`.

Similarly, in FPGrowth, the frequent itemsets are stored as an `RDD[(Array[Item], Long)]`, which is mapped to `RDD<Tuple2<Object, Object>>`. Though we provide a "Java-friendly" method `javaFreqItemsets` that returns `JavaRDD[(Array[Item], java.lang.Long)]`. It doesn't really work because `Array[Item]` is mapped to `Object` in Java. So in this PR I created a class `FreqItemset` to wrap the results. It has `items` and `freq`, as well as a `javaItems` method that returns `List<Item>` in Java.

I'm not certain that the names I chose are proper: `Assignment`/`id`/`cluster` and `FreqItemset`/`items`/`freq`. Please let me know if there are better suggestions.

CC: jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #4695 from mengxr/SPARK-5900 and squashes the following commits:

865b5ca [Xiangrui Meng] make Assignment serializable
cffa96e [Xiangrui Meng] fix test
9c0e590 [Xiangrui Meng] remove unused Tuple2
1b9db3d [Xiangrui Meng] make PIC and FPGrowth Java-friendly
2015-02-19 18:06:16 -08:00
Ilya Ganelin 6bddc40353 SPARK-5570: No docs stating that `new SparkConf().set("spark.driver.memory", ...) will not work
I've updated documentation to reflect true behavior of this setting in client vs. cluster mode.

Author: Ilya Ganelin <ilya.ganelin@capitalone.com>

Closes #4665 from ilganeli/SPARK-5570 and squashes the following commits:

5d1c8dd [Ilya Ganelin] Added example configuration code
a51700a [Ilya Ganelin] Getting rid of extra spaces
85f7a08 [Ilya Ganelin] Reworded note
5889d43 [Ilya Ganelin] Formatting adjustment
f149ba1 [Ilya Ganelin] Minor updates
1fec7a5 [Ilya Ganelin] Updated to add clarification for other driver properties
db47595 [Ilya Ganelin] Slight formatting update
c899564 [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-5570
17b751d [Ilya Ganelin] Updated documentation for driver-memory to reflect its true behavior in client vs cluster mode
2015-02-19 15:53:20 -08:00
Xiangrui Meng d12d2ad76e [SPARK-5879][MLLIB] update PIC user guide and add a Java example
Updated PIC user guide to reflect API changes and added a simple Java example. The API is still not very Java-friendly. I created SPARK-5990 for this issue.

Author: Xiangrui Meng <meng@databricks.com>

Closes #4680 from mengxr/SPARK-5897 and squashes the following commits:

847d216 [Xiangrui Meng] apache header
87719a2 [Xiangrui Meng] remove PIC image
2dd921f [Xiangrui Meng] update PIC user guide and add a Java example
2015-02-18 16:29:32 -08:00
Burak Yavuz a8eb92dcb9 [SPARK-5507] Added documentation for BlockMatrix
Docs for BlockMatrix. mengxr

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #4664 from brkyvz/SPARK-5507PR and squashes the following commits:

4db30b0 [Burak Yavuz] [SPARK-5507] Added documentation for BlockMatrix
2015-02-18 10:11:08 -08:00
Xiangrui Meng 85e9d091d5 [SPARK-5519][MLLIB] add user guide with example code for fp-growth
The API is still not very Java-friendly because `Array[Item]` in `freqItemsets` is recognized as `Object` in Java. We might want to define a case class to wrap the return pair to make it Java friendly.

Author: Xiangrui Meng <meng@databricks.com>

Closes #4661 from mengxr/SPARK-5519 and squashes the following commits:

58ccc25 [Xiangrui Meng] add user guide with example code for fp-growth
2015-02-18 10:09:56 -08:00
MechCoder e79a7a626d SPARK-4610 addendum: [Minor] [MLlib] Minor doc fix in GBT classification example
numClassesForClassification has been renamed to numClasses.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #4672 from MechCoder/minor-doc and squashes the following commits:

d2ddb7f [MechCoder] Minor doc fix in GBT classification example
2015-02-18 10:13:40 +00:00
Burak Yavuz ae6cfb3acd [SPARK-5811] Added documentation for maven coordinates and added Spark Packages support
Documentation for maven coordinates + Spark Package support. Added pyspark tests for `--packages`

Author: Burak Yavuz <brkyvz@gmail.com>
Author: Davies Liu <davies@databricks.com>

Closes #4662 from brkyvz/SPARK-5811 and squashes the following commits:

56ccccd [Burak Yavuz] fixed broken test
64cb8ee [Burak Yavuz] passed pep8 on local
c07b81e [Burak Yavuz] fixed pep8
a8bd6b7 [Burak Yavuz] submit PR
4ef4046 [Burak Yavuz] ready for PR
8fb02e5 [Burak Yavuz] merged master
25c9b9f [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into python-jar
560d13b [Burak Yavuz] before PR
17d3f76 [Davies Liu] support .jar as python package
a3eb717 [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into SPARK-5811
c60156d [Burak Yavuz] [SPARK-5811] Added documentation for maven coordinates
2015-02-17 17:23:22 -08:00
CodingCat 31efb39c1d [Minor] fix typo in SQL document
Author: CodingCat <zhunansjtu@gmail.com>

Closes #4656 from CodingCat/fix_typo and squashes the following commits:

b41d15c [CodingCat] recover
689fe46 [CodingCat] fix typo
2015-02-17 12:16:52 -08:00
Patrick Wendell a51d51ffac SPARK-5850: Remove experimental label for Scala 2.11 and FlumePollingStream
Author: Patrick Wendell <patrick@databricks.com>

Closes #4638 from pwendell/SPARK-5850 and squashes the following commits:

386126f [Patrick Wendell] SPARK-5850: Remove experimental label for Scala 2.11 and FlumePollingStream.
2015-02-16 20:33:33 -08:00
Patrick Wendell 04b401da81 HOTFIX: Break in Jekyll build from #4589
That patch had a line break in the middle of a {{ }} expression, which is not allowed.
2015-02-16 15:44:01 -08:00
martinzapletal 61eb12674b [MLLIB][SPARK-5502] User guide for isotonic regression
User guide for isotonic regression added to docs/mllib-regression.md including code examples for Scala and Java.

Author: martinzapletal <zapletal-martin@email.cz>

Closes #4536 from zapletal-martin/SPARK-5502 and squashes the following commits:

67fe773 [martinzapletal] SPARK-5502 reworded model prediction rules to use more general language rather than the code/implementation specific terms
80bd4c3 [martinzapletal] SPARK-5502 created docs page for isotonic regression, added links to the page, updated data and examples
7d8136e [martinzapletal] SPARK-5502 Added documentation for Isotonic regression including examples for Scala and Java
504b5c3 [martinzapletal] SPARK-5502 Added documentation for Isotonic regression including examples for Scala and Java
2015-02-15 09:10:03 -08:00
gasparms f80e2629bb [SPARK-5800] Streaming Docs. Change linked files according the selected language
Currently, Spark Streaming Programming Guide after updateStateByKey  explanation links to file stateful_network_wordcount.py and note "For the complete Scala code ..." for any language tab selected. This is an incoherence.

I've changed the guide and link its pertinent example file. JavaStatefulNetworkWordCount.java example was not created so I added to the commit.

Author: gasparms <gmunoz@stratio.com>

Closes #4589 from gasparms/feature/streaming-guide and squashes the following commits:

7f37f89 [gasparms] More style changes
ec202b0 [gasparms] Follow spark style guide
f527328 [gasparms] Improve example to look like scala example
4d8785c [gasparms] Remove throw exception
e92e6b8 [gasparms] Fix incoherence
92db405 [gasparms] Fix Streaming Programming Guide. Change files according the selected language
2015-02-14 20:10:29 +00:00
Xiangrui Meng cc56c8729a [SPARK-5806] re-organize sections in mllib-clustering.md
Put example code close to the algorithm description.

Author: Xiangrui Meng <meng@databricks.com>

Closes #4598 from mengxr/SPARK-5806 and squashes the following commits:

a137872 [Xiangrui Meng] re-organize sections in mllib-clustering.md
2015-02-13 15:09:27 -08:00
Emre Sevinç 9f31db0610 SPARK-5805 Fixed the type error in documentation.
Fixes SPARK-5805 : Fix the type error in the final example given in MLlib - Clustering documentation.

Author: Emre Sevinç <emre.sevinc@gmail.com>

Closes #4596 from emres/SPARK-5805 and squashes the following commits:

1029f66 [Emre Sevinç] SPARK-5805 Fixed the type error in documentation.
2015-02-13 12:31:27 -08:00
Antonio Navarro Perez 6a1be026cf [SQL][DOCS] Update sql documentation
Updated examples using the new api and added DataFrame concept

Author: Antonio Navarro Perez <ajnavarro@users.noreply.github.com>

Closes #4560 from ajnavarro/ajnavarro-doc-sql-update and squashes the following commits:

82ebcf3 [Antonio Navarro Perez] Changed a missing JavaSQLContext to SQLContext.
8d5376a [Antonio Navarro Perez] fixed typo
8196b6b [Antonio Navarro Perez] [SQL][DOCS] Update sql documentation
2015-02-12 12:46:17 -08:00
Sean Owen 9a3ea49f74 SPARK-5727 [BUILD] Remove Debian packaging
(for master / 1.4 only)

Author: Sean Owen <sowen@cloudera.com>

Closes #4526 from srowen/SPARK-5727.2 and squashes the following commits:

83ba49c [Sean Owen] Remove Debian packaging
2015-02-12 12:36:26 +00:00
Daniel Darabos 03bf704bf4 Remove outdated remark about take(n).
Looking at the code, I believe this remark about `take(n)` computing partitions on the driver is no longer correct. Apologies if I'm wrong.

This came up in http://stackoverflow.com/q/28436559/3318517.

Author: Daniel Darabos <darabos.daniel@gmail.com>

Closes #4533 from darabos/patch-2 and squashes the following commits:

cc80f3a [Daniel Darabos] Remove outdated remark about take(n).
2015-02-11 20:24:17 +00:00
Sean Owen bd0d6e0cc3 SPARK-5727 [BUILD] Deprecate Debian packaging
This just adds a deprecation message. It's intended for backporting to branch 1.3 but can go in master too, to be followed by another PR that removes it for 1.4.

Author: Sean Owen <sowen@cloudera.com>

Closes #4516 from srowen/SPARK-5727.1 and squashes the following commits:

d48989f [Sean Owen] Refer to Spark 1.4
6c1c8b3 [Sean Owen] Deprecate Debian packaging
2015-02-11 08:30:16 +00:00
Davies Liu ea60284095 [SPARK-5704] [SQL] [PySpark] createDataFrame from RDD with columns
Deprecate inferSchema() and applySchema(), use createDataFrame() instead, which could take an optional `schema` to create an DataFrame from an RDD. The `schema` could be StructType or list of names of columns.

Author: Davies Liu <davies@databricks.com>

Closes #4498 from davies/create and squashes the following commits:

08469c1 [Davies Liu] remove Scala/Java API for now
c80a7a9 [Davies Liu] fix hive test
d1bd8f2 [Davies Liu] cleanup applySchema
9526e97 [Davies Liu] createDataFrame from RDD with columns
2015-02-10 19:40:12 -08:00
Marcelo Vanzin 20a6013106 [SPARK-2996] Implement userClassPathFirst for driver, yarn.
Yarn's config option `spark.yarn.user.classpath.first` does not work the same way as
`spark.files.userClassPathFirst`; Yarn's version is a lot more dangerous, in that it
modifies the system classpath, instead of restricting the changes to the user's class
loader. So this change implements the behavior of the latter for Yarn, and deprecates
the more dangerous choice.

To be able to achieve feature-parity, I also implemented the option for drivers (the existing
option only applies to executors). So now there are two options, each controlling whether
to apply userClassPathFirst to the driver or executors. The old option was deprecated, and
aliased to the new one (`spark.executor.userClassPathFirst`).

The existing "child-first" class loader also had to be fixed. It didn't handle resources, and it
was also doing some things that ended up causing JVM errors depending on how things
were being called.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #3233 from vanzin/SPARK-2996 and squashes the following commits:

9cf9cf1 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
a1499e2 [Marcelo Vanzin] Remove SPARK_HOME propagation.
fa7df88 [Marcelo Vanzin] Remove 'test.resource' file, create it dynamically.
a8c69f1 [Marcelo Vanzin] Review feedback.
cabf962 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
a1b8d7e [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
3f768e3 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
2ce3c7a [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
0e6d6be [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
70d4044 [Marcelo Vanzin] Fix pyspark/yarn-cluster test.
0fe7777 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
0e6ef19 [Marcelo Vanzin] Move class loaders around and make names more meaninful.
fe970a7 [Marcelo Vanzin] Review feedback.
25d4fed [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
3cb6498 [Marcelo Vanzin] Call the right loadClass() method on the parent.
fbb8ab5 [Marcelo Vanzin] Add locking in loadClass() to avoid deadlocks.
2e6c4b7 [Marcelo Vanzin] Mention new setting in documentation.
b6497f9 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
a10f379 [Marcelo Vanzin] Some feedback.
3730151 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
f513871 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
44010b6 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
7b57cba [Marcelo Vanzin] Remove now outdated message.
5304d64 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
35949c8 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
54e1a98 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
d1273b2 [Marcelo Vanzin] Add test file to rat exclude.
fa1aafa [Marcelo Vanzin] Remove write check on user jars.
89d8072 [Marcelo Vanzin] Cleanups.
a963ea3 [Marcelo Vanzin] Implement spark.driver.userClassPathFirst for standalone cluster mode.
50afa5f [Marcelo Vanzin] Fix Yarn executor command line.
7d14397 [Marcelo Vanzin] Register user jars in executor up front.
7f8603c [Marcelo Vanzin] Fix yarn-cluster mode without userClassPathFirst.
20373f5 [Marcelo Vanzin] Fix ClientBaseSuite.
55c88fa [Marcelo Vanzin] Run all Yarn integration tests via spark-submit.
0b64d92 [Marcelo Vanzin] Add deprecation warning to yarn option.
4a84d87 [Marcelo Vanzin] Fix the child-first class loader.
d0394b8 [Marcelo Vanzin] Add "deprecated configs" to SparkConf.
46d8cf2 [Marcelo Vanzin] Update doc with new option, change name to "userClassPathFirst".
a314f2d [Marcelo Vanzin] Enable driver class path isolation in SparkSubmit.
91f7e54 [Marcelo Vanzin] [yarn] Enable executor class path isolation.
a853e74 [Marcelo Vanzin] Re-work CoarseGrainedExecutorBackend command line arguments.
89522ef [Marcelo Vanzin] Add class path isolation support for Yarn cluster mode.
2015-02-09 21:17:28 -08:00
Xiangrui Meng 855d12ac0a [SPARK-5539][MLLIB] LDA guide
This is the LDA user guide from jkbradley with Java and Scala code example.

Author: Xiangrui Meng <meng@databricks.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4465 from mengxr/lda-guide and squashes the following commits:

6dcb7d1 [Xiangrui Meng] update java example in the user guide
76169ff [Xiangrui Meng] update java example
36c3ae2 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into lda-guide
c2a1efe [Joseph K. Bradley] Added LDA programming guide, plus Java example (which is in the guide and probably should be removed).
2015-02-08 23:40:36 -08:00
Sam Halliday 56aff4bd6c SPARK-5665 [DOCS] Update netlib-java documentation
I am the author of netlib-java and I found this documentation to be out of date. Some main points:

1. Breeze has not depended on jBLAS for some time
2. netlib-java provides a pure JVM implementation as the fallback (the original docs did not appear to be aware of this, claiming that gfortran was necessary)
3. The licensing issue is not just about LGPL: optimised natives have proprietary licenses. Building with the LGPL flag turned on really doesn't help you get past this.
4. I really think it's best to direct people to my detailed setup guide instead of trying to compress it into one sentence. It is different for each architecture, each OS, and for each backend.

I hope this helps to clear things up 😄

Author: Sam Halliday <sam.halliday@Gmail.com>
Author: Sam Halliday <sam.halliday@gmail.com>

Closes #4448 from fommil/patch-1 and squashes the following commits:

18cda11 [Sam Halliday] remove link to skillsmatters at request of @mengxr
a35e4a9 [Sam Halliday] reword netlib-java/breeze docs
2015-02-08 16:34:26 -08:00
WangTaoTheTonic d34f79c8db [SPARK-2945][YARN][Doc]add doc for spark.executor.instances
https://issues.apache.org/jira/browse/SPARK-2945

spark.executor.instances works. As this JIRA recommended, we should add docs for this common config.

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes #4350 from WangTaoTheTonic/SPARK-2945 and squashes the following commits:

4c3913a [WangTaoTheTonic] not compatible with dynamic allocation
5fa9c46 [WangTaoTheTonic] add doc for spark.executor.instances
2015-02-06 11:58:22 -08:00
Andrew Or fe3740c4c8 [SPARK-5636] Ramp up faster in dynamic allocation
A recent patch #4051 made the initial number default to 0. With this change, any Spark application using dynamic allocation's default settings will ramp up very slowly. Since we never request more executors than needed to saturate the pending tasks, it is safe to ramp up quickly. The current default of 60 may be too slow.

Author: Andrew Or <andrew@databricks.com>

Closes #4409 from andrewor14/dynamic-allocation-interval and squashes the following commits:

d3cc485 [Andrew Or] Lower request interval
2015-02-06 10:55:13 -08:00
Travis Galoppo 9ad56ad2a2 [SPARK-5013] [MLlib] Added documentation and sample data file for GaussianMixture
Simple description and code samples (and sample data) for GaussianMixture

Author: Travis Galoppo <tjg2107@columbia.edu>

Closes #4401 from tgaloppo/spark-5013 and squashes the following commits:

c9ff9a5 [Travis Galoppo] Fixed link in mllib-clustering.md Added Gaussian mixture and power iteration as available clustering techniques in mllib-guide
2368690 [Travis Galoppo] Minor fixes
3eb41fa [Travis Galoppo] [SPARK-5013] Added documentation and sample data file for GaussianMixture
2015-02-06 10:26:51 -08:00
Miguel Peralvo f827ef4d7e Update ec2-scripts.md
Change spark-version from 1.1.0 to 1.2.0 in the example for spark-ec2/Launch Cluster.

Author: Miguel Peralvo <miguel.peralvo@gmail.com>

Closes #4300 from MiguelPeralvo/patch-1 and squashes the following commits:

38adf0b [Miguel Peralvo] Update ec2-scripts.md
1850869 [Miguel Peralvo] Update ec2-scripts.md
2015-02-06 11:04:48 +00:00
Daoyuan Wang 6fa4ac1b00 [Branch-1.3] [DOC] doc fix for date
Trivial fix.

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #4400 from adrian-wang/docdate and squashes the following commits:

31bbe40 [Daoyuan Wang] doc fix for date
2015-02-05 12:42:27 -08:00