As is, to specify this option on command line, you have to escape the angle brackets.
Author: Bryan Cutler <bjcutler@us.ibm.com>
Closes#6049 from BryanCutler/dataFormat-option-7522 and squashes the following commits:
b34afb4 [Bryan Cutler] [SPARK-7522] Removed angle brackets from dataFormat option
(cherry picked from commit 4f8a155192)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
tdas
https://issues.apache.org/jira/browse/SPARK-7326
The problem most likely resides in DStream.slice() implementation, as shown below.
def slice(fromTime: Time, toTime: Time): Seq[RDD[T]] = {
if (!isInitialized) {
throw new SparkException(this + " has not been initialized")
}
if (!(fromTime - zeroTime).isMultipleOf(slideDuration)) {
logWarning("fromTime (" + fromTime + ") is not a multiple of slideDuration ("
+ slideDuration + ")")
}
if (!(toTime - zeroTime).isMultipleOf(slideDuration)) {
logWarning("toTime (" + fromTime + ") is not a multiple of slideDuration ("
+ slideDuration + ")")
}
val alignedToTime = toTime.floor(slideDuration, zeroTime)
val alignedFromTime = fromTime.floor(slideDuration, zeroTime)
logInfo("Slicing from " + fromTime + " to " + toTime +
" (aligned to " + alignedFromTime + " and " + alignedToTime + ")")
alignedFromTime.to(alignedToTime, slideDuration).flatMap(time => {
if (time >= zeroTime) getOrCompute(time) else None
})
}
Here after performing floor() on both fromTime and toTime, the result (alignedFromTime - zeroTime) and (alignedToTime - zeroTime) may no longer be multiple of the slidingDuration, thus making isTimeValid() check failed for all the remaining computation.
The fix is to add a new floor() function in Time.scala to respect the zeroTime while performing the floor :
def floor(that: Duration, zeroTime: Time): Time = {
val t = that.milliseconds
new Time(((this.millis - zeroTime.milliseconds) / t) * t + zeroTime.milliseconds)
}
And then change the DStream.slice to call this new floor function by passing in its zeroTime.
val alignedToTime = toTime.floor(slideDuration, zeroTime)
val alignedFromTime = fromTime.floor(slideDuration, zeroTime)
This way the alignedToTime and alignedFromTime are *really* aligned in respect to zeroTime whose value is not really a 0.
Author: Wesley Miao <wesley.miao@gmail.com>
Author: Wesley <wesley.miao@autodesk.com>
Closes#5871 from wesleymiao/spark-7326 and squashes the following commits:
82a4d8c [Wesley Miao] [SPARK-7326] [STREAMING] Performing window() on a WindowedDStream dosen't work all the time
48b4dc0 [Wesley] [SPARK-7326] [STREAMING] Performing window() on a WindowedDStream doesn't work all the time
6ade399 [Wesley] [SPARK-7326] [STREAMING] Performing window() on a WindowedDStream doesn't work all the time
2611745 [Wesley Miao] [SPARK-7326] [STREAMING] Performing window() on a WindowedDStream doesn't work all the time
(cherry picked from commit d70a076892)
Signed-off-by: Sean Owen <sowen@cloudera.com>
Bugs description:
1. There are extra commas on the top of session list.
2. The format of time in "Start at:" part is not the same as others.
3. The total number of online sessions is wrong.
Author: tianyi <tianyi.asiainfo@gmail.com>
Closes#6048 from tianyi/SPARK-7519 and squashes the following commits:
ed366b7 [tianyi] fix bug
(cherry picked from commit 2242ab31e9)
Signed-off-by: Cheng Lian <lian@databricks.com>
Modified 2 files:
python/pyspark/ml/param/_shared_params_code_gen.py
python/pyspark/ml/param/shared.py
Generated shared.py on Linux using Python 2.6.6 on Redhat Enterprise Linux Server 6.6.
python _shared_params_code_gen.py > shared.py
Only changed maxIter, regParam, rawPredictionCol based on strings from SharedParamsCodeGen.scala. Note warning was displayed when committing shared.py:
warning: LF will be replaced by CRLF in python/pyspark/ml/param/shared.py.
Author: Glenn Weidner <gweidner@us.ibm.com>
Closes#6023 from gweidner/br-7427 and squashes the following commits:
db72e32 [Glenn Weidner] [SPARK-7427] [PySpark] Make sharedParams match in Scala, Python
825e4a9 [Glenn Weidner] [SPARK-7427] [PySpark] Make sharedParams match in Scala, Python
e6a865e [Glenn Weidner] [SPARK-7427] [PySpark] Make sharedParams match in Scala, Python
1eee702 [Glenn Weidner] Merge remote-tracking branch 'upstream/master'
1ac10e5 [Glenn Weidner] Merge remote-tracking branch 'upstream/master'
cafd104 [Glenn Weidner] Merge remote-tracking branch 'upstream/master'
9bea1eb [Glenn Weidner] Merge remote-tracking branch 'upstream/master'
4a35c20 [Glenn Weidner] Merge remote-tracking branch 'upstream/master'
9790cbe [Glenn Weidner] Merge remote-tracking branch 'upstream/master'
d9c30f4 [Glenn Weidner] [SPARK-7275] [SQL] [WIP] Make LogicalRelation public
(cherry picked from commit c5aca0c27b)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
I implement a simple PCA wrapper for easy transform of vectors by PCA for example LabeledPoint or another complicated structure.
Example of usage:
```
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.feature.PCA
val data = sc.textFile("data/mllib/ridge-data/lpsa.data").map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1)
val pca = PCA.create(training.first().features.size/2, data.map(_.features))
val training_pca = training.map(p => p.copy(features = pca.transform(p.features)))
val test_pca = test.map(p => p.copy(features = pca.transform(p.features)))
val numIterations = 100
val model = LinearRegressionWithSGD.train(training, numIterations)
val model_pca = LinearRegressionWithSGD.train(training_pca, numIterations)
val valuesAndPreds = test.map { point =>
val score = model.predict(point.features)
(score, point.label)
}
val valuesAndPreds_pca = test_pca.map { point =>
val score = model_pca.predict(point.features)
(score, point.label)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
val MSE_pca = valuesAndPreds_pca.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("Mean Squared Error = " + MSE)
println("PCA Mean Squared Error = " + MSE_pca)
```
Author: Kirill A. Korinskiy <catap@catap.ru>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#4304 from catap/pca and squashes the following commits:
501bcd9 [Joseph K. Bradley] Small updates: removed k from Java-friendly PCA fit(). In PCASuite, converted results to set for comparison. Added an error message for bad k in PCA.
9dcc02b [Kirill A. Korinskiy] [SPARK-5521] fix scala style
1892a06 [Kirill A. Korinskiy] [SPARK-5521] PCA wrapper for easy transform vectors
(cherry picked from commit 8c07c75c98)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
Fixes bug with PySpark cvModel not having UID
Also made small PySpark fixes: Evaluator should inherit from Params. MockModel should inherit from Model.
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#5968 from jkbradley/pyspark-cv-uid and squashes the following commits:
57f13cd [Joseph K. Bradley] Made CrossValidatorModel call parent init in PySpark
(cherry picked from commit 3038443e58)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/6038)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#6038 from liancheng/fix-typo and squashes the following commits:
572c2a4 [Cheng Lian] Fixes variable name typo
(cherry picked from commit 6bf9352fa5)
Signed-off-by: Cheng Lian <lian@databricks.com>
Issue appears when one tries to create DataFrame using sqlContext.load("jdbc"...) statement when "dbtable" contains query with renamed columns.
If original column is used in SQL query once the resulting DataFrame will contain non-renamed column.
If original column is used in SQL query several times with different aliases, sqlContext.load will fail.
Original implementation of JDBCRDD.resolveTable uses getColumnName to detect column names in RDD schema.
Suggested implementation uses getColumnLabel to handle column renames in SQL statement which is aware of SQL "AS" statement.
Readings:
http://stackoverflow.com/questions/4271152/getcolumnlabel-vs-getcolumnnamehttp://stackoverflow.com/questions/12259829/jdbc-getcolumnname-getcolumnlabel-db2
Official documentation unfortunately a bit misleading in definition of "suggested title" purpose however clearly defines behavior of AS keyword in SQL statement.
http://docs.oracle.com/javase/7/docs/api/java/sql/ResultSetMetaData.html
getColumnLabel - Gets the designated column's suggested title for use in printouts and displays. The suggested title is usually specified by the SQL AS clause. If a SQL AS is not specified, the value returned from getColumnLabel will be the same as the value returned by the getColumnName method.
Author: Oleg Sidorkin <oleg.sidorkin@gmail.com>
Closes#6032 from osidorkin/master and squashes the following commits:
10fc44b [Oleg Sidorkin] [SPARK-7345][SQL] Regression test for JDBCSuite (resolved scala style test error)
2aaf6f7 [Oleg Sidorkin] [SPARK-7345][SQL] Regression test for JDBCSuite (renamed fields in JDBC query)
b7d5b22 [Oleg Sidorkin] [SPARK-7345][SQL] Regression test for JDBCSuite
09559a0 [Oleg Sidorkin] [SPARK-7345][SQL] Spark cannot detect renamed columns using JDBC connector
(cherry picked from commit d7a37bcaf1)
Signed-off-by: Reynold Xin <rxin@databricks.com>
jira: https://issues.apache.org/jira/browse/SPARK-7475
Add a new argument to specify the algorithm applied to LDA, to exhibit the basic usage of LDAOptimizer.
cc jkbradley
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#6000 from hhbyyh/ldaExample and squashes the following commits:
0a7e2bc [Yuhao Yang] fix according to comments
5810b0f [Yuhao Yang] adjust ldaExample for online LDA
(cherry picked from commit b13162b364)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
Author: tedyu <yuzhihong@gmail.com>
Closes#6031 from tedyu/master and squashes the following commits:
5c2580c [tedyu] Reference fasterxml.jackson.version in sql/core/pom.xml
ff2a44f [tedyu] Merge branch 'master' of github.com:apache/spark
28c8394 [tedyu] Upgrade version of jackson-databind in sql/core/pom.xml
(cherry picked from commit bd74301ff8)
Signed-off-by: Michael Armbrust <michael@databricks.com>
Currently version of jackson-databind in sql/core/pom.xml is 2.3.0
This is older than the version specified in root pom.xml
This PR upgrades the version in sql/core/pom.xml so that they're consistent.
Author: tedyu <yuzhihong@gmail.com>
Closes#6028 from tedyu/master and squashes the following commits:
28c8394 [tedyu] Upgrade version of jackson-databind in sql/core/pom.xml
(cherry picked from commit 3071aac387)
Signed-off-by: Michael Armbrust <michael@databricks.com>
A little fix about wrong url of the API document. (org.apache.spark.streaming.scheduler.StreamingListener)
Author: dobashim <dobashim@oss.nttdata.co.jp>
Closes#6024 from dobashim/master and squashes the following commits:
ac9a955 [dobashim] [STREAMING][DOCS] Fix wrong url about API docs of StreamingListener
(cherry picked from commit 7d0f17208c)
Signed-off-by: Sean Owen <sowen@cloudera.com>
When we use Spark on YARN and have AllJobPage via ResourceManager's proxy, the link URL in objects which represent each job on timeline view is wrong.
In timeline-view.js, the link is generated as follows.
```
window.location.href = "job/?id=" + getJobId(this);
```
This assumes the URL displayed on the web browser ends with "jobs/" but when we access AllJobPage via the proxy, the url displayed does not end with "jobs/"
The proxy doesn't return status code 301 or 302 so the url displayed still indicates the base url, not "/jobs" even though displaying AllJobPages.
![2015-05-07 3 34 37](https://cloud.githubusercontent.com/assets/4736016/7501079/a8507ad6-f46c-11e4-9bed-62abea170f4c.png)
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#5947 from sarutak/fix-link-in-timeline and squashes the following commits:
aaf40e1 [Kousuke Saruta] Added Copyright for vis.js
01bee7b [Kousuke Saruta] Fixed timeline-view.js in order to get correct href
(cherry picked from commit 12b95abc70)
Signed-off-by: Sean Owen <sowen@cloudera.com>
Author: Vinod K C <vinod.kc@huawei.com>
Closes#5974 from vinodkc/fix_countApproxDistinct_Validation and squashes the following commits:
3a3d59c [Vinod K C] Reverted removal of validation relativeSD<0.000017
799976e [Vinod K C] Removed testcase to assert IAE when relativeSD>3.7
8ddbfae [Vinod K C] Remove blank line
b1b00a3 [Vinod K C] Removed relativeSD validation from python API,RDD.scala will do validation
122d378 [Vinod K C] Fixed validation of relativeSD in countApproxDistinct
(cherry picked from commit dda6d9f404)
Signed-off-by: Sean Owen <sowen@cloudera.com>
In SPARK-7429 and PR https://github.com/apache/spark/pull/5960, I added the varargs annotation to Params.setDefault which takes a variable number of ParamPairs. It worked locally and on Jenkins for me.
However, mengxr reported issues compiling on his machine. So I'm reverting the change introduced in https://github.com/apache/spark/pull/5960 by removing varargs.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#6021 from jkbradley/revert-varargs and squashes the following commits:
098ed39 [Joseph K. Bradley] removed varargs annotation from Params.setDefaults taking multiple ParamPairs
(cherry picked from commit 2992623841)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
1) Handle scaling and addBias internally.
2) L1/L2 elasticnet using OWLQN optimizer.
Author: DB Tsai <dbt@netflix.com>
Closes#5967 from dbtsai/lor and squashes the following commits:
fa029bb [DB Tsai] made the bound smaller
0806002 [DB Tsai] better initial intercept and more test
5c31824 [DB Tsai] fix import
c387e25 [DB Tsai] Merge branch 'master' into lor
c84e931 [DB Tsai] Made MultiClassSummarizer private
f98e711 [DB Tsai] address feedback
a784321 [DB Tsai] fix style
8ec65d2 [DB Tsai] remove new line
f3f8c88 [DB Tsai] add more tests and they match R which is good. fix a bug
34705bc [DB Tsai] first commit
(cherry picked from commit 86ef4cfd43)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
This patch refactors the SQL `Exchange` operator's logic for determining whether map outputs need to be copied before being shuffled. As part of this change, we'll now avoid unnecessary copies in cases where sort-based shuffle operates on serialized map outputs (as in #4450 /
SPARK-4550).
This patch also includes a change to copy the input to RangePartitioner partition bounds calculation, which is necessary because this calculation buffers mutable Java objects.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5948)
<!-- Reviewable:end -->
Author: Josh Rosen <joshrosen@databricks.com>
Closes#5948 from JoshRosen/SPARK-7375 and squashes the following commits:
f305ff3 [Josh Rosen] Reduce scope of some variables in Exchange
899e1d7 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-7375
6a6bfce [Josh Rosen] Fix issue related to RangePartitioning:
ad006a4 [Josh Rosen] [SPARK-7375] Avoid defensive copying in exchange operator when sort.serializeMapOutputs takes effect.
(cherry picked from commit cde5483884)
Signed-off-by: Yin Huai <yhuai@databricks.com>
Changes include
1. Rename sortDF to arrange
2. Add new aliases `group_by` and `sample_frac`, `summarize`
3. Add more user friendly column addition (mutate), rename
4. Support mean as an alias for avg in Scala and also support n_distinct, n as in dplyr
Using these changes we can pretty much run the examples as described in http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html with the same syntax
The only thing missing in SparkR is auto resolving column names when used in an expression i.e. making something like `select(flights, delay)` works in dply but we right now need `select(flights, flights$delay)` or `select(flights, "delay")`. But this is a complicated change and I'll file a new issue for it
cc sun-rui rxin
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#6005 from shivaram/sparkr-df-api and squashes the following commits:
5e0716a [Shivaram Venkataraman] Fix some roxygen bugs
1254953 [Shivaram Venkataraman] Merge branch 'master' of https://github.com/apache/spark into sparkr-df-api
0521149 [Shivaram Venkataraman] Changes to make SparkR DataFrame dplyr friendly. Changes include 1. Rename sortDF to arrange 2. Add new aliases `group_by` and `sample_frac`, `summarize` 3. Add more user friendly column addition (mutate), rename 4. Support mean as an alias for avg in Scala and also support n_distinct, n as in dplyr
(cherry picked from commit 0a901dd3a1)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Added a check to handle container exit status for the preemption scenario, log an INFO message in such cases and move on.
andrewor14
Author: Ashwin Shankar <ashankar@netflix.com>
Closes#5993 from ashwinshankar77/SPARK-7451 and squashes the following commits:
90900cf [Ashwin Shankar] Fix log info message
cf8b6cf [Ashwin Shankar] Stop counting preemption of executors as failure
(cherry picked from commit b6c797b08c)
Signed-off-by: Sandy Ryza <sandy@cloudera.com>
Adds Python Api for `ALS` under `ml.recommendation` in PySpark. Also adds seed as a settable parameter in the Scala Implementation of ALS.
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#6015 from brkyvz/ml-rec and squashes the following commits:
be6e931 [Burak Yavuz] addressed comments
eaed879 [Burak Yavuz] readd numFeatures
0bd66b1 [Burak Yavuz] fixed seed
7f6d964 [Burak Yavuz] merged master
52e2bda [Burak Yavuz] added ALS
(cherry picked from commit 84bf931f36)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
Author: tedyu <yuzhihong@gmail.com>
Closes#5959 from ted-yu/master and squashes the following commits:
f83d445 [tedyu] Move cleaning outside of mapPartitionsWithIndex
56d7c92 [tedyu] Consolidate import of Random
f6014c0 [tedyu] Remove cleaning in RDD#filterWith
36feb6c [tedyu] Try to get correct syntax
55d01eb [tedyu] Try to get correct syntax
c2786df [tedyu] Correct syntax
d92bfcf [tedyu] Correct syntax in test
164d3e4 [tedyu] Correct variable name
8b50d93 [tedyu] Address Andrew's review comments
0c8d47e [tedyu] Add test for mapWith()
6846e40 [tedyu] Add test for flatMapWith()
6c124a9 [tedyu] Clean function in several RDD methods
(cherry picked from commit 54e6fa0563)
Signed-off-by: Andrew Or <andrew@databricks.com>
The DAG visualization currently displays only low-level Spark primitives (e.g. `map`, `reduceByKey`, `filter` etc.). For SQL, these aren't particularly useful. Instead, we should display higher level physical operators (e.g. `Filter`, `Exchange`, `ShuffleHashJoin`). cc marmbrus
-----------------
**Before**
<img src="https://issues.apache.org/jira/secure/attachment/12731586/before.png" width="600px"/>
-----------------
**After** (Pay attention to the words)
<img src="https://issues.apache.org/jira/secure/attachment/12731587/after.png" width="600px"/>
-----------------
Author: Andrew Or <andrew@databricks.com>
Closes#5999 from andrewor14/dag-viz-sql and squashes the following commits:
0db23a4 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-sql
1e211db [Andrew Or] Update comment
0d49fd6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-sql
ffd237a [Andrew Or] Fix style
202dac1 [Andrew Or] Make ignoreParent false by default
e61b1ab [Andrew Or] Visualize SQL operators, not low-level Spark primitives
569034a [Andrew Or] Add a flag to ignore parent settings and scopes
(cherry picked from commit bd61f07039)
Signed-off-by: Andrew Or <andrew@databricks.com>
Currently we're doing port retries in the TransportServer level, but this is not specified by the TransportContext API and it has other further-reaching impacts like causing undesirable behavior for the Yarn and Standalone shuffle services.
Author: Aaron Davidson <aaron@databricks.com>
Closes#5575 from aarondav/port-bind and squashes the following commits:
3c2d6ed [Aaron Davidson] Oops, never do it.
a5d9432 [Aaron Davidson] Remove shouldHostShuffleServiceIfEnabled
e901eb2 [Aaron Davidson] fix local-cluster mode for ExternalShuffleServiceSuite
59e5e38 [Aaron Davidson] [SPARK-6955] Perform port retries at NettyBlockTransferService level
(cherry picked from commit ffdc40ce7a)
Signed-off-by: Andrew Or <andrew@databricks.com>
Add a Python API for mllib.feature.ChiSqSelector
https://issues.apache.org/jira/browse/SPARK-5913
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#5939 from yanboliang/spark-5913 and squashes the following commits:
cdaac99 [Yanbo Liang] Python API for ChiSqSelector
(cherry picked from commit 35c9599b94)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
JIRA: https://issues.apache.org/jira/browse/SPARK-7390
Also fix a minor typo.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#5931 from viirya/fix_covariancecounter and squashes the following commits:
352eda6 [Liang-Chi Hsieh] Only merge other CovarianceCounter when its count is greater than zero.
(cherry picked from commit 90527f5604)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
The code was treating deep links as if they were attempt IDs, so
for example if you tried to load "/history/app1/jobs" directly,
that would fail because the code would treat "jobs" as an attempt id.
This change modifies the code to try both cases - first without an
attempt id, then with it, so that deep links are handled correctly.
This assumes that the links in the Spark UI do not clash with the
attempt id namespace, though, which is the case for YARN at least,
which is the only backend that currently publishes attempt IDs.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#5922 from vanzin/SPARK-7378 and squashes the following commits:
96f648b [Marcelo Vanzin] Fix comparison.
ed3bcd4 [Marcelo Vanzin] Merge branch 'master' into SPARK-7378
23483e4 [Marcelo Vanzin] Fat fingers.
b728f08 [Marcelo Vanzin] [SPARK-7378] [core] Handle deep links to unloaded apps.
(cherry picked from commit 5467c34c3d)
Signed-off-by: Andrew Or <andrew@databricks.com>
Order of initialization code was wrong.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#5998 from vanzin/hs-conf-fix and squashes the following commits:
00b6b6b [Marcelo Vanzin] [minor] [core] Allow History Server to read kerberos opts from config file.
(cherry picked from commit 9042f8f378)
Signed-off-by: Andrew Or <andrew@databricks.com>
The JVM is free to collect references to variables that no longer participate in a computation. This simple patch adds an operation to the variable 'rdd' to ensure it is not collected early in the test suite's explicit calls to GC.
ref: http://bugs.java.com/view_bug.do?bug_id=6721588
Author: Tim Ellison <t.p.ellison@gmail.com>
Closes#6010 from tellison/master and squashes the following commits:
77d1c8f [Tim Ellison] Defeat early garbage collection of test suite variable by aggressive JVMs
(cherry picked from commit 31da40dfee)
Signed-off-by: Andrew Or <andrew@databricks.com>
Spark shell crashes when compiled with scala 2.11 and SPARK_PREPEND_CLASSES=true
There is a similar Resolved JIRA issue -SPARK-7470 and a PR https://github.com/apache/spark/pull/5997 , which handled same issue only in scala 2.10
Author: vinodkc <vinod.kc.in@gmail.com>
Closes#6013 from vinodkc/fix_sqlcontext_exception_scala_2.11 and squashes the following commits:
119061c [vinodkc] Spark shell crashes when compiled with scala 2.11
(cherry picked from commit 4e7360e12d)
Signed-off-by: Andrew Or <andrew@databricks.com>
`vis.min.js` refers `vis.map` and this even refers `vis.js` which is used for debug `vis.js` but this debug feature is not needed for Spark itself.
This issue is really minor so I don't file this in JIRA.
/CC andrewor14
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#5994 from sarutak/remove-debug-feature-for-vis and squashes the following commits:
8be038f [Kousuke Saruta] Remove vis.map entry from .rat-exclude
7404945 [Kousuke Saruta] Removed debug feature for vis.js
(cherry picked from commit c45c09b015)
Signed-off-by: Andrew Or <andrew@databricks.com>
Add `python/lib/pyspark.zip` to `.gitignore`. After merging #5580, `python/lib/pyspark.zip` will be generated when building Spark.
Author: zsxwing <zsxwing@gmail.com>
Closes#6017 from zsxwing/gitignore and squashes the following commits:
39b10c4 [zsxwing] Ignore python/lib/pyspark.zip
(cherry picked from commit dc71e47f04)
Signed-off-by: Andrew Or <andrew@databricks.com>
GZIPInputStream allocates native memory that is not freed until close() or
when the finalizer runs. It is best to close() these streams explicitly.
stephenh made the same change for serializeMapStatuses in commit b0d884f0. This is the same change for deserialize.
(I ran the unit test suite! it seems to have passed. I did not make a JIRA since this seems "trivial", and the guidelines suggest it is not required for trivial changes)
Author: Evan Jones <ejones@twitter.com>
Closes#5982 from evanj/master and squashes the following commits:
0d76e85 [Evan Jones] [CORE] MapOutputTracker.deserializeMapStatuses: close input streams
(cherry picked from commit 25889d8d97)
Signed-off-by: Sean Owen <sowen@cloudera.com>
The previous cleanup-commit for SPARK-6627 renamed ShuffleBlockManager
to ShuffleBlockResolver, but didn't rename the associated subclasses and
variables; this commit does that.
I'm unsure whether it's ok to rename ExternalShuffleBlockManager, since that's technically a public class?
cc pwendell
Author: Kay Ousterhout <kayousterhout@gmail.com>
Closes#5764 from kayousterhout/SPARK-6627 and squashes the following commits:
43add1e [Kay Ousterhout] Spacing fix
96080bf [Kay Ousterhout] Test fixes
d8a5d36 [Kay Ousterhout] [SPARK-6627] Finished rename to ShuffleBlockResolver
(cherry picked from commit 4b3bb0e43c)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
It's the first step: generalize UnresolvedGetField to support all map, struct, and array
TODO: add `apply` in Scala and `__getitem__` in Python, and unify the `getItem` and `getField` methods to one single API(or should we keep them for compatibility?).
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#5744 from cloud-fan/generalize and squashes the following commits:
715c589 [Wenchen Fan] address comments
7ea5b31 [Wenchen Fan] fix python test
4f0833a [Wenchen Fan] add python test
f515d69 [Wenchen Fan] add apply method and test cases
8df6199 [Wenchen Fan] fix python test
239730c [Wenchen Fan] fix test compile
2a70526 [Wenchen Fan] use _bin_op in dataframe.py
6bf72bc [Wenchen Fan] address comments
3f880c3 [Wenchen Fan] add java doc
ab35ab5 [Wenchen Fan] fix python test
b5961a9 [Wenchen Fan] fix style
c9d85f5 [Wenchen Fan] generalize UnresolvedGetField to support all map, struct, and array
(cherry picked from commit 2d05f325dc)
Signed-off-by: Michael Armbrust <michael@databricks.com>
- Colors on the timeline now match the rest of the UI
- The expandable buttons to show timeline view, DAG, etc are now more visible
- Timeline text is smaller
- DAG visualization text and colors are more consistent throughout
- Fix some JavaScript style issues
- Various small fixes throughout (e.g. inconsistent capitalization, some confusing names, HTML escaping, etc)
Author: Matei Zaharia <matei@databricks.com>
Closes#5942 from mateiz/ui and squashes the following commits:
def38d0 [Matei Zaharia] Add some tooltips
4c5a364 [Matei Zaharia] Reduce stage and rank separation slightly
43dcbe3 [Matei Zaharia] Some updates to DAG
fac734a [Matei Zaharia] tweaks
6a6705d [Matei Zaharia] More fixes
67629f5 [Matei Zaharia] Various small tweaks
(cherry picked from commit a1ec08f7ed)
Signed-off-by: Matei Zaharia <matei@databricks.com>
Author: Jacek Lewandowski <lewandowski.jacek@gmail.com>
Closes#5976 from jacek-lewandowski/SPARK-7436-1.4 and squashes the following commits:
6298313 [Jacek Lewandowski] SPARK-7436: Fixed instantiation of custom recovery mode factory and added tests
This patch also removes the RDD docs from being built as a part of roxygen just by the method to delete
" ' '" of " \#' ".
Author: hqzizania <qian.huang@intel.com>
Author: qhuang <qian.huang@intel.com>
Closes#5969 from hqzizania/R1 and squashes the following commits:
6d27696 [qhuang] fixes in NAMESPACE
eb4b095 [qhuang] remove more docs
6394579 [qhuang] remove RDD docs in generics.R
6813860 [hqzizania] Fill the docs for DataFrame API in SparkR
857220f [hqzizania] remove the pairRDD docs from being built as a part of roxygen
c045d64 [hqzizania] remove the RDD docs from being built as a part of roxygen
(cherry picked from commit 008a60dd37)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Multiline commands are properly handled in this PR. oefirouz
![screen shot 2015-05-07 at 10 53 25 pm](https://cloud.githubusercontent.com/assets/829644/7531290/02ad2fd4-f50c-11e4-8c04-e58d1a61ad69.png)
Author: Xiangrui Meng <meng@databricks.com>
Closes#6001 from mengxr/SPARK-7474 and squashes the following commits:
b94b11d [Xiangrui Meng] update ParamGridBuilder doctest
(cherry picked from commit 65afd3ce8b)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
Implemented python wrappers for Scala functions that don't exist in `ml.features`
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#5991 from brkyvz/ml-feat-PR and squashes the following commits:
adcca55 [Burak Yavuz] add regex tokenizer to __all__
b91cb44 [Burak Yavuz] addressed comments
bd39fd2 [Burak Yavuz] remove addition
b82bd7c [Burak Yavuz] Parity in PySpark for ml.features
(cherry picked from commit f5ff4a84c4)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
Exposes data available in the UI as json over http. Key points:
* new endpoints, handled independently of existing XyzPage classes. Root entrypoint is `JsonRootResource`
* Uses jersey + jackson for routing & converting POJOs into json
* tests against known results in `HistoryServerSuite`
* also fixes some minor issues w/ the UI -- synchronizing on access to `StorageListener` & `StorageStatusListener`, and fixing some inconsistencies w/ the way we handle retained jobs & stages.
Author: Imran Rashid <irashid@cloudera.com>
Closes#5940 from squito/SPARK-3454_better_test_files and squashes the following commits:
1a72ed6 [Imran Rashid] rats
85fdb3e [Imran Rashid] Merge branch 'no_php' into SPARK-3454
1fc65b0 [Imran Rashid] Revert "Revert "[SPARK-3454] separate json endpoints for data in the UI""
1276900 [Imran Rashid] get rid of giant event file, replace w/ smaller one; check both shuffle read & shuffle write
4e12013 [Imran Rashid] just use test case name for expectation file name
863ef64 [Imran Rashid] rename json files to avoid strange file names and not look like php
(cherry picked from commit c796be70f3)
Signed-off-by: Patrick Wendell <patrick@databricks.com>
Based on https://github.com/apache/spark/pull/5478 that provide a PYSPARK_ARCHIVES_PATH env. within this PR, we just should export PYSPARK_ARCHIVES_PATH=/user/spark/pyspark.zip,/user/spark/python/lib/py4j-0.8.2.1-src.zip in conf/spark-env.sh when we don't install PySpark on each node of Yarn. i run python application successfully on yarn-client and yarn-cluster with this PR.
andrewor14 sryza Sephiroth-Lin Can you take a look at this?thanks.
Author: Lianhui Wang <lianhuiwang09@gmail.com>
Closes#5580 from lianhuiwang/SPARK-6869 and squashes the following commits:
66ffa43 [Lianhui Wang] Update Client.scala
c2ad0f9 [Lianhui Wang] Update Client.scala
1c8f664 [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869
008850a [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869
f0b4ed8 [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869
150907b [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869
20402cd [Lianhui Wang] use ZipEntry
9d87c3f [Lianhui Wang] update scala style
e7bd971 [Lianhui Wang] address vanzin's comments
4b8a3ed [Lianhui Wang] use pyArchivesEnvOpt
e6b573b [Lianhui Wang] address vanzin's comments
f11f84a [Lianhui Wang] zip pyspark archives
5192cca [Lianhui Wang] update import path
3b1e4c8 [Lianhui Wang] address tgravescs's comments
9396346 [Lianhui Wang] put zip to make-distribution.sh
0d2baf7 [Lianhui Wang] update import paths
e0179be [Lianhui Wang] add zip pyspark archives in build or sparksubmit
31e8e06 [Lianhui Wang] update code style
9f31dac [Lianhui Wang] update code and add comments
f72987c [Lianhui Wang] add archives path to PYTHONPATH
(cherry picked from commit ebff7327af)
Signed-off-by: Thomas Graves <tgraves@apache.org>
Added a new batch named `Substitution` before `Resolution` batch. The motivation for this is there are kind of cases we want to do some substitution on the parsed logical plan before resolve it.
Consider this two cases:
1 CTE, for cte we first build a row logical plan
```
'With Map(q1 -> 'Subquery q1
'Project ['key]
'UnresolvedRelation [src], None)
'Project [*]
'Filter ('key = 5)
'UnresolvedRelation [q1], None
```
In `With` logicalplan here is a map stored the (`q1-> subquery`), we want first take off the with command and substitute the `q1` of `UnresolvedRelation` by the `subquery`
2 Another example is Window function, in window function user may define some windows, we also need substitute the window name of child by the concrete window. this should also done in the Substitution batch.
Author: wangfei <wangfei1@huawei.com>
Closes#5776 from scwf/addbatch and squashes the following commits:
d4b962f [wangfei] added WindowsSubstitution
70f6932 [wangfei] Merge branch 'master' of https://github.com/apache/spark into addbatch
ecaeafb [wangfei] address yhuai's comments
553005a [wangfei] fix test case
0c54798 [wangfei] address comments
29aaaaf [wangfei] fix compile
1c9a092 [wangfei] added Substitution bastch
(cherry picked from commit f496bf3c53)
Signed-off-by: Yin Huai <yhuai@databricks.com>
This only happens if you have `SPARK_PREPEND_CLASSES` set. Then I built it with `build/sbt clean assembly compile` and just ran it with `bin/spark-shell`.
```
...
15/05/07 17:07:30 INFO EventLoggingListener: Logging events to file:/tmp/spark-events/local-1431043649919
15/05/07 17:07:30 INFO SparkILoop: Created spark context..
Spark context available as sc.
java.lang.NoClassDefFoundError: org/apache/hadoop/hive/conf/HiveConf
at java.lang.Class.getDeclaredConstructors0(Native Method)
at java.lang.Class.privateGetDeclaredConstructors(Class.java:2493)
at java.lang.Class.getConstructor0(Class.java:2803)
...
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 52 more
<console>:10: error: not found: value sqlContext
import sqlContext.implicits._
^
<console>:10: error: not found: value sqlContext
import sqlContext.sql
^
```
yhuai marmbrus
Author: Andrew Or <andrew@databricks.com>
Closes#5997 from andrewor14/sql-shell-crash and squashes the following commits:
61147e6 [Andrew Or] Also expect NoClassDefFoundError
(cherry picked from commit 714db2ef52)
Signed-off-by: Yin Huai <yhuai@databricks.com>