add docs for https://issues.apache.org/jira/browse/SPARK-6994
Author: vidmantas zemleris <vidmantas@vinted.com>
Closes#6030 from vidma/docs/row-with-named-fields and squashes the following commits:
241b401 [vidmantas zemleris] [SPARK-6994][SQL] Update docs for fetching Row fields by name
Author: Reynold Xin <rxin@databricks.com>
Closes#6071 from rxin/parserdialect and squashes the following commits:
ca2eb31 [Reynold Xin] Rename Dialect -> ParserDialect.
Author: Reynold Xin <rxin@databricks.com>
Closes#6068 from rxin/drop-column and squashes the following commits:
9d7d5ec [Reynold Xin] [SPARK-7509][SQL] DataFrame.drop in Python for dropping columns.
This is a follow up of #5876 and should be merged after #5876.
Let's wait for unit testing result from Jenkins.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#5963 from chenghao-intel/useIsolatedClient and squashes the following commits:
f87ace6 [Cheng Hao] remove the TODO and add `resolved condition` for HiveTable
a8260e8 [Cheng Hao] Update code as feedback
f4e243f [Cheng Hao] remove the serde setting for SequenceFile
d166afa [Cheng Hao] style issue
d25a4aa [Cheng Hao] Add SerDe support for CTAS
This should also close https://github.com/apache/spark/pull/5870
Author: Reynold Xin <rxin@databricks.com>
Closes#6066 from rxin/dropDups and squashes the following commits:
130692f [Reynold Xin] [SPARK-7324][SQL] DataFrame.dropDuplicates
JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-5893).
One thing to make clear, the `buckets` parameter, which is an array of `Double`, performs as split points. Say,
```scala
buckets = Array(-0.5, 0.0, 0.5)
```
splits the real number into 4 ranges, (-inf, -0.5], (-0.5, 0.0], (0.0, 0.5], (0.5, +inf), which is encoded as 0, 1, 2, 3.
Author: Xusen Yin <yinxusen@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#5980 from yinxusen/SPARK-5893 and squashes the following commits:
dc8c843 [Xusen Yin] Merge pull request #4 from jkbradley/yinxusen-SPARK-5893
1ca973a [Joseph K. Bradley] one more bucketizer test
34f124a [Joseph K. Bradley] Removed lowerInclusive, upperInclusive params from Bucketizer, and used splits instead.
eacfcfa [Xusen Yin] change ML attribute from splits into buckets
c3cc770 [Xusen Yin] add more unit test for binary search
3a16cc2 [Xusen Yin] refine comments and names
ac77859 [Xusen Yin] fix style error
fb30d79 [Xusen Yin] fix and test binary search
2466322 [Xusen Yin] refactor Bucketizer
11fb00a [Xusen Yin] change it into an Estimator
998bc87 [Xusen Yin] check buckets
4024cf1 [Xusen Yin] add test suite
5fe190e [Xusen Yin] add bucketizer
So users that are interested in this can track it easily.
Author: Reynold Xin <rxin@databricks.com>
Closes#6067 from rxin/SPARK-7550 and squashes the following commits:
ee0e34c [Reynold Xin] Updated DataFrame.saveAsTable Hive warning to include SPARK-7550 ticket.
Author: Reynold Xin <rxin@databricks.com>
Closes#6062 from rxin/agg-retain-doc and squashes the following commits:
43e511e [Reynold Xin] [SPARK-7462][SQL] Update documentation for retaining grouping columns in DataFrames.
Author: madhukar <phatak.dev@gmail.com>
Closes#5654 from phatak-dev/master and squashes the following commits:
386f407 [madhukar] #5654 updated for all the methods
2c997c5 [madhukar] Merge branch 'master' of https://github.com/apache/spark
00bc819 [madhukar] Merge branch 'master' of https://github.com/apache/spark
2a802c6 [madhukar] #5654 updated the doc according to comments
866e8df [madhukar] [SPARK-7084] improve saveAsTable documentation
As a follow-up to https://github.com/apache/spark/pull/5944
Author: Reynold Xin <rxin@databricks.com>
Closes#6064 from rxin/jointype-better-error and squashes the following commits:
7629bf7 [Reynold Xin] [SQL] Show better error messages for incorrect join types in DataFrames.
Author: Sean Owen <sowen@cloudera.com>
Closes#6063 from srowen/FixRunningTestsLink and squashes the following commits:
db62018 [Sean Owen] Fix the link to test building info on the wiki
Currently there's no chance to close the file correctly after the iteration is finished, change to `CompletionIterator` to avoid resource leakage.
Author: jerryshao <saisai.shao@intel.com>
Closes#6050 from jerryshao/close-file-correctly and squashes the following commits:
52dfaf5 [jerryshao] Close files correctly when iterator is finished
JIRA: https://issues.apache.org/jira/browse/SPARK-7516
In sql-programming-guide, deprecated python data frame api inferSchema() should be replaced by createDataFrame():
schemaPeople = sqlContext.inferSchema(people) ->
schemaPeople = sqlContext.createDataFrame(people)
Author: gchen <chenguancheng@gmail.com>
Closes#6041 from gchen/python-docs and squashes the following commits:
c27eb7c [gchen] replace inferSchema() with createDataFrame()
Now PySpark on YARN with cluster mode is supported so let's update doc.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#6040 from sarutak/update-doc-for-pyspark-on-yarn and squashes the following commits:
ad9f88c [Kousuke Saruta] Brushed up sentences
469fd2e [Kousuke Saruta] Merge branch 'master' of https://github.com/apache/spark into update-doc-for-pyspark-on-yarn
fcfdb92 [Kousuke Saruta] Updated doc for PySpark on YARN with cluster mode
Patch for SPARK-7508
This logs warn then generates a response which include the message body and stack trace as text/plain, no-cache. The exit code is 500.
In practise (in some tests in SPARK-1537 to be precise), jetty is getting in between this servlet and the web response the user sees —the body of the response is lost for any error response (500, even 404 and bad request). The standard Jetty handlers must be getting in the way.
This patch doesn't address that, it ensures that
1. if the jetty handlers were put to one side the users would see the errors
2. at least the exceptions appear in the server-side logs.
This is better to users saying "I saw a 500 error" and you not having anything in the logs to see what went wrong.
Author: Steve Loughran <stevel@hortonworks.com>
Closes#6033 from steveloughran/stevel/feature/SPARK-7508-JettyUtils and squashes the following commits:
584836f [Steve Loughran] SPARK-7508 drop trailing semicolon
ad6f185 [Steve Loughran] SPARK-7508: jetty handles exception reporting itself; spark just sets this up and logs exceptions before being relayed
258d9f9 [Steve Loughran] SPARK-7508 fix typo manually-edited before patch pushed
69c8263 [Steve Loughran] SPARK-7508 JettyUtils-generated servlets to log & report all errors
This is difficult to write a test for because it relies on the latest version of YARN, but I verified manually that the patch does pass along the label expression on this version and containers are successfully launched.
Author: Sandy Ryza <sandy@cloudera.com>
Closes#5242 from sryza/sandy-spark-6470 and squashes the following commits:
6af87b9 [Sandy Ryza] Change info to warning
6e22d99 [Sandy Ryza] [YARN] SPARK-6470. Add support for YARN node labels.
Updated Java, Scala, Python, and R.
Author: Reynold Xin <rxin@databricks.com>
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#5996 from rxin/groupby-retain and squashes the following commits:
aac7119 [Reynold Xin] Merge branch 'groupby-retain' of github.com:rxin/spark into groupby-retain
f6858f6 [Reynold Xin] Merge branch 'master' into groupby-retain
5f923c0 [Reynold Xin] Merge pull request #15 from shivaram/sparkr-groupby-retrain
c1de670 [Shivaram Venkataraman] Revert workaround in SparkR to retain grouped cols Based on reverting code added in commit 9a6be746ef
b8b87e1 [Reynold Xin] Fixed DataFrameJoinSuite.
d910141 [Reynold Xin] Updated rest of the files
1e6e666 [Reynold Xin] [SPARK-7462] By default retain group by columns in aggregate
Currently attempt to start a streamingContext while another one is started throws a confusing exception that the action name JobScheduler is already registered. Instead its best to throw a proper exception as it is not supported.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#5907 from tdas/SPARK-7361 and squashes the following commits:
fb81c4a [Tathagata Das] Fix typo
a9cd5bb [Tathagata Das] Added startSite to StreamingContext
5fdfc0d [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7361
5870e2b [Tathagata Das] Added check for multiple streaming contexts
As is, to specify this option on command line, you have to escape the angle brackets.
Author: Bryan Cutler <bjcutler@us.ibm.com>
Closes#6049 from BryanCutler/dataFormat-option-7522 and squashes the following commits:
b34afb4 [Bryan Cutler] [SPARK-7522] Removed angle brackets from dataFormat option
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#6044 from yanboliang/spark-6092 and squashes the following commits:
726a9b1 [Yanbo Liang] add newRankingMetrics
33f649c [Yanbo Liang] Add RankingMetrics in PySpark/MLlib
tdas
https://issues.apache.org/jira/browse/SPARK-7326
The problem most likely resides in DStream.slice() implementation, as shown below.
def slice(fromTime: Time, toTime: Time): Seq[RDD[T]] = {
if (!isInitialized) {
throw new SparkException(this + " has not been initialized")
}
if (!(fromTime - zeroTime).isMultipleOf(slideDuration)) {
logWarning("fromTime (" + fromTime + ") is not a multiple of slideDuration ("
+ slideDuration + ")")
}
if (!(toTime - zeroTime).isMultipleOf(slideDuration)) {
logWarning("toTime (" + fromTime + ") is not a multiple of slideDuration ("
+ slideDuration + ")")
}
val alignedToTime = toTime.floor(slideDuration, zeroTime)
val alignedFromTime = fromTime.floor(slideDuration, zeroTime)
logInfo("Slicing from " + fromTime + " to " + toTime +
" (aligned to " + alignedFromTime + " and " + alignedToTime + ")")
alignedFromTime.to(alignedToTime, slideDuration).flatMap(time => {
if (time >= zeroTime) getOrCompute(time) else None
})
}
Here after performing floor() on both fromTime and toTime, the result (alignedFromTime - zeroTime) and (alignedToTime - zeroTime) may no longer be multiple of the slidingDuration, thus making isTimeValid() check failed for all the remaining computation.
The fix is to add a new floor() function in Time.scala to respect the zeroTime while performing the floor :
def floor(that: Duration, zeroTime: Time): Time = {
val t = that.milliseconds
new Time(((this.millis - zeroTime.milliseconds) / t) * t + zeroTime.milliseconds)
}
And then change the DStream.slice to call this new floor function by passing in its zeroTime.
val alignedToTime = toTime.floor(slideDuration, zeroTime)
val alignedFromTime = fromTime.floor(slideDuration, zeroTime)
This way the alignedToTime and alignedFromTime are *really* aligned in respect to zeroTime whose value is not really a 0.
Author: Wesley Miao <wesley.miao@gmail.com>
Author: Wesley <wesley.miao@autodesk.com>
Closes#5871 from wesleymiao/spark-7326 and squashes the following commits:
82a4d8c [Wesley Miao] [SPARK-7326] [STREAMING] Performing window() on a WindowedDStream dosen't work all the time
48b4dc0 [Wesley] [SPARK-7326] [STREAMING] Performing window() on a WindowedDStream doesn't work all the time
6ade399 [Wesley] [SPARK-7326] [STREAMING] Performing window() on a WindowedDStream doesn't work all the time
2611745 [Wesley Miao] [SPARK-7326] [STREAMING] Performing window() on a WindowedDStream doesn't work all the time
Bugs description:
1. There are extra commas on the top of session list.
2. The format of time in "Start at:" part is not the same as others.
3. The total number of online sessions is wrong.
Author: tianyi <tianyi.asiainfo@gmail.com>
Closes#6048 from tianyi/SPARK-7519 and squashes the following commits:
ed366b7 [tianyi] fix bug
Modified 2 files:
python/pyspark/ml/param/_shared_params_code_gen.py
python/pyspark/ml/param/shared.py
Generated shared.py on Linux using Python 2.6.6 on Redhat Enterprise Linux Server 6.6.
python _shared_params_code_gen.py > shared.py
Only changed maxIter, regParam, rawPredictionCol based on strings from SharedParamsCodeGen.scala. Note warning was displayed when committing shared.py:
warning: LF will be replaced by CRLF in python/pyspark/ml/param/shared.py.
Author: Glenn Weidner <gweidner@us.ibm.com>
Closes#6023 from gweidner/br-7427 and squashes the following commits:
db72e32 [Glenn Weidner] [SPARK-7427] [PySpark] Make sharedParams match in Scala, Python
825e4a9 [Glenn Weidner] [SPARK-7427] [PySpark] Make sharedParams match in Scala, Python
e6a865e [Glenn Weidner] [SPARK-7427] [PySpark] Make sharedParams match in Scala, Python
1eee702 [Glenn Weidner] Merge remote-tracking branch 'upstream/master'
1ac10e5 [Glenn Weidner] Merge remote-tracking branch 'upstream/master'
cafd104 [Glenn Weidner] Merge remote-tracking branch 'upstream/master'
9bea1eb [Glenn Weidner] Merge remote-tracking branch 'upstream/master'
4a35c20 [Glenn Weidner] Merge remote-tracking branch 'upstream/master'
9790cbe [Glenn Weidner] Merge remote-tracking branch 'upstream/master'
d9c30f4 [Glenn Weidner] [SPARK-7275] [SQL] [WIP] Make LogicalRelation public
I implement a simple PCA wrapper for easy transform of vectors by PCA for example LabeledPoint or another complicated structure.
Example of usage:
```
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.feature.PCA
val data = sc.textFile("data/mllib/ridge-data/lpsa.data").map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1)
val pca = PCA.create(training.first().features.size/2, data.map(_.features))
val training_pca = training.map(p => p.copy(features = pca.transform(p.features)))
val test_pca = test.map(p => p.copy(features = pca.transform(p.features)))
val numIterations = 100
val model = LinearRegressionWithSGD.train(training, numIterations)
val model_pca = LinearRegressionWithSGD.train(training_pca, numIterations)
val valuesAndPreds = test.map { point =>
val score = model.predict(point.features)
(score, point.label)
}
val valuesAndPreds_pca = test_pca.map { point =>
val score = model_pca.predict(point.features)
(score, point.label)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
val MSE_pca = valuesAndPreds_pca.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("Mean Squared Error = " + MSE)
println("PCA Mean Squared Error = " + MSE_pca)
```
Author: Kirill A. Korinskiy <catap@catap.ru>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#4304 from catap/pca and squashes the following commits:
501bcd9 [Joseph K. Bradley] Small updates: removed k from Java-friendly PCA fit(). In PCASuite, converted results to set for comparison. Added an error message for bad k in PCA.
9dcc02b [Kirill A. Korinskiy] [SPARK-5521] fix scala style
1892a06 [Kirill A. Korinskiy] [SPARK-5521] PCA wrapper for easy transform vectors
Fixes bug with PySpark cvModel not having UID
Also made small PySpark fixes: Evaluator should inherit from Params. MockModel should inherit from Model.
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#5968 from jkbradley/pyspark-cv-uid and squashes the following commits:
57f13cd [Joseph K. Bradley] Made CrossValidatorModel call parent init in PySpark
Issue appears when one tries to create DataFrame using sqlContext.load("jdbc"...) statement when "dbtable" contains query with renamed columns.
If original column is used in SQL query once the resulting DataFrame will contain non-renamed column.
If original column is used in SQL query several times with different aliases, sqlContext.load will fail.
Original implementation of JDBCRDD.resolveTable uses getColumnName to detect column names in RDD schema.
Suggested implementation uses getColumnLabel to handle column renames in SQL statement which is aware of SQL "AS" statement.
Readings:
http://stackoverflow.com/questions/4271152/getcolumnlabel-vs-getcolumnnamehttp://stackoverflow.com/questions/12259829/jdbc-getcolumnname-getcolumnlabel-db2
Official documentation unfortunately a bit misleading in definition of "suggested title" purpose however clearly defines behavior of AS keyword in SQL statement.
http://docs.oracle.com/javase/7/docs/api/java/sql/ResultSetMetaData.html
getColumnLabel - Gets the designated column's suggested title for use in printouts and displays. The suggested title is usually specified by the SQL AS clause. If a SQL AS is not specified, the value returned from getColumnLabel will be the same as the value returned by the getColumnName method.
Author: Oleg Sidorkin <oleg.sidorkin@gmail.com>
Closes#6032 from osidorkin/master and squashes the following commits:
10fc44b [Oleg Sidorkin] [SPARK-7345][SQL] Regression test for JDBCSuite (resolved scala style test error)
2aaf6f7 [Oleg Sidorkin] [SPARK-7345][SQL] Regression test for JDBCSuite (renamed fields in JDBC query)
b7d5b22 [Oleg Sidorkin] [SPARK-7345][SQL] Regression test for JDBCSuite
09559a0 [Oleg Sidorkin] [SPARK-7345][SQL] Spark cannot detect renamed columns using JDBC connector
jira: https://issues.apache.org/jira/browse/SPARK-7475
Add a new argument to specify the algorithm applied to LDA, to exhibit the basic usage of LDAOptimizer.
cc jkbradley
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#6000 from hhbyyh/ldaExample and squashes the following commits:
0a7e2bc [Yuhao Yang] fix according to comments
5810b0f [Yuhao Yang] adjust ldaExample for online LDA
Author: tedyu <yuzhihong@gmail.com>
Closes#6031 from tedyu/master and squashes the following commits:
5c2580c [tedyu] Reference fasterxml.jackson.version in sql/core/pom.xml
ff2a44f [tedyu] Merge branch 'master' of github.com:apache/spark
28c8394 [tedyu] Upgrade version of jackson-databind in sql/core/pom.xml
Currently version of jackson-databind in sql/core/pom.xml is 2.3.0
This is older than the version specified in root pom.xml
This PR upgrades the version in sql/core/pom.xml so that they're consistent.
Author: tedyu <yuzhihong@gmail.com>
Closes#6028 from tedyu/master and squashes the following commits:
28c8394 [tedyu] Upgrade version of jackson-databind in sql/core/pom.xml
A little fix about wrong url of the API document. (org.apache.spark.streaming.scheduler.StreamingListener)
Author: dobashim <dobashim@oss.nttdata.co.jp>
Closes#6024 from dobashim/master and squashes the following commits:
ac9a955 [dobashim] [STREAMING][DOCS] Fix wrong url about API docs of StreamingListener
When we use Spark on YARN and have AllJobPage via ResourceManager's proxy, the link URL in objects which represent each job on timeline view is wrong.
In timeline-view.js, the link is generated as follows.
```
window.location.href = "job/?id=" + getJobId(this);
```
This assumes the URL displayed on the web browser ends with "jobs/" but when we access AllJobPage via the proxy, the url displayed does not end with "jobs/"
The proxy doesn't return status code 301 or 302 so the url displayed still indicates the base url, not "/jobs" even though displaying AllJobPages.
![2015-05-07 3 34 37](https://cloud.githubusercontent.com/assets/4736016/7501079/a8507ad6-f46c-11e4-9bed-62abea170f4c.png)
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#5947 from sarutak/fix-link-in-timeline and squashes the following commits:
aaf40e1 [Kousuke Saruta] Added Copyright for vis.js
01bee7b [Kousuke Saruta] Fixed timeline-view.js in order to get correct href
Author: Vinod K C <vinod.kc@huawei.com>
Closes#5974 from vinodkc/fix_countApproxDistinct_Validation and squashes the following commits:
3a3d59c [Vinod K C] Reverted removal of validation relativeSD<0.000017
799976e [Vinod K C] Removed testcase to assert IAE when relativeSD>3.7
8ddbfae [Vinod K C] Remove blank line
b1b00a3 [Vinod K C] Removed relativeSD validation from python API,RDD.scala will do validation
122d378 [Vinod K C] Fixed validation of relativeSD in countApproxDistinct
In SPARK-7429 and PR https://github.com/apache/spark/pull/5960, I added the varargs annotation to Params.setDefault which takes a variable number of ParamPairs. It worked locally and on Jenkins for me.
However, mengxr reported issues compiling on his machine. So I'm reverting the change introduced in https://github.com/apache/spark/pull/5960 by removing varargs.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#6021 from jkbradley/revert-varargs and squashes the following commits:
098ed39 [Joseph K. Bradley] removed varargs annotation from Params.setDefaults taking multiple ParamPairs
1) Handle scaling and addBias internally.
2) L1/L2 elasticnet using OWLQN optimizer.
Author: DB Tsai <dbt@netflix.com>
Closes#5967 from dbtsai/lor and squashes the following commits:
fa029bb [DB Tsai] made the bound smaller
0806002 [DB Tsai] better initial intercept and more test
5c31824 [DB Tsai] fix import
c387e25 [DB Tsai] Merge branch 'master' into lor
c84e931 [DB Tsai] Made MultiClassSummarizer private
f98e711 [DB Tsai] address feedback
a784321 [DB Tsai] fix style
8ec65d2 [DB Tsai] remove new line
f3f8c88 [DB Tsai] add more tests and they match R which is good. fix a bug
34705bc [DB Tsai] first commit
This patch refactors the SQL `Exchange` operator's logic for determining whether map outputs need to be copied before being shuffled. As part of this change, we'll now avoid unnecessary copies in cases where sort-based shuffle operates on serialized map outputs (as in #4450 /
SPARK-4550).
This patch also includes a change to copy the input to RangePartitioner partition bounds calculation, which is necessary because this calculation buffers mutable Java objects.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5948)
<!-- Reviewable:end -->
Author: Josh Rosen <joshrosen@databricks.com>
Closes#5948 from JoshRosen/SPARK-7375 and squashes the following commits:
f305ff3 [Josh Rosen] Reduce scope of some variables in Exchange
899e1d7 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-7375
6a6bfce [Josh Rosen] Fix issue related to RangePartitioning:
ad006a4 [Josh Rosen] [SPARK-7375] Avoid defensive copying in exchange operator when sort.serializeMapOutputs takes effect.
Changes include
1. Rename sortDF to arrange
2. Add new aliases `group_by` and `sample_frac`, `summarize`
3. Add more user friendly column addition (mutate), rename
4. Support mean as an alias for avg in Scala and also support n_distinct, n as in dplyr
Using these changes we can pretty much run the examples as described in http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html with the same syntax
The only thing missing in SparkR is auto resolving column names when used in an expression i.e. making something like `select(flights, delay)` works in dply but we right now need `select(flights, flights$delay)` or `select(flights, "delay")`. But this is a complicated change and I'll file a new issue for it
cc sun-rui rxin
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#6005 from shivaram/sparkr-df-api and squashes the following commits:
5e0716a [Shivaram Venkataraman] Fix some roxygen bugs
1254953 [Shivaram Venkataraman] Merge branch 'master' of https://github.com/apache/spark into sparkr-df-api
0521149 [Shivaram Venkataraman] Changes to make SparkR DataFrame dplyr friendly. Changes include 1. Rename sortDF to arrange 2. Add new aliases `group_by` and `sample_frac`, `summarize` 3. Add more user friendly column addition (mutate), rename 4. Support mean as an alias for avg in Scala and also support n_distinct, n as in dplyr
Added a check to handle container exit status for the preemption scenario, log an INFO message in such cases and move on.
andrewor14
Author: Ashwin Shankar <ashankar@netflix.com>
Closes#5993 from ashwinshankar77/SPARK-7451 and squashes the following commits:
90900cf [Ashwin Shankar] Fix log info message
cf8b6cf [Ashwin Shankar] Stop counting preemption of executors as failure
Adds Python Api for `ALS` under `ml.recommendation` in PySpark. Also adds seed as a settable parameter in the Scala Implementation of ALS.
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#6015 from brkyvz/ml-rec and squashes the following commits:
be6e931 [Burak Yavuz] addressed comments
eaed879 [Burak Yavuz] readd numFeatures
0bd66b1 [Burak Yavuz] fixed seed
7f6d964 [Burak Yavuz] merged master
52e2bda [Burak Yavuz] added ALS
Author: tedyu <yuzhihong@gmail.com>
Closes#5959 from ted-yu/master and squashes the following commits:
f83d445 [tedyu] Move cleaning outside of mapPartitionsWithIndex
56d7c92 [tedyu] Consolidate import of Random
f6014c0 [tedyu] Remove cleaning in RDD#filterWith
36feb6c [tedyu] Try to get correct syntax
55d01eb [tedyu] Try to get correct syntax
c2786df [tedyu] Correct syntax
d92bfcf [tedyu] Correct syntax in test
164d3e4 [tedyu] Correct variable name
8b50d93 [tedyu] Address Andrew's review comments
0c8d47e [tedyu] Add test for mapWith()
6846e40 [tedyu] Add test for flatMapWith()
6c124a9 [tedyu] Clean function in several RDD methods
The DAG visualization currently displays only low-level Spark primitives (e.g. `map`, `reduceByKey`, `filter` etc.). For SQL, these aren't particularly useful. Instead, we should display higher level physical operators (e.g. `Filter`, `Exchange`, `ShuffleHashJoin`). cc marmbrus
-----------------
**Before**
<img src="https://issues.apache.org/jira/secure/attachment/12731586/before.png" width="600px"/>
-----------------
**After** (Pay attention to the words)
<img src="https://issues.apache.org/jira/secure/attachment/12731587/after.png" width="600px"/>
-----------------
Author: Andrew Or <andrew@databricks.com>
Closes#5999 from andrewor14/dag-viz-sql and squashes the following commits:
0db23a4 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-sql
1e211db [Andrew Or] Update comment
0d49fd6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-sql
ffd237a [Andrew Or] Fix style
202dac1 [Andrew Or] Make ignoreParent false by default
e61b1ab [Andrew Or] Visualize SQL operators, not low-level Spark primitives
569034a [Andrew Or] Add a flag to ignore parent settings and scopes
Currently we're doing port retries in the TransportServer level, but this is not specified by the TransportContext API and it has other further-reaching impacts like causing undesirable behavior for the Yarn and Standalone shuffle services.
Author: Aaron Davidson <aaron@databricks.com>
Closes#5575 from aarondav/port-bind and squashes the following commits:
3c2d6ed [Aaron Davidson] Oops, never do it.
a5d9432 [Aaron Davidson] Remove shouldHostShuffleServiceIfEnabled
e901eb2 [Aaron Davidson] fix local-cluster mode for ExternalShuffleServiceSuite
59e5e38 [Aaron Davidson] [SPARK-6955] Perform port retries at NettyBlockTransferService level
I needed to run some d2 instances, so I updated the spark_ec2.py accordingly
Author: Brendan Collins <bcollins@blueraster.com>
Closes#6014 from brendancol/ec2-instance-types-update and squashes the following commits:
d7b4191 [Brendan Collins] Merge branch 'ec2-instance-types-update' of github.com:brendancol/spark into ec2-instance-types-update
6366c45 [Brendan Collins] added back cc1.4xlarge
fc2931f [Brendan Collins] updated ec2 instance types
80c2aa6 [Brendan Collins] vertically aligned whitespace
85c6236 [Brendan Collins] vertically aligned whitespace
1657c26 [Brendan Collins] updated ec2 instance types