Commit graph

12056 commits

Author SHA1 Message Date
Rekha Joshi 1017908205 [SPARK-9118] [ML] Implement IntArrayParam in mllib
Implement IntArrayParam in mllib

Author: Rekha Joshi <rekhajoshm@gmail.com>
Author: Joshi <rekhajoshm@gmail.com>

Closes #7481 from rekhajoshm/SPARK-9118 and squashes the following commits:

d3b1766 [Joshi] Implement IntArrayParam
0be142d [Rekha Joshi] Merge pull request #3 from apache/master
106fd8e [Rekha Joshi] Merge pull request #2 from apache/master
e3677c9 [Rekha Joshi] Merge pull request #1 from apache/master
2015-07-17 20:02:05 -07:00
Yu ISHIKAWA 34a889db85 [SPARK-7879] [MLLIB] KMeans API for spark.ml Pipelines
I Implemented the KMeans API for spark.ml Pipelines. But it doesn't include clustering abstractions for spark.ml (SPARK-7610). It would fit for another issues. And I'll try it later, since we are trying to add the hierarchical clustering algorithms in another issue. Thanks.

[SPARK-7879] KMeans API for spark.ml Pipelines - ASF JIRA https://issues.apache.org/jira/browse/SPARK-7879

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #6756 from yu-iskw/SPARK-7879 and squashes the following commits:

be752de [Yu ISHIKAWA] Add assertions
a14939b [Yu ISHIKAWA] Fix the dashed line's length in pyspark.ml.rst
4c61693 [Yu ISHIKAWA] Remove the test about whether "features" and "prediction" columns exist or not in Python
fb2417c [Yu ISHIKAWA] Use getInt, instead of get
f397be4 [Yu ISHIKAWA] Switch the comparisons.
ca78b7d [Yu ISHIKAWA] Add the Scala docs about the constraints of each parameter.
effc650 [Yu ISHIKAWA] Using expertSetParam and expertGetParam
c8dc6e6 [Yu ISHIKAWA] Remove an unnecessary test
19a9d63 [Yu ISHIKAWA] Include spark.ml.clustering to python tests
1abb19c [Yu ISHIKAWA] Add the statements about spark.ml.clustering into pyspark.ml.rst
f8338bc [Yu ISHIKAWA] Add the placeholders in Python
4a03003 [Yu ISHIKAWA] Test for contains in Python
6566c8b [Yu ISHIKAWA] Use `get`, instead of `apply`
288e8d5 [Yu ISHIKAWA] Using `contains` to check the column names
5a7d574 [Yu ISHIKAWA] Renamce `validateInitializationMode` to `validateInitMode` and remove throwing exception
97cfae3 [Yu ISHIKAWA] Fix the type of return value of `KMeans.copy`
e933723 [Yu ISHIKAWA] Remove the default value of seed from the Model class
978ee2c [Yu ISHIKAWA] Modify the docs of KMeans, according to mllib's KMeans
2ec80bc [Yu ISHIKAWA] Fit on 1 line
e186be1 [Yu ISHIKAWA] Make a few variables, setters and getters be expert ones
b2c205c [Yu ISHIKAWA] Rename the method `getInitializationSteps` to `getInitSteps` and `setInitializationSteps` to `setInitSteps` in Scala and Python
f43f5b4 [Yu ISHIKAWA] Rename the method `getInitializationMode` to `getInitMode` and `setInitializationMode` to `setInitMode` in Scala and Python
3cb5ba4 [Yu ISHIKAWA] Modify the description about epsilon and the validation
4fa409b [Yu ISHIKAWA] Add a comment about the default value of epsilon
2f392e1 [Yu ISHIKAWA] Make some variables `final` and Use `IntParam` and `DoubleParam`
19326f8 [Yu ISHIKAWA] Use `udf`, instead of callUDF
4d2ad1e [Yu ISHIKAWA] Modify the indentations
0ae422f [Yu ISHIKAWA] Add a test for `setParams`
4ff7913 [Yu ISHIKAWA] Add "ml.clustering" to `javacOptions` in SparkBuild.scala
11ffdf1 [Yu ISHIKAWA] Use `===` and the variable
220a176 [Yu ISHIKAWA] Set a random seed in the unit testing
92c3efc [Yu ISHIKAWA] Make the points for a test be fewer
c758692 [Yu ISHIKAWA] Modify the parameters of KMeans in Python
6aca147 [Yu ISHIKAWA] Add some unit testings to validate the setter methods
687cacc [Yu ISHIKAWA] Alias mllib.KMeans as MLlibKMeans in KMeansSuite.scala
a4dfbef [Yu ISHIKAWA] Modify the last brace and indentations
5bedc51 [Yu ISHIKAWA] Remve an extra new line
444c289 [Yu ISHIKAWA] Add the validation for `runs`
e41989c [Yu ISHIKAWA] Modify how to validate `initStep`
7ea133a [Yu ISHIKAWA] Change how to validate `initMode`
7991e15 [Yu ISHIKAWA] Add a validation for `k`
c2df35d [Yu ISHIKAWA] Make `predict` private
93aa2ff [Yu ISHIKAWA] Use `withColumn` in `transform`
d3a79f7 [Yu ISHIKAWA] Remove the inhefited docs
e9532e1 [Yu ISHIKAWA] make `parentModel` of KMeansModel private
8559772 [Yu ISHIKAWA] Remove the `paramMap` parameter of KMeans
6684850 [Yu ISHIKAWA] Rename `initializationSteps` to `initSteps`
99b1b96 [Yu ISHIKAWA] Rename `initializationMode` to `initMode`
79ea82b [Yu ISHIKAWA] Modify the parameters of KMeans docs
6569bcd [Yu ISHIKAWA] Change how to set the default values with `setDefault`
20a795a [Yu ISHIKAWA] Change how to set the default values with `setDefault`
11c2a12 [Yu ISHIKAWA] Limit the imports
badb481 [Yu ISHIKAWA] Alias spark.mllib.{KMeans, KMeansModel}
f80319a [Yu ISHIKAWA] Rebase mater branch and add copy methods
85d92b1 [Yu ISHIKAWA] Add `KMeans.setPredictionCol`
aa9469d [Yu ISHIKAWA] Fix a python test suite error caused by python 3.x
c2d6bcb [Yu ISHIKAWA] ADD Java test suites of the KMeans API for spark.ml Pipeline
598ed2e [Yu ISHIKAWA] Implement the KMeans API for spark.ml Pipelines in Python
63ad785 [Yu ISHIKAWA] Implement the KMeans API for spark.ml Pipelines in Scala
2015-07-17 18:30:04 -07:00
Yijie Shen 529a2c2d92 [SPARK-8280][SPARK-8281][SQL]Handle NaN, null and Infinity in math
JIRA:
https://issues.apache.org/jira/browse/SPARK-8280
https://issues.apache.org/jira/browse/SPARK-8281

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7451 from yijieshen/nan_null2 and squashes the following commits:

47a529d [Yijie Shen] style fix
63dee44 [Yijie Shen] handle log expressions similar to Hive
188be51 [Yijie Shen] null to nan in Math Expression
2015-07-17 17:33:19 -07:00
Daoyuan Wang 1707238601 [SPARK-7026] [SQL] fix left semi join with equi key and non-equi condition
When the `condition` extracted by `ExtractEquiJoinKeys` contain join Predicate for left semi join, we can not plan it as semiJoin. Such as

    SELECT * FROM testData2 x
    LEFT SEMI JOIN testData2 y
    ON x.b = y.b
    AND x.a >= y.a + 2
Condition `x.a >= y.a + 2` can not evaluate on table `x`, so it throw errors

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #5643 from adrian-wang/spark7026 and squashes the following commits:

cc09809 [Daoyuan Wang] refactor semijoin and add plan test
575a7c8 [Daoyuan Wang] fix notserializable
27841de [Daoyuan Wang] fix rebase
10bf124 [Daoyuan Wang] fix style
72baa02 [Daoyuan Wang] fix style
8e0afca [Daoyuan Wang] merge commits for rebase
2015-07-17 16:45:46 -07:00
Tathagata Das b13ef7723f [SPARK-9030] [STREAMING] Add Kinesis.createStream unit tests that actual sends data
Current Kinesis unit tests do not test createStream by sending data. This PR is to add such unit test. Note that this unit will not run by default. It will only run when the relevant environment variables are set.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #7413 from tdas/kinesis-tests and squashes the following commits:

0e16db5 [Tathagata Das] Added more comments regarding testOrIgnore
1ea5ce0 [Tathagata Das] Added more comments
c7caef7 [Tathagata Das] Address comments
a297b59 [Tathagata Das] Reverted unnecessary change in KafkaStreamSuite
90c9bde [Tathagata Das] Removed scalatest.FunSuite
deb7f4f [Tathagata Das] Removed scalatest.FunSuite
18c2208 [Tathagata Das] Changed how SparkFunSuite is inherited
dbb33a5 [Tathagata Das] Added license
88f6dab [Tathagata Das] Added scala docs
c6be0d7 [Tathagata Das] minor changes
24a992b [Tathagata Das] Moved KinesisTestUtils to src instead of test for future python usage
465b55d [Tathagata Das] Made unit tests optional in a nice way
4d70703 [Tathagata Das] Added license
129d436 [Tathagata Das] Minor updates
cc36510 [Tathagata Das] Added KinesisStreamSuite
2015-07-17 16:43:18 -07:00
Wenchen Fan bd903ee89f [SPARK-9117] [SQL] fix BooleanSimplification in case-insensitive
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7452 from cloud-fan/boolean-simplify and squashes the following commits:

2a6e692 [Wenchen Fan] fix style
d3cfd26 [Wenchen Fan] fix BooleanSimplification in case-insensitive
2015-07-17 16:28:24 -07:00
Wenchen Fan fd6b3101fb [SPARK-9113] [SQL] enable analysis check code for self join
The check was unreachable before, as `case operator: LogicalPlan` catches everything already.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7449 from cloud-fan/tmp and squashes the following commits:

2bb6637 [Wenchen Fan] add test
5493aea [Wenchen Fan] add the check back
27221a7 [Wenchen Fan] remove unnecessary analysis check code for self join
2015-07-17 16:03:33 -07:00
Yijie Shen 15fc2ffe55 [SPARK-9080][SQL] add isNaN predicate expression
JIRA: https://issues.apache.org/jira/browse/SPARK-9080

cc rxin

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7464 from yijieshen/isNaN and squashes the following commits:

11ae039 [Yijie Shen] add isNaN in functions
666718e [Yijie Shen] add isNaN predicate expression
2015-07-17 15:49:31 -07:00
Reynold Xin b2aa490bb6 [SPARK-9142] [SQL] Removing unnecessary self types in Catalyst.
Just a small change to add Product type to the base expression/plan abstract classes, based on suggestions on #7434 and offline discussions.

Author: Reynold Xin <rxin@databricks.com>

Closes #7479 from rxin/remove-self-types and squashes the following commits:

e407ffd [Reynold Xin] [SPARK-9142][SQL] Removing unnecessary self types in Catalyst.
2015-07-17 15:02:13 -07:00
Joshi 42d8a012f6 [SPARK-8593] [CORE] Sort app attempts by start time.
This makes sure attempts are listed in the order they were executed, and that the
app's state matches the state of the most current attempt.

Author: Joshi <rekhajoshm@gmail.com>
Author: Rekha Joshi <rekhajoshm@gmail.com>

Closes #7253 from rekhajoshm/SPARK-8593 and squashes the following commits:

874dd80 [Joshi] History Server: updated order for multiple attempts(logcleaner)
716e0b1 [Joshi] History Server: updated order for multiple attempts(descending start time works everytime)
548c753 [Joshi] History Server: updated order for multiple attempts(descending start time works everytime)
83306a8 [Joshi] History Server: updated order for multiple attempts(descending start time)
b0fc922 [Joshi] History Server: updated order for multiple attempts(updated comment)
cc0fda7 [Joshi] History Server: updated order for multiple attempts(updated test)
304cb0b [Joshi] History Server: updated order for multiple attempts(reverted HistoryPage)
85024e8 [Joshi] History Server: updated order for multiple attempts
a41ac4b [Joshi] History Server: updated order for multiple attempts
ab65fa1 [Joshi] History Server: some attempt completed to work with showIncomplete
0be142d [Rekha Joshi] Merge pull request #3 from apache/master
106fd8e [Rekha Joshi] Merge pull request #2 from apache/master
e3677c9 [Rekha Joshi] Merge pull request #1 from apache/master
2015-07-17 22:47:28 +01:00
Bryan Cutler 8b8be1f5d6 [SPARK-7127] [MLLIB] Adding broadcast of model before prediction for ensembles
Broadcast of ensemble models in transformImpl before call to predict

Author: Bryan Cutler <bjcutler@us.ibm.com>

Closes #6300 from BryanCutler/bcast-ensemble-models-7127 and squashes the following commits:

86e73de [Bryan Cutler] [SPARK-7127] Replaced deprecated callUDF with udf
40a139d [Bryan Cutler] Merge branch 'master' into bcast-ensemble-models-7127
9afad56 [Bryan Cutler] [SPARK-7127] Simplified calls by overriding transformImpl and using broadcasted model in callUDF to make prediction
1f34be4 [Bryan Cutler] [SPARK-7127] Removed accidental newline
171a6ce [Bryan Cutler] [SPARK-7127] Used modelAccessor parameter in predictImpl to access broadcasted model
6fd153c [Bryan Cutler] [SPARK-7127] Applied broadcasting to remaining ensemble models
aaad77b [Bryan Cutler] [SPARK-7127] Removed abstract class for broadcasting model, instead passing a prediction function as param to transform
83904bb [Bryan Cutler] [SPARK-7127] Adding broadcast of model before prediction in RandomForestClassifier
2015-07-17 14:10:16 -07:00
Yanbo Liang 830666f6fe [SPARK-8792] [ML] Add Python API for PCA transformer
Add Python API for PCA transformer

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #7190 from yanboliang/spark-8792 and squashes the following commits:

8f4ac31 [Yanbo Liang] address comments
8a79cc0 [Yanbo Liang] Add Python API for PCA transformer
2015-07-17 14:08:06 -07:00
Feynman Liang 6da1069696 [SPARK-9090] [ML] Fix definition of residual in LinearRegressionSummary, EnsembleTestHelper, and SquaredError
Make the definition of residuals in Spark consistent with literature. We have been using `prediction - label` for residuals, but literature usually defines `residual = label - prediction`.

Author: Feynman Liang <fliang@databricks.com>

Closes #7435 from feynmanliang/SPARK-9090-Fix-LinearRegressionSummary-Residuals and squashes the following commits:

f4b39d8 [Feynman Liang] Fix doc
bc12a92 [Feynman Liang] Tweak EnsembleTestHelper and SquaredError residuals
63f0d60 [Feynman Liang] Fix definition of residual
2015-07-17 14:00:53 -07:00
zsxwing ad0954f6de [SPARK-5681] [STREAMING] Move 'stopReceivers' to the event loop to resolve the race condition
This is an alternative way to fix `SPARK-5681`. It minimizes the changes.

Closes #4467

Author: zsxwing <zsxwing@gmail.com>
Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6294 from zsxwing/pr4467 and squashes the following commits:

709ac1f [zsxwing] Fix the comment
e103e8a [zsxwing] Move ReceiverTracker.stop into ReceiverTracker.stop
f637142 [zsxwing] Address minor code style comments
a178d37 [zsxwing] Move 'stopReceivers' to the event looop to resolve the race condition
51fb07e [zsxwing] Fix the code style
3cb19a3 [zsxwing] Merge branch 'master' into pr4467
b4c29e7 [zsxwing] Stop receiver only if we start it
c41ee94 [zsxwing] Make stopReceivers private
7c73c1f [zsxwing] Use trackerStateLock to protect trackerState
a8120c0 [zsxwing] Merge branch 'master' into pr4467
7b1d9af [zsxwing] "case Throwable" => "case NonFatal"
15ed4a1 [zsxwing] Register before starting the receiver
fff63f9 [zsxwing] Use a lock to eliminate the race condition when stopping receivers and registering receivers happen at the same time.
e0ef72a [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into tracker_status_timeout
19b76d9 [Liang-Chi Hsieh] Remove timeout.
34c18dc [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into tracker_status_timeout
c419677 [Liang-Chi Hsieh] Fix style.
9e1a760 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into tracker_status_timeout
355f9ce [Liang-Chi Hsieh] Separate register and start events for receivers.
3d568e8 [Liang-Chi Hsieh] Let receivers get registered first before going started.
ae0d9fd [Liang-Chi Hsieh] Merge branch 'master' into tracker_status_timeout
77983f3 [Liang-Chi Hsieh] Add tracker status and stop to receive messages when stopping tracker.
2015-07-17 14:00:31 -07:00
Wenchen Fan 074085d678 [SPARK-9136] [SQL] fix several bugs in DateTimeUtils.stringToTimestamp
a follow up of https://github.com/apache/spark/pull/7353

1. we should use `Calendar.HOUR_OF_DAY` instead of `Calendar.HOUR`(this is for AM, PM).
2. we should call `c.set(Calendar.MILLISECOND, 0)` after `Calendar.getInstance`

I'm not sure why the tests didn't fail in jenkins, but I ran latest spark master branch locally and `DateTimeUtilsSuite` failed.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7473 from cloud-fan/datetime and squashes the following commits:

66cdaf2 [Wenchen Fan] fix several bugs in DateTimeUtils.stringToTimestamp
2015-07-17 13:57:31 -07:00
Yanbo Liang 9974642870 [SPARK-8600] [ML] Naive Bayes API for spark.ml Pipelines
Naive Bayes API for spark.ml Pipelines

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #7284 from yanboliang/spark-8600 and squashes the following commits:

bc890f7 [Yanbo Liang] remove labels valid check
c3de687 [Yanbo Liang] remove labels from ml.NaiveBayesModel
a2b3088 [Yanbo Liang] address comments
3220b82 [Yanbo Liang] trigger jenkins
3018a41 [Yanbo Liang] address comments
208e166 [Yanbo Liang] Naive Bayes API for spark.ml Pipelines
2015-07-17 13:55:17 -07:00
Yuhao Yang 806c579f43 [SPARK-9062] [ML] Change output type of Tokenizer to Array(String, true)
jira: https://issues.apache.org/jira/browse/SPARK-9062

Currently output type of Tokenizer is Array(String, false), which is not compatible with Word2Vec and Other transformers since their input type is Array(String, true). Seq[String] in udf will be treated as Array(String, true) by default.

I'm not sure what's the recommended way for Tokenizer to handle the null value in the input. Any suggestion will be welcome.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #7414 from hhbyyh/tokenizer and squashes the following commits:

c01bd7a [Yuhao Yang] change output type of tokenizer
2015-07-17 13:43:19 -07:00
Davies Liu f9a82a884e [SPARK-9138] [MLLIB] fix Vectors.dense
Vectors.dense() should accept numbers directly, like the one in Scala. We already use it in doctests, it worked by luck.

cc mengxr jkbradley

Author: Davies Liu <davies@databricks.com>

Closes #7476 from davies/fix_vectors_dense and squashes the following commits:

e0fd292 [Davies Liu] fix Vectors.dense
2015-07-17 12:43:58 -07:00
tien-dungle 587c315b20 [SPARK-9109] [GRAPHX] Keep the cached edge in the graph
The change here is to keep the cached RDDs in the graph object so that when the graph.unpersist() is called these RDDs are correctly unpersisted.

```java
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
import org.slf4j.LoggerFactory
import org.apache.spark.graphx.util.GraphGenerators

// Create an RDD for the vertices
val users: RDD[(VertexId, (String, String))] =
  sc.parallelize(Array((3L, ("rxin", "student")), (7L, ("jgonzal", "postdoc")),
                       (5L, ("franklin", "prof")), (2L, ("istoica", "prof"))))
// Create an RDD for edges
val relationships: RDD[Edge[String]] =
  sc.parallelize(Array(Edge(3L, 7L, "collab"),    Edge(5L, 3L, "advisor"),
                       Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
// Define a default user in case there are relationship with missing user
val defaultUser = ("John Doe", "Missing")
// Build the initial Graph
val graph = Graph(users, relationships, defaultUser)
graph.cache().numEdges

graph.unpersist()

sc.getPersistentRDDs.foreach( r => println( r._2.toString))
```

Author: tien-dungle <tien-dung.le@realimpactanalytics.com>

Closes #7469 from tien-dungle/SPARK-9109_Graphx-unpersist and squashes the following commits:

8d87997 [tien-dungle] Keep the cached edge in the graph
2015-07-17 12:11:32 -07:00
Liang-Chi Hsieh eba6a1af4c [SPARK-8945][SQL] Add add and subtract expressions for IntervalType
JIRA: https://issues.apache.org/jira/browse/SPARK-8945

Add add and subtract expressions for IntervalType.

Author: Liang-Chi Hsieh <viirya@appier.com>

This patch had conflicts when merged, resolved by
Committer: Reynold Xin <rxin@databricks.com>

Closes #7398 from viirya/interval_add_subtract and squashes the following commits:

acd1f1e [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into interval_add_subtract
5abae28 [Liang-Chi Hsieh] For comments.
6f5b72e [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into interval_add_subtract
dbe3906 [Liang-Chi Hsieh] For comments.
13a2fc5 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into interval_add_subtract
83ec129 [Liang-Chi Hsieh] Remove intervalMethod.
acfe1ab [Liang-Chi Hsieh] Fix scala style.
d3e9d0e [Liang-Chi Hsieh] Add add and subtract expressions for IntervalType.
2015-07-17 09:38:08 -07:00
zhichao.li 305e77cd83 [SPARK-8209[SQL]Add function conv
cc chenghao-intel  adrian-wang

Author: zhichao.li <zhichao.li@intel.com>

Closes #6872 from zhichao-li/conv and squashes the following commits:

6ef3b37 [zhichao.li] add unittest and comments
78d9836 [zhichao.li] polish dataframe api and add unittest
e2bace3 [zhichao.li] update to use ImplicitCastInputTypes
cbcad3f [zhichao.li] add function conv
2015-07-17 09:32:27 -07:00
Wenchen Fan 59d24c226a [SPARK-9130][SQL] throw exception when check equality between external and internal row
instead of return false, throw exception when check equality between external and internal row is better.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7460 from cloud-fan/row-compare and squashes the following commits:

8a20911 [Wenchen Fan] improve equals
402daa8 [Wenchen Fan] throw exception when check equality between external and internal row
2015-07-17 09:31:13 -07:00
Yanbo Liang 441e072a22 [MINOR] [ML] fix wrong annotation of RFormula.formula
fix wrong annotation of RFormula.formula

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #7470 from yanboliang/RFormula and squashes the following commits:

61f1919 [Yanbo Liang] fix wrong annotation
2015-07-17 09:00:41 -07:00
Hari Shreedharan c043a3e9df [SPARK-8851] [YARN] In Client mode, make sure the client logs in and updates tokens
In client side, the flow is SparkSubmit -> SparkContext -> yarn/Client. Since the yarn client only gets a cloned config and the staging dir is set here, it is not really possible to do re-logins in the SparkContext. So, do the initial logins in Spark Submit and do re-logins as we do now in the AM, but the Client behaves like an executor in this specific context and reads the credentials file to update the tokens. This way, even if the streaming context is started up from checkpoint - it is fine since we have logged in from SparkSubmit itself itself.

Author: Hari Shreedharan <hshreedharan@apache.org>

Closes #7394 from harishreedharan/yarn-client-login and squashes the following commits:

9a2166f [Hari Shreedharan] make it possible to use command line args and config parameters together.
de08f57 [Hari Shreedharan] Fix import order.
5c4fa63 [Hari Shreedharan] Add a comment explaining what is being done in YarnClientSchedulerBackend.
c872caa [Hari Shreedharan] Fix typo in log message.
2c80540 [Hari Shreedharan] Move token renewal to YarnClientSchedulerBackend.
0c48ac2 [Hari Shreedharan] Remove direct use of ExecutorDelegationTokenUpdater in Client.
26f8bfa [Hari Shreedharan] [SPARK-8851][YARN] In Client mode, make sure the client logs in and updates tokens.
58b1969 [Hari Shreedharan] Simple attempt 1.
2015-07-17 09:38:08 -05:00
Davies Liu ec8973d124 [SPARK-9022] [SQL] Generated projections for UnsafeRow
Added two projections: GenerateUnsafeProjection and FromUnsafeProjection, which could be used to convert UnsafeRow from/to GenericInternalRow.

They will re-use the buffer during projection, similar to MutableProjection (without all the interface MutableProjection has).

cc rxin JoshRosen

Author: Davies Liu <davies@databricks.com>

Closes #7437 from davies/unsafe_proj2 and squashes the following commits:

dbf538e [Davies Liu] test with all the expression (only for supported types)
dc737b2 [Davies Liu] address comment
e424520 [Davies Liu] fix scala style
70e231c [Davies Liu] address comments
729138d [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_proj2
5a26373 [Davies Liu] unsafe projections
2015-07-17 01:27:14 -07:00
Yu ISHIKAWA 5a3c1ad087 [SPARK-9093] [SPARKR] Fix single-quotes strings in SparkR
[[SPARK-9093] Fix single-quotes strings in SparkR - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9093)

This is the result of lintr at the rivision:011551620faa87107a787530f074af3d9be7e695
[[SPARK-9093] The result of lintr at 011551620f](https://gist.github.com/yu-iskw/8c47acf3202796da4d01)

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #7439 from yu-iskw/SPARK-9093 and squashes the following commits:

61c391e [Yu ISHIKAWA] [SPARK-9093][SparkR] Fix single-quotes strings in SparkR
2015-07-17 17:00:50 +09:00
Wenchen Fan 3f6d28a5ca [SPARK-9102] [SQL] Improve project collapse with nondeterministic expressions
Currently we will stop project collapse when the lower projection has nondeterministic expressions. However it's overkill sometimes, we should be able to optimize `df.select(Rand(10)).select('a)` to `df.select('a)`

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7445 from cloud-fan/non-deterministic and squashes the following commits:

0deaef6 [Wenchen Fan] Improve project collapse with nondeterministic expressions
2015-07-17 00:59:15 -07:00
Reynold Xin 111c05538d Added inline comment for the canEqual PR by @cloud-fan. 2015-07-16 23:13:06 -07:00
Xiangrui Meng 358e7bf652 [SPARK-9126] [MLLIB] do not assert on time taken by Thread.sleep()
Measure lower and upper bounds for task time and use them for validation. This PR also implements `Stopwatch.toString`. This suite should finish in less than 1 second.

jkbradley pwendell

Author: Xiangrui Meng <meng@databricks.com>

Closes #7457 from mengxr/SPARK-9126 and squashes the following commits:

4b40faa [Xiangrui Meng] simplify tests
739f5bd [Xiangrui Meng] do not assert on time taken by Thread.sleep()
2015-07-16 23:02:06 -07:00
Joseph K. Bradley 322d286bb7 [SPARK-7131] [ML] Copy Decision Tree, Random Forest impl to spark.ml
This PR copies the RandomForest implementation from spark.mllib to spark.ml.  Note that this includes the DecisionTree implementation, but not the GradientBoostedTrees one (which will come later).

I essentially copied a minimal amount of code to spark.ml, removed the use of bins (and only used splits), and modified code only as much as necessary to get it to compile.  The spark.ml implementation still uses some spark.mllib classes (privately), which can be moved in future PRs.

This refactoring will be helpful in extending the node representation to include more information, such as class probabilities.

Specifically:
* Copied code from spark.mllib to spark.ml:
  * mllib.tree.DecisionTree, mllib.tree.RandomForest copied to ml.tree.impl.RandomForest (main implementation)
  * NodeIdCache (needed to use splits instead of bins)
  * TreePoint (use splits instead of bins)
* Added ml.tree.LearningNode used in RandomForest training (needed vars)
* Removed bins from implementation, and only used splits
* Small fix in JavaDecisionTreeRegressorSuite

CC: mengxr  manishamde  codedeft chouqin

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #7294 from jkbradley/dt-move-impl and squashes the following commits:

48749be [Joseph K. Bradley] cleanups based on code review, mostly style
bea9703 [Joseph K. Bradley] scala style fixes.  added some scala doc
4e6d2a4 [Joseph K. Bradley] removed unnecessary use of copyValues, setParent for trees
9a4d721 [Joseph K. Bradley] cleanups. removed InfoGainStats from ml, using old one for now.
836e7d4 [Joseph K. Bradley] Fixed test suite failures
bd5e063 [Joseph K. Bradley] fixed bucketizing issue
0df3759 [Joseph K. Bradley] Need to remove use of Bucketizer
d5224a9 [Joseph K. Bradley] modified tree and forest to use moved impl
cc01823 [Joseph K. Bradley] still editing RF to get it to work
19143fb [Joseph K. Bradley] More progress, but not done yet.  Rebased with master after 1.4 release.
2015-07-16 22:26:59 -07:00
Wenchen Fan f893955b9c [SPARK-8899] [SQL] remove duplicated equals method for Row
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7291 from cloud-fan/row and squashes the following commits:

a11addf [Wenchen Fan] move hashCode back to internal row
2de6180 [Wenchen Fan] making apply() call to get()
fbe1b24 [Wenchen Fan] add null check
ebdf148 [Wenchen Fan] address comments
25ef087 [Wenchen Fan] remove duplicated equals method for Row
2015-07-16 21:41:36 -07:00
zsxwing 812b63bbee [SPARK-8857][SPARK-8859][Core]Add an internal flag to Accumulable and send internal accumulator updates to the driver via heartbeats
This PR includes the following changes:

1. Remove the thread local `Accumulators.localAccums`. Instead, all Accumulators in the executors will register with its TaskContext.
2. Add an internal flag to Accumulable. For internal Accumulators, their updates will be sent to the driver via heartbeats.

Author: zsxwing <zsxwing@gmail.com>

Closes #7448 from zsxwing/accumulators and squashes the following commits:

c24bc5b [zsxwing] Add comments
bd7dcf1 [zsxwing] Add an internal flag to Accumulable and send internal accumulator updates to the driver via heartbeats
2015-07-16 21:09:09 -07:00
Andrew Or 96aa3340f4 [SPARK-8119] HeartbeatReceiver should replace executors, not kill
**Symptom.** If an executor in an application times out, `HeartbeatReceiver` attempts to kill it. After this happens, however, the application never gets an executor back even when there are cluster resources available.

**Cause.** The issue is that `sc.killExecutor` automatically assumes that the application wishes to adjust its resource requirements permanently downwards. This is not the intention in `HeartbeatReceiver`, however, which simply wants a replacement for the expired executor.

**Fix.** Differentiate between the intention to kill and the intention to replace an executor with a fresh one. More details can be found in the commit message.

Author: Andrew Or <andrew@databricks.com>

Closes #7107 from andrewor14/heartbeat-no-kill and squashes the following commits:

1cd2cd7 [Andrew Or] Add regression test for SPARK-8119
25a347d [Andrew Or] Reuse more code in scheduler backend
31ebd40 [Andrew Or] Differentiate between kill and replace
2015-07-16 19:39:54 -07:00
Timothy Chen d86bbb4e28 [SPARK-6284] [MESOS] Add mesos role, principal and secret
Mesos supports framework authentication and role to be set per framework, which the role is used to identify the framework's role which impacts the sharing weight of resource allocation and optional authentication information to allow the framework to be connected to the master.

Author: Timothy Chen <tnachen@gmail.com>

Closes #4960 from tnachen/mesos_fw_auth and squashes the following commits:

0f9f03e [Timothy Chen] Fix review comments.
8f9488a [Timothy Chen] Fix rebase
f7fc2a9 [Timothy Chen] Add mesos role, auth and secret.
2015-07-16 19:37:15 -07:00
Lianhui Wang 49351c7f59 [SPARK-8646] PySpark does not run on YARN if master not provided in command line
andrewor14 davies vanzin can you take a look at this? thanks

Author: Lianhui Wang <lianhuiwang09@gmail.com>

Closes #7438 from lianhuiwang/SPARK-8646 and squashes the following commits:

cb3f12d [Lianhui Wang] add whitespace
6d874a6 [Lianhui Wang] support pyspark for yarn-client
2015-07-16 19:31:45 -07:00
Aaron Davidson 57e9b13bf9 [SPARK-8644] Include call site in SparkException stack traces thrown by job failures
Example exception (new part at bottom, clearly demarcated):

```
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.RuntimeException: uh-oh!
	at org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$37$$anonfun$38$$anonfun$apply$mcJ$sp$2.apply(DAGSchedulerSuite.scala:880)
	at org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$37$$anonfun$38$$anonfun$apply$mcJ$sp$2.apply(DAGSchedulerSuite.scala:880)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
	at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1640)
	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1099)
	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1099)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1777)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1777)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
	at org.apache.spark.scheduler.Task.run(Task.scala:70)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:744)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1298)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1289)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1288)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1288)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:755)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:755)
	at scala.Option.foreach(Option.scala:236)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:755)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1509)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1470)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1459)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:560)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1744)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1762)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1777)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1791)
	at org.apache.spark.rdd.RDD.count(RDD.scala:1099)
	at org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$37$$anonfun$38.apply$mcJ$sp(DAGSchedulerSuite.scala:880)
	at org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$37$$anonfun$38.apply(DAGSchedulerSuite.scala:880)
	at org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$37$$anonfun$38.apply(DAGSchedulerSuite.scala:880)
	at org.scalatest.Assertions$class.intercept(Assertions.scala:997)
	at org.scalatest.FunSuite.intercept(FunSuite.scala:1555)
	at org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$37.apply$mcV$sp(DAGSchedulerSuite.scala:879)
	at org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$37.apply(DAGSchedulerSuite.scala:878)
	at org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$37.apply(DAGSchedulerSuite.scala:878)
	at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
	at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
	at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
	at org.scalatest.Transformer.apply(Transformer.scala:22)
	at org.scalatest.Transformer.apply(Transformer.scala:20)
	at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
	at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42)
	at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
	at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
	at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
	at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
	at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
	at org.apache.spark.scheduler.DAGSchedulerSuite.org$scalatest$BeforeAndAfter$$super$runTest(DAGSchedulerSuite.scala:70)
	at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
	at org.apache.spark.scheduler.DAGSchedulerSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(DAGSchedulerSuite.scala:70)
	at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
	at org.apache.spark.scheduler.DAGSchedulerSuite.runTest(DAGSchedulerSuite.scala:70)
	at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
	at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
	at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
	at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
	at scala.collection.immutable.List.foreach(List.scala:318)
	at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
	at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
	at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
	at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
	at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
	at org.scalatest.Suite$class.run(Suite.scala:1424)
	at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
	at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
	at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
	at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
	at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
	at org.apache.spark.scheduler.DAGSchedulerSuite.org$scalatest$BeforeAndAfter$$super$run(DAGSchedulerSuite.scala:70)
	at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
	at org.apache.spark.scheduler.DAGSchedulerSuite.org$scalatest$BeforeAndAfterAll$$super$run(DAGSchedulerSuite.scala:70)
	at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
	at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
	at org.apache.spark.scheduler.DAGSchedulerSuite.run(DAGSchedulerSuite.scala:70)
	at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
	at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
	at sbt.ForkMain$Run$2.call(ForkMain.java:294)
	at sbt.ForkMain$Run$2.call(ForkMain.java:284)
	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:744)
```

Author: Aaron Davidson <aaron@databricks.com>

Closes #7028 from aarondav/stack-trace and squashes the following commits:

4714664 [Aaron Davidson] [SPARK-8644] Include call site in SparkException stack traces thrown by job failures
2015-07-16 18:14:45 -07:00
jerryshao 031d7d4143 [SPARK-6304] [STREAMING] Fix checkpointing doesn't retain driver port issue.
Author: jerryshao <saisai.shao@intel.com>
Author: Saisai Shao <saisai.shao@intel.com>

Closes #5060 from jerryshao/SPARK-6304 and squashes the following commits:

89b01f5 [jerryshao] Update the unit test to add more cases
275d252 [jerryshao] Address the comments
7cc146d [jerryshao] Address the comments
2624723 [jerryshao] Fix rebase conflict
45befaa [Saisai Shao] Update the unit test
bbc1c9c [Saisai Shao] Fix checkpointing doesn't retain driver port issue
2015-07-16 16:55:46 -07:00
Reynold Xin fec10f0c63 [SPARK-9085][SQL] Remove LeafNode, UnaryNode, BinaryNode from TreeNode.
This builds on #7433 but also removes LeafNode/UnaryNode. These are slightly more complicated to remove. I had to change some abstract classes to traits in order for it to work.

The problem with LeafNode/UnaryNode is that they are often mixed in at the end of an Expression, and then the toString function actually gets resolved to the ones defined in TreeNode, rather than in Expression.

Author: Reynold Xin <rxin@databricks.com>

Closes #7434 from rxin/remove-binary-unary-leaf-node and squashes the following commits:

9e8a4de [Reynold Xin] Generator should not be foldable.
3135a8b [Reynold Xin] SortOrder should not be foldable.
9c589cf [Reynold Xin] Fixed one more test case...
2225331 [Reynold Xin] Aggregate expressions should not be foldable.
16b5c90 [Reynold Xin] [SPARK-9085][SQL] Remove LeafNode, UnaryNode, BinaryNode from TreeNode.
2015-07-16 13:58:39 -07:00
Yijie Shen 43dac2c880 [SPARK-6941] [SQL] Provide a better error message to when inserting into RDD based table
JIRA: https://issues.apache.org/jira/browse/SPARK-6941

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7342 from yijieshen/SPARK-6941 and squashes the following commits:

f82cbe7 [Yijie Shen] reorder import
dd67e40 [Yijie Shen] resolve comments
09518af [Yijie Shen] fix import order in DataframeSuite
0c635d4 [Yijie Shen] make match more specific
9df388d [Yijie Shen] move check into PreWriteCheck
847ab20 [Yijie Shen] Detect insertion error in DataSourceStrategy
2015-07-16 10:52:09 -07:00
Jan Prach b536d5dc6c [SPARK-9015] [BUILD] Clean project import in scala ide
Cleanup maven for a clean import in scala-ide / eclipse.

* remove groovy plugin which is really not needed at all
* add-source from build-helper-maven-plugin is not needed as recent version of scala-maven-plugin do it automatically
* add lifecycle-mapping plugin to hide a few useless warnings from ide

Author: Jan Prach <jendap@gmail.com>

Closes #7375 from jendap/clean-project-import-in-scala-ide and squashes the following commits:

c4b4c0f [Jan Prach] fix whitespaces
5a83e07 [Jan Prach] Revert "remove java compiler warnings from java tests"
312007e [Jan Prach] scala-maven-plugin itself add scala sources by default
f47d856 [Jan Prach] remove spark-1.4-staging repository
c8a54db [Jan Prach] remove java compiler warnings from java tests
999a068 [Jan Prach] remove some maven warnings in scala ide
80fbdc5 [Jan Prach] remove groovy and gmavenplus plugin
2015-07-16 18:42:41 +01:00
Tarek Auel 4ea6480a3b [SPARK-8995] [SQL] cast date strings like '2015-01-01 12:15:31' to date
Jira https://issues.apache.org/jira/browse/SPARK-8995

In PR #6981we noticed that we cannot cast date strings that contains a time, like '2015-03-18 12:39:40' to date. Besides it's not possible to cast a string like '18:03:20' to a timestamp.

If a time is passed without a date, today is inferred as date.

Author: Tarek Auel <tarek.auel@googlemail.com>
Author: Tarek Auel <tarek.auel@gmail.com>

Closes #7353 from tarekauel/SPARK-8995 and squashes the following commits:

14f333b [Tarek Auel] [SPARK-8995] added tests for daylight saving time
ca1ae69 [Tarek Auel] [SPARK-8995] style fix
d20b8b4 [Tarek Auel] [SPARK-8995] bug fix: distinguish between 0 and null
ef05753 [Tarek Auel] [SPARK-8995] added check for year >= 1000
01c9ff3 [Tarek Auel] [SPARK-8995] support for time strings
34ec573 [Tarek Auel] fixed style
71622c0 [Tarek Auel] improved timestamp and date parsing
0e30c0a [Tarek Auel] Hive compatibility
cfbaed7 [Tarek Auel] fixed wrong checks
71f89c1 [Tarek Auel] [SPARK-8995] minor style fix
f7452fa [Tarek Auel] [SPARK-8995] removed old timestamp parsing
30e5aec [Tarek Auel] [SPARK-8995] date and timestamp cast
c1083fb [Tarek Auel] [SPARK-8995] cast date strings like '2015-01-01 12:15:31' to date or timestamp
2015-07-16 08:26:39 -07:00
Daniel Darabos 011551620f [SPARK-8893] Add runtime checks against non-positive number of partitions
https://issues.apache.org/jira/browse/SPARK-8893

> What does `sc.parallelize(1 to 3).repartition(p).collect` return? I would expect `Array(1, 2, 3)` regardless of `p`. But if `p` < 1, it returns `Array()`. I think instead it should throw an `IllegalArgumentException`.

> I think the case is pretty clear for `p` < 0. But the behavior for `p` = 0 is also error prone. In fact that's how I found this strange behavior. I used `rdd.repartition(a/b)` with positive `a` and `b`, but `a/b` was rounded down to zero and the results surprised me. I'd prefer an exception instead of unexpected (corrupt) results.

Author: Daniel Darabos <darabos.daniel@gmail.com>

Closes #7285 from darabos/patch-1 and squashes the following commits:

decba82 [Daniel Darabos] Allow repartitioning empty RDDs to zero partitions.
97de852 [Daniel Darabos] Allow zero partition count in HashPartitioner
f6ba5fb [Daniel Darabos] Use require() for simpler syntax.
d5e3df8 [Daniel Darabos] Require positive number of partitions in HashPartitioner
897c628 [Daniel Darabos] Require positive maxPartitions in CoalescedRDD
2015-07-16 08:16:54 +01:00
Liang-Chi Hsieh 0a795336df [SPARK-8807] [SPARKR] Add between operator in SparkR
JIRA: https://issues.apache.org/jira/browse/SPARK-8807

Add between operator in SparkR.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #7356 from viirya/add_r_between and squashes the following commits:

7f51b44 [Liang-Chi Hsieh] Add test for non-numeric column.
c6a25c5 [Liang-Chi Hsieh] Add between function.
2015-07-15 23:36:57 -07:00
Cheng Hao e27212317c [SPARK-8972] [SQL] Incorrect result for rollup
We don't support the complex expression keys in the rollup/cube, and we even will not report it if we have the complex group by keys, that will cause very confusing/incorrect result.

e.g. `SELECT key%100 FROM src GROUP BY key %100 with ROLLUP`

This PR adds an additional project during the analyzing for the complex GROUP BY keys, and that projection will be the child of `Expand`, so to `Expand`, the GROUP BY KEY are always the simple key(attribute names).

Author: Cheng Hao <hao.cheng@intel.com>

Closes #7343 from chenghao-intel/expand and squashes the following commits:

1ebbb59 [Cheng Hao] update the comment
827873f [Cheng Hao] update as feedback
34def69 [Cheng Hao] Add more unit test and comments
c695760 [Cheng Hao] fix bug of incorrect result for rollup
2015-07-15 23:35:27 -07:00
Wenchen Fan ba33096846 [SPARK-9068][SQL] refactor the implicit type cast code
based on https://github.com/apache/spark/pull/7348

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7420 from cloud-fan/type-check and squashes the following commits:

7633fa9 [Wenchen Fan] revert
fe169b0 [Wenchen Fan] improve test
03b70da [Wenchen Fan] enhance implicit type cast
2015-07-15 22:27:39 -07:00
Cheng Hao 42dea3acf9 [SPARK-8245][SQL] FormatNumber/Length Support for Expression
- `BinaryType` for `Length`
- `FormatNumber`

Author: Cheng Hao <hao.cheng@intel.com>

Closes #7034 from chenghao-intel/expression and squashes the following commits:

e534b87 [Cheng Hao] python api style issue
601bbf5 [Cheng Hao] add python API support
3ebe288 [Cheng Hao] update as feedback
52274f7 [Cheng Hao] add support for udf_format_number and length for binary
2015-07-15 21:47:21 -07:00
Yin Huai 9c64a75bfc [SPARK-9060] [SQL] Revert SPARK-8359, SPARK-8800, and SPARK-8677
JIRA: https://issues.apache.org/jira/browse/SPARK-9060

This PR reverts:
* 31bd30687b (SPARK-8359)
* 24fda73811 (SPARK-8677)
* 4b5cfc988f (SPARK-8800)

Author: Yin Huai <yhuai@databricks.com>

Closes #7426 from yhuai/SPARK-9060 and squashes the following commits:

651264d [Yin Huai] Revert "[SPARK-8359] [SQL] Fix incorrect decimal precision after multiplication"
cfda7e4 [Yin Huai] Revert "[SPARK-8677] [SQL] Fix non-terminating decimal expansion for decimal divide operation"
2de9afe [Yin Huai] Revert "[SPARK-8800] [SQL] Fix inaccurate precision/scale of Decimal division operation"
2015-07-15 21:08:30 -07:00
Xiangrui Meng 73d92b00b9 [SPARK-9018] [MLLIB] add stopwatches
Add stopwatches for easy instrumentation of MLlib algorithms. This is based on the `TimeTracker` used in decision trees. The distributed version uses Spark accumulator. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #7415 from mengxr/SPARK-9018 and squashes the following commits:

40b4347 [Xiangrui Meng] == -> ===
c477745 [Xiangrui Meng] address Joseph's comments
f981a49 [Xiangrui Meng] add stopwatches
2015-07-15 21:02:42 -07:00
Eric Liang 6960a7938c [SPARK-8774] [ML] Add R model formula with basic support as a transformer
This implements minimal R formula support as a feature transformer. Both numeric and string labels are supported, but features must be numeric for now.

cc mengxr

Author: Eric Liang <ekl@databricks.com>

Closes #7381 from ericl/spark-8774-1 and squashes the following commits:

d1959d2 [Eric Liang] clarify comment
2db68aa [Eric Liang] second round of comments
dc3c943 [Eric Liang] address comments
5765ec6 [Eric Liang] fix style checks
1f361b0 [Eric Liang] doc
fb0826b [Eric Liang] [SPARK-8774] Add R model formula with basic support as a transformer
2015-07-15 20:33:06 -07:00
Reynold Xin b0645195d0 [SPARK-9086][SQL] Remove BinaryNode from TreeNode.
These traits are not super useful, and yet cause problems with toString in expressions due to the orders they are mixed in.

Author: Reynold Xin <rxin@databricks.com>

Closes #7433 from rxin/remove-binary-node and squashes the following commits:

1881f78 [Reynold Xin] [SPARK-9086][SQL] Remove BinaryNode from TreeNode.
2015-07-15 17:50:11 -07:00