Commit graph

8931 commits

Author SHA1 Message Date
Kousuke Saruta 97eb6d7f51 Fix wrong file name pattern in .gitignore
In .gitignore, there is an entry for spark-*-bin.tar.gz but considering make-distribution.sh, the name pattern should be spark-*-bin-*.tgz.

This change is really small so I don't open issue in JIRA. If it's needed, please let me know.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #3529 from sarutak/fix-wrong-tgz-pattern and squashes the following commits:

de3c70a [Kousuke Saruta] Fixed wrong file name pattern in .gitignore
2014-12-01 00:29:28 -08:00
Prabeesh K 5e7a6dcb8f [SPARK-4632] version update
Author: Prabeesh K <prabsmails@gmail.com>

Closes #3495 from prabeesh/master and squashes the following commits:

ab03d50 [Prabeesh K] Update pom.xml
8c6437e [Prabeesh K] Revert
e10b40a [Prabeesh K] version update
dbac9eb [Prabeesh K] Revert
ec0b1c3 [Prabeesh K] [SPARK-4632] version update
a835505 [Prabeesh K] [SPARK-4632] version update
831391b [Prabeesh K]  [SPARK-4632] version update
2014-11-30 20:51:53 -08:00
Patrick Wendell 06dc1b15e4 MAINTENANCE: Automated closing of pull requests.
This commit exists to close the following pull requests on Github:

Closes #2915 (close requested by 'JoshRosen')
Closes #3140 (close requested by 'JoshRosen')
Closes #3366 (close requested by 'JoshRosen')
2014-11-30 20:51:13 -08:00
Cheng Lian 2a4d389f70 [DOC] Fixes formatting typo in SQL programming guide
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3498)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3498 from liancheng/fix-sql-doc-typo and squashes the following commits:

865ecd7 [Cheng Lian] Fixes formatting typo in SQL programming guide
2014-11-30 19:04:07 -08:00
lewuathe a217ec5fd5 [SPARK-4656][Doc] Typo in Programming Guide markdown
Grammatical error in Programming Guide document

Author: lewuathe <lewuathe@me.com>

Closes #3412 from Lewuathe/typo-programming-guide and squashes the following commits:

a3e2f00 [lewuathe] Typo in Programming Guide markdown
2014-11-30 17:18:50 -08:00
carlmartin aea7a99761 [SPARK-4623]Add the some error infomation if using spark-sql in yarn-cluster mode
If using spark-sql in yarn-cluster mode, print an error infomation just as the spark shell in yarn-cluster mode.

Author: carlmartin <carlmartinmax@gmail.com>
Author: huangzhaowei <carlmartinmax@gmail.com>

Closes #3479 from SaintBacchus/sparkSqlShell and squashes the following commits:

35829a9 [carlmartin] improve the description of comment
e6c1eb7 [carlmartin] add a comment in bin/spark-sql to remind user who wants to change the class
f1c5c8d [carlmartin] Merge branch 'master' into sparkSqlShell
8e112c5 [huangzhaowei] singular form
ec957bc [carlmartin] Add the some error infomation if using spark-sql in yarn-cluster mode
7bcecc2 [carlmartin] Merge branch 'master' of https://github.com/apache/spark into codereview
4fad75a [carlmartin] Add the Error infomation using spark-sql in yarn-cluster mode
2014-11-30 16:19:41 -08:00
Sean Owen 048ecca625 SPARK-2143 [WEB UI] Add Spark version to UI footer
This PR adds the Spark version number to the UI footer; this is how it looks:

![screen shot 2014-11-21 at 22 58 40](https://cloud.githubusercontent.com/assets/822522/5157738/f4822094-7316-11e4-98f1-333a535fdcfa.png)

Author: Sean Owen <sowen@cloudera.com>

Closes #3410 from srowen/SPARK-2143 and squashes the following commits:

e9b3a7a [Sean Owen] Add Spark version to footer
2014-11-30 11:41:38 -08:00
Takuya UESHIN 0fcd24cc54 [DOCS][BUILD] Add instruction to use change-version-to-2.11.sh in 'Building for Scala 2.11'.
To build with Scala 2.11, we have to execute `change-version-to-2.11.sh` before Maven execute, otherwise inter-module dependencies are broken.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #3361 from ueshin/docs/building-spark_2.11 and squashes the following commits:

1d29126 [Takuya UESHIN] Add instruction to use change-version-to-2.11.sh in 'Building for Scala 2.11'.
2014-11-30 00:10:31 -05:00
Takayuki Hasegawa 4316a7b010 SPARK-4507: PR merge script should support closing multiple JIRA tickets
This will fix SPARK-4507.

For pull requests that reference multiple JIRAs in their titles, it would be helpful if the PR merge script offered to close all of them.

Author: Takayuki Hasegawa <takayuki.hasegawa0311@gmail.com>

Closes #3428 from hase1031/SPARK-4507 and squashes the following commits:

bf6d64b [Takayuki Hasegawa] SPARK-4507: try to resolve issue when no JIRAs in title
401224c [Takayuki Hasegawa] SPARK-4507: moved codes as before
ce89021 [Takayuki Hasegawa] SPARK-4507: PR merge script should support closing multiple JIRA tickets
2014-11-29 23:12:10 -05:00
zsxwing c06222427f [SPARK-4505][Core] Add a ClassTag parameter to CompactBuffer[T]
Added a ClassTag parameter to CompactBuffer. So CompactBuffer[T] can create primitive arrays for primitive types. It will reduce the memory usage for primitive types significantly and only pay minor performance lost.

Here is my test code:
```Scala
  // Call org.apache.spark.util.SizeEstimator.estimate
  def estimateSize(obj: AnyRef): Long = {
    val c = Class.forName("org.apache.spark.util.SizeEstimator$")
    val f = c.getField("MODULE$")
    val o = f.get(c)
    val m = c.getMethod("estimate", classOf[Object])
    m.setAccessible(true)
    m.invoke(o, obj).asInstanceOf[Long]
  }

  sc.parallelize(1 to 10000).groupBy(_ => 1).foreach {
    case (k, v) =>
      println(v.getClass() + " size: " + estimateSize(v))
  }
```

Using the previous CompactBuffer outputed
```
class org.apache.spark.util.collection.CompactBuffer size: 313358
```

Using the new CompactBuffer outputed
```
class org.apache.spark.util.collection.CompactBuffer size: 65712
```

In this case, the new `CompactBuffer` only used 20% memory of the previous one. It's really helpful for `groupByKey` when using a primitive value.

Author: zsxwing <zsxwing@gmail.com>

Closes #3378 from zsxwing/SPARK-4505 and squashes the following commits:

4abdbba [zsxwing] Add a ClassTag parameter to reduce the memory usage of CompactBuffer[T] when T is a primitive type
2014-11-29 20:23:08 -05:00
Kousuke Saruta 938dc141ee [SPARK-4057] Use -agentlib instead of -Xdebug in sbt-launch-lib.bash for debugging
In -launch-lib.bash, -Xdebug option is used for debugging. We should use -agentlib option for Java 6+.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2904 from sarutak/SPARK-4057 and squashes the following commits:

39b5320 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-4057
26b4af8 [Kousuke Saruta] Improved java option for debugging
2014-11-29 20:14:14 -05:00
Stephen Haberman 95290bf4c4 Include the key name when failing on an invalid value.
Admittedly a really small tweak.

Author: Stephen Haberman <stephen@exigencecorp.com>

Closes #3514 from stephenh/include-key-name-in-npe and squashes the following commits:

937740a [Stephen Haberman] Include the key name when failing on an invalid value.
2014-11-29 20:12:05 -05:00
Nicholas Chammas 317e114e11 [SPARK-3398] [SPARK-4325] [EC2] Use EC2 status checks.
This PR re-introduces [0e648bc](0e648bc2be) from PR #2339, which somehow never made it into the codebase.

Additionally, it removes a now-unnecessary linear backoff on the SSH checks since we are blocking on EC2 status checks before testing SSH.

Author: Nicholas Chammas <nicholas.chammas@gmail.com>

Closes #3195 from nchammas/remove-ec2-ssh-backoff and squashes the following commits:

efb29e1 [Nicholas Chammas] Revert "Remove linear backoff."
ef3ca99 [Nicholas Chammas] reuse conn
adb4eaa [Nicholas Chammas] Remove linear backoff.
55caa24 [Nicholas Chammas] Check EC2 status checks before SSH.
2014-11-29 00:31:06 -08:00
Patrick Wendell 047ff573f7 MAINTENANCE: Automated closing of pull requests.
This commit exists to close the following pull requests on Github:

Closes #3451 (close requested by 'pwendell')
Closes #1310 (close requested by 'pwendell')
Closes #3207 (close requested by 'JoshRosen')
2014-11-29 00:24:35 -05:00
Liang-Chi Hsieh 49fe8797e6 [SPARK-4597] Use proper exception and reset variable in Utils.createTempDir()
`File.exists()` and `File.mkdirs()` only throw `SecurityException` instead of `IOException`. Then, when an exception is thrown, `dir` should be reset too.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #3449 from viirya/fix_createtempdir and squashes the following commits:

36cacbd [Liang-Chi Hsieh] Use proper exception and reset variable.
2014-11-28 18:04:15 -08:00
Sean Owen 48223d8815 SPARK-1450 [EC2] Specify the default zone in the EC2 script help
This looks like a one-liner, so I took a shot at it. There can be no fixed default availability zone since the names are different per region. But the default behavior can be documented:

```
    if opts.zone == "":
        opts.zone = random.choice(conn.get_all_zones()).name
```

Author: Sean Owen <sowen@cloudera.com>

Closes #3454 from srowen/SPARK-1450 and squashes the following commits:

9193cf3 [Sean Owen] Document that --zone defaults to a single random zone
2014-11-28 17:43:38 -05:00
Marcelo Vanzin 915f8eeb3a [SPARK-4584] [yarn] Remove security manager from Yarn AM.
The security manager adds a lot of overhead to the runtime of the
app, and causes a severe performance regression. Even stubbing out
all unneeded methods (all except checkExit()) does not help.

So, instead, penalize users who do an explicit System.exit() by leaving
them in "undefined behavior" territory: if they do that, the Yarn
backend won't be able to report the final app status to the RM.
The result is that the final status of the application might not match
the user's expectations.

One side-effect of the change is that users who do an explicit
System.exit() will lose the AM retry functionality. Since there is
no way to know if the exit was because of success or failure, the
AM right now errs on the side of it being a successful exit.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #3484 from vanzin/SPARK-4584 and squashes the following commits:

21f2502 [Marcelo Vanzin] Do not retry apps that use System.exit().
4198b3b [Marcelo Vanzin] [SPARK-4584] [yarn] Remove security manager from Yarn AM.
2014-11-28 15:16:05 -05:00
Takuya UESHIN e464f0ac2d [SPARK-4193][BUILD] Disable doclint in Java 8 to prevent from build error.
Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #3058 from ueshin/issues/SPARK-4193 and squashes the following commits:

e096bb1 [Takuya UESHIN] Add a plugin declaration to pluginManagement.
6762ec2 [Takuya UESHIN] Fix usage of -Xdoclint javadoc option.
fdb280a [Takuya UESHIN] Fix Javadoc errors.
4745f3c [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4193
923e2f0 [Takuya UESHIN] Use doclint option `-missing` instead of `none`.
30d6718 [Takuya UESHIN] Fix Javadoc errors.
b548017 [Takuya UESHIN] Disable doclint in Java 8 to prevent from build error.
2014-11-28 13:00:15 -05:00
Daoyuan Wang 53ed7f1c7f [SPARK-4643] [Build] Remove unneeded staging repositories from build
The old location will return a 404.

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #3504 from adrian-wang/repo and squashes the following commits:

f604e05 [Daoyuan Wang] already in maven central, remove at all
f494fac [Daoyuan Wang] spark staging repo outdated
2014-11-28 12:41:51 -05:00
KaiXinXiaoLei 052e65815f Delete unnecessary function
when building spark by sbt, the function “runAlternateBoot" in sbt/sbt-launch-lib.bash is not used. And this function is not used by spark code. So I think this function is not necessary. And the option of "sbt.boot.properties" can be configured in the command line when building spark, eg:
sbt/sbt assembly -Dsbt.boot.properties=$bootpropsfile.

The file from https://github.com/sbt/sbt-launcher-package is changed. And the function “runAlternateBoot" is deleted in upstream project. I think spark project should delete this function in file sbt/sbt-launch-lib.bash. Thanks.

Author: KaiXinXiaoLei <huleilei1@huawei.com>

Closes #3224 from KaiXinXiaoLei/deleteFunction and squashes the following commits:

e8eac49 [KaiXinXiaoLei] Delete blank lines.
efe36d4 [KaiXinXiaoLei] Delete unnecessary function
2014-11-28 12:34:07 -05:00
Cheng Lian 5b99bf243e [SPARK-4645][SQL] Disables asynchronous execution in Hive 0.13.1 HiveThriftServer2
This PR disables HiveThriftServer2 asynchronous execution by setting `runInBackground` argument in `ExecuteStatementOperation` to `false`, and reverting `SparkExecuteStatementOperation.run` in Hive 13 shim to Hive 12 version. This change makes Simba ODBC driver v1.0.0.1000 work.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3506)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3506 from liancheng/disable-async-exec and squashes the following commits:

593804d [Cheng Lian] Disables asynchronous execution in Hive 0.13.1 HiveThriftServer2
2014-11-28 11:42:40 -05:00
maji2014 ceb6281970 [SPARK-4619][Storage]delete redundant time suffix
Time suffix exists in Utils.getUsedTimeMs(startTime), no need to append again, delete that

Author: maji2014 <maji3@asiainfo.com>

Closes #3475 from maji2014/SPARK-4619 and squashes the following commits:

df0da4e [maji2014] delete redundant time suffix
2014-11-28 00:36:22 -08:00
Cheng Lian 120a350240 [SPARK-4613][Core] Java API for JdbcRDD
This PR introduces a set of Java APIs for using `JdbcRDD`:

1. Trait (interface) `JdbcRDD.ConnectionFactory`: equivalent to the `getConnection: () => Connection` parameter in `JdbcRDD` constructor.
2. Two overloaded versions of `Jdbc.create`: used to create `JavaRDD` that wraps a `JdbcRDD`.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3478)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3478 from liancheng/japi-jdbc-rdd and squashes the following commits:

9a54625 [Cheng Lian] Only shutdowns a single DB rather than the whole Derby driver
d4cedc5 [Cheng Lian] Moves Java JdbcRDD test case to a separate test suite
ffcdf2e [Cheng Lian] Java API for JdbcRDD
2014-11-27 18:01:14 -08:00
roxchkplusony 84376d3139 [SPARK-4626] Kill a task only if the executorId is (still) registered with the scheduler
Author: roxchkplusony <roxchkplusony@gmail.com>

Closes #3483 from roxchkplusony/bugfix/4626 and squashes the following commits:

aba9184 [roxchkplusony] replace warning message per review
5e7fdea [roxchkplusony] [SPARK-4626] Kill a task only if the executorId is (still) registered with the scheduler
2014-11-27 15:54:40 -08:00
Sean Owen 5d7fe178b3 SPARK-4170 [CORE] Closure problems when running Scala app that "extends App"
Warn against subclassing scala.App, and remove one instance of this in examples

Author: Sean Owen <sowen@cloudera.com>

Closes #3497 from srowen/SPARK-4170 and squashes the following commits:

4a6131f [Sean Owen] Restore multiline string formatting
a8ca895 [Sean Owen] Warn against subclassing scala.App, and remove one instance of this in examples
2014-11-27 09:03:17 -08:00
Andrew Or c86e9bc4fd [Release] Automate generation of contributors list
This commit provides a script that computes the contributors list
by linking the github commits with JIRA issues. Automatically
translating github usernames remains a TODO at this point.
2014-11-26 23:16:23 -08:00
CodingCat 5af53ada65 [SPARK-732][SPARK-3628][CORE][RESUBMIT] eliminate duplicate update on accmulator
https://issues.apache.org/jira/browse/SPARK-3628

In current implementation, the accumulator will be updated for every successfully finished task, even the task is from a resubmitted stage, which makes the accumulator counter-intuitive

In this patch, I changed the way for the DAGScheduler to update the accumulator,

DAGScheduler maintains a HashTable, mapping the stage id to the received <accumulator_id , value> pairs. Only when the stage becomes independent, (no job needs it any more), we accumulate the values of the <accumulator_id , value> pairs, when a task finished, we check if the HashTable has contained such stageId, it saves the accumulator_id, value only when the task is the first finished task of a new stage or the stage is running for the first attempt...

Author: CodingCat <zhunansjtu@gmail.com>

Closes #2524 from CodingCat/SPARK-732-1 and squashes the following commits:

701a1e8 [CodingCat] roll back change on Accumulator.scala
1433e6f [CodingCat] make MIMA happy
b233737 [CodingCat] address Matei's comments
02261b8 [CodingCat] rollback  some changes
6b0aff9 [CodingCat] update document
2b2e8cf [CodingCat] updateAccumulator
83b75f8 [CodingCat] style fix
84570d2 [CodingCat] re-enable  the bad accumulator guard
1e9e14d [CodingCat] add NPE guard
21b6840 [CodingCat] simplify the patch
88d1f03 [CodingCat] fix rebase error
f74266b [CodingCat] add test case for resubmitted result stage
5cf586f [CodingCat] de-duplicate on task level
138f9b3 [CodingCat] make MIMA happy
67593d2 [CodingCat] make if allowing duplicate update as an option of accumulator
2014-11-26 16:52:04 -08:00
Xiangrui Meng 561d31d2f1 [SPARK-4614][MLLIB] Slight API changes in Matrix and Matrices
Before we have a full picture of the operators we want to add, it might be safer to hide `Matrix.transposeMultiply` in 1.2.0. Another update we want to change is `Matrix.randn` and `Matrix.rand`, both of which should take a `Random` implementation. Otherwise, it is very likely to produce inconsistent RDDs. I also added some unit tests for matrix factory methods. All APIs are new in 1.2, so there is no incompatible changes.

brkyvz

Author: Xiangrui Meng <meng@databricks.com>

Closes #3468 from mengxr/SPARK-4614 and squashes the following commits:

3b0e4e2 [Xiangrui Meng] add mima excludes
6bfd8a4 [Xiangrui Meng] hide transposeMultiply; add rng to rand and randn; add unit tests
2014-11-26 08:22:50 -08:00
Joseph E. Gonzalez 288ce583b0 Removing confusing TripletFields
After additional discussion with rxin, I think having all the possible `TripletField` options is confusing.  This pull request reduces the triplet fields to:

```java
  /**
   * None of the triplet fields are exposed.
   */
  public static final TripletFields None = new TripletFields(false, false, false);

  /**
   * Expose only the edge field and not the source or destination field.
   */
  public static final TripletFields EdgeOnly = new TripletFields(false, false, true);

  /**
   * Expose the source and edge fields but not the destination field. (Same as Src)
   */
  public static final TripletFields Src = new TripletFields(true, false, true);

  /**
   * Expose the destination and edge fields but not the source field. (Same as Dst)
   */
  public static final TripletFields Dst = new TripletFields(false, true, true);

  /**
   * Expose all the fields (source, edge, and destination).
   */
  public static final TripletFields All = new TripletFields(true, true, true);
```

Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>

Closes #3472 from jegonzal/SimplifyTripletFields and squashes the following commits:

91796b5 [Joseph E. Gonzalez] removing confusing triplet fields
2014-11-26 00:55:28 -08:00
Tathagata Das e7f4d2534b [SPARK-4612] Reduce task latency and increase scheduling throughput by making configuration initialization lazy
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L337 creates a configuration object for every task that is launched, even if there is no new dependent file/JAR to update. This is a heavy-weight creation that should be avoided if there is no new file/JAR to update. This PR makes that creation lazy. Quick local test in spark-perf scheduling throughput tests gives the following numbers in a local standalone scheduler mode.
1 job with 10000 tasks: before 7.8395 seconds, after 2.6415 seconds = 3x increase in task scheduling throughput

pwendell JoshRosen

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #3463 from tdas/lazy-config and squashes the following commits:

c791c1e [Tathagata Das] Reduce task latency by making configuration initialization lazy
2014-11-25 23:15:58 -08:00
Aaron Davidson 346bc17a2e [SPARK-4516] Avoid allocating Netty PooledByteBufAllocators unnecessarily
Turns out we are allocating an allocator pool for every TransportClient (which means that the number increases with the number of nodes in the cluster), when really we should just reuse one for all clients.

This patch, as expected, greatly decreases off-heap memory allocation, and appears to make allocation only proportional to the number of cores.

Author: Aaron Davidson <aaron@databricks.com>

Closes #3465 from aarondav/fewer-pools and squashes the following commits:

36c49da [Aaron Davidson] [SPARK-4516] Avoid allocating unnecessarily Netty PooledByteBufAllocators
2014-11-26 00:32:45 -05:00
Aaron Davidson f5f2d27385 [SPARK-4516] Cap default number of Netty threads at 8
In practice, only 2-4 cores should be required to transfer roughly 10 Gb/s, and each core that we use will have an initial overhead of roughly 32 MB of off-heap memory, which comes at a premium.

Thus, this value should still retain maximum throughput and reduce wasted off-heap memory allocation. It can be overridden by setting the number of serverThreads and clientThreads manually in Spark's configuration.

Author: Aaron Davidson <aaron@databricks.com>

Closes #3469 from aarondav/fewer-pools2 and squashes the following commits:

087c59f [Aaron Davidson] [SPARK-4516] Cap default number of Netty threads at 8
2014-11-25 23:57:04 -05:00
Xiangrui Meng b5fb1410c5 [SPARK-4604][MLLIB] make MatrixFactorizationModel public
User could construct an MF model directly. I added a note about the performance.

Author: Xiangrui Meng <meng@databricks.com>

Closes #3459 from mengxr/SPARK-4604 and squashes the following commits:

f64bcd3 [Xiangrui Meng] organize imports
ed08214 [Xiangrui Meng] check preconditions and unit tests
a624c12 [Xiangrui Meng] make MatrixFactorizationModel public
2014-11-25 20:11:40 -08:00
Patrick Wendell 4d95526a75 [HOTFIX]: Adding back without-hive dist 2014-11-25 23:10:46 -05:00
Joseph K. Bradley c251fd7405 [SPARK-4583] [mllib] LogLoss for GradientBoostedTrees fix + doc updates
Currently, the LogLoss used by GradientBoostedTrees has 2 issues:
* the gradient (and therefore loss) does not match that used by Friedman (1999)
* the error computation uses 0/1 accuracy, not log loss

This PR updates LogLoss.
It also adds some doc for boosting and forests.

I tested it on sample data and made sure the log loss is monotonically decreasing with each boosting iteration.

CC: mengxr manishamde codedeft

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #3439 from jkbradley/gbt-loss-fix and squashes the following commits:

cfec17e [Joseph K. Bradley] removed forgotten temp comments
a27eb6d [Joseph K. Bradley] corrections to last log loss commit
ed5da2c [Joseph K. Bradley] updated LogLoss (boosting) for numerical stability
5e52bff [Joseph K. Bradley] * Removed the 1/2 from SquaredError.  This also required updating the test suite since it effectively doubles the gradient and loss. * Added doc for developers within RandomForest. * Small cleanup in test suite (generating data only once)
e57897a [Joseph K. Bradley] Fixed LogLoss for GradientBoostedTrees, and updated doc for losses, forests, and boosting
2014-11-25 20:10:15 -08:00
Xiangrui Meng 7eba0fbe45 [Spark-4509] Revert EC2 tag-based cluster membership patch
This PR reverts changes related to tag-based cluster membership. As discussed in SPARK-3332, we didn't figure out a safe strategy to use tags to determine cluster membership, because tagging is not atomic. The following changes are reverted:

SPARK-2333: 94053a7b76
SPARK-3213: 7faf755ae4
SPARK-3608: 78d4220fa0.

I tested launch, login, and destroy. It is easy to check the diff by comparing it to Josh's patch for branch-1.1:

https://github.com/apache/spark/pull/2225/files

JoshRosen I sent the PR to master. It might be easier for us to keep master and branch-1.2 the same at this time. We can always re-apply the patch once we figure out a stable solution.

Author: Xiangrui Meng <meng@databricks.com>

Closes #3453 from mengxr/SPARK-4509 and squashes the following commits:

f0b708b [Xiangrui Meng] revert 94053a7b76
4298ea5 [Xiangrui Meng] revert 7faf755ae4
35963a1 [Xiangrui Meng] Revert "SPARK-3608 Break if the instance tag naming succeeds"
2014-11-25 16:07:09 -08:00
hushan[胡珊] 9bdf5da590 Fix SPARK-4471: blockManagerIdFromJson function throws exception while B...
Fix [SPARK-4471](https://issues.apache.org/jira/browse/SPARK-4471): blockManagerIdFromJson function throws exception while BlockManagerId be null in MetadataFetchFailedException

Author: hushan[胡珊] <hushan@xiaomi.com>

Closes #3340 from suyanNone/fix-blockmanagerId-jnothing-2 and squashes the following commits:

159f9a3 [hushan[胡珊]] Refine test code for blockmanager is null
4380d73 [hushan[胡珊]] remove useless blank line
3ccf651 [hushan[胡珊]] Fix SPARK-4471: blockManagerIdFromJson function throws exception while metadata fetch failed
2014-11-25 15:51:08 -08:00
Andrew Or 9afcbe494a [SPARK-4546] Improve HistoryServer first time user experience
The documentation points the user to run the following
```
sbin/start-history-server.sh
```
The first thing this does is throw an exception that complains a log directory is not specified. The exception message itself does not say anything about what to set. Instead we should have a default and a landing page with a better message. The new default log directory is `file:/tmp/spark-events`.

This is what it looks like as of this PR:

![after](https://issues.apache.org/jira/secure/attachment/12682985/after.png)

Author: Andrew Or <andrew@databricks.com>

Closes #3411 from andrewor14/minor-history-improvements and squashes the following commits:

f33d6b3 [Andrew Or] Point user to set config if default log dir does not exist
fc4c17a [Andrew Or] Improve HistoryServer UX
2014-11-25 15:48:02 -08:00
Andrew Or 1b2ab1cd1b [SPARK-4592] Avoid duplicate worker registrations in standalone mode
**Summary.** On failover, the Master may receive duplicate registrations from the same worker, causing the worker to exit. This is caused by this commit 4afe9a4852, which adds logic for the worker to re-register with the master in case of failures. However, the following race condition may occur:

(1) Master A fails and Worker attempts to reconnect to all masters
(2) Master B takes over and notifies Worker
(3) Worker responds by registering with Master B
(4) Meanwhile, Worker's previous reconnection attempt reaches Master B, causing the same Worker to register with Master B twice

**Fix.** Instead of attempting to register with all known masters, the worker should re-register with only the one that it has been communicating with. This is safe because the fact that a failover has occurred means the old master must have died. Then, when the worker is finally notified of a new master, it gives up on the old one in favor of the new one.

**Caveat.** Even this fix is subject to more obscure race conditions. For instance, if Master B fails and Master A recovers immediately, then Master A may still observe duplicate worker registrations. However, this and other potential race conditions summarized in [SPARK-4592](https://issues.apache.org/jira/browse/SPARK-4592), are much, much less likely than the one described above, which is deterministically reproducible.

Author: Andrew Or <andrew@databricks.com>

Closes #3447 from andrewor14/standalone-failover and squashes the following commits:

0d9716c [Andrew Or] Move re-registration logic to actor for thread-safety
79286dc [Andrew Or] Preserve old behavior for initial retries
83b321c [Andrew Or] Tweak wording
1fce6a9 [Andrew Or] Active master actor could be null in the beginning
b6f269e [Andrew Or] Avoid duplicate worker registrations
2014-11-25 15:46:26 -08:00
Tathagata Das 8838ad7c13 [SPARK-4196][SPARK-4602][Streaming] Fix serialization issue in PairDStreamFunctions.saveAsNewAPIHadoopFiles
Solves two JIRAs in one shot
- Makes the ForechDStream created by saveAsNewAPIHadoopFiles serializable for checkpoints
- Makes the default configuration object used saveAsNewAPIHadoopFiles be the Spark's hadoop configuration

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #3457 from tdas/savefiles-fix and squashes the following commits:

bb4729a [Tathagata Das] Same treatment for saveAsHadoopFiles
b382ea9 [Tathagata Das] Fix serialization issue in PairDStreamFunctions.saveAsNewAPIHadoopFiles.
2014-11-25 14:16:27 -08:00
DB Tsai bf1a6aaac5 [SPARK-4581][MLlib] Refactorize StandardScaler to improve the transformation performance
The following optimizations are done to improve the StandardScaler model
transformation performance.

1) Covert Breeze dense vector to primitive vector to reduce the overhead.
2) Since mean can be potentially a sparse vector, we explicitly convert it to dense primitive vector.
3) Have a local reference to `shift` and `factor` array so JVM can locate the value with one operation call.
4) In pattern matching part, we use the mllib SparseVector/DenseVector instead of breeze's vector to
make the codebase cleaner.

Benchmark with mnist8m dataset:

Before,
DenseVector withMean and withStd: 50.97secs
DenseVector withMean and withoutStd: 42.11secs
DenseVector withoutMean and withStd: 8.75secs
SparseVector withoutMean and withStd: 5.437secs

With this PR,
DenseVector withMean and withStd: 5.76secs
DenseVector withMean and withoutStd: 5.28secs
DenseVector withoutMean and withStd: 5.30secs
SparseVector withoutMean and withStd: 1.27secs

Note that without the local reference copy of `factor` and `shift` arrays,
the runtime is almost three time slower.

DenseVector withMean and withStd: 18.15secs
DenseVector withMean and withoutStd: 18.05secs
DenseVector withoutMean and withStd: 18.54secs
SparseVector withoutMean and withStd: 2.01secs

The following code,
```scala
while (i < size) {
   values(i) = (values(i) - shift(i)) * factor(i)
   i += 1
}
```
will generate the bytecode
```
   L13
    LINENUMBER 106 L13
   FRAME FULL [org/apache/spark/mllib/feature/StandardScalerModel org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/DenseVector T [D I I] []
    ILOAD 7
    ILOAD 6
    IF_ICMPGE L14
   L15
    LINENUMBER 107 L15
    ALOAD 5
    ILOAD 7
    ALOAD 5
    ILOAD 7
    DALOAD
    ALOAD 0
    INVOKESPECIAL org/apache/spark/mllib/feature/StandardScalerModel.shift ()[D
    ILOAD 7
    DALOAD
    DSUB
    ALOAD 0
    INVOKESPECIAL org/apache/spark/mllib/feature/StandardScalerModel.factor ()[D
    ILOAD 7
    DALOAD
    DMUL
    DASTORE
   L16
    LINENUMBER 108 L16
    ILOAD 7
    ICONST_1
    IADD
    ISTORE 7
    GOTO L13
```
, while with the local reference of the `shift` and `factor` arrays, the bytecode will be
```
   L14
    LINENUMBER 107 L14
    ALOAD 0
    INVOKESPECIAL org/apache/spark/mllib/feature/StandardScalerModel.factor ()[D
    ASTORE 9
   L15
    LINENUMBER 108 L15
   FRAME FULL [org/apache/spark/mllib/feature/StandardScalerModel org/apache/spark/mllib/linalg/Vector [D org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/DenseVector T [D I I [D] []
    ILOAD 8
    ILOAD 7
    IF_ICMPGE L16
   L17
    LINENUMBER 109 L17
    ALOAD 6
    ILOAD 8
    ALOAD 6
    ILOAD 8
    DALOAD
    ALOAD 2
    ILOAD 8
    DALOAD
    DSUB
    ALOAD 9
    ILOAD 8
    DALOAD
    DMUL
    DASTORE
   L18
    LINENUMBER 110 L18
    ILOAD 8
    ICONST_1
    IADD
    ISTORE 8
    GOTO L15
```

You can see that with local reference, the both of the arrays will be in the stack, so JVM can access the value without calling `INVOKESPECIAL`.

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #3435 from dbtsai/standardscaler and squashes the following commits:

85885a9 [DB Tsai] revert to have lazy in shift array.
daf2b06 [DB Tsai] Address the feedback
cdb5cef [DB Tsai] small change
9c51eef [DB Tsai] style
fc795e4 [DB Tsai] update
5bffd3d [DB Tsai] first commit
2014-11-25 11:07:11 -08:00
Tathagata Das 69cd53eae2 [SPARK-4601][Streaming] Set correct call site for streaming jobs so that it is displayed correctly on the Spark UI
When running the NetworkWordCount, the description of the word count jobs are set as "getCallsite at DStream:xxx" . This should be set to the line number of the streaming application that has the output operation that led to the job being created. This is because the callsite is incorrectly set in the thread launching the jobs. This PR fixes that.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #3455 from tdas/streaming-callsite-fix and squashes the following commits:

69fc26f [Tathagata Das] Set correct call site for streaming jobs so that it is displayed correctly on the Spark UI
2014-11-25 06:50:36 -08:00
arahuja d240760191 [SPARK-4344][DOCS] adding documentation on spark.yarn.user.classpath.first
The documentation for the two parameters is the same with a pointer from the standalone parameter to the yarn parameter

Author: arahuja <aahuja11@gmail.com>

Closes #3209 from arahuja/yarn-classpath-first-param and squashes the following commits:

51cb9b2 [arahuja] [SPARK-4344][DOCS] adding documentation for YARN on userClassPathFirst
2014-11-25 08:23:41 -06:00
jerryshao fef27b2943 [SPARK-4381][Streaming]Add warning log when user set spark.master to local in Spark Streaming and there's no job executed
Author: jerryshao <saisai.shao@intel.com>

Closes #3244 from jerryshao/SPARK-4381 and squashes the following commits:

d2486c7 [jerryshao] Improve the warning log
d726e85 [jerryshao] Add local[1] to the filter condition
eca428b [jerryshao] Add warning log
2014-11-25 05:36:29 -08:00
q00251598 a51118a34a [SPARK-4535][Streaming] Fix the error in comments
change `NetworkInputDStream` to `ReceiverInputDStream`
change `ReceiverInputTracker` to `ReceiverTracker`

Author: q00251598 <qiyadong@huawei.com>

Closes #3400 from watermen/fix-comments and squashes the following commits:

75d795c [q00251598] change 'NetworkInputDStream' to 'ReceiverInputDStream' && change 'ReceiverInputTracker' to 'ReceiverTracker'
2014-11-25 04:01:56 -08:00
GuoQiang Li f515f9432b [SPARK-4526][MLLIB]GradientDescent get a wrong gradient value according to the gradient formula.
This is caused by the miniBatchSize parameter.The number of `RDD.sample` returns is not fixed.
cc mengxr

Author: GuoQiang Li <witgo@qq.com>

Closes #3399 from witgo/GradientDescent and squashes the following commits:

13cb228 [GuoQiang Li] review commit
668ab66 [GuoQiang Li] Double to Long
b6aa11a [GuoQiang Li] Check miniBatchSize is greater than 0
0b5c3e3 [GuoQiang Li] Minor fix
12e7424 [GuoQiang Li] GradientDescent get a wrong gradient value according to the gradient formula, which is caused by the miniBatchSize parameter.
2014-11-25 02:01:19 -08:00
DB Tsai 89f9122646 [SPARK-4596][MLLib] Refactorize Normalizer to make code cleaner
In this refactoring, the performance will be slightly increased due to removing
the overhead from breeze vector. The bottleneck is still in breeze norm
which is implemented by activeIterator.

This inefficiency of breeze norm will be addressed in next PR. At least,
this PR makes the code more consistent in the codebase.

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #3446 from dbtsai/normalizer and squashes the following commits:

e20a2b9 [DB Tsai] first commit
2014-11-25 01:57:34 -08:00
wangfei 0fe54cff19 [DOC][Build] Wrong cmd for build spark with apache hadoop 2.4.X and hive 12
Author: wangfei <wangfei1@huawei.com>

Closes #3335 from scwf/patch-10 and squashes the following commits:

d343113 [wangfei] add '-Phive'
60d595e [wangfei] [DOC] Wrong cmd for build spark with apache hadoop 2.4.X and Hive 12 support
2014-11-24 22:32:39 -08:00
w00228970 723be60e23 [SQL] Compute timeTaken correctly
```timeTaken``` should not count the time of printing result.

Author: w00228970 <wangfei1@huawei.com>

Closes #3423 from scwf/time-taken-bug and squashes the following commits:

da7e102 [w00228970] compute time taken correctly
2014-11-24 21:17:24 -08:00
tkaessmann 9ce2bf3821 [SPARK-4582][MLLIB] get raw vectors for further processing in Word2Vec
This is #3309 for the master branch.

e.g. clustering

Author: tkaessmann <tobias.kaessmanns24.com>

Closes #3309 from tkaessmann/branch-1.2 and squashes the following commits:

e3a3142 [tkaessmann] changes the comment for getVectors
58d3d83 [tkaessmann] removes sign from comment
a5be213 [tkaessmann] fixes getVectors to fit code guidelines
3782fa9 [tkaessmann] get raw vectors for further processing

Author: tkaessmann <tobias.kaessmann@s24.com>

Closes #3437 from mengxr/SPARK-4582 and squashes the following commits:

6c666b4 [tkaessmann] get raw vectors for further processing in Word2Vec
2014-11-24 19:58:01 -08:00