Commit graph

9030 commits

Author SHA1 Message Date
Patrick Wendell 36bdb5b748 MAINTENANCE: Automated closing of pull requests.
This commit exists to close the following pull requests on Github:

Closes #2883 (close requested by 'pwendell')
Closes #3364 (close requested by 'pwendell')
Closes #4458 (close requested by 'pwendell')
Closes #1574 (close requested by 'andrewor14')
Closes #2546 (close requested by 'andrewor14')
Closes #2516 (close requested by 'andrewor14')
Closes #154 (close requested by 'andrewor14')
2014-12-10 14:41:16 -08:00
Andrew Or 4f93d0cabe [SPARK-4759] Fix driver hanging from coalescing partitions
The driver hangs sometimes when we coalesce RDD partitions. See JIRA for more details and reproduction.

This is because our use of empty string as default preferred location in `CoalescedRDDPartition` causes the `TaskSetManager` to schedule the corresponding task on host `""` (empty string). The intended semantics here, however, is that the partition does not have a preferred location, and the TSM should schedule the corresponding task accordingly.

Author: Andrew Or <andrew@databricks.com>

Closes #3633 from andrewor14/coalesce-preferred-loc and squashes the following commits:

e520d6b [Andrew Or] Oops
3ebf8bd [Andrew Or] A few comments
f370a4e [Andrew Or] Fix tests
2f7dfb6 [Andrew Or] Avoid using empty string as default preferred location
2014-12-10 14:27:53 -08:00
Ilya Ganelin 447ae2de5d [SPARK-4569] Rename 'externalSorting' in Aggregator
Hi all - I've renamed the unhelpfully named variable and added a comment clarifying what's actually happening.

Author: Ilya Ganelin <ilya.ganelin@capitalone.com>

Closes #3666 from ilganeli/SPARK-4569B and squashes the following commits:

1810394 [Ilya Ganelin] [SPARK-4569] Rename 'externalSorting' in Aggregator
e2d2092 [Ilya Ganelin] [SPARK-4569] Rename 'externalSorting' in Aggregator
d7cefec [Ilya Ganelin] [SPARK-4569] Rename 'externalSorting' in Aggregator
5b3f39c [Ilya Ganelin] [SPARK-4569] Rename  in Aggregator
2014-12-10 14:19:37 -08:00
Daoyuan Wang e230da18f8 [SPARK-4793] [Deploy] ensure .jar at end of line
sometimes I switch between different version and do not want to rebuild spark, so I rename assembly.jar into .jar.bak, but still caught by `compute-classpath.sh`

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #3641 from adrian-wang/jar and squashes the following commits:

45cbfd0 [Daoyuan Wang] ensure .jar at end of line
2014-12-10 13:30:45 -08:00
Andrew Or faa8fd8178 [SPARK-4215] Allow requesting / killing executors only in YARN mode
Currently this doesn't do anything in other modes, so we might as well just disable it rather than having the user mistakenly rely on it.

Author: Andrew Or <andrew@databricks.com>

Closes #3615 from andrewor14/dynamic-allocation-yarn-only and squashes the following commits:

ce6487a [Andrew Or] Allow requesting / killing executors only in YARN mode
2014-12-10 12:48:24 -08:00
Andrew Or 56212831c6 [SPARK-4771][Docs] Document standalone cluster supervise mode
tdas looks like streaming already refers to the supervise mode. The link from there is broken though.

Author: Andrew Or <andrew@databricks.com>

Closes #3627 from andrewor14/document-supervise and squashes the following commits:

9ca0908 [Andrew Or] Wording changes
2b55ed2 [Andrew Or] Document standalone cluster supervise mode
2014-12-10 12:41:36 -08:00
Kousuke Saruta 0fc637b4c2 [SPARK-4329][WebUI] HistoryPage pagenation
Current HistoryPage have links only to previous page or next page.
I suggest to add index to access history pages easily.

I implemented like following pics.

If there are many pages, current page +/- N pages, head page and last page are indexed.

![2014-11-10 16 13 25](https://cloud.githubusercontent.com/assets/4736016/4986246/9c7bbac4-6937-11e4-8695-8634d039d5b6.png)
![2014-11-10 16 03 21](https://cloud.githubusercontent.com/assets/4736016/4986210/3951bb74-6937-11e4-8b4e-9f90d266d736.png)
![2014-11-10 16 03 39](https://cloud.githubusercontent.com/assets/4736016/4986211/3b196ad8-6937-11e4-9f81-74bc0a6dad5b.png)
![2014-11-10 16 03 49](https://cloud.githubusercontent.com/assets/4736016/4986213/40686138-6937-11e4-86c0-41100f0404f6.png)
![2014-11-10 16 04 04](https://cloud.githubusercontent.com/assets/4736016/4986215/4326c9b4-6937-11e4-87ac-0f30c86ec6e3.png)

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #3194 from sarutak/history-page-indexing and squashes the following commits:

15d3d2d [Kousuke Saruta] Simplified code
c93932e [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into history-page-indexing
1c2f605 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into history-page-indexing
76b05e3 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into history-page-indexing
b2240f8 [Kousuke Saruta] Fixed style
ec7922e [Kousuke Saruta] Simplified code
755a004 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into history-page-indexing
cfa242b [Kousuke Saruta] Added index to HistoryPage
2014-12-10 12:30:45 -08:00
GuoQiang Li 742e7093ec [SPARK-4161]Spark shell class path is not correctly set if "spark.driver.extraClassPath" is set in defaults.conf
Author: GuoQiang Li <witgo@qq.com>

Closes #3050 from witgo/SPARK-4161 and squashes the following commits:

abb6fa4 [GuoQiang Li] move usejavacp opt to spark-shell
89e39e7 [GuoQiang Li] review commit
c2a6f04 [GuoQiang Li] Spark shell class path is not correctly set if "spark.driver.extraClassPath" is set in defaults.conf
2014-12-10 12:26:42 -08:00
Nathan Kronenfeld 94b377f944 [SPARK-4772] Clear local copies of accumulators as soon as we're done with them
Accumulators keep thread-local copies of themselves.  These copies were only cleared at the beginning of a task.  This meant that (a) the memory they used was tied up until the next task ran on that thread, and (b) if a thread died, the memory it had used for accumulators was locked up forever on that worker.

This PR clears the thread-local copies of accumulators at the end of each task, in the tasks finally block, to make sure they are cleaned up between tasks.  It also stores them in a ThreadLocal object, so that if, for some reason, the thread dies, any memory they are using at the time should be freed up.

Author: Nathan Kronenfeld <nkronenfeld@oculusinfo.com>

Closes #3570 from nkronenfeld/Accumulator-Improvements and squashes the following commits:

a581f3f [Nathan Kronenfeld] Change Accumulators to private[spark] instead of adding mima exclude to get around false positive in mima tests
b6c2180 [Nathan Kronenfeld] Include MiMa exclude as per build error instructions - this version incompatibility should be irrelevent, as it will only surface if a master is talking to a worker running a different version of spark.
537baad [Nathan Kronenfeld] Fuller refactoring as intended, incorporating JR's suggestions for ThreadLocal localAccums, and keeping clear(), but also calling it in tasks' finally block, rather than just at the beginning of the task.
39a82f2 [Nathan Kronenfeld] Clear local copies of accumulators as soon as we're done with them
2014-12-09 23:53:17 -08:00
Josh Rosen f79c1cfc99 [Minor] Use <sup> tag for help icon in web UI page header
This small commit makes the `(?)` web UI help link into a superscript, which should address feedback that the current design makes it look like an error occurred or like information is missing.

Before:

![image](https://cloud.githubusercontent.com/assets/50748/5370611/a3ed0034-7fd9-11e4-870f-05bd9faad5b9.png)

After:

![image](https://cloud.githubusercontent.com/assets/50748/5370602/6c5ca8d6-7fd9-11e4-8d1a-568d71290aa7.png)

Author: Josh Rosen <joshrosen@databricks.com>

Closes #3659 from JoshRosen/webui-help-sup and squashes the following commits:

bd72899 [Josh Rosen] Use <sup> tag for help icon in web UI page header.
2014-12-09 23:47:05 -08:00
Reynold Xin 9bd9334f58 Config updates for the new shuffle transport.
Author: Reynold Xin <rxin@databricks.com>

Closes #3657 from rxin/conf-update and squashes the following commits:

7370eab [Reynold Xin] Config updates for the new shuffle transport.
2014-12-09 19:29:09 -08:00
Reynold Xin 2b9b72682e [SPARK-4740] Create multiple concurrent connections between two peer nodes in Netty.
It's been reported that when the number of disks is large and the number of nodes is small, Netty network throughput is low compared with NIO. We suspect the problem is that only a small number of disks are utilized to serve shuffle files at any given point, due to connection reuse. This patch adds a new config parameter to specify the number of concurrent connections between two peer nodes, default to 2.

Author: Reynold Xin <rxin@databricks.com>

Closes #3625 from rxin/SPARK-4740 and squashes the following commits:

ad4241a [Reynold Xin] Updated javadoc.
f33c72b [Reynold Xin] Code review feedback.
0fefabb [Reynold Xin] Use double check in synchronization.
41dfcb2 [Reynold Xin] Added test case.
9076b4a [Reynold Xin] Fixed two NPEs.
3e1306c [Reynold Xin] Minor style fix.
4f21673 [Reynold Xin] [SPARK-4740] Create multiple concurrent connections between two peer nodes in Netty.
2014-12-09 17:49:59 -08:00
Sean Owen d8f84f26e3 SPARK-4805 [CORE] BlockTransferMessage.toByteArray() trips assertion
Allocate enough room for type byte as well as message, to avoid tripping assertion about capacity of the buffer

Author: Sean Owen <sowen@cloudera.com>

Closes #3650 from srowen/SPARK-4805 and squashes the following commits:

9e1d502 [Sean Owen] Allocate enough room for type byte as well as message, to avoid tripping assertion about capacity of the buffer
2014-12-09 16:38:27 -08:00
Sandy Ryza 5e4c06f8e5 SPARK-4567. Make SparkJobInfo and SparkStageInfo serializable
Author: Sandy Ryza <sandy@cloudera.com>

Closes #3426 from sryza/sandy-spark-4567 and squashes the following commits:

cb4b8d2 [Sandy Ryza] SPARK-4567. Make SparkJobInfo and SparkStageInfo serializable
2014-12-09 16:26:07 -08:00
hushan[胡珊] 30dca924df [SPARK-4714] BlockManager.dropFromMemory() should check whether block has been removed after synchronizing on BlockInfo instance.
After synchronizing on the `info` lock in the `removeBlock`/`dropOldBlocks`/`dropFromMemory` methods in BlockManager, the block that `info` represented may have already removed.

The three methods have the same logic to get the `info` lock:
```
   info = blockInfo.get(id)
   if (info != null) {
     info.synchronized {
       // do something
     }
   }
```

So, there is chance that when a thread enters the `info.synchronized` block, `info` has already been removed from the `blockInfo` map by some other thread who entered `info.synchronized` first.

The `removeBlock` and `dropOldBlocks` methods are idempotent, so it's safe for them to run on blocks that have already been removed.
But in `dropFromMemory` it may be problematic since it may drop block data which already removed into the diskstore, and this calls data store operations that are not designed to handle missing blocks.

This patch fixes this issue by adding a check to `dropFromMemory` to test whether blocks have been removed by a racing thread.

Author: hushan[胡珊] <hushan@xiaomi.com>

Closes #3574 from suyanNone/refine-block-concurrency and squashes the following commits:

edb989d [hushan[胡珊]] Refine code style and comments position
55fa4ba [hushan[胡珊]] refine code
e57e270 [hushan[胡珊]] add check info is already remove or not while having gotten info.syn
2014-12-09 15:54:40 -08:00
Kay Ousterhout 1f5110630c [SPARK-4765] Make GC time always shown in UI.
This commit removes the GC time for each task from the set of
optional, additional metrics, and instead always shows it for
each task.

cc pwendell

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #3622 from kayousterhout/gc_time and squashes the following commits:

15ac242 [Kay Ousterhout] Make TaskDetailsClassNames private[spark]
e71d893 [Kay Ousterhout] [SPARK-4765] Make GC time always shown in UI.
2014-12-09 15:10:36 -08:00
maji2014 b31074466a [SPARK-4691][shuffle] Restructure a few lines in shuffle code
In HashShuffleReader.scala and HashShuffleWriter.scala, no need to judge "dep.aggregator.isEmpty" again as this is judged by "dep.aggregator.isDefined"

In SortShuffleWriter.scala, "dep.aggregator.isEmpty"  is better than "!dep.aggregator.isDefined" ?

Author: maji2014 <maji3@asiainfo.com>

Closes #3553 from maji2014/spark-4691 and squashes the following commits:

bf7b14d [maji2014] change a elegant way for SortShuffleWriter.scala
10d0cf0 [maji2014] change a elegant way
d8f52dc [maji2014] code optimization for judgement
2014-12-09 13:13:12 -08:00
jbencook 61f1a70227 [SPARK-874] adding a --wait flag
This PR adds a --wait flag to the `./sbin/stop-all.sh` script.

Author: jbencook <jbenjamincook@gmail.com>

Closes #3567 from jbencook/master and squashes the following commits:

d05c5bb [jbencook] [SPARK-874] adding a --wait flag
2014-12-09 12:16:19 -08:00
Sandy Ryza 912563aa35 SPARK-4338. [YARN] Ditch yarn-alpha.
Sorry if this is a little premature with 1.2 still not out the door, but it will make other work like SPARK-4136 and SPARK-2089 a lot easier.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #3215 from sryza/sandy-spark-4338 and squashes the following commits:

1c5ac08 [Sandy Ryza] Update building Spark docs and remove unnecessary newline
9c1421c [Sandy Ryza] SPARK-4338. Ditch yarn-alpha.
2014-12-09 11:02:43 -08:00
Cheng Hao 383c5555c9 [SPARK-4785][SQL] Initilize Hive UDFs on the driver and serialize them with a wrapper
Different from Hive 0.12.0, in Hive 0.13.1 UDF/UDAF/UDTF (aka Hive function) objects should only be initialized once on the driver side and then serialized to executors. However, not all function objects are serializable (e.g. GenericUDF doesn't implement Serializable). Hive 0.13.1 solves this issue with Kryo or XML serializer. Several utility ser/de methods are provided in class o.a.h.h.q.e.Utilities for this purpose. In this PR we chose Kryo for efficiency. The Kryo serializer used here is created in Hive. Spark Kryo serializer wasn't used because there's no available SparkConf instance.

Author: Cheng Hao <hao.cheng@intel.com>
Author: Cheng Lian <lian@databricks.com>

Closes #3640 from chenghao-intel/udf_serde and squashes the following commits:

8e13756 [Cheng Hao] Update the comment
74466a3 [Cheng Hao] refactor as feedbacks
396c0e1 [Cheng Hao] avoid Simple UDF to be serialized
e9c3212 [Cheng Hao] update the comment
19cbd46 [Cheng Hao] support udf instance ser/de after initialization
2014-12-09 10:28:33 -08:00
zsxwing bcb5cdad61 [SPARK-3154][STREAMING] Replace ConcurrentHashMap with mutable.HashMap and remove @volatile from 'stopped'
Since `sequenceNumberToProcessor` and `stopped` are both protected by the lock `sequenceNumberToProcessor`, `ConcurrentHashMap` and `volatile` is unnecessary. So this PR updated them accordingly.

Author: zsxwing <zsxwing@gmail.com>

Closes #3634 from zsxwing/SPARK-3154 and squashes the following commits:

0d087ac [zsxwing] Replace ConcurrentHashMap with mutable.HashMap and remove @volatile from 'stopped'
2014-12-08 23:54:15 -08:00
Cheng Hao 51b1fe1426 [SPARK-4769] [SQL] CTAS does not work when reading from temporary tables
This is the code refactor and follow ups for #2570

Author: Cheng Hao <hao.cheng@intel.com>

Closes #3336 from chenghao-intel/createtbl and squashes the following commits:

3563142 [Cheng Hao] remove the unused variable
e215187 [Cheng Hao] eliminate the compiling warning
4f97f14 [Cheng Hao] fix bug in unittest
5d58812 [Cheng Hao] revert the API changes
b85b620 [Cheng Hao] fix the regression of temp tabl not found in CTAS
2014-12-08 17:39:12 -08:00
Jacky Li 944384363d [SQL] remove unnecessary import in spark-sql
Author: Jacky Li <jacky.likun@huawei.com>

Closes #3630 from jackylk/remove and squashes the following commits:

150e7e0 [Jacky Li] remove unnecessary import
2014-12-08 17:27:46 -08:00
Sandy Ryza cda94d15ea SPARK-4770. [DOC] [YARN] spark.scheduler.minRegisteredResourcesRatio doc...
...umented default is incorrect for YARN

Author: Sandy Ryza <sandy@cloudera.com>

Closes #3624 from sryza/sandy-spark-4770 and squashes the following commits:

bd81a3a [Sandy Ryza] SPARK-4770. [DOC] [YARN] spark.scheduler.minRegisteredResourcesRatio documented default is incorrect for YARN
2014-12-08 16:28:36 -08:00
Sean Owen e829bfa1ab SPARK-3926 [CORE] Reopened: result of JavaRDD collectAsMap() is not serializable
My original 'fix' didn't fix at all. Now, there's a unit test to check whether it works. Of the two options to really fix it -- copy the `Map` to a `java.util.HashMap`, or copy and modify Scala's implementation in `Wrappers.MapWrapper`, I went with the latter.

Author: Sean Owen <sowen@cloudera.com>

Closes #3587 from srowen/SPARK-3926 and squashes the following commits:

8586bb9 [Sean Owen] Remove unneeded no-arg constructor, and add additional note about copied code in LICENSE
7bb0e66 [Sean Owen] Make SerializableMapWrapper actually serialize, and add unit test
2014-12-08 16:13:03 -08:00
Andrew Or 65f929d5b3 [SPARK-4750] Dynamic allocation - synchronize kills
Simple omission on my part.

Author: Andrew Or <andrew@databricks.com>

Closes #3612 from andrewor14/dynamic-allocation-synchronization and squashes the following commits:

1f03b60 [Andrew Or] Synchronize kills
2014-12-08 16:02:33 -08:00
Kostas Sakellis d6a972b3e4 [SPARK-4774] [SQL] Makes HiveFromSpark more portable
HiveFromSpark read the kv1.txt file from SPARK_HOME/examples/src/main/resources/kv1.txt which assumed
you had a source tree checked out. Now we copy the kv1.txt file to a temporary file and delete it when
the jvm shuts down. This allows us to run this example outside of a spark source tree.

Author: Kostas Sakellis <kostas@cloudera.com>

Closes #3628 from ksakellis/kostas-spark-4774 and squashes the following commits:

6770f83 [Kostas Sakellis] [SPARK-4774] [SQL] Makes HiveFromSpark more portable
2014-12-08 15:44:18 -08:00
Christophe Préaud ab2abcb5ef [SPARK-4764] Ensure that files are fetched atomically
tempFile is created in the same directory than targetFile, so that the
move from tempFile to targetFile is always atomic

Author: Christophe Préaud <christophe.preaud@kelkoo.com>

Closes #2855 from preaudc/master and squashes the following commits:

9ba89ca [Christophe Préaud] Ensure that files are fetched atomically
54419ae [Christophe Préaud] Merge remote-tracking branch 'upstream/master'
c6a5590 [Christophe Préaud] Revert commit 8ea871f8130b2490f1bad7374a819bf56f0ccbbd
7456a33 [Christophe Préaud] Merge remote-tracking branch 'upstream/master'
8ea871f [Christophe Préaud] Ensure that files are fetched atomically
2014-12-08 11:44:54 -08:00
Takeshi Yamamuro 8817fc7fe8 [SPARK-4620] Add unpersist in Graph and GraphImpl
Add an IF to uncache both vertices and edges of Graph/GraphImpl.
This IF is useful when iterative graph operations build a new graph in each iteration, and the vertices and edges of previous iterations are no longer needed for following iterations.

Author: Takeshi Yamamuro <linguin.m.s@gmail.com>

This patch had conflicts when merged, resolved by
Committer: Ankur Dave <ankurdave@gmail.com>

Closes #3476 from maropu/UnpersistInGraphSpike and squashes the following commits:

77a006a [Takeshi Yamamuro] Add unpersist in Graph and GraphImpl
2014-12-07 19:42:02 -08:00
Takeshi Yamamuro 2e6b736b0e [SPARK-4646] Replace Scala.util.Sorting.quickSort with Sorter(TimSort) in Spark
This patch just replaces a native quick sorter with Sorter(TimSort) in Spark.
It could get performance gains by ~8% in my quick experiments.

Author: Takeshi Yamamuro <linguin.m.s@gmail.com>

Closes #3507 from maropu/TimSortInEdgePartitionBuilderSpike and squashes the following commits:

8d4e5d2 [Takeshi Yamamuro] Remove a wildcard import
3527e00 [Takeshi Yamamuro] Replace Scala.util.Sorting.quickSort with Sorter(TimSort) in Spark
2014-12-07 19:37:14 -08:00
GuoQiang Li e895e0cbec [SPARK-3623][GraphX] GraphX should support the checkpoint operation
Author: GuoQiang Li <witgo@qq.com>

Closes #2631 from witgo/SPARK-3623 and squashes the following commits:

a70c500 [GuoQiang Li] Remove java related
4d1e249 [GuoQiang Li] Add comments
e682724 [GuoQiang Li] Graph should support the checkpoint operation
2014-12-06 00:56:51 -08:00
CrazyJvm 6eb1b6f620 Streaming doc : do you mean inadvertently?
Author: CrazyJvm <crazyjvm@gmail.com>

Closes #3620 from CrazyJvm/streaming-foreachRDD and squashes the following commits:

b72886b [CrazyJvm] do you mean inadvertently?
2014-12-05 13:42:13 -08:00
Zhang, Liye 98a7d09978 [SPARK-4005][CORE] handle message replies in receive instead of in the individual private methods
In BlockManagermasterActor, when handling message type UpdateBlockInfo, the message replies is in handled in individual private methods, should handle it in receive of Akka.

Author: Zhang, Liye <liye.zhang@intel.com>

Closes #2853 from liyezhang556520/akkaRecv and squashes the following commits:

9b06f0a [Zhang, Liye] remove the unreachable code
bf518cd [Zhang, Liye] change the indent
242166b [Zhang, Liye] modified accroding to the comments
d4b929b [Zhang, Liye] [SPARK-4005][CORE] handle message replies in receive instead of in the individual private methods
2014-12-05 12:00:32 -08:00
Cheng Lian 6f61e1f961 [SPARK-4761][SQL] Enables Kryo by default in Spark SQL Thrift server
Enables Kryo and disables reference tracking by default in Spark SQL Thrift server. Configurations explicitly defined by users in `spark-defaults.conf` are respected (the Thrift server is started by `spark-submit`, which handles configuration properties properly).

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3621)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3621 from liancheng/kryo-by-default and squashes the following commits:

70c2775 [Cheng Lian] Enables Kryo by default in Spark SQL Thrift server
2014-12-05 10:27:40 -08:00
Michael Armbrust f5801e813f [SPARK-4753][SQL] Use catalyst for partition pruning in newParquet.
Author: Michael Armbrust <michael@databricks.com>

Closes #3613 from marmbrus/parquetPartitionPruning and squashes the following commits:

4f138f8 [Michael Armbrust] Use catalyst for partition pruning in newParquet.
2014-12-04 22:25:21 -08:00
Andrew Or fd8525334c Revert "SPARK-2624 add datanucleus jars to the container in yarn-cluster"
This reverts commit a975dc3279.
2014-12-04 21:53:49 -08:00
Andrew Or 87437df036 Revert "[HOT FIX] [YARN] Check whether /lib exists before listing its files"
This reverts commit 90ec643e9a.
2014-12-04 21:53:38 -08:00
Masayoshi TSUZUKI ca379039f7 [SPARK-4464] Description about configuration options need to be modified in docs.
Added description about -h and -host.
Modified description about -i and -ip which are now deprecated.
Added description about --properties-file.

Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>

Closes #3329 from tsudukim/feature/SPARK-4464 and squashes the following commits:

6c07caf [Masayoshi TSUZUKI] [SPARK-4464] Description about configuration options need to be modified in docs.
2014-12-04 19:33:02 -08:00
Andy Konwinski 15cf3b0125 Fix typo in Spark SQL docs.
Author: Andy Konwinski <andykonwinski@gmail.com>

Closes #3611 from andyk/patch-3 and squashes the following commits:

7bab333 [Andy Konwinski] Fix typo in Spark SQL docs.
2014-12-04 18:27:02 -08:00
Masayoshi TSUZUKI ddfc09c363 [SPARK-4421] Wrong link in spark-standalone.html
Modified the link of building Spark.

Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>

Closes #3279 from tsudukim/feature/SPARK-4421 and squashes the following commits:

56e31c1 [Masayoshi TSUZUKI] Modified the link of building Spark.
2014-12-04 18:14:36 -08:00
Reynold Xin ed92b47e83 [SPARK-4397] Move object RDD to the front of RDD.scala.
I ran into multiple cases that SBT/Scala compiler was confused by the implicits in continuous compilation mode. Adding explicit return types fixes the problem.

Author: Reynold Xin <rxin@databricks.com>

Closes #3580 from rxin/rdd-implicit and squashes the following commits:

ee32fcd [Reynold Xin] Move object RDD to the end of the file.
b8562c9 [Reynold Xin] Merge branch 'master' of github.com:apache/spark into rdd-implicit
d4e9f85 [Reynold Xin] Code review.
a836a37 [Reynold Xin] Move object RDD to the front of RDD.scala.
2014-12-04 16:32:20 -08:00
lewuathe ab8177da2d [SPARK-4652][DOCS] Add docs about spark-git-repo option
There might be some cases when WIPS spark version need to be run
on EC2 cluster. In order to setup this type of cluster more easily,
add --spark-git-repo option description to ec2 documentation.

Author: lewuathe <lewuathe@me.com>
Author: Josh Rosen <joshrosen@databricks.com>

Closes #3513 from Lewuathe/doc-for-development-spark-cluster and squashes the following commits:

6dae8ee [lewuathe] Wrap consistent with other descriptions
cfaf9be [lewuathe] Add docs about spark-git-repo option

(Editing / cleanup by Josh Rosen)
2014-12-04 15:24:36 -08:00
Saldanha 743a889d27 [SPARK-4459] Change groupBy type parameter from K to U
Please see https://issues.apache.org/jira/browse/SPARK-4459

Author: Saldanha <saldaal1@phusca-l24858.wlan.na.novartis.net>

Closes #3327 from alokito/master and squashes the following commits:

54b1095 [Saldanha] [SPARK-4459] changed type parameter for keyBy from K to U
d5f73c3 [Saldanha] [SPARK-4459] added keyBy test
316ad77 [Saldanha] SPARK-4459 changed type parameter for groupBy from K to U.
62ddd4b [Saldanha] SPARK-4459 added failing unit test
2014-12-04 14:57:41 -08:00
alexdebrie 794f3aec24 [SPARK-4745] Fix get_existing_cluster() function with multiple security groups
The current get_existing_cluster() function would only find an instance belonged to a cluster if the instance's security groups == cluster_name + "-master" (or "-slaves"). This fix allows for multiple security groups by checking if the cluster_name + "-master" security group is in the list of groups for a particular instance.

Author: alexdebrie <alexdebrie1@gmail.com>

Closes #3596 from alexdebrie/master and squashes the following commits:

9d51232 [alexdebrie] Fix get_existing_cluster() function with multiple security groups
2014-12-04 14:14:39 -08:00
Patrick Wendell 8dae26f838 [HOTFIX] Fixing two issues with the release script.
1. The version replacement was still producing some false changes.
2. Uploads to the staging repo specifically.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #3608 from pwendell/release-script and squashes the following commits:

3c63294 [Patrick Wendell] Fixing two issues with the release script:
2014-12-04 12:11:41 -08:00
WangTaoTheTonic 8106b1e36b [SPARK-4253] Ignore spark.driver.host in yarn-cluster and standalone-cluster modes
In yarn-cluster and standalone-cluster modes, we don't know where driver will run until it is launched.  If the `spark.driver.host` property is set on the submitting machine and propagated to the driver through SparkConf then this will lead to errors when the driver launches.

This patch fixes this issue by dropping the `spark.driver.host` property in SparkSubmit when running in a cluster deploy mode.

Author: WangTaoTheTonic <barneystinson@aliyun.com>
Author: WangTao <barneystinson@aliyun.com>

Closes #3112 from WangTaoTheTonic/SPARK4253 and squashes the following commits:

ed1a25c [WangTaoTheTonic] revert unrelated formatting issue
02c4e49 [WangTao] add comment
32a3f3f [WangTaoTheTonic] ingore it in SparkSubmit instead of SparkContext
667cf24 [WangTaoTheTonic] document fix
ff8d5f7 [WangTaoTheTonic] also ignore it in standalone cluster mode
2286e6b [WangTao] ignore spark.driver.host in yarn-cluster mode
2014-12-04 11:53:23 -08:00
Cheng Lian 28c7acacef [SPARK-4683][SQL] Add a beeline.cmd to run on Windows
Tested locally with a Win7 VM. Connected to a Spark SQL Thrift server instance running on Mac OS X with the following command line:

```
bin\beeline.cmd -u jdbc:hive2://10.0.2.2:10000 -n lian
```

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3599)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3599 from liancheng/beeline.cmd and squashes the following commits:

79092e7 [Cheng Lian] Windows script for BeeLine
2014-12-04 10:21:03 -08:00
Xiangrui Meng 7e758d7092 [FIX][DOC] Fix broken links in ml-guide.md
and some minor changes in ScalaDoc.

Author: Xiangrui Meng <meng@databricks.com>

Closes #3601 from mengxr/SPARK-4575-fix and squashes the following commits:

c559768 [Xiangrui Meng] minor code update
ce94da8 [Xiangrui Meng] Java Bean -> JavaBean
0b5c182 [Xiangrui Meng] fix links in ml-guide
2014-12-04 20:16:35 +08:00
Joseph K. Bradley 469a6e5f3b [SPARK-4575] [mllib] [docs] spark.ml pipelines doc + bug fixes
Documentation:
* Added ml-guide.md, linked from mllib-guide.md
* Updated mllib-guide.md with small section pointing to ml-guide.md

Examples:
* CrossValidatorExample
* SimpleParamsExample
* (I copied these + the SimpleTextClassificationPipeline example into the ml-guide.md)

Bug fixes:
* PipelineModel: did not use ParamMaps correctly
* UnaryTransformer: issues with TypeTag serialization (Thanks to mengxr for that fix!)

CC: mengxr shivaram  etrain  Documentation for Pipelines: I know the docs are not complete, but the goal is to have enough to let interested people get started using spark.ml and to add more docs once the package is more established/complete.

Author: Joseph K. Bradley <joseph@databricks.com>
Author: jkbradley <joseph.kurata.bradley@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #3588 from jkbradley/ml-package-docs and squashes the following commits:

d393b5c [Joseph K. Bradley] fixed bug in Pipeline (typo from last commit).  updated examples for CV and Params for spark.ml
c38469c [Joseph K. Bradley] Updated ml-guide with CV examples
99f88c2 [Joseph K. Bradley] Fixed bug in PipelineModel.transform* with usage of params.  Updated CrossValidatorExample to use more training examples so it is less likely to get a 0-size fold.
ea34dc6 [jkbradley] Merge pull request #4 from mengxr/ml-package-docs
3b83ec0 [Xiangrui Meng] replace TypeTag with explicit datatype
41ad9b1 [Joseph K. Bradley] Added examples for spark.ml: SimpleParamsExample + Java version, CrossValidatorExample + Java version.  CrossValidatorExample not working yet.  Added programming guide for spark.ml, but need to add CrossValidatorExample to it once CrossValidatorExample works.
2014-12-04 17:00:06 +08:00
Joseph K. Bradley 529439bd50 [docs] Fix outdated comment in tuning guide
When you use the SPARK_JAVA_OPTS env variable, Spark complains:

```
SPARK_JAVA_OPTS was detected (set to ' -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps ').
This is deprecated in Spark 1.0+.

Please instead use:
 - ./spark-submit with conf/spark-defaults.conf to set defaults for an application
 - ./spark-submit with --driver-java-options to set -X options for a driver
 - spark.executor.extraJavaOptions to set -X options for executors
 - SPARK_DAEMON_JAVA_OPTS to set java options for standalone daemons (master or worker)
```

This updates the docs to redirect the user to the relevant part of the configuration docs.

CC: mengxr  but please CC someone else as needed

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #3592 from jkbradley/tuning-doc and squashes the following commits:

0760ce1 [Joseph K. Bradley] fixed outdated comment in tuning guide
2014-12-04 00:59:32 -08:00