Commit graph

10026 commits

Author SHA1 Message Date
Xusen Yin 2d4e00efe2 [SPARK-5986][MLLib] Add save/load for k-means
This PR adds save/load for K-means as described in SPARK-5986. Python version will be added in another PR.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #4951 from yinxusen/SPARK-5986 and squashes the following commits:

6dd74a0 [Xusen Yin] rewrite some functions and classes
cd390fd [Xusen Yin] add indexed point
b144216 [Xusen Yin] remove invalid comments
dce7055 [Xusen Yin] add save/load for k-means for SPARK-5986
2015-03-11 00:24:55 -07:00
Michael Armbrust 2672374110 [SPARK-5183][SQL] Update SQL Docs with JDBC and Migration Guide
Author: Michael Armbrust <michael@databricks.com>

Closes #4958 from marmbrus/sqlDocs and squashes the following commits:

9351dbc [Michael Armbrust] fix parquet example
6877e13 [Michael Armbrust] add sql examples
d81b7e7 [Michael Armbrust] rxins comments
e393528 [Michael Armbrust] fix order
19c2735 [Michael Armbrust] more on data source load/store
00d5914 [Michael Armbrust] Update SQL Docs with JDBC and Migration Guide
2015-03-10 18:13:09 -07:00
Reynold Xin 74fb433702 Minor doc: Remove the extra blank line in data types javadoc.
The extra blank line is preventing the first lines from showing up in the package summary page.

Author: Reynold Xin <rxin@databricks.com>

Closes #4955 from rxin/datatype-docs and squashes the following commits:

1621114 [Reynold Xin] Minor doc: Remove the extra blank line in data types javadoc.
2015-03-10 17:25:04 -07:00
cheng chang 7c7d2d5e09 [SPARK-6186] [EC2] Make Tachyon version configurable in EC2 deployment script
This PR comes from Tachyon community to solve the issue:
https://tachyon.atlassian.net/browse/TACHYON-11

An accompanying PR is in mesos/spark-ec2:
https://github.com/mesos/spark-ec2/pull/101

Author: cheng chang <myairia@gmail.com>

Closes #4901 from uronce-cc/master and squashes the following commits:

313aa36 [cheng chang] minor re-wording
fd2a48e [cheng chang] Remove Tachyon when deploying through git hash
1d53c5c [cheng chang] add default value to --tachyon-version
6f8887e [cheng chang] make tachyon version configurable
2015-03-10 11:02:54 +00:00
Nicholas Chammas d14df06c05 [SPARK-6191] [EC2] Generalize ability to download libs
Right now we have a method to specifically download boto. This PR generalizes it so it's easy to download additional libraries if we want.

For example, adding new external libraries for spark-ec2 is now as simple as:

```python
external_libs = [
    {
         "name": "boto",
         "version": "2.34.0",
         "md5": "5556223d2d0cc4d06dd4829e671dcecd"
    },
    {
        "name": "PyYAML",
        "version": "3.11",
        "md5": "f50e08ef0fe55178479d3a618efe21db"
    },
    {
        "name": "argparse",
        "version": "1.3.0",
        "md5": "9bcf7f612190885c8c85e30ba41db3c7"
    }
]
```
Likely use cases:
* Downloading PyYAML to allow spark-ec2 configs to be persisted as a YAML file. ([SPARK-925](https://issues.apache.org/jira/browse/SPARK-925))
* Downloading argparse to clean up / modernize our option parsing.

First run output, with PyYAML and argparse added just for demonstration purposes:

```shell
$ ./spark-ec2 --version
Downloading external libraries that spark-ec2 needs from PyPI to /path/to/spark/ec2/lib...
This should be a one-time operation.
 - Downloading boto...
 - Finished downloading boto.
 - Downloading PyYAML...
 - Finished downloading PyYAML.
 - Downloading argparse...
 - Finished downloading argparse.
spark-ec2 1.2.1
```

Output thereafter:

```shell
$ ./spark-ec2 --version
spark-ec2 1.2.1
```

Author: Nicholas Chammas <nicholas.chammas@gmail.com>

Closes #4919 from nchammas/setup-ec2-libs and squashes the following commits:

a077955 [Nicholas Chammas] print default region
c95fb7d [Nicholas Chammas] to docstring
5448845 [Nicholas Chammas] remove libs added for demo purposes
60d8c23 [Nicholas Chammas] generalize ability to download libs
2015-03-10 10:58:31 +00:00
Lev Khomich c4c4b07bf6 [SPARK-6087][CORE] Provide actionable exception if Kryo buffer is not large enough
A simple try-catch wrapping KryoException to be more informative.

Author: Lev Khomich <levkhomich@gmail.com>

Closes #4947 from levkhomich/master and squashes the following commits:

0f7a947 [Lev Khomich] [SPARK-6087][CORE] Provide actionable exception if Kryo buffer is not large enough
2015-03-10 10:55:42 +00:00
Yuhao Yang 9a0272fbb3 [SPARK-6177][MLlib]Add note in LDA example to remind possible coalesce
JIRA: https://issues.apache.org/jira/browse/SPARK-6177
Add comment to introduce coalesce to LDA example to avoid the possible massive partitions from `sc.textFile`.

sc.textFile will create RDD with one partition for each file, and the possible massive partitions downgrades LDA performance.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #4899 from hhbyyh/adjustPartition and squashes the following commits:

a499630 [Yuhao Yang] update comment
9a2d7b6 [Yuhao Yang] move to comment
f7fd5d4 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into adjustPartition
26a564a [Yuhao Yang] add coalesce to LDAExample
2015-03-10 10:52:21 +00:00
Davies Liu 8767565cef [SPARK-6194] [SPARK-677] [PySpark] fix memory leak in collect()
Because circular reference between JavaObject and JavaMember, an Java object can not be released until Python GC kick in, then it will cause memory leak in collect(), which may consume lots of memory in JVM.

This PR change the way we sending collected data back into Python from local file to socket, which could avoid any disk IO during collect, also avoid any referrers of Java object in Python.

cc JoshRosen

Author: Davies Liu <davies@databricks.com>

Closes #4923 from davies/fix_collect and squashes the following commits:

d730286 [Davies Liu] address comments
24c92a4 [Davies Liu] fix style
ba54614 [Davies Liu] use socket to transfer data from JVM
9517c8f [Davies Liu] fix memory leak in collect()
2015-03-09 16:24:06 -07:00
Reynold Xin 3cac1991a1 [SPARK-5310][Doc] Update SQL Programming Guide to include DataFrames.
Author: Reynold Xin <rxin@databricks.com>

Closes #4954 from rxin/df-docs and squashes the following commits:

c592c70 [Reynold Xin] [SPARK-5310][Doc] Update SQL Programming Guide to include DataFrames.
2015-03-09 16:16:16 -07:00
Reynold Xin 70f88148bb [Docs] Replace references to SchemaRDD with DataFrame
Author: Reynold Xin <rxin@databricks.com>

Closes #4952 from rxin/schemardd-df-reference and squashes the following commits:

b2b1dbe [Reynold Xin] [Docs] Replace references to SchemaRDD with DataFrame
2015-03-09 13:29:19 -07:00
Theodore Vasiloudis f7c7992043 [EC2] [SPARK-6188] Instance types can be mislabeled when re-starting cluster with default arguments
As described in https://issues.apache.org/jira/browse/SPARK-6188 and discovered in https://issues.apache.org/jira/browse/SPARK-5838.

When re-starting a cluster, if the user does not provide the instance types, which is the recommended behavior in the docs currently, the instance will be assigned the default type m1.large. This then affects the setup of the machines.

This solves this by getting the instance types from the existing instances, and overwriting the default options.

EDIT: Further clarification of the issue:

In short, while the instances themselves are the same as launched, their setup is done assuming the default instance type, m1.large.

This means that the machines are assumed to have 2 disks, and that leads to problems that are described in in issue [5838](https://issues.apache.org/jira/browse/SPARK-5838), where machines that have one disk end up having shuffle spills in the in the small (8GB) snapshot partitions that quickly fills up and results in failing jobs due to "No space left on device" errors.

Other instance specific settings that are set in the spark_ec2.py script are likely to be wrong as well.

Author: Theodore Vasiloudis <thvasilo@users.noreply.github.com>
Author: Theodore Vasiloudis <tvas@sics.se>

Closes #4916 from thvasilo/SPARK-6188]-Instance-types-can-be-mislabeled-when-re-starting-cluster-with-default-arguments and squashes the following commits:

6705b98 [Theodore Vasiloudis] Added comment to clarify setting master instance type to the empty string.
a3d29fe [Theodore Vasiloudis] More trailing whitespace
7b32429 [Theodore Vasiloudis] Removed trailing whitespace
3ebd52a [Theodore Vasiloudis] Make sure that the instance type is correct when relaunching a cluster.
2015-03-09 14:16:07 +00:00
Jacky Li 55b1b32dc8 [GraphX] Improve LiveJournalPageRank example
1. Removed unnecessary import
2. Modified usage print since user must specify the --numEPart parameter as it is required in Analytics.main

Author: Jacky Li <jacky.likun@huawei.com>

Closes #4917 from jackylk/import and squashes the following commits:

6c07682 [Jacky Li] fix comment
c0df8f2 [Jacky Li] fix scalastyle
b6235e6 [Jacky Li] fix for comment
87be83b [Jacky Li] remove default value description
5caae76 [Jacky Li] remove import and modify usage
2015-03-08 19:47:35 +00:00
Sean Owen f16b7b031f SPARK-6205 [CORE] UISeleniumSuite fails for Hadoop 2.x test with NoClassDefFoundError
Add xml-apis to core test deps to work aroudn UISeleniumSuite classpath issue

Author: Sean Owen <sowen@cloudera.com>

Closes #4933 from srowen/SPARK-6205 and squashes the following commits:

ddd4d32 [Sean Owen] Add xml-apis to core test deps to work aroudn UISeleniumSuite classpath issue
2015-03-08 14:09:40 +00:00
Nicholas Chammas 52ed7da12e [SPARK-6193] [EC2] Push group filter up to EC2
When looking for a cluster, spark-ec2 currently pulls down [info for all instances](eb48fd6e9d/ec2/spark_ec2.py (L620)) and filters locally. When working on an AWS account with hundreds of active instances, this step alone can take over 10 seconds.

This PR improves how spark-ec2 searches for clusters by pushing the filter up to EC2.

Basically, the problem (and solution) look like this:

```python
>>> timeit.timeit('blah = conn.get_all_reservations()', setup='from __main__ import conn', number=10)
116.96390509605408
>>> timeit.timeit('blah = conn.get_all_reservations(filters={"instance.group-name": ["my-cluster-master"]})', setup='from __main__ import conn', number=10)
4.629754066467285
```

Translated to a user-visible action, this looks like (against an AWS account with ~200 active instances):

```shell
# master
$ python -m timeit -n 3 --setup 'import subprocess' 'subprocess.call("./spark-ec2 get-master my-cluster --region us-west-2", shell=True)'
...
3 loops, best of 3: 9.83 sec per loop

# this PR
$ python -m timeit -n 3 --setup 'import subprocess' 'subprocess.call("./spark-ec2 get-master my-cluster --region us-west-2", shell=True)'
...
3 loops, best of 3: 1.47 sec per loop
```

This PR also refactors `get_existing_cluster()` to make it, I hope, simpler.

Finally, this PR fixes some minor grammar issues related to printing status to the user. 🎩 👏

Author: Nicholas Chammas <nicholas.chammas@gmail.com>

Closes #4922 from nchammas/get-existing-cluster-faster and squashes the following commits:

18802f1 [Nicholas Chammas] ignore shutting-down
f2a5b9f [Nicholas Chammas] fix grammar
d96a489 [Nicholas Chammas] push group filter up to EC2
2015-03-08 14:01:26 +00:00
Florian Verhein 334c5bd1ae [SPARK-5641] [EC2] Allow spark_ec2.py to copy arbitrary files to cluster
Give users an easy way to rcp a directory structure to the master's / as part of the cluster launch, at a useful point in the workflow (before setup.sh is called on the master).

This is an alternative approach to meeting requirements discussed in https://github.com/apache/spark/pull/4487

Author: Florian Verhein <florian.verhein@gmail.com>

Closes #4583 from florianverhein/master and squashes the following commits:

49dee88 [Florian Verhein] removed addition of trailing / in rsync to give user this option, added documentation in help
7b8e3d8 [Florian Verhein] remove unused args
87d922c [Florian Verhein] [SPARK-5641] [EC2] implement --deploy-root-dir
2015-03-07 12:56:59 +00:00
WangTaoTheTonic 729c05bda8 [Minor]fix the wrong description
Found it by accident. I'm not gonna file jira for this as it is a very tiny fix.

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes #4936 from WangTaoTheTonic/wrongdesc and squashes the following commits:

fb8a8ec [WangTaoTheTonic] fix the wrong description
aca5596 [WangTaoTheTonic] fix the wrong description
2015-03-07 12:35:26 +00:00
Nicholas Chammas 2646794ffb [EC2] Reorder print statements on termination
The PR reorders some print statements slightly on cluster termination so that they read better.

For example, from this:

```
Are you sure you want to destroy the cluster spark-cluster-test?
The following instances will be terminated:
Searching for existing cluster spark-cluster-test in region us-west-2...
Found 1 master(s), 2 slaves
> ...
ALL DATA ON ALL NODES WILL BE LOST!!
Destroy cluster spark-cluster-test (y/N):
```

To this:

```
Searching for existing cluster spark-cluster-test in region us-west-2...
Found 1 master(s), 2 slaves
The following instances will be terminated:
> ...
ALL DATA ON ALL NODES WILL BE LOST!!
Are you sure you want to destroy the cluster spark-cluster-test? (y/N)
```

Author: Nicholas Chammas <nicholas.chammas@gmail.com>

Closes #4932 from nchammas/termination-print-order and squashes the following commits:

c23711d [Nicholas Chammas] reorder prints on termination
2015-03-07 12:33:41 +00:00
RobertZK 48a723c986 Fix python typo (+ Scala, Java typos)
Author: RobertZK <technoguyrob@gmail.com>
Author: Robert Krzyzanowski <technoguyrob@gmail.com>

Closes #4840 from robertzk/patch-1 and squashes the following commits:

d286215 [RobertZK] lambda fix per @laserson
5937989 [Robert Krzyzanowski] Fix python typo
2015-03-07 00:39:24 +00:00
Vinod K C dba0b2eadb [SPARK-6178][Shuffle] Removed unused imports
Author: Vinod K C <vinod.kchuawei.com>

Author: Vinod K C <vinod.kc@huawei.com>

Closes #4900 from vinodkc/unused_imports and squashes the following commits:

5373456 [Vinod K C] Removed empty lines
9da7438 [Vinod K C] Changed order of import
594d471 [Vinod K C] Removed unused imports
2015-03-06 14:43:09 +00:00
GuoQiang Li 05cb6b34d8 [Minor] Resolve sbt warnings: postfix operator second should be enabled
Resolve sbt warnings:

```
[warn] spark/streaming/src/main/scala/org/apache/spark/streaming/util/WriteAheadLogManager.scala:155: postfix operator second should be enabled
[warn] by making the implicit value scala.language.postfixOps visible.
[warn] This can be achieved by adding the import clause 'import scala.language.postfixOps'
[warn] or by setting the compiler option -language:postfixOps.
[warn] See the Scala docs for value scala.language.postfixOps for a discussion
[warn] why the feature should be explicitly enabled.
[warn]         Await.ready(f, 1 second)
[warn]                          ^
```

Author: GuoQiang Li <witgo@qq.com>

Closes #4908 from witgo/sbt_warnings and squashes the following commits:

0629af4 [GuoQiang Li] Resolve sbt warnings: postfix operator second should be enabled
2015-03-06 13:20:20 +00:00
Marcelo Vanzin cd7594ca6a [core] [minor] Don't pollute source directory when running UtilsSuite.
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #4921 from vanzin/utils-suite and squashes the following commits:

7795dd4 [Marcelo Vanzin] [core] [minor] Don't pollute source directory when running UtilsSuite.
2015-03-06 09:43:24 +00:00
Zhang, Liye d8b3da9ddf [CORE, DEPLOY][minor] align arguments order with docs of worker
The help message for starting `worker` is `Usage: Worker [options] <master>`. While in `start-slaves.sh`, the format is not align with that, it is confusing for the fist glance.

Author: Zhang, Liye <liye.zhang@intel.com>

Closes #4924 from liyezhang556520/startSlaves and squashes the following commits:

7fd5deb [Zhang, Liye] align arguments order with docs of worker
2015-03-06 09:34:07 +00:00
Michael Armbrust eb48fd6e9d [SQL] Make Strategies a public developer API
Author: Michael Armbrust <michael@databricks.com>

Closes #4920 from marmbrus/openStrategies and squashes the following commits:

cbc35c0 [Michael Armbrust] [SQL] Make Strategies a public developer API
2015-03-05 14:50:25 -08:00
Yin Huai 1b4bb25c10 [SPARK-6163][SQL] jsonFile should be backed by the data source API
jira: https://issues.apache.org/jira/browse/SPARK-6163

Author: Yin Huai <yhuai@databricks.com>

Closes #4896 from yhuai/SPARK-6163 and squashes the following commits:

45e023e [Yin Huai] Address @chenghao-intel's comment.
2e8734e [Yin Huai] Use JSON data source for jsonFile.
92a4a33 [Yin Huai] Test.
2015-03-05 14:49:44 -08:00
Wenchen Fan 5873c713cc [SPARK-6145][SQL] fix ORDER BY on nested fields
Based on #4904 with style errors fixed.

`LogicalPlan#resolve` will not only produce `Attribute`, but also "`GetField` chain".
So in `ResolveSortReferences`, after resolve the ordering expressions, we should not just collect the `Attribute` results, but also `Attribute` at the bottom of "`GetField` chain".

Author: Wenchen Fan <cloud0fan@outlook.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #4918 from marmbrus/pr/4904 and squashes the following commits:

997f84e [Michael Armbrust] fix style
3eedbfc [Wenchen Fan] fix 6145
2015-03-05 14:49:01 -08:00
Josh Rosen 424a86a1ed [SPARK-6175] Fix standalone executor log links when ephemeral ports or SPARK_PUBLIC_DNS are used
This patch fixes two issues with the executor log viewing links added in Spark 1.3.  In standalone mode, the log URLs might include a port value of 0 rather than the actual bound port of the UI, which broke the ability to view logs from workers whose web UIs had been configured to bind to ephemeral ports.  In addition, the URLs used workers' local hostnames instead of respecting SPARK_PUBLIC_DNS, which prevented this feature from working properly on Spark EC2 clusters because the links would point to internal DNS names instead of external ones.

I included tests for both of these bugs:

- We now browse to the URLs and verify that they point to the expected pages.
- To test SPARK_PUBLIC_DNS, I changed the code that reads the environment variable to do so via `SparkConf.getenv`, then used a custom SparkConf subclass to mock the environment variable (this pattern is used elsewhere in Spark's tests).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #4903 from JoshRosen/SPARK-6175 and squashes the following commits:

5577f41 [Josh Rosen] Remove println
cfec135 [Josh Rosen] Use webUi.boundPort and publicAddress in log links
27918c7 [Josh Rosen] Add failing unit tests for standalone log URL viewing
c250fbe [Josh Rosen] Respect SparkConf in local-cluster Workers.
422a2ef [Josh Rosen] Use conf.getenv to read SPARK_PUBLIC_DNS
2015-03-05 12:04:00 -08:00
Xiangrui Meng 0bfacd5c5d [SPARK-6090][MLLIB] add a basic BinaryClassificationMetrics to PySpark/MLlib
A simple wrapper around the Scala implementation. `DataFrame` is used for serialization/deserialization. Methods that return `RDD`s are not supported in this PR.

davies If we recognize Scala's `Product`s in Py4J, we can easily add wrappers for Scala methods that returns `RDD[(Double, Double)]`. Is it easy to register serializer for `Product` in PySpark?

Author: Xiangrui Meng <meng@databricks.com>

Closes #4863 from mengxr/SPARK-6090 and squashes the following commits:

009a3a3 [Xiangrui Meng] provide schema
dcddab5 [Xiangrui Meng] add a basic BinaryClassificationMetrics to PySpark/MLlib
2015-03-05 11:50:09 -08:00
Sean Owen c9cfba0ceb SPARK-6182 [BUILD] spark-parent pom needs to be published for both 2.10 and 2.11
Option 1 of 2: Convert spark-parent module name to spark-parent_2.10 / spark-parent_2.11

Author: Sean Owen <sowen@cloudera.com>

Closes #4912 from srowen/SPARK-6182.1 and squashes the following commits:

eff60de [Sean Owen] Convert spark-parent module name to spark-parent_2.10 / spark-parent_2.11
2015-03-05 11:31:48 -08:00
Daoyuan Wang e06c7dfbc2 [SPARK-6153] [SQL] promote guava dep for hive-thriftserver
For package thriftserver, guava is used at runtime.

/cc pwendell

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #4884 from adrian-wang/test and squashes the following commits:

4600ae7 [Daoyuan Wang] only promote for thriftserver
44dda18 [Daoyuan Wang] promote guava dep for hive
2015-03-05 16:35:17 +08:00
Sean Owen 7ac072f74b SPARK-5143 [BUILD] [WIP] spark-network-yarn 2.11 depends on spark-network-shuffle 2.10
Update `<scala.binary.version>` prop in POM when switching between Scala 2.10/2.11

ScrapCodes for review. This `sed` command is supposed to just replace the first occurrence, but it replaces them all. Are you more of a `sed` wizard than I? It may be a GNU/BSD thing that is throwing me off. Really, just the first instance should be replaced, hence the `[WIP]`.

NB on OS X the original `sed` command here will create files like `pom.xml-e` through the source tree though it otherwise works. It's like `-e` is also the arg to `-i`. I couldn't get rid of that even with `-i""`. No biggie.

Author: Sean Owen <sowen@cloudera.com>

Closes #4876 from srowen/SPARK-5143 and squashes the following commits:

b060c44 [Sean Owen] Oops, fixed reversed version numbers!
e875d4a [Sean Owen] Add note about non-GNU sed; fix new pom.xml update to work as intended on GNU sed
703e1eb [Sean Owen] Update scala.binary.version prop in POM when switching between Scala 2.10/2.11
2015-03-04 21:00:51 -08:00
Cheng Lian 1aa90e39e3 [SPARK-6149] [SQL] [Build] Excludes Guava 15 referenced by jackson-module-scala_2.10
This PR excludes Guava 15.0 from the SBT build, to make Spark SQL CLI (`bin/spark-sql`) work when compiled against Hive 0.12.0.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4890)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #4890 from liancheng/exclude-guava-15 and squashes the following commits:

91ae9fa [Cheng Lian] Moves Guava 15 exclusion from SBT build to POM
282bd2a [Cheng Lian] Excludes Guava 15 referenced by jackson-module-scala_2.10
2015-03-04 20:52:58 -08:00
Marcelo Vanzin 3a35a0dfe9 [SPARK-6144] [core] Fix addFile when source files are on "hdfs:"
The code failed in two modes: it complained when it tried to re-create a directory that already existed, and it was placing some files in the wrong parent directory. The patch fixes both issues.

Author: Marcelo Vanzin <vanzin@cloudera.com>
Author: trystanleftwich <trystan@atscale.com>

Closes #4894 from vanzin/SPARK-6144 and squashes the following commits:

100b3a1 [Marcelo Vanzin] Style fix.
58266aa [Marcelo Vanzin] Fix fetchHcfs file for directories.
91733b7 [trystanleftwich] [SPARK-6144]When in cluster mode using ADD JAR with a hdfs:// sourced jar will fail
2015-03-04 12:58:39 -08:00
Zhang, Liye f6773edce0 [SPARK-6107][CORE] Display inprogress application information for event log history for standalone mode
when application is finished running abnormally (Ctrl + c for example), the history event log file is still ends with `.inprogress` suffix. And the application state can not be showed on webUI, User can only see "*Application history not foud xxxx, Application xxx is still in progress*".

For application that not finished normally, the history will show:
![image](https://cloud.githubusercontent.com/assets/4716022/6437137/184f9fc0-c0f5-11e4-88cc-a2eb087e4561.png)

Author: Zhang, Liye <liye.zhang@intel.com>

Closes #4848 from liyezhang556520/showLogInprogress and squashes the following commits:

03589ac [Zhang, Liye] change inprogress to in progress
b55f19f [Zhang, Liye] scala modify after rebase
8aa66a2 [Zhang, Liye] use softer wording
b030bd4 [Zhang, Liye] clean code
79c8cb1 [Zhang, Liye] fix some mistakes
11cdb68 [Zhang, Liye] add a missing space
c29205b [Zhang, Liye] refine code according to sean owen's comments
e9952a7 [Zhang, Liye] scala style fix again
150502d [Zhang, Liye] scala style fix
f11a5da [Zhang, Liye] small fix for file path
22e878b [Zhang, Liye] enable in progress eventlog file
2015-03-04 12:28:27 +00:00
Liang-Chi Hsieh aef8a84e42 [SPARK-6134][SQL] Fix wrong datatype for casting FloatType and default LongType value in defaultPrimitive
In `CodeGenerator`, the casting on `FloatType` should use `FloatType` instead of `IntegerType`.

Besides, `defaultPrimitive` for `LongType` should be `-1L` instead of `1L`.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #4870 from viirya/codegen_type and squashes the following commits:

76311dd [Liang-Chi Hsieh] Fix wrong datatype for casting on FloatType. Fix the wrong value for LongType in defaultPrimitive.
2015-03-04 20:23:43 +08:00
Cheng Lian 76b472f12a [SPARK-6136] [SQL] Removed JDBC integration tests which depends on docker-client
Integration test suites in the JDBC data source (`MySQLIntegration` and `PostgresIntegration`) depend on docker-client 2.7.5, which transitively depends on Guava 17.0. Unfortunately, Guava 17.0 is causing test runtime binary compatibility issues when Spark is compiled against Hive 0.12.0, or Hadoop 2.4.

Considering `MySQLIntegration` and `PostgresIntegration` are ignored right now, I'd suggest moving them from the Spark project to the [Spark integration tests] [1] project. This PR removes both the JDBC data source integration tests and the docker-client test dependency.

[1]: |https://github.com/databricks/spark-integration-tests

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4872)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #4872 from liancheng/remove-docker-client and squashes the following commits:

1f4169e [Cheng Lian] Removes DockerHacks
159b24a [Cheng Lian] Removed JDBC integration tests which depends on docker-client
2015-03-04 19:39:02 +08:00
Brennon York 418f38d92f [SPARK-3355][Core]: Allow running maven tests in run-tests
Added an AMPLAB_JENKINS_BUILD_TOOL env. variable to allow differentiation between maven and sbt build / test suites. The only issue I found with this is that, when running maven builds I wasn't able to get individual package tests running without running a `mvn install` first. Not sure what Jenkins is doing wrt its env., but figured its much better to just test everything than install packages in the "~/.m2/" directory and only test individual items, esp. if this is predominantly for the Jenkins build. Thoughts / comments would be great!

Author: Brennon York <brennon.york@capitalone.com>

Closes #4734 from brennonyork/SPARK-3355 and squashes the following commits:

c813d32 [Brennon York] changed mvn call from 'clean compile
616ce30 [Brennon York] fixed merge conflicts
3540de9 [Brennon York] added an AMPLAB_JENKINS_BUILD_TOOL env. variable to allow differentiation between maven and sbt build / test suites
2015-03-04 11:02:33 +00:00
tedyu 8d3e2414d4 SPARK-6085 Increase default value for memory overhead
Author: tedyu <yuzhihong@gmail.com>

Closes #4836 from tedyu/master and squashes the following commits:

d65b495 [tedyu] SPARK-6085 Increase default value for memory overhead
1fdd4df [tedyu] SPARK-6085 Increase default value for memory overhead
2015-03-04 11:00:52 +00:00
Xiangrui Meng 76e20a0a03 [SPARK-6141][MLlib] Upgrade Breeze from 0.10 to 0.11 to fix convergence bug
LBFGS and OWLQN in Breeze 0.10 has convergence check bug.
This is fixed in 0.11, see the description in Breeze project for detail:

https://github.com/scalanlp/breeze/pull/373#issuecomment-76879760

Author: Xiangrui Meng <meng@databricks.com>
Author: DB Tsai <dbtsai@alpinenow.com>
Author: DB Tsai <dbtsai@dbtsai.com>

Closes #4879 from dbtsai/breeze and squashes the following commits:

d848f65 [DB Tsai] Merge pull request #1 from mengxr/AlpineNow-breeze
c2ca6ac [Xiangrui Meng] upgrade to breeze-0.11.1
35c2f26 [Xiangrui Meng] fix LRSuite
397a208 [DB Tsai] upgrade breeze
2015-03-03 23:52:02 -08:00
Andrew Or d334bfbcf3 [SPARK-6132][HOTFIX] ContextCleaner InterruptedException should be quiet
If the cleaner is stopped, we shouldn't print a huge stack trace when the cleaner thread is interrupted because we purposefully did this.

Author: Andrew Or <andrew@databricks.com>

Closes #4882 from andrewor14/cleaner-interrupt and squashes the following commits:

8652120 [Andrew Or] Just a hot fix
2015-03-03 20:49:45 -08:00
Imran Rashid 1f1fccc5ce [SPARK-5949] HighlyCompressedMapStatus needs more classes registered w/ kryo
https://issues.apache.org/jira/browse/SPARK-5949

Author: Imran Rashid <irashid@cloudera.com>

Closes #4877 from squito/SPARK-5949_register_roaring_bitmap and squashes the following commits:

7e13316 [Imran Rashid] style style style
5f6bb6d [Imran Rashid] more style
709bfe0 [Imran Rashid] style
a5cb744 [Imran Rashid] update tests to cover both types of RoaringBitmapContainers
09610c6 [Imran Rashid] formatting
f9a0b7c [Imran Rashid] put primitive array registrations together
97beaf8 [Imran Rashid] SPARK-5949 HighlyCompressedMapStatus needs more classes registered w/ kryo
2015-03-03 15:33:19 -08:00
Andrew Or 6c20f35290 [SPARK-6133] Make sc.stop() idempotent
Before we would get the following (benign) error if we called `sc.stop()` twice. This is because the listener bus would try to post the end event again even after it has already stopped. This happens occasionally when flaky tests fail, usually as a result of other sources of error. Either way we shouldn't be logging this error when it is not the cause of the failure.
```
ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerApplicationEnd(1425348445682)
```

Author: Andrew Or <andrew@databricks.com>

Closes #4871 from andrewor14/sc-stop and squashes the following commits:

a14afc5 [Andrew Or] Move code after code
915db16 [Andrew Or] Move code into code
2015-03-03 15:09:57 -08:00
Andrew Or fe63e82291 [SPARK-6132] ContextCleaner race condition across SparkContexts
The problem is that `ContextCleaner` may clean variables that belong to a different `SparkContext`. This can happen if the `SparkContext` to which the cleaner belongs stops, and a new one is started immediately afterwards in the same JVM. In this case, if the cleaner is in the middle of cleaning a broadcast, for instance, it will do so through `SparkEnv.get.blockManager`, which could be one that belongs to a different `SparkContext`.

JoshRosen and I suspect that this is the cause of many flaky tests, most notably the `JavaAPISuite`. We were able to reproduce the failure locally (though it is not deterministic and very hard to reproduce).

Author: Andrew Or <andrew@databricks.com>

Closes #4869 from andrewor14/cleaner-masquerade and squashes the following commits:

29168c0 [Andrew Or] Synchronize ContextCleaner stop
2015-03-03 13:44:05 -08:00
Sean Owen e750a6bfdd SPARK-1911 [DOCS] Warn users if their assembly jars are not built with Java 6
Add warning about building with Java 7+ and running the JAR on early Java 6.

CC andrewor14

Author: Sean Owen <sowen@cloudera.com>

Closes #4874 from srowen/SPARK-1911 and squashes the following commits:

79fa2f6 [Sean Owen] Add warning about building with Java 7+ and running the JAR on early Java 6.
2015-03-03 13:40:11 -08:00
Andrew Or 9af001749a Revert "[SPARK-5423][Core] Cleanup resources in DiskMapIterator.finalize to ensure deleting the temp file"
This reverts commit 90095bf3ce.
2015-03-03 13:03:52 -08:00
Wenchen Fan e359794cec [SPARK-6138][CORE][minor] enhance the toArray method in SizeTrackingVector
Use array copy instead of `Iterator#toArray` to make it more efficient.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #4825 from cloud-fan/minor and squashes the following commits:

c933ee5 [Wenchen Fan] make toArray method just in parent
946a35b [Wenchen Fan] minor enhance
2015-03-03 12:12:23 +00:00
CodingCat 975643c256 [SPARK-6118] making package name of deploy.worker.CommandUtils and deploy.CommandUtilsSuite consistent
https://issues.apache.org/jira/browse/SPARK-6118

I found that the object CommandUtils is placed under deploy.worker package, while CommandUtilsSuite is  under deploy

Conventionally, we put the implementation and unit test class under the same package

here, to minimize the change, I move CommandUtilsSuite to worker package,

**However, CommandUtils seems to contain some general methods (though only used by worker.* classes currently**,  we may also consider to replace CommonUtils

Author: CodingCat <zhunansjtu@gmail.com>

Closes #4856 from CodingCat/SPARK-6118 and squashes the following commits:

cb93700 [CodingCat] making package name consistent
2015-03-03 10:32:57 +00:00
Patrick Wendell 0c9a8eaed7 BUILD: Minor tweaks to internal build scripts
This adds two features:
1. The ability to publish with a different maven version than
   that specified in the release source.
2. Forking of different Zinc instances during the parallel dist
   creation (to help with some stability issues).
2015-03-03 01:53:48 -08:00
Patrick Wendell 165ff36426 HOTFIX: Bump HBase version in MapR profiles.
After #2982 (SPARK-4048) we rely on the newer HBase packaging format.
2015-03-03 01:38:50 -08:00
DB Tsai b196056190 [SPARK-5537][MLlib][Docs] Add user guide for multinomial logistic regression
Adding more description on top of #4861.

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #4866 from dbtsai/doc and squashes the following commits:

37e9d07 [DB Tsai] doc
2015-03-02 22:37:12 -08:00
Joseph K. Bradley c2fe3a6ff1 [SPARK-6120] [mllib] Warnings about memory in tree, ensemble model save
Issue: When the Python DecisionTree example in the programming guide is run, it runs out of Java Heap Space when using the default memory settings for the spark shell.

This prints a warning.

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4864 from jkbradley/dt-save-heap and squashes the following commits:

02e8daf [Joseph K. Bradley] fixed based on code review
7ecb1ed [Joseph K. Bradley] Added warnings about memory when calling tree and ensemble model save with too small a Java heap size
2015-03-02 22:33:51 -08:00