Commit graph

9661 commits

Author SHA1 Message Date
mcheah 6fe70d8432 [SPARK-5691] Fixing wrong data structure lookup for dupe app registratio...
In Master's registerApplication method, it checks if the application had
already registered by examining the addressToWorker hash map. In reality,
it should refer to the addressToApp data structure, as this is what
really tracks which apps have been registered.

Author: mcheah <mcheah@palantir.com>

Closes #4477 from mccheah/spark-5691 and squashes the following commits:

efdc573 [mcheah] [SPARK-5691] Fixing wrong data structure lookup for dupe app registration
2015-02-09 13:20:14 -08:00
Liang-Chi Hsieh dae216147f [SPARK-5664][BUILD] Restore stty settings when exiting from SBT's spark-shell
For launching spark-shell from SBT.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #4451 from viirya/restore_stty and squashes the following commits:

fdfc480 [Liang-Chi Hsieh] Restore stty settings when exit (for launching spark-shell from SBT).
2015-02-09 11:45:12 -08:00
Davies Liu afb131637d [SPARK-5678] Convert DataFrame to pandas.DataFrame and Series
```
pyspark.sql.DataFrame.to_pandas = to_pandas(self) unbound pyspark.sql.DataFrame method
    Collect all the rows and return a `pandas.DataFrame`.

    >>> df.to_pandas()  # doctest: +SKIP
       age   name
    0    2  Alice
    1    5    Bob

pyspark.sql.Column.to_pandas = to_pandas(self) unbound pyspark.sql.Column method
    Return a pandas.Series from the column

    >>> df.age.to_pandas()  # doctest: +SKIP
    0    2
    1    5
    dtype: int64
```

Not tests by jenkins (they depends on pandas)

Author: Davies Liu <davies@databricks.com>

Closes #4476 from davies/to_pandas and squashes the following commits:

6276fb6 [Davies Liu] Convert DataFrame to pandas.DataFrame and Series
2015-02-09 11:42:52 -08:00
Sean Owen de7806048a SPARK-4267 [YARN] Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later
Before passing to YARN, escape arguments in "extraJavaOptions" args, in order to correctly handle cases like -Dfoo="one two three". Also standardize how these args are handled and ensure that individual args are treated as stand-alone args, not one string.

vanzin andrewor14

Author: Sean Owen <sowen@cloudera.com>

Closes #4452 from srowen/SPARK-4267.2 and squashes the following commits:

c8297d2 [Sean Owen] Before passing to YARN, escape arguments in "extraJavaOptions" args, in order to correctly handle cases like -Dfoo="one two three". Also standardize how these args are handled and ensure that individual args are treated as stand-alone args, not one string.
2015-02-09 10:33:57 -08:00
Sandy Ryza 0793ee1b4d SPARK-2149. [MLLIB] Univariate kernel density estimation
Author: Sandy Ryza <sandy@cloudera.com>

Closes #1093 from sryza/sandy-spark-2149 and squashes the following commits:

5f06b33 [Sandy Ryza] More review comments
0f73060 [Sandy Ryza] Respond to Sean's review comments
0dfa005 [Sandy Ryza] SPARK-2149. Univariate kernel density estimation
2015-02-09 10:12:12 +00:00
Nicholas Chammas 4dfe180fc8 [SPARK-5473] [EC2] Expose SSH failures after status checks pass
If there is some fatal problem with launching a cluster, `spark-ec2` just hangs without giving the user useful feedback on what the problem is.

This PR exposes the output of the SSH calls to the user if the SSH test fails during cluster launch for any reason but the instance status checks are all green. It also removes the growing trail of dots while waiting in favor of a fixed 3 dots.

For example:

```
$ ./ec2/spark-ec2 -k key -i /incorrect/path/identity.pem --instance-type m3.medium --slaves 1 --zone us-east-1c launch "spark-test"
Setting up security groups...
Searching for existing cluster spark-test...
Spark AMI: ami-35b1885c
Launching instances...
Launched 1 slaves in us-east-1c, regid = r-7dadd096
Launched master in us-east-1c, regid = r-fcadd017
Waiting for cluster to enter 'ssh-ready' state...
Warning: SSH connection error. (This could be temporary.)
Host: 127.0.0.1
SSH return code: 255
SSH output: Warning: Identity file /incorrect/path/identity.pem not accessible: No such file or directory.
Warning: Permanently added '127.0.0.1' (RSA) to the list of known hosts.
Permission denied (publickey).
```

This should give users enough information when some unrecoverable error occurs during launch so they can know to abort the launch. This will help avoid situations like the ones reported [here on Stack Overflow](http://stackoverflow.com/q/28002443/) and [here on the user list](http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3C1422323829398-21381.postn3.nabble.com%3E), where the users couldn't tell what the problem was because it was being hidden by `spark-ec2`.

This is a usability improvement that should be backported to 1.2.

Resolves [SPARK-5473](https://issues.apache.org/jira/browse/SPARK-5473).

Author: Nicholas Chammas <nicholas.chammas@gmail.com>

Closes #4262 from nchammas/expose-ssh-failure and squashes the following commits:

8bda6ed [Nicholas Chammas] default to print SSH output
2b92534 [Nicholas Chammas] show SSH output after status check pass
2015-02-09 09:44:53 +00:00
Xiangrui Meng 855d12ac0a [SPARK-5539][MLLIB] LDA guide
This is the LDA user guide from jkbradley with Java and Scala code example.

Author: Xiangrui Meng <meng@databricks.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4465 from mengxr/lda-guide and squashes the following commits:

6dcb7d1 [Xiangrui Meng] update java example in the user guide
76169ff [Xiangrui Meng] update java example
36c3ae2 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into lda-guide
c2a1efe [Joseph K. Bradley] Added LDA programming guide, plus Java example (which is in the guide and probably should be removed).
2015-02-08 23:40:36 -08:00
Hung Lin 4575c5643a [SPARK-5472][SQL] Fix Scala code style
Fix Scala code style.

Author: Hung Lin <hung@zoomdata.com>

Closes #4464 from hunglin/SPARK-5472 and squashes the following commits:

ef7a3b3 [Hung Lin] SPARK-5472: fix scala style
2015-02-08 22:36:42 -08:00
Sean Owen 4396dfb37f SPARK-4405 [MLLIB] Matrices.* construction methods should check for rows x cols overflow
Check that size of dense matrix array is not beyond Int.MaxValue in Matrices.* methods. jkbradley this should be an easy one. Review and/or merge as you see fit.

Author: Sean Owen <sowen@cloudera.com>

Closes #4461 from srowen/SPARK-4405 and squashes the following commits:

c67574e [Sean Owen] Check that size of dense matrix array is not beyond Int.MaxValue in Matrices.* methods
2015-02-08 21:08:50 -08:00
Joseph K. Bradley c17161189d [SPARK-5660][MLLIB] Make Matrix apply public
This is #4447 with `override`.

Closes #4447

Author: Joseph K. Bradley <joseph@databricks.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #4462 from mengxr/SPARK-5660 and squashes the following commits:

f82c8d6 [Xiangrui Meng] add override to matrix.apply
91cedde [Joseph K. Bradley] made matrix apply public
2015-02-08 21:07:36 -08:00
Reynold Xin a052ed4250 [SPARK-5643][SQL] Add a show method to print the content of a DataFrame in tabular format.
An example:
```
year  month AVG('Adj Close) MAX('Adj Close)
1980  12    0.503218        0.595103
1981  01    0.523289        0.570307
1982  02    0.436504        0.475256
1983  03    0.410516        0.442194
1984  04    0.450090        0.483521
```

Author: Reynold Xin <rxin@databricks.com>

Closes #4416 from rxin/SPARK-5643 and squashes the following commits:

d0e0d6e [Reynold Xin] [SQL] Minor update to data source and statistics documentation.
269da83 [Reynold Xin] Updated isLocal comment.
2cf3c27 [Reynold Xin] Moved logic into optimizer.
1a04d8b [Reynold Xin] [SPARK-5643][SQL] Add a show method to print the content of a DataFrame in columnar format.
2015-02-08 18:56:51 -08:00
Sam Halliday 56aff4bd6c SPARK-5665 [DOCS] Update netlib-java documentation
I am the author of netlib-java and I found this documentation to be out of date. Some main points:

1. Breeze has not depended on jBLAS for some time
2. netlib-java provides a pure JVM implementation as the fallback (the original docs did not appear to be aware of this, claiming that gfortran was necessary)
3. The licensing issue is not just about LGPL: optimised natives have proprietary licenses. Building with the LGPL flag turned on really doesn't help you get past this.
4. I really think it's best to direct people to my detailed setup guide instead of trying to compress it into one sentence. It is different for each architecture, each OS, and for each backend.

I hope this helps to clear things up 😄

Author: Sam Halliday <sam.halliday@Gmail.com>
Author: Sam Halliday <sam.halliday@gmail.com>

Closes #4448 from fommil/patch-1 and squashes the following commits:

18cda11 [Sam Halliday] remove link to skillsmatters at request of @mengxr
a35e4a9 [Sam Halliday] reword netlib-java/breeze docs
2015-02-08 16:34:26 -08:00
Xiangrui Meng 5c299c58fb [SPARK-5598][MLLIB] model save/load for ALS
following #4233. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #4422 from mengxr/SPARK-5598 and squashes the following commits:

a059394 [Xiangrui Meng] SaveLoad not extending Loader
14b7ea6 [Xiangrui Meng] address comments
f487cb2 [Xiangrui Meng] add unit tests
62fc43c [Xiangrui Meng] implement save/load for MFM
2015-02-08 16:26:20 -08:00
Yin Huai 804949d519 [SQL] Set sessionState in QueryExecution.
This PR sets the SessionState in HiveContext's QueryExecution. So, we can make sure that SessionState.get can return the SessionState every time.

Author: Yin Huai <yhuai@databricks.com>

Closes #4445 from yhuai/setSessionState and squashes the following commits:

769c9f1 [Yin Huai] Remove unused import.
439f329 [Yin Huai] Try again.
427a0c9 [Yin Huai] Set SessionState everytime when we create a QueryExecution in HiveContext.
a3b7793 [Yin Huai] Set sessionState when dealing with CreateTableAsSelect.
2015-02-08 14:55:07 -08:00
medale 75fdccca32 [SPARK-3039] [BUILD] Spark assembly for new hadoop API (hadoop 2) contai...
...ns avro-mapred for

hadoop 1 API had been marked as resolved but did not work for at least some
builds due to version conflicts using avro-mapred-1.7.5.jar and
avro-mapred-1.7.6-hadoop2.jar (the correct version) when building for hadoop2.

sql/hive/pom.xml org.spark-project.hive:hive-exec's depends on 1.7.5:

Building Spark Project Hive 1.2.0
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-dependency-plugin:2.4:tree (default-cli)  spark-hive_2.10 ---
[INFO] org.apache.spark:spark-hive_2.10🫙1.2.0
[INFO] +- org.spark-project.hive:hive-exec:jar:0.13.1a:compile
[INFO] |  \- org.apache.avro:avro-mapred:jar:1.7.5:compile
[INFO] \- org.apache.avro:avro-mapred:jar:hadoop2:1.7.6:compile
[INFO]

Excluding this dependency allows the explicitly listed avro-mapred dependency
to be picked up.

Author: medale <medale94@yahoo.com>

Closes #4315 from medale/avro-hadoop2 and squashes the following commits:

1ab4fa3 [medale] Merge branch 'master' into avro-hadoop2
9d85e2a [medale] Merge remote-tracking branch 'upstream/master' into avro-hadoop2
51b9c2a [medale] [SPARK-3039] [BUILD] Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API had been marked as resolved but did not work for at least some builds due to version conflicts using avro-mapred-1.7.5.jar and avro-mapred-1.7.6-hadoop2.jar (the correct version) when building for hadoop2.
2015-02-08 10:35:29 +00:00
Kirill A. Korinskiy 23a99dabf1 [SPARK-5672][Web UI] Don't return ERROR 500 when have missing args
Spark web UI return `HTTP ERROR 500` when GET arguments is missing.

Author: Kirill A. Korinskiy <catap@catap.ru>

Closes #4239 from catap/ui_500 and squashes the following commits:

520e180 [Kirill A. Korinskiy] [SPARK-5672][Web UI] Return `HTTP ERROR 400` when have missing args
2015-02-08 10:31:46 +00:00
mbittmann 4878313695 [SPARK-5656] Fail gracefully for large values of k and/or n that will ex...
...ceed max int.

Large values of k and/or n in EigenValueDecomposition.symmetricEigs will result in array initialization to a value larger than Integer.MAX_VALUE in the following: var v = new Array[Double](n * ncv)

Author: mbittmann <mbittmann@gmail.com>
Author: bittmannm <mark.bittmann@agilex.com>

Closes #4433 from mbittmann/master and squashes the following commits:

ee56e05 [mbittmann] [SPARK-5656] Combine checks into simple message
e49cbbb [mbittmann] [SPARK-5656] Simply error message
860836b [mbittmann] Array size check updates based on code review
a604816 [bittmannm] [SPARK-5656] Fail gracefully for large values of k and/or n that will exceed max int.
2015-02-08 10:13:29 +00:00
liuchang0812 6fb141e2a9 [SPARK-5366][EC2] Check the mode of private key
Check the mode of private key file.

Author: liuchang0812 <liuchang0812@gmail.com>

Closes #4162 from Liuchang0812/ec2-script and squashes the following commits:

fc37355 [liuchang0812] quota file name
01ed464 [liuchang0812] more output
ce2a207 [liuchang0812] pep8
f44efd2 [liuchang0812] move code to real_main
8475a54 [liuchang0812] fix bug
cd61a1a [liuchang0812] import stat
c106cb2 [liuchang0812] fix trivis bug
89c9953 [liuchang0812] more output about checking private key
1177a90 [liuchang0812] remove commet
41188ab [liuchang0812] check the mode of private key
2015-02-08 10:08:51 +00:00
Josh Rosen 5de14cc276 [SPARK-5671] Upgrade jets3t to 0.9.2 in hadoop-2.3 and 2.4 profiles
Upgrading from jets3t 0.9.0 to 0.9.2 fixes a dependency issue that was
causing UISeleniumSuite to fail with ClassNotFoundExceptions when run
the hadoop-2.3 or hadoop-2.4 profiles.

The jets3t release notes can be found at http://www.jets3t.org/RELEASE_NOTES.html

Author: Josh Rosen <joshrosen@databricks.com>

Closes #4454 from JoshRosen/SPARK-5671 and squashes the following commits:

fa6cb3e [Josh Rosen] [SPARK-5671] Upgrade jets3t to 0.9.2 in hadoop-2.3 and 2.4 profiles
2015-02-07 17:19:08 -08:00
Zhan Zhang ecbbed2e4e [SPARK-5108][BUILD] Jackson dependency management for Hadoop-2.6.0 support
There is dependency compatibility issue. Currently hadoop-2.6.0 use 1.9.13 for jackson. Upgrade to the same version to make it consistent.

Author: Zhan Zhang <zhazhan@gmail.com>

Closes #3938 from zhzhan/spark5108 and squashes the following commits:

0080a84 [Zhan Zhang] change to upgrade jackson version only in hadoop-2.x
0b9bad6 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark into spark5108
917600a [Zhan Zhang] solve conflicts
f7064d0 [Zhan Zhang] hadoop2.6 dependency management fix
fc56b25 [Zhan Zhang] squash all commits
3bf966c [Zhan Zhang] test
2015-02-07 19:41:30 +00:00
Jacek Lewandowski dd4cb33a27 SPARK-5408: Use -XX:MaxPermSize specified by user instead of default in ...
...ExecutorRunner and DriverRunner

Author: Jacek Lewandowski <lewandowski.jacek@gmail.com>

Closes #4203 from jacek-lewandowski/SPARK-5408-1.3 and squashes the following commits:

d913686 [Jacek Lewandowski] SPARK-5408: Use -XX:MaxPermSize specified by used instead of default in ExecutorRunner and DriverRunner
2015-02-07 15:58:04 +00:00
Michael Armbrust e9a4fe12d3 [BUILD] Add the ability to launch spark-shell from SBT.
Now you can quickly launch the spark-shell without building an assembly.  For quick development iteration run `build/sbt ~sparkShell` and calling exit will relaunch with any changes.

Author: Michael Armbrust <michael@databricks.com>

Closes #4438 from marmbrus/sparkShellSbt and squashes the following commits:

b4e44fe [Michael Armbrust] [BUILD] Add the ability to launch spark-shell from SBT.
2015-02-07 00:14:38 -08:00
Andrew Or 1390e56fa8 [SPARK-5388] Provide a stable application submission gateway for standalone cluster mode
The goal is to provide a stable, REST-based application submission gateway that is not inherently based on Akka, which is unstable across versions. This PR targets standalone cluster mode, but is implemented in a general enough manner that can be potentially extended to other modes in the future. Client mode is currently not included in the changes here because there are many more Akka messages exchanged there.

As of the changes here, the Master will advertise two ports, 7077 and 6066. We need to keep around the old one (7077) for client mode and older versions of Spark submit. However, all new versions of Spark submit will use the REST gateway (6066).

By the way this includes ~700 lines of tests and ~200 lines of license.

Author: Andrew Or <andrew@databricks.com>

Closes #4216 from andrewor14/rest and squashes the following commits:

8d7ce07 [Andrew Or] Merge branch 'master' of github.com:apache/spark into rest
6f0c597 [Andrew Or] Use nullable fields for integer and boolean values
dfe4bd7 [Andrew Or] Merge branch 'master' of github.com:apache/spark into rest
b9e2a08 [Andrew Or] Minor comments
02b5cea [Andrew Or] Fix tests
d2b1ef8 [Andrew Or] Comment changes + minor code refactoring across the board
9c82a36 [Andrew Or] Minor comment and wording updates
b4695e7 [Andrew Or] Merge branch 'master' of github.com:apache/spark into rest
c9a8ad7 [Andrew Or] Do not include appResource and mainClass as properties
6fc7670 [Andrew Or] Report REST server response back to the user
40e6095 [Andrew Or] Pass submit parameters through system properties
cbd670b [Andrew Or] Include unknown fields, if any, in server response
9fee16f [Andrew Or] Include server protocol version on mismatch
09f873a [Andrew Or] Fix style
8188e61 [Andrew Or] Upgrade Jackson from 2.3.0 to 2.4.4
37538e0 [Andrew Or] Merge branch 'master' of github.com:apache/spark into rest
9165ae8 [Andrew Or] Fall back to Akka if endpoint was not REST
252d53c [Andrew Or] Clean up server error handling behavior further
c643f64 [Andrew Or] Fix style
bbbd329 [Andrew Or] Merge branch 'master' of github.com:apache/spark into rest
792e112 [Andrew Or] Use specific HTTP response codes on error
f98660b [Andrew Or] Version the protocol and include it in REST URL
721819f [Andrew Or] Provide more REST-like interface for submit/kill/status
581f7bf [Andrew Or] Merge branch 'master' of github.com:apache/spark into rest
9e0d1af [Andrew Or] Move some classes around to reduce number of files (minor)
42e5de4 [Andrew Or] Merge branch 'master' of github.com:apache/spark into rest
1f1c03f [Andrew Or] Use Jackson's DefaultScalaModule to simplify messages
9229433 [Andrew Or] Reduce duplicate naming in REST field
ade28fd [Andrew Or] Clean up REST response output in Spark submit
b2fef8b [Andrew Or] Abstract the success field to the general response
6c57b4b [Andrew Or] Increase timeout in end-to-end tests
bf696ff [Andrew Or] Add checks for enabling REST when using kill/status
7ee6737 [Andrew Or] Merge branch 'master' of github.com:apache/spark into rest
e2f7f5f [Andrew Or] Provide more safeguard against missing fields
9581df7 [Andrew Or] Clean up uses of exceptions
914fdff [Andrew Or] Merge branch 'master' of github.com:apache/spark into rest
e2104e6 [Andrew Or] stable -> rest
3db7379 [Andrew Or] Fix comments and name fields for better error messages
8d43486 [Andrew Or] Replace SubmitRestProtocolAction with class name
df90e8b [Andrew Or] Use Jackson for JSON de/serialization
d7a1f9f [Andrew Or] Fix local cluster tests
efa5e18 [Andrew Or] Merge branch 'master' of github.com:apache/spark into rest
e42c131 [Andrew Or] Add end-to-end tests for standalone REST protocol
837475b [Andrew Or] Show the REST port on the Master UI
d8d3717 [Andrew Or] Use a daemon thread pool for REST server
6568ca5 [Andrew Or] Merge branch 'master' of github.com:apache/spark into rest
77774ba [Andrew Or] Minor fixes
206cae4 [Andrew Or] Refactor and add tests for the REST protocol
63c05b3 [Andrew Or] Remove MASTER as a field (minor)
9e21b72 [Andrew Or] Action -> SparkSubmitAction (minor)
51c5ca6 [Andrew Or] Distinguish client and server side Spark versions
b44e103 [Andrew Or] Implement status requests + fix validation behavior
120ab9d [Andrew Or] Support kill and request driver status through SparkSubmit
544de1d [Andrew Or] Major clean ups in code and comments
e958cae [Andrew Or] Supported nested values in messages
484bd21 [Andrew Or] Specify an ordering for fields in SubmitDriverRequestMessage
6ff088d [Andrew Or] Rename classes to generalize REST protocol
af9d9cb [Andrew Or] Integrate REST protocol in standalone mode
53e7c0e [Andrew Or] Initial client, server, and all the messages
2015-02-06 15:57:06 -08:00
Grzegorz Dubicki e772b4e4e1 SPARK-5403: Ignore UserKnownHostsFile in SSH calls
See https://issues.apache.org/jira/browse/SPARK-5403

Author: Grzegorz Dubicki <grzegorz.dubicki@gmail.com>

Closes #4196 from grzegorz-dubicki/SPARK-5403 and squashes the following commits:

a7d863f [Grzegorz Dubicki] Resolve start command hanging issue
2015-02-06 15:43:58 -08:00
Xiangrui Meng 0e23ca9f80 [SPARK-5601][MLLIB] make streaming linear algorithms Java-friendly
Overload `trainOn`, `predictOn`, and `predictOnValues`.

CC freeman-lab

Author: Xiangrui Meng <meng@databricks.com>

Closes #4432 from mengxr/streaming-java and squashes the following commits:

6a79b85 [Xiangrui Meng] add java test for streaming logistic regression
2d7b357 [Xiangrui Meng] organize imports
1f662b3 [Xiangrui Meng] make streaming linear algorithms Java-friendly
2015-02-06 15:42:59 -08:00
Cheng Lian c4021401e3 [SQL] [Minor] HiveParquetSuite was disabled by mistake, re-enable them
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4440)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #4440 from liancheng/parquet-oops and squashes the following commits:

f21ede4 [Cheng Lian] HiveParquetSuite was disabled by mistake, re-enable them.
2015-02-06 15:23:42 -08:00
Michael Armbrust 76c4bf59f6 [SQL] Use TestSQLContext in Java tests
Sometimes tests were failing due to the creation of multiple `SparkContext`s in a single JVM.

Author: Michael Armbrust <michael@databricks.com>

Closes #4441 from marmbrus/javaTests and squashes the following commits:

657b1e0 [Michael Armbrust] [SQL] Use TestSQLContext in Java tests
2015-02-06 15:11:02 -08:00
lianhuiwang 61073f8321 [SPARK-4994][network]Cleanup removed executors' ShuffleInfo in yarn shuffle service
when the application is completed, yarn's nodemanager can remove application's local-dirs.but all executors' metadata of completed application havenot be removed. now it lets yarn ShuffleService to have much more memory to store Executors' ShuffleInfo. so these metadata need to be removed.

Author: lianhuiwang <lianhuiwang09@gmail.com>

Closes #3828 from lianhuiwang/SPARK-4994 and squashes the following commits:

f3ba1d2 [lianhuiwang] Cleanup removed executors' ShuffleInfo
2015-02-06 14:48:30 -08:00
huangzhaowei 2bda1c1d37 [SPARK-5444][Network]Add a retry to deal with the conflict port in netty server.
If the `spark.blockMnager.port` had conflicted with a specific port, Spark will throw an exception and exit.
So add a retry to avoid this situation.

Author: huangzhaowei <carlmartinmax@gmail.com>

Closes #4240 from SaintBacchus/NettyPortConflict and squashes the following commits:

cc926d2 [huangzhaowei] Add a retry to deal with the conflict port in netty server.
2015-02-06 14:36:58 -08:00
Kostas Sakellis dcd1e42d6b [SPARK-4874] [CORE] Collect record count metrics
Collects record counts for both Input/Output and Shuffle Metrics. For the input/output metrics, it just appends the counter every time the iterators get accessed.

For shuffle on the write side, we count the metrics post aggregation (after a map side combine) and on the read side we count the metrics pre aggregation. This allows both the bytes read/written metrics and the records read/written to line up.

For backwards compatibility, if we deserialize an older event that doesn't have record metrics, we set the metric to -1.

Author: Kostas Sakellis <kostas@cloudera.com>

Closes #4067 from ksakellis/kostas-spark-4874 and squashes the following commits:

bd919be [Kostas Sakellis] Changed 'Records Read' in shuffleReadMetrics json output to 'Total Records Read'
dad4d57 [Kostas Sakellis] Add a comment and check to BlockObjectWriter so that it cannot be reopend.
6f236a1 [Kostas Sakellis] Renamed _recordsWritten in ShuffleWriteMetrics to be more consistent
70620a0 [Kostas Sakellis] CR Feedback
17faa3a [Kostas Sakellis] Removed AtomicLong in favour of using Long
b6f9923 [Kostas Sakellis] Merge AfterNextInterceptingIterator with InterruptableIterator to save a function call
46c8186 [Kostas Sakellis] Combined Bytes and # records into one column
57551c1 [Kostas Sakellis] Conforms to SPARK-3288
6cdb44e [Kostas Sakellis] Removed the generic InterceptingIterator and repalced it with specific implementation
1aa273c [Kostas Sakellis] CR Feedback
1bb78b1 [Kostas Sakellis] [SPARK-4874] [CORE] Collect record count metrics
2015-02-06 14:31:20 -08:00
Michael Armbrust 57961567ef [HOTFIX] Fix the maven build after adding sqlContext to spark-shell
Follow up to #4387 to fix the build break.

Author: Michael Armbrust <michael@databricks.com>

Closes #4443 from marmbrus/fixMaven and squashes the following commits:

1eeba7d [Michael Armbrust] try again
7f5fb15 [Michael Armbrust] [HOTFIX] Fix the maven build after adding sqlContext to spark-shell
2015-02-06 14:27:06 -08:00
Marcelo Vanzin 5687bab8fd [SPARK-5600] [core] Clean up FsHistoryProvider test, fix app sort order.
Clean up some test setup code to remove duplicate instantiation of the
provider. Also make sure unfinished apps are sorted correctly.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #4370 from vanzin/SPARK-5600 and squashes the following commits:

0d048d5 [Marcelo Vanzin] Cleanup test code a bit.
2585119 [Marcelo Vanzin] Review feedback.
8b97544 [Marcelo Vanzin] Merge branch 'master' into SPARK-5600
be979e9 [Marcelo Vanzin] Merge branch 'master' into SPARK-5600
298371c [Marcelo Vanzin] [SPARK-5600] [core] Clean up FsHistoryProvider test, fix app sort order.
2015-02-06 14:23:09 -08:00
Kashish Jain ca66159a4f SPARK-5613: Catch the ApplicationNotFoundException exception to avoid thread from getting killed on yarn restart.
[SPARK-5613] Added a  catch block to catch the ApplicationNotFoundException. Without this catch block the thread gets killed on occurrence of this exception. This Exception occurs when yarn restarts and tries to find an application id for a spark job which got interrupted due to yarn getting stopped.
See the stacktrace in the bug for more details.

Author: Kashish Jain <kashish.jain@guavus.com>

Closes #4392 from kasjain/branch-1.2 and squashes the following commits:

4831000 [Kashish Jain] SPARK-5613: Catch the ApplicationNotFoundException exception to avoid thread from getting killed on yarn restart.
2015-02-06 13:59:11 -08:00
Vladimir Vladimirov b3872e00d1 SPARK-5633 pyspark saveAsTextFile support for compression codec
See https://issues.apache.org/jira/browse/SPARK-5633 for details

Author: Vladimir Vladimirov <vladimir.vladimirov@magnetic.com>

Closes #4403 from smartkiwi/master and squashes the following commits:

94c014e [Vladimir Vladimirov] SPARK-5633 pyspark saveAsTextFile support for compression codec
2015-02-06 13:55:02 -08:00
Xiangrui Meng 65181b7512 [HOTFIX][MLLIB] fix a compilation error with java 6
Author: Xiangrui Meng <meng@databricks.com>

Closes #4442 from mengxr/java6-fix and squashes the following commits:

2098500 [Xiangrui Meng] fix a compilation error with java 6
2015-02-06 13:52:35 -08:00
GenTang 0f3a36071a [SPARK-4983] Insert waiting time before tagging EC2 instances
The boto API doesn't support tag EC2 instances in the same call that launches them.
We add a five-second wait so EC2 has enough time to propagate the information so that
the tagging can succeed.

Author: GenTang <gen.tang86@gmail.com>
Author: Gen TANG <gen.tang86@gmail.com>

Closes #3986 from GenTang/spark-4983 and squashes the following commits:

13e257d [Gen TANG] modification of comments
47f06755 [GenTang] print the information
ab7a931 [GenTang] solve the issus spark-4983 by inserting waiting time
3179737 [GenTang] Revert "handling exceptions about adding tags to ec2"
6a8b53b [GenTang] Revert "the improvement of exception handling"
13e97a6 [GenTang] Revert "typo"
63fd360 [GenTang] typo
692fc2b [GenTang] the improvement of exception handling
6adcf6d [GenTang] handling exceptions about adding tags to ec2
2015-02-06 13:27:40 -08:00
OopsOutOfMemory 3d3ecd7741 [SPARK-5586][Spark Shell][SQL] Make sqlContext available in spark shell
Result is like this
```
15/02/05 13:41:22 INFO SparkILoop: Created spark context..
Spark context available as sc.
15/02/05 13:41:22 INFO SparkILoop: Created sql context..
SQLContext available as sqlContext.

scala> sq
sql          sqlContext   sqlParser    sqrt
```

Author: OopsOutOfMemory <victorshengli@126.com>

Closes #4387 from OopsOutOfMemory/sqlContextInShell and squashes the following commits:

c7f5203 [OopsOutOfMemory] auto-import sql() function
e160697 [OopsOutOfMemory] Merge branch 'sqlContextInShell' of https://github.com/OopsOutOfMemory/spark into sqlContextInShell
37c0a16 [OopsOutOfMemory] auto detect hive support
a9c59d9 [OopsOutOfMemory] rename and reduce range of imports
6b9e309 [OopsOutOfMemory] Merge branch 'master' into sqlContextInShell
cae652f [OopsOutOfMemory] make sqlContext available in spark shell
2015-02-06 13:20:10 -08:00
Wenchen Fan 4793c8402a [SPARK-5278][SQL] Introduce UnresolvedGetField and complete the check of ambiguous reference to fields
When the `GetField` chain(`a.b.c.d.....`) is interrupted by `GetItem` like `a.b[0].c.d....`, then the check of ambiguous reference to fields is broken.
The reason is that: for something like `a.b[0].c.d`, we first parse it to `GetField(GetField(GetItem(Unresolved("a.b"), 0), "c"), "d")`. Then in `LogicalPlan#resolve`, we resolve `"a.b"` and build a `GetField` chain from bottom(the relation). But for the 2 outer `GetFiled`, we have to resolve them in `Analyzer` or do it in `GetField` lazily, check data type of child, search needed field, etc. which is similar to what we have done in `LogicalPlan#resolve`.
So in this PR, the fix is just copy the same logic in `LogicalPlan#resolve` to `Analyzer`, which is simple and quick, but I do suggest introduce `UnresolvedGetFiled` like I explained in https://github.com/apache/spark/pull/2405.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #4068 from cloud-fan/simple and squashes the following commits:

a6857b5 [Wenchen Fan] fix import order
8411c40 [Wenchen Fan] use UnresolvedGetField
2015-02-06 13:08:09 -08:00
wangfei bc36356080 [SQL][Minor] Remove cache keyword in SqlParser
Since cache keyword already defined in `SparkSQLParser` and `SqlParser` of catalyst is a more general parser which should not cover keywords related to underlying compute engine, to remove  cache keyword in  `SqlParser`.

Author: wangfei <wangfei1@huawei.com>

Closes #4393 from scwf/remove-cache-keyword and squashes the following commits:

10ade16 [wangfei] remove cache keyword in sql parser
2015-02-06 12:42:23 -08:00
OopsOutOfMemory b62c35245a [SQL][HiveConsole][DOC] HiveConsole correct hiveconsole imports
Sorry for that PR #4330 has some mistakes.

I correct it....  so it works correctly now.

Author: OopsOutOfMemory <victorshengli@126.com>

Closes #4389 from OopsOutOfMemory/doc and squashes the following commits:

843eed9 [OopsOutOfMemory] correct hiveconsole imports
2015-02-06 12:41:28 -08:00
Yin Huai 3eccf29ce0 [SPARK-5595][SPARK-5603][SQL] Add a rule to do PreInsert type casting and field renaming and invalidating in memory cache after INSERT
This PR adds a rule to Analyzer that will add preinsert data type casting and field renaming to the select clause in an `INSERT INTO/OVERWRITE` statement. Also, with the change of this PR, we always invalidate our in memory data cache after inserting into a BaseRelation.

cc marmbrus liancheng

Author: Yin Huai <yhuai@databricks.com>

Closes #4373 from yhuai/insertFollowUp and squashes the following commits:

08237a7 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertFollowUp
316542e [Yin Huai] Doc update.
c9ccfeb [Yin Huai] Revert a unnecessary change.
84aecc4 [Yin Huai] Address comments.
1951fe1 [Yin Huai] Merge remote-tracking branch 'upstream/master'
c18da34 [Yin Huai] Invalidate cache after insert.
727f21a [Yin Huai] Preinsert casting and renaming.
2015-02-06 12:38:07 -08:00
OopsOutOfMemory 0b7eb3f3b7 [SPARK-5324][SQL] Results of describe can't be queried
Make below code works.
```
sql("DESCRIBE test").registerTempTable("describeTest")
sql("SELECT * FROM describeTest").collect()
```

Author: OopsOutOfMemory <victorshengli@126.com>
Author: Sheng, Li <OopsOutOfMemory@users.noreply.github.com>

Closes #4249 from OopsOutOfMemory/desc_query and squashes the following commits:

6fee13d [OopsOutOfMemory] up-to-date
e71430a [Sheng, Li] Update HiveOperatorQueryableSuite.scala
3ba1058 [OopsOutOfMemory] change to default argument
aac7226 [OopsOutOfMemory] Merge branch 'master' into desc_query
68eb6dd [OopsOutOfMemory] Merge branch 'desc_query' of github.com:OopsOutOfMemory/spark into desc_query
354ad71 [OopsOutOfMemory] query describe command
d541a35 [OopsOutOfMemory] refine test suite
e1da481 [OopsOutOfMemory] refine test suite
a780539 [OopsOutOfMemory] Merge branch 'desc_query' of github.com:OopsOutOfMemory/spark into desc_query
0015f82 [OopsOutOfMemory] code style
dd0aaef [OopsOutOfMemory] code style
c7d606d [OopsOutOfMemory] rename test suite
75f2342 [OopsOutOfMemory] refine code and test suite
f942c9b [OopsOutOfMemory] initial
11559ae [OopsOutOfMemory] code style
c5fdecf [OopsOutOfMemory] code style
aeaea5f [OopsOutOfMemory] rename test suite
ac2c3bb [OopsOutOfMemory] refine code and test suite
544573e [OopsOutOfMemory] initial
2015-02-06 12:33:20 -08:00
q00251598 a958d60975 [SPARK-5619][SQL] Support 'show roles' in HiveContext
Author: q00251598 <qiyadong@huawei.com>

Closes #4397 from watermen/SPARK-5619 and squashes the following commits:

f819b6c [q00251598] Support show roles in HiveContext.
2015-02-06 12:29:26 -08:00
Tobias Schlatter 500dc2b4b3 [SPARK-5640] Synchronize ScalaReflection where necessary
Author: Tobias Schlatter <tobias@meisch.ch>

Closes #4431 from gzm0/sync-scala-refl and squashes the following commits:

c5da21e [Tobias Schlatter] [SPARK-5640] Synchronize ScalaReflection where necessary
2015-02-06 12:15:02 -08:00
Liang-Chi Hsieh d433816157 [SPARK-5650][SQL] Support optional 'FROM' clause
In Hive, 'FROM' clause is optional. This pr supports it.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #4426 from viirya/optional_from and squashes the following commits:

fe81f31 [Liang-Chi Hsieh] Support optional 'FROM' clause.
2015-02-06 12:13:44 -08:00
Nicholas Chammas 70e5b030a7 [SPARK-5628] Add version option to spark-ec2
Every proper command line tool should include a `--version` option or something similar.

This PR adds this to `spark-ec2` using the standard functionality provided by `optparse`.

One thing we don't do here is follow the Python convention of setting `__version__`, since it seems awkward given how `spark-ec2` is laid out.

Author: Nicholas Chammas <nicholas.chammas@gmail.com>

Closes #4414 from nchammas/spark-ec2-show-version and squashes the following commits:

914cab5 [Nicholas Chammas] add version info
2015-02-06 12:08:22 -08:00
WangTaoTheTonic d34f79c8db [SPARK-2945][YARN][Doc]add doc for spark.executor.instances
https://issues.apache.org/jira/browse/SPARK-2945

spark.executor.instances works. As this JIRA recommended, we should add docs for this common config.

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes #4350 from WangTaoTheTonic/SPARK-2945 and squashes the following commits:

4c3913a [WangTaoTheTonic] not compatible with dynamic allocation
5fa9c46 [WangTaoTheTonic] add doc for spark.executor.instances
2015-02-06 11:58:22 -08:00
zsxwing af2a2a263a [SPARK-4361][Doc] Add more docs for Hadoop Configuration
I'm trying to point out reusing a Configuration in these APIs is dangerous. Any better idea?

Author: zsxwing <zsxwing@gmail.com>

Closes #3225 from zsxwing/SPARK-4361 and squashes the following commits:

fe4e3d5 [zsxwing] Add more docs for Hadoop Configuration
2015-02-06 11:51:09 -08:00
Josh Rosen fb6c0cbac4 [HOTFIX] Fix test build break in ExecutorAllocationManagerSuite.
This was caused because #3486 added a new field to ExecutorInfo and #4369
added new tests that created ExecutorInfos.  These patches were merged in
quick succession and were never tested together, hence the compilation error.
2015-02-06 11:48:52 -08:00
Liang-Chi Hsieh 80f3bcb58f [SPARK-5652][Mllib] Use broadcasted weights in LogisticRegressionModel
`LogisticRegressionModel`'s `predictPoint` should directly use broadcasted weights. This pr also fixes the compilation errors of two unit test suite: `JavaLogisticRegressionSuite ` and `JavaLinearRegressionSuite`.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #4429 from viirya/use_bcvalue and squashes the following commits:

5a797e5 [Liang-Chi Hsieh] Use broadcasted weights. Fix compilation error.
2015-02-06 11:22:11 -08:00