Author: nchammas <nicholas.chammas@gmail.com>
Closes#923 from nchammas/patch-1 and squashes the following commits:
65c4d18 [nchammas] updated link to mailing list
https://issues.apache.org/jira/browse/SPARK-1901
Author: Zhen Peng <zhenpeng01@baidu.com>
Closes#854 from zhpengg/bugfix-worker-kills-executor and squashes the following commits:
21d380b [Zhen Peng] add some error messages
506cea6 [Zhen Peng] add some docs for killProcess()
a0b9860 [Zhen Peng] [SPARK-1901] worker should make sure executor has exited before updating executor's info
Author: Prashant Sharma <prashant.s@imaginea.com>
Closes#910 from ScrapCodes/enable-mima/spark-core and squashes the following commits:
79f3687 [Prashant Sharma] updated Mima to check against version 1.0
1e8969c [Prashant Sharma] Spark core missed out on Mima settings. So in effect we never tested spark core for mima related errors.
This is a fairly large PR to clean up and update the docs for 1.0. The major changes are:
* A unified programming guide for all languages replaces language-specific ones and shows language-specific info in tabs
* New programming guide sections on key-value pairs, unit testing, input formats beyond text, migrating from 0.9, and passing functions to Spark
* Spark-submit guide moved to a separate page and expanded slightly
* Various cleanups of the menu system, security docs, and others
* Updated look of title bar to differentiate the docs from previous Spark versions
You can find the updated docs at http://people.apache.org/~matei/1.0-docs/_site/ and in particular http://people.apache.org/~matei/1.0-docs/_site/programming-guide.html.
Author: Matei Zaharia <matei@databricks.com>
Closes#896 from mateiz/1.0-docs and squashes the following commits:
03e6853 [Matei Zaharia] Some tweaks to configuration and YARN docs
0779508 [Matei Zaharia] tweak
ef671d4 [Matei Zaharia] Keep frames in JavaDoc links, and other small tweaks
1bf4112 [Matei Zaharia] Review comments
4414f88 [Matei Zaharia] tweaks
d04e979 [Matei Zaharia] Fix some old links to Java guide
a34ed33 [Matei Zaharia] tweak
541bb3b [Matei Zaharia] miscellaneous changes
fcefdec [Matei Zaharia] Moved submitting apps to separate doc
61d72b4 [Matei Zaharia] stuff
181f217 [Matei Zaharia] migration guide, remove old language guides
e11a0da [Matei Zaharia] Add more API functions
6a030a9 [Matei Zaharia] tweaks
8db0ae3 [Matei Zaharia] Added key-value pairs section
318d2c9 [Matei Zaharia] tweaks
1c81477 [Matei Zaharia] New section on basics and function syntax
e38f559 [Matei Zaharia] Actually added programming guide to Git
a33d6fe [Matei Zaharia] First pass at updating programming guide to support all languages, plus other tweaks throughout
3b6a876 [Matei Zaharia] More CSS tweaks
01ec8bf [Matei Zaharia] More CSS tweaks
e6d252e [Matei Zaharia] Change color of doc title bar to differentiate from 0.9.0
We add all the classes annotated as `DeveloperApi` to `~/.mima-excludes`.
Author: Prashant Sharma <prashant.s@imaginea.com>
Author: nikhil7sh <nikhilsharmalnmiit@gmail.ccom>
Closes#904 from ScrapCodes/SPARK-1820/ignore-Developer-Api and squashes the following commits:
de944f9 [Prashant Sharma] Code review.
e3c5215 [Prashant Sharma] Incorporated patrick's suggestions and fixed the scalastyle build.
9983a42 [nikhil7sh] [SPARK-1820] Make GenerateMimaIgnore @DeveloperApi annotation aware
A straightforward implementation of LPA algorithm for detecting graph communities using the Pregel framework. Amongst the growing literature on community detection algorithms in networks, LPA is perhaps the most elementary, and despite its flaws it remains a nice and simple approach.
Author: Ankur Dave <ankurdave@gmail.com>
Author: haroldsultan <haroldsultan@gmail.com>
Author: Harold Sultan <haroldsultan@gmail.com>
Closes#905 from haroldsultan/master and squashes the following commits:
327aee0 [haroldsultan] Merge pull request #2 from ankurdave/label-propagation
227a4d0 [Ankur Dave] Untabify
0ac574c [haroldsultan] Merge pull request #1 from ankurdave/label-propagation
0e24303 [Ankur Dave] Add LabelPropagationSuite
84aa061 [Ankur Dave] LabelPropagation: Fix compile errors and style; rename from LPA
9830342 [Harold Sultan] initial version of LPA
JIRA issue: [SPARK-1368](https://issues.apache.org/jira/browse/SPARK-1368)
This PR introduces two major updates:
- Replaced FP style code with `while` loop and reusable `GenericMutableRow` object in critical path of `HiveTableScan`.
- Using `ColumnProjectionUtils` to help optimizing RCFile and ORC column pruning.
My quick micro benchmark suggests these two optimizations made the optimized version 2x and 2.5x faster when scanning CSV table and RCFile table respectively:
```
Original:
[info] CSV: 27676 ms, RCFile: 26415 ms
[info] CSV: 27703 ms, RCFile: 26029 ms
[info] CSV: 27511 ms, RCFile: 25962 ms
Optimized:
[info] CSV: 13820 ms, RCFile: 10402 ms
[info] CSV: 14158 ms, RCFile: 10691 ms
[info] CSV: 13606 ms, RCFile: 10346 ms
```
The micro benchmark loads a 609MB CVS file (structurally similar to the `src` test table) into a normal Hive table with `LazySimpleSerDe` and a RCFile table, then scans these tables respectively.
Preparation code:
```scala
package org.apache.spark.examples.sql.hive
import org.apache.spark.sql.hive.LocalHiveContext
import org.apache.spark.{SparkConf, SparkContext}
object HiveTableScanPrepare extends App {
val sparkContext = new SparkContext(
new SparkConf()
.setMaster("local")
.setAppName(getClass.getSimpleName.stripSuffix("$")))
val hiveContext = new LocalHiveContext(sparkContext)
import hiveContext._
hql("drop table scan_csv")
hql("drop table scan_rcfile")
hql("""create table scan_csv (key int, value string)
| row format serde 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
| with serdeproperties ('field.delim'=',')
""".stripMargin)
hql(s"""load data local inpath "${args(0)}" into table scan_csv""")
hql("""create table scan_rcfile (key int, value string)
| row format serde 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
|stored as
| inputformat 'org.apache.hadoop.hive.ql.io.RCFileInputFormat'
| outputformat 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'
""".stripMargin)
hql(
"""
|from scan_csv
|insert overwrite table scan_rcfile
|select scan_csv.key, scan_csv.value
""".stripMargin)
}
```
Benchmark code:
```scala
package org.apache.spark.examples.sql.hive
import org.apache.spark.sql.hive.LocalHiveContext
import org.apache.spark.{SparkConf, SparkContext}
object HiveTableScanBenchmark extends App {
val sparkContext = new SparkContext(
new SparkConf()
.setMaster("local")
.setAppName(getClass.getSimpleName.stripSuffix("$")))
val hiveContext = new LocalHiveContext(sparkContext)
import hiveContext._
val scanCsv = hql("select key from scan_csv")
val scanRcfile = hql("select key from scan_rcfile")
val csvDuration = benchmark(scanCsv.count())
val rcfileDuration = benchmark(scanRcfile.count())
println(s"CSV: $csvDuration ms, RCFile: $rcfileDuration ms")
def benchmark(f: => Unit) = {
val begin = System.currentTimeMillis()
f
val end = System.currentTimeMillis()
end - begin
}
}
```
@marmbrus Please help review, thanks!
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#758 from liancheng/fastHiveTableScan and squashes the following commits:
4241a19 [Cheng Lian] Distinguishes sorted and possibly not sorted operations more accurately in HiveComparisonTest
cf640d8 [Cheng Lian] More HiveTableScan optimisations:
bf0e7dc [Cheng Lian] Added SortedOperation pattern to match *some* definitely sorted operations and avoid some sorting cost in HiveComparisonTest.
6d1c642 [Cheng Lian] Using ColumnProjectionUtils to optimise RCFile and ORC column pruning
eb62fd3 [Cheng Lian] [SPARK-1368] Optimized HiveTableScan
Author: Yin Huai <huai@cse.ohio-state.edu>
Closes#889 from yhuai/SPARK-1935 and squashes the following commits:
7d50ef1 [Yin Huai] Explicitly add commons-codec 1.5 as a dependency.
Added doctest for method textFile and description for methods _initialize_context and _ensure_initialized in context.py
Author: Jyotiska NK <jyotiska123@gmail.com>
Closes#187 from jyotiska/pyspark_context and squashes the following commits:
356f945 [Jyotiska NK] Added doctest for textFile method in context.py
5b23686 [Jyotiska NK] Updated context.py with method descriptions
The changes could be ported back to 0.9 as well.
Changing in.read to in.readFully to read the whole input stream rather than the first 1020 bytes.
This should ok considering that Flume caps the body size to 32K by default.
Author: David Lemieux <david.lemieux@radialpoint.com>
Closes#865 from lemieud/SPARK-1916 and squashes the following commits:
a265673 [David Lemieux] Updated SparkFlumeEvent to read the whole stream rather than the first X bytes.
(cherry picked from commit 0b769b73fb)
Signed-off-by: Patrick Wendell <pwendell@gmail.com>
This PR improves and organizes the config option page
and makes a few other changes to config docs. See a preview here:
http://people.apache.org/~pwendell/config-improvements/configuration.html
The biggest changes are:
1. The configs for the standalone master/workers were moved to the
standalone page and out of the general config doc.
2. SPARK_LOCAL_DIRS was missing from the standalone docs.
3. Expanded discussion of injecting configs with spark-submit, including an
example.
4. Config options were organized into the following categories:
- Runtime Environment
- Shuffle Behavior
- Spark UI
- Compression and Serialization
- Execution Behavior
- Networking
- Scheduling
- Security
- Spark Streaming
Author: Patrick Wendell <pwendell@gmail.com>
Closes#880 from pwendell/config-cleanup and squashes the following commits:
93f56c3 [Patrick Wendell] Feedback from Matei
6f66efc [Patrick Wendell] More feedback
16ae776 [Patrick Wendell] Adding back header section
d9c264f [Patrick Wendell] Small fix
e0c1728 [Patrick Wendell] Response to Matei's review
27d57db [Patrick Wendell] Reverting changes to index.html (covered in #896)
e230ef9 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into config-cleanup
a374369 [Patrick Wendell] Line wrapping fixes
fdff7fc [Patrick Wendell] Merge remote-tracking branch 'apache/master' into config-cleanup
3289ea4 [Patrick Wendell] Pulling in changes from #856
106ee31 [Patrick Wendell] Small link fix
f7e79bc [Patrick Wendell] Re-organizing config options.
54b184d [Patrick Wendell] Adding standalone configs to the standalone page
592e94a [Patrick Wendell] Stash
29b5446 [Patrick Wendell] Better discussion of spark-submit in configuration docs
2d719ef [Patrick Wendell] Small fix
4af9e07 [Patrick Wendell] Adding SPARK_LOCAL_DIRS docs
204b248 [Patrick Wendell] Small fixes
`ApproxCountDistinctMergeFunction` should return `Int` value because the `dataType` of `ApproxCountDistinct` is `IntegerType`.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#893 from ueshin/issues/SPARK-1938 and squashes the following commits:
3970e88 [Takuya UESHIN] Remove a superfluous line.
5ad7ec1 [Takuya UESHIN] Make dataType for each of CountDistinct, ApproxCountDistinctMerge and ApproxCountDistinct LongType.
cbe7c71 [Takuya UESHIN] Revert a change.
fc3ac0f [Takuya UESHIN] Fix evaluated value type of ApproxCountDistinctMergeFunction to Int.
Allow underscore in column name of a struct field https://issues.apache.org/jira/browse/SPARK-1922 .
Author: LY Lai <ly.lai@vpon.com>
Closes#873 from lyuanlai/master and squashes the following commits:
2253263 [LY Lai] Allow underscore in struct field column name
Average values are difference between the calculation is done partially or not partially.
Because `AverageFunction` (in not-partially calculation) counts even if the evaluated value is null.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#862 from ueshin/issues/SPARK-1915 and squashes the following commits:
b1ff3c0 [Takuya UESHIN] Modify AverageFunction not to count if the evaluated value is null.
Nullability of `Max`/`Min`/`First` should be `true` because they return `null` if there are no rows.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#881 from ueshin/issues/SPARK-1926 and squashes the following commits:
322610f [Takuya UESHIN] Fix nullability of Min/Max/First.
bugfix worker DriverStateChanged state should match DriverState.FAILED
Author: lianhuiwang <lianhuiwang09@gmail.com>
Closes#864 from lianhuiwang/master and squashes the following commits:
480ce94 [lianhuiwang] address aarondav comments
f2b5970 [lianhuiwang] bugfix worker DriverStateChanged state should match DriverState.FAILED
`var cachedPeers: Seq[BlockManagerId] = null` is used in `def replicate(blockId: BlockId, data: ByteBuffer, level: StorageLevel)` without proper protection.
There are two place will call `replicate(blockId, bytesAfterPut, level)`
* 17f3075bc4/core/src/main/scala/org/apache/spark/storage/BlockManager.scala (L644) runs in `connectionManager.futureExecContext`
* 17f3075bc4/core/src/main/scala/org/apache/spark/storage/BlockManager.scala (L752) `doPut` runs in `connectionManager.handleMessageExecutor`. `org.apache.spark.storage.BlockManagerWorker` calls `blockManager.putBytes` in `connectionManager.handleMessageExecutor`.
As they run in different `Executor`s, this is a race condition which may cause the memory pointed by `cachedPeers` is not correct even if `cachedPeers != null`.
The race condition of `onReceiveCallback` is that it's set in `BlockManagerWorker` but read in a different thread in `ConnectionManager.handleMessageExecutor`.
Author: zsxwing <zsxwing@gmail.com>
Closes#887 from zsxwing/SPARK-1932 and squashes the following commits:
524f69c [zsxwing] SPARK-1932: Fix race conditions in onReceiveCallback and cachedPeers
https://issues.apache.org/jira/browse/SPARK-1933
Author: Reynold Xin <rxin@apache.org>
Closes#888 from rxin/addfile and squashes the following commits:
8c402a3 [Reynold Xin] Updated comment.
ff6c162 [Reynold Xin] SPARK-1933: Throw a more meaningful exception when a directory is passed to addJar/addFile.
Author: Reynold Xin <rxin@apache.org>
Closes#875 from rxin/pep8-dev-scripts and squashes the following commits:
04b084f [Reynold Xin] Made dev Python scripts PEP8 compliant.
DAGScheduler does not handle local task OOM properly, and will wait for the job result forever.
Author: Zhen Peng <zhenpeng01@baidu.com>
Closes#883 from zhpengg/bugfix-dag-scheduler-oom and squashes the following commits:
76f7eda [Zhen Peng] remove redundant memory allocations
aa63161 [Zhen Peng] SPARK-1929 DAGScheduler suspended by local task OOM
905173df57 introduced a bug in partitionBy where, after repartitioning the edges, it reuses the VertexRDD without updating the routing tables to reflect the new edge layout. Subsequent accesses of the triplets contain nulls for many vertex properties.
This commit adds a test for this bug and fixes it by introducing `VertexRDD#withEdges` and calling it in `partitionBy`.
Author: Ankur Dave <ankurdave@gmail.com>
Closes#885 from ankurdave/SPARK-1931 and squashes the following commits:
3930cdd [Ankur Dave] Note how to set up VertexRDD for efficient joins
9bdbaa4 [Ankur Dave] [SPARK-1931] Reconstruct routing tables in Graph.partitionBy
JIRA: https://issues.apache.org/jira/browse/SPARK-1925
Author: zsxwing <zsxwing@gmail.com>
Closes#879 from zsxwing/SPARK-1925 and squashes the following commits:
5cf5a6d [zsxwing] SPARK-1925: Replace '&' with '&&'
Author: witgo <witgo@qq.com>
Closes#884 from witgo/scalastyle and squashes the following commits:
4b08ae4 [witgo] Fix scalastyle warnings in yarn alpha
`CountFunction` should count up only if the child's evaluated value is not null.
Because it traverses to evaluate all child expressions, even if the child is null, it counts up if one of the all children is not null.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#861 from ueshin/issues/SPARK-1914 and squashes the following commits:
3b37315 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-1914
2afa238 [Takuya UESHIN] Simplify CountFunction not to traverse to evaluate all child expressions.
Self explanatory.
Author: Patrick Wendell <pwendell@gmail.com>
Closes#878 from pwendell/java-constructor and squashes the following commits:
2cc1605 [Patrick Wendell] HOTFIX: Add no-arg SparkContext constructor in Java
```scala
rdd.aggregate(Sum('val))
```
is just shorthand for
```scala
rdd.groupBy()(Sum('val))
```
but seems be more natural than doing a groupBy with no grouping expressions when you really just want an aggregation over all rows.
Did not add a JavaSchemaRDD or Python API, as these seem to be lacking several other methods like groupBy() already -- leaving that cleanup for future patches.
Author: Aaron Davidson <aaron@databricks.com>
Closes#874 from aarondav/schemardd and squashes the following commits:
e9e68ee [Aaron Davidson] Add comment
db6afe2 [Aaron Davidson] Introduce SchemaRDD#aggregate() for simple aggregations
https://issues.apache.org/jira/browse/SPARK-1903
Author: Andrew Ash <andrew@andrewash.com>
Closes#856 from ash211/SPARK-1903 and squashes the following commits:
6e7782a [Andrew Ash] Add the technology used on each port
1d9b5d3 [Andrew Ash] Document port for history server
56193ee [Andrew Ash] spark.ui.port becomes worker.ui.port and master.ui.port
a774c07 [Andrew Ash] Wording in network section
90e8237 [Andrew Ash] Use real :toc instead of the hand-written one
edaa337 [Andrew Ash] Master -> Standalone Cluster Master
57e8869 [Andrew Ash] Port -> Default Port
3d4d289 [Andrew Ash] Title to title case
c7d42d9 [Andrew Ash] [WIP] SPARK-1903 Add initial port listing for documentation
a416ae9 [Andrew Ash] Word wrap to 100 lines
Author: Reynold Xin <rxin@apache.org>
Closes#871 from rxin/mllib-pep8 and squashes the following commits:
848416f [Reynold Xin] Fixed a typo in the previous cleanup (c -> sc).
a8db4cd [Reynold Xin] Fix PEP8 violations in Python mllib.
Mostly related to the following two rules in PEP8 and PEP257:
- Line length < 72 chars.
- First line should be a concise description of the function/class.
Author: Reynold Xin <rxin@apache.org>
Closes#869 from rxin/docstring-schemardd and squashes the following commits:
7cf0cbc [Reynold Xin] Updated sql.py for pep8 docstring.
0a4aef9 [Reynold Xin] Merge branch 'master' into docstring-schemardd
6678937 [Reynold Xin] Python docstring update for sql.py.
Author: Reynold Xin <rxin@apache.org>
Closes#870 from rxin/examples-python-pep8 and squashes the following commits:
2829e84 [Reynold Xin] Fix PEP8 violations in examples/src/main/python.
Minor cleanup following #841.
Author: Reynold Xin <rxin@apache.org>
Closes#868 from rxin/schema-count and squashes the following commits:
5442651 [Reynold Xin] SPARK-1822: Some minor cleanup work on SchemaRDD.count()
This sets the max line length to 100 as a PEP8 exception.
Author: Reynold Xin <rxin@apache.org>
Closes#872 from rxin/pep8 and squashes the following commits:
2f26029 [Reynold Xin] Added PEP8 style configuration file.
Author: Kan Zhang <kzhang@apache.org>
Closes#841 from kanzhang/SPARK-1822 and squashes the following commits:
2f8072a [Kan Zhang] [SPARK-1822] Minor style update
cf4baa4 [Kan Zhang] [SPARK-1822] Adding Scaladoc
e67c910 [Kan Zhang] [SPARK-1822] SchemaRDD.count() should use optimizer
Add an 'exec' at the end of the spark-submit script, to avoid keeping a
bash process hanging around while it runs. This makes ps look a little
bit nicer.
Author: Colin Patrick Mccabe <cmccabe@cloudera.com>
Closes#858 from cmccabe/SPARK-1907 and squashes the following commits:
7023b64 [Colin Patrick Mccabe] spark-submit: add exec at the end of the script
JIRA issue: [SPARK-1913](https://issues.apache.org/jira/browse/SPARK-1913)
When scanning Parquet tables, attributes referenced only in predicates that are pushed down are not passed to the `ParquetTableScan` operator and causes exception.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#863 from liancheng/spark-1913 and squashes the following commits:
f976b73 [Cheng Lian] Addessed the readability issue commented by @rxin
f5b257d [Cheng Lian] Added back comments deleted by mistake
ae60ab3 [Cheng Lian] [SPARK-1913] Attributes referenced only in predicates pushed down should remain in ParquetTableScan operator
Author: Zhen Peng <zhenpeng01@baidu.com>
Closes#827 from zhpengg/bugfix-executor-id-not-found and squashes the following commits:
cd8bb65 [Zhen Peng] bugfix: check executor id existence when executor exit
This commit requires the user to manually say "yes" when buiding Spark
without Java 6. The prompt can be bypassed with a flag (e.g. if the user
is scripting around make-distribution).
Author: Patrick Wendell <pwendell@gmail.com>
Closes#859 from pwendell/java6 and squashes the following commits:
4921133 [Patrick Wendell] Adding Pyspark Notice
fee8c9e [Patrick Wendell] SPARK-1911: Emphasize that Spark jars should be built with Java 6.
If I run the following on a YARN cluster
```
bin/spark-submit sheep.py --master yarn-client
```
it fails because of a mismatch in paths: `spark-submit` thinks that `sheep.py` resides on HDFS, and balks when it can't find the file there. A natural workaround is to add the `file:` prefix to the file:
```
bin/spark-submit file:/path/to/sheep.py --master yarn-client
```
However, this also fails. This time it is because python does not understand URI schemes.
This PR fixes this by automatically resolving all paths passed as command line argument to `spark-submit` properly. This has the added benefit of keeping file and jar paths consistent across different cluster modes. For python, we strip the URI scheme before we actually try to run it.
Much of the code is originally written by @mengxr. Tested on YARN cluster. More tests pending.
Author: Andrew Or <andrewor14@gmail.com>
Closes#853 from andrewor14/submit-paths and squashes the following commits:
0bb097a [Andrew Or] Format path correctly before adding it to PYTHONPATH
323b45c [Andrew Or] Include --py-files on PYTHONPATH for pyspark shell
3c36587 [Andrew Or] Improve error messages (minor)
854aa6a [Andrew Or] Guard against NPE if user gives pathological paths
6638a6b [Andrew Or] Fix spark-shell jar paths after #849 went in
3bb0359 [Andrew Or] Update more comments (minor)
2a1f8a0 [Andrew Or] Update comments (minor)
6af2c77 [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-paths
a68c4d1 [Andrew Or] Handle Windows python file path correctly
427a250 [Andrew Or] Resolve paths properly for Windows
a591a4a [Andrew Or] Update tests for resolving URIs
6c8621c [Andrew Or] Move resolveURIs to Utils
db8255e [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-paths
f542dce [Andrew Or] Fix outdated tests
691c4ce [Andrew Or] Ignore special primary resource names
5342ac7 [Andrew Or] Add missing space in error message
02f77f3 [Andrew Or] Resolve command line arguments to spark-submit properly
the same reason as https://github.com/apache/spark/pull/588
Author: baishuo(白硕) <vc_java@hotmail.com>
Closes#815 from baishuo/master and squashes the following commits:
6876c1e [baishuo(白硕)] Update LBFGSSuite.scala
- Added script to automatically generate change list CHANGES.txt
- Added test for verifying linking against maven distributions of `spark-sql` and `spark-hive`
- Added SBT projects for testing functionality of `spark-sql` and `spark-hive`
- Fixed issues in existing tests that might have come up because of changes in Spark 1.0
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#844 from tdas/update-dev-scripts and squashes the following commits:
25090ba [Tathagata Das] Added missing license
e2e20b3 [Tathagata Das] Updated tests for auditing releases.
The hierarchy for configuring the Spark master in the shell is as follows:
```
MASTER > --master > spark.master (spark-defaults.conf)
```
This is inconsistent with the way we run normal applications, which is:
```
--master > spark.master (spark-defaults.conf) > MASTER
```
I was trying to run a shell locally on a standalone cluster launched through the ec2 scripts, which automatically set `MASTER` in spark-env.sh. It was surprising to me that `--master` didn't take effect, considering that this is the way we tell users to set their masters [here](http://people.apache.org/~pwendell/spark-1.0.0-rc7-docs/scala-programming-guide.html#initializing-spark).
Author: Andrew Or <andrewor14@gmail.com>
Closes#846 from andrewor14/shell-master and squashes the following commits:
2cb81c9 [Andrew Or] Respect spark.master before MASTER in REPL
Spark shell currently overwrites `spark.jars` with `ADD_JARS`. In all modes except yarn-cluster, this means the `--jar` flag passed to `bin/spark-shell` is also discarded. However, in the [docs](http://people.apache.org/~pwendell/spark-1.0.0-rc7-docs/scala-programming-guide.html#initializing-spark), we explicitly tell the users to add the jars this way.
Author: Andrew Or <andrewor14@gmail.com>
Closes#849 from andrewor14/shell-jars and squashes the following commits:
928a7e6 [Andrew Or] ',' -> "," (minor)
afc357c [Andrew Or] Handle spark.jars == "" in SparkILoop, not SparkSubmit
c6da113 [Andrew Or] Do not set spark.jars to ""
d8549f7 [Andrew Or] Respect spark.jars and --jars in spark-shell
Due to perhaps zombie processes on Jenkins, it seems that at least 10
Spark ports are in use. It also doesn't matter that the port increases
when used, it could in fact go down -- the only part that matters is
that it selects a different port rather than failing to bind.
Changed test to match this.
Thanks to @andrewor14 for helping diagnose this.
Author: Aaron Davidson <aaron@databricks.com>
Closes#857 from aarondav/tiny and squashes the following commits:
c199ec8 [Aaron Davidson] Fix UISuite unit test that fails under Jenkins contention
Sent secondary jars to distributed cache of all containers and add the cached jars to classpath before executors start. Tested on a YARN cluster (CDH-5.0).
`spark-submit --jars` also works in standalone server and `yarn-client`. Thanks for @andrewor14 for testing!
I removed "Doesn't work for drivers in standalone mode with "cluster" deploy mode." from `spark-submit`'s help message, though we haven't tested mesos yet.
CC: @dbtsai @sryza
Author: Xiangrui Meng <meng@databricks.com>
Closes#848 from mengxr/yarn-classpath and squashes the following commits:
23e7df4 [Xiangrui Meng] rename spark.jar to __spark__.jar and app.jar to __app__.jar to avoid confliction apped $CWD/ and $CWD/* to the classpath remove unused methods
a40f6ed [Xiangrui Meng] standalone -> cluster
65e04ad [Xiangrui Meng] update spark-submit help message and add a comment for yarn-client
11e5354 [Xiangrui Meng] minor changes
3e7e1c4 [Xiangrui Meng] use sparkConf instead of hadoop conf
dc3c825 [Xiangrui Meng] add secondary jars to classpath in yarn
1. Add < code > to configuration options
2. List env variables in tabular format to be consistent with other pages.
3. Moved Viewing Spark Properties section up.
This is against branch-1.0, but should be cherry picked into master as well.
Author: Reynold Xin <rxin@apache.org>
Closes#851 from rxin/doc-config and squashes the following commits:
28ac0d3 [Reynold Xin] Add <code> to configuration options, and list env variables in a table.
(cherry picked from commit 75af8bd333)
Signed-off-by: Reynold Xin <rxin@apache.org>