Netty's DefaultFileRegion requires a FileDescriptor in its constructor, which means we need to have a opened file handle. In super large workloads, this could lead to too many open files due to the way these file descriptors are cleaned. This pull request creates a new LazyFileRegion that initializes the FileDescriptor when we are sending data for the first time.
Author: Reynold Xin <rxin@databricks.com>
Author: Reynold Xin <rxin@apache.org>
Closes#3172 from rxin/lazyFD and squashes the following commits:
0bdcdc6 [Reynold Xin] Added reference to Netty's DefaultFileRegion
d4564ae [Reynold Xin] Added SparkConf to the ctor argument of IndexShuffleBlockManager.
6ed369e [Reynold Xin] Code review feedback.
04cddc8 [Reynold Xin] [SPARK-4307] Initialize FileDescriptor lazily in FileRegion.
For me the core tests failed because there are two locale dependent parts in the code.
Look at the Jira ticket for details.
Why is it necessary to check the exception message in isBindCollision in
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L1686
?
Author: Niklas Wilcke <1wilcke@informatik.uni-hamburg.de>
Closes#3036 from numbnut/core-test-fix and squashes the following commits:
1fb0d04 [Niklas Wilcke] Fixing locale dependend code and tests
Make Tachyon related unit tests execute without deploying a Tachyon system locally.
Author: RongGu <gurongwalker@gmail.com>
Closes#3030 from RongGu/SPARK-2703 and squashes the following commits:
ad08827 [RongGu] Make Tachyon related unit tests execute without deploying a Tachyon system locally
Author: Sandy Ryza <sandy@cloudera.com>
This patch had conflicts when merged, resolved by
Committer: Kay Ousterhout <kayousterhout@gmail.com>
Closes#2968 from sryza/sandy-spark-3179 and squashes the following commits:
dce4784 [Sandy Ryza] More review feedback
8d350d1 [Sandy Ryza] Fix test against Hadoop 2.5+
e7c74d0 [Sandy Ryza] More review feedback
6cff9c4 [Sandy Ryza] Review feedback
fb2dde0 [Sandy Ryza] SPARK-3179
andrewor14 Another try at SPARK-1209, to address https://github.com/apache/spark/pull/2814#issuecomment-61197619
I successfully tested with `mvn -Dhadoop.version=1.0.4 -DskipTests clean package; mvn -Dhadoop.version=1.0.4 test` I assume that is what failed Jenkins last time. I also tried `-Dhadoop.version1.2.1` and `-Phadoop-2.4 -Pyarn -Phive` for more coverage.
So this is why the class was put in `org.apache.hadoop` to begin with, I assume. One option is to leave this as-is for now and move it only when Hadoop 1.0.x support goes away.
This is the other option, which adds a call to force the constructor to be public at run-time. It's probably less surprising than putting Spark code in `org.apache.hadoop`, but, does involve reflection. A `SecurityManager` might forbid this, but it would forbid a lot of stuff Spark does. This would also only affect Hadoop 1.0.x it seems.
Author: Sean Owen <sowen@cloudera.com>
Closes#3048 from srowen/SPARK-1209 and squashes the following commits:
0d48f4b [Sean Owen] For Hadoop 1.0.x, make certain constructors public, which were public in later versions
466e179 [Sean Owen] Disable MIMA warnings resulting from moving the class -- this was also part of the PairRDDFunctions type hierarchy though?
eb61820 [Sean Owen] Move SparkHadoopMapRedUtil / SparkHadoopMapReduceUtil from org.apache.hadoop to org.apache.spark
Use "k" in javadoc of top and takeOrdered to avoid confusion with type K in pair RDDs. I think this resolves the discussion in SPARK-1344.
Author: Sean Owen <sowen@cloudera.com>
Closes#3168 from srowen/SPARK-1344 and squashes the following commits:
6963fcc [Sean Owen] Use "k" in javadoc of top and takeOrdered to avoid confusion with type K in pair RDDs
This PR resorts to `SparkContext.version` rather than META-INF/MANIFEST.MF in the assembly jar to inspect Spark version. Currently, when built with Maven, the MANIFEST.MF file in the assembly jar is incorrectly replaced by Guava 15.0 MANIFEST.MF, probably because of the assembly/shading tricks.
Another related PR is #3103, which tries to fix the MANIFEST issue.
Author: Cheng Lian <lian@databricks.com>
Closes#3105 from liancheng/spark-4225 and squashes the following commits:
d9585e1 [Cheng Lian] Resorts to SparkContext.version to inspect Spark version
This PR elimiantes the network package's usage of the Java serializer and replaces it with Encodable, which is a lightweight binary protocol. Each message is preceded by a type id, which will allow us to change messages (by only adding new ones), or to change the format entirely by switching to a special id (such as -1).
This protocol has the advantage over Java that we can guarantee that messages will remain compatible across compiled versions and JVMs, though it does not provide a clean way to do schema migration. In the future, it may be good to use a more heavy-weight serialization format like protobuf, thrift, or avro, but these all add several dependencies which are unnecessary at the present time.
Additionally this unifies the RPC messages of NettyBlockTransferService and ExternalShuffleClient.
Author: Aaron Davidson <aaron@databricks.com>
Closes#3146 from aarondav/free and squashes the following commits:
ed1102a [Aaron Davidson] Remove some unused imports
b8e2a49 [Aaron Davidson] Add appId to test
538f2a3 [Aaron Davidson] [SPARK-4187] [Core] Switch to binary protocol for external shuffle service messages
This PR fixed `Utils.exceptionString` to output the full exception information. However, the stack trace may become very huge, so I also updated the Web UI to collapse the error information by default (display the first line and clicking `+detail` will display the full info).
Here are the screenshots:
Stages:
![stages](https://cloud.githubusercontent.com/assets/1000778/4882441/66d8cc68-6356-11e4-8346-6318677d9470.png)
Details for one stage:
![stage](https://cloud.githubusercontent.com/assets/1000778/4882513/1311043c-6357-11e4-8804-ca14240a9145.png)
The full information in the gray text field is:
```Java
org.apache.spark.shuffle.FetchFailedException: Connection reset by peer
at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129)
at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:160)
at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:159)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:189)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
at sun.nio.ch.IOUtil.read(IOUtil.java:166)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
... 1 more
```
/cc aarondav
Author: zsxwing <zsxwing@gmail.com>
Closes#3073 from zsxwing/SPARK-4204 and squashes the following commits:
176d1e3 [zsxwing] Add comments to explain the stack trace difference
ca509d3 [zsxwing] Add fullStackTrace to the constructor of ExceptionFailure
a07057b [zsxwing] Core style fix
dfb0032 [zsxwing] Backward compatibility for old history server
1e50f71 [zsxwing] Update as per review and increase the max height of the stack trace details
94f2566 [zsxwing] Change Utils.exceptionString to contain the inner exceptions and make the error information in Web UI more friendly
This relies on a hook from whoever is hosting the shuffle service to invoke removeApplication() when the application is completed. Once invoked, we will clean up all the executors' shuffle directories we know about.
Author: Aaron Davidson <aaron@databricks.com>
Closes#3126 from aarondav/cleanup and squashes the following commits:
33a64a9 [Aaron Davidson] Missing brace
e6e428f [Aaron Davidson] Address comments
16a0d27 [Aaron Davidson] Cleanup
e4df3e7 [Aaron Davidson] [SPARK-4236] Cleanup removed applications' files in shuffle service
This adds a RetryingBlockFetcher to the NettyBlockTransferService which is wrapped around our typical OneForOneBlockFetcher, adding retry logic in the event of an IOException.
This sort of retry allows us to avoid marking an entire executor as failed due to garbage collection or high network load.
TODO:
- [x] unit tests
- [x] put in ExternalShuffleClient too
Author: Aaron Davidson <aaron@databricks.com>
Closes#3101 from aarondav/retry and squashes the following commits:
72a2a32 [Aaron Davidson] Add that we should remove the condition around the retry thingy
c7fd107 [Aaron Davidson] Fix unit tests
e80e4c2 [Aaron Davidson] Address initial comments
6f594cd [Aaron Davidson] Fix unit test
05ff43c [Aaron Davidson] Add to external shuffle client and add unit test
66e5a24 [Aaron Davidson] [SPARK-4238] [Core] Perform network-level retry of shuffle file fetches
Author: Aaron Davidson <aaron@databricks.com>
Closes#3142 from aarondav/worker and squashes the following commits:
3780bd7 [Aaron Davidson] Address comments
2dcdfc1 [Aaron Davidson] Add private[worker]
47f49d3 [Aaron Davidson] NettyBlockTransferService shouldn't care about app ids (it's only b/t executors)
258417c [Aaron Davidson] [SPARK-4277] Support external shuffle service on executor
Author: Aaron Davidson <aaron@databricks.com>
Closes#3128 from aarondav/compiter and squashes the following commits:
698e4be [Aaron Davidson] [SPARK-4264] Completion iterator should only invoke callback once
add binaryFiles() and binaryRecords() in Python
```
binaryFiles(self, path, minPartitions=None):
:: Developer API ::
Read a directory of binary files from HDFS, a local file system
(available on all nodes), or any Hadoop-supported file system URI
as a byte array. Each file is read as a single record and returned
in a key-value pair, where the key is the path of each file, the
value is the content of each file.
Note: Small files are preferred, large file is also allowable, but
may cause bad performance.
binaryRecords(self, path, recordLength):
Load data from a flat binary file, assuming each record is a set of numbers
with the specified numerical format (see ByteBuffer), and the number of
bytes per record is constant.
:param path: Directory to the input data files
:param recordLength: The length at which to split the records
```
Author: Davies Liu <davies@databricks.com>
Closes#3078 from davies/binary and squashes the following commits:
cd0bdbd [Davies Liu] Merge branch 'master' of github.com:apache/spark into binary
3aa349b [Davies Liu] add experimental notes
24e84b6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into binary
5ceaa8a [Davies Liu] Merge branch 'master' of github.com:apache/spark into binary
1900085 [Davies Liu] bugfix
bb22442 [Davies Liu] add binaryFiles and binaryRecords in Python
This commit stripes table rows after hiding some rows, to
ensure that rows are correct striped to alternate white
and grey even when rows are hidden by default.
Author: Kay Ousterhout <kayousterhout@gmail.com>
Closes#3117 from kayousterhout/striping and squashes the following commits:
be6e10a [Kay Ousterhout] [SPARK-4255] Fix incorrect table striping
Mesos offers may not contain all resources, and Spark needs to check to
ensure they are present and sufficient. Spark may throw an erroneous
exception when resources aren't present.
Author: Brenden Matthews <brenden@diddyinc.com>
Closes#3024 from brndnmtthws/fix-mesos-resource-misuse and squashes the following commits:
e5f9580 [Brenden Matthews] [SPARK-4158] Fix for missing resources.
This creates a new module `network/yarn` that depends on `network/shuffle` recently created in #3001. This PR introduces a custom Yarn auxiliary service that runs the external shuffle service. As of the changes here this shuffle service is required for using dynamic allocation with Spark.
This is still WIP mainly because it doesn't handle security yet. I have tested this on a stable Yarn cluster.
Author: Andrew Or <andrew@databricks.com>
Closes#3082 from andrewor14/yarn-shuffle-service and squashes the following commits:
ef3ddae [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service
0ee67a2 [Andrew Or] Minor wording suggestions
1c66046 [Andrew Or] Remove unused provided dependencies
0eb6233 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service
6489db5 [Andrew Or] Try catch at the right places
7b71d8f [Andrew Or] Add detailed java docs + reword a few comments
d1124e4 [Andrew Or] Add security to shuffle service (INCOMPLETE)
5f8a96f [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service
9b6e058 [Andrew Or] Address various feedback
f48b20c [Andrew Or] Fix tests again
f39daa6 [Andrew Or] Do not make network-yarn an assembly module
761f58a [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service
15a5b37 [Andrew Or] Fix build for Hadoop 1.x
baff916 [Andrew Or] Fix tests
5bf9b7e [Andrew Or] Address a few minor comments
5b419b8 [Andrew Or] Add missing license header
804e7ff [Andrew Or] Include the Yarn shuffle service jar in the distribution
cd076a4 [Andrew Or] Require external shuffle service for dynamic allocation
ea764e0 [Andrew Or] Connect to Yarn shuffle service only if it's enabled
1bf5109 [Andrew Or] Use the shuffle service port specified through hadoop config
b4b1f0c [Andrew Or] 4 tabs -> 2 tabs
43dcb96 [Andrew Or] First cut integration of shuffle service with Yarn aux service
b54a0c4 [Andrew Or] Initial skeleton for Yarn shuffle service
replaces the existing read() call with readFully().
Author: industrial-sloth <industrial-sloth@users.noreply.github.com>
Closes#3093 from industrial-sloth/branch-1.2-fixedLenRecRdr and squashes the following commits:
a245c8a [industrial-sloth] use readFully in FixedLengthBinaryRecordReader
(cherry picked from commit 6844e7a821)
Signed-off-by: Matei Zaharia <matei@databricks.com>
This commit fixes the scheduler delay in the UI (which previously
included things that are not scheduler delay, like time to
deserialize the task and serialize the result), and also
adds information about time to deserialize tasks to the optional
additional metrics. Time to deserialize the task can be large relative
to task time for short jobs, and understanding when it is high can help
developers realize that they should try to reduce closure size (e.g, by including
less data in the task description).
cc shivaram etrain
Author: Kay Ousterhout <kayousterhout@gmail.com>
Closes#2832 from kayousterhout/SPARK-3983 and squashes the following commits:
0c1398e [Kay Ousterhout] Fixed ordering
531575d [Kay Ousterhout] Removed executor launch time
1f13afe [Kay Ousterhout] Minor spacing fixes
335be4b [Kay Ousterhout] Made metrics hideable
5bc3cba [Kay Ousterhout] [SPARK-3984] [SPARK-3983] Improve UI task metrics.
Does three things: (1) Adds SASL to ExternalShuffleClient, (2) puts SecurityManager in BlockManager's constructor, and (3) adds unit test.
Author: Aaron Davidson <aaron@databricks.com>
Closes#3108 from aarondav/sasl-client and squashes the following commits:
48b622d [Aaron Davidson] Screw it, let's just get LimitedInputStream
3543b70 [Aaron Davidson] Back out of pom change due to unknown test issue?
b58518a [Aaron Davidson] ByteStreams.limit() not available :(
cbe451a [Aaron Davidson] Address comments
2bf2908 [Aaron Davidson] [SPARK-4242] [Core] Add SASL to external shuffle service
This PR simplify serializer, always use batched serializer (AutoBatchedSerializer as default), even batch size is 1.
Author: Davies Liu <davies@databricks.com>
This patch had conflicts when merged, resolved by
Committer: Josh Rosen <joshrosen@databricks.com>
Closes#2920 from davies/fix_autobatch and squashes the following commits:
e544ef9 [Davies Liu] revert unrelated change
6880b14 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
1d557fc [Davies Liu] fix tests
8180907 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
76abdce [Davies Liu] clean up
53fa60b [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
d7ac751 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
2cc2497 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
b4292ce [Davies Liu] fix bug in master
d79744c [Davies Liu] recover hive tests
be37ece [Davies Liu] refactor
eb3938d [Davies Liu] refactor serializer in scala
8d77ef2 [Davies Liu] simplify serializer, use AutoBatchedSerializer by default.
Author: zsxwing <zsxwing@gmail.com>
Closes#3085 from zsxwing/SPARK-4166-back-comp and squashes the following commits:
89329f4 [zsxwing] Add a backward compatibility test for ExecutorLostFailure
/cc aarondav
Author: zsxwing <zsxwing@gmail.com>
Closes#3086 from zsxwing/SPARK-4163-back-comp and squashes the following commits:
21cb2a8 [zsxwing] Add a backward compatibility test for FetchFailed
This patch allows executor thread dumps to be collected on-demand and viewed in the Spark web UI.
The thread dumps are collected using Thread.getAllStackTraces(). To allow remote thread dumps to be triggered from the web UI, I added a new `ExecutorActor` that runs inside of the Executor actor system and responds to RPCs from the driver. The driver's mechanism for obtaining a reference to this actor is a little bit hacky: it uses the block manager master actor to determine the host/port of the executor actor systems in order to construct ActorRefs to ExecutorActor. Unfortunately, I couldn't find a much cleaner way to do this without a big refactoring of the executor -> driver communication.
Screenshots:
![image](https://cloud.githubusercontent.com/assets/50748/4781793/7e7a0776-5cbf-11e4-874d-a91cd04620bd.png)
![image](https://cloud.githubusercontent.com/assets/50748/4781794/8bce76aa-5cbf-11e4-8d13-8477748c9f7e.png)
![image](https://cloud.githubusercontent.com/assets/50748/4781797/bd11a8b8-5cbf-11e4-9ad7-a7459467ec8e.png)
Author: Josh Rosen <joshrosen@databricks.com>
Closes#2944 from JoshRosen/jstack-in-web-ui and squashes the following commits:
3c21a5d [Josh Rosen] Address review comments:
880f7f7 [Josh Rosen] Merge remote-tracking branch 'origin/master' into jstack-in-web-ui
f719266 [Josh Rosen] Merge remote-tracking branch 'origin/master' into jstack-in-web-ui
19707b0 [Josh Rosen] Add one comment.
127a130 [Josh Rosen] Update to use SparkContext.DRIVER_IDENTIFIER
b8e69aa [Josh Rosen] Merge remote-tracking branch 'origin/master' into jstack-in-web-ui
3dfc2d4 [Josh Rosen] Add missing file.
bc1e675 [Josh Rosen] Undo some leftover changes from the earlier approach.
f4ac1c1 [Josh Rosen] Switch to on-demand collection of thread dumps
dfec08b [Josh Rosen] Add option to disable thread dumps in UI.
4c87d7f [Josh Rosen] Use separate RPC for sending thread dumps.
2b8bdf3 [Josh Rosen] Enable thread dumps from the driver when running in non-local mode.
cc3e6b3 [Josh Rosen] Fix test code in DAGSchedulerSuite.
87b8b65 [Josh Rosen] Add new listener event for thread dumps.
8c10216 [Josh Rosen] Add missing file.
0f198ac [Josh Rosen] [SPARK-611] Display executor thread dumps in web UI
The number of completed stages and failed stages showed on webUI will always be less than 1000. This is really misleading when there are already thousands of stages completed or failed. The number should be correct even when only partial stages listed on the webUI (stage info will be removed if the number is too large).
Author: Zhang, Liye <liye.zhang@intel.com>
Closes#3035 from liyezhang556520/webStageNum and squashes the following commits:
d9e29fb [Zhang, Liye] add detailed comments for variables
4ea8fd1 [Zhang, Liye] change variable name accroding to comments
f4c404d [Zhang, Liye] [SPARK-4168][WebUI] web statges number should show correctly when stages are more than 1000
A leak of event loops may be causing test failures.
Author: Aaron Davidson <aaron@databricks.com>
Closes#3053 from aarondav/leak and squashes the following commits:
e676d18 [Aaron Davidson] Typo!
8f96475 [Aaron Davidson] Keep original ssc semantics
7e49f10 [Aaron Davidson] A leak of event loops may be causing test failures.
This PR overrides the `GetInfo` Hive Thrift API to provide correct version information. Another property `spark.sql.hive.version` is added to reveal the underlying Hive version. These are generally useful for Spark SQL ODBC driver providers. The Spark version information is extracted from the jar manifest. Also took the chance to remove the `SET -v` hack, which was a workaround for Simba ODBC driver connectivity.
TODO
- [x] Find a general way to figure out Hive (or even any dependency) version.
This [blog post](http://blog.soebes.de/blog/2014/01/02/version-information-into-your-appas-with-maven/) suggests several methods to inspect application version. In the case of Spark, this can be tricky because the chosen method:
1. must applies to both Maven build and SBT build
For Maven builds, we can retrieve the version information from the META-INF/maven directory within the assembly jar. But this doesn't work for SBT builds.
2. must not rely on the original jars of dependencies to extract specific dependency version, because Spark uses assembly jar.
This implies we can't read Hive version from Hive jar files since standard Spark distribution doesn't include them.
3. should play well with `SPARK_PREPEND_CLASSES` to ease local testing during development.
`SPARK_PREPEND_CLASSES` prevents classes to be loaded from the assembly jar, thus we can't locate the jar file and read its manifest.
Given these, maybe the only reliable method is to generate a source file containing version information at build time. pwendell Do you have any suggestions from the perspective of the build process?
**Update** Hive version is now retrieved from the newly introduced `HiveShim` object.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Author: Cheng Lian <lian@databricks.com>
Closes#2843 from liancheng/get-info and squashes the following commits:
a873d0f [Cheng Lian] Updates test case
53f43cd [Cheng Lian] Retrieves underlying Hive verson via HiveShim
1d282b8 [Cheng Lian] Removes the Simba ODBC "SET -v" hack
f857fce [Cheng Lian] Overrides Hive GetInfo Thrift API and adds Hive version property
Now when ExecutorLostFailure happens, it only displays `ExecutorLostFailure (executor lost)`. Adding the executor id will help locate the faulted executor.
Author: zsxwing <zsxwing@gmail.com>
Closes#3033 from zsxwing/SPARK-4166 and squashes the following commits:
ff4664c [zsxwing] Backward-compatible support
c5c4cf2 [zsxwing] Display the executor ID in the Web UI when ExecutorLostFailure happens
Right now, operations like collect() and take() can crash the driver with an OOM if they bring back too many data.
This PR will introduce spark.driver.maxResultSize, after setting it, the driver will abort a job if its result is bigger than it.
By default, it's 1g (for backward compatibility for most the cases).
In local mode, the driver and executor share the same JVM, the default setting can not protect JVM from OOM.
cc mateiz
Author: Davies Liu <davies@databricks.com>
Closes#3003 from davies/collect and squashes the following commits:
248ed5e [Davies Liu] fix compile
272522e [Davies Liu] address comments
2c35773 [Davies Liu] add sizes in message of abort()
5d62303 [Davies Liu] address comments
bc3c077 [Davies Liu] Merge branch 'master' of github.com:apache/spark into collect
11f97c5 [Davies Liu] address comments
47b144f [Davies Liu] check the size of result before send and fetch
3d81af2 [Davies Liu] address comments
ca8267d [Davies Liu] limit the size of data by collect
This patch introduces the tooling necessary to construct an external shuffle service which is independent of Spark executors, and then use this service inside Spark. An example (just for the sake of this PR) of the service creation can be found in Worker, and the service itself is used by plugging in the StandaloneShuffleClient as Spark's ShuffleClient (setup in BlockManager).
This PR continues the work from #2753, which extracted out the transport layer of Spark's block transfer into an independent package within Spark. A new package was created which contains the Spark business logic necessary to retrieve the actual shuffle data, which is completely independent of the transport layer introduced in the previous patch. Similar to the transport layer, this package must not depend on Spark as we anticipate plugging this service as a lightweight process within, say, the YARN NodeManager, and do not wish to include Spark's dependencies (including Scala itself).
There are several outstanding tasks which must be complete before this PR can be merged:
- [x] Complete unit testing of network/shuffle package.
- [x] Performance and correctness testing on a real cluster.
- [x] Remove example service instantiation from Worker.scala.
There are even more shortcomings of this PR which should be addressed in followup patches:
- Don't use Java serializer for RPC layer! It is not cross-version compatible.
- Handle shuffle file cleanup for dead executors once the application terminates or the ContextCleaner triggers.
- Documentation of the feature in the Spark docs.
- Improve behavior if the shuffle service itself goes down (right now we don't blacklist it, and new executors cannot spawn on that machine).
- SSL and SASL integration
- Nice to have: Handle shuffle file consolidation (this would requires changes to Spark's implementation).
Author: Aaron Davidson <aaron@databricks.com>
Closes#3001 from aarondav/shuffle-service and squashes the following commits:
4d1f8c1 [Aaron Davidson] Remove changes to Worker
705748f [Aaron Davidson] Rename Standalone* to External*
fd3928b [Aaron Davidson] Do not unregister executor outputs unduly
9883918 [Aaron Davidson] Make suggested build changes
3d62679 [Aaron Davidson] Add Spark integration test
7fe51d5 [Aaron Davidson] Fix SBT integration
56caa50 [Aaron Davidson] Address comments
c8d1ac3 [Aaron Davidson] Add unit tests
2f70c0c [Aaron Davidson] Fix unit tests
5483e96 [Aaron Davidson] Fix unit tests
46a70bf [Aaron Davidson] Whoops, bracket
5ea4df6 [Aaron Davidson] [SPARK-3796] Create external service which can serve shuffle files
Note that we're turning this on for at least the first part of the QA period as a trial. We want to enable this (and deprecate the NioBlockTransferService) as soon as possible in the hopes that NettyBlockTransferService will be more stable and easier to maintain. We will turn it off if we run into major issues.
Author: Aaron Davidson <aaron@databricks.com>
Closes#3049 from aarondav/enable-netty and squashes the following commits:
bb981cc [Aaron Davidson] [SPARK-4183] Enable NettyBlockTransferService by default
The additions add the abstract BinaryFileInputFormat and BinaryRecordReader classes for reading in data as a byte stream and converting it to another format using the ```def parseByteArray(inArray: Array[Byte]): T``` function.
As a trivial example ```ByteInputFormat``` and ```ByteRecordReader``` are included which just return the Array[Byte] from a given file.
Finally a RDD for ```BinaryFileInputFormat``` (to allow for easier partitioning changes as was done for WholeFileInput) was added and the appropriate byteFiles to the ```SparkContext``` so the functions can be easily used by others.
A common use case might be to read in a folder
```
sc.byteFiles("s3://mydrive/tif/*.tif").map(rawData => ReadTiffFromByteArray(rawData))
```
Author: Kevin Mader <kevinmader@gmail.com>
Author: Kevin Mader <kmader@users.noreply.github.com>
Closes#1658 from kmader/master and squashes the following commits:
3c49a30 [Kevin Mader] fixing wholetextfileinput to it has the same setMinPartitions function as in BinaryData files
359a096 [Kevin Mader] making the final corrections suggested by @mateiz and renaming a few functions to make their usage clearer
6379be4 [Kevin Mader] reorganizing code
7b9d181 [Kevin Mader] removing developer API, cleaning up imports
8ac288b [Kevin Mader] fixed a single slightly over 100 character line
92bda0d [Kevin Mader] added new tests, renamed files, fixed several of the javaapi functions, formatted code more nicely
a32fef7 [Kevin Mader] removed unneeded classes added DeveloperApi note to portabledatastreams since the implementation might change
49174d9 [Kevin Mader] removed unneeded classes added DeveloperApi note to portabledatastreams since the implementation might change
c27a8f1 [Kevin Mader] jenkins crashed before running anything last time, so making minor change
b348ce1 [Kevin Mader] fixed order in check (prefix only appears on jenkins not when I run unit tests locally)
0588737 [Kevin Mader] filename check in "binary file input as byte array" test now ignores prefixes and suffixes which might get added by Hadoop
4163e38 [Kevin Mader] fixing line length and output from FSDataInputStream to DataInputStream to minimize sensitivity to Hadoop API changes
19812a8 [Kevin Mader] Fixed the serialization issue with PortableDataStream since neither CombineFileSplit nor TaskAttemptContext implement the Serializable interface, by using ByteArrays for storing both and then recreating the objects from these bytearrays as needed.
238c83c [Kevin Mader] fixed several scala-style issues, changed structure of binaryFiles, removed excessive classes added new tests. The caching tests still have a serialization issue, but that should be easily fixed as well.
932a206 [Kevin Mader] Update RawFileInput.scala
a01c9cf [Kevin Mader] Update RawFileInput.scala
441f79a [Kevin Mader] fixed a few small comments and dependency
12e7be1 [Kevin Mader] removing imglib from maven (definitely not ready yet)
5deb79e [Kevin Mader] added new portabledatastream to code so that it can be serialized correctly
f032bc0 [Kevin Mader] fixed bug in path name, renamed tests
bc5c0b9 [Kevin Mader] made minor stylistic adjustments from mateiz
df8e528 [Kevin Mader] fixed line lengths and changed java test
9a313d5 [Kevin Mader] making classes that needn't be public private, adding automatic file closure, adding new tests
edf5829 [Kevin Mader] fixing line lengths, adding new lines
f4841dc [Kevin Mader] un-optimizing imports, silly intellij
eacfaa6 [Kevin Mader] Added FixedLengthBinaryInputFormat and RecordReader from freeman-lab and added them to both the JavaSparkContext and the SparkContext as fixedLengthBinaryFile
1622935 [Kevin Mader] changing the line lengths to make jenkins happy
1cfa38a [Kevin Mader] added apache headers, added datainputstream directly as an output option for more complicated readers (HDF5 perhaps), and renamed several of the functions and files to be more consistent. Also added parallel functions to the java api
84035f1 [Kevin Mader] adding binary and byte file support spark
81c5f12 [Kevin Mader] Merge pull request #1 from apache/master
Author: Sandy Ryza <sandy@cloudera.com>
Closes#3043 from sryza/sandy-spark-4175 and squashes the following commits:
e327340 [Sandy Ryza] SPARK-4175. Exception on stage page
WebUI
Author: Mark Mims <mark.mims@canonical.com>
This patch had conflicts when merged, resolved by
Committer: Josh Rosen <joshrosen@databricks.com>
Closes#3031 from mmm/remove-accumulators-col and squashes the following commits:
6141cb3 [Mark Mims] reformat to satisfy scalastyle linelength. build failed from jenkins https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22604/
390893b [Mark Mims] cleanup
c28c449 [Mark Mims] looking much better now... minimal explicit formatting. Now, see if any sort keys make sense
fb72156 [Mark Mims] mimic hasInput. The basics work here, but wanna clean this up with maybeAccumulators for column content
This commit adds a set of checkboxes to the stage detail
page that the user can use to show additional task metrics,
including the GC time, result serialization time, result fetch
time, and scheduler delay. All of these metrics are now
hidden by default. This allows advanced users to look at more
detailed metrics, without distracting the average user.
This change also cleans up the stage detail page so that metrics
are shown in the same order in the summary table as in the task table,
and updates the metrics in both tables such that they contain the same
set of metrics.
The ability to remember a user's preferences for which metrics
should be shown has been filed as SPARK-4024.
Here's what the stage detail page looks like by default:
![image](https://cloud.githubusercontent.com/assets/1108612/4744322/3ebe319e-5a2f-11e4-891f-c792be79caa2.png)
and once a user clicks "Show additional metrics" (note that all the metrics get checked by default):
![image](https://cloud.githubusercontent.com/assets/1108612/4744332/51e5abda-5a2f-11e4-8994-d0d3705ee05d.png)
cc shivaram andrewor14
Author: Kay Ousterhout <kayousterhout@gmail.com>
Closes#2867 from kayousterhout/SPARK-checkboxes and squashes the following commits:
6015913 [Kay Ousterhout] Added comment
08dee73 [Kay Ousterhout] Josh's usability comments
0940d61 [Kay Ousterhout] Style updates based on Andrew's review
ef05ccd [Kay Ousterhout] Added tooltips
d7cfaaf [Kay Ousterhout] Made list of add'l metrics collapsible.
70c1fb5 [Kay Ousterhout] [SPARK-4016] Allow user to show/hide UI metrics.
More efficient sampling, based on Gap Sampling optimization:
http://erikerlandson.github.io/blog/2014/09/11/faster-random-samples-with-gap-sampling/
Author: Erik Erlandson <eerlands@redhat.com>
Closes#2455 from erikerlandson/spark-3250-pr and squashes the following commits:
72496bc [Erik Erlandson] [SPARK-3250] Implement Gap Sampling optimization for random sampling
This is currently breaking the package build for some people (including me).
This patch does some general clean-up which also fixes the current issue.
- Uses consistent artifact naming
- Adds sbt support for this module
- Changes tests to use scalatest (fixes the original issue[1])
One thing to note, it turns out that scalatest when invoked in the
Maven build doesn't succesfully detect JUnit Java tests. This is
a long standing issue, I noticed it applies to all of our current
test suites as well. I've created SPARK-4159 to fix this.
[1] The original issue is that we need to allocate extra memory
for the tests, happens by default in our scalatest configuration.
Author: Patrick Wendell <pwendell@gmail.com>
Closes#3025 from pwendell/hotfix and squashes the following commits:
faa9053 [Patrick Wendell] HOTFIX: Clean up build in network module.
(This is just a look at what completely moving the classes would look like. I know Patrick flagged that as maybe not OK, although, it's private?)
Author: Sean Owen <sowen@cloudera.com>
Closes#2814 from srowen/SPARK-1209 and squashes the following commits:
ead1115 [Sean Owen] Disable MIMA warnings resulting from moving the class -- this was also part of the PairRDDFunctions type hierarchy though?
2d42c1d [Sean Owen] Move SparkHadoopMapRedUtil / SparkHadoopMapReduceUtil from org.apache.hadoop to org.apache.spark
This also includes minor re-organization of the code. Tested locally in both client and deploy modes.
Author: Andrew Or <andrew@databricks.com>
Author: Andrew Or <andrewor14@gmail.com>
Closes#2697 from andrewor14/memory-cluster-mode and squashes the following commits:
01d78bc [Andrew Or] Merge branch 'master' of github.com:apache/spark into memory-cluster-mode
ccd468b [Andrew Or] Add some comments per Patrick
c956577 [Andrew Or] Tweak wording
2b4afa0 [Andrew Or] Unused import
47a5a88 [Andrew Or] Correct Spark properties precedence order
bf64717 [Andrew Or] Merge branch 'master' of github.com:apache/spark into memory-cluster-mode
dd452d0 [Andrew Or] Respect spark.*.memory in cluster mode
Sort "Started", "Completed", "Duration" and "Last Updated" by time.
Author: zsxwing <zsxwing@gmail.com>
Closes#3014 from zsxwing/SPARK-4153 and squashes the following commits:
ec8b9ad [zsxwing] Sort "Started", "Completed", "Duration" and "Last Updated" by time
I had planned to fix this when I merged it but I forgot to. witgo
Author: Andrew Or <andrew@databricks.com>
Closes#3018 from andrewor14/command-utils-style and squashes the following commits:
c2959fb [Andrew Or] Style hot fix
We use "\<driver\>" everywhere. Let's not do that.
Author: Andrew Or <andrew@databricks.com>
Closes#3020 from andrewor14/consolidate-driver and squashes the following commits:
c1c2204 [Andrew Or] Just use "<driver>" for local executor ID
3d751e9 [Andrew Or] Consolidate usages of <driver>
Author: Andrew Or <andrewor14@gmail.com>
Author: Andrew Or <andrew@databricks.com>
Closes#3021 from andrewor14/typos and squashes the following commits:
daaf417 [Andrew Or] Merge branch 'master' of github.com:apache/spark into typos
4838ae4 [Andrew Or] Merge branch 'master' of github.com:apache/spark into typos
026d426 [Andrew Or] Merge branch 'master' of github.com:andrewor14/spark into typos
a81ae8f [Andrew Or] Some typos