Commit graph

1138 commits

Author SHA1 Message Date
Reynold Xin ef29a9a9aa [SPARK-4307] Initialize FileDescriptor lazily in FileRegion.
Netty's DefaultFileRegion requires a FileDescriptor in its constructor, which means we need to have a opened file handle. In super large workloads, this could lead to too many open files due to the way these file descriptors are cleaned. This pull request creates a new LazyFileRegion that initializes the FileDescriptor when we are sending data for the first time.

Author: Reynold Xin <rxin@databricks.com>
Author: Reynold Xin <rxin@apache.org>

Closes #3172 from rxin/lazyFD and squashes the following commits:

0bdcdc6 [Reynold Xin] Added reference to Netty's DefaultFileRegion
d4564ae [Reynold Xin] Added SparkConf to the ctor argument of IndexShuffleBlockManager.
6ed369e [Reynold Xin] Code review feedback.
04cddc8 [Reynold Xin] [SPARK-4307] Initialize FileDescriptor lazily in FileRegion.
2014-11-11 00:25:31 -08:00
Patrick Wendell 6e7a309b81 Revert "[SPARK-2703][Core]Make Tachyon related unit tests execute without deploying a Tachyon system locally."
This reverts commit bd86cb1738.
2014-11-10 14:56:06 -08:00
Niklas Wilcke ed8bf1eac5 [SPARK-4169] [Core] Accommodate non-English Locales in unit tests
For me the core tests failed because there are two locale dependent parts in the code.
Look at the Jira ticket for details.

Why is it necessary to check the exception message in isBindCollision in
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L1686
?

Author: Niklas Wilcke <1wilcke@informatik.uni-hamburg.de>

Closes #3036 from numbnut/core-test-fix and squashes the following commits:

1fb0d04 [Niklas Wilcke] Fixing locale dependend code and tests
2014-11-10 11:37:38 -08:00
RongGu bd86cb1738 [SPARK-2703][Core]Make Tachyon related unit tests execute without deploying a Tachyon system locally.
Make Tachyon related unit tests execute without deploying a Tachyon system locally.

Author: RongGu <gurongwalker@gmail.com>

Closes #3030 from RongGu/SPARK-2703 and squashes the following commits:

ad08827 [RongGu] Make Tachyon related unit tests execute without deploying a Tachyon system locally
2014-11-09 23:48:15 -08:00
Sandy Ryza 3c2cff4b94 SPARK-3179. Add task OutputMetrics.
Author: Sandy Ryza <sandy@cloudera.com>

This patch had conflicts when merged, resolved by
Committer: Kay Ousterhout <kayousterhout@gmail.com>

Closes #2968 from sryza/sandy-spark-3179 and squashes the following commits:

dce4784 [Sandy Ryza] More review feedback
8d350d1 [Sandy Ryza] Fix test against Hadoop 2.5+
e7c74d0 [Sandy Ryza] More review feedback
6cff9c4 [Sandy Ryza] Review feedback
fb2dde0 [Sandy Ryza] SPARK-3179
2014-11-09 22:29:03 -08:00
Aaron Davidson d4fa04e50d [SPARK-4187] [Core] Switch to binary protocol for external shuffle service messages
This PR elimiantes the network package's usage of the Java serializer and replaces it with Encodable, which is a lightweight binary protocol. Each message is preceded by a type id, which will allow us to change messages (by only adding new ones), or to change the format entirely by switching to a special id (such as -1).

This protocol has the advantage over Java that we can guarantee that messages will remain compatible across compiled versions and JVMs, though it does not provide a clean way to do schema migration. In the future, it may be good to use a more heavy-weight serialization format like protobuf, thrift, or avro, but these all add several dependencies which are unnecessary at the present time.

Additionally this unifies the RPC messages of NettyBlockTransferService and ExternalShuffleClient.

Author: Aaron Davidson <aaron@databricks.com>

Closes #3146 from aarondav/free and squashes the following commits:

ed1102a [Aaron Davidson] Remove some unused imports
b8e2a49 [Aaron Davidson] Add appId to test
538f2a3 [Aaron Davidson] [SPARK-4187] [Core] Switch to binary protocol for external shuffle service messages
2014-11-07 09:42:21 -08:00
zsxwing 3abdb1b24a [SPARK-4204][Core][WebUI] Change Utils.exceptionString to contain the inner exceptions and make the error information in Web UI more friendly
This PR fixed `Utils.exceptionString` to output the full exception information. However, the stack trace may become very huge, so I also updated the Web UI to collapse the error information by default (display the first line and clicking `+detail` will display the full info).

Here are the screenshots:

Stages:
![stages](https://cloud.githubusercontent.com/assets/1000778/4882441/66d8cc68-6356-11e4-8346-6318677d9470.png)

Details for one stage:
![stage](https://cloud.githubusercontent.com/assets/1000778/4882513/1311043c-6357-11e4-8804-ca14240a9145.png)

The full information in the gray text field is:
```Java
org.apache.spark.shuffle.FetchFailedException: Connection reset by peer
	at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
	at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
	at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
	at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
	at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129)
	at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:160)
	at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
	at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
	at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
	at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:159)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
	at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
	at org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
	at org.apache.spark.scheduler.Task.run(Task.scala:56)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:189)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcher.read0(Native Method)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
	at sun.nio.ch.IOUtil.read(IOUtil.java:166)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
	at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
	at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
	at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
	... 1 more
```

/cc aarondav

Author: zsxwing <zsxwing@gmail.com>

Closes #3073 from zsxwing/SPARK-4204 and squashes the following commits:

176d1e3 [zsxwing] Add comments to explain the stack trace difference
ca509d3 [zsxwing] Add fullStackTrace to the constructor of ExceptionFailure
a07057b [zsxwing] Core style fix
dfb0032 [zsxwing] Backward compatibility for old history server
1e50f71 [zsxwing] Update as per review and increase the max height of the stack trace details
94f2566 [zsxwing] Change Utils.exceptionString to contain the inner exceptions and make the error information in Web UI more friendly
2014-11-06 21:52:38 -08:00
Aaron Davidson 48a19a6dba [SPARK-4236] Cleanup removed applications' files in shuffle service
This relies on a hook from whoever is hosting the shuffle service to invoke removeApplication() when the application is completed. Once invoked, we will clean up all the executors' shuffle directories we know about.

Author: Aaron Davidson <aaron@databricks.com>

Closes #3126 from aarondav/cleanup and squashes the following commits:

33a64a9 [Aaron Davidson] Missing brace
e6e428f [Aaron Davidson] Address comments
16a0d27 [Aaron Davidson] Cleanup
e4df3e7 [Aaron Davidson] [SPARK-4236] Cleanup removed applications' files in shuffle service
2014-11-06 19:54:32 -08:00
Aaron Davidson 6e9ef10fd7 [SPARK-4277] Support external shuffle service on Standalone Worker
Author: Aaron Davidson <aaron@databricks.com>

Closes #3142 from aarondav/worker and squashes the following commits:

3780bd7 [Aaron Davidson] Address comments
2dcdfc1 [Aaron Davidson] Add private[worker]
47f49d3 [Aaron Davidson] NettyBlockTransferService shouldn't care about app ids (it's only b/t executors)
258417c [Aaron Davidson] [SPARK-4277] Support external shuffle service on executor
2014-11-06 17:20:46 -08:00
Aaron Davidson 23eaf0e12f [SPARK-4264] Completion iterator should only invoke callback once
Author: Aaron Davidson <aaron@databricks.com>

Closes #3128 from aarondav/compiter and squashes the following commits:

698e4be [Aaron Davidson] [SPARK-4264] Completion iterator should only invoke callback once
2014-11-06 10:45:46 -08:00
Aaron Davidson 4c42986cc0 [SPARK-4242] [Core] Add SASL to external shuffle service
Does three things: (1) Adds SASL to ExternalShuffleClient, (2) puts SecurityManager in BlockManager's constructor, and (3) adds unit test.

Author: Aaron Davidson <aaron@databricks.com>

Closes #3108 from aarondav/sasl-client and squashes the following commits:

48b622d [Aaron Davidson] Screw it, let's just get LimitedInputStream
3543b70 [Aaron Davidson] Back out of pom change due to unknown test issue?
b58518a [Aaron Davidson] ByteStreams.limit() not available :(
cbe451a [Aaron Davidson] Address comments
2bf2908 [Aaron Davidson] [SPARK-4242] [Core] Add SASL to external shuffle service
2014-11-05 14:38:43 -08:00
Aaron Davidson 5e73138a01 [SPARK-2938] Support SASL authentication in NettyBlockTransferService
Also lays the groundwork for supporting it inside the external shuffle service.

Author: Aaron Davidson <aaron@databricks.com>

Closes #3087 from aarondav/sasl and squashes the following commits:

3481718 [Aaron Davidson] Delete rogue println
44f8410 [Aaron Davidson] Delete documentation - muahaha!
eb9f065 [Aaron Davidson] Improve documentation and add end-to-end test at Spark-level
a6b95f1 [Aaron Davidson] Address comments
785bbde [Aaron Davidson] Cleanup
79973cb [Aaron Davidson] Remove unused file
151b3c5 [Aaron Davidson] Add docs, timeout config, better failure handling
f6177d7 [Aaron Davidson] Cleanup SASL state upon connection termination
7b42adb [Aaron Davidson] Add unit tests
8191bcb [Aaron Davidson] [SPARK-2938] Support SASL authentication in NettyBlockTransferService
2014-11-04 16:15:38 -08:00
zsxwing b671ce047d [SPARK-4166][Core] Add a backward compatibility test for ExecutorLostFailure
Author: zsxwing <zsxwing@gmail.com>

Closes #3085 from zsxwing/SPARK-4166-back-comp and squashes the following commits:

89329f4 [zsxwing] Add a backward compatibility test for ExecutorLostFailure
2014-11-03 22:47:45 -08:00
zsxwing 9bdc8412a0 [SPARK-4163][Core] Add a backward compatibility test for FetchFailed
/cc aarondav

Author: zsxwing <zsxwing@gmail.com>

Closes #3086 from zsxwing/SPARK-4163-back-comp and squashes the following commits:

21cb2a8 [zsxwing] Add a backward compatibility test for FetchFailed
2014-11-03 22:40:43 -08:00
Sandy Ryza 28128150e7 SPARK-4178. Hadoop input metrics ignore bytes read in RecordReader insta...
...ntiation

Author: Sandy Ryza <sandy@cloudera.com>

Closes #3045 from sryza/sandy-spark-4178 and squashes the following commits:

8d2e70e [Sandy Ryza] Kostas's review feedback
e5b27c0 [Sandy Ryza] SPARK-4178. Hadoop input metrics ignore bytes read in RecordReader instantiation
2014-11-03 15:19:01 -08:00
zsxwing 76386e1a23 [SPARK-4163][Core][WebUI] Send the fetch failure message back to Web UI
This is a PR to send the fetch failure message back to Web UI.
Before:
![f1](https://cloud.githubusercontent.com/assets/1000778/4856595/1f036c80-60be-11e4-956f-335147fbccb7.png)
![f2](https://cloud.githubusercontent.com/assets/1000778/4856596/1f11cbea-60be-11e4-8fe9-9f9b2b35c884.png)

After (Please ignore the meaning of exception, I threw it in the code directly because it's hard to simulate a fetch failure):
![e1](https://cloud.githubusercontent.com/assets/1000778/4856600/2657ea38-60be-11e4-9f2d-d56c5f900f10.png)
![e2](https://cloud.githubusercontent.com/assets/1000778/4856601/26595008-60be-11e4-912b-2744af786991.png)

Author: zsxwing <zsxwing@gmail.com>

Closes #3032 from zsxwing/SPARK-4163 and squashes the following commits:

f7e1faf [zsxwing] Discard changes for FetchFailedException and minor modification
4e946f7 [zsxwing] Add e as the cause of SparkException
316767d [zsxwing] Add private[storage] to FetchResult
d51b0b6 [zsxwing] Set e as the cause of FetchFailedException
b88c919 [zsxwing] Use 'private[storage]' for case classes instead of 'sealed'
62103fd [zsxwing] Update as per review
0c07d1f [zsxwing] Backward-compatible support
a3bca65 [zsxwing] Send the fetch failure message back to Web UI
2014-11-02 23:20:22 -08:00
Aaron Davidson 2ebd1df3f1 [SPARK-4183] Close transport-related resources between SparkContexts
A leak of event loops may be causing test failures.

Author: Aaron Davidson <aaron@databricks.com>

Closes #3053 from aarondav/leak and squashes the following commits:

e676d18 [Aaron Davidson] Typo!
8f96475 [Aaron Davidson] Keep original ssc semantics
7e49f10 [Aaron Davidson] A leak of event loops may be causing test failures.
2014-11-02 16:26:24 -08:00
zsxwing 4e6a7a0b3e [SPARK-4166][Core][WebUI] Display the executor ID in the Web UI when ExecutorLostFailure happens
Now when ExecutorLostFailure happens, it only displays `ExecutorLostFailure (executor lost)`. Adding the executor id will help locate the faulted executor.

Author: zsxwing <zsxwing@gmail.com>

Closes #3033 from zsxwing/SPARK-4166 and squashes the following commits:

ff4664c [zsxwing] Backward-compatible support
c5c4cf2 [zsxwing] Display the executor ID in the Web UI when ExecutorLostFailure happens
2014-11-02 10:44:52 -08:00
Davies Liu 6181577e99 [SPARK-3466] Limit size of results that a driver collects for each action
Right now, operations like collect() and take() can crash the driver with an OOM if they bring back too many data.

This PR will introduce spark.driver.maxResultSize, after setting it, the driver will abort a job if its result is bigger than it.

By default, it's 1g (for backward compatibility for most the cases).

In local mode, the driver and executor share the same JVM, the default setting can not protect JVM from OOM.

cc mateiz

Author: Davies Liu <davies@databricks.com>

Closes #3003 from davies/collect and squashes the following commits:

248ed5e [Davies Liu] fix compile
272522e [Davies Liu] address comments
2c35773 [Davies Liu] add sizes in message of abort()
5d62303 [Davies Liu] address comments
bc3c077 [Davies Liu] Merge branch 'master' of github.com:apache/spark into collect
11f97c5 [Davies Liu] address comments
47b144f [Davies Liu] check the size of result before send and fetch
3d81af2 [Davies Liu] address comments
ca8267d [Davies Liu] limit the size of data by collect
2014-11-02 00:03:51 -07:00
Aaron Davidson f55218aeb1 [SPARK-3796] Create external service which can serve shuffle files
This patch introduces the tooling necessary to construct an external shuffle service which is independent of Spark executors, and then use this service inside Spark. An example (just for the sake of this PR) of the service creation can be found in Worker, and the service itself is used by plugging in the StandaloneShuffleClient as Spark's ShuffleClient (setup in BlockManager).

This PR continues the work from #2753, which extracted out the transport layer of Spark's block transfer into an independent package within Spark. A new package was created which contains the Spark business logic necessary to retrieve the actual shuffle data, which is completely independent of the transport layer introduced in the previous patch. Similar to the transport layer, this package must not depend on Spark as we anticipate plugging this service as a lightweight process within, say, the YARN NodeManager, and do not wish to include Spark's dependencies (including Scala itself).

There are several outstanding tasks which must be complete before this PR can be merged:
- [x] Complete unit testing of network/shuffle package.
- [x] Performance and correctness testing on a real cluster.
- [x] Remove example service instantiation from Worker.scala.

There are even more shortcomings of this PR which should be addressed in followup patches:
- Don't use Java serializer for RPC layer! It is not cross-version compatible.
- Handle shuffle file cleanup for dead executors once the application terminates or the ContextCleaner triggers.
- Documentation of the feature in the Spark docs.
- Improve behavior if the shuffle service itself goes down (right now we don't blacklist it, and new executors cannot spawn on that machine).
- SSL and SASL integration
- Nice to have: Handle shuffle file consolidation (this would requires changes to Spark's implementation).

Author: Aaron Davidson <aaron@databricks.com>

Closes #3001 from aarondav/shuffle-service and squashes the following commits:

4d1f8c1 [Aaron Davidson] Remove changes to Worker
705748f [Aaron Davidson] Rename Standalone* to External*
fd3928b [Aaron Davidson] Do not unregister executor outputs unduly
9883918 [Aaron Davidson] Make suggested build changes
3d62679 [Aaron Davidson] Add Spark integration test
7fe51d5 [Aaron Davidson] Fix SBT integration
56caa50 [Aaron Davidson] Address comments
c8d1ac3 [Aaron Davidson] Add unit tests
2f70c0c [Aaron Davidson] Fix unit tests
5483e96 [Aaron Davidson] Fix unit tests
46a70bf [Aaron Davidson] Whoops, bracket
5ea4df6 [Aaron Davidson] [SPARK-3796] Create external service which can serve shuffle files
2014-11-01 14:37:45 -07:00
Kevin Mader 7136719b7d [SPARK-2759][CORE] Generic Binary File Support in Spark
The additions add the abstract BinaryFileInputFormat and BinaryRecordReader classes for reading in data as a byte stream and converting it to another format using the ```def parseByteArray(inArray: Array[Byte]): T``` function.
As a trivial example ```ByteInputFormat``` and ```ByteRecordReader``` are included which just return the Array[Byte] from a given file.
Finally a RDD for ```BinaryFileInputFormat``` (to allow for easier partitioning changes as was done for WholeFileInput) was added and the appropriate byteFiles to the ```SparkContext``` so the functions can be easily used by others.
A common use case might be to read in a folder
```
sc.byteFiles("s3://mydrive/tif/*.tif").map(rawData => ReadTiffFromByteArray(rawData))
```

Author: Kevin Mader <kevinmader@gmail.com>
Author: Kevin Mader <kmader@users.noreply.github.com>

Closes #1658 from kmader/master and squashes the following commits:

3c49a30 [Kevin Mader] fixing wholetextfileinput to it has the same setMinPartitions function as in BinaryData files
359a096 [Kevin Mader] making the final corrections suggested by @mateiz and renaming a few functions to make their usage clearer
6379be4 [Kevin Mader] reorganizing code
7b9d181 [Kevin Mader] removing developer API, cleaning up imports
8ac288b [Kevin Mader] fixed a single slightly over 100 character line
92bda0d [Kevin Mader] added new tests, renamed files, fixed several of the javaapi functions, formatted code more nicely
a32fef7 [Kevin Mader] removed unneeded classes added DeveloperApi note to portabledatastreams since the implementation might change
49174d9 [Kevin Mader] removed unneeded classes added DeveloperApi note to portabledatastreams since the implementation might change
c27a8f1 [Kevin Mader] jenkins crashed before running anything last time, so making minor change
b348ce1 [Kevin Mader] fixed order in check (prefix only appears on jenkins not when I run unit tests locally)
0588737 [Kevin Mader] filename check in "binary file input as byte array" test now ignores prefixes and suffixes which might get added by Hadoop
4163e38 [Kevin Mader] fixing line length and output from FSDataInputStream to DataInputStream to minimize sensitivity to Hadoop API changes
19812a8 [Kevin Mader] Fixed the serialization issue with PortableDataStream since neither CombineFileSplit nor TaskAttemptContext implement the Serializable interface, by using ByteArrays for storing both and then recreating the objects from these bytearrays as needed.
238c83c [Kevin Mader] fixed several scala-style issues, changed structure of binaryFiles, removed excessive classes added new tests. The caching tests still have a serialization issue, but that should be easily fixed as well.
932a206 [Kevin Mader] Update RawFileInput.scala
a01c9cf [Kevin Mader] Update RawFileInput.scala
441f79a [Kevin Mader] fixed a few small comments and dependency
12e7be1 [Kevin Mader] removing imglib from maven (definitely not ready yet)
5deb79e [Kevin Mader] added new portabledatastream to code so that it can be serialized correctly
f032bc0 [Kevin Mader] fixed bug in path name, renamed tests
bc5c0b9 [Kevin Mader] made minor stylistic adjustments from mateiz
df8e528 [Kevin Mader] fixed line lengths and changed java test
9a313d5 [Kevin Mader] making classes that needn't be public private, adding automatic file closure, adding new tests
edf5829 [Kevin Mader] fixing line lengths, adding new lines
f4841dc [Kevin Mader] un-optimizing imports, silly intellij
eacfaa6 [Kevin Mader] Added FixedLengthBinaryInputFormat and RecordReader from freeman-lab and added them to both the JavaSparkContext and the SparkContext as fixedLengthBinaryFile
1622935 [Kevin Mader] changing the line lengths to make jenkins happy
1cfa38a [Kevin Mader] added apache headers, added datainputstream directly as an output option for more complicated readers (HDF5 perhaps), and renamed several of the functions and files to be more consistent. Also added parallel functions to the java api
84035f1 [Kevin Mader] adding binary and byte file support spark
81c5f12 [Kevin Mader] Merge pull request #1 from apache/master
2014-11-01 11:59:39 -07:00
Erik Erlandson ad3bd0dff8 [SPARK-3250] Implement Gap Sampling optimization for random sampling
More efficient sampling, based on Gap Sampling optimization:
http://erikerlandson.github.io/blog/2014/09/11/faster-random-samples-with-gap-sampling/

Author: Erik Erlandson <eerlands@redhat.com>

Closes #2455 from erikerlandson/spark-3250-pr and squashes the following commits:

72496bc [Erik Erlandson] [SPARK-3250] Implement Gap Sampling optimization for random sampling
2014-10-30 22:30:52 -07:00
Andrew Or 9334d69967 [SPARK-4155] Consolidate usages of <driver>
We use "\<driver\>" everywhere. Let's not do that.

Author: Andrew Or <andrew@databricks.com>

Closes #3020 from andrewor14/consolidate-driver and squashes the following commits:

c1c2204 [Andrew Or] Just use "<driver>" for local executor ID
3d751e9 [Andrew Or] Consolidate usages of <driver>
2014-10-30 15:32:46 -07:00
Andrew Or 24c5129257 [SPARK-3319] [SPARK-3338] Resolve Spark submit config paths
The bulk of this PR is comprised of tests. All changes in functionality are made in `SparkSubmit.scala` (~20 lines).

**SPARK-3319.** There is currently a divergence in behavior when the user passes in additional jars through `--jars` and through setting `spark.jars` in the default properties file. The former will happily resolve the paths (e.g. convert `my.jar` to `file:/absolute/path/to/my.jar`), while the latter does not. We should resolve paths consistently in both cases. This also applies to the following pairs of command line arguments and Spark configs:

- `--jars` ~ `spark.jars`
- `--files` ~ `spark.files` / `spark.yarn.dist.files`
- `--archives` ~ `spark.yarn.dist.archives`
- `--py-files` ~ `spark.submit.pyFiles`

**SPARK-3338.** This PR also fixes the following bug: if the user sets `spark.submit.pyFiles` in his/her properties file, it does not actually get picked up even if `--py-files` is not set. This is simply because the config is overridden by an empty string.

Author: Andrew Or <andrewor14@gmail.com>
Author: Andrew Or <andrew@databricks.com>

Closes #2232 from andrewor14/resolve-config-paths and squashes the following commits:

fff2869 [Andrew Or] Add spark.yarn.jar
da3a1c1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into resolve-config-paths
f0fae64 [Andrew Or] Merge branch 'master' of github.com:apache/spark into resolve-config-paths
05e03d6 [Andrew Or] Add tests for resolving both command line and config paths
460117e [Andrew Or] Resolve config paths properly
fe039d3 [Andrew Or] Beef up tests to test fixed-pointed-ness of Utils.resolveURI(s)
2014-10-30 15:29:07 -07:00
GuoQiang Li cd739bd756 [SPARK-1720][SPARK-1719] use LD_LIBRARY_PATH instead of -Djava.library.path
- [X] Standalone
- [X] YARN
- [X] Mesos
- [X]  Mac OS X
- [X] Linux
- [ ]  Windows

This is another implementation about #1031

Author: GuoQiang Li <witgo@qq.com>

Closes #2711 from witgo/SPARK-1719 and squashes the following commits:

c7b26f6 [GuoQiang Li] review commits
4488e41 [GuoQiang Li] Refactoring CommandUtils
a444094 [GuoQiang Li] review commits
40c0b4a [GuoQiang Li] Add buildLocalCommand method
c1a0ddd [GuoQiang Li] fix comments
156ce88 [GuoQiang Li] review commit
38aa377 [GuoQiang Li] Refactor CommandUtils.scala
4269e00 [GuoQiang Li] Refactor SparkSubmitDriverBootstrapper.scala
7a1d634 [GuoQiang Li] use LD_LIBRARY_PATH instead of -Djava.library.path
2014-10-29 23:02:58 -07:00
Andrew Or 8d59b37b02 [SPARK-3795] Heuristics for dynamically scaling executors
This is part of a bigger effort to provide elastic scaling of executors within a Spark application ([SPARK-3174](https://issues.apache.org/jira/browse/SPARK-3174)). This PR does not provide any functionality by itself; it is a skeleton that is missing a mechanism to be added later in [SPARK-3822](https://issues.apache.org/jira/browse/SPARK-3822).

Comments and feedback are most welcome. For those of you reviewing this in detail, I highly recommend doing it through your favorite IDE instead of through the diff here.

Author: Andrew Or <andrewor14@gmail.com>
Author: Andrew Or <andrew@databricks.com>

Closes #2746 from andrewor14/scaling-heuristics and squashes the following commits:

8a4fdaa [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics
e045df8 [Andrew Or] Add warning message (minor)
dfa31ec [Andrew Or] Fix tests
c0becc4 [Andrew Or] Merging with SPARK-3822
4784f93 [Andrew Or] Reword an awkward log message
181f27f [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics
c79e907 [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics
4672b90 [Andrew Or] It's nano time.
a6a30f2 [Andrew Or] Do not allow min/max executors of 0
c60ec33 [Andrew Or] Rewrite test logic with clocks
b00b680 [Andrew Or] Fix style
c3caa65 [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics
7f9da14 [Andrew Or] Factor out logic to verify bounds on # executors (minor)
f279019 [Andrew Or] Add time mocking tests for polling loop
685e347 [Andrew Or] Factor out clock in polling loop to facilitate testing
3cea7f7 [Andrew Or] Use PrivateMethodTester to keep original class private
3156d81 [Andrew Or] Update comments and exception messages
92f36f9 [Andrew Or] Address minor review comments
abdea61 [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics
2aefd09 [Andrew Or] Correct listener behavior
9fe6e44 [Andrew Or] Rename variables and configs + update comments and log messages
149cc32 [Andrew Or] Fix style
254c958 [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics
5ff829b [Andrew Or] Add tests for ExecutorAllocationManager
19c6c4b [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics
5896515 [Andrew Or] Move ExecutorAllocationManager out of scheduler package
9ca8945 [Andrew Or] Rewrite callbacks through the listener interface
5e336b9 [Andrew Or] Remove code from backend to avoid conflict with SPARK-3822
092d1fd [Andrew Or] Remove timeout logic for pending requests
1309fab [Andrew Or] Request executors by specifying the number pending
8bc0e9d [Andrew Or] Add logic to expire pending requests after timeouts
b750ee1 [Andrew Or] Express timers in terms of expiration times + remove retry logic
7f8dd47 [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics
9d516cc [Andrew Or] Bug fix: Actually trigger the add timer / add retry timer
44f1832 [Andrew Or] Rename configs to include time units
eaae7ef [Andrew Or] Address various review comments
6f8be6c [Andrew Or] Beef up comments on what each of the timers mean
baaa403 [Andrew Or] Simplify variable names (minor)
42beec8 [Andrew Or] Reset whether the add threshold is crossed on cancellation
9bcc0bc [Andrew Or] ExecutorScalingManager -> ExecutorAllocationManager
2784398 [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics
5a97d9e [Andrew Or] Log retry attempts in INFO + clean up logging
2f55c9f [Andrew Or] Do not keep requesting executors even after max attempts
0acd1cb [Andrew Or] Rewrite timer logic with polling
b3c7d44 [Andrew Or] Start the retry timer for adding executors at the right time
9b5f2ea [Andrew Or] Wording changes in comments and log messages
c2203a5 [Andrew Or] Simplify code to access the scheduler backend
e519d08 [Andrew Or] Simplify initialization code
2cc87a7 [Andrew Or] Add retry logic for removing executors
d0b34a6 [Andrew Or] Add retry logic for adding executors
9cc4649 [Andrew Or] Simplifying synchronization logic
67c03c7 [Andrew Or] Correct semantics of adding executors + update comments
6c48ab0 [Andrew Or] Update synchronization comment
8901900 [Andrew Or] Simplify remove policy + change the semantics of add policy
1cc8444 [Andrew Or] Minor wording change
ae5b64a [Andrew Or] Add synchronization
20ec6b9 [Andrew Or] First cut implementation of removing executors dynamically
4077ae2 [Andrew Or] Minor code re-organization
6f1fa66 [Andrew Or] First cut implementation of adding executors dynamically
b2e6dcc [Andrew Or] Add skeleton interface for requesting / killing executors
2014-10-29 17:48:59 -07:00
Andrew Or 1df05a40eb [SPARK-3822] Executor scaling mechanism for Yarn
This is part of a broader effort to enable dynamic scaling of executors ([SPARK-3174](https://issues.apache.org/jira/browse/SPARK-3174)). This is intended to work alongside SPARK-3795 (#2746), SPARK-3796 and SPARK-3797, but is functionally independently of these other issues.

The logic is built on top of PraveenSeluka's changes at #2798. This is different from the changes there in a few major ways: (1) the mechanism is implemented within the existing scheduler backend framework rather than in new `Actor` classes. This also introduces a parent abstract class `YarnSchedulerBackend` to encapsulate common logic to communicate with the Yarn `ApplicationMaster`. (2) The interface of requesting executors exposed to the `SparkContext` is the same, but the communication between the scheduler backend and the AM uses total number executors desired instead of an incremental number. This is discussed in #2746 and explained in the comments in the code.

I have tested this significantly on a stable Yarn cluster.

------------
A remaining task for this issue is to tone down the error messages emitted when an executor is removed.
Currently, `SparkContext` and its components react as if the executor has failed, resulting in many scary error messages and eventual timeouts. While it's not strictly necessary to fix this as of the first-cut implementation of this mechanism, it would be good to add logic to distinguish this case. I prefer to address this in a separate PR. I have filed a separate JIRA for this task at SPARK-4134.

Author: Andrew Or <andrew@databricks.com>
Author: Andrew Or <andrewor14@gmail.com>

Closes #2840 from andrewor14/yarn-scaling-mechanism and squashes the following commits:

485863e [Andrew Or] Minor log message changes
4920be8 [Andrew Or] Clarify that public API is only for Yarn mode for now
1c57804 [Andrew Or] Reword a few comments + other review comments
6321140 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-scaling-mechanism
02836c0 [Andrew Or] Limit scope of synchronization
4e2ed7f [Andrew Or] Fix bug: keep track of removed executors properly
73ade46 [Andrew Or] Wording changes (minor)
2a7a6da [Andrew Or] Add `sc.killExecutor` as a shorthand (minor)
665f229 [Andrew Or] Mima excludes
79aa2df [Andrew Or] Simplify the request interface by asking for a total
04f625b [Andrew Or] Fix race condition that causes over-allocation of executors
f4783f8 [Andrew Or] Change the semantics of requesting executors
005a124 [Andrew Or] Fix tests
4628b16 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-scaling-mechanism
db4a679 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-scaling-mechanism
572f5c5 [Andrew Or] Unused import (minor)
f30261c [Andrew Or] Kill multiple executors rather than one at a time
de260d9 [Andrew Or] Simplify by skipping useless null check
9c52542 [Andrew Or] Simplify by skipping the TaskSchedulerImpl
97dd1a8 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-scaling-mechanism
d987b3e [Andrew Or] Move addWebUIFilters to Yarn scheduler backend
7b76d0a [Andrew Or] Expose mechanism in SparkContext as developer API
47466cd [Andrew Or] Refactor common Yarn scheduler backend logic
c4dfaac [Andrew Or] Avoid thrashing when removing executors
53e8145 [Andrew Or] Start yarn actor early to listen for AM registration message
bbee669 [Andrew Or] Add mechanism in yarn client mode
2014-10-29 14:01:00 -07:00
Reynold Xin dff015533d [SPARK-3453] Netty-based BlockTransferService, extracted from Spark core
This PR encapsulates #2330, which is itself a continuation of #2240. The first goal of this PR is to provide an alternate, simpler implementation of the ConnectionManager which is based on Netty.

In addition to this goal, however, we want to resolve [SPARK-3796](https://issues.apache.org/jira/browse/SPARK-3796), which calls for a standalone shuffle service which can be integrated into the YARN NodeManager, Standalone Worker, or on its own. This PR makes the first step in this direction by ensuring that the actual Netty service is as small as possible and extracted from Spark core. Given this, we should be able to construct this standalone jar which can be included in other JVMs without incurring significant dependency or runtime issues. The actual work to ensure that such a standalone shuffle service would work in Spark will be left for a future PR, however.

In order to minimize dependencies and allow for the service to be long-running (possibly much longer-running than Spark, and possibly having to support multiple version of Spark simultaneously), the entire service has been ported to Java, where we have full control over the binary compatibility of the components and do not depend on the Scala runtime or version.

These issues: have been addressed by folding in #2330:

SPARK-3453: Refactor Netty module to use BlockTransferService interface
SPARK-3018: Release all buffers upon task completion/failure
SPARK-3002: Create a connection pool and reuse clients across different threads
SPARK-3017: Integration tests and unit tests for connection failures
SPARK-3049: Make sure client doesn't block when server/connection has error(s)
SPARK-3502: SO_RCVBUF and SO_SNDBUF should be bootstrap childOption, not option
SPARK-3503: Disable thread local cache in PooledByteBufAllocator

TODO before mergeable:
- [x] Implement uploadBlock()
- [x] Unit tests for RPC side of code
- [x] Performance testing (see comments [here](https://github.com/apache/spark/pull/2753#issuecomment-59475022))
- [x] Turn OFF by default (currently on for unit testing)

Author: Reynold Xin <rxin@apache.org>
Author: Aaron Davidson <aaron@databricks.com>
Author: cocoatomo <cocoatomo77@gmail.com>
Author: Patrick Wendell <pwendell@gmail.com>
Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Davies Liu <davies.liu@gmail.com>
Author: Anand Avati <avati@redhat.com>

Closes #2753 from aarondav/netty and squashes the following commits:

cadfd28 [Aaron Davidson] Turn netty off by default
d7be11b [Aaron Davidson] Turn netty on by default
4a204b8 [Aaron Davidson] Fail block fetches if client connection fails
2b0d1c0 [Aaron Davidson] 100ch
0c5bca2 [Aaron Davidson] Merge branch 'master' of https://github.com/apache/spark into netty
14e37f7 [Aaron Davidson] Address Reynold's comments
8dfcceb [Aaron Davidson] Merge branch 'master' of https://github.com/apache/spark into netty
322dfc1 [Aaron Davidson] Address Reynold's comments, including major rename
e5675a4 [Aaron Davidson] Fail outstanding RPCs as well
ccd4959 [Aaron Davidson] Don't throw exception if client immediately fails
9da0bc1 [Aaron Davidson] Add RPC unit tests
d236dfd [Aaron Davidson] Remove no-op serializer :)
7b7a26c [Aaron Davidson] Fix Nio compile issue
dd420fd [Aaron Davidson] Merge branch 'master' of https://github.com/apache/spark into netty-test
939f276 [Aaron Davidson] Attempt to make comm. bidirectional
aa58f67 [cocoatomo] [SPARK-3909][PySpark][Doc] A corrupted format in Sphinx documents and building warnings
8dc1ded [cocoatomo] [SPARK-3867][PySpark] ./python/run-tests failed when it run with Python 2.6 and unittest2 is not installed
5b5dbe6 [Prashant Sharma] [SPARK-2924] Required by scala 2.11, only one fun/ctor amongst overriden alternatives, can have default argument(s).
2c5d9dc [Patrick Wendell] HOTFIX: Fix build issue with Akka 2.3.4 upgrade.
020691e [Davies Liu] [SPARK-3886] [PySpark] use AutoBatchedSerializer by default
ae4083a [Anand Avati] [SPARK-2805] Upgrade Akka to 2.3.4
29c6dcf [Aaron Davidson] [SPARK-3453] Netty-based BlockTransferService, extracted from Spark core
f7e7568 [Reynold Xin] Fixed spark.shuffle.io.receiveBuffer setting.
5d98ce3 [Reynold Xin] Flip buffer.
f6c220d [Reynold Xin] Merge with latest master.
407e59a [Reynold Xin] Fix style violation.
a0518c7 [Reynold Xin] Implemented block uploads.
4b18db2 [Reynold Xin] Copy the buffer in fetchBlockSync.
bec4ea2 [Reynold Xin] Removed OIO and added num threads settings.
1bdd7ee [Reynold Xin] Fixed tests.
d68f328 [Reynold Xin] Logging close() in case close() fails.
f63fb4c [Reynold Xin] Add more debug message.
6afc435 [Reynold Xin] Added logging.
c066309 [Reynold Xin] Implement java.io.Closeable interface.
519d64d [Reynold Xin] Mark private package visibility and MimaExcludes.
f0a16e9 [Reynold Xin] Fixed test hanging.
14323a5 [Reynold Xin] Removed BlockManager.getLocalShuffleFromDisk.
b2f3281 [Reynold Xin] Added connection pooling.
d23ed7b [Reynold Xin] Incorporated feedback from Norman: - use same pool for boss and worker - remove ioratio - disable caching of byte buf allocator - childoption sendbuf/receivebuf - fire exception through pipeline
9e0cb87 [Reynold Xin] Fixed BlockClientHandlerSuite
5cd33d7 [Reynold Xin] Fixed style violation.
cb589ec [Reynold Xin] Added more test cases covering cleanup when fault happens in ShuffleBlockFetcherIteratorSuite
1be4e8e [Reynold Xin] Shorten NioManagedBuffer and NettyManagedBuffer class names.
108c9ed [Reynold Xin] Forgot to add TestSerializer to the commit list.
b5c8d1f [Reynold Xin] Fixed ShuffleBlockFetcherIteratorSuite.
064747b [Reynold Xin] Reference count buffers and clean them up properly.
2b44cf1 [Reynold Xin] Added more documentation.
1760d32 [Reynold Xin] Use Epoll.isAvailable in BlockServer as well.
165eab1 [Reynold Xin] [SPARK-3453] Refactor Netty module to use BlockTransferService.
2014-10-29 11:27:07 -07:00
Xiangrui Meng 1559495dd9 [FIX] disable benchmark code
I forgot to disable the benchmark code in #2937, which increased the Jenkins build time by couple minutes.

aarondav

Author: Xiangrui Meng <meng@databricks.com>

Closes #2990 from mengxr/disable-benchmark and squashes the following commits:

c58f070 [Xiangrui Meng] disable benchmark code
2014-10-28 22:00:10 -07:00
zsxwing 1536d70331 [SPARK-4008] Fix "kryo with fold" in KryoSerializerSuite
`zeroValue` will be serialized by `spark.closure.serializer` but `spark.closure.serializer` only supports the default Java serializer. So it must not be `ClassWithoutNoArgConstructor`, which can not be serialized by the Java serializer.

This PR changed `zeroValue` to null and updated the test to make it work correctly.

Author: zsxwing <zsxwing@gmail.com>

Closes #2856 from zsxwing/SPARK-4008 and squashes the following commits:

51da655 [zsxwing] [SPARK-4008] Fix "kryo with fold" in KryoSerializerSuite
2014-10-28 17:59:10 -07:00
Xiangrui Meng 84e5da87e3 [SPARK-4084] Reuse sort key in Sorter
Sorter uses generic-typed key for sorting. When data is large, it creates lots of key objects, which is not efficient. We should reuse the key in Sorter for memory efficiency. This change is part of the petabyte sort implementation from rxin .

The `Sorter` class was written in Java and marked package private. So it is only available to `org.apache.spark.util.collection`. I renamed it to `TimSort` and add a simple wrapper of it, still called `Sorter`, in Scala, which is `private[spark]`.

The benchmark code is updated, which now resets the array before each run. Here is the result on sorting primitive Int arrays of size 25 million using Sorter:

~~~
[info] - Sorter benchmark for key-value pairs !!! IGNORED !!!
Java Arrays.sort() on non-primitive int array: Took 13237 ms
Java Arrays.sort() on non-primitive int array: Took 13320 ms
Java Arrays.sort() on non-primitive int array: Took 15718 ms
Java Arrays.sort() on non-primitive int array: Took 13283 ms
Java Arrays.sort() on non-primitive int array: Took 13267 ms
Java Arrays.sort() on non-primitive int array: Took 15122 ms
Java Arrays.sort() on non-primitive int array: Took 15495 ms
Java Arrays.sort() on non-primitive int array: Took 14877 ms
Java Arrays.sort() on non-primitive int array: Took 16429 ms
Java Arrays.sort() on non-primitive int array: Took 14250 ms
Java Arrays.sort() on non-primitive int array: (13878 ms first try, 14499 ms average)
Java Arrays.sort() on primitive int array: Took 2683 ms
Java Arrays.sort() on primitive int array: Took 2683 ms
Java Arrays.sort() on primitive int array: Took 2701 ms
Java Arrays.sort() on primitive int array: Took 2746 ms
Java Arrays.sort() on primitive int array: Took 2685 ms
Java Arrays.sort() on primitive int array: Took 2735 ms
Java Arrays.sort() on primitive int array: Took 2669 ms
Java Arrays.sort() on primitive int array: Took 2693 ms
Java Arrays.sort() on primitive int array: Took 2680 ms
Java Arrays.sort() on primitive int array: Took 2642 ms
Java Arrays.sort() on primitive int array: (2948 ms first try, 2691 ms average)
Sorter without key reuse on primitive int array: Took 10732 ms
Sorter without key reuse on primitive int array: Took 12482 ms
Sorter without key reuse on primitive int array: Took 10718 ms
Sorter without key reuse on primitive int array: Took 12650 ms
Sorter without key reuse on primitive int array: Took 10747 ms
Sorter without key reuse on primitive int array: Took 10783 ms
Sorter without key reuse on primitive int array: Took 12721 ms
Sorter without key reuse on primitive int array: Took 10604 ms
Sorter without key reuse on primitive int array: Took 10622 ms
Sorter without key reuse on primitive int array: Took 11843 ms
Sorter without key reuse on primitive int array: (11089 ms first try, 11390 ms average)
Sorter with key reuse on primitive int array: Took 5141 ms
Sorter with key reuse on primitive int array: Took 5298 ms
Sorter with key reuse on primitive int array: Took 5066 ms
Sorter with key reuse on primitive int array: Took 5164 ms
Sorter with key reuse on primitive int array: Took 5203 ms
Sorter with key reuse on primitive int array: Took 5274 ms
Sorter with key reuse on primitive int array: Took 5186 ms
Sorter with key reuse on primitive int array: Took 5159 ms
Sorter with key reuse on primitive int array: Took 5164 ms
Sorter with key reuse on primitive int array: Took 5078 ms
Sorter with key reuse on primitive int array: (5311 ms first try, 5173 ms average)
~~~

So with key reuse, it is faster and less likely to trigger GC.

Author: Xiangrui Meng <meng@databricks.com>
Author: Reynold Xin <rxin@apache.org>

Closes #2937 from mengxr/SPARK-4084 and squashes the following commits:

d73c3d0 [Xiangrui Meng] address comments
0b7b682 [Xiangrui Meng] fix mima
a72f53c [Xiangrui Meng] update timeIt
38ba50c [Xiangrui Meng] update timeIt
720f731 [Xiangrui Meng] add doc about JIT specialization
78f2879 [Xiangrui Meng] update tests
7de2efd [Xiangrui Meng] update the Sorter benchmark code to be correct
8626356 [Xiangrui Meng] add prepare to timeIt and update testsin SorterSuite
5f0d530 [Xiangrui Meng] update method modifiers of SortDataFormat
6ffbe66 [Xiangrui Meng] rename Sorter to TimSort and add a Scala wrapper that is private[spark]
b00db4d [Xiangrui Meng] doc and tests
cf94e8a [Xiangrui Meng] renaming
464ddce [Reynold Xin] cherry-pick rxin's commit
2014-10-28 15:14:41 -07:00
zsxwing abcafcfba3 [Spark 3922] Refactor spark-core to use Utils.UTF_8
A global UTF8 constant is very helpful to handle encoding problems when converting between String and bytes. There are several solutions here:

1. Add `val UTF_8 = Charset.forName("UTF-8")` to Utils.scala
2. java.nio.charset.StandardCharsets.UTF_8 (require JDK7)
3. io.netty.util.CharsetUtil.UTF_8
4. com.google.common.base.Charsets.UTF_8
5. org.apache.commons.lang.CharEncoding.UTF_8
6. org.apache.commons.lang3.CharEncoding.UTF_8

IMO, I prefer option 1) because people can find it easily.

This is a PR for option 1) and only fixes Spark Core.

Author: zsxwing <zsxwing@gmail.com>

Closes #2781 from zsxwing/SPARK-3922 and squashes the following commits:

f974edd [zsxwing] Merge branch 'master' into SPARK-3922
2d27423 [zsxwing] Refactor spark-core to use Refactor spark-core to use Utils.UTF_8
2014-10-28 14:26:57 -07:00
Shivaram Venkataraman 7768a800d4 [SPARK-4031] Make torrent broadcast read blocks on use.
This avoids reading torrent broadcast variables when they are referenced in the closure but not used in the closure. This is done by using a `lazy val` to read broadcast blocks

cc rxin JoshRosen for review

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #2871 from shivaram/broadcast-read-value and squashes the following commits:

1456d65 [Shivaram Venkataraman] Use getUsedTimeMs and remove readObject
d6c5ee9 [Shivaram Venkataraman] Use laxy val to implement readBroadcastBlock
0b34df7 [Shivaram Venkataraman] Merge branch 'master' of https://github.com/apache/spark into broadcast-read-value
9cec507 [Shivaram Venkataraman] Test if broadcast variables are read lazily
768b40b [Shivaram Venkataraman] Merge branch 'master' of https://github.com/apache/spark into broadcast-read-value
8792ed8 [Shivaram Venkataraman] Make torrent broadcast read blocks on use. This avoids reading broadcast variables when they are referenced in the closure but not used by the code.
2014-10-28 10:14:16 -07:00
Josh Rosen 46c63417c1 [SPARK-4107] Fix incorrect handling of read() and skip() return values
`read()` may return fewer bytes than requested; when this occurred, the old code would silently return less data than requested, which might cause stream corruption errors.  `skip()` faces similar issues, too.

This patch fixes several cases where we mis-handle these methods' return values.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #2969 from JoshRosen/file-channel-read-fix and squashes the following commits:

e724a9f [Josh Rosen] Fix similar issue of not checking skip() return value.
cbc03ce [Josh Rosen] Update the other log message, too.
01e6015 [Josh Rosen] file.getName -> file.getAbsolutePath
d961d95 [Josh Rosen] Fix another issue in FileServerSuite.
b9265d2 [Josh Rosen] Fix a similar (minor) issue in TestUtils.
cd9d76f [Josh Rosen] Fix a similar error in Tachyon:
3db0008 [Josh Rosen] Fix a similar read() error in Utils.offsetBytes().
db985ed [Josh Rosen] Fix unsafe usage of FileChannel.read():
2014-10-28 00:04:16 -07:00
Sean Owen bfa614b127 SPARK-4022 [CORE] [MLLIB] Replace colt dependency (LGPL) with commons-math
This change replaces usages of colt with commons-math3 equivalents, and makes some minor necessary adjustments to related code and tests to match.

Author: Sean Owen <sowen@cloudera.com>

Closes #2928 from srowen/SPARK-4022 and squashes the following commits:

61a232f [Sean Owen] Fix failure due to different sampling in JavaAPISuite.sample()
16d66b8 [Sean Owen] Simplify seeding with call to reseedRandomGenerator
a1a78e0 [Sean Owen] Use Well19937c
31c7641 [Sean Owen] Fix Python Poisson test by choosing a different seed; about 88% of seeds should work but 1 didn't, it seems
5c9c67f [Sean Owen] Additional test fixes from review
d8f88e0 [Sean Owen] Replace colt with commons-math3. Some tests do not pass yet.
2014-10-27 10:53:15 -07:00
Sandy Ryza dea302ddbd SPARK-2621. Update task InputMetrics incrementally
The patch takes advantage an API provided in Hadoop 2.5 that allows getting accurate data on Hadoop FileSystem bytes read.  It eliminates the old method, which naively accepts the split size as the input bytes.  An impact of this change will be that input metrics go away when using against Hadoop versions earlier thatn 2.5.  I can add this back in, but my opinion is that no metrics are better than inaccurate metrics.

This is difficult to write a test for because we don't usually build against a version of Hadoop that contains the function we need.  I've tested it manually on a pseudo-distributed cluster.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #2087 from sryza/sandy-spark-2621 and squashes the following commits:

23010b8 [Sandy Ryza] Missing style fixes
74fc9bb [Sandy Ryza] Make getFSBytesReadOnThreadCallback private
1ab662d [Sandy Ryza] Clear things up a bit
984631f [Sandy Ryza] Switch from pull to push model and add test
7ef7b22 [Sandy Ryza] Add missing curly braces
219abc9 [Sandy Ryza] Fall back to split size
90dbc14 [Sandy Ryza] SPARK-2621. Update task InputMetrics incrementally
2014-10-27 10:04:24 -07:00
Shivaram Venkataraman 9aa340a23f [SPARK-4030] Make destroy public for broadcast variables
This change makes the destroy function public for broadcast variables. Motivation for the change is described in https://issues.apache.org/jira/browse/SPARK-4030.
This patch also logs where destroy was called from if a broadcast variable is used after destruction.

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #2922 from shivaram/broadcast-destroy and squashes the following commits:

a11abab [Shivaram Venkataraman] Fix scala style in Utils.scala
bed9c9d [Shivaram Venkataraman] Make destroy blocking by default
e80c1ab [Shivaram Venkataraman] Make destroy public for broadcast variables Also log where destroy was called from if a broadcast variable is used after destruction.
2014-10-27 08:45:36 -07:00
Josh Rosen bf589fc717 [SPARK-3616] Add basic Selenium tests to WebUISuite
This patch adds Selenium tests for Spark's web UI.  To avoid adding extra
dependencies to the test environment, the tests use Selenium's HtmlUnitDriver,
which is pure-Java, instead of, say, ChromeDriver.

I added new tests to try to reproduce a few UI bugs reported on JIRA, namely
SPARK-3021, SPARK-2105, and SPARK-2527.  I wasn't able to reproduce these bugs;
I suspect that the older ones might have been fixed by other patches.

In order to use HtmlUnitDriver, I added an explicit dependency on the
org.apache.httpcomponents version of httpclient in order to prevent jets3t's
older version from taking precedence on the classpath.

I also upgraded ScalaTest to 2.2.1.

Author: Josh Rosen <joshrosen@apache.org>
Author: Josh Rosen <joshrosen@databricks.com>

Closes #2474 from JoshRosen/webui-selenium-tests and squashes the following commits:

fcc9e83 [Josh Rosen] scalautils -> scalactic package rename
510e54a [Josh Rosen] [SPARK-3616] Add basic Selenium tests to WebUISuite.
2014-10-26 11:29:27 -07:00
Josh Rosen 9530316887 [SPARK-2321] Stable pull-based progress / status API
This pull request is a first step towards the implementation of a stable, pull-based progress / status API for Spark (see [SPARK-2321](https://issues.apache.org/jira/browse/SPARK-2321)).  For now, I'd like to discuss the basic implementation, API names, and overall interface design.  Once we arrive at a good design, I'll go back and add additional methods to expose more information via these API.

#### Design goals:

- Pull-based API
- Usable from Java / Scala / Python (eventually, likely with a wrapper)
- Can be extended to expose more information without introducing binary incompatibilities.
- Returns immutable objects.
- Don't leak any implementation details, preserving our freedom to change the implementation.

#### Implementation:

- Add public methods (`getJobInfo`, `getStageInfo`) to SparkContext to allow status / progress information to be retrieved.
- Add public interfaces (`SparkJobInfo`, `SparkStageInfo`) for our API return values.  These interfaces consist entirely of Java-style getter methods.  The interfaces are currently implemented in Java.  I decided to explicitly separate the interface from its implementation (`SparkJobInfoImpl`, `SparkStageInfoImpl`) in order to prevent users from constructing these responses themselves.
-Allow an existing JobProgressListener to be used when constructing a live SparkUI.  This allows us to re-use this listeners in the implementation of this status API.  There are a few reasons why this listener re-use makes sense:
   - The status API and web UI are guaranteed to show consistent information.
   - These listeners are already well-tested.
   - The same garbage-collection / information retention configurations can apply to both this API and the web UI.
- Extend JobProgressListener to maintain `jobId -> Job` and `stageId -> Stage` mappings.

The progress API methods are implemented in a separate trait that's mixed into SparkContext.  This helps to avoid SparkContext.scala from becoming larger and more difficult to read.

Author: Josh Rosen <joshrosen@databricks.com>
Author: Josh Rosen <joshrosen@apache.org>

Closes #2696 from JoshRosen/progress-reporting-api and squashes the following commits:

e6aa78d [Josh Rosen] Add tests.
b585c16 [Josh Rosen] Accept SparkListenerBus instead of more specific subclasses.
c96402d [Josh Rosen] Address review comments.
2707f98 [Josh Rosen] Expose current stage attempt id
c28ba76 [Josh Rosen] Update demo code:
646ff1d [Josh Rosen] Document spark.ui.retainedJobs.
7f47d6d [Josh Rosen] Clean up SparkUI constructors, per Andrew's feedback.
b77b3d8 [Josh Rosen] Merge remote-tracking branch 'origin/master' into progress-reporting-api
787444c [Josh Rosen] Move status API methods into trait that can be mixed into SparkContext.
f9a9a00 [Josh Rosen] More review comments:
3dc79af [Josh Rosen] Remove creation of unused listeners in SparkContext.
249ca16 [Josh Rosen] Address several review comments:
da5648e [Josh Rosen] Add example of basic progress reporting in Java.
7319ffd [Josh Rosen] Add getJobIdsForGroup() and num*Tasks() methods.
cc568e5 [Josh Rosen] Add note explaining that interfaces should not be implemented outside of Spark.
6e840d4 [Josh Rosen] Remove getter-style names and "consistent snapshot" semantics:
08cbec9 [Josh Rosen] Begin to sketch the interfaces for a stable, public status API.
ac2d13a [Josh Rosen] Add jobId->stage, stageId->stage mappings in JobProgressListener
24de263 [Josh Rosen] Create UI listeners in SparkContext instead of in Tabs:
2014-10-25 00:06:57 -07:00
Kousuke Saruta 098f83c7cc [SPARK-4075] [Deploy] Jar url validation is not enough for Jar file
In deploy.ClientArguments.isValidJarUrl, the url is checked as follows.

    def isValidJarUrl(s: String): Boolean = s.matches("(.+):(.+)jar")

So, it allows like 'hdfs:file.jar' (no authority).

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2925 from sarutak/uri-syntax-check-improvement and squashes the following commits:

cf06173 [Kousuke Saruta] Improved URI syntax checking
2014-10-24 13:08:59 -07:00
Josh Rosen 83b7a1c650 [SPARK-4019] [SPARK-3740] Fix MapStatus compression bug that could lead to empty results or Snappy errors
This commit fixes a bug in MapStatus that could cause jobs to wrongly return
empty results if those jobs contained stages with more than 2000 partitions
where most of those partitions were empty.

For jobs with > 2000 partitions, MapStatus uses HighlyCompressedMapStatus,
which only stores the average size of blocks.  If the average block size is
zero, then this will cause all blocks to be reported as empty, causing
BlockFetcherIterator to mistakenly skip them.

For example, this would return an empty result:

    sc.makeRDD(0 until 10, 1000).repartition(2001).collect()

This can also lead to deserialization errors (e.g. Snappy decoding errors)
for jobs with > 2000 partitions where the average block size is non-zero but
there is at least one empty block.  In this case, the BlockFetcher attempts to
fetch empty blocks and fails when trying to deserialize them.

The root problem here is that MapStatus has a (previously undocumented)
correctness property that was violated by HighlyCompressedMapStatus:

    If a block is non-empty, then getSizeForBlock must be non-zero.

I fixed this by modifying HighlyCompressedMapStatus to store the average size
of _non-empty_ blocks and to use a compressed bitmap to track which blocks are
empty.

I also removed a test which was broken as originally written: it attempted
to check that HighlyCompressedMapStatus's size estimation error was < 10%,
but this was broken because HighlyCompressedMapStatus is only used for map
statuses with > 2000 partitions, but the test only created 50.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #2866 from JoshRosen/spark-4019 and squashes the following commits:

fc8b490 [Josh Rosen] Roll back hashset change, which didn't improve performance.
5faa0a4 [Josh Rosen] Incorporate review feedback
c8b8cae [Josh Rosen] Two performance fixes:
3b892dd [Josh Rosen] Address Reynold's review comments
ba2e71c [Josh Rosen] Add missing newline
609407d [Josh Rosen] Use Roaring Bitmap to track non-empty blocks.
c23897a [Josh Rosen] Use sets when comparing collect() results
91276a3 [Josh Rosen] [SPARK-4019] Fix MapStatus compression bug that could lead to empty results.
2014-10-23 16:39:32 -07:00
Josh Rosen 813effc701 [SPARK-3426] Fix sort-based shuffle error when spark.shuffle.compress and spark.shuffle.spill.compress settings are different
This PR fixes SPARK-3426, an issue where sort-based shuffle crashes if the
`spark.shuffle.spill.compress` and `spark.shuffle.compress` settings have
different values.

The problem is that sort-based shuffle's read and write paths use different
settings for determining whether to apply compression.  ExternalSorter writes
runs to files using `TempBlockId` ids, which causes
`spark.shuffle.spill.compress` to be used for enabling compression, but these
spilled files end up being shuffled over the network and read as shuffle files
using `ShuffleBlockId` by BlockStoreShuffleFetcher, which causes
`spark.shuffle.compress` to be used for enabling decompression.  As a result,
this leads to errors when these settings disagree.

Based on the discussions in #2247 and #2178, it sounds like we don't want to
remove the `spark.shuffle.spill.compress` setting.  Therefore, I've tried to
come up with a fix where `spark.shuffle.spill.compress` is used to compress
data that's read and written locally and `spark.shuffle.compress` is used to
compress any data that will be fetched / read as shuffle blocks.

To do this, I split `TempBlockId` into two new id types, `TempLocalBlockId` and
`TempShuffleBlockId`, which map to `spark.shuffle.spill.compress` and
`spark.shuffle.compress`, respectively.  ExternalAppendOnlyMap also used temp
blocks for spilling data.  It looks like ExternalSorter was designed to be
a generic sorter but its configuration already happens to be tied to sort-based
shuffle, so I think it's fine if we use `spark.shuffle.compress` to compress
its spills; we can move the compression configuration to the constructor in
a later commit if we find that ExternalSorter is being used in other contexts
where we want different configuration options to control compression.  To
summarize:

**Before:**

|       | ExternalAppendOnlyMap        | ExternalSorter               |
|-------|------------------------------|------------------------------|
| Read  | spark.shuffle.spill.compress | spark.shuffle.compress       |
| Write | spark.shuffle.spill.compress | spark.shuffle.spill.compress |

**After:**

|       | ExternalAppendOnlyMap        | ExternalSorter         |
|-------|------------------------------|------------------------|
| Read  | spark.shuffle.spill.compress | spark.shuffle.compress |
| Write | spark.shuffle.spill.compress | spark.shuffle.compress |

Thanks to andrewor14 for debugging this with me!

Author: Josh Rosen <joshrosen@databricks.com>

Closes #2890 from JoshRosen/SPARK-3426 and squashes the following commits:

1921cf6 [Josh Rosen] Minor edit for clarity.
c8dd8f2 [Josh Rosen] Add comment explaining use of createTempShuffleBlock().
2c687b9 [Josh Rosen] Fix SPARK-3426.
91e7e40 [Josh Rosen] Combine tests into single test of all combinations
76ca65e [Josh Rosen] Add regression test for SPARK-3426.
2014-10-22 14:49:58 -07:00
Sandy Ryza 6bb56faea8 SPARK-1813. Add a utility to SparkConf that makes using Kryo really easy
Author: Sandy Ryza <sandy@cloudera.com>

Closes #789 from sryza/sandy-spark-1813 and squashes the following commits:

48b05e9 [Sandy Ryza] Simplify
b824932 [Sandy Ryza] Allow both spark.kryo.classesToRegister and spark.kryo.registrator at the same time
6a15bb7 [Sandy Ryza] Small fix
a2278c0 [Sandy Ryza] Respond to review comments
6ef592e [Sandy Ryza] SPARK-1813. Add a utility to SparkConf that makes using Kryo really easy
2014-10-21 21:53:09 -07:00
Josh Rosen 5a8f64f336 [SPARK-3958] TorrentBroadcast cleanup / debugging improvements.
This PR makes several changes to TorrentBroadcast in order to make
it easier to reason about, which should help when debugging SPARK-3958.
The key changes:

- Remove all state from the global TorrentBroadcast object.  This state
  consisted mainly of configuration options, like the block size and
  compression codec, and was read by the blockify / unblockify methods.
  Unfortunately, the use of `lazy val` for `BLOCK_SIZE` meant that the block
  size was always determined by the first SparkConf that TorrentBroadast was
  initialized with; as a result, unit tests could not properly test
  TorrentBroadcast with different block sizes.

  Instead, blockifyObject and unBlockifyObject now accept compression codecs
  and blockSizes as arguments.  These arguments are supplied at the call sites
  inside of TorrentBroadcast instances.  Each TorrentBroadcast instance
  determines these values from SparkEnv's SparkConf.  I was careful to ensure
  that we do not accidentally serialize CompressionCodec or SparkConf objects
  as part of the TorrentBroadcast object.

- Remove special-case handling of local-mode in TorrentBroadcast.  I don't
  think that broadcast implementations should know about whether we're running
  in local mode.  If we want to optimize the performance of broadcast in local
  mode, then we should detect this at a higher level and use a dummy
  LocalBroadcastFactory implementation instead.

  Removing this code fixes a subtle error condition: in the old local mode
  code, a failure to find the broadcast in the local BlockManager would lead
  to an attempt to deblockify zero blocks, which could lead to confusing
  deserialization or decompression errors when we attempted to decompress
  an empty byte array.  This should never have happened, though: a failure to
  find the block in local mode is evidence of some other error.  The changes
  here will make it easier to debug those errors if they ever happen.

- Add a check that throws an exception when attempting to deblockify an
  empty array.

- Use ScalaCheck to add a test to check that TorrentBroadcast's
  blockifyObject and unBlockifyObject methods are inverses.

- Misc. cleanup and logging improvements.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #2844 from JoshRosen/torrentbroadcast-bugfix and squashes the following commits:

1e8268d [Josh Rosen] Address Reynold's review comments
2a9fdfd [Josh Rosen] Address Reynold's review comments.
c3b08f9 [Josh Rosen] Update TorrentBroadcast tests to reflect removal of special local-mode optimizations.
5c22782 [Josh Rosen] Store broadcast variable's value in the driver.
33fc754 [Josh Rosen] Change blockify/unblockifyObject to accept serializer as argument.
618a872 [Josh Rosen] [SPARK-3958] TorrentBroadcast cleanup / debugging improvements.
2014-10-21 00:49:11 -07:00
Josh Rosen d1966f3a8b [SPARK-3902] [SPARK-3590] Stabilize AsynRDDActions and add Java API
This PR adds a Java API for AsyncRDDActions and promotes the API from `Experimental` to stable.

Author: Josh Rosen <joshrosen@apache.org>
Author: Josh Rosen <joshrosen@databricks.com>

Closes #2760 from JoshRosen/async-rdd-actions-in-java and squashes the following commits:

0d45fbc [Josh Rosen] Whitespace fix.
ad3ae53 [Josh Rosen] Merge remote-tracking branch 'origin/master' into async-rdd-actions-in-java
c0153a5 [Josh Rosen] Remove unused variable.
e8e2867 [Josh Rosen] Updates based on Marcelo's review feedback
7a1417f [Josh Rosen] Removed unnecessary java.util import.
6f8f6ac [Josh Rosen] Fix import ordering.
ff28e49 [Josh Rosen] Add MiMa excludes and fix a scalastyle error.
346e46e [Josh Rosen] [SPARK-3902] Stabilize AsyncRDDActions; add Java API.
2014-10-19 20:02:31 -07:00
Prashant Sharma 2fe0ba9561 SPARK-3874: Provide stable TaskContext API
This is a small number of clean-up changes on top of #2782. Closes #2782.

Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Patrick Wendell <pwendell@gmail.com>

Closes #2803 from pwendell/pr-2782 and squashes the following commits:

56d5b7a [Patrick Wendell] Minor clean-up
44089ec [Patrick Wendell] Clean-up the TaskContext API.
ed551ce [Prashant Sharma] Fixed a typo
df261d0 [Prashant Sharma] Josh's suggestion
facf3b1 [Prashant Sharma] Fixed the mima issue.
7ecc2fe [Prashant Sharma] CR, Moved implementations to TaskContextImpl
bbd9e05 [Prashant Sharma] adding missed out files to git.
ef633f5 [Prashant Sharma] SPARK-3874, Provide stable TaskContext API
2014-10-16 21:38:45 -04:00
GuoQiang Li 293a0b5dbb [SPARK-2098] All Spark processes should support spark-defaults.conf, config file
This is another implementation about #1256
cc andrewor14 vanzin

Author: GuoQiang Li <witgo@qq.com>

Closes #2379 from witgo/SPARK-2098-new and squashes the following commits:

4ef1cbd [GuoQiang Li] review commit
49ef70e [GuoQiang Li] Refactor getDefaultPropertiesFile
c45d20c [GuoQiang Li] All Spark processes should support spark-defaults.conf, config file
2014-10-14 22:16:38 -07:00
Bill Bejeck 9b6de6fbc0 SPARK-3178 setting SPARK_WORKER_MEMORY to a value without a label (m or g) sets the worker memory limit to zero
Validate the memory is greater than zero when set from the SPARK_WORKER_MEMORY environment variable or command line without a g or m label.  Added unit tests. If memory is 0 an IllegalStateException is thrown. Updated unit tests to mock environment variables by subclassing SparkConf (tip provided by Josh Rosen).   Updated WorkerArguments to use SparkConf.getenv instead of System.getenv for reading the SPARK_WORKER_MEMORY environment variable.

Author: Bill Bejeck <bbejeck@gmail.com>

Closes #2309 from bbejeck/spark-memory-worker and squashes the following commits:

51cf915 [Bill Bejeck] SPARK-3178 - Validate the memory is greater than zero when set from the SPARK_WORKER_MEMORY environment variable or command line without a g or m label.  Added unit tests. If memory is 0 an IllegalStateException is thrown. Updated unit tests to mock environment variables by subclassing SparkConf (tip provided by Josh Rosen).   Updated WorkerArguments to use SparkConf.getenv instead of System.getenv for reading the SPARK_WORKER_MEMORY environment variable.
2014-10-14 12:12:38 -07:00
Aaron Davidson 186b497c94 [SPARK-3921] Fix CoarseGrainedExecutorBackend's arguments for Standalone mode
The goal of this patch is to fix the swapped arguments in standalone mode, which was caused by  79e45c9323 (diff-79391110e9f26657e415aa169a004998R153).

More details can be found in the JIRA: [SPARK-3921](https://issues.apache.org/jira/browse/SPARK-3921)

Tested in Standalone mode, but not in Mesos.

Author: Aaron Davidson <aaron@databricks.com>

Closes #2779 from aarondav/fix-standalone and squashes the following commits:

725227a [Aaron Davidson] Fix ExecutorRunnerTest
9d703fe [Aaron Davidson] [SPARK-3921] Fix CoarseGrainedExecutorBackend's arguments for Standalone mode
2014-10-13 23:31:37 -07:00
Jakub Dubovský fc616d51a5 [SPARK-3121] Wrong implementation of implicit bytesWritableConverter
val path = ... //path to seq file with BytesWritable as type of both key and value
val file = sc.sequenceFile[Array[Byte],Array[Byte]](path)
file.take(1)(0)._1

This prints incorrect content of byte array. Actual content starts with correct one and some "random" bytes and zeros are appended. BytesWritable has two methods:

getBytes() - return content of all internal array which is often longer then actual value stored. It usually contains the rest of previous longer values

copyBytes() - return just begining of internal array determined by internal length property

It looks like in implicit conversion between BytesWritable and Array[byte] getBytes is used instead of correct copyBytes.

dbtsai

Author: Jakub Dubovský <james64@inMail.sk>
Author: Dubovsky Jakub <dubovsky@avast.com>

Closes #2712 from james64/3121-bugfix and squashes the following commits:

f85d24c [Jakub Dubovský] Test name changed, comments added
1b20d51 [Jakub Dubovský] Import placed correctly
406e26c [Jakub Dubovský] Scala style fixed
f92ffa6 [Dubovsky Jakub] performance tuning
480f9cd [Dubovsky Jakub] Bug 3121 fixed
2014-10-12 22:03:26 -07:00