Commit graph

19 commits

Author SHA1 Message Date
Marcelo Vanzin 38d4e9e446 [SPARK-6229] Add SASL encryption to network library.
There are two main parts of this change:

- Extending the bootstrap mechanism in the network library to add a server-side
  bootstrap (which works a little bit differently than the client-side bootstrap), and
  to allow the  bootstraps to modify the underlying channel.

- Use SASL to encrypt data going through the RPC channel.

The second item requires some non-optimal code to be able to work around the
fact that the outbound path in netty is not thread-safe, and ordering is very important
when encryption is in the picture.

A lot of the changes outside the network/common library are just to adjust to the
changed API for initializing the RPC server.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #5377 from vanzin/SPARK-6229 and squashes the following commits:

ff01966 [Marcelo Vanzin] Use fancy new size config style.
be53f32 [Marcelo Vanzin] Merge branch 'master' into SPARK-6229
47d4aff [Marcelo Vanzin] Merge branch 'master' into SPARK-6229
7a2a805 [Marcelo Vanzin] Clean up some unneeded changes.
2f92237 [Marcelo Vanzin] Add comment.
67bb0c6 [Marcelo Vanzin] Revert "Avoid exposing ByteArrayWritableChannel outside of test code."
065f684 [Marcelo Vanzin] Add test to verify chunking.
3d1695d [Marcelo Vanzin] Minor cleanups.
73cff0e [Marcelo Vanzin] Skip bytes in decode path too.
318ad23 [Marcelo Vanzin] Avoid exposing ByteArrayWritableChannel outside of test code.
346f829 [Marcelo Vanzin] Avoid trip through channel selector by not reporting 0 bytes written.
a4a5938 [Marcelo Vanzin] Review feedback.
4797519 [Marcelo Vanzin] Remove unused import.
9908ada [Marcelo Vanzin] Fix test, SASL backend disposal.
7fe1489 [Marcelo Vanzin] Add a test that makes sure encryption is actually enabled.
adb6f9d [Marcelo Vanzin] Review feedback.
cf2a605 [Marcelo Vanzin] Clean up some code.
8584323 [Marcelo Vanzin] Fix a comment.
e98bc55 [Marcelo Vanzin] Add option to only allow encrypted connections to the server.
dad42fc [Marcelo Vanzin] Make encryption thread-safe, less memory-intensive.
b00999a [Marcelo Vanzin] Consolidate ByteArrayWritableChannel, fix SASL code to match master changes.
b923cae [Marcelo Vanzin] Make SASL encryption handler thread-safe, handle FileRegion messages.
39539a7 [Marcelo Vanzin] Add config option to enable SASL encryption.
351a86f [Marcelo Vanzin] Add SASL encryption to network library.
fbe6ccb [Marcelo Vanzin] Add TransportServerBootstrap, make SASL code use it.
2015-05-01 19:01:46 -07:00
Marcelo Vanzin 5b335bdda3 [SPARK-6228] [network] Move SASL classes from network/shuffle to network...
.../common.

No code changes. Left the shuffle-related files in the shuffle module.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #4953 from vanzin/SPARK-6228 and squashes the following commits:

664ef30 [Marcelo Vanzin] [SPARK-6228] [network] Move SASL classes from network/shuffle to network/common.
2015-03-11 13:16:22 +00:00
Vinod K C dba0b2eadb [SPARK-6178][Shuffle] Removed unused imports
Author: Vinod K C <vinod.kchuawei.com>

Author: Vinod K C <vinod.kc@huawei.com>

Closes #4900 from vinodkc/unused_imports and squashes the following commits:

5373456 [Vinod K C] Removed empty lines
9da7438 [Vinod K C] Changed order of import
594d471 [Vinod K C] Removed unused imports
2015-03-06 14:43:09 +00:00
Aaron Davidson b4034c3f88 [Minor] Fix test RetryingBlockFetcherSuite after changed config name
Flakey due to the default retry interval being the same as our test's wait timeout.

Author: Aaron Davidson <aaron@databricks.com>

Closes #3972 from aarondav/fix-test and squashes the following commits:

db77cab [Aaron Davidson] [Minor] Fix test after changed config name
2015-01-09 09:20:16 -08:00
zsxwing 8773705fd4 [SPARK-4883][Shuffle] Add a name to the directoryCleaner thread
Author: zsxwing <zsxwing@gmail.com>

Closes #3734 from zsxwing/SPARK-4883 and squashes the following commits:

e6f2b61 [zsxwing] Fix the name
cc74727 [zsxwing] Add a name to the directoryCleaner thread
2014-12-22 12:11:36 -08:00
Reynold Xin 9bd9334f58 Config updates for the new shuffle transport.
Author: Reynold Xin <rxin@databricks.com>

Closes #3657 from rxin/conf-update and squashes the following commits:

7370eab [Reynold Xin] Config updates for the new shuffle transport.
2014-12-09 19:29:09 -08:00
Sean Owen d8f84f26e3 SPARK-4805 [CORE] BlockTransferMessage.toByteArray() trips assertion
Allocate enough room for type byte as well as message, to avoid tripping assertion about capacity of the buffer

Author: Sean Owen <sowen@cloudera.com>

Closes #3650 from srowen/SPARK-4805 and squashes the following commits:

9e1d502 [Sean Owen] Allocate enough room for type byte as well as message, to avoid tripping assertion about capacity of the buffer
2014-12-09 16:38:27 -08:00
Xiangrui Meng 4b0c1edfdf [SPARK-4326] fix unidoc
There are two issues:

1. specifying guava 11.0.2 will cause hashInt not found in unidoc (any reason to force the version here?)
2. unidoc doesn't recognize static class defined in a base class

aarondav srowen vanzin

Author: Xiangrui Meng <meng@databricks.com>

Closes #3253 from mengxr/SPARK-4326 and squashes the following commits:

53967bf [Xiangrui Meng] fix unidoc
2014-11-13 13:16:20 -08:00
Reynold Xin ef29a9a9aa [SPARK-4307] Initialize FileDescriptor lazily in FileRegion.
Netty's DefaultFileRegion requires a FileDescriptor in its constructor, which means we need to have a opened file handle. In super large workloads, this could lead to too many open files due to the way these file descriptors are cleaned. This pull request creates a new LazyFileRegion that initializes the FileDescriptor when we are sending data for the first time.

Author: Reynold Xin <rxin@databricks.com>
Author: Reynold Xin <rxin@apache.org>

Closes #3172 from rxin/lazyFD and squashes the following commits:

0bdcdc6 [Reynold Xin] Added reference to Netty's DefaultFileRegion
d4564ae [Reynold Xin] Added SparkConf to the ctor argument of IndexShuffleBlockManager.
6ed369e [Reynold Xin] Code review feedback.
04cddc8 [Reynold Xin] [SPARK-4307] Initialize FileDescriptor lazily in FileRegion.
2014-11-11 00:25:31 -08:00
Aaron Davidson d4fa04e50d [SPARK-4187] [Core] Switch to binary protocol for external shuffle service messages
This PR elimiantes the network package's usage of the Java serializer and replaces it with Encodable, which is a lightweight binary protocol. Each message is preceded by a type id, which will allow us to change messages (by only adding new ones), or to change the format entirely by switching to a special id (such as -1).

This protocol has the advantage over Java that we can guarantee that messages will remain compatible across compiled versions and JVMs, though it does not provide a clean way to do schema migration. In the future, it may be good to use a more heavy-weight serialization format like protobuf, thrift, or avro, but these all add several dependencies which are unnecessary at the present time.

Additionally this unifies the RPC messages of NettyBlockTransferService and ExternalShuffleClient.

Author: Aaron Davidson <aaron@databricks.com>

Closes #3146 from aarondav/free and squashes the following commits:

ed1102a [Aaron Davidson] Remove some unused imports
b8e2a49 [Aaron Davidson] Add appId to test
538f2a3 [Aaron Davidson] [SPARK-4187] [Core] Switch to binary protocol for external shuffle service messages
2014-11-07 09:42:21 -08:00
Aaron Davidson 48a19a6dba [SPARK-4236] Cleanup removed applications' files in shuffle service
This relies on a hook from whoever is hosting the shuffle service to invoke removeApplication() when the application is completed. Once invoked, we will clean up all the executors' shuffle directories we know about.

Author: Aaron Davidson <aaron@databricks.com>

Closes #3126 from aarondav/cleanup and squashes the following commits:

33a64a9 [Aaron Davidson] Missing brace
e6e428f [Aaron Davidson] Address comments
16a0d27 [Aaron Davidson] Cleanup
e4df3e7 [Aaron Davidson] [SPARK-4236] Cleanup removed applications' files in shuffle service
2014-11-06 19:54:32 -08:00
Aaron Davidson f165b2bbf5 [SPARK-4188] [Core] Perform network-level retry of shuffle file fetches
This adds a RetryingBlockFetcher to the NettyBlockTransferService which is wrapped around our typical OneForOneBlockFetcher, adding retry logic in the event of an IOException.

This sort of retry allows us to avoid marking an entire executor as failed due to garbage collection or high network load.

TODO:
- [x] unit tests
- [x] put in ExternalShuffleClient too

Author: Aaron Davidson <aaron@databricks.com>

Closes #3101 from aarondav/retry and squashes the following commits:

72a2a32 [Aaron Davidson] Add that we should remove the condition around the retry thingy
c7fd107 [Aaron Davidson] Fix unit tests
e80e4c2 [Aaron Davidson] Address initial comments
6f594cd [Aaron Davidson] Fix unit test
05ff43c [Aaron Davidson] Add to external shuffle client and add unit test
66e5a24 [Aaron Davidson] [SPARK-4238] [Core] Perform network-level retry of shuffle file fetches
2014-11-06 18:39:14 -08:00
Aaron Davidson 6e9ef10fd7 [SPARK-4277] Support external shuffle service on Standalone Worker
Author: Aaron Davidson <aaron@databricks.com>

Closes #3142 from aarondav/worker and squashes the following commits:

3780bd7 [Aaron Davidson] Address comments
2dcdfc1 [Aaron Davidson] Add private[worker]
47f49d3 [Aaron Davidson] NettyBlockTransferService shouldn't care about app ids (it's only b/t executors)
258417c [Aaron Davidson] [SPARK-4277] Support external shuffle service on executor
2014-11-06 17:20:46 -08:00
Andrew Or 96136f222a [SPARK-3797] Minor addendum to Yarn shuffle service
I did not realize there was a `network.util.JavaUtils` when I wrote this code. This PR moves the `ByteBuffer` string conversion to the appropriate place. I tested the changes on a stable yarn cluster.

Author: Andrew Or <andrew@databricks.com>

Closes #3144 from andrewor14/yarn-shuffle-util and squashes the following commits:

b6c08bf [Andrew Or] Remove unused import
94e205c [Andrew Or] Use netty Unpooled
85202a5 [Andrew Or] Use guava Charsets
057135b [Andrew Or] Reword comment
adf186d [Andrew Or] Move byte buffer String conversion logic to JavaUtils
2014-11-06 17:18:49 -08:00
Andrew Or 61a5cced04 [SPARK-3797] Run external shuffle service in Yarn NM
This creates a new module `network/yarn` that depends on `network/shuffle` recently created in #3001. This PR introduces a custom Yarn auxiliary service that runs the external shuffle service. As of the changes here this shuffle service is required for using dynamic allocation with Spark.

This is still WIP mainly because it doesn't handle security yet. I have tested this on a stable Yarn cluster.

Author: Andrew Or <andrew@databricks.com>

Closes #3082 from andrewor14/yarn-shuffle-service and squashes the following commits:

ef3ddae [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service
0ee67a2 [Andrew Or] Minor wording suggestions
1c66046 [Andrew Or] Remove unused provided dependencies
0eb6233 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service
6489db5 [Andrew Or] Try catch at the right places
7b71d8f [Andrew Or] Add detailed java docs + reword a few comments
d1124e4 [Andrew Or] Add security to shuffle service (INCOMPLETE)
5f8a96f [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service
9b6e058 [Andrew Or] Address various feedback
f48b20c [Andrew Or] Fix tests again
f39daa6 [Andrew Or] Do not make network-yarn an assembly module
761f58a [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service
15a5b37 [Andrew Or] Fix build for Hadoop 1.x
baff916 [Andrew Or] Fix tests
5bf9b7e [Andrew Or] Address a few minor comments
5b419b8 [Andrew Or] Add missing license header
804e7ff [Andrew Or] Include the Yarn shuffle service jar in the distribution
cd076a4 [Andrew Or] Require external shuffle service for dynamic allocation
ea764e0 [Andrew Or] Connect to Yarn shuffle service only if it's enabled
1bf5109 [Andrew Or] Use the shuffle service port specified through hadoop config
b4b1f0c [Andrew Or] 4 tabs -> 2 tabs
43dcb96 [Andrew Or] First cut integration of shuffle service with Yarn aux service
b54a0c4 [Andrew Or] Initial skeleton for Yarn shuffle service
2014-11-05 15:42:05 -08:00
Aaron Davidson 4c42986cc0 [SPARK-4242] [Core] Add SASL to external shuffle service
Does three things: (1) Adds SASL to ExternalShuffleClient, (2) puts SecurityManager in BlockManager's constructor, and (3) adds unit test.

Author: Aaron Davidson <aaron@databricks.com>

Closes #3108 from aarondav/sasl-client and squashes the following commits:

48b622d [Aaron Davidson] Screw it, let's just get LimitedInputStream
3543b70 [Aaron Davidson] Back out of pom change due to unknown test issue?
b58518a [Aaron Davidson] ByteStreams.limit() not available :(
cbe451a [Aaron Davidson] Address comments
2bf2908 [Aaron Davidson] [SPARK-4242] [Core] Add SASL to external shuffle service
2014-11-05 14:38:43 -08:00
Aaron Davidson 5e73138a01 [SPARK-2938] Support SASL authentication in NettyBlockTransferService
Also lays the groundwork for supporting it inside the external shuffle service.

Author: Aaron Davidson <aaron@databricks.com>

Closes #3087 from aarondav/sasl and squashes the following commits:

3481718 [Aaron Davidson] Delete rogue println
44f8410 [Aaron Davidson] Delete documentation - muahaha!
eb9f065 [Aaron Davidson] Improve documentation and add end-to-end test at Spark-level
a6b95f1 [Aaron Davidson] Address comments
785bbde [Aaron Davidson] Cleanup
79973cb [Aaron Davidson] Remove unused file
151b3c5 [Aaron Davidson] Add docs, timeout config, better failure handling
f6177d7 [Aaron Davidson] Cleanup SASL state upon connection termination
7b42adb [Aaron Davidson] Add unit tests
8191bcb [Aaron Davidson] [SPARK-2938] Support SASL authentication in NettyBlockTransferService
2014-11-04 16:15:38 -08:00
Aaron Davidson 2ebd1df3f1 [SPARK-4183] Close transport-related resources between SparkContexts
A leak of event loops may be causing test failures.

Author: Aaron Davidson <aaron@databricks.com>

Closes #3053 from aarondav/leak and squashes the following commits:

e676d18 [Aaron Davidson] Typo!
8f96475 [Aaron Davidson] Keep original ssc semantics
7e49f10 [Aaron Davidson] A leak of event loops may be causing test failures.
2014-11-02 16:26:24 -08:00
Aaron Davidson f55218aeb1 [SPARK-3796] Create external service which can serve shuffle files
This patch introduces the tooling necessary to construct an external shuffle service which is independent of Spark executors, and then use this service inside Spark. An example (just for the sake of this PR) of the service creation can be found in Worker, and the service itself is used by plugging in the StandaloneShuffleClient as Spark's ShuffleClient (setup in BlockManager).

This PR continues the work from #2753, which extracted out the transport layer of Spark's block transfer into an independent package within Spark. A new package was created which contains the Spark business logic necessary to retrieve the actual shuffle data, which is completely independent of the transport layer introduced in the previous patch. Similar to the transport layer, this package must not depend on Spark as we anticipate plugging this service as a lightweight process within, say, the YARN NodeManager, and do not wish to include Spark's dependencies (including Scala itself).

There are several outstanding tasks which must be complete before this PR can be merged:
- [x] Complete unit testing of network/shuffle package.
- [x] Performance and correctness testing on a real cluster.
- [x] Remove example service instantiation from Worker.scala.

There are even more shortcomings of this PR which should be addressed in followup patches:
- Don't use Java serializer for RPC layer! It is not cross-version compatible.
- Handle shuffle file cleanup for dead executors once the application terminates or the ContextCleaner triggers.
- Documentation of the feature in the Spark docs.
- Improve behavior if the shuffle service itself goes down (right now we don't blacklist it, and new executors cannot spawn on that machine).
- SSL and SASL integration
- Nice to have: Handle shuffle file consolidation (this would requires changes to Spark's implementation).

Author: Aaron Davidson <aaron@databricks.com>

Closes #3001 from aarondav/shuffle-service and squashes the following commits:

4d1f8c1 [Aaron Davidson] Remove changes to Worker
705748f [Aaron Davidson] Rename Standalone* to External*
fd3928b [Aaron Davidson] Do not unregister executor outputs unduly
9883918 [Aaron Davidson] Make suggested build changes
3d62679 [Aaron Davidson] Add Spark integration test
7fe51d5 [Aaron Davidson] Fix SBT integration
56caa50 [Aaron Davidson] Address comments
c8d1ac3 [Aaron Davidson] Add unit tests
2f70c0c [Aaron Davidson] Fix unit tests
5483e96 [Aaron Davidson] Fix unit tests
46a70bf [Aaron Davidson] Whoops, bracket
5ea4df6 [Aaron Davidson] [SPARK-3796] Create external service which can serve shuffle files
2014-11-01 14:37:45 -07:00