ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Gabor Somogyi	e516f7e09e	[SPARK-28928][SS] Use Kafka delegation token protocol on sources/sinks ### What changes were proposed in this pull request? At the moment there are 3 places where communication protocol with Kafka cluster has to be set when delegation token used: * On delegation token * On source * On sink Most of the time users are using the same protocol on all these places (within one Kafka cluster). It would be better to declare it in one place (delegation token side) and Kafka sources/sinks can take this config over. In this PR I've I've modified the code in a way that Kafka sources/sinks are taking over delegation token side `security.protocol` configuration when the token and the source/sink matches in `bootstrap.servers` configuration. This default configuration can be overwritten on each source/sink independently by using `kafka.security.protocol` configuration. ### Why are the changes needed? The actual configuration's default behavior represents the minority of the use-cases and inconvenient. ### Does this PR introduce any user-facing change? Yes, with this change users need to provide less configuration parameters by default. ### How was this patch tested? Existing + additional unit tests. Closes #25631 from gaborgsomogyi/SPARK-28928. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-09-09 15:41:51 -07:00
Gabor Somogyi	d502c80404	[SPARK-28922][SS] Safe Kafka parameter redaction ### What changes were proposed in this pull request? At the moment Kafka parameter reduction is expecting `SparkEnv`. This must exist in normal queries but several unit tests are not providing it to make things simple. As an end-result such tests are throwing similar exception: ``` java.lang.NullPointerException at org.apache.spark.kafka010.KafkaRedactionUtil$.redactParams(KafkaRedactionUtil.scala:29) at org.apache.spark.kafka010.KafkaRedactionUtilSuite.$anonfun$new$1(KafkaRedactionUtilSuite.scala:33) at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149) at org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196) at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178) at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:56) at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221) at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214) at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:56) at org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229) at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396) at scala.collection.immutable.List.foreach(List.scala:392) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229) at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228) at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) at org.scalatest.Suite.run(Suite.scala:1147) at org.scalatest.Suite.run$(Suite.scala:1129) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560) at org.scalatest.FunSuiteLike.$anonfun$run$1(FunSuiteLike.scala:233) at org.scalatest.SuperEngine.runImpl(Engine.scala:521) at org.scalatest.FunSuiteLike.run(FunSuiteLike.scala:233) at org.scalatest.FunSuiteLike.run$(FunSuiteLike.scala:232) at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:56) at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:56) at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:45) at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13(Runner.scala:1346) at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13$adapted(Runner.scala:1340) at scala.collection.immutable.List.foreach(List.scala:392) at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1340) at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24(Runner.scala:1031) at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24$adapted(Runner.scala:1010) at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1506) at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:1010) at org.scalatest.tools.Runner$.run(Runner.scala:850) at org.scalatest.tools.Runner.run(Runner.scala) at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2(ScalaTestRunner.java:131) at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:28) ``` These are annoying and only red herrings so I would like to make them disappear. There are basically 2 ways to handle this situation: * Add default value for `SparkEnv` in `KafkaReductionUtil` * Add `SparkEnv` to all such tests => I think it would be overkill and would just increase number of lines without real value Considering this I've chosen the first approach. ### Why are the changes needed? Couple of tests are throwing exceptions even if no real problem. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New + additional unit tests. Closes #25621 from gaborgsomogyi/safe-reduct. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-29 19:17:48 -07:00
Gabor Somogyi	7d72c073dd	[SPARK-28760][SS][TESTS] Add Kafka delegation token end-to-end test with mini KDC ### What changes were proposed in this pull request? At the moment no end-to-end Kafka delegation token test exists which was mainly because of missing embedded KDC. KDC is missing in general from the testing side so I've discovered what kind of possibilities are there. The most obvious choice is the MiniKDC inside the Hadoop library where Apache Kerby runs in the background. What this PR contains: * Added MiniKDC as test dependency from Hadoop * Added `maven-bundle-plugin` because couple of dependencies are coming in bundle format * Added security mode to `KafkaTestUtils`. Namely start KDC -> start Zookeeper in secure mode -> start Kafka in secure mode * Added a roundtrip test (saves and reads back data from Kafka) ### Why are the changes needed? No such test exists + security testing with KDC is completely missing. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing + additional unit tests. I've put the additional test into a loop and was consuming ~10 sec average. Closes #25477 from gaborgsomogyi/SPARK-28760. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-08-29 11:52:35 -07:00
Gabor Somogyi	d47c219f94	[SPARK-28055][SS][DSTREAMS] Add delegation token custom AdminClient configurations. ## What changes were proposed in this pull request? At the moment Kafka delegation tokens are fetched through `AdminClient` but there is no possibility to add custom configuration parameters. In [options](https://spark.apache.org/docs/2.4.3/structured-streaming-kafka-integration.html#kafka-specific-configurations) there is already a possibility to add custom configurations. In this PR I've added similar this possibility to `AdminClient`. ## How was this patch tested? Existing + added unit tests. ``` cd docs/ SKIP_API=1 jekyll build ``` Manual webpage check. Closes #24875 from gaborgsomogyi/SPARK-28055. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-07-11 09:36:24 -07:00
Gabor Somogyi	911fadf33a	[SPARK-27748][SS] Kafka consumer/producer password/token redaction. ## What changes were proposed in this pull request? Kafka parameters are logged at several places and the following parameters has to be redacted: * Delegation token * `ssl.truststore.password` * `ssl.keystore.password` * `ssl.key.password` This PR contains: * Spark central redaction framework used to redact passwords (`spark.redaction.regex`) * Custom redaction added to handle `sasl.jaas.config` (delegation token) * Redaction code added into consumer/producer code * Test refactor ## How was this patch tested? Existing + additional unit tests. Closes #24627 from gaborgsomogyi/SPARK-27748. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-06-03 15:43:08 -07:00
Sean Owen	a10608cb82	[SPARK-27680][CORE][SQL][GRAPHX] Remove usage of Traversable ## What changes were proposed in this pull request? This removes usage of `Traversable`, which is removed in Scala 2.13. This is mostly an internal change, except for the change in the `SparkConf.setAll` method. See additional comments below. ## How was this patch tested? Existing tests. Closes #24584 from srowen/SPARK-27680. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-05-14 09:14:56 -05:00
Gabor Somogyi	2f55809425	[SPARK-27294][SS] Add multi-cluster Kafka delegation token ## What changes were proposed in this pull request? The actual implementation doesn't support multi-cluster Kafka connection with delegation token. In this PR I've added this functionality. What this PR contains: * New way of configuration * Multiple delegation token obtain/store/use functionality * Documentation * The change works on DStreams also ## How was this patch tested? Existing + additional unit tests. Additionally tested on cluster. Test scenario: * 2 * 4 node clusters * The 4-4 nodes are in different kerberos realms * Cross-Realm trust between the 2 realms * Yarn * Kafka broker version 2.1.0 * security.protocol = SASL_SSL * sasl.mechanism = SCRAM-SHA-512 * Artificial exceptions during processing * Source reads from realm1 sink writes to realm2 Kafka broker settings: * delegation.token.expiry.time.ms=600000 (10 min) * delegation.token.max.lifetime.ms=1200000 (20 min) * delegation.token.expiry.check.interval.ms=300000 (5 min) Closes #24305 from gaborgsomogyi/SPARK-27294. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-05-07 11:40:43 -07:00
Sean Owen	8a17d26784	[SPARK-27536][CORE][ML][SQL][STREAMING] Remove most use of scala.language.existentials ## What changes were proposed in this pull request? I want to get rid of as much use of `scala.language.existentials` as possible for 3.0. It's a complicated language feature that generates warnings unless this value is imported. It might even be on the way out of Scala: https://contributors.scala-lang.org/t/proposal-to-remove-existential-types-from-the-language/2785 For Spark, it comes up mostly where the code plays fast and loose with generic types, not the advanced situations you'll often see referenced where this feature is explained. For example, it comes up in cases where a function returns something like `(String, Class[_])`. Scala doesn't like matching this to any other instance of `(String, Class[_])` because doing so requires inferring the existence of some type that satisfies both. Seems obvious if the generic type is a wildcard, but, not technically something Scala likes to let you get away with. This is a large PR, and it only gets rid of _most_ instances of `scala.language.existentials`. The change should be all compile-time and shouldn't affect APIs or logic. Many of the changes simply touch up sloppiness about generic types, making the known correct value explicit in the code. Some fixes involve being more explicit about the existence of generic types in methods. For instance, `def foo(arg: Class[_])` seems innocent enough but should really be declared `def foo[T](arg: Class[T])` to let Scala select and fix a single type when evaluating calls to `foo`. For kind of surprising reasons, this comes up in places where code evaluates a tuple of things that involve a generic type, but is OK if the two parts of the tuple are evaluated separately. One key change was altering `Utils.classForName(...): Class[_]` to the more correct `Utils.classForName[T](...): Class[T]`. This caused a number of small but positive changes to callers that otherwise had to cast the result. In several tests, `Dataset[_]` was used where `DataFrame` seems to be the clear intent. Finally, in a few cases in MLlib, the return type `this.type` was used where there are no subclasses of the class that uses it. This really isn't needed and causes issues for Scala reasoning about the return type. These are just changed to be concrete classes as return types. After this change, we have only a few classes that still import `scala.language.existentials` (because modifying them would require extensive rewrites to fix) and no build warnings. ## How was this patch tested? Existing tests. Closes #24431 from srowen/SPARK-27536. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-29 11:02:01 -05:00
Koert Kuipers	7b367bfc86	[SPARK-27477][BUILD] Kafka token provider should have provided dependency on Spark ## What changes were proposed in this pull request? Change spark-token-provider-kafka-0-10 dependency on spark-core to be provided ## How was this patch tested? Ran existing unit tests Closes #24384 from koertkuipers/feat-kafka-token-provider-fix-deps. Authored-by: Koert Kuipers <koert@tresata.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-04-26 11:52:08 -07:00
Gabor Somogyi	94adffa8b1	[SPARK-27270][SS] Add Kafka dynamic JAAS authentication debug possibility ## What changes were proposed in this pull request? `Krb5LoginModule` supports debug parameter which is not yet supported from Spark side. This configuration makes it easier to debug authentication issues against Kafka. In this PR `Krb5LoginModule` debug flag controlled by either `sun.security.krb5.debug` or `com.ibm.security.krb5.Krb5Debug`. Additionally found some hardcoded values like `ssl.truststore.location`, etc... which could be error prone if Kafka changes it so in such cases Kafka define used. ## How was this patch tested? Existing + additional unit tests + on cluster. Closes #24204 from gaborgsomogyi/SPARK-27270. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-04-11 16:39:40 -07:00
Gabor Somogyi	98a8725e66	[SPARK-27022][DSTREAMS] Add kafka delegation token support. ## What changes were proposed in this pull request? It adds Kafka delegation token support for DStreams. Please be aware as Kafka native sink is not available for DStreams this PR contains delegation token usage only on consumer side. What this PR contains: * Usage of token through dynamic JAAS configuration * `KafkaConfigUpdater` moved to `kafka-0-10-token-provider` * `KafkaSecurityHelper` functionality moved into `KafkaTokenUtil` * Documentation ## How was this patch tested? Existing unit tests + on cluster. Long running Kafka to file tests on 4 node cluster with randomly thrown artificial exceptions. Test scenario: * 4 node cluster * Yarn * Kafka broker version 2.1.0 * security.protocol = SASL_SSL * sasl.mechanism = SCRAM-SHA-512 Kafka broker settings: * delegation.token.expiry.time.ms=600000 (10 min) * delegation.token.max.lifetime.ms=1200000 (20 min) * delegation.token.expiry.check.interval.ms=300000 (5 min) After each 7.5 minutes new delegation token obtained from Kafka broker (10 min * 0.75). When token expired after 10 minutes (Spark obtains new one and doesn't renew the old), the brokers expiring thread comes after each 5 minutes (invalidates expired tokens) and artificial exception has been thrown inside the Spark application (such case Spark closes connection), then the latest delegation token picked up correctly. cd docs/ SKIP_API=1 jekyll build Manual webpage check. Closes #23929 from gaborgsomogyi/SPARK-27022. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-03-07 11:36:37 -08:00
Gabor Somogyi	28ced387b9	[SPARK-26772][YARN] Delete ServiceCredentialProvider and make HadoopDelegationTokenProvider a developer API ## What changes were proposed in this pull request? `HadoopDelegationTokenProvider` has basically the same functionality just like `ServiceCredentialProvider` so the interfaces can be merged. `YARNHadoopDelegationTokenManager` now loads `ServiceCredentialProvider`s in one step. The drawback of this if one provider fails all others are not loaded. `HadoopDelegationTokenManager` loads `HadoopDelegationTokenProvider`s independently so it provides more robust behaviour. In this PR I've I've made the following changes: * Deleted `YARNHadoopDelegationTokenManager` and `ServiceCredentialProvider` * Made `HadoopDelegationTokenProvider` a `DeveloperApi` ## How was this patch tested? Existing unit tests. Closes #23686 from gaborgsomogyi/SPARK-26772. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-02-15 14:43:13 -08:00
Gabor Somogyi	d0443a74d1	[SPARK-26766][CORE] Remove the list of filesystems from HadoopDelegationTokenProvider.obtainDelegationTokens ## What changes were proposed in this pull request? Delegation token providers interface now has a parameter `fileSystems` but this is needed only for `HadoopFSDelegationTokenProvider`. In this PR I've addressed this issue in the following way: * Removed `fileSystems` parameter from `HadoopDelegationTokenProvider` * Moved `YarnSparkHadoopUtil.hadoopFSsToAccess` into `HadoopFSDelegationTokenProvider` * Moved `spark.yarn.stagingDir` into core * Moved `spark.yarn.access.namenodes` into core and renamed to `spark.kerberos.access.namenodes` * Moved `spark.yarn.access.hadoopFileSystems` into core and renamed to `spark.kerberos.access.hadoopFileSystems` ## How was this patch tested? Existing unit tests. Closes #23698 from gaborgsomogyi/SPARK-26766. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-02-08 13:41:52 -08:00
Gabor Somogyi	773efede20	[SPARK-26254][CORE] Extract Hive + Kafka dependencies from Core. ## What changes were proposed in this pull request? There are ugly provided dependencies inside core for the following: * Hive * Kafka In this PR I've extracted them out. This PR contains the following: * Token providers are now loaded with service loader * Hive token provider moved to hive project * Kafka token provider extracted into a new project ## How was this patch tested? Existing + newly added unit tests. Additionally tested on cluster. Closes #23499 from gaborgsomogyi/SPARK-26254. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-01-25 10:36:00 -08:00

14 commits