ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Liang-Chi Hsieh	83013752e3	[SPARK-23455][ML] Default Params in ML should be saved separately in metadata ## What changes were proposed in this pull request? We save ML's user-supplied params and default params as one entity in metadata. During loading the saved models, we set all the loaded params into created ML model instances as user-supplied params. It causes some problems, e.g., if we strictly disallow some params to be set at the same time, a default param can fail the param check because it is treated as user-supplied param after loading. The loaded default params should not be set as user-supplied params. We should save ML default params separately in metadata. For backward compatibility, when loading metadata, if it is a metadata file from previous Spark, we shouldn't raise error if we can't find the default param field. ## How was this patch tested? Pass existing tests and added tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #20633 from viirya/save-ml-default-params.	2018-04-24 10:40:25 -07:00
Marcelo Vanzin	3cb82047f2	[SPARK-22941][CORE] Do not exit JVM when submit fails with in-process launcher. The current in-process launcher implementation just calls the SparkSubmit object, which, in case of errors, will more often than not exit the JVM. This is not desirable since this launcher is meant to be used inside other applications, and that would kill the application. The change turns SparkSubmit into a class, and abstracts aways some of the functionality used to print error messages and abort the submission process. The default implementation uses the logging system for messages, and throws exceptions for errors. As part of that I also moved some code that doesn't really belong in SparkSubmit to a better location. The command line invocation of spark-submit now uses a special implementation of the SparkSubmit class that overrides those behaviors to do what is expected from the command line version (print to the terminal, exit the JVM, etc). A lot of the changes are to replace calls to methods such as "printErrorAndExit" with the new API. As part of adding tests for this, I had to fix some small things in the launcher option parser so that things like "--version" can work when used in the launcher library. There is still code that prints directly to the terminal, like all the Ivy-related code in SparkSubmitUtils, and other areas where some re-factoring would help, like the CommandLineUtils class, but I chose to leave those alone to keep this change more focused. Aside from existing and added unit tests, I ran command line tools with a bunch of different arguments to make sure messages and errors behave like before. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #20925 from vanzin/SPARK-22941.	2018-04-11 10:13:44 -05:00
WeichenXu	252468a744	[SPARK-14681][ML] Provide label/impurity stats for spark.ml decision tree nodes ## What changes were proposed in this pull request? API: ``` trait ClassificationNode extends Node def getLabelCount(label: Int): Double trait RegressionNode extends Node def getCount(): Double def getSum(): Double def getSquareSum(): Double // turn LeafNode to be trait trait LeafNode extends Node { def prediction: Double def impurity: Double ... } class ClassificationLeafNode extends ClassificationNode with LeafNode class RegressionLeafNode extends RegressionNode with LeafNode // turn InternalNode to be trait trait InternalNode extends Node{ def gain: Double def leftChild: Node def rightChild: Node def split: Split ... } class ClassificationInternalNode extends ClassificationNode with InternalNode override def leftChild: ClassificationNode override def rightChild: ClassificationNode class RegressionInternalNode extends RegressionNode with InternalNode override val leftChild: RegressionNode override val rightChild: RegressionNode class DecisionTreeClassificationModel override val rootNode: ClassificationNode class DecisionTreeRegressionModel override val rootNode: RegressionNode ``` Closes #17466 ## How was this patch tested? UT will be added soon. Author: WeichenXu <weichen.xu@databricks.com> Author: jkbradley <joseph.kurata.bradley@gmail.com> Closes #20786 from WeichenXu123/tree_stat_api_2.	2018-04-09 12:18:07 -07:00
Marco Gaido	567bd31e0a	[SPARK-23412][ML] Add cosine distance to BisectingKMeans ## What changes were proposed in this pull request? The PR adds the option to specify a distance measure in BisectingKMeans. Moreover, it introduces the ability to use the cosine distance measure in it. ## How was this patch tested? added UTs + existing UTs Author: Marco Gaido <marcogaido91@gmail.com> Closes #20600 from mgaido91/SPARK-23412.	2018-03-12 14:53:15 -05:00
“attilapiros”	116c581d26	[SPARK-20659][CORE] Removing sc.getExecutorStorageStatus and making StorageStatus private ## What changes were proposed in this pull request? In this PR StorageStatus is made to private and simplified a bit moreover SparkContext.getExecutorStorageStatus method is removed. The reason of keeping StorageStatus is that it is usage from SparkContext.getRDDStorageInfo. Instead of the method SparkContext.getExecutorStorageStatus executor infos are extended with additional memory metrics such as usedOnHeapStorageMemory, usedOffHeapStorageMemory, totalOnHeapStorageMemory, totalOffHeapStorageMemory. ## How was this patch tested? By running existing unit tests. Author: “attilapiros” <piros.attila.zsolt@gmail.com> Author: Attila Zsolt Piros <2017933+attilapiros@users.noreply.github.com> Closes #20546 from attilapiros/SPARK-20659.	2018-02-13 06:54:15 -08:00
gatorsmile	bd08a9e7af	[SPARK-23070] Bump previousSparkVersion in MimaBuild.scala to be 2.2.0 ## What changes were proposed in this pull request? Bump previousSparkVersion in MimaBuild.scala to be 2.2.0 and add the missing exclusions to `v23excludes` in `MimaExcludes`. No item can be un-excluded in `v23excludes`. ## How was this patch tested? The existing tests. Author: gatorsmile <gatorsmile@gmail.com> Closes #20264 from gatorsmile/bump22.	2018-01-15 22:32:38 +08:00
gatorsmile	651f76153f	[SPARK-23028] Bump master branch version to 2.4.0-SNAPSHOT ## What changes were proposed in this pull request? This patch bumps the master branch version to `2.4.0-SNAPSHOT`. ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #20222 from gatorsmile/bump24.	2018-01-13 00:37:59 +08:00
Xianjin YE	a6fc300e91	[SPARK-22897][CORE] Expose stageAttemptId in TaskContext ## What changes were proposed in this pull request? stageAttemptId added in TaskContext and corresponding construction modification ## How was this patch tested? Added a new test in TaskContextSuite, two cases are tested: 1. Normal case without failure 2. Exception case with resubmitted stages Link to [SPARK-22897](https://issues.apache.org/jira/browse/SPARK-22897) Author: Xianjin YE <advancedxy@gmail.com> Closes #20082 from advancedxy/SPARK-22897.	2018-01-02 23:30:38 +08:00
Jose Torres	8941a4abca	[SPARK-22789] Map-only continuous processing execution ## What changes were proposed in this pull request? Basic continuous execution, supporting map/flatMap/filter, with commits and advancement through RPC. ## How was this patch tested? new unit-ish tests (exercising execution end to end) Author: Jose Torres <jose@databricks.com> Closes #19984 from jose-torres/continuous-impl.	2017-12-22 23:05:03 -08:00
Erik LaBianca	4c2efde931	[SPARK-22855][BUILD] Add -no-java-comments to sbt docs/scalacOptions Prevents Scala 2.12 scaladoc from blowing up attempting to parse java comments. ## What changes were proposed in this pull request? Adds -no-java-comments to docs/scalacOptions under Scala 2.12. Also moves scaladoc configs out of the TestSettings and into the standard sharedSettings section in SparkBuild.scala. ## How was this patch tested? SBT_OPTS=-Dscala-2.12 sbt ++2.12.4 tags/publishLocal Author: Erik LaBianca <erik.labianca@gmail.com> Closes #20042 from easel/scaladoc-212.	2017-12-21 10:08:38 -06:00
Erik LaBianca	0abaf31be7	[SPARK-22852][BUILD] Exclude -Xlint:unchecked from sbt javadoc flags ## What changes were proposed in this pull request? Moves the -Xlint:unchecked flag in the sbt build configuration from Compile to (Compile, compile) scope, allowing publish and publishLocal commands to work. ## How was this patch tested? Successfully published the spark-launcher subproject from within sbt successfully, where it fails without this patch. Author: Erik LaBianca <erik.labianca@gmail.com> Closes #20040 from easel/javadoc-xlint.	2017-12-21 09:38:21 -06:00
Yanbo Liang	1e44dd0044	[SPARK-3181][ML] Implement huber loss for LinearRegression. ## What changes were proposed in this pull request? MLlib ```LinearRegression``` supports _huber_ loss addition to _leastSquares_ loss. The huber loss objective function is: ![image](https://user-images.githubusercontent.com/1962026/29554124-9544d198-8750-11e7-8afa-33579ec419d5.png) Refer Eq.(6) and Eq.(8) in [A robust hybrid of lasso and ridge regression](http://statweb.stanford.edu/~owen/reports/hhu.pdf). This objective is jointly convex as a function of (w, σ) ∈ R × (0,∞), we can use L-BFGS-B to solve it. The current implementation is a straight forward porting for Python scikit-learn [```HuberRegressor```](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.HuberRegressor.html). There are some differences: * We use mean loss (```lossSum/weightSum```), but sklearn uses total loss (```lossSum```). * We multiply the loss function and L2 regularization by 1/2. It does not affect the result if we multiply the whole formula by a factor, we just keep consistent with _leastSquares_ loss. So if fitting w/o regularization, MLlib and sklearn produce the same output. If fitting w/ regularization, MLlib should set ```regParam``` divide by the number of instances to match the output of sklearn. ## How was this patch tested? Unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #19020 from yanboliang/spark-3181.	2017-12-13 21:19:14 -08:00
Marcelo Vanzin	e1dd03e42c	[SPARK-22372][CORE, YARN] Make cluster submission use SparkApplication. The main goal of this change is to allow multiple cluster-mode submissions from the same JVM, without having them end up with mixed configuration. That is done by extending the SparkApplication trait, and doing so was reasonably trivial for standalone and mesos modes. For YARN mode, there was a complication. YARN used a "SPARK_YARN_MODE" system property to control behavior indirectly in a whole bunch of places, mainly in the SparkHadoopUtil / YarnSparkHadoopUtil classes. Most of the changes here are removing that. Since we removed support for Hadoop 1.x, some methods that lived in YarnSparkHadoopUtil can now live in SparkHadoopUtil. The remaining methods don't need to be part of the class, and can be called directly from the YarnSparkHadoopUtil object, so now there's a single implementation of SparkHadoopUtil. There were two places in the code that relied on SPARK_YARN_MODE to make decisions about YARN-specific functionality, and now explicitly check the master from the configuration for that instead: * fetching the external shuffle service port, which can come from the YARN configuration. * propagation of the authentication secret using Hadoop credentials. This also was cleaned up a little to not need so many methods in `SparkHadoopUtil`. With those out of the way, actually changing the YARN client to extend SparkApplication was easy. Tested with existing unit tests, and also by running YARN apps with auth and kerberos both on and off in a real cluster. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #19631 from vanzin/SPARK-22372.	2017-12-04 11:05:03 -08:00
Marcelo Vanzin	8ff474f6e5	[SPARK-20650][CORE] Remove JobProgressListener. The only remaining use of this class was the SparkStatusTracker, which was modified to use the new status store. The test code to wait for executors was moved to TestUtils and now uses the SparkStatusTracker API. Indirectly, ConsoleProgressBar also uses this data. Because it has some lower latency requirements, a shortcut to efficiently get the active stages from the active listener was added to the AppStateStore. Now that all UI code goes through the status store to get its data, the FsHistoryProvider can be cleaned up to only replay event logs when needed - that is, when there is no pre-existing disk store for the application. As part of this change I also modified the streaming UI to read the needed data from the store, which was missed in the previous patch that made JobProgressListener redundant. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #19750 from vanzin/SPARK-20650.	2017-11-29 14:34:41 -08:00
Yinan Li	e9b2070ab2	[SPARK-18278][SCHEDULER] Spark on Kubernetes - Basic Scheduler Backend ## What changes were proposed in this pull request? This is a stripped down version of the `KubernetesClusterSchedulerBackend` for Spark with the following components: - Static Allocation of Executors - Executor Pod Factory - Executor Recovery Semantics It's step 1 from the step-wise plan documented [here](https://github.com/apache-spark-on-k8s/spark/issues/441#issuecomment-330802935). This addition is covered by the [SPIP vote](http://apache-spark-developers-list.1001551.n3.nabble.com/SPIP-Spark-on-Kubernetes-td22147.html) which passed on Aug 31 . ## How was this patch tested? - The patch contains unit tests which are passing. - Manual testing: `./build/mvn -Pkubernetes clean package` succeeded. - It is a subset of the entire changelist hosted in http://github.com/apache-spark-on-k8s/spark which is in active use in several organizations. - There is integration testing enabled in the fork currently [hosted by PepperData](spark-k8s-jenkins.pepperdata.org:8080) which is being moved over to RiseLAB CI. - Detailed documentation on trying out the patch in its entirety is in: https://apache-spark-on-k8s.github.io/userdocs/running-on-kubernetes.html cc rxin felixcheung mateiz (shepherd) k8s-big-data SIG members & contributors: mccheah ash211 ssuchter varunkatta kimoonkim erikerlandson liyinan926 tnachen ifilonenko Author: Yinan Li <liyinan926@gmail.com> Author: foxish <ramanathana@google.com> Author: mcheah <mcheah@palantir.com> Closes #19468 from foxish/spark-kubernetes-3.	2017-11-28 23:02:09 -08:00
Sean Owen	fba63c1a7b	[SPARK-22607][BUILD] Set large stack size consistently for tests to avoid StackOverflowError ## What changes were proposed in this pull request? Set `-ea` and `-Xss4m` consistently for tests, to fix in particular: ``` OrderingSuite: ... - GenerateOrdering with ShortType * RUN ABORTED * java.lang.StackOverflowError: at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:370) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) ... ``` ## How was this patch tested? Existing tests. Manually verified it resolves the StackOverflowError this intends to resolve. Author: Sean Owen <sowen@cloudera.com> Closes #19820 from srowen/SPARK-22607.	2017-11-26 07:42:44 -06:00
WeichenXu	774398045b	[SPARK-21087][ML] CrossValidator, TrainValidationSplit expose sub models after fitting: Scala ## What changes were proposed in this pull request? We add a parameter whether to collect the full model list when CrossValidator/TrainValidationSplit training (Default is NOT), avoid the change cause OOM) - Add a method in CrossValidatorModel/TrainValidationSplitModel, allow user to get the model list - CrossValidatorModelWriter add a “option”, allow user to control whether to persist the model list to disk (will persist by default). - Note: when persisting the model list, use indices as the sub-model path ## How was this patch tested? Test cases added. Author: WeichenXu <weichen.xu@databricks.com> Closes #19208 from WeichenXu123/expose-model-list.	2017-11-14 16:48:26 -08:00
Marcelo Vanzin	4741c07809	[SPARK-20648][CORE] Port JobsTab and StageTab to the new UI backend. This change is a little larger because there's a whole lot of logic behind these pages, all really tied to internal types and listeners, and some of that logic had to be implemented in the new listener and the needed data exposed through the API types. - Added missing StageData and ExecutorStageSummary fields which are used by the UI. Some json golden files needed to be updated to account for new fields. - Save RDD graph data in the store. This tries to re-use existing types as much as possible, so that the code doesn't need to be re-written. So it's probably not very optimal. - Some old classes (e.g. JobProgressListener) still remain, since they're used in other parts of the code; they're not used by the UI anymore, though, and will be cleaned up in a separate change. - Save information about active pools in the store. This data is not really used in the SHS, but it's not a lot of data so it's still recorded when replaying applications. - Because the new store sorts things slightly differently from the previous code, some json golden files had some elements within them shuffled around. - The retention unit test in UISeleniumSuite was disabled because the code to throw away old stages / tasks hasn't been added yet. - The job description field in the API tries to follow the old behavior, which makes it be empty most of the time, even though there's information to fill it in. For stages, a new field was added to hold the description (which is basically the job description), so that the UI can be rendered in the old way. - A new stage status ("SKIPPED") was added to account for the fact that the API couldn't represent that state before. Without this, the stage would show up as "PENDING" in the UI, which is now based on API types. - The API used to expose "executorRunTime" as the value of the task's duration, which wasn't really correct (also because that value was easily available from the metrics object); this change fixes that by storing the correct duration, which also means a few expectation files needed to be updated to account for the new durations and sorting differences due to the changed values. - Added changes to implement SPARK-20713 and SPARK-21922 in the new code. Tested with existing unit tests (and by using the UI a lot). Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #19698 from vanzin/SPARK-20648.	2017-11-14 10:34:32 -06:00
Dongjoon Hyun	64c989495a	[SPARK-22485][BUILD] Use `exclude[Problem]` instead `excludePackage` in MiMa ## What changes were proposed in this pull request? `excludePackage` is deprecated like the [following](https://github.com/lightbend/migration-manager/blob/master/core/src/main/scala/com/typesafe/tools/mima/core/Filters.scala#L33-L36) and shows deprecation warnings now. This PR uses `exclude[Problem](packageName + ".")` instead. ```scala deprecated("Replace with ProblemFilters.exclude[Problem](\"my.package.\")", "0.1.15") def excludePackage(packageName: String): ProblemFilter = { exclude[Problem](packageName + ".*") } ``` ## How was this patch tested? Pass the Jenkins MiMa. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #19710 from dongjoon-hyun/SPARK-22485.	2017-11-09 16:40:19 -08:00
Marcelo Vanzin	6ae12715c7	[SPARK-20647][CORE] Port StorageTab to the new UI backend. This required adding information about StreamBlockId to the store, which is not available yet via the API. So an internal type was added until there's a need to expose that information in the API. The UI only lists RDDs that have cached partitions, and that information wasn't being correctly captured in the listener, so that's also fixed, along with some minor (internal) API adjustments so that the UI can get the correct data. Because of the way partitions are cached, some optimizations w.r.t. how often the data is flushed to the store could not be applied to this code; because of that, some different ways to make the code more performant were added to the data structures tracking RDD blocks, with the goal of avoiding expensive copies when lots of blocks are being updated. Tested with existing and updated unit tests. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #19679 from vanzin/SPARK-20647.	2017-11-09 15:46:16 -06:00
Marcelo Vanzin	11eea1a4ce	[SPARK-20646][CORE] Port executors page to new UI backend. The executors page is built on top of the REST API, so the page itself was easy to hook up to the new code. Some other pages depend on the `ExecutorListener` class that is being removed, though, so they needed to be modified to use data from the new store. Fortunately, all they seemed to need is the map of executor logs, so that was somewhat easy too. The executor timeline graph required adding some properties to the ExecutorSummary API type. Instead of following the previous code, which stored all the listener events in memory, the timeline is now created based on the data available from the API. I had to change some of the test golden files because the old code would return executors in "random" order (since it used a mutable Map instead of something that returns a sorted list), and the new code returns executors in id order. Tested with existing unit tests. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #19678 from vanzin/SPARK-20646.	2017-11-07 23:14:29 -06:00
Marcelo Vanzin	7475a9655c	[SPARK-20645][CORE] Port environment page to new UI backend. This change modifies the status listener to collect the information needed to render the envionment page, and populates that page and the API with information collected by the listener. Tested with existing and added unit tests. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #19677 from vanzin/SPARK-20645.	2017-11-07 16:03:24 -06:00
Marcelo Vanzin	c7f38e5adb	[SPARK-20644][core] Initial ground work for kvstore UI backend. There are two somewhat unrelated things going on in this patch, but both are meant to make integration of individual UI pages later on much easier. The first part is some tweaking of the code in the listener so that it does less updates of the kvstore for data that changes fast; for example, it avoids writing changes down to the store for every task-related event, since those can arrive very quickly at times. Instead, for these kinds of events, it chooses to only flush things if a certain interval has passed. The interval is based on how often the current spark-shell code updates the progress bar for jobs, so that users can get reasonably accurate data. The code also delays as much as possible hitting the underlying kvstore when replaying apps in the history server. This is to avoid unnecessary writes to disk. The second set of changes prepare the history server and SparkUI for integrating with the kvstore. A new class, AppStatusStore, is used for translating between the stored data and the types used in the UI / API. The SHS now populates a kvstore with data loaded from event logs when an application UI is requested. Because this store can hold references to disk-based resources, the code was modified to retrieve data from the store under a read lock. This allows the SHS to detect when the store is still being used, and only update it (e.g. because an updated event log was detected) when there is no other thread using the store. This change ended up creating a lot of churn in the ApplicationCache code, which was cleaned up a lot in the process. I also removed some metrics which don't make too much sense with the new code. Tested with existing and added unit tests, and by making sure the SHS still works on a real cluster. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #19582 from vanzin/SPARK-20644.	2017-11-06 08:45:40 -06:00
pj.fanning	1ff41d8693	[SPARK-21708][BUILD] update some sbt plugins ## What changes were proposed in this pull request? These are just some straightforward upgrades to use the latest versions of some sbt plugins that also support sbt 1.0. The remaining sbt plugins that need upgrading will require bigger changes. ## How was this patch tested? Tested sbt use manually. Author: pj.fanning <pj.fanning@workday.com> Closes #19609 from pjfanning/SPARK-21708.	2017-10-31 08:16:54 +00:00
Marcelo Vanzin	0e9a750a8d	[SPARK-20643][CORE] Add listener implementation to collect app state. The initial listener code is based on the existing JobProgressListener (and others), and tries to mimic their behavior as much as possible. The change also includes some minor code movement so that some types and methods from the initial history server code code can be reused. The code introduces a few mutable versions of public API types, used internally, to make it easier to update information without ugly copy methods, and also to make certain updates cheaper. Note the code here is not 100% correct. This is meant as a building ground for the UI integration in the next milestones. As different parts of the UI are ported, fixes will be made to the different parts of this code to account for the needed behavior. I also added annotations to API types so that Jackson is able to correctly deserialize options, sequences and maps that store primitive types. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #19383 from vanzin/SPARK-20643.	2017-10-26 11:05:16 -05:00
Sean Owen	0c03297bf0	[SPARK-22142][BUILD][STREAMING] Move Flume support behind a profile, take 2 ## What changes were proposed in this pull request? Move flume behind a profile, take 2. See https://github.com/apache/spark/pull/19365 for most of the back-story. This change should fix the problem by removing the examples module dependency and moving Flume examples to the module itself. It also adds deprecation messages, per a discussion on dev about deprecating for 2.3.0. ## How was this patch tested? Existing tests, which still enable flume integration. Author: Sean Owen <sowen@cloudera.com> Closes #19412 from srowen/SPARK-22142.2.	2017-10-06 15:08:28 +01:00
gatorsmile	472864014c	Revert "[SPARK-22142][BUILD][STREAMING] Move Flume support behind a profile" This reverts commit `a2516f41ae`.	2017-09-29 11:45:58 -07:00
Sean Owen	a2516f41ae	[SPARK-22142][BUILD][STREAMING] Move Flume support behind a profile ## What changes were proposed in this pull request? Add 'flume' profile to enable Flume-related integration modules ## How was this patch tested? Existing tests; no functional change Author: Sean Owen <sowen@cloudera.com> Closes #19365 from srowen/SPARK-22142.	2017-09-29 08:26:53 +01:00
Sean Owen	4fbf748bf8	[SPARK-21893][BUILD][STREAMING][WIP] Put Kafka 0.8 behind a profile ## What changes were proposed in this pull request? Put Kafka 0.8 support behind a kafka-0-8 profile. ## How was this patch tested? Existing tests, but, until PR builder and Jenkins configs are updated the effect here is to not build or test Kafka 0.8 support at all. Author: Sean Owen <sowen@cloudera.com> Closes #19134 from srowen/SPARK-21893.	2017-09-13 10:10:40 +01:00
hyukjinkwon	64936c14a7	[SPARK-21903][BUILD][FOLLOWUP] Upgrade scalastyle-maven-plugin and scalastyle as well in POM and SparkBuild.scala ## What changes were proposed in this pull request? This PR proposes to match scalastyle version in POM and SparkBuild.scala ## How was this patch tested? Manual builds. Author: hyukjinkwon <gurwls223@gmail.com> Closes #19146 from HyukjinKwon/SPARK-21903-follow-up.	2017-09-06 23:28:12 +09:00
hyukjinkwon	7f3c6ff4ff	[SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0. ## What changes were proposed in this pull request? 1.0.0 fixes an issue with import order, explicit type for public methods, line length limitation and comment validation: ``` [error] .../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/Main.scala:50:16: Are you sure you want to println? If yes, wrap the code block with [error] // scalastyle:off println [error] println(...) [error] // scalastyle:on println [error] .../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala:49: File line length exceeds 100 characters [error] .../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala:22:21: Are you sure you want to println? If yes, wrap the code block with [error] // scalastyle:off println [error] println(...) [error] // scalastyle:on println [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:35:6: Public method must have explicit type [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:51:6: Public method must have explicit type [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:93:15: Public method must have explicit type [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:98:15: Public method must have explicit type [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:47:2: Insert a space after the start of the comment [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:26:43: JavaDStream should come before JavaDStreamLike. ``` This PR also fixes the workaround added in SPARK-16877 for `org.scalastyle.scalariform.OverrideJavaChecker` feature, added from 0.9.0. ## How was this patch tested? Manually tested. Author: hyukjinkwon <gurwls223@gmail.com> Closes #19116 from HyukjinKwon/scalastyle-1.0.0.	2017-09-05 19:40:05 +09:00
Sean Owen	12ab7f7e89	[SPARK-14280][BUILD][WIP] Update change-version.sh and pom.xml to add Scala 2.12 profiles and enable 2.12 compilation …build; fix some things that will be warnings or errors in 2.12; restore Scala 2.12 profile infrastructure ## What changes were proposed in this pull request? This change adds back the infrastructure for a Scala 2.12 build, but does not enable it in the release or Python test scripts. In order to make that meaningful, it also resolves compile errors that the code hits in 2.12 only, in a way that still works with 2.11. It also updates dependencies to the earliest minor release of dependencies whose current version does not yet support Scala 2.12. This is in a sense covered by other JIRAs under the main umbrella, but implemented here. The versions below still work with 2.11, and are the _latest_ maintenance release in the _earliest_ viable minor release. - Scalatest 2.x -> 3.0.3 - Chill 0.8.0 -> 0.8.4 - Clapper 1.0.x -> 1.1.2 - json4s 3.2.x -> 3.4.2 - Jackson 2.6.x -> 2.7.9 (required by json4s) This change does _not_ fully enable a Scala 2.12 build: - It will also require dropping support for Kafka before 0.10. Easy enough, just didn't do it yet here - It will require recreating `SparkILoop` and `Main` for REPL 2.12, which is SPARK-14650. Possible to do here too. What it does do is make changes that resolve much of the remaining gap without affecting the current 2.11 build. ## How was this patch tested? Existing tests and build. Manually tested with `./dev/change-scala-version.sh 2.12` to verify it compiles, modulo the exceptions above. Author: Sean Owen <sowen@cloudera.com> Closes #18645 from srowen/SPARK-14280.	2017-09-01 19:21:21 +01:00
WeichenXu	96028e36b4	[SPARK-17139][ML][FOLLOW-UP] Add convenient method `asBinary` for casting to BinaryLogisticRegressionSummary ## What changes were proposed in this pull request? add an "asBinary" method to LogisticRegressionSummary for convenient casting to BinaryLogisticRegressionSummary. ## How was this patch tested? Testcase updated. Author: WeichenXu <weichen.xu@databricks.com> Closes #19072 from WeichenXu123/mlor_summary_as_binary.	2017-08-31 16:22:40 -07:00
Weichen Xu	c7270a46fc	[SPARK-17139][ML] Add model summary for MultinomialLogisticRegression ## What changes were proposed in this pull request? Add 4 traits, using the following hierarchy: LogisticRegressionSummary LogisticRegressionTrainingSummary: LogisticRegressionSummary BinaryLogisticRegressionSummary: LogisticRegressionSummary BinaryLogisticRegressionTrainingSummary: LogisticRegressionTrainingSummary, BinaryLogisticRegressionSummary and the public method such as `def summary` only return trait type listed above. and then implement 4 concrete classes: LogisticRegressionSummaryImpl (multiclass case) LogisticRegressionTrainingSummaryImpl (multiclass case) BinaryLogisticRegressionSummaryImpl (binary case). BinaryLogisticRegressionTrainingSummaryImpl (binary case). ## How was this patch tested? Existing tests & added tests. Author: WeichenXu <WeichenXu123@outlook.com> Closes #15435 from WeichenXu123/mlor_summary.	2017-08-28 13:31:01 -07:00
Herman van Hovell	05af2de0fd	[SPARK-21830][SQL] Bump ANTLR version and fix a few issues. ## What changes were proposed in this pull request? This PR bumps the ANTLR version to 4.7, and fixes a number of small parser related issues uncovered by the bump. The main reason for upgrading is that in some cases the current version of ANTLR (4.5) can exhibit exponential slowdowns if it needs to parse boolean predicates. For example the following query will take forever to parse: ```sql SELECT * FROM RANGE(1000) WHERE TRUE AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' ``` This is caused by a know bug in ANTLR (https://github.com/antlr/antlr4/issues/994), which was fixed in version 4.6. ## How was this patch tested? Existing tests. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #19042 from hvanhovell/SPARK-21830.	2017-08-24 16:33:55 -07:00
Peng Meng	a0345cbebe	[SPARK-21680][ML][MLLIB] optimize Vector compress ## What changes were proposed in this pull request? When use Vector.compressed to change a Vector to SparseVector, the performance is very low comparing with Vector.toSparse. This is because you have to scan the value three times using Vector.compressed, but you just need two times when use Vector.toSparse. When the length of the vector is large, there is significant performance difference between this two method. ## How was this patch tested? The existing UT Author: Peng Meng <peng.meng@intel.com> Closes #18899 from mpjlu/optVectorCompress.	2017-08-16 19:05:20 +01:00
Marcelo Vanzin	3f958a9992	[SPARK-21731][BUILD] Upgrade scalastyle to 0.9. This version fixes a few issues in the import order checker; it provides better error messages, and detects more improper ordering (thus the need to change a lot of files in this patch). The main fix is that it correctly complains about the order of packages vs. classes. As part of the above, I moved some "SparkSession" import in ML examples inside the "$example on$" blocks; that didn't seem consistent across different source files to start with, and avoids having to add more on/off blocks around specific imports. The new scalastyle also seems to have a better header detector, so a few license headers had to be updated to match the expected indentation. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #18943 from vanzin/SPARK-21731.	2017-08-15 13:59:00 -07:00
pj.fanning	c0e333dbed	[SPARK-21709][BUILD] sbt 0.13.16 and some plugin updates ## What changes were proposed in this pull request? Update sbt version to 0.13.16. I think this is a useful stepping stone to getting to sbt 1.0.0. ## How was this patch tested? Existing Build. Author: pj.fanning <pj.fanning@workday.com> Closes #18921 from pjfanning/SPARK-21709.	2017-08-12 20:01:20 +01:00
Takeshi Yamamuro	b78cf13bf0	[SPARK-21276][CORE] Update lz4-java to the latest (v1.4.0) ## What changes were proposed in this pull request? This pr updated `lz4-java` to the latest (v1.4.0) and removed custom `LZ4BlockInputStream`. We currently use custom `LZ4BlockInputStream` to read concatenated byte stream in shuffle. But, this functionality has been implemented in the latest lz4-java (https://github.com/lz4/lz4-java/pull/105). So, we might update the latest to remove the custom `LZ4BlockInputStream`. Major diffs between the latest release and v1.3.0 in the master are as follows (`62f7547abb...6d4693f562`); - fixed NPE in XXHashFactory similarly - Don't place resources in default package to support shading - Fixes ByteBuffer methods failing to apply arrayOffset() for array-backed - Try to load lz4-java from java.library.path, then fallback to bundled - Add ppc64le binary - Add s390x JNI binding - Add basic LZ4 Frame v1.5.0 support - enable aarch64 support for lz4-java - Allow unsafeInstance() for ppc64le archiecture - Add unsafeInstance support for AArch64 - Support 64-bit JNI build on Solaris - Avoid over-allocating a buffer - Allow EndMark to be incompressible for LZ4FrameInputStream. - Concat byte stream ## How was this patch tested? Existing tests. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #18883 from maropu/SPARK-21276.	2017-08-09 17:31:52 +02:00
Marcelo Vanzin	979bf946d5	[SPARK-20655][CORE] In-memory KVStore implementation. This change adds an in-memory implementation of KVStore that can be used by the live UI. The implementation is not fully optimized, neither for speed nor space, but should be fast enough for using in the listener bus. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #18395 from vanzin/SPARK-20655.	2017-08-08 11:02:54 -07:00
Sean Owen	e26dac5feb	[SPARK-21415] Triage scapegoat warnings, part 1 ## What changes were proposed in this pull request? Address scapegoat warnings for: - BigDecimal double constructor - Catching NPE - Finalizer without super - List.size is O(n) - Prefer Seq.empty - Prefer Set.empty - reverse.map instead of reverseMap - Type shadowing - Unnecessary if condition. - Use .log1p - Var could be val In some instances like Seq.empty, I avoided making the change even where valid in test code to keep the scope of the change smaller. Those issues are concerned with performance and it won't matter for tests. ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #18635 from srowen/Scapegoat1.	2017-07-18 08:47:17 +01:00
Sean Owen	425c4ada4c	[SPARK-19810][BUILD][CORE] Remove support for Scala 2.10 ## What changes were proposed in this pull request? - Remove Scala 2.10 build profiles and support - Replace some 2.10 support in scripts with commented placeholders for 2.12 later - Remove deprecated API calls from 2.10 support - Remove usages of deprecated context bounds where possible - Remove Scala 2.10 workarounds like ScalaReflectionLock - Other minor Scala warning fixes ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #17150 from srowen/SPARK-19810.	2017-07-13 17:06:24 +08:00
jinxing	58434acdd8	[SPARK-19937] Collect metrics for remote bytes read to disk during shuffle. In current code(https://github.com/apache/spark/pull/16989), big blocks are shuffled to disk. This pr proposes to collect metrics for remote bytes fetched to disk. Author: jinxing <jinxing6042@126.com> Closes #18249 from jinxing64/SPARK-19937.	2017-06-22 14:10:51 -07:00
Marcelo Vanzin	0cba495120	[SPARK-20641][CORE] Add key-value store abstraction and LevelDB implementation. This change adds an abstraction and LevelDB implementation for a key-value store that will be used to store UI and SHS data. The interface is described in KVStore.java (see javadoc). Specifics of the LevelDB implementation are discussed in the javadocs of both LevelDB.java and LevelDBTypeInfo.java. Included also are a few small benchmarks just to get some idea of latency. Because they're too slow for regular unit test runs, they're disabled by default. Tested with the included unit tests, and also as part of the overall feature implementation (including running SHS with hundreds of apps). Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #17902 from vanzin/shs-ng/M1.	2017-06-06 13:39:10 -05:00
Yanbo Liang	0698e6c88c	[SPARK-20606][ML] Revert "[] ML 2.2 QA: Remove deprecated methods for ML" This reverts commit `b8733e0ad9`. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17944 from yanboliang/spark-20606-revert.	2017-05-11 14:48:13 +08:00
Sanket	181261a81d	[SPARK-20355] Add per application spark version on the history server headerpage ## What changes were proposed in this pull request? Spark Version for a specific application is not displayed on the history page now. It should be nice to switch the spark version on the UI when we click on the specific application. Currently there seems to be way as SparkListenerLogStart records the application version. So, it should be trivial to listen to this event and provision this change on the UI. For Example <img width="1439" alt="screen shot 2017-04-06 at 3 23 41 pm" src="https://cloud.githubusercontent.com/assets/8295799/25092650/41f3970a-2354-11e7-9b0d-4646d0adeb61.png"> <img width="1399" alt="screen shot 2017-04-17 at 9 59 33 am" src="https://cloud.githubusercontent.com/assets/8295799/25092743/9f9e2f28-2354-11e7-9605-f2f1c63f21fe.png"> {"Event":"SparkListenerLogStart","Spark Version":"2.0.0"} (Please fill in changes proposed in this fix) Modified the SparkUI for History server to listen to SparkLogListenerStart event and extract the version and print it. ## How was this patch tested? Manual testing of UI page. Attaching the UI screenshot changes here (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Sanket <schintap@untilservice-lm> Closes #17658 from redsanket/SPARK-20355.	2017-05-09 09:30:09 -05:00
Yanbo Liang	b8733e0ad9	[SPARK-20606][ML] ML 2.2 QA: Remove deprecated methods for ML ## What changes were proposed in this pull request? Remove ML methods we deprecated in 2.1. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17867 from yanboliang/spark-20606.	2017-05-09 17:30:37 +08:00
Steve Loughran	2cf83c4783	[SPARK-7481][BUILD] Add spark-hadoop-cloud module to pull in object store access. ## What changes were proposed in this pull request? Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson. It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`. There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector. (this is the successor to #12004; I can't re-open it) ## How was this patch tested? Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples) Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well. Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile. SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package` maven build `mvn install -Phadoop-cloud -Phadoop-2.7` This PR does not update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible. Author: Steve Loughran <stevel@apache.org> Author: Steve Loughran <stevel@hortonworks.com> Closes #17834 from steveloughran/cloud/SPARK-7481-current.	2017-05-07 10:15:31 +01:00
madhu	9064f1b044	[SPARK-20495][SQL][CORE] Add StorageLevel to cacheTable API ## What changes were proposed in this pull request? Currently cacheTable API only supports MEMORY_AND_DISK. This PR adds additional API to take different storage levels. ## How was this patch tested? unit tests Author: madhu <phatak.dev@gmail.com> Closes #17802 from phatak-dev/cacheTableAPI.	2017-05-05 22:44:03 +08:00
Josh Rosen	f44c8a843c	[SPARK-20453] Bump master branch version to 2.3.0-SNAPSHOT This patch bumps the master branch version to `2.3.0-SNAPSHOT`. Author: Josh Rosen <joshrosen@databricks.com> Closes #17753 from JoshRosen/SPARK-20453.	2017-04-24 21:48:04 -07:00

1 2 3 4 5 ...

1004 commits