ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Alexander Pivovarov	299eb04ba0	Fix hadoop.version in building-spark.md Couple of mvn build examples use `-Dhadoop.version=VERSION` instead of actual version number Author: Alexander Pivovarov <apivovarov@gmail.com> Closes #15440 from apivovarov/patch-1.	2016-10-11 22:31:21 -07:00
hyukjinkwon	0c0ad436ad	[SPARK-17719][SPARK-17776][SQL] Unify and tie up options in a single place in JDBC datasource package ## What changes were proposed in this pull request? This PR proposes to fix arbitrary usages among `Map[String, String]`, `Properties` and `JDBCOptions` instances for options in `execution/jdbc` package and make the connection properties exclude Spark-only options. This PR includes some changes as below: - Unify `Map[String, String]`, `Properties` and `JDBCOptions` in `execution/jdbc` package to `JDBCOptions`. - Move `batchsize`, `fetchszie`, `driver` and `isolationlevel` options into `JDBCOptions` instance. - Document `batchSize` and `isolationlevel` with marking both read-only options and write-only options. Also, this includes minor types and detailed explanation for some statements such as url. - Throw exceptions fast by checking arguments first rather than in execution time (e.g. for `fetchsize`). - Exclude Spark-only options in connection properties. ## How was this patch tested? Existing tests should cover this. Author: hyukjinkwon <gurwls223@gmail.com> Closes #15292 from HyukjinKwon/SPARK-17719.	2016-10-10 22:22:41 -07:00
Timothy Chen	29f186bfdf	[SPARK-14082][MESOS] Enable GPU support with Mesos ## What changes were proposed in this pull request? Enable GPU resources to be used when running coarse grain mode with Mesos. ## How was this patch tested? Manual test with GPU. Author: Timothy Chen <tnachen@gmail.com> Closes #14644 from tnachen/gpu_mesos.	2016-10-10 23:20:15 +02:00
Wenchen Fan	23ddff4b2b	[SPARK-17338][SQL] add global temp view ## What changes were proposed in this pull request? Global temporary view is a cross-session temporary view, which means it's shared among all sessions. Its lifetime is the lifetime of the Spark application, i.e. it will be automatically dropped when the application terminates. It's tied to a system preserved database `global_temp`(configurable via SparkConf), and we must use the qualified name to refer a global temp view, e.g. SELECT * FROM global_temp.view1. changes for `SessionCatalog`: 1. add a new field `gloabalTempViews: GlobalTempViewManager`, to access the shared global temp views, and the global temp db name. 2. `createDatabase` will fail if users wanna create `global_temp`, which is system preserved. 3. `setCurrentDatabase` will fail if users wanna set `global_temp`, which is system preserved. 4. add `createGlobalTempView`, which is used in `CreateViewCommand` to create global temp views. 5. add `dropGlobalTempView`, which is used in `CatalogImpl` to drop global temp view. 6. add `alterTempViewDefinition`, which is used in `AlterViewAsCommand` to update the view definition for local/global temp views. 7. `renameTable`/`dropTable`/`isTemporaryTable`/`lookupRelation`/`getTempViewOrPermanentTableMetadata`/`refreshTable` will handle global temp views. changes for SQL commands: 1. `CreateViewCommand`/`AlterViewAsCommand` is updated to support global temp views 2. `ShowTablesCommand` outputs a new column `database`, which is used to distinguish global and local temp views. 3. other commands can also handle global temp views if they call `SessionCatalog` APIs which accepts global temp views, e.g. `DropTableCommand`, `AlterTableRenameCommand`, `ShowColumnsCommand`, etc. changes for other public API 1. add a new method `dropGlobalTempView` in `Catalog` 2. `Catalog.findTable` can find global temp view 3. add a new method `createGlobalTempView` in `Dataset` ## How was this patch tested? new tests in `SQLViewSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #14897 from cloud-fan/global-temp-view.	2016-10-10 15:48:57 +08:00
Sean Owen	cff5607552	[SPARK-17707][WEBUI] Web UI prevents spark-submit application to be finished ## What changes were proposed in this pull request? This expands calls to Jetty's simple `ServerConnector` constructor to explicitly specify a `ScheduledExecutorScheduler` that makes daemon threads. It should otherwise result in exactly the same configuration, because the other args are copied from the constructor that is currently called. (I'm not sure we should change the Hive Thriftserver impl, but I did anyway.) This also adds `sc.stop()` to the quick start guide example. ## How was this patch tested? Existing tests; _pending_ at least manual verification of the fix. Author: Sean Owen <sowen@cloudera.com> Closes #15381 from srowen/SPARK-17707.	2016-10-07 10:31:41 -07:00
Shixiong Zhu	9293734d35	[SPARK-17346][SQL] Add Kafka source for Structured Streaming ## What changes were proposed in this pull request? This PR adds a new project ` external/kafka-0-10-sql` for Structured Streaming Kafka source. It's based on the design doc: https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing tdas did most of work and part of them was inspired by koeninger's work. ### Introduction The Kafka source is a structured streaming data source to poll data from Kafka. The schema of reading data is as follows: Column \| Type ---- \| ---- key \| binary value \| binary topic \| string partition \| int offset \| long timestamp \| long timestampType \| int The source can deal with deleting topics. However, the user should make sure there is no Spark job processing the data when deleting a topic. ### Configuration The user can use `DataStreamReader.option` to set the following configurations. Kafka Source's options \| value \| default \| meaning ------ \| ------- \| ------ \| ----- startingOffset \| ["earliest", "latest"] \| "latest" \| The start point when a query is started, either "earliest" which is from the earliest offset, or "latest" which is just from the latest offset. Note: This only applies when a new Streaming query is started, and that resuming will always pick up from where the query left off. failOnDataLost \| [true, false] \| true \| Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected. subscribe \| A comma-separated list of topics \| (none) \| The topic list to subscribe. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source. subscribePattern \| Java regex string \| (none) \| The pattern used to subscribe the topic. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source. kafka.consumer.poll.timeoutMs \| long \| 512 \| The timeout in milliseconds to poll data from Kafka in executors fetchOffset.numRetries \| int \| 3 \| Number of times to retry before giving up fatch Kafka latest offsets. fetchOffset.retryIntervalMs \| long \| 10 \| milliseconds to wait before retrying to fetch Kafka offsets Kafka's own configurations can be set via `DataStreamReader.option` with `kafka.` prefix, e.g, `stream.option("kafka.bootstrap.servers", "host:port")` ### Usage * Subscribe to 1 topic ```Scala spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host:port") .option("subscribe", "topic1") .load() ``` * Subscribe to multiple topics ```Scala spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host:port") .option("subscribe", "topic1,topic2") .load() ``` * Subscribe to a pattern ```Scala spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host:port") .option("subscribePattern", "topic.*") .load() ``` ## How was this patch tested? The new unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Author: Tathagata Das <tathagata.das1565@gmail.com> Author: Shixiong Zhu <zsxwing@gmail.com> Author: cody koeninger <cody@koeninger.org> Closes #15102 from zsxwing/kafka-source.	2016-10-05 16:45:45 -07:00
sethah	9df54f5325	[SPARK-17239][ML][DOC] Update user guide for multiclass logistic regression ## What changes were proposed in this pull request? Updates user guide to reflect that LogisticRegression now supports multiclass. Also adds new examples to show multiclass training. ## How was this patch tested? Ran locally using spark-submit, run-example, and copy/paste from user guide into shells. Generated docs and verified correct output. Author: sethah <seth.hendrickson16@gmail.com> Closes #15349 from sethah/SPARK-17239.	2016-10-05 18:28:21 +00:00
Sean Owen	1dd68d3827	[SPARK-17718][DOCS][MLLIB] Make loss function formulation label note clearer in MLlib docs ## What changes were proposed in this pull request? Move note about labels being +1/-1 in formulation only to be just under the table of formulations. ## How was this patch tested? Doc build Author: Sean Owen <sowen@cloudera.com> Closes #15330 from srowen/SPARK-17718.	2016-10-03 18:09:36 +00:00
Jagadeesan	a27033c0bb	[SPARK-17736][DOCUMENTATION][SPARKR] Update R README for rmarkdown,… ## What changes were proposed in this pull request? To build R docs (which are built when R tests are run), users need to install pandoc and rmarkdown. This was done for Jenkins in ~~[SPARK-17420](https://issues.apache.org/jira/browse/SPARK-17420)~~ … pandoc] Author: Jagadeesan <as2@us.ibm.com> Closes #15309 from jagadeesanas2/SPARK-17736.	2016-10-03 10:46:38 +01:00
Dongjoon Hyun	15e9bbb49e	[MINOR][DOC] Add an up-to-date description for default serialization during shuffling ## What changes were proposed in this pull request? This PR aims to make the doc up-to-date. The documentation is generally correct, but after https://issues.apache.org/jira/browse/SPARK-13926, Spark starts to choose Kyro as a default serialization library during shuffling of simple types, arrays of simple types, or string type. ## How was this patch tested? This is a documentation update. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #15315 from dongjoon-hyun/SPARK-DOC-SERIALIZER.	2016-09-30 22:05:59 -07:00
Dongjoon Hyun	39eb3bb1ec	[SPARK-17412][DOC] All test should not be run by `root` or any admin user ## What changes were proposed in this pull request? `FsHistoryProviderSuite` fails if `root` user runs it. The test case SPARK-3697: ignore directories that cannot be read depends on `setReadable(false, false)` to make test data files and expects the number of accessible files is 1. But, `root` can access all files, so it returns 2. This PR adds the assumption explicitly on doc. `building-spark.md`. ## How was this patch tested? This is a documentation change. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #15291 from dongjoon-hyun/SPARK-17412.	2016-09-29 16:01:45 -07:00
José Hiram Soltren	958200497a	[DOCS] Reorganize explanation of Accumulators and Broadcast Variables ## What changes were proposed in this pull request? The discussion of the interaction of Accumulators and Broadcast Variables should logically follow the discussion on Checkpointing. As currently written, this section discusses Checkpointing before it is formally introduced. To remedy this: - Rename this section to "Accumulators, Broadcast Variables, and Checkpoints", and - Move this section after "Checkpointing". ## How was this patch tested? Testing: ran $ SKIP_API=1 jekyll build , and verified changes in a Web browser pointed at docs/_site/index.html. Author: José Hiram Soltren <jose@cloudera.com> Closes #15281 from jsoltren/doc-changes.	2016-09-29 10:18:56 -07:00
Takeshi YAMAMURO	b2e9731ca4	[MINOR][DOCS] Fix th doc. of spark-streaming with kinesis ## What changes were proposed in this pull request? This pr is just to fix the document of `spark-kinesis-integration`. Since `SPARK-17418` prevented all the kinesis stuffs (including kinesis example code) from publishing, `bin/run-example streaming.KinesisWordCountASL` and `bin/run-example streaming.JavaKinesisWordCountASL` does not work. Instead, it fetches the kinesis jar from the Spark Package. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #15260 from maropu/DocFixKinesis.	2016-09-29 08:26:03 -04:00
Shuai Lin	b2a7eedcdd	[SPARK-17017][ML][MLLIB][ML][DOC] Updated the ml/mllib feature selection docs for ChiSqSelector ## What changes were proposed in this pull request? A follow up for #14597 to update feature selection docs about ChiSqSelector. ## How was this patch tested? Generated html docs. It can be previewed at: * ml: http://sparkdocs.lins05.pw/spark-17017/ml-features.html#chisqselector * mllib: http://sparkdocs.lins05.pw/spark-17017/mllib-feature-extraction.html#chisqselector Author: Shuai Lin <linshuai2012@gmail.com> Closes #15236 from lins05/spark-17017-update-docs-for-chisq-selector-fpr.	2016-09-28 06:12:48 -04:00
Andrew Mills	00be16df64	[Docs] Update spark-standalone.md to fix link Corrected a link to the configuration.html page, it was pointing to a page that does not exist (configurations.html). Documentation change, verified in preview. Author: Andrew Mills <ammills01@users.noreply.github.com> Closes #15244 from ammills01/master.	2016-09-26 16:41:14 -04:00
Liang-Chi Hsieh	8135e0e5eb	[SPARK-17153][SQL] Should read partition data when reading new files in filestream without globbing ## What changes were proposed in this pull request? When reading file stream with non-globbing path, the results return data with all `null`s for the partitioned columns. E.g., case class A(id: Int, value: Int) val data = spark.createDataset(Seq( A(1, 1), A(2, 2), A(2, 3)) ) val url = "/tmp/test" data.write.partitionBy("id").parquet(url) spark.read.parquet(url).show +-----+---+ \|value\| id\| +-----+---+ \| 2\| 2\| \| 3\| 2\| \| 1\| 1\| +-----+---+ val s = spark.readStream.schema(spark.read.load(url).schema).parquet(url) s.writeStream.queryName("test").format("memory").start() sql("SELECT * FROM test").show +-----+----+ \|value\| id\| +-----+----+ \| 2\|null\| \| 3\|null\| \| 1\|null\| +-----+----+ ## How was this patch tested? Jenkins tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #14803 from viirya/filestreamsource-option.	2016-09-26 13:07:11 -07:00
Justin Pihony	50b89d05b7	[SPARK-14525][SQL] Make DataFrameWrite.save work for jdbc ## What changes were proposed in this pull request? This change modifies the implementation of DataFrameWriter.save such that it works with jdbc, and the call to jdbc merely delegates to save. ## How was this patch tested? This was tested via unit tests in the JDBCWriteSuite, of which I added one new test to cover this scenario. ## Additional details rxin This seems to have been most recently touched by you and was also commented on in the JIRA. This contribution is my original work and I license the work to the project under the project's open source license. Author: Justin Pihony <justin.pihony@gmail.com> Author: Justin Pihony <justin.pihony@typesafe.com> Closes #12601 from JustinPihony/jdbc_reconciliation.	2016-09-26 09:54:22 +01:00
Jeff Zhang	f62ddc5983	[SPARK-17210][SPARKR] sparkr.zip is not distributed to executors when running sparkr in RStudio ## What changes were proposed in this pull request? Spark will add sparkr.zip to archive only when it is yarn mode (SparkSubmit.scala). ``` if (args.isR && clusterManager == YARN) { val sparkRPackagePath = RUtils.localSparkRPackagePath if (sparkRPackagePath.isEmpty) { printErrorAndExit("SPARK_HOME does not exist for R application in YARN mode.") } val sparkRPackageFile = new File(sparkRPackagePath.get, SPARKR_PACKAGE_ARCHIVE) if (!sparkRPackageFile.exists()) { printErrorAndExit(s"$SPARKR_PACKAGE_ARCHIVE does not exist for R application in YARN mode.") } val sparkRPackageURI = Utils.resolveURI(sparkRPackageFile.getAbsolutePath).toString // Distribute the SparkR package. // Assigns a symbol link name "sparkr" to the shipped package. args.archives = mergeFileLists(args.archives, sparkRPackageURI + "#sparkr") // Distribute the R package archive containing all the built R packages. if (!RUtils.rPackages.isEmpty) { val rPackageFile = RPackageUtils.zipRLibraries(new File(RUtils.rPackages.get), R_PACKAGE_ARCHIVE) if (!rPackageFile.exists()) { printErrorAndExit("Failed to zip all the built R packages.") } val rPackageURI = Utils.resolveURI(rPackageFile.getAbsolutePath).toString // Assigns a symbol link name "rpkg" to the shipped package. args.archives = mergeFileLists(args.archives, rPackageURI + "#rpkg") } } ``` So it is necessary to pass spark.master from R process to JVM. Otherwise sparkr.zip won't be distributed to executor. Besides that I also pass spark.yarn.keytab/spark.yarn.principal to spark side, because JVM process need them to access secured cluster. ## How was this patch tested? Verify it manually in R Studio using the following code. ``` Sys.setenv(SPARK_HOME="/Users/jzhang/github/spark") .libPaths(c(file.path(Sys.getenv(), "R", "lib"), .libPaths())) library(SparkR) sparkR.session(master="yarn-client", sparkConfig = list(spark.executor.instances="1")) df <- as.DataFrame(mtcars) head(df) ``` … Author: Jeff Zhang <zjffdu@apache.org> Closes #14784 from zjffdu/SPARK-17210.	2016-09-23 11:37:43 -07:00
frreiss	646f383465	[SPARK-17421][DOCS] Documenting the current treatment of MAVEN_OPTS. ## What changes were proposed in this pull request? Modified the documentation to clarify that `build/mvn` and `pom.xml` always add Java 7-specific parameters to `MAVEN_OPTS`, and that developers can safely ignore warnings about `-XX:MaxPermSize` that may result from compiling or running tests with Java 8. ## How was this patch tested? Rebuilt HTML documentation, made sure that building-spark.html displays correctly in a browser. Author: frreiss <frreiss@us.ibm.com> Closes #15005 from frreiss/fred-17421a.	2016-09-22 10:31:15 +01:00
Marcelo Vanzin	2cd1bfa4f0	[SPARK-4563][CORE] Allow driver to advertise a different network address. The goal of this feature is to allow the Spark driver to run in an isolated environment, such as a docker container, and be able to use the host's port forwarding mechanism to be able to accept connections from the outside world. The change is restricted to the driver: there is no support for achieving the same thing on executors (or the YARN AM for that matter). Those still need full access to the outside world so that, for example, connections can be made to an executor's block manager. The core of the change is simple: add a new configuration that tells what's the address the driver should bind to, which can be different than the address it advertises to executors (spark.driver.host). Everything else is plumbing the new configuration where it's needed. To use the feature, the host starting the container needs to set up the driver's port range to fall into a range that is being forwarded; this required the block manager port to need a special configuration just for the driver, which falls back to the existing spark.blockManager.port when not set. This way, users can modify the driver settings without affecting the executors; it would theoretically be nice to also have different retry counts for driver and executors, but given that docker (at least) allows forwarding port ranges, we can probably live without that for now. Because of the nature of the feature it's kinda hard to add unit tests; I just added a simple one to make sure the configuration works. This was tested with a docker image running spark-shell with the following command: docker blah blah blah \ -p 38000-38100:38000-38100 \ [image] \ spark-shell \ --num-executors 3 \ --conf spark.shuffle.service.enabled=false \ --conf spark.dynamicAllocation.enabled=false \ --conf spark.driver.host=[host's address] \ --conf spark.driver.port=38000 \ --conf spark.driver.blockManager.port=38020 \ --conf spark.ui.port=38040 Running on YARN; verified the driver works, executors start up and listen on ephemeral ports (instead of using the driver's config), and that caching and shuffling (without the shuffle service) works. Clicked through the UI to make sure all pages (including executor thread dumps) worked. Also tested apps without docker, and ran unit tests. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #15120 from vanzin/SPARK-4563.	2016-09-21 14:42:41 -07:00
VinceShieh	57dc326bd0	[SPARK-17219][ML] Add NaN value handling in Bucketizer ## What changes were proposed in this pull request? This PR fixes an issue when Bucketizer is called to handle a dataset containing NaN value. Sometimes, null value might also be useful to users, so in these cases, Bucketizer should reserve one extra bucket for NaN values, instead of throwing an illegal exception. Before: ``` Bucketizer.transform on NaN value threw an illegal exception. ``` After: ``` NaN values will be grouped in an extra bucket. ``` ## How was this patch tested? New test cases added in `BucketizerSuite`. Signed-off-by: VinceShieh <vincent.xieintel.com> Author: VinceShieh <vincent.xie@intel.com> Closes #14858 from VinceShieh/spark-17219.	2016-09-21 10:20:57 +01:00
sandy	bbe0b1d623	[SPARK-17575][DOCS] Remove extra table tags in configuration document ## What changes were proposed in this pull request? Remove extra table tags in configurations document. ## How was this patch tested? Run all test cases and generate document. Before with extra tag its look like below ![config-wrong1](https://cloud.githubusercontent.com/assets/8075390/18608239/c602bb60-7d01-11e6-875e-f38558997dd3.png) ![config-wrong2](https://cloud.githubusercontent.com/assets/8075390/18608241/cf3b672c-7d01-11e6-935e-1e73f9e6e578.png) After removing tags its looks like below ![config](https://cloud.githubusercontent.com/assets/8075390/18608245/e156eb8e-7d01-11e6-98aa-3be68d4d1961.png) ![config2](https://cloud.githubusercontent.com/assets/8075390/18608247/e84eecd4-7d01-11e6-9738-a3f7ff8fe834.png) Author: sandy <phalodi@gmail.com> Closes #15130 from phalodi/SPARK-17575.	2016-09-17 16:25:03 +01:00
Daniel Darabos	69cb049697	Correct fetchsize property name in docs ## What changes were proposed in this pull request? Replace `fetchSize` with `fetchsize` in the docs. ## How was this patch tested? I manually tested `fetchSize` and `fetchsize`. The latter has an effect. See also [`JdbcUtils.scala#L38`](https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L38) for the definition of the property. Author: Daniel Darabos <darabos.daniel@gmail.com> Closes #14975 from darabos/patch-3.	2016-09-17 12:28:42 +01:00
Sean Owen	dc0a4c9161	[SPARK-17445][DOCS] Reference an ASF page as the main place to find third-party packages ## What changes were proposed in this pull request? Point references to spark-packages.org to https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects This will be accompanied by a parallel change to the spark-website repo, and additional changes to this wiki. ## How was this patch tested? Jenkins tests. Author: Sean Owen <sowen@cloudera.com> Closes #15075 from srowen/SPARK-17445.	2016-09-14 10:10:16 +01:00
Jagadeesan	def7c265f5	[SPARK-17449][DOCUMENTATION] Relation between heartbeatInterval and… ## What changes were proposed in this pull request? The relation between spark.network.timeout and spark.executor.heartbeatInterval should be mentioned in the document. … network timeout] Author: Jagadeesan <as2@us.ibm.com> Closes #15042 from jagadeesanas2/SPARK-17449.	2016-09-14 09:03:16 +01:00
Satendra Kumar	7098a12945	Streaming doc correction. ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) Streaming doc correction. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Satendra Kumar <satendra@knoldus.com> Closes #14996 from satendrakumar06/patch-1.	2016-09-09 19:15:06 +01:00
Gurvinder Singh	92ce8d4849	[SPARK-15487][WEB UI] Spark Master UI to reverse proxy Application and Workers UI ## What changes were proposed in this pull request? This pull request adds the functionality to enable accessing worker and application UI through master UI itself. Thus helps in accessing SparkUI when running spark cluster in closed networks e.g. Kubernetes. Cluster admin needs to expose only spark master UI and rest of the UIs can be in the private network, master UI will reverse proxy the connection request to corresponding resource. It adds the path for workers/application UIs as WorkerUI: <http/https>://master-publicIP:<port>/target/workerID/ ApplicationUI: <http/https>://master-publicIP:<port>/target/appID/ This makes it easy for users to easily protect the Spark master cluster access by putting some reverse proxy e.g. https://github.com/bitly/oauth2_proxy ## How was this patch tested? The functionality has been tested manually and there is a unit test too for testing access to worker UI with reverse proxy address. pwendell bomeng BryanCutler can you please review it, thanks. Author: Gurvinder Singh <gurvinder.singh@uninett.no> Closes #13950 from gurvindersingh/rproxy.	2016-09-08 17:20:20 -07:00
Yangyang Liu	5bea8757cc	[SPARK-16619] Add shuffle service metrics entry in monitoring docs After change [SPARK-16405](https://github.com/apache/spark/pull/14080), we need to update docs by adding shuffle service metrics entry in currently supporting metrics list. Author: Yangyang Liu <yangyangliu@fb.com> Closes #14254 from lovexi/yangyang-monitoring-doc.	2016-09-01 17:01:16 -07:00
Sean Owen	3893e8c576	[SPARK-17331][CORE][MLLIB] Avoid allocating 0-length arrays ## What changes were proposed in this pull request? Avoid allocating some 0-length arrays, esp. in UTF8String, and by using Array.empty in Scala over Array[T]() ## How was this patch tested? Jenkins Author: Sean Owen <sowen@cloudera.com> Closes #14895 from srowen/SPARK-17331.	2016-09-01 12:13:07 -07:00
Seigneurin, Alexis (CONT)	dd859f95c0	fixed typos fixed 2 typos Author: Seigneurin, Alexis (CONT) <Alexis.Seigneurin@capitalone.com> Closes #14877 from aseigneurin/fix-typo-2.	2016-09-01 09:32:05 +01:00
Jeff Zhang	fa6347938f	[SPARK-17178][SPARKR][SPARKSUBMIT] Allow to set sparkr shell command through --conf ## What changes were proposed in this pull request? Allow user to set sparkr shell command through --conf spark.r.shell.command ## How was this patch tested? Unit test is added and also verify it manually through ``` bin/sparkr --master yarn-client --conf spark.r.shell.command=/usr/local/bin/R ``` Author: Jeff Zhang <zjffdu@apache.org> Closes #14744 from zjffdu/SPARK-17178.	2016-08-31 00:20:41 -07:00
Alex Bozarth	f7beae6da0	[SPARK-17243][WEB UI] Spark 2.0 History Server won't load with very large application history ## What changes were proposed in this pull request? With the new History Server the summary page loads the application list via the the REST API, this makes it very slow to impossible to load with large (10K+) application history. This pr fixes this by adding the `spark.history.ui.maxApplications` conf to limit the number of applications the History Server displays. This is accomplished using a new optional `limit` param for the `applications` api. (Note this only applies to what the summary page displays, all the Application UI's are still accessible if the user knows the App ID and goes to the Application UI directly.) I've also added a new test for the `limit` param in `HistoryServerSuite.scala` ## How was this patch tested? Manual testing and dev/run-tests Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #14835 from ajbozarth/spark17243.	2016-08-30 16:33:54 -05:00
Ferdinand Xu	4b4e329e49	[SPARK-5682][CORE] Add encrypted shuffle in spark This patch is using Apache Commons Crypto library to enable shuffle encryption support. Author: Ferdinand Xu <cheng.a.xu@intel.com> Author: kellyzly <kellyzly@126.com> Closes #8880 from winningsix/SPARK-10771.	2016-08-30 09:15:31 -07:00
Dmitriy Sokolov	d4eee9932e	[MINOR][DOCS] Fix minor typos in python example code ## What changes were proposed in this pull request? Fix minor typos python example code in streaming programming guide ## How was this patch tested? N/A Author: Dmitriy Sokolov <silentsokolov@gmail.com> Closes #14805 from silentsokolov/fix-typos.	2016-08-30 11:23:37 +01:00
Seigneurin, Alexis (CONT)	08913ce000	fixed a typo idempotant -> idempotent Author: Seigneurin, Alexis (CONT) <Alexis.Seigneurin@capitalone.com> Closes #14833 from aseigneurin/fix-typo.	2016-08-29 13:12:10 +01:00
Sean Owen	e07baf1412	[SPARK-17001][ML] Enable standardScaler to standardize sparse vectors when withMean=True ## What changes were proposed in this pull request? Allow centering / mean scaling of sparse vectors in StandardScaler, if requested. This is for compatibility with `VectorAssembler` in common usages. ## How was this patch tested? Jenkins tests, including new caes to reflect the new behavior. Author: Sean Owen <sowen@cloudera.com> Closes #14663 from srowen/SPARK-17001.	2016-08-27 08:48:56 +01:00
Michael Gummelt	8e5475be3c	[SPARK-16967] move mesos to module ## What changes were proposed in this pull request? Move Mesos code into a mvn module ## How was this patch tested? unit tests manually submitting a client mode and cluster mode job spark/mesos integration test suite Author: Michael Gummelt <mgummelt@mesosphere.io> Closes #14637 from mgummelt/mesos-module.	2016-08-26 12:25:22 -07:00
Shixiong Zhu	341e0e778d	[SPARK-17242][DOCUMENT] Update links of external dstream projects ## What changes were proposed in this pull request? Updated links of external dstream projects. ## How was this patch tested? Just document changes. Author: Shixiong Zhu <shixiong@databricks.com> Closes #14814 from zsxwing/dstream-link.	2016-08-25 21:08:42 -07:00
Alex Bozarth	891ac2b914	[SPARK-15083][WEB UI] History Server can OOM due to unlimited TaskUIData ## What changes were proposed in this pull request? Based on #12990 by tankkyo Since the History Server currently loads all application's data it can OOM if too many applications have a significant task count. `spark.ui.trimTasks` (default: false) can be set to true to trim tasks by `spark.ui.retainedTasks` (default: 10000) (This is a "quick fix" to help those running into the problem until a update of how the history server loads app data can be done) ## How was this patch tested? Manual testing and dev/run-tests ![spark-15083](https://cloud.githubusercontent.com/assets/13952758/17713694/fe82d246-63b0-11e6-9697-b87ea75ff4ef.png) Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #14673 from ajbozarth/spark15083.	2016-08-24 14:39:41 -05:00
Yanbo Liang	45b786aca2	[MINOR][DOC] Fix wrong ml.feature.Normalizer document. ## What changes were proposed in this pull request? The ```ml.feature.Normalizer``` examples illustrate L1 norm rather than L2, we should correct corresponding document. ![image](https://cloud.githubusercontent.com/assets/1962026/17928637/85aec284-69b0-11e6-9b13-d465ee560581.png) ## How was this patch tested? Doc change, no test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14787 from yanboliang/normalizer.	2016-08-24 08:24:16 -07:00
hyukjinkwon	588559911d	[MINOR][DOC] Use standard quotes instead of "curly quote" marks from Mac in structured streaming programming guides ## What changes were proposed in this pull request? This PR fixes curly quotes (`“` and `”` ) to standard quotes (`"`). This will be a actual problem when users copy and paste the examples. This would not work. This seems only happening in `structured-streaming-programming-guide.md`. ## How was this patch tested? Manually built. This will change some examples to be correctly marked down as below: ![2016-08-23 3 24 13](https://cloud.githubusercontent.com/assets/6477701/17882878/2a38332e-694a-11e6-8e84-76bdb89151e0.png) to ![2016-08-23 3 26 06](https://cloud.githubusercontent.com/assets/6477701/17882888/376eaa28-694a-11e6-8b88-32ea83997037.png) Author: hyukjinkwon <gurwls223@gmail.com> Closes #14770 from HyukjinKwon/minor-quotes.	2016-08-23 21:21:43 +01:00
Sean Owen	342278c09c	[SPARK-16320][DOC] Document G1 heap region's effect on spark 2.0 vs 1.6 ## What changes were proposed in this pull request? Collect GC discussion in one section, and documenting findings about G1 GC heap region size. ## How was this patch tested? Jekyll doc build Author: Sean Owen <sowen@cloudera.com> Closes #14732 from srowen/SPARK-16320.	2016-08-22 11:15:53 -07:00
Jagadeesan	bd9655063b	[SPARK-17085][STREAMING][DOCUMENTATION AND ACTUAL CODE DIFFERS - UNSUPPORTED OPERATIONS] Changes in Spark Stuctured Streaming doc in this link https://spark.apache.org/docs/2.0.0/structured-streaming-programming-guide.html#unsupported-operations Author: Jagadeesan <as2@us.ibm.com> Closes #14715 from jagadeesanas2/SPARK-17085.	2016-08-22 09:30:31 +01:00
GraceH	4b6c2cbcb1	[SPARK-16968] Document additional options in jdbc Writer ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) This is the document for previous JDBC Writer options. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Unit test has been added in previous PR. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: GraceH <jhuang1@paypal.com> Closes #14683 from GraceH/jdbc_options.	2016-08-22 09:03:46 +01:00
wm624@hotmail.com	e328f577e8	[SPARK-17002][CORE] Document that spark.ssl.protocol. is required for SSL ## What changes were proposed in this pull request? `spark.ssl.enabled`=true, but failing to set `spark.ssl.protocol` will fail and throw meaningless exception. `spark.ssl.protocol` is required when `spark.ssl.enabled`. Improvement: require `spark.ssl.protocol` when initializing SSLContext, otherwise throws an exception to indicate that. Remove the OrElse("default"). Document this requirement in configure.md ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Manual tests: Build document and check document Configure `spark.ssl.enabled` only, it throws exception below: 6/08/16 16:04:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mwang); groups with view permissions: Set(); users with modify permissions: Set(mwang); groups with modify permissions: Set() Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: spark.ssl.protocol is required when enabling SSL connections. at scala.Predef$.require(Predef.scala:224) at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:285) at org.apache.spark.deploy.master.Master$.startRpcEnvAndEndpoint(Master.scala:1026) at org.apache.spark.deploy.master.Master$.main(Master.scala:1011) at org.apache.spark.deploy.master.Master.main(Master.scala) Configure `spark.ssl.protocol` and `spark.ssl.protocol` It works fine. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #14674 from wangmiao1981/ssl.	2016-08-21 11:51:46 +01:00
Stavros Kontopoulos	b81421afb0	[SPARK-17087][MESOS] Documentation for Making Spark on Mesos honor port restrictions ## What changes were proposed in this pull request? - adds documentation for https://issues.apache.org/jira/browse/SPARK-11714 ## How was this patch tested? Doc no test needed. Author: Stavros Kontopoulos <stavros.kontopoulos@lightbend.com> Closes #14667 from skonto/add_doc.	2016-08-18 12:19:19 +01:00
sandy	e28a8c5899	[SPARK-17089][DOCS] Remove api doc link for mapReduceTriplets operator ## What changes were proposed in this pull request? Remove the api doc link for mapReduceTriplets operator because in latest api they are remove so when user link to that api they will not get mapReduceTriplets there so its more good to remove than confuse the user. ## How was this patch tested? Run all the test cases ![screenshot from 2016-08-16 23-08-25](https://cloud.githubusercontent.com/assets/8075390/17709393/8cfbf75a-6406-11e6-98e6-38f7b319d833.png) Author: sandy <phalodi@gmail.com> Closes #14669 from phalodi/SPARK-17089.	2016-08-16 12:50:55 -07:00
linbojin	6f0988b129	[MINOR][DOC] Correct code snippet results in quick start documentation ## What changes were proposed in this pull request? As README.md file is updated over time. Some code snippet outputs are not correct based on new README.md file. For example: ``` scala> textFile.count() res0: Long = 126 ``` should be ``` scala> textFile.count() res0: Long = 99 ``` This pr is to add comments to point out this problem so that new spark learners have a correct reference. Also, fixed a samll bug, inside current documentation, the outputs of linesWithSpark.count() without and with cache are different (one is 15 and the other is 19) ``` scala> val linesWithSpark = textFile.filter(line => line.contains("Spark")) linesWithSpark: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at <console>:27 scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"? res3: Long = 15 ... scala> linesWithSpark.cache() res7: linesWithSpark.type = MapPartitionsRDD[2] at filter at <console>:27 scala> linesWithSpark.count() res8: Long = 19 ``` ## How was this patch tested? manual test: run `$ SKIP_API=1 jekyll serve --watch` Author: linbojin <linbojin203@gmail.com> Closes #14645 from linbojin/quick-start-documentation.	2016-08-16 11:37:54 +01:00
Jagadeesan	e46cb78b3b	[SPARK-12370][DOCUMENTATION] Documentation should link to examples … ## What changes were proposed in this pull request? When documentation is built is should reference examples from the same build. There are times when the docs have links that point to files in the GitHub head which may not be valid on the current release. Changed that in URLs to make them point to the right tag in git using ```SPARK_VERSION_SHORT``` …from its own release version] [Streaming programming guide] Author: Jagadeesan <as2@us.ibm.com> Closes #14596 from jagadeesanas2/SPARK-12370.	2016-08-13 11:25:03 +01:00
WeichenXu	91f2735a18	[DOC] add config option spark.ui.enabled into document ## What changes were proposed in this pull request? The configuration doc lost the config option `spark.ui.enabled` (default value is `true`) I think this option is important because many cases we would like to turn it off. so I add it. ## How was this patch tested? N/A Author: WeichenXu <WeichenXu123@outlook.com> Closes #14604 from WeichenXu123/add_doc_param_spark_ui_enabled.	2016-08-12 20:10:09 +01:00
hyukjinkwon	f4482225c4	[MINOR][DOC] Fix style in examples across documentation ## What changes were proposed in this pull request? This PR fixes the documentation as below: - Python has 4 spaces and Java and Scala has 2 spaces (See https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide). - Avoid excessive parentheses and curly braces for anonymous functions. (See https://github.com/databricks/scala-style-guide#anonymous) ## How was this patch tested? N/A Author: hyukjinkwon <gurwls223@gmail.com> Closes #14593 from HyukjinKwon/minor-documentation.	2016-08-12 10:00:58 +01:00
Jeff Zhang	7a9e25c383	[SPARK-13081][PYSPARK][SPARK_SUBMIT] Allow set pythonExec of driver and executor through conf… Before this PR, user have to export environment variable to specify the python of driver & executor which is not so convenient for users. This PR is trying to allow user to specify python through configuration "--pyspark-driver-python" & "--pyspark-executor-python" Manually test in local & yarn mode for pyspark-shell and pyspark batch mode. Author: Jeff Zhang <zjffdu@apache.org> Closes #13146 from zjffdu/SPARK-13081.	2016-08-11 20:08:39 -07:00
hyukjinkwon	7186e8c318	[SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and indentation in documentation ## What changes were proposed in this pull request? Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments. This PR fixes three things below: - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java. - Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples. - Fix `StructuredNetworkWordCountWindowed` and `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially). ## How was this patch tested? N/A Closes https://github.com/apache/spark/pull/14491 Author: hyukjinkwon <gurwls223@gmail.com> Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local> Closes #14564 from HyukjinKwon/SPARK-16886.	2016-08-11 11:31:52 +01:00
Andrew Ash	8a6b7037bb	Correct example value for spark.ssl.YYY.XXX settings Docs adjustment to: - link to other relevant section of docs - correct statement about the only value when actually other values are supported Author: Andrew Ash <andrew@andrewash.com> Closes #14581 from ash211/patch-10.	2016-08-11 11:26:57 +01:00
Tao Wang	7a6a3c3fbc	[SPARK-17010][MINOR][DOC] Wrong description in memory management document ## What changes were proposed in this pull request? change the remain percent to right one. ## How was this patch tested? Manual review Author: Tao Wang <wangtao111@huawei.com> Closes #14591 from WangTaoTheTonic/patch-1.	2016-08-10 22:30:18 -07:00
jerryshao	ab648c0004	[SPARK-14743][YARN] Add a configurable credential manager for Spark running on YARN ## What changes were proposed in this pull request? Add a configurable token manager for Spark on running on yarn. ### Current Problems ### 1. Supported token provider is hard-coded, currently only hdfs, hbase and hive are supported and it is impossible for user to add new token provider without code changes. 2. Also this problem exits in timely token renewer and updater. ### Changes In This Proposal ### In this proposal, to address the problems mentioned above and make the current code more cleaner and easier to understand, mainly has 3 changes: 1. Abstract a `ServiceTokenProvider` as well as `ServiceTokenRenewable` interface for token provider. Each service wants to communicate with Spark through token way needs to implement this interface. 2. Provide a `ConfigurableTokenManager` to manage all the register token providers, also token renewer and updater. Also this class offers the API for other modules to obtain tokens, get renewal interval and so on. 3. Implement 3 built-in token providers `HDFSTokenProvider`, `HiveTokenProvider` and `HBaseTokenProvider` to keep the same semantics as supported today. Whether to load in these built-in token providers is controlled by configuration "spark.yarn.security.tokens.${service}.enabled", by default for all the built-in token providers are loaded. ### Behavior Changes ### For the end user there's no behavior change, we still use the same configuration `spark.yarn.security.tokens.${service}.enabled` to decide which token provider is enabled (hbase or hive). For user implemented token provider (assume the name of token provider is "test") needs to add into this class should have two configurations: 1. `spark.yarn.security.tokens.test.enabled` to true 2. `spark.yarn.security.tokens.test.class` to the full qualified class name. So we still keep the same semantics as current code while add one new configuration. ### Current Status ### - [x] token provider interface and management framework. - [x] implement built-in token providers (hdfs, hbase, hive). - [x] Coverage of unit test. - [x] Integrated test with security cluster. ## How was this patch tested? Unit test and integrated test. Please suggest and review, any comment is greatly appreciated. Author: jerryshao <sshao@hortonworks.com> Closes #14065 from jerryshao/SPARK-16342.	2016-08-10 15:39:30 -07:00
Timothy Chen	eca58755fb	[SPARK-16927][SPARK-16923] Override task properties at dispatcher. ## What changes were proposed in this pull request? - enable setting default properties for all jobs submitted through the dispatcher [SPARK-16927] - remove duplication of conf vars on cluster submitted jobs [SPARK-16923] (this is a small fix, so I'm including in the same PR) ## How was this patch tested? mesos/spark integration test suite manual testing Author: Timothy Chen <tnachen@gmail.com> Closes #14511 from mgummelt/override-props.	2016-08-10 10:11:03 +01:00
Josh Rosen	b89b3a5c8e	[SPARK-16956] Make ApplicationState.MAX_NUM_RETRY configurable ## What changes were proposed in this pull request? This patch introduces a new configuration, `spark.deploy.maxExecutorRetries`, to let users configure an obscure behavior in the standalone master where the master will kill Spark applications which have experienced too many back-to-back executor failures. The current setting is a hardcoded constant (10); this patch replaces that with a new cluster-wide configuration. Background: This application-killing was added in `6b5980da79` (from September 2012) and I believe that it was designed to prevent a faulty application whose executors could never launch from DOS'ing the Spark cluster via an infinite series of executor launch attempts. In a subsequent patch (#1360), this feature was refined to prevent applications which have running executors from being killed by this code path. Motivation for making this configurable: Previously, if a Spark Standalone application experienced more than `ApplicationState.MAX_NUM_RETRY` executor failures and was left with no executors running then the Spark master would kill that application, but this behavior is problematic in environments where the Spark executors run on unstable infrastructure and can all simultaneously die. For instance, if your Spark driver runs on an on-demand EC2 instance while all workers run on ephemeral spot instances then it's possible for all executors to die at the same time while the driver stays alive. In this case, it may be desirable to keep the Spark application alive so that it can recover once new workers and executors are available. In order to accommodate this use-case, this patch modifies the Master to never kill faulty applications if `spark.deploy.maxExecutorRetries` is negative. I'd like to merge this patch into master, branch-2.0, and branch-1.6. ## How was this patch tested? I tested this manually using `spark-shell` and `local-cluster` mode. This is a tricky feature to unit test and historically this code has not changed very often, so I'd prefer to skip the additional effort of adding a testing framework and would rather rely on manual tests and review for now. Author: Josh Rosen <joshrosen@databricks.com> Closes #14544 from JoshRosen/add-setting-for-max-executor-failures.	2016-08-09 11:21:45 -07:00
Michael Gummelt	62e6212441	[SPARK-16809] enable history server links in dispatcher UI ## What changes were proposed in this pull request? Links the Spark Mesos Dispatcher UI to the history server UI - adds spark.mesos.dispatcher.historyServer.url - explicitly generates frameworkIDs for the launched drivers, so the dispatcher knows how to correlate drivers and frameworkIDs ## How was this patch tested? manual testing Author: Michael Gummelt <mgummelt@mesosphere.io> Author: Sergiusz Urbaniak <sur@mesosphere.io> Closes #14414 from mgummelt/history-server.	2016-08-09 10:55:33 +01:00
Michael Gummelt	53d1c78779	Update docs to include SASL support for RPC ## What changes were proposed in this pull request? Update docs to include SASL support for RPC Evidence: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala#L63 ## How was this patch tested? Docs change only Author: Michael Gummelt <mgummelt@mesosphere.io> Closes #14549 from mgummelt/sasl.	2016-08-08 16:07:51 -07:00
Shivansh	6c1ecb191b	[SPARK-16911] Fix the links in the programming guide ## What changes were proposed in this pull request? Fix the broken links in the programming guide of the Graphx Migration and understanding closures ## How was this patch tested? By running the test cases and checking the links. Author: Shivansh <shiv4nsh@gmail.com> Closes #14503 from shiv4nsh/SPARK-16911.	2016-08-07 09:30:18 +01:00
keliang	1275f64696	[SPARK-16870][DOCS] Summary:add "spark.sql.broadcastTimeout" into docs/sql-programming-gu… ## What changes were proposed in this pull request? default value for spark.sql.broadcastTimeout is 300s. and this property do not show in any docs of spark. so add "spark.sql.broadcastTimeout" into docs/sql-programming-guide.md to help people to how to fix this timeout error when it happenned ## How was this patch tested? not need (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) …ide.md JIRA_ID:SPARK-16870 Description:default value for spark.sql.broadcastTimeout is 300s. and this property do not show in any docs of spark. so add "spark.sql.broadcastTimeout" into docs/sql-programming-guide.md to help people to how to fix this timeout error when it happenned Test:done Author: keliang <keliang@cmss.chinamobile.com> Closes #14477 from biglobster/keliang.	2016-08-07 09:28:32 +01:00
Bryan Cutler	b1ebe182ca	[SPARK-16932][DOCS] Changed programming guide to not reference old accumulator API in Scala ## What changes were proposed in this pull request? In the programming guide, the accumulator section mixes up both the old and new APIs causing it to be confusing. This is not necessary for Scala, so all references to the old API are removed. For Java, it is somewhat fixed up except for the example of a custom accumulator because I don't think an API exists yet. Python has not currently implemented the new API. ## How was this patch tested? built doc locally Author: Bryan Cutler <cutlerb@gmail.com> Closes #14516 from BryanCutler/fixup-accumulator-programming-guide-SPARK-15702.	2016-08-07 09:06:59 +01:00
Michael Gummelt	7aaa5a01c1	document that Mesos cluster mode supports python update docs to be consistent with SPARK-14645 https://issues.apache.org/jira/browse/SPARK-14645 Author: Michael Gummelt <mgummelt@mesosphere.io> Closes #14514 from mgummelt/fix-docs.	2016-08-07 08:59:04 +01:00
cody koeninger	c9f2501af2	[SPARK-16312][STREAMING][KAFKA][DOC] Doc for Kafka 0.10 integration ## What changes were proposed in this pull request? Doc for the Kafka 0.10 integration ## How was this patch tested? Scala code examples were taken from my example repo, so hopefully they compile. Author: cody koeninger <cody@koeninger.org> Closes #14385 from koeninger/SPARK-16312.	2016-08-05 10:13:32 +01:00
Sital Kedia	9c15d079df	[SPARK-15074][SHUFFLE] Cache shuffle index file to speedup shuffle fetch ## What changes were proposed in this pull request? Shuffle fetch on large intermediate dataset is slow because the shuffle service open/close the index file for each shuffle fetch. This change introduces a cache for the index information so that we can avoid accessing the index files for each block fetch ## How was this patch tested? Tested by running a job on the cluster and the shuffle read time was reduced by 50%. Author: Sital Kedia <skedia@fb.com> Closes #12944 from sitalkedia/shuffle_service.	2016-08-04 14:54:38 -07:00
Shuai Lin	36827ddafe	[SPARK-16822][DOC] Support latex in scaladoc. ## What changes were proposed in this pull request? Support using latex in scaladoc by adding MathJax javascript to the js template. ## How was this patch tested? Generated scaladoc. Preview: - LogisticGradient: [before](https://spark.apache.org/docs/2.0.0/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient) and [after](https://sparkdocs.lins05.pw/spark-16822/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient) - MinMaxScaler: [before](https://spark.apache.org/docs/2.0.0/api/scala/index.html#org.apache.spark.ml.feature.MinMaxScaler) and [after](https://sparkdocs.lins05.pw/spark-16822/api/scala/index.html#org.apache.spark.ml.feature.MinMaxScaler) Author: Shuai Lin <linshuai2012@gmail.com> Closes #14438 from lins05/spark-16822-support-latex-in-scaladoc.	2016-08-02 09:14:08 -07:00
Cheng Lian	10e1c0e638	[SPARK-16734][EXAMPLES][SQL] Revise examples of all language bindings ## What changes were proposed in this pull request? This PR makes various minor updates to examples of all language bindings to make sure they are consistent with each other. Some typos and missing parts (JDBC example in Scala/Java/Python) are also fixed. ## How was this patch tested? Manually tested. Author: Cheng Lian <lian@databricks.com> Closes #14368 from liancheng/revise-examples.	2016-08-02 15:02:40 +08:00
Sun Dapeng	2c15323ad0	[SPARK-16761][DOC][ML] Fix doc link in docs/ml-guide.md ## What changes were proposed in this pull request? Fix the link at http://spark.apache.org/docs/latest/ml-guide.html. ## How was this patch tested? None Author: Sun Dapeng <sdp@apache.org> Closes #14386 from sundapeng/doclink.	2016-07-29 06:01:23 -07:00
Michael Gummelt	266b92faff	[SPARK-16637] Unified containerizer ## What changes were proposed in this pull request? New config var: spark.mesos.docker.containerizer={"mesos","docker" (default)} This adds support for running docker containers via the Mesos unified containerizer: http://mesos.apache.org/documentation/latest/container-image/ The benefit is losing the dependency on `dockerd`, and all the costs which it incurs. I've also updated the supported Mesos version to 0.28.2 for support of the required protobufs. This is blocked on: https://github.com/apache/spark/pull/14167 ## How was this patch tested? - manually testing jobs submitted with both "mesos" and "docker" settings for the new config var. - spark/mesos integration test suite Author: Michael Gummelt <mgummelt@mesosphere.io> Closes #14275 from mgummelt/unified-containerizer.	2016-07-29 05:50:47 -07:00
Bartek Wiśniewski	bc4851adeb	[MINOR][DOC] missing keyword new ## What changes were proposed in this pull request? added missing keyword for java example ## How was this patch tested? wasn't Author: Bartek Wiśniewski <wedi@Ava.local> Closes #14381 from wedi-dev/quickfix/missing_keyword.	2016-07-27 10:53:22 -07:00
Mark Grover	70f846a313	[SPARK-5847][CORE] Allow for configuring MetricsSystem's use of app ID to namespace all metrics ## What changes were proposed in this pull request? Adding a new property to SparkConf called spark.metrics.namespace that allows users to set a custom namespace for executor and driver metrics in the metrics systems. By default, the root namespace used for driver or executor metrics is the value of `spark.app.id`. However, often times, users want to be able to track the metrics across apps for driver and executor metrics, which is hard to do with application ID (i.e. `spark.app.id`) since it changes with every invocation of the app. For such use cases, users can set the `spark.metrics.namespace` property to another spark configuration key like `spark.app.name` which is then used to populate the root namespace of the metrics system (with the app name in our example). `spark.metrics.namespace` property can be set to any arbitrary spark property key, whose value would be used to set the root namespace of the metrics system. Non driver and executor metrics are never prefixed with `spark.app.id`, nor does the `spark.metrics.namespace` property have any such affect on such metrics. ## How was this patch tested? Added new unit tests, modified existing unit tests. Author: Mark Grover <mark@apache.org> Closes #14270 from markgrover/spark-5847.	2016-07-27 10:13:15 -07:00
Philipp Hoffmann	0869b3a5f0	[SPARK-15271][MESOS] Allow force pulling executor docker images ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Mesos agents by default will not pull docker images which are cached locally already. In order to run Spark executors from mutable tags like `:latest` this commit introduces a Spark setting (`spark.mesos.executor.docker.forcePullImage`). Setting this flag to true will tell the Mesos agent to force pull the docker image (default is `false` which is consistent with the previous implementation and Mesos' default behaviour). Author: Philipp Hoffmann <mail@philipphoffmann.de> Closes #14348 from philipphoffmann/force-pull-image.	2016-07-26 16:09:10 +01:00
Nicholas Brown	ba0aade6d5	Fix description of spark.speculation.quantile ## What changes were proposed in this pull request? Minor doc fix regarding the spark.speculation.quantile configuration parameter. It incorrectly states it should be a percentage, when it should be a fraction. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) I tried building the documentation but got some unidoc errors. I also got them when building off origin/master, so I don't think I caused that problem. I did run the web app and saw the changes reflected as expected. Author: Nicholas Brown <nbrown@adroitdigital.com> Closes #14352 from nwbvt/master.	2016-07-25 19:18:27 -07:00
Takeshi YAMAMURO	cda4603de3	[SQL][DOC] Fix a default name for parquet compression ## What changes were proposed in this pull request? This pr is to fix a wrong description for parquet default compression. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #14351 from maropu/FixParquetDoc.	2016-07-25 15:08:58 -07:00
Josh Rosen	fc17121d59	Revert "[SPARK-15271][MESOS] Allow force pulling executor docker images" This reverts commit `978cd5f125`.	2016-07-25 12:43:44 -07:00
Shuai Lin	3b6e1d094e	[SPARK-16485][DOC][ML] Fixed several inline formatting in ml features doc ## What changes were proposed in this pull request? Fixed several inline formatting in ml features doc. Before: <img width="475" alt="screen shot 2016-07-14 at 12 24 57 pm" src="https://cloud.githubusercontent.com/assets/717363/16827974/1e1b6e04-49be-11e6-8aa9-4a0cb6cd3b4e.png"> After: <img width="404" alt="screen shot 2016-07-14 at 12 25 48 pm" src="https://cloud.githubusercontent.com/assets/717363/16827976/2576510a-49be-11e6-96dd-92a1fa464d36.png"> ## How was this patch tested? Genetate the docs locally by `SKIP_API=1 jekyll build` and view it in the browser. Author: Shuai Lin <linshuai2012@gmail.com> Closes #14194 from lins05/fix-docs-formatting.	2016-07-25 20:26:55 +01:00
Philipp Hoffmann	978cd5f125	[SPARK-15271][MESOS] Allow force pulling executor docker images ## What changes were proposed in this pull request? Mesos agents by default will not pull docker images which are cached locally already. In order to run Spark executors from mutable tags like `:latest` this commit introduces a Spark setting `spark.mesos.executor.docker.forcePullImage`. Setting this flag to true will tell the Mesos agent to force pull the docker image (default is `false` which is consistent with the previous implementation and Mesos' default behaviour). ## How was this patch tested? I ran a sample application including this change on a Mesos cluster and verified the correct behaviour for both, with and without, force pulling the executor image. As expected the image is being force pulled if the flag is set. Author: Philipp Hoffmann <mail@philipphoffmann.de> Closes #13051 from philipphoffmann/force-pull-image.	2016-07-25 20:14:47 +01:00
Felix Cheung	b73defdd79	[SPARKR][DOCS] fix broken url in doc ## What changes were proposed in this pull request? Fix broken url, also, sparkR.session.stop doc page should have it in the header, instead of saying "sparkR.stop" ![image](https://cloud.githubusercontent.com/assets/8969467/17080129/26d41308-50d9-11e6-8967-79d6c920313f.png) Data type section is in the middle of a list of gapply/gapplyCollect subsections: ![image](https://cloud.githubusercontent.com/assets/8969467/17080122/f992d00a-50d8-11e6-8f2c-fd5786213920.png) ## How was this patch tested? manual test Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14329 from felixcheung/rdoclinkfix.	2016-07-25 11:25:41 -07:00
Cheng Lian	53b2456d1d	[SPARK-16380][EXAMPLES] Update SQL examples and programming guide for Python language binding This PR is based on PR #14098 authored by wangmiao1981. ## What changes were proposed in this pull request? This PR replaces the original Python Spark SQL example file with the following three files: - `sql/basic.py` Demonstrates basic Spark SQL features. - `sql/datasource.py` Demonstrates various Spark SQL data sources. - `sql/hive.py` Demonstrates Spark SQL Hive interaction. This PR also removes hard-coded Python example snippets in the SQL programming guide by extracting snippets from the above files using the `include_example` Liquid template tag. ## How was this patch tested? Manually tested. Author: wm624@hotmail.com <wm624@hotmail.com> Author: Cheng Lian <lian@databricks.com> Closes #14317 from liancheng/py-examples-update.	2016-07-23 11:41:24 -07:00
Tom Graves	6c56fff118	[SPARK-16650] Improve documentation of spark.task.maxFailures Clarify documentation on spark.task.maxFailures No tests run as its documentation Author: Tom Graves <tgraves@yahoo-inc.com> Closes #14287 from tgravescs/SPARK-16650.	2016-07-22 12:41:38 +01:00
Michael Gummelt	235cb256d0	[SPARK-16194] Mesos Driver env vars ## What changes were proposed in this pull request? Added new configuration namespace: spark.mesos.env.* This allows a user submitting a job in cluster mode to set arbitrary environment variables on the driver. spark.mesos.driverEnv.KEY=VAL will result in the env var "KEY" being set to "VAL" I've also refactored the tests a bit so we can re-use code in MesosClusterScheduler. And I've refactored the command building logic in `buildDriverCommand`. Command builder values were very intertwined before, and now it's easier to determine exactly how each variable is set. ## How was this patch tested? unit tests Author: Michael Gummelt <mgummelt@mesosphere.io> Closes #14167 from mgummelt/driver-env-vars.	2016-07-21 18:29:00 +01:00
Holden Karau	1bf13ba3a2	[MINOR][DOCS][STREAMING] Minor docfix schema of csv rather than parquet in comments ## What changes were proposed in this pull request? Fix parquet to csv in a comment to match the input format being read. ## How was this patch tested? N/A (doc change only) Author: Holden Karau <holden@us.ibm.com> Closes #14274 from holdenk/minor-docfix-schema-of-csv-rather-than-parquet.	2016-07-21 09:17:38 +01:00
Kishor Patil	b9bab4dcf6	[SPARK-15951] Change Executors Page to use datatables to support sorting columns and searching 1. Create the executorspage-template.html for displaying application information in datables. 2. Added REST API endpoint "allexecutors" to be able to see all executors created for particular job. 3. The executorspage.js uses jQuery to access the data from /api/v1/applications/appid/allexecutors REST API, and use DataTable to display executors for the application. It also, generates summary of dead/live and total executors created during life of the application. 4. Similar changes applicable to Executors Page on history server for a given application. Snapshots for how it looks like now: <img width="938" alt="screen shot 2016-06-14 at 2 45 44 pm" src="https://cloud.githubusercontent.com/assets/6090397/16060092/ad1de03a-324b-11e6-8469-9eaa3f2548b5.png"> New Executors Page screenshot looks like this: <img width="1436" alt="screen shot 2016-06-15 at 10 12 01 am" src="https://cloud.githubusercontent.com/assets/6090397/16085514/ee7004f0-32e1-11e6-9340-33d91e407f2b.png"> Author: Kishor Patil <kpatil@yahoo-inc.com> Closes #13670 from kishorvpatil/execTemplates.	2016-07-20 12:22:43 -05:00
Weiqing Yang	95abbe5377	[SPARK-15923][YARN] Spark Application rest api returns 'no such app: … ## What changes were proposed in this pull request? Update monitoring.md. …<appId>' Author: Weiqing Yang <yangweiqing001@gmail.com> Closes #14163 from Sherry302/master.	2016-07-20 14:26:26 +01:00
WeichenXu	9674af6f6f	[SPARK-16568][SQL][DOCUMENTATION] update sql programming guide refreshTable API in python code ## What changes were proposed in this pull request? update `refreshTable` API in python code of the sql-programming-guide. This API is added in SPARK-15820 ## How was this patch tested? N/A Author: WeichenXu <WeichenXu123@outlook.com> Closes #14220 from WeichenXu123/update_sql_doc_catalog.	2016-07-19 18:48:41 -07:00
Ahmed Mahran	6caa22050e	[MINOR][SQL][STREAMING][DOCS] Fix minor typos, punctuations and grammar ## What changes were proposed in this pull request? Minor fixes correcting some typos, punctuations, grammar. Adding more anchors for easy navigation. Fixing minor issues with code snippets. ## How was this patch tested? `jekyll serve` Author: Ahmed Mahran <ahmed.mahran@mashin.io> Closes #14234 from ahmed-mahran/b-struct-streaming-docs.	2016-07-19 12:01:54 +01:00
Cheng Lian	1426a08052	[SPARK-16303][DOCS][EXAMPLES] Minor Scala/Java example update ## What changes were proposed in this pull request? This PR moves one and the last hard-coded Scala example snippet from the SQL programming guide into `SparkSqlExample.scala`. It also renames all Scala/Java example files so that all "Sql" in the file names are updated to "SQL". ## How was this patch tested? Manually verified the generated HTML page. Author: Cheng Lian <lian@databricks.com> Closes #14245 from liancheng/minor-scala-example-update.	2016-07-18 23:07:59 -07:00
Felix Cheung	75f0efe74d	[SPARKR][DOCS] minor code sample update in R programming guide ## What changes were proposed in this pull request? Fix code style from ad hoc review of RC4 doc ## How was this patch tested? manual shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14250 from felixcheung/rdocs2rc4.	2016-07-18 16:01:57 -07:00
Narine Kokhlikyan	4167304836	[SPARK-16112][SPARKR] Programming guide for gapply/gapplyCollect ## What changes were proposed in this pull request? Updates programming guide for spark.gapply/spark.gapplyCollect. Similar to other examples I used `faithful` dataset to demonstrate gapply's functionality. Please, let me know if you prefer another example. ## How was this patch tested? Existing test cases in R Author: Narine Kokhlikyan <narine@slice.com> Closes #14090 from NarineK/gapplyProgGuide.	2016-07-16 16:56:16 -07:00
Joseph K. Bradley	5ffd5d3838	[SPARK-14817][ML][MLLIB][DOC] Made DataFrame-based API primary in MLlib guide ## What changes were proposed in this pull request? Made DataFrame-based API primary * Spark doc menu bar and other places now link to ml-guide.html, not mllib-guide.html * mllib-guide.html keeps RDD-specific list of features, with a link at the top redirecting people to ml-guide.html * ml-guide.html includes a "maintenance mode" announcement about the RDD-based API * Reviewers: please check this carefully * (minor) Titles for DF API no longer include "- spark.ml" suffix. Titles for RDD API have "- RDD-based API" suffix * Moved migration guide to ml-guide from mllib-guide * Also moved past guides from mllib-migration-guides to ml-migration-guides, with a redirect link on mllib-migration-guides * Reviewers: I did not change any of the content of the migration guides. Reorganized DataFrame-based guide: * ml-guide.html mimics the old mllib-guide.html page in terms of content: overview, migration guide, etc. * Moved Pipeline description into ml-pipeline.html and moved tuning into ml-tuning.html * Reviewers: I did not change the content of these guides, except some intro text. * Sidebar remains the same, but with pipeline and tuning sections added Other: * ml-classification-regression.html: Moved text about linear methods to new section in page ## How was this patch tested? Generated docs locally Author: Joseph K. Bradley <joseph@databricks.com> Closes #14213 from jkbradley/ml-guide-2.0.	2016-07-15 13:38:23 -07:00
Josh Rosen	972673aca5	[SPARK-16555] Work around Jekyll error-handling bug which led to silent failures If a custom Jekyll template tag throws Ruby's equivalent of a "file not found" exception, then Jekyll will stop the doc building process but will exit with a successful status, causing our doc publishing jobs to silently fail. This is caused by https://github.com/jekyll/jekyll/issues/5104, a case of bad error-handling logic in Jekyll. This patch works around this by updating our `include_example.rb` plugin to catch the exception and exit rather than allowing it to bubble up and be ignored by Jekyll. I tested this manually with ``` rm ./examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala cd docs SKIP_API=1 jekyll build echo $? ``` Author: Josh Rosen <joshrosen@databricks.com> Closes #14209 from JoshRosen/fix-doc-building.	2016-07-14 15:55:36 -07:00
Shivaram Venkataraman	01c4c1fa53	[SPARK-16553][DOCS] Fix SQL example file name in docs ## What changes were proposed in this pull request? Fixes a typo in the sql programming guide ## How was this patch tested? Building docs locally (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #14208 from shivaram/spark-sql-doc-fix.	2016-07-14 14:19:30 -07:00
Marcelo Vanzin	b7b5e17876	[SPARK-16505][YARN] Optionally propagate error during shuffle service startup. This prevents the NM from starting when something is wrong, which would lead to later errors which are confusing and harder to debug. Added a unit test to verify startup fails if something is wrong. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #14162 from vanzin/SPARK-16505.	2016-07-14 09:42:32 -05:00
Felix Cheung	fb2e8eeb0b	[SPARKR][DOCS][MINOR] R programming guide to include csv data source example ## What changes were proposed in this pull request? Minor documentation update for code example, code style, and missed reference to "sparkR.init" ## How was this patch tested? manual shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14178 from felixcheung/rcsvprogrammingguide.	2016-07-13 15:09:23 -07:00
James Thomas	51a6706b13	[SPARK-16114][SQL] updated structured streaming guide ## What changes were proposed in this pull request? Updated structured streaming programming guide with new windowed example. ## How was this patch tested? Docs Author: James Thomas <jamesjoethomas@gmail.com> Closes #14183 from jjthomas/ss_docs_update.	2016-07-13 13:26:23 -07:00
sandy	bf107f1e65	[SPARK-16438] Add Asynchronous Actions documentation ## What changes were proposed in this pull request? Add Asynchronous Actions documentation inside action of programming guide ## How was this patch tested? check the documentation indentation and formatting with md preview. Author: sandy <phalodi@gmail.com> Closes #14104 from phalodi/SPARK-16438.	2016-07-13 11:33:46 +01:00
aokolnychyi	772c213ec7	[SPARK-16303][DOCS][EXAMPLES] Updated SQL programming guide and examples - Hard-coded Spark SQL sample snippets were moved into source files under examples sub-project. - Removed the inconsistency between Scala and Java Spark SQL examples - Scala and Java Spark SQL examples were updated The work is still in progress. All involved examples were tested manually. An additional round of testing will be done after the code review. ![image](https://cloud.githubusercontent.com/assets/6235869/16710314/51851606-462a-11e6-9fbe-0818daef65e4.png) Author: aokolnychyi <okolnychyyanton@gmail.com> Closes #14119 from aokolnychyi/spark_16303.	2016-07-13 16:12:11 +08:00
Lianhui Wang	5ad68ba5ce	[SPARK-15752][SQL] Optimize metadata only query that has an aggregate whose children are deterministic project or filter operators. ## What changes were proposed in this pull request? when query only use metadata (example: partition key), it can return results based on metadata without scanning files. Hive did it in HIVE-1003. ## How was this patch tested? add unit tests Author: Lianhui Wang <lianhuiwang09@gmail.com> Author: Wenchen Fan <wenchen@databricks.com> Author: Lianhui Wang <lianhuiwang@users.noreply.github.com> Closes #13494 from lianhuiwang/metadata-only.	2016-07-12 18:52:15 +02:00
Xin Ren	05d7151ccb	[MINOR][STREAMING][DOCS] Minor changes on kinesis integration ## What changes were proposed in this pull request? Some minor changes for documentation page "Spark Streaming + Kinesis Integration". Moved "streaming-kinesis-arch.png" before the bullet list, not in between the bullets. ## How was this patch tested? Tested manually, on my local machine. Author: Xin Ren <iamshrek@126.com> Closes #14097 from keypointt/kinesisDoc.	2016-07-11 18:09:14 -07:00

1 2 3 4 5 ...

1756 commits