ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
zero323	ba7666274e	[SPARK-20208][DOCS][FOLLOW-UP] Add FP-Growth to SparkR programming guide ## What changes were proposed in this pull request? Add `spark.fpGrowth` to SparkR programming guide. ## How was this patch tested? Manual tests. Author: zero323 <zero323@users.noreply.github.com> Closes #17775 from zero323/SPARK-20208-FOLLOW-UP.	2017-04-27 00:34:20 -07:00
Mark Grover	66636ef0b0	[SPARK-20435][CORE] More thorough redaction of sensitive information This change does a more thorough redaction of sensitive information from logs and UI Add unit tests that ensure that no regressions happen that leak sensitive information to the logs. The motivation for this change was appearance of password like so in `SparkListenerEnvironmentUpdate` in event logs under some JVM configurations: `"sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=secret_password ..." ` Previously redaction logic was only checking if the key matched the secret regex pattern, it'd redact it's value. That worked for most cases. However, in the above case, the key (sun.java.command) doesn't tell much, so the value needs to be searched. This PR expands the check to check for values as well. ## How was this patch tested? New unit tests added that ensure that no sensitive information is present in the event logs or the yarn logs. Old unit test in UtilsSuite was modified because the test was asserting that a non-sensitive property's value won't be redacted. However, the non-sensitive value had the literal "secret" in it which was causing it to redact. Simply updating the non-sensitive property's value to another arbitrary value (that didn't have "secret" in it) fixed it. Author: Mark Grover <mark@apache.org> Closes #17725 from markgrover/spark-20435.	2017-04-26 17:06:21 -07:00
anabranch	7a365257e9	[SPARK-20400][DOCS] Remove References to 3rd Party Vendor Tools ## What changes were proposed in this pull request? Simple documentation change to remove explicit vendor references. ## How was this patch tested? NA Please review http://spark.apache.org/contributing.html before opening a pull request. Author: anabranch <bill@databricks.com> Closes #17695 from anabranch/remove-vendor.	2017-04-26 09:49:05 +01:00
zero323	df58a95a33	[SPARK-20437][R] R wrappers for rollup and cube ## What changes were proposed in this pull request? - Add `rollup` and `cube` methods and corresponding generics. - Add short description to the vignette. ## How was this patch tested? - Existing unit tests. - Additional unit tests covering new features. - `check-cran.sh`. Author: zero323 <zero323@users.noreply.github.com> Closes #17728 from zero323/SPARK-20437.	2017-04-25 22:00:45 -07:00
ding	0a7f5f2798	[SPARK-5484][GRAPHX] Periodically do checkpoint in Pregel ## What changes were proposed in this pull request? Pregel-based iterative algorithms with more than ~50 iterations begin to slow down and eventually fail with a StackOverflowError due to Spark's lack of support for long lineage chains. This PR causes Pregel to checkpoint the graph periodically if the checkpoint directory is set. This PR moves PeriodicGraphCheckpointer.scala from mllib to graphx, moves PeriodicRDDCheckpointer.scala, PeriodicCheckpointer.scala from mllib to core ## How was this patch tested? unit tests, manual tests (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: ding <ding@localhost.localdomain> Author: dding3 <ding.ding@intel.com> Author: Michael Allman <michael@videoamp.com> Closes #15125 from dding3/cp2_pregel.	2017-04-25 11:20:32 -07:00
Armin Braun	c8f1219510	[SPARK-20455][DOCS] Fix Broken Docker IT Docs ## What changes were proposed in this pull request? Just added the Maven `test`goal. ## How was this patch tested? No test needed, just a trivial documentation fix. Author: Armin Braun <me@obrown.io> Closes #17756 from original-brownbear/SPARK-20455.	2017-04-25 09:13:50 +01:00
Josh Rosen	f44c8a843c	[SPARK-20453] Bump master branch version to 2.3.0-SNAPSHOT This patch bumps the master branch version to `2.3.0-SNAPSHOT`. Author: Josh Rosen <joshrosen@databricks.com> Closes #17753 from JoshRosen/SPARK-20453.	2017-04-24 21:48:04 -07:00
郭小龙 10207633	ad290402aa	[SPARK-20401][DOC] In the spark official configuration document, the 'spark.driver.supervise' configuration parameter specification and default values are necessary. ## What changes were proposed in this pull request? Use the REST interface submits the spark job. e.g. curl -X POST http://10.43.183.120:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data'{ "action": "CreateSubmissionRequest", "appArgs": [ "myAppArgument" ], "appResource": "/home/mr/gxl/test.jar", "clientSparkVersion": "2.2.0", "environmentVariables": { "SPARK_ENV_LOADED": "1" }, "mainClass": "cn.zte.HdfsTest", "sparkProperties": { "spark.jars": "/home/mr/gxl/test.jar", "spark.driver.supervise": "true", "spark.app.name": "HdfsTest", "spark.eventLog.enabled": "false", "spark.submit.deployMode": "cluster", "spark.master": "spark://10.43.183.120:6066" } }' I hope that make sure that the driver is automatically restarted if it fails with non-zero exit code. But I can not find the 'spark.driver.supervise' configuration parameter specification and default values from the spark official document. ## How was this patch tested? manual tests Please review http://spark.apache.org/contributing.html before opening a pull request. Author: 郭小龙 10207633 <guo.xiaolong1@zte.com.cn> Author: guoxiaolong <guo.xiaolong1@zte.com.cn> Author: guoxiaolongzte <guo.xiaolong1@zte.com.cn> Closes #17696 from guoxiaolongzte/SPARK-20401.	2017-04-21 20:08:26 +01:00
Hervé	34767997e0	Small rewording about history server use case Hello PR #10991 removed the built-in history view from Spark Standalone, so the history server is no longer useful to Yarn or Mesos only. Author: Hervé <dud225@users.noreply.github.com> Closes #17709 from dud225/patch-1.	2017-04-21 08:52:18 +01:00
ymahajan	bdc6056919	Fixed typos in docs ## What changes were proposed in this pull request? Typos at a couple of place in the docs. ## How was this patch tested? build including docs Please review http://spark.apache.org/contributing.html before opening a pull request. Author: ymahajan <ymahajan@snappydata.io> Closes #17690 from ymahajan/master.	2017-04-19 20:08:31 -07:00
cody koeninger	71a8e9df12	[SPARK-20036][DOC] Note incompatible dependencies on org.apache.kafka artifacts ## What changes were proposed in this pull request? Note that you shouldn't manually add dependencies on org.apache.kafka artifacts ## How was this patch tested? Doc only change, did jekyll build and looked at the page. Author: cody koeninger <cody@koeninger.org> Closes #17675 from koeninger/SPARK-20036.	2017-04-19 18:58:58 +01:00
Ji Yan	a888fed309	[SPARK-19740][MESOS] Add support in Spark to pass arbitrary parameters into docker when running on mesos with docker containerizer ## What changes were proposed in this pull request? Allow passing in arbitrary parameters into docker when launching spark executors on mesos with docker containerizer tnachen ## How was this patch tested? Manually built and tested with passed in parameter Author: Ji Yan <jiyan@Jis-MacBook-Air.local> Closes #17109 from yanji84/ji/allow_set_docker_user.	2017-04-16 14:34:12 +01:00
hyukjinkwon	bca4259f12	[MINOR][DOCS] JSON APIs related documentation fixes ## What changes were proposed in this pull request? This PR proposes corrections related to JSON APIs as below: - Rendering links in Python documentation - Replacing `RDD` to `Dataset` in programing guide - Adding missing description about JSON Lines consistently in `DataFrameReader.json` in Python API - De-duplicating little bit of `DataFrameReader.json` in Scala/Java API ## How was this patch tested? Manually build the documentation via `jekyll build`. Corresponding snapstops will be left on the codes. Note that currently there are Javadoc8 breaks in several places. These are proposed to be handled in https://github.com/apache/spark/pull/17477. So, this PR does not fix those. Author: hyukjinkwon <gurwls223@gmail.com> Closes #17602 from HyukjinKwon/minor-json-documentation.	2017-04-12 09:16:39 +01:00
Lee Dongjin	b938438248	[MINOR][DOCS] Fix spacings in Structured Streaming Programming Guide ## What changes were proposed in this pull request? 1. Omitted space between the sentences: `... on static data.The Spark SQL engine will ...` -> `... on static data. The Spark SQL engine will ...` 2. Omitted colon in Output Model section. ## How was this patch tested? None. Author: Lee Dongjin <dongjin@apache.org> Closes #17564 from dongjinleekr/feature/fix-programming-guide.	2017-04-12 09:12:14 +01:00
Dongjoon Hyun	cde9e32848	[MINOR][DOCS] Update supported versions for Hive Metastore ## What changes were proposed in this pull request? Since SPARK-18112 and SPARK-13446, Apache Spark starts to support reading Hive metastore 2.0 ~ 2.1.1. This updates the docs. ## How was this patch tested? N/A Author: Dongjoon Hyun <dongjoon@apache.org> Closes #17612 from dongjoon-hyun/metastore.	2017-04-11 19:30:34 -07:00
MirrorZ	d11ef3d77e	Document Master URL format in high availability set up ## What changes were proposed in this pull request? Add documentation for adding master url in multi host, port format for standalone cluster with high availability with zookeeper. Referring documentation [Standby Masters with ZooKeeper](http://spark.apache.org/docs/latest/spark-standalone.html#standby-masters-with-zookeeper) ## How was this patch tested? Documenting the functionality already present. Author: MirrorZ <chandrika3437@gmail.com> Closes #17584 from MirrorZ/master.	2017-04-11 10:34:39 +01:00
郭小龙 10207633	9e0893b53d	[SPARK-20218][DOC][APP-ID] applications//stages' in REST API,add description. ## What changes were proposed in this pull request? 1. '/applications/[app-id]/stages' in rest api.status should add description '?status=[active\|complete\|pending\|failed] list only stages in the state.' Now the lack of this description, resulting in the use of this api do not know the use of the status through the brush stage list. 2.'/applications/[app-id]/stages/[stage-id]' in REST API,remove redundant description ‘?status=[active\|complete\|pending\|failed] list only stages in the state.’. Because only one stage is determined based on stage-id. code: GET def stageList(QueryParam("status") statuses: JList[StageStatus]): Seq[StageData] = { val listener = ui.jobProgressListener val stageAndStatus = AllStagesResource.stagesAndStatus(ui) val adjStatuses = { if (statuses.isEmpty()) { Arrays.asList(StageStatus.values(): _*) } else { statuses } }; ## How was this patch tested? manual tests Please review http://spark.apache.org/contributing.html before opening a pull request. Author: 郭小龙 10207633 <guo.xiaolong1@zte.com.cn> Closes #17534 from guoxiaolongzte/SPARK-20218.	2017-04-07 13:03:07 +01:00
Kalvin Chau	c8fc1f3bad	[SPARK-20085][MESOS] Configurable mesos labels for executors ## What changes were proposed in this pull request? Add spark.mesos.task.labels configuration option to add mesos key:value labels to the executor. "k1:v1,k2:v2" as the format, colons separating key-value and commas to list out more than one. Discussion of labels with mgummelt at #17404 ## How was this patch tested? Added unit tests to verify labels were added correctly, with incorrect labels being ignored and added a test to test the name of the executor. Tested with: `./build/sbt -Pmesos mesos/test` Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Kalvin Chau <kalvin.chau@viasat.com> Closes #17413 from kalvinnchau/mesos-labels.	2017-04-06 09:14:31 +01:00
Tathagata Das	9543fc0e08	[SPARK-20224][SS] Updated docs for streaming dropDuplicates and mapGroupsWithState ## What changes were proposed in this pull request? - Fixed bug in Java API not passing timeout conf to scala API - Updated markdown docs - Updated scala docs - Added scala and Java example ## How was this patch tested? Manually ran examples. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #17539 from tdas/SPARK-20224.	2017-04-05 16:03:04 -07:00
Anirudh Ramanathan	11238d4c62	[SPARK-18278][SCHEDULER] Documentation to point to Kubernetes cluster scheduler ## What changes were proposed in this pull request? Adding documentation to point to Kubernetes cluster scheduler being developed out-of-repo in https://github.com/apache-spark-on-k8s/spark cc rxin srowen tnachen ash211 mccheah erikerlandson ## How was this patch tested? Docs only change Author: Anirudh Ramanathan <foxish@users.noreply.github.com> Author: foxish <ramanathana@google.com> Closes #17522 from foxish/upstream-doc.	2017-04-04 10:46:44 -07:00
guoxiaolongzte	c95fbea68e	[SPARK-20190][APP-ID] applications//jobs' in rest api,status should be [running\|s… …ucceeded\|failed\|unknown] ## What changes were proposed in this pull request? '/applications/[app-id]/jobs' in rest api.status should be'[running\|succeeded\|failed\|unknown]'. now status is '[complete\|succeeded\|failed]'. but '/applications/[app-id]/jobs?status=complete' the server return 'HTTP ERROR 404'. Added '?status=running' and '?status=unknown'. code ： public enum JobExecutionStatus { RUNNING, SUCCEEDED, FAILED, UNKNOWN; ## How was this patch tested? manual tests Please review http://spark.apache.org/contributing.html before opening a pull request. Author: guoxiaolongzte <guo.xiaolong1@zte.com.cn> Closes #17507 from guoxiaolongzte/SPARK-20190.	2017-04-04 09:56:17 +01:00
Yuhao Yang	4d28e8430d	[SPARK-19969][ML] Imputer doc and example ## What changes were proposed in this pull request? Add docs and examples for spark.ml.feature.Imputer. Currently scala and Java examples are included. Python example will be added after https://github.com/apache/spark/pull/17316 ## How was this patch tested? local doc generation and example execution Author: Yuhao Yang <yuhao.yang@intel.com> Closes #17324 from hhbyyh/imputerdoc.	2017-04-03 11:42:33 +02:00
hyukjinkwon	364b0db753	[MINOR][DOCS] Replace non-breaking space to normal spaces that breaks rendering markdown # What changes were proposed in this pull request? It seems there are several non-breaking spaces were inserted into several `.md`s and they look breaking rendering markdown files. These are different. For example, this can be checked via `python` as below: ```python >>> " " '\xc2\xa0' >>> " " ' ' ``` _Note that it seems this PR description automatically replaces non-breaking spaces into normal spaces. Please open a `vi` and copy and paste it into `python` to verify this (do not copy the characters here)._ I checked the output below in Sapari and Chrome on Mac OS and, Internal Explorer on Windows 10. Before ![2017-04-03 12 37 17](https://cloud.githubusercontent.com/assets/6477701/24594655/50aaba02-186a-11e7-80bb-d34b17a3398a.png) ![2017-04-03 12 36 57](https://cloud.githubusercontent.com/assets/6477701/24594654/50a855e6-186a-11e7-94e2-661e56544b0f.png) After ![2017-04-03 12 36 46](https://cloud.githubusercontent.com/assets/6477701/24594657/53c2545c-186a-11e7-9a73-00529afbfd75.png) ![2017-04-03 12 36 31](https://cloud.githubusercontent.com/assets/6477701/24594658/53c286c0-186a-11e7-99c9-e66b1f510fe7.png) ## How was this patch tested? Manually checking. These instances were found via ``` grep --include=.scala --include=.python --include=.java --include=.r --include=.R --include=.md --include=.r -r -I " " . ``` in Mac OS. It seems there are several instances more as below: ``` ./docs/sql-programming-guide.md: │ ├── ... ./docs/sql-programming-guide.md: │ │ ./docs/sql-programming-guide.md: │ ├── country=US ./docs/sql-programming-guide.md: │ │ └── data.parquet ./docs/sql-programming-guide.md: │ ├── country=CN ./docs/sql-programming-guide.md: │ │ └── data.parquet ./docs/sql-programming-guide.md: │ └── ... ./docs/sql-programming-guide.md: ├── ... ./docs/sql-programming-guide.md: │ ./docs/sql-programming-guide.md: ├── country=US ./docs/sql-programming-guide.md: │ └── data.parquet ./docs/sql-programming-guide.md: ├── country=CN ./docs/sql-programming-guide.md: │ └── data.parquet ./docs/sql-programming-guide.md: └── ... ./sql/core/src/test/README.md:│ ├── .avdl # Testing Avro IDL(s) ./sql/core/src/test/README.md:│ └── *.avpr # !! NO TOUCH !! Protocol files generated from Avro IDL(s) ./sql/core/src/test/README.md:│ ├── gen-avro.sh # Script used to generate Java code for Avro ./sql/core/src/test/README.md:│ └── gen-thrift.sh # Script used to generate Java code for Thrift ``` These seems generated via `tree` command which inserts non-breaking spaces. They do not look causing any problem for rendering within code blocks and I did not fix it to reduce the overhead to manually replace it when it is overwritten via `tree` command in the future. Author: hyukjinkwon <gurwls223@gmail.com> Closes #17517 from HyukjinKwon/non-breaking-space.	2017-04-03 10:09:11 +01:00
郭小龙 10207633	cf5963c961	[SPARK-20177] Document about compression way has some little detail ch… …anges. ## What changes were proposed in this pull request? Document compression way little detail changes. 1.spark.eventLog.compress add 'Compression will use spark.io.compression.codec.' 2.spark.broadcast.compress add 'Compression will use spark.io.compression.codec.' 3,spark.rdd.compress add 'Compression will use spark.io.compression.codec.' 4.spark.io.compression.codec add 'event log describe'. eg Through the documents, I don't know what is compression mode about 'event log'. ## How was this patch tested? manual tests Please review http://spark.apache.org/contributing.html before opening a pull request. Author: 郭小龙 10207633 <guo.xiaolong1@zte.com.cn> Closes #17498 from guoxiaolongzte/SPARK-20177.	2017-04-01 11:48:58 +01:00
Seigneurin, Alexis (CONT)	669a11b61b	[DOCS][MINOR] Fixed a few typos in the Structured Streaming documentation Fixed a few typos. There is one more I'm not sure of: ``` Append mode uses watermark to drop old aggregation state. But the output of a windowed aggregation is delayed the late threshold specified in `withWatermark()` as by the modes semantics, rows can be added to the Result Table only once after they are ``` Not sure how to change `is delayed the late threshold`. Author: Seigneurin, Alexis (CONT) <Alexis.Seigneurin@capitalone.com> Closes #17443 from aseigneurin/typos.	2017-03-30 16:12:17 +01:00
Yuming Wang	edc87d76ef	[SPARK-20107][DOC] Add spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version option to configuration.md ## What changes were proposed in this pull request? Add `spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version` option to `configuration.md`. Set `spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2` can speed up [HadoopMapReduceCommitProtocol.commitJob](https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121) for many output files. All cloudera's hadoop 2.6.0-cdh5.4.0 or higher versions(see: `1c12361823` and `16b2de2732`) and apache's hadoop 2.7.0 or higher versions support this improvement. More see: 1. [MAPREDUCE-4815](https://issues.apache.org/jira/browse/MAPREDUCE-4815): Speed up FileOutputCommitter#commitJob for many output files. 2. [MAPREDUCE-6406](https://issues.apache.org/jira/browse/MAPREDUCE-6406): Update the default version for the property mapreduce.fileoutputcommitter.algorithm.version to 2. ## How was this patch tested? Manual test and exist tests. Author: Yuming Wang <wgyumg@gmail.com> Closes #17442 from wangyum/SPARK-20107.	2017-03-30 10:39:57 +01:00
Yanbo Liang	1d00761b91	[MINOR][SPARKR] Move 'Data type mapping between R and Spark' to right place in SparkR doc. Section ```Data type mapping between R and Spark``` was put in the wrong place in SparkR doc currently, we should move it to a separate section. ## What changes were proposed in this pull request? Before this PR: ![image](https://cloud.githubusercontent.com/assets/1962026/24340911/bc01a532-126a-11e7-9a08-0d60d13a547c.png) After this PR: ![image](https://cloud.githubusercontent.com/assets/1962026/24340938/d9d32a9a-126a-11e7-8891-d2f5b46e0c71.png) Author: Yanbo Liang <ybliang8@gmail.com> Closes #17440 from yanboliang/sparkr-doc.	2017-03-27 17:37:24 -07:00
sureshthalamati	c791180705	[SPARK-10849][SQL] Adds option to the JDBC data source write for user to specify database column type for the create table ## What changes were proposed in this pull request? Currently JDBC data source creates tables in the target database using the default type mapping, and the JDBC dialect mechanism. If users want to specify different database data type for only some of columns, there is no option available. In scenarios where default mapping does not work, users are forced to create tables on the target database before writing. This workaround is probably not acceptable from a usability point of view. This PR is to provide a user-defined type mapping for specific columns. The solution is to allow users to specify database column data type for the create table as JDBC datasource option(createTableColumnTypes) on write. Data type information can be specified in the same format as table schema DDL format (e.g: `name CHAR(64), comments VARCHAR(1024)`). All supported target database types can not be specified , the data types has to be valid spark sql data types also. For example user can not specify target database CLOB data type. This will be supported in the follow-up PR. Example: ```Scala df.write .option("createTableColumnTypes", "name CHAR(64), comments VARCHAR(1024)") .jdbc(url, "TEST.DBCOLTYPETEST", properties) ``` ## How was this patch tested? Added new test cases to the JDBCWriteSuite Author: sureshthalamati <suresh.thalamati@gmail.com> Closes #16209 from sureshthalamati/jdbc_custom_dbtype_option_json-spark-10849.	2017-03-23 17:39:33 -07:00
uncleGen	facfd60886	[SPARK-20021][PYSPARK] Miss backslash in python code ## What changes were proposed in this pull request? Add backslash for line continuation in python code. ## How was this patch tested? Jenkins. Author: uncleGen <hustyugm@gmail.com> Author: dylon <hustyugm@gmail.com> Closes #17352 from uncleGen/python-example-doc.	2017-03-22 11:10:08 +00:00
christopher snow	7620aed828	[SPARK-20011][ML][DOCS] Clarify documentation for ALS 'rank' parameter ## What changes were proposed in this pull request? API documentation and collaborative filtering documentation page changes to clarify inconsistent description of ALS rank parameter. - [DOCS] was previously: "rank is the number of latent factors in the model." - [API] was previously: "rank - number of features to use" This change describes rank in both places consistently as: - "Number of features to use (also referred to as the number of latent factors)" Author: Chris Snow <chris.snowuk.ibm.com> Author: christopher snow <chsnow123@gmail.com> Closes #17345 from snowch/SPARK-20011.	2017-03-21 13:23:59 +00:00
Tyson Condie	c2d1761a57	[SPARK-19906][SS][DOCS] Documentation describing how to write queries to Kafka ## What changes were proposed in this pull request? Add documentation that describes how to write streaming and batch queries to Kafka. zsxwing tdas Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Tyson Condie <tcondie@gmail.com> Closes #17246 from tcondie/kafka-write-docs.	2017-03-20 17:18:59 -07:00
Sital Kedia	7b5d873aef	[SPARK-13369] Add config for number of consecutive fetch failures The previously hardcoded max 4 retries per stage is not suitable for all cluster configurations. Since spark retries a stage at the sign of the first fetch failure, you can easily end up with many stage retries to discover all the failures. In particular, two scenarios this value should change are (1) if there are more than 4 executors per node; in that case, it may take 4 retries to discover the problem with each executor on the node and (2) during cluster maintenance on large clusters, where multiple machines are serviced at once, but you also cannot afford total cluster downtime. By making this value configurable, cluster managers can tune this value to something more appropriate to their cluster configuration. Unit tests Author: Sital Kedia <skedia@fb.com> Closes #17307 from sitalkedia/SPARK-13369.	2017-03-17 09:33:58 -05:00
uncleGen	e29a74d5b1	[DOCS][SS] fix structured streaming python example ## What changes were proposed in this pull request? - SS python example: `TypeError: 'xxx' object is not callable` - some other doc issue. ## How was this patch tested? Jenkins. Author: uncleGen <hustyugm@gmail.com> Closes #17257 from uncleGen/docs-ss-python.	2017-03-12 08:29:37 +00:00
Yong Tang	8f0490e22b	[SPARK-17979][SPARK-14453] Remove deprecated SPARK_YARN_USER_ENV and SPARK_JAVA_OPTS This fix removes deprecated support for config `SPARK_YARN_USER_ENV`, as is mentioned in SPARK-17979. This fix also removes deprecated support for the following: ``` SPARK_YARN_USER_ENV SPARK_JAVA_OPTS SPARK_CLASSPATH SPARK_WORKER_INSTANCES ``` Related JIRA: [SPARK-14453]: https://issues.apache.org/jira/browse/SPARK-14453 [SPARK-12344]: https://issues.apache.org/jira/browse/SPARK-12344 [SPARK-15781]: https://issues.apache.org/jira/browse/SPARK-15781 Existing tests should pass. Author: Yong Tang <yong.tang.github@outlook.com> Closes #17212 from yongtang/SPARK-17979.	2017-03-10 13:34:01 -08:00
Liwei Lin	40da4d181d	[SPARK-19715][STRUCTURED STREAMING] Option to Strip Paths in FileSource ## What changes were proposed in this pull request? Today, we compare the whole path when deciding if a file is new in the FileSource for structured streaming. However, this would cause false negatives in the case where the path has changed in a cosmetic way (i.e. changing `s3n` to `s3a`). This patch adds an option `fileNameOnly` that causes the new file check to be based only on the filename (but still store the whole path in the log). ## Usage ```scala spark .readStream .option("fileNameOnly", true) .text("s3n://bucket/dir1/dir2") .writeStream ... ``` ## How was this patch tested? Added a test case Author: Liwei Lin <lwlin7@gmail.com> Closes #17120 from lw-lin/filename-only.	2017-03-09 11:02:44 -08:00
Wenchen Fan	d69aeeaff4	[SPARK-19516][DOC] update public doc to use SparkSession instead of SparkContext ## What changes were proposed in this pull request? After Spark 2.0, `SparkSession` becomes the new entry point for Spark applications. We should update the public documents to reflect this. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #16856 from cloud-fan/doc.	2017-03-07 11:32:36 -08:00
VinceShieh	4a9034b173	[SPARK-17498][ML] StringIndexer enhancement for handling unseen labels ## What changes were proposed in this pull request? This PR is an enhancement to ML StringIndexer. Before this PR, String Indexer only supports "skip"/"error" options to deal with unseen records. But those unseen records might still be useful and user would like to keep the unseen labels in certain use cases, This PR enables StringIndexer to support keeping unseen labels as indices [numLabels]. '''Before StringIndexer().setHandleInvalid("skip") StringIndexer().setHandleInvalid("error") '''After support the third option "keep" StringIndexer().setHandleInvalid("keep") ## How was this patch tested? Test added in StringIndexerSuite Signed-off-by: VinceShieh <vincent.xieintel.com> (Please fill in changes proposed in this fix) Author: VinceShieh <vincent.xie@intel.com> Closes #16883 from VinceShieh/spark-17498.	2017-03-07 11:24:20 -08:00
jerryshao	ba186a841f	[MINOR][DOC] Fix doc for web UI https configuration ## What changes were proposed in this pull request? Doc about enabling web UI https is not correct, "spark.ui.https.enabled" is not existed, actually enabling SSL is enough for https. ## How was this patch tested? N/A Author: jerryshao <sshao@hortonworks.com> Closes #17147 from jerryshao/fix-doc-ssl.	2017-03-03 14:23:31 -08:00
Zhe Sun	0bac3e4cde	[SPARK-19797][DOC] ML pipeline document correction ## What changes were proposed in this pull request? Description about pipeline in this paragraph is incorrect https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works > If the Pipeline had more stages, it would call the LogisticRegressionModel’s transform() method on the DataFrame before passing the DataFrame to the next stage. Reason: Transformer could also be a stage. But only another Estimator will invoke an transform call and pass the data to next stage. The description in the document misleads ML pipeline users. ## How was this patch tested? This is a tiny modification of docs/ml-pipelines.md. I jekyll build the modification and check the compiled document. Author: Zhe Sun <ymwdalex@gmail.com> Closes #17137 from ymwdalex/SPARK-19797-ML-pipeline-document-correction.	2017-03-03 11:55:57 +01:00
Nick Pentreath	9cca3dbf4a	[SPARK-19345][ML][DOC] Add doc for "coldStartStrategy" usage in ALS [SPARK-14489](https://issues.apache.org/jira/browse/SPARK-14489) added the ability to skip `NaN` predictions during `ALSModel.transform`. This PR adds documentation for the `coldStartStrategy` param to the ALS user guide, and add code to the examples to illustrate usage. ## How was this patch tested? Doc and example change only. Build HTML doc locally and verified example code builds, and runs in shell for Scala/Python. Author: Nick Pentreath <nickp@za.ibm.com> Closes #17102 from MLnick/SPARK-19345-coldstart-doc.	2017-03-02 15:51:16 +02:00
Felix Cheung	8d6ef895ee	[SPARK-18352][DOCS] wholeFile JSON update doc and programming guide ## What changes were proposed in this pull request? Update doc for R, programming guide. Clarify default behavior for all languages. ## How was this patch tested? manually Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #17128 from felixcheung/jsonwholefiledoc.	2017-03-02 01:02:38 -08:00
Michael McCune	bf5987cbe6	[SPARK-19769][DOCS] Update quickstart instructions ## What changes were proposed in this pull request? This change addresses the renaming of the `simple.sbt` build file to `build.sbt`. Newer versions of the sbt tool are not finding the older named file and are looking for the `build.sbt`. The quickstart instructions for self-contained applications is updated with this change. ## How was this patch tested? As this is a relatively minor change of a few words, the markdown was checked for syntax and spelling. Site was built with `SKIP_API=1 jekyll serve` for testing purposes. Author: Michael McCune <msm@redhat.com> Closes #17101 from elmiko/spark-19769.	2017-03-01 00:07:16 +01:00
actuaryzhang	d743ea4c76	[MINOR][DOC] Update GLM doc to include tweedie distribution Update GLM documentation to include the Tweedie distribution. #16344 jkbradley yanboliang Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #17103 from actuaryzhang/doc.	2017-02-28 14:43:44 -08:00
Yuming Wang	9b8eca65dc	[SPARK-19660][CORE][SQL] Replace the configuration property names that are deprecated in the version of Hadoop 2.6 ## What changes were proposed in this pull request? Replace all the Hadoop deprecated configuration property names according to [DeprecatedProperties](https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-common/DeprecatedProperties.html). except: https://github.com/apache/spark/blob/v2.1.0/python/pyspark/sql/tests.py#L1533 https://github.com/apache/spark/blob/v2.1.0/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala#L987 https://github.com/apache/spark/blob/v2.1.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/SetCommand.scala#L45 https://github.com/apache/spark/blob/v2.1.0/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L614 ## How was this patch tested? Existing tests Author: Yuming Wang <wgyumg@gmail.com> Closes #16990 from wangyum/HadoopDeprecatedProperties.	2017-02-28 10:13:42 +00:00
Boaz Mohar	061bcfb869	[MINOR][DOCS] Fixes two problems in the SQL programing guide page ## What changes were proposed in this pull request? Removed duplicated lines in sql python example and found a typo. ## How was this patch tested? Searched for other typo's in the page to minimize PR's. Author: Boaz Mohar <boazmohar@gmail.com> Closes #17066 from boazmohar/doc-fix.	2017-02-25 11:32:09 -08:00
Ramkumar Venkataraman	1b9ba258e0	[MINOR][DOCS] Fix few typos in structured streaming doc ## What changes were proposed in this pull request? Minor typo in `even-time`, which is changed to `event-time` and a couple of grammatical errors fix. ## How was this patch tested? N/A - since this is a doc fix. I did a jekyll build locally though. Author: Ramkumar Venkataraman <rvenkataraman@paypal.com> Closes #17037 from ramkumarvenkat/doc-fix.	2017-02-25 02:18:22 +00:00
Shubham Chopra	fa7c582e94	[SPARK-15355][CORE] Proactive block replication ## What changes were proposed in this pull request? We are proposing addition of pro-active block replication in case of executor failures. BlockManagerMasterEndpoint does all the book-keeping to keep a track of all the executors and the blocks they hold. It also keeps a track of which executors are alive through heartbeats. When an executor is removed, all this book-keeping state is updated to reflect the lost executor. This step can be used to identify executors that are still in possession of a copy of the cached data and a message could be sent to them to use the existing "replicate" function to find and place new replicas on other suitable hosts. Blocks replicated this way will let the master know of their existence. This can happen when an executor is lost, and would that way be pro-active as opposed be being done at query time. ## How was this patch tested? This patch was tested with existing unit tests along with new unit tests added to test the functionality. Author: Shubham Chopra <schopra31@bloomberg.net> Closes #14412 from shubhamchopra/ProactiveBlockReplication.	2017-02-24 15:40:01 -08:00
uncleGen	d027624574	[SPARK-16122][DOCS] application environment rest api ## What changes were proposed in this pull request? follow up pr of #16949. ## How was this patch tested? jenkins Author: uncleGen <hustyugm@gmail.com> Closes #17033 from uncleGen/doc-restapi-environment.	2017-02-23 17:06:14 -08:00
Kay Ousterhout	f87a6a59af	[SPARK-19684][DOCS] Remove developer info from docs. This commit moves developer-specific information from the release- specific documentation in this repo to the developer tools page on the main Spark website. This commit relies on this PR on the Spark website: https://github.com/apache/spark-website/pull/33. srowen Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #17018 from kayousterhout/SPARK-19684.	2017-02-23 13:27:47 -08:00
Marcelo Vanzin	4661d30b98	[SPARK-19554][UI,YARN] Allow SHS URL to be used for tracking in YARN RM. Allow an application to use the History Server URL as the tracking URL in the YARN RM, so there's still a link to the web UI somewhere in YARN even if the driver's UI is disabled. This is useful, for example, if an admin wants to disable the driver UI by default for applications, since it's harder to secure it (since it involves non trivial ssl certificate and auth management that admins may not want to expose to user apps). This needs to be opt-in, because of the way the YARN proxy works, so a new configuration was added to enable the option. The YARN RM will proxy requests to live AMs instead of redirecting the client, so pages in the SHS UI will not render correctly since they'll reference invalid paths in the RM UI. The proxy base support in the SHS cannot be used since that would prevent direct access to the SHS. So, to solve this problem, for the feature to work end-to-end, a new YARN-specific filter was added that detects whether the requests come from the proxy and redirects the client appropriatly. The SHS admin has to add this filter manually if they want the feature to work. Tested with new unit test, and by running with the documented configuration set in a test cluster. Also verified the driver UI is used when it's enabled. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #16946 from vanzin/SPARK-19554.	2017-02-22 14:37:53 -08:00

1 2 3 4 5 ...

1884 commits