ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
zhengruifeng	b29829e2ab	[SPARK-25584][ML][DOC] datasource for libsvm user guide ## What changes were proposed in this pull request? it seems that doc for libsvm datasource is not added in https://github.com/apache/spark/pull/22675. This pr is to add it. ## How was this patch tested? doc built locally ![图片](https://user-images.githubusercontent.com/7322292/62044350-4ad51480-b235-11e9-8f09-cbcbe9d3b7f9.png) Closes #25286 from zhengruifeng/doc_libsvm_data_source. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-01 09:15:42 -05:00
gengjiaan	d03ec65f01	[SPARK-27924][SQL] Support ANSI SQL Boolean-Predicate syntax ## What changes were proposed in this pull request? This PR aims to support ANSI SQL `Boolean-Predicate` syntax. ```sql expression IS [NOT] TRUE expression IS [NOT] FALSE expression IS [NOT] UNKNOWN ``` There are some mainstream database support this syntax. - PostgreSQL: https://www.postgresql.org/docs/9.1/functions-comparison.html - Hive: https://issues.apache.org/jira/browse/HIVE-13583 - Redshift: https://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html - Vertica: https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/LanguageElements/Predicates/Boolean-predicate.htm For example: ```sql spark-sql> select null is true, null is not true; false true spark-sql> select false is true, false is not true; false true spark-sql> select true is true, true is not true; true false spark-sql> select null is false, null is not false; false true spark-sql> select false is false, false is not false; true false spark-sql> select true is false, true is not false; false true spark-sql> select null is unknown, null is not unknown; true false spark-sql> select false is unknown, false is not unknown; false true spark-sql> select true is unknown, true is not unknown; false true ``` Note: A null input is treated as the logical value "unknown". ## How was this patch tested? Pass the Jenkins with the newly added test cases. Closes #25074 from beliefer/ansi-sql-boolean-test. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-07-30 23:59:50 -07:00
gengjiaan	dba4375359	[MINOR][CORE][DOCS] Fix inconsistent description of showConsoleProgress ## What changes were proposed in this pull request? The latest docs http://spark.apache.org/docs/latest/configuration.html contains some description as below: spark.ui.showConsoleProgress \| true \| Show the progress bar in the console. The progress bar shows the progress of stages that run for longer than 500ms. If multiple stages run at the same time, multiple progress bars will be displayed on the same line. -- \| -- \| -- But the class `org.apache.spark.internal.config.UI` define the config `spark.ui.showConsoleProgress` as below: ``` val UI_SHOW_CONSOLE_PROGRESS = ConfigBuilder("spark.ui.showConsoleProgress") .doc("When true, show the progress bar in the console.") .booleanConf .createWithDefault(false) ``` So I think there are exists some little mistake and lead to confuse reader. ## How was this patch tested? No need UT. Closes #25297 from beliefer/inconsistent-desc-showConsoleProgress. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-31 12:17:44 +09:00
zhengruifeng	44c28d7515	[SPARK-28399][ML][PYTHON] implement RobustScaler ## What changes were proposed in this pull request? Implement `RobustScaler` Since the transformation is quite similar to `StandardScaler`, I refactor the transform function so that it can be reused in both scalers. ## How was this patch tested? existing and added tests Closes #25160 from zhengruifeng/robust_scaler. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-07-30 10:24:33 -05:00
Junjie Chen	780d176136	[SPARK-28042][K8S] Support using volume mount as local storage ## What changes were proposed in this pull request? This pr is used to support using hostpath/PV volume mounts as local storage. In KubernetesExecutorBuilder.scala, the LocalDrisFeatureStep is built before MountVolumesFeatureStep which means we cannot use any volumes mount later. This pr adjust the order of feature building steps which moves localDirsFeature at last so that we can check if directories in SPARK_LOCAL_DIRS are set to volumes mounted such as hostPath, PV, or others. ## How was this patch tested? Unit tests Closes #24879 from chenjunjiedada/SPARK-28042. Lead-authored-by: Junjie Chen <jimmyjchen@tencent.com> Co-authored-by: Junjie Chen <cjjnjust@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-07-29 10:44:17 -07:00
Lee Dongjin	d98aa2a184	[MINOR] Trivial cleanups These are what I found during working on #22282. - Remove unused value: `UnsafeArraySuite#defaultTz` - Remove redundant new modifier to the case class, `KafkaSourceRDDPartition` - Remove unused variables from `RDD.scala` - Remove trailing space from `structured-streaming-kafka-integration.md` - Remove redundant parameter from `ArrowConvertersSuite`: `nullable` is `true` by default. - Remove leading empty line: `UnsafeRow` - Remove trailing empty line: `KafkaTestUtils` - Remove unthrown exception type: `UnsafeMapData` - Replace unused declarations: `expressions` - Remove duplicated default parameter: `AnalysisErrorSuite` - `ObjectExpressionsSuite`: remove duplicated parameters, conversions and unused variable Closes #25251 from dongjinleekr/cleanup/201907. Authored-by: Lee Dongjin <dongjin@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-29 23:38:02 +09:00
Luca Canali	f2a2d980ed	[SPARK-25285][CORE] Add startedTasks and finishedTasks to the metrics system in the executor instance ## What changes were proposed in this pull request? The motivation for these additional metrics is to help in troubleshooting and monitoring task execution workload when running on a cluster. Currently available metrics include executor threadpool metrics for task completed and for active tasks. The addition of threadpool taskStarted metric will allow for example to collect info on the (approximate) number of failed tasks by computing the difference thread started – (active threads + completed tasks and/or successfully finished tasks). The proposed metric finishedTasks is also intended for this type of troubleshooting. The difference between finshedTasks and threadpool.completeTasks, is that the latter is a (dropwizard library) gauge taken from the threadpool, while the former is a (dropwizard) counter computed in the [[Executor]] class, when a task successfully finishes, together with several other task metrics counters. Note, there are similarities with some of the metrics introduced in SPARK-24398, however there are key differences, coming from the fact that this PR concerns the executor source, therefore providing metric values per executor + metric values do not require to pass through the listerner bus in this case. ## How was this patch tested? Manually tested on a YARN cluster Closes #22290 from LucaCanali/AddMetricExecutorStartedTasks. Lead-authored-by: Luca Canali <luca.canali@cern.ch> Co-authored-by: LucaCanali <luca.canali@cern.ch> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-07-26 14:03:57 -07:00
Douglas R Colkitt	8fc5cb6285	[SPARK-28473][DOC] Stylistic consistency of build command in README ## What changes were proposed in this pull request? Change the format of the build command in the README to start with a `./` prefix ./build/mvn -DskipTests clean package This increases stylistic consistency across the README- all the other commands have a `./` prefix. Having a visible `./` prefix also makes it clear to the user that the shell command requires the current working directory to be at the repository root. ## How was this patch tested? README.md was reviewed both in raw markdown and in the Github rendered landing page for stylistic consistency. Closes #25231 from Mister-Meeseeks/master. Lead-authored-by: Douglas R Colkitt <douglas.colkitt@gmail.com> Co-authored-by: Mister-Meeseeks <douglas.colkitt@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-07-23 16:29:46 -07:00
HyukjinKwon	e3f7ca37db	[SPARK-28321][DOCS][FOLLOW-UP] Update migration guide by 0-args Java UDF's internal behaviour change ## What changes were proposed in this pull request? This PR proposes to add a note in the migration guide. See https://github.com/apache/spark/pull/25108#issuecomment-513526585 ## How was this patch tested? N/A Closes #25224 from HyukjinKwon/SPARK-28321-doc. Lead-authored-by: HyukjinKwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-07-22 16:33:31 +08:00
Dongjoon Hyun	c97f06de94	[SPARK-25705][DOC][FOLLOWUP] Recover links to structured-streaming-kafka-integration ## What changes were proposed in this pull request? This PR is a follow-up PR to recover three links from [the previous commit](https://github.com/apache/spark/pull/22703/files#diff-21245da8f8dbfef6401c5500f559f0bc). Currently, those three are broken. ``` $ git grep structured-streaming-kafka-0-10-integration structured-streaming-programming-guide.md: - Kafka source - Reads data from Kafka. It's compatible with Kafka broker versions 0.10.0 or higher. See the [Kafka Integration Guide](structured-streaming-kafka-0-10-integration.html) for more details. structured-streaming-programming-guide.md: See the <a href="structured-streaming-kafka-0-10-integration.html">Kafka Integration Guide</a>. structured-streaming-programming-guide.md: <td>See the <a href="structured-streaming-kafka-0-10-integration.html">Kafka Integration Guide</a></td> ``` It's because we have `structured-streaming-kafka-integration.html` instead of `structured-streaming-kafka-0-10-integration.html`. ``` $ find . -name structured-streaming-kafka-0-10-integration.md $ find . -name structured-streaming-kafka-integration.md ./structured-streaming-kafka-integration.md ``` ## How was this patch tested? Manual. Closes #25221 from dongjoon-hyun/SPARK-25705. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-22 11:22:06 +09:00
Arun Pandian	a0a58cf2ef	[SPARK-28464][DOC][SS] Document Kafka source minPartitions option Adding doc for the kafka source minPartitions option to "Structured Streaming + Kafka Integration Guide" The text is based on the content in https://docs.databricks.com/spark/latest/structured-streaming/kafka.html#configuration Closes #25219 from arunpandianp/SPARK-28464. Authored-by: Arun Pandian <apandian@groupon.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-07-21 13:13:30 -07:00
HyukjinKwon	0512af1668	[SPARK-28389][SQL][FOLLOW-UP] Use one example in 'add_months' behavior change at migration guide ## What changes were proposed in this pull request? This PR proposes to add one example to describe 'add_months' behaviour change by https://github.com/apache/spark/pull/25153. Spark 2.4: ```sql select add_months(DATE'2019-02-28', 1) ``` ``` +--------------------------------+ \|add_months(DATE '2019-02-28', 1)\| +--------------------------------+ \| 2019-03-31\| +--------------------------------+ ``` Current master: ```sql select add_months(DATE'2019-02-28', 1) ``` ``` +--------------------------------+ \|add_months(DATE '2019-02-28', 1)\| +--------------------------------+ \| 2019-03-28\| +--------------------------------+ ``` ## How was this patch tested? Manually tested on Spark 2.4.1 and the current master. Closes #25199 from HyukjinKwon/SPARK-28389. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-19 14:29:16 +09:00
Marcelo Vanzin	2ddeff97d7	[SPARK-27963][CORE] Allow dynamic allocation without a shuffle service. This change adds a new option that enables dynamic allocation without the need for a shuffle service. This mode works by tracking which stages generate shuffle files, and keeping executors that generate data for those shuffles alive while the jobs that use them are active. A separate timeout is also added for shuffle data; so that executors that hold shuffle data can use a separate timeout before being removed because of being idle. This allows the shuffle data to be kept around in case it is needed by some new job, or allow users to be more aggressive in timing out executors that don't have shuffle data in active use. The code also hooks up to the context cleaner so that shuffles that are garbage collected are detected, and the respective executors not held unnecessarily. Testing done with added unit tests, and also with TPC-DS workloads on YARN without a shuffle service. Closes #24817 from vanzin/SPARK-27963. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-07-16 16:37:38 -07:00
Thomas Graves	43d68cd4ff	[SPARK-27959][YARN] Change YARN resource configs to use .amount ## What changes were proposed in this pull request? we are adding in generic resource support into spark where we have suffix for the amount of the resource so that we could support other configs. Spark on yarn already had added configs to request resources via the configs spark.yarn.{executor/driver/am}.resource=<some amount>, where the <some amount> is value and unit together. We should change those configs to have a `.amount` suffix on them to match the spark configs and to allow future configs to be more easily added. YARN itself already supports tags and attributes so if we want the user to be able to pass those from spark at some point having a suffix makes sense. it would allow for a spark.yarn.{executor/driver/am}.resource.{resource}.tag= type config. ## How was this patch tested? Tested via unit tests and manually on a yarn 3.x cluster with GPU resources configured on. Closes #24989 from tgravescs/SPARK-27959-yarn-resourceconfigs. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-07-16 10:56:07 -07:00
Maxim Gekk	f241fc7776	[SPARK-28389][SQL] Use Java 8 API in add_months ## What changes were proposed in this pull request? In the PR, I propose to use the `plusMonths()` method of `LocalDate` to add months to a date. This method adds the specified amount to the months field of `LocalDate` in three steps: 1. Add the input months to the month-of-year field 2. Check if the resulting date would be invalid 3. Adjust the day-of-month to the last valid day if necessary The difference between current behavior and propose one is in handling the last day of month in the original date. For example, adding 1 month to `2019-02-28` will produce `2019-03-28` comparing to the current implementation where the result is `2019-03-31`. The proposed behavior is implemented in MySQL and PostgreSQL. ## How was this patch tested? By existing test suites `DateExpressionsSuite`, `DateFunctionsSuite` and `DateTimeUtilsSuite`. Closes #25153 from MaxGekk/add-months. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-07-15 20:49:39 +08:00
Peter Toth	1a26126d8c	[SPARK-28228][SQL] Fix substitution order of nested WITH clauses ## What changes were proposed in this pull request? This PR adds compatibility of handling a `WITH` clause within another `WITH` cause. Before this PR these queries retuned `1` while after this PR they return `2` as PostgreSQL does: ``` WITH t AS (SELECT 1), t2 AS ( WITH t AS (SELECT 2) SELECT * FROM t ) SELECT * FROM t2 ``` ``` WITH t AS (SELECT 1) SELECT ( WITH t AS (SELECT 2) SELECT * FROM t ) ``` As this is an incompatible change, the PR introduces the `spark.sql.legacy.cte.substitution.enabled` flag as an option to restore old behaviour. ## How was this patch tested? Added new UTs. Closes #25029 from peter-toth/SPARK-28228. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-07-12 07:17:33 -07:00
Gabor Somogyi	f83000597f	[SPARK-23472][CORE] Add defaultJavaOptions for driver and executor. ## What changes were proposed in this pull request? This PR adds two new config properties: `spark.driver.defaultJavaOptions` and `spark.executor.defaultJavaOptions`. These are intended to be set by administrators in a file of defaults for options like JVM garbage collection algorithm. Users will still set `extraJavaOptions` properties, and both sets of JVM options will be added to start a JVM (default options are prepended to extra options). ## How was this patch tested? Existing + additional unit tests. ``` cd docs/ SKIP_API=1 jekyll build ``` Manual webpage check. Closes #24804 from gaborgsomogyi/SPARK-23472. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-07-11 09:37:26 -07:00
Gabor Somogyi	d47c219f94	[SPARK-28055][SS][DSTREAMS] Add delegation token custom AdminClient configurations. ## What changes were proposed in this pull request? At the moment Kafka delegation tokens are fetched through `AdminClient` but there is no possibility to add custom configuration parameters. In [options](https://spark.apache.org/docs/2.4.3/structured-streaming-kafka-integration.html#kafka-specific-configurations) there is already a possibility to add custom configurations. In this PR I've added similar this possibility to `AdminClient`. ## How was this patch tested? Existing + added unit tests. ``` cd docs/ SKIP_API=1 jekyll build ``` Manual webpage check. Closes #24875 from gaborgsomogyi/SPARK-28055. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-07-11 09:36:24 -07:00
Zhu, Lipeng	b89c3de1a4	[SPARK-28310][SQL] Support (FIRST_VALUE\|LAST_VALUE)(expr[ (IGNORE\|RESPECT) NULLS]?) syntax ## What changes were proposed in this pull request? According to the ANSI SQL 2011 ![image](https://user-images.githubusercontent.com/698621/60855327-d01c6900-a235-11e9-9a1b-d438615a4673.png) Below are Teradata, Oracle, Redshift which already support this grammar. - Teradata - https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/SUwCpTupqmlBJvi2mipOaA - Oracle - https://docs.oracle.com/en/database/oracle/oracle-database/18/sqlrf/FIRST_VALUE.html#GUID-D454EC3F-370C-4C64-9B11-33FCB10D95EC - Redshift – https://docs.aws.amazon.com/redshift/latest/dg/r_WF_first_value.html - Postgresql didn't implement this grammar: https://www.postgresql.org/docs/devel/functions-window.html >The SQL standard defines a RESPECT NULLS or IGNORE NULLS option for lead, lag, first_value, last_value, and nth_value. This is not implemented in PostgreSQL: the behavior is always the same as the standard's default, namely RESPECT NULLS. ## How was this patch tested? UT. Closes #25082 from lipzhu/SPARK-28310. Authored-by: Zhu, Lipeng <lipzhu@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-07-10 07:41:05 -07:00
Dongjoon Hyun	bbc2be4f42	[SPARK-28294][CORE] Support `spark.history.fs.cleaner.maxNum` configuration ## What changes were proposed in this pull request? Up to now, Apache Spark maintains the given event log directory by time policy, `spark.history.fs.cleaner.maxAge`. However, there are two issues. 1. Some file system has a limitation on the maximum number of files in a single directory. For example, HDFS `dfs.namenode.fs-limits.max-directory-items` is 1024 * 1024 by default. https://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml 2. Spark is sometimes unable to to clean up some old log files due to permission issues (mainly, security policy). To handle both (1) and (2), this PR aims to support an additional policy configuration for the maximum number of files in the event log directory, `spark.history.fs.cleaner.maxNum`. Spark will try to keep the number of files in the event log directory according to this policy. ## How was this patch tested? Pass the Jenkins with a newly added test case. Closes #25072 from dongjoon-hyun/SPARK-28294. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-07-10 07:19:47 -07:00
Yuming Wang	90c64ea419	[SPARK-28267][DOC] Update building-spark.md(support build with hadoop-3.2) ## What changes were proposed in this pull request? Since [SPARK-23710](https://issues.apache.org/jira/browse/SPARK-23710), Hadoop 3.x can support Hive. This PR add _build with `hadoop-3.2`_ to building-spark.md. ## How was this patch tested? manual tests ``` cd docs SKIP_API=1 jekyll build ``` ![image](https://user-images.githubusercontent.com/5399861/60942057-cf5a0480-a313-11e9-9534-4765520e799f.png) Closes #25063 from wangyum/SPARK-28267. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-07-10 08:51:08 -05:00
HyukjinKwon	cdbc30213b	[SPARK-28226][PYTHON] Document Pandas UDF mapInPandas ## What changes were proposed in this pull request? This PR proposes to document `MAP_ITER` with `mapInPandas`. ## How was this patch tested? Manually checked the documentation. ![Screen Shot 2019-07-05 at 1 52 30 PM](https://user-images.githubusercontent.com/6477701/60698812-26cf2d80-9f2c-11e9-8295-9c00c28f5569.png) ![Screen Shot 2019-07-05 at 1 48 53 PM](https://user-images.githubusercontent.com/6477701/60698710-ac061280-9f2b-11e9-8521-a4f361207e06.png) Closes #25025 from HyukjinKwon/SPARK-28226. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-07 09:07:52 +09:00
Yuming Wang	4caf81a48f	[SPARK-28093][SQL][FOLLOW-UP] Update trim function behavior changes to migration guide ## What changes were proposed in this pull request? We changed our non-standard syntax for `trim` function in #24902 from `TRIM(trimStr, str)` to `TRIM(str, trimStr)` to be compatible with other databases. This pr update the migration guide. I checked various databases(PostgreSQL, Teradata, Vertica, Oracle, DB2, SQL Server 2019, MySQL, Hive, Presto) and it seems that only PostgreSQL and Presto support this non-standard syntax. PostgreSQL: ```sql postgres=# select substr(version(), 0, 16), trim('yxTomxx', 'x'); substr \| btrim -----------------+------- PostgreSQL 11.3 \| yxTom (1 row) ``` Presto: ```sql presto> select trim('yxTomxx', 'x'); _col0 ------- yxTom (1 row) ``` ## How was this patch tested? manual tests Closes #24948 from wangyum/SPARK-28093-FOLLOW-UP-DOCS. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-07-05 17:55:54 -07:00
zhengruifeng	443b158182	[SPARK-26970][DOC][FOLLOWUP] link doc & example of Interaction ## What changes were proposed in this pull request? link doc & example of Interaction ## How was this patch tested? existing tests Closes #25027 from zhengruifeng/py_doc_interaction. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-07-02 17:30:57 -05:00
gengjiaan	832ff87918	[SPARK-28077][SQL] Support ANSI SQL OVERLAY function. ## What changes were proposed in this pull request? The `OVERLAY` function is a `ANSI` `SQL`. For example: ``` SELECT OVERLAY('abcdef' PLACING '45' FROM 4); SELECT OVERLAY('yabadoo' PLACING 'daba' FROM 5); SELECT OVERLAY('yabadoo' PLACING 'daba' FROM 5 FOR 0); SELECT OVERLAY('babosa' PLACING 'ubb' FROM 2 FOR 4); ``` The results of the above four `SQL` are: ``` abc45f yabadaba yabadabadoo bubba ``` Note: If the input string is null, then the result is null too. There are some mainstream database support the syntax. PostgreSQL: https://www.postgresql.org/docs/11/functions-string.html Vertica: https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/String/OVERLAY.htm?zoom_highlight=overlay Oracle: https://docs.oracle.com/en/database/oracle/oracle-database/19/arpls/UTL_RAW.html#GUID-342E37E7-FE43-4CE1-A0E9-7DAABD000369 DB2: https://www.ibm.com/support/knowledgecenter/SSGMCP_5.3.0/com.ibm.cics.rexx.doc/rexx/overlay.html There are some show of the PR on my production environment. ``` spark-sql> SELECT OVERLAY('abcdef' PLACING '45' FROM 4); abc45f Time taken: 6.385 seconds, Fetched 1 row(s) spark-sql> SELECT OVERLAY('yabadoo' PLACING 'daba' FROM 5); yabadaba Time taken: 0.191 seconds, Fetched 1 row(s) spark-sql> SELECT OVERLAY('yabadoo' PLACING 'daba' FROM 5 FOR 0); yabadabadoo Time taken: 0.186 seconds, Fetched 1 row(s) spark-sql> SELECT OVERLAY('babosa' PLACING 'ubb' FROM 2 FOR 4); bubba Time taken: 0.151 seconds, Fetched 1 row(s) spark-sql> SELECT OVERLAY(null PLACING '45' FROM 4); NULL Time taken: 0.22 seconds, Fetched 1 row(s) spark-sql> SELECT OVERLAY(null PLACING 'daba' FROM 5); NULL Time taken: 0.157 seconds, Fetched 1 row(s) spark-sql> SELECT OVERLAY(null PLACING 'daba' FROM 5 FOR 0); NULL Time taken: 0.254 seconds, Fetched 1 row(s) spark-sql> SELECT OVERLAY(null PLACING 'ubb' FROM 2 FOR 4); NULL Time taken: 0.159 seconds, Fetched 1 row(s) ``` ## How was this patch tested? Exists UT and new UT. Closes #24918 from beliefer/ansi-sql-overlay. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2019-06-28 19:13:08 +09:00
Josh Rosen	d83f84a122	[SPARK-27676][SQL][SS] InMemoryFileIndex should respect spark.sql.files.ignoreMissingFiles ## What changes were proposed in this pull request? Spark's `InMemoryFileIndex` contains two places where `FileNotFound` exceptions are caught and logged as warnings (during [directory listing](`bcd3b61c4b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala (L274)`) and [block location lookup](`bcd3b61c4b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala (L333)`)). This logic was added in #15153 and #21408. I think that this is a dangerous default behavior because it can mask bugs caused by race conditions (e.g. overwriting a table while it's being read) or S3 consistency issues (there's more discussion on this in the [JIRA ticket](https://issues.apache.org/jira/browse/SPARK-27676)). Failing fast when we detect missing files is not sufficient to make concurrent table reads/writes or S3 listing safe (there are other classes of eventual consistency issues to worry about), but I think it's still beneficial to throw exceptions and fail-fast on the subset of inconsistencies / races that we _can_ detect because that increases the likelihood that an end user will notice the problem and investigate further. There may be some cases where users _do_ want to ignore missing files, but I think that should be an opt-in behavior via the existing `spark.sql.files.ignoreMissingFiles` flag (the current behavior is itself race-prone because a file might be be deleted between catalog listing and query execution time, triggering FileNotFoundExceptions on executors (which are handled in a way that _does_ respect `ignoreMissingFIles`)). This PR updates `InMemoryFileIndex` to guard the log-and-ignore-FileNotFoundException behind the existing `spark.sql.files.ignoreMissingFiles` flag. Note: this is a change of default behavior, so I think it needs to be mentioned in release notes. ## How was this patch tested? New unit tests to simulate file-deletion race conditions, tested with both values of the `ignoreMissingFIles` flag. Closes #24668 from JoshRosen/SPARK-27676. Lead-authored-by: Josh Rosen <rosenville@gmail.com> Co-authored-by: Josh Rosen <joshrosen@stripe.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-06-26 09:11:28 +09:00
Gabor Somogyi	1a915bf20f	[MINOR][SQL][DOCS] failOnDataLoss has effect on batch queries so fix the doc ## What changes were proposed in this pull request? According to the [Kafka integration document](https://spark.apache.org/docs/2.4.0/structured-streaming-kafka-integration.html) `failOnDataLoss` has effect only on streaming queries. While I was implementing the DSv2 Kafka batch sources I've realized it's not true. This feature is covered in [KafkaDontFailOnDataLossSuite](`54da3bbfb2/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaDontFailOnDataLossSuite.scala (L180)`). In this PR I've updated the doc to reflect this behavior. ## How was this patch tested? ``` cd docs/ SKIP_API=1 jekyll build ``` Manual webpage check. Closes #24932 from gaborgsomogyi/failOnDataLoss. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-06-23 19:23:57 -05:00
Dongjoon Hyun	47f54b1ec7	[SPARK-28118][CORE] Add `spark.eventLog.compression.codec` configuration ## What changes were proposed in this pull request? Event logs are different from the other data in terms of the lifetime. It would be great to have a new configuration for Spark event log compression like `spark.eventLog.compression.codec` . This PR adds this new configuration as an optional configuration. So, if `spark.eventLog.compression.codec` is not given, `spark.io.compression.codec` will be used. ## How was this patch tested? Pass the Jenkins with the newly added test case. Closes #24921 from dongjoon-hyun/SPARK-28118. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2019-06-21 00:43:38 +00:00
Yuming Wang	fe5145ede2	[SPARK-28109][SQL] Fix TRIM(type trimStr FROM str) returns incorrect value ## What changes were proposed in this pull request? [SPARK-28093](https://issues.apache.org/jira/browse/SPARK-28093) fixed `TRIM/LTRIM/RTRIM('str', 'trimStr')` returns an incorrect value, but that fix introduced a new bug, `TRIM(type trimStr FROM str)` returns an incorrect value. This pr fix this issue. ## How was this patch tested? unit tests and manual tests: Before this PR: ```sql spark-sql> SELECT trim('yxTomxx', 'xyz'), trim(BOTH 'xyz' FROM 'yxTomxx'); Tom z spark-sql> SELECT trim('xxxbarxxx', 'x'), trim(BOTH 'x' FROM 'xxxbarxxx'); bar spark-sql> SELECT ltrim('zzzytest', 'xyz'), trim(LEADING 'xyz' FROM 'zzzytest'); test xyz spark-sql> SELECT ltrim('zzzytestxyz', 'xyz'), trim(LEADING 'xyz' FROM 'zzzytestxyz'); testxyz spark-sql> SELECT ltrim('xyxXxyLAST WORD', 'xy'), trim(LEADING 'xy' FROM 'xyxXxyLAST WORD'); XxyLAST WORD spark-sql> SELECT rtrim('testxxzx', 'xyz'), trim(TRAILING 'xyz' FROM 'testxxzx'); test xy spark-sql> SELECT rtrim('xyztestxxzx', 'xyz'), trim(TRAILING 'xyz' FROM 'xyztestxxzx'); xyztest spark-sql> SELECT rtrim('TURNERyxXxy', 'xy'), trim(TRAILING 'xy' FROM 'TURNERyxXxy'); TURNERyxX ``` After this PR: ```sql spark-sql> SELECT trim('yxTomxx', 'xyz'), trim(BOTH 'xyz' FROM 'yxTomxx'); Tom Tom spark-sql> SELECT trim('xxxbarxxx', 'x'), trim(BOTH 'x' FROM 'xxxbarxxx'); bar bar spark-sql> SELECT ltrim('zzzytest', 'xyz'), trim(LEADING 'xyz' FROM 'zzzytest'); test test spark-sql> SELECT ltrim('zzzytestxyz', 'xyz'), trim(LEADING 'xyz' FROM 'zzzytestxyz'); testxyz testxyz spark-sql> SELECT ltrim('xyxXxyLAST WORD', 'xy'), trim(LEADING 'xy' FROM 'xyxXxyLAST WORD'); XxyLAST WORD XxyLAST WORD spark-sql> SELECT rtrim('testxxzx', 'xyz'), trim(TRAILING 'xyz' FROM 'testxxzx'); test test spark-sql> SELECT rtrim('xyztestxxzx', 'xyz'), trim(TRAILING 'xyz' FROM 'xyztestxxzx'); xyztest xyztest spark-sql> SELECT rtrim('TURNERyxXxy', 'xy'), trim(TRAILING 'xy' FROM 'TURNERyxXxy'); TURNERyxX TURNERyxX ``` And PostgreSQL: ```sql postgres=# SELECT trim('yxTomxx', 'xyz'), trim(BOTH 'xyz' FROM 'yxTomxx'); btrim \| btrim -------+------- Tom \| Tom (1 row) postgres=# SELECT trim('xxxbarxxx', 'x'), trim(BOTH 'x' FROM 'xxxbarxxx'); btrim \| btrim -------+------- bar \| bar (1 row) postgres=# SELECT ltrim('zzzytest', 'xyz'), trim(LEADING 'xyz' FROM 'zzzytest'); ltrim \| ltrim -------+------- test \| test (1 row) postgres=# SELECT ltrim('zzzytestxyz', 'xyz'), trim(LEADING 'xyz' FROM 'zzzytestxyz'); ltrim \| ltrim ---------+--------- testxyz \| testxyz (1 row) postgres=# SELECT ltrim('xyxXxyLAST WORD', 'xy'), trim(LEADING 'xy' FROM 'xyxXxyLAST WORD'); ltrim \| ltrim --------------+-------------- XxyLAST WORD \| XxyLAST WORD (1 row) postgres=# SELECT rtrim('testxxzx', 'xyz'), trim(TRAILING 'xyz' FROM 'testxxzx'); rtrim \| rtrim -------+------- test \| test (1 row) postgres=# SELECT rtrim('xyztestxxzx', 'xyz'), trim(TRAILING 'xyz' FROM 'xyztestxxzx'); rtrim \| rtrim ---------+--------- xyztest \| xyztest (1 row) postgres=# SELECT rtrim('TURNERyxXxy', 'xy'), trim(TRAILING 'xy' FROM 'TURNERyxXxy'); rtrim \| rtrim -----------+----------- TURNERyxX \| TURNERyxX (1 row) ``` Closes #24911 from wangyum/SPARK-28109. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-06-19 12:47:18 -07:00
Xiangrui Meng	1b2448bc10	[SPARK-28056][PYTHON] add doc for SCALAR_ITER Pandas UDF ## What changes were proposed in this pull request? Add docs for `SCALAR_ITER` Pandas UDF. cc: WeichenXu123 HyukjinKwon ## How was this patch tested? Tested example code manually. Closes #24897 from mengxr/SPARK-28056. Authored-by: Xiangrui Meng <meng@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2019-06-17 20:51:36 -07:00
Bryan Cutler	90f80395af	[SPARK-28041][PYTHON] Increase minimum supported Pandas to 0.23.2 ## What changes were proposed in this pull request? This increases the minimum supported version of Pandas to 0.23.2. Using a lower version will raise an error `Pandas >= 0.23.2 must be installed; however, your version was 0.XX`. Also, a workaround for using pyarrow with Pandas 0.19.2 was removed. ## How was this patch tested? Existing Tests Closes #24867 from BryanCutler/pyspark-increase-min-pandas-SPARK-28041. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-06-18 09:10:58 +09:00
Mellacheruvu Sandeep	b7b4452553	[SPARK-24898][DOC] Adding spark.checkpoint.compress to the docs ## What changes were proposed in this pull request? Adding spark.checkpoint.compress configuration parameter to the documentation ![](https://user-images.githubusercontent.com/3538013/59580409-a7013080-90ee-11e9-9b2c-3d29015f597e.png) ## How was this patch tested? Checked locally for jeykyll html docs. Also validated the html for any issues. Closes #24883 from sandeepvja/SPARK-24898. Authored-by: Mellacheruvu Sandeep <mellacheruvu.sandeep@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-06-16 22:54:08 -07:00
Takuya UESHIN	5ae1a6bf0d	[SPARK-28052][SQL] Make `ArrayExists` follow the three-valued boolean logic. ## What changes were proposed in this pull request? Currently `ArrayExists` always returns boolean values (if the arguments are not `null`), but it should follow the three-valued boolean logic: - `true` if the predicate holds at least one `true` - otherwise, `null` if the predicate holds `null` - otherwise, `false` This behavior change is made to match Postgres' equivalent function `ANY/SOME (array)`'s behavior: https://www.postgresql.org/docs/9.6/functions-comparisons.html#AEN21174 ## How was this patch tested? Modified tests and existing tests. Closes #24873 from ueshin/issues/SPARK-28052/fix_exists. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-06-15 10:48:06 -07:00
Sean Owen	15462e1a8f	[SPARK-28004][UI] Update jquery to 3.4.1 ## What changes were proposed in this pull request? We're using an old-ish jQuery, 1.12.4, and should probably update for Spark 3 to keep up in general, but also to keep up with CVEs. In fact, we know of at least one resolved in only 3.4.0+ (https://nvd.nist.gov/vuln/detail/CVE-2019-11358). They may not affect Spark, but, if the update isn't painful, maybe worthwhile in order to make future 3.x updates easier. jQuery 1 -> 2 doesn't sound like a breaking change, as 2.0 is supposed to maintain compatibility with 1.9+ (https://blog.jquery.com/2013/04/18/jquery-2-0-released/) 2 -> 3 has breaking changes: https://jquery.com/upgrade-guide/3.0/. It's hard to evaluate each one, but the most likely area for problems is in ajax(). However, our usage of jQuery (and plugins) is pretty simple. Update jquery to 3.4.1; update jquery blockUI and mustache to latest ## How was this patch tested? Manual testing of docs build (except R docs), worker/master UI, spark application UI. Note: this really doesn't guarantee it works, as our tests can't test javascript, and this is merely anecdotal testing, although I clicked about every link I could find. There's a risk this breaks a minor part of the UI; it does seem to work fine in the main. Closes #24843 from srowen/SPARK-28004. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-06-14 22:19:20 -07:00
Yesheng Ma	3ddc77d9ac	[SPARK-21136][SQL] Disallow FROM-only statements and show better warnings for Hive-style single-from statements Current Spark SQL parser can have pretty confusing error messages when parsing an incorrect SELECT SQL statement. The proposed fix has the following effect. BEFORE: ``` spark-sql> SELECT * FROM test WHERE x NOT NULL; Error in query: mismatched input 'FROM' expecting {<EOF>, 'CLUSTER', 'DISTRIBUTE', 'EXCEPT', 'GROUP', 'HAVING', 'INTERSECT', 'LATERAL', 'LIMIT', 'ORDER', 'MINUS', 'SORT', 'UNION', 'WHERE', 'WINDOW'}(line 1, pos 9) == SQL == SELECT * FROM test WHERE x NOT NULL ---------^^^ ``` where in fact the error message should be hinted to be near `NOT NULL`. AFTER: ``` spark-sql> SELECT * FROM test WHERE x NOT NULL; Error in query: mismatched input 'NOT' expecting {<EOF>, 'AND', 'CLUSTER', 'DISTRIBUTE', 'EXCEPT', 'GROUP', 'HAVING', 'INTERSECT', 'LIMIT', 'OR', 'ORDER', 'MINUS', 'SORT', 'UNION', 'WINDOW'}(line 1, pos 27) == SQL == SELECT * FROM test WHERE x NOT NULL ---------------------------^^^ ``` In fact, this problem is brought by some problematic Spark SQL grammar. There are two kinds of SELECT statements that are supported by Hive (and thereby supported in SparkSQL): * `FROM table SELECT blahblah SELECT blahblah` * `SELECT blah FROM table` Reference [HiveQL single-from stmt grammar](https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/HiveParser.g) It is fine when these two SELECT syntaxes are supported separately. However, since we are currently supporting these two kinds of syntaxes in a single ANTLR rule, this can be problematic and therefore leading to confusing parser errors. This is because when a SELECT clause was parsed, it can't tell whether the following FROM clause actually belongs to it or is just the beginning of a new `FROM table SELECT *` statement. ## What changes were proposed in this pull request? 1. Modify ANTLR grammar to fix the above-mentioned problem. This fix is important because the previous problematic grammar does affect a lot of real-world queries. Due to the previous problematic and messy grammar, we refactored the grammar related to `querySpecification`. 2. Modify `AstBuilder` to have separate visitors for `SELECT ... FROM ...` and `FROM ... SELECT ...` statements. 3. Drop the `FROM table` statement, which is supported by accident and is actually parsed in the wrong code path. Both Hive and Presto do not support this syntax. ## How was this patch tested? Existing UTs and new UTs. Closes #24809 from yeshengm/parser-refactor. Authored-by: Yesheng Ma <kimi.ysma@gmail.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>	2019-06-11 18:30:56 -07:00
Zhu, Lipeng	3b37bfde2a	[SPARK-27949][SQL] Support SUBSTRING(str FROM n1 [FOR n2]) syntax ## What changes were proposed in this pull request? Currently, function `substr/substring`'s usage is like `substring(string_expression, n1 [,n2])`. But, the ANSI SQL defined the pattern for substr/substring is like `SUBSTRING(str FROM n1 [FOR n2])`. This gap makes some inconvenient when we switch to the SparkSQL. - ANSI SQL-92: http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt Below are the mainly DB engines to support the ANSI standard for substring. - PostgreSQL https://www.postgresql.org/docs/9.1/functions-string.html - MySQL https://dev.mysql.com/doc/refman/8.0/en/string-functions.html#function_substring - Redshift https://docs.aws.amazon.com/redshift/latest/dg/r_SUBSTRING.html - Teradata https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/XnePye0Cwexw6Pny_qnxVA Oracle, SQL Server, Hive, Presto don't have this additional syntax. ## How was this patch tested? Pass the Jenkins with the updated test cases. Closes #24802 from lipzhu/SPARK-27949. Authored-by: Zhu, Lipeng <lipzhu@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-06-10 09:05:10 -07:00
Yuming Wang	2926890ffb	[SPARK-27970][SQL] Support Hive 3.0 metastore ## What changes were proposed in this pull request? It seems that some users are using Hive 3.0.0. This pr makes it support Hive 3.0 metastore. ## How was this patch tested? unit tests Closes #24688 from wangyum/SPARK-26145. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-06-07 15:24:07 -07:00
Thomas Graves	d30284b5a5	[SPARK-27760][CORE] Spark resources - change user resource config from .count to .amount ## What changes were proposed in this pull request? Change the resource config spark.{executor/driver}.resource.{resourceName}.count to .amount to allow future usage of containing both a count and a unit. Right now we only support counts - # of gpus for instance, but in the future we may want to support units for things like memory - 25G. I think making the user only have to specify a single config .amount is better then making them specify 2 separate configs of a .count and then a .unit. Change it now since its a user facing config. Amount also matches how the spark on yarn configs are setup. ## How was this patch tested? Unit tests and manually verified on yarn and local cluster mode Closes #24810 from tgravescs/SPARK-27760-amount. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2019-06-06 14:16:05 -05:00
Jules Damji	b71abd654d	[MINOR][DOC] Avro data source documentation change ## What changes were proposed in this pull request? This is a minor documentation change whereby the https://spark.apache.org/docs/latest/sql-data-sources-avro.html mentions "The date type and naming of record fields should match the input Avro data or Catalyst data," The term Catalyst data is confusing. It should instead say, Spark's internal data type such as String Type or IntegerType. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) There are no code changes; only doc changes. Please review https://spark.apache.org/contributing.html before opening a pull request. Closes #24787 from dmatrix/br-orc-ds.doc.changes. Authored-by: Jules Damji <dmatrix@comcast.net> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-06-04 16:17:53 -07:00
Luca Canali	adf72e26d9	[SPARK-27773][FOLLOWUP][DOC] Add numCaughtExceptions metric to monitoring doc ## What changes were proposed in this pull request? SPARK-27773 has introduced a new metric (counter) numCaughtExceptions to the Spark Dropwizard monitoring system. This PR adds an entry in the monitoring documentation to document this. Closes #24790 from LucaCanali/addDocFollowingSPARK27773. Authored-by: Luca Canali <luca.canali@cern.ch> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-06-04 08:40:32 -07:00
HyukjinKwon	d1f3c994c7	[SPARK-27942][DOCS][PYTHON] Note that Python 2.7 is deprecated in Spark documentation ## What changes were proposed in this pull request? This PR adds deprecation notes in Spark documentation. ## How was this patch tested? git grep -r "python 2.6" git grep -r "python 2.6" git grep -r "python 2.7" git grep -r "python 2.7" Closes #24789 from HyukjinKwon/SPARK-27942. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-06-04 07:59:25 -07:00
HyukjinKwon	db48da87f0	[SPARK-27834][SQL][R][PYTHON] Make separate PySpark/SparkR vectorization configurations ## What changes were proposed in this pull request? `spark.sql.execution.arrow.enabled` was added when we add PySpark arrow optimization. Later, in the current master, SparkR arrow optimization was added and it's controlled by the same configuration `spark.sql.execution.arrow.enabled`. There look two issues about this: 1. `spark.sql.execution.arrow.enabled` in PySpark was added from 2.3.0 whereas SparkR optimization was added 3.0.0. The stability is different so it's problematic when we change the default value for one of both optimization first. 2. Suppose users want to share some JVM by PySpark and SparkR. They are currently forced to use the optimization for all or none if the configuration is set globally. This PR proposes two separate configuration groups for PySpark and SparkR about Arrow optimization: - Deprecate `spark.sql.execution.arrow.enabled` - Add `spark.sql.execution.arrow.pyspark.enabled` (fallback to `spark.sql.execution.arrow.enabled`) - Add `spark.sql.execution.arrow.sparkr.enabled` - Deprecate `spark.sql.execution.arrow.fallback.enabled` - Add `spark.sql.execution.arrow.pyspark.fallback.enabled ` (fallback to `spark.sql.execution.arrow.fallback.enabled`) Note that `spark.sql.execution.arrow.maxRecordsPerBatch` is used within JVM side for both. Note that `spark.sql.execution.arrow.fallback.enabled` was added due to behaviour change. We don't need it in SparkR - SparkR side has the automatic fallback. ## How was this patch tested? Manually tested and some unittests were added. Closes #24700 from HyukjinKwon/separate-sparkr-arrow. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-06-03 10:01:37 +09:00
gengjiaan	8feb80ad86	[SPARK-27811][CORE][DOCS] Improve docs about spark.driver.memoryOverhead and spark.executor.memoryOverhead. ## What changes were proposed in this pull request? I found the docs of `spark.driver.memoryOverhead` and `spark.executor.memoryOverhead` exists a little ambiguity. For example, the origin docs of `spark.driver.memoryOverhead` start with `The amount of off-heap memory to be allocated per driver in cluster mode`. But `MemoryManager` also managed a memory area named off-heap used to allocate memory in tungsten mode. So I think the description of `spark.driver.memoryOverhead` always make confused. `spark.executor.memoryOverhead` has the same confused with `spark.driver.memoryOverhead`. ## How was this patch tested? Exists UT. Closes #24671 from beliefer/improve-docs-of-overhead. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-06-01 08:19:50 -05:00
Thomas Graves	1277f8fa92	[SPARK-27362][K8S] Resource Scheduling support for k8s ## What changes were proposed in this pull request? Add ability to map the spark resource configs spark.{executor/driver}.resource.{resourceName} to kubernetes Container builder so that we request resources (gpu,s/fpgas/etc) from kubernetes. Note that the spark configs will overwrite any resource configs users put into a pod template. I added a generic vendor config which is only used by kubernetes right now. I intentionally didn't put it into the kubernetes config namespace just to avoid adding more config prefixes. I will add more documentation for this under jira SPARK-27492. I think it will be easier to do all at once to get cohesive story. ## How was this patch tested? Unit tests and manually testing on k8s cluster. Closes #24703 from tgravescs/SPARK-27362. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2019-05-31 15:26:14 -05:00
Marcelo Vanzin	09ed64d795	[SPARK-27868][CORE] Better default value and documentation for socket server backlog. First, there is currently no public documentation for this setting. So it's hard to even know that it could be a problem if your application starts failing with weird shuffle errors. Second, the javadoc attached to the code was incorrect; the default value just uses the default value from the JRE, which is 50, instead of having an unbounded queue as the comment implies. So use a default that is a "rounded" version of the JRE default, and provide documentation explaining that this value may need to be adjusted. Also added a log message that was very helpful in debugging an issue caused by this problem. Closes #24732 from vanzin/SPARK-27868. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-29 14:56:36 -07:00
Yuanjian Li	8949bc7a3c	[SPARK-27665][CORE] Split fetch shuffle blocks protocol from OpenBlocks ## What changes were proposed in this pull request? As the current approach in OneForOneBlockFetcher, we reuse the OpenBlocks protocol to describe the fetch request for shuffle blocks, and it causes the extension work for shuffle fetching like #19788 and #24110 very awkward. In this PR, we split the fetch request for shuffle blocks from OpenBlocks which named FetchShuffleBlocks. It's a loose bind with ShuffleBlockId and can easily extend by adding new fields in this protocol. ## How was this patch tested? Existing and new added UT. Closes #24565 from xuanyuanking/SPARK-27665. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-27 22:19:31 +08:00
DB Tsai	a12de29c1a	[SPARK-27838][SQL] Support user provided non-nullable avro schema for nullable catalyst schema without any null record ## What changes were proposed in this pull request? When the data is read from the sources, the catalyst schema is always nullable. Since Avro uses Union type to represent nullable, when any non-nullable avro file is read and then written out, the schema will always be changed. This PR provides a solution for users to keep the Avro schema without being forced to use Union type. ## How was this patch tested? One test is added. Closes #24682 from dbtsai/avroNull. Authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2019-05-24 21:47:14 +00:00
HyukjinKwon	cc0b9d41cd	[MINOR][DOCS][R] Use actual version in SparkR Arrow guide for copy-and-paste ## What changes were proposed in this pull request? To address https://github.com/apache/spark/pull/24506#discussion_r280964509 ## How was this patch tested? N/A Closes #24701 from HyukjinKwon/minor-arrow-r-doc. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-24 10:38:26 -07:00
Gabor Somogyi	4e7908f2e7	[MINOR][DOC] ForeachBatch doc fix. ## What changes were proposed in this pull request? ForeachBatch doc is wrongly formatted. This PR formats it. ## How was this patch tested? ``` cd docs SKIP_API=1 jekyll build ``` Manual webpage check. Closes #24698 from gaborgsomogyi/foreachbatchdoc. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-25 00:03:59 +09:00
Thomas Graves	74e5e41eeb	[SPARK-27488][CORE] Driver interface to support GPU resources ## What changes were proposed in this pull request? Added the driver functionality to get the resources. The user interface is: SparkContext.resources - I called it this to match the TaskContext.resources api proposed in the other PR. Originally it was going to be called SparkContext.getResources but changed to be consistent, if people have strong feelings I can change it. There are 2 ways the driver can discover what resources it has. 1) user specifies a discoveryScript, this is similar to the executors and is meant for yarn and k8s where they don't tell you what you were allocated but you are running in isolated environment. 2) read the config spark.driver.resource.resourceName.addresses. The config is meant to be used with standalone mode where the Worker will have to assign what GPU addresses the Driver is allowed to use by setting that config. When the user runs a spark application, if they want the driver to have GPU's they would specify the conf spark.driver.resource.gpu.count=X where x is the number they want. If they are running on yarn or k8s they will also have to specify the discoveryScript as specified above, if they are on standalone mode and cluster is setup properly they wouldn't have to specify anything else. We could potentially get rid of the spark.driver.resources.gpu.addresses config which is really meant to be an internal config for worker to set if the standalone mode Worker wanted to write a discoveryScript out and set that for the user. I'll wait for the jira that implements that to decide if we can remove. - This PR also has changes to be consistent about using resourceName everywhere. - change the config names from POSTFIX to SUFFIX to be more consistent with other areas in Spark - Moved the config checks around a bit since now used by both executor and driver. Note those might overlap a bit with https://github.com/apache/spark/pull/24374 so we will have to figure out which one should go in first. ## How was this patch tested? Unit tests and manually test the interface. Closes #24615 from tgravescs/SPARK-27488. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2019-05-23 11:46:13 -07:00

1 2 3 4 5 ...

2427 commits